Table of Contents Previous Chapter

Resource Discovery

Edward A. Fox, V. S. Subrahmanian & John M. Carroll

Finding the right information in the Web is a difficult problem, especially in light of the "information explosion" and rapid growth of the Web. Similarly, finding the right information in a Digital Library [Fox 94; Fox 95] is also a hard problem. Even finding the right information in a large collection is difficult. Though in one sense these three tasks are all embodiments of the generic problem of "information retrieval", they differ in terms of scale, supporting methodologies, availability of supplemental or meta information, complexities related to the requirement for distributed processing, and performance demands. In the WWW, resource discovery builds upon what was discussed in Section 4.1 regarding Indexing, Semantics & Representation.

For the foreseeable future, working on the problem of information retrieval in large collections will be of value. There are continual improvements in this field, and allied areas (e.g., natural language processing) also continue to yield new tools and techniques that can be tested and integrated with other approaches. Connecting with hypertext, browsing, and visualization are all important parts of the solution. Success here is certainly of value in solving similar problems for the WWW.

Recommendation 1: Information Retrieval Research

We recommend ongoing support for research into "information retrieval" and the fields of hypertext, multimedia systems, and human-computer interaction, especially as they relate to the problem of finding information in large collections. Advances in this regard usually can be applied directly to the larger problems of finding information in digital libraries or the WWW, since it seems that many advanced retrieval methods have scaled up after only minor refinement (see experiences in connection with the ARPA funded TIPSTER and TREC projects). Yet, new problems also arise in regard to the next level situation, that of digital libraries.

Digital libraries are large, and usually are made up of a number of heterogeneous collections. They will typically include at least a moderate amount of multimedia information, which presents new challenges relative to only searching text. They make it more important to manage variety in item size. In addition to indexing, cataloging is required, usually to handle large items or item collections. Other problems come to roost, like eliminating duplicates or relating multiple versions.

Since traditional libraries are often located and planned to serve a range of user needs, such as university education and research, digital libraries will certainly be required to support important user tasks. Part of this means facilitating search, navigation and browsing in and between multiple spaces, such as document space, concept space, keyword space, and others relating to strings, features, or linguistic constructs. Whether one deals with geographic information, video archives, or data from outer space, many of the same general approaches will apply, especially if integrated with domain specific processing.

Adapting these general approaches to particular types of collections, however, warrants empirical validation in a variety of representative situations. It is also necessary to develop a theory and practice of specification of complex collections and their processing, whereby: specialized methods can be applied to each multimedia data type; new approaches can be developed to handle the complex interrelationships of those data items into common structures (e.g., multimedia streams and performances, composite documents, multimodal interaction schema); and distributed retrieval operations can effectively operate upon such large and complex heterogeneous collections (as found in digital libraries).

Digital libraries are possible now in part due to advances in electronic publishing (interpreted in the broad sense). People create most documents with computers, and also capture a large percentage of multimedia objects into digital forms. With proper standards, much of what is needed by way of indexing, cataloging and metadata can be automatically recorded in usable forms. Such standards and practices can be applied to the WWW at large, especially if urgently needed research develops tools and techniques that allow domain-specialized standards to be used in general purpose environments (like the Web) without loosing their expressive power or requiring labor-intensive coding or translations.

With so much automatic processing, the trend to disintermediation will continue. Thus, an important challenge in developing digital libraries is that of preserving as much as possible of the knowledge and skills expert intermediaries demonstrate. Library and information science have much to contribute, and knowledge acquisition in this regard is of particular value. In addition, models, architectures, prototypes and tests regarding agents must be undertaken to see if such knowledge can be properly applied. Hence:

Recommendation 2: Continued NSF Support for Digital Libraries

We recommend ongoing support by NSF for research on digital libraries. Just like the ARPA TIPSTER project's utility was greatly enhanced by involving many smaller project groups through the TREC efforts, a significant amount of support should be provided for investigators not funded under the Digital Library Initiative, since they are likely to make important contributions, using the new testbeds that are already under development.It would be helpful if these new projects could build upon the DLI testbeds, accessing part of the corpora, using software libraries, or analyzing sanitized access logs. Important leveraging will result if new investigators can visit the DLI sites in person or electronically, participate as colleagues in special workshops, or directly make use of the developing DLI research facilities.

Another way to think about the future WWW collection of NSF-related information is as a digital library for science and engineering research and education. As such, it deserves serious study, and requires careful coordination and attention. There are enumerable important resources that will be included, and which will support user tasks related to education and research. Program managers have particular information needs(1): managing their own information and also accessing information about the scientific community to find reviewers, panelists, and related activities of other agencies, as well as to contact investigators and task force attendees. The resource discovery needs of investigators operate in reverse, since they must learn about NSF programs, program managers, previous awards, funded proposals, project reports, and a myriad of related documents. They may find it helpful to locate and communicate with investigators funded by NSF, or others who have attended NSF-sponsored workshops. Clearly, it is possible to describe many of the needs of program managers or investigators. Much harder is the task of determining the resource discovery needs of the rest of the nation in terms of NSF-related information. Hence:

Recommendation 3: Perform Needs Analysis of NSF

We recommend that NSF arrange for a careful analysis of the data, information and knowledge needs of its staff, of investigators in the sciences and engineering, and of others in the nation who might benefit from accessing what should evolve into a digital library of NSF-related materials. This resource discovery needs analysis should deal not only with content but also with organization, task specification, access and manipulation capabilities, and user interface characteristics.

With this analysis complete, NSF will have a roadmap to build, through efforts of those it employs or funds, a significant digital library. As this evolves, it will relate to larger and larger portions of the WWW. Thus, it is important for NSF's operations that it encourages the improvement of resource discovery methods on the WWW. It is also of benefit for other uses of the WWW for NSF to provide support. In particular, there is now little funded research dealing with the many scientific problems tied to resource discovery on the WWW. Hence:

Recommendation 4: Targeted Research Areas

We recommend that NSF fund research aimed at discovering useful resources in the WWW:

Some joint workshops involving researchers dealing with document models and searching, object technologies, and various types of scientific databases might stimulate work on the above areas in the context of important applications (e.g., genomics, social and behavioral database research, and digital libraries).

References

Fox E., Akscyn R., Furuta R., & Leggett J. (1995) Guest Editors' Introduction to Digital Libraries. Communications of the ACM, 38(4), (in press).

Fox E., ed., (1993) Sourcebook on Digital Libraries: Report for the National Science Foundation. TR-93-35, VPI&SU Computer Science Dept., Blacksburg, VA. Available by anonymous FTP from directory /pub/DigitalLibrary on fox.cs.vt.edu.


Footnotes

(1)
See position statement by J. Hestenes in Appendix D of the on-line version of this report.
Table of Contents Next Chapter