Table of Contents Previous Chapter

Indexing, Semantics & Representation

Robert C. Berwick & Edward A. Fox

Discovering resources and retrieving information (see also Section 5.3) in the WWW depends upon indexing of information to capture the underlying semantics, and representing that to facilitate search and browsing. Today, the best retrieval systems for the Web rely on programs to walk regions of the Web and build indices for statistical retrieval based on each document's full text. Such automatic meta-indexes or "indexes of indexes" like the Configurable Unified Search Engine CUSI <URL:http://web.nexor.co.uk/public/cusi/doc/search.html> provide a partially credible global indexing platform, but because they walk over a static snapshot of the Web at any one time, they cannot adapt to dynamic aspects of the Web, such as documents generated on-the-fly. The reasons for the current inadequate state of WWW indexing are straightforward: Because it takes a long time to walk the Web (three months as of March 1994), global indices are out of date before they are compiled -- and the time required to traverse the entire Web is increasing, presumably, exponentially, in lockstep with the growth of information on the Web. In general, Web walkers cannot index pages which are synthesized on the fly or are gatewayed from other information systems (e.g., searches in relational or text databases).

Retrieval of sounds, images, speech, and text, while highly desirable, remains a research goal. In general, weaker indexing methods are used: statistical retrieval, using manual indexing. There is currently no commonly agreed upon method (the WAIS protocol leaves open the search engine to be used), as does GILS (Government Information Locator System, OMB's system built on the Z39.50 protocol). There is ample room for improving effectiveness of all Web-based retrieval systems through better indexing. The following discussion identifies particular problem areas and then offers suitable recommendations to improve indexing on the WWW.

Quantity and dynamics. The amount of networked information is very large and growing very rapidly. This presents a serious problem with updates (see above) that has yet to be adequately solved. It also makes developing "perfect searches" (those that have very high precision) very difficult; users will be forced to develop expertise at searching or spend more of their time looking at imperfectly ranked result sets unless better indexing and search systems are developed. Faster networks and faster computers will help, but like building better highways, they will in all likelihood just increase traffic flow; system administrators have learned this lesson many times about disk space.

Redundancy. The same information often appears in many forms, and needs to be distilled down to a non-redundant account. People use the redundancy of natural language to pack multiple descriptions into a few sentences. In the WWW there is redundancy from mirror sites, multiple formats, multiple media types, and successive versions. Note that minimization of redundancy does not necessarily conflict with the notion of providing multiple views of the same information.

Indexing not in infrastructure. Although the HTTP protocol provides for typed links, these are not yet widely used to create meta-level assertions about documents. New documents are not added to site or global indexes upon entry. Additionally, the piggybacking of queries into the syntax of URLs does not scale well to large queries.

Recommendation 1: From HTML to HTIML

Just as the simplicity of HTML stimulated the development of WWW, NSF should encourage the formulation of a HyperText Indexing Markup Language (HTIML). Such an "open standard" should encourage indexing. Paragraph-based indexing engines should be supplied as part of the software available for the WWW (see Recommendation 3 below). We propose that NSF sponsor a workshop to improve indexing of the WWW, considering approaches like HTIML. The HTIML proposal is to make something like an artificial English a "standard" for index markup, because the redundancy of English (or any other natural language) makes it useful as a highly redundant way to specify queries used to retrieve text.

Recommendation 2. Syntax and Semantics

Research should be encouraged on specifying the linkage between syntax and semantics in documents marked up with SGML, and on how to improve searching with that information. While SGML allows description of document structure, and DSSSL allows description of the mapping from tagged documents to a layout presentation, there is no description of the mapping from tagged documents to a semantic representation. A specification for such mappings would allow documents to be indexed in a fashion that would consider semantic intent. Thus, names or dates in documents identified syntactically could be easily indexed by converting them to a suitable canonical representation. Later, at search time, a "query-by-example" scheme could ask for documents with a given name occurring in the list of authors, or in some bibliographic reference. Eventually, this scheme could become a standard for semantic specification to supplement the SGML suite of standards.

Recommendation 3: Information Brokers

Research should be encouraged to fit indexing, link data, and later retrieval into the distributed environment of the WWW. Traditionally, retrieval systems work with large databases and index all the data they contain. In the WWW this usually means that a server will do this for information it contains, building an index, and providing a search system to access that index and its data. Alternatively, some "worm" may collect a large amount of WWW information, index that, and provide search support for it, along with pointers to the original location. However, there are other scenarios that should be explored. For example, the Harvest approach allows for collection of data from a group of servers, indexing of that data, and searching in parallel against a number of such group servers or information brokers. Similarly, the Whois++ and Uniform Resource Characteristics (URC) approach by the Internet Engineering Task Force enables both global resource location and discovery via a hierarchical infrastructure of index passing and result caching. Other options deal with: where and how data is collected, where and how indexes are stored, and what degree of parallelism and centralization is involved in searching. Further, it is likely that some data will be indexed in several ways: e.g., for different purposes, perhaps with different depth or detail or processing involved. Thus, some systems might index a numeric table as a string of values while another might index it as row-label/column-label/value triples.

Given the rich link structure of the Web, that data should be reflected in the indexing of documents. Thus, links into and going out from a document characterize it, and should be used as part of the indexing, as is done with citation search systems, or those that include indexing of citation and bibliographic coupling data along with term indexing.

Research should consider how these various collection, selection, analysis, indexing, and retrieval methods can be integrated. Toolkits are needed to help, as are protocols, that would fit into a general scalable architectural framework. This must be related to natural language analysis methods and query interfaces.

Recommendation 4: Intelligent Agents

Indexing and interface methods should be extended to better support intelligent agents. Research is needed to bring together work on intelligent agents with other work on indexing and document analysis, to lead to new architectures, approaches, representations, and methods. Right now there is little interchange between workers adopting these different approaches. There are separate problems addressed by each group, such as handling knowledge representation and protocol for agent, or handling large index collections. However, considering these problems together can have great synergy. Thus, the Z39.50 standard for information retrieval in client-server situations could be extended to support richer document characterization and easier processing by agents. Many agents look for specific patterns or structures in data, and may use what they find to help with further analysis; this approach could work well with regular search systems handling very large collections.

On the interface side, agents should be developed that have a rich interaction with users. In some cases, indexing can assist, if linguistic structures (e.g., phrases) or syntactic elements (e.g, title, author list) are known to be required during searching. Thus, if an agent works with a user to disambiguate a word, identifying the precise word sense desired, that can be of greatest value if indexing was carried out to the sense level. Usability studies (see Section 5.6) are needed to tie in with this research.

Recommendation 5: Knowledge Representation & Interchange Schemes

We recommend research into: knowledge representation and interchange schemes in a distributed environment of autonomous agents; upgrading of speech act theory into a theory of collaboration in large "invisible colleges"; applying knowledge bases to tailor and increase the efficiency of the lifelong learning experience of large numbers of individuals; developing "research environments" (e.g., for a chemist, or a numerical analyst) that integrate a variety of tools with related documentation as well as datasets and information resources; and constructing diagnostic systems that help humans or computers try to determine causes and repairs for various common types of failures (e.g., of equipment, of social groups, of economic enterprises). All of these studies and more will be enabled by the accumulation of vast information stores into a comprehensive information architecture. This type of basic research is needed to enable their efficient utilization by millions of future users of a World-Wide Knowledge Web, with reasonable performance. All of these investigations must occur in the environment of an open network, with methods and standards evolving as required regarding: representation, interchange, protocols, and interfaces. Interoperability is the key, and the WWW provides an environment inherited from the Internet where rapid prototyping and large-scale testing of small inventions is encouraged.

Recommendation 6: Federal Agency Efforts

NSF should work proactively to support other Federal Government efforts that involve or should require more effective indexing and searching, starting with running a workshop on this topic. Many government agencies manage large collections of information, and engage in indexing activities. In the case of the National Library of Medicine, this involves extensive manual indexing as well as a number of automated indexer aids. In the case of DoD, there are the TIPSTER/TREC and MUC activities that include testbed development as well as studies of indexing, message understanding, document analysis, ad hoc retrieval, and routing. For the White House, there are efforts to manage electronic mail. For the Intelligence and Defense Communities there are numerous efforts for filtering, routing and retrieval. NSF should encourage collaboration by running a cross agency workshop to consider distributed indexing issues, to handle very large collections, semantic searching, intelligent agents, and other approaches in the context of the WWW, but informed by other studies and investigations such as those mentioned above, or new efforts with Digital Libraries.


Footnotes

(1)
The START natural language system (SynTactic Analysis using Reversible Transformations) consists of two modules which share the same Grammar. The understanding module analyzes English text and produces a knowledge base which incorporates information found in the text. Given an appropriate segment of the knowledge base, the generating module produces English sentences. A user can retrieve the information stored in the knowledge base by querying it in English. The system will then produce an English response.
Table of Contents Next Chapter