Table of Contents
Previous Chapter
First, as with all Web sites, there will be a conversion of "Chamber of Commerce" information into HTML formats such that from now on NSF policy should dictate that all public information be produced using SGML/HTML compatible software so that it may be posted to the Web when "published". This can potentially both better serve the communities to whom this information is directed and perhaps even save money through eventual reductions in printing and mailing costs.
Apart from the existing information about NSF which currently inhabits the NSF Web site, I believe longer documents such as NSF and National Science Council reports should be considered for online access.
Second, the review process can benefit from both allowing documents to be submitted in HTML giving reviewers the capability to access both the original proposal and original accompanying suporting materials (e.g. papers published by the proposal submitters). Some special concerns here for privacy and confidentiality may need to be addressed so that if materials are located at the submitters home sites the access to them doesn't reveal the reviewers identities. Perhaps temporary relocation of materials to NSF for the time of the review could be made. This seems largely to be a logistics problem, since it is clear what materials are involved. I.e., original proposal, supporting published papers, biographical data on the proposers, and associated previous documents connected via the biographical data on the proposers or relating to previous work on the project.
Third, once proposals are funded, there is an opportunity for NSF to greatly increase the potential for interactions within the research community and technology transfer by encouraging the grantees to create homepage access to their project and its work. Thus, instead of merely receiving word back from the funded projects in the form of a periodic summary of their results, NSF can provide a jumping off point to each project.
Foutth, the issue now turns to how NSF can use the new capabilities to promote Science and Scientific Literacy (the new goal of the administration). The Web as it currently exists is powerful but unstructured. Unlike other information collections whose access is mediated by the efforts of the library and information science communities, the Web has just grown. This is good and bad. It provides for maximal experimentation, with low start-up costs for individuals everywhere, but it creates a hodge-podge of data, duplicated across many sites, and seems to encourage a geometrically growing effort to index the materials via the lowest common denominator of weighted keyword access.
Two generic factors could greatly improve the situation. One would be to encourage the creation of a universal protocol for access to publications, particularly journal articles. The URL gives us the ability to include a pointer in documents--but one has to discover a path to the pointer to use it. Within the scientific and scholarly literature there are somewhat common means of referring to authors publications within papers, such as SMITH&JONES-94 for a relatively unique identification of a paper by two authors, Smith and Jones, published in 1994. This scheme could be expanded to include site information and in effect create a virtual identifier for a paper which when entered into Mosaic could literally whisk the reader away to the paper online. So, if Smith and Jones worked at The University of Texas at Austin in the Computation Center, the reference might be something as simple as "cc.utexas.edu/docs/SMITH&JONES-94".
This is essentially a call to devise an extension to the Web addressing scheme specifically for publications in a manner which would allow creation of the hypothetical identifier for any publication worldwide by anyone seeking such a paper. This identifier could then be included in published papers and from its logical components readers could hypothesize additional papers identifiers and request them.
The second generic factor I see a need for is to promotethe assignment of the equivalent to CIP (Cataloguing in Publication) information for all pages of information on the Web. Some time back the library community recognized that distributing cataloguing information for books and journals was getting excessively expensive because every library had to explicitly list the documents they had received in a request to a common supplier of cataloguing information. Eventually it was converted into an electronic interaction, but still the requests had to be submitted. To eliminate this, the contents of the library catalog were generated by the Library of Congress upon receipt of a preliminary copy of the published materials and then printed in the book/journal itself when it was sent out. Today virtually all books contain a catalog card on their copyright page which gives the call numbers in the LC and DDC classifications, subject descriptors, and correct author/title/publisher details for the book. Anyone can tell from a given copy of a book where the subject matter of the book would be found in any library--thus facilitating access to the content matter in a larger context.
This is what the Web lacks. It has neither a global subject scheme in place (and the full classification scheme of the Library of Congress would probably be needed to encompass it today) nor any means for someone to know where a given node lies in the global context of nodes. Efforts to provide index access to text descriptions of sites and their offerings is ever falling behind the contents online.
First, the Web as a structure seems to embody many of the properties of a Semantic Network, as long used in the computational linguistics field. Many properties of the Web are similar to those of semantic nets and hence many of the procedures applied in computational linguistics to semantic networks might be applicable to the Web itself.
Some differences do exist. For one, the Web doesn't employ "arc labels" to describe the nature of the node being connected to via a URL. An exploration of what such arc labels might be would be worthwhile and how they could be placed into an expanded protocol for this would be worthwhile. Minimally, for example, arcs ought to indicate the type of information at the node, such as the medium. Maximally, they should deal with the semantic relationship of the information to the current node. Additionally, semantic networks usually assume arc labels to have reverse interpretations and to be reverse accessible, thus the HAS-PART relationship is the reverse of the ISPART relationship (e.g. VIRGINIA HAS-PART ARLINGTON, ARLINGTON ISPART VIRGINIA).
In computational linguistics the relationships between nodes relate to linguistic properties of events and entities. ISA/HASTYPE (is a type of/has type), ISPART/HASPART (is a part of/has as a part), CAUSES/CAUSED-BY (causes/is caused by), and the case arguments of event description (AGENT=is the human agent of the event, THEME=is the main event/entity upon which the event operates, INSTRUMENT=is the means by which the event operates on the theme, SOURCE=is the thing from which the event initiates its action, GOAL=is the thing to which the event transforms or progresses its action, LOCATION=is the physical (geographic) location of the activity of the event, etc.)
On the Web these are at present just an atypical subset of the set of relationships between nodes. Also, the events are obscured because the current arc labels would be composed of an event and an entity, such as "WRITTEN-BY". Strictly speaking this should be a link such as THEME-OF to a WRITING event whose AGENT arc would the the author and which would itself have possible other links.
Among the tasks to which semantic networks are put is that of natural language generation, i.e. the generation of text based on starting at some node in the network and accessing the set of nodes connected to that point and interpreting what is found by levels of connection and a precedence of relationships, e.g., "This is the Home Page of the National Science Foundation created by Author X on January 19xx and directs the reader to other pages describing the major divisions of NSF and a staff phone/email directory, etc."
It would seem that being able to produce such natural language descriptions of the Web for any node could be useful. Note also that such descriptions vary dependent upon where one starts and that one can start at ANY node.
Next... One of the more powerful techniques for creating new sub-networks was exploited by Doug Lenat in the CYC Project at MCC. This was "copy and edit". It involved making an analogy between two arbitrarily complex entities which provided the computer with guidance as to what sub-structures to create and then permitting the raeder to EDIT these sub-structures to correct them for actual differences from the existing nodes. This technique would also work on the Web, such that one could, using such an interface creation tool literally say something as complex as "The NIH is a government agency like the NSF" and then expect the system to generate a copy of ALL the information and structures existing for the NSF and then allow the creator to populate it with corrected values for NIH.
Such a tool could greatly speed data entry for NSF-funded projects.
The other tasks which the interpretation of the Web as a semantic network of nodes with semantic arcs between them would facilitate is the intelligent use of the Web by software information agents seeking information. It would allow the programming of search strategies as rules and scripts to be followed, e.g. "Find an email address for a Dr. Kathleen Fisher, a psychologist at the University of California at Jim Davis" could translate into: "Find the name FISHER, KATHLEEN in the faculty/staff phone/email directory attached to the homepage for the University of California at Jim Davis." Alternative search strategies might be programmed employing knowledge of geography, etc.
The Web as a entity lacks world knowledge for intelligent agents to use in searching it. This is partly a matter of knowing more about the world in terms of geography and organization than currently exists on the Web itself. I.e., WE know what states contain what major cities, or how to use the ZIP code directory or the telephone area code listings to locate things. WE know that there are major branches of the USA federal government and that the Army is a part of the DoD. The question is how and where should such information exist in the Web for the use of intelligent agents? Facilitating access to the Web could be done if simplified function call interfaces to some of these resources could be created and in effect a listing of available functions which Web scripts could contain could be generated. If CITY-OF(ZIP-CODE) was a function, then calling it with a ZIP code would yield the city--handy for a program or a person. Since the data is online, all that is missing is the interface components to make these readily usable by people or programs.
In my view the problem is to get the research reports onto the net, where they will be immediately reachable by the global community, and such that information discovery and filtration systems can be brought to bear on them. I would suggest that it would be useful for the community to adopt some standard for electronic publishing of research reports. There are a number of suitable systems working now, among them are WATERS (from Old Dominion University and Virginia Tech) and the Dienst system (developed as part of the ARPA CSTR project.)
More information about WATERS can be obtained from this site. Dienst is available. I will also be presenting a paper on it at the upcoming WWW 94 conference.
As to new capabilities, I think one of the most important is a systematic way of dealing with different levels of privacy. Every project has some working materials they will be happy to make fully pubic, some that are for trusted colleagues but not general publication, and some that are for internal use only. Good simple standards and ways of managing this will be of help to all.
Table of Contents