Second Working Conference on Reverse Engineering Summary Report

by James H. Cross II, Alex Quilici, Linda Wills, Philip Newcomb, and Elliot Chikofsky
SIGSoft Software Engineering Notes, Vol. 20, No.5, December 1995

The Second Working Conference on Reverse Engineering was held July 14-16, 1995 in Toronto in conjunction with CASE'95. The goal of the conference was to bring together researchers who are exploring innovative methods of extracting and using information from existing software systems. Interest in reverse engineering has been growing rapidly, as the need for maintaining, upgrading, and migrating existing software systems has become more apparent and pressing. Several researchers reported on the application of reverse engineering techniques to practical industrial problems having significant economic importance. The software and legacy systems to which their techniques are being applied are quite complex, large, and diverse. A few examples from this year's papers are a public key encryption program, industrial invoicing systems, the X window system, and software for analyzing data sent back from space missions.

WCRE is a "working" conference, emphasizing focused discussion interspersed with paper presentations. With about 42% of the presentations from industry and 58% from academia, the conference provided a balanced picture of current trends and challenges in this field. The participants described recent advances in several key areas, including program understanding, redundancy analysis, reuse, transformation, and translation. Data reverse engineering and finding objects in procedural programs continued to be important topics of discussion and active research, in response to the growing need to support migration to new platforms and paradigms. Tools and environments to support reverse engineering activities and for other researchers to build upon were also highlighted. The types of information being extracted from existing software spans a wide range, including specifications, business rules, objects, and, more recently, architectural features. In addition, the sources of information acted upon by reverse engineering systems has broadened to include non-code sources. Researchers are also starting to formalize the methods and theory of the field, a trend that should contribute to the validation and increased precision of reverse engineering technology.

This report attempts to capture the essence of the presentations and discussion among the participants. For brevity, titles of papers and multiple authors are omitted. The complete set of papers are available in the conference proceedings published by IEEE Computer Society Press.

The general chair, Elliot Chikofsky (of the DMR Group and Northeastern University), organized the working conference, along with program co-chairs Philip Newcomb (of Software Revolutions) and Linda Wills (of the Georgia Institute of Technology). The IEEE Computer Society Technical Council on Software Engineering's Subcommittee on Reverse Engineering, ACM SIGSoft, and the Reengineering Forum sponsored this single-track event.

Generating Accessible Documentation from Legacy Systems

The WCRE formally opened with a panel, led by Jacob Slonim, on a problem of growing interest: automatically generating accessible, dynamic documentation from legacy systems. Elliot Chikofsky presented a brief history of PSL/PSA, in which he expressed the importance of consistent formats and query mechanisms, the need for multiple information services, and the value of providing automatically generated, accessible documents. James Cross presented an overview of the GRASP/Ada tool in which source code is considered as a set of structured documents. A tagger/renderer architecture based on SGML provides for extraction and presentation of a variety of views of software, both graphical and textual, which are important for comprehension of large, complex systems. Lewis Johnson presented an on-going research project in which documents are generated on demand. The goal is to automatically tailor documentation to tasks the user is trying to accomplish. The work is driven by empirical studies of inquiry episodes gathered from newsgroups. Philip Newcomb discussed the automatic generation of documentation to aid a user trying to decide whether reengineering of the business process is appropriate, or whether continued maintenance and other options should be considered. He described the use of the Software Refinery as a means of holding relatively large images (300 MB) of software systems and the flexibility that it provided in terms of documentation generated.

The general issues identified during the panel discussion included: what formalisms are appropriate for documentation, how well this documentation is matched with the particular task maintainers have to perform, what are the limitations of documentation (particularly if derived from source code), and how can the documentation be made more accessible? Source code alone cannot yield everything one needs to know about a system; however, since many seem to be getting along without additional documentation, is it really necessary? A fruitful strategy may be to concentrate on generating only documentation that addresses specific maintenance tasks, rather than generating all possible documentation whether it's needed or not.

Analysis of Non-Code Sources

The next session, chaired by Lewis Johnson, focused on the analysis of non-code sources. Patricia Lutsky presented her efforts in recovering information from textual documents (e.g., manuals and structured requirements) for the primary purpose of test case generation. Julio Cesar Leite discussed working from structural analysis specifications to recover the domain lexicon and a set of business rules for the application. Peter Grogono considered the problem of extracting formal semantics from diagrammatic sources, in particular dataflow diagrams.

During the discussion of these papers, a basic question left somewhat unanswered was: how accurate is this documentation and should the source code be treated as the final arbiter? Most felt that is should. However, if the documentation is not consistent with the code, the inconsistency points out where to check the code for correctness. This certainly appears to be a good strategy in generating test cases.

Having any description or view of a software artifact, in addition to the source code, is useful in trying to understand it. Insomuch as the documentation describes the code, it can provide a higher level of understanding of the software in terms of trying to recover items such as lexicons and test case information. Leite's lexicon work demonstrates that documentation is meant to describe things that are not in the source code, and in that way, it can be an effective bridge between user's descriptions of the program and the programmers' implementation of it. Also, it is difficult to capture information such as design rationale in source code, making non-code sources of this information extremely valuable. Finally, it was noted that success in recovering information from both non-code and code sources hinges on what one would expect to find in either of these versus what is actually found.

Transformation and Translation

The transformation and translation session, chaired by Mark Wilson, included two papers. Philip Newcomb presented a method for automatic translation of procedural systems into a non-procedural, event-driven, data flow computational architecture. Experience indicates that significant performance optimizations are achieved for many types of computation after the conversion. Frederico Zoufaly discussed RESCUE, a legacy systems translator which demonstrated the theoretical basis required for automatic translation by means of deduction. Deduction is performed using an expression rewriting process which preserves equivalence in each deduction step, thereby guaranteeing translation correctness.

Tools and Environments

A tools and environments session followed, chaired by Hausi Muller. Philip Newcomb discussed a legacy system cataloging facility in which a software modeling and construction tool provides a foundation for analysis and transformation on an enterprise scale. The facility provides for storage of software and information about legacy systems in terms of persistent software models. Discussion focused on how the system was used, who used it, and the possibility of associating a cost model with it. The system has both internal and external customers who use it for system modeling, transformation, and migration. Cost models are not yet available, but meaningful information can be extracted to facilitate a cost model.

Stan Jarzabek, presented the design of a generic reverse engineering assistant tool which combines heuristic rules with the active participation of a domain/data expert to extract design information from a variety of software artifacts. In most cases, the domain expert and data expert are the same individual. Discussion was directed at the level of granularity of the information provided by pattern matching methods as well as the entity-relationship diagrams that are generated.

Larry Markosian described a new approach to developing tools for measuring and documenting source code compliance with design and documentation standards. The Software Refinery provides the basis for the current tool, which has been used to measure compliance for over three million lines of C and Fortran source code. The metric (number of violations divided by the number of opportunities) is often straightforward to compute but the violations can be difficult to detect. No quantitative evaluations have been made but inspections and experience indicate that the metric is useful. Extensions to other languages is often possible, although some peculiarities exist.

Mark Wilson presented ongoing work to integrate reengineering, reuse and specification tool environments to enable reverse engineering. Transformations are often easy when source and target languages have compatible semantics; difficulties arise in dealing with bit manipulation, assembly code and graphical displays. Graphical displays are used at the required level of granularity for visualization purposes only; no graphical manipulations for editing are supported, yet.

Detecting Duplication

A session on detecting duplicate fragments of code in large software systems was chaired by Howard Reubinstein. Brenda Baker, whose paper won the conference's Best Paper award, described a tool, called "dup", that can be used to locate instances of duplication or near-duplication in large software systems. The goal of the tool, which is based on an exact match with variable names, function names, and constants, is to reduce the duplication in source code and aid in program restructuring. Kostas Kontogiannis presented results of research in pattern matching for locating pairs of similar code fragments (called "clones"). The pattern matching uses dynamic programming to determine best matches based on a variety of program features, including variable usage, keywords, and software quality and complexity metrics. A unique pattern matching technique is also being developed for design concept localization. It uses Markov models to express the probability that a given abstract pattern corresponds to a given code fragment.

These papers raised issues concerning how to cope with more complex similarities than just variable names, function names, and constants. For example, is it possible to deduce semantic features of the program? Could one use semantics knowledge to make more intelligent similarity detection? After similarities have been detected, they could be used in conjunction with program renovation, program understanding, error detection, program maintenance, product redocumentation, quality assessment, and extraction of reusable components.

Learning from the Reverse Engineering Process

Mike Olsem chaired a session on analyzing empirical data from reverse engineering projects, featuring a presentation by Piernicola Fiore. The study reported the analysis of productivity and identified the need for adaptable automated tools. Results indicated that cost is not necessarily related to number of lines of code, and that both the data and the program need distinct econometric models. All tools used in this research were found to be inadequate in some manner and support was needed from other tools. The study also concluded that it is difficult for researchers to justify work on a practical problem, since many see it as work, not as research, and thus not intellectually "challenging" enough.

Data Reverse Engineering

Giuseppe Visaggio chaired the session on data reverse engineering. Michael Blaha presented observed idiosyncracies of relational data base designs that he and his colleagues have encountered over the past several years, many of which are in commercial software products! The idiosyncrasies discovered are useful in judging the quality of a database tool, not just its functionality. Blaha concluded that further tool support is needed, and that the catalog of idiosyncracies may help inform and drive improvements in data reverse engineering technology.

Helen Edwards described deriving a logical data model for a system using the RECAST method which derives a no-loss representation of the structured systems analysis and design method (SSADM). The case studies presented yielded good results, and the method discussed is good for accurate documentation of the system. The most difficult part of this study of the method was tracking down the information.

Jean-Luc Hainaut discussed general architecture requirements for tools supporting the reverse engineering of data-centered systems. An operational tool that was developed based on these requirements was used to recover a logical schema from the data, during which it was necessary to look at both the data and the procedures related to the data. The application studied, which was sequential rather than concurrent, demonstrated the transformation of the data structure at a higher level. Although in this case, the transformation set contained only about 30-35 transformations, the set could, in theory, grow very large.

Kathi Davis presented a tool for step-by-step data model reverse engineering which uses COBOL record layouts and DB2 data definitions to produce a conceptual data model for the purpose of migrating to new data technology. The tool was used to analyze legacy code to produce entity-relationship models with a data dictionary as the main repository of information. The tool was designed using COBOL and Oracle. Users of the data dictionary, have indicated that it was utilized frequently and the quality was acceptable.

Program Understanding

Ettore Merlo chaired a session of two papers on program understanding. Alex Quilici described the DECODE reverse-engineering environment in which the user and system cooperate to extract object-oriented designs from legacy software. Spencer Rugaber and Linda Wills interleaved their presentation to dazzle, entertain, and highlight the interleaving problem in program understanding. This is the problem that multiple strands of computation, each responsible for accomplishing a distinct purpose, may be woven together and overlap in a single contiguous section of code.

Several issues were discussed that were common to both papers. One is what kinds of formal design representations should be used as a target for program understanding systems? A related issue is how to handle the different models of design abstractions and accomodate the many different ways of viewing these abstractions that users might require. Another open issue is what mechanisms are appropriate for tying the recovered design abstractions to domain concepts. If the user is responsible for doing this mapping (as in DECODE), for which tasks is this approach appropriate?

In discussing the interleaving problem, it was noted that even though interleaving reduces comprehensibility, it may be unavoidable (e.g., if the available programming language constructs don't provide the right resources) and sometimes it may even be desirable (e.g., due to nonfunctional requirements). An interesting question is what kinds of programming language features are valuable in avoiding interleaving. Object-oriented notation, pattern and constraint-based notation maybe helpful, but they can introduce problems of their own, such as new kinds of delocalized plans. Another key issue discussed was what types of techniques are effective in untangling interleaved code. For example, is slicing a good mechanism? If it's insufficient, then what other techniques, such as pattern recognition, can be used to improve it?

Finally, a fresh look at the different paradigms for viewing the software understanding process was taken. The notion that programs can be understood by tiling plans on them was called into question. An improved tiling metaphor would incorporate notions, such as overlapping tilings, and categorize the different types of compositions that can be performed on plans in order to have a richer way of mapping plans onto programs than by simple tiling. An alternative metaphor is to consider program understanding as an inverse transformation process: if one can recover the transformations that led to the resulting program, then by inverting them, we can come up with a more abstract representation, where the design elements are more easily separable. Of course, existing program recognition systems already implicitly do a significant amount of transformation in order to get the program into a more canonical form suitable for matching. It is an open issue what should be learned during the understanding process: the actual plans making up the program or the transformations applied in its construction (e.g., finite differencing or other optimization transformations), or both.

Formal Methods

Jim-Qun Ning, chaired a session focusing on the use of formal methods in reverse engineering. Andrea DeLucia described an approach to qualifying reusable functions through symbolic execution. Gerald Gannod presented strongest post-condition semantics as a formal basis for reverse engineering imperative language source code. Indra Tjandra discussed the formal representation of reusable software modules to facilitate semantic level analysis and reasoning about the software.

Several observations were made concerning these papers. The current perception in formal methods research, as well as the current state of the art, focuses on small programs, raising issues of practicality, feasibility and scalability. A topic of discussion was how formal methods will be used in conjunction with other approaches; for example, patterns offer a potential beneficial coupling with symbolic execution. It was observed that there is a rough sense of equivalence in expressive power in the techniques the presenters are employing. A open issue common to all three is the issue of undecidability. When a program does not terminate, a false post-condition is indicated. Tjandra suggests that the use of universal algebra may be a means of escaping undecidability using algebraic transformations. Whether or not it is possible to reach the desired level of comprehension with this needs to be verified on a practical basis.

Another interesting question was whether a hybrid approach might be fruitful, combining symbolic execution with concrete execution threads. DeLucia advocated using a dual-mode of structural identification of components and then dealing with the formal qualification. Also, the lack of objects in the papers raised the issue of types being as important as the values. Finally, the problem of implementation recognition was raised in that one may not have something that looks like the concept present in the program, making it difficult to identify and reuse it. Ward's work was mentioned as relevant to this problem, as is work on analogy and plan recognition.

Finding Objects

Spencer Rugaber chaired a session on finding objects in procedural programs, which included four papers. Harald Gall presented an approach for finding objects as a promising way of reducing effort in program understanding and thus lower maintenance costs. Harry Sneed described a method for extracting object-oriented specifications from legacy COBOL systems. Alexander Yeh focused on recovering abstract data types and object instances from procedural languages such as C. Philip Newcomb discussed the general problem of reengineering procedural systems into object-oriented systems.

These papers centered on two major focal points: (1) reverse engineering technologies for migrating from the procedural paradigm to the object oriented paradigm, and (2) the development of object oriented technology itself. On the first point, there is currently a key role that human experts assume in defining an object. No clear framework exists to evaluate methods for identifying objects, and no clear process of comparing these different methods exists. On the second point, using OO technology solves many problems, but also introduces some of its own. There is a risk of confusing the problems of object-oriented technology with the problems of current reverse engineering technologies. How is it possible to identify the right object in a program not designed with objects in mind? An advantage of objects is that they should be conceptually easier to understand. However, is this true of objects automatically obtained from pre-existing programs? To what extent do objects identified in code match objects or concepts in the "real world"? A higher level of description of objects is needed to fully answer that question.

Architectural Discovery and Evaluation

Cordell Green chaired the session on architectural discovery and evaluation, in which David Harris described recognizers for extracting architectural features from source code, and John Gravley discussed the experimental evaluation of subsystem classification recovery techniques. Both papers tried to answer the question "What happens after reengineering?" Harris focused on where the resulting artifacts fit within the overall environment. He described a set of language independent descriptions that reference a language or operating system specific model to allow the analyst to query key aspects of the abstract (e.g., What are the calls to the network system?). Gravley was concerned with the quality of the models after reengineering as compared to what you were expecting before the reengineering. He presented a method to measure the quality of an artifact by a series of equations that describe the deviance between the actual results and the expected results of the reengineering process.

Acknowledgments

The authors wish to thank each of the individual session rapporteurs for their contributions to this report: Spencer Rugaber, Howard Reubenstein, Ettore Merlo, Jean-Luc Hainaut, Mark Wilson, Julio Cesar Leite, Lewis Johnson, David Eichmann, Gerardo Canfora, and Michael Olsem.