EVALUATING SOFTWARE VISUALIZATION USE FOR COMPUTER SCIENCE EDUCATION

CS7390 WINTER '98

ASHLEY GEORGE TAYLOR

ataylor@cc.gatech.edu

 
 

Introduction

Intuitively, it seems that Software Visualization (SV) systems should be useful for teaching and software development. However, very few studies have shown conclusively that they are effective. We will examine some of the studies done on SV's for teaching.
 


Index


Evaluation Methodologies

There are two primary evaluation methodologies: experimental or empirical, and naturalistic or observational.
 
 

Empirical Studies

In an empirical study, the researcher makes a hypothesis. Two or more subject groups of the same size and similar background are chosen. One group is provided with the experimental material and the other with 'traditional' material. The latter is called the control group. The same post-test is administered to each group. Each group is given equal time to use the material and to complete the post-test.

The intent is to keep everything constant except for the use of the experimental material. The post-test results are analyzed to see if there are significant differences between the groups, and the results are interpreted to see how they relate to the hypothesis.

There are variations in the approach described above. A pre-test may be given before use of the material. Several experimental groups may be formed, each using different material.
 
Index


Observational Studies

Observational studies try to capture what users do. Researchers then attempt to classify and explain user behavior, identifying patterns and speculating why they do what they do.

Researchers try to provide a 'natural' environment for the study, i.e. a setting similar to the environment in which a user would normally use the material. In some cases this is impractical. A control group is not required, though comparative groups are frequently used. The observational method is one of several Naturalistic methods.
 
Index
 



 
 

Empirically Assessing Algorithm Animations as Learning Aids

Stasko and Lawrence [SL98] describe two empirical studies of Algorithm Animation.
 

Study 1

The first study compared the use of text versus text and animations for self-study. Two groups of graduate computer science students were given a text description of the pairing heap algorithm. In addition, one of the groups had access to an XTango animation using the insert, delete min. key, decrease key, and delete operations. Neither group was familiar with the material presented. Each group had 10 students.

The text-only group had 45 minutes to read. The Animation group also had 45 minutes, but had a maximum of 30 minutes to read the material, and could spend the remaining time (15 minutes or more) on the animation. Students using animation and text did marginally better than text-only group, and the animation group finished the post-test in slightly less time.

A survey of the animation group found that students wanted to be able to step back through the animation and replay it. This capability was not available in XTango. They also wanted text descriptions of what was happening. Individual Students commented that they liked the smooth transitions and speed control, but found it difficult to remember the animation afterwards.

The authors note that "Most of the test items require the ability to accurately carry out the main procedures of the algorithm, and neither presentation seemed likely to give participants that ability."  Given this limitation, it seems difficult to make firm conclusions from this study about the effectiveness algorithm animation.
 
 Index

Study 2

The second study, compared (A) the use of animations vs. slides in a lecture, and (B) supplementing each lecture format with an animation laboratory. Kruskal's MST algorithm was presented using Polka in the lecture and XTango in the labs. To ensure that all students saw the same presentation, students were shown a videotape of the lecture. The slides showed a series of screen shots of the animation.

Two alternative lab formats were used in part (B), active and passive. Passive lab participants used data sets prepared by the instructor, while those in active labs constructed their own data sets interactively.

This study implemented some of the suggestions from the first study. The animation was annotated with a brief text description which in effect was very high-level pseudocode. At each step of the animation, the relevant text was highlighted. Use of the animation in the lecture provided instructor explanations. The design of the animation had more focused instructional objectives, which were specifically evaluated.
 

Evaluation and Results

Evaluation of the four experiment combinations (two no-lab, two lab) used fixed response questions to test understanding of specific steps in the algorithm and free response for testing general conceptual understanding.

No significant difference was found between the two no-lab groups in part (A), but animation students did slightly worse on the free response section.

In part (B), students who did the active lab performed significantly better than those who did the passive lab or no lab at all. The difference was larger on the free response section, which the authors suggest indicate an improvement in concept formation.

The second study is one of the most widely referenced algorithm animation studies, and seems to be a good example of experimental design.
 
Index

 



 

A Principled Approach to the Evaluation of SV: A Case Study in Prolog

Mulholland [Mul98] describes a number of empirical studies of SV's and educational software which have produced contradictory results. He postulates that this is due to faulty design. Researchers go for an absolute proof that something works. They software produce results which are difficult to explain because the studies do not capture relevant data.

"Much of the empirical work ... suffers from the problem of trying to find global generalisations that are not there.... Many of the studies derive performance results without deriving the information necessary to explain them. Observations are usually not made of how the subjects approached the task and what features they found confusing."

That author combines empirical and observational approaches to design a study which captures user data in conjunction with empirical measurement. In addition, the material used was designed to test for understanding of specific cognitive tasks.
 
Index
 
 

Study Overview

Mulholland compares four Prolog SV systems to see if and how they support the learning of specific tasks. The tasks were chosen based on previous cognitive studies of students learning Prolog.

The subjects were four groups of Cognitive Psychology students taking an Artificial Intelligence module of a course. Each group used a different Prolog SV as a learning and debugging environment for a week.

The students knew Prolog previously, which raises some concern about possible prior knowledge. No pre-test is mentioned which would help account for this.
 
Index
 
 

Test Task

The test task was to identify differences between a given program with source code, and a modified program for which only the output was provided. The differences between the programs were of four types:

The programs were designed to check for backtracking misconceptions. Pairs of students worked together to promote discussion. Student discussion and interactions were recorded and analyzed. A limit of 5 min. per problem was imposed. Students could do as many problems as possible in 15 minutes.

Index
 
 

Prolog SV Systems

Each SV had been designed with specific objectives in mind:
 
Spy Trace execution using Unification model 
 
PTP  (Prolog Trace Package ) 

Give more execution details than Spy 
 

TPM (Transparent Prolog Machine). 

Show execution graphically using depth first AND/OR trees Provide overview, detailed views 
 

TTT (Textual Tree Tracer) 

Trace using close format to source code

 

A common execution environment, the Prolog Program Visualization Laboratory (PPVL) was used for all SV's. It provided a common interface and recorded user activity.
 
Index



Results

Analysis of the scores showed that there were four main types of misunderstanding: Understanding 'timing' is defined as the ability to decipher what point in execution an output indicated. Total instances of misunderstanding by SV was ranked:
 
Least  PTP 
SPY/TTT 
Most  TPM 

However, TPM had the least number of misunderstandings for Data Flow. The number of problems solved in the given time (15 minutes) was also ranked by SV:
 
Most PTP
TTT
Spy
Least TPM

The author was able to relate the features of each SV and student results. He found that: This study seems to be well designed and an example for others. However, one wonders why a group which used no SV was not included.

Index
 
 





Testing Effectiveness of Algorithm Animation

Gurka and Citrin  explore a wide range of issues related to the use and evaluation of algorithm animations. They wonder if "...it is possible that animations, as they are now created and used, are not particularly effective for teaching algorithms. "

Approaches to evaluation are discussed. They find that qualitative evaluation does not usually address effectiveness directly and tends to rely on student perceptions which are also mixed with usability issues. Quantitative evaluation often has no pedagogical substitute for animation in the control group, so any improvement may be attributed to the extra or alternative methodology used. They suggest using a human tutor for the control group.  They also mention the need for large number of subjects to produce significant findings.
 
 Index



Design Considerations

Gurka and Citrin propose a set of seven animation factors which should be taken into account in design and evaluation:
 
  1. Usability difficulties. This includes providing controls requested by users (e.g. step forward and reverse).
  2. Animation quality, discussed below.
  3. System training, which might promote the independent use of animation systems. This would include tutorials on exploration techniques, not just software training.
  4. Logistics of system availability. They believe it is important for the student to be able to run a software visualization system on almost any platform, and to be able to exchange visualizations with TA's etc.
  5. Type of animation use, i.e. what is the best place in the curriculum to use animations and why?
  6. Individual differences among learners, particularly exploratory and visual learning ability.
  7. Algorithm complexity. They speculate that algorithms might have to be of a certain level of difficulty for an animation to be useful, while others are so complex they need to be presented in stages
 Index



Animation Design Suggestions

The authors believe that animation quality is hardest to address, and that animation designers should examine instructional issues and graphic design principles. Graphic design principles can be used to design animations which draw the users focus, use color appropriately, etc. Instructional issues are discussed in detail. They propose that each algorithm be examined to see Furtermore, they recommend that expert instructors be consulted to advise what aspects of an algorithm are difficult. Pedagogical design principles should be used to present algorithms and highlight difficult aspects. Designers should reinforce an action in more than one way in a presentation. For example, they could use text to accompany and narrate the steps in an animation to guide the "student's emerging mental model". A "sequenced set" of animations could be used to present algorithms in a pedagogical manner, component by component.
 

The authors believe that improved design will yield better results from studies. Some of the proffered suggestions have been used in the other studies discussed, i.e. accompanying text and user interaction in Stasko and Lawrence, and pedagogical design in Mulholland. Both studies produced positive results.
 
 Index


Conclusions

There is room for additional work in this area. The work Stasko, Mulholland and others show that benefits from the use of SV systems can be achieved with well designed systems, used thoughtfully. The suggestions from Gurka point to directions with further potential. For sound evaluation, there must be careful design of the study as well as the SV systems and pedagogy used.
 
 Index


References

 
[Mul98] Paul Mulholland, ìA Principled Approach to the Evaluation of SV: A Case Study in Prologî, in Software Visualization, John Stasko, et.al. eds. MIT Press, 1998.
 
 
 
[SL98] 
 
John Stasko and Andrea Lawrence, ìEmpirically Assessing Algorithm Animations as Learning Aidsî, in Software Visualization, John Stasko, et.al., eds. MIT Press, 1998.
 
 
[GC96] Judith S. Gurka and Wayne Citrin. Testing effectiveness of algorithm animation. Proceedings of the 1996 IEEE Symposium on Visual Languages, pages 182-189, Boulder, CO, September 1996.
 
 

Useful Links

Software Visualization at the Graphics Visualization and Usability Centre at Georgia Tech
Software Visualization at the Open University

 Index