CS 1321X - Lecture 13 - September 30, 2003

CS 1321X - Lecture 13

Data Abstraction


I.  Data abstraction

Now you should be somewhat familiar with the basic ideas 
behind abstraction. And the type of abstraction that we've been 
dealing with mostly is called procedural abstraction. That is, in 
building our procedures, we focus on:

  Postponing worrying about the details

  Decomposition

  Putting as much distance as possible between the high-level 
  conceptual ideas of what we're trying to do and the details of how it 
  gets done

That final point is very important. What it says is that, in the 
procedures we build, we're trying to separate the theory, the design, 
the algorithm, etc., from the low-level implementation stuff as much 
as we can.

Why is this good for us? It aids in designability, maintainability, 
adaptability, readability, debuggability, and all the other 'itys. 
But we also know that it's painful. Why? Let's face it...we'd rather 
just slam-dunk the code, and all this abstraction stuff is hard to 
think about if we're not used to it.

Nevertheless, the positives outweigh the negatives here, so we 
continue abstracting away. We get similar benefits when we abstract 
away the details of our data structures. For example, we already know 
that in Scheme we can build linked-list data structures easily 
without worrying about the details of how those structures are 
implemented. And that makes our programming lives easier. Those same 
simple tasks could be a lot more difficult in other programming
languages; Scheme has done some of the abstraction for us.

So this puts us into another, but related, world of abstraction that's
called data abstraction. We want to be able to write our programs in
ways that focus on the high-level concepts about the data while
abstracting away the implementation details of the data
structures. How important is this data abstraction thing?  Put it this
way: if, for the past 30 or 40 years, programmers had bothered to
abstract away the details of how dates are represented from the
higher-level procedures that used dates in important computations, the
task of fixing all the programs in the entire world by January 1, 2000
would have been a whole lot easier, saving billions of dollars, and
millions of hours of effort. That's how important data abstraction is.

When talking and thinking about programming at the higher levels, it's
easier to work with terms, concepts, and ideas that are neither
language nor computer dependent. Doing this successfully requires the
use of existing abstract data types provided by the language or the
creation of your own abstract data types from existing ones. Remember,
an ADT is a logical (not they way it's physically represented, but
more, um, abstract) description of a data structure plus operations
that are specific to that data structure. Those data-structure-specific 
operations form the interface between whatever uses the data structure and 
the implementation details of the data structure itself. So getting those 
operations right is a large part of doing the data abstraction right.

Today we'll be talking about dynamic linked-list data structures. 
They're dynamic, because they can change in size and shape as they're 
being used. That dynamism provides a nice degree of flexibility, and 
that's generally good, but there's a cost: extra information must be 
stored in a dynamic data structure to keep track of how it's supposed 
to be organized, and that takes up extra memory and uses up extra 
time. Other sorts of data structures that we'll talk about in weeks 
to come (with names like arrays and vectors) are static in that they 
don't typically change size and shape, so there's less bookkeeping 
overhead, but there's also less flexibility for the programmer. But 
that's an issue we'll deal with later too.

For now, let's start with a quick introduction to the granddaddy of 
abstractions, the graph.


II.  Graphs -- an abstraction of abstractions

There's an abstraction that computer science types use all the time. 
It's called a graph, and you may have seen graphs before. You just 
didn't know that their technical name is "graph"...it's a math thing. 
A graph is just a collection of vertices and edges. Sometimes the 
vertices are called nodes, and sometimes the edges are called links 
or arcs. Any way you look at it, it's all the same. Here's a 
graphical representation of a graph:

Since we're going to use graphs as a means of describing data structures, it would be nice to have a way to represent stored data in a graph. Computer science folks do this by drawing circles at the vertices and writing the information in those circles. That's another abstraction, because that's not how it really looks in the computer's memory, but it makes it easier for us to think about and play with:

And if we put orientations on the edges, we get something called a directed graph. Those orientations, which we denote with arrowheads, give us more information about relationships between the pieces of data in the vertices, like the order that's imposed by the links between cons cells:

This directed graph notation gives us an abstract representation of a linked list in Scheme, like '(a b c), which as you recall is a previously-defined abstract data type in Scheme. (I hope you recall it...you've been using linked lists for weeks!)

(a b c)

  _______                     _______                 _______
|   |   |                   |   |   |               |   |  /|
| | | --+------------------>| | | --+-------------->| | | / |
|_|_|___|                   |_|_|___|               |_|_|/__|
  |                           |                       |
  |                           |                       |
 \|/                         \|/                     \|/
  a                           b                       c


III. Trees

Computer scientists use a data structure called a "tree" quite often. 
A tree is a hierarchical linked-list structure. In directed-graph 
terms, a tree is a single node (or vertex), called the "root" of the 
tree, with directed links (or edges) to zero or more nodes (or 
vertices), each of which is the root of a tree. There are no cycles 
or loopbacks in trees, so by following the directed links, there is 
only one path from one node to another in a tree.

Here's a directed graph abstraction of a tree:

This particular example also happens to be an example of a binary 
tree, in which each node has directed links emanating from it to no 
more than two other nodes.

We can represent this tree as a Scheme list:

(a (b (d () ()) (e () ())) (c (f () ()) (g () ())))

At the top level, this tree consists of three elements: the symbol a 
stored at the root, the left subtree, and the right subtree. The left 
subtree in turn consists of three elements: the symbol b, the left 
subtree of b, and the right subtree of b. And so on and so on. When a 
node doesn't have a left subtree or a right subtree, we just put () 
where the subtee would go. You could map this tree onto 
box-and-pointer notation to see the "treeness" of it all, but I'll 
leave that as an exercise for you.

Admittedly, this is an unwieldy sort of representation for a tree, 
and if you're thinking ahead, you might have already concluded that 
we might get more flexibility if we mapped this tree onto our a-list 
representation. You'd be right, but we're getting ahead of ourselves. 
This representation will serve us just fine for now.

As mentioned earlier, trees are hierarchical data structures, so we 
use them to organize chunks of information that have hierarchical 
relationships between the chunks. Obviously, we organize this stuff 
so that we can find it when we need it, and there are some standard 
ways of searching trees, some efficient and some not so efficient, 
that we'll see in the near future. Let's start with the inefficient 
ways so we know what they look like, and then we can work to make 
them better.

One way to find something in a tree is to look at every node in the 
tree and see if the thing you're looking for is there. The process of 
looking at every node in a tree is called tree traversal, and there 
are different ways to traverse a tree. One specific way of looking at 
everything is called preorder tree traversal. The high-level 
algorithm, which is recursive, of course, looks like this:


preorder-tree-traversal

  1.  visit the root of the tree and do what you're going to do 
      with it (print it, collect it, stop if it's what you're looking for, 
      whatever)

  2.  call preorder-tree-traversal on the left subtree of the root

  3.  call preorder-tree-traversal on the right subtree of the root


If we do a preorder tree traversal on the tree above, and print every 
root node that we visit in step 1, then we'd see the following 
printed:

abdecfg

Can we write some Scheme function to perform a preorder tree 
traversal on the tree above? Well of course we can! Here's one 
possibility:


(define (print-preorder tree)
    (cond ((null? tree) ())
          (else (write (car tree))                ;print root
                (print-preorder (cadr tree))      ;trav left
                (print-preorder (caddr tree)))))) ;trav right


If we run our little program, we'd see this


>  (print-preorder '(a (b (d () ()) (e () ())) (c (f () ()) (g () ()))))
abdecfg
()
>


Just like we expected. Or at least it's almost what we expected. Why 
do you suppose that empty list is displayed below the line where you 
see "abdecfg"?  That's something worth thinking about, no? Yes. 
(Think about what this function returns as opposed to what it prints.)

One thing you should note right away is that we're using "write" in 
the print-preorder function and printing the nodes as we visit them, 
as opposed to collecting the nodes into a list.  This makes it a 
little bit easier to see how this works---you don't have to think 
about how or when to collect nodes into the list. But this also makes 
us take advantage of something we touched on only briefly before: 
side effects. Recall that a side effect is something that persists 
after the function that caused it ceases to execute.  Printing is a 
side effect. The other thing that's happening here is that we're 
evaluating more than one expression in the action part of a cond 
clause (the else clause, in this case). It's always been possible, 
but up until now it hasn't been useful, because as we said before, we 
haven't been using any side effects.

Using side effects takes us out of the purely functional programming 
paradigm, but there's no way to do printing otherwise, so there it 
is. Do not consider this little exposition to be permission to go use 
any and all side effects for whatever reason. For now, we'll tell you 
when it's ok to use a side effect; don't use them if we don't tell 
you it's ok. We'll eventually give you free rein, so be patient. And 
be afraid...be very afraid.

Sadly, "print-preorder" isn't is good as it could be. As it stands 
now, the details of accessing the list structure that represents a 
binary tree are merged with the high-level algorithm for traversing a 
binary tree. While it's not real important for a five-line function, 
if you developed large programs where the data access details were 
intertwingled (it's a technical term) with the higher-level issues, 
you'd be one unhappy camper when somebody came along and insisted on 
changing the data structure specs. Trust me. Just ask anyone who 
dealt with the Y2K cleanup.  Ugh.  In other words, we haven't done a 
very good job of data abstraction here. How do we fix that?



Copyright (c) 2003 by Kurt Eiselt.  All rights reserved, with 
the exception of stuff that belongs to somebody else.

Last revised: October 2, 2003