I. A brief introduction to search
Now that we have all this new knowledge about representation in trees
(and related hierarchical structures, as we'll see soon), we need some
means for exploring these knowledge structures to get at the
information we want at the time we want it. How do we do this? The
answer is a bunch of techniques which collectively fall under the
heading of "search". We've seen a little bit of this already, when we
use "assoc" to find a key in an association list, or when we do a
preorder tree traversal on a tree (which looks a whole lot like one of
the searches we'll see later today). Search is a concept which
permeates computer science. We'll only touch on a few of kinds of
search in this course, but they'll be sufficient to demonstrate the
basic difference between brute- force, exhaustive, or "dumb" search
and heuristic or "intelligent" search.
II. Linear search
You already know how to do a linear search. It's what happens when you
scoot down a linked list in Scheme looking for a particular list
element. It's what Scheme is doing when it looks for a key in an
a-list using the assoc function. (If you've ever seen my office, you
know that the only way I could find something in there is by linear
search: I start at one end of the desk and look at everything until I
find what I'm looking for.) Linear searches take a long time -- O(n),
that kind of time. (Actually, assuming an even distribution of stuff
in the file, you're looking at 1/2 * O(n), but as you know the
constants are more or less unimportant.) In some contexts, as in the
case of sorting algorithms for example, O(n) is pretty fast. You'll
see more about sorting algorithms in a few weeks. But in the context
of searching, O(n) time is pretty slow, so we'd like to find something
better.
III. A little taste of binary search
To get better than O(n) search time complexity, we can impose a
separate indexing scheme on our data structure, so that we can cut
down on some search time. For example, we could apply a binary search
mechanism to look for an employee record in a linear list or file. The
basic principle behind a binary search is that every time we make a
comparison to see where to search next, we're eliminating about half
of the information that remains to be searched. For example, if the
employee's name starts with a letter in the range A-M, we could start
the search at the beginning of the file, but if the name starts with
the letter N-Z, we would start the search at approximately the midway
point in the file. We could continue to divide the big groups into
smaller groups, until eventually the time to find a single record is
governed not by the behavior of the linear search but by the behavior
of the binary search. It turns out that binary search time complexity
is O(log n) instead of O(n). There are other indexing mechanisms that
we could use, such as hashing functions, that would give us different
kinds of advantages. We'll see more about binary searching and maybe
even hashing functions later in the course.
IV. Searching a hierarchical structure
As we discussed previously, we don't always store our stuff in linear
formats. We can also organize knowledge in hierarchies. Zoological
taxonomies and business organizational charts are classic examples of
hierarchical orderings. Genealogical or family trees are also
classic examples, so let's take a look at the oldest known family
tree...the Flintstone Family Tree:
Chip Roxy
| (twins) |
___________
^
/ \
/ \
/ \
has-mom / \ has-dad
/ \
\/_ _\/
Pebbles Bamm-Bamm
/ \ has-dad / \
has-mom / \ has-mom / \ has-dad
/ \ / \
\/_ _\/ \/_ _\/
Wilma Fred Betty Barney
In structures like this, as before, we may want to search for useful
information. But structures like this, unlike linear file structures,
make it easier to search for the answers to questions like "What's the
relationship of Barney to Chip?" or "Who is Chip's grandfather on his
mother's side?" So remember, when we're searching a hierarchical
structure like this, we're often interested in more than "is the item
I'm looking for in there?"...we often want to know the answer to "and
how do I get there from the beginning in case I want to do it again?"
or "what is the relationship between the thing I'm looking for and
something else in the data structure?"
V. Depth-first search
The simplest form of search in a hierarchical or network structure is
called "depth-first search". Here's an algorithm for depth-first search
on a binary tree, looking for a specific node in the tree:
df-search
1. look at the root
2. if it's what you're looking for, then return success
3. if the root has no descendants, then return failure
4. call df-search on the subtree whose root is the leftmost descendant
and return success if that search is successful
5. call df-search on the subtree whose root is the rightmost descendant
and return success if that search is successful
This algorithm may look somewhat familiar, since it's just a variant of the
preorder tree traversal algorithm you've seen earlier:
preorder
1. visit the root
2. call preorder on the left subtree
3. call preorder on the right subtree
The big differences between the preorder algorithm and the depth-first
search algorithm are these:
depth-first search stops before searching the whole tree, if it
finds what it's looking for; preorder traversal always examines the
entire tree
with depth-first search, searching the right subtree occurs only if
the search of the left subtree failed to find what was being looked
for; with preorder traversal, the right subtree is always explored
(this is sort of a corollary to the first difference listed just above)
How do you implement this in what is quickly becoming your favorite
programming language? That's what's up next.
VI. Implementing depth-first search
Above, we talked about the differences between depth-first search of a
binary tree and preorder traversal of that same tree. These differences
make implementation of depth-first search more complicated than
preorder traversal, but not drastically so. Here's a simple depth-first
search implementation for the Flintstone Family Tree, using a
representation format for trees that we've used occasionally before
(but this isn't the only representation that we could have used). The
tree looks like this in Scheme (and remember that just to make things
simpler, we've eliminated one of the twins):
'(chip (pebbles (wilma () ()) (fred () ()))
(bamm-bamm (betty () ()) (barney () ())))
And the Scheme code itself looks like this:
(define (dfs item tree)
(cond ((done? tree) #f)
((found-item? item (get-root tree)) item)
(else (or (dfs item (get-left-subtree tree))
(dfs item (get-right-subtree tree))))))
(define (done? tree)
(null? tree))
(define (found-item? item tree)
(equal? item tree))
(define (get-root tree)
(car tree))
(define (get-left-subtree tree)
(cadr tree))
(define (get-right-subtree tree)
(caddr tree))
Again, I've abstracted away the details of accessing the list data
structure that represents the family tree, leaving only a high-level
algorithm description in the main function. In fact, the only primitive
Scheme functions used to describe the high-level algorithm are
"define", "cond", and "or".
If we add some calls to "display" (an output function) at just the
right spot, we can trace the behavior of our "dfs" program as it
searches the Flintstone Family Tree. The program will print the name at
every node in the order that it searches those nodes, until it finds
the desired name in the tree (assuming it's found). Then instead of
printing that node, "dfs" will return the name it finds there. Here's
the top level function with some extra stuff for printing the trace:
(define (dfs item tree)
(cond ((done? tree) #f)
((found-item? item (get-root tree)) item)
(else (display (get-root tree))
(display " ")
(or (dfs item (get-left-subtree tree))
(dfs item (get-right-subtree tree))))))
And if we fire up Dr. Scheme one more time and try this out, we'll see
the following:
> (dfs 'barney '(chip (pebbles (wilma () ()) (fred () ()))
(bamm-bamm (betty () ()) (barney () ()))))
chip pebbles wilma fred bamm-bamm betty
barney
>
Will wonders never cease? That's exactly the order...first Chip, then
Pebbles, then Wilma, then Fred, back up to Bamm-Bamm, down to Betty, and
then finally the finding of Barney...that we'd expect in depth-first
search. That is, the search proceeds downward in the tree as far as it
can before it starts to work its way back up, but as soon as it can search
down some previously unsearched path way, it starts to search downward again.
VII. A quick detour through logical predicates
If you noticed the use of the "or" operation above, you may be
wondering what it is. It's one of several logical predicates
(sometimes called Boolean predicates, after the famed mathematician
George Boole) available to you in some form in most programming
languages. The "or" predicate takes one or more arguments, and if any
of them evaluate to #t, it returns #t. Here are some examples:
> (or #t)
#t
> (or #t #t)
#t
> (or #f #t #f)
#t
Another related predicate is the "and" predicate. It too takes one or
more arguments, and returns #t if all the arguments evaluate to #t:
> (and #t)
#t
> (and #f #t)
#f
> (and #t #t #t)
#t
Finally, there's the "not" predicate, which takes one argument and returns
#f if the argument evaluates to #t, and returns #t if the argument
evaluates to #f:
> (not #t)
#f
> (not #f)
#t
The use of "or" in the "dfs" function is an easy way to fulfill the
requirement that the right subtree isn't searched if what we're
looking for is found in the left subtree. It's not an especially
obvious use of "or", which is typically used as a logical predicate,
not as program control mechanism, but it's not an uncommon use
in Scheme programming. This use of "or" takes advantage of an implicit
implementation detail (i.e., that "or" evaluates its arguments left to
right, and stops as soon as it finds an argument which evaluates to a
non-nil value). Usually we would discourage non-obvious uses of
operations that rely on non-obvious implementation details, but
because this use of "or" is sort of a standard Scheme idiom,
we're ok with it here. Also, since the left-to-right evaluation of
arguments is part of the standard R5RS Scheme specification for "or",
we can count on things happening the way we expect them to here, but
beware of non-standard Scheme systems that might implement an "or" which
evaluates arguments right-to-left, with results that might confuse us
a little bit (although in a purely functional world, the results
wouldn't be catastrophic). Furthermore, this approach assumes that any
given node has a fixed number of children; if you want to cope with a
variable number of children at any node, you might want to code up a
slightly different version of this anyway. For now, we'll leave the
"or" there, but feel free to do something different, and hopefully
better. You can control depth-first search with "or" as we've done
here, but you can also do the same thing using just the "else" in
"cond". If you have access to a copy of the textbook and are
inclined, you'll find an example of how to do this without "or" in
that book.
VIII. Getting past yes or no
Sadly, the search function described in the previous chunk of notes
above doesn't tell me much---just whether or not an item I'm looking
for is in the tree. I'd get more information if I could get the search
function to tell me how to get from the root of the tree to the item
I'm looking for, assuming the item I'm looking for is in the tree. That
path from the root to the item would at least be an approximation of
the relationship between those two nodes in the tree; in the case of
the Flintstones, for example, the path
"Chip -has-dad-> Bamm-Bamm -has-dad-> Barney"
tells me something about the relationship between Chip and Barney. How
can I get my depth-first search procedure to return this path, instead
of just the item itself, when it finds the item in the tree? It's pretty
easy. All you do is introduce an additional argument as a sort of
variable to store the path from the root to wherever the procedure is
looking in the tree. You get that additional argument by adding a helping
function, just like in many of those examples of tail recursion. (But note
that the addition of the helper function does not make this depth-first
search tail recursive.) Then it's just a question of building up the
result as the procedure searches deeper in the tree:
(define (dfs item tree)
(dfs-helper item tree ()))
(define (dfs-helper item tree result)
(cond ((done? tree) #f)
((found-item? item (get-root tree))
(cons item result))
(else (or (dfs-helper item
(get-left-subtree tree)
(cons (get-root tree) result))
(dfs-helper item
(get-right-subtree tree)
(cons (get-root tree) result))))))
(define (done? tree)
(null? tree))
(define (found-item? item tree)
(equal? item tree))
(define (get-root tree)
(car tree))
(define (get-left-subtree tree)
(cadr tree))
(define (get-right-subtree tree)
(caddr tree))
And note that because I've taken the time to do a great deal of data
abstraction, separating the functions that access the Scheme data
structure from the higher-level algorithm, that all I had to do was
make a few changes to the top-level procedure; the lower-level ones
are Scheme data structure accessors.
Could I convert the Flintstones Family Tree into an association list,
and then modify the dfs program accordingly? Well of course I could.
But I'll leave that as an exercise for you, the home viewer.
IX. Breadth-first search
Instead of searching as deep down one path from the root as one can
search into the tree before searching down a different path, you
could also write your search program such that it looks at the root
first, then all the children of the root next (i.e., the next level
of nodes), then all the children of those children (i.e., the third
level of nodes), and so on. When a search looks at all the nodes at
one level before looking at the next level, that is when it searches
the breadth of the tree before it looks deeper, that's called
"breadth-first search".
The key to getting breadth-first search correctly is to employ a data
structure we haven't heard about before: the queue. A queue allows
you to add things to one end and take things off the other end. So a
queue maintains the order in which things arrived (first-in-first-out
or FIFO in geekspeak) instead of reversing the order like a stack
does. In our case, we'll use the queue to keep track of all the
subtrees we need to search, in the order that we'll have to search
them.
Here's a general algorithm for breadth-first search of a binary tree.
Note that step 0 is an initialization step and only happens once, so
it's not part of the named breadth-first-search procedure
(bf-search). In other words, when you see "call bf-search" below,
step 0 isn't being executed every time the bf-search procedure is
called. Also, the algorithm below wants you to put whole trees and
subtrees on the queue, and that's fine for our nested list
implementation of a tree. But you could also just put pointers to
trees and subtrees on the queue, and that should work just as well,
so long as you remember how to access the information at the ends of
the pointers. So on your homework problems dealing with breadth-first
search where the TAs want you to implement your tree as an
association list, you'll probably want to store just pointers (a.k.a.
keys or indices) in your queue.
initialize the queue by adding the tree that you're
going to search to the back of the queue and call
bf-search
bf-search
1. if the queue is empty, then return failure (you've
searched the entire tree and you didn't find what
you were looking for)
2. retrieve the tree that's at the front of the queue and
look at its root
3. if it's what you're looking for, then return success
4. if the root has no descendants, then remove that tree
from the queue and call bf-search
5. remove the tree from the front of the queue and retain
the left and right subtrees from that tree
6. add those left and right subtrees to the back of the
queue and call bf-search
So for example, we start by putting the whole Flintstone Family Tree
on the queue. We then look at the root of that tree, which is Chip.
If it's what we're looking for, we're done. If it's not what we're
looking for, we'll remove that whole tree from the front of the
queue, but we'll put the left subtree of that tree on the back of the
queue, and then we'll add the right subtree. Now we look at what's
first on the queue. It's the subtree whose root is Chip's mother,
Pebbles. If Pebbles is what we're looking for, then we're done. But
if Pebbles isn't what we're looking for, we'll remove that subtree
from the front of the queue, but we'll put that subtree's subtrees on
the back of the queue. So now what's at the front of the queue? It's
the subtree whose root is Bamm-Bamm. So far, we've looked at Chip,
then Pebbles, then Bamm-Bamm. With depth-first search we would have
looked at Wilma right after Pebbles, because we'd keep going down as
far as we can go. But with breadth-first search, we had to look at
Bamm-Bamm and any other nodes on the same level as Pebbles before
we'd go looking at the next level down.
Here's the breadth-first search program, which I adapted quickly from
the depth-first search program we wrote earlier. We didn't go over
this in class, but I promised I'd provide the program in these notes.
I used a helper function to initialize a queue to be a list whose
elements are trees or subtrees to be searched. The first element is the
initial tree in its entirety. If I included the queue data structure
explicitly in the top-level function, it might seem like a violation of
all the data abstraction principles we've discussed, but keep in mind
that the queue data structure is part of the breadth-first search
algorithm, so it could be argued that it's appropriate to have that
data structure up there. However, it would be nicer to abstract away
the fact that my queue is implemented as a simple list, so I've done
that. Here's the my code:
(define (bfs item tree)
(bfs-helper item (list tree)))
(define (bfs-helper item queue)
(cond ((done? queue) #f)
((null? (get-next-tree queue)) (bfs-helper item (rest-of queue)))
((found-item? item (get-root (get-next-tree queue))) item)
(else (display (get-root (get-next-tree queue)))
(display " ")
(bfs-helper item (add-to-queue
(rest-of queue)
(get-left-subtree
(get-next-tree queue))
(get-right-subtree
(get-next-tree queue)))))))
(define (add-to-queue queue left right)
(append queue (list left) (list right)))
(define (rest-of queue)
(cdr queue))
(define (done? queue)
(null? queue))
(define (found-item? item root)
(equal? item root))
(define (get-root tree)
(car tree))
(define (get-left-subtree tree)
(cadr tree))
(define (get-right-subtree tree)
(caddr tree))
(define (get-next-tree queue)
(car queue))
And again, when I try this out, it works fine:
> (bfs 'barney '(chip (pebbles (wilma () ()) (fred () ()))
(bamm-bamm (betty () ()) (barney () ()))))
chip pebbles bamm-bamm wilma fred betty
barney
>
X. Depth-first vs. breadth-first searches
Both depth-first search and breadth-first search are forms of
exhaustive search. In other words, they'll systematically explore the
whole tree, no matter how big it is, until they find what they're
looking for. And if they don't find what they're looking for, they
will in fact search the entire tree. So in that sense, they're not
very smart, but if what they're looking for is in the tree, and if
the tree is finite in size (and yes, it is possible to have an
infinitely large tree, but that's for other courses), either search
is guaranteed to find what it's looking for if it's in the tree. In
fact, depth-first search and breadth-first search are so closely
related that you could turn the breadth-first search program above
into a depth-first search by turning that queue into a stack. That
is, if instead of adding subtrees to the back of the queue you add
those same subtrees to the front of the queue, you end up with
depth-first search instead of breadth-first search. All you'd have to
do in the program above is rewrite the add-to-queue function like
this:
(define (add-to-queue queue left right)
(cons left (cons right queue)))
So now we're consing the subtrees onto the front of the data
structure instead of appending them to the back end. The subtree that
is put on there last will now be the next tree to be searched. When
you run this thing again, here's what happens:
> (bfs 'barney '(chip (pebbles (wilma () ()) (fred () ()))
(bamm-bamm (betty () ()) (barney () ()))))
chip pebbles wilma fred bamm-bamm betty
barney
>
The order in which nodes are visited is now what you'd expect from
depth-first search, not from breadth-first search.
But there are some behavioral differences between the two. For
example, if you were interested in the shortest path from the root to
the node you're looking for, you'd probably prefer to use
breadth-first search, because breadth-first search is guaranteed to
find the shortest path from the root to the desired node if you
measure path distance in terms of number of edges or links from the
root. This could be important, say, when there are multiple nodes
scattered throughout the tree that could work as the goal node.
However, if the tree is very broad and bushy---that is, if it's an
N-ary tree where N is kind of big, breadth-first search could end up
searching lots of nodes before it finds the goal node. Also note
that the queue could end up getting very very large, so there's some
additional storage overhead that comes with breadth-first search that
doesn't necessarily come with depth-first search. (Remember that we can
and have implemented depth-first search without that additional
data structure...the activation stack supplies the only stack we
really need for depth-first search.)
Depth-first search, on the other hand, offers no guarantee of finding
the shortest path to the goal node, but if the goal node is way deep
in the tree and could be found, say, over on the left-hand side of
the tree, just for instance, a depth-first search might find the goal
node sooner than a breadth-first search. Again, this might be
important where there are multiple goal nodes scattered throughout
the tree, but you're more concerned about finding some goal node in a
hurry as opposed to finding the shortest path to a goal node.
In short, it depends on the organization of the data structure itself
and what kinds of things you can assume about where the goal node or
nodes might be found. This whole "find the goal node in a tree" may
seem pretty obscure right now, but as you'll see in the next couple
of weeks, finding a goal node among a bunch of possible goal nodes in
some data structure is a pretty common metaphor for all kinds of
interesting tasks that we ask computers to do for us. So we'll see
more of this stuff later.
Copyright (c) 2003 by Kurt Eiselt. All rights reserved, with
the exception of stuff that belongs to somebody else.
Last revised: October 7, 2003