CS 1321X - Lecture 15 - October 7, 2003

CS 1321X - Lecture 15

Search


I.  A brief introduction to search 

Now that we have all this new knowledge about representation in trees
(and related hierarchical structures, as we'll see soon), we need some
means for exploring these knowledge structures to get at the
information we want at the time we want it. How do we do this? The
answer is a bunch of techniques which collectively fall under the
heading of "search". We've seen a little bit of this already, when we
use "assoc" to find a key in an association list, or when we do a
preorder tree traversal on a tree (which looks a whole lot like one of
the searches we'll see later today). Search is a concept which
permeates computer science. We'll only touch on a few of kinds of
search in this course, but they'll be sufficient to demonstrate the
basic difference between brute- force, exhaustive, or "dumb" search
and heuristic or "intelligent" search. 


II.  Linear search 

You already know how to do a linear search. It's what happens when you
scoot down a linked list in Scheme looking for a particular list
element. It's what Scheme is doing when it looks for a key in an
a-list using the assoc function. (If you've ever seen my office, you
know that the only way I could find something in there is by linear
search: I start at one end of the desk and look at everything until I
find what I'm looking for.) Linear searches take a long time -- O(n),
that kind of time. (Actually, assuming an even distribution of stuff
in the file, you're looking at 1/2 * O(n), but as you know the
constants are more or less unimportant.)  In some contexts, as in the
case of sorting algorithms for example, O(n) is pretty fast.  You'll
see more about sorting algorithms in a few weeks.  But in the context
of searching, O(n) time is pretty slow, so we'd like to find something
better.


III.  A little taste of binary search 

To get better than O(n) search time complexity, we can impose a
separate indexing scheme on our data structure, so that we can cut
down on some search time.  For example, we could apply a binary search
mechanism to look for an employee record in a linear list or file. The
basic principle behind a binary search is that every time we make a
comparison to see where to search next, we're eliminating about half
of the information that remains to be searched. For example, if the
employee's name starts with a letter in the range A-M, we could start
the search at the beginning of the file, but if the name starts with
the letter N-Z, we would start the search at approximately the midway
point in the file. We could continue to divide the big groups into
smaller groups, until eventually the time to find a single record is
governed not by the behavior of the linear search but by the behavior
of the binary search. It turns out that binary search time complexity
is O(log n) instead of O(n). There are other indexing mechanisms that
we could use, such as hashing functions, that would give us different
kinds of advantages. We'll see more about binary searching and maybe
even hashing functions later in the course.


IV.  Searching a hierarchical structure 

As we discussed previously, we don't always store our stuff in linear
formats. We can also organize knowledge in hierarchies. Zoological
taxonomies and business organizational charts are classic examples of
hierarchical orderings.  Genealogical or family trees are also
classic examples, so let's take a look at the oldest known family
tree...the Flintstone Family Tree: 


                     Chip       Roxy
                       | (twins) |
                       ___________
                            ^
                           / \
                          /   \
                         /     \
                has-mom /       \ has-dad
                       /         \
                     \/_         _\/
                  Pebbles       Bamm-Bamm
                  / \ has-dad       / \
         has-mom /   \     has-mom /   \ has-dad
                /     \           /     \
              \/_     _\/       \/_     _\/
             Wilma    Fred     Betty   Barney


In structures like this, as before, we may want to search for useful
information. But structures like this, unlike linear file structures,
make it easier to search for the answers to questions like "What's the
relationship of Barney to Chip?" or "Who is Chip's grandfather on his
mother's side?" So remember, when we're searching a hierarchical
structure like this, we're often interested in more than "is the item
I'm looking for in there?"...we often want to know the answer to "and
how do I get there from the beginning in case I want to do it again?" 
or "what is the relationship between the thing I'm looking for and
something else in the data structure?"


V.  Depth-first search 

The simplest form of search in a hierarchical or network structure is
called "depth-first search". Here's an algorithm for depth-first search
on a binary tree, looking for a specific node in the tree: 

df-search

1.  look at the root

2.  if it's what you're looking for, then return success

3.  if the root has no descendants, then return failure

4.  call df-search on the subtree whose root is the leftmost descendant
    and return success if that search is successful

5.  call df-search on the subtree whose root is the rightmost descendant
    and return success if that search is successful


This algorithm may look somewhat familiar, since it's just a variant of the 
preorder tree traversal algorithm you've seen earlier:

preorder

1.  visit the root

2.  call preorder on the left subtree

3.  call preorder on the right subtree


The big differences between the preorder algorithm and the depth-first 
search algorithm are these:

  depth-first search stops before searching the whole tree, if it
  finds what it's looking for; preorder traversal always examines the
  entire tree 

  with depth-first search, searching the right subtree occurs only if
  the search of the left subtree failed to find what was being looked
  for; with preorder traversal, the right subtree is always explored
  (this is sort of a corollary to the first difference listed just above)

How do you implement this in what is quickly becoming your favorite
programming language? That's what's up next.


VI.  Implementing depth-first search 

Above, we talked about the differences between depth-first search of a
binary tree and preorder traversal of that same tree. These differences
make implementation of depth-first search more complicated than
preorder traversal, but not drastically so. Here's a simple depth-first
search implementation for the Flintstone Family Tree, using a
representation format for trees that we've used occasionally before
(but this isn't the only representation that we could have used). The
tree looks like this in Scheme (and remember that just to make things
simpler, we've eliminated one of the twins): 


'(chip (pebbles (wilma () ()) (fred () ()))
       (bamm-bamm (betty () ()) (barney () ())))


And the Scheme code itself looks like this: 

(define (dfs item tree)
   (cond ((done? tree) #f)
         ((found-item? item (get-root tree)) item)
         (else (or (dfs item (get-left-subtree tree))
                   (dfs item (get-right-subtree tree))))))

(define (done? tree)
   (null? tree))

(define (found-item? item tree)
   (equal? item tree))

(define (get-root tree)
   (car tree))

(define (get-left-subtree tree)
   (cadr tree))

(define (get-right-subtree tree)
   (caddr tree))


Again, I've abstracted away the details of accessing the list data 
structure that represents the family tree, leaving only a high-level 
algorithm description in the main function. In fact, the only primitive 
Scheme functions used to describe the high-level algorithm are 
"define", "cond", and "or". 

If we add some calls to "display" (an output function) at just the
right spot, we can trace the behavior of our "dfs" program as it
searches the Flintstone Family Tree. The program will print the name at
every node in the order that it searches those nodes, until it finds
the desired name in the tree (assuming it's found).  Then instead of
printing that node, "dfs" will return the name it finds there. Here's
the top level function with some extra stuff for printing the trace: 


(define (dfs item tree)
   (cond ((done? tree) #f)
         ((found-item? item (get-root tree)) item)
         (else (display (get-root tree))
               (display " ")
               (or (dfs item (get-left-subtree tree))
                   (dfs item (get-right-subtree tree))))))


And if we fire up Dr. Scheme one more time and try this out, we'll see 
the following: 

> (dfs 'barney '(chip (pebbles (wilma () ()) (fred () ()))
                      (bamm-bamm (betty () ()) (barney () ()))))
chip pebbles wilma fred bamm-bamm betty
barney
>


Will wonders never cease?  That's exactly the order...first Chip, then 
Pebbles, then Wilma, then Fred, back up to Bamm-Bamm, down to Betty, and 
then finally the finding of Barney...that we'd expect in depth-first 
search. That is, the search proceeds downward in the tree as far as it 
can before it starts to work its way back up, but as soon as it can search 
down some previously unsearched path way, it starts to search downward again. 


VII.  A quick detour through logical predicates

If you noticed the use of the "or" operation above, you may be
wondering what it is.  It's one of several logical predicates
(sometimes called Boolean predicates, after the famed mathematician
George Boole) available to you in some form in most programming
languages.  The "or" predicate takes one or more arguments, and if any
of them evaluate to #t, it returns #t.  Here are some examples:

> (or #t)
#t

> (or #t #t)
#t

> (or #f #t #f)
#t


Another related predicate is the "and" predicate.  It too takes one or 
more arguments, and returns #t if all the arguments evaluate to #t:

> (and #t)
#t

> (and #f #t)
#f

> (and #t #t #t)
#t


Finally, there's the "not" predicate, which takes one argument and returns 
#f if the argument evaluates to #t, and returns #t if the argument 
evaluates to #f:

> (not #t)
#f

> (not #f)
#t

The use of "or" in the "dfs" function is an easy way to fulfill the
requirement that the right subtree isn't searched if what we're
looking for is found in the left subtree. It's not an especially
obvious use of "or", which is typically used as a logical predicate,
not as program control mechanism, but it's not an uncommon use
in Scheme programming.  This use of "or" takes advantage of an implicit 
implementation detail (i.e., that "or" evaluates its arguments left to 
right, and stops as soon as it finds an argument which evaluates to a 
non-nil value).  Usually we would discourage non-obvious uses of 
operations that rely on non-obvious implementation details, but
because this use of "or" is sort of a standard Scheme idiom,
we're ok with it here.  Also, since the left-to-right evaluation of
arguments is part of the standard R5RS Scheme specification for "or", 
we can count on things happening the way we expect them to here, but 
beware of non-standard Scheme systems that might implement an "or" which
evaluates arguments right-to-left, with results that might confuse us
a little bit (although in a purely functional world, the results
wouldn't be catastrophic). Furthermore, this approach assumes that any 
given node has a fixed number of children; if you want to cope with a
variable number of children at any node, you might want to code up a
slightly different version of this anyway. For now, we'll leave the
"or" there, but feel free to do something different, and hopefully
better. You can control depth-first search with "or" as we've done
here, but you can also do the same thing using just the "else" in
"cond".  If you have access to a copy of the textbook and are
inclined, you'll find an example of how to do this without "or" in
that book.


VIII.  Getting past yes or no 

Sadly, the search function described in the previous chunk of notes
above doesn't tell me much---just whether or not an item I'm looking
for is in the tree. I'd get more information if I could get the search
function to tell me how to get from the root of the tree to the item
I'm looking for, assuming the item I'm looking for is in the tree. That
path from the root to the item would at least be an approximation of
the relationship between those two nodes in the tree; in the case of
the Flintstones, for example, the path 

  "Chip -has-dad-> Bamm-Bamm -has-dad-> Barney" 

tells me something about the relationship between Chip and Barney. How 
can I get my depth-first search procedure to return this path, instead 
of just the item itself, when it finds the item in the tree? It's pretty 
easy. All you do is introduce an additional argument as a sort of 
variable to store the path from the root to wherever the procedure is 
looking in the tree.  You get that additional argument by adding a helping 
function, just like in many of those examples of tail recursion. (But note 
that the addition of the helper function does not make this depth-first 
search tail recursive.) Then it's just a question of building up the 
result as the procedure searches deeper in the tree: 

(define (dfs item tree)
   (dfs-helper item tree ()))

(define (dfs-helper item tree result)
   (cond ((done? tree) #f)
         ((found-item? item (get-root tree))
          (cons item result))
         (else (or (dfs-helper item
                               (get-left-subtree tree)
                               (cons (get-root tree) result))
                   (dfs-helper item
                               (get-right-subtree tree)
                               (cons (get-root tree) result))))))

(define (done? tree)
   (null? tree))

(define (found-item? item tree)
   (equal? item tree))

(define (get-root tree)
    (car tree))

(define (get-left-subtree tree)
   (cadr tree))

(define (get-right-subtree tree)
   (caddr tree))


And note that because I've taken the time to do a great deal of data 
abstraction, separating the functions that access the Scheme data 
structure from the higher-level algorithm, that all I had to do was 
make a few changes to the top-level procedure; the lower-level ones 
are Scheme data structure accessors. 

Could I convert the Flintstones Family Tree into an association list,
and then modify the dfs program accordingly? Well of course I could.
But I'll leave that as an exercise for you, the home viewer.


IX.  Breadth-first search

Instead of searching as deep down one path from the root as one can 
search into the tree before searching down a different path, you 
could also write your search program such that it looks at the root 
first, then all the children of the root next (i.e., the next level 
of nodes), then all the children of those children (i.e., the third 
level of nodes), and so on. When a search looks at all the nodes at 
one level before looking at the next level, that is when it searches 
the breadth of the tree before it looks deeper, that's called 
"breadth-first search".

The key to getting breadth-first search correctly is to employ a data 
structure we haven't heard about before: the queue.  A queue allows 
you to add things to one end and take things off the other end. So a 
queue maintains the order in which things arrived (first-in-first-out 
or FIFO in geekspeak) instead of reversing the order like a stack 
does. In our case, we'll use the queue to keep track of all the 
subtrees we need to search, in the order that we'll have to search 
them.

Here's a general algorithm for breadth-first search of a binary tree. 
Note that step 0 is an initialization step and only happens once, so 
it's not part of the named breadth-first-search procedure 
(bf-search). In other words, when you see "call bf-search" below, 
step 0 isn't being executed every time the bf-search procedure is 
called. Also, the algorithm below wants you to put whole trees and 
subtrees on the queue, and that's fine for our nested list 
implementation of a tree. But you could also just put pointers to 
trees and subtrees on the queue, and that should work just as well, 
so long as you remember how to access the information at the ends of 
the pointers. So on your homework problems dealing with breadth-first 
search where the TAs want you to implement your tree as an 
association list, you'll probably want to store just pointers (a.k.a. 
keys or indices) in your queue.


    initialize the queue by adding the tree that you're
    going to search to the back of the queue and call
    bf-search

bf-search

1.  if the queue is empty, then return failure (you've
    searched the entire tree and you didn't find what
    you were looking for)

2.  retrieve the tree that's at the front of the queue and
    look at its root

3.  if it's what you're looking for, then return success

4.  if the root has no descendants, then remove that tree
    from the queue and call bf-search

5.  remove the tree from the front of the queue and retain
    the left and right subtrees from that tree

6.  add those left and right subtrees to the back of the
    queue and call bf-search


So for example, we start by putting the whole Flintstone Family Tree 
on the queue. We then look at the root of that tree, which is Chip. 
If it's what we're looking for, we're done. If it's not what we're 
looking for, we'll remove that whole tree from the front of the 
queue, but we'll put the left subtree of that tree on the back of the 
queue, and then we'll add the right subtree. Now we look at what's 
first on the queue. It's the subtree whose root is Chip's mother, 
Pebbles. If Pebbles is what we're looking for, then we're done. But 
if Pebbles isn't what we're looking for, we'll remove that subtree 
from the front of the queue, but we'll put that subtree's subtrees on 
the back of the queue. So now what's at the front of the queue? It's 
the subtree whose root is Bamm-Bamm. So far, we've looked at Chip, 
then Pebbles, then Bamm-Bamm. With depth-first search we would have 
looked at Wilma right after Pebbles, because we'd keep going down as 
far as we can go. But with breadth-first search, we had to look at 
Bamm-Bamm and any other nodes on the same level as Pebbles before 
we'd go looking at the next level down.

Here's the breadth-first search program, which I adapted quickly from 
the depth-first search program we wrote earlier.  We didn't go over
this in class, but I promised I'd provide the program in these notes. 
I used a helper function to initialize a queue to be a list whose 
elements are trees or subtrees to be searched. The first element is the 
initial tree in its entirety. If I included the queue data structure 
explicitly in the top-level function, it might seem like a violation of 
all the data abstraction principles we've discussed, but keep in mind 
that the queue data structure is part of the breadth-first search 
algorithm, so it could be argued that it's appropriate to have that 
data structure up there. However, it would be nicer to abstract away 
the fact that my queue is implemented as a simple list, so I've done 
that.  Here's the my code:


(define (bfs item tree)
    (bfs-helper item (list tree)))

(define (bfs-helper item queue)
    (cond ((done? queue) #f)
          ((null? (get-next-tree queue)) (bfs-helper item (rest-of queue)))
          ((found-item? item (get-root (get-next-tree queue))) item)
          (else (display (get-root (get-next-tree queue)))
                (display " ")
                (bfs-helper item (add-to-queue
                                   (rest-of queue)
                                   (get-left-subtree
                                     (get-next-tree queue))
                                   (get-right-subtree
                                     (get-next-tree queue)))))))

(define (add-to-queue queue left right)
    (append queue (list left) (list right)))

(define (rest-of queue)
    (cdr queue))

(define (done? queue)
    (null? queue))

(define (found-item? item root)
    (equal? item root))

(define (get-root tree)
    (car tree))

(define (get-left-subtree tree)
    (cadr tree))

(define (get-right-subtree tree)
    (caddr tree))

(define (get-next-tree queue)
    (car queue))


And again, when I try this out, it works fine:


>  (bfs 'barney '(chip (pebbles (wilma () ()) (fred () ()))
                       (bamm-bamm (betty () ()) (barney () ()))))
chip pebbles bamm-bamm wilma fred betty
barney
>


X.  Depth-first vs. breadth-first searches

Both depth-first search and breadth-first search are forms of 
exhaustive search. In other words, they'll systematically explore the 
whole tree, no matter how big it is, until they find what they're 
looking for. And if they don't find what they're looking for, they 
will in fact search the entire tree.  So in that sense, they're not 
very smart, but if what they're looking for is in the tree, and if 
the tree is finite in size (and yes, it is possible to have an 
infinitely large tree, but that's for other courses), either search 
is guaranteed to find what it's looking for if it's in the tree. In 
fact, depth-first search and breadth-first search are so closely 
related that you could turn the breadth-first search program above 
into a depth-first search by turning that queue into a stack. That 
is, if instead of adding subtrees to the back of the queue you add 
those same subtrees to the front of the queue, you end up with 
depth-first search instead of breadth-first search. All you'd have to 
do in the program above is rewrite the add-to-queue function like 
this:


(define (add-to-queue queue left right)
    (cons left (cons right queue)))


So now we're consing the subtrees onto the front of the data 
structure instead of appending them to the back end. The subtree that 
is put on there last will now be the next tree to be searched. When 
you run this thing again, here's what happens:


>  (bfs 'barney '(chip (pebbles (wilma () ()) (fred () ()))
                       (bamm-bamm (betty () ()) (barney () ()))))
chip pebbles wilma fred bamm-bamm betty
barney
>


The order in which nodes are visited is now what you'd expect from 
depth-first search, not from breadth-first search.

But there are some behavioral differences between the two. For 
example, if you were interested in the shortest path from the root to 
the node you're looking for, you'd probably prefer to use 
breadth-first search, because breadth-first search is guaranteed to 
find the shortest path from the root to the desired node if you 
measure path distance in terms of number of edges or links from the 
root. This could be important, say, when there are multiple nodes 
scattered throughout the tree that could work as the goal node. 
However, if the tree is very broad and bushy---that is, if it's an 
N-ary tree where N is kind of big, breadth-first search could end up 
searching lots of nodes before it finds the goal node. Also note 
that the queue could end up getting very very large, so there's some 
additional storage overhead that comes with breadth-first search that 
doesn't necessarily come with depth-first search. (Remember that we can 
and have implemented depth-first search without that additional
data structure...the activation stack supplies the only stack we 
really need for depth-first search.)

Depth-first search, on the other hand, offers no guarantee of finding 
the shortest path to the goal node, but if the goal node is way deep 
in the tree and could be found, say, over on the left-hand side of 
the tree, just for instance, a depth-first search might find the goal 
node sooner than a breadth-first search. Again, this might be 
important where there are multiple goal nodes scattered throughout 
the tree, but you're more concerned about finding some goal node in a 
hurry as opposed to finding the shortest path to a goal node.

In short, it depends on the organization of the data structure itself 
and what kinds of things you can assume about where the goal node or 
nodes might be found.  This whole "find the goal node in a tree" may 
seem pretty obscure right now, but as you'll see in the next couple 
of weeks, finding a goal node among a bunch of possible goal nodes in 
some data structure is a pretty common metaphor for all kinds of 
interesting tasks that we ask computers to do for us.  So we'll see 
more of this stuff later.



Copyright (c) 2003 by Kurt Eiselt.  All rights reserved, with 
the exception of stuff that belongs to somebody else.

Last revised: October 7, 2003