CS 4600, Introduction to Intelligent Systems Project #4

CS 4600
Introduction to Intelligent Systems
Project #4
Bayes Nets and Markov Chains

Numbers

Due: November 18, 2003 23:59:59 EST
Please email your code and explanations to Steve Marlowe with the subject line "cs4600 project #4". Also, be certain to include your name in your message.

The assignment is worth 6% of your final grade. There is an opportunity for 3% bonus points.

Why?

The goal of this project is to give you experience using Bayesian networks and Markov chains. First, you are asked to design a simple expert system in the domain of your choice and implement it using existing software. This part of the project does not involve any programming. Second, you are asked to build a very simple one-step Markov Chain Language Learner. This will require programming, but mainly you will be asked to calculate some simple probabilities. Naturally, this will be done in LISP, and that's a good thing because you can rip some of your code from Project 3.

Read everything below carefully!

Bayes Nets

Choose an area you are very familiar with. Some examples include predicting the weather, troubleshooting a car, diagnosing a cold, guessing who will win a football game, or predicting your grade in 4600. These examples are only meant to show you that the area you choose does not have to be a technical one. After you choose an area, design a simple Bayesian network with 8 to 15 nodes. Make sure that your network has an interesting topology (e.g., Do NOT design a flat or linear network).

Implement you network using GeNIe, a development environment for building decision-theoretic models. Get familiar with the basic functionalities of GeNIe (namely building a network with chance nodes and relations, setting evidence, and querying the network) and implement your network. GeNIe runs on Windows. To use it, you will need to download it from http://www2.sis.pitt.edu/~genie. The software has comprehensive online documentation, but if you have problems feel free to email the TAs.

Perform some queries with your expert system. For each query, set evidence on some nodes and read off the values from other nodes.

Familiarize yourself with other features of GeNIe that we did not cover in class such as influence diagrams. Show how they can be used for planning, that is action selection. Provide a brief, concrete example of such an application.

What to turn in

Submit an attachment containing a zip file with four files in it:

The Genie file (.dsl) for your Bayes net.
A concise description of your Bayes net (i.e., the overall purpose of your expert system and the meaning of the nodes).
A description of your queries (at least two): The evidence used and the node(s) queried as well as the posteriori probabilities at these queried nodes. Also, explain the results you obtain. Your work should demonstrate that your network indeed encodes your expertise effectively.
A short writeup of using other features of GeNIe and using action selection (see above).

Additional Notes

GeNIe has been known to crash so save often. We have played with it on Win2000 and had no problems, but you never know.

Markov Chains

Here, you will write a LISP program that determines the language in which a test sentence has been written, using a exceedingly simple one-place lookahead Markov-chain algorithm. You can give credit (or blame) to this assignment to David Albert who gave a version of it to his Artifical Intelligence class at the Harvard University Extension School.

Your program will begin by reading a series of sentences in known languages, and constructing character-based probability matrices for each language. After "learning" the languages, you will be asked to read in test sentences and report which language they were written in.

You are to write three functions:

reset
learn
predict

The signature of reset is simple:

  (defun reset () ...)

It is just an initialization routine, that allows you to (re)set any global state you have.

The signature of learn is:

  (defun learn (language sentence) ...)

...where language is a string naming a language and sentence is a string from that language. The function looks at the sentence and uses it to update its view of the statistics of that language. Example uses would be:

  (learn "SPANISH" "Hola como estas")
  (learn "ENGLISH" "Hello how are you")

The signature of predict is:

  (defun predict (sentence) ...)

...where sentence is a string. You are asked to return the chances of each language being the source of the sentence. Example uses would be:

  (predict "Hola como estas") 
       -> (("SPANISH" .75) ("ENGLISH" .25))
  (predict "Hello how are you")
       -> (("ENGLISH" .75) ("SPANISH" .25))

Note that the languages are returned in order of likelihood, and as a list (language-name likelihood).

Here's what's provided to you:

utils.lisp: some lisp code that's designed to make this assignment significantly easier for you. I'm not kidding: go look at the file and read it.
english: a file of english sentences.
spanish: a file of spanish sentences.
french: a file of french sentences.
hiphop: a file of hip hop lyrics.
lisp: a file of lisp expressions.

You will notice that each sentence begins with a space. You will also take note that we may test you on any language we wish, including something that we might just make up on the spot, so don't just hardcode five arrays for the languages listed above (see the discussion of hashtables below).

How to do this

It turns out that this is easy, just:

Create a frequency array for every language.
Note: Treat upper and lower case letters as identical, and treat any non-letter as a space. Do not worry about handling multiple white spaces in any special way. Each of your frequency arrays will need to be 27x27. Use the "0" row and column for end-of-word markers (spaces or unknown characters) and 1 through 26 for A through Z. All accented characters will have been magically fixed for you in these examples.
When asked to make a prediction, treat your frequency matrices as if they were probability matrices. You will need to keep the frequency matrix around because you may be given more and more sentences even after you've been asked to make predictions.
Note: A frequency matrix tells you how many occurrences you found of letter A followed by letter B. What you really want to know is the probability that letter A will be followed by letter B. Obviously, this can be computed across each row as simply the frequency in any particular cell divided by the sum of all the entries in that row. Notice that you want to be sure to compute the probabilities by row, not by column. You might want to ask yourself what would happened if you computed probabilities the other way around.
Special Note: DO NOT COMPUTE ANY PROBABILITIES OF ZERO! Any value of zero in the frequency matrix should be treated as a small positive number (you may use any small value you want for this assignment; 0.01 appears to be a good one) when computing the probability matrix. You might want to ask yourself why this sort of thing matters.
When asked to make a prediction, compute the probability of the sentence based on the probability matrices you have computed. You'll need to keep track of a list of probability values, one for each language. You'll know how to do this because we've gone over it in class.
Note: In theory, the probabilities for each language generating the sentence should be initialized to 1.0, as they can only go down as you read more characters. When you finish, you should return the list of languages, ordered by how likely it is that the sentence was generated by that language. As you'll notice from above, you are also returning an estimate of the probability that the sentence is in fact in that language as opposed to one of the others. If you think about it, all of the probabilities in your list will be vanishingly small, so those numbers are not in and of themselves interesting: what is interesting is how they compare to each other. For this assignment, you will normalize those numbers (i.e., the probability of a particular language is the ratio of the value for that language to the sum of all the values in your list, and the highest such ratio corresponds to the most likely language).

Basically, everything you need is available in utils.lisp and in your own experiences. Still, in order to make storing and retrieving frequency tables by name easier, I would suggest something that you may not be familar with: LISP hashtable functions. A full discussion is in chapter 16 of Steele's book (available online from the resources page), but here are the highlights:

(make-hash-table :test #'equal) creates a hashtable that uses equal to compare keys.
(clrhash hashtable) clears a hashtable.
(gethash key hashtable) returns the hashtable entry indexed by key, and nil if there is no entry.
(setf (gethash key hashtable) value) sets the hashtable entry indexed by key to value.
(maphash function hashtable) runs over ever entry in hashtable. For each entry, maphash calls function on two arguments: the key of the entry and the value of the entry; maphash then returns nil.

Again, utils.lisp contains some functions that make using hashtables easier.

What to turn in

Submit an attachment containing your code. As always, any additional functions or variables you use should have your-gtaccount appended to them.

Bonus Points!

For an extra 3%:

Write a two-place lookahead algorithm (for example, with three-dimensional arrays). If you think about it, this is conceptually no more difficult than the program you've been asked to do already (in fact, with the power of LISP it is pretty easy to do this for n-place lookahead, where n is passed in, but there's no particular reason for you to go this route).

Now, to make this extension interesting, you will need to do a couple more things than just make the code work:

Create much larger learning corpuses for each of the five languages. The small files I've given you just don't contain enough data. You might want to ask yourself why this is.
Feel free to use the internet (all the cool kids do) to grab more text written in each language. Is 50 kilobytes of information per language sufficient? Or 100kb? Start there, and find out if you gain any accuracy or confidence as compared to a one-letter lookahead algorithm.
Note: Be sure to convert non-English characters to the basic 26-letter alphabet.
Test each sentence against both a one-letter and two-letter lookahead algorithm using the same learning corpus, to see whether there is actually any gain in accuracy using a two-letter lookahead algorithm. Can you find sentences that are NOT properly identified using a one-letter lookahead algorithm, but ARE properly identified with a two-letter lookahead algorithm? If so, provide examples and some thoughtful explanation. If not, why not?

What to turn in

Include your two-place code with your one-place code (using the function names reset2, learn2, and predict2). Also include the thoughtful explanation described immediately above. This should be in two attachments: yourgtaccount.lisp and yourgtaccount.txt. The latter should have your name and gt account at the top.