Some software and data for you to use. I now rarely update this page. Most new code goes to my github page. See also my publications page for data and code associated with specific papers.

Lecture Transcript Corpus

This dataset contains transcriptions of classroom lectures from MIT's intro to artificial intelligence and physics courses. For the physics course, there is also a transcript from the Galaxy automatic speech recognition system. Both datasets contain manually-annotated topic segmentation markers.

OpenFST Phonetic Transliteration

This is a demonstration of how to use OpenFST to manipulate weighted finite-state transducers in a C++ program. It reads files formatted like the CMU pronunciation dictionary, and learns a noisy-channel transliteration model. The results aren't great -- individual phoneme-to-letter emissions is probably not the right model for phonetic transcription -- but it should be helpful if you want to learn how to use OpenFST. Requires the BOOST C++ libraries.

FSTPronouncer.tgz [March 22, 2009]

Bayesian Unsupervised Topic Segmentation

This code for doing linear topic segmentation on text can be found at its own github page.

RISO LBFGS Wrapper for Weka

Weka is a machine-learning package. It has its own Limited-Memory BFGS optimization code, but I found that it was very slow when applied to my own custom conditional model. Then I found the RISO code for LBFGS, which looked great, but was less friendly to integrate than the Weka optimization package. My wrapper tries to provide the same sort of interface as the Weka optimization package.


Simple Painless Annotation of Movies

Note: I haven't worked on this package in a long time, and I no longer have time to support it. You'd likely be better served looking for a comparable tool that it is actively maintained. However, the jar file is provided here for the historical record. (August 31, 2008)

Get it here.

You will also need:

Spam allows you to do annotation of Quicktime-playable movies or audio files. The primary design goal of SPAM is to do annotation quickly, using lots of keystrokes whenever possible. It's not supported and there's little help or documentation - sorry.

SPAM is (c) MIT 2005. It's free for academic purposes.

Other comparable tools:

Anvil's got lots of features and is probably more stable than SPAM. I found the UI to be a little clunky.
IBM Multimodal Annotation Tool
I've never tried this.


This is a genetic algorithms toolkit for the game of
robocode. I'm not 100% sure about the state of this code, but if you're interested in robocode you can try it out. Also, I see that Robocode is now open-source -- I have no idea what that means for compatibility with this code, which was written in early 2003.

There are three parts:
This is the actual TableRex interpreter. It extends robocode.AdvancedRobot
This is the external thing that controls the genetic algorithm. This has the main method that you run.
This is the genetic algorithm implementation. Keep in mind that it makes little sense to try and optimize this code for speed, since all the time is spent in evaluating the robots in robocode.

Here's a video of a robot that learned a very specialized dodging pattern to beat squigbot. That's what happens when you train against only a single adversary.

If you use this code, please cite the following:

J. Eisenstein. Evolving Robocode Tank Fighters. MIT AI Lab Memo AIM-2003-023.

One last thing -- some other guys took this idea much further than I did. I think they used some of this code, although I'm not totally sure what went into their final version. Check out their paper.

This code is (c) MIT 2005.