Some software and data for you to use.

Lecture Transcript Corpus

This dataset contains transcriptions of classroom lectures from MIT's intro to artificial intelligence and physics courses. For the physics course, there is also a transcript from the Galaxy automatic speech recognition system. Both datasets contain manually-annotated topic segmentation markers.

OpenFST Phonetic Transliteration

This is a demonstration of how to use OpenFST to manipulate weighted finite-state transducers in a C++ program. It reads files formatted like the CMU pronunciation dictionary, and learns a noisy-channel transliteration model. The results aren't great -- individual phoneme-to-letter emissions is probably not the right model for phonetic transcription -- but it should be helpful if you want to learn how to use OpenFST. Requires the BOOST C++ libraries.

FSTPronouncer.tgz [March 22, 2009]


Bayesian Unsupervised Topic Segmentation

This code for doing linear topic segmentation on text can be found at its own page.

Dirichlet Process Mixture Models in Matlab

Dirichlet Process Mixture Models -- also called Infinite Mixture Models -- are a cool way to do clustering when you don't know how many clusters you want. This is a matlab implementation for Dirichlet Process Mixture Models with multivariate gaussian observations. This is the "collapsed" version, meaning that the sufficient statistics of the Gaussians are marginalized out. Some of the code is based on Michael Mandel's earlier implementation -- all of it is GPL'd.

To run this code, you'll need to have Tom Minka's Lightspeed Toolbox and FastFit (for simplicity, I removed the dependence on the Bayes Net Toolbox).

Two nice papers that describe the details of the DPMM are: Rasmussen's The Infinite Gaussian Mixture Model, which uses sampling; and Penny's Variational Bayes for d-dimensional Gaussian Mixture Models, which shows how to variational inference in finite gaussian mixture models. The extension to DPMMs is well-explained by the combination of this paper and Blei and Jordan's Variational Inference for Dirichlet Process Mixtures.

After I wrote this code, the updated version of Lightspeed changed a function name from normpdfln to mvnnormpdfln. Thus this code no longer works out of the box, but it is a relatively simple fix. I plan to take care of this "soon," but in the meantime it should be easy for you to patch this yourself.


RISO LBFGS Wrapper for Weka

Weka is a machine-learning package. It has its own Limited-Memory BFGS optimization code, but I found that it was very slow when applied to my own custom conditional model. Then I found the RISO code for LBFGS, which looked great, but was less friendly to integrate than the Weka optimization package. My wrapper tries to provide the same sort of interface as the Weka optimization package.


CondensationTracker

A few people were interested and had questions about this code; I moved the discussion over
here, so that questions could be asked and answered in a more public forum.

SPAM

Simple Painless Annotation of Movies

Note: I haven't worked on this package in a long time, and I no longer have time to support it. You'd likely be better served looking for a comparable tool that it is actively maintained. However, the jar file is provided here for the curious. (August 31, 2008)

Get it here.

You will also need:

Spam allows you to do annotation of Quicktime-playable movies or audio files. The primary design goal of SPAM is to do annotation quickly, using lots of keystrokes whenever possible. It's not supported and there's little help or documentation - sorry.

SPAM is (c) MIT 2005. It's free for academic purposes.

Other comparable tools:

Anvil
Anvil's got lots of features and is probably more stable than SPAM. I found the UI to be a little clunky.
IBM Multimodal Annotation Tool
I've never tried this.

TableRex

This is a genetic algorithms toolkit for the game of
robocode. I'm not 100% sure about the state of this code, but if you're interested in robocode you can try it out. Also, I see that Robocode is now open-source -- I have no idea what that means for compatibility with this code, which was written in early 2003.

There are three parts:

SmallBrain.java
This is the actual TableRex interpreter. It extends robocode.AdvancedRobot
BrainWorld.java
This is the external thing that controls the genetic algorithm. This has the main method that you run.
GeneticAlgorithm.java
This is the genetic algorithm implementation. Keep in mind that it makes little sense to try and optimize this code for speed, since all the time is spent in evaluating the robots in robocode.

Here's a video of a robot that learned a very specialized dodging pattern to beat squigbot. That's what happens when you train against only a single adversary.

If you use this code, please cite the following:

J. Eisenstein. Evolving Robocode Tank Fighters. MIT AI Lab Memo AIM-2003-023.

One last thing -- some other guys took this idea much further than I did. I think they used some of this code, although I'm not totally sure what went into their final version. Check out their paper.

All code is (c) MIT 2005.