Some software and data for you to use.
Lecture Transcript Corpus
This dataset contains transcriptions of classroom lectures
from MIT's intro to artificial intelligence and physics courses. For the physics course,
there is also a transcript from the Galaxy automatic speech recognition system. Both datasets
contain manually-annotated topic segmentation markers.
OpenFST Phonetic Transliteration
This is a demonstration of how to use OpenFST to manipulate weighted
finite-state transducers in a C++ program. It reads files formatted like the CMU
pronunciation dictionary,
and learns a noisy-channel transliteration model. The results aren't great
-- individual phoneme-to-letter emissions is probably not the right model for
phonetic transcription -- but it should be helpful if you want to learn how
to use OpenFST. Requires the BOOST C++
libraries.
FSTPronouncer.tgz [March
22, 2009]
Bayesian Unsupervised Topic Segmentation
This code for doing linear topic segmentation on text can be found
at its own page.
RISO LBFGS Wrapper for Weka
Weka is a machine-learning
package. It has its own Limited-Memory BFGS optimization code,
but I found that it was very slow when applied to my own custom conditional
model. Then I found the RISO code for
LBFGS, which looked great, but was less friendly to integrate than the
Weka optimization package. My
wrapper tries to provide the same sort of interface as the Weka
optimization package.
SPAM
Simple Painless Annotation of Movies
Note: I haven't worked on this package in a long time, and I no longer have time to support it. You'd likely be better served looking for a comparable tool that it is actively maintained. However, the jar file is provided here for the historical record. (August 31, 2008)
Get it here.
You will also need:
- Quicktime for Java
- JRE 1.4 or later
Spam allows you to do annotation of Quicktime-playable movies or audio
files. The primary design goal of SPAM is to do annotation quickly,
using lots of keystrokes whenever possible. It's not supported and
there's little help or documentation - sorry.
SPAM is (c) MIT 2005. It's free for academic purposes.
Other comparable tools:
- Anvil
- Anvil's got lots of features and is probably more stable than SPAM.
I found the UI to be a little clunky.
-
IBM Multimodal Annotation Tool
- I've never tried this.
TableRex
This is a genetic algorithms toolkit for the game of robocode. I'm not 100% sure about the
state of this code, but if you're interested in robocode you can try it out.
Also, I see that Robocode is now open-source -- I have no idea what that
means for compatibility with this code, which was written in early 2003.
There are three parts:
- SmallBrain.java
- This is the actual TableRex interpreter. It extends
robocode.AdvancedRobot
- BrainWorld.java
- This is the external thing that controls the genetic algorithm.
This has the main method that you run.
- GeneticAlgorithm.java
- This is the genetic algorithm implementation.
Keep in mind that it makes little sense
to try and optimize this code for speed,
since all the time is spent in evaluating
the robots in robocode.
-
Here's a video of a robot that
learned a very specialized dodging pattern to beat squigbot. That's
what happens when you train against only a single adversary.
If you use this code, please cite the following:
J. Eisenstein.
Evolving Robocode Tank Fighters.
MIT AI Lab Memo AIM-2003-023.
One last thing -- some other guys took this idea much further than I did.
I think they used some of this code, although I'm not totally sure what
went into their final version.
Check out their paper.
This code is (c) MIT 2005.