Statistical Models of Spam


Sponsor:
Charles Isbell
isbell@cc.gatech.edu
CRB 380

Area: Machine Learning


Problem: I get a lot of email. And when I say a lot, I mean a lot. Given my mail problems, I and a few friends wrote an email server/client system that handles all my mail for me. It's called Ishmail, and among its many cool features is that it's an extensible research platorm.

Unfortunately, a non-trivial amount of the mail I get is spam. I now know how to grow more hair, spice up my sex life, save on mortgages, find millions of dollars in my closet, view the best porn on the net, and do some other things that I'd probably be arrested for telling you about. If only I cared.

I've adopted many strategies for dealing with the spam. Using Ishmail, I've been able to define many useful rules that capture much of the spam, but those guys are really clever and they keep getting cleverer. Recently, there has been some advances in building simple statistical models to classify spam. It's an ongoing war with an ever-changing target, of course, but with a statistical model one can continually update one's statistics as time goes on.

Now, unlike the general problem of coming up with rules for sorting mail (see here for a project on that), I suspect that users don't need to a human-readable rule that they "understand" for their spam filter. So I'd like someone to add this functionality to Ishmail.

What needs to be done:A literature review about the appropriate machine learning techniques to use here, and perhaps an implementation. In this case, I may have some pointers to what has been done so far, and perhaps even some code. The deliverable, then, would be the beginnings of a working prototype.

If this project seems a bit interesting to you, but not quite right, you may want to see the other non-spam version of this project as well.