Unsupervised learning relies on finding relevant information in large databases. This is possible thanks in part to groundbreaking research by College of Computing Professor Santosh Vempala and his collaborators that was done 20 years ago.
The team's lasting contributions to information retrieval were recently recognized with a test-of-time award. The annual conference Principles of Database Systems (PODS) gave their paper, Latent Semantic Indexing: A Probabilistic Analysis, its Gems of PODS honor at the this year’s conference.
The 1998 paper analyzed a popular spectral algorithm and introduced the very first topic model, now a standard in unsupervised learning.
The researchers discovered that if a database is viewed as a matrix, a computer algorithm can perform singular-value decomposition, a matrix reduction technique that pulls out the most significant directions to explain the data. This step not only involves minimal distortion of data, but it actually yields better retrieval results than the full original matrix.
“This was one of the first provable techniques for automatically extracting information from data,” Vempala said.
The model has been significantly enhanced in the decades since the paper was published. In that time, the work has influenced prominent computing fields such as spectral methods, data mining, machine learning, and deep neural networks.
Vempala wrote the paper as a summer intern at IBM with Prabhakar Raghavan (now VP of Engineering at Google), together with Columbia University Professor Christos Papadimitriou (then at Berkeley), and Meiji University Professor Hisao Tamaki.
As part of the award, Papadimitriou gave a talk during the PODs conference about the paper and its influence.