November 30, 2009
Avrim Blum
Carnegie Mellon University
Title: A new theoretical framework for clustering
Abstract:
Problems of clustering data come up in many different areas throughout science. Theoretical treatments of clustering have generally been of two types: either on algorithms for (approximately) optimizing various distance-based quantities such as k-median, k-means, and min-sum objectives, or on clustering under probabilistic "generative model"
assumptions such as mixtures of Gaussian or related distributions.
In this work we propose a new approach to analyzing the problem of clustering. Building on models used in computational learning theory, we consider the goal of approximately recovering an unknown target clustering from given pairwise similarity information, without making generative-model assumptions about the data. Instead, we ask: what relations between the similarity information and the desired clustering are sufficient to cluster well -- these relations taking the role of the "concept class" in learning theory. We find that if we are willing to relax our goals a bit (for example, allow the algorithm to produce a hierarchical clustering that we will call successful if it contains a pruning that is close to the correct
answer) then this leads to a number of interesting graph-theoretic and game-theoretic properties that are sufficient for an algorithm to succeed. We show we can also produce accurate clusterings under implicit assumptions made when considering approximation algorithms for distance-based objectives such as k-median, k-means, and min-sum.
This talk is based on work joint with Maria-Florina Balcan, Anupam Gupta, and Santosh Vempala.