Back to ARC Colloquium page


November 30, 2009
Avrim Blum
Carnegie Mellon University

Title: A new theoretical framework for clustering

Abstract:

Problems of clustering data come up in many different areas throughout science.  Theoretical treatments of clustering have generally been of two types: either on algorithms for (approximately) optimizing various distance-based quantities such as k-median, k-means, and min-sum objectives, or on clustering under probabilistic "generative model"
assumptions such as mixtures of Gaussian or related distributions.

In this work we propose a new approach to analyzing the problem of clustering.  Building on models used in computational learning theory, we consider the goal of approximately recovering an unknown target clustering from given pairwise similarity information, without making generative-model assumptions about the data.  Instead, we ask: what relations between the similarity information and the desired clustering are sufficient to cluster well -- these relations taking the role of the "concept class" in learning theory.  We find that if we are willing to relax our goals a bit (for example, allow the algorithm to produce a hierarchical clustering that we will call successful if it contains a pruning that is close to the correct answer) then this leads to a number of interesting graph-theoretic and game-theoretic properties that are sufficient for an algorithm to succeed.  We show we can also produce accurate clusterings under implicit assumptions made when considering approximation algorithms for distance-based objectives such as k-median, k-means, and min-sum.

In fact, in this case we can do as well as if we had been able to approximate such objectives to values that are NP-hard to achieve.

This talk is based on work joint with Maria-Florina Balcan, Anupam Gupta, and Santosh Vempala.