Problems of clustering data come up in many different areas throughout science. Theoretical treatments of clustering have generally been of two types: either on algorithms for (approximately) optimizing various distance-based objectives such as k-median, k-means, and min-sum, or on clustering under probabilistic (i.e., “generative model”) assumptions such as mixtures of Gaussian or related distributions.

In this work we propose a new approach to analyzing the problem of clustering. Building on models used in learning theory, we consider the goal of approximately recovering an unknown target clustering from a given similarity matrix (weighted graph), without making
generative-model assumptions about the data. Instead, we ask: what relations between the similarity information and the desired clustering are sufficient to cluster well — these relations taking the role of the “concept class” in learning theory. We find that if
we are willing to relax our goals a bit (for example, allow the algorithm to produce a hierarchical clustering that we will call successful if it contains a pruning that is close to the correct answer) then this leads to a number of interesting graph-theoretic and
game-theoretic properties that are sufficient to cluster well. We show we can also produce accurate clusterings under implicit assumptions made when considering approximation algorithms for
distance-based objectives such as k-median, k-means, and min-sum. In fact, we can do as well as if we had been able to approximate such objectives to values that are NP-hard to achieve.