Much of modern data is generated by humans and drives decisions made in a variety of settings, such as recommendations for online markets, analysis of social networks, or denoising crowdsourced labels. Due to the complexities of human behavior, the precise data model is often unknown, creating a need for for flexible models with minimal assumptions. A minimal property that is natural for many datasets is "exchangeability", i.e. invariant under relabeling of the dataset, which naturally leads to a nonparametric latent variable model. The corresponding inference problem can be formulated as matrix or graphon estimation.

We propose similarity-based inference algorithms for such nonparametric latent variable models, and we provide theoretical guarantees that bound the error. Our method can be computed in a distributed manner, lending to good scalability properties. As a byproduct, our analysis explains a longstanding mystery of why the collaborative filtering heuristic performs well in practice. While classical collaborative filtering typically requires a dense dataset, we propose a new method which compares larger radius neighborhoods of data to compute similarities, and show that the estimate converges even for very sparse datasets, which has implications towards sparse graphon estimation. For denoising crowd-sourced labels, our algorithm provides guarantees under flexible models allowing for heteregeneity of task and worker types. 

*This presentation is based on a collection of joint works with: (a) Yihua Li, Dogyoon Song, and Devavrat Shah; (b) Christian Borgs, Jennifer Chayes, and Devavrat Shah; and (c) Devavrat Shah.