A striking feature of modern supervised machine learning is its pervasive over-parametrization. Deep networks contain millions of parameters, often exceeding the number of data points by orders of magnitude. These networks are trained to nearly interpolate the data by driving the training error to zero. Yet, at odds with most theory, they show excellent test performance. In this talk I will show that classical kernel methods do, in fact, exhibit these unusual properties. Moreover, I will present theoretical and empirical results to argue that we are unlikely to make progress on understanding deep learning until we develop a fundamental understanding of classical "shallow" kernel classifiers in the "modern" over-fitted setting. Joint work with Siyuan Ma and Soumik Mandal.