Home Organization Outline Talks Logistics Participants Sponsors

Statistical Inference on High-Dimensional Covariance Structure

Tony Cai, University of Pennsylvania

Covariance structure is of fundamental importance in many areas of statistical inference and a wide range of applications, including genomics, fMRI analysis, risk management, and web search problems. In the high dimensional setting where the dimension $p$ can be much larger than the sample size $n$, classical methods and results based on fixed $p$ and large $n$ are no longer applicable. In this talk, I will discuss some recent results on optimal and adaptive estimation of large covariance matrices under different settings. Estimation of sparse precision matrix, which has close connections to graphical models, will also be discussed. The results and technical analysis reveal in some cases new features that are quite different from the conventional signal recovery problems.
Something for almost nothing: Advances in sub-linear time algorithms

Ronitt Rubinfeld MIT and Tel Aviv University (Watch )

Linear-time algorithms have long been considered the gold standard of computational efficiency. Indeed, it is hard to imagine doing better than that, since for a nontrivial problem, any algorithm must consider all of the input in order to make a decision. However, as extremely large data sets are pervasive, it is natural to wonder what one can do in sub-linear time. Over the past two decades, several surprising advances have been made on designing such algorithms. We will give a non-exhaustive survey of this emerging area, highlighting recent progress and directions for further research.
Machine Learning for Big Data

Carlos Guestrin, Carnegie Mellon

Today, machine learning (ML) methods play a central role in industry and science. The growth of the Web and improvements in sensor data collection technology have been rapidly increasing the magnitude and complexity of the ML tasks we must solve. This growth is driving the need for scalable, parallel ML algorithms that can handle "BigData." Unfortunately, designing and implementing efficient parallel ML algorithms is challenging. Existing high-level parallel abstractions such as MapReduce and Pregel are insufficiently expressive to achieve the desired performance, while low-level tools such as MPI are difficult to use, leaving ML experts repeatedly solving the same design challenges. In this talk, I will describe the GraphLab framework, which naturally expresses asynchronous, dynamic graph computations that are key for state-of-the-art ML algorithms. When these algorithms are expressed in our higher-level abstraction, GraphLab will effectively address many of the underlying parallelism challenges, including data distribution, optimized communication, and guaranteeing sequential consistency, a property that is surprisingly important for many ML algorithms. On a variety of large-scale tasks, GraphLab provides 20-100x performance improvements over Hadoop. In recent months, GraphLab has received thousands of downloads, and is being actively used by a number of startups, companies, research labs and universities. This talk represents joint work with Yucheng Low, Joey Gonzalez, Aapo Kyrola, Jay Gu, and Danny Bickson.
Cancer genomics

David Haussler, UC Santa Cruz (Watch )

Throughout life, the cells in every individual accumulate many changes in the DNA inherited from his or her parents. Certain combinations of changes lead to cancer. During the last decade, the cost of DNA sequencing has been dropping by a factor of 10 every two years, making it now possible to read most of the three billion base genome from a patient.s cancer tumor, and to try to determine all of the thousands of DNA changes in it. Under the auspices of NCI.s Cancer Genome Atlas Project, 10,000 tumors will be sequenced in this manner in the next few years. Soon cancer genome sequencing will be a widespread clinical practice, and millions of tumors will be sequenced. A massive computational problem looms in interpreting these data. First, because we can only read short pieces of DNA, we have the enormous problem of assembling a coherent and reliable representation of the tumor genome from massive amounts of incomplete and error-prone evidence. This is the first challenge. Second, every human genome is unique from birth, and every tumor a unique variant. There is no single route to cancer. We must learn to read the varied signatures of cancer within the tumor genome and associate these with optimal treatments. Already there are hundreds of molecularly targeted treatments for cancer available, each known to be more or less effective depending on specific genetic variants. However, targeting a single gene with one treatment rarely works. The second challenge is to tackle the combinatorics of personalized, targeted, combination therapy in cancer.
Designing Large-scale Nudge Algorithms

Balaji Prabhakar, Stanford University(Watch )

n many of the challenges faced by the modern world, from overcrowded road networks to overstretched healthcare systems, large benefits for society come about from small changes by very many individuals. We survey the problems and the cost they impose on society. We describe a framework for designing "nudge algorithms" and a series of pilot projects which aim to nudge behavior in networks such as transportation, wellness and recycling. Pilots have been conducted with Infosys Technologies, Bangalore (commuting) and Accenture-USA (wellness), and two are ongoing: in Singapore (public transit congestion) and at Stanford (congestion and parking). Some salient themes are the use of low-cost sensing (RFID, smartphones) and networking technology for sensing individual behavior, and the use incentives and social norming to nudge the behavior. We present some preliminary results from the pilots.