Complex subsurface Earth structures such as fault zones are imaged using travel times of seismic waves across large arrays of seismic sensors. Such travel time tomography problems are ill-posed, with often dense but irregular ray coverage of environments. We propose a 2D travel time tomography method which regularizes the inversion by modeling local groups of phase speed pixels from discrete speed maps, called patches, as sparse linear combinations of atoms from a dictionary. Further, the dictionary atoms are adapted to the data using dictionary learning. In this locally-sparse travel time tomography (LST) method, the local model is integrated with the overall phase speed map, called the global model, which constrains large-scale features using L2-norm regularization. With efficient dictionary learning algorithms, the LST method scales well to tomography problems with large numbers of rays and pixels. We develop a $mathit{maximum a posteriori}$ formulation for LST, which is solved as an iterative inversion algorithm. LST performance is demonstrated on both synthetic and real seismic data.