A learning algorithm can be viewed as a randomized mapping, or a "channel" in information theory, which takes a training dataset as input, and generates a hypothesis (e.g., a classifier) as output. Similar to the concept of channel capacity in information theory, we can define the “information capacity” of a learning algorithm as the maximum mutual information between the input and output of the algorithm, where the maximization is taken over all distributions of the training data. We derive sharp upper bounds on the generalization error of learning algorithms in terms of their information capacity. It is shown that the smaller the information capacity is, the better the learning algorithm generalizes, or the less it overfits. We also provide theoretical guidelines for striking the right balance between data fit and generalization by controlling the information capacity of a learning algorithm. The results can also be applied to the algorithm design and accuracy analysis for the so-called adaptive data analytics, where the same dataset is processed by a sequence of learning algorithms in stages, and the algorithm at each stage has access to the outputs of previous stages. This topic of data reuse in adaptive data analysis has attracted a great deal of attention recently, and our work leads to nontrivial improvement on several state-of-the-art results.