Gossip and consensus algorithms are attractive for distributed function computation, but in networks with poor edge expansion, such as grids and random geometric graphs, the message complexity has suboptimal scaling. We study a class of distributed averaging algorithms where each node locally operates a linear predictor using two memory taps. Agents iteratively update their local value with a convex combination of the linear predictor and the usual weighted linear combination of values received from neighboring nodes. We derive the optimal mixing parameter for the convex combination and show how it can easily be computed in a distributed manner. Surprisingly, using just two memory taps provably leads to significant improvements in the rate of convergence. For example, for an n-node two-dimensional grid topology we obtain a factor of \sqrt{n} reduction in the number of iterations required to reach a prescribed level of accuracy.