We consider the problem of learning a discrete distribution up to permutation in the large alphabet regime. We characterize the minimax rates under a Wasserstein distance, and show that our estimator satisfies the following properties: 1) it requires at most k/log k samples to achieve vanishing risk, where k is the alphabet size; 2) plugging it in permutation invariant functionals such as entropy results in statistically optimal estimators for those functionals; 3) it is efficiently computable using linear and conic programming.