We study how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment, in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects' choices, on a trial-to-trial basis, are best captured by a 'forgetful' Bayesian iterative learning model in combination with a partially myopic decision policy known as Knowledge Gradient. Our model accounts for subjects' choices better than a set of previously proposed models, with the added benefit of being closest in performance to the optimal Bayesian model than other heuristics with the same computational complexity (all are significantly less complex than the optimal model).