Many potential applications of machine learning, such as online advertising and recommender systems, require repeatedly choosing an action given an input to maximize some external reward signal. These problems lack explicitly labeled data indicating which action is best. This type of partially supervised optimization is formalized as a ``contextual bandit'' or ``partially labeled'' problem. Prior solutions require control of the actions during the learning process. However, in many real-world applications, the training data is instead generated by some external uncontrolled and effectively unknown policy. For example, when developing a new search engine that learns over time, we face the so-called ``warm start problem''; if our initial engine does not begin with a reasonable performance level, few users will be willing to provide it with the feedback essential for its improvement. We propose a solution for the problem of learning a policy from logged data in partial feedback environments which is (a) sound and consistent, and which (b) does \emph{not} require that the logging policy randomize, as is often the case. This is the first solution with these desirable properties. We empirically verify our solution on a reasonably sized set of real-world data obtained from an online advertising company.