The online multi-armed bandit problem and its generalizations are repeated
decision making problems, where the goal is to select one of several possible
decisions in every round, and incur a cost associated with the decision, in
such a way that the total cost incurred over all iterations is close to the
cost of the best fixed decision in hindsight. The difference in these costs is
known as the {\em regret} of the algorithm. The term {\em bandit} refers to the
setting where one only obtains the cost of the decision used in a given
iteration and no other information.

Perhaps the most general form of this problem is the non-stochastic bandit
linear optimization problem, where the set of decisions is a convex set in some
Euclidean space, and the cost functions are linear. Only recently an efficient
algorithm attaining $\tilde{O}(\sqrt{T})$ regret was discovered in this
setting.

In this paper we propose a new algorithm for the bandit linear optimization
problem which obtains a regret bound of $\tilde{O}(\sqrt{Q})$, where $Q$ is the
total variation in the cost functions. This regret bound, previously
conjectured to hold in the full information case, shows that it is possible to
incur much less regret in a slowly changing environment even in the bandit
setting. Our algorithm is efficient and applies several new ideas to bandit
optimization such as reservoir sampling.