The K-Armed Bandit Problem Notes

Fundamentals of Reinforcement Learning Week1 Notes

First of all, there are some notations need to be determined.

Notation Explanation
number of actions (arms)
discrete time step or play number
true value (expected reward) of action
estimate at time of
number of times action has been selected up prior to time
learned preference for selecting action at time
probability of selecting action at time
estimate at time of the expected reward given


Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

This is the original form of the k-armed bandit problem, so named by analogy to a slot machine, or “one-armed bandit”, except that it has k levers instead of one. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot. Through repeated action selections you are to maximize your winnings by concentrating your actions on the best levers.

In short, we need to calculate the following formula:

To explore the best result, there are two types of actions, exploitation and exploration.

  • Exploitation means selecting the greedy action so that you can gain high reward.
  • Exploration means selecting the nongreedy action in which you may improve your estimate value.


Action-value Method

This is the most simple and straight one. We calculate value of action using following formula:

When , converges to . We call this the sample-average. When , we may set . Then, the selection rule is greedy selection, as the following:

For those equal valued , we just arbitrarily select one. This step is called exploiting. To explore other actions which may improve the result, a simple strategy called is used.

  • For each action, with a small probability , we select to explore, to randomly select an action among all actions. With probability , we do exploitation normally.

Say short, the test result shows that method outperformed than pure greedy method, when .

Incremental Implementation

At time , we select one action -th time. The estimate value of the -th action is:

We can store the sum of , add each new reward on it, and then calculate the current value.

We only need to store and . Also we can see it in an iteration format.

To generalize it, the form is:

The is an error in the estimation. And . It has a sense of similarity with the gradient descent.

Finally, a simple bandit algorithm is as the following:

  • Initialize, for a = 1 to k:
  • Loop:

Tracking a Nonstationary Problem

In reinforcement learning, a common situation is that we give more weight to recent rewards than to long-past rewards. When we assign an value . The formula can be transformed to the following:

We call it a weighted average because the sum of the weights is , which is also called exponential recency-weighted average.

Let’s use denoting the step-size parameter, which is a list of sequence with no length limitation. The following is the condition:

The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence.


  • Sample-average, , meets all conditions. If , a constant, it will not meet the second condition and will never completely converge but continue to vary in response to the most recently received rewards.
  • The sequence of step-size parameters that meets formular conditions are often very slow. It is often used in theoretical work but not in applications and empirical research.


Optimistic Initial Values

In above formula , the selection of can affect the converge speed and accuracy. We can set , which is wildly optimistic initial value. With such value, each reward of the following value will be “unsatisfied”, so that it will go to explore. We call the technique optimistic initial values.

The technique is quite effective on stationary problems, but not on nonstationary problems. The reason is that it always start from the beginning. If the task changes, it needs to learn from the start and do exploration. Also, we don’t know the maximal reward.

Upper-Confidence-Bound Action Selection

In the former formulas, we all discuss about how to get a better , exploitation, value but ignore how to explore more efficiently. One effective way of doing this is to select actions according to:

denotes the number of times that action a has been selected prior to time t. controls the degree of exploration. If , then a is considered to be a maximizing action. This is called upper confidence bound (UCB).

The experiment results shows that it performs better than -greedy on the problem. However, there are several reasons why it is usually not pratical in other reinforcement learning.

  • It is more complex than -greedy method.
  • It is hard to deal with large state spaces.

In short, it is specific designed for the k-armed bandit problem.