Fundamentals of Reinforcement Learning Week1 Notes
First of all, there are some notations need to be determined.
Notation  Explanation 

number of actions (arms)  
discrete time step or play number  
true value (expected reward) of action 

estimate at time 

number of times action 

learned preference for selecting action 

probability of selecting action 

estimate at time 
Definition
Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
This is the original form of the karmed bandit problem, so named by analogy to a slot machine, or “onearmed bandit”, except that it has k levers instead of one. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot. Through repeated action selections you are to maximize your winnings by concentrating your actions on the best levers.
In short, we need to calculate the following formula:
To explore the best result, there are two types of actions, exploitation and exploration.
 Exploitation means selecting the greedy action so that you can gain high reward.
 Exploration means selecting the nongreedy action in which you may improve your estimate value.
Method
Actionvalue Method
This is the most simple and straight one. We calculate value of action
When
For those equal valued
 For each action, with a small probability
, we select to explore, to randomly select an action among all actions. With probability , we do exploitation normally.
Say short, the test result shows that
Incremental Implementation
At time
We can store the sum of
We only need to store
To generalize it, the form is:
The
Finally, a simple bandit algorithm is as the following:
 Initialize, for a = 1 to k:
 Loop:
Tracking a Nonstationary Problem
In reinforcement learning, a common situation is that we give more weight to recent rewards than to longpast rewards. When we assign
We call it a weighted average because the sum of the weights is
Let’s use
The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence.
Note:
 Sampleaverage,
, meets all conditions. If , a constant, it will not meet the second condition and will never completely converge but continue to vary in response to the most recently received rewards.  The sequence of stepsize parameters that meets formular
conditions are often very slow. It is often used in theoretical work but not in applications and empirical research.
Optimization
Optimistic Initial Values
In above formula
The technique is quite effective on stationary problems, but not on nonstationary problems. The reason is that it always start from the beginning. If the task changes, it needs to learn from the start and do exploration. Also, we don’t know the maximal reward.
UpperConfidenceBound Action Selection
In the former formulas, we all discuss about how to get a better
The experiment results shows that it performs better than
 It is more complex than
greedy method.  It is hard to deal with large state spaces.
In short, it is specific designed for the karmed bandit problem.