Fundamentals of Reinforcement Learning Week1 Notes
First of all, there are some notations need to be determined.
|number of actions (arms)|
|discrete time step or play number|
|true value (expected reward) of action
|estimate at time
|number of times action
|learned preference for selecting action
|probability of selecting action
|estimate at time
Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
This is the original form of the k-armed bandit problem, so named by analogy to a slot machine, or “one-armed bandit”, except that it has k levers instead of one. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot. Through repeated action selections you are to maximize your winnings by concentrating your actions on the best levers.
In short, we need to calculate the following formula:
To explore the best result, there are two types of actions, exploitation and exploration.
- Exploitation means selecting the greedy action so that you can gain high reward.
- Exploration means selecting the nongreedy action in which you may improve your estimate value.
This is the most simple and straight one. We calculate value of action
For those equal valued
- For each action, with a small probability
, we select to explore, to randomly select an action among all actions. With probability , we do exploitation normally.
Say short, the test result shows that
We can store the sum of
We only need to store
To generalize it, the form is:
Finally, a simple bandit algorithm is as the following:
- Initialize, for a = 1 to k:
In reinforcement learning, a common situation is that we give more weight to recent rewards than to long-past rewards. When we assign
We call it a weighted average because the sum of the weights is
The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence.
, meets all conditions. If , a constant, it will not meet the second condition and will never completely converge but continue to vary in response to the most recently received rewards.
- The sequence of step-size parameters that meets formular
conditions are often very slow. It is often used in theoretical work but not in applications and empirical research.
In above formula
The technique is quite effective on stationary problems, but not on nonstationary problems. The reason is that it always start from the beginning. If the task changes, it needs to learn from the start and do exploration. Also, we don’t know the maximal reward.
In the former formulas, we all discuss about how to get a better
The experiment results shows that it performs better than
- It is more complex than
- It is hard to deal with large state spaces.
In short, it is specific designed for the k-armed bandit problem.