Fundamentals of Reinforcement Learning Week3 Notes
Policy is a mapping from states to actions. There are two kinds of policies:
It can be represented in the form
It can be formulized as
- A policy depends only on the current state.
- It is a restriction on the state rather than on the agent.
Value function is designed to evaluate expected return in the future in a given state. It aggregates many possible future returns into a single number.
The value function of a state
The value of any terminal state is 0. We call the function
Similarly, we define the value of taking action
We call the function
Value functions are crucial in reinforcement learning. They allow an agent to query the quality of its current situation without waiting to observe the long-term outcome. The benefits is twofold:
- The return is immediately available
- The return may be random due to stochastic in both the policy and environment dynamics
Bellman function is used to formalize the connection between the value of
- Bellman functions provides us a way to solve the value function by writing a system of linear equations.
- It can be only directly used on small MDPs
An optimal policy is defined as the policy with the highest possible value function in all states. There is at least one exist. It can be concatenated by best parts of multiple policies. Due to the exponential number of possible policies, brute-force searching is not impossible.
We always use
According to the formula
Since the optimal policy can be made of optimal action in every state, the
Similarly, we can do it on
and already take into account the reward consequence of all possible future behavior, the greedy method here will finally lead the the final optimal result.
- We cannot simply use linear system solver to solve the optimal
because it is the thing we would like to solve.
If we have