Fundamentals of Reinforcement Learning Week3 Notes

Policies and Value Functions

Policy

Policy is a mapping from states to actions. There are two kinds of policies:

Deterministic Policy

It can be represented in the form . Each state will lead to an action but not all actions have corresponding states.

Stochastic Policy

It can be formulized as , which is a probability distribution. Since it is a distribution, it should follows the following rules:

Notes:

A policy depends only on the current state.

It is a restriction on the state rather than on the agent.

Value Function

Value function is designed to evaluate expected return in the future in a given state. It aggregates many possible future returns into a single number.

State-value function

The value function of a state under a policy denoted , is the expected return when starting in and following thereafter. For MDPs, we can define formally by:

The value of any terminal state is 0. We call the function the state-value function for policy .

Action-value function

Similarly, we define the value of taking action in state under a policy , denoted , as the expected return starting from , taking the action , and thereafter following policy :

We call the function the action-value function for policy .

Value functions are crucial in reinforcement learning. They allow an agent to query the quality of its current situation without waiting to observe the long-term outcome. The benefits is twofold:

The return is immediately available

The return may be random due to stochastic in both the policy and environment dynamics

Bellman Function

Bellman function is used to formalize the connection between the value of and the value of its possible successor states.

State-value Bellman Function

Action-value Bellman Function

Summary

Bellman functions provides us a way to solve the value function by writing a system of linear equations.

It can be only directly used on small MDPs

Optimal Policies and Optimal Value Functions

An optimal policy is defined as the policy with the highest possible value function in all states. There is at least one exist. It can be concatenated by best parts of multiple policies. Due to the exponential number of possible policies, brute-force searching is not impossible.

We always use and to denote optimal state-value function and optimal action-value function respectively.

Bellman Optimal Equation for

According to the formula ,

Since the optimal policy can be made of optimal action in every state, the would be for the optimal action, achieve the highest value, and for else, which means it becomes a deterministic optimal policy. Then, it becomes

Bellman Optimal Equation for

Similarly, we can do it on .

Notes:

Since and already take into account the reward consequence of all possible future behavior, the greedy method here will finally lead the the final optimal result.

We cannot simply use linear system solver to solve the optimal because it is the thing we would like to solve.