Reinforcement learning algorithms can be divided into many families. In model-free temporal difference methods like Q-learning/SARSA, we try to learn action value for any state-action pair, either by recording (“memorizing”) exact values in a tabular or learning a function to approximate it. Under -greedy, the action to be selected at a state will therefore be but there is also a small constant chance to be any non-optimal action.
Another family is called policy gradient methods which directly map states to actions. To select actions, they do not need to consult a value table or a value function. Instead, each action can be selected with a probability determined by a parameterized policy function , where is the policy function’s parameters.
The advantages of policy gradient methods over Q-learning/SARSA using greedy are mainly two:
- in some situations the optimal approximate policy must be stochastic. An example from [1]: in card games with imperfect information the optimal play is often to do two different things with specific probabilities, such as when bluffing in Poker. Action-value methods have no natural way of finding stochastic optimal policies.
- problems vary in the complexity of their policies and action-value functions. In some problems, the policy is a much simpler function to approximate than action-values so it will be faster to learn
The general update form of policy gradient methods is , where is performance measure with respect to the policy weights.
Now, the policy gradient theorem can be briefly described as [5]:
In episodic case, the policy gradient theorem derives as follows (according to [1]):
The last expression is an exact formula of the gradient of . It can also be seen as an expectation of over the probability distribution of landing on in steps. Please note for any fixed , . Therefore, we can rewrite the above last expression as an expectation form:
where . is an unbiased estimator of in the last two steps since we do not have estimation for . We use to replace , meaning that the sequence are generated following the policy and the transition probability . Sometimes we can also write because the policy is parameterized by . We can also write as . In other words,
What we do in reality is to use these collected state, actions and rewards as samples to approximate the exact expectation of :
This kind of policy gradient method is called REINFORCE, by Ronald J. Williams from Northeastern University. The original paper [2] is hard to read in my opinion. It directly tells you what is the update rule of by construction, and then tells you the expected update aligns in the same direction as the performance gradient. What I wish to be told is how he derived the update rule of in the first place.
(Updated 09/18/2017: The same derivative procedure of REINFORCE is illustrated more clearly in [10])
(Updated 01/31/2020: The derivative procedure of REINFORCE in continuous state/action space is illustrated in [15])
One important extension of REINFORCE is to offsetting by a baseline , a function of . Intuitively, we need not how good an action is, but how much better this action is compared to average actions. For example, if is uniformly larger than zero for either good or bad actions, will always get updated to encourage either kind of actions. We need to calibrate such that it can differentiate good or bad actions better. Mathematically, adding a baseline can reduce the variance of while still keeping it as unbiased estimator.
First, let’s look at why offsetting by still makes an unbiased estimator (reference from [8]):
To ensure is an unbiased estimate of , should only be a function of , but not . Otherwise is not zero.
It is less obvious that adding can reduce the variance of .
From here, we can see if has large enough covariance with to outweigh its own variance, then the variance is reduced. Unrealistically, if , then variance will be zero, although this is impossible because is only a function of without magic forecast ability to 100% approach .
(sidenote: I cannot follow [8]’s deduction on variance reduction part.)
One way is to train a predictor on with parameter as a baseline:
From [1]: Note that REINFORCE uses the complete return from time t (), which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed.
When we derived , we use the property that . However, could have high variance because it involves returns from step to , where each reward can be seen as a random variable [13]. An alternative estimator of which has lower variance but higher bias is to use “bootstrapping”, i.e., use a parameterized value function plus the next immediate reward to approximate . The one-step actor-critic algorithm is described as follows [1]:
REINFORCE is an on-policy algorithm because in the gradient update depends on , the returns generated by following the current policy . The specific one-step actor-critic algorithm we just described is also an on-policy algorithm because depends on the next state which is the result of applying at the current state . There also exists off-policy actor-critics, see an overview of on-policy and off-policy policy gradient methods at [14].
A more recent advance in baseline is Generalized Advantage Estimation (GAE) [6]. They introduce two parameters, and , in an un-discounted reward problem to help estimate with little introduced bias and reduced variance. (Note that how discounted reward problems can be transformed into an un-discounted problem: “But the discounted problem (maximizing can be handled as an instance of the undiscounted problem in which we absorb the discount factor into the reward function, making it time-dependent.”)
They introduce their notations:
Note that is a biased estimator of but as they claim previous works have studied to “reduce variance by downweighting rewards corresponding to delayed effects, at the cost of introducing bias”.
The paper’s goal is to find a good estimator of which is called .
In other word, if is -just, then it helps to construct an unbiased estimator of . Equation (8) just uses the property that the expectation of a sum equals to the sum of expectations.
Now, what other property does have? If we know any property of , it will help us find more easily. The paper proposes one property:
Sketch proof of proposition 1:
First of all, to understand the notations clearly, think that and are functions with input as the whole trajectory . Similarly, think as a function with the input as the former part of the trajectory .
Now, suppose a satisfied such that where for all , then:
Next, they find a good candidate function for , which is . Only when we know is -just. Otherwise is a biased estimator of . However, if we can take average of different variations of (equation 11, 12, 13, 14, 15, and 16), then we might get a low bias, low variance estimator, which is called .
is -just regardless of the accuracy of (again, this is because ). However is believed (I don’t know how to prove that) to have high variance due to the long sum of rewards. On the other extreme end, has low variance but since we are estimating the value function , must be a biased estimator of . with would make a trade-off between variance and bias.
Update 2018-11-08
Policy gradient is better illustrated in several recent posts: [11] and [12]
Reference
[1] Reinforcement learning: An introduction
[2] Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
[4] Asynchronous Methods for Deep Reinforcement Learning
[5] http://www.breloff.com/DeepRL-OnlineGAE/
[6] HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION
[7] Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
[9] https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/
[10] https://www.youtube.com/watch?v=kUiR0RLmGCo
[11] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#soft-actor-critic
[12] https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#expected-grad-log-prob-lemma
[13] Supplement material of DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills
[14] Unifying On-Policy and Off-Policy Learning
[15] http://web.stanford.edu/class/cme241/lecture_slides/PolicyGradient.pdf
latex for policy gradient theorem:
latex for expectation form rewritten:
latex for adding baseline is still unbiased estimator:
latex for sketch proof of proposition 1: