Policy Gradient

Reinforcement learning algorithms can be divided into many families. In model-free temporal difference methods like Q-learning/SARSA, we try to learn action value $Q(s,a)$ for any state-action pair, either by recording (“memorizing”) exact values in a tabular or learning a function to approximate it. Under $\epsilon$ -greedy, the action to be selected at a state will therefore be $argmax_a Q(s,a)$ but there is also a small constant chance $\epsilon$ to be any non-optimal action.

Another family is called policy gradient methods which directly map states to actions. To select actions, they do not need to consult a value table or a value function. Instead, each action can be selected with a probability determined by a parameterized policy function $\pi(a|s,\theta)$ , where $\theta$ is the policy function’s parameters.

The advantages of policy gradient methods over Q-learning/SARSA using $\epsilon$ greedy are mainly two:

in some situations the optimal approximate policy must be stochastic. An example from [1]: in card games with imperfect information the optimal play is often to do two different things with specific probabilities, such as when bluffing in Poker. Action-value methods have no natural way of finding stochastic optimal policies.
problems vary in the complexity of their policies and action-value functions. In some problems, the policy is a much simpler function to approximate than action-values so it will be faster to learn

The general update form of policy gradient methods is $\theta_{t+1} = \theta_t + \alpha \nabla \eta(\theta_t)$ , where $\eta(\theta_t)$ is performance measure with respect to the policy weights.

Now, the policy gradient theorem can be briefly described as [5]:

Screenshot from 2017-05-29 11-43-06

In episodic case, the policy gradient theorem derives as follows (according to [1]):

Screenshot from 2017-05-29 20-53-45

The last expression is an exact formula of the gradient of $\eta(\theta)$ . It can also be seen as an expectation of $\gamma^t \sum\limits_{a_t} \nabla_\theta \pi(a_t|s_t) q(s_t, a_t)$ over the probability distribution of landing on $s_t$ in $t$ steps. Please note for any fixed $t$ , $\sum\limits_{s_t} \Pr(s_0 \rightarrow s_t, t, \pi)=1$ . Therefore, we can rewrite the above last expression as an expectation form:

Screenshot from 2017-05-31 00-09-15

where $\qquad G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + \cdots$ . $G_t$ is an unbiased estimator of $q(s_t, a_t)$ in the last two steps since we do not have estimation for $q(s_t, a_t)$ . We use $\mathbb{E}_{\pi}$ to replace $\mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t)} &s=2$ , meaning that the sequence $S_0, A_0, S_1, A_1, \cdots$ are generated following the policy $\pi(a_t|s_t)$ and the transition probability $p(s_{t+1}|s_t, a_t)$ . Sometimes we can also write $\mathbb{E}_{\pi_\theta}$ because the policy $\pi$ is parameterized by $\theta$ . We can also write $\mathbb{E}_{\pi}$ as $\mathbb{E}_{s_{0:T}, a_{0:T}}$ . In other words, $\nabla \eta(\theta) = \mathbb{E}_{s_{0:T}, a_{0:T}}[\sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t|s_t)] &s=2$

What we do in reality is to use these collected state, actions and rewards as samples to approximate the exact expectation of $\nabla \eta(\theta)$ :

Screenshot from 2017-05-28 19-56-19

This kind of policy gradient method is called REINFORCE, by Ronald J. Williams from Northeastern University. The original paper [2] is hard to read in my opinion. It directly tells you what is the update rule of $\theta$ by construction, and then tells you the expected update aligns in the same direction as the performance gradient. What I wish to be told is how he derived the update rule of $\theta$ in the first place.

(Updated 09/18/2017: The same derivative procedure of REINFORCE is illustrated more clearly in [10])

(Updated 01/31/2020: The derivative procedure of REINFORCE in continuous state/action space is illustrated in [15])

One important extension of REINFORCE is to offsetting $G_t$ by a baseline $b(s_t)$ , a function of $s_t$ . Intuitively, we need not how good an action is, but how much better this action is compared to average actions. For example, if $G_t$ is uniformly larger than zero for either good or bad actions, $\theta$ will always get updated to encourage either kind of actions. We need to calibrate $G_t$ such that it can differentiate good or bad actions better. Mathematically, adding a baseline can reduce the variance of $\nabla \eta(\theta)$ while still keeping it as unbiased estimator.

First, let’s look at why offsetting $G_t$ by $b(s_t)$ still makes $\nabla \eta(\theta)$ an unbiased estimator (reference from [8]):

Screenshot from 2017-05-31 00-10-24

To ensure $\mathbb{E}_{s_{0:T}, a_{0:T}} [\sum\limits_{t=0}^T \gamma^t (G_t - b(s_t)) \nabla_\theta \log \pi (a_t|s_t)]$ is an unbiased estimate of $\nabla \eta(\theta)$ , $b(s_t)$ should only be a function of $s_t$ , but not $a_t$ . Otherwise $\mathbb{E}_{s_{0:T}, a_{0:T}}[\sum\limits_{t=0}^T \gamma^t b(s_t) \nabla_\theta \log \pi (a_t | s_t)]$ is not zero.

It is less obvious that adding $b(s_t)$ can reduce the variance of $Var[ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)]$ .

Screenshot from 2017-06-08 21-13-30

From here, we can see if $\sum\limits_{t=0}^T \gamma^t b(s_t) \nabla_\theta \log \pi(a_t | s_t)$ has large enough covariance with $\sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)$ to outweigh its own variance, then the variance is reduced. Unrealistically, if $b(s_t) = G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + \cdots$ , then variance will be zero, although this is impossible because $b(s_t)$ is only a function of $s_t$ without magic forecast ability to 100% approach $G_t$ .

(sidenote: I cannot follow [8]’s deduction on variance reduction part.)

One way is to train a predictor on $\hat{v}(S_t | w)$ with parameter $w$ as a baseline:

Screenshot from 2017-05-28 21-59-16

From [1]: Note that REINFORCE uses the complete return from time t ( $G_t$ ), which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed.

When we derived $\nabla \eta(\theta)$ , we use the property that $\mathbb{E}[G_t|S_t, A_t]=q_\pi(s_t, a_t)$ . However, $G_t$ could have high variance because it involves returns from step $t$ to $T$ , where each reward can be seen as a random variable [13]. An alternative estimator of $q_\pi(s_t, a_t)$ which has lower variance but higher bias is to use “bootstrapping”, i.e., use a parameterized value function $\hat{v}_w$ plus the next immediate reward to approximate $G_t \approx R + \gamma \hat{v}(S', w)$ . The one-step actor-critic algorithm is described as follows [1]:

Screen Shot 2018-11-11 at 1.44.33 PM

REINFORCE is an on-policy algorithm because $\delta=G_t - \hat{v}(S_t,w)$ in the gradient update depends on $G_t$ , the returns generated by following the current policy $\pi_\theta$ . The specific one-step actor-critic algorithm we just described is also an on-policy algorithm because $\delta=R+\gamma \hat{v}(S', w) - \hat{v}(S, w)$ depends on the next state $S'$ which is the result of applying $\pi_\theta$ at the current state $S$ . There also exists off-policy actor-critics, see an overview of on-policy and off-policy policy gradient methods at [14].

A more recent advance in baseline is Generalized Advantage Estimation (GAE) [6]. They introduce two parameters, $\gamma$ and $\lambda$ , in an un-discounted reward problem to help estimate $g:=\nabla_\theta \mathbb{E}[\sum_{t=0}^\infty] r_t$ with little introduced bias and reduced variance. (Note that how discounted reward problems can be transformed into an un-discounted problem: “But the discounted problem (maximizing $\sum_{t=0}^\infty \gamma^t r_t)$ can be handled as an instance of the undiscounted problem in which we absorb the discount factor into the reward function, making it time-dependent.”)

They introduce their notations:

Screenshot from 2017-06-09 17:48:05

Note that $g^\gamma$ is a biased estimator of $g$ but as they claim previous works have studied to “reduce variance by downweighting rewards corresponding to delayed effects, at the cost of introducing bias”.

The paper’s goal is to find a good estimator of $A^{\pi, \gamma}$ which is called $\hat{A}_t$ .

Screenshot from 2017-06-10 18-07-03

In other word, if $\hat{A}_t$ is $\gamma$ -just, then it helps to construct an unbiased estimator of $g^\gamma$ . Equation (8) just uses the property that the expectation of a sum equals to the sum of expectations.

Now, what other property does $\hat{A}_t$ have? If we know any property of $\hat{A}_t$ , it will help us find $\hat{A}_t$ more easily. The paper proposes one property:

Screenshot from 2017-06-10 18-19-33

Sketch proof of proposition 1:

First of all, to understand the notations clearly, think that $\hat{A}_t(s_{0:\infty}, a_{0:\infty})$ and $Q_t(s_{0:\infty}, a_{0:\infty})$ are functions with input as the whole trajectory $s_0, a_0, s_1, a_1, \cdots$ . Similarly, think $b_t(s_{0:t}, a_{0:t-1})$ as a function with the input as the former part of the trajectory $s_0, a_0, s_1, a_1, \cdots, s_t$ .

Now, suppose a satisfied $\hat{A}_t$ such that $\hat{A}_t(s_{0:\infty}, a_{0:\infty}) = Q_t(s_{0:\infty}, a_{0:\infty}) - b_t(s_{0:t}, a_{0:t-1})$ where for all $(s_t, a_t), \; \mathbb{E}_{s_{t+1:\infty}, a_{t+1:\infty}|s_t,a_t} = Q^{\pi,\gamma}(s_t, a_t)$ , then:

Screenshot from 2017-06-10 19-18-26

Next, they find a good candidate function for $\hat{A}_t$ , which is $\delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t)$ . Only when we know $V = V^{\pi, \gamma}$ is $\delta_t^V$ $\gamma$ -just. Otherwise $V$ is a biased estimator of $g^\gamma$ . However, if we can take average of different variations of $\delta$ (equation 11, 12, 13, 14, 15, and 16), then we might get a low bias, low variance estimator, which is called $GAE(\gamma, \lambda)$ .

Screenshot from 2017-06-10 20-58-57

$\hat{A}_t:=GAE(\gamma, 1)$ is $\gamma$ -just regardless of the accuracy of $V$ (again, this is because $E_{s_{0:\infty}, a_{0:\infty}}[V(s_t) \nabla_\theta \log \pi_\theta (a_t | s_t) ] = 0$ ). However $GAE(\gamma, 1)$ is believed (I don’t know how to prove that) to have high variance due to the long sum of rewards. On the other extreme end, $GAE(\gamma, 0)$ has low variance but since we are estimating the value function $V$ , $GAE(\gamma, 0)$ must be a biased estimator of $g^\gamma$ . $GAE(\gamma, \lambda)$ with $0<\lambda<1$ would make a trade-off between variance and bias.

Update 2018-11-08

Policy gradient is better illustrated in several recent posts: [11] and [12]

Reference

[1] Reinforcement learning: An introduction

[2] Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

[3] https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2

[4] Asynchronous Methods for Deep Reinforcement Learning

[5] http://www.breloff.com/DeepRL-OnlineGAE/

[6] HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

[7] Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

[8] https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/

[9] https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/

[10] https://www.youtube.com/watch?v=kUiR0RLmGCo

[11] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#soft-actor-critic

[12] https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#expected-grad-log-prob-lemma

[13] Supplement material of DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

[14] Unifying On-Policy and Off-Policy Learning

[15] http://web.stanford.edu/class/cme241/lecture_slides/PolicyGradient.pdf

latex for policy gradient theorem:

$\begin{align*} \nabla \eta(\theta) &= \nabla_\theta v_{\pi} (s_0) \quad \quad \text{performance measure is the value of starting state} \\ &= \nabla_\theta \big[ \sum\limits_{a_0} \pi(a_0|s_0) q(s_0,a_0) \big] \\ &=\sum\limits_{a_0} \big[ \nabla_\theta \pi(a_0|s_0) q(s_0, a_0) + \pi(a_0|s_0) \nabla_\theta q(s_0, a_0) \big] \quad \quad \text{derivative product rule} \\ &= \sum\limits_{a_0} \Big[ \nabla_\theta \pi(a_0|s_0) q(s_0, a_0) + \pi(a_0|s_0) \nabla_\theta \big[ \sum\limits_{s_1,r_0} p(s_1, r_0 |s_0,a_0)(r_0 + \gamma v_\pi(s_1)) \big] \Big] \\ &= \sum\limits_{a_0} \big[ \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) + \pi(a_0 | s_0) \sum\limits_{s_1} \gamma p(s_1| s_0, a_0) \nabla_\theta v_{\pi}(s_1) \big] \qquad r \text{ has nothing to do with regard to } \theta \\ & \qquad \text{up till now, we have a recursion:} \\ & \qquad \nabla_\theta v_\pi(s_t)= \sum\limits_{a_t} \Big[ \nabla_\theta \pi(a_t|s_t) q(s_t, a_t) + \pi(a_t|s_t) \big[ \sum\limits_{s_{t+1}} \gamma p(s_{t+1}|s_t,a_t) \nabla_\theta v_\pi(s_{t+1}) \big] \Big] \\ &=\sum\limits_{a_0} \Big[ \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) + \pi(a_0 | s_0) \sum\limits_{s_1} \gamma p(s_1| s_0, a_0) \\ & \qquad \qquad \sum\limits_{a_1} \big[ \nabla_\theta \pi(a_1 | s_1)q(s_1, a_1) + \pi(a_1 | s_1)\sum\limits_{s_2} \gamma p(s_2|s_1, a_1) \nabla_\theta v_{\pi} (s_2) \big] \Big] \\ &=\sum\limits_{a_0} \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) \\ & \qquad + \gamma \sum\limits_{s_1} \sum\limits_{a_0} p(s_1| s_0, a_0) \pi(a_0 | s_0) \sum\limits_{a_1} \nabla_\theta \pi(a_1 | s_1)q(s_1, a_1) \\ & \qquad + \gamma^2 \sum\limits_{s_2} \sum\limits_{a_1} \sum\limits_{s_1} \sum\limits_{a_0} p(s_2|s_1, a_1) \pi(a_1 | s_1) p(s_1| s_0, a_0) \pi(a_0 | s_0) \nabla_\theta v_{\pi} (s_2) \\ &= \cdots \qquad \text{(keep unrolling using the recursion)}\\ &= \sum\limits_{t=0}^\infty \sum\limits_{s_t} \gamma^t \Pr(s_0 \rightarrow s_t, t, \pi) \sum\limits_{a_t} \nabla_\theta \pi(a_t | s_t) q(s_t, a_t) \qquad \Pr(s_0 \rightarrow s_t, t, \pi) \text{ is the prob. of } s_0 \text{ to } s_t \text{ in } t \text{ steps} \end{align*}$

latex for expectation form rewritten:

$\begin{align*} \nabla \eta(\theta) &= \sum\limits_{t=0}^\infty \sum\limits_{s_t} \gamma^t \Pr(s_0 \rightarrow s_t, t, \pi) \sum\limits_{a_t} \nabla_\theta \pi(a_t | s_t) q(s_t, a_t) \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi)}[\sum\limits_{a_t} \gamma^t \nabla_\theta \pi(a_t | s_t) q(s_t, a_t) ] \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi)} [\sum\limits_{a_t} \gamma^t \pi(a_t | s_t) q(s_t, a_t) \frac{\nabla_\theta \pi(a_t | s_t)}{\pi(a_t | s_t)} ] \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t) \quad}[ \gamma^t q(s_t, a_t) \frac{\nabla_\theta \pi(a_t | s_t)}{\pi(a_t | s_t)}] \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t) \quad}[ \gamma^t q(s_t, a_t) \nabla_\theta \log \pi(a_t | s_t)] \qquad \nabla \log(x) = \frac{\nabla x}{x} \\ &=\mathbb{E}_{\pi}[ \sum\limits_{t=0}^\infty \gamma^t q(s_t, a_t) \nabla_\theta \log \pi(a_t | s_t)] \\ &=\mathbb{E}_{\pi}[ \sum\limits_{t=0}^\infty \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] \end{align*}$

latex for adding baseline is still unbiased estimator:

$\begin{align*} &\mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t (G_t - b(s_t)) \nabla_\theta \log \pi(a_t | s_t)] \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \mathbb{E}_{s_{0:T}, a_{0:T}}[ \sum\limits_{t=0}^T \gamma^t b(s_t) \nabla_\theta \log \pi (a_t | s_t) ] \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \mathbb{E}_{s_{0:T}, a_{0:T}}[ \gamma^t b(s_t) \nabla_\theta \log \pi (a_t | s_t) ] \qquad \text{exp. of sum equals to sum of exp.}\\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \mathbb{E}_{s_{t}, a_{t}}[ \gamma^t b(s_t) \nabla_\theta \log \pi (a_t | s_t) ] \qquad \text{remove irrelevant variables in each exp.}\\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \sum\limits_{a_{t}} p(s_t, a_t) \gamma^t b(s_t) \nabla_\theta \log \pi (a_t | s_t) \qquad \text{expectation form} \rightarrow \text{discrete sum} \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \sum\limits_{a_{t}} p(s_t) \pi(a_t|s_t) \gamma^t b(s_t) \frac{\nabla_\theta \pi (a_t | s_t)}{\pi(a_t | s_t) } \qquad \text{rule of probability} \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t b(s_t) p(s_t) \sum\limits_{a_{t}} \nabla_\theta \pi (a_t | s_t) \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t b(s_t) p(s_t) \nabla_\theta \sum\limits_{a_{t}} \pi (a_t | s_t) \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t b(s_t) p(s_t) \nabla_\theta 1 \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] \\ =& \nabla \eta(\theta) \end{align*}$

latex for sketch proof of proposition 1:

$\begin{align*} &\mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [\hat{A}_t(s_{0:\infty}, a_{0:\infty}) \nabla_\theta \log \pi_\theta(a_t | s_t) ] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [(Q_t(s_{0:\infty}, a_{0:\infty}) - b_t(s_{0:t}, a_{0:t-1})) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[Q_t(s_{0:\infty}, a_{0:\infty}) \nabla_\theta \log \pi_\theta(a_t | s_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad \text{we will first work on the former part} \downarrow \\ &= \mathbb{E}_{s_{0:t} \atop a_{0:t}}[\nabla_\theta \log \pi_\theta(a_t | s_t) \mathbb{E}_{s_{t+1:\infty}, a_{t+1:\infty}} [Q_t(s_{0:\infty}, a_{0:\infty})] ] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:t} \atop a_{0:t}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad \text{since } Q^{\pi, \gamma}(s_t, a_t) \text{ is a function of input only } s_t \text{ and } a_t \text{, we can change } \mathbb{E}_{s_{0:t} \atop a_{0:t}} \text{ to } \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[V^{\pi, \gamma}(s_t) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad V^{\pi, \gamma}(s_t) \text{ is an instance of } b_t(s_{0:t}, a_{0:t-1}) \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [A^{\pi, \gamma}(s_t, a_t) \nabla_\theta \log \pi_\theta(a_t | s_t) ] \end{align*}$

Leave a comment

Cancel reply