I’d like to talk more about policy gradient [1], which I touched upon in 2017. In common online tutorials, policy gradient theorem takes a lot of spaces to prove that the gradient of the policy in the direction to improve accumulated returns is:
where is the accumulated return beginning from step from real samples.
Note that the above gradient is only for on-policy policy gradient, because . How can we derive the gradient for off-policy policy gradient, i.e., where is a different behavior policy? I think the simplest way to derive it is to connect importance sampling with policy gradient.
A quick introduction of importance sampling [2]: estimating a quantity under a distribution is equivalently estimating it under another different distribution with an importance sampling weight, i.e., . This comes handy when you can’t collect data using but you can still compute when is sampled from .
In the case of off-policy policy gradient, becomes “the value function of the target policy, averaged over the state distribution of the behavior policy” (from DPG paper [6]):
As we know adding a state-dependent baseline in on-policy policy gradient would reduce the variance of gradient [4] and not introduce bias. Based on my proof, adding a state-dependent baseline in off-policy policy gradient will not introduce bias either:
You have probably noticed that during the derivation in on-policy policy gradient, we treat as non-differentiable with regard to . However is indeed affected by because it is the returns of the choices made by . I am not sure if treating differentiable would help improve performance, but there are definitely papers doing so in specific applications. One such an example is seq2slate [5] (where, in their notations, refers to actions, refers to states, and is the same as our ):
If we adapt Eqn. 8 to the off-policy setting, we have (the first three lines would be the same):
If you follow the same rewriting trick to rewrite Eqn. 8, we can get another form of gradient in the on-policy setting:
, where in the last few equations, not followed by is treated as non-differentiable.
Again, I’d like to disclaim that I don’t really know whether we should just make the policy differentiable (policy gradient) or make both the policy and return differentiable (seq2slate Eqn. 8). I’ll report back if I have any empirical findings.
————————— Update 2020.7 ——————-
Another topic to consider is constrained optimization in off-policy policy gradient. I am using notations closer to my real-world usage. Assume we are optimizing a policy’s parameters using data collected by another policy to maximize some reward function (i.e., one-step optimization, ), with the constraint that the policy should also achieve at least some amount of in another reward type :
subject to
Writing it in a standard optimization format, we have:
subject to
Now we can solve this optimization problem using Lagrangian multiplier. Suppose there is a non-negative multiplier . The augmented objective function becomes the original objective plus an additional penalty, which makes infeasible solutions sub-optimal:
In simpler problems we can just find the saddle point where and . However, our problem could be in a large scale that it is not feasible to find the saddle point easily. Note that one cannot use stochastic gradient descent to find the saddle point because saddle points are not local minima [8].
Following the idea in [7], what we can show is that if we try to optimize under different ‘s, then would have a monotonic trend:
Suppose
For any arbitrary two ‘s: , we have:
Therefore,
Adding both sides will keep the inequality hold:
(because the inequality changes direction)
This theorem immediately implies that if we optimize using normal off-policy policy gradient with reward in the form of , we are essentially doing constrained optimization. The larger we use, the more rigid the constraint is. While using a combined reward shape like looks a naive way to balance trade-offs, I never thought it could link to constrained optimization in this way!
The Lagrangian Multiplier optimization procedure above is taken referenced at [9].
————————— Update 2020.7 part (2) ——————-
Note that one cannot use stochastic gradient descent to find the saddle point because saddle points are not local minima [8].
This sentence is actually controversial. From [10] Section 3, we can see that we can still use stochastic gradient update to find the saddle point: for , we use gradient ascent; for we use gradient descent. Second, we can convert to yet another problem for minimizing the magnitude of the gradient of both and [11] . The intuition is that the closer to the saddle point, the smaller magnitude of the gradient of and . Therefore, we can perform stochastic gradient descent on the magnitude of gradient.
References
[1] https://czxttkl.com/2017/05/28/policy-gradient/
[2] https://czxttkl.com/2017/03/30/importance-sampling/
[3] On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
[5] https://arxiv.org/pdf/1810.02019.pdf
[6] http://proceedings.mlr.press/v32/silver14.pdf
[7] Deep Learning with Logged Bandit Feedback
[8] https://stackoverflow.com/a/12284903
[9] https://people.duke.edu/~hpgavin/cee201/LagrangeMultipliers.pdf
[10] Reward Constrained Policy Optimization
[11] https://en.wikipedia.org/wiki/Lagrange_multiplier#Example_4:_Numerical_optimization