In this post, I am sharing my understanding regarding Deterministic Policy Gradient Algorithm (DPG) [1] and its deep-learning version (DDPG) [2].
We have introduced policy gradient theorem in [3, 4]. Here, we briefly recap. The objective function of policy gradient methods is:
![]()
where
represents
,
is the stationary distribution of Markov chain for
,
, and
.
is accumulated rewards since time step
:
.
Policy gradient theorem proves that the gradient of policy parameters with regard to
is:
![]()
More specifically, the policy gradient theorem we are talking about is stochastic policy gradient theorem because at each state the policy outputs a stochastic action distribution
.
However, the policy gradient theorem implies that policy gradient must be calculated on-policy, as
means the state and action distribution is generated by following
.
If the training samples is generated by some other behavioral policy
, then the objective function becomes:
![]()
and we must rely on importance sampling to calculate the policy gradient [7,8]:
![Rendered by QuickLaTeX.com \nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^\beta}[\sum\limits_{a\in A}\nabla_\theta \pi(a|s) Q^\pi(s,a) + \pi(a|s)\nabla_\theta Q^\pi(s,a)] \newline \approx \mathbb{E}_{s\sim d^\beta}[\sum\limits_{a\in A}\nabla_\theta \pi(a|s) Q^\pi(s,a)] \newline =\mathbb{E}_{\beta}[\frac{\pi_\theta(a|s)}{\beta(a|s)}Q^\pi(s,a)\nabla_\theta ln \pi_\theta(a|s)]](https://czxttkl.com/wp-content/ql-cache/quicklatex.com-45583aee57264f3e0f6b1e30d1398e67_l3.png)
where ![]()
However, using importance sampling
could incur some learning difficulty problems [5]:

What DPG proposes is that we design a policy that outputs actions deterministically:
, thus
would only have integration (or sum) over state distribution and get rid of importance sampling, which makes learning potentially easier:
![]()
Writing the above formula in integration (rather than discrete sum), we get:

And the gradient of
is:
![Rendered by QuickLaTeX.com \nabla_\theta J(\theta) = \int\limits_S d^\beta(s) \nabla_\theta Q^\mu(s, \mu_\theta(s)) ds \approx \mathbb{E}_\beta [\nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}]](https://czxttkl.com/wp-content/ql-cache/quicklatex.com-95413f31d5976999e5fcb9526f93b1ca_l3.png)
This immediately implies that
can be updated off policy. Because we only need to sample
, then
can be calculated without knowing what action the behavior policy
took (we only need to know what the training policy would take, i.e.,
). We should also note that there is an approximation in the formula of
, because
cannot be applied chain rules easily since even you replace
with
, the
function itself still depends on
.
Updating
is one part of DPG. The other part is fitting a Q-network with parameter
on the optimal Q-function of policy governed by
. This equates to update according to mean-squared Bellman error (MSBE) [10], which is the same as in Q-Learning.
I haven’t read the DDPG paper [2] thoroughly but based on my rough understanding, it adds several tricks when dealing with large inputs (such as images). According to [9]’s summary, DDPG introduced three tricks:
- add batch normalization to normalize “every dimension across samples in one minibatch”
- add noise in the output of DPG such that the algorithm can explore
- soft update target network
References
[3] Notes on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”: https://czxttkl.com/?p=3497
[4] Policy Gradient: https://czxttkl.com/?p=2812
[9] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#ddpg
[10] https://spinningup.openai.com/en/latest/algorithms/ddpg.html#the-q-learning-side-of-ddpg