Convergence of Q-learning and SARSA

Here, I am listing some classic proofs regarding the convergence of Q-learning and SARSA in finite MDPs (by definition, in finite Markov Decision Process the sets of statesactions and rewards are finite [1]).

The very first Q-learning convergence proof comes from [4]. The proof is based on a very useful theorem:


Note that this theorem is general to be applied on multi-dimensional space (x can be multi-dimensional).

We can apply Theorem 1 to Q-learning whose update rule is as follows:
Q_{t+1}(s_t, a_t) = (1-\alpha(s_t, a_t))Q_t(s_t, a_t) + \alpha_t(s_t, a_t)[r_t +\gamma \max_{b \in \mathcal{A}}Q_t(s_{t+1}, b)]

Subtracting Q^*(s_t, a_t) from both sides and letting \triangle_t(s,a)=Q_t(s,a)-Q^*(s,a), we have:
\triangle_t(s_t,a_t)\newline=(1-\alpha_t(s_t, a_t))\triangle_t(s_t, a_t) + \alpha_t(s,a)[r_t + \gamma\max_{b\in \mathcal{A}}Q_t(s_{t+1}, b) - Q^*(s_t, a_t)]\newline=(1-\alpha_t(s_t, a_t))\triangle_t(s_t, a_t) + \alpha_t(s_t, a_t)F_t(s_t,a_t)

We can also let \alpha_t(s,a), the learning rate, to be zero for \forall s,a \neq s_t, a_t. Till this point, we have modeled the Q-learning process exactly as the random iterative process stated in Theorem 1. Hence, as long as we can prove the 4 assumptions are held, we can prove Q-learning converges to the optimal Q-values (i.e., \triangle_t(s,a) converges to 0).

Assumption 1: trivial to hold since we are focusing on finite MDPs

Assumption 2: \beta_n(x)=\alpha_n(x)=\text{Q-learning learning rate} based on our formulation. If we apply a GLIE learning policy (“greedy in the limit with infinite exploration”) [2], then we make assumption 2 hold:

  

Assumption 3: \|E\{F_t(s,a) | P_n\}\|_w \leq \gamma \|\triangle_t\|_w. There are some notations that may not look familiar to ordinary audience. First, \| x \|_w := \|w \cdot x\|_\infty=max |w_i x_i| means the weighted maximum norm [5]. Therefore, \|\triangle_t\|_w=\max\limits_{s,a} Q_t(s,a)-Q^*(s,a). Second, F_t(s,a) | P_n just means F_t(s,a) can be estimated using all the past interactions, i.e., Q_t in F_t(s,a) can be estimated conditional on all the past information rather than just (s_t, a_t)

To prove assumption 3 holds, we first look at how we define the optimal Q-function:
Q^*(s,a)=\sum\limits_{s'\in\mathcal{S}}P_a(s,s')[r(s,a,s')+\gamma \max\limits_{a'\in\mathcal{A}}Q^*(s',a')]

We can show that Q^*(s,a) is a fix point of a contraction operator \textbf{H}, defined over a generic function q: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}:
(\textbf{H}q)(s,a)=\sum\limits_{s'\in\mathcal{S}}P_a(s, s')[r(s,a,s')+\gamma \max\limits_{a' \in \mathcal{A}}q(s',a')]

Actually, \|\textbf{H}q_1 - \textbf{H}q_2\|_\infty \leq \gamma \|q_1-q_2\|_\infty, a \gamma-contraction, because:
\|\textbf{H}q_1 - \textbf{H}q_2\|_\infty \newline=\max\limits_{s,a} \big|\sum\limits_{s'\in \mathcal{S}} P_a(s,s')[r(s,a,s')+\gamma\max\limits_{a' \in \mathcal{A}}q_1(s',a') - r(s,a, s')-\gamma \max\limits_{a'\in\mathcal{A}}q_2(s',a')]\big|\newline=\max\limits_{s,a}\gamma \big|\sum\limits_{s'\in\mathcal{S}}P_a(s,s') [\max\limits_{a' \in \mathcal{A}}q_1(s',a')-\max\limits_{a' \in \mathcal{A}}q_2(s',a')]\big|\newline\text{Think of }\max |\textbf{p}*\textbf{x}| \leq \max|p_1\cdot\max|\textbf{x}|, p_2\cdot \max|\textbf{x}|, \cdots| \text{ if } \textbf{p}>0\newline\leq\max\limits_{s,a}\gamma \sum\limits_{s'\in \mathcal{S}}P_a(s,s')\big|\max\limits_{a' \in \mathcal{A}}q_1(s',a')-\max\limits_{a' \in \mathcal{A}}q_2(s',a')\big|\newline\text{norm property: }|\textbf{u}-\textbf{v}|_\infty \geq |\textbf{u}|_\infty-|\textbf{v}|_\infty\newline\leq \max\limits_{s,a}\gamma \sum\limits_{s'\in \mathcal{S}}P_a(s,s')\max\limits_{a' \in \mathcal{A}}\big|q_1(s',a')-q_2(s',a')\big| \newline \text{Enlarge the domain of max} \newline\leq \max\limits_{s,a}\gamma \sum\limits_{s'\in \mathcal{S}}P_a(s,s')\max\limits_{s'' \in \mathcal{S},a'' \in \mathcal{A}}\big|q_1(s'',a'')-q_2(s'',a'')\big|\newline=\max\limits_{s,a}\gamma \sum\limits_{s'\in \mathcal{S}}P_a(s,s')\|q_1-q_2\|_\infty\newline=\gamma \|q_1-q_2\|_\infty

Note in one of our previous note [6], we talked about the intuitive meaning of \gamma-contraction: consecutive application of \textbf{H} makes the function q closer and closer to the fixed point q^* at the rate of \gamma. Another intuitive understanding of \|\textbf{H}q_1 - \textbf{H}q_2\| \leq \gamma \|q_1-q_2\| (in terms of L2 norm) is that the distance between q_1 and q_2 is closer after applying \textbf{H}:

After proving \textbf{H} is a \gamma-contraction, we now prove assumption 3:
\|E\{F_t(s,a) | P_n\}\|_w \newline=\sum\limits_{s'\in\mathcal{S}}P_a(s,s')[r(s,a,s')+\gamma\max\limits_{a'\in\mathcal{A}}Q_t(s', a')-Q^*(s,a)]\newline=\sum\limits_{s'\in\mathcal{S}}P_a(s,s')[r(s,a,s')+\gamma\max\limits_{a'\in\mathcal{A}}Q_t(s', a')]-Q^*(s,a)\newline=(\textbf{H}Q_t)(s,a)-Q^*(s,a)\newline\text{Using the fact that }Q^*=\textbf{H}Q^*\newline=(\textbf{H}Q_t)(s,a)-(\textbf{H}Q^*)(s,a)\newline \leq \|\textbf{H}Q_t-\textbf{H}Q^*\|_\infty\newline\leq \gamma \|Q_t-Q^*\|_\infty\newline=\gamma \|\triangle_t\|_\infty

It is less obvious to me how to prove assumption 4, even after reading [3] and proposition 5.5 in [8]. But the take away of assumption 4 is that the variance of F_t(s,a) should be bounded.

At last, y0u can prove the convergence of SARSA [2] in a similar fashion. Hope I’ll have time to cover in the future.  

I think the most important note to take away from this post is the pattern to prove convergence of a learning algorithm. Researchers often need to propose variants of Q-learning (such as soft Q-values in maximum entropy environment [6], or SlateQ for dealing with combinatorial actions [9]). They usually need to prove convergence at least in the case of finite MDPs. One can start from the optimal Q-function of their learning algorithm and prove it is a fixed point of a contraction operator. Then, look at the update rule and construct a random iterative process as stated in Theorem 1. Finally, prove the 4 assumptions hold.

 

Reference

[1] https://medium.com/harder-choices/finite-markov-decision-process-8a34f5e571f9

[2] Convergence Results for Single-Step On-Policy
Reinforcement-Learning Algorithms https://link.springer.com/content/pdf/10.1023/A:1007678930559.pdf

[3] Convergence of Q-learning: a simple proof http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf

[4] Convergence of Stochastic Iterative Dynamic Programming Algorithms https://papers.nips.cc/paper/764-convergence-of-stochastic-iterative-dynamic-programming-algorithms.pdf

[5] https://math.stackexchange.com/questions/182054/is-this-a-norm-triangle-inequality-for-weighted-maximum-norm

[6] https://czxttkl.com/2018/10/30/notes-on-soft-actor-critic-off-policy-maximum-entropy-deep-reinforcement-learning-with-a-stochastic-actor/

[7] norm properties: https://en.wikipedia.org/wiki/Norm_(mathematics)#Properties

[8] Neuro Dynamic Programming https://www.dropbox.com/s/naenlfpy0f9vmvt/neuro-dynamic-programming-optimization-and-neural-.pdf?dl=0

[9] SlateQ: https://arxiv.org/abs/1905.12767

Cross entropy with logits

I keep forgetting the exact formulation of `binary_cross_entropy_with_logits` in pytorch. So write this down for future reference.

The function binary_cross_entropy_with_logits takes as two kinds of inputs: (1) the value right before the probability transformation (softmax) layer, whose range is (-infinity, +infinity); (2) the target, whose values are binary

binary_cross_entropy_with_logits calculates the following loss (i.e., negative log likelihood), ignoring sample weights:

    \[loss = -[target * log(\sigma(input)) + (1-target) * log(1 - \sigma(input))]\]

>>> import torch
>>> import torch.nn.functional as F
>>> input = torch.tensor([3.0])
>>> target = torch.tensor([1.0])
>>> F.binary_cross_entropy_with_logits(input, target)
tensor(0.0486)
>>> - (target * torch.log(torch.sigmoid(input)) + (1-target)*torch.log(1-torch.sigmoid(input)))
tensor([0.0486])

2019-12-03 Update

Now let’s look at the difference between N-classes cross entropy and KL-divergence loss. They refer the same thing (https://adventuresinmachinelearning.com/cross-entropy-kl-divergence/) but differ only in I/O format.

import torch.nn as nn
import torch
import torch.nn.functional as F

loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
loss(input, target)
>>> tensor(1.6677, grad_fn=<NllLossBackward>)

loss1 = nn.KLDivLoss(reduction="batchmean")
loss1(nn.LogSoftmax()(input), F.one_hot(target, 5).float())
>>> tensor(1.6677, grad_fn=<NllLossBackward>)

Reference:

[1] https://pytorch.org/docs/stable/nn.html#binary-cross-entropy-with-logits

[2] https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss

[3] https://stackoverflow.com/questions/34240703/what-is-logits-softmax-and-softmax-cross-entropy-with-logits

mujoco only works with gcc8

pip install mujoco-py would only build with gcc8. On Mac, use ll /usr/local/Cellar/gcc* to find all gcc versions you have installed. Uninstall them and only install gcc@8.

brew uninstall gcc
brew uninstall gcc@7
brew uninstall gcc@9
brew install gcc@8

Another time I saw the following error when using pip install mujoco-py:

  Building wheel for mujoco-py (PEP 517) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/czxttkl/anaconda2/envs/softlearning/bin/python /Users/czxttkl/anaconda2/envs/softlearning/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/tmp8nhhe2ui
       cwd: /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py
  Complete output (34 lines):
  running bdist_wheel
  running build
  Removing old mujoco_py cext /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/cymj_2.0.2.9_37_macextensionbuilder_37.so
  Compiling /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/cymj.pyx because it depends on /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-build-env-phfzl25m/overlay/lib/python3.7/site-packages/Cython/Includes/numpy/__init__.pxd.
  [1/1] Cythonizing /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/cymj.pyx
  running build_ext
  building 'mujoco_py.cymj' extension
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders/kt
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py
  creating /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/gl
  /usr/local/bin/gcc-8 -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/czxttkl/anaconda2/envs/softlearning/include -arch x86_64 -I/Users/czxttkl/anaconda2/envs/softlearning/include -arch x86_64 -DONMAC -Imujoco_py -I/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py -I/Users/czxttkl/.mujoco/mujoco200/include -I/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-build-env-phfzl25m/overlay/lib/python3.7/site-packages/numpy/core/include -I/Users/czxttkl/anaconda2/envs/softlearning/include/python3.7m -c /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/cymj.c -o /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/generated/_pyxbld_2.0.2.9_37_macextensionbuilder/temp.macosx-10.9-x86_64-3.7/private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/cymj.o -fopenmp -w
  In file included from /usr/local/Cellar/gcc@8/8.3.0/lib/gcc/8/gcc/x86_64-apple-darwin18/8.3.0/include-fixed/syslimits.h:7,
                   from /usr/local/Cellar/gcc@8/8.3.0/lib/gcc/8/gcc/x86_64-apple-darwin18/8.3.0/include-fixed/limits.h:34,
                   from /Users/czxttkl/anaconda2/envs/softlearning/include/python3.7m/Python.h:11,
                   from /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/cymj.c:58:
  /usr/local/Cellar/gcc@8/8.3.0/lib/gcc/8/gcc/x86_64-apple-darwin18/8.3.0/include-fixed/limits.h:194:61: error: no include path in which to search for limits.h
   #include_next &lt;limits.h>  /* recurse down to the real one */
                                                               ^
  In file included from /Users/czxttkl/anaconda2/envs/softlearning/include/python3.7m/Python.h:25,
                   from /private/var/folders/kt/bt_h1rbx0658mqgh49cmt1xwb9g9_0/T/pip-install-g5afauff/mujoco-py/mujoco_py/cymj.c:58:
  /usr/local/Cellar/gcc@8/8.3.0/lib/gcc/8/gcc/x86_64-apple-darwin18/8.3.0/include-fixed/stdio.h:78:10: fatal error: _stdio.h: No such file or directory
   #include &lt;_stdio.h>
            ^~~~~~~~~~
  compilation terminated.
  error: command '/usr/local/bin/gcc-8' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for mujoco-py
Failed to build mujoco-py
ERROR: Could not build wheels for mujoco-py which use PEP 517 and cannot be installed directly

This error is suspected to be due to a corrupted gcc@8. I solved this by using brew reinstall gcc@8.

However during the reinstallation, I encountered another error:

Error: Your Xcode (11.0) is outdated.
Please update to Xcode 11.3 (or delete it).
Xcode can be updated from the App Store.

, which I solved by using xcode-select –install.

Notes for “Defensive Quantization: When Efficiency Meets Robustness”

I have been reading “Defensive Quantization: When Efficiency Meets Robustness” recently. Neural network quantization is a brand-new topic to me so I am writing some notes down for learning. 

The first introduction I read is [1], from which I learn that the term “quantization” generally refers to reducing the memory usage of model weights by lowering representation precision. It could mean several similar concepts: (1) low precision: convert FP32 (floating point of 32 bits) to FP16, INT8, etc; (2) mixed precision, use FP16 for some weights while still keeping using FP32 for other weights (so that accuracy can be maintained); (3) exact quantization, basically using INT8 for all weights.

The next thing to understand is the basic of fixed-point and floating-point representation:

 

The fixed-point format saves the integer part and fractional part as separate numbers. Therefore, the precision between two consecutive fixed-point numbers is fixed. (For example, fixed-point numbers can be 123.456, 123.457, 123.458, …, with 0.001 between each two numbers.) The floating-point format saves a number as significand \times base^{exponent} so significand and exponent are saved as two separate numbers. Note that the precision between two consecutive floating-point numbers is not fixed but actually depends on exponents. For every exponent, the number of representable floating-point numbers is the same. But for a smaller exponent, the density of representable floating-point numbers is high; while for a larger exponent, the density is low. As a result, the closer a real value number is to 0, the more accurate it can be represented as a floating-point number. In practice, floating-point numbers are quite accurate and precise even with the largest exponent.  

The procedure to quantize FP32 to INT8 is shown in Eqn.6~9 in [1]. In short, you need to find a scale and a reference point (X_{zero_point} in the equations) to make sure every FP32 can fall into INT8’s value range [0, 255].

Eqn. 6 ~ 9 only describes how to quantize weights but there is no guarantee that arithmetic operated on quantized weights would still fall into the quantized range. For example, there is an operator that adds two FP32 weights. While we can quantize the two weights into INT8, it is possible that the operator would result to overflow after adding the two quantized weights. Therefore, people have designed dedicated procedures to perform quantized arithmetic, such as multiplication, addition, subtraction, so on. See Eqn. 10~16 and 17~26 for the example of multiplication and addition, respectively.

While quantization is one way to reduce memory usage of models, compression is another alternative. [3] is one very famous work on compressing deep learning models. 

 

Since I am an RL guy, I’d also like to introduce another work [4]. This work uses RL for model compression. They designed an MDP where the state is each layer’s features, action is compression ratio per layer (so the dimension of action is one and range is (0,1)); reward is model performance on the validation dataset, only evaluated at the last step of the MDP.  The reward can also be tweaked to add resource constraint, such as latency. The take-away result is accuracy can be almost preserved while model size is compressed to a good ratio. 

 

References

[1] Neural Network Quantization Introduction

[2] Defensive Quantization: When Efficiency Meets Robustness

[3] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

[4] AMC: AutoML for Model Compression and Acceleration on Mobile Devices

[5] https://www.geeksforgeeks.org/floating-point-representation-digital-logic/

Gradient and Natural Gradient, Fisher Information Matrix and Hessian

Here I am writing down some notes summarizing my understanding in natural gradient. There are many online materials covering similar topics. I am not adding anything new but just doing personal summary.

Assume we have a model with model parameter \theta. We have training data x. Then, the Hessian of log likelihood, log p(x|\theta), is:

    \[       H_{log p(x|\theta)} = \begin{bmatrix} \frac{\partial^2 logp(x|\theta)}{\partial \theta_1^2} & \dots & \frac{\partial^2 logp(x|\theta)}{\partial \theta_1 \partial \theta_n} \\ \dots & \dots & \dots \\ \frac{\partial^2 logp(x|\theta)}{\partial \theta_n \partial \theta_1} & \dots & \frac{\partial^2 logp(x|\theta)}{\partial \theta_n^2} \end{bmatrix} \]

Fisher information matrix F is defined as the covariance matrix of \nabla_\theta log p(x|\theta) (note that \nabla_\theta log p(x|\theta) itself is a vector). If we define s(\theta) = \nabla_\theta log p(x|\theta), then F=\mathbb{E}_{p(x|\theta)}\left[\left(s(\theta) - mean(s(\theta))\right)\left(s(\theta) - mean(s(\theta))\right)^T\right]=\mathbb{E}_{p(x|\theta)}\left[s(\theta)s(\theta)^T\right]. The last equation holds because mean(s(\theta)) is 0 [1]. It can be shown that the negative of \mathbb{E}_{p(x|\theta)}[H_{log p(x|\theta}] is Fisher information matrix [1]. Therefore, Fisher information matrix can be thought as the (negative) curvature of log likelihood. The curvature (Hessian) of a function is the second-order derivative of the function, which depicts how quickly the gradient of the function changes. Computing Hessian (curvature) takes longer time than computing just gradient but knowing Hessian can accelerate learning convergence. If the curvature is high, the gradient changes quickly, then the gradient update of the parameters should be more cautious (i.e., smaller step); if the curvature is low, the gradient doesn’t change much, then the gradient update can be more aggressive (i.e., large step). Moreover, the eigenvalues of a Hessian determines convergence speed [3]. Therefore, knowing Fisher information matrix is quite important in optimizing a function.

Suppose we are going to optimize a loss function: \min \mathcal{L}(x, \theta). The normal gradient update is: \theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(x,\theta). The natural gradient has a little different form: \theta \leftarrow \theta - \alpha F^{-1} \mathcal{L}(x,\theta). The natural gradient formula is actually derived from [5]:

    \[ \nabla_\theta^{NAT} \mathcal{L}(x,\theta) = \arg\min_{d} \mathcal{L}(x, \theta+d), \quad s.t. \quad KL\left( p(x|\theta) || p(x|\theta+d)\right) = c \]

This formula shows the intuition behind natural gradient: the natural gradient should minimize the loss as much as possible while doesn’t radically change p(x|\theta). Another way to think about natural gradient is that since the Fisher information matrix F encodes the curvature of the log likelihood, then the natural gradient is the normal gradient scaled by the reverse of the curvature: if the log likelihood’s curvature is large, that means some change in \theta could radically the likelihood, then we should be conservative in the gradient update.   

 

Reference

[1] https://wiseodd.github.io/techblog/2018/03/11/fisher-information/

[2] https://towardsdatascience.com/its-only-natural-an-excessively-deep-dive-into-natural-gradient-optimization-75d464b89dbb

[3] http://mlexplained.com/2018/02/02/an-introduction-to-second-order-optimization-for-deep-learning-practitioners-basic-math-for-deep-learning-part-1/

[4] http://kvfrans.com/a-intuitive-explanation-of-natural-gradient-descent/

[5] https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/

 

Stochastic Variational Inference

Introduction

In this post, we introduce one machine learning technique called stochastic variational inference that is widely used to estimate posterior distribution of Bayesian models. Suppose in a Bayesian model, the model parameters is denoted as a vector z and the observation is denoted as x. According to Bayesian theorem, the posterior distribution of z can be computed as:

p(z|x)=\frac{p(z,x)}{p(x)}

p(x) is the probability of observation marginal over all possible model parameters:

p(x)=\int p(z,x)dz 

p(x) isn’t easy to compute, most of time intractable, because of its integral form. If we are not able to compute p(x), then we are not able to compute p(z|x), which is what we want to know. Therefore, we need to come up with a way to approximate p(z|x). We denote the approximated posterior as q(z). q(z) is also called the variational distribution hence the name of variational inference.

Stochastic variational inference (SVI) is such one method to approximate p(z|x). From the ICML 2018 tutorial [2], we can see the niche where SVI lies: among all possible ways to approximate p(z|x), there is a group of algorithms using optimization to minimize the difference between q^*(\cdot) and p(\cdot|x). Those representing the difference between the two distributions as Kullback-Leibler divergence is called variational inference. If we further categorize based on the family of q^*, there is one particular family called mean-field variational family which is easy to apply variational inference. After all levels of categorization, we arrive at some form of objective function which we sort to minimize. SVI is one optimization method to optimize a defined objective function that pushes q^* to reflect our interest in minimizing the KL-divergence with p(z|x).

Objective function

By definition, KL divergence between two continuous distributions H and G is defined as [4]:

KL\left(h(x)||g(x)\right)\newline=\int^\infty_{-\infty}h(x)log\frac{h(x)}{g(x)}dx \newline=\mathbb{E}_h[log\;h(x)]-\mathbb{E}_h[log\;g(x)]

If we are trying to find the best approximated distribution q^*(z) using variational Bayes, we define the following objective function:

q^*(z)=argmin_q KL\left(q(z) || p(z|x) \right )

where KL(q(z)||p(z|x))=\mathbb{E}_q [log\;q(z)] - \mathbb{E}_q [log\;p(z,x)]+log\;p(x). (all expectations are taken with respect to q(z).) Note that if we are gonna optimize w.r.t q(), then log\;p(x) can be treated as a constant. Thus, minimizing the KL-divergence is equivalent to maximizing:

ELBO(q)=\mathbb{E}_q[log\;p(z,x)] - \mathbb{E}_q[log\;q(z)]

ELBO(q) is the lower bound of log\;p(x) because of the non-negativity of KL-divergence:

KL(q(z) || p(z|x)) = log p(x) - ELBO(q) \geq 0

Update 2020.4:

The derivation above is also illustrated in [14]:

There are several other ways to understand ELBO.

  1. Based on Jensen’s inequality [13]: for a convex function f and a random variable X, f(\mathbb{E}[X]) \leq \mathbb{E}\left[f(X)\right]; for a concave function g, g(\mathbb{E}[X]) \geq \mathbb{E}\left[g(X)\right]. Therefore, we have:

log \; p(x)\newline=log \int_z p(z,x)\newline=log \int_z p(z,x)\frac{q(z)}{q(z)}\newline=log \int_z q(z)\frac{p(z,x)}{q(z)}\newline=log \left(\mathbb{E}_{q(z)}\left[\frac{p(z,x)}{q(z)}\right]\right) \newline \geq \mathbb{E}_{q(z)}\left[log \frac{p(z,x)}{q(z)}\right] \quad\quad \text{by Jensen's inequality} \newline =ELBO(q)

Therefore, ELBO(q) is the lower bound of log \; p(x)

2. By rearranging ELBO(q), we have:

ELBO(q)\newline=\mathbb{E}_q[log\;p(z,x)] - \mathbb{E}_q[log\;q(z)]\newline=\mathbb{E}_q[log\;p(x|z)] + \mathbb{E}_q[log\;p(z)] - \mathbb{E}_q[log\;q(z)] \quad\quad p(z) \text{ is the prior of } z \newline= \mathbb{E}_q[log\;p(x|z)]  - \mathbb{E}_q [log \frac{q(z)}{p(z)}]\newline=\mathbb{E}_q[log\;p(x|z)] - KL\left(q(z) || p(z)\right)

Therefore, the first part of ELBO(q) can be thought as the so-called “reconstruction error”, which encourages q(z) to put more probability mass on the area with high log\;p(x|z). The second part encourages q(z) to be close to the parameter prior p(z). ELBO(q) is the common objective used in Variational Autoencoder models. 

How to optimize?

Recall that our objective function is q^*(z)=argmin_q KL(q(z) || p(z|x) ). In practice, minimizing with regard to q translates to parameterize q(z) and then optimize the objective function with regard to the parameters. One big assumption we could make to facilitate computation is to assume all latent variables are independent such that q(z) can be factorized into the product of distributions of individual latent variables. We call such a q(z) the mean-field variational family:

q(z)=\prod\limits_{j=1}^{|z|} q_j(z_j|\theta_j)  

From the factorization, you can see that each individual latent variable’s distribution is governed by its own parameter \theta_j. Hence, the objective function to approximate p(z|x) changes from q^*(z)=argmin_q KL(q(z) || p(z|x) ) to:

\theta^* = argmin_{\theta_1, \cdot, \theta_{|z|}} KL(\prod\limits_{j=1}^{|z|} q_j(z_j|\theta_j) || p(z|x) )

One simple algorithm to optimize this is called coordinate ascent mean-field variational inference (CAVI). Each time, the algorithm optimizes one variational distribution parameter while holding all the others fixed. The algorithm works as follows:

“Set q_j(z_j) \propto exp\{\mathbb{E}_{-j}[log p(z_j|z_{-j}, x)]\}” may seems hard to understand. It means that setting the variational distribution parameter \theta_j such that q_j(z_j|\theta_j) follows the distribution that is equivalent to exp\{\mathbb{E}_{-j}[log p(z_j|z_{-j}, x)]\} up to a constant. \mathbb{E}_{-j} means that the expectation is taken with regard to a distribution \prod_{\ell \neq j} q_\ell(z_\ell|\theta_\ell).

What to do after knowing q(z)=\prod_j q(z_j|\theta_j)?

After the optimization (using CAVI for example), we get the variational distribution q(z)=\prod_j q(z_j|\theta_j). We can use the estimated \theta_j to analytically derive the mean of z_j or sample z_j from q(z_j|\theta_j). One thing to note is that there is no restriction on the parametric form of the individual variational distribution. For example, you may define q(z_j|\theta_j) to be an exponential distribution: q(z_j|\theta_j)=\theta_j e^{-\theta_j z_j}. Then, the mean of z_j is 1/\theta_j. If q(z_j|\theta_j) is a normal distribution, then \theta_j actually contains two parameters, the normal distribution’s mean and variance. Thus the mean of z_j is simply the mean parameter. 

Stochastic Variational Inference

One big disadvantage of CAVI is its scalability. Each update of \theta_j requires full sweep of data to compute the update. Stochastic variational inference (SVI) kicks in because updates of \theta_j using SVI only requires sub-samples of data. The simple idea is to take the gradient of \nabla_\theta ELBO and use it to update \theta. But there is some more detail:

  1. formulas of updates would be very succinct if we assume complete conditionals are in the exponential family: p(z_j|z_{-j}, x)=h(z_j) exp\{\eta_j(z_{-j},x)^Tz_j - a(\eta_j(z_j, x))\}, where z_j is its own sufficient statistics, h(\cdot), a(\cdot), and \eta(\cdot) are defined according to the definition of the exponential family [10]. 
  2. We also categorize latent variables into local variables, and global variables. 
  3. the gradient is not simply taken in the Euclidean space of parameters but in the distribution space [11]. In other words, the gradient is transformed in a sensible way such that it is in the steepest descent direction of KL-divergence. Also see [12].

All in all, SVI works as follows [8]: 

Reference

[1] https://www.quora.com/Why-and-when-does-mean-field-variational-Bayes-underestimate-variance

[2] ICML 2018 Tutorial Session: Variational Bayes and Beyond: Bayesian Inference for Big Data: https://www.youtube.com/watch?time_continue=2081&v=DYRK0-_K2UU

[3] Variational Inference: A Review for Statisticians

[4] https://towardsdatascience.com/demystifying-kl-divergence-7ebe4317ee68

[5] https://www.zhihu.com/question/31032863

[6] https://blog.csdn.net/aws3217150/article/details/57072827

[7] http://krasserm.github.io/2018/04/03/variational-inference/

[8] NIPS 2016 Tutorial Variational Inference: Foundations and Modern Methods: https://www.youtube.com/watch?v=ogdv_6dbvVQ

[9] https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-and-pytorch-b1c24e6ab8cd

[10] https://jwmi.github.io/BMS/chapter3-expfams-and-conjugacy.pdf

[11] https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/

[12] https://czxttkl.com/2019/05/09/gradient-and-natural-gradient-fisher-information-matrix-and-hessian/

[13] https://en.wikipedia.org/wiki/Jensen%27s_inequality

[14] https://github.com/cpark321/uncertainty-deep-learning/blob/master/05.%20Variational%20Inference%20(toy%20example).ipynb

Bayesian linear regression

Ordinary least square (OLS) linear regression have point estimates on weight vector w that fit the formula: \arg\min_w \left\Vert(Y - Xw)\right\Vert^2 + \epsilon. If we assume normality of the errors: \epsilon \sim \mathcal{N}(\boldsymbol{0}, \sigma^2 \boldsymbol{I}) with a fixed point estimate on \sigma^2, we could also enable analysis on confidence interval and future prediction (see discussion in the end of [2]). Instead of point estimates, bayesian linear regression assumes w and \sigma^2 are random variables and learns the posterior distribution of w and \sigma^2 from data. In my view, Bayesian linear regression is a more flexible method because it supports incorporating prior knowledge about parameters and the posterior distributions they provide enable more uncertainty analysis and facilitate other tasks [3], for example Thompson sampling in contextual bandit problems, which we will cover in the future. However, it is also not a panacea: they do not generally improve the prediction accuracy if no informative prior is provided.

The fundamental of Bayesian methods lies in Bayesian theorem, which states:

p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \propto p(D|\theta)p(\theta)

Specifically, in Bayesian linear regression, D represents observed data Y and X and \theta refers to w and \sigma^2. So the formula above can be rewritten as:

p(w, \sigma^2|Y, X) \propto p(Y|X, w, \sigma^2) p(w, \sigma^2)

The procedure to adopt bayesian methods is to (1) define the likelihood and proper prior distributions of the parameters in interests; (2) calculate the posterior distribution according to Bayesian theorem; (3) use the posterior distribution to achieve other tasks, such as predicting Y on new X.

The likelihood function of linear regression is:

p(Y|X, w, \sigma^2)=(2\pi\sigma^2)^{-n/2} exp\big\{-\frac{(Y-Xw)^T(Y-Xw)}{2\sigma^2}\big\}

which can be illustrated as what the probability to observe the data if we assume Y is normally distributed with mean X^Tw and variance \sigma^2\boldsymbol{I}.

To get analytical expression of the posterior distribution p(w,\sigma^2|Y, X), we usually require that the prior in the same probability distribution family as the likelihood. Here we can treat the likelihood p(Y|X, w, \sigma^2) as a two-dimensional exponential family (the concept is illustrated in chapter 4 in [6]), a distribution regarding to w and \sigma^2. Therefore, the prior of the likelihood, p(w, \sigma^2), can be modeled as a Normal-inverse-Gamma (NIG) distribution:

p(w, \sigma^2) =p(w|\sigma^2)p(\sigma^2)=\mathcal{N}(w_0, \sigma^2 V_0) \cdot IG(\alpha_0, \beta_0)

The inverse gamma IG(\cdot) is also called inverse chi-squared distribution; they only differ in parameterization [9]. We can denote p(w, \sigma^2) as NIG(w_0, V_0, \alpha_0, \beta_0). Note that we express p(w | \sigma^2), meaning that w and \sigma^2 is not independent. For one reason, if we model p(w, \sigma^2)=p(w) \cdot p(\sigma^2), we would not get a conjugate prior. Second, if you think (X, Y) are generated from some process governed by w and \sigma^2, then w and \sigma^2 are dependent conditioned on (X, Y) (see section 3 in [8]).

Now, given that the likelihood and the prior are in the same probability distribution family, the posterior distribution is also a NIG distribution:

p(w, \sigma^2|Y, X) = NIG(w_1, V_1, \alpha_1, \beta_1),

where:

w_1 = (V_0^{-1}+X^TX)^{-1}(V_0^{-1}w_0+X^T Y)

V_1 = (V_0^{-1}+X^TX)^{-1}

\alpha_1 = \alpha_0 + \frac{1}{2}n

\beta_1 = \beta_0 + \frac{1}{2}(w_0^T V_0^{-1}w_0 + Y^T Y - w_1^TV_1^{-1}w_1)

The posterior predictive distribution (for predicting new data) is:

p(Y_{new}|X_{new}) = \int \int p(Y_{new}|X_{new}, Y, X, w, \sigma^2) p(w, \sigma^2|Y, X) dwd\sigma^2

The result should be a student-t distribution. But the derivation detail is very complicated, possibly referring to section 6 in [10] and section 3 in [8]. I know from other online posts [11,12,13] that practical libraries don’t calculate the analytical form of the posterior distribution but rely on sampling techniques like MCMC. However, even though they get the posterior distribution,  I don’t know how they would implement the posterior predictive distribution. This would worth my further investigation in the future.

Side notes

We have touched upon Bayesian linear regression when introducing Bayesian Optimization [1]. [4] is also a good resource of Bayesian linear regression. However, [1] and [4] only assume w is a random variable but \sigma^2 is still a fix point estimate. This post actually goes fully bayesian by assuming both w and \sigma^2 are random variables whose joint distribution follows the so called Normal inverse Gamma distribution. There aren’t too many resources in the same vein though. What I’ve found so far are: [5], section 2 in [7].

Reference

[1] https://czxttkl.com/?p=3212

[2] https://czxttkl.com/2015/03/22/logistic-regression%e6%98%af%e5%a6%82%e4%bd%95%e5%bd%a2%e6%88%90%e7%9a%84%ef%bc%9f/logistic-regression%e6%98%af%e5%a6%82%e4%bd%95%e5%bd%a2%e6%88%90%e7%9a%84%ef%bc%9f

[3] https://wso2.com/blog/research/part-two-linear-regression

[4] http://fourier.eng.hmc.edu/e176/lectures/ch7/node16.html

[5] A Guide to Bayesian Inference for Regression Problems: https://www.ptb.de/emrp/fileadmin/documents/nmasatue/NEW04/Papers/BPGWP1.pdf

[6] Bolstad, W. M. (2010). Understanding computational Bayesian statistics (Vol. 644). John Wiley & Sons.

[7] Denison, D. G., Holmes, C. C., Mallick, B. K., & Smith, A. F. (2002). Bayesian methods for nonlinear classification and regression (Vol. 386). John Wiley & Sons.

[8] https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

[9] https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution

[10] Conjugate bayesian analysis of the gaussian distribution

[11] https://wso2.com/blog/research/part-two-linear-regression

[12] https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7

[13] https://towardsdatascience.com/bayesian-linear-regression-in-python-using-machine-learning-to-predict-student-grades-part-2-b72059a8ac7e

Counterfactual Policy Evaluation

Evaluating trained RL policies offline is extremely important in real-world production: a trained policy with unexpected behaviors or unsuccessful learning would cause the system regress online therefore what safe to do is to evaluate their performance on the offline training data, based on which we decide whether to deploy. Evaluating policies offline is an ongoing research topic called “counterfactual policy evaluation” (CPE): what would a policy perform even though we only have trajectories that were generated by some other policy?

CPE for contextual bandit

We first look at CPE on contextual bandit problems. The contextual bandit problem is to take action a at each state s that is drawn from a distribution \mathcal{D}. We can only observe the reward associated to that specific pair of s and a: r(s, a). Rewards associated to other actions not taken r(s,a'), a' \neq a can’t be observed. The next state at which you would take the next action, will not be affected by the current state or action. Essentially, contextual bandit problems are Markov Decision Problems when horizon (the number of steps) equals one. Suppose we have a dataset \mathcal{S} with N samples, each sample being a tuple (s_i, a_i, r_i), i=1,2,\cdots,N. a_i is sampled from a behavior policy \pi_0(a|s). r_i is calculated based on an underlying but non-observable reward function r_i=r(s_i, a_i). For now, we ignore any noise that could exist in reward collection. To simplify our calculation, we will assume that the decision of \pi_0 is independent across samples: this assumption bypasses the possible fact that \pi_0 may maintain an internal memory h, which is the history of observations used to facilitate its future decisions. We will also assume that during data collection we can access the true probability \pi_0(\cdot|s), the true action propensity of the behavior policy. This is not difficult to achieve in practice because we usually have the direct access to the current policy’s model. We will call the policy that we want to evaluate target policy, denoted as \pi_1.

Given all these notations defined, the value function of \pi_1 is:

V^{\pi_1} = \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi_1(\cdot|s)}[r(s,a)]

A straightforward way called Direct Method (DM) to estimate V^{\pi_1} is to train an approximated reward function \hat{r} and plug into V^{\pi_1}:

V^{\pi_1}_{DM}=\frac{1}{N}\sum\limits_{s \in \mathcal{S}}\sum\limits_a \hat{r}(s,a)\pi_1(a|s)

The bias of \hat{r} directly determines the bias of V^{\pi_1}_{DM}. The problem is that if \hat{r} is only trained on \mathcal{S} generated by \pi_0, it might be possible that these samples do not sufficiently cover state/action pairs relevant to V^{\pi_1}, thus \hat{r} could be very biased and inaccurate.

The second method is called inverse propensity score (IPS). Its formulation is basically importance sampling on rewards:

V^{\pi_1}_{IPS} = \frac{1}{N} \sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \frac{r_i \pi_1(a_i|s_i)}{\pi_0(a_i|s_i)}

We can prove that V^{\pi_1}_{IPS} is an unbiased estimator of V^{\pi_1}. I get some ideas from [4]:

\mathbb{E}_{p_{\pi_0}(\mathcal{S})}[V^{\pi_1}_{IPS}] \newline (p_{\pi_0}(\mathcal{S}) \text{ is the joint distribution of all samples generated by }\pi_0) \newline = \frac{1}{N}\int p_{\pi_0}(\mathcal{S})\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \frac{r_i \pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} d\mathcal{S} \newline \text{(since each sample is collected independently, we have:)}\newline=\frac{1}{N}\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}}\int p_{\pi_0}(s_i, a_i)\frac{r(s_i, a_i)\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)}ds_i da_i \newline=\int p_{\pi_0}(s, a)\frac{r(s, a)\pi_1(a|s)}{\pi_0(a|s)}ds da \newline=\int p_\mathcal{D}(s) \pi_0(a|s)\frac{r(s, a)\pi_1(a|s)}{\pi_0(a|s)}dsda \newline=\int p_{\pi_1}(s,a)r(s,a) dsda\newline=V^{\pi_1}

There is one condition to be satisfied for the proof to be hold: if \pi_0(a|s)=0 then \pi_1(a|s)=0. First, we don’t need to worry about that any \pi_0(a|s)=0 exists in the denominator of V^{\pi_1}_{IPS} because those samples would never be collected in \mathcal{S} in the first place. However, if \pi_0(a|s)=0, \pi_1(a|s)>0 for some (s,a), there would never be data to evaluate the outcome of taking action a at state s, which means V^{\pi_1}_{IPS} becomes a biased estimator of V^{\pi_1}. (Part of these ideas comes from [7].)

The intuition behind V^{\pi_1}_{IPS} is that:

  1. If r_i is a large (good) reward, and if \pi_1(a_i|s_i) > \pi_0(a_i|s_i), then we guess that \pi_1 is probably better than \pi_0 because \pi_1 may give actions leading to good rewards larger probabilities than \pi_0.
  2. If r_i is a small (bad) reward, and if \pi_1(a_i|s_i) > \pi_0(a_i|s_i), then we guess that \pi_1 is probably worse than \pi_0 because \pi_1 may give actions leading to bad rewards larger probabilities than \pi_0.
  3. We can reverse the relationship such that if \pi_1(a_i|s_i) < \pi_0(a_i|s_i), we can guess \pi_1 is better than \pi_0 if r_i itself is a bad reward. \pi_1 probably is a better policy trying to avoid actions leading to bad rewards.
  4. If \pi_1(a_i|s_i) < \pi_0(a_i|s_i) and r_i is a good reward, \pi_0 is probably a better policy trying to take actions leading to good rewards.

We can also relate the intuition of V^{\pi_1}_{IPS} to Inverse Probability of Treatment Weighting (IPTW) in observational studies [5]. In observational studies, researchers want to determine the effect of certain treatment while the treatment and control selection of subjects is not totally randomized. IPTW is one kind of adjustment to observational data for measuring the treatment effect correctly. One intuitive explanation of IPTW is to weight the subject with how much information the subject could reveal [2]. For example, if a subject who has a low probability to receive treatment and has a high probability to stay in the control group happened to receive treatment, his treatment result should be very informative because he represents the characteristics mostly associated with the control group. In a similar way, we can consider the importance sampling ratio \frac{\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} for each r_i as an indicator of how much information that sample reveals. When the behavior policy \pi_0 neglects some action a_i that is instead emphasized by \pi_1 (i.e., \pi_0(a_i|s_i) < \pi_1(a_i|s_i)), the sample (s_i, a_i, r_i) has some valuable counter-factual information hence we need to “amplify” r_i.

The third method to estimate V^{\pi_1} called Doubly Robust (DR) [1] combines the direct method and IPS method:

V^{\pi_1}_{DR}=\frac{1}{N}\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \big[\frac{(r_i - \hat{r}(s_i, a_i))\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} +\int\hat{r}(s_i, a')\pi_1(a'|s_i) da'\big]

Given our assumption that we know the true action propensity \pi_0(\cdot|s), V^{\pi_1}_{DR} is still an unbiased estimator:

\mathbb{E}_{p_{\pi_0}(\mathcal{S})}[V^{\pi_1}_{DR}] \newline=\int p_{\pi_0}(s, a) \cdot \big[\frac{(r(s,a) - \hat{r}(s, a))\pi_1(a|s)}{\pi_0(a|s)} +\int \hat{r}(s, a')\pi_1(a'|s) da' \big]ds da \newline=\int p_{\pi_0}(s,a) \cdot\big[\frac{(r(s,a) - \hat{r}(s, a))\pi_1(a|s)}{\pi_0(a|s)} +\int(\hat{r}(s, a') - r(s,a'))\pi_1(a'|s)da' +\int r(s, a')\pi_1(a'|s)da' \big] dsda\newline=\int p_{\pi_1}(s,a) \cdot (-1+1)dsda + \int p_{\pi_1}(s,a) r(s,a) dsda\newline=V^{\pi_1}

However, in the original paper of DR [1], the authors don’t assume that the true action propensity can be accurately obtained. This means, V^{\pi_1}_{DR} can be a biased estimator: however, if V^{\pi_1}_{DM} or V^{\pi_1}_{IPS} have biases under certain bounds, V^{\pi_1}_{DR} would have far lower bias than either of the other two.

Under the same assumption that the true action propensity is accessible, we could get the variances of DM, DR, and IPS CPE (Var[V^{\pi_1}_{DM}], Var[V^{\pi_1}_{DR}], and Var[V^{\pi_1}_{IPS}]) based on the formulas given in Theorem 2 in [1] while setting \delta=0 because we assume we know the true action propensity:

ips dr dm

Under some reasonable assumptions according to [1], Var[V^{\pi_1}_{DR}] should be between Var[V^{\pi_1}_{DM}] and Var[V^{\pi_1}_{IPS}]. Therefore, V^{\pi_1}_{DR} should be overall favored because of its zero bias and intermediate variance.

CPE for MDP

We have completed introducing several CPE methods on contextual bandit problems. Now, we move our focus to CPE on Markov Decision Problems when horizon is larger than 1. The value function of a policy is:

V^{\pi} = \mathbb{E}_{s_0 \sim \mathcal{D}, a_t \sim \pi(\cdot|s_t), s_{t+1}\sim P(\cdot|a_t, s_t)}[\sum\limits_{t=1}^H\gamma^{t-1}r_t],

where H is the number of steps in one trajectory. Simplifying the notation of this formula, we get:

V^{\pi} = \mathbb{E}_{\tau \sim (\pi, \mu)}[R(\tau)],

where \tau is a single trajectory, \mu denotes the transition dynamics, R(\tau)=\sum\limits_{t=1}^H\gamma^{t-1}r_t is the accumulative discounted rewards for the trajectory \tau.

Following the idea of IPS, we can have importance-sampling based CPE of \pi_1 based on trajectories generated by \pi_0:

V^{\pi_1}_{trajectory-IS}(\tau) = R(\tau) \prod\limits_{t=1}^H\frac{\pi_1(a_t|s_t)}{\pi_0(a_t|s_t)}

V^{\pi_1}_{per-decision-IS}(\tau)\newline=\sum\limits_{t=1}^H \gamma^{t-1} r_t \prod\limits_{j=1}^t \frac{\pi_1(a_j|s_j)}{\pi_0(a_j|s_j)}\newline=\sum\limits_{t=1}^H \gamma^{t-1} r_t \rho_{1:t}

V^{\pi_1}_{trajectory-IS} is an unbiased estimator of V^{\pi_1} given the same condition satisfied in V^{\pi_1}_{IPS}: if \pi_0(a|s)=0 then \pi_1(a|s)=0. V^{\pi_1}_{per-decision-IS} is also an unbiased estimator of V^{\pi_1} proved in [6].  However, V^{\pi_1}_{per-decision-IS} has lower upper bound than  V^{\pi_1}_{trajectory-IS} therefore in practice V^{\pi_1}_{per-decision-IS} usually enjoys lower variance than V^{\pi_1}_{trajectory-IS} (compare chapter 3.5 and 3.6 in [9]).

Following the idea of DR, [3] proposed unbiased DR-based CPE on finite horizons, and later [8] extended to the infinite-horizon case:

V^{\pi_1}_{DR}(\tau)=\sum\limits_{t=1}^\infty \gamma^{t-1}r_t\rho_{1:t} -\sum\limits_{t=1}^\infty \gamma^{t-1} \big( \rho_{1:t}\hat{q}^{\pi_1}(s_t, a_t)-\rho_{1:t-1}\hat{v}^{\pi_1}(s_t) \big),

where \hat{q}^{\pi_0} and \hat{v}^{\pi_0} are the learned q-value and value function under the behavior policy \pi_0. \rho_{1:t} = \prod\limits_{i=1}^t\frac{\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} and we define the corner case \rho_{1:0} = 1.

Based on Theorem 1 in [3], Var[V^{\pi_1}_{DR}] should have lower variance than Var[V^{\pi_1}_{per-decision-IS}], after realizing that per-decision-IS CPE is a special case of DR CPE.

Another CPE method also proposed by [6] is called Weighted Doubly Robust (WDR). It normalizes \rho_{1:t} in V^{\pi_1}_{DR}:

V^{\pi_1}_{WDR}(\tau)=\sum\limits_{t=1}^\infty \gamma^{t-1}r_t w_{1:t} -\sum\limits_{t=1}^\infty \gamma^{t-1} \big( w_{1:t}\hat{q}^{\pi_1}(s_t, a_t)-w_{1:t-1}\hat{v}^{\pi_1}(s_t) \big),

where w_{1:t} = \frac{\rho_{1:t}}{\sum\limits_\tau \rho_{1:t}}.

V^{\pi_1}_{WDR} is a biased estimator but since it truncates the range of \rho in V^{\pi_1}_{DR} to [0,1],  WDR could lower variance than DR when fewer samples are available.  The low variance of WDR produces larger reduction in expected square error than the additional error incurred by the bias. (See chapter 3.8 of [9]). Furthermore, under reasonable assumptions WDR is a consistent estimator so the bias will become less an issue as the sample size increase.  (As a side note, an estimator can be any combination of biased/unbiased and inconsistent/consistent [10].)

Direct Method CPE on trajectories learns to reconstruct the MDP. This means DM needs to learn the transition function and reward function for simulating episodes.The biggest concern is whether the collected samples support to train a low bias model. If they do, then DM can actually achieve very good performance. [8] Section 6 lists a situation when DM can outperform DR and WDR: while DM has a low variance and low bias (even no bias), stochasticity of reward and state transition produces high variance in DR and WDR.

The last CPE method is called Model and Guided Importance Sampling Combining (MAGIC) [8]. It is considered as the best CPE method because it adopts the advantages from both WDR and DM. The motivation given by [8] is that CPE methods involving importance sampling are generally consistent (i.e., when samples increase, the estimated distribution moves close to the true value) but still suffer high variance; model-based methods (DM) has low variance but is biased. MAGIC combines the ideas of DM and WDR such that MAGIC can track the performance of whichever is better between DM and WDR. The MAGIC method is heavily math involved. Without deep dive into it yet, we only show its pseudocode for now:

Screen Shot 2019-02-19 at 11.26.03 PM

Below is a table concluding this post, showing whether each CPE method (for evaluating trajectories) is biased/unbiased and consistent/inconsistent.

DM biased inconsistent 1
trajectory-IS unbiased consistent 5
per-decision-IS unbiased consistent 4
DR unbiased consistent 3
WDR biased consistent 2
MAGIC biased consistent Not sure, possibly

between 1 and 2

(Table note:  (1) variance rank 1 means the lowest variance, 5 means the highest; (2) when generating this table, we use some reasonable assumptions we mentioned and used in the papers cited .)

[9] Thomas, P. S. (2015). Safe reinforcement learning (Doctoral dissertation, University of Massachusetts Libraries).

[10] https://stats.stackexchange.com/questions/31036/what-is-the-difference-between-a-consistent-estimator-and-an-unbiased-estimator