czxttkl – Page 6

Control variate using Taylor expansion

We talked about control variate in [4]: when evaluating $\mathbb{E}_{p(x)}[f(x)]$ by Monte Carlo samples, we can instead evaluate $\mathbb{E}_{p(x)}[f(x)-h(x)+\theta]$ with $\theta=\mathbb{E}_{p(x)}[h(x)]$ in order to reduce variance. The requirement for control variate to work is that $h(x)$ is correlated with $f(x)$ and the mean of $h(x)$ is known.

In this post we will walk through a classic example of using control variate, in which $h(x)$ is picked as the Taylor expansion of $f(x)$ . We know that $f(x)$ ‘s Taylor expansion can be expressed as $f(x)=f(a) + f'(a)(x-a) + f''(a)/2! * (x-a)^2 + f'''(a)/3! * (x-a)^3 + \cdots$ . Therefore, if we pick the second order of Taylor expansion for $f(x)$ and set $a=0$ , we get $h(x)=f(0)+f'(0)x + f''(0)/2! * x^2$ .

Let’s see an example from [6]. Suppose $f(u_1, u_2)=exp[(u_1^2 + u_2^2)/2]$ and we want to evaluate the integral $\Psi=\int^1_0 \int^1_0 f(u_1, u_2) du_1 du_2$ . If we want to use Monte Carlo estimator to estimate $\Psi$ using $U^1, \cdots, U^m$ samples, we have:
$\frac{\text{area of sampling}}{m}\sum\limits_{k=1}^m f(U^k)=\frac{1}{m}\sum\limits_{k=1}^m f(U^k)$

The control variate as the second-order Taylor expansion of $f(x)$ is $h(u_1, u_2)=1 + (u_1^2 + u_2^2)/2$ (we use multi-variable Taylor expansion here [7]). We first compute its mean: $\theta=\mathbb{E}_{p(x)}[h(x)] = \int^1_0\int^1_0 \left(1 +(u_1^2+u_2^2)/2 \right) du_1 du_2 = 4/3$ . Thus, our control variate Monto Carlo estimator is:
$\frac{1}{m}\sum\limits_{k=1}^{m} z(u_1, u_2)\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[f(u_1, u_2) -h(u_1, u_2) + \theta\right]\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[exp[(u_1^2 + u_2^2)/2] - 1 - (u_1^2 + u_2^2)/2 + 4/3 \right]$ .

And from the results of [6] the sampling variance of $z(u_1, u_2)$ is smaller than $f(u_1, u_2)$ .

Now what [2] propose is to apply similar Taylor expansion-based control variate on REINFORCE policy gradient:

$\nabla_\theta J(\theta) = \mathbb{E}_{s_t \sim \rho_{\pi}(\cdot), a_t \sim \pi(\cdot | s_t)} \left[ \nabla_\theta log \pi_\theta (a_t | s_t) R_t\right]$

If we use Monte Carlo estimator, then we essentially use samples $\nabla_\theta log \pi_\theta (a_t | s_t) R_t$ to approximate $\nabla_\theta J(\theta)$ , which causes variance. If we have a control variate $h(s_t, a_t)=Q_w(s_t, \bar{a}_t) + \nabla_a Q_w(s_t, a)|_{a=\bar{a}_t} (a_t - \bar{a}_t)$ and $\bar{a}_t$ can be chosen at any value but a sensible choice being $\mu_\theta(s_t)$ , a deterministic version of $\pi_\theta(a_t | s_t)$ , then the control variate version of policy gradient can be written as (Eqn. 7 in [2]):
$\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s\sim \rho_\pi, a \sim \pi}\left[ h(s_t, a_t)\right] \newline=\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s \sim \rho_\pi}\left[ \nabla_a Q_w(s_t, a)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)\right]$

As you can see the second term can be computed when $Q_w$ is some critic with known analytic expression, and actually does not depend on $\pi_\theta$ .

[3] uses the law of total variance (some good examples and intuitions [8, 9]) to decompose the control-variate policy gradient into three parts: variance caused by variance of states $\Sigma_s$ , variance of actions $\Sigma_a$ , and variance of trajectories $\Sigma_\tau$ . What [3] highlights is that the magnitude of $\Sigma_a$ is usually greatly smaller than the rest two so the benefit of using control variate-based policy gradient is very limited.

References

[1] MUPROP: UNBIASED BACKPROPAGATION FOR STOCHASTIC NEURAL NETWORKS

[2] Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC

[3] The Mirage of Action-Dependent Baselines in Reinforcement Learning

[4] https://czxttkl.com/2020/03/31/control-variate/

[5] https://czxttkl.com/2017/03/30/importance-sampling/

[6] https://www.value-at-risk.net/variance-reduction-with-control-variates-monte-carlo-simulation/

[7] https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables

[8] http://home.cc.umanitoba.ca/~farhadi/ASPER/Law%20of%20Total%20Variance.pdf

[9] https://math.stackexchange.com/a/3377007/235140

Notes on Ray paper

I am reading Robert Nishihara’s thesis on Ray [1] while encountering several concepts I am not familiar with. So this post takes some notes on those new concepts.

Ring AllReduce

Ring AllReduce is a technique to communicate with multiple computation nodes for aggregating results. It is a primitive to many distributed training systems. In the Ray paper, the author tries to analyze how easy Ring AllReduce can be implemented using Ray’s API.

There are two articles that articulating the idea really well [5, 6]. They point out the bottleneck of network data communication when a distributed training system wants to aggregate gradients from worker nodes in a master-slave pattern. Because each worker (say, there are P-1 workers in total) needs to send its gradient vector (say it uses N space) to master and the master needs to aggregate and send the updated model vector (same size as the gradient vectors) back to each worker for the next iteration, the data transferring through the network would be as large as 2*(P-1)*N, which scales linearly with the number of workers.

Ring AllReduce have multiple cycles of data transferring among all workers (P workers since we don’t need one for the master). But each cycle only transfers a small amount of data in each cycle such that the accumulative amount of data transferring would be smaller than that of the master-slave pattern.

The basic idea of RingAllReduce is to divide each gradient vector into P chunks on all workers. There are first P-1 cycles of data aggregating then P-1 cycles data syncing. In the first cycle, each worker $i (i=1,\ldots,P)$ sends its i-th chunk to the next worker (with index $i+1$ ), and receives the (i-1)-th chunk from the previous worker. The received (i-1)-th chunk will be aggregated locally with the worker’s own (i-1)-th chunk. In the second cycle, each worker sends its (i-1)-th chunk, which got aggregated in the last cycle, to the next worker, and receives (i-2)-th chunk from the previous worker. Similarly, each worker now can aggregate on its (i-2)-th chunk with the received chunk which also indexes on i-2. Continuing on this pattern, each worker will send its (i-2), (i-3)-th, …chunk in each cycle until $P-1$ data aggregating cycles are done. Upon then each worker has one chunk that has been fully aggregated with all other workers. Then doing the similar circular cycles P-1 times can make all workers sync on all fully aggregated chunks from each other.

I draw one simple diagram to illustrate the idea:

ADMM (Alternating Direction Method of Multipliers)

Another interesting technique introduced in the thesis is ADMM. It is a good opportunity to revisit optimization from the scratch. I’m mainly following [4] and [7] for basic concepts.

The definition of directional derivative and gradient:
The gradient of $f(x), x\in \mathbb{R}^n$ is $\nabla f(x) = \left\{ \frac{\partial f(x)}{\partial x_1}, \cdots, \frac{\partial f(x)}{\partial x_n}\right\}$ . The gradient is a vector and can only be known fully when given a concrete value $x \in \mathbb{R}^n$ . The directional derivative is the gradient’s projection on another unit vector $u \in \mathbb{R}^n$ : $df(x)(u) = <\nabla f(x), u> = |\nabla f(x)| \;|u| \;cos \theta = |\nabla f(x)| \;cos \theta$ , where $<\cdot, \cdot>$ is inner product. See some introduction in [9] and [10].

The definition of a proper function simply means (Definition 51.5 [7]): $f(x) > -\infty$ for all $x \in \mathbb{R}^n$ and $f(x) < +\infty$ for some $x \in \mathbb{R}^n$ .

The definition of a convex and strictly convex function is (Proposition 40.9 [7]):
The function $f$ is convex on $U$ iff $f(v) \geq f(u) + df(u)(v-u)$ . Remember that $df(u)$ is the change of $f(u)$ caused by a tiny change in $u$ and we have $df(u)/du = tan(\text{the slope of the tangent line at }f(u))$ . $f$ is a strict convex function iff $f(v) > f(u) + df(u)(v-u)$ .

This definition actually leads to a very common technique to check if a multivariate quadratic function is a convex function: if $f(u)=\frac{1}{2}u^T A u - u^T b, u \in \mathbb{R}^n$ , as long as $A$ is positive semidefinite, then $f(u)$ is convex. See Example 40.1 [7].

The definition of affine hull: given a set of vectors $S=(x_1, \cdots, x_m)$ with $x_i \in \mathbb{R}^n$ , $aff(S)$ is all affine combinations of the vectors in $S$ , i.e., $aff(S)=\{\lambda_1 x_1 + \cdots + \lambda_m x_m | \lambda_1 + \cdots + \lambda_m = 1\}$ .

The definition of relint (Definition 51.9 [7]):
Suppose $C$ a subset of $\mathbb{R}^n$ , the relative interior of $C$ is the set: $relint(C)=\{x \in C | B_\epsilon(x) \cap aff(C) \subseteq C \;\; \text{for some } \epsilon > 0\}$ . There is a good example from [8] which explains the difference between interior and relative interior.

The most important thing is to understand the high level of the typical optimization procedure, which I also covered in [3]. Suppose we want to optimize the primal problem:
minimize $J(v)$
subject to $\varphi_i(v) \leq 0, \quad \quad i=1,\ldots,m$
$\psi_j(v)=0, \quad\quad j = 1,\ldots, p$
with $dom(J) = \Omega \subset \mathbb{R}^n$

The Lagrangian $L(v,\mu, \upsilon)$ for the primal problem is defined as:
$L(v, \mu, \upsilon)=J(v)+\sum\limits_{i=1}^m \mu_i \varphi_i(v) + \sum\limits_{j=1}^p \upsilon_j \psi_j(v)$ , where $\mu \in \mathbb{R}^m_+$ (i.e., $\mu > 0$ ) and $\upsilon \in \mathbb{R}^p$ .

The KKT conditions are necessary conditions that if $v^*$ is a local minimum of $L(v,\mu, \upsilon)$ , then $v^*$ must satisfy the following conditions [11]:
1. stationarity: $\nabla J(v^*) + \sum\limits_{i=1}^m \mu_i \nabla \varphi_i(v^*) + \sum\limits_{j=1}^p \upsilon_j \nabla \psi_j(v^*) = 0$
2. primal feasibility: $\varphi_i(v^*) \leq 0, \; i=1, \ldots, m$ and $\psi_j(v^*)=0, \; j=1, \ldots, p$
3. dual feasibility: $\mu_i \geq 0, \; i=1,\ldots, m$
4. complementary slackness: $\mu_i \varphi_i(v^*)=0, \; i=1, \ldots, m$

In other words, if we already know $v^*$ we know it must satisfy the KKT conditions. However not all solutions that satisfy the KKT conditions are the local minimum of $L(v,\mu, \upsilon)$ . The KKT conditions can be used to find global minimum $v^*$ if additional conditions are satisfied. One such conditions is “Second-order sufficient conditions” [12], which I also talked about in [3]. Another such conditions are if the constraints, the domain $dom(J)$ , and the objective function $J$ are convex, then KKT conditions also imply $v^*$ is a global minimum. (Theorem 50.6 [7]).

While some problems have characteristics that allow using the KKT conditions as sufficient condition to find the global solution, there are others that are easier to solve or approximate using another method called dual ascent. By maximizing another function called dual function, we can get the exact solution or a lower bound of the primal problem.

The dual function $G: \mathbb{R}^m_+ \times \mathbb{R}^p \rightarrow \mathbb{R}$ is defined as:
$G(\mu, \upsilon) = inf\limits_{v \in \Omega} L(v, \mu, \upsilon) \quad \quad \mu \in \mathbb{R}^m_+ \text{ and } \upsilon \in \mathbb{R}^p$

The dual problem is defined as:
maximize $G(\mu, \upsilon)$
subject to $\mu \in \mathbb{R}^m_+, \upsilon \in \mathbb{R}^p$

From [7] page 1697, one of the main advantages of the dual problem over the primal problem is that it is a convex optimization problem, since we wish to maximize a concave objective function $G$ (thus minimize $−G$ , a convex function), and the constraints $\mu \geq 0$ are convex. In a number of practical situations, the dual function $G$ can indeed be computed.

If the maximum of the dual problem is $d^*$ and the minimum of the primal problem is $p^*$ . We have weak duality that always holds: $d^* \leq p^*$ . Strong duality holds when the dual gap is zero, with certain conditions holding, for example slater’s condition [14]. We can find the local minimum $(\mu^*, \upsilon^*)$ of the dual problem by a special form of gradient ascent algorithm called sequential optimization problem (SMO) [13] because special treatment is needed for the constraints involved in the dual problem.

[7] provides two ways on how to do constrained optimization on SVM (Section 50.6 and 50.10): one is to use the KKT conditions, the other is to solve the dual problem.

The dual problem is simplified when there are only affine equality constraints in the primal problem:
Primal problem:
minimize $J(u)$
subject to $Au=b$

Dual problem:
maximize $G(\lambda)=inf\limits_{u \in \mathbb{R}^n} L(u, \lambda)=inf\limits_{u \in \mathbb{R}^n}J(u)+\lambda^T(Au-b)$
subject to $\lambda \in \mathbb{R}^p$

Since the dual problem is an unconstrained optimization problem, we can use dual ascent to solve this problem easily:
$u^{k+1} = argmin_u L(u, \lambda^k)$
$\lambda^{k+1} = \lambda^k + \alpha^k(Au^{k+1}-b)$ ,
where $\alpha^k>0$ is a step size.

The main flaw of the dual ascent is that under certain conditions the solution may diverge. The augmented Lagrangian method augments the Lagrangian function with a penalty term:
$L_{p}(u, \lambda)=J(u) + \lambda^T(Au-b)+ (p/2)\|Au-b\|^2_2$

Based on some theory, the augmented Lagrangian is strongly convex and has better convergence property.

The method of multipliers is the dual ascent applied on the augmented Lagrangian:
$u^{k+1} = argmin_u L_p(u, \lambda^k)$
$\lambda^{k+1} = \lambda^k + p(Au^{k+1}-b)$

If $u$ can be separated into two parts such that $J(u)=J(x,z)=f(x)+g(z)$ , then we can iteratively update $x$ , $z$ , and $\lambda$ separately, and such a method is called Alternating Direction Method of Multipliers (ADMM):
Primal problem:
minimize $f(x)+g(z)$
subject to $Ax+Bz=c$

The augmented Lagrangian:
$L_p(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-c)+(p/2)\|Ax+Bz-c\|^2_2$

Updates (note we are not doing $(x^{k+1}, z^{k+1})=argmin_{x,z} L_p(x,z,\lambda)$ as in the method of multipliers):
$x^{k+1}=argmin_x L_p(x, z^k, \lambda^k)$
$z^{k+1}=argmin_z L_p(x^{k+1}, z, \lambda^k)$
$\lambda^{k+1} = \lambda^k + p(Ax^{k+1}+Bz^{k+1}-c)$

If we define $\mu=(1/p)\lambda$ and $r=Ax+Bz-c$ , then the updates can be simplified as:
$x^{k+1}=argmin_x \left( f(x) + (p/2) \|Ax+Bz^k-c+\mu^k\|^2_2 \right)$
$z^{k+1}=argmin_z \left( g(z) + (p/2)\|Ax^{k+1} + Bz - c + \mu^k \|^2_2 \right)$
$\mu^{k+1}=\mu^k + Ax^{k+1} + Bz^{k+1} - c$

Structures in $f$ , $g$ , $A$ , and $B$ can often be exploited to carry out $x$ -minimization and $z$ -minimization more efficiently. We now look at several examples.

Example 1: if $A=I$ is an identity matrix, $v=-Bz^k+c-\mu^k$ , and $f(\cdot)$ is the indicator function of a closed nonempty convex set $\mathcal{C}$ , then $x^{k+1}=argmin_x \left( f(x) + (p/2) \|x-v\|^2_2 \right)=\Pi_{\mathcal{C}}(v)$ , which means $x^{k+1}$ is the projection of $v$ onto $\mathcal{C}$ .

Example 2: if $f(x)=(1/2)x^T P x + q^T x + r$ and $v=-Bz^k+c-\mu^k$ , then $x^{k+1}=argmin_x \left( f(x) + (p/2) \|Ax-v\|^2_2 \right)$ becomes an unconstrained quadratic programming with the analytic solution at $x^*=\left(P+pA^T A\right)^{-1}(pA^Tv-q)$ .

Example 3: if $f(x)=(1/2)x^T P x + q^T x + r, dom(f)=\{x|Ax+b\}$ and it must satisfy $x\geq 0$ , then the ADMM primal problem becomes:
minimize $f(x) + g(z)$ (where $g(z)$ is an indicator function of the nonnegative orthant $\mathbb{R}_+^n$ )
subject to $x-z=0$

Then $x^{k+1}=argmin_x \left( f(x) + (p/2) \|x-z^k+u^k\|^2_2 \right)$ . This is a constrained quadratic programming. We have to use KKT conditions to solve it. Suppose the Lagrangian multiplier is $v\in\mathbb{R}^n$ , then the KKT conditions state that:
$\nabla f(x) +\nabla\left((p/2) \|x-z^k+u^k\|^2_2 \right) + v^T \nabla\left(Ax-b\right) \newline= \nabla f(x) + p (x - z^k + u^k)+ v^T A = 0$
$Ax-b = 0$ ,
which is exactly given by a linear system in matrix form in Section 5.2 [4]:

(As a side note, solving a constrained quadratic programming usually relies on KKT conditions. It can be converted to solving a linear system in matrix form: https://www.math.uh.edu/~rohop/fall_06/Chapter3.pdf)

Example 4 (Section 4.4.3 in [4]): if $f(x)=\lambda \|x\|_1$ , $v=-Bz^k+c-\mu^k$ , and $A=I$ , then $x^*=argmin_x(\lambda \|x\|_1 + (p/2)\|x-v\|^2_2)=S_{\lambda/p}(v)$ where $S$ is the soft thresholding operator.

References

[1] On Systems and Algorithms for Distributed Machine Learning: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-30.pdf

[2] Compare between parameter servers and ring allreduce: https://harvest.usask.ca/bitstream/handle/10388/12390/CHOWDHURY-THESIS-2019.pdf?sequence=1&isAllowed=y

[3] Overview of optimization: https://czxttkl.com/2016/02/22/optimization-overview/

[4] ADMM tutorial: https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

[5] https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/

[6] https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da

[7] Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning https://www.cis.upenn.edu/~jean/math-deep.pdf

[8] https://math.stackexchange.com/a/2774285/235140

[9] http://sites.science.oregonstate.edu/math/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/grad/grad.html

[10] https://math.stackexchange.com/a/661220/235140

[11] https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions#Necessary_conditions

[12] https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions#Sufficient_conditions

[13] http://cs229.stanford.edu/notes/cs229-notes3.pdf

[14] https://en.wikipedia.org/wiki/Slater%27s_condition

Understand Fourier Transform of Images

We’ve covered Fourier Transform in [1] and [2] while we use only examples of 1D. In this post we are going to see what 2D Fourier Transform looks like.

First, we look at a 2D image with one direction sinusoid waves (left) and its Fourier Transform (right).

I added coordinates to help you understand the illustration. Let’s assume the $x$ -axis in the image corresponds to $i$ -axis in frequency domain; $y$ -axis in the image corresponds to $j$ -axis in the frequency domain. Because there is never fluctuation along the $x$ -axis, there should only be $i=0$ frequency. However, along the $y$ -axis there is regular sinusoid change, therefore there should be $j>0$ frequency. As a combination, we see two dots (two dirac delta functions) on the line $i=0$ .

To reinforce our understanding, let’s look at three more pictures. The first two have vertical bars with different spatial frequencies. As you might guess, the first picture with sparser bars have a smaller frequency than the second picture. Their frequencies are only on the line $j=0$ because there is no signal variation on the $y$ -axis.

The third picture is generated by $sin(x+y)$ (https://www.wolframalpha.com/input/?i=+plot+sin%28x%2By%29). And it has regular variations of waves along both the $x$ and $y$ -axis. Thus, we see the two white dots in the frequency domains are on a diagonal.

Next, we examine a real-world photo, a famous picture titled “cameraman”. We mark one leg of the tripod in th image and its corresponding frequency band. This leg is almost perpendicular to the $x$ -axis. Therefore, it introduces high-frequency variation (sharp edges) along the direction of $x$ -axis. Meanwhile it introduces less variation on the direction of $y$ -axis. On the frequency domain, you can see the corresponding line has a slope more towards the $i$ -axis than the $j$ -axis.

At last, let’s revisit how images are constituted. In essence, images are formed by 2D waves, just like 1D functions are formed by 1D waves:

References

[1] https://czxttkl.com/2018/10/07/eulers-formula/

[2] https://czxttkl.com/2020/04/27/revisit-gaussian-kernel/

Image examples come from:

[3] https://www.youtube.com/watch?v=YYGltoYEmKo

[4] https://homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm

[5] https://www.youtube.com/watch?v=gwaYwRwY6PU

A website that is helpful in visualizing 2d function:

[6] http://al-roomi.org/3DPlot/index.html

Revisit Gaussian kernel

This post is mainly about Gaussian kernel (aka radial basis function kernel), a commonly used technique in machine learning. We’ve touched the concept of Gaussian kernel in [1] and [2] but with less details.

Definition of Gaussian Kernel

The intuitive understanding of a kernel is that it maps points in a data space, where those points may be difficult to be classified, to a higher dimensional space, where the corresponding mapped points are easier to be separable.

Gaussian kernel is defined as:

$K(\mathbf{x}, \mathbf{y})=exp\left(-\frac{\|\mathbf{x}-\mathbf{y}\|^2}{2\sigma^2}\right),$

where $\mathbf{x}$ and $\mathbf{y}$ are two arbitrary points in the data space, $\sigma$ is a hyperparameter of the Gaussian kernel that controls the sensitivity of difference when $\mathbf{x}$ and $\mathbf{y}$ are different. Imagine $\sigma$ goes to infinity, then no matter how different $\mathbf{x}$ and $\mathbf{y}$ are, $K(\mathbf{x}, \mathbf{y})$ will be 1. The formula of Gaussian kernel is identical to the PDF of a normal distribution.

$K(\mathbf{x}, \mathbf{y})$ can be understood as the distance between the corresponding points of $\mathbf{x}$ and $\mathbf{y}$ in the mapped higher dimensional space (denoted as $\phi(\mathbf{x})$ and $\phi(\mathbf{y})$ ). The higher dimensional space is actually an infinite dimensional space, and the distance is measured by the dot product. That is, $K(\mathbf{x}, \mathbf{y}) = <\phi(\mathbf{x}), \phi(\mathbf{y})>$ .

So what is $\phi(\mathbf{x})$ then? According to ([3] page 11), we have:

Application of Gaussian Kernels

Using Gaussian kernels, we can design classification/regression algorithms and their ideas would be similar to nearest neighbor algorithms. Here is how [4]. Suppose we have $N$ data points $\mathbf{x}_1, \cdots, \mathbf{x}_N$ with their labels $y_1, \cdots, y_N$ . Each of the $N$ data points will contribute some information $w_n$ to the prediction of a new data point $h(\mathbf{x})$ .

Intuitively, those $\mathbf{x}_n$ that are close to $\mathbf{x}$ should contribute more information; those $\mathbf{x}_n$ that are far away from $\mathbf{x}$ should contribute less information. Therefore, we use a Gaussian kernel $K(\mathbf{x}, \mathbf{x}_n)=exp\left(-\gamma \|\mathbf{x}-\mathbf{x}_n\|^2\right)$ to denote the degree of $w_n$ that $\mathbf{x}_n$ should contribute to $h(\mathbf{x})$ . This screenshot comes from [4], where $\gamma = \frac{1}{2\sigma^2}$

The way to learn $w_n$ is that we want $h(\mathbf{x})=0 \text{ for } \mathbf{x}=\mathbf{x}_1, \cdots, \mathbf{x}_N$ . Therefore, we can solve the following matrix-vector equation to obtain $\mathbf{w}$ :

Once we have $\mathbf{w}$ , we have full information of our regression/classification model $h(\mathbf{x})$ , and we can use it to make prediction to unseen data.

We can use Gaussian kernel to perform smoothing. In [11], the author detailed some examples of applying Gaussian kernel for smoothing. In the example below, the authors generated some noisy data in 1D array (left). By applying a Gaussian kernel, that is averaging the values mostly from the neighborhood, we get a much smoother data distribution (right).

Magically, applying a correct Gaussian kernel can recover from noisy data that seems random. Below, we can see that a clean data distribution (up left), after added with noise (up right), can be recovered mostly if the Gaussian kernel “matches” certain characteristics of the noise (bottom).

Another application of Gaussian kernels is to perform kernel density estimation [5]. It is a technique to depict a smooth probability distribution from finite data samples. The smoothness of the probability distribution is determined by the hyperparameter $\sigma$ . When $\sigma$ is small, the distribution will look like histograms because the kernel will average data on near neighbors. Larger $\sigma$ will have wider focus on data.

Convolution and Fourier Transform

Gaussian kernels can be used in the setting of convolution and Fourier transform. We first quickly review what convolution and Fourier transform are and their relationships.

A convolution of two functions is defined as:

$f(t)=(g*h)(t)=\int^\infty_{-\infty}g(\tau)h(t-\tau)d\tau$

For a function that is on the time domain $f(t)$ , its frequency domain function $\hat{f}(\xi)$ is defined as:

$\hat{f}(\xi)=\int^{\infty}_{-\infty}f(t)e^{-2\pi i t \xi} dt$

And the inverse Fourier transform is defined as:

$f(t)=\int^\infty_{-\infty}\hat{f}(\xi)e^{2\pi i t \xi} d\xi$

If $f(t)$ itself is a Gaussian function, then $\hat{f}(\xi)$ would also be a Gaussian function but with different shapes. The more spread $f(t)$ is, the sharper $\hat{f}(\xi)$ is [9]. This means, a spread $f(t)$ changes slowly so it has less high-frequency components.

Now let’s prove the convolution theorem, which states that the Fourier transform of the convolution of two functions is equal to the product of the Fourier transform of each function. The proof comes from the last page of [10]:

Suppose we have $f(t)=(g*h)(t)=\int^\infty_{-\infty}g(\tau)h(t-\tau)d\tau$ , and suppose $\hat{f}(\xi)$ , $\hat{g}(\xi)$ , and $\hat{h}(\xi)$ are the corresponding Fourier transformed functions. From the inverse Fourier transform we have $g(t)=\int^\infty_{-\infty} \hat{g}(\xi) 2^{\pi i t \xi}d\xi$ and $h(t)=\int^\infty_{-\infty} \hat{h}(\xi) 2^{\pi i t \xi}d\xi$ .

By the definition of convolution:

$(g*h)(t)=\int^\infty_{-\infty}g(\tau)h(t-\tau)d\tau \newline=\int^\infty_{-\infty} g(\tau)\left[\int^\infty_{-\infty} \hat{h}(\xi) 2^{\pi i (t-\tau) \xi}d\xi\right]d\tau \newline=\int^\infty_{-\infty} \hat{h}(\xi) \left[\int^\infty_{-\infty} g(\tau) 2^{\pi i (t-\tau) \xi}d\xi\right]d\tau \quad \quad \text{swap } g(\tau)\text{ and }\hat{h}(\xi)\newline=\int^\infty_{-\infty} \hat{h}(\xi) \left[\int^\infty_{-\infty} g(\tau) 2^{-\pi i \tau \xi} d\tau\right] 2^{\pi i t \xi} d\xi\newline=\int^\infty_{-\infty} \hat{h}(\xi) \hat{g}(\xi) 2^{\pi i t \xi} d\xi \newline = IFT\left[\hat{h}(\xi) \hat{g}(\xi)\right] \quad\quad Q.E.D.$

I think [6] summarizes well the relationship between convolution and Fourier Transform [7].

Now let’s understand the last sentence from the last slide: convolution with a Gaussian will preserve low-frequency components while reducing high-frequency components.

Suppose we have a rectangle function $h(t) =\sqcap(t/2)$ with width 2 and height 1.
We know that its fourier transform is (https://www.wolframalpha.com/input/?i=+rect%28t%2F2%29+from+-2+to+2):

After applying Gaussian kernel smoothing (e.g., $g(t)=gaussian(0, 1)$ , as we introduced in the application section, we probably get a smoother function roughly like a triangle function $f(t)=(g*h)(t)=\Lambda(t)$ (https://www.wolframalpha.com/input/?i=triangle%28t%29):

$\Lambda(t)$ ‘s fourier transform is (https://www.wolframalpha.com/input/?i=fourier+transform+%28triangle%28t%29%29):

Apparently, we can see that the high frequency components of $\Lambda(t)$ is much smaller than those of $\sqcap(t/2)$ .

From the point of the convolution theorem to understand this phenomenon, the fourier transform of $(g*h)(t)$ is equal to $\hat{g}(\xi) \hat{h}(\xi)$ . We know that $\hat{h}(\xi)$ is a $sinc$ function and $\hat{g}(\xi)$ is a gaussian function, therefore the high frequency components of $\hat{g}(\xi)\hat{h}(\xi)$ would be attenuated.

References

[1] https://czxttkl.com/2018/01/20/my-understanding-in-bayesian-optimization/

[2] https://czxttkl.com/2017/12/28/revisit-support-vector-machine/

[3] https://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf

[4] https://www.youtube.com/watch?v=O8CfrnOPtLc

[5] http://www.surajrampure.com/resources/ds100/KDE.html

[6] https://web.stanford.edu/class/cs279/lectures/lecture9.pdf

[7] https://czxttkl.com/2018/10/07/eulers-formula/

[8] https://en.wikipedia.org/wiki/Fourier_transform#Other_conventions

[9] http://pages.stat.wisc.edu/~mchung/teaching/MIA/reading/diffusion.gaussian.kernel.pdf.pdf

[10] http://ugastro.berkeley.edu/infrared09/PDF-2009/convolution2.pdf

[11] https://matthew-brett.github.io/teaching/smoothing_intro.html

Optimization with discrete random variables

In this post, I’m going to talk about common techniques that enable us to optimize a loss function w.r.t. discrete random variables. I may go off on a tangent on various models (e.g., variational auto-encoder and reinforcement learning) because these techniques come from many different areas so please bear with me.

We start from variational auto encoder (VAE) with continuous latent variables. Then we will look at VAE with discrete latent variables.

VAE

With generative modeling in mind, we suppose every data $x$ is generated by some latent variable $z$ , which itself is a random variable following certain distribution $p(z|x)$ . We’d like to use a parameterized distribution $q_\theta(z|x)$ to approximate $p(z|x)$ (this is the encoder of VAE). Usually, $q_\theta(z|x)$ can be parameterized as a Gaussian distribution: $q_\theta(z|x) \sim \mathcal{N}\left(\mu_\theta(x), \Sigma_\theta(x)\right)$ . To make $p(z|x)$ and $q_\theta(z|x)$ close to each other, we minimize the KL-divergence between them:

$\theta^* = argmin_\theta\; KL\left(q_\theta(z|x) || p(z|x) \right ) \newline= argmin_\theta \; \mathbb{E}_{q_\theta} [log\;q_\theta(z|x)] - \mathbb{E}_{q_\theta} [log\;p(z,x)]+log\;p(x)$

In the formula above, $\mathbb{E}_{q_\theta}[f(z)]$ means $\int q_\theta(z|x) f(z) dz$ , and $log \; p(x)$ is a constant w.r.t. $\theta$ and thus can be omitted. With simple re-arrangement, we can equivalently maximize the so-called ELBO function in order to find $\theta^*$ :

$ELBO(\theta) \newline= \mathbb{E}_{q_\theta} [log\;p(z,x)] - \mathbb{E}_{q_\theta} [log\;q_\theta(z|x)]\newline=\mathbb{E}_{q_\theta}[log\;p(x|z)] + \mathbb{E}_{q_\theta}[log\;p(z)] - \mathbb{E}_{q_\theta}[log\;q_\theta(z|x)] \quad\quad p(z) \text{ is the prior of } z \newline= \mathbb{E}_{q_\theta}[log\;p(x|z)] - \mathbb{E}_{q_\theta} [log \frac{q_{\theta}(z|x)}{p(z)}]\newline=\mathbb{E}_{q_\theta}[log\;p(x|z)] - KL\left(q_\theta(z|x) || p(z)\right)$

Practically, $p(x|z)$ is also fitted by a parameterized function $p_\phi(x|z)$ (this is the decoder of VAE). So the ultimate objective function we have for fitting a VAE is:

(1) $\begin{equation*} $argmax_{\theta, \phi} \; \mathbb{E}_{x\sim D} \left[\mathbb{E}_{q_\theta}[log\;p_\phi(x|z)] - KL\left(q_\theta(z|x) || p(z)\right)\right]$\end{equation*}$

We can interpret the objective function in the following ways: $\mathbb{E}_{q_\theta}[log\;p_\phi(x|z)]$ can be thought as the so-called “reconstruction error”, which encourages the reconstructed $x$ be as close to the original $x$ as possible. $KL\left(q_\theta(z|x) || p(z)\right)$ encourages $q_\theta(z|x)$ to be close to the prior of $z$ . For more about ELBO and variational inference, please refer to one older post [8].

——————– Update 09/16/2021 ———————

A good post overviewing VAE: https://stats.stackexchange.com/a/315613/80635

Optimize VAE

Optimizing Eq.1 requires additional care. If using stochastic gradient descent, you might want to first sample an $x$ from the data distribution $D$ , then sample $z$ from $q_\theta(z|x)$ , to compute:

(2) $\begin{equation*} $log\;p_\phi(x|z) - KL\left(q_\theta(z|x) || p(z)\right)$\end{equation*}$

However, Eq.2 is not the Monte Carlo sampling of the real gradient w.r.t. $\theta$ . This is because $\mathbb{E}_{q_\theta}$ in Eq.1 also depends on the learning parameter $\theta$ . Now we introduce two general ways to optimize Eq.1 [1].

The first method is called score function estimator. From now on, we will ignore the part $\mathbb{E}_{x\sim D}$ . The gradient of Eq.1 w.r.t. $\theta$ can be written as:

(3) $\begin{align*}&\nabla_\theta \left\{\mathbb{E}_{q_\theta}[log\;p_\phi(x|z)] - KL\left(q_\theta(z|x) || p(z)\right)\right\}\\=&\nabla_\theta \left\{\mathbb{E}_{q_\theta}\left[log\;p_\phi(x, z) - \log q_\theta(z|x)\right] \right\} \quad\quad \text{ rewrite KL divergence} \\ =&\nabla_\theta \; \int q_\theta(z|x) \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]dz \\=& \int \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]\nabla_\theta q_\theta(z|x) dz + \int q_\theta(z|x) \nabla_\theta \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right] dz \\=& \mathbb{E}_{q_\theta}\left[ \left(log\;p_\phi(x, z) - \log q_\theta(z|x) \right) \nabla_\theta \log q_\theta(z|x) \right] + \mathbb{E}_{q_\theta}\left[\nabla_\theta \log p_\phi(x, z)\right] + \mathbb{E}_{q_\theta}\left[ \nabla_\theta \log q_\theta(z|x) \right] \\&\text{--- The second term is zero because no }\theta \text{ in } \log p_\phi(x,z) \\&\text{--- The third term being zero is a common trick. See Eqn. 5 in [1]} \\=& \mathbb{E}_{q_\theta}\left[ \left(log\;p_\phi(x, z) - \log q_\theta(z|x) \right) \nabla_\theta \log q_\theta(z|x) \right]\end{align*}$

Now we’ve moved the derivative inside the expectation so we can sample $z$ from $q_\theta(z|x)$ to get Monte Carlo sampling of the gradient.

The second method to optimize Eqn. 1 is called pathwise gradient estimator using the reparameterization trick. We’ve seen the trick used in Soft Actor Critic [9]. Here is how Eqn. 1 can be rewritten with the assumption that $z \sim f_\theta(x) + p(\epsilon)$ ( $\epsilon$ is an independent source of noise):

(4) $\begin{align*}&\nabla_\theta \left\{\mathbb{E}_{q_\theta}[log\;p_\phi(x|z)] - KL\left(q_\theta(z|x) || p(z)\right)\right\}\\=&\nabla_\theta \left\{\mathbb{E}_{q_\theta}\left[log\;p_\phi(x, z) - \log q_\theta(z|x)\right] \right\} \quad\quad \text{ rewrite KL divergence} \\=&\nabla_\theta \; \int q_\theta(z|x) \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]dz \\=&\nabla_\theta \; \int p(\epsilon) \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]d\epsilon \quad\quad \\&\text{--- Above uses the property of changing variables in probability density functions.} \\&\text{--- See discussion in [10, 11]} \\=& \int p(\epsilon) \nabla_\theta \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]d\epsilon \\=& \int p(\epsilon) \nabla_z \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right] \nabla_\theta z d\epsilon \\=& \mathbb{E}_{p(\epsilon)} \left[ \nabla_z \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right] \nabla_\theta f_\theta(x) \right]\end{align*}$

I think [1] summarizes really well that the pathwise estimator generally has lower variance in practice:

In the end, I’d like to share a course which covers more topics on optimization with discrete random variables [6].

Optimize Discrete VAE

The idea of discrete VAE is that the hidden variables are represented as discrete random variables such that it could be better understood by humans as clustering of data. The question remains on how you perform reparameterization trick if $z$ is a discrete random variable. The answer is to rely on Gumbel-Softmax trick [12].

Suppose a discrete random variable $z$ has $K$ classes with class probability $\pi_1, \pi_2, \cdots, \pi_k$ , which can be parameterized by a neural network consuming the raw data. There are several steps to perform the Gumbel-Softmax trick. First, $z$ is represented as a one-hot encoding vector. The active component would be the one with the highest following quantity:

(5) $\begin{align*}\arg\max_i \left[ g_i + log \pi_i \right]\end{align*}$

The beauty of Eqn.5 is that this is equivalent to draw a sample based on the probability distribution $\pi_1, \pi_2, \cdots, \pi_k$ .

Second, the one-hot encoding vector is relaxed to a continuous vector $y \in \mathbb{R}^k$ , with each component as:

(6) $\begin{align*}y_i = \frac{exp\left( (log \pi_i + g_i) / \tau \right)}{\sum^k_{j=1} exp \left( (log\pi_j + g_j) / \tau \right)},\end{align*}$

where $g_1, \cdots, g_k$ are i.i.d. samples drawn from $Gumbel(0,1)$ distribution, and $\tau$ will be annealed through the training such that $y$ will be closer and closer to a one-hot encoding vector.

Finally, depending on specific problems, we will forward $y$ as some differentiable quantity to downstream models (discussed in Section 2.2. in [12]). If the problem is about learning hidden representations, we can have fully connected layers on top of the whole $y$ . If the problem would require sampling discrete actions as in reinforcement learning, in forward pass we would need to sample an action by $\arg\max$ on $y$ while in backward pass we would use Straight-Through operator by approximating $\nabla_\theta z \approx \nabla_\theta y$ .

Here is an example of Gumbel-Softmax trick-based discrete VAE, which I adapted from [13]. In the code, we create 30 10-category random variables as the hidden representation $z$ . As a result, $z$ becomes a 300-dim continuous vector during the training. While in testing (where we create images in data/sample_), we randomly create 30 one-hot encoding vectors of dim 10 as the hidden representation, and ask the decoder to generate images for us.

An earlier attempt to train discrete VAE can be seen in [5] and [3]. It uses embedding table look up + straight through to train such discrete VAE models.

References

[1] Lecture Notes. Part III: Black-Box Variational Inference

[2] http://edwardlib.org/tutorials/klqp

[3] Reproducing Neural Discrete Representation Learning

[4] https://github.com/ritheshkumar95/pytorch-vqvae

[5] Neural Discrete Representation Learning

[6] https://duvenaud.github.io/learn-discrete/

[7] Discrete Variational Autoencoders

[8] https://czxttkl.com/2019/05/04/stochastic-variational-inference/

[9] https://czxttkl.com/2018/10/30/notes-on-soft-actor-critic-off-policy-maximum-entropy-deep-reinforcement-learning-with-a-stochastic-actor/

[10] https://math.stackexchange.com/questions/930931/explanation-of-how-probability-density-functions-transform-under-the-change-of-v

[11] https://en.wikipedia.org/wiki/Probability_density_function#Function_of_random_variables_and_change_of_variables_in_the_probability_density_function

[12] CATEGORICAL REPARAMETERIZATION WITH GUMBEL-SOFTMAX

[13] https://github.com/YongfeiYan/Gumbel_Softmax_VAE

Control Variate

In this post, let’s talk about another variance reduction method called “control variate”. This is yet another variance reduction method besides importance sampling [1].

We start from a simple example: we want to estimate $\mathbb{E}_{p(x)}[f(x)]$ , the expectation of function $f$ applied on a random variable $x$ with $x$ sampled from $p(x)$ . Analytically, $\mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx$ . You can also think of $\mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx$ as $\mathbb{E}_{p(x)}[f(x)] = \int q(x')x'dx'=\mathbb{E}_{q(x')}[x']$ , the expectation of another random variable $x'$ with any function exerted on it. Since I like the form $\mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx$ better, I’ll keep using this format throughout the post. In our setting, we can only evaluate $f(x)$ for any given $x$ but do not know the form of $f(x)$ in advance. Using Monte Carlo naively, we can draw $N$ numbers from $p(x)$ and take average of their associated values in $f(x)$ : $\mathbb{E}_{p(x)}[f(x)]\approx \frac{1}{N}\sum\limits_{i=1}^Nf(x_i)$

We can use control variate to reduce the variance of our estimate of $\mathbb{E}_{p(x)}[f(x)]$ . In the control variate method, we introduce another function $h(x)$ , for which we can evaluate $h(x)$ for any $x$ and even know its expectation $\theta=\mathbb{E}_{p(x)}[h(x)]$ (note: we don’t know that for $f(x)$ ).

The control variate method says that the following estimator, $z(x)$ which can be seen as another random variable, is an unbiased estimate of $\mathbb{E}_{p(x)}[f(x)]$ :

$z(x) = f(x) - h(x) + \theta$

It is easy to see that $z(x)$ is an unbiased estimator:

$\mathbb{E}_{p(x)}[z(x)]\newline=\mathbb{E}_{p(x)}[f(x)]-\mathbb{E}_{p(x)}[h(x)] + \theta\newline=\mathbb{E}_{p(x)}[f(x)]-\theta + \theta\newline=\mathbb{E}_{p(x)}[f(x)]$

Now let’s look at the variance of $z(x)$ :

$Var_{p(x)}[z(x)] \newline = Var_{p(x)}\left[f(x) - h(x)+\theta\right] \newline = Var_{p(x)}\left[f(x)-h(x)\right] \quad \quad \text{constant doesn't contribute to variance}\newline=\mathbb{E}_{p(x)}\left[\left(f(x)-h(x)-\mathbb{E}_{p(x)}\left[f(x)-h(x)\right] \right)^2\right] \quad\quad Var(x)=\mathbb{E}[(x-\mathbb{E}(x))^2] \newline=\mathbb{E}_{p(x)}\left[\left( f(x)-\mathbb{E}_{p(x)}[f(x)] - \left(h(x)-\mathbb{E}_{p(x)}[h(x)]\right) \right)^2\right]\newline=\mathbb{E}_{p(x)}\left[\left(f(x)-\mathbb{E}_{p(x)}[f(x)]\right)^2\right] + \mathbb{E}_{p(x)}\left[\left(h(x)-\mathbb{E}_{p(x)}[h(x)]\right)^2\right] \newline - 2 * \mathbb{E}_{p(x)}\left[f(x)-\mathbb{E}_{p(x)}[f(x)]\right] * \mathbb{E}_{p(x)}\left[h(x)-\mathbb{E}_{p(x)}[h(x)]\right]\newline=Var_{p(x)}\left[f(x)\right]+Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]$

Thus, as long as we introduce $h(x)$ such that $Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]< 0$ , we will reduce the variance of the estimate $\mathbb{E}_{p(x)}[f(x)]$ .

A more advanced method is called regression sampling [2]. We can have a tunable parameter $c$ in $z(x)$ :

$z(x) = f(x) + c\cdot (h(x) - \theta)$

In the control variate example, $c=-1$ . However, when $c$ is tunable, the variance of $z(x)$ becomes:

$Var_{p(x)}[z(x)] \newline=Var_{p(x)}\left[f(x)\right]+c^2 \cdot Var_{p(x)}\left[h(x)\right] + 2c * Cov_{p(x)}\left[f(x), h(x)\right]$ ,

which is a quadratic formula with regard to $c$ . Therefore, the optimal coefficient is:

$c^*=-\frac{Cov_{p(x)}[f(x),h(x)]}{Var_{p(x)}[h(x)]}$

With $c^*$ , the optimal variance of $z(x)$ can be achieved at:

$Var^*_{p(x)}[z(x)] \newline=Var_{p(x)}\left[f(x)\right] - \frac{\left(Cov_{p(x)}\left[f(x), h(x)\right] \right)^2}{Var_{p(x)}[h(x)]}\newline=\left(1-\left(Corr_{p(x)}\left[f(x),h(x)\right]\right)^2\right)Var_{p(x)}\left[f(x)\right] \quad\quad Corr(x,y)=Cov(x,y)/stdev(x)stdev(y)$

This concludes that the more correlated $f(x)$ and $h(x)$ are with each other, the more variance we can reduce potentially.

Now, let’s do an experiment to verify the control variate method indeed reduces variance. We let $p(x)$ be a uniform distribution in range $[0, 10]$ , $f(x)=x^2$ and $h(x)=x$ . Statistical quantities can be calculated as:

$\mathbb{E}_{p(x)}[f(x)]=\frac{100}{3}$
$Var_{p(x)}[f(x)]=\frac{8000}{9}$
$\mathbb{E}_{p(x)}[h(x)]=5$
$Var_{p(x)}[h(x)]=\frac{25}{3}$
$Cov_{p(x)}\left[f(x), h(x)\right]=\frac{250}{3}$

Therefore, according to the derived theory, we have:

$Var_{p(x)}[z(x)] \newline =Var_{p(x)}\left[f(x)\right]+Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]\newline =\frac{8000}{9} + \frac{25}{3}-\frac{2\cdot 250}{3}\newline=\frac{6575}{9}<Var_{p(x)}[f(x)]$ .

In other words, we will expect to see around 82.2% of the original variance.

Here is the code

import numpy

T = 20000  # num of rounds
N = 10   # num of samples per round
a, b = 0., 10.   # lower bound, upper bound
expct_hx = 5.

# monte carlo estimation
mc_res = []
for t in range(T):
    x = numpy.random.uniform(low=a, high=b, size=N)
    mc_res.append(numpy.mean(x**2))

# control variate estimation using h(x)
cv_res = []
for t in range(T):
    x = numpy.random.uniform(low=a, high=b, size=N)
    cv_res.append(numpy.mean(x ** 2 - x + expct_hx))

print("MC estimation (mean/std): {0}/{1}".format(numpy.mean(mc_res), numpy.std(mc_res)))
print("Control variate estimation (mean/std): {0}/{1}".format(numpy.mean(cv_res), numpy.std(cv_res)))
print(f"Variance reduction percentage {(numpy.std(cv_res)/numpy.std(mc_res))**2}")

Here’s the result and indeed we see around 82% variance as we use the control variate:

MC estimation (mean/std): 33.292873225286186/9.426090731496881
Control variate estimation (mean/std): 33.37154726826849/8.61508822016471
Variance reduction percentage 0.8353264371911822

In the end, I want to mention that we are analyzing $Var_{p(x)}[z(x)]$ with the assumption that it is a single-sample random variable. If we always collect $N$ samples and use their mean as the estimate of $\mathbb{E}_{p(x)}[f(x)]$ , then $Var_{p(x)}[\bar{z}(x)]=Var_{p(x)}\left[\frac{\sum\limits_{i=1}^N z(x_i)}{N}\right]=N \cdot Var_{p(x)}\left[\frac{z(x)}{N}\right]=\frac{Var_{p(x)}\left[z(x)\right]}{N}$ . This matches with our intuition that the more samples you average over, the less variance you have. However, the relative ratio of variance you can save by introducing $h(x)$ will remain the same no matter how large $N$ is, which is $1-\left(Corr_{p(x)}\left[f(x),h(x)\right]\right)^2$ .

I also want to mention that there are other variance reduction techniques usable in other scenarios, such as importance sampling and reparameterization trick. People have shown empirically that reparameterization trick can also reduce variance [4] but so far I haven’t found any theoretical proof.

References

[1] https://czxttkl.com/2017/03/30/importance-sampling/

[2] https://en.wikipedia.org/wiki/Control_variates

[3] https://statweb.stanford.edu/~owen/mc/Ch-var-basic.pdf

[4] https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb

Personalized Re-ranking

In the industry there is a trend to add a re-ranker at the final stage of a recommendation system. The re-ranker ranks the items that have already been filtered out from an enormous candidate set, aiming to provide the finest level of personalized ordering before the items are ultimately delivered to the user.

In this post I am trying to have a quick re-implementation of “Personalized Re-ranking for Recommendation” [1]. The model architecture is as follows. A transformer encoder encodes candidate item features and user features, and then attention score is computed for each item (analogous to pairwise ranking). All attention scores will pass through a softmax function and fit with click labels.

import dataclasses
from dataclasses import asdict, dataclass
import torch.nn as nn
from functools import partial, reduce
import torch
import numpy as np
from torch.utils.data import DataLoader, Dataset
import random
from typing import cast

PADDING_SYMBOL = 0

@dataclass
class PreprocessedRankingInput:
    state_features: torch.Tensor
    src_seq: torch.Tensor
    tgt_out_seq: torch.Tensor
    slate_reward: torch.Tensor
    position_reward: torch.Tensor
    tgt_out_idx: torch.Tensor
        
    def _replace(self, **kwargs):
        return cast(type(self), dataclasses.replace(self, **kwargs))

    def cuda(self):
        cuda_tensor = {}
        for field in dataclasses.fields(self):
            f = getattr(self, field.name)
            if isinstance(f, torch.Tensor):
                cuda_tensor[field.name] = f.cuda(non_blocking=True)
        return self._replace(**cuda_tensor)


def embedding_np(idx, table):
    """ numpy version of embedding look up """
    new_shape = (*idx.shape, -1)
    return table[idx.flatten()].reshape(new_shape)


class TransposeLayer(nn.Module):
    def forward(self, input):
        return input.transpose(1, 0)


def create_encoder(
    input_dim,
    d_model=512,
    nhead=2,
    dim_feedforward=512,
    dropout=0.1,
    activation="relu",
    num_encoder_layers=2,
    use_gpu=False,
):
    feat_embed = nn.Linear(input_dim, d_model)
    encoder_layer = nn.TransformerEncoderLayer(
        d_model, nhead, dim_feedforward, dropout, activation
    )
    encoder_norm = nn.LayerNorm(d_model)
    encoder = nn.TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
    scorer = nn.Linear(d_model, 1)
    final_encoder = nn.Sequential(
        feat_embed, 
        nn.ReLU(), 
        TransposeLayer(), # make sure batch_size is the first dim
        encoder, # nn.TransformerEncoder assumes batch_size is the second dim
        TransposeLayer(), 
        nn.ReLU(), 
        scorer
    )
    if use_gpu:
        final_encoder.cuda()
    return final_encoder


def _num_of_params(model):
    return len(torch.cat([p.flatten() for p in model.parameters()]))

def _print_gpu_mem(use_gpu):
    if use_gpu:
        print(
            'gpu usage',
            torch.cuda.memory_stats(
                torch.device('cuda')
            )['active_bytes.all.current'] / 1024 / 1024 / 1024,
            'GB',
        )

def create_nn(
    input_dim,
    d_model=512,
    nhead=8,
    dim_feedforward=512,
    dropout=0.1,
    activation="relu",
    num_encoder_layers=2,
    use_gpu=False,
):

    feat_embed = nn.Linear(input_dim, d_model)
    scorer = nn.Linear(d_model, 1)
    final_nn = nn.Sequential(
        feat_embed,
        nn.ReLU(),
        nn.Linear(d_model, d_model),
        nn.ReLU(),
        nn.Linear(d_model, d_model),
        nn.ReLU(),
        scorer,
    )
    if use_gpu:
        final_nn.cuda()
    return final_nn


def batch_to_score(encoder, batch, test=False):
    batch_size, tgt_seq_len = batch.tgt_out_idx.shape
    state_feat_dim = batch.state_features.shape[1]

    concat_feat_vec = torch.cat(
        (
            batch.state_features.repeat(1, max_src_seq_len).reshape(
                batch_size, max_src_seq_len, state_feat_dim
            ),
            batch.src_seq,
        ),
        dim=2,
    )
    encoder_output = encoder(concat_feat_vec).squeeze(-1)
    if test:
        return encoder_output

    device = encoder_output.device
    slate_encoder_output = encoder_output[
        torch.arange(batch_size, device=device).repeat_interleave(tgt_seq_len),
        batch.tgt_out_idx.flatten(),
    ].reshape(batch_size, tgt_seq_len)
    
    return slate_encoder_output


def train(encoder, batch, optimizer):
    # shape: batch_size, tgt_seq_len
    slate_encoder_output = batch_to_score(encoder, batch)

    log_softmax = nn.LogSoftmax(dim=1)
    kl_loss = nn.KLDivLoss(reduction="batchmean")
    loss = kl_loss(log_softmax(slate_encoder_output), batch.position_reward)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.detach().cpu().numpy()


@torch.no_grad()
def test(encoder, batch):
    encoder.eval()
    # shape: batch_size, tgt_seq_len
    slate_encoder_output = batch_to_score(encoder, batch, test=False)
    slate_acc = torch.mean(
        (
            torch.argmax(slate_encoder_output, dim=1)
            == torch.argmax(batch.position_reward, dim=1)
        ).float()
    )

    # shape: batch_size, seq_seq_len
    total_encoder_output = batch_to_score(encoder, batch, test=True)
    batch_size = batch.tgt_out_idx.shape[0]
    correct_idx = batch.tgt_out_idx[
        torch.arange(batch_size), torch.argmax(batch.position_reward, dim=1)
    ]
    total_acc = torch.mean(
        (
            torch.argmax(total_encoder_output, dim=1)
            == correct_idx
        ).float()
    )

    encoder.train()
    print(f"slate acc {slate_acc}, total acc {total_acc}")


       
class ValueModel(nn.Module):
    """
    Generate ground-truth VM coefficients based on user features + candidate distribution
    """

    def __init__(self, state_feat_dim, candidate_feat_dim, hidden_size):
        super(ValueModel, self).__init__()
        self.state_feat_dim = state_feat_dim
        self.candidate_feat_dim = candidate_feat_dim
        self.hidden_size = hidden_size
        self.layer1 = nn.Linear(state_feat_dim + 3 * candidate_feat_dim, candidate_feat_dim)
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
            # model will be called with fixed parameters
            p.requires_grad = False

    def forward(
        self,
        state_features,
        src_seq,
        tgt_out_seq,
        tgt_out_idx,
    ):
        batch_size, max_src_seq_len, candidate_feat_dim = src_seq.shape
        max_tgt_seq_len = tgt_out_seq.shape[1]

        mean = src_seq.mean(dim=1)
        std = src_seq.std(dim=1)
        max = src_seq.max(dim=1).values
        vm_coef = self.layer1(torch.cat((state_features, mean, std, max), dim=1)).unsqueeze(2)
        pointwise_score = torch.bmm(tgt_out_seq, vm_coef).squeeze()
        return pointwise_score
                

class TestDataset(Dataset):
    def __init__(
        self,
        batch_size: int,
        num_batches: int,
        state_feat_dim: int,
        candidate_feat_dim: int,
        max_src_seq_len: int,
        max_tgt_seq_len: int,
        use_gpu: bool,
    ):
        self.batch_size = batch_size
        self.num_batches = num_batches
        self.state_feat_dim = state_feat_dim
        self.candidate_feat_dim = candidate_feat_dim
        self.max_src_seq_len = max_src_seq_len
        self.max_tgt_seq_len = max_tgt_seq_len
        self.use_gpu = use_gpu

        self.personalized_vm = ValueModel(state_feat_dim, candidate_feat_dim, 10)
        if use_gpu:
            self.personalized_vm.cuda()

    def __len__(self):
        return self.num_batches

    def action_generator(self, state_features, src_seq):
        batch_size, max_src_seq_len, _ = src_seq.shape
        action = np.full((batch_size, self.max_tgt_seq_len), PADDING_SYMBOL).astype(np.long)
        for i in range(batch_size):
            action[i] = np.random.permutation(
                np.arange(self.max_src_seq_len)
            )[:self.max_tgt_seq_len]
        return action

    def reward_oracle(
        self,
        state_features,
        src_seq,
        tgt_out_seq,
        tgt_out_idx,
    ):
        batch_size = state_features.shape[0]
        # shape: batch_size x max_tgt_seq_len
        pointwise_score = self.personalized_vm(
            state_features,
            src_seq,
            tgt_out_seq,
            tgt_out_idx,
        )
        slate_rewards = torch.ones(batch_size)
        position_rewards = (
            pointwise_score == torch.max(pointwise_score, dim=1).values.unsqueeze(1)
        ).float()
        return slate_rewards, position_rewards

    @torch.no_grad()
    def __getitem__(self, idx):
        if self.use_gpu:
            device = torch.device("cuda")
        else:
            device = torch.device("cpu")

        if idx % 10 == 0:
            print(f"generating {idx}")
            _print_gpu_mem(self.use_gpu)

        candidate_feat_dim = self.candidate_feat_dim
        state_feat_dim = self.state_feat_dim
        batch_size = self.batch_size
        max_src_seq_len = self.max_src_seq_len
        max_tgt_seq_len = self.max_tgt_seq_len

        state_features = np.random.randn(batch_size, state_feat_dim).astype(np.float32)
        candidate_features = np.random.randn(
            batch_size, self.max_src_seq_len, candidate_feat_dim
        ).astype(np.float32)
     
        # The last candidate feature is the sum of all other features. This just
        # simulates that in prod we often have some computed scores based on
        # the raw features
        candidate_features[:, :, -1] = np.sum(candidate_features[:, :, :-1], axis=-1)
        tgt_out_idx = np.full((batch_size, max_tgt_seq_len), PADDING_SYMBOL).astype(
            np.long
        )
        src_seq = np.zeros((batch_size, max_src_seq_len, candidate_feat_dim)).astype(
            np.float32
        )
        tgt_out_seq = np.zeros(
            (batch_size, max_tgt_seq_len, candidate_feat_dim)
        ).astype(np.float32)

        for i in range(batch_size):
            # TODO: we can test sequences with different lengths
            src_seq_len = max_src_seq_len
            src_in_idx = np.arange(src_seq_len) 
            src_seq[i] = embedding_np(src_in_idx, candidate_features[i])

        with torch.no_grad():
            tgt_out_idx = self.action_generator(state_features, src_seq) 

        for i in range(batch_size):
            tgt_out_seq[i] = embedding_np(tgt_out_idx[i], candidate_features[i])

        with torch.no_grad():
            slate_rewards, position_rewards = self.reward_oracle(
                torch.from_numpy(state_features).to(device),
                torch.from_numpy(src_seq).to(device),
                torch.from_numpy(tgt_out_seq).to(device),
                torch.from_numpy(tgt_out_idx).to(device),
            )
            slate_rewards = slate_rewards.cpu()
            position_rewards = position_rewards.cpu()

        return PreprocessedRankingInput(
            state_features=torch.from_numpy(state_features),
            src_seq=torch.from_numpy(src_seq),
            tgt_out_seq=torch.from_numpy(tgt_out_seq),
            slate_reward=slate_rewards,
            position_reward=position_rewards,
            tgt_out_idx=torch.from_numpy(tgt_out_idx),
        )


def _collate_fn(batch):
    assert len(batch) == 1
    return batch[0]


def _set_np_seed(worker_id):
    np.random.seed(worker_id)
    random.seed(worker_id)

@torch.no_grad()
def create_data(
    batch_size,
    num_batches,
    max_src_seq_len,
    max_tgt_seq_len,
    state_feat_dim,
    candidate_feat_dim,
    num_workers,
    use_gpu,
):
    dataset = DataLoader(
        TestDataset(
            batch_size,
            num_batches,
            state_feat_dim,
            candidate_feat_dim,
            max_src_seq_len,
            max_tgt_seq_len,
            use_gpu=use_gpu,
        ),
        batch_size=1,
        shuffle=False,
        num_workers=num_workers,
        worker_init_fn=_set_np_seed,
        collate_fn=_collate_fn,
    )
    dataset = [batch for i, batch in enumerate(dataset)]
    return dataset


def main(
    dataset,
    create_model_func,
    num_epochs,
    state_feat_dim,
    candidate_feat_dim,
    use_gpu,
):
    model = create_model_func(
        input_dim=state_feat_dim + candidate_feat_dim, use_gpu=use_gpu
    )
    print(f"model num of params: {_num_of_params(model)}")
    optimizer = torch.optim.Adam(
        model.parameters(), lr=0.001, amsgrad=True,
    )

    test_batch = None
    for e in range(num_epochs):
        epoch_loss = []
        for i, batch in enumerate(dataset):
            if use_gpu:
                batch = batch.cuda()
            
            if e == 0 and i == 0:
                test_batch = batch
                test(model, test_batch)
                print()
                continue

            loss = train(model, batch, optimizer)
            epoch_loss.append(loss)

            if (e == 0 and i < 10) or i % 10 == 0:
                print(f"epoch {e} batch {i} loss {loss}")
                test(model, test_batch)
                print()

        print(f"epoch {e} average loss: {np.mean(epoch_loss)}\n")
    
    return model

if __name__ == "__main__":
    batch_size = 4096
    num_batches = 100
    max_src_seq_len = 10
    max_tgt_seq_len = 5
    state_feat_dim = 7
    candidate_feat_dim = 10
    num_workers = 0
    use_gpu = True
    dataset = create_data(
        batch_size,
        num_batches,
        max_src_seq_len,
        max_tgt_seq_len,
        state_feat_dim,
        candidate_feat_dim,
        num_workers,
        use_gpu,
    )

    num_epochs = 10
    encoder = main(
        dataset,
        create_encoder,
        num_epochs,
        state_feat_dim,
        candidate_feat_dim,
        use_gpu,
    )

    num_epochs = 10
    nnet = main(
        dataset,
        create_nn,
        num_epochs,
        state_feat_dim,
        candidate_feat_dim,
        use_gpu,
    )

Code explanation:

We create `TestDataset` to generate random data. Here I create a hypothetical situation where we have state features (user features) and src_seq, which are the features of candidates and the input to the encoder. tgt_out_seq are the features of the items actually shown to the user.

For each ranking data point, the user will click the best item, as you can see in `reward_oracle` function. The best item of the ground truth is determined by a value model, which computes item scores based on user features and the characteristics of candidate features (mean, max, and standard deviation of the candidate distribution).

In `train` function, you can see that we use KLDivLoss to fit attention scores against position_rewards (clicks as mentioned above).

In `test` function , we test how many times the best item resulted from the encoder model is the same as the ground truth.

We also compare the transformer encoder with a classic multi-layer NN.

Final outputs:

Transformer encoder output:
epoch 9 batch 90 loss 0.24046754837036133
slate acc 0.92333984375
epoch 9 average loss: 0.23790153861045837

NN output:
epoch 9 batch 90 loss 0.6537740230560303
slate acc 0.73583984375
epoch 9 average loss: 0.6661167740821838

As we can see, the transformer encoder model clearly outperforms the NN in slate acc.

References

[1] Personalized Re-ranking for Recommendation

Practical considerations of off-policy policy gradient

I’d like to talk more about policy gradient [1], which I touched upon in 2017. In common online tutorials, policy gradient theorem takes a lot of spaces to prove that the gradient of the policy $\pi_{\theta}$ in the direction to improve accumulated returns is:

$J(\theta)\newline=\mathbb{E}_{s,a \sim \pi_\theta} [Q^{\pi_\theta}(s,a)] \newline=\mathbb{E}_{s \sim \pi_\theta}[ \mathbb{E}_{a \sim \pi_\theta} [Q^{\pi_\theta}(s,a)]]\newline=\mathbb{E}_{s\sim\pi_\theta} [\int \pi_\theta(a|s) Q^{\pi_\theta}(s,a) da]$

$\nabla_{\theta} J(\theta) \newline= \mathbb{E}_{s\sim\pi_\theta} [\int \nabla \pi_\theta(a|s) Q^{\pi_\theta}(s,a) da] \quad\quad \text{Treat } Q^{\pi_\theta}(s,a) \text{ as non-differentiable} \newline= \mathbb{E}_{s\sim\pi_\theta} [\int \pi_\theta(a|s) \frac{\nabla \pi_\theta(a|s)}{\pi_\theta(a|s)} Q^{\pi_\theta}(s,a) da] \newline= \mathbb{E}_{s, a \sim \pi_\theta} [Q^{\pi_\theta}(s,a) \nabla_\theta log \pi_\theta(a|s)] \newline \approx \frac{1}{N} [G_t \nabla_\theta log \pi_\theta(a_t|s_t)] \quad \quad s_t, a_t \sim \pi_\theta$
where $G_t$ is the accumulated return beginning from step $t$ from real samples.

Note that the above gradient is only for on-policy policy gradient, because $s_t, a_t \sim \pi_\theta$ . How can we derive the gradient for off-policy policy gradient, i.e., $s_t, a_t \sim \pi_b$ where $\pi_b$ is a different behavior policy? I think the simplest way to derive it is to connect importance sampling with policy gradient.

A quick introduction of importance sampling [2]: estimating a quantity under a distribution is equivalently estimating it under another different distribution with an importance sampling weight, i.e., $\mathbb{E}_{p(x)}[x]=\mathbb{E}_{q(x)}[\frac{q(x)}{p(x)}x]$ . This comes handy when you can’t collect data using $p(x)$ but you can still compute $p(x)$ when $x$ is sampled from $q(x)$ .

In the case of off-policy policy gradient, $J(\theta)$ becomes “the value function of the target policy, averaged over the state distribution of the behavior policy” (from DPG paper [6]):

$J(\theta)\newline=\mathbb{E}_{s\sim\pi_b}\left[\mathbb{E}_{a \sim \pi_\theta} [Q^{\pi_\theta}(s,a)] \right] \newline=\mathbb{E}_{s,a \sim \pi_b} [\frac{\pi_\theta(a|s)}{\pi_b(a|s)}Q^{\pi_b}(s,a)]$

$\nabla_{\theta} J(\theta) \newline=\mathbb{E}_{s,a \sim \pi_b} [\frac{\nabla_\theta \pi_\theta(a|s)}{\pi_b(a|s)} Q^{\pi_b}(s,a)] \quad\quad \text{Again, treat } Q^{\pi_b}(s,a) \text{ as non-differentiable}\newline=\mathbb{E}_{s,a \sim \pi_b} [\frac{\pi_\theta(a|s)}{\pi_b(a|s)} Q^{\pi_b}(s,a) \nabla_\theta log \pi_\theta(a|s)] \newline \approx \frac{1}{N}[\frac{\pi_\theta(a_t|s_t)}{\pi_b(a_t|s_t)} G_t \nabla_\theta log \pi_\theta(a_t|s_t)] \quad\quad s_t, a_t \sim \pi_b$

As we know adding a state-dependent baseline in on-policy policy gradient would reduce the variance of gradient [4] and not introduce bias. Based on my proof, adding a state-dependent baseline in off-policy policy gradient will not introduce bias either:

$\mathbb{E}_{s,a \sim \pi_b} \left[\frac{\pi_\theta(a|s)}{\pi_b(a|s)} b(s) \nabla_\theta log \pi_\theta(a|s)\right]\newline=\mathbb{E}_{s \sim \pi_b}\left[b(s) \cdot \mathbb{E}_{a \sim \pi_b}\left[\frac{\pi_\theta(a|s)}{\pi_b(a|s)} \nabla_\theta log \pi_\theta(a|s)\right]\right] \newline = \mathbb{E}_{s \sim \pi_b}\left[b(s) \int \pi_b(a|s) \nabla_\theta \frac{\pi_\theta(a|s)}{\pi_b(a|s)} da\right]\newline=\mathbb{E}_{s \sim \pi_b}\left[b(s) \int \nabla_\theta \pi_\theta(a|s) da\right]\newline=\mathbb{E}_{s \sim \pi_b}\left[b(s) \nabla_\theta \int \pi_\theta(a|s) da\right] \newline=\mathbb{E}_{s \sim \pi_b}\left[b(s) \nabla_\theta 1 \right] \newline=0$

You have probably noticed that during the derivation in on-policy policy gradient, we treat $Q^{\pi_\theta}(s,a)$ as non-differentiable with regard to $\theta$ . However $Q^{\pi_\theta}(s,a)$ is indeed affected by $\pi_\theta$ because it is the returns of the choices made by $\pi_\theta$ . I am not sure if treating $Q^{\pi_\theta}(s,a)$ differentiable would help improve performance, but there are definitely papers doing so in specific applications. One such an example is seq2slate [5] (where, in their notations, $\pi$ refers to actions, $x$ refers to states, and $\mathcal{L}_\pi(\theta)$ is the same as our $Q^{\pi_\theta}(s,a)$ ):

If we adapt Eqn. 8 to the off-policy setting, we have (the first three lines would be the same):

$\nabla_\theta \mathbb{E}_{\pi \sim p_\theta(\cdot|x)} \left[ \mathcal{L}_\pi(\theta)\right] \newline= \nabla_\theta \sum\limits_\pi p_\theta(\pi|x)\mathcal{L}_\pi(\theta)\newline=\sum\limits_\pi \left[ \nabla_\theta p_\theta(\pi|x) \cdot \mathcal{L}_\pi(\theta)+p_\theta(\pi|x) \cdot \nabla_\theta \mathcal{L}_\pi(\theta) \right]\newline=p_b(\pi|x) \cdot \sum\limits_\pi \left[ \frac{\nabla_\theta p_\theta(\pi|x)}{p_b(\pi|x)} \cdot \mathcal{L}_\pi(\theta)+\frac{p_\theta(\pi|x)}{p_b(\pi|x)} \cdot \nabla_\theta \mathcal{L}_\pi(\theta) \right]\newline=\mathbb{E}_{\pi\sim p_b(\cdot|x)}\left[\frac{1}{p_b(\pi|x)}\nabla_\theta\left(p_\theta(\pi|x)\mathcal{L}_\pi(\theta)\right)\right]$

If you follow the same rewriting trick to rewrite Eqn. 8, we can get another form of gradient in the on-policy setting:

$\nabla_\theta \mathbb{E}_{\pi \sim p_\theta(\cdot|x)} \left[ \mathcal{L}_\pi(\theta)\right] \newline= \nabla_\theta \sum\limits_\pi p_\theta(\pi|x)\mathcal{L}_\pi(\theta)\newline=\sum\limits_\pi \left[ \nabla_\theta p_\theta(\pi|x) \cdot \mathcal{L}_\pi(\theta)+p_\theta(\pi|x) \cdot \nabla_\theta \mathcal{L}_\pi(\theta) \right]\newline=p_\theta(\pi|x) \cdot \sum\limits_\pi \left[ \frac{\nabla_\theta p_\theta(\pi|x)}{p_\theta(\pi|x)} \cdot \mathcal{L}_\pi(\theta)+\frac{p_\theta(\pi|x)}{p_\theta(\pi|x)} \cdot \nabla_\theta \mathcal{L}_\pi(\theta) \right]\newline=\mathbb{E}_{\pi\sim p_\theta(\cdot|x)}\left[\frac{1}{p_\theta(\pi|x)}\nabla_\theta\left(p_\theta(\pi|x)\mathcal{L}_\pi(\theta)\right)\right]$
, where in the last few equations, $p_\theta$ not followed by $\nabla_\theta$ is treated as non-differentiable.

Again, I’d like to disclaim that I don’t really know whether we should just make the policy differentiable (policy gradient) or make both the policy and return differentiable (seq2slate Eqn. 8). I’ll report back if I have any empirical findings.

————————— Update 2020.7 ——————-

Another topic to consider is constrained optimization in off-policy policy gradient. I am using notations closer to my real-world usage. Assume we are optimizing a policy’s parameters using data collected by another policy to maximize some reward function $R(s,a)$ (i.e., one-step optimization, $R(s,a)>0 \text{ for } \forall s,a$ ), with the constraint that the policy should also achieve at least some amount of $S$ in another reward type $Q(s,a)$ :
$argmax_\theta \;\; \frac{1}{n} \sum\limits_{i=1}^n R_i\frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)}$
subject to $\frac{1}{n}\sum\limits^{n}_{i=1} Q_i \frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)} \geq S$

Writing it in a standard optimization format, we have:
$argmin_\theta \;\; -\frac{1}{n} \sum\limits_{i=1}^n R_i\frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)}$
subject to $-\frac{1}{n}\sum\limits^{n}_{i=1} Q_i \frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)} + S \leq 0$

Now we can solve this optimization problem using Lagrangian multiplier. Suppose there is a non-negative multiplier $\lambda \geq 0$ . The augmented objective function becomes the original objective plus an additional penalty, which makes infeasible solutions sub-optimal:
$\arg\max\limits_\lambda \arg\min\limits_\theta \mathcal{L}(\theta, \lambda) \newline= -\frac{1}{n} \sum\limits_{i=1}^n R_i\frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)} + \lambda \left(-\frac{1}{n}\sum\limits^{n}_{i=1} Q_i \frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)} + S \right) \newline = -\frac{1}{n} \sum\limits_{i=1}^n (R_i + \lambda Q_i) \frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)} + \lambda S$

In simpler problems we can just find the saddle point where $\frac{\partial \mathcal{L}}{\partial \theta}=0$ and $\frac{\partial \mathcal{L}}{\partial \lambda}=0$ . However, our problem could be in a large scale that it is not feasible to find the saddle point easily. Note that one cannot use stochastic gradient descent to find the saddle point because saddle points are not local minima [8].

Following the idea in [7], what we can show is that if we try to optimize $\theta$ under different $\lambda$ ‘s, then $\frac{1}{n}\sum\limits^{n}_{i=1} Q_i \frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)}$ would have a monotonic trend:

Suppose $J^\lambda = -\frac{1}{n}\sum\limits_{i=1}^n R_i\frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)} -\frac{1}{n}\sum\limits_{i=1}^n \lambda Q_i \frac{\pi_\theta(s_i, a_i)}{\pi_b(s_i, a_i)} = -f(\theta)-\lambda g(\theta)$

For any arbitrary two $\lambda$ ‘s: $\lambda_a < \lambda_b$ , we have:
$\theta_a = argmin_\theta J^{\lambda_a}$
$\theta_b = argmin_\theta J^{\lambda_b}$

Therefore,
$-f(\theta_a) - \lambda_a g(\theta_a) < -f(\theta_b) - \lambda_a g(\theta_b)$
$-f(\theta_b) - \lambda_b g(\theta_b) < -f(\theta_a) - \lambda_b g(\theta_a)$

Adding both sides will keep the inequality hold:
$-f(\theta_a) - \lambda_a g(\theta_a)-f(\theta_b) - \lambda_b g(\theta_b) < -f(\theta_b) - \lambda_a g(\theta_b)-f(\theta_a) - \lambda_b g(\theta_a)$
$\lambda_a g(\theta_a) + \lambda_b g(\theta_b) > \lambda_a g(\theta_b) + \lambda_b g(\theta_a)$
$(\lambda_a - \lambda_b) g(\theta_a) > (\lambda_a - \lambda_b) g(\theta_b)$
$g(\theta_a) < g(\theta_b)$ (because $\lambda_a - \lambda_b < 0$ the inequality changes direction)

This theorem immediately implies that if we optimize $\theta$ using normal off-policy policy gradient with reward in the form of $R_i + \lambda Q_i$ , we are essentially doing constrained optimization. The larger $\lambda$ we use, the more rigid the constraint is. While using a combined reward shape like $R_i + \lambda Q_i$ looks a naive way to balance trade-offs, I never thought it could link to constrained optimization in this way!

The Lagrangian Multiplier optimization procedure above is taken referenced at [9].

————————— Update 2020.7 part (2) ——————-

Note that one cannot use stochastic gradient descent to find the saddle point because saddle points are not local minima [8].

This sentence is actually controversial. From [10] Section 3, we can see that we can still use stochastic gradient update to find the saddle point: for $\lambda$ , we use gradient ascent; for $\theta$ we use gradient descent. Second, we can convert $\mathcal{L}(\theta, \lambda)$ to yet another problem for minimizing the magnitude of the gradient of both $\theta$ and $\lambda$ [11] . The intuition is that the closer to the saddle point, the smaller magnitude of the gradient of $\theta$ and $\lambda$ . Therefore, we can perform stochastic gradient descent on the magnitude of gradient.

References

[1] https://czxttkl.com/2017/05/28/policy-gradient/

[2] https://czxttkl.com/2017/03/30/importance-sampling/

[3] On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

[4] https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/

[5] https://arxiv.org/pdf/1810.02019.pdf

[6] http://proceedings.mlr.press/v32/silver14.pdf

[7] Deep Learning with Logged Bandit Feedback

[8] https://stackoverflow.com/a/12284903

[9] https://people.duke.edu/~hpgavin/cee201/LagrangeMultipliers.pdf

[10] Reward Constrained Policy Optimization

[11] https://en.wikipedia.org/wiki/Lagrange_multiplier#Example_4:_Numerical_optimization

ETFs

In my opinion, buying ETFs is a good investment method that has decent average return and low risk. My buy philosophy is to buy ETFs in fixed amount and fixed intervals, say every pay check you allocate $2000 to purchase ETFs, regardless whether their prices are high or low. My sell philosophy is inspired from Benjamin Graham with the following two conditions: (1) When E/P ratio (earning / price, i.e., the inverse of P/E ratio) is below 10%, pause investing; (2) If E/P ratio is below 2 times 10yr treasury interest, sell. Basically, E/P ratio can be roughly thought of the interest rate of your investment in stocks. If it is too low, then other investment methods such as treasury become relatively more attractive.

Now with my philosophy explained, I’m sharing some online tools that help implement my philosophy.

(1) 10-yr treasury interest: https://www.marketwatch.com/investing/bond/tmubmusd10y?countrycode=bx

(2) E/P ratio (or P/E ratio) of ETFs can be found in several websites. I am using FCA (First Trust China AlphaDEX Fund) as an example:

(a) https://www.etf.com/FCA#overview. Search “Price / Earning Ratio”

(b) https://ycharts.com/companies/FCA. Search “Weighted Average PE Ratio”.

(3) You can find ETFs with the lowest P/E ratios here: https://etfdb.com/compare/lowest-pe-ratio/.

The problems of using E/P ratio to screen ETFs go back my loose thought of E/P ratio as “the interest rate of investment in stocks”. E/P is the ratio of the current earning and the current price. Even though the current price can reflect investors’ future prospect, the current earning can’t reveal long-term performance. There could be high-risk ETFs with very high E/P ratio at the moment but you still don’t want to buy them. That’s why from https://etfdb.com/compare/lowest-pe-ratio/ you may see some ETFs tracking small countries’ risky economy among those lowest P/E ratio ETFs.

Because the absolute E/P ratio might not be as indicative, I think it is very desirable to know the percentile of the current E/P ratio in the same ETF’s history. I only know two websites that maintain this statistics for Chinese ETFs: https://qieman.com/idx-eval and https://danjuanapp.com/djmodule/value-center?channel=1300100141. I also know some websites with historical E/P ratio only for big names indices, such as S&P 500 or Nasdaq: https://www.quandl.com/data/MULTPL/SP500_PE_RATIO_MONTH-S-P-500-PE-Ratio-by-Month or https://www.multpl.com/s-p-500-pe-ratio/table/by-month.

Nevertheless, you can query individual stocks’ P/E ratio from https://ycharts.com/companies/FB/pe_ratio (replace FB with other companies you want to query) or using wolfram alpha https://www.wolframalpha.com/input/?i=p%2Fe+fb. You can then compute an ETF’s P/E ratio by weighting all individuals’ P/E ratio: https://money.stackexchange.com/questions/96998/how-can-i-tell-if-an-etf-is-expensive.

Update 2020-03

Another evidence that could boost your confidence in buying ETFs is the form below showing the duration of all historical bear markets (https://www.fidelity.com/viewpoints/market-and-economic-insights/bear-markets-the-business-cycle-explained, this link may have expired, you can checkout this pdf 10 things you should know about the bear market). On average, bear market can last 22 months, which means if you can sustainably make investment in ETFs, you will eventually survive with a huge return. I also checked that 75% and 95% percentile of bear market duration is 30 and 45 months.

Here is another data only from 1965 to 2019. It also shows the duration from trough to next peak:

Another post to show historical peak-to-trough (https://www.bankrate.com/investing/how-long-will-bear-market-last/):

—————– Update 08/12 2020 ————————-

The trough after the peak in 2020-02-20 happened on 03-23 (2237). It then recovered on 08-12 (3380). Therefore, the peak to trough lasts 32 days, while the recovery takes 142 days.

I also want to talk about municipal bond and treasury ETFs. They are less risky than stock ETFs and hence have lower returns. But they are still good investment to dilute your risk. You can find all municipal bond ETFs at https://etfdb.com/etfs/bond/municipal-bond/#etfs__dividends&sort_name=assets_under_management&sort_order=desc&page=1 and all treasury ETFs at https://etfdb.com/etfs/bond/treasuries/#etfs__dividends&sort_name=dividend_yield&sort_order=desc&page=1. The most famous municipal bond ETF is MUB, with much stable price and decent dividend yield (2.23% annual yield checked on 08-12-2020). The treasury ETFs usually have higher fluctuation and more returns on average. You can check the following chart to decide based on your risk favor (https://rich01.com/bond-etf-treasuries-5/):

In my experience, MUB strikes the best balance between volatility and dividend yield. In other words, ETFs that have higher dividend yields than MUB usually have higher volatility.

You can also buy ETFs that track US bond market (BND) and international bond market (BNDX).

—————– Update 08/21 2020 ————————-

From https://www.youtube.com/watch?v=0eM1jYV0j-k and https://www.youtube.com/watch?v=Co7ih4M4AAg, I know some other good ETF candidates too:
1. VGT: technology companies weighted by their values. If you compare with QQQ which tracks Nasdaq, you’ll find it even performs better in the last 5 years and even with higher dividend! https://etfdb.com/tool/etf-comparison/QQQ-VGT/#performance

2. VYM high dividend ETF. Again it is more volatile than MUB and has just marginally better dividend yield.

3. VT. ETF tracking the entire world’s stock market including US. You can also look at VXUS, the world market except US.

4. YYY or PCEF, high yield ETF but may have some risk: https://fburl.com/19il38ry. Also look at HYG and SDIV.

Also, it is good to consider leveraged ETFs for holding short time. For example, this is a strategy I’d like to try in the next bear market: when the bear market establishes and has lasted for more than 50-percentile of the historical bear market duration, we can start pouring money into TQQQ, the 3x leveraged ETF tracking Nasdaq. Generally leveraged ETFs is not recommended for long-term investment. See discussion at Reddit:

Comment
byu/iamagoodpanda from discussion
ininvesting

However, another post (https://seekingalpha.com/article/4416406-tqqq-long-term-hold-viable-dca-only-for-those-highest-risk-tolerance) did backtests and concluded that investing TQQQ in fixed interval has much better return, but its volatility is huge – sometimes you may see 90% loss in economic crashes.

—————– Update 12/08 2020 ————————-

I read this interesting article (https://stocknews.com/news/gof-sdy-sdiv-pfl-glo-warning-this-etf-could-ruin-your-retirement/) saying that closed-end funds (CEFs) are better than high-yield ETFs because the former are more actively managed thus are subject to more stable price. I also observed that high-yield ETFs like SDIV and YYY have price losses over time. The article gives several CEFs in the order of performance as of Sept. 2020: GOF, PFL, GLO.

I find that Fidelity has great UI for comparing loss. Here are examples:

https://screener.fidelity.com/ftgw/etf/goto/snapshot/performance.jhtml?symbols=GOF

https://screener.fidelity.com/ftgw/etf/goto/snapshot/performance.jhtml?symbols=YYY

https://screener.fidelity.com/ftgw/etf/goto/snapshot/performance.jhtml?symbols=SDIV

Update 08/30/2021

Someone says maintaining TQQQ/TMF = 60%/40% is a good strategy: https://www.optimizedportfolio.com/tqqq/

Constrained RL / Multi-Objective RL

Learning a policy that can optimize multiple types of rewards or satisfy different constraints is a much desired feature in the industry. In real products, we often care about not only single one metric but several that interplay with each other. For example, we want to derive a policy to recommend news feeds which expects to increase impressions but should not regress the number of likes, follows, and comments.

The vein of multi-objective and constrained RL is an active research area. Before we proceed, I want to clarify the definition of multi-objective RL and constrained RL in my own words. Multi-objective RL focuses on optimizing multiple objectives in the same time. For one reward type, the question is how we can maximize this reward type without reducing any of other rewards. Therefore, multi-objective RL usually needs to fit the whole Pareto Frontier. Constrained RL is somewhat relaxed because its focused question is how we can maximize this reward type while other rewards are no worse than specified constraints (which don’t need to be “not reduce any of other rewards).

Multi-Objective RL

Constrained RL

[1] introduces an algorithm called ApproPO to answer the feasibility problem and return the best policy if the problem is feasible. The feasibility problem is defined as:

$\text{Find } \mu \in \triangle{(\Pi)} \text{ such that } \bar{\pmb{z}}(\mu) \in \mathcal{C},$

which can be expressed as a distance objective:

$\min\limits_{\mu \in \triangle(\Pi)} dist(\bar{\pmb{z}}(\mu), \mathcal{C})$

Here, $\triangle(\Pi)$ denotes all possible mixed policies so you should think $\mu$ as a probability distribution over finitely many stationary policies. $\bar{\pmb{z}}(\cdot) \in \mathbb{R}^d$ is a $d$ -dimensional return measurement vector (i.e., $d$ is the number of rewards over which we specify constraints). $\bar{\pmb{z}}(\mu)$ could then be thought as a weighted sum of returns of individual stationary policies: $\bar{\pmb{z}}(\mu)=\sum\limits_{\pi} \mu(\pi)\bar{\pmb{z}}(\pi)$ . $C$ is the convex set of constraints. Why couldn’t we just get one single stationary policy as the answer? It seems to me that ApproPO has to return a mixture of policies because the author chooses to solve the minimization problem by solving a two-player game based on game theory, which suggests “the average of their plays converges to the solution of the game”.

Here is the stretch of how the authors solve the minimization problem by solving a two-player game.

For a convex cone $\mathcal{C} \in \mathbb{R}^d$ and any point $\pmb{x} \in \mathbb{R}^d$ , we define the dual convex cone $\mathcal{C}^\circ \triangleq \{\pmb{\lambda}: \pmb{\lambda}\cdot \pmb{x} \leq 0 \text{ for all } \pmb{x}\in \mathcal{C}\}$ .
According to Lemma 3.2, $dist(\pmb{x}, \mathcal{C}) = \max\limits_{\pmb{\lambda} \in \mathcal{C}^\circ \cap \mathcal{B}} \pmb{\lambda} \cdot \pmb{x}$ . Plugging this Lemma into the minimization problem we actually care about, we obtain a min-max game (with $\pmb{x}=\bar{\pmb{z}}(\mu)$ ): $\min\limits_{\mu \in \triangle(\Pi)} \max\limits_{\pmb{\lambda} \in \mathcal{C}^\circ \cap \mathcal{B}} \pmb{\lambda} \cdot \bar{z}(\mu)$ .
The min-max formula satisfies certain conditions (as shown in the paragraph above Eqn. (9) in the paper [1]) such that it is equivalent to a max-min game: $\min\limits_{\mu \in \triangle(\Pi)} \max\limits_{\pmb{\lambda} \in \mathcal{C}^\circ \cap \mathcal{B}} \pmb{\lambda} \cdot \bar{\pmb{z}}(\mu) = \max\limits_{\pmb{\lambda} \in \mathcal{C}^\circ \cap \mathcal{B}} \min\limits_{\mu \in \triangle(\Pi)} \pmb{\lambda} \cdot \bar{\pmb{z}}(\mu)$ .
In the max-min game, the $\pmb{\lambda}$ -player plays first, and the $\mu$ -player plays next. And their turns can continue incessantly. Theory has shown that if the $\pmb{\lambda}$ -player keeps using a no-regret algorithm repeatedly against $\mu$ -player who uses a best response algorithm, then “the averages of their plays converge to the solution of the game”.
There is an off-the-shelf no-regret algorithm called online gradient descent (OGD) which can be used by $\pmb{\lambda}$ -player. We won’t cover the details of this algorithm in this post. Given any $\pmb{\lambda}$ returned by $\pmb{\lambda}$ palyer, the best response algorithm used by $\mu$ -player will need to find $\mu$ such that $\pmb{\lambda} \cdot \bar{\pmb{z}}(\mu)= \sum\limits_{\pi}\mu(\pi) \cdot (\pmb{\lambda}\cdot\bar{\pmb{z}}(\pi))$ is minimized. The implicit conditions that the best response algorithm’s solution should also satisfy are $\sum\limits_\pi u(\pi)=1$ and $0 \leq u(\pi) \leq 1$ for $\forall \pi$ .
Based on properties of linear programming, we know the best response algorithm’s solution must be on a corner point. This can be translated into finding a single stationary policy $\pi$ which optimizes the return with the scalar reward $-\pmb{\lambda}\cdot \bar{\pmb{z}}$ . Such a $\pi$ will be found using any preferred RL algorithm on the MDP.
After $\mu$ -player returns the best $\mu$ , $\lambda$ -player can again return a new $\pmb{\lambda}$ based on $\bar{\pmb{z}}(\mu)$ . Then $\mu$ -player again returns the best policy learned on the scalar reward $-\pmb{\lambda}\cdot \bar{\pmb{z}}$ . The iterations go on for $T$ steps. Finally the algorithm would return a random mixture of $\mu$ from all iteration steps.

My concern is that this algorithm could not easily be extended to the batch learning context. Although the best response algorithm can always use an off-policy algorithm to learn $\pi$ without recollecting data, determining the optimal $\pmb{\lambda}$ needs to know $\bar{\pmb{z}}(\pi)$ , the estimation of $\pi$ on all dimensions of rewards. This would be super inaccurate using any off-policy evaluation method. There is also a question on how to serve the mixture of policies rather than just one single policy.

References

[1] Reinforcement Learning with Convex Constraints