czxttkl – Page 2

Diffusion models

Diffusion models are popular these days. This blog [1] summarizes the comparison between diffusion models with other generative models:

Before we go into the technical details, I want to use my own words to summarize my understanding in diffusion models. Diffusion models have two subprocesses: forward process and backward process. The forward process is non-learnable and the backward process is learnable. For every training samples (e.g., images) $\mathbf{x}_0$ , the forward process adds a Gaussian noise $\boldsymbol{\epsilon}_t$ in $T$ steps until $\mathbf{x}_T$ is (or approximately close to) an isotropic Gaussian. The backward process tries to recover $\mathbf{x}_0$ in T steps, starting from an isotropic Gaussian $\mathbf{x}_T$ . Each backward step samples $\mathbf{x}_{t-1}$ from $\mathbf{x}_t$ with the probability $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}(\mathbf{x}_{t-1}| \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t))$ . The eventual goal is that, given a training sample, we want $p_\theta(\mathbf{x}_0)$ to be as high as possible, where $p_\theta(\mathbf{x}_0)=p_\theta(\mathbf{x}_{T:0})=p(\mathbf{x}_T)\prod\limits_{t=T}^1 p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ . It turns out that maximizing $p_\theta(\mathbf{x}_0)$ will be equivalent to optimizing an ELBO objective function, which is equivalent to make $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ be as close as possible to the distribution $q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \boldsymbol{\epsilon}_t)$ . Because in the forward process we have recorded $\mathbf{x}_t$ and $\boldsymbol{\epsilon}_t$ for all $t=1,\cdots, T$ , $q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \boldsymbol{\epsilon}_t)$ can be written in a closed form. Therefore, we can use a loss function (i.e., KL divergence between two Gaussians) to train $\theta$ by fitting $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ against $q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \boldsymbol{\epsilon}_t)$ .

More technical details

We start from the objective, that the data likelihood $x_0$ under a diffusion model $\theta$ , is maximized: $maximize \; \log p_\theta(x_0)$ . Similar to stochastic variational inference, we can derive a lower bound and maximize the lower bound instead:

(1) $\begin{equation*} \begin{split} & maximize \;\; \log p_\theta(x_0) \\ & \geq \log p_\theta(x_0) - \underbrace{D_{KL}\left( q\left( \mathbf{x}_{1:T} | \mathbf{x}_0 \right) || p_\theta\left( \mathbf{x}_{1:T} | \mathbf{x}_0 \right) \right)}_\text{KL divergence is non-negative} \\ &=\log p_\theta(x_0) - \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0) } \left[ \log \underbrace{\frac{q\left(\mathbf{x}_{1:T}|\mathbf{x}_0 \right)}{p_\theta\left( \mathbf{x}_{0:T}\right) / p_\theta \left( \mathbf{x}_0\right)}}_\text{Eqvlt. to $p_\theta\left( \mathbf{x}_{1:T} | \mathbf{x}_0 \right)$} \right] \\ &=\log p_\theta(x_0) - \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0) } \left[ \log \frac{q\left( \mathbf{x}_{1:T} | \mathbf{x}_0 \right)}{p_\theta \left( \mathbf{x}_{0:T}\right) } + \log p_\theta\left(\mathbf{x}_0 \right) \right] \\ &=- \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0) } \left[ \log \frac{q\left(\mathbf{x}_{1:T} | \mathbf{x}_0\right) }{p_\theta\left( \mathbf{x}_{0:T}\right)} \right] \\ &=-\mathbb{E}_{q}\biggl[ \\ &\quad \underbrace{D_{KL}\left( q( \mathbf{x}_T | \mathbf{x}_0) || p_\theta(\mathbf{x}_T) \right)}_\text{$L_T$} \\ &\quad + \sum\limits_{t=2}^T \underbrace{D_{KL}\left( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \right)}_\text{$L_{t-1}$} \\ &\quad \underbrace{- \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}_\text{$L_{0}$} \\ &\biggr] \end{split} \end{equation*}$

We now focus on $L_{t-1}$ for $t=2, \cdots, T$ because $L_T$ is non-learnable and $L_0$ is trivially handled. With some mathematical computation, we have

(2) $\begin{equation*} q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I}) \end{equation*}$

and

(3) $\begin{equation*} \begin{split} \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) &=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0 \\ &= \frac{1}{\sqrt{\alpha_t}}\left( \mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}}_t } \epsilon_t\right), \end{split} \end{equation*}$

where $\beta_t$ , $\tilde{\beta}_t$ , and $\bar{\alpha}_t$ are terms involving noise scheduling steps $\alpha_t$ .

Now, the other part of $L_{t-1}$ is $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ , which can be parameterized as

(4) $\begin{equation*} \begin{split} &p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \\ &= \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) ) \\ &= \mathcal{N}(\mathbf{x}_{t-1}; \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \underbrace{\frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}_\text{predict $\epsilon_t$ from $\mathbf{x}_t$ and $t$} \Big), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \end{split} \end{equation*}$

Because KL divergence betwen two Gaussians [5] can be represented as $\mathrm{KL}[P\,||\,Q] = \frac{1}{2} \left[ (\mu_2 - \mu_1)^T \Sigma_2^{-1} (\mu_2 - \mu_1) + \mathrm{tr}(\Sigma_2^{-1} \Sigma_1) - \ln \frac{|\Sigma_1|}{|\Sigma_2|} - n \right]$ , $L_{t-1}$ (i.e., the KL divergence between $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ and $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ ) can be expressed analytically and fed into autograd frameworks for optimization.

Code Example

The exact code example I was reading is https://colab.research.google.com/github/JeongJiHeon/ScoreDiffusionModel/blob/main/DDPM/DDPM_example.ipynb, which is easy enough.

Our data is just two 2D Gaussian distributions. One distribution will be sampled more often (prob=0.8) than the other.

And after 1000 training iterations, here is the inference process looks like: we have N data points which are pure Gaussian noises. $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ are now learned such that sampling from it can recover the original data distribution (although I feel the two distributions are not 8-2 in quantities):

Reference

([1] and [2] are good learning materials for me to write this post; [3] and [4] are good coding examples.)

[1] https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

[2] https://aman.ai/primers/ai/diffusion-models/

[3] https://www.youtube.com/watch?v=a4Yfz2FxXiY

[4] https://github.com/JeongJiHeon/ScoreDiffusionModel#content–tutorial–blog-kr-

[5] https://statproofbook.github.io/P/mvn-kl.html

Mode collapse is real for generative models

I am very curious to see whether generative models like GAN and VAE can fit data of multi-modes. [1] has some overview over different generative models, mentioning that VAE has a clear probabilistic objective function and is more efficient.

[2] showed that diffusion models (score-based generative models) can better fit multimode distribution than VAE and GAN. [4] mentions that VAE does not have mode collapse as GAN but has a similar problem that the decoder may ignore the latent representation and generate things arbitrarily and InfoVAE [5] can mitigate it. [6] also suggested that beta-VAE and Adversarial VAE [7] achieve better performance than vanilla VAE.

I did a hands-on toy experiment, where I generate 3-modes distributions and let vanilla VAE fit them. The result is not very satisfying given this is only a very simple toy problem. Therefore, it calls out that we may need more advanced VAE as described above.

First, when the three modes are well separated and well balanced (I generated 33% data as -10, 33% data as 0, and the other 33% data as 10):

However, when the three modes are -1, 0, and 1, generated data are not well separated:

When the three modes are -0.1, 0, and 0.1, generated data only concentrate on one mode (mode collapse!):

I also tried that the three modes have imbalanced data amount and VAE did not perform well either:

The toy experiment’s code can be found in `train_VAE_simple.py` in https://www.dropbox.com/sh/n8lbdzvrdvahme7/AADFgs_JLz8BhHzOaUCcJ1MZa?dl=0.

At last, I’d like to share that [3] seems a good tutorial to introduce diffusion models.

References

[1] Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning: https://arxiv.org/pdf/1705.00470.pdf

[2] Can Push-forward Generative Models Fit Multimodal Distributions?: https://openreview.net/pdf?id=Tsy9WCO_fK1

[3] https://www.youtube.com/watch?v=cS6JQpEY9cs

[4] https://ai.stackexchange.com/questions/8879/why-doesnt-vae-suffer-mode-collapse

[5] https://ermongroup.github.io/blog/a-tutorial-on-mmd-variational-autoencoders/

[6] https://ameroyer.github.io/reading-notes/representation%20learning/2019/05/02/infovae.html

[7] https://medium.com/vitrox-publication/adversarial-auto-encoder-aae-a3fc86f71758

Causal Inference in Recommendation Systems

We have briefly touched some concepts of causal inference in [1, 2]. This post introduces some more specific works which apply causal inference in recommendation systems. Some works need to know the background of backdoor and frontdoor adjustments. So we will introduce them first.

Backdoor and frontdoor adjustment

Suppose we have a causal graph like below and we want to know the causal effect of X on Y. This can be translated to computing $P(y|do(X=x))$ .

Graph

$P(y|do(X=x))$ can be seen as the post-intervention distribution on a modified causal graph, in which we want to intervene on $X$ by setting a specific value $x$ on $X$ . Therefore, the parents of $X$ do not affect its value anymore and this corresponds to removing all the parents of $X$ in the original causal graph:

enter image description here

When backdoor or frontdoor criteria are satisfied, we can apply backdoor and frontdoor adjustments to express $P(y|do(X=x))$ only using the observational distribution on the original causal graph:

Backdoor:

$P(y|do(X=x)) = \sum\limits_u P(y|x,u)P(u)$

Frontdoor:

$P(y|do(X=x)) = \sum\limits_z P(z|x) \sum\limits_{x'} P(y|x', z)P(x')$

The difference between backdoor and frontdoor adjustment is that the backdoor adjustment needs to assume that all confounder factors are observed (note $u$ is used in the backdoor adjustment formula but not in the frontdoor adjustment formula).

I find two places which shed lights on how backdoor and frontdoor adjustment formulas are derived. Let’s first look at how the backdoor adjustment formula is derived, from [3]:

$P(Y|do(X=x)) := P^*(Y|x) \newline \text{(Suppose }P^*\text{ is the post-intervention distribution in the modified causal graph)} \newline = \sum\limits_u P^*(Y|x,u)P^*(u|x) \newline= \sum\limits_u P^*(Y|x,u)P^*(u) \newline(\text{Because }P^*(U|X)=P^*(U) \text{ in the modified causal graph}) \newline =\sum\limits_u P(Y|x,u)P(u) \newline (\text{Because the observational and post interventional distribution have } \newline \text{the same causal structure for Y and U})$

Note that the backdoor adjustment assumes we can observe the confounder. That’s why you can see the summation over $u$ in $P(y|do(X=x)) = \sum\limits_u P(y|x,u)P(u)$ . The frontdoor adjustment assumes we cannot observer $u$ . Therefore, we have to use other observational distributions to decompose $P(y|do(X=x))$ . While [3] also explains how to derive the frontdoor adjustment for $P(y|do(X=x))$ , I feel [4] did a better job in explanation.

$P(Y|do(X=x)) \newline= \sum\limits_z P(Y|z, do(X=x))P(z|do(X=x)) \newline =\sum\limits_z P(Y|z, do(X=x))P(z|x) \newline (P(z|do(X=x))=P(z|x) \text{ because there is no confounder between X and Z}) \newline (\text{However, we can't change }P(Y|z, do(X=x)) \text{ to }P(Y|z, x) \text{ because} \newline \text{there is an (unobserved) confounder between X and Y} )\newline=\sum\limits_z P(Y|do(Z=z), do(X=x))P(z|x) \newline \text{A trick to change from } P(Y|z, do(X=x))\text{ to } P(Y|do(Z=z), do(X=x)) \newline = \sum\limits_z P(Y|do(Z=z))P(z|x) \newline do(X=x)\text{'s effect is completely blocked by }do(Z=z)\newline\text{hence }P(Y|do(Z=z), do(X=x))=P(Y|do(Z=z)) \newline=\sum\limits_z P(z|x) \left[\sum\limits_{x'}P(Y|z, x')P(x')\right] \newline \text{Applying backdoor adjustment between do(Z=z) and Y}$

Note that in the last equation, $P(Y|z, x')$ cannot be further simplified to $P(Y|z)$ because of the presence of $U$ which can affect both $X$ and $Y$ [5].

Backdoor adjustment in real world papers

After introducing backdoor / front adjustment, let’s examine how they are used in practice. I came across two papers [9, 10], both of which applied backdoor adjustment.

[9] tries to account for popularity bias explicitly, for which many other models fail to take into account. In the diagram below, I represents items, U represents users, C represents user interaction on specific items, and Z represents popularity. Popularity is a confounder for items because more popular items tend to get exposed more often and create a feedback loop to continue biasing the model. Popularity is also a confounder for user interaction because users tend to have “herd mentality” thus tend to follow the majority to consume popular items.

To really know the intrinsic value of an item to a user, we have to estimate $P(C|do(U,I))$ . By applying the backdoor adjustment, we have $P(C|do(U,I))=\sum\limits_z P(C|U, I, z) P(z)$ . $P(C|U, I, z)$ can be estimated by any deep learning model which takes as input user-side features, item-side features, as well as popularity. But the paper has a nice trick to decompose $P(C|U, I, z) = intrinsic\_value_{\theta}(U,I) \cdot pop\_adjust(z)$ , where $\theta$ is model parameters. Then they can use $intrinsic\_value_{\theta}(U,I)$ to rank items based on real intrinsic value.

[10] provides another idea to do backdoor adjustment. They are tackling the bias of video length which is commonly seen in modern video recommendation systems. The bias exists because video watch time is often a topline metric video recommendation systems try to optimize. As a result, video with long duration tends to be overexposured and tend to amplify the video length bias feedback loop. The causal graph the authors model is as below, where U, V, D, and W represents user, video, duration, and watch time, respectively:

The backdoor adjustment to estimate the real intrinsic value $E[W|do(U,V)]$ is

Basically, the formula says that we will group videos into different duration buckets. The real intrinsic value $h(u, v)$ is an estimator to estimate the watch time percentile within that duration bucket. $h(u, v)$ can be a shared model that predicts the watch time percentile in different buckets.

It is interesting to see that during the inference, they will still rank based on predicted watch time, not the intrinsic value $h(u,v)$ :

A question could arise naturally that why we could not use a deep learning model to take as input user and video features as well as the confounder duration? This goes back to the motivation of the paper itself [10]. Such a deep learning model tends to predict high watch time for long videos simply because they are over-exposed (i.e., many of them are put in top positions so naturally long videos attract more watch time). However, with bucketing durations and changing labels to percentiles within each bucket, the model prediction will be less biased.

Causal Discovery

The paper [7] introduces a cool way to trim user history in user sequence modeling. The context is that in sequential recommendation problems, the objective function can be formulated as following:

$min -\sum\limits_{k=1}^N\sum\limits_{j=1}^{l_k} log f(\vec{v}_k^j | \mathbf{H}_k^j),$

where $\vec{v}_k^j$ is user $k$ ‘s $j$ -th interacted item in history represented as a one-hot vector, and $\mathbf{H}_k^j$ is the user interacted item history before the $j$ -th item. $f(\cdot)$ can be a softmax function so that minimizing the function will encourage the model to accurately predict which item the user will most likely interact with next given the history.

However, user history may contain many irrelevant items which makes prediction hard. [7] proposes using causal discovery to discover causally related items so that we only use purified history to predict next items users may interact with. The sequential model will use a more focused history for prediction and will be potentially more accurate in prediction.

Denote all users’ history as $\mathbf{V}=\{\vec{v}_k^j\}_{k=1\cdots N,j=1\cdots l_k}$ , the objective then becomes:

$min - \ell(\mathbf{V}, W) + \lambda ||W||_1 \newline = min -\sum\limits_{k=1}^N\sum\limits_{j=1}^{l_k}\sum\limits_{b=1}^{|\mathcal{V}|} log f(\left[\vec{v}_k^j\right]_b | \mathbf{\tilde{H}}_{kb}^j(W)) + \lambda ||W||_1 \newline s.t. \quad trace(e^{W \odot W})=\vert\mathcal{V}\vert,$

where $W \in \mathbb{R}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert}$ is the discovered causal graph between items, and $\mathbf{\tilde{H}}_{kb}^j(W))$ is the purified history for each potential item $b$ at the time of prediction. $W$ is a real-valued matrix which approximates the actual binarized causal DAG, i.e., $binarize(W_{ij}) = 1$ means that the change of item $i$ will cause the change of item $j$ .

Note that the form of $min - \ell(\mathbf{V}, W) + \lambda ||W||_1 \; s.t. \; trace(e^{W \odot W})=\vert\mathcal{V}\vert$ is from NOTEARS [8], the work introducing differentiable causal discovery. Originally in [8], $\mathbf{V} \in \{0,1\}^{N\times \vert \mathcal{V} \vert}$ , which can be thought as a matrix expressing all interacted items for each user without temporal order. $\ell(\mathbf{V}, W)=\sum\limits_{k=1}^N\sum\limits_{b=1}^{\vert\mathcal{V}\vert}(\mathbf{V}_k^b, \mathbf{V}_k^{T} W_{\cdot b})^2$ is a least-square error, which intuitively means that each observed interacted item should be explained by all other observed interacted items and the causal graph $W$ . [8] shows that, if $B \in \{0, 1\}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert}$ is the binarized version of $W$ (the actual causal directed acyclic graph), then $trace\left( e^{B} \right) = \vert\mathcal{V}\vert$ . $W$ as a relaxed version of $B$ can also be constrained in similar way but we need to slightly modify the constraint as $trace(e^{W \odot W})=\vert\mathcal{V}\vert$ .

The novelty of [7] is that it puts the causal discovery framework proposed in [8] in a sequential recommendation context such that temporal information is taken into account and $\mathbf{V}$ is no longer just a matrix of all interacted items for each user. Instead, $\mathbf{V}=\{\vec{v}_k^j\}_{k=1\cdots N,j=1\cdots l_k}$ .

On a side note, the paper [7] also introduces a differentiable clustering method which is also pretty cool.

References

[1] Causal Inference https://czxttkl.com/2020/10/13/causal-inference/

[2] Tools needed to build an RL debugging tool: https://czxttkl.com/2020/09/08/tools-needed-to-build-an-rl-debugging-tool/

[3] https://stats.stackexchange.com/questions/312992/causal-effect-by-back-door-and-front-door-adjustments/313127#313127

[4] http://www.degeneratestate.org/posts/2018/Sep/03/causal-inference-with-python-part-3-frontdoor-adjustment/

[5] https://stats.stackexchange.com/questions/448004/intuition-behind-conditioning-y-on-x-in-the-front-door-adjustment-formula/

[6] https://stats.stackexchange.com/questions/538477/causal-inference-difference-between-blocking-a-backdoor-path-and-adding-a-vari

[7] Sequential Recommendation with Causal Behavior Discovery: https://arxiv.org/abs/2204.00216

[8] DAGs with NO TEARS Continuous Optimization for Structure Learning: https://arxiv.org/abs/1803.01422

[9] Causal Intervention for Leveraging Popularity Bias in Recommendation: https://arxiv.org/pdf/2102.11107.pdf

[10] Deconfounding Duration Bias in Watch-time Prediction for Video Recommendation: https://arxiv.org/abs/2206.06003

Dollar cost average on TQQQ vs QQQ [Simulation]

This post runs a simple simulation on using the Dollar-Cost-Average strategy to invest in QQQ vs. TQQQ, its 3x-leveraged ETF.

In the simulation, QQQ will plunge 34% after 20 rounds. One round is a small up-and-down cycle – the index first moves up 1% then 3% down, until 34% down from the top. After reaching the bottom, I simulate another 42 rounds such that QQQ can almost crawl back to its previous top. This time, within each round, the index first moves up 2% then 1% down. In each round, we invest two times using a fixed amount of money, one after the up and one after the down.

I simulate the same process for TQQQ. The only difference is that in each round, the up and downs are 3x as those in QQQ.

The script and result can be found below. We can see that DCA in TQQQ will still give us better return. Also, as we expect, when QQQ has almost reached back to its former price (99.94), TQQQ is still not yet back to the peak (88.07). However, that does not contradict with the fact that the DCA earns more on TQQQ than QQQ overall.

DOWN_ROUNDS = 20
UP_ROUNDS = 42
stock_price = 100
amounts = 0
DAC = 100
SCALE = 1  # 1 for QQQ and 3 for TQQQ
for i in range(DOWN_ROUNDS):
    amounts += DAC / stock_price
    stock_price = stock_price * (1 + 0.01 * SCALE)
    print(f"round={i}, amount: {amounts}, price={stock_price}")
    amounts += DAC / stock_price
    stock_price = stock_price * (1 - 0.03 * SCALE)
    print(f"round={i}, amount: {amounts}, price={stock_price}")

for i in range(UP_ROUNDS):
    amounts += DAC / stock_price
    stock_price = stock_price * (1 + 0.02 * SCALE)
    print(f"round={i}, amount: {amounts}, price={stock_price}")
    amounts += DAC / stock_price
    stock_price = stock_price * (1 - 0.01 * SCALE)
    print(f"round={i}, amount: {amounts}, price={stock_price}")

print(f"final value={amounts * stock_price}")

# QQQ (scale=1) final value=15197.210360810055
# The last print: round=41, amount: 152.06046946267062, price=99.9418876879163

# TQQQ (scale=3) final value=22578.30484235393
# The last print: round=41, amount: 256.36563654519176, price=88.07071472846891

GATO and related AGI research

Policy Generalist

Deepmind has recently published a work named Gato. I find it interesting as Gato learns a multi-modal multi-task policy to many tasks such as robot arm manipulation, playing atari, and image captioning. I don’t think the original paper [2] has every detail of implementation but I’ll try to best summarize what I understand. I will also try to survey literature pertinent to Gato because it can be very useful towards the north star of general AI.

First of all, Gato adopts the most common architecture these days – transformers. The core idea is to transform data of different modalities into sequences of tokens. Then tokens will be passed through embedding tables or some model-based transformation to generate embeddings of the same embedding dimension.

Here is how different modalities are sequentialized. Texts data are natural sequences of words. Each word (or phrase) will be indexed with a unique integer. Image data will be transformed into non-overlapping 16×16 patches in raster order. Reinforcement learning tasks will generate sequences of tuples (observation, action). They don’t include reward in sequences because (I guess) they mainly use behavior cloning/imitation learning. Specifically, they pick from the top-performing expert demonstration sequences and use supervised learning to fit expert actions. Continuous actions are discretized into discrete actions. For image patches, the authors applied ResNet to generate their embeddings. For texts/actions, they are passed through learnable embedding tables to generate their embeddings.

If you pay attention, you will find that all different kinds of tasks have a discrete action space. As we mentioned earlier, reinforcement learning tasks have either discrete actions or continuous actions that are transformed into discrete actions. For chatbot/image captioning tasks, the action space is basically the vocabulary space. So the training loss used by Gato can be just autoregressive cross-entropy:

Algorithm Distillation

Algorithm distillation is a novel idea to me at first glance [3]. But I can relate to works a few years ago about learning to learn using RNN [4, 5]. The idea of [3] is that when we solve a reinforcement learning task or multiple RL tasks using an algorithm of our interest, we can record (state, action, reward) sequences that span multiple episodes. When these multi-episode sequences are long enough, they contain information about how the algorithm improves the policy over time and how the policy being improved interacted with the environment(s). We can train a sequence model (e.g., transformers) to directly mimic the multi-episode sequences. As a result, the RL algorithm itself is distilled into the sequence model. We can then call the sequence model on a new task and asks it to just unroll itself. The sequence model will magically unroll actions that can gradually improve returns over time, just like we use the training RL algorithm on a new task!

References

[1] DeepMind Gato Explained – One AI Model for 600 tasks https://www.youtube.com/watch?v=EEcxkv60nxM

[2] A Generalist Agent https://arxiv.org/abs/2205.06175

[3] In-context Reinforcement Learning with Algorithm Distillation: https://arxiv.org/abs/2210.14215

[4] Learning to Learn without Gradient Descent by Gradient Descent: https://arxiv.org/abs/1611.03824

[5] Learning to learn by gradient descent by gradient descent: https://proceedings.neurips.cc/paper/2016/file/fb87582825f9d28a8d42c5e5e5e8b23d-Paper.pdf

Some latest recsys papers

7 years ago I posted one tutorial about recommendation systems. Now it is 2022 and there are many more advancements. This post will overview several latest ideas.

CTR models

Google’s recsys 2022 paper [1] introduces many practical details on their CTR models. First, to reduce training cost, there are 3 effective ways: applying bottleneck layers (a layer’s weight matrix can be broken down into two low-rank matrices), weight-sharing neural architecture search, and data sampling. The contribution to training cost decrease of the 3 techniques is listed below. It seems like data sampling is a very effective technique.

Second, to improve the model performance, they include the rank loss (pairwise ranking loss) to the pointwise cross entropy training loss, distillation, second-order optimizer, and add so-called deep&cross network to learn feature interaction. The contribution of these techniques to the accuracy is listed below:

Graph models

Let’s look at Pixie, Pinterest’s real-time random walk system for recommending pins [2]. The graph in this paper has pins and boards (a collection of pins of similar topics) as nodes. If a user has saved pin p in board b, then there is an edge between node p and b. Given a query pin, we can use random walk to surface more connected pins on the graph. Random walk’s idea has been introduced in one previous post.

Pixie has several customizations on random walk to boost its business value. First, it biases random walk on user-relevant nodes (e.g., favoring pins of the same language as the user is predicted to know). Second, they allocate the number of random walk steps based on a specific formula (Eqn. 2). They claim “the distribution gave us the desired property that more steps are allocated to starting pins with high degrees and pins with low degrees also receive sufficient number of steps”. Third, they boost pins that are reached from multiple pins, which indicates the pins are more relevant than others. Fourth, they apply early stopping and graph pruning to make random walk more efficient.

Interestingly, we can see that Pinterest loads the graph in single machines so that random walk can happen without cross-machine communication:

Retrieval

PinnerSage outlines a solution for retrieving related pins at Pinterest. The most interesting part is that PinnerSage learns a fluid number of user embeddings per user. So the retrieval would not have to be constrained to only the pins around the “mean” user preference.

PinnerSage uses pre-trained embeddings for each watched pins and cluster each user’s past 90 days watched pins into a dynamic number of clusters. The clustering method which supports an automatic determination of the number of clusters is called Ward. The user’s embeddings are the medoid of his/her clusters. PinnerSage also computes the importance of each cluster. The importance score will be used in the serving time as only the top 3 most important clusters’ user embeddings are used for retrieving new pins. The serving uses a typical approximated KNN searcher to retrieve new pins. The paper tried several common AKNN indexing schemes (LSH Orthoplex, Product Quantization, and HNSW) and found HNSW performed best on cost, latency, and recall.

Sequence Modeling

User history has abundant information about user preference. PinnerFormer [4] details an industrial sequence model used at Pinterest. User history is represented as pin sequences. Passing pin sequences through a transformer will generate an embedding per position. The loss function (named Dense All Action Loss) is defined as to predict each pin’s future positive engaged pins.

For model serving, the paper described that there is a daily workflow which updates the embeddings of users who have any type of engagement in the past day. Inactive users in the past day will keep using the previously generated embeddings. However, this also indicated the transformer model is not updated frequently otherwise all user embeddings need to be updated as the transformer model is updated.

There are several earlier sequence modeling works:

Deep Interest Network [6]. In essence, DIN is a pioneer work for learning attentions between the current ad and all previous ads. Attention scores can be understood as the intensity of user interests between ads. The only difference between DIN and other attention-based models is that attention scores are allowed to sum up to not exactly 1 (see their explanation in section 4.3).
Bert4Rec [7]. Vanilla sequential recommendation models, like vanilla Transformers / RNN-based models as referenced in [7], run through items from left to right and predict next items sequentially. Therefore, vanilla sequential recommendation models are considered as “unidirectional”. Bert4Rec applies bidirectional transformers models to learn inter-relationships between items. It introduces a task called Cloze task as the objective function. Basically, they randomly remove some items in input sequences (i.e., replace them with a special token “[mask]”) and then predict the ids of the masked items based on their surrounding context.

On device

Alibaba published EdgeRec [5] which can process fine-grain user events in real time and perform contextual reranking. They categorize user events into two types: item exposure actions and item page-view actions. Item exposure actions depict user scroll behavior and exposing duration while item page-view actions depict more engaged behavior like “add to cart” or “click customer service”. As you can guess, item page-view actions are much more sparse than item exposure actions. Therefore, EdgeRec decides to process the two types of actions as two separate sequences. EdgeRec uses GRUs to process the two sequences and generate a hidden representation for each action. At the re-ranking stage, each candidate has a re-ranking representation which comes from (1) the context of other candidates and (2) attention to the item exposure and item page-view action sequences. The context of other candidates is generated by another GRU which runs through the candidates in the order generated by the previous ranker. The re-ranking representation is then passed through a multi-layer perceptron (MLP) and finally the cross-entropy loss for learning.

It is important to pay attention to the serving details of EdgeRec (section 2.2.2). First, although the whole EdgeRec model is trained, the GRUs for processing action sequences and the GRU for re-ranking are separately run on device during serving. The GRUs for processing action sequences will run for each new incoming user action (O(1) time complexity as claimed in the paper) on device and store generated embeddings in a database on device. The re-ranking GRU will fetch the generated embeddings from the on-device database and then perform model inference based on them. Another engineering detail is that since the model may use sparse features and their corresponding embedding tables may have high cardinality, they have to store embedding tables on cloud and send only relevant rows to device. There is also some versioning engineering details which ensure the edge and cloud’s models and embedding tables can match with each other.

References

[1] On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models: https://arxiv.org/abs/2209.05310

[2] Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time: https://arxiv.org/abs/1711.07601

[3] PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest: https://arxiv.org/abs/2007.03634

[4] PinnerFormer- Sequence Modeling for User Representation at Pinterest: https://arxiv.org/abs/2205.04507

[5] EdgeRec: Recommender System on Edge in Mobile Taobao: https://arxiv.org/abs/2005.08416

[6] Deep Interest Network for Click-Through Rate Prediction: https://arxiv.org/abs/1706.06978

[7] BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer: https://arxiv.org/pdf/1904.06690.pdf

New Model Architectures

There are many advancements in new model architectures in AI domain. Let me overview these advancements in this post.

Linear Compression Embedding

LCE [1] is simply using a matrix to project one embedding matrix to another: $AW =B$ , where $W \in \mathbb{R}^{M \times N}, A \in \mathbb{R}^{B \times M}, and B \in \mathbb{R}^{B \times N}$ .

Pyramid networks, inception network, dhen, lce

Perceiver and Perceiver IO

Perceiver-based architectures [5,6] solve a pain point of traditional transformers: quadratic computation in terms of sequence length for computing KQV attention, where K/Q/V all have the shape batch_size x seq_len x emb_dim. Perceiver-based architectures creates a learnable V matrix which is independent of input sequence lengths, thus to reduce the attention computation to be linear. Thanks to linear attention computation time, within the same computation budget, it is also able to stack more attention blocks to improve the performance. There are three web tutorials that are helpful: [3,4,5].

Since Perceiver has done a good job of reducing attention computation to linear time (in terms of sequence length), I was wonder why recent LLMs did not, if any, adopt this architecture. A reddit online discussion shed some light [7].

Mamba

Mamba [9] is a new sequence modeling architecture which also achieves linear time based on so called “selective structured state space model”. Currently, it achieves SOTA in many tasks. The official implementation can be found here. There is not much material which can clearly articulate the theory behind Mamba. I have only found pieces of information here and there: [8, 10]. The way I understand about Mamba after watching [8] is:

Mamba enjoys the best of both worlds of RNN and Transformers

	Training	Inference
RNN	O(N) but can’t be parallelized because states need to roll out sequentially	O(1) per token
Transformer	O(N^2) but can be parallelized	O(N) per token (see time analysis in [12])
Mamba	O(N) and can be parallelized	O(1) per token

According to basic definition of structured state space sequence (S4) models [10], Mamba has the state transition represented as differential equations:
$h'(t)=\mathbf{A}h(t)+\mathbf{B}x(t)$
$y(t)=\mathbf{C}h(t)$
Note, $h'(t)$ is the change (i.e., derivative) of the hidden state.
The differential equations describe a real-time system but in sequence modeling we are interested in sequences with discrete time steps (i.e., tokens). “Discretization” in the paper means that we use approximation methods to derive new matrices and when we assume the input function becomes the sequence , where is the step size.
There are two methods to do the discretization. [8] introduces the Euler’s method, which is simple for exposition. The paper uses zero-order hold, which is explained in more details in [13]. We use the Euler’s method for illustration, following [8]:
1. by definition of derivative we know that $h(t+\Delta) \approx \Delta h'(t) + h(t)$
2. Substitute the state space model’s equations with step 1:
  $h(t+\Delta) \newline \approx \Delta \left( \mathbf{A}h(t)+\mathbf{B}x(t) \right) + h(t)\newline=\Delta \mathbf{A} h(t)+\Delta \mathbf{B}x(t) + h(t) \newline= \left(\mathbf{I}+\Delta\mathbf{A} \right)h(t) + \Delta \mathbf{B} x(t) \newline = \bar{\mathbf{A}} h(t) + \bar{\mathbf{B}} x(t)$
State transitions during training can be parallelized without needing to wait for sequential rollout because the state transition model can be converted into convolutional formulation that only depends on the input sequences:
A S4 mdoel is used for one input dimension. So if a token has 512 embedding size, there will be 512 S4 models in parallel. In comparison, transformers use attention heads where each attention head consumes all dimensions (or a group of dimensions) of a token.

Reference

LCE explanation: https://fb.workplace.com/groups/watchrankingexperiments/permalink/2853044271454166/ (internal) and https://www.dropbox.com/scl/fi/0ygejam1mg2jhocfmgoba/watch-_-chaining-ranking-experiment-review-_-group-_-workplace.pdf?rlkey=4euhhg6f0h838ythg1na1amz8&dl=0 (external)
https://medium.com/@curttigges/the-annotated-perceiver-74752113eefb
https://medium.com/ml-summaries/perceiver-paper-s-summary-3c589ec74238
https://medium.com/ml-summaries/perceiver-io-paper-summary-e8f28e451d21
Perceiver: General Perception with Iterative Attention: https://arxiv.org/abs/2103.03206
Perceiver IO: A General Architecture for Structured Inputs & Outputs: https://arxiv.org/abs/2107.14795
https://www.reddit.com/r/MachineLearning/comments/tx7e34/d_why_arent_new_llms_using_the_perceiver/
https://www.youtube.com/watch?v=8Q_tqwpTpVU
Mamba: Linear-Time Sequence Modeling with Selective State Spaces: https://arxiv.org/abs/2312.00752
https://srush.github.io/annotated-s4/?ref=blog.oxen.ai
https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad
Llama code anatomy: https://czxttkl.com/2024/01/17/llama-code-anatomy/
https://en.wikipedia.org/wiki/Discretization#Derivation

Simulation on the ads supply problem

I start to feel the importance of simulating any practical problem before deploying an RL policy. If you cannot implement a reasonable simulator on your own, you are not clear about your environment and your model. It is then a pure gamble to me if we just train an RL policy offline without testing in a reasonable simulator and go directly to online testing. Sure, you might see good online results once a while, but you might even not be able to reproduce the success if you were to do it again.

Recently, I face a new problem in ads supply. Think about a typical setting of modern recommendation systems where we need to insert ads in between organic contents. If we insert ads too frequently, user engagement with organic content will drop; if we insert ads too little, the system will have low ads revenue. We will assume in each user session, organic contents and ads are loaded in the unit of “pages”. When the user is close to finishing one page, the next page is formed by having a specific ranking of organic contents and ads. I think an RL policy to decide the best ad insertion positions in each page is promising because as a business, we care about accumulated value (i.e., accumulated engagement or ads revenue). If we look at user trajectories in the unit of pages, there might exist a sequence of ads insertion decisions which can optimize the accumulated value better than any supervised learning-based policy which only optimizes for immediate reward.

Here is just a hypothetical example why RL can work better than a supervised learning-based policy:

Suppose a user will have a session of 4 pages. Based on a supervised learning model, we will insert 6 ads in each of the first 3 pages; but in the last page, because the user has seen 3 pages of 6-ads and reached fatigue, we have to insert only 1 ads. The supervised learning model will lead to 19 ads inserted in total. However, a reinforcement learning model will know that in order to maximize the accumulated ads inserted, it can place 5 ads in all the 4 pages. The RL model will eventually insert 20 ads, one more than the SL model.

I design an ads supply environment for testing whether an RL policy is working well. In this environment, every user’s behavior is measured within a time window of 7days. They have a default session continuation probability of 0.8 after seeing each page and a default next session returning time of 24 hours. Each page can be inserted 1,2, or 3 ads. Whenever they see two consecutive pages with an average load > 1.5, the page continuation probability will be shrunk by 0.2 factor and next session returning time will also increase 24 hours. The state features include last two pages’ ad loads, the # of sessions and pages so far, etc.

We collect data using the designed simulator and a random data collection policy. The data consists of 1M episodes (equivalent to 10M (S,A,R,S, A) tuples). We compare Q-learning, SARSA, and Monte Carlo Value Regression. Surprisingly, MC performs quite better than the other two and the random data collection policy. Here is the result:

Random data collection policy: Avg sessions: 2.966372, avg reward: 20.859195
Q-Learning: Test avg sessions: 4.3327, avg reward: 33.8197
SARSA: Test avg sessions: 3.808, avg reward: 31.5479
Monte-Carlo Value Regression: Test avg sessions: 4.6173, avg reward: 43.2075

From this simulation, I become more favorable to always start with simple methods like Monte Carlo Value regression. Advanced RL algorithms may not be the best in a noisy real-world environment.

How to run the code:

Download the code folder: https://www.dropbox.com/s/p2mfowz9m63zwn7/ads_supply.zip?dl=0
Run data_generator.py to generate training data, which is to be saved to offline_train_data.pickle
Run mc_learning.py, q_learning.py, or sarsa.py to train a model
Run eval.py to evaluate model performance. Remember to change MODEL_SAVE_PATH properly.

Chrome extension for saving web pages as pdfs as exactly they look

I tried to search the answer for long time and came across this solution which works well: https://superuser.com/a/1464698/1279375

GFlowNet

GFlowNet is the latest technique developed for solving combinatorial optimization problems [1]. I’ve prepared a series of deep dive slides for it (See GFlowNet deep dive). In this post, I just list a few more references.

References

[1] https://yoshuabengio.org/2022/03/05/generative-flow-networks/

[2] https://towardsdatascience.com/the-what-why-and-how-of-generative-flow-networks-4fb3cd309af0

[3] https://neurips.cc/media/neurips-2021/Slides/26729.pdf

[4] https://www.youtube.com/watch?v=7W69-ffTs48

[5] https://milayb.notion.site/GFlowNet-Tutorial-919dcf0a0f0c4e978916a2f509938b00#afe03e54d6db43468f8dee3a3350f98a

[6] http://folinoid.com/w/gflownet/