Mode collapse is real for generative models

I am very curious to see whether generative models like GAN and VAE can fit data of multi-modes. [1] has some overview over different generative models, mentioning that VAE has a clear probabilistic objective function and is more efficient.

[2] showed that diffusion models (score-based generative models) can better fit multimode distribution than VAE and GAN. [4] mentions that VAE does not have mode collapse as GAN but has a similar problem that the decoder may ignore the latent representation and generate things arbitrarily and InfoVAE [5] can mitigate it. [6] also suggested that beta-VAE and Adversarial VAE [7] achieve better performance than vanilla VAE. 

I did a hands-on toy experiment, where I generate 3-modes distributions and let vanilla VAE fit them. The result is not very satisfying given this is only a very simple toy problem. Therefore, it calls out that we may need more advanced VAE as described above.

First, when the three modes are well separated and well balanced (I generated 33% data as -10, 33% data as 0, and the other 33% data as 10):

However, when the three modes are -1, 0, and 1, generated data are not well separated:

When the three modes are -0.1, 0, and 0.1, generated data only concentrate on one mode (mode collapse!):

I also tried that the three modes have imbalanced data amount and VAE did not perform well either:

 

The toy experiment’s code can be found in `train_VAE_simple.py` in https://www.dropbox.com/sh/n8lbdzvrdvahme7/AADFgs_JLz8BhHzOaUCcJ1MZa?dl=0.

At last, I’d like to share that [3] seems a good tutorial to introduce diffusion models.

 

References

[1] Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning: https://arxiv.org/pdf/1705.00470.pdf

[2] Can Push-forward Generative Models Fit Multimodal Distributions?: https://openreview.net/pdf?id=Tsy9WCO_fK1

[3] https://www.youtube.com/watch?v=cS6JQpEY9cs

[4] https://ai.stackexchange.com/questions/8879/why-doesnt-vae-suffer-mode-collapse

[5] https://ermongroup.github.io/blog/a-tutorial-on-mmd-variational-autoencoders/

[6] https://ameroyer.github.io/reading-notes/representation%20learning/2019/05/02/infovae.html

[7] https://medium.com/vitrox-publication/adversarial-auto-encoder-aae-a3fc86f71758

Causal Inference in Recommendation Systems

We have briefly touched some concepts of causal inference in [1, 2]. This post introduces some more specific works which apply causal inference in recommendation systems. Some works need to know the background of backdoor and frontdoor adjustments. So we will introduce them first.

Backdoor and frontdoor adjustment 

Suppose we have a causal graph like below and we want to know the causal effect of X on Y. This can be translated to computing P(y|do(X=x))

Graph

P(y|do(X=x)) can be seen as the post-intervention distribution on a modified causal graph, in which we want to intervene on X by setting a specific value x on X. Therefore, the parents of X do not affect its value anymore and this corresponds to removing all the parents of X in the original causal graph:

enter image description here

When backdoor or frontdoor criteria are satisfied, we can apply backdoor and frontdoor adjustments to express P(y|do(X=x)) only using the observational distribution on the original causal graph:

Backdoor:

    \[P(y|do(X=x)) = \sum\limits_u P(y|x,u)P(u)\]

Frontdoor:

    \[P(y|do(X=x)) = \sum\limits_z P(z|x) \sum\limits_{x'} P(y|x', z)P(x')\]

The difference between backdoor and frontdoor adjustment is that the backdoor adjustment needs to assume that all confounder factors are observed (note u is used in the backdoor adjustment formula but not in the frontdoor adjustment formula).

I find two places which shed lights on how backdoor and frontdoor adjustment formulas are derived. Let’s first look at how the backdoor adjustment formula is derived, from [3]: 

P(Y|do(X=x)) := P^*(Y|x) \newline \text{(Suppose }P^*\text{ is the post-intervention distribution in the modified causal graph)} \newline = \sum\limits_u P^*(Y|x,u)P^*(u|x) \newline= \sum\limits_u P^*(Y|x,u)P^*(u) \newline(\text{Because }P^*(U|X)=P^*(U) \text{ in the modified causal graph}) \newline =\sum\limits_u P(Y|x,u)P(u) \newline (\text{Because the observational and post interventional distribution have } \newline \text{the same causal structure for Y and U})

Note that the backdoor adjustment assumes we can observe the confounder. That’s why you can see the summation over u in P(y|do(X=x)) = \sum\limits_u P(y|x,u)P(u). The frontdoor adjustment assumes we cannot observer u. Therefore, we have to use other observational distributions to decompose P(y|do(X=x)). While [3] also explains how to derive the frontdoor adjustment for P(y|do(X=x)), I feel [4] did a better job in explanation.

P(Y|do(X=x)) \newline= \sum\limits_z P(Y|z, do(X=x))P(z|do(X=x)) \newline =\sum\limits_z P(Y|z, do(X=x))P(z|x) \newline (P(z|do(X=x))=P(z|x) \text{ because there is no confounder between X and Z}) \newline (\text{However, we can't change }P(Y|z, do(X=x)) \text{ to }P(Y|z, x) \text{ because} \newline \text{there is an (unobserved) confounder between X and Y} )\newline=\sum\limits_z P(Y|do(Z=z), do(X=x))P(z|x) \newline \text{A trick to change from } P(Y|z, do(X=x))\text{ to } P(Y|do(Z=z), do(X=x)) \newline = \sum\limits_z P(Y|do(Z=z))P(z|x) \newline do(X=x)\text{'s effect is completely blocked by }do(Z=z)\newline\text{hence }P(Y|do(Z=z), do(X=x))=P(Y|do(Z=z)) \newline=\sum\limits_z P(z|x) \left[\sum\limits_{x'}P(Y|z, x')P(x')\right] \newline \text{Applying backdoor adjustment between do(Z=z) and Y}

Note that in the last equation, P(Y|z, x') cannot be further simplified to P(Y|z) because of the presence of U which can affect both X and Y [5].

Backdoor adjustment in real world papers

After introducing backdoor / front adjustment, let’s examine how they are used in practice. I came across two papers [9, 10], both of which applied backdoor adjustment.

[9] tries to account for popularity bias explicitly, for which many other models fail to take into account. In the diagram below, I represents items, U represents users, C represents user interaction on specific items, and Z represents popularity. Popularity is a confounder for items because more popular items tend to get exposed more often and create a feedback loop to continue biasing the model. Popularity is also a confounder for user interaction because users tend to have “herd mentality” thus tend to follow the majority to consume popular items.

To really know the intrinsic value of an item to a user, we have to estimate P(C|do(U,I)). By applying the backdoor adjustment, we have P(C|do(U,I))=\sum\limits_z P(C|U, I, z) P(z). P(C|U, I, z) can be estimated by any deep learning model which takes as input user-side features, item-side features, as well as popularity. But the paper has a nice trick to decompose P(C|U, I, z) = intrinsic\_value_{\theta}(U,I) \cdot pop\_adjust(z), where \theta is model parameters. Then they can use intrinsic\_value_{\theta}(U,I) to rank items based on real intrinsic value.

[10] provides another idea to do backdoor adjustment. They are tackling the bias of video length which is commonly seen in modern video recommendation systems. The bias exists because video watch time is often a topline metric video recommendation systems try to optimize. As a result, video with long duration tends to be overexposured and tend to amplify the video length bias feedback loop. The causal graph the authors model is as below, where U, V, D, and W represents user, video, duration, and watch time, respectively:

The backdoor adjustment to estimate the real intrinsic value E[W|do(U,V)] is 

Basically, the formula says that we will group videos into different duration buckets. The real intrinsic value h(u, v) is an estimator to estimate the watch time percentile within that duration bucket. h(u, v) can be a shared model that predicts the watch time percentile in different buckets.

It is interesting to see that during the inference, they will still rank based on predicted watch time, not the intrinsic value h(u,v):

A question could arise naturally that why we could not use a deep learning model to take as input user and video features as well as the confounder duration? This goes back to the motivation of the paper itself [10]. Such a deep learning model tends to predict high watch time for long videos simply because they are over-exposed (i.e., many of them are put in top positions so naturally long videos attract more watch time). However, with bucketing durations and changing labels to percentiles within each bucket, the model prediction will be less biased.

Causal Discovery

The paper [7] introduces a cool way to trim user history in user sequence modeling. The context is that in sequential recommendation problems, the objective function can be formulated as following:

    \[min -\sum\limits_{k=1}^N\sum\limits_{j=1}^{l_k} log f(\vec{v}_k^j | \mathbf{H}_k^j),\]

where \vec{v}_k^j is user k‘s j-th interacted item in history represented as a one-hot vector, and \mathbf{H}_k^j is the user interacted item history before the j-th item. f(\cdot) can be a softmax function so that minimizing the function will encourage the model to accurately predict which item the user will most likely interact with next given the history.

However, user history may contain many irrelevant items which makes prediction hard. [7] proposes using causal discovery to discover causally related items so that we only use purified history to predict next items users may interact with. The sequential model will use a more focused history for prediction and will be potentially more accurate in prediction.

Denote all users’ history as \mathbf{V}=\{\vec{v}_k^j\}_{k=1\cdots N,j=1\cdots l_k}, the objective then becomes:

min - \ell(\mathbf{V}, W) + \lambda ||W||_1 \newline = min -\sum\limits_{k=1}^N\sum\limits_{j=1}^{l_k}\sum\limits_{b=1}^{|\mathcal{V}|} log f(\left[\vec{v}_k^j\right]_b | \mathbf{\tilde{H}}_{kb}^j(W)) + \lambda ||W||_1 \newline s.t. \quad trace(e^{W \odot W})=\vert\mathcal{V}\vert, 

where W \in \mathbb{R}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert} is the discovered causal graph between items, and \mathbf{\tilde{H}}_{kb}^j(W)) is the purified history for each potential item b at the time of prediction. W is a real-valued matrix which approximates the actual binarized causal DAG, i.e., binarize(W_{ij}) = 1 means that the change of item i will cause the change of item j

Note that the form of min - \ell(\mathbf{V}, W) + \lambda ||W||_1 \; s.t. \; trace(e^{W \odot W})=\vert\mathcal{V}\vert is from NOTEARS [8], the work introducing differentiable causal discovery. Originally in [8], \mathbf{V} \in \{0,1\}^{N\times \vert \mathcal{V} \vert}, which can be thought as a matrix expressing all interacted items for each user without temporal order. \ell(\mathbf{V}, W)=\sum\limits_{k=1}^N\sum\limits_{b=1}^{\vert\mathcal{V}\vert}(\mathbf{V}_k^b, \mathbf{V}_k^{T} W_{\cdot b})^2 is a least-square error, which intuitively means that each observed interacted item should be explained by all other observed interacted items and the causal graph W. [8] shows that, if B \in \{0, 1\}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert} is the binarized version of W (the actual causal directed acyclic graph), then trace\left( e^{B} \right) = \vert\mathcal{V}\vert. W as a relaxed version of B can also be constrained in similar way but we need to slightly modify the constraint as trace(e^{W \odot W})=\vert\mathcal{V}\vert.

The novelty of [7] is that it puts the causal discovery framework proposed in [8] in a sequential recommendation context such that temporal information is taken into account and \mathbf{V} is no longer just a matrix of all interacted items for each user. Instead, \mathbf{V}=\{\vec{v}_k^j\}_{k=1\cdots N,j=1\cdots l_k}.

On a side note, the paper [7] also introduces a differentiable clustering method which is also pretty cool. 

 

References

[1] Causal Inference https://czxttkl.com/2020/10/13/causal-inference/

[2] Tools needed to build an RL debugging tool: https://czxttkl.com/2020/09/08/tools-needed-to-build-an-rl-debugging-tool/

[3] https://stats.stackexchange.com/questions/312992/causal-effect-by-back-door-and-front-door-adjustments/313127#313127

[4] http://www.degeneratestate.org/posts/2018/Sep/03/causal-inference-with-python-part-3-frontdoor-adjustment/

[5] https://stats.stackexchange.com/questions/448004/intuition-behind-conditioning-y-on-x-in-the-front-door-adjustment-formula/

[6] https://stats.stackexchange.com/questions/538477/causal-inference-difference-between-blocking-a-backdoor-path-and-adding-a-vari

[7] Sequential Recommendation with Causal Behavior Discovery: https://arxiv.org/abs/2204.00216

[8] DAGs with NO TEARS Continuous Optimization for Structure Learning: https://arxiv.org/abs/1803.01422

[9] Causal Intervention for Leveraging Popularity Bias in Recommendation: https://arxiv.org/pdf/2102.11107.pdf

[10] Deconfounding Duration Bias in Watch-time Prediction for Video Recommendation: https://arxiv.org/abs/2206.06003

Dollar cost average on TQQQ vs QQQ [Simulation]

This post runs a simple simulation on using the Dollar-Cost-Average strategy to invest in QQQ vs. TQQQ, its 3x-leveraged ETF.

In the simulation, QQQ will plunge 34% after 20 rounds. One round is a small up-and-down cycle – the index first moves up 1% then 3% down, until 34% down from the top. After reaching the bottom, I simulate another 42 rounds such that QQQ can almost crawl back to its previous top. This time, within each round, the index first moves up 2% then 1% down. In each round, we invest two times using a fixed amount of money, one after the up and one after the down. 

I simulate the same process for TQQQ. The only difference is that in each round, the up and downs are 3x as those in QQQ. 

The script and result can be found below. We can see that DCA in TQQQ will still give us better return. Also, as we expect, when QQQ has almost reached back to its former price (99.94), TQQQ is still not yet back to the peak (88.07). However, that does not contradict with the fact that the DCA earns more on TQQQ than QQQ overall.

DOWN_ROUNDS = 20
UP_ROUNDS = 42
stock_price = 100
amounts = 0
DAC = 100
SCALE = 1  # 1 for QQQ and 3 for TQQQ
for i in range(DOWN_ROUNDS):
    amounts += DAC / stock_price
    stock_price = stock_price * (1 + 0.01 * SCALE)
    print(f"round={i}, amount: {amounts}, price={stock_price}")
    amounts += DAC / stock_price
    stock_price = stock_price * (1 - 0.03 * SCALE)
    print(f"round={i}, amount: {amounts}, price={stock_price}")

for i in range(UP_ROUNDS):
    amounts += DAC / stock_price
    stock_price = stock_price * (1 + 0.02 * SCALE)
    print(f"round={i}, amount: {amounts}, price={stock_price}")
    amounts += DAC / stock_price
    stock_price = stock_price * (1 - 0.01 * SCALE)
    print(f"round={i}, amount: {amounts}, price={stock_price}")

print(f"final value={amounts * stock_price}")

# QQQ (scale=1) final value=15197.210360810055
# The last print: round=41, amount: 152.06046946267062, price=99.9418876879163

# TQQQ (scale=3) final value=22578.30484235393
# The last print: round=41, amount: 256.36563654519176, price=88.07071472846891

GATO and related AGI research

Policy Generalist

Deepmind has recently published a work named Gato. I find it interesting as Gato learns a multi-modal multi-task policy to many tasks such as robot arm manipulation, playing atari, and image captioning. I don’t think the original paper [2] has every detail of implementation but I’ll try to best summarize what I understand. I will also try to survey literature pertinent to Gato because it can be very useful towards the north star of general AI.  

First of all, Gato adopts the most common architecture these days – transformers. The core idea is to transform data of different modalities into sequences of tokens. Then tokens will be passed through embedding tables or some model-based transformation to generate embeddings of the same embedding dimension.

Here is how different modalities are sequentialized. Texts data are natural sequences of words. Each word (or phrase) will be indexed with a unique integer. Image data will be transformed into non-overlapping 16×16 patches in raster order. Reinforcement learning tasks will generate sequences of tuples (observation, action). They don’t include reward in sequences because (I guess) they mainly use behavior cloning/imitation learning. Specifically, they pick from the top-performing expert demonstration sequences and use supervised learning to fit expert actions. Continuous actions are discretized into discrete actions. For image patches, the authors applied ResNet to generate their embeddings. For texts/actions, they are passed through learnable embedding tables to generate their embeddings. 

If you pay attention, you will find that all different kinds of tasks have a discrete action space. As we mentioned earlier, reinforcement learning tasks have either discrete actions or continuous actions that are transformed into discrete actions. For chatbot/image captioning tasks, the action space is basically the vocabulary space. So the training loss used by Gato can be just autoregressive cross-entropy:

Algorithm Distillation

Algorithm distillation is a novel idea to me at first glance [3]. But I can relate to works a few years ago about learning to learn using RNN [4, 5]. The idea of [3] is that when we solve a reinforcement learning task or multiple RL tasks using an algorithm of our interest, we can record (state, action, reward) sequences that span multiple episodes. When these multi-episode sequences are long enough, they contain information about how the algorithm improves the policy over time and how the policy being improved interacted with the environment(s). We can train a sequence model (e.g., transformers) to directly mimic the multi-episode sequences. As a result, the RL algorithm itself is distilled into the sequence model. We can then call the sequence model on a new task and asks it to just unroll itself. The sequence model will magically unroll actions that can gradually improve returns over time, just like we use the training RL algorithm on a new task!

 

 

References

[1] DeepMind Gato Explained – One AI Model for 600 tasks https://www.youtube.com/watch?v=EEcxkv60nxM

[2] A Generalist Agent https://arxiv.org/abs/2205.06175

[3] In-context Reinforcement Learning with Algorithm Distillation: https://arxiv.org/abs/2210.14215

[4] Learning to Learn without Gradient Descent by Gradient Descent: https://arxiv.org/abs/1611.03824

[5] Learning to learn by gradient descent by gradient descent: https://proceedings.neurips.cc/paper/2016/file/fb87582825f9d28a8d42c5e5e5e8b23d-Paper.pdf

 

Some latest recsys papers

7 years ago I posted one tutorial about recommendation systems. Now it is 2022 and there are many more advancements. This post will overview several latest ideas.

CTR models

Google’s recsys 2022 paper [1] introduces many practical details on their CTR models. First, to reduce training cost, there are 3 effective ways: applying bottleneck layers (a layer’s weight matrix can be broken down into two low-rank matrices), weight-sharing neural architecture search, and data sampling. The contribution to training cost decrease of the 3 techniques is listed below. It seems like data sampling is a very effective technique. 

Second, to improve the model performance, they include the rank loss (pairwise ranking loss) to the pointwise cross entropy training loss, distillation, second-order optimizer, and add so-called deep&cross network to learn feature interaction. The contribution of these techniques to the accuracy is listed below:

 

Graph models 

Let’s look at Pixie, Pinterest’s real-time random walk system for recommending pins [2]. The graph in this paper has pins and boards (a collection of pins of similar topics) as nodes. If a user has saved pin p in board b, then there is an edge between node p and b. Given a query pin, we can use random walk to surface more connected pins on the graph. Random walk’s idea has been introduced in one previous post

Pixie has several customizations on random walk to boost its business value. First, it biases random walk on user-relevant nodes (e.g., favoring pins of the same language as the user is predicted to know). Second, they allocate the number of random walk steps based on a specific formula (Eqn. 2). They claim “the distribution gave us the desired property that more steps are allocated to starting pins with high degrees and pins with low degrees also receive sufficient number of steps”. Third, they boost pins that are reached from multiple pins, which indicates the pins are more relevant than others. Fourth, they apply early stopping and graph pruning to make random walk more efficient.

Interestingly, we can see that Pinterest loads the graph in single machines so that random walk can happen without cross-machine communication:

 

Retrieval

PinnerSage outlines a solution for retrieving related pins at Pinterest. The most interesting part is that PinnerSage learns a fluid number of user embeddings per user. So the retrieval would not have to be constrained to only the pins around the “mean” user preference. 

PinnerSage uses pre-trained embeddings for each watched pins and cluster each user’s past 90 days watched pins into a dynamic number of clusters. The clustering method which supports an automatic determination of the number of clusters is called Ward. The user’s embeddings are the medoid of his/her clusters. PinnerSage also computes the importance of each cluster. The importance score will be used in the serving time as only the top 3 most important clusters’ user embeddings are used for retrieving new pins. The serving uses a typical approximated KNN searcher to retrieve new pins. The paper tried several common AKNN indexing schemes (LSH Orthoplex, Product Quantization, and HNSW) and found HNSW performed best on cost, latency, and recall.

Sequence Modeling

User history has abundant information about user preference. PinnerFormer [4] details an industrial sequence model used at Pinterest. User history is represented as pin sequences. Passing pin sequences through a transformer will generate an embedding per position. The loss function (named Dense All Action Loss) is defined as to predict each pin’s future positive engaged pins. 

For model serving, the paper described that there is a daily workflow which updates the embeddings of users who have any type of engagement in the past day. Inactive users in the past day will keep using the previously generated embeddings. However, this also indicated the transformer model is not updated frequently otherwise all user embeddings need to be updated as the transformer model is updated.

There are several earlier sequence modeling works:

  1. Deep Interest Network [6]. In essence, DIN is a pioneer work for learning attentions between the current ad and all previous ads. Attention scores can be understood as the intensity of user interests between ads. The only difference between DIN and other attention-based models is that attention scores are allowed to sum up to not exactly 1 (see their explanation in section 4.3).
  2. Bert4Rec [7]. Vanilla sequential recommendation models, like vanilla Transformers / RNN-based models as referenced in [7], run through items from left to right and predict next items sequentially. Therefore, vanilla sequential recommendation models are considered as “unidirectional”. Bert4Rec applies bidirectional transformers models to learn inter-relationships between items. It introduces a task called Cloze task as the objective function. Basically, they randomly remove some items in input sequences (i.e., replace them with a special token “[mask]”) and then predict the ids of the masked items based on their surrounding context. 


 

On device

Alibaba published EdgeRec [5] which can process fine-grain user events in real time and perform contextual reranking. They categorize user events into two types: item exposure actions and item page-view actions. Item exposure actions depict user scroll behavior and exposing duration while item page-view actions depict more engaged behavior like “add to cart” or “click customer service”. As you can guess, item page-view actions are much more sparse than item exposure actions. Therefore, EdgeRec decides to process the two types of actions as two separate sequences. EdgeRec uses GRUs to process the two sequences and generate a hidden representation for each action. At the re-ranking stage, each candidate has a re-ranking representation which comes from (1) the context of other candidates and (2) attention to the item exposure and item page-view action sequences. The context of other candidates is generated by another GRU which runs through the candidates in the order generated by the previous ranker. The re-ranking representation is then passed through a multi-layer perceptron (MLP) and finally the cross-entropy loss for learning.

It is important to pay attention to the serving details of EdgeRec (section 2.2.2). First, although the whole EdgeRec model is trained, the GRUs for processing action sequences and the GRU for re-ranking are separately run on device during serving. The GRUs for processing action sequences will run for each new incoming user action (O(1) time complexity as claimed in the paper) on device and store generated embeddings in a database on device. The re-ranking GRU will fetch the generated embeddings from the on-device database and then perform model inference based on them. Another engineering detail is that since the model may use sparse features and their corresponding embedding tables may have high cardinality, they have to store embedding tables on cloud and send only relevant rows to device. There is also some versioning engineering details which ensure the edge and cloud’s models and embedding tables can match with each other.

 

References

[1] On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models: https://arxiv.org/abs/2209.05310

[2] Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time: https://arxiv.org/abs/1711.07601 

[3] PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest: https://arxiv.org/abs/2007.03634

[4] PinnerFormer- Sequence Modeling for User Representation at Pinterest: https://arxiv.org/abs/2205.04507

[5] EdgeRec: Recommender System on Edge in Mobile Taobao: https://arxiv.org/abs/2005.08416

[6] Deep Interest Network for Click-Through Rate Prediction: https://arxiv.org/abs/1706.06978

[7] BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer: https://arxiv.org/pdf/1904.06690.pdf

 

New Model Architectures

There are many advancements in new model architectures in AI domain. Let me overview these advancements in this post.

 

Linear Compression Embedding

LCE [1] is simply using a matrix to project one embedding matrix to another: AW =B, where W \in \mathbb{R}^{M \times N}, A \in \mathbb{R}^{B \times M}, and B \in \mathbb{R}^{B \times N}.

Pyramid networks, inception network, dhen, lce

 

Perceiver and Perceiver IO

Perceiver-based architectures [5,6] solve a pain point of traditional transformers: quadratic computation in terms of sequence length for computing KQV attention, where K/Q/V all have the shape batch_size x seq_len x emb_dim. Perceiver-based architectures creates a learnable V matrix which is independent of input sequence lengths, thus to reduce the attention computation to be linear. Thanks to linear attention computation time, within the same computation budget, it is also able to stack more attention blocks to improve the performance. There are three web tutorials that are helpful: [3,4,5].

Since Perceiver has done a good job of reducing attention computation to linear time (in terms of sequence length), I was wonder why recent LLMs did not, if any, adopt this architecture. A reddit online discussion shed some light [7].

 

Mamba

Mamba [9] is a new sequence modeling architecture which also achieves linear time based on so called “selective structured state space model”. Currently, it achieves SOTA in many tasks. The official implementation can be found here. There is not much material which can clearly articulate the theory behind Mamba. I have only found pieces of information here and there: [8, 10]. The way I understand about Mamba after watching [8] is:

  1. Mamba enjoys the best of both worlds of RNN and Transformers
      Training Inference
    RNN O(N) but can’t be parallelized because states need to roll out sequentially O(1) per token
    Transformer O(N^2) but can be parallelized  O(N) per token (see time analysis in [12])
    Mamba O(N) and can be parallelized O(1) per token
  2. According to basic definition of structured state space sequence (S4) models [10], Mamba has the state transition represented as differential equations:
    h'(t)=\mathbf{A}h(t)+\mathbf{B}x(t)
    y(t)=\mathbf{C}h(t)
    Note, h'(t) is the change (i.e., derivative) of the hidden state.
  3. The differential equations describe a real-time system but in sequence modeling we are interested in sequences with discrete time steps (i.e., tokens). “Discretization” in the paper means that we use approximation methods to derive new matrices \bar{\mathbf{A}} and \bar{\mathbf{B}} when we assume the input function x(t) becomes the sequence \{x(t\Delta)\}_{t=0,1,\cdots}, where \Delta is the step size. 
    There are two methods to do the discretization. [8] introduces the Euler’s method, which is simple for exposition. The paper uses zero-order hold, which is explained in more details in [13]. We use the Euler’s method for illustration, following [8]:
    1.  by definition of derivative we know that h(t+\Delta) \approx \Delta h'(t) + h(t)
    2. Substitute the state space model’s equations with step 1:
      h(t+\Delta) \newline \approx \Delta \left( \mathbf{A}h(t)+\mathbf{B}x(t) \right) + h(t)\newline=\Delta \mathbf{A} h(t)+\Delta \mathbf{B}x(t) + h(t) \newline= \left(\mathbf{I}+\Delta\mathbf{A} \right)h(t) + \Delta \mathbf{B} x(t) \newline = \bar{\mathbf{A}} h(t) + \bar{\mathbf{B}} x(t)
  4. State transitions during training can be parallelized without needing to wait for sequential rollout because the state transition model can be converted into convolutional formulation that only depends on the input sequences:
  5. A S4 mdoel is used for one input dimension. So if a token has 512 embedding size, there will be 512 S4 models in parallel. In comparison, transformers use attention heads where each attention head consumes all dimensions (or a group of dimensions) of a token.

 

Reference

  1. LCE explanation: https://fb.workplace.com/groups/watchrankingexperiments/permalink/2853044271454166/ (internal) and https://www.dropbox.com/scl/fi/0ygejam1mg2jhocfmgoba/watch-_-chaining-ranking-experiment-review-_-group-_-workplace.pdf?rlkey=4euhhg6f0h838ythg1na1amz8&dl=0 (external)
  2. https://medium.com/@curttigges/the-annotated-perceiver-74752113eefb
  3. https://medium.com/ml-summaries/perceiver-paper-s-summary-3c589ec74238
  4. https://medium.com/ml-summaries/perceiver-io-paper-summary-e8f28e451d21
  5. Perceiver: General Perception with Iterative Attention: https://arxiv.org/abs/2103.03206
  6. Perceiver IO: A General Architecture for Structured Inputs & Outputs: https://arxiv.org/abs/2107.14795
  7. https://www.reddit.com/r/MachineLearning/comments/tx7e34/d_why_arent_new_llms_using_the_perceiver/
  8. https://www.youtube.com/watch?v=8Q_tqwpTpVU
  9. Mamba: Linear-Time Sequence Modeling with Selective State Spaces: https://arxiv.org/abs/2312.00752
  10. https://srush.github.io/annotated-s4/?ref=blog.oxen.ai
  11. https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad
  12. Llama code anatomy: https://czxttkl.com/2024/01/17/llama-code-anatomy/
  13. https://en.wikipedia.org/wiki/Discretization#Derivation

 

Simulation on the ads supply problem

I start to feel the importance of simulating any practical problem before deploying an RL policy. If you cannot implement a reasonable simulator on your own, you are not clear about your environment and your model. It is then a pure gamble to me if we just train an RL policy offline without testing in a reasonable simulator and go directly to online testing. Sure, you might see good online results once a while, but you might even not be able to reproduce the success if you were to do it again. 

Recently, I face a new problem in ads supply. Think about a typical setting of modern recommendation systems where we need to insert ads in between organic contents. If we insert ads too frequently, user engagement with organic content will drop; if we insert ads too little, the system will have low ads revenue. We will assume in each user session, organic contents and ads are loaded in the unit of “pages”. When the user is close to finishing one page, the next page is formed by having a specific ranking of organic contents and ads. I think an RL policy to decide the best ad insertion positions in each page is promising because as a business, we care about accumulated value (i.e., accumulated engagement or ads revenue). If we look at user trajectories in the unit of pages, there might exist a sequence of ads insertion decisions which can optimize the accumulated value better than any supervised learning-based policy which only optimizes for immediate reward.

Here is just a hypothetical example why RL can work better than a supervised learning-based policy:

Suppose a user will have a session of 4 pages. Based on a supervised learning model, we will insert 6 ads in each of the first 3 pages; but in the last page, because the user has seen 3 pages of 6-ads and reached fatigue, we have to insert only 1 ads. The supervised learning model will lead to 19 ads inserted in total. However, a reinforcement learning model will know that in order to maximize the accumulated ads inserted, it can place 5 ads in all the 4 pages. The RL model will eventually insert 20 ads, one more than the SL model.   

I design an ads supply environment for testing whether an RL policy is working well. In this environment, every user’s behavior is measured within a time window of 7days. They have a default session continuation probability of 0.8 after seeing each page and a default next session returning time of 24 hours. Each page can be inserted 1,2, or 3 ads. Whenever they see two consecutive pages with an average load > 1.5, the page continuation probability will be shrunk by 0.2 factor and next session returning time will also increase 24 hours. The state features include last two pages’ ad loads, the # of sessions and pages so far, etc.

We collect data using the designed simulator and a random data collection policy. The data consists of 1M episodes (equivalent to 10M (S,A,R,S, A) tuples). We compare Q-learning, SARSA, and Monte Carlo Value Regression. Surprisingly, MC performs quite better than the other two and the random data collection policy. Here is the result:

  • Random data collection policy: Avg sessions: 2.966372, avg reward: 20.859195
  • Q-Learning: Test avg sessions: 4.3327, avg reward: 33.8197
  • SARSA: Test avg sessions: 3.808, avg reward: 31.5479
  • Monte-Carlo Value Regression: Test avg sessions: 4.6173, avg reward: 43.2075

From this simulation, I become more favorable to always start with simple methods like Monte Carlo Value regression. Advanced RL algorithms may not be the best in a noisy real-world environment.

 

How to run the code: 

  1. Download the code folder: https://www.dropbox.com/s/p2mfowz9m63zwn7/ads_supply.zip?dl=0
  2. Run data_generator.py to generate training data, which is to be saved to offline_train_data.pickle
  3. Run mc_learning.py, q_learning.py, or sarsa.py to train a model
  4. Run eval.py to evaluate model performance. Remember to change MODEL_SAVE_PATH properly.

GFlowNet

GFlowNet is the latest technique developed for solving combinatorial optimization problems [1]. I’ve prepared a series of deep dive slides for it (See GFlowNet deep dive). In this post, I just list a few more references.

References

[1] https://yoshuabengio.org/2022/03/05/generative-flow-networks/

[2] https://towardsdatascience.com/the-what-why-and-how-of-generative-flow-networks-4fb3cd309af0

[3] https://neurips.cc/media/neurips-2021/Slides/26729.pdf

[4] https://www.youtube.com/watch?v=7W69-ffTs48

[5] https://milayb.notion.site/GFlowNet-Tutorial-919dcf0a0f0c4e978916a2f509938b00#afe03e54d6db43468f8dee3a3350f98a

[6] http://folinoid.com/w/gflownet/

How does Metropolis-Hastings algorithm work?

I learned about Markov Chain Monte Carlo (MCMC) algorithm a little bit during my phd but I did not record my thoughts back then. In this post, I revisit the core concept of MCMC, particularly focusing on illustrating the Metropolis-Hastings (MH) algorithm. 

What is the motivation of MCMC?

Suppose you have observed some data x. You would like to express the distribution of x as some distribution which is parameterized by \theta. You have some prior p(\theta). You would like to know the distribution of the posterior distribution p(\theta|x). Based on Bayes’ theorem,

    \[p(\theta|x)=\frac{p(\theta) p(x|\theta)}{p(x)}=\frac{p(\theta) p(x|\theta)}{\int_\theta p(\theta) p(x|\theta) d\theta}\]

 

The denominator \int_\theta p(\theta) p(x|\theta) d\theta is often hard to compute. Therefore, it is hard to know the exact statistics of the posterior distribution, such as the mean of the posterior distribution. MCMC has some mechanism to allow you sample points on the space of \theta such that when a sufficient amount of points of \theta is sampled, they approximate the true posterior distribution p(\theta|x) hence you will be able to compute the empirical statistics such as the empirical mean.

Let’s use one concrete example (taken reference from [2]). You are a teacher who wants to know the score distribution of your students. You assume the score distribution is Normal distribution \mathcal{N}(\mu, \sigma^2) while \sigma^2=4^2=16 is a fixed variance. According to basic conjugate prior mathematics [4], you have a prior distribution p(\mu|\mu_0, \sigma_0^2) = \mathcal{N}(\mu_0, \sigma_0^2) which is also a Normal distribution. You have observed 3 students’ scores therefore you can express the likelihood function as: p(x_1, x_2, x_3|\mu, \sigma^2) \propto \frac{1}{\sigma^n} exp\left( -\frac{1}{2\sigma^2}\sum\limits_{i=1}^3 (x_i - \mu)^2 \right). In this simple example, there is an analytical expression for the posterior distribution p(\mu| x_1, x_2, x_3) which is also a Normal distribution. But for our pedagogical purpose, let’s assume the posterior distribution p(\mu|x_1, x_2, x_3) is impossible to compute and we have to use MCMC to infer posterior statistics such as the posterior mean.  

How MCMC works? (specifically MH algorithm)

We focus on the MH algorithm, which basically has a loop of sampling a number of \theta for estimating the posterior distribution. The following steps are adapted from [2]:

  1. Begin with a plausible starting value; Let’s say you guess \mu=110 at first.

  2. Generate a new proposal by taking the last sample (110) and adding some random noise. This random noise is generated from a proposal distribution, which should be symmetric and centered on zero. In our example we will use a proposal distribution that is normal with zero mean and standard deviation of 5. This means the new proposal is 110 (the last sample) plus a random sample from N(0,5). Suppose this results in a proposal of 108.
    Note that, if the proposal distribution is not symmetric, the following sample acceptance criteria (step 4&5) will need to be adjusted. See how g(\cdot) is defined in [1].

  3. Compare the likelihood value of the new proposal (i.e., 108) against the likelihood value of the posterior at the most recent sample (i.e., 110). In other words, you will compare p(\mu=108|\mu_0, \sigma_0^2) \cdot p(x_1, x_2, x_3|\mu=108, \sigma^2=16) and p(\mu=110|\mu_0, \sigma_0^2) \cdot p(x_1, x_2, x_3|\mu=110, \sigma^2=16)
    Note: in many other online tutorials such as [2] and [5], people also interchangeably say that we compare the posterior value of the new proposal and the most recent sample. Since the likelihood value has only a multiplicative constant difference than the posterior, comparing the likelihood value is equivalent to comparing the posterior. 

  4. If the new proposal has a higher likelihood value than the most recent sample, then accept the new proposal.

  5. If the new proposal has a lower likelihood value than the most recent sample, then randomly choose to accept or reject the new proposal, with a probability equal to the ratio of the two likelihood values. For example, if the likelihood value at the new proposal value is one-fifth as high as the likelihood of the most recent sample, then accept the new proposal with 20% probability.

  6. If the new proposal is accepted, it becomes the next sample in the MCMC chain, otherwise the next sample in the MCMC chain is just a copy of the most recent sample.

  7. This completes one iteration of MCMC. The next iteration is completed by returning to step 2.

  8. Stop when there are enough samples (e.g., 500). Deciding when one has enough samples is a separate issue. Now you can compute any empirical statistics as you like, such as the empirical posterior mean. 

The above 8 steps illustrates the core formula of MH, the acceptance criterion function [1, 3]:

    \[\alpha(\theta^{new}, \theta^{old})=min \left(1, \frac{p(\theta^{old}|x)g(\theta^{new}|\theta^{old})}{p(\theta^{new}|x) g(\theta^{old}|\theta^{new})}\right)\]

 

References:

[1] https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm

[2] https://link.springer.com/article/10.3758/s13423-016-1015-8

[3] https://stats.stackexchange.com/a/497060/80635

[4] https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

[5] https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50 (the most liked comment is very useful for understanding)