GATO and related AGI research

Policy Generalist

Deepmind has recently published a work named Gato. I find it interesting as Gato learns a multi-modal multi-task policy to many tasks such as robot arm manipulation, playing atari, and image captioning. I don’t think the original paper [2] has every detail of implementation but I’ll try to best summarize what I understand. I will also try to survey literature pertinent to Gato because it can be very useful towards the north star of general AI.  

First of all, Gato adopts the most common architecture these days – transformers. The core idea is to transform data of different modalities into sequences of tokens. Then tokens will be passed through embedding tables or some model-based transformation to generate embeddings of the same embedding dimension.

Here is how different modalities are sequentialized. Texts data are natural sequences of words. Each word (or phrase) will be indexed with a unique integer. Image data will be transformed into non-overlapping 16×16 patches in raster order. Reinforcement learning tasks will generate sequences of tuples (observation, action). They don’t include reward in sequences because (I guess) they mainly use behavior cloning/imitation learning. Specifically, they pick from the top-performing expert demonstration sequences and use supervised learning to fit expert actions. Continuous actions are discretized into discrete actions. For image patches, the authors applied ResNet to generate their embeddings. For texts/actions, they are passed through learnable embedding tables to generate their embeddings. 

If you pay attention, you will find that all different kinds of tasks have a discrete action space. As we mentioned earlier, reinforcement learning tasks have either discrete actions or continuous actions that are transformed into discrete actions. For chatbot/image captioning tasks, the action space is basically the vocabulary space. So the training loss used by Gato can be just autoregressive cross-entropy:

Algorithm Distillation

Algorithm distillation is a novel idea to me at first glance [3]. But I can relate to works a few years ago about learning to learn using RNN [4, 5]. The idea of [3] is that when we solve a reinforcement learning task or multiple RL tasks using an algorithm of our interest, we can record (state, action, reward) sequences that span multiple episodes. When these multi-episode sequences are long enough, they contain information about how the algorithm improves the policy over time and how the policy being improved interacted with the environment(s). We can train a sequence model (e.g., transformers) to directly mimic the multi-episode sequences. As a result, the RL algorithm itself is distilled into the sequence model. We can then call the sequence model on a new task and asks it to just unroll itself. The sequence model will magically unroll actions that can gradually improve returns over time, just like we use the training RL algorithm on a new task!

 

 

References

[1] DeepMind Gato Explained – One AI Model for 600 tasks https://www.youtube.com/watch?v=EEcxkv60nxM

[2] A Generalist Agent https://arxiv.org/abs/2205.06175

[3] In-context Reinforcement Learning with Algorithm Distillation: https://arxiv.org/abs/2210.14215

[4] Learning to Learn without Gradient Descent by Gradient Descent: https://arxiv.org/abs/1611.03824

[5] Learning to learn by gradient descent by gradient descent: https://proceedings.neurips.cc/paper/2016/file/fb87582825f9d28a8d42c5e5e5e8b23d-Paper.pdf

 

Some latest recsys papers

7 years ago I posted one tutorial about recommendation systems. Now it is 2022 and there are many more advancements. This post will overview several latest ideas.

CTR models

Google’s recsys 2022 paper [1] introduces many practical details on their CTR models. First, to reduce training cost, there are 3 effective ways: applying bottleneck layers (a layer’s weight matrix can be broken down into two low-rank matrices), weight-sharing neural architecture search, and data sampling. The contribution to training cost decrease of the 3 techniques is listed below. It seems like data sampling is a very effective technique. 

Second, to improve the model performance, they include the rank loss (pairwise ranking loss) to the pointwise cross entropy training loss, distillation, second-order optimizer, and add so-called deep&cross network to learn feature interaction. The contribution of these techniques to the accuracy is listed below:

 

Graph models 

Let’s look at Pixie, Pinterest’s real-time random walk system for recommending pins [2]. The graph in this paper has pins and boards (a collection of pins of similar topics) as nodes. If a user has saved pin p in board b, then there is an edge between node p and b. Given a query pin, we can use random walk to surface more connected pins on the graph. Random walk’s idea has been introduced in one previous post

Pixie has several customizations on random walk to boost its business value. First, it biases random walk on user-relevant nodes (e.g., favoring pins of the same language as the user is predicted to know). Second, they allocate the number of random walk steps based on a specific formula (Eqn. 2). They claim “the distribution gave us the desired property that more steps are allocated to starting pins with high degrees and pins with low degrees also receive sufficient number of steps”. Third, they boost pins that are reached from multiple pins, which indicates the pins are more relevant than others. Fourth, they apply early stopping and graph pruning to make random walk more efficient.

Interestingly, we can see that Pinterest loads the graph in single machines so that random walk can happen without cross-machine communication:

 

Retrieval

PinnerSage outlines a solution for retrieving related pins at Pinterest. The most interesting part is that PinnerSage learns a fluid number of user embeddings per user. So the retrieval would not have to be constrained to only the pins around the “mean” user preference. 

PinnerSage uses pre-trained embeddings for each watched pins and cluster each user’s past 90 days watched pins into a dynamic number of clusters. The clustering method which supports an automatic determination of the number of clusters is called Ward. The user’s embeddings are the medoid of his/her clusters. PinnerSage also computes the importance of each cluster. The importance score will be used in the serving time as only the top 3 most important clusters’ user embeddings are used for retrieving new pins. The serving uses a typical approximated KNN searcher to retrieve new pins. The paper tried several common AKNN indexing schemes (LSH Orthoplex, Product Quantization, and HNSW) and found HNSW performed best on cost, latency, and recall.

Sequence Modeling

User history has abundant information about user preference. PinnerFormer [4] details an industrial sequence model used at Pinterest. User history is represented as pin sequences. Passing pin sequences through a transformer will generate an embedding per position. The loss function (named Dense All Action Loss) is defined as to predict each pin’s future positive engaged pins. 

For model serving, the paper described that there is a daily workflow which updates the embeddings of users who have any type of engagement in the past day. Inactive users in the past day will keep using the previously generated embeddings. However, this also indicated the transformer model is not updated frequently otherwise all user embeddings need to be updated as the transformer model is updated.

There are several earlier sequence modeling works:

  1. Deep Interest Network [6]. In essence, DIN is a pioneer work for learning attentions between the current ad and all previous ads. Attention scores can be understood as the intensity of user interests between ads. The only difference between DIN and other attention-based models is that attention scores are allowed to sum up to not exactly 1 (see their explanation in section 4.3).
  2. Bert4Rec [7]. Vanilla sequential recommendation models, like vanilla Transformers / RNN-based models as referenced in [7], run through items from left to right and predict next items sequentially. Therefore, vanilla sequential recommendation models are considered as “unidirectional”. Bert4Rec applies bidirectional transformers models to learn inter-relationships between items. It introduces a task called Cloze task as the objective function. Basically, they randomly remove some items in input sequences (i.e., replace them with a special token “[mask]”) and then predict the ids of the masked items based on their surrounding context. 


 

On device

Alibaba published EdgeRec [5] which can process fine-grain user events in real time and perform contextual reranking. They categorize user events into two types: item exposure actions and item page-view actions. Item exposure actions depict user scroll behavior and exposing duration while item page-view actions depict more engaged behavior like “add to cart” or “click customer service”. As you can guess, item page-view actions are much more sparse than item exposure actions. Therefore, EdgeRec decides to process the two types of actions as two separate sequences. EdgeRec uses GRUs to process the two sequences and generate a hidden representation for each action. At the re-ranking stage, each candidate has a re-ranking representation which comes from (1) the context of other candidates and (2) attention to the item exposure and item page-view action sequences. The context of other candidates is generated by another GRU which runs through the candidates in the order generated by the previous ranker. The re-ranking representation is then passed through a multi-layer perceptron (MLP) and finally the cross-entropy loss for learning.

It is important to pay attention to the serving details of EdgeRec (section 2.2.2). First, although the whole EdgeRec model is trained, the GRUs for processing action sequences and the GRU for re-ranking are separately run on device during serving. The GRUs for processing action sequences will run for each new incoming user action (O(1) time complexity as claimed in the paper) on device and store generated embeddings in a database on device. The re-ranking GRU will fetch the generated embeddings from the on-device database and then perform model inference based on them. Another engineering detail is that since the model may use sparse features and their corresponding embedding tables may have high cardinality, they have to store embedding tables on cloud and send only relevant rows to device. There is also some versioning engineering details which ensure the edge and cloud’s models and embedding tables can match with each other.

 

References

[1] On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models: https://arxiv.org/abs/2209.05310

[2] Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time: https://arxiv.org/abs/1711.07601 

[3] PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest: https://arxiv.org/abs/2007.03634

[4] PinnerFormer- Sequence Modeling for User Representation at Pinterest: https://arxiv.org/abs/2205.04507

[5] EdgeRec: Recommender System on Edge in Mobile Taobao: https://arxiv.org/abs/2005.08416

[6] Deep Interest Network for Click-Through Rate Prediction: https://arxiv.org/abs/1706.06978

[7] BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer: https://arxiv.org/pdf/1904.06690.pdf

 

New Model Architectures

There are many advancements in new model architectures in AI domain. Let me overview these advancements in this post.

 

Linear Compression Embedding

LCE [1] is simply using a matrix to project one embedding matrix to another: AW =B, where W \in \mathbb{R}^{M \times N}, A \in \mathbb{R}^{B \times M}, and B \in \mathbb{R}^{B \times N}.

Pyramid networks, inception network, dhen, lce

 

Perceiver and Perceiver IO

Perceiver-based architectures [5,6] solve a pain point of traditional transformers: quadratic computation in terms of sequence length for computing KQV attention, where K/Q/V all have the shape batch_size x seq_len x emb_dim. Perceiver-based architectures creates a learnable V matrix which is independent of input sequence lengths, thus to reduce the attention computation to be linear. Thanks to linear attention computation time, within the same computation budget, it is also able to stack more attention blocks to improve the performance. There are three web tutorials that are helpful: [3,4,5].

Since Perceiver has done a good job of reducing attention computation to linear time (in terms of sequence length), I was wonder why recent LLMs did not, if any, adopt this architecture. A reddit online discussion shed some light [7].

 

Mamba

Mamba [9] is a new sequence modeling architecture which also achieves linear time based on so called “selective structured state space model”. Currently, it achieves SOTA in many tasks. The official implementation can be found here. There is not much material which can clearly articulate the theory behind Mamba. I have only found pieces of information here and there: [8, 10]. The way I understand about Mamba after watching [8] is:

  1. Mamba enjoys the best of both worlds of RNN and Transformers
      Training Inference
    RNN O(N) but can’t be parallelized because states need to roll out sequentially O(1) per token
    Transformer O(N^2) but can be parallelized  O(N) per token (see time analysis in [12])
    Mamba O(N) and can be parallelized O(1) per token
  2. According to basic definition of structured state space sequence (S4) models [10], Mamba has the state transition represented as differential equations:
    h'(t)=\mathbf{A}h(t)+\mathbf{B}x(t)
    y(t)=\mathbf{C}h(t)
    Note, h'(t) is the change (i.e., derivative) of the hidden state.
  3. The differential equations describe a real-time system but in sequence modeling we are interested in sequences with discrete time steps (i.e., tokens). “Discretization” in the paper means that we use approximation methods to derive new matrices \bar{\mathbf{A}} and \bar{\mathbf{B}} when we assume the input function x(t) becomes the sequence \{x(t\Delta)\}_{t=0,1,\cdots}, where \Delta is the step size. 
    There are two methods to do the discretization. [8] introduces the Euler’s method, which is simple for exposition. The paper uses zero-order hold, which is explained in more details in [13]. We use the Euler’s method for illustration, following [8]:
    1.  by definition of derivative we know that h(t+\Delta) \approx \Delta h'(t) + h(t)
    2. Substitute the state space model’s equations with step 1:
      h(t+\Delta) \newline \approx \Delta \left( \mathbf{A}h(t)+\mathbf{B}x(t) \right) + h(t)\newline=\Delta \mathbf{A} h(t)+\Delta \mathbf{B}x(t) + h(t) \newline= \left(\mathbf{I}+\Delta\mathbf{A} \right)h(t) + \Delta \mathbf{B} x(t) \newline = \bar{\mathbf{A}} h(t) + \bar{\mathbf{B}} x(t)
  4. State transitions during training can be parallelized without needing to wait for sequential rollout because the state transition model can be converted into convolutional formulation that only depends on the input sequences:
  5. A S4 mdoel is used for one input dimension. So if a token has 512 embedding size, there will be 512 S4 models in parallel. In comparison, transformers use attention heads where each attention head consumes all dimensions (or a group of dimensions) of a token.

 

Reference

  1. LCE explanation: https://fb.workplace.com/groups/watchrankingexperiments/permalink/2853044271454166/ (internal) and https://www.dropbox.com/scl/fi/0ygejam1mg2jhocfmgoba/watch-_-chaining-ranking-experiment-review-_-group-_-workplace.pdf?rlkey=4euhhg6f0h838ythg1na1amz8&dl=0 (external)
  2. https://medium.com/@curttigges/the-annotated-perceiver-74752113eefb
  3. https://medium.com/ml-summaries/perceiver-paper-s-summary-3c589ec74238
  4. https://medium.com/ml-summaries/perceiver-io-paper-summary-e8f28e451d21
  5. Perceiver: General Perception with Iterative Attention: https://arxiv.org/abs/2103.03206
  6. Perceiver IO: A General Architecture for Structured Inputs & Outputs: https://arxiv.org/abs/2107.14795
  7. https://www.reddit.com/r/MachineLearning/comments/tx7e34/d_why_arent_new_llms_using_the_perceiver/
  8. https://www.youtube.com/watch?v=8Q_tqwpTpVU
  9. Mamba: Linear-Time Sequence Modeling with Selective State Spaces: https://arxiv.org/abs/2312.00752
  10. https://srush.github.io/annotated-s4/?ref=blog.oxen.ai
  11. https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad
  12. Llama code anatomy: https://czxttkl.com/2024/01/17/llama-code-anatomy/
  13. https://en.wikipedia.org/wiki/Discretization#Derivation

 

Simulation on the ads supply problem

I start to feel the importance of simulating any practical problem before deploying an RL policy. If you cannot implement a reasonable simulator on your own, you are not clear about your environment and your model. It is then a pure gamble to me if we just train an RL policy offline without testing in a reasonable simulator and go directly to online testing. Sure, you might see good online results once a while, but you might even not be able to reproduce the success if you were to do it again. 

Recently, I face a new problem in ads supply. Think about a typical setting of modern recommendation systems where we need to insert ads in between organic contents. If we insert ads too frequently, user engagement with organic content will drop; if we insert ads too little, the system will have low ads revenue. We will assume in each user session, organic contents and ads are loaded in the unit of “pages”. When the user is close to finishing one page, the next page is formed by having a specific ranking of organic contents and ads. I think an RL policy to decide the best ad insertion positions in each page is promising because as a business, we care about accumulated value (i.e., accumulated engagement or ads revenue). If we look at user trajectories in the unit of pages, there might exist a sequence of ads insertion decisions which can optimize the accumulated value better than any supervised learning-based policy which only optimizes for immediate reward.

Here is just a hypothetical example why RL can work better than a supervised learning-based policy:

Suppose a user will have a session of 4 pages. Based on a supervised learning model, we will insert 6 ads in each of the first 3 pages; but in the last page, because the user has seen 3 pages of 6-ads and reached fatigue, we have to insert only 1 ads. The supervised learning model will lead to 19 ads inserted in total. However, a reinforcement learning model will know that in order to maximize the accumulated ads inserted, it can place 5 ads in all the 4 pages. The RL model will eventually insert 20 ads, one more than the SL model.   

I design an ads supply environment for testing whether an RL policy is working well. In this environment, every user’s behavior is measured within a time window of 7days. They have a default session continuation probability of 0.8 after seeing each page and a default next session returning time of 24 hours. Each page can be inserted 1,2, or 3 ads. Whenever they see two consecutive pages with an average load > 1.5, the page continuation probability will be shrunk by 0.2 factor and next session returning time will also increase 24 hours. The state features include last two pages’ ad loads, the # of sessions and pages so far, etc.

We collect data using the designed simulator and a random data collection policy. The data consists of 1M episodes (equivalent to 10M (S,A,R,S, A) tuples). We compare Q-learning, SARSA, and Monte Carlo Value Regression. Surprisingly, MC performs quite better than the other two and the random data collection policy. Here is the result:

  • Random data collection policy: Avg sessions: 2.966372, avg reward: 20.859195
  • Q-Learning: Test avg sessions: 4.3327, avg reward: 33.8197
  • SARSA: Test avg sessions: 3.808, avg reward: 31.5479
  • Monte-Carlo Value Regression: Test avg sessions: 4.6173, avg reward: 43.2075

From this simulation, I become more favorable to always start with simple methods like Monte Carlo Value regression. Advanced RL algorithms may not be the best in a noisy real-world environment.

 

How to run the code: 

  1. Download the code folder: https://www.dropbox.com/s/p2mfowz9m63zwn7/ads_supply.zip?dl=0
  2. Run data_generator.py to generate training data, which is to be saved to offline_train_data.pickle
  3. Run mc_learning.py, q_learning.py, or sarsa.py to train a model
  4. Run eval.py to evaluate model performance. Remember to change MODEL_SAVE_PATH properly.

GFlowNet

GFlowNet is the latest technique developed for solving combinatorial optimization problems [1]. I’ve prepared a series of deep dive slides for it (See GFlowNet deep dive). In this post, I just list a few more references.

References

[1] https://yoshuabengio.org/2022/03/05/generative-flow-networks/

[2] https://towardsdatascience.com/the-what-why-and-how-of-generative-flow-networks-4fb3cd309af0

[3] https://neurips.cc/media/neurips-2021/Slides/26729.pdf

[4] https://www.youtube.com/watch?v=7W69-ffTs48

[5] https://milayb.notion.site/GFlowNet-Tutorial-919dcf0a0f0c4e978916a2f509938b00#afe03e54d6db43468f8dee3a3350f98a

[6] http://folinoid.com/w/gflownet/

How does Metropolis-Hastings algorithm work?

I learned about Markov Chain Monte Carlo (MCMC) algorithm a little bit during my phd but I did not record my thoughts back then. In this post, I revisit the core concept of MCMC, particularly focusing on illustrating the Metropolis-Hastings (MH) algorithm. 

What is the motivation of MCMC?

Suppose you have observed some data x. You would like to express the distribution of x as some distribution which is parameterized by \theta. You have some prior p(\theta). You would like to know the distribution of the posterior distribution p(\theta|x). Based on Bayes’ theorem,

    \[p(\theta|x)=\frac{p(\theta) p(x|\theta)}{p(x)}=\frac{p(\theta) p(x|\theta)}{\int_\theta p(\theta) p(x|\theta) d\theta}\]

 

The denominator \int_\theta p(\theta) p(x|\theta) d\theta is often hard to compute. Therefore, it is hard to know the exact statistics of the posterior distribution, such as the mean of the posterior distribution. MCMC has some mechanism to allow you sample points on the space of \theta such that when a sufficient amount of points of \theta is sampled, they approximate the true posterior distribution p(\theta|x) hence you will be able to compute the empirical statistics such as the empirical mean.

Let’s use one concrete example (taken reference from [2]). You are a teacher who wants to know the score distribution of your students. You assume the score distribution is Normal distribution \mathcal{N}(\mu, \sigma^2) while \sigma^2=4^2=16 is a fixed variance. According to basic conjugate prior mathematics [4], you have a prior distribution p(\mu|\mu_0, \sigma_0^2) = \mathcal{N}(\mu_0, \sigma_0^2) which is also a Normal distribution. You have observed 3 students’ scores therefore you can express the likelihood function as: p(x_1, x_2, x_3|\mu, \sigma^2) \propto \frac{1}{\sigma^n} exp\left( -\frac{1}{2\sigma^2}\sum\limits_{i=1}^3 (x_i - \mu)^2 \right). In this simple example, there is an analytical expression for the posterior distribution p(\mu| x_1, x_2, x_3) which is also a Normal distribution. But for our pedagogical purpose, let’s assume the posterior distribution p(\mu|x_1, x_2, x_3) is impossible to compute and we have to use MCMC to infer posterior statistics such as the posterior mean.  

How MCMC works? (specifically MH algorithm)

We focus on the MH algorithm, which basically has a loop of sampling a number of \theta for estimating the posterior distribution. The following steps are adapted from [2]:

  1. Begin with a plausible starting value; Let’s say you guess \mu=110 at first.

  2. Generate a new proposal by taking the last sample (110) and adding some random noise. This random noise is generated from a proposal distribution, which should be symmetric and centered on zero. In our example we will use a proposal distribution that is normal with zero mean and standard deviation of 5. This means the new proposal is 110 (the last sample) plus a random sample from N(0,5). Suppose this results in a proposal of 108.
    Note that, if the proposal distribution is not symmetric, the following sample acceptance criteria (step 4&5) will need to be adjusted. See how g(\cdot) is defined in [1].

  3. Compare the likelihood value of the new proposal (i.e., 108) against the likelihood value of the posterior at the most recent sample (i.e., 110). In other words, you will compare p(\mu=108|\mu_0, \sigma_0^2) \cdot p(x_1, x_2, x_3|\mu=108, \sigma^2=16) and p(\mu=110|\mu_0, \sigma_0^2) \cdot p(x_1, x_2, x_3|\mu=110, \sigma^2=16)
    Note: in many other online tutorials such as [2] and [5], people also interchangeably say that we compare the posterior value of the new proposal and the most recent sample. Since the likelihood value has only a multiplicative constant difference than the posterior, comparing the likelihood value is equivalent to comparing the posterior. 

  4. If the new proposal has a higher likelihood value than the most recent sample, then accept the new proposal.

  5. If the new proposal has a lower likelihood value than the most recent sample, then randomly choose to accept or reject the new proposal, with a probability equal to the ratio of the two likelihood values. For example, if the likelihood value at the new proposal value is one-fifth as high as the likelihood of the most recent sample, then accept the new proposal with 20% probability.

  6. If the new proposal is accepted, it becomes the next sample in the MCMC chain, otherwise the next sample in the MCMC chain is just a copy of the most recent sample.

  7. This completes one iteration of MCMC. The next iteration is completed by returning to step 2.

  8. Stop when there are enough samples (e.g., 500). Deciding when one has enough samples is a separate issue. Now you can compute any empirical statistics as you like, such as the empirical posterior mean. 

The above 8 steps illustrates the core formula of MH, the acceptance criterion function [1, 3]:

    \[\alpha(\theta^{new}, \theta^{old})=min \left(1, \frac{p(\theta^{old}|x)g(\theta^{new}|\theta^{old})}{p(\theta^{new}|x) g(\theta^{old}|\theta^{new})}\right)\]

 

References:

[1] https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm

[2] https://link.springer.com/article/10.3758/s13423-016-1015-8

[3] https://stats.stackexchange.com/a/497060/80635

[4] https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

[5] https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50 (the most liked comment is very useful for understanding)

 

 

Terms you need to know when working for Ads

There are several terms often used in the Ads ML domain. Let’s dive into them.

total bid = ecpm bid + quality bid

ecpm bid = paced bid * ecvr

  • ecvr is “expected conversion rate”. Different ads can have different definitions of conversion [8] – e.g., link click, app install, or offsite conversion. ecvr = p(conversion | impression) * position_discount
  • paced bid = bid by advertiser * pacer, where pacer controls how fast the advertiser’s budget is spent

advertiser value is the same thing as ecpm bid [15]

position_discount is a function of position, page type, and ad’s conversion type. Ads in the same request have the same page type. Note that position_discount may not be monotonically decreasing with the slot increasing.

ecpm price = 2nd ecpm bid + 2nd quality bid – 1st quality bid (we charge the runner-up’s ecpm bid plus the difference between quality bids). My interpretation of ecpm price is what we, the platform like Meta, actually charge, compared to ecpm bid, which is what advertisers bid. 

Price-to-value ratio (PVR) = ecpm price / ecpm bid

Billing Based Revenue BBR = sum over all impressions (ecpm price) 

Note: the summation is over all impressions because most of the advertisers (> 95%) are billed by impression.

Event Based Revenue EBR

  • EBR = sum over all conversions (paced bid * PVR)
  • EBR = sum over all conversions (paced bid * ecpm price / ecpm bid)
  • EBR = sum over all conversions (ecpm price / ECVR)
  • Difference between EBR vs BBR [4]: note that BBR is summing over all impressions while EBR is summing over all conversions. Since most advertisers are billed by impressions, BBR is closer to real revenue (see “which is close to our actual business operation in practice” in [4]) while EBR is closer to what in expectation should be charged. As [4] shows, EBR and BBR are the same when ECVR is well calibrated.

Estimated Ads Value = sum over all impressions (ecpm bid)

Realized Ads Value = sum over all conversions (paced bid)

Ads Score = Realized Ads Value + Quality Score

User Dollar Value (UDV) = P70(Aggregate of top eCPM bids of user’s last week’s requests)   [2]

Quality bid is also known as organic bid. It contains several components like Report Bid, Xout Bid, Repetition Penalty bid, etc. [3]. [14] has a more detailed walkthrough of how each component is computed in quality bid. Let’s use one example, XOUT-Repetitive Bid. 

xout_repetitive_bid = -udv * xout_normalization_factor * xout_repetitive_prediction<br />, where xout_normalization_factor = 1 / avg(user's estimated xout rate on ads impressions)

When we report Ads Score, we need to compute quality score. Xout-repetitive quality score is different than xout_repetitive_bid in whether the last term is prediction or realized value.

Xout-repetitive quality score = -udv * xout_normalization_factor * 1(user hides the ad with reason repetition)

 

Auction process

When a page request comes, the backend system needs to assemble a number of organic contents and ads. For ads, each page request only involves one set of ads (or, one set of advertisers, assuming each ad comes from a distinct advertiser). There are two ways to fill each slot with one of the ads:

  1. slot first greedy. Traverse each slot. For each slot, pick the ad with the highest total bid. Note that in the total bid, ecvr = p(conversion | impression) * position_discount
  2. bidder-slot pair greedy. For the same set of ads, compute their total bids on all possible slots. Because position_discount may not monotonically decrease with the slot, the highest total bid may fill a slot not on the next slot with the smallest index. Whenever we fill a slot, we remove the computed total bids belonging to the associated advertiser and other advertisers’ total bids on the slot. We then continue to fill the next slot until all slots are filled. 

 

Inline vs. non-inline conversion

Inline: specific actions performed on ad units

non-inline: actions performed outside ad units

https://www.internalfb.com/intern/wiki/Ad-event-types/overview/

there is also a good answer here.

You can check real-time ads traffic of different ads types here.

 

l0, l1, l2, l3, l4, l5 meaning

l1 = ads, l2=adset, l3=campaign, l4=ad account, l5=business. See [10,11]

 

What is dilution / backtest/ default v0 / explicit v0?

My colleague explained to me how dilution arises in the context of ads but I don’t have an answer for default/explicit v0:

At any given time, there are many QRT universes going on for different testing purposes. As a result, one user can belong to multiple QRT universes at the same time. It is possible that multiple QRT universes will test changes on the same feature / model type. Then the question is which QRT universe’s treatment a user will receive if he or she happens to belong multiple QRTs that are testing the same feature / model type. In reality, there is always some deterministic priority ordering among QRT universes. So even one user belongs to universe A, he or she may always receive the treatment of universe B. In this case, universe B dilutes the results of universe A. Readings of universe A should be amplified by (1 – dilute factor). In practice, dilute factor > 20% is too high that we consider readings invalid.

I have seen dilution defined in the context out of ads [12]. In that case, suppose there are X users allocated to the test group but only Z% of them actually saw the test feature. So any metrics read from the test group only represent X * Z% users. 

 

Ads optimization goal vs. conv type vs. AdsEventTrait

From [17]: AdOptimizationGoal is customer facing – when our advertiser creates an AdSet (L2), they pick an optimization method. ConversionType is an internal field, which is used to determine the correct models to use. Non-UOF systems use ConversionType directly. Those already migrated to UOF use AdEventTraits (AET), which is computed from ConversionType, AdOptimizationGoal and other things. In a nutshell, AET is a newer concept similar to ConversionType, but it captures a lot more information like the attribution information. In the old system, Ads can only be optimized using a single event (ConversionType). With UOF, Ads can be optimized using multiple events (AETs). Impressions and clicks are the actual events.

 

Unified Optimization Framework [16]

Unified Optimization Framework (UOF) provides the consistent and principled development flow for ad product / quality bid. It covers most parts of the current end-to-end development process. If you want to know anything about UOF, this is the right place for you.

 

Ads optimization goal vs. conv type vs. AdsEventTrait

From [17]: AdOptimizationGoal is customer facing – when our advertiser creates an AdSet (L2), they pick an optimization method. ConversionType is an internal field, which is used to determine the correct models to use. Non-UOF systems use ConversionType directly. Those already migrated to UOF use AdEventTraits (AET), which is computed from ConversionType, AdOptimizationGoal and other things. In a nutshell, AET is a newer concept similar to ConversionType, but it captures a lot more information like the attribution information. In the old system, Ads can only be optimized using a single event (ConversionType). With UOF, Ads can be optimized using multiple events (AETs). Impressions and clicks are the actual events.

 

Generate ad model baselines

At big leading companies, ad models are iterated in a designed streamline. In each iteration, a strong baseline is generated as the reference point for new technical proposals. Baselines need to be rigorously built to be used to judge other proposals. How do we build baselines? We divide the process into three steps [18]: 

  • B1 [Pure Refresh]: SmartClone the v0 job but only remove hard-deprecated features; it’s the closest trainable baseline to v0. Because v0 uses different training dates / settings, we use this job as a representation of v0 to compare training metrics against.
  • B2 [Launchable Refresh]: SmartClone the v0 job and remove/replace all features that block the launch; it’s the launchable baseline that we can directly use for production. This job usually is the fallback and lower bound of our baseline jobs; only when other jobs are better than this B2 job will they be picked as final baseline job.
  • B3 [Feature Refresh]: AMR Clone the v0 job and replace the features with new feature importance results using AMR; it’s the baseline that contains new features. We use this job to harvest the gain of new features / feature importance techniques.

 

Calibration [19]

There are other variations of NE that are also used such as Coarse NE (a S/L resilient offline metric) and cali-free NE (metric which is resilient to the frequent high eval NE variance due to mis calibration). (See more introduction in [19]). 

Currently, we adopt model-level calibration (MLC) instead of adsevent level calibration [20]. In [20], they also mentioned that MBC will be deprecated completely once MLC is fully online. The difference between MLC and MBC is listed below [21]:

 

References

[1] Ads metrics min delivery camp https://docs.google.com/presentation/d/19spfxiPCQ-MG1yn1ufhNdC4EOWgb7DpAKxICvMXL1cA/edit?usp=sharing 

[2] UDV Wiki: https://fburl.com/wiki/ryaxbcet 

[3] Organic bid wiki: https://www.internalfb.com/intern/wiki/FAM_Quality/Concepts/Organic_bids/

[4] BBR vs. EBR: https://www.internalfb.com/intern/wiki/BCR_Delivery/How_is_BBR_and_EBR_different/

[5] Ads Score Primer:  https://www.internalfb.com/intern/wiki/Ads/Delivery/AdsRanking/Ads_Core_ML_Ranking_Overview/Ad_Score_Explained_Primer/

[6] Auction / bidding https://docs.google.com/presentation/d/1B843o2SE0ERXBWmo7HLtodBCanHRY4OEJFYhvRl0Prc/edit#slide=id.g9fd629c270_1_192

[7] Bidder Slot Pair Greedy Example: https://docs.google.com/spreadsheets/d/1FbIT1I3buy3x36S7CDR8zlTo85hlSdaTz-MnMn4bPUE/edit#gid=0

[8] Ad event types: https://www.internalfb.com/intern/wiki/Ad-event-types/

[9] Ads Score wiki: https://www.internalfb.com/intern/wiki/Ads/Delivery/AdsRanking/Ranking/Metrics/AdsScore/

 
[11] The Difference Between Facebook Ads Campaigns, Ad-Sets & Ads: https://www.k6agency.com/facebook-campaign-adset-ad/
[12] Dilution in Quick Experiment: https://www.internalfb.com/intern/wiki/Qe2/Dilution/ 

Tools needed to facilitate long-term value optimization

[This post is inspired by one of my previous posts [4] which combs my thoughts on how to build a validation tool for RL problems. Now, I am trying to comb my thoughts for building tools to facilitate long-term value optimization.]

There is one important topic in applied RL: long-term user value optimization. Usually, this means we want to recommend items that can trigger users’ long-term engagement. However, there are two big hurdles: 1. is the long-term effect measurable/learnable? 2. how do we evaluate an algorithm offline?

Measure long-term effect

My current thought is to train a powerful sequence model which predicts a user immediate engagement from historical sequence consumptions. With this model, we can mask out a single item in the input sequence and observe how much variation of the current item’s engagement occurred due to the mask-out. If there indeed exists long-term effect, the variation of the current item’s engagement prediction should be statistically significant. I wrote a piece of pseudo-code for realizing this idea in ads long-term value optimization:

 

Google/Youtube recently proposed an interesting data science project for identifying proxy mid-term metrics to optimize for long-term metrics [6]. The motivation is that long-term metrics are usually sparse and noisy so it can be more robust to optimize for mid-term metrics. The paper analyzes a large amount of users’ 20-week video watch sequences. For each user, they divide the 20 weeks into 10 time buckets of 2 weeks. For each time bucket, they compute several diversity and consumption metrics. Therefore, each user will essentially have a sequence of different metrics computed for each 2-weeks time gap. Now, they want to train a classifier for predicting whether low-activeness users will progress to power users at the end of 20 weeks. The input features to the classifier are the metrics in the 2nd time bucket and the difference of metric values between the 1st and 2nd time buckets. They use random forest as the model class, which allows easy feature importance computation after training. Metrics that are commonly found in all different user cohorts are considered good reward functions.

 

Evaluate an algorithm offline

Now that assume we already have a LTV model which can re-rank items differently from a supervised-learning model, how do we evaluate it such that we know whether the ultimate metric we care about can be improved? Although there exist importance sampling-based approaches, I think the most feasible approach is still model-based because of the complex of real-world problems. Long-term value optimization often happens in the context of slate ranking. We first need to decide how we want to simulate slates once we apply LTV re-ranking. We can have two options:

1. simulate a series of slates a user could watch. My current thought is to utilize some contrastive learning or generative adversarial idea: when we extract a pool of possible items that could be placed into the next simulated slate, certain items are more likely to belong to the slate than others. We can train a model to differentiate items from real slate data and items that are randomly sampled. The model can then be used to simulate (sequences of) slates. 

update 09/01/2022:

There is another paper [7] which can inspire how to simulate slates. This paper is from Meta. The architecture is best summarized in the diagram below. They represent each user by their past consumed post sequences. Using Transformers, they can obtain a user embedding at each time step t when the user consumes a post. Note that, when Transformers process the post sequences, they use pre-trained post embeddings. The optimization target is that the user embedding at time t should have high cosine similarity with the post embedding at time t+1 while having low cosine similarity with posts at other time steps. We can adapt this loss function for slate generation such that user embedding at time t should have high cosine similarity with any post in the next logged slate while having low cosine similarity with posts in other slates and from other users. 

2. reuse logged slates. Only simulate the re-ranking order within logged slates. This approach implies a big assumption: LTV reranking would not affect how slates are generated. The assumption is not totally unreasonable for offline evaluation purpose. After all, simulating how slates are generated as in option 1 is really a difficult problem. 

Once we can simulate slates (or reuse logged slates), we can train another sequence model to simulate how the LTV model helps lift the metric we care about. See the screenshot below for my sketch of idea:

Of course, for both ideas (“measure long-term effect” and “evaluate an algorithm offline”), we can apply uncertainty estimation to make results more reliable. I mentioned how to conduct uncertainty estimation for deep neural networks in [4] and also this post [3] is helpful.

I also stumble upon another paper for counterfactual evaluation [5]. It is applied to a different but related problem: if an ad impression is causal for a later ad conversion. We train a sequential model with an autoregressive decoder. Then, we can simulate user behavior after the ad impression (we call it sequence w1) and before the ad impression (we call it sequence w2). Suppose we simulate a fixed number of steps in w1 and w2 and end at an ad embedding at e1 and e2 respectively. Then, we can compare the distance between d(e1, e_logged) and d(e2, e_logged), where e_logged is the embedding of the ad conversion. If, on average, d(e1, e_logged) < d(e2, e_logged), then that is a signal telling that the ad impression is causal for the conversion. 

 

References

[1] Positive Unlabeled learning with non-negative risk estimator

[2] Pure: positive unlabeled recommendation with generative adversarial network

[3] Deep ensemble: https://machinelearningmastery.com/ensemble-methods-for-deep-learning-neural-networks/

[4] https://czxttkl.com/2020/09/08/tools-needed-to-build-an-rl-debugging-tool/

[5] Shopping in the Multiverse- A Counterfactual Approach to In-Session Attribution: https://arxiv.org/abs/2007.10087

[6] Surrogate for Long-Term User Experience in Recommender Systems: https://dl.acm.org/doi/abs/10.1145/3534678.3539073

[7] NxtPost: User to Post Recommendations in Facebook Groups: https://arxiv.org/pdf/2202.03645.pdf

 

Some SOTA Model-based RL

Model-based RL has always intrigued me more than model-free RL. Because the former converts RL problems into supervised learning problems which can always employ SOTA deep learning techniques. In this post, I am introducing several latest developments of model-based RL. I categorize them into planning and non planning-based. 

Planning

This is one I am reviewing so I am only referencing it with a short name MOPP. The model-based planning algorithm works as follows:

From the paper, this algorithm contains three components: (1) guided trajectory rollout; (2) trajectory pruning; (3) trajectory optimization:

  • In (1), we first train an ensemble of behavior cloning models f_b^l(s_t) for l = 1, \cdots, K_2. Each behavior cloning model models the output action as a normal distribution with a predicted mean and standard deviation. Each time when we want to sample an action, we first randomly pick a behavior cloning model and then sample the action multiple times according to the predicted mean and standard deviation. However, there is some additional adaptation for how to sample from the predicted mean and standard deviation. First, the standard deviation is scaled larger so that action sampling can be more diverse and aggressive. Second, to counter sampling bad actions, we learn a Q-function from logging data and use it to finally select the action to proceed the trajectory simulation. Although the Q-function estimates the Q-value for the behavior policy instead of the policy being learned, the authors argue that it “provides a conservative but relatively reliable long-term prior information”. The two points are manifested in Eqn. 3:
  • In (2), we use a threshold L to prune trajectories with overall uncertainty larger than that. The way to quantify uncertainty of a trajectory is to look at the sum of each step’s uncertainty defined as disc(s,a)=max_{i,j} \| f_m^i(s,a) - f_m^j(s,a)\|, i,j\in 1,\ldots,K, the largest disagreement between any two dynamic models in the ensemble.
    • In (3), the final action to perform after planning is some weighted action. The weights come from accumulated predicted returns from the remaining trajectories after pruning in (2):
      Note one detail at line 14: if we simulate H steps, we can get H predicted rewards. However, we won’t know what is accumulated return after H steps. So we would use the learned Q-function and the behavior cloning model to estimate the value after the H steps. 

The MOPP paper has the following comparisons with other existing algorithms. The high-level takeaway is that MOPP is strong when the dataset (medium and med-expert) is less diverse (hence harder for a vanilla supervised learning model to generalize well). 

The MOPP paper actually motivates me to review other SOTA model-based methods. As shown in the table, MOPP and MBOP belong to model-based offline planning methods which needs some planning mechanisms, while model-based offline RL methods include MBPO and MOPO which don’t require planning. I’ll introduce MBOP first as another model-based planning algorithm and then move to non-planning based methods. 

MBOP [3] is actually a simplified version of MOPP. Compared to MOPP, MBOP lacks:

  1. No value estimate after H-step rollouts
  2. No standard deviation scaling so sampled actions are mostly confined to the behavior policy’s standard deviation.
  3. No uncertainty-based trajectory pruning.

Non-Planning

Non-planning algorithms rely on model-based algorithms in tandem with standard policy optimization methods (like TD3, SAC, etc). One example is called MOPO [1], which adds uncertainty estimate as a penalty to the reward function. Then, they still use model-free algorithms to learn a policy. MOPO works as follows:

As you can see, on line 9, reward function is adjusted by \lambda max_{i=1}^N \| \Sigma^i (s_j, a_j)\|_F, the maximum standard deviation of the learned models in the ensemble. As the authors argue, “empirically the learned variance often captures both aleatoric and epistemic uncertainty, even for learning deterministic functions (where only epistemic uncertainty exists). We use the maximum of the learned variance in the ensemble to be more conservative and robust.”

Another work to use model-based method to adjust the reward function is MORel [4]. It constructs a pessimistic MDP on which standard policy optimization approaches can be applied.

MBPO [2] learns a world model and uses it to generate K-steps rollouts. Policy optimization then trains on these generated rollouts.

 

 

References

[1] MOPO: Model-based Offline Policy Optimization: https://arxiv.org/abs/2005.13239

[2] When to Trust Your Model: Model-Based Policy Optimization: https://openreview.net/pdf?id=BJg8cHBxUS

[3] Model-Based Offline Planning: https://arxiv.org/abs/2008.05556

[4] MOReL: Model-Based Offline Reinforcement Learning: https://arxiv.org/abs/2005.05951