czxttkl – Page 3

How does Metropolis-Hastings algorithm work?

I learned about Markov Chain Monte Carlo (MCMC) algorithm a little bit during my phd but I did not record my thoughts back then. In this post, I revisit the core concept of MCMC, particularly focusing on illustrating the Metropolis-Hastings (MH) algorithm.

What is the motivation of MCMC?

Suppose you have observed some data $x$ . You would like to express the distribution of $x$ as some distribution which is parameterized by $\theta$ . You have some prior $p(\theta)$ . You would like to know the distribution of the posterior distribution $p(\theta|x)$ . Based on Bayes’ theorem,

$p(\theta|x)=\frac{p(\theta) p(x|\theta)}{p(x)}=\frac{p(\theta) p(x|\theta)}{\int_\theta p(\theta) p(x|\theta) d\theta}$

The denominator $\int_\theta p(\theta) p(x|\theta) d\theta$ is often hard to compute. Therefore, it is hard to know the exact statistics of the posterior distribution, such as the mean of the posterior distribution. MCMC has some mechanism to allow you sample points on the space of $\theta$ such that when a sufficient amount of points of $\theta$ is sampled, they approximate the true posterior distribution $p(\theta|x)$ hence you will be able to compute the empirical statistics such as the empirical mean.

Let’s use one concrete example (taken reference from [2]). You are a teacher who wants to know the score distribution of your students. You assume the score distribution is Normal distribution $\mathcal{N}(\mu, \sigma^2)$ while $\sigma^2=4^2=16$ is a fixed variance. According to basic conjugate prior mathematics [4], you have a prior distribution $p(\mu|\mu_0, \sigma_0^2) = \mathcal{N}(\mu_0, \sigma_0^2)$ which is also a Normal distribution. You have observed 3 students’ scores therefore you can express the likelihood function as: $p(x_1, x_2, x_3|\mu, \sigma^2) \propto \frac{1}{\sigma^n} exp\left( -\frac{1}{2\sigma^2}\sum\limits_{i=1}^3 (x_i - \mu)^2 \right)$ . In this simple example, there is an analytical expression for the posterior distribution $p(\mu| x_1, x_2, x_3)$ which is also a Normal distribution. But for our pedagogical purpose, let’s assume the posterior distribution $p(\mu|x_1, x_2, x_3)$ is impossible to compute and we have to use MCMC to infer posterior statistics such as the posterior mean.

How MCMC works? (specifically MH algorithm)

We focus on the MH algorithm, which basically has a loop of sampling a number of $\theta$ for estimating the posterior distribution. The following steps are adapted from [2]:

Begin with a plausible starting value; Let’s say you guess $\mu=110$ at first.
Generate a new proposal by taking the last sample (110) and adding some random noise. This random noise is generated from a proposal distribution, which should be symmetric and centered on zero. In our example we will use a proposal distribution that is normal with zero mean and standard deviation of 5. This means the new proposal is 110 (the last sample) plus a random sample from N(0,5). Suppose this results in a proposal of 108.
Note that, if the proposal distribution is not symmetric, the following sample acceptance criteria (step 4&5) will need to be adjusted. See how $g(\cdot)$ is defined in [1].
Compare the likelihood value of the new proposal (i.e., 108) against the likelihood value of the posterior at the most recent sample (i.e., 110). In other words, you will compare $p(\mu=108|\mu_0, \sigma_0^2) \cdot p(x_1, x_2, x_3|\mu=108, \sigma^2=16)$ and $p(\mu=110|\mu_0, \sigma_0^2) \cdot p(x_1, x_2, x_3|\mu=110, \sigma^2=16)$ .
Note: in many other online tutorials such as [2] and [5], people also interchangeably say that we compare the posterior value of the new proposal and the most recent sample. Since the likelihood value has only a multiplicative constant difference than the posterior, comparing the likelihood value is equivalent to comparing the posterior.
If the new proposal has a higher likelihood value than the most recent sample, then accept the new proposal.
If the new proposal has a lower likelihood value than the most recent sample, then randomly choose to accept or reject the new proposal, with a probability equal to the ratio of the two likelihood values. For example, if the likelihood value at the new proposal value is one-fifth as high as the likelihood of the most recent sample, then accept the new proposal with 20% probability.
If the new proposal is accepted, it becomes the next sample in the MCMC chain, otherwise the next sample in the MCMC chain is just a copy of the most recent sample.
This completes one iteration of MCMC. The next iteration is completed by returning to step 2.
Stop when there are enough samples (e.g., 500). Deciding when one has enough samples is a separate issue. Now you can compute any empirical statistics as you like, such as the empirical posterior mean.

The above 8 steps illustrates the core formula of MH, the acceptance criterion function [1, 3]:

$\alpha(\theta^{new}, \theta^{old})=min \left(1, \frac{p(\theta^{old}|x)g(\theta^{new}|\theta^{old})}{p(\theta^{new}|x) g(\theta^{old}|\theta^{new})}\right)$

References:

[1] https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm

[2] https://link.springer.com/article/10.3758/s13423-016-1015-8

[3] https://stats.stackexchange.com/a/497060/80635

[4] https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

[5] https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50 (the most liked comment is very useful for understanding)

Terms you need to know when working for Ads

There are several terms often used in the Ads ML domain. Let’s dive into them.

total bid = ecpm bid + quality bid

ecpm bid = paced bid * ecvr

ecvr is “expected conversion rate”. Different ads can have different definitions of conversion [8] – e.g., link click, app install, or offsite conversion. ecvr = p(conversion | impression) * position_discount
paced bid = bid by advertiser * pacer, where pacer controls how fast the advertiser’s budget is spent

advertiser value is the same thing as ecpm bid [15]

position_discount is a function of position, page type, and ad’s conversion type. Ads in the same request have the same page type. Note that position_discount may not be monotonically decreasing with the slot increasing.

ecpm price = 2nd ecpm bid + 2nd quality bid – 1st quality bid (we charge the runner-up’s ecpm bid plus the difference between quality bids). My interpretation of ecpm price is what we, the platform like Meta, actually charge, compared to ecpm bid, which is what advertisers bid.

Price-to-value ratio (PVR) = ecpm price / ecpm bid

Billing Based Revenue BBR = sum over all impressions (ecpm price)

Note: the summation is over all impressions because most of the advertisers (> 95%) are billed by impression.

Event Based Revenue EBR

EBR = sum over all conversions (paced bid * PVR)
EBR = sum over all conversions (paced bid * ecpm price / ecpm bid)
EBR = sum over all conversions (ecpm price / ECVR)
Difference between EBR vs BBR [4]: note that BBR is summing over all impressions while EBR is summing over all conversions. Since most advertisers are billed by impressions, BBR is closer to real revenue (see “which is close to our actual business operation in practice” in [4]) while EBR is closer to what in expectation should be charged. As [4] shows, EBR and BBR are the same when ECVR is well calibrated.

Estimated Ads Value = sum over all impressions (ecpm bid)

Realized Ads Value = sum over all conversions (paced bid)

Ads Score = Realized Ads Value + Quality Score

User Dollar Value (UDV) = P70(Aggregate of top eCPM bids of user’s last week’s requests) [2]

Quality bid is also known as organic bid. It contains several components like Report Bid, Xout Bid, Repetition Penalty bid, etc. [3]. [14] has a more detailed walkthrough of how each component is computed in quality bid. Let’s use one example, XOUT-Repetitive Bid.

xout_repetitive_bid = -udv * xout_normalization_factor * xout_repetitive_prediction<br />, where xout_normalization_factor = 1 / avg(user's estimated xout rate on ads impressions)

When we report Ads Score, we need to compute quality score. Xout-repetitive quality score is different than xout_repetitive_bid in whether the last term is prediction or realized value.

Xout-repetitive quality score = -udv * xout_normalization_factor * 1(user hides the ad with reason repetition)

Auction process

When a page request comes, the backend system needs to assemble a number of organic contents and ads. For ads, each page request only involves one set of ads (or, one set of advertisers, assuming each ad comes from a distinct advertiser). There are two ways to fill each slot with one of the ads:

slot first greedy. Traverse each slot. For each slot, pick the ad with the highest total bid. Note that in the total bid, ecvr = p(conversion | impression) * position_discount
bidder-slot pair greedy. For the same set of ads, compute their total bids on all possible slots. Because position_discount may not monotonically decrease with the slot, the highest total bid may fill a slot not on the next slot with the smallest index. Whenever we fill a slot, we remove the computed total bids belonging to the associated advertiser and other advertisers’ total bids on the slot. We then continue to fill the next slot until all slots are filled.

Inline vs. non-inline conversion

Inline: specific actions performed on ad units

non-inline: actions performed outside ad units

https://www.internalfb.com/intern/wiki/Ad-event-types/overview/

there is also a good answer here.

You can check real-time ads traffic of different ads types here.

l0, l1, l2, l3, l4, l5 meaning

l1 = ads, l2=adset, l3=campaign, l4=ad account, l5=business. See [10,11]

What is dilution / backtest/ default v0 / explicit v0?

My colleague explained to me how dilution arises in the context of ads but I don’t have an answer for default/explicit v0:

At any given time, there are many QRT universes going on for different testing purposes. As a result, one user can belong to multiple QRT universes at the same time. It is possible that multiple QRT universes will test changes on the same feature / model type. Then the question is which QRT universe’s treatment a user will receive if he or she happens to belong multiple QRTs that are testing the same feature / model type. In reality, there is always some deterministic priority ordering among QRT universes. So even one user belongs to universe A, he or she may always receive the treatment of universe B. In this case, universe B dilutes the results of universe A. Readings of universe A should be amplified by (1 – dilute factor). In practice, dilute factor > 20% is too high that we consider readings invalid.

I have seen dilution defined in the context out of ads [12]. In that case, suppose there are X users allocated to the test group but only Z% of them actually saw the test feature. So any metrics read from the test group only represent X * Z% users.

Ads optimization goal vs. conv type vs. AdsEventTrait

From [17]: AdOptimizationGoal is customer facing – when our advertiser creates an AdSet (L2), they pick an optimization method. ConversionType is an internal field, which is used to determine the correct models to use. Non-UOF systems use ConversionType directly. Those already migrated to UOF use AdEventTraits (AET), which is computed from ConversionType, AdOptimizationGoal and other things. In a nutshell, AET is a newer concept similar to ConversionType, but it captures a lot more information like the attribution information. In the old system, Ads can only be optimized using a single event (ConversionType). With UOF, Ads can be optimized using multiple events (AETs). Impressions and clicks are the actual events.

Unified Optimization Framework [16]

Unified Optimization Framework (UOF) provides the consistent and principled development flow for ad product / quality bid. It covers most parts of the current end-to-end development process. If you want to know anything about UOF, this is the right place for you.

Ads optimization goal vs. conv type vs. AdsEventTrait

Generate ad model baselines

At big leading companies, ad models are iterated in a designed streamline. In each iteration, a strong baseline is generated as the reference point for new technical proposals. Baselines need to be rigorously built to be used to judge other proposals. How do we build baselines? We divide the process into three steps [18]:

B1 [Pure Refresh]: SmartClone the v0 job but only remove hard-deprecated features; it’s the closest trainable baseline to v0. Because v0 uses different training dates / settings, we use this job as a representation of v0 to compare training metrics against.
B2 [Launchable Refresh]: SmartClone the v0 job and remove/replace all features that block the launch; it’s the launchable baseline that we can directly use for production. This job usually is the fallback and lower bound of our baseline jobs; only when other jobs are better than this B2 job will they be picked as final baseline job.
B3 [Feature Refresh]: AMR Clone the v0 job and replace the features with new feature importance results using AMR; it’s the baseline that contains new features. We use this job to harvest the gain of new features / feature importance techniques.

Calibration [19]

There are other variations of NE that are also used such as Coarse NE (a S/L resilient offline metric) and cali-free NE (metric which is resilient to the frequent high eval NE variance due to mis calibration). (See more introduction in [19]).

Currently, we adopt model-level calibration (MLC) instead of adsevent level calibration [20]. In [20], they also mentioned that MBC will be deprecated completely once MLC is fully online. The difference between MLC and MBC is listed below [21]:

References

[1] Ads metrics min delivery camp https://docs.google.com/presentation/d/19spfxiPCQ-MG1yn1ufhNdC4EOWgb7DpAKxICvMXL1cA/edit?usp=sharing

[2] UDV Wiki: https://fburl.com/wiki/ryaxbcet

[3] Organic bid wiki: https://www.internalfb.com/intern/wiki/FAM_Quality/Concepts/Organic_bids/

[4] BBR vs. EBR: https://www.internalfb.com/intern/wiki/BCR_Delivery/How_is_BBR_and_EBR_different/

[5] Ads Score Primer: https://www.internalfb.com/intern/wiki/Ads/Delivery/AdsRanking/Ads_Core_ML_Ranking_Overview/Ad_Score_Explained_Primer/

[6] Auction / bidding https://docs.google.com/presentation/d/1B843o2SE0ERXBWmo7HLtodBCanHRY4OEJFYhvRl0Prc/edit#slide=id.g9fd629c270_1_192

[7] Bidder Slot Pair Greedy Example: https://docs.google.com/spreadsheets/d/1FbIT1I3buy3x36S7CDR8zlTo85hlSdaTz-MnMn4bPUE/edit#gid=0

[8] Ad event types: https://www.internalfb.com/intern/wiki/Ad-event-types/

[9] Ads Score wiki: https://www.internalfb.com/intern/wiki/Ads/Delivery/AdsRanking/Ranking/Metrics/AdsScore/

[10] Ads Data Model Concept: https://www.internalfb.com/intern/wiki/Ads-data-model/Onboarding/core-objects/

[11] The Difference Between Facebook Ads Campaigns, Ad-Sets & Ads: https://www.k6agency.com/facebook-campaign-adset-ad/

[12] Dilution in Quick Experiment: https://www.internalfb.com/intern/wiki/Qe2/Dilution/

[13] Backtest basics: https://www.internalfb.com/intern/wiki/FASTI_Experimentation_Best_Practices/Backtests/

[14] Anatomy of an Organic Bid: https://www.internalfb.com/intern/wiki/FAM_Quality/Archived/Guide-to-ads-organic-bid/anatomy-of-an-organic-bid/

[15] https://www.internalfb.com/intern/wiki/Lin_Playground/Ads_101/

[16] https://www.internalfb.com/intern/wiki/Unified_Optimization_Framework/

[17] https://fb.workplace.com/groups/145709299400760/posts/1005576500080698/?comment_id=1009972666307748

[18] https://www.internalfb.com/intern/wiki/AdsInfra/ML_Automation/Other_Resources/Deprecated/Auto_B3/Overal_Precedure/

[19] https://www.internalfb.com/intern/wiki/Ads/Delivery/DeliveryProducts/Segment_Experiences/App_Ads/App_Ads_Delivery/General_ML_Engineering/model_metrics/

Tools needed to facilitate long-term value optimization

[This post is inspired by one of my previous posts [4] which combs my thoughts on how to build a validation tool for RL problems. Now, I am trying to comb my thoughts for building tools to facilitate long-term value optimization.]

There is one important topic in applied RL: long-term user value optimization. Usually, this means we want to recommend items that can trigger users’ long-term engagement. However, there are two big hurdles: 1. is the long-term effect measurable/learnable? 2. how do we evaluate an algorithm offline?

Measure long-term effect

My current thought is to train a powerful sequence model which predicts a user immediate engagement from historical sequence consumptions. With this model, we can mask out a single item in the input sequence and observe how much variation of the current item’s engagement occurred due to the mask-out. If there indeed exists long-term effect, the variation of the current item’s engagement prediction should be statistically significant. I wrote a piece of pseudo-code for realizing this idea in ads long-term value optimization:

Google/Youtube recently proposed an interesting data science project for identifying proxy mid-term metrics to optimize for long-term metrics [6]. The motivation is that long-term metrics are usually sparse and noisy so it can be more robust to optimize for mid-term metrics. The paper analyzes a large amount of users’ 20-week video watch sequences. For each user, they divide the 20 weeks into 10 time buckets of 2 weeks. For each time bucket, they compute several diversity and consumption metrics. Therefore, each user will essentially have a sequence of different metrics computed for each 2-weeks time gap. Now, they want to train a classifier for predicting whether low-activeness users will progress to power users at the end of 20 weeks. The input features to the classifier are the metrics in the 2nd time bucket and the difference of metric values between the 1st and 2nd time buckets. They use random forest as the model class, which allows easy feature importance computation after training. Metrics that are commonly found in all different user cohorts are considered good reward functions.

Evaluate an algorithm offline

Now that assume we already have a LTV model which can re-rank items differently from a supervised-learning model, how do we evaluate it such that we know whether the ultimate metric we care about can be improved? Although there exist importance sampling-based approaches, I think the most feasible approach is still model-based because of the complex of real-world problems. Long-term value optimization often happens in the context of slate ranking. We first need to decide how we want to simulate slates once we apply LTV re-ranking. We can have two options:

1. simulate a series of slates a user could watch. My current thought is to utilize some contrastive learning or generative adversarial idea: when we extract a pool of possible items that could be placed into the next simulated slate, certain items are more likely to belong to the slate than others. We can train a model to differentiate items from real slate data and items that are randomly sampled. The model can then be used to simulate (sequences of) slates.

update 09/01/2022:

There is another paper [7] which can inspire how to simulate slates. This paper is from Meta. The architecture is best summarized in the diagram below. They represent each user by their past consumed post sequences. Using Transformers, they can obtain a user embedding at each time step t when the user consumes a post. Note that, when Transformers process the post sequences, they use pre-trained post embeddings. The optimization target is that the user embedding at time t should have high cosine similarity with the post embedding at time t+1 while having low cosine similarity with posts at other time steps. We can adapt this loss function for slate generation such that user embedding at time t should have high cosine similarity with any post in the next logged slate while having low cosine similarity with posts in other slates and from other users.

2. reuse logged slates. Only simulate the re-ranking order within logged slates. This approach implies a big assumption: LTV reranking would not affect how slates are generated. The assumption is not totally unreasonable for offline evaluation purpose. After all, simulating how slates are generated as in option 1 is really a difficult problem.

Once we can simulate slates (or reuse logged slates), we can train another sequence model to simulate how the LTV model helps lift the metric we care about. See the screenshot below for my sketch of idea:

Of course, for both ideas (“measure long-term effect” and “evaluate an algorithm offline”), we can apply uncertainty estimation to make results more reliable. I mentioned how to conduct uncertainty estimation for deep neural networks in [4] and also this post [3] is helpful.

I also stumble upon another paper for counterfactual evaluation [5]. It is applied to a different but related problem: if an ad impression is causal for a later ad conversion. We train a sequential model with an autoregressive decoder. Then, we can simulate user behavior after the ad impression (we call it sequence w1) and before the ad impression (we call it sequence w2). Suppose we simulate a fixed number of steps in w1 and w2 and end at an ad embedding at e1 and e2 respectively. Then, we can compare the distance between d(e1, e_logged) and d(e2, e_logged), where e_logged is the embedding of the ad conversion. If, on average, d(e1, e_logged) < d(e2, e_logged), then that is a signal telling that the ad impression is causal for the conversion.

References

[1] Positive Unlabeled learning with non-negative risk estimator

[2] Pure: positive unlabeled recommendation with generative adversarial network

[3] Deep ensemble: https://machinelearningmastery.com/ensemble-methods-for-deep-learning-neural-networks/

[4] https://czxttkl.com/2020/09/08/tools-needed-to-build-an-rl-debugging-tool/

[5] Shopping in the Multiverse- A Counterfactual Approach to In-Session Attribution: https://arxiv.org/abs/2007.10087

[6] Surrogate for Long-Term User Experience in Recommender Systems: https://dl.acm.org/doi/abs/10.1145/3534678.3539073

[7] NxtPost: User to Post Recommendations in Facebook Groups: https://arxiv.org/pdf/2202.03645.pdf

Some SOTA Model-based RL

Model-based RL has always intrigued me more than model-free RL. Because the former converts RL problems into supervised learning problems which can always employ SOTA deep learning techniques. In this post, I am introducing several latest developments of model-based RL. I categorize them into planning and non planning-based.

Planning

This is one I am reviewing so I am only referencing it with a short name MOPP. The model-based planning algorithm works as follows:

From the paper, this algorithm contains three components: (1) guided trajectory rollout; (2) trajectory pruning; (3) trajectory optimization:

In (1), we first train an ensemble of behavior cloning models $f_b^l(s_t)$ for $l = 1, \cdots, K_2$ . Each behavior cloning model models the output action as a normal distribution with a predicted mean and standard deviation. Each time when we want to sample an action, we first randomly pick a behavior cloning model and then sample the action multiple times according to the predicted mean and standard deviation. However, there is some additional adaptation for how to sample from the predicted mean and standard deviation. First, the standard deviation is scaled larger so that action sampling can be more diverse and aggressive. Second, to counter sampling bad actions, we learn a Q-function from logging data and use it to finally select the action to proceed the trajectory simulation. Although the Q-function estimates the Q-value for the behavior policy instead of the policy being learned, the authors argue that it “provides a conservative but relatively reliable long-term prior information”. The two points are manifested in Eqn. 3:
In (2), we use a threshold to prune trajectories with overall uncertainty larger than that. The way to quantify uncertainty of a trajectory is to look at the sum of each step’s uncertainty defined as , the largest disagreement between any two dynamic models in the ensemble.
- In (3), the final action to perform after planning is some weighted action. The weights come from accumulated predicted returns from the remaining trajectories after pruning in (2):
  Note one detail at line 14: if we simulate $H$ steps, we can get $H$ predicted rewards. However, we won’t know what is accumulated return after $H$ steps. So we would use the learned Q-function and the behavior cloning model to estimate the value after the $H$ steps.

The MOPP paper has the following comparisons with other existing algorithms. The high-level takeaway is that MOPP is strong when the dataset (medium and med-expert) is less diverse (hence harder for a vanilla supervised learning model to generalize well).

The MOPP paper actually motivates me to review other SOTA model-based methods. As shown in the table, MOPP and MBOP belong to model-based offline planning methods which needs some planning mechanisms, while model-based offline RL methods include MBPO and MOPO which don’t require planning. I’ll introduce MBOP first as another model-based planning algorithm and then move to non-planning based methods.

MBOP [3] is actually a simplified version of MOPP. Compared to MOPP, MBOP lacks:

No value estimate after $H$ -step rollouts
No standard deviation scaling so sampled actions are mostly confined to the behavior policy’s standard deviation.
No uncertainty-based trajectory pruning.

Non-Planning

Non-planning algorithms rely on model-based algorithms in tandem with standard policy optimization methods (like TD3, SAC, etc). One example is called MOPO [1], which adds uncertainty estimate as a penalty to the reward function. Then, they still use model-free algorithms to learn a policy. MOPO works as follows:

As you can see, on line 9, reward function is adjusted by $\lambda max_{i=1}^N \| \Sigma^i (s_j, a_j)\|_F$ , the maximum standard deviation of the learned models in the ensemble. As the authors argue, “empirically the learned variance often captures both aleatoric and epistemic uncertainty, even for learning deterministic functions (where only epistemic uncertainty exists). We use the maximum of the learned variance in the ensemble to be more conservative and robust.”

Another work to use model-based method to adjust the reward function is MORel [4]. It constructs a pessimistic MDP on which standard policy optimization approaches can be applied.

MBPO [2] learns a world model and uses it to generate K-steps rollouts. Policy optimization then trains on these generated rollouts.

References

[1] MOPO: Model-based Offline Policy Optimization: https://arxiv.org/abs/2005.13239

[2] When to Trust Your Model: Model-Based Policy Optimization: https://openreview.net/pdf?id=BJg8cHBxUS

[3] Model-Based Offline Planning: https://arxiv.org/abs/2008.05556

[4] MOReL: Model-Based Offline Reinforcement Learning: https://arxiv.org/abs/2005.05951

Laplacian Approximation and Bayesian Logistic Regression

Recently, I am studying a basic contextual bandit algorithm called Logistic Regression Bandit, which is also known as Bayesian Logistic Regression. It all starts from the paper “An Empirical Evaluation of Thompson Sampling” [1]. In Algorithm 3 in the paper, the authors gives the pseudo-code for how to train the Logistic Regression Bandit, assuming its weights are from a Gaussian distribution with a diagonal covariance matrix:

When I first read it, my biggest confusion is Laplace approximation – why is it used in training a Bayesian logistic regression? Now, let’s dig into how Laplace approximation comes into play.

Let’s first review the set up of a Bayesian logistic regression. Suppose we have a feature space $\mathbf{x} \in \mathbb{R}^d$ . We have a dataset $\mathcal{D}$ which contains tuples of $(\mathbf{x}_i, y_i)$ for $i=1, \cdots, N$ . These are observed features and labels. If we apply a Bayesian logistic regression on such data, the logistic regression will have a weight vector of the same dimension as the features: $\mathbf{w} \in \mathbb{R}^d$ . The prediction function is a sigmoid function: $p(\mathbf{x}) = \left( 1+exp(-\mathbf{w}^T\mathbf{x})\right)^{-1}$ .

Because it is a Bayesian logistic regression, we need to estimate a posterior probability distribution for $\mathbf{w}$ given the observation $\mathcal{D}$ , $p(\mathbf{w}|\mathcal{D})$ . We assume that $\mathbf{w}$ come from a multi-dimensional Gaussian distribution with a diagonal covariance matrix. This is the same as what Algorithm 3’s notation implies: $w_i \sim \mathcal{N}(m_i, q_i^{-1})$ .

We can also get the likelihood function $p(\mathcal{D}|\mathbf{w})=\prod\limits_{j=1}^N p(\mathbf{x}_j)^{y_j} \left(1-p(\mathbf{x}_j)\right)^{1-y_j}$

Using the Bayesian Theorem, we can derive (the notations adapted from [2] (slides page 9~12) to the notations in this post):

$p(\mathbf{w}|\mathcal{D}) \propto p(\mathbf{w})p(\mathcal{D}|\mathbf{w})\newline\quad\qquad=\prod\limits_{i=1}^d \mathcal{N}(m_i, q_i^{-1}) \prod\limits_{j=1}^N p(\mathbf{x}_j)^{y_j} \left(1-p(\mathbf{x}_j)\right)^{1-y_j}$

Now, here is how Lapalace Approximation [3] comes into play. The goal of Lapalace Approximation is to learn a Gaussian distribution to approximate $p(\mathbf{w}|\mathcal{D})$ , which is not a Gaussian distribution in its exact expression. In other words, if $p(\mathbf{w}|\mathcal{D}) \approx \mathcal{N}(m_i', q_i'^{-1})$ , what would be $m_i'$ and $q_i'$ ?

First, we will take the logarithm on both sides, leading to:

$\log p(\mathbf{w}|\mathcal{D}) = -\left( \frac{1}{2}\sum\limits_{i=1}^d q_i (w_i - m_i)^2 + \sum\limits_{j=1}^n \log(1 + exp\left( -y_j \mathbf{w}^T \mathbf{x}_j\right)) \right) + const \newline= f(\mathbf{w}) + const$

Suppose $f(\mathbf{w})$ achieves maximum $\mathbf{w}_0$ , which is able to be found using gradient ascent or expressed in a closed formula (I think $f(\mathbf{w})$ is a concave function but not totally sure). Based on the 2nd-order Taylor expansion at point $\mathbf{w}_0$ , we have: $f(\mathbf{w}) \approx f(\mathbf{w}_0) + f'(\mathbf{w}_0)(\mathbf{w}-\mathbf{w}_0)+\frac{1}{2}f''(\mathbf{w}_0)(\mathbf{w} - \mathbf{w}_0)^2$ . Because $f'(\mathbf{w}_0)=0$ , we can further simplify to $f(\mathbf{w}) \approx f(\mathbf{w}_0) +\frac{1}{2}f''(\mathbf{w}_0)(\mathbf{w} - \mathbf{w}_0)^2$ .

$f(\mathbf{w}_0) +\frac{1}{2}f''(\mathbf{w}_0)(\mathbf{w} - \mathbf{w}_0)^2$ can actually be seen as the logarithm of another Gaussian distribution $\mathcal{N}(m_i', q_i'^{-1})$ , whose mean is at $\mathbf{w}_0$ and variance as $-f''(\mathbf{w}_0)$ , and $f(\mathbf{w}_0)$ absorbs any remaining constants. This is why Algorithm 3 updates $m_i = w_i$ because $w_i$ is essentially from $\mathbf{w}_0$ that maximizes $f(\mathbf{w})$ , and $q_i=q_i + \sum\limits_{j=1}^n x^2_{ij}p_j(1-p_j)$ since this is essentially $f''(\mathbf{w}_0)$ .

To summarize, Laplace Approximation finds a new Gaussian distribution to approximate $p(\mathbf{w}|\mathcal{D})$ !

Reference

[1] An Empirical Evaluation of Thompson Sampling: https://papers.nips.cc/paper/2011/hash/e53a0a2978c28872a4505bdb51db06dc-Abstract.html

[2] https://cs.iupui.edu/~mdundar/CSCIMachineLearning/Lecture12.pdf

[3] https://bookdown.org/rdpeng/advstatcomp/laplace-approximation.html

[4] https://ufal.mff.cuni.cz/~jurcicek/NPFL108-BI-2014LS/04-approximate-inference-laplace-approximation.pdf

Markov Chain and Markov Decision Process on Graphs

It is a cool idea that we can formulate data of many problems as graphs. It is even cooler that we can improve graph-based algorithms with Reinforcement Learning (RL). In this post, I am going to overview several related ideas.

Warm up on graphs – GCN

Let’s first warm up with some contexts on graphs. Graphs are everywhere; many data can be represented in a graph format. Graphs provide richer information than simple feature engineering because each node’s neighborhood information may be very valuable for characterizing a node thus being very valuable for prediction tasks. One famous graph model is called Graph Convolutional Networks (GCN) [3] ([4, 5, 6] are good tutorials).

In GCN, nodes are represented as vectors, which we denote as $H^L \in \mathbb{R}^{|V|\times d^L}$ , with $|V|$ being the number of total nodes and $d^L$ is the vector dimension. We hope to learn these vector representations using $L$ layers of neural networks. A simple learning rule (in [3] it is called propagation rule) is:

$H^{l+1} = f(H^l, A) = \sigma(AH^lW^l), \text{for}\;\; \forall l =0, \cdots, L-1$ ,

where $A$ is the adjacency matrix, $H^l \in \mathbb{R}^{|V| \times d^l}$ is the node representation from the previous layer, $\sigma(\cdot)$ is an activation function, and $W^l$ is the parameters of this layer. $H^0$ is some basic feature representation of nodes, which can be as simple as an identity matrix, or as complex as shortest path distance to other nodes.

The problem of the simple propagation rule is that: (1) the computation of $AH^l W^l$ does not involve each node’s own representation; (2) nodes with large degrees will tend to have large values in their feature representation while nodes with small degrees will have small values. Therefore, a better propagation rule is:

$H^{l+1} = f(H^l, A) = \sigma(D^{-1}\hat{A}H^lW^l)$ ,

where $\hat{A}=A+I$ , $D^{-1}$ is the inverse degree matrix of $\hat{A}$ as a way to normalize the following value by the degree of each node.

In [3], the authors further proposes $H^{l+1} = f(H^l, A) = \sigma(D^{-0.5}\hat{A}D^{-0.5}H^lW^l)$ , which has some theoretical connection to “localized spectral filters”. This form of propagation rule, according to [5], not only takes into considerations the degree of the source node but also the target node for each node pair.

It is important to note that node representation $H^L$ can be learned by unsupervised learning or semi-supervised learning manner. What we describe above is the unsupervised learning approach, in which node embeddings are computed without knowing any node labels. A more interesting point of [3] is that the propagation rule can be used to learn node representation in a semi-supervised learning setting, where only some nodes’ labels are revealed but the representations of all the nodes in the graphs can be learned. This is because if we back-propagate on the prediction loss function, it can eventually back-propagate to all other nodes due to that the nodes with labels have representations as the aggregation from other nodes’.

Knowledge Graphs

Knowledge graphs can be seen as a special form of graphs: graphs with heterogenous types of edges. Using knowledge graphs can help on many downstream tasks. I happen to know two works which use knowledge graph information to aid recommendation system tasks.

The first work is called Deep Knowledge-Aware Network (DKN) [7]. It solves the problem of news recommendation. If we don’t use knowledge graphs, for each news candidate and for each historical read news article from a specific user, we can only featurize it by text information. The authors propose to map entities mentioned in news titles to a pre-built knowledge graph and create the entity representation in the knowledge graph. Then they can augment each news article’s feature representation with the entity representation from the entities mentioned in the news title. With the augmented feature representation, retrieval/ranking tasks can be more accurate.

While [7] can be thought of as a content-based method to aid recommendation, [8] is a structure-based method, i.e., it uses structure information of nodes in a graph neural network to help predicting CTR/engagement. Node embeddings are learned from a GCN-based approach on a pre-built knowledge graph, as GCN is a suitable algorithm for distilling neighborhood structures. (Note [7] and [8] are from the same group of authors who rely on Microsoft Satori knowledge graph system). Also, to avoid huge memory consumption, they don’t use the adjacency matrix in the computation, but instead sampling a fixed number of neighborhoods when aggregating each node’s representation from its neighbor.

There are many ways to learn node embeddings in knowledge graphs. Top famous ones are (the screenshot is from [7], which gives a good overview):

Markov Chains on graphs

Now, let’s look at some problems where Markov Chains are formulated on graphs. In a paper from Spotify [1], the authors assume that users’ interactions with different music genres are a Markov chain:

$\mathbf{\pi}_i^{t+1} = (1-\gamma) \mathbf{\pi}_i^{t-1} + \gamma \mathbf{\pi}_i^t$

where $\mathbf{\pi}_i^t$ is the distribution of user $i$ ‘s played genres at time $t$ . Hence, $\mathbf{\pi}_i^t$ can be represented as play counts on each genre $\mathbf{n}^t_i=[n^t_{i1}, \cdots, n^t_{iN}]$ normalized by total play counts $\xi_i^t$ : $\mathbf{\pi}_i^t=\mathbf{n}^t_i / \xi_i^t$ . The transitions between genres at different time steps are considered as a graph.

Then, they use some traditional graph models to formulate the generative process of $\xi_i^t$ and $n^t_i$ :

$\xi_i^{t+1} \sim Poisson(\xi_i^t)$

$\mathbf{n}^{t+1}_i | \xi_i^{t+1}, \mathbf{\pi}_i^{t+1} \sim Multinomial(\xi_i^{t+1}, \mathbf{\pi}_i^t A)$

The core idea is that:
1. The total play count at time step $t+1$ , $\xi_i^{t+1}$ , is sampled from a Poisson distribution determined by $\xi_i^t$ .

2. The play count on each category on the next timestep, $\mathbf{n}^{t+1}_i$ , will then be determined by both the total play count $\xi_i^{t+1}$ and the genre distribution $\mathbf{\pi}_i^{t+1}$ . $A$ is a $N \times N$ transition matrix to be learned denoting how user genre preferences change in two consecutive steps. The paper chooses to use some two-step maximal likelihood method to learn $A$ and $\gamma$ .

A much older paper [2] bears the same idea, where it learns item-to-item transitions instead of genre transitions.

Random Walk

Random walk is a powerful algorithm. With proper parameterization, random walk is essentially the same as personalized page-rank [16]. Many graph learning techniques also rely on random walk as an important component.

DeepWalk [9] uses node visit sequences from random walk to learn node representation. Its idea can be illustrated in the diagram below. Suppose from node 3, a random walker can reach to 1, 5, 1, …. We can use a fixed-size moving window to examine each node: for example, when we are at node 1 whose neighbors in the random walk sequence is node 3 and 5, we wish to predict the context node 3 and 5 given the representation of node 1. The model structure will be similar to SkipGrams which is widely used in NLP. The paper has some computation optimization on the prediction layer (softmax layer), which turns normal softmax into hierarchical softmax.

Node2Vec [11] is another similar work to DeepWalk. The only difference is that DeepWalk uses the pure random walk to generate node visitation sequence, whereas Node2Vec has two hyper-parameters to control the random walk tendency to favor breadth-first search v.s. depth first search [10].

Markov Decision Process on Graphs

Applying RL on graphs means we need to formulate a Markov Decision Process. The common idea is to learn to navigate in the graph such as to reach a desired node. Here is a knowledge graph-based application from MINERVA [13].

The idea of MINERVA is that if we want to query a knowledge graph by (source node, relationship), we want to land on a target node which could give us the answer. For example, source node=Obama, relationship=nationality, we want to learn an agent to traverse the graph and eventually land on the node=United States. The sequence of node navigation can be seen as a sequential decision problem: the agent, based on its current state and available outgoing relationships, needs to decide which next node to go to. So this agent can be naturally learned by reinforcement learning. Since the knowledge graph query task usually has ground truth data {(source node relationship, target node)} , we can design the reward function to be +1 if the agent eventually lands on the target node, or -1 if not.

DeepPath [12] is similar to MINERVA but simpler. DeepPath learns to return an efficient reasoning path that connects the source and target node (hence its each state always contains source and target node and the query relationship), whereas MINERVA always only contains source node and the query relationship. There is also a similar work [14] which improves the sample efficiency of RL-based walkers.

References

[1] Where To Next? A Dynamic Model of User Preferences: https://dl.acm.org/doi/10.1145/3442381.3450028

[2] Factorizing Personalized Markov Chains for Next-Basket Recommendation: https://dl.acm.org/doi/abs/10.1145/1772690.1772773

[3] Semi-Supervised Classification with Graph Convolutional Networks: https://arxiv.org/abs/1609.02907

[4] https://towardsdatascience.com/how-to-do-deep-learning-on-graphs-with-graph-convolutional-networks-7d2250723780

[5] https://towardsdatascience.com/how-to-do-deep-learning-on-graphs-with-graph-convolutional-networks-62acf5b143d0

[6] https://csustan.csustan.edu/~tom/Clustering/GraphLaplacian-tutorial.pdf

[7] DKN: Deep Knowledge-Aware Network for News Recommendation: https://arxiv.org/abs/1801.08284

[8] Knowledge Graph Convolutional Networks for Recommender Systems: https://arxiv.org/abs/1904.12575

[9] DeepWalk: Online Learning of Social Representations: https://arxiv.org/abs/1403.6652

[10] https://antonsruberts.github.io/graph/deepwalk/

[11] Node2Vec: https://arxiv.org/abs/1607.00653

[12] DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning: https://arxiv.org/abs/1707.06690

[13] Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning: https://arxiv.org/abs/1711.05851

[15] M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search: https://arxiv.org/abs/1802.04394

[16] https://www.r-bloggers.com/2014/04/from-random-walks-to-personalized-pagerank/

Check object memory in Python

We can use Pympler (https://pympler.readthedocs.io/en/latest/) to inspect an object’s memory in Python. It can get an object’s memory including their references.

Here is an example:

>>> from pympler import asizeof 
>>> obj = [1, 2, (3, 4), 'text'] 
>>> asizeof.asizeof(obj) 
176 
>>> print(asizeof.asized(obj, detail=1).format()) 
[1, 2, (3, 4), 'text'] size=176 flat=48 (3, 4) size=64 flat=32 'text' size=32 flat=32 1 size=16 flat=16 2 size=16 flat=16

Normalizing Flows

Some update before we dive into today’s topic: I have not updated this blog for about 2 months, which is considered a long time : ). This is because I have picked up more tech lead work for setting up the team’s planning. I sincerely hope that our team will steer towards a good direction from 2022 and beyond.

Now, let’s go to today’s topic: normalizing flows. Normalizing flows is a type of generative models. Generative models can be best summarized using the following objective function:

$\text{maximize}_{\theta} \quad \mathbb{E}_{x \sim P_{data}}\left[ P_{\theta}(x)\right],$
i.e., we want to find a model $\theta$ which has the highest likelihood for the data generated from the data distribution as such to “approximate” the data distribution.

According to [1] (starting 11:00 min), we can categorize generative models as:

Explicit with tractable density. i.e., $P_\theta(x)$ has an analytical form. Examples include Normalizing Flows, PixelCNN, PixelRNN, and WaveNet. The latter three examples are autoregressive models which generate elements (e.g., pixels, audio clips) sequentially however are computationally heavy due to the nature of autoregressive. Normalizing Flows can generate the whole image instantaneously thus has a computational advantage over others. (For more pros/cons of Normalizing Flows, please refer to [4])
Explicit with approximate density, i.e., we are only optimizing for some bounds of $P_\theta(x)$ . One example is Variational Encoder Decoder [2], in which we optimize ELBO.
Implicit with no modeling of density. One example is Generative Adversarial Network (GAN), where we generate images just based on the generator network with a noise input.

Normalizing Flows is based on a theory called “change of variables” [3]. Suppose $Z$ and $X$ are random variables both with dimension $n$ . Also, suppose there is a mapping function $f: \mathbb{R}^n \rightarrow \mathbb{R}^n$ such that $Z=f(X)$ and $X=f^{-1}(Z)$ . Then the density functions of $Z$ and $X$ have the following relationship:

$p_X(x) =p_Z(f(x))\left| det\left( \frac{\partial f(x)}{\partial x}\right) \right|=p_Z(f(x))\left| det \left( \frac{\partial f^{-1}(z)}{\partial z }\right) \right|^{-1}$ ,
where the last two equations are due to $det(A^{-1})=det(A)^{-1}$ . From now on, we assume that $X$ denotes data while $Z$ denotes a random variable in a latent space.

In Normalizing Flows, the mapping is parameterized by $\theta$ (i.e., $f \rightarrow f_\theta$ and $f^{-1} \rightarrow f^{-1}_\theta$ ), which is what we try to learn. The objective function for learning $\theta$ becomes:

$\text{maximize}_\theta \quad p_X(x|\theta) = \text{maximize}_\theta\quad p_Z(f_\theta(x))\left| det \left( \frac{\partial f_\theta(x)}{\partial x}\right) \right|$

As you can see, our goal is to learn to map complex data distribution $X$ into a simpler latent variable distribution $Z$ (usually $Z \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ). The reason that we want $Z$ to follow a simple distribution is that: first, we need to compute $p_Z(f_\theta(x))$ in the objective function so ideally $p_Z(\cdot)$ should be easy to compute; second, once we learn $\theta$ , we know the mapping $f_\theta$ and $f^{-1}_\theta$ , and then we can easily sample $z \sim Z$ and apply $f^{-1}_\theta(z)$ to generate new data (e.g., new images). The requirement for a valid and practical $f_\theta$ is that: (1) it has an invertible $f^{-1}_\theta$ ; (2) $\left| det\left( \frac{\partial f_\theta(x)}{\partial x}\right) \right|$ or $\left| det \left( \frac{\partial f^{-1}_\theta(z)}{\partial z }\right) \right|$ is efficient to compute.

One nice property of Normalizing Flows is that you can chain multiple transformation $f$ to form a new transformation. Suppose $f(x) = f_1 \circ \cdots \circ f_L(x)$ with each $f_i$ having a tractable inverse and a tractable Jacobian determinant. Then:

$p_X(x) =p_Z(f(x))\prod\limits_{i=1}^L\left|det\left( \frac{\partial f_i}{\partial (f_{i-1}\circ\cdots \circ f_0(x))}\right) \right|$ , where $f_0(x)=x$ .

In practice, we usually pick $p_Z \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and optimize for log likelihood. Therefore, our objective becomes (as can be seen in Eqn. 1 from the Flow++ paper (one SOTA flow model) [5]):

$\text{maximize}_\theta \quad \log p_X(x|\theta) \newline= \log p_Z(f_\theta(x)) + \sum\limits_{i=1}^L\log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right| \newline=\log \mathcal{N}(f_\theta(x); \mathbf{0}, \mathbf{I}) + \sum\limits_{i=1}^L \log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right|$

[6] provides a good tutorial for getting started in Normalizing Flows, while [7] has a more in-depth explanation and it will help a lot in understanding more advanced implementation such as Flow++ [8]. For now, I am going to introduce [6].

[6] introduces a flow called Planar Flow. It has relatively a straightforward transformation (linear + activation) and easy-to-compute Jacobian determinant:

Planar Flow is defined in Python as below:

Once it is defined, we can instantiate an arbitrary Planar Flow and see how it transforms from a 2D Normal distribution:

Now, suppose we want to learn what is the Planar Flow that could transform a 2D normal distribution to a target distribution defined as:

The objective function, as we already introduced above, is $\text{maximize}_\theta \quad \log p_X(x|\theta) = \log p_Z(f_\theta(x)) + \sum\limits_{i=1}^L\log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right|$ . Here, $x$ is samples from a 2D normal distribution, $p_Z()$ is the target density_ring distribution. Therefore, the loss function (to be minimized) is defined as:

The overall training loop is:

Last, I highly recommend watching this ECCV tutorial: https://www.youtube.com/watch?v=u3vVyFVU_lI

TODO:

dequantization:

https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html#Dequantization

https://arxiv.org/pdf/1511.01844.pdf

References

[1] PixelCNN, Wavenet & Variational Autoencoders – Santiago Pascual – UPC 2017 https://www.youtube.com/watch?v=FeJT8ejgsL0

[2] Optimization with discrete random variables: https://czxttkl.com/2020/04/06/optimization-with-discrete-random-variables/

[3] Normalizing Flows note: https://deepgenerativemodels.github.io/notes/flow/

[4] Introduction to Normalizing Flows: https://towardsdatascience.com/introduction-to-normalizing-flows-d002af262a4b

[5] Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design: https://arxiv.org/abs/1902.00275

[6] Normalizing Flows in PyTorch: https://github.com/acids-ircam/pytorch_flows/blob/master/flows_01.ipynb

[7] UvA DL Notebook: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html

[8] Flow++ Github implementation: https://github.com/chrischute/flowplusplus

Leetcode 695. Max Area of Island

You are given an m x n binary matrix grid. An island is a group of 1‘s (representing land) connected 4-directionally (horizontal or vertical.) You may assume all four edges of the grid are surrounded by water.

The area of an island is the number of cells with a value 1 in the island.

Return the maximum area of an island in grid. If there is no island, return 0.

Example 1:

Input: grid = [[0,0,1,0,0,0,0,1,0,0,0,0,0],[0,0,0,0,0,0,0,1,1,1,0,0,0],[0,1,1,0,1,0,0,0,0,0,0,0,0],[0,1,0,0,1,1,0,0,1,0,1,0,0],[0,1,0,0,1,1,0,0,1,1,1,0,0],[0,0,0,0,0,0,0,0,0,0,1,0,0],[0,0,0,0,0,0,0,1,1,1,0,0,0],[0,0,0,0,0,0,0,1,1,0,0,0,0]]
Output: 6
Explanation: The answer is not 11, because the island must be connected 4-directionally.

Example 2:

Input: grid = [[0,0,0,0,0,0,0,0]]
Output: 0

Constraints:

m == grid.length
n == grid[i].length
1 <= m, n <= 50
grid[i][j] is either 0 or 1.

BFS/DFS

Ideas

Traverse the grid. For every cell with value 1, you can use DFS or BFS to further traverse its neighbors and count the number of nodes visited as the value of the connected area. At the end of the traversal, you can just take the maximum of all the areas associated to each value-1 cell.

class Solution(object):
    def maxAreaOfIsland(self, grid):
        """
        :type grid: List[List[int]]
        :rtype: int
        """
        height, width = len(grid), len(grid[0])
        
        def dfs(i, j):
            if i < 0 or i >= height or j < 0 or j >= width:
                return 0
            if grid[i][j] == 1:
                grid[i][j] = 0
                return 1 + dfs(i-1, j) + dfs(i+1, j) + dfs(i, j-1) + dfs(i, j+1)
            else:
                return 0
            
        def bfs(i, j):
            if grid[i][j] == 0:
                return 0
            
            grid[i][j] = 0
            nodes = [(i, j)]
            area = 0
            while nodes:
                x, y = nodes.pop(0)
                area += 1
                for dx, dy in [(1, 0), (-1, 0), (0, 1), (0, -1)]:
                    x_, y_ = x + dx, y + dy
                    if x_ < 0 or x_ >= height or y_ < 0 or y_ >= width:
                        continue
                    if grid[x_][y_] == 1:
                        grid[x_][y_] = 0
                        nodes.append((x_, y_))
            return area
    
        max_area = 0
        for x in range(height):
            for y in range(width):
                # area = dfs(x,y)
                area = bfs(x, y)
                max_area = max(area, max_area)
    
        return max_area

Union Find

I think the clearest online reference is from this post:

https://leetcode.com/problems/max-area-of-island/discuss/729303/Python-BFS-and-Union-Find

Idea:

Union find is an algorithm in which we maintain each node’s parent as an array. The node parent is initialized such that each node’s parent is itself. While we traverse nodes, we will iteratively update the node parent array until connected nodes’ parents converge to the same node. We also maintain a size array, which keeps the record of the size of connected nodes. We have used union find for other leetcode problems. One good and succinct example is https://czxttkl.com/2015/10/19/leetcode-261-graph-valid-tree/.

class Solution(object):
    
    def maxAreaOfIsland(self, grid):
        """
        :type grid: List[List[int]]
        :rtype: int
        """
        height, width = len(grid), len(grid[0])
        # initialization for union find
        size = [[0 for j in range(width)] for i in range(height)]
        parent = [[(None, None) for j in range(width)] for i in range(height)]
        for i in range(height):
            for j in range(width):
                parent[i][j] = (i, j)
                if grid[i][j] == 1:
                    size[i][j] = 1
        
        def union_find(p, q):
            if parent[p][q] == (p, q):
                return p, q
            else:
                return union_find(*parent[p][q])
            
        for i in range(height):
            for j in range(width):
                if grid[i][j] == 1:
                    for next_i, next_j in ((i, j+1), (i+1, j)):
                        if next_j >= width:
                            continue
                        if next_i >= height:
                            continue
                        if grid[next_i][next_j] != 1:
                            continue
                        
                        parent_ij = union_find(i, j)
                        parent_ij1 = union_find(next_i, next_j) 
                        
                        if parent_ij == parent_ij1:
                            continue

                        x, y = parent_ij[0], parent_ij[1]
                        w, z = parent_ij1[0], parent_ij1[1]
                        if size[x][y] < size[w][z]:
                            parent[x][y] = parent[w][z]
                            size[w][z] += size[x][y] 
                        else:
                            parent[w][z] = parent[x][y]
                            size[x][y] += size[w][z]
        
        max_area = 0
        for i in range(height):
            for j in range(width):
                max_area = max(max_area, size[i][j])
        
        return max_area

Data Parallelism and Model Parallelism

In this post, we review the concept of data parallelism, model parallelism, and more in between. We will illustrate ideas using SOTA ML system designs.

Data Parallelism

Data parallelism means that there are multiple training workers fed with different parts of the full data, while the model parameters are hosted in a central place. There are two mainstream approaches of doing data parallelism: parameter servers and AllReduce.

Parameter Servers host model parameters in a server node group. Its basic form is illustrated in the pseudocode below. We can see that workers do not hold any model parameter. They only push local gradients to servers and fetch weights from servers afterwards, while there is no communication between workers.

I covered Ring AllReduce in my previous post [1] (when I surveyed the Ray paper). In short, Ring AllReduce aggregates the gradients of the model parameters between all training nodes after every round of training (i.e., one minibatch on each trainer node). In PyTorch, it is implemented in an API called DistributedDataParallel [2]. Each training node will have a full copy of the model and receive a subset of data for training. Once each node has one forward pass and the corresponding backward pass, the model parameters’ gradients will be synced using the AllReduce primitive. Such behaviors are studied in my previous post [3].

[2] makes a distinction between AllReduce and Parameter Servers: As one AllReduce operation cannot start until all processes join, it is considered to be a synchronized communication, as opposed to the P2P communication used in parameter servers. Hence, from this distinction, you can see that the communication cost of parameter servers grow linearly with the number of trainer nodes in the system. On the other hand, Ring AllReduce is an algorithm for which the communication cost is constant and independent of the number of trainer nodes in the system, and is determined solely by the slowest connection between trainer nodes in the system [15]. There is an empirical report comparing Ring AllReduce and Parameter Servers [14]. The result is consistent with our analysis: “it is easy to see that ring all-reduce (horovod) scales better on all models”.

Model Parallelism

The vanilla model parallelism is just to hold different parts of a model on different devices. It is useful when you have a model too large to be held on one node.

Pipeline parallelism [4] is a more advanced method of model parallelism which can have great speed-up over the vanilla model parallelism. First, the layers of a model are partitioned into a series of “cells”. Each cell is placed on a different node. Then, a mini-batch is split into smaller batches, called micro-batch, which get fed to cells in a sequential way such that each cell is operating on a micro-batch at one time. Therefore, the time for waiting for data to be processed on other nodes is minimized. The idea of pipeline parallelism is best illustrated in the diagram below (subplot c):

To put data parallelism and pipeline parallelism into a more concrete context, let’s study the paper PipeTransformer [5], which employs a hybrid of data parallelism and model parallelism. PipeTransformer employs synchronous pipeline parallel within one machine and data parallel across machines. It gradually freezes the stack of layers in order to reduce the number of active parameters, and spawns more processes on freed resources to increase data parallel width. PipeTransformer combines four components: Freeze Algorithm, AutoPipe, AutoDP, and AutoCache.

Freeze Algorithm, which is responsible for determining until which layer from the bottom should be frozen. Freezing layers comes from the recent study that parameters in neural networks usually converge from the bottom-up. After a few iterations, the bottom layers usually start to the less actively changed through the end of the training.
AutoPipe, which will create cell partitions among the unfrozen layers. The size of a cell will be adjusted smartly for best speed. For example, pipelines can be compressed (i.e., cells can be merged) to reduce the bubble size, which is the speed bottleneck for a pipeline. Because of Freeze Algorithm and techniques in AutoPipe, as the training goes on and more layers are frozen, one pipeline will occupy less nodes (e.g., GPUs).
AutoDP, which can create data parallelism either on the same machine or across machines. Data parallelism across machines is easy to understand because it naturally increases the throughput of training. On the other hand, data parallelism on the same machine is possible because AutoPipe will dynamically reduce the size of pipelines on one machine so there could be more replicas of pipelines even on the same machine.
AutoCache is more specific to the freezing training paradigm because caching frozen layers’ outputs could skip the same computation over and over again.

I find the diagram below is useful for illustrating AutoPipe and AutoDP:

While this post cannot survey all relevant papers, I find the table from [2] is comprehensive (at least as of 2021):

Practical Examples

There are some more practical lessons I learned from Pytorch official document.

CUDA semantics [6]

Each cuda device maintains a default stream. Within the same stream, operations are executed sequentially. But across different streams, operations are executed asynchronously. So if you modify a tensor on cuda:0 and then modify another tensor on cuda:1, do not assume that you can see the change to the tensor on cuda:0 if you have seen the change to the tensor on cuda:1.

Single-Machine Model Parallelism [7]

On a single machine, the vanilla model parallelism is to place different parts of a model to different cuda devices. As an example:

To use the pipeline parallelism, one needs to further split mini-batches into micro-batches and feed different micro-batches into different parts of the model. As an example:

As we can see, the vanilla model parallelism introduces more communication overhead. When a model can just fit into one gpu, the vanilla model parallelism actually has a worse speed than not using it. Pipeline parallelism can speed up things but the speed up is only sub-linear because of extra overhead.

Later, I found that Pytorch has a wrapped class called Pipe to handle pipeline parallelism automatically. Here is an example of using Pipe to train a Transformer model [9].

Use rpc to implement parameter servers

I find there are three good examples for how rpc can be flexible to implement a parameter server.

In the first example [10], they create a parameter server in one process. The parameter server holds a nn.Module. This module has different partitions on different gpu devices. There are two trainer processes. In each of them, a TrainerNet is created for passing input to the module on the parameter server process through rpc remote call. As you can see, the forward function of TrainerNet is purely a rpc remote call. In other words, there is no linear algebra computation happening on the trainer processes in this example. The overall training loop (run on each trainer process) is very simple. A DistributedOptimizer is used to optimize model parameters (hosted remotely on the parameter server process).

In the second example [11], they implement a training script that support both a parameter server and data parallelism. There are four processes: one master process, one parameter server process, and two trainer processes. It is best to read the training script from the master process. The master process first creates a RemoteModule on the parameter server process. Then it uses the rpc_async API to start _run_trainer method on two trainer processes. The master process passes the reference to the RemoteModule to _run_trainer. Therefore, each trainer process can use the reference to create a HybridModel, which holds both the reference to the RemoteModule and some local model synced by DistributedDataParallel. Things become clear if you look at the forward function of HybridModel: data are passed to the RemoteModule first and then to the local model to generate model outputs.

The third example [12] is simpler than the previous two. There is one parameter server process and one trainer process. It is easier to understand the script by starting from the trainer process. The trainer process first initializes an RNN model, which holds some local LSTM module and the remote reference to an EmbeddingTable and a Decoder. The EmbeddingTable and Decoder are created on the parameter server process through rpc remote call.

Use rpc to implement agent-observer reinforcement learning paradigm

This pedagogical example [13] is interesting because it shows how rpc remote calls happen between several processes back and forth. First of all, there is an agent process and several observer processes. The agent holds the remote reference to Observer class, which is created on every observer process. When the agent needs to collect rl training data, it kicks off run_episode method of Observer on the observer processes. While in the run_episode method of Observer, we call the agent’s select_action and report_reward methods through rpc remote calls. Hence you can see that there are multiple rpc remote calls between the agent process and observer processes.

——————————- Update 2022-01-03 ——————————-

I want to take down some notes about the ZERO paper [16] because it introduces some basic concepts about data and model parallelism. We start from recalling DistributedDataParallel which implements All-Reduce (see Data Parallel and DistributedDataParallel in “Data Parallelism” section; see introduction of All-Reduce in [1]). Typically, each process in an All-Reduce group will hold memory as much as all the parameters and other optimizer states would take, even though at each round of All-Reduce each process is only responsible for storing one shard of all the parameters correctly.

Therefore, the basic idea of ZERO (which is also called Fully Sharded Data Parallel (FSDP) at Meta [17]) is to only keep one shard of parameters in the memory; when forward and backward computation need to know the values of other shards, the all-gather operator is used to construct necessary parameters on the fly. I think the pseudo-code from [17] best describes the ZERO or FSDP:

The ZERO paper has some good explanation on memory usage during the lifecycle of model training. Based on Section 3 (Where did all the memory go?), I summarize three main memory usages: (1) parameters & gradients; (2) optimizer states, such as momentum/variance as required by Adam; (3) residual states, such as activations.

If we apply FSDP on parameters, gradients, and optimizer states in a setting of mixed-precision training with Adam optimizer, then ZERO would have ~60x reduction in memory if we use $N_d=64$ GPUs, as shown in the table below, where $P_os$ , $P_g$ , and $P_p$ means we apply ZERO only on optimizer states, gradients, and parameters, respectively:

For activations, we need to first illustrate why activations occupy memory in model training. This is because during backprogagtion, we need to both activation values and model parameter values to compute derivatives. If we do not store activations computed in the forward pass, we then need to recompute them in the backward pass. Therefore, a straightforward method is to save all activations computed in the forward pass for later usage in the backward pass. As you can see, this would require GPU memory as much as all activations. Only when backpropagation has progressed long enough to have all dependencies of an activation computed can the activation be discarded.

It is a usual practice to apply activation checkpoints, a technique to reduce memory footprints of activations (see [18] for a good tutorial). Activation checkpointing store only a subset of activations, allowing activations to be recomputed but not too often. The most memory-efficient way to checkpoint activations is to store every sqrt(n) activations, where n is the number of total layers. ZERO applies FSDP on activations to have them sharded on different GPUs.

As you may see from the pseudocode and explanations above, ZERO/FSDP can reduce the memory optimizer states and activations by sharding to different GPUs, which is not done in DistributedDataParallel. However, the peak memory will still depend on the parameters/activations after all-gather at some points in forward/backward. Note that, all-gather usually only gather parameters/activations per layer. So ZERO/FSDP will enjoy the most memory reduction if you can break down your models into many layers.

In 04/2021, Microsoft released a new blog post about ZERO-infinity, which is based on ZERO plus several enhancement: https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/

——————————- Update 2022-01-03 ——————————-

todo: https://thegradient.pub/systems-for-machine-learning/

References

[1] https://czxttkl.com/2020/05/20/notes-on-ray-pap/

[2] PyTorch Distributed: Experiences on Accelerating Data Parallel Training

[3] Analyze DistributedDataParallel (DPP)’s behavior: https://czxttkl.com/2020/10/03/analyze-distributeddataparallels-behavior/

[4] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

[5] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

[6] CUDA semantics: https://pytorch.org/docs/stable/notes/cuda.html

[7] Single Machine Model Parallelism: https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html#speed-up-by-pipelining-inputs

[8] Pytorch RPC Framework: https://pytorch.org/docs/stable/rpc.html

[9] Train Transformers using Pipeline: https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html

[10] IMPLEMENTING A PARAMETER SERVER USING DISTRIBUTED RPC FRAMEWORK: https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html, code: https://github.com/pytorch/examples/blob/master/distributed/rpc/parameter_server/rpc_parameter_server.py

[11] COMBINING DISTRIBUTED DATAPARALLEL WITH DISTRIBUTED RPC FRAMEWORK: https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html, code: https://github.com/pytorch/examples/blob/master/distributed/rpc/ddp_rpc/main.py

[12] Distributed RNN using Distributed Autograd and Distributed Optimizer: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html#distributed-rnn-using-distributed-autograd-and-distributed-optimizer, code: https://github.com/pytorch/examples/tree/master/distributed/rpc/rnn

[13] Distributed Reinforcement Learning using RPC and RRef: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html#distributed-reinforcement-learning-using-rpc-and-rref, code: https://github.com/pytorch/examples/tree/cedca7729fef11c91e28099a0e45d7e98d03b66d/distributed/rpc/rl

[14] Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment: https://www.diva-portal.org/smash/get/diva2:1224181/FULLTEXT01.pdf

[15] https://xzhu0027.gitbook.io/blog/ml-system/sys-ml-index/parameter-servers

[16] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: https://arxiv.org/pdf/1910.02054.pdf

[17] Fully Sharded Data Parallel: faster AI training with fewer GPUs: https://engineering.fb.com/2021/07/15/open-source/fsdp/

[18] https://github.com/cybertronai/gradient-checkpointing