PyTorch Lightning template

Back to the old days, I’ve studied how to implement highly efficient PyTorch pipelines for multi-gpu training [1]. DistributedDataParallel is the way to go, but it is cumbersome that we need boilerplates for spawning workers and constructing data readers.

Now, PyTorch Lighting offers clean API for setting up multi-gpu training easily. Here is a template I designed, which I will stick to for prototyping models for the rest of my life : )

import os
from typing import List, Any

from dataclasses import dataclass
import torch
import torch.distributed as dist
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, IterableDataset, random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy


TOTAL_NUM_BATCHES = 320
BATCH_SIZE = 32
STATE_DIM = 5
NUM_GPUS = 2
NUM_WORKERS = 2
UPDATE_FREQ = 1
WEIGHTS = torch.tensor([2.0, 3.1, 2.1, -1.5, -1.7])

@dataclass
class PolicyGradientInput:
    pid: int
    state: torch.Tensor
    reward: torch.Tensor

    def __len__(self):
        return self.state.shape[0]

    @classmethod
    def from_batch(cls, x):
        return cls(
            pid=x[0][0].item(),
            state=x[1][0].reshape(BATCH_SIZE, STATE_DIM),
            reward=x[2][0].reshape(BATCH_SIZE, 1),
        )

class EpisodicDataset(IterableDataset):
    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        pid = os.getpid()
        if worker_info is None:
            total_num_batches = TOTAL_NUM_BATCHES // NUM_GPUS
            worker_id = -1
        else:
            total_workers = worker_info.num_workers
            total_num_batches = int(TOTAL_NUM_BATCHES // NUM_GPUS // total_workers)
            worker_id = worker_info.id
        # You will see that we have an EpisodicDataset on each of NUM_GPUS processes
        print(f"{worker_info}, pid={pid}, total_num_batches={total_num_batches}")

        for _ in range(total_num_batches):
            state = torch.randn(BATCH_SIZE, STATE_DIM)
            reward = torch.sum(state * WEIGHTS, dim=1)
            yield (pid, state, reward)


class PPO(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(STATE_DIM, 128), nn.ReLU(), nn.Linear(128, 128), nn.ReLU(), nn.Linear(128, 1)
        )
        self.traj_buffer = []
        self.update_freq = UPDATE_FREQ
        self.step = 0

    def training_step(self, batch: List[Any], batch_idx):
        batch: PolicyGradientInput = PolicyGradientInput.from_batch(batch)
        self.traj_buffer.append(batch)
        self.step += 1
        rank = dist.get_rank()
        # use first three trajectories' pids as as signature. quickly check if all trainers share the same data
        # the answer is each trainer maintains different trajectories
        traj_buffer_signature = ','.join([str(traj.pid) for traj in self.traj_buffer[:3]])
        print(f"rank={rank}, traj_buffer_len={len(self.traj_buffer)}, step={self.step}, signature={traj_buffer_signature}")
        if self.step % self.update_freq == 0:
            model_params = list(self.model.parameters())[0][0].detach().cpu().numpy()
            print(f"before {self.step} step training: rank={rank}, model_params={model_params}")
            return self.update_model()

    def configure_optimizers(self):
        # Somehow, Adam doesn't work for linear regression
        # optimizer = torch.optim.Adam(self.parameters(), lr=1e-2)
        optimizer = torch.optim.SGD(self.parameters(), lr=1e-2)
        return optimizer

    def update_model(self):
        traj = self.traj_buffer[-1]
        loss = F.mse_loss(self.model(traj.state), traj.reward)
        rank = dist.get_rank()
        print(f"rank={rank}, step={self.step}, loss={loss}")
        return loss


ds = EpisodicDataset()
dl = DataLoader(ds, batch_size=1, num_workers=NUM_WORKERS, pin_memory=True)
ppo = PPO()
trainer = pl.Trainer(gpus=NUM_GPUS, max_epochs=1, progress_bar_refresh_rate=1, accelerator='ddp')
trainer.fit(ppo, dl)

As you can see, you only need to define a Dataset, a DataLoader with appropriate NUM_WORKERS, and a pytorch-lightning Trainer in which you specify the number of gpus. Each of the NUM_GPUS GPUs will then use NUM_WORKERS processes for reading data and use one main process for training the model.

The example shows that each trainer on a GPU maintains a list of trajectories, which is not shared with the trainers on other GPUs. However, I believe model parameters are synced after every loss function computation because the underlying mechanism is still DistributedDataParallel

 

References

[1] https://czxttkl.com/2020/10/03/analyze-distributeddataparallels-behavior/ 

Precision Recall Curve vs. ROC curve

While ROC (receiver operating characteristic) curve is ubiquitous in model reporting, Precision Recall Curve is less reported. However, the latter is especially useful when we have imbalanced data. Let’s review pertinent concepts.

True Positive = TP = you predict positive and the actual label is positive

False Positive = FP = you predict positive but the actual label is negative

True Negative = TN = you predict negative and the actual label is negative

False Negative = FN = you predict negative and the actual label is positive

TPR = True Positive Rate = Sensitivity = Recall = TP / (TP + FN)

FPR = False Positive Rate = FP / (FP + TN). Among all negative samples (FP + TN), how many of them are erroneously identified as positive (FP). FPR will increase if you blindly increase the probability threshold for predicting negative. In an extreme case, you can just predict every sample as positive.    

Precision = TP / (TP + FP). Among all the predicted positive samples (TP + FP), how many of them actually have positive labels. 

For an ROC curve, the x-axis is FPR and the y-axis is TPR.

For a Precision-Recall curve, the x-axis is Recall and the y-axis is Precision.

Now, let’s look at an example with imbalanced data, in which case the Precision-Recall curve is more informative than the ROC curve (https://www.kaggle.com/general/7517). 

Suppose there is 1 million documents, among which 100 documents are relevant, and two methods to retrieve them. At some decision threshold for method 1 and some other decision threshold for method 2, we have:

  • Method 1 (@some decision threshold): 100 retrieved documents, 90 relevant
    • TP=90, FP=10, TN=1M-100-10, FN=10
    • FPR = FP / (FP + TN) = 10 / (1M – 100) =0.000010001, TPR = TP / (TP + FN) = 90 / (90 + 10) = 0.9, 
    • Recall = TPR = 0.9, Precision = TP / (TP + FP) = 90 / (90 + 10) = 0.9
  • Method 2 (@some decision threshold): 2000 retrieved documents, 90 relevant
    • TP=90, FP=1910, TN=1M-2000-10, FN=10
    • FPR = FP / (FP + TN) = 1910 / (1910 + 1M – 2000 – 10) = 0.00191019101, TPR = TP / (TP + FN) = 90 / (90 + 10) = 0.9
    • Recall = TPR = 0.9, Precision = TP / (TP + FP) = 90 / (90 + 1910) = 0.045

Note that, the calculation is just for one point on the ROC / Precision-Recall curve. We can already see that their ROC would not be too different (at same TPR, FPR is 0.000010001 vs. 0.00191019101). But PR curve would be more different, since at the same recall, the precisions is 0.9 vs. 0.045.

 

Deep Learning-based Sorting

Here, I am talking about a few techniques of using deep neural networks to accomplish sorting/ranking tasks.

Reinforcement Learning – policy gradient paradigm

Using policy gradient to solve combinatorial optimization problems such as Traveling Salesman Problems is not new. Ranking K out of N candidates is also a combinatorial optimization problem thus can be solved by policy gradient. Now the only question remained is how you parameterize the ranking policy. You can parameterize as a sequence model (considering item interactions) [2] or a Packelee Lucce distribution (if there is assumed to be an optimal pointwise relevance score). Additionally, as we discussed in [1], you can treat the ranking reward non-differentiable (such as logged impressions, clicks, etc.) or differentiable (some loss similar to NDCG but differentiable). 

soft rank 

References

[1] https://czxttkl.com/2020/02/18/practical-considerations-of-off-policy-policy-gradient/

[2] https://arxiv.org/abs/1810.02019

GAN (Generative Adversarial Network)

Here, I am taking some notes down while following the GAN online course (https://www.deeplearning.ai/generative-adversarial-networks-specialization/).

The first thing I want to point out is that one should be very careful about the computation graph during the training of GANs. To maximize efficiency in one iteration, we can call the generator only once, using the generator output to train both the discriminator and the generator itself. During training the discriminator, we need to call generator_output.detach(), so that after the discriminator loss is backward-ed, the gradient information of the generator is still kept in the computation graph for training the generator. Another option is to call `discriminator_loss.backward(retain_graph=True)` to keep the gradient information. You can see both methods in https://github.com/pytorch/examples/blob/a60bd4e261afc091004ea3cf582d0ad3b2e01259/dcgan/main.py#L230 and https://zhuanlan.zhihu.com/p/43843694.

Next, we move to DCGAN (Deep Convolutional GAN) [1]. It is interesting to see that the generator/discriminator equipped with image-processing architectures like deconvolution/convolution layers will lead to more realistic generated images. The deconvolutional layer is something I am not familiar before. To understand it, we need to think of the convolution/deconvolution operation as matrix multiplication [2]. Suppose you have an input image (4×4) and a kernel (3×3), which expects to generate 2×2 output:

Equivalently, we can create a convolution matrix (4 x 16), flatten the input to (16 x 1), then multiply the two to get (4 x 1) output, and reshape to (2 x 2):

.       

 

The deconvolution can be seen as the reverse of the convolution process. A 16×4 deconvolution matrix multiplies a flattened convoluted output (2 x 2 -> 4 x 1), outputting a 16 x 1 matrix reshaped to be 4 x 4. 

 

The next topic is Wasserstein GAN [3]. The motivation behind it is that the original GAN’s generator loss (max_g -\left[\mathbb{E}\left(log d(x)\right) +\mathbb{E}\left(1 - log d(g(z)) \right) \right]) would provide little useful gradient when the discriminator is perfect in separating real and generated outputs. (There is much theory to dig in, but [4] gives a very well explanation on how it is superior in preventing mode collapsing and vanishing gradient.) The Wasserstein-loss approximates a better distance function than BCEloss called Earth Mover’s Distance which provides useful gradient even when the real and generated output is already separable.

However, when training GANs using W-loss, the critic has a special condition. It has to be 1-Lipschitz continuous. Conceptually, it is easy to check if a function is 1-Lipschitz continuous at every point. You just check if the function grows or drops with slope > 1 at any point:

It is interesting to see that how 1-Lipschitz continuity (i.e., the gradient the discriminator/critic is no larger than 1) is soft-enforced by adding a penalty term.

Instead of checking the critic is 1-Lipschitz continuity for every input, we sample an intermediate image by combining a real and fake image. We then encourage the gradient norm on the intermediate image to be within 1.

  

It is also intriguing to see how we can control GANs to generate images containing specific features. The first method is to utilize a pre-trained classifier, which scores how much a desired feature is present in a generated image.

A slightly improved method is disentanglement, which makes sure other features do not change simultaneously. 

The second method is to augment the generator and discriminator’s input with a one-hot class vector. So you can ask the generator what class to generate, and you can ask the discriminator how relevant a generated image is to a given class, as shown below. Instead of requiring a pre-trained class as in the first method, the second method requires knowing the labels of training data. 

The third method is infoGAN [5], which requires neither pre-trained models nor data labels. InfoGAN tries to learn latent codes that can control the output of the generator. The generator G takes as input a noise vector z (as in the original GAN) and additionally, a latent code c that learns semantic categories by unsupervised learning. A clever idea is proposed that the mutual information between G(z, c) and c, I(c;G(z,c)) should be maximized. In other words, G(z,c) should reveal as much information as possible about c, which means we force the generator to generate images as much relevant to c as possible. A similar idea is mentioned in our previous post when talking about the InfoNCE loss [6]. This notebook (C1W4_(Optional_Notebook)_InfoGAN) walks through the implementation of InfoGAN.

The lecture then moves on to evaluation metrics for GAN. One metric is based on Fréchet distance. We use some pre-trained image models such as Inception-v3 to process images and get embeddings, usually the output of one hidden layer. If we process N real images and N fake images, we would have N real images’ embeddings and N fake images’ embeddings. We can quantitatively compare the two embedding distributions by Fréchet distance, which has the following form:

Another evaluation metric is from [7] called “Learned Perceptual Image Patch Similarity”. They also extract image features from L layers of a pre-trained network. As shown in Eqn. 1 below, the distance between two images is a weighted sum of per-layer similarity. There is a learnable parameter w_l in per-layer similarity, which is learned to optimize for differentiating more similar image pairs from less similar pairs. Eventually, LPIPS can be used to measure the distance between real and fake images in GAN.

Finally, the lecture discusses one state-of-the-art GAN called StyleGAN [8]. There are three main components of StyleGAN: (1) progressive growing, (2) noise mapping network, and (3) adaptive instance normalization. 

StyleGAN supports two ways of style variation. The first is styling mixing, which feeds different w vectors to different layers of the generator:

The second way to do style variation is to sample noise and add it between adaptive instance sampling operations.

 

 

References

[1] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

[2] Up-sampling with Transposed Convolution

[3] https://arxiv.org/pdf/1701.07875.pdf

[4] https://zhuanlan.zhihu.com/p/25071913

[5] InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

[6] Noise-Contrastive Estimation https://czxttkl.com/2020/07/30/noise-contrastive-estimation/

[7] The Unreasonable Effectiveness of Deep Features as a Perceptual Metric https://arxiv.org/pdf/1801.03924.pdf

[8] A Style-Based Generator Architecture for Generative Adversarial Networks https://arxiv.org/abs/1812.04948

Projected Gradient Descent

I am reading “Determinantal point processes for machine learning”, in which it uses projected gradient descent in Eqn. 212. More broadly, such problems have this general form:

where we want to map from \vec{y} to \vec{x} on the simplex. Since we often encounter problems of the sum-to-1 constraint, I think it is worth listing the solution in this post.

The algorithm is simple as detailed in https://eng.ucmerced.edu/people/wwang5/papers/SimplexProj.pdf:

We omit the proof but I do appreciate the elegancy of this algorithm.

Many ways towards recommendation diversity

Diversity of recommendations keeps users engaged and prevents boredom [1]. In this post, I will introduce several machine learning-based methods to achieve diverse recommendations. The literature in this post is mostly retrieved from the overview paper “Recent Advances in Diversified Recommendation” [6]. 

Determinant Point Process

Let’s first review what is the determinant of a matrix [2]. The determinant of matrix A can be understood as representing the (signed) volume of the parallelepiped transformed from an n-dimensional unit cube by A. The parallelepiped is defined by the vertices of coordinates same as A‘s column vectors. Let’s look at a 2D example:

(1)   \begin{align*}|A| = \begin{vmatrix} a & b\\c & d \end{vmatrix}=ad - bc \end{align*}

 

A transforms a unit square on the 2D plane to a parallelogram characterized by two vertices (a,c) and (b,d) (corresponding to A‘s first column and second column). Therefore, according to basic math [3], this parallelogram’s area can be computed as ad - bc, which is the definition of the determinant of A.

A point process \mathcal{P} on an N-cardinality set S is a probability distribution over the power set of S (i.e., 2^{N}). Determinant point process represents a family of probability distributions such that for any subset s \subseteq S, its probability density has a closed form as:

(2)   \begin{align*}P(s \subseteq S)=\frac{det(L_s)}{det(L+I)} \propto det(L_s),\end{align*}


where L is a constructed N\times N matrix indexed by the elements of S, I is an identity matrix of the same size, and L_s is a sub-matrix of L: L_s \equiv [L_{ij}]_{i,j \in s}.

L can be constructed in a way to give the determinant point process a beautiful interpretation on balancing individual items’ quality and inter-item diversity. First, we can rewrite L as a Gram matrix: L=B^T B, where B has a size K \times N. Then, each column of B can be represented as B_i=q_i \dot \phi_i, i=1,\dots,N, which can be thought of as the normalized feature vector of one item in s (\Vert\phi_i \Vert_2=1) scaled by a quality scalar q_i. Next, as we know that P(s \subseteq S) \propto det(L_s) and det(AB)=det(A)det(B), we know that det(L_s)=det^2(B)=Vol\_of\_Parallelepiped^2(\{B_i\}_{i\in s}). The volume of the parallelepiped characterized by \{B_i\}_{i\in s} is affected by both q_i and \phi_i. The larger the q_i is, the larger the volume. The larger <\phi_i, \phi_j> (for any i,j\in s) is, the more similar the two items are, and consequently the smaller the volume is. This can be best understood in when |s|=2 or 3

We denote s^* as the best subset of S that can achieves the highest P(s \subseteq S). Now we introduce a concept: P(s \subseteq S) is log-submodular (Section 2.2 in [7]). Suppose f(s)=log P(s \subseteq S). Then f(s) is submodular because f(s \cup i) - f(s) \geq f(s' \cup i) - f(s') for any s \subseteq s' \subseteq S \setminus \{i\}. Intuitively, adding elements to s yields diminishing returns in f(s) as s is getting large. As [5] reveals (around Eqn. 74), f(s) is also non-monotone, as the additional of even a single element can decrease the probability. Submodular maximization corresponds to finding s^* that has the highest P(s \subseteq S) (or log P(s \subseteq S)). Submodular maximization is NP-hard (i.e., can be reduced to an NP problem in polynomial time). (But interestingly, submodular minimization can be solved exactly in polynomial time, see introduction in Section 1.2.1 in one’s thesis). Therefore, in practice, we turn to approximate method for P(s \subseteq S) maximization. [4] introduces one simple greedy algorithm (Eqn. 15). 

Simple Submodular Diversification

Before DPP is introduced to solve the diversification problem, Amazon has proposed a simpler idea to model diversification [9], which is also one kind of submodular minimization.

Here, \mathcal{A}_k is an item candidate set, \mathbf{a}_i is each item’s category (one-hot vector), s(\mathbf{a}_i) is the score of each item, and \mathbf{w} is a utility vector for governing the total utility of the final selected item set. The optimal item set will optimize \pho. And in practice it is optimized by a greedy method. \mathbf{w} is the only learnable parameter of the same dimension as the number of categories. If we want to learn a global optimal \mathbf{w}, we will need to estimate each category’s CTR by bayesian probability. If we want to learn a personalized \mathbf{w}, we will need to estimate each user’s CTR on each category.  

Calibrate recommendation to user history 

[8] introduces another idea of doing calibration. The tenet of it is that the recommended list should have similar topic distributions as the user watched history. You can look at Section 3 of [8] to get the exact formula. Basically, they use KL-divergence to measure the difference between the recommendation and user history.

References

[1] https://about.instagram.com/blog/engineering/on-the-value-of-diversified-recommendations

[2] https://en.wikipedia.org/wiki/Determinant#Geometric_meaning

[3] https://socratic.org/questions/how-do-you-find-the-area-of-a-parallelogram-with-vertices

[4] Practical Diversified Recommendations on YouTube with Determinantal Point Processes

[5] Determinantal point processes for machine learning

[6] Recent Advances in Diversified Recommendation

[7] Fast Greedy MAP Inference for Determinantal Point Process to Improve Recommendation Diversity

[8] Calibrated Recommendations

[9] Adaptive, Personalized Diversity for Visual Discover

Someone just saved this website: WordPress backup and Crayon Syntax Highlighter

It turned out that my old syntax highlighter “Crayon Syntax Highlighter” does not work with new PhP version (7.4+) and the plugin is no longer maintained officially. Luckily, someone updates it and provide a zip file: https://www.axew3.com/w3/forum/?coding=dmlld3RvcGljLnBocD9mPTEwJnQ9MTU1NQ==. I also back up the updated plugin on my dropbox: https://www.dropbox.com/s/e0u8jb2oqfagv9c/crayon-syntax-highlighter.zip?dl=0

BTW, I also back up the whole website using “All-in-One WP Migration” plugin: https://www.dropbox.com/s/unpeslbs404ujof/czxttkl.com-20201207-055211-640.wpress?dl=0

 

 

Correct font for subroutines in Latex

Today I learned that Latex supports many different typefaces: https://www.overleaf.com/learn/latex/font_typefaces and http://www.personal.ceu.hu/tex/typeface.htm

I’ve seen papers like this one (https://arxiv.org/pdf/2006.15704.pdf) uses a different font for subroutines (like function names or APIs). One example is from their reference to the “CrossEntropyLoss” function:

Now, I try by myself different typefaces and see which one it uses:

{\rm CrossEntropyLoss rm}

{\sl CrossEntropyLoss sl} 

{\sf CrossEntropyLoss sf} 

{\sf CrossEntropyLoss sf} 

{\tt CrossEntropyLoss tt}  

{\sc CrossEntropyLoss sc}

{\fontfamily{qcr}\selectfont
CrossEntropyLoss qcr
}

{\fontfamily{bch}\selectfont
CrossEntropyLoss bch
}

{\fontfamily{lmtt}\selectfont
CrossEntropyLoss lmtt
}

I think the closest font is lmtt

Focal loss for classification and regression

I haven’t learnt any new loss function for a long time. Today I am going to learn one new loss function, focal loss, which was introduced in 2018 [1].

Let’s start from a typical classification task. For a data (\vect{x}, y), where \vect{x} is the feature vector and y is a binary label, a model predicts p(y=1|\vect{x}) = p. Then the cross entropy function can be expressed as:

    \[CE(p,y)=-ylog(p)-(1-y)log(1-p)\]

 
If you set p_t = p if y=1 or p_t=1-p if y=0, then CE(p, y)=CE(p_t)=-log(p_t), as shown in the blue curve below:

When there are many easy examples, i.e., p_t > 0.6 and the direction matches with the ground-truth label, the loss is small. However, if a dataset has an overwhelmingly large number of such easy examples as in the CV object detection domain, all these small losses add up to overwhelm the hard examples, i.e., those with p_t < 0.6. 

Focal loss’s idea is to penalize losses for all kinds of examples, easy or hard, but the easy examples get penalized much more. Focal loss is defined as: FL(p_t)=-(1-p_t)^\gamma log(p_t). By tuning with different \gamma‘s, you can see that the loss curve gets modulated differently and easy examples’ loss become less and less important.

In the regression domain, we can define the MSE (L2) loss as below, with l being the absolute prediction. error:

    \[L_2 = | p - y |^2 = l^2\]


Applying the focal loss’s spirit on L_2, you get:

    \[FL\_L_2=l^\gamma \cdot l^2 = l^{\gamma+2}\]

So the focal loss in the regression domain is just a higher-order loss. 

[2] takes a step further on the FL\_L_2 by letting l^\gamma penalty effective almost only on the easy examples but not on the hard examples. They call it the shrinkage loss:

    \[L_S = \frac{l^2}{1 + exp\left(a\cdot \left( c-l \right)\right)}\]

Reference

[1] Focal Loss for Dense Object Detection: https://arxiv.org/pdf/1708.02002.pdf

[2] Deep Regression Tracking with Shrinkage Loss: https://openaccess.thecvf.com/content_ECCV_2018/papers/Xiankai_Lu_Deep_Regression_Tracking_ECCV_2018_paper.pdf

Causal Inference

I’ve given a causal inference 101 in [7]. Since then I have been learning more and more about causal inference. I find Jonas Peters’ series of lectures quite good [1~4]. In this post, I am going to take some notes as I watch his lectures, and I am going to share some classic papers in causal inference.

11:27 in [2], Jonas talked about two-stage regression. The problem is as follows. You want to know the causal effect of X on Y but there is some unobserved confounding factor that may impede your estimation (i.e., you can’t directly do Y = \alpha X + N_y regression). So we can introduce an instrument variable I which does not directly cause Y by assumption. In the example he gave, X is smoking, Y is lung cancer, and H is some hidden confounding factor that may affect X and Y, for example, stress level. I is the tax on tobacco. By intuition, we assume the tax on tobacco doesn’t cause the change in stress level and lung cancer.

It must be a linear system for two-stage regression to work. We assume the underlying system can be described by the following linear equations:

Y=\alpha X + \gamma H + N_y
X=\beta H + \sigma I + N_x

Therefore, by replacing X‘s equation into Y, we have:
Y=(\alpha \beta + \gamma) H + \alpha \sigma I + \alpha N_x + N_y

From the above equation, it becomes clear how to estimate \alpha without depending on H. First, we regress X on I, so that we can obtain \sigma. Then using the fitted value of X: \hat{X}=\sigma I, we regress Y on \hat{X}. The coefficient learned that is associated with \hat{X} would be roughly \alpha, because the rest components (\alpha \beta + \gamma) H + \alpha N_x + N_y in Y is independent of \sigma I

However, this method would not work if the system is non-linear because then it is hard to separate \sigma I from the rest components. \alpha would be intertwined in different components which can’t be learned easily. 

A real-world application of using two-stage regression is [5], where X is time spent on different video types, Y is overall user satisfaction, H is some hidden user preference that may cause both users spending time on videos and their satisfaction. To really understand the magnitude of causal effect of video watch time and user satisfaction, they look at data from different online A/B test groups, with I being a one-hot encoding of the test group a data belong to. Following the same procedure as two-stage regression, we can argue that I does not cause either Y or H directly. The contribution of [5] is it reduces the bias when estimating the causal effect \alpha

28:00 in [2], Jonas introduced counterfactual distribution. He introduced an example which I think can best illustrate the difference between intervention distribution and counterfactual distribution. See their notations from “Notation and Terminology” in [6]. 

The example goes by follows. 

Suppose T is a treatment to a disease. R is the recovery result, which depends on whether one receives the treatment and a noise N_R following a Bernoulli distribution. According to the provided structural causal model (SCM), the intervention distribution of P_R(\text{do}(T:=1)) would be N_R. In natural language, a patient will have 0.99 probability to recover if received the treatment.

However, if we observed a specific patient who receives the treatment but did not recover, then we know this patient’s N_R is sampled at value 0. So his counterfactual distribution if he were not received the treatment is P_R(N_R=0;\text{do}(T:=0)) = 1. Note that this specific patient has a new SCM: T= 1, R=1-T. The counterfactual distribution P_R(N_R=0;\text{do}(T:=0)) on the old SCM would be the same as the intervention distribution on the new SCM.

I’ve talked about causal discovery using regression and noise independence test in [7]. Here is one another more example. In a CV paper [8], the authors want to determine the arrow of time given video frames. They model the time series’ next value as a linear function of the past two values plus additive independent noise. The final result shows that the forward-time series has a higher independence test p-value than the backward-time series. (A high p-value means we cannot reject the null hypothesis that the noise is independent of the input.) They only keep the data as valid time-series if the noise in the regression model for one direction at least is non-Gaussian, determined by a normality test. The noise independence test is “based on an estimate of the Hilbert-Schmidt norm of the cross-covariance operator between two reproducing kernel Hilbert spaces associated with the two variables whose in- dependence we are testing”. 

Another interesting idea about causal discovery is causal invariant prediction. The goal of causal invariant prediction is to find all parents that cause a variable of interest Y among all possible features \{X_1, X_2, \cdots, X_p\}. The core assumption of causal invariant prediction is that in a SCM and any inventional distribution based on it, P(Y|PA_Y) remains invariant if the structural equation for Y does not change. Therefore based on data from different SCMs of the same set of variables (which can be observational data or interventional experiments), we can try to fit a linear regression Y \leftarrow \sum\limits_{k\in S}\gamma_k X_k + \epsilon_Y for every feature subset S \subset \{X_1, X_2, \cdots, X_p\} for each SCM. Across different SCMs, we would find all S‘s which lead to the same estimation of \gamma_k, k \in S. Then the parent of Y is the intersection of all the filtered S‘s because the interaction features’s relationship to Y remains invariant all the time, satisfying our assumption of causal invariant prediction. [9] applies causal invariant prediction on gene perturbation experiments and I find the example in Table 1 is very illustrative.

References 

[1] Lecture 1: https://www.youtube.com/watch?v=zvrcyqcN9Wo&t=2642s

[2] Lecture 2: https://www.youtube.com/watch?v=bHOGP5o3Vu0

[3] Lecture 3: https://www.youtube.com/watch?v=Jp4UcgpVA2I&t=4051s

[4] Lecture 4: https://www.youtube.com/watch?v=ytnr_2dyyMU

[5] Learning causal effects from many randomized experiments using regularized instrumental variables:
https://arxiv.org/abs/1701.01140

[6] Elements of Causal Inference: https://library.oapen.org/bitstream/handle/20.500.12657/26040/11283.pdf?sequence=1&isAllowed=y

[7] https://czxttkl.com/2020/09/08/tools-needed-to-build-an-rl-debugging-tool/

[8] Seeing the Arrow of Time: https://www.robots.ox.ac.uk/~vgg/publications/2014/Pickup14/pickup14.pdf

[9] Methods for causal inference from gene perturbation experiments and validation: https://www.pnas.org/content/113/27/7361