Many ways towards recommendation diversity

Diversity of recommendations keeps users engaged and prevents boredom [1]. In this post, I will introduce several machine learning-based methods to achieve diverse recommendations. The literature in this post is mostly retrieved from the overview paper “Recent Advances in Diversified Recommendation” [6]. 

Determinant Point Process

Let’s first review what is the determinant of a matrix [2]. The determinant of matrix A can be understood as representing the (signed) volume of the parallelepiped transformed from an n-dimensional unit cube by A. The parallelepiped is defined by the vertices of coordinates same as A‘s column vectors. Let’s look at a 2D example:

(1)   \begin{align*}|A| = \begin{vmatrix} a & b\\c & d \end{vmatrix}=ad - bc \end{align*}

 

A transforms a unit square on the 2D plane to a parallelogram characterized by two vertices (a,c) and (b,d) (corresponding to A‘s first column and second column). Therefore, according to basic math [3], this parallelogram’s area can be computed as ad - bc, which is the definition of the determinant of A.

A point process \mathcal{P} on an N-cardinality set S is a probability distribution over the power set of S (i.e., 2^{N}). Determinant point process represents a family of probability distributions such that for any subset s \subseteq S, its probability density has a closed form as:

(2)   \begin{align*}P(s \subseteq S)=\frac{det(L_s)}{det(L+I)} \propto det(L_s),\end{align*}


where L is a constructed N\times N matrix indexed by the elements of S, I is an identity matrix of the same size, and L_s is a sub-matrix of L: L_s \equiv [L_{ij}]_{i,j \in s}.

L can be constructed in a way to give the determinant point process a beautiful interpretation on balancing individual items’ quality and inter-item diversity. First, we can rewrite L as a Gram matrix: L=B^T B, where B has a size K \times N. Then, each column of B can be represented as B_i=q_i \dot \phi_i, i=1,\dots,N, which can be thought of as the normalized feature vector of one item in s (\Vert\phi_i \Vert_2=1) scaled by a quality scalar q_i. Next, as we know that P(s \subseteq S) \propto det(L_s) and det(AB)=det(A)det(B), we know that det(L_s)=det^2(B)=Vol\_of\_Parallelepiped^2(\{B_i\}_{i\in s}). The volume of the parallelepiped characterized by \{B_i\}_{i\in s} is affected by both q_i and \phi_i. The larger the q_i is, the larger the volume. The larger <\phi_i, \phi_j> (for any i,j\in s) is, the more similar the two items are, and consequently the smaller the volume is. This can be best understood in when |s|=2 or 3

We denote s^* as the best subset of S that can achieves the highest P(s \subseteq S). Now we introduce a concept: P(s \subseteq S) is log-submodular (Section 2.2 in [7]). Suppose f(s)=log P(s \subseteq S). Then f(s) is submodular because f(s \cup i) - f(s) \geq f(s' \cup i) - f(s') for any s \subseteq s' \subseteq S \setminus \{i\}. Intuitively, adding elements to s yields diminishing returns in f(s) as s is getting large. As [5] reveals (around Eqn. 74), f(s) is also non-monotone, as the additional of even a single element can decrease the probability. Submodular maximization corresponds to finding s^* that has the highest P(s \subseteq S) (or log P(s \subseteq S)). Submodular maximization is NP-hard (i.e., can be reduced to an NP problem in polynomial time). (But interestingly, submodular minimization can be solved exactly in polynomial time, see introduction in Section 1.2.1 in one’s thesis). Therefore, in practice, we turn to approximate method for P(s \subseteq S) maximization. [4] introduces one simple greedy algorithm (Eqn. 15). 

Simple Submodular Diversification

Before DPP is introduced to solve the diversification problem, Amazon has proposed a simpler idea to model diversification [9], which is also one kind of submodular minimization.

Here, \mathcal{A}_k is an item candidate set, \mathbf{a}_i is each item’s category (one-hot vector), s(\mathbf{a}_i) is the score of each item, and \mathbf{w} is a utility vector for governing the total utility of the final selected item set. The optimal item set will optimize \pho. And in practice it is optimized by a greedy method. \mathbf{w} is the only learnable parameter of the same dimension as the number of categories. If we want to learn a global optimal \mathbf{w}, we will need to estimate each category’s CTR by bayesian probability. If we want to learn a personalized \mathbf{w}, we will need to estimate each user’s CTR on each category.  

Calibrate recommendation to user history 

[8] introduces another idea of doing calibration. The tenet of it is that the recommended list should have similar topic distributions as the user watched history. You can look at Section 3 of [8] to get the exact formula. Basically, they use KL-divergence to measure the difference between the recommendation and user history.

References

[1] https://about.instagram.com/blog/engineering/on-the-value-of-diversified-recommendations

[2] https://en.wikipedia.org/wiki/Determinant#Geometric_meaning

[3] https://socratic.org/questions/how-do-you-find-the-area-of-a-parallelogram-with-vertices

[4] Practical Diversified Recommendations on YouTube with Determinantal Point Processes

[5] Determinantal point processes for machine learning

[6] Recent Advances in Diversified Recommendation

[7] Fast Greedy MAP Inference for Determinantal Point Process to Improve Recommendation Diversity

[8] Calibrated Recommendations

[9] Adaptive, Personalized Diversity for Visual Discover

Someone just saved this website: WordPress backup and Crayon Syntax Highlighter

It turned out that my old syntax highlighter “Crayon Syntax Highlighter” does not work with new PhP version (7.4+) and the plugin is no longer maintained officially. Luckily, someone updates it and provide a zip file: https://www.axew3.com/w3/forum/?coding=dmlld3RvcGljLnBocD9mPTEwJnQ9MTU1NQ==. I also back up the updated plugin on my dropbox: https://www.dropbox.com/s/e0u8jb2oqfagv9c/crayon-syntax-highlighter.zip?dl=0

BTW, I also back up the whole website using “All-in-One WP Migration” plugin: https://www.dropbox.com/s/unpeslbs404ujof/czxttkl.com-20201207-055211-640.wpress?dl=0

 

 

Correct font for subroutines in Latex

Today I learned that Latex supports many different typefaces: https://www.overleaf.com/learn/latex/font_typefaces and http://www.personal.ceu.hu/tex/typeface.htm

I’ve seen papers like this one (https://arxiv.org/pdf/2006.15704.pdf) uses a different font for subroutines (like function names or APIs). One example is from their reference to the “CrossEntropyLoss” function:

Now, I try by myself different typefaces and see which one it uses:

{\rm CrossEntropyLoss rm}

{\sl CrossEntropyLoss sl} 

{\sf CrossEntropyLoss sf} 

{\sf CrossEntropyLoss sf} 

{\tt CrossEntropyLoss tt}  

{\sc CrossEntropyLoss sc}

{\fontfamily{qcr}\selectfont
CrossEntropyLoss qcr
}

{\fontfamily{bch}\selectfont
CrossEntropyLoss bch
}

{\fontfamily{lmtt}\selectfont
CrossEntropyLoss lmtt
}

I think the closest font is lmtt

Focal loss for classification and regression

I haven’t learnt any new loss function for a long time. Today I am going to learn one new loss function, focal loss, which was introduced in 2018 [1].

Let’s start from a typical classification task. For a data (\vect{x}, y), where \vect{x} is the feature vector and y is a binary label, a model predicts p(y=1|\vect{x}) = p. Then the cross entropy function can be expressed as:

    \[CE(p,y)=-ylog(p)-(1-y)log(1-p)\]

 
If you set p_t = p if y=1 or p_t=1-p if y=0, then CE(p, y)=CE(p_t)=-log(p_t), as shown in the blue curve below:

When there are many easy examples, i.e., p_t > 0.6 and the direction matches with the ground-truth label, the loss is small. However, if a dataset has an overwhelmingly large number of such easy examples as in the CV object detection domain, all these small losses add up to overwhelm the hard examples, i.e., those with p_t < 0.6. 

Focal loss’s idea is to penalize losses for all kinds of examples, easy or hard, but the easy examples get penalized much more. Focal loss is defined as: FL(p_t)=-(1-p_t)^\gamma log(p_t). By tuning with different \gamma‘s, you can see that the loss curve gets modulated differently and easy examples’ loss become less and less important.

In the regression domain, we can define the MSE (L2) loss as below, with l being the absolute prediction. error:

    \[L_2 = | p - y |^2 = l^2\]


Applying the focal loss’s spirit on L_2, you get:

    \[FL\_L_2=l^\gamma \cdot l^2 = l^{\gamma+2}\]

So the focal loss in the regression domain is just a higher-order loss. 

[2] takes a step further on the FL\_L_2 by letting l^\gamma penalty effective almost only on the easy examples but not on the hard examples. They call it the shrinkage loss:

    \[L_S = \frac{l^2}{1 + exp\left(a\cdot \left( c-l \right)\right)}\]

Reference

[1] Focal Loss for Dense Object Detection: https://arxiv.org/pdf/1708.02002.pdf

[2] Deep Regression Tracking with Shrinkage Loss: https://openaccess.thecvf.com/content_ECCV_2018/papers/Xiankai_Lu_Deep_Regression_Tracking_ECCV_2018_paper.pdf

Causal Inference

I’ve given a causal inference 101 in [7]. Since then I have been learning more and more about causal inference. I find Jonas Peters’ series of lectures quite good [1~4]. In this post, I am going to take some notes as I watch his lectures, and I am going to share some classic papers in causal inference.

11:27 in [2], Jonas talked about two-stage regression. The problem is as follows. You want to know the causal effect of X on Y but there is some unobserved confounding factor that may impede your estimation (i.e., you can’t directly do Y = \alpha X + N_y regression). So we can introduce an instrument variable I which does not directly cause Y by assumption. In the example he gave, X is smoking, Y is lung cancer, and H is some hidden confounding factor that may affect X and Y, for example, stress level. I is the tax on tobacco. By intuition, we assume the tax on tobacco doesn’t cause the change in stress level and lung cancer.

It must be a linear system for two-stage regression to work. We assume the underlying system can be described by the following linear equations:

Y=\alpha X + \gamma H + N_y
X=\beta H + \sigma I + N_x

Therefore, by replacing X‘s equation into Y, we have:
Y=(\alpha \beta + \gamma) H + \alpha \sigma I + \alpha N_x + N_y

From the above equation, it becomes clear how to estimate \alpha without depending on H. First, we regress X on I, so that we can obtain \sigma. Then using the fitted value of X: \hat{X}=\sigma I, we regress Y on \hat{X}. The coefficient learned that is associated with \hat{X} would be roughly \alpha, because the rest components (\alpha \beta + \gamma) H + \alpha N_x + N_y in Y is independent of \sigma I

However, this method would not work if the system is non-linear because then it is hard to separate \sigma I from the rest components. \alpha would be intertwined in different components which can’t be learned easily. 

A real-world application of using two-stage regression is [5], where X is time spent on different video types, Y is overall user satisfaction, H is some hidden user preference that may cause both users spending time on videos and their satisfaction. To really understand the magnitude of causal effect of video watch time and user satisfaction, they look at data from different online A/B test groups, with I being a one-hot encoding of the test group a data belong to. Following the same procedure as two-stage regression, we can argue that I does not cause either Y or H directly. The contribution of [5] is it reduces the bias when estimating the causal effect \alpha

28:00 in [2], Jonas introduced counterfactual distribution. He introduced an example which I think can best illustrate the difference between intervention distribution and counterfactual distribution. See their notations from “Notation and Terminology” in [6]. 

The example goes by follows. 

Suppose T is a treatment to a disease. R is the recovery result, which depends on whether one receives the treatment and a noise N_R following a Bernoulli distribution. According to the provided structural causal model (SCM), the intervention distribution of P_R(\text{do}(T:=1)) would be N_R. In natural language, a patient will have 0.99 probability to recover if received the treatment.

However, if we observed a specific patient who receives the treatment but did not recover, then we know this patient’s N_R is sampled at value 0. So his counterfactual distribution if he were not received the treatment is P_R(N_R=0;\text{do}(T:=0)) = 1. Note that this specific patient has a new SCM: T= 1, R=1-T. The counterfactual distribution P_R(N_R=0;\text{do}(T:=0)) on the old SCM would be the same as the intervention distribution on the new SCM.

I’ve talked about causal discovery using regression and noise independence test in [7]. Here is one another more example. In a CV paper [8], the authors want to determine the arrow of time given video frames. They model the time series’ next value as a linear function of the past two values plus additive independent noise. The final result shows that the forward-time series has a higher independence test p-value than the backward-time series. (A high p-value means we cannot reject the null hypothesis that the noise is independent of the input.) They only keep the data as valid time-series if the noise in the regression model for one direction at least is non-Gaussian, determined by a normality test. The noise independence test is “based on an estimate of the Hilbert-Schmidt norm of the cross-covariance operator between two reproducing kernel Hilbert spaces associated with the two variables whose in- dependence we are testing”. 

Another interesting idea about causal discovery is causal invariant prediction. The goal of causal invariant prediction is to find all parents that cause a variable of interest Y among all possible features \{X_1, X_2, \cdots, X_p\}. The core assumption of causal invariant prediction is that in a SCM and any inventional distribution based on it, P(Y|PA_Y) remains invariant if the structural equation for Y does not change. Therefore based on data from different SCMs of the same set of variables (which can be observational data or interventional experiments), we can try to fit a linear regression Y \leftarrow \sum\limits_{k\in S}\gamma_k X_k + \epsilon_Y for every feature subset S \subset \{X_1, X_2, \cdots, X_p\} for each SCM. Across different SCMs, we would find all S‘s which lead to the same estimation of \gamma_k, k \in S. Then the parent of Y is the intersection of all the filtered S‘s because the interaction features’s relationship to Y remains invariant all the time, satisfying our assumption of causal invariant prediction. [9] applies causal invariant prediction on gene perturbation experiments and I find the example in Table 1 is very illustrative.

References 

[1] Lecture 1: https://www.youtube.com/watch?v=zvrcyqcN9Wo&t=2642s

[2] Lecture 2: https://www.youtube.com/watch?v=bHOGP5o3Vu0

[3] Lecture 3: https://www.youtube.com/watch?v=Jp4UcgpVA2I&t=4051s

[4] Lecture 4: https://www.youtube.com/watch?v=ytnr_2dyyMU

[5] Learning causal effects from many randomized experiments using regularized instrumental variables:
https://arxiv.org/abs/1701.01140

[6] Elements of Causal Inference: https://library.oapen.org/bitstream/handle/20.500.12657/26040/11283.pdf?sequence=1&isAllowed=y

[7] https://czxttkl.com/2020/09/08/tools-needed-to-build-an-rl-debugging-tool/

[8] Seeing the Arrow of Time: https://www.robots.ox.ac.uk/~vgg/publications/2014/Pickup14/pickup14.pdf

[9] Methods for causal inference from gene perturbation experiments and validation: https://www.pnas.org/content/113/27/7361

Analyze DistributedDataParallel (DPP)’s behavior

DistributedDataParallel implements data parallelism at the module level which can run across different machines. There is one process running on each device where one copy of the module is held. Each process loads its own data which is non-overlapping with other processes’. At the initialization phase, all copies are synchronized to ensure they start from the same initialized weights. The forward pass is executed independently on each device, during which no communication is needed. In the backward pass, the gradients are all-reduced across the devices, ensuring that each device ends up with identical copy of the gradients/weights, therefore eliminating the need for model syncs at the beginning of each iteration [1].

[2] illustrates some more detailed design. For example, parameters can be synchronized more efficiently by bucketing. Parameters will be bucketed into several buckets. In the backward phase, whenever the gradient is ready for all the members in a bucket, the synchronization will kick off for that bucket. Thus one does not need to wait for the gradients of ALL parameters to become ready before synchronization starts.  

Another detail in [2] is how gradient synchronization is started in the backward phase. It says that “DDP uses autograd hooks registered at construction time to trigger gradients synchronizations. “

I did a quick experiment to verify these design details. test_multi_gpu is in charge of spawning worker processes and the real model training happens in the _worker function. initialize_trainer is the piece of code for initializing models at each process. Check out my comment in “Logs” section in the pseudocode below for the explanation for DPP’s behavior.

def test_multi_gpu(
    use_gpu: bool,
    num_gpus: int,
    normalization_data_map: Dict[str, NormalizationData],
    reader_options: ReaderOptions,
):
    logger.info(f"Enter test_multi_gpu with reader options {reader_options}")

    # These ENVS are needed by torch.distributed: https://fburl.com/1i86h2yg
    os.environ["MASTER_ADDR"] = "127.0.0.1"
    os.environ["MASTER_PORT"] = str(find_unused_port())

    manager = mp.Manager()
    result_dict = manager.dict()

    workflow_run_id = flow.get_flow_environ().workflow_run_id
    # The second condition is to avoid collision in unit test.
    # When running unit test, a local DB is used.
    # As a result, the workflow run ID may not be unique.
    if workflow_run_id and workflow_run_id > 10000:
        init_method = f"zeus://{workflow_run_id}"
    else:
        host_name = socket.gethostname()
        process_id = os.getpid()
        init_method = f"zeus://{host_name}_{process_id}"

    backend = "nccl" if use_gpu else "gloo"

    mp.spawn(
        _worker,
        args=(
            use_gpu,
            num_gpus,
            backend,
            init_method,
            reader_options,
            normalization_data_map,
            result_dict,
        ),
        nprocs=num_gpus,
        join=True,
    )
    logger.info("finish spawn")


def _worker(
    rank: int,
    use_gpu: bool,
    world_size: int,
    backend: str,
    init_method: str,
    reader_options: ReaderOptions,
    normalization_data_map: Dict[str, NormalizationData],
    reward_options: Optional[RewardOptions] = None,
    warmstart_path: Optional[str] = None,
    output_dict=None,
):
    logger.info(f"rank={rank} with reader options {reader_options}")
    dist.init_process_group(
        backend=backend, init_method=init_method, world_size=world_size, rank=rank
    )
    if use_gpu:
        torch.cuda.set_device(rank)

    model = create_model(...)
    trainer = model.initialize_trainer(
        use_gpu=use_gpu,
        reward_options=reward_options,
        normalization_data_map=normalization_data_map,
        warmstart_path=warmstart_path,
    )
    logger.info(f"rank={rank} finish initialize")

    num_of_data = 0
    data_reader = construct_distributed_data_reader(
        normalization_data_map, reader_options
    )
    for idx, batch in enumerate(data_reader):
        batch = post_data_loader_preprocessor(batch)
        if use_gpu:
            batch = batch.cuda()

        num_of_data += len(batch.training_input.state.float_features)

        logger.info(
            f"rank={rank} batch state={batch.training_input.state.float_features}"
        )
        logger.info(
            f"rank={rank} before train seq2slate param={print_param(trainer.seq2slate_net.seq2slate_net.seq2slate)}"
        )
        if rank == 1:
            logger.info(f"rank={rank} wake")
            time.sleep(60)
            logger.info(f"rank={rank} sleep")

        trainer.train(batch)
        logger.info(
            f"rank={rank} after train seq2slate param={print_param(trainer.seq2slate_net.seq2slate_net.seq2slate)}"
        )

        break

    logger.info(f"rank={rank} finish reading {num_of_data} data")


def initialize_trainer(self) -> Seq2SlateTrainer:
    seq2slate_net = initialize_model(...)

    if self.use_gpu:
        seq2slate_net = seq2slate_net.cuda()

    logger.info(f"Within manager {print_param(seq2slate_net.seq2slate)}")
    logger.info(
        f"Within manager {next(seq2slate_net.seq2slate.parameters()).device}"
    )
    if self.trainer_param.num_parallel > 1:
        seq2slate_net = _DistributedSeq2SlateNet(seq2slate_net)

    return _initialize_trainer(seq2slate_net)

###############
     Logs
###############
# This is printed within manager.initialize_trainer to show that 
# models are initially with different parameters
# (see line 66 then line 112~115):
I1003 063214.340 seq2slate_transformer.py:140] Within manager tensor([-0.2585,  0.3716, -0.1077, -0.2114,  0.1636,  0.1398, -0.2960, -0.1204,\n ...], device='cuda:0', grad_fn=<CatBackward>)
I1003 063214.341 seq2slate_transformer.py:142] Within manager cuda:0
I1003 063214.349 seq2slate_transformer.py:140] Within manager tensor([-0.1076, -0.0444,  0.3003, -0.1177,  0.0275, -0.0811,  0.2084,  0.3369,\n ...], device='cuda:1', grad_fn=<CatBackward>)


# Below is printed from line 85 ~ 90
# You can see that each process receives different, non-overlapping data
# You can also see that at this point, the two models have the same parameters,
# which are ensured by DDP. The parameters come from one particular copy (rank=0)
I1003 063214.531 test_multi_gpu.py:144] rank=0 batch state=tensor([[ 0.0000,  0.0000, -0.0000,  ..., -1.6131, -1.6298, -1.6118],\n ...]],\n       device='cuda:0')
I1003 063214.540 test_multi_gpu.py:147] rank=0 before train seq2slate param=tensor([-0.2585,  0.3716, -0.1077, -0.2114,  0.1636,  0.1398, -0.2960, -0.1204,\n         
 ...], device='cuda:0', grad_fn=<CatBackward>)
I1003 063214.544 test_multi_gpu.py:144] rank=1 batch state=tensor([[ 0.0000,  0.0000, -0.7115,  ..., -2.2678, -2.3524, -2.4194],\n  ...]],\n       device='cuda:1')
I1003 063214.553 test_multi_gpu.py:147] rank=1 before train seq2slate param=tensor([-0.2585,  0.3716, -0.1077, -0.2114,  0.1636,  0.1398, -0.2960, -0.1204,\n         ..., device='cuda:1', grad_fn=<CatBackward>)


# We deliberately let rank 1 sleep for one minute. 
# But you can see that rank 0 does not return from its train function earlier
# because it blocks on .backward function, waiting for rank 1's backward() finish.
# You can see after .train function, both processes have resulted to the same parameters again
I1003 063214.554 test_multi_gpu.py:150] rank=1 wake
I1003 063314.613 test_multi_gpu.py:152] rank=1 sleep
I1003 063315.023 seq2slate_trainer.py:181] 1 batch: ips_loss=-2.706389904022217, clamped_ips_loss=-2.706389904022217, baseline_loss=0.0, max_ips=27.303083419799805, mean_ips=0.6373803615570068, grad_update=True
I1003 063315.033 test_multi_gpu.py:156] rank=0 after train seq2slate param=tensor([-0.2485,  0.3616, -0.0977, -0.2214,  0.1736,  0.1298, -0.3060, -0.1304,\n         ...], device='cuda:0', grad_fn=<CatBackward>)
I1003 063315.033 test_multi_gpu.py:161] rank=0 finish reading 1024 data
I1003 063315.039 seq2slate_trainer.py:181] 1 batch: ips_loss=-2.7534916400909424, clamped_ips_loss=-2.7534916400909424, baseline_loss=0.0, max_ips=272.4482116699219, mean_ips=0.908729612827301, grad_update=True
I1003 063315.050 test_multi_gpu.py:156] rank=1 after train seq2slate param=tensor([-0.2485,  0.3616, -0.0977, -0.2214,  0.1736,  0.1298, -0.3060, -0.1304,\n         ...], device='cuda:1', grad_fn=<CatBackward>)
I1003 063315.050 test_multi_gpu.py:161] rank=1 finish reading 1024 data

References

[1] https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

[2] https://pytorch.org/docs/stable/notes/ddp.html

Tools needed to build an RL debugging tool

I’ve always had a dream to build a debugging tool to automatically analyze an observational dataset and tell me whether this dataset is suitable to apply batch-RL algorithms like DQN/DDPG. As we know, we cannot directly control or even know how observational dataset is collected. An incorrect data collection procedure or a wrong dataset would be costly to the whole model training loop [12]. My idea has been that this tool should first tell if the process underlying the dataset is an Markov Decision Process (MDP). By definition, in an MDP, s_{t+1} (next state) is a function of s_t (current state) and a_t (action); R_t (reward) is also a function of s_t and a_t. Thus, the goal of the debugging tool would be make causal inference on the observational dataset whether states and actions cause the change of rewards and state transitions.   

I will start with some Causal 101. I read some primers on causal inference [1, 2, 3, 4], among which [1] has a lot of good concise summary. The biggest take-away from [1] is that causation is not prediction. The second take-away is that there are two fundamental questions regarding causal effects: (1) estimating causal effect on a known relationship X->Y (2) causal relationship discovery, which [1] claims to be statistically impossible but I don’t understand why.  For (1), we can use randomization experiments in which we assure X is changed independently and then observe how Y changes, or we can account for confounding variables in models. In a special case where we account for all confounding variables in a linear regression model (based on subject matter knowledge), the coefficient for X would be the perfect estimate of the causal effect of X->Y.

Regarding (2) causal relationship discovery, I’ve seen several papers researching on the topic. There are generally three ways to discover causal relationship. The first is to use noise model. In the additive noise model (a bivariate model), with the assumption that if X->Y but not Y->X, if you train two regression models Y=f(X)+\epsilon_1 and X=g(Y)+\epsilon_2, you will find that \epsilon_1 is independent of X while \epsilon_2 is not independent of Y.  See [5, 6]. The noise independence test can be done through kernel-based test (see reference in [6]). The only situation where this method cannot distinguish the causal direction is when X and Y have a linear relationship and their noise is Gaussian. The second method is related to the Kolmogorov complexity. Theorem 1 from [7] states that under some assumptions if X->Y then K(F_X)+K(F_{Y|X}) \leq K(F_Y) + K(F_{X|Y}), which can be proxied as so called pinball loss. Then quantile regression (regression using pinball loss) can be fitted for X to Y and Y to X. The direction with lower pinball loss would indicate the causal relationship. The third method is called matching. While there are several methods for matching, we talk about a widely adopted one called propensity score matching. The idea is that you group events together with similar propensity scores. The propensity score of an event is how likely it happens. Once similarly likely events are grouped together, you can further divide them by whether they receive some “treatment” and compute causal effect.  One good example of applying propensity score matching is from an NLP work [11] where they probe which word causes positive or negative emotions. Some people criticize propensity score matching for its drawbacks [9, 10] and there are more matching methods to be considered [10]. We need to be very careful about the assumptions of these methods, since many of them cannot work when confounding variables are present

Next, I will go with uncertainty 101. Uncertainty of deep neural networks is a really fascinating topic to me: we can almost fit anything using deep neural networks but that doesn’t tell much if without uncertainty estimation. There are multiple ways to give uncertainty estimation of a deep NN: 

  1. High-Quality Prediction Intervals for Deep Learning: A Distribution-Free, Ensembled Approach: https://arxiv.org/pdf/1802.07167.pdf. The loss function is based on the idea that high-quality PIs should be as narrow as possible subject to a given coverage proportion, where coverage means that the proportion of data with label y lies within the predicted interval [y_l, y_u].
  2. Dropout as a bayesian approximation: Representing model uncertainty in deep learning: https://arxiv.org/abs/1506.02142. At testing time, dropout is applied on a trained neural network multiple times to get multiple predictions. Predictive intervals are computed empirically from the obtained samples. 
  3. Simple and scalable predictive uncertainty estimation using deep ensembles: https://arxiv.org/abs/1612.01474. Conceptually this is the most straightforward idea where an ensemble of models can give you a distribution of predictions. This idea is used in RL model-based planning,  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models: https://arxiv.org/abs/1805.12114.   
  4. Frequentist Uncertainty Estimates for Deep Learning: http://bayesiandeeplearning.org/2018/papers/31.pdf. Uncertainty estimate by quantile regression. A more detailed version: https://arxiv.org/abs/1811.00908
  5. Conservative Uncertainty Estimation by Fitting Prior Networks: https://openreview.net/pdf?id=BJlahxHYDS. Like Bayesian Optimization, uncertainty actually doesn’t rely on data labels.

After introducing causal and uncertainty 101, let’s dive into the debugging tool in my mind. Since any observational dataset may contain confounding issues and spurious correlation, we cannot really trust any observational dataset. Rather, the only trustful dataset is to collect (s,a,r,s’) using a completely random policy. 

First, we need to debug if R_t is a function of s_t and a_t. This is done by computing feature importance for each state feature and each action feature. Feature importance is obtained by the change of reward prediction after masking one particular feature for all logged data on a trained and fixed reward model. The intuition is that if a feature is important, masking it should bring significant agitation in the reward prediction. If no state feature is important, then this problem can degenerate to be a multi-arm bandit problem at best when action feature is important. We can use the uncertainty work we surveyed to get a confidence bound for the estimated feature importance, and we expect the lower confidence bound should be above 0. We call the test of whether lower confidence bound > 0 in at least p% logged data as feature importance significance test. (Update 2021-02-10: there is a clever way to do feature importance called Boruta: https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a)

Since our data is randomized on actions, any action feature that passes the feature importance significance test can be equivalent to say that action causes reward. But if a state feature passes the feature importance significance test, we can only say the state feature is correlated with the reward. This is because we could have one of the following causal relationships. In the latter case, a confounding factor causes the state feature and the reward.

That’s why even if there are states that are important for rewards it is not enough. More specifically, there are two concerns that things could go wrong: (1) what if actions couldn’t exert any influence on the states that are important to rewards? (2) what if actions can make influence on a state that is considered important for reward prediction, but that state is not the cause of the reward but merely correlated with the reward? For (1) we need to debug if s_{t+1} is a function of s_t and a_t. We train next state prediction models for each state feature, and by shuffling action values, we can see how next state prediction changes, whose magnitude will be used as proxy for action influence on that state feature. To account for data variance (aleatoric error), we can fit mixture density distributions on next states. 

For (2) “what if actions can make influence on a state that is considered important for reward prediction, but that state is not the cause of the reward but merely correlated with the reward”, I think here is a good example to illustrate the point. Suppose there is a state feature “ads view of a particular product”, action is “whether show more ads about the product”, reward is “purchase of the product”. Suppose there is a confounding variable “passion about the product” that actually causes views and purchases.  While the action can increase more views, the state feature and reward are independent once we know if a person loves the product. In this case, even some state features pass reward feature importance significance test and are caused by actions, we will still be unsure whether the problem is meaningful to apply for RL.

Overall, the debugging tool procedure can be summarized as:

—————— this part may not be useful ———————

Suppose there are two users (mdp_id=0, and mdp_id=1). User 0 loves the product thus every ad view would lead to purchase. User 1 hates the product thus ad view has no effect on him. 

mdp_id time_step passion (s_t) views (s_t) action (a_t) purchase (r_t)
0 0 1 0 show 1
0 1 1 1 not show 0
0 2 1 1 show 1
0 3 1 2 not show 0
1 0 0 0 show 0
1 1 0 0 not show 0

If passion and view are both state features, and purchase is a function of (passion, action), then passion will be considered important for rewards but not views. Then for passion, we will find that actions couldn’t change it, which indicates this problem is not suitable for RL.

If passion and view are both state features, and purchase is a function of (passion, views, action) (e.g., there is additional purchase decay due to accumulated views), then passion and view will both be considered important for rewards . We will find that action couldn’t change passion but views, which concludes that this dataset is valid for trying RL.

If passion is unobservable and only views is observable, and purchase is a function of (passion, action), then no feature will be considered as important for reward prediction. Our tool would raise flag immediately.

If passion is is unobservable and only views is observable, and purchase is a function of (passion, views, action), then views will be recognized as important for rewards. We will find out actions could change views and conclude that this data is still valid for trying RL.

When passion is unobservable and only views is observable, and purchase is a function of (passion, action), and passion=1 by nature has far more data than passion=0. 

—————— this part may not be useful ———————

 

References

[1] http://www.stat.cmu.edu/~larry/=stat401/Causal.pdf

[2] causal-inference.org: https://sites.google.com/view/causality-reading-group/resources?authuser=0

[3] Elements of Causal Inference: https://library.oapen.org/bitstream/handle/20.500.12657/26040/11283.pdf?sequence=1&isAllowed=y

[4] Review of Causal Discovery Methods Based on Graphical Models: https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full

[5] Nonlinear causal discovery with additive noise models: https://papers.nips.cc/paper/3548-nonlinear-causal-discovery-with-additive-noise-models

[6] On the Identifiability of the Post-Nonlinear Causal Model: https://arxiv.org/pdf/1205.2599.pdf 

[7] Distinguishing Cause from Effect Using Quantiles: Bivariate Quantile Causal Discovery: https://arxiv.org/abs/1801.10579

[8] Frequentist Uncertainty Estimates for Deep Learning: http://bayesiandeeplearning.org/2018/papers/31.pdf

[9] https://en.wikipedia.org/wiki/Propensity_score_matching#Advantages_and_disadvantages

[10] Why Propensity Scores Should Not Be Used for Matching: https://www.cambridge.org/core/journals/political-analysis/article/why-propensity-scores-should-not-be-used-for-matching/94DDE7ED8E2A796B693096EB714BE68B

[11] Feature Selection as Causal Inference Experiments with Text Classification: https://www.aclweb.org/anthology/K17-1018/

[12] Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/0d556e45afc54afeb2eb6b51a9bc1827b9961ff4.pdf

 

Distributed RL

This post curates some distributed RL research.

We start from A3C [2], which we briefly covered a few years ago [3]. The core idea is that there is a central global network which is periodically synchronized with workers. Each worker copies the network parameters from the global network, runs through the environment and collects experiences, and computes gradients on its own machine. Then, it sends the gradients to the global network, where the parameters actually get updated, and then gets back the updated global network parameters [4].  Each worker updates the global network at different “asynchronous” time points, therefore the “asynchronous” in A3C. 

The problem of A3C is that workers may run with lagged parameters. Therefore the computed gradients from different workers may contradict with each other or with the global version. Therefore sometimes it is better to wait for all workers finish before one global update happens [5]: 

This means at any time of A2C all workers and the global network always have the same parameters. Of course, the drawback of A2C is that the speed of one global gradient update is determined by the slowest worker. 

IMPALA [1] overcomes some drawbacks of A2C/A3C. IMPALA is conceptually closer to A3C in that there are also workers running on different versions of parameters to collect experience. The global controller can update the global network parameters whenever it receives message from one worker independently of other workers’ progress. Therefore, IMPALA (and A3C) is expected to have higher experience collection throughput than A2C. IMPALA solves the issue of worker policies lagging behind the global version with two methods: (1) workers send trajectories, rather than computed gradient as in A3C, to the global controller. The sent trajectories contain action propensities by the worker network; (2) once the global controller receives trajectories from one worker, it can use V-trace in gradient computation, something similar to importance sampling weighting to adjust discrepancy between the worker policy and the global policy.  

Seed RL [7] is another attempt to improve over IMPALA with two motivated observations:
(1) in IMPALA there are two things sent over the network: trajectories (from workers to the global controller) and model parameters (from the global controller to workers). However model parameters can be a number of times larger than trajectories, meaning network bandwidth is wasted a lot on sending parameters of large models. 
(2) workers are usually CPU-based, which are slow to perform policies (inference).

Seed RL doesn’t send model parameters over the network. It performs learning and inference on the global controller, which is usually equipped with GPU. The global controller only sends inferred action outputs to workers and workers send trajectories to the global controller (this is the same as IMPALA). Seed RL also uses V-Trace as it is still an asynchronous algorithm in which workers can fall behind than the global.  

The difference between Seed RL and IMPALA can be best viewed in the following diagrams [6]:

 

References

[1] IMPALA

[2] A2C / A3C

[3] https://czxttkl.com/2017/06/13/a3c-code-walkthrough/

[4] A3C explanation in Medium

[5] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#a2c

[6] https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html

[7] https://arxiv.org/pdf/1910.06591.pdf

Noise-Contrastive Estimation

I encountered the noise-contrastive estimation (NCE) technique in several papers recently. So it is a good time to visit this topic. NCE is a commonly used technique under the word2vec umbrella [2]. Previously I talked about several other ways to generate word embeddings in [1] but skipped introducing NCE. In this post, I will introduce it, with notations borrowed from the very paper applying it to word2vec [2].

First, let’s look at an example to get familiar with the NLP terms, target word and context [3], which are denoted as w and h in [2]: 

Using the idea of word2vec, words appear together frequently should have similar embeddings. One step further, a word’s embedding should match with the embeddings of its context. Often the context is another embedding derived from the embeddings of the words in the context. For example, the context embedding can be as simple as mean pooling of the embeddings of all the words in the context. 

Translating this idea into a loss function, we want to increase the following probability for every observed pair of (w, h):

(1)   \begin{equation*}    P^h_\theta(w) = \frac{exp(s_\theta(w, h))}{\sum_{w' \in V} exp(s_\theta(w', h))}\end{equation*}


where s_\theta(\cdot) is a scoring function to score the similarity of any pair (w, h), and w' comes from all other possible words in the entire vocabulary. Obviously, this can be an expensive loss function if the vocabulary size is large, which is usually the case. 

To avoid the inefficiency of computing pairwise scores for all the words, NCE proposes to transform the problem of Eqn.1 into a binary classification problem. For an observed pair (w, h), we randomly pick k additional words from the vocabulary to form a negative sample set. Among the k+1 words (w and k sampled words), we look at each word w and compute the probability that it is from the original observed pair (w, h).  The fact that w appears as one word in the k+1 words is either due to being sampled from the distribution P^h_\theta(w) provided w is from the observed pair (w, h), or due to being sampled from a uniform distribution P_n(w)=1/|V| at least one out of k times. Using Bayesian Theorem (more details in [3]), we can actually derive the conditional probability of w being from the observed pair given w is already in the k+1 words:

(2)   \begin{equation*}    P^h(D=1|w, \theta) = \frac{P^h_\theta(w)}{P^h_\theta(w) + kP_n(w)}\end{equation*}

The classification objective then becomes that for the word w from (w, h), P^h(D=1|w, \theta) should be as high as possible; for the word w from the negative samples, P^h(D=0|w, \theta) should be as high as possible. That is what Eqn. 8 from [2] is about:

(3)   \begin{equation*}    J^h(\theta) = \mathbb{E}_{w \in (w,h)} \left[ log P^h(D=1|w,\theta)\right] + k \mathbb{E}_{w \not\in (w,h)} \left[ log P^h(D=0|w,\theta) \right]\end{equation*}

There exists some theoretical reasons that I am not too sure about, but which can be taken advantage of to relax the definition of P^h(D=1|w, \theta). That is, we replace P^h_\theta(w) in Eqn. 2 with an unnormalized score s_\theta(w,h). With the replacement, we won’t need to worry about how to compute the partition function \sum_{w' \in V} exp(s_\theta(w', h)) as in Eqn. 1.

There are also variants of NCE loss. One is called InfoNCE (see Eqn. 4 in [4]) and has been applied to model-based reinforcement learning [5]. [4] claims that its objective function is actually optimizing the lower bound on mutual information between w and h in the word2vec context (or state and next state in the reinforcement learning context). Mutual information can be understood as the uncertainty reduction of w after observing h [6]. 

 

References

[1] https://czxttkl.com/2017/01/02/embedding-and-heterogeneous-network-papers/

[2] Learning word embeddings efficiently with noise-contrastive estimation

[3] https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html#context-based-skip-gram-model

[4] Representation Learning with Contrastive Predictive Coding

[5] Unsupervised State Representation Learning in Atari

[6] http://www.scholarpedia.org/article/Mutual_information

Control variate using Taylor expansion

We talked about control variate in [4]: when evaluating \mathbb{E}_{p(x)}[f(x)] by Monte Carlo samples, we can instead evaluate \mathbb{E}_{p(x)}[f(x)-h(x)+\theta] with \theta=\mathbb{E}_{p(x)}[h(x)] in order to reduce variance. The requirement for control variate to work is that h(x) is correlated with f(x) and the mean of h(x) is known. 

In this post we will walk through a classic example of using control variate, in which h(x) is picked as the Taylor expansion of f(x). We know that f(x)‘s Taylor expansion can be expressed as f(x)=f(a) + f'(a)(x-a) + f''(a)/2! * (x-a)^2 + f'''(a)/3! * (x-a)^3 + \cdots. Therefore, if we pick the second order of Taylor expansion for f(x) and set a=0, we get h(x)=f(0)+f'(0)x + f''(0)/2! * x^2

Let’s see an example from [6]. Suppose f(u_1, u_2)=exp[(u_1^2 + u_2^2)/2] and we want to evaluate the integral \Psi=\int^1_0 \int^1_0 f(u_1, u_2) du_1 du_2. If we want to use Monte Carlo estimator to estimate \Psi using U^1, \cdots, U^m samples, we have:
\frac{\text{area of sampling}}{m}\sum\limits_{k=1}^m f(U^k)=\frac{1}{m}\sum\limits_{k=1}^m f(U^k)

The control variate as the second-order Taylor expansion of f(x) is h(u_1, u_2)=1 + (u_1^2 + u_2^2)/2 (we use multi-variable Taylor expansion here [7]). We first compute its mean: \theta=\mathbb{E}_{p(x)}[h(x)] = \int^1_0\int^1_0 \left(1 +(u_1^2+u_2^2)/2 \right) du_1 du_2 = 4/3. Thus, our control variate Monto Carlo estimator is:
\frac{1}{m}\sum\limits_{k=1}^{m} z(u_1, u_2)\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[f(u_1, u_2) -h(u_1, u_2) + \theta\right]\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[exp[(u_1^2 + u_2^2)/2] - 1 - (u_1^2 + u_2^2)/2 + 4/3 \right].

And from the results of [6] the sampling variance of z(u_1, u_2) is smaller than f(u_1, u_2).

Now what [2] propose is to apply similar Taylor expansion-based control variate on REINFORCE policy gradient:

\nabla_\theta J(\theta) = \mathbb{E}_{s_t \sim \rho_{\pi}(\cdot), a_t \sim \pi(\cdot | s_t)} \left[ \nabla_\theta log \pi_\theta (a_t | s_t) R_t\right]

If we use Monte Carlo estimator, then we essentially use samples \nabla_\theta log \pi_\theta (a_t | s_t) R_t to approximate \nabla_\theta J(\theta), which causes variance. If we have a control variate h(s_t, a_t)=Q_w(s_t, \bar{a}_t) + \nabla_a Q_w(s_t, a)|_{a=\bar{a}_t} (a_t - \bar{a}_t) and \bar{a}_t can be chosen at any value but a sensible choice being \mu_\theta(s_t), a deterministic version of \pi_\theta(a_t | s_t), then the control variate version of policy gradient can be written as (Eqn. 7 in [2]):
\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s\sim \rho_\pi, a \sim \pi}\left[ h(s_t, a_t)\right] \newline=\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s \sim \rho_\pi}\left[ \nabla_a Q_w(s_t, a)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)\right] 

As you can see the second term can be computed when Q_w is some critic with known analytic expression, and actually does not depend on \pi_\theta

[3] uses the law of total variance (some good examples and intuitions [8, 9]) to decompose the control-variate policy gradient into three parts: variance caused by variance of states \Sigma_s, variance of actions \Sigma_a, and variance of trajectories \Sigma_\tau. What [3] highlights is that the magnitude of \Sigma_a is usually greatly smaller than the rest two so the benefit of using control variate-based policy gradient is very limited.

 

References

[1] MUPROP: UNBIASED BACKPROPAGATION FOR STOCHASTIC NEURAL NETWORKS

[2] Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC

[3] The Mirage of Action-Dependent Baselines in Reinforcement Learning

[4] https://czxttkl.com/2020/03/31/control-variate/

[5] https://czxttkl.com/2017/03/30/importance-sampling/

[6] https://www.value-at-risk.net/variance-reduction-with-control-variates-monte-carlo-simulation/

[7] https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables

[8] http://home.cc.umanitoba.ca/~farhadi/ASPER/Law%20of%20Total%20Variance.pdf

[9] https://math.stackexchange.com/a/3377007/235140