Recent advances in Batch RL

I’ll introduce some recent papers advancing batch RL.

The first paper is Critic Regularized Regression [1]. It starts from a general form of actor-critic policy gradient objective function, where Q_\theta is a learned critic function:

For a behavior cloning method, f(Q_\theta, \pi, s, a) = Q_\theta(s, a). However, we can do much more than that choice:

The CRR paper tested the first two choices, while another famous algorithm Advantage Weighted Behavior Model (ABM) from paper (Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning) [2] used the third choice.

 

CRR can be connected to a popular family of policy gradient algorithms called Maximum a Posteriori Policy Optimisation (MPO) [3], which I think is better summarized in [2]. From [2], we can see that MPO solves problems in the following form:

The optimal solution by MPO has a closed form expression: \pi(a|s) \propto \pi_{prior}(a|s) exp(\hat{Q}^{\pi}(s,a) / \eta)

In CRR, \pi_{prior} is equal to the logging policy \mu_\beta. So the optimal solution is supposed to be \pi(a|s) \propto \mu_{\beta}(a|s) exp(\hat{Q}^{\pi}(s,a) / \eta). The CRR paper points out it is equivalent to subtract the state value V(s) from \hat{Q}^{\pi}(s,a), leading to \pi(a|s) \propto exp(\hat{A}^{\pi}(s,a) / \eta). (I am not sure how \mu_{\beta}(a|s) is omitted. )

The overall learning loop of CRR is very straightforward:

 

Let’s look at how MPO is used in the ABM paper [2]. While \pi_{prior} in CRR is the logging policy \mu_\beta, \pi_{prior} in ABM could also be a behavior cloning policy mimicking the logging policy or the advantaged-based behavior cloning policy.

As argued in [2], the reason to not simply using a behavior cloning prior is to avoid learning from a very mediocre logging policy:

Once we obtain \pi_{prior}, we can learn the RL policy \pi using Eqn. 2. [2] provides two ways of optimization: (1) Expectation-Maximization, alternating between \pi and the Lagrangian multiplier \alpha; (2) stochastic gradient descent with reparameterization trick because there is \mathbb{E}_{a \sim \pi(\cdot|s)} in the objective.

 

The next paper is conservative Q-learning [4]. The idea is simple though the proof in the paper is too dense to understand. I am following https://towardsdatascience.com/the-power-of-offline-reinforcement-learning-5e3d3942421c to illustrate its idea. In standard Q-learning, our loss function is:

    \begin{align*}\begin{split}L&=\mathbb{E}_{s,a\sim D}\left[\delta(s,a)\right]\\&=\mathbb{E}_{s,a\sim D} \left[ \left\| Q(s,a) - \left(r(s,a) + \gamma max_{a'}Q(s',a') \right) \right\|^2 \right] \end{split}\end{align*}

A very conservative Q-learning loss function is:

    \begin{align*}\begin{split}L&=\mathbb{E}_{s,a\sim D}\left[\delta(s,a)\right] + \alpha \cdot \mathbb{E}_{s\sim D, a \sim \pi} \left[ Q(s,a)\right],\end{split}\end{align*}


which can be understood intuitively as you are always dragging down the Q-values of whichever action selected by the learned policy.

However, the authors find in this way the Q values are learned too conservative. Therefore, they propose a relaxing version of loss:

    \begin{align*}\begin{split}L&=\mathbb{E}_{s,a\sim D}\left[\delta(s,a)\right] + \alpha \cdot \left( \mathbb{E}_{s\sim D, a \sim \pi} \left[ Q(s,a)\right] - \mathbb{E}_{s,a \sim D} \left[ Q(s,a)\right] \right),\end{split}\end{align*}

This objective can be understood that we not only drag down the Q-values of the learned policy’s actions, but also pull up the Q-values of the logged actions. As a balance, the resultant Q-function is more conservative in actions not seen but less conservative in actions seen in the logged data.

 

Different batch RL can all be boiled down to how to extrapolate Q-values on unobserved/less observed (state, action) pairs. [5] provides a principled way to zero out those “uncertain” Q-values. Here are some basic notations for introducing their idea:

There are two paradigms of learning a policy from data: policy iteration and value iteration. [5]’s methodology can be plugged in either paradigm.

For policy iteration (they call MBS-PI):

For value iteration (they call MBS-QI):

You can see that their methods all come down a filter \varepsilon, which simply zeros out Q values of (state, action) frequency less than a threshold:

It is interesting to see how we empirically compute \hat{\mu}(s,a), the count/probability distribution of encountering (state, action) pairs.

  • For discrete action problems, they discretize state spaces by binning each state feature dimension. Therefore, there are a finite number of (state, action) pairs for which they count frequencies.
  • For continuous action problems, we can train a VAE to learn ELBO(s,a), the lower bound of \hat{\mu}(s,a), as the objective function. Once the VAE is learned, we can compute \varepsilon(s,a) by \varepsilon(s,a) = \mathbb{I}(ELBO(s,a) > \text{thres}), where \mathbb{I}(\cdot) is an indicator function. 
  • For continuous action problems, another methodology, which is used in the paper, is to integrate with BCQ [7]. BCQ trains a conditional VAE to only sample actions similar to the data collection policy in Bellman update. One drawback of BCQ is that it only considers sampling likely actions, even if the state is less observed. In comparison, [5] takes into consider the observation frequency of both states and actions. Since BCQ already prevents Q-values from backing-up on low frequent actions (given any state), they only need to apply \varepsilon(s)=\mathbb{I}(ELBO(s) > \text{thres}_2) on BCQ-updated Q-values to further zero out Q-values with less observed states. ELBO(s) comes from a VAE trained only on state data.

Let’s recap how VAE works, mainly based on my previous post: Stochastic Variational InferenceOptimization with discrete random variables, and a nice online post https://wiseodd.github.io/techblog/2016/12/17/conditional-vae/. Suppose a VAE contains an encoder \mathcal{E} (encoding input X to a latent vector z) and a generator \mathcal{G} (reconstructing the input based on z). \mathcal{E}(z|X) and \mathcal{G}(X|z) entail the probability density of the latent vector and the reconstructed input. Usually, \mathcal{E}(z|X) \sim \mathcal{N}(\mu, \sigma), where \mu and \sigma are the output of the encoder. VAE optimizes an objective function called ELBO (Evidence Lower Bound):

ELBO(X) = \mathbb{E}_{z\sim\mathcal{E}(z|X)}\left[\log \mathcal{G}(X|z) \right] - D_{KL} \left[ \mathcal{E}(z|X) || P(z) \right] \leq \log\left[P(X)\right].

Now, if you just let your VAE learn on states s (as X) collected by a behavior policy \pi_b, then once this VAE is learned, you can compute ELBO for any test state s_{test}, which can be interpreted as the lower bound of how likely s_{test} would appear under the same behavior policy \pi_b.  

A conditional VAE [6] only requires small modifications to normal VAEs, where C denotes a condition vector:

ELBO(X|C) = \mathbb{E}_{z\sim\mathcal{E}(z|X, C)}\left[\log \mathcal{G}(X|z, C) \right] - D_{KL} \left[ \mathcal{E}(z|X, C) || P(z) \right] \leq \log\left[P(X|C)\right]

In real implementation, we just need to concatenate C to the latent vector z and the input X and keep the rest of VAE training code unchanged.

 

As I said, different batch RL can all be boiled down to how to extrapolate Q-values on unobserved/less observed (state, action) pairs. BEAR from [8] is another example. While BCQ, MBS-PI, and MBS-QI have implicit constraints on being close to the behavior policy (i.e., prevent Q-value back-ups on less observed state/action pairs), BEAR has a more explicit constraint in its objective function:

where MMD is maximum mean discrepancy measuring the distance between two distributions using empirical samples:

MMD is illustrated very well in this post: https://stats.stackexchange.com/questions/276497/maximum-mean-discrepancy-distance-distribution. And it is also used in InfoVAE to replace the KL divergence term in the loss function of traditional VAE (recall that the ELBO objective consists of two terms, reconstruction error and KL divergence to encourage the latent vector \mathcal{E}(z|X) to be close to the prior). As argued in https://ermongroup.github.io/blog/a-tutorial-on-mmd-variational-autoencoders/, KL divergence may push \mathcal{E}(z|X) and X to be close for every input, which may result to a VAE ended up with learning a uninformative latent code, while MMD-based VAE only encourages \mathcal{E}(z|X) and X to be close in expectation

Overall, the algorithm’s pseudocode works as below:

  

References

[1] Critic Regularized Regression: https://arxiv.org/abs/2006.15134

[2] Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning: https://arxiv.org/abs/2002.08396

[3] Maximum a Posteriori Policy Optimisation (https://arxiv.org/abs/1806.06920)

[4] Conservative Q-Learning for Offline Reinforcement Learning: https://arxiv.org/abs/2006.04779

[5] Provably Good Batch Reinforcement Learning Without Great Exploration: https://arxiv.org/abs/2007.08202

[6] Conditional Generative Adversarial Nets: https://arxiv.org/abs/1411.1784. Nice post: https://wiseodd.github.io/techblog/2016/12/17/conditional-vae/

[7] Off-Policy Deep Reinforcement Learning without Exploration: https://arxiv.org/abs/1812.02900

[8] Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction: https://arxiv.org/abs/1906.00949

Some classical methodologies in applied products

I am reading two papers which uses very classical methodologies for optimizing metrics in real world applications.

The first is constrained optimization for ranking, from The NodeHopper: Enabling Low Latency Ranking with Constraints via a Fast Dual Solver. The paper performs per-slate constrained optimization:

Here, c_i is item i‘s primary metric value, r_i is item i‘s position after ranking r, and a[r_i] means the attention strength on each item when the item is ranked by r. Similarly, M is the constraint matrix. As you can see, they define the slate reward/constraint as the sum of per-item rewards/constraints, which may or may not be the case for some application. 

If there are n items, there will be n! possible rankings. Therefore, optimizing Eqn. 1 will be hard. They skillfully convert the problem into something more manageable. First, they rewrite Eqn. 1 into Eqn. 2:

R is a permutation matrix (each row and column has exactly one 1). 

Then, they relax R to be a probabilistic matrix, P, an “expected” permutation matrix. (In 3d, there is a typo. It should be P=P_\alpha=\sum^{n!}_{j=1} \alpha_j R^j). Here, \alpha is a distribution for all possible permutations.

Now, we can just optimize w.r.t. P, which has only n^2 entries:

Finally, Eqn. 4 can be solved by the Lagrangian method

The rest of the paper is really complicated and hard to understand. But they are solving the same constrained optimization problem.

The second paper I read is named “Automated Creative Optimization for E-Commerce Advertising” (https://arxiv.org/abs/2103.00436). The background of this paper is that in online advertising, each candidate ads contain a set of interactive elements as a combination, such as templates, fonts, and backgrounds.

An ad’s CTR can be naturally predicted using Factorization Machine (from Eqn.3. in the paper): 

The explanation for Eqn. 3 is that, there are L elements in an ads x_c. e_i and e_j are the embeddings of a pair of elements, where the interaction score between the two embeddings could be computed using one of the operators, Concat, Multiply, Plus, Max, or Min. (mentioned in Section 4.2).

The problem the paper tries to solve is that when the system has many ads candidates, how can the system pick the best ads candidate believed to have the highest CTR while balancing the need to explore the element space? So they use Thompson Sampling with Bayesian Contextual Bandit. The bayesian part comes from that all embeddings (e_i, e_j, …) are bayesian estimates from a Gaussian distribution. For every next ads, they sample embedding values from the present distribution, pick the best ads, observe the reward, and then update the posterior distribution. 

How do we update the embedding estimates \Theta \sim \mathcal{N}(\mu, \Sigma)? We use stochastic variational inference (https://czxttkl.com/2019/05/04/stochastic-variational-inference/). We can optimize with gradient-based methods w.r.t. the ELBO function, which contains only a likelihood (given a sampled \Theta, how likely is to observe the current dataset?) and a KL divergence between the current Gaussian distribution and a prior Gaussian distribution, both of which have an analytical expression. 

This paper is a classical example of stochastic variational inference and could be applied to many real-world problems. 

Reward/return decomposition

In reinforcement learning (RL), it is common that a task only reveals rewards sparsely, e.g., at the end of an episode. This prevents RL algorithms from learning efficiently, especially when the task horizon is long. There has been some research on how to distribute sparse rewards to more preceding steps.

One simple, interesting research is [1]. Its loss is defined as:
The loss can be understood as that each step’s reward r_t constitutes contribution from the current state b(s_t) and all previous states c(s_k), \forall k=0,\cdots t-1, gated by g(s_t).

[7] uses Transformers (or any attention-based sequence models) to decompose episode reward to step rewards. The loss function is:
You can think of \alpha in the subscript of s and a as 0,\cdots, t, i.e., the Transformers will take into accounts all previous steps until the current step to make the prediction of reward for the current step. The difference with [1] is that: a. [7] uses previous steps’ information more effectively using a sequence model; b. [7] only decomposes the total episode reward R(\tau), while [1] decomposes for every step reward.

Another research is Hindsight Experience Replay (HER) [4], in which we create spurious but easier-to-achieve rewards. This repository has a clear implementation [5]. 

HER is best explained using an example, a bit-flipping environment, as described below:
As you can see, this environment could have 2^n states and only one state could give you a positive reward. It would almost be infeasible for a normal RL model to explore and learn effectively. What confused me but now is clear to me is that the state of this environment fed to an RL model has two parts, the current n bits and the target n bits. This is manifested in the agent’s learn function in [5].

The crux of HER is as follows. Even when an episode terminates and fails, we augment the replay buffer with some new tuples, pretending that we want to arrive the last state as the target goal.

Because of these intermediate easier goals being attached to the state, the RL model has more non-sparse experience to learn from to reach the goal specified in its state.

Finally, there is work called Temporal Value Transport [8], with a blog [9] explaining it. I think this work is very heuristic-driven. Suppose one episode has T steps, they use a sequence model with attention mechanism to predict each pair of steps (i,j)‘s influence on reward (i<j). Specifically, the sequence model is a classifier predicting whether “the logit predicting whether the rewards-to-go from a given observation are below or above a threshold.” (from [9]). 

 

 

 

 

 

Then, the reward is re-distributed by:

 

Below is two works that I do not understand fully.

The first is [6]. It converts the normal policy gradient update:

to:

where \Phi_t contains the future part of the trajectory after time step t. The intuition that in off-policy learning, we can utilize the steps after the current step to make better value estimation. However, I do not understand how \Phi_t is computed (the paper says it is from a classifier based on RNN) and what \mathb{P}(A_t|X_t, \Phi_t) represents.

The second is RUDDER [2], with its web-view blog explaining it [3]. However, I still do not get it how LSTM is used in RUDDER to distribute rewards. [1]’s related work sheds some lights on how RUDDER works, though:

 

References

[1] Synthetic Returns for Long-Term Credit Assignment: https://arxiv.org/pdf/2102.12425.pdf

[2] RUDDER: Return Decomposition for Delayed Rewards https://arxiv.org/pdf/1806.07857.pdf

[3] https://ml-jku.github.io/rudder/

[4] Hindsight Experience Replay: https://arxiv.org/pdf/1707.01495.pdf

[5] https://github.com/hemilpanchiwala/Hindsight-Experience-Replay

[6] COUNTERFACTUAL CREDIT ASSIGNMENT IN MODEL-FREE REINFORCEMENT LEARNING: https://arxiv.org/pdf/2011.09464.pdf

[7] Sequence modeling of temporal credit assignment for episodic reinforcement learning: https://arxiv.org/pdf/1905.13420.pdf

[8] Optimizing agent behavior over long time scales by transporting value: https://www.nature.com/articles/s41467-019-13073-w.pdf

[9] https://www.efavdb.com/ltca

Self-Supervised Learning Tricks

I am reading some self-supervised learning papers. Some of them have interesting tricks to create self-supervised learning signals. This post is dedicated for those tricks.

The first paper I read is SwAV(Swapping Assignments between multiple Views of the same image) [1]. The high level idea is that we create K clusters with cluster centers \{c_k, \dots, c_k\}. These cluster centers are trainable parameters. More importantly, these cluster centers will be used in every training batch. In each batch, we distort (augment) each image into two versions. For each version, the distorted images are transformed to an embedding space z and clustered equally by the K clusters. As you might guess, the two distorted images from the same original image should belong to the same cluster.

The clustering-based method has an advantage over contrastive learning directly on image features because the former operates on a smaller input space when doing pairwise comparisons:

The interesting trick in SwAV is to partition images equally to the K clusters. If we denote C^T Z to be the image-cluster center similarity matrix, then the problem is converted to finding Q such that:

The constraints enforce that on average each cluster is associated with \text{batch size} / K data points. As illustrated in [1], they find the continuous code Q^* using the iterative Sinkhorn-Knopp algorithm.

SwAV has been used to train on 1 billion random instagram images. The resulting models (called SEER) achieve SOTA top1 accuracy on ImageNet data after fine tuning. [2]

References

[1] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments: https://arxiv.org/pdf/2006.09882.pdf 

[2] Self-supervised Pretraining of Visual Features in the Wild: https://arxiv.org/abs/2103.01988

 

Run a specific parent’s method from a child class

This is an example of how to run a specific parent’s method from a child class in Python.

class A(object):
    def foo(self):
        print('A.foo()')
        self.run()

    def run(self):
        print("A run")


class B(object):
    def foo(self):
        print('B.foo()')
        self.run()

    def run(self):
        print("B run")


class C(A, B):
    def foo(self):
        print('C.foo()')
        A.foo(self)
        B.foo(self)

c = C()
c.foo()

 

Results:

A.foo()
A run
B.foo()
A run

PyTorch Lightning template

Back to the old days, I’ve studied how to implement highly efficient PyTorch pipelines for multi-gpu training [1]. DistributedDataParallel is the way to go, but it is cumbersome that we need boilerplates for spawning workers and constructing data readers.

Now, PyTorch Lighting offers clean API for setting up multi-gpu training easily. Here is a template I designed, which I will stick to for prototyping models for the rest of my life : )

import os
from typing import List, Any

from dataclasses import dataclass
import torch
import torch.distributed as dist
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, IterableDataset, random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy


TOTAL_NUM_BATCHES = 320
BATCH_SIZE = 32
STATE_DIM = 5
NUM_GPUS = 2
NUM_WORKERS = 2
UPDATE_FREQ = 1
WEIGHTS = torch.tensor([2.0, 3.1, 2.1, -1.5, -1.7])

@dataclass
class PolicyGradientInput:
    pid: int
    state: torch.Tensor
    reward: torch.Tensor

    def __len__(self):
        return self.state.shape[0]

    @classmethod
    def from_batch(cls, x):
        return cls(
            pid=x[0][0].item(),
            state=x[1][0].reshape(BATCH_SIZE, STATE_DIM),
            reward=x[2][0].reshape(BATCH_SIZE, 1),
        )

class EpisodicDataset(IterableDataset):
    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        pid = os.getpid()
        if worker_info is None:
            total_num_batches = TOTAL_NUM_BATCHES // NUM_GPUS
            worker_id = -1
        else:
            total_workers = worker_info.num_workers
            total_num_batches = int(TOTAL_NUM_BATCHES // NUM_GPUS // total_workers)
            worker_id = worker_info.id
        # You will see that we have an EpisodicDataset on each of NUM_GPUS processes
        print(f"{worker_info}, pid={pid}, total_num_batches={total_num_batches}")

        for _ in range(total_num_batches):
            state = torch.randn(BATCH_SIZE, STATE_DIM)
            reward = torch.sum(state * WEIGHTS, dim=1)
            yield (pid, state, reward)


class PPO(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(STATE_DIM, 128), nn.ReLU(), nn.Linear(128, 128), nn.ReLU(), nn.Linear(128, 1)
        )
        self.traj_buffer = []
        self.update_freq = UPDATE_FREQ
        self.step = 0

    def training_step(self, batch: List[Any], batch_idx):
        batch: PolicyGradientInput = PolicyGradientInput.from_batch(batch)
        self.traj_buffer.append(batch)
        self.step += 1
        rank = dist.get_rank()
        # use first three trajectories' pids as as signature. quickly check if all trainers share the same data
        # the answer is each trainer maintains different trajectories
        traj_buffer_signature = ','.join([str(traj.pid) for traj in self.traj_buffer[:3]])
        print(f"rank={rank}, traj_buffer_len={len(self.traj_buffer)}, step={self.step}, signature={traj_buffer_signature}")
        if self.step % self.update_freq == 0:
            model_params = list(self.model.parameters())[0][0].detach().cpu().numpy()
            print(f"before {self.step} step training: rank={rank}, model_params={model_params}")
            return self.update_model()

    def configure_optimizers(self):
        # Somehow, Adam doesn't work for linear regression
        # optimizer = torch.optim.Adam(self.parameters(), lr=1e-2)
        optimizer = torch.optim.SGD(self.parameters(), lr=1e-2)
        return optimizer

    def update_model(self):
        traj = self.traj_buffer[-1]
        loss = F.mse_loss(self.model(traj.state), traj.reward)
        rank = dist.get_rank()
        print(f"rank={rank}, step={self.step}, loss={loss}")
        return loss


ds = EpisodicDataset()
dl = DataLoader(ds, batch_size=1, num_workers=NUM_WORKERS, pin_memory=True)
ppo = PPO()
trainer = pl.Trainer(gpus=NUM_GPUS, max_epochs=1, progress_bar_refresh_rate=1, accelerator='ddp')
trainer.fit(ppo, dl)

As you can see, you only need to define a Dataset, a DataLoader with appropriate NUM_WORKERS, and a pytorch-lightning Trainer in which you specify the number of gpus. Each of the NUM_GPUS GPUs will then use NUM_WORKERS processes for reading data and use one main process for training the model.

The example shows that each trainer on a GPU maintains a list of trajectories, which is not shared with the trainers on other GPUs. However, I believe model parameters are synced after every loss function computation because the underlying mechanism is still DistributedDataParallel

 

References

[1] https://czxttkl.com/2020/10/03/analyze-distributeddataparallels-behavior/ 

Precision Recall Curve vs. ROC curve

While ROC (receiver operating characteristic) curve is ubiquitous in model reporting, Precision Recall Curve is less reported. However, the latter is especially useful when we have imbalanced data. Let’s review pertinent concepts.

True Positive = TP = you predict positive and the actual label is positive

False Positive = FP = you predict positive but the actual label is negative

True Negative = TN = you predict negative and the actual label is negative

False Negative = FN = you predict negative and the actual label is positive

TPR = True Positive Rate = Sensitivity = Recall = TP / (TP + FN)

FPR = False Positive Rate = FP / (FP + TN). Among all negative samples (FP + TN), how many of them are erroneously identified as positive (FP). FPR will increase if you blindly increase the probability threshold for predicting negative. In an extreme case, you can just predict every sample as positive.    

Precision = TP / (TP + FP). Among all the predicted positive samples (TP + FP), how many of them actually have positive labels. 

For an ROC curve, the x-axis is FPR and the y-axis is TPR.

For a Precision-Recall curve, the x-axis is Recall and the y-axis is Precision.

Now, let’s look at an example with imbalanced data, in which case the Precision-Recall curve is more informative than the ROC curve (https://www.kaggle.com/general/7517). 

Suppose there is 1 million documents, among which 100 documents are relevant, and two methods to retrieve them. At some decision threshold for method 1 and some other decision threshold for method 2, we have:

  • Method 1 (@some decision threshold): 100 retrieved documents, 90 relevant
    • TP=90, FP=10, TN=1M-100-10, FN=10
    • FPR = FP / (FP + TN) = 10 / (1M – 100) =0.000010001, TPR = TP / (TP + FN) = 90 / (90 + 10) = 0.9, 
    • Recall = TPR = 0.9, Precision = TP / (TP + FP) = 90 / (90 + 10) = 0.9
  • Method 2 (@some decision threshold): 2000 retrieved documents, 90 relevant
    • TP=90, FP=1910, TN=1M-2000-10, FN=10
    • FPR = FP / (FP + TN) = 1910 / (1910 + 1M – 2000 – 10) = 0.00191019101, TPR = TP / (TP + FN) = 90 / (90 + 10) = 0.9
    • Recall = TPR = 0.9, Precision = TP / (TP + FP) = 90 / (90 + 1910) = 0.045

Note that, the calculation is just for one point on the ROC / Precision-Recall curve. We can already see that their ROC would not be too different (at same TPR, FPR is 0.000010001 vs. 0.00191019101). But PR curve would be more different, since at the same recall, the precisions is 0.9 vs. 0.045.

 

Deep Learning-based Sorting

Here, I am talking about a few techniques of using deep neural networks to accomplish sorting/ranking tasks.

Reinforcement Learning – policy gradient paradigm

Using policy gradient to solve combinatorial optimization problems such as Traveling Salesman Problems is not new. Ranking K out of N candidates is also a combinatorial optimization problem thus can be solved by policy gradient. Now the only question remained is how you parameterize the ranking policy. You can parameterize as a sequence model (considering item interactions) [2] or a Packelee Lucce distribution (if there is assumed to be an optimal pointwise relevance score). Additionally, as we discussed in [1], you can treat the ranking reward non-differentiable (such as logged impressions, clicks, etc.) or differentiable (some loss similar to NDCG but differentiable). 

soft rank 

References

[1] https://czxttkl.com/2020/02/18/practical-considerations-of-off-policy-policy-gradient/

[2] https://arxiv.org/abs/1810.02019

GAN (Generative Adversarial Network)

Here, I am taking some notes down while following the GAN online course (https://www.deeplearning.ai/generative-adversarial-networks-specialization/).

The first thing I want to point out is that one should be very careful about the computation graph during the training of GANs. To maximize efficiency in one iteration, we can call the generator only once, using the generator output to train both the discriminator and the generator itself. During training the discriminator, we need to call generator_output.detach(), so that after the discriminator loss is backward-ed, the gradient information of the generator is still kept in the computation graph for training the generator. Another option is to call `discriminator_loss.backward(retain_graph=True)` to keep the gradient information. You can see both methods in https://github.com/pytorch/examples/blob/a60bd4e261afc091004ea3cf582d0ad3b2e01259/dcgan/main.py#L230 and https://zhuanlan.zhihu.com/p/43843694.

Next, we move to DCGAN (Deep Convolutional GAN) [1]. It is interesting to see that the generator/discriminator equipped with image-processing architectures like deconvolution/convolution layers will lead to more realistic generated images. The deconvolutional layer is something I am not familiar before. To understand it, we need to think of the convolution/deconvolution operation as matrix multiplication [2]. Suppose you have an input image (4×4) and a kernel (3×3), which expects to generate 2×2 output:

Equivalently, we can create a convolution matrix (4 x 16), flatten the input to (16 x 1), then multiply the two to get (4 x 1) output, and reshape to (2 x 2):

.       

 

The deconvolution can be seen as the reverse of the convolution process. A 16×4 deconvolution matrix multiplies a flattened convoluted output (2 x 2 -> 4 x 1), outputting a 16 x 1 matrix reshaped to be 4 x 4. 

 

The next topic is Wasserstein GAN [3]. The motivation behind it is that the original GAN’s generator loss (max_g -\left[\mathbb{E}\left(log d(x)\right) +\mathbb{E}\left(1 - log d(g(z)) \right) \right]) would provide little useful gradient when the discriminator is perfect in separating real and generated outputs. (There is much theory to dig in, but [4] gives a very well explanation on how it is superior in preventing mode collapsing and vanishing gradient.) The Wasserstein-loss approximates a better distance function than BCEloss called Earth Mover’s Distance which provides useful gradient even when the real and generated output is already separable.

However, when training GANs using W-loss, the critic has a special condition. It has to be 1-Lipschitz continuous. Conceptually, it is easy to check if a function is 1-Lipschitz continuous at every point. You just check if the function grows or drops with slope > 1 at any point:

It is interesting to see that how 1-Lipschitz continuity (i.e., the gradient the discriminator/critic is no larger than 1) is soft-enforced by adding a penalty term.

Instead of checking the critic is 1-Lipschitz continuity for every input, we sample an intermediate image by combining a real and fake image. We then encourage the gradient norm on the intermediate image to be within 1.

  

It is also intriguing to see how we can control GANs to generate images containing specific features. The first method is to utilize a pre-trained classifier, which scores how much a desired feature is present in a generated image.

A slightly improved method is disentanglement, which makes sure other features do not change simultaneously. 

The second method is to augment the generator and discriminator’s input with a one-hot class vector. So you can ask the generator what class to generate, and you can ask the discriminator how relevant a generated image is to a given class, as shown below. Instead of requiring a pre-trained class as in the first method, the second method requires knowing the labels of training data. 

The third method is infoGAN [5], which requires neither pre-trained models nor data labels. InfoGAN tries to learn latent codes that can control the output of the generator. The generator G takes as input a noise vector z (as in the original GAN) and additionally, a latent code c that learns semantic categories by unsupervised learning. A clever idea is proposed that the mutual information between G(z, c) and c, I(c;G(z,c)) should be maximized. In other words, G(z,c) should reveal as much information as possible about c, which means we force the generator to generate images as much relevant to c as possible. A similar idea is mentioned in our previous post when talking about the InfoNCE loss [6]. This notebook (C1W4_(Optional_Notebook)_InfoGAN) walks through the implementation of InfoGAN.

The lecture then moves on to evaluation metrics for GAN. One metric is based on Fréchet distance. We use some pre-trained image models such as Inception-v3 to process images and get embeddings, usually the output of one hidden layer. If we process N real images and N fake images, we would have N real images’ embeddings and N fake images’ embeddings. We can quantitatively compare the two embedding distributions by Fréchet distance, which has the following form:

Another evaluation metric is from [7] called “Learned Perceptual Image Patch Similarity”. They also extract image features from L layers of a pre-trained network. As shown in Eqn. 1 below, the distance between two images is a weighted sum of per-layer similarity. There is a learnable parameter w_l in per-layer similarity, which is learned to optimize for differentiating more similar image pairs from less similar pairs. Eventually, LPIPS can be used to measure the distance between real and fake images in GAN.

Finally, the lecture discusses one state-of-the-art GAN called StyleGAN [8]. There are three main components of StyleGAN: (1) progressive growing, (2) noise mapping network, and (3) adaptive instance normalization. 

StyleGAN supports two ways of style variation. The first is styling mixing, which feeds different w vectors to different layers of the generator:

The second way to do style variation is to sample noise and add it between adaptive instance sampling operations.

 

 

References

[1] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

[2] Up-sampling with Transposed Convolution

[3] https://arxiv.org/pdf/1701.07875.pdf

[4] https://zhuanlan.zhihu.com/p/25071913

[5] InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

[6] Noise-Contrastive Estimation https://czxttkl.com/2020/07/30/noise-contrastive-estimation/

[7] The Unreasonable Effectiveness of Deep Features as a Perceptual Metric https://arxiv.org/pdf/1801.03924.pdf

[8] A Style-Based Generator Architecture for Generative Adversarial Networks https://arxiv.org/abs/1812.04948

Projected Gradient Descent

I am reading “Determinantal point processes for machine learning”, in which it uses projected gradient descent in Eqn. 212. More broadly, such problems have this general form:

where we want to map from \vec{y} to \vec{x} on the simplex. Since we often encounter problems of the sum-to-1 constraint, I think it is worth listing the solution in this post.

The algorithm is simple as detailed in https://eng.ucmerced.edu/people/wwang5/papers/SimplexProj.pdf:

We omit the proof but I do appreciate the elegancy of this algorithm.