Check object memory in Python

We can use Pympler (https://pympler.readthedocs.io/en/latest/) to inspect an object’s memory in Python. It can get an object’s memory including their references.

Here is an example:

>>> from pympler import asizeof 
>>> obj = [1, 2, (3, 4), 'text'] 
>>> asizeof.asizeof(obj) 
176 
>>> print(asizeof.asized(obj, detail=1).format()) 
[1, 2, (3, 4), 'text'] size=176 flat=48 (3, 4) size=64 flat=32 'text' size=32 flat=32 1 size=16 flat=16 2 size=16 flat=16

Normalizing Flows

Some update before we dive into today’s topic: I have not updated this blog for about 2 months, which is considered a long time : ). This is because I have picked up more tech lead work for setting up the team’s planning. I sincerely hope that our team will steer towards a good direction from 2022 and beyond. 

Now, let’s go to today’s topic: normalizing flows. Normalizing flows is a type of generative models. Generative models can be best summarized using the following objective function: 

\text{maximize}_{\theta} \quad \mathbb{E}_{x \sim P_{data}}\left[ P_{\theta}(x)\right],
i.e., we want to find a model \theta which has the highest likelihood for the data generated from the data distribution as such to “approximate” the data distribution. 

According to [1] (starting 11:00 min), we can categorize generative models as:

  1. Explicit with tractable density. i.e., P_\theta(x) has an analytical form. Examples include Normalizing Flows, PixelCNN, PixelRNN, and WaveNet. The latter three examples are autoregressive models which generate elements (e.g., pixels, audio clips) sequentially however are computationally heavy due to the nature of autoregressive. Normalizing Flows can generate the whole image instantaneously thus has a computational advantage over others. (For more pros/cons of Normalizing Flows, please refer to [4])
  2. Explicit with approximate density, i.e., we are only optimizing for some bounds of P_\theta(x). One example is Variational Encoder Decoder [2], in which we optimize ELBO. 
  3. Implicit with no modeling of density. One example is Generative Adversarial Network (GAN), where we generate images just based on the generator network with a noise input. 

Normalizing Flows is based on a theory called “change of variables” [3]. Suppose Z and X are random variables both with dimension n. Also, suppose there is a mapping function f: \mathbb{R}^n \rightarrow \mathbb{R}^n such that Z=f(X) and X=f^{-1}(Z). Then the density functions of Z and X have the following relationship:

p_X(x) =p_Z(f(x))\left| det\left( \frac{\partial f(x)}{\partial x}\right) \right|=p_Z(f(x))\left| det \left( \frac{\partial f^{-1}(z)}{\partial z }\right) \right|^{-1},
where the last two equations are due to det(A^{-1})=det(A)^{-1}. From now on, we assume that X denotes data while Z denotes a random variable in a latent space. 

In Normalizing Flows, the mapping is parameterized by \theta (i.e., f \rightarrow f_\theta and f^{-1} \rightarrow f^{-1}_\theta), which is what we try to learn. The objective function for learning \theta becomes:

\text{maximize}_\theta \quad p_X(x|\theta) = \text{maximize}_\theta\quad p_Z(f_\theta(x))\left| det \left( \frac{\partial f_\theta(x)}{\partial x}\right) \right|

As you can see, our goal is to learn to map complex data distribution X into a simpler latent variable distribution Z (usually Z \sim \mathcal{N}(\mathbf{0}, \mathbf{I})). The reason that we want Z to follow a simple distribution is that: first, we need to compute p_Z(f_\theta(x)) in the objective function so ideally p_Z(\cdot) should be easy to compute; second, once we learn \theta, we know the mapping f_\theta and f^{-1}_\theta, and then we can easily sample z \sim Z and apply f^{-1}_\theta(z) to generate new data (e.g., new images). The requirement for a valid and practical f_\theta is that: (1) it has an invertible f^{-1}_\theta; (2) \left| det\left( \frac{\partial f_\theta(x)}{\partial x}\right) \right| or \left| det \left( \frac{\partial f^{-1}_\theta(z)}{\partial z }\right) \right| is efficient to compute. 

One nice property of Normalizing Flows is that you can chain multiple transformation f to form a new transformation. Suppose f(x) = f_1 \circ \cdots \circ f_L(x) with each f_i having a tractable inverse and a tractable Jacobian determinant. Then:

p_X(x) =p_Z(f(x))\prod\limits_{i=1}^L\left|det\left( \frac{\partial f_i}{\partial (f_{i-1}\circ\cdots \circ f_0(x))}\right) \right|, where f_0(x)=x

In practice, we usually pick p_Z \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and optimize for log likelihood. Therefore, our objective becomes (as can be seen in Eqn. 1 from the Flow++ paper (one SOTA flow model) [5]): 

\text{maximize}_\theta \quad \log p_X(x|\theta) \newline= \log p_Z(f_\theta(x)) + \sum\limits_{i=1}^L\log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right| \newline=\log \mathcal{N}(f_\theta(x); \mathbf{0}, \mathbf{I}) + \sum\limits_{i=1}^L \log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right|

[6] provides a good tutorial for getting started in Normalizing Flows, while [7] has a more in-depth explanation and it will help a lot in understanding more advanced implementation such as Flow++ [8]. For now, I am going to introduce [6].

[6] introduces a flow called Planar Flow. It has relatively a straightforward transformation (linear + activation) and easy-to-compute Jacobian determinant:

Planar Flow is defined in Python as below:

Once it is defined, we can instantiate an arbitrary Planar Flow and see how it transforms from a 2D Normal distribution:

Now, suppose we want to learn what is the Planar Flow that could transform a 2D normal distribution to a target distribution defined as:

The objective function, as we already introduced above, is \text{maximize}_\theta \quad \log p_X(x|\theta) = \log p_Z(f_\theta(x)) + \sum\limits_{i=1}^L\log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right|. Here, x is samples from a 2D normal distribution, p_Z() is the target density_ring distribution. Therefore, the loss function (to be minimized) is defined as:

The overall training loop is:

 

Last, I highly recommend watching this ECCV tutorial: https://www.youtube.com/watch?v=u3vVyFVU_lI

 

TODO:

dequantization:

https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html#Dequantization

https://arxiv.org/pdf/1511.01844.pdf

 

 

References

[1] PixelCNN, Wavenet & Variational Autoencoders – Santiago Pascual – UPC 2017 https://www.youtube.com/watch?v=FeJT8ejgsL0

[2] Optimization with discrete random variables: https://czxttkl.com/2020/04/06/optimization-with-discrete-random-variables/

[3] Normalizing Flows note: https://deepgenerativemodels.github.io/notes/flow/

[4] Introduction to Normalizing Flows: https://towardsdatascience.com/introduction-to-normalizing-flows-d002af262a4b

[5] Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design: https://arxiv.org/abs/1902.00275

[6] Normalizing Flows in PyTorch: https://github.com/acids-ircam/pytorch_flows/blob/master/flows_01.ipynb

[7] UvA DL Notebook: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html

[8] Flow++ Github implementation: https://github.com/chrischute/flowplusplus

 

 

 

Leetcode 695. Max Area of Island

You are given an m x n binary matrix grid. An island is a group of 1‘s (representing land) connected 4-directionally (horizontal or vertical.) You may assume all four edges of the grid are surrounded by water.

The area of an island is the number of cells with a value 1 in the island.

Return the maximum area of an island in grid. If there is no island, return 0.

Example 1:

Input: grid = [[0,0,1,0,0,0,0,1,0,0,0,0,0],[0,0,0,0,0,0,0,1,1,1,0,0,0],[0,1,1,0,1,0,0,0,0,0,0,0,0],[0,1,0,0,1,1,0,0,1,0,1,0,0],[0,1,0,0,1,1,0,0,1,1,1,0,0],[0,0,0,0,0,0,0,0,0,0,1,0,0],[0,0,0,0,0,0,0,1,1,1,0,0,0],[0,0,0,0,0,0,0,1,1,0,0,0,0]]
Output: 6
Explanation: The answer is not 11, because the island must be connected 4-directionally.

Example 2:

Input: grid = [[0,0,0,0,0,0,0,0]]
Output: 0

Constraints:

  • m == grid.length
  • n == grid[i].length
  • 1 <= m, n <= 50
  • grid[i][j] is either 0 or 1.

BFS/DFS

Ideas

Traverse the grid. For every cell with value 1, you can use DFS or BFS to further traverse its neighbors and count the number of nodes visited as the value of the connected area. At the end of the traversal, you can just take the maximum of all the areas associated to each value-1 cell.

class Solution(object):
    def maxAreaOfIsland(self, grid):
        """
        :type grid: List[List[int]]
        :rtype: int
        """
        height, width = len(grid), len(grid[0])
        
        def dfs(i, j):
            if i < 0 or i >= height or j < 0 or j >= width:
                return 0
            if grid[i][j] == 1:
                grid[i][j] = 0
                return 1 + dfs(i-1, j) + dfs(i+1, j) + dfs(i, j-1) + dfs(i, j+1)
            else:
                return 0
            
        def bfs(i, j):
            if grid[i][j] == 0:
                return 0
            
            grid[i][j] = 0
            nodes = [(i, j)]
            area = 0
            while nodes:
                x, y = nodes.pop(0)
                area += 1
                for dx, dy in [(1, 0), (-1, 0), (0, 1), (0, -1)]:
                    x_, y_ = x + dx, y + dy
                    if x_ < 0 or x_ >= height or y_ < 0 or y_ >= width:
                        continue
                    if grid[x_][y_] == 1:
                        grid[x_][y_] = 0
                        nodes.append((x_, y_))
            return area
    
        max_area = 0
        for x in range(height):
            for y in range(width):
                # area = dfs(x,y)
                area = bfs(x, y)
                max_area = max(area, max_area)
    
        return max_area

 

Union Find

I think the clearest online reference is from this post:  

https://leetcode.com/problems/max-area-of-island/discuss/729303/Python-BFS-and-Union-Find

Idea: 

Union find is an algorithm in which we maintain each node’s parent as an array. The node parent is initialized such that each node’s parent is itself. While we traverse nodes, we will iteratively update the node parent array until connected nodes’ parents converge to the same node. We also maintain a size array, which keeps the record of the size of connected nodes. We have used union find for other leetcode problems. One good and succinct example is https://czxttkl.com/2015/10/19/leetcode-261-graph-valid-tree/

class Solution(object):
    
    def maxAreaOfIsland(self, grid):
        """
        :type grid: List[List[int]]
        :rtype: int
        """
        height, width = len(grid), len(grid[0])
        # initialization for union find
        size = [[0 for j in range(width)] for i in range(height)]
        parent = [[(None, None) for j in range(width)] for i in range(height)]
        for i in range(height):
            for j in range(width):
                parent[i][j] = (i, j)
                if grid[i][j] == 1:
                    size[i][j] = 1
        
        def union_find(p, q):
            if parent[p][q] == (p, q):
                return p, q
            else:
                return union_find(*parent[p][q])
            
        for i in range(height):
            for j in range(width):
                if grid[i][j] == 1:
                    for next_i, next_j in ((i, j+1), (i+1, j)):
                        if next_j >= width:
                            continue
                        if next_i >= height:
                            continue
                        if grid[next_i][next_j] != 1:
                            continue
                        
                        parent_ij = union_find(i, j)
                        parent_ij1 = union_find(next_i, next_j) 
                        
                        if parent_ij == parent_ij1:
                            continue

                        x, y = parent_ij[0], parent_ij[1]
                        w, z = parent_ij1[0], parent_ij1[1]
                        if size[x][y] < size[w][z]:
                            parent[x][y] = parent[w][z]
                            size[w][z] += size[x][y] 
                        else:
                            parent[w][z] = parent[x][y]
                            size[x][y] += size[w][z]
        
        max_area = 0
        for i in range(height):
            for j in range(width):
                max_area = max(max_area, size[i][j])
        
        return max_area

Data Parallelism and Model Parallelism

In this post, we review the concept of data parallelism, model parallelism, and more in between. We will illustrate ideas using SOTA ML system designs.

Data Parallelism

Data parallelism means that there are multiple training workers fed with different parts of the full data, while the model parameters are hosted in a central place. There are two mainstream approaches of doing data parallelism: parameter servers and AllReduce.  

Parameter Servers host model parameters in a server node group. Its basic form is illustrated in the pseudocode below. We can see that workers do not hold any model parameter. They only push local gradients to servers and fetch weights from servers afterwards, while there is no communication between workers.

I covered Ring AllReduce in my previous post [1] (when I surveyed the Ray paper). In short, Ring AllReduce aggregates the gradients of the model parameters between all training nodes after every round of training (i.e., one minibatch on each trainer node). In PyTorch, it is implemented in an API called DistributedDataParallel [2]. Each training node will have a full copy of the model and receive a subset of data for training. Once each node has one forward pass and the corresponding backward pass, the model parameters’ gradients will be synced using the AllReduce primitive. Such behaviors are studied in my previous post [3].

[2] makes a distinction between AllReduce and Parameter Servers: As one AllReduce operation cannot start until all processes join, it is considered to be a synchronized communication, as opposed to the P2P communication used in parameter servers. Hence, from this distinction, you can see that the communication cost of parameter servers grow linearly with the number of trainer nodes in the system. On the other hand,  Ring AllReduce is an algorithm for which the communication cost is constant and independent of the number of trainer nodes in the system, and is determined solely by the slowest connection between trainer nodes in the system [15]. There is an empirical report comparing Ring AllReduce and Parameter Servers [14]. The result is consistent with our analysis: “it is easy to see that ring all-reduce (horovod) scales better on all models”.

Model Parallelism

The vanilla model parallelism is just to hold different parts of a model on different devices. It is useful when you have a model too large to be held on one node. 

Pipeline parallelism [4] is a more advanced method of model parallelism which can have great speed-up over the vanilla model parallelism. First, the layers of a model are partitioned into a series of “cells”. Each cell is placed on a different node. Then, a mini-batch is split into smaller batches, called micro-batch, which get fed to cells in a sequential way such that each cell is operating on a micro-batch at one time. Therefore, the time for waiting for data to be processed on other nodes is minimized. The idea of pipeline parallelism is best illustrated in the diagram below (subplot c):

To put data parallelism and pipeline parallelism into a more concrete context, let’s study the paper PipeTransformer [5], which employs a hybrid of data parallelism and model parallelism. PipeTransformer employs synchronous pipeline parallel within one machine and data parallel across machines. It gradually freezes the stack of layers in order to reduce the number of active parameters, and spawns more processes on freed resources to increase data parallel width. PipeTransformer combines four components: Freeze Algorithm, AutoPipe, AutoDP, and AutoCache.

  1. Freeze Algorithm, which is responsible for determining until which layer from the bottom should be frozen. Freezing layers comes from the recent study that parameters in neural networks usually converge from the bottom-up. After a few iterations, the bottom layers usually start to the less actively changed through the end of the training.
  2. AutoPipe, which will create cell partitions among the unfrozen layers. The size of a cell will be adjusted smartly for best speed. For example, pipelines can be compressed (i.e., cells can be merged) to reduce the bubble size, which is the speed bottleneck for a pipeline. Because of Freeze Algorithm and techniques in AutoPipe, as the training goes on and more layers are frozen, one pipeline will occupy less nodes (e.g., GPUs). 
  3. AutoDP, which can create data parallelism either on the same machine or across machines. Data parallelism across machines is easy to understand because it naturally increases the throughput of training. On the other hand, data parallelism on the same machine is possible because AutoPipe will dynamically reduce the size of pipelines on one machine so there could be more replicas of pipelines even on the same machine. 
  4. AutoCache is more specific to the freezing training paradigm because caching frozen layers’ outputs could skip the same computation over and over again. 

I find the diagram below is useful for illustrating AutoPipe and AutoDP:

While this post cannot survey all relevant papers, I find the table from [2] is comprehensive (at least as of 2021):

Practical Examples

There are some more practical lessons I learned from Pytorch official document.

CUDA semantics [6]

Each cuda device maintains a default stream. Within the same stream, operations are executed sequentially. But across different streams, operations are executed asynchronously. So if you modify a tensor on cuda:0 and then modify another tensor on cuda:1, do not assume that you can see the change to the tensor on cuda:0 if you have seen the change to the tensor on cuda:1

Single-Machine Model Parallelism [7]

On a single machine, the vanilla model parallelism is to place different parts of a model to different cuda devices. As an example:

To use the pipeline parallelism, one needs to further split mini-batches into micro-batches and feed different micro-batches into different parts of the model. As an example:

As we can see, the vanilla model parallelism introduces more communication overhead. When a model can just fit into one gpu, the vanilla model parallelism actually has a worse speed than not using it. Pipeline parallelism can speed up things but the speed up is only sub-linear because of extra overhead.

Later, I found that Pytorch has a wrapped class called Pipe to handle pipeline parallelism automatically. Here is an example of using Pipe to train a Transformer model [9].

Use rpc to implement parameter servers

I find there are three good examples for how rpc can be flexible to implement a parameter server.

In the first example [10], they create a parameter server in one process. The parameter server holds a nn.Module. This module has different partitions on different gpu devices. There are two trainer processes. In each of them, a TrainerNet is created for passing input to the module on the parameter server process through rpc remote call. As you can see, the forward function of TrainerNet is purely a rpc remote call. In other words, there is no linear algebra computation happening on the trainer processes in this example. The overall training loop (run on each trainer process) is very simple. A DistributedOptimizer is used to optimize model parameters (hosted remotely on the parameter server process).

In the second example [11], they implement a training script that support both a parameter server and data parallelism. There are four processes: one master process, one parameter server process, and two trainer processes. It is best to read the training script from the master process. The master process first creates a RemoteModule on the parameter server process. Then it uses the rpc_async API to start _run_trainer method on two trainer processes. The master process passes the reference to the RemoteModule to _run_trainer. Therefore, each trainer process can use the reference to create a HybridModel, which holds both the reference to the RemoteModule and some local model synced by DistributedDataParallel. Things become clear if you look at the forward function of HybridModel: data are passed to the RemoteModule first and then to the local model to generate model outputs.

The third example [12] is simpler than the previous two. There is one parameter server process and one trainer process. It is easier to understand the script by starting from the trainer process. The trainer process first initializes an RNN model, which holds some local LSTM module and the remote reference to an EmbeddingTable and a Decoder. The EmbeddingTable and Decoder are created on the parameter server process through rpc remote call.

Use rpc to implement agent-observer reinforcement learning paradigm

This pedagogical example [13] is interesting because it shows how rpc remote calls happen between several processes back and forth. First of all, there is an agent process and several observer processes. The agent holds the remote reference to Observer class, which is created on every observer process. When the agent needs to collect rl training data, it kicks off run_episode method of Observer on the observer processes. While in the run_episode method of Observer, we call the agent’s select_action and report_reward methods through rpc remote calls. Hence you can see that there are multiple rpc remote calls between the agent process and observer processes.

 

——————————- Update 2022-01-03 ——————————- 

I want to take down some notes about the ZERO paper [16] because it introduces some basic concepts about data and model parallelism. We start from recalling DistributedDataParallel which implements All-Reduce (see Data Parallel and DistributedDataParallel in “Data Parallelism” section; see introduction of All-Reduce in [1]). Typically, each process in an All-Reduce group will hold memory as much as all the parameters and other optimizer states would take, even though at each round of All-Reduce each process is only responsible for storing one shard of all the parameters correctly

Therefore, the basic idea of ZERO (which is also called Fully Sharded Data Parallel (FSDP) at Meta [17]) is to only keep one shard of parameters in the memory; when forward and backward computation need to know the values of other shards, the all-gather operator is used to construct necessary parameters on the fly. I think the pseudo-code from [17] best describes the ZERO or FSDP:

The ZERO paper has some good explanation on memory usage during the lifecycle of model training. Based on Section 3 (Where did all the memory go?), I summarize three main memory usages: (1) parameters & gradients; (2) optimizer states, such as momentum/variance as required by Adam; (3) residual states, such as activations.

If we apply FSDP on parameters, gradients, and optimizer states in a setting of mixed-precision training with Adam optimizer, then ZERO would have ~60x reduction in memory if we use N_d=64 GPUs, as shown in the table below, where P_os, P_g, and P_p means we apply ZERO only on optimizer states, gradients, and parameters, respectively:

For activations, we need to first illustrate why activations occupy memory in model training. This is because during backprogagtion, we need to both activation values and model parameter values to compute derivatives. If we do not store activations computed in the forward pass, we then need to recompute them in the backward pass. Therefore, a straightforward method is to save all activations computed in the forward pass for later usage in the backward pass. As you can see, this would require GPU memory as much as all activations. Only when backpropagation has progressed long enough to have all dependencies of an activation computed can the activation be discarded. 

It is a usual practice to apply activation checkpoints, a technique to reduce memory footprints of activations (see [18] for a good tutorial). Activation checkpointing store only a subset of activations, allowing activations to be recomputed but not too often. The most memory-efficient way to checkpoint activations is to store every sqrt(n) activations, where n is the number of total layers. ZERO applies FSDP on activations to have them sharded on different GPUs.

As you may see from the pseudocode and explanations above, ZERO/FSDP can reduce the memory optimizer states and activations by sharding to different GPUs, which is not done in DistributedDataParallel. However, the peak memory will still depend on the parameters/activations after all-gather at some points in forward/backward. Note that, all-gather usually only gather parameters/activations per layer. So ZERO/FSDP will enjoy the most memory reduction if you can break down your models into many layers. 

In 04/2021, Microsoft released a new blog post about ZERO-infinity, which is based on ZERO plus several enhancement: https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/

——————————- Update 2022-01-03 ——————————- 

 

 

todo: https://thegradient.pub/systems-for-machine-learning/

 

References

[1] https://czxttkl.com/2020/05/20/notes-on-ray-pap/

[2] PyTorch Distributed: Experiences on Accelerating Data Parallel Training 

[3] Analyze DistributedDataParallel (DPP)’s behavior: https://czxttkl.com/2020/10/03/analyze-distributeddataparallels-behavior/

[4] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

[5] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

[6] CUDA semantics: https://pytorch.org/docs/stable/notes/cuda.html

[7] Single Machine Model Parallelism: https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html#speed-up-by-pipelining-inputs

[8] Pytorch RPC Framework: https://pytorch.org/docs/stable/rpc.html

[9] Train Transformers using Pipeline: https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html

[10] IMPLEMENTING A PARAMETER SERVER USING DISTRIBUTED RPC FRAMEWORK: https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html, code: https://github.com/pytorch/examples/blob/master/distributed/rpc/parameter_server/rpc_parameter_server.py

[11] COMBINING DISTRIBUTED DATAPARALLEL WITH DISTRIBUTED RPC FRAMEWORK:  https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html, code: https://github.com/pytorch/examples/blob/master/distributed/rpc/ddp_rpc/main.py

[12] Distributed RNN using Distributed Autograd and Distributed Optimizer: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html#distributed-rnn-using-distributed-autograd-and-distributed-optimizer, code: https://github.com/pytorch/examples/tree/master/distributed/rpc/rnn

[13] Distributed Reinforcement Learning using RPC and RRef: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html#distributed-reinforcement-learning-using-rpc-and-rref, code: https://github.com/pytorch/examples/tree/cedca7729fef11c91e28099a0e45d7e98d03b66d/distributed/rpc/rl

[14] Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment: https://www.diva-portal.org/smash/get/diva2:1224181/FULLTEXT01.pdf

[15] https://xzhu0027.gitbook.io/blog/ml-system/sys-ml-index/parameter-servers

[16] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: https://arxiv.org/pdf/1910.02054.pdf

[17] Fully Sharded Data Parallel: faster AI training with fewer GPUs: https://engineering.fb.com/2021/07/15/open-source/fsdp/

[18] https://github.com/cybertronai/gradient-checkpointing

Recent advances in Neural Architecture Search

 

It has been some time since I got touch in neural architecture search (NAS) in my PhD, when I tried to get ideas for solving a combinatorial optimization problem for collectible card games’ deck recommendation. My memory about NAS mainly stays in one of the most classic NAS paper “Neural architecture search with reinforcement learning” [1], which uses policy gradient to search for better architectures. Apparently, things have advanced rapidly.

We can start from the most powerful NAS variant, AutoML-Zero [2]. It is not simply evolving an architecture, but actually evolve programs, literally. A program is defined with three main functions, setup(), learn(), and predict(). The algorithm uses evolutionary algorithms to search all possible implementations of the three functions using elementary operations. The objective of evolutionary algorithms is Evaluate() function shown below:

The evolutionary process is best illustrated in the following diagram:

After seeing one of the most advanced NAS development, we can look at simpler ones. EfficientNet shows that deep neural networks can improve the most performance when depth, width, and resolution are scaled up simultaneously. This is a million-dollar-worth finding! Hence, they propose compound scaling, where they simply search over one dimension \phi:

They show that scaling existing models like MobileNet and ResNet up lead to better performance then scaling one dimension (width, depth, resolution).

Next, they scale up a baseline model EfficientNet-B0, a model found by MnasNet [4]. As we can see, scaling up MnasNet leads to steady improvement in performance. MnasNet is also a policy gradient-based NAS algorithm, similar to earlier works like [1] (which might not be sample efficient). But it has two main improvements: (1) the reward adds latency penalty; (2) each block is repeating one sub-architectures and different blocks could have different sub-architectures; as comparison, previous works repeat only one sub-architectures in multiple blocks. 

Besides EfficientNet, another simple yet effective method is using random search over a predictor that is trained on different possible architectures for predicting their performance (Neural Predictor from [5]). The predictor itself uses graph neural networks (specifically, graph convolutional network) to consume neural architectures represented as graphs. Most graph neural networks learn to represent node embeddings. However, in this work, they are interested in a regression model which maps the overall graph structure (architecture) to a scalar (performance), so they average the node embeddings to represent the overall graph’s embedding as the regression model’s input. 

The ancestor of Neural Predictor is PNAS [9], which progressively expands architectures. At each step of expansion, it queries a trained predictor for which block to expand. The predictor is trained on (architecture of N blocks, performance), while it is used to predict the performance of architectures of N+1 blocks. PNAS cannot be fully parallelizable because of its progressive nature. That is one downside compared to Neural Predictor.

Now, let’s move on to one-shot NAS, which mostly revolve around the weight sharing technique. We use One-Shot [10], DARTS [11], and ProxylessNAS [8] to introduce the weight sharing technique. Weight sharing uses a much bigger architecture for training a model. The larger architecture covers all possible architectures we want to consider. If there are N possible architectures \mathcal{O}=\{o_i\}, \;i=1 \cdots N, then the output of One-Shot net and DARTS is (as summarized in ProxylessNAS [8]):

, where \alpha_i in m_{\mathcal{O}}^{DARTS} is a softmax distribution over the N architectures. To be clear, from the equation above, in order to compute m_{\mathcal{O}}^{One-Shot} or m_{\mathcal{O}}^{DARTS}, we have to compute all the N architectures’ outputs.  

The problem with One-Shot and DARTS is that the memory needs to hold all N architectures hence may easily blow up. ProxylessNAS proposes a technique called BinaryConnect, which is very close to DARTS but has subtle difference. BinaryConnect means that m_{\mathcal{O}}^{ProxylessNAS} = o_i(x), where o_i(x) is sampled from a softmax distribution. The difference with DARTS is that m_{\mathcal{O}}^{ProxylessNAS} is strictly one architecture’s output, rather than a weighted sum of N architectures. To take gradients of m_{\mathcal{O}}^{ProxylessNAS}, they propose a trick to sample two architectures every time when computing the gradient:

ProxylessNAS also proposes to train a neural network to predict latency so that model prediction accuracy and latency can be both differentiable w.r.t. architectures.

ENAS [12] is actually the originator of the weight sharing technique. One hurdle ENAS has over ProxylessNAS is that it requires an additional RNN controller to decide which architecture to sample. ProxylessNAS only needs to learn the softmax distribution parameters \alpha_i.

TuNAS [6] can be seen as an improved version over ProxylessNAS. There are mainly two improvements:

(1) more aggressive weight sharing, which they called operation collapsing.

(2) a new reward function, which they claim leads to more robust results:

In TuNAS, policy gradient is used for sampling potential architectures, while weight sharing updates the sampled architectures’ weights. As the RL policy samples more and more architectures which cover all possible weights, all the weights in the biggest architecture get learned well. The overall learning loop is described as:

However, one problem arises from the weight sharing technique, as pointed out in the abstract of the BigNAS paper [7]: “existing methods assume that the weights must be retrained, fine-tuned, or otherwise post-processed after the search is completed”. BigNAS proposes several empirical tricks to make sure that architectures found by weight sharing can achieve good performance without retraining or post-processing steps. First, they introduce the sandwich rule, which samples the smallest child model, the biggest full model, and N randomly sampled child models. The gradients are aggregated from all the sampled models before relevant weights get updated. As [7] hypothesized, “The motivation is to improve all child models in our search space simultaneously, by pushing up both the performance lower bound (the smallest child model) and the performance upper bound (the biggest child model) across all child models.” Second, they introduce inplace distillation, which means the biggest full model’s prediction is used to supervise all child models through the whole training process. I am actually not sure why using the ground-truth labels for child models would be inferior than the teacher network’s prediction. Third, they design a learning rate schedule while ends up at a constant rather than simply exponentially decreasing. Moreover, they dedicate section 3.2 to describe a coarse-to-fine architecture selection paradigm after the full model is trained.

In parallel, few-shot NAS tries to solve a similar problem that evaluation based on the subnets sampled from a one-shot supernet usually have imperfect correlation with the evaluation based on those subnets retrained from scratch. So in few-shot NAS, multiple supernets are trained so that sampled subnets have better evaluation correlation. With less weight sharing than one-shot NAS but still lower computation than not using weight sharing, few-shot strikes a good balance. Please check [13] for more details.

In the last, I want to spend some time to specifically walk through some details from DARTS [11],  SNAS [14], and DSNAS [15], which involves several advanced gradient-based methods to do one-shot learning. 

In DARTS, as we survey when we introduce ProxylessNAS, the outcome of an input x is a softmax distribution over a set of candidate operators o(x):

\alpha then is a learnable vector determining which operator to be more preferable. Suppose the network’s own weight is w, then the objective function of DARTS is essentially:
Note that \alpha is actually optimized on the validation dataset. As the paper suggests, this is a bi-level optimization problem. Analytically, the optimal solution is obtained when \nabla_\alpha \mathcal{L}_{val}(w^*(\alpha), \alpha)=0 and \nabla_w \mathcal{L}_{train}(w,\alpha)=0.

The paper proposes to replace w^*(\alpha), the expensive minimization argmin_w \mathcal{L}_{train}(w,\alpha), with w-\xi\nabla_w \mathcal{L}_{train}(w,\alpha).

However, now w-\xi\nabla_w \mathcal{L}_{train}(w,\alpha) and \alpha, the first and second argument in \mathcal{L}_{val}(\cdot, \cdot), are both a function of \alpha, so we need to use the derivative rule of multi-variable functions to really compute \nabla_\alpha \mathcal{L}_{val}(w-\xi \nabla_w \mathcal{L}_{train}(w,\alpha), \alpha). I refer to https://math.libretexts.org/Bookshelves/Calculus/Book%3A_Calculus_(OpenStax)/14%3A_Differentiation_of_Functions_of_Several_Variables/14.5%3A_The_Chain_Rule_for_Multivariable_Functions and https://www.programmersought.com/article/14295360685/#67_176 to really understand how to compute \nabla_\alpha \mathcal{L}_{val}(w-\xi \nabla_w \mathcal{L}_{train}(w,\alpha), \alpha).

We use a simpler notation to represent \nabla_\alpha f\left(g_1(\alpha), g_2(\alpha) \right) := \nabla_\alpha \mathcal{L}_{val}(w-\xi \nabla_w \mathcal{L}_{train}(w,\alpha), \alpha):= \nabla_\alpha \mathcal{L}_{val}(w', \alpha), where f\left(\cdot, \cdot \right) = \mathcal{L}_{val}(\cdot, \cdot), g_1(\alpha)=w-\xi \nabla_w \mathcal{L}_{train}(w,\alpha)=w', g_2(\alpha)=\alpha.

From the derivative rule of multi-variable functions

we have:

\nabla_\alpha f\left(g_1(\alpha), g_2(\alpha) \right) = \frac{d}{dg_1} f\left(g_1(\alpha), g_2(\alpha) \right) \frac{dg_1}{d\alpha} + \frac{d}{dg_2} f\left(g_1(\alpha), g_2(\alpha) \right) \frac{dg_2}{d\alpha}

So we finally compute \nabla_\alpha \mathcal{L}_{val}(w-\xi \nabla_w \mathcal{L}_{train}(w,\alpha), \alpha) as: 

        -\xi \nabla^2_{\alpha, w} \mathcal{L}_{train}(w,\alpha)\nabla_{w'}\mathcal{L}_{val}(w',\alpha) + \nabla_\alpha \mathcal{L}_{val}(w', \alpha)

As you may notice, \nabla^2_{\alpha, w}\mathcal{L}_{train}(w,\alpha) \nabla_{w'}\mathcal{L}_{val}(w', \alpha) is an expensive matrix-vector product, so people propose to use the finite difference approximation:

\nabla^2_{\alpha, w}\mathcal{L}_{train}(w,\alpha) \nabla_{w'}\mathcal{L}_{val}(w', \alpha) \approx \frac{\nabla_\alpha \mathcal{L}_{train}(w^+, \alpha) - \nabla_\alpha \mathcal{L}_{train}(w^-, \alpha)}{2\epsilon},
where w^\pm = w \pm \epsilon \nabla_{w'}\mathcal{L}_{val}(w', \alpha) 

The finite difference approximation is actually based on Taylor expansion:

The problem with DARTS is that after learning both the architecture selection parameter \alpha and the supernet’s own weight w, it derives the best child network by o^{(i,j)}=argmax_{o \in \mathcal{O}} \alpha_o^{(i,j)}. As pointed out by SNAS [14], DARTS thus has “the inconsistency between the performance of derived child networks and converged parent networks”. SNAS uses the concrete distribution as the more principled way to learn the architecture selection parameter \alpha. In the concrete distribution, there is a parameter \lambda that will be annealed to be close to zero thus when the training finished, the architecture selection parameter will converge to discrete variables (clearly denoting which operator is connected between which pair of nodes). 

The notations in the SNAS paper [14] is a bit chaotic. Hope I can comb them more cleanly. The idea of SNAS is that the architecture search parameter is a binary tensor \mathbf{Z}. \mathbf{Z}^{k}_{i,j} denotes the node i is connected to node j with operator k. Therefore, \mathbf{Z}_{i,j} is actually a one-hot encoding vector of length k, i.e., the selection of one of the k possible operators connecting node i and j. Similarly, \mathbf{O}_{i,j}(x) is also a vector of length k, denoting the output of the k possible operators.

Therefore, the objective function of SNAS is:

    \begin{align*}\mathbb{E}_{\mathbf{Z} \sim p_\alpha(\mathbf{Z})}\left[\mathcal{L}_\theta(\mathbf{Z})\right],\end{align*}


where \mathcal{L}_\theta(\cdot) is the model’s own loss.

Now, \mathbf{Z} will be parameterized by the concrete distribution, which constitutes the parameter selection parameter \mathbf{\alpha} of the same size of \mathbf{Z}, i.e., |I| \times |J| \times |K|, the Gumbel random variable G_{i,j}^k, and the annealing parameter \lambda. Specifically,

    \begin{align*}\begin{split}\mathbf{Z}_{i,j}^k &= f_{\alpha_{i,j}}\left( \mathbf{G}^{k}_{i,j} \right) \\&=\frac{exp\left( \left( \log \alpha_{i,j}^k + G_{i,j}^k \right)/\lambda \right)}{\sum^{K}_{l=1} exp\left( \left( \log \alpha_{i,j}^l  + G_{i,j}^l \right) / \lambda \right)}\end{split}\end{align*}

The authors proves that \mathbb{E}_{\mathbf{Z} \sim p_\alpha(\mathbf{Z})}\left[\frac{\partial \mathcal{L}_\theta(\mathbf{Z})}{\partial \alpha_{i,j}^k}\right] is essentially doing policy gradient.

First, we make sure we are clear on the notations on how node j‘s input x_j interacts with other nodes that connect to it (small modification from equation 20 in [14]):

    \begin{align*}x_j = \sum_{i<j} \sum_{k=1}^K \mathbf{Z}_{i,j}^k \mathbf{O}_{i,j}^k(x_i)\end{align*}

Now, using the chain rule of derivatives, we have

    \begin{align*}\begin{split}\frac{\partial \mathcal{L}_\theta(\mathbf{Z})}{\partial \alpha_{i,j}^k} &=\sum_{k'=1}^{K} \frac{\partial \mathcal{L}_\theta(\mathbf{Z})}{\partial x_j} \cdot \frac{\partial x_j}{\partial \mathbf{Z}_{i,j}^{k'}} \cdot \frac{\partial \mathbf{Z}_{i,j}^{k'}}{\partial \alpha_{i,j}^k} \\&=\sum_{k'=1}^{K} \frac{\partial \mathcal{L}_\theta(\mathbf{Z})}{\partial x_j} \cdot \mathbf{O}_{i,j}^{k'}(x_i) \cdot \frac{\partial \mathbf{Z}_{i,j}^{k'}}{\partial \alpha_{i,j}^k} \qquad\qquad (\text{replace } \frac{\partial x_j}{\partial \mathbf{Z}_{i,j}^{k'}}) \\&=\sum_{k'=1}^{K} \frac{\partial \mathcal{L}_\theta(\mathbf{Z})}{\partial x_j} \cdot \mathbf{O}_{i,j}^{k'}(x_i) \cdot \left(\left(\delta(k'-k)-\mathbf{Z}_{i,j}^{k'}\right)\mathbf{Z}_{i,j}^{k'}\frac{1}{\lambda \alpha_{i,j}^k}\right) \qquad(\text{replace } \frac{\partial x_j}{\partial \mathbf{Z}_{i,j}^{k'}}\text{ based on trick [16]})\end{split}\end{align*}

Finally, Appendix D shows that:

That’s the form of policy gradient. That’s why it is equivalent to say that SNAS is trained with policy gradient and with reward \frac{\partial \mathcal{L}}{\partial x_j} \tilde{\mathbf{O}}_{i,j}(x_i).

I’ll dig into DSNAS [15] once it gets more attention.

——————– Update 2021/09 ——————–

I want to mention two more classic ideas of NAS.

  1. Regularized evolution [17], in which every mutated solution has a certain survival period. The only way that a good architecture can remain the population is by being passed down from parents to children through the generations. “regularized” means every solution has an age, therefore it encourages more diversity and exploration and avoids being stuck by spuriously promising solutions.
  2. Bayesian Optimization [18], in which there are five most important components: encoding, neural predictor, uncertainty estimate, acquisition function, and acquisition optimization. [18] used empirical experiments to find the optimal combination of them: path encoding (a specific feature engineering method they propose), an ensemble of 5 feedforward neural networks as uncertainty estimate, independent Thompson Sampling, and mutation-based acquisition optimization.

 

References (arxiv submitted time)

[1] NEURAL ARCHITECTURE SEARCH WITH REINFORCEMENT LEARNING (2016.11)

[2] AutoML-Zero: Evolving Machine Learning Algorithms From Scratch (2020.3)

[3] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (2019.5)

[4] MnasNet: Platform-aware neural architecture search for mobile (2018.7)

[5] Neural Predictor for Neural Architecture Search (2019.12)

[6] Can weight sharing outperform random architecture search? An investigation with TuNAS (2020.8)

[7] BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models (2020.3)

[8] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware (2018.12)

[9] Progressive Neural Architecture Search (2017.12)

[10] Understanding and Simplifying One-Shot Architecture Search (2018)

[11] DARTS: Differentiable Architecture Search (2018. 6)

[12] Efficient Neural Architecture Search via Parameter Sharing (2018.2)

[13] Few-shot Neural Architecture Search (2020.6) 

[14] SNAS: Stochastic Neural Architecture Search (2018.12)

[15] DSNAS: Direct Neural Architecture Search without Parameter Retraining (2020.2)

[16] softmax derivative trick https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

[17] Regularized Evolution for Image Classier Architecture Search (2019.2)

[18] BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search (2020.11)

Recent advances in Batch RL

I’ll introduce some recent papers advancing batch RL.

The first paper is Critic Regularized Regression [1]. It starts from a general form of actor-critic policy gradient objective function, where Q_\theta is a learned critic function:

For a behavior cloning method, f(Q_\theta, \pi, s, a) = Q_\theta(s, a). However, we can do much more than that choice:

The CRR paper tested the first two choices, while another famous algorithm Advantage Weighted Behavior Model (ABM) from paper (Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning) [2] used the third choice.

 

CRR can be connected to a popular family of policy gradient algorithms called Maximum a Posteriori Policy Optimisation (MPO) [3], which I think is better summarized in [2]. From [2], we can see that MPO solves problems in the following form:

The optimal solution by MPO has a closed form expression: \pi(a|s) \propto \pi_{prior}(a|s) exp(\hat{Q}^{\pi}(s,a) / \eta)

In CRR, \pi_{prior} is equal to the logging policy \mu_\beta. So the optimal solution is supposed to be \pi(a|s) \propto \mu_{\beta}(a|s) exp(\hat{Q}^{\pi}(s,a) / \eta). The CRR paper points out it is equivalent to subtract the state value V(s) from \hat{Q}^{\pi}(s,a), leading to \pi(a|s) \propto exp(\hat{A}^{\pi}(s,a) / \eta). (I am not sure how \mu_{\beta}(a|s) is omitted. )

The overall learning loop of CRR is very straightforward:

 

Let’s look at how MPO is used in the ABM paper [2]. While \pi_{prior} in CRR is the logging policy \mu_\beta, \pi_{prior} in ABM could also be a behavior cloning policy mimicking the logging policy or the advantaged-based behavior cloning policy.

As argued in [2], the reason to not simply using a behavior cloning prior is to avoid learning from a very mediocre logging policy:

Once we obtain \pi_{prior}, we can learn the RL policy \pi using Eqn. 2. [2] provides two ways of optimization: (1) Expectation-Maximization, alternating between \pi and the Lagrangian multiplier \alpha; (2) stochastic gradient descent with reparameterization trick because there is \mathbb{E}_{a \sim \pi(\cdot|s)} in the objective.

 

The next paper is conservative Q-learning [4]. The idea is simple though the proof in the paper is too dense to understand. I am following https://towardsdatascience.com/the-power-of-offline-reinforcement-learning-5e3d3942421c to illustrate its idea. In standard Q-learning, our loss function is:

    \begin{align*}\begin{split}L&=\mathbb{E}_{s,a\sim D}\left[\delta(s,a)\right]\\&=\mathbb{E}_{s,a\sim D} \left[ \left\| Q(s,a) - \left(r(s,a) + \gamma max_{a'}Q(s',a') \right) \right\|^2 \right] \end{split}\end{align*}

A very conservative Q-learning loss function is:

    \begin{align*}\begin{split}L&=\mathbb{E}_{s,a\sim D}\left[\delta(s,a)\right] + \alpha \cdot \mathbb{E}_{s\sim D, a \sim \pi} \left[ Q(s,a)\right],\end{split}\end{align*}


which can be understood intuitively as you are always dragging down the Q-values of whichever action selected by the learned policy.

However, the authors find in this way the Q values are learned too conservative. Therefore, they propose a relaxing version of loss:

    \begin{align*}\begin{split}L&=\mathbb{E}_{s,a\sim D}\left[\delta(s,a)\right] + \alpha \cdot \left( \mathbb{E}_{s\sim D, a \sim \pi} \left[ Q(s,a)\right] - \mathbb{E}_{s,a \sim D} \left[ Q(s,a)\right] \right),\end{split}\end{align*}

This objective can be understood that we not only drag down the Q-values of the learned policy’s actions, but also pull up the Q-values of the logged actions. As a balance, the resultant Q-function is more conservative in actions not seen but less conservative in actions seen in the logged data.

 

Different batch RL can all be boiled down to how to extrapolate Q-values on unobserved/less observed (state, action) pairs. [5] provides a principled way to zero out those “uncertain” Q-values. Here are some basic notations for introducing their idea:

There are two paradigms of learning a policy from data: policy iteration and value iteration. [5]’s methodology can be plugged in either paradigm.

For policy iteration (they call MBS-PI):

For value iteration (they call MBS-QI):

You can see that their methods all come down a filter \varepsilon, which simply zeros out Q values of (state, action) frequency less than a threshold:

It is interesting to see how we empirically compute \hat{\mu}(s,a), the count/probability distribution of encountering (state, action) pairs.

  • For discrete action problems, they discretize state spaces by binning each state feature dimension. Therefore, there are a finite number of (state, action) pairs for which they count frequencies.
  • For continuous action problems, we can train a VAE to learn ELBO(s,a), the lower bound of \hat{\mu}(s,a), as the objective function. Once the VAE is learned, we can compute \varepsilon(s,a) by \varepsilon(s,a) = \mathbb{I}(ELBO(s,a) > \text{thres}), where \mathbb{I}(\cdot) is an indicator function. 
  • For continuous action problems, another methodology, which is used in the paper, is to integrate with BCQ [7]. BCQ trains a conditional VAE to only sample actions similar to the data collection policy in Bellman update. One drawback of BCQ is that it only considers sampling likely actions, even if the state is less observed. In comparison, [5] takes into consider the observation frequency of both states and actions. Since BCQ already prevents Q-values from backing-up on low frequent actions (given any state), they only need to apply \varepsilon(s)=\mathbb{I}(ELBO(s) > \text{thres}_2) on BCQ-updated Q-values to further zero out Q-values with less observed states. ELBO(s) comes from a VAE trained only on state data.

Let’s recap how VAE works, mainly based on my previous post: Stochastic Variational InferenceOptimization with discrete random variables, and a nice online post https://wiseodd.github.io/techblog/2016/12/17/conditional-vae/. Suppose a VAE contains an encoder \mathcal{E} (encoding input X to a latent vector z) and a generator \mathcal{G} (reconstructing the input based on z). \mathcal{E}(z|X) and \mathcal{G}(X|z) entail the probability density of the latent vector and the reconstructed input. Usually, \mathcal{E}(z|X) \sim \mathcal{N}(\mu, \sigma), where \mu and \sigma are the output of the encoder. VAE optimizes an objective function called ELBO (Evidence Lower Bound):

ELBO(X) = \mathbb{E}_{z\sim\mathcal{E}(z|X)}\left[\log \mathcal{G}(X|z) \right] - D_{KL} \left[ \mathcal{E}(z|X) || P(z) \right] \leq \log\left[P(X)\right].

Now, if you just let your VAE learn on states s (as X) collected by a behavior policy \pi_b, then once this VAE is learned, you can compute ELBO for any test state s_{test}, which can be interpreted as the lower bound of how likely s_{test} would appear under the same behavior policy \pi_b.  

A conditional VAE [6] only requires small modifications to normal VAEs, where C denotes a condition vector:

ELBO(X|C) = \mathbb{E}_{z\sim\mathcal{E}(z|X, C)}\left[\log \mathcal{G}(X|z, C) \right] - D_{KL} \left[ \mathcal{E}(z|X, C) || P(z) \right] \leq \log\left[P(X|C)\right]

In real implementation, we just need to concatenate C to the latent vector z and the input X and keep the rest of VAE training code unchanged.

 

As I said, different batch RL can all be boiled down to how to extrapolate Q-values on unobserved/less observed (state, action) pairs. BEAR from [8] is another example. While BCQ, MBS-PI, and MBS-QI have implicit constraints on being close to the behavior policy (i.e., prevent Q-value back-ups on less observed state/action pairs), BEAR has a more explicit constraint in its objective function:

where MMD is maximum mean discrepancy measuring the distance between two distributions using empirical samples:

MMD is illustrated very well in this post: https://stats.stackexchange.com/questions/276497/maximum-mean-discrepancy-distance-distribution. And it is also used in InfoVAE to replace the KL divergence term in the loss function of traditional VAE (recall that the ELBO objective consists of two terms, reconstruction error and KL divergence to encourage the latent vector \mathcal{E}(z|X) to be close to the prior). As argued in https://ermongroup.github.io/blog/a-tutorial-on-mmd-variational-autoencoders/, KL divergence may push \mathcal{E}(z|X) and X to be close for every input, which may result to a VAE ended up with learning a uninformative latent code, while MMD-based VAE only encourages \mathcal{E}(z|X) and X to be close in expectation

Overall, the algorithm’s pseudocode works as below:

  

References

[1] Critic Regularized Regression: https://arxiv.org/abs/2006.15134

[2] Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning: https://arxiv.org/abs/2002.08396

[3] Maximum a Posteriori Policy Optimisation (https://arxiv.org/abs/1806.06920)

[4] Conservative Q-Learning for Offline Reinforcement Learning: https://arxiv.org/abs/2006.04779

[5] Provably Good Batch Reinforcement Learning Without Great Exploration: https://arxiv.org/abs/2007.08202

[6] Conditional Generative Adversarial Nets: https://arxiv.org/abs/1411.1784. Nice post: https://wiseodd.github.io/techblog/2016/12/17/conditional-vae/

[7] Off-Policy Deep Reinforcement Learning without Exploration: https://arxiv.org/abs/1812.02900

[8] Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction: https://arxiv.org/abs/1906.00949

Some classical methodologies in applied products

I am reading two papers which uses very classical methodologies for optimizing metrics in real world applications.

The first is constrained optimization for ranking, from The NodeHopper: Enabling Low Latency Ranking with Constraints via a Fast Dual Solver. The paper performs per-slate constrained optimization:

Here, c_i is item i‘s primary metric value, r_i is item i‘s position after ranking r, and a[r_i] means the attention strength on each item when the item is ranked by r. Similarly, M is the constraint matrix. As you can see, they define the slate reward/constraint as the sum of per-item rewards/constraints, which may or may not be the case for some application. 

If there are n items, there will be n! possible rankings. Therefore, optimizing Eqn. 1 will be hard. They skillfully convert the problem into something more manageable. First, they rewrite Eqn. 1 into Eqn. 2:

R is a permutation matrix (each row and column has exactly one 1). 

Then, they relax R to be a probabilistic matrix, P, an “expected” permutation matrix. (In 3d, there is a typo. It should be P=P_\alpha=\sum^{n!}_{j=1} \alpha_j R^j). Here, \alpha is a distribution for all possible permutations.

Now, we can just optimize w.r.t. P, which has only n^2 entries:

Finally, Eqn. 4 can be solved by the Lagrangian method

The rest of the paper is really complicated and hard to understand. But they are solving the same constrained optimization problem.

The second paper I read is named “Automated Creative Optimization for E-Commerce Advertising” (https://arxiv.org/abs/2103.00436). The background of this paper is that in online advertising, each candidate ads contain a set of interactive elements as a combination, such as templates, fonts, and backgrounds.

An ad’s CTR can be naturally predicted using Factorization Machine (from Eqn.3. in the paper): 

The explanation for Eqn. 3 is that, there are L elements in an ads x_c. e_i and e_j are the embeddings of a pair of elements, where the interaction score between the two embeddings could be computed using one of the operators, Concat, Multiply, Plus, Max, or Min. (mentioned in Section 4.2).

The problem the paper tries to solve is that when the system has many ads candidates, how can the system pick the best ads candidate believed to have the highest CTR while balancing the need to explore the element space? So they use Thompson Sampling with Bayesian Contextual Bandit. The bayesian part comes from that all embeddings (e_i, e_j, …) are bayesian estimates from a Gaussian distribution. For every next ads, they sample embedding values from the present distribution, pick the best ads, observe the reward, and then update the posterior distribution. 

How do we update the embedding estimates \Theta \sim \mathcal{N}(\mu, \Sigma)? We use stochastic variational inference (https://czxttkl.com/2019/05/04/stochastic-variational-inference/). We can optimize with gradient-based methods w.r.t. the ELBO function, which contains only a likelihood (given a sampled \Theta, how likely is to observe the current dataset?) and a KL divergence between the current Gaussian distribution and a prior Gaussian distribution, both of which have an analytical expression. 

This paper is a classical example of stochastic variational inference and could be applied to many real-world problems. 

Reward/return decomposition

In reinforcement learning (RL), it is common that a task only reveals rewards sparsely, e.g., at the end of an episode. This prevents RL algorithms from learning efficiently, especially when the task horizon is long. There has been some research on how to distribute sparse rewards to more preceding steps.

One simple, interesting research is [1]. Its loss is defined as:
The loss can be understood as that each step’s reward r_t constitutes contribution from the current state b(s_t) and all previous states c(s_k), \forall k=0,\cdots t-1, gated by g(s_t).

[7] uses Transformers (or any attention-based sequence models) to decompose episode reward to step rewards. The loss function is:
You can think of \alpha in the subscript of s and a as 0,\cdots, t, i.e., the Transformers will take into accounts all previous steps until the current step to make the prediction of reward for the current step. The difference with [1] is that: a. [7] uses previous steps’ information more effectively using a sequence model; b. [7] only decomposes the total episode reward R(\tau), while [1] decomposes for every step reward.

Another research is Hindsight Experience Replay (HER) [4], in which we create spurious but easier-to-achieve rewards. This repository has a clear implementation [5]. 

HER is best explained using an example, a bit-flipping environment, as described below:
As you can see, this environment could have 2^n states and only one state could give you a positive reward. It would almost be infeasible for a normal RL model to explore and learn effectively. What confused me but now is clear to me is that the state of this environment fed to an RL model has two parts, the current n bits and the target n bits. This is manifested in the agent’s learn function in [5].

The crux of HER is as follows. Even when an episode terminates and fails, we augment the replay buffer with some new tuples, pretending that we want to arrive the last state as the target goal.

Because of these intermediate easier goals being attached to the state, the RL model has more non-sparse experience to learn from to reach the goal specified in its state.

Finally, there is work called Temporal Value Transport [8], with a blog [9] explaining it. I think this work is very heuristic-driven. Suppose one episode has T steps, they use a sequence model with attention mechanism to predict each pair of steps (i,j)‘s influence on reward (i<j). Specifically, the sequence model is a classifier predicting whether “the logit predicting whether the rewards-to-go from a given observation are below or above a threshold.” (from [9]). 

 

 

 

 

 

Then, the reward is re-distributed by:

 

Below is two works that I do not understand fully.

The first is [6]. It converts the normal policy gradient update:

to:

where \Phi_t contains the future part of the trajectory after time step t. The intuition that in off-policy learning, we can utilize the steps after the current step to make better value estimation. However, I do not understand how \Phi_t is computed (the paper says it is from a classifier based on RNN) and what \mathb{P}(A_t|X_t, \Phi_t) represents.

The second is RUDDER [2], with its web-view blog explaining it [3]. However, I still do not get it how LSTM is used in RUDDER to distribute rewards. [1]’s related work sheds some lights on how RUDDER works, though:

 

References

[1] Synthetic Returns for Long-Term Credit Assignment: https://arxiv.org/pdf/2102.12425.pdf

[2] RUDDER: Return Decomposition for Delayed Rewards https://arxiv.org/pdf/1806.07857.pdf

[3] https://ml-jku.github.io/rudder/

[4] Hindsight Experience Replay: https://arxiv.org/pdf/1707.01495.pdf

[5] https://github.com/hemilpanchiwala/Hindsight-Experience-Replay

[6] COUNTERFACTUAL CREDIT ASSIGNMENT IN MODEL-FREE REINFORCEMENT LEARNING: https://arxiv.org/pdf/2011.09464.pdf

[7] Sequence modeling of temporal credit assignment for episodic reinforcement learning: https://arxiv.org/pdf/1905.13420.pdf

[8] Optimizing agent behavior over long time scales by transporting value: https://www.nature.com/articles/s41467-019-13073-w.pdf

[9] https://www.efavdb.com/ltca

Self-Supervised Learning Tricks

I am reading some self-supervised learning papers. Some of them have interesting tricks to create self-supervised learning signals. This post is dedicated for those tricks.

The first paper I read is SwAV(Swapping Assignments between multiple Views of the same image) [1]. The high level idea is that we create K clusters with cluster centers \{c_k, \dots, c_k\}. These cluster centers are trainable parameters. More importantly, these cluster centers will be used in every training batch. In each batch, we distort (augment) each image into two versions. For each version, the distorted images are transformed to an embedding space z and clustered equally by the K clusters. As you might guess, the two distorted images from the same original image should belong to the same cluster.

The clustering-based method has an advantage over contrastive learning directly on image features because the former operates on a smaller input space when doing pairwise comparisons:

The interesting trick in SwAV is to partition images equally to the K clusters. If we denote C^T Z to be the image-cluster center similarity matrix, then the problem is converted to finding Q such that:

The constraints enforce that on average each cluster is associated with \text{batch size} / K data points. As illustrated in [1], they find the continuous code Q^* using the iterative Sinkhorn-Knopp algorithm.

SwAV has been used to train on 1 billion random instagram images. The resulting models (called SEER) achieve SOTA top1 accuracy on ImageNet data after fine tuning. [2]

References

[1] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments: https://arxiv.org/pdf/2006.09882.pdf 

[2] Self-supervised Pretraining of Visual Features in the Wild: https://arxiv.org/abs/2103.01988

 

Run a specific parent’s method from a child class

This is an example of how to run a specific parent’s method from a child class in Python.

class A(object):
    def foo(self):
        print('A.foo()')
        self.run()

    def run(self):
        print("A run")


class B(object):
    def foo(self):
        print('B.foo()')
        self.run()

    def run(self):
        print("B run")


class C(A, B):
    def foo(self):
        print('C.foo()')
        A.foo(self)
        B.foo(self)

c = C()
c.foo()

 

Results:

A.foo()
A run
B.foo()
A run