Analyze DistributedDataParallel (DPP)’s behavior

DistributedDataParallel implements data parallelism at the module level which can run across different machines. There is one process running on each device where one copy of the module is held. Each process loads its own data which is non-overlapping with other processes’. At the initialization phase, all copies are synchronized to ensure they start from the same initialized weights. The forward pass is executed independently on each device, during which no communication is needed. In the backward pass, the gradients are all-reduced across the devices, ensuring that each device ends up with identical copy of the gradients/weights, therefore eliminating the need for model syncs at the beginning of each iteration [1].

[2] illustrates some more detailed design. For example, parameters can be synchronized more efficiently by bucketing. Parameters will be bucketed into several buckets. In the backward phase, whenever the gradient is ready for all the members in a bucket, the synchronization will kick off for that bucket. Thus one does not need to wait for the gradients of ALL parameters to become ready before synchronization starts.  

Another detail in [2] is how gradient synchronization is started in the backward phase. It says that “DDP uses autograd hooks registered at construction time to trigger gradients synchronizations. “

I did a quick experiment to verify these design details. test_multi_gpu is in charge of spawning worker processes and the real model training happens in the _worker function. initialize_trainer is the piece of code for initializing models at each process. Check out my comment in “Logs” section in the pseudocode below for the explanation for DPP’s behavior.

def test_multi_gpu(
    use_gpu: bool,
    num_gpus: int,
    normalization_data_map: Dict[str, NormalizationData],
    reader_options: ReaderOptions,
):
    logger.info(f"Enter test_multi_gpu with reader options {reader_options}")

    # These ENVS are needed by torch.distributed: https://fburl.com/1i86h2yg
    os.environ["MASTER_ADDR"] = "127.0.0.1"
    os.environ["MASTER_PORT"] = str(find_unused_port())

    manager = mp.Manager()
    result_dict = manager.dict()

    workflow_run_id = flow.get_flow_environ().workflow_run_id
    # The second condition is to avoid collision in unit test.
    # When running unit test, a local DB is used.
    # As a result, the workflow run ID may not be unique.
    if workflow_run_id and workflow_run_id > 10000:
        init_method = f"zeus://{workflow_run_id}"
    else:
        host_name = socket.gethostname()
        process_id = os.getpid()
        init_method = f"zeus://{host_name}_{process_id}"

    backend = "nccl" if use_gpu else "gloo"

    mp.spawn(
        _worker,
        args=(
            use_gpu,
            num_gpus,
            backend,
            init_method,
            reader_options,
            normalization_data_map,
            result_dict,
        ),
        nprocs=num_gpus,
        join=True,
    )
    logger.info("finish spawn")


def _worker(
    rank: int,
    use_gpu: bool,
    world_size: int,
    backend: str,
    init_method: str,
    reader_options: ReaderOptions,
    normalization_data_map: Dict[str, NormalizationData],
    reward_options: Optional[RewardOptions] = None,
    warmstart_path: Optional[str] = None,
    output_dict=None,
):
    logger.info(f"rank={rank} with reader options {reader_options}")
    dist.init_process_group(
        backend=backend, init_method=init_method, world_size=world_size, rank=rank
    )
    if use_gpu:
        torch.cuda.set_device(rank)

    model = create_model(...)
    trainer = model.initialize_trainer(
        use_gpu=use_gpu,
        reward_options=reward_options,
        normalization_data_map=normalization_data_map,
        warmstart_path=warmstart_path,
    )
    logger.info(f"rank={rank} finish initialize")

    num_of_data = 0
    data_reader = construct_distributed_data_reader(
        normalization_data_map, reader_options
    )
    for idx, batch in enumerate(data_reader):
        batch = post_data_loader_preprocessor(batch)
        if use_gpu:
            batch = batch.cuda()

        num_of_data += len(batch.training_input.state.float_features)

        logger.info(
            f"rank={rank} batch state={batch.training_input.state.float_features}"
        )
        logger.info(
            f"rank={rank} before train seq2slate param={print_param(trainer.seq2slate_net.seq2slate_net.seq2slate)}"
        )
        if rank == 1:
            logger.info(f"rank={rank} wake")
            time.sleep(60)
            logger.info(f"rank={rank} sleep")

        trainer.train(batch)
        logger.info(
            f"rank={rank} after train seq2slate param={print_param(trainer.seq2slate_net.seq2slate_net.seq2slate)}"
        )

        break

    logger.info(f"rank={rank} finish reading {num_of_data} data")


def initialize_trainer(self) -> Seq2SlateTrainer:
    seq2slate_net = initialize_model(...)

    if self.use_gpu:
        seq2slate_net = seq2slate_net.cuda()

    logger.info(f"Within manager {print_param(seq2slate_net.seq2slate)}")
    logger.info(
        f"Within manager {next(seq2slate_net.seq2slate.parameters()).device}"
    )
    if self.trainer_param.num_parallel > 1:
        seq2slate_net = _DistributedSeq2SlateNet(seq2slate_net)

    return _initialize_trainer(seq2slate_net)

###############
     Logs
###############
# This is printed within manager.initialize_trainer to show that 
# models are initially with different parameters
# (see line 66 then line 112~115):
I1003 063214.340 seq2slate_transformer.py:140] Within manager tensor([-0.2585,  0.3716, -0.1077, -0.2114,  0.1636,  0.1398, -0.2960, -0.1204,\n ...], device='cuda:0', grad_fn=<CatBackward>)
I1003 063214.341 seq2slate_transformer.py:142] Within manager cuda:0
I1003 063214.349 seq2slate_transformer.py:140] Within manager tensor([-0.1076, -0.0444,  0.3003, -0.1177,  0.0275, -0.0811,  0.2084,  0.3369,\n ...], device='cuda:1', grad_fn=<CatBackward>)


# Below is printed from line 85 ~ 90
# You can see that each process receives different, non-overlapping data
# You can also see that at this point, the two models have the same parameters,
# which are ensured by DDP. The parameters come from one particular copy (rank=0)
I1003 063214.531 test_multi_gpu.py:144] rank=0 batch state=tensor([[ 0.0000,  0.0000, -0.0000,  ..., -1.6131, -1.6298, -1.6118],\n ...]],\n       device='cuda:0')
I1003 063214.540 test_multi_gpu.py:147] rank=0 before train seq2slate param=tensor([-0.2585,  0.3716, -0.1077, -0.2114,  0.1636,  0.1398, -0.2960, -0.1204,\n         
 ...], device='cuda:0', grad_fn=<CatBackward>)
I1003 063214.544 test_multi_gpu.py:144] rank=1 batch state=tensor([[ 0.0000,  0.0000, -0.7115,  ..., -2.2678, -2.3524, -2.4194],\n  ...]],\n       device='cuda:1')
I1003 063214.553 test_multi_gpu.py:147] rank=1 before train seq2slate param=tensor([-0.2585,  0.3716, -0.1077, -0.2114,  0.1636,  0.1398, -0.2960, -0.1204,\n         ..., device='cuda:1', grad_fn=<CatBackward>)


# We deliberately let rank 1 sleep for one minute. 
# But you can see that rank 0 does not return from its train function earlier
# because it blocks on .backward function, waiting for rank 1's backward() finish.
# You can see after .train function, both processes have resulted to the same parameters again
I1003 063214.554 test_multi_gpu.py:150] rank=1 wake
I1003 063314.613 test_multi_gpu.py:152] rank=1 sleep
I1003 063315.023 seq2slate_trainer.py:181] 1 batch: ips_loss=-2.706389904022217, clamped_ips_loss=-2.706389904022217, baseline_loss=0.0, max_ips=27.303083419799805, mean_ips=0.6373803615570068, grad_update=True
I1003 063315.033 test_multi_gpu.py:156] rank=0 after train seq2slate param=tensor([-0.2485,  0.3616, -0.0977, -0.2214,  0.1736,  0.1298, -0.3060, -0.1304,\n         ...], device='cuda:0', grad_fn=<CatBackward>)
I1003 063315.033 test_multi_gpu.py:161] rank=0 finish reading 1024 data
I1003 063315.039 seq2slate_trainer.py:181] 1 batch: ips_loss=-2.7534916400909424, clamped_ips_loss=-2.7534916400909424, baseline_loss=0.0, max_ips=272.4482116699219, mean_ips=0.908729612827301, grad_update=True
I1003 063315.050 test_multi_gpu.py:156] rank=1 after train seq2slate param=tensor([-0.2485,  0.3616, -0.0977, -0.2214,  0.1736,  0.1298, -0.3060, -0.1304,\n         ...], device='cuda:1', grad_fn=<CatBackward>)
I1003 063315.050 test_multi_gpu.py:161] rank=1 finish reading 1024 data

References

[1] https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

[2] https://pytorch.org/docs/stable/notes/ddp.html

Tools needed to build an RL debugging tool

I’ve always had a dream to build a debugging tool to automatically analyze an observational dataset and tell me whether this dataset is suitable to apply batch-RL algorithms like DQN/DDPG. As we know, we cannot directly control or even know how observational dataset is collected. An incorrect data collection procedure or a wrong dataset would be costly to the whole model training loop [12]. My idea has been that this tool should first tell if the process underlying the dataset is an Markov Decision Process (MDP). By definition, in an MDP, s_{t+1} (next state) is a function of s_t (current state) and a_t (action); R_t (reward) is also a function of s_t and a_t. Thus, the goal of the debugging tool would be make causal inference on the observational dataset whether states and actions cause the change of rewards and state transitions.   

I will start with some Causal 101. I read some primers on causal inference [1, 2, 3, 4], among which [1] has a lot of good concise summary. The biggest take-away from [1] is that causation is not prediction. The second take-away is that there are two fundamental questions regarding causal effects: (1) estimating causal effect on a known relationship X->Y (2) causal relationship discovery, which [1] claims to be statistically impossible but I don’t understand why.  For (1), we can use randomization experiments in which we assure X is changed independently and then observe how Y changes, or we can account for confounding variables in models. In a special case where we account for all confounding variables in a linear regression model (based on subject matter knowledge), the coefficient for X would be the perfect estimate of the causal effect of X->Y.

Regarding (2) causal relationship discovery, I’ve seen several papers researching on the topic. There are generally three ways to discover causal relationship. The first is to use noise model. In the additive noise model (a bivariate model), with the assumption that if X->Y but not Y->X, if you train two regression models Y=f(X)+\epsilon_1 and X=g(Y)+\epsilon_2, you will find that \epsilon_1 is independent of X while \epsilon_2 is not independent of Y.  See [5, 6]. The noise independence test can be done through kernel-based test (see reference in [6]). The only situation where this method cannot distinguish the causal direction is when X and Y have a linear relationship and their noise is Gaussian. The second method is related to the Kolmogorov complexity. Theorem 1 from [7] states that under some assumptions if X->Y then K(F_X)+K(F_{Y|X}) \leq K(F_Y) + K(F_{X|Y}), which can be proxied as so called pinball loss. Then quantile regression (regression using pinball loss) can be fitted for X to Y and Y to X. The direction with lower pinball loss would indicate the causal relationship. The third method is called matching. While there are several methods for matching, we talk about a widely adopted one called propensity score matching. The idea is that you group events together with similar propensity scores. The propensity score of an event is how likely it happens. Once similarly likely events are grouped together, you can further divide them by whether they receive some “treatment” and compute causal effect.  One good example of applying propensity score matching is from an NLP work [11] where they probe which word causes positive or negative emotions. Some people criticize propensity score matching for its drawbacks [9, 10] and there are more matching methods to be considered [10]. We need to be very careful about the assumptions of these methods, since many of them cannot work when confounding variables are present

Next, I will go with uncertainty 101. Uncertainty of deep neural networks is a really fascinating topic to me: we can almost fit anything using deep neural networks but that doesn’t tell much if without uncertainty estimation. There are multiple ways to give uncertainty estimation of a deep NN: 

  1. High-Quality Prediction Intervals for Deep Learning: A Distribution-Free, Ensembled Approach: https://arxiv.org/pdf/1802.07167.pdf. The loss function is based on the idea that high-quality PIs should be as narrow as possible subject to a given coverage proportion, where coverage means that the proportion of data with label y lies within the predicted interval [y_l, y_u].
  2. Dropout as a bayesian approximation: Representing model uncertainty in deep learning: https://arxiv.org/abs/1506.02142. At testing time, dropout is applied on a trained neural network multiple times to get multiple predictions. Predictive intervals are computed empirically from the obtained samples. 
  3. Simple and scalable predictive uncertainty estimation using deep ensembles: https://arxiv.org/abs/1612.01474. Conceptually this is the most straightforward idea where an ensemble of models can give you a distribution of predictions. This idea is used in RL model-based planning,  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models: https://arxiv.org/abs/1805.12114.   
  4. Frequentist Uncertainty Estimates for Deep Learning: http://bayesiandeeplearning.org/2018/papers/31.pdf. Uncertainty estimate by quantile regression. A more detailed version: https://arxiv.org/abs/1811.00908
  5. Conservative Uncertainty Estimation by Fitting Prior Networks: https://openreview.net/pdf?id=BJlahxHYDS. Like Bayesian Optimization, uncertainty actually doesn’t rely on data labels.

After introducing causal and uncertainty 101, let’s dive into the debugging tool in my mind. Since any observational dataset may contain confounding issues and spurious correlation, we cannot really trust any observational dataset. Rather, the only trustful dataset is to collect (s,a,r,s’) using a completely random policy. 

First, we need to debug if R_t is a function of s_t and a_t. This is done by computing feature importance for each state feature and each action feature. Feature importance is obtained by the change of reward prediction after masking one particular feature for all logged data on a trained and fixed reward model. The intuition is that if a feature is important, masking it should bring significant agitation in the reward prediction. If no state feature is important, then this problem can degenerate to be a multi-arm bandit problem at best when action feature is important. We can use the uncertainty work we surveyed to get a confidence bound for the estimated feature importance, and we expect the lower confidence bound should be above 0. We call the test of whether lower confidence bound > 0 in at least p% logged data as feature importance significance test. (Update 2021-02-10: there is a clever way to do feature importance called Boruta: https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a)

Since our data is randomized on actions, any action feature that passes the feature importance significance test can be equivalent to say that action causes reward. But if a state feature passes the feature importance significance test, we can only say the state feature is correlated with the reward. This is because we could have one of the following causal relationships. In the latter case, a confounding factor causes the state feature and the reward.

That’s why even if there are states that are important for rewards it is not enough. More specifically, there are two concerns that things could go wrong: (1) what if actions couldn’t exert any influence on the states that are important to rewards? (2) what if actions can make influence on a state that is considered important for reward prediction, but that state is not the cause of the reward but merely correlated with the reward? For (1) we need to debug if s_{t+1} is a function of s_t and a_t. We train next state prediction models for each state feature, and by shuffling action values, we can see how next state prediction changes, whose magnitude will be used as proxy for action influence on that state feature. To account for data variance (aleatoric error), we can fit mixture density distributions on next states. 

For (2) “what if actions can make influence on a state that is considered important for reward prediction, but that state is not the cause of the reward but merely correlated with the reward”, I think here is a good example to illustrate the point. Suppose there is a state feature “ads view of a particular product”, action is “whether show more ads about the product”, reward is “purchase of the product”. Suppose there is a confounding variable “passion about the product” that actually causes views and purchases.  While the action can increase more views, the state feature and reward are independent once we know if a person loves the product. In this case, even some state features pass reward feature importance significance test and are caused by actions, we will still be unsure whether the problem is meaningful to apply for RL.

Overall, the debugging tool procedure can be summarized as:

—————— this part may not be useful ———————

Suppose there are two users (mdp_id=0, and mdp_id=1). User 0 loves the product thus every ad view would lead to purchase. User 1 hates the product thus ad view has no effect on him. 

mdp_id time_step passion (s_t) views (s_t) action (a_t) purchase (r_t)
0 0 1 0 show 1
0 1 1 1 not show 0
0 2 1 1 show 1
0 3 1 2 not show 0
1 0 0 0 show 0
1 1 0 0 not show 0

If passion and view are both state features, and purchase is a function of (passion, action), then passion will be considered important for rewards but not views. Then for passion, we will find that actions couldn’t change it, which indicates this problem is not suitable for RL.

If passion and view are both state features, and purchase is a function of (passion, views, action) (e.g., there is additional purchase decay due to accumulated views), then passion and view will both be considered important for rewards . We will find that action couldn’t change passion but views, which concludes that this dataset is valid for trying RL.

If passion is unobservable and only views is observable, and purchase is a function of (passion, action), then no feature will be considered as important for reward prediction. Our tool would raise flag immediately.

If passion is is unobservable and only views is observable, and purchase is a function of (passion, views, action), then views will be recognized as important for rewards. We will find out actions could change views and conclude that this data is still valid for trying RL.

When passion is unobservable and only views is observable, and purchase is a function of (passion, action), and passion=1 by nature has far more data than passion=0. 

—————— this part may not be useful ———————

 

References

[1] http://www.stat.cmu.edu/~larry/=stat401/Causal.pdf

[2] causal-inference.org: https://sites.google.com/view/causality-reading-group/resources?authuser=0

[3] Elements of Causal Inference: https://library.oapen.org/bitstream/handle/20.500.12657/26040/11283.pdf?sequence=1&isAllowed=y

[4] Review of Causal Discovery Methods Based on Graphical Models: https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full

[5] Nonlinear causal discovery with additive noise models: https://papers.nips.cc/paper/3548-nonlinear-causal-discovery-with-additive-noise-models

[6] On the Identifiability of the Post-Nonlinear Causal Model: https://arxiv.org/pdf/1205.2599.pdf 

[7] Distinguishing Cause from Effect Using Quantiles: Bivariate Quantile Causal Discovery: https://arxiv.org/abs/1801.10579

[8] Frequentist Uncertainty Estimates for Deep Learning: http://bayesiandeeplearning.org/2018/papers/31.pdf

[9] https://en.wikipedia.org/wiki/Propensity_score_matching#Advantages_and_disadvantages

[10] Why Propensity Scores Should Not Be Used for Matching: https://www.cambridge.org/core/journals/political-analysis/article/why-propensity-scores-should-not-be-used-for-matching/94DDE7ED8E2A796B693096EB714BE68B

[11] Feature Selection as Causal Inference Experiments with Text Classification: https://www.aclweb.org/anthology/K17-1018/

[12] Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/0d556e45afc54afeb2eb6b51a9bc1827b9961ff4.pdf

 

Distributed RL

This post curates some distributed RL research.

We start from A3C [2], which we briefly covered a few years ago [3]. The core idea is that there is a central global network which is periodically synchronized with workers. Each worker copies the network parameters from the global network, runs through the environment and collects experiences, and computes gradients on its own machine. Then, it sends the gradients to the global network, where the parameters actually get updated, and then gets back the updated global network parameters [4].  Each worker updates the global network at different “asynchronous” time points, therefore the “asynchronous” in A3C. 

The problem of A3C is that workers may run with lagged parameters. Therefore the computed gradients from different workers may contradict with each other or with the global version. Therefore sometimes it is better to wait for all workers finish before one global update happens [5]: 

This means at any time of A2C all workers and the global network always have the same parameters. Of course, the drawback of A2C is that the speed of one global gradient update is determined by the slowest worker. 

IMPALA [1] overcomes some drawbacks of A2C/A3C. IMPALA is conceptually closer to A3C in that there are also workers running on different versions of parameters to collect experience. The global controller can update the global network parameters whenever it receives message from one worker independently of other workers’ progress. Therefore, IMPALA (and A3C) is expected to have higher experience collection throughput than A2C. IMPALA solves the issue of worker policies lagging behind the global version with two methods: (1) workers send trajectories, rather than computed gradient as in A3C, to the global controller. The sent trajectories contain action propensities by the worker network; (2) once the global controller receives trajectories from one worker, it can use V-trace in gradient computation, something similar to importance sampling weighting to adjust discrepancy between the worker policy and the global policy.  

Seed RL [7] is another attempt to improve over IMPALA with two motivated observations:
(1) in IMPALA there are two things sent over the network: trajectories (from workers to the global controller) and model parameters (from the global controller to workers). However model parameters can be a number of times larger than trajectories, meaning network bandwidth is wasted a lot on sending parameters of large models. 
(2) workers are usually CPU-based, which are slow to perform policies (inference).

Seed RL doesn’t send model parameters over the network. It performs learning and inference on the global controller, which is usually equipped with GPU. The global controller only sends inferred action outputs to workers and workers send trajectories to the global controller (this is the same as IMPALA). Seed RL also uses V-Trace as it is still an asynchronous algorithm in which workers can fall behind than the global.  

The difference between Seed RL and IMPALA can be best viewed in the following diagrams [6]:

 

References

[1] IMPALA

[2] A2C / A3C

[3] https://czxttkl.com/2017/06/13/a3c-code-walkthrough/

[4] A3C explanation in Medium

[5] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#a2c

[6] https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html

[7] https://arxiv.org/pdf/1910.06591.pdf

Noise-Contrastive Estimation

I encountered the noise-contrastive estimation (NCE) technique in several papers recently. So it is a good time to visit this topic. NCE is a commonly used technique under the word2vec umbrella [2]. Previously I talked about several other ways to generate word embeddings in [1] but skipped introducing NCE. In this post, I will introduce it, with notations borrowed from the very paper applying it to word2vec [2].

First, let’s look at an example to get familiar with the NLP terms, target word and context [3], which are denoted as w and h in [2]: 

Using the idea of word2vec, words appear together frequently should have similar embeddings. One step further, a word’s embedding should match with the embeddings of its context. Often the context is another embedding derived from the embeddings of the words in the context. For example, the context embedding can be as simple as mean pooling of the embeddings of all the words in the context. 

Translating this idea into a loss function, we want to increase the following probability for every observed pair of (w, h):

(1)   \begin{equation*}    P^h_\theta(w) = \frac{exp(s_\theta(w, h))}{\sum_{w' \in V} exp(s_\theta(w', h))}\end{equation*}


where s_\theta(\cdot) is a scoring function to score the similarity of any pair (w, h), and w' comes from all other possible words in the entire vocabulary. Obviously, this can be an expensive loss function if the vocabulary size is large, which is usually the case. 

To avoid the inefficiency of computing pairwise scores for all the words, NCE proposes to transform the problem of Eqn.1 into a binary classification problem. For an observed pair (w, h), we randomly pick k additional words from the vocabulary to form a negative sample set. Among the k+1 words (w and k sampled words), we look at each word w and compute the probability that it is from the original observed pair (w, h).  The fact that w appears as one word in the k+1 words is either due to being sampled from the distribution P^h_\theta(w) provided w is from the observed pair (w, h), or due to being sampled from a uniform distribution P_n(w)=1/|V| at least one out of k times. Using Bayesian Theorem (more details in [3]), we can actually derive the conditional probability of w being from the observed pair given w is already in the k+1 words:

(2)   \begin{equation*}    P^h(D=1|w, \theta) = \frac{P^h_\theta(w)}{P^h_\theta(w) + kP_n(w)}\end{equation*}

The classification objective then becomes that for the word w from (w, h), P^h(D=1|w, \theta) should be as high as possible; for the word w from the negative samples, P^h(D=0|w, \theta) should be as high as possible. That is what Eqn. 8 from [2] is about:

(3)   \begin{equation*}    J^h(\theta) = \mathbb{E}_{w \in (w,h)} \left[ log P^h(D=1|w,\theta)\right] + k \mathbb{E}_{w \not\in (w,h)} \left[ log P^h(D=0|w,\theta) \right]\end{equation*}

There exists some theoretical reasons that I am not too sure about, but which can be taken advantage of to relax the definition of P^h(D=1|w, \theta). That is, we replace P^h_\theta(w) in Eqn. 2 with an unnormalized score s_\theta(w,h). With the replacement, we won’t need to worry about how to compute the partition function \sum_{w' \in V} exp(s_\theta(w', h)) as in Eqn. 1.

There are also variants of NCE loss. One is called InfoNCE (see Eqn. 4 in [4]) and has been applied to model-based reinforcement learning [5]. [4] claims that its objective function is actually optimizing the lower bound on mutual information between w and h in the word2vec context (or state and next state in the reinforcement learning context). Mutual information can be understood as the uncertainty reduction of w after observing h [6]. 

 

References

[1] https://czxttkl.com/2017/01/02/embedding-and-heterogeneous-network-papers/

[2] Learning word embeddings efficiently with noise-contrastive estimation

[3] https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html#context-based-skip-gram-model

[4] Representation Learning with Contrastive Predictive Coding

[5] Unsupervised State Representation Learning in Atari

[6] http://www.scholarpedia.org/article/Mutual_information

Control variate using Taylor expansion

We talked about control variate in [4]: when evaluating \mathbb{E}_{p(x)}[f(x)] by Monte Carlo samples, we can instead evaluate \mathbb{E}_{p(x)}[f(x)-h(x)+\theta] with \theta=\mathbb{E}_{p(x)}[h(x)] in order to reduce variance. The requirement for control variate to work is that h(x) is correlated with f(x) and the mean of h(x) is known. 

In this post we will walk through a classic example of using control variate, in which h(x) is picked as the Taylor expansion of f(x). We know that f(x)‘s Taylor expansion can be expressed as f(x)=f(a) + f'(a)(x-a) + f''(a)/2! * (x-a)^2 + f'''(a)/3! * (x-a)^3 + \cdots. Therefore, if we pick the second order of Taylor expansion for f(x) and set a=0, we get h(x)=f(0)+f'(0)x + f''(0)/2! * x^2

Let’s see an example from [6]. Suppose f(u_1, u_2)=exp[(u_1^2 + u_2^2)/2] and we want to evaluate the integral \Psi=\int^1_0 \int^1_0 f(u_1, u_2) du_1 du_2. If we want to use Monte Carlo estimator to estimate \Psi using U^1, \cdots, U^m samples, we have:
\frac{\text{area of sampling}}{m}\sum\limits_{k=1}^m f(U^k)=\frac{1}{m}\sum\limits_{k=1}^m f(U^k)

The control variate as the second-order Taylor expansion of f(x) is h(u_1, u_2)=1 + (u_1^2 + u_2^2)/2 (we use multi-variable Taylor expansion here [7]). We first compute its mean: \theta=\mathbb{E}_{p(x)}[h(x)] = \int^1_0\int^1_0 \left(1 +(u_1^2+u_2^2)/2 \right) du_1 du_2 = 4/3. Thus, our control variate Monto Carlo estimator is:
\frac{1}{m}\sum\limits_{k=1}^{m} z(u_1, u_2)\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[f(u_1, u_2) -h(u_1, u_2) + \theta\right]\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[exp[(u_1^2 + u_2^2)/2] - 1 - (u_1^2 + u_2^2)/2 + 4/3 \right].

And from the results of [6] the sampling variance of z(u_1, u_2) is smaller than f(u_1, u_2).

Now what [2] propose is to apply similar Taylor expansion-based control variate on REINFORCE policy gradient:

\nabla_\theta J(\theta) = \mathbb{E}_{s_t \sim \rho_{\pi}(\cdot), a_t \sim \pi(\cdot | s_t)} \left[ \nabla_\theta log \pi_\theta (a_t | s_t) R_t\right]

If we use Monte Carlo estimator, then we essentially use samples \nabla_\theta log \pi_\theta (a_t | s_t) R_t to approximate \nabla_\theta J(\theta), which causes variance. If we have a control variate h(s_t, a_t)=Q_w(s_t, \bar{a}_t) + \nabla_a Q_w(s_t, a)|_{a=\bar{a}_t} (a_t - \bar{a}_t) and \bar{a}_t can be chosen at any value but a sensible choice being \mu_\theta(s_t), a deterministic version of \pi_\theta(a_t | s_t), then the control variate version of policy gradient can be written as (Eqn. 7 in [2]):
\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s\sim \rho_\pi, a \sim \pi}\left[ h(s_t, a_t)\right] \newline=\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s \sim \rho_\pi}\left[ \nabla_a Q_w(s_t, a)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)\right] 

As you can see the second term can be computed when Q_w is some critic with known analytic expression, and actually does not depend on \pi_\theta

[3] uses the law of total variance (some good examples and intuitions [8, 9]) to decompose the control-variate policy gradient into three parts: variance caused by variance of states \Sigma_s, variance of actions \Sigma_a, and variance of trajectories \Sigma_\tau. What [3] highlights is that the magnitude of \Sigma_a is usually greatly smaller than the rest two so the benefit of using control variate-based policy gradient is very limited.

 

References

[1] MUPROP: UNBIASED BACKPROPAGATION FOR STOCHASTIC NEURAL NETWORKS

[2] Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC

[3] The Mirage of Action-Dependent Baselines in Reinforcement Learning

[4] https://czxttkl.com/2020/03/31/control-variate/

[5] https://czxttkl.com/2017/03/30/importance-sampling/

[6] https://www.value-at-risk.net/variance-reduction-with-control-variates-monte-carlo-simulation/

[7] https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables

[8] http://home.cc.umanitoba.ca/~farhadi/ASPER/Law%20of%20Total%20Variance.pdf

[9] https://math.stackexchange.com/a/3377007/235140

Notes on Ray paper

I am reading Robert Nishihara’s thesis on Ray [1] while encountering several concepts I am not familiar with. So this post takes some notes on those new concepts.

Ring AllReduce

Ring AllReduce is a technique to communicate with multiple computation nodes for aggregating results. It is a primitive to many distributed training systems. In the Ray paper, the author tries to analyze how easy Ring AllReduce can be implemented using Ray’s API. 

There are two articles that articulating the idea really well [5, 6]. They point out the bottleneck of network data communication when a distributed training system wants to aggregate gradients from worker nodes in a master-slave pattern. Because each worker (say, there are P-1 workers in total) needs to send its gradient vector (say it uses N space) to master and the master needs to aggregate and send the updated model vector (same size as the gradient vectors) back to each worker for the next iteration, the data transferring through the network would be as large as 2*(P-1)*N, which scales linearly with the number of workers.

Ring AllReduce have multiple cycles of data transferring among all workers (P workers since we don’t need one for the master). But each cycle only transfers a small amount of data in each cycle such that the accumulative amount of data transferring would be smaller than that of the master-slave pattern.

The basic idea of RingAllReduce is to divide each gradient vector into P chunks on all workers. There are first P-1 cycles of data aggregating then P-1 cycles data syncing. In the first cycle, each worker i (i=1,\ldots,P) sends its i-th chunk to the next worker (with index i+1), and receives the (i-1)-th chunk from the previous worker. The received (i-1)-th chunk will be aggregated locally with the worker’s own (i-1)-th chunk. In the second cycle, each worker sends its (i-1)-th chunk, which got aggregated in the last cycle, to the next worker, and receives (i-2)-th chunk from the previous worker. Similarly, each worker now can aggregate on its (i-2)-th chunk with the received chunk which also indexes on i-2. Continuing on this pattern, each worker will send its (i-2), (i-3)-th, …chunk in each cycle until P-1 data aggregating cycles are done. Upon then each worker has one chunk that has been fully aggregated with all other workers. Then doing the similar circular cycles P-1 times can make all workers sync on all fully aggregated chunks from each other.

I draw one simple diagram to illustrate the idea:

 

 

ADMM (Alternating Direction Method of Multipliers)

Another interesting technique introduced in the thesis is ADMM. It is a good opportunity to revisit optimization from the scratch. I’m mainly following [4] and [7] for basic concepts.

The definition of directional derivative and gradient:
The gradient of f(x), x\in \mathbb{R}^n is \nabla f(x) = \left\{ \frac{\partial f(x)}{\partial x_1}, \cdots, \frac{\partial f(x)}{\partial x_n}\right\}. The gradient is a vector and can only be known fully when given a concrete value x \in \mathbb{R}^n. The directional derivative is the gradient’s projection on another unit vector u \in \mathbb{R}^n: df(x)(u) = <\nabla f(x), u> = |\nabla f(x)| \;|u| \;cos \theta = |\nabla f(x)| \;cos \theta, where <\cdot, \cdot> is inner product. See some introduction in [9] and [10].

The definition of a proper function simply means (Definition 51.5 [7]): f(x) > -\infty for all x \in \mathbb{R}^n and f(x) < +\infty for some x \in \mathbb{R}^n

The definition of a convex and strictly convex function is (Proposition 40.9 [7]):  
The function f is convex on U iff f(v) \geq f(u) + df(u)(v-u). Remember that df(u) is the change of f(u) caused by a tiny change in u and we have df(u)/du = tan(\text{the slope of the tangent line at }f(u)). f is a strict convex function iff f(v) > f(u) + df(u)(v-u)

This definition actually leads to a very common technique to check if a multivariate quadratic function is a convex function: if f(u)=\frac{1}{2}u^T A u - u^T b, u \in \mathbb{R}^n, as long as A is positive semidefinite, then f(u) is convex. See Example 40.1 [7].

The definition of affine hull: given a set of vectors S=(x_1, \cdots, x_m) with x_i \in \mathbb{R}^n, aff(S) is all affine combinations of the vectors in S, i.e., aff(S)=\{\lambda_1 x_1 + \cdots + \lambda_m x_m | \lambda_1 + \cdots + \lambda_m = 1\}.

The definition of relint (Definition 51.9 [7]):
Suppose C a subset of \mathbb{R}^n, the relative interior of C is the set: relint(C)=\{x \in C | B_\epsilon(x) \cap aff(C) \subseteq C \;\; \text{for some } \epsilon > 0\}. There is a good example from [8] which explains the difference between interior and relative interior.

The most important thing is to understand the high level of the typical optimization procedure, which I also covered in [3]. Suppose we want to optimize the primal problem:
minimize  J(v)
subject to  \varphi_i(v) \leq 0, \quad \quad i=1,\ldots,m
                  \psi_j(v)=0, \quad\quad j = 1,\ldots, p
with dom(J) = \Omega \subset \mathbb{R}^n

The Lagrangian L(v,\mu, \upsilon) for the primal problem is defined as:
L(v, \mu, \upsilon)=J(v)+\sum\limits_{i=1}^m \mu_i \varphi_i(v) + \sum\limits_{j=1}^p \upsilon_j \psi_j(v), where \mu \in \mathbb{R}^m_+ (i.e., \mu > 0) and \upsilon \in \mathbb{R}^p.

The KKT conditions are necessary conditions that if v^* is a local minimum of L(v,\mu, \upsilon), then v^* must satisfy the following conditions [11]:
1. stationarity: \nabla J(v^*) + \sum\limits_{i=1}^m \mu_i \nabla \varphi_i(v^*) + \sum\limits_{j=1}^p \upsilon_j \nabla \psi_j(v^*) = 0
2. primal feasibility: \varphi_i(v^*) \leq 0, \; i=1, \ldots, m and \psi_j(v^*)=0, \; j=1, \ldots, p
3. dual feasibility: \mu_i \geq 0, \; i=1,\ldots, m
4. complementary slackness: \mu_i \varphi_i(v^*)=0, \; i=1, \ldots, m

In other words, if we already know v^* we know it must satisfy the KKT conditions. However not all solutions that satisfy the KKT conditions are the local minimum of L(v,\mu, \upsilon). The KKT conditions can be used to find global minimum v^* if additional conditions are satisfied. One such conditions is “Second-order sufficient conditions” [12], which I also talked about in [3]. Another such conditions are if the constraints, the domain dom(J), and the objective function J are convex, then KKT conditions also imply v^* is a global minimum. (Theorem 50.6 [7]). 

While some problems have characteristics that allow using the KKT conditions as sufficient condition to find the global solution, there are others that are easier to solve or approximate using another method called dual ascent. By maximizing another function called dual function, we can get the exact solution or a lower bound of the primal problem. 

The dual function G: \mathbb{R}^m_+ \times \mathbb{R}^p \rightarrow \mathbb{R} is defined as:
G(\mu, \upsilon) = inf\limits_{v \in \Omega} L(v, \mu, \upsilon)  \quad \quad \mu \in \mathbb{R}^m_+ \text{ and } \upsilon \in \mathbb{R}^p

The dual problem is defined as:
maximize G(\mu, \upsilon)
subject to \mu \in \mathbb{R}^m_+, \upsilon \in \mathbb{R}^p

From [7] page 1697, one of the main advantages of the dual problem over the primal problem is that it is a convex optimization problem, since we wish to maximize a concave objective function G (thus minimize −G, a convex function), and the constraints \mu \geq 0 are convex. In a number of practical situations, the dual function G can indeed be computed.

If the maximum of the dual problem is d^* and the minimum of the primal problem is p^*. We have weak duality that always holds: d^* \leq p^*. Strong duality holds when the dual gap is zero, with certain conditions holding, for example slater’s condition [14]. We can find the local minimum (\mu^*, \upsilon^*) of the dual problem by a special form of gradient ascent algorithm called sequential optimization problem (SMO) [13] because special treatment is needed for the constraints involved in the dual problem. 

[7] provides two ways on how to do constrained optimization on SVM (Section 50.6 and 50.10): one is to use the KKT conditions, the other is to solve the dual problem. 

The dual problem is simplified when there are only affine equality constraints in the primal problem:
Primal problem:
minimize J(u)
subject to  Au=b

Dual problem:
maximize G(\lambda)=inf\limits_{u \in \mathbb{R}^n} L(u, \lambda)=inf\limits_{u \in \mathbb{R}^n}J(u)+\lambda^T(Au-b)
subject to \lambda \in \mathbb{R}^p

Since the dual problem is an unconstrained optimization problem, we can use dual ascent to solve this problem easily:
u^{k+1} = argmin_u L(u, \lambda^k)
\lambda^{k+1} = \lambda^k + \alpha^k(Au^{k+1}-b),
where \alpha^k>0 is a step size.

The main flaw of the dual ascent is that under certain conditions the solution may diverge. The augmented Lagrangian method augments the Lagrangian function with a penalty term:
L_{p}(u, \lambda)=J(u) + \lambda^T(Au-b)+ (p/2)\|Au-b\|^2_2

Based on some theory, the augmented Lagrangian is strongly convex and has better convergence property. 

The method of multipliers is the dual ascent applied on the augmented Lagrangian:
u^{k+1} = argmin_u L_p(u, \lambda^k)
\lambda^{k+1} = \lambda^k + p(Au^{k+1}-b)

If u can be separated into two parts such that J(u)=J(x,z)=f(x)+g(z), then we can iteratively update x, z, and \lambda separately, and such a method is called Alternating Direction Method of Multipliers (ADMM):
Primal problem:
minimize f(x)+g(z) 
subject to Ax+Bz=c

The augmented Lagrangian:
L_p(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-c)+(p/2)\|Ax+Bz-c\|^2_2

Updates (note we are not doing (x^{k+1}, z^{k+1})=argmin_{x,z} L_p(x,z,\lambda) as in the method of multipliers):
x^{k+1}=argmin_x L_p(x, z^k, \lambda^k)
z^{k+1}=argmin_z L_p(x^{k+1}, z, \lambda^k)
\lambda^{k+1} = \lambda^k + p(Ax^{k+1}+Bz^{k+1}-c)

If we define \mu=(1/p)\lambda and r=Ax+Bz-c, then the updates can be simplified as:
x^{k+1}=argmin_x \left( f(x) + (p/2) \|Ax+Bz^k-c+\mu^k\|^2_2 \right)
z^{k+1}=argmin_z \left( g(z) + (p/2)\|Ax^{k+1} + Bz - c + \mu^k \|^2_2 \right)
\mu^{k+1}=\mu^k + Ax^{k+1} + Bz^{k+1} - c

Structures in f, g, A, and B can often be exploited to carry out x-minimization and z-minimization more efficiently. We now look at several examples.

Example 1: if A=I is an identity matrix, v=-Bz^k+c-\mu^k, and f(\cdot) is the indicator function of a closed nonempty convex set \mathcal{C}, then x^{k+1}=argmin_x \left( f(x) + (p/2) \|x-v\|^2_2 \right)=\Pi_{\mathcal{C}}(v), which means x^{k+1} is the projection of v onto \mathcal{C}.

Example 2: if f(x)=(1/2)x^T P x + q^T x + r and v=-Bz^k+c-\mu^k, then x^{k+1}=argmin_x \left( f(x) + (p/2) \|Ax-v\|^2_2 \right) becomes an unconstrained quadratic programming with the analytic solution at x^*=\left(P+pA^T A\right)^{-1}(pA^Tv-q).

Example 3: if f(x)=(1/2)x^T P x + q^T x + r, dom(f)=\{x|Ax+b\} and it must satisfy x\geq 0, then the ADMM primal problem becomes:
minimize f(x) + g(z)  (where g(z) is an indicator function of the nonnegative orthant \mathbb{R}_+^n)
subject to x-z=0

Then x^{k+1}=argmin_x \left( f(x) + (p/2) \|x-z^k+u^k\|^2_2 \right). This is a constrained quadratic programming. We have to use KKT conditions to solve it. Suppose the Lagrangian multiplier is v\in\mathbb{R}^n, then the KKT conditions state that:
\nabla f(x) +\nabla\left((p/2) \|x-z^k+u^k\|^2_2 \right) + v^T \nabla\left(Ax-b\right) \newline= \nabla f(x) + p (x - z^k + u^k)+ v^T A = 0
Ax-b = 0,
which is exactly given by a linear system in matrix form in Section 5.2 [4]:

(As a side note, solving a constrained quadratic programming usually relies on KKT conditions. It can be converted to solving a linear system in matrix form: https://www.math.uh.edu/~rohop/fall_06/Chapter3.pdf

Example 4 (Section 4.4.3 in [4]): if f(x)=\lambda \|x\|_1, v=-Bz^k+c-\mu^k, and A=I, then x^*=argmin_x(\lambda \|x\|_1 + (p/2)\|x-v\|^2_2)=S_{\lambda/p}(v) where S is the soft thresholding operator. 

 

References

[1] On Systems and Algorithms for Distributed Machine Learning: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-30.pdf

[2] Compare between parameter servers and ring allreduce: https://harvest.usask.ca/bitstream/handle/10388/12390/CHOWDHURY-THESIS-2019.pdf?sequence=1&isAllowed=y

[3] Overview of optimization: https://czxttkl.com/2016/02/22/optimization-overview/

[4] ADMM tutorial: https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

[5] https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/

[6] https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da

[7] Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning https://www.cis.upenn.edu/~jean/math-deep.pdf

[8] https://math.stackexchange.com/a/2774285/235140

[9] http://sites.science.oregonstate.edu/math/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/grad/grad.html

[10] https://math.stackexchange.com/a/661220/235140

[11] https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions#Necessary_conditions

[12] https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions#Sufficient_conditions

[13] http://cs229.stanford.edu/notes/cs229-notes3.pdf

[14] https://en.wikipedia.org/wiki/Slater%27s_condition

Understand Fourier Transform of Images

We’ve covered Fourier Transform in [1] and [2] while we use only examples of 1D. In this post we are going to see what 2D Fourier Transform looks like.

First, we look at a 2D image with one direction sinusoid waves (left) and its Fourier Transform (right).

I added coordinates to help you understand the illustration. Let’s assume the x-axis in the image corresponds to i-axis in frequency domain; y-axis in the image corresponds to j-axis in the frequency domain. Because there is never fluctuation along the x-axis, there should only be i=0 frequency. However, along the y-axis there is regular sinusoid change, therefore there should be j>0 frequency. As a combination, we see two dots (two dirac delta functions) on the line i=0

To reinforce our understanding, let’s look at three more pictures. The first two have vertical bars with different spatial frequencies. As you might guess, the first picture with sparser bars have a smaller frequency than the second picture. Their frequencies are only on the line j=0 because there is no signal variation on the y-axis.

The third picture is generated by sin(x+y) (https://www.wolframalpha.com/input/?i=+plot+sin%28x%2By%29). And it has regular variations of waves along both the x and y-axis. Thus, we see the two white dots in the frequency domains are on a diagonal. 

Next, we examine a real-world photo, a famous picture titled “cameraman”. We mark one leg of the tripod in th image and its corresponding frequency band. This leg is almost perpendicular to the x-axis. Therefore, it introduces high-frequency variation (sharp edges) along the direction of x-axis. Meanwhile it introduces less variation on the direction of y-axis. On the frequency domain, you can see the corresponding line has a slope more towards the i-axis than the j-axis.  

At last, let’s revisit how images are constituted. In essence, images are formed by 2D waves, just like 1D functions are formed by 1D waves:

 

 

References

[1] https://czxttkl.com/2018/10/07/eulers-formula/

[2] https://czxttkl.com/2020/04/27/revisit-gaussian-kernel/ 

Image examples come from:

[3] https://www.youtube.com/watch?v=YYGltoYEmKo

[4] https://homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm

[5] https://www.youtube.com/watch?v=gwaYwRwY6PU

A website that is helpful in visualizing 2d function:

[6] http://al-roomi.org/3DPlot/index.html

Revisit Gaussian kernel

This post is mainly about Gaussian kernel (aka radial basis function kernel), a commonly used technique in machine learning. We’ve touched the concept of Gaussian kernel in [1] and [2] but with less details. 

Definition of Gaussian Kernel

The intuitive understanding of a kernel is that it maps points in a data space, where those points may be difficult to be classified, to a higher dimensional space, where the corresponding mapped points are easier to be separable.

Gaussian kernel is defined as:

    \[K(\mathbf{x}, \mathbf{y})=exp\left(-\frac{\|\mathbf{x}-\mathbf{y}\|^2}{2\sigma^2}\right),\]

where \mathbf{x} and \mathbf{y} are two arbitrary points in the data space, \sigma is a hyperparameter of the Gaussian kernel that controls the sensitivity of difference when \mathbf{x} and \mathbf{y} are different. Imagine \sigma goes to infinity, then no matter how different \mathbf{x} and \mathbf{y} are, K(\mathbf{x}, \mathbf{y}) will be 1. The formula of Gaussian kernel is identical to the PDF of a normal distribution.

K(\mathbf{x}, \mathbf{y}) can be understood as the distance between the corresponding points of \mathbf{x} and \mathbf{y} in the mapped higher dimensional space (denoted as \phi(\mathbf{x}) and \phi(\mathbf{y})). The higher dimensional space is actually an infinite dimensional space, and the distance is measured by the dot product. That is, K(\mathbf{x}, \mathbf{y}) = <\phi(\mathbf{x}), \phi(\mathbf{y})>.

So what is \phi(\mathbf{x}) then? According to ([3] page 11), we have:

Application of Gaussian Kernels

Using Gaussian kernels, we can design classification/regression algorithms and their ideas would be similar to nearest neighbor algorithms. Here is how [4]. Suppose we have N data points \mathbf{x}_1, \cdots, \mathbf{x}_N with their labels y_1, \cdots, y_N. Each of the N data points will contribute some information w_n to the prediction of a new data point h(\mathbf{x})

Intuitively, those \mathbf{x}_n that are close to \mathbf{x} should contribute more information; those \mathbf{x}_n that are far away from \mathbf{x} should contribute less information. Therefore, we use a Gaussian kernel K(\mathbf{x}, \mathbf{x}_n)=exp\left(-\gamma \|\mathbf{x}-\mathbf{x}_n\|^2\right) to denote the degree of w_n that \mathbf{x}_n should contribute to h(\mathbf{x}). This screenshot comes from [4], where \gamma  = \frac{1}{2\sigma^2} 

The way to learn w_n is that we want h(\mathbf{x})=0 \text{ for } \mathbf{x}=\mathbf{x}_1, \cdots, \mathbf{x}_N. Therefore, we can solve the following matrix-vector equation to obtain \mathbf{w}:

Once we have \mathbf{w}, we have full information of our regression/classification model h(\mathbf{x}), and we can use it to make prediction to unseen data.

We can use Gaussian kernel to perform smoothing. In [11], the author detailed some examples of applying Gaussian kernel for smoothing. In the example below, the authors generated some noisy data in 1D array (left). By applying a Gaussian kernel, that is averaging the values mostly from the neighborhood, we get a much smoother data distribution (right).  

Magically, applying a correct Gaussian kernel can recover from noisy data that seems random. Below, we can see that a clean data distribution (up left), after added with noise (up right), can be recovered mostly if the Gaussian kernel “matches” certain characteristics of the noise (bottom).

Another application of Gaussian kernels is to perform kernel density estimation [5]. It is a technique to depict a smooth probability distribution from finite data samples. The smoothness of the probability distribution is determined by the hyperparameter \sigma. When \sigma is small, the distribution will look like histograms because the kernel will average data on near neighbors. Larger \sigma will have wider focus on data.

 

Convolution and Fourier Transform

Gaussian kernels can be used in the setting of convolution and Fourier transform. We first quickly review what convolution and Fourier transform are and their relationships. 

A convolution of two functions is defined as:

    \[f(t)=(g*h)(t)=\int^\infty_{-\infty}g(\tau)h(t-\tau)d\tau\]

For a function that is on the time domain f(t), its frequency domain function \hat{f}(\xi) is defined as:

    \[\hat{f}(\xi)=\int^{\infty}_{-\infty}f(t)e^{-2\pi i t \xi} dt\]


And the inverse Fourier transform is defined as:

    \[f(t)=\int^\infty_{-\infty}\hat{f}(\xi)e^{2\pi i t \xi} d\xi\]

If f(t) itself is a Gaussian function, then \hat{f}(\xi) would also be a Gaussian function but with different shapes. The more spread f(t) is, the sharper \hat{f}(\xi) is [9]. This means, a spread f(t) changes slowly so it has less high-frequency components.

Now let’s prove the convolution theorem, which states that the Fourier transform of the convolution of two functions is equal to the product of the Fourier transform of each function. The proof comes from the last page of [10]:

Suppose we have f(t)=(g*h)(t)=\int^\infty_{-\infty}g(\tau)h(t-\tau)d\tau, and suppose \hat{f}(\xi), \hat{g}(\xi), and \hat{h}(\xi) are the corresponding Fourier transformed functions. From the inverse Fourier transform we have g(t)=\int^\infty_{-\infty} \hat{g}(\xi) 2^{\pi i t \xi}d\xi and h(t)=\int^\infty_{-\infty} \hat{h}(\xi) 2^{\pi i t \xi}d\xi.

By the definition of convolution:

(g*h)(t)=\int^\infty_{-\infty}g(\tau)h(t-\tau)d\tau \newline=\int^\infty_{-\infty} g(\tau)\left[\int^\infty_{-\infty} \hat{h}(\xi) 2^{\pi i (t-\tau) \xi}d\xi\right]d\tau \newline=\int^\infty_{-\infty} \hat{h}(\xi) \left[\int^\infty_{-\infty} g(\tau) 2^{\pi i (t-\tau) \xi}d\xi\right]d\tau \quad \quad \text{swap } g(\tau)\text{ and }\hat{h}(\xi)\newline=\int^\infty_{-\infty} \hat{h}(\xi) \left[\int^\infty_{-\infty} g(\tau) 2^{-\pi i \tau \xi} d\tau\right] 2^{\pi i t \xi} d\xi\newline=\int^\infty_{-\infty} \hat{h}(\xi) \hat{g}(\xi) 2^{\pi i t \xi} d\xi \newline = IFT\left[\hat{h}(\xi) \hat{g}(\xi)\right] \quad\quad Q.E.D.

I think [6] summarizes well the relationship between convolution and Fourier Transform [7]. 

Now let’s understand the last sentence from the last slide: convolution with a Gaussian will preserve low-frequency components while reducing high-frequency components.

Suppose we have a rectangle function h(t) =\sqcap(t/2) with width 2 and height 1. 
We know that its fourier transform is (https://www.wolframalpha.com/input/?i=+rect%28t%2F2%29+from+-2+to+2):

After applying Gaussian kernel smoothing (e.g., g(t)=gaussian(0, 1), as we introduced in the application section, we probably get a smoother function roughly like a triangle function f(t)=(g*h)(t)=\Lambda(t) (https://www.wolframalpha.com/input/?i=triangle%28t%29):

\Lambda(t)‘s fourier transform is (https://www.wolframalpha.com/input/?i=fourier+transform+%28triangle%28t%29%29):

Apparently, we can see that the high frequency components of \Lambda(t) is much smaller than those of \sqcap(t/2).  

From the point of the convolution theorem to understand this phenomenon, the fourier transform of (g*h)(t) is equal to \hat{g}(\xi) \hat{h}(\xi). We know that \hat{h}(\xi) is a sinc function and \hat{g}(\xi) is a gaussian function, therefore the high frequency components of \hat{g}(\xi)\hat{h}(\xi) would be attenuated.

 

References

[1] https://czxttkl.com/2018/01/20/my-understanding-in-bayesian-optimization/

[2] https://czxttkl.com/2017/12/28/revisit-support-vector-machine/

[3] https://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf

[4] https://www.youtube.com/watch?v=O8CfrnOPtLc

[5] http://www.surajrampure.com/resources/ds100/KDE.html

[6] https://web.stanford.edu/class/cs279/lectures/lecture9.pdf

[7] https://czxttkl.com/2018/10/07/eulers-formula/

[8] https://en.wikipedia.org/wiki/Fourier_transform#Other_conventions

[9] http://pages.stat.wisc.edu/~mchung/teaching/MIA/reading/diffusion.gaussian.kernel.pdf.pdf

[10] http://ugastro.berkeley.edu/infrared09/PDF-2009/convolution2.pdf

[11] https://matthew-brett.github.io/teaching/smoothing_intro.html

Optimization with discrete random variables

In this post, I’m going to talk about common techniques that enable us to optimize a loss function w.r.t. discrete random variables. I may go off on a tangent on various models (e.g., variational auto-encoder and reinforcement learning) because these techniques come from many different areas so please bear with me.

We start from variational auto encoder (VAE) with continuous latent variables. Then we will look at VAE with discrete latent variables.

VAE

With generative modeling in mind, we suppose every data x is generated by some latent variable z, which itself is a random variable following certain distribution p(z|x). We’d like to use a parameterized distribution q_\theta(z|x) to approximate p(z|x) (this is the encoder of VAE). Usually, q_\theta(z|x) can be parameterized as a Gaussian distribution: q_\theta(z|x) \sim \mathcal{N}\left(\mu_\theta(x), \Sigma_\theta(x)\right). To make p(z|x) and q_\theta(z|x) close to each other, we minimize the KL-divergence between them:

\theta^* = argmin_\theta\; KL\left(q_\theta(z|x) || p(z|x) \right ) \newline= argmin_\theta \; \mathbb{E}_{q_\theta} [log\;q_\theta(z|x)] - \mathbb{E}_{q_\theta} [log\;p(z,x)]+log\;p(x)

In the formula above, \mathbb{E}_{q_\theta}[f(z)] means \int q_\theta(z|x) f(z) dz, and log \; p(x) is a constant w.r.t. \theta and thus can be omitted. With simple re-arrangement, we can equivalently maximize the so-called ELBO function in order to find \theta^*:

ELBO(\theta) \newline= \mathbb{E}_{q_\theta} [log\;p(z,x)] - \mathbb{E}_{q_\theta} [log\;q_\theta(z|x)]\newline=\mathbb{E}_{q_\theta}[log\;p(x|z)] + \mathbb{E}_{q_\theta}[log\;p(z)] - \mathbb{E}_{q_\theta}[log\;q_\theta(z|x)] \quad\quad p(z) \text{ is the prior of } z \newline= \mathbb{E}_{q_\theta}[log\;p(x|z)]  - \mathbb{E}_{q_\theta} [log \frac{q_{\theta}(z|x)}{p(z)}]\newline=\mathbb{E}_{q_\theta}[log\;p(x|z)] - KL\left(q_\theta(z|x) || p(z)\right) 

Practically, p(x|z) is also fitted by a parameterized function p_\phi(x|z) (this is the decoder of VAE). So the ultimate objective function we have for fitting a VAE is:

(1)   \begin{equation*} $argmax_{\theta, \phi} \; \mathbb{E}_{x\sim D} \left[\mathbb{E}_{q_\theta}[log\;p_\phi(x|z)] - KL\left(q_\theta(z|x) || p(z)\right)\right]$\end{equation*}

We can interpret the objective function in the following ways: \mathbb{E}_{q_\theta}[log\;p_\phi(x|z)] can be thought as the so-called “reconstruction error”, which encourages the reconstructed x be as close to the original x as possible. KL\left(q_\theta(z|x) || p(z)\right) encourages q_\theta(z|x) to be close to the prior of z. For more about ELBO and variational inference, please refer to one older post [8].

——————– Update 09/16/2021 ———————

A good post overviewing VAE: https://stats.stackexchange.com/a/315613/80635

Optimize VAE

Optimizing Eq.1 requires additional care. If using stochastic gradient descent, you might want to first sample an x from the data distribution D, then sample z from q_\theta(z|x), to compute:

(2)   \begin{equation*} $log\;p_\phi(x|z) - KL\left(q_\theta(z|x) || p(z)\right)$\end{equation*}

However, Eq.2 is not the Monte Carlo sampling of the real gradient w.r.t. \theta. This is because \mathbb{E}_{q_\theta} in Eq.1 also depends on the learning parameter \theta. Now we introduce two general ways to optimize Eq.1 [1].   

The first method is called score function estimator. From now on, we will ignore the part \mathbb{E}_{x\sim D}. The gradient of Eq.1 w.r.t. \theta can be written as:

(3)   \begin{align*}&\nabla_\theta \left\{\mathbb{E}_{q_\theta}[log\;p_\phi(x|z)] - KL\left(q_\theta(z|x) || p(z)\right)\right\}\\=&\nabla_\theta \left\{\mathbb{E}_{q_\theta}\left[log\;p_\phi(x, z) - \log q_\theta(z|x)\right] \right\} \quad\quad  \text{ rewrite KL divergence} \\ =&\nabla_\theta \; \int q_\theta(z|x) \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]dz  \\=& \int \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]\nabla_\theta q_\theta(z|x) dz + \int q_\theta(z|x) \nabla_\theta \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right] dz \\=& \mathbb{E}_{q_\theta}\left[ \left(log\;p_\phi(x, z) - \log q_\theta(z|x) \right) \nabla_\theta \log q_\theta(z|x) \right] + \mathbb{E}_{q_\theta}\left[\nabla_\theta \log p_\phi(x, z)\right] + \mathbb{E}_{q_\theta}\left[ \nabla_\theta \log q_\theta(z|x) \right] \\&\text{---  The second term is zero because no }\theta \text{ in } \log p_\phi(x,z) \\&\text{---  The third term being zero is a common trick. See Eqn. 5 in [1]} \\=& \mathbb{E}_{q_\theta}\left[ \left(log\;p_\phi(x, z) - \log q_\theta(z|x) \right) \nabla_\theta \log q_\theta(z|x) \right]\end{align*}

Now we’ve moved the derivative inside the expectation so we can sample z from q_\theta(z|x) to get Monte Carlo sampling of the gradient.  

The second method to optimize Eqn. 1 is called pathwise gradient estimator using the reparameterization trick. We’ve seen the trick used in Soft Actor Critic [9]. Here is how Eqn. 1 can be rewritten with the assumption that z \sim f_\theta(x) + p(\epsilon) (\epsilon is an independent source of noise):

(4)   \begin{align*}&\nabla_\theta \left\{\mathbb{E}_{q_\theta}[log\;p_\phi(x|z)] - KL\left(q_\theta(z|x) || p(z)\right)\right\}\\=&\nabla_\theta \left\{\mathbb{E}_{q_\theta}\left[log\;p_\phi(x, z) - \log q_\theta(z|x)\right] \right\} \quad\quad  \text{ rewrite KL divergence} \\=&\nabla_\theta \; \int q_\theta(z|x) \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]dz  \\=&\nabla_\theta \; \int p(\epsilon) \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]d\epsilon \quad\quad \\&\text{--- Above uses the property of changing variables in probability density functions.} \\&\text{--- See discussion in [10, 11]} \\=& \int p(\epsilon) \nabla_\theta \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right]d\epsilon \\=& \int p(\epsilon) \nabla_z \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right] \nabla_\theta z d\epsilon \\=& \mathbb{E}_{p(\epsilon)} \left[ \nabla_z \left[log\;p_\phi(x, z) - \log q_\theta(z|x) \right] \nabla_\theta f_\theta(x) \right]\end{align*}

I think [1] summarizes really well that the pathwise estimator generally has lower variance in practice:

In the end, I’d like to share a course which covers more topics on optimization with discrete random variables [6]. 

Optimize Discrete VAE

The idea of discrete VAE is that the hidden variables are represented as discrete random variables such that it could be better understood by humans as clustering of data. The question remains on how you perform reparameterization trick if z is a discrete random variable. The answer is to rely on Gumbel-Softmax trick [12].  

Suppose a discrete random variable z has K classes with class probability \pi_1, \pi_2, \cdots, \pi_k, which can be parameterized by a neural network consuming the raw data. There are several steps to perform the Gumbel-Softmax trick. First, z is represented as a one-hot encoding vector. The active component would be the one with the highest following quantity:

(5)   \begin{align*}\arg\max_i \left[ g_i + log \pi_i \right]\end{align*}

  
The beauty of Eqn.5 is that this is equivalent to draw a sample based on the probability distribution \pi_1, \pi_2, \cdots, \pi_k.

Second, the one-hot encoding vector is relaxed to a continuous vector y \in \mathbb{R}^k, with each component as:

(6)   \begin{align*}y_i = \frac{exp\left( (log \pi_i + g_i) / \tau \right)}{\sum^k_{j=1} exp \left( (log\pi_j + g_j) / \tau \right)},\end{align*}


where g_1, \cdots, g_k are i.i.d. samples drawn from Gumbel(0,1) distribution, and \tau will be annealed through the training such that y will be closer and closer to a one-hot encoding vector.

Finally, depending on specific problems, we will forward y as some differentiable quantity to downstream models (discussed in Section 2.2. in [12]). If the problem is about learning hidden representations, we can have fully connected layers on top of the whole y. If the problem would require sampling discrete actions as in reinforcement learning, in forward pass we would need to sample an action by \arg\max on y while in backward pass we would use Straight-Through operator by approximating \nabla_\theta z \approx \nabla_\theta y

Here is an example of Gumbel-Softmax trick-based discrete VAE, which I adapted from [13]. In the code, we create 30 10-category random variables as the hidden representation z. As a result, z becomes a 300-dim continuous vector during the training. While in testing (where we create images in data/sample_), we randomly create 30 one-hot encoding vectors of dim 10 as the hidden representation, and ask the decoder to generate images for us.

An earlier attempt to train discrete VAE can be seen in [5] and [3]. It uses embedding table look up + straight through to train such discrete VAE models. 

References

[1] Lecture Notes. Part III: Black-Box Variational Inference

[2] http://edwardlib.org/tutorials/klqp 

[3] Reproducing Neural Discrete Representation Learning

[4] https://github.com/ritheshkumar95/pytorch-vqvae

[5] Neural Discrete Representation Learning

[6] https://duvenaud.github.io/learn-discrete/

[7] Discrete Variational Autoencoders

[8] https://czxttkl.com/2019/05/04/stochastic-variational-inference/

[9] https://czxttkl.com/2018/10/30/notes-on-soft-actor-critic-off-policy-maximum-entropy-deep-reinforcement-learning-with-a-stochastic-actor/

[10] https://math.stackexchange.com/questions/930931/explanation-of-how-probability-density-functions-transform-under-the-change-of-v

[11] https://en.wikipedia.org/wiki/Probability_density_function#Function_of_random_variables_and_change_of_variables_in_the_probability_density_function

[12] CATEGORICAL REPARAMETERIZATION WITH GUMBEL-SOFTMAX

[13] https://github.com/YongfeiYan/Gumbel_Softmax_VAE

Control Variate

In this post, let’s talk about another variance reduction method called “control variate”. This is yet another variance reduction method besides importance sampling [1].

We start from a simple example: we want to estimate \mathbb{E}_{p(x)}[f(x)], the expectation of function f applied on a random variable x with x sampled from p(x). Analytically, \mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx. You can also think of \mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx as \mathbb{E}_{p(x)}[f(x)] = \int q(x')x'dx'=\mathbb{E}_{q(x')}[x'], the expectation of another random variable x' with any function exerted on it. Since I like the form \mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx better, I’ll keep using this format throughout the post. In our setting, we can only evaluate f(x) for any given x but do not know the form of f(x) in advance. Using Monte Carlo naively, we can draw N numbers from p(x) and take average of their associated values in f(x): \mathbb{E}_{p(x)}[f(x)]\approx \frac{1}{N}\sum\limits_{i=1}^Nf(x_i)

We can use control variate to reduce the variance of our estimate of \mathbb{E}_{p(x)}[f(x)]. In the control variate method, we introduce another function h(x), for which we can evaluate h(x) for any x and even know its expectation \theta=\mathbb{E}_{p(x)}[h(x)] (note: we don’t know that for f(x)).

The control variate method says that the following estimator, z(x) which can be seen as another random variable, is an unbiased estimate of \mathbb{E}_{p(x)}[f(x)]:

z(x) = f(x) - h(x) + \theta

It is easy to see that z(x) is an unbiased estimator:

\mathbb{E}_{p(x)}[z(x)]\newline=\mathbb{E}_{p(x)}[f(x)]-\mathbb{E}_{p(x)}[h(x)] + \theta\newline=\mathbb{E}_{p(x)}[f(x)]-\theta + \theta\newline=\mathbb{E}_{p(x)}[f(x)]

Now let’s look at the variance of z(x):

Var_{p(x)}[z(x)] \newline = Var_{p(x)}\left[f(x) - h(x)+\theta\right] \newline = Var_{p(x)}\left[f(x)-h(x)\right]  \quad \quad \text{constant doesn't contribute to variance}\newline=\mathbb{E}_{p(x)}\left[\left(f(x)-h(x)-\mathbb{E}_{p(x)}\left[f(x)-h(x)\right] \right)^2\right] \quad\quad Var(x)=\mathbb{E}[(x-\mathbb{E}(x))^2] \newline=\mathbb{E}_{p(x)}\left[\left( f(x)-\mathbb{E}_{p(x)}[f(x)] - \left(h(x)-\mathbb{E}_{p(x)}[h(x)]\right) \right)^2\right]\newline=\mathbb{E}_{p(x)}\left[\left(f(x)-\mathbb{E}_{p(x)}[f(x)]\right)^2\right] + \mathbb{E}_{p(x)}\left[\left(h(x)-\mathbb{E}_{p(x)}[h(x)]\right)^2\right] \newline - 2 * \mathbb{E}_{p(x)}\left[f(x)-\mathbb{E}_{p(x)}[f(x)]\right] * \mathbb{E}_{p(x)}\left[h(x)-\mathbb{E}_{p(x)}[h(x)]\right]\newline=Var_{p(x)}\left[f(x)\right]+Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]

Thus, as long as we introduce h(x) such that Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]< 0, we will reduce the variance of the estimate \mathbb{E}_{p(x)}[f(x)].  

A more advanced method is called regression sampling [2]. We can have a tunable parameter c in z(x):

z(x) = f(x) + c\cdot (h(x) - \theta)

In the control variate example, c=-1. However, when c is tunable, the variance of z(x) becomes:

Var_{p(x)}[z(x)] \newline=Var_{p(x)}\left[f(x)\right]+c^2 \cdot Var_{p(x)}\left[h(x)\right] + 2c * Cov_{p(x)}\left[f(x), h(x)\right],

which is a quadratic formula with regard to c. Therefore, the optimal coefficient is:

c^*=-\frac{Cov_{p(x)}[f(x),h(x)]}{Var_{p(x)}[h(x)]}

With c^*, the optimal variance of z(x) can be achieved at:

Var^*_{p(x)}[z(x)] \newline=Var_{p(x)}\left[f(x)\right] - \frac{\left(Cov_{p(x)}\left[f(x), h(x)\right] \right)^2}{Var_{p(x)}[h(x)]}\newline=\left(1-\left(Corr_{p(x)}\left[f(x),h(x)\right]\right)^2\right)Var_{p(x)}\left[f(x)\right] \quad\quad Corr(x,y)=Cov(x,y)/stdev(x)stdev(y)

This concludes that the more correlated f(x) and h(x) are with each other, the more variance we can reduce potentially.

Now, let’s do an experiment to verify the control variate method indeed reduces variance. We let p(x) be a uniform distribution in range [0, 10], f(x)=x^2 and h(x)=x. Statistical quantities can be calculated as:

\mathbb{E}_{p(x)}[f(x)]=\frac{100}{3}
Var_{p(x)}[f(x)]=\frac{8000}{9}
\mathbb{E}_{p(x)}[h(x)]=5
Var_{p(x)}[h(x)]=\frac{25}{3}
Cov_{p(x)}\left[f(x), h(x)\right]=\frac{250}{3}

Therefore, according to the derived theory, we have:

Var_{p(x)}[z(x)] \newline =Var_{p(x)}\left[f(x)\right]+Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]\newline =\frac{8000}{9} + \frac{25}{3}-\frac{2\cdot 250}{3}\newline=\frac{6575}{9}<Var_{p(x)}[f(x)].

In other words, we will expect to see around 82.2% of the original variance.

Here is the code

import numpy

T = 20000  # num of rounds
N = 10   # num of samples per round
a, b = 0., 10.   # lower bound, upper bound
expct_hx = 5.

# monte carlo estimation
mc_res = []
for t in range(T):
    x = numpy.random.uniform(low=a, high=b, size=N)
    mc_res.append(numpy.mean(x**2))

# control variate estimation using h(x)
cv_res = []
for t in range(T):
    x = numpy.random.uniform(low=a, high=b, size=N)
    cv_res.append(numpy.mean(x ** 2 - x + expct_hx))

print("MC estimation (mean/std): {0}/{1}".format(numpy.mean(mc_res), numpy.std(mc_res)))
print("Control variate estimation (mean/std): {0}/{1}".format(numpy.mean(cv_res), numpy.std(cv_res)))
print(f"Variance reduction percentage {(numpy.std(cv_res)/numpy.std(mc_res))**2}")

Here’s the result and indeed we see around 82% variance as we use the control variate:

MC estimation (mean/std): 33.292873225286186/9.426090731496881
Control variate estimation (mean/std): 33.37154726826849/8.61508822016471
Variance reduction percentage 0.8353264371911822

In the end, I want to mention that we are analyzing Var_{p(x)}[z(x)] with the assumption that it is a single-sample random variable. If we always collect N samples and use their mean as the estimate of \mathbb{E}_{p(x)}[f(x)], then Var_{p(x)}[\bar{z}(x)]=Var_{p(x)}\left[\frac{\sum\limits_{i=1}^N z(x_i)}{N}\right]=N \cdot Var_{p(x)}\left[\frac{z(x)}{N}\right]=\frac{Var_{p(x)}\left[z(x)\right]}{N}. This matches with our intuition that the more samples you average over, the less variance you have. However, the relative ratio of variance you can save by introducing h(x) will remain the same no matter how large N is, which is 1-\left(Corr_{p(x)}\left[f(x),h(x)\right]\right)^2.

I also want to mention that there are other variance reduction techniques usable in other scenarios, such as importance sampling and reparameterization trick. People have shown empirically that reparameterization trick can also reduce variance [4] but so far I haven’t found any theoretical proof. 

References

[1] https://czxttkl.com/2017/03/30/importance-sampling/

[2] https://en.wikipedia.org/wiki/Control_variates

[3] https://statweb.stanford.edu/~owen/mc/Ch-var-basic.pdf

[4] https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb