Normalizing Flows

Some update before we dive into today’s topic: I have not updated this blog for about 2 months, which is considered a long time : ). This is because I have picked up more tech lead work for setting up the team’s planning. I sincerely hope that our team will steer towards a good direction from 2022 and beyond. 

Now, let’s go to today’s topic: normalizing flows. Normalizing flows is a type of generative models. Generative models can be best summarized using the following objective function: 

\text{maximize}_{\theta} \quad \mathbb{E}_{x \sim P_{data}}\left[ P_{\theta}(x)\right],
i.e., we want to find a model \theta which has the highest likelihood for the data generated from the data distribution as such to “approximate” the data distribution. 

According to [1] (starting 11:00 min), we can categorize generative models as:

  1. Explicit with tractable density. i.e., P_\theta(x) has an analytical form. Examples include Normalizing Flows, PixelCNN, PixelRNN, and WaveNet. The latter three examples are autoregressive models which generate elements (e.g., pixels, audio clips) sequentially however are computationally heavy due to the nature of autoregressive. Normalizing Flows can generate the whole image instantaneously thus has a computational advantage over others. (For more pros/cons of Normalizing Flows, please refer to [4])
  2. Explicit with approximate density, i.e., we are only optimizing for some bounds of P_\theta(x). One example is Variational Encoder Decoder [2], in which we optimize ELBO. 
  3. Implicit with no modeling of density. One example is Generative Adversarial Network (GAN), where we generate images just based on the generator network with a noise input. 

Normalizing Flows is based on a theory called “change of variables” [3]. Suppose Z and X are random variables both with dimension n. Also, suppose there is a mapping function f: \mathbb{R}^n \rightarrow \mathbb{R}^n such that Z=f(X) and X=f^{-1}(Z). Then the density functions of Z and X have the following relationship:

p_X(x) =p_Z(f(x))\left| det\left( \frac{\partial f(x)}{\partial x}\right) \right|=p_Z(f(x))\left| det \left( \frac{\partial f^{-1}(z)}{\partial z }\right) \right|^{-1},
where the last two equations are due to det(A^{-1})=det(A)^{-1}. From now on, we assume that X denotes data while Z denotes a random variable in a latent space. 

In Normalizing Flows, the mapping is parameterized by \theta (i.e., f \rightarrow f_\theta and f^{-1} \rightarrow f^{-1}_\theta), which is what we try to learn. The objective function for learning \theta becomes:

\text{maximize}_\theta \quad p_X(x|\theta) = \text{maximize}_\theta\quad p_Z(f_\theta(x))\left| det \left( \frac{\partial f_\theta(x)}{\partial x}\right) \right|

As you can see, our goal is to learn to map complex data distribution X into a simpler latent variable distribution Z (usually Z \sim \mathcal{N}(\mathbf{0}, \mathbf{I})). The reason that we want Z to follow a simple distribution is that: first, we need to compute p_Z(f_\theta(x)) in the objective function so ideally p_Z(\cdot) should be easy to compute; second, once we learn \theta, we know the mapping f_\theta and f^{-1}_\theta, and then we can easily sample z \sim Z and apply f^{-1}_\theta(z) to generate new data (e.g., new images). The requirement for a valid and practical f_\theta is that: (1) it has an invertible f^{-1}_\theta; (2) \left| det\left( \frac{\partial f_\theta(x)}{\partial x}\right) \right| or \left| det \left( \frac{\partial f^{-1}_\theta(z)}{\partial z }\right) \right| is efficient to compute. 

One nice property of Normalizing Flows is that you can chain multiple transformation f to form a new transformation. Suppose f(x) = f_1 \circ \cdots \circ f_L(x) with each f_i having a tractable inverse and a tractable Jacobian determinant. Then:

p_X(x) =p_Z(f(x))\prod\limits_{i=1}^L\left|det\left( \frac{\partial f_i}{\partial (f_{i-1}\circ\cdots \circ f_0(x))}\right) \right|, where f_0(x)=x

In practice, we usually pick p_Z \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and optimize for log likelihood. Therefore, our objective becomes (as can be seen in Eqn. 1 from the Flow++ paper (one SOTA flow model) [5]): 

\text{maximize}_\theta \quad \log p_X(x|\theta) \newline= \log p_Z(f_\theta(x)) + \sum\limits_{i=1}^L\log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right| \newline=\log \mathcal{N}(f_\theta(x); \mathbf{0}, \mathbf{I}) + \sum\limits_{i=1}^L \log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right|

[6] provides a good tutorial for getting started in Normalizing Flows, while [7] has a more in-depth explanation and it will help a lot in understanding more advanced implementation such as Flow++ [8]. For now, I am going to introduce [6].

[6] introduces a flow called Planar Flow. It has relatively a straightforward transformation (linear + activation) and easy-to-compute Jacobian determinant:

Planar Flow is defined in Python as below:

Once it is defined, we can instantiate an arbitrary Planar Flow and see how it transforms from a 2D Normal distribution:

Now, suppose we want to learn what is the Planar Flow that could transform a 2D normal distribution to a target distribution defined as:

The objective function, as we already introduced above, is \text{maximize}_\theta \quad \log p_X(x|\theta) = \log p_Z(f_\theta(x)) + \sum\limits_{i=1}^L\log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right|. Here, x is samples from a 2D normal distribution, p_Z() is the target density_ring distribution. Therefore, the loss function (to be minimized) is defined as:

The overall training loop is:

 

Last, I highly recommend watching this ECCV tutorial: https://www.youtube.com/watch?v=u3vVyFVU_lI

 

TODO:

dequantization:

https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html#Dequantization

https://arxiv.org/pdf/1511.01844.pdf

 

 

References

[1] PixelCNN, Wavenet & Variational Autoencoders – Santiago Pascual – UPC 2017 https://www.youtube.com/watch?v=FeJT8ejgsL0

[2] Optimization with discrete random variables: https://czxttkl.com/2020/04/06/optimization-with-discrete-random-variables/

[3] Normalizing Flows note: https://deepgenerativemodels.github.io/notes/flow/

[4] Introduction to Normalizing Flows: https://towardsdatascience.com/introduction-to-normalizing-flows-d002af262a4b

[5] Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design: https://arxiv.org/abs/1902.00275

[6] Normalizing Flows in PyTorch: https://github.com/acids-ircam/pytorch_flows/blob/master/flows_01.ipynb

[7] UvA DL Notebook: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html

[8] Flow++ Github implementation: https://github.com/chrischute/flowplusplus

 

 

 

Leave a comment

Your email address will not be published. Required fields are marked *