Normalizing Flows

Some update before we dive into today’s topic: I have not updated this blog for about 2 months, which is considered a long time : ). This is because I have picked up more tech lead work for setting up the team’s planning. I sincerely hope that our team will steer towards a good direction from 2022 and beyond.

Now, let’s go to today’s topic: normalizing flows. Normalizing flows is a type of generative models. Generative models can be best summarized using the following objective function:

$\text{maximize}_{\theta} \quad \mathbb{E}_{x \sim P_{data}}\left[ P_{\theta}(x)\right],$
i.e., we want to find a model $\theta$ which has the highest likelihood for the data generated from the data distribution as such to “approximate” the data distribution.

According to [1] (starting 11:00 min), we can categorize generative models as:

Explicit with tractable density. i.e., $P_\theta(x)$ has an analytical form. Examples include Normalizing Flows, PixelCNN, PixelRNN, and WaveNet. The latter three examples are autoregressive models which generate elements (e.g., pixels, audio clips) sequentially however are computationally heavy due to the nature of autoregressive. Normalizing Flows can generate the whole image instantaneously thus has a computational advantage over others. (For more pros/cons of Normalizing Flows, please refer to [4])
Explicit with approximate density, i.e., we are only optimizing for some bounds of $P_\theta(x)$ . One example is Variational Encoder Decoder [2], in which we optimize ELBO.
Implicit with no modeling of density. One example is Generative Adversarial Network (GAN), where we generate images just based on the generator network with a noise input.

Normalizing Flows is based on a theory called “change of variables” [3]. Suppose $Z$ and $X$ are random variables both with dimension $n$ . Also, suppose there is a mapping function $f: \mathbb{R}^n \rightarrow \mathbb{R}^n$ such that $Z=f(X)$ and $X=f^{-1}(Z)$ . Then the density functions of $Z$ and $X$ have the following relationship:

$p_X(x) =p_Z(f(x))\left| det\left( \frac{\partial f(x)}{\partial x}\right) \right|=p_Z(f(x))\left| det \left( \frac{\partial f^{-1}(z)}{\partial z }\right) \right|^{-1}$ ,
where the last two equations are due to $det(A^{-1})=det(A)^{-1}$ . From now on, we assume that $X$ denotes data while $Z$ denotes a random variable in a latent space.

In Normalizing Flows, the mapping is parameterized by $\theta$ (i.e., $f \rightarrow f_\theta$ and $f^{-1} \rightarrow f^{-1}_\theta$ ), which is what we try to learn. The objective function for learning $\theta$ becomes:

$\text{maximize}_\theta \quad p_X(x|\theta) = \text{maximize}_\theta\quad p_Z(f_\theta(x))\left| det \left( \frac{\partial f_\theta(x)}{\partial x}\right) \right|$

As you can see, our goal is to learn to map complex data distribution $X$ into a simpler latent variable distribution $Z$ (usually $Z \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ). The reason that we want $Z$ to follow a simple distribution is that: first, we need to compute $p_Z(f_\theta(x))$ in the objective function so ideally $p_Z(\cdot)$ should be easy to compute; second, once we learn $\theta$ , we know the mapping $f_\theta$ and $f^{-1}_\theta$ , and then we can easily sample $z \sim Z$ and apply $f^{-1}_\theta(z)$ to generate new data (e.g., new images). The requirement for a valid and practical $f_\theta$ is that: (1) it has an invertible $f^{-1}_\theta$ ; (2) $\left| det\left( \frac{\partial f_\theta(x)}{\partial x}\right) \right|$ or $\left| det \left( \frac{\partial f^{-1}_\theta(z)}{\partial z }\right) \right|$ is efficient to compute.

One nice property of Normalizing Flows is that you can chain multiple transformation $f$ to form a new transformation. Suppose $f(x) = f_1 \circ \cdots \circ f_L(x)$ with each $f_i$ having a tractable inverse and a tractable Jacobian determinant. Then:

$p_X(x) =p_Z(f(x))\prod\limits_{i=1}^L\left|det\left( \frac{\partial f_i}{\partial (f_{i-1}\circ\cdots \circ f_0(x))}\right) \right|$ , where $f_0(x)=x$ .

In practice, we usually pick $p_Z \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and optimize for log likelihood. Therefore, our objective becomes (as can be seen in Eqn. 1 from the Flow++ paper (one SOTA flow model) [5]):

$\text{maximize}_\theta \quad \log p_X(x|\theta) \newline= \log p_Z(f_\theta(x)) + \sum\limits_{i=1}^L\log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right| \newline=\log \mathcal{N}(f_\theta(x); \mathbf{0}, \mathbf{I}) + \sum\limits_{i=1}^L \log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right|$

[6] provides a good tutorial for getting started in Normalizing Flows, while [7] has a more in-depth explanation and it will help a lot in understanding more advanced implementation such as Flow++ [8]. For now, I am going to introduce [6].

[6] introduces a flow called Planar Flow. It has relatively a straightforward transformation (linear + activation) and easy-to-compute Jacobian determinant:

Planar Flow is defined in Python as below:

Once it is defined, we can instantiate an arbitrary Planar Flow and see how it transforms from a 2D Normal distribution:

Now, suppose we want to learn what is the Planar Flow that could transform a 2D normal distribution to a target distribution defined as:

The objective function, as we already introduced above, is $\text{maximize}_\theta \quad \log p_X(x|\theta) = \log p_Z(f_\theta(x)) + \sum\limits_{i=1}^L\log \left|det \frac{\partial f_{\theta_i}}{\partial f_{\theta_{i-1}}} \right|$ . Here, $x$ is samples from a 2D normal distribution, $p_Z()$ is the target density_ring distribution. Therefore, the loss function (to be minimized) is defined as: