Diffusion models

Diffusion models are popular these days. This blog [1] summarizes the comparison between diffusion models with other generative models:

Before we go into the technical details, I want to use my own words to summarize my understanding in diffusion models. Diffusion models have two subprocesses: forward process and backward process. The forward process is non-learnable and the backward process is learnable. For every training samples (e.g., images) \mathbf{x}_0, the forward process adds a Gaussian noise \boldsymbol{\epsilon}_t in T steps until \mathbf{x}_T is (or approximately close to) an isotropic Gaussian. The backward process tries to recover \mathbf{x}_0 in T steps, starting from an isotropic Gaussian \mathbf{x}_T. Each backward step samples \mathbf{x}_{t-1} from \mathbf{x}_t with the probability p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}(\mathbf{x}_{t-1}| \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t)). The eventual goal is that, given a training sample, we want p_\theta(\mathbf{x}_0) to be as high as possible, where p_\theta(\mathbf{x}_0)=p_\theta(\mathbf{x}_{T:0})=p(\mathbf{x}_T)\prod\limits_{t=T}^1 p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t}). It turns out that maximizing p_\theta(\mathbf{x}_0) will be equivalent to optimizing an ELBO objective function, which is equivalent to make p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t}) be as close as possible to the distribution q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \boldsymbol{\epsilon}_t). Because in the forward process we have recorded \mathbf{x}_t and \boldsymbol{\epsilon}_t for all t=1,\cdots, T, q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \boldsymbol{\epsilon}_t) can be written in a closed form. Therefore, we can use a loss function (i.e., KL divergence between two Gaussians) to train \theta by fitting p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t}) against q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \boldsymbol{\epsilon}_t).      

 

More technical details

We start from the objective, that the data likelihood x_0 under a diffusion model \theta, is maximized: maximize \; \log p_\theta(x_0).  Similar to stochastic variational inference, we can derive a lower bound and maximize the lower bound instead:

(1)   \begin{equation*} \begin{split} & maximize \;\; \log p_\theta(x_0) \\  & \geq \log p_\theta(x_0) - \underbrace{D_{KL}\left( q\left( \mathbf{x}_{1:T} | \mathbf{x}_0 \right) || p_\theta\left( \mathbf{x}_{1:T} | \mathbf{x}_0 \right) \right)}_\text{KL divergence is non-negative} \\  &=\log p_\theta(x_0) - \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0) } \left[ \log \underbrace{\frac{q\left(\mathbf{x}_{1:T}|\mathbf{x}_0 \right)}{p_\theta\left( \mathbf{x}_{0:T}\right) / p_\theta \left( \mathbf{x}_0\right)}}_\text{Eqvlt. to $p_\theta\left( \mathbf{x}_{1:T} | \mathbf{x}_0 \right)$} \right] \\ &=\log p_\theta(x_0) - \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0) } \left[ \log \frac{q\left( \mathbf{x}_{1:T} | \mathbf{x}_0 \right)}{p_\theta \left( \mathbf{x}_{0:T}\right) } + \log p_\theta\left(\mathbf{x}_0 \right) \right] \\ &=- \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0) } \left[ \log \frac{q\left(\mathbf{x}_{1:T} | \mathbf{x}_0\right) }{p_\theta\left( \mathbf{x}_{0:T}\right)} \right] \\ &=-\mathbb{E}_{q}\biggl[ \\ &\quad \underbrace{D_{KL}\left( q( \mathbf{x}_T | \mathbf{x}_0) || p_\theta(\mathbf{x}_T) \right)}_\text{$L_T$} \\ &\quad + \sum\limits_{t=2}^T \underbrace{D_{KL}\left( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) || p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \right)}_\text{$L_{t-1}$} \\ &\quad \underbrace{- \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}_\text{$L_{0}$} \\ &\biggr] \end{split} \end{equation*}

 

We now focus on L_{t-1} for t=2, \cdots, T because L_T is non-learnable and L_0 is trivially handled. With some mathematical computation, we have 

(2)   \begin{equation*} q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I}) \end{equation*}

and

(3)   \begin{equation*} \begin{split} \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) &=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0 \\ &= \frac{1}{\sqrt{\alpha_t}}\left( \mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}}_t } \epsilon_t\right),  \end{split} \end{equation*}

where \beta_t, \tilde{\beta}_t, and \bar{\alpha}_t are terms involving noise scheduling steps \alpha_t.

 

Now, the other part of L_{t-1} is p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t), which can be parameterized as

(4)   \begin{equation*} \begin{split} &p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) \\ &= \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) ) \\ &= \mathcal{N}(\mathbf{x}_{t-1}; \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \underbrace{\frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}_\text{predict $\epsilon_t$ from $\mathbf{x}_t$ and $t$} \Big), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \end{split} \end{equation*}

Because KL divergence betwen two Gaussians [5] can be represented as \mathrm{KL}[P\,||\,Q] = \frac{1}{2} \left[ (\mu_2 - \mu_1)^T \Sigma_2^{-1} (\mu_2 - \mu_1) + \mathrm{tr}(\Sigma_2^{-1} \Sigma_1) - \ln \frac{|\Sigma_1|}{|\Sigma_2|} - n \right], L_{t-1} (i.e., the KL divergence between p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) and q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)) can be expressed analytically and fed into autograd frameworks for optimization.

Code Example

The exact code example I was reading is https://colab.research.google.com/github/JeongJiHeon/ScoreDiffusionModel/blob/main/DDPM/DDPM_example.ipynb, which is easy enough.

Our data is just two 2D Gaussian distributions. One distribution will be sampled more often (prob=0.8) than the other.     

And after 1000 training iterations, here is the inference process looks like: we have N data points which are pure Gaussian noises. p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) are now learned such that sampling from it can recover the original data distribution (although I feel the two distributions are not 8-2 in quantities):

 

 
 

Leave a comment

Your email address will not be published. Required fields are marked *