Diffusion models are popular these days. This blog [1] summarizes the comparison between diffusion models with other generative models:
Before we go into the technical details, I want to use my own words to summarize my understanding in diffusion models. Diffusion models have two subprocesses: forward process and backward process. The forward process is non-learnable and the backward process is learnable. For every training samples (e.g., images) , the forward process adds a Gaussian noise in steps until is (or approximately close to) an isotropic Gaussian. The backward process tries to recover in T steps, starting from an isotropic Gaussian . Each backward step samples from with the probability . The eventual goal is that, given a training sample, we want to be as high as possible, where . It turns out that maximizing will be equivalent to optimizing an ELBO objective function, which is equivalent to make be as close as possible to the distribution . Because in the forward process we have recorded and for all , can be written in a closed form. Therefore, we can use a loss function (i.e., KL divergence between two Gaussians) to train by fitting against .
More technical details
We start from the objective, that the data likelihood under a diffusion model , is maximized: . Similar to stochastic variational inference, we can derive a lower bound and maximize the lower bound instead:
(1)
We now focus on for because is non-learnable and is trivially handled. With some mathematical computation, we have
(2)
and
(3)
where , , and are terms involving noise scheduling steps .
Now, the other part of is , which can be parameterized as
(4)
Because KL divergence betwen two Gaussians [5] can be represented as , (i.e., the KL divergence between and ) can be expressed analytically and fed into autograd frameworks for optimization.
Code Example
The exact code example I was reading is https://colab.research.google.com/github/JeongJiHeon/ScoreDiffusionModel/blob/main/DDPM/DDPM_example.ipynb, which is easy enough.
Our data is just two 2D Gaussian distributions. One distribution will be sampled more often (prob=0.8) than the other.
And after 1000 training iterations, here is the inference process looks like: we have N data points which are pure Gaussian noises. are now learned such that sampling from it can recover the original data distribution (although I feel the two distributions are not 8-2 in quantities):
Reference
([1] and [2] are good learning materials for me to write this post; [3] and [4] are good coding examples.)
[1] https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
[2] https://aman.ai/primers/ai/diffusion-models/
[3] https://www.youtube.com/watch?v=a4Yfz2FxXiY
[4] https://github.com/JeongJiHeon/ScoreDiffusionModel#content–tutorial–blog-kr-