This post is mainly about Gaussian kernel (aka radial basis function kernel), a commonly used technique in machine learning. We’ve touched the concept of Gaussian kernel in [1] and [2] but with less details.
Definition of Gaussian Kernel
The intuitive understanding of a kernel is that it maps points in a data space, where those points may be difficult to be classified, to a higher dimensional space, where the corresponding mapped points are easier to be separable.
Gaussian kernel is defined as:
where and
are two arbitrary points in the data space,
is a hyperparameter of the Gaussian kernel that controls the sensitivity of difference when
and
are different. Imagine
goes to infinity, then no matter how different
and
are,
will be 1. The formula of Gaussian kernel is identical to the PDF of a normal distribution.
can be understood as the distance between the corresponding points of
and
in the mapped higher dimensional space (denoted as
and
). The higher dimensional space is actually an infinite dimensional space, and the distance is measured by the dot product. That is,
.
So what is then? According to ([3] page 11), we have:
Application of Gaussian Kernels
Using Gaussian kernels, we can design classification/regression algorithms and their ideas would be similar to nearest neighbor algorithms. Here is how [4]. Suppose we have data points
with their labels
. Each of the
data points will contribute some information
to the prediction of a new data point
.
Intuitively, those that are close to
should contribute more information; those
that are far away from
should contribute less information. Therefore, we use a Gaussian kernel
to denote the degree of
that
should contribute to
. This screenshot comes from [4], where
The way to learn is that we want
. Therefore, we can solve the following matrix-vector equation to obtain
:
Once we have , we have full information of our regression/classification model
, and we can use it to make prediction to unseen data.
We can use Gaussian kernel to perform smoothing. In [11], the author detailed some examples of applying Gaussian kernel for smoothing. In the example below, the authors generated some noisy data in 1D array (left). By applying a Gaussian kernel, that is averaging the values mostly from the neighborhood, we get a much smoother data distribution (right).
Magically, applying a correct Gaussian kernel can recover from noisy data that seems random. Below, we can see that a clean data distribution (up left), after added with noise (up right), can be recovered mostly if the Gaussian kernel “matches” certain characteristics of the noise (bottom).
Another application of Gaussian kernels is to perform kernel density estimation [5]. It is a technique to depict a smooth probability distribution from finite data samples. The smoothness of the probability distribution is determined by the hyperparameter . When
is small, the distribution will look like histograms because the kernel will average data on near neighbors. Larger
will have wider focus on data.
Convolution and Fourier Transform
Gaussian kernels can be used in the setting of convolution and Fourier transform. We first quickly review what convolution and Fourier transform are and their relationships.
A convolution of two functions is defined as:
For a function that is on the time domain , its frequency domain function
is defined as:
And the inverse Fourier transform is defined as:
If itself is a Gaussian function, then
would also be a Gaussian function but with different shapes. The more spread
is, the sharper
is [9]. This means, a spread
changes slowly so it has less high-frequency components.
Now let’s prove the convolution theorem, which states that the Fourier transform of the convolution of two functions is equal to the product of the Fourier transform of each function. The proof comes from the last page of [10]:
Suppose we have , and suppose
,
, and
are the corresponding Fourier transformed functions. From the inverse Fourier transform we have
and
.
By the definition of convolution:
I think [6] summarizes well the relationship between convolution and Fourier Transform [7].
Now let’s understand the last sentence from the last slide: convolution with a Gaussian will preserve low-frequency components while reducing high-frequency components.
Suppose we have a rectangle function with width 2 and height 1.
We know that its fourier transform is (https://www.wolframalpha.com/input/?i=+rect%28t%2F2%29+from+-2+to+2):
After applying Gaussian kernel smoothing (e.g., , as we introduced in the application section, we probably get a smoother function roughly like a triangle function
(https://www.wolframalpha.com/input/?i=triangle%28t%29):
‘s fourier transform is (https://www.wolframalpha.com/input/?i=fourier+transform+%28triangle%28t%29%29):
Apparently, we can see that the high frequency components of is much smaller than those of
.
From the point of the convolution theorem to understand this phenomenon, the fourier transform of is equal to
. We know that
is a
function and
is a gaussian function, therefore the high frequency components of
would be attenuated.
References
[1] https://czxttkl.com/2018/01/20/my-understanding-in-bayesian-optimization/
[2] https://czxttkl.com/2017/12/28/revisit-support-vector-machine/
[3] https://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf
[4] https://www.youtube.com/watch?v=O8CfrnOPtLc
[5] http://www.surajrampure.com/resources/ds100/KDE.html
[6] https://web.stanford.edu/class/cs279/lectures/lecture9.pdf
[7] https://czxttkl.com/2018/10/07/eulers-formula/
[8] https://en.wikipedia.org/wiki/Fourier_transform#Other_conventions
[9] http://pages.stat.wisc.edu/~mchung/teaching/MIA/reading/diffusion.gaussian.kernel.pdf.pdf
[10] http://ugastro.berkeley.edu/infrared09/PDF-2009/convolution2.pdf
[11] https://matthew-brett.github.io/teaching/smoothing_intro.html