Gradient and Natural Gradient, Fisher Information Matrix and Hessian

Here I am writing down some notes summarizing my understanding in natural gradient. There are many online materials covering similar topics. I am not adding anything new but just doing personal summary. Assume we have a model with model parameter . We have training data . Then, the Hessian of log likelihood, , is:   …

Stochastic Variational Inference

Introduction In this post, we introduce one machine learning technique called stochastic variational inference that is widely used to estimate posterior distribution of Bayesian models. Suppose in a Bayesian model, the model parameters is denoted as a vector and the observation is denoted as . According to Bayesian theorem, the posterior distribution of can be …

Bayesian linear regression

Ordinary least square (OLS) linear regression have point estimates on weight vector that fit the formula: . If we assume normality of the errors: with a fixed point estimate on , we could also enable analysis on confidence interval and future prediction (see discussion in the end of [2]). Instead of point estimates, bayesian linear …

Resources about Attention is all you need

There are several online posts [1][2] that illustrate the idea of Transformer, the model introduced in the paper “attention is all you need” [4]. Based on [1] and [2], I am sharing a short tutorial for implementing Transformer [3]. In this tutorial, the task is “copy-paste”, i.e., to let a Transformer learn to output the …

Notes on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”

I am reading this paper (https://arxiv.org/abs/1801.01290) and wanted to take down some notes about it. Introduction Soft Actor-Critic is a special version of Actor-Critic algorithms. Actor-Critic algorithms are one kind of policy gradient methods. Policy gradient methods are different than value-based methods (like Q-learning), where you learn Q-values and then infer the best action to …

Euler’s Formula and Fourier Transform

Euler’s formula states that $latex e^{ix} =\cos{x}+ i \sin{x}$. When $latex x = \pi$, the formula becomes $latex e^{\pi} = -1$ known as Euler’s identity. An easy derivation of Euler’s formula is given in [3] and [5]. According to Maclaurin series (a special case of taylor expansion $latex f(x)=f(a)+f'(a)(x-a)+\frac{f”(a)}{2!}(x-a)^2+\cdots$ when $latex a=0$),  $latex e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots &s=2$ …

How to conduct grid search

I have always had some doubts on grid search. I am not sure how I should conduct grid search for hyperparameter tuning for a model and report the model’s generalization performance for a scientific paper. There are three possible ways: 1)  Split data into 10 folds. Repeat 10 times of the following: pick 9 folds as training data, …

Monte Carlo Tree Search Overview

Monte Carlo Tree Search (MCTS) has been successfully applied in complex games such as Go [1]. In this post, I am going to introduce some basic concepts of MCTS and its application. MCTS is a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a …