Category Archives: Algorithm
Gradient and Natural Gradient, Fisher Information Matrix and Hessian
Stochastic Variational Inference
Bayesian linear regression
Ordinary least square (OLS) linear regression have point estimates on weight vector that fit the formula: . If we assume normality of the errors: with a fixed point estimate on , we could also enable analysis on confidence interval and future prediction (see discussion in the end of [2]). Instead of point estimates, bayesian linear …
Resources about Attention is all you need
There are several online posts [1][2] that illustrate the idea of Transformer, the model introduced in the paper “attention is all you need” [4]. Based on [1] and [2], I am sharing a short tutorial for implementing Transformer [3]. In this tutorial, the task is “copy-paste”, i.e., to let a Transformer learn to output the …
Continue reading “Resources about Attention is all you need”
Notes from Introduction to Calculus and Analysis
Cauchy-Schwarz inequality: $latex (a_1b_1 + a_2b_2 + \cdots + a_nb_n)^2 \leq (a_1^2 + a_2^2 + \cdots + a_n^2)(b_1^2+b_2^2 + \cdots + b_n^2)$ When $latex a_1=\sqrt{x}, a_2=\sqrt{y}, b_1=\sqrt{y}, b_2=\sqrt{x}$, then $latex (2\sqrt{xy})^2\leq (x+y)^2$
Notes on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”
I am reading this paper (https://arxiv.org/abs/1801.01290) and wanted to take down some notes about it. Introduction Soft Actor-Critic is a special version of Actor-Critic algorithms. Actor-Critic algorithms are one kind of policy gradient methods. Policy gradient methods are different than value-based methods (like Q-learning), where you learn Q-values and then infer the best action to …
Notes on Glicko paper
This weekend I just read again the Glicko skill rating paper [1] but I found something not very clear in the paper. I’d like to make some notes, some based on my guesses. Hope I’d sort them out completely in the future. First, Glicko models game outcomes by the Bradley-Terry model, meaning that the win …
Euler’s Formula and Fourier Transform
Euler’s formula states that $latex e^{ix} =\cos{x}+ i \sin{x}$. When $latex x = \pi$, the formula becomes $latex e^{\pi} = -1$ known as Euler’s identity. An easy derivation of Euler’s formula is given in [3] and [5]. According to Maclaurin series (a special case of taylor expansion $latex f(x)=f(a)+f'(a)(x-a)+\frac{f”(a)}{2!}(x-a)^2+\cdots$ when $latex a=0$), $latex e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots &s=2$ …
How to conduct grid search
I have always had some doubts on grid search. I am not sure how I should conduct grid search for hyperparameter tuning for a model and report the model’s generalization performance for a scientific paper. There are three possible ways: 1) Split data into 10 folds. Repeat 10 times of the following: pick 9 folds as training data, …