Focal loss for classification and regression

I haven’t learnt any new loss function for a long time. Today I am going to learn one new loss function, focal loss, which was introduced in 2018 [1]. Let’s start from a typical classification task. For a data , where is the feature vector and is a binary label, a model predicts . Then …

Tools needed to build an RL debugging tool

I’ve always had a dream to build a debugging tool to automatically analyze an observational dataset and tell me whether this dataset is suitable to apply batch-RL algorithms like DQN/DDPG. As we know, we cannot directly control or even know how observational dataset is collected. An incorrect data collection procedure or a wrong dataset would …

Noise-Contrastive Estimation

I encountered the noise-contrastive estimation (NCE) technique in several papers recently. So it is a good time to visit this topic. NCE is a commonly used technique under the word2vec umbrella [2]. Previously I talked about several other ways to generate word embeddings in [1] but skipped introducing NCE. In this post, I will introduce …

Control variate using Taylor expansion

We talked about control variate in [4]: when evaluating by Monte Carlo samples, we can instead evaluate with in order to reduce variance. The requirement for control variate to work is that is correlated with and the mean of is known.  In this post we will walk through a classic example of using control variate, …

Notes on Ray paper

I am reading Robert Nishihara’s thesis on Ray [1] while encountering several concepts I am not familiar with. So this post takes some notes on those new concepts. Ring AllReduce Ring AllReduce is a technique to communicate with multiple computation nodes for aggregating results. It is a primitive to many distributed training systems. In the …

Understand Fourier Transform of Images

We’ve covered Fourier Transform in [1] and [2] while we use only examples of 1D. In this post we are going to see what 2D Fourier Transform looks like. First, we look at a 2D image with one direction sinusoid waves (left) and its Fourier Transform (right). I added coordinates to help you understand the …

Revisit Gaussian kernel

This post is mainly about Gaussian kernel (aka radial basis function kernel), a commonly used technique in machine learning. We’ve touched the concept of Gaussian kernel in [1] and [2] but with less details.  Definition of Gaussian Kernel The intuitive understanding of a kernel is that it maps points in a data space, where those …

Optimization with discrete random variables

In this post, I’m going to talk about common techniques that enable us to optimize a loss function w.r.t. discrete random variables. I may go off on a tangent on various models (e.g., variational auto-encoder and reinforcement learning) because these techniques come from many different areas so please bear with me. We start from variational …