View LLMs as compressors + Scaling laws

I feel it is a fascinating perspective to view LLMs as compressors. Today, we are going to introduce the basic idea of it.  We first use very layman terms to introduce what compression does. Compression can be seen as representing a stream of bits with a shorter stream of bits. It is based on assumption …

Causal Inference 102

In my blog, I have covered several pieces of information about causal inference:  Causal Inference: we talked about (a) two-stage regression for estimating the causal effect between X and Y even when there is a confounder between them; (b) causal invariant prediction Tools needed to build an RL debugging tool: we talked about 3 main …

Reinfocement Learning in LLMs

In this post, we overview Reinforcement Learning techniques used in LLMs and alternative techniques that are often compared with RL techniques. PPO The PPO-based approach is the most famous RL approach. Detailed derivation of PPO and implementation tricks are introduced thoroughly in [2]. Especially, we want to call out their recommended implementation tricks: SLiC-HF SLiC-HF …

Llama code anatomy

This is the first time I have read llama2 code. Many things are still similar to the original transformer code, but there are also some new things. I am documenting some findings. Where is Llama2 Code? Modeling (training) code is hosted here: https://github.com/facebookresearch/llama/blob/main/llama/model.py Inference code is hosted here: https://github.com/facebookresearch/llama/blob/main/llama/generation.py Annotations There are two online annotations …

Diffusion models

Diffusion models are popular these days. This blog [1] summarizes the comparison between diffusion models with other generative models: Before we go into the technical details, I want to use my own words to summarize my understanding in diffusion models. Diffusion models have two subprocesses: forward process and backward process. The forward process is non-learnable …

Mode collapse is real for generative models

I am very curious to see whether generative models like GAN and VAE can fit data of multi-modes. [1] has some overview over different generative models, mentioning that VAE has a clear probabilistic objective function and is more efficient. [2] showed that diffusion models (score-based generative models) can better fit multimode distribution than VAE and …

Causal Inference in Recommendation Systems

We have briefly touched some concepts of causal inference in [1, 2]. This post introduces some more specific works which apply causal inference in recommendation systems. Some works need to know the background of backdoor and frontdoor adjustments. So we will introduce them first. Backdoor and frontdoor adjustment  Suppose we have a causal graph like …

GATO and related AGI research

Policy Generalist Deepmind has recently published a work named Gato. I find it interesting as Gato learns a multi-modal multi-task policy to many tasks such as robot arm manipulation, playing atari, and image captioning. I don’t think the original paper [2] has every detail of implementation but I’ll try to best summarize what I understand. …