View LLMs as compressors + Scaling laws

I feel it is a fascinating perspective to view LLMs as compressors. Today, we are going to introduce the basic idea of it.

We first use very layman terms to introduce what compression does. Compression can be seen as representing a stream of bits with a shorter stream of bits. It is based on assumption that there are certain repetitive patterns in the original stream so that we can represent those repetitive patterns with shorter codes. For example, if the original bit stream is “00001 00011 00001 00011…”, we can create a codebook, where “00001” is represented as “0” and “00011” is represented as “1”. Then, we can just use “010101…” plus the created codebook to represent the original bit stream. Anyone receiving the coded stream can uncover the original bit stream as long as they also receive the codebook.

There exist many compression algorithms. One algorithm is called arithmetic coding. It represents a bit stream by a float number between [0, 1] and its compressed stream will be the binary coding of that float number. Arithmetic coding can be easily connected to LLMs because when it compresses it utilizes $p(y|x_{<k})$ , which is exactly the next token prediction distribution.

We use the example in the paper [1] to illustrate how arithmetic coding works.

Suppose we have 3 tokens in the vocabulary (A, X, and I). To encode AIXI, we will look at the next-token prediction from a sequence predictor (LLM or any compression algorithm). Following Appendix A of [1], we have:

As we can see, we need to use 0101010 (7 bits) to represent AIXI. Other plausible sequences also need multiple bits to represent them. On average, the length of arithmetic code is larger than 1.

Now, let’s have a very hypothetical setting, where the LLM has more certain predictions. We have P(AIXI)=0.5 and P(AIXX)=0.5 and every other sequence has 0 sequence likelihood. In this case we only need to use 1 bit to represent the two plausible sequence. Therefore, we can conclude that if a LLM can predict sequences more accurately, it will compress data using shorter lengths of arithmetic codes.

References:

[1] Language Modeling Is Compression: https://arxiv.org/abs/2309.10668

[2] LLMZip: Lossless Text Compression using Large Language Models: https://arxiv.org/abs/2306.04050

[3] How Language Models Beat PNG and FLAC Compression & What It Means: https://blog.codingconfessions.com/p/language-modeling-is-compression

View LLMs as compressors + Scaling laws

Leave a comment

Cancel reply