LLM Long Context

In this post, let’s visit how modern LLMs encode positional information. We start from the most famous paper in this domain [1] and dive into some key details.

Why we need positional encoding
LLMs need positional encodings to differentiate different semantic meanings of the same word. We use the motivational example from [2]:

The two “dogs” refer to different entities. Without any positional information, the output of a (multi headed) self attention operation is identical for the same token in different positions.

Preliminaries

Now, let’s settle down on notations. We start from the classic Transformers and its core mechanism, self-attention. Suppose the input sequence is $S=\{w_i\}^{N}_{i=1}$ , a length $N$ word sequence with $w_i$ being the i-th element. Each word has its corresponding embedding $E=\{\mathbf{x}_i\}^{N}_{i=1}$ , where $\mathbf{x}_i \in \mathbb{R}^d$ , a $d$ -dimension vector. At a position $m$ , the word ( $w_m$ )’s output is a weighted sum of all values of other words in the sequence, where the weights are determined by the self-attention mechanism:

$\mathbf{q}_m = f_q(\mathbf{x}_m, m)$

$\mathbf{k}_n = f_k(\mathbf{x}_n, n)$

$\mathbf{v}_n = f_v(\mathbf{x}_n, n)$

$a_{m,n} = \frac{exp\left( \frac{\mathbf{q}_m^T \mathbf{k}_n}{\sqrt{d}}\right)}{\sum\limits^N_{j=1}exp \left(\frac{\mathbf{q}_m^T \mathbf{k}_j}{\sqrt{d}} \right)}$

$\mathbf{o}_m=\sum\limits^N_{n=1} a_{m,n} \mathbf{v}_n$

Sinusoidal Absolute Position Encoding

This is what used in the original Transformers paper [6]. It is defined as:
$\mathbf{q}_m = f_q(\mathbf{x}_m, m) = \mathbf{W}_q (\mathbf{x}_i + \mathbf{p}_i)$