LLM Long Context

In this post, let’s visit how modern LLMs encode positional information. We start from the most famous paper in this domain [1] and dive into some key details.

Why we need positional encoding

LLMs need positional encodings to differentiate different semantic meanings of the same word. We use the motivational example from [2]:

The two “dogs” refer to different entities. Without any positional information, the output of a (multi headed) self attention operation is identical for the same token in different positions.

Preliminaries

Now, let’s settle down on notations. We start from the classic Transformers and its core mechanism, self-attention. Suppose the input sequence is S=\{w_i\}^{N}_{i=1}, a length N word sequence with w_i being the i-th element. Each word has its corresponding embedding E=\{\mathbf{x}_i\}^{N}_{i=1}, where \mathbf{x}_i \in \mathbb{R}^d, a d-dimension vector. At a position m, the word (w_m)’s output is a weighted sum of all values of other words in the sequence, where the weights are determined by the self-attention mechanism:
\mathbf{q}_m = f_q(\mathbf{x}_m, m)
\mathbf{k}_n = f_k(\mathbf{x}_n, n)
\mathbf{v}_n = f_v(\mathbf{x}_n, n)
a_{m,n} = \frac{exp\left( \frac{\mathbf{q}_m^T \mathbf{k}_n}{\sqrt{d}}\right)}{\sum\limits^N_{j=1}exp \left(\frac{\mathbf{q}_m^T \mathbf{k}_j}{\sqrt{d}} \right)}
\mathbf{o}_m=\sum\limits^N_{n=1} a_{m,n} \mathbf{v}_n

Sinusoidal Absolute Position Encoding

This is what used in the original Transformers paper [6]. It is defined as:
\mathbf{q}_i = f_q(\mathbf{x}_i, i) = \mathbf{W}_q (\mathbf{x}_i + \mathbf{p}_i)
\mathbf{k}_i = f_k(\mathbf{x}_i, i) = \mathbf{W}_k (\mathbf{x}_i + \mathbf{p}_i)
\mathbf{v}_i = f_v(\mathbf{x}_i, i) = \mathbf{W}_v (\mathbf{x}_i + \mathbf{p}_i)
\mathbf{p}_{i, 2t} = sin(k/10000^{2t/d})
\mathbf{p}_{i, 2t+1} = cos(k/10000^{2t/d})
Note that i is the position index 0 \leq i \leq N2t and 2t+1 are the dimension indices of the positional encoding hence 0 \leq t < d/2.

One drawback of using the additive sinusoidal PE is that \mathbf{p}_i makes \mathbf{x}_i + \mathbf{p}_i a bit chaotic. In the motivational example from [4], suppose \mathbf{x}_i = (1,1), then at different positions 0 ~ 7,  \mathbf{x}_i + \mathbf{p}_i can become any value around \mathbf{x}_i, making LLMs hard to generalize.

Research has shown that the perplexity of the models trained with sinusoidal absolute position embeddings exploded past the training length.

RoPE (Rotary Positional Embeddings)

RoPE uses a multiplication form for positional embeddings. As we know, multiplying a vector with a matrix is equivalent to rotate that vector by some angle.

\mathbf{q}_j = f_q(\mathbf{x}_j, j) = \mathbf{R}^d_{\Theta, j}(\mathbf{W}_q \mathbf{x}_j) = (\mathbf{W}_q \mathbf{x}_j) e^{i j \theta_t}
\mathbf{k}_j = f_k(\mathbf{x}_j, j) = \mathbf{R}^d_{\Theta, j}(\mathbf{W}_k \mathbf{x}_j) = (\mathbf{W}_k \mathbf{x}_j) e^{i j \theta_t}
\mathbf{v}_j = f_v(\mathbf{x}_j, j) = \mathbf{R}^d_{\Theta, j}(\mathbf{W}_v \mathbf{x}_j) = (\mathbf{W}_v \mathbf{x}_j) e^{i j \theta_t}
\mathbf{R}_{\theta_t, j} = \begin{pmatrix}cos(j\theta_t) & -sin(j\theta_t) \\ sin(j\theta_t) & cos(j\theta_t)\end{pmatrix}
\mathbf{R}^{d}_{\Theta, j} = \begin{pmatrix}\mathbf{R}_{\theta_0, j} & \ldots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \ldots & \mathbf{R}_{\theta_{d/2-1}, j} \end{pmatrix}
\theta_t = \beta^{-2t/d}
\beta=10000 by default

(Some clarification about the notation: in e^{i j \theta}, i is the symbol for the imaginary dimension, j is the position in the sequence. t is the dimension index of embeddings hence 0 \leq t < d/2)

As we can see, with RoPE now the transformed vectors are nicely rotated as the position changes. And its context window extrapolation performance is much better than the sinusoidal positional embeddings.

RoPE has many desired properties of ideal positional embeddings. For any two positions t and s, their (unnormalized) attention scores depend on the position difference t-s. As t-s increases, the attention scores will decrease (if everything else was kept the same), which is proved Section 3.4.3 in [1].
\mathbf{q}_t^T \mathbf{k}_s = \left(\mathbf{R}^{d}_{\Theta, t} \mathbf{W}_q \mathbf{x}_t \right)^T \left(\mathbf{R}^{d}_{\Theta, s} \mathbf{W}_k \mathbf{x}_s \right) \newline\qquad =\mathbf{x}^T_t \mathbf{W}_q \mathbf{R}^{d}_{\Theta, t-s} \mathbf{W}_k \mathbf{x}_s \newline\qquad = Re\left[\sum\limits_{i=0}^{d/2-1} \mathbf{q}_{[2i:2i+1]} \mathbf{k}^*_{[2i:2i+1]} e^{i(t-s)\theta_{i}} \right]\newline \qquad \text{decrease as t-s increases}

How to choose RoPE base

In practice, we often face situations where we pre-train a model with a context window T_{pre-train}, post-train a context window T_{post-train}, and need to inference with a longer context window T_{test}.  We assume T_{pre-train} \leq T_{post-train} \leq T_{test}. Without any remedies, perplexity will shoot up outside T_{post-train} (even though RoPE is already better than Sinusoidal). [11] shows that we can improve the extrapolation ability of RoPE by either increasing/decreasing the base and post-training on longer context lengths.

In Figure 1.b, they showed that, for Llama2 13b with T_{pre-train}=4k and \beta=10K, post-training RoPE with T_{post-train}=16k and \beta=1M has the best extrapolation performance, followed by T_{post-train}=16k and \beta=500, then followed by T_{post-train}=4k and \beta=1M, and finally T_{post-train}=4k and \beta=500:

Let us first explain why decreasing \beta can improve extrapolation. As we introduced, when computing attention scores, RoPE is essentially rotating embeddings by different angles – for any two positions t and s in the sequence, it rotates embeddings with e^{i(t-s)\theta_i}, where \theta_i = \beta^{-2i/d}, 0 \leq i < d/2. The larger the base \beta is, the smaller rotation angel it will be. Rotations have periods, meaning that after a certain amount of rotation, rotated embeddings will be at the same position as its original position without rotation. Just like trigonometric functions (cosine and sine) [12], periods of RoPE are determined by P_i = \frac{2\pi}{\theta_i} = \frac{2\pi}{\beta^{-2i/d}} = 2\pi \cdot \beta^{2i/d}, 0 \leq i < d/2. So you can see that for some dimensions periods are shorter while some dimensions have longer periods, all depending on what i is.

To have minimal extrapolation error, we should have T_{pre-train}=T_{post-train} = T_{test} so that the model has learned representation for every possible relative position difference t-s that could occur during testing. However, if we can’t do that (i.e., T_{pre-train}=T_{post-train} < T_{test}), the best compromise we can make in post-training is to let as many embedding dimensions as possible to have small periods  (P_i \leq T_{post-train}). As such, the model will see full cycles of rotations of those dimensions within T_{post-train}, learn better understanding/representation of rotations, and has better chance to extrapolate well in a longer context length.

Let’s use some example. In Llama2, the total dimension of embeddings is 128, \beta=10000, T_{post-train}=4096. This means that when i = 46, P_i = 2\pi \cdot \beta^{2i/d} = 4711 > T_{post-train}. Therefore, there will be 92 dimensions whose periods can fit into the 4k context length while the remaining 36 dimensions’ periods are longer than 4k. If we change \beta to 500, then every dimension’s period will fit in the 4k context length. That’s why [11] found that \beta=500 can lead to good extrapolation performance.

Now we explain why increasing \beta can also help extrapolation. Increasing \beta in post-training means rotation speed is smaller (i.e., \theta_i = \beta^{-2i/d} decreases) and period is longer (i.e., P_i = 2\pi \cdot \beta^{2i/d} increases). Therefore, in test time, even we see a relative position difference t-s which is larger than T_{post-train}, the rotation pattern e^{i(t-s)\theta_i} may still be seen before in pre-training or post-training. The role of post-training with an increased \beta is to bridge the model’s understanding between the rotation pattern observed in pre-training / post-training and the rotation pattern that could be observed in test-time. Increasing \beta in RoPE is used in well-cited research [13, 14]. In this vein, another technique (Position Interpolation from [10]) is similar – they try to scale large rotation {(t-s)\theta_i} that could happen in large T_{test} down to something that is already learned in post-training.

Advanced Topics

  1. The NoPE paper claims that we do not even need explicit positional encodings [16]. Whether it can become a mainstream remains to be seen [17].
  2. Extending to infinite context requires us to have some memory mechanism. [15] proposes one mechanism, in which we first chunk a sequence into N segments. Within each segment, attention is computed with an additional compressive memory matrix. The compressive memory matrix (M_s) contains compressed information from all previous segments and is updated after each segment to carry new information over. Therefore, in theory, the model can extend to infinite context.

 

 

Reference

[1] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING: https://arxiv.org/pdf/2104.09864

[2] https://huggingface.co/blog/designing-positional-encoding

[3] https://www.gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models

[4] https://www.youtube.com/watch?v=GQPOtyITy54

[5] https://cedricchee.com/blog/rope_embeddings/

[6] https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

[7] The Impact of Positional Encoding on Length Generalization in Transformers: https://arxiv.org/pdf/2305.19466

[8] https://czxttkl.com/2018/10/07/eulers-formula/

[9] https://www.youtube.com/watch?v=C6rV8BsrrCc

[10] Extending context window of large language models via positional interpolation

[11] Scaling Laws of RoPE-based Extrapolation

[12] https://math.libretexts.org/Bookshelves/Applied_Mathematics/Mathematics_for_Game_Developers_(Burzynski)/05%3A_Some_Basic_Trigonometry/5.05%3A_Amplitude_and_Period_of_the_Sine_and_Cosine_Functions

[13] Effective Long-Context Scaling of Foundation Models: https://arxiv.org/abs/2309.16039

[14] Code Llama: https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/

[15] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention: https://arxiv.org/abs/2404.07143

[16] The Impact of Positional Encoding on Length Generalization in Transformers: https://arxiv.org/abs/2305.19466

[17] https://www.reddit.com/r/MachineLearning/comments/1dfay95/d_what_do_you_think_of_nope_on_small_models_at/

Leave a comment

Your email address will not be published. Required fields are marked *