In this post, let’s visit how modern LLMs encode positional information. We start from the most famous paper in this domain [1] and dive into some key details.
Why we need positional encoding
LLMs need positional encodings to differentiate different semantic meanings of the same word. We use the motivational example from [2]:
The two “dogs” refer to different entities. Without any positional information, the output of a (multi headed) self attention operation is identical for the same token in different positions.
Preliminaries
















Reference
[1] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING: https://arxiv.org/pdf/2104.09864
[2] https://huggingface.co/blog/designing-positional-encoding
[3] https://www.gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models
[4] https://www.youtube.com/watch?v=GQPOtyITy54
[5] https://cedricchee.com/blog/rope_embeddings/
[6] https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html