← Back to glossary

Context Window

Definition

Maximum number of tokens (roughly: word fragments) a language model can process simultaneously per request — including input, retrieved documents and output. In 2026, values range from around 8,000 (small models) to over two million tokens (frontier models).

Noise — Signal

Larger context windows are marketed as "the model can now read all of our documentation". Technically true. In practice the model does not use the long context evenly — content at the beginning and end is processed more reliably than content in the middle ("lost in the middle" effect). Inference cost also scales linearly or super-linearly with context length, which can make large requests uneconomic.

The right question

Not: "Do we need the model with the largest context window?" But: "What is the typical input size of our use cases, how do we measure whether the model actually finds the relevant passages, and when is a RAG architecture more economical than a larger context window?"

← Back to glossary