Go Back

Context window scaling: what we learned at 128K tokens

Research

Nov 14, 2025

Expanding the context window from 64K to 128K tokens sounds like a straightforward infrastructure change. It is not. The challenges are distributed across memory, attention, and evaluation — and some of them only become visible once you start testing on the kinds of documents customers actually send.

Memory and Compute

Attention complexity scales quadratically with sequence length. Doubling the context window quadruples the attention computation at peak. We addressed this through a combination of sparse attention patterns for long-range dependencies and hardware-level optimizations that reduce memory bandwidth pressure during the attention pass.

The Lost-in-the-Middle Problem

Research has shown that models trained on long contexts tend to underweight information in the middle of the input, focusing disproportionately on the beginning and end. We targeted this directly in fine-tuning, using training examples that required retrieving critical information from the middle of long documents.

Our evaluation on retrieval tasks at different positions within a 128K context shows consistent performance across the full length, with less than 3 percentage points of accuracy difference between start, middle, and end positions.

What 128K Actually Unlocks

The practical threshold is roughly 80-100 pages of dense text. That covers the majority of legal contracts, full financial reports, and most regulatory filings. Users no longer need to chunk these documents before sending them, which both simplifies their integration and improves output quality.

Go Back

Context window scaling: what we learned at 128K tokens

Research

Nov 14, 2025

Expanding the context window from 64K to 128K tokens sounds like a straightforward infrastructure change. It is not. The challenges are distributed across memory, attention, and evaluation — and some of them only become visible once you start testing on the kinds of documents customers actually send.

Memory and Compute

Attention complexity scales quadratically with sequence length. Doubling the context window quadruples the attention computation at peak. We addressed this through a combination of sparse attention patterns for long-range dependencies and hardware-level optimizations that reduce memory bandwidth pressure during the attention pass.

The Lost-in-the-Middle Problem

Research has shown that models trained on long contexts tend to underweight information in the middle of the input, focusing disproportionately on the beginning and end. We targeted this directly in fine-tuning, using training examples that required retrieving critical information from the middle of long documents.

Our evaluation on retrieval tasks at different positions within a 128K context shows consistent performance across the full length, with less than 3 percentage points of accuracy difference between start, middle, and end positions.

What 128K Actually Unlocks

The practical threshold is roughly 80-100 pages of dense text. That covers the majority of legal contracts, full financial reports, and most regulatory filings. Users no longer need to chunk these documents before sending them, which both simplifies their integration and improves output quality.

Go Back

Context window scaling: what we learned at 128K tokens

Research

Nov 14, 2025

Expanding the context window from 64K to 128K tokens sounds like a straightforward infrastructure change. It is not. The challenges are distributed across memory, attention, and evaluation — and some of them only become visible once you start testing on the kinds of documents customers actually send.

Memory and Compute

Attention complexity scales quadratically with sequence length. Doubling the context window quadruples the attention computation at peak. We addressed this through a combination of sparse attention patterns for long-range dependencies and hardware-level optimizations that reduce memory bandwidth pressure during the attention pass.

The Lost-in-the-Middle Problem

Research has shown that models trained on long contexts tend to underweight information in the middle of the input, focusing disproportionately on the beginning and end. We targeted this directly in fine-tuning, using training examples that required retrieving critical information from the middle of long documents.

Our evaluation on retrieval tasks at different positions within a 128K context shows consistent performance across the full length, with less than 3 percentage points of accuracy difference between start, middle, and end positions.

What 128K Actually Unlocks

The practical threshold is roughly 80-100 pages of dense text. That covers the majority of legal contracts, full financial reports, and most regulatory filings. Users no longer need to chunk these documents before sending them, which both simplifies their integration and improves output quality.

Context window scaling: what we learned at 128K tokens

Memory and Compute

The Lost-in-the-Middle Problem

What 128K Actually Unlocks

Context window scaling: what we learned at 128K tokens

Memory and Compute

The Lost-in-the-Middle Problem

What 128K Actually Unlocks

Context window scaling: what we learned at 128K tokens

Memory and Compute

The Lost-in-the-Middle Problem

What 128K Actually Unlocks

Crucible

Crucible

Crucible