Designing for latency: the engineering behind Fast Mode

Engineering

Fast Mode exists because not every request needs deep reasoning. Classifying a document type, extracting a date, routing a query — these tasks do not benefit from multi-step inference. Forcing them through the same pipeline as complex analysis adds latency and cost with no accuracy gain.

The engineering challenge was building a mode that is genuinely fast without feeling like a downgrade.

What We Optimized

Three things drive latency in a model API: time to first token, throughput, and queue depth. Fast Mode addresses all three.

Smaller Inference Graph

Fast Mode uses a pruned version of the crucible-1 inference graph. We identified the layers that contribute most to reasoning depth and removed them from the fast path. Accuracy on simple tasks is essentially unchanged. On complex tasks, the accuracy gap is real, which is why Fast Mode is not the default.

Priority Queuing

Fast Mode requests are routed to a separate queue with lower depth limits. This means Fast Mode requests never wait behind a Deep Mode job that is processing a 200-page document.

Prefill Optimization

For short prompts, most of the latency is in the prefill phase. We cache common prompt structures and batch prefill across concurrent requests where possible, which cuts median time to first token by roughly 40%.

When to Use It

Fast Mode is the right choice for high-volume pipelines where the per-request task is bounded and well-defined. If you are unsure, start with Standard and profile your latency. The mode switch is a single parameter change.

Designing for latency: the engineering behind Fast Mode

Engineering

Fast Mode exists because not every request needs deep reasoning. Classifying a document type, extracting a date, routing a query — these tasks do not benefit from multi-step inference. Forcing them through the same pipeline as complex analysis adds latency and cost with no accuracy gain.

The engineering challenge was building a mode that is genuinely fast without feeling like a downgrade.

What We Optimized

Three things drive latency in a model API: time to first token, throughput, and queue depth. Fast Mode addresses all three.

Smaller Inference Graph

Fast Mode uses a pruned version of the crucible-1 inference graph. We identified the layers that contribute most to reasoning depth and removed them from the fast path. Accuracy on simple tasks is essentially unchanged. On complex tasks, the accuracy gap is real, which is why Fast Mode is not the default.

Priority Queuing

Fast Mode requests are routed to a separate queue with lower depth limits. This means Fast Mode requests never wait behind a Deep Mode job that is processing a 200-page document.

Prefill Optimization

For short prompts, most of the latency is in the prefill phase. We cache common prompt structures and batch prefill across concurrent requests where possible, which cuts median time to first token by roughly 40%.

When to Use It

Fast Mode is the right choice for high-volume pipelines where the per-request task is bounded and well-defined. If you are unsure, start with Standard and profile your latency. The mode switch is a single parameter change.

Designing for latency: the engineering behind Fast Mode

Engineering

Fast Mode exists because not every request needs deep reasoning. Classifying a document type, extracting a date, routing a query — these tasks do not benefit from multi-step inference. Forcing them through the same pipeline as complex analysis adds latency and cost with no accuracy gain.

The engineering challenge was building a mode that is genuinely fast without feeling like a downgrade.

What We Optimized

Three things drive latency in a model API: time to first token, throughput, and queue depth. Fast Mode addresses all three.

Smaller Inference Graph

Fast Mode uses a pruned version of the crucible-1 inference graph. We identified the layers that contribute most to reasoning depth and removed them from the fast path. Accuracy on simple tasks is essentially unchanged. On complex tasks, the accuracy gap is real, which is why Fast Mode is not the default.

Priority Queuing

Fast Mode requests are routed to a separate queue with lower depth limits. This means Fast Mode requests never wait behind a Deep Mode job that is processing a 200-page document.

Prefill Optimization

For short prompts, most of the latency is in the prefill phase. We cache common prompt structures and batch prefill across concurrent requests where possible, which cuts median time to first token by roughly 40%.

When to Use It

Fast Mode is the right choice for high-volume pipelines where the per-request task is bounded and well-defined. If you are unsure, start with Standard and profile your latency. The mode switch is a single parameter change.

Create a free website with Framer, the website builder loved by startups, designers and agencies.