
Streaming feels simple from the outside. You send a request and tokens appear one by one. Under the hood, making that reliable at scale across different reasoning modes required solving a set of problems that are not obvious until you are in production.
Why Streaming Matters for Reasoning Models
For a standard language model, streaming is mostly a UX concern. For a reasoning model, it is a latency problem. Deep reasoning mode can take several seconds before producing the first token. Without streaming, users stare at a blank screen. With it, they see output begin almost immediately.
The Architecture
We use server-sent events (SSE) over a persistent HTTP connection. Each chunk contains a partial output object with a delta field. The client accumulates deltas and renders them progressively.
Connection Management
One of the harder problems was handling dropped connections gracefully. SSE connections time out, especially on mobile networks. We implemented a last-event-id mechanism so clients can reconnect mid-stream without losing tokens.
Backpressure Handling
When the model generates tokens faster than the client can consume them, we buffer on the server side with a fixed-size queue. If the queue fills, we slow token dispatch rather than drop events. This keeps the stream intact even on slow connections.
What We Do Not Stream Yet
The reasoning trace in deep mode is not streamed. The trace is assembled during generation and only complete at the end. We are working on a way to stream trace segments incrementally, which is on the roadmap for later this year.

Streaming feels simple from the outside. You send a request and tokens appear one by one. Under the hood, making that reliable at scale across different reasoning modes required solving a set of problems that are not obvious until you are in production.
Why Streaming Matters for Reasoning Models
For a standard language model, streaming is mostly a UX concern. For a reasoning model, it is a latency problem. Deep reasoning mode can take several seconds before producing the first token. Without streaming, users stare at a blank screen. With it, they see output begin almost immediately.
The Architecture
We use server-sent events (SSE) over a persistent HTTP connection. Each chunk contains a partial output object with a delta field. The client accumulates deltas and renders them progressively.
Connection Management
One of the harder problems was handling dropped connections gracefully. SSE connections time out, especially on mobile networks. We implemented a last-event-id mechanism so clients can reconnect mid-stream without losing tokens.
Backpressure Handling
When the model generates tokens faster than the client can consume them, we buffer on the server side with a fixed-size queue. If the queue fills, we slow token dispatch rather than drop events. This keeps the stream intact even on slow connections.
What We Do Not Stream Yet
The reasoning trace in deep mode is not streamed. The trace is assembled during generation and only complete at the end. We are working on a way to stream trace segments incrementally, which is on the roadmap for later this year.

Streaming feels simple from the outside. You send a request and tokens appear one by one. Under the hood, making that reliable at scale across different reasoning modes required solving a set of problems that are not obvious until you are in production.
Why Streaming Matters for Reasoning Models
For a standard language model, streaming is mostly a UX concern. For a reasoning model, it is a latency problem. Deep reasoning mode can take several seconds before producing the first token. Without streaming, users stare at a blank screen. With it, they see output begin almost immediately.
The Architecture
We use server-sent events (SSE) over a persistent HTTP connection. Each chunk contains a partial output object with a delta field. The client accumulates deltas and renders them progressively.
Connection Management
One of the harder problems was handling dropped connections gracefully. SSE connections time out, especially on mobile networks. We implemented a last-event-id mechanism so clients can reconnect mid-stream without losing tokens.
Backpressure Handling
When the model generates tokens faster than the client can consume them, we buffer on the server side with a fixed-size queue. If the queue fills, we slow token dispatch rather than drop events. This keeps the stream intact even on slow connections.
What We Do Not Stream Yet
The reasoning trace in deep mode is not streamed. The trace is assembled during generation and only complete at the end. We are working on a way to stream trace segments incrementally, which is on the roadmap for later this year.