How we built Crucible's streaming pipeline

Engineering

Streaming feels simple from the outside. You send a request and tokens appear one by one. Under the hood, making that reliable at scale across different reasoning modes required solving a set of problems that are not obvious until you are in production.

Why Streaming Matters for Reasoning Models

For a standard language model, streaming is mostly a UX concern. For a reasoning model, it is a latency problem. Deep reasoning mode can take several seconds before producing the first token. Without streaming, users stare at a blank screen. With it, they see output begin almost immediately.

The Architecture

We use server-sent events (SSE) over a persistent HTTP connection. Each chunk contains a partial output object with a delta field. The client accumulates deltas and renders them progressively.

Connection Management

One of the harder problems was handling dropped connections gracefully. SSE connections time out, especially on mobile networks. We implemented a last-event-id mechanism so clients can reconnect mid-stream without losing tokens.

Backpressure Handling

When the model generates tokens faster than the client can consume them, we buffer on the server side with a fixed-size queue. If the queue fills, we slow token dispatch rather than drop events. This keeps the stream intact even on slow connections.

What We Do Not Stream Yet

The reasoning trace in deep mode is not streamed. The trace is assembled during generation and only complete at the end. We are working on a way to stream trace segments incrementally, which is on the roadmap for later this year.

How we built Crucible's streaming pipeline

Engineering

Streaming feels simple from the outside. You send a request and tokens appear one by one. Under the hood, making that reliable at scale across different reasoning modes required solving a set of problems that are not obvious until you are in production.

Why Streaming Matters for Reasoning Models

For a standard language model, streaming is mostly a UX concern. For a reasoning model, it is a latency problem. Deep reasoning mode can take several seconds before producing the first token. Without streaming, users stare at a blank screen. With it, they see output begin almost immediately.

The Architecture

We use server-sent events (SSE) over a persistent HTTP connection. Each chunk contains a partial output object with a delta field. The client accumulates deltas and renders them progressively.

Connection Management

One of the harder problems was handling dropped connections gracefully. SSE connections time out, especially on mobile networks. We implemented a last-event-id mechanism so clients can reconnect mid-stream without losing tokens.

Backpressure Handling

When the model generates tokens faster than the client can consume them, we buffer on the server side with a fixed-size queue. If the queue fills, we slow token dispatch rather than drop events. This keeps the stream intact even on slow connections.

What We Do Not Stream Yet

The reasoning trace in deep mode is not streamed. The trace is assembled during generation and only complete at the end. We are working on a way to stream trace segments incrementally, which is on the roadmap for later this year.

How we built Crucible's streaming pipeline

Engineering

Streaming feels simple from the outside. You send a request and tokens appear one by one. Under the hood, making that reliable at scale across different reasoning modes required solving a set of problems that are not obvious until you are in production.

Why Streaming Matters for Reasoning Models

For a standard language model, streaming is mostly a UX concern. For a reasoning model, it is a latency problem. Deep reasoning mode can take several seconds before producing the first token. Without streaming, users stare at a blank screen. With it, they see output begin almost immediately.

The Architecture

We use server-sent events (SSE) over a persistent HTTP connection. Each chunk contains a partial output object with a delta field. The client accumulates deltas and renders them progressively.

Connection Management

One of the harder problems was handling dropped connections gracefully. SSE connections time out, especially on mobile networks. We implemented a last-event-id mechanism so clients can reconnect mid-stream without losing tokens.

Backpressure Handling

When the model generates tokens faster than the client can consume them, we buffer on the server side with a fixed-size queue. If the queue fills, we slow token dispatch rather than drop events. This keeps the stream intact even on slow connections.

What We Do Not Stream Yet

The reasoning trace in deep mode is not streamed. The trace is assembled during generation and only complete at the end. We are working on a way to stream trace segments incrementally, which is on the roadmap for later this year.

Create a free website with Framer, the website builder loved by startups, designers and agencies.