How we reduced hallucination rates by 12% in Crucible Pro 2.1

Engineering

The Problem

Hallucinations in reasoning models are not random. They follow patterns — and once you understand those patterns, you can target them directly. In Crucible Pro 2.0, we saw a consistent failure mode on tasks requiring the model to synthesize information across multiple long documents. The model would occasionally introduce confident-sounding claims that had no grounding in the source material.

What We Changed

The fix came in three parts.

Grounding Verification Pass

We introduced a post-generation verification step in deep reasoning mode. Before the model finalizes its output, it runs a lightweight grounding check — tracing each claim back to a passage in the input. Any claim that fails the trace is flagged and regenerated.

Calibrated Confidence Scoring

Models tend to hallucinate most when they are operating near the edge of their knowledge. We retrained the confidence scoring layer to be more conservative in ambiguous contexts, which caused the model to qualify uncertain claims rather than state them as fact.

Adversarial Fine-Tuning

We expanded our fine-tuning dataset to include adversarial examples specifically designed to trigger hallucination in previous model versions. Exposing the model to these failure modes during training reduced their frequency in production.

The Result

Crucible Pro 2.1 achieves a 94.1% accuracy score on our internal DOCLENS-v2 benchmark, up from 82.9% in version 2.0. Hallucination rate on multi-document synthesis tasks dropped by 12 percentage points. These gains hold across legal, financial, and research document types.

How we reduced hallucination rates by 12% in Crucible Pro 2.1

Engineering

The Problem

Hallucinations in reasoning models are not random. They follow patterns — and once you understand those patterns, you can target them directly. In Crucible Pro 2.0, we saw a consistent failure mode on tasks requiring the model to synthesize information across multiple long documents. The model would occasionally introduce confident-sounding claims that had no grounding in the source material.

What We Changed

The fix came in three parts.

Grounding Verification Pass

We introduced a post-generation verification step in deep reasoning mode. Before the model finalizes its output, it runs a lightweight grounding check — tracing each claim back to a passage in the input. Any claim that fails the trace is flagged and regenerated.

Calibrated Confidence Scoring

Models tend to hallucinate most when they are operating near the edge of their knowledge. We retrained the confidence scoring layer to be more conservative in ambiguous contexts, which caused the model to qualify uncertain claims rather than state them as fact.

Adversarial Fine-Tuning

We expanded our fine-tuning dataset to include adversarial examples specifically designed to trigger hallucination in previous model versions. Exposing the model to these failure modes during training reduced their frequency in production.

The Result

Crucible Pro 2.1 achieves a 94.1% accuracy score on our internal DOCLENS-v2 benchmark, up from 82.9% in version 2.0. Hallucination rate on multi-document synthesis tasks dropped by 12 percentage points. These gains hold across legal, financial, and research document types.

How we reduced hallucination rates by 12% in Crucible Pro 2.1

Engineering

The Problem

Hallucinations in reasoning models are not random. They follow patterns — and once you understand those patterns, you can target them directly. In Crucible Pro 2.0, we saw a consistent failure mode on tasks requiring the model to synthesize information across multiple long documents. The model would occasionally introduce confident-sounding claims that had no grounding in the source material.

What We Changed

The fix came in three parts.

Grounding Verification Pass

We introduced a post-generation verification step in deep reasoning mode. Before the model finalizes its output, it runs a lightweight grounding check — tracing each claim back to a passage in the input. Any claim that fails the trace is flagged and regenerated.

Calibrated Confidence Scoring

Models tend to hallucinate most when they are operating near the edge of their knowledge. We retrained the confidence scoring layer to be more conservative in ambiguous contexts, which caused the model to qualify uncertain claims rather than state them as fact.

Adversarial Fine-Tuning

We expanded our fine-tuning dataset to include adversarial examples specifically designed to trigger hallucination in previous model versions. Exposing the model to these failure modes during training reduced their frequency in production.

The Result

Crucible Pro 2.1 achieves a 94.1% accuracy score on our internal DOCLENS-v2 benchmark, up from 82.9% in version 2.0. Hallucination rate on multi-document synthesis tasks dropped by 12 percentage points. These gains hold across legal, financial, and research document types.

Create a free website with Framer, the website builder loved by startups, designers and agencies.