
The Problem
Hallucinations in reasoning models are not random. They follow patterns — and once you understand those patterns, you can target them directly. In Crucible Pro 2.0, we saw a consistent failure mode on tasks requiring the model to synthesize information across multiple long documents. The model would occasionally introduce confident-sounding claims that had no grounding in the source material.
What We Changed
The fix came in three parts.
Grounding Verification Pass
We introduced a post-generation verification step in deep reasoning mode. Before the model finalizes its output, it runs a lightweight grounding check — tracing each claim back to a passage in the input. Any claim that fails the trace is flagged and regenerated.
Calibrated Confidence Scoring
Models tend to hallucinate most when they are operating near the edge of their knowledge. We retrained the confidence scoring layer to be more conservative in ambiguous contexts, which caused the model to qualify uncertain claims rather than state them as fact.
Adversarial Fine-Tuning
We expanded our fine-tuning dataset to include adversarial examples specifically designed to trigger hallucination in previous model versions. Exposing the model to these failure modes during training reduced their frequency in production.
The Result
Crucible Pro 2.1 achieves a 94.1% accuracy score on our internal DOCLENS-v2 benchmark, up from 82.9% in version 2.0. Hallucination rate on multi-document synthesis tasks dropped by 12 percentage points. These gains hold across legal, financial, and research document types.

The Problem
Hallucinations in reasoning models are not random. They follow patterns — and once you understand those patterns, you can target them directly. In Crucible Pro 2.0, we saw a consistent failure mode on tasks requiring the model to synthesize information across multiple long documents. The model would occasionally introduce confident-sounding claims that had no grounding in the source material.
What We Changed
The fix came in three parts.
Grounding Verification Pass
We introduced a post-generation verification step in deep reasoning mode. Before the model finalizes its output, it runs a lightweight grounding check — tracing each claim back to a passage in the input. Any claim that fails the trace is flagged and regenerated.
Calibrated Confidence Scoring
Models tend to hallucinate most when they are operating near the edge of their knowledge. We retrained the confidence scoring layer to be more conservative in ambiguous contexts, which caused the model to qualify uncertain claims rather than state them as fact.
Adversarial Fine-Tuning
We expanded our fine-tuning dataset to include adversarial examples specifically designed to trigger hallucination in previous model versions. Exposing the model to these failure modes during training reduced their frequency in production.
The Result
Crucible Pro 2.1 achieves a 94.1% accuracy score on our internal DOCLENS-v2 benchmark, up from 82.9% in version 2.0. Hallucination rate on multi-document synthesis tasks dropped by 12 percentage points. These gains hold across legal, financial, and research document types.

The Problem
Hallucinations in reasoning models are not random. They follow patterns — and once you understand those patterns, you can target them directly. In Crucible Pro 2.0, we saw a consistent failure mode on tasks requiring the model to synthesize information across multiple long documents. The model would occasionally introduce confident-sounding claims that had no grounding in the source material.
What We Changed
The fix came in three parts.
Grounding Verification Pass
We introduced a post-generation verification step in deep reasoning mode. Before the model finalizes its output, it runs a lightweight grounding check — tracing each claim back to a passage in the input. Any claim that fails the trace is flagged and regenerated.
Calibrated Confidence Scoring
Models tend to hallucinate most when they are operating near the edge of their knowledge. We retrained the confidence scoring layer to be more conservative in ambiguous contexts, which caused the model to qualify uncertain claims rather than state them as fact.
Adversarial Fine-Tuning
We expanded our fine-tuning dataset to include adversarial examples specifically designed to trigger hallucination in previous model versions. Exposing the model to these failure modes during training reduced their frequency in production.
The Result
Crucible Pro 2.1 achieves a 94.1% accuracy score on our internal DOCLENS-v2 benchmark, up from 82.9% in version 2.0. Hallucination rate on multi-document synthesis tasks dropped by 12 percentage points. These gains hold across legal, financial, and research document types.