Benchmarks

The following benchmarks reflect internal evaluations run by the Crucible research team. All tasks were evaluated on held-out test sets not included in model training.

Document reasoning (DOCLENS-v2)

Model	Score
crucible-1	91.4
crucible-1-mini	86.7
crucible-2-preview	94.8

Multi-step inference (MSI-bench)

Model	Score
crucible-1	88.2
crucible-1-mini	81.5
crucible-2-preview	93.1

Structured extraction (EXTRACT-100)

Model	Precision	Recall
crucible-1	94.1%	92.6%
crucible-1-mini	89.3%	87.4%
crucible-2-preview	96.2%	95.7%

Benchmark results are updated with each major model release. For third-party evaluations, see the research section of the Crucible blog.

Buy Template