Benchmarks
The following benchmarks reflect internal evaluations run by the Crucible research team. All tasks were evaluated on held-out test sets not included in model training.
Document reasoning (DOCLENS-v2)
Model | Score |
crucible-1 | 91.4 |
crucible-1-mini | 86.7 |
crucible-2-preview | 94.8 |
Multi-step inference (MSI-bench)
Model | Score |
crucible-1 | 88.2 |
crucible-1-mini | 81.5 |
crucible-2-preview | 93.1 |
Structured extraction (EXTRACT-100)
Model | Precision | Recall |
crucible-1 | 94.1% | 92.6% |
crucible-1-mini | 89.3% | 87.4% |
crucible-2-preview | 96.2% | 95.7% |
Benchmark results are updated with each major model release. For third-party evaluations, see the research section of the Crucible blog.