Benchmarks

The following benchmarks reflect internal evaluations run by the Crucible research team. All tasks were evaluated on held-out test sets not included in model training.

Document reasoning (DOCLENS-v2)

Model

Score

crucible-1

91.4

crucible-1-mini

86.7

crucible-2-preview

94.8

Multi-step inference (MSI-bench)

Model

Score

crucible-1

88.2

crucible-1-mini

81.5

crucible-2-preview

93.1

Structured extraction (EXTRACT-100)

Model

Precision

Recall

crucible-1

94.1%

92.6%

crucible-1-mini

89.3%

87.4%

crucible-2-preview

96.2%

95.7%




Benchmark results are updated with each major model release. For third-party evaluations, see the research section of the Crucible blog.

Create a free website with Framer, the website builder loved by startups, designers and agencies.