Go Back

Measuring extraction accuracy at scale

Research

Oct 8, 2025

Structured extraction is one of the most common things people do with Crucible. You have a document. You want specific fields pulled out in a consistent format. The task sounds simple and it mostly is — until you are running it on thousands of documents per day and edge cases start to compound.

This post is about how we measure extraction accuracy and what we have learned from doing it at scale.

Precision vs Recall

Precision and recall pull in opposite directions on extraction tasks. A model optimized for precision will only extract a field when it is confident, missing some real values. A model optimized for recall will extract aggressively, pulling in some noise. The right balance depends on your use case. For financial data where a missed figure is costly, recall matters more. For legal clauses where a wrong extraction creates liability, precision matters more.

Crucible lets you adjust this balance through temperature and by specifying confidence thresholds in your JSON schema.

Where Errors Come From

In our analysis of extraction failures across a sample of production requests, three sources account for most errors:

Ambiguous field boundaries. When a document uses inconsistent formatting, the model sometimes disagrees with human annotators about where one field ends and another begins.
Implicit values. Some fields are implied by context rather than stated directly. Models handle these less reliably than explicit values.
Schema mismatch. When the provided schema does not match the structure of the document, extraction quality degrades quickly.

Improving Your Results

The single most effective intervention is a well-defined schema with field descriptions. Telling the model what a field means and what format the value should take is more valuable than any prompt engineering around it.

Go Back

Measuring extraction accuracy at scale

Research

Oct 8, 2025

Structured extraction is one of the most common things people do with Crucible. You have a document. You want specific fields pulled out in a consistent format. The task sounds simple and it mostly is — until you are running it on thousands of documents per day and edge cases start to compound.

This post is about how we measure extraction accuracy and what we have learned from doing it at scale.

Precision vs Recall

Precision and recall pull in opposite directions on extraction tasks. A model optimized for precision will only extract a field when it is confident, missing some real values. A model optimized for recall will extract aggressively, pulling in some noise. The right balance depends on your use case. For financial data where a missed figure is costly, recall matters more. For legal clauses where a wrong extraction creates liability, precision matters more.

Crucible lets you adjust this balance through temperature and by specifying confidence thresholds in your JSON schema.

Where Errors Come From

In our analysis of extraction failures across a sample of production requests, three sources account for most errors:

Ambiguous field boundaries. When a document uses inconsistent formatting, the model sometimes disagrees with human annotators about where one field ends and another begins.
Implicit values. Some fields are implied by context rather than stated directly. Models handle these less reliably than explicit values.
Schema mismatch. When the provided schema does not match the structure of the document, extraction quality degrades quickly.

Improving Your Results

The single most effective intervention is a well-defined schema with field descriptions. Telling the model what a field means and what format the value should take is more valuable than any prompt engineering around it.

Go Back

Measuring extraction accuracy at scale

Research

Oct 8, 2025

Structured extraction is one of the most common things people do with Crucible. You have a document. You want specific fields pulled out in a consistent format. The task sounds simple and it mostly is — until you are running it on thousands of documents per day and edge cases start to compound.

This post is about how we measure extraction accuracy and what we have learned from doing it at scale.

Precision vs Recall

Precision and recall pull in opposite directions on extraction tasks. A model optimized for precision will only extract a field when it is confident, missing some real values. A model optimized for recall will extract aggressively, pulling in some noise. The right balance depends on your use case. For financial data where a missed figure is costly, recall matters more. For legal clauses where a wrong extraction creates liability, precision matters more.

Crucible lets you adjust this balance through temperature and by specifying confidence thresholds in your JSON schema.

Where Errors Come From

In our analysis of extraction failures across a sample of production requests, three sources account for most errors:

Ambiguous field boundaries. When a document uses inconsistent formatting, the model sometimes disagrees with human annotators about where one field ends and another begins.
Implicit values. Some fields are implied by context rather than stated directly. Models handle these less reliably than explicit values.
Schema mismatch. When the provided schema does not match the structure of the document, extraction quality degrades quickly.

Improving Your Results

The single most effective intervention is a well-defined schema with field descriptions. Telling the model what a field means and what format the value should take is more valuable than any prompt engineering around it.