
How to Validate AI Document Extraction Accuracy in Production
Confidence scores, dual-model evaluation, and the correction loop that makes extraction reliable at scale

Every AI extraction demo looks great. The model pulls clean JSON from a messy PDF. The audience nods. Someone asks about accuracy. The vendor says 95%. Everyone moves on.
Then you get to production. A brokerage statement has a transaction date that the model inferred from the statement period instead of extracting from the actual transaction. A loss run report marks a deductible as zero because the field was blank and the model assumed a reasonable default. An invoice from a new vendor uses terminology the model hasn't seen before, and it maps a line item to the wrong category with high confidence.
None of these are hallucinations in the dramatic sense. The model didn't make something up from nothing. But they're wrong. And in finance, insurance, healthcare, and supply chain, wrong is expensive.
The question every technical team should be asking isn't "what's your accuracy?" It's "how do you know when you're wrong, and what do you do about it?"
Why a single confidence score is not enough
Most extraction tools that offer any accuracy metric give you a single number: a confidence score for the overall extraction, or maybe per field. The number is usually high. 90%, 95%, 97%. And it tells you almost nothing.
A confidence score without an explanation is a black box output from a black box system. You don't know why it's 95% instead of 99%. You don't know which fields are pulling the score down. You don't know whether the model is uncertain because the input was ambiguous or because it made a systematic error that will repeat on every similar document.
In production, what you need is per-field confidence with reasoning. Not just "this field is 87% confident" but "this field is 87% confident because the year was inferred from the statement period rather than explicitly present in the transaction record." That distinction is the difference between a number you ignore and a signal you can act on.
How we validate at bem
We built our validation architecture around a simple principle: the model that extracts the data should not be the model that evaluates the data. That's like asking someone to grade their own exam.
Here's how it works in practice.
Layer 1: Log probability extraction. Every foundation model produces log probabilities as it generates tokens. These are the raw statistical confidence of the model at each step of the extraction. We extract these probabilities and compute a weighted average on a per-field basis. This gives you the model's own assessment of its certainty, before any external validation.
Most extraction tools don't surface this. They either don't extract log probs at all, or they aggregate them into a single number that obscures the field-level detail. We pass the raw probabilities through to you so your system can set thresholds on a per-field basis.
Layer 2: Independent evaluation model. We run a separate, larger model that evaluates the extraction output against the original input. This model is specifically trained for evaluation, not extraction. It looks at the input document, looks at the extracted output, and produces a detailed assessment of every field.
This is where you get the reasoning. The evaluation model doesn't just say "87% confident." It says "the extracted year is 2022, but the statement period covers January to December 2024. The year appears to be inferred rather than directly extracted. This is likely incorrect."
The separation matters. An extraction model optimized for speed and accuracy will sometimes make confident mistakes. An evaluation model optimized for skepticism catches those mistakes precisely because it's looking at the problem from a different angle.
Layer 3: Schematization. Before the data reaches you, we apply deterministic infrastructure to it. This is the non-AI layer: type checking, constraint validation, enum matching, required field verification. If your schema says a date must be in ISO format and the model returned a natural language date, the schematization layer catches it. If your schema says a field must be one of five values and the model returned something else, it flags it.
This layer pairs statistical AI with deterministic validation. The models handle the semantic understanding. The schema enforcement handles the structural guarantees. Together, they catch errors that either layer alone would miss.
The correction-to-training loop
Validation by itself is necessary but not sufficient. What makes the system actually get better over time is closing the loop between corrections and training.
Here's the workflow. A document comes through the pipeline. The extraction runs. The validation layers produce confidence scores with reasoning. Any field below a configurable threshold gets flagged.
When a human operator reviews a flagged field and makes a correction, that correction doesn't just fix the immediate output. It feeds back into a fine-tuning pipeline that trains the model specifically for your data. Over time, the model learns the patterns unique to your documents: the terminology your vendors use, the layouts your carriers prefer, the edge cases in your industry.
We see this play out consistently across customers. A new customer starts at 93 to 97% accuracy, depending on the complexity of their documents. Within weeks of production use, as corrections accumulate and fine-tuning runs, accuracy climbs into the high 99s for their specific document types. The human review volume drops proportionally.
The key insight is that this loop is automatic. You don't need a data science team to manage fine-tuning jobs. Corrections submitted through the API or through the review UI trigger retraining automatically. The system versions every model, runs regression tests, and rolls back if a new model performs worse than the previous one.
Threshold-based routing in production
In a production pipeline, you don't want to send every document to human review. You want to route intelligently based on confidence.
We support threshold-based routing out of the box. You configure a confidence threshold, say 95%, and any extraction that falls below it gets routed to a different queue. That queue might go to a human operator, or it might go to a BPO partner, or it might trigger a different extraction strategy.
The threshold is configurable per field, not just per document. You might accept 90% confidence on a vendor name but demand 99% on a transaction amount. The routing logic respects that granularity.
For one insurance customer, we set up a workflow where high-confidence extractions go straight to production, medium-confidence extractions go to an internal review queue, and low-confidence extractions route to their existing BPO for manual processing. Over time, as the model improved, the volume flowing to the BPO dropped from 30% of documents to less than 5%.
What to ask any extraction vendor
When you're evaluating extraction tools for production use, here are the questions that separate real validation from marketing claims.
Do you provide per-field confidence scores, or only document-level? Is the evaluation model independent from the extraction model? Can I get the reasoning behind a confidence score, not just the number? Does the system support a correction-to-training loop? Is fine-tuning automatic or manual? Can I set confidence thresholds per field for routing? What happens when the model encounters a document format it hasn't seen before?
That last question is the most revealing. A system that returns high confidence on unfamiliar documents is more dangerous than one that returns low confidence. The ability to say "I'm not sure" is a feature, not a bug.
We built bem for industries where getting it wrong has real consequences: finance, insurance, healthcare, supply chain. The extraction is the easy part. The hard part is knowing when to trust the output and building a system that gets better every time it's wrong.

Written by
Antonio Bustamante
Apr 7, 2026 · Engineering


Ready to see it in action?
Talk to our team to walk through how bem can work inside your stack.
Talk to the team