Document parsing accuracy benchmark

130 pages from public datasets and public documents, including CORD, SROIE, arXiv, and SEC 10-Ks. Each page was human-verified region by region and run through 6 providers across seven document types: forms and invoices, tables and financial docs, multi-column pages, multilingual docs, image-heavy docs, presentations, and scanned or degraded pages.

text accuracy
Extract
81.9%
Extend
74.8%
Reducto
70.1%
LlamaParse
69.1%
AWS Textract
60.7%
Unstructured
59.1%
0%25%50%75%100%
results

Accuracy columns are gold-set means with 95% confidence intervals, sorted by text accuracy.

providertext accuracyword-overlap F1grounded accuracylayout IoU
Extract
81.9%
78.2%85.3%
84.5%
81.5%87.4%
88.6%
86.1%91.1%
65.3%
61.4%69.1%
Extend
74.8%
70.0%79.2%
74.0%
70.1%77.6%
94.2%
91.2%96.6%
69.2%
64.7%73.4%
Reducto
70.1%
66.0%73.9%
73.3%
68.7%77.7%
85.3%
80.9%89.4%
66.1%
61.5%70.7%
LlamaParse
69.1%
63.9%74.6%
62.4%
57.5%67.5%
61.3%
54.9%67.3%
52.0%
46.6%57.4%
AWS TextractLAYOUT
60.7%
55.3%66.1%
70.5%
65.0%75.7%
81.3%
76.1%86.3%
68.8%
65.3%72.5%
Unstructured
59.1%
53.3%65.1%
52.8%
46.1%59.4%
63.9%
57.4%70.4%
68.6%
63.9%73.2%
at scale

Speed and coverage across 12k pages

The larger consensus benchmark asks a different question: across 106 business PDFs and 11,448 pages, who stays fast while remaining competitive with the provider-agreement reference? Corpus mix: financial reports, compliance audits, legal contracts, healthcare and insurance packets, government forms, technical specs, manuals, slides, image-heavy pages, multilingual docs, and scanned/OCR cases.

Extract p50
11.5s
consensus F1
94.3%

Consensus F1 is provider agreement against a majority-provider pseudo-reference, not human-labeled accuracy. The gold-set table above is the primary accuracy claim; this section shows speed and coverage depth on the broad corpus.View the full 107-document breakdown →

what we measured
Grounded accuracy

For each labeled region we gather the predicted boxes overlapping its location (page-sized dumps are excluded), stitch their text in reading order, and measure how much of the region's text is recoverable in that local context — 11 when the text is contained, otherwise the best-matching window scored by 1NED1-\mathrm{NED}. The metric averages this recall 1Nrrecr\frac{1}{N}\sum_{r}\mathrm{rec}_r over all NN regions, with a region that got no overlapping prediction scoring 0; extra neighboring text from a coarse box is not penalized, so word-, line-, and paragraph-box outputs are scored fairly.

Text accuracy

Character fidelity over the whole page in each provider's emitted reading order: 1NED1-\mathrm{NED}, where NED=edit distancemax(p,g)\mathrm{NED}=\frac{\text{edit distance}}{\max(|p|,\,|g|)} is the Levenshtein distance between prediction pp and ground truth gg over the longer length. Markdown, case, and Textract LAYOUT table scaffolding are normalized first.

Word-overlap F1

Bag-of-words overlap, independent of order: the harmonic mean of word precision PP and recall RR, F1=2PRP+RF_1=\frac{2PR}{P+R}. It surfaces which words were captured versus dropped or hallucinated.

Layout IoU

How well the predicted boxes cover the real content blocks: intersection-over-union of the predicted (AA) and verified (BB) box masks rasterized on a 1000×1000 grid, ABAB\frac{|A\cap B|}{|A\cup B|}, independent of the text inside.

methodology

All 130 pages are real documents spanning seven types: forms & invoices, tables & financial, multi-column, multilingual, image-heavy, presentations, and scanned & degraded. They are drawn from public OCR datasets (CORD, SROIE), arXiv pages, lecture slides, SEC 10-K filings, and synthetic multilingual renders. Every page was human-verified region by region against its source. Each score is a document-level mean with a 95% confidence interval from bootstrap resampling over pages. Test pages are held out from training; synthetic renders in the test set are generated separately from training data.