Document parsing accuracy benchmark

130 pages from public datasets and public documents, including CORD, SROIE, arXiv, and SEC 10-Ks. Each page was human-verified region by region and run through 6 providers across seven document types: forms and invoices, tables and financial docs, multi-column pages, multilingual docs, image-heavy docs, presentations, and scanned or degraded pages.

text accuracy

Extract

81.9%

Extend

74.8%

Reducto

70.1%

LlamaParse

69.1%

AWS Textract

60.7%

Unstructured

59.1%

0%25%50%75%100%

results

Accuracy columns are gold-set means with 95% confidence intervals, sorted by text accuracy.

provider	text accuracy	word-overlap F1	grounded accuracy	layout IoU
Extract	81.9% 78.2%–85.3%	84.5% 81.5%–87.4%	88.6% 86.1%–91.1%	65.3% 61.4%–69.1%
Extend	74.8% 70.0%–79.2%	74.0% 70.1%–77.6%	94.2% 91.2%–96.6%	69.2% 64.7%–73.4%
Reducto	70.1% 66.0%–73.9%	73.3% 68.7%–77.7%	85.3% 80.9%–89.4%	66.1% 61.5%–70.7%
LlamaParse	69.1% 63.9%–74.6%	62.4% 57.5%–67.5%	61.3% 54.9%–67.3%	52.0% 46.6%–57.4%
AWS TextractLAYOUT	60.7% 55.3%–66.1%	70.5% 65.0%–75.7%	81.3% 76.1%–86.3%	68.8% 65.3%–72.5%
Unstructured	59.1% 53.3%–65.1%	52.8% 46.1%–59.4%	63.9% 57.4%–70.4%	68.6% 63.9%–73.2%

at scale

Speed and coverage across 12k pages

The larger consensus benchmark asks a different question: across 106 business PDFs and 11,448 pages, who stays fast while remaining competitive with the provider-agreement reference? Corpus mix: financial reports, compliance audits, legal contracts, healthcare and insurance packets, government forms, technical specs, manuals, slides, image-heavy pages, multilingual docs, and scanned/OCR cases.

Extract p50

11.5s

consensus F1

94.3%

Consensus F1 is provider agreement against a majority-provider pseudo-reference, not human-labeled accuracy. The gold-set table above is the primary accuracy claim; this section shows speed and coverage depth on the broad corpus.View the full 107-document breakdown →

what we measured

Grounded accuracy

For each labeled region we gather the predicted boxes overlapping its location (page-sized dumps are excluded), stitch their text in reading order, and measure how much of the region's text is recoverable in that local context — $1$ when the text is contained, otherwise the best-matching window scored by $1-\mathrm{NED}$ . The metric averages this recall $\frac{1}{N}\sum_{r}\mathrm{rec}_r$ over all $N$ regions, with a region that got no overlapping prediction scoring 0; extra neighboring text from a coarse box is not penalized, so word-, line-, and paragraph-box outputs are scored fairly.

Text accuracy

Character fidelity over the whole page in each provider's emitted reading order: $1-\mathrm{NED}$ , where $\mathrm{NED}=\frac{\text{edit distance}}{\max(|p|,\,|g|)}$ is the Levenshtein distance between prediction $p$ and ground truth $g$ over the longer length. Markdown, case, and Textract LAYOUT table scaffolding are normalized first.

Word-overlap F1

Bag-of-words overlap, independent of order: the harmonic mean of word precision $P$ and recall $R$ , $F_1=\frac{2PR}{P+R}$ . It surfaces which words were captured versus dropped or hallucinated.

Layout IoU

How well the predicted boxes cover the real content blocks: intersection-over-union of the predicted ( $A$ ) and verified ( $B$ ) box masks rasterized on a 1000×1000 grid, $\frac{|A\cap B|}{|A\cup B|}$ , independent of the text inside.

methodology

All 130 pages are real documents spanning seven types: forms & invoices, tables & financial, multi-column, multilingual, image-heavy, presentations, and scanned & degraded. They are drawn from public OCR datasets (CORD, SROIE), arXiv pages, lecture slides, SEC 10-K filings, and synthetic multilingual renders. Every page was human-verified region by region against its source. Each score is a document-level mean with a 95% confidence interval from bootstrap resampling over pages. Test pages are held out from training; synthetic renders in the test set are generated separately from training data.