Document parsing accuracy benchmark
130 pages from public datasets and public documents, including CORD, SROIE, arXiv, and SEC 10-Ks. Each page was human-verified region by region and run through 6 providers across seven document types: forms and invoices, tables and financial docs, multi-column pages, multilingual docs, image-heavy docs, presentations, and scanned or degraded pages.
Accuracy columns are gold-set means with 95% confidence intervals, sorted by text accuracy.
| provider | text accuracy | word-overlap F1 | grounded accuracy | layout IoU |
|---|---|---|---|---|
| Extract | 81.9% 78.2%–85.3% | 84.5% 81.5%–87.4% | 88.6% 86.1%–91.1% | 65.3% 61.4%–69.1% |
| Extend | 74.8% 70.0%–79.2% | 74.0% 70.1%–77.6% | 94.2% 91.2%–96.6% | 69.2% 64.7%–73.4% |
| Reducto | 70.1% 66.0%–73.9% | 73.3% 68.7%–77.7% | 85.3% 80.9%–89.4% | 66.1% 61.5%–70.7% |
| LlamaParse | 69.1% 63.9%–74.6% | 62.4% 57.5%–67.5% | 61.3% 54.9%–67.3% | 52.0% 46.6%–57.4% |
| AWS TextractLAYOUT | 60.7% 55.3%–66.1% | 70.5% 65.0%–75.7% | 81.3% 76.1%–86.3% | 68.8% 65.3%–72.5% |
| Unstructured | 59.1% 53.3%–65.1% | 52.8% 46.1%–59.4% | 63.9% 57.4%–70.4% | 68.6% 63.9%–73.2% |
Speed and coverage across 12k pages
The larger consensus benchmark asks a different question: across 106 business PDFs and 11,448 pages, who stays fast while remaining competitive with the provider-agreement reference? Corpus mix: financial reports, compliance audits, legal contracts, healthcare and insurance packets, government forms, technical specs, manuals, slides, image-heavy pages, multilingual docs, and scanned/OCR cases.
Consensus F1 is provider agreement against a majority-provider pseudo-reference, not human-labeled accuracy. The gold-set table above is the primary accuracy claim; this section shows speed and coverage depth on the broad corpus.View the full 107-document breakdown →
For each labeled region we gather the predicted boxes overlapping its location (page-sized dumps are excluded), stitch their text in reading order, and measure how much of the region's text is recoverable in that local context — when the text is contained, otherwise the best-matching window scored by . The metric averages this recall over all regions, with a region that got no overlapping prediction scoring 0; extra neighboring text from a coarse box is not penalized, so word-, line-, and paragraph-box outputs are scored fairly.
Character fidelity over the whole page in each provider's emitted reading order: , where is the Levenshtein distance between prediction and ground truth over the longer length. Markdown, case, and Textract LAYOUT table scaffolding are normalized first.
Bag-of-words overlap, independent of order: the harmonic mean of word precision and recall , . It surfaces which words were captured versus dropped or hallucinated.
How well the predicted boxes cover the real content blocks: intersection-over-union of the predicted () and verified () box masks rasterized on a 1000×1000 grid, , independent of the text inside.
All 130 pages are real documents spanning seven types: forms & invoices, tables & financial, multi-column, multilingual, image-heavy, presentations, and scanned & degraded. They are drawn from public OCR datasets (CORD, SROIE), arXiv pages, lecture slides, SEC 10-K filings, and synthetic multilingual renders. Every page was human-verified region by region against its source. Each score is a document-level mean with a 95% confidence interval from bootstrap resampling over pages. Test pages are held out from training; synthetic renders in the test set are generated separately from training data.