Back to accuracy benchmark

Full 107-document corpus breakdown

Supporting data for extract-bench across 107 business PDFs and 12,432 pages. This page measures speed, coverage, and provider-agreement consensus; consensus F1 is not human-labeled accuracy, so the 130-page gold set remains the primary accuracy claim.

Run the comparison yourself →
Extract latency
11.5s
#1

p50 wall-clock

Extract coverage
97.2%
#1

103/106 documents in selected tags

speed vs consensus

Upper-left is better: faster latency with higher consensus against the majority-provider reference.

document tags
← fasterslower →
ExtractOther qualified provider
provider table

Exact measurements for the selected documents.

providermodesuccessconsensus F1p50 latencypages/secbbox
Extractdefault103/10694.3%11.5s1.4100%
AWS TextractLAYOUT + TABLES95/10699.5%18.7s0.8100%
LlamaParsepremium95/10692.1%25.2s1.495.8%
ExtendParse 2.094/10696.1%28.1s0.7100%
Azure DIprebuilt-layout95/10699.3%29.1s0.9100%
Reductostandard parse95/10698.9%38s0.6100%
documents in this view
106 docs
Executive memo
3p
tinycoretinyborn-digitalsynthetic
Cover letter
1p
tinytinysyntheticforms
Invoice (1pg)
1p
tinytinysyntheticformstables
Thermal receipt
1p
tinyrasterizedtinyscannedsynthetic
Business card
1p
tinyrasterizedtinyscannedsynthetic
USPS shipping label
1p
tinyrasterizedtinysyntheticforms
Meeting notes scan
2p
tinyrasterizedtinyscannedsynthetic
Event flyer
1p
tinytinysyntheticmagazines
Attention Is All You Need
15p
academiccoretablestwo-columnpublic
Curiosity-driven exploration
12p
academiccoretwo-columnpublic
Segment Anything
30p
academicpublictwo-column
COVID-19 paper (CORD-19 / arxiv)
12p
academicpublictables
Oncology / drug-discovery preprint
26p
academicpublictables
Long-form CS preprint (LoRA / arxiv)
26p
academicpubliclong-form
CLIP preprint (CS, two-column)
48p
academicpublictwo-column
math.AG preprint
14p
academicpublicmath
Physics preprint
24p
academicpublictwo-column
Open chemistry textbook chapter
1393p
academicpubliclong-form
96-page financial PDF
96p
financialcorelargetables
Mixed compliance packet (mid)
123p
financialcorelargetables
Pfizer 10-K
121p
financiallargetablespublic
HPE 10-K
176p
financiallargetablespublic
Apple FY2025 10-K
95p
financiallargetablespublic
JPMorgan Chase FY2025 10-K
658p
financiallargetablespublic
Tesla Q1 10-Q
62p
financialtablespublic
Walmart Q3 10-Q
62p
financialtablespublic
Berkshire 2024 annual letter
15p
financialpublic
Microsoft FY25 proxy statement
199p
financiallargetablespublic
Coinbase S-1 (2021)
335p
financiallargetablespublic
NIST Cybersecurity Framework v2.0
32p
compliancepublic
HHS HIPAA Privacy Rule
25p
compliancepublic
GDPR official text
88p
compliancepubliclong-form
NIST SP 800-66r2 HIPAA Security Rule (PCI-DSS substitute)
122p
compliancelargepublic
FRA safety regulation
4p
compliancepublic
SaaS Master Service Agreement
40p
legalformslong-formsynthetic
Mutual NDA
5p
legalsynthetic
Executive employment agreement
25p
legalsyntheticlong-form
Mastodon instance ToS
1p
legalpublic
Synthetic ToS
25p
legalsyntheticlong-form
SCOTUS slip opinion
114p
legalpubliclong-form
Congressional bill
973p
legalpubliclong-form
Sample warranty deed
8p
legalformssynthetic
Clinical trial protocol
89p
healthcarelong-form
FDA drug label
7p
healthcarepublictables
FDA warning letter
16p
healthcarepublic
CMS-1500 medical claim form
4p
healthcarepublicforms
Sanitized EHR export
20p
healthcaresynthetictables
PMC OA clinical trial report
2p
healthcarepublictables
CDC MMWR public-health bulletin
5p
healthcarepublictables
CMS Medicare Provider Manual chapter
323p
healthcarepubliclong-form
Clinical lab result
4p
healthcaresynthetictables
Auto insurance policy
40p
insurancesyntheticformslong-form
Homeowners declaration page
6p
insurancesyntheticforms
Life insurance contract
50p
insurancesyntheticlong-form
Auto claim form (filled)
4p
insurancesyntheticforms
Dental EOB
2p
insurancesynthetictables
Pharmacy EOB
2p
insurancesynthetictables
Long-term disability claim packet
30p
insurancesyntheticforms
Medicare benefits explanation
4p
insurancesynthetictables
IRS 1040 instructions scan
123p
governmentcorerasterizedscannedlargetables
IRS 1040 1990 form
64p
governmentrasterizedscannedlarge
IRS W-2
11p
governmentpublicforms
USCIS I-9
4p
governmentpublicforms
IRS 1099-MISC
6p
governmentpublicforms
IRS Schedule C
2p
governmentpublicforms
FBI FOIA response packet
60p
governmentrasterizedpublicscannedlong-form
USCIS I-130 petition
12p
governmentpublicforms
USCIS Request for Evidence
10p
governmentsyntheticforms
FCC rulemaking notice
4p
governmentpubliclong-form
RFC 8446 (TLS 1.3)
160p
technicallargepublic
RFC 7540 (HTTP/2)
96p
technicallargepublic
RFC 5415 (CAPWAP)
155p
technicallargepublic
RFC 9000 (QUIC transport protocol)
151p
technicallargepublic
PostgreSQL 16 manual chapter
3033p
technicalpubliclong-form
NVMe public spec excerpt
458p
technicallargepublic
Eaton UPS manual
74p
manualslong-form
FAA airworthiness directive packet
1p
manualspublic
OSHA safety bulletin
35p
manualspublic
FAA Airplane Flying Handbook excerpt
406p
manualspubliclong-form
Lecture slide handout
38p
presentationscoreslidestables
Scanned lecture handout
38p
presentationscorerasterizedslidesscanned
KubeCon keynote slides
60p
presentationspublicslides
Board meeting deck
25p
presentationssyntheticslides
Sales enablement deck
30p
presentationssyntheticslides
Corporate training deck
50p
presentationssyntheticslideslong-form
Research conference poster
3p
presentationspublicslidesimage-heavy
Public conference keynote
25p
presentationspublicslides
Image-heavy PDF
77p
image-heavycoreimage-heavylarge
NASA Earth Observatory feature
26p
image-heavypublicimage-heavymagazines
Marketing brochure
12p
image-heavysyntheticimage-heavymagazines
Real-estate brochure (mockup)
20p
image-heavysyntheticimage-heavymagazines
Smithsonian Open Access catalog
80p
image-heavypublicimage-heavymagazines
NPS Yellowstone brochure
100p
image-heavypublicimage-heavymagazineslarge
Library of Congress photography collection
30p
image-heavypublicimage-heavy
USGS scientific publication
54p
image-heavypublicimage-heavytables
German employment contract
40p
multilingualmultilingualsynthetic
IRS Form 1040(SP) Spanish
162p
multilingualmultilingualpubliclong-form
Arabic news magazine layout (RTL)
12p
multilingualmultilingualsyntheticmagazines
Unicode edge-case PDF
20p
regressioncoreunicodepathological
Synthetic regression case
10p
regressioncoresyntheticpathological
Mixed-rotation text
8p
regressionsyntheticpathological
Malformed xref
6p
regressionsyntheticpathological
Tiny-font footnote stress
12p
regressionsyntheticpathological
Vertical-text CJK
6p
regressionsyntheticpathologicalmultilingual
Multi-column rotated
8p
regressionsyntheticpathological
Low-contrast text
8p
regressionsyntheticpathological

Rasterized docs ship as page-image PDFs (no text layer). They test the OCR / layout path, not text-layer extraction — treated as a sibling-class to the scanned-doc set, not equivalent to born-digital business PDFs.

methodology
  • 106 documents across representative classes.
  • Default rankings require at least 85% document success and non-zero source bbox coverage.
  • The corpus spans tiny born-digital, academic, financial filings, compliance / regulatory, legal, healthcare, insurance, government forms, technical specs, manuals, slides, image-heavy / magazines, multilingual, regression cases, and DocLayNet rasterized layouts.
  • DocLayNet docs are stitched from rasterized page-images (no text layer) — they test the OCR / layout path, not text-layer extraction. The page tags them is_rasterized.
  • Consensus F1 is computed against tokens emitted by a majority of successful providers for each document.