# Extract

Parse documents into structured data

Text, tables, and figures in one call. Best text accuracy, at least 2x faster than other parsers, and $3 per 1,000 pages.

## Base URL

`https://api.extract.page`

## Auth

Send your key on every request:

```
X-API-KEY: <your_key>
```

Grab one from the dashboard at https://extract.page/dashboard after signup. Free tier is 1,000 pages on signup with no card required.

## Endpoints

### POST /v1/extract (hosted URL)

JSON body with a `url` pointing at a document already on the public internet.

```bash
$ curl https://api.extract.page/v1/extract \
    -H "X-API-KEY: $EXTRACT_KEY" \
    -H "Content-Type: application/json" \
    -d '{"url": "https://cdn.extract.page/demo/overview-of-computer-science.pdf"}'

{
  "chunks": [
    { "page_content": "Attention Is All You Need", "page_no": 1, "bbox": [90.0, 94.0, 505.2, 118.4] },
    { "page_content": "Ashish Vaswani",            "page_no": 1, "bbox": [108.0, 132.0, 198.3, 143.1] },
    { "page_content": "Noam Shazeer",              "page_no": 1, "bbox": [210.0, 132.0, 292.1, 143.1] }
  ]
}
```

### POST /v1/extract/file (upload)

Multipart upload when the bytes are in memory or on disk.

```bash
$ curl https://api.extract.page/v1/extract/file \
    -H "X-API-KEY: $EXTRACT_KEY" \
    -F "file=@paper.pdf"

{
  "chunks": [
    { "page_content": "Attention Is All You Need", "page_no": 1, "bbox": [90.0, 94.0, 505.2, 118.4] },
    { "page_content": "Ashish Vaswani",            "page_no": 1, "bbox": [108.0, 132.0, 198.3, 143.1] },
    { "page_content": "Noam Shazeer",              "page_no": 1, "bbox": [210.0, 132.0, 292.1, 143.1] }
  ]
}
```

### POST /v1/extract/schema (structured fields)

Pass a JSON schema alongside a `url` (or use `POST /v1/extract/schema/file` for uploads) to pull typed fields — invoice totals, claim numbers, form values — straight out of a document instead of post-processing chunks yourself. See the schema-extraction guide at https://docs.extract.page.

### Async batch (large jobs)

For bulk workloads, reserve file slots with `POST /v1/files`, submit them as one job with `POST /v1/batches`, then poll `GET /v1/batches/{id}`. Handles up to 1,000,000 pages with webhook delivery on completion. See the batch guide at https://docs.extract.page.

## Quickstart

### Python (URL)

```python
import requests

res = requests.post(
    "https://api.extract.page/v1/extract",
    headers={"X-API-KEY": EXTRACT_KEY},
    json={"url": "https://cdn.extract.page/demo/overview-of-computer-science.pdf"},
).json()

# res["chunks"][0]
# { "page_content": "Attention Is All You Need", "page_no": 1, "bbox": [90.0, 94.0, 505.2, 118.4] }
```

### Python (upload)

```python
import requests

with open("paper.pdf", "rb") as f:
    res = requests.post(
        "https://api.extract.page/v1/extract/file",
        headers={"X-API-KEY": EXTRACT_KEY},
        files={"file": f},
    ).json()

# res["chunks"][0]
# { "page_content": "Attention Is All You Need", "page_no": 1, "bbox": [90.0, 94.0, 505.2, 118.4] }
```

### TypeScript (URL)

```ts
const res = await fetch("https://api.extract.page/v1/extract", {
  method: "POST",
  headers: {
    "X-API-KEY": process.env.EXTRACT_KEY!,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({ url: "https://cdn.extract.page/demo/overview-of-computer-science.pdf" }),
}).then((r) => r.json());

// res.chunks[0]
// { page_content: "Attention Is All You Need", page_no: 1, bbox: [90.0, 94.0, 505.2, 118.4] }
```

### TypeScript (upload)

```ts
import { readFile } from "node:fs/promises";

const form = new FormData();
form.append("file", new Blob([await readFile("paper.pdf")]), "paper.pdf");

const res = await fetch("https://api.extract.page/v1/extract/file", {
  method: "POST",
  headers: { "X-API-KEY": process.env.EXTRACT_KEY! },
  body: form,
}).then((r) => r.json());

// res.chunks[0]
// { page_content: "Attention Is All You Need", page_no: 1, bbox: [90.0, 94.0, 505.2, 118.4] }
```

## Response shape

A list of chunks. Each chunk carries:

- `page_content` — the extracted text span
- `page_no` — 1-indexed page number
- `bbox` — [x0, y0, x1, y1] in PDF points
- `image_url` — present on image chunks, points at our object store
- `confidence` — per-span confidence score

## Limits (synchronous)

- 500 pages per request
- 150 MB per request

For larger jobs, the async endpoint handles batch jobs up to 1M pages with webhook delivery on completion. Email hello@extract.page or book a call for access.

## Typical latency

A 15-page born-digital PDF returns in ~400ms. A 96-page financial report in ~13s. A 492-page technical spec in ~22s. Scanned pages take longer because OCR runs inline. See the Benchmarks section for the full suite.

## Trust and data handling

| category   | answer              |
|------------|---------------------|
| compliance | HIPAA + BAA         |
| training   | Never               |
| retention  | Dropped on response |

We don't train on customer data. Source documents are processed in memory and dropped as soon as the response returns.

## Proven at scale

70,000,000+ pages processed. Built from [YouLearn](https://youlearn.ai)'s production document pipeline before becoming an API.

## Errors

Every error returns JSON in the shape `{ "detail": "<message>" }`. Correlate with support via the `X-Request-Id` response header.

| status | meaning                                    | retry?         |
|--------|--------------------------------------------|----------------|
| 400    | malformed request body                     | no, fix input  |
| 401    | missing or invalid API key                 | no             |
| 402    | balance exhausted — top up in dashboard    | no, top up     |
| 413    | payload exceeds 150 MB                     | no, split      |
| 422    | unprocessable (e.g., unreadable file)      | no             |
| 429    | rate limited                               | yes, backoff   |
| 500    | server error                               | yes, backoff   |
| 503    | temporarily unavailable                    | yes, backoff   |

Retries are safe because extraction is stateless — but note that a successful request is billable usage, and a retry after a 5xx may double-bill if the first request actually succeeded server-side. Check your usage dashboard if retries surprise you.

## Pricing

### Free — $0

1,000 pages · no card · lifetime.

- Full API access, no rate gates
- Self-serve dashboard with usage
- Email support
- $3 / 1,000 pages after free credit

### Custom — Enterprise

For teams with higher workloads · volume discounts · SLAs.

- Dedicated region + private networking
- HIPAA + BAA available
- Slack channel with engineering
- Production SLAs and priority queues
- Async batch jobs up to 1M pages with webhook delivery

Email hello@extract.page.

Top up from the dashboard in $10, $30, $100, $500 increments or any custom amount. Keys keep working the moment a top-up lands.

## Benchmarks

The landing page presents three benchmark views: a capability matrix, an accuracy benchmark, and a speed benchmark.

### Capability matrix

The completeness win — source-grounded spans, bounding boxes, OCR confidence, and mixed-format input in one call.

| capability          | Extract | aws textract | llamaparse   | reducto |
|---------------------|---------|--------------|--------------|---------|
| text extraction     | yes     | yes          | yes          | yes     |
| text accuracy       | 81.9%   | 60.7%        | 69.1%        | 70.1%   |
| per-span bbox       | yes     | yes          | no           | yes     |
| per-span confidence | yes     | no           | no           | yes     |
| OCR                 | yes     | yes          | premium only | yes     |
| pptx / docx input   | yes     | no           | no           | no      |
| markdown output     | no      | no           | yes          | yes     |

LlamaParse OCR caveat: only in premium mode, with higher latency and credit cost than fast mode.

### Accuracy benchmark

81.9% text accuracy on 130 human-labeled gold pages across 7 document types, scored character-level — the highest of six providers. Extract also leads on word F1 (84.5%).

| provider     | text accuracy | word F1 |
|--------------|---------------|---------|
| Extract      | 81.9%         | 84.5%   |
| extend       | 74.8%         | 74.0%   |
| reducto      | 70.1%         | 73.3%   |
| llamaparse   | 69.1%         | 62.4%   |
| aws textract | 60.7%         | 70.5%   |
| unstructured | 59.1%         | 52.8%   |

### Speed benchmark

Lowest median per-document latency of every hosted provider tested, measured across the speed & coverage corpus (median of three runs each). Tested on born-digital papers, financial reports, scanned forms, image-heavy decks, multi-column layouts, large technical specs, and adversarial layouts. See "Typical latency" above for concrete per-document numbers.

## Confidence review

Review the uncertain. Skip the rest.

Per-span confidence puts uncertain OCR text first, so a 200k-page run shows the spans that need review. The landing page demo shows a synthetic claim form with low-confidence OCR spans flagged first, including each field's bbox citation.

## Custom benchmark

Send us your docs. We'll show you how it performs on yours, not ours.

Book a benchmark call to run the same eval on a representative corpus.

## FAQ

**Do you have a BAA / HIPAA compliance?**
Yes. HIPAA + BAA is available on request. We have signed BAAs with healthcare customers in production. Talk to us about your compliance requirements.

**Do you store my documents?**
We don't train on your data. Source documents are processed in memory and dropped as soon as the response returns. Extracted images are uploaded to our object store so you can fetch them via image_url. The custom tier supports customer-managed encryption, configurable retention, and dedicated regions.

**What file types can I send?**
PDF, PPTX, and DOCX. Scanned PDFs are handled automatically. OCR runs inline with no separate surcharge.

**What does the response look like?**
A list of chunks. Each chunk carries page_content, page_no, a bbox in PDF points, and a per-span confidence score. Image chunks include an image_url to the rendered region.

**Can I run a benchmark on my own documents?**
Yes. Send us 20-50 representative documents, or bring them to your benchmark call and we'll run the eval live. Results will be back within a few days. Healthcare and other regulated docs are handled under BAA on a private pipeline. Book a benchmark call.

**How does your pricing compare to AWS Textract or Reducto?**
Our pricing is $3 per 1,000 pages, all-in. No credits, no seat fees, no monthly minimums. Compared to providers that price by credit (Reducto, LlamaParse) or by operation type (Textract, Azure DI), most teams find we're cheaper in total spend once you account for tables, forms, and OCR. We're happy to scope your monthly volume on a benchmark call.

**What are the hard limits per request?**
500 pages and 150 MB per synchronous request. For larger jobs, we have an async endpoint that handles batch jobs up to 1M pages with webhook delivery on completion. Email hello@extract.page or book a call for access.

**What throughput can you handle? Is there a rate limit?**
Pages within a request are parsed concurrently, so latency scales with document size rather than page-by-page — a 96-page report returns in ~13s, a 492-page document in ~22s. Across requests the platform autoscales to absorb bursts. Standard accounts have a default request rate that we raise for high-volume customers; the custom tier adds dedicated capacity with committed throughput and an uptime SLA. We'll size your target throughput on a benchmark call.

**Can I self-host or deploy in a VPC?**
Available on the custom tier. Dedicated regions, private networking, and on-prem options are available for teams with strict security or data residency requirements.

**Do you offer SLAs or dedicated capacity?**
Yes. Available on the custom tier: negotiated rate per page, dedicated regions, private networking, production SLAs, and a Slack channel with the engineering team.

**What happens when I run out of balance?**
The API returns 402 when your balance is exhausted. Top up your balance from the dashboard in $10, $30, $100, $500 increments, or any amount. Keys keep working the moment a top-up lands.

## Contact

- General: hello@extract.page
- Docs: https://docs.extract.page
- Dashboard: https://extract.page/dashboard