Extract text from documents and images with Datalab Marker and OCR
Captured source
source ↗Extract text from documents and images with Datalab Marker and OCR – Replicate blog
Replicate Blog
Extract text from documents and images with Datalab Marker and OCR
Posted October 21, 2025 by andreasjansson
Datalab ’s state-of-the-art document parsing and text extraction models are now on Replicate.
Marker turns PDF, DOCX, PPTX, images (and more!) into markdown or JSON. It formats tables, math, and code, extracts images, and can pull specific fields when you pass a JSON Schema.
OCR detects text in ninety languages from images and documents, and returns reading order and table grids.
The Marker model is based on the popular open source Marker project (29k Github stars) and OCR is based on Surya (19k Github stars).
Run Marker and OCR on Replicate:
Marker: replicate.com/datalab-to/marker
OCR: replicate.com/datalab-to/ocr
Run Marker
Copy
import replicate
output = replicate.run( "datalab-to/marker" , input = { "file" : open ( "report.pdf" , "rb" ), "mode" : "balanced" , # fast / balanced / accurate "include_metadata" : True , # return page-level JSON metadata }, ) print (output[ "markdown" ][: 400 ])
Run OCR
Copy
import replicate
output = replicate.run( "datalab-to/ocr" , input = { "file" : open ( "receipt.jpg" , "rb" ), "visualize" : True , # return the input image with red polygons around detected text "return_pages" : True , # return layout data }, ) print (output[ "text" ][: 200 ])
Visit the models on Replicate for code snippets in other languages.
These models are both fast and accurate. They outperform established tools like Tesseract, with short processing times. Marker processes a page in about 0.18 seconds and can hit 120 pages per second when batched.
Structured extraction
One particularly powerful feature of Marker is structured extraction. For example, you can extract specific fields from an invoice:
Copy
import json import replicate
schema = { "type" : "object" , "properties" : { "vendor" : { "type" : "string" }, "invoice_number" : { "type" : "string" }, "date" : { "type" : "string" }, "total" : { "type" : "number" } } }
output = replicate.run( "datalab-to/marker" , input = { "file" : "https://multimedia-example-files.replicate.dev/replicator-invoice.1page.pdf" , "page_schema" : json.dumps(schema), } ) structured_data = json.loads(output[ "extraction_schema_json" ]) print (structured_data)
Performance
Marker performance was evaluated using the olmOCR-Bench benchmark, a dataset of 1,403 PDF files with 7,010 unit test cases that evaluate the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information.
Marker outperforms all models tested, including GPT-4o, Deepseek OCR, Mistral OCR, and olmOCR.
Model ArXiv Old Scans Math Tables Old Scans Headers and Footers Multi column Long tiny text Base Overall Datalab Marker (Balanced mode) 81.4 80.3 89.4 50.0 88.3 81.0 91.6 99.9 82.7 ± 0.9 Datalab Marker (Fast mode) 83.8 69.7 74.8 32.3 86.6 79.4 85.7 99.6 76.5 ± 1.0 Mistral OCR API 77.2 67.5 60.6 29.3 93.6 71.3 77.1 99.4 72.0 ± 1.1 Deepseek OCR 75.2 67.9 79.1 32.9 96.1 66.3 78.5 97.7 74.2 ± 1.0 Nanonets OCR 67.0 68.6 77.7 39.5 40.7 69.9 53.4 99.3 64.5 ± 1.1 GPT-4o (Anchored) 53.5 74.5 70.0 40.7 93.8 69.3 60.6 96.8 69.9 ± 1.1 Gemini Flash 2 (Anchored) 54.5 56.1 72.1 34.2 64.7 61.5 71.5 95.6 63.8 ± 1.2 Qwen 2.5 VL (No Anchor) 63.1 65.7 67.3 38.6 73.6 68.3 49.1 98.3 65.5 ± 1.2 olmOCR v0.3.0 78.6 79.9 72.9 43.9 95.1 77.3 81.2 98.9 78.5 ± 1.1
Pricing
Marker costs
$4 per 1000 pages without page_schema in fast and balanced modes.
$6 per 1000 pages when doing structured extraction with page_schema . $ $6 per 1000 pages in accurate mode.
OCR costs $2 per 1000 pages.
Next: How to prompt Veo 3.1
Notability
notability 3.0/10Routine feature announcement blog