OCR on scanned PDF: practical guide for firms and freelancers

You receive a 30-page contract faxed by an old-school client. Impossible to search a keyword, impossible to copy-paste a paragraph. You'd have to retype the whole thing.

This is exactly what OCR (Optical Character Recognition) solves. In 2026, recognition quality is such that a properly OCR'd scan is indistinguishable from a native PDF — for search, copy-paste, archiving, indexing.

Here's how it works, when to use it, and how to choose between the many solutions.

Why OCR has become indispensable

If you work in any of these fields, you probably already OCR without knowing it:

Law firm: find a specific clause in old contracts
Accounting: index 10 years of scanned invoices per client
HR: make CVs received as non-editable PDFs searchable
Notary: digitize old deeds kept in archive
Academic research: cite passages from scanned books

Without OCR, these documents are images. They contain no "text" from a computer's standpoint. With OCR, they become text — indexable, copyable, translatable, editable.

How modern OCR works

Older OCR (until the 2010s) relied on shape recognition: each character is compared to a known shape library. Average accuracy, especially on slightly tilted scans or unusual fonts.

Modern OCR (Tesseract 5+, Google Vision, AWS Textract, etc.) uses convolutional neural networks (CNN) trained on tens of millions of pages. The model no longer recognizes isolated characters but understands context: an ambiguous "rn" will be read "rn" if the word is "morning" but "m" if it's "men."

Typical accuracy on a clean A4 scan at 300 DPI:

Typed documents in French/English: 99.5%+
Neat handwriting: 85-95% (with specific training)
Free handwriting: 60-80% (still OCR's Achilles heel)
Decorative fonts: variable, sometimes poor
Complex tables: structure often lost, text generally good

Local vs server: choosing architecture

For a firm or SMB, the OCR architecture choice has real implications.

Browser OCR (Tesseract.js)

Pros:

No data leaves the machine
Maximum confidentiality
No recurring cost

Cons:

Slow: 3-10 seconds per page on a modern machine
Limited to lightweight models (~95% accuracy instead of 99%)
Unsuitable for PDFs over 50 pages
Drains battery (intensive computation)

Relevant for: occasional OCR on 1-5 pages, very sensitive data.

Server OCR (cloud models)

Pros:

Very fast: 0.2-1 second per page
Top-tier models (99.5%+ accuracy)
Multilingual (60+ languages)
Structured table recognition
No load on the client machine

Cons:

Files transit through a server
Recurring service cost
Connection-dependent latency

Relevant for: bulk processing, high accuracy requirement, multilingual.

What's PDFly's choice?

PDFly made the following choice: Premium-only OCR, server processing, files deleted immediately after processing, EU-hosted.

Why not in-browser? Because Tesseract.js accuracy (~95%) isn't enough for professional uses where every error costs proofreading time. Server models (99.5%+) eliminate that cost.

Why in Europe? For the same reasons our Cloud Act article details: legal contracts, HR files, and invoices have no business on US servers.

How to OCR a PDF with PDFly

Go to pdfly.eu/en/tools/ocr-pdf
Drop your scanned PDF (up to 500 MB on Premium)
Choose the main language (French, English, multilingual)
Run OCR
Download the PDF with invisible text layer — visually identical to the original but searchable

The resulting PDF typically weighs 5-15% more (the text layer adds weight). If size matters, compress the result afterward.

Common pitfalls and how to avoid them

Scan pitfalls

Tilted scan: text 5° off-axis drops OCR accuracy from 99% to 70%. Many tools auto-correct ("deskew"), but not all. Check for it.
Low-resolution scan (<200 DPI): degraded accuracy. Re-scan at 300 DPI minimum.
Scan with gray/yellow background (old photocopies): apply a white threshold before OCR.
Black on blue/green background: insufficient contrast. Convert to grayscale first.

PDF pitfalls

Protected PDF: OCR often blocked. Unlock with the Unlock tool if authorized.
Already partially OCR'd PDF: some pages have text, others don't. Re-OCR'ing the entire document is simpler than processing only some pages.
PDF with complex columns: reading order can be incorrect (reads column 1 line 1 then column 2 line 1 instead of all of column 1). For scanned newspapers and magazines, expect some manual rework.

Legal pitfalls

OCR on a signed document: the text layer doesn't modify the image, so the signature remains valid. But some courts refuse a re-processed PDF after signing. Check local legal context.
Probative archiving: if you archive an OCR'd version for evidentiary value, also keep the original non-OCR'd file (eIDAS requires traceability of the transformation).

Concrete use cases

Law firm

Index 10 years of jurisprudence received as scanned PDFs. Bulk OCR, then full-text search in the firm's DMS. Gain: find a specific clause in 5 seconds instead of 3 hours.

Accounting firm

Digitize client invoices received on paper or low-quality scans. OCR with table extraction, then import into accounting software (Cegid, Sage, etc.). Gain: automate journal entries.

HR

Make a base of 5,000 CVs received over 5 years searchable. OCR + Elasticsearch indexing. Gain: find "Java + 5 years + Bordeaux" in 1 click.

Academic research

Digitize an out-of-copyright rare book for citation. Multilingual OCR, export to TXT, copy-paste into thesis with citations.

In summary

Modern OCR has become a reliable technical commodity. The tool choice mostly depends on required confidentiality and volume.

For daily firm/accounting/HR documents: EU server OCR (PDFly Premium or equivalent) offers the best accuracy/speed/GDPR-compliance compromise.

OCR a PDF with PDFly Premium — files processed in Europe, deleted immediately after.