You receive a 30-page contract faxed by an old-school client. Impossible to search a keyword, impossible to copy-paste a paragraph. You'd have to retype the whole thing.
This is exactly what OCR (Optical Character Recognition) solves. In 2026, recognition quality is such that a properly OCR'd scan is indistinguishable from a native PDF — for search, copy-paste, archiving, indexing.
Here's how it works, when to use it, and how to choose between the many solutions.
Why OCR has become indispensable
If you work in any of these fields, you probably already OCR without knowing it:
- Law firm: find a specific clause in old contracts
- Accounting: index 10 years of scanned invoices per client
- HR: make CVs received as non-editable PDFs searchable
- Notary: digitize old deeds kept in archive
- Academic research: cite passages from scanned books
Without OCR, these documents are images. They contain no "text" from a computer's standpoint. With OCR, they become text — indexable, copyable, translatable, editable.
How modern OCR works
Older OCR (until the 2010s) relied on shape recognition: each character is compared to a known shape library. Average accuracy, especially on slightly tilted scans or unusual fonts.
Modern OCR (Tesseract 5+, Google Vision, AWS Textract, etc.) uses convolutional neural networks (CNN) trained on tens of millions of pages. The model no longer recognizes isolated characters but understands context: an ambiguous "rn" will be read "rn" if the word is "morning" but "m" if it's "men."
Typical accuracy on a clean A4 scan at 300 DPI:
- Typed documents in French/English: 99.5%+
- Neat handwriting: 85-95% (with specific training)
- Free handwriting: 60-80% (still OCR's Achilles heel)
- Decorative fonts: variable, sometimes poor
- Complex tables: structure often lost, text generally good
Local vs server: choosing architecture
For a firm or SMB, the OCR architecture choice has real implications.
Browser OCR (Tesseract.js)
Pros:
- No data leaves the machine
- Maximum confidentiality
- No recurring cost
Cons:
- Slow: 3-10 seconds per page on a modern machine
- Limited to lightweight models (~95% accuracy instead of 99%)
- Unsuitable for PDFs over 50 pages
- Drains battery (intensive computation)
Relevant for: occasional OCR on 1-5 pages, very sensitive data.
Server OCR (cloud models)
Pros:
- Very fast: 0.2-1 second per page
- Top-tier models (99.5%+ accuracy)
- Multilingual (60+ languages)
- Structured table recognition
- No load on the client machine
Cons:
- Files transit through a server
- Recurring service cost
- Connection-dependent latency
Relevant for: bulk processing, high accuracy requirement, multilingual.
What's PDFly's choice?
PDFly made the following choice: Premium-only OCR, server processing, files deleted immediately after processing, EU-hosted.
Why not in-browser? Because Tesseract.js accuracy (~95%) isn't enough for professional uses where every error costs proofreading time. Server models (99.5%+) eliminate that cost.
Why in Europe? For the same reasons our Cloud Act article details: legal contracts, HR files, and invoices have no business on US servers.
How to OCR a PDF with PDFly
- Go to pdfly.eu/en/tools/ocr-pdf
- Drop your scanned PDF (up to 500 MB on Premium)
- Choose the main language (French, English, multilingual)
- Run OCR
- Download the PDF with invisible text layer — visually identical to the original but searchable
The resulting PDF typically weighs 5-15% more (the text layer adds weight). If size matters, compress the result afterward.
Common pitfalls and how to avoid them
Scan pitfalls
- Tilted scan: text 5° off-axis drops OCR accuracy from 99% to 70%. Many tools auto-correct ("deskew"), but not all. Check for it.
- Low-resolution scan (<200 DPI): degraded accuracy. Re-scan at 300 DPI minimum.
- Scan with gray/yellow background (old photocopies): apply a white threshold before OCR.
- Black on blue/green background: insufficient contrast. Convert to grayscale first.
PDF pitfalls
- Protected PDF: OCR often blocked. Unlock with the Unlock tool if authorized.
- Already partially OCR'd PDF: some pages have text, others don't. Re-OCR'ing the entire document is simpler than processing only some pages.
- PDF with complex columns: reading order can be incorrect (reads column 1 line 1 then column 2 line 1 instead of all of column 1). For scanned newspapers and magazines, expect some manual rework.
Legal pitfalls
- OCR on a signed document: the text layer doesn't modify the image, so the signature remains valid. But some courts refuse a re-processed PDF after signing. Check local legal context.
- Probative archiving: if you archive an OCR'd version for evidentiary value, also keep the original non-OCR'd file (eIDAS requires traceability of the transformation).
Concrete use cases
Law firm
Index 10 years of jurisprudence received as scanned PDFs. Bulk OCR, then full-text search in the firm's DMS. Gain: find a specific clause in 5 seconds instead of 3 hours.
Accounting firm
Digitize client invoices received on paper or low-quality scans. OCR with table extraction, then import into accounting software (Cegid, Sage, etc.). Gain: automate journal entries.
HR
Make a base of 5,000 CVs received over 5 years searchable. OCR + Elasticsearch indexing. Gain: find "Java + 5 years + Bordeaux" in 1 click.
Academic research
Digitize an out-of-copyright rare book for citation. Multilingual OCR, export to TXT, copy-paste into thesis with citations.
In summary
Modern OCR has become a reliable technical commodity. The tool choice mostly depends on required confidentiality and volume.
For daily firm/accounting/HR documents: EU server OCR (PDFly Premium or equivalent) offers the best accuracy/speed/GDPR-compliance compromise.
OCR a PDF with PDFly Premium — files processed in Europe, deleted immediately after.