How to Extract Text from an Image or Scanned PDF with OCR, Right in Your Browser
7 min read
You have a photo of a document, a screenshot, or a scanned PDF, and you need the words inside it as actual text you can edit, search, or paste somewhere. That job is called OCR, optical character recognition, and you no longer need desktop software or an upload-your-file website to do it: modern browsers can run a full OCR engine locally. But OCR is not magic. On a clean scan it gets nearly every character right; on a blurry phone photo it produces junk, and the difference is almost entirely about the input you feed it. This guide explains what OCR actually does, how to tell when you do not even need it, what breaks it, why the language setting matters more than people think, and exactly what an in-browser tool does and does not send over the network.
OCR in one paragraph: pixels in, guesses out
A scanned page contains no text. To a computer it is just a grid of colored pixels, the same as a photo of a beach. OCR is software that scans that grid, finds shapes that look like letters, and outputs its best guess as real characters. The word 'guess' matters: the engine matches pixel patterns against what it learned printed text looks like, so the output is a statistical reconstruction, not a copy.
That is why the same tool can be near-perfect on one file and useless on another. Give it a flat, sharp, high-contrast scan of printed text and a good engine reads it with only occasional slips. Give it a skewed, dim phone photo of small text and every stage of the pipeline degrades: it finds lines where there are none, splits characters in the wrong places, and confidently outputs the wrong letters. Treat OCR output as a fast draft that needs a proofread, not as ground truth, especially for numbers.
First, check whether you need OCR at all
There are two kinds of PDFs, and they look identical on screen. A born-digital PDF, one exported from Word, Google Docs, or a website, stores its text as characters. Open it in any PDF viewer and you can select the text with your cursor, search it, and copy it out. That copy is instant and 100% accurate, which no OCR will ever match. A scanned PDF is different: each page is a photograph of paper, and there are no characters inside to select.
So before you OCR a PDF, try to select a word in it or search for one. If that works, just copy the text directly and skip OCR entirely. Running OCR on a born-digital PDF means rasterizing perfectly good text into an image and then guessing it back, which is slower and strictly less accurate than the text that was already there. OCR is for the cases where selection fails: scans, faxes, photos of paper, and screenshots.
What makes OCR accurate, and what quietly breaks it
OCR engines are trained on clean printed text, so accuracy falls off with distance from that ideal: a flat page, sharp focus, dark text on a light background, a normal font. The single biggest factor is resolution. The long-standing rule of thumb is to scan at 300 DPI, which for a photo or screenshot translates to letters roughly 20 pixels tall or more. Below that, the strokes that distinguish an 8 from a B or an e from a c are only a pixel or two wide, and the information the engine needs simply is not in the image. Upscaling a low-resolution image afterward does not bring it back.
The common killers are worth knowing by name, because each one is avoidable at capture time:
- Blur and camera shake: soft edges make letter shapes ambiguous.
- Skew: text photographed or scanned at an angle breaks line detection.
- Tiny text: screenshots of small UI text or fine print fall under the ~20-pixel floor.
- Low contrast: gray text on gray, or text over photos and colored backgrounds.
- Glare and shadows: a phone photo with your own shadow across the page loses whole regions.
- Handwriting: standard OCR engines are built for print; cursive comes out as noise.
- Decorative fonts: heavily stylized or condensed type strays too far from the training data.
- Complex layouts: multi-column pages and tables can come back with the reading order scrambled.
How to capture a source that OCR can actually read
Five minutes of care at capture time saves thirty minutes of fixing garbled output. If you have a scanner, scan text documents at 300 DPI; grayscale is fine and keeps files smaller than color. If you are using your phone, put the page on a flat surface in even light, hold the camera directly above it rather than at an angle, fill the frame with the page, and make sure your own shadow is not across the text. Tap to focus on the text before shooting.
Then crop. If the text you want shares the frame with a table edge, a hand, or a busy background, crop the image down to just the text region before running OCR. Everything else in the frame is something the engine can misread as text. Each of these steps maps directly onto one of the failure modes above: resolution, skew, contrast, and clutter.
Languages: the setting people skip
OCR engines use a separate trained model per language, because the model encodes both the letter shapes and the statistics of the language, which character sequences are plausible words. Pick the wrong language and you do not get slightly worse results, you get gibberish: an English model looking at Chinese text will dutifully output the Latin letters those strokes most resemble. If your OCR output looks like random characters, the language setting is the first thing to check, before you blame image quality.
FileTinker's OCR offers 17 languages: English, Traditional and Simplified Chinese, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Japanese, Korean, Arabic, Hindi, Polish, Turkish, and Vietnamese, plus a combined English + Chinese mode for documents that genuinely mix the two, which is common in receipts and forms from Taiwan and Hong Kong. Each language's data is downloaded only when you first use it. Use the combined mode only for genuinely mixed pages: giving the engine two alphabets to consider on a single-language page just adds ways to be wrong.
What an in-browser OCR tool actually does with your file
FileTinker's Image to Text tool runs the open-source Tesseract engine, compiled to WebAssembly, inside a worker in your browser tab. You pick a language, drop in a file, and watch a progress bar while it reads. It accepts common image formats, JPG, PNG, WebP and others, plus PDFs. Images go straight to the engine.
Scanned PDFs take an extra step: the tool first renders each page to an image at double its nominal resolution using pdf.js, then feeds the pages to the engine one at a time, showing 'page 2 of 5' style progress as it goes. The recognized pages are joined with a blank line between them into one result. Note that this happens to every PDF, including born-digital ones, which is exactly why the select-and-copy check from earlier is worth doing first.
The result lands in an editable text box, so you can fix recognition errors on the spot, then copy it or download it as a .txt file. If the engine finds nothing, it says so rather than inventing text. And if you need the cleaned-up text as a document again, FileTinker's text-to-PDF tool will turn the .txt into a real, selectable-text PDF, which is usually the end goal of scanning in the first place.
The privacy fine print, stated plainly
'Runs in your browser' is true here, but it does not mean 'no network activity at all,' and it is worth being precise about the difference. Your image or PDF is never uploaded: it is decoded, rendered, and recognized entirely on your own machine, and no server ever receives it. What does travel over the network is the software. On first use, the tool downloads the OCR engine (the worker and its WebAssembly core, served from the jsDelivr CDN) and the trained data for the language you picked (served from the standard Tesseract data host). That is a one-time download — a couple of megabytes for the engine, plus the language model, which ranges from a few megabytes for English to tens of megabytes for languages like Chinese or Japanese — and your browser caches it, so later runs start much faster.
In short: code comes down, your file never goes up. That distinction is what makes in-browser OCR reasonable for sensitive material like contracts, IDs, and medical letters, where sending the document itself to a stranger's server is the thing you are trying to avoid. A network observer could tell that you fetched OCR software; they could not see what you ran it on.
Cleaning up what OCR gives you
Even a good result needs a pass. OCR has classic confusions: lowercase l, the digit 1, and capital I; O and 0; 'rn' read as 'm'. Proofread numbers with extra care, since a wrong digit in a phone number or invoice total is both likely and invisible to a spell check. Structural quirks are normal too: you will often get a line break at the end of every printed line rather than every paragraph, and words hyphen-split across lines stay split.
Since the output box is editable, do this cleanup before you copy or download. For a one-page letter it takes a minute. That minute, plus a decent capture, is the whole difference between OCR being a party trick and being the fastest way to get paper into a document you can actually use.