Scanning, OCR and PDF
A cheatsheet for scanning and and OCRing documents on Linux with sane and tesseract
Under Linux one can scan an image with the
scanimage command provided by the SANE framework.
Unpaper (manpage) is a scan post-processing tool which can remove paper and scan artefacts and also do alignment of carelessly scanned pages. Which is great when preparing them for OCR.
Thanks to a friend of mine for mentioning it!
OCR using Tesseract and imagemagick
Tesseract is both, a library and command line tool for Optical Character Recognition (OCR). Image in, text in image out.
It best works with greyscale images with low background noise and high contrast. Getting a scan closer to that can be done using unpaper or - for scans of known quality - imagemagick using its builtin contrast filter and the level clipping, which is faster than any fully automatic tool.
To explain the most important tesseract options:
To set the used language or script. Accepts multiple values separated by a
- To set the mode of operation, meaning whether or not to use the LSTM based approach or the original tesseract OCR. At the time of writing I had the best results with option 2, enabling both.
- Tells tesseract what kind of layout analysis to run. I prefer to do a separate OSD (Orientation and Script Detection) only run in mode 0 rotating the image and then using the full automatic layout mode 3 without OSD to do the actual OCR.
- To give tesseract a text file with a list of words that are likely to occur in the document, this helps avoiding the OCR equivalent to a typo. There is also a pattern file (note that these are not regexes!). More on word and pattern lists in the tesseract manpage.
Note: Tesseract assumes the images to be scanned at 300dpi, if you are using a different resolution make sure to tell tesseract using the
PDF creation and manipulation
Creating a PDF from images
To create a PDF from images without OCR one can use imagemagick.
Also see the imagemagick documentation on the -page option..
OCR using OCRmyPDF
OCRmyPDF is a python wrapper for tesseract that makes it pretty painless to run OCR on a PDF and turn PDFs into standards compliant PDF/A and optimize PDFs to reduce filesize.
Note: The package is called
ocrmypdf for all major Linux distributions (AUR on Arch).
Thank you (again) to the friend who found this tool.
Creating PDFs with OCR using tesseract and poppler
To create PDFs with an OCR overlay one can delegate the image to PDF step to tesseract using its
Note: These OCR overlays while pretty good are intended for making the documents searchable, they are not a replacement for proper semantic and accessible documents.
Getting text out of PDF files
To get text out of a PDF one can use the
pdftohtml command (which come with poppler).
To get the commandline utilities on some distributions one should also install the
Imagemagick is sometimes just called
Tesseract usually comes without the needed data files, these probably come in packages called like
tesseract-data-<lang>, if you just want everything there is probably a
pdfto* commands are either directly in the
poppler package or in