Scanning, OCR and PDF
A cheatsheet for scanning and and OCRing documents on Linux with sane and tesseract
Table of Contents
Scanning
Under Linux one can scan an image with the scanimage
command provided by the SANE framework.
Unpaper
Unpaper (manpage) is a scan post-processing tool which can remove paper and scan artefacts and also do alignment of carelessly scanned pages. Which is great when preparing them for OCR.
Thanks to a friend of mine for mentioning it!
OCR using Tesseract and imagemagick
Tesseract is both, a library and command line tool for Optical Character Recognition (OCR). Image in, text in image out.
It best works with greyscale images with low background noise and high contrast. Getting a scan closer to that can be done using unpaper or - for scans of known quality - imagemagick using its builtin contrast filter and the level clipping, which is faster than any fully automatic tool.
To explain the most important tesseract options:
- -l
-
To set the used language or script. Accepts multiple values separated by a
+
. i.e.-l deu+eng
- --oem
- To set the mode of operation, meaning whether or not to use the LSTM based approach or the original tesseract OCR. At the time of writing I had the best results with option 2, enabling both.
- --psm
- Tells tesseract what kind of layout analysis to run. I prefer to do a separate OSD (Orientation and Script Detection) only run in mode 0 rotating the image and then using the full automatic layout mode 3 without OSD to do the actual OCR.
- --user-words
- To give tesseract a text file with a list of words that are likely to occur in the document, this helps avoiding the OCR equivalent to a typo. There is also a pattern file (note that these are not regexes!). More on word and pattern lists in the tesseract manpage.
Note for Void-Linux users: The package name and command for calling tesseract is tesseract-ocr
on Void.
Note on DPI settings: Tesseract assumes the images to be scanned at 300dpi, if you are using a different resolution make sure to tell tesseract using the --dpi
option.
PDF creation and manipulation
Creating a PDF from images
To create a PDF from images without OCR one can use imagemagick.
Also see the imagemagick documentation on the -page option..
OCR using OCRmyPDF
OCRmyPDF is a python wrapper for tesseract that makes it pretty painless to run OCR on a PDF and turn PDFs into standards compliant PDF/A and optimize PDFs to reduce filesize.
Package Note: The package is called ocrmypdf
for all major Linux distributions (AUR on Arch).
Thank you (again) to the friend who found this tool.
Creating PDFs with OCR using tesseract and poppler
To create PDFs with an OCR overlay one can delegate the image to PDF step to tesseract using its pdf
preset and join the resulting documents after the OCR step with the poppler pdfunite
tool.
Accessibility Note: These OCR overlays while pretty good are intended for making the documents searchable, they are not a replacement for proper semantic and accessible documents.
Getting text out of PDF files
To get text out of a PDF one can use the pdftotext
and pdftohtml
command (which come with poppler).
Packages
To get the commandline utilities on some distributions one should also install the <packagename>-bin
package.
Imagemagick is sometimes just called magick
.
Tesseract usually comes without the needed data files, these probably come in packages called like tesseract-data-<lang>
, if you just want everything there is probably a tesseract-data-all
metapackage.
The pdfunite
and pdfto*
commands are either directly in the poppler
package or in poppler-utils
.