Scanning, OCR and PDF

Date: , Updated: — Topics: , — Lang: — by Slatian

A cheatsheet for scanning and and OCRing documents on Linux with sane and tesseract

Table of Contents

Scanning

Under Linux one can scan an image with the scanimage command provided by the SANE framework.

A helper function that scans an image from the default scanner to the given path with settings suitable for OCR
scan_image_to() {
	printf "Scanning Image …\n"
	scanimage --mode Gray --format png --resolution 300 -o "$1" -p
}

Note: The --mode and --resolution options are scanner specific and my work different for you. See scanimage --help.

Unpaper

Unpaper (manpage) is a scan post-processing tool which can remove paper and scan artefacts and also do alignment of carelessly scanned pages. Which is great when preparing them for OCR.

Thanks to a friend of mine for mentioning it!

OCR using Tesseract and imagemagick

Tesseract is both, a library and command line tool for Optical Character Recognition (OCR). Image in, text in image out.

It best works with greyscale images with low background noise and high contrast. Getting a scan closer to that can be done using unpaper or - for scans of known quality - imagemagick using its builtin contrast filter and the level clipping, which is faster than any fully automatic tool.

A function that can be used to OCR a given prerotated image that does some preprocessing with imagemagick to improve noise (the -level option) and contrast for scans, the output is the OCR as plaintext to stdout.
# Usage: ocr_image <image>
ocr_image() {
	convert "$1" -colorspace Gray -contrast -level 20%,100% -contrast -contrast - | \
	tesseract -l deu+eng --oem 2 --psm 3 - -
}

To explain the most important tesseract options:

-l
To set the used language or script. Accepts multiple values separated by a +. i.e. -l deu+eng
--oem
To set the mode of operation, meaning whether or not to use the LSTM based approach or the original tesseract OCR. At the time of writing I had the best results with option 2, enabling both.
--psm
Tells tesseract what kind of layout analysis to run. I prefer to do a separate OSD (Orientation and Script Detection) only run in mode 0 rotating the image and then using the full automatic layout mode 3 without OSD to do the actual OCR.
--user-words
To give tesseract a text file with a list of words that are likely to occur in the document, this helps avoiding the OCR equivalent to a typo. There is also a pattern file (note that these are not regexes!). More on word and pattern lists in the tesseract manpage.

Note for Void-Linux users: The package name and command for calling tesseract is tesseract-ocr on Void.

This function uses tesseract with the psm 0 mode to correctly orientate a given image file that is assumed to be German or English text. The file is then replaced.
# Usage: rotate_scanned_image <image>
rotate_scanned_image() {
    # Replace or leve out the -l option here if you work with other languages/scripts
	PAGE_OSD="$(tesseract -l deu+eng --oem 2 --psm 0 "$1" - 2>&1)"
    #if DEBUG_OSD is set print some debug output.
	[ -n "$DEBUG_OSD" ] && printf "PAGE OSD =====\n%s\n==============\n" "$PAGE_OSD" >&2
	APPLY_ROTATION="$( printf "%s\n" "$PAGE_OSD" | awk '/^Rotate:/{ print $2 }' )"
	convert "$1" -rotate "${APPLY_ROTATION:-0}" "$1"
}

Note on DPI settings: Tesseract assumes the images to be scanned at 300dpi, if you are using a different resolution make sure to tell tesseract using the --dpi option.

PDF creation and manipulation

Creating a PDF from images

To create a PDF from images without OCR one can use imagemagick.

Example that uses the convert command to convert two png files to an pdf with a pagesize of DIN A4.
convert -page a4 page_1.png page_2.png document.pdf

Also see the imagemagick documentation on the -page option..

OCR using OCRmyPDF

OCRmyPDF is a python wrapper for tesseract that makes it pretty painless to run OCR on a PDF and turn PDFs into standards compliant PDF/A and optimize PDFs to reduce filesize.

An example for OCRing a PDF file that contains German and English text as images.
ocrmypdf -l deu+eng input.pdf output.pdf

Package Note: The package is called ocrmypdf for all major Linux distributions (AUR on Arch).

Thank you (again) to the friend who found this tool.

Creating PDFs with OCR using tesseract and poppler

To create PDFs with an OCR overlay one can delegate the image to PDF step to tesseract using its pdf preset and join the resulting documents after the OCR step with the poppler pdfunite tool.

Tesseract command to generate a PDF with OCR overlay for a single page with the options left out (they are the same as with the plaintext preset)
tesseract page_n.png page_n [options...] pdf
pdfunite command to join multiple PDFs together. Btw. these can be any PDF documents.
pdfunite page_*.pdf document.pdf

Accessibility Note: These OCR overlays while pretty good are intended for making the documents searchable, they are not a replacement for proper semantic and accessible documents.

Getting text out of PDF files

To get text out of a PDF one can use the pdftotext and pdftohtml command (which come with poppler).

pdftotext file.pdf

Packages

To get the commandline utilities on some distributions one should also install the <packagename>-bin package.

Imagemagick is sometimes just called magick.

Tesseract usually comes without the needed data files, these probably come in packages called like tesseract-data-<lang>, if you just want everything there is probably a tesseract-data-all metapackage.

The pdfunite and pdfto* commands are either directly in the poppler package or in poppler-utils.