Scanning, OCR and PDF

A cheatsheet for scanning and and OCRing documents on Linux with sane and tesseract

Scanning

Under Linux one can scan an image with the scanimage command provided by the SANE framework.

A helper function that scans an image from the default scanner to the given path with settings suitable for OCR
scan_image_to() {
	printf "Scanning Image …\n"
	scanimage --mode Gray --format png --resolution 300 -o "$1" -p
}

OCR using Tesseract and imagemagick

Tesseract best works with greyscale images whit low background noise and high contrast. Getting a scan closer to that can be done using imagemagick using its builtin contrast filter and the level clipping tool.

Note: There also is the unpaper scan post processing tool which can figure out some good settings itself and also do alignment of carelessly scanned pages but it is slower than the imagemagick filters. Thanks to a colleague of mine for mentioning it! (unaper manpage)

A function that can be used to OCR a given prerotated image that does some preprocessing with imagemagick to improve noise (the -level option) and contrast for scans, the output is the OCR as plaintext to stdout.
# Usage: ocr_image <image>
ocr_image() {
	convert "$1" -colorspace Gray -contrast -level 20%,100% -contrast -contrast - | \
	tesseract -l deu+eng --oem 2 --psm 3 - -
}

To explain the most important tesseract options:

-l
To set the used language or script. Accepts multiple values separated by a +. i.e. -l deu+eng
--oem
To set the mode of operation, meaning whether or not to use the LSTM based approach or the original tesseract OCR. At the time of writing I had the best results with option 2, enabling both.
--psm
Tells tesseract what kind of layout analysis to run. I prefer to do a separate OSD (Orientation and Script Detection) only run in mode 0 rotating the image and then using the full automatic layout mode 3 without OSD to do the actual OCR.
--user-words
To give tesseract a text file with a list of words that are likely to occur in the document, this helps avoiding the OCR equivalent to a typo. There is also a pattern file (note that these are not regexes!). More on word and pattern lists in the tesseract manpage.
This function uses tesseract with the psm 0 mode to correctly orientate a given image file that is assumed to be German or English text. The file is then replaced.
# Usage: rotate_scanned_image <image>
rotate_scanned_image() {
    # Replace or leve out the -l option here if you work with other languages/scripts
	PAGE_OSD="$(tesseract -l deu+eng --oem 2 --psm 0 "$1" - 2>&1)"
    #if DEBUG_OSD is set print some debug output.
	[ -n "$DEBUG_OSD" ] && printf "PAGE OSD =====\n%s\n==============\n" "$PAGE_OSD" >&2
	APPLY_ROTATION="$( printf "%s\n" "$PAGE_OSD" | awk '/^Rotate:/{ print $2 }' )"
	convert "$1" -rotate "${APPLY_ROTATION:-0}" "$1"
}

Note: Tesseract assumes the images to be scanned at 300dpi, if you are using a different resolution make sure to tell tesseract using the --dpi option.

PDF creation and manipulation

Creating a PDF from images

To create a PDF from images without OCR one can use imagemagick.

convert page_1.png page_2.png document.pdf

Creating PDFs with OCR using tesseract and poppler

To create PDFs with an OCR overlay one can delegate the image to PDF step to tesseract using its pdf preset and join the resulting documents after the OCR step with the popper pdfunite tool.

Tesseract command to generate a PDF with OCR overlay for a single page with the options left out (they are the same as with the plaintext preset)
tesseract page_n.png page_n [options...] pdf
pdfunite command to join multiple PDFs together. Btw. these can be any PDF documents.
pdfunite page_*.pdf document.pdf

Note: These OCR overlays while pretty good are intended for making the documents searchable, they are not a replacement for proper semantic and accessible documents (which shouldn't be a PDF anyway).

Packages

To get the commandline utilities on some distributions one should also install the <packagename>-bin package.

Imagemagick is sometimes just called magick.

Tesseract usually comes without the needed data files, these probably come in packages called like tesseract-data-<lang>, if you just want everything there is probably a tesseract-data-all metapackage.