Scanning, OCR and PDF
A cheatsheet for scanning and and OCRing documents on Linux with sane and tesseract
Table of Contents
Scanning
Under Linux one can scan an image with the scanimage command provided by the SANE framework.
scan_image_to() {
printf "Scanning Image …\n"
scanimage --mode Gray --format png --resolution 300 -o "$1" -p
}
Note: The --mode and --resolution options are scanner specific and my work different for you. See scanimage --help.
Unpaper
Unpaper (manpage) is a scan post-processing tool which can remove paper and scan artefacts and also do alignment of carelessly scanned pages. Which is great when preparing them for OCR.
Thanks to a friend of mine for mentioning it!
OCR using Tesseract and imagemagick
Tesseract is both, a library and command line tool for Optical Character Recognition (OCR). Image in, text in image out.
It best works with greyscale images with low background noise and high contrast. Getting a scan closer to that can be done using unpaper or - for scans of known quality - imagemagick using its builtin contrast filter and the level clipping, which is faster than any fully automatic tool.
# Usage: ocr_image <image>
ocr_image() {
convert "$1" -colorspace Gray -contrast -level 20%,100% -contrast -contrast - | \
tesseract -l deu+eng --oem 2 --psm 3 - -
}
To explain the most important tesseract options:
- -l
-
To set the used language or script. Accepts multiple values separated by a
+. i.e.-l deu+eng - --oem
- To set the mode of operation, meaning whether or not to use the LSTM based approach or the original tesseract OCR. At the time of writing I had the best results with option 2, enabling both.
- --psm
- Tells tesseract what kind of layout analysis to run. I prefer to do a separate OSD (Orientation and Script Detection) only run in mode 0 rotating the image and then using the full automatic layout mode 3 without OSD to do the actual OCR.
- --user-words
- To give tesseract a text file with a list of words that are likely to occur in the document, this helps avoiding the OCR equivalent to a typo. There is also a pattern file (note that these are not regexes!). More on word and pattern lists in the tesseract manpage.
Note for Void-Linux users: The package name and command for calling tesseract is tesseract-ocr on Void.
# Usage: rotate_scanned_image <image>
rotate_scanned_image() {
# Replace or leve out the -l option here if you work with other languages/scripts
PAGE_OSD="$(tesseract -l deu+eng --oem 2 --psm 0 "$1" - 2>&1)"
#if DEBUG_OSD is set print some debug output.
[ -n "$DEBUG_OSD" ] && printf "PAGE OSD =====\n%s\n==============\n" "$PAGE_OSD" >&2
APPLY_ROTATION="$( printf "%s\n" "$PAGE_OSD" | awk '/^Rotate:/{ print $2 }' )"
convert "$1" -rotate "${APPLY_ROTATION:-0}" "$1"
}
Note on DPI settings: Tesseract assumes the images to be scanned at 300dpi, if you are using a different resolution make sure to tell tesseract using the --dpi option.
PDF creation and manipulation
Creating a PDF from images
To create a PDF from images without OCR one can use imagemagick.
convert command to convert two png files to an pdf with a pagesize of DIN A4.convert -page a4 page_1.png page_2.png document.pdf
Also see the imagemagick documentation on the -page option..
OCR using OCRmyPDF
OCRmyPDF is a python wrapper for tesseract that makes it pretty painless to run OCR on a PDF and turn PDFs into standards compliant PDF/A and optimize PDFs to reduce filesize.
ocrmypdf -l deu+eng input.pdf output.pdf
Package Note: The package is called ocrmypdf for all major Linux distributions (AUR on Arch).
Thank you (again) to the friend who found this tool.
Creating PDFs with OCR using tesseract and poppler
To create PDFs with an OCR overlay one can delegate the image to PDF step to tesseract using its pdf preset and join the resulting documents after the OCR step with the poppler pdfunite tool.
tesseract page_n.png page_n [options...] pdf
pdfunite page_*.pdf document.pdf
Accessibility Note: These OCR overlays while pretty good are intended for making the documents searchable, they are not a replacement for proper semantic and accessible documents.
Getting text out of PDF files
To get text out of a PDF one can use the pdftotext and pdftohtml command (which come with poppler).
pdftotext file.pdf
Packages
To get the commandline utilities on some distributions one should also install the <packagename>-bin package.
Imagemagick is sometimes just called magick.
Tesseract usually comes without the needed data files, these probably come in packages called like tesseract-data-<lang>, if you just want everything there is probably a tesseract-data-all metapackage.
The pdfunite and pdfto* commands are either directly in the poppler package or in poppler-utils.