Scanning, OCR and PDF
A cheatsheet for scanning and and OCRing documents on Linux with sane and tesseract
Scanning
Under Linux one can scan an image with the scanimage
command provided by the SANE framework.
OCR using Tesseract and imagemagick
Tesseract best works with greyscale images whit low background noise and high contrast. Getting a scan closer to that can be done using imagemagick using its builtin contrast filter and the level clipping tool.
Note: There also is the unpaper scan post processing tool which can figure out some good settings itself and also do alignment of carelessly scanned pages but it is slower than the imagemagick filters. Thanks to a colleague of mine for mentioning it! (unaper manpage)
# Usage: ocr_image <image>
To explain the most important tesseract options:
- -l
-
To set the used language or script. Accepts multiple values separated by a
+
. i.e.-l deu+eng
- --oem
- To set the mode of operation, meaning whether or not to use the LSTM based approach or the original tesseract OCR. At the time of writing I had the best results with option 2, enabling both.
- --psm
- Tells tesseract what kind of layout analysis to run. I prefer to do a separate OSD (Orientation and Script Detection) only run in mode 0 rotating the image and then using the full automatic layout mode 3 without OSD to do the actual OCR.
- --user-words
- To give tesseract a text file with a list of words that are likely to occur in the document, this helps avoiding the OCR equivalent to a typo. There is also a pattern file (note that these are not regexes!). More on word and pattern lists in the tesseract manpage.
# Usage: rotate_scanned_image <image>
Note: Tesseract assumes the images to be scanned at 300dpi, if you are using a different resolution make sure to tell tesseract using the --dpi
option.
PDF creation and manipulation
Creating a PDF from images
To create a PDF from images without OCR one can use imagemagick.
Creating PDFs with OCR using tesseract and poppler
To create PDFs with an OCR overlay one can delegate the image to PDF step to tesseract using its pdf
preset and join the resulting documents after the OCR step with the popper pdfunite
tool.
tesseract page_n.png page_n [options...] pdf
Note: These OCR overlays while pretty good are intended for making the documents searchable, they are not a replacement for proper semantic and accessible documents (which shouldn't be a PDF anyway).
Packages
To get the commandline utilities on some distributions one should also install the <packagename>-bin
package.
Imagemagick is sometimes just called magick
.
Tesseract usually comes without the needed data files, these probably come in packages called like tesseract-data-<lang>
, if you just want everything there is probably a tesseract-data-all
metapackage.