While OCRing a batch of images through OmniPage the other day, I was silently cursing my computer. I had about 1,500 pages, and OmniPage was crashing after every second or third image. I’ve used versions 13-16 of the software, and this problem seems to just get worse with each new release. Fed up, I decided to look for an alternative.
I remembered seeing a few years ago that HP had open-sourced their OCR engine, Tesseract, development of which has now been taken over by Google. Tesseract is supposedly very good at what it does, namely, recognizing characters in images.
Tesseract does not, however, have many essential features found in modern OCR software, including document layout analysis and output formatting. That’s where OCRopus comes in. I think of it as a wrapper around Tesseract, capable of doing the layout analysis and providing formatted output. In truth, it can do much more than that, and different OCR engines and other components can be plugged into OCRopus, but the preceding simplification works for my purposes.
Use OCRopus with a simple call from the command line:
$ ocroscript recognize /path/to/file.png > /path/to/output.html
OCRopus will work its magic on file.png and give you an hOCR file. hOCR uses
title attributes in an otherwise simple HTML file to embed layout information into the recognized text. I hope soon to create a script to transform the hOCR into a PDF; I’ll post more when it’s ready.
The trickiest part of using OCRopus is the installation. There are quite a few dependencies and some inaccurate documentation, so I made a few wrong turns along the way. Fortunately, I remembered to document what I was doing as I went. The instructions below represent the necessary steps to have an operable installation of OCRopus on Linux Mint as of 2009-03-27. For the record, I’m starting in