OCR with OCRopus and Tesseract
While OCRing a batch of images through OmniPage the other day, I was silently cursing my computer. I had about 1,500 pages, and OmniPage was crashing after every second or third image. I’ve used versions 13-16 of the software, and this problem seems to just get worse with each new release. Fed up, I decided to look for an alternative.
I remembered seeing a few years ago that HP had open-sourced their OCR engine, Tesseract, development of which has now been taken over by Google. Tesseract is supposedly very good at what it does, namely, recognizing characters in images.
Tesseract does not, however, have many essential features found in modern OCR software, including document layout analysis and output formatting. That’s where OCRopus comes in. I think of it as a wrapper around Tesseract, capable of doing the layout analysis and providing formatted output. In truth, it can do much more than that, and different OCR engines and other components can be plugged into OCRopus, but the preceding simplification works for my purposes.
Usage
Use OCRopus with a simple call from the command line:
$ ocroscript recognize /path/to/file.png > /path/to/output.html
OCRopus will work its magic on file.png and give you an hOCR file. hOCR uses class and title attributes in an otherwise simple HTML file to embed layout information into the recognized text. I hope soon to create a script to transform the hOCR into a PDF; I’ll post more when it’s ready.
Installation
The trickiest part of using OCRopus is the installation. There are quite a few dependencies and some inaccurate documentation, so I made a few wrong turns along the way. Fortunately, I remembered to document what I was doing as I went. The instructions below represent the necessary steps to have an operable installation of OCRopus on Linux Mint as of 2009-03-27. For the record, I’m starting in /var/tmp.
Install Tesseract
As mentioned above, Tesseract is the OCR engine that powers OCRopus.
$ svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only $ cd tesseract-ocr-read-only $ ./configure $ make $ sudo make install $ cd ..
Install iulib
iulib provides some basic image processing libraries used by OCRopus.
$ svn checkout http://iulib.googlecode.com/svn/trunk/ iulib $ cd iulib $ sudo apt-get install scons $ sudo apt-get install libpng12-dev libjpeg62-dev libtiff4-dev libavcodec-dev libavformat-dev libsdl-gfx1.2-dev libsdl-image1.2-dev $ sudo apt-get install imagemagick $ scons $ sudo scons install $ cd ..
Install Leptonica
Leptonica provides more image processing and layout analysis capabilities.
$ wget http://leptonica.googlecode.com/files/leptonlib-1.60.tar.gz $ tar xvzf leptonlib-1.60.tar.gz $ cd leptonlib-1.60 $ ./configure $ make $ sudo make install $ cd ..
Install OpenFST
OpenFST provides language modeling code to OCRopus. Note that this takes a while (a couple of hours for me) to compile.
$ wget http://mohri-lt.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.1.tar.gz $ tar xvzf openfst-1.1.tar.gz $ cd openfst-1.1 $ ./configure $ make $ sudo make install $ cd ..
Install OCRopus
We now have all our dependencies installed, so it’s time to install OCRopus.
$ sudo apt-get install libeditline-dev $ svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus $ cd ocropus$ ./configure $ make $ sudo make install
Update (2009-04-01): OCRopus is still young and has many bugs. One particularly annoying bug, one that is quite easy to fix: the Doctype declaration for the hOCR file was missing some quotes, rendering the XHTML invalid. I’ve submitted a patch. So, some slightly revised installation instructions, picking up in the ocropus directory:
$ wget http://xplus3.net/downloads/fix_ocropus_doctype.diff $ patch -p0 -i fix_ocropus_doctype.diff $ ./configure $ make $ sudo make install
Did you really get ocropus to build successfully using openfst version 1.1?
Are you sure you didn’t ’svn checkout http://ocropus.googlecode.com/svn/external ocropus-external’, which downloads the patched openfst beta that ocropus 0.3 is built against (and which is about a year older than the current 1.1)?
Robert Waters | 2009-04-14 00:00 | Permalink
Request how to build ocropus in (windows platform XP), if possible.
sriranga(76yrsold) | 2009-04-14 01:37 | Permalink
@Robert Waters: Heh, I had no idea I was supposed to use a different version of OpenFST. I really did use version 1.1, using the exact commands I indicated above. No one told me it wasn’t supposed to work, and I didn’t get any error messages.
Jonathan Brinley | 2009-04-14 08:43 | Permalink
Thanks. I am really not sure what is going on; I am waiting for a response from the ocropus devs.
I, and others, have been getting a compile error when building ocropus with openfst support; apparently the current build of ocropus (svn) uses code that doesnt exist in openfst 1.1 (it only exists in the old openfst beta). I wish that I knew why yours worked!
Robert Waters | 2009-04-14 11:10 | Permalink
Hi I followed your steps and I got the following:
checking /usr/local/include/leptonica/allheaders.h usability… no
checking /usr/local/include/leptonica/allheaders.h presence… no
checking for /usr/local/include/leptonica/allheaders.h… no
checking /usr/local/include/liblept/allheaders.h usability… yes
checking /usr/local/include/liblept/allheaders.h presence… no
configure: WARNING: /usr/local/include/liblept/allheaders.h: accepted by the compiler, rejected by the preprocessor!
configure: WARNING: /usr/local/include/liblept/allheaders.h: proceeding with the compiler’s result
checking for /usr/local/include/liblept/allheaders.h… yes
checking for pixCreate in -llept… no
configure: error: leptonica not found! Choose –without-leptonica if you don’t want to use it.
I don’t know what went wrong. Could anyone please help? Thanks!
Ed
Edward Wong | 2009-04-21 17:16 | Permalink
oh btw, this errors happened when i did a ./configure during the build of ocropus.
Edward Wong | 2009-04-21 17:18 | Permalink
Thank you very much, making and installation worked quite fine for me under ubuntu jaunty 9.04.
But there seem to be some problems with openfst (is it disabled completely?? I am using 1.1 as proposed):
When I run “make check”, then there are some errors.
Here are the first ones (could possibly have to do with the current state of ocropus?)…
# ocroscript test-alignment.lua
./ocr-bpnet/grouping.cc:553 FAILED ASSERT (WARNING) basepoint + xheight + ascender_rise < image.dim(1)
# ocroscript test-bpnet.lua
./ocr-bpnet/grouping.cc:553 FAILED ASSERT (WARNING) basepoint + xheight + ascender_rise < image.dim(1)
OpenFST is disabled, we can’t test it.
Markus
BTW: There is a little typo in the second line of the code (tesseract - tessaract) ;-).
Markus Lutz | 2009-05-23 08:02 | Permalink
@Markus: Typo fixed. Thanks. I haven’t tried to install since the new version came out a couple of weeks ago, so there might be new problems. These instructions were accurate two months ago, but seem to have quickly become obsolete.
Jonathan Brinley | 2009-05-23 08:43 | Permalink
Request how to build ocropus in (windows platform XP), if possible.
PB | 2009-05-28 11:58 | Permalink
I got the same error as Edward Wong, http://xplus3.net/2009/03/31/ocr-with-ocropus-and-tesseract/#comment-547
I’m trying to build OCRopus at OSX 10.5.7
Any help regarding this error would be appreciated!
Arjan | 2009-06-07 17:37 | Permalink
Edward Wong: had a similar problem - tried using this build (with out OpenFST due to build errors).
./configure –with-tesseract=/usr/local –without-fst –with-leptonica=/usr/local LIBS=”-lpng -lgif -ltiff -ljpeg $LIBS”
the LIBS=”-lpng -lgif -ltiff -ljpeg $LIBS”
bit will fix your problem..
At least it did with ocropus-0.3 - still haven’t managed to get ocropus-0.4 to build yet..
aususer | 2009-06-24 03:29 | Permalink
Thanks for giving it a try; keep in mind that OCRopus is still in alpha release, so there are going to be lots of rough spots. Beta is planned for October.
Tom | 2009-07-04 19:33 | Permalink
Really,I am very much impressed with above article. However why not
try Ocropus latest version 0.4(alpha)on the above lines would appreciated for benefit of users - which also serves as tutorial for newbies.
It would be nice to build in windows platform on the lines of built by Lakmesha.G.V who had uploaded windows version for Ocorpus 1.1 on Dec 30 2007 in the file section of forum
Wishing you all the best of luck.
sriranga(76yrsold) | 2009-07-04 20:56 | Permalink