Convert hOCR to PDF

As I mentioned recently, OCRopus OCR software output an hOCR file. What is hOCR? hOCR is an open standard for representing OCR results in an HTML document (not to be confused with HOCR). It is basically a microformat using div and span tags’ class and title attributes to convey information from the OCR process, as you can see from this example of a basic hOCR document:

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="ocr_line ocr_page" name="ocr-capabilities"/>
    <meta content="en" name="ocr-langs"/>
    <meta content="Latn" name="ocr-scripts"/>
    <meta content="" name="ocr-microformats"/>
    <title>OCR Output</title>
  </head>
  <body>
    <div class="ocr_page" title="bbox 0 0 2548 3300; image /path/to/scanned/image.png">
      <span class="ocr_line" title="bbox 659 143 863 177">Some Text</span>
      <span class="ocr_line" title="bbox 723 275 916 324">More Text</span>
    </div>
  </body>
</html>

For the more details, see Thomas Breuel’s complete hOCR specification draft. The example, though, shows all that I need to know. From the div[@class='ocr_page'], I can learn the dimensions (in pixels) of the image that was OCRed (as well as the path to that image on my machine). From a span[@class='ocr_line'], I can learn the location and dimensions of the bounding box around a line of recognized text, as well as the content of that line.

That information, along with a copy of the original image (or a sufficiently similar image) is enough to create a PDF of the image with selectable text.
Continue reading “Convert hOCR to PDF”

Find Files by Size

Find all TIFFs in a directory smaller than 90 MB:

$ find /dir/to/search -name *.tif -size -90M -exec ls -lh {} \;

Get just the size and path and write to a file:

$ find /dir/to/search -name *.tif -size -90M -exec ls -lh {} \; | awk '{print $5 , $8}' > output.txt

Useful for finding images that might have been scanned at the wrong resolution/bit depth/etc.

OCR with OCRopus and Tesseract

While OCRing a batch of images through OmniPage the other day, I was silently cursing my computer. I had about 1,500 pages, and OmniPage was crashing after every second or third image. I’ve used versions 13-16 of the software, and this problem seems to just get worse with each new release. Fed up, I decided to look for an alternative.

I remembered seeing a few years ago that HP had open-sourced their OCR engine, Tesseract, development of which has now been taken over by Google. Tesseract is supposedly very good at what it does, namely, recognizing characters in images.

Tesseract does not, however, have many essential features found in modern OCR software, including document layout analysis and output formatting. That’s where OCRopus comes in. I think of it as a wrapper around Tesseract, capable of doing the layout analysis and providing formatted output. In truth, it can do much more than that, and different OCR engines and other components can be plugged into OCRopus, but the preceding simplification works for my purposes.

Usage

Use OCRopus with a simple call from the command line:

$ ocroscript recognize /path/to/file.png > /path/to/output.html

OCRopus will work its magic on file.png and give you an hOCR file. hOCR uses class and title attributes in an otherwise simple HTML file to embed layout information into the recognized text. I hope soon to create a script to transform the hOCR into a PDF; I’ll post more when it’s ready.

Installation

The trickiest part of using OCRopus is the installation. There are quite a few dependencies and some inaccurate documentation, so I made a few wrong turns along the way. Fortunately, I remembered to document what I was doing as I went. The instructions below represent the necessary steps to have an operable installation of OCRopus on Linux Mint as of 2009-03-27. For the record, I’m starting in /var/tmp.
Continue reading “OCR with OCRopus and Tesseract”

Measuring the Value of a Book

How do you measure the value of a book? One might ask several questions when determining what a book is worth: How meaningful is the content? Is it enjoyable to read? Can you learn from it? Does it have historical significance? The list can go on indefinitely, and everyone will weigh the various factors differently, depending on their reason for wanting a particular book.

How do you measure the value of a book? One might ask several questions when determining what a book is worth: How meaningful is the content? Is it enjoyable to read? Can you learn from it? Does it have historical significance? The list can go on indefinitely, and everyone will weigh the various factors differently, depending on their reason for wanting a particular book.

I don’t want to say that any particular metric is necessarily wrong, but I find it inconceivable that people will buy books based on how thick they are, with nary a thought for aught else. But apparently this happens. How else could the Strand Book Store sell books by the foot?

Sure, they mention some legitimate uses, film and theatre sets, for example. But to say “we will custom design a library that is sure to be a perfect match for any home or office space” implies that these books have no value beyond the visual appeal of their spines. Maybe it’s just the penny-pincher in me talking, but can’t you get the same visual effect with wallpaper? Leave the books for snobs like me who thinks books are for reading.