Convert hOCR to PDF

As I mentioned recently, OCRopus OCR software output an hOCR file. What is hOCR? hOCR is an open standard for representing OCR results in an HTML document (not to be confused with HOCR). It is basically a microformat using div and span tags’ class and title attributes to convey information from the OCR process, as you can see from this example of a basic hOCR document:

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="ocr_line ocr_page" name="ocr-capabilities"/>
    <meta content="en" name="ocr-langs"/>
    <meta content="Latn" name="ocr-scripts"/>
    <meta content="" name="ocr-microformats"/>
    <title>OCR Output</title>
  </head>
  <body>
    <div class="ocr_page" title="bbox 0 0 2548 3300; image /path/to/scanned/image.png">
      <span class="ocr_line" title="bbox 659 143 863 177">Some Text</span>
      <span class="ocr_line" title="bbox 723 275 916 324">More Text</span>
    </div>
  </body>
</html>

For the more details, see Thomas Breuel’s complete hOCR specification draft. The example, though, shows all that I need to know. From the div[@class='ocr_page'], I can learn the dimensions (in pixels) of the image that was OCRed (as well as the path to that image on my machine). From a span[@class='ocr_line'], I can learn the location and dimensions of the bounding box around a line of recognized text, as well as the content of that line.

That information, along with a copy of the original image (or a sufficiently similar image) is enough to create a PDF of the image with selectable text.
Continue reading “Convert hOCR to PDF”

Command Line PDF Editing

As I’ve mentioned before, Acrobat’s JavaScript API lags far behind other Adobe applications. Its limitations turned a seemingly simple project I was working on into an exercise in futility.

Overview

I have a collection of a little over 5,000 PDF files, the output of an OCR job. Each file contains one page of a newspaper. Four pages put together would make one issue. My goal, then, is to take four PDF files (e.g., 1896-09-24_001.pdf, 1896-09-24_002.pdf, 1896-09-24_003.pdf, and 1896-09-24_004.pdf) and merge them together into one file (1896-09-24.pdf). Then repeat 1,300 times or so to get the rest of the issues.

Sounds like an ideal job for a small script. Unfortunately, Acrobat only gives JavaScript access to the file system for opening and saving files. It has no way to read a directory for a list of files, which is rather fundamental to the task at hand.

PyPdf Almost Works

At Jay Luker‘s suggestion, I tried out PyPdf. It seems to do everything I needed. And indeed it would, except it can’t read my PDF files. It does its job just fine with other files, but not these that OmniPage created. It seems the files are missing an attribute that PyPdf looks for, so I end up with a KeyError.

pdftk to the Rescue

So, after much gnashing of teeth, Jason Ronallo suggested I try pdftk, a simple command line tool that can merge and split PDFs, among other capabilities. To merge the issue noted above takes just one line:

pdftk 1896-09-24_001.pdf 1896-09-24_002.pdf 1896-09-24_003.pdf 1896-09-24_004.pdf cat output 1896-09-24.pdf
 

A simple Python script can just call pdftk repeatedly to take care of the whole collection.

Many thanks, pdftk and #code4lib.