Convert hOCR to PDF

As I mentioned recently, OCRopus OCR software output an hOCR file. What is hOCR? hOCR is an open standard for representing OCR results in an HTML document (not to be confused with HOCR). It is basically a microformat using div and span tags’ class and title attributes to convey information from the OCR process, as you can see from this example of a basic hOCR document:

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<html xmlns="">
    <meta content="ocr_line ocr_page" name="ocr-capabilities"/>
    <meta content="en" name="ocr-langs"/>
    <meta content="Latn" name="ocr-scripts"/>
    <meta content="" name="ocr-microformats"/>
    <title>OCR Output</title>
    <div class="ocr_page" title="bbox 0 0 2548 3300; image /path/to/scanned/image.png">
      <span class="ocr_line" title="bbox 659 143 863 177">Some Text</span>
      <span class="ocr_line" title="bbox 723 275 916 324">More Text</span>

For the more details, see Thomas Breuel’s complete hOCR specification draft. The example, though, shows all that I need to know. From the div[@class='ocr_page'], I can learn the dimensions (in pixels) of the image that was OCRed (as well as the path to that image on my machine). From a span[@class='ocr_line'], I can learn the location and dimensions of the bounding box around a line of recognized text, as well as the content of that line.

That information, along with a copy of the original image (or a sufficiently similar image) is enough to create a PDF of the image with selectable text.
Continue reading “Convert hOCR to PDF”

How to Install rdflib

At the Linked Data preconference at Code4Lib 2009 a couple of weeks ago, I learned about the rdflib Python library. Naturally, I wanted to install the library so I could mess with it on my own. That proved to be a little problematic, though. Following the installation instructions, I typed:

$ easy_install -U "rdflib>=2.4,<=3.0a"

And it very helpfully told me (among other things):

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -O2 -Wall -Wstrict-prototypes
-fPIC -I/usr/include/python2.5 -c src/bison/SPARQLParser.c -o
src/bison/SPARQLParser.c:7:20: error: Python.h: No such file or directory

As it turns out, I should have had python-dev and build-essential already installed. Being new to Linux, I did not know. Thanks, BenO, for helping me out with this.

The complete installation instructions:

$ apt-get install python-dev
$ apt-get install build-essential
$ easy_install "rdflib==2.4.0"