Convert hOCR to PDF

As I mentioned recently, OCRopus OCR software output an hOCR file. What is hOCR? hOCR is an open standard for representing OCR results in an HTML document (not to be confused with HOCR). It is basically a microformat using div and span tags’ class and title attributes to convey information from the OCR process, as you can see from this example of a basic hOCR document:

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="ocr_line ocr_page" name="ocr-capabilities"/>
    <meta content="en" name="ocr-langs"/>
    <meta content="Latn" name="ocr-scripts"/>
    <meta content="" name="ocr-microformats"/>
    <title>OCR Output</title>
  </head>
  <body>
    <div class="ocr_page" title="bbox 0 0 2548 3300; image /path/to/scanned/image.png">
      <span class="ocr_line" title="bbox 659 143 863 177">Some Text</span>
      <span class="ocr_line" title="bbox 723 275 916 324">More Text</span>
    </div>
  </body>
</html>

For the more details, see Thomas Breuel’s complete hOCR specification draft. The example, though, shows all that I need to know. From the div[@class='ocr_page'], I can learn the dimensions (in pixels) of the image that was OCRed (as well as the path to that image on my machine). From a span[@class='ocr_line'], I can learn the location and dimensions of the bounding box around a line of recognized text, as well as the content of that line.

That information, along with a copy of the original image (or a sufficiently similar image) is enough to create a PDF of the image with selectable text.

Creating a PDF

A while back, Florian Hackenberger created a basic hOCR to PDF converter in Java. It does its job reasonably well, but it’s somewhat rough around the edges. Since I’m more familiar with Python than Java, it seemed like a good idea to rewrite the code in Python, so I could more comfortably hack it.

And so I present to you, dear reader, the hOCR Converter Python script (note: this script depends on the ReportLab PDF Library, which, in turn, depends on the FreeType 2 Font Engine). Right now, it has two primary functions:

  1. Given an hOCR file, create a text-only (i.e., no HTML) document
  2. Given an hOCR file and an image, create a PDF

Usage is pretty simple:

from HocrConverter import HocrConverter
hocr = HocrConverter("myHocrFile.html")
hocr.to_text("output.txt")
hocr.to_pdf("myImageFile.png", "output.pdf")

You end up with a text file at output.txt containing the contents of the body of the hOCR document and a PDF file at output.pdf containing your image superimposed on top of the text.

Note that the image you use to create the PDF need not be the same image used for the OCR. If, for example, you used a 300 or 400 DPI image for the OCR, but you want a smaller file for the PDF, you can create a 72 DPI version of the image and feed that through the script instead of the original.

With a simple script that iterates through a directory of images, calling OCRopus and this script for each image (maybe with some ImageMagick and pdftk thrown in for good measure), you can quickly OCR a large batch of images and convert them to searchable PDF documents.

Future Development

I’d like to add the capability to convert the output of various OCR programs into hOCR format. For example, OmniPage, which is much better than OCRopus when it comes to layout analysis for complex images, can output documents in its proprietary XML schema. I should be able to transform that into an hOCR document and then use this script to create a PDF from there.

I would welcome any other suggestions or contributions.

Download the HocrConverter script

12 Responses to "Convert hOCR to PDF"

  1. Super cool!

    Jonathan Rochkind | 2009-04-02 17:48 | Permalink

  2. Hi,
    Interesting. Do you know how to apply a data extractor? I would like to use Ocropus to extract information which is in determined places inside the document.

    Greetings

    Pablo | 2009-04-21 06:44 | Permalink

  3. Hi,

    This might be a stupid question, but why include the original scanned image in the result pdf? If there are photos in the document I’d understand…

    cputter | 2009-05-26 17:00 | Permalink

  4. @cputter: I’m typically not interested in the actual text except insofar as it is an aid to discovery. I’m usually working on cultural heritage digitization projects (see http://libx.bsu.edu). The goal is to provide a reasonably accurate digital surrogate for the original physical item, so people can see not just what the item said, but what it looked like (which will, often, include images or marginalia). OCRing the text and embedding it with the scanned image in a PDF provides a representation of the item that can be discovered by searching the full OCRed text.

    Jonathan Brinley | 2009-05-26 19:14 | Permalink

  5. If you replace text.textLine by text.textLines, no more “s” is prepended to each line (due to the “\n” character added at beginning and end of each line of text in output of `ocropus buildhtml mydir`).

    I use a modified version of your script, because resolution and image size were not properly found in my images. I now type something like: python HocrConverter b.hocr c.pdf 300 300 2488 3507, with the syntax python HocrConverter.py inputHocrFile outputPdfFile imagedpix imagedpiy imagepixelwidth imagepixelheight

    And i include the image with pdftk image.pdf background text.pdf output result.pdf

    d | 2009-08-18 08:35 | Permalink

  6. Very beautiful job

    I work on a simple script using scanadf+unpaper+ocropus+HocrConverter+gs for a simple and complete chain

    I have two question:
    1)
    do you know how/why, in a multicolumn document, the pdf browser could avoid consider contigous lines fragments as from the same text ?
    in the hOcr file the columns are presented in the right order

    2)
    do you consider cuneiform as ocr (with hocr2pdf)

    Best regards

    François Elie | 2009-08-20 20:04 | Permalink

  7. François Elie, your bug comes from gs: I do not have your problem in my xpdf or my acroread. Try pdftops or pdf2ps instead of gs ?

    d | 2009-09-03 10:04 | Permalink

  8. xplus3, do you allow me to leave my modified version of your HocrConverter available for free download at http://bugs.gentoo.org/show_bug.cgi?id=185810#c32 ?

    Can you set your HocrConverter version on GPL ?

    d | 2009-09-03 10:08 | Permalink

  9. @d: HocrConverter is © 2009 Jonathan Brinley and available under the MIT License. I really should add a copyright statement to the file. Maybe next week…

    Jonathan Brinley | 2009-09-03 10:21 | Permalink

  10. Can I run your HocrConverter script from bash as a oneliner?

    HOCR=”temp.hocr”; TEXT=”temp.text”; IMAGE=”test.tiff”; OUT=”test.pdf”

    python -c “from HocrConverter import HocrConverter; hocr = HocrConverter(\”"$HOCR”\”); hocr.to_text(\”"$TEXT”\”); hocr.to_pdf(\”"$IMAGE”\”, \”"$OUT”\”);”

    How do I make python find your HocrConverter module?

    I tried the Gentoo ebuild, but their hocrtopdf creates a new pdf with the found
    ocr text written in Helvetica (or Courier) with reportlab. I just want the
    original unmodified scanned image overlayed with the hidden ocr text.

    Jeremy | 2009-12-02 14:35 | Permalink

  11. I got confused between Gentoo’s modified python script hocrtopdf and
    your original HocrConverter. The scripts take different commandline arguments:

    $ python /usr/local/bin/HocrConverter
    Usage: python HocrConverter.py inputHocrFile inputImageFile outputPdfFile

    $ /usr/local/bin/hocrtopdf
    Usage: hocrtopdf inputHocrFile fontHelveticaORCourier outputPdfFile imagedpix imagedpiy imagepixelwidth imagepixelheight oneifisotropic

    The only problem I’m having with HocrConverter, is that I’m getting a division
    by zero exception because my hocr file is missing ocrwidth/ocrheight
    info:

    # get dimensions of the OCR, which may not match the image
    if self.hocr is not None:
    for div in self.hocr.findall(”.//%sdiv”%(self.xmlns)):
    if div.attrib['class'] == ‘ocr_page’:
    coords = self.element_coordinates(div)
    ocrwidth = coords[2]-coords[0]
    ocrheight = coords[3]-coords[1]

    which results in the division by zero error:

    HOCR=”pg_0001.hocr”; IMAGE=”image.png”; OUT=”pg_0001-new.pdf”

    $ HocrConverter $HOCR $IMAGE $OUT
    DEBUG: ocrwidth= 0 , ocrheight= 0
    Traceback (most recent call last):
    File “/usr/local/bin/HocrConverter”, line 179, in
    hocr.to_pdf(sys.argv[2], sys.argv[3])
    File “/usr/local/bin/HocrConverter”, line 126, in to_pdf
    ocr_dpi = (ocrwidth/width, ocrheight/height)
    ZeroDivisionError: integer division or modulo by zero

    I generated my $HOCR with ocropus:
    $ ocropus book2pages $DIR $IMAGE
    $ ocropus pages2lines $DIR
    $ ocropus lines2fsts $DIR
    $ ocropus fsts2bestpaths $DIR
    $ ocropus fsts2text $DIR
    $ ocropus buildhtml $DIR \
    | tr ‘A12′ ‘12A’ \
    | sed ’s% Transitional//ENA http://www%Transitional//EN“A “http://www%’ \
    | tr ‘A12′ ‘12A’ \
    > $HOCR
    $ sed -e ‘2s:$:”:’ -e ‘3s:h:”h:’ -i $HOCR

    Jeremy | 2009-12-02 23:36 | Permalink

  12. I re-read Reply #5.
    Somehow I missed the part about using pdftk to add the hocr-text.pdf
    as a background to the original image.pdf

    hocrtopdf works like a charm!. Thanks.

    Jeremy | 2009-12-03 11:25 | Permalink

Leave a Reply