Convert hOCR to PDF

As I mentioned recently, OCRopus OCR software output an hOCR file. What is hOCR? hOCR is an open standard for representing OCR results in an HTML document (not to be confused with HOCR). It is basically a microformat using div and span tags’ class and title attributes to convey information from the OCR process, as you can see from this example of a basic hOCR document:

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="ocr_line ocr_page" name="ocr-capabilities"/>
    <meta content="en" name="ocr-langs"/>
    <meta content="Latn" name="ocr-scripts"/>
    <meta content="" name="ocr-microformats"/>
    <title>OCR Output</title>
  </head>
  <body>
    <div class="ocr_page" title="bbox 0 0 2548 3300; image /path/to/scanned/image.png">
      <span class="ocr_line" title="bbox 659 143 863 177">Some Text</span>
      <span class="ocr_line" title="bbox 723 275 916 324">More Text</span>
    </div>
  </body>
</html>

For the more details, see Thomas Breuel’s complete hOCR specification draft. The example, though, shows all that I need to know. From the div[@class='ocr_page'], I can learn the dimensions (in pixels) of the image that was OCRed (as well as the path to that image on my machine). From a span[@class='ocr_line'], I can learn the location and dimensions of the bounding box around a line of recognized text, as well as the content of that line.

That information, along with a copy of the original image (or a sufficiently similar image) is enough to create a PDF of the image with selectable text.

Creating a PDF

A while back, Florian Hackenberger created a basic hOCR to PDF converter in Java. It does its job reasonably well, but it’s somewhat rough around the edges. Since I’m more familiar with Python than Java, it seemed like a good idea to rewrite the code in Python, so I could more comfortably hack it.

And so I present to you, dear reader, the hOCR Converter Python script (note: this script depends on the ReportLab PDF Library, which, in turn, depends on the FreeType 2 Font Engine). Right now, it has two primary functions:

  1. Given an hOCR file, create a text-only (i.e., no HTML) document
  2. Given an hOCR file and an image, create a PDF

Usage is pretty simple:

from HocrConverter import HocrConverter
hocr = HocrConverter("myHocrFile.html")
hocr.to_text("output.txt")
hocr.to_pdf("myImageFile.png", "output.pdf")

You end up with a text file at output.txt containing the contents of the body of the hOCR document and a PDF file at output.pdf containing your image superimposed on top of the text.

Note that the image you use to create the PDF need not be the same image used for the OCR. If, for example, you used a 300 or 400 DPI image for the OCR, but you want a smaller file for the PDF, you can create a 72 DPI version of the image and feed that through the script instead of the original.

With a simple script that iterates through a directory of images, calling OCRopus and this script for each image (maybe with some ImageMagick and pdftk thrown in for good measure), you can quickly OCR a large batch of images and convert them to searchable PDF documents.

Future Development

I’d like to add the capability to convert the output of various OCR programs into hOCR format. For example, OmniPage, which is much better than OCRopus when it comes to layout analysis for complex images, can output documents in its proprietary XML schema. I should be able to transform that into an hOCR document and then use this script to create a PDF from there.

I would welcome any other suggestions or contributions.

Download the HocrConverter script from github

26 Responses to Convert hOCR to PDF

  1. Pablo says:

    Hi,
    Interesting. Do you know how to apply a data extractor? I would like to use Ocropus to extract information which is in determined places inside the document.

    Greetings

  2. cputter says:

    Hi,

    This might be a stupid question, but why include the original scanned image in the result pdf? If there are photos in the document I’d understand…

  3. @cputter: I’m typically not interested in the actual text except insofar as it is an aid to discovery. I’m usually working on cultural heritage digitization projects (see http://libx.bsu.edu). The goal is to provide a reasonably accurate digital surrogate for the original physical item, so people can see not just what the item said, but what it looked like (which will, often, include images or marginalia). OCRing the text and embedding it with the scanned image in a PDF provides a representation of the item that can be discovered by searching the full OCRed text.

  4. d says:

    If you replace text.textLine by text.textLines, no more “s” is prepended to each line (due to the “\n” character added at beginning and end of each line of text in output of `ocropus buildhtml mydir`).

    I use a modified version of your script, because resolution and image size were not properly found in my images. I now type something like: python HocrConverter b.hocr c.pdf 300 300 2488 3507, with the syntax python HocrConverter.py inputHocrFile outputPdfFile imagedpix imagedpiy imagepixelwidth imagepixelheight

    And i include the image with pdftk image.pdf background text.pdf output result.pdf

  5. François Elie says:

    Very beautiful job

    I work on a simple script using scanadf+unpaper+ocropus+HocrConverter+gs for a simple and complete chain

    I have two question:
    1)
    do you know how/why, in a multicolumn document, the pdf browser could avoid consider contigous lines fragments as from the same text ?
    in the hOcr file the columns are presented in the right order

    2)
    do you consider cuneiform as ocr (with hocr2pdf)

    Best regards

  6. d says:

    François Elie, your bug comes from gs: I do not have your problem in my xpdf or my acroread. Try pdftops or pdf2ps instead of gs ?

  7. d says:

    xplus3, do you allow me to leave my modified version of your HocrConverter available for free download at http://bugs.gentoo.org/show_bug.cgi?id=185810#c32 ?

    Can you set your HocrConverter version on GPL ?

  8. @d: HocrConverter is © 2009 Jonathan Brinley and available under the MIT License. I really should add a copyright statement to the file. Maybe next week…

  9. Jeremy says:

    Can I run your HocrConverter script from bash as a oneliner?

    HOCR=”temp.hocr”; TEXT=”temp.text”; IMAGE=”test.tiff”; OUT=”test.pdf”

    python -c “from HocrConverter import HocrConverter; hocr = HocrConverter(\””$HOCR”\”); hocr.to_text(\””$TEXT”\”); hocr.to_pdf(\””$IMAGE”\”, \””$OUT”\”);”

    How do I make python find your HocrConverter module?

    I tried the Gentoo ebuild, but their hocrtopdf creates a new pdf with the found
    ocr text written in Helvetica (or Courier) with reportlab. I just want the
    original unmodified scanned image overlayed with the hidden ocr text.

  10. Jeremy says:

    I got confused between Gentoo’s modified python script hocrtopdf and
    your original HocrConverter. The scripts take different commandline arguments:

    $ python /usr/local/bin/HocrConverter
    Usage: python HocrConverter.py inputHocrFile inputImageFile outputPdfFile

    $ /usr/local/bin/hocrtopdf
    Usage: hocrtopdf inputHocrFile fontHelveticaORCourier outputPdfFile imagedpix imagedpiy imagepixelwidth imagepixelheight oneifisotropic

    The only problem I’m having with HocrConverter, is that I’m getting a division
    by zero exception because my hocr file is missing ocrwidth/ocrheight
    info:

    # get dimensions of the OCR, which may not match the image
    if self.hocr is not None:
    for div in self.hocr.findall(“.//%sdiv”%(self.xmlns)):
    if div.attrib[‘class’] == ‘ocr_page':
    coords = self.element_coordinates(div)
    ocrwidth = coords[2]-coords[0]
    ocrheight = coords[3]-coords[1]

    which results in the division by zero error:

    HOCR=”pg_0001.hocr”; IMAGE=”image.png”; OUT=”pg_0001-new.pdf”

    $ HocrConverter $HOCR $IMAGE $OUT
    DEBUG: ocrwidth= 0 , ocrheight= 0
    Traceback (most recent call last):
    File “/usr/local/bin/HocrConverter”, line 179, in
    hocr.to_pdf(sys.argv[2], sys.argv[3])
    File “/usr/local/bin/HocrConverter”, line 126, in to_pdf
    ocr_dpi = (ocrwidth/width, ocrheight/height)
    ZeroDivisionError: integer division or modulo by zero

    I generated my $HOCR with ocropus:
    $ ocropus book2pages $DIR $IMAGE
    $ ocropus pages2lines $DIR
    $ ocropus lines2fsts $DIR
    $ ocropus fsts2bestpaths $DIR
    $ ocropus fsts2text $DIR
    $ ocropus buildhtml $DIR \
    | tr ‘A12′ ’12A’ \
    | sed ‘s% Transitional//ENA http://www%Transitional//EN“A “http://www%’ \
    | tr ‘A12′ ’12A’ \
    > $HOCR
    $ sed -e ‘2s:$:”:’ -e ‘3s:h:”h:’ -i $HOCR

  11. Jeremy says:

    I re-read Reply #5.
    Somehow I missed the part about using pdftk to add the hocr-text.pdf
    as a background to the original image.pdf

    hocrtopdf works like a charm!. Thanks.

  12. d says:

    @Jeremy, in #15 : you only need one sed command, theoretically, juste to address a bug of ocropus.

  13. ziegi says:

    Is the HocrConverter script still available ?
    The download link gives me a 404 page ;-(
    Thanks

  14. Jason says:

    I’d like to try it out, but I’m also getting a 404.

  15. Apologies for the downtime. HocrConverter is now available from github: http://github.com/jbrinley/HocrConverter

  16. SRG says:

    I tried, and i can’t use it with a .hocr file produced by cuneiform (which works way better than tesseract / ocropus for me).
    The problem is that the html output produced by cuneiform is not very well formed (missing “” around the parameter of the src tag, missing last / in meta element, and so on).
    And as python is (in my opinion) one of the biggest crap ever made (whatever the tool is, if it is writen in python, i’ll have a lot of troubles having everything working), i can’t use your tool (i get a lot of “xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 25″, and so on). So bad.

  17. Manjunath says:

    Hi,

    Thanks for the great script, However this is working in Ubuntu 9.10, In 10.04 I get the following error.

    Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
    [GCC 4.4.3] on linux2
    Type “help”, “copyright”, “credits” or “license” for more information.
    >>> from HocrConverter import HocrConverter
    >>> hocr = HocrConverter(“/tmp/Mytest-1.html”)
    Traceback (most recent call last):
    File “”, line 1, in
    File “/usr/lib/python2.6/HocrConverter.py”, line 31, in __init__
    self.parse_hocr(hocrFileName)
    File “/usr/lib/python2.6/HocrConverter.py”, line 76, in parse_hocr
    self.hocr.parse(hocrFileName)
    File “/usr/lib/python2.6/xml/etree/ElementTree.py”, line 587, in parse
    self._root = parser.close()
    File “/usr/lib/python2.6/xml/etree/ElementTree.py”, line 1254, in close
    self._parser.Parse(“”, 1) # end of data
    xml.parsers.expat.ExpatError: no element found: line 1, column 0

    Ubuntu version is 10.04 ( 64 Bit ).

    Any help would be appreciated.

    Regards,
    Manjunath

    • Sam says:

      @Manjunath:

      Not sure, but it is possible your hocr is not in clean xhtml? I was able to work around the same issue by preceding execution of hocrconverter with a “tidy -imu -asxml my_hocr.html”.

      Unfortunately, I ran into another problem:


      File "/Users/someUser/somePath/HocrConverter.py", line 152, in to_pdf
      text.setHorizScale((((float(coords[2])/ocr_dpi[0]*inch)-(float(coords[0])/ocr_dpi[0]*inch))/pdf.stringWidth(line.text.rstrip(), fontname, fontsize))*100)
      AttributeError: 'NoneType' object has no attribute 'rstrip'

      The text element apparently can be of “NoneType” (not a python dude yet :) which causes the script to trip up at “rstrip()” because I assume a String is expected.

  18. Josh says:

    Thanks! Great tool. I had to modify it slightly to get it to work with hocr produced by tesseract. At least in my hocr files, the text of each line is nested within sub elements of the one currently referenced by the to_pdf method, one sub-element for each word (a span element with a class attribute of ‘ocr_word,’ which has a child span element with a class attribute of ‘ocrx_word,’ where the actual text of the word is held), so the calls to line.text would yield “none,” as the text was in a child element. Here’s the tweak I used:
    1) add as a sub-function of the to_pdf method the following function, which gathers the text from each of the sub-elements for the line element into a single string:
    def get_full_line(line_elem):
    result = “”
    for word_unit in line_elem.findall(‘span[@class=”ocr_word”]’):
    for word in word_unit.findall(‘span[@class=”ocrx_word”]’):
    result = result + word.text + ‘ ‘
    return result

    2) change the following portion of the to_pdf method as indicated in order to call the above function and use the returned line text, rather than try to retrieve this text directly from the element corresponding to the entire line:

    if self.hocr is not None:
    for line in self.hocr.findall(“.//%sspan”%(self.xmlns)):
    if line.attrib[‘class’] == ‘ocr_line':
    line_text = get_full_line(line)
    coords = self.element_coordinates(line)
    text = pdf.beginText()
    text.setFont(fontname, fontsize)
    text.setTextRenderMode(3) # invisible

    # set cursor to bottom left corner of line bbox (adjust for dpi)
    text.setTextOrigin((float(coords[0])/ocr_dpi[0])*inch, (height*inch)-(float(coords[3])/ocr_dpi[1])*inch)

    # scale the width of the text to fill the width of the line’s bbox

    text.setHorizScale((((float(coords[2])/ocr_dpi[0]*inch)-(float(coords[0])/ocr_dpi[0]*inch))/pdf.stringWidth(line_text, fontname, fontsize))*100)
    # write the text to the page
    text.textLine(line_text)
    pdf.drawText(text)

    After making these two minor changes, it worked great. Thanks for sharing this. It saved me a lot of time.

  19. Jason says:

    Josh, thanks for your changes — I tried them but had problems as I am using Cygwin which uses python 2.6, and which uses an version of ElementTree (1.2) that doesn’t support attributes in xPath selectors (e.g. span[@class=”ocrx_word”]). Strangely enough as well, tesseract produces hocr html which contains spans with classes of “xocr_word” (not “ocrx_word” as you mentioned).

    Hence I’ve had to amend your script to swap in lxml (which supports and extends the ElementTree api ).

    from reportlab.pdfgen.canvas import Canvas
    from reportlab.lib.units import inch
    #http://stackoverflow.com/questions/7122461/updating-python-elementtree-to-overcome-xpath-selector-issue
    from lxml import etree as ElementTree
    #from xml.etree.ElementTree import ElementTree
    from PIL import Image
    import re
    import sys
    
    
    class HocrConverter():
        """
        A class for converting documents to/from the hOCR format.
    
        For details of the hOCR format, see:
    
          http://docs.google.com/View?docid=dfxcv4vc_67g844kf
    
        See also:
    
          http://code.google.com/p/hocr-tools/
    
        Basic usage:
    
        Create a PDF from an hOCR file and an image:
    
          hocr = HocrConverter("path/to/hOCR/file")
          hocr.to_pdf("path/to/image/file", "path/to/output/file")
    
        """
        def __init__(self, hocrFileName=None):
            self.hocr = None
            self.xmlns = ''
            self.boxPattern = re.compile('bbox((\s+\d+){4})')
            if hocrFileName is not None:
                self.parse_hocr(hocrFileName)
    
        def __str__(self):
            """
            Return the textual content of the HTML body
            """
            if self.hocr is None:
                return ''
            body = self.hocr.find(".//%sbody" % (self.xmlns))
            if body is not None:
                return self._get_element_text(body).encode('utf-8')  # XML gives unicode
            else:
                return ''
    
        def _get_element_text(self, element):
            """
            Return the textual content of the element and its children
            """
            text = ''
            if element.text is not None:
                text = text + element.text
            for child in element.getchildren():
                text = text + self._get_element_text(child)
            if element.tail is not None:
                text = text + element.tail
            return text
    
        def element_coordinates(self, element):
            """
            Returns a tuple containing the coordinates of the bounding box around
            an element
            """
            out = (0, 0, 0, 0)
            if 'title' in element.attrib:
                matches = self.boxPattern.search(element.attrib['title'])
                if matches:
                    coords = matches.group(1).split()
                    out = (int(coords[0]), int(coords[1]), int(coords[2]), int(coords[3]))
            return out
    
        def parse_hocr(self, hocrFileName):
            """
            Reads an XML/XHTML file into an ElementTree object
            """
            self.hocr = ElementTree.ElementTree()
            self.hocr.parse(hocrFileName)
    
            # if the hOCR file has a namespace, ElementTree requires its use to find elements
            matches = re.match('({.*})html', self.hocr.getroot().tag)
            if matches:
                self.xmlns = matches.group(1)
            else:
                self.xmlns = ''
    
        def to_pdf(self, imageFileName, outFileName, fontname="Courier", fontsize=8):
            """
            Creates a PDF file with an image superimposed on top of the text.
    
            Text is positioned according to the bounding box of the lines in
            the hOCR file.
    
            The image need not be identical to the image used to create the hOCR file.
            It can be scaled, have a lower resolution, different color mode, etc.
            """
            # http://xplus3.net/2009/04/02/convert-hocr-to-pdf/comment-page-1/#comment-2853
            def get_full_line(line_elem):
                result = ''
                for word_unit in line_elem.findall("%sspan[@class='ocr_word']" % (self.xmlns)):
                    for word in word_unit.findall('%sspan[@class="xocr_word"]' % (self.xmlns)):
                        result = result + word.text + ' '
                return result
    
            if self.hocr is None:
                # warn that no text will be embedded in the output PDF
                print "Warning: No hOCR file specified. PDF will be image-only."
    
            im = Image.open(imageFileName)
            imwidthpx, imheightpx = im.size
            if 'dpi' in im.info:
                width = float(im.size[0]) / im.info['dpi'][0]
                height = float(im.size[1]) / im.info['dpi'][1]
            else:
                # we have to make a reasonable guess
                # set to None for now and try again using info from hOCR file
                width = height = None
    
            ocr_dpi = (300, 300)  # a default, in case we can't find it
    
            # get dimensions of the OCR, which may not match the image
            if self.hocr is not None:
                for div in self.hocr.findall(".//%sdiv" % (self.xmlns)):
                    if div.attrib['class'] == 'ocr_page':
                        coords = self.element_coordinates(div)
                        ocrwidth = coords[2] - coords[0]
                        ocrheight = coords[3] - coords[1]
                        if width is None:
                          # no dpi info with the image
                          # assume OCR was done at 300 dpi
                            width = ocrwidth / 300
                            height = ocrheight / 300
                        ocr_dpi = (ocrwidth / width, ocrheight / height)
                        break  # there shouldn't be more than one, and if there is, we don't want it
    
            if width is None:
                # no dpi info with the image, and no help from the hOCR file either
                # this will probably end up looking awful, so issue a warning
                print "Warning: DPI unavailable for image %s. Assuming 96 DPI." % (imageFileName)
                width = float(im.size[0]) / 96
                height = float(im.size[1]) / 96
    
            # create the PDF file
            pdf = Canvas(outFileName, pagesize=(width * inch, height * inch), pageCompression=1)  # page size in points (1/72 in.)
    
            # put the image on the page, scaled to fill the page
            pdf.drawInlineImage(im, 0, 0, width=width * inch, height=height * inch)
    
            if self.hocr is not None:
                for line in self.hocr.findall(".//%sspan" % (self.xmlns)):
                    if line.attrib['class'] == 'ocr_line':
                        line_text = get_full_line(line)
                        coords = self.element_coordinates(line)
                        text = pdf.beginText()
                        text.setFont(fontname, fontsize)
                        text.setTextRenderMode(3)  # invisible
    
                        # set cursor to bottom left corner of line bbox (adjust for dpi)
                        text.setTextOrigin((float(coords[0]) / ocr_dpi[0]) * inch, (height * inch) - (float(coords[3]) / ocr_dpi[1]) * inch)
    
                        # scale the width of the text to fill the width of the line's bbox
                        text.setHorizScale((((float(coords[2]) / ocr_dpi[0] * inch) - (float(coords[0]) / ocr_dpi[0] * inch)) / pdf.stringWidth(line_text, fontname, fontsize)) * 100)
    
                        # write the text to the page
                        text.textLine(line_text)
                        pdf.drawText(text)
    
            # finish up the page and save it
            pdf.showPage()
            pdf.save()
    
        def to_text(self, outFileName):
            """
            Writes the textual content of the hOCR body to a file.
            """
            f = open(outFileName, "w")
            f.write(self.__str__())
            f.close()
    
    if __name__ == "__main__":
        if len(sys.argv) < 4:
            print 'Usage: python HocrConverter.py inputHocrFile inputImageFile outputPdfFile'
            sys.exit(1)
        hocr = HocrConverter(sys.argv[1])
        hocr.to_pdf(sys.argv[2], sys.argv[3])
    
  20. I made some edits to your script. It can now read the hocr format returned by tesseract 3.02. It also does better layout, taking into account the font descent. Words that are on the same line will now appear to be on the same line. There is also some conversion for special unicode characters that cause trouble (e.g., fi, fl).


    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    '''
    1 - Convert PDF to PNG files
    2 - Tesseract PNG files to create HOCR
    3 - Run this program on each individual PNG/HOCR file
    4 - Combine PDF pages

    '''

    import logging
    import math
    import codecs
    from reportlab.pdfgen.canvas import Canvas
    from reportlab.lib.units import inch
    from reportlab.pdfbase.pdfmetrics import getDescent, getFont
    import xml.etree.cElementTree as ET
    import Image, re, sys

    draw_image = False
    hide_text = False

    draw_line_rect = False
    draw_word_rect = False
    bottom_by_line = True

    class HocrConverter():
    """
    A class for converting documents to/from the hOCR format.

    For details of the hOCR format, see:

    http://docs.google.com/View?docid=dfxcv4vc_67g844kf

    See also:

    http://code.google.com/p/hocr-tools/

    Basic usage:

    Create a PDF from an hOCR file and an image:

    hocr = HocrConverter("path/to/hOCR/file")
    hocr.to_pdf("path/to/image/file", "path/to/output/file")

    """
    def __init__(self, hocrFileName = None):
    self.hocr = None
    self.xmlns = ''
    self.boxPattern = re.compile('bbox((\s+\d+){4})')
    if hocrFileName is not None:
    self.parse_hocr(hocrFileName)

    def __str__(self):
    """
    Return the textual content of the HTML body
    """
    if self.hocr is None:
    return ''
    body = self.hocr.find(".//%sbody"%(self.xmlns))
    if body:
    return self._get_element_text(body).encode('utf-8') # XML gives unicode
    else:
    return ''

    def _get_element_text(self, element):
    """
    Return the textual content of the element and its children
    """
    text = ''
    if element.text is not None:
    text = text + element.text
    for child in element.getchildren():
    text = text + self._get_element_text(child)
    if element.tail is not None:
    text = text + element.tail
    return text

    def element_coordinates(self, element):
    """
    Returns a tuple containing the coordinates of the bounding box around
    an element
    """
    if 'title' in element.attrib:
    matches = self.boxPattern.search(element.attrib['title'])
    if matches:
    coords = matches.group(1).split()
    return (int(coords[0]),int(coords[1]),int(coords[2]),int(coords[3]))
    return None

    def parse_hocr(self, hocrFileName):
    """
    Reads an XML/XHTML file into an ElementTree object
    """
    self.hocr = ET.fromstring(open(hocrFileName, 'r').read())
    #self.hocr.parse(hocrFileName)

    # if the hOCR file has a namespace, ElementTree requires its use to find elements
    matches = re.match('({.*})html', self.hocr.tag)
    if matches:
    self.xmlns = matches.group(1)
    else:
    self.xmlns = ''

    def to_pdf(self, imageFileName, outFileName, fontname="Times-Roman", fontsize=8):
    """
    Creates a PDF file with an image superimposed on top of the text.

    Text is positioned according to the bounding box of the lines in
    the hOCR file.

    The image need not be identical to the image used to create the hOCR file.
    It can be scaled, have a lower resolution, different color mode, etc.
    """
    if self.hocr is None:
    # warn that no text will be embedded in the output PDF
    print "Warning: No hOCR file specified. PDF will be image-only."

    im = Image.open(imageFileName)
    imwidthpx, imheightpx = im.size
    if 'dpi' in im.info:
    width = float(im.size[0])/im.info['dpi'][0]
    height = float(im.size[1])/im.info['dpi'][1]
    else:
    # we have to make a reasonable guess
    # set to None for now and try again using info from hOCR file
    logging.info("No Image DPI Info, get from hOCR File")
    width = height = None

    ocr_dpi = (300, 300) # a default, in case we can't find it

    # get dimensions of the OCR, which may not match the image
    if self.hocr is not None:
    for div in self.hocr.findall(".//%sdiv"%(self.xmlns)):
    if div.attrib['class'] == 'ocr_page':
    coords = self.element_coordinates(div)
    ocrwidth = coords[2]-coords[0]
    ocrheight = coords[3]-coords[1]
    if width is None:
    # no dpi info with the image
    # assume OCR was done at 300 dpi
    width = ocrwidth/300.0
    height = ocrheight/300.0
    ocr_dpi = (ocrwidth/width, ocrheight/height)
    break # there shouldn't be more than one, and if there is, we don't want it

    if width is None:
    # no dpi info with the image, and no help from the hOCR file either
    # this will probably end up looking awful, so issue a warning
    logging.error("DPI unavailable for image %s. Assuming 96 DPI."%(imageFileName))
    width = float(im.size[0])/96
    height = float(im.size[1])/96

    # create the PDF file
    pdf = Canvas(outFileName, pagesize=(width*inch, height*inch), pageCompression=1) # page size in points (1/72 in.)

    logging.info((width, height))
    # put the image on the page, scaled to fill the page
    if draw_image:
    pdf.drawInlineImage(im, 0, 0, width=width*inch, height=height*inch)

    if self.hocr is not None:
    for line in self.hocr.findall(".//%sspan"%(self.xmlns)):
    if line.attrib['class'] in {'ocr_line', 'ocrx_line'}:
    # Set the top and bottom of the bounding box for each line
    coords = self.element_coordinates(line)
    if not coords:
    continue
    bottom = inch*(height - float(coords[3])/ocr_dpi[1])
    top = inch*(height - float(coords[1])/ocr_dpi[1])
    box_height = top - bottom
    metrics = getDescent(fontname)
    # First guess the fontsize based on box height
    fontsize = max(8, box_height)
    # Adjust the bottom text but the descent amount
    bottom -= getDescent(fontname, fontsize)
    # Now get a more accurate fontsize taking into account the descent
    fontsize = max(8, top - bottom)
    if draw_line_rect:
    right_line = inch*float(coords[2])/ocr_dpi[0]
    left_line = inch*float(coords[0])/ocr_dpi[0]
    pdf.rect(left_line, bottom, right_line-left_line, top-bottom)

    if line.attrib['class'] in {'ocrx_word', 'ocr_word'}:
    # print line_text.encode('ascii','ignore')
    coords = self.element_coordinates(line)
    if not coords:
    continue
    line_text = (u"".join(list(line.itertext()))
    ).replace(u"\uFB01",u"fi").replace(u"\uFB02",u"fl")
    #print "".join(list(line.itertext()))
    text = pdf.beginText()
    if hide_text:
    text.setTextRenderMode(3) # invisible

    bot_text = inch*(height - float(coords[3])/ocr_dpi[1])
    top_text = inch*(height - float(coords[1])/ocr_dpi[1])

    right_text = inch*float(coords[2])/ocr_dpi[0]
    left_text = inch*float(coords[0])/ocr_dpi[0]

    # set cursor to bottom left corner of line bbox (adjust for dpi)
    if bottom_by_line:
    text.setTextOrigin(left_text, bottom-(top-bottom-fontsize))
    else:
    fontsize = max(8, math.ceil(top-bottom))
    text.setTextOrigin(left_text, bot_text)

    text.setFont(fontname, fontsize)

    # scale the width of the text to fill the width of the line's bbox
    stringwidth = pdf.stringWidth(line_text, fontname, fontsize)
    if not stringwidth: stringwidth = 1

    horiz_scale= 100.0*(right_text-left_text)/stringwidth
    # print horiz_scale, stringwidth, right_text-left_text
    text.setHorizScale(horiz_scale)

    # write the text to the page
    text.textLine(line_text)
    pdf.drawText(text)

    if draw_word_rect:

    pdf.setStrokeColorRGB(0.6,0.4,0.9)
    pdf.roundRect(left_text, bot_text, right_text-left_text, top_text-bot_text, radius=3)
    pdf.setStrokeColorRGB(0.0,0.0,0.0)

    # finish up the page and save it
    pdf.showPage()
    pdf.save()

    def to_text(self, outFileName):
    """
    Writes the textual content of the hOCR body to a file.
    """
    f = open(outFileName, "w")
    f.write(self.__str__())
    f.close()

    if __name__ == "__main__":
    logging.basicConfig(
    level = logging.DEBUG,
    format = '%(asctime)s %(levelname)s %(message)s',
    )
    if len(sys.argv) < 4:
    print 'Usage: python HocrConverter.py inputHocrFile inputImageFile outputPdfFile'
    sys.exit(1)
    hocr = HocrConverter(sys.argv[1])
    hocr.to_pdf(sys.argv[2], sys.argv[3])

    • ThorX89 says:

      Hi, would you be willing to repost it verbatim or post a link to the code? The formatting must have gotten lost when you posted it.

    • Andy E says:

      Since there hasn’t been any input for 2 months, I thought I’d make an attempt. Although I got the script running without errors, there might still be logic errors (e. g. that originally two if-blocks were nested and now they’re not nested anymore)

      I’ve just fixed it from my (programmer’s) common sense I’ve learned from other languages. I’m a total rookie in Python, but finally I know now that indenting is extremely mandatory in this language, since you can easily change the logic with only one space less.

      http://pastebin.com/4URWRNPu

  21. dhartford says:

    Future Development

    I’d like to add the capability to convert the output of various OCR programs into hOCR format. For example, OmniPage, which is much better than OCRopus when it comes to layout analysis for complex images, can output documents in its proprietary XML schema. I should be able to transform that into an hOCR document and then use this script to create a PDF from there.

    I would welcome any other suggestions or contributions.

    I too have a huge desire to find more OCR engines to support or have a conversion to hOCR format. On top of easier integration for PDF conversion like you demonstrated, but also more possibilities for OCR voting options.

  22. Anuj says:

    thanku soo much it helped me a lot :)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">