<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>x + 3 &#187; hOCR</title>
	<atom:link href="http://xplus3.net/tag/hocr/feed/" rel="self" type="application/rss+xml" />
	<link>http://xplus3.net</link>
	<description></description>
	<lastBuildDate>Fri, 19 Aug 2011 01:05:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
		<item>
		<title>Convert hOCR to PDF</title>
		<link>http://xplus3.net/2009/04/02/convert-hocr-to-pdf/</link>
		<comments>http://xplus3.net/2009/04/02/convert-hocr-to-pdf/#comments</comments>
		<pubDate>Thu, 02 Apr 2009 20:05:46 +0000</pubDate>
		<dc:creator>Jonathan Brinley</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[hOCR]]></category>
		<category><![CDATA[HocrConverter]]></category>
		<category><![CDATA[OCR]]></category>
		<category><![CDATA[OCRopus]]></category>
		<category><![CDATA[PDF]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://xplus3.net/?p=207</guid>
		<description><![CDATA[As I mentioned recently, OCRopus OCR software output an hOCR file. What is hOCR? hOCR is an open standard for representing OCR results in an HTML document (not to be confused with HOCR). It is basically a microformat using div and span tags&#8217; class and title attributes to convey information from the OCR process, as you can see from this &#8230; <a href="http://xplus3.net/2009/04/02/convert-hocr-to-pdf/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>As <a href="http://xplus3.net/2009/03/31/ocr-with-ocropus-and-tesseract/">I mentioned recently</a>, OCRopus OCR software output an hOCR file. What is hOCR? hOCR is an open standard for representing OCR results in an HTML document (not to be confused with <a href="http://en.wikipedia.org/wiki/HOCR_(software)"><abbr title="Hebrew Optical Character Recoginition">HOCR</abbr></a>). It is basically a microformat using <code>div</code> and <code>span</code> tags&#8217; <code>class</code> and <code>title</code> attributes to convey information from the OCR process, as you can see from this example of a basic hOCR document:</p>


<div class="wp-geshi-highlight-wrap5"><div class="wp-geshi-highlight-wrap4"><div class="wp-geshi-highlight-wrap3"><div class="wp-geshi-highlight-wrap2"><div class="wp-geshi-highlight-wrap"><div class="wp-geshi-highlight"><div class="xml"><pre class="de1"><span class="sc0">&lt;!DOCTYPE html</span>
<span class="sc0">  PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot;</span>
<span class="sc0">  &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;</span>
<span class="sc3"><span class="re1">&lt;html</span> <span class="re0">xmlns</span>=<span class="st0">&quot;http://www.w3.org/1999/xhtml&quot;</span><span class="re2">&gt;</span></span>
  <span class="sc3"><span class="re1">&lt;head<span class="re2">&gt;</span></span></span>
    <span class="sc3"><span class="re1">&lt;meta</span> <span class="re0">content</span>=<span class="st0">&quot;ocr_line ocr_page&quot;</span> <span class="re0">name</span>=<span class="st0">&quot;ocr-capabilities&quot;</span><span class="re2">/&gt;</span></span>
    <span class="sc3"><span class="re1">&lt;meta</span> <span class="re0">content</span>=<span class="st0">&quot;en&quot;</span> <span class="re0">name</span>=<span class="st0">&quot;ocr-langs&quot;</span><span class="re2">/&gt;</span></span>
    <span class="sc3"><span class="re1">&lt;meta</span> <span class="re0">content</span>=<span class="st0">&quot;Latn&quot;</span> <span class="re0">name</span>=<span class="st0">&quot;ocr-scripts&quot;</span><span class="re2">/&gt;</span></span>
    <span class="sc3"><span class="re1">&lt;meta</span> <span class="re0">content</span>=<span class="st0">&quot;&quot;</span> <span class="re0">name</span>=<span class="st0">&quot;ocr-microformats&quot;</span><span class="re2">/&gt;</span></span>
    <span class="sc3"><span class="re1">&lt;title<span class="re2">&gt;</span></span></span>OCR Output<span class="sc3"><span class="re1">&lt;/title<span class="re2">&gt;</span></span></span>
  <span class="sc3"><span class="re1">&lt;/head<span class="re2">&gt;</span></span></span>
  <span class="sc3"><span class="re1">&lt;body<span class="re2">&gt;</span></span></span>
    <span class="sc3"><span class="re1">&lt;div</span> <span class="re0">class</span>=<span class="st0">&quot;ocr_page&quot;</span> <span class="re0">title</span>=<span class="st0">&quot;bbox 0 0 2548 3300; image /path/to/scanned/image.png&quot;</span><span class="re2">&gt;</span></span>
      <span class="sc3"><span class="re1">&lt;span</span> <span class="re0">class</span>=<span class="st0">&quot;ocr_line&quot;</span> <span class="re0">title</span>=<span class="st0">&quot;bbox 659 143 863 177&quot;</span><span class="re2">&gt;</span></span>Some Text<span class="sc3"><span class="re1">&lt;/span<span class="re2">&gt;</span></span></span>
      <span class="sc3"><span class="re1">&lt;span</span> <span class="re0">class</span>=<span class="st0">&quot;ocr_line&quot;</span> <span class="re0">title</span>=<span class="st0">&quot;bbox 723 275 916 324&quot;</span><span class="re2">&gt;</span></span>More Text<span class="sc3"><span class="re1">&lt;/span<span class="re2">&gt;</span></span></span>
    <span class="sc3"><span class="re1">&lt;/div<span class="re2">&gt;</span></span></span>
  <span class="sc3"><span class="re1">&lt;/body<span class="re2">&gt;</span></span></span>
<span class="sc3"><span class="re1">&lt;/html<span class="re2">&gt;</span></span></span></pre></div></div></div></div></div></div></div>


<p>For the more details, see <a href="http://docs.google.com/View?docid=dfxcv4vc_67g844kf">Thomas Breuel&#8217;s complete hOCR specification draft</a>. The example, though, shows all that I need to know. From the <code>div[@class='ocr_page']</code>, I can learn the dimensions (in pixels) of the image that was OCRed (as well as the path to that image on my machine). From a <code>span[@class='ocr_line']</code>, I can learn the location and dimensions of the bounding box around a line of recognized text, as well as the content of that line.</p>
<p>That information, along with a copy of the original image (or a sufficiently similar image) is enough to create a PDF of the image with selectable text.<br />
<span id="more-207"></span></p>
<h3>Creating a PDF</h3>
<p>A while back, Florian Hackenberger  created a <a href="http://groups.google.com/group/ocropus/browse_thread/thread/3cf464bda5807952">basic hOCR to PDF converter in Java</a>. It does its job reasonably well, but it&#8217;s somewhat rough around the edges. Since I&#8217;m more familiar with Python than Java, it seemed like a good idea to rewrite the code in Python, so I could more comfortably hack it.</p>
<p>And so I present to you, dear reader, the <a href="http://github.com/jbrinley/HocrConverter">hOCR Converter Python script</a> (note: this script depends on the <a href="http://www.reportlab.org/downloads.html#reportlab">ReportLab PDF Library</a>, which, in turn, depends on the <a href="http://www.freetype.org/">FreeType 2 Font Engine</a>). Right now, it has two primary functions:</p>
<ol>
<li>Given an hOCR file, create a text-only (<em>i.e.</em>, no HTML) document</li>
<li>Given an hOCR file and an image, create a PDF</li>
</ol>
<p>Usage is pretty simple:</p>


<div class="wp-geshi-highlight-wrap5"><div class="wp-geshi-highlight-wrap4"><div class="wp-geshi-highlight-wrap3"><div class="wp-geshi-highlight-wrap2"><div class="wp-geshi-highlight-wrap"><div class="wp-geshi-highlight"><div class="python"><pre class="de1"><span class="kw1">from</span> HocrConverter <span class="kw1">import</span> HocrConverter
hocr <span class="sy0">=</span> HocrConverter<span class="br0">&#40;</span><span class="st0">&quot;myHocrFile.html&quot;</span><span class="br0">&#41;</span>
hocr.<span class="me1">to_text</span><span class="br0">&#40;</span><span class="st0">&quot;output.txt&quot;</span><span class="br0">&#41;</span>
hocr.<span class="me1">to_pdf</span><span class="br0">&#40;</span><span class="st0">&quot;myImageFile.png&quot;</span><span class="sy0">,</span> <span class="st0">&quot;output.pdf&quot;</span><span class="br0">&#41;</span></pre></div></div></div></div></div></div></div>


<p>You end up with a text file at <code>output.txt</code> containing the contents of the <code>body</code> of the hOCR document and a PDF file at <code>output.pdf</code> containing your image superimposed on top of the text.</p>
<p>Note that the image you use to create the PDF need not be the same image used for the OCR. If, for example, you used a 300 or 400 DPI image for the OCR, but you want a smaller file for the PDF, you can create a 72 DPI version of the image and feed that through the script instead of the original.</p>
<p>With a simple script that iterates through a directory of images, calling OCRopus and this script for each image (maybe with some ImageMagick and <a href="http://xplus3.net/2008/10/09/command-line-pdf-editing/">pdftk</a> thrown in for good measure), you can quickly OCR a large batch of images and convert them to searchable PDF documents.</p>
<h3>Future Development</h3>
<p>I&#8217;d like to add the capability to convert the output of various OCR programs into hOCR format. For example, OmniPage, which is much better than OCRopus when it comes to layout analysis for complex images, can output documents in its proprietary XML schema. I should be able to transform that into an hOCR document and then use this script to create a PDF from there.</p>
<p>I would welcome any other suggestions or contributions.</p>
<p><strong><a href="http://github.com/jbrinley/HocrConverter">Download the HocrConverter script from github</a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://xplus3.net/2009/04/02/convert-hocr-to-pdf/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>OCR with OCRopus and Tesseract</title>
		<link>http://xplus3.net/2009/03/31/ocr-with-ocropus-and-tesseract/</link>
		<comments>http://xplus3.net/2009/03/31/ocr-with-ocropus-and-tesseract/#comments</comments>
		<pubDate>Tue, 31 Mar 2009 18:20:38 +0000</pubDate>
		<dc:creator>Jonathan Brinley</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[hOCR]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[OCR]]></category>
		<category><![CDATA[OCRopus]]></category>
		<category><![CDATA[Tesseract]]></category>

		<guid isPermaLink="false">http://xplus3.net/?p=193</guid>
		<description><![CDATA[While OCRing a batch of images through OmniPage the other day, I was silently cursing my computer. I had about 1,500 pages, and OmniPage was crashing after every second or third image. I&#8217;ve used versions 13-16 of the software, and this problem seems to just get worse with each new release. Fed up, I decided to look for an alternative. &#8230; <a href="http://xplus3.net/2009/03/31/ocr-with-ocropus-and-tesseract/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>While OCRing a batch of images through OmniPage the other day, I was silently cursing my computer. I had about 1,500 pages, and OmniPage was crashing after every second or third image. I&#8217;ve used versions 13-16 of the software, and this problem seems to just get worse with each new release. Fed up, I decided to look for an alternative.</p>
<p>I remembered seeing a few years ago that HP had open-sourced their OCR engine, <a href="http://code.google.com/p/tesseract-ocr/">Tesseract</a>, development of which has now been taken over by Google. Tesseract is supposedly very good at what it does, namely, recognizing characters in images.</p>
<p>Tesseract does not, however, have many essential features found in modern OCR software, including document layout analysis and output formatting. That&#8217;s where <a href="http://sites.google.com/site/ocropus/">OCRopus</a> comes in. I think of it as a wrapper around Tesseract, capable of doing the layout analysis and providing formatted output. In truth, it can do much more than that, and different OCR engines and other components can be plugged into OCRopus, but the preceding simplification works for my purposes.</p>
<h3>Usage</h3>
<p>Use OCRopus with a simple call from the command line:</p>
<pre>$ ocroscript recognize /path/to/file.png &gt; /path/to/output.html</pre>
<p>OCRopus will work its magic on file.png and give you an hOCR file. hOCR uses <code>class</code> and <code>title</code> attributes in an otherwise simple HTML file to embed layout information into the recognized text. I hope soon to create a script to transform the hOCR into a PDF; I&#8217;ll post more when it&#8217;s ready.</p>
<h3>Installation</h3>
<p>The trickiest part of using OCRopus is the installation. There are quite a few dependencies and some inaccurate documentation, so I made a few wrong turns along the way. Fortunately, I remembered to document what I was doing as I went. The instructions below represent the necessary steps to have an operable installation of OCRopus on Linux Mint as of 2009-03-27. For the record, I&#8217;m starting in <code>/var/tmp</code>.<br />
<span id="more-193"></span></p>
<h4>Install Tesseract</h4>
<p>As mentioned above, <a href="http://code.google.com/p/tesseract-ocr/">Tesseract</a> is the OCR engine that powers OCRopus.</p>
<pre>$ svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only
$ cd tesseract-ocr-read-only
$ ./configure
$ make
$ sudo make install
$ cd ..</pre>
<h4>Install iulib</h4>
<p><a href="http://code.google.com/p/iulib/">iulib</a> provides some basic image processing libraries used by OCRopus.</p>
<pre>$ svn checkout http://iulib.googlecode.com/svn/trunk/ iulib
$ cd iulib
$ sudo apt-get install scons
$ sudo apt-get install libpng12-dev libjpeg62-dev libtiff4-dev libavcodec-dev libavformat-dev libsdl-gfx1.2-dev libsdl-image1.2-dev
$ sudo apt-get install imagemagick
$ scons
$ sudo scons install
$ cd ..</pre>
<h4>Install Leptonica</h4>
<p><a href="http://code.google.com/p/leptonica/">Leptonica</a> provides more image processing and layout analysis capabilities.</p>
<pre>$ wget http://leptonica.googlecode.com/files/leptonlib-1.60.tar.gz
$ tar xvzf leptonlib-1.60.tar.gz
$ cd leptonlib-1.60
$ ./configure
$ make
$ sudo make install
$ cd ..</pre>
<h4>Install OpenFST</h4>
<p><a href="http://www.openfst.org/">OpenFST</a> provides language modeling code to OCRopus. Note that this takes a while (a couple of hours for me) to compile.</p>
<pre>$ wget http://mohri-lt.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.1.tar.gz
$ tar xvzf openfst-1.1.tar.gz
$ cd openfst-1.1
$ ./configure
$ make
$ sudo make install
$ cd ..</pre>
<h4>Install OCRopus</h4>
<p>We now have all our dependencies installed, so it&#8217;s time to install <a href="http://code.google.com/p/ocropus/">OCRopus</a>.</p>
<pre>$ sudo apt-get install libeditline-dev
$ svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus
$ cd ocropus
<del>$ ./configure
$ make
$ sudo make install</del></pre>
<p><strong>Update (2009-04-01):</strong> OCRopus is still young and has many bugs. One particularly annoying bug, one that is quite easy to fix: the Doctype declaration for the hOCR file was missing some quotes, rendering the XHTML invalid. I&#8217;ve submitted a patch. So, some slightly revised installation instructions, picking up in the <code>ocropus</code> directory:</p>
<pre>$ wget http://xplus3.net/downloads/fix_ocropus_doctype.diff
$ patch -p0 -i fix_ocropus_doctype.diff
$ ./configure
$ make
$ sudo make install</pre>
]]></content:encoded>
			<wfw:commentRss>http://xplus3.net/2009/03/31/ocr-with-ocropus-and-tesseract/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
	</channel>
</rss>

