Creating a Searchable PDF File with OCRed Text

Patent Librarian and all-around good guy Jim Miller lamented the lack of OCR text in an Espacenet patent that was in Spanish. I’ve OCRed to produce text files that became a searchable online database but I hadn’t tried producing searchable pdfs. I offered to try to help anyway. In googling around, I found the June 4th comment at the bottom of this page that says that tesseract, an opensouce OCR program, can produce pdf files with embedded text.

In my prior OCRing, I had loaded the windows v4.00.00alpha version of tesseract onto my otherwise ordinary laptop but now I needed to download and install the Spanish language training file or spa.traineddata from this page. Spanish is one of more than a hundred supported languages.

Tesseract can OCR either a tiff or png file but here we’re starting with a pdf. Luckily, convert, which is part of the opensource project imagemagick, can create a tiff file from a pdf file or do other image manipulation you may need. Simply doing a “convert MX2016010808A.pdf MX2016010808A.tif” produces a tiff file but tesseract throws a number of errors and doesn’t produce any output. Googling the error messages landed me on an incrementally helpful page at stackoverflow, a wonderful resource ranked highly by google in searches of this nature, though it still didn’t resolve all the errors. Further down in google’s search results was a page¹ that lists the convert parameters its author used to produce an OCR-able tiff. As the author states, the library tesseract is built upon can be rather finicky about the internals of the source tiff file. See the author’s first footnote by following the link in my one and only footnote. That page also has information on OCRing using Mac OS X, which is something I have no experience doing.

Putting all of the above together, I was able to produce the pdf file Jim dreamt of! Below are the commands that worked when entered in a dos command prompt. I’ll spare you the iterations that weren’t successful.

convert -density 300 MX2016010808A.pdf -type Grayscale -compress lzw -background white -alpha Off -depth 32 MX2016010808A.tif
tesseract -l spa MX2016010808A.tif OcrMX2016010808A pdf (note that it’s a space before the pdf)

Here the 1 Meg file MX2016010808A.pdf was the input and the 20 Meg, 28 page OcrMX2016010808A.pdf was produced in 9 and a half minutes. The first command created the OCR-friendly MX2016010808A.tif. The second command OCRed MX2016010808A.tif, specifying Spanish (-l spa) and pdf output. Had we not specified these parameters, the default would have been to use English and it would have produced a text file as output. Tesseract has a gui screen but I prefer to use the command line version of it. I think that imagemagick only has a command line interface. For completeness I’ll include the version of ImageMagick I used: 7.0.6-9 Q16 x64 2017-08-21. Also, because the ratio is so outrageous, I’ll mention that I’ve used convert and tesseract to produce one searchable pdf and slightly over one million text files.

Footnote

¹Baumann, Ryan. “Command-Line OCR with Tesseract on Mac OS X.” Ryan Baumann – /etc (blog), 13 Nov 2014, https://ryanfb.github.io/etc/2014/11/13/command_line_ocr_on_mac_os_x.html (accessed 15 Jul 2018).

Leave a Reply

Your email address will not be published. Required fields are marked *