OCRing USAMark Tiffs

OCR = Optical Character Recognition, software that attempts to figure out what text is contained in an image, generally reading an image and producing a text file.

On the patent office’s web site there are a couple of date based differences in what data is available. For patents, there isn’t a lot of data associated with ones granted before 1976, there are just images of the patents themselves. A long time ago, google patents burst onto the scene. They OCRed the images of the pre-1976 patents, and then, suddenly, as if by magic, the unsearchable became searchable. It was absolutely mind-blowing at the time.

Some time later, circa 2010, the patent office made bulk data available. I sat around waiting for someone to OCR the registration certificates that were originally released as USMark on 247 dvds, and once sold by the patent office for around $2,000. I even tried suggesting to google that they branch out and ORC the registration certificates, as a summer intern project etc., to no avail.

For trademarks there is a date based data division similar to the 1976 one just mentioned for patents. TESS, the fickle online application that lets you search for existing trademarks, generally has data for trademarks that were active in 1984 and beyond. (Trademarks can be renewed indefinitely through use and payment of applicable fees, so TESS has data for active trademarks that are now more than a hundred years old, if they were active in 1984.) Anodically, there are around 600,000 long dead trademarks, the ones not active in 1984, that are not in TESS, which is why I refer to her as fickle. Images of most of the registration certificates not in TESS are available online1, but they are not searchable.

After no one else OCRed the registration certificates, I decided to do it myself, on my otherwise ordinary desktop computer. The question you might have is how do you OCR about a million tiffs of registration certificates? The answer is simple, by OCRing 20,000 registration certificates per day, so after five days you’ve done 100,000 of them. Just repeat that effort nine more times and after about 50 days you’ll have OCRed a million registration certificates! I could have stopped after I OCRed the 600,000 not in TESS, but it was fun so I kept going until I had a little over a million of them.

I created a search page for the OCRed registration certificates, though I had to spread them across three databases to appease my ISP. Then, suddenly, the unsearchable became searchable!

1There are around 11,000 tiffs on the USMark dvds that are not in the patent office’s TSDR (Trademark Status and Document Retrieval) but they are here on my site, as pdfs. Links to them are returned in the search results, when applicable.

OCRed Plant Patents

Bulk data became a thing a few years ago, so I downloaded the USMark trademark zip files, all 43,000 of them. I then ocr’ed a little over a million of the registration certificates in the zip files and put them into a searchable database. Then I did the same thing for plant patents, which began in 1931. I OCR’ed PP1 through PP3,986, searchable here. PP3,987 is the first fully searchable plant patent on the uspto’s web site, so I didn’t need to OCR any further.

Creating a Searchable PDF File with OCRed Text

Patent Librarian and all-around good guy Jim Miller lamented the lack of OCR text in an Espacenet patent that was in Spanish. I’ve OCRed to produce text files that became a searchable online database but I hadn’t tried producing searchable pdfs. I offered to try to help anyway. In googling around, I found the June 4th comment at the bottom of this page that says that tesseract, an opensouce OCR program, can produce pdf files with embedded text. Continue reading “Creating a Searchable PDF File with OCRed Text”