OCRing USAMark Tiffs

OCR = Optical Character Recognition, software that attempts to figure out what text is contained in an image, generally reading an image and producing a text file.

On the patent office’s web site there are a couple of date based differences in what data is available. For patents, there isn’t a lot of data associated with ones granted before 1976, there are just images of the patents themselves. A long time ago, google patents burst onto the scene. They OCRed the images of the pre-1976 patents, and then, suddenly, as if by magic, the unsearchable became searchable. It was absolutely mind-blowing at the time.

Some time later, circa 2010, the patent office made bulk data available. I sat around waiting for someone to OCR the registration certificates that were originally released as USMark on 247 dvds, and once sold by the patent office for around $2,000. I even tried suggesting to google that they branch out and ORC the registration certificates, as a summer intern project etc., to no avail.

For trademarks there is a date based data division similar to the 1976 one just mentioned for patents. TESS, the fickle online application that lets you search for existing trademarks, generally has data for trademarks that were active in 1984 and beyond. (Trademarks can be renewed indefinitely through use and payment of applicable fees, so TESS has data for active trademarks that are now more than a hundred years old, if they were active in 1984.) Anodically, there are around 600,000 long dead trademarks, the ones not active in 1984, that are not in TESS, which is why I refer to her as fickle. Images of most of the registration certificates not in TESS are available online1, but they are not searchable.

After no one else OCRed the registration certificates, I decided to do it myself, on my otherwise ordinary desktop computer. The question you might have is how do you OCR about a million tiffs of registration certificates? The answer is simple, by OCRing 20,000 registration certificates per day, so after five days you’ve done 100,000 of them. Just repeat that effort nine more times and after about 50 days you’ll have OCRed a million registration certificates! I could have stopped after I OCRed the 600,000 not in TESS, but it was fun so I kept going until I had a little over a million of them.

I created a search page for the OCRed registration certificates, though I had to spread them across three databases to appease my ISP. Then, suddenly, the unsearchable became searchable!

1There are around 11,000 tiffs on the USMark dvds that are not in the patent office’s TSDR (Trademark Status and Document Retrieval) but they are here on my site, as pdfs. Links to them are returned in the search results, when applicable.

Leave a Reply

Your email address will not be published. Required fields are marked *