OCRing USAMark Tiffs

OCR = Optical Character Recognition, software that attempts to figure out what text is contained in an image, generally reading an image and producing a text file.

On the patent office’s web site there are a couple of date based differences in what data is available. For patents, there isn’t a lot of data associated with ones granted before 1976, there are just images of the patents themselves. A long time ago, google patents burst onto the scene. They OCRed the images of the pre-1976 patents, and then, suddenly, as if by magic, the unsearchable became searchable. It was absolutely mind-blowing at the time.

Some time later, circa 2010, the patent office made bulk data available. I sat around waiting for someone to OCR the registration certificates that were originally released as USMark on 247 dvds, and once sold by the patent office for around $2,000. I even tried suggesting to google that they branch out and ORC the registration certificates, as a summer intern project etc., to no avail.

For trademarks there is a date based data division similar to the 1976 one just mentioned for patents. TESS, the fickle online application that lets you search for existing trademarks, generally has data for trademarks that were active in 1984 and beyond. (Trademarks can be renewed indefinitely through use and payment of applicable fees, so TESS has data for active trademarks that are now more than a hundred years old, if they were active in 1984.) Anodically, there are around 600,000 long dead trademarks, the ones not active in 1984, that are not in TESS, which is why I refer to her as fickle. Images of most of the registration certificates not in TESS are available online1, but they are not searchable.

After no one else OCRed the registration certificates, I decided to do it myself, on my otherwise ordinary desktop computer. The question you might have is how do you OCR about a million tiffs of registration certificates? The answer is simple, by OCRing 20,000 registration certificates per day, so after five days you’ve done 100,000 of them. Just repeat that effort nine more times and after about 50 days you’ll have OCRed a million registration certificates! I could have stopped after I OCRed the 600,000 not in TESS, but it was fun so I kept going until I had a little over a million of them.

I created a search page for the OCRed registration certificates, though I had to spread them across three databases to appease my ISP. Then, suddenly, the unsearchable became searchable!

1There are around 11,000 tiffs on the USMark dvds that are not in the patent office’s TSDR (Trademark Status and Document Retrieval) but they are here on my site, as pdfs. Links to them are returned in the search results, when applicable.

OCRed Plant Patents

Bulk data became a thing a few years ago, so I downloaded the USMark trademark zip files, all 43,000 of them. I then ocr’ed a little over a million of the registration certificates in the zip files and put them into a searchable database. Then I did the same thing for plant patents, which began in 1931. I OCR’ed PP1 through PP3,986, searchable here. PP3,987 is the first fully searchable plant patent on the uspto’s web site, so I didn’t need to OCR any further.

The Python Wrapper

This has nothing to do with a snake in a hoodie, laying down rhythmic rhymes, that would be Python The Rapper 🙂 The same people who wrote the Patentsview api also wrote a python wrapper that produces a csv file for you. All you need to do is download the code and dependencies (instructions provided in the link) and write a configuration file for your query or queries (more than one query can be specified in a configuration file). I realized that the queries it makes can be chained together, where the output of one becomes the input of another. I posted about it here, in the patentsview forum.

Reissued Patents

I’m one of the developers of datamp.org, an online database of tool patents. I’m also one of the handful of people who enters tool patents there. I was surprised to find a reissued patent on the uspto’s web site for a utility patent I had entered into datamp. If you find a reissued patent on the uspto’s site, it references the original patent. The opposite is not true, however. If you look up a patent at uspto.gov, you don’t necessarily know, that is,  you aren’t shown if it was subsequently reissued.  If the patent was issued in or after 1976, you can do an advanced search on the uspto’s site to see if the patent in question was reissued (reis/XXXXXXXX, where the X’s are eight digit patent number, padded with leading zeroes if necessary).  Most of the patents in datamp were issued before 1976, so the advanced search wasn’t an option to find the reissued patent I accidently found, while looking for another patent.

Take a look at the patent I accidently found, it’s RE22,908, issued August 12, 1947.  It shows that it’s a reissue of 2,314,915, issued March 30, 1943.  However, when you view 2,314,915 it doesn’t show that it was reissued.

I created a database table for reissued patents using some post 1975 bulk patent xml files I had downloaded. It occurred to me that it would be cool if reissued patent data was made available in bulk. It would save me from downloading giant xml files and running a program to look for reissued patent data.  Check out my reissued patent page. Wouldn’t it  be  really cool if the pre-1976 reissue data was also available?

Patents not in bulk xml files

I nearly fell out of my chair when the patent office announced that they would be giving away their data for free! How cool is that? Free data, where is the catch? It turned out there isn’t one, well maybe a very small one, percentagewise that is.

Each quarter the Patentsview api team processes all the bulk grant xml files the patent office makes available, something like a zillion of them (actually closer to 2,000 at the time of this writing). That’s how they create the database that their api uses. After processing the files and updating their database, they then make their data available for download, and get this, it’s also free!

Separately, the patent office constantly updates a list of withdrawn patents. I thought hey, I wonder if there are any missing patents in Patentsview’s patent.tsv file? I’m that sort of guy, you know. Often interesting things can be found when you examine the gaps and overlaps of related data files or data sets. In theory, any patent numbers not in their patent file should correspond to withdrawn patents, right? It turned out to be only partially true, which was really unexpected.

First, I found that there are around 8,000 patents in both files, patents that were issued and whose data was included in bulk grant xml files but later they were withdrawn! This is problematic for the Patentsview database, as I tried to point out to them. They really should exclude withdrawn patents from their database but they don’t see it that way. In other words, a search using their api could contain patents that have been withdrawn. As a user of the api I find this unacceptable.

Second, I did find unexplained gaps! There are 306 patent numbers not in the Patentsview file that do not correspond to withdrawn patents. I double checked a few and the patents are indeed missing from the bulk xml files. Percentagewise it’s miniscule, 306 patents out of 7.5M patents, put shouldn’t it be 0%? Again, this is problematic for the Patentsview database. Searches using their api cannot include these patents as they were not part of their underlying source. Another serious flaw with their api, even though it isn’t their fault that the patents are not part of the bulk xml files. I pointed this out to them but an alternate source of this data has not been found.

I think the patent office should produce a “catchup” xml file containing the 306 valid patents that are not in the bulk xml grant files. The Patentsview people could then add them to their database and other bulk data consumers could do whatever they want to do with them. If the Patentsview people also excluded withdrawn patents, I’d be a lot happier as a consumer of their api. There are other flaws with the api, but this would be a step in the right direction.

(Some of the missing patents are listed on this page, and some listed on this page, for a total of 306 patents.)

TSDR and the api key

The T in USPTO (Unites States Patent and Trademark Office), you should have just learned, stands for Trademark. Their TSDR (Trademark Status & Document Retrieval) api deals with trademarks. If you’ve used it recently, you probably noticed that they now require an api key, which you can get by registering with them. Their Swagger-UI page is at https://developer.uspto.gov/swagger/tsdr-api-v1 but it doesn’t allow you to enter your now-required api key and it has a number of other omissions (listed in my github repo). Further, their api does not accept browser requests coming from domains other than their own (they’re blocked by CORS policy), which is why the Swagger-UI page I created does not work (though the generated curl commands work and my modified swagger object can be imported into postman from https://mustberuss.github.io/TSDR-Swagger/myswagger_v1_tsdr_uspto.json). I emailed them to point out these problems but they have not updated their Swagger-UI page or allow CORS requests.

I’ve been down this road before, I cannot get the patentsview people to adopt the Swagger-UI page I created for their api. Using an api’s Swagger-UI page can be a great way to learn the ins and outs of an api, but it takes a little cooperation from the api provider! By contrast, the Swagger-UI page for the USPTO’s PEDS (Patent Examination Data System) api works as it should, without my involvement.

A Rookie Mistake

There are around 11,000 registration certificates that are not online. They correspond to dead trademarks that have no legal standing. I was researching tool companies that held some of these missing trademarks. I requested copies from the patent office through a patent librarian I know. On one request I mixed up the serial number and registration number of the trademark I was interested in. The former number is not that useful and the latter is all important. On the other end of my request was an intern at the patent office. When I met her in person, I related the tale of how she schooled me on my rookie mistake. The group around us burst into laughter, it turned out she had been a teacher at some point and schooling people was nothing new to her!

Mike White’s excellent US Trademark number guide succinctly explains the trademark numbering peculiarities brought about by the Trademark Act of 1946. When my son was younger, he liked the cartoon Ben 10 which has a Null Void- somewhere in dimensional space that you don’t want to wind up. One of the side effects of the 1946 Trademark Act was that it created a Null Void of registration numbers (for reasons too complex to explain in parenthesis, registration numbers 444,812 through 500,000 where never issued). The request I was schooled on was a request for a “registration number” 493,259, which was never issued. The registration certificate is online, where its serial and registration numbers can be seen if you don’t interchange them. (Some of the above is excerpted from this article.)