Patents not in bulk xml files

I nearly fell out of my chair when the patent office announced that they would be giving away their data for free! How cool is that? Free data, where is the catch? It turned out there isn’t one, well maybe a very small one, percentagewise that is.

Each quarter the Patentsview api team processes all the bulk grant xml files the patent office makes available, something like a zillion of them (actually closer to 2,000 at the time of this writing). That’s how they create the database that their api uses. After processing the files and updating their database, they then make their data available for download, and get this, it’s also free!

Separately, the patent office constantly updates a list of withdrawn patents. I thought hey, I wonder if there are any missing patents in Patentsview’s patent.tsv file? I’m that sort of guy, you know. Often interesting things can be found when you examine the gaps and overlaps of related data files or data sets. In theory, any patent numbers not in their patent file should correspond to withdrawn patents, right? It turned out to be only partially true, which was really unexpected.

First, I found that there are around 8,000 patents in both files, patents that were issued and whose data was included in bulk grant xml files but later they were withdrawn! This is problematic for the Patentsview database, as I tried to point out to them. They really should exclude withdrawn patents from their database but they don’t see it that way. In other words, a search using their api could contain patents that have been withdrawn. As a user of the api I find this unacceptable.

Second, I did find unexplained gaps! There are 306 patent numbers not in the Patentsview file that do not correspond to withdrawn patents. I double checked a few and the patents are indeed missing from the bulk xml files. Percentagewise it’s miniscule, 306 patents out of 7.5M patents, put shouldn’t it be 0%? Again, this is problematic for the Patentsview database. Searches using their api cannot include these patents as they were not part of their underlying source. Another serious flaw with their api, even though it isn’t their fault that the patents are not part of the bulk xml files. I pointed this out to them but an alternate source of this data has not been found.

I think the patent office should produce a “catchup” xml file containing the 306 valid patents that are not in the bulk xml grant files. The Patentsview people could then add them to their database and other bulk data consumers could do whatever they want to do with them. If the Patentsview people also excluded withdrawn patents, I’d be a lot happier as a consumer of their api. There are other flaws with the api, but this would be a step in the right direction.

(Some of the missing patents are listed on this page, and some listed on this page, for a total of 306 patents.)

Leave a Reply

Your email address will not be published. Required fields are marked *