Bulk Data Problems

The USPTO (United States Patent and Trademark Office) made some, but not all, of its data available to anyone who wanted to download it. The first thing to be aware of is that some of the patents in the bulk grant xml files were subsequently withdrawn. The last time I checked there were around 8,000 withdrawn patents in the bulk grant xml files. The second thing to be aware of is that there are around 300 granted patents whose data is inexplicably absent from the bulk grant files. Percentagewise, it’s a miniscule oversight but shouldn’t 100% of the granted patents be present in the bulk grant files? Hey USPTO, how about producing a catchup file that contains the missing granted patents?

The other flaws I have noticed have to do with the bulk classification files. The patent office stopped producing the bulk United States Patent Classification (USPC) file despite their continued use. It is true that since June 2015 utility patents are not assigned USPCs, but plant patents and design patents still receive them (also reissued plant or design patents). The last bulk USPC file produced ended with assignments for PP29260 and D816289, both issued April 24, 2018. (See the image above.)

The other classification in use is the Cooperative Patent Classification (CPC). There is a bulk CPC file but it only contains CPC assignments for utility patents. CPC assignments for reissued patents and plant patents are not included in the bulk data file. It seems to be well guarded secret, or at least not widely publicized, that plant patents receive CPC assignments. I’ve found that roughly half of all plant patents have received one or more CPC assignment as shown on this page of mine.

The real problem is that the patentsview api uses the bulk data files to build their database. From the above this means that they are missing the ~300 patents that are not present in the bulk grant files, plant patents after PP29260 and design patents after D816289 don’t have USPC assignments, approximately half of the 30,000 plant patents are missing their CPC assignments as are all reissued patents. By choice, they load all the data from the bulk grant xml files, so this means their database contains around 8,000 withdrawn patents (see my previous post on withdrawn patents). Take a look at the Data Collection Phase that shows how they process the bulk files. The diagram above shows corrections that should be made to their loading process.

If you have been paying particular attention, you will have noticed that in the patentsview database plant patents after PP29260 are only searchable by their at-issue International Patent Classification (IPC), it’s the only classification system they have data for, as they lack a bulk post-PP29260-USPC file and a non-utility-patent bulk CPC file. There are lots of things you could do with patentviews, but one thing you can not do is effectively search plant patents by USPC or CPC.

The USPTO’s patft, its online search page, has all of the above with the exception of four missing granted, non-withdrawn patents that are not in the bulk grant files (6,287,179; 6,392,191; 6,394,333, and 6,558,580). patft has CPC assignments for plant and reissued patent, and USPC assignments for plant and design patents after PP29260 and D816289 but they do not make this data available as bulk download files.

Bulk data is great, depending on what you do with it, but the USPTO still has a monopoly on some of its data. I have been unsuccessful in my efforts to have this rectified and would appreciate any help!

Notations in the diagram above

  1. Patentsview should not load withdrawn patents into their database
  2. The USPTO should resume producing the bulk USPC file
  3. The bulk CPC file should contain all patents, not just utility patents

Leave a Reply

Your email address will not be published. Required fields are marked *