Bulk Data Problems

The USPTO (United States Patent and Trademark Office) made some, but not all, of its data available to anyone who wanted to download it. The first thing to be aware of is that some of the patents in the bulk grant xml files were subsequently withdrawn. The last time I checked there were around 8,000 withdrawn patents in the bulk grant xml files. The second thing to be aware of is that there are around 300 granted patents whose data is inexplicably absent from the bulk grant files. Percentagewise, it’s a miniscule oversight but shouldn’t 100% of the granted patents be present in the bulk grant files? Hey USPTO, how about producing a catchup file that contains the missing granted patents?

The other flaws I have noticed have to do with the bulk classification files. The patent office stopped producing the bulk United States Patent Classification (USPC) file despite their continued use. It is true that since June 2015 utility patents are not assigned USPCs, but plant patents and design patents still receive them (also reissued plant or design patents). The last bulk USPC file produced ended with assignments for PP29260 and D816289, both issued April 24, 2018. (See the image above.)

The other classification in use is the Cooperative Patent Classification (CPC). There is a bulk CPC file but it only contains CPC assignments for utility patents. CPC assignments for reissued patents and plant patents are not included in the bulk data file. It seems to be well guarded secret, or at least not widely publicized, that plant patents receive CPC assignments. I’ve found that roughly half of all plant patents have received one or more CPC assignment as shown on this page of mine.

The real problem is that the patentsview api uses the bulk data files to build their database. From the above this means that they are missing the ~300 patents that are not present in the bulk grant files, plant patents after PP29260 and design patents after D816289 don’t have USPC assignments, approximately half of the 30,000 plant patents are missing their CPC assignments as are all reissued patents. By choice, they load all the data from the bulk grant xml files, so this means their database contains around 8,000 withdrawn patents (see my previous post on withdrawn patents). Take a look at the Data Collection Phase that shows how they process the bulk files. The diagram above shows corrections that should be made to their loading process.

If you have been paying particular attention, you will have noticed that in the patentsview database plant patents after PP29260 are only searchable by their at-issue International Patent Classification (IPC), it’s the only classification system they have data for, as they lack a bulk post-PP29260-USPC file and a non-utility-patent bulk CPC file. There are lots of things you could do with patentviews, but one thing you can not do is effectively search plant patents by USPC or CPC.

The USPTO’s patft, its online search page, has all of the above with the exception of four missing granted, non-withdrawn patents that are not in the bulk grant files (6,287,179; 6,392,191; 6,394,333, and 6,558,580). patft has CPC assignments for plant and reissued patent, and USPC assignments for plant and design patents after PP29260 and D816289 but they do not make this data available as bulk download files.

Bulk data is great, depending on what you do with it, but the USPTO still has a monopoly on some of its data. I have been unsuccessful in my efforts to have this rectified and would appreciate any help!

Notations in the diagram above

  1. Patentsview should not load withdrawn patents into their database
  2. The USPTO should resume producing the bulk USPC file
  3. The bulk CPC file should contain all patents, not just utility patents

When apis fail you

Sometimes there isn’t a way other than screen scraping to get the data you want, which is unfortunate. I’d like to programmatically retrieve classification fields for the plant patents issued each Tuesday. I can’t use the patentsview api since its data lags behind, it’s updated roughly quarterly while the patent office’s site is updated each Tuesday. Plus the api does not return uspc classifications on newer plant patents as the patent offices has stopped producing the bulk file of them (the last file produced stopped with PP29260, issued April 24, 2018). The api also does not return cpcs that are now coming back on about half of the plant patents, as there is no bulk source of them (the bulk cpc file only contains utility patents, fans of reissued patents are also out of luck). See this page if you don’t believe me that some plant patents do get cpc assignments!

Similarly, I could use try to use the PEDS (Patent Examination Data System) api but only returns one uspc classification per patent when multiples are allowed1 and it also does not return cpcs. So, having no other free option, you can’t blame a guy if he makes requests weekly to patft and scrapes the page of data that is returned!

1If you want to check for yourself, these plant patents each had 4 uspc assignments when I scrapped them PP23484, PP23723, PP23924, PP24080, PP24201, PP24521, PP24634, PP24828. Compare peds and patentsview to patft to see the disparity.

The Python Wrapper

This has nothing to do with a snake in a hoodie, laying down rhythmic rhymes, that would be Python The Rapper 🙂 The same people who wrote the Patentsview api also wrote a python wrapper that produces a csv file for you. All you need to do is download the code and dependencies (instructions provided in the link) and write a configuration file for your query or queries (more than one query can be specified in a configuration file). I realized that the queries it makes can be chained together, where the output of one becomes the input of another. I posted about it here, in the patentsview forum.

Patents not in bulk xml files

I nearly fell out of my chair when the patent office announced that they would be giving away their data for free! How cool is that? Free data, where is the catch? It turned out there isn’t one, well maybe a very small one, percentagewise that is.

Each quarter the Patentsview api team processes all the bulk grant xml files the patent office makes available, something like a zillion of them (actually closer to 2,000 at the time of this writing). That’s how they create the database that their api uses. After processing the files and updating their database, they then make their data available for download, and get this, it’s also free!

Separately, the patent office constantly updates a list of withdrawn patents. I thought hey, I wonder if there are any missing patents in Patentsview’s patent.tsv file? I’m that sort of guy, you know. Often interesting things can be found when you examine the gaps and overlaps of related data files or data sets. In theory, any patent numbers not in their patent file should correspond to withdrawn patents, right? It turned out to be only partially true, which was really unexpected.

First, I found that there are around 8,000 patents in both files, patents that were issued and whose data was included in bulk grant xml files but later they were withdrawn! This is problematic for the Patentsview database, as I tried to point out to them. They really should exclude withdrawn patents from their database but they don’t see it that way. In other words, a search using their api could contain patents that have been withdrawn. As a user of the api I find this unacceptable.

Second, I did find unexplained gaps! There are 306 patent numbers not in the Patentsview file that do not correspond to withdrawn patents. I double checked a few and the patents are indeed missing from the bulk xml files. Percentagewise it’s miniscule, 306 patents out of 7.5M patents, put shouldn’t it be 0%? Again, this is problematic for the Patentsview database. Searches using their api cannot include these patents as they were not part of their underlying source. Another serious flaw with their api, even though it isn’t their fault that the patents are not part of the bulk xml files. I pointed this out to them but an alternate source of this data has not been found.

I think the patent office should produce a “catchup” xml file containing the 306 valid patents that are not in the bulk xml grant files. The Patentsview people could then add them to their database and other bulk data consumers could do whatever they want to do with them. If the Patentsview people also excluded withdrawn patents, I’d be a lot happier as a consumer of their api. There are other flaws with the api, but this would be a step in the right direction.

(Some of the missing patents are listed on this page, and some listed on this page, for a total of 306 patents.)

TSDR and the api key

The T in USPTO (Unites States Patent and Trademark Office), you should have just learned, stands for Trademark. Their TSDR (Trademark Status & Document Retrieval) api deals with trademarks. If you’ve used it recently, you probably noticed that they now require an api key, which you can get by registering with them. Their Swagger-UI page is at https://developer.uspto.gov/swagger/tsdr-api-v1 but it doesn’t allow you to enter your now-required api key and it has a number of other omissions (listed in my github repo). Further, their api does not accept browser requests coming from domains other than their own (they’re blocked by CORS policy), which is why the Swagger-UI page I created does not work (though the generated curl commands work and my modified swagger object can be imported into postman from https://mustberuss.github.io/TSDR-Swagger/myswagger_v1_tsdr_uspto.json). I emailed them to point out these problems but they have not updated their Swagger-UI page or allow CORS requests.

I’ve been down this road before, I cannot get the patentsview people to adopt the Swagger-UI page I created for their api. Using an api’s Swagger-UI page can be a great way to learn the ins and outs of an api, but it takes a little cooperation from the api provider! By contrast, the Swagger-UI page for the USPTO’s PEDS (Patent Examination Data System) api works as it should, without my involvement.

Developer Candy: Swagger UI for the patentsview api

Russ Allen, developer and patent enthusiast, created a Swagger UI json object and explains its usefulness.

Pretend for a moment that you are a developer working on something cool that needs to call a web service. If you are lucky, the web service provider will have made a Swagger UI web page available for you. It’s an opensource project that generates a web page that lets users make calls to the web service by filling in form fields. It’s similar to Postman with a lot of setup work done for you. At the heart of Swagger UI is a json object that specifies all the api does or will do if you play by its rules (properly use its verbs and endpoints by passing what it expects in the formats it accepts). All an api provider needs to do is to create the json object and plug it into the Swagger UI package they’ve downloaded from swagger.io. That’s nothing more than copying the json object and dist folder to their web site. Then they just need to update the index.html file with the url of the file containing their json object and boom, their web site has a Swagger UI web page for the whole world to use!

Russ noticed that the patentsview api did not have a Swagger UI web page so he created the necessary json object. Below is an example that shows both the power of the patentsview api and of Swagger UI. We start by filling in the Swagger UI web page’s form fields that will issue a get to the patentsview’s /patents/query endpoint but we intentionally made a mistake, perhaps you’ll be able to spot it.

When we press the Execute button in the Swagger UI web page, the response is added to the UI page.

It seems we’ve made the api mad by requesting a field in the f parameter that it doesn’t yet support. Fortunately for us, the patentsview api developers thoughtfully return an x-status-reason response header explaining exactly why it returned a 400 or Bad Request response code. How cool is that? (Note that not all api providers go to this extent to be helpful.) If we correct our request and press the Execute button again, we are rewarded with the api’s data returned nicely formatted.


The Swagger UI web page and this api’s x-status-reason become a powerful tool for developers and interacting with the api is a great way to quickly learn the ins and outs of the api before writing any code. Try this very demo here!

The json object can also open doors in the opensource community. Several opensource projects use the json object as input and convert it to other formats or generate tests etc. Like many things in life, there are two standards. There’s the Swagger 2.0 specification and the newer Swagger 3.0, also known as the OpenAPI specification. Russ used one of these opensource projects to convert the Swagger 2.0 json object into its corresponding Swagger 3.0/OpenAPI object. Having both versions maximizes usefulness. Some opensource projects accept either version but there are ones looking for a specific version. There’s a nicely formatted list of these projects and which version(s) they accept at http://openapi.tools. Oh, and if Russ hasn’t sold you on the power of the json object, he suggests that you try importing it into Postman to see what happens: Boom, nicely loaded Postman page just itching to hit the patentsview api endpoints for you! In Postman:
File -> Import -> Import From Link [Swagger (v1/v2)] http://patentsview.historicip.com/patentsview.json

Russ would like to contribute the two json objects to the patentsview api project if its developers would care to host a Swagger UI page at patentsview.org. Otherwise, the patentsview Swagger 2.0 UI page is available at http://patentsview.historicip.com/swagger/ and the Swagger 3.0/OpenAPI version is at http://patentsview.historicip.com/swagger3/. The UI pages look the same, but the underlying json objects are distinct and correspond to their respective Swagger versions.

Note: Currently the X-Status-Reason header is not being displayed in either version of the UI (in chrome at least). I’ve opened an issue to address this.