Bulk Data Problems

The USPTO (United States Patent and Trademark Office) made some, but not all, of its data available to anyone who wanted to download it. The first thing to be aware of is that some of the patents in the bulk grant xml files were subsequently withdrawn. The last time I checked there were around 8,000 withdrawn patents in the bulk grant xml files. The second thing to be aware of is that there are around 300 granted patents whose data is inexplicably absent from the bulk grant files. Percentagewise, it’s a miniscule oversight but shouldn’t 100% of the granted patents be present in the bulk grant files? Hey USPTO, how about producing a catchup file that contains the missing granted patents?

The other flaws I have noticed have to do with the bulk classification files. The patent office stopped producing the bulk United States Patent Classification (USPC) file despite their continued use. It is true that since June 2015 utility patents are not assigned USPCs, but plant patents and design patents still receive them (also reissued plant or design patents). The last bulk USPC file produced ended with assignments for PP29260 and D816289, both issued April 24, 2018. (See the image above.)

The other classification in use is the Cooperative Patent Classification (CPC). There is a bulk CPC file but it only contains CPC assignments for utility patents. CPC assignments for reissued patents and plant patents are not included in the bulk data file. It seems to be well guarded secret, or at least not widely publicized, that plant patents receive CPC assignments. I’ve found that roughly half of all plant patents have received one or more CPC assignment as shown on this page of mine.

The real problem is that the patentsview api uses the bulk data files to build their database. From the above this means that they are missing the ~300 patents that are not present in the bulk grant files, plant patents after PP29260 and design patents after D816289 don’t have USPC assignments, approximately half of the 30,000 plant patents are missing their CPC assignments as are all reissued patents. By choice, they load all the data from the bulk grant xml files, so this means their database contains around 8,000 withdrawn patents (see my previous post on withdrawn patents). Take a look at the Data¬†Collection Phase that shows how they process the bulk files. The diagram above shows corrections that should be made to their loading process.

If you have been paying particular attention, you will have noticed that in the patentsview database plant patents after PP29260 are only searchable by their at-issue International Patent Classification (IPC), it’s the only classification system they have data for, as they lack a bulk post-PP29260-USPC file and a non-utility-patent bulk CPC file. There are lots of things you could do with patentviews, but one thing you can not do is effectively search plant patents by USPC or CPC.

The USPTO’s patft, its online search page, has all of the above with the exception of four missing granted, non-withdrawn patents that are not in the bulk grant files (6,287,179; 6,392,191; 6,394,333, and 6,558,580). patft has CPC assignments for plant and reissued patent, and USPC assignments for plant and design patents after PP29260 and D816289 but they do not make this data available as bulk download files.

Bulk data is great, depending on what you do with it, but the USPTO still has a monopoly on some of its data. I have been unsuccessful in my efforts to have this rectified and would appreciate any help!

Notations in the diagram above

  1. Patentsview should not load withdrawn patents into their database
  2. The USPTO should resume producing the bulk USPC file
  3. The bulk CPC file should contain all patents, not just utility patents

Withdrawn Patents

Patents get withdrawn from time to time. Some are never issued but some are withdrawn after being issued. In the latter case, data for the withdrawn patent can be found in the wild. The patent office maintains a list of withdrawn patents at http://www.uspto.gov/patents-application-process/patent-search/withdrawn-patent-number Separately, the patentsview api team processes the bulk grant patent xml files and makes their files available for download. If one compares the patentsview patent.tsv file to the patent office’s withdrawn patent list, one finds (or found at the time this was written) 7,930 patents in both files. The patent office removes withdrawn patents from its web site, they are not returned by searches but this is not the case with the patentsview api. It will return withdrawn patents, which is pretty bizarre. I don’t know of another patent platform that does that. I raised a git issue to point this out to the otherwise fine patentsview folks but nothing has changed. (Two take-aways here, one that there is data for withdrawn patents in the grant xml files and the other is that patentsview loads them into their database.)

Another source of data for withdrawn patents is the USPat dvds once produced by the patent office. The data is available for download as thousands of zip files containing tiff images of patents, both withdrawn ones and ones that were not withdrawn. In the zip files I have analyzed, I have found 5,191 withdrawn patents among the millions of patents that have not been withdrawn.

The last source that I know of for data on withdrawn patents is the Official Gazettes (OGs) produced by the patent office each week. Some patents appear in the OGS that are subsequently withdrawn. An example would be PP31,892 which would have been issued on June 23, 2020. That patent wasn’t in the grant xml for the patents granted on June 23, 2020 but it did appear the OG for that date. It is also listed on the patent office’s withdrawn patent page. Interestingly, PP31,893 was also withdrawn but it is not present in xml file for June 23, 2020 and the OG says “Patent Not Issued For This Number”. Above is an image that shows the OG entries for these two withdrawn patents.

A possible source, that I haven’t fully investigated, is Hathi Trust. They have scanned many of the OGs that were physically published. The last printed OG was September 24, 2002, more recent ones are only published electronically.

So if you are interested in withdrawn patents, they are out there! (That is, there may be xml data, tiffs and/or OG html and images available.) Oh, and another trick to finding which patents are withdrawn is to do a search in patft for ccl/WITHDRAWN, slightly nonsensical syntax but it works!

NYPL/UMD Plant Patent Project

One of the more surprising elements of plant patents is that their online images are in black and white! Patent and Trademark Resource Centers (PTRC) scattered across the US receive color copies of them but the online community is left guessing what each patented plant looks like in color. A few years ago, Ken Johnson at the PTRC in New York City’s Public Library (NYPL) began scanning the color copies they received. He put them online with the giant caveat that they cannot be used for legal purposes, only the official color copies can be used legally. One of the libraries at the University of Maryland (UMD) is also a PTRC and they have taken up scanning plant patents not scanned by the New York Public Library. So, if you are wondering what a particular plant patent looks like in color, head over to https://www.lib.umd.edu/plantpatents or http://www.nypl.org/collections/nypl-recommendations/guides/plant-patents-2012 Not all of the nearly 33,000 plant patents have been scanned, but they are working on it. Be sure to check out the UMD project’s credits page, I might be mentioned on it. Oh, and if you are curious what the rose plant above looks like, unofficially of course, it’s here.


DATAMP = Directory of American Tool and Machinery Patents

If you are looking for the patent associated with an antique tool, head over to datamp.org. It’s quite possible you’ll be able to find the tool patent you are looking for among the 70,000 or so patents there. I’m a developer of the site and one of the data stewards that enters patent data so I highly recommend the site!

Here’s the most recent patent that was entered into datamp:

  • US Patent: 404,057US Patent: 404,057
    Micrometer Caliper Patentee: Morris F. Smith - New Haven CT Granted:1889-05-28 Manufactured by R.H. Brown & Co. - New Haven, New Haven County CT This 6 inch Micrometer was #128 in the Starrett catalog. This is a 0-1" micrometer mounted on a bar with a pin and a set of precision bushings. Of the 6 possible positions in the example shown (image #2), only one was .001" off. The rest were less than that. My invention relates to that class of micrometer-calipers, which measures distances greater than one inch by means of moving a slide, which carries one of the caliper-points. The objects of my invention are accurate and expeditious adjustment of the slide and improved means for taking up the wear of the micrometer-screw. To enable others to make and use my improved calipers, I will give a description of the same in detail, reference being had to the drawings hereto annexed. The beam A, Fig. 1, has the two parts a and &. The part a is rectangular and of equal size throughout its length and the part 5 integral with the part a is bent at a right angle to it and terminates in the cylindrical enlargement c, which is tapped to receive the screw B. The part a is perforated in two lines parallel and near to the sides of the beam, into which hardened-steel bushings d are forced. These bushed holes are arranged to operate in conjunction with the bushed holes in the slide, so that when bushings having like numbers are brought in line the micrometer-screw being in the position shown in Fig. 1, the distance between caliper points C and D will be as many inches or units of measure as the number on the coincident bushings indicates. Thus in Fig. 1 bushings 3 are in line, and the distance between points C and D is three units. Fractional measurements are obtained in the same manner as in the ordinary micrometers. The screw B is reduced in size where it protrudes from the beam, and is chamfered at the end C, and forms one of the caliper-points. The slide E is a rectangular box fitted to the beam, except on the lower side, where room enough is left for the shoe or gib g to fill. On its lower side a circular boss is raised, which is tapped to receive the thumb-screw which passes through the lower side of the slide and comes against the shoe g. The shoe g is turned up at each end at right angles to hold it in place when the screw is released. On the upper side of the slide extends upward, the arm e, terminating in the cylindrical enlargement , which is tapped to receive the part F. When the slide is in place on the beam A, the axes of the threaded parts c and are in line. The wider sides, h, of the slide are perforated in two parallel lines, which are in the same planes as the lines of the holes in the beam, and the holes are equal in number. Into these holes, bushings i are forced, and are ground out, as is hereinafter explained. The pin G-, with the knurled head, is fitted to these bushings and holds the slide from endwise movement when the tool is in use. This is one of 70,708 patents currently in the database at datamp.org

When apis fail you

Sometimes there isn’t a way other than screen scraping to get the data you want, which is unfortunate. I’d like to programmatically retrieve classification fields for the plant patents issued each Tuesday. I can’t use the patentsview api since its data lags behind, it’s updated roughly quarterly while the patent office’s site is updated each Tuesday. Plus the api does not return uspc classifications on newer plant patents as the patent offices has stopped producing the bulk file of them (the last file produced stopped with PP29260, issued April 24, 2018). The api also does not return cpcs that are now coming back on about half of the plant patents, as there is no bulk source of them (the bulk cpc file only contains utility patents, fans of reissued patents are also out of luck). See this page if you don’t believe me that some plant patents do get cpc assignments!

Similarly, I could use try to use the PEDS (Patent Examination Data System) api but only returns one uspc classification per patent when multiples are allowed1 and it also does not return cpcs. So, having no other free option, you can’t blame a guy if he makes requests weekly to patft and scrapes the page of data that is returned!

1If you want to check for yourself, these plant patents each had 4 uspc assignments when I scrapped them PP23484, PP23723, PP23924, PP24080, PP24201, PP24521, PP24634, PP24828. Compare peds and patentsview to patft to see the disparity.

OCRed Plant Patents

Bulk data became a thing a few years ago, so I downloaded the USMark trademark zip files, all 43,000 of them. I then ocr’ed a little over a million of the registration certificates in the zip files and put them into a searchable database. Then I did the same thing for plant patents, which began in 1931. I OCR’ed PP1 through PP3,986, searchable here. PP3,987 is the first fully searchable plant patent on the uspto’s web site, so I didn’t need to OCR any further.

The Python Wrapper

This has nothing to do with a snake in a hoodie, laying down rhythmic rhymes, that would be Python The Rapper ūüôā The same people who wrote the Patentsview api also wrote a python wrapper that produces a csv file for you. All you need to do is¬†download the code¬†and dependencies (instructions provided in the link) and write a configuration file for your query or queries (more than one query can be specified in a configuration file). I realized that the queries it makes can be chained together, where the output of one becomes the input of another. I posted about it here, in the patentsview forum.

Reissued Patents

I’m one of the developers of datamp.org, an online database of tool patents. I’m also one of the handful of people who enters tool patents there. I was surprised to find a reissued patent on the uspto’s web site for a utility patent I had entered into datamp. If you find a reissued patent on the uspto’s site, it references the original patent. The opposite is not true, however. If you look up a patent at uspto.gov, you don’t necessarily know, that is,¬† you aren’t shown if it was subsequently reissued.¬† If the patent was issued in or after 1976, you can do an advanced search on the uspto’s site to see if the patent in question was reissued (reis/XXXXXXXX, where the X’s are eight digit patent number, padded with leading zeroes if necessary).¬† Most of the patents in datamp were issued before 1976, so the advanced search wasn’t an option to find the reissued patent I accidently found, while looking for another patent.

Take a look at the patent I accidently found, it’s RE22,908, issued August 12, 1947.¬† It shows that it’s a reissue of 2,314,915, issued March 30, 1943.¬† However, when you view 2,314,915 it doesn’t show that it was reissued.

I created a database table for reissued patents using some post 1975 bulk patent xml files I had downloaded. It occurred to me that it would be cool if reissued patent data was made available in bulk. It would save me from downloading giant xml files and running a program to look for reissued patent data.¬† Check out my reissued patent page. Wouldn’t it¬† be¬† really cool if the pre-1976 reissue data was also available?

Patents not in bulk xml files

I nearly fell out of my chair when the patent office announced that they would be giving away their data for free! How cool is that? Free data, where is the catch? It turned out there isn’t one, well maybe a very small one, percentagewise that is.

Each quarter the Patentsview api team processes all the bulk grant xml files the patent office makes available, something like a zillion of them (actually closer to 2,000 at the time of this writing). That’s how they create the database that their api uses. After processing the files and updating their database, they then make their data available for download, and get this, it’s also free!

Separately, the patent office constantly updates a list of withdrawn patents. I thought hey, I wonder if there are any missing patents in Patentsview’s patent.tsv file? I’m that sort of guy, you know. Often interesting things can be found when you examine the gaps and overlaps of related data files or data sets. In theory, any patent numbers not in their patent file should correspond to withdrawn patents, right? It turned out to be only partially true, which was really unexpected.

First, I found that there are around 8,000 patents in both files, patents that were issued and whose data was included in bulk grant xml files but later they were withdrawn! This is problematic for the Patentsview database, as I tried to point out to them. They really should exclude withdrawn patents from their database but they don’t see it that way. In other words, a search using their api could contain patents that have been withdrawn. As a user of the api I find this unacceptable.

Second, I did find unexplained gaps! There are 306 patent numbers not in the Patentsview file that do not correspond to withdrawn patents. I double checked a few and the patents are indeed missing from the bulk xml files. Percentagewise it’s miniscule, 306 patents out of 7.5M patents, put shouldn’t it be 0%? Again, this is problematic for the Patentsview database. Searches using their api cannot include these patents as they were not part of their underlying source. Another serious flaw with their api, even though it isn’t their fault that the patents are not part of the bulk xml files. I pointed this out to them but an alternate source of this data has not been found.

I think the patent office should produce a “catchup” xml file containing the 306 valid patents that are not in the bulk xml grant files. The Patentsview people could then add them to their database and other bulk data consumers could do whatever they want to do with them. If the Patentsview people also excluded withdrawn patents, I’d be a lot happier as a consumer of their api. There are other flaws with the api, but this would be a step in the right direction.

(Some of the missing patents are listed on this page, and some listed on this page, for a total of 306 patents.)

My Reference Database

Searching by inventor’s name works at the patent office’s web site back to 1920. ex: in/Edison in the 1790-present database would find patents where one of the inventors’ last name is Edison, the famous guy plus anyone else with that surname. But what if you are interested in patents before 1920, as say someone who is reading something on a site called historicip.com? The patent office doesn’t provide a direct way to do this, but my friend Jeff found a way to do just that! It’s there, hiding in plain sight, on the advance search’s help page. I’ll give you a second to try to spot it.

Jeff’s trick is to search the modern, 1976 and up, database for references by name! ex: Ref/Edison. That will return modern patents that reference patents by someone with the surname Edison. The reference can be to any patent, including pre-1920 patents! Suddenly there is a way to find pre-1920 patents for an inventor. It’s an absolutely brilliant, albeit clunky, way to cheat the system, to do something that could not otherwise be done. It’s clunky in that you cannot control the date range of the references. The reference could be to a 1920 and up patent you could have found simply by doing an inventor name search (in/Edison). You’d have to look at a lot of false positives, but ah, before google patents was a thing, you would occasionally find a pre-1920 patent by the inventor you were searching for! The unsearchable suddenly became searchable.

Flaws with the trick include finding the same pre-1920 patent referenced again and again by modern patents. The first one is pure gold to its searcher, but the rest become false positives. Another problem is that not every pre-1920 patent gets referenced by a modern patent. So it’s not a perfect trick, but that doesn’t diminish its brilliance any. It’s so cleaver I would be really proud if I that thought it up myself.

Inspired by Jeff’s trick, I created a searchable database of pre-1920 inventors. I did this by scanning the bulk granted patent xml files, looking for pre-1920 references that I hadn’t already found. Bam, no more false positives! Well, at least not as many as using a straight up reference search as it turns out. The problem is that the reference data is not complete clean. I found typos in the inventor’s names, dyslexic permutations of the referenced patent number, missing D’s of design patents etc. I also added pre-1920 inventors gleaned from other sources, as I credit . Seems it pays to be a developer of an antique tool patent web site, chock full of pre-1920 inventors names. It came into existence though the efforts of my friends Jeff and Ralph and others.