When apis fail you

Sometimes there isn’t a way other than screen scraping to get the data you want, which is unfortunate. Screen scraping is like an arms race, if the uspto changes its web site, I have to change my script. I’d like to programmatically retrieve classification fields for the plant patents issued each Tuesday. I can’t use the patentsview api since its data lags behind, it’s updated roughly quarterly while the patent office’s site is updated each Tuesday. Plus the api does not return uspc classifications on newer plant patents as the patent offices has stopped producing the bulk file of them (the last file produced stopped with PP29260, issued April 24, 2018). The api also does not return cpcs that are now coming back on about half of the plant patents, as there is no bulk source of them (the bulk cpc file only contains utility patents, fans of reissued patents are also out of luck). See this page if you don’t believe me that some plant patents do get cpc assignments!

Similarly, I could use try to use the PEDS (Patent Examination Data System) api but only returns one uspc classification per patent when multiples are allowed1 and it also does not return cpcs. Another unappealing option would be to download the weekly xml files of issued patents and extract the plant patent data.  The weekly files are huge and it’s all the patents issued that week, so it’s a lot of data I don’t need! As a case in point, today’s xml file, as I write this, has data for 6 plant patents along with 7,552 utility, design and reissued patents.

So, having no other free option, you can’t blame a guy if he makes requests weekly to ppubs and scrapes data from the pages returned!

A note on my script, it’s specifically for my use case of pulling down newly issued plant patents and extracting data. There is at least one opensource python project that I know of that programmers can use to make their own requests, easing the screen scraping burden. The patentsview api has an opensource python wrapper that only requires a configuration file (no coding) but it suffers the same problems already mentioned about the underlying patentsview api.

1If you want to check for yourself, these plant patents each had 4 uspc assignments when I scrapped them PP23484, PP23723, PP23924, PP24080, PP24201, PP24521, PP24634, PP24828. Compare peds and patentsview to ppubs to see the disparity. This link will bring up these patents in ppubs.

Leave a Reply

Your email address will not be published. Required fields are marked *