OCR = Optical Character Recognition, software that attempts to figure out what text is contained in an image, generally reading an image and producing a text file. On the patent office’s web site there are a couple of date based differences in what data is available. For patents, there isn’t a lot of data associated … Continue reading “OCRing USAMark Tiffs”
Category: Optical Character Recognition (OCR)
These posts are about things I’ve ocr’ed and what I’ve done with the text.
OCRed Plant Patents
Bulk data became a thing a few years ago, so I downloaded the USMark trademark zip files, all 43,000 of them. I then ocr’ed a little over a million of the registration certificates in the zip files and put them into a searchable database. Then I did the same thing for plant patents, which began … Continue reading “OCRed Plant Patents”
Creating a Searchable PDF File with OCRed Text
Patent Librarian and all-around good guy Jim Miller lamented the lack of OCR text in an Espacenet patent that was in Spanish. I’ve OCRed to produce text files that became a searchable online database but I hadn’t tried producing searchable pdfs. I offered to try to help anyway. In googling around, I found the June 4th comment at the … Continue reading “Creating a Searchable PDF File with OCRed Text”