Gathering Publicly Available Information with an API

by Keno Catabay and Rachel Smart

This is a post for anyone who is interested in utilizing web APIs to gather content or simply have questions about how to begin interacting with web APIs. Keno Catabay and myself, Rachel Smart, both work in the Office of Digital Research and Scholarship on various content gathering related projects. Keno was our Graduate Assistant since Fall 2017, pursuing data and digital pedagogy interests as well as teaching python workshops. I am the manager of the Research Repository, DigiNole, and am responsible for content recruitment, gathering, and management and all the offshoot projects.

Earlier this summer, we embarked on a project to assist FSU’s Office of Commercialization to archive approved patent documents that university affiliated researchers have filed with United States Patent and Trademark Office (USPTO) since the 80s. These patents are to be uploaded into DigiNole, our institutional repository, increasing their discoverability, given that the USPTO Patent Full-text and Image Database(PatFT) is difficult to navigate and DigiNole is indexed by Google Scholar.

This project was accomplished, in part, using the Patent-Harvest software developed by Virginia Tech libraries. The software contains a Java script that retrieves metadata and PDF files of the patent objects from PatFT through the new API the USPTO is developing for their database, currently in its beta stage. While the Virginia Tech Patent-Harvest was an excellent starting point–thank you, VTech!–we decided that communicating directly with the USPTO API would be more beneficial for our project long-term, as we could manipulate the metadata more freely. Although, currently we rely on the VTech script to retrieve the pdf files.

If you are harvesting data from an API, you will have to familiarize yourself with the site’s unique API query language. The USPTO API query language can be found here:  API Query Language. We also had to make sure we were communicating with the correct endpoint, a URL that represents the objects we were looking to harvest. In our case, we were querying the Patents Endpoint.

Communicating with the API can be difficult for the uninitiated. For someone with a cursory understanding of IT and coding, you may run into roadblocks, specifically while attempting to communicate with the API directly from the command line/terminal of your computer. There are two main HTTP requests you can make to the server: GET requests and POST requests. GET HTTP requests appear to be the preferred standard, unless the parameters of your request exceed 2,000 characters in which case you would make a POST request.

Snapshot of Postman’s interface during a query

Keno chose to use Postman, a free software, to send the HTTP requests without having to download packages from the command line. Depending on how much traffic is on the server, Postman is able to harvest the metadata in a few minutes for us.

Instructions for writing the parameters, or the data that we wanted from USPTO, is clearly provided by the API Query Language site, patentsview.org. In our case, we wanted our metadata to have specific fields, which are listed in the following GET request.

GET http://www.patentsview.org/api/patents/query?q={“assignee_organization”:”Florida State University Research Foundation, Inc”}&f=[“patent_number”,”patent_date”, “patent_num_cited_by_us_patents”, “app_date”, “patent_title”, “inventor_first_name”, “inventor_last_name”, “patent_abstract”, “patent_type”, “inventor_id”,”assignee_id”]&o={“per_page”:350}

Note that the request defaults to 25 results, so o={“per_page”:350} was inserted in the parameters as we expected around 200 returned results from that particular assignee.

USPTO returns the data in JSON format, which is written in an easy-to-read, key/value pair format. However, this data needs to be transformed into the xml MODS metadata format in order for the patent objects (paired metadata and pdf files) to be deposited into the research repository. A php script already being used to transform metadata for the repository was re-purposed for this transformation task, but significant changes needed to be made. When the debugging process is completed, the php script is executed through the command line with the json file as an argument, and 465 new well-formed, valid MODS records are born!

Snippet of the JSON to MODS transformation script

This project took about three weeks to complete. For those curious about what kinds of inventions researchers at FSU are patenting, the collection housing these patents can be found here in the Florida State University Patent collection. The frequency at which this collection will be updated with new patents is still undecided, but currently we intend to run the script twice a year to net the recently approved patents.