2012-12-12

Querying Freebase to Find Public Domain Authors

In October I wrote about querying DBpedia to find out which authors were born in 1942 resulting in their works (probably) entering the public domain in 2013. In a comment on this post, Tom Morris pointed out that a simple Freebase query easily leads to more results than the - ever-increasing - SPARQL query I had provided for DBpedia (Thanks, Tom). Eventually, I used Freebase to get a list of public domain authors as querying DBpedia to this end turned out to be impractical. (Maybe, in the future libraries will provide data and tools to learn about works entering the public domain...)

Missing class hierarchy in DBpedia

Why is querying DBpedia impractical? Following my blog post, Jindřich Mynarz helped me to improve the SPARQL query on this etherpad. We soon realized that you'd have to build a UNION query with hundreds of classes in order to get all people who died in 1942 and who have published something during their lifetime. The reason is, that little class hierarchy in DBpedia exists. There seems to be some hierarchy in the YAGO ontology that we would have liked to exploit but unfortunately typos in the ontology (rdfs:suBClassOf, see e.g. http://dbpedia.org/class/yago/Essayist110064405) render this impossible.

Querying Freebase

As already mentioned, I ended up querying Freebase. I modified the query provided by Tom and finally got a list of 481 authors who died in 1942 as well as information about their concrete death date, their profession, nationality and works published. I did this using rather a trial and error approach than understanding the details of MQL (Metaweb Query Language). It resulted in this query:
{
    "type": "/book/author",
    "name": [],
    "/people/deceased_person/date_of_death": null,
    "mid": null,
    "/people/person/nationality" : [],
    "/people/person/profession" : [],
    "works_written": [],
    "d2:/people/deceased_person/date_of_death<": "1942-12-31",
    "/people/deceased_person/date_of_death>": "1941-12-31",
    "limit": 500
}
Unfortunately, a query like this with a limit of 500 results would time out. It took me some time to search through the documentations and to finally find out how to employ the cursor in a Freebase query to handle off-sets. At last I came up with this query (now as URL) which worked fine for my purpose:

https://www.googleapis.com/freebase/v1/mqlread?&query=[{"type":"/book/author","name":[],"/people/deceased_person/date_of_death":null,"mid":null,"/people/person/nationality":[],"/people/person/profession":[],"works_written":[],"d2:/people/deceased_person/date_of_death<":"1942-12-31","/people/deceased_person/date_of_death>":"1941-12-31","limit":75}]&cursor

Conversion to Spreadsheet with Google Refine

At last, I needed to convert the JSON files Freebase was providing into CSV or similar to be able to upload it to a Google Spreadsheet. I used Google Refine (in transition to Open Refine), a tool that I have been wanting to try out for quite some time now. It was a logical step to use this tool for my purposes anyhow as it originates from the same people who have developed Freebase...
Google Refine was easy to install. It was also easy to upload the JSON to do some adjustments (mainly moving and renaming columns) and then I could directly upload the result to this Google spreadsheet.

Caveats

I hope, the list of authors and possible public domain works may be useful to some people. It assumes that works enter the public domain 70 years after the author's death which is true for most countries. Of course, this list has to be taken with some care, so you might want to clarify the individual case before digitizing the works and publishing them on the internet. The list also includes many translations of original works which will probalby not enter the public domain in 2013 as translators usually enjoy a copyright for their translations. IANAL.