2012-10-30

Querying DBpedia to find Public Domain Authors


Yesterday, Sam Leon asked for help to populate a list of authors whose work will enter the public domain in 2013. My first thought was: This is a perfect use case for querying DBpedia's SPARQL endpoint! So I tried some queries.

Unfortunately, I had problems with the xsd datatypes when I built my query on the properties dbpedia-owl:deathYear and/or dbpedia-owl:deathDate. Doing a quick search on the web, I noticed that problems with xsd:date aren't new to DBpedia. It didn't work out to write a query guided by the workarounds provided in [1] and [2]. Perhaps somebody else can tell me, how you can solve these problems...

I decided to query based on the wikipeda category 1942 deaths. With this kind of query i had no problems, for example:

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?person {
    ?person a dbpedia-owl:Writer .
    ?person dct:subject <http://dbpedia.org/resource/Category:1942_deaths> .
}
(See the result for the previous query here.)

This query only delivers persons that are typed as dbpedia-owl:Writer. Franz Boas for example isn't covered. One would have to do more queries with other categories of people that publish written works:
One can combine queries of multiple classes in SPARQL with a UNION query. To only list those people once that are members of multiple classes, one should add a DISTINCT to the SELECT query, for example:
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT DISTINCT ?person {
    { ?person a dbpedia-owl:Scientist }
    UNION
    { ?person a dbpedia-owl:Writer }
    UNION
    { ?person a dbpedia-owl:Philosopher }
    UNION
    { ?person a <http://dbpedia.org/class/yago/AmericanAnthropologists> }
    UNION
    { ?person dct:subject <http://dbpedia.org/resource/Category:German_poets>  }
    ?person dct:subject <http://dbpedia.org/resource/Category:1942_deaths> . 
}
(result)

Note, that you can also use relevant categories connected to the person by dcterms:subject. This SPARQL query already delivers 102 persons who died in 1942 and most probably all have published at least one written work. The query needs to be extended to cover most of the people in Wikipedia/DBpedia that have published written works.

As one can see, these queries aren't as simple as you would like them to be. That's because you have to adjust to the underlying data which - like all data on the web - is kind of messy. The good thing is: If you have worked out a useful SPARQL query that includes most of the categories and subject classes for people who publish stuff, you can easily re-use the query for upcoming lists of public domain material in coming years.


[1] http://answers.semanticweb.com/questions/947/dbpedia-sparql-endpoint-xsddate-comparison-weirdness

[2] http://pablomendes.wordpress.com/2011/05/19/sparql-xsddate-weirdness/