2012-10-30

Querying DBpedia to find Public Domain Authors


Yesterday, Sam Leon asked for help to populate a list of authors whose work will enter the public domain in 2013. My first thought was: This is a perfect use case for querying DBpedia's SPARQL endpoint! So I tried some queries.

Unfortunately, I had problems with the xsd datatypes when I built my query on the properties dbpedia-owl:deathYear and/or dbpedia-owl:deathDate. Doing a quick search on the web, I noticed that problems with xsd:date aren't new to DBpedia. It didn't work out to write a query guided by the workarounds provided in [1] and [2]. Perhaps somebody else can tell me, how you can solve these problems...

I decided to query based on the wikipeda category 1942 deaths. With this kind of query i had no problems, for example:

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?person {
    ?person a dbpedia-owl:Writer .
    ?person dct:subject <http://dbpedia.org/resource/Category:1942_deaths> .
}
(See the result for the previous query here.)

This query only delivers persons that are typed as dbpedia-owl:Writer. Franz Boas for example isn't covered. One would have to do more queries with other categories of people that publish written works:
One can combine queries of multiple classes in SPARQL with a UNION query. To only list those people once that are members of multiple classes, one should add a DISTINCT to the SELECT query, for example:
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT DISTINCT ?person {
    { ?person a dbpedia-owl:Scientist }
    UNION
    { ?person a dbpedia-owl:Writer }
    UNION
    { ?person a dbpedia-owl:Philosopher }
    UNION
    { ?person a <http://dbpedia.org/class/yago/AmericanAnthropologists> }
    UNION
    { ?person dct:subject <http://dbpedia.org/resource/Category:German_poets>  }
    ?person dct:subject <http://dbpedia.org/resource/Category:1942_deaths> . 
}
(result)

Note, that you can also use relevant categories connected to the person by dcterms:subject. This SPARQL query already delivers 102 persons who died in 1942 and most probably all have published at least one written work. The query needs to be extended to cover most of the people in Wikipedia/DBpedia that have published written works.

As one can see, these queries aren't as simple as you would like them to be. That's because you have to adjust to the underlying data which - like all data on the web - is kind of messy. The good thing is: If you have worked out a useful SPARQL query that includes most of the categories and subject classes for people who publish stuff, you can easily re-use the query for upcoming lists of public domain material in coming years.


[1] http://answers.semanticweb.com/questions/947/dbpedia-sparql-endpoint-xsddate-comparison-weirdness

[2] http://pablomendes.wordpress.com/2011/05/19/sparql-xsddate-weirdness/

Kommentare:

_ hat gesagt…

Hi Adrian,

I've also recently encountered this issue, when you need to filter out erroneous literals, so that they don't raise errors when casted to a datatype (e.g., by using xsd:dateTime()). The key in this issue is to understand the three logical values in FILTER evaluation (true, false and error) and the ways logical expressions containing them are evaluated (http://www.w3.org/TR/rdf-sparql-query/#evaluation).

In this example, it is possible to "pre-screen" the data by a regular expression:

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>

SELECT ?person {
?person a dbpedia-owl:Writer ;
dbpedia-owl:deathDate ?date .
FILTER (regex(str(?date), "\\d{4}-\\d{2}-\\d{2}") && (year(now()) - year(xsd:dateTime(?date)) = 70))
}

This way, if the regex() evaluates to false and the second part of the expression raises error, the FILTER clause evaluates to false and therefore filters out erroneous data. If you used FILTER and casted ?date to xsd:dateTime, you'd get error back. However, in writing the query in this way, you'll lose some dates, that aren't valid xsd:dates (e.g., 1942). In total, you'll get 44 writers, which is less than 50 that you'll get by using the http://dbpedia.org/resource/Category:1942_deaths category. The reasons for smaller count vary:
- Using http://dbpedia.org/property/deathDate instead of http://dbpedia.org/ontology/deathDate.
- Incorrect categorization (e.g., http://dbpedia.org/resource/Nikola_Vaptsarov apparently died in 1943)
- Using value that is invalid xsd:date (e.g., "May 1942")

Adrian Pohl hat gesagt…

Thanks for the explanation Jindřich.

The main reason for the smaller count is that you didn't cover the other dbpedia-owl classes and dct:subject categories with a UNION query. I added that to your query and got out 93 responses. Actually, the difference isn't that big then.

With regard to Nikola Vaptsarov: The wikipedia article text and the info box contained different death years. After checking viaf.org (it says he died 1942), I corrected the info box.

Tom Morris hat gesagt…

Running this query against Freebase will give you 475 authors who died in 1942. That might be easier than trying to construct such a complex query.

q = {
"/people/deceased_person/date_of_death": None,
"mid": None,
"name": None,
"type": "/book/author",
"d2:/people/deceased_person/date_of_death<": "1942-12-31" % year,
"/people/deceased_person/date_of_death>": "1941-12-31",
"key":[{"namespace":"/wikipedia/en_id","value":None,"optional":True}]
}

Adrian Pohl hat gesagt…

Thanks, Tom. This query indeed looks much simpler and gives me some motivation to play around with the Freebase API.

Adrian Pohl hat gesagt…

I anyone is still interested in querying DBpedia for PD authors or in improving the query, take a look at this etherpad, where Jindrich and me tried to improve the query: http://okfnpad.org/pd-sparql.

Kommentar veröffentlichen