2012-06-22

Linked Data in worldcat.org

Two days ago OCLC announced that linked data has been added to worldcat.org. I took a quick look at it and just want to share some notes on this.

OCLC goes open, finally

I am very happy that OCLC - with using the ODC-BY license - finally managed to choose open licensing for WorldCat. Quite a change of attitude when you recall the attempt in 2008 to sneak in a restrictive viral copyright license as part of a WorldCat record policy (for more information see the code4lib wikipage on the policy change or my German article about it). Certainly, it were not at last the blogging librarians and library tech people, the open access/open data proponents etc. who didn't stop to push OCLC towards openness, who made this possible. Thank you all!

Of course, this is only the beginning. One thing is, that dumps of this WorldCat data aren't available yet (see follow-up addendum here), thus, making it necessary to crawl the whole WorldCat to get hold of the data. Another thing is, that there probably is a whole lot of useful information in WorldCat that isn't part of the linked data in worldcat.org yet .

schema.org in RDFa and microdata

What information is actually encoded as linked data in worldcat.org? And how did OCLC add RDF to worldcat.org? It used the schema.org vocabulary to add semantic markup to the HTML. This markup is both added as microdata - the native choice fo schema.org vocab - as well as in RDFa. schema.org lets people choose how to use the vocabulary, on the schema.org blog it recently said: "Our approach is "Microdata and more". As implementations and services begin to consume RDFa 1.1, publishers with an interest in mixing schema.org with additional vocabularies, or who are using tools like Drupal 7, may find RDFa well worth exploring."

Let's take a look at a description of a bibliographic resource in worldcat.org, e.g. http://www.worldcat.org/title/linked-data-evolving-the-web-into-a-global-data-space/oclc/704257552The part of the HTML source containing the semantic markup is marked as "Microdata Section" (although it does also contain RDFa). As the HTML source isn't really readable for humans, we need to get hold of the RDF in a readable form first to have a look at it. I prefer the turtle syntax for looking at RDF. One can get the RDF contained in the HTML out using the RDFa distiller provided by the W3C. More precisely you have to use the distiller that supports RDFa 1.1 as schema.org supports RDFa 1.1 and, thus, worldcat.org is enriched according to the RDFa 1.1 standard.

However, using the distiller on the example resource I can get back a turtle document that contains the following triples:

1:  @prefix library: <http://purl.org/library/> .  
2:  @prefix madsrdf: <http://www.loc.gov/mads/rdf/v1#> .  
3:  @prefix owl: <http://www.w3.org/2002/07/owl#> .  
4:  @prefix schema: <http://schema.org/> .  
5:  @prefix skos: <http://www.w3.org/2004/02/skos/core#> .  
6:  <http://www.worldcat.org/oclc/707877350> a schema:Book;  
7:    library:holdingsCount "1"@en;  
8:    library:oclcnum "707877350"@en;  
9:    library:placeOfPublication [ a schema:Place;  
10:        schema:name "San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) :"@en ];  
11:    schema:about [ a skos:Concept;  
12:        schema:name "Web site development."@en;  
13:        madsrdf:isIdentifiedByAuthority <http://id.loc.gov/authorities/subjects/sh98004795> ],  
14:      [ a skos:Concept;  
15:        schema:name "Semantic Web."@en;  
16:        madsrdf:isIdentifiedByAuthority <http://id.loc.gov/authorities/subjects/sh2002000569> ],  
17:      <http://dewey.info/class/025/e22/>,  
18:      <http://id.worldcat.org/fast/1112076>,  
19:      <http://id.worldcat.org/fast/1173243>;  
20:    schema:author <http://viaf.org/viaf/38278185>;  
21:    schema:bookFormat schema:EBook;  
22:    schema:contributor <http://viaf.org/viaf/171087834>;  
23:    schema:copyrightYear "2011"@en;  
24:    schema:description "1. Introduction -- The data deluge -- The rationale for linked data -- Structure enables sophisticated processing -- Hyperlinks connect distributed data -- From data islands to a global data space -- Introducing Big Lynx productions --"@en,  
25:      "The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study."@en;  
26:    schema:inLanguage "en"@en;  
27:    schema:isbn "1608454312"@en,  
28:      "9781608454310"@en;  
29:    schema:name "Linked data evolving the web into a global data space"@en;  
30:    schema:publisher [ a schema:Organization;  
31:        schema:name "Morgan & Claypool"@en ];  
32:    owl:sameAs <http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001> .  

This looks quite nice to me. You see, how schema.org let's you easily convey the most relevant information and the property names are well-chosen to make it easy for humans to read the RDF (in contrast e.g. to the ISBD vocabulary which uses numbers in the property URIs following the library tradition :-/).

The example also shows the current shortcomings of schema.org and where the library community might put some effort in to extending it, as OCLC has already been doing for this release with the experimental "library" extension vocabulary for use with Schema.org. E.g., there are no seperate schema.org properties for a table of content and an abstract so that they are both put into one string using ther schema:description property.

Links to other linked data sources

There are links to several other data sources: LoC authorities (lines 13, 16, 41, 44) , dewey.info (17), the linked data FAST headings (18,19), viaf.org (20,22) and an owl:sameAs link to the HTTP-DOI identifier (32). As most of these services are already run by OCLC and as the connections probably all were already existent in the data, creating these links wasn't hard work, which of course doesn't make them less useful.

Copyright information

What I found very interesting is the schema:copyrightYear property used in some descriptions in worldcat.org. I don't know how much resources are covered with the indication of a copyright year and how accurate the data is, but this seems a useful source to me for projects like publicdomainworks.net.

Missing URIs

As with other preceding publications of linked bibliographic data there are some URIs missing for things we might want to link to instead of only serving the name string of the respecting entity: I am talking about places and publishers. Until now, AFAIK URIs for publishers don't exist, hopefully someone (OCLC perhaps?) is already working on a LOD registry for publishers. For places, we have geonames but it is not that trivial to generate the right links. It's not a great surprise that a lot of work has to be done to build the global data space.