My previous post talked about how to install 4store as a triplestore, and use the Ruby library RDF.rb in order to process RDF extracted from that store. This was a response to Richard Pope’s Linked Data/RDF/SPARQL Documentation Challenge which asks for documentation of how to install a triplestore, load data into it, retrieve it using SPARQL and access the results as native structures using Ruby, Python or PHP.
I quite enjoyed writing the last one, so I thought I’d try again. As before, I am on Mac OS X, but this time I’m going to use Python, which I have not programmed in before. I like a challenge. You might not like the results!
Updated to include some of Arto Bendicken’s recommendations.
This post is a response to Richard Pope’s Linked Data/RDF/SPARQL Documentation Challenge. In it, he asks for documentation of the following steps:
- Install an RDF store from a package management system on a computer running either Apple’s OSX or Ubuntu Desktop.
- Install a code library (again from a package management system) for talking to the RDF store in either PHP, Ruby or Python.
- Programatically load some real-world data into the RDF datastore using either PHP, Ruby or Python.
- Programatically retrieve data from the datastore with SPARQL using using either PHP, Ruby or Python.
- Convert retrieved data into an object or datatype that can be used by the chosen programming language (e.g. a Python dictionary).
I’ve been told so many time how RDF sucks for mainstream developers that it was the main point of my TPAC talk late last year. I think that this is a great motivating challenge for improving not only the documentation of how to use RDF stores and libraries but how to improve their generally installability and usability for developers as well.
Anyway, I thought I’d try to get as far as I could to see just how bad things really are. I am on Mac OS X, and I’m going to use Ruby (although I don’t really know it all that well, so please forgive my mistakes). I’ll breeze on through as if everything is hunky dory, but there are some caveats at the end.
One of the biggest selling points of linked data is that it’s supposed to facilitate web-scale distributed publication of data. Just as with the human web, anyone can publish data at their local site without having to go through any kind of central authority.
Just as with the human web, convergence on particular sets of URIs for particular kinds of things can happen in an evolutionary way: in a blog post I might point to Amazon when I want to talk about a particular book, Wikipedia to define the concepts I mention, people’s blogs or twitter streams when I mention them.
And with everyone using the same terms to talk about the same things, there’s the prospect of being able to easily pull together information from completely different sources to find connections and patterns that we’d never have found otherwise.
What’s been very unclear to me is how this distributed publication of data can be married with the use of SPARQL for querying. After all, SPARQL doesn’t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.
This post is an attempt to explore this tension, between distributed publication and centralised query, and to try to find a pattern that we might use within the UK government (and potentially more widely, of course) to publish and expose linked data in a queryable way. It’s a bit sketchy, and I’d welcome comments.
Today, I’m going to moan about the lack of features in SPARQL that are necessary to do many kinds of data analysis and visualisation. Going from raw data, held in RDF, to data like
cannot be done with SPARQL on its own. These calculations involve aggregation, grouping and projection which are planned for SPARQL vNext, but not here yet (at least, not in any standard way or in every triplestore).
Here’s the pretty graph to illustrate today’s rant:
It’s not really XML, I suppose, but it’s certainly a bunch of interesting and timely topics. I particularly hope that we’ll get some public sector people in the room so that we can discuss some of the challenges and opportunities in that area.
I’ll start with the problem. To create the graphs I showed in my last post, I wanted to split MPs into groups based on their party affiliation. Ideally, I wanted the Google Visualisation query to look like:
select mp, additionalCosts, totalTravel, totalBasic where party = 'Conservative' order by totalClaim desc limit 25
because this is reasonably easy to understand and for a developer to create without having to know any magic URIs.
The party affiliation for an MP is given in the RDF supplied within the Talis store as a pointer to one of the resources:
Now, if you visit http://dbpedia.org/resource/Conservative_Party_(UK) then you’ll see precious few properties and none of them give you access to the string ‘Conservative’. If you look at http://dbpedia.org/resource/Liberal_Democrats, you’ll see plenty of properties, one of which is
dbpprop:partyName. But trying to query on
dbpprop:partyName within the Talis data store gives me nothing, because that information hasn’t been imported into the particular store that this SPARQL query is running on.
My last post showed a visualisation of the Guardian’s MP’s Expenses data, ported into a Talis triplestore. Here’s a screenshot of another one (follow the link for the interactive version). The files that are used to create it are attached to this post.
There are several things that are frustrating about creating these visualisations, which I want to discuss because I think they lead to some lessons about what data publilshers and members of the semantic web community should do to make these things easy. The first thing I want to talk about is datatyping.