When we encourage people to put their data on the web as linked data, the biggest question is “How?”. There are so many “How?” questions to answer:
and, of course:
I’m aware I’ve been quiet for the past few months. This isn’t because nothing interesting has been going on — rather the opposite. It’s been difficult to get a chance to sit down and write about the work I’ve been doing, when actually doing the work has been taking up so much time.
Most of my time has been spent on the new legislation.gov.uk website and its underlying API. There’s so much to say about this project that I hardly know where to start, so I’ll just try to do an overview and we can take it from there. Let me know what you’re interested in.
One of the biggest selling points of linked data is that it’s supposed to facilitate web-scale distributed publication of data. Just as with the human web, anyone can publish data at their local site without having to go through any kind of central authority.
Just as with the human web, convergence on particular sets of URIs for particular kinds of things can happen in an evolutionary way: in a blog post I might point to Amazon when I want to talk about a particular book, Wikipedia to define the concepts I mention, people’s blogs or twitter streams when I mention them.
And with everyone using the same terms to talk about the same things, there’s the prospect of being able to easily pull together information from completely different sources to find connections and patterns that we’d never have found otherwise.
What’s been very unclear to me is how this distributed publication of data can be married with the use of SPARQL for querying. After all, SPARQL doesn’t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.
This post is an attempt to explore this tension, between distributed publication and centralised query, and to try to find a pattern that we might use within the UK government (and potentially more widely, of course) to publish and expose linked data in a queryable way. It’s a bit sketchy, and I’d welcome comments.
As we encourage linked data adoption within the UK public sector, something we run into again and again is that (unsurprisingly) particular domain areas have pre-existing standard ways of thinking about the data that they care about. There are existing models, often with multiple serialisations, such as in XML and a text-based form, that are supported by existing tool chains.
In contrast, if there is existing RDF in that domain area, it’s usually been designed by people who are more interested in the RDF than in the domain area, and is thus generally more focused on the goals of the typical casual data re-user rather than the professionals in the area.
As you probably know, I’ve been working quite a lot recently on the UK government’s use of linked data, and in particular on providing guidance for people who want to publish their data as linked data. One of the things that we need to provide guidance about is how to publish linked data that changes over time. I’ve touched on this topic before but things have progressed now to the stage where we have to make some real, practical, recommendations.