One of the biggest selling points of linked data is that it’s supposed to facilitate web-scale distributed publication of data. Just as with the human web, anyone can publish data at their local site without having to go through any kind of central authority.
Just as with the human web, convergence on particular sets of URIs for particular kinds of things can happen in an evolutionary way: in a blog post I might point to Amazon when I want to talk about a particular book, Wikipedia to define the concepts I mention, people’s blogs or twitter streams when I mention them.
And with everyone using the same terms to talk about the same things, there’s the prospect of being able to easily pull together information from completely different sources to find connections and patterns that we’d never have found otherwise.
What’s been very unclear to me is how this distributed publication of data can be married with the use of SPARQL for querying. After all, SPARQL doesn’t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.
This post is an attempt to explore this tension, between distributed publication and centralised query, and to try to find a pattern that we might use within the UK government (and potentially more widely, of course) to publish and expose linked data in a queryable way. It’s a bit sketchy, and I’d welcome comments.
First, let’s look at the publication of data. We publish data at the moment in all kinds of ways: embedded tables within PDFs, CSV database dumps, Excel spreadsheets, Word documents, XML, JSON, N3 and so on and on. Each of these documents contains a set of information: a dataset.
Each dataset contains information about a whole load of things, usually real-world things. This is easy to see when you have datasets that contain lots of things of the same type: a spreadsheet might contain information about lots of different local authorities, a database dump about a bunch of schools. In FOAF terms, we’d say that the dataset has each of these things as a topic.
Even datasets that are really about one thing (have, in FOAF terms, a primary topic) contain information about lots of other things. For example, a web page about a hospital might include some level of information about the different departments within the hospital, the strategic health authority that it belongs to, the chief executive and so on. Information that is just about one thing is rarely useful; at the very least, you will want to know the labels of things that it’s related to.
If we move to thinking about linked data, each thing is assigned an HTTP URI. There is then one particular dataset that stands above all the other datasets that contain information about that thing: the dataset in the document that you get when you resolve its URI. The fact that there is this dataset doesn’t alter the fact that there are many many other datasets out there that contain information about the thing. But the dataset that you get at the URI for the thing obviously has a special role.
These datasets — the ones you get at the end of a resource’s URI — are the way in which an organisation can exercise control over the use of URIs minted within their domain. The organisation that controls the URI for a thing determines whether that URI resolves, and what is at the end of the URI. If fifteen different websites all published information about a school consistently using the same URI for that school, anyone could pull that information together into something potentially useful. But if the URI for the school doesn’t actually resolve, then you would have to wonder whether the school actually exists, or if it’s just a figment of the imagination of those fifteen websites: a spoof school.
Also, you’d expect the information that you find at the end of the URI to be correct and up to date. You’d expect it to be reasonably complete as well: to return a bunch of information about the school and pointers to more information about the school. This information is likely to come from a bunch of trusted sources: an integrated view over a collection of other datasets.
We’ve established that
And so on to querying. Linked data can be useful without explicit querying — you can navigate around related sets of information by following links, and pull together information gleaned from different sites — but querying of some kind provides much more potential power and, with a linked data API, the opportunity to provide an easy-to-use web-based API for the data.
SPARQL queries operate over a default graph (or dataset) and a set of supplementary named graphs. For efficiency, these need to be pulled into a single triplestore.
And so we have a quandry. To support queries, we need all the data we might want to query to be pulled into a single triplestore. Given that all data is linked, and all links are potentially interesting, the only answer seems to be to have the whole web of data in a single store. And that kind of centralised solution seems impractical, both in terms of the sheer size of store you’d need and the obvious impact on efficiency of doing so.
I think the answer (for the moment at least) is to forget about querying the entire web of linked data and focus on supporting the easy creation of targeted, curated, triplestores that each incorporate a useful subset of the linked data that’s out there. What subset is useful for a given triplestore is a design question that should be informed by the potential users of that particular service. Larger subsets are likely to locate more cross-connections, but have a performance penalty.
For example, a service that was oriented towards helping local authorities plan their schooling provision might include all the current data about nursery, primary and secondary schools (but not universities or versioned data), information about their administrative district and the district that they appear in (but no extra information about census areas), and those neighbourhood statistics, including historic data, that relate to children and schooling (but not those that relate to care of the elderly, for example).
Another service might include all historic information about schools and universities and historic information about all associated administrative geography, but not include neighbourhood statistics.
In the scenario painted above, each triplestore will include different datasets, brought together for a particular purpose. Imagine a huge warehouse full of boxes, each of which is a particular dataset. Each triplestore will fit together a different set of those boxes. What’s neat about the linked data approach is that the boxes are really easy to bring together: creating a triplestore should just be a matter of selecting which datasets you want to use with little or no hand-crafting of links between them or resolution of naming conflicts.
The challenge from the side of the data publisher is to enable these triplestores to be both created and kept up to date. A data publisher has to:
A lot of these problems are solved.
VoiD’s purpose in life is to describe datasets and how they link to each other, and it provides a void:dataDump property that points to a dump of the data. VoiD can describe datasets that are supersets of other datasets, which enables datasets to be grouped together into potentially useful bundles.
Where information needs to be kept up to date, we can use feeds. We need to keep up to date information about the datasets that a publisher makes available, and information about the content of a particular dataset. This can be achieved through a single Atom feed in which each dataset is recorded as an entry, with an <updated> element indicating its last update. Datasets that are removed can be indicated through a deleted-entry element. There is some ongoing work that suggests how to augment voiD with a pointer to such a feed.
As well as pointing to a dataset, and indicating that it has been updated, the Atom feed could contain information about the change itself, represented as a changeset. This could be included as part of the information provided about the new version of the dataset, described in terms of its provenance.
Feeds that were provided in this way could be provided using the normal model, whereby any interested triplestores would regularly check the feed for updates, or using PubSubHubbub in order to push notifications to triplestores. The latter would require triplestore providers to support a service that accepted such notifications, of course.
A triplestore should expose which datasets (and which versions of those datasets) are used within the triplestore. This can be gathered through a SPARQL query to list the available graphs and their metadata, so long as that information is included within the named graphs themselves.
How does all this translate into what guidelines we should put into place for UK government publishers and what tools we should provide centrally?
First, we need to recognise the responsibility that comes with the ownership of a URI. Within the UK, we are encouraging people to use URIs of the form:
http://{sector}.data.gov.uk/id/{concept}/{identifier}
to name things like schools and hospitals, with the recognition that information about those things might come from many different public bodies. Someone has to be in charge of that domain: they have to determine which URIs within a particular URI set are resolvable, and what information is provided at the end of each URI. These same sector owners should support easy-to-use APIs based around the particular URI sets that they are responsible for.
The easiest route to supporting the pages, an easy-to-use API, and a SPARQL endpoint for deeper querying is going to be to create a curated triplestore with a linked data API layer over the top. This triplestore will need to be populated with data from multiple datasets, both as separate named graphs (to provide traceability back to the original data) and merged into a default graph that reflects the current state of the world.
The precise datasets that are included within the triplestore will depend on the judgement of the sector owners about both the trustworthiness of the available datasets and their utility. For example, it’s likely that a lot of triplestores will want to include information about administrative geography and perhaps some information about time, simply because everything happens somewhere and sometime.
Second, we need to make this process really easy, through guidelines and tooling.
We encourage the data owners themselves (which are individual public bodies) to publish, along with the datasets themselves:
Data owners should be able to split up the datasets that they provide into different groups based on their knowledge of the domain, with the possibility of individual datasets belonging to more than one group.
We then create tooling that can:
To facilitate PubSubHubbub use, which supports timely updating of triplestores, we’d need a PubSubHubbub hub. Data owners can inform this hub of updates to their feeds and sector owners can register interest in particular feeds.
These guidelines and tooling are not just useful for sector owners: they are useful for anyone who wants to pull together linked data published in a distributed way across the web. We should expect and encourage multiple stores offering different combinations of datasets and different levels of service. The ones offered centrally, by sector owners, are certainly not the be-all and end-all — in fact we should look on them as a basic level of service, to be superseded by the community.
Comments
Re: Distributed Publication and Querying
From the perspective of data.gov.uk, I think the curated approach you outline should work well, and I can see the reasons why you’d want to go down that route.
From my own struggles to open up existing data as Linked Data, I’d argue for a slightly different emphasis, though. In my experiments, using ready-merged data sets (dumps, if you like) as named graphs in SPARQL FROM statements works really well with most tools now. Because of their web-like nature they’re pretty simple to deal with on both provider and consumer ends, and the approach has a degree of resilience built in by balancing the levels of effort required at deployment and query time from both the provider and consumer.
Which leads me to think that the data dumps should be mandatory, and the metadata about the named graphs that they embody optional. Not least because a lot of the really crucial trust stuff about a particular graph can already be deduced from its URI- as you outlined.
With regard to tooling, the one thing I really miss using the dump-as-named-graph approach is an easy way to expose the live contents of an RDBMS as premerged or serialised named graphs. Creating a static dump is easy, but they can get stale quickly. Using SPARQL’s CONSTRUCT will do the trick (if the consumer’s own SPARQL client takes the enormous URIs), but exerts potentially huge loads on the server. What’s required is something that intelligently caches the named graph data sets. Perhaps common or garden web caches will do the trick?
Hope this makes sense
Cheers,
Wilbert
Re: Distributed Publication and Querying
Hi Jeni - I know little about Linked Data, so the wise thing for me to do is just keep my mouth shut. But I do have a reaction to this post that may help, coming as it does from outside. My gut reaction is that the idea will flounder. It’s a top-down managed approach to a problem that will be defied by all the many gotchas that pop up at the data level. It is devilishly difficult to merge data from different sources in a way that maintains the integrity of each data set. And once you’ve done that it is even more difficult to live in the data long enough to understand what it says. Curation would help but the magnitude of the effort needed seems huge. … Regards, Gary
Re: Distributed Publication and Querying
Hey Jeni,
Great post once again, thanks! A few pointers:
You may want to have a look at my slides about "Querying Linked Data with SPARQL". These slides list different options to for querying Linked Data, including the pros and cons. This slideset is from the Consuming Linked Data tutorial we gave at last year's ISWC. At the WWW conference in April we will give a similar tutorial again.
The central approach you propose is not the only solution. It is indeed possible to execute SPARQL queries over the Web of Linked Data (at least over connected parts of it). I'm working on a novel query execution paradigm called link traversal based query execution. The idea is to discover data relevant for answering a query by following specific links during the query execution itself. All that is required from the publishers is adherence to the Linked Data principles. This, in particular, includes the 4th principle: adding links to data from other data sources (you know, only this makes the Web of data a real Web). You can read about link traversal based query execution in my ISWC'09 paper "Executing SPARQL Queries over the Web of Linked Data". Recently, I finished another text that provides a complete formalization of the approach. If you're interested I can send it. A query engine that implements the query approach is part of the Semantic Web Client Library. On top of this library I implemented SQUIN which provides the functionality of the query engine as a simple Web service that can be accessed like an ordinary SPARQL endpoint.
Greetings,
Olaf
Re: Distributed Publication and Querying
I was really pleased to see this, as it tackles issues that have bothered me for a while, and compliments this recent post from Wilbert Kraan on querying across two data stores. I'm still in the process of getting my head around linked data, so you'll have to forgive me if some of the questions/points I have are basic or completely off beam.
You say "Given that all data is linked, and all links are potentially interesting, the only answer seems to be to have the whole web of data in a single store. And that kind of centralised solution seems impractical, both in terms of the sheer size of store you’d need and the obvious impact on efficiency of doing so." I can clearly see the problem, but I do wonder if we rewind 10 years or so, we might have said the same about the web - and yet this was the model that was adopted because (I guess) the advantages of doing so were worth the cost. It also seems to me that some aspects of 'trust' can only be assessed by getting the whole picture (or as much of it as you can) - again, this is clearly how pagerank.
I can see how for well defined data sets with clear provenance you could go with the curated model, but not if you want to know how many people are saying something about a resource (e.g. publishing a rating for a school let's say)
Re: Distributed Publication and Querying
I guess that my thoughts here are grounded in what is possible now, with the state of technology as it is. I think it’s very possible that in the future there will be some grand web-spanning endpoint — a Google equivalent for the web of data — that executes SPARQL queries over distributed stores of some kind using, I don’t know, a map/reduce paradigm or something.
But I don’t think we’re there yet. Currently, federated queries are non-standard and slow and I don’t think that they will give the kind of level of service that third-party developers are going to want from a RESTful API over data distributed across multiple government sources. In the future, maybe, but not right now.
Re: Distributed Publication and Querying
Great discussion Jeni.
A lot of the linked data work up to now has aimed at a more exploratory style of federation where you automatically build a cache of relevant results from multiple sources though some sort of spread activation link following - e.g. SQUIN. That exploratory, open ended, style will still be needed but I think you are right that for many of the Gov data applications we need curated views over relevant trusted data subsets targeted at specific classes of use. Your vision of a network of automatically updated views through feeds/notification makes good sense in that situation.
It would be great to think through what we could do at the Linked Data API level to allow us to express the set of trusted sources to integrate, separate from the mechanics of how that integration is kept up to date. It should be fairly easy to support multiple SPARQL endpoints for the select and/or view phases. The Java implementation already does some of that. Then an API implementation can chose whether to dynamically query the federated sources, maintain caches of select/view results or maintain a complete integrated triple store copy to query.
Dave
Re: Distributed Publication and Querying
Sounds really cool. The use of PubSubHubbub for notifying about data updates sounds great to me. This is one of the best use-cases I hope the protocol can address going forward. Thanks for writing out your thoughts!
-Brett