There’s a big push within the UK government right now, helped along by the appointment of Tim Berners-Lee, to publish their data using Linked Data principles.
One of the challenges is how to publish Linked Data in a world that sometimes, even frequently, changes. Cool URIs don’t change, but departmental domain names do, as departments are split and merged and rebranded. So the URIs that are minted for things like schools and roads need to be detached from the departments that have responsibility for them, neutralised into general domains such as education.data.gov.uk and transport.data.gov.uk.
But that’s the least of the problems. Because schools and roads themselves don’t remain static either. They are split and merged and rebranded. They are resources that change over time. What should their URIs look like?
Some choices are (more or less) obvious. If we have an identifier for a school such as:
http://education.data.gov.uk/id/school/109812
then if the school is merged, and a new school is created, this URI can redirect (301 Moved Permanently) to the URI for the new school. If the school is shut down, the URI can respond with a 410 Gone.
But what if the school’s name changes? If we have a triple like this:
<http://education.data.gov.uk/id/school/109812>
ed:name "Broadmoor Primary School" .
that is true up to 1st September 2009, and:
<http://education.data.gov.uk/id/school/109812>
ed:name "Wildmoor Heath School"
that is true from 1st September 2009? Are there precautions we should be taking to ensure that these URIs will still work — that the statements we make now about the school will remain valid — in the face of these changes?
I can think of two ways of handling this. (But I’m sure there are others.)
Let me flesh these out a bit.
On the web, it’s a common pattern to have URIs without dates in them meaning the “current” version of the resource, and URIs with dates to be used for versions on particular dates.
So http://education.data.gov.uk/id/school/19081 could be used for the school as it is now, whereas http://education.data.gov.uk/id/school/19081/2008-09-01 is used for the school as it was on 1st September 2008. If you requested http://education.data.gov.uk/id/school/19081 on 1st April 2009 you’d get a 307 Temporary Redirect response pointing you at http://education.data.gov.uk/id/school/19081/2008-09-01, since that was the last date that the school was updated.
In this scheme, the eventual response you’d get on 1st September 2009 would be something like:
<http://education.data.gov.uk/id/school/19081/2009-09-01>
ed:name "Wildmoor Heath School" ;
rdfs:isDefinedBy <http://education.data.gov.uk/school/19081/2009-09-01> .
<http://education.data.gov.uk/school/19081/2009-09-01>
dc:modified "2009-09-01"^^xsd:date ;
dct:hasVersion <http://education.data.gov.uk/school/19081/2008-09-01> .
and the same request, to http://education.data.gov.uk/id/school/19081, the day before would have given you:
<http://education.data.gov.uk/id/school/19081/2008-09-01>
ed:name "Broadmoor Primary School" ;
rdfs:isDefinedBy <http://education.data.gov.uk/school/19081/2008-09-01> .
<http://education.data.gov.uk/school/19081/2008-09-01>
dc:modified "2008-09-01"^^xsd:date .
I think there could be another set of triples in both graphs about the timeless http://education.data.gov.uk/id/school/19081 and its relationship to the two versions. But I don’t know what relationship should be used between them (dct:hasVersion?).
The benefit of this approach is that the two sets of triples can be combined without the two versions of the school being merged together (although you still have to make sure you pick the most recent version when you’re doing a query).
The downside is that as time goes on and changes pile on changes, you get more and more repeated triples with differently dated subjects, because many of the properties of the school won’t change. It would be possible to define triples about the unddated http://education.data.gov.uk/id/school/19081, but
Another problem with this approach is that it makes it harder for people to make assertions externally about the resource. If I’m running my own site and want to say something about this school, I could make the statement:
<http://education.data.gov.uk/id/school/19081> my:rating 5 .
but this would apply forevermore. Or I could make the statement:
<http://education.data.gov.uk/id/school/19081/2008-09-01> my:rating 5 .
which would only apply to the school as on 2008-09-01. I would have to keep track of the new versions of the school as they became available, which is a lot of effort.
We could use undated identifier URIs but include metadata about the document containing the RDF in that document which indicates its currency. This is along the lines of what is shown in the Linked Data Tutorial.
Under this approach, the identifier URI http://education.data.gov.uk/id/school/19081 would give a 303 See Other redirection. On 1st April 2009, it would redirect to http://education.data.gov.uk/school/19081/2008-09-01 and would return:
<http://education.data.gov.uk/id/school/19081>
ed:name "Broadmoor Primary School" ;
rdfs:isDefinedBy <http://education.data.gov.uk/school/19081/2008-09-01> .
<http://education.data.gov.uk/school/19081/2008-09-01>
dc:modified "2008-09-01"^^xsd:date .
On 1st September 2009, the identifier URI http://education.data.gov.uk/id/school/19081 would redirect to http://education.data.gov.uk/school/19081/2009-09-01 and the RDF returned would include triples like:
<http://education.data.gov.uk/id/school/19081>
ed:name "Wildmoor Heath School" ;
rdfs:isDefinedBy <http://education.data.gov.uk/school/19081/2009-09-01> .
<http://education.data.gov.uk/school/19081/2009-09-01>
dc:modified "2009-09-01"^^xsd:date ;
dct:hasVersion <http://education.data.gov.uk/school/19081/2008-09-01> .
The data available about the school at any particular time would always be current, and the metadata about that data can indicate when it was last changed.
A Linked-Data-aware triplestore that regularly scraped the site would create named graphs like:
<http://education.data.gov.uk/school/19081/2008-09-01> {
<http://education.data.gov.uk/id/school/19081>
ed:name "Broadmoor Primary School" ;
rdfs:isDefinedBy <http://education.data.gov.uk/school/19081/2008-09-01> .
<http://education.data.gov.uk/school/19081/2008-09-01>
dc:modified "2008-09-01"^^xsd:date .
}
<http://education.data.gov.uk/school/19081/2009-09-01> {
<http://education.data.gov.uk/id/school/19081>
ed:name "Wildmoor Heath School" ;
rdfs:isDefinedBy <http://education.data.gov.uk/school/19081/2009-09-01> .
<http://education.data.gov.uk/school/19081/2009-09-01>
dc:modified "2009-09-01"^^xsd:date ;
dct:hasVersion <http://education.data.gov.uk/school/19081/2008-09-01> .
}
and it would then be possible to use SPARQL to query the graphs either individually or in combination.
What bothers me about this approach is that anything scraping the data needs to understand the interaction between the date of requesting http://education.data.gov.uk/id/school/19081 and the value of dc:modified in the resource linked to through rdfs:isDefinedBy to tell the difference between information that is true and information that was true. A naive aggregator that regularly visited the site could easily end up with just:
<http://education.data.gov.uk/id/school/19081>
ed:name "Broadmoor Primary School" ;
ed:name "Wildmoor Heath School" ;
rdfs:isDefinedBy <http://education.data.gov.uk/school/19081/2009-09-01>
rdfs:isDefinedBy <http://education.data.gov.uk/school/19081/2008-09-01> .
<http://education.data.gov.uk/school/19081/2009-09-01>
dc:modified "2009-09-01"^^xsd:date .
<http://education.data.gov.uk/school/19081/2008-09-01>
dc:modified "2008-09-01"^^xsd:date .
with no means of telling which name is associated with which modification date.
Aside: What really bothers me about named graphs is that there’s no real standard. The closest is in SPARQL, which standardises how to query over a set of graphs but doesn’t really say how these graphs could be created. There’s nothing that I know of that says that named graphs should be created as I’ve described above. The syntaxes suggested for expressing named graphs are drafts and notes and proposals.
Without standards that define how named graphs should be created and expressed, it’s hard to work out how exactly they should be used.
So: what options have I missed? How should we be publishing Linked Data in a changing world?
Comments
Re: Linked Open Data in a Changing World
I think that it is important to separate internal representation of temporal information and the way we communicate state of ‘the world’ and changes in ‘the world’ through various representations.
I typically use custom semantic database with built-in support for temporal dimension (additional context/scope) and basic temporal reasoning. This approach allows to keep track of changes in ‘the world’ and generate a ‘world snapshot’ at any moment in time. It is also possible to generate representation of a history of any property related to a subject (time series, collection of observations)
Complementary approach includes usage of ‘current’ and ‘historical’ properties.
For example, ‘type’ - corresponds to current type(s) of a subject, ‘type-history’ - types in history, ‘name’ - corresponds to a current name of a subject, ‘name-history’ - corresponds to a set of names that subject had at some moment in time.
Historical properties provide a very nice simplified view on a history of a subject.
If our semantic database supports also simplified reification then we can add additional assertions about time of validity for property values.
In this case, http://education.data.gov.uk/id/school/109812 corresponds to a combination of ‘current’ and ‘historical’ properties with temporal annotations.
http://education.data.gov.uk/id/school/19081/2009-09-01 can be used to reference a state of the ‘the world’ on 2009-09-01 (which also may include current-at-that-moment and historical-at-that-moment properties)
Temporal semantic databases can calculate ‘world snapshots’ more or less efficiently without assertion duplication.
In terms of inferences, this approach is more consistent with incremental inferences and truth maintenance.
Re: Linked Open Data in a Changing World
Thanks for this post - interesting and useful.
I think I prefer the “undated identifier URIs but include metadata about the document containing the RDF in that document which indicates its currency” option. This is similar to the way OS data currently works. Metadata for changes are included in the GML files we server OS MasterMap in.
This is a problem I’m battling with because I’m currently updating the RDF for the administrative areas of GB, and these administrative areas change a couple of times a year (at least).
Re: Linked Open Data in a Changing World
I think the only way you’ll begin to resolve this is to scope every triple to a date range when it is valid. The “current” URI could be a resource referencing all the triples known to be valid for the resource, now; other date-specific resources could be introduced to represent the state of the resource at other points in time.
I think these “container” resources would have to be computed at the point of query, rather than pre-computed: it would be computationally expensive to continually group statements within larger resources and keep the relationships up to date as time passes. Or perhaps you’d leave the client to do that, and provide an extremely minimal, “permanent” resource which points to types of relationships which it holds to other “permanent” resources. Though this whole idea of what constitutes the permanent core of a resource is always going to be sticky itself (for example, will schools always have names?).
There are some interesting thoughts on this in the computational linguistics literature (I recall this one - http://www.aclweb.org/anthology/P/P88/P88-1009.pdf - was pretty interesting); and there are going to be interesting cases where you can’t be definite about dates/times when triples could/would/should etc. hold. So, a thorny issue which definitely needs addressing.
Re: Linked Open Data in a Changing World
What I have found with RDF is that if you don’t fit your modeling in with the standard vocabularies, you are back in the realm of custom semantics: in which case you may as well be using vanilla XML. To put it another way, RDF has a great lack of standardized properties etc and this makes it useless for lots of purposes where you would think it would be easy: there are not enough standard verbs for it to be convenient.
Perhaps your use-case is big enough that you can push through a new set of organization-related verbs for RDF. That would be great.
I had to do a little government-related RDF job recently, my first. I ended up having to make it pretty much like a Topic Map (through the ISO Topic Maps standard has the same problem: not enough ‘Published Subject Indicators’.) The only way I could figure out how to do something useful was 1) to clearly distinguish between concepts and instances and 2) to remember that the job was not modeling information per se but being able to label/enable more semantic understanding of a web of resources.
First I made URLs (URNs actually, I used LSIDs) for all basic concepts (in your case, a ‘School’ and a ‘School name’ and ‘School term’) and labelled them as owl:concepts with RDF triples. Then I made triples for all instances of the concepts (a particular school). Then I could use skos:isSubjectOf to link from the particular instance to web home pages (as well as foaf:page etc for related pages.)
In your case, I would say that perhaps it would be a mistake to treat a webpage for a school at a certain time as if it were a concept. Figure out your concepts, then your instances, then link from these to exiting pages. There is a school concept. There are actual schools. Schools have terms. During school terms they have names. (Or whatever you come up with.) The webpage for a school at a certain time is XXX. At another time it is YYY. (It doesn’t matter if you end up with multiple URLs or URNs for things, you can use owl:sameAs if the same resource has different identifiers.)
In my PRESTO idea, my point was that all significant pieces of information (resources) at every level of granularity should have a unique, persistent, friendly URL regardless of whether each item was on the web or not. I think the Linked Data re-branding of RDF can fit in quite well with this. But some things may be better solved by URL resolvers using regexes (such as the Tuckey resolver) rather than with RDF.
Cheers Rick Jelliffe
Re: Linked Open Data in a Changing World
Hi Jeni,
While you could see a school that changed its name as a new entity I wouldn't go for your first proposal. It just makes things too complicated: to be consistent we would have to adopt this approach to *every* URI we use, we would have to communicate to people why we use this mass of URIs (and see the current discussions on the semweb and LOD mailing lists how our URI usage is too complicated already, apparently) and it would just make data integration very hard because you never know which URI to use. And in some areas data changes very often so you end up with loads of URIs. This just can't be the solution, it has to be a lot simpler.
You talk about the risks of an aggregator merging the data. I don't see this risk. Every aggregator has to assume that documents change all the time so whenever it retrieves data from a resource it knows it accessed before then it can either replace its current data or store the new data under a different graph URI that reflects the time and all other circumstances of access. If it does the latter then yes, you have to be careful when querying the store of that aggregator.
A more efficient alternative to always reloading the whole document is the Talis Changeset Protocol: http://n2.talis.com/wiki/Changeset_Protocol - I think this needs to see wider adoption. Although I would probably replace the reification they rely on with two named graphs, one holding all the triples being added and one holding all the triples being removed.
Apart from graphs which contain the changesets you can also have one main graph which reflects the current state of all of the data with all changesets applied (and you can do that both on the publisher and on the aggregator side).
Regards,
Simon
Re: Linked Open Data in a Changing World
i guess this is the curse of the specific architectural approach of the semantic web, which assumes that all URIs are HTTP URIs, and you get descriptions about the identified entities by accessing the resource URI itself. on http://dret.typepad.com/dretblog/2009/07/the-last-uri-scheme-youll-ever-need.html i have argued that such an approach might actually be harmful, and i think your case is another example of where this approach shows some unfortunate side-effects. in plain web architecture, URIs are completely opaque, so if you’re afraid that domains might go away (or you just want to be able to handle that case), you use non-HTTP URIs, let’s assume in a very simple case something like tag:jenischool98764321986432874. this of course is not a URI scheme by itself, but you get the idea… by doing this, you decouple resource identity (the URI you’re using to use an entity of interest) from resource access (how do i get any information about this school?). this of course means that your apps must be aware of those URIs and must know how to resolve them, should you want to access the resource itself or get a description of it. but that’s the price you have to pay to be able to deal with unpredictable changes on the DNS level.
to me, this looks like a great example of the semantic web trade-off: by baking HTTP into the foundation, some things can be magically simplified (the httpRange-14 trick), but if that trick does not work anymore because DNS names change, for example, you are starting to pay the price for it.
Re: Linked Open Data in a Changing World
Jeni,
Isn’t it possible to make the splits and the merges part of your model/vocabulary?
The entities as a result of a merge or split event get new identifiers.
For namechanges you can use something along these lines, changing name or label from dataproperty to objectproperty, where the name has his own PIT properties.
Re: Linked Open Data in a Changing World
Jeni, interesting post and an important issue.
First, it’s not really about named graphs. Named graphs can be used to store and query either approach. It’s about wether you just version your documents, or wether you also version your resource identifiers.
I think that it’s better to keep the same version identifier. To make this work well with versioned documents, some additional vocabulary would be useful: “This document is no longer valid and superseded by that one over there.” “This document is valid from date X through date Y.” The main question is what would be a good venue to standardize such a vocabulary. I don’t know the answer.
About dret’s comment: We tried this in the RDF world (before Linked Data) and I didn’t like it. Each time you hear about some new interesting data, you have to start up your dev tools and code a resolver for their funky identifier scheme. Which will ususally end up being “http://whatever.com/resolver?id=…”, re-introducing the HTTP URIs they tried to avoid. You know, with proper foresight and vigilance, you can keep HTTP URIs stable for as long as someone cares enough to pony up a couple 100 quid every year. A trustworthy PURL service can also help.
Re: Linked Open Data in a Changing World
If you treat every document containing RDF data as an OWL ontology (which just happens to only contain instance data) then you can use some of OWL’s vocab for exactly that. And it will make the DL reasoners happy I guess. ;-) For example owl:priorVersion and its sub-properties might be useful. And in OWL 2 you can even keep the ontology URI the same and just change its version IRI (a new property) so it’s all connected through the ontology URI. It doesn’t have anything for deprecating ontologies though. To express the temporal validity of a document you can use dcterms:valid although it doesn’t specify a format for how to express such a range. I guess you’d do that with OWL Time.
Re: Linked Open Data in a Changing World
I think that
dcterms:validis the right thing to use. It’s range is a literal, so I don’t think we could use OWL Time, but Dublin Core define a method for encoding periods. Something like:Dublin Core also has useful properties like
dcterms:replacesanddcterms:isReplacedBy.