If you’ve hung around in linked data circles for any amount of time, you’ll probably have come across the httpRange-14 issue. This was an issue placed before the W3C TAG years and years ago which has become a permathread on semantic web and linked data mailing lists. The basic question (or my interpretation of it) is:
Given that URIs can sometimes be used to name things that aren’t on the web (eg the novel Moby Dick) and sometimes things that are (eg the Wikipedia page about Moby Dick), how can you tell, for a given URI, how it’s being used so that you can work out what a statement (say, about its author) means?
One answer is to use a hash URI whenever you want to refer to something that doesn’t live on the web, with the base URI providing information about that thing. For example:
http://en.wikipedia.org/wiki/Moby-Dickis the URI for the Wikipedia page
http://en.wikipedia.org/wiki/Moby-Dick#thingis a URI for the novel itself
The problem some people (including me) have with this is that hash URIs are primarily used to indicate portions of a web page, and using them for things that aren’t page fragments overloads them. It’s also an inflexible method, because the server isn’t told what the fragment identifier is, and therefore it can’t be used as the basis for a redirection, for example.
The 2005 TAG resolution for people who wanted to use separate non-hash URIs, such as [warning, made-up URIs]
http://en.wikipedia.org/wiki/Moby-Dickis the URI for the Wikipedia page
http://wikipedia.org/thing/Moby-Dickis the URI for the novel itself
2XXresponse when you request a URI, that URI refers to a document (the document that you get back)
303response when you request a URI, that URI could refer to anything, and the resource you get by following the redirection describes that thing (hence if a URI should refer to something that isn’t on the web then requests to it should respond with a 303)
4XXresponse when you request a URI, that URI could represent anything
This leads to the
303 pattern described for example within Cool URIs for the Semantic Web; in the example here, the response to
http://wikipedia.org/thing/Moby-Dick would be a 303 redirection to
Six years later, we have a lot of experience about this technique of distinguishing between things that are or are not on the web, and it has a bunch of practical limitations.
303redirections) that make life difficult for people without that level of access
303redirection, which adds server load and slows things down (this is made worse as
303responses can’t be cached — an oversight in the HTTP spec that I gather is fixed in HTTPbis)
303pattern requires a level of knowledge and understanding that is beyond most web developers, particularly if they get no benefit from taking care over their use of URIs (for example, Facebook, schema.org and so on all encourage the use of URIs for non-web things without a word about
303redirection is required
Basically, while the web architectural principles behind the use of
303 redirections are (arguably!) sound, the collective experience of the past six years indicates that many publishers will not use it because they don’t know to, because they don’t care to, because they make mistakes or because they simply can’t while meeting the other practical constraints of their project.
A number of other approaches have been suggested, before and after the TAG decision, many of which are documented within the draft TAG finding Providing and discovering definitions of URIs.
The first observation that I want to make is that many of the objections to the
303 pattern are about the practicalities of publishers using it. Therefore, any suggestions to provide an alternative technique that involves
Content-Locationor other specialist headers)
are not going to be widely used for exactly the same reason. I’m not at all persuaded that it’s worth spending time developing them.
My second observation is that there are three questions that are being conflated and we might make more progress if we separated them:
Both the hash URI pattern and the
303 pattern make the assumption that you need to have separate URIs for things that are not on the web (eg books) and documents on the web about them (eg pages about books). This is useful because it enables people to make separate statements about the author of a book:
<http://wikipedia.org/thing/Moby-Dick> dct:creator <http://wikipedia.org/thing/Herman_Melville> ; .
from the authors of the Wikipedia page about that book:
<http://en.wikipedia.org/wiki/Moby-Dick> dct:creator <http://wikipedia.org/user/Aristophanes68> , <http://wikipedia.org/user/SporkBot> , <http://wikipedia.org/user/Curb_Chain> , ... .
If we only have the URI
http://en.wikipedia.org/wiki/Moby-Dick then we run into difficulties interpreting statements made about that URI, and indeed different people might use the URI in different ways, or make some statements that use the URI to mean the novel and some to mean the Wikipedia page.
So there are good reasons to have two separate URIs in these cases.
But the fact is that many publishers currently have a one-URI-fits-all policy. And even if they don’t, people reusing those URIs will often make mistakes and use the wrong one. It would be nice if we could make the world see that this leads to all sorts of logical problems for the Semantic Web, but I just can’t see that happening.
Web Semantic Web Traditional Design hypertext logic/database + URIs - link integrity ? = viral growth
Are there parts of traditional logic and databases that, if we set them aside, will result in viral growth of the Semantic Web?
(By the way, in case my replication of this slide is interpreted incorrectly: I’m certainly not implying that viral growth of the Semantic Web as an end in itself, though I would like to see viral growth in data sharing.)
Dropping the requirement for link integrity, coping with the fact that sometimes links would break, was what made the web work. It would have been simply impossible to build the web as a decentralised system if there had been a requirement for links to always work.
Of course that doesn’t mean that we like it when links get broken. There’s oodles of best practice advice out there on making sure that you retain support for old URIs if you change your web space; we have backup systems in place in the form of web archives so we can work out what was once at the end of a particular URI; and the resolvability of links is something a linter will check about your website.
So it’s not that when he developed the web TimBL rejected entirely the very concept of link integrity, it’s that he recognised that we have to work with the imperfection of the real world. Links break. HTTP copes. Browsers cope. People cope.
The imperfection of the real world as it applies to linked data is that URIs will be used in ambiguous ways. We might not like it; we might write best practice documents that encourage people to have separate URIs for web-thing and non-web-thing, develop tools that help people detect when they’ve used the wrong URI, and so on. But it will still happen, and in my opinion we need to work out how to cope.
In fact, ambiguity in URIs goes much further than just a confusion between the Wikipedia page about Moby Dick and the novel Moby Dick itself. URIs are names, and names are used by different people to mean different things. The same URI might end up meaning:
Even if the publisher provides a clear and unambiguous definition about what the URI
http://en.wikipedia.org/wiki/Moby-Dick means, other people will use it to mean something different because it’s close enough for what they want to say.
So I think the answer to the first question I posed — “Must publishers provide separate URIs for things-on-the-web and the non-web-things that they describe?” — has to be “No, though it is good practice to.” We can fight against ambiguity, but we have to accept that we cannot win.
As discussed above, in a perfect world, we would have separate URIs for things-on-the-web and non-web-things and any data that we published about Moby Dick would use the URI for the Wikipedia page to talk about things like the licence for that information, or how the information was created (its provenance), and the URI for the novel to talk about things like the licence for the novel and what characters appeared in it.
But the world is not perfect, and we are going to end up with situations where the same URI is used to refer to a whole range of different things. How do we cope?
Well, first let me say that I don’t see people merging data together willy-nilly and hoping to get something useful out of it. URIs give us connection points and RDF gives us a flexible data model, which means that merging data can be easier than the kinds of custom merging that you have to do with CSV and JSON, but I don’t think it can ever remove entirely the requirement for curation. We want to ensure that the need for intervention in merging two datasets is kept to a minimum, but we can’t expect it to be entirely removed.
So with that in mind, there are at least three techniques that can be used to get useful data out of a world in which the same URI is used to mean different things.
The first technique is to interpret particular properties as describing a one-or-more-step-removed relationship between a resource and a value. For example, the
dct:creator properties would be defined such that the RDF statements
<http://en.wikipedia.org/wiki/Moby-Dick> bib:author <http://en.wikipedia.org/wiki/Herman_Melville> ; dct:creator <http://en.wikipedia.org/wiki/User:Aristophanes68> ; .
would be interpreted as saying
The topic of the page
http://en.wikipedia.org/wiki/Moby-Dickwas authored by the topic of the page
http://en.wikipedia.org/wiki/Herman_Melville. The creator of the page
http://en.wikipedia.org/wiki/Moby-Dickis the topic of the page
The biggest problem with the global application of this approach is that there are a lot of existing properties defined in vocabularies such as FOAF or Dublin Core that aren’t defined as one-step-removed properties. One publisher might use
dct:creator to link to “a page describing the creator of this page” and another might use it to point directly to a (non-web-thing) URI for the creator of the page. So practically, this approach requires the interpretation of properties to be done on a dataset-by-dataset basis. Which leads onto the next approach.
A second technique would be to make the assumption that within a single dataset, a single URI has a single meaning, but that the meaning may differ between datasets. I suspect that this is true even when publishers attempt to take care about which URI they use, because, like names, the meaning of a URI is slightly different depending on its use.
Re-users of data need to work out whether the way URIs are used in one dataset is close enough to the way they are used in another dataset, to ascertain whether it’s appropriate to simply merge the datasets or whether something slightly more complicated needs to be done to bring the datasets together.
The problem with this approach is that it raises the barrier to joining together graphs: you can’t just bung the data into a triplestore and perform queries on it, you have to work out some kind of mapping between the datasets up front.
The final technique that I’ll talk about here is to say that different applications need to access different properties, and can ignore any properties that don’t fit with how they want to use the data. It is relatively rarely useful to have generic RDF viewers; people (generally) build applications to answer questions and perform tasks, not to just browse around data.
For example, if a single dataset were to contain:
<http://en.wikipedia.org/wiki/Moby-Dick> a bib:Book ; bib:author <http://en.wikipedia.org/wiki/Herman_Melville> ; a foaf:Document ; dct:creator <http://en.wikipedia.org/wiki/User:Aristophanes68> ; .
then an application that was interested in gathering data about books would only care about the fact that
http://en.wikipedia.org/wiki/Moby-Dick was a book with an author of
http://en.wikipedia.org/wiki/Herman_Melville and wouldn’t care about the FOAF or Dublin Core classes or properties associated with the URI. An application that was interested in gathering information about the authorship of documents on the web, on the other hand, might look for the
foaf:Document class and Dublin Core properties and ignore everything else.
To me, this approach seems the most promising way of retaining the core benefits of RDF. It seems more robust in the face of user error than the idea of defining one-step-removed properties, and retains the ease of mashing together data from different sources in a way that you wouldn’t get if you had to think about the URI usage within each of the datasets that you want to bring together.
And so we get to the final question: how should people be able to get from a URI to information about whatever the URI refers to?
I’ve discussed above how I think distinguishing between things-on-the-web and non-web-things has to be seen as a best practice. I think we should continue to recommend the
303 or hash URI methods as the best practice for accessing data from a URI. My reason for this is that introducing yet another method will just makes it harder for publishers to know which method to use when, plus I don’t want to see people who have adopted these techniques in good faith being told that they were doing the wrong thing all along. What I’d like to aim to do is to find a way of fitting these methods into a larger approach.
I also recognise the argument that articulating the relationships between on-the-web and not-on-the-web resources purely through HTTP responses isn’t ideal. It’s useful to have explicit links between resources within the data itself. Within the linked data work that I’ve done for
data.gov.uk I’ve tried to adopt a pattern of explicitly using
foaf:page to link together the different resources. Other people have suggested the
wdrs:describedby property for pointers from a resource to information about that resource;
rdfs:isDefinedBy performs a similar function for classes and properties within RDFS.
It would be nice to have one defined property or set of properties to describe these relationships, but we have to recognise that not everyone will use them, so the approach we take has to work when these links aren’t present. The majority of people and sites are going to start off by publishing data about something at a single URI, and simply return data about that thing (a
200 response) when the URI is requested. If they then progress to wanting to have separate URIs for that thing and the page about the thing, or indeed to disambiguate the URI that they’ve used in some other way, we need to make it easy for them to do so.
I think we need two properties:
eg:describedBy describes the link between a resource (of any type) and a document that describes it;
eg:couldBe is a disambiguation link that points from a URI to other possible, more precise, URIs.
Then I think we need some rules along the lines of (I don’t pretend these are entirely worked out):
303response redirecting to
U'when you fetch a URI
Uthen behave as if the response from
U'included the triple
U eg:describedBy U'
Uis a hash URI whose base URI is
U'then behave as if the response from
U'included the triple
U eg:describedBy U'
2XXresponse in response to a URI
U eg:describedBy ?pagethen assume that the document you have is
eg:couldBeany of the
U eg:describedBy ?pagethen assume that the document that you have is
?pageand it is about
U(along with other things, possibly); statements about
?pagemight include information about the licence or provenance of the returned document
?thing eg:describedBy Uthen assume that the document you have is
Uand it is about (possibly multiple)
U eg:describedBy U; in this case,
Uis being used in an ambiguous way
We could go further and say:
U eg:couldBe ?page . ?thing eg:describedBy ?pagethen assume that the document you have is
?pageand it is about
U eg:couldBe ?thing . ?thing eg:describedBy ?pagethen assume that the document you have is
?pageand it is about
This way, if someone starts off using
U in an ambiguous way, or to mean only the page or only the thing, they can later add
eg:couldBe statements to disambiguate and add information about the page or thing the page describes.
It’s worth bearing in mind that we shouldn’t just be concerned about locating information about things that aren’t on the web, but about things that are on the web but that cannot have metadata embedded within them. For example, how do we discover the licence associated with a particular image? Although there are methods of embedding metadata within image and other binary formats, such as XMP, it’s still useful to be able to locate metadata about images based on their URI.
With a scheme such as that described above, publishers that used content negotiation to return some data about the image in another format could use
eg:describedBy to indicate that the returned document is about the image (or set of images in different formats).
The summary of my thinking is:
303s and hash URIs and provide an easy upgrade path as publishers choose to add more specificity
The key will be how we find practical ways to cope with the real, imperfect, fuzzy web of data while providing an evolutionary path to greater clarity and specificity that publishers can take when they see the benefit of doing so.