What Do URIs Mean Anyway?

If you’ve hung around in linked data circles for any amount of time, you’ll probably have come across the httpRange-14 issue. This was an issue placed before the W3C TAG years and years ago which has become a permathread on semantic web and linked data mailing lists. The basic question (or my interpretation of it) is:

Given that URIs can sometimes be used to name things that aren’t on the web (eg the novel Moby Dick) and sometimes things that are (eg the Wikipedia page about Moby Dick), how can you tell, for a given URI, how it’s being used so that you can work out what a statement (say, about its author) means?

One answer is to use a hash URI whenever you want to refer to something that doesn’t live on the web, with the base URI providing information about that thing. For example:

  • http://en.wikipedia.org/wiki/Moby-Dick is the URI for the Wikipedia page
  • http://en.wikipedia.org/wiki/Moby-Dick#thing is a URI for the novel itself

The problem some people (including me) have with this is that hash URIs are primarily used to indicate portions of a web page, and using them for things that aren’t page fragments overloads them. It’s also an inflexible method, because the server isn’t told what the fragment identifier is, and therefore it can’t be used as the basis for a redirection, for example.

The 2005 TAG resolution for people who wanted to use separate non-hash URIs, such as [warning, made-up URIs]

  • http://en.wikipedia.org/wiki/Moby-Dick is the URI for the Wikipedia page
  • http://wikipedia.org/thing/Moby-Dick is the URI for the novel itself

was:

  1. if you get a 2XX response when you request a URI, that URI refers to a document (the document that you get back)
  2. if you get a 303 response when you request a URI, that URI could refer to anything, and the resource you get by following the redirection describes that thing (hence if a URI should refer to something that isn’t on the web then requests to it should respond with a 303)
  3. if you get a 4XX response when you request a URI, that URI could represent anything

This leads to the 303 pattern described for example within Cool URIs for the Semantic Web; in the example here, the response to http://wikipedia.org/thing/Moby-Dick would be a 303 redirection to http://en.wikipedia.org/wiki/Moby-Dick.

Six years later, we have a lot of experience about this technique of distinguishing between things that are or are not on the web, and it has a bunch of practical limitations.

  • it requires access to web server configuration (to add 303 redirections) that make life difficult for people without that level of access
  • URIs for things that aren’t on the web always require two round-trips to get hold of information, as the first always responds with a 303 redirection, which adds server load and slows things down (this is made worse as 303 responses can’t be cached — an oversight in the HTTP spec that I gather is fixed in HTTPbis)
  • using the 303 pattern requires a level of knowledge and understanding that is beyond most web developers, particularly if they get no benefit from taking care over their use of URIs (for example, Facebook, schema.org and so on all encourage the use of URIs for non-web things without a word about 303 redirections)
  • even people who do have this knowledge and understanding sometimes find it hard to work out whether a particular thing that they want to talk about is a thing-on-the-web or not and therefore whether the use of a 303 redirection is required
  • even people who do try to take care in their use of URIs easily make mistakes because we interact with URIs by copy-and-pasting them from browser address bars, and the only URIs that appear there are URIs for things on the web

Basically, while the web architectural principles behind the use of 303 redirections are (arguably!) sound, the collective experience of the past six years indicates that many publishers will not use it because they don’t know to, because they don’t care to, because they make mistakes or because they simply can’t while meeting the other practical constraints of their project.

A number of other approaches have been suggested, before and after the TAG decision, many of which are documented within the draft TAG finding Providing and discovering definitions of URIs.

The first observation that I want to make is that many of the objections to the 303 pattern are about the practicalities of publishers using it. Therefore, any suggestions to provide an alternative technique that involves

  • introducing new URI schemes (eg tdb)
  • introducing new HTTP methods (eg MGET)
  • introducing new HTTP status codes (eg 209)
  • using particular HTTP headers (eg Link or Content-Location or other specialist headers)

are not going to be widely used for exactly the same reason. I’m not at all persuaded that it’s worth spending time developing them.

My second observation is that there are three questions that are being conflated and we might make more progress if we separated them:

  • Must publishers provide separate URIs for things-on-the-web and the non-web-things that they describe?
  • How can you tell what a reference to a particular URI within a piece of data (eg an RDF statement) means?
  • How can you get from a URI to information about whatever that URI refers to?

Ambiguity in URIs

Both the hash URI pattern and the 303 pattern make the assumption that you need to have separate URIs for things that are not on the web (eg books) and documents on the web about them (eg pages about books). This is useful because it enables people to make separate statements about the author of a book:

<http://wikipedia.org/thing/Moby-Dick> 
  dct:creator <http://wikipedia.org/thing/Herman_Melville> ;
  .

from the authors of the Wikipedia page about that book:

<http://en.wikipedia.org/wiki/Moby-Dick>
  dct:creator 
    <http://wikipedia.org/user/Aristophanes68> ,
    <http://wikipedia.org/user/SporkBot> ,
    <http://wikipedia.org/user/Curb_Chain> ,
    ...
  .

If we only have the URI http://en.wikipedia.org/wiki/Moby-Dick then we run into difficulties interpreting statements made about that URI, and indeed different people might use the URI in different ways, or make some statements that use the URI to mean the novel and some to mean the Wikipedia page.

So there are good reasons to have two separate URIs in these cases.

But the fact is that many publishers currently have a one-URI-fits-all policy. And even if they don’t, people reusing those URIs will often make mistakes and use the wrong one. It would be nice if we could make the world see that this leads to all sorts of logical problems for the Semantic Web, but I just can’t see that happening.

This situation reminds me of one of the central innovations that the web had over previous hypertext systems. There is a great slide) by Dan Connolly which roughly looks like:

Web Semantic Web
Traditional Design hypertext logic/database
+ URIs
- link integrity ?
= viral growth

Are there parts of traditional logic and databases that, if we set them aside, will result in viral growth of the Semantic Web?

(By the way, in case my replication of this slide is interpreted incorrectly: I’m certainly not implying that viral growth of the Semantic Web as an end in itself, though I would like to see viral growth in data sharing.)

Dropping the requirement for link integrity, coping with the fact that sometimes links would break, was what made the web work. It would have been simply impossible to build the web as a decentralised system if there had been a requirement for links to always work.

Of course that doesn’t mean that we like it when links get broken. There’s oodles of best practice advice out there on making sure that you retain support for old URIs if you change your web space; we have backup systems in place in the form of web archives so we can work out what was once at the end of a particular URI; and the resolvability of links is something a linter will check about your website.

So it’s not that when he developed the web TimBL rejected entirely the very concept of link integrity, it’s that he recognised that we have to work with the imperfection of the real world. Links break. HTTP copes. Browsers cope. People cope.

The imperfection of the real world as it applies to linked data is that URIs will be used in ambiguous ways. We might not like it; we might write best practice documents that encourage people to have separate URIs for web-thing and non-web-thing, develop tools that help people detect when they’ve used the wrong URI, and so on. But it will still happen, and in my opinion we need to work out how to cope.

In fact, ambiguity in URIs goes much further than just a confusion between the Wikipedia page about Moby Dick and the novel Moby Dick itself. URIs are names, and names are used by different people to mean different things. The same URI might end up meaning:

  • the Wikipedia page about Moby Dick
  • the novel Moby Dick
  • the whale Moby Dick
  • the story Moby Dick (originally a novel but later adapted as a film)
  • and so on

Even if the publisher provides a clear and unambiguous definition about what the URI http://en.wikipedia.org/wiki/Moby-Dick means, other people will use it to mean something different because it’s close enough for what they want to say.

So I think the answer to the first question I posed — “Must publishers provide separate URIs for things-on-the-web and the non-web-things that they describe?” — has to be “No, though it is good practice to.” We can fight against ambiguity, but we have to accept that we cannot win.

Disambiguating Statements

As discussed above, in a perfect world, we would have separate URIs for things-on-the-web and non-web-things and any data that we published about Moby Dick would use the URI for the Wikipedia page to talk about things like the licence for that information, or how the information was created (its provenance), and the URI for the novel to talk about things like the licence for the novel and what characters appeared in it.

But the world is not perfect, and we are going to end up with situations where the same URI is used to refer to a whole range of different things. How do we cope?

Well, first let me say that I don’t see people merging data together willy-nilly and hoping to get something useful out of it. URIs give us connection points and RDF gives us a flexible data model, which means that merging data can be easier than the kinds of custom merging that you have to do with CSV and JSON, but I don’t think it can ever remove entirely the requirement for curation. We want to ensure that the need for intervention in merging two datasets is kept to a minimum, but we can’t expect it to be entirely removed.

So with that in mind, there are at least three techniques that can be used to get useful data out of a world in which the same URI is used to mean different things.

One-Step-Removed Properties

The first technique is to interpret particular properties as describing a one-or-more-step-removed relationship between a resource and a value. For example, the bib:author and dct:creator properties would be defined such that the RDF statements

<http://en.wikipedia.org/wiki/Moby-Dick>
  bib:author <http://en.wikipedia.org/wiki/Herman_Melville> ;
  dct:creator <http://en.wikipedia.org/wiki/User:Aristophanes68> ;
  .

would be interpreted as saying

The topic of the page http://en.wikipedia.org/wiki/Moby-Dick was authored by the topic of the page http://en.wikipedia.org/wiki/Herman_Melville. The creator of the page http://en.wikipedia.org/wiki/Moby-Dick is the topic of the page http://en.wikipedia.org/wiki/User:Aristophanes68.

The biggest problem with the global application of this approach is that there are a lot of existing properties defined in vocabularies such as FOAF or Dublin Core that aren’t defined as one-step-removed properties. One publisher might use dct:creator to link to “a page describing the creator of this page” and another might use it to point directly to a (non-web-thing) URI for the creator of the page. So practically, this approach requires the interpretation of properties to be done on a dataset-by-dataset basis. Which leads onto the next approach.

Named Graphs

A second technique would be to make the assumption that within a single dataset, a single URI has a single meaning, but that the meaning may differ between datasets. I suspect that this is true even when publishers attempt to take care about which URI they use, because, like names, the meaning of a URI is slightly different depending on its use.

Re-users of data need to work out whether the way URIs are used in one dataset is close enough to the way they are used in another dataset, to ascertain whether it’s appropriate to simply merge the datasets or whether something slightly more complicated needs to be done to bring the datasets together.

The problem with this approach is that it raises the barrier to joining together graphs: you can’t just bung the data into a triplestore and perform queries on it, you have to work out some kind of mapping between the datasets up front.

Duck Typing

The final technique that I’ll talk about here is to say that different applications need to access different properties, and can ignore any properties that don’t fit with how they want to use the data. It is relatively rarely useful to have generic RDF viewers; people (generally) build applications to answer questions and perform tasks, not to just browse around data.

For example, if a single dataset were to contain:

<http://en.wikipedia.org/wiki/Moby-Dick>
  a bib:Book ;
  bib:author <http://en.wikipedia.org/wiki/Herman_Melville> ;
  a foaf:Document ;
  dct:creator <http://en.wikipedia.org/wiki/User:Aristophanes68> ;
  .

then an application that was interested in gathering data about books would only care about the fact that http://en.wikipedia.org/wiki/Moby-Dick was a book with an author of http://en.wikipedia.org/wiki/Herman_Melville and wouldn’t care about the FOAF or Dublin Core classes or properties associated with the URI. An application that was interested in gathering information about the authorship of documents on the web, on the other hand, might look for the foaf:Document class and Dublin Core properties and ignore everything else.

To me, this approach seems the most promising way of retaining the core benefits of RDF. It seems more robust in the face of user error than the idea of defining one-step-removed properties, and retains the ease of mashing together data from different sources in a way that you wouldn’t get if you had to think about the URI usage within each of the datasets that you want to bring together.

Locating Data From URIs

And so we get to the final question: how should people be able to get from a URI to information about whatever the URI refers to?

I’ve discussed above how I think distinguishing between things-on-the-web and non-web-things has to be seen as a best practice. I think we should continue to recommend the 303 or hash URI methods as the best practice for accessing data from a URI. My reason for this is that introducing yet another method will just makes it harder for publishers to know which method to use when, plus I don’t want to see people who have adopted these techniques in good faith being told that they were doing the wrong thing all along. What I’d like to aim to do is to find a way of fitting these methods into a larger approach.

I also recognise the argument that articulating the relationships between on-the-web and not-on-the-web resources purely through HTTP responses isn’t ideal. It’s useful to have explicit links between resources within the data itself. Within the linked data work that I’ve done for data.gov.uk I’ve tried to adopt a pattern of explicitly using foaf:primaryTopic, foaf:primaryTopicOf and foaf:page to link together the different resources. Other people have suggested the wdrs:describedby property for pointers from a resource to information about that resource; rdfs:isDefinedBy performs a similar function for classes and properties within RDFS.

It would be nice to have one defined property or set of properties to describe these relationships, but we have to recognise that not everyone will use them, so the approach we take has to work when these links aren’t present. The majority of people and sites are going to start off by publishing data about something at a single URI, and simply return data about that thing (a 200 response) when the URI is requested. If they then progress to wanting to have separate URIs for that thing and the page about the thing, or indeed to disambiguate the URI that they’ve used in some other way, we need to make it easy for them to do so.

I think we need two properties: eg:describedBy and eg:couldBe. eg:describedBy describes the link between a resource (of any type) and a document that describes it; eg:couldBe is a disambiguation link that points from a URI to other possible, more precise, URIs.

Then I think we need some rules along the lines of (I don’t pretend these are entirely worked out):

  • if you get a 303 response redirecting to U' when you fetch a URI U then behave as if the response from U' included the triple U eg:describedBy U'
  • if the URI U is a hash URI whose base URI is U' then behave as if the response from U' included the triple U eg:describedBy U'
  • if you get a 2XX response in response to a URI U then:
    • if there are multiple triples that match the pattern U eg:describedBy ?page then assume that the document you have is U' where U' eg:couldBe any of the ?pages
    • otherwise, if there is a single triple that matches the pattern U eg:describedBy ?page then assume that the document that you have is ?page and it is about U (along with other things, possibly); statements about ?page might include information about the licence or provenance of the returned document
    • if there are any triples that match the pattern ?thing eg:describedBy U then assume that the document you have is U and it is about (possibly multiple) ?things
    • otherwise, behave as if there is a triple U eg:describedBy U; in this case, U is being used in an ambiguous way

We could go further and say:

  • if there are two triples that match the pattern U eg:couldBe ?page . ?thing eg:describedBy ?page then assume that the document you have is ?page and it is about ?thing
  • if there are two triples that match the pattern U eg:couldBe ?thing . ?thing eg:describedBy ?page then assume that the document you have is ?page and it is about ?thing

This way, if someone starts off using U in an ambiguous way, or to mean only the page or only the thing, they can later add eg:describedBy and eg:couldBe statements to disambiguate and add information about the page or thing the page describes.

It’s worth bearing in mind that we shouldn’t just be concerned about locating information about things that aren’t on the web, but about things that are on the web but that cannot have metadata embedded within them. For example, how do we discover the licence associated with a particular image? Although there are methods of embedding metadata within image and other binary formats, such as XMP, it’s still useful to be able to locate metadata about images based on their URI.

With a scheme such as that described above, publishers that used content negotiation to return some data about the image in another format could use eg:describedBy to indicate that the returned document is about the image (or set of images in different formats).

Summary

The summary of my thinking is:

  • we should learn to cope with ambiguity in URIs
  • we should not constrain how applications manage that ambiguity, though duck typing seems the most promising approach to me
  • we should define some specific properties that can be used to disambiguate URIs, describe their defaults with 303s and hash URIs and provide an easy upgrade path as publishers choose to add more specificity

The key will be how we find practical ways to cope with the real, imperfect, fuzzy web of data while providing an evolutionary path to greater clarity and specificity that publishers can take when they see the benefit of doing so.

Comments

Re: What Do URIs Mean Anyway?

Disclaimer: This stuff fascinates me, but it’s not my area of expertise. ☺

So, technically “URI” encompasses both URNs and URLs. Is it possible to use both a URL and URN in tandem?

Re: What Do URIs Mean Anyway?

“we should learn to cope with ambiguity in URIs”

Yep. That about covers it. There is no plausible future in which the world at large uses URIs exclusively in unambiguous ways.

Re: What Do URIs Mean Anyway?

Hi Jeni,

Two points about this that I think weren’t discussed so far:

What happens if there was no web access and you are stuck with statements but no way to resolve any of them to determine their extra-triple-303-realworldness? Triples need to be self-explanatory, or at least not interfere with common-sense.

Why is it so difficult to accept that there are, and always will be, more than one way of looking at the same thing without having to rename it each time? It is a shame somewhat that when people discover the technical difficulty here, it is based on the fact that intelligent reasoners are nothing more than extra triple generators. Their main purpose is to add triples that people possibly forgot to add themselves. The semantic in semantic web comes solely from users, not producers, in my opinion. No matter what you do as a publisher, you can’t stop someone removing a triple or two from your description if they disagree with it.

I learnt about the debate 6 years ago, and I have been working continually in the field ever since without following the crowd. I implemented 303 redirections the other day in case someone wanted them, but noone has been particularly interested so far, content negotiation works fine without them anyway! I once had an academic journal article reviewer whose comment was mostly that I had not implemented 303, and hence all of the things were pages, and nothing more, ever.

Flexibility is useful in the real world. It is a valid goal to create an ontology that describes a variety of ways to relate real-world things based on triple relationships, but it won’t affect the way most people deal with triples. One exception would be if they are in the very small group that deals with provenance as a research goal. Document processors will ignore the real world triples as necessary anyway, and real world scientists are proactive enough to delete extraneous document metadata before doing their work.

Theoretically, rdfs:range and rdfs:domain contain enough entropy to be used on their own to generate triple-specific annotations about the possible reasoned types for a subject or object URI and then we could segregate duck-typing into at least hierarchical 3 categories. Triple-specific-reasoned-type, URI-reasoned-type, declared-type (possibly in that order).

Hopefully it attracts some new attention from other inventive, broad-minded, reasoning folks that have not been downtrodden so far by the 303 hash evangelists.

Re: What Do URIs Mean Anyway?

Hi Jeni, I’ve been reading the heated debates about httpRange-14. Thanks for this helpful summary of the problem. It seems to me though that the problem has arisen from the arguably strange decision to use http URIs to refer to real-world or abstract ‘things’ other web-addressable documents. The use of the same http prefix for the ID of real-world thing as for the protocol to use to retrieve the document about that thing is particularly odd.

What if I actually want to retrieve the thing itself? Maybe in the far future we’ll have teleportation and we’ll be able to do a 303 redirect to it, in which case, the decision to use http URIs will seem like a brilliant foresight after all.

Re: What Do URIs Mean Anyway?

Fortunately, in the future, we will still be using URIs to identify namespaces, so when the teleporter malfunctions (as they always do) and the 303 redirect returns both a duck and an evil duck, we will be able to perform disambiguation. Right?

Re: What Do URIs Mean Anyway?

We’d probably just get the evil duck: http://en.wikipedia.org/wiki/Mirror,Mirror(StarTrek:TheOriginalSeries). Our future Spock wouldn’t need to bother with a mind-meld, he’d be able to make use of evil duck’s rdfs:range and rdfs:domain.

Re: What Do URIs Mean Anyway?

i want to thank it is realy great design and informations

Re: What Do URIs Mean Anyway?

Jeni, a nice and levelheaded summary.

Regarding the downsides of the 303 and hash URI approaches, I’d mention two more.

First, in existing HTML, hyperlinks are expressed with plain old document URLs. When adding RDFa markup to such pages, adding @rel=”ex:worksFor” to a hyperlink is simple. Anything involving hash URIs or 303 URIs is going to be much more complicated, at least it’s going to require @about=”#this” and @resource=”elsewhere.com/#that”. This ruins an opportunity to get some really cheap structured data onto the web.

Second, in RESTful APIs that properly do “hypertext as the engine of application state”, we already have URIs for domain entities, and we have typed links between them, and sometimes we even already have content negotiation between XML and JSON. So adding RDF is surely just a question of adding a third representation format? No, because the RDF crowd insists that those existing URIs are really document URIs and not domain entity URI as the REST developer thought, and that they ought to add hashes or other 303-responding URIs before they can add RDF representations to their resources. Again, it ruins an opportunity to get more RDF adoption.

I’m not quite sure that I understand your characterization of the “one-step-removed” approach. On the one hand it sounds exactly like what I’ve done previously with http://vocab.sindice.com/xfn . But on the other hand I don’t understand how the problem of inconsistent use of DC refers to that. Since DC isn’t defined as one-step-removed, it seems clear that DC statements would apply to the “page” and not the “thing”. On the other hand, doesn’t the duck typing proposal suffer from exactly this problem?

(I think what I advocate is two-class duck typing: Each URI can identify exactly two things, a web resource, and a duck.)

Regarding the proposed eg:describedBy logic:

  1. Make it the other way round, eg:about or eg:topic or something like that. It serializes more naturally.

  2. The logic only really works when you assume that the object of eg:describedBy is a web resource rather than any kind of document. I think it’s better to define it so that Ahab is not eg:describedBy the book “Moby Dick”, because things can only be eg:describedBy web resources.

Re: What Do URIs Mean Anyway?

I am interested in your style and i see that any efforts and work that give any kind of help and useful informations to people anywhere must be thanked and respected efforts , if it is small or big , from one person or group , your informations also must be thanked

Re: What Do URIs Mean Anyway?

Hi Richard,

I agree on your points about 303s preventing adoption.

The one-step-removed properties are pretty much what you did with http://vocab.sindice.com/xfn. If I have:

<http://en.wikipedia.org/wiki/Moby_Dick> 
  dct:title "Moby-Dick - Wikipedia, the free encyclopedia"@en .

this tells you the title of the Wikipedia page. I can invent a new property called bib:title whose semantic is that it provides the title of the book that the page is about. Then I can do:

<http://en.wikipedia.org/wiki/Moby_Dick> bib:title "Moby Dick"@en .

and it gives me a property whose subject (in RDF terms) is the page but that actually provides information about the book. In the same way, the xfn:*-hyperlink properties are used in statements that give information that is actually about the people behind the pages that are their subject and object.

My point about Dublin Core and FOAF and most every other vocabulary was that the properties are not defined in this way. The domain of foaf:familyName is directly foaf:Person, not a-page-about-a-person. So if we want to use one-step-removed properties as a method of disambiguation, we have to invent whole swathes of mirror properties for each property in these vocabularies. (I think my mention of Dublin Core was confusing because the existing properties often do apply directly to pages; I was thinking in the context of Dublin Core properties about novels such as Moby Dick.)

I’m not claiming that duck typing is a perfect solution, particularly when it comes to pages about novels where Dublin Core properties can reasonably apply to either; I just think it’s the best chance that a consuming application has to identify data that it wants to use.

I think we will find URIs being used to mean more than two things; for example, http://www.surreycc.gov.uk/ might be used to mean Surrey County Council or the Surrey County Council website as a whole, or the specific home page.

I’m easy about eg:describedBy vs eg:about (I used eg:describedBy primarily because it was near to wdrs:describedby which various people had mentioned). I agree with your comments about its semantics.

Jeni

Re: What Do URIs Mean Anyway?

eg:describes would make things more explicit, and perhaps encourage people to mint URIs for the off-web things too

Re: What Do URIs Mean Anyway?

very great and I am interested in your style and i see that any efforts and work that give any kind of help and useful informations to people anywhere must be thanked and respected efforts

Re: What Do URIs Mean Anyway?

+1 to all this. Thanks for the clarifications.

I find it helpful when thinking about these discussions to pretend that Dublin Core doesn’t exist. It’s the worst case of ambiguity, because its properties are so widely applicable. My impression is that the vast majority of other terms are actually quite safe for duck typing. There are actually only so many things that one might want to say about web resources; most properties apply only to other kinds of things. I guess this is an argument that supports the duck typing approach.

Re: What Do URIs Mean Anyway?

Yes, another comment.

You identify 3 questions that are commonly conflated. I thought I would try to answer them from my back to basics point of view:

  • Must publishers provide separate URIs for things-on-the-web and the non-web-things that they describe?
    Publishers should provide separate URIs for anything they wish to write metadata about or refer to individually.
  • How can you tell what a reference to a particular URI within a piece of data (eg an RDF statement) means?
    The URI always refers to the resource, never to the representation that you receive.
  • How can you get from a URI to information about whatever that URI refers to?
    Just dereference the URI and inspect the HTTP response that you get.

I am working on a longer blog post about ambiguity but this comment didn’t fit within that post.

Re: What Do URIs Mean Anyway?

I picked up this thread as John Erickson had referenced our late ’90s “indecs” work. Excellent summary Jeni.

Ian - on a detailed point I think your second statement isn’t strictly true: “The URI always refers to the resource, never to the representation that you receive.” Just as there are reasons (for example, for rights management and curation) for identifying local physical copies, there can be analagous reasons for referencing local digital copies. Of course this is not the norm, and I can’t give you a specific example offhand, but I would avoid making your statement absolute if you are re-blogging on the issue. People can denote anything they want with a URI, and one of my own observed principles is that any kind of thing that can happen in metadata, does.

Re: What Do URIs Mean Anyway?

As the others have said, a great exposition of the issue, Jeni!

I feel like some of this has been discussed before --- as in way before, during the <indecs> project, ca. 1998-2000. See esp. Godfrey Rust and Mark Bide. The <indecs> metadata framework: Principles, model and data dictionary (June 2000).

The <indecs> vocabulary focuses on describing the creation, use and exchange of "stuff" --- specifically creative works. It acknowledges that some entities --- e.g. works --- exist only inthe abstract, that entities will likely have multiple forms that may need to be disambiguated, and especially that some entities will have rich and varied relationships with other entities (e.g. multiple manifestations of works, events that affect works, etc.) in the intellectual property domain.

<indecs> (I think wisely) avoids addressing the relationship between the syntax of identifiers; although Mark and Godfrey were active at the time in the DOI universe, their models anticipated the more URI-oriented world of RDF and have since been "ported" there. Since publishers were all over the map at the time about whether identifiers should be "smart" or "dumb," I'm sure Mark and Godfrey were ambivalent on the issue. ;)

Unfortunately, the indecs.org website is currently down, but the DOI website has a useful fact sheet page which links to the relevant documentation.

Re: What Do URIs Mean Anyway?

Mark and I are very much still active in the DOI universe, and also with RDF and linked data these days, as those these worlds are starting to interoperate - DOI agencies are just beginning to create linked data, and we are looking at issuing a set of properties in the doi: namespace for dealing with some of the relationship issues which Jeni highlights.

Jeni later references the FRBR work, which mirrors indecs in most of its basic structures.

An observation about the wider context: first, many of the identification issues begin pre-URI, in the other standard IDs (like ISBN, ISRC, DOI and emerging ones like ISTC and ISNI), or in proprietary internal IDs issued within corporate databases. Larger publishers are having to wrestle with the relationships issues, which in some areas are nightmarish because of the explosion of relationships broguht about by digital representation and granularity. In Jeni and others’ comments only a few are mentioned - it can get extremely complicated - an image of a painting of Herman Melville in a chapter of a digital representation of a physical edition of a book about a particular edition of the book “Moby Dick” is a commonplace set of relationships now - and for a publisher there may be rights management issues relating to any of those distinct items, so in principle they may all need to be identified at some point in their processes. There will never be universal or fully competent resolutions to these issues, but large publishers are having to take some steps now, for their own survival and to enable re-purposing of content. One of the things that is likely to emerge from that is some more robust semantics for relationships. My underlying point here is that ambiguity often begins with non-URI identifiers, and is then exported in URIs to the web, and by that stage the damage is done and it is normally too late to disambiguate with web tools. The suggestions being proposed in this thread would help, but they are part of a bigger identity issue.

Re: What Do URIs Mean Anyway?

I like the duck typing approach. It seems saner than anything else I’ve seen.

One thing I never quite figured out is why the recommendation was to use 303s (which are hard to configure for most people) instead of an application/real-world-thing media type. Yes, it’s a hack, but at least people can most of the time set that up much more easily (especially if it’s IETF registered and therefore gets shipped in standard server configurations). There is such a thing as excessive HTTP orthodoxy.

Re: What Do URIs Mean Anyway?

Can I nitpick a little in the spirit of shaping this into something that could be used to define a way forward on this issue.

In your paraphrasing of the httprange decision you say: “if you get a 303 response when you request a URI, that URI could refer to anything, and the resource you get by following the redirection describes that thing”. The actual resolution says nothing about the relationship between the original resource and the one you are redirected to. Also it introduces the terminology of “description” which I think is problematic (when is a representation a description? when is it a definition? is a definition a description? etc.)

You use the term “on the web” but don’t define it. I’ve seen this phrase used before and generally it’s felt to be synonymous with information resource. However information resource was a term invented purely for the httprange decision, i.e. to name the class of things that respond with a 200 status code.

Here’s my thought experiment: what would your post look like if you avoided using the terms “description”, “on the web” and “information resource” or their synonyms?

Re: What Do URIs Mean Anyway?

Great post, as always Jeni. You captured the issue brilliantly clearly and your separation of the three questions is spot on.

In terms of options for handling ambiguity I would add “use the type statements, trust those you get by dereferencing the URI”, which I think is a key part of IanD’s “Back to basics”. So if (made up URIs) U = http://www.epimorphics.com/thing/epimorphics returns RDF saying U rdf:type org:Organization then I know U is not a web page (disjoint with foaf:Document which I take to include web pages). If there’s no 303 and I haven’t included any eg:describeBy then I have simply given you no way to talk about the page separately from the organization. Tough. This seems cleaner to me, and simpler to handle, that trying to treat this as a punned case.

It would be useful to have agreed classes for “thing on the web” and for “thing not on the web” so that people who want to state that their URI is one or the other can do so.

Dave

Re: What Do URIs Mean Anyway?

Hi Dave, this approach—a URI is not a web page if the publisher says it’s an organization—is the worst of all.

It flatly denies reality. The Epimorphics URI you describe is a web resource, as evidenced by the fact that it can be opened in a web browser. Why would an rdf:type statement change that?

You are saying that when an RDFa triple is added to a web page, then it’s suddenly no longer a web page but now is a company? And the web page that I can still open in my browser under the same URI just like before actually has ceased to have a URI, and disappeared from the web? The web page can no longer be talked about, despite the fact I can still look at it exactly like before?

This doesn’t make any sense to me. I’d take ambiguity over that approach any time…

Re: What Do URIs Mean Anyway?

Hi Richard,

No I’m not saying that the page ceases to be have a URI and disappears from the web! I’m talking about what the URI denotes. What its interpretation is.

Just as we’ve been saying that, say, http://dbpedia.org/resource/Berlin denotes the concept of Berlin not a web page. If I point a web browser at it I do seem to get a page of info. If I’m a specialist I notice the redirection going on under the covers but as far as most people can see you do a GET on it and some bytes come back. We currently wouldn’t expect to also say that http://dbpedia.org/resource/Berlin is a foaf:Document (w3c:InformationResource or whatever). Of course someone might say that and then we would then deal with the ambiguity somehow but at the moment we don’t force everyone to say that. The question here is supposing dbpedia didn’t do the 303 redirection, can we still just go on treating http://dbpedia.org/resource/Berlin as denoting a dbpedia-owl:City without any fuss or do we also now have treat it as if it were also a foaf:Document? As it happens that page has no foaf:primaryTopic or x:describedBy links to http://dbpedia.org/page/Berlin so apart from the network round trip I don’t see any use of the page URI and don’t see what would get broken if it didn’t exist.

Dave

Re: What Do URIs Mean Anyway?

Dave,

“No I’m not saying that the page ceases to be have a URI and disappears from the web!”

Yes that is exactly what you are saying. You are saying that a web page stops having a URI when I add @typeof=”foaf:Organization” to its HTML markup. Something that doesn’t have a URI isn’t on the Web.

The 303 thing is a hack. Yes you can point your browser to dbpedia.org/resource/Berlin and get a page, but that relies on the fact that the page has its own separate URI. The page still has a URI, unlike in your proposal. The empirical fact that non-experts get confused about these two separate URIs doesn’t mean that we can do away with the web page URI; it means that we should do away with the separation.

Here’s what would get broken. Let’s say I add microdata markup to a web page saying that it’s actually a duck. A microdata-capable client would correctly recognize it as a duck. An RDFa-capable client would not recognize that typing, and hence would be licensed to infer that it’s a web page (just a normal web page, like those billions of others). And I presume you still buy into the belief that web pages and ducks are disjoint. (The duck typing and one-step-removed proposals don’t have this problem because the URI would still identify a web page even if it also identifies a duck through some punning mechanism. Some agents may not recognize it as a duck, but that doesn’t yield a contradiction.)

I like to think of the web as a universal medium for exchanging representations. RDF is just one format among many. Web architecture has to work regardless of whether you understand RDF triples, or a particular RDF syntax. RDF conventions must not contradict web architecture, and cannot override it. I don’t think that your proposal can be reconciled with these views.

A chair is still a chair, even if you write on it in Spanish that it’s a duck. That’s just common sense. Some people might find it useful for some application to treat chairs with “pato” written on them as ducks. That’s fine, as long as they don’t ask the world to globally buy into that.

Re: What Do URIs Mean Anyway?

Thinking more about this, our opinions might not be as far apart as I initially thought. What I’m saying is:

“If it quacks like a duck, treat it as a duck. If it smells like coffee, treat it as coffee. Treating the same thing as a duck and as coffee is kinda weird, but it’s ok.”

What you are saying is:

“If it smells like coffee, treat it as coffee. Why the heck do you keep going on about ducks?”

I guess that’s because I’m trying to make sense of this thing on the scale of the entire web, while you’re more interested in a local convention that works for a particular dataset and where one doesn’t have to be concerned about, say, non-RDF documents.

Re: What Do URIs Mean Anyway?

Jeni, I agree with Dave that this is a great post and one of the best expositions of the problem I’ve seen. I’m largely in agreement with you and I think Dave’s comment addresses the missing part for me: trust what the publisher says at that URI and rely on our tools to detect inconsistency.

Dave, someone just needs to mint an InformationResource class and its complement, both subclasses of rdfs:Resource. Any of us could do it, but my feeling has been that it would be better in w3.org uri space.

On a more philosophical point, I don’t think the constraint to relax on the semweb is ambiguity. I think the thing we should/are relaxing is consistency, i.e. we don’t need to wait for entire world to agree in order to do useful things. We tolerate inconsistency by either rejecting data we don’t like (safe in the knowledge that the web is large enough that we will find more data later) or by relaxing our other constraints (switching schemas, adding local rules). That’s why I’m much more sanguine about owl:sameAs than I used to be. Yes most owl:sameAs relationships are inconsistent, but in practice it doesn’t matter that much (after all, what is identity really anyway, lots of ratholes and paradoxes in there that we should just turn away from, e.g. Ship of Theseus).

Re: What Do URIs Mean Anyway?

I found the following class that could be used, but it’s tied up in an ontology containing other related classes that I know nothing about:

http://www.w3.org/2006/gen/ont#InformationResource

Re: What Do URIs Mean Anyway?

Dave,

Yes, I meant (but didn’t state explicitly) type statements to be part of the input that you use when Duck Typing: if the data we get back when we dereference U says U a org:Organization then a consuming application can infer it’s an organisation and render information about it as an organisation. I’m also absolutely in favour of regarding the information that you get back when you request U to be held as more trustworthy than any other data you might find about U.

What we have to deal with is that there might also be a statement, even within the very same trustworthy page, that explicitly states U a foaf:Document; or U might have some other property that, through RDFS reasoning, implies that U a foaf:Document. I’m not saying it’s good practice, but it will happen, because publishers do not always follow good practice.

The question is what do we do about it? Which statement do we believe? Do we ignore all the data? Write angry mails to the publisher? We need to work out what we do in the face of these logical errors; we need a coping strategy.

Jeni

Re: What Do URIs Mean Anyway?

Hi Jeni,

Sure, I wasn’t disagreeing that there will be ambiguous data and we have to deal with it. That’s true anyway, irrespective of httprange-14.

I was reading your rules in “Locating Data From URIs” as being stronger than that. The last rule says that unless I explicitly give you a second URI for the web page (via an explicit or implicit eg:describedBy) then you must treat U as being ambiguous and introduce “U eg:describedBy U”. Whereas I’m saying that if U says it is something which is known to be disjoint from a web page (true of org:Organization) then I don’t need to introduce the “U eg:describedBy U” punning statement. U denotes the organization, does not denote the web page, there is no U’ denoting the web page.

Dave

Re: What Do URIs Mean Anyway?

Dave,

I don’t think I’ve been clear enough (in my own mind or the post), so thanks for pulling this out.

The section about Locating Data from URIs is about understanding how the data that you get back when dereferencing the URI relates to the URI that you used to get it, and where (within that data or through dereferencing other URIs) you might find more information about related things. The only implication that you can get from a eg:describedBy statement is that the object of such a statement is (can be viewed as) something on the web. It doesn’t stop that object from being used to mean other things too (such as an organisation).

The section on Disambiguating Statements on the other hand, is all about the various tactics consumers might use to cope with the fact that URIs are ambiguous. One of the clues that you might use when Duck Typing is whether U is the subject of a eg:describedBy statement, but at that point it’s a clue on a par with any other.

Does that make sense?

Jeni

Re: What Do URIs Mean Anyway?

Jeni,

It does make sense and that was the way I was reading it, I’m just pushing back :)

You are saying that the U you dereference is always viewed as something-on-the-web, but is allowed to also denote a non-web-thing and then you deal with the ambiguity.

I’m saying that if U explicitly says it is a non-web-thing then it is not a something-on-the-web and I don’t have to “introduce” the U eg:describedBy U triple with its eg:describedBy rdfs:range eg:WebThing and consequent ambiguity. There simply is no URI which denotes some web page for U. I agree that it is possible that the dataset at U or other datasets I trust will still introduce ambiguity anyway.

If you really want to state the relation between the URI as a network address and the concept I’m using it to denote then you could use U eg:describedBy "U"^^xsd:anyURI

Does that make sense?

Dave

Re: What Do URIs Mean Anyway?

Dave,

Yes, that makes sense. You’re concerned that adding the U ex:describedBy U triple makes U ambiguous in cases where U is only used for, say, an organisation.

I like the consistency of always having a ex:describedBy relationship whatever happens when you fetch U. But what about if there was no range constraint on ex:describedBy — it would then simply be a relation that says that its object describes its subject with no implication about what the object might be.

Of course applications would still be free to use whatever rules they wanted to disambiguate if that was necessary, which might include using a ex:describedBy triple, but that’s up to them and what their users find useful.

Would that work for you?

Jeni

Re: What Do URIs Mean Anyway?

Jeni,

Yes, that would be OK. I would prefer to not auto-add an ex:describedBy triple when there is no indirection going (or to use the xsd:anyURI literal option) but if there’s no range constraint (or moral equivalent) then it at least does no harm from my POV.

Dave

Re: What Do URIs Mean Anyway?

eg:describedBy rdfs:range eg:Fetchable. eg:describedBy rdfs:range eg:Representable. eg:describedBy rdfs:range eg:WebAccesibleResource.

Something along these lines.

Re: What Do URIs Mean Anyway?

+1

For a lot of cases I guess you’d be more interesting in the ‘thing’ than the ‘document’. If you want to provide important metadata could you not just use named graphs and no 303s?

Punning is used a lot in OWL2, and I think it gives rise to a lot of confusion.

Re: What Do URIs Mean Anyway?

Nice work, as always, Jeni. What you’ve said highlights that the data sets on the web generally will have to be assumed to be open-world data, as opposed to closed-world. In fact, using your plan of ignoring properties one doesn’t care about, a single subject could have a lot of contradictory properties that would be impossible in a closed world system. Even just the ambiguities you talk about would be impossible.

So a lot of DL inferences will just not be possible. That’s good in some ways, because things will be faster, and truth maintenance will be much easier.

Re: What Do URIs Mean Anyway?

Interesting post - thanks! I’m glad, in particular, that you’ve identified the fact that many (I’d say most) publishers aren’t motivated to use a 303. This may in itself not be much of an impediment, as publishers in general (unsurprisingly) much concerned about the niceties of disambiguation.

As an aside, any search marketer worth their salt knows the difference between 301 and 302 redirects, but few have ever heard of a 303 (I hadn’t encountered it until I began mucking into the semantic web). Certainly none would ever use a 303 for an exposed link (that is a clickable, followed, link as specified by href) because that redirect would throttle the search engine value of a link. Though SEO does concern itself with a type of disambiguation, namely canonicalization (having only one URI represent one web page) - achieved at the server level by employing 301s.

The big disadvantage to me of hash URIs is that this is precisely how internal anchors are marked up, and a has URI may or not refer to a page location.

Re: What Do URIs Mean Anyway?

Wow. Really, wow. Great post, Jeni! Very thought provoking for a young hacker like me.

This post raised a question for me. You talk about the relationship of a URI to a Wikipedia page about “Moby Dick” and a URI to ‘the thing’ “Moby Dick”.

What is the relationship between the folling things: 1) A URI to a page about a photo on flickr 2) A URI to the .jpg file of that photo 3) A data URI (http://en.wikipedia.org/wiki/DataURIscheme) containing the data of that image. Especially 2 and 3 makes me wonder.

Or, what is the relationship between a URI to ‘the thing’ “Moby Dick” and a data URI containing the entire contents of “Moby Dick”?

Re: What Do URIs Mean Anyway?

Hi @bengo,

I think that there isn’t an absolute answer to your question — it’s a modelling question and the relationships that you assert should depend on the kind of queries that you want to perform over the data.

For the flickr example, you could say that there is a photo (which should have its own URI) which is described by the HTML page about the photo and manifested in both the .jpg file and the data encoded within the data URI.

Similarly, a data URI containing the entire contents of ‘Moby Dick’ would be a manifestation of (a particular version of the novel) Moby Dick. You might want to look at FRBR if you’re interested in modelling the relationships between these kinds of resources.

Cheers,

Jeni

Re: What Do URIs Mean Anyway?

You pose a questions 1-3 about relations between URIs. You also speak about the things to which they refer (a page about a photo on flickr, the .jpg version of the photo, and (maybe) a page of image metadata). The relationships expressed in RDF are not between URI (if fact making URI themselves the subject of RDF statements is not staightforward) - the relationships are between the things to which URI refer (or at least are intended to refer).

So a literal answer to your questions 1-3 is that in general there is no prescribed relationships between the URI you mention (though it is common to for publishers to follow syntactic patterns, one such being the ../id/.. -> ../doc/.. pattern used on data.gov.uk that Jani mentions). And as to the relationships between there things to which those URI refer… you stated them in framing the question.

I think the important thing to strive for (in the face of inevitable abiguity) is “different URI for different things - people, places, events webpages, documents etc. are all things.

I think that the deep question that people ‘disagree’ about or at least a forcing function for the permathread is whether an indicator is required to distinguish between URI that reference things on and off the web either by: apriori inspection of a URI; protocol interaction status codes; or from explicit statements from a source that you are prepared to believe.

The first of these motivated an early position that hashless HTTP scheme URI always referenced documents/web resource (aka IRs) which was set aside by the TAG resolution, to be replaced but an apparent need to be able to make a partial distinction on the basis of a protocol interaction (hence 200 => document/web resource). For others only explicit statements (that you’re prepared to believe!) are necessary if your really want to know.