Those readers who follow the TAG or public-lod mailing lists over the last couple of weeks cannot have failed to notice a large number of posts on a theme that recurs on roughly a 9-monthly cycle within these communities: httpRange-14.
The reason for this particular recurrence was a Call for Change Proposals on the resolution. The TAG meets on Monday, and discussion of this issue is one of the first items on our agenda. These are my thoughts going in to that discussion.
The recent discussion on the lists has, I think, helped to refine the questions that lie at the core of the httpRange-14 issue. They are:
Knowing whether the response to a URI provides the content of the resource identified by that URI is important because when you have data about the thing identified by a URI, such as its author or the license that it is provided under, you need to know what information is actually being referred to so that you can tell what information you can reuse and whom you have to attribute.
For example, the GOV UK website has a license at the bottom of each page:
<p>
Much of the information on this website is available for reuse under the
<a href="http://www.nationalarchives.gov.uk/doc/open-government-licence/"
rel="licence">Open Government Licence</a>
</p>
Seeing this, an application that knows the Open Government License enables free reuse can tell that it can lift content out of the page and use it on their own site. An application could automatically scrape out and republish the first paragraph of those news stories provided on this site and any others that were published with under this license.
There are vocal disagreements about particularly the first of the two questions I outlined above. What’s become clear to me is that the source of the arguments stem from a difference in world view about what kind of resources are available on the web.
Under the web of data view, the web consists of data, and all the resources on the web are information resources, defined as those resources whose essential characteristics can be conveyed in a message. Data, in other words.
URIs can still be used to name other resources, which are not on the web either because they are not information resources (such as a Person) or because they are not available yet (such as unscanned books). Under this world view, however, giving a successful HTTP response for such a resource is simply wrong, because these resources aren’t on the web.
The problem that this world view therefore needs to address is how to create URIs to identify resources that aren’t on the web. There are two answers:
Hash URIs have the benefit that there is a direct relationship between the hash URI which identifies the resource and a resource on the web that describes it. An HTTP client naturally strips the fragment identifier from the URI in order to make the request to a server, which then delivers the description of the resource.
If you identify a resource that isn’t on the web using an HTTP URI that is not a hash URI, you cannot get a successful response back because the resource you have asked for is, by definition in this world view, not on the web. The workaround is for the publisher to use the 303 See Other status code to point from the resource that you requested to its description on the web. (This is the essence of the httpRange-14 resolution.)
Under the web of things view, the web consists of things, and resources on the web could be anything: documents, people, films, teapots and so on. When a client makes an HTTP request for a resource, the response must reflect the state of the resource, but that state could be its content (if it’s an information resource) or it could be a description of the resource.
Under this world view, giving a successful HTTP response for a resource that isn’t an information resource is absolutely fine: the description of the resource is still a reflection of its state.
The problem that needs to be addressed when you have this world view becomes apparent when you think back to the licensing example above. Given an application knows that the resource identified by a given URI can be reused, how does it know whether the representation of that resource is reusable? It could be that the identified resource (for example an out-of-copyright book) has an open license, but that the representation of the resource holds only a description of that resource (some metadata about the book), and that description has a much more restrictive license. Or vice versa.
So to address this use case, you need some other mechanism to enable an application to tell that the representation is the content of the resource, rather than merely a description of it.
To generalise, the linked data community operates within the web of data world view and the larger web community operates within the web of things world view.
What is happening increasingly, however, is that these two world views are rubbing up against each other, and while both are internally coherent, switching between the world views causes not only a cognitive disconnect for developers but practical problems when transforming or moving data published under one world view into the other world view.
In addition, publication of data on the web through APIs is growing all the time, particularly REST APIs supporting the Hypertext as the Engine of Application State (HATEOAS) principle. As we share more data on the web, and we use URIs in our APIs, the question of what those URIs mean and how we associate licenses and provenance information with data, will only become more important.
We have an obligation, therefore, to reflect on the experience from the linked data community over the last few years and how that experience might spread to the larger web community.
Discussions within the linked data community over the httpRange-14 resolution centre on two problems that people have encountered:
A bit of a side-point here. I think that the questions I posed at the start of this post are general questions about web architecture, so it puzzles me that the only people who seem to really care about them, and who debate them endlessly, are the linked data community. This is partly because the linked data community use URIs extensively to identify the things about which they provide data, but I think it’s also about the fundamental attitude of those within the community, which was characterised in a recent post by Hugh Glaser:
Personally, I never did agree with the solution [to httpRange-14], but have always aimed to carry out the implications of it in the systems I construct.
This is for two reasons:
a) as a member of a small community, it is destructive to do otherwise;
b) as a professional engineer, my ethical obligations require me to do so.It is this second, the ethical obligations that are the most significant.
I should not digress from the standards, or even Best Practice, in my work.
The linked data community is jam packed with people who feel an ethical obligation to adhere to standards and best practices. We try to do what we are told is the Right Thing by individuals and standards organisations even when we don’t agree that it is the Right Thing and even if it turns out to be impractical.
In the larger web community, people who don’t agree with a standard or best practice, or who find it too impractical to implement, simply ignore it. There is no need to endlessly debate something that you can just ignore. And the httpRange-14 resolution is ignorable by the larger web community because so far it has had very little impact on any implementations at all, let alone widely-deployed implementations that work over the non-linked-data web.
Going into the TAG meeting about this on Monday, the main decision that I see is whether to continue to assume a web of data world view. In the web of data world view, it is impossible for a URI to return a description of a resource, whereas in the web of things world view it is fine. Personally, I would prefer to design around the web of things world view as I think this would ease some of the disconnects between linked data and the wider web, but there are others on the TAG who adhere strongly to the web of data view, so I think that change is unlikely.
If we stick with the web of data view, the main issues are how to alleviate the current practical difficulties that people are encountering with its implementation and explanation. I think there are three measures that would help:
Determine a conventional syntax for fragment identifiers that are used to identify things that are not on the web, as opposed to fragments of content. I’m thinking something like hash-bang URIs: using a character after the hash character that just gives a quick indication that the fragment identifier is being used in a special way, to refer to something that isn’t on the web rather than a fragment of a document, for example #*.
Change to recommending a single best practice of using hash URIs for resources that aren’t on the web, and in particular recommending having a one-to-one correspondence between resources on the web and those not on the web, using one particular conventional hash URI. For example, http://www.whitehouse.gov/#* would identify the resource that http://www.whitehouse.gov/ is about: The Whitehouse. This ensures that new publishers of data won’t run into the problems with publishing using 303 redirections, because they won’t use that method of publication. It also removes choice, which helps adopters who can otherwise get overwhelmed with options and the trade-offs between them.
Allow publishers who are currently using 303 redirections to publish descriptions of resources identified using non-hash URIs to switch to providing a representation using a 200 status code, along with a method of indicating that the representation is the description of the resource rather than its content. This indicator could be:
<link rel="describedby"> element in HTML)If we did move to a web of things view, the main question would be how to provide an indicator that the representation of a particular resource is the content of that resource as opposed to being a description. It would help ease transition if this was a natural consequence of the current pattern of publication on httpRange-14-compliant sites, so for example, you’d want to consider the representation of a resource the content of the resource if you got to it:
as well as if there was an explicit indicator within the representation that said the resource was an information resource.
Whichever decisions are made, I would personally like to see the concrete requirements on client behaviour that arise from these different publication practices, for example enabling a reuser to associate a license with a particular piece of content or a crawler to create RDF statements about URIs encountered on the web, to bring whatever decisions are made down to earth and less ignorable.
Comments
Re: Content and Descriptions of Web Resources
Disclaimer: I have absolutely no experience in the Linked Data community, so the input here may be pure nonsense :-)
If the linked data community insists on being able to differentiate between URIs of things and URIs that can be dereferenced and describe things by the use of magic strings in URIs - then why not simply create a new protocol identifier such that we don’t start overloading the existing HTTP protocol with stuff that should not be in it? Lets call it “http-id”.
An application that sees “http-id://www.whitehouse.gov/” knows that this is an ID that has an online description on the web which will be placed at “http://www.whitehouse.gov/”.
In ain’t much different from using hashes, it does not introduce magic strings in existing protocols, and it avoids the problem of 303-redirects where you won’t be able to tell a “wifi login” redirect from a semantic redirect.
But besides that I would rather see link-rels used as described here: http://dret.typepad.com/dretblog/2012/04/describing-resources-the-web-way.html as this seems to fit the existing web architecture better.
/Jørn
Re: Content and Descriptions of Web Resources
In the course of designing a vocabulary service, we proposed another URI pattern which makes it very clear that the resource identified is a description of another resource:
e.g. http://def.seegrid.csiro.au/sissvoc/isc2010/resource?uri=http://resource.geosciml.org/classifier/ics/ischart/Ludlow
The description is by definition a document (or graph), while the resource being described could be from the internet of data, or from the internet of things. Furthermore, as the example shows, there is no requirement that the resource being described shares the domain of the resource doing the describing. I’ve written a short blog about it here:
http://thismodel.posterous.com/a-new-cool-uri-pattern
Re: Content and Descriptions of Web Resources
Good post thanks. However, the current draft of HTTP 1.1 allows 303s to be cached, but your point still stands, as many clients today still don`t.
http://tools.ietf.org/html/draft-ietf-httpbis-p2-semantics-19#section-7.3.4
Re: Content and Descriptions of Web Resources
Hi Jeni,
Great summary.
Agreed that the fundamental disconnect is between Web of Data and Web of Things view.
Your measure (2) for sticking with Web of Data would be rather unpalatable. Many ontologies are published as simple documents with hash URIs for the concepts. Not only is this an easy, low cost way to publish but as a consumer then it’s really handy getting the whole ontology back not just an isolated class definition. At least when the ontology is manageable size. Yes you could split the ontology into a large bunch of different resources, give each an rdfs:isDefinedBy and also publish the aggregate; but that’s a lot more work for someone starting out on this road. Furthermore, as discussed on the LOD list, that would mean anyone using an ontology defined this way would need a separate namespace declaration for each property (otherwise you can’t serialize in RDF/XML). I know you know all this (!) but just wanted to emphasise that if the TAG goes this route it’ll need some get-out for vocabulary publishers, if not for other “just put up a document” publishers.
Dave
Re: Content and Descriptions of Web Resources
Hi Dave,
Oh I agree: the intention of 2. was not to be “everything must be published like this” but rather that “if you want to have separate documents for separate things, do it like this”. I think a recommended convention is better than a series of trade-offs and decisions for people who don’t want to get into all the details.
Jeni
Re: Content and Descriptions of Web Resources
as usual, a great discussion of the problem and the options! maybe one way of looking at options would be to take a step back and look at what web architecture allows, once you step away from linked data constraints. web architecture allows to identify resources by non-HTTP URIs, so if that makes sense for my application, i could identify books by (a fictional) isbn: URI scheme and in such a design, identifying a book by the URI isbn:1590593243 is perfectly permissible. if i want to describe that book somewhere, what i can do is create metadata about the book and link it to the book’s identifier by using a link relation. whether you use “describedby” or “describes” (or maybe that latter one could be “about”?) is just a matter of taste (i’d prefer the latter because the described resource does not necessarily need to know what it is described by, whereas the describing resource necessarily needs to know what it is describing), and “describedby” is already registered in http://www.iana.org/assignments/link-relations/link-relations.xml by POWDER. after that, REST and web architecture allow us to solve the problem in a pretty straightforward way, and both resource identifiers and description identifiers can happily use HTTP or non-HTTP URI schemes, because the conceptual connection is established through a well-defined link relation. we can also easily conceptualize discovery services which in essence would be nothing but linkbases providing access to managed sets of describedby/describes links. in such an approach, we don’t have to rely on URI or HTTP conventions, and all of our intentions are clearly expressed with fundamental building blocks of web architecture. using hash URIs or 303 redirects would just become conventions how to express describedby/describes links in environments with additional constraints (all URIs are HTTP URIs), but there would be a well-defined way of how those conventions could be translated to the RESTful world of typed links and opaque URIs. and whether such typed links would be communicated inline (embedded in described or description resources) or out-of-line (in HTTP Link: headers) would be an implementation choice and also one depending on the involved URI schemes and media types.
Re: Content and Descriptions of Web Resources
Hi Erik,
Yes, I’ve been gone for a long time. I go away for ~5 years and I see the debates and the suggested solutions are still the same, i.e. using non-dereferencable URIs in this case. Ah, I really hope not because it’s just so impractical to me, but unlike 5 years ago I don’t have the energy to debate at the same level as before. So I’ll just have to hope for the best. :)
-Mike
Re: Content and Descriptions of Web Resources
I’m having real trouble with the various proposals which have been discussed over the last couple of weeks, and that’s because the situation boils to this:
Here’s the rub: the perceived problem is technical, so people are looking for a technical solution — but the actual problem isn’t technical, it's one of communication. People publishing data haven’t used fragments (which are easily the quickest and simplest way to do this disambiguation, even if some people — though I’ve never actually met them — deem them “ugly”) nor 303s (a solution which seems to exist purely to provide an alternative to those people who opted not to use fragments) because they don’t understand why they’re necessary.
It’s not even that the disambiguation problem is particularly difficult to explain to people coming at this fresh, it’s just that the linked data community — being a bunch of geeks focussed on linked data — has traditionally been pretty bad at explaining things in terms people outside of that community understand.
In my experience of talking to people about precisely this ambiguity, you can go from an understanding of "you use URIs to identify things, not just web pages” to “…and that’s how you differentiate between URIs for things and URLs of documents describing them” in about ten minutes with somebody of reasonable intelligence. Sure, it's easier face-to-face than it is written down, but this is a community which deeply understands web architecture and is highly motivated to see the widespread adoption of a web of data.
The amount of effort which has been poured into coming up with, and then discussing, yet another alternative way to differentiate URIs-and-URLs for people at large to still not see the point of is, to be honest, incredibly frustrating.
My recommendation is that “we” (imagine me hand-waving a bit, here) do this:
Re: Content and Descriptions of Web Resources
Hi Mo:
I agree that minting yet another syntax is probably not the best solution, and I agree that it's probably more of a social problem then a technical problem. But the web is a web of things because that's the only one of the two that can be assumed. Plus, to support a web of data then there needs to be three (3) states represented: 1.) Definitely a data resource, 2.) Definitely a thing resource, and 3.) Not specified where the latter will probably continue to be 99% of the web. You can't assume perfection from a web publisher; that violates Postel's precient law, and that's why XHTML was still borne.
Anyway, to a current professional WordPress plugin developer (who is hopefully above average, but maybe I'm just being wishful) just what exactly are you proposing as a solution and your solution was unclear to me?
-Mike
P.S. BTW, I spent about a 6 months back in 2006 surfing the various lists related to WebArch and REST and reading everything else related I could find and, try and I might I could never understand RDF, triples and all. Often I'm the guy explaining things to other people[1], but with RDF I've come up empty every time.
[1] http://wordpress.stackexchange.com/users/89/mikeschinkel
Re: Content and Descriptions of Web Resources
Hi Mike,
We already have those solutions, that’s the point. Use fragments. Use 303s. Use non-HTTP URIs if the capability to dereference them (e.g., uri.arpa) becomes widespread enough.
(Personally, I lean towards the former).
Coming up with new ways to do it just muddies the waters further when the problem is that people find them too damned murky already.
DNS
Thanks for this!
Just one thought for now. Hopefully a relatively fresh one.
A very common experience for modern Web users, is to visit a Web site and get a non-authoritative response, because the requesting machine is being re-routed. This is typical with wifi in hotel, cafe, airport, station etc. This is not a cornercase, but big business happening countless times daily.
Are there any practices from that (rather different) situation that may help us untangle http-range-14?
If I go to http://www.jenitennison.com/ and I get given (without 30x redirection) a Wifi login page, …
…how are we to understand this? Not that it is guaranteed by webarch that this content is an authoritative rendering of http://www.jenitennison.com/ … even https: doesn’t guarantee that, for various fallible-human reasons.
Since someone fetching http://www.jenitennison.com/ has no way to know that intermediaries aren’t interfering, therefore we should be cautious in our inferences, and take other information into account. Perhaps some secure DNS tricks allow the owner of the associated domain name to pass additional information …?
In other words, Tabulator etc is already overstepping the mark, if it assumes a 200 received from ‘the Web’ when sending a GET to http://www.jenitennison.com/ is enough to imply that the authoritative server is asserting something about what the URI denotes. It might as well be a wifi capture portal. Client code should be more skeptical.
I haven’t studied http://en.wikipedia.org/wiki/DomainNameSystemSecurityExtensions adequately yet. But I suspect it might be a safer place to find authoritative information than HTTP headers from an unverified HTTP transaction. In which case, perhaps we can consider this as an alternate channel of information from publishers to indicate e.g. domains whose 200 codes indicate successful descriptive responses rather than serializations?