Opaque URIs != Unreadable URIs

I’ve been talking about URIs a lot recently. One of the things that has bothered me about some of the conversations is the conflation of the concepts of “opaque URIs” and “non-human-readable URIs”. This is my argument for keeping the concepts separate.

The opacity of URIs is an important axiom in web architecture. It states that web applications must not try to pick apart URIs in order to work out information from them. Applications must not, for example, use the fact that a URI has .html at the end to infer that it resolves to an HTML document. It’s closely related to hypertext as engine of application state, in that opaque URIs should not be generated by web applications either: they must be discovered through links and the submission of forms.

But this has nothing to do with readability or hackability, both of which are extremely important for human users. Readable URIs help human users understand something about the resource that the URI is pointing to. Hackable URIs (by which I mean ones that people might manipulate by altering or removing portions of the path or query) enable human users to locate other resources that they might be interested in.

Before I go further, a couple of caveats:

I am not saying that every URI must contain a natural language identifier. An example is the URI for a school, which could include:

  • the name of the school
  • the unique reference number for the school
  • the record number for the school in the database that is being published on the web

Using the name of the school, as I’ve discussed, is probably a bad idea because of its lack of longevity. Using the record number for the school within the particular database that’s being published is entirely non-human-readable because there is simply no way of finding out what that would be for a given school. The unique reference number for the school, on the other hand, may be an obscure series of digits, but it is a meaningful one which renders the URI readable and hackable.

There are also times when uniquely identifying a resource using natural identifiers within the URI leads to incredibly long and complex URIs, in which case the ‘human readable’ version isn’t actually human readable. Introducing non-human-readable components is then the only option.

Back to my argument:

Why should URIs support humans doing things that applications must not? Because humans are intelligent. When humans hack a URI, they are aware that they are making a guess, taking a chance and might or might not end up at something useful. If they get a 404, or even more importantly if they get to information about something that they weren’t expecting, they are intelligent enough to recognise that the chance they took didn’t pay off. Applications aren’t intelligent. They can’t tell the difference between a right guess and a wrong guess, so it’s best not to let them guess at all.

Let me give an example. Let’s say that I’m creating a URI for a particular house. Here are two possible URIs:

http://id.example.org/house/NG9_3HZ/4
http://id.example.org/house/0aef0218

The first is readable and hackable. A human could change the house number or the postcode. They could remove the house number and expect a list of houses within the postcode. The second is not readable or hackable: there is no way to know what you would get if you changed the identifier within the URI.

Now it is true that an application accessing a site that used the URIs like the first could create those URIs programmatically whereas it couldn’t (perhaps) create a URI like the second. But if it did create the URIs programmatically it would be the fault of the application, not the fault of the URI.

As publishers, it is our responsibility to provide humans URIs that are meaningful and hackable, and to provide applications with the means of creating or identifying these URIs through forms and links. But it is not our responsibility to prevent applications from doing things that they should not do by deliberately obfuscating our URIs.

Comments

Re: Opaque URIs != Unreadable URIs

Very onteresting topic! Well, URLs should be definitely readable and hackable. But for URIs, I dont think that is necessary. For example, a person who is not familiar with the convention of the UK post code will not know that "NG9_3HZ" indicates a post code and "4" indicates a house number. For him or her, "NG9_3HZ/4" has no difference with "0aef0218". However, "NG9_3HZ/4" really give users a chance to hack itself because it contains one more layer than "0aef0218". Otherwise, I cant see the difference between "NG9_3HZ/4" and "0aef0218/X".

Re: Opaque URIs != Unreadable URIs

Great observation and important issue! But I disagree with the school example. I think that a good URI (subject identifier) for a school should include type/class identifier ‘school’, school name, city/…, state/province/… , and country. I think that it is OK for URI-based subject identifiers to evolve. We just need to except it as a ‘way of life’ and build semantics-based applications accordingly. Subjects can have multiple identifiers (including current identifiers and deprecated identifiers). Of course, subject identifier evolution is totally different from providing subject identifiers (URIs) for subjects at specific moment in time.

Re: Opaque URIs != Unreadable URIs

There’s no possible way of constructing URIs that are completely resistant to change. If you use things like postcodes, you run the risk of postcodes being reassigned. Even if you use random object identifiers, you will get questions about identity: if a comprehensive school is closed and a city academy opens in the same premises with a new name, you have to make a decision about whether it is the same school. And of course you can have an infant school and a junior school on the same site that for some purposes are regarded as one school and for other purposes as two. All I’m saying is: don’t imagine there will ever be a right answer to this question. In the end you have to be pragmatic.

Re: Opaque URIs != Unreadable URIs

The problem I find with treating URIs as opaque is that we have close to two decades of implementation conventions to overcome, or longer if you count the history of hierarchical file systems.

I suspect that the bulk of the Web development community does not understand that a URI does not necessarily map 1:1 to a file in a Web server's document root. Despite the fact that a Request-URI is considered opaque the HTTP protocol, it is quickly given meaning by an HTTP server's core request handler — quicker than most would care to discern. Furthermore, the execution models of PHP, ASP, JSP, Coldfusion and server-side includes do little to reinforce this distinction. The fact that the Content-Type header supersedes the vestige of a file extension seems to be lost completely on the bulk of the population.

As for the URIs themselves, I have adopted a new policy: I use the form http://authori.ty/uuiduuid-uuid-uuid-uuid-uuiduuiduuid until I come up with a path structure I am comfortable with. It also helps with the Cool URI problem by preventing collisions and I can come in later with a 301 to tidy it up. I am also working on a dynamic faceted taxonomy based on set intersection to help generate URI paths.

Re: Opaque URIs != Unreadable URIs

Agreed, wholeheartedly! I think people confuse expecting “templating” URIs, possibly a bad, unRESTful thing, with beautiful, carefully architected, hackable, human-readable URIs which are definitely a good and useful thing. Tom Coates talked about this nicely, from a non-academic web architecture POV whilst confessing to being a URL Fetishist: http://www.plasticbag.org/archives/2006/02/myfutureofwebapps_slides/

Re: Opaque URIs != Unreadable URIs

I agree too.

It’s just the identifiers (or ‘slugs’) that have this debate about them though - the rest of the path is equally important. Even people who argue for non-human-readable IDs (for persistency) still mostly advocate having some words in the path to identify the type of thing. Otherwise every page on a website would have the same URL structure consisting of a random string.

It’s also not just about word based ids vs alphanumeric ids. Human-readability applies to the latter, too. Shorter is better. A checkdigit can sometimes be helpful. ISBNs even have a structure to them (you can work out a publisher from just the ISBN - there are even books containing lookup tables of these which are used by bookshops). If it includes alpha characters, then do these include vowels? Are they case sensitive? etc etc These are all issues to consider.

Frankie