Versioning URIs

This discussion is closed: you can't post new comments.

Yesterday I went along to a workshop on developing URI guidelines for the UK public sector. Because of the current drive to get more UK public sector information online, and the fact that we have Tim Berners-Lee on board, there’s a growing recognition of the fact that we need URIs for the real-world and conceptual things that we talk about in the public sector: schools, roads, hospitals, services, councils, and so on.

One of the particular points of contention at the meeting was whether URIs for non-information resources (ie for real-world and conceptual things) should contain dates or version numbers, or not.

Let’s get some of the argument out of the way first. We are not talking about documents here. Documents will almost always have multiple versions, and if you care at all about maintaining a historical record you will want to refer to the previous version of a document. So dates or version numbers within URIs that refer to documents are often a really good idea. Even better if you have one URI without a date that consistently redirects (through a 307 Temporary Redirect) to the current version of the document.

Documents (that people read) are just one form of “information resource”: things that are information and therefore can be transmitted electronically. Other things in the world are “non-information resources”: things that are more than simple information and therefore cannot be transmitted electronically, such as schools, roads, hospitals and so on. A lot of things that we want to talk about (make RDF assertions about) are non-information resources. We give them URIs to name them, so that we can talk about them unambiguously, and we give them HTTP URIs so that we have a way of finding information resources (documents) that give us information about them.

Does the information that you get when you resolve a non-information resource URI change? Absolutely. A request to a non-information resource URI will respond with a 303 See Other that redirects to an information resource (probably without a version number) that itself redirects (307 Temporary Redirect) to a URI for a particular version of information about the resource. For example an identifier that means a particular school such as:

http://id.example.org/education/school/78

can 303 redirect to the current version of a document that contains information about that school, such as:

http://www.example.org/education/school/78

which will 307 redirect to a particular version of information about that school, such as:

http://www.example.org/education/school/78/2008-09-01

The date is in the URI for the information resource (the information about the school), and therefore it doesn’t need to be in the URI for the non-information resource (the school).

OK, but say that the identifier for a school changes over time. Let’s say that you’ve designed your URIs for schools like:

http://id.example.org/school/bracknell-forest/broadmoor-primary

and the name of the school changes. Now the above identifier isn’t applicable any more, and any RDF statements out there on the web that have used this identifier are now talking about something that no longer exists. How do you deal with this?

Well, the first rule is that non-information resource URIs must not include information that is likely to change. That’s why a lot of URIs contain numbers rather than names. So we shouldn’t have included the name of the school in the URI? OK, we’ll use a number instead:

http://id.example.org/school/bracknell-forest/78

Hang on. Bracknell Forest is a council, and historically it’s been known for councils to change, either in their boundaries (which would mean that a school would move council) or in its name, or they are merged, or… well, there are lots of things that could happen to a council. So in the face of all these possibilities, and given that we no longer need the council name to disambiguate the school name (because we have a number instead), we can employ a second rule: non-information resource URIs must not include unnecessary hierarchy. We can eliminate part of the path and still identify the school:

http://id.example.org/school/78

And so we come to the final thing that could change: “school”. Now surely, you might say, the concept of a school cannot change. And maybe you’re right, maybe it won’t. On the other hand, in the UK we have in the past had things called polytechnics), which are now known as universities, so the types of educational establishments that we have do change over time.

We could do a bunch of things to help prevent a conceptual change like this from requiring a change to the URI:

  • we keep the number of concepts named within the URI to a minimum (eg don’t have both ‘education’ and ‘school’)
  • we use wide terms rather than narrow terms (eg use a generic ‘school’ rather than having separate ‘grammar-school’, ‘primary-school’ and so on)
  • we could change the term ‘school’ to a code (eg use ‘C3X0’ instead of ‘school’), but I don’t think this will help: you’ll still have problems if ‘C3X0’ and ‘F9R2’ mean the same thing in the future, whatever they’re called.
  • we could eliminate the concept term from the URI altogether, and label everything under one flat naming scheme, using something that has billions and billions of possible combinations. I know, a UUID! No, I’m not serious.

And so we come to the question of versioning the URIs themselves. This is what Tim Berners-Lee says in Cool URIs don’t change:

I’ll go into this danger in more detail as it is one of the more difficult things to avoid. Typically, topics end up in URIs when you classify your documents according to a breakdown of the work you are doing. That breakdown will change. Names for areas will change. At W3C we wanted to change “MarkUp” to “Markup” and then to “HTML” to reflect the actual content of the section. Also, beware that this is often a flat name space. In 100 years are you sure you won’t want to reuse anything? We wanted to reuse “History” and “Stylesheets” for example in our short life.

This is a tempting way of organizing a web site - and indeed a tempting way of organizing anything, including the whole web. It is a great medium term solution but has serious drawbacks in the long term

Part of the reasons for this lie in the philosophy of meaning. every term in the language it a potential clustering subject, and each person can have a different idea of what it means. Because the relationships between subjects are web-like rather than tree-like, even for people who agree on a web may pick a different tree representation. These are my (oft repeated) general comments on the dangers of hierarchical classification as a general solution.

Effectively, when you use a topic name in a URI you are binding yourself to some classification. You may in the future prefer a different one. Then, the URI will be liable to break.

A reason for using a topic area as part of the URI is that responsibility for sub-parts of a URI space is typically delegated, and then you need a name for the organizational body - the subdivision or group or whatever - which has responsibility for that sub-space. This is binding your URIs to the organizational structure. It is typically safe only when protected by a date further up the URI (to the left of it): 1998/pics can be taken to mean for your server “what we meant in 1998 by pics”, rather than “what in 1998 we did with what we now refer to as pics.”

Let’s spell out the danger with some examples. Let’s say that in 20 year’s time, nurseries and primary schools merge into ‘schools’ and secondary schools, sixth-form colleges and universities merge into ‘academies’. A particular primary school currently known as:

http://id.example.org/school/78

will continue to be known by that URI. A particular university currently known as:

http://id.example.org/university/307

is now known as:

http://id.example.org/academy/79

To support these changes, we have to set up some 301 Moved Permanently redirects; http://id.example.org/university/307 has to redirect to http://id.example.org/academy/79. The RDF found at the end of the new URIs has to include owl:sameAs triples that link the new URIs back to the old ones, to indicate they are talking about the same institution:

<http://id.example.org/academy/79> owl:sameAs <http://id.example.org/university/307>

or this would be derived from the 301 response.

Similar changes may or may not happen within the RDF hosted elsewhere that talks about these institutions. Since it can be discovered that they are identical, there’s no real reason for anyone to start using the new URIs unless they want to.

Then 30 years later, the government of the time decide to create a new kind of institution which they call a ‘university’. The university of 50 years hence isn’t actually the same as the ‘university’ as we mean it — they are virtual meeting places for independent researchers, each centered on a particular topic of study rather than a physical location — but they need URIs. And since they are called ‘university’ that is the name that should be used in the URI. Now someone mints the URI:

http://id.example.org/university/307

But disaster! This University 307 is not at all the same as the old University 307, now known as Academy 79. The same URI has been used for two different things. Redirections halt, graphs are smushed, distinctions are lost and fallacies haunt the web.

TimBL’s solution to this possibility is for every URI that includes a topic to include the year in which the topic was minted. So we would have:

http://id.example.org/2009/school/78

that remains the same, and then:

http://id.example.org/2009/university/307

redirecting to:

http://id.example.org/2029/academy/79

and the introduction of:

http://id.example.org/2059/university/307

which can be guaranteed to be distinct from http://id.example.org/2009/university/307.

This, to me, is the crux of the argument for including a version inside the URIs that you use for non-information resources. It means that you can reuse old terms with new meanings within URIs without breaking the web.

On the other hand, many people, myself among them, really dislike the use of years or version numbers within URIs for non-information resources (unless, I should say, they are used as part of the identification of the resource). I think there are four main reasons:

  • they are additional cruft that add to the length of a URI but provide no information about the thing being identified
  • they can give a misleading impression about the relevance of a concept; for example FOAF is stuck at version 0.1 (http://xmlns.com/foaf/0.1/) despite being widely used, while http://www.w3.org/1998/Math/MathML is feeling distinctly old (in internet time) despite being under active development
  • it leads to a proliferation of URIs and creates additional work for people who want to keep their URIs up to date, even when the concepts themselves don’t change (such as for the primary school’s URI above)

In essence, the likelihood of a term being reused with a different meaning seems low enough that the cost (in readability, understandability and maintainability) of supporting URIs that contain versions or years doesn’t seem worthwhile. We can keep the likelihood low by using terms that are unlikely to change their meaning (particularly avoiding those that have more than one meaning) and by disambiguating them (for example by using ‘train-station’ rather than just ‘station’).

There is also, perhaps, a middle way here that can keep the majority of URIs clean without leading to overlapping names. That’s to start with a URI scheme that does not include a version number or year, and only to start introducing them when it becomes necessary due to the reuse of previous terms. In the example above, in 2059 we might have:

http://id.example.org/school/78
http://id.example.org/academy/79
http://id.example.org/university2.0/307

In other words, we make a decision now that our future selves will have to act upon. All we have to worry about is our future selves caring as much about persisting historical URIs as we do about persisting our current ones.

What do you think? Should versioning be avoided in URIs at all costs, or always be included just in case? Are there other arguments for or against including versions or years in URIs? What other design considerations are there that help prevent changes to URIs over (long periods of) time?

Comments

Re: Versioning URIs

Hello Jeni,

Sorry to be late to the party. Great article, thanks. I was going to comment before going on vactation, then decided to think again and post later once I’d thought some more… so we’ll see if actually hit the “post comment” button.

Firstly, the article is titled “Versioning URIs” however I can’t help but think that what is being versioned is/are the resources to which the URIs refer. Of course that’s an anal quibble really, but at the same time it reveals that what you are trying to model is the changing state of the world - and that you are wanting to refer to things as they were in a particular state (or time/interval), not necessarily as they are right ‘now’. I think that’s quite a big and open field at the moment. I see a new post from Ian Davis [2] on the topic.

Secondly, the article says:

“Even better if you have one URI without a date that consistently redirects (through a 307 Temporary Redirect) to the current version of the document.”

My initial reaction was “Woa… I’m not so sure that you do want to be doing that.” The current version (undated) and specific (dated) URI refer to different resources - at least in as much as that by design, the resource referred to by the dated URI is immutable. The 307 suggests undated resource has moved to a temporary location. Roy Fielding casts a resource as a mapping function from time to sets of equivalent representations [1] (actually sets of equivalent representations or URI - the later I take to cover redirection cases). So my initial reaction arose from it being the case that, by design, the resource that ‘persists’ at dated reference has a very different future from the resource at the undated reference. Indeed if anything, it is representations of the dated resource that are temporarily available via the undated URI. Anyway… the point is that whilst I’m quite comfortable with the notion that two (or more) resources may share (sets of available) representations over some period of time, they are nevertheless different resources and the ‘hint’ that one reference is merely a temporary reference for the other seems to me misleading… and perhaps shouldn’t be encouraged. I think that whilst the two referenced resources are related in some way, I’d want to call that out explicitly in a statement rather than leverage it from a 307 response.

[1] http://www.ics.uci.edu/~fielding/pubs/webarch_icse2000.pdf [2] http://iandavis.com/blog/2009/08/time-in-rdf-1

Re: Versioning URIs

I am relatively new (a couple of years) to the semantic web, rdf, etc, but do have over 20 years of experience in data (particularly, that which is relationally organized). The issue of URIs is not very different from that of how one would identify any data element (regardless of storage construct). As such, there is always an inclination to add business (i.e. meaningful) information into data identifiers. However, doing so almost always creates problems when changes occur (may of which have been described by Jeni in this blog post).

Given the problems I’ve encountered with identifiers that contain any meaning, my recommendation would be to utilize completely meaningless, randomly assigned identifiers (i.e.guids) for all URIs. In addtion to the problems meaningful identifiers create regarding change, most would argue that a minimal amount of information should be included in the URI to reduce the reduce the problems caused by the very issues discussed in this blog. Furhter, unless the URI contained a significant amount of information (which would be quite impractical), anyone wanting to gather information about the resource will need to see the properties that describe it; they are far more relevant than the URI itself. Quite often, many of the URIs are cryptic anyway.

So, to me, the best solution would be to utilize completely meaningless identifiers (URIs) for resources and let the properties/attributes assigned to them provide the meaningful information that is desired.

Re: Versioning URIs

I don't think it's worth getting too worked up about this stuff; URIs are supposed to be opaque, after all.

However, I think the best guide for deciding how much versioning you need in URIs is the length of time you are happy to ensure that the URIs are not accidentally reused. If you are willing to record all URIs you ever mint "forever" then no explicit versioning is required. Remember, though, that this needs to cover the case of domain transfer, as John Cowan points out above. In the case of a .gov.uk domain, for example, this might not be a problem (even if a department disappears and then is recreated with the same name) but for many users, particularly those using personal domains, this is needs to be considered.

On the other hand, if you just want to "mint and forget" then putting a date in is a very good idea.

I like the way that tag: URIs contain a date and that it is a specific violation of protocol/etiquette/manners to mint a URI for a domain for a date when you don't/didn't own it. I think it's a pity there's no equivalent for HTTP: e.g., a fairly strong recommendation that if the top part of the path looks like year[/month[/day]] then it should actually be such for a period when the minter owned the domain.

Captcha: "within 1968". Is that versioned, then?

Re: Versioning URIs

Is there some parallel to the DNS here? www.example.net is really 123.456.789.123, right now. It might be some other IP later. So id.example.org/124/78 might be found from id.example.org/school/78 right now, but from id.example.org/academy/78 some other day. Since /124/78 is meaningless, there’s no reason ever to change it. What it is doesn’t change, but the context we assign it certainly will.

If we wish to put human-readable meaning in the URI, then it should be a temporary condition. Versioning should be built in, but in a meaningless way. /124/78 might be /school/10Nov2009 right now, but /academy/archival_only/10Nov2009 some day.

Re: Versioning URIs

I also meant to add that the claim that meaningless identifiers are inherently more stable just isn’t borne out by the facts. The city of Roma (in Italy) has been known by exactly the same name for the last 2762 years. The name of Dimashq (Damascus in English), allowing for changes in language (Akkadian > Aramaic > Arabic), has been in use for at least 3341 years, as far as records go, and quite likely for 5000 years or longer.

By comparison the oldest arbitrary identifiers I know of, house numbers in Paris, are only 546 years old — and they are by no means stable over the centuries.

Re: Versioning URIs

Jeni,

an excellent dissection and discussion of the issues on both sides of the argument. With some very insightful comments, look forward to the follow up article (?) weighing up the path you chose and your experiences.

Re: Versioning URIs

Really clear laying out of the issues Jeni. However I got a sense of deja vu, as this is an issue that pre-dates the web.

In my very first computing job (COBOL programming for Cumbria County Council) many many years ago, I read an article in Computer Weekly about choice of keys (I think for ISAM not even relational DBs). The article argued that keys should NEVER contain anything informational as it is bound to change. The author gave an example of standard maritime identifiers for a ship’s journey (rather like a flight number) that were based on destination port and supposed to never change … except when the ship maybe moved to a different route. There is always an ‘except’, so, the author argued, keys should be non-informational.

Just a short while after reading this I was working on a personnel system for the Education Dept. and was told emphatically that every teacher had a DES code given to them by government and that this code never changed. I believed them … they were my clients. However, sure enough, after several rounds of testing and demoing when they were happy with everything I tried a first mass import from the council’s main payroll file. Validations failed on a number of the DES numbers. It turned out that every teacher had a DES number except for new teachers where the Education Dept. then issued a sort of ‘pretend’ one … and of course the DES number never changed except when the real number came through. Of course, the uniqueness of the key was core to lots of the system … major rewrite :-/

The same issues occurred in many relational DBs where the spirit (rather like RDF triples) was that the record was defined by values, not by identity … but look at most SQL DBs today and everywhere you see unique but arbitrary identifying ids. DOIs, ISBNs, the BBC programme ids - we relearn the old lessons.

Unfortunately, once one leaves the engineered world of databases or SemWeb, neither arbitrary ids nor versioned ones entirely solve things as many real world entities tend to evolve rather than metamorphose, so for many purposes http://persons.org/2009/AlanDix is the same as http://persons.org/1969/AlanDix, but for others different: ‘nearly same as’ only has limited transitivity!

Thu Jul 23 11:16:21 BST 2009 Re: Versioning URIs

Versioning should only be introduced to disambiguate, however ambiguity should be checked before creation of a uri and when detected should be avoided by use of a different term in preference to versioning.

Re: Versioning URIs

These guidelines cover the topic well, and I hope that we will adopt something close to it. An important piece of guidance will be deciding when a non-information resource has been revised in some way, or has changed enough to warrant being thought of as a new thing that has morphed from the old thing.

So perhaps, if the M5 which currently ends at Exeter, were to be extended down to Plymouth, that might require that we create a new version of the M5 URI. However, if a secondary school were to add on a 6th form college, that sounds like a new school to me and should have a new URI where the RDF would give you a relationship to the previous one.

So, it would be good to give some design guidance about when to version vs when to create a new URI.

Re: Versioning URIs

One of the things that was said at the workshop was that we should be creating the most “natural” URI for a thing. In very many cases there will be an existing identifier scheme (or schemes), in which case the aim should be to “http’ise” that as elegantly as possible, rather than create a parallel or alternative set of identifiers, or introduce any additional elements. One of the interesting things about government is that it (the department or agency) is often the entity responsible for creating those existing identifiers and maintaining them over time – so the government identifier for a thing is likely to be useful for very many other people also seeking to identify that thing. It is also likely to be the identifier other people are already using.

The existing identifier schemes will have the merit of having been in place for some time (potentially several decades), and will have evolved to a point where they are both sufficient as identifiers and reasonably well maintained. If those schemes need versioning, then, in all probability they will already contain a version number as part of the scheme, using an approach to versioning that makes sense in the context of the things that the scheme identifies. If those schemes have not needed versioning hitherto, we can reasonably assume they don’t need versioning now. Either way, a global approach to versioning the httpised URI is not required. The dis-benefit of introducing a general versioning approach to the URI is that it introduces a needless additional element to the http’ised version of the existing identifier – and potentially an element that is inconsistent with any versioning inherent in the existing identifier.

So long as identifiers in an existing (non-http) scheme are not re-used (an important criteria), then they are good candidates for http’ising – and that is what we should do. This is the “minimal disruption” route to linked data - http’ise as many of the existing unique identifiers that the government controls as we possibly can, as the basis for our http URIs. If they come with versioning, great, problem solved; if not, then versioning is probably not required.

Of course, all of this pre-supposes that the situation where we are making entirely new identifiers (ie URIs that are not based on existing ids) for a thing are probably quite rare? That’s my hunch, right now. We’ll only know that, by trying to publish more linked data.

Re: Versioning URIs

My 2 cents…

I would avoid minor version numbers in URIs for the reasons you mentioned. Just bump an integer number each time the meaning is changed. So you get:

http://id.example.org/school/78
http://id.example.org/academy/79
http://id.example.org/university2/307

Aesthetically university2 doesn’t look good (and maybe politically undesirable also), so I’d suggest:

http://id.example.org/school/78
http://id.example.org/academy/79
http://id.example.org/2/university/307

And while I’m here, I prefer small version numbers to dates as I think they are easier to remember. Chances are a reassignment of a category will only happen 2 maybe 3 times. It’s easier to remember that this is the third incarnation of universities and the second incarnation of academies rather than remembering that universities changed in 2025 and 2036, while academies changed in 2026, the meaning of lamp posts changed in 2018 and bollards in 2017!

Re: Versioning URIs

I believe that every non-informational URI (notably including namespace names, which are “information” in one sense but don’t correspond to any document) should have a date in it, for the very good reason that people rent domain names, they don’t own them. We are increasingly going to get the situation of non-informational URIs that embed to the domain names of dead organizations, and worse yet of domain names that have moved from a dead organization to a living one. Hack.com, for example, has had four owners that I know of.

Consistently inserting a date into a non-informational URI, and interpreting the date as some year in which the assigning organization controlled the domain name (not necessarily the year in which the URI was minted), prevents this problem. RFC 3085 defines the urn:newsml: scheme, which was meant to support this very idea. urn:newsml: URIs contain a domain name, a date as specified above, an owner-specific string, and a version number representing multiple versions of the denoted resource. Newsml URNs were meant for news stories, but in principle could be used for anything, definitely including non-information resources. RFC 4151 describes the tag: URI scheme, which contain the same components, except for the version number; they are explicitly meant for non-information URIs, and allow an email address in place of a domain name, for people who don’t own any domain names. Only http: scheme names are fashionable nowadays, but the same principles remain.

Re: Versioning URIs

I think that there are several approaches that can help. Firstly, it is helpful to establish ‘Subject URI repositories’. Such repositories collect information about URIs which are used to identify subjects (with subject names and human readable subject descriptions)

One of existing examples:

http://subj3ct.com/

Second important concept, I think, is the idea of ‘deprecated identifiers’. Deprecated identifiers continue to identify the same subjects, but they are not recommended for future usage.

For example, current identifier:


http://psi.ontopedia.net/Media_Campus_Villa_Ida_Leipzig_Germany 

has deprecated identifier (recorded in Subject URI repository):


http://psi.ontopedia.net/Campus_Villa_Ida_Leipzig_Germany

(These descriptions are optimized for Topic Maps, but the same idea can be used for RDF.)

Subject URI repositories + support for ‘deprecated subject identifiers’ can help to have ‘Cool URIs’ which can evolve.

Re: Versioning URIs

Thanks for the insight Jeni. We went through all of this at the BBC regarding programme identifiers, as I’m sure you’ve already heard from Tom Scott et al. It led to our concept of a “PIP code”, an alphanumeric programme identifier that gives very little away about the contents of a programme, but is also very opaque and not very hackable. So of course one of the first things that happened was that the iPlayer guys added some redundant but readable text to the end of the unique URLs for “google juice” and readability purposes (eg http://www.bbc.co.uk/iplayer/episode/b00llg8k/GettingOnEpisode_1/). Does this defeat the purpose? You decide. It has certainly led to ieological battles between the purist /programmes people and the pragmatic (populist?) iPlayer people, with some of us watching bemusedly from the sidelines…

But I wonder whether there isn’t an important point being lost in the detail here. When a URI would need to change according to your examples, it would be changing for a reason. http://id.example.org/school/78 isn’t really the same as http://id.example.org/academy/79, they might be two institutions that share the same physical location, staff, students etc but they’re not the same thing. If I graduated from Smith High School I wouldn’t go around telling people that I was an alumnus of Jones Academy (although I might explain that I am an alumnus of Smith High School which has been replaced by Jones Academy). This is a subtlety that 303 See Other and owl:sameAs don’t really express. But we could surely create some new types of redirects that allow for citation-like links such as “superseded by” and wikipedia-like disambiguation pages for universities that happen to use the same URI over time.

They’re still Cool URIs, I’m not saying the URIs should change, you should always be able to access http://id.example.org/school/78 after it is turned into an academy. But what you see when you visit it is most likely just a link with some explanation of what has happened to that resource (perhaps along with some historical information about what was once the proud and noble Smith High School).

Of course if you take this approach then you don’t even need id.example.org/academy/79, you could just say id.example.org/academy/StTriniansBerkshire and let your linking rules handle the rest. A bit more ambiguous, a bit more in the style of avoiding premature optimisation, and a maybe bit more like the web…?

Re: Versioning URIs

There was a time when I copied the W3C approach and used date-versioned URIs for everything, but I eventually gave that up because of the problems that it caused for users. As you noted, when a dated URI has been in use for a number of years, it becomes difficult for people to remember which year is the correct one, and they get anxious about whether it has been superseded or not by a URI that is identical except for the year.

The dated URI approach works well for the W3C, which insists on using its website as a showcase for a minimalist approach to website management, but deals with little on its website other than (dated) specifications. By contrast, the UK public sector deals with a much broader range of content than the W3C does, so copying the W3C probably isn’t “good practice” in this circumstance.

Cheers, Tony.