In my last post about RDFa and HTML I talked about how one of the gulfs that separates the HTML5 and Semantic Web communities is the attitude to the resolvability of property (and class) URIs.
I’m currently experimenting with introducing the ability to automatically locate information about properties and other resources that are referenced within triples to rdfQuery, so now is a good time, as far as I’m concerned, to look more closely at what the ability to resolve properties gives you and how to avoid problems if the property URI is (temporarily or permanently) unresolvable or resolvable to something new.
I’m going to attempt to answer:
We can divide applications using online data into three general categories:
Most mashups are data-specific applications. When you, as a developer, create a data-specific application, the thing that you need to know most of all is what information the dataset contains. Part of that is working out the meaning of a particular property (or class). What the data publisher needs to do is make sure that the data they publish is documented.
There are three ways of locating the documentation about a particular property or class:
For a developer, it’s very useful to find out about a property by bunging its URI into a browser and hitting return. Want to know what http://xmlns.com/foaf/0.1/name means? Look up that URI. By comparison, if you want to know what a vevent is, your best bet is a search engine. In the results I get from Google, the microformat definition of vevent is currently second on the list. (The Microdata definition of vevent doesn’t even feature.) Even if a property isn’t available at its URI, its URI gives a more unique identifier to search for than an short term: you’re more likely to find relevant information if you search for http://xmlns.com/foaf/0.1/name than if you search for name.
But there’s no requirement for data-specific applications to use computer-readable information about properties or classes. If you know the data that’s available in a dataset, you can find out the semantics of the properties and classes it contains and hard-code those within your application. Most applications that reuse data are currently of this type, and it tends to be the only kind that non-Semantic Web people think about.
Vocabulary-specific and generic applications will have some vocabularies built in but may also operate with unknown vocabularies. For example, an application that cares about FOAF profiles is almost certainly going to want to hard-code information about FOAF rather than download its schema every time it’s used.
There are three reasons for building-in knowledge about particular vocabularies:
It’s worth noting that applications increasingly do rely on the availability of networked resources in order to operate — that’s what cloud computing is all about — but the resources are usually ones that the application developers have some kind of control over.
It helps to use URIs for properties and classes for well-known vocabularies only in as much as it means that property and class names from different vocabularies won’t clash, so you don’t have to worry about your application confusing http://xmlns.com/foaf/0.1/title with http://purl.org/dc/terms/title.
On the other hand, if data uses an unknown vocabulary, vocabulary-specific and generic applications would like to get hold of extra information. This falls into three categories:
http://people.example.org/ontology/fullName is defined as a sub-property of http://xmlns.com/foaf/0.1/name then the application can use or display the value of http://people.example.org/ontology/fullName in exactly the same way as the value of http://xmlns.com/foaf/0.1/namehttp://people.example.org/ontology/fullName has a domain of http://xmlns.com/foaf/0.1/Person then anything that has the property http://people.example.org/ontology/fullName must be a http://xmlns.com/foaf/0.1/PersonThese are in descending order of priority: many applications will want to interact with the user in some way, in which case human-readable information is vital. Applications that have built-in knowledge about one or more vocabularies are likely to have special handling for those vocabularies, so being able to map unknown properties and classes into those known vocabularies will enhance the behaviour of the application, although it adds a bit of complexity in the implementation to do so. Further reasoning has the potential to increase the value of sparse data but again increases the complexity of implementation.
Using URIs for classes and properties provides a mechanism for applications to get hold of this extra information about unknown vocabularies. They might try four tactics, in order of priority:
Robust applications will not break if they don’t manage to locate the definition of a property or class. They can certainly continue to parse any data that they come across. To create a human-readable label, they might use the part of the URI after the last # or /. It’s no loss (to the application) if they cannot perform other reasoning: they might display the data in some default way or simply ignore it.
It’s worth noting, because of the fear of DDoS attacks that some people have, that the majority of applications won’t need to actually GET property or class URIs, either because they are data-specific applications or because they only work with vocabularies that are hard-coded into them. Applications that are good web citizens will avoid DDoS attacks on popular vocabularies by hard-coding knowledge about those vocabularies and/or maintaining a cache, either locally or in the cloud, of vocabularies that have already been resolved.
With what I’ve said above in mind, what can publishers do to help applications to understand the data that they provide?
If a publisher is only concerned about data-specific, point-to-point mashups, all they have to provide is the data itself. It will help developers if there is some documentation of the dataset and the properties and classes used within it. But data publishers who only want their data to be discoverable by people can rely on human intelligence for locating information, and for them using URIs for properties and classes may seem like overkill.
But in a linked data world, publishers should really support their data being discovered automatically via the links from other data. Here we’re talking about making life easier for vocabulary-specific and generic applications to use the data that you provide.
The vocabularies that you use within your data fall into three general categories:
As a data publisher, the first thing you can do is to use well-known vocabularies in your data wherever possible, even if you also use local or reused vocabularies to express the same properties or classes.
For example, say you have some data describing a cricket team and use http://cricket.example.org/ontology#name for the name of a member of a team, and that you mean it to be a sub-property of http://xmlns.com/foaf/0.1/name (which is itself a sub-property of http://www.w3.org/2000/01/rdf-schema#label). If you just publish the http://cricket.example.org/ontology#name property then the only way that a generic application can know that http://cricket.example.org/ontology#name can be used as a label for a resource (which is a person) is by attempting to resolve http://cricket.example.org/ontology and reasoning based on what it finds. On the other hand, if you also provide http://xmlns.com/foaf/0.1/name and http://www.w3.org/2000/01/rdf-schema#label properties, applications are no longer dependent on the network, nor on having the ability to reason, to use that information.
You could also provide mappings onto any reused vocabularies that you specialise, but this is less worthwhile given that vocabulary-specific and generic applications are unlikely to understand them either.
The second thing you can do is to include information about the properties that you use within the data that you publish. This isn’t important for well-known vocabularies (because they’re… uh… well-known) and it’s only useful for local vocabularies if you’re not publishing those vocabularies, because if someone can access your data, odds are they’re able to access your local vocabulary’s property URIs as well. But it is useful for reused vocabularies, where you can’t guarantee access, in just the same way as it’s useful to provide basic labelling information about any resources you reference.
If you’re publishing your data embeddded within a web page, as well as marking up the data, you can mark up the labels that you use for those values, which more than likely appear as headings in a table or something similar.
If you are publishing a schema or ontology that describes your properties and types, there are also things that you can do to help applications. The most important thing is to assist caches in their caching of the ontology, which will reduce the number of times that it needs to be accessed directly and help you avoid DDoS attacks: see Mark Nottingham’s Caching Tutorial. You can also reduce the number of hits on your server by using hash URIs for your property and class names and use standard load-balancing techniques to manage the traffic.
If you’re referring to reused vocabularies within your own, you can also embed information about the relevant properties and classes from those vocabularies within your own ontology. This can save applications an extra hop, and lessens the risk of the reused vocabulary disappearing (perhaps forever).
If you want to help people who might reuse your ontology, you can make the process of copying it easier by publishing it as a single file, or broken up into segments that are likely to be reused individually. At a non-technical level, it’s also a good idea to provide a announcement mailing list or a feed so that people who reuse your vocabulary can be kept up to date with any changes you make to it.
Bearing all this in mind, what should I (and other framework developers) do to support the reusers of data? I think I need to make it easy for application developers to:
In other words, I need to make it easy for people to use a range of strategies for getting hold of information about a property or class, aside from simply trying to access it at its URI. I think that means that it’s better to provide a lightweight solution, giving developers the opportunity to be in control of which URIs get resolved rather than automatically downloading extra information from the URI that’s actually used for the property or class. It also means I need to provide hooks in the code that they can use to trigger that resolution.
It would also be useful, of course, for developers to be able to use information about properties and classes easily, in particular to reason with it. That kind of support is something I’ve been working on for rdfQuery. It’s not quite ready yet.
My (somewhat contentious) view is that we place too much emphasis on the resolvability of property and class names, and that this can put people off the idea of the Semantic Web. You can do useful things with data without resolving properties or classes. And for a large number of useful applications, being able to actually reason over the data you get at the end of a property URI would have a high implementation cost without providing a great deal of functional benefit.
Further, for data publishers, the requirement to enable the resolution of every property and class URI you use within your data just adds to the publishing burden, especially if you’re made to feel it has to resolve to some kind of grand OWL ontology.
There’s a concept in psychology of the zone of proximal development. The idea is that if someone is operating at a particular level then as a teacher you should help them to achieve something slightly above that level, rather than trying to get them to do everything straight away.
The same is true here. We need to help publishers make the small steps that they can make, one at a time, to gradually get them to full Semantic Web goodness:
rdfs:subPropertyOf/rdfs:subClassOf mappings from your properties/types to well-known properties/types within your data, so that it can be displayed in custom waysThe biggest leap, the one that requires the most persuasion and the most justification, is probably from simply publishing the data in a machine-readable format to using the RDF model with URIs for properties and types. But if you remove the cost of having to provide anything at the end of the URI and factor in the potential benefits you may reap in the future (as you step further up that ladder), the question becomes less “why?” and more “why not?”.
Comments
Re: On Resolvability
I had some thought about resolvability for classes in my blog a while back (admitedly a brain dump before heading to the pub :))
http://johngoodwin225.wordpress.com/2009/04/01/linked-ontology-web/
There is a lot of work going on at the moment (see my comments in my blog) around modularising OWL ontologies. I agree that you might not always want to import the whole ontology into an application, but for various applications it will be very important that they use the correct definition of a class. In the past assumptions that two classes mean the same thing because they have the same name has lead to data integration disasters. The modulartity work is interesting because it allows you (based on certain conditions) to extract just the right amount of information from an ontology for a given class. For example if I want to reuse the class Zoo from an ontology about buildings how do I know I have pulled out all the right axioms so that I have kept the intended meaning of the class Zoo? For large OWL ontologies I was thinking that maybe we could use these modularisation tools (in say Pellet) to given just the right information when dereferencing a class URI.
Re: On Resolvability
I agree with the notion that there are many small steps between “make your data available to the public in some way” to “publish state-of-the-art, full-featured linked data”. And I agree that if any of the steps in between are too high for a publisher, for whatever reason (technical, political, available resources, perceived lack of benefit, etc etc), then we should cheer them on for taking the initial steps, rather than criticize that they don’t go all the way. Most of the value is in the first steps anyway, and the value of the last few steps is still mostly theoretical at this point in time.
I’m certainly guilty of having minted a lot of class and property URIs that do not resolve. I fully intend to provide schema information for all of them at some point, time permitting, but getting the data out there was more important than necessarily doing The Right Thing in every detail.
I kicked off the Neologism vocabulary editor project two years ago to make steps 5-8 of your list much easier; we didn’t make as much progress as I hoped unfortunately, but I think that with an easy-to-use web based vocabulary editing and hosting app, doing resolvable vocabulary terms should become a no-brainer because it’s actually the easiest way of managing your vocabulary terms.
Re: On Resolvability
Have you seen Web Protege:
http://protegewiki.stanford.edu/index.php/WebProtege
Re: On Resolvability
Hi Jeni,
Another really good post, and I think I generally agree with your viewpoint about resolvability. A lot of the semantic web applications I’ve built have not relied on resolving URIs at all. They’ve generally been working within a known schema, or working with a known subset of some data. These fall into the “data-specific” classification although were more about managing document metadata, workflows, and a generic publishing framework than visualisations. A lot of different kinds of apps fall into that category.
I think this is mostly a function of how I’ve approached RDF technologies: cautiously applying them to address specific problems, one step at a time, with the initial emphasis on wanting a standards based semi-structured data model with flexibility greater than XML.
Anyway, there are a few specific points I wanted to make:
Firstly, as you’re well aware, the issue of handling dependencies on external URIs is found not just in web applications but also for XML workflows too. In that technology stack using URI Resolvers and XML Catalogs is a best practice (although still not as widely deployed as it should be); caching proxies are a useful additional layer on top of that.
The same should be true of RDF applications, but I don’t think that most frameworks commonly implement support for XML catalogs or something like it. Jena has the notion of a LocationMapping which can be used to achieve exactly the same effect.
I think this is another area that framework developers should address.
Secondly, I’ve also found it useful to annotate RDF schemas with additional data that describes how a property or class should be handled by an application. An obvious one is associating a property with some application-specific processing rules. But it may just be providing a useful label for a property, without having to hard-code it into the application. (And RDF is great for internationalisation, as multiple language qualified literals for labels are easy)
So I think its important that frameworks should also allow external schemas to be supplemented locally with custom annotations. The nice aspect of RDF Schemas are that they’re just RDF, so the merging is easy.
More generally, we should also recognize that sometimes an application may want to make some local assumptions about how properties or classes can be related. E.g. within my application I might want to treat two properties as equivalent even if the rest of the world disagrees. This could be hard-coded into the application, but why not externalise it so it can be declared within a schema?
Finally, a more general point about resolvability:
Why is it that using URIs to discover human and machine-readable documentation and data isn’t more prevalent? After all (X)HTML has its profiles which are URI based; and XML has Namespaces. Both of these have been around for some time but I’ve never really seen wide usage of dereferencing and resolvability in those contexts. Certainly not for driving application processing.
And yet there has always been interest in exploring the possibilities that this offers, even if only to do things like find documentation or a schema to validate a document.
There are many answers to that question some of which you’ve addressed in an earlier posting on prefixes and URIs. But I think one issue is that previous approaches have failed to standardise what you will actually get when you deference say, an XML namespace or an HTML profile. Technologies like RDDL tried to address that but without much success as far as I can see.
One of the elegant features of RDF, IMHO, is that both the data and the schema use the same model and that both are fully grounded in the web. So, what you expect to receive when dereferencing a property URI is RDF. Ideally with data in the RDFS and OWL ontologies, but perhaps a whole lot more.
I think this offers a lot of fruitful ground to explore which I don’t think we’ve previously been able to do with other approaches that don’t have the same uniformity of model.
So I’m cautiously optimistic about the important of resolvability in the future, but agree that in the short term the pragmatic approach is to get the data out there, and then at least publish some useful labels and documentation.