When Leigh Dodds presented about Linked Data at the XML Summer School this year, one of the things he suggested was that when you have a controlled vocabulary, you should define resources for the terms in that vocabulary rather than having a fixed set of literal values.
For example, if you’re saying that the topic of a page is elephants you should use a triple like:
<> dc:subject <http://example.com/id/concept/animal/elephant> .
rather than one like:
<> dc:subject "elephant"^^xsd:token .
There are two advantages of using a resource here rather than a literal value:
You can associate other metadata with the term. For example, you can use SKOS to describe it and its associations with terms from other controlled vocabularies, giving it multiple labels in different languages, providing a description and examples and so on.
You can more accurately and easily associate multiple documents that use the same term. Of course you could always use string-based matching to pull together documents using the same subject, but that’s much more prone to error, and would leave you with less certainty about the results of your query.
What struck me, though, was that these arguments apply just as well to other typed values that we use within RDF. For example, say I wanted to describe the colour of my eyes. I could say something like:
<#me> eg:eyeColour "#695C3E"^^eg:colour .
But wouldn’t it be better to use:
<#me> eg:eyeColour <http://example.com/id/concept/colour/rgb/695C3E> .
The colour resource could have properties associated with it:
<http://example.com/id/concept/colour/rgb/695C3E>
a eg:Colour ;
skos:prefLabel "#695C3E" ;
eg:red "69"^^eg:hex ;
eg:green "5C"^^eg:hex ;
eg:blue "3E"^^eg:hex ;
eg:hue 42 ;
eg:saturation 41 ;
eg:brightness 41 ;
eg:pantone ... ;
... .
and so on. And if other people pointed to the same resource, a semantic search engine could give you a list of things of that colour.
And what about those numbers? Would it be better if I said:
<http://example.com/id/concept/colour/rgb/695C3E>
eg:red <http://example.com/id/concept/number/hex/69> .
and there was information about that resource:
<http://example.com/id/concept/number/hex/69>
owl:sameAs <http://example.com/id/concept/number/105> .
<http://example.com/id/concept/number/105>
a eg:Integer ;
rdf:value 105 ;
eg:divisor
<http://example.com/id/concept/number/3> ,
<http://example.com/id/concept/number/5> ,
<http://example.com/id/concept/number/7> ;
... .
and so on.
In fact, we already have identifiers for some of these resources. DBPedia inherits from Wikipedia information about (many) numbers and (some) dates. For example, check out what it says about the number 720 or the rather less helpful page on the year 1914.
What we lose is a certain level of ease of querying because the values that can be compared by SPARQL (say) are an extra step away. But it’s still doable so long as the resource has a rdf:value property holding the primitive literal for the type (one recognised by SPARQL). If I wanted to find married couples where the husband is younger than the wife, I could do something like:
SELECT ?husband ?wife
WHERE {
?husband eg:age ?husbandAgeResource .
?husbandAgeResource rdf:value ?husbandAge .
?wife eg:age ?wifeAgeResource .
?wifeAgeResource rdf:value ?wifeAge .
FILTER (?husbandAge < ?wifeAge)
}
One interesting aspect of these kinds of resources (and something Leigh promised to blog about too) is that they’re either infinite or have a large enough value space that it would be impractical to store all the information about them within a traditional triplestore. They could be made available as linked data easily enough since much of the interesting information about a colour or number would be derivable. But it might be difficult to provide a SPARQL end point for them. For example, consider:
SELECT ?number
WHERE {
?number eg:divisor <http://example.com/id/concept/number/3> .
}
ORDER BY ?number
LIMIT 10
There are already linked data spaces a bit like this floating around. The URIs defined by LinkedGeoData are infinite, given that it accepts any number of decimal places for latitude and longitude (technically it defines resources for circular areas rather than points). The RDF/XML that we’re producing for UK Legislation is generated on demand based on a date which, for each item, can be any date between 1st February 1991 and the current date.
What do you think? Is it mad to use resources instead of literal values? Where do you stop? How can queries be carried out over these infinite (or extremely large) sets of resources?
Comments
Re: Resources for Values
This discussion sounds familiar from discussions about object-oriented design: When should I use primitive types, and when should I create a class that wraps the primitive type or an enumeration type?
The answer is that enumerations are usually a good idea, but all the plumbing for wrapping a primitive into another object just to be able to attach some interesting methods that return derived values is rarely worth the effort.
Back to RDF: With controlled vocabularies, it’s relatively easy for the party defining the schema to provide an exhaustive set of URIs for all the terms in the controlled vocabulary.
In your color example, there’s more work involved: Someone has to design, implement, deploy, and operate a service that allows resolution of all the URIs in this very large color URI space. A simple static file with a few dozen resources will no longer do.
And when that is the case, then it’s probably more cost-effective to just go with a literal.
Information about an entity that can be easily derived, should be derived by the client, and expressing it as linked data is a bit silly I think, like the divisor example you gave.
A hunch for which I don’t have a good justification, I’ll just put it out there: You mentioned the case where you want “dynamic” URI spaces where some part of the URI, such as a date or geographic coordinate, is used in some computation. This gives rise to a URI space containing an infinite number of URIs. I think those should only be URIs of documents. For example, let’s say you want a service that, based on latitude and longitude, gives you nearby places of interest. Don’t build the service like LinkedGeoData so that it has a URI for each point on the surface of the Earth. Build the service so that there is a document describing each point on the surface of the earth, containing descriptions of the nearby places (which are a finite set coming from some database).
Re: Resources for Values
I think this is a delicate balancing act. While non-literal resources are generally more useful than literals, trying to eliminate literals from a dataset can end up fairly horrible. Where does it end?
That way madness lies. Vocab designers need to decide at precisely where it becomes useful to use literals, and indicate this appropriately in their schemas (using rdfs:range).
Where to do this depends a lot on exactly what they hope to achieve from their vocabulary. If you want to model, say, products for sale, colour may be a small, incidental feature, and you can get away with using literals. If you are writing a tool to convert, say, CSS to RDF for a smart agent to reason with and determine, say, whether a particular design is accessible to the colour blind, then colours are going to be much more important to you, and there may be benefits in minting URIs for them.
And, for what it's worth...
http://ontologi.es/colour/F0A000