Establishing Trust by Describing Provenance

Oct 24, 2009

Update 2009-11-08: The developers of the Provenance Vocabulary tell me that the pattern I used below isn’t correct, and there doesn’t currently seem to be a method of describing what I want to describe using that vocabulary. But it’s still under development, so hopefully it will become usable soon.

One of my favourite tweets from Rob McKinnon (aka @delineator) is this one:

feeling upset RDF enthusiasts oversell RDF, ignoring creation, provenance, ambiguity, subjectivity + versioning problems #linkeddata #london

because it’s one of the things that bugs me on occasion too, and because the issues he mentions are so vitally important when we’re talking about public sector information but (because they’re the hard issues) are easy to de-prioritise in the rush to make data available.

Expressing Statistics with RDF

Oct 23, 2009

Update: If you’re interested in expressing statistics in RDF, I’d encourage you to join the publishing statistical data group and take a look at the documentation for ‘SDMX-RDF’ described there.

One of the things that we’ve been discussing over on the UK Government Data Developers mailing list is how best to represent the vast quantities of statistical data that the government produces, in RDF. This is what we’ve come up with.

hmg.gov.uk/data and What We Can Do

Oct 3, 2009

This week, the Cabinet Office went live with a preview version of hmg.gov.uk/data, available only to those who subscribe to the UK Government Data Developers Google Group. Harry Metcalfe has written a great review, or of course you can check it out yourselves.

Already, though, there are discussions starting on the mailing list about how the data is being made available, and I’m worried that these might distract us from getting things done.

Resources for Values

Sep 26, 2009

When Leigh Dodds presented about Linked Data at the XML Summer School this year, one of the things he suggested was that when you have a controlled vocabulary, you should define resources for the terms in that vocabulary rather than having a fixed set of literal values.

For example, if you’re saying that the topic of a page is elephants you should use a triple like:

<> dc:subject <http://example.com/id/concept/animal/elephant> .

rather than one like:

<> dc:subject "elephant"^^xsd:token .

There are two advantages of using a resource here rather than a literal value:

  • You can associate other metadata with the term. For example, you can use SKOS to describe it and its associations with terms from other controlled vocabularies, giving it multiple labels in different languages, providing a description and examples and so on.

  • You can more accurately and easily associate multiple documents that use the same term. Of course you could always use string-based matching to pull together documents using the same subject, but that’s much more prone to error, and would leave you with less certainty about the results of your query.

What struck me, though, was that these arguments apply just as well to other typed values that we use within RDF. For example, say I wanted to describe the colour of my eyes. I could say something like:

<#me> eg:eyeColour "#695C3E"^^eg:colour .

But wouldn’t it be better to use:

<#me> eg:eyeColour <http://example.com/id/concept/colour/rgb/695C3E> .

The colour resource could have properties associated with it:

<http://example.com/id/concept/colour/rgb/695C3E>
  a eg:Colour ;
  skos:prefLabel "#695C3E" ;
  eg:red "69"^^eg:hex ;
  eg:green "5C"^^eg:hex ;
  eg:blue "3E"^^eg:hex ;
  eg:hue 42 ;
  eg:saturation 41 ;
  eg:brightness 41 ;
  eg:pantone ... ;
  ... .

and so on. And if other people pointed to the same resource, a semantic search engine could give you a list of things of that colour.

And what about those numbers? Would it be better if I said:

<http://example.com/id/concept/colour/rgb/695C3E>
  eg:red <http://example.com/id/concept/number/hex/69> .

and there was information about that resource:

<http://example.com/id/concept/number/hex/69>
  owl:sameAs <http://example.com/id/concept/number/105> .
  
<http://example.com/id/concept/number/105>
  a eg:Integer ;
  rdf:value 105 ;
  eg:divisor
    <http://example.com/id/concept/number/3> ,
    <http://example.com/id/concept/number/5> ,
    <http://example.com/id/concept/number/7> ;
  ... .

and so on.

In fact, we already have identifiers for some of these resources. DBPedia inherits from Wikipedia information about (many) numbers and (some) dates. For example, check out what it says about the number 720 or the rather less helpful page on the year 1914.

What we lose is a certain level of ease of querying because the values that can be compared by SPARQL (say) are an extra step away. But it’s still doable so long as the resource has a rdf:value property holding the primitive literal for the type (one recognised by SPARQL). If I wanted to find married couples where the husband is younger than the wife, I could do something like:

SELECT ?husband ?wife
WHERE {
  ?husband eg:age ?husbandAgeResource .
  ?husbandAgeResource rdf:value ?husbandAge .
  ?wife eg:age ?wifeAgeResource .
  ?wifeAgeResource rdf:value ?wifeAge .
  FILTER (?husbandAge < ?wifeAge)
}

One interesting aspect of these kinds of resources (and something Leigh promised to blog about too) is that they’re either infinite or have a large enough value space that it would be impractical to store all the information about them within a traditional triplestore. They could be made available as linked data easily enough since much of the interesting information about a colour or number would be derivable. But it might be difficult to provide a SPARQL end point for them. For example, consider:

SELECT ?number
WHERE {
  ?number eg:divisor <http://example.com/id/concept/number/3> .
}
ORDER BY ?number
LIMIT 10

There are already linked data spaces a bit like this floating around. The URIs defined by LinkedGeoData are infinite, given that it accepts any number of decimal places for latitude and longitude (technically it defines resources for circular areas rather than points). The RDF/XML that we’re producing for UK Legislation is generated on demand based on a date which, for each item, can be any date between 1st February 1991 and the current date.

What do you think? Is it mad to use resources instead of literal values? Where do you stop? How can queries be carried out over these infinite (or extremely large) sets of resources?

The HTML5 DOM and RDFa

Sep 24, 2009

One of the fundamental disconnects between HTML5 and previous versions of HTML is the way in which you answer the question “what is the structure of this page?”. Things that make use of that structure, such as RDFa, need to take this into account.

An example is the document:

<html>
  <head><title>HTML example</title></head>
  <body>
    <table>
      <span>Example title</span>
      <tr><td>Example table</td></tr>
    </table>
  </body>
</html>

There are two different ways in which you might interpret the structure of this document. First, you might view the structure to be as it is written, with the <span> element as a child of the <table> element and therefore a tree that looks like:

+- html
   +- head
   | +- title
   +- body
      +- table
         +- span
         +- tr
            +- td 

Second, you might view the structure of the page to be the DOM as it is constructed by an HTML5 processor, which will move the <span> out from the table due to foster parenting, giving the result:

+- html
   +- head
   | +- title
   +- body
      +- span
      +- table
         +- tr
            +- td 

Which you view it as doesn’t really matter at this point, but it does when you start to introduce markup that infers information based on the structure of the page, such as RDFa. Let me introduce some RDFa markup to the document:

<html xmlns:dc="http://purl.org/dc/elements/1.1/">
  <head><title>HTML+RDFa example</title></head>
  <body>
    <table about="http://example.com">
      <span property="dc:title">Example title</span>
      <tr><td>Example table</td></tr>
    </table>
  </body>
</html>

Now, if you view the structure to be as written, the <span> element is within the <table> element, and is therefore viewed as talking about whatever it is that the <table> element is talking about, namely http://example.com. So the RDF that you will glean from this page is:

<http://example.com> dc:title "Example title" 

On the other hand, if you view the structure to be that constructed by an HTML5 processor, the <span> element is not within the <table> element, and is therefore viewed as talking about whatever the document is talking about, namely the document itself. So the RDF that you will glean from the page is:

<> dc:title "Example title"

This isn’t exactly a new problem. There has always been the possibility of Javascript embedded within a page changing the page by moving or inserting elements, making the page that a non-browser sees fundamentally different from the one that a browser sees. This has been used by SEO people and spam merchants to get search engines to direct people to pages which mutated into something different when they were actually visited by a browser. And this eventually lead to those people who cared about interpreting meaning from the structure of pages (ie search engines) to at least go some way towards evaluating the Javascript within the page in order to “see” the page as a human would.

So it’s not a new problem, but it’s still a problem.

For those people trying to define how RDFa is interpreted in HTML5, there are several unpleasant alternatives:

  1. Define RDFa as operating over an HTML5 DOM. This would make things easy for Javascript implementations in as much as they can rely on being used with HTML5 DOMs, ie in HTML5 browsers. But it raises the implementation burden for other implementations, such as those based on XSLT or a simple tidy-then-interpret-as-XML approach: essentially every implementation will need to include an HTML5 parsing library.

  2. Define RDFa as operating over a DOM, but leave the creation of that DOM as implementation-defined. This effectively passes the buck (“it’s not our fault that HTML5 processors will construct a different DOM from XML processors”) but makes it hard to test implementation conformance and for authors to know exactly how their page will be interpreted. For example, an implementation that constructed a DOM with randomly rearranged elements would be entirely conformant despite producing completely different triples from one that took the elements in the original order.

  3. Define RDFa as operating over a serialisation, with precise wording that describes how that serialisation is mapped into a tree structure that is walked to process the RDFa within the page. This approach will prevent implementations that use other methods of constructing trees from being conformant; depending on how it’s defined that might include XSLT implementations and/or Javascript implementations and/or implementations that use standard (XML-based) libraries for parsing the documents.

Personally I lean towards the second of these: defining RDFa as operating over a DOM but placing no constraints on how that DOM is created. It leaves open the possibility of Javascript implementations to work on the DOM they see, which may be radically different from the one seen by other processors due both to HTML5 reordering of elements and the dynamic modification of the page through Javascript. (Several people use rdfQuery to do before-and-after parsing of RDFa within a page, turning browsers into semantic editors, for example.) But it also lets conformant implementations be constructed in other ways for implementation ease or user needs, supporting the use of XSLT through GRDDL and the static crawling of content with minimal processing.

Perhaps the set of permissible methods of DOM creation could be listed to prevent completely random processing, but I expect that it will be effectively limited through social and technological pressures. Implementations that build DOMs in random ways aren’t going to be as useful (to their users) as those that build them in expected ways; it’s also going to be far easier to implement RDFa processors using standard parsing libraries.

The approach is not without its downsides, of course. XSLT is similarly defined as operating over a tree model, with the question of how that tree model is constructed left to the implementation. Most processors decided to construct the tree using standard XML parsing, but famously MSXML would strip certain whitespace-only text nodes from the tree (unless you specified a parsing flag telling it not to), leading to incompatibilities and user confusion.

My guess is that the same kind of thing will happen with RDFa processors. It could very well be the case that an author will:

  • check their RDFa in an RDFa validator that constructs a static HTML5 DOM, revealing one set of triples
  • be confused when they then use a Javascript RDFa library within their page and get a slightly different set of triples because of some Javascript embedded in the page that changes its structure
  • be further confused when a search engine that uses a tidy-and-interpret-as-XML approach gleans yet another slightly different set of triples and displays it in the search result

So if this approach were chosen, I would expect wording in the specification that required implementations to state the method they used to create the DOM (ie it should be implementation-defined rather than implementation-dependent) and that warned authors of the most likely causes of differences between implementations (such as tree modifications performed by HTML5 processors and Javascript within the page). I’d also like to see tools that take an HTML page and indicate the triples that it generates under different common DOM construction methods, so that authors can see the variation in how their documents might be interpreted.