<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.jenitennison.com/blog" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>linked data</title>
 <link>http://www.jenitennison.com/blog/taxonomy/term/46</link>
 <description>The taxonomy view with a depth of 0.</description>
 <language>en</language>
<item>
 <title>What Do URIs Mean Anyway?</title>
 <link>http://www.jenitennison.com/blog/node/159</link>
 <description>&lt;p&gt;If you&amp;#8217;ve hung around in linked data circles for any amount of time, you&amp;#8217;ll probably have come across the &lt;a href=&quot;http://www.w3.org/wiki/HttpRange14Webography&quot;&gt;httpRange-14 issue&lt;/a&gt;. This was an issue placed before the &lt;a href=&quot;http://www.w3.org/2001/tag/&quot;&gt;W3C TAG&lt;/a&gt; years and years ago which has become a &lt;a href=&quot;http://en.wiktionary.org/wiki/permathread&quot;&gt;permathread&lt;/a&gt; on semantic web and linked data mailing lists. The basic question (or my interpretation of it) is:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Given that URIs can sometimes be used to name things that aren&amp;#8217;t on the web (eg the novel Moby Dick) and sometimes things that are (eg the &lt;a href=&quot;http://en.wikipedia.org/wiki/Moby-Dick&quot;&gt;Wikipedia page about Moby Dick&lt;/a&gt;), how can you tell, for a given URI, how it&amp;#8217;s being used so that you can work out what a statement (say, about its author) means?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;!--break--&gt;

&lt;p&gt;One answer is to use a &lt;a href=&quot;http://www.jenitennison.com/blog/node/154&quot;&gt;hash URI&lt;/a&gt; whenever you want to refer to something that doesn&amp;#8217;t live on the web, with the base URI providing information about that thing. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick&lt;/code&gt; is the URI for the Wikipedia page&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick#thing&lt;/code&gt; is a URI for the novel itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem some people (including me) have with this is that hash URIs are primarily used to indicate portions of a web page, and using them for things that aren&amp;#8217;t page fragments overloads them. It&amp;#8217;s also an inflexible method, because the server isn&amp;#8217;t told what the fragment identifier is, and therefore it can&amp;#8217;t be used as the basis for a redirection, for example.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html&quot;&gt;2005 TAG resolution&lt;/a&gt; for people who wanted to use separate non-hash URIs, such as [&lt;em&gt;warning, made-up URIs&lt;/em&gt;]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick&lt;/code&gt; is the URI for the Wikipedia page&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http://wikipedia.org/thing/Moby-Dick&lt;/code&gt; is the URI for the novel itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;if you get a &lt;code&gt;2XX&lt;/code&gt; response when you request a URI, that URI refers to a document (the document that you get back)&lt;/li&gt;
&lt;li&gt;if you get a &lt;code&gt;303&lt;/code&gt; response when you request a URI, that URI could refer to anything, and the resource you get by following the redirection describes that thing (hence if a URI should refer to something that isn&amp;#8217;t on the web then requests to it should respond with a 303)&lt;/li&gt;
&lt;li&gt;if you get a &lt;code&gt;4XX&lt;/code&gt; response when you request a URI, that URI could represent anything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This leads to the &lt;code&gt;303&lt;/code&gt; pattern described for example within &lt;a href=&quot;http://www.w3.org/TR/cooluris/#r303gendocument&quot;&gt;Cool URIs for the Semantic Web&lt;/a&gt;; in the example here, the response to &lt;code&gt;http://wikipedia.org/thing/Moby-Dick&lt;/code&gt; would be a 303 redirection to &lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Six years later, we have a lot of experience about this technique of distinguishing between things that are or are not on the web, and it has a bunch of practical limitations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it requires access to web server configuration (to add &lt;code&gt;303&lt;/code&gt; redirections) that make life difficult for people without that level of access&lt;/li&gt;
&lt;li&gt;URIs for things that aren&amp;#8217;t on the web always require two round-trips to get hold of information, as the first always responds with a &lt;code&gt;303&lt;/code&gt; redirection, which adds server load and slows things down (this is made worse as &lt;code&gt;303&lt;/code&gt; responses can&amp;#8217;t be cached &amp;#8212; an oversight in the HTTP spec that I gather is fixed in &lt;a href=&quot;http://tools.ietf.org/wg/httpbis/&quot;&gt;HTTPbis&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;using the &lt;code&gt;303&lt;/code&gt; pattern requires a level of knowledge and understanding that is beyond most web developers, particularly if they get no benefit from taking care over their use of URIs (for example, Facebook, schema.org and so on all encourage the use of URIs for non-web things without a word about &lt;code&gt;303&lt;/code&gt; redirections)&lt;/li&gt;
&lt;li&gt;even people who do have this knowledge and understanding sometimes find it hard to work out whether a particular thing that they want to talk about is a thing-on-the-web or not and therefore whether the use of a &lt;code&gt;303&lt;/code&gt; redirection is required&lt;/li&gt;
&lt;li&gt;even people who &lt;em&gt;do&lt;/em&gt; try to take care in their use of URIs easily make mistakes because we interact with URIs by copy-and-pasting them from browser address bars, and the only URIs that appear there are URIs for things on the web&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Basically, while the web architectural principles behind the use of &lt;code&gt;303&lt;/code&gt; redirections are (arguably!) sound, the collective experience of the past six years indicates that many publishers will not use it because they don&amp;#8217;t know to, because they don&amp;#8217;t care to, because they make mistakes or because they simply can&amp;#8217;t while meeting the other practical constraints of their project.&lt;/p&gt;

&lt;p&gt;A number of other approaches have been suggested, before and after the TAG decision, many of which are documented within the draft TAG finding &lt;a href=&quot;http://www.w3.org/2001/tag/awwsw/issue57/latest/&quot;&gt;Providing and discovering definitions of URIs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The first observation that I want to make is that many of the objections to the &lt;code&gt;303&lt;/code&gt; pattern are about the practicalities of publishers using it. Therefore, any suggestions to provide an alternative technique that involves&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;introducing new URI schemes (eg &lt;code&gt;tdb&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;introducing new HTTP methods (eg &lt;code&gt;MGET&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;introducing new HTTP status codes (eg &lt;code&gt;209&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;using particular HTTP headers (eg &lt;code&gt;Link&lt;/code&gt; or &lt;code&gt;Content-Location&lt;/code&gt; or other specialist headers)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;are not going to be widely used for exactly the same reason. I&amp;#8217;m not at all persuaded that it&amp;#8217;s worth spending time developing them.&lt;/p&gt;

&lt;p&gt;My second observation is that there are three questions that are being conflated and we might make more progress if we separated them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must publishers provide separate URIs for things-on-the-web and the non-web-things that they describe?&lt;/li&gt;
&lt;li&gt;How can you tell what a reference to a particular URI within a piece of data (eg an RDF statement) means?&lt;/li&gt;
&lt;li&gt;How can you get from a URI to information about whatever that URI refers to?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Ambiguity in URIs&lt;/h2&gt;

&lt;p&gt;Both the hash URI pattern and the &lt;code&gt;303&lt;/code&gt; pattern make the assumption that you need to have separate URIs for things that are not on the web (eg books) and documents on the web about them (eg pages about books). This is useful because it enables people to make separate statements about the author of a book:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://wikipedia.org/thing/Moby-Dick&amp;gt; 
  dct:creator &amp;lt;http://wikipedia.org/thing/Herman_Melville&amp;gt; ;
  .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;from the authors of the Wikipedia page about that book:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://en.wikipedia.org/wiki/Moby-Dick&amp;gt;
  dct:creator 
    &amp;lt;http://wikipedia.org/user/Aristophanes68&amp;gt; ,
    &amp;lt;http://wikipedia.org/user/SporkBot&amp;gt; ,
    &amp;lt;http://wikipedia.org/user/Curb_Chain&amp;gt; ,
    ...
  .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If we only have the URI &lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick&lt;/code&gt; then we run into difficulties interpreting statements made about that URI, and indeed different people might use the URI in different ways, or make some statements that use the URI to mean the novel and some to mean the Wikipedia page.&lt;/p&gt;

&lt;p&gt;So there are good reasons to have two separate URIs in these cases.&lt;/p&gt;

&lt;p&gt;But the fact is that many publishers currently have a one-URI-fits-all policy. And even if they don&amp;#8217;t, people reusing those URIs will often make mistakes and use the wrong one. It would be nice if we could make the world see that this leads to all sorts of logical problems for the Semantic Web, but I just can&amp;#8217;t see that happening.&lt;/p&gt;

&lt;p&gt;This situation reminds me of one of the central innovations that the web had over previous hypertext systems. There is a &lt;a href=&quot;http://www.w3.org/2006/09dc-aus/swpf#(7&quot;&gt;great slide&lt;/a&gt;) by &lt;a href=&quot;http://en.wikipedia.org/wiki/Dan_Connolly&quot;&gt;Dan Connolly&lt;/a&gt; which roughly looks like:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;table border=&quot;1&quot;&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Web&lt;/th&gt;
      &lt;th&gt;Semantic Web&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Traditional Design&lt;/th&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;hypertext&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;logic/database&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;+&lt;/th&gt;
      &lt;td colspan=&quot;2&quot; style=&quot;text-align: center&quot;&gt;URIs&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;-&lt;/th&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;link integrity&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;?&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;=&lt;/th&gt;
      &lt;td colspan=&quot;2&quot; style=&quot;text-align: center&quot;&gt;viral growth&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/table&gt;
  &lt;p&gt;Are there parts of traditional logic and databases that, if we set them aside, will result in viral growth of the Semantic Web?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;(By the way, in case my replication of this slide is interpreted incorrectly: I&amp;#8217;m certainly not implying that viral growth of the Semantic Web as an end in itself, though I would like to see viral growth in data sharing.)&lt;/p&gt;

&lt;p&gt;Dropping the requirement for link integrity, coping with the fact that sometimes links would break, was what made the web work. It would have been simply impossible to build the web as a decentralised system if there had been a requirement for links to always work.&lt;/p&gt;

&lt;p&gt;Of course that doesn&amp;#8217;t mean that we &lt;em&gt;like it&lt;/em&gt; when links get broken. There&amp;#8217;s oodles of best practice advice out there on making sure that you retain support for old URIs if you change your web space; we have backup systems in place in the form of web archives so we can work out what was once at the end of a particular URI; and the resolvability of links is something a linter will check about your website.&lt;/p&gt;

&lt;p&gt;So it&amp;#8217;s not that when he developed the web TimBL rejected entirely the very concept of link integrity, it&amp;#8217;s that he recognised that we have to work with the imperfection of the real world. Links break. HTTP copes. Browsers cope. People cope.&lt;/p&gt;

&lt;p&gt;The imperfection of the real world as it applies to linked data is that &lt;a href=&quot;http://www.ibiblio.org/hhalpin/homepage/publications/indefenseofambiguity.html&quot;&gt;URIs will be used in ambiguous ways&lt;/a&gt;. We might not like it; we might write best practice documents that encourage people to have separate URIs for web-thing and non-web-thing, develop tools that help people detect when they&amp;#8217;ve used the wrong URI, and so on. But it will still happen, and in my opinion we need to work out how to cope.&lt;/p&gt;

&lt;p&gt;In fact, ambiguity in URIs goes much further than just a confusion between the Wikipedia page about Moby Dick and the novel Moby Dick itself. URIs are names, and names are used by different people to mean different things. The same URI might end up meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the Wikipedia page about Moby Dick&lt;/li&gt;
&lt;li&gt;the novel Moby Dick&lt;/li&gt;
&lt;li&gt;the whale Moby Dick&lt;/li&gt;
&lt;li&gt;the story Moby Dick (originally a novel but later adapted as a film)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;and so on&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if the publisher provides a clear and unambiguous definition about what the URI &lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick&lt;/code&gt; means, other people will use it to mean something different because it&amp;#8217;s close enough for what they want to say.&lt;/p&gt;

&lt;p&gt;So I think the answer to the first question I posed &amp;#8212; &amp;#8220;Must publishers provide separate URIs for things-on-the-web and the non-web-things that they describe?&amp;#8221; &amp;#8212; has to be &amp;#8220;No, though it is good practice to.&amp;#8221; We can fight against ambiguity, but we have to accept that we cannot win.&lt;/p&gt;

&lt;h2&gt;Disambiguating Statements&lt;/h2&gt;

&lt;p&gt;As discussed above, in a perfect world, we would have separate URIs for things-on-the-web and non-web-things and any data that we published about Moby Dick would use the URI for the Wikipedia page to talk about things like the licence for that information, or how the information was created (its provenance), and the URI for the novel to talk about things like the licence for the novel and what characters appeared in it.&lt;/p&gt;

&lt;p&gt;But the world is not perfect, and we are going to end up with situations where the same URI is used to refer to a whole range of different things. How do we cope?&lt;/p&gt;

&lt;p&gt;Well, first let me say that I don&amp;#8217;t see people merging data together willy-nilly and hoping to get something useful out of it. URIs give us connection points and RDF gives us a flexible data model, which means that merging data can be easier than the kinds of custom merging that you have to do with CSV and JSON, but I don&amp;#8217;t think it can ever remove entirely the requirement for curation. We want to ensure that the need for intervention in merging two datasets is kept to a minimum, but we can&amp;#8217;t expect it to be entirely removed.&lt;/p&gt;

&lt;p&gt;So with that in mind, there are at least three techniques that can be used to get useful data out of a world in which the same URI is used to mean different things.&lt;/p&gt;

&lt;h3&gt;One-Step-Removed Properties&lt;/h3&gt;

&lt;p&gt;The first technique is to interpret particular properties as describing a one-or-more-step-removed relationship between a resource and a value. For example, the &lt;code&gt;bib:author&lt;/code&gt; and &lt;code&gt;dct:creator&lt;/code&gt; properties would be defined such that the RDF statements&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://en.wikipedia.org/wiki/Moby-Dick&amp;gt;
  bib:author &amp;lt;http://en.wikipedia.org/wiki/Herman_Melville&amp;gt; ;
  dct:creator &amp;lt;http://en.wikipedia.org/wiki/User:Aristophanes68&amp;gt; ;
  .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;would be interpreted as saying&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The &lt;strong&gt;topic of the page&lt;/strong&gt; &lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick&lt;/code&gt; was authored by the &lt;strong&gt;topic of the page&lt;/strong&gt; &lt;code&gt;http://en.wikipedia.org/wiki/Herman_Melville&lt;/code&gt;. The creator of the page &lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick&lt;/code&gt; is the &lt;strong&gt;topic of the page&lt;/strong&gt; &lt;code&gt;http://en.wikipedia.org/wiki/User:Aristophanes68&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The biggest problem with the global application of this approach is that there are a lot of existing properties defined in vocabularies such as FOAF or Dublin Core that aren&amp;#8217;t defined as one-step-removed properties. One publisher might use &lt;code&gt;dct:creator&lt;/code&gt; to link to &amp;#8220;a page describing the creator of this page&amp;#8221; and another might use it to point directly to a (non-web-thing) URI for the creator of the page. So practically, this approach requires the interpretation of properties to be done on a dataset-by-dataset basis. Which leads onto the next approach.&lt;/p&gt;

&lt;h3&gt;Named Graphs&lt;/h3&gt;

&lt;p&gt;A second technique would be to make the assumption that within a single dataset, a single URI has a single meaning, but that the meaning may differ between datasets. I suspect that this is true even when publishers attempt to take care about which URI they use, because, like names, the meaning of a URI is slightly different depending on its use.&lt;/p&gt;

&lt;p&gt;Re-users of data need to work out whether the way URIs are used in one dataset is close enough to the way they are used in another dataset, to ascertain whether it&amp;#8217;s appropriate to simply merge the datasets or whether something slightly more complicated needs to be done to bring the datasets together.&lt;/p&gt;

&lt;p&gt;The problem with this approach is that it raises the barrier to joining together graphs: you can&amp;#8217;t just bung the data into a triplestore and perform queries on it, you have to work out some kind of mapping between the datasets up front.&lt;/p&gt;

&lt;h3&gt;Duck Typing&lt;/h3&gt;

&lt;p&gt;The final technique that I&amp;#8217;ll talk about here is to say that different applications need to access different properties, and can ignore any properties that don&amp;#8217;t fit with how they want to use the data. It is relatively rarely useful to have generic RDF viewers; people (generally) build applications to answer questions and perform tasks, not to just browse around data.&lt;/p&gt;

&lt;p&gt;For example, if a single dataset were to contain:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://en.wikipedia.org/wiki/Moby-Dick&amp;gt;
  a bib:Book ;
  bib:author &amp;lt;http://en.wikipedia.org/wiki/Herman_Melville&amp;gt; ;
  a foaf:Document ;
  dct:creator &amp;lt;http://en.wikipedia.org/wiki/User:Aristophanes68&amp;gt; ;
  .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;then an application that was interested in gathering data about books would only care about the fact that &lt;code&gt;http://en.wikipedia.org/wiki/Moby-Dick&lt;/code&gt; was a book with an author of &lt;code&gt;http://en.wikipedia.org/wiki/Herman_Melville&lt;/code&gt; and wouldn&amp;#8217;t care about the FOAF or Dublin Core classes or properties associated with the URI. An application that was interested in gathering information about the authorship of documents on the web, on the other hand, might look for the &lt;code&gt;foaf:Document&lt;/code&gt; class and Dublin Core properties and ignore everything else.&lt;/p&gt;

&lt;p&gt;To me, this approach seems the most promising way of retaining the core benefits of RDF. It seems more robust in the face of user error than the idea of defining one-step-removed properties, and retains the ease of mashing together data from different sources in a way that you wouldn&amp;#8217;t get if you had to think about the URI usage within each of the datasets that you want to bring together.&lt;/p&gt;

&lt;h2&gt;Locating Data From URIs&lt;/h2&gt;

&lt;p&gt;And so we get to the final question: how should people be able to get from a URI to information about whatever the URI refers to?&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve discussed above how I think distinguishing between things-on-the-web and non-web-things has to be seen as a best practice. I think we should continue to recommend the &lt;code&gt;303&lt;/code&gt; or hash URI methods as the best practice for accessing data from a URI. My reason for this is that introducing yet another method will just makes it harder for publishers to know which method to use when, plus I don&amp;#8217;t want to see people who have adopted these techniques in good faith being told that they were doing the wrong thing all along. What I&amp;#8217;d like to aim to do is to find a way of fitting these methods into a larger approach.&lt;/p&gt;

&lt;p&gt;I also recognise the argument that articulating the relationships between on-the-web and not-on-the-web resources purely through HTTP responses isn&amp;#8217;t ideal. It&amp;#8217;s useful to have explicit links between resources within the data itself. Within the linked data work that I&amp;#8217;ve done for &lt;code&gt;data.gov.uk&lt;/code&gt; I&amp;#8217;ve tried to adopt a pattern of explicitly using &lt;code&gt;foaf:primaryTopic&lt;/code&gt;, &lt;code&gt;foaf:primaryTopicOf&lt;/code&gt; and &lt;code&gt;foaf:page&lt;/code&gt; to link together the different resources. Other people have suggested the &lt;a href=&quot;http://www.w3.org/2007/05/powder-s#describedby&quot;&gt;&lt;code&gt;wdrs:describedby&lt;/code&gt;&lt;/a&gt; property for pointers from a resource to information about that resource; &lt;code&gt;rdfs:isDefinedBy&lt;/code&gt; performs a similar function for classes and properties within RDFS.&lt;/p&gt;

&lt;p&gt;It would be nice to have one defined property or set of properties to describe these relationships, but we have to recognise that not everyone will use them, so the approach we take has to work when these links aren&amp;#8217;t present. The majority of people and sites are going to start off by publishing data about something at a single URI, and simply return data about that thing (a &lt;code&gt;200&lt;/code&gt; response) when the URI is requested. If they then progress to wanting to have separate URIs for that thing and the page about the thing, or indeed to disambiguate the URI that they&amp;#8217;ve used in some other way, we need to make it easy for them to do so.&lt;/p&gt;

&lt;p&gt;I think we need two properties: &lt;code&gt;eg:describedBy&lt;/code&gt; and &lt;code&gt;eg:couldBe&lt;/code&gt;. &lt;code&gt;eg:describedBy&lt;/code&gt; describes the link between a resource (of any type) and a document that describes it; &lt;code&gt;eg:couldBe&lt;/code&gt; is a disambiguation link that points from a URI to other possible, more precise, URIs.&lt;/p&gt;

&lt;p&gt;Then I think we need some rules along the lines of (I don&amp;#8217;t pretend these are entirely worked out):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if you get a &lt;code&gt;303&lt;/code&gt; response redirecting to &lt;code&gt;U&#039;&lt;/code&gt; when you fetch a URI &lt;code&gt;U&lt;/code&gt; then behave as if the response from &lt;code&gt;U&#039;&lt;/code&gt; included the triple &lt;code&gt;U eg:describedBy U&#039;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;if the URI &lt;code&gt;U&lt;/code&gt; is a hash URI whose base URI is &lt;code&gt;U&#039;&lt;/code&gt; then behave as if the response from &lt;code&gt;U&#039;&lt;/code&gt; included the triple &lt;code&gt;U eg:describedBy U&#039;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;if you get a &lt;code&gt;2XX&lt;/code&gt; response in response to a URI &lt;code&gt;U&lt;/code&gt; then:
&lt;ul&gt;&lt;li&gt;if there are multiple triples that match the pattern &lt;code&gt;U eg:describedBy ?page&lt;/code&gt; then assume that the document you have is &lt;code&gt;U&#039;&lt;/code&gt; where &lt;code&gt;U&#039;&lt;/code&gt; &lt;code&gt;eg:couldBe&lt;/code&gt; any of the &lt;code&gt;?page&lt;/code&gt;s&lt;/li&gt;
&lt;li&gt;otherwise, if there is a single triple that matches the pattern &lt;code&gt;U eg:describedBy ?page&lt;/code&gt; then assume that the document that you have is &lt;code&gt;?page&lt;/code&gt; and it is about &lt;code&gt;U&lt;/code&gt; (along with other things, possibly); statements about &lt;code&gt;?page&lt;/code&gt; might include information about the licence or provenance of the returned document&lt;/li&gt;
&lt;li&gt;if there are any triples that match the pattern &lt;code&gt;?thing eg:describedBy U&lt;/code&gt; then assume that the document you have is &lt;code&gt;U&lt;/code&gt; and it is about (possibly multiple) &lt;code&gt;?thing&lt;/code&gt;s&lt;/li&gt;
&lt;li&gt;otherwise, behave as if there is a triple &lt;code&gt;U eg:describedBy U&lt;/code&gt;; in this case, &lt;code&gt;U&lt;/code&gt; is being used in an ambiguous way&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We could go further and say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if there are two triples that match the pattern &lt;code&gt;U eg:couldBe ?page . ?thing eg:describedBy ?page&lt;/code&gt; then assume that the document you have is &lt;code&gt;?page&lt;/code&gt; and it is about &lt;code&gt;?thing&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;if there are two triples that match the pattern &lt;code&gt;U eg:couldBe ?thing . ?thing eg:describedBy ?page&lt;/code&gt; then assume that the document you have is &lt;code&gt;?page&lt;/code&gt; and it is about &lt;code&gt;?thing&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This way, if someone starts off using &lt;code&gt;U&lt;/code&gt; in an ambiguous way, or to mean only the page or only the thing, they can later add &lt;code&gt;eg:describedBy&lt;/code&gt; and &lt;code&gt;eg:couldBe&lt;/code&gt; statements to disambiguate and add information about the page or thing the page describes.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s worth bearing in mind that we shouldn&amp;#8217;t just be concerned about locating information about things that aren&amp;#8217;t on the web, but about things that &lt;em&gt;are&lt;/em&gt; on the web but that cannot have metadata embedded within them. For example, how do we discover the licence associated with a particular image? Although there are methods of embedding metadata within image and other binary formats, such as &lt;a href=&quot;http://en.wikipedia.org/wiki/Extensible_Metadata_Platform&quot;&gt;XMP&lt;/a&gt;, it&amp;#8217;s still useful to be able to locate metadata about images based on their URI.&lt;/p&gt;

&lt;p&gt;With a scheme such as that described above, publishers that used content negotiation to return some data about the image in another format could use &lt;code&gt;eg:describedBy&lt;/code&gt; to indicate that the returned document is about the image (or set of images in different formats).&lt;/p&gt;

&lt;h2&gt;Summary&lt;/h2&gt;

&lt;p&gt;The summary of my thinking is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;we should learn to cope with ambiguity in URIs&lt;/li&gt;
&lt;li&gt;we should not constrain how applications manage that ambiguity, though duck typing seems the most promising approach to me&lt;/li&gt;
&lt;li&gt;we should define some specific properties that can be used to disambiguate URIs, describe their defaults with &lt;code&gt;303&lt;/code&gt;s and hash URIs and provide an easy upgrade path as publishers choose to add more specificity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key will be how we find practical ways to cope with the real, imperfect, fuzzy web of data while providing an evolutionary path to greater clarity and specificity that publishers can take when they see the benefit of doing so.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/159#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/69">uris</category>
 <pubDate>Tue, 05 Jul 2011 22:06:33 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">159 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>TAG F2F, June 2011</title>
 <link>http://www.jenitennison.com/blog/node/158</link>
 <description>&lt;p&gt;As you may know, I accepted an appointment to the &lt;a href=&quot;http://www.w3.org/2001/tag/&quot;&gt;W3C&amp;#8217;s Technical Architecture Group&lt;/a&gt; earlier this year. Last week was the first face-to-face meeting that I attended, hosted in the &lt;a href=&quot;http://en.wikipedia.org/wiki/Ray_and_Maria_Stata_Center&quot;&gt;Stata Center&lt;/a&gt; at MIT. As you can tell from the &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda&quot;&gt;agenda&lt;/a&gt; (which was in fact revised as we went along) it was a packed three days.&lt;/p&gt;

&lt;p&gt;What I intend to do here is to briefly report on the major areas that we discussed and give a tiny bit of my own personal take on them. In no way should any of what I write here be judged as revealing the official opinion of the TAG, it&amp;#8217;s just me saying what I think, and I&amp;#8217;m not going to go into anything in depth because they&amp;#8217;re all incredibly gnarly and contentious topics and I&amp;#8217;d not only be here all year but also end up in a tar pit.&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;Role of the TAG&lt;/h2&gt;

&lt;p&gt;Usefully for me as a newcomer, our first session was about the ongoing role of the TAG. The TAG occupies a unique position within the W3C. According to its &lt;a href=&quot;http://www.w3.org/2004/10/27-tag-charter.html&quot;&gt;charter&lt;/a&gt; it was set up&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;To improve the effectiveness of Working Groups, to reduce misunderstandings and overlapping work, and to improve the consistency of Web technologies developed inside and outside W3C&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The TAG ultimately has three routes to do this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;by providing specific advice on issues that are brought to its attention&lt;/li&gt;
&lt;li&gt;by writing documents on basic web architecture principles that go through community review, particularly through the general review of the W3C standards track and become Recommendations&lt;/li&gt;
&lt;li&gt;by advising the W3C Director (Tim Berners-Lee) about what he should do on the extremely rare occasions when there are issues that he is supposed to adjudicate on&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In none of these cases is there anything that binds the people receiving the advice of the TAG, or reading Findings or Recommendations made by the TAG, to accept them or do anything about them. The power and authority of the TAG depends solely on the quality and utility of its arguments, which is how it should be in my opinion.&lt;/p&gt;

&lt;h2&gt;Client-Side Application State&lt;/h2&gt;

&lt;p&gt;The first technical session was about &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda#clientState&quot;&gt;client-side application state&lt;/a&gt; and was a review of the &lt;a href=&quot;http://www.w3.org/2001/tag/doc/IdentifyingApplicationState-20110515.html&quot;&gt;Identifying Application State draft&lt;/a&gt; that &lt;a href=&quot;http://en.wikipedia.org/wiki/T._V._Raman&quot;&gt;T.V. Raman&lt;/a&gt; began before he left the TAG and that &lt;a href=&quot;http://www.linkedin.com/pub/ashok-malhotra/4/675/6a2&quot;&gt;Ashok Malhotra&lt;/a&gt; has been working on since. This should in the next few months or so be published as a TAG Finding (though it is currently on the Recommendation track).&lt;/p&gt;

&lt;p&gt;This work is essentially about documenting the different ways in which you can identify application state within a URI, why that&amp;#8217;s a useful thing to do, and some of the pitfalls of using &lt;a href=&quot;http://www.jenitennison.com/blog/node/154&quot;&gt;hash URIs&lt;/a&gt; to do so. Most of the discussion was about details to do with wording within the document. One thing I thought particularly interesting was the point that URI-based application state is relevant in all &amp;#8216;active content&amp;#8217;, not just in HTML; for example, scripting in SVG or in PDFs bring the same considerations.&lt;/p&gt;

&lt;h2&gt;Buffer Bloat&lt;/h2&gt;

&lt;p&gt;Over lunch on Monday we listened to and discussed a presentation by &lt;a href=&quot;http://en.wikipedia.org/wiki/Jim_Gettys&quot;&gt;Jim Gettys&lt;/a&gt; on &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda#bufferbloat&quot;&gt;buffer bloat&lt;/a&gt;. Basically (and all the errors here are introduced by me), TCP/IP is designed to route around network blockages, but it can only do so if it detects them quickly. When you have big buffers in place, as in the case of all modern operating systems and hardware, blockages aren&amp;#8217;t detected quickly; they&amp;#8217;re only detected when the buffers fill up. Then buffers empty and the data has to be sent again. The net result is that connections get really slow, not just for upload or download but for both, not just for you but for everyone using the network.&lt;/p&gt;

&lt;p&gt;Jim talked about how this is exacerbated by the large amount of web traffic and the design of HTTP, particularly the lack of use of HTTP pipelining (whereby several HTTP requests and responses are sent over one long-term connection), because it leads to lots of small messages which can&amp;#8217;t be handled effectively. There&amp;#8217;s lots more about this &lt;a href=&quot;http://gettys.wordpress.com/&quot;&gt;on his blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Jim also talked about the failure of certificate authorities and how we should be looking at distributed protocols using digitally signed data, pointing us in particular to &lt;a href=&quot;http://www.ccnx.org/&quot;&gt;CCNx&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Fragment ID Semantics&lt;/h2&gt;

&lt;p&gt;First thing Tuesday was a session that I led on &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda.html#mimefrag&quot;&gt;fragids&lt;/a&gt;, in particular the problems that are arising out of the mime type registration of +xml types (&lt;a href=&quot;http://www.w3.org/2006/02/son-of-3023/draft-murata-kohn-lilley-xml-04.html#frag&quot;&gt;3023bis&lt;/a&gt;) clashing with those that are used for, say, &lt;a href=&quot;http://www.w3.org/TR/2011/WD-media-frags-20110317/&quot;&gt;images&lt;/a&gt;, and what happens when these come together in something like SVG.&lt;/p&gt;

&lt;p&gt;The same issues arise whenever you have documents with types that &amp;#8216;inherit&amp;#8217; fragid semantics from two directions. For example, XHTML documents are XML documents, so constraints on +xml mean you shouldn&amp;#8217;t use interpreted fragids (eg hash-bangs) on them, but they are also &amp;#8216;active content&amp;#8217; which makes interpreted fragids useful. Similarly, in linked data you shouldn&amp;#8217;t really use a hash URI to mean a Person with a primary resource that provides as a response an XML document with embedded RDFa, because according to XML fragid semantics, such a URI should point to an XML element.&lt;/p&gt;

&lt;p&gt;Basically the use of fragids has grown markedly outside their original scope and these situations aren&amp;#8217;t really covered in the specs. I am now tasked to create a document that describes the issues and suggests ways forward. So that will be fun.&lt;/p&gt;

&lt;h2&gt;Telcon with IAB&lt;/h2&gt;

&lt;p&gt;The second session on Tuesday was a telcon with the &lt;a href=&quot;http://www.iab.org/&quot;&gt;IAB&lt;/a&gt; which has a similar role within the &lt;a href=&quot;http://www.ietf.org/&quot;&gt;IETF&lt;/a&gt; as the TAG does within the W3C. This was a bit of a &amp;#8216;getting to know you&amp;#8217; session, covering the work of the two groups on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;versioning and extensibility&lt;/li&gt;
&lt;li&gt;security&lt;/li&gt;
&lt;li&gt;privacy, including Do Not Track&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and talking about opportunities to meet and work together on various topics like these.&lt;/p&gt;

&lt;h2&gt;URI Definition Discovery and Metadata Architecture&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda#metadata&quot;&gt;afternoon session on Tuesday&lt;/a&gt; was spent on &lt;a href=&quot;http://mumble.net/~jar/&quot;&gt;Jonathan Rees&amp;#8217;s&lt;/a&gt; work on the &lt;a href=&quot;http://www.w3.org/wiki/AwwswHome&quot;&gt;Architecture of the World Wide Semantic Web&lt;/a&gt;, which covers, amongst other things, what people in semantic web circles call &lt;a href=&quot;http://www.w3.org/wiki/HttpRange14Webography&quot;&gt;httpRange-14&lt;/a&gt;. At core, this is about the kinds of URIs we can use to refer to real-world things, what the response to HTTP requests on those URIs should be, and how we find out information about these resources.&lt;/p&gt;

&lt;p&gt;Jonathan has put together a document called &lt;a href=&quot;http://www.w3.org/2001/tag/awwsw/issue57/20110531/&quot;&gt;Providing and discovering definitions of URIs&lt;/a&gt; which covers the various ways that have been suggested over time, including the 303 method that was &lt;a href=&quot;http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039&quot;&gt;recommended by the TAG in 2005&lt;/a&gt; and methods that have been suggested by various people since that time.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s clear that the 303 method has lots of practical shortcomings for people deploying linked data, and isn&amp;#8217;t the way in which URIs are commonly used by Facebook and schema.org, who don&amp;#8217;t currently care about using separate URIs for documents and the things those documents are about. We discussed these alongside concerns that we continue to support people who want to do things like describe the license or provenance of a document (as well as the facts that it contains) and don&amp;#8217;t introduce anything that is incompatible with the ways in which people who have been following recommended practice are publishing their linked data. The general mood was that we need to support some kind of &amp;#8216;punning&amp;#8217;, whereby a single URI could be used to refer to both a document and a real-world thing, with different properties being assigned to different &amp;#8216;views&amp;#8217; of that resource.&lt;/p&gt;

&lt;p&gt;Jonathan is going to continue to work on the draft, incorporating some other possible approaches. It&amp;#8217;s a &lt;a href=&quot;http://lists.w3.org/Archives/Public/public-lod/2011Jun/0186.html&quot;&gt;very contentious topic within the linked data community&lt;/a&gt;. My opinion is while we need to provide some &amp;#8216;good practice&amp;#8217; guides for linked data publishers, we can&amp;#8217;t just stick to a theoretical ideal that experience has shown not to be practical. What I&amp;#8217;d hope is that the TAG can help to pull together the various arguments for and against different options, and document whatever approach the wider community supports.&lt;/p&gt;

&lt;h2&gt;Can publication of hyperlinks cause copyright infringment?&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda.html#linkcopyright&quot;&gt;first session on Wednesday&lt;/a&gt; was another session that I led, discussing the &lt;a href=&quot;http://www.w3.org/2001/tag/doc/publishingAndLinkingOnTheWeb-2011-05-28&quot;&gt;Publishing and Linking on the Web draft&lt;/a&gt; that &lt;a href=&quot;http://torgo.com/blog/&quot;&gt;Dan Appelquist&lt;/a&gt; and I have been working on.&lt;/p&gt;

&lt;p&gt;The aim of this document is to explain the tensions between terms that are commonly used in legal documents such as &amp;#8220;possession&amp;#8221;, &amp;#8220;adaptation&amp;#8221; and &amp;#8220;distribution&amp;#8221; and the way that publication works on the web, in which multiple servers may have copies of the same document (because they cache copies to make the &amp;#8216;net go faster), automated agents may make changes to those documents (such as compressing or resizing documents, or merging Javascript) and people may refer others to those documents through linking.&lt;/p&gt;

&lt;p&gt;We&amp;#8217;re particularly keen to argue that linking to something is not the same thing as distributing it. The web&amp;#8217;s power arises through its links, so it&amp;#8217;s important that people are able to link to something without being worried about what happens when/if the domain they link to is taken over by something illegal.&lt;/p&gt;

&lt;p&gt;Dan and I are going to continue to work on this document in response to various suggestions around organisation and terminology, with a view to getting some &amp;#8216;friendly legal experts&amp;#8217; to look it over and then seeking wider review. The intention is for it to eventually become a Recommendation as this will give greater weight to it as a document for a legal audience.&lt;/p&gt;

&lt;h2&gt;API Minimisation and Client-Side Storage&lt;/h2&gt;

&lt;p&gt;There were then a couple of short sessions.&lt;/p&gt;

&lt;p&gt;Dan talked about &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda.html#apis&quot;&gt;API Minimisation&lt;/a&gt;, which is the design principle that to increase privacy we should design APIs that enable people requesting information to say exactly what information they need, and return only that rather that everything known about a think. Dan&amp;#8217;s put together an &lt;a href=&quot;http://www.w3.org/2001/tag/doc/APIMinimization-20100605.html&quot;&gt;draft&lt;/a&gt; and should be calling for review for it soon.&lt;/p&gt;

&lt;p&gt;Ashok then led discussion on &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda.html#webAppStorage&quot;&gt;client-side storage&lt;/a&gt; and we brainstormed around some of the architectural/design issues about which we might want to write if we were to put together a document. This work is at a very early stage.&lt;/p&gt;

&lt;h2&gt;TAG Priorities&lt;/h2&gt;

&lt;p&gt;After lunch, we had a &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda#priorities&quot;&gt;session on TAG priorities&lt;/a&gt; where we discussed which of the various pieces of work that we&amp;#8217;re doing should receive the most attention and had a quick review of who is doing what within the TAG.&lt;/p&gt;

&lt;p&gt;Our basic problem is that a lot of this stuff feels quite urgent, and we want to be responsive, but with only 5-6 of us &amp;#8220;actively involved&amp;#8221; (which means 1 day/week) in drafting documents, and other TAG duties taking up our time, it feels like we have taken on too much work. Our focus for the next little while is going to be on responding to issues where our lack of response might either hold people up or cause longer term problems (for example the publication of contradictory mime type definitions), which means things like the document on publishing and linking on the web will need to bubble in the background rather than being the focus of activity.&lt;/p&gt;

&lt;h2&gt;HTML5 Last Call&lt;/h2&gt;

&lt;p&gt;Our &lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-agenda.html#htmlreview&quot;&gt;final session&lt;/a&gt;, for which we were joined by &lt;a href=&quot;http://www.w3.org/People/LeHegaret/&quot;&gt;Philippe Le Hégaret&lt;/a&gt;, was on the HTML5 Last Call documents. The TAG has raised various issues over the course of HTML5 development and want to follow up on how those issues have been addressed in the documents. Our role means that we&amp;#8217;re responsible for making sure there&amp;#8217;s consistency with other specifications, and that there isn&amp;#8217;t anything that seems like it&amp;#8217;s going to cause problems in the long term.&lt;/p&gt;

&lt;p&gt;The part that we spent most discussion time on was the relationship between &lt;a href=&quot;http://www.w3.org/TR/2011/WD-microdata-20110525/&quot;&gt;Microdata&lt;/a&gt; and &lt;a href=&quot;http://www.w3.org/TR/2011/WD-rdfa-in-html-20110525/&quot;&gt;RDFa&lt;/a&gt;. We talked about the precedents for having two specifications that do very similar things but with different approaches, such as CSS and XSL, and how this isn&amp;#8217;t necessarily a bad thing so long as they don&amp;#8217;t contradict each other and people can move between them easily (because they have the same conceptual foundations).&lt;/p&gt;

&lt;p&gt;I&amp;#8217;m going to save my opinion on this topic for another post. Suffice it to say that microdata and RDFa as currently specified don&amp;#8217;t work well with each other but it&amp;#8217;s not at all clear what the best path forward is. The TAG decided to recommend that the W3C set up a Task Force to look at what the best way forward might be.&lt;/p&gt;

&lt;h2&gt;Final Words&lt;/h2&gt;

&lt;p&gt;If you want links to the minutes of the TAG F2F, they&amp;#8217;re available within the agenda or on separate pages for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/06-minutes&quot;&gt;Monday 6th June&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/07-minutes&quot;&gt;Tuesday 7th June&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.w3.org/2001/tag/2011/06/08-minutes&quot;&gt;Wednesday 8th June&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have anything to say on any of these topics, please send email to the &lt;a href=&quot;mailto:www-tag@w3.org&quot;&gt;TAG mailing list&lt;/a&gt;. Or you could comment here or &lt;a href=&quot;mailto:jeni@jenitennison.com&quot;&gt;email me directly&lt;/a&gt; if you like. Which leads me on to talking about what I&amp;#8217;d like to do in the TAG.&lt;/p&gt;

&lt;p&gt;One of the guidance notes for new members to the TAG says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;TAG members are elected or appointed not to represent their individual member organizations, but the Web community as a whole. We try to take that responsibility very seriously.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I do take that responsibility seriously. Web architecture has to be a combination of practice and theory, balancing approaches that work right now with a desire to not break anything long term. I do practical work developing web applications with HTML, CSS, Javascript, XML, RDF, XSLT, XQuery and so on and so on every day, but I know I don&amp;#8217;t see all the difficult corners of the open web standard space: no one person can.&lt;/p&gt;

&lt;p&gt;I can listen though, so that&amp;#8217;s what I will try to do: listen, digest, reflect and act.&lt;/p&gt;

&lt;p&gt;But I have limited resources. Unlike most of the members of the TAG, I am not employed by a large organisation that pays me for time I take on the work that I do for the TAG. The W3C kindly paid for my flights to and from F2Fs, but not hotels or expenses. I wouldn&amp;#8217;t have taken this on if I wasn&amp;#8217;t prepared to shoulder the financial burden, but if there is anyone out there who might sponsor my participation, I&amp;#8217;d love to hear from you.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/158#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/44">html5</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/71">microdata</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/42">rdfa</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/73">tag</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/69">uris</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/12">web</category>
 <pubDate>Fri, 17 Jun 2011 10:44:12 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">158 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Getting Started with RDF and SPARQL Using Sesame and Python</title>
 <link>http://www.jenitennison.com/blog/node/153</link>
 <description>&lt;p&gt;My &lt;a href=&quot;http://www.jenitennison.com/blog/node/152&quot;&gt;previous post&lt;/a&gt; talked about how to install &lt;a href=&quot;http://4store.org/&quot;&gt;4store&lt;/a&gt; as a triplestore, and use the Ruby library &lt;a href=&quot;http://rdf.rubyforge.org/&quot;&gt;RDF.rb&lt;/a&gt; in order to process RDF extracted from that store. This was a response to Richard Pope&amp;#8217;s &lt;a href=&quot;http://memespring.co.uk/2011/01/linked-data-rdfsparql-documentation-challenge/&quot;&gt;Linked Data/RDF/SPARQL Documentation Challenge&lt;/a&gt; which asks for documentation of how to install a triplestore, load data into it, retrieve it using SPARQL and access the results as native structures using Ruby, Python or PHP.&lt;/p&gt;

&lt;p&gt;I quite enjoyed writing the last one, so I thought I&amp;#8217;d try again. As before, I am on Mac OS X, but this time I&amp;#8217;m going to use Python, which I have not programmed in before. I like a challenge. You might not like the results!&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;Sesame&lt;/h2&gt;

&lt;p&gt;This time, I&amp;#8217;m going to use &lt;a href=&quot;http://www.openrdf.org/&quot;&gt;Sesame&lt;/a&gt;, as I was told by &lt;a href=&quot;http://twitter.com/johnlsheridan&quot;&gt;John Sheridan&lt;/a&gt; that it was so easy to install that even he, a civil servant, could do it!&lt;/p&gt;

&lt;p&gt;Sesame needs a Java servlet container. I&amp;#8217;m using &lt;a href=&quot;http://tomcat.apache.org/&quot;&gt;Tomcat&lt;/a&gt; because I have some experience with it, but you could use something like &lt;a href=&quot;http://jetty.codehaus.org/jetty/&quot;&gt;Jetty&lt;/a&gt; if you prefer. I had a bit of trouble getting Tomcat 6 to install, but that might just have been because it has a lot of dependencies and I wasn&amp;#8217;t patient enough. It might be worth upgrading your existing ports and getting verbose output so you know there&amp;#8217;s activity as you install Tomcat:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port upgrade outdated
$ sudo port -v install tomcat6
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This installs Tomcat 6 in &lt;code&gt;/opt/local/share/java/tomcat6&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;While that&amp;#8217;s happening, get Sesame from its &lt;a href=&quot;http://sourceforge.net/projects/sesame/files/Sesame%202/&quot;&gt;download page&lt;/a&gt;. I got hold of &lt;code&gt;openrdf-sesame-2.3.2-sdk.tar.gz&lt;/code&gt;. The files we actually need are the &lt;code&gt;.war&lt;/code&gt;s so we can just extract them and put them in the &lt;code&gt;webapps&lt;/code&gt; directory within Tomcat:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ tar -zxvf openrdf-sesame-2.3.2-sdk.tar.gz openrdf-sesame-2.3.2/war/*.war
$ sudo cp openrdf-sesame-2.3.2/war/*.war /opt/local/share/java/tomcat6/webapps/
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then startup Tomcat:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo /opt/local/share/java/tomcat6/bin/tomcatctl start
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;All being well, you should see Tomcat doing some initial setup:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;conf_setup.sh: file conf/catalina.policy is missing; copying conf/catalina.policy.sample to its place.
conf_setup.sh: file conf/catalina.properties is missing; copying conf/catalina.properties.sample to its place.
conf_setup.sh: file conf/server.xml is missing; copying conf/server.xml.sample to its place.
conf_setup.sh: file conf/tomcat-users.xml is missing; copying conf/tomcat-users.xml.sample to its place.
conf_setup.sh: file conf/web.xml is missing; copying conf/web.xml.sample to its place.
conf_setup.sh: file conf/setenv.local is missing; copying conf/setenv.local.sample to its place.
Starting Tomcat.... started. (pid 20064)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now have a look at &lt;code&gt;http://localhost:8080/openrdf-sesame&lt;/code&gt;. If you&amp;#8217;re like me, you&amp;#8217;ll get some error messages because the user that Tomcat is running under (&lt;code&gt;www&lt;/code&gt;) isn&amp;#8217;t able to create or write to a logging directory that it wants to create, in my case &lt;code&gt;/Users/Jeni/Library/Application Support/Aduna/OpenRDF Sesame/logs&lt;/code&gt;. This turns out to be partly caused by permissions issues and partly caused by the spaces in the filename. To get around it, create a data directory for Aduna that doesn&amp;#8217;t have spaces in the filename, and change its ownership to &lt;code&gt;www&lt;/code&gt;. In my case, I chose &lt;code&gt;/opt/local/var/aduna&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo mkdir -p /opt/local/var/aduna
$ sudo chown www:www /opt/local/var/aduna
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then edit Tomcat&amp;#8217;s &lt;code&gt;setenv.local&lt;/code&gt; file which in my environment is at &lt;code&gt;/opt/local/share/java/tomcat6/conf&lt;/code&gt; and add a line that sets the &lt;code&gt;info.aduna.platform.appdata.basedir&lt;/code&gt; to that directory, like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;export JAVA_OPTS=&#039;-Dinfo.aduna.platform.appdata.basedir=/opt/local/var/aduna&#039;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Restart Tomcat:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo /opt/local/share/java/tomcat6/bin/tomcatctl restart
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then navigate again to &lt;a href=&quot;http://localhost:8080/openrdf-sesame&quot;&gt;http://localhost:8080/openrdf-sesame&lt;/a&gt; and you should see the Welcome page:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-welcome.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;As you can see, this recommends using the Workbench for managing the repositories. If you open that, at &lt;a href=&quot;http://localhost:8080/openrdf-workbench&quot;&gt;http://localhost:8080/openrdf-workbench&lt;/a&gt;.&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-home.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;We&amp;#8217;ll use this Workbench to create a new repository for our data, which I&amp;#8217;ll call &lt;code&gt;reference&lt;/code&gt;. Click on &lt;code&gt;New Repository&lt;/code&gt; from the left hand navigation and fill in the form. I&amp;#8217;m just going to use the default in-memory RDF store because I&amp;#8217;m only using a little data; the other options (using MySQL or PostgreSQL stores) would be useful if I were creating something more permanent. See &lt;a href=&quot;http://www.openrdf.org/doc/sesame2/users/ch07.html#section-rdbms-store-config&quot;&gt;the Sesame User Guide&lt;/a&gt; for information about those.&lt;/p&gt;

&lt;p&gt;So fill in the form to create a new repository with the id &lt;code&gt;reference&lt;/code&gt; and whatever title you like:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-new-repository.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Click &lt;code&gt;Next&lt;/code&gt; and there will be a couple more options to select; I just used the default for these:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-new-repository-2.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Click &lt;code&gt;Create&lt;/code&gt; and you will see a summary of the new repository that you&amp;#8217;ve created:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-new-repository-3.jpg&quot; /&gt;
&lt;/p&gt;

&lt;h2&gt;Loading Data&lt;/h2&gt;

&lt;p&gt;I&amp;#8217;m going to use the same data as I did before:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;a href=&quot;http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf&quot;&gt;http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can add data to a Sesame repository in a browser through the Workbench by uploading a file, pointing Sesame at a URL or pasting in some RDF that you want to load. There are also Java bindings for adding data to Sesame. But neither of those are any good to us as we need programmatic access.&lt;/p&gt;

&lt;p&gt;So we will use the &lt;a href=&quot;http://www.openrdf.org/doc/sesame2/system/ch08.html#d0e304&quot;&gt;HTTP method&lt;/a&gt;. I want to add some statements to the &lt;code&gt;reference&lt;/code&gt; repository in the graph (what Sesame calls &amp;#8220;context&amp;#8221;) &lt;code&gt;http://source.data.gov.uk/data/reference/organogram-co/2010-10-30&lt;/code&gt;, which amounts to an HTTP PUT on the repository&amp;#8217;s statements with that context. &lt;/p&gt;

&lt;p&gt;Now I don&amp;#8217;t know much at all about Python, but it looks as though the built-in library &lt;code&gt;urllib2&lt;/code&gt; doesn&amp;#8217;t support &lt;code&gt;PUT&lt;/code&gt; and there&amp;#8217;s a better HTTP library available in &lt;a href=&quot;http://code.google.com/p/httplib2/&quot;&gt;&lt;code&gt;httplib2&lt;/code&gt;&lt;/a&gt;. MacPorts supports various different packages for &lt;code&gt;httplib2&lt;/code&gt; with different versions of Python. Now there only seems to be a package for rdflib, which we&amp;#8217;ll use later, for Python 2.6, so we&amp;#8217;ll go for &lt;code&gt;py26-httplib2&lt;/code&gt;, which will bring in Python 2.6 with it just in case.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port install py26-httplib2
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After running this, if you want to actually use it you will need to install the &lt;code&gt;python_select&lt;/code&gt; port and choose Python 2.6:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port install python_select
$ sudo python_select python26
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then open up another Terminal window or tab (because the change won&amp;#8217;t have affected your old one) and check what version of Python you&amp;#8217;re running:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python --version
Python 2.6.6
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;With the &lt;code&gt;httplib2&lt;/code&gt; library in place, it&amp;#8217;s time for a Python script (&lt;code&gt;load-rdf-into-sesame.py&lt;/code&gt;) to do the PUTting:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import urllib
import httplib2

repository = &#039;reference&#039;
graph      = &#039;http://source.data.gov.uk/data/reference/organogram-co/2010-06-30&#039;
filename   = &#039;/Users/Jeni/Downloads/index.rdf&#039;

print &quot;Loading %s into %s in Sesame&quot; % (filename, graph)
params = { &#039;context&#039;: &#039;&amp;lt;&#039; + graph + &#039;&amp;gt;&#039; }
endpoint = &quot;http://localhost:8080/openrdf-sesame/repositories/%s/statements?%s&quot; % (repository, urllib.urlencode(params))
data = open(filename, &#039;r&#039;).read()
(response, content) = httplib2.Http().request(endpoint, &#039;PUT&#039;, body=data, headers={ &#039;content-type&#039;: &#039;application/rdf+xml&#039; })
print &quot;Response %s&quot; % response.status
print content
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run the script from the command line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python load-rdf-into-sesame.py
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you should get just get:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loading /Users/Jeni/Downloads/index.rdf into http://source.data.gov.uk/data/reference/organogram-co/2010-06-30 in Sesame
Response 204
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;which isn&amp;#8217;t particularly helpful (well, the &lt;code&gt;204&lt;/code&gt; response tells us it worked), but you can then check &lt;a href=&quot;http://localhost:8080/openrdf-workbench/repositories/reference/contexts&quot;&gt;http://localhost:8080/openrdf-workbench/repositories/reference/contexts&lt;/a&gt; and you should see that there is a new context of &lt;code&gt;http://source.data.gov.uk/data/reference/organogram-co/2010-06-30&lt;/code&gt;:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-contexts.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Click on the context and it will take you to a list of (some of) the triples in that graph:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-explore-context.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;One of the nice things about Sesame is that the Workbench provides you with ways of exploring the data that you have loaded. On the left navigation bar there are ways of listing the types of the entities described in the data:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-explore-types.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;from which you can find instances of that type, for example of &lt;code&gt;org:Organization&lt;/code&gt;:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-explore-organization.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;and then the statements about a particular instance, for example DirectGov:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-explore-directgov.jpg&quot; /&gt;
&lt;/p&gt;

&lt;h2&gt;Running a Query&lt;/h2&gt;

&lt;p&gt;Onto running a query directly. This is done on Sesame in exactly the same way as it was done on 4store in my last walkthrough: by HTTP POSTing a query to the SPARQL endpoint. Sesame&amp;#8217;s page for testing queries on the &lt;code&gt;reference&lt;/code&gt; repository is at &lt;a href=&quot;http://localhost:8080/openrdf-workbench/repositories/reference/query&quot;&gt;http://localhost:8080/openrdf-workbench/repositories/reference/query&lt;/a&gt; and we&amp;#8217;ll use the basic one that lists types of things that are described within the data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT DISTINCT ?type 
WHERE { 
  ?thing a ?type .
} 
ORDER BY ?type
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Paste that into the textarea that&amp;#8217;s provided on &lt;a href=&quot;http://localhost:8080/openrdf-workbench/repositories/reference/query&quot;&gt;http://localhost:8080/openrdf-workbench/repositories/reference/query&lt;/a&gt; so it looks like:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-query.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;and you get an HTML page:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-query-result.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;That&amp;#8217;s nice for humans, but not so good for computers. When we request the results of this query programmatically, we&amp;#8217;ll want to make sure that we specifically ask for the query results in either &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-XMLres/&quot;&gt;XML&lt;/a&gt; or &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-json-res/&quot;&gt;JSON&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I went the XML route last time, so let&amp;#8217;s mix it up a bit and try processing the JSON results of a SPARQL query this time, as it&amp;#8217;s really easy to access using the &lt;code&gt;json&lt;/code&gt; module in Python. So, we need to &lt;code&gt;POST&lt;/code&gt; the query, ensuring that we set the &lt;code&gt;Accept&lt;/code&gt; header to &lt;code&gt;application/sparql-results+json&lt;/code&gt;, and then process the results as JSON. Here is &lt;a href=&quot;/blog/files/find-rdf-types.py&quot;&gt;&lt;code&gt;find-rdf-types.py&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import urllib
import httplib2
import json

query = &#039;SELECT DISTINCT ?type WHERE { ?thing a ?type . } ORDER BY ?type&#039;
repository = &#039;reference&#039;
endpoint = &quot;http://localhost:8080/openrdf-sesame/repositories/%s&quot; % (repository)

print &quot;POSTing SPARQL query to %s&quot; % (endpoint)
params = { &#039;query&#039;: query }
headers = { 
  &#039;content-type&#039;: &#039;application/x-www-form-urlencoded&#039;, 
  &#039;accept&#039;: &#039;application/sparql-results+json&#039; 
}
(response, content) = httplib2.Http().request(endpoint, &#039;POST&#039;, urllib.urlencode(params), headers=headers)

print &quot;Response %s&quot; % response.status
results = json.loads(content)
print &quot;\n&quot;.join([result[&#039;type&#039;][&#039;value&#039;] for result in results[&#039;results&#039;][&#039;bindings&#039;]])
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python find-rdf-types.py
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you get:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;POSTing SPARQL query to http://localhost:8080/openrdf-sesame/repositories/reference
Response 200
http://purl.org/linked-data/cube#DataSet
http://purl.org/linked-data/cube#DataStructureDefinition
http://purl.org/linked-data/cube#Observation
http://purl.org/net/opmv/ns#Artifact
http://purl.org/net/opmv/ns#Process
http://purl.org/net/opmv/types/google-refine#OperationDescription
http://purl.org/net/opmv/types/google-refine#Process
http://purl.org/net/opmv/types/google-refine#Project
http://rdfs.org/ns/void#Dataset
http://reference.data.gov.uk/def/central-government/AssistantParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/CivilServicePost
http://reference.data.gov.uk/def/central-government/Department
http://reference.data.gov.uk/def/central-government/DeputyDirector
http://reference.data.gov.uk/def/central-government/DeputyParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/Director
http://reference.data.gov.uk/def/central-government/DirectorGeneral
http://reference.data.gov.uk/def/central-government/ParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/PermanentSecretary
http://reference.data.gov.uk/def/central-government/PublicBody
http://reference.data.gov.uk/def/central-government/SeniorAssistantParliamentaryCounsel
http://reference.data.gov.uk/def/intervals/CalendarDay
http://www.w3.org/2000/01/rdf-schema#Class
http://www.w3.org/ns/org#Organization
http://www.w3.org/ns/org#OrganizationalUnit
http://xmlns.com/foaf/0.1/Person
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is the same set of types as that given through the HTML browse interface. Note that the JSON results themselves look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{
  &quot;head&quot;: {
    &quot;vars&quot;: [ &quot;type&quot; ]
  }, 
  &quot;results&quot;: {
    &quot;bindings&quot;: [
      {
        &quot;type&quot;: { &quot;type&quot;: &quot;uri&quot;, &quot;value&quot;: &quot;http:\/\/purl.org\/linked-data\/cube#DataSet&quot; }
      }, 
      {
        &quot;type&quot;: { &quot;type&quot;: &quot;uri&quot;, &quot;value&quot;: &quot;http:\/\/purl.org\/linked-data\/cube#DataStructureDefinition&quot; }
      }, 
      {
        &quot;type&quot;: { &quot;type&quot;: &quot;uri&quot;, &quot;value&quot;: &quot;http:\/\/purl.org\/linked-data\/cube#Observation&quot; }
      }, 
      ...
    ]
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Each of the items within the &lt;code&gt;bindings&lt;/code&gt; array contains a set of bindings for the variables in the SPARQL query. This closely matches the XML format.&lt;/p&gt;

&lt;h2&gt;Processing RDF&lt;/h2&gt;

&lt;p&gt;Now we get onto the part of this where we look at specific libraries for RDF support in Python. The most popular library is &lt;a href=&quot;http://www.rdflib.net/&quot;&gt;rdflib&lt;/a&gt;, which you can install using MacPorts:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port install py26-rdflib
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The SPARQL query we&amp;#8217;ll try this time uses a CONSTRUCT query, which creates RDF, rather than a SELECT query, which as we&amp;#8217;ve seen can create either XML or JSON. For example, try the query:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;

CONSTRUCT {
  ?person 
    a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
} WHERE { 
  ?person a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This gets all the information in the data about the individuals for whom names have been supplied in the data, as RDF. Again, Sesame will display this as HTML when you try doing it, but you can choose a different format from the drop-down menu at the top of the Query Result display:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-query-result-rdf.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;When you&amp;#8217;re not accessing using a browser, by default Sesame serves up its results in &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/TriG/Spec/&quot;&gt;TriG format&lt;/a&gt;, which isn&amp;#8217;t particularly appropriate for the results of CONSTRUCT queries as we don&amp;#8217;t need multiple graphs. We&amp;#8217;ll request &lt;a href=&quot;http://www.w3.org/TR/rdf-testcases/#ntriples&quot;&gt;N-Triples&lt;/a&gt; as that&amp;#8217;s something rdflib can understand. Sesame 2 uses the content type &lt;code&gt;text/plain&lt;/code&gt; for N-Triples, so we can request this type by setting the &lt;code&gt;Accept&lt;/code&gt; header:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;params = { &#039;query&#039;: query }
headers = { 
  &#039;content-type&#039;: &#039;application/x-www-form-urlencoded&#039;, 
  &#039;accept&#039;: &#039;text/plain&#039; 
}
(response, content) = httplib2.Http().request(endpoint, &#039;POST&#039;, urllib.urlencode(params), headers=headers)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We then need to parse this Turtle response into a &lt;a href=&quot;http://www.rdflib.net/rdflib-2.4.0/html/public/rdflib.Graph.Graph-class.html&quot;&gt;&lt;code&gt;rdflib.Graph&lt;/code&gt;&lt;/a&gt; object:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;graph = rdflib.ConjunctiveGraph()
graph.parse(rdflib.StringInputSource(content), format=&quot;nt&quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We then need to get information out of that graph, which rdflib isn&amp;#8217;t particularly good at. So let&amp;#8217;s use &lt;a href=&quot;http://www.openvest.com/trac/wiki/RDFAlchemy&quot;&gt;RDFAlchemy&lt;/a&gt; instead. That can be installed using &lt;a href=&quot;http://packages.python.org/distribute/easy_install.html&quot;&gt;easy_install&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo easy_install-2.6 rdfalchemy
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;RDFAlchemy can be used to map RDF graphs onto Python object structures in a fairly straight-forward manner. Basically, you define the namespaces of the vocabularies that you want to use, then some classes for the kinds of things that you have in the data, with properties that map onto properties in the RDF. Then you set the &lt;code&gt;rdfSubject.db&lt;/code&gt; to the source of the data (which can be an rdflib Graph as above) and can query on it. Here&amp;#8217;s an example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;FOAF = rdflib.Namespace(&#039;http://xmlns.com/foaf/0.1/&#039;)
RDF = rdflib.Namespace(&#039;http://www.w3.org/1999/02/22-rdf-syntax-ns#&#039;)

class Person(rdfalchemy.rdfSubject):
  rdf_type = FOAF.Person
  name = rdfalchemy.rdfSingle(FOAF.name)
  mbox = rdfalchemy.rdfSingle(FOAF.mbox)

rdfalchemy.rdfSubject.db = graph
stott = Person.get_by(name=&#039;Andrew Stott&#039;)
print &quot;Andrew Stott&#039;s email address: %s&quot; % stott.mbox.n3()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;RDFAlchemy adds both &lt;code&gt;get_by()&lt;/code&gt; and &lt;code&gt;filter_by()&lt;/code&gt; methods on the descriptor classes that you define, to get a single item that matches a query or a list of items, respectively.&lt;/p&gt;

&lt;p&gt;The full script for &lt;a href=&quot;/blog/files/get-names-and-emails.py&quot;&gt;&amp;#8216;get-names-and-emails.py&amp;#8217;&lt;/a&gt; is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import urllib
import httplib2
import rdflib
import rdfalchemy

query = &quot;&quot;&quot;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;

CONSTRUCT {
  ?person
    a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
} WHERE {
  ?person a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
}&quot;&quot;&quot;
repository = &#039;reference&#039;
endpoint = &quot;http://localhost:8080/openrdf-sesame/repositories/%s&quot; % repository

print &quot;POSTing SPARQL query to %s&quot; % endpoint
params = { &#039;query&#039;: query }
headers = { 
  &#039;content-type&#039;: &#039;application/x-www-form-urlencoded&#039;, 
  &#039;accept&#039;: &#039;text/plain&#039; 
}
(response, content) = httplib2.Http().request(endpoint, &#039;POST&#039;, urllib.urlencode(params), headers=headers)
print &quot;Response %s&quot; % response.status

graph = rdflib.ConjunctiveGraph()
graph.parse(rdflib.StringInputSource(content), format=&quot;nt&quot;)

print &quot;Loaded %d triples&quot; % len(graph)

FOAF = rdflib.Namespace(&#039;http://xmlns.com/foaf/0.1/&#039;)
RDF = rdflib.Namespace(&#039;http://www.w3.org/1999/02/22-rdf-syntax-ns#&#039;)

class Person(rdfalchemy.rdfSubject):
  rdf_type = FOAF.Person
  name = rdfalchemy.rdfSingle(FOAF.name)
  mbox = rdfalchemy.rdfSingle(FOAF.mbox)

rdfalchemy.rdfSubject.db = graph
stott = Person.get_by(name=&#039;Andrew Stott&#039;)
print &quot;Andrew Stott&#039;s email address: %s&quot; % stott.mbox.n3()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run this script with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python get-names-and-emails.py
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you get the result:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;No handlers could be found for logger &quot;rdflib.Literal&quot;
POSTing SPARQL query to http://localhost:8080/openrdf-sesame/repositories/reference
Response 200
Loaded 459 triples
Andrew Stott&#039;s email address: &amp;lt;mailto:andrew.stott@cabinet-office.gsi.gov.uk&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first line is apparently a &lt;a href=&quot;http://groups.google.com/group/rdfalchemy-dev/browse_thread/thread/44a94ec27c4c0b85&quot;&gt;side-effect of rdflib/RDFAlchemy weirdness&lt;/a&gt; which you don&amp;#8217;t need to worry about. The rest shows that the search was successful; the call to the &lt;code&gt;.n3()&lt;/code&gt; call on the email address is only necessary because it is a resource rather than a literal, and therefore doesn&amp;#8217;t get converted to a particularly readable string otherwise.&lt;/p&gt;

&lt;h2&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;So there you have it, another walkthrough of setting up a local triplestore, loading in data and accessing that data programmatically using SPARQL queries, this time using Sesame and Python rather than 4store and Ruby.&lt;/p&gt;

&lt;p&gt;This walkthrough took me a fair bit longer to do than the previous one, for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I&amp;#8217;ve done almost no previous programming with Python (as you can probably tell), so every little thing took ages to work out &amp;#8212; you know you&amp;#8217;re in trouble when you&amp;#8217;re Googling for string concatenation code! I&amp;#8217;m very happy to accept corrections and improvements, which I&amp;#8217;ll include in the above.&lt;/li&gt;
&lt;li&gt;I spent a lot of time faffing around with different Python versions, opting for the latest and then finding that the libraries that I wanted to use weren&amp;#8217;t available for that version and so on. I eventually ended up with Python 2.6; the code above may or may not work with any other versions.&lt;/li&gt;
&lt;li&gt;Setting up Sesame 2 was pretty frustrating: first Tomcat wouldn&amp;#8217;t work, then Jetty wouldn&amp;#8217;t work, and finally I did get Tomcat working and then had the issue with the log directory, as I described above. Once I&amp;#8217;d changed the data directory things worked very smoothly.&lt;/li&gt;
&lt;li&gt;I thought rdflib was going to be enough to work with RDF in Python, but really it isn&amp;#8217;t (if you want to get data &lt;em&gt;out&lt;/em&gt; as well as put data &lt;em&gt;in&lt;/em&gt;), so I had to find something else.&lt;/li&gt;
&lt;li&gt;The documentation for rdflib and RDFAlchemy isn&amp;#8217;t as comprehensive as the documentation for RDF.rb, especially if you&amp;#8217;re not familiar with Python, so it took me a bit longer to work out how to do things with those particular libraries.&lt;/li&gt;
&lt;li&gt;I took a lot more screenshots!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again, I haven&amp;#8217;t followed Richard&amp;#8217;s steps to the letter; in particular I haven&amp;#8217;t used a package to get data out of (or into) Sesame: I&amp;#8217;ve just done it through HTTP calls. I did it this way deliberately because I think it&amp;#8217;s a really important feature of triplestores that you can query them through a common interface: SPARQL. It means that you can take the Python code here and use it against 4store or another triplestore with only a change to the value of the endpoint variable, and similarly take the Ruby code from my previous walkthrough and use it against Sesame. Your code is not tied to a particular implementation or API; you &amp;#8220;only&amp;#8221; have to learn SPARQL and you&amp;#8217;re away.&lt;/p&gt;

&lt;p&gt;If you prefer something a little more tightly bound, however, RDFAlchemy does have some targeted &lt;a href=&quot;http://www.openvest.com/trac/wiki/RDFAlchemy#Sesame&quot;&gt;Sesame support&lt;/a&gt;, as does &lt;a href=&quot;http://rdf.rubyforge.org/sesame/&quot;&gt;RDF.rb&lt;/a&gt; for that matter. These can help with the management of the data within the repository as well as querying it.&lt;/p&gt;

&lt;p&gt;Another thing that&amp;#8217;s worth pointing out is that 4store and Sesame have completely different (HTTP-based) interfaces for getting data into stores, and that rdflib/RDFAlchemy and RDF.rb have completely different ways of loading data into in-memory graphs, querying it, and getting information from the results, quite aside from the obvious language-based differences that you&amp;#8217;d expect.&lt;/p&gt;

&lt;p&gt;On the SPARQL side, there are some efforts within the W3C to define a &lt;a href=&quot;http://www.w3.org/TR/sparql11-http-rdf-update/&quot;&gt;uniform HTTP protocol for managing RDF graphs&lt;/a&gt; and of course there&amp;#8217;s &lt;a href=&quot;http://www.w3.org/TR/sparql11-update/&quot;&gt;SPARQL 1.1 Update&lt;/a&gt;. There are glimmers of hope for a &lt;a href=&quot;http://www.w3.org/QA/2010/12/new_rdf_working_group_rdfjson.html&quot;&gt;standard RDF API&lt;/a&gt;, as &lt;a href=&quot;http://www.jenitennison.com/blog/node/150&quot;&gt;I&amp;#8217;ve argued for recently&lt;/a&gt;, but I gather that this effort will be focused on client-side developers, ie that it is really a standard RDF API &lt;em&gt;for Javascript&lt;/em&gt;, which I think is a wasted opportunity: I would have been faster in this task if I&amp;#8217;d been able to use familiar methods, and I wouldn&amp;#8217;t have been so dependent on the documentation provided by the author of a particular library.&lt;/p&gt;

&lt;p&gt;Anyway, hopefully my tramping this path will make it easier for those who follow.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/153#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/65">python</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/66">rdfalchemy</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/64">rdflib</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/67">sesame</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <enclosure url="http://www.jenitennison.com/blog/files/load-rdf-into-sesame.py.txt" length="615" type="text/plain" />
 <pubDate>Tue, 25 Jan 2011 17:27:24 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">153 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Getting Started with RDF and SPARQL Using 4store and RDF.rb</title>
 <link>http://www.jenitennison.com/blog/node/152</link>
 <description>&lt;p&gt;&lt;strong&gt;Updated&lt;/strong&gt; to include some of &lt;a href=&quot;http://www.jenitennison.com/blog/node/152#comment-10579&quot;&gt;Arto Bendicken&amp;#8217;s recommendations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This post is a response to Richard Pope&amp;#8217;s &lt;a href=&quot;http://memespring.co.uk/2011/01/linked-data-rdfsparql-documentation-challenge/&quot;&gt;Linked Data/RDF/SPARQL Documentation Challenge&lt;/a&gt;. In it, he asks for documentation of the following steps:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ul&gt;
  &lt;li&gt;Install an RDF store from a package management system on a computer running either Apple’s OSX or Ubuntu Desktop.&lt;/li&gt;
  &lt;li&gt;Install a code library (again from a package management system) for talking to the RDF store in either PHP, Ruby or Python.&lt;/li&gt;
  &lt;li&gt;Programatically load some real-world data into the RDF datastore using either PHP, Ruby or Python.&lt;/li&gt;
  &lt;li&gt;Programatically retrieve data from the datastore with SPARQL using using either PHP, Ruby or Python.&lt;/li&gt;
  &lt;li&gt;Convert retrieved data into an object or datatype that can be used by the chosen programming language (e.g. a Python dictionary).&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;I&amp;#8217;ve been told so many time how RDF sucks for mainstream developers that it was the main point of my &lt;a href=&quot;http://www.w3.org/2010/11/TPAC/RDF-SW-velocity.pdf&quot;&gt;TPAC talk&lt;/a&gt; late last year. I think that this is a great motivating challenge for improving not only the documentation of how to use RDF stores and libraries but how to improve their generally installability and usability for developers as well.&lt;/p&gt;

&lt;p&gt;Anyway, I thought I&amp;#8217;d try to get as far as I could to see just how bad things really are. I am on Mac OS X, and I&amp;#8217;m going to use Ruby (although I don&amp;#8217;t really know it all that well, so please forgive my mistakes). I&amp;#8217;ll breeze on through as if everything is hunky dory, but there are some caveats at the end.&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;4store&lt;/h2&gt;

&lt;p&gt;I&amp;#8217;m going to use &lt;a href=&quot;http://4store.org&quot;&gt;4store&lt;/a&gt; because it&amp;#8217;s really easy to install on the Mac. If you want to install it on Ubuntu, &lt;a href=&quot;http://blog.dbtune.org/post/2009/08/14/4Store-stuff&quot;&gt;there&amp;#8217;s a package available&lt;/a&gt;. For a Mac, it&amp;#8217;s a matter of going to the &lt;a href=&quot;http://4store.org/download/macosx/&quot;&gt;list of Mac downloads&lt;/a&gt;, downloading the most recent version, opening the &lt;code&gt;.dmg&lt;/code&gt; and installing the 4store application by dragging it into your Applications folder.&lt;/p&gt;

&lt;p&gt;When you run the 4store application you get a command line prompt. To set up and start a triplestore called &amp;#8216;reference&amp;#8217; with a SPARQL endpoint running on port 8000, type the following commands:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ 4s-backend-setup reference
$ 4s-backend reference
$ 4s-httpd -p 8000 reference
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you then navigate to &lt;a href=&quot;http://localhost:8000/&quot;&gt;http://localhost:8000/&lt;/a&gt; you should see the following:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/4store-homepage.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Don&amp;#8217;t let the title &amp;#8216;Not found&amp;#8217; put you off. The fact you get a response means that it&amp;#8217;s working.&lt;/p&gt;

&lt;h2&gt;Loading Data&lt;/h2&gt;

&lt;p&gt;First, find some data to load. A good place for government RDF data is &lt;a href=&quot;http://source.data.gov.uk/data/&quot;&gt;http://source.data.gov.uk/data/&lt;/a&gt; for example. I downloaded&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;a href=&quot;http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf&quot;&gt;http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are several ways of &lt;a href=&quot;http://4store.org/trac/wiki/ImportData&quot;&gt;importing data into 4store using the command line&lt;/a&gt;. Yves Raimond has created a &lt;a href=&quot;https://github.com/moustaki/4store-ruby&quot;&gt;Ruby gem&lt;/a&gt; for doing so programmatically. There&amp;#8217;s also &lt;a href=&quot;https://github.com/fumi/rdf-4store&quot;&gt;rdf-4store&lt;/a&gt; from Fumihiro Kato which ties into the &lt;a href=&quot;http://rdf.rubyforge.org/&quot;&gt;RDF.rb&lt;/a&gt; library which I&amp;#8217;ll use later on.&lt;/p&gt;

&lt;p&gt;However, if you use the &lt;a href=&quot;http://4store.org/trac/wiki/SparqlServer&quot;&gt;SPARQL server&lt;/a&gt; then it&amp;#8217;s just an HTTP PUT call, which of course you can do in any language you like (every language has support for making HTTP requests, right?) without the need to install any store-specific packages. However, since we&amp;#8217;ll be doing a lot of HTTP requests, it&amp;#8217;s useful to have a library that can make them simple. There are &lt;a href=&quot;http://ruby-toolbox.com/categories/http_clients.html&quot;&gt;plenty to choose from for Ruby&lt;/a&gt;. I chose &lt;a href=&quot;https://github.com/archiloque/rest-client&quot;&gt;rest-client&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo gem install rest-client
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;With that, I wrote the following little Ruby script called &lt;a href=&quot;/blog/files/load-data-into-4store_0.rb&quot;&gt;&amp;#8216;load-data-into-4store.rb&amp;#8217;&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/usr/bin/env ruby
require &#039;rubygems&#039;
require &#039;rest_client&#039;

filename = &#039;/Users/Jeni/Downloads/index.rdf&#039;
graph    = &#039;http://source.data.gov.uk/data/reference/organogram-co/2010-06-30&#039;
endpoint = &#039;http://localhost:8000/data/&#039;

puts &quot;Loading #{filename} into #{graph} in 4store&quot;
response = RestClient.put endpoint + graph, File.read(filename), :content_type =&amp;gt; &#039;application/rdf+xml&#039;
puts &quot;Response #{response.code}: 
#{response.to_str}&quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run the script from the command line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby load-rdf-into-4store.rb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you should get the response:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Sending PUT /data/http://source.data.gov.uk/data/reference/organogram-co/2010-06-30 to localhost:8000
Response 201: 
&amp;lt;!DOCTYPE HTML PUBLIC &quot;-//IETF//DTD HTML 2.0//EN&quot;&amp;gt;
&amp;lt;html&amp;gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;201 imported successfully&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;
&amp;lt;body&amp;gt;&amp;lt;h1&amp;gt;201 imported successfully&amp;lt;/h1&amp;gt;
&amp;lt;p&amp;gt;This is a 4store SPARQL server.&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;4store v1.0.5&amp;lt;/p&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can then check &lt;a href=&quot;http://localhost:8000/status/size/&quot;&gt;http://localhost:8000/status/size/&lt;/a&gt; and you should see that there are now some triples in the store:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/4store-size.jpg&quot; /&gt;
&lt;/p&gt;

&lt;h2&gt;Running a Query&lt;/h2&gt;

&lt;p&gt;The next step is to query that data using SPARQL. Running SPARQL queries is just a matter of HTTP POSTing a query to the SPARQL endpoint. 4store provides a page that you can use to test out queries at &lt;a href=&quot;http://localhost:8000/test/&quot;&gt;http://localhost:8000/test/&lt;/a&gt; so perhaps we should do that before diving into the Ruby code. The easy one to start with is just one that returns a list of the types of things that are described within the data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT DISTINCT ?type 
WHERE { 
  ?thing a ?type .
} 
ORDER BY ?type
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Paste that into the textarea that&amp;#8217;s provided on &lt;a href=&quot;http://localhost:8000/test/&quot;&gt;http://localhost:8000/test/&lt;/a&gt; so it looks like:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/4store-test-query.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;and you get some XML:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;?xml version=&quot;1.0&quot;?&amp;gt;
&amp;lt;sparql xmlns=&quot;http://www.w3.org/2005/sparql-results#&quot;&amp;gt;
  &amp;lt;head&amp;gt;
    &amp;lt;variable name=&quot;type&quot;/&amp;gt;
  &amp;lt;/head&amp;gt;
  &amp;lt;results&amp;gt;
    &amp;lt;result&amp;gt;
      &amp;lt;binding name=&quot;type&quot;&amp;gt;&amp;lt;uri&amp;gt;http://purl.org/linked-data/cube#DataSet&amp;lt;/uri&amp;gt;&amp;lt;/binding&amp;gt;
    &amp;lt;/result&amp;gt;
    &amp;lt;result&amp;gt;
      &amp;lt;binding name=&quot;type&quot;&amp;gt;&amp;lt;uri&amp;gt;http://purl.org/linked-data/cube#DataStructureDefinition&amp;lt;/uri&amp;gt;&amp;lt;/binding&amp;gt;
    &amp;lt;/result&amp;gt;
    ...
  &amp;lt;/results&amp;gt;
&amp;lt;/sparql&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;SELECT queries like this one (which are the most common kind of query to run to simply extract data) return &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-XMLres/&quot;&gt;SPARQL Query Results XML Format&lt;/a&gt; by default, so there&amp;#8217;s no need to get hold of a specialised library for processing the results: you just need something to process XML.&lt;/p&gt;

&lt;p&gt;For Ruby, I&amp;#8217;m choosing &lt;a href=&quot;http://nokogiri.org/&quot;&gt;Nokogiri&lt;/a&gt; as I&amp;#8217;ve heard good things about it. To install:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port install libxml2 libxslt
$ sudo gem install nokogiri
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So now we just need a script that will run this query, process the results as XML, and do something with them. Call it &lt;a href=&quot;/blog/files/find-rdf-types_0.rb&quot;&gt;&amp;#8216;find-rdf-types.rb&amp;#8217;&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/usr/bin/env ruby
require &#039;rubygems&#039;
require &#039;rest_client&#039;
require &#039;nokogiri&#039;

query = &#039;SELECT DISTINCT ?type WHERE { ?thing a ?type . } ORDER BY ?type&#039;
endpoint = &#039;http://localhost:8000/sparql/&#039;

puts &quot;POSTing SPARQL query to #{endpoint}&quot;
response = RestClient.post endpoint, :query =&amp;gt; query
puts &quot;Response #{response.code}&quot;
xml = Nokogiri::XML(response.to_str)

xml.xpath(&#039;//sparql:binding[@name = &quot;type&quot;]/sparql:uri&#039;, &#039;sparql&#039; =&amp;gt; &#039;http://www.w3.org/2005/sparql-results#&#039;).each do |type|
  puts type.content
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby find-rdf-types.rb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you get:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;POSTing SPARQL query to http://localhost:8000/sparql/
Response 200
http://purl.org/linked-data/cube#DataSet
http://purl.org/linked-data/cube#DataStructureDefinition
http://purl.org/linked-data/cube#Observation
http://purl.org/net/opmv/ns#Artifact
http://purl.org/net/opmv/ns#Process
http://purl.org/net/opmv/types/google-refine#OperationDescription
http://purl.org/net/opmv/types/google-refine#Process
http://purl.org/net/opmv/types/google-refine#Project
http://rdfs.org/ns/void#Dataset
http://reference.data.gov.uk/def/central-government/AssistantParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/CivilServicePost
http://reference.data.gov.uk/def/central-government/Department
http://reference.data.gov.uk/def/central-government/DeputyDirector
http://reference.data.gov.uk/def/central-government/DeputyParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/Director
http://reference.data.gov.uk/def/central-government/DirectorGeneral
http://reference.data.gov.uk/def/central-government/ParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/PermanentSecretary
http://reference.data.gov.uk/def/central-government/PublicBody
http://reference.data.gov.uk/def/central-government/SeniorAssistantParliamentaryCounsel
http://reference.data.gov.uk/def/intervals/CalendarDay
http://www.w3.org/2000/01/rdf-schema#Class
http://www.w3.org/ns/org#Organization
http://www.w3.org/ns/org#OrganizationalUnit
http://xmlns.com/foaf/0.1/Person
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So we can see that the dataset contains information that include statistical data using the &lt;a href=&quot;http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html&quot;&gt;data cube&lt;/a&gt; vocabulary, provenance information using &lt;a href=&quot;http://code.google.com/p/opmv/&quot;&gt;OPMV (Open Provenance Model Vocabulary)&lt;/a&gt;, some information about organisations using &lt;a href=&quot;http://www.epimorphics.com/public/vocabulary/org.html&quot;&gt;org&lt;/a&gt;, some data.gov.uk-specific vocabulary, and people using &lt;a href=&quot;http://xmlns.com/foaf/spec/&quot;&gt;FOAF&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Processing RDF&lt;/h2&gt;

&lt;p&gt;Sometimes it can be useful to get non-tabular data out of SPARQL. At that point, rather than using SELECT queries, you will want to use a CONSTRUCT query, which creates RDF. For example, try the query:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;

CONSTRUCT {
  ?person 
    a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
} WHERE { 
  ?person a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This gets all the information in the data about the individuals for whom names have been supplied in the data, as RDF.&lt;/p&gt;

&lt;p&gt;Although the response is RDF/XML, you definitely &lt;em&gt;do not&lt;/em&gt; want to process it as XML. Instead, you need a proper RDF library. Fortunately, there&amp;#8217;s a good one for Ruby in &lt;a href=&quot;http://rdf.rubyforge.org/&quot;&gt;RDF.rb&lt;/a&gt;. You can install it and a bunch of extra plugins that make it easy to deal with RDF in all its guises using:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo gem install linkeddata
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This lets us pick out an appropriate parser based on the &lt;code&gt;Content-Type&lt;/code&gt; of the response, and load the results of the SPARQL query into an  in-memory &lt;a href=&quot;http://rdf.rubyforge.org/RDF/Graph.html&quot;&gt;&lt;code&gt;RDF::Graph&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;response = RestClient.post endpoint, :query =&amp;gt; query
content_type = response.headers[:content_type][/^[^ ;]+/]
puts &quot;Response #{response.code} type #{content_type}&quot;

graph = RDF::Graph.new
graph &amp;lt;&amp;lt; RDF::Reader.for(:content_type =&amp;gt; content_type).new(response.to_str)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can perform subsequent queries over that graph, for example just to extract names and telephone numbers and put them into a dictionary:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;query = RDF::Query.new({
  :person =&amp;gt; {
    RDF.type  =&amp;gt; FOAF.Person,
    FOAF.name =&amp;gt; :name,
    FOAF.mbox =&amp;gt; :email,
  }
})

people = {}
query.execute(graph).each do |person|
  people[person.name.to_s] = person.email.to_s
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;#8217;s worth noting that the constants &lt;code&gt;RDF&lt;/code&gt; and &lt;code&gt;FOAF&lt;/code&gt; are pre-declared by including &lt;code&gt;RDF&lt;/code&gt;, and the values that you get back from a query are RDF values, which can be URIs or have datatypes or languages. In the above code I&amp;#8217;ve converted them into strings for insertion into the Ruby dictionary.&lt;/p&gt;

&lt;p&gt;The full script for &lt;a href=&quot;/blog/files/get-names-and-emails_0.rb&quot;&gt;&amp;#8216;get-names-and-emails.rb&amp;#8217;&lt;/a&gt; is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/usr/bin/env ruby
require &#039;rubygems&#039;
require &#039;rest_client&#039;
require &#039;linkeddata&#039;

include RDF

query = &quot;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;

CONSTRUCT {
  ?person 
    a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
} WHERE { 
  ?person a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
}&quot;
endpoint = &#039;http://localhost:8000/sparql/&#039;

puts &quot;POSTing SPARQL query to #{endpoint}&quot;
response = RestClient.post endpoint, :query =&amp;gt; query
content_type = response.headers[:content_type][/^[^ ;]+/]
puts &quot;Response #{response.code} type #{content_type}&quot;

graph = RDF::Graph.new
graph &amp;lt;&amp;lt; RDF::Reader.for(:content_type =&amp;gt; content_type).new(response.to_str)

puts &quot;\nLoaded #{graph.count} triples\n&quot;

query = RDF::Query.new({
  :person =&amp;gt; {
    RDF.type  =&amp;gt; FOAF.Person,
    FOAF.name =&amp;gt; :name,
    FOAF.mbox =&amp;gt; :email,
  }
})

people = {}
query.execute(graph).each do |person|
  people[person.name.to_s] = person.email.to_s
end
puts &quot;\nCreating directory of #{people.length} people&quot;

stott_email = people[&#039;Andrew Stott&#039;]
puts &quot;\nAndrew Stott&#039;s email address: #{stott_email}&quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run this script with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby get-names-and-emails.rb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you get the result:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;POSTing SPARQL query to http://localhost:8000/sparql/
Response 200 type application/rdf+xml

Loaded 459 triples

Creating directory of 75 people

Andrew Stott&#039;s email address: mailto:andrew.stott@cabinet-office.gsi.gov.uk
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Conclusions and Caveats&lt;/h2&gt;

&lt;p&gt;So there you have it, a walkthrough of setting up a local triplestore, loading in data and accessing that data programmatically using SPARQL queries.&lt;/p&gt;

&lt;p&gt;Now for some caveats. First, you&amp;#8217;re bound to have noticed that I having followed Richard&amp;#8217;s steps to the letter.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;4store wasn&amp;#8217;t installed from a package management system. The only packaged triplestore I could locate on &lt;a href=&quot;http://www.macports.org/&quot;&gt;MacPorts&lt;/a&gt; was &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot;&gt;Virtuoso&lt;/a&gt; (which I&amp;#8217;ll come to in a second). I hope that 4store&amp;#8217;s installation is simple enough for this slight deviation from the rules not to matter.&lt;/li&gt;
&lt;li&gt;I didn&amp;#8217;t install a package for specifically talking to 4store in order to load in data, just used HTTP requests. There are &lt;a href=&quot;http://4store.org/trac/wiki/ClientLibraries&quot;&gt;client libraries&lt;/a&gt; for 4store, but I figure that the HTTP requests are easy enough, and the resulting code more portable into other environments, so I prefer not to use them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Second, there are a couple of dead ends that I went down that I haven&amp;#8217;t written up in the above:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I did spend some time yesterday evening trying to get &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot;&gt;Virtuoso&lt;/a&gt; set up. I managed to get it installed, but loading data into it seemed to require some magic which I couldn&amp;#8217;t figure out. So I went to bed instead.&lt;/li&gt;
&lt;li&gt;I tried to install and use &lt;a href=&quot;http://rdf.rubyforge.org/raptor/&quot;&gt;rdf-raptor&lt;/a&gt; in order to parse the RDF/XML that naturally comes out of 4store CONSTRUCT queries, but got a &lt;code&gt;Could not open library &#039;libraptor&#039;&lt;/code&gt; error. I couldn&amp;#8217;t find an immediate fix for that, so decided to keep things simple instead and just use plain RDF.rb.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Third, I want to reiterate that there may be better ways of using 4store, rest_client, Nokogiri and RDF.rb, as well as Ruby generally, than those shown above. I don&amp;#8217;t claim to be an expert in any of these technologies. If you have suggestions and corrections, I&amp;#8217;d encourage you to add a comment and I&amp;#8217;ll incorporate them in the text to improve it.&lt;/p&gt;

&lt;p&gt;Finally, some general points, because the strong binding of &amp;#8216;linked data&amp;#8217; and &amp;#8216;SPARQL&amp;#8217; in Richard&amp;#8217;s post bothers me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It&amp;#8217;s not necessary to have a SPARQL endpoint when publishing linked data, nor to run your own triplestore. If you already have a website, you are probably better off generating N-Triples or RDF/XML or Turtle in the same way as you generate HTML or XML or JSON.&lt;/li&gt;
&lt;li&gt;It&amp;#8217;s not necessary to learn SPARQL to access and use linked data: the whole point is that the data in linked data is available through HTTP access in standard (RDF-based) formats, so you can scrape them using a follow-your-nose approach and store the results however you like.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Having said the above, if you&amp;#8217;re collecting linked data from multiple sources with unpredictable content and want to query across it, having a local triplestore is very useful.&lt;/p&gt;

&lt;p&gt;I also want to point out that within the &lt;a href=&quot;http://data.gov.uk/linked-data&quot;&gt;linked data we&amp;#8217;ve published on data.gov.uk&lt;/a&gt;, we&amp;#8217;ve made a big effort to make the data available in multiple formats such as JSON, XML and CSV, and through a RESTful, URI-parameter-driven API, precisely to lower the barrier for developers who want to use that information but understandably don&amp;#8217;t want to take the time or make the effort to learn the linked data technologies that underly the sites. For those that do, the RDF/XML and Turtle is there as well, and the SPARQL queries that are used to create each page are available to look at, tweak, and learn from. Our hope is that the &lt;a href=&quot;http://code.google.com/p/linked-data-api/&quot;&gt;linked data API&lt;/a&gt; that provides access to lists of &lt;a href=&quot;http://education.data.gov.uk/doc/school&quot;&gt;schools&lt;/a&gt;, &lt;a href=&quot;http://reference.data.gov.uk/doc/department&quot;&gt;departments&lt;/a&gt; and &lt;a href=&quot;http://transport.data.gov.uk/doc/station&quot;&gt;railway stations&lt;/a&gt; can make the linked data learning curve a little less steep.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/152#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/61">4store</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/62">rdf.rb</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/63">ruby</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <enclosure url="http://www.jenitennison.com/blog/files/load-rdf-into-4store_0.rb" length="437" type="text/x-ruby-script" />
 <pubDate>Sat, 15 Jan 2011 19:17:57 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">152 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Priorities for RDF</title>
 <link>http://www.jenitennison.com/blog/node/149</link>
 <description>&lt;p&gt;A couple of weeks ago I did a talk at the &lt;a href=&quot;http://www.w3.org/2010/11/TPAC/PlenaryAgenda&quot;&gt;TPAC Plenary Day&lt;/a&gt; about why RDF hasn&amp;#8217;t had the uptake that it might and what could be done about it.&lt;/p&gt;

&lt;p&gt;I felt quite uncomfortable about doing this for many reasons. The predominant one is that I&amp;#8217;m well aware that the world is made by the people who turn up. It is far far easier to snipe from the sidelines than it is to put in the effort to attend telcons and face-to-face meetings, to engage on mailing lists, to write specifications and implementations and tutorials.&lt;/p&gt;

&lt;p&gt;On the other hand, what I hope is that the perspective of someone who is outside that process, someone who tries to understand and interpret and &lt;em&gt;use&lt;/em&gt; the results of that process, might be valuable. And so I aimed to provide that honestly.&lt;/p&gt;

&lt;p&gt;In that spirit, I&amp;#8217;m going to put my stake in the ground and say that there are three areas where I think W3C should be concentrating its efforts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;standardising (something like) TriG &amp;#8212; Turtle plus named graphs&lt;/li&gt;
&lt;li&gt;standardising an API for the RDF data model&lt;/li&gt;
&lt;li&gt;standardising a path language for RDF that can be used by that API and others for easy access&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;and that it should specifically &lt;em&gt;not&lt;/em&gt; put its efforts into standardising another syntax for RDF based on JSON.&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;Format Standardisation&lt;/h2&gt;

&lt;p&gt;The first point is that I think we need to decide on a single recommended format for RDF. &lt;/p&gt;

&lt;p&gt;Fundamentally, unlike XML or JSON, RDF is defined first and foremost as a model rather than as a syntax. That means it can be expressed in a number of syntaxes, the most common of which are &lt;a href=&quot;http://www.w3.org/TR/REC-rdf-syntax/&quot;&gt;RDF/XML&lt;/a&gt;, &lt;a href=&quot;http://www.w3.org/TeamSubmission/turtle/&quot;&gt;Turtle&lt;/a&gt; and &lt;a href=&quot;http://www.w3.org/TR/rdf-testcases/#ntriples&quot;&gt;N-Triples&lt;/a&gt; though of course there&amp;#8217;s also &lt;a href=&quot;http://www.w3.org/TR/xhtml-rdfa-primer/&quot;&gt;RDFa&lt;/a&gt;, &lt;a href=&quot;http://n2.talis.com/wiki/RDF_JSON_Specification&quot;&gt;RDF/JSON&lt;/a&gt;, &lt;a href=&quot;http://json-ld.org/&quot;&gt;JSON-LD&lt;/a&gt; and &lt;a href=&quot;http://www.w3.org/DesignIssues/Notation3&quot;&gt;N3&lt;/a&gt; and if you start factoring in named graphs you can add &lt;a href=&quot;http://www.hpl.hp.com/techreports/2004/HPL-2004-56.html&quot;&gt;TriX&lt;/a&gt;, &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/TriG/&quot;&gt;TriG&lt;/a&gt; and &lt;a href=&quot;http://sw.deri.org/2008/07/n-quads/&quot;&gt;N-Quads&lt;/a&gt; to the list.&lt;/p&gt;

&lt;p&gt;Except for a few corner cases, it would be perfectly possible to express the same RDF model in any of these syntaxes. Why is this so bad? Surely having choice is a good thing, because publishers can choose an option that fits with their workflows? And aren&amp;#8217;t all these formats generated automatically anyway, such that the same data can be provided in many ways with no overhead?&lt;/p&gt;

&lt;p&gt;Well, no, there are actually two ways in which &lt;strong&gt;having multiple syntaxes makes adoption harder&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;publishers aren&amp;#8217;t always generating data automatically&lt;/strong&gt;; in a number of cases (which I think and hope will grow) RDF data is being generated just like CSV files are, as static documents which are simply published in the same way as other static documents. In these cases, publishers either have to do the research and make a decision about which format to use, or produce the data in multiple formats. This is a particular challenge when people aren&amp;#8217;t convinced they want to generate RDF anyway.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;toolsets have to handle producing or consuming multiple formats&lt;/strong&gt;. That means more code, more testing and more maintenance on both the production and consumption sides of the equation, all of which raise the implementation burden.&lt;/p&gt;

&lt;p&gt;Of course it&amp;#8217;s natural that during the initial stages of the use of a technology that we should see a variety of patterns of use: there need to be innovations and experiments so that we can find what works and what doesn&amp;#8217;t. But as that technology matures, we need to start bedding down some basics. There need to be agreed foundations that, &lt;em&gt;even if imperfect&lt;/em&gt;, are solid enough for the majority of us to build upon. And we need to exercise some self-restraint to concentrate on doing that building rather than revisiting those decisions.&lt;/p&gt;

&lt;p&gt;We have a number of years of experience now about what formats are easy to understand, to pass around, to create and to process. It is time, I think, to pick one, to get it standardised, to deprecate others and to provide a much cleaner and clearer picture to publishers and consumers.&lt;/p&gt;

&lt;p&gt;Of the formats that we have, the one that fits best with the RDF data model and is simplest for humans to understand is Turtle. But it needs to support named graphs, so that it&amp;#8217;s possible to share the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-query/#rdfDataset&quot;&gt;RDF datasets&lt;/a&gt; that are exposed within a SPARQL endpoint, which is why I say W3C should standardise something like TriG.&lt;/p&gt;

&lt;h2&gt;RDF APIs&lt;/h2&gt;

&lt;p&gt;The second point is to work on standardising the APIs that are available for developers who work with RDF. Why standardise APIs? Because it would make accessing RDF easier and more predictable for web developers, who often work across multiple languages and platforms. Developers don&amp;#8217;t really care about syntax &amp;#8212; although having something readable is useful for debugging &amp;#8212; they care about the way in which they get to interact with in-memory structures that hold the data.&lt;/p&gt;

&lt;p&gt;RDF needs an API that exposes its internal model (of literals and resources and triples and graphs and datasets) in a way that isn&amp;#8217;t too onerous for people to use. There are lots and lots of RDF APIs about, within the various parsers that are available for different platforms; the only one that&amp;#8217;s approaching a standard is the one embedded within the &lt;a href=&quot;http://www.w3.org/TR/rdfa-api/&quot;&gt;RDFa API specification&lt;/a&gt;. I would like to see that disentangled from RDFa and for it, or something like it, to gain traction amongst the writers of RDF libraries such as the &lt;a href=&quot;http://librdf.org/&quot;&gt;Redland RDF libraries&lt;/a&gt;, &lt;a href=&quot;http://www.rdflib.net/&quot;&gt;RDFLib&lt;/a&gt;, &lt;a href=&quot;http://code.google.com/p/moriarty/&quot;&gt;Moriarty&lt;/a&gt;, &lt;a href=&quot;https://github.com/tommorris/reddy&quot;&gt;Reddy&lt;/a&gt;, &lt;a href=&quot;http://code.google.com/p/rdfquery/&quot;&gt;rdfQuery&lt;/a&gt; and so on and on.&lt;/p&gt;

&lt;p&gt;But &lt;strong&gt;having an API for RDF&amp;#8217;s data model is not enough&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I think there is a lot that we can learn from XML&amp;#8217;s experience here. James Clark&amp;#8217;s recent blog post about &lt;a href=&quot;http://blog.jclark.com/2010/11/xml-vs-web_24.html&quot;&gt;XML and the web&lt;/a&gt; describes what it&amp;#8217;s like for developers working with XML compared to JSON:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The fundamental problem is the mismatch between programming language data structures and the XML element/attribute data model of elements. This leaves the developer with three choices, all unappetising:&lt;/p&gt;
  
  &lt;ul&gt;
  &lt;li&gt;live with an inconvenient element/attribute representation of the data;&lt;/li&gt;
  &lt;li&gt;descend into XML Schema hell in the company of your favourite data binding tool;&lt;/li&gt;
  &lt;li&gt;write reams of code to convert the XML into a convenient data structure.&lt;/li&gt;
  &lt;/ul&gt;
  
  &lt;p&gt;By contrast with JSON, especially with a dynamic programming language, you can get a reasonable in-memory representation just by calling a library function.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So JSON is popular because accessing information within the JSON is really easy. And that&amp;#8217;s for two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;it&amp;#8217;s parsed with a single simple function call in a common library&lt;/li&gt;
&lt;li&gt;the result of parsing is simple to navigate; typically you can do so using native methods such as dot-notation paths&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first of these is a simple matter of winning hearts and minds. The second is the important one: it&amp;#8217;s easy to use because the underlying JSON model fits neatly onto the object-oriented programming paradigm that most developers use.&lt;/p&gt;

&lt;p&gt;XML isn&amp;#8217;t so popular among web developers because its underlying model doesn&amp;#8217;t fit well into most programming languages: it has attributes and mixed content and a whole bunch of other things that don&amp;#8217;t map straight-forwardly onto objects-with-properties. Navigating through XML (or HTML) structures using a DOM is tedious and automatic binding mostly doesn&amp;#8217;t work.&lt;/p&gt;

&lt;p&gt;What about RDF? On the face of it, RDF is a good fit with object-oriented models; they both follow a basic entity-attribute-value approach. However, there are (at least) three things in RDF that do not fit with the object-oriented model:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Properties in RDF are identified through URIs rather than simple names (ie those containing just letters and numbers and underscores). Some programming languages, such as Javascript, let you have properties that aren&amp;#8217;t simple names, but you then have to access them through the relatively clunky &lt;code&gt;[]&lt;/code&gt; notation rather than dot-notation paths. Properties are first-class objects in RDF with things like labels and ranges and inverses; fitting with standard programming languages here means using &lt;a href=&quot;http://en.wikipedia.org/wiki/Reflection_(computer_science)&quot;&gt;reflection&lt;/a&gt; and having the ability to annotate fields, and everything gets a bit mind-bending.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Values in RDF often have datatypes or languages associated with them, and the set of datatypes that you can use is completely extensible (and of course datatypes are first-class objects, with their own properties, too). This wouldn&amp;#8217;t be so bad except that making every value an object means comparisons with basic strings or numbers won&amp;#8217;t generally work.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In RDF, a property can have more than one value (an unordered bag of values, if you like) or it can have a value that is a &lt;code&gt;rdf:List&lt;/code&gt; (an ordered sequence of values); it can even have many values which are &lt;code&gt;rdf:List&lt;/code&gt;s. On the other hand, the object-oriented model generally supports values that are arrays (and of course you can have arrays within arrays), which are always ordered. So there is always a choice to be made when mapping from an object-oriented model to RDF, about whether the values should be at the same level or be &lt;code&gt;rdf:List&lt;/code&gt;s.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In other words, just as with XML, &lt;strong&gt;there is no straight-forward mapping from RDF to an object structure that developers can immediately use&lt;/strong&gt;. That doesn&amp;#8217;t stop us trying, of course:&lt;/p&gt;

&lt;p&gt;There&amp;#8217;s the approach that we use within the &lt;a href=&quot;http://code.google.com/p/linked-data-api/&quot;&gt;linked data API&lt;/a&gt; and elsewhere, which is to make it easy for data publishers to create simple JSON versions of the RDF they publish. A website-specific configuration file determines what the mapping looks like. URIs of properties get turned into readable names (and provide the map back to the original property URI, so that it&amp;#8217;s possible to get more information about the property). Datatypes and languages are ignored by default (but mapped onto structured values if so configured). And the distinction between properties-with-multiple-values and properties-whose-value-is-a-List is ignored. We purposefully lose some of RDF&amp;#8217;s expressivity and power in order to gain usability. You can see the result in action at &lt;a href=&quot;http://education.data.gov.uk/doc/school&quot;&gt;http://education.data.gov.uk/doc/school&lt;/a&gt; and &lt;a href=&quot;http://education.data.gov.uk/doc/department&quot;&gt;http://education.data.gov.uk/doc/department&lt;/a&gt; for example.&lt;/p&gt;

&lt;p&gt;There&amp;#8217;s the approach that Nathan has taken within &lt;a href=&quot;https://github.com/webr3/js3&quot;&gt;js3&lt;/a&gt; which is to create libraries that, with a bit of work on the client side, give a way of mapping RDF into an object-oriented structure which is easy to manipulate (or vice versa, create RDF from OO structures). It&amp;#8217;s the same basic principle as helping publishers to generate JSON, but the interpretation and mapping is done by the client rather than the publisher. The work that Nathan&amp;#8217;s done to manage this in Javascript is very impressive; I don&amp;#8217;t know whether the same approach can be mapped to other languages.&lt;/p&gt;

&lt;p&gt;But as James intimated, data binding is hairy and scary. &lt;strong&gt;Mappings between different data models are always imperfect, lossy, sensitive to what seem like small changes and therefore hard to maintain&lt;/strong&gt;. I remember nodding along as Mike Kay talked about this at the XML Summer School in relation to the use of XML: the horrors of working with systems in which there three way maps between relational and object-oriented and XML structures, and the relief that comes with working with an XML-only architecture. I suspect this same observation is one of the drivers behind the growth of JSON databases.&lt;/p&gt;

&lt;p&gt;On the other hand, you know, perhaps RDF is close enough to the object-oriented model that it won&amp;#8217;t be so bad. Perhaps we could find a way to standardise on a method of configuring applications that do the mapping, such as defining short names for properties, describing how to handle objects with datatypes and languages and so on. We have &lt;a href=&quot;http://code.google.com/p/linked-data-api/wiki/JSONFormats&quot;&gt;a body of experience&lt;/a&gt; that can be learnt from, including the ones above, and perhaps it can be tied into the &lt;a href=&quot;http://www.w3.org/TR/r2rml/&quot;&gt;RDB-to-RDF&lt;/a&gt; work too. The biggest challenge, I suspect, will be to create something round-trippable.&lt;/p&gt;

&lt;h2&gt;Path Languages&lt;/h2&gt;

&lt;p&gt;The other option that James didn&amp;#8217;t mention but that I touched on in my TPAC talk is to learn from how working with HTML and XML has been made easier in libraries such as &lt;a href=&quot;http://jquery.org/&quot;&gt;jQuery&lt;/a&gt; or &lt;a href=&quot;http://hpricot.com/&quot;&gt;hpricot&lt;/a&gt;. These libraries still allow the HTML and XML to be accessed through a DOM, rather than mapping HTML or XML into object structures, but &lt;strong&gt;make the lives of developers simpler by supporting querying of the HTML/XML using path languages that are &lt;em&gt;designed&lt;/em&gt; to be used to query those kinds of structures&lt;/strong&gt;. For HTML, that&amp;#8217;s CSS; for XML that&amp;#8217;s XPath. (It&amp;#8217;s the same approach as is used for strings: we use regular expressions for many operations rather than working with them at the character level.) Path languages work over the native model; all that&amp;#8217;s offered in the library are functions that take strings (holding the path language) and return objects or values as appropriate.&lt;/p&gt;

&lt;p&gt;I don&amp;#8217;t know exactly what it looks like, and it might already be out there (the world moves fast and I know I&amp;#8217;m not aware of everything), but what I think we need is a path language for navigating around RDF, probably based on &lt;a href=&quot;http://www.w3.org/TR/sparql11-query/#propertypaths&quot;&gt;SPARQL property paths&lt;/a&gt; or the &lt;a href=&quot;http://www.w3.org/2005/04/fresnel-info/fsl/&quot;&gt;FRESNEL selector language&lt;/a&gt; and an API (or APIs) that uses it. For example, something that lets developers use code like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;graph.find(&quot;*[foaf:nick = &#039;web3r&#039;]/foaf:name&quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;to pick values out of an in-memory graph. In my opinion, something like this would be much more likely to bring benefits than a data binding approach.&lt;/p&gt;

&lt;h2&gt;Why Not RDF in JSON?&lt;/h2&gt;

&lt;p&gt;What I&amp;#8217;ve tried to explain above is firstly that we already have too many syntaxes for RDF, and secondly that the main barrier to developers using RDF is the way in which they are forced to interact with that RDF once they have hold of it, not the syntax itself. The syntax that we use for RDF really doesn&amp;#8217;t matter, because developers interact with the in-memory dataset, not directly on the syntax.&lt;/p&gt;

&lt;p&gt;Nathan&amp;#8217;s recent post on &lt;a href=&quot;http://webr3.org/blog/linked-data/opening-linked-data/&quot;&gt;Opening Linked Data&lt;/a&gt;, which is worth reading in its entirety, captures the essence of the issue:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;You can&amp;#8217;t shoe horn RDF in to JSON, no matter how hard you try - well, you can, but you loose all the benefits of JSON in the first place, because the data is RDF, triples and not objects, rdf nodes and not simple values&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In other words, &lt;strong&gt;using JSON as the basis for an RDF syntax doesn&amp;#8217;t actually win you anything in terms of the ease of processing of that RDF&lt;/strong&gt;. In fact, I&amp;#8217;ll go further and say it has exactly the same bad qualities as RDF/XML.&lt;/p&gt;

&lt;p&gt;One of the bad things about RDF/XML is that it misleads people into thinking they can use normal XML tooling to process RDF, but XML tooling exposes the XML tree, not the RDF graph that they need. It&amp;#8217;s good enough in some circumstances, of course, but it&amp;#8217;s not working with RDF as RDF. Similarly, just because you&amp;#8217;re using XML tools doesn&amp;#8217;t mean RDF/XML is easy to generate; you&amp;#8217;re a lot safer to generate correct RDF/XML from an in-memory graph, in the same way as generating XML using string manipulation is harder work than it first appears.&lt;/p&gt;

&lt;p&gt;In exactly the same way, I think that a JSON-based syntax for RDF will mislead developers into thinking that they can interpret and generate that JSON like they can normal JSON, and interact with it at that level, when this simply isn&amp;#8217;t the case.&lt;/p&gt;

&lt;p&gt;The only advantage that I can see for a JSON-based RDF syntax is equivalent to the only advantage of RDF/XML: it is easier to store for people who use JSON databases, just as RDF/XML is easier to store for people who use XML databases. I am not sure that benefit is worth the cost of an additional RDF syntax; isn&amp;#8217;t RDF better stored in a triplestore?&lt;/p&gt;

&lt;h2&gt;Summary&lt;/h2&gt;

&lt;p&gt;So to reiterate, as far as I&amp;#8217;m concerned, W3C and the RDF community should be concentrating on a syntax for RDF that doesn&amp;#8217;t come saddled with those kinds of assumptions, which I think is Turtle + graphs; something like TriG. They should be concentrating on developing a standard API for RDF access that has a chance of adoption among the developers of RDF libraries, and on working out what parts of SPARQL and FRESNEL could be used to create a path language that could be reused in several contexts, including within such an API. And these should be done in preference to a RDF syntax in JSON which doesn&amp;#8217;t solve the core problems, and in fact just adds another syntax to the mix.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/149#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/60">tpac2010</category>
 <pubDate>Sun, 28 Nov 2010 21:44:52 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">149 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Using Freebase Gridworks to Create Linked Data</title>
 <link>http://www.jenitennison.com/blog/node/145</link>
 <description>&lt;p&gt;When we encourage people to put their data on the web as linked data, the biggest question is &amp;#8220;How?&amp;#8221;. There are so many &amp;#8220;How?&amp;#8221; questions to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how do we choose what URIs to use for things?&lt;/li&gt;
&lt;li&gt;how do we choose what vocabularies to use?&lt;/li&gt;
&lt;li&gt;how do we handle changing data?&lt;/li&gt;
&lt;li&gt;how do we tell people how the data was created?&lt;/li&gt;
&lt;li&gt;how do we publish it?&lt;/li&gt;
&lt;li&gt;how will other people know about it?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and, of course:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how do we create it?&lt;/li&gt;
&lt;/ul&gt;

&lt;!--break--&gt;

&lt;p&gt;Our goal within the linked data part of data.gov.uk (and I know we haven&amp;#8217;t achieved it yet) is to both answer these questions and to make the answers as simple as possible. The answers to the questions &lt;em&gt;cannot&lt;/em&gt; either require up-front knowledge of all possible types of data that might be published or depend on the availability of linked data for all the things we want to talk about. It &lt;em&gt;cannot&lt;/em&gt; require registration at centralised services. It &lt;em&gt;cannot&lt;/em&gt; require everyone to do everything in the same way or at the same pace.&lt;/p&gt;

&lt;p&gt;We must take adopt an approach that encourages people to make their data available in forms that are easier for other people to pick up and use &lt;strong&gt;because they see the benefits for them&lt;/strong&gt; and their stakeholders and because the effort of doing so is not too high to bear. We must grow, adapt and evolve incrementally. If linked data eventually wins, it will be due to its benefits, not to faith.&lt;/p&gt;

&lt;p&gt;Anyway, enough rant. The point of this blog post is to talk about one of the answers to the &amp;#8216;How do we create it?&amp;#8217; question: using &lt;a href=&quot;http://code.google.com/p/freebase-gridworks/&quot;&gt;Freebase Gridworks&lt;/a&gt;. For those who haven&amp;#8217;t encountered it, Gridworks is an incredibly useful application that enables you to easily analyse, clean and manipulate tabular data. In a few steps, it can be used to generated linked datasets which can then be published on the web just like any other file, ready for other people to reuse without jumping through hoops. I&amp;#8217;m going to assume that you can &lt;a href=&quot;http://code.google.com/p/freebase-gridworks/wiki/Downloads?tm=2&quot;&gt;download it&lt;/a&gt; and &lt;a href=&quot;http://code.google.com/p/freebase-gridworks/wiki/GettingStarted&quot;&gt;install it&lt;/a&gt; following the instructions provided on the Gridworks site.&lt;/p&gt;

&lt;p&gt;In this post, I&amp;#8217;m going to talk about how to use Gridworks to generate linked data, using an example of local government spending data from &lt;a href=&quot;http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm&quot;&gt;Windsor and Maidenhead council&lt;/a&gt;. Like a good train journey, there&amp;#8217;s quite a lot to see along the way.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: Many thanks to Dave Reynolds for his work on this data and comments on an earlier version of this post.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;Importing Data&lt;/h2&gt;

&lt;p&gt;The first step is to import the data into Gridworks. If you just take the Windsor &amp;amp; Maidenhead data and import it directly, you&amp;#8217;ll get a single not-very-useful column as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/bad-import.jpg&quot; title=&quot;Bad import into Gridworks&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If you look at the spreadsheet in a normal spreadsheet programme then you&amp;#8217;ll see why. Like a lot of spreadsheets created by normal people, who want to create something readable by human beings rather than computers, it has some extra lines at the top to explain what the spreadsheet contains, as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/spreadsheet.jpg&quot; title=&quot;Original spreadsheet&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Fortunately, Gridworks lets us easily skip over these first few lines. When you import the data, put the number &lt;code&gt;1&lt;/code&gt; in the box for &amp;#8220;Ignore X initial non-blank lines&amp;#8221;, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/import-dialog.jpg&quot; title=&quot;Import dialog&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;(You need the number &lt;code&gt;1&lt;/code&gt; because although there are three lines before the table really starts, the second two of those are blank.)&lt;/p&gt;

&lt;p&gt;That done, the data should look a lot more useful, as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/good-import.jpg&quot; title=&quot;Good import into Gridworks&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;h2&gt;Cleaning Data&lt;/h2&gt;

&lt;p&gt;The next thing to do is to explore the data a bit to get a handle on what&amp;#8217;s there and work out whether any cleaning or rationalisation is necessary to improve its quality.&lt;/p&gt;

&lt;p&gt;With columns that hold names, such as &amp;#8216;Directorate&amp;#8217;, &amp;#8216;Service&amp;#8217; or &amp;#8216;Supplier Name&amp;#8217;, you&amp;#8217;re looking for slight misspellings caused by bad data entry. Gridworks helps you find these by creating a list of the distinct values for a particular column and telling you how many instances there are of each. Use the arrow at the side of the column name to pull down the menu, then choose &lt;code&gt;Facet &amp;gt; Text Facet&lt;/code&gt; to create this list, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/facet-menu.jpg&quot; title=&quot;Choosing from the facet menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once you&amp;#8217;ve chosen &lt;code&gt;Text Facet&lt;/code&gt;, the list pops up on the left hand side of the window. You can click on these to filter the table to contain just those rows that have that value for that column, but you can then scan through this to spot any places where there looks to be a typo or two entries that should really be the same. For example, the Services list holds both &amp;#8216;Libraries &amp;amp; Information Services&amp;#8217; and &amp;#8216;Library &amp;amp; Information Services&amp;#8217;, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/services-list.jpg&quot; title=&quot;Repetition in the Services list&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s unlikely that there are really two distinct services with such similar names, so we&amp;#8217;d like to clean up this data by standardising on one name or another. You can quickly change all occurrences of one value to another using the &lt;code&gt;edit&lt;/code&gt; option that appears just to the right of the value when you hover over it. This brings up a dialog that enables you to change all of those values to something else, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/edit-value-dialog.jpg&quot; title=&quot;Editing a value across the spreadsheet&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can do something similar with numeric columns, such as the &amp;#8216;Amount excl vat £&amp;#8217; column. This time choose &lt;code&gt;Numeric Facet&lt;/code&gt; rather than &lt;code&gt;Text Facet&lt;/code&gt; and you&amp;#8217;ll get a histogram up as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/amount-facet.jpg&quot; title=&quot;Amount histogram&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is useful for identifying outliers. If you grab the handle on the left of the histogram and move it to the centre, the rows will get filtered to only those that have an amount within that range. For example, moving it to only show rows between £500,000 and £1,500,000 shows that there are three payments of this size, all made by Children&amp;#8217;s Services to Wilmott Dixon Construction Limited, as shown in this screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/high-value-transactions.jpg&quot; title=&quot;High value transactions&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Although these values are much higher than most of the others in the spreadsheet, they don&amp;#8217;t seem to be errors &amp;#8212; I guess a new school was being built or something &amp;#8212; so there&amp;#8217;s nothing to correct here, but it shows how numeric facets can be used to explore the data.&lt;/p&gt;

&lt;p&gt;Another approach to exploring and cleaning the data is to use the clustering algorithms that are built into Gridworks to identify duplicates. To do this, pull down the column menu and this time choose &lt;code&gt;Edit Cells... &amp;gt; Cluster and Edit&lt;/code&gt;, as shown in the following screenshot, this time for the &amp;#8216;Supplier Name&amp;#8217; column:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/edit-cells-menu.jpg&quot; title=&quot;Choosing from the Edit Cells menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This brings up a dialog that groups together values that look similar. In this case, &amp;#8216;Siemens plc&amp;#8217; and &amp;#8216;Siemens PLC&amp;#8217;, as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/cluster-dialog.jpg&quot; title=&quot;Clustering values in a column&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can use this dialog to change all the similar values to a standard one. Check the &lt;code&gt;Merge&lt;/code&gt; checkbox for the clusters of values that should be merged, edit the &lt;code&gt;New Cell Value&lt;/code&gt; field to whatever standard value you want to adopt, and choose &lt;code&gt;Apply &amp;amp; Re-cluster&lt;/code&gt; or simply &lt;code&gt;Apply &amp;amp; Close&lt;/code&gt; to make the change.&lt;/p&gt;

&lt;p&gt;You will often find that the default clustering algorithm (key collision/fingerprint) doesn&amp;#8217;t come up with any clusters as it&amp;#8217;s fairly conservative. It&amp;#8217;s worth playing around a bit with different algorithms to look for other duplicates by selecting other possibilities from the drop-down menus. For example, choosing the &amp;#8216;nearest neighbour&amp;#8217; method with the Levenstein distance function and a radius of 2 (edits) results in four possible duplicates within the Suppliers list, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/levenstein-cluster.jpg&quot; title=&quot;Clustering values with Levenstein distance&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If you&amp;#8217;re not sure about whether the cluster is due to a typo or not, hover over the row and click on the &lt;code&gt;Browse this cluster&lt;/code&gt; link that appears. That will bring up a separate window that will show you just the rows in the cluster, from which you should be able to make a judgement. For example, it&amp;#8217;s not clear whether &amp;#8216;Academia Ltd&amp;#8217; is a typo for &amp;#8216;Academics Ltd&amp;#8217; but browsing the cluster shows that the Cost Centre codes and the Types of the transactions are completely different for the two Suppliers, so they are probably different.&lt;/p&gt;

&lt;h2&gt;Deriving Data&lt;/h2&gt;

&lt;p&gt;The next step is to derive some data from what we have within the spreadsheet. Since our goal is to produce linked data, the kind of derived data that we&amp;#8217;re interested in are URIs.&lt;/p&gt;

&lt;p&gt;At this point we need to start making decisions about what URIs to use. If you look at the &lt;a href=&quot;http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm&quot;&gt;list of spending data from Windsor and Maidenhead&lt;/a&gt;, you&amp;#8217;ll see that there are a whole bunch of these spreadsheets. It would be really useful if we could tie these spreadsheets together by using the same URIs for the same things across the datasets. For that reason, the only URI that&amp;#8217;s going to be local to the dataset is the URI for each line (or data point if you like) itself. On the other hand, most of the things that are named here are going to be local to Windsor &amp;amp; Maidenhead: &amp;#8216;Abba Cars&amp;#8217; may be sufficient to identify a single company within Windsor &amp;amp; Maidenhead, but certainly wouldn&amp;#8217;t be nationwide. So the URIs I&amp;#8217;m going to create here are mostly going to be within the &lt;code&gt;www.rbwm.gov.uk&lt;/code&gt; domain.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the table of the columns and the associated URIs that I&amp;#8217;m going to use. I should stress that this is just for example purposes, but I&amp;#8217;ve used the following principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;URIs for datasets are just like URIs for any other web document, but shouldn&amp;#8217;t have an extension because the data itself should be available in many formats&lt;/li&gt;
&lt;li&gt;URIs for real-world things should have &lt;code&gt;/id&lt;/code&gt; at the start of the path, and URIs for conceptual things should have &lt;code&gt;/def&lt;/code&gt; at the start of their paths; both should result in a 303 redirection to a suitable web page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what we&amp;#8217;re doing within data.gov.uk, but it&amp;#8217;s an important principle of the web that different councils might well choose their own URI schemes, depending on the kind of technology support that they have, without any bad side-effects on the interpretation of the data.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Column&lt;/th&gt;
      &lt;th&gt;URI pattern&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;(Dataset)&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;(Row/ExpenditureLine)&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#{row-number}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;(Council)&lt;/th&gt;
      &lt;td&gt;http://statistics.data.gov.uk/id/local-authority/00ME&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Directorate&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/id/directorate/{directorate-slug}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Updated&lt;/th&gt;
      &lt;td&gt;http://reference.data.gov.uk/id/day/{date}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;TransNo/Payment&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/id/transaction/{transaction-number}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Service&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/id/service/{service-slug}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Cost Centre&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/def/cost-centre/{cost-centre-code}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Supplier Name&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/id/supplier/{supplier-slug}&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As you can see, those of the columns that contain text fields have, as part of their URI, a &lt;a href=&quot;http://en.wikipedia.org/wiki/Slug_(production)&quot;&gt;&amp;#8216;slug&amp;#8217;&lt;/a&gt;. This is a shortened, normalised value suitable for putting in a URI: basically ensuring that the string doesn&amp;#8217;t contain any punctuation or spaces. For example, &amp;#8216;Adult &amp;amp; Community Services&amp;#8217; would turn into &amp;#8216;adult-community-services&amp;#8217;.&lt;/p&gt;

&lt;p&gt;Our first task will be to create these slugs. To do this, we&amp;#8217;ll create a new column based on the existing ones by choosing &lt;code&gt;Edit Column &amp;gt; Add Column Based on This Column ...&lt;/code&gt; from the drop-down menu on the appropriate column:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/edit-column-menu.jpg&quot; title=&quot;Edit Column menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Selecting this will bring up a dialog which will ask you to name the new column and then enter a formula to calculate the new value, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/create-slug.jpg&quot; title=&quot;Edit Column menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The default language for this formula is Gridworks&amp;#8217; own, though there are other options available. To create the slug, we need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;turn the value to lower case&lt;/li&gt;
&lt;li&gt;replace all spaces with hyphens&lt;/li&gt;
&lt;li&gt;remove anything that isn&amp;#8217;t a letter, number, or hyphen&lt;/li&gt;
&lt;li&gt;replace all sequences of two hyphens with a single hyphen&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is done in two steps. The first three steps can be done using the formula:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;replace(replace(toLowercase(value), &#039; &#039;, &#039;-&#039;), /[^-a-z0-9]/, &#039;&#039;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Gridworks helps by listing the original and resulting values for the first several rows of the spreadsheet, so that you can see whether it&amp;#8217;s working as expected. When you&amp;#8217;re happy, hitting &lt;code&gt;OK&lt;/code&gt; creates the new column.&lt;/p&gt;

&lt;p&gt;The last step (replacing all sequences of two hyphens with a single hyphen) can be done by editing the cells in the new column. Bring up the &lt;code&gt;Edit Cells... &amp;gt; Transform...&lt;/code&gt; dialog using the menu:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/edit-cells-menu-2.jpg&quot; title=&quot;Edit Cells menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;and use the formula:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;replace(value, &#039;--&#039;, &#039;-&#039;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;then check the &lt;code&gt;Re-transform until no change&lt;/code&gt; checkbox so that any pairs of hyphens are repeatedly replaced with single hyphens, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/transform.jpg&quot; title=&quot;Edit Cells menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The other tabs in the new column and edit cells dialogs are really helpful. The &lt;code&gt;History&lt;/code&gt; tab lets you choose formulae that you&amp;#8217;ve used before to use again. This is useful here because we want to create the slugs for the Service and Supplier Name in the same way. The &lt;code&gt;Help&lt;/code&gt; tab lists all the functions that you can use within the formula.&lt;/p&gt;

&lt;p&gt;Creating the URIs for the columns proceeds in the same way, except this time the formulae are more like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&#039;http://www.rbwm.gov.uk/id/directorate/&#039; + value
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are two that are slightly different. First, there&amp;#8217;s the URI for the date, which needs to be constructed from the date/time value held by Gridworks as follows. We can do this in two stages. First, to construct a new column called &amp;#8216;Date&amp;#8217; to hold the formatted date:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;datePart(value, &#039;year&#039;) + &#039;-&#039; + 
if (datePart(value, &#039;month&#039;) &amp;lt; 9, &#039;0&#039;, &#039;&#039;) + replace(datePart(value, &#039;month&#039;) + 1, &#039;.0&#039;, &#039;&#039;) + &#039;-&#039; + 
if (datePart(value, &#039;day&#039;) &amp;lt; 10, &#039;0&#039;, &#039;&#039;) + datePart(value, &#039;day&#039;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(note that the &lt;code&gt;datePart()&lt;/code&gt; function returns a 0-based count for the month) and then to create the Date URI column based on this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&#039;http://reference.data.gov.uk/id/day/&#039; + value
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Second, there&amp;#8217;s the URI for the row (an expenditure line) itself, which needs to be constructed using the row number. It&amp;#8217;s useful to construct it as a local URI (ie just the fragment) as this means the same code can be used to construct the column across different datasets, so it&amp;#8217;s just:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&#039;#&#039; + rowIndex
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Exporting Data&lt;/h2&gt;

&lt;p&gt;Once the extra columns have been made, it&amp;#8217;s time to export data from Gridworks. While Gridworks makes it easy to export to CSV or into Freebase, it&amp;#8217;s also possible to export in any format you want using templates. Use the &lt;code&gt;Project&lt;/code&gt; menu and choose &lt;code&gt;Export Filtered Rows &amp;gt; Templating ...&lt;/code&gt;, as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/project-menu.jpg&quot; title=&quot;Project menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Note that this will only export the rows that you currently have selected, so if you want to export everything, make sure that you deselect any facets that you&amp;#8217;ve currently got selected.&lt;/p&gt;

&lt;p&gt;Choosing the &lt;code&gt;Templating ...&lt;/code&gt; option will open up a dialog that you can use to create whatever format you want. The default, as shown in the following screenshot, is JSON.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/template-dialog-json.jpg&quot; title=&quot;Templating dialog to create JSON&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;On the left are four fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prefix&lt;/strong&gt; is content that&amp;#8217;s put at the top of the exported data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row Template&lt;/strong&gt; is content that&amp;#8217;s generated for each row&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row Separator&lt;/strong&gt; is content that&amp;#8217;s put between each row&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Suffix&lt;/strong&gt; is content that&amp;#8217;s put at the bottom of the exported data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One thing to be extremely careful of here is that any changes you made to the fields on the left here &lt;strong&gt;will not be saved&lt;/strong&gt; when the dialog is closed. For that reason, it&amp;#8217;s a good idea to create your templates in a separate text file and copy and paste them in. Also note that the sample data on the right is only for the first set of rows, not for the whole spreadsheet.&lt;/p&gt;

&lt;p&gt;We&amp;#8217;re going to generate Turtle using the template, so the next stage is to work out precisely what Turtle to generate. We&amp;#8217;ve been working on small vocabulary for payment data based on the &lt;a href=&quot;http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html&quot;&gt;Data Cube vocabulary&lt;/a&gt; and that&amp;#8217;s what I&amp;#8217;ll use here, although it isn&amp;#8217;t quite complete and available yet as it will be. We&amp;#8217;ll start at the bottom, with the individual rows, and then add extra surrounding information as we go.&lt;/p&gt;

&lt;h3&gt;Row Template&lt;/h3&gt;

&lt;p&gt;Within this data, each row corresponds to a &lt;code&gt;payment:ExpenditureLine&lt;/code&gt; within the dataset. The expenditure lines can be organised into groups based on the &lt;code&gt;payment:Payment&lt;/code&gt; that they&amp;#8217;re associated with, which is indicated through the &amp;#8216;TransNo&amp;#8217; column in the database. Within the payment vocabulary we&amp;#8217;re using, we can assign individual expenditure lines to the payment using the &lt;code&gt;payment:expenditureLine&lt;/code&gt; property.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;payment:payer&lt;/code&gt; of each &lt;code&gt;payment:Payment&lt;/code&gt; is Windsor &amp;amp; Maidenhead council. The &lt;code&gt;payment:payee&lt;/code&gt; is the &amp;#8216;Supplier&amp;#8217; listed in the spreadsheet. The &lt;code&gt;payment:date&lt;/code&gt; is the &amp;#8216;Updated&amp;#8217; date.&lt;/p&gt;

&lt;p&gt;Each individual line in the spreadsheet is a &lt;code&gt;payment:ExpenditureLine&lt;/code&gt; which is associated with one of these payments. The &lt;code&gt;payment:expenditureCode&lt;/code&gt; is the &amp;#8216;Cost Centre&amp;#8217; and the actual &lt;code&gt;payment:amountExcludingVAT&lt;/code&gt; is the &amp;#8216;Amount excl vat £&amp;#8217; value. Some example Turtle for the first line is thus:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt;
  qb:slice &amp;lt;http://www.rbwm.gov.uk/id/transaction/2650750&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/id/transaction/2650750&amp;gt;
  a payment:Payment , qb:Slice ;
  rdfs:label &quot;Transaction 2650750&quot;@en ;
  qb:sliceStructure payment:payment-slice ;
  payment:transactionReference &quot;2650750&quot; ;
  payment:payer &amp;lt;http://statistics.data.gov.uk/id/local-authority/00ME&amp;gt; ;
  payment:payee &amp;lt;http://www.rbwm.gov.uk/id/supplier/1st-choice-d-b-driveways-limited&amp;gt; ;
  payment:date &amp;lt;http://reference.data.gov.uk/id/day/2010-04-09&amp;gt; ;
  payment:expenditureLine &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&amp;gt;
  a payment:ExpenditureLine , qb:Observation ;
  rdfs:label &quot;Expenditure Line 0&quot;@en ;
  qb:dataSet &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
  payment:expenditureCode &amp;lt;http://www.rbwm.gov.uk/def/cost-centre/LM05&amp;gt; ;
  payment:amountExcludingVAT 1875.00 .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;#8217;s the basic data for each line, but there&amp;#8217;s also some other information which should be brought out for each line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the name of the payee&lt;/li&gt;
&lt;li&gt;the date, year, month and day-of-month for the payment, which may help further analysis of the data&lt;/li&gt;
&lt;li&gt;the meaning of the expenditure code (particularly its association to a particular service)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In each of these cases, pulling the information out from each line is going to lead to a lot of repetition, because the same payee, date and so on will be described in multiple lines, but we don&amp;#8217;t have any choice and we can tidy it up by removing duplicates afterwards. The Turtle for the first line will look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/id/supplier/1st-choice-d-b-driveways-limited&amp;gt;
  a org:Organization ;
  rdfs:label &quot;1st Choice - D B Driveways Limited&quot;@en .

&amp;lt;http://reference.data.gov.uk/id/day/2010-04-09&amp;gt;
  a interval:CalendarDay ;
  rdfs:label &quot;2010-04-09&quot; ;
  time:hasBeginning &amp;lt;http://reference.data.gov.uk/id/gregorian-instant/2010-04-09T00:00:00&amp;gt; ;
  interval:ordinalYear 2010 ;
  interval:ordinalMonthOfYear 4 ;
  interval:ordinalDayOfMonth 9 .

&amp;lt;http://reference.data.gov.uk/id/gregorian-instant/2010-04-09T00:00:00&amp;gt;
  a time:Instant ;
  time:inXSDDateTime &quot;2010-04-09T00:00:00&quot;^^xsd:dateTime .

&amp;lt;http://www.rbwm.gov.uk/def/cost-centre/LM05&amp;gt;
  a rbwm:CostCentre , skos:Concept ;
  rdfs:label &quot;Cost Centre LM05&quot;@en ;
  rbwm:costCentreCode &quot;LM05&quot;^^rbwm:CostCentreCode ;
  rbwm:service &amp;lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&amp;gt;
  a rbwm:Service ;
  rdfs:label &quot;Magnet Leisure Centre&quot;@en ;
  rbwm:providedBy &amp;lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&amp;gt;
  a rbwm:Directorate ;
  rdfs:label &quot;Adult &amp;amp; Community Services&quot;@en ;
  org:unitOf &amp;lt;http://statistics.data.gov.uk/id/local-authority/00ME&amp;gt; ;
  rbwm:provides &amp;lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&amp;gt; .

&amp;lt;http://statistics.data.gov.uk/id/local-authority/00ME&amp;gt;
  org:hasUnit &amp;lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You&amp;#8217;ll see that in the last part of this I&amp;#8217;ve introduced some properties and classes with a &lt;code&gt;rbwm:&lt;/code&gt; prefix. These are for classes and properties that are here in this data, but aren&amp;#8217;t part of the payment vocabulary. The basic schema is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;rbwm:CostCentre a rdfs:Class ;
  rdfs:label &quot;Cost Centre&quot;@en ;
  rdfs:comment &quot;A cost centre.&quot;@en .

rbwm:Service a rdfs:Class ;
  rdfs:label &quot;Service&quot;@en ;
  rdfs:comment &quot;A service provided by the council.&quot;@en .

rbwm:Directorate a rdfs:Class ;
  rdfs:label &quot;Directorate&quot;@en ;
  rdfs:comment &quot;A directorate within the council&quot;@en .

rbwm:service a rdf:Property , owl:ObjectProperty ;
  rdfs:label &quot;Service&quot;@en ;
  rdfs:comment &quot;The service associated with a particular cost centre.&quot;@en ;
  rdfs:domain rbwm:CostCentre ;
  rdfs:range rbwm:Service .

rbwm:providedBy a rdf:Property , owl:ObjectProperty ;
  rdfs:label &quot;Provided By&quot;@en ;
  rdfs:comment &quot;The directorate that provides this service.&quot;@en ;
  rdfs:domain rbwm:Service ;
  rdfs:range rbwm:Directorate .

rbwm:provides a rdf:Property , owl:ObjectProperty ;
  rdfs:label &quot;Provides&quot;@en ;
  rdfs:comment &quot;A service provided by this directorate.&quot;@en ;
  rdfs:domain rbwm:Directorate ;
  rdfs:range rbwm:Service .

rbwm:costCentreCode a rdf:Property , owl:DatatypeProperty ;
  rdfs:label &quot;Cost Centre Code&quot;@en ;
  rdfs:comment &quot;The code of this cost centre.&quot;@en ;
  rdfs:domain rbwm:CostCentre ;
  rdfs:range rbwm:CostCentreCode .

rbwm:CostCentreCode a rdfs:Datatype ;
  rdfs:label &quot;Cost Centre Code&quot;@en ;
  rdfs:comment &quot;A cost centre code consisting of two capital letters followed by two digits.&quot;@en .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This illustrates how individual councils might extend the information that they make available in RDF without having to seek any kind of prior agreement from anyone else. If, later on, a third party starts to make available ontologies for cost centres, services and directorates, Windsor &amp;amp; Maidenhead could start to link up their RDF with those more widely standardised classes and properties, with appropriate use of &lt;code&gt;rdfs:subClassOf&lt;/code&gt; or &lt;code&gt;rdfs:subPropertyOf&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now we have an idea about what data we can extract for a single row, we can turn this into a Gridworks template. The templates are fairly straight forward. Wherever you want to insert a value from a particular column, you use the syntax &lt;code&gt;${Column Name}&lt;/code&gt;. If you want to do any further processing, you can use the syntax &lt;code&gt;{{Formula}}&lt;/code&gt; to insert the result of a calculation.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt;
  qb:slice &amp;lt;${Transaction URI}&amp;gt; .

&amp;lt;${Transaction URI}&amp;gt;
  a payment:Payment , qb:Slice ;
  rdfs:label &quot;Transaction ${TransNo}&quot;@en ;
  qb:sliceStructure payment:payment-slice ;
  payment:transactionReference &quot;${TransNo}&quot; ;
  payment:payer &amp;lt;http://statistics.data.gov.uk/id/local-authority/00ME&amp;gt; ;
  payment:payee &amp;lt;${Supplier URI}&amp;gt; ;
  payment:date &amp;lt;${Date URI}&amp;gt; ;
  payment:expenditureLine &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2${Line URI}&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2${Line URI}&amp;gt;
  a payment:ExpenditureLine , qb:Observation ;
  rdfs:label &quot;Expenditure Line {{rowIndex}}&quot;@en ;
  qb:dataSet &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
  payment:expenditureCode &amp;lt;${Cost Centre URI}&amp;gt; ;
  payment:amountExcludingVAT {{cells[&#039;Amount excl vat £&#039;].value + 0}} .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that the last line here uses the expression &lt;code&gt;cells[&#039;Amount excl vat £&#039;].value + 0&lt;/code&gt; in order to ensure that every figure has a decimal place, which makes them into &lt;code&gt;xsd:decimal&lt;/code&gt; values within the resulting RDF.&lt;/p&gt;

&lt;p&gt;I won&amp;#8217;t do the rest of the row template here, though it&amp;#8217;s &lt;a href=&quot;/blog/files/finance_supplier_payments_2010_q2_provenance.ttl&quot;&gt;available in full in a separate file&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The other parts of the template are easier to complete. The prefix needs to contain any namespace prefixes that are used within the RDF. It&amp;#8217;s also useful to put a base URI here and describe the dataset itself. The RDF for the dataset should contain a number of properties about the dataset as a whole. There are a number of levels at which the dataset can be described:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;basic metadata such as its title and the license that it&amp;#8217;s available under&lt;/li&gt;
&lt;li&gt;statistical metadata including what dimensions it has and how it&amp;#8217;s sliced&lt;/li&gt;
&lt;li&gt;linked data metadata such as how this dataset links out to other linked datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Turtle for this description is shown here:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments&amp;gt;
  a void:Dataset ;
  void:subset &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt;
  a payment:PaymentDataset , void:Dataset ;
  # basic metadata
  rdfs:label &quot;Windsor &amp;amp; Maidenhead Supplier Payments where charge to specific cost centre is &amp;gt;= £500 for period April 2010 - June 2010&quot;@en ;
  dct:license &amp;lt;http://data.gov.uk/id/licence&amp;gt; ;
  dct:temporal [
    # this time is retrieved from the Last-Modified date on the original spreadsheet
    time:hasBeginning &amp;lt;http://reference.data.gov.uk/id/gregorian-instant/2010-08-02T08:37:02&amp;gt;
  ] ;

  # statistical metadata
  qb:structure payment:payments-with-expenditure-structure ;
  qb:sliceKey payment:payment-slice ;
  payment:currency &amp;lt;http://dbpedia.org/resource/Pound_sterling&amp;gt; ;

  # linked data metadata
  void:exampleResource
    &amp;lt;http://www.rbwm.gov.uk/id/transaction/2650750&amp;gt; ,
    &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&amp;gt; ;
  void:vocabulary payment: , qb: , rbwm: ;
  void:subset [
    a void:Linkset ;
    void:linkPredicate qb:slice ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:payer ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
    void:objectsTarget &amp;lt;http://statistics.data.gov.uk/id/local-authority&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:payee ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/supplier&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:date ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
    void:objectsTarget &amp;lt;http://reference.data.gov.uk/id/day&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:expenditureLine ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:expenditureCode ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/def/cost-centre&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:service ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/def/cost-centre&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/service&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:providedBy ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/service&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/directorate&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:provides ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/directorate&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/service&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate org:hasUnit ;
    void:subjectsTarget &amp;lt;http://statistics.data.gov.uk/id/local-authority&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/directorate&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate org:unitOf ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/directorate&amp;gt; ;
    void:objectsTarget &amp;lt;http://statistics.data.gov.uk/id/local-authority&amp;gt; ;
  ] .
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Provenance&lt;/h2&gt;

&lt;p&gt;I&amp;#8217;ve described here, verbally, exactly what I&amp;#8217;ve done in terms of the cleaning of the data, deriving new columns, and the template that I&amp;#8217;ve used to create a Turtle rendition of the data in this spreadsheet. One of the things that we&amp;#8217;ve worked hard on within data.gov.uk is finding ways of expressing this provenance information in RDF. There are two reasons for this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Providing provenance increases transparency and enables you to check the processing that the data has been through, increasing your trust in the data.&lt;/li&gt;
&lt;li&gt;Describing the process in sufficient detail for you to replicate that process enables you to modify and repeat the process, which both enables you to add value and to apply the same processing to your own situation, thus spreading best practice.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The basic provenance vocabulary that we&amp;#8217;re using within data.gov.uk is the &lt;a href=&quot;http://code.google.com/p/opmv/&quot;&gt;Open Provenance Model Vocabulary&lt;/a&gt;. This vocabulary talks about Artifacts, Processes that create and use them, and Agents that control those processes. We&amp;#8217;ve created an extension of this vocabulary specifically to help describe this kind of scenario, where a spreadsheet is processed using Gridworks and then exported using a template. I&amp;#8217;ll put this provenance information in a separate file simply because embedding provenance information, which includes a template, in the template itself gets us into nasty recursion issues.&lt;/p&gt;

&lt;p&gt;As well as the template, there are two supplementary artifacts that we need to record the provenance of this data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the Gridworks project itself&lt;/li&gt;
&lt;li&gt;the JSON description of the set of operations performed by Gridworks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first can be exported using the &lt;code&gt;Project&lt;/code&gt; menu. The second is accessed through the &lt;code&gt;Undo/Redo&lt;/code&gt; tab as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/undo-redo.jpg&quot; title=&quot;Undo/Redo tab&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This tab shows the actions that have been carried out on the data, and enables you to undo them in sequence. The &lt;code&gt;extract&lt;/code&gt; link at the bottom opens up the dialog shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/extract-dialog.jpg&quot; title=&quot;Extract Operations dialog&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You have to manually copy and paste the JSON description from the right of this dialog into a separate file in order to save it.&lt;/p&gt;

&lt;p&gt;We can then start describing the provenance of the RDF; this needs to go in the Turtle file itself. We start by saying that the RDF that we&amp;#8217;ve created was created from the Gridworks project and through an extraction operation. A simple link to the spreadsheet that was used as the source of the data also provides a quick link back to the original data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt;
  a opmv:Artifact ;
  dct:source &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&amp;gt; ;
  gridworks:wasExportedBy &amp;lt;finance_supplier_payments_2010_q2_provenance#gridworks-export&amp;gt; ;
  gridworks:wasExportedFrom &amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The provenance information then needs to describe the export process:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;#gridworks-export&amp;gt;
  a gridworks:ExportUsingTemplate , opmv:Process ;
  rdfs:label &quot;Process for Exporting Windsor &amp;amp; Maidenhead data as Turtle&quot; ;
  gridworks:project &amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt; ;
  gridworks:template &amp;lt;#gridworks-template&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The project itself was created from the original Excel spreadsheet. The details of how it was generated are through an import that ignored a single non-blank header row and then went through the set of operations described by the JSON.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt;
  a gridworks:Project , opmv:Artifact ;
  rdfs:label &quot;Windsor &amp;amp; Maidenhead Supplier Payments April 2010 - June 2010 Gridworks Project&quot;@en ;
  gridworks:wasCreatedFrom &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&amp;gt; ;
  opmv:wasGeneratedBy &amp;lt;#gridworks-processing&amp;gt; .

&amp;lt;#gridworks-processing&amp;gt;
  a gridworks:Process , opmv:Process ;
  rdfs:label &quot;Processing on the Gridworks Project&quot;@en ;
  common:usedData &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&amp;gt; ;
  gridworks:ignore 1 ;
  gridworks:operationDescription &amp;lt;finance_supplier_payments_2010_q2_operations.json&amp;gt; .

&amp;lt;finance_supplier_payments_2010_q2_operations.json&amp;gt;
  a gridworks:OperationDescription , opmv:Artifact ;
  rdfs:label &quot;Dump of the Processing carried out by Gridworks on Windsor &amp;amp;amp; Maidenhead Supplier Payments April 2010 - June 2010 data&quot;@en ;
  gridworks:wasExportedFrom &amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt; ;
  gridworks:wasExportedBy &amp;lt;#gridworks-operation-description-extraction&amp;gt; .

&amp;lt;#gridworks-operation-description-extraction&amp;gt;
  a gridworks:ExtractOperationDescription , opmv:Process ;
  rdfs:label &quot;Extraction of the operation description from the Windsor &amp;amp;amp; Maidenhead Supplier Payments April 2010 - June 2010 Project from Gridworks&quot;@en ;
  gridworks:project &amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The template is described in terms of the separate parts; in fact it&amp;#8217;s useful to use this provenance file as the record of the template that you use, given that Gridworks won&amp;#8217;t save the template in the project itself.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;#gridworks-template&amp;gt;
  a gridworks:Template , opmv:Artifact ;
  gridworks:prefix &quot;&quot;&quot;
...
&quot;&quot;&quot;^^xsd:string ;
  gridworks:rowTemplate &quot;&quot;&quot;
...
&quot;&quot;&quot;^^^xsd:string .
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Rinse and Repeat&lt;/h2&gt;

&lt;p&gt;Gridworks makes it easy to repeat a given set of operations on another spreadsheet that follows the same structure. If you download the &lt;a href=&quot;http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm&quot;&gt;Windsor and Maidenhead spending data from 2009 Q4&lt;/a&gt; and import it into Gridworks, you&amp;#8217;ll see that it uses the same set of columns as the 2010 Q2 data that we&amp;#8217;ve been looking at. (Strangely enough, the 2010 Q1 data doesn&amp;#8217;t quite follow the same structure as it doesn&amp;#8217;t include the &amp;#8216;TransNo&amp;#8217; column.)&lt;/p&gt;

&lt;p&gt;There are a couple of differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &amp;#8216;Updated&amp;#8217; column isn&amp;#8217;t recognised as holding dates on import; you can use &lt;code&gt;Edit Cells... &amp;gt; Transform&lt;/code&gt; to change these values into dates using the &lt;code&gt;toDate(value)&lt;/code&gt; formula&lt;/li&gt;
&lt;li&gt;the &amp;#8216;Amount excl vat £&amp;#8217; column isn&amp;#8217;t recognised as holding numbers on import because the values have commas in them; you can use the formula &lt;code&gt;toNumber(replace(value, &#039;,&#039;, &#039;&#039;))&lt;/code&gt; to rectify this&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might want to do some more cleaning, for example to check for duplicates, but once that is done, you use the &lt;code&gt;apply&lt;/code&gt; link at the bottom of the &lt;code&gt;Undo/Redo&lt;/code&gt; tab to apply the JSON operation description that you imported for the previous spreadsheet on this one. The templates require only a little tweaking to give different filenames and labels, but otherwise can be used as-is.&lt;/p&gt;

&lt;p&gt;So while the process of cleaning data, deriving values and creating a template for exporting as Turtle is a bit of effort, the likelihood is that you will be able to repeat the same operations on similar data with a minimal amount of work.&lt;/p&gt;

&lt;h2&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Gridworks is a simply amazing tool for data cleansing, analysis and, as we&amp;#8217;ve seen, transformation. It&amp;#8217;s set to become more so for our purposes in the near future, as it comes to support the mapping of names for things to URIs using configurable reconciliation services (which might allow it to automatically map Government Department names to URIs, for example), and the creation of RDF using a more intuitive and user-friendly approach than the templates that I&amp;#8217;ve illustrated here.&lt;/p&gt;

&lt;p&gt;Of course there are issues, particularly for UK civil servants who typically have to operate on locked-down machines running IE7 (if they&amp;#8217;re lucky). Gridworks also only deals with the fairly simple cases of data that fits in a spreadsheet-like structure, without the complexities of annotations on rows, columns or individual cells that we often see in government data.&lt;/p&gt;

&lt;p&gt;Nevertheless, there&amp;#8217;s huge potential here to provide a fairly easy route to the publication of linked data for people who are familiar with spreadsheets, in particular one that can be tweaked and extended to allow for the variety and complexity of real-world data.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/145#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/54">datagovuk</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/59">gridworks</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/58">provenance</category>
 <enclosure url="http://www.jenitennison.com/blog/files/finance_supplier_payments_2010_q2_project.tar.gz" length="458733" type="application/x-gzip" />
 <pubDate>Sun, 22 Aug 2010 22:23:32 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">145 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Distributed Publication and Querying</title>
 <link>http://www.jenitennison.com/blog/node/143</link>
 <description>&lt;p&gt;One of the biggest selling points of linked data is that it&amp;#8217;s supposed to facilitate web-scale distributed publication of data. Just as with the human web, anyone can publish data at their local site without having to go through any kind of central authority.&lt;/p&gt;

&lt;p&gt;Just as with the human web, convergence on particular sets of URIs for particular kinds of things can happen in an evolutionary way: in a blog post I might point to Amazon when I want to talk about a particular book, Wikipedia to define the concepts I mention, people&amp;#8217;s blogs or twitter streams when I mention them.&lt;/p&gt;

&lt;p&gt;And with everyone using the same terms to talk about the same things, there&amp;#8217;s the prospect of being able to easily pull together information from completely different sources to find connections and patterns that we&amp;#8217;d never have found otherwise.&lt;/p&gt;

&lt;p&gt;What&amp;#8217;s been very unclear to me is how this distributed publication of data can be married with the use of SPARQL for querying. After all, SPARQL doesn&amp;#8217;t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.&lt;/p&gt;

&lt;p&gt;This post is an attempt to explore this tension, between distributed publication and centralised query, and to try to find a pattern that we might use within the UK government (and potentially more widely, of course) to publish and expose linked data in a queryable way. It&amp;#8217;s a bit sketchy, and I&amp;#8217;d welcome comments.&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;Publishing Datasets&lt;/h2&gt;

&lt;p&gt;First, let&amp;#8217;s look at the publication of data. We publish data at the moment in all kinds of ways: embedded tables within PDFs, CSV database dumps, Excel spreadsheets, Word documents, XML, JSON, N3 and so on and on. Each of these documents contains a set of information: a dataset.&lt;/p&gt;

&lt;p&gt;Each dataset contains information about a whole load of &lt;em&gt;things&lt;/em&gt;, usually real-world things. This is easy to see when you have datasets that contain lots of things of the same type: a spreadsheet might contain information about lots of different local authorities, a database dump about a bunch of schools. In FOAF terms, we&amp;#8217;d say that the dataset has each of these things as a &lt;em&gt;topic&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Even datasets that are really about one &lt;em&gt;thing&lt;/em&gt; (have, in FOAF terms, a &lt;em&gt;primary topic&lt;/em&gt;) contain information about lots of other things. For example, a web page about a hospital might include some level of information about the different departments within the hospital, the strategic health authority that it belongs to, the chief executive and so on. Information that is just about one thing is rarely useful; at the very least, you will want to know the labels of things that it&amp;#8217;s related to.&lt;/p&gt;

&lt;p&gt;If we move to thinking about linked data, each &lt;em&gt;thing&lt;/em&gt; is assigned an HTTP URI. There is then one particular dataset that stands above all the other datasets that contain information about that &lt;em&gt;thing&lt;/em&gt;: the dataset in the document that you get when you resolve its URI. The fact that there is this dataset doesn&amp;#8217;t alter the fact that there are many many other datasets out there that contain information about the &lt;em&gt;thing&lt;/em&gt;. But the dataset that you get at the URI for the thing obviously has a special role.&lt;/p&gt;

&lt;p&gt;These datasets &amp;#8212; the ones you get at the end of a resource&amp;#8217;s URI &amp;#8212; are &lt;em&gt;the&lt;/em&gt; way in which an organisation can exercise control over the use of URIs minted within their domain. The organisation that controls the URI for a &lt;em&gt;thing&lt;/em&gt; determines whether that URI resolves, and what is at the end of the URI. If fifteen different websites all published information about a school consistently using the same URI for that school, anyone could pull that information together into something potentially useful. But if the URI for the school doesn&amp;#8217;t actually resolve, then you would have to wonder whether the school actually exists, or if it&amp;#8217;s just a figment of the imagination of those fifteen websites: a spoof school.&lt;/p&gt;

&lt;p&gt;Also, you&amp;#8217;d expect the information that you find at the end of the URI to be correct and up to date. You&amp;#8217;d expect it to be reasonably complete as well: to return a bunch of information about the school and pointers to more information about the school. This information is likely to come from a bunch of trusted sources: an integrated view over a collection of other datasets.&lt;/p&gt;

&lt;h2&gt;Providing SPARQL Endpoints&lt;/h2&gt;

&lt;p&gt;We&amp;#8217;ve established that&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;anyone can publish information about anything they choose, but that people will have different levels of trust in different sources of information&lt;/li&gt;
&lt;li&gt;information about any one &lt;em&gt;thing&lt;/em&gt; is seldom useful on its own; the power of the linked data web is the ability to make connections between things&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And so on to querying. Linked data can be useful without explicit querying &amp;#8212; you can navigate around related sets of information by following links, and pull together information gleaned from different sites &amp;#8212; but querying of some kind provides much more potential power and, with a &lt;a href=&quot;http://purl.org/linked-data/api/spec&quot;&gt;linked data API&lt;/a&gt;, the opportunity to provide an easy-to-use web-based API for the data.&lt;/p&gt;

&lt;p&gt;SPARQL queries operate over a default graph (or dataset) and a set of supplementary named graphs. For efficiency, these need to be pulled into a single triplestore.&lt;/p&gt;

&lt;p&gt;And so we have a quandry. To support queries, we need all the data we might want to query to be pulled into a single triplestore. Given that all data is linked, and all links are potentially interesting, the only answer seems to be to have the whole web of data in a single store. And that kind of centralised solution seems impractical, both in terms of the sheer size of store you&amp;#8217;d need and the obvious impact on efficiency of doing so.&lt;/p&gt;

&lt;h2&gt;Curated Triplestores&lt;/h2&gt;

&lt;p&gt;I think the answer (for the moment at least) is to forget about querying the entire web of linked data and focus on supporting the easy creation of targeted, curated, triplestores that each incorporate a useful subset of the linked data that&amp;#8217;s out there. What subset is useful for a given triplestore is a design question that should be informed by the potential users of that particular service. Larger subsets are likely to locate more cross-connections, but have a performance penalty.&lt;/p&gt;

&lt;p&gt;For example, a service that was oriented towards helping local authorities plan their schooling provision might include all the current data about nursery, primary and secondary schools (but not universities or versioned data), information about their administrative district and the district that they appear in (but no extra information about census areas), and those neighbourhood statistics, including historic data, that relate to children and schooling (but not those that relate to care of the elderly, for example).&lt;/p&gt;

&lt;p&gt;Another service might include all historic information about schools and universities and historic information about all associated administrative geography, but not include neighbourhood statistics.&lt;/p&gt;

&lt;h2&gt;Supporting On-Demand Triplestores&lt;/h2&gt;

&lt;p&gt;In the scenario painted above, each triplestore will include different datasets, brought together for a particular purpose. Imagine a huge warehouse full of boxes, each of which is a particular dataset. Each triplestore will fit together a different set of those boxes. What&amp;#8217;s neat about the linked data approach is that the boxes are really easy to bring together: creating a triplestore should just be a matter of selecting which datasets you want to use with little or no hand-crafting of links between them or resolution of naming conflicts.&lt;/p&gt;

&lt;p&gt;The challenge from the side of the data publisher is to enable these triplestores to be both created and kept up to date. A data publisher has to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;describe what datasets are available&lt;/li&gt;
&lt;li&gt;describe how these link to other potentially interesting datasets, to give hints about where connections might be made&lt;/li&gt;
&lt;li&gt;provide a mechanism for getting the current state of all the available datasets (which can obviously be through crawling but could alternatively be through a dump or set of dumps)&lt;/li&gt;
&lt;li&gt;provide a mechanism for informing interested parties about new datasets being made available (which could be through routine crawling or through a feed)&lt;/li&gt;
&lt;li&gt;provide a mechanism for informing interested parties about when a dataset changes (which could also be through routine crawling or through a feed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A lot of these problems are solved.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://rdfs.org/ns/void/&quot;&gt;VoiD&lt;/a&gt;&amp;#8217;s purpose in life is to describe datasets and how they link to each other, and it provides a &lt;code&gt;void:dataDump&lt;/code&gt; property that points to a dump of the data. VoiD can describe datasets that are supersets of other datasets, which enables datasets to be grouped together into potentially useful bundles.&lt;/p&gt;

&lt;p&gt;Where information needs to be kept up to date, we can use feeds. We need to keep up to date information about the datasets that a publisher makes available, and information about the content of a particular dataset. This can be achieved through a single Atom feed in which each dataset is recorded as an entry, with an &lt;code&gt;&amp;lt;updated&amp;gt;&lt;/code&gt; element indicating its last update. Datasets that are removed can be indicated through a &lt;a href=&quot;http://tools.ietf.org/html/draft-snell-atompub-tombstones-06&quot;&gt;&lt;code&gt;deleted-entry&lt;/code&gt; element&lt;/a&gt;. There is some ongoing work that suggests how to &lt;a href=&quot;http://groups.google.com/group/dataset-dynamics/web/components-vocabularies-protocols-formats&quot;&gt;augment voiD with a pointer to such a feed&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As well as pointing to a dataset, and indicating that it has been updated, the Atom feed could contain information about the change itself, represented as a &lt;a href=&quot;http://vocab.org/changeset/schema.html&quot;&gt;changeset&lt;/a&gt;. This could be included as part of the information provided about the new version of the dataset, described in terms of its &lt;a href=&quot;http://www.jenitennison.com/blog/node/142&quot;&gt;provenance&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Feeds that were provided in this way could be provided using the normal model, whereby any interested triplestores would regularly check the feed for updates, or using &lt;a href=&quot;http://code.google.com/p/pubsubhubbub/&quot;&gt;PubSubHubbub&lt;/a&gt; in order to push notifications to triplestores. The latter would require triplestore providers to support a service that accepted such notifications, of course.&lt;/p&gt;

&lt;p&gt;A triplestore should expose which datasets (and which versions of those datasets) are used within the triplestore. This can be gathered through a SPARQL query to list the available graphs and their metadata, so long as that information is included within the named graphs themselves.&lt;/p&gt;

&lt;h2&gt;What Should We Do?&lt;/h2&gt;

&lt;p&gt;How does all this translate into what guidelines we should put into place for UK government publishers and what tools we should provide centrally?&lt;/p&gt;

&lt;p&gt;First, we need to recognise the responsibility that comes with the ownership of a URI. Within the UK, we are encouraging people to use URIs of the form:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;http://{sector}.data.gov.uk/id/{concept}/{identifier}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;to name things like schools and hospitals, with the recognition that information about those things might come from many different public bodies. &lt;em&gt;Someone&lt;/em&gt; has to be in charge of that domain: they have to determine which URIs within a particular URI set are resolvable, and what information is provided at the end of each URI. These same sector owners should support easy-to-use APIs based around the particular URI sets that they are responsible for.&lt;/p&gt;

&lt;p&gt;The easiest route to supporting the pages, an easy-to-use API, and a SPARQL endpoint for deeper querying is going to be to create a curated triplestore with a &lt;a href=&quot;http://purl.org/linked-data/api/spec&quot;&gt;linked data API&lt;/a&gt; layer over the top. This triplestore will need to be populated with data from multiple datasets, both as separate named graphs (to provide traceability back to the original data) and merged into a default graph that reflects the current state of the world.&lt;/p&gt;

&lt;p&gt;The precise datasets that are included within the triplestore will depend on the judgement of the sector owners about both the trustworthiness of the available datasets and their utility. For example, it&amp;#8217;s likely that a lot of triplestores will want to include information about administrative geography and perhaps some information about time, simply because everything happens somewhere and sometime.&lt;/p&gt;

&lt;p&gt;Second, we need to make this process really easy, through guidelines and tooling.&lt;/p&gt;

&lt;p&gt;We encourage the data owners themselves (which are individual public bodies) to publish, along with the datasets themselves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;voiD descriptions of the groups of datasets that they publish&lt;/li&gt;
&lt;li&gt;metadata about the individual datasets that they publish (within each dataset itself)&lt;/li&gt;
&lt;li&gt;Atom feeds that are updated each time datasets are added, removed or altered, preferably including changeset information&lt;/li&gt;
&lt;li&gt;(optionally) dumps of groups of datasets, in NQuads format&lt;/li&gt;
&lt;li&gt;(optionally) notifications of changes to the Atom feed to a PubSubHubbub hub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data owners should be able to split up the datasets that they provide into different groups based on their knowledge of the domain, with the possibility of individual datasets belonging to more than one group.&lt;/p&gt;

&lt;p&gt;We then create tooling that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;enable the sector owners to quickly and easily put together a list of trusted sites from which datasets can be gathered&lt;/li&gt;
&lt;li&gt;collect datasets from these sites, either through NQuads dumps or through crawling&lt;/li&gt;
&lt;li&gt;merge datasets to create a default current view&lt;/li&gt;
&lt;li&gt;put these datasets into a triplestore&lt;/li&gt;
&lt;li&gt;keep the triplestore up to date, either through polling feeds or by accepting PubSubHubbub notifications to identify changes, applying those changes, and merging data as required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To facilitate PubSubHubbub use, which supports timely updating of triplestores, we&amp;#8217;d need a PubSubHubbub hub. Data owners can inform this hub of updates to their feeds and sector owners can register interest in particular feeds.&lt;/p&gt;

&lt;p&gt;These guidelines and tooling are not just useful for sector owners: they are useful for anyone who wants to pull together linked data published in a distributed way across the web. We should expect and encourage multiple stores offering different combinations of datasets and different levels of service. The ones offered centrally, by sector owners, are certainly not the be-all and end-all &amp;#8212; in fact we should look on them as a basic level of service, to be superseded by the community.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/143#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <pubDate>Mon, 22 Mar 2010 21:26:53 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">143 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Translating Existing Models to RDF</title>
 <link>http://www.jenitennison.com/blog/node/142</link>
 <description>&lt;p&gt;As we encourage linked data adoption within the UK public sector, something we run into again and again is that (unsurprisingly) particular domain areas have pre-existing standard ways of thinking about the data that they care about. There are existing models, often with multiple serialisations, such as in XML and a text-based form, that are supported by existing tool chains.&lt;/p&gt;

&lt;p&gt;In contrast, if there is existing RDF in that domain area, it&amp;#8217;s usually been designed by people who are more interested in the RDF than in the domain area, and is thus generally more focused on the goals of the typical casual data re-user rather than the professionals in the area.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;To give an example, the international statistics community uses &lt;a href=&quot;http://sdmx.org&quot;&gt;SDMX&lt;/a&gt; for representing and exchanging statistics (and a lot more besides; it&amp;#8217;s a huge standard). SDMX includes a well-thought through model for statistical datasets and the observations within them, as well as standard concepts for things like gender, age, unit multipliers and so on. By comparison, &lt;a href=&quot;http://sw.joanneum.at/scovo/schema.html&quot;&gt;SCOVO&lt;/a&gt;, the main RDF model for representing statistics, barely scratches the surface in comparison.&lt;/p&gt;

&lt;p&gt;This isn&amp;#8217;t the only example: the &lt;a href=&quot;http://inspire.jrc.ec.europa.eu/&quot;&gt;INSPIRE Directive&lt;/a&gt; defines how geographic information must be made available. &lt;a href=&quot;http://www.gigateway.org.uk/metadata/standards.html&quot;&gt;GEMINI&lt;/a&gt; defines the kind of geospatial metadata that that community cares about. The &lt;a href=&quot;http://openprovenance.org/&quot;&gt;Open Provenance Model&lt;/a&gt; is the result of many contributors from multiple fields, and again has a number of serialisations.&lt;/p&gt;

&lt;p&gt;You could view this as a challenge: experts in their domains already have models and serialisations for the data that they care about; how can we persuade them to adopt an RDF model and serialisations instead?&lt;/p&gt;

&lt;p&gt;But that&amp;#8217;s totally the wrong question. Linked data doesn&amp;#8217;t, can&amp;#8217;t and won&amp;#8217;t replace existing ways of handling data. But it has got some interesting features that can bring great benefit to people who want to publish their data, namely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;web-scale addresses&lt;/strong&gt; &amp;#8212; being able to name and refer to things like individual observations in a statistical hypercube, a particular road junction, or the particular process that led to something being created&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;annotation&lt;/strong&gt; &amp;#8212; the ability to record metadata about everything that you can name, which is everything!&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;distributed publication&lt;/strong&gt; &amp;#8212; enabling multiple publishers to control the publication of their data without having to upload it to a central location&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;links&lt;/strong&gt; &amp;#8212; the joining of information to other information, providing more context, supporting more queries and reducing the requirement for duplication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question is really about how to enable people to reap these benefits; the answer, because HTTP-based addressing and typed linkage is usually hard to introduce into existing formats, is usually to publish data using an RDF-based model alongside existing formats. This might be done by generating an RDF-based format (such as RDF/XML or Turtle) as an alternative to the standard XML or HTML, accessible via content negotiation, or by providing a &lt;a href=&quot;http://www.w3.org/TR/grddl/&quot;&gt;GRDDL&lt;/a&gt; transformation that maps an XML format into RDF/XML.&lt;/p&gt;

&lt;p&gt;Either way, the underlying model needs to be mapped into RDF. We&amp;#8217;re furthest down this road with &lt;a href=&quot;http://groups.google.com/group/publishing-statistical-data&quot;&gt;statistical data&lt;/a&gt;. I wanted to explore here what it might look like for the Open Provenance Model, building on lessons learned from the statistical domain.&lt;/p&gt;

&lt;h2&gt;Open Provenance Model&lt;/h2&gt;

&lt;p&gt;The Open Provenance Model talks about three main &lt;strong&gt;nodes&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;artifacts&lt;/strong&gt;, which are the things that are produced or used by processes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;processes&lt;/strong&gt;, which are actions that are performed using or producing artifacts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;agents&lt;/strong&gt;, which are the people or systems that perform actions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and five kinds of &lt;strong&gt;edges&lt;/strong&gt; that can be defined between them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;process A &lt;strong&gt;used&lt;/strong&gt; artifact B&lt;/li&gt;
&lt;li&gt;artifact A &lt;strong&gt;was generated by&lt;/strong&gt; process B&lt;/li&gt;
&lt;li&gt;process A &lt;strong&gt;was controlled by&lt;/strong&gt; agent B&lt;/li&gt;
&lt;li&gt;process A &lt;strong&gt;was triggered by&lt;/strong&gt; process B&lt;/li&gt;
&lt;li&gt;artifact A &lt;strong&gt;was derived from&lt;/strong&gt; artifact B&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then things start getting more complicated. OPM indicates that each artifact and agent plays a different &lt;strong&gt;role&lt;/strong&gt; when it is used by, generated by or controls a process. What&amp;#8217;s more, each artifact and agent might be involved in the process at different &lt;strong&gt;times&lt;/strong&gt; (though timing information is optional within OPM). And a given provenance graph may contain several &lt;strong&gt;accounts&lt;/strong&gt; of how artifacts, processes and agents fit together.&lt;/p&gt;

&lt;h2&gt;Existing Mapping to RDF&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://openprovenance.org/model/opm.owl&quot;&gt;OWL ontology for OPM&lt;/a&gt; for OPM is a very literal mapping of OPM into RDF. Each of the types of nodes is a separate class, and each of the types of edges is a separate class. Thus, it introduces a lot of n-ary relationships. Take a really simple example of an XML file being transformed into HTML using XSLT. With the OPM ontology, the RDF would look something like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;_:transformation a opm:Process .
&amp;lt;doc.html&amp;gt; a opm:Artifact .
&amp;lt;doc.xml&amp;gt; a opm:Artifact .
&amp;lt;doc.xsl&amp;gt; a opm:Artifact .
_:processor a opm:Agent .
_:Jeni a opm:Agent .

_:stylesheetLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause &amp;lt;doc.xml&amp;gt; ;
  opm:role eg:xsltSource .

_:sourceLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause &amp;lt;doc.xsl&amp;gt; ;
  opm:role eg:xsltStylesheet .

_:resultLink a opm:WasGeneratedBy ;
  opm:effect &amp;lt;doc.html&amp;gt; ;
  opm:cause _:transformation ;
  opm:role eg:xsltResult .

_:processorLink a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:processor ;
  opm:role xslt:processor .

_:userLink a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:Jeni ;
  opm:role xslt:user .

_:derivation a opm:WasDerivedFrom ;
  opm:effect &amp;lt;doc.html&amp;gt; ;
  opm:cause &amp;lt;doc.xml&amp;gt; .

xslt:source a opm:Role ;
  opm:value &quot;source&quot; .

xslt:stylesheet a opm:Role ;
  opm:value &quot;stylesheet&quot; .

xslt:result a opm:Role ;
  opm:value &quot;result&quot; .

xslt:processor a opm:Role ;
  opm:value &quot;processor&quot; .

xslt:user a opm:Role ;
  opm:value &quot;user&quot; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To give you an idea of what this mapping means, if I wanted to work out who created &lt;code&gt;doc.html&lt;/code&gt;, I would have to do a query like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT ?who
WHERE {
  ?generatedBy 
    opm:cause &amp;lt;doc.html&amp;gt; ;
    opm:role xslt:result ;
    opm:effect ?transformation .
  ?controlledBy
    opm:effect ?transformation ;
    opm:role xslt:user ;
    opm:cause ?who .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Some Observations&lt;/h2&gt;

&lt;p&gt;There are two things that I want to pull out about the RDF mapping described above.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it&amp;#8217;s incredibly literal; every entity type within the model is mapped onto an RDF class, including the edges, the roles and the accounts (which I didn&amp;#8217;t show above)&lt;/li&gt;
&lt;li&gt;it doesn&amp;#8217;t reuse any existing vocabularies, even when they might help (such as for the &amp;#8216;value&amp;#8217; of a role, which is really a label)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It reminds me of the mapping of object-oriented or relational data models into each other or into XML, which often result in a god awful mess and people swearing that technology X is goddamned ugly. &lt;/p&gt;

&lt;p&gt;The fact is that elegant uses of each modelling paradigm &amp;#8212; ones that are easy to understand and efficient to query &amp;#8212; always take advantage of the unique features of that paradigm. For example, good XML vocabularies take advantage of the distinctions between attributes and elements, of nesting and hierarchies, and of the ability to hold mixed content.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s the same with RDF. There are four features of RDF that I think good vocabularies will take suitable advantage of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;existing vocabularies&lt;/li&gt;
&lt;li&gt;inheritance&lt;/li&gt;
&lt;li&gt;shortcuts and reasoning&lt;/li&gt;
&lt;li&gt;named graphs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reusing existing vocabularies&lt;/strong&gt; takes advantage of the ease of bringing together diverse domains within RDF, and it makes data more reusable. For example, an OPM mapping that encourages the reuse of FOAF for people and organisations saves time and effort for the developers of the OPM RDF vocabulary, that they would otherwise have spent modelling the details of agents; and it means that any agents that are described within the description of a piece of provenance are automatically available as agents in the wider FOAF cloud. The same goes for using DOAP to describe software.&lt;/p&gt;

&lt;p&gt;By reusing vocabularies, the data isn&amp;#8217;t isolated any more, locked within a single context designed for a single use. This is a huge benefit of the linked data approach and it makes sense to leverage it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using inheritance&lt;/strong&gt; means creating general purpose classes and properties and encouraging other people to use &lt;code&gt;rdfs:subClassOf&lt;/code&gt; or &lt;code&gt;rdfs:subPropertyOf&lt;/code&gt; to specialise them according to their own requirements. Within OPM, the different roles that artifacts and agents might play in a process is a natural fit with either sub-properties or sub-classes, depending on how the edges in the model are represented. For example, rather than&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;_:stylesheetLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause &amp;lt;doc.xsl&amp;gt; ;
  opm:role eg:xsltStylesheet .

xslt:stylesheet a opm:Role ;
  opm:value &quot;stylesheet&quot; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;you could generate data that looked like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;_:stylesheetLink a xslt:Stylesheet ;
  opm:effect _:transformation ;
  opm:cause &amp;lt;doc.xsl&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;where &lt;code&gt;xslt:Stylesheet&lt;/code&gt; is defined as a subclass of &lt;code&gt;opm:Used&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Inheritance is a basic form of &lt;strong&gt;reasoning&lt;/strong&gt;. In the case of the subclass relationship outlined above, the reasoning is that anything that is a &lt;code&gt;xslt:Stylesheet&lt;/code&gt; is also a &lt;code&gt;opm:Used&lt;/code&gt;, and thus:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;_:stylesheetLink a xslt:Stylesheet .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;implies&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;_:stylesheetLink a xslt:Used .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Taking the scenario where you&amp;#8217;re doing native linked data publishing &amp;#8212; storing data in a triplestore and then publishing it out from there &amp;#8212; you have two choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you can store just the basic data, and let the application retrieving it carry out whatever reasoning is necessary to derive the information they need; this limits the size of the triplestore, but can place a large burden on people using it &amp;#8212; either they have to be very familiar with the exact choices made in modelling the basic data, or they have to construct complex SPARQL queries that take account of the fact that the data might be modelled in many different ways&lt;/li&gt;
&lt;li&gt;you can store not only the basic data but also anything that can be derived from it; this increases the number of triples you have to store, but means that people can query it without having to perform any reasoning themselves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The latter is obviously the more user-friendly approach. (And a triplestore could make it easy by understanding and applying schemas, ontologies and rules as data is loaded in.)&lt;/p&gt;

&lt;p&gt;To take a more complex example, provenance could be modelled in a much more direct way, such as:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;doc.html&amp;gt; a opm:Artifact ;
  opm:derivedFrom &amp;lt;doc.xml&amp;gt; ;
  opm:generatedBy [
    xslt:source &amp;lt;doc.xml&amp;gt; ;
    xslt:stylesheet &amp;lt;doc.xsl&amp;gt; ;
    xslt:processor _:processor ;
    xslt:user _:Jeni ;
  ] .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;where &lt;code&gt;xslt:source&lt;/code&gt; and &lt;code&gt;xslt:stylesheet&lt;/code&gt; are sub-properties of a property called &lt;code&gt;opm:used&lt;/code&gt;, and &lt;code&gt;xslt:processor&lt;/code&gt; and &lt;code&gt;xslt:user&lt;/code&gt; are sub-properties of &lt;code&gt;opm:controlledBy&lt;/code&gt;. This removes the n-ary properties, which (given the use of inheritance to represent roles) are only actually needed if the model needs to capture the timing of the involvement of particular artifacts or agents within a process, and makes the provenance information much easier to query than before:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT ?who
WHERE {
  &amp;lt;doc.html&amp;gt; opm:generatedBy ?transformation .
  ?transformation xslt:user ?who .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But what if we also want to support the more complex, n-ary-relation-based models? We would need to assert, somehow, a rule that said that the presence of a &lt;code&gt;opm:controlledBy&lt;/code&gt; relationship from a process to an agent was equivalent to having a &lt;code&gt;opm:WasControlledBy&lt;/code&gt; instance with a &lt;code&gt;opm:cause&lt;/code&gt; pointing to the agent and an &lt;code&gt;opm:effect&lt;/code&gt; pointing to the process. Combine this with &lt;code&gt;xslt:user&lt;/code&gt; being sub-property of &lt;code&gt;opm:controlledBy&lt;/code&gt; and you have the statement:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;_:transformation xslt:user _:Jeni .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;implying:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;_:transformation opm:controlledBy _:Jeni .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;which in turn implies:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[] a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:Jeni .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The same reasoning could be applied in the opposite direction, of course. Part of the definition of the use of OPM in RDF could be that the presence of a &lt;code&gt;opm:WasControlledBy&lt;/code&gt; with a &lt;code&gt;opm:cause&lt;/code&gt; pointing to an agent and &lt;code&gt;opm:effect&lt;/code&gt; pointing to a process implies a &lt;code&gt;opm:controlledBy&lt;/code&gt; link between the &lt;code&gt;opm:effect&lt;/code&gt; and the &lt;code&gt;opm:cause&lt;/code&gt;. Whichever was used in the initial modelling of the data, the same query could be used to query the data (accepting some loss of precision along the way, but if you&amp;#8217;re not interesting in timing information then why should you suffer the cost of querying through n-ary relations?).&lt;/p&gt;

&lt;p&gt;The final thing that I mentioned above that mappings from existing models to RDF should take advantage of is &lt;strong&gt;named graphs&lt;/strong&gt;. In OPM, the obvious way that named graphs could play a role is in providing support for the different &lt;em&gt;accounts&lt;/em&gt; of provenance. Separate named graphs could be used to represent separate accounts, referencing the same artifacts, agents and processes where appropriate. Individually, the graphs can remain simple; together, you have the full power of OPM.&lt;/p&gt;

&lt;h2&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Modelling is a complex design activity, and you&amp;#8217;re best off avoiding doing it if you can. That means reusing conceptual models that have been built up for a domain as much as possible and reusing existing vocabularies wherever you can. But you can&amp;#8217;t and shouldn&amp;#8217;t try to avoid doing design when mapping from a conceptual model to a particular modelling paradigm such as a relational, object-oriented, XML or RDF model.&lt;/p&gt;

&lt;p&gt;If you&amp;#8217;re mapping to RDF, remember to take advantage of what it&amp;#8217;s good at such as web-scale addressing and extensibility, and always bear in mind how easy or difficult your data will be to query. There is no point publishing linked data if it is unusable.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/142#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/57">modelling</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/58">provenance</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <pubDate>Sat, 13 Mar 2010 20:35:46 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">142 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Versioning (UK Government) Linked Data</title>
 <link>http://www.jenitennison.com/blog/node/141</link>
 <description>&lt;p&gt;As you probably know, I&amp;#8217;ve been working quite a lot recently on the UK government&amp;#8217;s use of linked data, and in particular on providing guidance for people who want to publish their data as linked data. One of the things that we need to provide guidance about is how to publish linked data that changes over time. I&amp;#8217;ve &lt;a href=&quot;http://www.jenitennison.com/blog/node/108&quot;&gt;touched on this topic before&lt;/a&gt; but things have progressed now to the stage where we have to make some real, practical, recommendations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: the contents of this post have been greatly informed through discussions with &lt;a href=&quot;http://www.ldodds.com/blog/&quot;&gt;Leigh Dodds&lt;/a&gt;, &lt;a href=&quot;http://twitter.com/skwlilac&quot;&gt;Stuart Williams&lt;/a&gt;, &lt;a href=&quot;http://www.amberdown.net/&quot;&gt;Dave Reynolds&lt;/a&gt;, &lt;a href=&quot;http://iandavis.com/&quot;&gt;Ian Davis&lt;/a&gt; and John Sheridan. Ian Davis&amp;#8217; series on &lt;a href=&quot;http://blog.iandavis.com/2009/08/time-in-rdf-1&quot;&gt;representing time in RDF&lt;/a&gt; is also well worth a look for a comparison of alternative approaches.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve split this into two parts: versioned information resources (which are pretty easy) and versioned non-information resources (which are pretty hard). For both, we need to&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provide some guidance about what the RDF should look like&lt;/li&gt;
&lt;li&gt;mint or adopt properties to support that model&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Versioned Information Resources&lt;/h2&gt;

&lt;p&gt;Easy things first. Some of the things that we talk about, such as legislation, are information resources (web documents), and these have different versions. The relevant level of precision for legislation is a day, but this will be different for different kinds of documents &amp;#8212; some might change every second, for others an incrementally increasing version number might be more appropriate than a date. A generic pattern for the URIs, based on the &lt;a href=&quot;http://writetoreply.org/ukgovurisets/&quot;&gt;design of URI sets for the UK public sector report&lt;/a&gt; would be:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;http://{sector}.data.gov.uk/doc/{concept}/{identifier}/{version}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For example, the OFSTED report for a particular school based on an inspection carried out in 2009 might be something like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;http://education.data.gov.uk/doc/inspection-report/12345/2009
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(There might be sub-versions too, if the inspection report itself goes through a revision process.) The RDF for this document should include links to the previous reports that it replaces, and dates that indicate when it was created and so on:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&amp;gt;
  rdfs:label &quot;2009 Inspection Report for Such-and-Such School&quot;@en ;
  dct:created &quot;2009-10-18&quot;^^xsd:date ;
  dct:replaces &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;#8217;s also useful to have a URI for unversioned document; this is the same as for the versioned document, but without the version:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;http://{sector}.data.gov.uk/doc/{concept}/{identifier}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This document acts as a hub for the various concrete versions of the document:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/doc/inspection-report/12345&amp;gt;
  rdfs:label &quot;Inspection Report for Such-and-Such School&quot;@en ;
  dct:hasVersion
    &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2003&amp;gt; ,
    ... .

&amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&amp;gt;
  rdfs:label &quot;2009 Inspection Report for Such-and-Such School&quot;@en ;
  dct:created &quot;2009-10-18&quot;^^xsd:date ;
  dct:replaces &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&amp;gt; ;
  dct:isVersionOf &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345&amp;gt; .

&amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&amp;gt;
  rdfs:label &quot;2009 Inspection Report for Such-and-Such School&quot;@en ;
  dct:created &quot;2003-11-23&quot;^^xsd:date ;
  dct:isReplacedBy &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&amp;gt; ;
  dct:replaces &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345/2003&amp;gt; ;
  dct:isVersionOf &amp;lt;http://education.data.gov.uk/doc/inspection-report/12345&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It would be expected that people linking to the document would either point to a particular (dated) resource or to the unversioned (hub) document. For example, if someone were talking specifically about the 2006 OFSTED inspection, they would point to the 2006 inspection report; if they were referring to whatever inspection report is current, they&amp;#8217;d use the unversioned URI.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Note: Although &lt;code&gt;dct:hasVersion&lt;/code&gt; and &lt;code&gt;dct:isVersionOf&lt;/code&gt; are sort-of OK here, having a property that points to the current version (ie most recent) version of a resource would be very helpful.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;Versioned Non-Information Resources&lt;/h2&gt;

&lt;p&gt;The harder problem is how we handle changes to non-information resources over time. For example, how do we handle the fact that a school often changes head, sometimes changes name, regularly changes class sizes, rarely changes address and so on? How do we handle the fact that we have legacy statistics about local authorities as they were in 2008, prior to the 2009 reorganisation, and that it&amp;#8217;s very likely that these kinds of changes will continue to take place regularly in the future?&lt;/p&gt;

&lt;p&gt;Our requirements are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;that the data is easily usable by people who only care about the current state of a resource&lt;/li&gt;
&lt;li&gt;that the (current) data remains easily queryable at a SPARQL endpoint&lt;/li&gt;
&lt;li&gt;that it&amp;#8217;s &lt;em&gt;possible&lt;/em&gt; (not necessarily easy) to query historic data&lt;/li&gt;
&lt;li&gt;that historic data can be moderately easily retrieved and navigated&lt;/li&gt;
&lt;li&gt;that it can represent historical states even when the precise time period is not known&lt;/li&gt;
&lt;li&gt;that it can distinguish between a change in the concept and a change in our record of it (e.g. changing the name of a school, versus correcting a typo in the database entry for the school)&lt;/li&gt;
&lt;li&gt;that it can trace what the nature or cause of the change was (e.g. redrawing of local authority boundaries)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Statistical Data&lt;/h3&gt;

&lt;p&gt;To begin our discussion, let&amp;#8217;s look at statistical data. Statistical data is data that&amp;#8217;s usually numeric and for which we have values that are categorised along multiple dimensions as well as time. School census information is statistical data, for example, because each value is associated with not only the school and the date at which the census was taken but also the age (and gender, but to simplify I&amp;#8217;ll pretend just age) of the children being counted. This gives us a set of observations which might each look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;/data/edubase/census/12345/age/11/2009&amp;gt; 
  a sdmx:Observation ;
  sdmx:dataset &amp;lt;/data/edubase&amp;gt; ;
  dct:replaces &amp;lt;/data/edubase/census/12345/age/11/2008&amp;gt; ;
  rdf:value 85 ;
  edu:school &amp;lt;/id/school/12345&amp;gt; ;
  edu:schoolYear &amp;lt;/id/school-year/2009&amp;gt; ;
  sdmx:age 11 .
&lt;/code&gt;&lt;/pre&gt;

&lt;blockquote&gt;
  &lt;p&gt;Note: This is indicative of the vocabulary we might use for statistics; don&amp;#8217;t rely on it. If you&amp;#8217;re interested in the progress we&amp;#8217;re making on modelling statistical datasets using RDF, come and join &lt;a href=&quot;http://groups.google.com/group/publishing-statistical-data&quot;&gt;the publishing statistical data Google Group&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These statistical observations point to the interval that they apply to as a property, with the &lt;code&gt;rdf:value&lt;/code&gt; property holding the actual value. The observation won&amp;#8217;t change over time (unless it is corrected, which I will come back to), and &lt;strong&gt;observations from different times can all remain present within the graph without interacting badly with each other&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is great because it means that we can make queries that give us time series views over the data. For example, we could define a series for girls aged 11 at this particular school over time something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;/data/edubase/census/12345/age/11&amp;gt;
  a sdmx:TimeSeries ;
  edu:school &amp;lt;/id/school/12345&amp;gt; ;
  sdmx:age 11 ;
  sdmx:observation
    &amp;lt;/data/edubase/census/12345/age/11/gender/F/2009&amp;gt; ,
    &amp;lt;/data/edubase/census/12345/age/11/gender/F/2008&amp;gt; ,
    &amp;lt;/data/edubase/census/12345/age/11/gender/F/2007&amp;gt; ,
    ... .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and associate this with the school through a specialised property:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;/id/school/12345&amp;gt; edu:age11 &amp;lt;/data/edubase/census/12345/age/11&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The fly in the ointment is that data that is purely represented in this way is really hard to query if all you&amp;#8217;re actually interested in is the &lt;em&gt;current&lt;/em&gt; value for the particular statistic. For example, say that you&amp;#8217;ve just moved to an area and are trying desperately to find a school that might have room for your 11-year-old. Given that class sizes are capped at 30, you could look for schools that have a number of 11-year-olds that is not a multiple of 30. If you want to know how many 11 year-olds are &lt;em&gt;currently&lt;/em&gt; in a school (according to the most recent measurement), you need a query like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT ?age11
WHERE {
  &amp;lt;/id/school/12345&amp;gt; edu:age11 [
    sdmx:observation ?currentObservation ;
  ]
  OPTIONAL {
    ?futureObservation dct:replaces ?currentObservation .
  }
  FILTER ( !bound(?futureObservation) ) .
  ?currentObservation rdf:value ?age11 .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(it&amp;#8217;s even more complicated if you don&amp;#8217;t have the &lt;code&gt;dct:replaces&lt;/code&gt; links!).&lt;/p&gt;

&lt;p&gt;How much simpler it would be for people if there was a property that just indicated the current state of the world:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;/id/school/12345&amp;gt; edu:currentAge11 85 .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The same argument applies even more strongly for values that we would categorise as &lt;strong&gt;reference data&lt;/strong&gt;, such as the name of a school. Although it would be possible to model all this information using the kind of n-ary relation approach we have to use for statistical observations, it would be both incredibly hard to query and incredibly verbose to do so. Even if n-ary relations are the &amp;#8220;correct&amp;#8221; way of modelling the changing data, they are impractical for querying.&lt;/p&gt;

&lt;p&gt;And, as I hinted, we have to have some way of managing the possibility of statistics themselves being versioned (for example if an error is detected within the statistics). Using n-ary relations to provide the value of an observation gets very complicated very quickly.&lt;/p&gt;

&lt;p&gt;So, we have made the decision to use named graphs.&lt;/p&gt;

&lt;h3&gt;Named Graphs&lt;/h3&gt;

&lt;p&gt;Named graphs can be used in two ways which are related but need to be thought about slightly differently.&lt;/p&gt;

&lt;p&gt;First, we can use a named-graph approach to the &lt;strong&gt;publication&lt;/strong&gt; of RDF. We can describe the same &lt;em&gt;thing&lt;/em&gt; within multiple documents; each document can contain different (and contradictory) information, but also metadata about the document that indicates precisely when the information it contains is valid.&lt;/p&gt;

&lt;p&gt;Second, we can use a named-graph approach to the &lt;strong&gt;representation&lt;/strong&gt; of RDF within a triple- (or more accurately quad-) store. We can collect together statements that are made at the same time, from the same source, and with the same level of authority into a named graph. These graphs can then be loaded into the store, with the metadata about each graph made explicitly available so that relevant graphs can be selected and queried.&lt;/p&gt;

&lt;p&gt;There are two things that are worth noting about this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Publishing named graphs is relevant however RDF is published. For example, in some linked data publication set-ups, RDF/XML or RDFa might be generated on demand based on an underlying database of some description. In this case, the named graphs for representing data aren&amp;#8217;t relevant (the database will presumably capture some provenance and validity information itself that can be exposed within the RDF).&lt;/li&gt;
&lt;li&gt;In the case where linked data is published natively (ie stored in a triplestore and exposed as linked data through an API), the two uses of named graphs don&amp;#8217;t precisely align with each other. The named graphs that we create when we convert or load data within a triplestore are not (necessarily) the same as the named graphs that we expose when we publish data. What&amp;#8217;s important here is
&lt;ul&gt;&lt;li&gt;that the named graphs that we have within the triplestore can feasibly be used (by a publication framework such as the &lt;a href=&quot;http://purl.org/linked-data/api/spec&quot;&gt;linked data API&lt;/a&gt; we&amp;#8217;re working on) to create the publication-based named graphs&lt;/li&gt;
&lt;li&gt;that the SPARQL endpoint offered by the triplestore has a default graph which reflects the current state of affairs&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let&amp;#8217;s look at these two uses of named graphs in more detail.&lt;/p&gt;

&lt;h3&gt;Publication of Named Graphs&lt;/h3&gt;

&lt;p&gt;Our intention is to publish different information about the same resource within different documents (aka named graphs). This approach hooks into the approach for versioning information resources outlined above. A resource is described in a document, and many documents may describe the same resource.&lt;/p&gt;

&lt;p&gt;For example, if a school changes its name from &amp;#8220;Broadmoor Primary School&amp;#8221; to &amp;#8220;Wildmoor Heath School&amp;#8221; on 1st September 2009, then after 1st September 2009, requesting information about the school at &lt;code&gt;http://education.data.gov.uk/id/school/12345&lt;/code&gt; would result in a &lt;code&gt;303 See Other&lt;/code&gt; redirection to &lt;code&gt;http://education.data.gov.uk/doc/school/12345&lt;/code&gt; which would contain information about the school that is currently relevant:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Information about the school that is currently relevant
&amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
  rdfs:label &quot;Wildmoor Heath School&quot;@en ;
  foaf:isPrimaryTopicOf 
    &amp;lt;http://education.data.gov.uk/doc/school/12345&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&amp;gt; ,
    ... .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;as well as metadata about the document that&amp;#8217;s been returned and the &amp;#8220;hub&amp;#8221; document that lists the alternative versions:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/doc/school/12345&amp;gt;
  rdfs:label &quot;Information about School 123456&quot;@en ;
  foaf:primaryTopic &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt; ;
  dct:hasVersion
    &amp;lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&amp;gt; ,
    ... .

&amp;lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&amp;gt;
  rdfs:label &quot;Information about Wildmoor Heath School from 1st Sept 2009&quot;@en ;
  foaf:primaryTopic &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt; ;
  dct:created &quot;2009-09-01&quot;^^xsd:date ;
  dct:replaces &amp;lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&amp;gt; ;
  dct:isVersionOf &amp;lt;http://education.data.gov.uk/doc/school/12345&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A request to the replaced document &lt;code&gt;http://education.data.gov.uk/doc/school/12345/2001-09-01&lt;/code&gt; would result in the information that was valid about the school on the 1st September 2001:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
  rdfs:label &quot;Broadmoor Primary School&quot;@en ;
  foaf:isPrimaryTopicOf 
    &amp;lt;http://education.data.gov.uk/doc/school/12345&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&amp;gt; ,
    ... .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and, again, metadata about the document that&amp;#8217;s been returned and the &amp;#8220;hub&amp;#8221; document that lists the alternative versions:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/doc/school/12345&amp;gt;
  rdfs:label &quot;Information about School 123456&quot;@en ;
  foaf:primaryTopic &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt; ;
  dct:hasVersion
    &amp;lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&amp;gt; ,
    ... .

&amp;lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&amp;gt;
  rdfs:label &quot;Information about Broadmoor Primary School (2001-2008)&quot;@en ;
  foaf:primaryTopic &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt; ;
  dct:created &quot;2001-09-01&quot;^^xsd:date ;
  dct:isReplacedBy &amp;lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&amp;gt; ;
  dct:replaces &amp;lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&amp;gt; ;
  dct:isVersionOf &amp;lt;http://education.data.gov.uk/doc/school/12345&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The statements about &lt;code&gt;http://education.data.gov.uk/id/school/12345&lt;/code&gt; in this second document are inconsistent with the statements retrieved from &lt;code&gt;http://education.data.gov.uk/doc/school/12345&lt;/code&gt; but because they are published within different documents, they should be considered (by anyone retrieving this data) to be different graphs and therefore are allowed to provide different views of the world.&lt;/p&gt;

&lt;p&gt;The statements about the named graphs &lt;code&gt;http://education.data.gov.uk/doc/school/12345/2009-09-01&lt;/code&gt; and &lt;code&gt;http://education.data.gov.uk/doc/school/12345/2001-09-01&lt;/code&gt; can include information about the interval during which the content of the document is valid. (We haven&amp;#8217;t worked out exactly how to indicate this yet; &lt;code&gt;dct:valid&lt;/code&gt; is no good; see later.)&lt;/p&gt;

&lt;h4&gt;Associated Resources&lt;/h4&gt;

&lt;p&gt;This story seems fine until you start to look at linked resources. For example, schools may link out to separate resources, particularly when different aspects of a school are likely to change at different rates or come from different sources. A school is unlikely to change its name in the middle of a school year, but may well change some of its staff, and the number of pupils it has, during a year. It&amp;#8217;s likely that these separate sets of information will be represented as different resources.&lt;/p&gt;

&lt;p&gt;The document published about the school for a particular date will not necessarily include all the details of the linked resource at that point in time. This can make it hard to navigate to the particular version of the linked resource. For example, if a client wants to look at the information about a school at 1st September 2001, they would locate the graph at &lt;code&gt;http://education.data.gov.uk/doc/school/12345/2001-09-01&lt;/code&gt;. This might contain:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
  rdfs:label &quot;Broadmoor Primary School&quot;@en ;
  edu:staffing &amp;lt;http://education.data.gov.uk/id/school/12345/staff&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A request to &lt;code&gt;http://education.data.gov.uk/id/school/12345/staff&lt;/code&gt; will result in a &lt;code&gt;303 See Other&lt;/code&gt; request to &lt;code&gt;http://education.data.gov.uk/doc/school/12345/staff&lt;/code&gt;. This is &lt;em&gt;current&lt;/em&gt; information about the staffing, and which will include:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/id/school/12345/staff&amp;gt;
  rdfs:label &quot;Staffing of Wildmoor Heath School&quot;@en ;
  edu:school &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt; ;
  edu:head ... ;
  edu:deputy ... ;
  ... ;
  foaf:isPrimaryTopicOf
    &amp;lt;http://education.data.gov.uk/doc/school/12345/staff&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2009-01-01&amp;gt; ,
    ... .

&amp;lt;http://education.data.gov.uk/doc/school/12345/staff&amp;gt;
  rdfs:label &quot;Information about Staffing at School 123456&quot;@en ;
  foaf:primaryTopic &amp;lt;http://education.data.gov.uk/id/school/12345/staff&amp;gt; ;
  dct:hasVersion
    &amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&amp;gt; ,
    &amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2009-01-01&amp;gt; ,
    ... .

&amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&amp;gt;
  rdfs:label &quot;Staffing of Wildmoor Heath School in Autumn Term, 2009&quot;@en ;
  foaf:primaryTopic &amp;lt;http://education.data.gov.uk/id/school/12345/staff&amp;gt; ;
  dct:created &quot;2009-09-01&quot;^^xsd:date ;
  dct:isVersionOf &amp;lt;http://education.data.gov.uk/doc/school/12345/staff&amp;gt; ;
  dct:replaces &amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The client then has to work out which of the possible versions of the graph about &lt;code&gt;http://education.data.gov.uk/id/school/12345/staff&lt;/code&gt; it should look at to navigate back to the information that&amp;#8217;s relevant at 1st September 2001.&lt;/p&gt;

&lt;p&gt;There are two techniques that we might use to help address this. One is for the information that&amp;#8217;s retrieved at &lt;code&gt;http://education.data.gov.uk/doc/school/12345/2001-09-01&lt;/code&gt; to include some basic information about the linked resource that includes &lt;code&gt;foaf:isPrimaryTopicOf&lt;/code&gt; links directly to the relevant versioned document about the linked resource. For example, that document should contain:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/id/school/12345/staff&amp;gt;
  rdfs:label &quot;Staffing of Wildmoor Heath School&quot;@en ;
  foaf:isPrimaryTopicOf &amp;lt;http://education.data.gov.uk/doc/school/12345/staff/2001-09-01&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These links will have to be generated by the publication framework since they are calculated based on the date of the requested resource.&lt;/p&gt;

&lt;p&gt;The other technique is to use HTTP headers to request the applicable date, as suggested by the &lt;a href=&quot;http://www.mementoweb.org/&quot;&gt;Memento Experiment&lt;/a&gt;. Even with this technique, it&amp;#8217;s still useful to have distinct URIs for the individual documents so that they can be pointed to and talked about.&lt;/p&gt;

&lt;h3&gt;Representation of Data in Named Graphs&lt;/h3&gt;

&lt;p&gt;Let&amp;#8217;s turn to looking at the use of named graphs within a triplestore. In the government case, we&amp;#8217;re expecting that information about schools going into a single triplestore is likely to come from multiple sources. Each source may release information at different intervals, with different temporal validity. The data from a single source will over-ride other information from that source over time, but equally data from different sources will be overlapping and contradictory.&lt;/p&gt;

&lt;p&gt;To manage this, we split up triples into named graphs based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;their source&lt;/li&gt;
&lt;li&gt;their temporal validity (and their temporal relationship with other graphs)&lt;/li&gt;
&lt;li&gt;their authoritativeness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This metadata about the named graph is recorded within the named graph itself, using &lt;code&gt;voiD&lt;/code&gt; and other vocabularies.&lt;/p&gt;

&lt;p&gt;In more detail:&lt;/p&gt;

&lt;h4&gt;Named Graphs over Time&lt;/h4&gt;

&lt;p&gt;Named graphs are expected to occur within a series over time. The triples within one graph will be completely replaced by the triples within another graph. The most recent graph is one that has not yet been replaced. To record this, the graphs should have associated with them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the dates when the data in the graph is valid (only the start date is really required)&lt;/li&gt;
&lt;li&gt;the graph(s) that the graph replaces&lt;/li&gt;
&lt;li&gt;the graph(s) that the graph is replaced by&lt;/li&gt;
&lt;li&gt;the date when the data in the graph was created&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To avoid repetition of data within multiple graphs, graphs should be split up at the level that updates are likely to occur within the source of the data. For example, Edubase holds a database of schools. If the linked data for schools is generated based on dumps of the entire Edubase database, then there would be a separate named graph for each dump of the database. If the linked data is created more dynamically, based on updates at the level of an individual school, say, then there should be a separate series of named graphs for each school. If the updates can occur at an even finer level of granularity (eg at each record within each table within the database), then there can be separate named graphs at that level.&lt;/p&gt;

&lt;h4&gt;Named Graphs from Different Sources&lt;/h4&gt;

&lt;p&gt;Information about the same resources will come from different sources, and have gone through different levels of processing to become linked data. To allow us to provide information about the provenance of different triples, separate named graphs should be used for data from different sources. The metadata about a graph should include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the source of the data (through &lt;code&gt;dct:source&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;the provenance of the data (through something more complex, yet to be finalised)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Much of the information about a particular resource will only come from one source. For example, Edubase contains the pupil census for a school while Ofsted provides inspection reports. However, there will be overlaps between the information available from different sources, such as the name and address of the school.&lt;/p&gt;

&lt;p&gt;For any given property of a resource (such as the name of the school), there should be one source that is the authoritative source of that information; other sources are considered supplementary. Each source should therefore usually provide two series of named graphs: one of information for which they are considered the authority, and one of information for which they are not. The metadata about the graph should include a property that indicates whether the information it contains is authoritative or not.&lt;/p&gt;

&lt;h4&gt;Constructing a Graph for a Given Date&lt;/h4&gt;

&lt;p&gt;It&amp;#8217;s extremely useful to be able to create snapshots that contain information that&amp;#8217;s current at a particular point in time. The most useful of these is the &lt;em&gt;current&lt;/em&gt; graph, which is the one that should be exposed as the default graph in the SPARQL endpoint offered by the triplestore.&lt;/p&gt;

&lt;p&gt;The graph can be created by combining:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;all the triples from authoritative graphs that are valid at that point in time (eg have a validity date before that point in time, and that are not replaced by a graph whose validity date is also before that point in time)&lt;/li&gt;
&lt;li&gt;those triples from supplementary graphs for which there is no existing triple in the graph with the same subject and property&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, there may be information available about a school from Edubase and from OFSTED, as follows (in TRiG syntax):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# graph containing data from Edubase from 2008-09-01
&amp;lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&amp;gt; {
  &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
    rdfs:label &quot;Broadmoor Primary School&quot;@en ;
    edu:census &amp;lt;http://education.data.gov.uk/id/school/12345/census&amp;gt; .

  &amp;lt;http://education.data.gov.uk/id/school/12345/census&amp;gt;
    ... .

  &amp;lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&amp;gt;
    a void:Dataset ;
    dct:created &quot;2008-09-01&quot;^^xsd:date ;
    dct:replaces &amp;lt;http://education.data.gov.uk/data/edubase/12345/2007-09-01/authoritative&amp;gt; ;
    dct:isReplacedBy &amp;lt;http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative&amp;gt; ;
    dct:source &amp;lt;http://www.edubase.gov.uk/&amp;gt; ;
    :authoritative true .

  &amp;lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01&amp;gt;
    a void:Dataset ;
    void:subset &amp;lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&amp;gt; ;
    ... .
}

# graph containing data from Edubase from 2009-09-01; the name of the school 
# has changed (as have) the details of the census
&amp;lt;http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative&amp;gt; {
  &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
    rdfs:label &quot;Wildmoor Heath School&quot;@en ;
    edu:census &amp;lt;http://education.data.gov.uk/id/school/12345/census&amp;gt; .

  &amp;lt;http://education.data.gov.uk/id/school/12345/census&amp;gt;
    ... .

  ... metadata about this graph ...
}

# graph containing authoritative data from Ofsted from 2008-03-01
# note that this doesn&#039;t include the name of the school
&amp;lt;http://education.data.gov.uk/data/ofsted/12345/2008-03-01/authoritative&amp;gt; {
  &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
    edu:inspection &amp;lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&amp;gt; .

  &amp;lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&amp;gt;
    ... .

  ... metadata about this graph ...
}

# graph containing supplementary data from Ofsted from 2008-03-01
# this includes the name of the school (at the time of the inspection)
&amp;lt;http://education.data.gov.uk/data/ofsted/12345/2008-03-01/supplementary&amp;gt; {
  &amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
    rdfs:label &quot;Broadmoor Primary School&quot;@en ;

  ... metadata about this graph ...
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that metadata about each graph is embedded in the graph itself.&lt;/p&gt;

&lt;p&gt;In the example above, a graph for 2010-01-01 would contain:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
  rdfs:label &quot;Wildmoor Heath School&quot;@en ;
  edu:census &amp;lt;http://education.data.gov.uk/id/school/12345/census&amp;gt; ;
  edu:inspection &amp;lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&amp;gt; .

&amp;lt;http://education.data.gov.uk/id/school/12345/census&amp;gt;
  ... .

&amp;lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&amp;gt;
  ... .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It would not contain the triple:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://education.data.gov.uk/id/school/12345&amp;gt;
  rdfs:label &quot;Broadmoor Primary School&quot;@en ;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;because this triple is only present in an authoritative form within &lt;code&gt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&lt;/code&gt;, which is replaced by &lt;code&gt;http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative&lt;/code&gt; or from &lt;code&gt;http://education.data.gov.uk/data/ofsted/school/12345/2008-03-01/supplementary&lt;/code&gt; which is a supplementary graph and can&amp;#8217;t override the label provided by the authoritative graph.&lt;/p&gt;

&lt;h2&gt;Unanswered Questions&lt;/h2&gt;

&lt;p&gt;There are three gaps within this that need plugging.&lt;/p&gt;

&lt;p&gt;First, how should we represent the interval during which a graph is valid? As I&amp;#8217;ve indicated above, &lt;code&gt;dct:valid&lt;/code&gt; doesn&amp;#8217;t cut it because it can&amp;#8217;t represent an interval very well (there is a &lt;a href=&quot;http://dublincore.org/documents/dcmi-period/&quot;&gt;Dublin Core recommended format for representing periods&lt;/a&gt; but it&amp;#8217;s not going to be easy for people to process). We have work ongoing on defining intervals (by Stuart Williams) and will probably have to mint our own property to indicate the temporal validity of a named graph, given that &lt;code&gt;dct:valid&lt;/code&gt; takes a literal rather than a resource.&lt;/p&gt;

&lt;p&gt;Second, how should we indicate whether a graph is authoritative or not? Should this be a simple boolean switch (which will make the logic for combining graphs easier, and probably be easiest to assess) or a kind of confidence level, which might allow for missing data better?&lt;/p&gt;

&lt;p&gt;Third, how should we represent the events that cause the replacement of one named graph with another? I think that we should be able to use the provenance vocabulary that we end up using to represent these changes, so that it&amp;#8217;s possible to indicate whether the new information is the correction of a clerical error or an actual change to the real world thing.&lt;/p&gt;

&lt;p&gt;And, we have to try this out. While it looks as if it might work, I won&amp;#8217;t be confident until we&amp;#8217;ve tried it out with some real data and some real queries. I&amp;#8217;m also concerned that while keeping data in separate, annotated, named graphs seems like our best chance of managing versions and tracking provenance, it adds a hurdle onto the generation of linked data that might be too high, particularly for people who are just starting out.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/141#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/56">named graphs</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/55">versioning</category>
 <pubDate>Sat, 27 Feb 2010 22:15:40 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">141 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Why Linked Data for data.gov.uk?</title>
 <link>http://www.jenitennison.com/blog/node/140</link>
 <description>&lt;p&gt;&lt;a href=&quot;http://data.gov.uk/&quot;&gt;data.gov.uk&lt;/a&gt; was finally launched to the public last week (still in beta, but now a more public beta than the beta that it&amp;#8217;s been in for the last few months). It&amp;#8217;s a great step forward, and everyone involved should be proud of both the amount of data that&amp;#8217;s been made available and the website itself, which (&lt;a href=&quot;http://www.independent.co.uk/news/uk/politics/labours-computer-blunders-cost-16326bn-1871967.html&quot;&gt;unlike a lot of UK government IT&lt;/a&gt;) was developed rapidly by a small team based on open source software (and at low cost).&lt;/p&gt;

&lt;p&gt;This is a first step on a long road.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;One of the features of the UK Government&amp;#8217;s approach to freeing data is the emphasis on using &lt;a href=&quot;http://www.data.gov.uk/wiki/Linked_Data&quot;&gt;linked data&lt;/a&gt;. What I don&amp;#8217;t think has really been articulated is either what that means or why we should take this approach. From what I&amp;#8217;ve seen, developers seem to think:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;linked data is a synonym for turning everything into RDF and putting it in one big triplestore, equivalent to making one big database of government data and therefore prone to exactly the same, well-known and understood problems that government has with creating huge databases&lt;/li&gt;
&lt;li&gt;linked data requires everyone to agree to the same model and vocabulary, which means huge efforts in standardisation and ends up with something that suits no one&lt;/li&gt;
&lt;li&gt;the UK government will be releasing all their data as linked data immediately, and in no other way&lt;/li&gt;
&lt;li&gt;the UK government has been seduced into using linked data by academics who don&amp;#8217;t understand anything about how the web or the real world works&lt;/li&gt;
&lt;li&gt;the UK government has been seduced into using linked data by big businesses who stand to make a pretty penny providing services to departments that are forced to publish their data in this way&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are true. In fact, the UK government is committed to publishing data as linked data because they are convinced it is the &lt;strong&gt;best approach available for publishing data in a hugely diverse and distributed environment, in a gradual and sustainable way&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because linked data is just a term for how to publish data on the web while working &lt;em&gt;with&lt;/em&gt; the web. And the web is the best architecture we know for publishing information in a hugely diverse and distributed environment, in a gradual and sustainable way.&lt;/p&gt;

&lt;p&gt;If you&amp;#8217;re a web developer, you already know that the best APIs are &lt;a href=&quot;http://en.wikipedia.org/wiki/Representational_State_Transfer&quot;&gt;RESTful APIs&lt;/a&gt;. That argument has been won. It means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;using (HTTP) URIs to identify resources: naming &lt;em&gt;things&lt;/em&gt; with URIs rather than actions on those things (which are carried out using the standard set of HTTP verbs)&lt;/li&gt;
&lt;li&gt;recognising the distinction between resources and representations of those resources: the same URI might return a different representation of the resource, such as HTML or XML or JSON&lt;/li&gt;
&lt;li&gt;returning self-descriptive messages: being able to process representations in a manner that is obvious from the mime type&lt;/li&gt;
&lt;li&gt;hypermedia as the engine of application state: being able to locate additional resources through the use of (typed) links&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Linked data is about following these rules for publishing data. It is about using URIs to identify things, providing information at the end of those URIs that is self-descriptive, and linking those things to other things through typed links.&lt;/p&gt;

&lt;p&gt;One of the features of this approach is that it doesn&amp;#8217;t require any big bangs. No one planned the web: sat down and mapped out each page and its precise relations to every other page, in advance. It grew, and evolved, and continues to grow and evolve every day. It grows through individuals and institutions publishing information for their own reasons and linking to other people who have published information for their own reasons, and, because we have some fundamental standards that clients and servers understand, it All Just Works.&lt;/p&gt;

&lt;h2&gt;Standards&lt;/h2&gt;

&lt;p&gt;Did you notice how I slipped in the &amp;#8220;because we have some fundamental standards that clients and servers understand&amp;#8221;? One standard is obviously HTTP: that controls how clients and servers can talk to each other: it allows clients to request pages and servers to respond. Another standard is HTML: that enables browsers to display information in ways that people can understand it, and (crucially) has a known set of semantics that browsers can use to tell when something is a link, which people can navigate to find more information.&lt;/p&gt;

&lt;p&gt;For linked data, there are two crucial standards: RDF and SPARQL. Yes, I know what you&amp;#8217;re thinking, because believe me two years ago that would have been my reaction too, but let me explain why.&lt;/p&gt;

&lt;p&gt;There&amp;#8217;s one way in which publishing data isn&amp;#8217;t like publishing documents: its model. Documents are made up of paragraphs and headings and lists and tables and so on. Data is made up of&amp;#8230; what? Well, at its most basic, it&amp;#8217;s &lt;em&gt;things&lt;/em&gt; that have &lt;em&gt;properties&lt;/em&gt; which have &lt;em&gt;values&lt;/em&gt;. We might call the things &lt;em&gt;objects&lt;/em&gt; or &lt;em&gt;entities&lt;/em&gt;, and call some of the properties &lt;em&gt;relations&lt;/em&gt;. We might even call them &lt;em&gt;records&lt;/em&gt; with &lt;em&gt;columns&lt;/em&gt; and &lt;em&gt;values&lt;/em&gt; and &lt;em&gt;foreign keys&lt;/em&gt;. But however you term them, for better or worse, we do tend to think about data in this way: &lt;em&gt;thing&lt;/em&gt;, &lt;em&gt;property&lt;/em&gt;, &lt;em&gt;value&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So if we are going to publish data on the web, we need a standard way of expressing the data so that a client receiving the data can work out what&amp;#8217;s a &lt;em&gt;thing&lt;/em&gt;, what&amp;#8217;s a &lt;em&gt;property&lt;/em&gt;, what&amp;#8217;s a &lt;em&gt;value&lt;/em&gt;. &lt;strong&gt;And, because this is the web, what&amp;#8217;s a &lt;em&gt;link&lt;/em&gt;&lt;/strong&gt;. This is the fundamental standard we need, and this is what RDF gives.&lt;/p&gt;

&lt;p&gt;RDF is actually a model rather than a syntax. It&amp;#8217;s a bit like the split between the DOM and HTML or XHTML. The DOM tells the browser how to render the page: the HTML or XHTML is just a syntax which the browser is able to convert into a DOM that it displays. We could imagine browsers converting wiki syntax into a DOM. Or creating a DOM based on XML and XSLT, which of course they all do.&lt;/p&gt;

&lt;p&gt;So, RDF is like the DOM, with varying representations of RDF (XML-based, text-based, JSON-based, even HTML-based) that can be used to pass to the client the underlying model of &lt;em&gt;things&lt;/em&gt; and &lt;em&gt;properties&lt;/em&gt; and &lt;em&gt;values&lt;/em&gt; (some of which are &lt;em&gt;links&lt;/em&gt;). What the client does then is its business: clients that retrieve data aren&amp;#8217;t browsers &amp;#8212; they&amp;#8217;re not all going to display the data, use the same parts of the data, or otherwise process it in the same way &amp;#8212; but they can pull out the &lt;em&gt;things&lt;/em&gt;, &lt;em&gt;properties&lt;/em&gt; and &lt;em&gt;values&lt;/em&gt;, and know which are &lt;em&gt;links&lt;/em&gt;, and this data structure will often, with a good RDF library, map on to some natural structure within whatever programming language is being used, and make the programmer&amp;#8217;s job easier.&lt;/p&gt;

&lt;h2&gt;Vocabularies&lt;/h2&gt;

&lt;p&gt;What we don&amp;#8217;t want to have to define are standard ways of expressing &lt;em&gt;particular&lt;/em&gt; data (such as data about a school) because different individuals and organisations will have completely different ways of thinking about a particular thing. A school itself will have information about uniform and open days; &lt;a href=&quot;http://www.ofsted.gov.uk/&quot;&gt;OFSTED&lt;/a&gt; about performance; &lt;a href=&quot;http://www.edubase.gov.uk/&quot;&gt;Edubase&lt;/a&gt; about administration and pupil numbers; the PTA about after-school activities. Expecting everyone to adopt a particular standard vocabulary for describing a school is as futile as expecting everyone to adopt exactly the same page layout within their web pages, and exactly the same class names in their CSS.&lt;/p&gt;

&lt;p&gt;But we don&amp;#8217;t want to rule out opportunistic alignments where individuals or organisations, for whatever reason, &lt;em&gt;do&lt;/em&gt; want to use the same vocabularies. Look at what&amp;#8217;s happened with classes in HTML. There is absolutely no constraint on what classes people use in their HTML. But there are clusters of web pages that use some of the same classes. Websites that use &lt;a href=&quot;http://www.edubase.gov.uk/&quot;&gt;microformats&lt;/a&gt;. Websites that adopt a particular &lt;a href=&quot;http://en.wikipedia.org/wiki/CSS_framework&quot;&gt;CSS framework&lt;/a&gt;. Importantly, though, even where some classes are shared, it doesn&amp;#8217;t mean that &lt;em&gt;all&lt;/em&gt; classes are shared: adoption of a particular microformat or CSS framework doesn&amp;#8217;t limit the rest of the page.&lt;/p&gt;

&lt;p&gt;RDF has this balance between allowing individuals and organisations complete freedom in how they describe their information and the opportunity to share and reuse parts of vocabularies in a mix-and-match way. This is so important in a government context because (with all due respect to civil servants) we &lt;em&gt;really&lt;/em&gt; want to avoid a situation where we have to get lots of civil servants from multiple agencies into the same room to come up with the single government-approved way of describing a school. We can all imagine how long that would take.&lt;/p&gt;

&lt;p&gt;The other thing about RDF that really helps here is that it&amp;#8217;s easy to align vocabularies if you want to, post-hoc. &lt;a href=&quot;http://www.w3.org/TR/rdf-schema/&quot;&gt;RDFS&lt;/a&gt; and &lt;a href=&quot;http://www.w3.org/TR/owl-overview/&quot;&gt;OWL&lt;/a&gt; define properties that you can use to assert that this property is really the same as that property, or that anything with a value for this property has the same value for that other property. This lowers the risk for organisations who are starting to publish using RDF, because it means that if a new vocabulary comes along they can opportunistically match their existing vocabulary with the new one. It enables organisations to tweak existing vocabularies to suit their purposes, by creating specialised versions of established properties.&lt;/p&gt;

&lt;p&gt;So the linked data web is designed to grow and evolve in exactly the same way as the human web has grown and evolve. It grows through people adding links to existing data. It grows through people creating their own vocabularies. And it evolves as links break and reform, and vocabularies combine and diverge. It is complex and messy and self-organising.&lt;/p&gt;

&lt;h2&gt;Layers&lt;/h2&gt;

&lt;p&gt;The cornerstone of the great, messy, web is the URI. URIs have two important roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;they identify things&lt;/strong&gt; - If two sets of data use the same URI then it&amp;#8217;s dead easy to work out when they are talking about the same thing, for example to bring together the information published by a school with its OFSTED report with its pupil census. Spread this around to five, ten, twenty datasets from different places all using the same identifier for the school, and you have huge pool of information. And the great thing about RDF (because they also use URIs to identify properties) is that those datasets can be combined automatically without worrying about clashes, rather than through painstaking developer effort.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;they provide somewhere to look for information&lt;/strong&gt; - This is the point of using HTTP URIs, because that look-up is as simple as retrieving a document from the web. This enables programmatic, on-demand, access to the information. Developers don&amp;#8217;t have to download huge database dumps when all they are interested in is a small fraction of that data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But we know that of course sometimes developers &lt;em&gt;do&lt;/em&gt; want to download huge database dumps. So we need URIs for those dumps, and ways to associate metadata with them, and ways to search them. Adopting linked data doesn&amp;#8217;t preclude providing sets of data in larger lumps. In fact, what&amp;#8217;s needed are ways of creating those larger datasets by bringing together the more granular linked data into lists and graphs; this is essentially what SPARQL does.&lt;/p&gt;

&lt;p&gt;We also know that there&amp;#8217;s a trade-off to be made between the power of URIs and the simplicity of using short, unqualified names, particularly when it comes to naming schema-level entities such as properties or classes. Most mashups that we see at the moment bring together just a few datasets, making it easy for developers to scan for naming clashes, or examine values to work out whether a particular property contains a link or not. This is the 80% of the use of data on the web that can be addressed by the 20% solution of the kind of JSON and plain old XML you see in most APIs.&lt;/p&gt;

&lt;p&gt;But publishing with RDF can be the basis of these kinds of simple APIs, and still address the hard 20% that we will encounter quickly as we mash more data together. Any data munger knows that the main challenge of making data available in an easily accessible way is cleaning, tidying, modelling and restructuring. If that&amp;#8217;s done into RDF then creating simple JSON, XML and even CSV is really easy. Creating middle-ware that will make the creation of these basic APIs really easy must be the top priority of this linked data effort.&lt;/p&gt;

&lt;h2&gt;Reality Check&lt;/h2&gt;

&lt;p&gt;So it&amp;#8217;s all good, right?&lt;/p&gt;

&lt;p&gt;No, of course it&amp;#8217;s not all good. Just as in the early days of the human web, we face huge challenges simply getting tooling to a level where it&amp;#8217;s easy (really easy) for government departments and local authorities to publish data as RDF and for the consumers of the data to use it. We have some patterns for publishing linked data, but, as in the early days of the human web, there&amp;#8217;s still a lot we don&amp;#8217;t know about the best way to make data usable by third parties.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s worth noting that the main challenges we face are ones that are common to all attempts to make data both open and reusable. How do we easily create structured and reusable data from presentation-oriented Excel or (worse) PDFs? How do we handle changes over time, and record the provenance of the information that we provide? How to we represent statistical hypercubes? Or location information? These are things that we will only learn by trying things out.&lt;/p&gt;

&lt;p&gt;In the end, though, the best evidence we have for how the web of linked data will progress is the evidence of how things were for the human web. It is hard to be an early adopter, both for social reasons and technological reasons. Nothing will happen overnight, but gradually there will be network effects: more shared URIs, more shared vocabularies, making it both easier to adopt and more beneficial for everyone.&lt;/p&gt;

&lt;p&gt;Is this a kind of faith? Maybe. I believe in the web.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/140#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/54">datagovuk</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <pubDate>Tue, 26 Jan 2010 13:10:58 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">140 at http://www.jenitennison.com/blog</guid>
</item>
</channel>
</rss>

