<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Jeni's Musings</title>
  <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog"/>
  <link rel="self" type="application/atom+xml" href="http://www.jenitennison.com/blog/atom/feed"/>
  <id>http://www.jenitennison.com/blog/atom/feed</id>
  <updated>2009-12-07T11:00:47+00:00</updated>
  <entry>
    <title>Versioning (UK Government) Linked Data</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/141" />
    <id>http://www.jenitennison.com/blog/node/141</id>
    <published>2010-02-27T22:15:40+00:00</published>
    <updated>2010-02-27T22:15:40+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="named graphs" />
    <category term="versioning" />
    <summary type="html"><![CDATA[<p>As you probably know, I&#8217;ve been working quite a lot recently on the UK government&#8217;s use of linked data, and in particular on providing guidance for people who want to publish their data as linked data. One of the things that we need to provide guidance about is how to publish linked data that changes over time. I&#8217;ve <a href="http://www.jenitennison.com/blog/node/108">touched on this topic before</a> but things have progressed now to the stage where we have to make some real, practical, recommendations.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>As you probably know, I&#8217;ve been working quite a lot recently on the UK government&#8217;s use of linked data, and in particular on providing guidance for people who want to publish their data as linked data. One of the things that we need to provide guidance about is how to publish linked data that changes over time. I&#8217;ve <a href="http://www.jenitennison.com/blog/node/108">touched on this topic before</a> but things have progressed now to the stage where we have to make some real, practical, recommendations.</p>

<p><em>Note: the contents of this post have been greatly informed through discussions with <a href="http://www.ldodds.com/blog/">Leigh Dodds</a>, <a href="http://twitter.com/skwlilac">Stuart Williams</a>, <a href="http://www.amberdown.net/">Dave Reynolds</a>, <a href="http://iandavis.com/">Ian Davis</a> and John Sheridan. Ian Davis&#8217; series on <a href="http://blog.iandavis.com/2009/08/time-in-rdf-1">representing time in RDF</a> is also well worth a look for a comparison of alternative approaches.</em></p>

<p>I&#8217;ve split this into two parts: versioned information resources (which are pretty easy) and versioned non-information resources (which are pretty hard). For both, we need to</p>

<ul>
<li>provide some guidance about what the RDF should look like</li>
<li>mint or adopt properties to support that model</li>
</ul>

<h2>Versioned Information Resources</h2>

<p>Easy things first. Some of the things that we talk about, such as legislation, are information resources (web documents), and these have different versions. The relevant level of precision for legislation is a day, but this will be different for different kinds of documents &#8212; some might change every second, for others an incrementally increasing version number might be more appropriate than a date. A generic pattern for the URIs, based on the <a href="http://writetoreply.org/ukgovurisets/">design of URI sets for the UK public sector report</a> would be:</p>

<pre><code>http://{sector}.data.gov.uk/doc/{concept}/{identifier}/{version}
</code></pre>

<p>For example, the OFSTED report for a particular school based on an inspection carried out in 2009 might be something like:</p>

<pre><code>http://education.data.gov.uk/doc/inspection-report/12345/2009
</code></pre>

<p>(There might be sub-versions too, if the inspection report itself goes through a revision process.) The RDF for this document should include links to the previous reports that it replaces, and dates that indicate when it was created and so on:</p>

<pre><code>&lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&gt;
  rdfs:label "2009 Inspection Report for Such-and-Such School"@en ;
  dct:created "2009-10-18"^^xsd:date ;
  dct:replaces &lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&gt; .
</code></pre>

<p>It&#8217;s also useful to have a URI for unversioned document; this is the same as for the versioned document, but without the version:</p>

<pre><code>http://{sector}.data.gov.uk/doc/{concept}/{identifier}
</code></pre>

<p>This document acts as a hub for the various concrete versions of the document:</p>

<pre><code>&lt;http://education.data.gov.uk/doc/inspection-report/12345&gt;
  rdfs:label "Inspection Report for Such-and-Such School"@en ;
  dct:hasVersion
    &lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&gt; ,
    &lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&gt; ,
    &lt;http://education.data.gov.uk/doc/inspection-report/12345/2003&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&gt;
  rdfs:label "2009 Inspection Report for Such-and-Such School"@en ;
  dct:created "2009-10-18"^^xsd:date ;
  dct:replaces &lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&gt; ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/inspection-report/12345&gt; .

&lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&gt;
  rdfs:label "2009 Inspection Report for Such-and-Such School"@en ;
  dct:created "2003-11-23"^^xsd:date ;
  dct:isReplacedBy &lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&gt; ;
  dct:replaces &lt;http://education.data.gov.uk/doc/inspection-report/12345/2003&gt; ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/inspection-report/12345&gt; .
</code></pre>

<p>It would be expected that people linking to the document would either point to a particular (dated) resource or to the unversioned (hub) document. For example, if someone were talking specifically about the 2006 OFSTED inspection, they would point to the 2006 inspection report; if they were referring to whatever inspection report is current, they&#8217;d use the unversioned URI.</p>

<blockquote>
  <p><em>Note: Although <code>dct:hasVersion</code> and <code>dct:isVersionOf</code> are sort-of OK here, having a property that points to the current version (ie most recent) version of a resource would be very helpful.</em></p>
</blockquote>

<h2>Versioned Non-Information Resources</h2>

<p>The harder problem is how we handle changes to non-information resources over time. For example, how do we handle the fact that a school often changes head, sometimes changes name, regularly changes class sizes, rarely changes address and so on? How do we handle the fact that we have legacy statistics about local authorities as they were in 2008, prior to the 2009 reorganisation, and that it&#8217;s very likely that these kinds of changes will continue to take place regularly in the future?</p>

<p>Our requirements are:</p>

<ul>
<li>that the data is easily usable by people who only care about the current state of a resource</li>
<li>that the (current) data remains easily queryable at a SPARQL endpoint</li>
<li>that it&#8217;s <em>possible</em> (not necessarily easy) to query historic data</li>
<li>that historic data can be moderately easily retrieved and navigated</li>
<li>that it can represent historical states even when the precise time period is not known</li>
<li>that it can distinguish between a change in the concept and a change in our record of it (e.g. changing the name of a school, versus correcting a typo in the database entry for the school)</li>
<li>that it can trace what the nature or cause of the change was (e.g. redrawing of local authority boundaries)</li>
</ul>

<h3>Statistical Data</h3>

<p>To begin our discussion, let&#8217;s look at statistical data. Statistical data is data that&#8217;s usually numeric and for which we have values that are categorised along multiple dimensions as well as time. School census information is statistical data, for example, because each value is associated with not only the school and the date at which the census was taken but also the age (and gender, but to simplify I&#8217;ll pretend just age) of the children being counted. This gives us a set of observations which might each look like:</p>

<pre><code>&lt;/data/edubase/census/12345/age/11/2009&gt; 
  a sdmx:Observation ;
  sdmx:dataset &lt;/data/edubase&gt; ;
  dct:replaces &lt;/data/edubase/census/12345/age/11/2008&gt; ;
  rdf:value 85 ;
  edu:school &lt;/id/school/12345&gt; ;
  edu:schoolYear &lt;/id/school-year/2009&gt; ;
  sdmx:age 11 .
</code></pre>

<blockquote>
  <p>Note: This is indicative of the vocabulary we might use for statistics; don&#8217;t rely on it. If you&#8217;re interested in the progress we&#8217;re making on modelling statistical datasets using RDF, come and join <a href="http://groups.google.com/group/publishing-statistical-data">the publishing statistical data Google Group</a>.</p>
</blockquote>

<p>These statistical observations point to the interval that they apply to as a property, with the <code>rdf:value</code> property holding the actual value. The observation won&#8217;t change over time (unless it is corrected, which I will come back to), and <strong>observations from different times can all remain present within the graph without interacting badly with each other</strong>.</p>

<p>This is great because it means that we can make queries that give us time series views over the data. For example, we could define a series for girls aged 11 at this particular school over time something like this:</p>

<pre><code>&lt;/data/edubase/census/12345/age/11&gt;
  a sdmx:TimeSeries ;
  edu:school &lt;/id/school/12345&gt; ;
  sdmx:age 11 ;
  sdmx:observation
    &lt;/data/edubase/census/12345/age/11/gender/F/2009&gt; ,
    &lt;/data/edubase/census/12345/age/11/gender/F/2008&gt; ,
    &lt;/data/edubase/census/12345/age/11/gender/F/2007&gt; ,
    ... .
</code></pre>

<p>and associate this with the school through a specialised property:</p>

<pre><code>&lt;/id/school/12345&gt; edu:age11 &lt;/data/edubase/census/12345/age/11&gt; .
</code></pre>

<p>The fly in the ointment is that data that is purely represented in this way is really hard to query if all you&#8217;re actually interested in is the <em>current</em> value for the particular statistic. For example, say that you&#8217;ve just moved to an area and are trying desperately to find a school that might have room for your 11-year-old. Given that class sizes are capped at 30, you could look for schools that have a number of 11-year-olds that is not a multiple of 30. If you want to know how many 11 year-olds are <em>currently</em> in a school (according to the most recent measurement), you need a query like:</p>

<pre><code>SELECT ?age11
WHERE {
  &lt;/id/school/12345&gt; edu:age11 [
    sdmx:observation ?currentObservation ;
  ]
  OPTIONAL {
    ?futureObservation dct:replaces ?currentObservation .
  }
  FILTER ( !bound(?futureObservation) ) .
  ?currentObservation rdf:value ?age11 .
}
</code></pre>

<p>(it&#8217;s even more complicated if you don&#8217;t have the <code>dct:replaces</code> links!).</p>

<p>How much simpler it would be for people if there was a property that just indicated the current state of the world:</p>

<pre><code>&lt;/id/school/12345&gt; edu:currentAge11 85 .
</code></pre>

<p>The same argument applies even more strongly for values that we would categorise as <strong>reference data</strong>, such as the name of a school. Although it would be possible to model all this information using the kind of n-ary relation approach we have to use for statistical observations, it would be both incredibly hard to query and incredibly verbose to do so. Even if n-ary relations are the &#8220;correct&#8221; way of modelling the changing data, they are impractical for querying.</p>

<p>And, as I hinted, we have to have some way of managing the possibility of statistics themselves being versioned (for example if an error is detected within the statistics). Using n-ary relations to provide the value of an observation gets very complicated very quickly.</p>

<p>So, we have made the decision to use named graphs.</p>

<h3>Named Graphs</h3>

<p>Named graphs can be used in two ways which are related but need to be thought about slightly differently.</p>

<p>First, we can use a named-graph approach to the <strong>publication</strong> of RDF. We can describe the same <em>thing</em> within multiple documents; each document can contain different (and contradictory) information, but also metadata about the document that indicates precisely when the information it contains is valid.</p>

<p>Second, we can use a named-graph approach to the <strong>representation</strong> of RDF within a triple- (or more accurately quad-) store. We can collect together statements that are made at the same time, from the same source, and with the same level of authority into a named graph. These graphs can then be loaded into the store, with the metadata about each graph made explicitly available so that relevant graphs can be selected and queried.</p>

<p>There are two things that are worth noting about this:</p>

<ol>
<li>Publishing named graphs is relevant however RDF is published. For example, in some linked data publication set-ups, RDF/XML or RDFa might be generated on demand based on an underlying database of some description. In this case, the named graphs for representing data aren&#8217;t relevant (the database will presumably capture some provenance and validity information itself that can be exposed within the RDF).</li>
<li>In the case where linked data is published natively (ie stored in a triplestore and exposed as linked data through an API), the two uses of named graphs don&#8217;t precisely align with each other. The named graphs that we create when we convert or load data within a triplestore are not (necessarily) the same as the named graphs that we expose when we publish data. What&#8217;s important here is
<ul><li>that the named graphs that we have within the triplestore can feasibly be used (by a publication framework such as the <a href="http://purl.org/linked-data/api/spec">linked data API</a> we&#8217;re working on) to create the publication-based named graphs</li>
<li>that the SPARQL endpoint offered by the triplestore has a default graph which reflects the current state of affairs</li></ul></li>
</ol>

<p>Let&#8217;s look at these two uses of named graphs in more detail.</p>

<h3>Publication of Named Graphs</h3>

<p>Our intention is to publish different information about the same resource within different documents (aka named graphs). This approach hooks into the approach for versioning information resources outlined above. A resource is described in a document, and many documents may describe the same resource.</p>

<p>For example, if a school changes its name from &#8220;Broadmoor Primary School&#8221; to &#8220;Wildmoor Heath School&#8221; on 1st September 2009, then after 1st September 2009, requesting information about the school at <code>http://education.data.gov.uk/id/school/12345</code> would result in a <code>303 See Other</code> redirection to <code>http://education.data.gov.uk/doc/school/12345</code> which would contain information about the school that is currently relevant:</p>

<pre><code># Information about the school that is currently relevant
&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Wildmoor Heath School"@en ;
  foaf:isPrimaryTopicOf 
    &lt;http://education.data.gov.uk/doc/school/12345&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ,
    ... .
</code></pre>

<p>as well as metadata about the document that&#8217;s been returned and the &#8220;hub&#8221; document that lists the alternative versions:</p>

<pre><code>&lt;http://education.data.gov.uk/doc/school/12345&gt;
  rdfs:label "Information about School 123456"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  dct:hasVersion
    &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt;
  rdfs:label "Information about Wildmoor Heath School from 1st Sept 2009"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  dct:created "2009-09-01"^^xsd:date ;
  dct:replaces &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/school/12345&gt; .
</code></pre>

<p>A request to the replaced document <code>http://education.data.gov.uk/doc/school/12345/2001-09-01</code> would result in the information that was valid about the school on the 1st September 2001:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Broadmoor Primary School"@en ;
  foaf:isPrimaryTopicOf 
    &lt;http://education.data.gov.uk/doc/school/12345&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ,
    ... .
</code></pre>

<p>and, again, metadata about the document that&#8217;s been returned and the &#8220;hub&#8221; document that lists the alternative versions:</p>

<pre><code>&lt;http://education.data.gov.uk/doc/school/12345&gt;
  rdfs:label "Information about School 123456"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  dct:hasVersion
    &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt;
  rdfs:label "Information about Broadmoor Primary School (2001-2008)"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  dct:created "2001-09-01"^^xsd:date ;
  dct:isReplacedBy &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ;
  dct:replaces &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/school/12345&gt; .
</code></pre>

<p>The statements about <code>http://education.data.gov.uk/id/school/12345</code> in this second document are inconsistent with the statements retrieved from <code>http://education.data.gov.uk/doc/school/12345</code> but because they are published within different documents, they should be considered (by anyone retrieving this data) to be different graphs and therefore are allowed to provide different views of the world.</p>

<p>The statements about the named graphs <code>http://education.data.gov.uk/doc/school/12345/2009-09-01</code> and <code>http://education.data.gov.uk/doc/school/12345/2001-09-01</code> can include information about the interval during which the content of the document is valid. (We haven&#8217;t worked out exactly how to indicate this yet; <code>dct:valid</code> is no good; see later.)</p>

<h4>Associated Resources</h4>

<p>This story seems fine until you start to look at linked resources. For example, schools may link out to separate resources, particularly when different aspects of a school are likely to change at different rates or come from different sources. A school is unlikely to change its name in the middle of a school year, but may well change some of its staff, and the number of pupils it has, during a year. It&#8217;s likely that these separate sets of information will be represented as different resources.</p>

<p>The document published about the school for a particular date will not necessarily include all the details of the linked resource at that point in time. This can make it hard to navigate to the particular version of the linked resource. For example, if a client wants to look at the information about a school at 1st September 2001, they would locate the graph at <code>http://education.data.gov.uk/doc/school/12345/2001-09-01</code>. This might contain:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Broadmoor Primary School"@en ;
  edu:staffing &lt;http://education.data.gov.uk/id/school/12345/staff&gt; .
</code></pre>

<p>A request to <code>http://education.data.gov.uk/id/school/12345/staff</code> will result in a <code>303 See Other</code> request to <code>http://education.data.gov.uk/doc/school/12345/staff</code>. This is <em>current</em> information about the staffing, and which will include:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345/staff&gt;
  rdfs:label "Staffing of Wildmoor Heath School"@en ;
  edu:school &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  edu:head ... ;
  edu:deputy ... ;
  ... ;
  foaf:isPrimaryTopicOf
    &lt;http://education.data.gov.uk/doc/school/12345/staff&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-01-01&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/school/12345/staff&gt;
  rdfs:label "Information about Staffing at School 123456"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345/staff&gt; ;
  dct:hasVersion
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-01-01&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&gt;
  rdfs:label "Staffing of Wildmoor Heath School in Autumn Term, 2009"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345/staff&gt; ;
  dct:created "2009-09-01"^^xsd:date ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/school/12345/staff&gt; ;
  dct:replaces &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&gt; .
</code></pre>

<p>The client then has to work out which of the possible versions of the graph about <code>http://education.data.gov.uk/id/school/12345/staff</code> it should look at to navigate back to the information that&#8217;s relevant at 1st September 2001.</p>

<p>There are two techniques that we might use to help address this. One is for the information that&#8217;s retrieved at <code>http://education.data.gov.uk/doc/school/12345/2001-09-01</code> to include some basic information about the linked resource that includes <code>foaf:isPrimaryTopicOf</code> links directly to the relevant versioned document about the linked resource. For example, that document should contain:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345/staff&gt;
  rdfs:label "Staffing of Wildmoor Heath School"@en ;
  foaf:isPrimaryTopicOf &lt;http://education.data.gov.uk/doc/school/12345/staff/2001-09-01&gt; .
</code></pre>

<p>These links will have to be generated by the publication framework since they are calculated based on the date of the requested resource.</p>

<p>The other technique is to use HTTP headers to request the applicable date, as suggested by the <a href="http://www.mementoweb.org/">Memento Experiment</a>. Even with this technique, it&#8217;s still useful to have distinct URIs for the individual documents so that they can be pointed to and talked about.</p>

<h3>Representation of Data in Named Graphs</h3>

<p>Let&#8217;s turn to looking at the use of named graphs within a triplestore. In the government case, we&#8217;re expecting that information about schools going into a single triplestore is likely to come from multiple sources. Each source may release information at different intervals, with different temporal validity. The data from a single source will over-ride other information from that source over time, but equally data from different sources will be overlapping and contradictory.</p>

<p>To manage this, we split up triples into named graphs based on:</p>

<ul>
<li>their source</li>
<li>their temporal validity (and their temporal relationship with other graphs)</li>
<li>their authoritativeness</li>
</ul>

<p>This metadata about the named graph is recorded within the named graph itself, using <code>voiD</code> and other vocabularies.</p>

<p>In more detail:</p>

<h4>Named Graphs over Time</h4>

<p>Named graphs are expected to occur within a series over time. The triples within one graph will be completely replaced by the triples within another graph. The most recent graph is one that has not yet been replaced. To record this, the graphs should have associated with them:</p>

<ul>
<li>the dates when the data in the graph is valid (only the start date is really required)</li>
<li>the graph(s) that the graph replaces</li>
<li>the graph(s) that the graph is replaced by</li>
<li>the date when the data in the graph was created</li>
</ul>

<p>To avoid repetition of data within multiple graphs, graphs should be split up at the level that updates are likely to occur within the source of the data. For example, Edubase holds a database of schools. If the linked data for schools is generated based on dumps of the entire Edubase database, then there would be a separate named graph for each dump of the database. If the linked data is created more dynamically, based on updates at the level of an individual school, say, then there should be a separate series of named graphs for each school. If the updates can occur at an even finer level of granularity (eg at each record within each table within the database), then there can be separate named graphs at that level.</p>

<h4>Named Graphs from Different Sources</h4>

<p>Information about the same resources will come from different sources, and have gone through different levels of processing to become linked data. To allow us to provide information about the provenance of different triples, separate named graphs should be used for data from different sources. The metadata about a graph should include:</p>

<ul>
<li>the source of the data (through <code>dct:source</code>)</li>
<li>the provenance of the data (through something more complex, yet to be finalised)</li>
</ul>

<p>Much of the information about a particular resource will only come from one source. For example, Edubase contains the pupil census for a school while Ofsted provides inspection reports. However, there will be overlaps between the information available from different sources, such as the name and address of the school.</p>

<p>For any given property of a resource (such as the name of the school), there should be one source that is the authoritative source of that information; other sources are considered supplementary. Each source should therefore usually provide two series of named graphs: one of information for which they are considered the authority, and one of information for which they are not. The metadata about the graph should include a property that indicates whether the information it contains is authoritative or not.</p>

<h4>Constructing a Graph for a Given Date</h4>

<p>It&#8217;s extremely useful to be able to create snapshots that contain information that&#8217;s current at a particular point in time. The most useful of these is the <em>current</em> graph, which is the one that should be exposed as the default graph in the SPARQL endpoint offered by the triplestore.</p>

<p>The graph can be created by combining:</p>

<ol>
<li>all the triples from authoritative graphs that are valid at that point in time (eg have a validity date before that point in time, and that are not replaced by a graph whose validity date is also before that point in time)</li>
<li>those triples from supplementary graphs for which there is no existing triple in the graph with the same subject and property</li>
</ol>

<p>For example, there may be information available about a school from Edubase and from OFSTED, as follows (in TRiG syntax):</p>

<pre><code># graph containing data from Edubase from 2008-09-01
&lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&gt; {
  &lt;http://education.data.gov.uk/id/school/12345&gt;
    rdfs:label "Broadmoor Primary School"@en ;
    edu:census &lt;http://education.data.gov.uk/id/school/12345/census&gt; .

  &lt;http://education.data.gov.uk/id/school/12345/census&gt;
    ... .

  &lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&gt;
    a void:Dataset ;
    dct:created "2008-09-01"^^xsd:date ;
    dct:replaces &lt;http://education.data.gov.uk/data/edubase/12345/2007-09-01/authoritative&gt; ;
    dct:isReplacedBy &lt;http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative&gt; ;
    dct:source &lt;http://www.edubase.gov.uk/&gt; ;
    :authoritative true .

  &lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01&gt;
    a void:Dataset ;
    void:subset &lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&gt; ;
    ... .
}

# graph containing data from Edubase from 2009-09-01; the name of the school 
# has changed (as have) the details of the census
&lt;http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative&gt; {
  &lt;http://education.data.gov.uk/id/school/12345&gt;
    rdfs:label "Wildmoor Heath School"@en ;
    edu:census &lt;http://education.data.gov.uk/id/school/12345/census&gt; .

  &lt;http://education.data.gov.uk/id/school/12345/census&gt;
    ... .

  ... metadata about this graph ...
}

# graph containing authoritative data from Ofsted from 2008-03-01
# note that this doesn't include the name of the school
&lt;http://education.data.gov.uk/data/ofsted/12345/2008-03-01/authoritative&gt; {
  &lt;http://education.data.gov.uk/id/school/12345&gt;
    edu:inspection &lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&gt; .

  &lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&gt;
    ... .

  ... metadata about this graph ...
}

# graph containing supplementary data from Ofsted from 2008-03-01
# this includes the name of the school (at the time of the inspection)
&lt;http://education.data.gov.uk/data/ofsted/12345/2008-03-01/supplementary&gt; {
  &lt;http://education.data.gov.uk/id/school/12345&gt;
    rdfs:label "Broadmoor Primary School"@en ;

  ... metadata about this graph ...
}
</code></pre>

<p>Note that metadata about each graph is embedded in the graph itself.</p>

<p>In the example above, a graph for 2010-01-01 would contain:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Wildmoor Heath School"@en ;
  edu:census &lt;http://education.data.gov.uk/id/school/12345/census&gt; ;
  edu:inspection &lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&gt; .

&lt;http://education.data.gov.uk/id/school/12345/census&gt;
  ... .

&lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&gt;
  ... .
</code></pre>

<p>It would not contain the triple:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Broadmoor Primary School"@en ;
</code></pre>

<p>because this triple is only present in an authoritative form within <code>http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative</code>, which is replaced by <code>http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative</code> or from <code>http://education.data.gov.uk/data/ofsted/school/12345/2008-03-01/supplementary</code> which is a supplementary graph and can&#8217;t override the label provided by the authoritative graph.</p>

<h2>Unanswered Questions</h2>

<p>There are three gaps within this that need plugging.</p>

<p>First, how should we represent the interval during which a graph is valid? As I&#8217;ve indicated above, <code>dct:valid</code> doesn&#8217;t cut it because it can&#8217;t represent an interval very well (there is a <a href="http://dublincore.org/documents/dcmi-period/">Dublin Core recommended format for representing periods</a> but it&#8217;s not going to be easy for people to process). We have work ongoing on defining intervals (by Stuart Williams) and will probably have to mint our own property to indicate the temporal validity of a named graph, given that <code>dct:valid</code> takes a literal rather than a resource.</p>

<p>Second, how should we indicate whether a graph is authoritative or not? Should this be a simple boolean switch (which will make the logic for combining graphs easier, and probably be easiest to assess) or a kind of confidence level, which might allow for missing data better?</p>

<p>Third, how should we represent the events that cause the replacement of one named graph with another? I think that we should be able to use the provenance vocabulary that we end up using to represent these changes, so that it&#8217;s possible to indicate whether the new information is the correction of a clerical error or an actual change to the real world thing.</p>

<p>And, we have to try this out. While it looks as if it might work, I won&#8217;t be confident until we&#8217;ve tried it out with some real data and some real queries. I&#8217;m also concerned that while keeping data in separate, annotated, named graphs seems like our best chance of managing versions and tracking provenance, it adds a hurdle onto the generation of linked data that might be too high, particularly for people who are just starting out.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Why Linked Data for data.gov.uk?</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/140" />
    <id>http://www.jenitennison.com/blog/node/140</id>
    <published>2010-01-26T13:10:58+00:00</published>
    <updated>2010-01-26T13:59:00+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="datagovuk" />
    <category term="linked data" />
    <summary type="html"><![CDATA[<p><a href="http://data.gov.uk/">data.gov.uk</a> was finally launched to the public last week (still in beta, but now a more public beta than the beta that it&#8217;s been in for the last few months). It&#8217;s a great step forward, and everyone involved should be proud of both the amount of data that&#8217;s been made available and the website itself, which (<a href="http://www.independent.co.uk/news/uk/politics/labours-computer-blunders-cost-16326bn-1871967.html">unlike a lot of UK government IT</a>) was developed rapidly by a small team based on open source software (and at low cost).</p>

<p>This is a first step on a long road.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p><a href="http://data.gov.uk/">data.gov.uk</a> was finally launched to the public last week (still in beta, but now a more public beta than the beta that it&#8217;s been in for the last few months). It&#8217;s a great step forward, and everyone involved should be proud of both the amount of data that&#8217;s been made available and the website itself, which (<a href="http://www.independent.co.uk/news/uk/politics/labours-computer-blunders-cost-16326bn-1871967.html">unlike a lot of UK government IT</a>) was developed rapidly by a small team based on open source software (and at low cost).</p>

<p>This is a first step on a long road.</p>

<!--break-->

<p>One of the features of the UK Government&#8217;s approach to freeing data is the emphasis on using <a href="http://www.data.gov.uk/wiki/Linked_Data">linked data</a>. What I don&#8217;t think has really been articulated is either what that means or why we should take this approach. From what I&#8217;ve seen, developers seem to think:</p>

<ul>
<li>linked data is a synonym for turning everything into RDF and putting it in one big triplestore, equivalent to making one big database of government data and therefore prone to exactly the same, well-known and understood problems that government has with creating huge databases</li>
<li>linked data requires everyone to agree to the same model and vocabulary, which means huge efforts in standardisation and ends up with something that suits no one</li>
<li>the UK government will be releasing all their data as linked data immediately, and in no other way</li>
<li>the UK government has been seduced into using linked data by academics who don&#8217;t understand anything about how the web or the real world works</li>
<li>the UK government has been seduced into using linked data by big businesses who stand to make a pretty penny providing services to departments that are forced to publish their data in this way</li>
</ul>

<p>None of these are true. In fact, the UK government is committed to publishing data as linked data because they are convinced it is the <strong>best approach available for publishing data in a hugely diverse and distributed environment, in a gradual and sustainable way</strong>.</p>

<p>Why?</p>

<p>Because linked data is just a term for how to publish data on the web while working <em>with</em> the web. And the web is the best architecture we know for publishing information in a hugely diverse and distributed environment, in a gradual and sustainable way.</p>

<p>If you&#8217;re a web developer, you already know that the best APIs are <a href="http://en.wikipedia.org/wiki/Representational_State_Transfer">RESTful APIs</a>. That argument has been won. It means:</p>

<ul>
<li>using (HTTP) URIs to identify resources: naming <em>things</em> with URIs rather than actions on those things (which are carried out using the standard set of HTTP verbs)</li>
<li>recognising the distinction between resources and representations of those resources: the same URI might return a different representation of the resource, such as HTML or XML or JSON</li>
<li>returning self-descriptive messages: being able to process representations in a manner that is obvious from the mime type</li>
<li>hypermedia as the engine of application state: being able to locate additional resources through the use of (typed) links</li>
</ul>

<p>Linked data is about following these rules for publishing data. It is about using URIs to identify things, providing information at the end of those URIs that is self-descriptive, and linking those things to other things through typed links.</p>

<p>One of the features of this approach is that it doesn&#8217;t require any big bangs. No one planned the web: sat down and mapped out each page and its precise relations to every other page, in advance. It grew, and evolved, and continues to grow and evolve every day. It grows through individuals and institutions publishing information for their own reasons and linking to other people who have published information for their own reasons, and, because we have some fundamental standards that clients and servers understand, it All Just Works.</p>

<h2>Standards</h2>

<p>Did you notice how I slipped in the &#8220;because we have some fundamental standards that clients and servers understand&#8221;? One standard is obviously HTTP: that controls how clients and servers can talk to each other: it allows clients to request pages and servers to respond. Another standard is HTML: that enables browsers to display information in ways that people can understand it, and (crucially) has a known set of semantics that browsers can use to tell when something is a link, which people can navigate to find more information.</p>

<p>For linked data, there are two crucial standards: RDF and SPARQL. Yes, I know what you&#8217;re thinking, because believe me two years ago that would have been my reaction too, but let me explain why.</p>

<p>There&#8217;s one way in which publishing data isn&#8217;t like publishing documents: its model. Documents are made up of paragraphs and headings and lists and tables and so on. Data is made up of&#8230; what? Well, at its most basic, it&#8217;s <em>things</em> that have <em>properties</em> which have <em>values</em>. We might call the things <em>objects</em> or <em>entities</em>, and call some of the properties <em>relations</em>. We might even call them <em>records</em> with <em>columns</em> and <em>values</em> and <em>foreign keys</em>. But however you term them, for better or worse, we do tend to think about data in this way: <em>thing</em>, <em>property</em>, <em>value</em>.</p>

<p>So if we are going to publish data on the web, we need a standard way of expressing the data so that a client receiving the data can work out what&#8217;s a <em>thing</em>, what&#8217;s a <em>property</em>, what&#8217;s a <em>value</em>. <strong>And, because this is the web, what&#8217;s a <em>link</em></strong>. This is the fundamental standard we need, and this is what RDF gives.</p>

<p>RDF is actually a model rather than a syntax. It&#8217;s a bit like the split between the DOM and HTML or XHTML. The DOM tells the browser how to render the page: the HTML or XHTML is just a syntax which the browser is able to convert into a DOM that it displays. We could imagine browsers converting wiki syntax into a DOM. Or creating a DOM based on XML and XSLT, which of course they all do.</p>

<p>So, RDF is like the DOM, with varying representations of RDF (XML-based, text-based, JSON-based, even HTML-based) that can be used to pass to the client the underlying model of <em>things</em> and <em>properties</em> and <em>values</em> (some of which are <em>links</em>). What the client does then is its business: clients that retrieve data aren&#8217;t browsers &#8212; they&#8217;re not all going to display the data, use the same parts of the data, or otherwise process it in the same way &#8212; but they can pull out the <em>things</em>, <em>properties</em> and <em>values</em>, and know which are <em>links</em>, and this data structure will often, with a good RDF library, map on to some natural structure within whatever programming language is being used, and make the programmer&#8217;s job easier.</p>

<h2>Vocabularies</h2>

<p>What we don&#8217;t want to have to define are standard ways of expressing <em>particular</em> data (such as data about a school) because different individuals and organisations will have completely different ways of thinking about a particular thing. A school itself will have information about uniform and open days; <a href="http://www.ofsted.gov.uk/">OFSTED</a> about performance; <a href="http://www.edubase.gov.uk/">Edubase</a> about administration and pupil numbers; the PTA about after-school activities. Expecting everyone to adopt a particular standard vocabulary for describing a school is as futile as expecting everyone to adopt exactly the same page layout within their web pages, and exactly the same class names in their CSS.</p>

<p>But we don&#8217;t want to rule out opportunistic alignments where individuals or organisations, for whatever reason, <em>do</em> want to use the same vocabularies. Look at what&#8217;s happened with classes in HTML. There is absolutely no constraint on what classes people use in their HTML. But there are clusters of web pages that use some of the same classes. Websites that use <a href="http://www.edubase.gov.uk/">microformats</a>. Websites that adopt a particular <a href="http://en.wikipedia.org/wiki/CSS_framework">CSS framework</a>. Importantly, though, even where some classes are shared, it doesn&#8217;t mean that <em>all</em> classes are shared: adoption of a particular microformat or CSS framework doesn&#8217;t limit the rest of the page.</p>

<p>RDF has this balance between allowing individuals and organisations complete freedom in how they describe their information and the opportunity to share and reuse parts of vocabularies in a mix-and-match way. This is so important in a government context because (with all due respect to civil servants) we <em>really</em> want to avoid a situation where we have to get lots of civil servants from multiple agencies into the same room to come up with the single government-approved way of describing a school. We can all imagine how long that would take.</p>

<p>The other thing about RDF that really helps here is that it&#8217;s easy to align vocabularies if you want to, post-hoc. <a href="http://www.w3.org/TR/rdf-schema/">RDFS</a> and <a href="http://www.w3.org/TR/owl-overview/">OWL</a> define properties that you can use to assert that this property is really the same as that property, or that anything with a value for this property has the same value for that other property. This lowers the risk for organisations who are starting to publish using RDF, because it means that if a new vocabulary comes along they can opportunistically match their existing vocabulary with the new one. It enables organisations to tweak existing vocabularies to suit their purposes, by creating specialised versions of established properties.</p>

<p>So the linked data web is designed to grow and evolve in exactly the same way as the human web has grown and evolve. It grows through people adding links to existing data. It grows through people creating their own vocabularies. And it evolves as links break and reform, and vocabularies combine and diverge. It is complex and messy and self-organising.</p>

<h2>Layers</h2>

<p>The cornerstone of the great, messy, web is the URI. URIs have two important roles:</p>

<ul>
<li><p><strong>they identify things</strong> - If two sets of data use the same URI then it&#8217;s dead easy to work out when they are talking about the same thing, for example to bring together the information published by a school with its OFSTED report with its pupil census. Spread this around to five, ten, twenty datasets from different places all using the same identifier for the school, and you have huge pool of information. And the great thing about RDF (because they also use URIs to identify properties) is that those datasets can be combined automatically without worrying about clashes, rather than through painstaking developer effort.</p></li>
<li><p><strong>they provide somewhere to look for information</strong> - This is the point of using HTTP URIs, because that look-up is as simple as retrieving a document from the web. This enables programmatic, on-demand, access to the information. Developers don&#8217;t have to download huge database dumps when all they are interested in is a small fraction of that data.</p></li>
</ul>

<p>But we know that of course sometimes developers <em>do</em> want to download huge database dumps. So we need URIs for those dumps, and ways to associate metadata with them, and ways to search them. Adopting linked data doesn&#8217;t preclude providing sets of data in larger lumps. In fact, what&#8217;s needed are ways of creating those larger datasets by bringing together the more granular linked data into lists and graphs; this is essentially what SPARQL does.</p>

<p>We also know that there&#8217;s a trade-off to be made between the power of URIs and the simplicity of using short, unqualified names, particularly when it comes to naming schema-level entities such as properties or classes. Most mashups that we see at the moment bring together just a few datasets, making it easy for developers to scan for naming clashes, or examine values to work out whether a particular property contains a link or not. This is the 80% of the use of data on the web that can be addressed by the 20% solution of the kind of JSON and plain old XML you see in most APIs.</p>

<p>But publishing with RDF can be the basis of these kinds of simple APIs, and still address the hard 20% that we will encounter quickly as we mash more data together. Any data munger knows that the main challenge of making data available in an easily accessible way is cleaning, tidying, modelling and restructuring. If that&#8217;s done into RDF then creating simple JSON, XML and even CSV is really easy. Creating middle-ware that will make the creation of these basic APIs really easy must be the top priority of this linked data effort.</p>

<h2>Reality Check</h2>

<p>So it&#8217;s all good, right?</p>

<p>No, of course it&#8217;s not all good. Just as in the early days of the human web, we face huge challenges simply getting tooling to a level where it&#8217;s easy (really easy) for government departments and local authorities to publish data as RDF and for the consumers of the data to use it. We have some patterns for publishing linked data, but, as in the early days of the human web, there&#8217;s still a lot we don&#8217;t know about the best way to make data usable by third parties.</p>

<p>It&#8217;s worth noting that the main challenges we face are ones that are common to all attempts to make data both open and reusable. How do we easily create structured and reusable data from presentation-oriented Excel or (worse) PDFs? How do we handle changes over time, and record the provenance of the information that we provide? How to we represent statistical hypercubes? Or location information? These are things that we will only learn by trying things out.</p>

<p>In the end, though, the best evidence we have for how the web of linked data will progress is the evidence of how things were for the human web. It is hard to be an early adopter, both for social reasons and technological reasons. Nothing will happen overnight, but gradually there will be network effects: more shared URIs, more shared vocabularies, making it both easier to adopt and more beneficial for everyone.</p>

<p>Is this a kind of faith? Maybe. I believe in the web.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part V: Finishing Touches</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/139" />
    <id>http://www.jenitennison.com/blog/node/139</id>
    <published>2009-12-05T08:50:28+00:00</published>
    <updated>2009-12-05T08:50:28+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="rdf" />
    <summary type="html"><![CDATA[<p>This is the fifth part in this series about creating linked data. I&#8217;ve talked previously about <a href="http://www.jenitennison.com/blog/node/135">analysis and modelling</a>, <a href="http://www.jenitennison.com/blog/node/136">defining URIs</a>, <a href="http://www.jenitennison.com/blog/node/137">defining concept schemes</a> and <a href="http://www.jenitennison.com/blog/node/138">defining a vocabulary</a>. In this instalment I&#8217;ll talk about the finishing touches that can make linked data easier to browse, query, locate and trust.</p>

<p>Note that we don&#8217;t <em>have</em> to do any of these things; they&#8217;re not part of the core data. We shouldn&#8217;t beat ourselves up if we don&#8217;t have time to do it right now, because we can always add them later, and it might be that you just don&#8217;t agree that they should be done. But many of them don&#8217;t take a lot of time and can enhance the user&#8217;s experience of the data.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>This is the fifth part in this series about creating linked data. I&#8217;ve talked previously about <a href="http://www.jenitennison.com/blog/node/135">analysis and modelling</a>, <a href="http://www.jenitennison.com/blog/node/136">defining URIs</a>, <a href="http://www.jenitennison.com/blog/node/137">defining concept schemes</a> and <a href="http://www.jenitennison.com/blog/node/138">defining a vocabulary</a>. In this instalment I&#8217;ll talk about the finishing touches that can make linked data easier to browse, query, locate and trust.</p>

<p>Note that we don&#8217;t <em>have</em> to do any of these things; they&#8217;re not part of the core data. We shouldn&#8217;t beat ourselves up if we don&#8217;t have time to do it right now, because we can always add them later, and it might be that you just don&#8217;t agree that they should be done. But many of them don&#8217;t take a lot of time and can enhance the user&#8217;s experience of the data.</p>

<!--break-->

<h2>Labels</h2>

<p>Every resource should have a label, even blank nodes. Adding labels makes it easier for people to generate HTML views from the data. Sometimes we have resources that have an obvious label (like the name of a local authority); at other times, the label needs to be constructed based on the other information that&#8217;s available about the resource.</p>

<p>I talked in the last instalment about <code>skos:prefLabel</code> (preferred label), <code>skos:altLabel</code> (alternative label) and <code>rdfs:label</code>. Technically, <code>skos:prefLabel</code> and <code>skos:altLabel</code> are sub properties of <code>rdfs:label</code>, which means that if a resource has a <code>skos:prefLabel</code> it also has a <code>rdfs:label</code> with that value. However, drawing that conclusion requires either built-in knowledge of SKOS or the ability to both automatically get hold of the SKOS ontology and reason with it, which is feasible (this is one of the advantages of RDF, after all), but adds an extra hurdle for people wanting to use your data.</p>

<p>So it&#8217;s best to give everything a <code>rdfs:label</code>, even if they already have a <code>skos:prefLabel</code> or <code>skos:altLabel</code>. It&#8217;s also good to try to imagine that label in the context of having no other information about the thing that it&#8217;s labelling, such as in the title of a page. For example, if you&#8217;re looking at the observation <code>http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle</code> in the context of a traffic count, it may seem sensible to label it just &#8220;bicycle&#8221; (as I did in the first iteration of turning this traffic count data into RDF). But without that context, it makes no sense. Better to label it &#8220;Bicycles - 8 Oct 2001 17:00-18:00 - East - Salterton Road, EAST OF DINAN WAY, EXMOUTH&#8221; and provide an even more descriptive <code>rdfs:comment</code> like &#8220;Number of bicycles counted travelling East at Salterton Road, EAST OF DINAN WAY, EXMOUTH on 8 October 2001 between 17:00 and 18:00.&#8221;.</p>

<h2>Datasets</h2>

<p>There are two kinds of datasets that are applicable to this particular &#8230;err&#8230; set of data &#8230; and that we should describe within the RDF. They are:</p>

<ul>
<li>datasets that are sets of statistical data items (such as the observations in the traffic count data); these are best described using <a href="http://sw.joanneum.at/scovo/schema.html">SCOVO</a></li>
<li>datasets that are general descriptions of particular sets of linked data (such as roads or local authorities); these are best described using <a href="http://semanticweb.org/wiki/VoiD">voiD</a></li>
</ul>

<p>Both kinds of datasets can be identified for UK government data using URIs in the form:</p>

<pre><code>http://{sector}.data.gov.uk/set/{dataset}/
</code></pre>

<h3>SCOVO Datasets</h3>

<p>Every <code>scovo:Item</code> should be part of a <code>scovo:Dataset</code>, associated through a <code>scovo:dataset</code> (and a reverse <code>scovo:datasetOf</code>). A <code>scovo:Dataset</code> is pretty simple: all you really need to do is give it an identifier and, of course, a label. In this case, something like:</p>

<pre><code>http://transport.data.gov.uk/set/traffic-count/2001-2008/
</code></pre>

<p>This is an identifier that the various <code>scovo:Item</code>s should use to indicate where the data comes from:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a scovo:Item ;
  scovo:dataset &lt;http://transport.data.gov.uk/set/traffic-count/2001-2008/&gt; ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  traffic:vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  rdf:value 2 .
</code></pre>

<p>It&#8217;s also an identifier that we can attach some metadata to. Obviously it needs a label, but we can also attach other metadata, such as the <a href="http://www.jenitennison.com/blog/node/133">provenance of the dataset</a>.</p>

<pre><code>&lt;http://transport.data.gov.uk/set/traffic-count/2001-2008/&gt;
  a scovo:Dataset ;
  a prv:DataItem ;
  rdfs:label "Traffic counts between 2001 and 2008"@en ;
  prv:createdBy [
    a prv:DataCreation ;
    prv:performedAt ... ;
    prv:performedBy ... ;
    prv:usedData ... ;
    prv:usedGuideline ... ;
  ] .
</code></pre>

<h3>VoiD Datasets</h3>

<p>VoiD is designed to be used to describe sets of linked data, their contents, their provenance and their relationships with each other. There are many ways of dividing up the data that we&#8217;ve been looking at into datasets. We can start with a simple example: the dataset containing linked data about countries:</p>

<pre><code>&lt;http://statistics.data.gov.uk/set/country/&gt;
  a void:Dataset ;
  rdfs:label "Countries"@en ;
  foaf:homepage &lt;http://statistics.data.gov.uk/set/country&gt; ;
  dct:subject &lt;http://dbpedia.org/resource/Country&gt; ;
  cc:license [
    a cc:License ;
    rdfs:label "data.gov.uk Licence"@en ;
    foaf:homepage &lt;http://data.hmg.gov.uk/terms-privacy&gt; ;
    cc:permits cc:DerivativeWorks, cc:Distribution, cc:Reproduction ;
    cc:requires cc:Attribution ;
  ] ;
  void:exampleResource &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  void:sparqlEndpoint &lt;http://services.data.gov.uk/statistics/sparql&gt; ;
  void:uriRegexPattern "http://statistics.data.gov.uk/id/country?name=.+"^^xs:string ;
  void:vocabulary &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; ;
  void:vocabulary &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
</code></pre>

<p>This provides a link to a home page for the dataset, which should contain information about the dataset itself. (Accessing the URI for the dataset should also redirect users to this home page.) I&#8217;ve used the same URI as the dataset URI but without the slash at the end. (This is probably too subtle a difference between URIs; we don&#8217;t currently have official guidance for URIs for documents-about-datasets or documents-about-definitions.)</p>

<p>The <code>void:exampleResource</code> property can be used to point to resources that can act as starting points for exploring the data, and the <code>void:sparqlEndpoint</code> property points at a SPARQL endpoint that can be used for deeper querying. The <code>void:uriRegexPattern</code> property provides a regular expression for the URIs that are used to identify the resources that the dataset is about. <code>void:vocabulary</code> points to the vocabularies that the dataset uses.</p>

<p>Various <a href="http://dublincore.org/documents/dcmi-terms/">Dublin Core</a> properties can be used to provide metadata about the dataset, such as its subject matter. The <a href="http://creativecommons.org/ns">Creative Commons schema</a> provides a way of indicating the licence that the dataset is made available under, which is essential information to enable reuse. (I&#8217;ve derived some RDF about the licence here from the one <a href="http://data.hmg.gov.uk/terms-privacy">described on the data.hmg.gov.uk pages</a>; there should be an official version some time soon.)</p>

<p>The data that we can actually produce from this traffic count dataset is actually a <em>subset</em> of the dataset of all countries, and we can indicate this through a <code>void:subset</code> relationship:</p>

<pre><code>&lt;http://statistics.data.gov.uk/set/country/&gt;
  ...
  void:subset [
    a void:Dataset ;
    a prv:DataItem ;
    rdfs:label "Country data from the DfT traffic count dataset 2001-2008"@en ;
    prv:createdBy [
      a prv:DataCreation ;
      prv:performedAt ... ;
      prv:performedBy ... ;
      prv:usedData ... ;
      prv:usedGuideline ... ;
    ] ;
  ] .
</code></pre>

<p>The other kind of subset that we should describe are link sets. Link sets are datasets that contain links between datasets. The country dataset doesn&#8217;t (currently) contain any links to other datasets, but the count dataset does:</p>

<pre><code>&lt;http://transport.data.gov.uk/set/traffic-count&gt;
  a void:Dataset ;
  rdfs:label "Traffic Counts"@en ;
  foaf:homepage &lt;http://transport.data.gov.uk/set/traffic-count&gt; ;
  dct:subject &lt;http://dbpedia.org/resource/Traffic&gt; ;
  dct:subject &lt;http://dbpedia.org/resource/Counting&gt; ;
  cc:license [
    a cc:License ;
    rdfs:label "data.gov.uk Licence"@en ;
    foaf:homepage &lt;http://data.hmg.gov.uk/terms-privacy&gt; ;
    cc:permits cc:DerivativeWorks, cc:Distribution, cc:Reproduction ;
    cc:requires cc:Attribution ;
  ] ;
  void:exampleResource &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  void:uriRegexPattern &lt;http://transport.data.gov.uk/id/traffic-count-point/[0-9]+/direction/[NSEW]/hour/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:00:00&gt; ;
  void:sparqlEndpoint &lt;http://services.data.gov.uk/transport/sparql&gt; ;
  void:vocabulary &lt;http://transport.data.gov.uk/def/traffic/&gt; ;
  void:subset [
    a void:Dataset ;
    rdfs:label "Traffic Counts from the DfT traffic count dataset 2001-2008"@en ;
    prv:createdBy ...
  ] ;
  void:subset [
    a void:Linkset ;
    rdfs:label "Traffic count / count point links"@en ;
    rdfs:comment "Links from a traffic count to the count point at which the count was taken."@en ;
    void:subjectsTarget &lt;http://transport.data.gov.uk/set/traffic-count/&gt; ;
    void:linkPredicate &lt;http://transport.data.gov.uk/def/count&gt; ;
    void:objectsTarget &lt;http://transport.data.gov.uk/set/traffic-count-point/&gt; ;
  ] ;
  void:subset [
    a void:Linkset ;
    rdfs:label "Traffic count / cardinal direction"@en ;
    rdfs:comment "Links from a traffic count to the direction in which the traffic was going."@en ;
    void:subjectsTarget &lt;http://transport.data.gov.uk/set/traffic-count/&gt; ;
    void:linkPredicate &lt;http://transport.data.gov.uk/def/direction&gt; ;
    void:objectsTarget &lt;http://dbpedia.org/void/Dataset&gt; ;
  ] ;
  void:subset [
    a void:Linkset ;
    rdfs:label "Traffic count / hour"@en ;
    rdfs:comment "Links from a traffic count to the hour when the traffic was being monitored."@en ;
    void:subjectsTarget &lt;http://transport.data.gov.uk/set/traffic-count/&gt; ;
    void:linkPredicate &lt;http://transport.data.gov.uk/def/direction&gt; ;
    void:objectsTarget [
      a void:Dataset ;
      rdfs:label "URIs for places and times" ;
      foaf:homepage "http://placetime.com/" ;
    ] ;
  ] .
</code></pre>

<p><code>scovo:Dataset</code>s are often subsets of <code>void:Datasets</code>. In the case of the traffic count data, the observations described by the <code>scovo:Dataset</code> above are a subset of the <code>void:Dataset</code> that is the set of <em>all</em> such observations (including ones from other years).</p>

<h2>Derivable Data</h2>

<p>The discussion about <code>rdfs:label</code> above touched on another set of information that should be included within the RDF data we produce: data that is automatically derivable from the data we provide. There are three main reasons for including derivable data within what we publish:</p>

<ol>
<li><p>Given the current adoption of RDF-aware technologies, the consumers of our data are pretty unlikely to be able to (or to want to) use schemas, ontologies or rule sets to help them to reason over the data and draw conclusions. The consumers of this data <em>might</em> include semantic search engines and people scraping the data into their own triplestores, but they&#8217;re far more likely to be developers who really don&#8217;t care about RDF at all. It would be a shame to publish the data and then have no one use it.</p></li>
<li><p>Computing derivable data once saves overall effort. We calculate it once, centrally, and it means that the people using the data don&#8217;t have to spend processing time doing it themselves. (There&#8217;s a classic time/space trade-off here, of course; the down side of including data that isn&#8217;t strictly necessary is that the documents will end up larger.)</p></li>
<li><p>If we provide information that people are likely to need within the document that they get when they request a given resource, they&#8217;re less likely to need to resort to (harder to construct and more intensive to process) SPARQL queries to get what they need.</p></li>
</ol>

<p>The overriding principle that we can use to help us decide what to include is to consider what we would like to see if we visited a page about the particular thing.</p>

<p>How we manage to provide the derived data depends on how we publish the data. I&#8217;m not talking here about how to do the publishing, but rather about what the consumers of the data should expect to see eventually. So, for example, if we publish the data as static files then we&#8217;re going to have to include all this data in those files. If we generate the RDF dynamically, we just have to make sure that the generated RDF includes the derived data; we might be able to set up rules in a triplestore, or a transformation of the data that it naturally produces, to include the derivable data.</p>

<h3>Superclasses and Super-properties</h3>

<p>One set of derived data is that inferred from the superclasses and super-properties that are defined with the RDF vocabularies we use in our data. Basically, if a resource has a type that is a subclass of another type, then the resource should have that superclass as a type as well. Similarly, if a triple includes a property that has a super-property, then there ought also to be a triple that links the subject and object of the original triple with the super-property as well.</p>

<p>To understand when it&#8217;s important to include this kind of derived data, we need to be aware of the kind of applications that will use the data. Some applications will be targeting just this dataset about traffic counts, and will be written to use whatever properties and classes that we&#8217;ve made available. Other applications will be targeted at specific vocabularies at a more general-purpose level. There might be applications that can be used to visualise SKOS hierarchies as a tree, for example, or applications that can plot any <code>geo:lat</code>/<code>geo:long</code> coordinates on a map, or any OWL-Time intervals and instants on a timeline. Still other applications, such as viewers like Tabulator, will be used with any old RDF. We need to provide enough information to make the data easily usable by these more generic applications.</p>

<p>As an example, in the last instalment we introduced classes for <code>traffic:VehicleType</code> and <code>traffic:RoadCategory</code> which were subclasses of <code>skos:Concept</code>. If we want generic SKOS visualisers to be able to display the vehicle type and road category concept schemes, we should try to make it easy for them to work out which things are concepts, by indicating that they are concepts as well. Bearing in mind what I&#8217;ve said above about labels, that means that the original RDF:</p>

<pre><code>&lt;motorway&gt; a traffic:RoadCategory ;
  skos:prefLabel "Motorway"@en ;
  skos:broader &lt;major&gt; ;
  skos:scopeNote "Major roads often used for long distance travel. They are usually three or more lanes in each direction and generally have the maximum speed limit of 70mph."@en ;
  skos:inScheme &lt;&gt; .
</code></pre>

<p>should include a reference to <code>skos:Concept</code> and a <code>rdfs:label</code>:</p>

<pre><code>&lt;motorway&gt; a traffic:RoadCategory ;
  a skos:Concept ;
  rdfs:label "Motorway"@en ;
  skos:prefLabel "Motorway"@en ;
  skos:broader &lt;major&gt; ;
  skos:scopeNote "Major roads often used for long distance travel. They are usually three or more lanes in each direction and generally have the maximum speed limit of 70mph."@en ;
  skos:inScheme &lt;&gt; .
</code></pre>

<p>Note that I haven&#8217;t included the results of <em>all</em> the reasoning that we could anticipate. The property <code>skos:scopeNote</code> is a sub-property of <code>skos:note</code>, for example, but I haven&#8217;t included a <code>skos:note</code> explicitly because any SKOS-aware processor should have that kind of knowledge built in. The rule of thumb is that <strong>if the result of the reasoning involves a resource from another vocabulary, then we should include it</strong>.</p>

<h3>Derivable Values</h3>

<p>There are other kinds of derivable data in this data set. In particular, there are eastings and northings, but not latitudes and longitudes. When there&#8217;s useful derivable data, especially when it&#8217;s not trivial to derive, it makes sense to make that available explicitly, otherwise everyone else will have to go through the effort of deriving it themselves.</p>

<p>We&#8217;ve already done this with the information about the hours of the traffic counts, by pulling out the year and hour of the count rather than having them tucked away within a <code>xs:dateTime</code> literal. The same should be true of the eastings and northings. For small numbers of values, you can use the <a href="http://gps.ordnancesurvey.co.uk/convert.asp">Ordnance Survey&#8217;s online converter</a>; for larger numbers of values you can download the (Windows only and very dated) software or try one of the various converters you can find with a <a href="http://www.google.com/search?q=easting+northing+latitude+longitude+conversion+UK">Google search</a>.</p>

<p>Latitudes and longitudes for points should, of course, be expressed using the <code>geo:lat</code> and <code>geo:long</code> properties from the <a href="http://www.w3.org/2003/01/geo/">http://www.w3.org/2003/01/geo/wgs84_pos#</a> vocabulary.</p>

<h3>Inverses</h3>

<p>Statements in RDF link two things. For example, you can view the statement:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; .
</code></pre>

<p>as saying that traffic count point 13 is on the B3178 <em>and</em> that the B3178 has a count point on it that is traffic count point 13.</p>

<p>So it&#8217;s always possible, when creating a query about or a representation of the road to include the &#8216;backward links&#8217; &#8212; the statements in which the road features as an <em>object</em> as well as those in which it features as a <em>subject</em>. This has caused some people to argue that <a href="http://dowhatimean.net/2006/06/an-rdf-design-pattern-inverse-property-labels">relationships should only be defined in one direction</a>.</p>

<p>Personally, I don&#8217;t agree, for two reasons.</p>

<ol>
<li><p>Although it&#8217;s <em>possible</em> to create queries and representations that include backward links, it often doesn&#8217;t happen like that. It&#8217;s different with different triplestores, but result of a the <code>DESCRIBE</code> SPARQL query commonly only includes triples in which the thing being described in the subject, not the object. Also, when constructing queries, it seem more natural to always &#8220;travel forward&#8221; through the graph. For example:</p>

<pre><code>SELECT ?count
WHERE {
  ?point
    area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/43UC&gt; ;
    traffic:count ?count .
}
</code></pre>

<p>rather than:</p>

<pre><code>SELECT ?count
WHERE {
  ?point
    area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/43UC&gt; .
  ?count
    traffic:countPoint ?point .
}
</code></pre>

<p>So although it introduces redundancy, I think that including inverse relationships in RDF aids usability and navigability.</p></li>
<li><p>Sometimes both directions of a relationship contain meaningful information. For example, it&#8217;s not enough to include a <code>gen:mother</code> relationship from a person to their mother because the implied reverse relationship is simply that the person is a child of their mother &#8212; you need to include a <code>gen:son</code> or <code>gen:daughter</code> relationship as well to tell which type of child.</p></li>
</ol>

<p>So in this dataset, I&#8217;m going to include inverse relationships where appropriate:</p>

<ul>
<li>from countries to regions</li>
<li>from regions to local authority districts</li>
<li>from roads to count points</li>
<li>from count points to counts</li>
<li>from counts to observations</li>
</ul>

<h3>Shortcuts</h3>

<p>Another thing that can aid the navigability of a set of RDF data is to provide &#8220;shortcuts&#8221;. For example, at the moment we have links that say which country a region belongs to and which region a local authority district belongs to, but we don&#8217;t have a link that says which country a local authority district belongs to. These kind of links can make it easier to navigate through (and to query) a dataset, so they can be worth adding so long as they don&#8217;t clutter up the data too much.</p>

<p>Just think of what you&#8217;d like to know about a particular <em>thing</em> when you visit its page. If you&#8217;re looking at transport in a local authority district, it would be useful to know what region and country it belongs to and about what roads and traffic count points it contains. But it would be too much to have a list of all the counts and observations that have been taken on those count points.</p>

<p>For this dataset, I&#8217;m going to add shortcuts from:</p>

<ul>
<li>countries to local authority districts (and vice versa)</li>
<li>count points to regions and countries</li>
<li>roads to local authority districts (and vice versa)</li>
<li>roads to regions and countries</li>
<li>roads to road categories and road names</li>
<li>roads to counts (and vice versa)</li>
<li>observations to count points, roads, directions and count hours</li>
</ul>

<p>These are all judgement calls &#8212; there are no hard and fast rules &#8212; and as you can see I&#8217;m not adding inverses everywhere here because to do so would lead to unnecessarily large sets of RDF in some cases.</p>

<hr />

<p>That&#8217;s the end of this instalment. I had been intending to make this the final one, but there are a couple of things still left to talk about: the publication of RDF, and the supplementary documents that we need to provide (including RDF about those supplementary documents). I&#8217;ve also had a request to talk about OWL ontologies, so I&#8217;ll probably do that, and there are things to say about how to manage things changing over time. So this may end up being an eight-part series!</p>

<p>To keep us up to date, with all the extra derived information added, the RDF looks as follows:</p>

<pre><code>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#"&gt; .
@prefix owl: &lt;http://www.w3.org/2002/07/owl#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time&gt; .
@prefix scovo: &lt;http://purl.org/NET/scovo#&gt; .
@prefix area: &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; .
@prefix admingeo: &lt;http://data.ordnancesurvey.co.uk/ontology/admingeo/&gt; .
@prefix space: &lt;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://statistics.data.gov.uk/id/country?name=England&gt;
  a area:Country ;
  rdfs:label "England"@en ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/government-office-region/K&gt;
  a admingeo:GovernmentOfficeRegion ;
  rdfs:label "South West"@en ;
  skos:notation "K"^^area:StandardCode ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt;
  a area:LocalAuthorityDistrict ;
  rdfs:label "Devon"@en ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; .

&lt;http://transport.data.gov.uk/id/local-authority-district/1115&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/18&gt;
  a area:LocalAuthority ;
  rdfs:label "Devon County Council"@en ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:coverage &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/local-authority/1116&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; .

&lt;http://transport.data.gov.uk/id/road/B3178&gt;
  a traffic:Road ;
  rdfs:label "B3178" ;
  rdfs:label "Salterton Road"@en ;
  skos:prefLabel "B3178" ;
  skos:altLabel "Salterton Road"@en ;
  skos:notation "B3178"^^traffic:RoadNumber ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; ;
  traffic:roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  a traffic:CountPoint ;
  rdfs:label "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  rdfs:comment "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  skos:notation "13"^^traffic:CountPointNumber ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  traffic:roadName "Salterton Road"@en ;
  traffic:roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  space:easting 302600 ;
  space:northing 81984 ;
  geo:lat 50.6294 ;
  geo:long -3.3784 ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt;
  a traffic:Count ;
  rdfs:label "8 Oct 2001 17:00-18:00 - East - Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  traffic:direction &lt;http://dbpedia.org/resource/East&gt; ;
  traffic:countHour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; ;
  traffic:observation &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt; .

&lt;http://dbpedia.org/resource/East&gt;
  rdfs:label "East"@en .

&lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt;
  a traffic:CountHour ;
  rdfs:label "8 Oct 2001, 17:00-18:00"@en ;
  time:hasBeginning &lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt; ;
  time:hasEnd &lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt; ;
  time:hasDurationDescription _:OneHour ;
  time:intervalDuring &lt;http://dbpedia.org/resource/2001&gt; .

_:OneHour a time:DurationDescription ;
  rdfs:label "one hour"@en ;
  time:years 0 ;
  time:months 0 ;
  time:days 0 ;
  time:hours 1 ;
  time:minutes 0 ;
  time:seconds 0 .

&lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 17:00"@en ;
  time:inXSDDateTime "2001-10-08T17:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 17 ;
  ] .

&lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 18:00"@en ;
  time:inXSDDateTime "2001-10-08T18:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 18 ;
  ] .

&lt;http://dbpedia.org/resource/2001&gt;
  a time:Interval ;
  rdfs:label "2001" ;
  rdf:value "2001"^^xsd:gYear ;
  time:intervalEquals &lt;http://placetime.com/interval/gregorian/2001-01-01T00:00:00Z/P1Y&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a scovo:Item ;
  rdfs:label "8 Oct 2001 17:00-18:00 - East - Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:direction &lt;http://dbpedia.org/resource/East&gt; ;
  traffic:countHour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; ;
  traffic:vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  rdf:value 2 .
</code></pre>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part IV: Developing RDF Schemas</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/138" />
    <id>http://www.jenitennison.com/blog/node/138</id>
    <published>2009-11-26T10:35:32+00:00</published>
    <updated>2009-11-28T22:03:32+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="rdf" />
    <summary type="html"><![CDATA[<p>This is the fourth instalment in a series about turning an existing dataset into some linked data. I&#8217;ve previously talked about <a href="http://www.jenitennison.com/blog/node/135">analysis and modelling</a>, <a href="http://www.jenitennison.com/blog/node/136">defining URIs</a> and <a href="http://www.jenitennison.com/blog/node/137">defining concept schemes</a>. In this instalment, we&#8217;ll look at developing a schema in which we define the classes, properties and datatypes that we want to use in the RDF that describes the <em>things</em> in our dataset.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>This is the fourth instalment in a series about turning an existing dataset into some linked data. I&#8217;ve previously talked about <a href="http://www.jenitennison.com/blog/node/135">analysis and modelling</a>, <a href="http://www.jenitennison.com/blog/node/136">defining URIs</a> and <a href="http://www.jenitennison.com/blog/node/137">defining concept schemes</a>. In this instalment, we&#8217;ll look at developing a schema in which we define the classes, properties and datatypes that we want to use in the RDF that describes the <em>things</em> in our dataset.</p>

<p>We&#8217;ll start by writing out some RDF for our record, using Turtle here for readability, and use unprefixed names to indicate classes, properties and datatypes, just so we can see what we need. Then we&#8217;ll see how those requirements match up to existing vocabularies and ontologies that we can reuse. Anything that&#8217;s left over we&#8217;re going to have to put in our own vocabulary. We&#8217;ll call this</p>

<pre><code>http://transport.data.gov.uk/def/traffic/
</code></pre>

<p>All the classes, properties and datatypes that we define will eventually use that namespace.</p>

<p>Let&#8217;s focus on this record; I find it easiest to use an actual example rather than talk in abstract:</p>

<pre><code>"England","South West","K",1115.00,"18","Devon County Council",
13,"B3178",,"B Urban","Salterton Road",
"Salterton Road, EAST OF DINAN WAY, EXMOUTH",302600,81984,
8/10/2001 00:00:00,"E",17,2,2,400,5,41,0,2,0,0,0,0,2,450
</code></pre>

<p>We&#8217;ll put this into RDF bit by bit.</p>

<h2>Areas</h2>

<p>First, let&#8217;s look at the areas and local authorities. The kind of RDF that we want to have looks like:</p>

<pre><code>&lt;http://statistics.data.gov.uk/id/country?name=England&gt;
  a :Country ;
  :name "England"@en .

&lt;http://statistics.data.gov.uk/id/government-office-region/K&gt;
  a :GovernmentOfficeRegion ;
  :name "South West"@en ;
  :code "K"^^:ONScode ;
  :containedBy &lt;http://statistics.data.gov.uk/id/country?name=England&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt;
  a :LocalAuthorityDistrict ;
  :code "18"^^:ONScode ;
  :code "1115"^^:DfTLAcode ;
  :localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  :containedBy &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  :containedBy &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; .

&lt;http://transport.data.gov.uk/id/local-authority-district/1115&gt;
  :sameAs &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/18&gt;
  a :LocalAuthority ;
  :name "Devon County Council"@en ;
  :code "18"^^:ONSLAcode ;
  :code "1115"^^:DfTLAcode ;
  :localAuthorityDistrict &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/local-authority/1116&gt;
  :sameAs &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; .
</code></pre>

<p>To work out what we need to put in our schema, we should first look at what existing vocabularies there are that could help. These areas are already defined elsewhere, so we can just use the same vocabulary for countries, regions, local authority districts and local authorities as is used there. The vocabularies that are useful here are:</p>

<ul>
<li><code>http://statistics.data.gov.uk/def/administrative-geography/</code> which defines classes and properties related to administrative areas and local authorities (as described by the <a href="http://www.statistics.gov.uk/">Office of National Statistics</a>)</li>
<li><code>http://data.ordnancesurvey.co.uk/ontology/admingeo/</code> which also defines classes and properties related to administrative areas (as described by the <a href="http://www.ordnancesurvey.co.uk/">Ordnance Survey</a>)</li>
<li><code>http://data.ordnancesurvey.co.uk/ontology/spatialrelations/</code>, also developed by John Goodwin at the Ordnance Survey, which defines spatial relationships between areas</li>
</ul>

<p>There are other commonly used vocabularies that it&#8217;s helpful to know about:</p>

<ul>
<li>RDFS is designed for representing RDF schemas, but it has a few general-purpose properties that are good to know, namely <code>rdfs:label</code> (the label for a thing) and <code>rdfs:comment</code> (a comment or description about the thing).</li>
<li>SKOS is designed for representing concept schemes, but again it has a few properties that can be used with any set of linked data, in particular <code>skos:prefLabel</code> (the preferred label for a thing), <code>skos:altLabel</code> (an alternative label for a thing) and <code>skos:notation</code> (a code for the thing).</li>
<li>OWL is designed for representing ontologies, but it has one very important property that you should know about &#8212; <code>owl:sameAs</code> &#8212; which is used to link two things that are the same thing.</li>
<li>XML Schema datatypes can be used within RDF, which is useful for things like dates, times, integers and so on.</li>
<li>For our purposes here, OWL-Time is going to prove useful, as it has a bunch of properties that are used to represent instants and durations.</li>
</ul>

<p>If we look through the RDF above, the only thing that <em>isn&#8217;t</em> covered by these vocabularies is the <code>DfTLAcode</code> datatype. If we use the <code>http://transport.data.gov.uk/def/traffic/</code> namespace, there&#8217;s not really any need to indicate that this is a transport-related code, so we can just call it <code>LAcode</code>. Let&#8217;s define that datatype:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/LAcode&gt;
  a rdfs:Datatype ;
  rdfs:label "Local Authority Code"@en .
</code></pre>

<p>That&#8217;s it. Now here&#8217;s the Turtle for the areas with the relevant namespaces added, and property names changed where appropriate:</p>

<pre><code>@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#"&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix owl: &lt;http://www.w3.org/2002/07/owl#&gt; .
@prefix area: &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; .
@prefix space: &lt;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/&gt; .
@prefix admingeo: &lt;http://data.ordnancesurvey.co.uk/ontology/admingeo/&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://statistics.data.gov.uk/id/country?name=England&gt;
  a area:Country ;
  rdfs:label "England"@en .

&lt;http://statistics.data.gov.uk/id/government-office-region/K&gt;
  a admingeo:GovernmentOfficeRegion ;
  rdfs:label "South West"@en ;
  skos:notation "K"^^area:StandardCode ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt;
  a area:LocalAuthorityDistrict ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; .

&lt;http://transport.data.gov.uk/id/local-authority-district/1115&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/18&gt;
  a area:LocalAuthority ;
  rdfs:label "Devon County Council"@en ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:coverage &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/local-authority/1116&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; .
</code></pre>

<h2>Roads</h2>

<p>Here&#8217;s the kind of RDF we want to create for roads:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/road/B3178&gt;
  a :Road ;
  :code "B3178"^^:RoadNumber .
</code></pre>

<p>Obviously, we need a class for roads:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/Road&gt;
  a rdfs:Class ;
  rdfs:label "Road"@en .
</code></pre>

<p>Wherever there&#8217;s a code, I like to reuse <code>skos:notation</code>. But it&#8217;s important to define a datatype for the values used with that notation because (as we saw with local authorities above) there may be several different coding schemes that apply to the same Thing, and we need to be able to distinguish between them in case they clash. So:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/RoadNumber&gt;
  a rdfs:Datatype ;
  rdfs:label "Road Number"@en .
</code></pre>

<p>That&#8217;s all we have to define for roads; now the RDF can look like:</p>

<pre><code>@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .

&lt;http://transport.data.gov.uk/id/road/B3178&gt;
  a traffic:Road ;
  skos:notation "B3178"^^traffic:RoadNumber .
</code></pre>

<h2>Count Points</h2>

<p>On to count points. Here&#8217;s the sketch of the RDF we want to create:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  a :TrafficCountPoint ;
  :description "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  :code "13"^^:CountPointNumber ;
  :road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  :roadName "Salterton Road"@en ;
  :roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  :easting 302600 ;
  :northing 81984 ;
  :localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  :localAuthorityDistrict &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .
</code></pre>

<p>Of these, the description could be done with <code>rdfs:comment</code>. The code can be held by a <code>skos:notation</code> (provided we define a datatype for the count point number):</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/CountPointNumber&gt;
  a rdfs:Datatype ;
  rdfs:label "Traffic Count Point Number"@en .
</code></pre>

<p>Properties for easting and northing are actually defined by the OS&#8217;s spatial relations ontology (although unfortunately neither the ontology nor the property is currently resolvable; the only way you&#8217;d know this is through looking at their use in the conversion of the edubase data). Links to local authorities and local authority districts can be done using the ONS-based administrative geography ontology, which again is currently only guessable at by looking at the online data.</p>

<p>That leaves us with a <code>traffic:CountPoint</code> class (no point calling it <code>TrafficCountPoint</code> if the namespace provides sufficient disambiguation):</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/CountPoint&gt;
  a rdfs:Class ;
  rdfs:label "Traffic Count Point"@en .
</code></pre>

<p>A road property to point to a road:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/road&gt;
  a rdf:Property ;
  rdfs:label "road"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Road&gt; .
</code></pre>

<p>Note that properties are by convention named with a lowercase first letter, whereas classes are named with an uppercase first letter. It&#8217;s a good idea to follow that convention. Note also that I&#8217;ve defined a <code>rdfs:range</code> for this property, which means that anything that&#8217;s the <em>object</em> in a RDF statement that involves this property must be a <code>traffic:Road</code>.</p>

<p>We need a road name property to give the name of the road at the count point.</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/road&gt;
  a rdf:Property ;
  rdfs:label "road name"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Road&gt; .
</code></pre>

<p>We also need a road category property to point to the categor(ies) of the road at the count point:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/roadCategory&gt;
  a rdf:Property ;
  rdfs:label "road category"@en .
</code></pre>

<p>You&#8217;ll remember that we defined different road categories using SKOS, such that each road category is a <code>skos:Concept</code>. But to give a range to the <code>traffic:roadCategory</code> property, we need to create a class for all the things that are categories of road. These are all <code>skos:Concept</code>s, and we can indicate that through an <code>rdfs:subClassOf</code> property:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/RoadCategory&gt;
  a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:label "Road Category"@en .
</code></pre>

<p>use this as the range of the <code>traffic:roadCategory</code> property:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/roadCategory&gt;
  a rdf:Property ;
  rdfs:label "road category"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/RoadCategory&gt; .
</code></pre>

<p>and amend the concept scheme we created to include references to this new class, for example:</p>

<pre><code>&lt;motorway&gt; a traffic:RoadCategory ;
  skos:prefLabel "Motorway"@en ;
  skos:broader &lt;major&gt; ;
  skos:scopeNote "Major roads often used for long distance travel. They are usually three or more lanes in each direction and generally have the maximum speed limit of 70mph."@en ;
  skos:inScheme &lt;&gt; .
</code></pre>

<p>So here is the RDF with the relevant properties properly defined:</p>

<pre><code>@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix area: &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; .
@prefix space: &lt;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  a traffic:CountPoint ;
  rdfs:comment "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  skos:notation "13"^^traffic:CountPointNumber ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  traffic:roadName "Salterton Road"@en ;
  traffic:roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  space:easting 302600 ;
  space:northing 81984 ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .
</code></pre>

<h2>Traffic Counts</h2>

<p>On to traffic counts. The un-namespaced RDF should look like:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt;
  a :TrafficCount ;
  :countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  :direction &lt;http://dbpedia.org/resource/East&gt; ;
  :hour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; .
</code></pre>

<p>So for that we need a class for traffic counts:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/Count&gt;
  a rdfs:Class ;
  rdfs:label "Traffic Count"@en .
</code></pre>

<p>a property that can link to the traffic count to the count point where the count is taken:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/countPoint&gt;
  a rdf:Property ;
  rdfs:label "traffic count point"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/CountPoint&gt; .
</code></pre>

<p>a property to link to the the direction the traffic is flowing in (we can&#8217;t put a range on this one because the DBPedia resources we&#8217;re using don&#8217;t have a common type):</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/direction&gt;
  a rdf:Property ;
  rdfs:label "traffic direction"@en .
</code></pre>

<p>and finally a property to link to the hour during which the measurement was taken. This last one is a very common thing to need to do, so we&#8217;d imagine that there might be an existing property defined somewhere that we could use. <a href="http://sdmx.org/">SDMX</a>, which includes a standard for representing statistical information in XML, defines a <code>REF_PERIOD</code> field which would seem to suit our purposes, but we don&#8217;t yet have a proper mapping of SDMX into RDF (I&#8217;ve had an initial cut, but it needs some input from statisticians).</p>

<p>So for now, we&#8217;ll use a specific property in our own namespace; we can always indicate that it&#8217;s a sub-property of a future SDMX property at a later date. I&#8217;m going to call it <code>countHour</code> and give it a domain of <code>traffic:Count</code> to indicate that the property has a pretty specific use for providing the count for an hour. We could just give its range as a generic <code>time:Interval</code>, but the kind of hours that are traffic count hours are kinda special intervals: they&#8217;re obviously an hour long, but are also restricted to start and end on the hour, cover an hour between 7am and 7pm, and don&#8217;t occur in winter. So it feels like we should have a special kind of interval for that purpose:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/countHour&gt;
  a rdf:Property ;
  rdfs:label "hour of count"@en ;
  rdfs:domain &lt;http://transport.data.gov.uk/def/traffic/Count&gt; ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/CountHour&gt; .

&lt;http://transport.data.gov.uk/def/traffic/CountHour&gt;
  a rdfs:Class ;
  rdfs:subClassOf time:Interval ;
  rdfs:label "Count Hour"@en .
</code></pre>

<p>All those properties were in the traffic namespace, so here&#8217;s the RDF with it added:</p>

<pre><code>@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt;
  a traffic:Count ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:direction &lt;http://dbpedia.org/resource/East&gt; ;
  traffic:countHour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; .
</code></pre>

<h2>Cardinal Directions</h2>

<p>As I discussed in the last instalment, we&#8217;re not actually going to mint URIs for cardinal directions, but that doesn&#8217;t mean we can&#8217;t make statements about them in the RDF we generate. As I&#8217;ll discuss in more depth in the next instalment, it&#8217;s always good to provide a label at the very least:</p>

<pre><code>&lt;http://dbpedia.org/resource/East&gt;
  rdfs:label "East"@en .
</code></pre>

<h2>Intervals and Instants</h2>

<p>Let&#8217;s look now at the RDF we want to generate about the hour during which the count was taken. As I&#8217;ve said above, these hours are a special kind of interval, and we&#8217;ve already created a class for them. I also discussed earlier that the things about this interval that are really useful for the purposes of querying are the year during which the count was taken and the hour at which it was taken, so we should pull out at least those pieces of information. Time-based data can be represented in RDF using the <a href="http://www.w3.org/2006/time">OWL-Time ontology</a>.</p>

<p>Unfortunately, expressing time very specifically gets. This is what the statements we want to make look like using OWL-Time:</p>

<pre><code>@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time&gt; .

&lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt;
  a traffic:CountHour ;
  rdfs:label "8 Oct 2001, 17:00-18:00"@en ;
  time:hasBeginning &lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt; ;
  time:hasEnd &lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt; ;
  time:hasDurationDescription _:OneHour ;
  time:intervalDuring &lt;http://dbpedia.org/resource/2001&gt; .

_:OneHour a time:DurationDescription ;
  rdfs:label "one hour"@en ;
  time:years 0 ;
  time:months 0 ;
  time:days 0 ;
  time:hours 1 ;
  time:minutes 0 ;
  time:seconds 0 .

&lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 17:00"@en ;
  time:inXSDDateTime "2001-10-08T17:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 17 ;
  ] .

&lt;http://placetime.com/interval/gregorian/2001-10-08T18:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 18:00"@en ;
  time:inXSDDateTime "2001-10-08T18:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 18 ;
  ] .

&lt;http://dbpedia.org/resource/2001&gt;
  a time:Interval ;
  rdfs:label "2001" ;
  rdf:value "2001"^^xsd:gYear ;
  time:intervalEquals &lt;http://placetime.com/interval/gregorian/2001-01-01T00:00:00Z/P1Y&gt; .
</code></pre>

<h2>Observations</h2>

<p>Finally we&#8217;re on to the observations themselves. The un-namespaced RDF looks like:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a :Observation ;
  :count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  :vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  :value 2 .
</code></pre>

<p>The <a href="http://purl.org/NET/scovo">SCOVO</a> vocabulary exists to represent statistical information like this. In SCOVO, observations are called <code>scovo:Item</code>s, the value of the statistical measure itself (the count in this case) should be held in the <code>rdf:value</code> property, and any other properties should be subtypes of <code>scovo:dimension</code>, which has a domain of <code>scovo:Dimension</code>.</p>

<p>To fit in with SCOVO, then, we need to have the pointer to the count that this observation belongs to as a property that is a sub-property of <code>scovo:dimension</code>:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/count&gt;
  a rdf:Property ;
  rdfs:subPropertyOf scovo:dimension ;
  rdfs:label "count"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Count&gt; .
</code></pre>

<p>We might be tempted to indicate that the type of thing pointed to by the <code>traffic:count</code> property is a subclass of <code>scovo:Dimension</code>, but this is unnecessary and probably untrue: there might exist some traffic counts that <em>aren&#8217;t</em> dimensions, and the ones that are will be linked to by the <code>traffic:count</code> property can be inferred to be dimensions.</p>

<p>Similarly, the property that provides the pointer to the vehicle type should be a sub-property of <code>scovo:dimension</code> and we need a class for those various vehicle types in order to restrict the range of that property:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/vehicleType&gt;
  a rdf:Property ;
  rdfs:subPropertyOf scovo:dimension ;
  rdfs:label "vehicle type"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/VehicleType&gt; .

&lt;http://transport.data.gov.uk/def/VehicleType&gt;
  a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:label "Vehicle Type"@en .
</code></pre>

<p>Of course all the concepts that we created for the vehicle types need to be designated as instances of this new <code>traffic:VehicleType</code> class:</p>

<pre><code>&lt;bicycle&gt; a traffic:VehicleType ;
  ... .
</code></pre>

<p>So, the RDF with the proper namespaces is:</p>

<pre><code>@prefix scovo: &lt;http://purl.org/NET/scovo#&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a scovo:Item ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  traffic:vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  rdf:value 2 .
</code></pre>

<hr />

<p>That concludes our initial walkthrough of the data to create a vocabulary. I&#8217;ve duplicated the schema and the example data below so that it&#8217;s all in one place. But it&#8217;s not quite done. In the next instalment, I&#8217;ll look at adding some finishing touches that make the RDF easier to use.</p>

<hr />

<h2>Schema</h2>

<p>This is the full schema. It contains just six classes, seven properties and three datatypes at the moment, so it&#8217;s pretty small as vocabularies go. We&#8217;ve been able to reuse a lot of classes, properties and datatypes that have already been defined elsewhere in the RDF itself, so this vocabulary is pretty focused on just what we need to describe traffic counts.</p>

<pre><code>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix scovo: &lt;http://purl.org/NET/scovo#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time&gt; .

# Classes #

&lt;http://transport.data.gov.uk/def/traffic/Road&gt;
  a rdfs:Class ;
  rdfs:label "Road"@en .

&lt;http://transport.data.gov.uk/def/traffic/CountPoint&gt;
  a rdfs:Class ;
  rdfs:label "Traffic Count Point"@en .

&lt;http://transport.data.gov.uk/def/traffic/Count&gt;
  a rdfs:Class ;
  rdfs:label "Traffic Count"@en .

&lt;http://transport.data.gov.uk/def/traffic/RoadCategory&gt;
  a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:label "Road Category"@en .    

&lt;http://transport.data.gov.uk/def/traffic/CountHour&gt;
  a rdfs:Class ;
  rdfs:subClassOf time:Interval ;
  rdfs:label "Count Hour"@en .

&lt;http://transport.data.gov.uk/def/VehicleType&gt;
  a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:label "Vehicle Type"@en .

# Properties #

&lt;http://transport.data.gov.uk/def/traffic/road&gt;
  a rdf:Property ;
  rdfs:label "road name"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Road&gt; .

&lt;http://transport.data.gov.uk/def/traffic/countPoint&gt;
  a rdf:Property ;
  rdfs:label "traffic count point"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/CountPoint&gt; .

&lt;http://transport.data.gov.uk/def/traffic/count&gt;
  a rdf:Property ;
  rdfs:subPropertyOf scovo:dimension ;
  rdfs:label "count"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Count&gt; .

&lt;http://transport.data.gov.uk/def/traffic/roadCategory&gt;
  a rdf:Property ;
  rdfs:label "road category"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/RoadCategory&gt; .

&lt;http://transport.data.gov.uk/def/traffic/direction&gt;
  a rdf:Property ;
  rdfs:label "traffic direction"@en .

&lt;http://transport.data.gov.uk/def/traffic/countHour&gt;
  a rdf:Property ;
  rdfs:label "hour of count"@en ;
  rdfs:domain &lt;http://transport.data.gov.uk/def/traffic/Count&gt; ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/CountHour&gt; .

&lt;http://transport.data.gov.uk/def/vehicleType&gt;
  a rdf:Property ;
  rdfs:subPropertyOf scovo:dimension ;
  rdfs:label "vehicle type"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/VehicleType&gt; .

# Datatypes #

&lt;http://transport.data.gov.uk/def/traffic/LAcode&gt;
  a rdfs:Datatype ;
  rdfs:label "Local Authority Code"@en .

&lt;http://transport.data.gov.uk/def/traffic/RoadNumber&gt;
  a rdfs:Datatype ;
  rdfs:label "Road Number"@en .

&lt;http://transport.data.gov.uk/def/traffic/CountPointNumber&gt;
  a rdfs:Datatype ;
  rdfs:label "Traffic Count Point Number"@en .
</code></pre>

<hr />

<h2>RDF Data</h2>

<p>Here&#8217;s a sample set of data. It looks like rather a lot to simply describe the number of bicycles at a particular point on a road (and it doesn&#8217;t even include the SKOS concept schemes that we did last time), but (a) it all provides valuable context for that measurement and (b) most of it will be reused by a lot of other measurements.</p>

<pre><code>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#"&gt; .
@prefix owl: &lt;http://www.w3.org/2002/07/owl#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time&gt; .
@prefix scovo: &lt;http://purl.org/NET/scovo#&gt; .
@prefix area: &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; .
@prefix admingeo: &lt;http://data.ordnancesurvey.co.uk/ontology/admingeo/&gt; .
@prefix space: &lt;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://statistics.data.gov.uk/id/country?name=England&gt;
  a area:Country ;
  rdfs:label "England"@en .

&lt;http://statistics.data.gov.uk/id/government-office-region/K&gt;
  a admingeo:GovernmentOfficeRegion ;
  rdfs:label "South West"@en ;
  skos:notation "K"^^area:StandardCode ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt;
  a area:LocalAuthorityDistrict ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; .

&lt;http://transport.data.gov.uk/id/local-authority-district/1115&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/18&gt;
  a area:LocalAuthority ;
  rdfs:label "Devon County Council"@en ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:coverage &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/local-authority/1116&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; .

&lt;http://transport.data.gov.uk/id/road/B3178&gt;
  a traffic:Road ;
  skos:notation "B3178"^^traffic:RoadNumber .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  a traffic:CountPoint ;
  rdfs:comment "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  skos:notation "13"^^traffic:CountPointNumber ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  traffic:roadName "Salterton Road"@en ;
  traffic:roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  space:easting 302600 ;
  space:northing 81984 ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt;
  a traffic:Count ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:direction &lt;http://dbpedia.org/resource/East&gt; ;
  traffic:countHour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; .

&lt;http://dbpedia.org/resource/East&gt;
  rdfs:label "East"@en .

&lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt;
  a traffic:CountHour ;
  rdfs:label "8 Oct 2001, 17:00-18:00"@en ;
  time:hasBeginning &lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt; ;
  time:hasEnd &lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt; ;
  time:hasDurationDescription _:OneHour ;
  time:intervalDuring &lt;http://dbpedia.org/resource/2001&gt; .

_:OneHour a time:DurationDescription ;
  rdfs:label "one hour"@en ;
  time:years 0 ;
  time:months 0 ;
  time:days 0 ;
  time:hours 1 ;
  time:minutes 0 ;
  time:seconds 0 .

&lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 17:00"@en ;
  time:inXSDDateTime "2001-10-08T17:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 17 ;
  ] .

&lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 18:00"@en ;
  time:inXSDDateTime "2001-10-08T18:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 18 ;
  ] .

&lt;http://dbpedia.org/resource/2001&gt;
  a time:Interval ;
  rdfs:label "2001" ;
  rdf:value "2001"^^xsd:gYear ;
  time:intervalEquals &lt;http://placetime.com/interval/gregorian/2001-01-01T00:00:00Z/P1Y&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a scovo:Item ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  traffic:vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  rdf:value 2 .
</code></pre>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part III: Defining Concept Schemes</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/137" />
    <id>http://www.jenitennison.com/blog/node/137</id>
    <published>2009-11-22T21:04:41+00:00</published>
    <updated>2009-11-22T21:04:41+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="rdf" />
    <category term="skos" />
    <summary type="html"><![CDATA[<p>This is the third instalment in a series that I&#8217;m writing about turning data into linked data. I&#8217;m using traffic count data as the example, since that&#8217;s a dataset that I&#8217;m currently working on. In the last two instalments, I talked about <a href="http://www.jenitennison.com/blog/node/135">analysing and modelling the data</a> and about <a href="http://www.jenitennison.com/blog/node/136">designing URIs</a> for the <em>things</em> in that model.</p>

<p>Within the model, there are three sets of things that are <strong>concepts</strong>:</p>

<ul>
<li>road categories</li>
<li>vehicle types</li>
<li>cardinal directions</li>
</ul>
    ]]></summary>
    <content type="html"><![CDATA[<p>This is the third instalment in a series that I&#8217;m writing about turning data into linked data. I&#8217;m using traffic count data as the example, since that&#8217;s a dataset that I&#8217;m currently working on. In the last two instalments, I talked about <a href="http://www.jenitennison.com/blog/node/135">analysing and modelling the data</a> and about <a href="http://www.jenitennison.com/blog/node/136">designing URIs</a> for the <em>things</em> in that model.</p>

<p>Within the model, there are three sets of things that are <strong>concepts</strong>:</p>

<ul>
<li>road categories</li>
<li>vehicle types</li>
<li>cardinal directions</li>
</ul>

<p>As I discussed last time, cardinal directions have URIs defined within DBPedia which are good enough for our purposes. The categorisation of roads and vehicles, on the other hand, is something specific to UK transport data, so they are up to us to define.</p>

<p>There&#8217;s a really useful RDF vocabulary called <a href="http://www.w3.org/TR/skos-primer/">SKOS</a> which is designed precisely for defining the kind of concept schemes that we want to use here. SKOS provides classes for concepts, concept schemes and collections (groupings of concepts within a scheme), and properties for linking them and providing labels, codes, definitions and so forth. Many of the SKOS properties can be used outside concept schemes &#8212; for example <code>skos:prefLabel</code> can be used anywhere you want to indicate the preferred label for a thing &#8212; so it&#8217;s good to get to know them.</p>

<h2>Vehicle Types</h2>

<p>Before we dive into RDF, let&#8217;s take some time to understand the classification that we need to model. We&#8217;re modelling vehicle types because counts are made of each different type of vehicle passing a traffic count point over a particular hour. Within the CSV data, the relevant column headings are:</p>

<ul>
<li><code>Pedal cycles</code></li>
<li><code>Two wheeled motor vehicles</code></li>
<li><code>Cars and taxis</code></li>
<li><code>Buses and coaches</code></li>
<li><code>Light vans</code></li>
<li><code>HGVr2</code></li>
<li><code>HGVr3</code></li>
<li><code>HGVr4+</code></li>
<li><code>HGVa3/4</code></li>
<li><code>HGVa5</code></li>
<li><code>HGVa6</code></li>
<li><code>All HGV</code></li>
<li><code>All motor vehicles</code></li>
</ul>

<p>These classifications are detailed in the <a href="http://www.dft.gov.uk/matrix/forms/definitions.aspx">Department for Transport documentation of the dataset</a>. It&#8217;s clear that it&#8217;s not a flat classification, but can be arranged into a hierarchy as follows:</p>

<pre><code>+- Pedal cycles
+- All motor vehicles
   +- Two wheeled motor vehicles
   +- Cars and taxis
   +- Buses and coaches
   +- Light vans
   +- All HGV
      +- Rigid HGV
      |  +- HGVr2
      |  +- HGVr3
      |  +- HGVr4+
      +- Articulated HGV
         +- HGVa3/4
         +- HGVa5
         +- HGVa6
</code></pre>

<p>So all we have to do is define that in SKOS. We&#8217;ve already decided that the URIs will look like:</p>

<pre><code>http://transport.data.gov.uk/def/vehicle-category/{type}
</code></pre>

<p>so for URI-hackability reasons we&#8217;ll call the concept scheme:</p>

<pre><code>http://transport.data.gov.uk/def/vehicle-category/
</code></pre>

<p>It&#8217;s probably easiest to just show what the concept scheme looks like. This is in <a href="http://www.w3.org/TeamSubmission/turtle/">Turtle</a>.</p>

<pre><code>@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@base &lt;http://transport.data.gov.uk/def/vehicle-category/&gt; .

&lt;&gt; a skos:ConceptScheme ;
  skos:prefLabel "Vehicle Types"@en ;
  skos:hasTopConcept &lt;bicycle&gt; ;
  skos:hasTopConcept &lt;motor-vehicle&gt; .
...
&lt;motor-vehicle&gt; a skos:Concept ;
  skos:prefLabel "Motor Vehicle"@en ;
  skos:topConceptOf &lt;&gt; ;
  skos:narrower &lt;motorbike&gt; ;
  skos:narrower &lt;car&gt; ;
  skos:narrower &lt;bus&gt; ;
  skos:narrower &lt;van&gt; ;
  skos:narrower &lt;HGV&gt; .
...
&lt;HGV&gt; a skos:Concept ;
  skos:prefLabel "Heavy Goods Vehicle"@en ;
  skos:altLabel "HGV"@en ;
  skos:definition "Goods vehicles over 3,500 kgs gross vehicle weight."@en ;
  skos:scopeNote "Includes tractors (without trailers), road rollers, box vans and similar large vans. A two axle motor tractive unit without trailer is also included."@en ;
  skos:broader &lt;motor-vehicle&gt; ;
  skos:narrower &lt;HGVr&gt; ;
  skos:narrower &lt;HGVa&gt; ;
  skos:inScheme &lt;&gt; .
...
</code></pre>

<p>The properties shown here are:</p>

<ul>
<li><code>skos:prefLabel</code> - the preferred label for something; there can only be one in any given language</li>
<li><code>skos:altLabel</code> - an alternative label for the thing; there can be any number</li>
<li><code>skos:definition</code> - provides a definition of the term</li>
<li><code>skos:scopeNote</code> - provides information about the scope of the term (eg what&#8217;s included or excluded)</li>
<li><code>skos:broader</code>/<code>skos:narrower</code> - link together concepts into a hierarchy</li>
<li><code>skos:hasTopConcept</code>/<code>skos:topConceptOf</code> - links together the concept schemes and the concepts at the top of the concept hierarchy defined within the scheme</li>
<li><code>skos:inScheme</code> - points from a concept the concept scheme it&#8217;s defined in; it&#8217;s necessary to use either this or <code>skos:topConceptOf</code> on every <code>skos:Concept</code> otherwise it&#8217;s not clear which concept scheme they belong to</li>
</ul>

<p>Note that in the RDF I&#8217;ve assigned every string a language (English). That&#8217;s good practice when values are textual; a Welsh translation could be provided for each one as well, for example.</p>

<h2>Road Categories</h2>

<p>Road categories are also described within the documentation for this dataset. The hierarchy is shown in the documentation as:</p>

<pre><code>+- Major Roads
|  +- Motorways
|  |  +- Trunk
|  |  +- Principal
|  +- A Roads
|     +- Trunk
|     |  +- Urban
|     |  +- Rural
|     +- Principal
|        +- Urban
|        +- Rural
+- Minor Roads
   +- B Roads
   |  +- Urban
   |  +- Rural
   +- C Roads
   |  +- Urban
   |  +- Rural
   +- Unclassified Roads
      +- Urban
      +- Rural
</code></pre>

<p>But this is actually the result of three sets of overlapping concepts:</p>

<ul>
<li>roads by classification (major/minor, motorway/A/B/C/unclassified)</li>
<li>roads by locale (urban/rural)</li>
<li>major roads by maintenance responsibility (trunk/principal)</li>
</ul>

<p>These kinds of subdivisions of concepts can be managed in SKOS through <code>skos:Collection</code>s, which group together concepts without being broader than those concepts. Here&#8217;s a snippet from the concept scheme that shows how this works.</p>

<pre><code>@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@base &lt;http://transport.data.gov.uk/def/road-category/&gt; .

&lt;&gt; a skos:ConceptScheme ;
  skos:prefLabel "Road Categories"@en ;
  skos:hasTopConcept &lt;major&gt; ;
  skos:hasTopConcept &lt;minor&gt; ;
  skos:hasTopConcept &lt;urban&gt; ;
  skos:hasTopConcept &lt;rural&gt; .

&lt;classification&gt; a skos:Collection ;
  skos:prefLabel "Road by Classification"@en ;
  skos:member &lt;major&gt; ;
  skos:member &lt;minor&gt; .

&lt;maintenance&gt; a skos:Collection ;
  skos:prefLabel "Major Road by Maintenance Responsibility"@en ;
  skos:member &lt;principal&gt; ;
  skos:member &lt;trunk&gt; .

&lt;major&gt; a skos:Concept ;
  skos:prefLabel "Major Road"@en ;
  skos:altLabel "Major"@en ;
  skos:scopeNote "Include motorways and A roads. These roads usually have high traffic flows and are often the main arteries to major destinations."@en ;
  skos:narrower &lt;motorway&gt; ;
  skos:narrower &lt;a&gt; ;
  skos:narrower &lt;principal&gt; ;
  skos:narrower &lt;trunk&gt; ;
  skos:topConceptOf &lt;&gt; .

&lt;motorway&gt; a skos:Concept ;
  skos:prefLabel "Motorway"@en ;
  skos:broader &lt;major&gt; ;
  skos:scopeNote "Major roads often used for long distance travel. They are usually three or more lanes in each direction and generally have the maximum speed limit of 70mph."@en ;
  skos:inScheme &lt;&gt; .
...
&lt;trunk&gt; a skos:Concept ;
  a skos:Concept ;
  skos:prefLabel "Trunk Road"@en ;
  skos:altLabel "Trunk"@en ;
  skos:scopeNote "Most motorways and many of the long distance rural A roads are trunk roads."@en ;
  skos:note "The responsibility for the maintenance of trunk roads lies with the Secretary of State and they are managed by the Highways Agency in England, the National Assembly of Wales in Wales and the Scottish Executive in Scotland (National Through Routes)."@en ;
  skos:broader &lt;major&gt; ;
  skos:inScheme &lt;&gt; .
...
</code></pre>

<p>In a hierarchy, these multiple overlapping concepts can be shown as:</p>

<pre><code>+- &lt;Road by Classification&gt;
|  +- Major Road
|  |  +- &lt;Major Road by Classification&gt;
|  |  |  +- Motorway
|  |  |  +- A Road
|  |  +- &lt;Major Road by Maintenance Responsibility&gt;
|  |     +- Principal Road
|  |     +- Trunk Road
|  +- Minor Road
|     +- B Road
|     +- C Road
|     +- Unclassified Road
+- &lt;Road by Locale&gt;
   +- Urban Road
   +- Rural Road
</code></pre>

<p>That&#8217;s our concept schemes done. Next it will be time to turn to defining a vocabulary for the particular <em>things</em> that we want to describe from this dataset.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part II: Defining URIs</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/136" />
    <id>http://www.jenitennison.com/blog/node/136</id>
    <published>2009-11-22T17:23:34+00:00</published>
    <updated>2009-11-23T13:36:00+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="uri" />
    <summary type="html"><![CDATA[<p>This is the second instalment in a series of posts about how to create linked data from existing data sets, using traffic count data as an example. In the last instalment, I talked about <a href="http://www.jenitennison.com/blog/node/135">analysing and modelling data</a>. This instalment discusses the creation of URIs for the various <em>things</em> that have been identified within the model.</p>

<p>This part of the process is the same as what you&#8217;d do if you were simply creating a RESTful API to a website. The principal is that everything has a URI, and if you resolve that URI you get information about the thing.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>This is the second instalment in a series of posts about how to create linked data from existing data sets, using traffic count data as an example. In the last instalment, I talked about <a href="http://www.jenitennison.com/blog/node/135">analysing and modelling data</a>. This instalment discusses the creation of URIs for the various <em>things</em> that have been identified within the model.</p>

<p>This part of the process is the same as what you&#8217;d do if you were simply creating a RESTful API to a website. The principal is that everything has a URI, and if you resolve that URI you get information about the thing.</p>

<!--break-->

<p>For the data.gov.uk site, we now have some <a href="http://www.cabinetoffice.gov.uk/media/308995/public_sector_uri.pdf">guidelines about the design of URIs for the UK public sector</a>. Basically, URIs for <em>things</em> should look like:</p>

<pre><code>http://{sector}.data.gov.uk/id/{type of thing}/{thing identifier}
</code></pre>

<p>There&#8217;ll be plenty of examples in what follows.</p>

<h2>Areas</h2>

<p>Some of the things that we&#8217;ve identified as being part of the traffic count dataset already have centrally-defined identifiers. As part of other data.gov.uk work, we&#8217;ve defined URIs for administrative areas like countries, regions, local authority districts and local authorities. The templates for these URIs are:</p>

<pre><code>http://statistics.data.gov.uk/id/country/{ONS code}
http://statistics.data.gov.uk/id/government-office-region/{ONS code}
http://statistics.data.gov.uk/id/local-authority-district/{ONS code}
http://statistics.data.gov.uk/id/local-authority/{ONS code}
</code></pre>

<p>We can use these identifiers directly for the regions, districts and local authorities. But there&#8217;s a problem with the country URI: we don&#8217;t have the ONS code for the country, only the name of the country. Fortunately, we&#8217;ve also defined URIs with this pattern:</p>

<pre><code>http://statistics.data.gov.uk/id/country?name={country name}
http://statistics.data.gov.uk/id/government-office-region?name={region name}
http://statistics.data.gov.uk/id/local-authority-district?name={district name}
http://statistics.data.gov.uk/id/local-authority?name={authority name}
</code></pre>

<p>so in this situation we can use the name-based country URI and we&#8217;ll get redirected to the canonical, code-based URI.</p>

<p>Local authorities actually have two codes within the dataset that we have: the ONS code and a DfT code. I can well imagine that other datasets from the Department for Transport will only reference the DfT code, so it&#8217;s a good idea to create URIs that are based on these codes; later on, we can state that the two identifiers actually mean exactly the same thing.</p>

<pre><code>http://transport.data.gov.uk/id/local-authority-district/{DfT code}
http://transport.data.gov.uk/id/local-authority/{DfT code}
</code></pre>

<p>So given the record:</p>

<pre><code>"England","North West","B",4315.00,"00BZ","St.Helens Metropolitan Borough Council",
4,"U",,"Unclassified Urban",,
,352100,398200,
7/6/2001 00:00:00,"N",7,1,0,5,1,0,0,0,0,0,0,0,0,6
</code></pre>

<p>the URIs we&#8217;ve defined so far are:</p>

<pre><code>http://statistics.data.gov.uk/id/country?name=England
http://statistics.data.gov.uk/id/government-office-region/B
http://statistics.data.gov.uk/id/local-authority-district/00BZ
http://statistics.data.gov.uk/id/local-authority/00BZ
http://transport.data.gov.uk/id/local-authority-district/4315
http://transport.data.gov.uk/id/local-authority/4315
</code></pre>

<h2>Roads</h2>

<p>Now we&#8217;re onto things that aren&#8217;t defined already. First is roads. If there&#8217;s a road number, the obvious thing to use is that road number; something like:</p>

<pre><code>http://transport.data.gov.uk/id/road/{road number}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/road/B3178
</code></pre>

<p>If there isn&#8217;t a road number, we&#8217;ll have to construct a URI. Since each count point is on one particular road, we can use the identifier of the count point to identify the road, so:</p>

<pre><code>http://transport.data.gov.uk/id/road/{class}-{count point number}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/road/U-4
</code></pre>

<h2>Count Points</h2>

<p>Count points can be identified through their number, so it makes sense to use that in the URI:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/{count point number}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/4
</code></pre>

<h2>Counts</h2>

<p>The counts themselves don&#8217;t have their own identifiers, but they can be identified through a combination of the count point that they&#8217;re associated with, the direction of travel of the traffic that&#8217;s being counted, and the date and time that the count is made. So we can create a URI that combines these things. To aid hackability, I&#8217;m going to build on top of the traffic count point URI that we&#8217;ve already defined:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/{count point number}/direction/{direction}/hour/{time}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/4/direction/N/hour/2001-06-07T07:00:00
</code></pre>

<h2>Observations</h2>

<p>Again, observations build on top of the counts by adding a vehicle type to the mix, so we can construct URIs that reflect that:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/{count point number}/direction/{direction}/hour/{time}/type/{vehicle type}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/4/direction/N/hour/2001-06-07T07:00:00/type/motor-vehicle
</code></pre>

<h2>Road Categories</h2>

<p>Road categories are a bit different from the kinds of things that we&#8217;ve been talking about so far: they are concepts. For these URIs we use a slightly different pattern from the URIs above: <code>/def/</code> rather than <code>/id/</code>. For road categories we can use:</p>

<pre><code>http://transport.data.gov.uk/def/road-category/{category}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/def/road-category/motorway
</code></pre>

<h2>Vehicle Types</h2>

<p>Vehicle types are also concepts, so have similar URIs:</p>

<pre><code>http://transport.data.gov.uk/def/vehicle-category/{type}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/def/vehicle-category/HGVa5
</code></pre>

<h2>Cardinal Directions</h2>

<p>Cardinal directions are also concepts, but really they are global concepts, not specific to transport, or even to the UK. So it feels a bit strange to use URIs for them that imply that they somehow belong to data.gov.uk.</p>

<p>Fortunately, for this kind of general concept we can use URIs defined by <a href="http://dbpedia.org">DBPedia</a>. DBPedia is a linked data view on Wikipedia, so it has URIs for everything that Wikipedia has a page about, making it an excellent general purpose resource. The relevant URIs for the cardinal directions are:</p>

<pre><code>http://dbpedia.org/resource/North
http://dbpedia.org/resource/South
http://dbpedia.org/resource/East
http://dbpedia.org/resource/West
</code></pre>

<p>so that&#8217;s what we&#8217;ll use.</p>

<h2>Dates, Times and Periods</h2>

<p>For dates, times and periods, we can use the URIs provided by another general-purpose linked data resource: <a href="http://www.placetime.com/">placetime.com</a>. URIs for instants have the pattern:</p>

<pre><code>http://placetime.com/instant/gregorian/{dateTime}
</code></pre>

<p>while periods have the pattern:</p>

<pre><code>http://placetime.com/interval/gregorian/{dateTime}/{duration}
</code></pre>

<p>So the hour from 7-8am on 7th June 2001 would be:</p>

<pre><code>http://placetime.com/interval/gregorian/2001-06-07T07:00:00/PT1H
</code></pre>

<p>and the year 2001 would be:</p>

<pre><code>http://placetime.com/interval/gregorian/2001-01-01T00:00:00/P1Y
</code></pre>

<p>The thing is that the latter isn&#8217;t particularly approachable. Calendar years are used all over the place, so it would be nice to have a set of URIs for them that we use consistently. Again, DBPedia provides URIs for every year, such as:</p>

<pre><code>http://dbpedia.org/resource/2001
</code></pre>

<p>so where we need to refer to a calendar year, it would be good to reuse that.</p>

<hr />

<p>And that completes the sets of URIs that we need for this data. Stay tuned.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part I: Analysing and Modelling</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/135" />
    <id>http://www.jenitennison.com/blog/node/135</id>
    <published>2009-11-22T16:58:17+00:00</published>
    <updated>2009-11-22T16:58:17+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <summary type="html"><![CDATA[<p>One of the goals of the government&#8217;s Data Project is to equip the people who own data with the capability to publish it as linked data. There&#8217;s an overwhelming amount of work to do here from providing tool support to changing a culture that makes it hard to publish data. But we can start by taking some baby steps that simply explain what&#8217;s involved in turning existing data into linked data.</p>

<p>I&#8217;m currently reworking the traffic count linked data that I first transformed back in September, and I thought it would be helpful to talk through that process for several reasons:</p>

<ul>
<li>to give people using the traffic count data more insight into how it fits together</li>
<li>so that other people can follow it as they transform their own data</li>
<li>so that tool providers can spot some of the places where tools might help</li>
</ul>

<p>Rather than creating one massive blog post, I&#8217;m going to break it down into several steps. These are:</p>

<ol>
<li>analysing and modelling</li>
<li>defining URIs</li>
<li>defining concept schemes</li>
<li>defining classes, properties and datatypes</li>
<li>adding finishing touches</li>
</ol>

<p>This is the first instalment.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>One of the goals of the government&#8217;s Data Project is to equip the people who own data with the capability to publish it as linked data. There&#8217;s an overwhelming amount of work to do here from providing tool support to changing a culture that makes it hard to publish data. But we can start by taking some baby steps that simply explain what&#8217;s involved in turning existing data into linked data.</p>

<p>I&#8217;m currently reworking the traffic count linked data that I first transformed back in September, and I thought it would be helpful to talk through that process for several reasons:</p>

<ul>
<li>to give people using the traffic count data more insight into how it fits together</li>
<li>so that other people can follow it as they transform their own data</li>
<li>so that tool providers can spot some of the places where tools might help</li>
</ul>

<p>Rather than creating one massive blog post, I&#8217;m going to break it down into several steps. These are:</p>

<ol>
<li>analysing and modelling</li>
<li>defining URIs</li>
<li>defining concept schemes</li>
<li>defining classes, properties and datatypes</li>
<li>adding finishing touches</li>
</ol>

<p>This is the first instalment.</p>

<!--break-->

<hr />

<p>The first thing to do is to work out what <em>things</em> the data we have contains information about. One way of thinking about this is to try to identify anything in the data that you can imagine wanting to get information about. This process is exactly the same as the process you would go through when designing a database or an XML format.</p>

<p>It&#8217;s worth noting that this is a design process rather than a discovery process. There is no inherent model in any set of data; I can guarantee you that someone else will break down a given set of data in a different way from you. That means you have to make decisions along the way.</p>

<p>The column headers in the traffic count dataset are:</p>

<pre><code>"Country Name","Region Name","ONS Region Code","DfT LA Code","ONS LA Code",
"Local Authority Name","Count Point Number","Road Number","Road Sequence",
"Road Category","Road Name at CP","CP Location","Site coordinate easting",
"Site coordinate northin","Date of count","Direction of flow","Hour of count",
"Pedal cycles","Two wheeled motor vehicles","Cars and taxis","Buses and coaches",
"Light vans","HGVr2","HGVr3","HGVr4+","HGVa3/4","HGVa5","HGVa6","All HGV",
"All motor vehicles"
</code></pre>

<p>A couple of example records are:</p>

<pre><code>"England","North West","B",4315.00,"00BZ","St.Helens Metropolitan Borough Council",
4,"U",,"Unclassified Urban",,
,352100,398200,
7/6/2001 00:00:00,"N",7,1,0,5,1,0,0,0,0,0,0,0,0,6
</code></pre>

<p>and:</p>

<pre><code>"England","South West","K",1115.00,"18","Devon County Council",
13,"B3178",,"B Urban","Salterton Road",
"Salterton Road, EAST OF DINAN WAY, EXMOUTH",302600,81984,
8/10/2001 00:00:00,"E",17,2,2,400,5,41,0,2,0,0,0,0,2,450
</code></pre>

<p>After a bit of reading <a href="http://www.dft.gov.uk/matrix/forms/definitions.aspx">documentation</a> and poking around in the data, it emerges that we can group these together as follows:</p>

<ul>
<li>fields about <strong>countries</strong>
<ul><li><code>Country Name</code></li></ul></li>
<li>fields about <strong>regions</strong>
<ul><li><code>Region Name</code></li>
<li><code>ONS Region Code</code> (ONS = Office of National Statistics)</li></ul></li>
<li>fields about <strong>local authorities</strong>
<ul><li><code>DfT LA Code</code> (DfT = Department for Transport)</li>
<li><code>ONS LA Code</code></li>
<li><code>Local Authority Name</code></li></ul></li>
<li>fields about <strong>roads</strong>
<ul><li><code>Road Number</code> (not available for unclassified roads)</li></ul></li>
<li>fields about <strong>count points</strong>
<ul><li><code>Count Point Number</code></li>
<li><code>Road Sequence</code> (indicates order of count points on particular roads; only applicable to count points on those long roads)</li>
<li><code>Road Category</code> (roads can have different categories at different points; for example the A1 is sometimes &#8220;Trunk Motorway&#8221; and sometimes &#8220;Principal A Urban&#8221;)</li>
<li><code>Road Name at CP</code> (roads can have different names at different points; for example the A240 is sometimes &#8220;Surbiton Hill Road&#8221;, sometimes &#8220;Reigate Road&#8221;, sometimes something else)</li>
<li><code>CP Location</code> (a description of the location, sometimes missing)</li>
<li><code>Site coordinate easting</code></li>
<li><code>Site coordinate northing</code></li></ul></li>
<li>fields about <strong>counts</strong>
<ul><li><code>Date of count</code></li>
<li><code>Direction of flow</code> (one of the cardinal directions; observations at a particular count point will be made in both directions the road goes in)</li>
<li><code>Hour of count</code> (observations each cover an hour of traffic, from 7am to 7pm)</li></ul></li>
<li>fields that are counts
<ul><li><code>Pedal cycles</code></li>
<li><code>Two wheeled motor vehicles</code></li>
<li><code>Cars and taxis</code></li>
<li><code>Buses and coaches</code></li>
<li><code>Light vans</code></li>
<li><code>HGVr2</code> (HGVs with two rigid axles)</li>
<li><code>HGVr3</code></li>
<li><code>HGVr4+</code></li>
<li><code>HGVa3/4</code> (articulated HGVs with three or four axles)</li>
<li><code>HGVa5</code></li>
<li><code>HGVa6</code></li>
<li><code>All HGV</code></li>
<li><code>All motor vehicles</code></li></ul></li>
</ul>

<p>Some of the fields contain information about things (countries, regions etc), and some of them contain the actual data themselves, with the field names telling us about their type (ie the counts of various sorts). The <code>Road Category</code> field actually contains a whitespace-separated list of road categories about which we can imagine wanting to have more information (like what, exactly, is a &#8216;principal&#8217; road?). So as a first cut, the things in the data are:</p>

<ul>
<li>countries</li>
<li>regions</li>
<li>local authorities</li>
<li>roads</li>
<li>road categories</li>
<li>count points</li>
<li>counts</li>
<li>vehicle types</li>
</ul>

<p>There are also implied relationships between the various things that are described within the dataset, that can be identified through the co-occurrence of things within the same record. For example, all records that contain <code>"North West"</code> as a value for <code>Region Name</code> have <code>"England"</code> as the value for <code>Country Name</code>, so we can tell from the data that England contains the North West.</p>

<p>There&#8217;s obviously some kind of relationship between local authorities and regions (eg a <code>Local Authority Name</code> of <code>"Surrey County Council"</code> implies a <code>Region Name</code> of <code>South East</code>), but it&#8217;s hard to put a name to it. The relationship becomes more obvious if we introduce a new type of thing: a <strong>local authority district</strong>. Then we can say that a local authority covers a local authority district which is within a region.</p>

<p>So a first cut at a model is:</p>

<pre><code>                    +---- country
                    |        |
                    |        | contains
           contains |        v
                    |     region
                    |        |
                    |        | contains
                    v        v           covers
              local authority district &lt;--------- local authority
                         |
                         | contains
            on           v          category
  road &lt;----------- count point -----------------&gt; road category
                         ^
                         | at
                         |            of
                       count --------------------&gt; vehicle type
</code></pre>

<p>This is how I modelled it the first time round. It&#8217;s a pretty pure conceptual model: I haven&#8217;t taken into consideration how the model&#8217;s going to be represented (you could use the model to create some database tables, some XML or some RDF) or how it&#8217;s going to be queried.</p>

<p>But the fact is that it&#8217;s going to be represented as linked data, using RDF and querying with SPARQL. Having had experience with querying the model as represented above, there are three changes that I&#8217;m going to make:</p>

<ol>
<li><p>During a given hour at a given count point, the counts of all the different types of traffic are made by the same observer (be it human or electronic). It feels as if that set of counts (which are all represented within a single record in the CSV) belong together in some way. It also feels like it would be useful to be able to talk about that set of counts as a set, because they&#8217;re all going to be affected by the same factors (eg faults in the machine recording the counts; traffic jams; roadworks), and even though it&#8217;s not present in this dataset, we might want somewhere to hang that information. Pulling that duplicated data out into a separate <em>thing</em> will also help reduce the repetition and the number of triples needed in the RDF, which will speed up searching.</p></li>
<li><p>The data itself records the actual date on which the observation was taken, and the hour on which it was done. This could be represented just by a start date time like <code>2001-06-07T07:00:00</code>, and that&#8217;s how I did it initially. However, when you&#8217;re trying to analyse the data the things that really matter, most of the time, about the observation are the year and the hour. SPARQL isn&#8217;t very good at doing date/time processing, and anyway there are all sorts of things about a particular date that might be interesting (what day of the week is it? is it during the school summer holiday?) so it makes sense to pull these hour-long intervals out as separate <em>things</em> that we can talk about.</p></li>
<li><p>In a similar way, although the cardinal direction associated with a particular count could be represented as a simple string, whenever there&#8217;s a set of enumerated values it&#8217;s a good idea to consider turning them into <em>things</em>, because to do so enables you to associate extra information about them. For example, it would let us say that the English word for North is North, and that North is the opposite direction to South, and so on.</p></li>
</ol>

<p>So here&#8217;s what the revised model looks like:</p>

<pre><code>                   +---- country
                   |        |
                   |        | contains
          contains |        v
                   |     region
                   |        |
                   |        | contains
                   v        v            covers
             local authority district &lt;----------- local authority
                        | 
                        | contains
            on          v           category
 road &lt;----------- count point -------------------&gt; road category
                        ^
                        | at
               in       |     at           during
direction &lt;---------- count -----&gt; period --------&gt; year
                        ^            +-----start--&gt; instant
                        | at         +-----end----&gt; instant
                        |              of
                   observation -------------------&gt; vehicle type
</code></pre>

<p>and here&#8217;s a list of the things with a rough set of properties; properties marked with a question mark (?) are ones that don&#8217;t always have values in the data. Properties marked with an arrow (->) are ones that are pointers to other things.</p>

<ul>
<li>country
<ul><li>name</li></ul></li>
<li>region
<ul><li>name</li>
<li>ONS code</li>
<li>country -></li></ul></li>
<li>local authority district
<ul><li>name</li>
<li>ONS code</li>
<li>DfT code</li>
<li>local authority -></li>
<li>region -></li>
<li>country -></li></ul></li>
<li>local authority
<ul><li>name</li>
<li>ONS code</li>
<li>DfT code</li>
<li>local authority district -></li></ul></li>
<li>road
<ul><li>number ?</li></ul></li>
<li>count point
<ul><li>number</li>
<li>road sequence ?</li>
<li>road name ?</li>
<li>description ?</li>
<li>easting</li>
<li>northing</li>
<li>road -></li>
<li>road category -></li>
<li>local authority -></li>
<li>local authority district -></li></ul></li>
<li>road category
<ul><li>name</li>
<li>related categories -></li></ul></li>
<li>count
<ul><li>direction -></li>
<li>count point -></li>
<li>period -></li></ul></li>
<li>direction
<ul><li>name</li>
<li>related directions</li></ul></li>
<li>observation
<ul><li>value</li>
<li>count -></li>
<li>vehicle type -></li></ul></li>
<li>vehicle type
<ul><li>name</li>
<li>related types -></li></ul></li>
<li>period
<ul><li>start (instant) -></li>
<li>end (instant) -></li>
<li>year (period) -></li></ul></li>
<li>instant
<ul><li>year, month, day, hour, minute, second etc.</li></ul></li>
</ul>

<p>This is the first part of the process done. More soon.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Publishing Information About Inward Links</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/134" />
    <id>http://www.jenitennison.com/blog/node/134</id>
    <published>2009-11-08T19:30:20+00:00</published>
    <updated>2009-11-08T19:30:20+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="legislation" />
    <category term="linked data" />
    <summary type="html"><![CDATA[<p>In the Linked Data world, we talk a lot about having URIs that are identifiers for things, and making them HTTP URIs so that they can be dereferenced and people can find <em>more</em> information about those things.</p>

<p>This raises the questions of &#8220;what information should you publish?&#8221; Let&#8217;s make this concrete by using a real example: <a href="http://www.opsi.gov.uk/legislation-api/">UK Legislation</a>, which 
<a href="http://www.tso.co.uk/">TSO</a> is publishing for <a href="http://www.opsi.gov.uk/">OPSI</a> as Linked Data.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>In the Linked Data world, we talk a lot about having URIs that are identifiers for things, and making them HTTP URIs so that they can be dereferenced and people can find <em>more</em> information about those things.</p>

<p>This raises the questions of &#8220;what information should you publish?&#8221; Let&#8217;s make this concrete by using a real example: <a href="http://www.opsi.gov.uk/legislation-api/">UK Legislation</a>, which 
<a href="http://www.tso.co.uk/">TSO</a> is publishing for <a href="http://www.opsi.gov.uk/">OPSI</a> as Linked Data.</p>

<p>UK Legislation now has a set of URIs that are explicitly intended to be used as unique identifiers for items of legislation and parts, sections, subsections and so on within them. If you request one of these URIs, requesting RDF/XML, you will get some information about that bit of legislation, such as:</p>

<ul>
<li>bibliographic metadata such as its title, publisher, created date and so on</li>
<li>links to other related sections or items of legislation</li>
<li>links to particular versions of that bit of legislation</li>
</ul>

<p>So we provide some basic information, and the links we know about, ie those within UK Legislation.</p>

<p>It turns out that lots of things aside from UK Legislation reference legislation, and that when you publish information about them it&#8217;s helpful to be able to point to the relevant legislation. For example:</p>

<ul>
<li>the Home Office relate offences to sections of legislation that state that a particular activity is illegal and has a certain maximum penalty</li>
<li>local authorities are bound to provide certain services by law, so there&#8217;s a natural pointer from the definition of a service to that law</li>
<li>administrative areas such as counties and local authorities are defined by law, so when the Ordnance Survey publish information about those areas, it helps to point to the law in which their names are legally defined as the authority on which their statements are based</li>
<li>the publication of notices posted within the London Gazette is enforced by legislation, and the text of the notices usually indicates which piece of legislation caused the notice to be published</li>
</ul>

<p>These are all inward pointers. As we publish information about UK Legislation, we won&#8217;t know about all these links to the information we publish. But people who access information about UK Legislation might well want to know about those links. Wouldn&#8217;t it be useful to know &#8212; given an item of legislation &#8212; what it makes illegal, what it compels local authorities to do, which administrative areas it defines, which notices it has caused to be published?</p>

<p>We were discussing the same issue the other day in respect of spatial objects. The Ordnance Survey, or other organisations peddling spatial data, may define spatial objects, but other people define the things that those spatial objects represent, such as schools, roads, parks and so on. It&#8217;s obviously useful to go from a school to the spatial objects that represent its buildings, but it would also be useful to go from a spatial object that is a school building to the school.</p>

<p>So what should we, as publishers, do about the inward links (that we know about)? When we publish information about something should we also try to publish information about the things that (we know) reference that thing? I think the answer&#8217;s &#8220;yes,&#8221; at the very least in any human-readable access we give to the information. And from that come two further thoughts:</p>

<ul>
<li><p>If you are publishing data with outward links, it would be a good idea to provide feeds or other mechanisms that enable people to pull in basic information about the things that you&#8217;re publishing that link to something they&#8217;re publishing. SPARQL queries would do, but something a bit less general purpose and more approachable &#8212; I&#8217;m thinking a URL like <code>http://example.org/links?url=http://example.net/linked/resource</code> &#8212; would be better.</p></li>
<li><p>Information from another source is going to have different provenance/trust etc characteristics than the primary information you publish. That needs to be clearly indicated <em>somehow</em>; sounds to me like a requirement for named graphs.</p></li>
</ul>
    ]]></content>
  </entry>
  <entry>
    <title>Establishing Trust by Describing Provenance</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/133" />
    <id>http://www.jenitennison.com/blog/node/133</id>
    <published>2009-10-24T20:16:05+00:00</published>
    <updated>2009-11-08T10:04:01+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="rdf" />
    <summary type="html"><![CDATA[<p><em>Update 2009-11-08: The developers of the Provenance Vocabulary tell me that the pattern I used below isn&#8217;t correct, and there doesn&#8217;t currently seem to be a method of describing what I want to describe using that vocabulary. But it&#8217;s still under development, so hopefully it will become usable soon.</em></p>

<p>One of my favourite tweets from Rob McKinnon (aka <a href="http://www.twitter.com/delineator">@delineator</a>) is this one:</p>

<p><img src="/blog/files/delineatorquote.jpg" alt="feeling upset RDF enthusiasts oversell RDF, ignoring creation, provenance, ambiguity, subjectivity + versioning problems #linkeddata #london" style="width: 80%; margin-left: 10%; margin-right: 10%" /></p>

<p>because it&#8217;s one of the things that bugs me on occasion too, and because the issues he mentions are so vitally important when we&#8217;re talking about public sector information but (because they&#8217;re the hard issues) are easy to de-prioritise in the rush to make data available.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p><em>Update 2009-11-08: The developers of the Provenance Vocabulary tell me that the pattern I used below isn&#8217;t correct, and there doesn&#8217;t currently seem to be a method of describing what I want to describe using that vocabulary. But it&#8217;s still under development, so hopefully it will become usable soon.</em></p>

<p>One of my favourite tweets from Rob McKinnon (aka <a href="http://www.twitter.com/delineator">@delineator</a>) is this one:</p>

<p><img src="/blog/files/delineatorquote.jpg" alt="feeling upset RDF enthusiasts oversell RDF, ignoring creation, provenance, ambiguity, subjectivity + versioning problems #linkeddata #london" style="width: 80%; margin-left: 10%; margin-right: 10%" /></p>

<p>because it&#8217;s one of the things that bugs me on occasion too, and because the issues he mentions are so vitally important when we&#8217;re talking about public sector information but (because they&#8217;re the hard issues) are easy to de-prioritise in the rush to make data available.</p>

<!--break-->

<p>Let&#8217;s go back to basics: <strong>How do you know whether you can trust a piece of information?</strong> Think of an infographic in your daily newspaper showing the results of a survey. You could just decide based on your trust in the newspaper itself. But if you&#8217;re feeling suspicious, there are any number of things that you need to trust:</p>

<ul>
<li>the designer who created the infographic, that they didn&#8217;t skew the graphic to make it imply something that the data doesn&#8217;t warrant</li>
<li>the data munger who cleaned up the data and supplied it to the designer, that they didn&#8217;t introduce errors into the data while cleaning it up</li>
<li>the organisation who published the data, that they aggregated it accurately and published all the results</li>
<li>the organisation who conducted the survey, that they surveyed a large enough and representative enough collection of people and collated the results accurately</li>
<li>the people who responded to the survey, that they didn&#8217;t lie</li>
</ul>

<p>What enables us to determine how much to trust the results of each of these steps? Well, if the organisation who conducted the survey published enough information such that anyone could replicate the survey if they wanted to, we&#8217;re more likely to trust their results. That&#8217;s at least partly because in order to publish sufficient information for someone to replicate the study, they have to go into the kind of detail that necessarily exposes biases the study might have, but also because they&#8217;re unlikely to be that open about what they did if they were trying to cover something up. That&#8217;s why  <a href="http://en.wikipedia.org/wiki/Reproducibility">reproducability</a> is one of the fundamental principles in the <a href="http://en.wikipedia.org/wiki/Scientific_method">scientific method</a>.</p>

<p>That same principle applies at the data-processing end of the chain of processes that led to the infographic. We will trust the infographic more if we can get hold of</p>

<ol>
<li>the raw survey results (<strong>open data</strong>)</li>
<li>details about all the programs that aggregated, cleaned up or otherwise transformed the data, including their source code (<strong>open source</strong>)</li>
</ol>

<p>because with these details we could, if we chose, replicate the resulting data and create our own visualisation of it.</p>

<p>The question is, given some RDF, how do we provide enough detail about how it was generated to enable others to work out whether to trust it or not?</p>

<p>The answer should come as no surprise: &#8220;With RDF!&#8221;</p>

<p>I&#8217;ve been looking at a couple of vocabularies for recording provenance: the <a href="http://openprovenance.org/">Open Provenance Model</a> and the <a href="http://sourceforge.net/apps/mediawiki/trdf/index.php?title=Guide_to_the_Provenance_Vocabulary">Provenance Vocabulary</a>.</p>

<p>The Open Provenance Model is a general purpose model that can be expressed in RDF. It splits the world into three main things:</p>

<ul>
<li><strong>Artifacts</strong> which are things that you might want to record the provenance of</li>
<li><strong>Processes</strong> which are things that happen to artifacts</li>
<li><strong>Agents</strong> who initiate processes</li>
</ul>

<p>These three types of things interact with each other in three main ways:</p>

<ul>
<li>artifacts are generated by processes</li>
<li>processes use artifacts</li>
<li>processes are controlled by agents</li>
</ul>

<p>and two subsidiary ones which occur as a result of these:</p>

<ul>
<li>artifacts are derived from other artifacts</li>
<li>processes trigger other processes</li>
</ul>

<p>So far so good. But then it starts getting complicated. A given process might use and generate several artifacts (for example, an XSLT transformation might use a source document and a stylesheet, and generate an index and a number of pages), so each of the three main relationships above is qualified through the use of a particular <strong>role</strong>.</p>

<p>Further, different artifacts might be used or created at different times in the process, so each of the relationships is also qualified with a timestamp. (The Open Provenance Model is built to describe processes that might take days, or even longer, with different bits of information coming into play at different times.)</p>

<p>And then to add one more twist, each provenance graph is just one possible <strong>account</strong> of the history of an item; in particular a different account might break down the processes into subprocesses, or aggregate the processes differently.</p>

<p>The complications of having timestamps on relationships, and having multiple accounts, means that describing provenance with the Open Provenance Model is a little tedious, especially when you&#8217;re mostly concerned about the provenance of data (as opposed to, say, a car).</p>

<p>The Provenance Vocabulary has the same basic types, which they call Artifacts, Executions and Actors, but the different roles that artifacts play are indicated through the properties <code>prv:usedData</code>, <code>prv:usedGuideline</code> and its sub-properties, and timestamps are associated directly with executions and artifacts. It&#8217;s also specifically oriented towards the kinds of operations that we typically need to do when transforming and publishing linked data.</p>

<p>To give you an idea about how it might work, here&#8217;s an illustration of how we might use the provenance vocabulary to describe the construction of some RDF from a CSV file:</p>

<p>Here are the prefix bindings. Note the reuse of the <a href="http://www.w3.org/TR/HTTP-in-RDF/">HTTP vocabulary</a>, <a href="http://xmlns.com/foaf/spec/">FOAF</a>, <a href="http://dublincore.org/documents/dcmi-terms/">Dublin Core</a> and <a href="http://semanticweb.org/wiki/VoiD">VoID</a>.</p>

<pre><code>@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; .
@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix prv: &lt;http://purl.org/net/provenance/ns#&gt; .
@prefix prvTypes: &lt;http://purl.org/net/provenance/types#&gt; .
@prefix http: &lt;http://www.w3.org/2006/http#&gt; .
@prefix foaf: &lt;http://xmlns.com/foaf/0.1/&gt; .
@prefix dct: &lt;http://purl.org/dc/terms/&gt; .
@prefix void: &lt;http://rdfs.org/ns/void#&gt; .
</code></pre>

<p>The provenance record itself and caches of relevant documents somewhere web accessible; a <code>log</code> subdirectory on my website seems as good a place as any.</p>

<pre><code>@base &lt;http://www.jenitennison.com/log/2009-10-24/&gt; .
</code></pre>

<p>Now the meat of the provenance information. I&#8217;m defining a dataset identified as <code>http://statistics.data.gov.uk/id/region</code>, which is a dataset of Government Office Regions within the UK. The resources described by the dataset have URIs of the form <code>http://statistics.data.gov.uk/id/region/{regionCode}</code>, where <em>regionCode</em> is a single capital letter (assigned by <a href="http://www.statistics.gov.uk/">ONS</a>).</p>

<pre><code>&lt;http://statistics.data.gov.uk/id/region&gt; a void:Dataset ;
  dct:title "Government Office Regions" ;
  foaf:homepage &lt;http://statistics.data.gov.uk/doc/region&gt; ;
  void:exampleResource &lt;http://statistics.data.gov.uk/id/region/H&gt; ;
  void:uriRegexPattern "http://statistics.data.gov.uk/id/region/[A-Z]" ;
  void:subset _:GORRDF .
</code></pre>

<p>I&#8217;m not describing the provenance of this entire dataset here: the dataset of information about Government Office Regions will likely contain information from many different sources, created in many different ways at many different times. I&#8217;m just describing a particular subset of that information (identified here by a blank node with the ID <code>_:GORRDF</code>).</p>

<p>This dataset is captured as a dump in the web-accessible cache in which I&#8217;m keeping all the provenance-related information. Here we start seeing the provenance-related properties. The dataset was created by a <code>prv:DataCreation</code> event performed at 12:20 today by me. The creation used data from a CSV document that is also in the web-accessible cache, using the &#8220;guideline&#8221; (in this case an XSLT transformation) that is again in the web-accessible cache. I&#8217;ve also provided provenance information about that XSLT transformation (that it was created by me at 12:10 today; these times are made up, by the way! :)</p>

<pre><code>_:GORRDF a void:Dataset ;
  a prv:DataItem ;
  prv:containedBy &lt;cache/GOR_DEC_2008_EN_NC.rdf&gt; ;
  void:dataDump &lt;cache/GOR_DEC_2008_EN_NC.rdf&gt; ;
  prv:createdBy [
    a prv:DataCreation ;
    prv:performedAt "2009-10-24T12:20:00Z"^^xsd:dateTime ;
    prv:performedBy _:Jeni ;
    prv:usedData [
      a prv:DataItem ;
      prv:containedBy &lt;cache/GOR_DEC_2008_EN_NC.csv&gt; ;
    ] ;
    prv:usedGuideline [
      a prv:DataItem ;
      prv:containedBy &lt;cache/region.xsl&gt; ;
      prv:createdBy [
        a prv:DataCreation ;
        prv:performedAt "2009-10-24T12:10:00Z"^^xsd:dateTime ;
        prv:performedBy _:Jeni ;
      ] .
    ] ;
  ] .
</code></pre>

<p>Now we have some descriptions of those cached documents:</p>

<pre><code>&lt;cache/GOR_DEC_2008_EN_NC.rdf&gt; a prv:Document ;
  dct:format &lt;http://www.iana.org/assignments/media-types/application/rdf+xml&gt; ;
  rdfs:label "GOR_DEC_2008_EN_NC.rdf" .

&lt;cache/region.xsl&gt; a prv:Document ;
  dct:format &lt;http://www.iana.org/assignments/media-types/application/xslt+xml&gt; ;
  rdfs:label "gor.xsl" .

&lt;cache/GOR_DEC_2008_EN_NC.csv&gt; a prv:Document ;
  dct:format &lt;http://www.iana.org/assignments/media-types/text/csv&gt; ;
  rdfs:label "GOR_DEC_2008_EN_NC.csv" ;
  dct:isPartOf &lt;cache/government-office-regions.zip&gt; .
</code></pre>

<p>This last file &#8212; the CSV that contained the data &#8212; was part of a zip file. The zip file was retrieved via HTTP at 12:00 today through a GET request to the URI <code>http://www.ons.gov.uk/about-statistics/geography/products/geog-products-area/names-codes/administrative/government-office-regions.zip</code>, but I also make it available in the cache in case that original file either disappears or gets changed at a later date.</p>

<pre><code>&lt;cache/government-office-regions.zip&gt;
  a prv:Document ;
  rdfs:label "government-office-regions.zip" ;
  dct:format &lt;http://www.iana.org/assignments/media-types/application/zip&gt; ;
  dct:hasPart &lt;cache/GOR_DEC_2008_EN_NC.csv&gt; ;
  prv:retrievedBy [ 
    a prvTypes:HTTPBasedDataAccess ;
    prv:performedAt "2009-10-24T12:00:21Z"^^xsd:dateTime ;
    prvTypes:exchangedHTTPMessage [
      a http:GetRequest ;
      http:requestURI "http://www.ons.gov.uk/about-statistics/geography/products/geog-products-area/names-codes/administrative/government-office-regions.zip"^^xsd:anyURI .
    ] ;
  ] .
</code></pre>

<p>The final bits and pieces provide extra information about the resources that have been referenced above, including the provenance of the file that is providing this provenance information!</p>

<pre><code>_:Jeni a foaf:Person ;
  foaf:name "Jeni Tennison" ;
  foaf:homepage &lt;http://www.jenitennison.com/&gt; .

&lt;http://www.jenitennison.com/log/&gt; a void:Dataset ;
  dct:title "Jeni's Activity Log" ;
  foaf:homepage &lt;http://www.jenitennison.com/log/&gt; ;
  void:uriRegexPattern "http://www.jenitennison.com/log/(.+)" ;
  void:subset &lt;&gt; .

&lt;&gt; a void:Dataset ;
  dct:title "Jeni's log for 19th October 2009" ;
  foaf:homepage &lt;&gt; ;
  void:exampleResource &lt;log.ttl&gt; ;
  void:uriRegexPattern "http://www.jenitennison.com/log/2009-10-24/(.+)" ;
  void:subset [
    prv:containedBy &lt;log.ttl&gt; ;
    void:dataDump &lt;log.ttl&gt; ;
    prv:createdBy [
      a prv:DataCreation ;
      prv:performedAt "2009-10-24T18:57:00"^^xsd:dateTime ;
      prv:performedBy _:Jeni 
    ] ;
  ] .

&lt;log.ttl&gt; a prv:Document ;
  dct:format &lt;http://www.iana.org/assignments/media-types/text/turtle&gt; .

&lt;http://www.iana.org/assignments/media-types/application/xslt+xml&gt;
  rdf:value "application/xslt+xml" ;
  rdfs:label "XSLT" .

&lt;http://www.iana.org/assignments/media-types/application/rdf+xml&gt;
  rdf:value "application/rdf+xml" ;
  rdfs:label "RDF/XML" .

&lt;http://www.iana.org/assignments/media-types/text/turtle&gt;
  rdf:value "text/turtle" ;
  rdfs:label "Turtle" .

&lt;http://www.iana.org/assignments/media-types/application/zip&gt;
  rdf:value "application/zip" ;
  rdfs:label "Zip" .

&lt;http://www.iana.org/assignments/media-types/text/csv&gt;
  rdf:value "text/csv" ;
  rdfs:label "CSV" .
</code></pre>

<p>This pattern for providing provenance information isn&#8217;t a complete answer because it doesn&#8217;t address how you might assess the provenance of a particular <em>statement</em>. If I went to <code>http://statistics.data.gov.uk/id/region/H</code> the only way I could establish that the <code>rdfs:label</code> (say) for the region was generated through the process described above would be to match the URI to the <code>void:uriRegexPattern</code> above, get hold of the original RDF from the cache and work out whether it contains the <code>rdfs:label</code> statement that I&#8217;m interested in.</p>

<p>I have a hunch that this would be more viable with named graphs: if statements with different provenance were actually placed in different graphs, then it would be possible with a SPARQL query to identify the graph(s) in which a statement was made, and their provenance. I <em>think</em> that the <code>_:GORRDF</code> blank node in the above could be a <code>trix:Graph</code>, for example.</p>

<p>Regardless, the Provenance Vocabulary that I&#8217;ve used above seems to do the job reasonably well. I&#8217;m intending to try this approach out on a few datasets and see how it stands up to real-world complexities. Comments and suggestions appreciated.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Expressing Statistics with RDF</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/132" />
    <id>http://www.jenitennison.com/blog/node/132</id>
    <published>2009-10-23T22:07:50+00:00</published>
    <updated>2009-12-07T11:00:47+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="psi" />
    <category term="rdf" />
    <summary type="html"><![CDATA[<p>One of the things that we&#8217;ve been discussing over on the <a href="http://groups.google.com/group/uk-government-data-developers">UK Government Data Developers mailing list</a> is how best to represent the vast quantities of statistical data that the government produces, in RDF. This is what we&#8217;ve come up with.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>One of the things that we&#8217;ve been discussing over on the <a href="http://groups.google.com/group/uk-government-data-developers">UK Government Data Developers mailing list</a> is how best to represent the vast quantities of statistical data that the government produces, in RDF. This is what we&#8217;ve come up with.</p>

<!--break-->

<ol>
<li><p>We&#8217;ll use <a href="http://sw.joanneum.at/scovo/schema.html">SCOVO</a> as our main vocabulary.</p></li>
<li><p>Dimensions (the things a statistic are about) should be instances of specialised classes such as &#8216;Hospital&#8217; or &#8216;School&#8217;; these will often be <a href="http://www.w3.org/TR/skos-primer/">SKOS</a> concepts. We will try to reuse these as much as possible across datasets (see below).</p></li>
<li><p>We will create subproperties of <code>scv:dimension</code> that have appropriate names and different subclasses of <code>scv:Dimension</code>s as ranges. We will try to reuse these as much as possible across datasets (see below).</p></li>
<li><p>The <code>scv:Item</code>s we use (representing individual statistics) should not be blank nodes (because giving them URIs allows us to attach other information to them); they will each have a <code>scv:dataset</code> property that points to the <code>scv:Dataset</code> they belong to (which will probably also be a <code>void:Dataset</code>).</p></li>
<li><p>Every <code>scv:Item</code> will also be the object of at least one triple that involves one of its dimensions; this will usually be the real-world thing that the statistic is associated with (eg the school or hospital).</p></li>
<li><p>Most statistics are provided for a particular time period; for these, we will define relationships from <a href="http://www.w3.org/TR/owl-time/">OWL-Time</a> to <a href="http://www.placetime.com/">placetime.com</a> resources, but will also use appropriately datatyped literals where possible to make querying easier.</p></li>
</ol>

<p>Here&#8217;s an example of what this looks like:</p>

<pre><code>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; .
@prefix scv: &lt;http://purl.org/NET/scovo#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix dct: &lt;http://purl.org/dc/terms/&gt; .
@prefix void: &lt;http://rdfs.org/ns/void#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time#&gt; .
@prefix sdmx: &lt;http://proxy.data.gov.uk/sdmx.org/def/sdmx/&gt; .
@prefix pop: &lt;http://statistics.data.gov.uk/def/population/&gt; .
@prefix year: &lt;http://statistics.data.gov.uk/def/census-year/&gt; .

# The statistics themselves

&lt;http://statistics.data.gov.uk/id/local-authority-district/00HE&gt;
  rdfs:label "Cornwall" ;
  pop:totalPopulation &lt;http://statistics.data.gov.uk/id/local-authority-district/00HE/population/total/year/2001&gt; ;
  pop:ruralPopulation &lt;http://statistics.data.gov.uk/id/local-authority-district/00HE/population/rural/year/2001&gt; ;
  ... .

&lt;http://statistics.data.gov.uk/id/local-authority-district/00HE/population/total/year/2001&gt;
  a scv:Item ;
  rdf:value "499399"^^xsd:integer ;
  scv:dataset &lt;http://statistics.data.gov.uk/doc/local-authority-district/*/population&gt; ;
  sdmx:refArea &lt;http://statistics.data.gov.uk/id/local-authority-district/00HE&gt; ;
  pop:populationType pop:total ;
  sdmx:timePeriod &lt;http://statistics.data.gov.uk/def/census-year/2001&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/00HE/population/rural/year/2001&gt;
  a scv:Item ;
  rdf:value "127904"^^xsd:integer ;
  scv:dataset &lt;http://statistics.data.gov.uk/doc/local-authority-district/*/population&gt; ;
  sdmx:refArea &lt;http://statistics.data.gov.uk/id/local-authority-district/00HE&gt; ;
  pop:populationType pop:rural ;
  sdmx:timePeriod &lt;http://statistics.data.gov.uk/def/census-year/2001&gt; .

...

# Datasets

&lt;http://statistics.data.gov.uk/doc/local-authority-district/*/population/*/year/2001&gt;
  a scv:Dataset ;
  a void:Dataset ;
  dct:title "Populations of Local Authority Districts" ;
  ... .

# Common definitions for the dataset

pop:totalPopulation a rdf:Property ;
  rdfs:label "total population" ;
  rdfs:range scv:Item .
pop:ruralPopulation a rdf:Property ;
  rdfs:label "rural population" ;
  rdfs:range scv:Item .
...

pop:populationType rdfs:subPropertyOf scv:dimension ;
  rdfs:label "population type" ;
  rdfs:domain scv:Item ;
  rdfs:range pop:Population .

pop:Population a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:subClassOf scv:Dimension ;
  rdfs:label "population type" .

pop:populationScheme a skos:ConceptScheme ;
  skos:prefLabel "Population Types" ;
  pop:hasTopConcept pop:total .

pop:total a pop:Population ;
  skos:prefLabel "total population" ;
  skos:topConceptOf pop:populationScheme ;
  skos:narrower pop:rural ;
  ... .

pop:rural a pop:Population ;
  skos:prefLabel "rural population" ;
  skos:inScheme pop:populationScheme ;
  skos:broader pop:total ;
  ... .

year:Year a rdfs:Class ;
  rdfs:subClassOf time:Interval ;
  rdfs:subClassOf scv:Dimension .

&lt;http://statistics.data.gov.uk/def/census-year/2001&gt;
  rdfs:label "mid-2001" ;
  time:intervalDuring &lt;http://www.placetime.com/interval/gregorian/2001-01-01T00:00:00Z/P1Y&gt; .

&lt;http://www.placetime.com/interval/gregorian/2001-01-01T00:00:00Z/P1Y&gt;
  rdfs:label "2001" ;
  rdf:value "2001"^^xsd:gYear .
</code></pre>

<p>One source of sub-properties of <code>scv:dimension</code> (and subtypes of <code>scv:Dimension</code>) is <a href="http://sdmx.org/">SDMX</a> (Statistical Data and Metadata eXchange). This provides standard ways of indicating things like the area and time that a statistic applies to. I&#8217;ve made an <a href="/blog/files/sdmx.ttl">initial mapping into some RDFS properties</a> and <a href="/blog/files/codelists.ttl">SKOS schemes</a> as an indication of the kind of thing that would work here, but expect it to change.</p>

<p>We&#8217;re currently working on providing identifiers for the areas that statistics are likely to be about (such as local authority districts, MSOAs or wards). They are of the form:</p>

<pre><code>http://statistics.data.gov.uk/id/{area-type}/{ONS-area-code}
</code></pre>

<p>and they tie into the <a href="http://data.ordnancesurvey.co.uk/">newly released OS data</a>. I hope we&#8217;ll have them available as Linked Data soon.</p>

<p>One issue that hasn&#8217;t been resolved is how to handle the huge amount of repetition that is inherent in this method of representing statistical data. For example, in the data above, all the <code>scv:DataItem</code>s in the <code>scv:Dataset</code> <code>http://statistics.data.gov.uk/doc/local-authority-district/*/population/*/year/2001</code> are from 2001. Rather than indicating the year of each individual <code>scv:DataItem</code>, it would be nice if we could have a property on the dataset that indicated that <em>all</em> the items in that dataset had the same value for a particular dimension. If this were called <code>scv:itemDimension</code>, for example, then we could do:</p>

<pre><code>&lt;http://statistics.data.gov.uk/doc/local-authority-district/*/population/*/year/2001&gt;
  a scv:Dataset ;
  a void:Dataset ;
  dct:title "Populations of Local Authority Districts" ;
  sdmx:itemTimePeriod &lt;http://statistics.data.gov.uk/def/census-year/2001&gt; ;
  ... .

sdmx:itemTimePeriod rdfs:subPropertyOf scv:itemDimension ;
  rdfs:label "time period of items in the dataset" ;
  rdfs:domain scv:Dataset .
</code></pre>

<p>and the individual <code>scv:Item</code>s would not have to have any <code>sdmx:timePeriod</code> properties explicitly. Perhaps this is something that the people beind SCOVO might consider, or we might create the property ourselves.</p>

<p>As far as I know, this pattern for representing statistics has yet to be used &#8220;in anger&#8221;, but I hope that we&#8217;ll have some illustrations soon which will help us assess whether it&#8217;s viable. Any comments and suggestions would, of course, be very welcome!</p>
    ]]></content>
  </entry>
</feed>
