<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Jeni's Musings</title>
  <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog"/>
  <link rel="self" type="application/atom+xml" href="http://www.jenitennison.com/blog/atom/feed"/>
  <id>http://www.jenitennison.com/blog/atom/feed</id>
  <updated>2010-07-31T21:51:16+01:00</updated>
  <entry>
    <title>Using Freebase Gridworks to Create Linked Data</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/145" />
    <id>http://www.jenitennison.com/blog/node/145</id>
    <published>2010-08-22T23:23:32+01:00</published>
    <updated>2010-08-22T23:23:32+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="datagovuk" />
    <category term="gridworks" />
    <category term="linked data" />
    <category term="provenance" />
    <summary type="html"><![CDATA[<p>When we encourage people to put their data on the web as linked data, the biggest question is &#8220;How?&#8221;. There are so many &#8220;How?&#8221; questions to answer:</p>

<ul>
<li>how do we choose what URIs to use for things?</li>
<li>how do we choose what vocabularies to use?</li>
<li>how do we handle changing data?</li>
<li>how do we tell people how the data was created?</li>
<li>how do we publish it?</li>
<li>how will other people know about it?</li>
</ul>

<p>and, of course:</p>

<ul>
<li>how do we create it?</li>
</ul>
    ]]></summary>
    <content type="html"><![CDATA[<p>When we encourage people to put their data on the web as linked data, the biggest question is &#8220;How?&#8221;. There are so many &#8220;How?&#8221; questions to answer:</p>

<ul>
<li>how do we choose what URIs to use for things?</li>
<li>how do we choose what vocabularies to use?</li>
<li>how do we handle changing data?</li>
<li>how do we tell people how the data was created?</li>
<li>how do we publish it?</li>
<li>how will other people know about it?</li>
</ul>

<p>and, of course:</p>

<ul>
<li>how do we create it?</li>
</ul>

<!--break-->

<p>Our goal within the linked data part of data.gov.uk (and I know we haven&#8217;t achieved it yet) is to both answer these questions and to make the answers as simple as possible. The answers to the questions <em>cannot</em> either require up-front knowledge of all possible types of data that might be published or depend on the availability of linked data for all the things we want to talk about. It <em>cannot</em> require registration at centralised services. It <em>cannot</em> require everyone to do everything in the same way or at the same pace.</p>

<p>We must take adopt an approach that encourages people to make their data available in forms that are easier for other people to pick up and use <strong>because they see the benefits for them</strong> and their stakeholders and because the effort of doing so is not too high to bear. We must grow, adapt and evolve incrementally. If linked data eventually wins, it will be due to its benefits, not to faith.</p>

<p>Anyway, enough rant. The point of this blog post is to talk about one of the answers to the &#8216;How do we create it?&#8217; question: using <a href="http://code.google.com/p/freebase-gridworks/">Freebase Gridworks</a>. For those who haven&#8217;t encountered it, Gridworks is an incredibly useful application that enables you to easily analyse, clean and manipulate tabular data. In a few steps, it can be used to generated linked datasets which can then be published on the web just like any other file, ready for other people to reuse without jumping through hoops. I&#8217;m going to assume that you can <a href="http://code.google.com/p/freebase-gridworks/wiki/Downloads?tm=2">download it</a> and <a href="http://code.google.com/p/freebase-gridworks/wiki/GettingStarted">install it</a> following the instructions provided on the Gridworks site.</p>

<p>In this post, I&#8217;m going to talk about how to use Gridworks to generate linked data, using an example of local government spending data from <a href="http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm">Windsor and Maidenhead council</a>. Like a good train journey, there&#8217;s quite a lot to see along the way.</p>

<p><em>Note: Many thanks to Dave Reynolds for his work on this data and comments on an earlier version of this post.</em></p>

<h2>Importing Data</h2>

<p>The first step is to import the data into Gridworks. If you just take the Windsor &amp; Maidenhead data and import it directly, you&#8217;ll get a single not-very-useful column as shown in the following screenshot:</p>

<p><img src="/blog/files/bad-import.jpg" title="Bad import into Gridworks" style="width: 100%" /></p>

<p>If you look at the spreadsheet in a normal spreadsheet programme then you&#8217;ll see why. Like a lot of spreadsheets created by normal people, who want to create something readable by human beings rather than computers, it has some extra lines at the top to explain what the spreadsheet contains, as shown in the following screenshot:</p>

<p><img src="/blog/files/spreadsheet.jpg" title="Original spreadsheet" style="width: 100%" /></p>

<p>Fortunately, Gridworks lets us easily skip over these first few lines. When you import the data, put the number <code>1</code> in the box for &#8220;Ignore X initial non-blank lines&#8221;, as shown here:</p>

<p><img src="/blog/files/import-dialog.jpg" title="Import dialog" style="text-align: center" /></p>

<p>(You need the number <code>1</code> because although there are three lines before the table really starts, the second two of those are blank.)</p>

<p>That done, the data should look a lot more useful, as shown in the following screenshot:</p>

<p><img src="/blog/files/good-import.jpg" title="Good import into Gridworks" style="width: 100%" /></p>

<h2>Cleaning Data</h2>

<p>The next thing to do is to explore the data a bit to get a handle on what&#8217;s there and work out whether any cleaning or rationalisation is necessary to improve its quality.</p>

<p>With columns that hold names, such as &#8216;Directorate&#8217;, &#8216;Service&#8217; or &#8216;Supplier Name&#8217;, you&#8217;re looking for slight misspellings caused by bad data entry. Gridworks helps you find these by creating a list of the distinct values for a particular column and telling you how many instances there are of each. Use the arrow at the side of the column name to pull down the menu, then choose <code>Facet &gt; Text Facet</code> to create this list, as shown here:</p>

<p><img src="/blog/files/facet-menu.jpg" title="Choosing from the facet menu" style="text-align: center" /></p>

<p>Once you&#8217;ve chosen <code>Text Facet</code>, the list pops up on the left hand side of the window. You can click on these to filter the table to contain just those rows that have that value for that column, but you can then scan through this to spot any places where there looks to be a typo or two entries that should really be the same. For example, the Services list holds both &#8216;Libraries &amp; Information Services&#8217; and &#8216;Library &amp; Information Services&#8217;, as shown here:</p>

<p><img src="/blog/files/services-list.jpg" title="Repetition in the Services list" style="text-align: center" /></p>

<p>It&#8217;s unlikely that there are really two distinct services with such similar names, so we&#8217;d like to clean up this data by standardising on one name or another. You can quickly change all occurrences of one value to another using the <code>edit</code> option that appears just to the right of the value when you hover over it. This brings up a dialog that enables you to change all of those values to something else, as shown here:</p>

<p><img src="/blog/files/edit-value-dialog.jpg" title="Editing a value across the spreadsheet" style="text-align: center" /></p>

<p>You can do something similar with numeric columns, such as the &#8216;Amount excl vat £&#8217; column. This time choose <code>Numeric Facet</code> rather than <code>Text Facet</code> and you&#8217;ll get a histogram up as shown here:</p>

<p><img src="/blog/files/amount-facet.jpg" title="Amount histogram" style="text-align: center" /></p>

<p>This is useful for identifying outliers. If you grab the handle on the left of the histogram and move it to the centre, the rows will get filtered to only those that have an amount within that range. For example, moving it to only show rows between £500,000 and £1,500,000 shows that there are three payments of this size, all made by Children&#8217;s Services to Wilmott Dixon Construction Limited, as shown in this screenshot:</p>

<p><img src="/blog/files/high-value-transactions.jpg" title="High value transactions" style="width: 100%" /></p>

<p>Although these values are much higher than most of the others in the spreadsheet, they don&#8217;t seem to be errors &#8212; I guess a new school was being built or something &#8212; so there&#8217;s nothing to correct here, but it shows how numeric facets can be used to explore the data.</p>

<p>Another approach to exploring and cleaning the data is to use the clustering algorithms that are built into Gridworks to identify duplicates. To do this, pull down the column menu and this time choose <code>Edit Cells... &gt; Cluster and Edit</code>, as shown in the following screenshot, this time for the &#8216;Supplier Name&#8217; column:</p>

<p><img src="/blog/files/edit-cells-menu.jpg" title="Choosing from the Edit Cells menu" style="text-align: center" /></p>

<p>This brings up a dialog that groups together values that look similar. In this case, &#8216;Siemens plc&#8217; and &#8216;Siemens PLC&#8217;, as shown in the following screenshot:</p>

<p><img src="/blog/files/cluster-dialog.jpg" title="Clustering values in a column" style="width: 100%" /></p>

<p>You can use this dialog to change all the similar values to a standard one. Check the <code>Merge</code> checkbox for the clusters of values that should be merged, edit the <code>New Cell Value</code> field to whatever standard value you want to adopt, and choose <code>Apply &amp; Re-cluster</code> or simply <code>Apply &amp; Close</code> to make the change.</p>

<p>You will often find that the default clustering algorithm (key collision/fingerprint) doesn&#8217;t come up with any clusters as it&#8217;s fairly conservative. It&#8217;s worth playing around a bit with different algorithms to look for other duplicates by selecting other possibilities from the drop-down menus. For example, choosing the &#8216;nearest neighbour&#8217; method with the Levenstein distance function and a radius of 2 (edits) results in four possible duplicates within the Suppliers list, as shown here:</p>

<p><img src="/blog/files/levenstein-cluster.jpg" title="Clustering values with Levenstein distance" style="width: 100%" /></p>

<p>If you&#8217;re not sure about whether the cluster is due to a typo or not, hover over the row and click on the <code>Browse this cluster</code> link that appears. That will bring up a separate window that will show you just the rows in the cluster, from which you should be able to make a judgement. For example, it&#8217;s not clear whether &#8216;Academia Ltd&#8217; is a typo for &#8216;Academics Ltd&#8217; but browsing the cluster shows that the Cost Centre codes and the Types of the transactions are completely different for the two Suppliers, so they are probably different.</p>

<h2>Deriving Data</h2>

<p>The next step is to derive some data from what we have within the spreadsheet. Since our goal is to produce linked data, the kind of derived data that we&#8217;re interested in are URIs.</p>

<p>At this point we need to start making decisions about what URIs to use. If you look at the <a href="http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm">list of spending data from Windsor and Maidenhead</a>, you&#8217;ll see that there are a whole bunch of these spreadsheets. It would be really useful if we could tie these spreadsheets together by using the same URIs for the same things across the datasets. For that reason, the only URI that&#8217;s going to be local to the dataset is the URI for each line (or data point if you like) itself. On the other hand, most of the things that are named here are going to be local to Windsor &amp; Maidenhead: &#8216;Abba Cars&#8217; may be sufficient to identify a single company within Windsor &amp; Maidenhead, but certainly wouldn&#8217;t be nationwide. So the URIs I&#8217;m going to create here are mostly going to be within the <code>www.rbwm.gov.uk</code> domain.</p>

<p>Here&#8217;s the table of the columns and the associated URIs that I&#8217;m going to use. I should stress that this is just for example purposes, but I&#8217;ve used the following principles:</p>

<ul>
<li>URIs for datasets are just like URIs for any other web document, but shouldn&#8217;t have an extension because the data itself should be available in many formats</li>
<li>URIs for real-world things should have <code>/id</code> at the start of the path, and URIs for conceptual things should have <code>/def</code> at the start of their paths; both should result in a 303 redirection to a suitable web page</li>
</ul>

<p>This is what we&#8217;re doing within data.gov.uk, but it&#8217;s an important principle of the web that different councils might well choose their own URI schemes, depending on the kind of technology support that they have, without any bad side-effects on the interpretation of the data.</p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>URI pattern</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>(Dataset)</th>
      <td>http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2</td>
    </tr>
    <tr>
      <th>(Row/ExpenditureLine)</th>
      <td>http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#{row-number}</td>
    </tr>
    <tr>
      <th>(Council)</th>
      <td>http://statistics.data.gov.uk/id/local-authority/00ME</td>
    </tr>
    <tr>
      <th>Directorate</th>
      <td>http://www.rbwm.gov.uk/id/directorate/{directorate-slug}</td>
    </tr>
    <tr>
      <th>Updated</th>
      <td>http://reference.data.gov.uk/id/day/{date}</td>
    </tr>
    <tr>
      <th>TransNo/Payment</th>
      <td>http://www.rbwm.gov.uk/id/transaction/{transaction-number}</td>
    </tr>
    <tr>
      <th>Service</th>
      <td>http://www.rbwm.gov.uk/id/service/{service-slug}</td>
    </tr>
    <tr>
      <th>Cost Centre</th>
      <td>http://www.rbwm.gov.uk/def/cost-centre/{cost-centre-code}</td>
    </tr>
    <tr>
      <th>Supplier Name</th>
      <td>http://www.rbwm.gov.uk/id/supplier/{supplier-slug}</td>
    </tr>
  </tbody>
</table>

<p>As you can see, those of the columns that contain text fields have, as part of their URI, a <a href="http://en.wikipedia.org/wiki/Slug_(production)">&#8216;slug&#8217;</a>. This is a shortened, normalised value suitable for putting in a URI: basically ensuring that the string doesn&#8217;t contain any punctuation or spaces. For example, &#8216;Adult &amp; Community Services&#8217; would turn into &#8216;adult-community-services&#8217;.</p>

<p>Our first task will be to create these slugs. To do this, we&#8217;ll create a new column based on the existing ones by choosing <code>Edit Column &gt; Add Column Based on This Column ...</code> from the drop-down menu on the appropriate column:</p>

<p><img src="/blog/files/edit-column-menu.jpg" title="Edit Column menu" style="text-align: center" /></p>

<p>Selecting this will bring up a dialog which will ask you to name the new column and then enter a formula to calculate the new value, as shown here:</p>

<p><img src="/blog/files/create-slug.jpg" title="Edit Column menu" style="text-align: center" /></p>

<p>The default language for this formula is Gridworks&#8217; own, though there are other options available. To create the slug, we need to:</p>

<ol>
<li>turn the value to lower case</li>
<li>replace all spaces with hyphens</li>
<li>remove anything that isn&#8217;t a letter, number, or hyphen</li>
<li>replace all sequences of two hyphens with a single hyphen</li>
</ol>

<p>This is done in two steps. The first three steps can be done using the formula:</p>

<pre><code>replace(replace(toLowercase(value), ' ', '-'), /[^-a-z0-9]/, '')
</code></pre>

<p>Gridworks helps by listing the original and resulting values for the first several rows of the spreadsheet, so that you can see whether it&#8217;s working as expected. When you&#8217;re happy, hitting <code>OK</code> creates the new column.</p>

<p>The last step (replacing all sequences of two hyphens with a single hyphen) can be done by editing the cells in the new column. Bring up the <code>Edit Cells... &gt; Transform...</code> dialog using the menu:</p>

<p><img src="/blog/files/edit-cells-menu-2.jpg" title="Edit Cells menu" style="text-align: center" /></p>

<p>and use the formula:</p>

<pre><code>replace(value, '--', '-')
</code></pre>

<p>then check the <code>Re-transform until no change</code> checkbox so that any pairs of hyphens are repeatedly replaced with single hyphens, as shown here:</p>

<p><img src="/blog/files/transform.jpg" title="Edit Cells menu" style="text-align: center" /></p>

<p>The other tabs in the new column and edit cells dialogs are really helpful. The <code>History</code> tab lets you choose formulae that you&#8217;ve used before to use again. This is useful here because we want to create the slugs for the Service and Supplier Name in the same way. The <code>Help</code> tab lists all the functions that you can use within the formula.</p>

<p>Creating the URIs for the columns proceeds in the same way, except this time the formulae are more like:</p>

<pre><code>'http://www.rbwm.gov.uk/id/directorate/' + value
</code></pre>

<p>There are two that are slightly different. First, there&#8217;s the URI for the date, which needs to be constructed from the date/time value held by Gridworks as follows. We can do this in two stages. First, to construct a new column called &#8216;Date&#8217; to hold the formatted date:</p>

<pre><code>datePart(value, 'year') + '-' + 
if (datePart(value, 'month') &lt; 9, '0', '') + replace(datePart(value, 'month') + 1, '.0', '') + '-' + 
if (datePart(value, 'day') &lt; 10, '0', '') + datePart(value, 'day')
</code></pre>

<p>(note that the <code>datePart()</code> function returns a 0-based count for the month) and then to create the Date URI column based on this:</p>

<pre><code>'http://reference.data.gov.uk/id/day/' + value
</code></pre>

<p>Second, there&#8217;s the URI for the row (an expenditure line) itself, which needs to be constructed using the row number. It&#8217;s useful to construct it as a local URI (ie just the fragment) as this means the same code can be used to construct the column across different datasets, so it&#8217;s just:</p>

<pre><code>'#' + rowIndex
</code></pre>

<h2>Exporting Data</h2>

<p>Once the extra columns have been made, it&#8217;s time to export data from Gridworks. While Gridworks makes it easy to export to CSV or into Freebase, it&#8217;s also possible to export in any format you want using templates. Use the <code>Project</code> menu and choose <code>Export Filtered Rows &gt; Templating ...</code>, as shown in the following screenshot:</p>

<p><img src="/blog/files/project-menu.jpg" title="Project menu" style="text-align: center" /></p>

<p>Note that this will only export the rows that you currently have selected, so if you want to export everything, make sure that you deselect any facets that you&#8217;ve currently got selected.</p>

<p>Choosing the <code>Templating ...</code> option will open up a dialog that you can use to create whatever format you want. The default, as shown in the following screenshot, is JSON.</p>

<p><img src="/blog/files/template-dialog-json.jpg" title="Templating dialog to create JSON" style="width: 100%" /></p>

<p>On the left are four fields:</p>

<ul>
<li><strong>Prefix</strong> is content that&#8217;s put at the top of the exported data</li>
<li><strong>Row Template</strong> is content that&#8217;s generated for each row</li>
<li><strong>Row Separator</strong> is content that&#8217;s put between each row</li>
<li><strong>Suffix</strong> is content that&#8217;s put at the bottom of the exported data</li>
</ul>

<p>One thing to be extremely careful of here is that any changes you made to the fields on the left here <strong>will not be saved</strong> when the dialog is closed. For that reason, it&#8217;s a good idea to create your templates in a separate text file and copy and paste them in. Also note that the sample data on the right is only for the first set of rows, not for the whole spreadsheet.</p>

<p>We&#8217;re going to generate Turtle using the template, so the next stage is to work out precisely what Turtle to generate. We&#8217;ve been working on small vocabulary for payment data based on the <a href="http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html">Data Cube vocabulary</a> and that&#8217;s what I&#8217;ll use here, although it isn&#8217;t quite complete and available yet as it will be. We&#8217;ll start at the bottom, with the individual rows, and then add extra surrounding information as we go.</p>

<h3>Row Template</h3>

<p>Within this data, each row corresponds to a <code>payment:ExpenditureLine</code> within the dataset. The expenditure lines can be organised into groups based on the <code>payment:Payment</code> that they&#8217;re associated with, which is indicated through the &#8216;TransNo&#8217; column in the database. Within the payment vocabulary we&#8217;re using, we can assign individual expenditure lines to the payment using the <code>payment:expenditureLine</code> property.</p>

<p>The <code>payment:payer</code> of each <code>payment:Payment</code> is Windsor &amp; Maidenhead council. The <code>payment:payee</code> is the &#8216;Supplier&#8217; listed in the spreadsheet. The <code>payment:date</code> is the &#8216;Updated&#8217; date.</p>

<p>Each individual line in the spreadsheet is a <code>payment:ExpenditureLine</code> which is associated with one of these payments. The <code>payment:expenditureCode</code> is the &#8216;Cost Centre&#8217; and the actual <code>payment:amountExcludingVAT</code> is the &#8216;Amount excl vat £&#8217; value. Some example Turtle for the first line is thus:</p>

<pre><code>&lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt;
  qb:slice &lt;http://www.rbwm.gov.uk/id/transaction/2650750&gt; .

&lt;http://www.rbwm.gov.uk/id/transaction/2650750&gt;
  a payment:Payment , qb:Slice ;
  rdfs:label "Transaction 2650750"@en ;
  qb:sliceStructure payment:payment-slice ;
  payment:transactionReference "2650750" ;
  payment:payer &lt;http://statistics.data.gov.uk/id/local-authority/00ME&gt; ;
  payment:payee &lt;http://www.rbwm.gov.uk/id/supplier/1st-choice-d-b-driveways-limited&gt; ;
  payment:date &lt;http://reference.data.gov.uk/id/day/2010-04-09&gt; ;
  payment:expenditureLine &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&gt; .

&lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&gt;
  a payment:ExpenditureLine , qb:Observation ;
  rdfs:label "Expenditure Line 0"@en ;
  qb:dataSet &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt; ;
  payment:expenditureCode &lt;http://www.rbwm.gov.uk/def/cost-centre/LM05&gt; ;
  payment:amountExcludingVAT 1875.00 .
</code></pre>

<p>That&#8217;s the basic data for each line, but there&#8217;s also some other information which should be brought out for each line:</p>

<ul>
<li>the name of the payee</li>
<li>the date, year, month and day-of-month for the payment, which may help further analysis of the data</li>
<li>the meaning of the expenditure code (particularly its association to a particular service)</li>
</ul>

<p>In each of these cases, pulling the information out from each line is going to lead to a lot of repetition, because the same payee, date and so on will be described in multiple lines, but we don&#8217;t have any choice and we can tidy it up by removing duplicates afterwards. The Turtle for the first line will look like:</p>

<pre><code>&lt;http://www.rbwm.gov.uk/id/supplier/1st-choice-d-b-driveways-limited&gt;
  a org:Organization ;
  rdfs:label "1st Choice - D B Driveways Limited"@en .

&lt;http://reference.data.gov.uk/id/day/2010-04-09&gt;
  a interval:CalendarDay ;
  rdfs:label "2010-04-09" ;
  time:hasBeginning &lt;http://reference.data.gov.uk/id/gregorian-instant/2010-04-09T00:00:00&gt; ;
  interval:ordinalYear 2010 ;
  interval:ordinalMonthOfYear 4 ;
  interval:ordinalDayOfMonth 9 .

&lt;http://reference.data.gov.uk/id/gregorian-instant/2010-04-09T00:00:00&gt;
  a time:Instant ;
  time:inXSDDateTime "2010-04-09T00:00:00"^^xsd:dateTime .

&lt;http://www.rbwm.gov.uk/def/cost-centre/LM05&gt;
  a rbwm:CostCentre , skos:Concept ;
  rdfs:label "Cost Centre LM05"@en ;
  rbwm:costCentreCode "LM05"^^rbwm:CostCentreCode ;
  rbwm:service &lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&gt; .

&lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&gt;
  a rbwm:Service ;
  rdfs:label "Magnet Leisure Centre"@en ;
  rbwm:providedBy &lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&gt; .

&lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&gt;
  a rbwm:Directorate ;
  rdfs:label "Adult &amp; Community Services"@en ;
  org:unitOf &lt;http://statistics.data.gov.uk/id/local-authority/00ME&gt; ;
  rbwm:provides &lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/00ME&gt;
  org:hasUnit &lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&gt; .
</code></pre>

<p>You&#8217;ll see that in the last part of this I&#8217;ve introduced some properties and classes with a <code>rbwm:</code> prefix. These are for classes and properties that are here in this data, but aren&#8217;t part of the payment vocabulary. The basic schema is:</p>

<pre><code>rbwm:CostCentre a rdfs:Class ;
  rdfs:label "Cost Centre"@en ;
  rdfs:comment "A cost centre."@en .

rbwm:Service a rdfs:Class ;
  rdfs:label "Service"@en ;
  rdfs:comment "A service provided by the council."@en .

rbwm:Directorate a rdfs:Class ;
  rdfs:label "Directorate"@en ;
  rdfs:comment "A directorate within the council"@en .

rbwm:service a rdf:Property , owl:ObjectProperty ;
  rdfs:label "Service"@en ;
  rdfs:comment "The service associated with a particular cost centre."@en ;
  rdfs:domain rbwm:CostCentre ;
  rdfs:range rbwm:Service .

rbwm:providedBy a rdf:Property , owl:ObjectProperty ;
  rdfs:label "Provided By"@en ;
  rdfs:comment "The directorate that provides this service."@en ;
  rdfs:domain rbwm:Service ;
  rdfs:range rbwm:Directorate .

rbwm:provides a rdf:Property , owl:ObjectProperty ;
  rdfs:label "Provides"@en ;
  rdfs:comment "A service provided by this directorate."@en ;
  rdfs:domain rbwm:Directorate ;
  rdfs:range rbwm:Service .

rbwm:costCentreCode a rdf:Property , owl:DatatypeProperty ;
  rdfs:label "Cost Centre Code"@en ;
  rdfs:comment "The code of this cost centre."@en ;
  rdfs:domain rbwm:CostCentre ;
  rdfs:range rbwm:CostCentreCode .

rbwm:CostCentreCode a rdfs:Datatype ;
  rdfs:label "Cost Centre Code"@en ;
  rdfs:comment "A cost centre code consisting of two capital letters followed by two digits."@en .
</code></pre>

<p>This illustrates how individual councils might extend the information that they make available in RDF without having to seek any kind of prior agreement from anyone else. If, later on, a third party starts to make available ontologies for cost centres, services and directorates, Windsor &amp; Maidenhead could start to link up their RDF with those more widely standardised classes and properties, with appropriate use of <code>rdfs:subClassOf</code> or <code>rdfs:subPropertyOf</code>.</p>

<p>Now we have an idea about what data we can extract for a single row, we can turn this into a Gridworks template. The templates are fairly straight forward. Wherever you want to insert a value from a particular column, you use the syntax <code>${Column Name}</code>. If you want to do any further processing, you can use the syntax <code>{{Formula}}</code> to insert the result of a calculation.</p>

<pre><code>&lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt;
  qb:slice &lt;${Transaction URI}&gt; .

&lt;${Transaction URI}&gt;
  a payment:Payment , qb:Slice ;
  rdfs:label "Transaction ${TransNo}"@en ;
  qb:sliceStructure payment:payment-slice ;
  payment:transactionReference "${TransNo}" ;
  payment:payer &lt;http://statistics.data.gov.uk/id/local-authority/00ME&gt; ;
  payment:payee &lt;${Supplier URI}&gt; ;
  payment:date &lt;${Date URI}&gt; ;
  payment:expenditureLine &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2${Line URI}&gt; .

&lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2${Line URI}&gt;
  a payment:ExpenditureLine , qb:Observation ;
  rdfs:label "Expenditure Line {{rowIndex}}"@en ;
  qb:dataSet &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt; ;
  payment:expenditureCode &lt;${Cost Centre URI}&gt; ;
  payment:amountExcludingVAT {{cells['Amount excl vat £'].value + 0}} .
</code></pre>

<p>Note that the last line here uses the expression <code>cells['Amount excl vat £'].value + 0</code> in order to ensure that every figure has a decimal place, which makes them into <code>xsd:decimal</code> values within the resulting RDF.</p>

<p>I won&#8217;t do the rest of the row template here, though it&#8217;s <a href="/blog/files/finance_supplier_payments_2010_q2_provenance.ttl">available in full in a separate file</a>.</p>

<p>The other parts of the template are easier to complete. The prefix needs to contain any namespace prefixes that are used within the RDF. It&#8217;s also useful to put a base URI here and describe the dataset itself. The RDF for the dataset should contain a number of properties about the dataset as a whole. There are a number of levels at which the dataset can be described:</p>

<ul>
<li>basic metadata such as its title and the license that it&#8217;s available under</li>
<li>statistical metadata including what dimensions it has and how it&#8217;s sliced</li>
<li>linked data metadata such as how this dataset links out to other linked datasets</li>
</ul>

<p>The Turtle for this description is shown here:</p>

<pre><code>&lt;http://www.rbwm.gov.uk/public/finance_supplier_payments&gt;
  a void:Dataset ;
  void:subset &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt; .

&lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt;
  a payment:PaymentDataset , void:Dataset ;
  # basic metadata
  rdfs:label "Windsor &amp; Maidenhead Supplier Payments where charge to specific cost centre is &gt;= £500 for period April 2010 - June 2010"@en ;
  dct:license &lt;http://data.gov.uk/id/licence&gt; ;
  dct:temporal [
    # this time is retrieved from the Last-Modified date on the original spreadsheet
    time:hasBeginning &lt;http://reference.data.gov.uk/id/gregorian-instant/2010-08-02T08:37:02&gt;
  ] ;

  # statistical metadata
  qb:structure payment:payments-with-expenditure-structure ;
  qb:sliceKey payment:payment-slice ;
  payment:currency &lt;http://dbpedia.org/resource/Pound_sterling&gt; ;

  # linked data metadata
  void:exampleResource
    &lt;http://www.rbwm.gov.uk/id/transaction/2650750&gt; ,
    &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&gt; ;
  void:vocabulary payment: , qb: , rbwm: ;
  void:subset [
    a void:Linkset ;
    void:linkPredicate qb:slice ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt; ;
    void:objectsTarget &lt;http://www.rbwm.gov.uk/id/transaction&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:payer ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/id/transaction&gt; ;
    void:objectsTarget &lt;http://statistics.data.gov.uk/id/local-authority&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:payee ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/id/transaction&gt; ;
    void:objectsTarget &lt;http://www.rbwm.gov.uk/id/supplier&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:date ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/id/transaction&gt; ;
    void:objectsTarget &lt;http://reference.data.gov.uk/id/day&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:expenditureLine ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/id/transaction&gt; ;
    void:objectsTarget &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:expenditureCode ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt; ;
    void:objectsTarget &lt;http://www.rbwm.gov.uk/def/cost-centre&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:service ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/def/cost-centre&gt; ;
    void:objectsTarget &lt;http://www.rbwm.gov.uk/id/service&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:providedBy ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/id/service&gt; ;
    void:objectsTarget &lt;http://www.rbwm.gov.uk/id/directorate&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:provides ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/id/directorate&gt; ;
    void:objectsTarget &lt;http://www.rbwm.gov.uk/id/service&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate org:hasUnit ;
    void:subjectsTarget &lt;http://statistics.data.gov.uk/id/local-authority&gt; ;
    void:objectsTarget &lt;http://www.rbwm.gov.uk/id/directorate&gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate org:unitOf ;
    void:subjectsTarget &lt;http://www.rbwm.gov.uk/id/directorate&gt; ;
    void:objectsTarget &lt;http://statistics.data.gov.uk/id/local-authority&gt; ;
  ] .
</code></pre>

<h2>Provenance</h2>

<p>I&#8217;ve described here, verbally, exactly what I&#8217;ve done in terms of the cleaning of the data, deriving new columns, and the template that I&#8217;ve used to create a Turtle rendition of the data in this spreadsheet. One of the things that we&#8217;ve worked hard on within data.gov.uk is finding ways of expressing this provenance information in RDF. There are two reasons for this:</p>

<ol>
<li>Providing provenance increases transparency and enables you to check the processing that the data has been through, increasing your trust in the data.</li>
<li>Describing the process in sufficient detail for you to replicate that process enables you to modify and repeat the process, which both enables you to add value and to apply the same processing to your own situation, thus spreading best practice.</li>
</ol>

<p>The basic provenance vocabulary that we&#8217;re using within data.gov.uk is the <a href="http://code.google.com/p/opmv/">Open Provenance Model Vocabulary</a>. This vocabulary talks about Artifacts, Processes that create and use them, and Agents that control those processes. We&#8217;ve created an extension of this vocabulary specifically to help describe this kind of scenario, where a spreadsheet is processed using Gridworks and then exported using a template. I&#8217;ll put this provenance information in a separate file simply because embedding provenance information, which includes a template, in the template itself gets us into nasty recursion issues.</p>

<p>As well as the template, there are two supplementary artifacts that we need to record the provenance of this data:</p>

<ul>
<li>the Gridworks project itself</li>
<li>the JSON description of the set of operations performed by Gridworks</li>
</ul>

<p>The first can be exported using the <code>Project</code> menu. The second is accessed through the <code>Undo/Redo</code> tab as shown in the following screenshot:</p>

<p><img src="/blog/files/undo-redo.jpg" title="Undo/Redo tab" style="text-align: center" /></p>

<p>This tab shows the actions that have been carried out on the data, and enables you to undo them in sequence. The <code>extract</code> link at the bottom opens up the dialog shown in the following screenshot:</p>

<p><img src="/blog/files/extract-dialog.jpg" title="Extract Operations dialog" style="width: 100%" /></p>

<p>You have to manually copy and paste the JSON description from the right of this dialog into a separate file in order to save it.</p>

<p>We can then start describing the provenance of the RDF; this needs to go in the Turtle file itself. We start by saying that the RDF that we&#8217;ve created was created from the Gridworks project and through an extraction operation. A simple link to the spreadsheet that was used as the source of the data also provides a quick link back to the original data:</p>

<pre><code>&lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&gt;
  a opmv:Artifact ;
  dct:source &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&gt; ;
  gridworks:wasExportedBy &lt;finance_supplier_payments_2010_q2_provenance#gridworks-export&gt; ;
  gridworks:wasExportedFrom &lt;finance_supplier_payments_2010_q2_project.tar.gz&gt; .
</code></pre>

<p>The provenance information then needs to describe the export process:</p>

<pre><code>&lt;#gridworks-export&gt;
  a gridworks:ExportUsingTemplate , opmv:Process ;
  rdfs:label "Process for Exporting Windsor &amp; Maidenhead data as Turtle" ;
  gridworks:project &lt;finance_supplier_payments_2010_q2_project.tar.gz&gt; ;
  gridworks:template &lt;#gridworks-template&gt; .
</code></pre>

<p>The project itself was created from the original Excel spreadsheet. The details of how it was generated are through an import that ignored a single non-blank header row and then went through the set of operations described by the JSON.</p>

<pre><code>&lt;finance_supplier_payments_2010_q2_project.tar.gz&gt;
  a gridworks:Project , opmv:Artifact ;
  rdfs:label "Windsor &amp; Maidenhead Supplier Payments April 2010 - June 2010 Gridworks Project"@en ;
  gridworks:wasCreatedFrom &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&gt; ;
  opmv:wasGeneratedBy &lt;#gridworks-processing&gt; .

&lt;#gridworks-processing&gt;
  a gridworks:Process , opmv:Process ;
  rdfs:label "Processing on the Gridworks Project"@en ;
  common:usedData &lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&gt; ;
  gridworks:ignore 1 ;
  gridworks:operationDescription &lt;finance_supplier_payments_2010_q2_operations.json&gt; .

&lt;finance_supplier_payments_2010_q2_operations.json&gt;
  a gridworks:OperationDescription , opmv:Artifact ;
  rdfs:label "Dump of the Processing carried out by Gridworks on Windsor &amp;amp; Maidenhead Supplier Payments April 2010 - June 2010 data"@en ;
  gridworks:wasExportedFrom &lt;finance_supplier_payments_2010_q2_project.tar.gz&gt; ;
  gridworks:wasExportedBy &lt;#gridworks-operation-description-extraction&gt; .

&lt;#gridworks-operation-description-extraction&gt;
  a gridworks:ExtractOperationDescription , opmv:Process ;
  rdfs:label "Extraction of the operation description from the Windsor &amp;amp; Maidenhead Supplier Payments April 2010 - June 2010 Project from Gridworks"@en ;
  gridworks:project &lt;finance_supplier_payments_2010_q2_project.tar.gz&gt; .
</code></pre>

<p>The template is described in terms of the separate parts; in fact it&#8217;s useful to use this provenance file as the record of the template that you use, given that Gridworks won&#8217;t save the template in the project itself.</p>

<pre><code>&lt;#gridworks-template&gt;
  a gridworks:Template , opmv:Artifact ;
  gridworks:prefix """
...
"""^^xsd:string ;
  gridworks:rowTemplate """
...
"""^^^xsd:string .
</code></pre>

<h2>Rinse and Repeat</h2>

<p>Gridworks makes it easy to repeat a given set of operations on another spreadsheet that follows the same structure. If you download the <a href="http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm">Windsor and Maidenhead spending data from 2009 Q4</a> and import it into Gridworks, you&#8217;ll see that it uses the same set of columns as the 2010 Q2 data that we&#8217;ve been looking at. (Strangely enough, the 2010 Q1 data doesn&#8217;t quite follow the same structure as it doesn&#8217;t include the &#8216;TransNo&#8217; column.)</p>

<p>There are a couple of differences:</p>

<ul>
<li>the &#8216;Updated&#8217; column isn&#8217;t recognised as holding dates on import; you can use <code>Edit Cells... &gt; Transform</code> to change these values into dates using the <code>toDate(value)</code> formula</li>
<li>the &#8216;Amount excl vat £&#8217; column isn&#8217;t recognised as holding numbers on import because the values have commas in them; you can use the formula <code>toNumber(replace(value, ',', ''))</code> to rectify this</li>
</ul>

<p>You might want to do some more cleaning, for example to check for duplicates, but once that is done, you use the <code>apply</code> link at the bottom of the <code>Undo/Redo</code> tab to apply the JSON operation description that you imported for the previous spreadsheet on this one. The templates require only a little tweaking to give different filenames and labels, but otherwise can be used as-is.</p>

<p>So while the process of cleaning data, deriving values and creating a template for exporting as Turtle is a bit of effort, the likelihood is that you will be able to repeat the same operations on similar data with a minimal amount of work.</p>

<h2>Conclusions</h2>

<p>Gridworks is a simply amazing tool for data cleansing, analysis and, as we&#8217;ve seen, transformation. It&#8217;s set to become more so for our purposes in the near future, as it comes to support the mapping of names for things to URIs using configurable reconciliation services (which might allow it to automatically map Government Department names to URIs, for example), and the creation of RDF using a more intuitive and user-friendly approach than the templates that I&#8217;ve illustrated here.</p>

<p>Of course there are issues, particularly for UK civil servants who typically have to operate on locked-down machines running IE7 (if they&#8217;re lucky). Gridworks also only deals with the fairly simple cases of data that fits in a spreadsheet-like structure, without the complexities of annotations on rows, columns or individual cells that we often see in government data.</p>

<p>Nevertheless, there&#8217;s huge potential here to provide a fairly easy route to the publication of linked data for people who are familiar with spreadsheets, in particular one that can be tweaked and extended to allow for the variety and complexity of real-world data.</p>
    ]]></content>
  </entry>
  <entry>
    <title>legislation.gov.uk: Credit Where it&#039;s Due</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/144" />
    <id>http://www.jenitennison.com/blog/node/144</id>
    <published>2010-08-14T12:18:38+01:00</published>
    <updated>2010-08-23T09:01:32+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="legislation" />
    <category term="opendata" />
    <category term="psi" />
    <summary type="html"><![CDATA[<p>I&#8217;m aware I&#8217;ve been quiet for the past few months. This isn&#8217;t because nothing interesting has been going on &#8212; rather the opposite. It&#8217;s been difficult to get a chance to sit down and write about the work I&#8217;ve been doing, when actually doing the work has been taking up so much time.</p>

<p>Most of my time has been spent on the new <a href="http://www.legislation.gov.uk">legislation.gov.uk</a> website and its underlying API. There&#8217;s so much to say about this project that I hardly know where to start, so I&#8217;ll just try to do an overview and we can take it from there. Let me know what you&#8217;re interested in.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>I&#8217;m aware I&#8217;ve been quiet for the past few months. This isn&#8217;t because nothing interesting has been going on &#8212; rather the opposite. It&#8217;s been difficult to get a chance to sit down and write about the work I&#8217;ve been doing, when actually doing the work has been taking up so much time.</p>

<p>Most of my time has been spent on the new <a href="http://www.legislation.gov.uk">legislation.gov.uk</a> website and its underlying API. There&#8217;s so much to say about this project that I hardly know where to start, so I&#8217;ll just try to do an overview and we can take it from there. Let me know what you&#8217;re interested in.</p>

<!--break-->

<p>legislation.gov.uk is a government website built on the principles of transparency and open data, including ideas laid out in the <a href="http://webarchive.nationalarchives.gov.uk/20100413152047/http://poit.cabinetoffice.gov.uk/poit/2009/02/modernising-information-publishing-final/">Power of Information Taskforce Report</a>. We have a lovely user interface which helps end-users find and understand legislation, but it&#8217;s layered over the top of an API that <a href="http://www.legislation.gov.uk/licence">anyone is free to use</a> to construct their own websites based on the same data.</p>

<p>In fact, we built the API first, and it&#8217;s been around (though not in a particularly stable state) for about a year. However, it turned out that building the user interface really helped in two ways. First, it helped the legislation experts who were looking at the documents to spot errors in a way that they unsurprisingly struggled to do when presented with raw XML. Second, it helped to identify things that the API needed to do to support a useful website, such as always providing links to the table of contents for an item of legislation or providing a search based on modification date.</p>

<p>Now, if you&#8217;ve been reading <a href="http://seanmcgrath.blogspot.com/2010/05/kliss-first-things-first-what-is.html">Sean McGrath&#8217;s blog</a> you&#8217;ll know that as far as content goes, legislation is about as tough as you can get. For a start, Acts and Statutory Instruments are <em>semi-structured documents</em>, not tabular data. It&#8217;s not a simple matter of storing and extracting rows in a database: we need to be able to address portions of an item of legislation such as &#8220;Local Government Act 1988 (c. 9, SIF 81:2), Sch. 3 para. 13(1)(b)(2)&#8221; (this an <a href="http://www.legislation.gov.uk/ukpga/1975/30/section/24/2000-09-08#commentary-c1075749">actual citation</a>! I am not making this up!).</p>

<p>The content itself is complex. For legislation.gov.uk, the main challenge is not to do with faithfully reconstructing page and line breaks (fortunately!) but how to represent complex, annotated, changes to legislation over time, and then how to present them. Much of this had already been done (in terms of technology) within the <a href="http://www.statutelaw.gov.uk">Statute Law</a> and <a href="http://www.opsi.gov.uk/legislation">OPSI</a> websites, although the data comes from a variety of sources over time, each with its own set of peculiarities to be navigated. The larger challenge here was to provide a mechanism of navigating through the content that made clear the distinctions between the various versions of legislation that people can look at and warning them about their status without overwhelming them with information.</p>

<p>We also have a lot of documents, some of which are very large. There are nearly 60,000 items of legislation on the site. The largest and most complex of them has hundreds of sections and about a hundred distinct versions. When you consider all the versions of all the possible fragments of all the items of legislation, you&#8217;re talking about 6.5 million distinct documents, each of which is available in HTML, XML, PDF and for which there is some RDF metadata.</p>

<p>On top of this, the content is constantly changing. New legislation is published every working day, first as PDFs, then as HTML (and XML), and then various associated documents the most important of which are Explanatory Notes, again first in PDF and then in HTML/XML form. Old legislation changes too; the legislation.gov.uk editorial team is constantly working through a backlog of changes to existing legislation brought about by new legislation. Simply hooking up the site to keep up to date with these changes has been an enormous challenge.</p>

<p>The content also changes because we intend to add features to the site over time. The site has already seen bug fixes and tweaks to address problems that we&#8217;ve encountered post-launch, and there are a number of new features in the pipeline to bring the site up to the level of completeness where it can fully replace the existing OPSI and Statute Law websites.</p>

<p>Then we needed something that was reasonably fast and robust in the face of moderately heavy traffic. Providing fast access to ever changing content, especially when the changes themselves are unpredictable, is an ongoing challenge.</p>

<p>All of this has only been possible by having an excellent team of experts and developers. One of the things that made this project quite different from the majority of government projects of this size was that it was much closer to Agile than Prince2: clients and providers working closely in the same team, chatting on daily calls, working side-by-side. From the developer perspective, it gave us direct access to the people who both had the expertise about the content and knew what they wanted. From the customer side, I hope and believe that it gave them as close involvement in the development of the site as they could want and a far deeper level of understanding about exactly how it works (and therefore what is easy and what is hard, and where compromises are best made) than they would have had otherwise.</p>

<p>So here are some credits. First, from <a href="http://www.tso.co.uk/">TSO</a>, where I work:</p>

<ul>
<li><strong><a href="http://twitter.com/careyfarrell">Carey Farrell</a></strong> ran the project, keeping track of the many and various bits and pieces that needed doing and finding the people to get them done; he has been the project&#8217;s backbone</li>
<li><strong><a href="http://twitter.com/pauldappleby">Paul Appleby</a></strong> may have moved on to better things part way through, but made his mark early on in its design and architecture, and much much earlier in the design of the XML schema, and in many of the stylesheets that underlie the HTML and PDF views of this data</li>
<li><strong>Lee Goodby</strong> put together the system infrastructure, arranging more machines and memory and disk space to satisfy our endless demands</li>
<li><strong>Chunyu Cong</strong> performed many a thankless data wrangling task without complaint</li>
<li><strong>Griff Chamberlain</strong> has worked doggedly on this project (with occasional pauses for beer) since he came aboard, among many <em>many</em> other things working on the generation of PDF (via XSL-FO) from the XML source and dealing with the difficulties of usable next/previous navigation</li>
<li><strong><a href="http://www.menteithconsulting.com/wiki/People/TonyGraham">Tony Graham</a></strong> made the publication of Tables of Effects his own, as well as constantly improving our build processes</li>
<li><strong>Gavin Mannings</strong> achieved quite remarkable things with a combination of HTML, CSS and Javascript. If you don&#8217;t believe me, take a look at the source underlying <a href="http://www.legislation.gov.uk/ukpga">the histograms on the browse pages</a> or <a href="http://www.legislation.gov.uk/ukpga/1985/67/section/6?timeline=true">the timelines for legislation content</a></li>
<li><strong>Faiz Muhammad</strong> quickly got to grips with a whole set of complex and unfamiliar content and technologies, to create the UI from the API data</li>
<li><strong>Paul Harvey</strong> furnished us with data, warned us of bear pits, and remained astonishingly uncomplaining of the changes we were putting him through</li>
<li><strong>Marc Sturman</strong> brought all his expertise to bear in managing the publication of legislation from the SLD editorial system into the new website and pulled our fat from the fire both on deployment and in the creation of the larger PDFs</li>
<li><strong>Vinod Sathyamoorthy</strong> worked on all aspects of the infrastructure: scaling out the environment, testing it, configuring it and so on, to make it into a site that more than a few people could access at a time</li>
<li><strong><a href="http://twitter.com/RobBullen">Rob Bullen</a></strong> brought a little more order to a kind of controlled chaos, in the way a good project manager should</li>
<li><strong>Terry Blake</strong> had the clout and the clear-sighted vision to get things done, as well as (and perhaps secretly enjoying) getting his hands dirty on occasion</li>
</ul>

<p>From <a href="http://www.bunnyfoot.com/">Bunnyfoot</a>:</p>

<ul>
<li><strong>Mark Pierce</strong> designed the look and feel of the site, having to get to grips with the complexities of legislative content as well as treading the fine line between making the site look modern yet authoritative, appealing whilst not detracting from the content</li>
<li><strong>Rebecca Gill</strong> provided clear eyes, analysis and insight to help us understand how to improve the site for our users</li>
</ul>

<p>And from <a href="http://www.nationalarchives.gov.uk/">The National Archives</a>:</p>

<ul>
<li><strong><a href="http://twitter.com/crallison">Clare Allison</a></strong> has devoted her life to ensuring that the content on the site is as accurate and meaningful as it can be, working with the astonishing complexities of the legislation content with an amazing depth of knowledge and expertise</li>
<li><strong><a href="http://twitter.com/clairelait">Claire Lait</a></strong> has poured her soul into providing a meaningful and useful experience for the end users of the site with insight, intelligence and unparalleled openness and enthusiasm</li>
<li><strong>Catherine Tabone</strong> has dealt with the traumas of the ups and downs of deployment with fortitude and good humour</li>
</ul>

<p>And finally, none of this would have happened without <strong><a href="http://twitter.com/johnlsheridan">John Sheridan</a></strong> having the ambition and the vision for how legislation should be published on the web, creating the environment that enabled this project to be done, setting a positive tone and providing support, encouragement and a gently guiding hand throughout the process.</p>

<p>This isn&#8217;t everyone who has been involved in the project: there are system administrators and testers and beta users and a whole cloud of other support particularly from <a href="http://www.marklogic.com/">MarkLogic</a>, <a href="http://orbeon.com/">Orbeon</a> and <a href="http://www.akamai.com/">Akamai</a>. But these are the people who let it consume their lives for at least a while. Every one of them was vitally important to the project, bringing their own expertise and skills and personality. I admire them all hugely. 
No project of this size is completely plain sailing, and I am convinced that we would be in a very different position today if the project hadn&#8217;t been built on mutual respect and trust. I&#8217;ve sketched some of the challenges that we faced. If it all looks easy, it&#8217;s only because this group of people did their jobs incredibly well. This is my public thanks to them for all their work.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Distributed Publication and Querying</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/143" />
    <id>http://www.jenitennison.com/blog/node/143</id>
    <published>2010-03-22T21:26:53+00:00</published>
    <updated>2010-07-31T22:05:51+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="sparql" />
    <summary type="html"><![CDATA[<p>One of the biggest selling points of linked data is that it&#8217;s supposed to facilitate web-scale distributed publication of data. Just as with the human web, anyone can publish data at their local site without having to go through any kind of central authority.</p>

<p>Just as with the human web, convergence on particular sets of URIs for particular kinds of things can happen in an evolutionary way: in a blog post I might point to Amazon when I want to talk about a particular book, Wikipedia to define the concepts I mention, people&#8217;s blogs or twitter streams when I mention them.</p>

<p>And with everyone using the same terms to talk about the same things, there&#8217;s the prospect of being able to easily pull together information from completely different sources to find connections and patterns that we&#8217;d never have found otherwise.</p>

<p>What&#8217;s been very unclear to me is how this distributed publication of data can be married with the use of SPARQL for querying. After all, SPARQL doesn&#8217;t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.</p>

<p>This post is an attempt to explore this tension, between distributed publication and centralised query, and to try to find a pattern that we might use within the UK government (and potentially more widely, of course) to publish and expose linked data in a queryable way. It&#8217;s a bit sketchy, and I&#8217;d welcome comments.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>One of the biggest selling points of linked data is that it&#8217;s supposed to facilitate web-scale distributed publication of data. Just as with the human web, anyone can publish data at their local site without having to go through any kind of central authority.</p>

<p>Just as with the human web, convergence on particular sets of URIs for particular kinds of things can happen in an evolutionary way: in a blog post I might point to Amazon when I want to talk about a particular book, Wikipedia to define the concepts I mention, people&#8217;s blogs or twitter streams when I mention them.</p>

<p>And with everyone using the same terms to talk about the same things, there&#8217;s the prospect of being able to easily pull together information from completely different sources to find connections and patterns that we&#8217;d never have found otherwise.</p>

<p>What&#8217;s been very unclear to me is how this distributed publication of data can be married with the use of SPARQL for querying. After all, SPARQL doesn&#8217;t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.</p>

<p>This post is an attempt to explore this tension, between distributed publication and centralised query, and to try to find a pattern that we might use within the UK government (and potentially more widely, of course) to publish and expose linked data in a queryable way. It&#8217;s a bit sketchy, and I&#8217;d welcome comments.</p>

<!--break-->

<h2>Publishing Datasets</h2>

<p>First, let&#8217;s look at the publication of data. We publish data at the moment in all kinds of ways: embedded tables within PDFs, CSV database dumps, Excel spreadsheets, Word documents, XML, JSON, N3 and so on and on. Each of these documents contains a set of information: a dataset.</p>

<p>Each dataset contains information about a whole load of <em>things</em>, usually real-world things. This is easy to see when you have datasets that contain lots of things of the same type: a spreadsheet might contain information about lots of different local authorities, a database dump about a bunch of schools. In FOAF terms, we&#8217;d say that the dataset has each of these things as a <em>topic</em>.</p>

<p>Even datasets that are really about one <em>thing</em> (have, in FOAF terms, a <em>primary topic</em>) contain information about lots of other things. For example, a web page about a hospital might include some level of information about the different departments within the hospital, the strategic health authority that it belongs to, the chief executive and so on. Information that is just about one thing is rarely useful; at the very least, you will want to know the labels of things that it&#8217;s related to.</p>

<p>If we move to thinking about linked data, each <em>thing</em> is assigned an HTTP URI. There is then one particular dataset that stands above all the other datasets that contain information about that <em>thing</em>: the dataset in the document that you get when you resolve its URI. The fact that there is this dataset doesn&#8217;t alter the fact that there are many many other datasets out there that contain information about the <em>thing</em>. But the dataset that you get at the URI for the thing obviously has a special role.</p>

<p>These datasets &#8212; the ones you get at the end of a resource&#8217;s URI &#8212; are <em>the</em> way in which an organisation can exercise control over the use of URIs minted within their domain. The organisation that controls the URI for a <em>thing</em> determines whether that URI resolves, and what is at the end of the URI. If fifteen different websites all published information about a school consistently using the same URI for that school, anyone could pull that information together into something potentially useful. But if the URI for the school doesn&#8217;t actually resolve, then you would have to wonder whether the school actually exists, or if it&#8217;s just a figment of the imagination of those fifteen websites: a spoof school.</p>

<p>Also, you&#8217;d expect the information that you find at the end of the URI to be correct and up to date. You&#8217;d expect it to be reasonably complete as well: to return a bunch of information about the school and pointers to more information about the school. This information is likely to come from a bunch of trusted sources: an integrated view over a collection of other datasets.</p>

<h2>Providing SPARQL Endpoints</h2>

<p>We&#8217;ve established that</p>

<ul>
<li>anyone can publish information about anything they choose, but that people will have different levels of trust in different sources of information</li>
<li>information about any one <em>thing</em> is seldom useful on its own; the power of the linked data web is the ability to make connections between things</li>
</ul>

<p>And so on to querying. Linked data can be useful without explicit querying &#8212; you can navigate around related sets of information by following links, and pull together information gleaned from different sites &#8212; but querying of some kind provides much more potential power and, with a <a href="http://purl.org/linked-data/api/spec">linked data API</a>, the opportunity to provide an easy-to-use web-based API for the data.</p>

<p>SPARQL queries operate over a default graph (or dataset) and a set of supplementary named graphs. For efficiency, these need to be pulled into a single triplestore.</p>

<p>And so we have a quandry. To support queries, we need all the data we might want to query to be pulled into a single triplestore. Given that all data is linked, and all links are potentially interesting, the only answer seems to be to have the whole web of data in a single store. And that kind of centralised solution seems impractical, both in terms of the sheer size of store you&#8217;d need and the obvious impact on efficiency of doing so.</p>

<h2>Curated Triplestores</h2>

<p>I think the answer (for the moment at least) is to forget about querying the entire web of linked data and focus on supporting the easy creation of targeted, curated, triplestores that each incorporate a useful subset of the linked data that&#8217;s out there. What subset is useful for a given triplestore is a design question that should be informed by the potential users of that particular service. Larger subsets are likely to locate more cross-connections, but have a performance penalty.</p>

<p>For example, a service that was oriented towards helping local authorities plan their schooling provision might include all the current data about nursery, primary and secondary schools (but not universities or versioned data), information about their administrative district and the district that they appear in (but no extra information about census areas), and those neighbourhood statistics, including historic data, that relate to children and schooling (but not those that relate to care of the elderly, for example).</p>

<p>Another service might include all historic information about schools and universities and historic information about all associated administrative geography, but not include neighbourhood statistics.</p>

<h2>Supporting On-Demand Triplestores</h2>

<p>In the scenario painted above, each triplestore will include different datasets, brought together for a particular purpose. Imagine a huge warehouse full of boxes, each of which is a particular dataset. Each triplestore will fit together a different set of those boxes. What&#8217;s neat about the linked data approach is that the boxes are really easy to bring together: creating a triplestore should just be a matter of selecting which datasets you want to use with little or no hand-crafting of links between them or resolution of naming conflicts.</p>

<p>The challenge from the side of the data publisher is to enable these triplestores to be both created and kept up to date. A data publisher has to:</p>

<ul>
<li>describe what datasets are available</li>
<li>describe how these link to other potentially interesting datasets, to give hints about where connections might be made</li>
<li>provide a mechanism for getting the current state of all the available datasets (which can obviously be through crawling but could alternatively be through a dump or set of dumps)</li>
<li>provide a mechanism for informing interested parties about new datasets being made available (which could be through routine crawling or through a feed)</li>
<li>provide a mechanism for informing interested parties about when a dataset changes (which could also be through routine crawling or through a feed)</li>
</ul>

<p>A lot of these problems are solved.</p>

<p><a href="http://rdfs.org/ns/void/">VoiD</a>&#8217;s purpose in life is to describe datasets and how they link to each other, and it provides a <code>void:dataDump</code> property that points to a dump of the data. VoiD can describe datasets that are supersets of other datasets, which enables datasets to be grouped together into potentially useful bundles.</p>

<p>Where information needs to be kept up to date, we can use feeds. We need to keep up to date information about the datasets that a publisher makes available, and information about the content of a particular dataset. This can be achieved through a single Atom feed in which each dataset is recorded as an entry, with an <code>&lt;updated&gt;</code> element indicating its last update. Datasets that are removed can be indicated through a <a href="http://tools.ietf.org/html/draft-snell-atompub-tombstones-06"><code>deleted-entry</code> element</a>. There is some ongoing work that suggests how to <a href="http://groups.google.com/group/dataset-dynamics/web/components-vocabularies-protocols-formats">augment voiD with a pointer to such a feed</a>.</p>

<p>As well as pointing to a dataset, and indicating that it has been updated, the Atom feed could contain information about the change itself, represented as a <a href="http://vocab.org/changeset/schema.html">changeset</a>. This could be included as part of the information provided about the new version of the dataset, described in terms of its <a href="http://www.jenitennison.com/blog/node/142">provenance</a>.</p>

<p>Feeds that were provided in this way could be provided using the normal model, whereby any interested triplestores would regularly check the feed for updates, or using <a href="http://code.google.com/p/pubsubhubbub/">PubSubHubbub</a> in order to push notifications to triplestores. The latter would require triplestore providers to support a service that accepted such notifications, of course.</p>

<p>A triplestore should expose which datasets (and which versions of those datasets) are used within the triplestore. This can be gathered through a SPARQL query to list the available graphs and their metadata, so long as that information is included within the named graphs themselves.</p>

<h2>What Should We Do?</h2>

<p>How does all this translate into what guidelines we should put into place for UK government publishers and what tools we should provide centrally?</p>

<p>First, we need to recognise the responsibility that comes with the ownership of a URI. Within the UK, we are encouraging people to use URIs of the form:</p>

<pre><code>http://{sector}.data.gov.uk/id/{concept}/{identifier}
</code></pre>

<p>to name things like schools and hospitals, with the recognition that information about those things might come from many different public bodies. <em>Someone</em> has to be in charge of that domain: they have to determine which URIs within a particular URI set are resolvable, and what information is provided at the end of each URI. These same sector owners should support easy-to-use APIs based around the particular URI sets that they are responsible for.</p>

<p>The easiest route to supporting the pages, an easy-to-use API, and a SPARQL endpoint for deeper querying is going to be to create a curated triplestore with a <a href="http://purl.org/linked-data/api/spec">linked data API</a> layer over the top. This triplestore will need to be populated with data from multiple datasets, both as separate named graphs (to provide traceability back to the original data) and merged into a default graph that reflects the current state of the world.</p>

<p>The precise datasets that are included within the triplestore will depend on the judgement of the sector owners about both the trustworthiness of the available datasets and their utility. For example, it&#8217;s likely that a lot of triplestores will want to include information about administrative geography and perhaps some information about time, simply because everything happens somewhere and sometime.</p>

<p>Second, we need to make this process really easy, through guidelines and tooling.</p>

<p>We encourage the data owners themselves (which are individual public bodies) to publish, along with the datasets themselves:</p>

<ul>
<li>voiD descriptions of the groups of datasets that they publish</li>
<li>metadata about the individual datasets that they publish (within each dataset itself)</li>
<li>Atom feeds that are updated each time datasets are added, removed or altered, preferably including changeset information</li>
<li>(optionally) dumps of groups of datasets, in NQuads format</li>
<li>(optionally) notifications of changes to the Atom feed to a PubSubHubbub hub</li>
</ul>

<p>Data owners should be able to split up the datasets that they provide into different groups based on their knowledge of the domain, with the possibility of individual datasets belonging to more than one group.</p>

<p>We then create tooling that can:</p>

<ul>
<li>enable the sector owners to quickly and easily put together a list of trusted sites from which datasets can be gathered</li>
<li>collect datasets from these sites, either through NQuads dumps or through crawling</li>
<li>merge datasets to create a default current view</li>
<li>put these datasets into a triplestore</li>
<li>keep the triplestore up to date, either through polling feeds or by accepting PubSubHubbub notifications to identify changes, applying those changes, and merging data as required</li>
</ul>

<p>To facilitate PubSubHubbub use, which supports timely updating of triplestores, we&#8217;d need a PubSubHubbub hub. Data owners can inform this hub of updates to their feeds and sector owners can register interest in particular feeds.</p>

<p>These guidelines and tooling are not just useful for sector owners: they are useful for anyone who wants to pull together linked data published in a distributed way across the web. We should expect and encourage multiple stores offering different combinations of datasets and different levels of service. The ones offered centrally, by sector owners, are certainly not the be-all and end-all &#8212; in fact we should look on them as a basic level of service, to be superseded by the community.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Translating Existing Models to RDF</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/142" />
    <id>http://www.jenitennison.com/blog/node/142</id>
    <published>2010-03-13T20:35:46+00:00</published>
    <updated>2010-07-31T22:03:39+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="modelling" />
    <category term="provenance" />
    <category term="rdf" />
    <summary type="html"><![CDATA[<p>As we encourage linked data adoption within the UK public sector, something we run into again and again is that (unsurprisingly) particular domain areas have pre-existing standard ways of thinking about the data that they care about. There are existing models, often with multiple serialisations, such as in XML and a text-based form, that are supported by existing tool chains.</p>

<p>In contrast, if there is existing RDF in that domain area, it&#8217;s usually been designed by people who are more interested in the RDF than in the domain area, and is thus generally more focused on the goals of the typical casual data re-user rather than the professionals in the area.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>As we encourage linked data adoption within the UK public sector, something we run into again and again is that (unsurprisingly) particular domain areas have pre-existing standard ways of thinking about the data that they care about. There are existing models, often with multiple serialisations, such as in XML and a text-based form, that are supported by existing tool chains.</p>

<p>In contrast, if there is existing RDF in that domain area, it&#8217;s usually been designed by people who are more interested in the RDF than in the domain area, and is thus generally more focused on the goals of the typical casual data re-user rather than the professionals in the area.</p>

<!--break-->

<p>To give an example, the international statistics community uses <a href="http://sdmx.org">SDMX</a> for representing and exchanging statistics (and a lot more besides; it&#8217;s a huge standard). SDMX includes a well-thought through model for statistical datasets and the observations within them, as well as standard concepts for things like gender, age, unit multipliers and so on. By comparison, <a href="http://sw.joanneum.at/scovo/schema.html">SCOVO</a>, the main RDF model for representing statistics, barely scratches the surface in comparison.</p>

<p>This isn&#8217;t the only example: the <a href="http://inspire.jrc.ec.europa.eu/">INSPIRE Directive</a> defines how geographic information must be made available. <a href="http://www.gigateway.org.uk/metadata/standards.html">GEMINI</a> defines the kind of geospatial metadata that that community cares about. The <a href="http://openprovenance.org/">Open Provenance Model</a> is the result of many contributors from multiple fields, and again has a number of serialisations.</p>

<p>You could view this as a challenge: experts in their domains already have models and serialisations for the data that they care about; how can we persuade them to adopt an RDF model and serialisations instead?</p>

<p>But that&#8217;s totally the wrong question. Linked data doesn&#8217;t, can&#8217;t and won&#8217;t replace existing ways of handling data. But it has got some interesting features that can bring great benefit to people who want to publish their data, namely:</p>

<ul>
<li><strong>web-scale addresses</strong> &#8212; being able to name and refer to things like individual observations in a statistical hypercube, a particular road junction, or the particular process that led to something being created</li>
<li><strong>annotation</strong> &#8212; the ability to record metadata about everything that you can name, which is everything!</li>
<li><strong>distributed publication</strong> &#8212; enabling multiple publishers to control the publication of their data without having to upload it to a central location</li>
<li><strong>links</strong> &#8212; the joining of information to other information, providing more context, supporting more queries and reducing the requirement for duplication</li>
</ul>

<p>The question is really about how to enable people to reap these benefits; the answer, because HTTP-based addressing and typed linkage is usually hard to introduce into existing formats, is usually to publish data using an RDF-based model alongside existing formats. This might be done by generating an RDF-based format (such as RDF/XML or Turtle) as an alternative to the standard XML or HTML, accessible via content negotiation, or by providing a <a href="http://www.w3.org/TR/grddl/">GRDDL</a> transformation that maps an XML format into RDF/XML.</p>

<p>Either way, the underlying model needs to be mapped into RDF. We&#8217;re furthest down this road with <a href="http://groups.google.com/group/publishing-statistical-data">statistical data</a>. I wanted to explore here what it might look like for the Open Provenance Model, building on lessons learned from the statistical domain.</p>

<h2>Open Provenance Model</h2>

<p>The Open Provenance Model talks about three main <strong>nodes</strong>:</p>

<ul>
<li><strong>artifacts</strong>, which are the things that are produced or used by processes</li>
<li><strong>processes</strong>, which are actions that are performed using or producing artifacts</li>
<li><strong>agents</strong>, which are the people or systems that perform actions</li>
</ul>

<p>and five kinds of <strong>edges</strong> that can be defined between them:</p>

<ul>
<li>process A <strong>used</strong> artifact B</li>
<li>artifact A <strong>was generated by</strong> process B</li>
<li>process A <strong>was controlled by</strong> agent B</li>
<li>process A <strong>was triggered by</strong> process B</li>
<li>artifact A <strong>was derived from</strong> artifact B</li>
</ul>

<p>Then things start getting more complicated. OPM indicates that each artifact and agent plays a different <strong>role</strong> when it is used by, generated by or controls a process. What&#8217;s more, each artifact and agent might be involved in the process at different <strong>times</strong> (though timing information is optional within OPM). And a given provenance graph may contain several <strong>accounts</strong> of how artifacts, processes and agents fit together.</p>

<h2>Existing Mapping to RDF</h2>

<p>The <a href="http://openprovenance.org/model/opm.owl">OWL ontology for OPM</a> for OPM is a very literal mapping of OPM into RDF. Each of the types of nodes is a separate class, and each of the types of edges is a separate class. Thus, it introduces a lot of n-ary relationships. Take a really simple example of an XML file being transformed into HTML using XSLT. With the OPM ontology, the RDF would look something like:</p>

<pre><code>_:transformation a opm:Process .
&lt;doc.html&gt; a opm:Artifact .
&lt;doc.xml&gt; a opm:Artifact .
&lt;doc.xsl&gt; a opm:Artifact .
_:processor a opm:Agent .
_:Jeni a opm:Agent .

_:stylesheetLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause &lt;doc.xml&gt; ;
  opm:role eg:xsltSource .

_:sourceLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause &lt;doc.xsl&gt; ;
  opm:role eg:xsltStylesheet .

_:resultLink a opm:WasGeneratedBy ;
  opm:effect &lt;doc.html&gt; ;
  opm:cause _:transformation ;
  opm:role eg:xsltResult .

_:processorLink a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:processor ;
  opm:role xslt:processor .

_:userLink a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:Jeni ;
  opm:role xslt:user .

_:derivation a opm:WasDerivedFrom ;
  opm:effect &lt;doc.html&gt; ;
  opm:cause &lt;doc.xml&gt; .

xslt:source a opm:Role ;
  opm:value "source" .

xslt:stylesheet a opm:Role ;
  opm:value "stylesheet" .

xslt:result a opm:Role ;
  opm:value "result" .

xslt:processor a opm:Role ;
  opm:value "processor" .

xslt:user a opm:Role ;
  opm:value "user" .
</code></pre>

<p>To give you an idea of what this mapping means, if I wanted to work out who created <code>doc.html</code>, I would have to do a query like:</p>

<pre><code>SELECT ?who
WHERE {
  ?generatedBy 
    opm:cause &lt;doc.html&gt; ;
    opm:role xslt:result ;
    opm:effect ?transformation .
  ?controlledBy
    opm:effect ?transformation ;
    opm:role xslt:user ;
    opm:cause ?who .
}
</code></pre>

<h2>Some Observations</h2>

<p>There are two things that I want to pull out about the RDF mapping described above.</p>

<ul>
<li>it&#8217;s incredibly literal; every entity type within the model is mapped onto an RDF class, including the edges, the roles and the accounts (which I didn&#8217;t show above)</li>
<li>it doesn&#8217;t reuse any existing vocabularies, even when they might help (such as for the &#8216;value&#8217; of a role, which is really a label)</li>
</ul>

<p>It reminds me of the mapping of object-oriented or relational data models into each other or into XML, which often result in a god awful mess and people swearing that technology X is goddamned ugly. </p>

<p>The fact is that elegant uses of each modelling paradigm &#8212; ones that are easy to understand and efficient to query &#8212; always take advantage of the unique features of that paradigm. For example, good XML vocabularies take advantage of the distinctions between attributes and elements, of nesting and hierarchies, and of the ability to hold mixed content.</p>

<p>It&#8217;s the same with RDF. There are four features of RDF that I think good vocabularies will take suitable advantage of:</p>

<ul>
<li>existing vocabularies</li>
<li>inheritance</li>
<li>shortcuts and reasoning</li>
<li>named graphs</li>
</ul>

<p><strong>Reusing existing vocabularies</strong> takes advantage of the ease of bringing together diverse domains within RDF, and it makes data more reusable. For example, an OPM mapping that encourages the reuse of FOAF for people and organisations saves time and effort for the developers of the OPM RDF vocabulary, that they would otherwise have spent modelling the details of agents; and it means that any agents that are described within the description of a piece of provenance are automatically available as agents in the wider FOAF cloud. The same goes for using DOAP to describe software.</p>

<p>By reusing vocabularies, the data isn&#8217;t isolated any more, locked within a single context designed for a single use. This is a huge benefit of the linked data approach and it makes sense to leverage it.</p>

<p><strong>Using inheritance</strong> means creating general purpose classes and properties and encouraging other people to use <code>rdfs:subClassOf</code> or <code>rdfs:subPropertyOf</code> to specialise them according to their own requirements. Within OPM, the different roles that artifacts and agents might play in a process is a natural fit with either sub-properties or sub-classes, depending on how the edges in the model are represented. For example, rather than</p>

<pre><code>_:stylesheetLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause &lt;doc.xsl&gt; ;
  opm:role eg:xsltStylesheet .

xslt:stylesheet a opm:Role ;
  opm:value "stylesheet" .
</code></pre>

<p>you could generate data that looked like:</p>

<pre><code>_:stylesheetLink a xslt:Stylesheet ;
  opm:effect _:transformation ;
  opm:cause &lt;doc.xsl&gt; .
</code></pre>

<p>where <code>xslt:Stylesheet</code> is defined as a subclass of <code>opm:Used</code>.</p>

<p>Inheritance is a basic form of <strong>reasoning</strong>. In the case of the subclass relationship outlined above, the reasoning is that anything that is a <code>xslt:Stylesheet</code> is also a <code>opm:Used</code>, and thus:</p>

<pre><code>_:stylesheetLink a xslt:Stylesheet .
</code></pre>

<p>implies</p>

<pre><code>_:stylesheetLink a xslt:Used .
</code></pre>

<p>Taking the scenario where you&#8217;re doing native linked data publishing &#8212; storing data in a triplestore and then publishing it out from there &#8212; you have two choices:</p>

<ul>
<li>you can store just the basic data, and let the application retrieving it carry out whatever reasoning is necessary to derive the information they need; this limits the size of the triplestore, but can place a large burden on people using it &#8212; either they have to be very familiar with the exact choices made in modelling the basic data, or they have to construct complex SPARQL queries that take account of the fact that the data might be modelled in many different ways</li>
<li>you can store not only the basic data but also anything that can be derived from it; this increases the number of triples you have to store, but means that people can query it without having to perform any reasoning themselves</li>
</ul>

<p>The latter is obviously the more user-friendly approach. (And a triplestore could make it easy by understanding and applying schemas, ontologies and rules as data is loaded in.)</p>

<p>To take a more complex example, provenance could be modelled in a much more direct way, such as:</p>

<pre><code>&lt;doc.html&gt; a opm:Artifact ;
  opm:derivedFrom &lt;doc.xml&gt; ;
  opm:generatedBy [
    xslt:source &lt;doc.xml&gt; ;
    xslt:stylesheet &lt;doc.xsl&gt; ;
    xslt:processor _:processor ;
    xslt:user _:Jeni ;
  ] .
</code></pre>

<p>where <code>xslt:source</code> and <code>xslt:stylesheet</code> are sub-properties of a property called <code>opm:used</code>, and <code>xslt:processor</code> and <code>xslt:user</code> are sub-properties of <code>opm:controlledBy</code>. This removes the n-ary properties, which (given the use of inheritance to represent roles) are only actually needed if the model needs to capture the timing of the involvement of particular artifacts or agents within a process, and makes the provenance information much easier to query than before:</p>

<pre><code>SELECT ?who
WHERE {
  &lt;doc.html&gt; opm:generatedBy ?transformation .
  ?transformation xslt:user ?who .
}
</code></pre>

<p>But what if we also want to support the more complex, n-ary-relation-based models? We would need to assert, somehow, a rule that said that the presence of a <code>opm:controlledBy</code> relationship from a process to an agent was equivalent to having a <code>opm:WasControlledBy</code> instance with a <code>opm:cause</code> pointing to the agent and an <code>opm:effect</code> pointing to the process. Combine this with <code>xslt:user</code> being sub-property of <code>opm:controlledBy</code> and you have the statement:</p>

<pre><code>_:transformation xslt:user _:Jeni .
</code></pre>

<p>implying:</p>

<pre><code>_:transformation opm:controlledBy _:Jeni .
</code></pre>

<p>which in turn implies:</p>

<pre><code>[] a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:Jeni .
</code></pre>

<p>The same reasoning could be applied in the opposite direction, of course. Part of the definition of the use of OPM in RDF could be that the presence of a <code>opm:WasControlledBy</code> with a <code>opm:cause</code> pointing to an agent and <code>opm:effect</code> pointing to a process implies a <code>opm:controlledBy</code> link between the <code>opm:effect</code> and the <code>opm:cause</code>. Whichever was used in the initial modelling of the data, the same query could be used to query the data (accepting some loss of precision along the way, but if you&#8217;re not interesting in timing information then why should you suffer the cost of querying through n-ary relations?).</p>

<p>The final thing that I mentioned above that mappings from existing models to RDF should take advantage of is <strong>named graphs</strong>. In OPM, the obvious way that named graphs could play a role is in providing support for the different <em>accounts</em> of provenance. Separate named graphs could be used to represent separate accounts, referencing the same artifacts, agents and processes where appropriate. Individually, the graphs can remain simple; together, you have the full power of OPM.</p>

<h2>Conclusions</h2>

<p>Modelling is a complex design activity, and you&#8217;re best off avoiding doing it if you can. That means reusing conceptual models that have been built up for a domain as much as possible and reusing existing vocabularies wherever you can. But you can&#8217;t and shouldn&#8217;t try to avoid doing design when mapping from a conceptual model to a particular modelling paradigm such as a relational, object-oriented, XML or RDF model.</p>

<p>If you&#8217;re mapping to RDF, remember to take advantage of what it&#8217;s good at such as web-scale addressing and extensibility, and always bear in mind how easy or difficult your data will be to query. There is no point publishing linked data if it is unusable.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Versioning (UK Government) Linked Data</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/141" />
    <id>http://www.jenitennison.com/blog/node/141</id>
    <published>2010-02-27T22:15:40+00:00</published>
    <updated>2010-07-31T22:01:37+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="named graphs" />
    <category term="versioning" />
    <summary type="html"><![CDATA[<p>As you probably know, I&#8217;ve been working quite a lot recently on the UK government&#8217;s use of linked data, and in particular on providing guidance for people who want to publish their data as linked data. One of the things that we need to provide guidance about is how to publish linked data that changes over time. I&#8217;ve <a href="http://www.jenitennison.com/blog/node/108">touched on this topic before</a> but things have progressed now to the stage where we have to make some real, practical, recommendations.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>As you probably know, I&#8217;ve been working quite a lot recently on the UK government&#8217;s use of linked data, and in particular on providing guidance for people who want to publish their data as linked data. One of the things that we need to provide guidance about is how to publish linked data that changes over time. I&#8217;ve <a href="http://www.jenitennison.com/blog/node/108">touched on this topic before</a> but things have progressed now to the stage where we have to make some real, practical, recommendations.</p>

<p><em>Note: the contents of this post have been greatly informed through discussions with <a href="http://www.ldodds.com/blog/">Leigh Dodds</a>, <a href="http://twitter.com/skwlilac">Stuart Williams</a>, <a href="http://www.amberdown.net/">Dave Reynolds</a>, <a href="http://iandavis.com/">Ian Davis</a> and John Sheridan. Ian Davis&#8217; series on <a href="http://blog.iandavis.com/2009/08/time-in-rdf-1">representing time in RDF</a> is also well worth a look for a comparison of alternative approaches.</em></p>

<p>I&#8217;ve split this into two parts: versioned information resources (which are pretty easy) and versioned non-information resources (which are pretty hard). For both, we need to</p>

<ul>
<li>provide some guidance about what the RDF should look like</li>
<li>mint or adopt properties to support that model</li>
</ul>

<h2>Versioned Information Resources</h2>

<p>Easy things first. Some of the things that we talk about, such as legislation, are information resources (web documents), and these have different versions. The relevant level of precision for legislation is a day, but this will be different for different kinds of documents &#8212; some might change every second, for others an incrementally increasing version number might be more appropriate than a date. A generic pattern for the URIs, based on the <a href="http://writetoreply.org/ukgovurisets/">design of URI sets for the UK public sector report</a> would be:</p>

<pre><code>http://{sector}.data.gov.uk/doc/{concept}/{identifier}/{version}
</code></pre>

<p>For example, the OFSTED report for a particular school based on an inspection carried out in 2009 might be something like:</p>

<pre><code>http://education.data.gov.uk/doc/inspection-report/12345/2009
</code></pre>

<p>(There might be sub-versions too, if the inspection report itself goes through a revision process.) The RDF for this document should include links to the previous reports that it replaces, and dates that indicate when it was created and so on:</p>

<pre><code>&lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&gt;
  rdfs:label "2009 Inspection Report for Such-and-Such School"@en ;
  dct:created "2009-10-18"^^xsd:date ;
  dct:replaces &lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&gt; .
</code></pre>

<p>It&#8217;s also useful to have a URI for unversioned document; this is the same as for the versioned document, but without the version:</p>

<pre><code>http://{sector}.data.gov.uk/doc/{concept}/{identifier}
</code></pre>

<p>This document acts as a hub for the various concrete versions of the document:</p>

<pre><code>&lt;http://education.data.gov.uk/doc/inspection-report/12345&gt;
  rdfs:label "Inspection Report for Such-and-Such School"@en ;
  dct:hasVersion
    &lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&gt; ,
    &lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&gt; ,
    &lt;http://education.data.gov.uk/doc/inspection-report/12345/2003&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&gt;
  rdfs:label "2009 Inspection Report for Such-and-Such School"@en ;
  dct:created "2009-10-18"^^xsd:date ;
  dct:replaces &lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&gt; ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/inspection-report/12345&gt; .

&lt;http://education.data.gov.uk/doc/inspection-report/12345/2006&gt;
  rdfs:label "2009 Inspection Report for Such-and-Such School"@en ;
  dct:created "2003-11-23"^^xsd:date ;
  dct:isReplacedBy &lt;http://education.data.gov.uk/doc/inspection-report/12345/2009&gt; ;
  dct:replaces &lt;http://education.data.gov.uk/doc/inspection-report/12345/2003&gt; ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/inspection-report/12345&gt; .
</code></pre>

<p>It would be expected that people linking to the document would either point to a particular (dated) resource or to the unversioned (hub) document. For example, if someone were talking specifically about the 2006 OFSTED inspection, they would point to the 2006 inspection report; if they were referring to whatever inspection report is current, they&#8217;d use the unversioned URI.</p>

<blockquote>
  <p><em>Note: Although <code>dct:hasVersion</code> and <code>dct:isVersionOf</code> are sort-of OK here, having a property that points to the current version (ie most recent) version of a resource would be very helpful.</em></p>
</blockquote>

<h2>Versioned Non-Information Resources</h2>

<p>The harder problem is how we handle changes to non-information resources over time. For example, how do we handle the fact that a school often changes head, sometimes changes name, regularly changes class sizes, rarely changes address and so on? How do we handle the fact that we have legacy statistics about local authorities as they were in 2008, prior to the 2009 reorganisation, and that it&#8217;s very likely that these kinds of changes will continue to take place regularly in the future?</p>

<p>Our requirements are:</p>

<ul>
<li>that the data is easily usable by people who only care about the current state of a resource</li>
<li>that the (current) data remains easily queryable at a SPARQL endpoint</li>
<li>that it&#8217;s <em>possible</em> (not necessarily easy) to query historic data</li>
<li>that historic data can be moderately easily retrieved and navigated</li>
<li>that it can represent historical states even when the precise time period is not known</li>
<li>that it can distinguish between a change in the concept and a change in our record of it (e.g. changing the name of a school, versus correcting a typo in the database entry for the school)</li>
<li>that it can trace what the nature or cause of the change was (e.g. redrawing of local authority boundaries)</li>
</ul>

<h3>Statistical Data</h3>

<p>To begin our discussion, let&#8217;s look at statistical data. Statistical data is data that&#8217;s usually numeric and for which we have values that are categorised along multiple dimensions as well as time. School census information is statistical data, for example, because each value is associated with not only the school and the date at which the census was taken but also the age (and gender, but to simplify I&#8217;ll pretend just age) of the children being counted. This gives us a set of observations which might each look like:</p>

<pre><code>&lt;/data/edubase/census/12345/age/11/2009&gt; 
  a sdmx:Observation ;
  sdmx:dataset &lt;/data/edubase&gt; ;
  dct:replaces &lt;/data/edubase/census/12345/age/11/2008&gt; ;
  rdf:value 85 ;
  edu:school &lt;/id/school/12345&gt; ;
  edu:schoolYear &lt;/id/school-year/2009&gt; ;
  sdmx:age 11 .
</code></pre>

<blockquote>
  <p>Note: This is indicative of the vocabulary we might use for statistics; don&#8217;t rely on it. If you&#8217;re interested in the progress we&#8217;re making on modelling statistical datasets using RDF, come and join <a href="http://groups.google.com/group/publishing-statistical-data">the publishing statistical data Google Group</a>.</p>
</blockquote>

<p>These statistical observations point to the interval that they apply to as a property, with the <code>rdf:value</code> property holding the actual value. The observation won&#8217;t change over time (unless it is corrected, which I will come back to), and <strong>observations from different times can all remain present within the graph without interacting badly with each other</strong>.</p>

<p>This is great because it means that we can make queries that give us time series views over the data. For example, we could define a series for girls aged 11 at this particular school over time something like this:</p>

<pre><code>&lt;/data/edubase/census/12345/age/11&gt;
  a sdmx:TimeSeries ;
  edu:school &lt;/id/school/12345&gt; ;
  sdmx:age 11 ;
  sdmx:observation
    &lt;/data/edubase/census/12345/age/11/gender/F/2009&gt; ,
    &lt;/data/edubase/census/12345/age/11/gender/F/2008&gt; ,
    &lt;/data/edubase/census/12345/age/11/gender/F/2007&gt; ,
    ... .
</code></pre>

<p>and associate this with the school through a specialised property:</p>

<pre><code>&lt;/id/school/12345&gt; edu:age11 &lt;/data/edubase/census/12345/age/11&gt; .
</code></pre>

<p>The fly in the ointment is that data that is purely represented in this way is really hard to query if all you&#8217;re actually interested in is the <em>current</em> value for the particular statistic. For example, say that you&#8217;ve just moved to an area and are trying desperately to find a school that might have room for your 11-year-old. Given that class sizes are capped at 30, you could look for schools that have a number of 11-year-olds that is not a multiple of 30. If you want to know how many 11 year-olds are <em>currently</em> in a school (according to the most recent measurement), you need a query like:</p>

<pre><code>SELECT ?age11
WHERE {
  &lt;/id/school/12345&gt; edu:age11 [
    sdmx:observation ?currentObservation ;
  ]
  OPTIONAL {
    ?futureObservation dct:replaces ?currentObservation .
  }
  FILTER ( !bound(?futureObservation) ) .
  ?currentObservation rdf:value ?age11 .
}
</code></pre>

<p>(it&#8217;s even more complicated if you don&#8217;t have the <code>dct:replaces</code> links!).</p>

<p>How much simpler it would be for people if there was a property that just indicated the current state of the world:</p>

<pre><code>&lt;/id/school/12345&gt; edu:currentAge11 85 .
</code></pre>

<p>The same argument applies even more strongly for values that we would categorise as <strong>reference data</strong>, such as the name of a school. Although it would be possible to model all this information using the kind of n-ary relation approach we have to use for statistical observations, it would be both incredibly hard to query and incredibly verbose to do so. Even if n-ary relations are the &#8220;correct&#8221; way of modelling the changing data, they are impractical for querying.</p>

<p>And, as I hinted, we have to have some way of managing the possibility of statistics themselves being versioned (for example if an error is detected within the statistics). Using n-ary relations to provide the value of an observation gets very complicated very quickly.</p>

<p>So, we have made the decision to use named graphs.</p>

<h3>Named Graphs</h3>

<p>Named graphs can be used in two ways which are related but need to be thought about slightly differently.</p>

<p>First, we can use a named-graph approach to the <strong>publication</strong> of RDF. We can describe the same <em>thing</em> within multiple documents; each document can contain different (and contradictory) information, but also metadata about the document that indicates precisely when the information it contains is valid.</p>

<p>Second, we can use a named-graph approach to the <strong>representation</strong> of RDF within a triple- (or more accurately quad-) store. We can collect together statements that are made at the same time, from the same source, and with the same level of authority into a named graph. These graphs can then be loaded into the store, with the metadata about each graph made explicitly available so that relevant graphs can be selected and queried.</p>

<p>There are two things that are worth noting about this:</p>

<ol>
<li>Publishing named graphs is relevant however RDF is published. For example, in some linked data publication set-ups, RDF/XML or RDFa might be generated on demand based on an underlying database of some description. In this case, the named graphs for representing data aren&#8217;t relevant (the database will presumably capture some provenance and validity information itself that can be exposed within the RDF).</li>
<li>In the case where linked data is published natively (ie stored in a triplestore and exposed as linked data through an API), the two uses of named graphs don&#8217;t precisely align with each other. The named graphs that we create when we convert or load data within a triplestore are not (necessarily) the same as the named graphs that we expose when we publish data. What&#8217;s important here is
<ul><li>that the named graphs that we have within the triplestore can feasibly be used (by a publication framework such as the <a href="http://purl.org/linked-data/api/spec">linked data API</a> we&#8217;re working on) to create the publication-based named graphs</li>
<li>that the SPARQL endpoint offered by the triplestore has a default graph which reflects the current state of affairs</li></ul></li>
</ol>

<p>Let&#8217;s look at these two uses of named graphs in more detail.</p>

<h3>Publication of Named Graphs</h3>

<p>Our intention is to publish different information about the same resource within different documents (aka named graphs). This approach hooks into the approach for versioning information resources outlined above. A resource is described in a document, and many documents may describe the same resource.</p>

<p>For example, if a school changes its name from &#8220;Broadmoor Primary School&#8221; to &#8220;Wildmoor Heath School&#8221; on 1st September 2009, then after 1st September 2009, requesting information about the school at <code>http://education.data.gov.uk/id/school/12345</code> would result in a <code>303 See Other</code> redirection to <code>http://education.data.gov.uk/doc/school/12345</code> which would contain information about the school that is currently relevant:</p>

<pre><code># Information about the school that is currently relevant
&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Wildmoor Heath School"@en ;
  foaf:isPrimaryTopicOf 
    &lt;http://education.data.gov.uk/doc/school/12345&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ,
    ... .
</code></pre>

<p>as well as metadata about the document that&#8217;s been returned and the &#8220;hub&#8221; document that lists the alternative versions:</p>

<pre><code>&lt;http://education.data.gov.uk/doc/school/12345&gt;
  rdfs:label "Information about School 123456"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  dct:hasVersion
    &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt;
  rdfs:label "Information about Wildmoor Heath School from 1st Sept 2009"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  dct:created "2009-09-01"^^xsd:date ;
  dct:replaces &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/school/12345&gt; .
</code></pre>

<p>A request to the replaced document <code>http://education.data.gov.uk/doc/school/12345/2001-09-01</code> would result in the information that was valid about the school on the 1st September 2001:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Broadmoor Primary School"@en ;
  foaf:isPrimaryTopicOf 
    &lt;http://education.data.gov.uk/doc/school/12345&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ,
    ... .
</code></pre>

<p>and, again, metadata about the document that&#8217;s been returned and the &#8220;hub&#8221; document that lists the alternative versions:</p>

<pre><code>&lt;http://education.data.gov.uk/doc/school/12345&gt;
  rdfs:label "Information about School 123456"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  dct:hasVersion
    &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/school/12345/2001-09-01&gt;
  rdfs:label "Information about Broadmoor Primary School (2001-2008)"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  dct:created "2001-09-01"^^xsd:date ;
  dct:isReplacedBy &lt;http://education.data.gov.uk/doc/school/12345/2009-09-01&gt; ;
  dct:replaces &lt;http://education.data.gov.uk/doc/school/12345/1996-09-01&gt; ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/school/12345&gt; .
</code></pre>

<p>The statements about <code>http://education.data.gov.uk/id/school/12345</code> in this second document are inconsistent with the statements retrieved from <code>http://education.data.gov.uk/doc/school/12345</code> but because they are published within different documents, they should be considered (by anyone retrieving this data) to be different graphs and therefore are allowed to provide different views of the world.</p>

<p>The statements about the named graphs <code>http://education.data.gov.uk/doc/school/12345/2009-09-01</code> and <code>http://education.data.gov.uk/doc/school/12345/2001-09-01</code> can include information about the interval during which the content of the document is valid. (We haven&#8217;t worked out exactly how to indicate this yet; <code>dct:valid</code> is no good; see later.)</p>

<h4>Associated Resources</h4>

<p>This story seems fine until you start to look at linked resources. For example, schools may link out to separate resources, particularly when different aspects of a school are likely to change at different rates or come from different sources. A school is unlikely to change its name in the middle of a school year, but may well change some of its staff, and the number of pupils it has, during a year. It&#8217;s likely that these separate sets of information will be represented as different resources.</p>

<p>The document published about the school for a particular date will not necessarily include all the details of the linked resource at that point in time. This can make it hard to navigate to the particular version of the linked resource. For example, if a client wants to look at the information about a school at 1st September 2001, they would locate the graph at <code>http://education.data.gov.uk/doc/school/12345/2001-09-01</code>. This might contain:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Broadmoor Primary School"@en ;
  edu:staffing &lt;http://education.data.gov.uk/id/school/12345/staff&gt; .
</code></pre>

<p>A request to <code>http://education.data.gov.uk/id/school/12345/staff</code> will result in a <code>303 See Other</code> request to <code>http://education.data.gov.uk/doc/school/12345/staff</code>. This is <em>current</em> information about the staffing, and which will include:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345/staff&gt;
  rdfs:label "Staffing of Wildmoor Heath School"@en ;
  edu:school &lt;http://education.data.gov.uk/id/school/12345&gt; ;
  edu:head ... ;
  edu:deputy ... ;
  ... ;
  foaf:isPrimaryTopicOf
    &lt;http://education.data.gov.uk/doc/school/12345/staff&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-01-01&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/school/12345/staff&gt;
  rdfs:label "Information about Staffing at School 123456"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345/staff&gt; ;
  dct:hasVersion
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&gt; ,
    &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-01-01&gt; ,
    ... .

&lt;http://education.data.gov.uk/doc/school/12345/staff/2009-09-01&gt;
  rdfs:label "Staffing of Wildmoor Heath School in Autumn Term, 2009"@en ;
  foaf:primaryTopic &lt;http://education.data.gov.uk/id/school/12345/staff&gt; ;
  dct:created "2009-09-01"^^xsd:date ;
  dct:isVersionOf &lt;http://education.data.gov.uk/doc/school/12345/staff&gt; ;
  dct:replaces &lt;http://education.data.gov.uk/doc/school/12345/staff/2009-04-23&gt; .
</code></pre>

<p>The client then has to work out which of the possible versions of the graph about <code>http://education.data.gov.uk/id/school/12345/staff</code> it should look at to navigate back to the information that&#8217;s relevant at 1st September 2001.</p>

<p>There are two techniques that we might use to help address this. One is for the information that&#8217;s retrieved at <code>http://education.data.gov.uk/doc/school/12345/2001-09-01</code> to include some basic information about the linked resource that includes <code>foaf:isPrimaryTopicOf</code> links directly to the relevant versioned document about the linked resource. For example, that document should contain:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345/staff&gt;
  rdfs:label "Staffing of Wildmoor Heath School"@en ;
  foaf:isPrimaryTopicOf &lt;http://education.data.gov.uk/doc/school/12345/staff/2001-09-01&gt; .
</code></pre>

<p>These links will have to be generated by the publication framework since they are calculated based on the date of the requested resource.</p>

<p>The other technique is to use HTTP headers to request the applicable date, as suggested by the <a href="http://www.mementoweb.org/">Memento Experiment</a>. Even with this technique, it&#8217;s still useful to have distinct URIs for the individual documents so that they can be pointed to and talked about.</p>

<h3>Representation of Data in Named Graphs</h3>

<p>Let&#8217;s turn to looking at the use of named graphs within a triplestore. In the government case, we&#8217;re expecting that information about schools going into a single triplestore is likely to come from multiple sources. Each source may release information at different intervals, with different temporal validity. The data from a single source will over-ride other information from that source over time, but equally data from different sources will be overlapping and contradictory.</p>

<p>To manage this, we split up triples into named graphs based on:</p>

<ul>
<li>their source</li>
<li>their temporal validity (and their temporal relationship with other graphs)</li>
<li>their authoritativeness</li>
</ul>

<p>This metadata about the named graph is recorded within the named graph itself, using <code>voiD</code> and other vocabularies.</p>

<p>In more detail:</p>

<h4>Named Graphs over Time</h4>

<p>Named graphs are expected to occur within a series over time. The triples within one graph will be completely replaced by the triples within another graph. The most recent graph is one that has not yet been replaced. To record this, the graphs should have associated with them:</p>

<ul>
<li>the dates when the data in the graph is valid (only the start date is really required)</li>
<li>the graph(s) that the graph replaces</li>
<li>the graph(s) that the graph is replaced by</li>
<li>the date when the data in the graph was created</li>
</ul>

<p>To avoid repetition of data within multiple graphs, graphs should be split up at the level that updates are likely to occur within the source of the data. For example, Edubase holds a database of schools. If the linked data for schools is generated based on dumps of the entire Edubase database, then there would be a separate named graph for each dump of the database. If the linked data is created more dynamically, based on updates at the level of an individual school, say, then there should be a separate series of named graphs for each school. If the updates can occur at an even finer level of granularity (eg at each record within each table within the database), then there can be separate named graphs at that level.</p>

<h4>Named Graphs from Different Sources</h4>

<p>Information about the same resources will come from different sources, and have gone through different levels of processing to become linked data. To allow us to provide information about the provenance of different triples, separate named graphs should be used for data from different sources. The metadata about a graph should include:</p>

<ul>
<li>the source of the data (through <code>dct:source</code>)</li>
<li>the provenance of the data (through something more complex, yet to be finalised)</li>
</ul>

<p>Much of the information about a particular resource will only come from one source. For example, Edubase contains the pupil census for a school while Ofsted provides inspection reports. However, there will be overlaps between the information available from different sources, such as the name and address of the school.</p>

<p>For any given property of a resource (such as the name of the school), there should be one source that is the authoritative source of that information; other sources are considered supplementary. Each source should therefore usually provide two series of named graphs: one of information for which they are considered the authority, and one of information for which they are not. The metadata about the graph should include a property that indicates whether the information it contains is authoritative or not.</p>

<h4>Constructing a Graph for a Given Date</h4>

<p>It&#8217;s extremely useful to be able to create snapshots that contain information that&#8217;s current at a particular point in time. The most useful of these is the <em>current</em> graph, which is the one that should be exposed as the default graph in the SPARQL endpoint offered by the triplestore.</p>

<p>The graph can be created by combining:</p>

<ol>
<li>all the triples from authoritative graphs that are valid at that point in time (eg have a validity date before that point in time, and that are not replaced by a graph whose validity date is also before that point in time)</li>
<li>those triples from supplementary graphs for which there is no existing triple in the graph with the same subject and property</li>
</ol>

<p>For example, there may be information available about a school from Edubase and from OFSTED, as follows (in TRiG syntax):</p>

<pre><code># graph containing data from Edubase from 2008-09-01
&lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&gt; {
  &lt;http://education.data.gov.uk/id/school/12345&gt;
    rdfs:label "Broadmoor Primary School"@en ;
    edu:census &lt;http://education.data.gov.uk/id/school/12345/census&gt; .

  &lt;http://education.data.gov.uk/id/school/12345/census&gt;
    ... .

  &lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&gt;
    a void:Dataset ;
    dct:created "2008-09-01"^^xsd:date ;
    dct:replaces &lt;http://education.data.gov.uk/data/edubase/12345/2007-09-01/authoritative&gt; ;
    dct:isReplacedBy &lt;http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative&gt; ;
    dct:source &lt;http://www.edubase.gov.uk/&gt; ;
    :authoritative true .

  &lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01&gt;
    a void:Dataset ;
    void:subset &lt;http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative&gt; ;
    ... .
}

# graph containing data from Edubase from 2009-09-01; the name of the school 
# has changed (as have) the details of the census
&lt;http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative&gt; {
  &lt;http://education.data.gov.uk/id/school/12345&gt;
    rdfs:label "Wildmoor Heath School"@en ;
    edu:census &lt;http://education.data.gov.uk/id/school/12345/census&gt; .

  &lt;http://education.data.gov.uk/id/school/12345/census&gt;
    ... .

  ... metadata about this graph ...
}

# graph containing authoritative data from Ofsted from 2008-03-01
# note that this doesn't include the name of the school
&lt;http://education.data.gov.uk/data/ofsted/12345/2008-03-01/authoritative&gt; {
  &lt;http://education.data.gov.uk/id/school/12345&gt;
    edu:inspection &lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&gt; .

  &lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&gt;
    ... .

  ... metadata about this graph ...
}

# graph containing supplementary data from Ofsted from 2008-03-01
# this includes the name of the school (at the time of the inspection)
&lt;http://education.data.gov.uk/data/ofsted/12345/2008-03-01/supplementary&gt; {
  &lt;http://education.data.gov.uk/id/school/12345&gt;
    rdfs:label "Broadmoor Primary School"@en ;

  ... metadata about this graph ...
}
</code></pre>

<p>Note that metadata about each graph is embedded in the graph itself.</p>

<p>In the example above, a graph for 2010-01-01 would contain:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Wildmoor Heath School"@en ;
  edu:census &lt;http://education.data.gov.uk/id/school/12345/census&gt; ;
  edu:inspection &lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&gt; .

&lt;http://education.data.gov.uk/id/school/12345/census&gt;
  ... .

&lt;http://education.data.gov.uk/doc/school/12345/inspection/2008&gt;
  ... .
</code></pre>

<p>It would not contain the triple:</p>

<pre><code>&lt;http://education.data.gov.uk/id/school/12345&gt;
  rdfs:label "Broadmoor Primary School"@en ;
</code></pre>

<p>because this triple is only present in an authoritative form within <code>http://education.data.gov.uk/data/edubase/12345/2008-09-01/authoritative</code>, which is replaced by <code>http://education.data.gov.uk/data/edubase/12345/2009-09-01/authoritative</code> or from <code>http://education.data.gov.uk/data/ofsted/school/12345/2008-03-01/supplementary</code> which is a supplementary graph and can&#8217;t override the label provided by the authoritative graph.</p>

<h2>Unanswered Questions</h2>

<p>There are three gaps within this that need plugging.</p>

<p>First, how should we represent the interval during which a graph is valid? As I&#8217;ve indicated above, <code>dct:valid</code> doesn&#8217;t cut it because it can&#8217;t represent an interval very well (there is a <a href="http://dublincore.org/documents/dcmi-period/">Dublin Core recommended format for representing periods</a> but it&#8217;s not going to be easy for people to process). We have work ongoing on defining intervals (by Stuart Williams) and will probably have to mint our own property to indicate the temporal validity of a named graph, given that <code>dct:valid</code> takes a literal rather than a resource.</p>

<p>Second, how should we indicate whether a graph is authoritative or not? Should this be a simple boolean switch (which will make the logic for combining graphs easier, and probably be easiest to assess) or a kind of confidence level, which might allow for missing data better?</p>

<p>Third, how should we represent the events that cause the replacement of one named graph with another? I think that we should be able to use the provenance vocabulary that we end up using to represent these changes, so that it&#8217;s possible to indicate whether the new information is the correction of a clerical error or an actual change to the real world thing.</p>

<p>And, we have to try this out. While it looks as if it might work, I won&#8217;t be confident until we&#8217;ve tried it out with some real data and some real queries. I&#8217;m also concerned that while keeping data in separate, annotated, named graphs seems like our best chance of managing versions and tracking provenance, it adds a hurdle onto the generation of linked data that might be too high, particularly for people who are just starting out.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Why Linked Data for data.gov.uk?</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/140" />
    <id>http://www.jenitennison.com/blog/node/140</id>
    <published>2010-01-26T13:10:58+00:00</published>
    <updated>2010-07-31T21:59:03+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="datagovuk" />
    <category term="linked data" />
    <summary type="html"><![CDATA[<p><a href="http://data.gov.uk/">data.gov.uk</a> was finally launched to the public last week (still in beta, but now a more public beta than the beta that it&#8217;s been in for the last few months). It&#8217;s a great step forward, and everyone involved should be proud of both the amount of data that&#8217;s been made available and the website itself, which (<a href="http://www.independent.co.uk/news/uk/politics/labours-computer-blunders-cost-16326bn-1871967.html">unlike a lot of UK government IT</a>) was developed rapidly by a small team based on open source software (and at low cost).</p>

<p>This is a first step on a long road.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p><a href="http://data.gov.uk/">data.gov.uk</a> was finally launched to the public last week (still in beta, but now a more public beta than the beta that it&#8217;s been in for the last few months). It&#8217;s a great step forward, and everyone involved should be proud of both the amount of data that&#8217;s been made available and the website itself, which (<a href="http://www.independent.co.uk/news/uk/politics/labours-computer-blunders-cost-16326bn-1871967.html">unlike a lot of UK government IT</a>) was developed rapidly by a small team based on open source software (and at low cost).</p>

<p>This is a first step on a long road.</p>

<!--break-->

<p>One of the features of the UK Government&#8217;s approach to freeing data is the emphasis on using <a href="http://www.data.gov.uk/wiki/Linked_Data">linked data</a>. What I don&#8217;t think has really been articulated is either what that means or why we should take this approach. From what I&#8217;ve seen, developers seem to think:</p>

<ul>
<li>linked data is a synonym for turning everything into RDF and putting it in one big triplestore, equivalent to making one big database of government data and therefore prone to exactly the same, well-known and understood problems that government has with creating huge databases</li>
<li>linked data requires everyone to agree to the same model and vocabulary, which means huge efforts in standardisation and ends up with something that suits no one</li>
<li>the UK government will be releasing all their data as linked data immediately, and in no other way</li>
<li>the UK government has been seduced into using linked data by academics who don&#8217;t understand anything about how the web or the real world works</li>
<li>the UK government has been seduced into using linked data by big businesses who stand to make a pretty penny providing services to departments that are forced to publish their data in this way</li>
</ul>

<p>None of these are true. In fact, the UK government is committed to publishing data as linked data because they are convinced it is the <strong>best approach available for publishing data in a hugely diverse and distributed environment, in a gradual and sustainable way</strong>.</p>

<p>Why?</p>

<p>Because linked data is just a term for how to publish data on the web while working <em>with</em> the web. And the web is the best architecture we know for publishing information in a hugely diverse and distributed environment, in a gradual and sustainable way.</p>

<p>If you&#8217;re a web developer, you already know that the best APIs are <a href="http://en.wikipedia.org/wiki/Representational_State_Transfer">RESTful APIs</a>. That argument has been won. It means:</p>

<ul>
<li>using (HTTP) URIs to identify resources: naming <em>things</em> with URIs rather than actions on those things (which are carried out using the standard set of HTTP verbs)</li>
<li>recognising the distinction between resources and representations of those resources: the same URI might return a different representation of the resource, such as HTML or XML or JSON</li>
<li>returning self-descriptive messages: being able to process representations in a manner that is obvious from the mime type</li>
<li>hypermedia as the engine of application state: being able to locate additional resources through the use of (typed) links</li>
</ul>

<p>Linked data is about following these rules for publishing data. It is about using URIs to identify things, providing information at the end of those URIs that is self-descriptive, and linking those things to other things through typed links.</p>

<p>One of the features of this approach is that it doesn&#8217;t require any big bangs. No one planned the web: sat down and mapped out each page and its precise relations to every other page, in advance. It grew, and evolved, and continues to grow and evolve every day. It grows through individuals and institutions publishing information for their own reasons and linking to other people who have published information for their own reasons, and, because we have some fundamental standards that clients and servers understand, it All Just Works.</p>

<h2>Standards</h2>

<p>Did you notice how I slipped in the &#8220;because we have some fundamental standards that clients and servers understand&#8221;? One standard is obviously HTTP: that controls how clients and servers can talk to each other: it allows clients to request pages and servers to respond. Another standard is HTML: that enables browsers to display information in ways that people can understand it, and (crucially) has a known set of semantics that browsers can use to tell when something is a link, which people can navigate to find more information.</p>

<p>For linked data, there are two crucial standards: RDF and SPARQL. Yes, I know what you&#8217;re thinking, because believe me two years ago that would have been my reaction too, but let me explain why.</p>

<p>There&#8217;s one way in which publishing data isn&#8217;t like publishing documents: its model. Documents are made up of paragraphs and headings and lists and tables and so on. Data is made up of&#8230; what? Well, at its most basic, it&#8217;s <em>things</em> that have <em>properties</em> which have <em>values</em>. We might call the things <em>objects</em> or <em>entities</em>, and call some of the properties <em>relations</em>. We might even call them <em>records</em> with <em>columns</em> and <em>values</em> and <em>foreign keys</em>. But however you term them, for better or worse, we do tend to think about data in this way: <em>thing</em>, <em>property</em>, <em>value</em>.</p>

<p>So if we are going to publish data on the web, we need a standard way of expressing the data so that a client receiving the data can work out what&#8217;s a <em>thing</em>, what&#8217;s a <em>property</em>, what&#8217;s a <em>value</em>. <strong>And, because this is the web, what&#8217;s a <em>link</em></strong>. This is the fundamental standard we need, and this is what RDF gives.</p>

<p>RDF is actually a model rather than a syntax. It&#8217;s a bit like the split between the DOM and HTML or XHTML. The DOM tells the browser how to render the page: the HTML or XHTML is just a syntax which the browser is able to convert into a DOM that it displays. We could imagine browsers converting wiki syntax into a DOM. Or creating a DOM based on XML and XSLT, which of course they all do.</p>

<p>So, RDF is like the DOM, with varying representations of RDF (XML-based, text-based, JSON-based, even HTML-based) that can be used to pass to the client the underlying model of <em>things</em> and <em>properties</em> and <em>values</em> (some of which are <em>links</em>). What the client does then is its business: clients that retrieve data aren&#8217;t browsers &#8212; they&#8217;re not all going to display the data, use the same parts of the data, or otherwise process it in the same way &#8212; but they can pull out the <em>things</em>, <em>properties</em> and <em>values</em>, and know which are <em>links</em>, and this data structure will often, with a good RDF library, map on to some natural structure within whatever programming language is being used, and make the programmer&#8217;s job easier.</p>

<h2>Vocabularies</h2>

<p>What we don&#8217;t want to have to define are standard ways of expressing <em>particular</em> data (such as data about a school) because different individuals and organisations will have completely different ways of thinking about a particular thing. A school itself will have information about uniform and open days; <a href="http://www.ofsted.gov.uk/">OFSTED</a> about performance; <a href="http://www.edubase.gov.uk/">Edubase</a> about administration and pupil numbers; the PTA about after-school activities. Expecting everyone to adopt a particular standard vocabulary for describing a school is as futile as expecting everyone to adopt exactly the same page layout within their web pages, and exactly the same class names in their CSS.</p>

<p>But we don&#8217;t want to rule out opportunistic alignments where individuals or organisations, for whatever reason, <em>do</em> want to use the same vocabularies. Look at what&#8217;s happened with classes in HTML. There is absolutely no constraint on what classes people use in their HTML. But there are clusters of web pages that use some of the same classes. Websites that use <a href="http://www.edubase.gov.uk/">microformats</a>. Websites that adopt a particular <a href="http://en.wikipedia.org/wiki/CSS_framework">CSS framework</a>. Importantly, though, even where some classes are shared, it doesn&#8217;t mean that <em>all</em> classes are shared: adoption of a particular microformat or CSS framework doesn&#8217;t limit the rest of the page.</p>

<p>RDF has this balance between allowing individuals and organisations complete freedom in how they describe their information and the opportunity to share and reuse parts of vocabularies in a mix-and-match way. This is so important in a government context because (with all due respect to civil servants) we <em>really</em> want to avoid a situation where we have to get lots of civil servants from multiple agencies into the same room to come up with the single government-approved way of describing a school. We can all imagine how long that would take.</p>

<p>The other thing about RDF that really helps here is that it&#8217;s easy to align vocabularies if you want to, post-hoc. <a href="http://www.w3.org/TR/rdf-schema/">RDFS</a> and <a href="http://www.w3.org/TR/owl-overview/">OWL</a> define properties that you can use to assert that this property is really the same as that property, or that anything with a value for this property has the same value for that other property. This lowers the risk for organisations who are starting to publish using RDF, because it means that if a new vocabulary comes along they can opportunistically match their existing vocabulary with the new one. It enables organisations to tweak existing vocabularies to suit their purposes, by creating specialised versions of established properties.</p>

<p>So the linked data web is designed to grow and evolve in exactly the same way as the human web has grown and evolve. It grows through people adding links to existing data. It grows through people creating their own vocabularies. And it evolves as links break and reform, and vocabularies combine and diverge. It is complex and messy and self-organising.</p>

<h2>Layers</h2>

<p>The cornerstone of the great, messy, web is the URI. URIs have two important roles:</p>

<ul>
<li><p><strong>they identify things</strong> - If two sets of data use the same URI then it&#8217;s dead easy to work out when they are talking about the same thing, for example to bring together the information published by a school with its OFSTED report with its pupil census. Spread this around to five, ten, twenty datasets from different places all using the same identifier for the school, and you have huge pool of information. And the great thing about RDF (because they also use URIs to identify properties) is that those datasets can be combined automatically without worrying about clashes, rather than through painstaking developer effort.</p></li>
<li><p><strong>they provide somewhere to look for information</strong> - This is the point of using HTTP URIs, because that look-up is as simple as retrieving a document from the web. This enables programmatic, on-demand, access to the information. Developers don&#8217;t have to download huge database dumps when all they are interested in is a small fraction of that data.</p></li>
</ul>

<p>But we know that of course sometimes developers <em>do</em> want to download huge database dumps. So we need URIs for those dumps, and ways to associate metadata with them, and ways to search them. Adopting linked data doesn&#8217;t preclude providing sets of data in larger lumps. In fact, what&#8217;s needed are ways of creating those larger datasets by bringing together the more granular linked data into lists and graphs; this is essentially what SPARQL does.</p>

<p>We also know that there&#8217;s a trade-off to be made between the power of URIs and the simplicity of using short, unqualified names, particularly when it comes to naming schema-level entities such as properties or classes. Most mashups that we see at the moment bring together just a few datasets, making it easy for developers to scan for naming clashes, or examine values to work out whether a particular property contains a link or not. This is the 80% of the use of data on the web that can be addressed by the 20% solution of the kind of JSON and plain old XML you see in most APIs.</p>

<p>But publishing with RDF can be the basis of these kinds of simple APIs, and still address the hard 20% that we will encounter quickly as we mash more data together. Any data munger knows that the main challenge of making data available in an easily accessible way is cleaning, tidying, modelling and restructuring. If that&#8217;s done into RDF then creating simple JSON, XML and even CSV is really easy. Creating middle-ware that will make the creation of these basic APIs really easy must be the top priority of this linked data effort.</p>

<h2>Reality Check</h2>

<p>So it&#8217;s all good, right?</p>

<p>No, of course it&#8217;s not all good. Just as in the early days of the human web, we face huge challenges simply getting tooling to a level where it&#8217;s easy (really easy) for government departments and local authorities to publish data as RDF and for the consumers of the data to use it. We have some patterns for publishing linked data, but, as in the early days of the human web, there&#8217;s still a lot we don&#8217;t know about the best way to make data usable by third parties.</p>

<p>It&#8217;s worth noting that the main challenges we face are ones that are common to all attempts to make data both open and reusable. How do we easily create structured and reusable data from presentation-oriented Excel or (worse) PDFs? How do we handle changes over time, and record the provenance of the information that we provide? How to we represent statistical hypercubes? Or location information? These are things that we will only learn by trying things out.</p>

<p>In the end, though, the best evidence we have for how the web of linked data will progress is the evidence of how things were for the human web. It is hard to be an early adopter, both for social reasons and technological reasons. Nothing will happen overnight, but gradually there will be network effects: more shared URIs, more shared vocabularies, making it both easier to adopt and more beneficial for everyone.</p>

<p>Is this a kind of faith? Maybe. I believe in the web.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part V: Finishing Touches</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/139" />
    <id>http://www.jenitennison.com/blog/node/139</id>
    <published>2009-12-05T08:50:28+00:00</published>
    <updated>2010-07-31T21:56:22+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="rdf" />
    <summary type="html"><![CDATA[<p>This is the fifth part in this series about creating linked data. I&#8217;ve talked previously about <a href="http://www.jenitennison.com/blog/node/135">analysis and modelling</a>, <a href="http://www.jenitennison.com/blog/node/136">defining URIs</a>, <a href="http://www.jenitennison.com/blog/node/137">defining concept schemes</a> and <a href="http://www.jenitennison.com/blog/node/138">defining a vocabulary</a>. In this instalment I&#8217;ll talk about the finishing touches that can make linked data easier to browse, query, locate and trust.</p>

<p>Note that we don&#8217;t <em>have</em> to do any of these things; they&#8217;re not part of the core data. We shouldn&#8217;t beat ourselves up if we don&#8217;t have time to do it right now, because we can always add them later, and it might be that you just don&#8217;t agree that they should be done. But many of them don&#8217;t take a lot of time and can enhance the user&#8217;s experience of the data.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>This is the fifth part in this series about creating linked data. I&#8217;ve talked previously about <a href="http://www.jenitennison.com/blog/node/135">analysis and modelling</a>, <a href="http://www.jenitennison.com/blog/node/136">defining URIs</a>, <a href="http://www.jenitennison.com/blog/node/137">defining concept schemes</a> and <a href="http://www.jenitennison.com/blog/node/138">defining a vocabulary</a>. In this instalment I&#8217;ll talk about the finishing touches that can make linked data easier to browse, query, locate and trust.</p>

<p>Note that we don&#8217;t <em>have</em> to do any of these things; they&#8217;re not part of the core data. We shouldn&#8217;t beat ourselves up if we don&#8217;t have time to do it right now, because we can always add them later, and it might be that you just don&#8217;t agree that they should be done. But many of them don&#8217;t take a lot of time and can enhance the user&#8217;s experience of the data.</p>

<!--break-->

<h2>Labels</h2>

<p>Every resource should have a label, even blank nodes. Adding labels makes it easier for people to generate HTML views from the data. Sometimes we have resources that have an obvious label (like the name of a local authority); at other times, the label needs to be constructed based on the other information that&#8217;s available about the resource.</p>

<p>I talked in the last instalment about <code>skos:prefLabel</code> (preferred label), <code>skos:altLabel</code> (alternative label) and <code>rdfs:label</code>. Technically, <code>skos:prefLabel</code> and <code>skos:altLabel</code> are sub properties of <code>rdfs:label</code>, which means that if a resource has a <code>skos:prefLabel</code> it also has a <code>rdfs:label</code> with that value. However, drawing that conclusion requires either built-in knowledge of SKOS or the ability to both automatically get hold of the SKOS ontology and reason with it, which is feasible (this is one of the advantages of RDF, after all), but adds an extra hurdle for people wanting to use your data.</p>

<p>So it&#8217;s best to give everything a <code>rdfs:label</code>, even if they already have a <code>skos:prefLabel</code> or <code>skos:altLabel</code>. It&#8217;s also good to try to imagine that label in the context of having no other information about the thing that it&#8217;s labelling, such as in the title of a page. For example, if you&#8217;re looking at the observation <code>http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle</code> in the context of a traffic count, it may seem sensible to label it just &#8220;bicycle&#8221; (as I did in the first iteration of turning this traffic count data into RDF). But without that context, it makes no sense. Better to label it &#8220;Bicycles - 8 Oct 2001 17:00-18:00 - East - Salterton Road, EAST OF DINAN WAY, EXMOUTH&#8221; and provide an even more descriptive <code>rdfs:comment</code> like &#8220;Number of bicycles counted travelling East at Salterton Road, EAST OF DINAN WAY, EXMOUTH on 8 October 2001 between 17:00 and 18:00.&#8221;.</p>

<h2>Datasets</h2>

<p>There are two kinds of datasets that are applicable to this particular &#8230;err&#8230; set of data &#8230; and that we should describe within the RDF. They are:</p>

<ul>
<li>datasets that are sets of statistical data items (such as the observations in the traffic count data); these are best described using <a href="http://sw.joanneum.at/scovo/schema.html">SCOVO</a></li>
<li>datasets that are general descriptions of particular sets of linked data (such as roads or local authorities); these are best described using <a href="http://semanticweb.org/wiki/VoiD">voiD</a></li>
</ul>

<p>Both kinds of datasets can be identified for UK government data using URIs in the form:</p>

<pre><code>http://{sector}.data.gov.uk/set/{dataset}/
</code></pre>

<h3>SCOVO Datasets</h3>

<p>Every <code>scovo:Item</code> should be part of a <code>scovo:Dataset</code>, associated through a <code>scovo:dataset</code> (and a reverse <code>scovo:datasetOf</code>). A <code>scovo:Dataset</code> is pretty simple: all you really need to do is give it an identifier and, of course, a label. In this case, something like:</p>

<pre><code>http://transport.data.gov.uk/set/traffic-count/2001-2008/
</code></pre>

<p>This is an identifier that the various <code>scovo:Item</code>s should use to indicate where the data comes from:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a scovo:Item ;
  scovo:dataset &lt;http://transport.data.gov.uk/set/traffic-count/2001-2008/&gt; ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  traffic:vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  rdf:value 2 .
</code></pre>

<p>It&#8217;s also an identifier that we can attach some metadata to. Obviously it needs a label, but we can also attach other metadata, such as the <a href="http://www.jenitennison.com/blog/node/133">provenance of the dataset</a>.</p>

<pre><code>&lt;http://transport.data.gov.uk/set/traffic-count/2001-2008/&gt;
  a scovo:Dataset ;
  a prv:DataItem ;
  rdfs:label "Traffic counts between 2001 and 2008"@en ;
  prv:createdBy [
    a prv:DataCreation ;
    prv:performedAt ... ;
    prv:performedBy ... ;
    prv:usedData ... ;
    prv:usedGuideline ... ;
  ] .
</code></pre>

<h3>VoiD Datasets</h3>

<p>VoiD is designed to be used to describe sets of linked data, their contents, their provenance and their relationships with each other. There are many ways of dividing up the data that we&#8217;ve been looking at into datasets. We can start with a simple example: the dataset containing linked data about countries:</p>

<pre><code>&lt;http://statistics.data.gov.uk/set/country/&gt;
  a void:Dataset ;
  rdfs:label "Countries"@en ;
  foaf:homepage &lt;http://statistics.data.gov.uk/set/country&gt; ;
  dct:subject &lt;http://dbpedia.org/resource/Country&gt; ;
  cc:license [
    a cc:License ;
    rdfs:label "data.gov.uk Licence"@en ;
    foaf:homepage &lt;http://data.hmg.gov.uk/terms-privacy&gt; ;
    cc:permits cc:DerivativeWorks, cc:Distribution, cc:Reproduction ;
    cc:requires cc:Attribution ;
  ] ;
  void:exampleResource &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  void:sparqlEndpoint &lt;http://services.data.gov.uk/statistics/sparql&gt; ;
  void:uriRegexPattern "http://statistics.data.gov.uk/id/country?name=.+"^^xs:string ;
  void:vocabulary &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; ;
  void:vocabulary &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
</code></pre>

<p>This provides a link to a home page for the dataset, which should contain information about the dataset itself. (Accessing the URI for the dataset should also redirect users to this home page.) I&#8217;ve used the same URI as the dataset URI but without the slash at the end. (This is probably too subtle a difference between URIs; we don&#8217;t currently have official guidance for URIs for documents-about-datasets or documents-about-definitions.)</p>

<p>The <code>void:exampleResource</code> property can be used to point to resources that can act as starting points for exploring the data, and the <code>void:sparqlEndpoint</code> property points at a SPARQL endpoint that can be used for deeper querying. The <code>void:uriRegexPattern</code> property provides a regular expression for the URIs that are used to identify the resources that the dataset is about. <code>void:vocabulary</code> points to the vocabularies that the dataset uses.</p>

<p>Various <a href="http://dublincore.org/documents/dcmi-terms/">Dublin Core</a> properties can be used to provide metadata about the dataset, such as its subject matter. The <a href="http://creativecommons.org/ns">Creative Commons schema</a> provides a way of indicating the licence that the dataset is made available under, which is essential information to enable reuse. (I&#8217;ve derived some RDF about the licence here from the one <a href="http://data.hmg.gov.uk/terms-privacy">described on the data.hmg.gov.uk pages</a>; there should be an official version some time soon.)</p>

<p>The data that we can actually produce from this traffic count dataset is actually a <em>subset</em> of the dataset of all countries, and we can indicate this through a <code>void:subset</code> relationship:</p>

<pre><code>&lt;http://statistics.data.gov.uk/set/country/&gt;
  ...
  void:subset [
    a void:Dataset ;
    a prv:DataItem ;
    rdfs:label "Country data from the DfT traffic count dataset 2001-2008"@en ;
    prv:createdBy [
      a prv:DataCreation ;
      prv:performedAt ... ;
      prv:performedBy ... ;
      prv:usedData ... ;
      prv:usedGuideline ... ;
    ] ;
  ] .
</code></pre>

<p>The other kind of subset that we should describe are link sets. Link sets are datasets that contain links between datasets. The country dataset doesn&#8217;t (currently) contain any links to other datasets, but the count dataset does:</p>

<pre><code>&lt;http://transport.data.gov.uk/set/traffic-count&gt;
  a void:Dataset ;
  rdfs:label "Traffic Counts"@en ;
  foaf:homepage &lt;http://transport.data.gov.uk/set/traffic-count&gt; ;
  dct:subject &lt;http://dbpedia.org/resource/Traffic&gt; ;
  dct:subject &lt;http://dbpedia.org/resource/Counting&gt; ;
  cc:license [
    a cc:License ;
    rdfs:label "data.gov.uk Licence"@en ;
    foaf:homepage &lt;http://data.hmg.gov.uk/terms-privacy&gt; ;
    cc:permits cc:DerivativeWorks, cc:Distribution, cc:Reproduction ;
    cc:requires cc:Attribution ;
  ] ;
  void:exampleResource &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  void:uriRegexPattern &lt;http://transport.data.gov.uk/id/traffic-count-point/[0-9]+/direction/[NSEW]/hour/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:00:00&gt; ;
  void:sparqlEndpoint &lt;http://services.data.gov.uk/transport/sparql&gt; ;
  void:vocabulary &lt;http://transport.data.gov.uk/def/traffic/&gt; ;
  void:subset [
    a void:Dataset ;
    rdfs:label "Traffic Counts from the DfT traffic count dataset 2001-2008"@en ;
    prv:createdBy ...
  ] ;
  void:subset [
    a void:Linkset ;
    rdfs:label "Traffic count / count point links"@en ;
    rdfs:comment "Links from a traffic count to the count point at which the count was taken."@en ;
    void:subjectsTarget &lt;http://transport.data.gov.uk/set/traffic-count/&gt; ;
    void:linkPredicate &lt;http://transport.data.gov.uk/def/count&gt; ;
    void:objectsTarget &lt;http://transport.data.gov.uk/set/traffic-count-point/&gt; ;
  ] ;
  void:subset [
    a void:Linkset ;
    rdfs:label "Traffic count / cardinal direction"@en ;
    rdfs:comment "Links from a traffic count to the direction in which the traffic was going."@en ;
    void:subjectsTarget &lt;http://transport.data.gov.uk/set/traffic-count/&gt; ;
    void:linkPredicate &lt;http://transport.data.gov.uk/def/direction&gt; ;
    void:objectsTarget &lt;http://dbpedia.org/void/Dataset&gt; ;
  ] ;
  void:subset [
    a void:Linkset ;
    rdfs:label "Traffic count / hour"@en ;
    rdfs:comment "Links from a traffic count to the hour when the traffic was being monitored."@en ;
    void:subjectsTarget &lt;http://transport.data.gov.uk/set/traffic-count/&gt; ;
    void:linkPredicate &lt;http://transport.data.gov.uk/def/direction&gt; ;
    void:objectsTarget [
      a void:Dataset ;
      rdfs:label "URIs for places and times" ;
      foaf:homepage "http://placetime.com/" ;
    ] ;
  ] .
</code></pre>

<p><code>scovo:Dataset</code>s are often subsets of <code>void:Datasets</code>. In the case of the traffic count data, the observations described by the <code>scovo:Dataset</code> above are a subset of the <code>void:Dataset</code> that is the set of <em>all</em> such observations (including ones from other years).</p>

<h2>Derivable Data</h2>

<p>The discussion about <code>rdfs:label</code> above touched on another set of information that should be included within the RDF data we produce: data that is automatically derivable from the data we provide. There are three main reasons for including derivable data within what we publish:</p>

<ol>
<li><p>Given the current adoption of RDF-aware technologies, the consumers of our data are pretty unlikely to be able to (or to want to) use schemas, ontologies or rule sets to help them to reason over the data and draw conclusions. The consumers of this data <em>might</em> include semantic search engines and people scraping the data into their own triplestores, but they&#8217;re far more likely to be developers who really don&#8217;t care about RDF at all. It would be a shame to publish the data and then have no one use it.</p></li>
<li><p>Computing derivable data once saves overall effort. We calculate it once, centrally, and it means that the people using the data don&#8217;t have to spend processing time doing it themselves. (There&#8217;s a classic time/space trade-off here, of course; the down side of including data that isn&#8217;t strictly necessary is that the documents will end up larger.)</p></li>
<li><p>If we provide information that people are likely to need within the document that they get when they request a given resource, they&#8217;re less likely to need to resort to (harder to construct and more intensive to process) SPARQL queries to get what they need.</p></li>
</ol>

<p>The overriding principle that we can use to help us decide what to include is to consider what we would like to see if we visited a page about the particular thing.</p>

<p>How we manage to provide the derived data depends on how we publish the data. I&#8217;m not talking here about how to do the publishing, but rather about what the consumers of the data should expect to see eventually. So, for example, if we publish the data as static files then we&#8217;re going to have to include all this data in those files. If we generate the RDF dynamically, we just have to make sure that the generated RDF includes the derived data; we might be able to set up rules in a triplestore, or a transformation of the data that it naturally produces, to include the derivable data.</p>

<h3>Superclasses and Super-properties</h3>

<p>One set of derived data is that inferred from the superclasses and super-properties that are defined with the RDF vocabularies we use in our data. Basically, if a resource has a type that is a subclass of another type, then the resource should have that superclass as a type as well. Similarly, if a triple includes a property that has a super-property, then there ought also to be a triple that links the subject and object of the original triple with the super-property as well.</p>

<p>To understand when it&#8217;s important to include this kind of derived data, we need to be aware of the kind of applications that will use the data. Some applications will be targeting just this dataset about traffic counts, and will be written to use whatever properties and classes that we&#8217;ve made available. Other applications will be targeted at specific vocabularies at a more general-purpose level. There might be applications that can be used to visualise SKOS hierarchies as a tree, for example, or applications that can plot any <code>geo:lat</code>/<code>geo:long</code> coordinates on a map, or any OWL-Time intervals and instants on a timeline. Still other applications, such as viewers like Tabulator, will be used with any old RDF. We need to provide enough information to make the data easily usable by these more generic applications.</p>

<p>As an example, in the last instalment we introduced classes for <code>traffic:VehicleType</code> and <code>traffic:RoadCategory</code> which were subclasses of <code>skos:Concept</code>. If we want generic SKOS visualisers to be able to display the vehicle type and road category concept schemes, we should try to make it easy for them to work out which things are concepts, by indicating that they are concepts as well. Bearing in mind what I&#8217;ve said above about labels, that means that the original RDF:</p>

<pre><code>&lt;motorway&gt; a traffic:RoadCategory ;
  skos:prefLabel "Motorway"@en ;
  skos:broader &lt;major&gt; ;
  skos:scopeNote "Major roads often used for long distance travel. They are usually three or more lanes in each direction and generally have the maximum speed limit of 70mph."@en ;
  skos:inScheme &lt;&gt; .
</code></pre>

<p>should include a reference to <code>skos:Concept</code> and a <code>rdfs:label</code>:</p>

<pre><code>&lt;motorway&gt; a traffic:RoadCategory ;
  a skos:Concept ;
  rdfs:label "Motorway"@en ;
  skos:prefLabel "Motorway"@en ;
  skos:broader &lt;major&gt; ;
  skos:scopeNote "Major roads often used for long distance travel. They are usually three or more lanes in each direction and generally have the maximum speed limit of 70mph."@en ;
  skos:inScheme &lt;&gt; .
</code></pre>

<p>Note that I haven&#8217;t included the results of <em>all</em> the reasoning that we could anticipate. The property <code>skos:scopeNote</code> is a sub-property of <code>skos:note</code>, for example, but I haven&#8217;t included a <code>skos:note</code> explicitly because any SKOS-aware processor should have that kind of knowledge built in. The rule of thumb is that <strong>if the result of the reasoning involves a resource from another vocabulary, then we should include it</strong>.</p>

<h3>Derivable Values</h3>

<p>There are other kinds of derivable data in this data set. In particular, there are eastings and northings, but not latitudes and longitudes. When there&#8217;s useful derivable data, especially when it&#8217;s not trivial to derive, it makes sense to make that available explicitly, otherwise everyone else will have to go through the effort of deriving it themselves.</p>

<p>We&#8217;ve already done this with the information about the hours of the traffic counts, by pulling out the year and hour of the count rather than having them tucked away within a <code>xs:dateTime</code> literal. The same should be true of the eastings and northings. For small numbers of values, you can use the <a href="http://gps.ordnancesurvey.co.uk/convert.asp">Ordnance Survey&#8217;s online converter</a>; for larger numbers of values you can download the (Windows only and very dated) software or try one of the various converters you can find with a <a href="http://www.google.com/search?q=easting+northing+latitude+longitude+conversion+UK">Google search</a>.</p>

<p>Latitudes and longitudes for points should, of course, be expressed using the <code>geo:lat</code> and <code>geo:long</code> properties from the <a href="http://www.w3.org/2003/01/geo/">http://www.w3.org/2003/01/geo/wgs84_pos#</a> vocabulary.</p>

<h3>Inverses</h3>

<p>Statements in RDF link two things. For example, you can view the statement:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; .
</code></pre>

<p>as saying that traffic count point 13 is on the B3178 <em>and</em> that the B3178 has a count point on it that is traffic count point 13.</p>

<p>So it&#8217;s always possible, when creating a query about or a representation of the road to include the &#8216;backward links&#8217; &#8212; the statements in which the road features as an <em>object</em> as well as those in which it features as a <em>subject</em>. This has caused some people to argue that <a href="http://dowhatimean.net/2006/06/an-rdf-design-pattern-inverse-property-labels">relationships should only be defined in one direction</a>.</p>

<p>Personally, I don&#8217;t agree, for two reasons.</p>

<ol>
<li><p>Although it&#8217;s <em>possible</em> to create queries and representations that include backward links, it often doesn&#8217;t happen like that. It&#8217;s different with different triplestores, but result of a the <code>DESCRIBE</code> SPARQL query commonly only includes triples in which the thing being described in the subject, not the object. Also, when constructing queries, it seem more natural to always &#8220;travel forward&#8221; through the graph. For example:</p>

<pre><code>SELECT ?count
WHERE {
  ?point
    area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/43UC&gt; ;
    traffic:count ?count .
}
</code></pre>

<p>rather than:</p>

<pre><code>SELECT ?count
WHERE {
  ?point
    area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/43UC&gt; .
  ?count
    traffic:countPoint ?point .
}
</code></pre>

<p>So although it introduces redundancy, I think that including inverse relationships in RDF aids usability and navigability.</p></li>
<li><p>Sometimes both directions of a relationship contain meaningful information. For example, it&#8217;s not enough to include a <code>gen:mother</code> relationship from a person to their mother because the implied reverse relationship is simply that the person is a child of their mother &#8212; you need to include a <code>gen:son</code> or <code>gen:daughter</code> relationship as well to tell which type of child.</p></li>
</ol>

<p>So in this dataset, I&#8217;m going to include inverse relationships where appropriate:</p>

<ul>
<li>from countries to regions</li>
<li>from regions to local authority districts</li>
<li>from roads to count points</li>
<li>from count points to counts</li>
<li>from counts to observations</li>
</ul>

<h3>Shortcuts</h3>

<p>Another thing that can aid the navigability of a set of RDF data is to provide &#8220;shortcuts&#8221;. For example, at the moment we have links that say which country a region belongs to and which region a local authority district belongs to, but we don&#8217;t have a link that says which country a local authority district belongs to. These kind of links can make it easier to navigate through (and to query) a dataset, so they can be worth adding so long as they don&#8217;t clutter up the data too much.</p>

<p>Just think of what you&#8217;d like to know about a particular <em>thing</em> when you visit its page. If you&#8217;re looking at transport in a local authority district, it would be useful to know what region and country it belongs to and about what roads and traffic count points it contains. But it would be too much to have a list of all the counts and observations that have been taken on those count points.</p>

<p>For this dataset, I&#8217;m going to add shortcuts from:</p>

<ul>
<li>countries to local authority districts (and vice versa)</li>
<li>count points to regions and countries</li>
<li>roads to local authority districts (and vice versa)</li>
<li>roads to regions and countries</li>
<li>roads to road categories and road names</li>
<li>roads to counts (and vice versa)</li>
<li>observations to count points, roads, directions and count hours</li>
</ul>

<p>These are all judgement calls &#8212; there are no hard and fast rules &#8212; and as you can see I&#8217;m not adding inverses everywhere here because to do so would lead to unnecessarily large sets of RDF in some cases.</p>

<hr />

<p>That&#8217;s the end of this instalment. I had been intending to make this the final one, but there are a couple of things still left to talk about: the publication of RDF, and the supplementary documents that we need to provide (including RDF about those supplementary documents). I&#8217;ve also had a request to talk about OWL ontologies, so I&#8217;ll probably do that, and there are things to say about how to manage things changing over time. So this may end up being an eight-part series!</p>

<p>To keep us up to date, with all the extra derived information added, the RDF looks as follows:</p>

<pre><code>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#"&gt; .
@prefix owl: &lt;http://www.w3.org/2002/07/owl#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time&gt; .
@prefix scovo: &lt;http://purl.org/NET/scovo#&gt; .
@prefix area: &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; .
@prefix admingeo: &lt;http://data.ordnancesurvey.co.uk/ontology/admingeo/&gt; .
@prefix space: &lt;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://statistics.data.gov.uk/id/country?name=England&gt;
  a area:Country ;
  rdfs:label "England"@en ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/government-office-region/K&gt;
  a admingeo:GovernmentOfficeRegion ;
  rdfs:label "South West"@en ;
  skos:notation "K"^^area:StandardCode ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt;
  a area:LocalAuthorityDistrict ;
  rdfs:label "Devon"@en ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; .

&lt;http://transport.data.gov.uk/id/local-authority-district/1115&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/18&gt;
  a area:LocalAuthority ;
  rdfs:label "Devon County Council"@en ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:coverage &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/local-authority/1116&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; .

&lt;http://transport.data.gov.uk/id/road/B3178&gt;
  a traffic:Road ;
  rdfs:label "B3178" ;
  rdfs:label "Salterton Road"@en ;
  skos:prefLabel "B3178" ;
  skos:altLabel "Salterton Road"@en ;
  skos:notation "B3178"^^traffic:RoadNumber ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; ;
  traffic:roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  a traffic:CountPoint ;
  rdfs:label "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  rdfs:comment "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  skos:notation "13"^^traffic:CountPointNumber ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  traffic:roadName "Salterton Road"@en ;
  traffic:roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  space:easting 302600 ;
  space:northing 81984 ;
  geo:lat 50.6294 ;
  geo:long -3.3784 ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt;
  a traffic:Count ;
  rdfs:label "8 Oct 2001 17:00-18:00 - East - Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  traffic:direction &lt;http://dbpedia.org/resource/East&gt; ;
  traffic:countHour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; ;
  traffic:observation &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt; .

&lt;http://dbpedia.org/resource/East&gt;
  rdfs:label "East"@en .

&lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt;
  a traffic:CountHour ;
  rdfs:label "8 Oct 2001, 17:00-18:00"@en ;
  time:hasBeginning &lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt; ;
  time:hasEnd &lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt; ;
  time:hasDurationDescription _:OneHour ;
  time:intervalDuring &lt;http://dbpedia.org/resource/2001&gt; .

_:OneHour a time:DurationDescription ;
  rdfs:label "one hour"@en ;
  time:years 0 ;
  time:months 0 ;
  time:days 0 ;
  time:hours 1 ;
  time:minutes 0 ;
  time:seconds 0 .

&lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 17:00"@en ;
  time:inXSDDateTime "2001-10-08T17:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 17 ;
  ] .

&lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 18:00"@en ;
  time:inXSDDateTime "2001-10-08T18:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 18 ;
  ] .

&lt;http://dbpedia.org/resource/2001&gt;
  a time:Interval ;
  rdfs:label "2001" ;
  rdf:value "2001"^^xsd:gYear ;
  time:intervalEquals &lt;http://placetime.com/interval/gregorian/2001-01-01T00:00:00Z/P1Y&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a scovo:Item ;
  rdfs:label "8 Oct 2001 17:00-18:00 - East - Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:direction &lt;http://dbpedia.org/resource/East&gt; ;
  traffic:countHour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; ;
  traffic:vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  rdf:value 2 .
</code></pre>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part IV: Developing RDF Schemas</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/138" />
    <id>http://www.jenitennison.com/blog/node/138</id>
    <published>2009-11-26T10:35:32+00:00</published>
    <updated>2010-07-31T21:54:37+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="rdf" />
    <summary type="html"><![CDATA[<p>This is the fourth instalment in a series about turning an existing dataset into some linked data. I&#8217;ve previously talked about <a href="http://www.jenitennison.com/blog/node/135">analysis and modelling</a>, <a href="http://www.jenitennison.com/blog/node/136">defining URIs</a> and <a href="http://www.jenitennison.com/blog/node/137">defining concept schemes</a>. In this instalment, we&#8217;ll look at developing a schema in which we define the classes, properties and datatypes that we want to use in the RDF that describes the <em>things</em> in our dataset.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>This is the fourth instalment in a series about turning an existing dataset into some linked data. I&#8217;ve previously talked about <a href="http://www.jenitennison.com/blog/node/135">analysis and modelling</a>, <a href="http://www.jenitennison.com/blog/node/136">defining URIs</a> and <a href="http://www.jenitennison.com/blog/node/137">defining concept schemes</a>. In this instalment, we&#8217;ll look at developing a schema in which we define the classes, properties and datatypes that we want to use in the RDF that describes the <em>things</em> in our dataset.</p>

<p>We&#8217;ll start by writing out some RDF for our record, using Turtle here for readability, and use unprefixed names to indicate classes, properties and datatypes, just so we can see what we need. Then we&#8217;ll see how those requirements match up to existing vocabularies and ontologies that we can reuse. Anything that&#8217;s left over we&#8217;re going to have to put in our own vocabulary. We&#8217;ll call this</p>

<pre><code>http://transport.data.gov.uk/def/traffic/
</code></pre>

<p>All the classes, properties and datatypes that we define will eventually use that namespace.</p>

<p>Let&#8217;s focus on this record; I find it easiest to use an actual example rather than talk in abstract:</p>

<pre><code>"England","South West","K",1115.00,"18","Devon County Council",
13,"B3178",,"B Urban","Salterton Road",
"Salterton Road, EAST OF DINAN WAY, EXMOUTH",302600,81984,
8/10/2001 00:00:00,"E",17,2,2,400,5,41,0,2,0,0,0,0,2,450
</code></pre>

<p>We&#8217;ll put this into RDF bit by bit.</p>

<h2>Areas</h2>

<p>First, let&#8217;s look at the areas and local authorities. The kind of RDF that we want to have looks like:</p>

<pre><code>&lt;http://statistics.data.gov.uk/id/country?name=England&gt;
  a :Country ;
  :name "England"@en .

&lt;http://statistics.data.gov.uk/id/government-office-region/K&gt;
  a :GovernmentOfficeRegion ;
  :name "South West"@en ;
  :code "K"^^:ONScode ;
  :containedBy &lt;http://statistics.data.gov.uk/id/country?name=England&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt;
  a :LocalAuthorityDistrict ;
  :code "18"^^:ONScode ;
  :code "1115"^^:DfTLAcode ;
  :localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  :containedBy &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  :containedBy &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; .

&lt;http://transport.data.gov.uk/id/local-authority-district/1115&gt;
  :sameAs &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/18&gt;
  a :LocalAuthority ;
  :name "Devon County Council"@en ;
  :code "18"^^:ONSLAcode ;
  :code "1115"^^:DfTLAcode ;
  :localAuthorityDistrict &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/local-authority/1116&gt;
  :sameAs &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; .
</code></pre>

<p>To work out what we need to put in our schema, we should first look at what existing vocabularies there are that could help. These areas are already defined elsewhere, so we can just use the same vocabulary for countries, regions, local authority districts and local authorities as is used there. The vocabularies that are useful here are:</p>

<ul>
<li><code>http://statistics.data.gov.uk/def/administrative-geography/</code> which defines classes and properties related to administrative areas and local authorities (as described by the <a href="http://www.statistics.gov.uk/">Office of National Statistics</a>)</li>
<li><code>http://data.ordnancesurvey.co.uk/ontology/admingeo/</code> which also defines classes and properties related to administrative areas (as described by the <a href="http://www.ordnancesurvey.co.uk/">Ordnance Survey</a>)</li>
<li><code>http://data.ordnancesurvey.co.uk/ontology/spatialrelations/</code>, also developed by John Goodwin at the Ordnance Survey, which defines spatial relationships between areas</li>
</ul>

<p>There are other commonly used vocabularies that it&#8217;s helpful to know about:</p>

<ul>
<li>RDFS is designed for representing RDF schemas, but it has a few general-purpose properties that are good to know, namely <code>rdfs:label</code> (the label for a thing) and <code>rdfs:comment</code> (a comment or description about the thing).</li>
<li>SKOS is designed for representing concept schemes, but again it has a few properties that can be used with any set of linked data, in particular <code>skos:prefLabel</code> (the preferred label for a thing), <code>skos:altLabel</code> (an alternative label for a thing) and <code>skos:notation</code> (a code for the thing).</li>
<li>OWL is designed for representing ontologies, but it has one very important property that you should know about &#8212; <code>owl:sameAs</code> &#8212; which is used to link two things that are the same thing.</li>
<li>XML Schema datatypes can be used within RDF, which is useful for things like dates, times, integers and so on.</li>
<li>For our purposes here, OWL-Time is going to prove useful, as it has a bunch of properties that are used to represent instants and durations.</li>
</ul>

<p>If we look through the RDF above, the only thing that <em>isn&#8217;t</em> covered by these vocabularies is the <code>DfTLAcode</code> datatype. If we use the <code>http://transport.data.gov.uk/def/traffic/</code> namespace, there&#8217;s not really any need to indicate that this is a transport-related code, so we can just call it <code>LAcode</code>. Let&#8217;s define that datatype:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/LAcode&gt;
  a rdfs:Datatype ;
  rdfs:label "Local Authority Code"@en .
</code></pre>

<p>That&#8217;s it. Now here&#8217;s the Turtle for the areas with the relevant namespaces added, and property names changed where appropriate:</p>

<pre><code>@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#"&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix owl: &lt;http://www.w3.org/2002/07/owl#&gt; .
@prefix area: &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; .
@prefix space: &lt;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/&gt; .
@prefix admingeo: &lt;http://data.ordnancesurvey.co.uk/ontology/admingeo/&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://statistics.data.gov.uk/id/country?name=England&gt;
  a area:Country ;
  rdfs:label "England"@en .

&lt;http://statistics.data.gov.uk/id/government-office-region/K&gt;
  a admingeo:GovernmentOfficeRegion ;
  rdfs:label "South West"@en ;
  skos:notation "K"^^area:StandardCode ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt;
  a area:LocalAuthorityDistrict ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; .

&lt;http://transport.data.gov.uk/id/local-authority-district/1115&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/18&gt;
  a area:LocalAuthority ;
  rdfs:label "Devon County Council"@en ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:coverage &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/local-authority/1116&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; .
</code></pre>

<h2>Roads</h2>

<p>Here&#8217;s the kind of RDF we want to create for roads:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/road/B3178&gt;
  a :Road ;
  :code "B3178"^^:RoadNumber .
</code></pre>

<p>Obviously, we need a class for roads:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/Road&gt;
  a rdfs:Class ;
  rdfs:label "Road"@en .
</code></pre>

<p>Wherever there&#8217;s a code, I like to reuse <code>skos:notation</code>. But it&#8217;s important to define a datatype for the values used with that notation because (as we saw with local authorities above) there may be several different coding schemes that apply to the same Thing, and we need to be able to distinguish between them in case they clash. So:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/RoadNumber&gt;
  a rdfs:Datatype ;
  rdfs:label "Road Number"@en .
</code></pre>

<p>That&#8217;s all we have to define for roads; now the RDF can look like:</p>

<pre><code>@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .

&lt;http://transport.data.gov.uk/id/road/B3178&gt;
  a traffic:Road ;
  skos:notation "B3178"^^traffic:RoadNumber .
</code></pre>

<h2>Count Points</h2>

<p>On to count points. Here&#8217;s the sketch of the RDF we want to create:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  a :TrafficCountPoint ;
  :description "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  :code "13"^^:CountPointNumber ;
  :road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  :roadName "Salterton Road"@en ;
  :roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  :easting 302600 ;
  :northing 81984 ;
  :localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  :localAuthorityDistrict &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .
</code></pre>

<p>Of these, the description could be done with <code>rdfs:comment</code>. The code can be held by a <code>skos:notation</code> (provided we define a datatype for the count point number):</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/CountPointNumber&gt;
  a rdfs:Datatype ;
  rdfs:label "Traffic Count Point Number"@en .
</code></pre>

<p>Properties for easting and northing are actually defined by the OS&#8217;s spatial relations ontology (although unfortunately neither the ontology nor the property is currently resolvable; the only way you&#8217;d know this is through looking at their use in the conversion of the edubase data). Links to local authorities and local authority districts can be done using the ONS-based administrative geography ontology, which again is currently only guessable at by looking at the online data.</p>

<p>That leaves us with a <code>traffic:CountPoint</code> class (no point calling it <code>TrafficCountPoint</code> if the namespace provides sufficient disambiguation):</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/CountPoint&gt;
  a rdfs:Class ;
  rdfs:label "Traffic Count Point"@en .
</code></pre>

<p>A road property to point to a road:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/road&gt;
  a rdf:Property ;
  rdfs:label "road"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Road&gt; .
</code></pre>

<p>Note that properties are by convention named with a lowercase first letter, whereas classes are named with an uppercase first letter. It&#8217;s a good idea to follow that convention. Note also that I&#8217;ve defined a <code>rdfs:range</code> for this property, which means that anything that&#8217;s the <em>object</em> in a RDF statement that involves this property must be a <code>traffic:Road</code>.</p>

<p>We need a road name property to give the name of the road at the count point.</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/road&gt;
  a rdf:Property ;
  rdfs:label "road name"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Road&gt; .
</code></pre>

<p>We also need a road category property to point to the categor(ies) of the road at the count point:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/roadCategory&gt;
  a rdf:Property ;
  rdfs:label "road category"@en .
</code></pre>

<p>You&#8217;ll remember that we defined different road categories using SKOS, such that each road category is a <code>skos:Concept</code>. But to give a range to the <code>traffic:roadCategory</code> property, we need to create a class for all the things that are categories of road. These are all <code>skos:Concept</code>s, and we can indicate that through an <code>rdfs:subClassOf</code> property:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/RoadCategory&gt;
  a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:label "Road Category"@en .
</code></pre>

<p>use this as the range of the <code>traffic:roadCategory</code> property:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/roadCategory&gt;
  a rdf:Property ;
  rdfs:label "road category"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/RoadCategory&gt; .
</code></pre>

<p>and amend the concept scheme we created to include references to this new class, for example:</p>

<pre><code>&lt;motorway&gt; a traffic:RoadCategory ;
  skos:prefLabel "Motorway"@en ;
  skos:broader &lt;major&gt; ;
  skos:scopeNote "Major roads often used for long distance travel. They are usually three or more lanes in each direction and generally have the maximum speed limit of 70mph."@en ;
  skos:inScheme &lt;&gt; .
</code></pre>

<p>So here is the RDF with the relevant properties properly defined:</p>

<pre><code>@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix area: &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; .
@prefix space: &lt;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  a traffic:CountPoint ;
  rdfs:comment "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  skos:notation "13"^^traffic:CountPointNumber ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  traffic:roadName "Salterton Road"@en ;
  traffic:roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  space:easting 302600 ;
  space:northing 81984 ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .
</code></pre>

<h2>Traffic Counts</h2>

<p>On to traffic counts. The un-namespaced RDF should look like:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt;
  a :TrafficCount ;
  :countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  :direction &lt;http://dbpedia.org/resource/East&gt; ;
  :hour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; .
</code></pre>

<p>So for that we need a class for traffic counts:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/Count&gt;
  a rdfs:Class ;
  rdfs:label "Traffic Count"@en .
</code></pre>

<p>a property that can link to the traffic count to the count point where the count is taken:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/countPoint&gt;
  a rdf:Property ;
  rdfs:label "traffic count point"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/CountPoint&gt; .
</code></pre>

<p>a property to link to the the direction the traffic is flowing in (we can&#8217;t put a range on this one because the DBPedia resources we&#8217;re using don&#8217;t have a common type):</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/direction&gt;
  a rdf:Property ;
  rdfs:label "traffic direction"@en .
</code></pre>

<p>and finally a property to link to the hour during which the measurement was taken. This last one is a very common thing to need to do, so we&#8217;d imagine that there might be an existing property defined somewhere that we could use. <a href="http://sdmx.org/">SDMX</a>, which includes a standard for representing statistical information in XML, defines a <code>REF_PERIOD</code> field which would seem to suit our purposes, but we don&#8217;t yet have a proper mapping of SDMX into RDF (I&#8217;ve had an initial cut, but it needs some input from statisticians).</p>

<p>So for now, we&#8217;ll use a specific property in our own namespace; we can always indicate that it&#8217;s a sub-property of a future SDMX property at a later date. I&#8217;m going to call it <code>countHour</code> and give it a domain of <code>traffic:Count</code> to indicate that the property has a pretty specific use for providing the count for an hour. We could just give its range as a generic <code>time:Interval</code>, but the kind of hours that are traffic count hours are kinda special intervals: they&#8217;re obviously an hour long, but are also restricted to start and end on the hour, cover an hour between 7am and 7pm, and don&#8217;t occur in winter. So it feels like we should have a special kind of interval for that purpose:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/countHour&gt;
  a rdf:Property ;
  rdfs:label "hour of count"@en ;
  rdfs:domain &lt;http://transport.data.gov.uk/def/traffic/Count&gt; ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/CountHour&gt; .

&lt;http://transport.data.gov.uk/def/traffic/CountHour&gt;
  a rdfs:Class ;
  rdfs:subClassOf time:Interval ;
  rdfs:label "Count Hour"@en .
</code></pre>

<p>All those properties were in the traffic namespace, so here&#8217;s the RDF with it added:</p>

<pre><code>@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt;
  a traffic:Count ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:direction &lt;http://dbpedia.org/resource/East&gt; ;
  traffic:countHour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; .
</code></pre>

<h2>Cardinal Directions</h2>

<p>As I discussed in the last instalment, we&#8217;re not actually going to mint URIs for cardinal directions, but that doesn&#8217;t mean we can&#8217;t make statements about them in the RDF we generate. As I&#8217;ll discuss in more depth in the next instalment, it&#8217;s always good to provide a label at the very least:</p>

<pre><code>&lt;http://dbpedia.org/resource/East&gt;
  rdfs:label "East"@en .
</code></pre>

<h2>Intervals and Instants</h2>

<p>Let&#8217;s look now at the RDF we want to generate about the hour during which the count was taken. As I&#8217;ve said above, these hours are a special kind of interval, and we&#8217;ve already created a class for them. I also discussed earlier that the things about this interval that are really useful for the purposes of querying are the year during which the count was taken and the hour at which it was taken, so we should pull out at least those pieces of information. Time-based data can be represented in RDF using the <a href="http://www.w3.org/2006/time">OWL-Time ontology</a>.</p>

<p>Unfortunately, expressing time very specifically gets. This is what the statements we want to make look like using OWL-Time:</p>

<pre><code>@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time&gt; .

&lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt;
  a traffic:CountHour ;
  rdfs:label "8 Oct 2001, 17:00-18:00"@en ;
  time:hasBeginning &lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt; ;
  time:hasEnd &lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt; ;
  time:hasDurationDescription _:OneHour ;
  time:intervalDuring &lt;http://dbpedia.org/resource/2001&gt; .

_:OneHour a time:DurationDescription ;
  rdfs:label "one hour"@en ;
  time:years 0 ;
  time:months 0 ;
  time:days 0 ;
  time:hours 1 ;
  time:minutes 0 ;
  time:seconds 0 .

&lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 17:00"@en ;
  time:inXSDDateTime "2001-10-08T17:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 17 ;
  ] .

&lt;http://placetime.com/interval/gregorian/2001-10-08T18:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 18:00"@en ;
  time:inXSDDateTime "2001-10-08T18:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 18 ;
  ] .

&lt;http://dbpedia.org/resource/2001&gt;
  a time:Interval ;
  rdfs:label "2001" ;
  rdf:value "2001"^^xsd:gYear ;
  time:intervalEquals &lt;http://placetime.com/interval/gregorian/2001-01-01T00:00:00Z/P1Y&gt; .
</code></pre>

<h2>Observations</h2>

<p>Finally we&#8217;re on to the observations themselves. The un-namespaced RDF looks like:</p>

<pre><code>&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a :Observation ;
  :count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  :vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  :value 2 .
</code></pre>

<p>The <a href="http://purl.org/NET/scovo">SCOVO</a> vocabulary exists to represent statistical information like this. In SCOVO, observations are called <code>scovo:Item</code>s, the value of the statistical measure itself (the count in this case) should be held in the <code>rdf:value</code> property, and any other properties should be subtypes of <code>scovo:dimension</code>, which has a domain of <code>scovo:Dimension</code>.</p>

<p>To fit in with SCOVO, then, we need to have the pointer to the count that this observation belongs to as a property that is a sub-property of <code>scovo:dimension</code>:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/traffic/count&gt;
  a rdf:Property ;
  rdfs:subPropertyOf scovo:dimension ;
  rdfs:label "count"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Count&gt; .
</code></pre>

<p>We might be tempted to indicate that the type of thing pointed to by the <code>traffic:count</code> property is a subclass of <code>scovo:Dimension</code>, but this is unnecessary and probably untrue: there might exist some traffic counts that <em>aren&#8217;t</em> dimensions, and the ones that are will be linked to by the <code>traffic:count</code> property can be inferred to be dimensions.</p>

<p>Similarly, the property that provides the pointer to the vehicle type should be a sub-property of <code>scovo:dimension</code> and we need a class for those various vehicle types in order to restrict the range of that property:</p>

<pre><code>&lt;http://transport.data.gov.uk/def/vehicleType&gt;
  a rdf:Property ;
  rdfs:subPropertyOf scovo:dimension ;
  rdfs:label "vehicle type"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/VehicleType&gt; .

&lt;http://transport.data.gov.uk/def/VehicleType&gt;
  a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:label "Vehicle Type"@en .
</code></pre>

<p>Of course all the concepts that we created for the vehicle types need to be designated as instances of this new <code>traffic:VehicleType</code> class:</p>

<pre><code>&lt;bicycle&gt; a traffic:VehicleType ;
  ... .
</code></pre>

<p>So, the RDF with the proper namespaces is:</p>

<pre><code>@prefix scovo: &lt;http://purl.org/NET/scovo#&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a scovo:Item ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  traffic:vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  rdf:value 2 .
</code></pre>

<hr />

<p>That concludes our initial walkthrough of the data to create a vocabulary. I&#8217;ve duplicated the schema and the example data below so that it&#8217;s all in one place. But it&#8217;s not quite done. In the next instalment, I&#8217;ll look at adding some finishing touches that make the RDF easier to use.</p>

<hr />

<h2>Schema</h2>

<p>This is the full schema. It contains just six classes, seven properties and three datatypes at the moment, so it&#8217;s pretty small as vocabularies go. We&#8217;ve been able to reuse a lot of classes, properties and datatypes that have already been defined elsewhere in the RDF itself, so this vocabulary is pretty focused on just what we need to describe traffic counts.</p>

<pre><code>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix scovo: &lt;http://purl.org/NET/scovo#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time&gt; .

# Classes #

&lt;http://transport.data.gov.uk/def/traffic/Road&gt;
  a rdfs:Class ;
  rdfs:label "Road"@en .

&lt;http://transport.data.gov.uk/def/traffic/CountPoint&gt;
  a rdfs:Class ;
  rdfs:label "Traffic Count Point"@en .

&lt;http://transport.data.gov.uk/def/traffic/Count&gt;
  a rdfs:Class ;
  rdfs:label "Traffic Count"@en .

&lt;http://transport.data.gov.uk/def/traffic/RoadCategory&gt;
  a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:label "Road Category"@en .    

&lt;http://transport.data.gov.uk/def/traffic/CountHour&gt;
  a rdfs:Class ;
  rdfs:subClassOf time:Interval ;
  rdfs:label "Count Hour"@en .

&lt;http://transport.data.gov.uk/def/VehicleType&gt;
  a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:label "Vehicle Type"@en .

# Properties #

&lt;http://transport.data.gov.uk/def/traffic/road&gt;
  a rdf:Property ;
  rdfs:label "road name"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Road&gt; .

&lt;http://transport.data.gov.uk/def/traffic/countPoint&gt;
  a rdf:Property ;
  rdfs:label "traffic count point"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/CountPoint&gt; .

&lt;http://transport.data.gov.uk/def/traffic/count&gt;
  a rdf:Property ;
  rdfs:subPropertyOf scovo:dimension ;
  rdfs:label "count"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/Count&gt; .

&lt;http://transport.data.gov.uk/def/traffic/roadCategory&gt;
  a rdf:Property ;
  rdfs:label "road category"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/RoadCategory&gt; .

&lt;http://transport.data.gov.uk/def/traffic/direction&gt;
  a rdf:Property ;
  rdfs:label "traffic direction"@en .

&lt;http://transport.data.gov.uk/def/traffic/countHour&gt;
  a rdf:Property ;
  rdfs:label "hour of count"@en ;
  rdfs:domain &lt;http://transport.data.gov.uk/def/traffic/Count&gt; ;
  rdfs:range &lt;http://transport.data.gov.uk/def/traffic/CountHour&gt; .

&lt;http://transport.data.gov.uk/def/vehicleType&gt;
  a rdf:Property ;
  rdfs:subPropertyOf scovo:dimension ;
  rdfs:label "vehicle type"@en ;
  rdfs:range &lt;http://transport.data.gov.uk/def/VehicleType&gt; .

# Datatypes #

&lt;http://transport.data.gov.uk/def/traffic/LAcode&gt;
  a rdfs:Datatype ;
  rdfs:label "Local Authority Code"@en .

&lt;http://transport.data.gov.uk/def/traffic/RoadNumber&gt;
  a rdfs:Datatype ;
  rdfs:label "Road Number"@en .

&lt;http://transport.data.gov.uk/def/traffic/CountPointNumber&gt;
  a rdfs:Datatype ;
  rdfs:label "Traffic Count Point Number"@en .
</code></pre>

<hr />

<h2>RDF Data</h2>

<p>Here&#8217;s a sample set of data. It looks like rather a lot to simply describe the number of bicycles at a particular point on a road (and it doesn&#8217;t even include the SKOS concept schemes that we did last time), but (a) it all provides valuable context for that measurement and (b) most of it will be reused by a lot of other measurements.</p>

<pre><code>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#"&gt; .
@prefix owl: &lt;http://www.w3.org/2002/07/owl#&gt; .
@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@prefix time: &lt;http://www.w3.org/2006/time&gt; .
@prefix scovo: &lt;http://purl.org/NET/scovo#&gt; .
@prefix area: &lt;http://statistics.data.gov.uk/def/administrative-geography/&gt; .
@prefix admingeo: &lt;http://data.ordnancesurvey.co.uk/ontology/admingeo/&gt; .
@prefix space: &lt;http://data.ordnancesurvey.co.uk/ontology/spatialrelations/&gt; .
@prefix traffic: &lt;http://transport.data.gov.uk/def/traffic/&gt; .

&lt;http://statistics.data.gov.uk/id/country?name=England&gt;
  a area:Country ;
  rdfs:label "England"@en .

&lt;http://statistics.data.gov.uk/id/government-office-region/K&gt;
  a admingeo:GovernmentOfficeRegion ;
  rdfs:label "South West"@en ;
  skos:notation "K"^^area:StandardCode ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt;
  a area:LocalAuthorityDistrict ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:country &lt;http://statistics.data.gov.uk/id/country?name=England&gt; ;
  area:region &lt;http://statistics.data.gov.uk/id/government-office-region/K&gt; .

&lt;http://transport.data.gov.uk/id/local-authority-district/1115&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://statistics.data.gov.uk/id/local-authority/18&gt;
  a area:LocalAuthority ;
  rdfs:label "Devon County Council"@en ;
  skos:notation "18"^^area:StandardCode ;
  skos:notation "1115"^^traffic:LAcode ;
  area:coverage &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/local-authority/1116&gt;
  owl:sameAs &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; .

&lt;http://transport.data.gov.uk/id/road/B3178&gt;
  a traffic:Road ;
  skos:notation "B3178"^^traffic:RoadNumber .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt;
  a traffic:CountPoint ;
  rdfs:comment "Salterton Road, EAST OF DINAN WAY, EXMOUTH"@en ;
  skos:notation "13"^^traffic:CountPointNumber ;
  traffic:road &lt;http://transport.data.gov.uk/id/road/B3178&gt; ;
  traffic:roadName "Salterton Road"@en ;
  traffic:roadCategory 
    &lt;http://transport.data.gov.uk/def/road-category/b&gt; ,
    &lt;http://transport.data.gov.uk/def/road-category/urban&gt; ;
  space:easting 302600 ;
  space:northing 81984 ;
  area:localAuthority &lt;http://statistics.data.gov.uk/id/local-authority/18&gt; ;
  area:district &lt;http://statistics.data.gov.uk/id/local-authority-district/18&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt;
  a traffic:Count ;
  traffic:countPoint &lt;http://transport.data.gov.uk/id/traffic-count-point/13&gt; ;
  traffic:direction &lt;http://dbpedia.org/resource/East&gt; ;
  traffic:countHour &lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt; .

&lt;http://dbpedia.org/resource/East&gt;
  rdfs:label "East"@en .

&lt;http://placetime.com/interval/gregorian/2001-10-08T17:00:00Z/PT1H&gt;
  a traffic:CountHour ;
  rdfs:label "8 Oct 2001, 17:00-18:00"@en ;
  time:hasBeginning &lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt; ;
  time:hasEnd &lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt; ;
  time:hasDurationDescription _:OneHour ;
  time:intervalDuring &lt;http://dbpedia.org/resource/2001&gt; .

_:OneHour a time:DurationDescription ;
  rdfs:label "one hour"@en ;
  time:years 0 ;
  time:months 0 ;
  time:days 0 ;
  time:hours 1 ;
  time:minutes 0 ;
  time:seconds 0 .

&lt;http://placetime.com/instant/gregorian/2001-10-08T17:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 17:00"@en ;
  time:inXSDDateTime "2001-10-08T17:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 17 ;
  ] .

&lt;http://placetime.com/instant/gregorian/2001-10-08T18:00:00Z&gt;
  a time:Instant ;
  rdfs:label "8 Oct 2001, 18:00"@en ;
  time:inXSDDateTime "2001-10-08T18:00:00Z"^^xsd:dateTime ;
  time:inDateTime [
    a time:DateTimeDescription ;
    time:unitType time:unitHour ;
    time:year "2001"^^xsd:gYear ;
    time:month "--10"^^xsd:gMonth ;
    time:day "---08"^^xsd:gDay ;
    time:hour 18 ;
  ] .

&lt;http://dbpedia.org/resource/2001&gt;
  a time:Interval ;
  rdfs:label "2001" ;
  rdf:value "2001"^^xsd:gYear ;
  time:intervalEquals &lt;http://placetime.com/interval/gregorian/2001-01-01T00:00:00Z/P1Y&gt; .

&lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00/type/bicycle&gt;
  a scovo:Item ;
  traffic:count &lt;http://transport.data.gov.uk/id/traffic-count-point/13/direction/E/hour/2001-10-08T17:00:00&gt; ;
  traffic:vehicleType &lt;http://transport.data.gov.uk/def/vehicle/bicycle&gt; ;
  rdf:value 2 .
</code></pre>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part III: Defining Concept Schemes</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/137" />
    <id>http://www.jenitennison.com/blog/node/137</id>
    <published>2009-11-22T21:04:41+00:00</published>
    <updated>2010-07-31T21:52:48+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="rdf" />
    <category term="skos" />
    <summary type="html"><![CDATA[<p>This is the third instalment in a series that I&#8217;m writing about turning data into linked data. I&#8217;m using traffic count data as the example, since that&#8217;s a dataset that I&#8217;m currently working on. In the last two instalments, I talked about <a href="http://www.jenitennison.com/blog/node/135">analysing and modelling the data</a> and about <a href="http://www.jenitennison.com/blog/node/136">designing URIs</a> for the <em>things</em> in that model.</p>

<p>Within the model, there are three sets of things that are <strong>concepts</strong>:</p>

<ul>
<li>road categories</li>
<li>vehicle types</li>
<li>cardinal directions</li>
</ul>
    ]]></summary>
    <content type="html"><![CDATA[<p>This is the third instalment in a series that I&#8217;m writing about turning data into linked data. I&#8217;m using traffic count data as the example, since that&#8217;s a dataset that I&#8217;m currently working on. In the last two instalments, I talked about <a href="http://www.jenitennison.com/blog/node/135">analysing and modelling the data</a> and about <a href="http://www.jenitennison.com/blog/node/136">designing URIs</a> for the <em>things</em> in that model.</p>

<p>Within the model, there are three sets of things that are <strong>concepts</strong>:</p>

<ul>
<li>road categories</li>
<li>vehicle types</li>
<li>cardinal directions</li>
</ul>

<p>As I discussed last time, cardinal directions have URIs defined within DBPedia which are good enough for our purposes. The categorisation of roads and vehicles, on the other hand, is something specific to UK transport data, so they are up to us to define.</p>

<p>There&#8217;s a really useful RDF vocabulary called <a href="http://www.w3.org/TR/skos-primer/">SKOS</a> which is designed precisely for defining the kind of concept schemes that we want to use here. SKOS provides classes for concepts, concept schemes and collections (groupings of concepts within a scheme), and properties for linking them and providing labels, codes, definitions and so forth. Many of the SKOS properties can be used outside concept schemes &#8212; for example <code>skos:prefLabel</code> can be used anywhere you want to indicate the preferred label for a thing &#8212; so it&#8217;s good to get to know them.</p>

<h2>Vehicle Types</h2>

<p>Before we dive into RDF, let&#8217;s take some time to understand the classification that we need to model. We&#8217;re modelling vehicle types because counts are made of each different type of vehicle passing a traffic count point over a particular hour. Within the CSV data, the relevant column headings are:</p>

<ul>
<li><code>Pedal cycles</code></li>
<li><code>Two wheeled motor vehicles</code></li>
<li><code>Cars and taxis</code></li>
<li><code>Buses and coaches</code></li>
<li><code>Light vans</code></li>
<li><code>HGVr2</code></li>
<li><code>HGVr3</code></li>
<li><code>HGVr4+</code></li>
<li><code>HGVa3/4</code></li>
<li><code>HGVa5</code></li>
<li><code>HGVa6</code></li>
<li><code>All HGV</code></li>
<li><code>All motor vehicles</code></li>
</ul>

<p>These classifications are detailed in the <a href="http://www.dft.gov.uk/matrix/forms/definitions.aspx">Department for Transport documentation of the dataset</a>. It&#8217;s clear that it&#8217;s not a flat classification, but can be arranged into a hierarchy as follows:</p>

<pre><code>+- Pedal cycles
+- All motor vehicles
   +- Two wheeled motor vehicles
   +- Cars and taxis
   +- Buses and coaches
   +- Light vans
   +- All HGV
      +- Rigid HGV
      |  +- HGVr2
      |  +- HGVr3
      |  +- HGVr4+
      +- Articulated HGV
         +- HGVa3/4
         +- HGVa5
         +- HGVa6
</code></pre>

<p>So all we have to do is define that in SKOS. We&#8217;ve already decided that the URIs will look like:</p>

<pre><code>http://transport.data.gov.uk/def/vehicle-category/{type}
</code></pre>

<p>so for URI-hackability reasons we&#8217;ll call the concept scheme:</p>

<pre><code>http://transport.data.gov.uk/def/vehicle-category/
</code></pre>

<p>It&#8217;s probably easiest to just show what the concept scheme looks like. This is in <a href="http://www.w3.org/TeamSubmission/turtle/">Turtle</a>.</p>

<pre><code>@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@base &lt;http://transport.data.gov.uk/def/vehicle-category/&gt; .

&lt;&gt; a skos:ConceptScheme ;
  skos:prefLabel "Vehicle Types"@en ;
  skos:hasTopConcept &lt;bicycle&gt; ;
  skos:hasTopConcept &lt;motor-vehicle&gt; .
...
&lt;motor-vehicle&gt; a skos:Concept ;
  skos:prefLabel "Motor Vehicle"@en ;
  skos:topConceptOf &lt;&gt; ;
  skos:narrower &lt;motorbike&gt; ;
  skos:narrower &lt;car&gt; ;
  skos:narrower &lt;bus&gt; ;
  skos:narrower &lt;van&gt; ;
  skos:narrower &lt;HGV&gt; .
...
&lt;HGV&gt; a skos:Concept ;
  skos:prefLabel "Heavy Goods Vehicle"@en ;
  skos:altLabel "HGV"@en ;
  skos:definition "Goods vehicles over 3,500 kgs gross vehicle weight."@en ;
  skos:scopeNote "Includes tractors (without trailers), road rollers, box vans and similar large vans. A two axle motor tractive unit without trailer is also included."@en ;
  skos:broader &lt;motor-vehicle&gt; ;
  skos:narrower &lt;HGVr&gt; ;
  skos:narrower &lt;HGVa&gt; ;
  skos:inScheme &lt;&gt; .
...
</code></pre>

<p>The properties shown here are:</p>

<ul>
<li><code>skos:prefLabel</code> - the preferred label for something; there can only be one in any given language</li>
<li><code>skos:altLabel</code> - an alternative label for the thing; there can be any number</li>
<li><code>skos:definition</code> - provides a definition of the term</li>
<li><code>skos:scopeNote</code> - provides information about the scope of the term (eg what&#8217;s included or excluded)</li>
<li><code>skos:broader</code>/<code>skos:narrower</code> - link together concepts into a hierarchy</li>
<li><code>skos:hasTopConcept</code>/<code>skos:topConceptOf</code> - links together the concept schemes and the concepts at the top of the concept hierarchy defined within the scheme</li>
<li><code>skos:inScheme</code> - points from a concept the concept scheme it&#8217;s defined in; it&#8217;s necessary to use either this or <code>skos:topConceptOf</code> on every <code>skos:Concept</code> otherwise it&#8217;s not clear which concept scheme they belong to</li>
</ul>

<p>Note that in the RDF I&#8217;ve assigned every string a language (English). That&#8217;s good practice when values are textual; a Welsh translation could be provided for each one as well, for example.</p>

<h2>Road Categories</h2>

<p>Road categories are also described within the documentation for this dataset. The hierarchy is shown in the documentation as:</p>

<pre><code>+- Major Roads
|  +- Motorways
|  |  +- Trunk
|  |  +- Principal
|  +- A Roads
|     +- Trunk
|     |  +- Urban
|     |  +- Rural
|     +- Principal
|        +- Urban
|        +- Rural
+- Minor Roads
   +- B Roads
   |  +- Urban
   |  +- Rural
   +- C Roads
   |  +- Urban
   |  +- Rural
   +- Unclassified Roads
      +- Urban
      +- Rural
</code></pre>

<p>But this is actually the result of three sets of overlapping concepts:</p>

<ul>
<li>roads by classification (major/minor, motorway/A/B/C/unclassified)</li>
<li>roads by locale (urban/rural)</li>
<li>major roads by maintenance responsibility (trunk/principal)</li>
</ul>

<p>These kinds of subdivisions of concepts can be managed in SKOS through <code>skos:Collection</code>s, which group together concepts without being broader than those concepts. Here&#8217;s a snippet from the concept scheme that shows how this works.</p>

<pre><code>@prefix skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; .
@base &lt;http://transport.data.gov.uk/def/road-category/&gt; .

&lt;&gt; a skos:ConceptScheme ;
  skos:prefLabel "Road Categories"@en ;
  skos:hasTopConcept &lt;major&gt; ;
  skos:hasTopConcept &lt;minor&gt; ;
  skos:hasTopConcept &lt;urban&gt; ;
  skos:hasTopConcept &lt;rural&gt; .

&lt;classification&gt; a skos:Collection ;
  skos:prefLabel "Road by Classification"@en ;
  skos:member &lt;major&gt; ;
  skos:member &lt;minor&gt; .

&lt;maintenance&gt; a skos:Collection ;
  skos:prefLabel "Major Road by Maintenance Responsibility"@en ;
  skos:member &lt;principal&gt; ;
  skos:member &lt;trunk&gt; .

&lt;major&gt; a skos:Concept ;
  skos:prefLabel "Major Road"@en ;
  skos:altLabel "Major"@en ;
  skos:scopeNote "Include motorways and A roads. These roads usually have high traffic flows and are often the main arteries to major destinations."@en ;
  skos:narrower &lt;motorway&gt; ;
  skos:narrower &lt;a&gt; ;
  skos:narrower &lt;principal&gt; ;
  skos:narrower &lt;trunk&gt; ;
  skos:topConceptOf &lt;&gt; .

&lt;motorway&gt; a skos:Concept ;
  skos:prefLabel "Motorway"@en ;
  skos:broader &lt;major&gt; ;
  skos:scopeNote "Major roads often used for long distance travel. They are usually three or more lanes in each direction and generally have the maximum speed limit of 70mph."@en ;
  skos:inScheme &lt;&gt; .
...
&lt;trunk&gt; a skos:Concept ;
  a skos:Concept ;
  skos:prefLabel "Trunk Road"@en ;
  skos:altLabel "Trunk"@en ;
  skos:scopeNote "Most motorways and many of the long distance rural A roads are trunk roads."@en ;
  skos:note "The responsibility for the maintenance of trunk roads lies with the Secretary of State and they are managed by the Highways Agency in England, the National Assembly of Wales in Wales and the Scottish Executive in Scotland (National Through Routes)."@en ;
  skos:broader &lt;major&gt; ;
  skos:inScheme &lt;&gt; .
...
</code></pre>

<p>In a hierarchy, these multiple overlapping concepts can be shown as:</p>

<pre><code>+- &lt;Road by Classification&gt;
|  +- Major Road
|  |  +- &lt;Major Road by Classification&gt;
|  |  |  +- Motorway
|  |  |  +- A Road
|  |  +- &lt;Major Road by Maintenance Responsibility&gt;
|  |     +- Principal Road
|  |     +- Trunk Road
|  +- Minor Road
|     +- B Road
|     +- C Road
|     +- Unclassified Road
+- &lt;Road by Locale&gt;
   +- Urban Road
   +- Rural Road
</code></pre>

<p>That&#8217;s our concept schemes done. Next it will be time to turn to defining a vocabulary for the particular <em>things</em> that we want to describe from this dataset.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Creating Linked Data - Part II: Defining URIs</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/136" />
    <id>http://www.jenitennison.com/blog/node/136</id>
    <published>2009-11-22T17:23:34+00:00</published>
    <updated>2010-07-31T21:51:16+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="linked data" />
    <category term="uri" />
    <summary type="html"><![CDATA[<p>This is the second instalment in a series of posts about how to create linked data from existing data sets, using traffic count data as an example. In the last instalment, I talked about <a href="http://www.jenitennison.com/blog/node/135">analysing and modelling data</a>. This instalment discusses the creation of URIs for the various <em>things</em> that have been identified within the model.</p>

<p>This part of the process is the same as what you&#8217;d do if you were simply creating a RESTful API to a website. The principal is that everything has a URI, and if you resolve that URI you get information about the thing.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>This is the second instalment in a series of posts about how to create linked data from existing data sets, using traffic count data as an example. In the last instalment, I talked about <a href="http://www.jenitennison.com/blog/node/135">analysing and modelling data</a>. This instalment discusses the creation of URIs for the various <em>things</em> that have been identified within the model.</p>

<p>This part of the process is the same as what you&#8217;d do if you were simply creating a RESTful API to a website. The principal is that everything has a URI, and if you resolve that URI you get information about the thing.</p>

<!--break-->

<p>For the data.gov.uk site, we now have some <a href="http://www.cabinetoffice.gov.uk/media/308995/public_sector_uri.pdf">guidelines about the design of URIs for the UK public sector</a>. Basically, URIs for <em>things</em> should look like:</p>

<pre><code>http://{sector}.data.gov.uk/id/{type of thing}/{thing identifier}
</code></pre>

<p>There&#8217;ll be plenty of examples in what follows.</p>

<h2>Areas</h2>

<p>Some of the things that we&#8217;ve identified as being part of the traffic count dataset already have centrally-defined identifiers. As part of other data.gov.uk work, we&#8217;ve defined URIs for administrative areas like countries, regions, local authority districts and local authorities. The templates for these URIs are:</p>

<pre><code>http://statistics.data.gov.uk/id/country/{ONS code}
http://statistics.data.gov.uk/id/government-office-region/{ONS code}
http://statistics.data.gov.uk/id/local-authority-district/{ONS code}
http://statistics.data.gov.uk/id/local-authority/{ONS code}
</code></pre>

<p>We can use these identifiers directly for the regions, districts and local authorities. But there&#8217;s a problem with the country URI: we don&#8217;t have the ONS code for the country, only the name of the country. Fortunately, we&#8217;ve also defined URIs with this pattern:</p>

<pre><code>http://statistics.data.gov.uk/id/country?name={country name}
http://statistics.data.gov.uk/id/government-office-region?name={region name}
http://statistics.data.gov.uk/id/local-authority-district?name={district name}
http://statistics.data.gov.uk/id/local-authority?name={authority name}
</code></pre>

<p>so in this situation we can use the name-based country URI and we&#8217;ll get redirected to the canonical, code-based URI.</p>

<p>Local authorities actually have two codes within the dataset that we have: the ONS code and a DfT code. I can well imagine that other datasets from the Department for Transport will only reference the DfT code, so it&#8217;s a good idea to create URIs that are based on these codes; later on, we can state that the two identifiers actually mean exactly the same thing.</p>

<pre><code>http://transport.data.gov.uk/id/local-authority-district/{DfT code}
http://transport.data.gov.uk/id/local-authority/{DfT code}
</code></pre>

<p>So given the record:</p>

<pre><code>"England","North West","B",4315.00,"00BZ","St.Helens Metropolitan Borough Council",
4,"U",,"Unclassified Urban",,
,352100,398200,
7/6/2001 00:00:00,"N",7,1,0,5,1,0,0,0,0,0,0,0,0,6
</code></pre>

<p>the URIs we&#8217;ve defined so far are:</p>

<pre><code>http://statistics.data.gov.uk/id/country?name=England
http://statistics.data.gov.uk/id/government-office-region/B
http://statistics.data.gov.uk/id/local-authority-district/00BZ
http://statistics.data.gov.uk/id/local-authority/00BZ
http://transport.data.gov.uk/id/local-authority-district/4315
http://transport.data.gov.uk/id/local-authority/4315
</code></pre>

<h2>Roads</h2>

<p>Now we&#8217;re onto things that aren&#8217;t defined already. First is roads. If there&#8217;s a road number, the obvious thing to use is that road number; something like:</p>

<pre><code>http://transport.data.gov.uk/id/road/{road number}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/road/B3178
</code></pre>

<p>If there isn&#8217;t a road number, we&#8217;ll have to construct a URI. Since each count point is on one particular road, we can use the identifier of the count point to identify the road, so:</p>

<pre><code>http://transport.data.gov.uk/id/road/{class}-{count point number}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/road/U-4
</code></pre>

<h2>Count Points</h2>

<p>Count points can be identified through their number, so it makes sense to use that in the URI:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/{count point number}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/4
</code></pre>

<h2>Counts</h2>

<p>The counts themselves don&#8217;t have their own identifiers, but they can be identified through a combination of the count point that they&#8217;re associated with, the direction of travel of the traffic that&#8217;s being counted, and the date and time that the count is made. So we can create a URI that combines these things. To aid hackability, I&#8217;m going to build on top of the traffic count point URI that we&#8217;ve already defined:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/{count point number}/direction/{direction}/hour/{time}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/4/direction/N/hour/2001-06-07T07:00:00
</code></pre>

<h2>Observations</h2>

<p>Again, observations build on top of the counts by adding a vehicle type to the mix, so we can construct URIs that reflect that:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/{count point number}/direction/{direction}/hour/{time}/type/{vehicle type}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/id/traffic-count-point/4/direction/N/hour/2001-06-07T07:00:00/type/motor-vehicle
</code></pre>

<h2>Road Categories</h2>

<p>Road categories are a bit different from the kinds of things that we&#8217;ve been talking about so far: they are concepts. For these URIs we use a slightly different pattern from the URIs above: <code>/def/</code> rather than <code>/id/</code>. For road categories we can use:</p>

<pre><code>http://transport.data.gov.uk/def/road-category/{category}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/def/road-category/motorway
</code></pre>

<h2>Vehicle Types</h2>

<p>Vehicle types are also concepts, so have similar URIs:</p>

<pre><code>http://transport.data.gov.uk/def/vehicle-category/{type}
</code></pre>

<p>For example:</p>

<pre><code>http://transport.data.gov.uk/def/vehicle-category/HGVa5
</code></pre>

<h2>Cardinal Directions</h2>

<p>Cardinal directions are also concepts, but really they are global concepts, not specific to transport, or even to the UK. So it feels a bit strange to use URIs for them that imply that they somehow belong to data.gov.uk.</p>

<p>Fortunately, for this kind of general concept we can use URIs defined by <a href="http://dbpedia.org">DBPedia</a>. DBPedia is a linked data view on Wikipedia, so it has URIs for everything that Wikipedia has a page about, making it an excellent general purpose resource. The relevant URIs for the cardinal directions are:</p>

<pre><code>http://dbpedia.org/resource/North
http://dbpedia.org/resource/South
http://dbpedia.org/resource/East
http://dbpedia.org/resource/West
</code></pre>

<p>so that&#8217;s what we&#8217;ll use.</p>

<h2>Dates, Times and Periods</h2>

<p>For dates, times and periods, we can use the URIs provided by another general-purpose linked data resource: <a href="http://www.placetime.com/">placetime.com</a>. URIs for instants have the pattern:</p>

<pre><code>http://placetime.com/instant/gregorian/{dateTime}
</code></pre>

<p>while periods have the pattern:</p>

<pre><code>http://placetime.com/interval/gregorian/{dateTime}/{duration}
</code></pre>

<p>So the hour from 7-8am on 7th June 2001 would be:</p>

<pre><code>http://placetime.com/interval/gregorian/2001-06-07T07:00:00/PT1H
</code></pre>

<p>and the year 2001 would be:</p>

<pre><code>http://placetime.com/interval/gregorian/2001-01-01T00:00:00/P1Y
</code></pre>

<p>The thing is that the latter isn&#8217;t particularly approachable. Calendar years are used all over the place, so it would be nice to have a set of URIs for them that we use consistently. Again, DBPedia provides URIs for every year, such as:</p>

<pre><code>http://dbpedia.org/resource/2001
</code></pre>

<p>so where we need to refer to a calendar year, it would be good to reuse that.</p>

<hr />

<p>And that completes the sets of URIs that we need for this data. Stay tuned.</p>
    ]]></content>
  </entry>
</feed>
