The Real Deal: data.gov.uk

I’m sure that you’ve noticed that my recent posts have been somewhat obsessed with publishing and using public sector information. It’s because I’ve somehow been sucked into the work going on within the UK government, with Tim Berners-Lee and Nigel Shadbolt advising, to publish its data as linked data.

My recent blog posts about publishing data using Talis have actually been a front for much more complex work that I’ve been doing with a different data set.

As an early demonstration of how existing government data sets might be turned into linked data, a few weeks ago I was given a CSV file containing road traffic counts; the raw data that lies behind the traffic flow information available on the Department for Transport website. The data is really interesting and ripe for visualisations and analysis. For each hour of particular days each year, at particular points on many roads within the UK, the Department for Transport measures the number of bicycles, motorbikes, cars, vans, buses and HGVs of various types that roll past in each direction. The data contains information about:

  • the count of each of the various classes of traffic that pass the point in a particular direction on a particular hour of a particular day
  • the points at which these measurements were taken
  • the roads on which the points are situated
  • the areas in which the points are situated
  • the local authority that is in charge of these areas
  • the region that the area is in
  • the country that the region is in

The challenge was to turn the 386Mb CSV file into linked data. The result is up and available for you to look at; a good starting point is http://geo.data.gov.uk/0/country. Just follow the links from there.

With a few false starts and mis-steps, this is the process that I went through:

  1. Tidied the CSV file so that it could be processed using awk. That meant replacing the commas that were delimiters with |s. It also meant removing a couple of weird ^M characters that had snuck into the file.
  2. Examined the data and came up with an informal ontology and prototype URI scheme.
  3. Created a bunch of awk scripts to extract different data from the files and create RDF/XML from it.
  4. Ran the scripts to create RDF/XML.
  5. Uploaded the data into a Talis store.
  6. Created appropriate PHP for the data and put it into a proxy server.

Some of this has been covered by my recent posts, so I’m just going to talk about a few of these steps in a bit more detail.

First, the URIs. Frankly, they’re an experiment to see how it plays. The templates are:

The subdomains are one way of subdividing the vast set of public sector information into vague categories that might be handled by different departments, without using the (highly changeable) department names in the URI. The /0 portion of each URI is a version number: these URIs are experimental and liable to be unsupported in the future so they’re marked with a version 0. The /id portion of each URI indicates that these are URIs for non-information resources; the response is a 303 See Other redirect to the same URIs but without the /id.

After the /id, the URIs follow a common pattern of naming a class of resource, followed by an appropriate identifier for that resource. The identifiers themselves are designed to be unique, unlikely to change, and human readable.

The ontologies, well, actually they don’t exist as yet except in my head. It’s been more important to make the data available than to provide ontologies for it. Triplestores and SPARQL queries work without ontologies; indeed you have to go out of your way to find applications that actually reason with them. Like schemas for XML documents, they’re not absolutely essential, but useful for documentation purposes and potentially useful for applications.

There are, though, a couple of SKOS schemes for categorising roads and vehicle types. These are available via:

  • http://transport.data.gov.uk/0/category/road
  • http://transport.data.gov.uk/0/category/vehicle

They were informed by the British Roads FAQ and the data definitions from the Department for Transport. I heartily recommend a read; it’s scintillating stuff!

Anyway, with this size of file, and the kind of processing that needed to be done with it, the simple XSLT that I talked about previously for extracting data out of CSV files just wasn’t going to cut it. Awk, on the other hand, is designed for this kind of processing. Most of the RDF/XML could be generated by collecting unique values from the file. For example, to generate the RDF/XML for the regions I used:

BEGIN { 
  FS = "|";
  print "<rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"";
  print "  xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"";
  print "  xmlns:g=\"http://geo.data.gov.uk/0/ontology/geo#\">";
}
FNR > 1 {
  countries[$2] = substr($1, 2, length($1) - 2);
  regions[$2] = substr($2, 2, length($2) - 2);
  codes[$2] = substr($3, 2, length($3) - 2);
}
END { 
  for (region in regions) {
    country = countries[region];
    name = regions[region];
    code = codes[region];
    path = tolower(name);
    gsub(" ", "-", path);
    print "<g:Region rdf:about=\"http://geo.data.gov.uk/0/id/region/" path "\">";
    print "  <rdfs:label>" name "</rdfs:label>";
    print "  <g:isInCountry>";
    print "    <g:Country rdf:about=\"http://geo.data.gov.uk/0/id/country/" tolower(country) "\">";
    print "      <g:hasRegion rdf:resource=\"http://geo.data.gov.uk/0/id/region/" path "\" />";
    print "    </g:Country>";
    print "  </g:isInCountry>";
    if (code != "") {
      print "  <g:ONScode rdf:datatype=\"http://www.w3.org/2001/XMLSchema#NCName\">" code "</g:ONScode>";
    }
    print "</g:Region>";
  }
  print "</rdf:RDF>"; 
}

This generated RDF/XML that looks like:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:g="http://geo.data.gov.uk/0/ontology/geo#">
<g:Region rdf:about="http://geo.data.gov.uk/0/id/region/london">
  <rdfs:label>London</rdfs:label>
  <g:isInCountry>
    <g:Country rdf:about="http://geo.data.gov.uk/0/id/country/england">
      <g:hasRegion rdf:resource="http://geo.data.gov.uk/0/id/region/london" />
    </g:Country>
  </g:isInCountry>
  <g:ONScode rdf:datatype="http://www.w3.org/2001/XMLSchema#NCName">H</g:ONScode>
</g:Region>
<g:Region rdf:about="http://geo.data.gov.uk/0/id/region/yorkshire-and-the-humber">
  <rdfs:label>Yorkshire and The Humber</rdfs:label>
  <g:isInCountry>
    <g:Country rdf:about="http://geo.data.gov.uk/0/id/country/england">
      <g:hasRegion rdf:resource="http://geo.data.gov.uk/0/id/region/yorkshire-and-the-humber" />
    </g:Country>
  </g:isInCountry>
  <g:ONScode rdf:datatype="http://www.w3.org/2001/XMLSchema#NCName">D</g:ONScode>
</g:Region>
...
</rdf:RDF>

In other cases, I needed to split up the RDF/XML that was generated into several files. Uploads to Talis of more than about 2Mb cause the upload to fail. The traffic count point RDF/XML needed to be split into 13 separate files. The traffic counts themselves… well, I haven’t managed to do it all yet but to give you an idea, the 2008 data alone generated 1800 RDF/XML files, each about 1.6Mb in size and each taking about a minute to upload. What’s there now is all the 2008 data, and the overall motor vehicle counts from all the years. More will be added gradually.

The awk script that generates the count data in separate files is:

BEGIN { 
  FS = "|";
  fileCount = 0;
  countCount = 99999;
  curlFile = "traffic-counts.curl.sh";
}
FNR > 1 && $15 ~ /\/2008 / {
  countCount += 1;
  if (countCount > 200) {
    if (fileCount != 0) {
      print "</rdf:RDF>" > fileName; 
      close(fileName);
    }
    countCount = 0;
    fileCount += 1;
    fileName = "traffic-counts/traffic-counts." fileCount ".rdf";
    print "creating", fileName;
    print "echo loading", fileName > curlFile;
    print "curl -H \"Content-type: application/rdf+xml\" -o progress.txt --digest -u username:password --data-binary @" fileName " http://api.talis.com/stores/transport/meta" > curlFile;

    print "<?xml version=\"1.0\" encoding=\"ASCII\"?>" > fileName;
    print "<rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"" > fileName;
    print "  xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"" > fileName;
    print "  xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\"" > fileName;
    print "  xmlns:t=\"http://transport.data.gov.uk/0/ontology/traffic#\"" > fileName;
    print "  xml:base=\"http://transport.data.gov.uk/0/id/traffic-count/\">" > fileName;
  }

  cp = $7;
  date = $15;
  direction = substr($16, 2, length($16) - 2);
  split(date, dateFields, " ");
  date = dateFields[1];
  split(date, dateFields, "/");
  date = sprintf("%04d-%02d-%02d", dateFields[3], dateFields[2], dateFields[1]);
  hour = sprintf("%02d:00:00", $17);
  base = "http://transport.data.gov.uk/0/id/traffic-count/" cp "/" direction "/" date "/" hour;

  cycles = $18;
  motorbikes = $19;
  ...

  print "<t:Count rdf:about=\"" base "/cycle\">" > fileName;
  print "  <t:point>" > fileName;
  print "    <t:CountPoint rdf:about=\"http://transport.data.gov.uk/0/id/traffic-count-point/" cp "\">" > fileName;
  print "      <t:count rdf:resource=\"" base "/cycle\" />" > fileName;
  print "    </t:CountPoint>" > fileName;
  print "  </t:point>" > fileName;
  print "  <t:hour rdf:datatype=\"http://www.w3.org/2001/XMLSchema#dateTime\">" date "T" hour "</t:hour>" > fileName;
  print "  <t:direction>" direction "</t:direction>" > fileName;
  print "  <t:category rdf:resource=\"http://transport.data.gov.uk/0/category/bicycle\" />" > fileName;
  print "  <rdf:value  rdf:datatype=\"http://www.w3.org/2001/XMLSchema#integer\">" cycles "</rdf:value>" > fileName;
  print "</t:Count>" > fileName;
  print "<t:Count rdf:about=\"" base "/motorbike\">" > fileName;
  print "  <t:point>" > fileName;
  print "    <t:CountPoint rdf:about=\"http://transport.data.gov.uk/0/id/traffic-count-point/" cp "\">" > fileName;
  print "      <t:count rdf:resource=\"" base "/motorbike\" />" > fileName;
  print "    </t:CountPoint>" > fileName;
  print "  </t:point>" > fileName;
  print "  <t:hour rdf:datatype=\"http://www.w3.org/2001/XMLSchema#dateTime\">" date "T" hour "</t:hour>" > fileName;
  print "  <t:direction>" direction "</t:direction>" > fileName;
  print "  <t:category rdf:resource=\"http://transport.data.gov.uk/0/category/motorbike\" />" > fileName;
  print "  <rdf:value  rdf:datatype=\"http://www.w3.org/2001/XMLSchema#integer\">" motorbikes "</rdf:value>" > fileName;
  print "</t:Count>" > fileName;
  ...
}
END {
  print "</rdf:RDF>" > fileName; 
  close(fileName);
}

This also generates a shall script that includes the curl instructions to upload the files.

The original data contained easing/northing information about each point when generally latitude/longitude is easier for mapping. So I extracted the easting/northings, used the free (Windows only) software available via the Ordnance Survey to turn these into latitude/longitude — there is a web service to do the same, but you can only do 200 coordinates at a time — converted those into decimals, then RDF, and uploaded them.

The PHP scripts that serve the data as linked data are exactly what I’ve shown before. I amended the .htaccess file to redirect to an appropriate PHP script like this:

<IfModule mod_rewrite.c>
  RewriteEngine on
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_FILENAME} !-d

  RewriteRule ^id/(.+)$  id.php [L]

  RewriteCond %{REQUEST_URI} !\.php
  RewriteRule ^([^/]+)(/.+)? $1.php$2 [L,QSA]
</IfModule>

and created PHP scripts for each of the types of data being published. For example, region.php is:

<?php
  include "utils.php";
  proxy('http://geo.data.gov.uk/0/ontology/geo#Region', 50);
?>

And there we have it. Linked traffic count data on the web.

(And because this is all published through Talis, there’s also a SPARQL endpoint that you could use to run queries and create visualisations. Knock yourself out.)

Please take a look and comment on what we’ve done. What’s your opinion of the URI scheme? Is it useful to be able to access the data as linked data? Which other formats would you like to see?

Comments

Re: The Real Deal: data.gov.uk

Interesting post again Jeni.

I tend to follow a similar approach when making my RDF. Typically I clean up the original data (though usually in a text editor like editplus if I’m honest) and then I run python scripts to produce the RDF.

The URIs for our data (when we’ve fixed a URI scheme) will look something like

http://…/id/7000000000000123

OS data (well OS MasterMap) gives unique (16 digit) IDs to all features - these are called TOIDs. Our URIs will be based on these TOIDs.

I’m coming close to finishing the administrative geography RDF so that should soon give you something to “owl:sameAs” to.

Re: The Real Deal: data.gov.uk

Jeni,

Nice work!

Here is what the URIBurner Service [1] that we offer, delivering alternative presentations of your data based on processing the only data representation (RDF/XML) currently offered by your Linked Data Server:

Example:

  1. http://linkeddata.uriburner.com/about/html/http://geo.data.gov.uk/0/country — An HTML representation of the description of the resource (information resource): http://geo.data.gov.uk/0/country

By the way, what’s happening re. the handling of negotiated representations of other RDF model based data representations? Also, I think your current responses are confusing re. user agents requests for alternative formats as per these CURL dumps:

kidehen$ curl -I -H “Accept: text/turtle” http://geo.data.gov.uk/0/country HTTP/1.1 200 OK Date: Sun, 26 Jul 2009 23:47:24 GMT Server: Apache/2.2.11 (Win32) PHP/5.2.5 (x64) X-Powered-By: PHP/5.2.5 (x64) Content-Location: /0/country.rdf Content-Type: application/rdf+xml

kidehen$ curl -I -H “Accept: text/n3” http://geo.data.gov.uk/0/country HTTP/1.1 200 OK Date: Sun, 26 Jul 2009 23:47:50 GMT Server: Apache/2.2.11 (Win32) PHP/5.2.5 (x64) X-Powered-By: PHP/5.2.5 (x64) Content-Location: /0/country.rdf

In both cases, I think a 406 with suggested formats presented based on what the server currently is capable of serving up would be more user agent friendly :-)

Links:

  1. http://linkeddata.uriburner.com — service that produces Linked Data meme compliant proxy URIs for a variety of Web Addressable Resources (why may or may not serve up RDF based data by default).

Kingsley

Re: The Real Deal: data.gov.uk

  1. the (cleaned up) data could be interesting for other uses -are there plans to put it up online.

  2. conversion from OSGB to LatLong is fairly straightforward; I have the C++ code from 1995 somewhere and its probably trivial to update to other languages.

  3. CSV is, of course, the ideal format for MapReduce work, and 300+MB the datasize where it makes sense to start playing with it.

-Steve

OSGB grid to lat/long conversion

SteveL: conversion from OSGB to LatLong is fairly straightforward; I have the C++ code from 1995 somewhere and its probably trivial to update to other languages.

Careful though, that probably converts to the Airy ellipsoid (on which the Ordnance Survey grid was originally based - I've got some C code to do it, too). For modern work you would want lat/longs based on the WGS-84 ellipsoid for, for example, use with OpenStreetMap and most people's GPS setup. The differences can be of the order of a hundred metres or more: certainly enough to confuse one road with the next one along.

Re: OSGB grid to lat/long conversion

My code is to go from GPS (WGS-84) and OSGB, so it should get things right: Source code.

The data I'd really like to get hold of is a week or month's worth of partially anonymised ANPR data, replacing number plate with a unique ID (but retaining brand/model/type info for each vehicle ID). Rather than rely on DfT averages, you could see which cars drove the most, fastest, common routes, etc. That would be interesting, and with Hadoop clusters, the data would be easy to handle and work with.

Re: OSGB grid to lat/long conversion

My code is to go from GPS (WGS-84) and OSGB, so it should get things right:

Are you sure? I don't understand the maths properly but at a quick glance the code didn't show the full transformation so I tried it out: made up a little stdafx.h to compile on Linux and a test harness (rather grand name) to convert SU 000000. The result, lat: 50.798996, long: -2.000000. Longitude is a nice round number as square SU abuts the true central meridian of the OSGB transverse mercator projection; this is was the second thing which made me think these are Ordnance Survey lat/longs.

I then set up my GPS (Garmin 12 XL) for OSGB coordinates (Position Frmt: British Grid, Map Datum: Ord Srvy GB) and entered a waypoint for this position (name SU0000, position SU 00000 00000). As a sanity check, the GPS shows it on a bearing of 3964 mils (about 223°), range 125 km from my home in High Wycombe which seems about right; it should be somewhere near Bournemouth.

I then set the GPS to show OSGB lat/longs (Position Frmt: hddd.ddddd°, Map Datum: Ord Srvy GB) and looked again at the waypoint. The coordinates were now N50.799, W002.00000 which is about 0.5 metres from the point your code calculates - i.e., easily close enough.

With the GPS set for WGS 84 (Position Frmt: hddd.ddddd°, Map Datum: WGS 84) the coordinates were N50.79950, W002.00139. This is about 50 metres different in latitude and 90 metres different in longitude (using 1° latitude is about 111 km) so about 103 metres out in total.

I'm therefore pretty sure that your code computes Ordnance Survey lat/longs, not WGS 84.

This is a pity as it would be nice to have to some open-source code to go between OS grid and WGS 84 even if it's only approximate but, significantly better than 100 metres out. I don't think you can get open-source exact conversions because the latest incarnations of the OS grid (it changed in 2002, IIRC) uses tweaks on the 100 km grid squares which are regarded as proprietary by the OS. Still, it's been a couple of years since I've escaped from dealing with this stuff and the OS has relaxed a bit of late, I understand.

Re: OSGB grid to lat/long conversion

Just like to add that using the Ordnance Survey converter to convert easting 400000, northing 100000 (i.e., SU000000) with an elevation of zero to ETRS89 (which is the same as WGS84 apart from the modelled effect of continental drift, about 20mm per year, IIRC) gives N 50.79955144, W 2.00136020 which is about 5.4 metres different from what my GPS gives for WGS 84.

This difference is probably the result of a combination of calculation and display accuracy and the GPS calculation probably pre-dating OSTN02.

Re: OSGB grid to lat/long conversion

It’s been a long time since I looked at that code, so if it doesn’t work right, it probably is broken. Its from an activeX control that can be used to get position from a couple of types of GPS receiver, data that could then be picked up in javascript.

Being 8+ years since I looked at it, the maths is beyond me now, I don’t see anything in the codebase do go between Datums (e.g WGS84 and NAS-1929), which is really what you need: code to convert LLs from the OSGB36 datum to WGs84; the grid ref to LL conversion should be independent.

-steve

Re: OSGB grid to lat/long conversion

Not broken, just differently specified :-). As far as I can see it works fine for OSGB lat/longs but these are different from WGS84 which might confuse the unwary.

Re: The Real Deal: data.gov.uk

I believe that the raw data was made available to the Open Knowledge Foundation and that it’s being put on CKAN, but I may have got the wrong end of the stick. The clean ups that I did were largely to enable me to processing it with awk; I’m happy to put it up somewhere but will have to consult about where to put it.

Personally, I think it would be great if the data was made available as widely as possible in as raw a form as it comes, precisely to enable people to play with it in whatever way they want.

Re: The Real Deal: data.gov.uk

I guess it would go within http://www.ckan.net/package/read/dft-road-traffic-counts - would be great to get the CSV file.

Re: The Real Deal: data.gov.uk

Re:CSV

problems may arise in the EU with the separator “,” since it is often used the decimal point in numeric value. So it’s not such a universal format

Re: The Real Deal: data.gov.uk

In this CSV at least any values that have commas within them have quotes around them. So it’s absolutely fine to have values with commas in them; I just changed to using |s because I couldn’t figure out how to make awk ignore the commas that were within quoted values (it’s not a true CSV processor).