This post was imported from my old Drupal blog. To see the full thing, including comments, it's best to visit the Internet Archive.

Update: If you’re interested in expressing statistics in RDF, I’d encourage you to join the publishing statistical data group and take a look at the documentation for ‘SDMX-RDF’ described there.

One of the things that we’ve been discussing over on the UK Government Data Developers mailing list is how best to represent the vast quantities of statistical data that the government produces, in RDF. This is what we’ve come up with.

  1. We’ll use SCOVO as our main vocabulary.

  2. Dimensions (the things a statistic are about) should be instances of specialised classes such as ‘Hospital’ or ‘School’; these will often be SKOS concepts. We will try to reuse these as much as possible across datasets (see below).

  3. We will create subproperties of scv:dimension that have appropriate names and different subclasses of scv:Dimensions as ranges. We will try to reuse these as much as possible across datasets (see below).

  4. The scv:Items we use (representing individual statistics) should not be blank nodes (because giving them URIs allows us to attach other information to them); they will each have a scv:dataset property that points to the scv:Dataset they belong to (which will probably also be a void:Dataset).

  5. Every scv:Item will also be the object of at least one triple that involves one of its dimensions; this will usually be the real-world thing that the statistic is associated with (eg the school or hospital).

  6. Most statistics are provided for a particular time period; for these, we will define relationships from OWL-Time to resources, but will also use appropriately datatyped literals where possible to make querying easier.

Here’s an example of what this looks like:

@prefix rdf: <> .
@prefix rdfs: <> .
@prefix xsd: <> .
@prefix scv: <> .
@prefix skos: <> .
@prefix dct: <> .
@prefix void: <> .
@prefix time: <> .
@prefix sdmx: <> .
@prefix pop: <> .
@prefix year: <> .

# The statistics themselves

  rdfs:label "Cornwall" ;
  pop:totalPopulation <> ;
  pop:ruralPopulation <> ;
  ... .
  a scv:Item ;
  rdf:value "499399"^^xsd:integer ;
  scv:dataset <*/population> ;
  sdmx:refArea <> ;
  pop:populationType pop:total ;
  sdmx:timePeriod <> .
  a scv:Item ;
  rdf:value "127904"^^xsd:integer ;
  scv:dataset <*/population> ;
  sdmx:refArea <> ;
  pop:populationType pop:rural ;
  sdmx:timePeriod <> .

# Datasets

  a scv:Dataset ;
  a void:Dataset ;
  dct:title "Populations of Local Authority Districts" ;
  ... .

# Common definitions for the dataset

pop:totalPopulation a rdf:Property ;
  rdfs:label "total population" ;
  rdfs:range scv:Item .
pop:ruralPopulation a rdf:Property ;
  rdfs:label "rural population" ;
  rdfs:range scv:Item .

pop:populationType rdfs:subPropertyOf scv:dimension ;
  rdfs:label "population type" ;
  rdfs:domain scv:Item ;
  rdfs:range pop:Population .

pop:Population a rdfs:Class ;
  rdfs:subClassOf skos:Concept ;
  rdfs:subClassOf scv:Dimension ;
  rdfs:label "population type" .

pop:populationScheme a skos:ConceptScheme ;
  skos:prefLabel "Population Types" ;
  pop:hasTopConcept pop:total .

pop:total a pop:Population ;
  skos:prefLabel "total population" ;
  skos:topConceptOf pop:populationScheme ;
  skos:narrower pop:rural ;
  ... .

pop:rural a pop:Population ;
  skos:prefLabel "rural population" ;
  skos:inScheme pop:populationScheme ;
  skos:broader pop:total ;
  ... .

year:Year a rdfs:Class ;
  rdfs:subClassOf time:Interval ;
  rdfs:subClassOf scv:Dimension .

  rdfs:label "mid-2001" ;
  time:intervalDuring <> .

  rdfs:label "2001" ;
  rdf:value "2001"^^xsd:gYear .

One source of sub-properties of scv:dimension (and subtypes of scv:Dimension) is SDMX (Statistical Data and Metadata eXchange). This provides standard ways of indicating things like the area and time that a statistic applies to. I’ve made an initial mapping into some RDFS properties and SKOS schemes as an indication of the kind of thing that would work here, but expect it to change.

We’re currently working on providing identifiers for the areas that statistics are likely to be about (such as local authority districts, MSOAs or wards). They are of the form:{area-type}/{ONS-area-code}

and they tie into the newly released OS data. I hope we’ll have them available as Linked Data soon.

One issue that hasn’t been resolved is how to handle the huge amount of repetition that is inherent in this method of representing statistical data. For example, in the data above, all the scv:DataItems in the scv:Dataset*/population/*/year/2001 are from 2001. Rather than indicating the year of each individual scv:DataItem, it would be nice if we could have a property on the dataset that indicated that all the items in that dataset had the same value for a particular dimension. If this were called scv:itemDimension, for example, then we could do:

  a scv:Dataset ;
  a void:Dataset ;
  dct:title "Populations of Local Authority Districts" ;
  sdmx:itemTimePeriod <> ;
  ... .

sdmx:itemTimePeriod rdfs:subPropertyOf scv:itemDimension ;
  rdfs:label "time period of items in the dataset" ;
  rdfs:domain scv:Dataset .

and the individual scv:Items would not have to have any sdmx:timePeriod properties explicitly. Perhaps this is something that the people beind SCOVO might consider, or we might create the property ourselves.

As far as I know, this pattern for representing statistics has yet to be used “in anger”, but I hope that we’ll have some illustrations soon which will help us assess whether it’s viable. Any comments and suggestions would, of course, be very welcome!