This post was imported from my old Drupal blog. To see the full thing, including comments, it's best to visit the Internet Archive.

As we encourage linked data adoption within the UK public sector, something we run into again and again is that (unsurprisingly) particular domain areas have pre-existing standard ways of thinking about the data that they care about. There are existing models, often with multiple serialisations, such as in XML and a text-based form, that are supported by existing tool chains.

In contrast, if there is existing RDF in that domain area, it’s usually been designed by people who are more interested in the RDF than in the domain area, and is thus generally more focused on the goals of the typical casual data re-user rather than the professionals in the area.

To give an example, the international statistics community uses SDMX for representing and exchanging statistics (and a lot more besides; it’s a huge standard). SDMX includes a well-thought through model for statistical datasets and the observations within them, as well as standard concepts for things like gender, age, unit multipliers and so on. By comparison, SCOVO, the main RDF model for representing statistics, barely scratches the surface in comparison.

This isn’t the only example: the INSPIRE Directive defines how geographic information must be made available. GEMINI defines the kind of geospatial metadata that that community cares about. The Open Provenance Model is the result of many contributors from multiple fields, and again has a number of serialisations.

You could view this as a challenge: experts in their domains already have models and serialisations for the data that they care about; how can we persuade them to adopt an RDF model and serialisations instead?

But that’s totally the wrong question. Linked data doesn’t, can’t and won’t replace existing ways of handling data. But it has got some interesting features that can bring great benefit to people who want to publish their data, namely:

  • web-scale addresses – being able to name and refer to things like individual observations in a statistical hypercube, a particular road junction, or the particular process that led to something being created
  • annotation – the ability to record metadata about everything that you can name, which is everything!
  • distributed publication – enabling multiple publishers to control the publication of their data without having to upload it to a central location
  • links – the joining of information to other information, providing more context, supporting more queries and reducing the requirement for duplication

The question is really about how to enable people to reap these benefits; the answer, because HTTP-based addressing and typed linkage is usually hard to introduce into existing formats, is usually to publish data using an RDF-based model alongside existing formats. This might be done by generating an RDF-based format (such as RDF/XML or Turtle) as an alternative to the standard XML or HTML, accessible via content negotiation, or by providing a GRDDL transformation that maps an XML format into RDF/XML.

Either way, the underlying model needs to be mapped into RDF. We’re furthest down this road with statistical data. I wanted to explore here what it might look like for the Open Provenance Model, building on lessons learned from the statistical domain.

Open Provenance Model

The Open Provenance Model talks about three main nodes:

  • artifacts, which are the things that are produced or used by processes
  • processes, which are actions that are performed using or producing artifacts
  • agents, which are the people or systems that perform actions

and five kinds of edges that can be defined between them:

  • process A used artifact B
  • artifact A was generated by process B
  • process A was controlled by agent B
  • process A was triggered by process B
  • artifact A was derived from artifact B

Then things start getting more complicated. OPM indicates that each artifact and agent plays a different role when it is used by, generated by or controls a process. What’s more, each artifact and agent might be involved in the process at different times (though timing information is optional within OPM). And a given provenance graph may contain several accounts of how artifacts, processes and agents fit together.

Existing Mapping to RDF

The OWL ontology for OPM for OPM is a very literal mapping of OPM into RDF. Each of the types of nodes is a separate class, and each of the types of edges is a separate class. Thus, it introduces a lot of n-ary relationships. Take a really simple example of an XML file being transformed into HTML using XSLT. With the OPM ontology, the RDF would look something like:

_:transformation a opm:Process .
<doc.html> a opm:Artifact .
<doc.xml> a opm:Artifact .
<doc.xsl> a opm:Artifact .
_:processor a opm:Agent .
_:Jeni a opm:Agent .

_:stylesheetLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause <doc.xml> ;
  opm:role eg:xsltSource .

_:sourceLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause <doc.xsl> ;
  opm:role eg:xsltStylesheet .

_:resultLink a opm:WasGeneratedBy ;
  opm:effect <doc.html> ;
  opm:cause _:transformation ;
  opm:role eg:xsltResult .

_:processorLink a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:processor ;
  opm:role xslt:processor .

_:userLink a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:Jeni ;
  opm:role xslt:user .

_:derivation a opm:WasDerivedFrom ;
  opm:effect <doc.html> ;
  opm:cause <doc.xml> .

xslt:source a opm:Role ;
  opm:value "source" .

xslt:stylesheet a opm:Role ;
  opm:value "stylesheet" .

xslt:result a opm:Role ;
  opm:value "result" .

xslt:processor a opm:Role ;
  opm:value "processor" .

xslt:user a opm:Role ;
  opm:value "user" .

To give you an idea of what this mapping means, if I wanted to work out who created doc.html, I would have to do a query like:

SELECT ?who
WHERE {
  ?generatedBy 
    opm:cause <doc.html> ;
    opm:role xslt:result ;
    opm:effect ?transformation .
  ?controlledBy
    opm:effect ?transformation ;
    opm:role xslt:user ;
    opm:cause ?who .
}

Some Observations

There are two things that I want to pull out about the RDF mapping described above.

  • it’s incredibly literal; every entity type within the model is mapped onto an RDF class, including the edges, the roles and the accounts (which I didn’t show above)
  • it doesn’t reuse any existing vocabularies, even when they might help (such as for the ‘value’ of a role, which is really a label)

It reminds me of the mapping of object-oriented or relational data models into each other or into XML, which often result in a god awful mess and people swearing that technology X is goddamned ugly.

The fact is that elegant uses of each modelling paradigm – ones that are easy to understand and efficient to query – always take advantage of the unique features of that paradigm. For example, good XML vocabularies take advantage of the distinctions between attributes and elements, of nesting and hierarchies, and of the ability to hold mixed content.

It’s the same with RDF. There are four features of RDF that I think good vocabularies will take suitable advantage of:

  • existing vocabularies
  • inheritance
  • shortcuts and reasoning
  • named graphs

Reusing existing vocabularies takes advantage of the ease of bringing together diverse domains within RDF, and it makes data more reusable. For example, an OPM mapping that encourages the reuse of FOAF for people and organisations saves time and effort for the developers of the OPM RDF vocabulary, that they would otherwise have spent modelling the details of agents; and it means that any agents that are described within the description of a piece of provenance are automatically available as agents in the wider FOAF cloud. The same goes for using DOAP to describe software.

By reusing vocabularies, the data isn’t isolated any more, locked within a single context designed for a single use. This is a huge benefit of the linked data approach and it makes sense to leverage it.

Using inheritance means creating general purpose classes and properties and encouraging other people to use rdfs:subClassOf or rdfs:subPropertyOf to specialise them according to their own requirements. Within OPM, the different roles that artifacts and agents might play in a process is a natural fit with either sub-properties or sub-classes, depending on how the edges in the model are represented. For example, rather than

_:stylesheetLink a opm:Used ;
  opm:effect _:transformation ;
  opm:cause <doc.xsl> ;
  opm:role eg:xsltStylesheet .

xslt:stylesheet a opm:Role ;
  opm:value "stylesheet" .

you could generate data that looked like:

_:stylesheetLink a xslt:Stylesheet ;
  opm:effect _:transformation ;
  opm:cause <doc.xsl> .

where xslt:Stylesheet is defined as a subclass of opm:Used.

Inheritance is a basic form of reasoning. In the case of the subclass relationship outlined above, the reasoning is that anything that is a xslt:Stylesheet is also a opm:Used, and thus:

_:stylesheetLink a xslt:Stylesheet .

implies

_:stylesheetLink a xslt:Used .

Taking the scenario where you’re doing native linked data publishing – storing data in a triplestore and then publishing it out from there – you have two choices:

  • you can store just the basic data, and let the application retrieving it carry out whatever reasoning is necessary to derive the information they need; this limits the size of the triplestore, but can place a large burden on people using it – either they have to be very familiar with the exact choices made in modelling the basic data, or they have to construct complex SPARQL queries that take account of the fact that the data might be modelled in many different ways
  • you can store not only the basic data but also anything that can be derived from it; this increases the number of triples you have to store, but means that people can query it without having to perform any reasoning themselves

The latter is obviously the more user-friendly approach. (And a triplestore could make it easy by understanding and applying schemas, ontologies and rules as data is loaded in.)

To take a more complex example, provenance could be modelled in a much more direct way, such as:

<doc.html> a opm:Artifact ;
  opm:derivedFrom <doc.xml> ;
  opm:generatedBy [
    xslt:source <doc.xml> ;
    xslt:stylesheet <doc.xsl> ;
    xslt:processor _:processor ;
    xslt:user _:Jeni ;
  ] .

where xslt:source and xslt:stylesheet are sub-properties of a property called opm:used, and xslt:processor and xslt:user are sub-properties of opm:controlledBy. This removes the n-ary properties, which (given the use of inheritance to represent roles) are only actually needed if the model needs to capture the timing of the involvement of particular artifacts or agents within a process, and makes the provenance information much easier to query than before:

SELECT ?who
WHERE {
  <doc.html> opm:generatedBy ?transformation .
  ?transformation xslt:user ?who .
}

But what if we also want to support the more complex, n-ary-relation-based models? We would need to assert, somehow, a rule that said that the presence of a opm:controlledBy relationship from a process to an agent was equivalent to having a opm:WasControlledBy instance with a opm:cause pointing to the agent and an opm:effect pointing to the process. Combine this with xslt:user being sub-property of opm:controlledBy and you have the statement:

_:transformation xslt:user _:Jeni .

implying:

_:transformation opm:controlledBy _:Jeni .

which in turn implies:

[] a opm:WasControlledBy ;
  opm:effect _:transformation ;
  opm:cause _:Jeni .

The same reasoning could be applied in the opposite direction, of course. Part of the definition of the use of OPM in RDF could be that the presence of a opm:WasControlledBy with a opm:cause pointing to an agent and opm:effect pointing to a process implies a opm:controlledBy link between the opm:effect and the opm:cause. Whichever was used in the initial modelling of the data, the same query could be used to query the data (accepting some loss of precision along the way, but if you’re not interesting in timing information then why should you suffer the cost of querying through n-ary relations?).

The final thing that I mentioned above that mappings from existing models to RDF should take advantage of is named graphs. In OPM, the obvious way that named graphs could play a role is in providing support for the different accounts of provenance. Separate named graphs could be used to represent separate accounts, referencing the same artifacts, agents and processes where appropriate. Individually, the graphs can remain simple; together, you have the full power of OPM.

Conclusions

Modelling is a complex design activity, and you’re best off avoiding doing it if you can. That means reusing conceptual models that have been built up for a domain as much as possible and reusing existing vocabularies wherever you can. But you can’t and shouldn’t try to avoid doing design when mapping from a conceptual model to a particular modelling paradigm such as a relational, object-oriented, XML or RDF model.

If you’re mapping to RDF, remember to take advantage of what it’s good at such as web-scale addressing and extensibility, and always bear in mind how easy or difficult your data will be to query. There is no point publishing linked data if it is unusable.