As we encourage linked data adoption within the UK public sector, something we run into again and again is that (unsurprisingly) particular domain areas have pre-existing standard ways of thinking about the data that they care about. There are existing models, often with multiple serialisations, such as in XML and a text-based form, that are supported by existing tool chains.
In contrast, if there is existing RDF in that domain area, it’s usually been designed by people who are more interested in the RDF than in the domain area, and is thus generally more focused on the goals of the typical casual data re-user rather than the professionals in the area.
To give an example, the international statistics community uses SDMX for representing and exchanging statistics (and a lot more besides; it’s a huge standard). SDMX includes a well-thought through model for statistical datasets and the observations within them, as well as standard concepts for things like gender, age, unit multipliers and so on. By comparison, SCOVO, the main RDF model for representing statistics, barely scratches the surface in comparison.
This isn’t the only example: the INSPIRE Directive defines how geographic information must be made available. GEMINI defines the kind of geospatial metadata that that community cares about. The Open Provenance Model is the result of many contributors from multiple fields, and again has a number of serialisations.
You could view this as a challenge: experts in their domains already have models and serialisations for the data that they care about; how can we persuade them to adopt an RDF model and serialisations instead?
But that’s totally the wrong question. Linked data doesn’t, can’t and won’t replace existing ways of handling data. But it has got some interesting features that can bring great benefit to people who want to publish their data, namely:
The question is really about how to enable people to reap these benefits; the answer, because HTTP-based addressing and typed linkage is usually hard to introduce into existing formats, is usually to publish data using an RDF-based model alongside existing formats. This might be done by generating an RDF-based format (such as RDF/XML or Turtle) as an alternative to the standard XML or HTML, accessible via content negotiation, or by providing a GRDDL transformation that maps an XML format into RDF/XML.
Either way, the underlying model needs to be mapped into RDF. We’re furthest down this road with statistical data. I wanted to explore here what it might look like for the Open Provenance Model, building on lessons learned from the statistical domain.
The Open Provenance Model talks about three main nodes:
and five kinds of edges that can be defined between them:
Then things start getting more complicated. OPM indicates that each artifact and agent plays a different role when it is used by, generated by or controls a process. What’s more, each artifact and agent might be involved in the process at different times (though timing information is optional within OPM). And a given provenance graph may contain several accounts of how artifacts, processes and agents fit together.
The OWL ontology for OPM for OPM is a very literal mapping of OPM into RDF. Each of the types of nodes is a separate class, and each of the types of edges is a separate class. Thus, it introduces a lot of n-ary relationships. Take a really simple example of an XML file being transformed into HTML using XSLT. With the OPM ontology, the RDF would look something like:
_:transformation a opm:Process .
<doc.html> a opm:Artifact .
<doc.xml> a opm:Artifact .
<doc.xsl> a opm:Artifact .
_:processor a opm:Agent .
_:Jeni a opm:Agent .
_:stylesheetLink a opm:Used ;
opm:effect _:transformation ;
opm:cause <doc.xml> ;
opm:role eg:xsltSource .
_:sourceLink a opm:Used ;
opm:effect _:transformation ;
opm:cause <doc.xsl> ;
opm:role eg:xsltStylesheet .
_:resultLink a opm:WasGeneratedBy ;
opm:effect <doc.html> ;
opm:cause _:transformation ;
opm:role eg:xsltResult .
_:processorLink a opm:WasControlledBy ;
opm:effect _:transformation ;
opm:cause _:processor ;
opm:role xslt:processor .
_:userLink a opm:WasControlledBy ;
opm:effect _:transformation ;
opm:cause _:Jeni ;
opm:role xslt:user .
_:derivation a opm:WasDerivedFrom ;
opm:effect <doc.html> ;
opm:cause <doc.xml> .
xslt:source a opm:Role ;
opm:value "source" .
xslt:stylesheet a opm:Role ;
opm:value "stylesheet" .
xslt:result a opm:Role ;
opm:value "result" .
xslt:processor a opm:Role ;
opm:value "processor" .
xslt:user a opm:Role ;
opm:value "user" .
To give you an idea of what this mapping means, if I wanted to work out who created doc.html, I would have to do a query like:
SELECT ?who
WHERE {
?generatedBy
opm:cause <doc.html> ;
opm:role xslt:result ;
opm:effect ?transformation .
?controlledBy
opm:effect ?transformation ;
opm:role xslt:user ;
opm:cause ?who .
}
There are two things that I want to pull out about the RDF mapping described above.
It reminds me of the mapping of object-oriented or relational data models into each other or into XML, which often result in a god awful mess and people swearing that technology X is goddamned ugly.
The fact is that elegant uses of each modelling paradigm — ones that are easy to understand and efficient to query — always take advantage of the unique features of that paradigm. For example, good XML vocabularies take advantage of the distinctions between attributes and elements, of nesting and hierarchies, and of the ability to hold mixed content.
It’s the same with RDF. There are four features of RDF that I think good vocabularies will take suitable advantage of:
Reusing existing vocabularies takes advantage of the ease of bringing together diverse domains within RDF, and it makes data more reusable. For example, an OPM mapping that encourages the reuse of FOAF for people and organisations saves time and effort for the developers of the OPM RDF vocabulary, that they would otherwise have spent modelling the details of agents; and it means that any agents that are described within the description of a piece of provenance are automatically available as agents in the wider FOAF cloud. The same goes for using DOAP to describe software.
By reusing vocabularies, the data isn’t isolated any more, locked within a single context designed for a single use. This is a huge benefit of the linked data approach and it makes sense to leverage it.
Using inheritance means creating general purpose classes and properties and encouraging other people to use rdfs:subClassOf or rdfs:subPropertyOf to specialise them according to their own requirements. Within OPM, the different roles that artifacts and agents might play in a process is a natural fit with either sub-properties or sub-classes, depending on how the edges in the model are represented. For example, rather than
_:stylesheetLink a opm:Used ;
opm:effect _:transformation ;
opm:cause <doc.xsl> ;
opm:role eg:xsltStylesheet .
xslt:stylesheet a opm:Role ;
opm:value "stylesheet" .
you could generate data that looked like:
_:stylesheetLink a xslt:Stylesheet ;
opm:effect _:transformation ;
opm:cause <doc.xsl> .
where xslt:Stylesheet is defined as a subclass of opm:Used.
Inheritance is a basic form of reasoning. In the case of the subclass relationship outlined above, the reasoning is that anything that is a xslt:Stylesheet is also a opm:Used, and thus:
_:stylesheetLink a xslt:Stylesheet .
implies
_:stylesheetLink a xslt:Used .
Taking the scenario where you’re doing native linked data publishing — storing data in a triplestore and then publishing it out from there — you have two choices:
The latter is obviously the more user-friendly approach. (And a triplestore could make it easy by understanding and applying schemas, ontologies and rules as data is loaded in.)
To take a more complex example, provenance could be modelled in a much more direct way, such as:
<doc.html> a opm:Artifact ;
opm:derivedFrom <doc.xml> ;
opm:generatedBy [
xslt:source <doc.xml> ;
xslt:stylesheet <doc.xsl> ;
xslt:processor _:processor ;
xslt:user _:Jeni ;
] .
where xslt:source and xslt:stylesheet are sub-properties of a property called opm:used, and xslt:processor and xslt:user are sub-properties of opm:controlledBy. This removes the n-ary properties, which (given the use of inheritance to represent roles) are only actually needed if the model needs to capture the timing of the involvement of particular artifacts or agents within a process, and makes the provenance information much easier to query than before:
SELECT ?who
WHERE {
<doc.html> opm:generatedBy ?transformation .
?transformation xslt:user ?who .
}
But what if we also want to support the more complex, n-ary-relation-based models? We would need to assert, somehow, a rule that said that the presence of a opm:controlledBy relationship from a process to an agent was equivalent to having a opm:WasControlledBy instance with a opm:cause pointing to the agent and an opm:effect pointing to the process. Combine this with xslt:user being sub-property of opm:controlledBy and you have the statement:
_:transformation xslt:user _:Jeni .
implying:
_:transformation opm:controlledBy _:Jeni .
which in turn implies:
[] a opm:WasControlledBy ;
opm:effect _:transformation ;
opm:cause _:Jeni .
The same reasoning could be applied in the opposite direction, of course. Part of the definition of the use of OPM in RDF could be that the presence of a opm:WasControlledBy with a opm:cause pointing to an agent and opm:effect pointing to a process implies a opm:controlledBy link between the opm:effect and the opm:cause. Whichever was used in the initial modelling of the data, the same query could be used to query the data (accepting some loss of precision along the way, but if you’re not interesting in timing information then why should you suffer the cost of querying through n-ary relations?).
The final thing that I mentioned above that mappings from existing models to RDF should take advantage of is named graphs. In OPM, the obvious way that named graphs could play a role is in providing support for the different accounts of provenance. Separate named graphs could be used to represent separate accounts, referencing the same artifacts, agents and processes where appropriate. Individually, the graphs can remain simple; together, you have the full power of OPM.
Modelling is a complex design activity, and you’re best off avoiding doing it if you can. That means reusing conceptual models that have been built up for a domain as much as possible and reusing existing vocabularies wherever you can. But you can’t and shouldn’t try to avoid doing design when mapping from a conceptual model to a particular modelling paradigm such as a relational, object-oriented, XML or RDF model.
If you’re mapping to RDF, remember to take advantage of what it’s good at such as web-scale addressing and extensibility, and always bear in mind how easy or difficult your data will be to query. There is no point publishing linked data if it is unusable.
Comments
Re: Translating Existing Models to RDF
Hi Jeni,
Interesting post! I would suggest looking at the Provenir upper-level provenance ontology [1] that defines provenance concepts derived from the Basic Formal Ontology (BFO) and properties adapted from the Relation Ontology (RO).
We have already extended Provenir to create domain-specific provenance ontologies in biomedicine and oceanography. Note that these domain-specific provenance ontologies re-use existing ontology concepts, especially from the 166 ontologies listed at the National Center for Biomedical Ontologies (NCBO).
— “There is no point publishing linked data if it is unusable.”
I strongly agree with this. A paper by one of my lab colleagues provides an interesting perspective on this and is being presented at the AAAI Spring Symposium “Linked Data Meets Artificial Intelligence” [2].
Best,
Satya Sahoo
[1] http://wiki.knoesis.org/index.php/Provenir_Ontology
[2] http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf”
Re: Translating Existing Models to RDF
Nice post: Two things leapt out at me, with which I heartily agree:
“Linked data doesn’t, can’t and won’t replace existing ways of handling data.”
and “But it has got some interesting features that can bring great benefit to people who want to publish their data […] The question is really about how to enable people to reap these benefits” - exactly the right question, I think (for some value of “these benefits” - yours is reasonable, I might select some others).
Re: Translating Existing Models to RDF
That’s a very interesting set of points that helped me along in my thinking.
The one point I’m not sure about is your final one: I think it is definitely worth publishing ‘ugly’ or hard to use Linked Data.
One of the aha-erlebnisse I had about LD is that it allows people to figure out who will do the hard work on a case by case basis. That is, the provider of data may not have the resources to provide nice RDF, but it may still be worth some consumer’s while to come up with the killer SPARQL queries that get the info they need. As you just demonstrated with some virtuosity- if there’s a will, there’s a way.
It may even be the case that some third party sees value in providing a service that makes it easier for consumers to use the ‘ugly’ RDF: most likely by making their own models and hosting their own graphs that use the original URIs.
But in any such case, and for both providers, consumers and 3d party service people, the bottom line is the availability of the data in the first place.
Ugly Linked Data is always better than no Linked Data.
Re: Translating Existing Models to RDF
Your dead right, of course. I would love for data owners to just publish as much data as they can, in whatever way they can. Any data is better than none.
My perspective here, and in the wider work that I’m involved in, is about making recommendations for those people who want to (and can) publish properly and responsibly. We’re making recommendations about how many many publishers across the UK government should be publishing information about, say, provenance or statistics or geography — things that are common across many many domains.
We can’t in good faith recommend that people use a vocabulary that we can see would make it hard for others to get at the data. So this post is about encouraging the designers of those cross-domain vocabularies, who aren’t data owners themselves, not to be lazy in their design, and to think about the consumption of information that uses their vocabularies.
Re: Translating Existing Models to RDF
Hey Jeni,
Interesting post. I agree with the points you make. Please notice, even if more specific compared to OPM, our Provenance Vocabulary for Linked Data follows your suggestions. We applied the concepts of inheritance and shortcuts (e.g. by using OWL2 property chains). Additionally, our vocabulary definition includes schema-level links to other vocabularies. However, what I consider most important from a user point of view: We developed the vocabulary with understandability and, in particular, the need to make query formulation easy in mind. BTW, we want to provide a mapping of the Provenance Vocabulary to OPM in the near future.
Greetings,
Olaf
Re: Translating Existing Models to RDF
Olaf,
Yes! You’ll know that in my previous post on provenance I ended up trying to use the Provenance Vocabulary for precisely those reasons.
The main reason that I want to look at OPM is because it provides a model that’s a bit wider than the focus of the Provenance Vocabulary, in particular for modelling the provenance of legislation (ie which parliament created the legislation — not the XML file but the legislation itself — how has it been amended over time and so on). I think that a well-designed OPM vocabulary could fit neatly underneath the Provenance Vocabulary, and I’d really like to see that.
Jeni