SPARQL & Visualisation Frustrations: RDF Datatyping

My last post showed a visualisation of the Guardian’s MP’s Expenses data, ported into a Talis triplestore. Here’s a screenshot of another one (follow the link for the interactive version). The files that are used to create it are attached to this post.

Graphs of highest 25 expense claims in each party

There are several things that are frustrating about creating these visualisations, which I want to discuss because I think they lead to some lessons about what data publilshers and members of the semantic web community should do to make these things easy. The first thing I want to talk about is datatyping.

In RDF, literal values can be plain literals, in which case they may have an associated language; XML literals, in which case they have structure; or typed literals, which have a particular datatype, usually one of the ones defined by XML Schema.

The easiest kinds of literals to create, especially in RDF/XML, are plain literals. Indeed some formats don’t even support the creation of typed literals. So RDF often contains values that are actually numbers or dates, but that are plain literals rather than being typed with an appropriate datatype.

In the RDF for the MP’s expenses data, many of the figures are typed as xsd:int but some (such as salary and total claim) are untyped. Which means that:

  • sorting on them within the SPARQL query is done alphabetically rather than numerically
  • automated conversions into, say, JSON, will usually convert them into strings rather than numbers, or have to take a stab in the dark and assume that they are numeric based on their format

When I created the visualisation shown above, for example, I did a sort on the total-claim property to get the top 25 claimants, but that wasn’t what I actually got because I wasn’t sorting on a number.

Now the question of whether an element’s value intrinsically has a particular type or is merely given a type for the purposes of processing is something that has caused religious wars within the XML community. And in those wars I have always come down firmly on the side of typing being a matter of interpretation.

But with RDF I think it’s different, for two reasons:

First, unless I’m mistaken (and excepting extensions that may have been made by individual processors) the main mechanism that we have for processing RDF — SPARQL — does not support casting a plain literal into a typed literal. So there is simply no way of sorting numerically based on a plain literal. This could be viewed as a deficiency of SPARQL which might be addressed in a future version.

Second, one of the much-cited advantages of RDF is that it is self-describing. You can make requests to the URIs used for properties and classes to find out more information about them. But self-describing should apply to literal values too. If a value is a date, it should be labelled as a date; if it’s a number it should be labelled as a number.

So how about these as guidelines for creating RDF that would make processing RDF easier:

  • if the literal is XML, it should be an XML literal (obviously)
  • if the literal is in a particular language (such as a description or a name), it should be a plain literal with that language
  • otherwise it should be given an appropriate datatype

Comments

Re: SPARQL & Visualisation Frustrations: RDF Datatyping

It’s pity that the type can’t be deduced from the property. E.g., using something of the general form:

guardian-mps:salary rdfs:range xs:integer .

I seem to remember reading some discussion of this ages ago but the whole idea was rejected. I can’t remember why.

Re: SPARQL & Visualisation Frustrations: RDF Datatyping

Hi Jeni,

I agree with you about typing of literals, its part of the "say everything" approach I normally adopt.

The reason why I didn't type some of the literals in the dataset, is that the source data wasn't as clean as I'd hope. For example, there are salaries listed in the Guardian spreadsheet as "12345 (2004)" or similar. As I wanted to process the data from source, without correcting it the best bet seemed to be to use an untyped literal. Simplified my data conversaion.

It is possible to perform some limited casting in a SPARQL query when filtering and sorting values.

E.g: FILTER ( xsd:int(?salary) > 140000 )

Or ORDER BY DESC(xsd:int(?salary)

These are specified in selection 11.5 Constructor Functions in the SPARQL specification. The current limitation is in casting so that the return value is included in the results.

Here are some SPARQL queries that show that:

You can test these out here.

Re: SPARQL & Visualisation Frustrations: RDF Datatyping

Thanks for the pointer Leigh. I had looked at those casting functions before, but in the context of converting a dateTime into a date or a time, for the purposes of querying into the traffic flow data, which unfortunately isn’t supported (even though the equivalents in XPath do support them).

It’s not clear to me what happens if you try to perform a cast that isn’t allowed, such as the literal ‘12345 (2004)’ into an integer?

I’m not trying to pick on the MP’s expenses data (which is really great stuff!) just using it as an illustration of some of the subtleties of publishing data as RDF.

Re: SPARQL & Visualisation Frustrations: RDF Datatyping

Hi,

Yes I think there are some obvious omissions in the flexibility of the casting operators. Might be a good time to ping some suggestions to the Working Group?

My understanding (which is largely limited to the internals of the ARQ query engine) is that the cast to an invalid type, will result in an error and the query solution will be dropped. So I don’t think you’ll get an error, just less data in the results.