Today, I’m going to moan about the lack of features in SPARQL that are necessary to do many kinds of data analysis and visualisation. Going from raw data, held in RDF, to data like
cannot be done with SPARQL on its own. These calculations involve aggregation, grouping and projection which are planned for SPARQL vNext, but not here yet (at least, not in any standard way or in every triplestore).
Here’s the pretty graph to illustrate today’s rant:

The graph shows the number of notices of certain types placed in the London Gazette each day. The notices it summarises are those related to companies being liquidated, indicated by:
The graph is a version of:

with each data point averaged over 20 days. (The raw data spikes every Wednesday, presumably due to notices building up over the weekend and taking two days to appear in the Gazette.) It shows how the number of creditors’ voluntary liquidations (indicating companies that go insolvent and are unable to pay their creditors) doubled from around 30/day in May 2008 to around 60/day in the Spring of this year, but seems to be falling again (as far as we can tell; the data is not up-to-date).
This data is brought to you by the RDFa embedded by TSO in the notices on the London Gazette website and the scraping of said data into the datagovuk datastore held on the Talis platform, for both of which we have OPSI to thank.
The visualisation is brought to you by a touch of experimental “AJAR” in rdfQuery and the graphing power of Flot. Here are the lengths I have to go to to get the pretty graph:
First, I use rdfQuery to request a list of London Gazette issues since 1st May 2008. The SPARQL for the request is:
PREFIX corp-insolvency: <http://www.gazettes-online.co.uk/ontology/corp-insolvency#>
PREFIX g: <http://www.gazettes-online.co.uk/ontology#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
CONSTRUCT {
?issue a g:Issue .
?issue g:hasPublicationDate ?date .
}
WHERE {
?issue a g:Issue .
?issue g:hasPublicationDate ?date .
FILTER ( ?date > "2008-05-01"^^xsd:date ) .
}
This is a CONSTRUCT request because the resulting RDF/XML can be loaded into rdfQuery for querying. I could do a SELECT query and request JSON as the output format, but I’m doing a kind of end-to-end RDF thing here. So I use rdfQuery to make the request, load the result into an rdfQuery object, query it, and iterate over the results.
For each of the returned issues (all 293 of them), I make a separate request for all the relevant notices within that issue. The SPARQL looks like this:
PREFIX corp-insolvency: <http://www.gazettes-online.co.uk/ontology/corp-insolvency#>
PREFIX g: <http://www.gazettes-online.co.uk/ontology#>
CONSTRUCT {
?notice a ?type
}
WHERE {
?notice g:isInIssue $issue .
{ ?notice a corp-insolvency:MembersResolutionsForWindingUpNotice } UNION
{ ?notice a corp-insolvency:CreditorsResolutionsForWindingUpNotice } UNION
{ ?notice a corp-insolvency:AppointmentOfAdministratorNotice } UNION
{ ?notice a corp-insolvency:PetitionsToWindUpCompaniesNotice } .
?notice a ?type .
}
Once I’ve got the RDF for those notices, I can use rdfQuery to select just those of a particular type, then count how many there are and use the result to plot the graph.
Creating the graph involves 294 requests to the Talis store via the proxy that I’m using to get around the cross-site scripting issues, each of which takes (in my experience) between 200ms and 4s. So it’s pretty server-intensive for both the Talis servers and my proxy server (which is why I’m not actually going to make the page available generally). It’s also slow.
What I want to do is to be able to make four SPARQL requests that return RDF that summarise the number of notices of each of the different types on each date (or in each issue). I want to write SPARQL queries that look something like:
PREFIX corp-insolvency: <http://www.gazettes-online.co.uk/ontology/corp-insolvency#>
PREFIX g: <http://www.gazettes-online.co.uk/ontology#>
CONSTRUCT {
?issue a g:Issue .
?issue g:hasPublicationDate ?date .
?issue corp-insolvency:membersResolutionsForWindingUpNotices COUNT(?notice) .
}
WHERE {
?issue a g:Issue .
?issue g:hasPublicationDate ?date .
?notice g:isInIssue ?issue .
?notice a corp-insolvency:MembersResolutionsForWindingUpNotice .
}
GROUP BY ?issue
Four requests would be so much better than 294.
The thing of it is that this kind of facility is available as standard in SQL, the Google Visualisation API’s simple query language, or in the “reduce” part of map/reduce. If we’re to think of triplestores as a serious alternative to either relational or non-relational databases, and SPARQL as a serious alternative to either SQL or NoSQL, then it really must support these operations. And Real Soon.
In the meantime, I think the lesson for the publishers of linked data is to provide aggregated values for the obvious kinds of aggregations that people might want to do over your data. In the London Gazette data, that would be the counts of the various kinds of notices it contains. In the traffic flow data it would be the average, minimum and maximum traffic flow over each of the measured days, at each hour over the known dates and overall for each point.
On a more philosophical note, it strikes me that the concept of aggregation contradicts the Open World assumption. I can only know that the number of members’ winding-up order notices was exactly 30 if I know that I know of all the members’ winding-up order notices that exist. Pragmatically, in many cases this is going to be just fine, because we know that the datasets that we’re using are complete (our World is Closed), but it does slightly concern me that it’s impossible to do much useful data analysis without contradicting one of the fundamental tenets of the Semantic Web.
Comments
Re: SPARQL & Visualisation Frustrations: Aggregation and Project
Hey Jeni. I’m quite a fan of your blog and have been tracking rdfQuery closely lately and hope to use it with FuXi (at the server) to investigate a parallel paradigm for my current approach of ‘XForms in the client, XML on the wire, and RDF in the store (via transformation)’, i.e., one where RDF is bound to markup via javascript and dispatched to inference services (which is where FuXi comes in) and RDF datasets. More on that later (hopefully I can get around to blogging about it).
Anyways, I wanted to respond to your philosophical note:
On a more philosophical note, it strikes me that the concept of aggregation contradicts the Open World assumption. I can only know that the number of members’ winding-up order notices was exactly 30 if I know that I know of all the members’ winding-up order notices that exist. Pragmatically, in many cases this is going to be just fine, because we know that the datasets that we’re using are complete (our World is Closed), but it does slightly concern me that it’s impossible to do much useful data analysis without contradicting one of the fundamental tenets of the Semantic Web.
I think the OWA and its association with the Semantic Web is over stated and actually is myopic. I personally don’t think the Semantic Web will ever be able to break through a certain threshold of adoption and pragmatic usage without support of some form of CWA-based querying and inference. Note that it basically already does in the form of using OPTIONAL/FILTER/!BOUND for the equivalent of negation of failure (which follows the CWA that if it is not in the dataset/base then it is considered false)
Common sense semantics are more inline with the CWA. The problem with the OWA (and the reason why I think weaving it into the philosophy of the SW has done more harm than good) is that it assumes universal truths (despite the fact that the motivation for using the OWA is the avoidance of universal truths). Consider the following two english statements:
Consider the class of people who are guilty that is considered the opposite of the class of people that are considered innocent (where ‘opposite’ has a logical sense and a ‘common’ sense).
In a scenario where negation is used with the OWA. The intuition for this kind of negation (classical negation) is often described as the law of the excluded middle. In this case the class of people who are guilty is defined precisely as everyone minus those who are members of the class of people who are innocent. It is an absolute definition, however, as we know innocence, guilt, etc. are imprecise classifications in real life.
Even with an OWL axioms that states that the class of people who are innocent is disjoint with the class of people who are guilty, we cannot infer a person is guilty without this ‘negative information’. And even with this negative information we need to rule out their innocence in a purely logical/mathematical way.
But, as we know, although judicial systems consider the application of logic as the ideal way to prosecute/defend, inevitably they can only rely on known evidence to help a jury come to a particular conclusion. People are convicted of crimes even without irrefutable evidence.
This is the difference between negation via the CWA and classical negation. In non-nonotonic reasoning (which is often considered common-sense interpretation of negation), if there is no statement P in your database, then you can conclude not P. The emphasis shifts from the application of are purely mathematical process to derive complementary information to the administrator of the database to attempt to collect complete information about P, or at least annotate the database to the effect that he/she cannot do so.
There are many practical situation where we do have complete information about a particular predicate (consider banking, for instance). A knowledge representation that doesn’t support both forms of reasoning about negation will forever appeal only to logicians and not ‘engineers’.
If you are interested in this counter argument, I’d suggest looking at “Negation and Negative Information in the W3C Resource Description Framework” (at least the first part that lays out arguments about the futility of a system that adheres to only the OWA):
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.60.6232 , we cannot infer a person is guilty unless
Re: SPARQL & Visualisation Frustrations: Aggregation and Project
Thanks for this JeniT, been trying to figure out best method for doing these kinds of visualisation of Gov’t data off a tripleStore? Been looking at Garlic 4strore for sparql interface, but looks like your method here might be what we need? Must admit still feel that it is diffifult as a single dev to dip tow in linkedData waters. But posts like this make it real and much more pragmatic. Cheers. @dfflanders
Re: SPARQL & Visualisation Frustrations: Aggregation and Pro...
My understanding is that the open world is the reason that aggregate never made it into SPARQL the first time around. But as you say, that’s being rectified now. Many existing SPARQL implementations support some form of aggregates, so rather than get frustrated with what I’d suggest (clearly with a bias) is a very healthy and relatively quick standards process, I’d suggest you ask the implementors of your SPARQL engine of choice if they can add aggregate support. Most existing implementations are pretty close to one another in terms of both semantics and syntax, so there should be a pretty low cost to adopting this feature before all the paint has dried on this round of standardization.
Lee
Re: SPARQL & Visualisation Frustrations: Aggregation and Pro...
It’s fine in this particular case for me to ask Leigh to get Talis to extend their SPARQL implementation to include support for aggregates, but in general I shouldn’t have to know which implementation a particular SPARQL endpoint has behind the scenes in order to know what flavour of SPARQL is supported. So while I understand what you’re saying, I’m still frustrated :)
It’s interesting that you view the standards process as very healthy and relatively quick. The SPARQL WG charter does indeed make it look as if the next version of SPARQL will be finalised soon: Recommendation in June 2010! Great!
But then the first public Working Draft of the use cases/requirements was supposed to be published in March (actual: July), and the first public Working Draft of the query language, protocol and return XML format in April (actual: not yet).
I am absolutely not criticising the work that you’re doing. I know (believe me, I know) how long it can take to develop a standard within the W3C, and it’s not as if the W3C Working Groups that I’ve been involved with have met the timelines specified in their charters either. Working Groups continue to be given aggressive deadlines despite consistent evidence that they cannot be met. Those that are small struggle to find resources; those that are large have their time eaten up in arguments.
But I wonder whether there’s anything that could be done to mitigate the problem. As you suggest, there are already implementations of many of the features under consideration for SPARQL vNext. What if the Working Group did something similar to the WHATWG’s HTML5 draft and indicated, for each section:
This would allow implementers and users to know which parts of the spec are ready to be implemented/used as well as helping the members of the Working Group know where to focus their efforts.
I’d also suggest concentrating on one section at a time. It’s better for users and implementers to have a single feature finalised in 3 months than all the features in a first draft state.
In other words, take some of the lessons learned about Agile software development and apply them to standards development.
Re: SPARQL & Visualisation Frustrations: Aggregation and Pro...
Due to IP concerns with the group’s original charter, things were actually shuffled around a couple of months ago. As a result, the group is/will shortly be operating under a new “Phase II” charter with a schedule adjusted to meet reality: http://www.w3.org/2009/05/sparql-phase-II-charter (not sure if that’s public or member visible).
You and other readers of this post may be interested in a few slides I threw together recently summarizing the state of the WG’s work: http://www.slideshare.net/LeeFeigenbaum/sparql2-status
I’m actually going to disagree with the general suggestion of hardening one feature at a time - my experience with that in the past has been that the group ends up “finishing” one feature only to find that others impact completed design decisions in a negative way, resulting in having to reopen the completed features and a longer time to standardization then if the features are specified in parallel. Instead, I prefer efforts such as the slide deck above to educate the SPARQL user community about the working group’s progress and the relative completeness of each of the features.
In any case, it’s not one of my desires to reform the W3C process, so I’d rather do the best I can working within it. Given that, my core message is that it’s ok (& healthy) for implementations to extend the standard and for users to make use of those extensions. It’s also ok & healthy for users to request and advocate for extensions to be codified as standards. I just don’t think it’s cool when people decry the overall technology based on the current state of the iterative standards process.
Lee
Re: SPARQL & Visualisation Frustrations: Aggregation and Pro...
I can see that my comments have upset you, and I’m sorry for that. SPARQL does a lot already, and I’m sure that you and the other members of the SPARQL Working Group are working as hard as you can to move the process along. No doubt it feels like I’m just sniping from the sidelines without making any contribution to that effort.
My comments arise from my perception of the current state of the world, in which Linked Data/Semantic Web advocates seem to think that all you have to do is put your data online in RDF, preferably with a SPARQL endpoint, and that data is easily usable by everyone else. My recent posts have been about testing that assertion, and while it’s true that you can do a lot, and I’ve tried to show several patterns that work, it’s not true that you can do everything that you might want to do easily with the state of standards and implementations that we have now.
I think there’s a real danger that if people start using Linked Data simply due to the hype around it, then discover that actually it doesn’t meet their (reuser’s) needs, they’ll go off the idea entirely. On the other hand, if the community owns up to the fact that there are still these (fairly large) gaps, that we know about and are working hard on, I believe that people will be more inclined to be patient and forgiving.
Regarding your specific points about hardening one feature at a time: I agree with you during the initial development of a standard, but if you are incrementally building on an existing standard then you should be able to do it… incrementally.
And I wouldn’t expect you to work outside the W3C standards development process, but I don’t think that including extra information within the specification (or outside it if that really feels too uncomfortable) about the status of each feature is reforming that process. It’s just doing what you’re currently doing on Slideshare in a more structured and findable way.