Hosting Gridworks Instances

I’ve written previously about how wonderful Freebase Gridworks (shortly to be “Google Refine”) is for cleaning and converting data. Within the UK public sector, there are two big barriers to its use, however:

  1. Public sector workers typically can’t install software on their computers.
  2. They’re also typically stuck with IE7 (or even, if they’re really unlucky, IE6).

On Standards

I’m beginning to think that ‘to recommend’ is an irregular verb like those that appeared every so often in Yes, Minister:

Bernard: It’s one of those irregular verbs, isn’t it: I have an independent mind; you are an eccentric; he is round the twist.

Something like: I recommend, you tell people what to do, he engages in premature standardisation.

Using Freebase Gridworks to Create Linked Data

When we encourage people to put their data on the web as linked data, the biggest question is “How?”. There are so many “How?” questions to answer:

  • how do we choose what URIs to use for things?
  • how do we choose what vocabularies to use?
  • how do we handle changing data?
  • how do we tell people how the data was created?
  • how do we publish it?
  • how will other people know about it?

and, of course:

  • how do we create it?

legislation.gov.uk: Credit Where it's Due

I’m aware I’ve been quiet for the past few months. This isn’t because nothing interesting has been going on — rather the opposite. It’s been difficult to get a chance to sit down and write about the work I’ve been doing, when actually doing the work has been taking up so much time.

Most of my time has been spent on the new legislation.gov.uk website and its underlying API. There’s so much to say about this project that I hardly know where to start, so I’ll just try to do an overview and we can take it from there. Let me know what you’re interested in.

Distributed Publication and Querying

One of the biggest selling points of linked data is that it’s supposed to facilitate web-scale distributed publication of data. Just as with the human web, anyone can publish data at their local site without having to go through any kind of central authority.

Just as with the human web, convergence on particular sets of URIs for particular kinds of things can happen in an evolutionary way: in a blog post I might point to Amazon when I want to talk about a particular book, Wikipedia to define the concepts I mention, people’s blogs or twitter streams when I mention them.

And with everyone using the same terms to talk about the same things, there’s the prospect of being able to easily pull together information from completely different sources to find connections and patterns that we’d never have found otherwise.

What’s been very unclear to me is how this distributed publication of data can be married with the use of SPARQL for querying. After all, SPARQL doesn’t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.

This post is an attempt to explore this tension, between distributed publication and centralised query, and to try to find a pattern that we might use within the UK government (and potentially more widely, of course) to publish and expose linked data in a queryable way. It’s a bit sketchy, and I’d welcome comments.