<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.jenitennison.com/blog" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>sparql</title>
 <link>http://www.jenitennison.com/blog/taxonomy/term/51</link>
 <description>The taxonomy view with a depth of 0.</description>
 <language>en</language>
<item>
 <title>Getting Started with RDF and SPARQL Using Sesame and Python</title>
 <link>http://www.jenitennison.com/blog/node/153</link>
 <description>&lt;p&gt;My &lt;a href=&quot;http://www.jenitennison.com/blog/node/152&quot;&gt;previous post&lt;/a&gt; talked about how to install &lt;a href=&quot;http://4store.org/&quot;&gt;4store&lt;/a&gt; as a triplestore, and use the Ruby library &lt;a href=&quot;http://rdf.rubyforge.org/&quot;&gt;RDF.rb&lt;/a&gt; in order to process RDF extracted from that store. This was a response to Richard Pope&amp;#8217;s &lt;a href=&quot;http://memespring.co.uk/2011/01/linked-data-rdfsparql-documentation-challenge/&quot;&gt;Linked Data/RDF/SPARQL Documentation Challenge&lt;/a&gt; which asks for documentation of how to install a triplestore, load data into it, retrieve it using SPARQL and access the results as native structures using Ruby, Python or PHP.&lt;/p&gt;

&lt;p&gt;I quite enjoyed writing the last one, so I thought I&amp;#8217;d try again. As before, I am on Mac OS X, but this time I&amp;#8217;m going to use Python, which I have not programmed in before. I like a challenge. You might not like the results!&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;Sesame&lt;/h2&gt;

&lt;p&gt;This time, I&amp;#8217;m going to use &lt;a href=&quot;http://www.openrdf.org/&quot;&gt;Sesame&lt;/a&gt;, as I was told by &lt;a href=&quot;http://twitter.com/johnlsheridan&quot;&gt;John Sheridan&lt;/a&gt; that it was so easy to install that even he, a civil servant, could do it!&lt;/p&gt;

&lt;p&gt;Sesame needs a Java servlet container. I&amp;#8217;m using &lt;a href=&quot;http://tomcat.apache.org/&quot;&gt;Tomcat&lt;/a&gt; because I have some experience with it, but you could use something like &lt;a href=&quot;http://jetty.codehaus.org/jetty/&quot;&gt;Jetty&lt;/a&gt; if you prefer. I had a bit of trouble getting Tomcat 6 to install, but that might just have been because it has a lot of dependencies and I wasn&amp;#8217;t patient enough. It might be worth upgrading your existing ports and getting verbose output so you know there&amp;#8217;s activity as you install Tomcat:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port upgrade outdated
$ sudo port -v install tomcat6
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This installs Tomcat 6 in &lt;code&gt;/opt/local/share/java/tomcat6&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;While that&amp;#8217;s happening, get Sesame from its &lt;a href=&quot;http://sourceforge.net/projects/sesame/files/Sesame%202/&quot;&gt;download page&lt;/a&gt;. I got hold of &lt;code&gt;openrdf-sesame-2.3.2-sdk.tar.gz&lt;/code&gt;. The files we actually need are the &lt;code&gt;.war&lt;/code&gt;s so we can just extract them and put them in the &lt;code&gt;webapps&lt;/code&gt; directory within Tomcat:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ tar -zxvf openrdf-sesame-2.3.2-sdk.tar.gz openrdf-sesame-2.3.2/war/*.war
$ sudo cp openrdf-sesame-2.3.2/war/*.war /opt/local/share/java/tomcat6/webapps/
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then startup Tomcat:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo /opt/local/share/java/tomcat6/bin/tomcatctl start
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;All being well, you should see Tomcat doing some initial setup:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;conf_setup.sh: file conf/catalina.policy is missing; copying conf/catalina.policy.sample to its place.
conf_setup.sh: file conf/catalina.properties is missing; copying conf/catalina.properties.sample to its place.
conf_setup.sh: file conf/server.xml is missing; copying conf/server.xml.sample to its place.
conf_setup.sh: file conf/tomcat-users.xml is missing; copying conf/tomcat-users.xml.sample to its place.
conf_setup.sh: file conf/web.xml is missing; copying conf/web.xml.sample to its place.
conf_setup.sh: file conf/setenv.local is missing; copying conf/setenv.local.sample to its place.
Starting Tomcat.... started. (pid 20064)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now have a look at &lt;code&gt;http://localhost:8080/openrdf-sesame&lt;/code&gt;. If you&amp;#8217;re like me, you&amp;#8217;ll get some error messages because the user that Tomcat is running under (&lt;code&gt;www&lt;/code&gt;) isn&amp;#8217;t able to create or write to a logging directory that it wants to create, in my case &lt;code&gt;/Users/Jeni/Library/Application Support/Aduna/OpenRDF Sesame/logs&lt;/code&gt;. This turns out to be partly caused by permissions issues and partly caused by the spaces in the filename. To get around it, create a data directory for Aduna that doesn&amp;#8217;t have spaces in the filename, and change its ownership to &lt;code&gt;www&lt;/code&gt;. In my case, I chose &lt;code&gt;/opt/local/var/aduna&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo mkdir -p /opt/local/var/aduna
$ sudo chown www:www /opt/local/var/aduna
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then edit Tomcat&amp;#8217;s &lt;code&gt;setenv.local&lt;/code&gt; file which in my environment is at &lt;code&gt;/opt/local/share/java/tomcat6/conf&lt;/code&gt; and add a line that sets the &lt;code&gt;info.aduna.platform.appdata.basedir&lt;/code&gt; to that directory, like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;export JAVA_OPTS=&#039;-Dinfo.aduna.platform.appdata.basedir=/opt/local/var/aduna&#039;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Restart Tomcat:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo /opt/local/share/java/tomcat6/bin/tomcatctl restart
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then navigate again to &lt;a href=&quot;http://localhost:8080/openrdf-sesame&quot;&gt;http://localhost:8080/openrdf-sesame&lt;/a&gt; and you should see the Welcome page:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-welcome.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;As you can see, this recommends using the Workbench for managing the repositories. If you open that, at &lt;a href=&quot;http://localhost:8080/openrdf-workbench&quot;&gt;http://localhost:8080/openrdf-workbench&lt;/a&gt;.&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-home.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;We&amp;#8217;ll use this Workbench to create a new repository for our data, which I&amp;#8217;ll call &lt;code&gt;reference&lt;/code&gt;. Click on &lt;code&gt;New Repository&lt;/code&gt; from the left hand navigation and fill in the form. I&amp;#8217;m just going to use the default in-memory RDF store because I&amp;#8217;m only using a little data; the other options (using MySQL or PostgreSQL stores) would be useful if I were creating something more permanent. See &lt;a href=&quot;http://www.openrdf.org/doc/sesame2/users/ch07.html#section-rdbms-store-config&quot;&gt;the Sesame User Guide&lt;/a&gt; for information about those.&lt;/p&gt;

&lt;p&gt;So fill in the form to create a new repository with the id &lt;code&gt;reference&lt;/code&gt; and whatever title you like:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-new-repository.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Click &lt;code&gt;Next&lt;/code&gt; and there will be a couple more options to select; I just used the default for these:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-new-repository-2.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Click &lt;code&gt;Create&lt;/code&gt; and you will see a summary of the new repository that you&amp;#8217;ve created:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-new-repository-3.jpg&quot; /&gt;
&lt;/p&gt;

&lt;h2&gt;Loading Data&lt;/h2&gt;

&lt;p&gt;I&amp;#8217;m going to use the same data as I did before:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;a href=&quot;http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf&quot;&gt;http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can add data to a Sesame repository in a browser through the Workbench by uploading a file, pointing Sesame at a URL or pasting in some RDF that you want to load. There are also Java bindings for adding data to Sesame. But neither of those are any good to us as we need programmatic access.&lt;/p&gt;

&lt;p&gt;So we will use the &lt;a href=&quot;http://www.openrdf.org/doc/sesame2/system/ch08.html#d0e304&quot;&gt;HTTP method&lt;/a&gt;. I want to add some statements to the &lt;code&gt;reference&lt;/code&gt; repository in the graph (what Sesame calls &amp;#8220;context&amp;#8221;) &lt;code&gt;http://source.data.gov.uk/data/reference/organogram-co/2010-10-30&lt;/code&gt;, which amounts to an HTTP PUT on the repository&amp;#8217;s statements with that context. &lt;/p&gt;

&lt;p&gt;Now I don&amp;#8217;t know much at all about Python, but it looks as though the built-in library &lt;code&gt;urllib2&lt;/code&gt; doesn&amp;#8217;t support &lt;code&gt;PUT&lt;/code&gt; and there&amp;#8217;s a better HTTP library available in &lt;a href=&quot;http://code.google.com/p/httplib2/&quot;&gt;&lt;code&gt;httplib2&lt;/code&gt;&lt;/a&gt;. MacPorts supports various different packages for &lt;code&gt;httplib2&lt;/code&gt; with different versions of Python. Now there only seems to be a package for rdflib, which we&amp;#8217;ll use later, for Python 2.6, so we&amp;#8217;ll go for &lt;code&gt;py26-httplib2&lt;/code&gt;, which will bring in Python 2.6 with it just in case.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port install py26-httplib2
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After running this, if you want to actually use it you will need to install the &lt;code&gt;python_select&lt;/code&gt; port and choose Python 2.6:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port install python_select
$ sudo python_select python26
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then open up another Terminal window or tab (because the change won&amp;#8217;t have affected your old one) and check what version of Python you&amp;#8217;re running:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python --version
Python 2.6.6
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;With the &lt;code&gt;httplib2&lt;/code&gt; library in place, it&amp;#8217;s time for a Python script (&lt;code&gt;load-rdf-into-sesame.py&lt;/code&gt;) to do the PUTting:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import urllib
import httplib2

repository = &#039;reference&#039;
graph      = &#039;http://source.data.gov.uk/data/reference/organogram-co/2010-06-30&#039;
filename   = &#039;/Users/Jeni/Downloads/index.rdf&#039;

print &quot;Loading %s into %s in Sesame&quot; % (filename, graph)
params = { &#039;context&#039;: &#039;&amp;lt;&#039; + graph + &#039;&amp;gt;&#039; }
endpoint = &quot;http://localhost:8080/openrdf-sesame/repositories/%s/statements?%s&quot; % (repository, urllib.urlencode(params))
data = open(filename, &#039;r&#039;).read()
(response, content) = httplib2.Http().request(endpoint, &#039;PUT&#039;, body=data, headers={ &#039;content-type&#039;: &#039;application/rdf+xml&#039; })
print &quot;Response %s&quot; % response.status
print content
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run the script from the command line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python load-rdf-into-sesame.py
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you should get just get:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loading /Users/Jeni/Downloads/index.rdf into http://source.data.gov.uk/data/reference/organogram-co/2010-06-30 in Sesame
Response 204
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;which isn&amp;#8217;t particularly helpful (well, the &lt;code&gt;204&lt;/code&gt; response tells us it worked), but you can then check &lt;a href=&quot;http://localhost:8080/openrdf-workbench/repositories/reference/contexts&quot;&gt;http://localhost:8080/openrdf-workbench/repositories/reference/contexts&lt;/a&gt; and you should see that there is a new context of &lt;code&gt;http://source.data.gov.uk/data/reference/organogram-co/2010-06-30&lt;/code&gt;:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-contexts.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Click on the context and it will take you to a list of (some of) the triples in that graph:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-explore-context.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;One of the nice things about Sesame is that the Workbench provides you with ways of exploring the data that you have loaded. On the left navigation bar there are ways of listing the types of the entities described in the data:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-explore-types.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;from which you can find instances of that type, for example of &lt;code&gt;org:Organization&lt;/code&gt;:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-explore-organization.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;and then the statements about a particular instance, for example DirectGov:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-explore-directgov.jpg&quot; /&gt;
&lt;/p&gt;

&lt;h2&gt;Running a Query&lt;/h2&gt;

&lt;p&gt;Onto running a query directly. This is done on Sesame in exactly the same way as it was done on 4store in my last walkthrough: by HTTP POSTing a query to the SPARQL endpoint. Sesame&amp;#8217;s page for testing queries on the &lt;code&gt;reference&lt;/code&gt; repository is at &lt;a href=&quot;http://localhost:8080/openrdf-workbench/repositories/reference/query&quot;&gt;http://localhost:8080/openrdf-workbench/repositories/reference/query&lt;/a&gt; and we&amp;#8217;ll use the basic one that lists types of things that are described within the data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT DISTINCT ?type 
WHERE { 
  ?thing a ?type .
} 
ORDER BY ?type
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Paste that into the textarea that&amp;#8217;s provided on &lt;a href=&quot;http://localhost:8080/openrdf-workbench/repositories/reference/query&quot;&gt;http://localhost:8080/openrdf-workbench/repositories/reference/query&lt;/a&gt; so it looks like:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-query.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;and you get an HTML page:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-query-result.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;That&amp;#8217;s nice for humans, but not so good for computers. When we request the results of this query programmatically, we&amp;#8217;ll want to make sure that we specifically ask for the query results in either &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-XMLres/&quot;&gt;XML&lt;/a&gt; or &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-json-res/&quot;&gt;JSON&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I went the XML route last time, so let&amp;#8217;s mix it up a bit and try processing the JSON results of a SPARQL query this time, as it&amp;#8217;s really easy to access using the &lt;code&gt;json&lt;/code&gt; module in Python. So, we need to &lt;code&gt;POST&lt;/code&gt; the query, ensuring that we set the &lt;code&gt;Accept&lt;/code&gt; header to &lt;code&gt;application/sparql-results+json&lt;/code&gt;, and then process the results as JSON. Here is &lt;a href=&quot;/blog/files/find-rdf-types.py&quot;&gt;&lt;code&gt;find-rdf-types.py&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import urllib
import httplib2
import json

query = &#039;SELECT DISTINCT ?type WHERE { ?thing a ?type . } ORDER BY ?type&#039;
repository = &#039;reference&#039;
endpoint = &quot;http://localhost:8080/openrdf-sesame/repositories/%s&quot; % (repository)

print &quot;POSTing SPARQL query to %s&quot; % (endpoint)
params = { &#039;query&#039;: query }
headers = { 
  &#039;content-type&#039;: &#039;application/x-www-form-urlencoded&#039;, 
  &#039;accept&#039;: &#039;application/sparql-results+json&#039; 
}
(response, content) = httplib2.Http().request(endpoint, &#039;POST&#039;, urllib.urlencode(params), headers=headers)

print &quot;Response %s&quot; % response.status
results = json.loads(content)
print &quot;\n&quot;.join([result[&#039;type&#039;][&#039;value&#039;] for result in results[&#039;results&#039;][&#039;bindings&#039;]])
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python find-rdf-types.py
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you get:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;POSTing SPARQL query to http://localhost:8080/openrdf-sesame/repositories/reference
Response 200
http://purl.org/linked-data/cube#DataSet
http://purl.org/linked-data/cube#DataStructureDefinition
http://purl.org/linked-data/cube#Observation
http://purl.org/net/opmv/ns#Artifact
http://purl.org/net/opmv/ns#Process
http://purl.org/net/opmv/types/google-refine#OperationDescription
http://purl.org/net/opmv/types/google-refine#Process
http://purl.org/net/opmv/types/google-refine#Project
http://rdfs.org/ns/void#Dataset
http://reference.data.gov.uk/def/central-government/AssistantParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/CivilServicePost
http://reference.data.gov.uk/def/central-government/Department
http://reference.data.gov.uk/def/central-government/DeputyDirector
http://reference.data.gov.uk/def/central-government/DeputyParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/Director
http://reference.data.gov.uk/def/central-government/DirectorGeneral
http://reference.data.gov.uk/def/central-government/ParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/PermanentSecretary
http://reference.data.gov.uk/def/central-government/PublicBody
http://reference.data.gov.uk/def/central-government/SeniorAssistantParliamentaryCounsel
http://reference.data.gov.uk/def/intervals/CalendarDay
http://www.w3.org/2000/01/rdf-schema#Class
http://www.w3.org/ns/org#Organization
http://www.w3.org/ns/org#OrganizationalUnit
http://xmlns.com/foaf/0.1/Person
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is the same set of types as that given through the HTML browse interface. Note that the JSON results themselves look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;{
  &quot;head&quot;: {
    &quot;vars&quot;: [ &quot;type&quot; ]
  }, 
  &quot;results&quot;: {
    &quot;bindings&quot;: [
      {
        &quot;type&quot;: { &quot;type&quot;: &quot;uri&quot;, &quot;value&quot;: &quot;http:\/\/purl.org\/linked-data\/cube#DataSet&quot; }
      }, 
      {
        &quot;type&quot;: { &quot;type&quot;: &quot;uri&quot;, &quot;value&quot;: &quot;http:\/\/purl.org\/linked-data\/cube#DataStructureDefinition&quot; }
      }, 
      {
        &quot;type&quot;: { &quot;type&quot;: &quot;uri&quot;, &quot;value&quot;: &quot;http:\/\/purl.org\/linked-data\/cube#Observation&quot; }
      }, 
      ...
    ]
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Each of the items within the &lt;code&gt;bindings&lt;/code&gt; array contains a set of bindings for the variables in the SPARQL query. This closely matches the XML format.&lt;/p&gt;

&lt;h2&gt;Processing RDF&lt;/h2&gt;

&lt;p&gt;Now we get onto the part of this where we look at specific libraries for RDF support in Python. The most popular library is &lt;a href=&quot;http://www.rdflib.net/&quot;&gt;rdflib&lt;/a&gt;, which you can install using MacPorts:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port install py26-rdflib
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The SPARQL query we&amp;#8217;ll try this time uses a CONSTRUCT query, which creates RDF, rather than a SELECT query, which as we&amp;#8217;ve seen can create either XML or JSON. For example, try the query:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;

CONSTRUCT {
  ?person 
    a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
} WHERE { 
  ?person a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This gets all the information in the data about the individuals for whom names have been supplied in the data, as RDF. Again, Sesame will display this as HTML when you try doing it, but you can choose a different format from the drop-down menu at the top of the Query Result display:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/sesame-workbench-query-result-rdf.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;When you&amp;#8217;re not accessing using a browser, by default Sesame serves up its results in &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/TriG/Spec/&quot;&gt;TriG format&lt;/a&gt;, which isn&amp;#8217;t particularly appropriate for the results of CONSTRUCT queries as we don&amp;#8217;t need multiple graphs. We&amp;#8217;ll request &lt;a href=&quot;http://www.w3.org/TR/rdf-testcases/#ntriples&quot;&gt;N-Triples&lt;/a&gt; as that&amp;#8217;s something rdflib can understand. Sesame 2 uses the content type &lt;code&gt;text/plain&lt;/code&gt; for N-Triples, so we can request this type by setting the &lt;code&gt;Accept&lt;/code&gt; header:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;params = { &#039;query&#039;: query }
headers = { 
  &#039;content-type&#039;: &#039;application/x-www-form-urlencoded&#039;, 
  &#039;accept&#039;: &#039;text/plain&#039; 
}
(response, content) = httplib2.Http().request(endpoint, &#039;POST&#039;, urllib.urlencode(params), headers=headers)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We then need to parse this Turtle response into a &lt;a href=&quot;http://www.rdflib.net/rdflib-2.4.0/html/public/rdflib.Graph.Graph-class.html&quot;&gt;&lt;code&gt;rdflib.Graph&lt;/code&gt;&lt;/a&gt; object:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;graph = rdflib.ConjunctiveGraph()
graph.parse(rdflib.StringInputSource(content), format=&quot;nt&quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We then need to get information out of that graph, which rdflib isn&amp;#8217;t particularly good at. So let&amp;#8217;s use &lt;a href=&quot;http://www.openvest.com/trac/wiki/RDFAlchemy&quot;&gt;RDFAlchemy&lt;/a&gt; instead. That can be installed using &lt;a href=&quot;http://packages.python.org/distribute/easy_install.html&quot;&gt;easy_install&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo easy_install-2.6 rdfalchemy
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;RDFAlchemy can be used to map RDF graphs onto Python object structures in a fairly straight-forward manner. Basically, you define the namespaces of the vocabularies that you want to use, then some classes for the kinds of things that you have in the data, with properties that map onto properties in the RDF. Then you set the &lt;code&gt;rdfSubject.db&lt;/code&gt; to the source of the data (which can be an rdflib Graph as above) and can query on it. Here&amp;#8217;s an example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;FOAF = rdflib.Namespace(&#039;http://xmlns.com/foaf/0.1/&#039;)
RDF = rdflib.Namespace(&#039;http://www.w3.org/1999/02/22-rdf-syntax-ns#&#039;)

class Person(rdfalchemy.rdfSubject):
  rdf_type = FOAF.Person
  name = rdfalchemy.rdfSingle(FOAF.name)
  mbox = rdfalchemy.rdfSingle(FOAF.mbox)

rdfalchemy.rdfSubject.db = graph
stott = Person.get_by(name=&#039;Andrew Stott&#039;)
print &quot;Andrew Stott&#039;s email address: %s&quot; % stott.mbox.n3()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;RDFAlchemy adds both &lt;code&gt;get_by()&lt;/code&gt; and &lt;code&gt;filter_by()&lt;/code&gt; methods on the descriptor classes that you define, to get a single item that matches a query or a list of items, respectively.&lt;/p&gt;

&lt;p&gt;The full script for &lt;a href=&quot;/blog/files/get-names-and-emails.py&quot;&gt;&amp;#8216;get-names-and-emails.py&amp;#8217;&lt;/a&gt; is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import urllib
import httplib2
import rdflib
import rdfalchemy

query = &quot;&quot;&quot;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;

CONSTRUCT {
  ?person
    a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
} WHERE {
  ?person a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
}&quot;&quot;&quot;
repository = &#039;reference&#039;
endpoint = &quot;http://localhost:8080/openrdf-sesame/repositories/%s&quot; % repository

print &quot;POSTing SPARQL query to %s&quot; % endpoint
params = { &#039;query&#039;: query }
headers = { 
  &#039;content-type&#039;: &#039;application/x-www-form-urlencoded&#039;, 
  &#039;accept&#039;: &#039;text/plain&#039; 
}
(response, content) = httplib2.Http().request(endpoint, &#039;POST&#039;, urllib.urlencode(params), headers=headers)
print &quot;Response %s&quot; % response.status

graph = rdflib.ConjunctiveGraph()
graph.parse(rdflib.StringInputSource(content), format=&quot;nt&quot;)

print &quot;Loaded %d triples&quot; % len(graph)

FOAF = rdflib.Namespace(&#039;http://xmlns.com/foaf/0.1/&#039;)
RDF = rdflib.Namespace(&#039;http://www.w3.org/1999/02/22-rdf-syntax-ns#&#039;)

class Person(rdfalchemy.rdfSubject):
  rdf_type = FOAF.Person
  name = rdfalchemy.rdfSingle(FOAF.name)
  mbox = rdfalchemy.rdfSingle(FOAF.mbox)

rdfalchemy.rdfSubject.db = graph
stott = Person.get_by(name=&#039;Andrew Stott&#039;)
print &quot;Andrew Stott&#039;s email address: %s&quot; % stott.mbox.n3()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run this script with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python get-names-and-emails.py
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you get the result:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;No handlers could be found for logger &quot;rdflib.Literal&quot;
POSTing SPARQL query to http://localhost:8080/openrdf-sesame/repositories/reference
Response 200
Loaded 459 triples
Andrew Stott&#039;s email address: &amp;lt;mailto:andrew.stott@cabinet-office.gsi.gov.uk&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first line is apparently a &lt;a href=&quot;http://groups.google.com/group/rdfalchemy-dev/browse_thread/thread/44a94ec27c4c0b85&quot;&gt;side-effect of rdflib/RDFAlchemy weirdness&lt;/a&gt; which you don&amp;#8217;t need to worry about. The rest shows that the search was successful; the call to the &lt;code&gt;.n3()&lt;/code&gt; call on the email address is only necessary because it is a resource rather than a literal, and therefore doesn&amp;#8217;t get converted to a particularly readable string otherwise.&lt;/p&gt;

&lt;h2&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;So there you have it, another walkthrough of setting up a local triplestore, loading in data and accessing that data programmatically using SPARQL queries, this time using Sesame and Python rather than 4store and Ruby.&lt;/p&gt;

&lt;p&gt;This walkthrough took me a fair bit longer to do than the previous one, for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I&amp;#8217;ve done almost no previous programming with Python (as you can probably tell), so every little thing took ages to work out &amp;#8212; you know you&amp;#8217;re in trouble when you&amp;#8217;re Googling for string concatenation code! I&amp;#8217;m very happy to accept corrections and improvements, which I&amp;#8217;ll include in the above.&lt;/li&gt;
&lt;li&gt;I spent a lot of time faffing around with different Python versions, opting for the latest and then finding that the libraries that I wanted to use weren&amp;#8217;t available for that version and so on. I eventually ended up with Python 2.6; the code above may or may not work with any other versions.&lt;/li&gt;
&lt;li&gt;Setting up Sesame 2 was pretty frustrating: first Tomcat wouldn&amp;#8217;t work, then Jetty wouldn&amp;#8217;t work, and finally I did get Tomcat working and then had the issue with the log directory, as I described above. Once I&amp;#8217;d changed the data directory things worked very smoothly.&lt;/li&gt;
&lt;li&gt;I thought rdflib was going to be enough to work with RDF in Python, but really it isn&amp;#8217;t (if you want to get data &lt;em&gt;out&lt;/em&gt; as well as put data &lt;em&gt;in&lt;/em&gt;), so I had to find something else.&lt;/li&gt;
&lt;li&gt;The documentation for rdflib and RDFAlchemy isn&amp;#8217;t as comprehensive as the documentation for RDF.rb, especially if you&amp;#8217;re not familiar with Python, so it took me a bit longer to work out how to do things with those particular libraries.&lt;/li&gt;
&lt;li&gt;I took a lot more screenshots!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again, I haven&amp;#8217;t followed Richard&amp;#8217;s steps to the letter; in particular I haven&amp;#8217;t used a package to get data out of (or into) Sesame: I&amp;#8217;ve just done it through HTTP calls. I did it this way deliberately because I think it&amp;#8217;s a really important feature of triplestores that you can query them through a common interface: SPARQL. It means that you can take the Python code here and use it against 4store or another triplestore with only a change to the value of the endpoint variable, and similarly take the Ruby code from my previous walkthrough and use it against Sesame. Your code is not tied to a particular implementation or API; you &amp;#8220;only&amp;#8221; have to learn SPARQL and you&amp;#8217;re away.&lt;/p&gt;

&lt;p&gt;If you prefer something a little more tightly bound, however, RDFAlchemy does have some targeted &lt;a href=&quot;http://www.openvest.com/trac/wiki/RDFAlchemy#Sesame&quot;&gt;Sesame support&lt;/a&gt;, as does &lt;a href=&quot;http://rdf.rubyforge.org/sesame/&quot;&gt;RDF.rb&lt;/a&gt; for that matter. These can help with the management of the data within the repository as well as querying it.&lt;/p&gt;

&lt;p&gt;Another thing that&amp;#8217;s worth pointing out is that 4store and Sesame have completely different (HTTP-based) interfaces for getting data into stores, and that rdflib/RDFAlchemy and RDF.rb have completely different ways of loading data into in-memory graphs, querying it, and getting information from the results, quite aside from the obvious language-based differences that you&amp;#8217;d expect.&lt;/p&gt;

&lt;p&gt;On the SPARQL side, there are some efforts within the W3C to define a &lt;a href=&quot;http://www.w3.org/TR/sparql11-http-rdf-update/&quot;&gt;uniform HTTP protocol for managing RDF graphs&lt;/a&gt; and of course there&amp;#8217;s &lt;a href=&quot;http://www.w3.org/TR/sparql11-update/&quot;&gt;SPARQL 1.1 Update&lt;/a&gt;. There are glimmers of hope for a &lt;a href=&quot;http://www.w3.org/QA/2010/12/new_rdf_working_group_rdfjson.html&quot;&gt;standard RDF API&lt;/a&gt;, as &lt;a href=&quot;http://www.jenitennison.com/blog/node/150&quot;&gt;I&amp;#8217;ve argued for recently&lt;/a&gt;, but I gather that this effort will be focused on client-side developers, ie that it is really a standard RDF API &lt;em&gt;for Javascript&lt;/em&gt;, which I think is a wasted opportunity: I would have been faster in this task if I&amp;#8217;d been able to use familiar methods, and I wouldn&amp;#8217;t have been so dependent on the documentation provided by the author of a particular library.&lt;/p&gt;

&lt;p&gt;Anyway, hopefully my tramping this path will make it easier for those who follow.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/153#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/65">python</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/66">rdfalchemy</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/64">rdflib</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/67">sesame</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <enclosure url="http://www.jenitennison.com/blog/files/load-rdf-into-sesame.py.txt" length="615" type="text/plain" />
 <pubDate>Tue, 25 Jan 2011 17:27:24 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">153 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Getting Started with RDF and SPARQL Using 4store and RDF.rb</title>
 <link>http://www.jenitennison.com/blog/node/152</link>
 <description>&lt;p&gt;&lt;strong&gt;Updated&lt;/strong&gt; to include some of &lt;a href=&quot;http://www.jenitennison.com/blog/node/152#comment-10579&quot;&gt;Arto Bendicken&amp;#8217;s recommendations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This post is a response to Richard Pope&amp;#8217;s &lt;a href=&quot;http://memespring.co.uk/2011/01/linked-data-rdfsparql-documentation-challenge/&quot;&gt;Linked Data/RDF/SPARQL Documentation Challenge&lt;/a&gt;. In it, he asks for documentation of the following steps:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ul&gt;
  &lt;li&gt;Install an RDF store from a package management system on a computer running either Apple’s OSX or Ubuntu Desktop.&lt;/li&gt;
  &lt;li&gt;Install a code library (again from a package management system) for talking to the RDF store in either PHP, Ruby or Python.&lt;/li&gt;
  &lt;li&gt;Programatically load some real-world data into the RDF datastore using either PHP, Ruby or Python.&lt;/li&gt;
  &lt;li&gt;Programatically retrieve data from the datastore with SPARQL using using either PHP, Ruby or Python.&lt;/li&gt;
  &lt;li&gt;Convert retrieved data into an object or datatype that can be used by the chosen programming language (e.g. a Python dictionary).&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;I&amp;#8217;ve been told so many time how RDF sucks for mainstream developers that it was the main point of my &lt;a href=&quot;http://www.w3.org/2010/11/TPAC/RDF-SW-velocity.pdf&quot;&gt;TPAC talk&lt;/a&gt; late last year. I think that this is a great motivating challenge for improving not only the documentation of how to use RDF stores and libraries but how to improve their generally installability and usability for developers as well.&lt;/p&gt;

&lt;p&gt;Anyway, I thought I&amp;#8217;d try to get as far as I could to see just how bad things really are. I am on Mac OS X, and I&amp;#8217;m going to use Ruby (although I don&amp;#8217;t really know it all that well, so please forgive my mistakes). I&amp;#8217;ll breeze on through as if everything is hunky dory, but there are some caveats at the end.&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;4store&lt;/h2&gt;

&lt;p&gt;I&amp;#8217;m going to use &lt;a href=&quot;http://4store.org&quot;&gt;4store&lt;/a&gt; because it&amp;#8217;s really easy to install on the Mac. If you want to install it on Ubuntu, &lt;a href=&quot;http://blog.dbtune.org/post/2009/08/14/4Store-stuff&quot;&gt;there&amp;#8217;s a package available&lt;/a&gt;. For a Mac, it&amp;#8217;s a matter of going to the &lt;a href=&quot;http://4store.org/download/macosx/&quot;&gt;list of Mac downloads&lt;/a&gt;, downloading the most recent version, opening the &lt;code&gt;.dmg&lt;/code&gt; and installing the 4store application by dragging it into your Applications folder.&lt;/p&gt;

&lt;p&gt;When you run the 4store application you get a command line prompt. To set up and start a triplestore called &amp;#8216;reference&amp;#8217; with a SPARQL endpoint running on port 8000, type the following commands:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ 4s-backend-setup reference
$ 4s-backend reference
$ 4s-httpd -p 8000 reference
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you then navigate to &lt;a href=&quot;http://localhost:8000/&quot;&gt;http://localhost:8000/&lt;/a&gt; you should see the following:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/4store-homepage.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Don&amp;#8217;t let the title &amp;#8216;Not found&amp;#8217; put you off. The fact you get a response means that it&amp;#8217;s working.&lt;/p&gt;

&lt;h2&gt;Loading Data&lt;/h2&gt;

&lt;p&gt;First, find some data to load. A good place for government RDF data is &lt;a href=&quot;http://source.data.gov.uk/data/&quot;&gt;http://source.data.gov.uk/data/&lt;/a&gt; for example. I downloaded&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;a href=&quot;http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf&quot;&gt;http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are several ways of &lt;a href=&quot;http://4store.org/trac/wiki/ImportData&quot;&gt;importing data into 4store using the command line&lt;/a&gt;. Yves Raimond has created a &lt;a href=&quot;https://github.com/moustaki/4store-ruby&quot;&gt;Ruby gem&lt;/a&gt; for doing so programmatically. There&amp;#8217;s also &lt;a href=&quot;https://github.com/fumi/rdf-4store&quot;&gt;rdf-4store&lt;/a&gt; from Fumihiro Kato which ties into the &lt;a href=&quot;http://rdf.rubyforge.org/&quot;&gt;RDF.rb&lt;/a&gt; library which I&amp;#8217;ll use later on.&lt;/p&gt;

&lt;p&gt;However, if you use the &lt;a href=&quot;http://4store.org/trac/wiki/SparqlServer&quot;&gt;SPARQL server&lt;/a&gt; then it&amp;#8217;s just an HTTP PUT call, which of course you can do in any language you like (every language has support for making HTTP requests, right?) without the need to install any store-specific packages. However, since we&amp;#8217;ll be doing a lot of HTTP requests, it&amp;#8217;s useful to have a library that can make them simple. There are &lt;a href=&quot;http://ruby-toolbox.com/categories/http_clients.html&quot;&gt;plenty to choose from for Ruby&lt;/a&gt;. I chose &lt;a href=&quot;https://github.com/archiloque/rest-client&quot;&gt;rest-client&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo gem install rest-client
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;With that, I wrote the following little Ruby script called &lt;a href=&quot;/blog/files/load-data-into-4store_0.rb&quot;&gt;&amp;#8216;load-data-into-4store.rb&amp;#8217;&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/usr/bin/env ruby
require &#039;rubygems&#039;
require &#039;rest_client&#039;

filename = &#039;/Users/Jeni/Downloads/index.rdf&#039;
graph    = &#039;http://source.data.gov.uk/data/reference/organogram-co/2010-06-30&#039;
endpoint = &#039;http://localhost:8000/data/&#039;

puts &quot;Loading #{filename} into #{graph} in 4store&quot;
response = RestClient.put endpoint + graph, File.read(filename), :content_type =&amp;gt; &#039;application/rdf+xml&#039;
puts &quot;Response #{response.code}: 
#{response.to_str}&quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run the script from the command line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby load-rdf-into-4store.rb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you should get the response:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Sending PUT /data/http://source.data.gov.uk/data/reference/organogram-co/2010-06-30 to localhost:8000
Response 201: 
&amp;lt;!DOCTYPE HTML PUBLIC &quot;-//IETF//DTD HTML 2.0//EN&quot;&amp;gt;
&amp;lt;html&amp;gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;201 imported successfully&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;
&amp;lt;body&amp;gt;&amp;lt;h1&amp;gt;201 imported successfully&amp;lt;/h1&amp;gt;
&amp;lt;p&amp;gt;This is a 4store SPARQL server.&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;4store v1.0.5&amp;lt;/p&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can then check &lt;a href=&quot;http://localhost:8000/status/size/&quot;&gt;http://localhost:8000/status/size/&lt;/a&gt; and you should see that there are now some triples in the store:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/4store-size.jpg&quot; /&gt;
&lt;/p&gt;

&lt;h2&gt;Running a Query&lt;/h2&gt;

&lt;p&gt;The next step is to query that data using SPARQL. Running SPARQL queries is just a matter of HTTP POSTing a query to the SPARQL endpoint. 4store provides a page that you can use to test out queries at &lt;a href=&quot;http://localhost:8000/test/&quot;&gt;http://localhost:8000/test/&lt;/a&gt; so perhaps we should do that before diving into the Ruby code. The easy one to start with is just one that returns a list of the types of things that are described within the data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT DISTINCT ?type 
WHERE { 
  ?thing a ?type .
} 
ORDER BY ?type
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Paste that into the textarea that&amp;#8217;s provided on &lt;a href=&quot;http://localhost:8000/test/&quot;&gt;http://localhost:8000/test/&lt;/a&gt; so it looks like:&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
  &lt;img src=&quot;/blog/files/4store-test-query.jpg&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;and you get some XML:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;?xml version=&quot;1.0&quot;?&amp;gt;
&amp;lt;sparql xmlns=&quot;http://www.w3.org/2005/sparql-results#&quot;&amp;gt;
  &amp;lt;head&amp;gt;
    &amp;lt;variable name=&quot;type&quot;/&amp;gt;
  &amp;lt;/head&amp;gt;
  &amp;lt;results&amp;gt;
    &amp;lt;result&amp;gt;
      &amp;lt;binding name=&quot;type&quot;&amp;gt;&amp;lt;uri&amp;gt;http://purl.org/linked-data/cube#DataSet&amp;lt;/uri&amp;gt;&amp;lt;/binding&amp;gt;
    &amp;lt;/result&amp;gt;
    &amp;lt;result&amp;gt;
      &amp;lt;binding name=&quot;type&quot;&amp;gt;&amp;lt;uri&amp;gt;http://purl.org/linked-data/cube#DataStructureDefinition&amp;lt;/uri&amp;gt;&amp;lt;/binding&amp;gt;
    &amp;lt;/result&amp;gt;
    ...
  &amp;lt;/results&amp;gt;
&amp;lt;/sparql&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;SELECT queries like this one (which are the most common kind of query to run to simply extract data) return &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-XMLres/&quot;&gt;SPARQL Query Results XML Format&lt;/a&gt; by default, so there&amp;#8217;s no need to get hold of a specialised library for processing the results: you just need something to process XML.&lt;/p&gt;

&lt;p&gt;For Ruby, I&amp;#8217;m choosing &lt;a href=&quot;http://nokogiri.org/&quot;&gt;Nokogiri&lt;/a&gt; as I&amp;#8217;ve heard good things about it. To install:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo port install libxml2 libxslt
$ sudo gem install nokogiri
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So now we just need a script that will run this query, process the results as XML, and do something with them. Call it &lt;a href=&quot;/blog/files/find-rdf-types_0.rb&quot;&gt;&amp;#8216;find-rdf-types.rb&amp;#8217;&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/usr/bin/env ruby
require &#039;rubygems&#039;
require &#039;rest_client&#039;
require &#039;nokogiri&#039;

query = &#039;SELECT DISTINCT ?type WHERE { ?thing a ?type . } ORDER BY ?type&#039;
endpoint = &#039;http://localhost:8000/sparql/&#039;

puts &quot;POSTing SPARQL query to #{endpoint}&quot;
response = RestClient.post endpoint, :query =&amp;gt; query
puts &quot;Response #{response.code}&quot;
xml = Nokogiri::XML(response.to_str)

xml.xpath(&#039;//sparql:binding[@name = &quot;type&quot;]/sparql:uri&#039;, &#039;sparql&#039; =&amp;gt; &#039;http://www.w3.org/2005/sparql-results#&#039;).each do |type|
  puts type.content
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby find-rdf-types.rb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you get:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;POSTing SPARQL query to http://localhost:8000/sparql/
Response 200
http://purl.org/linked-data/cube#DataSet
http://purl.org/linked-data/cube#DataStructureDefinition
http://purl.org/linked-data/cube#Observation
http://purl.org/net/opmv/ns#Artifact
http://purl.org/net/opmv/ns#Process
http://purl.org/net/opmv/types/google-refine#OperationDescription
http://purl.org/net/opmv/types/google-refine#Process
http://purl.org/net/opmv/types/google-refine#Project
http://rdfs.org/ns/void#Dataset
http://reference.data.gov.uk/def/central-government/AssistantParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/CivilServicePost
http://reference.data.gov.uk/def/central-government/Department
http://reference.data.gov.uk/def/central-government/DeputyDirector
http://reference.data.gov.uk/def/central-government/DeputyParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/Director
http://reference.data.gov.uk/def/central-government/DirectorGeneral
http://reference.data.gov.uk/def/central-government/ParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/PermanentSecretary
http://reference.data.gov.uk/def/central-government/PublicBody
http://reference.data.gov.uk/def/central-government/SeniorAssistantParliamentaryCounsel
http://reference.data.gov.uk/def/intervals/CalendarDay
http://www.w3.org/2000/01/rdf-schema#Class
http://www.w3.org/ns/org#Organization
http://www.w3.org/ns/org#OrganizationalUnit
http://xmlns.com/foaf/0.1/Person
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So we can see that the dataset contains information that include statistical data using the &lt;a href=&quot;http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html&quot;&gt;data cube&lt;/a&gt; vocabulary, provenance information using &lt;a href=&quot;http://code.google.com/p/opmv/&quot;&gt;OPMV (Open Provenance Model Vocabulary)&lt;/a&gt;, some information about organisations using &lt;a href=&quot;http://www.epimorphics.com/public/vocabulary/org.html&quot;&gt;org&lt;/a&gt;, some data.gov.uk-specific vocabulary, and people using &lt;a href=&quot;http://xmlns.com/foaf/spec/&quot;&gt;FOAF&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Processing RDF&lt;/h2&gt;

&lt;p&gt;Sometimes it can be useful to get non-tabular data out of SPARQL. At that point, rather than using SELECT queries, you will want to use a CONSTRUCT query, which creates RDF. For example, try the query:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;

CONSTRUCT {
  ?person 
    a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
} WHERE { 
  ?person a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This gets all the information in the data about the individuals for whom names have been supplied in the data, as RDF.&lt;/p&gt;

&lt;p&gt;Although the response is RDF/XML, you definitely &lt;em&gt;do not&lt;/em&gt; want to process it as XML. Instead, you need a proper RDF library. Fortunately, there&amp;#8217;s a good one for Ruby in &lt;a href=&quot;http://rdf.rubyforge.org/&quot;&gt;RDF.rb&lt;/a&gt;. You can install it and a bunch of extra plugins that make it easy to deal with RDF in all its guises using:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo gem install linkeddata
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This lets us pick out an appropriate parser based on the &lt;code&gt;Content-Type&lt;/code&gt; of the response, and load the results of the SPARQL query into an  in-memory &lt;a href=&quot;http://rdf.rubyforge.org/RDF/Graph.html&quot;&gt;&lt;code&gt;RDF::Graph&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;response = RestClient.post endpoint, :query =&amp;gt; query
content_type = response.headers[:content_type][/^[^ ;]+/]
puts &quot;Response #{response.code} type #{content_type}&quot;

graph = RDF::Graph.new
graph &amp;lt;&amp;lt; RDF::Reader.for(:content_type =&amp;gt; content_type).new(response.to_str)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can perform subsequent queries over that graph, for example just to extract names and telephone numbers and put them into a dictionary:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;query = RDF::Query.new({
  :person =&amp;gt; {
    RDF.type  =&amp;gt; FOAF.Person,
    FOAF.name =&amp;gt; :name,
    FOAF.mbox =&amp;gt; :email,
  }
})

people = {}
query.execute(graph).each do |person|
  people[person.name.to_s] = person.email.to_s
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;#8217;s worth noting that the constants &lt;code&gt;RDF&lt;/code&gt; and &lt;code&gt;FOAF&lt;/code&gt; are pre-declared by including &lt;code&gt;RDF&lt;/code&gt;, and the values that you get back from a query are RDF values, which can be URIs or have datatypes or languages. In the above code I&amp;#8217;ve converted them into strings for insertion into the Ruby dictionary.&lt;/p&gt;

&lt;p&gt;The full script for &lt;a href=&quot;/blog/files/get-names-and-emails_0.rb&quot;&gt;&amp;#8216;get-names-and-emails.rb&amp;#8217;&lt;/a&gt; is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/usr/bin/env ruby
require &#039;rubygems&#039;
require &#039;rest_client&#039;
require &#039;linkeddata&#039;

include RDF

query = &quot;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;

CONSTRUCT {
  ?person 
    a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
} WHERE { 
  ?person a foaf:Person ;
    foaf:name ?name ;
    ?prop ?value .
}&quot;
endpoint = &#039;http://localhost:8000/sparql/&#039;

puts &quot;POSTing SPARQL query to #{endpoint}&quot;
response = RestClient.post endpoint, :query =&amp;gt; query
content_type = response.headers[:content_type][/^[^ ;]+/]
puts &quot;Response #{response.code} type #{content_type}&quot;

graph = RDF::Graph.new
graph &amp;lt;&amp;lt; RDF::Reader.for(:content_type =&amp;gt; content_type).new(response.to_str)

puts &quot;\nLoaded #{graph.count} triples\n&quot;

query = RDF::Query.new({
  :person =&amp;gt; {
    RDF.type  =&amp;gt; FOAF.Person,
    FOAF.name =&amp;gt; :name,
    FOAF.mbox =&amp;gt; :email,
  }
})

people = {}
query.execute(graph).each do |person|
  people[person.name.to_s] = person.email.to_s
end
puts &quot;\nCreating directory of #{people.length} people&quot;

stott_email = people[&#039;Andrew Stott&#039;]
puts &quot;\nAndrew Stott&#039;s email address: #{stott_email}&quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Run this script with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby get-names-and-emails.rb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and you get the result:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;POSTing SPARQL query to http://localhost:8000/sparql/
Response 200 type application/rdf+xml

Loaded 459 triples

Creating directory of 75 people

Andrew Stott&#039;s email address: mailto:andrew.stott@cabinet-office.gsi.gov.uk
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Conclusions and Caveats&lt;/h2&gt;

&lt;p&gt;So there you have it, a walkthrough of setting up a local triplestore, loading in data and accessing that data programmatically using SPARQL queries.&lt;/p&gt;

&lt;p&gt;Now for some caveats. First, you&amp;#8217;re bound to have noticed that I having followed Richard&amp;#8217;s steps to the letter.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;4store wasn&amp;#8217;t installed from a package management system. The only packaged triplestore I could locate on &lt;a href=&quot;http://www.macports.org/&quot;&gt;MacPorts&lt;/a&gt; was &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot;&gt;Virtuoso&lt;/a&gt; (which I&amp;#8217;ll come to in a second). I hope that 4store&amp;#8217;s installation is simple enough for this slight deviation from the rules not to matter.&lt;/li&gt;
&lt;li&gt;I didn&amp;#8217;t install a package for specifically talking to 4store in order to load in data, just used HTTP requests. There are &lt;a href=&quot;http://4store.org/trac/wiki/ClientLibraries&quot;&gt;client libraries&lt;/a&gt; for 4store, but I figure that the HTTP requests are easy enough, and the resulting code more portable into other environments, so I prefer not to use them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Second, there are a couple of dead ends that I went down that I haven&amp;#8217;t written up in the above:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I did spend some time yesterday evening trying to get &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot;&gt;Virtuoso&lt;/a&gt; set up. I managed to get it installed, but loading data into it seemed to require some magic which I couldn&amp;#8217;t figure out. So I went to bed instead.&lt;/li&gt;
&lt;li&gt;I tried to install and use &lt;a href=&quot;http://rdf.rubyforge.org/raptor/&quot;&gt;rdf-raptor&lt;/a&gt; in order to parse the RDF/XML that naturally comes out of 4store CONSTRUCT queries, but got a &lt;code&gt;Could not open library &#039;libraptor&#039;&lt;/code&gt; error. I couldn&amp;#8217;t find an immediate fix for that, so decided to keep things simple instead and just use plain RDF.rb.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Third, I want to reiterate that there may be better ways of using 4store, rest_client, Nokogiri and RDF.rb, as well as Ruby generally, than those shown above. I don&amp;#8217;t claim to be an expert in any of these technologies. If you have suggestions and corrections, I&amp;#8217;d encourage you to add a comment and I&amp;#8217;ll incorporate them in the text to improve it.&lt;/p&gt;

&lt;p&gt;Finally, some general points, because the strong binding of &amp;#8216;linked data&amp;#8217; and &amp;#8216;SPARQL&amp;#8217; in Richard&amp;#8217;s post bothers me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It&amp;#8217;s not necessary to have a SPARQL endpoint when publishing linked data, nor to run your own triplestore. If you already have a website, you are probably better off generating N-Triples or RDF/XML or Turtle in the same way as you generate HTML or XML or JSON.&lt;/li&gt;
&lt;li&gt;It&amp;#8217;s not necessary to learn SPARQL to access and use linked data: the whole point is that the data in linked data is available through HTTP access in standard (RDF-based) formats, so you can scrape them using a follow-your-nose approach and store the results however you like.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Having said the above, if you&amp;#8217;re collecting linked data from multiple sources with unpredictable content and want to query across it, having a local triplestore is very useful.&lt;/p&gt;

&lt;p&gt;I also want to point out that within the &lt;a href=&quot;http://data.gov.uk/linked-data&quot;&gt;linked data we&amp;#8217;ve published on data.gov.uk&lt;/a&gt;, we&amp;#8217;ve made a big effort to make the data available in multiple formats such as JSON, XML and CSV, and through a RESTful, URI-parameter-driven API, precisely to lower the barrier for developers who want to use that information but understandably don&amp;#8217;t want to take the time or make the effort to learn the linked data technologies that underly the sites. For those that do, the RDF/XML and Turtle is there as well, and the SPARQL queries that are used to create each page are available to look at, tweak, and learn from. Our hope is that the &lt;a href=&quot;http://code.google.com/p/linked-data-api/&quot;&gt;linked data API&lt;/a&gt; that provides access to lists of &lt;a href=&quot;http://education.data.gov.uk/doc/school&quot;&gt;schools&lt;/a&gt;, &lt;a href=&quot;http://reference.data.gov.uk/doc/department&quot;&gt;departments&lt;/a&gt; and &lt;a href=&quot;http://transport.data.gov.uk/doc/station&quot;&gt;railway stations&lt;/a&gt; can make the linked data learning curve a little less steep.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/152#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/61">4store</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/62">rdf.rb</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/63">ruby</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <enclosure url="http://www.jenitennison.com/blog/files/load-rdf-into-4store_0.rb" length="437" type="text/x-ruby-script" />
 <pubDate>Sat, 15 Jan 2011 19:17:57 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">152 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Distributed Publication and Querying</title>
 <link>http://www.jenitennison.com/blog/node/143</link>
 <description>&lt;p&gt;One of the biggest selling points of linked data is that it&amp;#8217;s supposed to facilitate web-scale distributed publication of data. Just as with the human web, anyone can publish data at their local site without having to go through any kind of central authority.&lt;/p&gt;

&lt;p&gt;Just as with the human web, convergence on particular sets of URIs for particular kinds of things can happen in an evolutionary way: in a blog post I might point to Amazon when I want to talk about a particular book, Wikipedia to define the concepts I mention, people&amp;#8217;s blogs or twitter streams when I mention them.&lt;/p&gt;

&lt;p&gt;And with everyone using the same terms to talk about the same things, there&amp;#8217;s the prospect of being able to easily pull together information from completely different sources to find connections and patterns that we&amp;#8217;d never have found otherwise.&lt;/p&gt;

&lt;p&gt;What&amp;#8217;s been very unclear to me is how this distributed publication of data can be married with the use of SPARQL for querying. After all, SPARQL doesn&amp;#8217;t (in its present form) support federated search, so to use SPARQL over all this distributed linked data, it sounds like you really need a central triplestore that contains everything you might want to query.&lt;/p&gt;

&lt;p&gt;This post is an attempt to explore this tension, between distributed publication and centralised query, and to try to find a pattern that we might use within the UK government (and potentially more widely, of course) to publish and expose linked data in a queryable way. It&amp;#8217;s a bit sketchy, and I&amp;#8217;d welcome comments.&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;Publishing Datasets&lt;/h2&gt;

&lt;p&gt;First, let&amp;#8217;s look at the publication of data. We publish data at the moment in all kinds of ways: embedded tables within PDFs, CSV database dumps, Excel spreadsheets, Word documents, XML, JSON, N3 and so on and on. Each of these documents contains a set of information: a dataset.&lt;/p&gt;

&lt;p&gt;Each dataset contains information about a whole load of &lt;em&gt;things&lt;/em&gt;, usually real-world things. This is easy to see when you have datasets that contain lots of things of the same type: a spreadsheet might contain information about lots of different local authorities, a database dump about a bunch of schools. In FOAF terms, we&amp;#8217;d say that the dataset has each of these things as a &lt;em&gt;topic&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Even datasets that are really about one &lt;em&gt;thing&lt;/em&gt; (have, in FOAF terms, a &lt;em&gt;primary topic&lt;/em&gt;) contain information about lots of other things. For example, a web page about a hospital might include some level of information about the different departments within the hospital, the strategic health authority that it belongs to, the chief executive and so on. Information that is just about one thing is rarely useful; at the very least, you will want to know the labels of things that it&amp;#8217;s related to.&lt;/p&gt;

&lt;p&gt;If we move to thinking about linked data, each &lt;em&gt;thing&lt;/em&gt; is assigned an HTTP URI. There is then one particular dataset that stands above all the other datasets that contain information about that &lt;em&gt;thing&lt;/em&gt;: the dataset in the document that you get when you resolve its URI. The fact that there is this dataset doesn&amp;#8217;t alter the fact that there are many many other datasets out there that contain information about the &lt;em&gt;thing&lt;/em&gt;. But the dataset that you get at the URI for the thing obviously has a special role.&lt;/p&gt;

&lt;p&gt;These datasets &amp;#8212; the ones you get at the end of a resource&amp;#8217;s URI &amp;#8212; are &lt;em&gt;the&lt;/em&gt; way in which an organisation can exercise control over the use of URIs minted within their domain. The organisation that controls the URI for a &lt;em&gt;thing&lt;/em&gt; determines whether that URI resolves, and what is at the end of the URI. If fifteen different websites all published information about a school consistently using the same URI for that school, anyone could pull that information together into something potentially useful. But if the URI for the school doesn&amp;#8217;t actually resolve, then you would have to wonder whether the school actually exists, or if it&amp;#8217;s just a figment of the imagination of those fifteen websites: a spoof school.&lt;/p&gt;

&lt;p&gt;Also, you&amp;#8217;d expect the information that you find at the end of the URI to be correct and up to date. You&amp;#8217;d expect it to be reasonably complete as well: to return a bunch of information about the school and pointers to more information about the school. This information is likely to come from a bunch of trusted sources: an integrated view over a collection of other datasets.&lt;/p&gt;

&lt;h2&gt;Providing SPARQL Endpoints&lt;/h2&gt;

&lt;p&gt;We&amp;#8217;ve established that&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;anyone can publish information about anything they choose, but that people will have different levels of trust in different sources of information&lt;/li&gt;
&lt;li&gt;information about any one &lt;em&gt;thing&lt;/em&gt; is seldom useful on its own; the power of the linked data web is the ability to make connections between things&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And so on to querying. Linked data can be useful without explicit querying &amp;#8212; you can navigate around related sets of information by following links, and pull together information gleaned from different sites &amp;#8212; but querying of some kind provides much more potential power and, with a &lt;a href=&quot;http://purl.org/linked-data/api/spec&quot;&gt;linked data API&lt;/a&gt;, the opportunity to provide an easy-to-use web-based API for the data.&lt;/p&gt;

&lt;p&gt;SPARQL queries operate over a default graph (or dataset) and a set of supplementary named graphs. For efficiency, these need to be pulled into a single triplestore.&lt;/p&gt;

&lt;p&gt;And so we have a quandry. To support queries, we need all the data we might want to query to be pulled into a single triplestore. Given that all data is linked, and all links are potentially interesting, the only answer seems to be to have the whole web of data in a single store. And that kind of centralised solution seems impractical, both in terms of the sheer size of store you&amp;#8217;d need and the obvious impact on efficiency of doing so.&lt;/p&gt;

&lt;h2&gt;Curated Triplestores&lt;/h2&gt;

&lt;p&gt;I think the answer (for the moment at least) is to forget about querying the entire web of linked data and focus on supporting the easy creation of targeted, curated, triplestores that each incorporate a useful subset of the linked data that&amp;#8217;s out there. What subset is useful for a given triplestore is a design question that should be informed by the potential users of that particular service. Larger subsets are likely to locate more cross-connections, but have a performance penalty.&lt;/p&gt;

&lt;p&gt;For example, a service that was oriented towards helping local authorities plan their schooling provision might include all the current data about nursery, primary and secondary schools (but not universities or versioned data), information about their administrative district and the district that they appear in (but no extra information about census areas), and those neighbourhood statistics, including historic data, that relate to children and schooling (but not those that relate to care of the elderly, for example).&lt;/p&gt;

&lt;p&gt;Another service might include all historic information about schools and universities and historic information about all associated administrative geography, but not include neighbourhood statistics.&lt;/p&gt;

&lt;h2&gt;Supporting On-Demand Triplestores&lt;/h2&gt;

&lt;p&gt;In the scenario painted above, each triplestore will include different datasets, brought together for a particular purpose. Imagine a huge warehouse full of boxes, each of which is a particular dataset. Each triplestore will fit together a different set of those boxes. What&amp;#8217;s neat about the linked data approach is that the boxes are really easy to bring together: creating a triplestore should just be a matter of selecting which datasets you want to use with little or no hand-crafting of links between them or resolution of naming conflicts.&lt;/p&gt;

&lt;p&gt;The challenge from the side of the data publisher is to enable these triplestores to be both created and kept up to date. A data publisher has to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;describe what datasets are available&lt;/li&gt;
&lt;li&gt;describe how these link to other potentially interesting datasets, to give hints about where connections might be made&lt;/li&gt;
&lt;li&gt;provide a mechanism for getting the current state of all the available datasets (which can obviously be through crawling but could alternatively be through a dump or set of dumps)&lt;/li&gt;
&lt;li&gt;provide a mechanism for informing interested parties about new datasets being made available (which could be through routine crawling or through a feed)&lt;/li&gt;
&lt;li&gt;provide a mechanism for informing interested parties about when a dataset changes (which could also be through routine crawling or through a feed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A lot of these problems are solved.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://rdfs.org/ns/void/&quot;&gt;VoiD&lt;/a&gt;&amp;#8217;s purpose in life is to describe datasets and how they link to each other, and it provides a &lt;code&gt;void:dataDump&lt;/code&gt; property that points to a dump of the data. VoiD can describe datasets that are supersets of other datasets, which enables datasets to be grouped together into potentially useful bundles.&lt;/p&gt;

&lt;p&gt;Where information needs to be kept up to date, we can use feeds. We need to keep up to date information about the datasets that a publisher makes available, and information about the content of a particular dataset. This can be achieved through a single Atom feed in which each dataset is recorded as an entry, with an &lt;code&gt;&amp;lt;updated&amp;gt;&lt;/code&gt; element indicating its last update. Datasets that are removed can be indicated through a &lt;a href=&quot;http://tools.ietf.org/html/draft-snell-atompub-tombstones-06&quot;&gt;&lt;code&gt;deleted-entry&lt;/code&gt; element&lt;/a&gt;. There is some ongoing work that suggests how to &lt;a href=&quot;http://groups.google.com/group/dataset-dynamics/web/components-vocabularies-protocols-formats&quot;&gt;augment voiD with a pointer to such a feed&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As well as pointing to a dataset, and indicating that it has been updated, the Atom feed could contain information about the change itself, represented as a &lt;a href=&quot;http://vocab.org/changeset/schema.html&quot;&gt;changeset&lt;/a&gt;. This could be included as part of the information provided about the new version of the dataset, described in terms of its &lt;a href=&quot;http://www.jenitennison.com/blog/node/142&quot;&gt;provenance&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Feeds that were provided in this way could be provided using the normal model, whereby any interested triplestores would regularly check the feed for updates, or using &lt;a href=&quot;http://code.google.com/p/pubsubhubbub/&quot;&gt;PubSubHubbub&lt;/a&gt; in order to push notifications to triplestores. The latter would require triplestore providers to support a service that accepted such notifications, of course.&lt;/p&gt;

&lt;p&gt;A triplestore should expose which datasets (and which versions of those datasets) are used within the triplestore. This can be gathered through a SPARQL query to list the available graphs and their metadata, so long as that information is included within the named graphs themselves.&lt;/p&gt;

&lt;h2&gt;What Should We Do?&lt;/h2&gt;

&lt;p&gt;How does all this translate into what guidelines we should put into place for UK government publishers and what tools we should provide centrally?&lt;/p&gt;

&lt;p&gt;First, we need to recognise the responsibility that comes with the ownership of a URI. Within the UK, we are encouraging people to use URIs of the form:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;http://{sector}.data.gov.uk/id/{concept}/{identifier}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;to name things like schools and hospitals, with the recognition that information about those things might come from many different public bodies. &lt;em&gt;Someone&lt;/em&gt; has to be in charge of that domain: they have to determine which URIs within a particular URI set are resolvable, and what information is provided at the end of each URI. These same sector owners should support easy-to-use APIs based around the particular URI sets that they are responsible for.&lt;/p&gt;

&lt;p&gt;The easiest route to supporting the pages, an easy-to-use API, and a SPARQL endpoint for deeper querying is going to be to create a curated triplestore with a &lt;a href=&quot;http://purl.org/linked-data/api/spec&quot;&gt;linked data API&lt;/a&gt; layer over the top. This triplestore will need to be populated with data from multiple datasets, both as separate named graphs (to provide traceability back to the original data) and merged into a default graph that reflects the current state of the world.&lt;/p&gt;

&lt;p&gt;The precise datasets that are included within the triplestore will depend on the judgement of the sector owners about both the trustworthiness of the available datasets and their utility. For example, it&amp;#8217;s likely that a lot of triplestores will want to include information about administrative geography and perhaps some information about time, simply because everything happens somewhere and sometime.&lt;/p&gt;

&lt;p&gt;Second, we need to make this process really easy, through guidelines and tooling.&lt;/p&gt;

&lt;p&gt;We encourage the data owners themselves (which are individual public bodies) to publish, along with the datasets themselves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;voiD descriptions of the groups of datasets that they publish&lt;/li&gt;
&lt;li&gt;metadata about the individual datasets that they publish (within each dataset itself)&lt;/li&gt;
&lt;li&gt;Atom feeds that are updated each time datasets are added, removed or altered, preferably including changeset information&lt;/li&gt;
&lt;li&gt;(optionally) dumps of groups of datasets, in NQuads format&lt;/li&gt;
&lt;li&gt;(optionally) notifications of changes to the Atom feed to a PubSubHubbub hub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data owners should be able to split up the datasets that they provide into different groups based on their knowledge of the domain, with the possibility of individual datasets belonging to more than one group.&lt;/p&gt;

&lt;p&gt;We then create tooling that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;enable the sector owners to quickly and easily put together a list of trusted sites from which datasets can be gathered&lt;/li&gt;
&lt;li&gt;collect datasets from these sites, either through NQuads dumps or through crawling&lt;/li&gt;
&lt;li&gt;merge datasets to create a default current view&lt;/li&gt;
&lt;li&gt;put these datasets into a triplestore&lt;/li&gt;
&lt;li&gt;keep the triplestore up to date, either through polling feeds or by accepting PubSubHubbub notifications to identify changes, applying those changes, and merging data as required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To facilitate PubSubHubbub use, which supports timely updating of triplestores, we&amp;#8217;d need a PubSubHubbub hub. Data owners can inform this hub of updates to their feeds and sector owners can register interest in particular feeds.&lt;/p&gt;

&lt;p&gt;These guidelines and tooling are not just useful for sector owners: they are useful for anyone who wants to pull together linked data published in a distributed way across the web. We should expect and encourage multiple stores offering different combinations of datasets and different levels of service. The ones offered centrally, by sector owners, are certainly not the be-all and end-all &amp;#8212; in fact we should look on them as a basic level of service, to be superseded by the community.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/143#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <pubDate>Mon, 22 Mar 2010 21:26:53 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">143 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>SPARQL &amp; Visualisation Frustrations: Aggregation and Projection</title>
 <link>http://www.jenitennison.com/blog/node/127</link>
 <description>&lt;p&gt;Today, I&amp;#8217;m going to moan about the lack of features in SPARQL that are necessary to do many kinds of data analysis and visualisation. Going from raw data, held in RDF, to data like&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &lt;em&gt;average&lt;/em&gt; traffic flow along the M5&lt;/li&gt;
&lt;li&gt;the &lt;em&gt;total&lt;/em&gt; amount claimed by each MP&lt;/li&gt;
&lt;li&gt;the &lt;em&gt;number of&lt;/em&gt; corporate insolvency notices published each day&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;cannot be done with SPARQL on its own. These calculations involve &lt;a href=&quot;http://www.w3.org/TR/sparql-features/#Aggregates&quot;&gt;aggregation, grouping&lt;/a&gt; and &lt;a href=&quot;http://www.w3.org/TR/sparql-features/#Project_expressions&quot;&gt;projection&lt;/a&gt; which are planned for SPARQL vNext, but not here yet (at least, not in any standard way or in every triplestore).&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the pretty graph to illustrate today&amp;#8217;s rant:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/insolvency-smooth.jpg&quot; alt=&quot;Corporate insolvency notices per day from the London Gazette since 1st May 2008, averaged over 20 days&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;The graph shows the number of notices of certain types placed in the &lt;a href=&quot;http://www.london-gazette.co.uk/&quot;&gt;London Gazette&lt;/a&gt; each day. The notices it summarises are those related to &lt;a href=&quot;http://www.insolvency.gov.uk/compulsoryliquidation/whatiscompulsoryliquidation.htm&quot;&gt;companies being liquidated&lt;/a&gt;, indicated by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.london-gazette.co.uk/issues/recent/10/corp-insolvency-winding-up-court/petitions-companies/start=1&quot;&gt;winding-up petitions&lt;/a&gt; indicating compulsory liquidation&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.london-gazette.co.uk/issues/recent/10/corp-insolvency-winding-up-members/resolution/start=1&quot;&gt;members resolutions for winding-up&lt;/a&gt; indicating members&amp;#8217; voluntary liquidation&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.london-gazette.co.uk/issues/recent/10/corp-insolvency-winding-up-creditors/resolution/start=1&quot;&gt;creditors resolutions for winding-up&lt;/a&gt; indicating creditors&amp;#8217; voluntary liquidation&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.london-gazette.co.uk/issues/recent/10/corp-insolvency-administration/appointments/start=1&quot;&gt;appointment of administrators&lt;/a&gt; indicating companies going into administration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The graph is a version of:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/insolvency-raw.jpg&quot; alt=&quot;Corporate insolvency notices per day from the London Gazette since 1st May 2008&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;with each data point averaged over 20 days. (The raw data spikes every Wednesday, presumably due to notices building up over the weekend and taking two days to appear in the Gazette.) It shows how the number of creditors&amp;#8217; voluntary liquidations (indicating companies that go insolvent and are unable to pay their creditors) doubled from around 30/day in May 2008 to around 60/day in the Spring of this year, but seems to be falling again (as far as we can tell; the data is not up-to-date).&lt;/p&gt;

&lt;p&gt;This data is brought to you by the RDFa embedded by &lt;a href=&quot;http://www.tso.co.uk/&quot;&gt;TSO&lt;/a&gt; in the notices on the London Gazette website and the scraping of said data into the &lt;a href=&quot;http://api.talis.com/stores/datagovuk&quot;&gt;datagovuk datastore&lt;/a&gt; held on the &lt;a href=&quot;http://www.talis.com/platform/&quot;&gt;Talis platform&lt;/a&gt;, for both of which we have &lt;a href=&quot;http://www.opsi.gov.uk/&quot;&gt;OPSI&lt;/a&gt; to thank.&lt;/p&gt;

&lt;p&gt;The visualisation is brought to you by a touch of experimental &amp;#8220;AJAR&amp;#8221; in &lt;a href=&quot;http://code.google.com/p/rdfquery&quot;&gt;rdfQuery&lt;/a&gt; and the graphing power of &lt;a href=&quot;http://code.google.com/p/flot/&quot;&gt;Flot&lt;/a&gt;. Here are the lengths I have to go to to get the pretty graph:&lt;/p&gt;

&lt;p&gt;First, I use rdfQuery to request a list of London Gazette issues since 1st May 2008. The SPARQL for the request is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX corp-insolvency: &amp;lt;http://www.gazettes-online.co.uk/ontology/corp-insolvency#&amp;gt;
PREFIX g: &amp;lt;http://www.gazettes-online.co.uk/ontology#&amp;gt;
PREFIX xsd: &amp;lt;http://www.w3.org/2001/XMLSchema#&amp;gt;

CONSTRUCT {
  ?issue a g:Issue .
  ?issue g:hasPublicationDate ?date .
}
WHERE {
  ?issue a g:Issue .
  ?issue g:hasPublicationDate ?date .
  FILTER ( ?date &amp;gt; &quot;2008-05-01&quot;^^xsd:date ) .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is a &lt;code&gt;CONSTRUCT&lt;/code&gt; request because the resulting RDF/XML can be loaded into rdfQuery for querying. I &lt;em&gt;could&lt;/em&gt; do a &lt;code&gt;SELECT&lt;/code&gt; query and request JSON as the output format, but I&amp;#8217;m doing a kind of end-to-end RDF thing here. So I use rdfQuery to make the request, load the result into an rdfQuery object, query it, and iterate over the results.&lt;/p&gt;

&lt;p&gt;For each of the returned issues (all 293 of them), I make a &lt;em&gt;separate&lt;/em&gt; request for all the relevant notices within that issue. The SPARQL looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX corp-insolvency: &amp;lt;http://www.gazettes-online.co.uk/ontology/corp-insolvency#&amp;gt;
PREFIX g: &amp;lt;http://www.gazettes-online.co.uk/ontology#&amp;gt;

CONSTRUCT {
  ?notice a ?type
}
WHERE {
  ?notice g:isInIssue $issue .
  { ?notice a corp-insolvency:MembersResolutionsForWindingUpNotice } UNION
  { ?notice a corp-insolvency:CreditorsResolutionsForWindingUpNotice } UNION
  { ?notice a corp-insolvency:AppointmentOfAdministratorNotice } UNION
  { ?notice a corp-insolvency:PetitionsToWindUpCompaniesNotice } .
  ?notice a ?type .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once I&amp;#8217;ve got the RDF for those notices, I can use rdfQuery to select just those of a particular type, then count how many there are and use the result to plot the graph.&lt;/p&gt;

&lt;p&gt;Creating the graph involves 294 requests to the Talis store via the proxy that I&amp;#8217;m using to get around the cross-site scripting issues, each of which takes (in my experience) between 200ms and 4s. So it&amp;#8217;s pretty server-intensive for both the Talis servers and my proxy server (which is why I&amp;#8217;m not actually going to make the page available generally). It&amp;#8217;s also slow.&lt;/p&gt;

&lt;p&gt;What I &lt;em&gt;want&lt;/em&gt; to do is to be able to make four SPARQL requests that return RDF that summarise the number of notices of each of the different types on each date (or in each issue). I &lt;em&gt;want&lt;/em&gt; to write SPARQL queries that look something like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX corp-insolvency: &amp;lt;http://www.gazettes-online.co.uk/ontology/corp-insolvency#&amp;gt;
PREFIX g: &amp;lt;http://www.gazettes-online.co.uk/ontology#&amp;gt;

CONSTRUCT {
  ?issue a g:Issue .
  ?issue g:hasPublicationDate ?date .
  ?issue corp-insolvency:membersResolutionsForWindingUpNotices COUNT(?notice) .
}
WHERE {
  ?issue a g:Issue .
  ?issue g:hasPublicationDate ?date .
  ?notice g:isInIssue ?issue .
  ?notice a corp-insolvency:MembersResolutionsForWindingUpNotice .
}
GROUP BY ?issue
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Four requests would be &lt;em&gt;so&lt;/em&gt; much better than 294.&lt;/p&gt;

&lt;p&gt;The thing of it is that this kind of facility is available as standard in SQL, the &lt;a href=&quot;http://code.google.com/apis/visualization/documentation/querylanguage.html#Group_By&quot;&gt;Google Visualisation API&amp;#8217;s simple query language&lt;/a&gt;, or in the &amp;#8220;reduce&amp;#8221; part of &lt;a href=&quot;http://en.wikipedia.org/wiki/MapReduce&quot;&gt;map/reduce&lt;/a&gt;. If we&amp;#8217;re to think of triplestores as a serious alternative to either relational or non-relational databases, and SPARQL as a serious alternative to either SQL or &lt;a href=&quot;http://en.wikipedia.org/wiki/Nosql&quot;&gt;NoSQL&lt;/a&gt;, then it really must support these operations. And Real Soon.&lt;/p&gt;

&lt;p&gt;In the meantime, I think the lesson for the publishers of linked data is to provide aggregated values for the obvious kinds of aggregations that people might want to do over your data. In the London Gazette data, that would be the counts of the various kinds of notices it contains. In the traffic flow data it would be the average, minimum and maximum traffic flow over each of the measured days, at each hour over the known dates and overall for each point.&lt;/p&gt;

&lt;p&gt;On a more philosophical note, it strikes me that the concept of aggregation contradicts the Open World assumption. I can only know that the number of members&amp;#8217; winding-up order notices was exactly 30 if I know that I know of &lt;em&gt;all&lt;/em&gt; the members&amp;#8217; winding-up order notices that exist. Pragmatically, in many cases this is going to be just fine, because we know that the datasets that we&amp;#8217;re using are complete (our World is Closed), but it does slightly concern me that it&amp;#8217;s impossible to do much useful data analysis without contradicting one of the fundamental tenets of the Semantic Web.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/127#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/49">visualisation</category>
 <pubDate>Sat, 12 Sep 2009 22:42:45 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">127 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Semantic Technologies at the XML Summer School</title>
 <link>http://www.jenitennison.com/blog/node/122</link>
 <description>&lt;p&gt;I &lt;a href=&quot;http://www.jenitennison.com/blog/node/107&quot;&gt;posted before&lt;/a&gt; about the joys of the &lt;a href=&quot;http://www.xmlsummerschool.com/&quot;&gt;XML Summer School&lt;/a&gt;: the learning, the punting, the drinking! Now the &lt;a href=&quot;http://xmlsummerschool.com/curriculum2009/semantic-technologies/&quot;&gt;Semantic Technologies&lt;/a&gt; track has been fleshed out to include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bob DuCharme giving an overview on the semantic web&lt;/li&gt;
&lt;li&gt;Leigh Dodds talking about publishing linked data&lt;/li&gt;
&lt;li&gt;Andy Seabourne talking about SPARQL&lt;/li&gt;
&lt;li&gt;Duncan Hall talking about creating ontologies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It&amp;#8217;s not really XML, I suppose, but it&amp;#8217;s certainly a bunch of interesting and timely topics. I particularly hope that we&amp;#8217;ll get some public sector people in the room so that we can discuss some of the challenges and opportunities in that area.&lt;/p&gt;

&lt;!--break--&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/122#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/45">xmlsummerschool09</category>
 <pubDate>Wed, 05 Aug 2009 20:21:53 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">122 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>SPARQL &amp; Visualisation Frustrations: Linked Data</title>
 <link>http://www.jenitennison.com/blog/node/121</link>
 <description>&lt;p&gt;I&amp;#8217;ll start with the problem. To create the graphs I showed in &lt;a href=&quot;http://www.jenitennison.com/blog/node/120&quot;&gt;my last post&lt;/a&gt;, I wanted to split MPs into groups based on their party affiliation. Ideally, I wanted the Google Visualisation query to look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;select mp, additionalCosts, totalTravel, totalBasic 
where party = &#039;Conservative&#039; 
order by totalClaim desc 
limit 25
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;because this is reasonably easy to understand and for a developer to create without having to know any magic URIs.&lt;/p&gt;

&lt;p&gt;The party affiliation for an MP is given in the RDF supplied within the &lt;a href=&quot;http://guardian.dataincubator.org/&quot;&gt;Talis store&lt;/a&gt; as a pointer to one of the resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;http://dbpedia.org/resource/Labour_Party_(UK)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http://dbpedia.org/resource/Conservative_Party_(UK)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http://dbpedia.org/resource/Liberal_Democrats&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, if you visit &lt;a href=&quot;http://dbpedia.org/resource/Conservative_Party_(UK)&quot;&gt;http://dbpedia.org/resource/Conservative_Party_(UK)&lt;/a&gt; then you&amp;#8217;ll see precious few properties and none of them give you access to the string &amp;#8216;Conservative&amp;#8217;. If you look at &lt;a href=&quot;http://dbpedia.org/resource/Liberal_Democrats&quot;&gt;http://dbpedia.org/resource/Liberal_Democrats&lt;/a&gt;, you&amp;#8217;ll see plenty of properties, one of which is &lt;code&gt;dbpprop:partyName&lt;/code&gt;. But trying to query on &lt;code&gt;dbpprop:partyName&lt;/code&gt; within the Talis data store gives me nothing, because that information hasn&amp;#8217;t been imported into the particular store that this SPARQL query is running on.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;What I did in &lt;code&gt;utils.php&lt;/code&gt; was extend the parsing of the &lt;code&gt;tq&lt;/code&gt; parameter, which is supposed to be in the Google Visualisation query language, to understand &lt;code&gt;&amp;lt;URI&amp;gt;&lt;/code&gt; as a reference to a resource. In other words, you can create a query like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;select mp, additionalCosts, totalTravel, totalBasic 
where rParty = &amp;lt;http://dbpedia.org/resource/Conservative_Party_(UK)&amp;gt; 
order by totalClaim desc 
limit 25
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and this will be mapped to a SPARQL query that looks like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;SELECT ?mp ?additionalCosts ?totalTravel ?totalBasic 
WHERE {
  ...
  FILTER (?rParty = &amp;lt;http://dbpedia.org/resource/Conservative_Party_(UK)&amp;gt;)
}
ORDER BY desc(?totalClaim)
LIMIT 25
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I don&amp;#8217;t like having done this, because I don&amp;#8217;t want Data Sources that happen to be SPARQL queries to look any different from other Data Sources. Introducing a new syntax for URI literals isn&amp;#8217;t really on.&lt;/p&gt;

&lt;p&gt;The superficial fix is to &lt;strong&gt;always provide basic labelling information for the resources referenced within a triplestore&lt;/strong&gt;. In this case, Leigh actually did include an &lt;code&gt;rdfs:label&lt;/code&gt; property for each of the party URIs within the Guardian store, so it was possible to use the query I wanted to use after all (though it took some experimentation to find this out).&lt;/p&gt;

&lt;p&gt;But underlying this is a bigger issue. Much is made of linked data &amp;#8212; that you can find out more about a particular thing by resolving the link to that thing &amp;#8212; but the best illustrations of the power and benefits of the semantic web tend to revolve around analysis and visualisations of moderately large amounts of data using SPARQL. And SPARQL (as yet) only runs on individual triplestores, which do not contain the entire semantic web. Every SPARQL query is limited by what has been loaded into the particular triplestore that is queried.&lt;/p&gt;

&lt;p&gt;Now, one of the &amp;#8220;time-permitting&amp;#8221; requirements for SPARQL 1.1 is &lt;a href=&quot;http://www.w3.org/TR/sparql-features/#Basic_federated_query&quot;&gt;Federated Queries&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Federated query is the ability to take a query and provide solutions based on information from many different sources. It is a hard problem in its most general form and is the subject of continuing (and continuous) research. A building block is the ability to have one query be able to issue a query on another SPARQL endpoint during query execution.&lt;/p&gt;
  
  &lt;p&gt;Time-permitting, the SPARQL Working Group will define the syntax and semantics for handling a basic class of federated queries in which the SPARQL endpoints to use in executing portions of the query are explicitly given by the query author.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That&amp;#8217;s certainly &amp;#8220;a building block&amp;#8221;, but it can&amp;#8217;t be the only method. For many data publishers, it&amp;#8217;s going to be far far simpler to publish their data as linked data in RDF/XML than it is to provide a SPARQL endpoint for that data. We can ask organisations like &lt;a href=&quot;http://www.talis.com/platform&quot;&gt;Talis&lt;/a&gt; to crawl our data and provide a SPARQL endpoint for it, and hope that the SPARQL Working Group have time to address federated search, but really we need tools that make it easy to aggregate, analyse and visualise linked data directly rather than through a triplestore silo.&lt;/p&gt;

&lt;p&gt;So how about it?&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/121#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/47">Talis</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/49">visualisation</category>
 <pubDate>Mon, 03 Aug 2009 20:36:34 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">121 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>SPARQL &amp; Visualisation Frustrations: RDF Datatyping</title>
 <link>http://www.jenitennison.com/blog/node/120</link>
 <description>&lt;p&gt;My &lt;a href=&quot;http://www.jenitennison.com/blog/node/119&quot;&gt;last post&lt;/a&gt; showed a visualisation of the &lt;a href=&quot;http://mps-expenses.guardian.co.uk/&quot;&gt;Guardian&amp;#8217;s MP&amp;#8217;s Expenses data&lt;/a&gt;, ported into a &lt;a href=&quot;http://guardian.dataincubator.org/&quot;&gt;Talis triplestore&lt;/a&gt;. Here&amp;#8217;s a screenshot of &lt;a href=&quot;/visualisation/mp-expenses.html&quot;&gt;another one&lt;/a&gt; (follow the link for the interactive version). The files that are used to create it are attached to this post.&lt;/p&gt;

&lt;p&gt;&lt;img alt=&quot;Graphs of highest 25 expense claims in each party&quot; src=&quot;/blog/files/mps-expenses.jpg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;There are several things that are frustrating about creating these visualisations, which I want to discuss because I think they lead to some lessons about what data publilshers and members of the semantic web community should do to make these things easy. The first thing I want to talk about is datatyping.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;In RDF, literal values can be plain literals, in which case they may have an associated language; XML literals, in which case they have structure; or typed literals, which have a particular datatype, usually one of the ones &lt;a href=&quot;http://www.w3.org/TR/xmlschema-2/&quot;&gt;defined by XML Schema&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The easiest kinds of literals to create, especially in RDF/XML, are plain literals. Indeed &lt;a href=&quot;http://www.jenitennison.com/blog/node/103&quot;&gt;some formats&lt;/a&gt; don&amp;#8217;t even support the creation of typed literals. So RDF often contains values that are &lt;em&gt;actually&lt;/em&gt; numbers or dates, but that are plain literals rather than being typed with an appropriate datatype.&lt;/p&gt;

&lt;p&gt;In the &lt;a href=&quot;http://guardian.dataincubator.org/person/charles-kennedy&quot;&gt;RDF for the MP&amp;#8217;s expenses data&lt;/a&gt;, many of the figures are typed as &lt;code&gt;xsd:int&lt;/code&gt; but some (such as salary and total claim) are untyped. Which means that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sorting on them within the SPARQL query is done alphabetically rather than numerically&lt;/li&gt;
&lt;li&gt;automated conversions into, say, JSON, will usually convert them into strings rather than numbers, or have to take a stab in the dark and assume that they are numeric based on their format&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I created the visualisation shown above, for example, I did a sort on the total-claim property to get the top 25 claimants, but that wasn&amp;#8217;t what I actually got because I wasn&amp;#8217;t sorting on a number.&lt;/p&gt;

&lt;p&gt;Now the question of whether an element&amp;#8217;s value intrinsically has a particular type or is merely given a type for the purposes of processing is something that has caused religious wars within the XML community. And in those wars I have always come down firmly on the side of typing being a matter of interpretation.&lt;/p&gt;

&lt;p&gt;But with RDF I think it&amp;#8217;s different, for two reasons:&lt;/p&gt;

&lt;p&gt;First, unless I&amp;#8217;m mistaken (and excepting extensions that may have been made by individual processors) the main mechanism that we have for processing RDF &amp;#8212; SPARQL &amp;#8212; does not support casting a plain literal into a typed literal. So there is simply no way of sorting numerically based on a plain literal. This could be viewed as a deficiency of SPARQL which might be addressed in a &lt;a href=&quot;http://www.w3.org/TR/sparql-features/&quot;&gt;future version&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Second, one of the much-cited advantages of RDF is that it is &lt;em&gt;self-describing&lt;/em&gt;. You can make requests to the URIs used for properties and classes to find out more information about them. But self-describing should apply to literal values too. If a value is a date, it should be labelled as a date; if it&amp;#8217;s a number it should be labelled as a number.&lt;/p&gt;

&lt;p&gt;So how about these as guidelines for creating RDF that would make processing RDF easier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if the literal is XML, it should be an XML literal (obviously)&lt;/li&gt;
&lt;li&gt;if the literal is in a particular language (such as a description or a name), it should be a plain literal with that language&lt;/li&gt;
&lt;li&gt;otherwise it should be given an appropriate datatype&lt;/li&gt;
&lt;/ul&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/120#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/31">rdf</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/51">sparql</category>
 <enclosure url="http://www.jenitennison.com/blog/files/mp-expenses.html" length="3075" type="text/html" />
 <pubDate>Sun, 02 Aug 2009 20:00:22 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">120 at http://www.jenitennison.com/blog</guid>
</item>
</channel>
</rss>

