My previous post talked about how to install 4store as a triplestore, and use the Ruby library RDF.rb in order to process RDF extracted from that store. This was a response to Richard Pope’s Linked Data/RDF/SPARQL Documentation Challenge which asks for documentation of how to install a triplestore, load data into it, retrieve it using SPARQL and access the results as native structures using Ruby, Python or PHP.
I quite enjoyed writing the last one, so I thought I’d try again. As before, I am on Mac OS X, but this time I’m going to use Python, which I have not programmed in before. I like a challenge. You might not like the results!
This time, I’m going to use Sesame, as I was told by John Sheridan that it was so easy to install that even he, a civil servant, could do it!
Sesame needs a Java servlet container. I’m using Tomcat because I have some experience with it, but you could use something like Jetty if you prefer. I had a bit of trouble getting Tomcat 6 to install, but that might just have been because it has a lot of dependencies and I wasn’t patient enough. It might be worth upgrading your existing ports and getting verbose output so you know there’s activity as you install Tomcat:
$ sudo port upgrade outdated
$ sudo port -v install tomcat6
This installs Tomcat 6 in /opt/local/share/java/tomcat6.
While that’s happening, get Sesame from its download page. I got hold of openrdf-sesame-2.3.2-sdk.tar.gz. The files we actually need are the .wars so we can just extract them and put them in the webapps directory within Tomcat:
$ tar -zxvf openrdf-sesame-2.3.2-sdk.tar.gz openrdf-sesame-2.3.2/war/*.war
$ sudo cp openrdf-sesame-2.3.2/war/*.war /opt/local/share/java/tomcat6/webapps/
Then startup Tomcat:
$ sudo /opt/local/share/java/tomcat6/bin/tomcatctl start
All being well, you should see Tomcat doing some initial setup:
conf_setup.sh: file conf/catalina.policy is missing; copying conf/catalina.policy.sample to its place.
conf_setup.sh: file conf/catalina.properties is missing; copying conf/catalina.properties.sample to its place.
conf_setup.sh: file conf/server.xml is missing; copying conf/server.xml.sample to its place.
conf_setup.sh: file conf/tomcat-users.xml is missing; copying conf/tomcat-users.xml.sample to its place.
conf_setup.sh: file conf/web.xml is missing; copying conf/web.xml.sample to its place.
conf_setup.sh: file conf/setenv.local is missing; copying conf/setenv.local.sample to its place.
Starting Tomcat.... started. (pid 20064)
Now have a look at http://localhost:8080/openrdf-sesame. If you’re like me, you’ll get some error messages because the user that Tomcat is running under (www) isn’t able to create or write to a logging directory that it wants to create, in my case /Users/Jeni/Library/Application Support/Aduna/OpenRDF Sesame/logs. This turns out to be partly caused by permissions issues and partly caused by the spaces in the filename. To get around it, create a data directory for Aduna that doesn’t have spaces in the filename, and change its ownership to www. In my case, I chose /opt/local/var/aduna.
$ sudo mkdir -p /opt/local/var/aduna
$ sudo chown www:www /opt/local/var/aduna
Then edit Tomcat’s setenv.local file which in my environment is at /opt/local/share/java/tomcat6/conf and add a line that sets the info.aduna.platform.appdata.basedir to that directory, like this:
export JAVA_OPTS='-Dinfo.aduna.platform.appdata.basedir=/opt/local/var/aduna'
Restart Tomcat:
$ sudo /opt/local/share/java/tomcat6/bin/tomcatctl restart
Then navigate again to http://localhost:8080/openrdf-sesame and you should see the Welcome page:
As you can see, this recommends using the Workbench for managing the repositories. If you open that, at http://localhost:8080/openrdf-workbench.
We’ll use this Workbench to create a new repository for our data, which I’ll call reference. Click on New Repository from the left hand navigation and fill in the form. I’m just going to use the default in-memory RDF store because I’m only using a little data; the other options (using MySQL or PostgreSQL stores) would be useful if I were creating something more permanent. See the Sesame User Guide for information about those.
So fill in the form to create a new repository with the id reference and whatever title you like:
Click Next and there will be a couple more options to select; I just used the default for these:
Click Create and you will see a summary of the new repository that you’ve created:
I’m going to use the same data as I did before:
http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf
You can add data to a Sesame repository in a browser through the Workbench by uploading a file, pointing Sesame at a URL or pasting in some RDF that you want to load. There are also Java bindings for adding data to Sesame. But neither of those are any good to us as we need programmatic access.
So we will use the HTTP method. I want to add some statements to the reference repository in the graph (what Sesame calls “context”) http://source.data.gov.uk/data/reference/organogram-co/2010-10-30, which amounts to an HTTP PUT on the repository’s statements with that context.
Now I don’t know much at all about Python, but it looks as though the built-in library urllib2 doesn’t support PUT and there’s a better HTTP library available in httplib2. MacPorts supports various different packages for httplib2 with different versions of Python. Now there only seems to be a package for rdflib, which we’ll use later, for Python 2.6, so we’ll go for py26-httplib2, which will bring in Python 2.6 with it just in case.
$ sudo port install py26-httplib2
After running this, if you want to actually use it you will need to install the python_select port and choose Python 2.6:
$ sudo port install python_select
$ sudo python_select python26
Then open up another Terminal window or tab (because the change won’t have affected your old one) and check what version of Python you’re running:
$ python --version
Python 2.6.6
With the httplib2 library in place, it’s time for a Python script (load-rdf-into-sesame.py) to do the PUTting:
import urllib
import httplib2
repository = 'reference'
graph = 'http://source.data.gov.uk/data/reference/organogram-co/2010-06-30'
filename = '/Users/Jeni/Downloads/index.rdf'
print "Loading %s into %s in Sesame" % (filename, graph)
params = { 'context': '<' + graph + '>' }
endpoint = "http://localhost:8080/openrdf-sesame/repositories/%s/statements?%s" % (repository, urllib.urlencode(params))
data = open(filename, 'r').read()
(response, content) = httplib2.Http().request(endpoint, 'PUT', body=data, headers={ 'content-type': 'application/rdf+xml' })
print "Response %s" % response.status
print content
Run the script from the command line:
$ python load-rdf-into-sesame.py
and you should get just get:
Loading /Users/Jeni/Downloads/index.rdf into http://source.data.gov.uk/data/reference/organogram-co/2010-06-30 in Sesame
Response 204
which isn’t particularly helpful (well, the 204 response tells us it worked), but you can then check http://localhost:8080/openrdf-workbench/repositories/reference/contexts and you should see that there is a new context of http://source.data.gov.uk/data/reference/organogram-co/2010-06-30:
Click on the context and it will take you to a list of (some of) the triples in that graph:
One of the nice things about Sesame is that the Workbench provides you with ways of exploring the data that you have loaded. On the left navigation bar there are ways of listing the types of the entities described in the data:
from which you can find instances of that type, for example of org:Organization:
and then the statements about a particular instance, for example DirectGov:
Onto running a query directly. This is done on Sesame in exactly the same way as it was done on 4store in my last walkthrough: by HTTP POSTing a query to the SPARQL endpoint. Sesame’s page for testing queries on the reference repository is at http://localhost:8080/openrdf-workbench/repositories/reference/query and we’ll use the basic one that lists types of things that are described within the data:
SELECT DISTINCT ?type
WHERE {
?thing a ?type .
}
ORDER BY ?type
Paste that into the textarea that’s provided on http://localhost:8080/openrdf-workbench/repositories/reference/query so it looks like:
and you get an HTML page:
That’s nice for humans, but not so good for computers. When we request the results of this query programmatically, we’ll want to make sure that we specifically ask for the query results in either XML or JSON.
I went the XML route last time, so let’s mix it up a bit and try processing the JSON results of a SPARQL query this time, as it’s really easy to access using the json module in Python. So, we need to POST the query, ensuring that we set the Accept header to application/sparql-results+json, and then process the results as JSON. Here is find-rdf-types.py
import urllib
import httplib2
import json
query = 'SELECT DISTINCT ?type WHERE { ?thing a ?type . } ORDER BY ?type'
repository = 'reference'
endpoint = "http://localhost:8080/openrdf-sesame/repositories/%s" % (repository)
print "POSTing SPARQL query to %s" % (endpoint)
params = { 'query': query }
headers = {
'content-type': 'application/x-www-form-urlencoded',
'accept': 'application/sparql-results+json'
}
(response, content) = httplib2.Http().request(endpoint, 'POST', urllib.urlencode(params), headers=headers)
print "Response %s" % response.status
results = json.loads(content)
print "\n".join([result['type']['value'] for result in results['results']['bindings']])
Run it:
$ python find-rdf-types.py
and you get:
POSTing SPARQL query to http://localhost:8080/openrdf-sesame/repositories/reference
Response 200
http://purl.org/linked-data/cube#DataSet
http://purl.org/linked-data/cube#DataStructureDefinition
http://purl.org/linked-data/cube#Observation
http://purl.org/net/opmv/ns#Artifact
http://purl.org/net/opmv/ns#Process
http://purl.org/net/opmv/types/google-refine#OperationDescription
http://purl.org/net/opmv/types/google-refine#Process
http://purl.org/net/opmv/types/google-refine#Project
http://rdfs.org/ns/void#Dataset
http://reference.data.gov.uk/def/central-government/AssistantParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/CivilServicePost
http://reference.data.gov.uk/def/central-government/Department
http://reference.data.gov.uk/def/central-government/DeputyDirector
http://reference.data.gov.uk/def/central-government/DeputyParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/Director
http://reference.data.gov.uk/def/central-government/DirectorGeneral
http://reference.data.gov.uk/def/central-government/ParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/PermanentSecretary
http://reference.data.gov.uk/def/central-government/PublicBody
http://reference.data.gov.uk/def/central-government/SeniorAssistantParliamentaryCounsel
http://reference.data.gov.uk/def/intervals/CalendarDay
http://www.w3.org/2000/01/rdf-schema#Class
http://www.w3.org/ns/org#Organization
http://www.w3.org/ns/org#OrganizationalUnit
http://xmlns.com/foaf/0.1/Person
This is the same set of types as that given through the HTML browse interface. Note that the JSON results themselves look like:
{
"head": {
"vars": [ "type" ]
},
"results": {
"bindings": [
{
"type": { "type": "uri", "value": "http:\/\/purl.org\/linked-data\/cube#DataSet" }
},
{
"type": { "type": "uri", "value": "http:\/\/purl.org\/linked-data\/cube#DataStructureDefinition" }
},
{
"type": { "type": "uri", "value": "http:\/\/purl.org\/linked-data\/cube#Observation" }
},
...
]
}
}
Each of the items within the bindings array contains a set of bindings for the variables in the SPARQL query. This closely matches the XML format.
Now we get onto the part of this where we look at specific libraries for RDF support in Python. The most popular library is rdflib, which you can install using MacPorts:
$ sudo port install py26-rdflib
The SPARQL query we’ll try this time uses a CONSTRUCT query, which creates RDF, rather than a SELECT query, which as we’ve seen can create either XML or JSON. For example, try the query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT {
?person
a foaf:Person ;
foaf:name ?name ;
?prop ?value .
} WHERE {
?person a foaf:Person ;
foaf:name ?name ;
?prop ?value .
}
This gets all the information in the data about the individuals for whom names have been supplied in the data, as RDF. Again, Sesame will display this as HTML when you try doing it, but you can choose a different format from the drop-down menu at the top of the Query Result display:
When you’re not accessing using a browser, by default Sesame serves up its results in TriG format, which isn’t particularly appropriate for the results of CONSTRUCT queries as we don’t need multiple graphs. We’ll request N-Triples as that’s something rdflib can understand. Sesame 2 uses the content type text/plain for N-Triples, so we can request this type by setting the Accept header:
params = { 'query': query }
headers = {
'content-type': 'application/x-www-form-urlencoded',
'accept': 'text/plain'
}
(response, content) = httplib2.Http().request(endpoint, 'POST', urllib.urlencode(params), headers=headers)
We then need to parse this Turtle response into a rdflib.Graph object:
graph = rdflib.ConjunctiveGraph()
graph.parse(rdflib.StringInputSource(content), format="nt")
We then need to get information out of that graph, which rdflib isn’t particularly good at. So let’s use RDFAlchemy instead. That can be installed using easy_install:
$ sudo easy_install-2.6 rdfalchemy
RDFAlchemy can be used to map RDF graphs onto Python object structures in a fairly straight-forward manner. Basically, you define the namespaces of the vocabularies that you want to use, then some classes for the kinds of things that you have in the data, with properties that map onto properties in the RDF. Then you set the rdfSubject.db to the source of the data (which can be an rdflib Graph as above) and can query on it. Here’s an example:
FOAF = rdflib.Namespace('http://xmlns.com/foaf/0.1/')
RDF = rdflib.Namespace('http://www.w3.org/1999/02/22-rdf-syntax-ns#')
class Person(rdfalchemy.rdfSubject):
rdf_type = FOAF.Person
name = rdfalchemy.rdfSingle(FOAF.name)
mbox = rdfalchemy.rdfSingle(FOAF.mbox)
rdfalchemy.rdfSubject.db = graph
stott = Person.get_by(name='Andrew Stott')
print "Andrew Stott's email address: %s" % stott.mbox.n3()
RDFAlchemy adds both get_by() and filter_by() methods on the descriptor classes that you define, to get a single item that matches a query or a list of items, respectively.
The full script for ‘get-names-and-emails.py’ is:
import urllib
import httplib2
import rdflib
import rdfalchemy
query = """PREFIX foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT {
?person
a foaf:Person ;
foaf:name ?name ;
?prop ?value .
} WHERE {
?person a foaf:Person ;
foaf:name ?name ;
?prop ?value .
}"""
repository = 'reference'
endpoint = "http://localhost:8080/openrdf-sesame/repositories/%s" % repository
print "POSTing SPARQL query to %s" % endpoint
params = { 'query': query }
headers = {
'content-type': 'application/x-www-form-urlencoded',
'accept': 'text/plain'
}
(response, content) = httplib2.Http().request(endpoint, 'POST', urllib.urlencode(params), headers=headers)
print "Response %s" % response.status
graph = rdflib.ConjunctiveGraph()
graph.parse(rdflib.StringInputSource(content), format="nt")
print "Loaded %d triples" % len(graph)
FOAF = rdflib.Namespace('http://xmlns.com/foaf/0.1/')
RDF = rdflib.Namespace('http://www.w3.org/1999/02/22-rdf-syntax-ns#')
class Person(rdfalchemy.rdfSubject):
rdf_type = FOAF.Person
name = rdfalchemy.rdfSingle(FOAF.name)
mbox = rdfalchemy.rdfSingle(FOAF.mbox)
rdfalchemy.rdfSubject.db = graph
stott = Person.get_by(name='Andrew Stott')
print "Andrew Stott's email address: %s" % stott.mbox.n3()
Run this script with:
$ python get-names-and-emails.py
and you get the result:
No handlers could be found for logger "rdflib.Literal"
POSTing SPARQL query to http://localhost:8080/openrdf-sesame/repositories/reference
Response 200
Loaded 459 triples
Andrew Stott's email address: <mailto:andrew.stott@cabinet-office.gsi.gov.uk>
The first line is apparently a side-effect of rdflib/RDFAlchemy weirdness which you don’t need to worry about. The rest shows that the search was successful; the call to the .n3() call on the email address is only necessary because it is a resource rather than a literal, and therefore doesn’t get converted to a particularly readable string otherwise.
So there you have it, another walkthrough of setting up a local triplestore, loading in data and accessing that data programmatically using SPARQL queries, this time using Sesame and Python rather than 4store and Ruby.
This walkthrough took me a fair bit longer to do than the previous one, for several reasons:
Again, I haven’t followed Richard’s steps to the letter; in particular I haven’t used a package to get data out of (or into) Sesame: I’ve just done it through HTTP calls. I did it this way deliberately because I think it’s a really important feature of triplestores that you can query them through a common interface: SPARQL. It means that you can take the Python code here and use it against 4store or another triplestore with only a change to the value of the endpoint variable, and similarly take the Ruby code from my previous walkthrough and use it against Sesame. Your code is not tied to a particular implementation or API; you “only” have to learn SPARQL and you’re away.
If you prefer something a little more tightly bound, however, RDFAlchemy does have some targeted Sesame support, as does RDF.rb for that matter. These can help with the management of the data within the repository as well as querying it.
Another thing that’s worth pointing out is that 4store and Sesame have completely different (HTTP-based) interfaces for getting data into stores, and that rdflib/RDFAlchemy and RDF.rb have completely different ways of loading data into in-memory graphs, querying it, and getting information from the results, quite aside from the obvious language-based differences that you’d expect.
On the SPARQL side, there are some efforts within the W3C to define a uniform HTTP protocol for managing RDF graphs and of course there’s SPARQL 1.1 Update. There are glimmers of hope for a standard RDF API, as I’ve argued for recently, but I gather that this effort will be focused on client-side developers, ie that it is really a standard RDF API for Javascript, which I think is a wasted opportunity: I would have been faster in this task if I’d been able to use familiar methods, and I wouldn’t have been so dependent on the documentation provided by the author of a particular library.
Anyway, hopefully my tramping this path will make it easier for those who follow.
Comments
Re: Getting Started with RDF and SPARQL Using Sesame and Python
Thanks Jeni, great stuff. Just some add-on:
If the only think one wants to do is to get to a SPARQL endpoint and get "back" to a familiar Python world as soon as possible, another approach is to use the SPARQL Wrapper: http://sparql-wrapper.sourceforge.net/. As its name says, it is just a wrapper around a SPARQL query, to send it through HTTP to wherever the endpoint is, and return the results.
(Yes, I wrote the original version, but the credit for it should really go to Sergio Fernandez, who took it up to sourceforge and maintains it. I was inspired by Lee's similar Javascript package (http://thefigtrees.net/lee/blog/2006/04/sparql_calendar_demo_a_sparql.ht...) when I wrote it...)
Re: Getting Started with RDF and SPARQL Using Sesame and Python
Thanks for another useful post Jeni. I tried doing something similar the other day on Windoze, but was having problems installing RDFalchemy (but think I know how to fix that now). Sesame is a good choice as it’s very easy to set up.
I (or you are welcome to :)) might try a similar thing with TDB as that is very easy to set up in a Unix environment. Not so easy in windows unless you use cygwin.
Re: Getting Started with RDF and SPARQL Using Sesame and Python
Nice writeup. I've done a bunch of sesame+python myself. Here are some notes: If port works, that's great, but some users may want to run tomcat without a full install. I have always had success just expanding the latest tomcat tarball in my home dir, copying in the .wars, and running "bin/catalina.sh". I use catalina.sh instead of startup.sh because startup.sh puts itself in the background. httplib2 looks reasonable, but I would have picked 'restkit' instead. Its API is slightly more compact because it does the urlencode(params) for you. It can also do http/1.1 pipelining; I have no idea if httplib2 can. import restkit repo = restkit.Resource( "http://localhost:8080/openrdf-sesame/repositories/%s" % repository) # (the form-urlencoded content type is automatic) response = repo.post(query="SELECT ...", headers={"accept":"..."}) print response.body_string() repo.get("statements") etc I think this style is more readable than your one-liner: for result in results['results']['bindings']]: print result['type']['value'] I've never found rdflib hard to work with for querying, to the point that I've never even been motivated to try RDFAlchemy. I'm sure it's great. For comparison, I believe the raw rdflib version of what you're doing is this (untested): matches = list(graph.queryd(""" SELECT DISTINCT ?person ?mbox WHERE { ?person a foaf:Person; foaf:name ?name; foaf:mbox ?mbox }""", initBindings={"name" : Literal("Andrew Stott")}, initNs=dict(foaf=FOAF))) if len(matches) != 1: raise ValueError("found %s matches for that name" % len(matches)) print "Andrew's email address: %s" % matches[0]['mbox'] # or, if you don't need the rdf:type check or the count checking, you # can use the low-level graph methods. (next() pulls the first value # from the iterator returned by subjects()) mbox = graph.value( graph.subjects(FOAF.name, Literal("Andrew Stott")).next(), FOAF.mbox) print "Andrew's email address: %s" % mbox BTW, I'm not sure what you were seeing with the URIRef output format. True, the python repr of that object wants you to know exactly what you're getting, but the str output looks fine to me: >>> print "repr: %r" % mbox repr: rdflib.URIRef('mailto:andrew.stott@cabinet-office.gsi.gov.uk') >>> print "str: %s" % mbox str: mailto:andrew.stott@cabinet-office.gsi.gov.uk Finally, I'm the author of http://projects.bigasterisk.com/sparqlhttp/ which is my big bloated lib for having rdflib talking to sparql endpoints (tested only on sesame :). It does things like store the prefixes you want to use up front so you don't have to repeat them per-request, present simpler APIs for add() and remove(), and some other even more experimental stuff from years ago.Re: Getting Started with RDF and SPARQL Using Sesame and Python
Hi, thanks for that.
One of the stipulations in Richard’s original post was that applications should be installable through a package management system, which is why I’ve tried to do that where possible rather than through the download-and-install method that I usually use and you suggest above.
Thank you for the comparison code for doing the querying through rdflib. From what I could tell from the documentation, the support for SPARQL querying over the in-memory graph was only available in a separate module, which didn’t seem to be installable through MacPorts. I might well have that wrong; I’m not sure I tried. (I also quite like RDFAlchemy’s method of mapping RDF resources to Python objects; I thought that would make it more approachable to people who weren’t used to writing RDF.)
I think the issue I was having with the string output from mbox resource was to do with RDFAlchemy: it was coming back as something like
rdfalchemy.rdfSubject(mailto:andrew.stott@cabinet-office.gsi.gov.uk)if I remember rightly. It may well be fine direct from rdflib.