Updated to include some of Arto Bendicken’s recommendations.
This post is a response to Richard Pope’s Linked Data/RDF/SPARQL Documentation Challenge. In it, he asks for documentation of the following steps:
- Install an RDF store from a package management system on a computer running either Apple’s OSX or Ubuntu Desktop.
- Install a code library (again from a package management system) for talking to the RDF store in either PHP, Ruby or Python.
- Programatically load some real-world data into the RDF datastore using either PHP, Ruby or Python.
- Programatically retrieve data from the datastore with SPARQL using using either PHP, Ruby or Python.
- Convert retrieved data into an object or datatype that can be used by the chosen programming language (e.g. a Python dictionary).
I’ve been told so many time how RDF sucks for mainstream developers that it was the main point of my TPAC talk late last year. I think that this is a great motivating challenge for improving not only the documentation of how to use RDF stores and libraries but how to improve their generally installability and usability for developers as well.
Anyway, I thought I’d try to get as far as I could to see just how bad things really are. I am on Mac OS X, and I’m going to use Ruby (although I don’t really know it all that well, so please forgive my mistakes). I’ll breeze on through as if everything is hunky dory, but there are some caveats at the end.
I’m going to use 4store because it’s really easy to install on the Mac. If you want to install it on Ubuntu, there’s a package available. For a Mac, it’s a matter of going to the list of Mac downloads, downloading the most recent version, opening the .dmg and installing the 4store application by dragging it into your Applications folder.
When you run the 4store application you get a command line prompt. To set up and start a triplestore called ‘reference’ with a SPARQL endpoint running on port 8000, type the following commands:
$ 4s-backend-setup reference
$ 4s-backend reference
$ 4s-httpd -p 8000 reference
If you then navigate to http://localhost:8000/ you should see the following:
Don’t let the title ‘Not found’ put you off. The fact you get a response means that it’s working.
First, find some data to load. A good place for government RDF data is http://source.data.gov.uk/data/ for example. I downloaded
http://source.data.gov.uk/data/reference/organogram-co/2010-10-31/index.rdf
There are several ways of importing data into 4store using the command line. Yves Raimond has created a Ruby gem for doing so programmatically. There’s also rdf-4store from Fumihiro Kato which ties into the RDF.rb library which I’ll use later on.
However, if you use the SPARQL server then it’s just an HTTP PUT call, which of course you can do in any language you like (every language has support for making HTTP requests, right?) without the need to install any store-specific packages. However, since we’ll be doing a lot of HTTP requests, it’s useful to have a library that can make them simple. There are plenty to choose from for Ruby. I chose rest-client:
$ sudo gem install rest-client
With that, I wrote the following little Ruby script called ‘load-data-into-4store.rb’:
#!/usr/bin/env ruby
require 'rubygems'
require 'rest_client'
filename = '/Users/Jeni/Downloads/index.rdf'
graph = 'http://source.data.gov.uk/data/reference/organogram-co/2010-06-30'
endpoint = 'http://localhost:8000/data/'
puts "Loading #{filename} into #{graph} in 4store"
response = RestClient.put endpoint + graph, File.read(filename), :content_type => 'application/rdf+xml'
puts "Response #{response.code}:
#{response.to_str}"
Run the script from the command line:
$ ruby load-rdf-into-4store.rb
and you should get the response:
Sending PUT /data/http://source.data.gov.uk/data/reference/organogram-co/2010-06-30 to localhost:8000
Response 201:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head><title>201 imported successfully</title></head>
<body><h1>201 imported successfully</h1>
<p>This is a 4store SPARQL server.</p><p>4store v1.0.5</p></body></html>
You can then check http://localhost:8000/status/size/ and you should see that there are now some triples in the store:
The next step is to query that data using SPARQL. Running SPARQL queries is just a matter of HTTP POSTing a query to the SPARQL endpoint. 4store provides a page that you can use to test out queries at http://localhost:8000/test/ so perhaps we should do that before diving into the Ruby code. The easy one to start with is just one that returns a list of the types of things that are described within the data:
SELECT DISTINCT ?type
WHERE {
?thing a ?type .
}
ORDER BY ?type
Paste that into the textarea that’s provided on http://localhost:8000/test/ so it looks like:
and you get some XML:
<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="type"/>
</head>
<results>
<result>
<binding name="type"><uri>http://purl.org/linked-data/cube#DataSet</uri></binding>
</result>
<result>
<binding name="type"><uri>http://purl.org/linked-data/cube#DataStructureDefinition</uri></binding>
</result>
...
</results>
</sparql>
SELECT queries like this one (which are the most common kind of query to run to simply extract data) return SPARQL Query Results XML Format by default, so there’s no need to get hold of a specialised library for processing the results: you just need something to process XML.
For Ruby, I’m choosing Nokogiri as I’ve heard good things about it. To install:
$ sudo port install libxml2 libxslt
$ sudo gem install nokogiri
So now we just need a script that will run this query, process the results as XML, and do something with them. Call it ‘find-rdf-types.rb’:
#!/usr/bin/env ruby
require 'rubygems'
require 'rest_client'
require 'nokogiri'
query = 'SELECT DISTINCT ?type WHERE { ?thing a ?type . } ORDER BY ?type'
endpoint = 'http://localhost:8000/sparql/'
puts "POSTing SPARQL query to #{endpoint}"
response = RestClient.post endpoint, :query => query
puts "Response #{response.code}"
xml = Nokogiri::XML(response.to_str)
xml.xpath('//sparql:binding[@name = "type"]/sparql:uri', 'sparql' => 'http://www.w3.org/2005/sparql-results#').each do |type|
puts type.content
end
Run it:
$ ruby find-rdf-types.rb
and you get:
POSTing SPARQL query to http://localhost:8000/sparql/
Response 200
http://purl.org/linked-data/cube#DataSet
http://purl.org/linked-data/cube#DataStructureDefinition
http://purl.org/linked-data/cube#Observation
http://purl.org/net/opmv/ns#Artifact
http://purl.org/net/opmv/ns#Process
http://purl.org/net/opmv/types/google-refine#OperationDescription
http://purl.org/net/opmv/types/google-refine#Process
http://purl.org/net/opmv/types/google-refine#Project
http://rdfs.org/ns/void#Dataset
http://reference.data.gov.uk/def/central-government/AssistantParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/CivilServicePost
http://reference.data.gov.uk/def/central-government/Department
http://reference.data.gov.uk/def/central-government/DeputyDirector
http://reference.data.gov.uk/def/central-government/DeputyParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/Director
http://reference.data.gov.uk/def/central-government/DirectorGeneral
http://reference.data.gov.uk/def/central-government/ParliamentaryCounsel
http://reference.data.gov.uk/def/central-government/PermanentSecretary
http://reference.data.gov.uk/def/central-government/PublicBody
http://reference.data.gov.uk/def/central-government/SeniorAssistantParliamentaryCounsel
http://reference.data.gov.uk/def/intervals/CalendarDay
http://www.w3.org/2000/01/rdf-schema#Class
http://www.w3.org/ns/org#Organization
http://www.w3.org/ns/org#OrganizationalUnit
http://xmlns.com/foaf/0.1/Person
So we can see that the dataset contains information that include statistical data using the data cube vocabulary, provenance information using OPMV (Open Provenance Model Vocabulary), some information about organisations using org, some data.gov.uk-specific vocabulary, and people using FOAF.
Sometimes it can be useful to get non-tabular data out of SPARQL. At that point, rather than using SELECT queries, you will want to use a CONSTRUCT query, which creates RDF. For example, try the query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT {
?person
a foaf:Person ;
foaf:name ?name ;
?prop ?value .
} WHERE {
?person a foaf:Person ;
foaf:name ?name ;
?prop ?value .
}
This gets all the information in the data about the individuals for whom names have been supplied in the data, as RDF.
Although the response is RDF/XML, you definitely do not want to process it as XML. Instead, you need a proper RDF library. Fortunately, there’s a good one for Ruby in RDF.rb. You can install it and a bunch of extra plugins that make it easy to deal with RDF in all its guises using:
$ sudo gem install linkeddata
This lets us pick out an appropriate parser based on the Content-Type of the response, and load the results of the SPARQL query into an in-memory RDF::Graph:
response = RestClient.post endpoint, :query => query
content_type = response.headers[:content_type][/^[^ ;]+/]
puts "Response #{response.code} type #{content_type}"
graph = RDF::Graph.new
graph << RDF::Reader.for(:content_type => content_type).new(response.to_str)
We can perform subsequent queries over that graph, for example just to extract names and telephone numbers and put them into a dictionary:
query = RDF::Query.new({
:person => {
RDF.type => FOAF.Person,
FOAF.name => :name,
FOAF.mbox => :email,
}
})
people = {}
query.execute(graph).each do |person|
people[person.name.to_s] = person.email.to_s
end
It’s worth noting that the constants RDF and FOAF are pre-declared by including RDF, and the values that you get back from a query are RDF values, which can be URIs or have datatypes or languages. In the above code I’ve converted them into strings for insertion into the Ruby dictionary.
The full script for ‘get-names-and-emails.rb’ is:
#!/usr/bin/env ruby
require 'rubygems'
require 'rest_client'
require 'linkeddata'
include RDF
query = "PREFIX foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT {
?person
a foaf:Person ;
foaf:name ?name ;
?prop ?value .
} WHERE {
?person a foaf:Person ;
foaf:name ?name ;
?prop ?value .
}"
endpoint = 'http://localhost:8000/sparql/'
puts "POSTing SPARQL query to #{endpoint}"
response = RestClient.post endpoint, :query => query
content_type = response.headers[:content_type][/^[^ ;]+/]
puts "Response #{response.code} type #{content_type}"
graph = RDF::Graph.new
graph << RDF::Reader.for(:content_type => content_type).new(response.to_str)
puts "\nLoaded #{graph.count} triples\n"
query = RDF::Query.new({
:person => {
RDF.type => FOAF.Person,
FOAF.name => :name,
FOAF.mbox => :email,
}
})
people = {}
query.execute(graph).each do |person|
people[person.name.to_s] = person.email.to_s
end
puts "\nCreating directory of #{people.length} people"
stott_email = people['Andrew Stott']
puts "\nAndrew Stott's email address: #{stott_email}"
Run this script with:
$ ruby get-names-and-emails.rb
and you get the result:
POSTing SPARQL query to http://localhost:8000/sparql/
Response 200 type application/rdf+xml
Loaded 459 triples
Creating directory of 75 people
Andrew Stott's email address: mailto:andrew.stott@cabinet-office.gsi.gov.uk
So there you have it, a walkthrough of setting up a local triplestore, loading in data and accessing that data programmatically using SPARQL queries.
Now for some caveats. First, you’re bound to have noticed that I having followed Richard’s steps to the letter.
Second, there are a couple of dead ends that I went down that I haven’t written up in the above:
Could not open library 'libraptor' error. I couldn’t find an immediate fix for that, so decided to keep things simple instead and just use plain RDF.rb.Third, I want to reiterate that there may be better ways of using 4store, rest_client, Nokogiri and RDF.rb, as well as Ruby generally, than those shown above. I don’t claim to be an expert in any of these technologies. If you have suggestions and corrections, I’d encourage you to add a comment and I’ll incorporate them in the text to improve it.
Finally, some general points, because the strong binding of ‘linked data’ and ‘SPARQL’ in Richard’s post bothers me:
Having said the above, if you’re collecting linked data from multiple sources with unpredictable content and want to query across it, having a local triplestore is very useful.
I also want to point out that within the linked data we’ve published on data.gov.uk, we’ve made a big effort to make the data available in multiple formats such as JSON, XML and CSV, and through a RESTful, URI-parameter-driven API, precisely to lower the barrier for developers who want to use that information but understandably don’t want to take the time or make the effort to learn the linked data technologies that underly the sites. For those that do, the RDF/XML and Turtle is there as well, and the SPARQL queries that are used to create each page are available to look at, tweak, and learn from. Our hope is that the linked data API that provides access to lists of schools, departments and railway stations can make the linked data learning curve a little less steep.
Comments
Re: Getting Started with RDF and SPARQL Using 4store and RDF.rb
Hi, I don’t know if you said it but I had a problem with Chrome for accessing to localhost:8000. I had to use safari. Thanks for this tutorial !
Re: Getting Started with RDF and SPARQL Using 4store and RDF.rb
Hi Jeni,
thanks for the above post.
One small hiccup I had with 4store when running “ruby find-rdf-types.rb” on Mac OSx 1.6.6 (ruby 1.9.2) was
Library not loaded: /opt/local/lib/libiconv.2.dylib (LoadError) Referenced from: /Users/richardhancock/.rvm/gems/ruby-1.9.2-p136/gems/nokogiri-1.4.4/lib/nokogiri/nokogiri.bundle Reason: Incompatible library version: nokogiri.bundle requires version 8.0.0 or later, but libiconv.2.dylib provides version 7.0.0
Turns out that the shell I was running “ruby find-rdf-types.rb” in was also the shell that I had opened to run the 4store 4s-* commands and the $PATH now started with /Applications/4store.app/Contents/MacOS/bin. This contains Applications/4store.app/Contents/MacOS/lib/libiconv.2.dylib which is at version 7.0.0.
The simple fix was to open a new shell, which wasn’t prefixed with /Applications/4store.app/Contents/MacOS/bin.
“ruby find-rdf-types.rb” then ran fine.
Some improvements and simplifications
The
load-data-into-4store.rbscript could probably be simplified to the point of being no more than a couple of lines of code, using one of the better HTTP clients listed at:http://ruby-toolbox.com/categories/http_clients.html
For example, the
httpartyandrest-clientgems are popular, and both are very simple to use:https://github.com/jnunemaker/httparty/blob/master/examples/twitter.rb
https://github.com/archiloque/rest-client
The
find-rdf-types.rbscript could be rewritten and simplified as follows, based on the RDF.rb-compatiblesparql-clientgem:The
get-names-and-emails.rbscript could be similarly simplified based onsparql-client. See the README for code examples to that effect:http://sparql.rubyforge.org/client/
Even if not using
sparql-client, wherever you have a code snippet like the following:…those can in all cases be simplified to this shorter but equivalent form:
Also, RDF.rb provides helpers for all the most important RDF vocabularies built in, so instead of the following:
…just say this instead:
Since you already found the
linkeddatagem that pulls in Gregg Kellogg’s Ruby-native RDF/XML parser (therdf-rdfxmlgem), solving therdf-raptorproblem you were having is probably not a priority.However, that problem is more than likely simply due to library load paths. If you have the Raptor libraries installed from MacPorts, they’ll be in
/opt/local/libinstead of/usr/local/lib, necessitating setting aDYLD_FALLBACK_LIBRARY_PATH, or similar, environment variable.I don’t recall how to do all that out of hand, but within the scope of
rdf-raptorthere are also a couple of easy workarounds (that I should document better in the README). See the following for details on environment variables you can set forrdf-raptor:http://rdf.rubyforge.org/raptor/RDF/Raptor.html
I should also mention that Fumihiro Kato has been working on a 4store storage adapter for RDF.rb:
https://github.com/fumi/rdf-4store
I don’t know the current status of
rdf-4store, but when it is ready, most everything in this blog post will be further simplified. For example, importing data will work something like the following:Many thanks for writing this thorough tutorial; I will add a link to it in RDF.rb’s README file.
Re: Getting Started with RDF and SPARQL Using 4store and RDF.rb
Thanks Jeni for the great post! I only wish I was a MacHead...
While the 4Store package for Ubuntu is theoretically available and installable from Synaptic, there appears to be a dependency fail for recent versions (esp. on Maverick):
The following packages have unmet dependencies:
4store : Depends: librasqal1 (>= 0.9.16) but it is not installable
E: Broken packages
Virtuoso data loading
[Posted on behalf of Kingsley Idehen]
Jeni,
Demystification of Virtuoso re. data loading goes as follows — for anyone that has successfully completed installation (Open Source or Commercial Editions) :
/sparqlNote: the data source URL doesn’t even have to be RDF based — which is where the Sponger Middleware comes into play — thus you can use any resource URL leaving the Sponger to do the following:
If you have very large data sources like DBpedia etc. from CKAN, simply our bulk loader:
I hope this demystifies Virtuoso data inserts. Our guides could be better as I’ve been through this loop with others, and each time the results is the same: Wow! I didn’t know it was so simple re. Virtuoso :-)
Nice post!
Re: Virtuoso data loading
Hi Guys New to Virtuoso products. I’ve installed Virtuoso Open source on a windows 2003 server. I have succesfully loaded RDF files using conductors RDF Store upload functionality. I had some trouble with a couple of files but worked out that they were too big to load. After splitting the files and loading invidually all was well.
Now I’m trying to load some more files in the same way. The files are in the same format as before and much smaller than I have loaded before, but they just refuse to load properly. An error message is displayed saying they are the wrong format, but thats not true. If I go to the graph tab in conductor I actually see the graph listed. But searching the data only shows a fraction of the data has been loaded.
Could this be something to do with characters in the file? Any help gratefully received. Cheers pash