<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.jenitennison.com/blog" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>datagovuk</title>
 <link>http://www.jenitennison.com/blog/taxonomy/term/54</link>
 <description>The taxonomy view with a depth of 0.</description>
 <language>en</language>
<item>
 <title>Government Should Do its Own Data Homework</title>
 <link>http://www.jenitennison.com/blog/node/148</link>
 <description>&lt;p&gt;I&amp;#8217;ve been reflecting a little since &lt;a href=&quot;http://www.ukuug.org/events/opentech2010/&quot;&gt;OpenTech&lt;/a&gt; on the relationship between the developer community and government.&lt;/p&gt;

&lt;p&gt;Let me set out my perspective first. My goal is to help ensure that the public sector publishes reusable data in the long term.&lt;/p&gt;

&lt;p&gt;To do that, data publication needs to be sustainable. It needs to be embedded within the day-to-day activity of the public sector, something that seems as natural as the generation of PDF reports seems today. It also needs to be useful. It needs to be easy for anyone to understand and reuse the data, with minimal effort. It cannot be the case, long term, that you need to be an expert hacker to reuse government data.&lt;/p&gt;

&lt;p&gt;To get there, we need to work towards a virtuous cycle in which the public sector is rewarded for publishing useful data well. The reward may come from financial savings, from increasing data quality, from better delivery of its remit, or simply from kudos. It doesn&amp;#8217;t matter how, but there needs to be some reward, or it just won&amp;#8217;t happen.&lt;/p&gt;

&lt;p&gt;Over the last few years, government has had to be persuaded that it&amp;#8217;s a good idea to release their data at all. The message from the developer community has been &amp;#8220;give us your data and we&amp;#8217;ll show you what we can do with it!&amp;#8221; Through hack days and various similar activities, developers have excited, wowed and dazzled officials and politicians, opening their eyes to what could be done. Through sustained argument and political pressure, developers have set out the economic and moral case that releasing data not only &lt;em&gt;could&lt;/em&gt;, but &lt;em&gt;should&lt;/em&gt; happen.&lt;/p&gt;

&lt;p&gt;They have been incredibly successful. We have &lt;a href=&quot;http://data.gov.uk/&quot;&gt;data.gov.uk&lt;/a&gt;, &lt;a href=&quot;http://www.ordnancesurvey.co.uk/oswebsite/opendata/&quot;&gt;open data from Ordnance Survey&lt;/a&gt;, strong commitments to open data within the &lt;a href=&quot;http://programmeforgovernment.hmg.gov.uk/government-transparency/&quot;&gt;Coalition Agreement&lt;/a&gt;, and the &lt;a href=&quot;http://data.gov.uk/blog/new-public-sector-transparency-board-and-public-data-transparency-principles&quot;&gt;Public Sector Transparency Board&lt;/a&gt; who are now applying that pressure, with authority, at the heart of government.&lt;/p&gt;

&lt;p&gt;My perception is that the argument that government should open up its data has basically been won. The questions within the public sector are now about &lt;em&gt;how&lt;/em&gt;, not &lt;em&gt;whether&lt;/em&gt;. And as a result, in this changed environment, I&amp;#8217;m growing slightly uneasy about the core developer message of &amp;#8220;give us your data and we&amp;#8217;ll show you what we can do with it!&amp;#8221;&lt;/p&gt;

&lt;p&gt;There are two things about that message that concern me. First, it implies government is doing it all wrong. Second, it implies that government doesn&amp;#8217;t &lt;em&gt;need&lt;/em&gt; to do any better, because the developer community can take up all the slack and fill in all the gaps. It&amp;#8217;s like getting fed up with a child struggling with their homework, and saying &amp;#8220;oh, just give it here and I&amp;#8217;ll do it!&amp;#8221; It&amp;#8217;s a narrative that simultaneously undermines the best efforts of those within government and removes from them the motivation and opportunity to learn to do better.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;Of course there is a tricky balance here. We don&amp;#8217;t want to let up pressure on the government to release important information. We don&amp;#8217;t want government to feel that they have to get their data perfect before releasing it. And we can&amp;#8217;t always wait for government, which can be slow-moving as an organisation, to provide everything we need right now.&lt;/p&gt;

&lt;p&gt;However, there are certain things that only the owners of data &amp;#8212; those within the public sector &amp;#8212; can do. People who own data understand it so much better than third parties: what codes mean, what values are used to indicate missing data, what gets included and what gets left out, which columns aren&amp;#8217;t really used any more, which interpretations are safe and which are meaningless. Data owners can be trusted in a way that no one outside could be; when data publication becomes a sustainable part of their activity, they are much better placed to provide a steady, reliable, flow of data than a third-party API that could disappear or get out of date whenever the volunteer behind it moves on to something new.&lt;/p&gt;

&lt;p&gt;People in government must be given the responsibility to publish their data well. And there are three core ways in which I think developers could help them.&lt;/p&gt;

&lt;p&gt;First, while there are many more technically savvy people within government than is sometimes made out, the average civil servant lacks both know-how and tooling. I think developers could help a huge amount here. What about hack days where developers sit side by side with civil servants to help them clean and publish their data? What about engaging with the owners of a particular data set to help &lt;em&gt;them&lt;/em&gt; to publish it in a way that was reusable and sustainable? What about writing services, accessible through the locked-down IT systems that civil servants have to use, that enabled them to convert their data into multiple formats, and to link up the ways they refer to things with the way other people do?&lt;/p&gt;

&lt;p&gt;Second, while government needs to be responsible for publishing its data, it can&amp;#8217;t be responsible for building everything that end-users need based on that data. Developers have the facility to create applications that bring together data from diverse parts of the public sector, and combine it with data from outside. This has always been a feature of hack days, of course; all I&amp;#8217;m arguing for is a focus on applications that the public sector &lt;em&gt;shouldn&amp;#8217;t&lt;/em&gt; be doing itself.&lt;/p&gt;

&lt;p&gt;Third, we need to build the virtuous cycle that I talked about above. Government needs to hear about what works for developers, as well as what doesn&amp;#8217;t. What data releases have been helpful and why? Who are the stars? Who should be rewarded and emulated? We need ways of feeding back in a constructive way to public sector workers who are trying their best with the resources they have &amp;#8212; often extensive subject-matter expertise but little time, locked-down technology and contracting finances.&lt;/p&gt;

&lt;p&gt;The vitality and engagement of the developer community has played a massively important role in the open government data initiative within the UK, and I&amp;#8217;m sure it will continue to do so. We are incredibly lucky, here, to have a collection of talented and motivated developers who volunteer their time to work with government data. My hope is simply that the relationship between government and developers can grow into one that is more encouraging and supportive, that understands the constraints and concerns of those within government, and that provides practical help to overcome them.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/148#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/54">datagovuk</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/50">psi</category>
 <pubDate>Sun, 26 Sep 2010 21:41:30 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">148 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Hosting Gridworks Instances</title>
 <link>http://www.jenitennison.com/blog/node/147</link>
 <description>&lt;p&gt;I&amp;#8217;ve &lt;a href=&quot;http://www.jenitennison.com/blog/node/145&quot;&gt;written previously&lt;/a&gt; about how wonderful &lt;a href=&quot;http://code.google.com/p/freebase-gridworks/&quot;&gt;Freebase Gridworks&lt;/a&gt; (&lt;a href=&quot;http://groups.google.com/group/freebase-gridworks/browse_thread/thread/f58390cd729c35fe/636f8332b44fbb00#636f8332b44fbb00&quot;&gt;shortly to be &amp;#8220;Google Refine&amp;#8221;&lt;/a&gt;) is for cleaning and converting data. Within the UK public sector, there are two big barriers to its use, however:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Public sector workers typically can&amp;#8217;t install software on their computers.&lt;/li&gt;
&lt;li&gt;They&amp;#8217;re also typically stuck with IE7 (or even, if they&amp;#8217;re really unlucky, IE6).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We&amp;#8217;ve got around the first of these issues by installing Gridworks as a hosted (password-protected) instance on &lt;code&gt;http://source.data.gov.uk/gridworks&lt;/code&gt;. Now, this isn&amp;#8217;t perfect of course: Gridworks wasn&amp;#8217;t designed to be used as a shared instance, so it doesn&amp;#8217;t have support for multiple users operating on the same project at the same time, let alone things like user accounts or access control. So we&amp;#8217;re operating on trust here &amp;#8212; hoping that people won&amp;#8217;t delete or edit each others&amp;#8217; projects &amp;#8212; but it&amp;#8217;s worth the risk.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s also not particularly pretty in that the links that Gridworks uses all assume that it&amp;#8217;s running at the root of a web server. Fortunately, source.data.gov.uk doesn&amp;#8217;t need to have a home page, so it&amp;#8217;s possible to have Gridworks available at the root (although in hope of something better in the future, I&amp;#8217;ve made the main point of entry &lt;code&gt;/gridworks&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;I got this working by installing Gridworks normally on the server and using Apache as a proxy, with the following configuration:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Gridworks support
RewriteRule &quot;^/$&quot; &quot;/gridworks&quot; [R,L]
RewriteRule &quot;^/gridworks(.*)$&quot; &quot;http://localhost:3333$1&quot; [P,L]
RewriteRule &quot;^/(.*)$&quot; &quot;http://localhost:3333/$1&quot; [P,L]
ProxyPass /gridworks/ http://localhost:3333/
ProxyPassReverse /gridworks/ http://localhost:3333/
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;#8217;s it.&lt;/p&gt;

&lt;p&gt;The IE7 problem will take a bit longer to solve, I imagine.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/147#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/54">datagovuk</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/59">gridworks</category>
 <pubDate>Sun, 19 Sep 2010 18:56:50 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">147 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>On Standards</title>
 <link>http://www.jenitennison.com/blog/node/146</link>
 <description>&lt;p&gt;I&amp;#8217;m beginning to think that &amp;#8216;to recommend&amp;#8217; is an irregular verb like those that appeared every so often in &lt;a href=&quot;http://en.wikiquote.org/wiki/Yes,_Minister&quot;&gt;Yes, Minister&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Bernard:&lt;/strong&gt; It&amp;#8217;s one of those irregular verbs, isn&amp;#8217;t it: I have an independent mind; you are an eccentric; he is round the twist.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Something like: I recommend, you tell people what to do, he engages in premature standardisation.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;Our own recommendations are much more reasonable than those made by other people. &lt;em&gt;We&lt;/em&gt; understand the requirements, whereas &lt;em&gt;they&lt;/em&gt; haven&amp;#8217;t talked to anyone. &lt;em&gt;We&lt;/em&gt; are issuing them as guidance and are open to feedback, whereas &lt;em&gt;they&lt;/em&gt; are ramming them down people&amp;#8217;s throats.&lt;/p&gt;

&lt;p&gt;Of course without guidance, recommendations and standards of some description, it becomes near impossible to do anything useful. Take a look at the wide variety of &lt;a href=&quot;https://spreadsheets.google.com/ccc?key=0AhOqra7su40fdEgtaG4yVFZGVjdYREVIWmprX2dENkE&amp;amp;hl=en_GB&quot;&gt;information released by different councils to meet their commitment to publish spending data&lt;/a&gt;. Many use different formats but even amongst those that use Excel or CSV, the column names are different. Look closer and you see that they actually report at different levels of granularity as well. Some report each transaction, some each invoice item, some the ways these items are assigned to different cost centres. Some stick to the £500 limit, some report everything. Some include VAT in the amounts they quote, some don&amp;#8217;t. Some provide the dates of each transaction, some just the period that it occurred in. If you are clever and committed, &lt;a href=&quot;http://openlylocal.com/councils/spending&quot;&gt;you can find some wood in the trees&lt;/a&gt; but it&amp;#8217;s hard work.&lt;/p&gt;

&lt;p&gt;This variety is not due to pigheadedness or stupidity on the part of the councils. It&amp;#8217;s down to the very different technical and political constraints and approaches, and the fact that there was little guidance at all, &lt;a href=&quot;http://data.gov.uk/blog/local-spending-data-guidance&quot;&gt;up until this week&lt;/a&gt;, about what was expected of them.&lt;/p&gt;

&lt;p&gt;The point I&amp;#8217;m making is that people in different circumstances will naturally do things differently; common practice does not appear overnight by magic.&lt;/p&gt;

&lt;p&gt;Should councils have held off publishing their data until there was some kind of guidance in place? &lt;strong&gt;Absolutely 100% No!&lt;/strong&gt; It is far better to have the data in some form than to not have it at all, and it&amp;#8217;s only by making real data available that they and we get to start informed discussions about what kind of guidance is necessary.&lt;/p&gt;

&lt;p&gt;Should they be working towards publishing something better? &lt;strong&gt;Hell Yeah!&lt;/strong&gt; Data is not really open if the people who consume it have to put in hours or days of effort to understand it, map it, merge it, to be able to do something useful with it.&lt;/p&gt;

&lt;p&gt;What that &amp;#8216;something better&amp;#8217; looks like, I really don&amp;#8217;t know. My prediction is that councils will converge gradually, over time, into a handful of different approaches (rather than the basketful that we have now). Some will converge by choosing to use particular publishers for their data. Others will converge because they want to take advantage of particular tools for analysing or visualising the data that they produce, which will require certain formats. Still others will converge through an interest in &amp;#8220;doing what&amp;#8217;s right&amp;#8221;, based on guidance from groups and organisations that they trust.&lt;/p&gt;

&lt;p&gt;From chaos will come order, eventually. But this is a process that is led by politics &amp;#8212; negotiation, persuasion, socialisation and cultural change &amp;#8212; not by technology. It&amp;#8217;s only to be expected that there will be differences in approaches along the way, because we need to try, to learn, and we need for there to be choice, to evolve.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/146#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/54">datagovuk</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/52">opendata</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/50">psi</category>
 <pubDate>Sun, 19 Sep 2010 15:54:51 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">146 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Using Freebase Gridworks to Create Linked Data</title>
 <link>http://www.jenitennison.com/blog/node/145</link>
 <description>&lt;p&gt;When we encourage people to put their data on the web as linked data, the biggest question is &amp;#8220;How?&amp;#8221;. There are so many &amp;#8220;How?&amp;#8221; questions to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how do we choose what URIs to use for things?&lt;/li&gt;
&lt;li&gt;how do we choose what vocabularies to use?&lt;/li&gt;
&lt;li&gt;how do we handle changing data?&lt;/li&gt;
&lt;li&gt;how do we tell people how the data was created?&lt;/li&gt;
&lt;li&gt;how do we publish it?&lt;/li&gt;
&lt;li&gt;how will other people know about it?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and, of course:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how do we create it?&lt;/li&gt;
&lt;/ul&gt;

&lt;!--break--&gt;

&lt;p&gt;Our goal within the linked data part of data.gov.uk (and I know we haven&amp;#8217;t achieved it yet) is to both answer these questions and to make the answers as simple as possible. The answers to the questions &lt;em&gt;cannot&lt;/em&gt; either require up-front knowledge of all possible types of data that might be published or depend on the availability of linked data for all the things we want to talk about. It &lt;em&gt;cannot&lt;/em&gt; require registration at centralised services. It &lt;em&gt;cannot&lt;/em&gt; require everyone to do everything in the same way or at the same pace.&lt;/p&gt;

&lt;p&gt;We must take adopt an approach that encourages people to make their data available in forms that are easier for other people to pick up and use &lt;strong&gt;because they see the benefits for them&lt;/strong&gt; and their stakeholders and because the effort of doing so is not too high to bear. We must grow, adapt and evolve incrementally. If linked data eventually wins, it will be due to its benefits, not to faith.&lt;/p&gt;

&lt;p&gt;Anyway, enough rant. The point of this blog post is to talk about one of the answers to the &amp;#8216;How do we create it?&amp;#8217; question: using &lt;a href=&quot;http://code.google.com/p/freebase-gridworks/&quot;&gt;Freebase Gridworks&lt;/a&gt;. For those who haven&amp;#8217;t encountered it, Gridworks is an incredibly useful application that enables you to easily analyse, clean and manipulate tabular data. In a few steps, it can be used to generated linked datasets which can then be published on the web just like any other file, ready for other people to reuse without jumping through hoops. I&amp;#8217;m going to assume that you can &lt;a href=&quot;http://code.google.com/p/freebase-gridworks/wiki/Downloads?tm=2&quot;&gt;download it&lt;/a&gt; and &lt;a href=&quot;http://code.google.com/p/freebase-gridworks/wiki/GettingStarted&quot;&gt;install it&lt;/a&gt; following the instructions provided on the Gridworks site.&lt;/p&gt;

&lt;p&gt;In this post, I&amp;#8217;m going to talk about how to use Gridworks to generate linked data, using an example of local government spending data from &lt;a href=&quot;http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm&quot;&gt;Windsor and Maidenhead council&lt;/a&gt;. Like a good train journey, there&amp;#8217;s quite a lot to see along the way.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: Many thanks to Dave Reynolds for his work on this data and comments on an earlier version of this post.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;Importing Data&lt;/h2&gt;

&lt;p&gt;The first step is to import the data into Gridworks. If you just take the Windsor &amp;amp; Maidenhead data and import it directly, you&amp;#8217;ll get a single not-very-useful column as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/bad-import.jpg&quot; title=&quot;Bad import into Gridworks&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If you look at the spreadsheet in a normal spreadsheet programme then you&amp;#8217;ll see why. Like a lot of spreadsheets created by normal people, who want to create something readable by human beings rather than computers, it has some extra lines at the top to explain what the spreadsheet contains, as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/spreadsheet.jpg&quot; title=&quot;Original spreadsheet&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Fortunately, Gridworks lets us easily skip over these first few lines. When you import the data, put the number &lt;code&gt;1&lt;/code&gt; in the box for &amp;#8220;Ignore X initial non-blank lines&amp;#8221;, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/import-dialog.jpg&quot; title=&quot;Import dialog&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;(You need the number &lt;code&gt;1&lt;/code&gt; because although there are three lines before the table really starts, the second two of those are blank.)&lt;/p&gt;

&lt;p&gt;That done, the data should look a lot more useful, as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/good-import.jpg&quot; title=&quot;Good import into Gridworks&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;h2&gt;Cleaning Data&lt;/h2&gt;

&lt;p&gt;The next thing to do is to explore the data a bit to get a handle on what&amp;#8217;s there and work out whether any cleaning or rationalisation is necessary to improve its quality.&lt;/p&gt;

&lt;p&gt;With columns that hold names, such as &amp;#8216;Directorate&amp;#8217;, &amp;#8216;Service&amp;#8217; or &amp;#8216;Supplier Name&amp;#8217;, you&amp;#8217;re looking for slight misspellings caused by bad data entry. Gridworks helps you find these by creating a list of the distinct values for a particular column and telling you how many instances there are of each. Use the arrow at the side of the column name to pull down the menu, then choose &lt;code&gt;Facet &amp;gt; Text Facet&lt;/code&gt; to create this list, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/facet-menu.jpg&quot; title=&quot;Choosing from the facet menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once you&amp;#8217;ve chosen &lt;code&gt;Text Facet&lt;/code&gt;, the list pops up on the left hand side of the window. You can click on these to filter the table to contain just those rows that have that value for that column, but you can then scan through this to spot any places where there looks to be a typo or two entries that should really be the same. For example, the Services list holds both &amp;#8216;Libraries &amp;amp; Information Services&amp;#8217; and &amp;#8216;Library &amp;amp; Information Services&amp;#8217;, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/services-list.jpg&quot; title=&quot;Repetition in the Services list&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s unlikely that there are really two distinct services with such similar names, so we&amp;#8217;d like to clean up this data by standardising on one name or another. You can quickly change all occurrences of one value to another using the &lt;code&gt;edit&lt;/code&gt; option that appears just to the right of the value when you hover over it. This brings up a dialog that enables you to change all of those values to something else, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/edit-value-dialog.jpg&quot; title=&quot;Editing a value across the spreadsheet&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can do something similar with numeric columns, such as the &amp;#8216;Amount excl vat £&amp;#8217; column. This time choose &lt;code&gt;Numeric Facet&lt;/code&gt; rather than &lt;code&gt;Text Facet&lt;/code&gt; and you&amp;#8217;ll get a histogram up as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/amount-facet.jpg&quot; title=&quot;Amount histogram&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is useful for identifying outliers. If you grab the handle on the left of the histogram and move it to the centre, the rows will get filtered to only those that have an amount within that range. For example, moving it to only show rows between £500,000 and £1,500,000 shows that there are three payments of this size, all made by Children&amp;#8217;s Services to Wilmott Dixon Construction Limited, as shown in this screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/high-value-transactions.jpg&quot; title=&quot;High value transactions&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Although these values are much higher than most of the others in the spreadsheet, they don&amp;#8217;t seem to be errors &amp;#8212; I guess a new school was being built or something &amp;#8212; so there&amp;#8217;s nothing to correct here, but it shows how numeric facets can be used to explore the data.&lt;/p&gt;

&lt;p&gt;Another approach to exploring and cleaning the data is to use the clustering algorithms that are built into Gridworks to identify duplicates. To do this, pull down the column menu and this time choose &lt;code&gt;Edit Cells... &amp;gt; Cluster and Edit&lt;/code&gt;, as shown in the following screenshot, this time for the &amp;#8216;Supplier Name&amp;#8217; column:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/edit-cells-menu.jpg&quot; title=&quot;Choosing from the Edit Cells menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This brings up a dialog that groups together values that look similar. In this case, &amp;#8216;Siemens plc&amp;#8217; and &amp;#8216;Siemens PLC&amp;#8217;, as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/cluster-dialog.jpg&quot; title=&quot;Clustering values in a column&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can use this dialog to change all the similar values to a standard one. Check the &lt;code&gt;Merge&lt;/code&gt; checkbox for the clusters of values that should be merged, edit the &lt;code&gt;New Cell Value&lt;/code&gt; field to whatever standard value you want to adopt, and choose &lt;code&gt;Apply &amp;amp; Re-cluster&lt;/code&gt; or simply &lt;code&gt;Apply &amp;amp; Close&lt;/code&gt; to make the change.&lt;/p&gt;

&lt;p&gt;You will often find that the default clustering algorithm (key collision/fingerprint) doesn&amp;#8217;t come up with any clusters as it&amp;#8217;s fairly conservative. It&amp;#8217;s worth playing around a bit with different algorithms to look for other duplicates by selecting other possibilities from the drop-down menus. For example, choosing the &amp;#8216;nearest neighbour&amp;#8217; method with the Levenstein distance function and a radius of 2 (edits) results in four possible duplicates within the Suppliers list, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/levenstein-cluster.jpg&quot; title=&quot;Clustering values with Levenstein distance&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If you&amp;#8217;re not sure about whether the cluster is due to a typo or not, hover over the row and click on the &lt;code&gt;Browse this cluster&lt;/code&gt; link that appears. That will bring up a separate window that will show you just the rows in the cluster, from which you should be able to make a judgement. For example, it&amp;#8217;s not clear whether &amp;#8216;Academia Ltd&amp;#8217; is a typo for &amp;#8216;Academics Ltd&amp;#8217; but browsing the cluster shows that the Cost Centre codes and the Types of the transactions are completely different for the two Suppliers, so they are probably different.&lt;/p&gt;

&lt;h2&gt;Deriving Data&lt;/h2&gt;

&lt;p&gt;The next step is to derive some data from what we have within the spreadsheet. Since our goal is to produce linked data, the kind of derived data that we&amp;#8217;re interested in are URIs.&lt;/p&gt;

&lt;p&gt;At this point we need to start making decisions about what URIs to use. If you look at the &lt;a href=&quot;http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm&quot;&gt;list of spending data from Windsor and Maidenhead&lt;/a&gt;, you&amp;#8217;ll see that there are a whole bunch of these spreadsheets. It would be really useful if we could tie these spreadsheets together by using the same URIs for the same things across the datasets. For that reason, the only URI that&amp;#8217;s going to be local to the dataset is the URI for each line (or data point if you like) itself. On the other hand, most of the things that are named here are going to be local to Windsor &amp;amp; Maidenhead: &amp;#8216;Abba Cars&amp;#8217; may be sufficient to identify a single company within Windsor &amp;amp; Maidenhead, but certainly wouldn&amp;#8217;t be nationwide. So the URIs I&amp;#8217;m going to create here are mostly going to be within the &lt;code&gt;www.rbwm.gov.uk&lt;/code&gt; domain.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the table of the columns and the associated URIs that I&amp;#8217;m going to use. I should stress that this is just for example purposes, but I&amp;#8217;ve used the following principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;URIs for datasets are just like URIs for any other web document, but shouldn&amp;#8217;t have an extension because the data itself should be available in many formats&lt;/li&gt;
&lt;li&gt;URIs for real-world things should have &lt;code&gt;/id&lt;/code&gt; at the start of the path, and URIs for conceptual things should have &lt;code&gt;/def&lt;/code&gt; at the start of their paths; both should result in a 303 redirection to a suitable web page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what we&amp;#8217;re doing within data.gov.uk, but it&amp;#8217;s an important principle of the web that different councils might well choose their own URI schemes, depending on the kind of technology support that they have, without any bad side-effects on the interpretation of the data.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Column&lt;/th&gt;
      &lt;th&gt;URI pattern&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;(Dataset)&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;(Row/ExpenditureLine)&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#{row-number}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;(Council)&lt;/th&gt;
      &lt;td&gt;http://statistics.data.gov.uk/id/local-authority/00ME&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Directorate&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/id/directorate/{directorate-slug}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Updated&lt;/th&gt;
      &lt;td&gt;http://reference.data.gov.uk/id/day/{date}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;TransNo/Payment&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/id/transaction/{transaction-number}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Service&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/id/service/{service-slug}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Cost Centre&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/def/cost-centre/{cost-centre-code}&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Supplier Name&lt;/th&gt;
      &lt;td&gt;http://www.rbwm.gov.uk/id/supplier/{supplier-slug}&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As you can see, those of the columns that contain text fields have, as part of their URI, a &lt;a href=&quot;http://en.wikipedia.org/wiki/Slug_(production)&quot;&gt;&amp;#8216;slug&amp;#8217;&lt;/a&gt;. This is a shortened, normalised value suitable for putting in a URI: basically ensuring that the string doesn&amp;#8217;t contain any punctuation or spaces. For example, &amp;#8216;Adult &amp;amp; Community Services&amp;#8217; would turn into &amp;#8216;adult-community-services&amp;#8217;.&lt;/p&gt;

&lt;p&gt;Our first task will be to create these slugs. To do this, we&amp;#8217;ll create a new column based on the existing ones by choosing &lt;code&gt;Edit Column &amp;gt; Add Column Based on This Column ...&lt;/code&gt; from the drop-down menu on the appropriate column:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/edit-column-menu.jpg&quot; title=&quot;Edit Column menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Selecting this will bring up a dialog which will ask you to name the new column and then enter a formula to calculate the new value, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/create-slug.jpg&quot; title=&quot;Edit Column menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The default language for this formula is Gridworks&amp;#8217; own, though there are other options available. To create the slug, we need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;turn the value to lower case&lt;/li&gt;
&lt;li&gt;replace all spaces with hyphens&lt;/li&gt;
&lt;li&gt;remove anything that isn&amp;#8217;t a letter, number, or hyphen&lt;/li&gt;
&lt;li&gt;replace all sequences of two hyphens with a single hyphen&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is done in two steps. The first three steps can be done using the formula:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;replace(replace(toLowercase(value), &#039; &#039;, &#039;-&#039;), /[^-a-z0-9]/, &#039;&#039;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Gridworks helps by listing the original and resulting values for the first several rows of the spreadsheet, so that you can see whether it&amp;#8217;s working as expected. When you&amp;#8217;re happy, hitting &lt;code&gt;OK&lt;/code&gt; creates the new column.&lt;/p&gt;

&lt;p&gt;The last step (replacing all sequences of two hyphens with a single hyphen) can be done by editing the cells in the new column. Bring up the &lt;code&gt;Edit Cells... &amp;gt; Transform...&lt;/code&gt; dialog using the menu:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/edit-cells-menu-2.jpg&quot; title=&quot;Edit Cells menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;and use the formula:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;replace(value, &#039;--&#039;, &#039;-&#039;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;then check the &lt;code&gt;Re-transform until no change&lt;/code&gt; checkbox so that any pairs of hyphens are repeatedly replaced with single hyphens, as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/transform.jpg&quot; title=&quot;Edit Cells menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The other tabs in the new column and edit cells dialogs are really helpful. The &lt;code&gt;History&lt;/code&gt; tab lets you choose formulae that you&amp;#8217;ve used before to use again. This is useful here because we want to create the slugs for the Service and Supplier Name in the same way. The &lt;code&gt;Help&lt;/code&gt; tab lists all the functions that you can use within the formula.&lt;/p&gt;

&lt;p&gt;Creating the URIs for the columns proceeds in the same way, except this time the formulae are more like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&#039;http://www.rbwm.gov.uk/id/directorate/&#039; + value
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are two that are slightly different. First, there&amp;#8217;s the URI for the date, which needs to be constructed from the date/time value held by Gridworks as follows. We can do this in two stages. First, to construct a new column called &amp;#8216;Date&amp;#8217; to hold the formatted date:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;datePart(value, &#039;year&#039;) + &#039;-&#039; + 
if (datePart(value, &#039;month&#039;) &amp;lt; 9, &#039;0&#039;, &#039;&#039;) + replace(datePart(value, &#039;month&#039;) + 1, &#039;.0&#039;, &#039;&#039;) + &#039;-&#039; + 
if (datePart(value, &#039;day&#039;) &amp;lt; 10, &#039;0&#039;, &#039;&#039;) + datePart(value, &#039;day&#039;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(note that the &lt;code&gt;datePart()&lt;/code&gt; function returns a 0-based count for the month) and then to create the Date URI column based on this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&#039;http://reference.data.gov.uk/id/day/&#039; + value
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Second, there&amp;#8217;s the URI for the row (an expenditure line) itself, which needs to be constructed using the row number. It&amp;#8217;s useful to construct it as a local URI (ie just the fragment) as this means the same code can be used to construct the column across different datasets, so it&amp;#8217;s just:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&#039;#&#039; + rowIndex
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Exporting Data&lt;/h2&gt;

&lt;p&gt;Once the extra columns have been made, it&amp;#8217;s time to export data from Gridworks. While Gridworks makes it easy to export to CSV or into Freebase, it&amp;#8217;s also possible to export in any format you want using templates. Use the &lt;code&gt;Project&lt;/code&gt; menu and choose &lt;code&gt;Export Filtered Rows &amp;gt; Templating ...&lt;/code&gt;, as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/project-menu.jpg&quot; title=&quot;Project menu&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Note that this will only export the rows that you currently have selected, so if you want to export everything, make sure that you deselect any facets that you&amp;#8217;ve currently got selected.&lt;/p&gt;

&lt;p&gt;Choosing the &lt;code&gt;Templating ...&lt;/code&gt; option will open up a dialog that you can use to create whatever format you want. The default, as shown in the following screenshot, is JSON.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/template-dialog-json.jpg&quot; title=&quot;Templating dialog to create JSON&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;On the left are four fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prefix&lt;/strong&gt; is content that&amp;#8217;s put at the top of the exported data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row Template&lt;/strong&gt; is content that&amp;#8217;s generated for each row&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row Separator&lt;/strong&gt; is content that&amp;#8217;s put between each row&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Suffix&lt;/strong&gt; is content that&amp;#8217;s put at the bottom of the exported data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One thing to be extremely careful of here is that any changes you made to the fields on the left here &lt;strong&gt;will not be saved&lt;/strong&gt; when the dialog is closed. For that reason, it&amp;#8217;s a good idea to create your templates in a separate text file and copy and paste them in. Also note that the sample data on the right is only for the first set of rows, not for the whole spreadsheet.&lt;/p&gt;

&lt;p&gt;We&amp;#8217;re going to generate Turtle using the template, so the next stage is to work out precisely what Turtle to generate. We&amp;#8217;ve been working on small vocabulary for payment data based on the &lt;a href=&quot;http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html&quot;&gt;Data Cube vocabulary&lt;/a&gt; and that&amp;#8217;s what I&amp;#8217;ll use here, although it isn&amp;#8217;t quite complete and available yet as it will be. We&amp;#8217;ll start at the bottom, with the individual rows, and then add extra surrounding information as we go.&lt;/p&gt;

&lt;h3&gt;Row Template&lt;/h3&gt;

&lt;p&gt;Within this data, each row corresponds to a &lt;code&gt;payment:ExpenditureLine&lt;/code&gt; within the dataset. The expenditure lines can be organised into groups based on the &lt;code&gt;payment:Payment&lt;/code&gt; that they&amp;#8217;re associated with, which is indicated through the &amp;#8216;TransNo&amp;#8217; column in the database. Within the payment vocabulary we&amp;#8217;re using, we can assign individual expenditure lines to the payment using the &lt;code&gt;payment:expenditureLine&lt;/code&gt; property.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;payment:payer&lt;/code&gt; of each &lt;code&gt;payment:Payment&lt;/code&gt; is Windsor &amp;amp; Maidenhead council. The &lt;code&gt;payment:payee&lt;/code&gt; is the &amp;#8216;Supplier&amp;#8217; listed in the spreadsheet. The &lt;code&gt;payment:date&lt;/code&gt; is the &amp;#8216;Updated&amp;#8217; date.&lt;/p&gt;

&lt;p&gt;Each individual line in the spreadsheet is a &lt;code&gt;payment:ExpenditureLine&lt;/code&gt; which is associated with one of these payments. The &lt;code&gt;payment:expenditureCode&lt;/code&gt; is the &amp;#8216;Cost Centre&amp;#8217; and the actual &lt;code&gt;payment:amountExcludingVAT&lt;/code&gt; is the &amp;#8216;Amount excl vat £&amp;#8217; value. Some example Turtle for the first line is thus:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt;
  qb:slice &amp;lt;http://www.rbwm.gov.uk/id/transaction/2650750&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/id/transaction/2650750&amp;gt;
  a payment:Payment , qb:Slice ;
  rdfs:label &quot;Transaction 2650750&quot;@en ;
  qb:sliceStructure payment:payment-slice ;
  payment:transactionReference &quot;2650750&quot; ;
  payment:payer &amp;lt;http://statistics.data.gov.uk/id/local-authority/00ME&amp;gt; ;
  payment:payee &amp;lt;http://www.rbwm.gov.uk/id/supplier/1st-choice-d-b-driveways-limited&amp;gt; ;
  payment:date &amp;lt;http://reference.data.gov.uk/id/day/2010-04-09&amp;gt; ;
  payment:expenditureLine &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&amp;gt;
  a payment:ExpenditureLine , qb:Observation ;
  rdfs:label &quot;Expenditure Line 0&quot;@en ;
  qb:dataSet &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
  payment:expenditureCode &amp;lt;http://www.rbwm.gov.uk/def/cost-centre/LM05&amp;gt; ;
  payment:amountExcludingVAT 1875.00 .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;#8217;s the basic data for each line, but there&amp;#8217;s also some other information which should be brought out for each line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the name of the payee&lt;/li&gt;
&lt;li&gt;the date, year, month and day-of-month for the payment, which may help further analysis of the data&lt;/li&gt;
&lt;li&gt;the meaning of the expenditure code (particularly its association to a particular service)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In each of these cases, pulling the information out from each line is going to lead to a lot of repetition, because the same payee, date and so on will be described in multiple lines, but we don&amp;#8217;t have any choice and we can tidy it up by removing duplicates afterwards. The Turtle for the first line will look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/id/supplier/1st-choice-d-b-driveways-limited&amp;gt;
  a org:Organization ;
  rdfs:label &quot;1st Choice - D B Driveways Limited&quot;@en .

&amp;lt;http://reference.data.gov.uk/id/day/2010-04-09&amp;gt;
  a interval:CalendarDay ;
  rdfs:label &quot;2010-04-09&quot; ;
  time:hasBeginning &amp;lt;http://reference.data.gov.uk/id/gregorian-instant/2010-04-09T00:00:00&amp;gt; ;
  interval:ordinalYear 2010 ;
  interval:ordinalMonthOfYear 4 ;
  interval:ordinalDayOfMonth 9 .

&amp;lt;http://reference.data.gov.uk/id/gregorian-instant/2010-04-09T00:00:00&amp;gt;
  a time:Instant ;
  time:inXSDDateTime &quot;2010-04-09T00:00:00&quot;^^xsd:dateTime .

&amp;lt;http://www.rbwm.gov.uk/def/cost-centre/LM05&amp;gt;
  a rbwm:CostCentre , skos:Concept ;
  rdfs:label &quot;Cost Centre LM05&quot;@en ;
  rbwm:costCentreCode &quot;LM05&quot;^^rbwm:CostCentreCode ;
  rbwm:service &amp;lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&amp;gt;
  a rbwm:Service ;
  rdfs:label &quot;Magnet Leisure Centre&quot;@en ;
  rbwm:providedBy &amp;lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&amp;gt;
  a rbwm:Directorate ;
  rdfs:label &quot;Adult &amp;amp; Community Services&quot;@en ;
  org:unitOf &amp;lt;http://statistics.data.gov.uk/id/local-authority/00ME&amp;gt; ;
  rbwm:provides &amp;lt;http://www.rbwm.gov.uk/id/service/magnet-leisure-centre&amp;gt; .

&amp;lt;http://statistics.data.gov.uk/id/local-authority/00ME&amp;gt;
  org:hasUnit &amp;lt;http://www.rbwm.gov.uk/id/directorate/adult-community-services&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You&amp;#8217;ll see that in the last part of this I&amp;#8217;ve introduced some properties and classes with a &lt;code&gt;rbwm:&lt;/code&gt; prefix. These are for classes and properties that are here in this data, but aren&amp;#8217;t part of the payment vocabulary. The basic schema is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;rbwm:CostCentre a rdfs:Class ;
  rdfs:label &quot;Cost Centre&quot;@en ;
  rdfs:comment &quot;A cost centre.&quot;@en .

rbwm:Service a rdfs:Class ;
  rdfs:label &quot;Service&quot;@en ;
  rdfs:comment &quot;A service provided by the council.&quot;@en .

rbwm:Directorate a rdfs:Class ;
  rdfs:label &quot;Directorate&quot;@en ;
  rdfs:comment &quot;A directorate within the council&quot;@en .

rbwm:service a rdf:Property , owl:ObjectProperty ;
  rdfs:label &quot;Service&quot;@en ;
  rdfs:comment &quot;The service associated with a particular cost centre.&quot;@en ;
  rdfs:domain rbwm:CostCentre ;
  rdfs:range rbwm:Service .

rbwm:providedBy a rdf:Property , owl:ObjectProperty ;
  rdfs:label &quot;Provided By&quot;@en ;
  rdfs:comment &quot;The directorate that provides this service.&quot;@en ;
  rdfs:domain rbwm:Service ;
  rdfs:range rbwm:Directorate .

rbwm:provides a rdf:Property , owl:ObjectProperty ;
  rdfs:label &quot;Provides&quot;@en ;
  rdfs:comment &quot;A service provided by this directorate.&quot;@en ;
  rdfs:domain rbwm:Directorate ;
  rdfs:range rbwm:Service .

rbwm:costCentreCode a rdf:Property , owl:DatatypeProperty ;
  rdfs:label &quot;Cost Centre Code&quot;@en ;
  rdfs:comment &quot;The code of this cost centre.&quot;@en ;
  rdfs:domain rbwm:CostCentre ;
  rdfs:range rbwm:CostCentreCode .

rbwm:CostCentreCode a rdfs:Datatype ;
  rdfs:label &quot;Cost Centre Code&quot;@en ;
  rdfs:comment &quot;A cost centre code consisting of two capital letters followed by two digits.&quot;@en .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This illustrates how individual councils might extend the information that they make available in RDF without having to seek any kind of prior agreement from anyone else. If, later on, a third party starts to make available ontologies for cost centres, services and directorates, Windsor &amp;amp; Maidenhead could start to link up their RDF with those more widely standardised classes and properties, with appropriate use of &lt;code&gt;rdfs:subClassOf&lt;/code&gt; or &lt;code&gt;rdfs:subPropertyOf&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now we have an idea about what data we can extract for a single row, we can turn this into a Gridworks template. The templates are fairly straight forward. Wherever you want to insert a value from a particular column, you use the syntax &lt;code&gt;${Column Name}&lt;/code&gt;. If you want to do any further processing, you can use the syntax &lt;code&gt;{{Formula}}&lt;/code&gt; to insert the result of a calculation.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt;
  qb:slice &amp;lt;${Transaction URI}&amp;gt; .

&amp;lt;${Transaction URI}&amp;gt;
  a payment:Payment , qb:Slice ;
  rdfs:label &quot;Transaction ${TransNo}&quot;@en ;
  qb:sliceStructure payment:payment-slice ;
  payment:transactionReference &quot;${TransNo}&quot; ;
  payment:payer &amp;lt;http://statistics.data.gov.uk/id/local-authority/00ME&amp;gt; ;
  payment:payee &amp;lt;${Supplier URI}&amp;gt; ;
  payment:date &amp;lt;${Date URI}&amp;gt; ;
  payment:expenditureLine &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2${Line URI}&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2${Line URI}&amp;gt;
  a payment:ExpenditureLine , qb:Observation ;
  rdfs:label &quot;Expenditure Line {{rowIndex}}&quot;@en ;
  qb:dataSet &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
  payment:expenditureCode &amp;lt;${Cost Centre URI}&amp;gt; ;
  payment:amountExcludingVAT {{cells[&#039;Amount excl vat £&#039;].value + 0}} .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that the last line here uses the expression &lt;code&gt;cells[&#039;Amount excl vat £&#039;].value + 0&lt;/code&gt; in order to ensure that every figure has a decimal place, which makes them into &lt;code&gt;xsd:decimal&lt;/code&gt; values within the resulting RDF.&lt;/p&gt;

&lt;p&gt;I won&amp;#8217;t do the rest of the row template here, though it&amp;#8217;s &lt;a href=&quot;/blog/files/finance_supplier_payments_2010_q2_provenance.ttl&quot;&gt;available in full in a separate file&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The other parts of the template are easier to complete. The prefix needs to contain any namespace prefixes that are used within the RDF. It&amp;#8217;s also useful to put a base URI here and describe the dataset itself. The RDF for the dataset should contain a number of properties about the dataset as a whole. There are a number of levels at which the dataset can be described:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;basic metadata such as its title and the license that it&amp;#8217;s available under&lt;/li&gt;
&lt;li&gt;statistical metadata including what dimensions it has and how it&amp;#8217;s sliced&lt;/li&gt;
&lt;li&gt;linked data metadata such as how this dataset links out to other linked datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Turtle for this description is shown here:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments&amp;gt;
  a void:Dataset ;
  void:subset &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; .

&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt;
  a payment:PaymentDataset , void:Dataset ;
  # basic metadata
  rdfs:label &quot;Windsor &amp;amp; Maidenhead Supplier Payments where charge to specific cost centre is &amp;gt;= £500 for period April 2010 - June 2010&quot;@en ;
  dct:license &amp;lt;http://data.gov.uk/id/licence&amp;gt; ;
  dct:temporal [
    # this time is retrieved from the Last-Modified date on the original spreadsheet
    time:hasBeginning &amp;lt;http://reference.data.gov.uk/id/gregorian-instant/2010-08-02T08:37:02&amp;gt;
  ] ;

  # statistical metadata
  qb:structure payment:payments-with-expenditure-structure ;
  qb:sliceKey payment:payment-slice ;
  payment:currency &amp;lt;http://dbpedia.org/resource/Pound_sterling&amp;gt; ;

  # linked data metadata
  void:exampleResource
    &amp;lt;http://www.rbwm.gov.uk/id/transaction/2650750&amp;gt; ,
    &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2#0&amp;gt; ;
  void:vocabulary payment: , qb: , rbwm: ;
  void:subset [
    a void:Linkset ;
    void:linkPredicate qb:slice ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:payer ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
    void:objectsTarget &amp;lt;http://statistics.data.gov.uk/id/local-authority&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:payee ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/supplier&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:date ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
    void:objectsTarget &amp;lt;http://reference.data.gov.uk/id/day&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:expenditureLine ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/transaction&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate payment:expenditureCode ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/def/cost-centre&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:service ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/def/cost-centre&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/service&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:providedBy ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/service&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/directorate&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate rbwm:provides ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/directorate&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/service&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate org:hasUnit ;
    void:subjectsTarget &amp;lt;http://statistics.data.gov.uk/id/local-authority&amp;gt; ;
    void:objectsTarget &amp;lt;http://www.rbwm.gov.uk/id/directorate&amp;gt; ;
  ] , [
    a void:Linkset ;
    void:linkPredicate org:unitOf ;
    void:subjectsTarget &amp;lt;http://www.rbwm.gov.uk/id/directorate&amp;gt; ;
    void:objectsTarget &amp;lt;http://statistics.data.gov.uk/id/local-authority&amp;gt; ;
  ] .
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Provenance&lt;/h2&gt;

&lt;p&gt;I&amp;#8217;ve described here, verbally, exactly what I&amp;#8217;ve done in terms of the cleaning of the data, deriving new columns, and the template that I&amp;#8217;ve used to create a Turtle rendition of the data in this spreadsheet. One of the things that we&amp;#8217;ve worked hard on within data.gov.uk is finding ways of expressing this provenance information in RDF. There are two reasons for this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Providing provenance increases transparency and enables you to check the processing that the data has been through, increasing your trust in the data.&lt;/li&gt;
&lt;li&gt;Describing the process in sufficient detail for you to replicate that process enables you to modify and repeat the process, which both enables you to add value and to apply the same processing to your own situation, thus spreading best practice.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The basic provenance vocabulary that we&amp;#8217;re using within data.gov.uk is the &lt;a href=&quot;http://code.google.com/p/opmv/&quot;&gt;Open Provenance Model Vocabulary&lt;/a&gt;. This vocabulary talks about Artifacts, Processes that create and use them, and Agents that control those processes. We&amp;#8217;ve created an extension of this vocabulary specifically to help describe this kind of scenario, where a spreadsheet is processed using Gridworks and then exported using a template. I&amp;#8217;ll put this provenance information in a separate file simply because embedding provenance information, which includes a template, in the template itself gets us into nasty recursion issues.&lt;/p&gt;

&lt;p&gt;As well as the template, there are two supplementary artifacts that we need to record the provenance of this data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the Gridworks project itself&lt;/li&gt;
&lt;li&gt;the JSON description of the set of operations performed by Gridworks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first can be exported using the &lt;code&gt;Project&lt;/code&gt; menu. The second is accessed through the &lt;code&gt;Undo/Redo&lt;/code&gt; tab as shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/undo-redo.jpg&quot; title=&quot;Undo/Redo tab&quot; style=&quot;text-align: center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This tab shows the actions that have been carried out on the data, and enables you to undo them in sequence. The &lt;code&gt;extract&lt;/code&gt; link at the bottom opens up the dialog shown in the following screenshot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/files/extract-dialog.jpg&quot; title=&quot;Extract Operations dialog&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You have to manually copy and paste the JSON description from the right of this dialog into a separate file in order to save it.&lt;/p&gt;

&lt;p&gt;We can then start describing the provenance of the RDF; this needs to go in the Turtle file itself. We start by saying that the RDF that we&amp;#8217;ve created was created from the Gridworks project and through an extraction operation. A simple link to the spreadsheet that was used as the source of the data also provides a quick link back to the original data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2&amp;gt;
  a opmv:Artifact ;
  dct:source &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&amp;gt; ;
  gridworks:wasExportedBy &amp;lt;finance_supplier_payments_2010_q2_provenance#gridworks-export&amp;gt; ;
  gridworks:wasExportedFrom &amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The provenance information then needs to describe the export process:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;#gridworks-export&amp;gt;
  a gridworks:ExportUsingTemplate , opmv:Process ;
  rdfs:label &quot;Process for Exporting Windsor &amp;amp; Maidenhead data as Turtle&quot; ;
  gridworks:project &amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt; ;
  gridworks:template &amp;lt;#gridworks-template&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The project itself was created from the original Excel spreadsheet. The details of how it was generated are through an import that ignored a single non-blank header row and then went through the set of operations described by the JSON.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt;
  a gridworks:Project , opmv:Artifact ;
  rdfs:label &quot;Windsor &amp;amp; Maidenhead Supplier Payments April 2010 - June 2010 Gridworks Project&quot;@en ;
  gridworks:wasCreatedFrom &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&amp;gt; ;
  opmv:wasGeneratedBy &amp;lt;#gridworks-processing&amp;gt; .

&amp;lt;#gridworks-processing&amp;gt;
  a gridworks:Process , opmv:Process ;
  rdfs:label &quot;Processing on the Gridworks Project&quot;@en ;
  common:usedData &amp;lt;http://www.rbwm.gov.uk/public/finance_supplier_payments_2010_q2.xls&amp;gt; ;
  gridworks:ignore 1 ;
  gridworks:operationDescription &amp;lt;finance_supplier_payments_2010_q2_operations.json&amp;gt; .

&amp;lt;finance_supplier_payments_2010_q2_operations.json&amp;gt;
  a gridworks:OperationDescription , opmv:Artifact ;
  rdfs:label &quot;Dump of the Processing carried out by Gridworks on Windsor &amp;amp;amp; Maidenhead Supplier Payments April 2010 - June 2010 data&quot;@en ;
  gridworks:wasExportedFrom &amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt; ;
  gridworks:wasExportedBy &amp;lt;#gridworks-operation-description-extraction&amp;gt; .

&amp;lt;#gridworks-operation-description-extraction&amp;gt;
  a gridworks:ExtractOperationDescription , opmv:Process ;
  rdfs:label &quot;Extraction of the operation description from the Windsor &amp;amp;amp; Maidenhead Supplier Payments April 2010 - June 2010 Project from Gridworks&quot;@en ;
  gridworks:project &amp;lt;finance_supplier_payments_2010_q2_project.tar.gz&amp;gt; .
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The template is described in terms of the separate parts; in fact it&amp;#8217;s useful to use this provenance file as the record of the template that you use, given that Gridworks won&amp;#8217;t save the template in the project itself.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;#gridworks-template&amp;gt;
  a gridworks:Template , opmv:Artifact ;
  gridworks:prefix &quot;&quot;&quot;
...
&quot;&quot;&quot;^^xsd:string ;
  gridworks:rowTemplate &quot;&quot;&quot;
...
&quot;&quot;&quot;^^^xsd:string .
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Rinse and Repeat&lt;/h2&gt;

&lt;p&gt;Gridworks makes it easy to repeat a given set of operations on another spreadsheet that follows the same structure. If you download the &lt;a href=&quot;http://www.rbwm.gov.uk/web/finance_payments_to_suppliers.htm&quot;&gt;Windsor and Maidenhead spending data from 2009 Q4&lt;/a&gt; and import it into Gridworks, you&amp;#8217;ll see that it uses the same set of columns as the 2010 Q2 data that we&amp;#8217;ve been looking at. (Strangely enough, the 2010 Q1 data doesn&amp;#8217;t quite follow the same structure as it doesn&amp;#8217;t include the &amp;#8216;TransNo&amp;#8217; column.)&lt;/p&gt;

&lt;p&gt;There are a couple of differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &amp;#8216;Updated&amp;#8217; column isn&amp;#8217;t recognised as holding dates on import; you can use &lt;code&gt;Edit Cells... &amp;gt; Transform&lt;/code&gt; to change these values into dates using the &lt;code&gt;toDate(value)&lt;/code&gt; formula&lt;/li&gt;
&lt;li&gt;the &amp;#8216;Amount excl vat £&amp;#8217; column isn&amp;#8217;t recognised as holding numbers on import because the values have commas in them; you can use the formula &lt;code&gt;toNumber(replace(value, &#039;,&#039;, &#039;&#039;))&lt;/code&gt; to rectify this&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might want to do some more cleaning, for example to check for duplicates, but once that is done, you use the &lt;code&gt;apply&lt;/code&gt; link at the bottom of the &lt;code&gt;Undo/Redo&lt;/code&gt; tab to apply the JSON operation description that you imported for the previous spreadsheet on this one. The templates require only a little tweaking to give different filenames and labels, but otherwise can be used as-is.&lt;/p&gt;

&lt;p&gt;So while the process of cleaning data, deriving values and creating a template for exporting as Turtle is a bit of effort, the likelihood is that you will be able to repeat the same operations on similar data with a minimal amount of work.&lt;/p&gt;

&lt;h2&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Gridworks is a simply amazing tool for data cleansing, analysis and, as we&amp;#8217;ve seen, transformation. It&amp;#8217;s set to become more so for our purposes in the near future, as it comes to support the mapping of names for things to URIs using configurable reconciliation services (which might allow it to automatically map Government Department names to URIs, for example), and the creation of RDF using a more intuitive and user-friendly approach than the templates that I&amp;#8217;ve illustrated here.&lt;/p&gt;

&lt;p&gt;Of course there are issues, particularly for UK civil servants who typically have to operate on locked-down machines running IE7 (if they&amp;#8217;re lucky). Gridworks also only deals with the fairly simple cases of data that fits in a spreadsheet-like structure, without the complexities of annotations on rows, columns or individual cells that we often see in government data.&lt;/p&gt;

&lt;p&gt;Nevertheless, there&amp;#8217;s huge potential here to provide a fairly easy route to the publication of linked data for people who are familiar with spreadsheets, in particular one that can be tweaked and extended to allow for the variety and complexity of real-world data.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/145#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/54">datagovuk</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/59">gridworks</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/58">provenance</category>
 <enclosure url="http://www.jenitennison.com/blog/files/finance_supplier_payments_2010_q2_project.tar.gz" length="458733" type="application/x-gzip" />
 <pubDate>Sun, 22 Aug 2010 22:23:32 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">145 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Why Linked Data for data.gov.uk?</title>
 <link>http://www.jenitennison.com/blog/node/140</link>
 <description>&lt;p&gt;&lt;a href=&quot;http://data.gov.uk/&quot;&gt;data.gov.uk&lt;/a&gt; was finally launched to the public last week (still in beta, but now a more public beta than the beta that it&amp;#8217;s been in for the last few months). It&amp;#8217;s a great step forward, and everyone involved should be proud of both the amount of data that&amp;#8217;s been made available and the website itself, which (&lt;a href=&quot;http://www.independent.co.uk/news/uk/politics/labours-computer-blunders-cost-16326bn-1871967.html&quot;&gt;unlike a lot of UK government IT&lt;/a&gt;) was developed rapidly by a small team based on open source software (and at low cost).&lt;/p&gt;

&lt;p&gt;This is a first step on a long road.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;One of the features of the UK Government&amp;#8217;s approach to freeing data is the emphasis on using &lt;a href=&quot;http://www.data.gov.uk/wiki/Linked_Data&quot;&gt;linked data&lt;/a&gt;. What I don&amp;#8217;t think has really been articulated is either what that means or why we should take this approach. From what I&amp;#8217;ve seen, developers seem to think:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;linked data is a synonym for turning everything into RDF and putting it in one big triplestore, equivalent to making one big database of government data and therefore prone to exactly the same, well-known and understood problems that government has with creating huge databases&lt;/li&gt;
&lt;li&gt;linked data requires everyone to agree to the same model and vocabulary, which means huge efforts in standardisation and ends up with something that suits no one&lt;/li&gt;
&lt;li&gt;the UK government will be releasing all their data as linked data immediately, and in no other way&lt;/li&gt;
&lt;li&gt;the UK government has been seduced into using linked data by academics who don&amp;#8217;t understand anything about how the web or the real world works&lt;/li&gt;
&lt;li&gt;the UK government has been seduced into using linked data by big businesses who stand to make a pretty penny providing services to departments that are forced to publish their data in this way&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are true. In fact, the UK government is committed to publishing data as linked data because they are convinced it is the &lt;strong&gt;best approach available for publishing data in a hugely diverse and distributed environment, in a gradual and sustainable way&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because linked data is just a term for how to publish data on the web while working &lt;em&gt;with&lt;/em&gt; the web. And the web is the best architecture we know for publishing information in a hugely diverse and distributed environment, in a gradual and sustainable way.&lt;/p&gt;

&lt;p&gt;If you&amp;#8217;re a web developer, you already know that the best APIs are &lt;a href=&quot;http://en.wikipedia.org/wiki/Representational_State_Transfer&quot;&gt;RESTful APIs&lt;/a&gt;. That argument has been won. It means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;using (HTTP) URIs to identify resources: naming &lt;em&gt;things&lt;/em&gt; with URIs rather than actions on those things (which are carried out using the standard set of HTTP verbs)&lt;/li&gt;
&lt;li&gt;recognising the distinction between resources and representations of those resources: the same URI might return a different representation of the resource, such as HTML or XML or JSON&lt;/li&gt;
&lt;li&gt;returning self-descriptive messages: being able to process representations in a manner that is obvious from the mime type&lt;/li&gt;
&lt;li&gt;hypermedia as the engine of application state: being able to locate additional resources through the use of (typed) links&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Linked data is about following these rules for publishing data. It is about using URIs to identify things, providing information at the end of those URIs that is self-descriptive, and linking those things to other things through typed links.&lt;/p&gt;

&lt;p&gt;One of the features of this approach is that it doesn&amp;#8217;t require any big bangs. No one planned the web: sat down and mapped out each page and its precise relations to every other page, in advance. It grew, and evolved, and continues to grow and evolve every day. It grows through individuals and institutions publishing information for their own reasons and linking to other people who have published information for their own reasons, and, because we have some fundamental standards that clients and servers understand, it All Just Works.&lt;/p&gt;

&lt;h2&gt;Standards&lt;/h2&gt;

&lt;p&gt;Did you notice how I slipped in the &amp;#8220;because we have some fundamental standards that clients and servers understand&amp;#8221;? One standard is obviously HTTP: that controls how clients and servers can talk to each other: it allows clients to request pages and servers to respond. Another standard is HTML: that enables browsers to display information in ways that people can understand it, and (crucially) has a known set of semantics that browsers can use to tell when something is a link, which people can navigate to find more information.&lt;/p&gt;

&lt;p&gt;For linked data, there are two crucial standards: RDF and SPARQL. Yes, I know what you&amp;#8217;re thinking, because believe me two years ago that would have been my reaction too, but let me explain why.&lt;/p&gt;

&lt;p&gt;There&amp;#8217;s one way in which publishing data isn&amp;#8217;t like publishing documents: its model. Documents are made up of paragraphs and headings and lists and tables and so on. Data is made up of&amp;#8230; what? Well, at its most basic, it&amp;#8217;s &lt;em&gt;things&lt;/em&gt; that have &lt;em&gt;properties&lt;/em&gt; which have &lt;em&gt;values&lt;/em&gt;. We might call the things &lt;em&gt;objects&lt;/em&gt; or &lt;em&gt;entities&lt;/em&gt;, and call some of the properties &lt;em&gt;relations&lt;/em&gt;. We might even call them &lt;em&gt;records&lt;/em&gt; with &lt;em&gt;columns&lt;/em&gt; and &lt;em&gt;values&lt;/em&gt; and &lt;em&gt;foreign keys&lt;/em&gt;. But however you term them, for better or worse, we do tend to think about data in this way: &lt;em&gt;thing&lt;/em&gt;, &lt;em&gt;property&lt;/em&gt;, &lt;em&gt;value&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So if we are going to publish data on the web, we need a standard way of expressing the data so that a client receiving the data can work out what&amp;#8217;s a &lt;em&gt;thing&lt;/em&gt;, what&amp;#8217;s a &lt;em&gt;property&lt;/em&gt;, what&amp;#8217;s a &lt;em&gt;value&lt;/em&gt;. &lt;strong&gt;And, because this is the web, what&amp;#8217;s a &lt;em&gt;link&lt;/em&gt;&lt;/strong&gt;. This is the fundamental standard we need, and this is what RDF gives.&lt;/p&gt;

&lt;p&gt;RDF is actually a model rather than a syntax. It&amp;#8217;s a bit like the split between the DOM and HTML or XHTML. The DOM tells the browser how to render the page: the HTML or XHTML is just a syntax which the browser is able to convert into a DOM that it displays. We could imagine browsers converting wiki syntax into a DOM. Or creating a DOM based on XML and XSLT, which of course they all do.&lt;/p&gt;

&lt;p&gt;So, RDF is like the DOM, with varying representations of RDF (XML-based, text-based, JSON-based, even HTML-based) that can be used to pass to the client the underlying model of &lt;em&gt;things&lt;/em&gt; and &lt;em&gt;properties&lt;/em&gt; and &lt;em&gt;values&lt;/em&gt; (some of which are &lt;em&gt;links&lt;/em&gt;). What the client does then is its business: clients that retrieve data aren&amp;#8217;t browsers &amp;#8212; they&amp;#8217;re not all going to display the data, use the same parts of the data, or otherwise process it in the same way &amp;#8212; but they can pull out the &lt;em&gt;things&lt;/em&gt;, &lt;em&gt;properties&lt;/em&gt; and &lt;em&gt;values&lt;/em&gt;, and know which are &lt;em&gt;links&lt;/em&gt;, and this data structure will often, with a good RDF library, map on to some natural structure within whatever programming language is being used, and make the programmer&amp;#8217;s job easier.&lt;/p&gt;

&lt;h2&gt;Vocabularies&lt;/h2&gt;

&lt;p&gt;What we don&amp;#8217;t want to have to define are standard ways of expressing &lt;em&gt;particular&lt;/em&gt; data (such as data about a school) because different individuals and organisations will have completely different ways of thinking about a particular thing. A school itself will have information about uniform and open days; &lt;a href=&quot;http://www.ofsted.gov.uk/&quot;&gt;OFSTED&lt;/a&gt; about performance; &lt;a href=&quot;http://www.edubase.gov.uk/&quot;&gt;Edubase&lt;/a&gt; about administration and pupil numbers; the PTA about after-school activities. Expecting everyone to adopt a particular standard vocabulary for describing a school is as futile as expecting everyone to adopt exactly the same page layout within their web pages, and exactly the same class names in their CSS.&lt;/p&gt;

&lt;p&gt;But we don&amp;#8217;t want to rule out opportunistic alignments where individuals or organisations, for whatever reason, &lt;em&gt;do&lt;/em&gt; want to use the same vocabularies. Look at what&amp;#8217;s happened with classes in HTML. There is absolutely no constraint on what classes people use in their HTML. But there are clusters of web pages that use some of the same classes. Websites that use &lt;a href=&quot;http://www.edubase.gov.uk/&quot;&gt;microformats&lt;/a&gt;. Websites that adopt a particular &lt;a href=&quot;http://en.wikipedia.org/wiki/CSS_framework&quot;&gt;CSS framework&lt;/a&gt;. Importantly, though, even where some classes are shared, it doesn&amp;#8217;t mean that &lt;em&gt;all&lt;/em&gt; classes are shared: adoption of a particular microformat or CSS framework doesn&amp;#8217;t limit the rest of the page.&lt;/p&gt;

&lt;p&gt;RDF has this balance between allowing individuals and organisations complete freedom in how they describe their information and the opportunity to share and reuse parts of vocabularies in a mix-and-match way. This is so important in a government context because (with all due respect to civil servants) we &lt;em&gt;really&lt;/em&gt; want to avoid a situation where we have to get lots of civil servants from multiple agencies into the same room to come up with the single government-approved way of describing a school. We can all imagine how long that would take.&lt;/p&gt;

&lt;p&gt;The other thing about RDF that really helps here is that it&amp;#8217;s easy to align vocabularies if you want to, post-hoc. &lt;a href=&quot;http://www.w3.org/TR/rdf-schema/&quot;&gt;RDFS&lt;/a&gt; and &lt;a href=&quot;http://www.w3.org/TR/owl-overview/&quot;&gt;OWL&lt;/a&gt; define properties that you can use to assert that this property is really the same as that property, or that anything with a value for this property has the same value for that other property. This lowers the risk for organisations who are starting to publish using RDF, because it means that if a new vocabulary comes along they can opportunistically match their existing vocabulary with the new one. It enables organisations to tweak existing vocabularies to suit their purposes, by creating specialised versions of established properties.&lt;/p&gt;

&lt;p&gt;So the linked data web is designed to grow and evolve in exactly the same way as the human web has grown and evolve. It grows through people adding links to existing data. It grows through people creating their own vocabularies. And it evolves as links break and reform, and vocabularies combine and diverge. It is complex and messy and self-organising.&lt;/p&gt;

&lt;h2&gt;Layers&lt;/h2&gt;

&lt;p&gt;The cornerstone of the great, messy, web is the URI. URIs have two important roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;they identify things&lt;/strong&gt; - If two sets of data use the same URI then it&amp;#8217;s dead easy to work out when they are talking about the same thing, for example to bring together the information published by a school with its OFSTED report with its pupil census. Spread this around to five, ten, twenty datasets from different places all using the same identifier for the school, and you have huge pool of information. And the great thing about RDF (because they also use URIs to identify properties) is that those datasets can be combined automatically without worrying about clashes, rather than through painstaking developer effort.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;they provide somewhere to look for information&lt;/strong&gt; - This is the point of using HTTP URIs, because that look-up is as simple as retrieving a document from the web. This enables programmatic, on-demand, access to the information. Developers don&amp;#8217;t have to download huge database dumps when all they are interested in is a small fraction of that data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But we know that of course sometimes developers &lt;em&gt;do&lt;/em&gt; want to download huge database dumps. So we need URIs for those dumps, and ways to associate metadata with them, and ways to search them. Adopting linked data doesn&amp;#8217;t preclude providing sets of data in larger lumps. In fact, what&amp;#8217;s needed are ways of creating those larger datasets by bringing together the more granular linked data into lists and graphs; this is essentially what SPARQL does.&lt;/p&gt;

&lt;p&gt;We also know that there&amp;#8217;s a trade-off to be made between the power of URIs and the simplicity of using short, unqualified names, particularly when it comes to naming schema-level entities such as properties or classes. Most mashups that we see at the moment bring together just a few datasets, making it easy for developers to scan for naming clashes, or examine values to work out whether a particular property contains a link or not. This is the 80% of the use of data on the web that can be addressed by the 20% solution of the kind of JSON and plain old XML you see in most APIs.&lt;/p&gt;

&lt;p&gt;But publishing with RDF can be the basis of these kinds of simple APIs, and still address the hard 20% that we will encounter quickly as we mash more data together. Any data munger knows that the main challenge of making data available in an easily accessible way is cleaning, tidying, modelling and restructuring. If that&amp;#8217;s done into RDF then creating simple JSON, XML and even CSV is really easy. Creating middle-ware that will make the creation of these basic APIs really easy must be the top priority of this linked data effort.&lt;/p&gt;

&lt;h2&gt;Reality Check&lt;/h2&gt;

&lt;p&gt;So it&amp;#8217;s all good, right?&lt;/p&gt;

&lt;p&gt;No, of course it&amp;#8217;s not all good. Just as in the early days of the human web, we face huge challenges simply getting tooling to a level where it&amp;#8217;s easy (really easy) for government departments and local authorities to publish data as RDF and for the consumers of the data to use it. We have some patterns for publishing linked data, but, as in the early days of the human web, there&amp;#8217;s still a lot we don&amp;#8217;t know about the best way to make data usable by third parties.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s worth noting that the main challenges we face are ones that are common to all attempts to make data both open and reusable. How do we easily create structured and reusable data from presentation-oriented Excel or (worse) PDFs? How do we handle changes over time, and record the provenance of the information that we provide? How to we represent statistical hypercubes? Or location information? These are things that we will only learn by trying things out.&lt;/p&gt;

&lt;p&gt;In the end, though, the best evidence we have for how the web of linked data will progress is the evidence of how things were for the human web. It is hard to be an early adopter, both for social reasons and technological reasons. Nothing will happen overnight, but gradually there will be network effects: more shared URIs, more shared vocabularies, making it both easier to adopt and more beneficial for everyone.&lt;/p&gt;

&lt;p&gt;Is this a kind of faith? Maybe. I believe in the web.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/140#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/54">datagovuk</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/46">linked data</category>
 <pubDate>Tue, 26 Jan 2010 13:10:58 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">140 at http://www.jenitennison.com/blog</guid>
</item>
</channel>
</rss>

