<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Jeni's Musings</title>
  <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog"/>
  <link rel="self" type="application/atom+xml" href="http://www.jenitennison.com/blog/atom/feed"/>
  <id>http://www.jenitennison.com/blog/atom/feed</id>
  <updated>2008-03-06T14:59:03+00:00</updated>
  <entry>
    <title>The Distributed Web</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/90" />
    <id>http://www.jenitennison.com/blog/node/90</id>
    <published>2008-05-11T22:07:29+01:00</published>
    <updated>2008-05-11T22:07:29+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="atom" />
    <category term="web" />
    <category term="xtech2008" />
    <summary type="html"><![CDATA[<p>XTech was subtitled &#8220;the mobile web&#8221;, but one of the major themes for me was that of <strong>the distributed web</strong>. The <a href="http://assets.expectnation.com/15/event/3/Why%20%22open%22%20matters%20—%20from%20innovation%20to%20commoditisation%20Paper%201.pdf" title="XTech 2008: Why &quot;open&quot; matters — from innovation to commoditisation">first keynote</a>, by <a href="http://www.gardeviance.org/about-me" title="Simon Wardley">Simon Wardley</a>, gave a vision of a future in which hardware, frameworks and applications are services in the cloud rather than products on machines we own: where we use <a href="http://www.flickr.com/" title="flickr">flickr</a> to store our photographs, <a href="http://code.google.com/appengine/" title="Google App Engine">Google App Engine</a> to host our applications, and <a href="http://www.amazon.com/gp/browse.html?node=16427261" title="Amazon Simple Storage Service">Amazon S3</a> to store our data. In <a href="http://www.davidrecordon.com/" title="David Recordon">David Recordon</a>&#8217;s keynote (<a href="http://adactio.com/journal/1461/" title="Adactio: David Recordon’s XTech keynote">written up by Jeremy Keith</a>), he talked about small, specific services provided by sites that aren&#8217;t &#8220;destination sites&#8221;. The same picture was painted by <a href="http://morethanseven.net/" title="Gareth Rushgrove">Gareth Rushgrove</a> in his talk on <a href="http://2008.xtech.org/public/schedule/detail/549" title="XTech 2008: Design Strategies for a Distributed Web">Design Strategies for a Distributed Web</a>.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>XTech was subtitled &#8220;the mobile web&#8221;, but one of the major themes for me was that of <strong>the distributed web</strong>. The <a href="http://assets.expectnation.com/15/event/3/Why%20%22open%22%20matters%20—%20from%20innovation%20to%20commoditisation%20Paper%201.pdf" title="XTech 2008: Why &quot;open&quot; matters — from innovation to commoditisation">first keynote</a>, by <a href="http://www.gardeviance.org/about-me" title="Simon Wardley">Simon Wardley</a>, gave a vision of a future in which hardware, frameworks and applications are services in the cloud rather than products on machines we own: where we use <a href="http://www.flickr.com/" title="flickr">flickr</a> to store our photographs, <a href="http://code.google.com/appengine/" title="Google App Engine">Google App Engine</a> to host our applications, and <a href="http://www.amazon.com/gp/browse.html?node=16427261" title="Amazon Simple Storage Service">Amazon S3</a> to store our data. In <a href="http://www.davidrecordon.com/" title="David Recordon">David Recordon</a>&#8217;s keynote (<a href="http://adactio.com/journal/1461/" title="Adactio: David Recordon’s XTech keynote">written up by Jeremy Keith</a>), he talked about small, specific services provided by sites that aren&#8217;t &#8220;destination sites&#8221;. The same picture was painted by <a href="http://morethanseven.net/" title="Gareth Rushgrove">Gareth Rushgrove</a> in his talk on <a href="http://2008.xtech.org/public/schedule/detail/549" title="XTech 2008: Design Strategies for a Distributed Web">Design Strategies for a Distributed Web</a>.</p>

<!--break-->

<p>So I was surprised at how contentious <a href="http://www.cwi.nl/~steven/" title="Steven Pemberton">Steven Pemberton</a>&#8217;s talk on <a href="http://2008.xtech.org/public/schedule/detail/545" title="XTech 2008: Why you should have a Website">Why you should have a Website</a> (thankfully again <a href="http://adactio.com/journal/1468/" title="Adactio: Why you should have a Website">documented by Jeremy Keith</a>) proved to be. Because to me it seemed to be the logical extension to the distribution of hardware, frameworks and application: the distribution of data. In fact, I&#8217;ve <a href="http://www.jenitennison.com/blog/node/60" title="Jeni's Musings: A sketch: personal APP servers and feed-based web apps">written about the same idea myself</a>, <a href="http://www.ldodds.com/blog/archives/000330.html" title="Lost Boy: Google AppEngine for Personal Web Presence?">as has Leigh Dodds</a>, more recently.</p>

<p>From the session, the main question seems to be &#8220;how could we do flickr without them holding our data?&#8221; I don&#8217;t want to particularly pick on flickr, especially because it&#8217;s not one of the worst offenders, but the problem of serving and sharing images does illustrate a whole range of issues, so I will use it as an example. I could just as easily be talking about ancestry.com. The way I see it, you need three levels:</p>

<ul>
<li><strong>providers</strong> which make information available in known formats</li>
<li><strong>user interfaces</strong> which provide the end-user with a way to access and manipulate the information</li>
<li><strong>brokers</strong> which locate information on the web and provide an aggregated interface</li>
</ul>

<p>(It occurs to me that this is similar to a model/view/controller architecture: the providers give the model, the user interfaces give the views and the brokers control the flow between the two.)</p>

<p>Where flickr is at the moment is a conglomeration of the three: to have your photo appear on flickr, and to gain the advantages that it gives you in terms of tag-based aggregations and social networking, you have to upload it. They are then the provider of the image+metadata (perhaps the only place it is located on the web), the user interface on the image+metadata (the interface through which the image is annotated), and the broker (they provide keyword-based retrieval, for example).</p>

<p>What would it look like to separate those functions?</p>

<p>First, you, as the owner of the image+metadata, could put your data anywhere: on a home wireless network box, on a webserver hosted by an ISP of your choice, on a site specifically designed for hosting photos. Your data is exposed to the larger web through a standard read/write protocol (I&#8217;m betting on <a href="http://tools.ietf.org/html/rfc5023" title="RFC 5023: The Atom Publishing Protocol">AtomPub</a>) that allows you to provide metadata both about resources and the links between resources. The point of it being read/write is that it allows other people to add metadata to or links from your resource to others, such as adding a comment on your image.</p>

<p>Second, an information broker would locate your photos by crawling for them (or perhaps by you submitting the URL somewhere, but mostly that shouldn&#8217;t be necessary). There are already information brokers around: Google provides a <a href="http://code.google.com/apis/ajaxsearch/documentation/#fonje" title="Google AJAX Search API">RESTful API for general search results</a>, <a href="http://developer.yahoo.com/search/" title="Yahoo Search Web Services">as does Yahoo!</a>; at XTech, <a href="http://dowhatimean.net/" title="Richard Cyganiak">Richard Cyganiak</a> talked about <a href="http://sindice.com/" title="Sindice">Sindice</a>, and <a href="http://sw.deri.org/~aidanh/" title="Aidan Hogan">Aidan Hogan</a> about the <a href="http://www.swse.org/" title="Semantic Web Search Engine">Semantic Web Search Engine</a>, both of which crawl for RDF triples and provide an API for querying the results. In an AtomPub-based environment, you&#8217;d want an information broker that located Atom feeds and resources, indexed them, and provided an AtomPub-based API for publishers to use.</p>

<p>Third, a user interface would provide an attractive and usable front-end that brought together many different sets of information. For example, flickr might combine your friends feed with an image search to provide a view of images recently made available by your friends. There&#8217;s no requirement for your friends to use flickr for this to work: flickr queries a broker for a list of your friends, then queries a broker for images by a particular person, the broker searches its index and points the application to the original resources that are provided by your friends.</p>

<p>A user interface has another role, though: to add to the web. Flickr wants to make it easy to add tags to photos, to create sets and collections that help you navigate your photos, for others to add comments and so on and on. And that&#8217;s fine, because AtomPub is a read/write API. To add a tag to a photo, flickr simply edits the resource with PUT. To add a comment, it locates the comment feed (which would be referenced from the entry for the particular image) and POSTs to create a new resource. And everyone can see those changes &#8212; the added value that you get from a social network.</p>

<p>None of this is to say that a single application can&#8217;t act as provider, broker and publisher at the same time, but I&#8217;m certain that users will favour those applications that do <em>all</em> of each role: provide to the whole web, broker the whole web, provide a user interface to the whole web. Flickr is almost there, but it doesn&#8217;t do the whole brokering job because it only brokers the data it provides, and therefore it doesn&#8217;t provide the whole user interface job.</p>

<p>This distributed web is a clear win, particularly for users, over walled gardens. They can switch from user interface to user interface, even use more than one at a time (perhaps one application is good for browsing while another is good for categorising), without any cost. They can choose who to use to serve their information on the basis of things that matter when you&#8217;re serving information (low downtime, backups, security, etc.) rather than on how pretty an interface looks or how much functionality it gives you. On the other side of the equation, applications get to do one thing and do it well.</p>

<p>It seems to me that this is simply how the web works, and the questions we should be asking are about privacy and trust and licensing and revenue models and standards development.</p>
    ]]></content>
  </entry>
  <entry>
    <title>XTech 2008</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/89" />
    <id>http://www.jenitennison.com/blog/node/89</id>
    <published>2008-05-11T11:25:40+01:00</published>
    <updated>2008-05-11T11:25:40+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="web" />
    <category term="xtech2008" />
    <summary type="html"><![CDATA[<p>I finally have some time to write about <a href="http://2008.xtech.org/" title="XTech 2008">XTech</a>. What a great conference! I know that <a href="http://times.usefulinc.com/" title="Edd Dumbill's blog">Edd</a> would like it bigger, but its modest size gives it a family feel. Like a family gathering, there are pontificating oldsters whose wisdom goes largely unappreciated by young upstarts who themselves bring energy and innovation to the crowd. And a bunch in the middle trying to translate across the gap: to explain the vision to the old and the reality to the new.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>I finally have some time to write about <a href="http://2008.xtech.org/" title="XTech 2008">XTech</a>. What a great conference! I know that <a href="http://times.usefulinc.com/" title="Edd Dumbill's blog">Edd</a> would like it bigger, but its modest size gives it a family feel. Like a family gathering, there are pontificating oldsters whose wisdom goes largely unappreciated by young upstarts who themselves bring energy and innovation to the crowd. And a bunch in the middle trying to translate across the gap: to explain the vision to the old and the reality to the new.</p>

<!--break-->

<p>Another way of putting it is the divide between the XML crowd and the Web 2.0 crowd. In <a href="http://seanmcgrath.blogspot.com/" title="Sean McGrath's blog">Sean McGrath</a>&#8217;s <a href="http://2008.xtech.org/public/schedule/detail/647" title="Orangutans, Oxen and Ogham stones. Mulling the movable Web">closing keynote</a>, thankfully <a href="http://adactio.com/journal/1469/" title="Adactio: Orangutans, Oxen and Ogham stones">written up by Jeremy Keith</a>, he talked about navigating the path between the document web and the programmable web, and the danger of tipping the balance too much in either one way or the other. XTech provides a great service in providing that balance, and of giving those of us with feet in both camps a home.</p>

<p>Particularly encouraging for me was to see some of the principles of the programmable web filtering into sites such as the <a href="http://2008.xtech.org/public/schedule/detail/577" title="XTech 2008: Rebuilding guardian.co.uk">Guardian</a> and the <a href="http://2008.xtech.org/public/schedule/detail/536" title="XTech 2008: Here's one I prepared earlier: the BBC's Tech Refresh project">BBC</a> which aren&#8217;t part of the Web 2.0 vowel-deprived clique. It gives me hope for the <a href="http://www.opsi.gov.uk/" title="Office of Public Sector Information">public information sector</a>, in which I happily find myself.</p>

<p>So many many thanks to Edd and to everyone who supplied me with alcohol, deprived me of sleep, and talked to me about tech. I can&#8217;t wait until next year.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Women at XTech 2008</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/88" />
    <id>http://www.jenitennison.com/blog/node/88</id>
    <published>2008-04-30T21:20:46+01:00</published>
    <updated>2008-04-30T21:20:46+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="equality" />
    <category term="xtech2008" />
    <summary type="html"><![CDATA[<p>It&#8217;s XTech 2008 next week. I&#8217;ll be there to talk about the work we  at <a href="http://www.tso.co.uk/" title="The Stationery Office">TSO</a> have been doing with <a href="http://www.opsi.gov.uk/" title="Office of Public Sector Information">OPSI</a> to add semantic information to the <a href="http://www.london-gazette.gov.uk/" title="The London Gazette">London Gazette</a> using RDFa. It&#8217;s really interesting and timely work on all sorts of levels; you can <a href="http://2008.xtech.org/public/schedule/detail/528" title="XTech 2008: SemWebbing the London Gazette">read the abstract of the talk</a> to get a taster and of course it&#8217;ll be published afterwards.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>It&#8217;s XTech 2008 next week. I&#8217;ll be there to talk about the work we  at <a href="http://www.tso.co.uk/" title="The Stationery Office">TSO</a> have been doing with <a href="http://www.opsi.gov.uk/" title="Office of Public Sector Information">OPSI</a> to add semantic information to the <a href="http://www.london-gazette.gov.uk/" title="The London Gazette">London Gazette</a> using RDFa. It&#8217;s really interesting and timely work on all sorts of levels; you can <a href="http://2008.xtech.org/public/schedule/detail/528" title="XTech 2008: SemWebbing the London Gazette">read the abstract of the talk</a> to get a taster and of course it&#8217;ll be published afterwards.</p>

<!--break-->

<p>Anyway, I was just browsing through the schedule and it struck me how few women they were speaking. Looking at the <a href="http://2008.xtech.org/public/schedule/speakers" title="XTech 2008: Speakers">speaker list</a>, out of the 64 speakers, just <strong>three</strong> are women. Three! That&#8217;s not even 5%!</p>

<p>Looking back at last year, it was a little better, at nine out of 94, which is getting towards 10%. It wasn&#8217;t much better at XML 2007, where nine of the 82 speakers (11%) were female. At Extreme 2007, eight of the 60 speakers (13%) were women.</p>

<p>I wonder whether there are a low proportion of women attending these conferences generally, or whether women attend in higher proportions but don&#8217;t submit papers, or whether they submit papers but a smaller proportion are accepted.</p>

<p>Anyway, if you&#8217;re a woman who&#8217;s going to XTech 2008 and you want to get together to <a href="http://uk.youtube.com/watch?v=MMb8Csll9Ws" title="YouTube: Women, Know Your Limits!">talk about kittens</a>, drop me a line.</p>
    ]]></content>
  </entry>
  <entry>
    <title>UK-based XML/XSLT job</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/87" />
    <id>http://www.jenitennison.com/blog/node/87</id>
    <published>2008-04-17T21:02:16+01:00</published>
    <updated>2008-04-17T21:02:16+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="xslt" />
    <category term="work" />
    <summary type="html"><![CDATA[<p>I&#8217;ve been asked if I could advertise the following vacancy. Any interested parties should contact <a href="mailto:gfuller@peopleworks.co.uk" title="Email Graham Fuller">Graham Fuller</a> from <a href="http://www.peopleworks.co.uk" title="Peopleworks">Peopleworks</a> (but say you saw it here; I&#8217;ll get a reward!).</p>

<blockquote>
  <p>Developer * XSLT * XML * Schemas * JavaScript * XHTML * CSS.</p>

  <p>Global retail organisation and household name is looking for 2 (two) Front-End/User Interface Developers to work on a major consumer e-commerce portal.</p>
</blockquote>
    ]]></summary>
    <content type="html"><![CDATA[<p>I&#8217;ve been asked if I could advertise the following vacancy. Any interested parties should contact <a href="mailto:gfuller@peopleworks.co.uk" title="Email Graham Fuller">Graham Fuller</a> from <a href="http://www.peopleworks.co.uk" title="Peopleworks">Peopleworks</a> (but say you saw it here; I&#8217;ll get a reward!).</p>

<blockquote>
  <p>Developer * XSLT * XML * Schemas * JavaScript * XHTML * CSS.</p>
  
  <p>Global retail organisation and household name is looking for 2 (two) Front-End/User Interface Developers to work on a major consumer e-commerce portal.</p>
</blockquote>

<!--break-->

<blockquote>
  <p>MAIN TASKS/REQUIREMENTS:</p>
  
  <ul>
  <li>Development of enterprise solutions</li>
  <li>Development of Consumer driven applications</li>
  <li>Adherence to Software Development Methodology</li>
  </ul>
  
  <p>ESSENTIAL SKILLS:</p>
  
  <ul>
  <li>XSLT &amp; XML</li>
  <li>XML schemas</li>
  <li>JavaScript</li>
  <li>XHTML</li>
  <li>Cross browser and platform CSS positioning</li>
  <li>Accessibility</li>
  </ul>
  
  <p>DESIRABLE (not essential) SKILLS:</p>
  
  <ul>
  <li>Understanding of web design</li>
  <li>JavaScript, including AJAX &amp; DHTML</li>
  <li>OO JavaScript</li>
  </ul>
  
  <p>These roles represent the opportunity to consult for a global multi billion global organisation.</p>
  
  <p>The roles will pay £400 to £450 per day and they are 3 to 6 months contracts.</p>
  
  <p>The role is based in Welwyn Garden City in Hertfordshire.</p>
  
  <p>It is a 15 to 20 minute walk from the train station and there is a company service bus every 15 minutes at peak times from the station to the campus.</p>
  
  <p>naturally there is loads of parking space for cars.</p>
</blockquote>
    ]]></content>
  </entry>
  <entry>
    <title>APECKS, ten years on</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/86" />
    <id>http://www.jenitennison.com/blog/node/86</id>
    <published>2008-04-16T21:34:43+01:00</published>
    <updated>2008-04-16T21:34:43+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="ontologies" />
    <category term="web" />
    <category term="life" />
    <summary type="html"><![CDATA[<p>Roughly ten years ago, I was attending <a href="http://ksi.cpsc.ucalgary.ca/KAW/KAW98/KAW98Proc.html" title="Proceedings of KAW'98">KAW&#8217;98</a>. I remember that conference as one of the best weeks of my life. I had <a href="http://users.ecs.soton.ac.uk/nrs/" title="University of Southampton: Nigel Shadbolt">great</a> <a href="http://www.louisecrow.com/blog/" title="Louise Crow">company</a>. I saw <a href="http://en.wikipedia.org/wiki/Lake_Louise,_Alberta" title="Lake Louise">scenery like I&#8217;d never seen before</a>. I presented <a href="http://ksi.cpsc.ucalgary.ca/KAW/KAW98/tennison/" title="KAW'98: APECKS: A Tool to Support Living Ontologies">my PhD work</a> for the first time to people who were (at least politely) interested in it. And I learned a lot, both from the presentations and less formal discussions.</p>

<p>(I remember driving back to Nottingham when we returned; a rainbow appeared in front of us, seeming to arch over our destination in a perfect finale.)</p>

<p>Looking back at that paper is like looking at my past generally is: much of it makes me cringe, but parts of it are surprisingly good. What&#8217;s interesting is that if you swap a few terms for modern buzzwords, it&#8217;s still a pretty neat idea. It&#8217;s also amazing how far we&#8217;ve come &#8212; how much has become common-place &#8212; in just ten years.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>Roughly ten years ago, I was attending <a href="http://ksi.cpsc.ucalgary.ca/KAW/KAW98/KAW98Proc.html" title="Proceedings of KAW'98">KAW&#8217;98</a>. I remember that conference as one of the best weeks of my life. I had <a href="http://users.ecs.soton.ac.uk/nrs/" title="University of Southampton: Nigel Shadbolt">great</a> <a href="http://www.louisecrow.com/blog/" title="Louise Crow">company</a>. I saw <a href="http://en.wikipedia.org/wiki/Lake_Louise,_Alberta" title="Lake Louise">scenery like I&#8217;d never seen before</a>. I presented <a href="http://ksi.cpsc.ucalgary.ca/KAW/KAW98/tennison/" title="KAW'98: APECKS: A Tool to Support Living Ontologies">my PhD work</a> for the first time to people who were (at least politely) interested in it. And I learned a lot, both from the presentations and less formal discussions.</p>

<p>(I remember driving back to Nottingham when we returned; a rainbow appeared in front of us, seeming to arch over our destination in a perfect finale.)</p>

<p>Looking back at that paper is like looking at my past generally is: much of it makes me cringe, but parts of it are surprisingly good. What&#8217;s interesting is that if you swap a few terms for modern buzzwords, it&#8217;s still a pretty neat idea. It&#8217;s also amazing how far we&#8217;ve come &#8212; how much has become common-place &#8212; in just ten years.</p>

<!--break-->

<p>In modern terms, what I did was develop web-based <a href="http://en.wikipedia.org/wiki/Social_software" title="Wikipedia: Social software">social software</a>, called <acronym title="Adaptive Presentation Environment for Collaborative Knowledge Structuring">APECKS</acronym>, for ontology creation. The idea was that people would create their own ontologies (either from scratch or based on others), and the system would find similarities and differences between them, with the aim of starting conversations about and sharing knowledge.</p>

<p>APECKS was built on top of a <a href="http://en.wikipedia.org/wiki/Web_application_framework" title="Wikipedia: Web application framework">web application framework</a> written in a <a href="http://en.wikipedia.org/wiki/Dynamic_programming_language" title="Wikipedia: Dynamic programming language">dynamic programming language</a>. We didn&#8217;t have <a href="http://en.wikipedia.org/wiki/Ruby_on_Rails" title="Wikipedia: Ruby on Rails">Ruby on Rails</a> in those days: I turned a MOO (a text-based virtual reality) into a HTTP server (with caching and everything!) and that formed the basis of the application.</p>

<p>APECKS was designed to use (lowercase) web services. It used <a href="http://tiger.cpsc.ucalgary.ca/" title="WebGrid III">one</a> for some of the complex ontology comparison that it needed to do. <a href="http://www.w3.org/TR/1998/WD-rdf-syntax-19980216/" title="W3C: RDF Working Draft from February 1998">RDF was nowhere near done</a>; OWL not even in a twinkle in its parents&#8217; eyes: nowadays, you&#8217;d build around those formats, which fit fairly well onto the <a href="http://en.wikipedia.org/wiki/Knowledge_Interchange_Format" title="Wikipedia: Knowledge Interchange Format">KIF</a>-based formalism that APECKS used. (The lack of a standard way to make the captured knowledge available was one of the reasons I got interested in XML &#8212; we&#8217;ve just celebrated <em>that</em> 10-year anniversary too.)</p>

<p>APECKS captured change history and design rationale as well as supporting unstructured communication between users. It didn&#8217;t provide feeds because, guess what, <a href="http://en.wikipedia.org/wiki/RSS_(file_format)" title="Wikipedia: RSS">feeds hadn&#8217;t been invented yet</a>. If I were doing it today, they would be a major feature.</p>

<p>APECKS didn&#8217;t do <a href="http://en.wikipedia.org/wiki/Representational_State_Transfer" title="Wikipedia: Representation State Transfer">REST</a> properly, but that concept wasn&#8217;t around either! APECKS was also rather formal and uninventive in getting knowledge out of people (although it did use those knowledge-acquisition techniques that are automatable, such as card sorts). Now, you could make the interface so much better, because now we have <a href="http://en.wikipedia.org/wiki/AJAX" title="Wikipedia: AJAX">AJAX</a>.</p>

<p>Part of me wants to update it. The semantic web is going to happen, and we&#8217;re going to need tools that help people share and link together the ontologies that they create. Tools that help people create ontologies without being semantic-web experts. </p>

<p>But I&#8217;ve been there, and done that, and anyway I&#8217;m sure that today&#8217;s students are creating applications that are much more innovative.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Metadata about RDF triples: reification and Linked Data</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/85" />
    <id>http://www.jenitennison.com/blog/node/85</id>
    <published>2008-04-11T21:59:07+01:00</published>
    <updated>2008-04-11T21:59:07+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="genealogy" />
    <category term="rdf" />
    <summary type="html"><![CDATA[<p>Those of you who have been following this blog will know that I&#8217;ve been thinking recently about <a href="http://www.jenitennison.com/blog/node/67#comment-4512" title="Jeni's Musings: Web 2.0 project: RDF and uncertainty">how to handle uncertainty related to RDF triples</a> (specifically in the context of a genealogical web app). Certainty isn&#8217;t the only kind of metadata-about-triples that you&#8217;d want to keep in an app like this. We need to know things like:</p>

<ul>
<li>who made the statement</li>
<li>when the statement was made</li>
<li>what evidence that led to the statement being made</li>
<li>licensing information about the reuse of the statement</li>
<li>(if we go with the rating idea) what ratings the statement has been given</li>
<li>(if we allow editing of statements) what changes have been made to the statement over time</li>
</ul>

<p>and so on. In short, all the metadata that you&#8217;d want to associate with <em>resources</em> you&#8217;d also want to associate with <em>statements</em>.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>Those of you who have been following this blog will know that I&#8217;ve been thinking recently about <a href="http://www.jenitennison.com/blog/node/67#comment-4512" title="Jeni's Musings: Web 2.0 project: RDF and uncertainty">how to handle uncertainty related to RDF triples</a> (specifically in the context of a genealogical web app). Certainty isn&#8217;t the only kind of metadata-about-triples that you&#8217;d want to keep in an app like this. We need to know things like:</p>

<ul>
<li>who made the statement</li>
<li>when the statement was made</li>
<li>what evidence that led to the statement being made</li>
<li>licensing information about the reuse of the statement</li>
<li>(if we go with the rating idea) what ratings the statement has been given</li>
<li>(if we allow editing of statements) what changes have been made to the statement over time</li>
</ul>

<p>and so on. In short, all the metadata that you&#8217;d want to associate with <em>resources</em> you&#8217;d also want to associate with <em>statements</em>.</p>

<!--break-->

<p>I&#8217;d anticipated using <a href="http://www.w3.org/TR/rdf-primer/#reification" title="W3C: RDF Primer: Reification">reification</a> to associate metadata with statements. Something like this</p>

<pre><code>&lt;rdf:Statement rdf:about="#statement1"&gt;
  &lt;rdf:subject rdf:resource="/people/CharlesDarwin" /&gt;
  &lt;rdf:predicate rdf:resource="/ontology/event-roles/passenger" /&gt;
  &lt;rdf:object rdf:resource="/events/BeagleVoyage" /&gt;
  &lt;dc:creator rdf:resource="/users/JeniT" /&gt;
  &lt;dc:date rdf:datatype="xsd:date"&gt;2008-04-11&lt;/dc:date&gt;
  &lt;g:certainty rdf:datatype="xsd:decimal"&gt;1.0&lt;/g:certainty&gt;
  ...
&lt;/rdf:Statement&gt;
</code></pre>

<p>or <a href="http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/#section-Syntax-reifying" title="W3C: RDF/XML Syntax Specification: Reifying Statements: rdf:ID">using <code>rdf:ID</code></a>, although this does limit the URI of our statements to hash-URIs:</p>

<pre><code>&lt;rdf:Description about="/people/CharlesDarwin"&gt;
  &lt;r:passenger rdf:ID="#statement1" rdf:resource="/events/BeagleVoyage" /&gt; 
&lt;/rdf:Description&gt;
&lt;rdf:Description about="#statement1"&gt;
  &lt;dc:creator rdf:resource="/users/JeniT" /&gt;
  &lt;dc:date rdf:datatype="xsd:date"&gt;2008-04-11&lt;/dc:date&gt;
  &lt;g:certainty rdf:datatype="xsd:decimal"&gt;1.0&lt;/g:certainty&gt;
&lt;/rdf:Description&gt;
</code></pre>

<p>(Please feel free to correct my RDF, RDF-folks!)</p>

<p>We can embed this information into our web pages using RDFa:</p>

<pre><code>&lt;div about="#statement1" instanceof="rdf:Statement"&gt;
  &lt;p class="statement"&gt;
    &lt;a rel="rdf:subject" href="/people/CharlesDarwin"&gt;
      Charles Darwin
    &lt;/a&gt;
    was a
    &lt;a rel="rdf:predicate" href="/ontologies/event-roles/passenger"&gt;
      passenger
    &lt;/a&gt;
    on the
    &lt;a rel="rdf:object" href="/events/BeagleVoyage"&gt;
      &lt;span about="/people/CharlesDarwin" 
            rel="r:passenger" 
            resource="/events/BeagleVoyage"&gt;
        Beagle Voyage
      &lt;/span&gt;
    &lt;/a&gt;
  &lt;/p&gt;
  &lt;dl class="metadata"&gt;
    &lt;dt&gt;Author:&lt;/dt&gt;
    &lt;dd&gt;
      &lt;a rel="dc:creator" href="/users/JeniT"&gt;
        Jeni Tennison
      &lt;/a&gt;
    &lt;/dd&gt;
    &lt;dt&gt;Date:&lt;/dt&gt;
    &lt;dd property="dc:date" datatype="xsd:date" 
        content="2008-04-11"&gt;
      11 Apr, 2008
    &lt;/dd&gt;
    &lt;dt&gt;Certainty:&lt;/dt&gt;
    &lt;dd property="b:certainty" datatype="xsd:decimal"
        content="1.0"&gt;
      &lt;img src="stars5.gif" alt="five stars" /&gt;
    &lt;/dd&gt;
  &lt;/dl&gt;
&lt;/div&gt;
</code></pre>

<p>Note that I&#8217;ve incorporated both the reified statement and the statement itself into the RDFa. If I&#8217;m correct in my mental parsing of RDFa, I think this leads to the set of triples from the RDF/XML in the above examples plus the triple:</p>

<pre><code>&lt;/people/CharlesDarwin&gt; r:passenger &lt;/events/BeagleVoyage&gt; .
</code></pre>

<p>But then the other day, I was reading the tutorial <a href="http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/" title="How to publish Linked Data on the Web">How to publish Linked Data on the Web</a>, which says</p>

<blockquote>
  <p>We discourage the use of RDF reification as the semantics of reification are unclear and as reified statements are rather cumbersome to query with the SPARQL query language. Metadata can be attached to the information resource instead, as explained in Section 5.</p>
</blockquote>

<p>Jumping to Section 5, I find</p>

<blockquote>
  <p><strong>Metadata:</strong> The representation should contain any metadata you want to attach to your published data, such as a URI identifying the author and licensing information. These should be recorded as RDF descriptions of the information resource that describes a non-information resource; that is, the subject of the RDF triples should be the URI of the information resource. Attaching meta-information to that information resource, rather than attaching it to the described resource itself or to specific RDF statements about the resource (as with RDF reification) plays nicely together with using Named Graphs and the SPARQL query language in Linked Data client applications&#8230;</p>
</blockquote>

<p>There are some examples of what this looks like within the tutorial. The first is an &#8220;<strong>authoritative description</strong>&#8221; found at <code>http://dbpedia.org/data/Alec_Empire</code> after a 303 redirection from <code>http://dbpedia.org/resource/Alec_Empire</code>.</p>

<pre><code># Metadata and Licensing Information
&lt;http://dbpedia.org/data/Alec_Empire&gt;
    rdfs:label "RDF description of Alec Empire" ;
    rdf:type foaf:Document ;
    dc:publisher &lt;http://dbpedia.org/resource/DBpedia&gt; ;
    dc:date "2007-07-13"^^xsd:date ;
    dc:rights &lt;http://en.wikipedia.org/wiki/WP:GFDL&gt; .

# The description
&lt;http://dbpedia.org/resource/Alec_Empire&gt; 
    foaf:name "Empire, Alec" ;
    rdf:type foaf:Person ;
    rdf:type &lt;http://dbpedia.org/class/yago/musician&gt; ;
    rdfs:comment
        "Alec Empire (born May 2, 1972) is a German musician who is ..."@en ;
    rdfs:comment
        "Alec Empire (eigentlich Alexander Wilke) ist ein deutscher Musiker. ..."@de ;
    dbpedia:genre &lt;http://dbpedia.org/resource/Techno&gt; ;
    dbpedia:associatedActs 
      &lt;http://dbpedia.org/resource/Atari_Teenage_Riot&gt; ;
    foaf:page &lt;http://en.wikipedia.org/wiki/Alec_Empire&gt; ;
    foaf:page &lt;http://dbpedia.org/page/Alec_Empire&gt; ; 
    rdfs:isDefinedBy &lt;http://dbpedia.org/data/Alec_Empire&gt; ;
    owl:sameAs &lt;http://zitgist.com/music/artist/d71ba53b-23b0-4870-a429-cce6f345763b&gt; .
</code></pre>

<p>The second is a <strong>non-authoritative description</strong> found at <code>http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/ChrisAboutRichard</code>:</p>

<pre><code># Metadata and Licensing Information
&lt;&gt;
    rdf:type foaf:Document ;
    dc:author &lt;http://www.bizer.de#chris&gt; ;
    dc:date "2007-07-13"^^xsd:date ;
    cc:license &lt;http://web.resource.org/cc/PublicDomain&gt; .

# The description
&lt;http://richard.cyganiak.de/foaf.rdf#cygri&gt; 
    foaf:name "Richard Cyganiak" ;
    foaf:topic_interest &lt;http://dbpedia.org/resource/Category:Databases&gt; ;
    foaf:topic_interest &lt;http://dbpedia.org/resource/MacBook_Pro&gt; ;
    rdfs:isDefinedBy &lt;http://richard.cyganiak.de/foaf.rdf&gt; ;
    rdf:seeAlso &lt;&gt; .
</code></pre>

<p>Note that <code>rdfs:isDefinedBy</code> does not necessarily point to the data you get when you retrieve the resource, but to an (presumably there can be more than one) authoritative description of the resource. It&#8217;s also associated with a particular <em>resource</em> rather than a particular <em>statement</em>.</p>

<p>To know which metadata applies to a particular statement, an application must know where it got the statement from. In effect, a statement here has <em>four</em> parts: subject, property, object and location (with the possibility that multiple statements with the same subject, property and object might have different locations and therefore different metadata). This is similar to assigning an ID to a statement, as with <code>rdf:ID</code>, but restricts the statement&#8217;s identifier to being the location where it was found.</p>

<p>So what does that mean for the genealogical web app? Well, in the app we&#8217;re going to find any given statement by a particular user quoted on lots of pages. I was intending to RDFa them all but that would mean lots of duplicate statements from different locations, potentially bloating applications that were harvesting the data.</p>

<p>I can&#8217;t work out whether I like or loathe the Linked Data concept of associating metadata with the document in which you find triples. In some ways it seems very natural &#8212; look for information about a resource at the URI for the resouce &#8212; but the metadata mechanisms restrict where you can place statements on the web (or at least assign semantics to their location which aren&#8217;t necessarily intended), and that seems like a Bad Thing. On the other hand, perhaps I&#8217;m just being overly influenced by the desire to use RDFa, which does lead one to want to mark up data wherever it appears.</p>

<p>I&#8217;d welcome any advice.</p>
    ]]></content>
  </entry>
  <entry>
    <title>XSLT Q&amp;A: Refactoring templates</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/84" />
    <id>http://www.jenitennison.com/blog/node/84</id>
    <published>2008-04-06T20:20:00+01:00</published>
    <updated>2008-04-06T20:59:55+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="xslt" />
    <summary type="html"><![CDATA[<p>A question about how to refactor some repetitive templates.</p>

<blockquote>
  <p>The issue is in creating XHTML headings.  </p>

  <p>For a small docbook article, I have the following templates in one of my stylesheets:</p>
</blockquote>
    ]]></summary>
    <content type="html"><![CDATA[<p>A question about how to refactor some repetitive templates.</p>

<blockquote>
  <p>The issue is in creating XHTML headings.  </p>
  
  <p>For a small docbook article, I have the following templates in one of my stylesheets:</p>
</blockquote>

<!--break-->

<pre><code>&lt;xsl:template match="article/title | article/info/title"&gt;
  &lt;h1&gt;&lt;xsl:apply-templates /&gt;&lt;/h1&gt;
&lt;/xsl:template&gt;

&lt;xsl:template match="article/section/title"&gt;
  &lt;h2&gt;&lt;xsl:apply-templates /&gt;&lt;/h2&gt;
&lt;/xsl:template&gt;

&lt;xsl:template match="article/section/section/title"&gt;
  &lt;h3&gt;&lt;xsl:apply-templates /&gt;&lt;/h3&gt;
&lt;/xsl:template&gt;

&lt;xsl:template match="article/section/section/section/title"&gt;
  &lt;h4&gt;&lt;xsl:apply-templates /&gt;&lt;/h4&gt;
&lt;/xsl:template&gt;

&lt;xsl:template match="article/section/section/section/section/title"&gt;
  &lt;h5&gt;&lt;xsl:apply-templates /&gt;&lt;/h5&gt;
&lt;/xsl:template&gt;

&lt;xsl:template match="article/section/section/section/section/section/title"&gt;
  &lt;h6&gt;&lt;xsl:apply-templates /&gt;&lt;/h6&gt;
&lt;/xsl:template&gt;
</code></pre>

<blockquote>
  <p>Obviously this was a quick and (VERY) dirty way to achieve the output I wanted.</p>
  
  <p>So, I know you can do something similar with an <code>&lt;xsl:choose&gt;</code> and some cases, but I have a feeling there&#8217;s a more automatic way.</p>
</blockquote>

<p>Seek out the similarities. The last five of these templates all match <code>&lt;title&gt;</code> elements within a <code>&lt;section&gt;</code> element. They all create an XHTML heading element and apply templates to the content of the <code>&lt;title&gt;</code> to get the content of the heading.</p>

<p>Identify the differences. They&#8217;re different in the level of heading that they create and in the number of ancestor <code>&lt;section&gt;</code> elements the <code>&lt;title&gt;</code> has.</p>

<p>Find the algorithm. Here&#8217;s the mapping:</p>

<table>
  <thead>
    <tr>
      <th>number of <code>&lt;section></code> ancestors</th>
      <th>required heading</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>2</td>
    </tr>
    <tr>
      <td>2</td>
      <td>3</td>
    </tr>
    <tr>
      <td>3</td>
      <td>4</td>
    </tr>
    <tr>
      <td>4</td>
      <td>5</td>
    </tr>
    <tr>
      <td>5</td>
      <td>6</td>
    </tr>
  </tbody>
</table>

<p>So the level of the heading is the number of ancestor <code>&lt;section&gt;</code> elements plus one.</p>

<p>Put it together. Get the number of ancestor <code>&lt;section&gt;</code> elements with <code>count(ancestor::section)</code>. Create the name of the heading element to create using an attribute value template in the <code>name</code> attribute.</p>

<pre><code>&lt;xsl:template match="section/title"&gt;
  &lt;xsl:variable name="nAncestorSections"
    select="count(ancestor::section)" /&gt;
  &lt;xsl:variable name="headingLevel"
    select="$nAncestorSections + 1" /&gt;
  &lt;xsl:element name="h{$headingLevel}"&gt;
    &lt;xsl:apply-templates /&gt;
  &lt;/xsl:element&gt;
&lt;/xsl:template&gt;
</code></pre>

<p>Of course there <em>are</em> differences between this refactored code and the original. In particular, this template deals improperly with the case where there are more than five nested sections, because it creates an <code>&lt;h7&gt;</code> element, which isn&#8217;t legal. If you thought that was likely to occur, you could change how <code>$headingLevel</code> is calculated to:</p>

<pre><code>&lt;xsl:variable name="headingLevel"&gt;
  &lt;xsl:choose&gt;
    &lt;xsl:when test="$nAncestorSections &gt;= 5"&gt;6&lt;/xsl:when&gt;
    &lt;xsl:otherwise&gt;
      &lt;xsl:value-of select="$nAncestorSections + 1" /&gt;
    &lt;/xsl:otherwise&gt;
  &lt;/xsl:choose&gt;
&lt;/xsl:variable&gt;
</code></pre>

<p>or:</p>

<pre><code>&lt;xsl:variable name="headingLevel"
  select="if ($nAncestorSections &gt;= 5)
          then 6 else $nAncestorSections + 1" /&gt;
</code></pre>

<p>in XSLT 2.0.</p>

<p>The other problem is that the template deals differently with <code>&lt;title&gt;</code> elements that appear within a <code>&lt;section&gt;</code> whose parent is neither <code>&lt;article&gt;</code> nor another <code>&lt;section&gt;</code> (which aren&#8217;t matched by the original templates). There are other possible parents for <code>&lt;section&gt;</code> namely <code>&lt;appendix&gt;</code>, <code>&lt;chapter&gt;</code>, <code>&lt;partintro&gt;</code> and <code>&lt;preface&gt;</code>, so if these elements are likely to appear in the subset of DocBook you&#8217;re using and you want the code to behave differently you need to either add more templates or some extra conditions into this one.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Free Our Bills</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/83" />
    <id>http://www.jenitennison.com/blog/node/83</id>
    <published>2008-03-31T20:10:14+01:00</published>
    <updated>2008-03-31T20:10:14+01:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="xml" />
    <category term="xslt" />
    <category term="legislation" />
    <summary type="html"><![CDATA[<p>The <a href="http://www.theyworkforyou.com/freeourbills/" title="TheyWorkForYou.com: Free Our Bills">Free Our Bills</a> campaign was launched recently in the UK. <a href="http://www.theregister.co.uk/2008/03/26/mysociety_xml_bills_cameron/comments/#c_185029" title="The Register: Comments on UK.gov urged to adopt web-friendly legislation format">Some of the comments I&#8217;ve seen</a> about the campaign makes me think that it might be helpful if people understood more about how Bills and legislation get published in the UK. I thought I&#8217;d offer a bit of background based on my experience (though there are many people with more intimate knowledge of the processes involved; perhaps they&#8217;ll correct me when I get it wrong).</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>The <a href="http://www.theyworkforyou.com/freeourbills/" title="TheyWorkForYou.com: Free Our Bills">Free Our Bills</a> campaign was launched recently in the UK. <a href="http://www.theregister.co.uk/2008/03/26/mysociety_xml_bills_cameron/comments/#c_185029" title="The Register: Comments on UK.gov urged to adopt web-friendly legislation format">Some of the comments I&#8217;ve seen</a> about the campaign makes me think that it might be helpful if people understood more about how Bills and legislation get published in the UK. I thought I&#8217;d offer a bit of background based on my experience (though there are many people with more intimate knowledge of the processes involved; perhaps they&#8217;ll correct me when I get it wrong).</p>

<!--break-->

<ul>
<li><p>Bills are draft legislation that is under discussion within the House of Commons or House of Lords. A Bill becomes law (legislation) when it is enacted.</p></li>
<li><p>Bills are published by Parliament and are available on the <a href="http://services.parliament.uk/bills/" title="UK Parliament: Bills Before Parliament">Parliament website</a>. Legislation is published by <a href="http://www.tso.co.uk/" title="The Stationery Office">The Stationery Office (TSO)</a> under contract to the Office of Public Sector Information (OPSI) on the <a href="http://www.opsi.gov.uk/legislation" title="OPSI: Legislation">OPSI website</a>.</p></li>
<li><p>Bills are changed (amended) as they progress through the Houses of Parliament. People are mostly interested in the most recent version of a Bill. Legislation can be changed (amended) by other legislation; the version of a piece of legislation with all the changes applied to it is known as consolidated legislation. Consolidated legislation is published in the <a href="http://www.statutelaw.gov.uk" title="Statute Law Database">Statute Law Database</a> as well as (too a more limited extent) on the <a href="http://www.opsi.gov.uk/legislation/revised" title="OPSI: Revised Legislation">OPSI website</a>.</p></li>
<li><p>Bills are edited by a dedicated team of Parliament employees who must reflect the amendments that the MPs say they want to make. They use a WYSIWYG XML editor. As is usual in an environment that has only been concerned about printed copies for centuries, they tend to focus on appearance rather than semantics, even when the XML supports the semantics.</p></li>
<li><p>The Free Our Bills campaign is not about making Bills (or legislation) easier for humans to read and understand, it&#8217;s about making it easier to extract information from a Bill so that people can be notified when a new Bill comes along on a subject they care about, or an old Bill is redrafted, and so on.</p></li>
<li><p>Bills are already available for the public to view on the web, in PDF and HTML forms. The problem is that the HTML is Really Really Bad (<a href="http://www.publications.parliament.uk/pa/ld200708/ldbills/044/08044.i-v.html" title="Parliament: Climate Change Bill">View Source to see</a>) and that makes it Really Really Hard to extract useful information from them.</p></li>
<li><p>There are reasons for the Bills HTML being Really Really Bad:</p>

<ul><li>The HTML must look <em>exactly</em> like it does in printed form, otherwise Members of Parliament (MPs) would get Really Really Confused.</li>
<li>MPs refer to pieces of a Bill (which they might want to change) by page and line number, not by the semantic structure of the Bill, so the HTML must have page and line numbers in it or MPs would get Really Really Confused. </li>
<li>Although the formatting of Bills is pretty consistent, there&#8217;s always the chance that a piece will need to be formatted specially. It might be safe to assume a particular presentation for a particular semantic 99% of the time, but if that 1% isn&#8217;t formatted in the different way, MPs would be Really Really Confused.</li>
<li>The code that creates the Bill HTML was written several years ago, when browser support for CSS was Really Really Bad.</li></ul></li>
<li><p>The picture for legislation is rather better because a strategic decision was made to focus on semantics rather than presentation. When a Bill is enacted, it gets converted into <a href="http://www.opsi.gov.uk/legislation/schema/" title="OPSI: Legislation schema">reasonably good semantic XML</a>, which forms the basis of all the HTML views. It also helps that this HTML was designed fairly recently, for modern browsers; it makes heavy use of CSS so there&#8217;s relatively little obfuscation of the content.</p></li>
</ul>

<p>I think there are interesting general lessons here:</p>

<ul>
<li><p><strong>Different user communities have different requirements.</strong> MPs have different requirements from Bills from the general public, who don&#8217;t care (as) much about line or page numbers. On the other hand, you need to actually consult with users about what they need rather than make assumptions about it: are MPs really likely to get Really Really Confused if the HTML presentation of a Bill looks slightly different from the PDF print version? I don&#8217;t know.</p></li>
<li><p><strong>Authors don&#8217;t care about what they don&#8217;t use.</strong> When the only way of using a Bill is to print it, it&#8217;s natural that authors and publishers only care about how it looks when it&#8217;s printed. Training people to care about semantic markup is really hard, and it&#8217;s made harder by WYSIWYG tools that allow them to override the semantic style. If a difference isn&#8217;t visible, then in author&#8217;s eyes it doesn&#8217;t exist.</p></li>
<li><p><strong>You have to positively decide to ignore appearance.</strong> When transforming from a WYSIWYG view, replicating appearance is the obvious thing to do. But it&#8217;s worthwhile in the long run to focus on extracting the semantics, because the resulting documents are so much more reusable.</p></li>
<li><p><strong>HTML, XML and XSLT are not inherently good.</strong> Parliament wanted Bills in HTML so that they were more accessible on the web. But the HTML is dreadfully inaccessible because of the other requirements placed on it. Similarly, XML can be incredibly obfuscated, or entirely about presentation, as formats such as OOXML illustrate. And just because your code is written in XSLT does not make it inherently easier to maintain then (say) a SAX transformation. It&#8217;s easy to misuse a technology.</p></li>
<li><p><strong>Developers who produce atrocious HTML aren&#8217;t necessarily ignorant.</strong> Unfortunately, there&#8217;s sometimes a limit to how much you can argue with your customers.</p></li>
</ul>
    ]]></content>
  </entry>
  <entry>
    <title>PRESTO and the limits of XPath-based URLs</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/80" />
    <id>http://www.jenitennison.com/blog/node/80</id>
    <published>2008-03-13T20:18:56+00:00</published>
    <updated>2008-03-13T20:18:56+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="web" />
    <category term="xpath" />
    <summary type="html"><![CDATA[<p>Rick Jelliffe has been writing recently about <a href="http://www.oreillynet.com/xml/blog/2008/02/presto_a_www_information_archi.html" title="PRESTO - A WWW Information Architecture for Legislation and Public Information systems">PRESTO</a>, most recently about the <a href="http://www.oreillynet.com/xml/blog/2008/03/presto_urls_as_xpaths_to_views.html" title="PRESTO: URLs as XPaths to views of information">design of URLs</a> based on the PRESTO system. In his latest post, Rick talks about using XPath as the basis of a URL scheme:</p>

<blockquote>
  <p>The Xpath for accessing a particular part’s title would be /law/part[2]/title so the PRESTO URLs would need some kind of convention.</p>

  <p>[snip]</p>

  <p>Now, I am not sure I understand the issues well enough to say which system for indexing is absolutely best. But I think the advantage of <code>http://www.eg.com/law/part2/title</code> over  <code>http://www.eg.com/law/part2/title</code> is that it is probably a more common case that your system is interested in <code>/law/part[2]/title</code> rather than all titles of parts <code>/law/part/title</code>. But it is a matter of the particular use case and the consequent virtual schema.</p>
</blockquote>
    ]]></summary>
    <content type="html"><![CDATA[<p>Rick Jelliffe has been writing recently about <a href="http://www.oreillynet.com/xml/blog/2008/02/presto_a_www_information_archi.html" title="PRESTO - A WWW Information Architecture for Legislation and Public Information systems">PRESTO</a>, most recently about the <a href="http://www.oreillynet.com/xml/blog/2008/03/presto_urls_as_xpaths_to_views.html" title="PRESTO: URLs as XPaths to views of information">design of URLs</a> based on the PRESTO system. In his latest post, Rick talks about using XPath as the basis of a URL scheme:</p>

<blockquote>
  <p>The Xpath for accessing a particular part’s title would be /law/part[2]/title so the PRESTO URLs would need some kind of convention.</p>
  
  <p>[snip]</p>
  
  <p>Now, I am not sure I understand the issues well enough to say which system for indexing is absolutely best. But I think the advantage of <code>http://www.eg.com/law/part2/title</code> over  <code>http://www.eg.com/law/part2/title</code> is that it is probably a more common case that your system is interested in <code>/law/part[2]/title</code> rather than all titles of parts <code>/law/part/title</code>. But it is a matter of the particular use case and the consequent virtual schema.</p>
</blockquote>

<!--break-->

<p>This has particular interest to me because I&#8217;ve recently been involved in putting some of the <a href="http://www.opsi.gov.uk/legislation/about_legislation.htm" title="OPSI: Legislation">UK&#8217;s legislation online</a>. We don&#8217;t expose the parts/sections and so on as individual documents at the moment (although this <em>is</em> something that you get with the <a href="http://www.statutelaw.gov.uk/" title="The Statute Law Database">Statute Law Database</a>, albeit with an awful URL scheme).</p>

<p>Anyway, we do have <em>anchors</em> for parts/sections within the main legislation which follow a similar scheme to the one that Rick suggests here. But they have a drawback: at least for consolidated legislation (which reflects the &#8220;current state&#8221; of legislation that has been amended by later legislation), the anchors don&#8217;t reflect the semantics of the numbering scheme used by the document. For example, see <a href="http://www.opsi.gov.uk/RevisedStatutes/Acts/ukpga/1977/cukpga_19770042_en_2#pt1-pb2-l1g6" title="OPSI: Legislation: Revised: Rent Act 1977: Section 5A">Section 5A of the Rent Act 1977</a>, whose URL is:</p>

<pre><code>http://www.opsi.gov.uk/RevisedStatutes/Acts/ukpga/1977/cukpga_19770042_en_2#pt1-pb2-l1g6
</code></pre>

<p>As you can see, the URL ends in a 6 rather than 5A because it&#8217;s the 6th Section that appears in the document.</p>

<p>The thing is that generic, position-based XPaths into content are seldom the ones that make most sense semantically. A friendly XPath to Section 5A would look like:</p>

<pre><code>/part[1]/group[2]/section[3]
</code></pre>

<p>and even if you just counted sections it would be:</p>

<pre><code>//section[6]
</code></pre>

<p>when what you really want is the equivalent of:</p>

<pre><code>//section[number = '5A']
</code></pre>

<p>Given this, I wonder if a &#8220;striped&#8221; URL scheme would be better, by which I mean something that follows the general pattern <code>/name/identifier/name/identifier</code>. For example:</p>

<pre><code>/part/I/section/5A
</code></pre>

<p>There are several advantages to this. The resulting URLs are more semantically meaningful than those based on positions. They are more robust to changes in the document (which naturally happen with consolidated legislation). They also provide a neat method of returning <em>all</em> the sections in a particular part, such as:</p>

<pre><code>/part/I/section
</code></pre>

<p>(though you could get this advantage with a position-based scheme as well, depending on how you map from XPath to URL).</p>

<p>The main disadvantage is that you have to provide a custom mapping from XPath to URL, because it&#8217;s not immediately obvious what identifier to use for a given element: it might be a <code>&lt;number&gt;</code> element child for one kind of element, but an <code>id</code> attribute for another element, and the position of the child for some other element. Of course you could add annotations to your schema to indicate what acts as the identifier for that particular element type, but it does raise the implementation barrier.</p>
    ]]></content>
  </entry>
  <entry>
    <title>RELAX NG for matching</title>
    <link rel="alternate" type="text/html" href="http://www.jenitennison.com/blog/node/79" />
    <id>http://www.jenitennison.com/blog/node/79</id>
    <published>2008-03-06T14:59:03+00:00</published>
    <updated>2008-03-06T14:59:03+00:00</updated>
    <author>
      <name>Jeni</name>
    </author>
    <category term="pipelines" />
    <category term="schema" />
    <summary type="html"><![CDATA[<p>I&#8217;m still thinking about doing <a href="http://www.jenitennison.com/blog/node/76" title="Jeni's Musings: Automatic markup and XML pipelines">automatic markup with XML pipelines</a>, and the kind of components that you might need in such a pipeline. These are the useful ones (list inspired by the components offered by <a href="http://www.gate.ac.uk/" title="General Architecture for Text Engineering">GATE</a>):</p>

<ul>
<li>a <strong>tokeniser</strong> that uses regular expressions to add markup to plain text</li>
<li>a <strong>gazetteer</strong> that uses a lookup to add markup to plain text</li>
<li>an <strong>annotater</strong> that adds attributes to existing elements based on their context/content</li>
<li>a <strong>grouper</strong> that adds markup around sequences of existing markup</li>
<li>a <strong>stripper</strong> that removes markup</li>
<li>a general purpose <strong>transformer</strong> that uses XSLT to do just about everything else</li>
</ul>
    ]]></summary>
    <content type="html"><![CDATA[<p>I&#8217;m still thinking about doing <a href="http://www.jenitennison.com/blog/node/76" title="Jeni's Musings: Automatic markup and XML pipelines">automatic markup with XML pipelines</a>, and the kind of components that you might need in such a pipeline. These are the useful ones (list inspired by the components offered by <a href="http://www.gate.ac.uk/" title="General Architecture for Text Engineering">GATE</a>):</p>

<ul>
<li>a <strong>tokeniser</strong> that uses regular expressions to add markup to plain text</li>
<li>a <strong>gazetteer</strong> that uses a lookup to add markup to plain text</li>
<li>an <strong>annotater</strong> that adds attributes to existing elements based on their context/content</li>
<li>a <strong>grouper</strong> that adds markup around sequences of existing markup</li>
<li>a <strong>stripper</strong> that removes markup</li>
<li>a general purpose <strong>transformer</strong> that uses XSLT to do just about everything else</li>
</ul>

<!--break-->

<p>The &#8220;grouper&#8221; is the most interesting and difficult of these. It needs to act like a tokeniser, except use regular expressions over markup rather than over text. For example, say I had:</p>

<pre><code>&lt;number&gt;06&lt;/number&gt;&lt;punc&gt;/&lt;/punc&gt;&lt;number&gt;03&lt;/number&gt;&lt;punc&gt;/&lt;/punc&gt;&lt;number&gt;08&lt;/number&gt;
</code></pre>

<p>I want to be able to create a rule that says &#8220;any sequence that looks like a number element that contains a two-digit number between 1 and 31, followed by a punc element that contains a slash, followed by another two-digit number between 1 and 12, followed by a punc element that contains a slash, followed by another two-digit number should be wrapped in a date element&#8221;.</p>

<p>Now this is something that XPath is really bad at. Try writing an expression that selects, from a sequence of elements that may contain other <code>&lt;number&gt;</code> and <code>&lt;punc&gt;</code> elements as well as other elements, only those sequences of elements that match the pattern I just described. It&#8217;s something like:</p>

<pre><code>number[. &gt;= 1 and . &lt;= 31 and string-length(.) = 2]
      [following-sibling::*[1]/self::punc = '/']
      [following-sibling::*[2]/self::number[. &gt;= 1 and . &lt;= 12 and string-length(.) = 2]]
      [following-sibling::*[3]/self::punc = '/']
      [following-sibling::*[4]/self::number[string-length(.) = 2]]
  /(self::number, following-sibling::*[position() &lt;= 4])
</code></pre>

<p>which is fiddly and messy and only works in this particular example because I know precisely how many elements there are supposed to be in the group.</p>

<p>In fact, it&#8217;s even difficult to do this kind of grouping using XSLT, even with <code>&lt;xsl:for-each-group&gt;</code> because the grouping is designed around elements either returning the same value or starting or ending with a particular kind of element, rather than grouping together a sequence that has a particular internal structure.</p>

<p>The language that <em>is</em> designed to describe sequences of elements is RELAX NG. Obviously RELAX NG is really useful as a schema language, but it&#8217;s really all to do with defining regular expressions over XML structures. We can use RELAX NG to describe the pattern of elements we want to match:</p>

<pre><code>&lt;group&gt;
  &lt;element name="number"&gt;
    &lt;data type="integer"&gt;
      &lt;param name="minInclusive"&gt;1&lt;/param&gt;
      &lt;param name="maxInclusive"&gt;31&lt;/param&gt;
      &lt;param name="pattern"&gt;[0-9]{2}&lt;/param&gt;
    &lt;/data&gt;
  &lt;/element&gt;
  &lt;element name="punc"&gt;
    &lt;value&gt;/&lt;/value&gt;
  &lt;/element&gt;
  &lt;element name="number"&gt;
    &lt;data type="integer"&gt;
      &lt;param name="minInclusive"&gt;1&lt;/param&gt;
      &lt;param name="maxInclusive"&gt;12&lt;/param&gt;
      &lt;param name="pattern"&gt;[0-9]{2}&lt;/param&gt;
    &lt;/data&gt;
  &lt;/element&gt;
  &lt;element name="punc"&gt;
    &lt;value&gt;/&lt;/value&gt;
  &lt;/element&gt;
  &lt;element name="number"&gt;
    &lt;data type="integer"&gt;
      &lt;param name="pattern"&gt;[0-9]{2}&lt;/param&gt;
    &lt;/data&gt;
  &lt;/element&gt;
&lt;/group&gt;
</code></pre>

<p>or, in compact syntax:</p>

<pre><code>element number { 
  xs:integer { minInclusive = "1" maxInclusive = "31" pattern = "[0-9]{2}" }
},
element punc { "/" },
element number { 
  xs:integer { minInclusive = "1" maxInclusive = "12" pattern = "[0-9]{2}" }
},
element punc { "/" },
element number { 
  xs:integer { pattern = "[0-9]{2}" }
}
</code></pre>

<p>As a language, RELAX NG is really good at this. You could even imagine adding attributes to name subexpressions which you could then do things with (in the same way as you can get the substring matching a subexpression when you use a regular expression over text).</p>

<p>So I think a &#8220;grouper&#8221; component should use RELAX NG to identify sequences to be marked up. But I have no idea if there are RELAX NG libraries out there that can be used in this way: to identify and extract matching sequences rather than to validate entire documents.</p>
    ]]></content>
  </entry>
</feed>
