<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.jenitennison.com/blog" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>markup</title>
 <link>http://www.jenitennison.com/blog/taxonomy/term/16</link>
 <description>The taxonomy view with a depth of 0.</description>
 <language>en</language>
<item>
 <title>Automatic markup and XML pipelines</title>
 <link>http://www.jenitennison.com/blog/node/76</link>
 <description>&lt;p&gt;The project I&amp;#8217;m working on at the moment aims to use RDFa (in XHTML) to expose some of the semantics in some natural-language text. We&amp;#8217;re aiming moderately low &amp;#8212; marking up dates, addresses, people&amp;#8217;s names, and various other more domain-specific things &amp;#8212; at least at the moment.&lt;/p&gt;

&lt;p&gt;The problem we&amp;#8217;re getting into now is how to get that information marked up. Because the information comes from various pretty unregulated sources, there&amp;#8217;s no way we can force the authors to do the mark up. And the scope for making it &amp;#8220;worth their while&amp;#8221; (in terms of making their authoring job easier or more effective or even offering financial rewards) is very low.&lt;/p&gt;

&lt;p&gt;So we&amp;#8217;re taking a look at the technologies we might use for automating the markup, specifically &lt;a href=&quot;http://www.gate.ac.uk/&quot; title=&quot;GATE: A General Architecture for Text Engineering&quot;&gt;GATE&lt;/a&gt; and &lt;a href=&quot;http://incubator.apache.org/uima/&quot; title=&quot;Apache UIMA: Unstructured Information Management Applications&quot;&gt;UIMA&lt;/a&gt;.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;These technologies basically use pipelines of components which each add some (out of line) annotations to the text. The annotations are done out of line because they might overlap, but you can (usually) serialize them into XML, which is what we want to do.&lt;/p&gt;

&lt;p&gt;I find these technologies frustrating for a number of reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;any configuration we do will be specific to that particular application; it&amp;#8217;ll be hard to for us to change to another implementation later on, and reuse by others will be limited to those who use the same implementation&lt;/li&gt;
&lt;li&gt;they involve a fair bit of proper coding (by which I mean Java or C++)&lt;/li&gt;
&lt;li&gt;where components can be configured through declarative means (such as keyword lists), there&amp;#8217;s no way to reuse (XML/RDF) resources that we already have; we&amp;#8217;ll have to manage transformations from them into the accepted formats through some external means, and I just &lt;em&gt;know&lt;/em&gt; they&amp;#8217;ll get out of sync&lt;/li&gt;
&lt;li&gt;their user documentation is dreadful; it seems like you need to have a good understanding of natural language processing to have a hope of even getting started&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It strikes me that the really powerful part of each of these technologies is the pipelining. The pipelining allows you to string together relatively simple operations (tokenising text, extrapolating sentences, marking up keywords, resolving ambiguities based on context etc.) which together give you something reasonably sophisticated.&lt;/p&gt;

&lt;p&gt;Using &lt;a href=&quot;http://www.w3.org/TR/xproc/&quot; title=&quot;W3C Working Draft: XProc: An XML Pipeline Language&quot;&gt;XProc&lt;/a&gt; to coordinate the pipeline would alleviate many of my frustrations. XProc can and will be implemented on many platforms, in many languages, so it&amp;#8217;ll be possible to move the pipeline from place to place (assuming that the components of the pipelines are similarly generic). It&amp;#8217;s declarative, so no &amp;#8220;proper coding&amp;#8221;. We&amp;#8217;ll be able to incorporate any transformations from existing XML/RDF data to the required configuration formats right into the pipeline. And&amp;#8230; OK, it won&amp;#8217;t automatically give us great user documentation or GUIs, but they&amp;#8217;ll come.&lt;/p&gt;

&lt;p&gt;The big problem is that XProc is still a Working Draft and the XProc ecosystem isn&amp;#8217;t well-developed. If we were one or two years down the line, XProc would be a Recommendation, there&amp;#8217;d be a .NET implementation readily available, and even perhaps extension XProc step types for tokenising, grouping and the other things we&amp;#8217;d need to do; anything that was missing we could pull together using XSLT.&lt;/p&gt;

&lt;p&gt;As it is, we&amp;#8217;re in that annoying in-between-time when the Right technology isn&amp;#8217;t ready and it looks like we&amp;#8217;re going to have to put effort into working with what feels like the Wrong technology just to get things done. But perhaps I&amp;#8217;m overlooking something in GATE or UIMA, or have missed another technology that would help us. Anyone out there got some experience that could help guide us?&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/76#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/16">markup</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/6">pipelines</category>
 <pubDate>Mon, 25 Feb 2008 22:02:57 +0000</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">76 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Whitespace in markup languages</title>
 <link>http://www.jenitennison.com/blog/node/43</link>
 <description>&lt;p&gt;I &lt;a href=&quot;http://www.jenitennison.com/blog/node/41&quot; title=&quot;Things that make me scream: xml:space=preserve in WordML&quot;&gt;wrote previously&lt;/a&gt; about the, to my mind, wrong-headed use of &lt;code&gt;xml:space&lt;/code&gt; in WordML (and &lt;a href=&quot;http://en.wikipedia.org/wiki/Office_Open_XML&quot; title=&quot;Office Open XML&quot;&gt;OOXML&lt;/a&gt;), and promised something a bit more positive about how whitespace &lt;em&gt;should&lt;/em&gt; be handled in markup languages. So here it is.&lt;/p&gt;

&lt;p&gt;A bit of a disclaimer up front: my attitude on this topic is highly skewed by the fact I use XSLT all the time, and it has particular ways of dealing with whitespace. I happen to think that the way XSLT deals with whitespace is pretty solid, but that might just be because it&amp;#8217;s what I&amp;#8217;m used to.&lt;/p&gt;

&lt;p&gt;The aim of this post is to answer the following question &amp;#8220;when designing a markup language, what should I say about whitespace processing?&amp;#8221;&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;Markup languages are data formats, not applications. Different applications may add or remove whitespace from documents in a given markup language. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;automatic indentation algorithms may change whitespace to make the document easier to read&lt;/li&gt;
&lt;li&gt;programs querying documents may normalise whitespace when doing (text-based) searches or comparisons&lt;/li&gt;
&lt;li&gt;renderers may ignore, collapse, change and add whitespace when rendering the document; for example browsers generally collapse whitespace and wrap the text when they display XML with a default (CSS) stylesheet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a markup language designer, you need to describe how whitespace should be handled in your markup language. You need to answer the questions of people generating documents in your markup language:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;can/should/must I add whitespace here?&lt;/li&gt;
&lt;li&gt;if I add whitespace here, will it make a difference to my target application?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and the questions of the people processing the documents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;can/should/must I ignore this whitespace?&lt;/li&gt;
&lt;li&gt;does this whitespace change the meaning of this value?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Whitespace in Element-Only Content&lt;/h2&gt;

&lt;p&gt;Whitespace in element-only content can be ignored without changing the meaning of an XML document. &amp;#8220;Element-only content&amp;#8221; means that the element can only ever contain elements. That&amp;#8217;s something like&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;element name { element given { text },
               element family { text } }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;in compact RELAX NG. It does &lt;em&gt;not&lt;/em&gt; mean element instances that happen to contain only elements and whitespace. If you declare &lt;code&gt;&amp;lt;name&amp;gt;&lt;/code&gt; as:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;element name { text &amp;amp;
               element given { text } &amp;amp;
               element family { text } }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;then the whitespace in&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;name&amp;gt;&amp;lt;given&amp;gt;Jeni&amp;lt;/given&amp;gt; &amp;lt;family&amp;gt;Tennison&amp;lt;/family&amp;gt;&amp;lt;/name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;does &lt;em&gt;not&lt;/em&gt; count as whitespace in element-only content. Parsers that come from the Microsoft stable have an annoying tendancy to think you can just get rid of this whitespace, which is why you should only ever use them with &lt;code&gt;preserveWhiteSpace&lt;/code&gt; set. (This has the unfortunate side-effect of keeping &lt;em&gt;all&lt;/em&gt; your whitespace, but it&amp;#8217;s better to have whitespace that you don&amp;#8217;t need than to not have whitespace you do need.)&lt;/p&gt;

&lt;p&gt;The XML spec requires parsers to pass all characters on to the application, although validating parsers can indicate if a character is a whitespace character that appears in element-only content according to the DTD associated with the document. The &lt;strong&gt;element content whitespace&lt;/strong&gt; &lt;a href=&quot;http://www.w3.org/TR/xml-infoset&quot; title=&quot;W3C: XML Infoset (Second Edition)&quot;&gt;infoset&lt;/a&gt; property of these characters has the value true.&lt;/p&gt;

&lt;p&gt;In practice, since MSXML doesn&amp;#8217;t do it right, since you can&amp;#8217;t rely on DTDs being accessible, and since applications don&amp;#8217;t tend to strip element content whitespace automatically anyway, people processing documents generally have to ignore this whitespace manually. In XSLT, for example, you should use &lt;code&gt;&amp;lt;xsl:strip-space&amp;gt;&lt;/code&gt; to get rid of whitespace from the elements you list. This reduces the size of the tree you&amp;#8217;re dealing with and prevents you from counting whitespace-only text nodes or outputting them and thus getting screwy indentation in the output.&lt;/p&gt;

&lt;p&gt;As a markup language designer, it doesn&amp;#8217;t hurt to clarify matters by making a global statement like &amp;#8220;whitespace that appears in element-only content can be ignored by processing applications&amp;#8221; and then listing which elements this applies to. But mostly people will assume this to be the case anyway.&lt;/p&gt;

&lt;h2&gt;Whitespace in Data Content&lt;/h2&gt;

&lt;p&gt;Next up is whitespace that appears in elements or attributes that contain data. Whitespace rules here usually come into play when testing values. For example, you&amp;#8217;ll usually want &lt;code&gt;date = &#039;2007-07-12&#039;&lt;/code&gt; to be true for both&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;date&amp;gt;
  2007-07-12
&amp;lt;/date&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;date&amp;gt;2007-07-12&amp;lt;/date&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are three standard kinds of whitespace normalisation. These are defined most formally in &lt;a href=&quot;http://www.w3.org/TR/xmlschema-1/#d0e1654&quot; title=&quot;W3C: XML Schema Part 1: White Space Normalisation During Validation&quot;&gt;XML Schema&lt;/a&gt;, but actually arise from the whitespace normalisation done to attribute values in basic XML:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;preserve&lt;/strong&gt;: all whitespace is preserved&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;replace&lt;/strong&gt;: every whitespace character is replaced with a space character&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;collapse&lt;/strong&gt;: all runs of whitespace are replaced by a single space; leading and trailing whitespace is stripped&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Attribute values are always either replaced (the default if there&amp;#8217;s no DTD) or collapsed (if they are typed as something other than CDATA in a DTD) during the parsing of the document. Like other text, element values are never touched during normal parsing.&lt;/p&gt;

&lt;p&gt;The types that you use in a DTD or schema should indicate to people writing or processing documents in your markup language which kind of whitespace processing is going to be done. Here&amp;#8217;s how it goes for the XML Schema datatypes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if you define the type as &lt;code&gt;xs:string&lt;/code&gt;, it means the whitespace should be preserved (although this won&amp;#8217;t happen for attribute values, since their whitespace gets replaced automatically)&lt;/li&gt;
&lt;li&gt;if you define the type as &lt;code&gt;xs:normalizedString&lt;/code&gt;, it means the whitespace should be replaced&lt;/li&gt;
&lt;li&gt;otherwise, (including &lt;code&gt;xs:token&lt;/code&gt;) it means the whitespace should be collapsed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It&amp;#8217;s worth thinking carefully about which of &lt;code&gt;xs:string&lt;/code&gt;, &lt;code&gt;xs:normalizedString&lt;/code&gt; or &lt;code&gt;xs:token&lt;/code&gt; should be used when defining enumerations. If you base an enumeration on &lt;code&gt;xs:string&lt;/code&gt; as in&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;xs:simpleType name=&quot;windowType&quot;&amp;gt;
  &amp;lt;xs:restriction base=&quot;xs:string&quot;&amp;gt;
    &amp;lt;xs:enumeration value=&quot;single glazed&quot; /&amp;gt;
    &amp;lt;xs:enumeration value=&quot;double glazed&quot; /&amp;gt;
  &amp;lt;/xs:restriction&amp;gt;
&amp;lt;/xs:simpleType&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;then no whitespace processing is done before the value is assessed against the enumerated values; the only values that are allowed are &lt;code&gt;&quot;single glazed&quot;&lt;/code&gt; and &lt;code&gt;&quot;double glazed&quot;&lt;/code&gt;. For example&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;window&amp;gt;
  single
  glazed
&amp;lt;/window&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;would be invalid. On the other hand, if you base an enumeration on &lt;code&gt;xs:token&lt;/code&gt; as in&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;xs:simpleType name=&quot;windowType&quot;&amp;gt;
  &amp;lt;xs:restriction base=&quot;xs:token&quot;&amp;gt;
    &amp;lt;xs:enumeration value=&quot;single glazed&quot; /&amp;gt;
    &amp;lt;xs:enumeration value=&quot;double glazed&quot; /&amp;gt;
  &amp;lt;/xs:restriction&amp;gt;
&amp;lt;/xs:simpleType&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;then whitespace is collapsed before the value is assessed against the enumerated values, so the example above would be valid.&lt;/p&gt;

&lt;p&gt;Generally, when you enumerate values you do want to collapse whitespace, so you should base the type on &lt;code&gt;xs:token&lt;/code&gt;. In RELAX NG, this is the default, and doing&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;choice&amp;gt;
  &amp;lt;value&amp;gt;single glazed&amp;lt;/value&amp;gt;
  &amp;lt;value&amp;gt;double glazed&amp;lt;/value&amp;gt;
&amp;lt;/choice&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;will result in the same behaviour as basing a simple type on &lt;code&gt;xs:token&lt;/code&gt;. If you don&amp;#8217;t want to strip whitespace, then you can use the &lt;code&gt;type&lt;/code&gt; attribute on the &lt;code&gt;&amp;lt;value&amp;gt;&lt;/code&gt; element to specify that the values are strings, not tokens&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;choice&amp;gt;
  &amp;lt;value type=&quot;string&quot;&amp;gt;single glazed&amp;lt;/value&amp;gt;
  &amp;lt;value type=&quot;string&quot;&amp;gt;double glazed&amp;lt;/value&amp;gt;
&amp;lt;/choice&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Of course one time you&amp;#8217;ll &lt;em&gt;really&lt;/em&gt; want enumerated values to be based on strings is if they can consist purely of whitespace.&lt;/p&gt;

&lt;h2&gt;Whitespace in Document Content&lt;/h2&gt;

&lt;p&gt;Once you have document content (content that is targeted at human consumption, which is usually mixed content), the only applications that should touch whitespace are rendering applications. Markup languages should be designed so that processors don&amp;#8217;t have to add (or remove) whitespace to get something human readable.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve been dealing with a markup language recently where this rule is broken in two ways. First, within a &lt;code&gt;&amp;lt;text&amp;gt;&lt;/code&gt; element, processing applications have to add a space before any processing instruction or element that it contains. For example, look at:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;text&amp;gt;The proviso to section 6(2) of the&amp;lt;?change id=&quot;797826&quot; 
  type=&quot;commentary&quot;?&amp;gt;Statutory Orders (Special Procedure) 
  Act 1945 (power to withdraw an order or submit it to 
  Parliament for further consideration by means of a Bill for its 
  confirmation) shall have effect in relation to compensation 
  orders as if for the words&amp;lt;quotation class=&quot;double&quot;&amp;gt;&quot;may 
  by notice given in the prescribed manner, withdraw the 
  order or may&quot;&amp;lt;/quotation&amp;gt; there were substituted the 
  word&amp;lt;quotation class=&quot;double&quot;&amp;gt;&quot;shall&quot;&amp;lt;/quotation&amp;gt;.&amp;lt;/text&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;See how there&amp;#8217;s no space before the &lt;code&gt;&amp;lt;?change?&amp;gt;&lt;/code&gt; PI or the &lt;code&gt;&amp;lt;quotation&amp;gt;&lt;/code&gt; elements? In this case, spaces need to be added before them. On the other hand, if the &lt;code&gt;&amp;lt;?change?&amp;gt;&lt;/code&gt; or &lt;code&gt;&amp;lt;quotation&amp;gt;&lt;/code&gt; element happens to start after certain kinds of punctuation, such as quotation marks or brackets then whitespace shouldn&amp;#8217;t be added.&lt;/p&gt;

&lt;p&gt;Second, in this badly designed markup language, if a &lt;code&gt;&amp;lt;text&amp;gt;&lt;/code&gt; element has element-only content, processing applications have to ignore the whitespace around them. For example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;commentarycontent&amp;gt;&amp;lt;text&amp;gt;S. 39(5)(&amp;lt;/text&amp;gt;
&amp;lt;text&amp;gt;
&amp;lt;font class=&quot;italic&quot;&amp;gt;b&amp;lt;/font&amp;gt;
&amp;lt;/text&amp;gt;
&amp;lt;text&amp;gt;) repealed by Industry Act 1980 (c. 33, SIF 64), 
  Sch. 2&amp;lt;/text&amp;gt;&amp;lt;/commentarycontent&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In this case, the whitespace in the second &lt;code&gt;&amp;lt;text&amp;gt;&lt;/code&gt; element can be ignored without any problems, but of course if the first &lt;code&gt;&amp;lt;text&amp;gt;&lt;/code&gt; didn&amp;#8217;t end with a bracket, but instead with a comma or a letter, the whitespace would have to be preserved (or at least turned into a space).&lt;/p&gt;

&lt;p&gt;Whitespace processing of the kind illustrated here is time-consuming, hard to specify and inaccurate. The only people who really know what whitespace is needed in document-oriented content are the authors who create it, so it&amp;#8217;s really important that they have the right, and responsibility, to determine where whitespace appears.&lt;/p&gt;

&lt;p&gt;Rendering applications are a special case, because they have to munge whitespace to make text more readable. That often includes normalising whitespace away and adding line breaks (and hyphens) in the rendered view of the document. Different presentation-oriented markup languages use different algorithms to do this normalisation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In HTML, whitespace within a block gets collapsed, while whitespace at the beginning or end of a block gets stripped away, except in &lt;code&gt;&amp;lt;pre&amp;gt;&lt;/code&gt; elements where all whitespace is preserved.&lt;/li&gt;
&lt;li&gt;In XSL-FO, whitespace processing depends on the &lt;code&gt;linefeed-treatment&lt;/code&gt;, &lt;code&gt;white-space-treatment&lt;/code&gt; and &lt;code&gt;white-space-collapse&lt;/code&gt; properties, which provide practically any kind of behaviour; the default is the HTML rules.&lt;/li&gt;
&lt;li&gt;In WordML, whitespace is replaced (not collapsed); whitespace at the beginning or end of a &lt;code&gt;&amp;lt;w:t&amp;gt;&lt;/code&gt; element is stripped away unless &lt;code&gt;xml:space=&quot;preserve&quot;&lt;/code&gt; for that &lt;code&gt;&amp;lt;w:t&amp;gt;&lt;/code&gt; element.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because every piece of human-readable text eventually makes its way into HTML, I think it&amp;#8217;s best to try to make your markup language follow the HTML rules. That means defining what the blocks are (the equivalent of paragraphs) and which ones have significant whitespace in them.&lt;/p&gt;

&lt;h2&gt;On &lt;code&gt;xml:space&lt;/code&gt;&lt;/h2&gt;

&lt;p&gt;As the above discussion has shown, there are at least three different kinds of whitespace normalisation that can be used on an XML document. But &lt;code&gt;xml:space&lt;/code&gt; has just two values: &lt;code&gt;default&lt;/code&gt; and &lt;code&gt;preserve&lt;/code&gt;. So one problem with using &lt;code&gt;xml:space&lt;/code&gt; is that it doesn&amp;#8217;t have any predefined semantics: in one application is might identify the distinction between collapsing and preserving, in another the distinction between replacing and preserving, in another (OOXML) the distinction between replacing-with-leading-and-trailing-whitespace-stripped and replacing.&lt;/p&gt;

&lt;p&gt;So my advice would be: don&amp;#8217;t use it. Instead, take the time to define explicitly how whitespace should be handled in the schema and documentation for your markup language.&lt;/p&gt;

&lt;p&gt;I could just about be argued into providing &lt;code&gt;xml:space&lt;/code&gt; as an optional attribute on elements where the user is likely to want to change the way whitespace is processed on a case-by-case basis. But I can&amp;#8217;t actually think of any example where that might happen.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;The biggest problem with whitespace handling is not that it can&amp;#8217;t be defined, but that so many applications do it wrong. I&amp;#8217;m sure that the bad whitespace use I described above, where PIs and elements implicitly added whitespace, arose not because the markup language designer decided that it would be a good idea but because the applications the authors used either added whitespace in their WYSIWYG displays or stripped it out when it was saved. Likewise, the abysmal whitespace-stripping behaviour of MSXML has led to many a strange use of markup, like David Carlisle adding &lt;code&gt;xml:space=&quot;preserve&quot;&lt;/code&gt; to his &lt;code&gt;&amp;lt;html&amp;gt;&lt;/code&gt; elements.&lt;/p&gt;

&lt;p&gt;So it&amp;#8217;s the responsibility of markup language designers to specify how whitespace should be used, but it&amp;#8217;s equally the responsibility of processors to honour those specifications.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/43#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/16">markup</category>
 <pubDate>Sun, 22 Jul 2007 09:32:25 +0100</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">43 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>Things that make me scream: xml:space=&quot;preserve&quot; in WordML</title>
 <link>http://www.jenitennison.com/blog/node/41</link>
 <description>&lt;p&gt;I intend to do a series of &amp;#8220;things that make me scream&amp;#8221; posts. Many of them will be about WordML (as in the markup language used by Word 2003) because that&amp;#8217;s what I&amp;#8217;m struggling with at the moment and because it&amp;#8217;s so goddam awful. I don&amp;#8217;t want to get into the whole &lt;a href=&quot;http://en.wikipedia.org/wiki/OpenDocument&quot; title=&quot;Open Document Format&quot;&gt;ODF&lt;/a&gt; vs &lt;a href=&quot;http://en.wikipedia.org/wiki/Office_Open_XML&quot; title=&quot;Office Open XML&quot;&gt;OOXML&lt;/a&gt; open standard-or-not debate. My problems with WordML (and OOXML) are mainly about aesthetics rather than process: I look at it and&amp;#8230; well, it makes me want to scream. Examining what it is about the language (or implementation thereof) that prompts this visceral reaction might help in designing better languages.&lt;/p&gt;

&lt;p&gt;So: did you know that Word 2003 puts a &lt;code&gt;xml:space=&quot;preserve&quot;&lt;/code&gt; attribute on the &lt;code&gt;&amp;lt;w:wordDocument&amp;gt;&lt;/code&gt; document element of the XML that it produces and doesn&amp;#8217;t indent its output? This is a nightmare if you ever have to actually look at the documents: auto-indentation programs (like the one in &lt;a href=&quot;http://www.oxygenxml.com/&quot; title=&quot;&lt;oXygen/&gt; XML Editor&quot;&gt;&amp;lt;oXygen/&amp;gt;&lt;/a&gt;) quite rightly won&amp;#8217;t add whitespace to elements that are in the scope of an &lt;code&gt;xml:space=&quot;preserve&quot;&lt;/code&gt; attribute, which means you can&amp;#8217;t use these programs to indent XML automatically.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;In fact, &amp;lt;oXygen/&amp;gt; has syntax-highlighting-related problems when you open a document that has very long lines (like over 5000 characters; it doesn&amp;#8217;t actually crash, but it eats all your CPU until you kill it, which I suppose is a kind of assisted suicide). This is usually mitigated by the fact &amp;lt;oXygen/&amp;gt; now prompts you to auto-indent when it detects such a document. But that doesn&amp;#8217;t help with these WordML documents, because the auto-indent can&amp;#8217;t actually reduce the line size. (Bear in mind that even the shortest WordML document &amp;#8212; one with no actual content &amp;#8212; created in Word 2003 is 4kb in size; 3926 characters in one line.)&lt;/p&gt;

&lt;p&gt;So my regular experience debugging these WordML stylesheets I&amp;#8217;m working on is to edit something in Word, save as XML, open in WordPad, remove &lt;code&gt;xml:space=&quot;preserve&quot;&lt;/code&gt;, hit Save, remember that I can&amp;#8217;t save it in WordPad while it&amp;#8217;s still open in Word, close it in Word, go back to WordPad, hit Save, open in &amp;lt;oXygen/&amp;gt;, auto-indent, look around and debug the code. And repeat. Argh.&lt;/p&gt;

&lt;p&gt;I could write an XSLT output filter that removed the &lt;code&gt;xml:space=&quot;preserve&quot;&lt;/code&gt;, but really I shouldn&amp;#8217;t have to. What on earth is &lt;code&gt;xml:space=&quot;preserve&quot;&lt;/code&gt; doing on the document element? It&amp;#8217;s meant to be used on elements that really do contain significant whitespace that really must be preserved. The examples in the &lt;a href=&quot;http://www.w3.org/TR/xml11/#sec-white-space&quot; title=&quot;W3C: XML 1.1 Recommendation&quot;&gt;XML 1.1 Recommendation&lt;/a&gt; are of &lt;code&gt;&amp;lt;poem&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;pre&amp;gt;&lt;/code&gt; where you&amp;#8217;d want to see the line breaks, tabs and spaces when you viewed the content of the element in some kind of default viewer. In other words, the examples are elements whose content should be displayed with &lt;code&gt;white-space: pre&lt;/code&gt; in CSS, or &lt;code&gt;white-space-treatment=&quot;preserve&quot;&lt;/code&gt; in XSL-FO. That just isn&amp;#8217;t the case for the &lt;code&gt;&amp;lt;w:wordDocument&amp;gt;&lt;/code&gt; element. Far from it.&lt;/p&gt;

&lt;p&gt;In fact, in Word 2003, whitespace is only significant in terms of the appearance of the document in a handful of elements, the most common being &lt;code&gt;&amp;lt;w:t&amp;gt;&lt;/code&gt;, which holds text inside a run. I also observe the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;line breaks are done with &lt;code&gt;&amp;lt;w:br&amp;gt;&lt;/code&gt;, carriage returns with &lt;code&gt;&amp;lt;w:cr&amp;gt;&lt;/code&gt; and tabs with &lt;code&gt;&amp;lt;w:tab&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;any non-space whitespace within a &lt;code&gt;&amp;lt;w:t&amp;gt;&lt;/code&gt; always gets converted to a space, however &lt;code&gt;xml:space&lt;/code&gt; is set&lt;/li&gt;
&lt;li&gt;any runs of spaces between words within a &lt;code&gt;&amp;lt;w:t&amp;gt;&lt;/code&gt; get preserved, however &lt;code&gt;xml:space&lt;/code&gt; is set&lt;/li&gt;
&lt;li&gt;runs of leading and trailing spaces get stripped if &lt;code&gt;xml:space&lt;/code&gt; isn&amp;#8217;t set to &lt;code&gt;preserve&lt;/code&gt;, and are preserved otherwise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I don&amp;#8217;t know if the same thing happens in Word 2007, because I haven&amp;#8217;t got a copy, but I note that in the &lt;a href=&quot;http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%20Part%204%20(PDF).zip&quot; title=&quot;Zipped Office Open XML Part 4: Markup Language Reference PDF&quot;&gt;OOXML spec&lt;/a&gt;, all the examples have &lt;code&gt;xml:space=&quot;preserve&quot;&lt;/code&gt; on all &lt;code&gt;&amp;lt;w:t&amp;gt;&lt;/code&gt; elements, and it says (in section 2.3.1, page 34):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It is also notable that since leading and trailing whitespace is not normally significant in XML; some runs require a designating [sic] specifying that their whitespace is significant via the xml:space element [sic].&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This seems to be a&amp;#8230; umm&amp;#8230; misinterpretation of the XML spec. Whitespace is always reported to XML applications (by any &lt;em&gt;conformant&lt;/em&gt; parser, anyway), and the application gets to decide what to do with it. The default whitespace handling is that the application should use its default whitespace handling &lt;strong&gt;whatever that means&lt;/strong&gt;. So I reckon that OOXML could just specify that whitespace is generally ignored except for in &lt;code&gt;&amp;lt;w:t&amp;gt;&lt;/code&gt; (and a few other) elements which are normalized strings in the XML Schema sense (&lt;code&gt;xs:normalizedString&lt;/code&gt;s have all whitespace characters replaced by a space). To be honest, I really don&amp;#8217;t see the point of &lt;code&gt;xml:space&lt;/code&gt; here at all.&lt;/p&gt;

&lt;p&gt;Whitespace handling is one of the hardest things to get right in any markup language or application, and there&amp;#8217;s no single right way to do it, but WordML&amp;#8217;s nowhere near right in my opinion. I&amp;#8217;m gonna have to put my thoughts about how it &lt;em&gt;should&lt;/em&gt; be done in a separate post, or just let the markup design experts out there have their say in comments on this one.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/41#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/16">markup</category>
 <pubDate>Fri, 13 Jul 2007 19:56:00 +0100</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">41 at http://www.jenitennison.com/blog</guid>
</item>
<item>
 <title>XTech 2007: Wednesday 16th May Afternoon</title>
 <link>http://www.jenitennison.com/blog/node/19</link>
 <description>&lt;p&gt;Yes, I&amp;#8217;m determined to write up every talk I attended at XTech 2007, so that &lt;em&gt;I&lt;/em&gt; have a record of it if nothing else. On Wednesday afternoon, I attended sessions on microformats, internationalisation and NVDL (as well as giving my own talk, of course).&lt;/p&gt;

&lt;!--break--&gt;

&lt;h2&gt;&lt;a href=&quot;http://2007.xtech.org/public/schedule/paper/41&quot; title=&quot;Microformats: the nanotechnology of the semantic web&quot;&gt;Microformats: the nanotechnology of the semantic web&lt;/a&gt;&lt;/h2&gt;

&lt;h3&gt;&lt;a href=&quot;http://adactio.com/&quot; title=&quot;Jeremy Keith&#039;s Website&quot;&gt;Jeremy Keith&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;This was a supremely well-put-together presentation on &lt;a href=&quot;http://microformats.org/&quot; title=&quot;Microformats Website&quot;&gt;microformats&lt;/a&gt;: beautiful slides, drama and humour, and a reference to &lt;a href=&quot;http://en.wikipedia.org/wiki/Neal_Stephenson&quot; title=&quot;Wikipedia: Neal Stephenson&quot;&gt;Neal Stephenson&amp;#8217;s&lt;/a&gt; &lt;a href=&quot;http://www.amazon.com/Diamond-Age-Illustrated-Primer-Spectra/dp/0553380966&quot; title=&quot;Amazon: Diamond Age&quot;&gt;Diamond Age&lt;/a&gt; (was I really one of only three people in the packed room to have read it?). There was a lot about what microformats are, how they&amp;#8217;re designed, what their niche is (Jeremy was very up-front about the fact they don&amp;#8217;t solve every problem), and how they&amp;#8217;re developed. But there weren&amp;#8217;t any demonstrations of microformat-based applications, which I would have really liked to see. The other thing I thought was worth noting was that Jeremy talked about the dangers of &amp;#8220;grey goo&amp;#8221; (he was using a nanotechnology metaphor): the proliferation of microformats. He expressed the strong desire that the set of microformats be kept small, and even said (I paraphrase) &amp;#8220;Do use semantic class names in your HTML, but don&amp;#8217;t call them microformats [unless they&amp;#8217;ve been through the microformats standardisation process]!&amp;#8221;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.holoweb.net/~liam/&quot; title=&quot;Liam Quin&#039;s Website&quot;&gt;Liam Quin&lt;/a&gt; gave a paper entitled &lt;a href=&quot;http://www.idealliance.org/papers/extreme/proceedings/html/2006/Quin01/EML2006Quin01.html&quot; title=&quot;Microformats: Contaminants or Ingredients&quot;&gt;Microformats: Contaminants or Ingredients&lt;/a&gt; at &lt;a href=&quot;http://www.extrememarkup.com/&quot; title=&quot;Extreme Markup Languages&quot;&gt;Extreme&lt;/a&gt; last year, asking what we, as traditional markup geeks, should do about them. Some were very sceptical, saying something along the lines of &amp;#8220;They&amp;#8217;re headed for a trainwreck; and we should sit back, watch it happen, and pick up the pieces.&amp;#8221; Others wanted to celebrate: the fact that tagging has become understood is really good news for the semantic web, open data and all that jazz. &lt;/p&gt;

&lt;p&gt;Both the traditional markup and the microformats community have the same goals: they want to make information easier to search for, to query, to integrate and so on. The microformats approach is to minimise the cost to those supplying information, and to target just a few, very common, kinds of data such as contact information, events and social networks. Traditional markup, on the other hand, aims to cover every single kind of information you might want to make available, and has to worry about issues like validating, styling, and distinguishing between tag sets.&lt;/p&gt;

&lt;p&gt;It seems that a fundamental problem is that the benefits of including semantic markup aren&amp;#8217;t immediately obvious to the supplier. Whether you use semantic class names in HTML or use elements in known namespaces, it&amp;#8217;s purely a matter of faith that this will make your information easier to locate or use. You can&amp;#8217;t know that search engines will include that information in their weighting algorithms, or that people reading your page will have the screen-scraping software necessary to pull anything out. With so little (obvious) benefit, authors will only supply semantic data if the cost is low. Adding class names to existing HTML elements is easy whether a web page is generated by hand or automatically. Adding namespaces and authoring special CSS might not be that much more costly to do, but it&amp;#8217;s much more costly to grok.&lt;/p&gt;

&lt;p&gt;So if we want authors to start putting elements in their own namespaces in their web pages, we need an application that immediately cranks up the benefit of doing so. I have no idea what that is.&lt;/p&gt;

&lt;h2&gt;&lt;a href=&quot;http://2007.xtech.org/public/schedule/paper/50&quot; title=&quot;Applying the Internationalization Tag Set&quot;&gt;Applying the Internationalization Tag Set&lt;/a&gt;&lt;/h2&gt;

&lt;h3&gt;&lt;a href=&quot;http://www.translate.com/&quot; title=&quot;Yves Savourel&#039;s Website&quot;&gt;Yves Savourel&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;This was a good introduction to [a standard] I only knew about vaguely. It&amp;#8217;s definitely worth knowing about the &lt;code&gt;its:*&lt;/code&gt; attributes for defining i18n features such as indicating which content should be translated, which are terms, providing comments for localisation and so on, just in case you need to build those in to new markup languages.&lt;/p&gt;

&lt;p&gt;I also have much admiration for how the ITS standard doesn&amp;#8217;t expect people to completely rework their markup languages to incorporate ITS data. Instead of using the ITS attributes directly in a document, you can use global rules embedded in the document itself, referenced from the document, or embedded in the schema for the document. I think this approach will prove useful in the development of &lt;a href=&quot;http://www.lmnlwiki.org/index.php/Talk:ECLIX#LIX&quot; title=&quot;LMNL in XML&quot;&gt;LIX&lt;/a&gt;, when we get around to formalising it.&lt;/p&gt;

&lt;h2&gt;&lt;a href=&quot;http://2007.xtech.org/public/schedule/detail/48&quot; title=&quot;NVDL - a breath of fresh air for compound document validation&quot;&gt;NVDL - a breath of fresh air for compound document validation&lt;/a&gt;&lt;/h2&gt;

&lt;h3&gt;&lt;a href=&quot;http://xmlguru.cz/&quot; title=&quot;Jirka Kosek&#039;s Website&quot;&gt;Jirka Kosek&lt;/a&gt; &amp;amp; &lt;a href=&quot;http://nalevka.com/&quot; title=&quot;Petr Nálevka&#039;s Website&quot;&gt;Petr Nálevka&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;http://www.nvdl.org/&quot; title=&quot;Namespace-based Validation Dispatching Language&quot;&gt;NVDL&lt;/a&gt; is Part 4 of &lt;a href=&quot;http://www.dsdl.org/&quot; title=&quot;Document Schema Definition Languages&quot;&gt;DSDL&lt;/a&gt;, specifically targeted at organising the validation of documents that incorporate multiple namespaces, such as XHTML documents containing islands of SVG, RDF and MathML. NVDL&amp;#8217;s approach is to identify subtrees within the document that need to be validated against a particular schema. The subtrees don&amp;#8217;t need to only hold one namespace, but often that will be the case.&lt;/p&gt;

&lt;p&gt;The XML Schema wonks in the room (Henry Thompson and Michael Sperberg-McQueen) were a bit befuddled, I think, because with XML Schema you just supply a whole bunch of schema documents to the processor, for different namespaces, and as long as the schemas contain wildcards they&amp;#8217;ll do the right thing. The concept of supplying multiple schemas to a validator isn&amp;#8217;t part of RELAX NG&amp;#8217;s validation approach, so you need something like NVDL if you don&amp;#8217;t want to rework your schema for every combination of namespaces.&lt;/p&gt;

&lt;p&gt;Henry and Michael were particularly concerned about the fact that it means you can override the original schema, allowing elements from foreign namespaces in situations where the original schema hasn&amp;#8217;t allowed them. But as Henry said, it just means that the primary schema you use to define what&amp;#8217;s allowed where is actually an NVDL schema: it&amp;#8217;s not auxiliary validation like Schematron is, but a language for the primary schema you use.&lt;/p&gt;

&lt;p&gt;Later, I wondered how much the &lt;a href=&quot;http://www.w3.org/TR/xproc&quot; title=&quot;XProc: An XML Pipeline Language&quot;&gt;XProc&lt;/a&gt; work would render NVDL irrelevant. After all, XProc can invoke validation of subtrees against multiple external schemas. On the other hand, NVDL&amp;#8217;s syntax is going to be easier to use if that&amp;#8217;s all you want to do. Perhaps someone will write a tool to convert NVDL schemas to XProc pipelines&amp;#8230;&lt;/p&gt;

&lt;p&gt;Actually, Jirka &amp;amp; Petr&amp;#8217;s experience with &lt;a href=&quot;http://sourceforge.net/projects/jnvdl/&quot; title=&quot;Java implementation of NVDL&quot;&gt;JNVDL&lt;/a&gt; is interesting from the XProc viewpoint, in particular the problems that they had with reporting meaningful line numbers when validating subtrees. Something that XProc implementers might want to look at in regard to error reporting with &lt;code&gt;&amp;lt;p:viewport&amp;gt;&lt;/code&gt;.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/19#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/16">markup</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/6">pipelines</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/8">schema</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/4">xtech</category>
 <pubDate>Sun, 20 May 2007 22:52:14 +0100</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">19 at http://www.jenitennison.com/blog</guid>
</item>
</channel>
</rss>
