Things that make me scream: xml:space="preserve" in WordML

I intend to do a series of “things that make me scream” posts. Many of them will be about WordML (as in the markup language used by Word 2003) because that’s what I’m struggling with at the moment and because it’s so goddam awful. I don’t want to get into the whole ODF vs OOXML open standard-or-not debate. My problems with WordML (and OOXML) are mainly about aesthetics rather than process: I look at it and… well, it makes me want to scream. Examining what it is about the language (or implementation thereof) that prompts this visceral reaction might help in designing better languages.

So: did you know that Word 2003 puts a xml:space="preserve" attribute on the <w:wordDocument> document element of the XML that it produces and doesn’t indent its output? This is a nightmare if you ever have to actually look at the documents: auto-indentation programs (like the one in <oXygen/>) quite rightly won’t add whitespace to elements that are in the scope of an xml:space="preserve" attribute, which means you can’t use these programs to indent XML automatically.

In fact, <oXygen/> has syntax-highlighting-related problems when you open a document that has very long lines (like over 5000 characters; it doesn’t actually crash, but it eats all your CPU until you kill it, which I suppose is a kind of assisted suicide). This is usually mitigated by the fact <oXygen/> now prompts you to auto-indent when it detects such a document. But that doesn’t help with these WordML documents, because the auto-indent can’t actually reduce the line size. (Bear in mind that even the shortest WordML document — one with no actual content — created in Word 2003 is 4kb in size; 3926 characters in one line.)

So my regular experience debugging these WordML stylesheets I’m working on is to edit something in Word, save as XML, open in WordPad, remove xml:space="preserve", hit Save, remember that I can’t save it in WordPad while it’s still open in Word, close it in Word, go back to WordPad, hit Save, open in <oXygen/>, auto-indent, look around and debug the code. And repeat. Argh.

I could write an XSLT output filter that removed the xml:space="preserve", but really I shouldn’t have to. What on earth is xml:space="preserve" doing on the document element? It’s meant to be used on elements that really do contain significant whitespace that really must be preserved. The examples in the XML 1.1 Recommendation are of <poem> and <pre> where you’d want to see the line breaks, tabs and spaces when you viewed the content of the element in some kind of default viewer. In other words, the examples are elements whose content should be displayed with white-space: pre in CSS, or white-space-treatment="preserve" in XSL-FO. That just isn’t the case for the <w:wordDocument> element. Far from it.

In fact, in Word 2003, whitespace is only significant in terms of the appearance of the document in a handful of elements, the most common being <w:t>, which holds text inside a run. I also observe the following:

  • line breaks are done with <w:br>, carriage returns with <w:cr> and tabs with <w:tab>
  • any non-space whitespace within a <w:t> always gets converted to a space, however xml:space is set
  • any runs of spaces between words within a <w:t> get preserved, however xml:space is set
  • runs of leading and trailing spaces get stripped if xml:space isn’t set to preserve, and are preserved otherwise

I don’t know if the same thing happens in Word 2007, because I haven’t got a copy, but I note that in the OOXML spec, all the examples have xml:space="preserve" on all <w:t> elements, and it says (in section 2.3.1, page 34):

It is also notable that since leading and trailing whitespace is not normally significant in XML; some runs require a designating [sic] specifying that their whitespace is significant via the xml:space element [sic].

This seems to be a… umm… misinterpretation of the XML spec. Whitespace is always reported to XML applications (by any conformant parser, anyway), and the application gets to decide what to do with it. The default whitespace handling is that the application should use its default whitespace handling whatever that means. So I reckon that OOXML could just specify that whitespace is generally ignored except for in <w:t> (and a few other) elements which are normalized strings in the XML Schema sense (xs:normalizedStrings have all whitespace characters replaced by a space). To be honest, I really don’t see the point of xml:space here at all.

Whitespace handling is one of the hardest things to get right in any markup language or application, and there’s no single right way to do it, but WordML’s nowhere near right in my opinion. I’m gonna have to put my thoughts about how it should be done in a separate post, or just let the markup design experts out there have their say in comments on this one.

Comments

Re: Things that make me scream: xml:space="preserve" in WordML

I'm just getting my feet wet with wordml. I threw together some simple xslts to make it a bit better. They could easily be combined into one:

Re: Things that make me scream: xml:space="preserve" in WordML

Hi Jeni, I’ve having a bit of a problem and don’t know if its xsl or VS.Net. I’m running a transform in .NET 2.0 and basically processing all elements and if it is a ‘special’ element I do something but if it is a ‘word element’ I basically just try to replicate it, but I can’t use a xsl:copy-of because I need to process its sub elements to see if there are any special elements. Anyway, when I come across something like this when I’ve pasted a big blob of WordML..

Hi Jeni, I’ve having a bit of a problem and don’t know if its xsl or VS.Net. I’m running a transform in .NET 2.0 and basically processing all elements and if it is a ‘special’ element I do something but if it is a ‘word element’ I basically just try to replicate it, but I can’t use a xsl:copy-of because I need to process its sub elements to see if there are any special elements. Anyway, when I come across something like this when I’ve pasted a big blob of WordML..

<w:t> </w:t>

I am using basically templates like this:

<xsl:template match=”” mode=”replace-merge-fields”> <xsl:element name=”{name()}”> <xsl:apply-templates select=”@ | node()” mode=”replace-merge-fields”/> </xsl:element> </xsl:template>

<xsl:template match=”@*” mode=”replace-merge-fields”> <xsl:attribute name=”{name()}”><xsl:value-of select=”.”/></xsl:attribute> </xsl:template>

<xsl:template match=”text()” mode=”replace-merge-fields”> <xsl:choose> <xsl:when test=”.=’«DBACC»’”> <xsl:text>TerryAney:DBACC</xsl:text> </xsl:when> <xsl:otherwise><xsl:value-of select=”.”/></xsl:otherwise> </xsl:choose> </xsl:template>

The problem is for that <w:t> </w:t> that would have put a space between some words is now converted to <w:t /> and I lose the space. Is this something that is .NET specific or xsl specific?

Any suggestions on how to preserve the actual open and close element would be greatly appreciated.

Re: Things that make me scream: xml:space="preserve" in WordML

It looks like you need to preserve the whitespace in the source WordML. I don’t know why you’re losing it in the first place, but if you just copy and paste to transform a part of WordML, then I guess you might lose the xml:space="preserve" that would otherwise preserve it. Just after you create the XmlDocument, before you actually load the XML, you need to set the PreserveWhitespace property on the XmlDocument object to true. Something like:

XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(filename);

Re: Things that make me scream: xml:space="preserve" in WordML

Can I email you a sample file and describe my process a bit more? This isn’t exactly the case. I’ll send it along and you can do what you wish (I dont’ want to gobble up all your time).

Re: Things that make me scream: xml:space="preserve" in WordML

Some times ago, I used to process SGML and space were significant most of the time May be we should ask for a “weak indent mode” where the indent is done in a way that give no problem whatever

Re: Things that make me scream: xml:space="preserve" in WordML

arf…

<root att="value" xml:space="preserve"
  >  <child position="1"
    > Significant  space   content </child
  > <child position="2"
    >Significant   space content   </child
</root>

Re: Things that make me scream: xml:space="preserve" in WordML

For all its flaws, XML Spy does cope with long lines better than Oxygen. It also has a useful bug in that it ignores xml:space=”preserve” when doing pretty printing. So, until recently, I tended to use XML Spy for opening WordML files.

However, I’ve found the most recent version of Oxygen much more performant with long lines than it used to be: you are using 8.2?

I actually raised a feature request against Oxygen some time ago to have an option to ignore xml:space, but with no response; understandable, but still annoying.

Oh, and I can confirm that OOXML doesn’t have an xml:space preserve on the root element anymore.

Re: Things that make me scream: xml:space="preserve" in WordML

From the few samples that I have lying around it seems that word2007 still puts everything on one line (in the zipped file) but does now move the xml:space attributes just on to the text run elements, not on the document element. It’s not that surprising that MS generated XML puts xml:space everywhere even in places where a reading of the XML spec would suggest that it isn’t needed, as their XML parsers have a well known “feature” that they strip white space by default, whatever the XML spec says about preserving it.

Perhaps It’ll make you scream but I tend to stick xml:space=”preserve” on the top level html element of xhtml files, to give IE a slight hint that inter word spaces perhaps ought to be rendered as spaces rather than being helpfully optimised away…

David

Re: Things that make me scream: xml:space="preserve" in WordML

I’d definitely agree that Whitespace handling is hard. The SketchPath project I’ve been working on certainly has issues with this but doesn’t have the same dilemma as Oxygen because its not an XML editor. SketchPath mercilessly (without regard to xml:space) removes whitespace if this is likely to impact on auto-indenting and preserves it otherwise. In practice, the affected whitespace characters are consecutive linefeeds (one’s ok) or any number of tabs, these are replaced by a single space character.

Perhaps XML Editors should also have such a ‘read only’ auto-indented view? One thing I’m considering is colour-coding the space characters that replace other whitespace - so tabs could be ‘bluespace’ and linefeeds ‘redspace’, or is this unnecessary?

I’ve used Oxygen and found it very useful, with many excellent features. I’m really surprised therefore that Oxygen experiences the problems you describe with very long lines, hopefully this is on the Oxygen people’s ‘to do’ list.

If the the Oxygen auto-indent fix fails only because of WordML’s xml:space issue, then it should work with Word2007’s OOXML because this only uses xml:space on elements and then only when they contain whitespace.

I think of OOXML (and its predecessor) as ‘data centric’ rather than ‘document centric’ so I don’t judge it in the same light as, say, DocBook (keeping out of the OOXML vs ODF debate too). It would be interesting to see how well (or not) DocBook implementations handle whitespace, especially because it uses ‘mixed content’ elements.