The project I’m working on at the moment aims to use RDFa (in XHTML) to expose some of the semantics in some natural-language text. We’re aiming moderately low — marking up dates, addresses, people’s names, and various other more domain-specific things — at least at the moment.
The problem we’re getting into now is how to get that information marked up. Because the information comes from various pretty unregulated sources, there’s no way we can force the authors to do the mark up. And the scope for making it “worth their while” (in terms of making their authoring job easier or more effective or even offering financial rewards) is very low.
So we’re taking a look at the technologies we might use for automating the markup, specifically GATE and UIMA.
I wrote previously about the, to my mind, wrong-headed use of xml:space in WordML (and OOXML), and promised something a bit more positive about how whitespace should be handled in markup languages. So here it is.
A bit of a disclaimer up front: my attitude on this topic is highly skewed by the fact I use XSLT all the time, and it has particular ways of dealing with whitespace. I happen to think that the way XSLT deals with whitespace is pretty solid, but that might just be because it’s what I’m used to.
The aim of this post is to answer the following question “when designing a markup language, what should I say about whitespace processing?”
I intend to do a series of “things that make me scream” posts. Many of them will be about WordML (as in the markup language used by Word 2003) because that’s what I’m struggling with at the moment and because it’s so goddam awful. I don’t want to get into the whole ODF vs OOXML open standard-or-not debate. My problems with WordML (and OOXML) are mainly about aesthetics rather than process: I look at it and… well, it makes me want to scream. Examining what it is about the language (or implementation thereof) that prompts this visceral reaction might help in designing better languages.
So: did you know that Word 2003 puts a xml:space="preserve" attribute on the <w:wordDocument> document element of the XML that it produces and doesn’t indent its output? This is a nightmare if you ever have to actually look at the documents: auto-indentation programs (like the one in <oXygen/>) quite rightly won’t add whitespace to elements that are in the scope of an xml:space="preserve" attribute, which means you can’t use these programs to indent XML automatically.
Yes, I’m determined to write up every talk I attended at XTech 2007, so that I have a record of it if nothing else. On Wednesday afternoon, I attended sessions on microformats, internationalisation and NVDL (as well as giving my own talk, of course).