This post was imported from my old Drupal blog. To see the full thing, including comments, it's best to visit the Internet Archive.
<bold> this is bold <italic> and italic </bold> text </italic>
and turning it into something well-formed, like:
<bold> this is bold <italic> and italic </italic></bold><italic> text </italic>
When you do this, you have to decide which elements can be split and which can’t, and their relative priorities. Wendell suggested that perhaps Creole might help to do this. I have been thinking about is using Creole to add annotations to markup (something like, you add attributes to the Creole patterns and they get copied on to the matched ranges, or are used to create new ranges), but I haven’t done that yet, and actually I think you probably want a different kind of language to do it (a new kind of schema language like James Clark suggested), because the way in which you break up overlapping structures has a lot to do with how you’re going to process them.
I’m reminded of the paper
Sperberg-McQueen, C. M., David Dubin, Claus Huitfeldt and Allen Renear. “Drawing inferences on the basis of markup.” In Proceedings of Extreme Markup Languages 2002.
in which (based on my memory of the talk) they discuss how different elements allow you to make different assertions about the text they contain, and consequently can be split in different ways. For example, a
<paragraph> element can’t be split into two
<paragraph> elements without changing the meaning of the document, whereas a
<bold> element can be split into two
<bold> elements with no problems because it’s really indicating “these characters are bold” rather than “this is a bold phrase”.
You can take a purist view (which would usually entail splitting hardly any elements, since most elements do mark up a range of text rather than the individual characters they contain), but I think the main reason you want to do this fragmentation is for presentation. And in that context, the notional semantics of the element don’t really matter: what matters is how they’re styled. For example, a
<comment> element, marking up a range of text that has been commented on, might not be splittable at a theoretical level, but if you’re going to render it simply by turning the background yellow, then in fact you can split it for that purpose.
Since it’s related to presentation, I wonder whether you could use a (simplified) CSS stylesheet to provide both the fragmentation and the style. Block-level elements (
display: block;) couldn’t be split whereas inline elements could. Elements that have the box model properties (margin, padding & borders) can’t be split, or, if they are, you need to mark the fragments as “left”, “middle” and “right”, and only apply the left margin/padding/border to the “left” fragment, and similarly with the right.
It wouldn’t be a general purpose transformation mechanism, but it would be darned useful!