Henry Thompson had a lot to say after my Creole presentation (open takahashi.xul?data=creole.data; requires Firefox) about the benefits of stand-off markup for linguistic information. From his overview, it seems that the NITE XML Toolkit that he’s been involved with represents overlapping linguistic data by holding atoms (here meaning the “lowest common denominator” shared pieces of data) and having multiple trees marking up these atoms. The trees are independently validated (since they are pure XML), with cross-hierarchy validation done through the query language. This is pretty similar to the XCONCUR approach, which augments a CONCUR-like multi-grammar validation with a Schematron-like constraint language.
Now, I have nothing against using constraint languages (like Schematron) to validate documents, but grammars (like RELAX NG) have big advantages. Most importantly, they are easier to write (if they’re designed properly), and tools can analyse them to do useful things, such as tell you what element or attribute is expected next. If it’s possible to write cross-grammar constraints in a grammar (like Creole) then why would you use a constraint language to do it?
I think the big difference between Henry’s domain and the one that I think will move overlap into the mainstream is between global and local concurrence. With global concurrence, entirely separate hierarchies are applied to the same data, so the natural validation mechanism is to use entirely separate grammars (with perhaps a few small rules to do cross-grammar validation where that proves necessary). With local concurrence, the vast majority of the document follows a single hierarchy with concurrence happening at a low level.
Actually, the best example for this doesn’t even involve overlap. Consider HTML paragraphs, which contain various inline elements such as <strong>, <em> and <a>. It doesn’t make sense for these elements to contain themselves (strong text is neither made stronger nor negated by appearing in two <strong> elements, and it’s not allowed for links to contain other links). So the natural model in Creole is
p = element p { mixed { strong* & em* & a* } }
strong = range strong { text }
em = range em { text }
a = range a { attribute href { text }, ..., text }
This model allows <a> elements to appear within <em> elements, or vice versa, not because of the content model of <em> but because the two ranges are interleaved (and one arrangement of interleaved ranges is containment). It doesn’t allow any of these elements to appear inside themselves. It would be a real maintenance headache to have separate grammars for each of these inline elements, when most of each of the grammars (all the hierarchy down to the paragraph level) would be the same.
Actually, looking at NITE, it seems like it employs a data model that’s quite like LMNL’s, in that it has the concept of layers over atoms or ranges/elements. (Interestingly it looks like they get around the problem of identifying which ranges belong to which layers purely by using their name.) Another difference here might be that while I’m talking about supporting overlap in fairly heavily structured documents (like office documents), they’re really using fairly flat annotations, where there isn’t much of a grammar anyway. But I might have that wrong: need to do more reading. The other thing to investigate is whether they have any support for self-overlap (<phrase> elements overlapping other <phrase> elements): I kinda gather that they don’t.
Anyway, Henry also made the points that (a) that he doesn’t want a new syntax for overlap and (b) stand-off markup works very well thank you. To address the latter point first, I think stand-off markup works very well if you have the tools to support it. It’s fine if you have an integrated toolkit which can pull together and display the stand-off markup as embedded markup, and let you create ranges by highlighting text with a mouse. But the great power of HTML and other web technologies is that you don’t need to use a specialised toolkit to write it: you can just use a text editor and it’s all right there in front of you with no (or minimal) cross-referencing required. Frankly, I’m not interested in “core” technologies that require me to install a particular piece of software in order to make use of them (cf Erik Meijer’s talk on LINQ, which I’ll have to discuss another time). I expect to be able to write a document containing overlap as easily as I can write a normal XML document.
On Henry’s point about yet another syntax for overlap, I am more and more coming to the conclusion that overlap will hit the mainstream if we have a simple way of encoding overlap in normal XML documents, namely something along the lines of LIX. Interestingly, Yves Savourel’s talk on Applying the Internationalization Tag Set was quite inspirational in this regard, since the working group seem to have put together a standard that both provides a set of standard elements and attributes to guide localisation, along with a method of mapping elements and attributes in existing markup languages onto those ITS elements and attributes. I wonder whether a similar approach could be used with LIX… but I’ll have to leave those thoughts for another time.
Comments
Re: XTech Creole presentation fallout
Just a couple of clarifications:
NXT’s data model doesn’t allow for any constraints that would require cross-hierarchy validation - if there are two hierarchies that draw on the same base layer of atoms, they are completely independent (they don’t even necessarily draw the atoms in the same order). We’ve never run into this sort of constraint as a requirement from our users (which isn’t to say there are no reasons for ever wanting this, just that our users’ needs are limited). Otherwise, validation is as you suggest - we create a set of XML trees that together cover all of the relationships expressed in our overall data graph, where the same node can occur redundantly in more than one tree, and validate each one separately (using a single schema that can cover any of them). This is simple-minded, but inefficient.
Many of our example uses are fairly flat because users come to NXT not just because of the overlapping hierarchies but also because we support multimodal annotation. However, the data model does support heavily structured documents, and there are some corpora that make use of that, with perhaps four different trees built on top of the same orthographic transcription, and some trees that decompose recursively through the same layer of tags. We don’t allow self-overlap for the same tag, though.
Re: XTech Creole presentation fallout
Thanks for those clarifications, Jean. I wonder: do you have any documents that you could make available? I think it would be very helpful to see what real users do with overlapping markup.
Re: XTech Creole presentation fallout
I think the main problem with Schematron is that you don’t know when you are finished. When are there no more constraints to impose? It’s hard to be sure.
In XHTML 2.0, by the way, nested
strongandemelements will mean a greater level of strength or emphasis as the case may be. The draft doesn’t say what that would mean typographically. Traditionally, double emphasis, marked up in manuscripts and typescripts by double underlining, is represented by SMALL CAPS; triple emphasis, marked up by triple underlining, is shown with ALL CAPS. You could do double-strength text with extra bold, I guess.Re: XTech Creole presentation fallout
Hi Jen,
I too hate PPT yet I downloaded your Presentation and tried to view it in Firefox loading the xul file - but it was all messd up.
Any clues?
Re: XTech Creole presentation fallout
Probably you need to add the request parameter
?data=creole.dataon the end of the URI you use to access the file. For example, use the URI: