Overlap, Containment and Dominance

I’ve spent the last few days at a workshop on overlapping markup in Amsterdam. It was organised by Claus Huitfeldt and Michael Sperberg-McQueen under a GODDAG banner, but included representatives of other approaches, such as the XCONCUR crowd and the LMNListas Wendell and myself.

Overlap is arguably the main remaining problem area for markup technologists. Capturing and analysing the overlap between poetic and syntactic structures in poems and plays helps academics gain a deeper understanding of the ways poetic technique has changed over time. And the complexities of structures in documents such as the Bible simply cannot be represented without allowing overlap to happen.

But academic study aside, overlap is a really important problem because whenever we collaborate on documents and whenever we change documents, we create overlapping structures. One of the major projects that I’ve worked on at TSO deals with publishing consolidated legislation, showing the places where “current” legislation was amended over time from its original, enacted state. The authors of legislation care little for document structures, and amendments often overlap document structures such as paragraphs and list items, and each other.

An Example

I used the following example during my talk on the Creole schema language during the workshop. It uses TexMECS notation, in which <name| is a start tag, |name> an end tag and the normal XML syntax is used for attributes:

<page n="199"|
...
<poem|
  <title|<pl|Recueillement|pl>|title>
  <stanza|
    <s|<sl|<pl|Sois sage, ô ma douleur, et tiens-toi plus |pl>
                                        <pl|tranquille.|pl>|sl>|s>
    <s|<sl|<pl|Tu réclamais le Soir; il descend; le voici:|pl>|sl>
    <sl|<pl|Une atmosphère obscure enveloppe la ville,|pl>|sl>
    <sl|<pl|Aux uns portant la paix, aux autres le souci.|pl>|sl>|s>
  |stanza>
  <stanza|
    <s|<sl|<pl|Pendant que des mortels la multitude vile,|pl>|sl>
    <sl|<pl|Sous le fouet du Plaisir, ce bourreau sans merci,|pl>|sl>
    <sl|<pl|Va cueillir des remords dans la fête servile,|pl>|sl>
    <sl|<pl|Ma douleur, donne moi la main; viens par ici,|pl>|sl>
  |stanza>
  <stanza|
    <sl|<pl|Loin d'eux.|s> <s|Vois se pencher les défuntes |pl>
                                                <pl|Années,|pl>|sl>
    <sl|<pl|Sur les balcons du ciel, en robes surannées;|pl>|sl>
    <sl|<pl|Surgir du fond des eaux le Regret souriant;|pl>|sl>
  |stanza>|page>
  <page n="200"|<stanza|
    <sl|<pl|Le Soleil moribund s'endormir sous une arche,|pl>|sl>
    <sl|<pl|Et, comme un long linceul traînant à l'Orient,|pl>|sl>
    <sl|<pl|Entends, ma chère, entends la douce Nuit qui |pl>
                                              <pl|marche.|pl>|sl>|s>
  |stanza>
|poem>
...
|page>

The start and end tags mark ranges in the text. (In some discussions of overlap, the ranges are called “elements”, but I prefer to reserve that term for structures that are self-contained, such as those in XML, to avoid confusion.) In Creole’s compact syntax, you could articulate the structure as follows:

# a book is a sequence of pages; it is also a sequence of poems
start = element book { page+ ~ poem+ }

# a page is a sequence of page lines
page = range page { pl+ }

# a poem starts with a title; the body of the poem can be characterised
# as a sequence of stanzas, but also as a sequence of sentences
poem = range poem { title, ( stanza+ ~ s+ ) }

# a title is a self-contained structure that may contains several page lines
title = element title { pl+ }

# a stanza contains several stanza lines
stanza = range stanza { sl+ }

# a stanza line contains one or more page lines
sl = range sl { pl+ }

# a sentence contains some text
s = range s { text }

# a page line contains some text
pl = range pl { text }

You could go further: sentences are made up of phrases, which are made up of words, which are made up of syllables, which are made up of letters. Stanzas within a sonnet such as this one can be clustered into an octet and a sestet and classified as quatrains and tercets based on the number of lines they contain. Stanza lines are also made up of syllables. And so on. Analysing the way in which the syntactic (sentence/phrase) structure overlaps with the prosodic (stanza/line) structure is one important way in which you can analyse a poem.

Containment vs Dominance

When you’re talking about overlapping structures, it’s useful to make the distinction between structures that contain each other and structures that dominate each other. Containment is a happenstance relationship between ranges while dominance is one that has a meaningful semantic. A page may happen to contain a stanza, but a poem domainates the stanzas that it contains.

In LMNL, we view a document as consisting of a sequence of atoms, usually characters, and ranges over those characters. But the model makes no assertions about dominance relationships between the ranges. This document model is easy to construct from a serialised document like the one above.

Conversely, GODDAG document models are directed acyclic graphs (DAGs): the nodes within those graphs have children and parents, with leaf nodes containing characters, and the parent-child relationship implies dominance. This is a useful model for processing, and particularly querying. Navigating a DAG is a lot like navigating a tree, just one that represents multiple hierarchies. But it isn’t possible to construct a DAG from a serialised document like the one above without extra information about which containment relationships are actually dominance relationships, and which mere happenstance.

So an important challenge is how to get from a flat, containment-only model to a DAG. There are four approaches that can be taken:

  1. For any document, for each pair of range names A and B, if every range named A contains a range named B, then assume that A dominates B; from that set of relationships, create a DAG.
  2. Introduce additional syntax into tags, such that dominance relationships between ranges can be expressed explicitly within the serialisation.
  3. Associate each document with a schema, and use the model expressed in the schema to identify dominance relationships; a Creole schema like the one above could be taking as asserting that poems dominate stanzas, for example, since stanzas are mentioned in the content model of the poem range.
  4. Defer the construction of a DAG to the point of processing; a document would then not be a DAG in and of itself, but only in relation to a particular process.

I find the last of these the most satisfactory. 1 is too arbitrary. 2 requires too much syntax. 3 requires a single schema per document (which, from experience with XML, I think is a broken model). One could imagine being able to specify something like:

book > page > pl > #text
book > poem > stanza > sl > #text
book > poem > s > #text

and this generating a DAG in which a book node had page and poem children, page nodes had pl children which had text children, poem nodes had stanza children and s children, and so on. With this structure, it would be easy enough to find stanzas with four lines (/book/poem/stanza[count(sl) = 4]) without having to worry about the possibilities of happenstance containment, such as some stanza lines being contained by sentences that are contained by stanzas.

There’s lots more to talk about here. In particular, things about the useful and appropriate ways of querying and transforming these structures, and how to best serialise them in XML. But I’ll leave those thoughts for another post.

Comments

Re: Overlap, Containment and Dominance

Capturing and analysing the overlap between poetic and syntactic structures in poems and plays helps academics gain a deeper understanding of the ways poetic technique has changed over time. And the complexities of structures in documents such as the Bible simply cannot be represented without allowing overlap to happen. This is really intriguing for me - can you post anymore pointers to people doing this kind of stuff (especially in semantics of peotry) - I’ve been working on and off on blingual poetry presentation for awhile and am looking to get real.

Re: Overlap, Containment and Dominance

Very interesting.

I’m surprised by your rejection of approach 2 on the grounds that it “requires too much syntax”. I would be inclined to start by designing the information model first and then figure out a syntax to represent that information model. Maybe I’m just brainwashed by too much XML/SGML, but the hierarchical relationships seem like a fundamental aspect of the information about the document which the markup should be capturing explicitly.

Re: Overlap, Containment and Dominance

Speaking only for myself of course, although since Jeni and I are “on the same team” (working on LMNL) we do find our attitudes frequently align.

As I see it, #2 and #4 are actually not mutually exclusive, if a syntactic representation were to be taken as a warrant for a preferred graph, while applications remain free to do other things with a serialization, or more things, than represent that particular graph (as XML applications arguably are free with respect to XML syntax).

In any case, the core of the issue is what the information model is. Much progress was made at the Goddag workshop in exploring the limits that might best be placed on the Goddag structure proposed by Huitfeldt and Sperberg-McQueen in order to ensure that it could be serialized in the form of a markup instance. But the TexMECS/Goddag approach has always been #1, not #2 or #4. This raises the issue of whether the particular structure implied by the syntax (such as TexMECS or XML), where the dominance relation is implicit in the order of tagging, is actually always going to be the best structure, even while other approaches such as #3 (taking a CREOLE instance, say, as not only a schema but a set of structural declarations) might yield a different structure from the same syntactic instance.

This issue aside, taking up approach #2. Given an information model — let’s say a Goddag structure with limitations to ensure serializability — the question gets to be how to represent dominance relations. One solution might be:

  • Any range could be annotated with a pointer to its parent;
  • Any range not so annotated would be taken to be a child of its closest enclosing (“containing”) range;
  • A range that designates a parent range that does not enclose it would be an error.
  • Text fragments would be leaf nodes of the graph and always children of their closest enclosing range.

(Note: I leave aside issues relating to what CMSMcQ calls “spurious overlap”, namely when tag ordering seems not to reflect containment or enclosure properly, such as <b|bold <i|italic|b>|i>. LMNL deals with this by saying that tag ordering in itself carries no information, and enclosure is to be inferred only from the relation among the tagged ranges as such, so the ‘i’ range is enclosed by the ‘b’ range, despite appearances. But other definitions of the relation between markup and model may differ on this point.)

The main question that arises in my mind about this approach would be the overhead necessary to check and enforce well-formedness, especially with respect to the third rule. It’s partly due to this that I tend to agree with Jeni that any syntax is likely to be just too heavy duty.

This is especially the case since in practice, I think users of any system that really handled overlap would have to rely on a structural validation mechanism over and above well-formedness checking in order to maintain their data properly — schemas would be even more essential (and more necessarily switchable) than with XML. If this is the case, then we are leaning back towards options #3 or #4.

The key distinction between approach #2 and either #3 or #4 is that in the latter, the document structures can only be maintained at the level of the “document type” (either essential, as per #3, or incidental, as per #4), whereas in approach #2, one could have syntactic instances that were invalid with respect to the type — if, say, a range pointed to a parent that was syntactically legal (by virtue of its being enclosed by it) but semantically unsound and invalid according to its schema.

Whether this would be a feature or a bug remains to be seen, I think. But as to that, I guess I can even imagine schemas for which such a mechanism would be needed to disambiguate which of several graphs was meant — with the possibility of expressing an invalid graph a necessary corollary of this.

Hm.