Automatic markup and XML pipelines

This discussion is closed: you can't post new comments.

The project I’m working on at the moment aims to use RDFa (in XHTML) to expose some of the semantics in some natural-language text. We’re aiming moderately low — marking up dates, addresses, people’s names, and various other more domain-specific things — at least at the moment.

The problem we’re getting into now is how to get that information marked up. Because the information comes from various pretty unregulated sources, there’s no way we can force the authors to do the mark up. And the scope for making it “worth their while” (in terms of making their authoring job easier or more effective or even offering financial rewards) is very low.

So we’re taking a look at the technologies we might use for automating the markup, specifically GATE and UIMA.

These technologies basically use pipelines of components which each add some (out of line) annotations to the text. The annotations are done out of line because they might overlap, but you can (usually) serialize them into XML, which is what we want to do.

I find these technologies frustrating for a number of reasons:

  • any configuration we do will be specific to that particular application; it’ll be hard to for us to change to another implementation later on, and reuse by others will be limited to those who use the same implementation
  • they involve a fair bit of proper coding (by which I mean Java or C++)
  • where components can be configured through declarative means (such as keyword lists), there’s no way to reuse (XML/RDF) resources that we already have; we’ll have to manage transformations from them into the accepted formats through some external means, and I just know they’ll get out of sync
  • their user documentation is dreadful; it seems like you need to have a good understanding of natural language processing to have a hope of even getting started

It strikes me that the really powerful part of each of these technologies is the pipelining. The pipelining allows you to string together relatively simple operations (tokenising text, extrapolating sentences, marking up keywords, resolving ambiguities based on context etc.) which together give you something reasonably sophisticated.

Using XProc to coordinate the pipeline would alleviate many of my frustrations. XProc can and will be implemented on many platforms, in many languages, so it’ll be possible to move the pipeline from place to place (assuming that the components of the pipelines are similarly generic). It’s declarative, so no “proper coding”. We’ll be able to incorporate any transformations from existing XML/RDF data to the required configuration formats right into the pipeline. And… OK, it won’t automatically give us great user documentation or GUIs, but they’ll come.

The big problem is that XProc is still a Working Draft and the XProc ecosystem isn’t well-developed. If we were one or two years down the line, XProc would be a Recommendation, there’d be a .NET implementation readily available, and even perhaps extension XProc step types for tokenising, grouping and the other things we’d need to do; anything that was missing we could pull together using XSLT.

As it is, we’re in that annoying in-between-time when the Right technology isn’t ready and it looks like we’re going to have to put effort into working with what feels like the Wrong technology just to get things done. But perhaps I’m overlooking something in GATE or UIMA, or have missed another technology that would help us. Anyone out there got some experience that could help guide us?

Comments

Re: Automatic markup and XML pipelines

annotations are done out of line because they might overlap

I believe there is a variant of XML that allows inline markup of overlapping regions, perhaps you have heard of it?….

David

Re: Automatic markup and XML pipelines

Indeed. If we were to use XProc (or another XML-based pipeline system), and we needed to support overlapping structures, I think we’d use something like CLIX. But I actually think we could go a long way with plain old hierarchical XML.

Re: Automatic markup and XML pipelines

I'm also hoping XProc materializes soon. In the mean time I'm (ab)using Ant too.

One other thought with adding RDFa to XHTML is that it could get awfully crowded in these documents when multiple applications are fighting for @property and @content attributes. I'm thinking it could be beneficial to be able to move such semantic markup in/out easily. ITS is not RDFa but is also for the most part an attributes language. It defines a way to add semantic markup using rule-sheets. Felix Sasaki wrote a paper where he describes the ITS approach to attach localization information to XML. Doing it in rule-sheets that are based on XPath reminds me a bit of the Schematron approach. With this approach you can probably push attributes (possibly RDFa) into XHTML.

Re: Automatic markup and XML pipelines

To clarify: the overall process begins with some domain-specific XML. We’re looking to add extra elements to that XML. It’s only much further down the line that we will expose at least some of that extra markup as RDFa/XHTML, but the markup needs to be in the original XML for us to expose it in that way.

I saw a paper about ITS somewhere else — I think it must have been Yves Savourel’s presentation at XTech 2007 — and I remember thinking it looked a lot pretty general purpose. Adding attributes/annotation is just one step in automatically marking up text, but the ITS approach does look like a promising one.

Re: Automatic markup and XML pipelines

you might find some solution using Ant and continuous integration engines like Anthill pro… this is abuse of build systems to achieve ‘work’ … though it has worked for me well in the past when controlling pipelines of XML.

another approach might be using eXist XMLDB and having triggers on collections; each collection could represent a step in your workflow and when a document goes into there(or is updated) you could generate (xslt, xpath, xquery) some output (which could go to another collection and so on) … eXist is not really ready for prime time in its default configuration … I tend to wrap it up in a perl handler and block all access to it (via iptables)…. but very effective.

I have some significant experience with both these approaches and would be willing to answer questions.

cheers, Jim Fuller

Note that I have

Re: Automatic markup and XML pipelines

Thanks for the suggestions. Given the Microsoft/.NET bias in the organisation that I’m working for, I don’t think Ant would work, but there is NAnt, which I suppose might do the job. Then we’d just (perhaps) have to rewrite the pipeline in XProc syntax when it got finalised.

Similarly, we have an in-house content management system and I don’t think suggesting eXist would go down too well. I suppose we could script up some code to perform the transformations within the CMS, but it sounds like you’d get a lot of dependencies between the repository and the pipeline: adding a step would mean adding a collection, right? I think that’s something we want to avoid.

Re: Automatic markup and XML pipelines

Given the .NET bias - have you looked at XAML Workflow and how this is used in the delarative aspect of the Windows Workflow Foundation, and also how this might align with XProc?