pipelines

RELAX NG for matching

I’m still thinking about doing automatic markup with XML pipelines, and the kind of components that you might need in such a pipeline. These are the useful ones (list inspired by the components offered by GATE):

  • a tokeniser that uses regular expressions to add markup to plain text
  • a gazetteer that uses a lookup to add markup to plain text
  • an annotater that adds attributes to existing elements based on their context/content
  • a grouper that adds markup around sequences of existing markup
  • a stripper that removes markup
  • a general purpose transformer that uses XSLT to do just about everything else

Automatic markup and XML pipelines

The project I’m working on at the moment aims to use RDFa (in XHTML) to expose some of the semantics in some natural-language text. We’re aiming moderately low — marking up dates, addresses, people’s names, and various other more domain-specific things — at least at the moment.

The problem we’re getting into now is how to get that information marked up. Because the information comes from various pretty unregulated sources, there’s no way we can force the authors to do the mark up. And the scope for making it “worth their while” (in terms of making their authoring job easier or more effective or even offering financial rewards) is very low.

So we’re taking a look at the technologies we might use for automating the markup, specifically GATE and UIMA.

Detecting streamability in XPath expressions and patterns

The XSL Working Group gave some comments recently on the Last Call Working Draft of XProc. One of the comments was about a bunch of standard steps that we’ve specified which do things you can do in XSLT, such as renaming certain nodes. These steps generally use XPath expressions or XSLT patterns to identify which nodes should be processed.

What bothers the XSL WG is that these steps aren’t guaranteed to be streamable. In a streamable process, an input document can be delivered to the processor as a stream of events (and an output similarly generated as a stream of events) rather than as an in-memory representation. Such processes will start producing results more quickly and require less memory than non-streamable ones. And, because they don’t need as much memory, they are able to work on larger documents.

If the processes we defined in XProc were streamable, there’d have a clear advantage over their XSLT equivalents, and therefore a purpose. However, since they’re not guaranteed streamable, it looks like we’re simply creating yet another transformation language.

XProc Last Call

Can you believe it, we’ve made it to Last Call on XProc (the XML pipeline language)! That’s only, like, nine months later than the published schedule, which I reckon is pretty good going. (Then again, I’m judging it against XSLT 2.0…)

I’m really excited about XProc. I’ve found that pipelining in XSLT — splitting up processing tasks into smaller, more manageable processing tasks and stringing them together — has greatly improved my productivity and the simplicity and maintainability of the code I write. But some processing (such as that used by my XSLT unit test framework) can’t be done in a single transformation, some is on massive documents that you can’t realistically process with XSLT (and I really don’t want to have to write SAX or StAX code to do it), and some I just want to do on all the files in a directory.

XProc gives me a high-level, declarative, streamable processing language for XML documents. And I think we’ve struck the right balance between something that’s simple enough to be easy for everyday tasks, and powerful enough to be able to do the more complex things you might want to do with it.

Pipelines (of lentils) in action

We went to the Science Museum on Monday. In Launch Pad, there are lots of hands-on activities for children. One of them starts with a big container with lots of lentils in it. You have to fill a bucket with lentils, then hoist the bucket up and along so it meets with a device that flips it over so that the lentils spill down a funnel into a tube and along a chute into another large container. From there there are two Archimedes screws linked together that, when you turn their handles, take the lentils into another funnel and down another tube into yet another large-ish container. From there, there are two conveyor belts with scoops attached that take the lentils up to another funnel, down another pipe and back into the first big container, where they can start the entire process again.

XTech 2007: Wednesday 16th May Afternoon

Yes, I’m determined to write up every talk I attended at XTech 2007, so that I have a record of it if nothing else. On Wednesday afternoon, I attended sessions on microformats, internationalisation and NVDL (as well as giving my own talk, of course).

Pipelining in XSLT

I took on a long-term contract back in January which is good fun (of course I have to say that; my boss might read this) and pretty challenging.

First, I’m hobbled by having to use XSLT 1.0 (MSXML, what’s more). I hadn’t really realised either how fantastic XSLT 2.0 is, nor how used to it I’ve become, until I started this work. How I miss user-defined functions, sequence constructors and if expressions.

Second, my task is to take some XHTML generated from WordprocessingML and (a) turn all the CSS styling relative, so that it uses ems and percentages all over the place rather than points and (b) rationalise the CSS so that common styling appears in the <head> of the XHTML rather than on individual elements.

Syndicate content