RELAX NG for matching

Mar 6, 2008

I’m still thinking about doing automatic markup with XML pipelines, and the kind of components that you might need in such a pipeline. These are the useful ones (list inspired by the components offered by GATE):

  • a tokeniser that uses regular expressions to add markup to plain text
  • a gazetteer that uses a lookup to add markup to plain text
  • an annotater that adds attributes to existing elements based on their context/content
  • a grouper that adds markup around sequences of existing markup
  • a stripper that removes markup
  • a general purpose transformer that uses XSLT to do just about everything else

Decision making

Mar 3, 2008

When I was young, my dad taught me a way of making tough decisions. You get a sheet of paper, make one column for each of the possibilities, and list pros and cons. The one that ends up with the most (important) pros and least (important) cons is the one that you should choose. My dad is a TJ-type.

My mum also taught me a way of making tough decisions. Her way was to toss a coin. But the point was not to just toss the coin, but to see how you feel when it landed. If you’re pleased, go with it. If you’re disappointed, ignore it and go with the other choice. My mum is a FJ-type.

New laptop time

Feb 28, 2008

[Update: Added Lenovo X300 to the comparison table. I haven’t managed to find a firm price, but the model I’d be looking at (with 4Gb RAM) is selling for 2,926 Euros, which is £2,230.58.]

My current laptop is on its last legs, due to an annoying hardware problem (the plastic holding together the screen hinge on the right has broken, and every time you open up the laptop it feels like there’s a chance the screen will disconnect entirely).

So I need to find a new laptop, which is a shame because aside from being underpowered compared to current laptops, this one is just about perfect. It’s a Fujitsu Siemens Lifebook P7010 and has the following characteristics that I appreciate:

  • small: it’s about the size of an A4 piece of paper (but thicker, obviously), sits easily on my knees even in cramped commuter trains, and slides neatly into a smart shoulder bag that people never suspect holds a computer
  • light: it’s much less than 2kg, which is my cut-off weight
  • widescreen: it’s a 10.6” widescreen with a resolution of 1280x768. Actually, it’s the width that I appreciate, so I guess widescreen isn’t essential if the screen is larger anyway, but I need those 1280 pixels.
  • battery life: the battery life used to be around 6 hours, which is enough for the longest train journeys, or just an evening unconstrained by power cords; it’s tailed off now, but it’s still not bad

Automatic markup and XML pipelines

Feb 25, 2008

The project I’m working on at the moment aims to use RDFa (in XHTML) to expose some of the semantics in some natural-language text. We’re aiming moderately low – marking up dates, addresses, people’s names, and various other more domain-specific things – at least at the moment.

The problem we’re getting into now is how to get that information marked up. Because the information comes from various pretty unregulated sources, there’s no way we can force the authors to do the mark up. And the scope for making it “worth their while” (in terms of making their authoring job easier or more effective or even offering financial rewards) is very low.

So we’re taking a look at the technologies we might use for automating the markup, specifically GATE and UIMA.

RDF and XML Q&A: Which should I use?

Feb 17, 2008

Another question to answer:

I’ve been reading about RDF, and I’m not sure in what situations it is more appropriate to use RDF over straight XML. I usually see RDF expressed as XML, but sometimes I see it written as language-independent functions (or methods).

Part of me is wondering if RDF is more appropriate for this project. What might the benefits be? And if it is, how difficult it would be to refactor it.

(Note that the person asking the question is talking about a small data-oriented project.) There’s a huge amount that could be said about this, so I might well post about some of it again. Here, I’m going to cut to the chase. This is what I’d recommend:

  1. Model your application in RDF terms: Create a description of what classes of resources your application needs to deal with, and which properties link those together. You can call this description a RDF schema or conceptual model or ontology, depending on how impressive you want to sound. This modelling activity is useful in itself, largely because it helps you understand what information you’re dealing with and how it fits together.

  2. Create a markup language that can be mapped to RDF: An XML version of your data allows you to make your data more generally available and reusable than locking it away in a triple store. Do one of the following:

    • Define a subset of RDF/XML for your application: The full flexibility of RDF/XML is complicated to handle for plain XML processors, so subset it to, for example, always used typed elements (such as <my:Course>) rather than rdf:type properties, and to use referencing or nesting in a consistent way.

    • Design markup languages that use RDFa attributes to reflect the semantics of the data: This gives you a standard way of mapping your markup language into RDF triples without having to adopt the “striped” design of RDF/XML in your markup language. A lot of the attributes can be defaulted to leave the markup language fairly streamlined.

    • Design markup languages exactly as you like, and define GRDDL mappings from them into RDF/XML: This gives you the most flexibility in your markup language design (though not complete flexibility – you still need to be able to identify the statements that you want to make from the XML), at the expense of having to write some XSLT.

The point of doing this is to put you in a position where you can just use XML if you want, but you also have the flexibility of using RDF either now or in the future.

The benefits of using RDF are partly to do with the ease with which you can do certain kinds of processing (specifically combining “facts” together to draw conclusions) and partly to do with the potential of reuse of your data. In the same way that XML gives people a common syntax and thus aids interchange of information, RDF allows others to draw some conclusions (more than they would with a random mess of elements and attributes) about what your data means.

I don’t think that using RDF triple stores, SPARQL and all that jazz gives you a great return for a small-scale, personal project – you’re better off sticking to flat files and some XSLT – but it doesn’t hurt to build in some of the formality of RDF anyway.