Unicode database in XML

May 14, 2007

Whatever algorithm you use to calculate Levenshtein distance, one of its great features is that you can tweak the cost of letter substitutions. For example, you can do a case-insensitive comparison of two strings, or perhaps more interestingly a semi-case-sensitive comparison of two strings, where the cost of replacing a character for its upper or lower case equivalent is less than the cost of replacing a character with an unrelated character, but more than zero. But that requires knowledge of whether and how two characters are related.

Of course all that information is stored in the Unicode Database, which are a bunch of text files in a structured format. I looked for an XML version but couldn’t find one (well, Googling “Unicode database XML” isn’t much help). So I downloaded UnicodeData.txt and NamesList.txt and put together an XSLT 2.0 stylesheet to create an XML version of the Unicode database.

Comment spam and feed format

May 14, 2007

I guess it’s an indication of something (like just being indexed by Google) when you first get comment spam on your blog. Anyway, I really don’t want to insist that commentators create accounts here, so after several annoying days of repeatedly deleting spam comments, I installed the Drupal Spam module and every spam comment since has been captured.

Reporting on the blogosphere

May 10, 2007

I noticed what I think is a new phenomenon earlier this week, while reading my daily paper. This is an extract from an Independent story about the abduction of a toddler from a holiday resort:

In the UK, the distraught parents were criticised in internet chat rooms for allowing their children to be out of their sight. [snip]

Some bloggers taking part in discussions threads on the internet since the news broke have claimed that as well-paid professionals the couple should have known better than to leave the children unsupervised. [snip]

Of course I’ve seen stories about blogging and internet use in newspapers before, but this is the first time that I’ve noticed a mainstream news article reporting on what internet users were saying about a mainstream news story.

Big XSLT applications just got easier to manage

May 10, 2007

I used to know how to arrange my XSLT modules. Each module had to be self-contained, and any common code imported into all the modules that used it. The reason? Because when you have on-going validation of your XSLT stylesheets, if the module can’t stand alone then you get all sorts of spurious errors. For example, if you define a variable in module A, which includes module B which uses that variable, then although the application as a whole will work fine, when you’re editing module B you’ll get errors because the variable isn’t defined in that module.

That rationale just got blown out of the water.

Levenshtein distance on the diagonal

May 6, 2007

The big problem with the previous Levenshtein distance implementation is that it recurses so much a number of times (roughly) equal to the multiple of the lengths of the two strings you’re comparing. If you’re using an XSLT processor that doesn’t recognise the function as being tail recursive then you can’t compare two strings more than about 20 characters in length (400 recursions).

The problem is that the standard dynamic programming Levenshtein distance algorithm is written for procedural programming languages in which you can do useful things like updating variables. XSLT ain’t like that, so we need an alternative algorithm.