Unicode database in XML

This post was imported from my old Drupal blog. To see the full thing, including comments, it's best to visit the Internet Archive.

Whatever algorithm you use to calculate Levenshtein distance, one of its great features is that you can tweak the cost of letter substitutions. For example, you can do a case-insensitive comparison of two strings, or perhaps more interestingly a semi-case-sensitive comparison of two strings, where the cost of replacing a character for its upper or lower case equivalent is less than the cost of replacing a character with an unrelated character, but more than zero. But that requires knowledge of whether and how two characters are related.

Of course all that information is stored in the Unicode Database, which are a bunch of text files in a structured format. I looked for an XML version but couldn’t find one (well, Googling “Unicode database XML” isn’t much help). So I downloaded UnicodeData.txt and NamesList.txt and put together an XSLT 2.0 stylesheet to create an XML version of the Unicode database.

The XML contains practically everything that you can get from those two files, which means:

block and subblock structures
hexadecimal and decimal codepoints
names, aliases and comments
category and numeric information
uppercase, lowercase and titlecase equivalents
decomposition of various kinds
related characters
bidi information

It might prove easier to search than grepping the text files, if you’re used to using XPath. I might split it up and put together an AJAX browser, in my Copious Spare Time.

Recent posts