Unicode database in XML

Whatever algorithm you use to calculate Levenshtein distance, one of its great features is that you can tweak the cost of letter substitutions. For example, you can do a case-insensitive comparison of two strings, or perhaps more interestingly a semi-case-sensitive comparison of two strings, where the cost of replacing a character for its upper or lower case equivalent is less than the cost of replacing a character with an unrelated character, but more than zero. But that requires knowledge of whether and how two characters are related.

Of course all that information is stored in the Unicode Database, which are a bunch of text files in a structured format. I looked for an XML version but couldn’t find one (well, Googling “Unicode database XML” isn’t much help). So I downloaded UnicodeData.txt and NamesList.txt and put together an XSLT 2.0 stylesheet to create an XML version of the Unicode database.

The XML contains practically everything that you can get from those two files, which means:

  • block and subblock structures
  • hexadecimal and decimal codepoints
  • names, aliases and comments
  • category and numeric information
  • uppercase, lowercase and titlecase equivalents
  • decomposition of various kinds
  • related characters
  • bidi information

It might prove easier to search than grepping the text files, if you’re used to using XPath. I might split it up and put together an AJAX browser, in my Copious Spare Time.

Comments

Re: Unicode database in XML

unicode.xml is 404.

Re: Unicode database in XML

I do sometimes wonder if we’re the same person:-)

most (or some, depending on how you count) of the data in UnicodeData.txt has been available as “unicode.xml” for years (er, decades nearly:-) First in Sebastian’s jadetex support for dsssl then in the MathML sources, and most recently in the sources for the entity set draft at http://www.w3.org/2003/entities/xml/unicode.xml

As part of the build up to the the MathML 3 drafts I recently needed to update those to Unicode 5 (which has some new characters specifically to support the entity sets) and so I got fed up of only having “most” of UnicodeData.txt and so I “put together an XSLT 2.0 stylesheet” which added all the missing info.

The draft is currently w3c-member only but only really as that’s a convenient cvs archive until ready to update the public version. If you (or your readers) have w3c access they may like to compare with http://www.w3.org/Math/Group/spec/xml/unicode.xml (the http-view of that is a client side stylesheet which doesn’t show all the information in the file it does have all of unicodedata.txt plus a pile of other stuff about entity names and names in mathematica, affii glyph register and TeX etc. I really should get that updated to a public part of the site…

Although actually while having unicodedata as xml is clearly a good thing, isn’t the approved way of solving the stated example problem to use a collation to do the comparison, which probably needs to be written in some less interesting language than XSLT, and may not benefit so much from an xml version?

David

Re: Unicode database in XML

Sorry! The XML file is 3.5Mb, so it’s actually available as unicode.zip. (Also corrected link in the main post.)