<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.jenitennison.com/blog" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>unicode</title>
 <link>http://www.jenitennison.com/blog/taxonomy/term/13</link>
 <description>The taxonomy view with a depth of 0.</description>
 <language>en</language>
<item>
 <title>Unicode database in XML</title>
 <link>http://www.jenitennison.com/blog/node/15</link>
 <description>&lt;p&gt;Whatever &lt;a href=&quot;http://www.jenitennison.com/blog/node/12&quot; title=&quot;Levenshtein distance on the diagonal&quot;&gt;algorithm&lt;/a&gt; you use to &lt;a href=&quot;http://www.jenitennison.com/blog/node/11&quot; title=&quot;Levenshtein distance in XSLT 2.0&quot;&gt;calculate Levenshtein distance&lt;/a&gt;, one of its great features is that you can tweak the cost of letter substitutions. For example, you can do a case-insensitive comparison of two strings, or perhaps more interestingly a semi-case-sensitive comparison of two strings, where the cost of replacing a character for its upper or lower case equivalent is less than the cost of replacing a character with an unrelated character, but more than zero. But that requires knowledge of whether and how two characters are related.&lt;/p&gt;

&lt;p&gt;Of course all that information is stored in the &lt;a href=&quot;http://www.unicode.org/Public/UNIDATA/&quot; title=&quot;Unicode Database directory&quot;&gt;Unicode Database&lt;/a&gt;, which are a bunch of text files in a structured format. I looked for an XML version but couldn&amp;#8217;t find one (well, Googling &amp;#8220;Unicode database XML&amp;#8221; isn&amp;#8217;t much help). So I downloaded &lt;a href=&quot;http://www.unicode.org/Public/UNIDATA/UnicodeData.txt&quot; title=&quot;Unicode Database&quot;&gt;UnicodeData.txt&lt;/a&gt; and &lt;a href=&quot;http://www.unicode.org/Public/UNIDATA/NamesList.txt&quot; title=&quot;Unicode Names List Database&quot;&gt;NamesList.txt&lt;/a&gt; and put together an &lt;a href=&quot;http://www.jenitennison.com/blog/files/Unicode.xsl&quot; title=&quot;Unicode database builder XSLT&quot;&gt;XSLT 2.0 stylesheet&lt;/a&gt; to create an &lt;a href=&quot;http://www.jenitennison.com/blog/files/unicode.zip&quot; title=&quot;Unicode XML&quot;&gt;XML version of the Unicode database&lt;/a&gt;.&lt;/p&gt;

&lt;!--break--&gt;

&lt;p&gt;The XML contains practically everything that you can get from those two files, which means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;block and subblock structures&lt;/li&gt;
&lt;li&gt;hexadecimal and decimal codepoints&lt;/li&gt;
&lt;li&gt;names, aliases and comments&lt;/li&gt;
&lt;li&gt;category and numeric information&lt;/li&gt;
&lt;li&gt;uppercase, lowercase and titlecase equivalents&lt;/li&gt;
&lt;li&gt;decomposition of various kinds&lt;/li&gt;
&lt;li&gt;related characters&lt;/li&gt;
&lt;li&gt;bidi information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It might prove easier to search than grepping the text files, if you&amp;#8217;re used to using XPath. I might split it up and put together an AJAX browser, in my Copious Spare Time.&lt;/p&gt;
</description>
 <comments>http://www.jenitennison.com/blog/node/15#comments</comments>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/14">xml</category>
 <category domain="http://www.jenitennison.com/blog/taxonomy/term/13">unicode</category>
 <pubDate>Mon, 14 May 2007 21:11:23 +0100</pubDate>
 <dc:creator>Jeni</dc:creator>
 <guid isPermaLink="false">15 at http://www.jenitennison.com/blog</guid>
</item>
</channel>
</rss>
