On Thursday morning, I was down to chair the first session in the “Core Technologies” track. Two interesting papers: one on XForms and one on Google Base. Then I snuck on to the “Applications” track to hear about scientific Wikis and the trials of managing schema repositories.
Yes, I’m determined to write up every talk I attended at XTech 2007, so that I have a record of it if nothing else. On Wednesday afternoon, I attended sessions on microformats, internationalisation and NVDL (as well as giving my own talk, of course).
Since there’s next to no ‘net connection at XTech 2007 (obviously the Web is not so ubiquitous as all that), I have nothing to do in the sessions but listen! Here are some thoughts about the sessions that I attended on the morning of Wednesday 16th. I haven’t included the keynotes not because they weren’t interesting but because I can’t think of anything to say about them at the moment.
Henry Thompson had a lot to say after my Creole presentation (open takahashi.xul?data=creole.data; requires Firefox) about the benefits of stand-off markup for linguistic information. From his overview, it seems that the NITE XML Toolkit that he’s been involved with represents overlapping linguistic data by holding atoms (here meaning the “lowest common denominator” shared pieces of data) and having multiple trees marking up these atoms. The trees are independently validated (since they are pure XML), with cross-hierarchy validation done through the query language. This is pretty similar to the XCONCUR approach, which augments a CONCUR-like multi-grammar validation with a Schematron-like constraint language.
Whatever algorithm you use to calculate Levenshtein distance, one of its great features is that you can tweak the cost of letter substitutions. For example, you can do a case-insensitive comparison of two strings, or perhaps more interestingly a semi-case-sensitive comparison of two strings, where the cost of replacing a character for its upper or lower case equivalent is less than the cost of replacing a character with an unrelated character, but more than zero. But that requires knowledge of whether and how two characters are related.
Of course all that information is stored in the Unicode Database, which are a bunch of text files in a structured format. I looked for an XML version but couldn’t find one (well, Googling “Unicode database XML” isn’t much help). So I downloaded UnicodeData.txt and NamesList.txt and put together an XSLT 2.0 stylesheet to create an XML version of the Unicode database.