Levenshtein distance in XSLT 2.0

[UPDATE: Added a link to the full stylesheet, and edited the code so it doesn’t overlap the right-hand column.]

Levenshtein distance is a measure of how many edits it takes to get from one string to another. In the basic algorithm, each addition, deletion and substitution counts as a single edit. So, for example, the distance between "XSLT 1.0" and "XSLT 2.0" is 1: the only difference is the substitution of 2 for 1, whereas the distance between "XSLT" and "XQuery" is 5: three substitutions and two additions.

One of the interesting features of Levenshtein distance is that there’s a fairly straight-forward dynamic programming algorithm that can be used to calculate it. I thought it might be interesting to see what an XSLT 2.0 implementation might look like.

In its naive form, the Levenshtein distance is calculated by saying:

  1. The distance between an empty string A and another string B is the number of characters in B (eg distance between '' and 'foo' is 3)
  2. Otherwise, take the minimum of:
    • one plus the distance between string A and B except for its last character
    • one plus the distance between A except for its last character and string B
    • the distance between A except for its last character and B except for its last character, plus one if the last characters of the two strings aren’t the same

In the normal case, then, for each distance you calculate, you have to calculate three other distances and take a minimum. A naive XSLT implementation would be:

<xsl:function name="my:LevenshteinDistanceA" 
  as="xs:integer">
  <xsl:param name="string1" as="xs:string" />
  <xsl:param name="string2" as="xs:string" />
  <xsl:sequence 
    select="my:LevenshteinDistanceA(
              string-to-codepoints($string1),
             string-to-codepoints($string2),
             string-length($string1),
             string-length($string2))" />
</xsl:function>

<xsl:function name="my:LevenshteinDistanceA" 
  as="xs:integer">
  <xsl:param name="chars1" as="xs:integer+" />
  <xsl:param name="chars2" as="xs:integer+" />
  <xsl:param name="i1" as="xs:integer" />
  <xsl:param name="i2" as="xs:integer" />
  <xsl:choose>
    <xsl:when test="$i1 = 0">
      <xsl:sequence select="$i2" />
    </xsl:when>
    <xsl:when test="$i2 = 0">
      <xsl:sequence select="$i1" />
    </xsl:when>
    <xsl:otherwise>
      <xsl:variable name="char1" as="xs:integer" 
        select="$chars1[$i1]" />
      <xsl:variable name="char2" as="xs:integer"
        select="$chars2[$i2]" />
      <xsl:variable name="deletion" as="xs:integer"
        select="my:LevenshteinDistanceA($chars1, $chars2, 
                                        $i1 - 1, $i2) + 
                1" />
      <xsl:variable name="insertion" as="xs:integer"
        select="my:LevenshteinDistanceA($chars1, $chars2, 
                                        $i1, $i2 - 1) + 
                1" />
      <xsl:variable name="substitution" as="xs:integer"
        select="my:LevenshteinDistanceA($chars1, $chars2, 
                                        $i1 - 1, $i2 - 1) +
                (if ($char1 eq $char2) then 0 else 1)" />
      <xsl:sequence select="min(($deletion, $insertion, 
                                 $substitution))" />
    </xsl:otherwise>
  </xsl:choose> 
</xsl:function>

This is horrendously slow for even moderate-length strings. In Saxon 8.9B, on my laptop, it will compare two five-character strings in 63ms on average; for ten-character strings it takes 26 seconds!

If you look at the Wikipedia page on Levenshtein distance, you’ll see an algorithm that effectively does memoisation to address the performance problem. Memoisation is the technique of recording the result of a function for a given set of arguments, and then using that stored result rather than recalculating it each time. For the Levenshtein distance algorithm, this is done by building a matrix of the distances between all possible leading substrings of the strings you’re comparing. So to compare "foo" and "bar", you effectively compare "" and "", "b", "ba" and "bar"; "f" and "", "b", "ba" and "bar"; "fo" and "", "b", "ba" and "bar"; and "foo" and "", "b", and "ba", recording the results as you go.

Here’s the dynamic programming version of the function in XSLT 2.0:

<xsl:function name="my:LevenshteinDistanceB" 
  as="xs:integer">
  <xsl:param name="string1" as="xs:string" />
  <xsl:param name="string2" as="xs:string" />
  <xsl:sequence 
    select="my:LevenshteinDistanceB(
              string-to-codepoints($string1),
             string-to-codepoints($string2),
             1, 1,
             for $p in (0 to string-length($string1)) 
                return $p,
             1)" />
</xsl:function>

<xsl:function name="my:LevenshteinDistanceB" 
  as="xs:integer">
  <xsl:param name="chars1" as="xs:integer+" />
  <xsl:param name="chars2" as="xs:integer+" />
  <xsl:param name="i1" as="xs:integer" />
  <xsl:param name="i2" as="xs:integer" />
  <xsl:param name="lastRow" as="xs:integer+" />
  <xsl:param name="thisRow" as="xs:integer+" />
  <xsl:choose>
    <xsl:when test="$i1 > count($chars1)">
      <xsl:choose>
        <xsl:when test="$i2 = count($chars2)">
          <xsl:sequence select="$thisRow[last()]" />
        </xsl:when>
        <xsl:otherwise>
          <xsl:sequence 
            select="my:LevenshteinDistanceB(
                      $chars1, $chars2, 
                      1, $i2 + 1, 
                      $thisRow, ($i2 + 1))" />
        </xsl:otherwise>
      </xsl:choose>
    </xsl:when>
    <xsl:otherwise>
      <xsl:variable name="char1" as="xs:integer" 
        select="$chars1[$i1]" />
      <xsl:variable name="char2" as="xs:integer" 
        select="$chars2[$i2]" />
      <xsl:variable name="deletion" as="xs:integer"
        select="$lastRow[$i1 + 1] + 1" />
      <xsl:variable name="insertion" as="xs:integer"
        select="$thisRow[last()] + 1" />
      <xsl:variable name="substitution" as="xs:integer"
        select="$lastRow[$i1] +
                (if ($char1 eq $char2) then 0 else 1)" />
      <xsl:variable name="cost" 
        select="min(($deletion, $insertion, 
                     $substitution))" />
      <xsl:sequence 
        select="my:LevenshteinDistanceB(
                  $chars1, $chars2, 
                  $i1 + 1, $i2, 
                  $lastRow, ($thisRow, $cost))" />
    </xsl:otherwise>
  </xsl:choose>
</xsl:function>

This performs much better. Here’s a little table showing the results:

# characterstime
533ms
1050ms
1567ms
2077ms

Why do I end there, when the timings are by no means unreasonable? Well, the bad news is that with anything over 20 characters, Saxon bails with a “Too many nested function calls. May be due to infinite recursion.” error. Which isn’t unreasonable I guess, given that, with 20 characters, you’re calling the same function 400 times in a row, but is kinda annoying given the recursion is not actually infinite.

I know David C.’s done an implementation as well, which I hope he’ll post in a comment, and I’ll be interested to see whether that (or any other implementation) performs any better (both in terms of speed and in terms of not running out of stack). When we’ve had a look at them, I’ll try to put together some suitably gnomic comments on the design of recursive functions and templates in general…

You can download the full stylesheet if you want.

Comments

Re: Levenshtein distance in XSLT 2.0

Nice article,Thankyou.

Re: Levenshtein distance in XSLT 2.0

I was intrigued by this. I wanted to test out Gestalt, as I have just finished a big refactoring to fix some tail-call problems that Florent Georges had exposed.

It was a good test, as first it exposed another couple of minor bugs (one was a TODO that I should have done about a year ago).

Here are the result (with Saxon 8.9 in the last column for comparison):

# characterstime in ms. (Gestalt)time in ms. (Saxon)
53531666
105331786
158112036
2013662422
2520633092
3031933989
3545145245
4063166775
4584968951
501116811484
551444214452
601790618176
652233522114
702725827368
753251032647

Evidently I have a slower machine than Jeni .

I stopped there since it is evident that the two implementations have roughly identical performance (experimental variation was of the order of 1 second). Saxon has overcome the disadvantage of JVM warm-up time at around 10 seconds (I am running both from the command line).

I was pleased with this result, since when I have done time comparisons in the past, after runtimes longer than 10 seconds were achieved, Gestalt would fall well behind. It seems to be a lot more mature now. Although this may just be because there is no data document involved.

Re: Levenshtein distance in XSLT 2.0

Odd how appealing expressions like

for $p in (0 to string-length($string1)) return $p

are, even to experts! Fortunately, the optimizer quickly turns them into

(0 to string-length($string1))

It’s one of those things like writing (if ($x=true())…) - somehow we do it even though we know it’s grossly redundant.

I wonder if it’s worth inferring a type of “xs:integer+” for (0 to string-length($string1)) rather than the current “xs:integer*”? In this example we’re doing an unnecessary dynamic type check. No, probably not, special-casing based on single examples has to stop somewhere! Unless the example is in an important benchmark, of course.

Re: Levenshtein distance in XSLT 2.0

Amusingly, my thought-process when writing for $p in (0 to string-length($string1)) return $p was that I wanted to get the position of each character in the string (hence the name of the variable $p). Doh.

Memoïzation (Re: Levenshtein distance in XSLT 2.0)

Can we think of memoization as a interesting feature for XSLT 2.1 ?

Re: Memoïzation (Re: Levenshtein distance in XSLT 2.0)

Actually, Saxon has a saxon:memo-function extension attribute that you can use to indicate that a given function should be memoised. In this case, adding saxon:memo-function="yes" to the naive implementation makes it function a lot better (eg comparing two 50-character strings in 2430ms), but not as well as the dynamic programming method (which compares them in 113ms).

I think that since memoisation is purely to do with performance, rather than functionality, it doesn’t belong in an XSLT spec. An XSLT processor could currently do automatic memoisation of certain functions or templates if it thought it could detect ones where it would lead to a performance improvement.

Re: Memoïzation (Re: Levenshtein distance in XSLT 2.0)

It’s hard to think of a sensible way of automatically deciding how to make the space/time trade-off needed for function memoization. Even if you know whether the user cares more about time or space, you can’t do any sensible estimates of either in advance, so it would have to be a learning algorithm. As with keys, I think there’s a good case for allowing hints in the language for this kind of thing.

In fact the saxon:memo-function attribute is currently more than a hint; in the case of functions that create a new node each time they are called it results in the same node being returned each time. Or as I prefer to phrase it, because it sounds more conformant, it’s an assertion by the user that the semantics of the stylesheet don’t depend on this side-effect.

Re: Levenshtein distance in XSLT 2.0

Turns out saxon is better at optimising tail recursion in templates than in functions. If I make the obvious changes to mine to turn the functions back to named templates I never get the nested function call error (er, because there are no function calls:-) and can get up to length 50 or so, after that I don’t get an error but it slows down rather more than I’m prepared to wait, but your version (which isn’t passing the whole n*m array )is quicker anyway so might get further in template style.

I seem to remember Dimitre raising a similar comment on the saxon list a while ago?

David

Re: Levenshtein distance in XSLT 2.0

I did wonder about that, but I can’t for the life of me make the template version do any better. Here’s what I’ve been trying: maybe someone else can get Saxon to recognise the tail recursion…

<xsl:template name="my:LevenshteinDistanceB" as="xs:integer">
  <xsl:param name="string1" as="xs:string" select="''" />
  <xsl:param name="string2" as="xs:string" select="''" />
  <xsl:param name="chars1" as="xs:integer*" 
    select="string-to-codepoints($string1)" />
  <xsl:param name="chars2" as="xs:integer*" 
    select="string-to-codepoints($string2)" />
  <xsl:param name="i1" as="xs:integer" select="1" />
  <xsl:param name="i2" as="xs:integer" select="1" />
  <xsl:param name="lastRow" as="xs:integer+" 
    select="for $p in (0 to string-length($string1)) return $p" />
  <xsl:param name="thisRow" as="xs:integer+" select="1" />
  <xsl:choose>
    <xsl:when test="count($chars1) = 0 or $i2 > count($chars2)">
      <xsl:sequence select="$lastRow[last()]" />
    </xsl:when>
    <xsl:otherwise>
      <xsl:variable name="cost" 
        select="min(($lastRow[$i1 + 1] + 1,
                     $thisRow[last()] + 1,
                     $lastRow[$i1] +
                       (if ($chars1[$i1] eq $chars2[$i2]) then 0 else 1)))" />
      <xsl:variable name="nextRow" as="xs:boolean" 
        select="$i1 = count($chars1)" />
      <xsl:call-template name="my:LevenshteinDistanceB">
        <xsl:with-param name="chars1" select="$chars1" />
        <xsl:with-param name="chars2" select="$chars2" />
        <xsl:with-param name="i1" 
          select="if ($nextRow) then 1 else ($i1 + 1)" />
        <xsl:with-param name="i2" 
          select="if ($nextRow) then ($i2 + 1) else $i2" />
        <xsl:with-param name="lastRow" 
          select="if ($nextRow) then ($thisRow, $cost) else $lastRow" />
        <xsl:with-param name="thisRow" 
          select="if ($nextRow) then $i2 + 1 else ($thisRow, $cost)" />
      </xsl:call-template>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Actually, this version made me think of another time when templates are nicer than functions: when you have lots of parameters, it’s helpful to be able to pass them by name rather than position.

Re: Levenshtein distance in XSLT 2.0

I’ve now made sufficient tweaks so both the function and the template versions are tail-recursive.

The template version wasn’t optimizing tail recursion in Saxon 8.9 for two reasons: (a) there’s no sensible static type inferencing being done for xsl:call-template (I think I was worried about the inferencing going into a loop when the call is recursive), and (b) there’s a blanket switch that says if the return type is declared, don’t do tail call optimization.

Keep these examples coming in: the only way to improve an optimizer is to give it lots of exercise.

The thing that really needs doing now (it’s long overdue) is some static analysis on apply-templates, using type information to narrow down the field of possible matching template rules.

Re: Levenshtein distance in XSLT 2.0

I seem to remember Dimitre raising a similar comment on the saxon list a while ago?

Yes, I did. From Dr. Kay’s response I got the impression that in order for Saxon to recognize TR in functions, the function should not produce any output until the very end. That means,collect the outpu in a variable and pass it along down the calls.

Definitely, absolutey, any improvement in Saxon’s treatment of TR in functions would be extremely appreciated. Otherwise, having to convert the code to DVC can sometimes be very challenging.

Cheers,

Dimitre

posting code

Hi Jeni,

I also had a similar problem with posting code in a webpage. The solution that I found was to add overflow auto to the pre element (and eventually a border) like below:

pre {
overflow: auto;
border: 1px dotted;
}

Here you have a sample to see how that looks like
http://georgebina.blogspot.com/2007/04/test.html

Hope that helps,
George

Re: posting code

Thanks George. You’ll see that I’ve adopted your suggestion (my apologies to anyone reading while I was experimenting).

Re: Levenshtein distance in XSLT 2.0

I implemented an XQuery version of the naive algorithm a little while back for a user on our forums:

http://forums.oracle.com/forums/thread.jspa?messageID=1570246&#1570246

declare function local:lev($arg1 as xs:string, $arg2 as xs:string) as xs:decimal?
{
  if(string-length($arg1) = 0) then string-length($arg2) else
  if(string-length($arg2) = 0) then string-length($arg1) else
  min((
    local:lev(substring($arg1, 2), $arg2) + 1,
    local:lev($arg1, substring($arg2, 2)) + 1,
    local:lev(substring($arg1, 2), substring($arg2, 2)) +
      (if(substring($arg1, 1, 1) = substring($arg2, 1, 1)) then 0 else 1)
  ))
};

Personally I think I find it easier to read with a non-XML syntax - but maybe I just haven't spent long enough looking at XSLT code.

Re: Levenshtein distance in XSLT 2.0

Here’s the XQuery implementation of the dynamic programming version. (I’ve used the same style as you, in particular omitting variable declarations where they’re unnecessary.)

declare function local:levB($arg1 as xs:string, 
                            $arg2 as xs:string) as xs:decimal
{
  local:levB(string-to-codepoints($arg1), 
             string-to-codepoints($arg2),
             1, 1,
             for $p in (0 to string-length($arg1)) return $p,
             1)
}

declare function local:levB($chars1 as xs:integer*,
                            $chars2 as xs:integer*,
                            $i1 as xs:integer,
                            $i2 as xs:integer,
                            $lastRow as xs:integer+,
                            $thisRow as xs:integer+)
  as xs:decimal
{
  if ($i1 > count($chars1)) then
    if ($i2 = count($chars2)) then $thisRow[last()] else
    local:levB($chars1, $chars2, 
               1, $i2 + 1, $thisRow, $i2 + 1) else
  local:levB($chars1, $chars2, $i1 + 1, $i2, $lastRow,
    ($thisRow,
     min((
       $lastRow[$i1 + 1] + 1,
       $thisRow[last()] + 1,
       $lastRow[$i1] + 
         (if ($chars1[$i1] = $chars2[$i2]) then 0 else 1)
     ))
    )
  )
};

Unsurprisingly, it gives exactly the same “infinite recursion” error with Saxon when comparing strings over 20 characters in length.

I’ll leave XSLT vs. XQuery debates for another time…

Re: Levenshtein distance in XSLT 2.0

Actually I’ve done 2 implementations one using O(M*N) space and one using O(m) except the O(m) one is broke, and I can’t decide whether to fix it now or to fall asleep watching the (UK, local gov) election results roll in…

David

Re: Levenshtein distance in XSLT 2.0

Oh well here’s the O(M*N) one, pretty similar to yours I think, and hits the same limits, I don’t think it should really as it’s tail recursive. Of course, normally I would have fully documented the code, but as I was so hurt by your remarks in the regex post, this is just unadorned xslt….

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc">

  <xsl:param name="debug"/>

  <xsl:template name="test">

    <xsl:sequence select="d:levenshteintest('0123456789001234567890','01234567890ABCDEFGHIJ')"/>
    <xsl:sequence select="d:levenshteintest('GUMBO','GAMBOL')"/>
    <xsl:sequence select="d:levenshteintest('GUMBO','GAMBOL')"/>

    <xsl:sequence select="d:levenshteintest('GUMBO','')"/>
    <xsl:sequence select="d:levenshteintest('','GAMBOL')"/>

  </xsl:template>


  <xsl:function name="d:levenshteintest">
    <xsl:param name="s" as="xs:string"/>
    <xsl:param name="t" as="xs:string"/>
    <xsl:text>&#10;distance( "</xsl:text>
    <xsl:value-of select="$s"/>
    <xsl:text>" , "</xsl:text>
    <xsl:value-of select="$t"/>
    <xsl:text>" ) = </xsl:text>
    <xsl:value-of select="d:levenshtein($s,$t)"/>
  </xsl:function>


  <xsl:function name="d:levenshtein">
    <xsl:param name="s" as="xs:string"/>
    <xsl:param name="t" as="xs:string"/>
    <xsl:variable name="n" select="string-length($s)"/>
    <xsl:variable name="m" select="string-length($t)"/>
    <xsl:variable name="ss" select="string-to-codepoints($s)"/>
    <xsl:variable name="tt" select="string-to-codepoints($t)"/>
    <xsl:sequence
    select="if ($n=0) 
        then
        $m
        else
                if ($m=0)
                then
                $n
                else
        d:levenshtein2(
        $m,
        $n,
        (for $i in (0 to $n) return $i,
                 for $j in (1 to $m) return 
                   ($j, for $i in (1 to $n) return 
                      (if($ss[$i]=$tt[$j]) then 0 else 1))),
        $n+3)
"/>
  </xsl:function>

  <xsl:function name="d:levenshtein2">
    <xsl:param name="m" as="xs:integer"/>
    <xsl:param name="n" as="xs:integer"/>
    <xsl:param name="a" as="xs:integer+"/>
    <xsl:param name="p" as="xs:integer"/>
<xsl:if test="$debug='yes'">
<xsl:value-of select="$m,$n"/>
======================================
<xsl:for-each select="0 to ($m)">
<xsl:text>&#10;</xsl:text>
<xsl:value-of select="for $x in subsequence($a,1+ . * ($n+1),$n+1) return format-number($x,'00')"/>
</xsl:for-each>
======================================
</xsl:if>
<xsl:choose>
  <xsl:when test="$p=2+($n+1)*($m+1)">
    <xsl:sequence select="$a[last()]"/>
  </xsl:when>
  <xsl:otherwise>
  <xsl:sequence select="d:levenshtein2($m,$n,
            ($a[position()&lt;$p],  
                        min((
$a[position()=$p - ($n+1)]+1,
$a[position()=$p - 1] + 1,
$a[position()=$p - ($n+2)]+$a[position()=$p]
                        )),
            $a[position()&gt;$p]),
            if($p mod ($n+1) = 0) then $p+2 else $p+1)
    "/>
  </xsl:otherwise>
</xsl:choose>
  </xsl:function>




</xsl:stylesheet>

$  saxon8 -it test levenshtein.xsl 

distance( "0123456789001234567890" , "01234567890ABCDEFGHIJ" ) = 11
distance( "GUMBO" , "GAMBOL" ) = 2
distance( "GUMBO" , "GAMBOL" ) = 2
distance( "GUMBO" , "" ) = 5
distance( "" , "GAMBOL" ) = 6

David

Re: Levenshtein distance in XSLT 2.0

The reason the function isn’t being treated as tail recursive is because it’s doing a cardinality check on the function result. This is to implement the “as=xs:integer”. The cardinality check is being applied because the other branch of the choice (the $thisRow[last()]) has a static cardinality of zero-or-one.

There are two things I could do to improve matters here. One is to recognize that when $thisRow has cardinality one-or-more, then $thisRow[last()] (and indeed $thisRow[1]) will have cardinality exactly-one. That’s something of a special case and one expects it won’t come up that often. This other thing is to move the cardinality check into the branches of the choice where it is actually needed.

It’s true that templates sometimes benefit from tail recursion in cases where functions don’t. The main reason for this is that templates are always evaluated in push mode, where concatenation of the result into a sequence is essentially a null operation. Functions are evaluated in pull or push mode depending on the circumstances, and it’s a run-time decision. In this case the template isn’t tail-recursive for the same reason as the function, that is, because of the (unnecessary) cardinality check applied to the result.

Re: Levenshtein distance in XSLT 2.0

Ah-ha! Well, I’m rather disgruntled about this. I thought that the point of adding type declarations everywhere was that the processor could use them to do clever things, and now I learn that (a) putting as="xs:integer" on my function causes the problem and (b) the as="xs:integer+" on the $thisRows declaration doesn’t get used! Makes me wonder why I bother…

Anyway, adding an exactly-one() call around $thisRows[last()] means that Saxon now recognises the function, but not the template, as tail-recursive. Removing the as="xs:integer" from the template is the only way I could find to fix that. I’m about to upload the new version (which also fixes the problem that the function didn’t do the right thing if either of the strings was empty).

Re: Levenshtein distance in XSLT 2.0

Oh sorry yours is O(M) already just storing 2 rows, so probably is closer to what I was aiming for than what I posted (which builds the whole matrix) so I don’t think I’ll bother fixing mine.

David