Markup Utility: Explanation

The Markup Utility is a utility for finding and changing words and phrases within some text. You can use it to:

This page explains how the Markup Utility works by going step by step through the stylesheet. It is intended for those interested in learning more about how XSLT works with real problems rather than as instructions on how to use the utility.

Global variables

For this kind of string manipulation there are a number of variables that are useful to have around. The way I've set up these variables means that they're highly English-centric. However, it would easy to add any extra punctuation, lowercase or uppercase letters to the variables to make it more applicable to other languages.

$punctuation = ".,:;!?&tab;&cr;&lf;  "'()[]<>{}"

$punctuation holds punctuation characters to identify the starts and ends of words. These have to be declared within the content of the variable declaration because I want to include whitespace characters like ends of lines and tabs, all of which are converted automatically to spaces if they are held within attribute values.

<xsl:variable name="punctuation">
  <xsl:text>.,:;!?&tab;&cr;&lf;&nbsp; &quot;'()[]&lt;>{}</xsl:text>
</xsl:variable>
    
$lowercase = 'abcdefghijklmnopqrstuvwxyz'

$lowercase holds a list of lowercase letters. The alphabetical order is simply to make it easier to make sure they're all there - it doesn't matter what order's used as long as it matches $uppercase.

<xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
    
$uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

$uppercase holds a list of uppercase letters. Again, the ordering doesn't matter as long as it matches $lowercase.

<xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />
    

The 'markup' template

The 'markup' template is the main template of importance in the stylesheet. I use a named template because the identity of the current node isn't important.

<xsl:template name="markup">
  ...
</xsl:template>

The first thing is to identify the parameters and their default values. There are five parameters:

$text
the text to be marked up - a string
$phrases
a node set of nodes whose value gives the word(s) that should be marked up
$words-only [= true]
$first-only [= false]
$match-case [= false]
<xsl:param name="text" />
<xsl:param name="phrases" />
<xsl:param name="words-only" select="true()" />
<xsl:param name="first-only" select="false()" />
<xsl:param name="match-case" select="false()" />

The next job is to identify those phrases that are actually included within the text, so that I can cycle through them and mark them up within it. Selecting only those phrases that are included at this point saves on processing. Since I sometimes have to check for matches where the case doesn't matter, it's worth setting a variable to hold the value of the all-lowercase text. This again saves on processing because the lowercase text is not generated for every phrase that is being looked at within the XPath.

<xsl:variable name="lcase-text" select="translate($text, $uppercase, $lowercase)" />
<xsl:variable name="included-phrases"
              select="$phrases[($match-case and contains($text, .)) or
                               (not($match-case) and contains($lcase-text,
                                                              translate(., $uppercase, $lowercase)))]" />

Now a big choice: are there any phrases included in this text or not? If there are, then I need to work on the text; if there aren't, then the text can be returned just as it is:

<xsl:choose>
  <xsl:when test="$included-phrases">
    ...
  </xsl:when>
  <xsl:otherwise><xsl:value-of select="$text" /></xsl:otherwise>
</xsl:choose>

When the text does include the phrases, I need to mark up the text with those phrases. There might be cases where there are two phrases that overlap each other: "ginger cat" and "cat", for example. To prevent "ginger cat" being missed and "cat" being marked up instead, I sort the phrases according to their length, and then process only the first one on the particular piece of text:

<xsl:for-each select="$included-phrases">
  <xsl:sort select="string-length(.)" data-type="number" order="descending" />
  <xsl:if test="position() = 1">
    ...
  </xsl:if>
</xsl:for-each>

Now we're getting down to it. First some variable declarations:

$phrase
the node representing the phrase to be marked up
$word
the content of that node, the actual word(s) to be marked up in the text
$remaining
a node set of the rest of the phrases that are contained within the text
<xsl:variable name="phrase" select="." />
<xsl:variable name="word" select="string($phrase)" />
<xsl:variable name="remaining" select="$included-phrases[. != $word]" />

This next variable declaration is a little complicated. I allowed various options at the beginning of the 'markup' template, including whether the whole word needed to be matched, to prevent "cat" being marked up within "categories" for example. I know the word that we're looking for, but if I'm after whole words only, then I need something a bit more sophisticated than just contains(), string-before() and string-after() to find that word for me.

The $match variable contains the actual string that I'm going to search for in the text, whether it be " cat " or " cat." or "'cat'". If the $words-only option is false(), then I don't have to worry about it. But when it's true() I have another template, 'get-first-word', which takes the text we're looking at, the word we want to match, and an option indicating whether the match should be case-sensitive or not, and gives me the string that I should be matching on.

<xsl:variable name="match">
  <xsl:choose>
    <xsl:when test="$words-only">
      <xsl:call-template name="get-first-word">
        <xsl:with-param name="text" select="$text" />
        <xsl:with-param name="word" select="$word" />
        <xsl:with-param name="match-case" select="$match-case" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise><xsl:value-of select="$word" /></xsl:otherwise>
  </xsl:choose>
</xsl:variable>

There are now two situations to worry about. Firstly, I could have found an actual occurence of the word within the text (in which case $match holds a string indicating that occurence). Or I could have found that actually the text didn't hold the string at all. I want to do different things in the two cases, so again I need a xsl:choose to do the conditional processing. If it's turned out that the word isn't actually in the text that I have, then I just need to call this template recursively on the text with the rest of the phrases that I identified as possibly being in it (and the same options set):

<xsl:choose>
  <xsl:when test="string($match)">
    ...
  </xsl:when>
  <xsl:otherwise>
    <xsl:choose>
      <xsl:when test="$remaining">
        <xsl:call-template name="markup">
          <xsl:with-param name="text" select="$text" />
          <xsl:with-param name="phrases" select="$remaining" />
          <xsl:with-param name="words-only" select="$words-only" />
          <xsl:with-param name="first-only" select="$first-only" />
          <xsl:with-param name="match-case" select="$match-case" />
        </xsl:call-template>
      </xsl:when>
      <xsl:otherwise><xsl:value-of select="$text" /></xsl:otherwise>
    </xsl:choose>
  </xsl:otherwise>
</xsl:choose>

Now I'm in the situation where I know what the actual string is within the text that I need to substitute. The trouble is that this string may have punctuation either side (or may not). I need to match the whole string (to make sure I'm matching whole words), but when it comes to marking it up, I only actually want to mark up the word itself (which may be in a different case from the original $word). So, I set three variables:

$first
the first character in the match string, if it's punctuation
$last
the last character in the match string, if it's punctuation
$replace
the word itself
 
<xsl:variable name="first">
  <xsl:if test="contains($punctuation, substring($match, 1, 1))"><xsl:value-of select="substring($match, 1, 1)" /></xsl:if>
</xsl:variable>
<xsl:variable name="last">
  <xsl:if test="contains($punctuation, substring($match, string-length($match)))"><xsl:value-of select="substring($match, string-length($match))" /></xsl:if>                
</xsl:variable>
<xsl:variable name="replace" select="substring($match, string-length($first) + 1,
                                                       string-length($match) - (string-length($first) + string-length($last)))" />

Again I'm faced with two possibilities: either there are more phrases that are left to be marked up (held in $remaining), or there aren't. If there aren't, the result consists of:

  1. the text before the matched word (plus that extra punctuation character if there is one)
  2. the marked-up word
  3. the text after the matched word (plus that extra punctuation character if there is one)
<xsl:choose>
  <xsl:when test="$remaining">
    ...
  </xsl:when>
  <xsl:otherwise>
    <xsl:value-of select="concat(substring-before($text, $match), $first)" />
    <xsl:apply-templates select="$phrase" mode="markup">
      <xsl:with-param name="word" select="$replace" />
    </xsl:apply-templates>
    <xsl:value-of select="concat($last, substring-after($text, $match))" />
  </xsl:otherwise>
</xsl:choose>

If there are more phrases left to markup, then the result consists of:

  1. the text before the matched word (plus any extra punctuation), marked up with the remaining phrases
  2. the marked-up word
  3. the text after the matched word (plus any extra punctuation), either:
    1. marked up with the remaining phrases, if I was only marking up the first occurence of the word or
    2. marked up with all the phrases, if I was marking up all occurences of the word
<xsl:call-template name="markup">
  <xsl:with-param name="text" select="concat(substring-before($text, $match), $first)" />
  <xsl:with-param name="phrases" select="$remaining" />
  <xsl:with-param name="words-only" select="$words-only" />
  <xsl:with-param name="first-only" select="$first-only" />
  <xsl:with-param name="match-case" select="$match-case" />
</xsl:call-template>
<xsl:apply-templates select="$phrase" mode="markup">
  <xsl:with-param name="word" select="$replace" />
</xsl:apply-templates>
<xsl:choose>
  <xsl:when test="$first-only">
    <xsl:call-template name="markup">
      <xsl:with-param name="text" select="concat($last, substring-after($text, $match))" />
      <xsl:with-param name="phrases" select="$remaining" />
      <xsl:with-param name="words-only" select="$words-only" />
      <xsl:with-param name="first-only" select="$first-only" />
      <xsl:with-param name="match-case" select="$match-case" />
    </xsl:call-template>
  </xsl:when>
  <xsl:otherwise>
    <xsl:call-template name="markup">
      <xsl:with-param name="text" select="concat($last, substring-after($text, $match))" />
      <xsl:with-param name="phrases" select="$included-phrases" />
      <xsl:with-param name="words-only" select="$words-only" />
      <xsl:with-param name="first-only" select="$first-only" />
      <xsl:with-param name="match-case" select="$match-case" />
    </xsl:call-template>
  </xsl:otherwise>
</xsl:choose>

The 'get-first-word' template

The aim of the get-first-word template is to retrieve a string that will match the first occurence of a whole word within the text. This string will most likely have punctuation either side that delimit it at a word - the punctuation gets returned as well.

In fact the 'get-first-word' template itself is very simple. It is passed three parameters:

$text
the text to be searched for the word - a string
$word
the word to be identified within the text
$match-case [= false]

It then farms out the work to two other named templates, get-first-word-matching-case if the case should be marched, and get-first-word-non-matching-case if it shouldn't.

<xsl:template name="get-first-word">
  <xsl:param name="text" />
  <xsl:param name="word" />
  <xsl:param name="match-case" select="false()" />
  <xsl:choose>
    <xsl:when test="$match-case">
      <xsl:call-template name="get-first-word-matching-case">
        <xsl:with-param name="text" select="$text" />
        <xsl:with-param name="word" select="$word" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:call-template name="get-first-word-non-matching-case">
        <xsl:with-param name="text" select="$text" />
        <xsl:with-param name="word" select="$word" />
      </xsl:call-template>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

The 'get-first-word-matching-case' templates

The 'get-first-word-matching-case' template takes two parameters:

$text
the text to be searched for the word - a string
$word
the word to be identified within the text

<xsl:template name="get-first-word-matching-case">
  <xsl:param name="text" />
  <xsl:param name="word" />
  ...
</xsl:template>

It then sets four variables:

$before
the string before the first occurence of the word in the text
$after
the string after the first occurence of the word in the text
$punc-before
$punc-after
<xsl:variable name="before" select="substring-before($text, $word)" />
<xsl:variable name="after" select="substring-after($text, $word)" />
<xsl:variable name="punc-before" select="contains($punctuation, substring($before, string-length($before), 1))" />
<xsl:variable name="punc-after" select="contains($punctuation, substring($after, 1, 1))" />

Now a big choose statement to decide what to do. There are six possible situations:

<xsl:choose>
  <xsl:when test="not(contains($text, $word))" />
  <xsl:when test="$punc-before and $punc-after">
    <xsl:value-of select="substring($text, string-length($before), string-length($word) + 2)" />
  </xsl:when>
  <xsl:when test="$text = $word">
    <xsl:value-of select="$word" />
  </xsl:when>
  <xsl:when test="$punc-after and starts-with($text, $word)">
    <xsl:value-of select="substring($text, 1, string-length($word) + 1)" />
  </xsl:when>
  <xsl:when test="$punc-before and not(substring-after($text, $word))">
    <xsl:value-of select="substring($text, string-length($text) - string-length($word))" />
  </xsl:when>
  <xsl:when test="contains($after, $word)">
    <xsl:call-template name="get-first-word-matching-case">
      <xsl:with-param name="text" select="$after" />
      <xsl:with-param name="word" select="$word" />
    </xsl:call-template>
  </xsl:when>
</xsl:choose>  

The 'get-first-word-non-matching-case' templates

The 'get-first-word-non-matching-case' template is very similar to the get-first-word-matching-case template, but is a little more complex because the case doesn't matter. It takes two parameters:

$text
the text to be searched for the word - a string
$word
the word to be identified within the text
<xsl:template name="get-first-word-non-matching-case">
  <xsl:param name="text" />
  <xsl:param name="word" />
  ...
</xsl:template>

It then sets four variables:

$lcase-text
the text, translated into all lowercase letters
$lcase-word
the word, translated into all lowercase letters
$before
the string before the first occurence of the (lowercase) word in the (lowercase) text
$after
the string after the first occurence of the (lowercase) word in the (lowercase) text
$punc-before
$punc-after
<xsl:variable name="lcase-text" select="translate($text, $uppercase, $lowercase)" />
<xsl:variable name="lcase-word" select="translate($word, $uppercase, $lowercase)" />
<xsl:variable name="before" select="substring($text, 1, string-length(substring-before($lcase-text, $lcase-word)))" />
<xsl:variable name="after" select="substring($text, string-length($before) + string-length($word) + 1)" />
<xsl:variable name="punc-before" select="contains($punctuation, substring($before, string-length($before), 1))" />
<xsl:variable name="punc-after" select="contains($punctuation, substring($after, 1, 1))" />

Now a big choose statement to decide what to do. There are six possible situations:

<xsl:choose>
  <xsl:when test="not(contains($lcase-text, $lcase-word))" />
  <xsl:when test="$punc-before and $punc-after">
    <xsl:value-of select="substring($text, string-length($before), string-length($word) + 2)" />
  </xsl:when>
  <xsl:when test="$lcase-text = $lcase-word">
    <xsl:value-of select="$text" />
  </xsl:when>
  <xsl:when test="$punc-after and starts-with($lcase-text, $lcase-word)">
    <xsl:value-of select="substring($text, 1, string-length($word) + 1)" />
  </xsl:when>
  <xsl:when test="$punc-before and not(substring-after($lcase-text, $lcase-word))">
    <xsl:value-of select="substring($text, string-length($text) - string-length($word))" />
  </xsl:when>
  <xsl:when test="contains(translate($after, $uppercase, $lowercase), $lcase-word)">
    <xsl:call-template name="get-first-word-non-matching-case">
      <xsl:with-param name="text" select="$after" />
      <xsl:with-param name="word" select="$word" />
    </xsl:call-template>
  </xsl:when>
</xsl:choose>

The generic markup template

The generic markup template is a template that matches any element in 'markup' mode. It's called by the markup template. This is the template that actually does the marking up of the phrase that has been identified within the text. It takes one parameter:

$word
the word that is being marked up

This template simply makes an HTML 'a' link around the word, linking it to the page identified by the 'id' attribute on the element that's being matched. So if you have a phrase that was:

  <phrase id="cat">cat</phrase>

then this template would produce:

  <a href="cat.html">cat</a>

The code for the template is:

<xsl:template match="*" mode="markup">
  <xsl:param name="word" />
  <a href="{@id}.html">
    <xsl:value-of select="$word" />
  </a>
</xsl:template>

You should create other templates in 'markup' mode to match the phrase nodes that you're using and create links to them, or highlight them, or do whatever you want to do with the marked up text.


/xslt/utilities/markup-explanation.xml by Jeni Tennison; generated using SAXON 6.5 from Michael Kay