Readable regexs

This discussion is closed: you can't post new comments.

Excellent post on XSL-List by Abel Braaksma on creating readable regex expressions in XSLT 2.0. He suggests always defining regular expressions in the content of a <xsl:variable>, using normal XML comments to annotate the different parts of the regex, and then using the x flag to ignore the extraneous whitespace that you’ve introduced.

Here’s a full example. Say that you want to parse a UK date. You could do:

<xsl:variable name="UKdate" as="xs:string">
  ([0-9]{,2})      <!-- group 1 holds the day: one or two digits -->
  /                <!-- separator -->
  ([0-9]{,2})      <!-- group 2 holds the month: one or two digits -->
  /                <!-- separator -->
  ([0-9]{2})       <!-- group 3 holds the year: two digits -->
</xsl:variable>

The variable is set to a string with lots of whitespace in it. The comments are, of course, ignored (when the XSLT stylesheet is initially compiled).

Then when you want to use the regular expression, use the x flag. This ignores all whitespace in the regular expression. For example:

<xsl:if test="matches(@date, $UKdate, 'x')">
  <xsl:analyze-string select="@date" regex="{$UKdate}" flags="x">
    ...
  </xsl:analyze-string>
</xsl:if>

(I know I don’t need to do the <xsl:if>, I’m just trying to show how you’d use the regular expression in a function call as well.)

Even if you don’t want to document your regular expressions (David C.), it’s a good idea to define them in variables. I’ve been caught out a number of times accidentally doing something like:

<xsl:analyze-string select="$n" regex="[0-9]{2}">
  ...
</xsl:analyze-string>

which of course matches any two digit number where the second digit is a 2. (The problem’s that the regex attribute value template, so the {}s are replaced by the result of evaluating their content, leaving the regular expression [0-9]2.) Using the content of an <xsl:variable> is a good idea as well, because it means you don’t have to escape quotes and apostrophes: regex syntax is enough of a headache without worrying about extra levels of escaping.

Comments

Re: Readable regexs

Even if you don’t want to document your regular expressions (David C.),

hmpf.

David

Re: Readable regexs

It’s probably worth recording here the comment that Mike made on the xsl-list thread, that using such a regex efficiently relies on the XSLT processor recognising that the variable reference is a compile-time constant and so compiling the regular expression at (XSLT) compile time. Of course if enough users do this anyway the XSLT implementors will probably feel motivated to making their processors support these idioms….

Davide

Re: Readable regexs

For the record, I think the position in Saxon 8.9.0.3 is that the regex is recognized as constant if the variable is inlined, and variables are inlined if they are local and referenced exactly once and the reference is not in a looping construct such as for-each. (I may be simplifying the rules here.) So the example where you create a local variable and then reference it as regex=”{$r}” works fine (whether or not you declare a type for the variable), but if you use the variable in more than one place, or in a for-each loop, or if it’s a global variable, then I think it might not be recognized as constant, and you will then incur the cost of regex compilation on each use.

This gets a bit cleverer in the next release, but looking at a test case it’s not yet as clever as I would like! (Another new optimization, whereby chunks of code that depend only on global variables get moved out of templates and functions into new global variables, is getting in the way. Conflicts between different rewrite rules are becoming an increasing problem.)

Re: Readable regexs

Jeni, As the original poster in that thread I benefit directly from all suggestions and directly expressed my gratitude and appreciation of Abel’s advice.

However, please, be cautious when you say:

“The variable is set to a string with lots of whitespace in it. The comments are, of course, ignored (when the XSLT stylesheet is initially compiled).”

In fact the comments are not ignored at all.

This transformation:

<xsl:stylesheet version=”2.0” xmlns:xsl=”http://www.w3.org/1999/XSL/Transform” xmlns:xs=”http://www.w3.org/2001/XMLSchema” >

<xsl:output omit-xml-declaration=”yes”/>

<xsl:variable name=”UKdate” as=”xs:string”>

([0-9]{,2}) <!— group 1 holds the day: one or two digits —>

/ <!— separator —>

([0-9]{,2}) <!— group 2 holds the month: one or two digits —>

/ <!— separator —>

([0-9]{2}) <!— group 3 holds the year: two digits —>

</xsl:variable>

<xsl:template match="/">
  <xsl:copy-of select="document('')/*/xsl:variable[1]/node()[2]"/>
<xsl:template>

Produces this output:

<!— group 1 holds the day: one or two digits —>

Cheers,

Dimitre Novatchev

Re: Readable regexs

OK,

Maybe I was confused by the phrase “when the XSLT stylesheet is initially compiled”.

The fact is that the comments are not skipped as result of (XML) parsing of the stylesheet.

And, certainly, when the parsed stylesheet is compiled as XSLT, then the value of an xsl:variable declared as=”xs:string” is obtained as the string value of the node named “xsl:variable” — that is, as the concatenation of all its text nodes.

Is this explanation more correct?

Cheers,

Dimitre Novatchev

Re: Readable regexs

Comments (and PIs) get stripped from the XML document almost immediately, even before whitespace (see 4.2 Stripping Whitespace from the Stylesheet). By the time the XSLT processor gets round to evaluating the content of the <xsl:variable>, it only contains one text node.

Re: Readable regexs

Thank you, Jeni,

This is the precise explanation.

Cheers,

Dimitre Novatchev

Re: Readable regexs

Yes, Dimitre, if you access a stylesheet using document(), you can get at things such as whitespace outside <xsl:text>, non-XSLT elements at the top level of the stylesheet, and, indeed, comments. If you access a stylesheet using unparsed-text(''), you can even get at serialisation details such as whether double or single quotes have been used around attribute values, and precisely how much whitespace has been used within a tag.

But when the document is loaded and compiled as a stylesheet, these things are ignored. The value of the $UKDate variable doesn’t include the comments, or less-than-signs, or any other indication that there were comments in the stylesheet.