Grouping Using the Muenchian Method

Grouping is a common problem in XSLT stylesheets: how do you take a list of elements and arrange them into groups. One of the most common situations in which it occurs is when you are getting XML output from a database. The database usually gives you results that are structured according to the records in the database. If it's an address book, for example, it might give you something like:

<records>
	<contact id="0001">
		<title>Mr</title>
		<forename>John</forename>
		<surname>Smith</surname>
	</contact>
	<contact id="0002">
		<title>Dr</title>
		<forename>Amy</forename>
		<surname>Jones</surname>
	</contact>
	...
</records>

The problem is how to turn this flat input into a number of lists, grouped by surname, to give something like:

Jones,<br />
	Amy (Dr)<br />
	Brian (Mr)<br />
Smith,<br />
	Fiona (Ms)<br />
	John (Mr)<br />

There are two steps in getting to a solution:

  1. identifying what the surnames are
  2. getting all the contacts that have the same surname

Identifying what the surnames are involves identifying one contact with each surname within the XML, which may as well be the first one that appears in it. One way to find these is to get those contacts that do not have a surname that is the same as a surname of any previous contact:

contact[not(surname = preceding-sibling::contact/surname)]

Once these contacts have been identified, it's easy to find out their surnames, and to gather together all the contacts that have the same surname:

<xsl:apply-templates select="/records/contact[surname = current()/surname]" />

The trouble with this method is that it involves two XPaths that take a lot of processing for big XML sources (such as those from big databases). Searching through all the preceding siblings with the 'preceding-siblings' axis takes a long time if you're near the end of the records. Similarly, getting all the contacts with a certain surname involves looking at every single contact each time. This makes it very inefficient.

The Muenchian Method is a method developed by Steve Muench for performing these functions in a more efficient way using keys. Keys work by assigning a key value to a node and giving you easy access to that node through the key value. If there are lots of nodes that have the same key value, then all those nodes are retrieved when you use that key value. Effectively this means that if you want to group a set of nodes according to a particular property of the node, then you can use keys to group them together.

Let's take our address book above. We want to group the contacts according to their surname, so we create a key that assigns each contact a key value that is the surname given in the record. The nodes that we want to group should be matched by the pattern in the 'match' attribute. The key value that we want to use is the one that's given by the 'use' attribute:

<xsl:key name="contacts-by-surname" match="contact" use="surname" />

Once this key is defined, if we know a surname, we can quickly access all the contacts that have that surname. For example:

key('contacts-by-surname', 'Smith')

will give all the records that have the surname 'Smith'. So it's easy to satisfy the second thing we needed to do (get all the contacts with the same surname):

<xsl:apply-templates select="key('contacts-by-surname', surname)" />

The first thing that we needed to do, though, was identify what the surnames were, which involved identifying the first contact within the XML that had a particular surname. We can use keys again here. We know that a contact will be part of list of nodes that is given when we use the key on its surname: the question is whether it will be the first in that list (which is arranged in document order) or further down? We're only interested in the records that are first in the list.

Finding out whether a contact is first in the list returned by the key involves comparing the contact node with the node that is first in the list returned by the key. There are a couple of generic methods of testing whether two nodes are identical:

  1. compare the unique identifiers generated for the two nodes (using generate-id()):

    contact[generate-id() =
            generate-id(key('contacts-by-surname', surname)[1])]
    	  
  2. see whether a node set made up of the two nodes has one or two nodes in it - nodes can't be repeated in a node set, so if there's only one node in it, then they must be the same node:

    contact[count(. | key('contacts-by-surname', surname)[1]) = 1]
        

Once you've identified the groups, you can sort them in whatever order you like. Similarly, you can sort the nodes within the group however you want. Here is a template, then, that creates the output that we specified from the XML we were given from the database:

<xsl:key name="contacts-by-surname" match="contact" use="surname" />
<xsl:template match="records">
	<xsl:for-each select="contact[count(. | key('contacts-by-surname', surname)[1]) = 1]">
		<xsl:sort select="surname" />
		<xsl:value-of select="surname" />,<br />
		<xsl:for-each select="key('contacts-by-surname', surname)">
			<xsl:sort select="forename" />
			<xsl:value-of select="forename" /> (<xsl:value-of select="title" />)<br />
		</xsl:for-each>
	</xsl:for-each>
</xsl:template>

The Muenchian Method is usually the best method to use for grouping nodes together from the XML source to your output because it doesn't involve trawling through large numbers of nodes, and it's therefore more efficient. It's especially beneficial where you have a flat output from a database, for example, that you need to structure into some kind of hierarchy. It can be applied in any situation where you are grouping nodes according to a property of the node that is retrievable through an XPath.

The downside is that the Muenchian Method will only work with a XSLT Processor that supports keys. This rules out James Clark's xt and pre-May 2000 versions of MSXML. In addition, using keys can be quite memory intensive, because all the nodes and their key values have to be kept in memory. Finally, it can be quite complicated to use keys where the nodes that you want to group are spread across different source documents.


/xslt/grouping/muenchian.xml by Jeni Tennison; generated using SAXON 6.5 from Michael Kay