This post was imported from my old Drupal blog. To see the full thing, including comments, it's best to visit the Internet Archive.
One of the fundamental disconnects between HTML5 and previous versions of HTML is the way in which you answer the question “what is the structure of this page?”. Things that make use of that structure, such as RDFa, need to take this into account.
An example is the document:
<html> <head><title>HTML example</title></head> <body> <table> <span>Example title</span> <tr><td>Example table</td></tr> </table> </body> </html>
There are two different ways in which you might interpret the structure of this document. First, you might view the structure to be as it is written, with the
<span> element as a child of the
<table> element and therefore a tree that looks like:
+- html +- head | +- title +- body +- table +- span +- tr +- td
Second, you might view the structure of the page to be the DOM as it is constructed by an HTML5 processor, which will move the
<span> out from the table due to foster parenting, giving the result:
+- html +- head | +- title +- body +- span +- table +- tr +- td
Which you view it as doesn’t really matter at this point, but it does when you start to introduce markup that infers information based on the structure of the page, such as RDFa. Let me introduce some RDFa markup to the document:
<html xmlns:dc="http://purl.org/dc/elements/1.1/"> <head><title>HTML+RDFa example</title></head> <body> <table about="http://example.com"> <span property="dc:title">Example title</span> <tr><td>Example table</td></tr> </table> </body> </html>
Now, if you view the structure to be as written, the
<span> element is within the
<table> element, and is therefore viewed as talking about whatever it is that the
<table> element is talking about, namely
http://example.com. So the RDF that you will glean from this page is:
<http://example.com> dc:title "Example title"
On the other hand, if you view the structure to be that constructed by an HTML5 processor, the
<span> element is not within the
<table> element, and is therefore viewed as talking about whatever the document is talking about, namely the document itself. So the RDF that you will glean from the page is:
<> dc:title "Example title"
So it’s not a new problem, but it’s still a problem.
For those people trying to define how RDFa is interpreted in HTML5, there are several unpleasant alternatives:
Define RDFa as operating over a DOM, but leave the creation of that DOM as implementation-defined. This effectively passes the buck (“it’s not our fault that HTML5 processors will construct a different DOM from XML processors”) but makes it hard to test implementation conformance and for authors to know exactly how their page will be interpreted. For example, an implementation that constructed a DOM with randomly rearranged elements would be entirely conformant despite producing completely different triples from one that took the elements in the original order.
Perhaps the set of permissible methods of DOM creation could be listed to prevent completely random processing, but I expect that it will be effectively limited through social and technological pressures. Implementations that build DOMs in random ways aren’t going to be as useful (to their users) as those that build them in expected ways; it’s also going to be far easier to implement RDFa processors using standard parsing libraries.
The approach is not without its downsides, of course. XSLT is similarly defined as operating over a tree model, with the question of how that tree model is constructed left to the implementation. Most processors decided to construct the tree using standard XML parsing, but famously MSXML would strip certain whitespace-only text nodes from the tree (unless you specified a parsing flag telling it not to), leading to incompatibilities and user confusion.
My guess is that the same kind of thing will happen with RDFa processors. It could very well be the case that an author will:
- check their RDFa in an RDFa validator that constructs a static HTML5 DOM, revealing one set of triples
- be further confused when a search engine that uses a tidy-and-interpret-as-XML approach gleans yet another slightly different set of triples and displays it in the search result