What You Can't Do with HTML5 Microdata

Update: Fixed a couple of errors in the microdata code.

The HTML5 microdata proposal has hit the web, just days before Google announced its support for RDFa (or at least one vocabulary encoded using RDFa attributes). These are, indeed, “interesting times” for the semantic web.

Now, if you’re one of those weirdos who want to embed RDF triples within your web pages, what you’re going to care about is whether you can use microdata to do it. Those of us who have been using RDFa in anger, rather than in toy examples, know that it can be hard to map a particular set of RDF statements onto HTML content. I thought I’d take a look to see just what it would be like to create particular RDF with the HTML5 microdata proposal.

Basics

On the face of it, you can express any triple in microdata because a triple like this (Turtle):

<http://www.example.com/subject> <http://www.example.com/property> <http://www.example.com/object> .

can always, and anywhere, be expressed with (HTML5):

<span item>
  <link itemprop="about" href="http://www.example.com/subject">
  <link itemprop="http://www.example.com/property" href="http://www.example.com/object">
</span>

while a triple like:

<http://www.example.com/subject> <http://www.example.com/otherProperty> "value" .

can be expressed with:

<span item>
  <link itemprop="about" href="http://www.example.com/subject">
  <meta itemprop="http://www.example.com/otherProperty" content="value">
</span>

Of course having to use all those long, repetitive URIs is a bit of a pain and bloats out the markup, but we’d never expect this to be hand-authored, right? Right? And what we really care about is that we can express the RDF.

It’s not just the URIs that are long-winded, by the way. RDFa manages to cram a lot into each element, whereas microdata usually requires separate elements. This is an example from the RDFa specification:

<img src="photo1.jpg"
  rel="license" resource="http://creativecommons.org/licenses/by/2.0/"
  property="dc:creator" content="Mark Birbeck" />

which produces the triples:

<photo1.jpg> xhv:license <http://creativecommons.org/licenses/by/2.0/> .
<photo1.jpg> dc:creator "Mark Birbeck" .

In HTML5, I think this has to be done with:

<span item>
  <img itemprop="about" src="photo1.jpg">
  <link itemprop="http://www.w3.org/1999/xhtml/vocab#license" 
        href="http://creativecommons.org/licenses/by/2.0/">
  <meta itemprop="http://purl.org/dc/elements/1.1/creator" 
        content="Mark Birbeck">
</span>

It’s a bit more tedious, but also more obvious what’s going on. Even after handling RDFa as much as I have, I still struggle to work out when, for example, an href attribute is providing the object for a statement, and when the subject. And if you look at the London Gazette RDFa, you’ll notice many occasions where empty <span> elements are used to provide the equivalent of the inline <link> and <meta> elements shown above. (In fact, as far as I recall earlier drafts of RDFa allowed <link> and <meta> elements to be used this too.)

From what I can see, though, there are two things that the microdata proposal in its current form can’t handle: datatyping and XML literals.

Datatypes

Datatypes are important in RDF. Values of properties are often not just strings, but dates, times, integers and so on. The microdata proposal mentions using the <time> element to create values, and has this example:

<div item>
 I was born on <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>.
</div>

The triple that you’d want to create from this is:

<> <http://www.w3.org/1999/xhtml/custom#birthday> "2009-05-10"^^xsd:date .

which makes it plain that the value is a date. However, the definition of the mapping from microdata to RDF makes it clear that the triple that’s created is:

<> <http://www.w3.org/1999/xhtml/custom#birthday> "2009-05-10" .

In other words, the value is a plain literal, not a date.

In RDFa, the datatype attribute is used to indicate the datatype of the value, so you can do:

<div xmlns:custom="http://www.w3.org/1999/xhtml/custom#">
  I was born on <span property="custom:birthday" content="2009-05-10" datatype="xsd:date">May 10th 2009</span>
</div>

It would be easy enough to say that the value of a <time> element has the datatype xsd:date, xsd:time or xsd:dateTime dependent on the syntax of its datetime attribute, but there are other times that you want typed values. We’ve used strings (as opposed to plain literals), integers and years. I wouldn’t want to rule out the use of custom datatypes such as colours (and RDF permits these). The JSON mapping could, perhaps, use an appropriate object if there is one, and otherwise use just the string value without too much loss of power.

XML Literals

Arguably less important is the lack of support for XML literals, which are values that contain markup. The example in the RDFa spec is:

<h2 property="dc:title">
  E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
</h2>

which generates the triple (Turtle):

<> <http://purl.org/dc/elements/1.1/title> "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"^^rdf:XMLLiteral .

RDFa allows you to force a value as an XML literal or a plain literal using the datatype attribute. Otherwise, if the element has any element children then it’s assumed to be an XML literal, and if not, a plain literal. I think the microdata proposal could adopt the same course of action. The JSON mapping could, perhaps, result in a value which is an array or some other container for a sequence of text and element nodes.

Final Thoughts

To my mind, the HTML5 microdata proposal is unacceptable in its current form because, unlike RDFa, it can’t be used to represent all the statements that you might want to represent. If those issues were fixed, there would be pros and cons between it and RDFa. Microdata is more long-winded, but more explicit. RDFa is more arcane but doesn’t swamp the content of the page to quite the same extent.

Like a lot of people, I would have far rather seen a proposal which didn’t reinvent the wheel, but how does the old saying go: “The great thing about standards is that there are so many to choose from.” If the microdata proposal stays the course, I only hope that we’ll see consumers supporting both it and RDFa so that producers can choose which to use rather than being forced to embed both within their pages.

Comments

Re: What You Can't Do with HTML5 Microdata

With the addition of one more attribute, the RDFa photo example could generate one more triple:

<img src="photo1.jpg" typeof="foaf:Image"
  rel="license" resource="http://creativecommons.org/licenses/by/2.0/"
  property="dc:creator" content="Mark Birbeck" />

Which results in:

  <photo1.jpg> a foaf:Image .
  <photo1.jpg> xhv:license <http://creativecommons.org/licenses/by/2.0/> .
  <photo1.jpg> dc:creator "Mark Birbeck" .

Three distinct facts from one element.

Re: What You Can't Do with HTML5 Microdata

In your first example, <link rel="license"> won't work - you have to use <link itemprop="http://www.w3.org/1999/xhtml/vocab#license">. (At least, it works in the demo parser so I'm pretty sure it's correct.)

(There's nothing in the spec's "Introduction" or "Encoding microdata" sections indicating that you can use rel - you just have to always use itemprop when specifying item properties. The only place "rel" is mentioned is in the algorithm for extracting RDF, for identifying resources linked to pages (not to items).)

Re: What You Can't Do with HTML5 Microdata

Oh no, you’re right. You have to use itemprop unless the subject of the <link> element is the document itself. Sorry for the error.

Re: What You Can't Do with HTML5 Microdata

Mmmm. Not very impressive Jeni. If you’re struggling with it… doesn’t say much for the design? I’m guessing that it should be hand authored, rather than… processed from Turtle or another easier format and ‘slotted in’ (which doesn’t make much sense), so it looks like there’s a lot more work to do before it becomes usable (if it ever does).

I note Shelly is showing google taking an interest, http://tr.im/lcFd which points to some form of uptake.

DaveP

Re: What You Can't Do with HTML5 Microdata

The microdata feature isn’t really meant to do RDF; that it can do it at all is mostly a bonus. The goal was to address a set of use cases that were brought up over the past few months, none of which actually need RDF as far as I can tell. For example, none of the use cases on the list needed data typing at the point of data entry or XML literals, so I didn’t try to find a way to provide those.

Regarding the time thing, I plan to add that typing at some point (since as you say it’s not too hard); I haven’t done it yet mostly because I didn’t want to add the whole date and time parsing algorithm to the microdata-to-RDF conversion algorithm unless it was clear we really needed it, since it pretty much doubles the total complexity of conversion to RDF.

What do RDF processors do when they find an object is a “time” and they expected a “date”? Or when it’s a string and they expected a datetime? (How can I test this?)

Re: What You Can't Do with HTML5 Microdata

The microdata feature isn’t really meant to do RDF; that it can do it at all is mostly a bonus.

Uh, OK. Given that a significant proportion of the people who work with the semantic web and who want to use metadata embedded within web pages are currently using RDF, I would have thought you’d use a “pave the cowpaths” approach. But I know your mind is made up so I’m not going to try to change it.

Regarding datatyping, you might consider using at least those datatypes that have an obvious mapping in Javascript, since this will allow people to compare and manipulate values (such as numbers) within their scripts without converting them explicitly. And with dates and times in particular, it would seem to break the principle of least surprise to have the type indicated within the markup but not reflected within the DOM.

Regarding XML Literals, the word ‘XML’ may send you running and screaming for the hills, but they’re actually about capturing structure rather than using a particular technology. They are a useful feature for document content. For example, descriptions of events may well run to several paragraphs, or include emphasised text or ruby markup, and it would be good in these cases for the value of the property to reflect that structure.

As far as validating the values goes: it’s not necessary for an RDF processor to validate the triples that it holds in order to process them (in the same way as it’s not necessary for an XML parser or an XSLT processor to validate a document in order to build a model of it and transform it). Knowing the datatypes of values becomes important when you’re processing RDF in exactly the same way as it would when processing the values in Javascript: in running comparisons and performing calculations, in RDF’s case within SPARQL or when using a library such as rdfQuery.

Only OWL supports the definition of datatypes for properties, so you would have to use an OWL validator to actually validate the triples against a vocabulary. Validators are harder to find than you might hope; they usually exist within triplestores rather than as standalone applications.

Re: What You Can't Do with HTML5 Microdata

(thanks for the twitter heads-up)

“Given that a significant proportion of the people who work with the semantic web and who want to use metadata embedded within web pages are currently using RDF, I would have thought you’d use a “pave the cowpaths” approach.”

The goal wasn’t to address “the semantic web”. The goal was to address the needs of a variety of people who want to annotate their HTML pages for various reasons (see http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-May/019681.html for some of the scenarios that I had in mind; I’m going to be posting more detailed comments on more scenarios and how we approach them shortly). That is, in designing microdata I was trying to address specific concrete problems, I was not trying to plug into the RDF world. None of the problems needed the RDF world as far as I can tell.

Having said that, I did try to bridge the gap to RDF, by defining how this stuff maps to RDF. (Also JSON, for similar reasons, and now iCalendar, vCard, and BibTeX.)

For dates and times, the DOM interface does have ways to interact with the “time” element using native JS Date objects (timeElement.time, .date, and .timezone).

For XML literals, there really weren’t any use cases that needed them. For event descriptions, vEvent (from iCalendar, used as the event vocabulary in HTML5 now), doesn’t support structured markup, so it’s not clear that XML literals would actually be useful there.

Is there any way for me to test what RDF processors do when they find an object is a “time” when they expected a “date”, or when they find a string when they expected a datetime? I really would like to study this more, but I don’t know how to study it. What tools process RDF event data, for instance?

For validation in general (of custom vocabularies used in microdata), I’m thinking the best plan might be to leverage the RDFS/OWL world and just say that tools should use those mechanisms to validate microdata, by translating the microdata markup to triples (as defined by the spec) and validating the triples. This is more or less the same story that RDFa would have for validation and editing.

Re: What You Can't Do with HTML5 Microdata

I don’t agree with your approach to this at all, but I don’t want to waste either your or my time arguing about it. I’m not conceding the point, just unwilling to engage in fruitless discussion.

I know that the DOM interface for the <time> element has ways to interact with the JS Date objects. What I’m asking for is a way to get from element.properties to a Date object. If you don’t want the values themselves to be Date objects, perhaps you could have a pointer from a property to the element that defined that property.

I know that iCalendar doesn’t support structured values in descriptions, but that shouldn’t prevent the HTML5 DOM from doing so. Take an example where a web page contains event data and another application wants to embed that event data in their own web pages. Rather than going through the lossy transformation to iCalendar, it could pick up the full structured description of the event. This is particularly useful for i18n if nothing else.

Thinking about validation, there are two questions to answer:

  1. How should the HTML5 processor handle it when the value provided for the property doesn’t match the datatype that’s been defined in the markup for the property?
  2. How does an RDF processor handle it when the value provided for the property doesn’t match the datatype that’s been defined in the schema or ontology for the property?

For the first case, I’d suggest that the property is ignored, as if the value wasn’t present. This is consistent with how you handle it when the <time> element’s value isn’t in a recognised date/time syntax.

For the second case, I think you are absolutely right to refer to RDFS/OWL for validation of the values when RDF is generated. Validation of the JSON would have to be through a JSON schema.

As I said, many RDF processors don’t validate the triples that they work with; you can still work with triples without it. In rdfQuery, for example, I don’t currently do any validation of the triples. I imagine that when I introduce it, I’ll ignore the invalid triple and provide a callback mechanism so that people can choose how to handle the error.

If you want to explore further applications that do validation, you could look at Jena; I’m afraid I’m not the best person to ask for assistance with it, but I’m sure #swig on irc.freenode.net will be able to help.

Re: What You Can't Do with HTML5 Microdata

I don’t understand what other approach one could use. Surely trying to address real concrete problems is the way all specifications should be written? Otherwise, what’s the point? I mean, we’re not doing this for fun, we’re trying to actually make things better.

The .properties array actually returns elements, so you can already do what you want directly. e.g. item.properties[‘creation’][0].date returns the date of the first ‘creation’ itemprop (assuming it’s a time element, anyway).

I don’t see why a Web page wanting to grab an event description from another would go through the microdata mechanism. Surely they’d just grab the whole blob of markup. If they did go through microdata, how could they know what markup to grab? I mean, the description is bound to have class names specific to the source page, and so on.

Regarding datatypes, if people don’t generally observe them anyway, it seems like a small loss not to have them yet. When people start using them more, we can add them (it wouldn’t be too hard to add an attribute like content-datatype that stored the type URI). I think it’s probably likely that in the general case the vocabularies are going to expect distinguishable types for particular values, though, and in those cases there doesn’t seem to be much value in explicitly labeling each literal with the type. Best to just check to see if the literals match the values and go from there. iCalendar gets this wrong (IMHO), it requires that fields that take dates or date-times be labeled as having dates or date-times appropriately, even though that does nothing useful, and just adds one more way for the data to be wrong. Better, IMHO, to just check to see if the value is a date or date-time (or an error) and work from there.

I’ll look at Jena, thanks for the link.

Re: What You Can't Do with HTML5 Microdata

Ian: re: the approach issue that Jeni doesn’t want to get drawn into, the problem from my standpoint is that you initially ignored the real use cases many of us have been telling you about for a long time. Once you finally did get around to considering them, you at least in some cases rewrote them to fit your preconceived idea of what metadata in HTML should be, all the while pretending that RDFa didn’t exist and wasn’t solving the exact same use cases (and to be clear: I’m not an unreserved advocate of RDFa; but that’s true of most technologies).

For example, you reduced my informal use case (that I posted on Shelley’s blog) to be about contact data, when it was about much more. The “more” wasn’t a contrived example, and included being able to markup publication data. And that has concrete implications for the proposal you came up with. For example, I don’t have super strong opinions about XML literals, but it’s a fact that within publication data properties (say titles and abstracts), one can find need for embedded markup: chemicals, em or i or b tags, etc., etc.

Re: What You Can't Do with HTML5 Microdata

I didn’t ignore any use cases, I read every single e-mail on the topic (tens of thousands of lines) and tried to understand every one.

If I missed a specific use case, I apologise prefusely; please reiterate it so that I can address it.

I did ignore comments of the form “we need RDFa”, just like I ignored comments of the form “we need SVG” or “we need MathML” or “we need VML” or “we need LaTeX”. My goal, what I am paid to do, and what I firmly believe is the right way to do standards development, is to start from end-user needs, and to work backwards to work out what the best technical solution might be. LaTeX isn’t the best technical solution for embedding Maths on Web pages, IMHO. RDFa (indeed RDF in general) isn’t the best technical solution for embedding contact or event or bibliographic information. Microformats isn’t the best technical solution for arbitrary nested name-value pair annotations by small authoring communities.

If there is a need to embed fragments of HTML documents, or fragments of other XML vocabularies, then I agree that we should address this. I didn’t see any concrete end-user goals that would need HTML or XML fragments to be solved. Events, for example, need to be compatible with the vEvent part of iCalendar to be imported into most calendar systems, and that doesn’t natively support marked up text. So for events, we don’t need HTML or XML fragment support. Replacing Atom in a way that doesn’t need an external resource (another use case that came up) would need fragment support, but probably won’t need to be addressed by a microdata-like solution at all.

If there are use cases that need to be addressed, please tell me, ideally by e-mailing one of the lists or me directly at ian@hixie.ch.

Re: What You Can't Do with HTML5 Microdata

Regarding datatypes, if people don’t generally observe them anyway, it seems like a small loss not to have them yet.

I should have pointed out, it’s not the case that people don’t use them. We use a bunch of different datatypes in the London Gazette RDF, including numbers, years, dates, URIs and strings (which are distinct from untyped literals in that they have no associated language). rdfQuery understands and use datatypes. SPARQL understands and uses typed literals when it queries against triplestores.

What people aren’t doing much at the moment, as far as I know, is validating triples (whether they hold typed values or not) against a schema or ontology, because that means using OWL, and therefore potentially quite heavy-weight reasoning capabilities.

Re: What You Can't Do with HTML5 Microdata

By “use” I meant consume, as in, have end-user-visible behaviour differences based on different types in the input (other than error messages).

Re: What You Can't Do with HTML5 Microdata

Sure. As I said, SPARQL observes datatypes. If you do:

SELECT ?v WHERE { ?v ?p 42 }

then it will only match those resources that have a property that are typed as xsd:integer. In a page of SPARQL results based on the above query, the page would differ depending on whether the triples its based on had typed values or not. This would be end-user visible.

Re: What You Can't Do with HTML5 Microdata

Sure, but my mum (or any typical end user) isn’t going to be writing SPARQL any time soon. What do tools aimed at regular end users that consume RDF do with datatypes?

Re: What You Can't Do with HTML5 Microdata

Tools that are aimed at regular end users will generally use SPARQL to query over a set of triples to produce their results. My mum doesn’t write SQL but she still uses applications that are based on SQL queries. Those applications wouldn’t behave as anticipated if they (for example) sorted prices in lexicographic order. The same is true for applications built on RDF.

I think what you’re actually asking for are pointers to tools that use RDF and that are used by end-users. Frankly, I don’t think that the space is developed enough yet for anyone to provide you with pointers that you’d find persuasive. I asked the same kind of question a while back and didn’t really get any good pointers. The closest are the semantic web case studies and use cases which you undoubtedly already know about. Maybe by the time HTML5 is done there will be more.

Re: What You Can't Do with HTML5 Microdata

(to clarify, I understand entirely that if your use case is “interoperate with the semantic web toolchain” that datatypes and XMLLiterals and so forth are important. But I am focusing on end-user use cases, and those don’t seem to need to interoperate with RDF at that level. I understand that you don’t want to debate that, but it is the crux of my approach, so it’s hard to get away from it in a discussion about the approach.)

Re: What You Can't Do with HTML5 Microdata

Much as I admire your rhetorical devices, I repeat that I’m not going to get drawn into discussing your approach to specification development. There are enough other people around that are doing that, if that’s the kind of discussion you want.

I’d missed that the element.properties accessor returns a collection of elements. Thanks for clarifying. That will provide access for working out whether a value is a date, time, date-time, URL (as I believe you call them), or in fact structured content (as in XML Literals). So although you’re not exposing those in the RDF or JSON, they are there in the DOM, which I suppose is half way there. There’s nothing stopping me using that information within rdfQuery to create triples with typed and XML Literal values. I’m just limited about what types can be identified.

I don’t see why a Web page wanting to grab an event description from another would go through the microdata mechanism. Surely they’d just grab the whole blob of markup. If they did go through microdata, how could they know what markup to grab? I mean, the description is bound to have class names specific to the source page, and so on.

The application would use the microdata to identify what “blob of markup” is the description of the event. Presumably if they were sensible they would sanitise it (by removing things that might throw off the display, such as CSS classes, or that might be security risks, such as scripts) before displaying it on a different web page. Santisation could preserve elements such as paragraphs, emphasis and ruby markup.

Back to datatyping: if literals were explicitly marked with their type then a generic viewer could go to a page of, say, product information, identify the items described within the page and sort them numerically based on their price property without knowing what ‘price’ meant. It could go to a page that listed doctors and sort them numerically based on their distance property without knowing what ‘distance’ meant. And so on. Associating types with values is useful for these kinds of generic (vocabulary-unaware) applications.

Conversely, as it stands an RDF-based application that does do datatype processing will flag as errors many triples that I would like to encode as microdata within HTML5. As far as I know, RDF processors assume that you mean what you say (ie when you provide an untyped literal) and don’t provide mechanisms for automatically casting values from one type to another based on their definition in an ontology. I guess you’d have to do some kind of pre-load filtering. Or just use RDFa.

So the costs of not supporting datatyping in microdata are that it limits what generic applications can do, and effectively makes those of us who do want RDF do a lot more work to get it. What would the cost be in supporting datatyping (by adding an attribute like content-datatype as you suggest) from your point of view?

Re: What You Can't Do with HTML5 Microdata

The only cost would be extra complexity and the implications thereof — implementation cost, testing cost, cost to tutorial writers, cost to authors trying to understand the language, etc. Same cost as adding any API or language feature.

Re: What You Can't Do with HTML5 Microdata

Good. I thought there might have been something else you were worried about.