Google's RDFa Support

This discussion is closed: you can't post new comments.

I can’t reply to Henri Sivonen

@JeniT What’s wrong with http://rdf.data-vocabulary.org/rdf.xml ?

in 140 characters.

http://rdf.data-vocabulary.org/rdf.xml is the the RDF schema that describes the classes and properties recognised by Google’s rich snippets, which promises to provide richer information about search results than is available currently, in the manner of SearchMonkey.

So what’s so bad about this RDF schema?

Well, firstly, unlike what I just claimed, it’s not the RDF schema for Google’s rich snippets. If you look at an example like this one from their own help pages:

<div xmlns:v="http://rdf.data-vocabulary.org/" typeof="v:Person">
   <span property="v:name">John Smith</span>
   <span property="v:nickname">Smithy</span>
   <span property="v:url">http://www.example.com</span>
   <span property="v:affiliation">ACME</span>
   <span rel="v:address">
     <span property="v:locality">Albuquerque</span>
   </span>
   <span property="title">Engineer</span>
   <a href="http://darryl-blog.example.com/" rel="v:friend">
   <span property="v:name">Darryl</span>
</div>

you’ll see that the CURIE v:name resolves to the URI http://rdf.data-vocabulary.org/name. And that resolves to… well, take a look. The fact that you can’t resolve a property URI to a definition within a schema or ontology really undermines the idea of self-describing documents. To know anything about what http://rdf.data-vocabulary.org/name means you have to be human enough to go hunting for the documentation that describes it. Contrast with http://xmlns.com/foaf/0.1/name which, if you ask for RDF, takes you to ontology in which you can look for the definition of that resource, and find out its domain, range, a label for it and so on.

So the redirection’s screwed up. What about the schema itself? Well, it’s not that it’s bad in and of itself; it reflects a reasonable model of the world of products, reviews, people, organisations. The model’s obviously heavily based on those used within the microformats that Google is also supporting: hProduct, hReview, hCard and XFN, which in turn are based on existing standards such as vCard or on solid document analysis on existing behaviour. So it’s not surprising that it’s a reasonable model.

The problem is not so much what has been done as what hasn’t. I’m not completely surprised that there’s no support for FOAF — the model is sufficiently different from that of hCard to make it harder to use — but not reusing the existing RDF ontology for vCard smacks of either ignorance or arrogance. I’ve advocated before going ahead and building your own ontologies rather than waiting for community standardisation, but not doing a rudimentary search for previous work is going too far.

The schema itself is very sparse. It only uses RDF schema, rather than OWL, so it’s not surprising that it’s not particularly expressive, but even so a few labels would have been useful! The classes that have been introduced have no hierarchy (they are all subclasses of rdf:Resource), so properties that are shared between people and organisations, such as v:name, have a domain of all resources. And there’s no property subclass, so no way to automatically identify the appropriate property to use as a substitute for rdfs:label when displaying information about a person or organisation. From a technical standpoint, it seems like it was hastily thrown together by people who either don’t understand why self-description is important, or don’t care about other people’s use of the vocabulary.

But more than these, the reason that I am disappointed with what’s been shown so far is that Google is really missing the point of using RDF: its extensibility. It’s easy enough to parse a set of microformats; to build something useful around a small number of known vocabularies. It’s easy, but it’s limited. When I heard the buzz of “Google supporting RDFa” I expected it to support not just the syntax of RDFa, but its extensibility, because otherwise what’s the point?

The quote that scares me most comes from the interview with RV Guha and Othar Hansson:

RV Guha … It’s really important that everybody, as far as possible, use the same vocabulary. So Google is essentially going to be making an investment in sort of hosting a vocabulary that maybe is Google Services. …

because a single centralised vocabulary is just not going to be feasible. The web’s too large and diverse for that.

And I believe that Google, by working with the web and combining semantic markup with their formidable computing power and existing natural language understanding, could do so much more. Let me talk through a really simple scenario to illustrate.

Google publish their own RDF vocabulary for products (as they have done). Amazon look at the vocabulary and decide that it doesn’t include some information that’s useful and important specifically for books, such as the author of the book. So Amazon extend Google’s vocabulary by adding a new class that’s a subclass of google:Product, amazon:Book and introduce another property in their RDFa markup, amazon:author, giving it a range of amazon:Book.

When Google indexes Amazon’s pages, the RDFa parser sees that the resources that are described in the pages are described as amazon:Books. It doesn’t know what a amazon:Book is, so it resolves the URI, looks at Amazon’s ontology and finds the label and description of the class, and the fact that it’s a subclass of google:Product. The fact that an amazon:Book is a kind of google:Product means that it can be displayed just like other products. Perhaps Google even specify some other things that can be put in your ontology to supplement the rich snippet, like an icon, or a list of properties that are important. I don’t know. The point is that the ontology provides information to Google about what to do with this class, without Google having to invent and own the vocabulary.

Similarly, when the RDFa is parsed, Google generates triples for this new amazon:author property just as it does for the other Google ones. It recognises that it doesn’t know what it means, so it resolves the URI for the property and finds a schema. The schema includes a label for the property, “author” and a natural language description. Google uses the label in the display of the rich snippet; it processes the natural language description to disambiguate the label and translate it into other languages. Perhaps it can even assess the importance of the property, and whether it should actually be displayed at all, by understanding its description.

Anyway, back to earth. These are the kind of pipe dreams that I used to ridicule semantic web folk about back before they body-snatched me! So although I’m sure that Google could do so much more, there are some things to be thankful for:

  • They have made the first steps towards recognising semantic markup as a potentially useful source of information.
  • They haven’t gone and invent yet another syntax for encoding semantic information within HTML. (Well, at least this arm hasn’t.)
  • They have reused microformats.
  • They have developed a vocabulary for a few things that are really useful, even if we did have ontologies for some of it already.
  • They will now have a stake in answering the difficult questions around trust, confidence, accuracy and time-sensitivity of semantic information.

And, of course, this will encourage people who haven’t previously been interested to use semantic markup, which will make data easier to get at and more open to reuse and I believe will benefit the web as a whole.

Comments

Re: Google's RDFa Support

Something else to be thankful for: Google didn’t give anyone reason to believe that “v:” is a magic prefix, but requires a proper xmlns:v declaration to be parsed properly.

Do you have a demo for verifying this?

Re: Google's RDFa Support

Untested, of course; I’m just basing this on the first “important property” listed at http://google.com/support/webmasters/bin/answer.py?answer=146898. At least they called it “important”!

Re: Google's RDFa Support

Something else to be thankful for: Google didn’t give anyone reason to believe that “v:” is a magic prefix, but requires a proper xmlns:v declaration to be parsed properly. Of course this is the right thing to do, but these days I don’t take that for granted.

Lately I’ve been thinking of the difference between URIs and URLs (besides URLs being a subset of URIs, like URNs) being that the former are Identifiers and the latter are Locators. If someone tells me that http://xxx is a locator, I expect to find something at that location; if they tell me that it’s an identifier, I know that it may just be a name used to distinguish that concept.

Bob

Re: Google's RDFa Support

The xmlns should have a # at the end: xmlns:v=”http://rdf.data-vocabulary.org/#”

Then it works as it should: http://rdf.data-vocabulary.org/#name

Re: Google's RDFa Support

So the web is a recipe for a DDoS attack?

No. None of the technologies deployed on Web scale (HTML, CSS, DOM, JS, etc.) require user agents to dereference a central URI on a vocabulary server in order to browse content. Browsers don’t need to GET anything from w3.org to make sense of content on example.com.

Re: Google's RDFa Support

First, there are two kinds of things that we’re talking about here:

  1. resources that are required in order to correctly parse a document (such as a DTD that is required to resolve entities; or the proposed RDFa profiles which as I understand it would be required in order to correctly interpret values in rel or property attributes)
  2. resources that add value to the document such as by providing additional styling or interaction

I agree with you that the first kind of resources should be avoided. RDF schemas and ontologies are of the second kind. They are not needed in order to generate or browse triples, but can be used to add value to those triples by enabling additional reasoning or by providing supplementary information to enhance their presentation. They are precisely as essential as CSS stylesheets and Javascript scripts. If the server is not available, it’s not the end of the world.

Second, we are not talking in this particular context about billions of browsers accessing a schema or ontology, but about Google accessing it in order to enhance the presentation of its search results. I believe that Google is pretty good at caching documents and only refreshing the cache at intervals. Perhaps Google could apply that expertise to caching (and storing in a more useful, compiled version, for its own purposes) RDF schemas and ontologies. I imagine that other semantic web applications would do something similar for popular and common vocabularies, simply for performance reasons, just as browsers do with the XHTML DTD.

Re: Google's RDFa Support

First, there are two kinds of things that we’re talking about > here:

  1. resources that are required in order to correctly parse a document (such as a DTD that is required to resolve entities; or the proposed RDFa profiles which as I understand it would be required in order to correctly interpret values in rel or property attributes)
  2. resources that add value to the document such as by providing additional styling or interaction

I agree with you that the first kind of resources should be avoided. RDF schemas and ontologies are of the second kind.

From the earlier discussions on the WHATWG list (not the latest round) and from Follow Your Nose advocacy I had heard at TPAC, I had been lead to believe that OWL ontologies retrieved by dereferencing namespace URIs were of the first kind.

In fact, I’m quite surprised to see the rejections of OWL as excess complexity compared to RDF itself in this comment thread.

They are not needed in order to generate or browse triples, but can be used to add value to those triples by enabling additional reasoning or by providing supplementary information to enhance their presentation.

This means that the app must be pre-programmed to recognize the predicates of all synonymous vocabularies it supports and can’t Follow its Nose to reason about unrecognized triples until the triples have been recast into something the app knows about, which is an operation often promoted as part of RDF advocacy (“Power of RDF”).

I don’t think end users “browse triples”, so you need some pre-programmed end-user-suitable functionality to trigger on the triples and you need to get the triples to a point where they match what has been programmed in advance.

They are precisely as essential as CSS stylesheets and Javascript scripts. If the server is not available, it’s not the end of the world.

If scripts aren’t available, it can break a page completely. The difference is that a page author can serve a script from whichever URI (s)he chooses and isn’t limited to relying on a contant third-party URI. Hence, the author can make the script as reliable as the HTML document as far as the HTTP server availability goes.

Second, we are not talking in this particular context about billions of browsers accessing a schema or ontology, but about Google accessing it in order to enhance the presentation of its search results.

I think we should develop solutions that work for both search engines and browsers.

Re: Google's RDFa Support

This means that the app must be pre-programmed to recognize the predicates of all synonymous vocabularies it supports and can’t Follow its Nose to reason about unrecognized triples until the triples have been recast into something the app knows about, which is an operation often promoted as part of RDF advocacy (“Power of RDF”).

If by ‘unrecognised triples’ you mean triples whose property isn’t recognised, yes. (The triple can always be recognised as a triple, but the vocabulary that its property comes from might not be recognised.) There are three kinds of applications that might use the triples:

  • generic, OWL-unaware, applications that will view/process/search any triples you throw at them, albeit not in a particularly user-friendly fashion
  • generic OWL-aware applications that will follow their nose to information from ontologies either to supplement the view they provide or to perform more sophisticated reasoning over the data (such as identifying that one property is equivalent to another)
  • vocabulary-aware applications that can only be used with specific kinds of triples and have no need for an ontology because any extra display-oriented information or reasoning is already hard-coded into the application

In fact, a single application might operate at two or all three levels: it might have built-in understanding of the ontologies that it has been built around, follow its nose to ontologies it doesn’t know about, and provide a default view based purely on the URIs and datatypes of the properties if the ontology can’t be found.

From what I can tell, Google are only doing the third of these options, which I view as a missed opportunity. But that’s only because Google is the ultimate generic application. In fact, I think it’s fair to say that most applications built on RDF are built around particular vocabularies.

If scripts aren’t available, it can break a page completely. The difference is that a page author can serve a script from whichever URI (s)he chooses and isn’t limited to relying on a contant third-party URI. Hence, the author can make the script as reliable as the HTML document as far as the HTTP server availability goes.

Point taken. In my experience, the more popular/common ontologies tend to reside in places which are stable (eg purl.org or w3.org) and many other ontologies reside on the same web server as the pages that use them, but even so there are going to be times when an ontology moves or a new version is developed. Keeping pages and ontologies up to date is going to be a maintenance job, just like fixing broken links in websites is now.

I think we should develop solutions that work for both search engines and browsers.

OK. So do you forsee a future where millions of instances of browsers incorporate generic interfaces for browsing RDF and therefore hit particular URIs so much that they’re swamped and those servers go down? And the browsers are so reliant on ontology-based processing that the users cannot use them effectively any longer? Is that the scenario you’re worried about?

Re: Google's RDFa Support

I can’t reply to Henri Sivonen in 140 characters.

Thank you for replying in more characters.

<span property="title">Engineer</span>

Amusingly, Google’s documentation is lacking the prefix there.

you’ll see that the CURIE v:name resolves to the URI http://rdf.data-vocabulary.org/name. And that resolves to… well, take a look. The fact that you can’t resolve a property URI to a definition within a schema or ontology really undermines the idea of self-describing documents.

OTOH, the idea that consumers dereference constant URIs is a recipe for DDoS attack. It is a problem even with URIs one could change but that people often don’t change.

The problem is not so much what has been done as what hasn’t. I’m not completely surprised that there’s no support for FOAF — the model is sufficiently different from that of hCard to make it harder to use — but not reusing the existing RDF ontology for vCard smacks of either ignorance or arrogance.

Isn’t this kind of gratuitous babelization what unilateral extensibility should be expected to lead to? And isn’t RDF even designed for it with the assumption that it is OK, because you can simply http://www.w3.org/2002/07/owl#sameAs things?

The ability of anyone to mint their own RDF vocabularies is advocated as a feature over centralized community vetting such as the Microformats Process or the WHATWG/HTML WG centralized minting of HTML elements, but when Google and Yahoo! go ahead and unilaterally mint their vocabularies people see a problem. Hmm.

But more than these, the reason that I am disappointed with what’s been shown so far is that Google is really missing the point of using RDF: its extensibility. It’s easy enough to parse a set of microformats; to build something useful around a small number of known vocabularies. It’s easy, but it’s limited. When I heard the buzz of “Google supporting RDFa” I expected it to support not just the syntax of RDFa, but its extensibility, because otherwise what’s the point?

What kind of seach result UI do you envisage for vocabularies that Google’s engineers have not hand-coded custom UI for?

P.S. My human success rate with your recaptcha is very low.

Re: Google's RDFa Support

Just to answer your final question (since the others have been answered by other people):

What kind of seach result UI do you envisage for vocabularies that Google’s engineers have not hand-coded custom UI for?

When displaying individual search results, the plain and simple thing to do would be to have “label: value” displayed for some of the fields. I imagine that Google could be smarter than that. For example:

  • values could be displayed in a datatype-specific way (eg through the localisation of dates)
  • properties that were sub-properties of known properties, and classes that were sub-classes of known classes could be displayed in the hand-coded custom UI for the known property/class
  • low-level natural language processing could enable Google to display the value in an appropriate way, such as by putting a dollar symbol before a decimal property whose description was “price in dollars”

And Google could explicitly provide mechanisms (ie properties within the RDF schema or OWL ontology) that would enable vocabulary developers to provide more control over the display of their metadata. For example:

  • indicating priorities about which properties should be displayed, and in what order
  • indicating which widget should be used to display a particular property
  • providing a template for displaying properties without explicit labels

and so on.

When displaying a bunch of search results with similar metadata, I’d imagine them having a side bar that enabled you to filter your search results by focusing on particular property values. The label for the property would come into play again, and I’d expect them to use the datatype of the property to determine the way in which a value could be chosen. (For example, having value ranges for numeric properties, calendars for dates, search boxes for free text.)

Re: Google's RDFa Support

The ability of anyone to mint their own RDF vocabularies is advocated as a feature over centralized community vetting such as the Microformats Process or the WHATWG/HTML WG centralized minting of HTML elements, but when Google and Yahoo! go ahead and unilaterally mint their vocabularies people see a problem.

I’d say as a general rule, it makes sense to invent new vocabularies when existing ones won’t do. That is emphatically not the case here. There is neither a technical nor a social justification for it in this case.

Suffice to say I won’t be changing the data I publish just to work around Google’s ignorance/arrogance here. So Google, let’s talk about how you can get value out of that data anyway.

Re: Google's RDFa Support

OTOH, the idea that consumers dereference constant URIs is a recipe for DDoS attack.

So the web is a recipe for a DDoS attack?

Isn’t this kind of gratuitous babelization what unilateral extensibility should be expected to lead to?

HTTP permits unrestricted use of mime types, but there are clear pressures towards a small number of formats on the web, rather than babelization.

And isn’t RDF even designed for it with the assumption that it is OK, because you can simply http://www.w3.org/2002/07/owl#sameAs things?

sameAs, samePropertyAs and sameClassAs are OWL things, not RDF. Requiring OWL processing is a bit heavy weight when compared with getting things right (which isn’t hard) in the first place.

Re: Google's RDFa Support

“They haven’t gone and invent yet another syntax for encoding semantic information within HTML. (Well, at least this arm hasn’t.)”

choke gurgle