Schema.org and the Responsibility of Monopoly

Update: This post has been translated to Italian on the Linked Open Data Italia blog.

In this post about schema.org I’m going to speculate about the economic drivers that affect how search engines use structured metadata on the web. I discuss how the technical features and choices within schema.org may cause wider long-term harm, and the role of open standards as a method for responsible companies to avoid the pitfalls of monopoly.

Before I launch into this, two things. The first is the standard disclaimer that I am speaking purely for myself. The second is that I recommend that you read Rufus Pollock’s paper Is Google the Next Microsoft? Competition, Welfare and Regulation in Internet Search. In it, he demonstrates how the search engine market will naturally tend to monopoly, and that because of the economic drivers in the search engine market, those monopolies will generally under-perform in terms of social good. In other words, if you are a search engine monopolist, you have to take positive steps to not be evil because all the market drivers force you in that direction.

Clearly schema.org is a significant move by our current search engine monopolist, Google, on several fronts and while I don’t pretend to have any particular insight, it’s fun to speculate about how schema.org fits with their wider goals, the extent to which they are avoiding monopolist traps, and what it might mean for the web in general.

Search engines serve their customers: advertisers. So why the interest in structured metadata? Structured metadata benefits search engines in at least three ways:

  1. presenting richer information increases the utility of the search engine for users, thus attracting more of them (more users => more attention overall => more money from advertisers)
  2. presenting richer information keeps users on the site for longer because search engines can present relevant information directly rather than users navigating away from the search engine’s site (more time on the site => more attention from individual users => more money from advertisers)
  3. analysing social metadata extracted from web pages, such as social graphs and individual interests can aid the targeting of adverts to particular users (more targeted adverts => more effective adverts => more money from advertisers)

Clearly there’s a lot of potential for search engines in structured metadata. Their difficulty is in getting people to use it such that they don’t lie, don’t find it too much hassle, and don’t make too many mistakes, because that way lies metacrap.

So the drivers for search engines are towards making it as easy as it could possibly be for publishers to embed metadata in their pages. It is also in their interest to ensure that the information that they extract is based as much as possible on the visible content of the page as this reduces the opportunity for people to lie (or make honest mistakes) by providing one value in the metadata and another in the content of the page. And it is in their interest to correct for errors when publishers make them.

The trap is that blindly pursuing these interests can also lead to anti-competitive behaviour.

Raising Barriers to Entry

The Conformance section of the Data Model page says (my emphasis):

While we would like all the markup we get to follow the schema, in practice, we expect a lot of data that does not. We expect schema.org properties to be used with new types. We also expect that often, where we expect a property value of type Person, Place, Organization or some other subClassOf Thing, we will get a text string. In the spirit of “some data is better than none”, we will accept this markup and do the best we can.

Schema.org contains multiple examples of properties whose values should be interpreted as being of a particular type, such as dates, times, numbers, durations, and specialised micro-syntaxes such as for an EventVenue’s openingHours property or an Article’s interactionCount property which (from the examples, if not the text) expects a syntax like UserTweets:65. These seem clear enough.

However, looking in more detail at the examples, it seems that even putting aside the option of providing a string when the schema expects an item, there are a variety of ways of expressing values for properties in schema.org. There are examples where numbers contain commas or are preceded by currency signs. Distances are a number followed by a “unit of measurement” without any indication of what acceptable units of measurements are. Fat content seems to follow some kind of syntax that includes a number and a measure but various other text as well. Even when values have to adhere to a particular microsyntax, there are examples that are non-standard (such as initial ‘P’s missing from durations).

In other words, there is no documentation about the way in which the values of the schema.org properties will be interpreted by search engines and there is a clear intention on the part of the search engines behind schema.org to be generous in what they accept, so as to ensure that publishers can be lazy while search engines maximise the amount of data that they can understand on the web. Lacking a specification that describes how values are interpreted, the only way for publishers, validators and tool developers to work it out will be to try it out, see what happens, and attempt to find patterns that are generally interpreted in the same way by at least the major search engines, or more likely (because why bother with anyone else), try to work out what Google is going to do with it.

We have been here before, with HTML, pre-WHATWG. Then, IE, which dominated the browser market, had the clear intention to be generous in what it accepted, and there was no specification that described the various error handling quirks that had to be reproduced in bug-for-bug compatible user agents. WHATWG have had to work extremely hard to reverse engineer a specification that provides some kind of predictability and consistency for publishers as well as making it possible for new entrants to the browser market (such as Google’s Chrome), validators, and other tools to reproduce the behaviour of existing browsers. This work has paid off: over the past few years, browser market share has diversified somewhat, largely due to the rise of mobile browsers and Chrome taking market share from IE.

With structured metadata, Google is in an extremely dominant position. Concretely, it will be very hard for Google to reveal the methods by which they extract meaningful metadata from the huge variety of textual content on the web: they may have patents that cover some aspects, and in other cases (particularly when that interpretation depends on the analysis of their vast caches of web pages, as in the case of natural language translation) the behaviour simply might not be replicable by any third party.

None of this, by the way, would be helped by using a different syntax to express the data within the page. The only way it could be addressed is by much more clarity, detail, and conformance criteria within the schema.org vocabulary specification.

Without that specificity, we get into a world where Bing, Facebook and any other search engines will spend a lot of time and effort trying to reverse engineer Google behaviour to extract the same data as they do. They might even sometimes manage to introduce useful quirks of interpretation of their own, but that’s unlikely given that their constrained engineering effort will naturally be focused on matching Google. This also forms a massive barrier to entry (as if those weren’t already significant) to potential new search engines. Overall, the lack of specificity suppresses innovation in the market.

And of course publishers, writers and tool creators are left struggling to keep up.

Syntax Fixing

While both Google and Yahoo! have previously used information described using microformats and RDFa to provide similar functionality, in schema.org they deprecate that support, both by using microdata throughout the examples and by explicitly saying:

If you have already done markup and it is already being used by Google, Microsoft, or Yahoo!, the markup format will continue to be supported. Changing to the new markup format could be helpful over time because you will be switching to a standard that is accepted across all three companies, but you don’t have to do it.

Whichever technology they choose, the act of search engine monopolies making that choice and the consequent widespread adoption via SEO creates a large barrier to changes to the technology. Even if the specification for the technology changes, those changes will be likely to be ignored in practice as Google (and hence other search engines) seek to retain backwards compatibility with the examples and guidance published on schema.org as they stand now.

It is particularly damaging to have the choice be microdata because microdata is a relatively new technology that has only just reached W3C Last Call Working Draft. In my experience, Last Call is usually the first time that a wider community outside interested Working Groups start to look at a technology seriously. To create better technologies and better specifications, Working Groups must be able to change in response to this review.

The ultimate result is again standardisation-by-implementation, which has long term adverse consequences in restricting competition (not between technologies, but between organisations using those technologies) and leads us to a situation where we could end up using something that is less than optimal for any kind of wider purpose outside the interests of the monopolist.

Standards Bodies

The development of schema.org might seem like a very minor thing, only of interest to people interested in SEO and structured metadata, but it is part of a bigger picture of the kinds of ripple-through effects the dominant players on the internet can have. It is almost impossible for monopolies not to do harm, not because anyone within them sets out to, but simply because they are so large that their behaviour is that much more important than anyone else’s.

The kinds of effects described above — ones that result in an overall sub-optimal outcome for society as a whole — are why society has competition laws that constrain monopolies and cartels. Sooner or later, just as it did with Microsoft, society applies the corrective force of regulation. There are already rumblings of this storm approaching.

It is also why we have neutral standards bodies, such as the W3C or the IETF, which provide a royalty-free patent policy as well as a defined process for developing specifications. These might seem tedious to comply with, and it might seem beneficial to companies to form a small cabal in order to get things done more quickly without having to seek wide consensus, but the bigger picture is that open standards developed within standards bodies protect companies from antitrust actions. Companies can point to royalty-free standards developed through a defined and fair process as proof of good behaviour that demonstrates their understanding of a wider responsibility to society as a whole.

As Winston Churchill might have said:

Many ways of developing standards have been tried and will be tried in this world of sin and woe. No one pretends that standards bodies are perfect or all-wise, and it has been said that developing standards within standards bodies is the worst possible way to do it except all those other ways that have been tried from time to time.

Objections to schema.org may seem to be sour grapes because they didn’t use a particular existing syntax or vocabulary, but look deeper and the issues schema.org raises are all about the responsibilities of monopolies and the role of open standards. The parallels with HTML, IE and Microsoft are striking; it will be interesting to see if this turns out the same way.

Comments

Re: Schema.org and the Responsibility of Monopoly

Thank you for the constructive feedback.

I do get your point about under specified property values being bad. We would like to solve this problem without introducing order of magnitude delays in the whole process.

The current thinking is that the best way to deal with this is to create a substantial library of examples of correct/good practice. Ideally, we would like the example creation and curation process to be community driven. We are looking into the right mechanisms for doing this and hope to have an announcement in the next week or two.

Re: Schema.org and the Responsibility of Monopoly

That seems like an awful way to address the problem. It would be better if someone with access to your code described what it does in the kind of English that Hixie uses in the HTML spec and then that description was published on schema.org.

Re: Schema.org and the Responsibility of Monopoly

You had me there until you got to standards bodies. You propose the W3C and the IETF.

Let’s deal with the latter first. Anyone with experience in the IETF from the past decade knows that the IETF would actively reject a proposal that they be developer/keeper of such a spec. It would never happen.

As for the W3C: that seems like a silly suggestion, given the failures of the past few years. The process for consensus in the W3C do not tend towards making good specs that involve semantics.

That leaves “create a new standards body”, and that becomes “who is going to pay for it, who will be allowed to participate, and who will lead the discussion and editing”. schema.org did all this without public exposure, but the end result is still a standards body, albeit with membership restrictions that we individuals bristle at. Really, the alternative would be “just have one company publish its rules”, which is what we had before this, and people hated that as well.

If there is some meaningful standards organization that can handle such semantic specifications that is about to appear, they could take over this work. To the best of my understanding, there is none.

Re: Schema.org and the Responsibility of Monopoly

Hi,

To be clear, I was using W3C and IETF as examples of standards bodies and didn’t mean to suggest that either would necessarily be the right place for a specification for a metadata vocabulary. On the other hand, the new Community or Business Groups within the W3C might be an option — they don’t require a consensus-based decision process, for example.

Jeni

On the neutrality of standards bodies

Great post. Thank you. I think the parallels between IE’s treatment of HTML and Google’s treatment of the undefined parts of the vocabulary layer are strong and should be a cause for concern.

Some nitpicking though:

neutral standards bodies, such as the W3C or the IETF

I think it’s an error to describe standards bodies as neutral. The W3C has a number biases.

The first one is that the W3C seems to favor whatever gets written down first at the W3C even if it’s badly designed or just plain wrong. This biases the W3C against stuff that got written down first in a non-W3C venue, stuff that is better designed, stuff that reflects implementation practice better, etc. To say something positive, at least sometimes the W3C allows competing efforts within the W3C. At the IETF, it is super-hard to get even obvious spec bugs fixed if they’ve gone too far along on the Standards Track.

The second W3C bias is that it’s a pay-to-play forum and the W3C has accumulated so much staff that in order to pay the salaries of the staff, the W3C needs money all the time. Individuals and small businesses can’t buy their way in the way large businesses can. It also means that most companies who pay the W3C aren’t all that Web-focused. Non-browser vendors outnumber browser vendors and can vote the browser vendors off the island as happened in 2004. Non-search engines also outnumber search engines even if each of the search engines probably pays the maximum per-company fees. I’ve never been to an AC meeting, but it seems to me that for any given topic, there’s going to be more people who are unfamiliar with a topic than people who are profoundly familiar with and invested in a topic and the legitimacy of AC reps get to vote is that their companies bought a seat at the table.

Then there’s the issue that stuff doesn’t get written down first (see the first bias) because it’s deemed good by relevant parties but because some people are close to the publication pipeline and get publish more. Mark Birbeck was already part of the inner circle of the XHTML2 WG and got to propose RDFa right inside the W3C. Likewise, Hixie was already the editor of HTML5 and got to do Microdata that way. Tough luck for someone with a good idea who isn’t already on the inside.

And a fourth bias is that the W3C has special commitments to RDF and XML. This places an immense strategy tax on everything the W3C does.

There are probably other biases. My point is that standards bodies aren’t neutral. At all.

P.S. It would be really nice to be able to preview the post without a captcha.

Re: Schema.org and the Responsibility of Monopoly

Hi Jeni,

FWIW I found the presentations by Google representatives at Semtech both reasonable and well-reasoned even in the face of some unreason from the linked data crowd.

From Google/Microsoft/Yahoo’s point of view the overriding, overwhelming criteria was simplicity for web masters. The message was that much more web content than you might think is seriously broken and fixing it up enough to extract value from it is hard, so anything complex is out. This drove all the design choices. They didn’t want multiple markups because then web masters will mix them up in non-fixable ways. They didn’t want namespace complexity and multiple vocabularies for the same reason. They felt MicroData was the simplest, a balance between flexibility and too much rope - they acknowledged that the choice was certainly arguable but they had to make a choice and did.

Their aim was primarily to “get something out the door” which at least Yahoo, Microsoft and Google could agree on. Forging that agreement was hard and nearly didn’t happen. Trying to get broader agreement would probably not have terminated at all. It certainly came across as a genuine attempt to get something that would at least work across the “big 3”. I’m not sure the three-way agreement really changes your argument about monopoly but it did appear to be a genuine, hard won agreement across the parties.

Given that premise I can at least see where they are coming from.

Yes, the value ranges are very open so that as a means to publishing or exchanging machine processable data it is very poor. But that’s not the aim. Currently the search engines have to parse all this out of unstructured text. Having at least semi-structured markup means they can do a better job. Allowing some element of human readable values in fields eases publications and I think they might argue that extracting sufficient semantics out of that semi-constrained text to deliver on their use cases is easier than trying to recover from the mess of broken markup that would result from attempting to dictate stronger structure and semantics.

I think the proof of this will actually be in the next few months. There were a lot of sensible words said over intention to engage multiple communities over improving and iterating schema.org. That they could understand the upset about doing the initial schema design out of the public eye, but could see not other viable option. The combination of the fact that it is cross-party and not just Google, and the openness expressed during the semtech discussions gives me some hope.

This is not semantic web or linked data though, nor need it be.

Dave

Re: Schema.org and the Responsibility of Monopoly

Thanks Jeni for this provocative post, and Dave for the “reasonable” first response! ;)

I see Dave’s point but I think it discounts the (massive) transformational effect any agreement on a “standard” by The Big Three (tm) will have. Regardless of whether they were to adopt microdata or RDFa, content producers who wish to be found will inevitably migrate to whatever model emerges; SEO consultants will continue to push their clients toward conformance with whatever the latest accepted “standard” turns out to be, and/or whatever causes the best placement in search results.

The Big Three could have raised the bar and used this as a teachable moment. Dave’s comment does imply they actually did do some “teaching”; part of their reasoning appears to be that, from their perspective, the web is much more of an untamed mess than the structured data community understands. My point is, perhaps they could have set a higher standard, one that would have caused more pain to adopters but would have been accompanied by earnest commitments to educate the community. The fact is, they’ll end up doing this anyway…

Dave highlights the challenges The Big Three apparently had reaching this agreement and how it almost didn’t happen. No doubt the challenges they faced were rooted in their fundamental assumptions and the goals for transforming search that each brought to the table. Radical transformation requires fighting for disruptive changes, including fighting with your peers on working groups and with the community for the adoption of those changes.

If the participants don’t enter into the discussions ready to face the battles that lie ahead, then precious little transformation can be expected.

Re: Schema.org and the Responsibility of Monopoly

Hi Dave,

Yes. I understand both the rationale of needing to satisfy the ‘dumb publisher’ and that getting agreement between Google and Microsoft (and Yahoo!) is an achievement. I can absolutely see why the developers behind schema.org are delighted with what they have managed to do; it is a significant technical and political accomplishment.

However, when you are in Google’s position, you need to think really carefully as a wider organisation about the larger scale impact that your actions have. It’s similar to how children who are naturally bigger than their peers have to be taught to be extra specially gentle: they don’t know their own strength and can easily hurt their friends without really meaning to.

I’m suspicious of phrases like “they had to make a choice” and “get something out the door”. These phrases instill a sense of urgency until you stop and wonder what would have happened if they hadn’t? I genuinely don’t know; perhaps they would have lost a huge amount of (potential) revenue (to Facebook, I assume) by not introducing schema.org right now. My argument is for a sense of social responsibility that is weighed against this potential loss, indeed the point here is that when you are a monopoly, doing the Right Thing almost always means swallowing a loss.

When it comes to processing values, I can quite understand the motivations to extract as much data as possible. I am sure that on the developer side these motives are purely about providing the best value to users. But these are the same arguments that led to the acceptance of completely broken HTML by IE6, and I believe that in the long term accepting as much garbage as possible causes damage to the web.

I have another post in mind to talk about the substantial difference in requirements on the exchange of data between dumb publishers and search engines compared to that between knowledgeable publisher and reusers…

Jeni

Re: Schema.org and the Responsibility of Monopoly

Hello,

(Coming as a small business e-commerce wanting to “look our best” to the search engines.)

Given that search/click rarely happens after page one of search results, and normally there are thousands or millions of search results, even empirically we can see that content is a deep mess—most sites must be complete train wrecks.

The data suggest that what we could call conscious or intentional page structuring is done on hardly any pages at all, and those pages dominate results. We fight our war for page one against maybe 30-100 pages out of millions.

Microdata in any standard is expensive to render in HTML, normally out of reach for all but the largest businesses and government agencies. From where I sit, one standard (schema.org) which the three major engines publically understand, even partially, beats two that they may or may not.

Now that the engines support it, I’ll build and publish to schema.org. The big internal work is piping the data from catalog to schema.org tag. Once done, as schema.org matures, responding becomes maintenance.

But notice this huge change: Our business now routinely manages its metadata consciously and intentfully. Or put another way, our business metatdata, driven by schema.org compliance, becomes day-to-day business management.

If we truly want a real and day-to-day useful semantic web, don’t we need a critical mass of publishers? Won’t they be more likely to encode to a standard that a critical mass of search engines actually use?

Microsoft (a rather competitive organization) is less than 30% of search, even with Yahoo added (Comscore Nov 2011). Isn’t it possible that with common semantics that the two companies will use the microdata to compete to win viewers?

Schema.org may mean many things, but it for sure means that a lot more publishers will build semantic content management into their business.

So which is better: widespread but incomplete semantic standards we don’t like very much or really good ones that hardly anyone uses? Do we want semantic clarity put there by the publisher or guessed at by the engines?

The search engines have to follow the content—it’s all they get. When we build semantic publishers, we build semantic content, and we build the semantic web.

Thanks, Todd

Re: Schema.org and the Responsibility of Monopoly

Hi Jeni,

Yes, I do take your point that “with power comes responsibility”.

One thing to bear in mind is how much of such activity is the result of individuals and small groups just trying to make useful things happen, rather than necessarily deliberative corporate consciousness. I’ve no insight into Google and it may all be a top down strategic plan but having worked at [another monster enterprise] I’m quite aware of how the outside world can sometimes over interpret the emergent behaviour of all the individual groups scurrying round as best they can :)

Dave

Re: Schema.org and the Responsibility of Monopoly

Dave,

I agree, and I honestly don’t think that this is a deliberate strategy on Google’s part. Rather, if you are in the position of a monopoly and you don’t want to do evil, then you (as an organisation) need to be aware of your emergent properties and do your utmost to create a corporate culture in which the individuals and small groups that are part of you think about the impact of their decisions and actions on a wider community.

Jeni

Re: Schema.org and the Responsibility of Monopoly

I don’t disagree re. monopolies, in fact it’s hard not to see a conspiracy given Google’s position regarding HTML5. Where on earth did this microdata stuff come from? There is an obvious snub of RDFa - already accepted by many of these machines. Microformats have also had their nose put out of joint. It would be fantasy to imagine this was a choice made on technical grounds.

But while Dave suggests “the proof of this will actually be in the next few months” (and he may well be right), I personally think the proof will be over a longer term. Despite their domination of the search market, and despite probably having the world’s greatest concentration of PhDs, Google is now a big stupid lumbering dinosaur of a beast, not unlike the old Microsoft. No matter what power they may wield, in the Web environment the rest of the world will always be more powerful.

So they encourage structured Web data, on their own terms. Ha, using their own terms. But in doing so (I hope and suspect) they’re opening a Pandora’s box. Yes, in the short term they may be able to make big wins in search and the advertising space, but the same data that gets published to gain some SEO points is also perfectly good data for everyone else.

I’d be quite happy for them to succeed, in the same way AOL succeeded in making their own version of the Web. AOL expanded the user base of the Web significantly. The approach isn’t Webby, it’s fragile. So when they’ve crashed and burnt, the rest of the planet is left with a slightly improved system.

I do actually hope schema.org gains some traction. On the technical side it’s full of holes, but we know how to fill those holes. I was tickled by one of the first messages on the schema.org list: “Is Battle an Event?”. Some centralised schema don’t cut the mustard on.

I was always annoyed by Sowa’s thing : http://www.jfsowa.com/computer/standard.htm

But maybe it’s time just to go with the flow.