How can you control how data gets used?

Oct 5, 2019

Democracy Club are asking for advice on some changes they’re considering making to their API’s terms and conditions. They’re considering changes for two reasons: to enable them to track impact and to give them the right to withdraw access from those they believe are damaging democracy or Democracy Club’s reputation. Here’s my reckons.

Service T&Cs vs data licences

When you are making decisions about terms and conditions around data provided through an API, you actually have to worry about two things: restrictions on who gets to use the API or how they can use it restrictions on what people who get access to the data can do with it

It’s important to think of these separately as it makes you more realistic about what you can and can’t monitor and control and therefore more likely to put useful mechanisms in place to achieve your goals.

To help with thinking separately about these two things, don’t think about the simple case of someone building an app directly on your API. Instead imagine a scenario where you have an intermediary who uses the API to access data from you, augments the data with additional information, and then provides access to that augmented data to a number of third parties who build applications with it.

If you think your data is useful and you want it to be used, this should be something you want to enable to happen. Combining datasets helps generate more valuable insights. Organisations that provide value added services might make money and create jobs and bring economic benefits, or at least give people something to hack on enjoyably over a weekend.

In this scenario, you have two technical/legal levers you can use to restrict what happens with the data:

  1. You can control access to the API. This is technically really easy, using API keys. And you can write into your T&Cs the conditions under which you will withdraw access privileges. The trouble is that when there are intermediaries, you cannot, on your own, selectively control access by the end users that the intermediaries serve. The intermediary won’t access your API for each request they receive: they will batch requests and cache responses and so on in order to reduce their load and reliance on you. So if there is a single bad actor on the other side of an intermediary, and API keys are your only lever, you will be faced with the decision of cutting off all the other good actors or tackling the bad actor through some other mechanism.

  2. You can put conditions in the data licence. You can of course put any conditions you like into a licence. But there are three problems with doing so. First, restrictions within a licence can mean that data cannot be combined with other data that people might feasibly want to combine it with. In particular, data that isn’t available under an open licence can’t be combined with data available under a share-alike licence, such as that from OpenStreetMap. Second, if everyone does this, intermediaries who combine data from lots of sources with different licences end up with all sorts of weird combination licences so you get not only licence proliferation but ever expanding complexity within those licences, which makes things complex for data users who are further downstream. Third, you have to be able to detect and enforce breaches. Detection is hard. Enforcement is costly - it’s a legal battle rather than a technical switch.

The viral complexity and practical unenforceability of restrictive data licences is why I would always recommend simply using an open licence for the data itself - make it open data. You can still have terms and conditions on the API while using an open licence for the data, but you need to recognises their limitations too.

So with that in mind, let’s consider the two goals of understanding impact and preventing bad uses with this scenario in mind.

Understanding impact

Here you have three choices:

  1. Track every use of your data. You will need to gather information about every direct user of your API, but you will also need a mechanism for intermediaries to report back to you about any users they have. You will need to have a way of enforcing this tracking, so that intermediaries don’t lie to you about how many people are using their services. This requirement will also restrict how the intermediaries themselves work: they couldn’t, for example, just publish the augmented data they’ve created under an open licence for anyone to download; they’ll have to use an API or other mechanism to track uses as well.

  2. Track only direct uses of your data. This is easy enough with a sign-up form of course, when you give out API keys. But be aware that some people will obfuscate who they are because they’re contrary sods who simply don’t see why you need to know so much about them. How much do you care? What rigour are you going to put around identity checks?

  3. Track only the uses people are kind enough to tell you about. Use optional sign-up forms. Send public and private requests for information about how people are using your data and how it’s been useful. Get trained evaluators to do more rigorous studies every now and then.

Personally, I reckon all impact measures are guesses, and we can never get the entire picture of the impact of anything that creates changes that are remotely complex or systemic. Every barrier you put in place around people using data is a transaction cost that reduces usage. So personally, if your primary goal is for the data you steward to get used, I would keep that friction as low as possible - provide sign-up forms to supply API keys, but make the details optional - and be prepared to explain to whomever asks that when you describe your impact you can only provide part of the picture. They’re used to that. They’ll understand.

Withdrawing access from bad actors

When you provide infrastructure, some people who use that infrastructure are going to be doing bad stuff with it. Bank robbers escape by road. Terrorists call each other on the phone. Presidents tweet racist comments. What’s true for transport, communications and digital infrastructure is true for data infrastructure.

It is right to think about the bad things people could do and how to respond to these inevitable bad actors, so that you give yourselves the levers you need to act and communicate clearly the rules of the road. You also need to think through the consequences for good actors for the systems you put in place.

First question: how are you going to detect bad actors in the first place? Two options:

  1. Proactively check every use of the data you’re making available to see if you approve of it. This is prohibitively expensive and unwieldy for you and introduces a large amount of cost and friction for reusers, especially given some will be using it through intermediaries.

  2. Reactively respond to concerns that are raised to you. This means you will miss some bad uses (perhaps if they happen in places you and your community don’t usually see). It also means anyone who uses the data you provide will need to live with the risk that someone who disagrees with what they’re doing reports them as a bad actor. Sometimes that risk alone can reduce the willingness to reuse of data.

Second question: how will you decide whether someone is a good or bad actor? There are some behaviours that are easily to quantify and detect (like overusing an API). But there are other behaviours where “bad” is a moral judgement. These are by definition fuzzy and the fuzzier they are, the more uncertainty there is for people reusing data about whether, at some point in the future, it might be decided that what they are doing is “bad” and the thing they have put time into developing be rendered useless. How do you give certainty to the people thinking about using the data you are providing? What if the people contributing to maintaining the data you’re stewarding disagree with your decision (one way or another)? When you make a judgement against someone do they get to appeal? To whom?

Third question: what are you going to do about it? Some of the actions you think are bad won’t be ongoing - they might be standalone analyses based on the data you’re providing, for example. So withdrawing access to the API isn’t always going to be a consequence that matters for people. Does that matter? Do people who have had API access withdrawn ever get to access it again, if they clean up their act? How will you detect people who you ban through one intermediary potentially accessing it again through another, or those who have accessed the API directly using an intermediary to do so instead?

It is completely possible to put in place data governance structures, systems and processes that can detect, assess and take action against bad actors. Just as it’s possible to have traffic police, wiretaps and content moderation. But it needs to be designed proportionally to the risks and in consideration of the costs to you and other users.

If we were talking about personal health records, I would be all for proactive ethical assessment of proposed uses, regular auditing of reusers and their systems, and having enough legal firepower to effectively enforce these terms and discourage breaches.

But with a collaboratively maintained register of public information, for me the systemic risks of additional friction and uncertainty that arise from introducing such a system, and the fact you can’t make it watertight anyway within the resources of a small non-profit, make me favour a different approach.

I would personally do the following:

  1. Make a very clear statement that guarantees you will not revoke access through the API except in very clearly defined, objectively verifiable circumstances, such as exceeding agreed rate limits. This is what you point to as your policy when people raise complaints about not shutting off access to people they think are bad. Write up why you’ve adopted this policy. Indicate when you will review it, the circumstances in which it might change, and the notice period you’ll give of any changes. This is to give certainty to the people you want to use and build on the data.

  2. Institute an approvals scheme. Either give those approved a special badge or only let those who get your approval use your logo or your name in any advertising they do of their product. Publish a list of the uses that have been approved and why (they’re great case studies too - look, impact measurement!). Make it clear in your policy that the only signal that you approve of a use of the data you’re providing is this approvals process. It will take work to assess a product so charge for it (you can have a sliding scale for this). Have the approval only last a year and make them pay for renewal.

  3. Name and shame. When you see uses of the data you steward that you disagree with or think are bad, write about them. Point the finger at the people doing the bad thing and galvanise the community to put pressure on them to stop. Point out the use is not approved by you. Point out that these bad uses make you more likely to start placing harder restrictions on everyone’s use and access in the future.

I do not know whether anyone will go for an approvals scheme. It depends on how much being associated with you matters to them. It’s worth asking, to design it in a way that works for them and you.

And this approach will not protect you from all criticism or feeling bad about people doing bad things using the infrastructure you’ve built. But nothing will do that. Even if you design a stricter governance system, people will criticise you when they disagree with your moral judgements. They will criticise you for arbitrariness, unfairness, lack of appeal, lack of enforcement, not listening to them etc etc etc. Bad stuff will happen that you only work out months down the line and couldn’t have done anything about anyway. You’ll take action against someone who goes on to do something even worse as a consequence.

Life is suffering.

If you don’t take the approach outlined above, then do aim to:

  • Communicate clearly with potential reusers about what you will do, so they can assess the risks of depending on the data you’re making available.

  • Have a plan for how to deal with bad actors that access data through intermediaries.

  • Have a plan for how to deal with bad actors that perform one-off analyses.

And either way, do try to monitor and review what the impact of whatever measures you put in place actually are. Ask for feedback. Do user research. We’re all working this out as we go, and what works in one place won’t work in another, so be prepared to learn and adapt.

For more discussion / other opinion see comments from Peter Wells and others on the original Democracy Club post and the obligatory Leigh Dodds blog posts:

(Being) The Elephant in the Room

Feb 10, 2019

In this post in going to write about the tensions that I find myself struggling with about ODI’s role in the wider ecosystem of organisations working around data.

I want to preface this by saying this is absolutely not a “poor me” post. We at ODI and I personally have been immensely fortunate to have received backing from funders like the UK government and Omidyar Network (now Luminate), support from other organisations and the data community at large. I have played a senior role at ODI since it started and now as its CEO I completely recognise my personal responsibility in ODI’s shape and form and activities. Indeed that’s why I’m writing this post. I want to talk about an area where I don’t think we’re doing as well as I’d like us to but I’m struggling to find a way to do better given the other responsibilities I have to my team and the organisation.

For me, one measure of whether ODI is successful is whether other organisations, including businesses, in the ecosystem do well. This is something we share with other organisations that are trying to help grow ecosystems rather than themselves - such as the Catapults in the UK or the Open Contracting Partnership (which I’m on the Advisory Board of) - or aim to scale their impact through partnerships rather than size - such as Open Knowledge International or Privacy International. We have had three core values at ODI since its foundation: expert, enabling and fearless. Part of being enabling is helping others succeed.

Now, ODI is a big organisation relative to others working in this space. We have over 50 people in our team, not including associates and subcontractors. Our turnover is roughly £5m annually now (though very little of that now is secure core/unrestricted funding). We invest in communications so we make a lot of noise. We invest in public policy and we’re based in London so we get to have good links into governments and attend the roundtables and launches and receptions that bring influence and opportunities. We also have some incredibly talented and well respected experts in our team.

I think of this as like being a well-meaning elephant in a crowded room with lots of other animals. We’re all trying to break through a wall and there’s no way we’re going to do it alone. The elephant can cause some serious damage to the wall but it sometimes squashes small animals underfoot without meaning to, just because it doesn’t see them. It bumps into other animals in annoying and damaging ways as it manoeuvres. It lifts some animals onto its back where they can get a better angle on the wall for a while but there’s only so much room and they keep falling off.

And then there’s the food. Most of it is placed really high up on shelves. The higher the shelves the more food there is on them. The elephant is tall and one of the only creatures that can reach the higher shelves. It’s well-meaning so it tries to share the food it gets around. Sometimes it forms a bridge with its body so other animals can get to higher shelves too. But it’s also hungry. It needs more food than the other animals just to survive and if it gets too weak it won’t be able to kick the wall any more, or reach the high shelves, or lift up any other animals.

How much should it eat? How should it choose which other animals to lift up or share food with? Should it be trying to grow bigger and taller so it can kick the wall harder and reach the higher shelves?

Analogies are fun. Zoom out from the elephant melodrama and the room is actually a small corner of a massive hanger of passive diplodocuses and brontosauruses who are able to reach even higher shelves and don’t care about the wall at all. Look at the adjoining paddock and there are animals who can feed from the ground (lucky things with their endowments). Look beyond and there be monsters - carnivores feeding on each other.

What I wrestle with is what the elephant should do, what ODI should do, what I should do in this situation. And of course there are no black and white answers here, just lots of ands. Eat enough to survive and share the food around. Work with people you work well with and choose partners fairly. Kick that wall hard yourself and help others kick it.

Some real examples:

  1. ODI’s big innovation grant from UK government comes with T&C’s on it that mean we get no overhead recovery (which we need to pay for things like desks, recruitment, financial administration, and reaching up to those high shelves to get more money) when we contract people or organisations outside ODI. In effect that means it costs us money to share that money, but we have also seen that “stimulus fund” approaches, and bringing in real expertise we lack in house, are much more effective at delivering high quality work and achieving impact than doing everything ourselves. So we have targets for external spend, and support the costs that don’t get covered “from elsewhere”.

  2. We developed the Data Ethics Canvas and made it openly available for others to pick up and use. Then we invested money “from elsewhere” into developing workshops and webinars around it and started selling those, and developing other advisory products that bring margin in (to be the “from elsewhere” money). Other organisations have done similarly, some through adopting the Canvas themselves (which is great, because what we really care about it knocking down that wall). But it means we could start competing with organisations we want to succeed.

  3. We’re putting together a bid for a new programme and we want to do it with a partner. An open procurement process isn’t appropriate because there’s no guarantee anything will come of it. But there are lots of potential partners who we could work with, who would each bring different perspectives and approaches - the point of involving them early is that it enables us to shape the programme to suit. We choose one but I know others could have been just as good and may be unhappy we chose someone else, and I couldn’t hand on heart say the choice was anything but arbitrary.

  4. This has happened so many times: we run an open procurement process for a piece of work and send the call to a number of different organisations, all of which are friends and allies we want to work with and all of whom we think could do the work. We score the resulting proposals and one wins because it’s closest to what we have in mind, which is determined by how we have shaped the project and the call. The organisations that don’t succeed are naturally disappointed, particularly when we’ve said we want to find ways to work with them (which we do), and ask why we even approached them for a bid, which they put unpaid time and effort into, if we weren’t going to choose them.

  5. Again a common pattern: an ally wants to take on a piece of work but isn’t constituted in the right way, isn’t big enough or secure enough or on the right government framework, so asks us to front it. They do the work, we take on administration, financial and reputational risk, but have limited control over the client relationship or work quality. If we insert ourselves into the project more, it feels like we’re exploiting others’ work - from their perspective we’re not really adding value. But if we stay hands off and something goes wrong, we are legally liable and our reputation could suffer.

I could go on. And again, I’m not moaning or saying any of this is in any way unfair on ODI. These are just the consequences of our position. I just wish I knew how to navigate them better, in a way that is fair both to the ODI team and organisation, and our friends and allies and partners; in a way that builds alliances rather than resentment, creates impact rather than conflict.

Answers on a postcard please.

My UKGovCamp 2019

Jan 20, 2019

I was at UKGovCamp yesterday for I think my 11th(?) year. Massive thanks to James, Amanda and all the rest of the campmakers for making it a brilliant day.

ODI were a sponsor and there were a bunch of us around. For the first time, I didn’t pitch myself. I was really glad that I encouraged others to instead. I only went to four sessions (rather than five). These are just some of my random thoughts following them (I’m not trying to represent everything that was said; I’ve linked to the notes from the sessions so you can read those if that’s what you want).

Data infrastructure

A session about the thing I spend my day job doing: working out how to build, or persuade others to build, a better data infrastructure.

  1. Infrastructure is boring. Despite the fact that government maintains so much of our physical infrastructure and understands how to invest in it, it doesn’t understand the link between the services, analysis, visualisations it wants and the data infrastructure that lies beneath. We need to motivate investment in data infrastructure through pointing at the more flashy, sexy, immediate stuff it enables (the websites, the apps). Think about the people who had to demonstrate why we need power lines or sewers or motorways. It’s not for their own sake, it’s to provide light, have flushable toilets, get around the country. We can’t just work on the data infrastructure level.

  2. The flashy and sexy stuff like AI enabled services funded through Govtech catalysts rely on data infrastructure. You can’t expect those efforts to succeed if the data isn’t there to support them. So you can’t just work at the service layer either.

  3. Building data infrastructure through delivering digital services is an art, a discipline, a cultural shift. Why doesn’t every digital service have an API? Why don’t service developers and designers think about all-of-government or even all-of-society needs as well as the immediate needs of their direct users?

  4. We didn’t talk in the session but I had more fundamental conversations around UKGovCamp about government’s attitude to data. There is a reversion from some quarters to the attitude of 10-15 years ago around how to get value from government’s data. If we Brexit, if the Reuse of Public Sector Information Regulations are repealed, there is a real risk of going back to the idea that government should sell access to data it holds. I’m worried.


I love spreadsheets and tabular data. Geek heaven. I spent the session occasionally pointing to things that already exist like Datasette.

  1. We depend so much on spreadsheets for managing data, and they are both extremely well and extremely badly suited to the purposes we put them to, which are manifold. There are also many reinventions, with services like Airtable or Smartsheet, but they’re proprietary and come with risks about portability should the services fail.

  2. Sometimes spreadsheets are used to collaborate in ways where people really need to stick to a schema handed down from on high. But what people really like about spreadsheets (as opposed to databases) is the ease of adding columns to suit their needs. But then too much flexibility breaks applications that have to ingest all the additional data but only care about some of it. It feels to me as if a transclusion mechanism - bringing data from one spreadsheet into another, such that editing it is reflected in the original but you can also add columns that won’t be reflected back to the original - could be a way through this tension.

  3. It’s so powerful to be able to collaborate on the same data as others, as in Google Sheets, rather than passing Excel files back and forth by email. But not all data is shareable or designed to be shareable, and the ability to have space to add your own stuff without explanation is useful too.

Rethinking government

I didn’t speak in this session. Although I have thoughts I have no settled Opinions and a niggling sense of unease.

  1. I tried to explain this session, and how I felt about this session, to my 15yo daughter. She’s been learning about feminism for her sociology GCSE and said that it reminded her of the characterisation of radical, liberal and Marxist feminism she’s been learning about. Digitalists probably all agree that the web (and all it entails) is changing society and government has to change too. But while some radical digitalists believe that requires a wholesale reinvention of how government works, I think there are some important pieces of our current system that we should preserve.

  2. I think trusted institutions are important. I think removing their identities and means of expressing their identities undermines them. It leaves us without things we can rely on. That’s scary and people who are scared are “not their best selves”, as the modern euphemism appears to be. Can we preserve institutions, grow people’s trust in them and reimagine the relationship between government and people?

  3. I am afraid of forgetting the large percentage (not a majority, but a large percentage) who through choice or circumstance do not have smartphones or broadband or a working knowledge of how to interact with digital anything. I am afraid of digitalism that facilitates an essentially inhuman and exclusive government. I am afraid of digital supremacy.

  4. It is ok for us to disagree about this. We should be disagreeing. In the session we talked about stories and visions and sci-fi for government. We should describe the futures we want to see, and we should critique the ones others describe, explain our fears, bang our drums. That’s how we’ll get there.

  5. But these are essentially political questions, questions about the role of the state, of communities, of individuals, of the private sector, of the media, of academia. And they highlight questions about the role of politicians and of civil servants, or even of policy and delivery if you like, and the relative power they wield, and the level of accountability and governance there is around them. Has digital changed that power balance? Should it? I know a lot of brilliant civil servants I would trust entirely but I’m not sure I want to live in a technocracy.

  6. I am all for discussing how to improve how government works, but how can we stop that conversation being dominated by white middle class male Londoners, however wonderful, insightful, inspirational, well meaning and right thinking I might find them. I’m part of the problem here, massively privileged, London centric. I want to hear other voices, outside the digital elite. Of course they won’t be at UKGovCamp. And I also recognise conversations have to start somewhere and gradually build coalitions. It’s just a concern that nags at me.

Open communication

This one’s more personal for me, but the session helped me reconcile some conflicts inside myself and move on my thinking about how I can encourage more openness at ODI.

  1. Communication is hard. So hard. The impact we intend to have, if we even think of our intent at all, is seldom the impact we do have. No one else is in the same context as you. Public communication is even harder, because the audience who sees what you write can be so varied. One way or asynchronous communication is harder again because there’s no feedback until you’re done and posted that can help you adjust or explain or nuance. If you care at all about what people think or feel (and I care about that a lot, probably too much) every piece of communication is a risk.

  2. It is a risk worth taking, and it was good today to be reminded of that. I remember many years ago posting about a civil servant I’d just encountered who I thought (and said) was a little crazy. Turned out he read my blog. Him bringing it up was the first time I encountered the intersection of my work and my online tribe, and when I started being more circumspect and intentional about what I write publicly. But that slightly crazy civil servant was of course John Sheridan, who would become a fantastic colleague and one of my closest friends. There are ways to put things, and sometimes you have to deal with people you have upset intentionally or not, but I cannot think of a time I have regretted writing authentically and letting people see inside my head.

  3. But still it is hard for me to write candidly given where I am now. Because I am CEO at ODI what I say can easily be taken as an institutional rather than personal position. When I communicate I am not just taking risks for myself but for my team and organisation. We are dependent on government money, and on preserving or building good relationships with funders. We and I must been seen to be strong, certain and confident to retain the confidence of our stakeholders. (Everything’s great by the way.)

  4. All this leaves me with is the recognition that anyone who cares about other people or about the organisation they work for will not post about everything, cannot be completely open. And that’s ok. At ODI we talk about data being as open as possible and the shades of grey between open and closed data. It was good to be reminded that as open as possible communication is better than nothing.

Reflections from the Canada/UK Colloquium on AI

Nov 28, 2018

Last week I was at the Canada-UK Colloquium on AI in Toronto. These are some things I learned and thoughts I had while there, in no particular order.

  1. On the role of “anchor firms”: Big tech firms help support a startup ecosystem by acting as a backstop for technologists, allowing them to take the risk of working for startups as they know they won’t be left completely high and dry if the startup fails. They also perform a useful role mapping academic approaches into the real world in the form of code, online services and so on that can be plugged together to build new applications quickly: no one else has the resources/capability/motivation to do this mapping. It’s interesting to think about the extent government should be doing this, or subsidising it, and the degree to which this mapping is done for data as well as code.

  2. On the role of third sector: We focus a lot when talking about AI and data on the role of the state, of business and of academia. But the third sector is important too. Consumer rights organisations have a role to play assessing and informing consumers about how services use data about them. Trade unions need to have a vision for how the demands on the workforce will change and workplaces and conditions should adapt. It was striking to me that through all the discussion of bodies supporting good governance of AI and data, the Ada Lovelace Institute was not mentioned.

  3. On the hype cycle: All the AI practitioners urged caution and were concerned about hyperbole in the media narrative about AI. They pointed out that deep learning and reinforcement learning are only suitable for particular tasks and that much of the AI vision we are being fed requires techniques that haven’t been invented yet. There’s a danger that when the current wave of AI (machine learning) fails to meet high expectations we will enter another AI winter of reduced funding for research that slows progress again.

  4. On what your phone can sense about you: Well-intentioned academics in Canada are prototyping applications to monitor levels of social anxiety, in a bid to provide better mental health care. (With permission) they can do things like work out what kind of places you go to, listen to your conversations, monitor movement, light, how much you touch your screen and so on. It felt creepy and invasive but got through the university ethics board. Not news, but to me it highlighted that these APIs and data were available to other Android apps, with the only check being the permissions dialog everyone clicks through. We probably don’t need to worry too much about well-intentioned academics with ethics approval: how do we find out about everyone else?

  5. On diversity: Canada has a strong commitment to increasing (particularly gender) diversity. There are warm words about diversity in the UK too. I have Opinions, highly influenced by Ellen Broad, that appear to be unusual:

    • Having a diverse team will not necessarily mean you avoid bias in your algorithms/products. Saying you need diversity to create products that work for everyone gives non-diverse teams an excuse for poor practices that they really shouldn’t be allowed to use. What about user research? What about empathy? It is impossible to represent everyone by having someone exactly like them within a team: we should focus on finding good ways of engaging with people outside development teams and hold those teams to a higher standard in using them.

    • We should be careful to quote local statistics, or statistics relevant to particular subfields, rather than make diversity out to be a general problem across technology. I also have a lingering concern that making a big deal about women being less prevalent in technology makes technology less attractive to women (no one likes to be in places where they’re a minority).

    • In contrast to software development, there are many women in the field of ethics and algorithmic accountability. Is ethics subtly being thought of as women’s work (emotional labour)? (In the UK, this is even spelled out in the names of our institutes: Alan Turing for computer science, Ada Lovelace for ethics.)

  6. On geopolitics: Canada and the UK have a lot in common. This may become even more true if Brexit goes ahead and Britain becomes a third country to Europe, with similar values but needing to prove data adequacy while having strong surveillance powers. France was other ally most often mentioned by Canadian representatives. The sense was that despite its strong investment in AI research and work by CIFAR, Canada was behind on thinking about data and data governance; there were also hints that its information commissioner’s office was not as helpful (to businesses) as the UK’s. As is common in these fora, there was a lot of talk about China, and state-led AI, but a general feeling that we need to engage and create international norms around AI rather than enter into a race.

  7. On the stories we tell: Quite a lot of debunking went on in the room. There were requests never to treat or talk about Sophia as AI; never to use the trolley problem as if it had anything to do with the choices autonomous cars would make; not to believe Babylon’s figures about triage accuracy; not to spread the falsehood that a sexbot was manhandled at an Australian trade fair; not to mischaracterise how DeepMind Health use patient data in Streams. Even a room of “experts” needed to be corrected on occasion. It is good to challenge each other, the examples we repeat, and the evidence we quote.

  8. On data trusts: Everyone is interested in data trusts. More precisely, everyone is interested in how to get data shared more readily while preserving privacy. When people say “data trusts” they mean very different things; they project their own notion of what well governed data sharing might look like. I really hope our work at ODI, and the concrete pilots we’ll be taking forward over the next few months help to make the notion more tangible, and highlight other models for sharing.

  9. On regulation / government intervention: I find that whenever we start talking about how government should intervene around AI, we get sucked into a personal data ethics black hole. It is hard to see past what should or shouldn’t be done with personal data and into other issues such as public procurement, competition policy or worker rights. Particularly in the UK, where there’s already lots of activity around data & AI ethics, we should avoid the black hole by trying to create venues for discussions that don’t talk about personal data.

  10. On populism & fear of technology: We listened to a fascinating presentation (similar to this recording) about the correlation between populism and fear of technology. Recent displacement from work is more likely to arise from technology than immigration, but immigration is more likely to get blamed. The good news is that those who fear automation, and particularly populists who fear automation, are happy with any policy response, including positive ones like supporting retraining. The lesson is to have a vision.

  11. On the role of humans: Both humans and computers are biased and sometimes make poor decisions. (When people feel there’s too much emphasis on AI being good, they remind us of AI’s failings; when they feel there’s too much emphasis on it being bad, they remind us of human failings.) We are more concerned about the black boxes of silicon-based neural networks than we are about the ones in our heads, or perhaps in our organisations. I lazily insist that decisions are made by humans, informed by data, but that’s because my mental model is medical diagnosis or parole recommendations. In a battle, there’s no time for a system that detects and destroys incoming torpedoes to refer to a human. I have started to think that the same things are needed whatever the decision making entity: transparency, explanation, accountability (a means of recompense for harm and a correction for the future). The trap we need to avoid is thinking any system (human or machine) is faultless.

  12. More on the role of humans: Robots are common in automobile manufacturing, but customers are now demanding more customisation in their cars, which robots aren’t as good at providing. So there are new roles for humans, working with machines. They call them “cobots”. On the railroad, there are now “portals” that photograph every outwardly visible inch of railcars as they drive through, and detect faults in minutes that used to take hours of inspection. Railcar engineers can concentrate on maintenance rather than finding faults. The current crop of AI is good at dull operational tasks, leaving the more interesting work for people (but do some humans like doing dull things some of the time? I know I do.).

  13. On intelligence: People are building more expressive bots, whether physical or virtual, that mimic human emotions through their appearance or behaviour. They are also getting better at reading emotion. At some point the mimicry gets so good we start reacting as if it’s real; that’s the point of the Turing test. On the other hand, knowing that you are talking to a machine rather than a human may be liberating: we learned about a chatbot designed to help people decide to stop smoking - one of its benefits was that people could talk to it without feeling judged. If a bot could fake care, would you prefer to tell a machine your woes?

Doesn't open data make data monopolies more powerful?

Jan 14, 2018

We live in a world where a few, mostly US-based companies hold huge amounts of data about us and about the world. Google and Facebook, and to a lesser extent Amazon and Apple, (GAFA) make money by providing services, including advertising services, that make excellent use of this data. They are big, rich, and powerful in both obvious and subtle ways. This makes people uncomfortable, and working out what to do about them and their impact on our society and economy has become one of the big questions of our age.

An argument has started to emerge against opening data, particularly government owned data, because of the power of these data monopolies. “If we make this data available with no restrictions,” the argument goes, “big tech will suck it up and become even more powerful. None of us want that.”

I want to dig into this line of argument, the elements of truth it contains, why the conclusion about not opening data is wrong, why the argument is actually being made, and look at better ways to address the issue.

More data disproportionately benefits big tech

It is true that big tech benefits, and benefits disproportionately to smaller organisations, from the greater availability of data.

Big tech have great capacity to work with data. They are geared to getting value from data: analysing it, drawing conclusions that help them grow and succeed, creating services that win them more customers. They have an advantage in both skills and scale when it comes to working with data.

Big tech have huge amounts of data that they can combine. Because of the network effects of linking and aggregating data together, the more data they have, the more useful that data becomes. They have an advantage because they have access to more data than other organisations.

Not opening data disproportionately damages smaller organisations

It is also true that small organisations suffer most from not opening data. Access to data enables organisations to experiment with ideas for innovative products and services. It helps them make better decisions, faster, which is particularly important for small organisations who need to make good choices about where to direct their energies or risk failure.

If data is sold instead of opened, big tech can buy it easily while smaller organisations are less able to afford to. Big tech have cash to spare, in house lawyers and negotiators, and savvy developers used to working with whatever copy or access protection is put around data. The friction that selling data access introduces is of minimal inconvenience to them. For small organisations, who lack these things, the friction is proportionately much greater. So on top of the disproportionate benefits big tech get from the data itself, they get an extra advantage from the barriers selling data puts in the way of smaller organisations.

If data isn’t made available to them (for example because they can’t negotiate to acceptable licensing conditions or the price is too high), big tech have the money and user base that enable them to invest in creating their own versions. Small organisations simply cannot invest in data collection to anywhere near the same scale. The data that big tech collects is (at least initially) lower quality than official versions, but it usually improves as people use it and correct it. Unlike public authorities, big tech have low motivation to provide equal coverage for everyone, favouring more lucrative users.

An example is addresses in the UK. Google couldn’t get access to that data under licensing conditions they could accept, so they built their own address data for use in Google Maps. Professionals think it is less accurate than officially held records. It particularly suffers outside urban and tourist areas because fewer people live there and there’s less need for people to use Google’s services there, which means less data available for Google to use to correct it.

Using different terms & conditions for different organisations doesn’t help

“Ah,” I hear you say, “but we can use different terms & conditions for different kinds of organisations so smaller ones don’t bear the same costs.”

It is true that it is possible to construct licensing terms and differential charging schemes that make it free for smaller firms to access and use data and charge larger firms. You can have free developer licences; service levels that flex with the size of companies (whether in employees or turnover or terminals); non-commercial licences for researchers, not-for-profits and hobbyists.

These are all possible, but they do not eliminate the problems.

First, the barrier for smaller organisations is not just about cash but about time and risk. Differential licensing and charging schemes are inevitably complex. Organisations have to understand whether they qualify for a particular tier and whether they are permitted to do what they want to do with the data. This takes time and often legal fees. The latter is often hard to work out because legal restrictions on particular data processing activities tend not to be black and white. They require interpretation and create uncertainty. This means organisations have to protect themselves against litigation arising from unintended non-compliance with the terms, which adds the cost of insurance. The more complex the scheme, the greater this friction.

Second, the clauses within a free licence always include one that prevents the organisation undercutting the original supplier of the data and selling it on to large organisations. Necessarily, this will place restrictions on the services that an organisation offers and the business model they adopt. They might be unable, for example, to build an API that adds value by providing different cuts of data on demand, or if they do their price might be determined by additional fees from the original supplier. Licensing restrictions limit what kinds of organisations can benefit from the data, and their ability to make money. And, as above, uncertainty about the scope of the restrictions (and whether the originating organisation will ever actually act on them) introduce risk and costs.

Third, while these costs and barriers are bad enough with one set of data, add another from a different supplier with another set of conditions, and you have to make sure you meet both of them. Sometimes this will be impossible (for example combining OpenStreetMap data, available under a share-alike licence, with non-commercial data). Add a third or fourth and you’re dealing with a combinatorial explosion of T&C intersections to navigate.

In part, the problems with differential pricing approach for data arise from the unique characteristics of data and the data economy.

  • it is endlessly manipulable which makes it necessarily complex to list all the ways in which you can, or can’t, use it, and which are allowed and which not

  • the greatest opportunities for innovation and growth are within infomediaries who slice and dice and add value to datasets; they need freedom to thrive

  • added value usually comes from the network effects of combining multiple datasets; but if there’s friction inherent in bringing datasets together, those same network effects will amplify that friction as well

It’s not surprising that people who are used to selling other kinds of things than data reach for “free licences for startups” as a solution to lower costs for smaller organisations. It seems an obvious thing to do. It might work for other kinds of products. It doesn’t work for data.

Opening data is better than not opening data

So far I’ve focused almost exclusively on the impacts of opening and not opening data on innovation and the ability of small businesses to thrive in comparison to big tech. I’ve discussed why selling or restricting access to and use of data favours big tech over and above the advantages they already receive from amassing more data.

If you like to think of playing fields, it’s true that opening data lifts big tech’s end of the pitch, but overall, it lifts the startup’s end more.

There are a few other considerations it’s worth quickly touching on.

Do we want big tech to use high quality data?

Earlier I wrote about how big tech makes its own data when it can’t get hold of official sources. They stitch together information from remote sensors, from what people upload, from explicit corrections, use clever machine learning techniques and come out with remarkably good reconstructions.

But “remarkably good” is not comprehensive. It is often skewed towards areas of high user demand, whether that’s cities rather than countryside or favouring the digitally included.

When big tech uses its own data rather than official data to provide services to citizens, it favours the enfranchised. It exacerbates societal inequalities.

It can also cost lives. I talked about Google’s address data and the doubts about its accuracy particularly outside towns and cities. Ambulances have started using it. When they are delayed because they go to the wrong place, people can die. Restricting access to address data forced Google to spend a bunch of money to recreate it, but who is actually suffering the consequences?

Not all services require the same level of detail in data. The impact of data errors is higher for some products than for others. But in general, we should want the products and services we use to be built on the highest quality, most reliable, most authoritative, timely, and comprehensive data infrastructure that we can provide. When we restrict access to that by not permitting companies with massive user bases amongst our citizenry to use that data, we damage ourselves.

What about big tech’s other advantages with data?

I’ve focused much of this on the advantage big tech enjoys in having access to data. As I touched on earlier, they also have an advantage in capability. If there’s a real desire to equalise smaller companies with big tech, they need support in growing their capability. This isn’t just about skills but also about tool availability and the ease of use of data.

Anything that helps people use data quickly and easily removes friction and gives a disproportionate advantage to organisations who aren’t able to just throw extra people at a problem. Make it available in standard formats and use standard identifiers. Create simple guides to help people understand how to use it. Provide open source tools and libraries to manipulate it. These are good things to do to increase the accessibility of data beyond simply opening it up.

How do we make this benefit society as a whole?

I’ve also been focusing deliberately on the narrow question of how we level the playing field between small organisations and big tech. Of course it’s not the case that everything small organisations do is good and everything big tech does is evil. Making data more open and accessible doesn’t ensure that anyone builds what society as a whole needs, and may mean the creation of damaging tools as well as beneficial ones. There might even (whisper it) be issues that can’t be solved with tech or data.

That said, the charities, community groups, and social enterprises that are most likely to want to build products or produce insights with positive social impact are also likely to be small organisations with the same constraints as I’ve discussed above. We should aim to help them. We can also encourage people to use data for good through targeted stimulus funding towards applications that create social or environmental benefits, as we did in the Open Data Challenge Series that ODI ran with Nesta.

Making it fair

When you dig into why people actually cite increasing inequality between data businesses as a reason for not opening data, it usually comes down to it feeling unfair that large organisations don’t contribute towards the cost of its collection and maintenance. After all, they benefit from the data and can certainly afford to pay for it. In the case of government data, where the public is paying the upkeep costs, this can feel particularly unfair.

It is unfair. It is unfair in the same way that it’s unfair that big tech benefits from the education system that the PhDs they employ went through, the national health service that lowers their cost of employment, the clean air they breathe and the security they enjoy. These are all public goods that they benefit from. The best pattern we have found for getting them, and everyone else who enjoys those benefits, to pay for them is taxation.

Getting the right taxation regime so that big tech makes a fair contribution to public goods is a large, international issue. We can’t work around failures at that level by charging big tech for access to public data. Trying to do so would be cutting off our nose to spite our face.

What can be done from a data perspective, whether a data steward is in the public sector or not, is to try to lower the costs of collection and maintenance. Having mechanisms for other people and organisations to correct data themselves, or even just highlight areas that need updating by the professionals, can help to distribute the load. Opening data helps to motivate collaborative maintenance: the same data becomes a common platform for many organisations and individuals, all of whom also contribute to its upkeep, just like Wikipedia, wikidata and OpenStreetMap. With government data, this requires government to shift its role towards being a platform provider —’s Expert Participation Programme demonstrates how this can be done without compromising quality.

Big tech and data monopolies

I have focused on big tech as if all data monopolies are big tech. That isn’t the case. What makes a data monopoly a monopoly is not that it is big and powerful and has lots of users, it’s that it has a monopoly on the data it holds. These appear as much in the public sector as the private sector. Within the confines of the law, they get to either benefit exclusively or choose the conditions in which others can benefit from the data they hold.

Some of that data could benefit us as individuals, as communities and as societies. Rather than restricting what data of ours data monopolies can access, another way to level the playing field is to ensure that others can access the data they hold by making it as open as possible while protecting people’s privacy, commercial confidentiality and national security. Take the 1956 Consent Decree against Bell Labs as inspiration. That decree forced Bell Labs to license their patents royalty free. It increased innovation in the US over the long term, and particularly that by startups.

There are various ways of making something similar happen around data. At the soft, encouraging end of the spectrum there are collaborative efforts such as OpenActive or making positive noises whenever Uber Movements helps cities gain insights, or Facebook adds more data into OpenStreetMap or supports EveryPolitician. At the hard regulatory end of the spectrum, we see legislation: the new data portability right in GDPR; the rights given under the Digital Economy Act to the Office of National Statistics in the UK to access administrative data held by companies; the French Digital Republic Act’s definition of data of public interest; the Competition & Markets Authority Order on Open Banking.

We should be much more concerned about unlocking the huge value of data held by data monopolies for everyone to benefit from — building a strong, fair and sustainable data infrastructure — than about getting them to pay for access to public data.

Opening up authoritative, high quality data benefits smaller companies, communities, and citizens. There’s no doubt that it also benefits larger organisations. But attempts at ever more complex restrictions about who can use data are likely to be counterproductive. There are other ways of leveling these playing fields.