What to do about pro-bono data service offers

May 19, 2020

During the Covid-19 pandemic, many tech companies are offering pro-bono data services to public health organisations, governments, and communities. Their offers include free access to software and to people to help with managing data, analysing it, and building predictive models.

I don’t think it’s particularly worthwhile speculating on the internal motivations for these companies. I choose to believe they genuinely want to support the global effort to help societies and economies through this crisis, and are providing that help in the way they know how. Others believe it’s all about capturing markets, gaining access to personal data for other (nefarious) purposes, or developing their own intellectual property and capability. Probably the truth is somewhere between these two.

But it doesn’t matter: any organisation considering accepting such help has to consider the same set of things regardless, to retain public trust, safeguard its own future operations, and with an eye to the market effects it generates.

Before I dive into detail, just to say that any such arrangement needs to satisfy the basic hygiene factor of being a clear contractual relationship. Informal “partnerships” will leave you extremely exposed on many of the issues discussed here. And be aware that a pro-bono project is not free to you: even if you’re not paying them, you will be putting in time, effort and resources into the project. You have to consider the project – and whether the help you’re being offered is the best way of achieving its goals – in the round.

Retaining public trust

Public trust is important in the best of times, but it’s particularly important in situations where you want the public to pay attention to what you say and do what you tell them (eg stay home, install an app, report their symptoms accurately).

Most companies large enough to offer you pro-bono data services will have a bad reputation about their use of data. This reputation might have arisen due to actual security breaches, enforcement action by data protection regulators or previous dodgy deals. It might simply be that people are frightened because they know those companies have huge amounts of data about them already and they feel powerless. The bad reputation might be even more diffuse: about whether the company pays fair taxes or treats its workers well.

The point is you are not starting from a neutral position with the public: you’re starting from one in which the motivations of the companies offering support will be immediately questioned and treated with suspicion. The fact the services are offered for free makes this worse, not better: to the public and press this is a red flag about a hidden motive which probably involves little guys getting screwed over.

In making your decision about taking up an offer, you have to factor in the fact that countering this trust deficit will take time and effort on your part. This is a cost to weigh against the benefits of the services offered. The only way to counter the trust deficit, and protect your own reputation and the trust the public has in you as an institution, is clear, proactive, transparent communication and effective, representative, accountable governance – and even doing these isn’t guaranteed to work. You need an excellent comms team, proactively communicating about every aspect of the project. You need to draft in trusted external experts to oversee the work, and demonstrably listen and respond to their recommendations. All of this takes substantial effort and time. Don’t overlook or underestimate it.

You should operate under the assumption that every aspect of the deal you make will come to the surface eventually. The more you hide, and the more it feels like people have to fight to get hold of information about what you’re doing, the more time you will spend firefighting as they dig up dirt, and the more trust you will lose in the long term. Make sure you are completely satisfied that you can be open with the public about everything you’re doing. If you’re not comfortable doing that, it probably indicates that there’s an ethical problem somewhere in the deal that you need to resolve.

Deeper, though, than these aspects of reputation management, is the genuine issue of exploitation of data about the public and the efficacy and trustworthiness of the service being provided. Things to question here include:

Is this service genuinely useful? Is it something you would procure off your own bat, because you really need it? Or are you being offered a tech solutionist stab in the dark that might come with huge opportunity costs?

Are there any security issues, outside the control of the data services company, that you need to be aware of? For example, can you ensure data is stored in a way that provides protection against intelligence service intrusion?

How are the biases and discriminatory issues in the data service being handled? What other processes or practices will need to be in place to counter those biases and ensure that you’re not excluding people by relying on this solution?

How have you secured permission or authority for this arrangement? Is it within your standard organisational policies, good practices, or lawful constitution? If some of those constraints have been waived due to the crisis, do you still have a mechanism for consulting with affected people and communities (eg patient or public representatives) about the project?

Who will ultimately benefit from the data service and how? It is inequitable and unjust for data about one group of people to be used to build tools that are then used to benefit another group of people. This includes you as an organisation getting benefits from data that don’t somehow return to the community. How can you ensure that the people and communities the data is about also benefit from the intelligence and services that are built over that data?

Answering these questions should help you identify additional things you need to make clear in the contract (eg where data will be stored) and do as part of the project (eg provide access back to communities).

Ensuring sustainability

You need to think hard about what happens when the arrangement ends. If you become reliant on a data service that’s being offered to you for free now, you may be committing yourself to future costs. If you don’t construct the contract well, you might be left in the situation where your friendly pro-bono supplier suddenly has you over a barrel and can charge what they want for the service you now can’t do without. Don’t kid yourself that everything will disappear when the crisis lifts. Assume that it will stay in place, and create contractual protections around that assumption.

Make sure the contract contains provisions that mean you retain as much intellectual property (IP) as possible. You should get ownership of as much of the code, data and models that get created during the project as you can. Making that IP as open as possible (ie open source, open data or at least open codebooks and schemas, and open models and algorithms) and ensuring everything is publicly documented will help alternative suppliers to be able to understand the system before you even start tendering for it.

One area where data service suppliers may want to retain IP is in any AI or automated services they build. I think this is reasonable: you want alternative suppliers to offer better services, not necessarily exact replicas. Just make sure that you retain enough rights over things like training data such that an alternative supplier will be able to create their own equivalent or improved solutions. (And remember what I said above about thinking through who gets to benefit from what’s being built.)

The contract should also ensure that, at the end of the contract period, an alternative supplier (which could be you, if you take it in house) will have sufficient time and access to existing systems to be able to take over the service. Make allowances for a transition period during which the pro-bono supplier continues to run the service while the alternate supplier builds their solution. Include in the terms that the supplier needs to capture and supply any updates to the data they were originally furnished with. Include public documentation for the logic behind algorithms too, where that’s important.

In other words think deeply about your exit strategy before entering into the deal. Protect your future self.

Building the market

Any supplier who is offering pro-bono support is likely to already be in a good market position. Entering into an arrangement with them might well entrench that position. You’re giving them a reputational boost as well as building their internal capacity and possibly product set. Accepting an offer of pro-bono help from one company without having assessed their offer against those of other suppliers is not fair or open procurement.

However, in an emergency you might feel you don’t have time for a fair and open procurement process: you’re just choosing whether to take up this free offer or not. So it’s worth thinking of some ways to counter the market impact of that decision, and the costs of doing so, as you’re shaping the project and weighing up the deal.

Fortunately, if you’re doing the things described above you have a good foundation in place. To preserve trust, you’ve already dialed the transparency up to max, so other potential suppliers (which might range from smaller data companies through academics and civil society groups) know what you’re up to. You’ve already open sourced code, opened up the models and algorithms underpinning any solution, and made as much data as possible (or at least its descriptors) open for others. Now you need to actively encourage and enable other people to create alternatives to (the fun/innovative/interesting/capacity building parts of) what your pro-bono supplier is giving you.

Here, I’d suggest employing open innovation techniques and more specifically put out a data challenge. Describe what you’re doing and invite others to show you their best ideas and implementations. Provide a prize (perhaps from the money you’re saving because of that lovely pro-bono help) to give some motivation; or link up with a research council or philanthropic funder to provide supporting grants; or just rely on the fact curious hackers love to challenge themselves to find better ways to do things with technology, particularly if that involves saving the world.

If developing the kinds of data services you’re getting for free requires access to personal data, create synthetic datasets that mirror the important characteristics of those datasets without containing any real information about real people. Make those available to the challenge participants.

Showcase the best solutions. Build their reputation. Count it as a success if those solutions get incorporated into the offer of the competitors to your pro-bono supplier. Remember you’re building your future market through this process, as well as everyone else’s.

In summary, an offer of free help is never actually free, but it is possible to construct a project such that everyone gets to benefit from it. If it’s still worth going ahead with those additional costs taken into account, knock yourself out.

Community consent

Jan 17, 2020

If we accept that individual consent for handling personal data is not working, what are the alternatives?

I’m writing this because it’s an area where I think there is some emerging consensus, in my particular data bubble, but also areas of disagreement. This post contains exploratory thinking: I’d really welcome thoughts and pointers to work and thinking I’ve missed.

The particular impetus for writing this is some of the recent work being done by UK regulators around adtech. First, there’s the Information Commissioner’s Office (ICO) work on how personal data is used in real-time bidding for ad space, which has found that many adtech intermediaries are relying on “legitimate interests” as the legal basis for processing personal data, even when this isn’t appropriate. Second, there’s the Competition and Markets Authority (CMA) interim report on the role of Google and Facebook in the adtech market, which includes examining the role of personal data and how current regulation affects the market.

But there’s other related work going on. In health, for example, there are long running debates on how to gain consent to use data in ways to improve healthcare (by Understanding Patient Data, medConfidential, etc). In transport or “Smart Cities” more generally, we see local government wanting to use data about people’s journeys for urban planning.

Personal data is used both for personalising services for individuals and (when aggregated into datasets that contain personal data about lots of people) understanding populations. The things that data is used to do or inform can be both beneficial and harmful for individuals, for societies and for particular groups and communities. And regardless of what it is used for, the collection of data can in itself be oppressive, and its sharing or sale inequitable.

The big question of this data age is how to control and govern this collection, use and sharing of personal data.

Personal data use entails choices

Going back to first principles, the point of collecting, using and sharing data is ultimately to inform decision making and action. There are always a range of options about what data to use for any analysis, with different data providing more or less accuracy or certainty when answering different questions. For example, the decision about what ads to serve within a story in an online newspaper could be based on any combination of the content of the story; knowledge about the general readership of the newspaper; or specific personal data about the reader looking at the page, their demographics and/or browser history.

There are also options when it comes to the architecture of the ecosystem supporting that analysis and use: how data is sourced and by whom, who performs which parts of the analysis and therefore what data gets shared with whom and in what ways.

Curious hackers like me will always think any data analysis is worth doing on its own terms - just because it’s interesting - and aim to use data that gives as accurate an answer as possible to the question posed. But there are damaging consequences whenever personal data is collected, shared and used: risks of data breaches, negative impacts on already excluded groups, the chilling effect of surveillance, and/or reduced competition, for example.

So behind any analysis there is a choice - an assessment that is a value judgement - about which combination of data to use, and how to source, collect and share it, to get value from it.

Our current data protection regime in the UK/Europe recognises some purposes for using data as being inherently worthwhile. For example, if someone needs to use or share personal data to comply with the law, there is an assumption that there is a democratic mandate for doing so (since the law was created through democratic processes). So for example, the vast data collection exercise that is the census has been deemed worthwhile democratically, is enshrined in law, and has a number of accountable governance structures in place to ensure it is done well and to mitigate the risks it entails.

In many other places, though, because people have different values, preferences and risk appetites, our current data protection regime has largely delegated making these value judgements to individuals. If someone consents to data about them being collected, used and shared in a particular way, then it’s deemed ok. And that’s a problem.

There are some very different attitudes to the role individuals should play in making assessments about the use of data about them. At one extreme, individual informed consent is seen as a gold standard, and anything done with personal data about someone that is done without their explicit consent is problematic. At the other extreme, people see basing what organisations can do on individual informed consent as fundamentally broken. There are a number of arguments for this:

  1. In practice, no matter how simple your T&Cs, well designed your cookie notices, or accessible your privacy controls, the vast majority of people agree to whatever they need to in order to access a service they want to access and never change the default privacy settings.
  2. Expecting people to spend time and effort on controlling their data shadows is placing an unfair and unnecessary burden on people’s lives, when that burden should be on organisations that use data to be responsible in how they handle it.
  3. Data value chains are so complex that no one can really anticipate how data might end up being used or what the consequences might be, so being properly informed when making those choices is unrealistic. Similarly, as with the climate crisis, it is unfair to hold individuals responsible for the systemic impacts of their actions.
  4. Data is never just about one person, so choices made by one individual can affect others that are linked to them (eg your friends and family on Facebook) or that share characteristics with them (we are not as unique as we like to think: data about other people like me gives you insights about me).
  5. Data about us and our communities is collected from our ambient environment (eg satellite imagery, CCTV cameras, Wifi signals, Streetview) in ways that it is impractical to provide individual consent for.
  6. The biases in who opts out of data collection and use aren’t well enough understood to compensate for them, which may undermine the validity of analyses of data where opting out is permitted and exacerbate the issue of solutions being designed for majority groups.

The applicability of these arguments varies from case to case. It may be that some of the issues can be mitigated in part through good design: requesting consent in straight forward ways, the availability of easy to understand privacy controls, finding ways to recognise the consent of multiple parties. All these are important both to allow for individual differences and to give people a sense of agency.

But even when consent is sought and controls provided, organisations handling personal data are the ones deciding:

  • What they never do
  • What they only do if people explicitly opt in (defaulting to “never do”)
  • What they do unless people opt out (defaulting to “always do”)
  • What they always do

Realistically, even with all the information campaigns in the world, and even with excellent design around consent and control over data, even with personal data stores and bottom up data trusts, the vast majority of people will neither opt in nor opt out but stick with the defaults. So in practice, even where a bit of individual choice is granted, organisations are making and will continue to make those assessments about the collection, use and sharing of data I talked about earlier.

Individual consent is theatre. In practice it’s no better than the most flexible (and abused) of legal bases for processing data under GDPR - “legitimate interests” - which specifically allows an organisation to make an assessment that balances their own, a third party’s, or broader public interest against the privacy interests of data subjects. Organisations that rely on consent for data processing don’t even have to demonstrate they carry out the balancing tests required for legitimate interests.

I do not think it’s acceptable for organisations to make decisions about how they handle data without giving those affected a voice, and listening and responding to it. But how can we make sure that happens?

Beyond individual consent, mechanisms for ensuring organisations listen to the preferences of people they affect are few and far between. When they rely on legitimate interests as their legal basis for processing data, ICO recommends organisations carry out and record the results of a Legitimate Interests Assessment (LIA). This is a set of questions that prompts an organisation to reflect on whether the processing has a legitimate purpose, that personal data is necessary to fulfill it, and considers the balance of that purpose against individual rights.

But there is no requirement for LIAs to be contributed to, commented on, or even seen by anyone outside the organisation. The only time they become of interest to the ICO is if they carry out an investigation.

I believe that for meaningful accountability, organisations should be engaging with people affected by what they’re doing whenever they’re making assessments about how they handle personal data. And I think (because data is about multiple people, and its use can have systemic, community and society-wide effects) this shouldn’t just include the direct subjects of the data who might provide consent but everyone in the communities and groups who will be affected. Carrying out this community-level engagement is completely compatible with also providing consent and control mechanisms to individuals, where that’s possible.

There are lots of different approaches for engagement:

  • Publishing and seeking comment on LIAs or similar
  • Ethics boards that include representatives from affected groups, or where appropriate elected representatives (eg letting local councils or parliament decide)
  • Carrying out targeted research through surveys, interviews, focus groups or other user research methods
  • Holding citizen juries or assemblies where a representative sample of the affected population is led through structured discussion of the options

I note that the Nuffield Foundation has just put out a call for a review of the evidence on the effectiveness of public deliberation like this, so aware I’m not being rigorous here, but it seems to me that in all cases, it’s important for participation to be and to be seen as legitimate (for people in the affected communities who are not directly involved to feel their opinions will have been stated and heard). It’s vital that organisations follow through in using the results of the engagement (so that it isn’t just an engagement-washing exercise). It’s also important that this engagement continues after the relevant data processing starts: that there are mechanisms for reviewing objections and changing the organisation’s approach as technologies, norms and populations change.

The bottom line is that I would like to see organisations being required to provide evidence that their use of data is acceptable to those affected by it, and to important subgroups that may be differentially affected. For example, if we’re talking about data about the use of bikes, the collection and use of that data should be both acceptable to the community in which the monitoring is taking place, and to the subset who use bikes.

I would like to see higher expectations around the level of evidence required for larger organisations than smaller, and higher expectations for those that are in the advantageous position of people not being able to switch to alternative providers that might have different approaches to personal data. This includes both government services and big tech platform businesses.

Perhaps, at a very basic and crude level, to make this evidence easier to create and easier to analyse, there could be some standardisation around the collection of quantitative evidence. For example, there could be a standard approach to surveying, with respondents shown a description of how data is used and asked questions such as “how beneficial do you think this use of data is?”, “how comfortable would you feel if data about you was used like this?”, “on balance, do you think it’s ok to use data like this?”. There could be a standard way of transparently summarising the results of those surveys alongside the descriptions of processing, and perhaps statistics about the percentage of people who exercise any privacy controls the organisation offers.

Now I don’t think the results of such surveys should be the entirety of the evidence organisations provide to demonstrate the legitimacy of their processing of personal data, not least because reasoning about acceptability is complex and contextual. But the publication of the results of standard surveys (which could work a bit like the reporting of gender pay gap data) would furnish consumer rights organisations, the media and privacy advocates with ammunition to lobby for better behaviour. It would enable regulators like ICO to prioritise their efforts on examining those organisations with poor survey results. (Eventually there could be hard requirements around the results of such surveys, but we’d need some benchmarking first.)

Whether this is realistic or not, I believe we have to move forward from critiquing the role of individual consent to requiring broader engagement and consent from people who are affected by how organisations handle data.

Possible effects

Let’s say that some of these ideas were put into place. What effects might that have?

The first thing to observe is that people are deeply unhappy with how some organisations are using data about them and others. This is particularly true about the tracking and profiling carried out by big tech and adtech and underpinning surveillance capitalism. ODI’s research with the RSA, “Data About Us”, showed people were also concerned about more systemic impacts such as information bubbles. Traditional market drivers are not working because of network effects and the lack of competitive alternatives. If the organisations involved were held to account against current public opinion (rather than being able to say “but our users consented to the T&Cs”), they would have to change how they operated fairly substantially.

It’s clear that both ICO and the CMA are concerned about the impact on content publishers and existing adtech businesses if the adtech market were disrupted too much and are consequently cautious about taking action. It is certainly true that if adtech intermediaries couldn’t use personal data in the way they currently do, many would have to pivot fairly rapidly to using different forms of data to match advertisers to display space, such as the content and source of the ad, and the content of the page in which it would be displayed. Facebook’s business model would be challenged. Online advertisers might find advertising spend provides less bang for buck (though it’s not clear to me what impact that would have on advertising spend or balance across different media). But arguments about how to weigh these potential business and economic consequences against the impacts of data collection on people and communities are precisely those we need to have in the open.

I can also foresee that a requirement for community consent would mean wider information campaigns from industry associations and others about the benefits of using personal data. Currently the media narrative about the use of personal data is almost entirely negative - to the extent that Doctor Who bad guys are big tech monopolists. Sectors, like health, where progress such as new treatments or better diagnosis often requires the use of personal data, are negatively impacted by this narrative, and can’t afford the widespread information campaigns that would shift that dial. If other industries needed to make the case that the use of personal data can bring benefits, that could mean the conversation about data starts to be more balanced.

Finally, and most importantly, people currently feel dissatisfied, that they don’t have control, and resigned to a status quo they are unhappy with. Requiring community consent could help provide a greater sense of collective power and agency over how organisations use data that should increase the levels of trust they have in the data economy. I believe that is essential if we are to build a healthy and robust data future.

Dominic Cummings, government transformation and digital twins

Jan 3, 2020

Dominic Cummings has written a viral job advert (or at least it seems viral in my particular Twitter bubble). Two observations.

First, on where the roles are focused, it’s worth looking at the Centre Forward report by Josh Harris and Jill Rutter at the Institute for Government. This looks at the kind of support Prime Ministers need (beyond diary management etc) to make the change they want to see in the world and in (small g) government. They break it down into the following categories:

  • policy advice and support
  • long-term policy development and direction
  • co-ordination and dispute resolution
  • progress assurance
  • incubating and catalysing change
  • communications and external relations

In Cummings’ post, the content or arrangement of the work isn’t made entirely clear (only the expertise required to do it), but using the categorisation above, it looks largely focused on short and long term policy development, progress assurance (assuming that’s what the project managers would do), and comms.

There aren’t roles targeted at either coordination or at catalysing change across government. In that regard, it feels like an insular team, with a theory of change that imagines the centre can come up with great ideas and then impose its will and direction through genius, personality and authority.

I guess it depends on how that team operates. They could act as a red team, bring together and coordinate existing experts inside and outside government to focus on priority problems in a fairly short time frame. Perhaps that would work. But I do question the chances of long-term success of a team that doesn’t include anyone with expertise in bringing about and embedding change in government. It’s not as if there haven’t been attempts before; there’s plenty to learn from.

So, if you’re a “super-talented weirdo” with expertise in government transformation, and want to work in a Johnson/Cummings No 10, I reckon you could make a pretty good case. And if I were Cummings, I’d be scouting at UKGovcamp.

Second set of observations is about the urge to be able to model the world (in as much of its complexity as possible), predict what will happen in it, understand how policy will affect it, and use data and evidence to work out what to do. This is wrapped up with a desire to be able to play with these models in an immersive way, described in Cummings’ Seeing Rooms post and evident from his interest in Dynamic Land.

I think the most useful term for this idea of having a digital (and usually agent-based) simulation of reality is “Digital Twins”. There’s a current ongoing stream of work on digital twins for physical infrastructure, led by the Centre for Digital Built Britain, which came out of the Infrastructure Commission report on Data for the public good. People also talk about digital twins in health, whether that’s digital twins of individuals or of larger health systems. People (including us at ODI) have also developed agent-based models for more abstract things like data policy and more classical models for things like fully renewable energy generation in 2050.

Large scale digital twins are more fantasy than reality. Look at the replies to this New Scientist tweet to see people calling bullshit on an article on AI simulations of everything. But smaller scale, focused models are feasible. I recommend in particular the Blackett Review of computational modelling which usefully spells out their uses and limitations.

But to create a useful digital twin, you don’t just need computing power and data scientists: you need data. I think Cummings’ team will pretty rapidly come up against a lack of usable, accessible data for the things they want to look at. I think they will work around those gaps, like all researchers do: they will clean spreadsheets up for their purposes, they will get one-off access from public bodies, they will model based on data from other countries, they will pay for or use political pressure to get hold of data the private sector holds, they will make informed guesses.

My fervent wish is that they will use the power they have to also strengthen the UK’s underlying data infrastructure for the longer term: that they fix the plumbing as they go along.

I also fervently wish that Cummings will recognise that data is not all you need, and that it can’t tell you everything. Every digital twin, every model, of anything remotely complex and interesting, embeds assumptions about the way the world works. We need to be able to pick apart these models, and the data they’re based on, to critique and scrutinise, to test assumptions and refine our understanding and the model. We need social scientists, not just computer scientists.

And to make that work, we need to be able to share models and data, to create alternatives that embed different narratives about the way the world works, and to discuss them. This openness is, to me, at the heart of scientific, evidence-based policy-making. It’s this piece - how to share elements of digital twins - that we’ve been concentrating on at ODI.

So if Cummings really recognises the importance of tapping into distributed expertise, at least some of the data people he recruits will need to be good at sharing data and models, not just creating them. And there will need to be some way to engage social scientists (and not just economists) with them.

Last thing on digital twins - Cummings wants to play with digital twins in an immersive way: to manipulate virtual worlds and test the impact of policy before it hits the real one. I think he should be looking for game designers, not just of computer games (though in some cases I’m sure they would great) but also table-top, policy games like those being explored at Nesta or like Datopolis, which Ellen Broad and I designed to help think about how data infrastructure works. Multi-player games can be cheap, simple, fun and communicative agent-based models. Simulated environments can be a good way to generate data you can’t otherwise get hold of. So I reckon there should be another “super-talented weirdo” opening for someone with game-creating skills.

I deliberately haven’t commented on everything and there’s plenty of other good commentary on Twitter; also look at Mark O’Neill’s and Matt Jukes’ posts. Just a final thought: I’m reminded of the warnings in “The Curious Hacker” by Connor Leahy, about how those driven by curiosity can bring both wonderful and hugely damaging changes. This Curious Hacker seems to be Cummings’ personality type (it’s my natural state as well), and he is recruiting more; they will need people to check, channel and challenge them. We all have a role to play here.

How can you control how data gets used?

Oct 5, 2019

Democracy Club are asking for advice on some changes they’re considering making to their API’s terms and conditions. They’re considering changes for two reasons: to enable them to track impact and to give them the right to withdraw access from those they believe are damaging democracy or Democracy Club’s reputation. Here’s my reckons.

Service T&Cs vs data licences

When you are making decisions about terms and conditions around data provided through an API, you actually have to worry about two things: restrictions on who gets to use the API or how they can use it restrictions on what people who get access to the data can do with it

It’s important to think of these separately as it makes you more realistic about what you can and can’t monitor and control and therefore more likely to put useful mechanisms in place to achieve your goals.

To help with thinking separately about these two things, don’t think about the simple case of someone building an app directly on your API. Instead imagine a scenario where you have an intermediary who uses the API to access data from you, augments the data with additional information, and then provides access to that augmented data to a number of third parties who build applications with it.

If you think your data is useful and you want it to be used, this should be something you want to enable to happen. Combining datasets helps generate more valuable insights. Organisations that provide value added services might make money and create jobs and bring economic benefits, or at least give people something to hack on enjoyably over a weekend.

In this scenario, you have two technical/legal levers you can use to restrict what happens with the data:

  1. You can control access to the API. This is technically really easy, using API keys. And you can write into your T&Cs the conditions under which you will withdraw access privileges. The trouble is that when there are intermediaries, you cannot, on your own, selectively control access by the end users that the intermediaries serve. The intermediary won’t access your API for each request they receive: they will batch requests and cache responses and so on in order to reduce their load and reliance on you. So if there is a single bad actor on the other side of an intermediary, and API keys are your only lever, you will be faced with the decision of cutting off all the other good actors or tackling the bad actor through some other mechanism.

  2. You can put conditions in the data licence. You can of course put any conditions you like into a licence. But there are three problems with doing so. First, restrictions within a licence can mean that data cannot be combined with other data that people might feasibly want to combine it with. In particular, data that isn’t available under an open licence can’t be combined with data available under a share-alike licence, such as that from OpenStreetMap. Second, if everyone does this, intermediaries who combine data from lots of sources with different licences end up with all sorts of weird combination licences so you get not only licence proliferation but ever expanding complexity within those licences, which makes things complex for data users who are further downstream. Third, you have to be able to detect and enforce breaches. Detection is hard. Enforcement is costly - it’s a legal battle rather than a technical switch.

The viral complexity and practical unenforceability of restrictive data licences is why I would always recommend simply using an open licence for the data itself - make it open data. You can still have terms and conditions on the API while using an open licence for the data, but you need to recognises their limitations too.

So with that in mind, let’s consider the two goals of understanding impact and preventing bad uses with this scenario in mind.

Understanding impact

Here you have three choices:

  1. Track every use of your data. You will need to gather information about every direct user of your API, but you will also need a mechanism for intermediaries to report back to you about any users they have. You will need to have a way of enforcing this tracking, so that intermediaries don’t lie to you about how many people are using their services. This requirement will also restrict how the intermediaries themselves work: they couldn’t, for example, just publish the augmented data they’ve created under an open licence for anyone to download; they’ll have to use an API or other mechanism to track uses as well.

  2. Track only direct uses of your data. This is easy enough with a sign-up form of course, when you give out API keys. But be aware that some people will obfuscate who they are because they’re contrary sods who simply don’t see why you need to know so much about them. How much do you care? What rigour are you going to put around identity checks?

  3. Track only the uses people are kind enough to tell you about. Use optional sign-up forms. Send public and private requests for information about how people are using your data and how it’s been useful. Get trained evaluators to do more rigorous studies every now and then.

Personally, I reckon all impact measures are guesses, and we can never get the entire picture of the impact of anything that creates changes that are remotely complex or systemic. Every barrier you put in place around people using data is a transaction cost that reduces usage. So personally, if your primary goal is for the data you steward to get used, I would keep that friction as low as possible - provide sign-up forms to supply API keys, but make the details optional - and be prepared to explain to whomever asks that when you describe your impact you can only provide part of the picture. They’re used to that. They’ll understand.

Withdrawing access from bad actors

When you provide infrastructure, some people who use that infrastructure are going to be doing bad stuff with it. Bank robbers escape by road. Terrorists call each other on the phone. Presidents tweet racist comments. What’s true for transport, communications and digital infrastructure is true for data infrastructure.

It is right to think about the bad things people could do and how to respond to these inevitable bad actors, so that you give yourselves the levers you need to act and communicate clearly the rules of the road. You also need to think through the consequences for good actors for the systems you put in place.

First question: how are you going to detect bad actors in the first place? Two options:

  1. Proactively check every use of the data you’re making available to see if you approve of it. This is prohibitively expensive and unwieldy for you and introduces a large amount of cost and friction for reusers, especially given some will be using it through intermediaries.

  2. Reactively respond to concerns that are raised to you. This means you will miss some bad uses (perhaps if they happen in places you and your community don’t usually see). It also means anyone who uses the data you provide will need to live with the risk that someone who disagrees with what they’re doing reports them as a bad actor. Sometimes that risk alone can reduce the willingness to reuse of data.

Second question: how will you decide whether someone is a good or bad actor? There are some behaviours that are easily to quantify and detect (like overusing an API). But there are other behaviours where “bad” is a moral judgement. These are by definition fuzzy and the fuzzier they are, the more uncertainty there is for people reusing data about whether, at some point in the future, it might be decided that what they are doing is “bad” and the thing they have put time into developing be rendered useless. How do you give certainty to the people thinking about using the data you are providing? What if the people contributing to maintaining the data you’re stewarding disagree with your decision (one way or another)? When you make a judgement against someone do they get to appeal? To whom?

Third question: what are you going to do about it? Some of the actions you think are bad won’t be ongoing - they might be standalone analyses based on the data you’re providing, for example. So withdrawing access to the API isn’t always going to be a consequence that matters for people. Does that matter? Do people who have had API access withdrawn ever get to access it again, if they clean up their act? How will you detect people who you ban through one intermediary potentially accessing it again through another, or those who have accessed the API directly using an intermediary to do so instead?

It is completely possible to put in place data governance structures, systems and processes that can detect, assess and take action against bad actors. Just as it’s possible to have traffic police, wiretaps and content moderation. But it needs to be designed proportionally to the risks and in consideration of the costs to you and other users.

If we were talking about personal health records, I would be all for proactive ethical assessment of proposed uses, regular auditing of reusers and their systems, and having enough legal firepower to effectively enforce these terms and discourage breaches.

But with a collaboratively maintained register of public information, for me the systemic risks of additional friction and uncertainty that arise from introducing such a system, and the fact you can’t make it watertight anyway within the resources of a small non-profit, make me favour a different approach.

I would personally do the following:

  1. Make a very clear statement that guarantees you will not revoke access through the API except in very clearly defined, objectively verifiable circumstances, such as exceeding agreed rate limits. This is what you point to as your policy when people raise complaints about not shutting off access to people they think are bad. Write up why you’ve adopted this policy. Indicate when you will review it, the circumstances in which it might change, and the notice period you’ll give of any changes. This is to give certainty to the people you want to use and build on the data.

  2. Institute an approvals scheme. Either give those approved a special badge or only let those who get your approval use your logo or your name in any advertising they do of their product. Publish a list of the uses that have been approved and why (they’re great case studies too - look, impact measurement!). Make it clear in your policy that the only signal that you approve of a use of the data you’re providing is this approvals process. It will take work to assess a product so charge for it (you can have a sliding scale for this). Have the approval only last a year and make them pay for renewal.

  3. Name and shame. When you see uses of the data you steward that you disagree with or think are bad, write about them. Point the finger at the people doing the bad thing and galvanise the community to put pressure on them to stop. Point out the use is not approved by you. Point out that these bad uses make you more likely to start placing harder restrictions on everyone’s use and access in the future.

I do not know whether anyone will go for an approvals scheme. It depends on how much being associated with you matters to them. It’s worth asking, to design it in a way that works for them and you.

And this approach will not protect you from all criticism or feeling bad about people doing bad things using the infrastructure you’ve built. But nothing will do that. Even if you design a stricter governance system, people will criticise you when they disagree with your moral judgements. They will criticise you for arbitrariness, unfairness, lack of appeal, lack of enforcement, not listening to them etc etc etc. Bad stuff will happen that you only work out months down the line and couldn’t have done anything about anyway. You’ll take action against someone who goes on to do something even worse as a consequence.

Life is suffering.

If you don’t take the approach outlined above, then do aim to:

  • Communicate clearly with potential reusers about what you will do, so they can assess the risks of depending on the data you’re making available.

  • Have a plan for how to deal with bad actors that access data through intermediaries.

  • Have a plan for how to deal with bad actors that perform one-off analyses.

And either way, do try to monitor and review what the impact of whatever measures you put in place actually are. Ask for feedback. Do user research. We’re all working this out as we go, and what works in one place won’t work in another, so be prepared to learn and adapt.

For more discussion / other opinion see comments from Peter Wells and others on the original Democracy Club post and the obligatory Leigh Dodds blog posts:

(Being) The Elephant in the Room

Feb 10, 2019

In this post in going to write about the tensions that I find myself struggling with about ODI’s role in the wider ecosystem of organisations working around data.

I want to preface this by saying this is absolutely not a “poor me” post. We at ODI and I personally have been immensely fortunate to have received backing from funders like the UK government and Omidyar Network (now Luminate), support from other organisations and the data community at large. I have played a senior role at ODI since it started and now as its CEO I completely recognise my personal responsibility in ODI’s shape and form and activities. Indeed that’s why I’m writing this post. I want to talk about an area where I don’t think we’re doing as well as I’d like us to but I’m struggling to find a way to do better given the other responsibilities I have to my team and the organisation.

For me, one measure of whether ODI is successful is whether other organisations, including businesses, in the ecosystem do well. This is something we share with other organisations that are trying to help grow ecosystems rather than themselves - such as the Catapults in the UK or the Open Contracting Partnership (which I’m on the Advisory Board of) - or aim to scale their impact through partnerships rather than size - such as Open Knowledge International or Privacy International. We have had three core values at ODI since its foundation: expert, enabling and fearless. Part of being enabling is helping others succeed.

Now, ODI is a big organisation relative to others working in this space. We have over 50 people in our team, not including associates and subcontractors. Our turnover is roughly £5m annually now (though very little of that now is secure core/unrestricted funding). We invest in communications so we make a lot of noise. We invest in public policy and we’re based in London so we get to have good links into governments and attend the roundtables and launches and receptions that bring influence and opportunities. We also have some incredibly talented and well respected experts in our team.

I think of this as like being a well-meaning elephant in a crowded room with lots of other animals. We’re all trying to break through a wall and there’s no way we’re going to do it alone. The elephant can cause some serious damage to the wall but it sometimes squashes small animals underfoot without meaning to, just because it doesn’t see them. It bumps into other animals in annoying and damaging ways as it manoeuvres. It lifts some animals onto its back where they can get a better angle on the wall for a while but there’s only so much room and they keep falling off.

And then there’s the food. Most of it is placed really high up on shelves. The higher the shelves the more food there is on them. The elephant is tall and one of the only creatures that can reach the higher shelves. It’s well-meaning so it tries to share the food it gets around. Sometimes it forms a bridge with its body so other animals can get to higher shelves too. But it’s also hungry. It needs more food than the other animals just to survive and if it gets too weak it won’t be able to kick the wall any more, or reach the high shelves, or lift up any other animals.

How much should it eat? How should it choose which other animals to lift up or share food with? Should it be trying to grow bigger and taller so it can kick the wall harder and reach the higher shelves?

Analogies are fun. Zoom out from the elephant melodrama and the room is actually a small corner of a massive hanger of passive diplodocuses and brontosauruses who are able to reach even higher shelves and don’t care about the wall at all. Look at the adjoining paddock and there are animals who can feed from the ground (lucky things with their endowments). Look beyond and there be monsters - carnivores feeding on each other.

What I wrestle with is what the elephant should do, what ODI should do, what I should do in this situation. And of course there are no black and white answers here, just lots of ands. Eat enough to survive and share the food around. Work with people you work well with and choose partners fairly. Kick that wall hard yourself and help others kick it.

Some real examples:

  1. ODI’s big innovation grant from UK government comes with T&C’s on it that mean we get no overhead recovery (which we need to pay for things like desks, recruitment, financial administration, and reaching up to those high shelves to get more money) when we contract people or organisations outside ODI. In effect that means it costs us money to share that money, but we have also seen that “stimulus fund” approaches, and bringing in real expertise we lack in house, are much more effective at delivering high quality work and achieving impact than doing everything ourselves. So we have targets for external spend, and support the costs that don’t get covered “from elsewhere”.

  2. We developed the Data Ethics Canvas and made it openly available for others to pick up and use. Then we invested money “from elsewhere” into developing workshops and webinars around it and started selling those, and developing other advisory products that bring margin in (to be the “from elsewhere” money). Other organisations have done similarly, some through adopting the Canvas themselves (which is great, because what we really care about it knocking down that wall). But it means we could start competing with organisations we want to succeed.

  3. We’re putting together a bid for a new programme and we want to do it with a partner. An open procurement process isn’t appropriate because there’s no guarantee anything will come of it. But there are lots of potential partners who we could work with, who would each bring different perspectives and approaches - the point of involving them early is that it enables us to shape the programme to suit. We choose one but I know others could have been just as good and may be unhappy we chose someone else, and I couldn’t hand on heart say the choice was anything but arbitrary.

  4. This has happened so many times: we run an open procurement process for a piece of work and send the call to a number of different organisations, all of which are friends and allies we want to work with and all of whom we think could do the work. We score the resulting proposals and one wins because it’s closest to what we have in mind, which is determined by how we have shaped the project and the call. The organisations that don’t succeed are naturally disappointed, particularly when we’ve said we want to find ways to work with them (which we do), and ask why we even approached them for a bid, which they put unpaid time and effort into, if we weren’t going to choose them.

  5. Again a common pattern: an ally wants to take on a piece of work but isn’t constituted in the right way, isn’t big enough or secure enough or on the right government framework, so asks us to front it. They do the work, we take on administration, financial and reputational risk, but have limited control over the client relationship or work quality. If we insert ourselves into the project more, it feels like we’re exploiting others’ work - from their perspective we’re not really adding value. But if we stay hands off and something goes wrong, we are legally liable and our reputation could suffer.

I could go on. And again, I’m not moaning or saying any of this is in any way unfair on ODI. These are just the consequences of our position. I just wish I knew how to navigate them better, in a way that is fair both to the ODI team and organisation, and our friends and allies and partners; in a way that builds alliances rather than resentment, creates impact rather than conflict.

Answers on a postcard please.