Doesn't open data make data monopolies more powerful?

Jan 14, 2018

We live in a world where a few, mostly US-based companies hold huge amounts of data about us and about the world. Google and Facebook, and to a lesser extent Amazon and Apple, (GAFA) make money by providing services, including advertising services, that make excellent use of this data. They are big, rich, and powerful in both obvious and subtle ways. This makes people uncomfortable, and working out what to do about them and their impact on our society and economy has become one of the big questions of our age.

An argument has started to emerge against opening data, particularly government owned data, because of the power of these data monopolies. “If we make this data available with no restrictions,” the argument goes, “big tech will suck it up and become even more powerful. None of us want that.”

I want to dig into this line of argument, the elements of truth it contains, why the conclusion about not opening data is wrong, why the argument is actually being made, and look at better ways to address the issue.

More data disproportionately benefits big tech

It is true that big tech benefits, and benefits disproportionately to smaller organisations, from the greater availability of data.

Big tech have great capacity to work with data. They are geared to getting value from data: analysing it, drawing conclusions that help them grow and succeed, creating services that win them more customers. They have an advantage in both skills and scale when it comes to working with data.

Big tech have huge amounts of data that they can combine. Because of the network effects of linking and aggregating data together, the more data they have, the more useful that data becomes. They have an advantage because they have access to more data than other organisations.

Not opening data disproportionately damages smaller organisations

It is also true that small organisations suffer most from not opening data. Access to data enables organisations to experiment with ideas for innovative products and services. It helps them make better decisions, faster, which is particularly important for small organisations who need to make good choices about where to direct their energies or risk failure.

If data is sold instead of opened, big tech can buy it easily while smaller organisations are less able to afford to. Big tech have cash to spare, in house lawyers and negotiators, and savvy developers used to working with whatever copy or access protection is put around data. The friction that selling data access introduces is of minimal inconvenience to them. For small organisations, who lack these things, the friction is proportionately much greater. So on top of the disproportionate benefits big tech get from the data itself, they get an extra advantage from the barriers selling data puts in the way of smaller organisations.

If data isn’t made available to them (for example because they can’t negotiate to acceptable licensing conditions or the price is too high), big tech have the money and user base that enable them to invest in creating their own versions. Small organisations simply cannot invest in data collection to anywhere near the same scale. The data that big tech collects is (at least initially) lower quality than official versions, but it usually improves as people use it and correct it. Unlike public authorities, big tech have low motivation to provide equal coverage for everyone, favouring more lucrative users.

An example is addresses in the UK. Google couldn’t get access to that data under licensing conditions they could accept, so they built their own address data for use in Google Maps. Professionals think it is less accurate than officially held records. It particularly suffers outside urban and tourist areas because fewer people live there and there’s less need for people to use Google’s services there, which means less data available for Google to use to correct it.

Using different terms & conditions for different organisations doesn’t help

“Ah,” I hear you say, “but we can use different terms & conditions for different kinds of organisations so smaller ones don’t bear the same costs.”

It is true that it is possible to construct licensing terms and differential charging schemes that make it free for smaller firms to access and use data and charge larger firms. You can have free developer licences; service levels that flex with the size of companies (whether in employees or turnover or terminals); non-commercial licences for researchers, not-for-profits and hobbyists.

These are all possible, but they do not eliminate the problems.

First, the barrier for smaller organisations is not just about cash but about time and risk. Differential licensing and charging schemes are inevitably complex. Organisations have to understand whether they qualify for a particular tier and whether they are permitted to do what they want to do with the data. This takes time and often legal fees. The latter is often hard to work out because legal restrictions on particular data processing activities tend not to be black and white. They require interpretation and create uncertainty. This means organisations have to protect themselves against litigation arising from unintended non-compliance with the terms, which adds the cost of insurance. The more complex the scheme, the greater this friction.

Second, the clauses within a free licence always include one that prevents the organisation undercutting the original supplier of the data and selling it on to large organisations. Necessarily, this will place restrictions on the services that an organisation offers and the business model they adopt. They might be unable, for example, to build an API that adds value by providing different cuts of data on demand, or if they do their price might be determined by additional fees from the original supplier. Licensing restrictions limit what kinds of organisations can benefit from the data, and their ability to make money. And, as above, uncertainty about the scope of the restrictions (and whether the originating organisation will ever actually act on them) introduce risk and costs.

Third, while these costs and barriers are bad enough with one set of data, add another from a different supplier with another set of conditions, and you have to make sure you meet both of them. Sometimes this will be impossible (for example combining OpenStreetMap data, available under a share-alike licence, with non-commercial data). Add a third or fourth and you’re dealing with a combinatorial explosion of T&C intersections to navigate.

In part, the problems with differential pricing approach for data arise from the unique characteristics of data and the data economy.

  • it is endlessly manipulable which makes it necessarily complex to list all the ways in which you can, or can’t, use it, and which are allowed and which not

  • the greatest opportunities for innovation and growth are within infomediaries who slice and dice and add value to datasets; they need freedom to thrive

  • added value usually comes from the network effects of combining multiple datasets; but if there’s friction inherent in bringing datasets together, those same network effects will amplify that friction as well

It’s not surprising that people who are used to selling other kinds of things than data reach for “free licences for startups” as a solution to lower costs for smaller organisations. It seems an obvious thing to do. It might work for other kinds of products. It doesn’t work for data.

Opening data is better than not opening data

So far I’ve focused almost exclusively on the impacts of opening and not opening data on innovation and the ability of small businesses to thrive in comparison to big tech. I’ve discussed why selling or restricting access to and use of data favours big tech over and above the advantages they already receive from amassing more data.

If you like to think of playing fields, it’s true that opening data lifts big tech’s end of the pitch, but overall, it lifts the startup’s end more.

There are a few other considerations it’s worth quickly touching on.

Do we want big tech to use high quality data?

Earlier I wrote about how big tech makes its own data when it can’t get hold of official sources. They stitch together information from remote sensors, from what people upload, from explicit corrections, use clever machine learning techniques and come out with remarkably good reconstructions.

But “remarkably good” is not comprehensive. It is often skewed towards areas of high user demand, whether that’s cities rather than countryside or favouring the digitally included.

When big tech uses its own data rather than official data to provide services to citizens, it favours the enfranchised. It exacerbates societal inequalities.

It can also cost lives. I talked about Google’s address data and the doubts about its accuracy particularly outside towns and cities. Ambulances have started using it. When they are delayed because they go to the wrong place, people can die. Restricting access to address data forced Google to spend a bunch of money to recreate it, but who is actually suffering the consequences?

Not all services require the same level of detail in data. The impact of data errors is higher for some products than for others. But in general, we should want the products and services we use to be built on the highest quality, most reliable, most authoritative, timely, and comprehensive data infrastructure that we can provide. When we restrict access to that by not permitting companies with massive user bases amongst our citizenry to use that data, we damage ourselves.

What about big tech’s other advantages with data?

I’ve focused much of this on the advantage big tech enjoys in having access to data. As I touched on earlier, they also have an advantage in capability. If there’s a real desire to equalise smaller companies with big tech, they need support in growing their capability. This isn’t just about skills but also about tool availability and the ease of use of data.

Anything that helps people use data quickly and easily removes friction and gives a disproportionate advantage to organisations who aren’t able to just throw extra people at a problem. Make it available in standard formats and use standard identifiers. Create simple guides to help people understand how to use it. Provide open source tools and libraries to manipulate it. These are good things to do to increase the accessibility of data beyond simply opening it up.

How do we make this benefit society as a whole?

I’ve also been focusing deliberately on the narrow question of how we level the playing field between small organisations and big tech. Of course it’s not the case that everything small organisations do is good and everything big tech does is evil. Making data more open and accessible doesn’t ensure that anyone builds what society as a whole needs, and may mean the creation of damaging tools as well as beneficial ones. There might even (whisper it) be issues that can’t be solved with tech or data.

That said, the charities, community groups, and social enterprises that are most likely to want to build products or produce insights with positive social impact are also likely to be small organisations with the same constraints as I’ve discussed above. We should aim to help them. We can also encourage people to use data for good through targeted stimulus funding towards applications that create social or environmental benefits, as we did in the Open Data Challenge Series that ODI ran with Nesta.

Making it fair

When you dig into why people actually cite increasing inequality between data businesses as a reason for not opening data, it usually comes down to it feeling unfair that large organisations don’t contribute towards the cost of its collection and maintenance. After all, they benefit from the data and can certainly afford to pay for it. In the case of government data, where the public is paying the upkeep costs, this can feel particularly unfair.

It is unfair. It is unfair in the same way that it’s unfair that big tech benefits from the education system that the PhDs they employ went through, the national health service that lowers their cost of employment, the clean air they breathe and the security they enjoy. These are all public goods that they benefit from. The best pattern we have found for getting them, and everyone else who enjoys those benefits, to pay for them is taxation.

Getting the right taxation regime so that big tech makes a fair contribution to public goods is a large, international issue. We can’t work around failures at that level by charging big tech for access to public data. Trying to do so would be cutting off our nose to spite our face.

What can be done from a data perspective, whether a data steward is in the public sector or not, is to try to lower the costs of collection and maintenance. Having mechanisms for other people and organisations to correct data themselves, or even just highlight areas that need updating by the professionals, can help to distribute the load. Opening data helps to motivate collaborative maintenance: the same data becomes a common platform for many organisations and individuals, all of whom also contribute to its upkeep, just like Wikipedia, wikidata and OpenStreetMap. With government data, this requires government to shift its role towards being a platform provider — legislation.gov.uk’s Expert Participation Programme demonstrates how this can be done without compromising quality.

Big tech and data monopolies

I have focused on big tech as if all data monopolies are big tech. That isn’t the case. What makes a data monopoly a monopoly is not that it is big and powerful and has lots of users, it’s that it has a monopoly on the data it holds. These appear as much in the public sector as the private sector. Within the confines of the law, they get to either benefit exclusively or choose the conditions in which others can benefit from the data they hold.

Some of that data could benefit us as individuals, as communities and as societies. Rather than restricting what data of ours data monopolies can access, another way to level the playing field is to ensure that others can access the data they hold by making it as open as possible while protecting people’s privacy, commercial confidentiality and national security. Take the 1956 Consent Decree against Bell Labs as inspiration. That decree forced Bell Labs to license their patents royalty free. It increased innovation in the US over the long term, and particularly that by startups.

There are various ways of making something similar happen around data. At the soft, encouraging end of the spectrum there are collaborative efforts such as OpenActive or making positive noises whenever Uber Movements helps cities gain insights, or Facebook adds more data into OpenStreetMap or supports EveryPolitician. At the hard regulatory end of the spectrum, we see legislation: the new data portability right in GDPR; the rights given under the Digital Economy Act to the Office of National Statistics in the UK to access administrative data held by companies; the French Digital Republic Act’s definition of data of public interest; the Competition & Markets Authority Order on Open Banking.

We should be much more concerned about unlocking the huge value of data held by data monopolies for everyone to benefit from — building a strong, fair and sustainable data infrastructure — than about getting them to pay for access to public data.

Opening up authoritative, high quality data benefits smaller companies, communities, and citizens. There’s no doubt that it also benefits larger organisations. But attempts at ever more complex restrictions about who can use data are likely to be counterproductive. There are other ways of leveling these playing fields.

Data portability

Dec 26, 2017

Earlier in the year I went to an OECD workshop on enhanced access to data. The workshop covered four general themes: open data, data sharing communities, data marketplaces and data portability. The discussion on the implications of data portability were particularly interesting.

Data portability is a new right under the EU-level General Data Protection Regulations (GDPR) due to come into force in May 2018 and a version of which will be written into UK law through the Data Protection Bill currently going through parliament.

The data portability right is a version of the existing data access right (which gives you the right to get hold of data about you held by an organisation). It is both more powerful, in that it gives you the right to have that data given to you or a third party of your choice in a commonly used machine readable format, and has a narrower scope in that it doesn’t apply to everything the organisation captures about you. It only applies to data captured automatically, and when it is either explicitly provided by you (eg when you fill in a form on a website) or generated as part of your activity (eg the records of your bank transactions). It does not apply to data that is inferred about you based on this data (eg if they’ve guessed that you’re gay or pregnant) or that they’ve got about you from other sources (eg your credit rating).

Why should we care about data portability?

There are three main reasons for the data portability right:

  1. Providing more transparency than is currently provided. At the moment, exercising your data access right can simply lead to receiving pages and pages of printed information. With data portability, people will be able to search within and analyse the data that organisations hold about them.

  2. Helping people to switch service providers without losing their histories. For example, if I wanted to switch from tracking my physical activity using Strava to using RunKeeper, the data portability right would guarantee I could get hold of the data held about my activities by Strava for import into RunKeeper.

  3. Supporting the growth of data analytics third party services that provide insights based on data. These include services oriented around providing deeper insights into particular types of activity (eg helping you to reduce your energy usage) or that link together different types of activity (eg bringing together your transport spend with the routes that you travel).

Transparency is the main reason that the data portability right was originally put into place: it is, after all, an extension of data protection legislation. However it’s unknown whether many people will exercise the data portability right for transparency purposes. On the other hand, under GDPR people will no longer have to pay to exercise their data access right. It is likely that this change will have a larger impact on the number of people exercising their right to find out what information organisations hold about them.

Support for switching is seen as a secondary positive effect to reduce lock-in and increase competition. However we switch services only rarely and data portability is only one of the many barriers in place when switching. Analogies with mobile number portability (ie your ability to keep your mobile number when you move supplier) are ill founded: if you switch your bank account you still have to update the information of all those who have your old account details - data portability can only go so far with helping with this (eg in providing a list of standing orders and direct debits to recreate).

The growth of third party analytics services is likely to be the long-term large-scale side effect of the data portability right. The vision is that we could have applications that help us, both directly and through our carers and advisors, make better decisions by integrating data from across our lives. Imagine, for example, a grocery shopping app that takes into account your previous purchases, your travel plans, your current balance and your weight to suggest what to buy that week. Or a service that helps your doctor prescribe the right intervention based on accurate information about your diet, alcohol intake and activity.

It is worth exploring how these tools might manifest in a little more detail, but first let’s have a little reality check.

What makes data portability hard?

Getting the benefits of data portability won’t be quite as straightforward as might be imagined. The extent to which it’s useful depends a lot on how organisations choose to implement it.

First, organisations that receive a request under the data portability right have a month to respond. This is arguably a reasonable period to wait if the request is made for transparency reasons. It would cause some pain when switching suppliers (but people are likely to experience pain doing that anyway). But a delay of this length really undermines the utility of data analytics services to provide a timely and useful services. One could imagine, say, a telecoms company providing up-to-date information about your location and mobile usage on their own site while only providing data that is a month old to competitor analytics services. The month window for response is there to enable smaller organisations respond in an ad hoc way to requests rather than needing to invest in end-to-end technology. Large companies who anticipate a lot of requests will want to invest in automating responses to them, which should enable timely access, but will some deliberately build in a lag to their responses to retain a competitive advantage?

Second, the data portability right covers your right to get hold of data about you from an organisation but it does not provide any guarantees that that data can be imported into other services. One would have thought that competitor services would invest in making it easy for users to move to them by porting data from elsewhere, but this requires investing in tracking many moving targets (as the export formats used by competitors change over time) for a small proportion of potential users (given other switching barriers), particularly in unsaturated markets. Will competitors find it more worthwhile to invest in developing features that retain their existing users and win newcomers to the market? Will new users lower their risks by first trying out services that they know they can’t switch to later?

Third, while the data portability right requires data to be provided in a commonly used format, this by no means guarantees standardisation in data formats across particular sectors. Organisations might reasonably interpret the right as requiring the use of the common syntaxes such as CSV, JSON or XML while leaving semantic interoperability completely untouched. For example, one supermarket might label a field in shopping transaction data “prodName” and another “PID”; each might use a completely different set of names for the same products, different categorisation schemes, different codes for suppliers and so on. Without standardisation, any service that wants to use data from a particular source will have to write a custom parser. Will organisations within particular sectors be motivated to collaborate on creating standards that provide greater interoperability?

Fourth, there are questions about how the data portability right will be implemented securely. It is already common practice for third parties to access, and scrape, password-protected websites by asking users for their usernames and passwords. This is extremely bad practice from a security perspective as access can’t be limited or revoked easily, and because users frequently reuse passwords across multiple sites. Badly implemented, the data portability right could lead to a bonanza for phishers and identity thieves. Will organisations encourage their users to reveal their login details to get access to data under the portability right, or will they take the time to implement more sophisticated and secure ways of authenticating and authorising third party access such as OAuth?

Finally, the data portability right places control into the hands of individuals to decide with whom to share data about and from the services they use. If our experiences with the cookie law, privacy policies and website T&Cs teach us anything, it’s that many people are lazy and will simply click “I agree” on anything that stands in the way of accessing a service. Some of the products that request access to data under the data portability right will be bogus, actively created by identity thieves or to build marketing databases, or they may simply store data badly and thus increase the risks of security breaches. Will people be able to choose wisely which third party services to grant access to through the data portability right? Will existing or new consumer organisations build services to help them do so? Will regulators rise to the new challenges this creates?

How might data portability pan out?

Bearing these limitations in mind, there are a number of potential unintended consequences of the data portability right.

First, the data portability right may push towards a less innovative and competitive market. The creation of standards for data portability might push services towards providing services that fit with the “shape” defined by those standard data formats but truly innovative services might not fit that shape. As a trivial example, traditional energy suppliers might not care about or provide information about who generated the energy they supply whereas innovative energy brokers might consider this a key piece of information for customers who want to buy from a local wind farm. The data portability right requires standards to be useful, but whatever standards get created will need to be flexible to the different kinds of products that services might provide.

Second, rather than promoting competition, the data portability right may place even more power in the hands of the big tech companies who have the capacity, in terms of knowledge and resources, to take most advantage of it. For example, Amazon is already a threat to traditional retailers; it is also well placed to take advantage of the data portability right to import people’s shopping lists to AmazonFresh. Google already infers things about you through your and millions of other people’s search patterns and clickstream; it will be able to give much more personalised insights on your travel habits than a startup that hasn’t got that vast amount of data to draw on. There are many opportunities for startups and SMEs in providing data brokerage and user facing services, but the data portability right isn’t going to suddenly put them on a level playing field.

Third, rather than increasing our privacy and control, the data portability right could make importing data from elsewhere a natural part of signing up for a new service, resulting in data about us proliferating onto multiple services and out of our control. Consumer and privacy rights groups need to combine forces to put pressure on businesses to minimise their data greed and to increase the ability of consumers to understand the implications of and make good choices about porting data into other services.

Fourth, while the European Data Protection Supervisor may think that “one cannot monetise and subject a fundamental right to a simple commercial transaction, even if it is the individual concerned by the data who is a party to the transaction”, the data portability right will undoubtedly lead to the development of personal data markets. People will be encouraged to port data about themselves into personal data brokers, with the promise of control over use and a financial return when it is sold on. This in turn may lead to a future where access to data is determined by who can pay for it, accelerating knowledge, power and financial inequalities.

Finally, on a more positive note, the data portability right could lead to more people making the positive choice to donate data about themselves for good causes such as medical research. Research on public attitudes to data use indicates that people are happy for personal data to be used for societal benefits. Data portability could provide a mechanism for some charities and civil society groups to engage people in collective action.

Where are the gaps in data portability?

Finally, there are a few areas that the data portability right doesn’t tackle, where legislation could perhaps be extended or clarified through guidance.

First, the data portability right applies to natural people, and not to organisations. But organisations are heavy users of services; service providers capture data about them just as they do about people; and organisations would benefit just as much, if not more, from the benefits of being able to switch suppliers or receive data analytics support. The Open Banking initiative, which has data portability at its heart, has focused on benefits to small businesses of being able to find suitable financing more easily. While organisations don’t have a data portability right, the individuals within them do - will organisations start using their staff to front data requests in order to achieve the same benefits?

Second, while the data portability right could result in data donation for societal benefit as described above, it would be far easier to realise those benefits if researchers and statisticians were able to access a data from a representative sample of service users, not just a biased subset of those savvy (and generous) enough to donate data. The Digital Economy Act gives the UK’s Office of National Statistics the power to require data from some public, private and third sector bodies, as long as doing so is consistent with the Data Protection Act. It will be interesting to see how the expectation of individual control over data use granted by GDPR interacts with this.

Third, while many speak about data portability in terms of providing access to “your data”, in reality data shared with third parties may include personal data about other people too. This might include data directly about other people in your social graph, or in your household, or with whom you transact through a peer-to-peer service. Similarly, it may include commercially sensitive data about businesses you frequent or charities you donate to. When analysed in bulk, data about a sample of the population becomes information about people who were not included directly in the analysis. For example, data about my shopping habits may be used to make guesses about the shopping habits of other middle class, middle aged mothers of two. Data about us is never only about us.

As data analytics and machine learning reach further into our individual lives, the choices we make as individuals about how data about us is shared and used, and indeed what we do while that data is being collected, have wider repercussions. They do not just affect the decisions that are made about us individually, but those that are made about others like us.

The data portability right provides us with a powerful positive ability to take advantage of the data others collect about us and new opportunities for innovators and campaigners. But it also pushes ever wider the door to a more surveilled society. It is hard to predict how it will affect the power dynamics between individuals and organisations, between incumbents and new providers, or between big tech and startups. Companies will need to cooperate, particularly around standards, for consumers to benefit. Regulators will need to watch closely how the right is implemented and the effects on the market. And we will need to take an ever more active role in questioning and holding to account everyone who uses data about us.

Acknowledgements

The discussion here is based heavily on the insights provided by Marc MacCarthy, Ruth Boardman, Lenard Koschwitz, Randi Flesland, John Foster, Babak Jahromi, Robin Wilton and the audience at the session on data portability at the OECD workshop on enhanced access to data. Thanks in particular to Christian Reimsbach Kounatze for organising it.

What would "data literature" look like?

May 19, 2017

My eldest daughter is now in secondary school and, while she enjoys and is good at Maths, what she really loves studying is History and English. Watching the critical thinking and analysis skills that she is learning and using for those subjects, I have started to wonder if we should be approaching data literacy from a different angle.

The need for children and adults to be equipped with data skills is well recognised. The Nesta paper Analytic Britain: Securing the Right Skills for the Data-Driven Economy contains some recommendations, for example. However, much of this work focuses on the development of what I would frame as data science skills: the basic skills like the ability to clean data, analyse it, display it in graphs and maps, and the more advanced skills of machine learning and interactive visualisations. Data literacy becomes equated with the ability to do things with data.

But for me, data literacy, and the skills we all need to have in our policymaking, businesses and lives, go beyond handling data. We need to know what data is capable of (even if we can’t do those things ourselves). We need to understand the limits of data, the ways it can be used for both good and ill, the implications that has on our lives and society. Understanding these things would help us use data well in government, business and our day to day lives and have more informed debate about how we use data in society.

You may remember from your own childhood studying both English Language and English Literature. English Language focuses on reading and writing, the production of material, the manipulation of language. English Literature focuses on the study of English in use, the material produced by different authors, their use of different techniques, the context in which they produced their works and the impact their work had. The two areas of study feed on each other: producing poetry enables you to understand poetry as a form, and studying great poems improves your own technique. But the focus of each is distinct. We expect children to be able to read and write when they leave school. We also expect them to understand how others’ writing has contributed to our culture and society.

Could we apply the same approach to data? Children are already taught Data Language as part of the Maths curriculum. They are taught how to collect data, record it, create basic statistics, make charts and graphs from it, even in primary school. But what about Data Literature?

What if children were taught about Florence Nightingale’s use of data? They could unpick the method of collection, the birth of new forms of visualisation and the use of data for argument and persuasion and change. They could examine the context of Nightingale’s work at the time and the repercussions through to the present day. They could create new works from her data, put together new visualisations and invent modern-day newspaper stories.

They could examine the works of great modern day data visualisers and compare and contrast their works around particular key events, such as the Iraq war or the 2016 presidential election, or on thematic topics such as climate change. They could examine commonalities in form - citation of sources, provision of values - as well as differences in style and expression. They could produce their own visualisations in the style of one of the greats, or simply copy a work to see how it’s done.

They could look at the use of data in reports, from official statistical releases, through academic papers, to sports commentary. They could look at how these have evolved over time, and the varying ways in which numbers and statistics can be used to inform and substantiate a story that is being told. They could look at the choices made about what numbers get quoted in such stories, and have exercises where they select different numbers or use different rhetorical devices (eg “almost 20%” vs “less than 20%”) to reach a different conclusion.

Children could be taught the history of census taking, from the Roman census that reportedly led to at least one birth in a stable, through the Doomsday book that redistributed land, to the modern day. They could examine different forms of census taking and the way in which the data is used. But they could also examine the way in which census taking, or indeed the gathering and use of any data, can exert power and change reality.

There are many other topics that would make rich study material: the art of fact checking; the role of open data in government transparency and accountability; the data flows in adtech; conversational interfaces with data such as Siri and Alexa; surveillance and secret data; personalisation and data ownership in smart devices.

I am not an educationalist, but I think that these kinds of topics would equip children with a much better understanding of what data really means to society. And I think it taps into the skills that those who lean towards the arts and social sciences enjoy exercising: skills such as critical thinking, context awareness and artistic appreciation. There are people who are turned off data because they don’t enjoy maths. This provides a different route to reach them.

I am sure there must be people thinking of and doing this already. I know of the Calling Bullshit course, for example. What else is there? Does this idea have legs? How could we advance it? Let me know at jeni@theodi.org.

Adding data trading to an agent-based model of the economy

May 15, 2016

It’s been a while since I gave an update about my attempt to build an agent-based model for the information economy. That’s partly because I got distracted crowdsourcing election candidate data and results for Democracy Club. It’s also partly because of this:

you think success is a straight line but it's actually a squiggle
Image from Demetri Martin’s “This is a Book”

You may recall that I constructed a basic agent-based economic model and added trade to it, based on Wilhite’s Bilateral Trade and ‘Small-World’ Networks. Then I did some sensitivity analysis on it to check that the assumptions that I’d made in coding it all up weren’t changing the outputs in any major way.

My next step was to add DATA to the mix.In the process I realised that the model wasn’t mirroring reality sufficiently for me to be drawing conclusions from it.

Creating an economic model that includes data

Adding data to the agent-based model is pretty straight forward. It’s just the same as FOOD. Each firm has some initialData which it can supplement either by trading or by producing (based on its dataPerStep productivity) to update its currentData running total. A Firm’s utility is calculated as DATA x FOOD x GOLD rather than just FOOD x GOLD in the original scenario.

In my initial model, the price for DATA is calculated in the same way as the price for FOOD. This isn’t fair, however, because unlike with FOOD, when DATA is sold the seller doesn’t lose any DATA. (The impact of this essential difference between data and physical goods is the thing that I wanted to explore in these models.)

If we take the same example as I gave when added trade to the basic model, but this time with DATA rather than FOOD, this is how the trade works between a Firm that starts off with 30 GOLD and 5 DATA and a Firm that has 10 GOLD and 15 DATA. The price is set, as it would be if they were trading FOOD, to 2 GOLD for each DATA. At that price, the exchange goes like this:

GOLDA DATAA UA mrsA GOLDB DATAB UB mrsB
30 5 150 6.00 10 15 150 0.67
28 6 168 4.67 12 15 180 0.80
26 7 182 3.71 14 15 210 0.93
24 8 192 3.00 16 15 240 1.07
22 9 198 2.44 18 15 270 1.20
20 10 200 2.00 20 15 300 1.33
18 11 198 1.64 22 15 330 1.47

In the original example, with FOOD, both Firms gain from the transaction up until 10 GOLD have been traded for 5 FOOD, at which point both have 20 GOLD, 10 FOOD and a utility of 200, an increase of 50 each. If they keep trading from that point their utility start to go down.

When DATA is involved, however, the Firm that is selling DATA does much better out of the transaction. Every step of the trade increases its utility because it is always simply adding GOLD rather than reducing its stock of DATA. So while the buyer’s utility rises from 150 to 200 in the trade, the seller’s utility doubles from 150 to 300.

Figuring out what a “fair price” for DATA would actually be, within this model, is something I want to come back to, but I thought I’d run the model with the price set in the same way as the price for FOOD is set, to give a baseline.

An economic model that includes data increases trading

Including data in the agent-based model does make some obvious changes. The first thing that’s really apparent is that compared to the FOOD-GOLD model, a lot more trading goes on.

In the FOOD-GOLD model, each Firm initiates an average of 1.25 trades, with a range of 0 to 12. Over the 5 runs, there are a total of 3182 trades.

In the DATA-FOOD-GOLD model, each Firm initiates an average of 3.54 trades over the 20 ticks, with a range of 0 to 14, with a total of 8893 trades over the 5 runs. This is biased heavily towards DATA trading, with each firm averaging 2.61 DATA trades (ranging from 0 to 13) and 0.93 FOOD trades (ranging from 0 to 7). About 74% of the trades that go on in this model involve DATA.

I realised having done this that the increase in trading could just be because there’s an additional good to trade, namely DATA. So I created an alternative COAL-FOOD-GOLD model where there’s again the additional good, but one that operates exactly like FOOD.

In the COAL-FOOD-GOLD model, each firm initiates an average of 1.59 trades, with a range of 0 to 15 and a total of 3971 trades over the 5 runs. As you’d expect, these are pretty evenly split between COAL and FOOD: 49% of the trades involve COAL.

So the increase in trading is partly due to there being more goods to trade, but mostly because of the unique nature of DATA.

Price stabilitsation with data trading

The price stabilisation graphs for food and for data are below.

price stabilisation for food

price stabilisation for data

Both prices stabilise at around 1 GOLD over time, but the prices of FOOD are a lot more variable than those for DATA. My guess is that this is because there’s less trading of FOOD than there is trading for DATA.

A look at inequality

One of the things that I’m particularly keen to examine in this model is whether data being in the mix changes the inequality in the set of Firms in the economy.

One thing to examine here is the relationship between the initial utility of each Firm and the final utility of each firm. In this version of the model, the initial utility is randomised (rather than being based on how much the Firm can produce, or all being initially equal). You’d expect a small correlation between what you start with and what you end up with, but not a large one as 20 steps is plenty of time to trade or produce your way out of your starting position. Here are the correlations:

model correlation
FOOD-GOLD 11%
COAL-FOOD-GOLD 19%
DATA-FOOD-GOLD 23%

So there’s some evidence there that organisations that when data is added to the mix, the starting condition of each Firm is more influential than it would otherwise be, but it’s still not a very strong correlation.

The other thing I looked at were Gini coefficients of the economy, which is a measure of how unequal a society is, with 0% being perfect equity and 100% perfect inequality (one person having all the wealth). (For reference, the UK’s Gini coefficient is about 34%.)

model initial Gini coefficient final Gini coefficient
COAL-FOOD-GOLD 56% 33%
DATA-FOOD-GOLD 56% 36%

The Gini coefficients are roughly the same. But the thing about this result that makes me question whether the model is accurate is the fact that the Gini coefficients are decreasing from the initial state to the final state. This isn’t the case in the UK or the US, for example, where inequality is growing, and if you look at the global Gini index you’ll see it’s been increasing over time.

If the Gini coefficient in the model is decreasing, that’s a sign that the model isn’t properly reflecting inequalities that arise in real economies. That would probably be fine if I didn’t want explicitly to study inequalities. Given I do, it feels like I have to refine the model a bit more to make it better reflect reality so that we can draw conclusions from it.

Next steps

First, I need to add a mechanism to the model to measure the Gini coefficient over time. I’m currently only measuring the Gini coefficient at the very beginning (where wealth distribution is randomised) and at the very end (after 20 ticks of trading). It might be that the Gini coefficient goes down rapidly during the price stabilisation phase and then starts increasing, and therefore the model is accurately reflecting the increase of the Gini coefficient over time, once it gets going. I need to monitor it on each tick in order to work that out.

Then, if the Gini coefficient isn’t increasing, I need to add some mechanisms to the mix that are likely to increase inequality. Things that I’ve thought of are:

  • reducing FOOD and/or GOLD by a fixed amount each tick, to mirror the minimum expenses a Firm incurs simply for existing; if I do this I have to add the possibility of Firms failing (and new Firms being created) or going into debt
  • adding a mechanism that enables those Firms that have more GOLD to get more GOLD, for example by lending at interest to other Firms or by investing in increasing their own productivity
  • breaking up the economy into smaller sub-economies that only trade with each other, with some connecting Firms that can trade across those sub-economies; this was a variant in Wilhite’s original paper but I don’t know if it had an effect on inequality

If you have any other ideas, let me know.

Trying it out

As before, if you want to try out where I’ve got to so far, I’ve committed the source code to Github and there are instructions there about how to run it. Any comments/PRs will be gratefully received. The code is quite messy now and could do with a refactor.

I’ve also put all the raw data generated from the runs described in this post in spreadsheets. These are:

Feel free to copy and create your own analyses if you don’t want to run the models yourself.

Sensitivity analysis on an agent-based economic model

Apr 1, 2016

Previously, in my quest to build an agent-based model for the information economy, I constructed a basic model and added trade to it, based on Wilhite’s Bilateral Trade and ‘Small-World’ Networks.

From doing that, we’ve seen that price stabilisation occurs over roughly the first 10 cycles, with about 38% of the 500 agents being pure producers, and about 5% only responding to trade requests from others.

There are a few parts of this model where I’ve made choices that might influence the outcome. To test these out, I want to do a sensitivity analysis to double-check that I’m not drawing unwarranted conclusions from single runs.

Setting up Repast to do multiple runs

Repast can be used to do batch runs of a particular model, spawning several instances with different starting conditions and therefore different end points.

Getting this working had a few false starts. Batch runs need to include code that stops the run after a set number of cycles. This code needs to be placed in the src/informationeconomy/context/SimBuilder.groovy file, which you don’t normally see when viewing the Package Explorer in Eclipse. Getting the simulation to stop after 20 iterations requires a simple line:

public class SimBuilder implements ContextBuilder {
	
	public Context build(Context context) {
        ...	
		RunEnvironment.getInstance().endAt(20)
		...
	}
}

With that in place, the Batch Run Configuration tool enables you to run any number of concurrent “worlds”. I ran five with different random seeds. The following price stabilisation graph shows that they all reach price stabilisation after about eight iterations (pale lines are individual runs; stronger lines are the average over these runs):

graph showing price stabilisation as max and min prices converge on a mean over around 8 ticks

With 20 ticks per run, about 43% spend all their time producing goods and 57% trade in some way. About 6% never initiate trade themselves but just respond to offers from other agents.

Initial FOOD and GOLD

The first area where I want to carry out some sensitivity analysis is in the initial amount of FOOD and GOLD that each agent has. In the runs described above, each agent starts with the same amount of FOOD and GOLD that they can make in a turn. There are two other options that I want to test out: one where every agent starts with one FOOD and one GOLD, and one where each agent starts with a random amount of FOOD and GOLD (between 1 and 30).

With all Firms initially having a random amount of FOOD and GOLD, there are slightly fewer pure producers (38%) and more Firms that only accept trades (10%). Prices don’t start as high and follow a smoother path to a later stabilisation (around 14 steps in), as shown here:

graph showing price stabilisation as max and min prices converge on a mean over around 16 ticks

As we’d expect, there’s no relationship between initial utility and final utility when the Firms’ initial utility is unrelated to their ability to produce goods:

graph showing final utility based on initial utility

Starting with one FOOD and one GOLD leads to more trading, with only 30% pure producers and 17% of Firms only accepting trades. Prices start higher (after no trading in the first step) but settle down in the same way as with the other kinds of starting conditions.

graph showing price stabilisation as max and min prices converge on a mean over around 16 ticks

Given the smoothness of the price stabilisation curve when Firms start with random amounts of FOOD and GOLD, I will use this version of the code going forward.

Randomising FOOD or GOLD production

On each step, each Firm currently has to decide whether to produce FOOD, produce GOLD, or trade. The code that determines which they choose to do has some built-in biases:

if (utilityMakingFood > utilityMakingGold) {
	if (trade['utility'] > utilityMakingFood) {
		action = makeTrade(trade)
	} else {
		currentFood += foodPerStep
		action = [ type: 'make', good: 'food', amount: foodPerStep, utility: currentUtility() ]
	}
} else if (trade['utility'] > utilityMakingGold) {
	action = makeTrade(trade)
} else {
	currentGold += goldPerStep
	action = [ type: 'make', good: 'gold', amount: goldPerStep, utility: currentUtility() ]
}

The Firm will only consider making FOOD if the utility of making it is greater than the utility of making GOLD. Similarly, it will only consider making a trade if the utility of trading is greater than producing either FOOD or GOLD. This should bias the Firms towards producing FOOD or GOLD, and specifically towards producing GOLD, all things being equal.

Under the initial configuration, where Firms begin with the amount of FOOD and GOLD that they can produce in a single step, 85% of Firms produce GOLD at some point, and 84% produce FOOD. Across the 5 runs, only 24 Firms produce only FOOD (never trading or producing GOLD), but even fewer produce only GOLD (2, over the 5 runs).

Under a randomised initial amount of FOOD and GOLD, 79% produce GOLD and 81% produce FOOD, with 41 only producing FOOD and 21 only producing GOLD over the 5 runs.

So I don’t think that the code is biasing the results towards the producing of GOLD, but it’s hard to tell whether it’s biasing away from trade. I’ve added a bit of randomness:

def randomlyTrue = random(10000) > 5000
if (utilityMakingFood > utilityMakingGold || (utilityMakingFood == utilityMakingGold && randomlyTrue)) {
	randomlyTrue = random(10000) > 5000
	if (trade['utility'] > utilityMakingFood || (trade['utility'] == utilityMakingFood && randomlyTrue)) {
		action = makeTrade(trade)
	} else {
		currentFood += foodPerStep
		action = [ type: 'make', good: 'food', amount: foodPerStep, utility: currentUtility() ]
	}
} else if (trade['utility'] > utilityMakingGold || (trade['utility'] == utilityMakingGold && randomlyTrue)) {
	action = makeTrade(trade)
} else {
	currentGold += goldPerStep
	action = [ type: 'make', good: 'gold', amount: goldPerStep, utility: currentUtility() ]
}

As anticipated, this makes very little difference. Over the five runs, one more Firm produces FOOD than previously, one more never trades, one more only produces FOOD and five more only produces GOLD. There is a more significant increase in the percentage of Firms that only receive (but do not initiate) trade, rising from 257 (10%) to 273 (11%).

Price stabilisation occurs as before, though the graph does show a more regular oscillation in maximum price over the first few ticks, compared to the slightly smoother trajectory shown in the previous graphs.

graph showing price stabilisation as max and min prices converge on a mean over around 16 ticks

All in all, the model does not appear to be sensitive to the biases in the code that determine how Firms choose what to do on each step. I will keep the less biased code.

Next steps

Next, it’s time to introduce DATA to the mix. My goal for the initial experiment is simply to replace FOOD with DATA, use the same formula to work out the price for DATA, but introduce the crucial difference between FOOD and DATA, namely that when you trade DATA, you do not lose it. I want to see what happens to price stabilisation in this scenario, and look at the kinds of Firms that emerge.

Trying it out

As before, if you want to try out where I’ve got to so far, I’ve committed the source code to Github and there are instructions there about how to run it. Any comments/PRs will be gratefully received.

I’ve also put all the raw data generated from the runs described in this section in a spreadsheet which you’re welcome to copy and run your own analysis over.