Agent-Based Model of the Information Economy: Initial Thoughts

Feb 9, 2016

There are several studies on the impact of open data, ranging from specific case studies through to macroeconomic studies that estimate the overall impact of greater availability of information.

Case studies are problematic because they tend to look only at effects around specific datasets or companies, and it’s hard to know how or whether these scale up to the economy as a whole. From what I’ve seen of the macroeconomic studies, on the other hand, they tend to be based on estimates created by multiplying big numbers together and don’t reveal who the winners and losers might be in a more open information-based economy.

Is it the case that opening data simply increases the gap between the information haves and have-nots, and that leads to wider economic inequality, or does everyone benefit when information is more widely available? Are there tipping points of availability at which we start realising the benefits of open data? What is the role of government in encouraging data to be more widely available and more widely used? To what extent should government invest in data infrastructure as a public good? How can local or specialist cooperatives pool resources to maintain data?

These are the kinds of questions we have answer in order to help frame public policy. While I can come up with and rationalise answers to those questions, I would prefer that they were based on something slightly more rigorous than a persuasive argument.

So I have been thinking recently about applying Agent-Based Modelling (ABM) or more specifically Agent-based Computational Economics (ACE) to the information economy. These seem to be promising techniques to get under the skin of the kind of impact that data might have on the variety of actors that there are within the economy, and to understand how public policy might affect that impact.

Now, I am not an economist, I do not have any background in agent-based modelling and it’s been a long time since I could justifiably call myself an artificial intelligence researcher. I’d like to learn about this field, and need to in order to pursue the goals described above. I hope other people will forgive my ignorance and mis-steps, and behave as this XKCD encourages:

Model design

I’ll share where my thinking is at so far.

The useful Tutorial on agent-based modelling and simulation contains a number of questions to help modellers focus on what they want to achieve:

What specific problem should be solved by the model? What specific questions should the model answer? What value-added would agent-based modelling bring to the problem that other modelling approaches cannot bring?

I’ve described above the kind of questions that I have in mind. Trying to be more specific, I’d like to explore:

  • What different groups or clusters of agents emerge in an information economy? Via an undergraduate dissertation by Gemma Hagen, I’ve found a paper by Allen Wilhite that talks about the emergence of specialisation in production or in trade. The Deloitte report on PSI talks about information holders, infomediaries, consumers and enablers: is there a model in which these roles emerge?
  • What difference does it make to overall productivity in an economy when there is a culture of data being open, available for anyone to access, use and share, as opposed to paid-for, available only to researchers or for non-commercial use, or completely closed?
  • How do share-alike licences propagate within an information economy? I would like to be able to work out whether we would achieve higher productivity by promoting share-alike licensing rather than public domain or attribution-based licensing.

An agent-based approach enables the study of emergent roles and enables us to tweak parameters to explore the difference that policy positions, capability building and emerging technologies might provide.

What should the agents be in the model? Who are the decision makers in the system? What are the entities that have behaviours? What data on agents are simply descriptive (static attributes)? What agent attributes would be calculated endogenously by the model and updated in the agents (dynamic attributes)?

Models of the circular flow of income contain additional agents. A two-sector model includes Households and Firms; a three sector model introduces Government. Given that I am particularly interested in the role of government and information held or gathered by the public sector, I think the model should eventually include Government, but I’m tempted to keep it really simple and only consider Firms in the first instance. Given at the moment I’m not so interested in the role of citizens, I think it shouldn’t include Households but instead abstract away their contribution.

In the Wilhite model there is only one type of agent (a Firm) but different Firms have different (randomly assigned) abilities to produce Goods of two types which you might characterise as FOOD and GOLD, which means that they naturally fall into different categories as producers or (if they’re not very good at production) traders.

So to start with, to keep it simple, I think the model should assign each Firm agent with a variable capacity to create DATA, FOOD and GOLD. We can imagine that agents that can create a lot of GOLD are services companies; those that can create a lot of FOOD are factories; those that can create a lot of DATA are technology companies.

Each agent will keep track of their DATA, FOOD and GOLD. To simulate the role of DATA within a Firm, I think that the amount of DATA held by the Firm should influence its productivity, namely its ability to create more FOOD or GOLD. Calculated from the amount of FOOD and GOLD held by the Firm is its wealth, which the Firm will attempt to maximise.

What is the agents’ environment? How do the agents interact with the environment? Is an agent’s mobility through space an important consideration?

I’m not sure whether to include proximity between Firms within the model. To keep it simple, and especially to mirror the fact that when it comes to exchanges of data, distance doesn’t matter, I’m tempted to simply have the agents co-existing in a soup.

What agent behaviours are of interest? What decisions do the agents make? What behaviours are being acted upon? What actions are being taken by the agents?

Mirroring the Wilhite model, I plan to let the agents choose between the following actions:

  • produce DATA, FOOD or GOLD
  • trade DATA, FOOD or GOLD with other Firms

In the initial model, they should choose between these actions based on which is going to increase their wealth the most. To ensure that agents ever want to trade for DATA, given that it doesn’t contribute directly to their wealth, they will need to have at least some predictive ability: to look forward to what they will be able to do if they have more DATA.

The other thing that makes DATA distinctive is that it is non-rivalrous. When it is traded, the original owner of the DATA doesn’t lose it. I’ll have to see how and whether that leads to sensible costing for DATA - I suspect that I’ll have to introduce knowledge of prior prices or earlier deals into the model to make it work.

How do the agents interact with each other? With the environment? How expansive or focused are agent interactions?

The Firms will interact with each other through trade. They can trade either DATA or FOOD for GOLD or vice versa, with a willing partner.

Where might the data come from, especially on agent behaviours, for such a model?

I aim to dig into the existing macroeconomic studies to see what estimates they use for the added productivity associated with access to data. I can also base behaviours on the Open data means business study we did at ODI. I’m open to other suggestions.

How might you validate the model, especially the agent behaviours?

There are statistics on the distribution of Firm sizes and sectors within the UK economy that could be used to validate the model. We can see whether instances emerge that match some of the case studies that we use. Again, I’m open to sugestions.

Modelling software

There’s a whole bunch of agent-based modelling software out there. I’ll admit when I looked at that list I was put off by the number of Java-based implementations, because it’s been something like 15 years since I last wrote Java.

I did like the look of ABCE because it’s Python (even though I don’t know Python very well either) and has a lot of the basic economic model that’s needed built-in. However, I was concerned that it might be hard to adapt it to an information economy because its primitive operations assume the transfer of physical goods, and about its relative immaturity.

Eventually I’ve settled on using Repast Simphony. The main reason for adopting it is the level of support that’s available, the easy-to-follow tutorial being a good example.

Final thought

I didn’t know anything about this field - not even that it was called agent-based computational economics - when I started looking on Friday night. By Sunday evening I had written my first agent-based model using the Repast tutorial and had a reasonably good idea about what to try first.

It’s at times like these when you realise quite how amazing the web is, how transformational. I can’t even imagine how I would have gotten started looking at this without it. But you also realise that we make this web through being generous with our time and knowledge, and not just on Wikipedia and Twitter. Take a look at these fantastic pages maintained by Leigh Tesfatsion on agent-based computational economics. From the HTML I guess they were started some time ago; the dates indicate they are still maintained. They contain tutorials, references to prior work, pointers to people working in the field. I have barely scratched the surface.

You also realise the tragedy of restricted access to academic research. The papers whose abstracts seem interesting but probably not £30 interesting. The book chapters that you can read snippets of, or pay £80 for. A world where those outside academia can’t easily access academic research is one where that research can have little impact.

I’ve created a repository on Github which I’ll use for code as it comes (currently all it has is the README and an open licence!). If you have any comments, advice or suggestions then please use the issue list there.

How to make fingerless gloves from old socks

Jan 24, 2016

Have you got socks that you love but that have developed holes in the soles? It’s really easy to turn them into cozy fingerless gloves like these, especially if they’re long and fluffy.

The finished article: fingerless gloves

1. Find some old socks

I love FatFace socks and particularly the fluffy ones with lovely deep colours that go with the rest of my clothes. Some are really long like these. And of course they develop holes.

Socks with holes in the soles

It always feels such a shame to throw them out when the vast majority of the sock is still fluffy. So, because I also really love fingerless gloves, I’ve taken to using old socks like these to make them.

2. Cut off the feet

Socks have two parts to them, the feet and the long bit that go up your leg. The join between the two parts that gives them the angle that enables them to fit well, is just at the heel, helpfully highlighted in a different colour on these socks.

The join of the foot with the rest of the sock

If you cut just above that join, you can cut a nice straight line across. It’s particularly easy with these socks because they have a stripe in the pattern to help guide you!

Cut off the foot of the sock

You can see after you’ve cut the foot away how a bend is introduced into socks.

After the foot is removed

Once the foot is removed you have a nice tube that forms the basis of your fingerless glove.

3. Add a thumb hole

Now it’s time to turn your attention to the top of the sock. Fold the tube so that you get the pattern to where you’d like it on the back of your hand. This sock has a fairly indistinct FatFace logo on it which I’m trying to position in the middle of the back of my hand.

Top of sock folded in half

You need to assess where to put your thumb hole based on the shape of your hand and how far along your fingers you want the glove to go when it’s on. I like mine to just cover my knuckles. Make a note of where your thumb needs to emerge from the glove. With these gloves it’s really easy because the pattern around the top coincides with where I want to put my thumb.

Finding the best place for the thumb hole

Now you need to make a vertical slit along the fold at the side of the soon-to-be-glove where your thumb should come out. Take that fold and pinch it vertically so that the middle is where your thumb should emerge.

Pinching the material to make a thumb hole

Cut about 1.5cm down into that pinch of fabric to make the vertical hole. It doesn’t matter if it’s a bit small, so err on the side of not cutting as much as that.

Cutting the thumb hole

Try on the glove to see whether the thumb hole is big enough. You can always cut a bit more up or down that slit to make it bigger.

Trying on the glove

That’s it! You’ve made your glove. No sewing required for these because the material is fleecy and doesn’t realy fray (and even if it does, you’ve just got double-life out of your socks!)

Thumbs up for fingerless gloves!

This post is dedicated to everyone at UKGovCamp 2016, especially Janet, Zaheda and Jacqui.

How might we use blockchains outside cryptocurrencies?

May 21, 2015

There’s been a bit of a buzz recently about using blockchains, one of the approaches used within the Bitcoin protocol, to support things other than the bitcoin currency.

Honduras is reportedly piloting a land register using blockchain technologies. On a smaller scale, the Isle of Man is piloting a registry of digital currency companies, again using blockchain. There are articles and videos that claim that blockchain can be used in the internet of things, for health records, and to track votes.

I want to dig a bit deeper and try to work out the practical application of blockchain for sharing registries, with a particular eye on open data. But before I can start looking at how those kinds of applications might work, I needed to understand how the Bitcoin blockchain works, at a technical level.

Note that this is written based only on a couple of days of research. I might well have missed things and certainly haven’t gotten into the arguments and subtleties within the blockchain community about how it should develop and grow.

What is a blockchain?

A blockchain is a chain of blocks, obviously. Here’s a picture of what a Bitcoin blockchain looks like:

Bitcoin Block Data

Each block in the chain contains a reference to the previous block in the chain (that’s the Prev_Hash in the picture) and some additional information which makes up the content of the block. The link to the previous block is what makes it a chain: given a block you can find all the information in all the previous blocks that led to this one, right back to what’s called the genesis block, the very first one in the chain.

Where is the blockchain stored?

The Bitcoin blockchain is managed by a network of distributed nodes. Every node has a copy of the entire blockchain (though from what I can gather some of the blockchain may be archived). New nodes can come and go, synchronising their copies of the blockchain against those of other nodes as they join the network.

"Centralised-decentralised-distributed" by 1983 (talk) (Uploads) - Own work. Licensed under CC BY-SA 3.0 via Wikipedia.

The fact that there are multiple copies of the blockchain on a distributed network of nodes is one of the powerful features of the blockchain. It makes the blockchain robust against nodes disappearing either temporarily or permanently, whether that’s due to connectivity issues, hardware failures, or interference. The more nodes there are in the network, the harder it is to disrupt the storage of the blockchain. There is no single point of failure, unlike in a centralised system with a single authority.

What does a block contain?

In Bitcoin, a block has a header and a list of bitcoin transactions (they are the Tx0, Tx1Tx3 in the picture of the blockchain). The header includes:

  • a pointer to the previous block (Prev_Hash in the picture)
  • a summary of the bitcoin transactions the block contains (technically, the Merkle hash of those transactions, the Tx_Root in the picture)
  • a timestamp that indicates when the block was created
  • a proof of the work that went into creating the block (that’s the Nonce in the picture)

The original paper on bitcoin doesn’t actually use the term “blockchain”. It talks about a timestamp server because that’s really what the blockchain is for: it provides irrefutable evidence that the data in a block existed at a particular time.

The actual timestamp given in a particular block isn’t necessarily to-the-second accurate. In fact there’s nothing stopping the timestamp of a particular block being before the timestamp of the previous block. If a block is in the blockchain, what is guaranteed is:

  1. the block was added at most two hours before its timestamp
  2. the block before this block in the chain existed at the time the block was created
  3. this block was added to the chain before the next block in the chain existed
  4. the data in this block (ie the bitcoin transactions) existed at the time the block was created

The hash of the header of the block, incorporating each of these pieces of information, becomes the identifier for the block which is used by the next block in the chain.

How do new blocks get added to the chain?

Every node in the network can add blocks to the chain. Every node is sent the data that needs to go into the blocks (eg the bitcoin transactions). Every node can package up that data into a block that links back to the last block in the chain that they are aware of. And every node can then transmit that block to the rest of the network to assert “this is the new chain”.

So what stops this being a completely chaotic situation where every node is adding blocks at the same point in the chain at the same time, forking it willy-nilly? What ensures that the nodes in the network have a consistent, consensus view of what the blockchain holds?

The first barrier is that nodes all operate under a set of protocol rules that determine what a valid block looks like. These rules include ensuring that each transaction is a valid transaction: that it is spending money that actually exists (technically, pointing to a previous matching transaction within the blockchain) and that it has been signed by the creator of the transaction. They also ensure integrity between transactions: that the same money isn’t being spent twice (technically, each output of a transaction can only form the input of one other transaction).

The other test for a valid block is where its nonce comes in. To be a valid block, the hash of the header of the block always has to start with a certain number of zeros. (Or, put another way, it has to be below a certain target number.) If you remember, the header contains:

  • the hash of the previous block in the chain
  • the Merkle hash of the transactions in the block
  • a timestamp
  • a nonce

So imagine you’re a node and you have a whole bunch of transactions that you want to put together into a block to add to the chain. You know the hash of the previous block in the chain. You can calculate the Merkle hash for the transactions you want to put in the block. You know what the time is. But what you don’t know, and what you have to work out, is what nonce will result in the header of your new block having a hash that starts with lots of zeros.

Now, the way that hashing works means that there is no way a node can algorithmically compute what nonce is going to give the block this magic property. Finding the nonce is like finding a Golden Ticket in Charlie & the Chocolate Factory. You might get lucky and find that the first nonce that you try gives you a valid hash. But realistically, you’re going to have to try lots and lots and lots of different nonces, like having to buy lots and lots and lots of chocolate bars in the hope of getting a Golden Ticket.

What do I mean by “lots and lots and lots”? Well, the current probability of managing to get a nonce that works per attempt is about 1 in a sextillion (10^21, about the number of grains of sand on all the beaches in the world). Specialised hardware performs at 1000 GH (giga-hashes), which means it tries about one trillion (10^12) nonces each second.

So nodes have to work really hard to create valid blocks. They have to try lots and lots and lots of different nonces, calculating lots and lots and lots of hashes. A valid block, whose hash begins with lots of zeros, is proof that the node that created it did lots of work, hence the nonce is sometimes called a proof of work.

How fast does the blockchain grow?

The number of zeros that a block’s hash has to start with, or the target number that it has to be below, determines the difficulty of creating a new block, and hence the average time that it will take. The smaller the target number, the more zeros the hash has to start with, the lower the probability of hitting on such a hash, and the harder it is to create a new block.

The difficulty of creating a new block is automatically regulated such that no matter how big the network of nodes gets, and no matter how much compute power they have, it will always take about 10 minutes for the whole network to find a new block.

Every 2016 blocks, which is about every 2 weeks, a new target number is calculated based on how long it took to create those 2016 blocks. If it averaged more than 10 minutes to create them, the target number goes up and the difficulty goes down; if it took less than 10 minutes to create them, the target number goes down and the difficulty goes up. So the difficulty adapts and varies over time based on the total compute power of the network.

This adaptation does give some limits to the growth of the blockchain and to the number of transactions the network can handle. A new block is added roughly every 10 minutes. Each block currently has a maximum size of 1MB, put in place to limit the amount of data that nodes have to pass around to stay synchronised. The size of transactions varies, but the commonly quoted maximum transaction rate is 7 transactions per second, equivalent to 4200 transactions per block. Bitcoin hasn’t reached these limits yet.

It also means that as the network grows, the total energy consumed grows, but the output stays the same. Researchers have estimated that the total power consumed by the Bitcoin network is something in the region of 3GW — about the same power consumption as Ireland.

Why do nodes make new blocks?

We’ve looked at the large numbers of calculations that nodes have to do in order to create new blocks. At the time of writing there are estimated to be about 100,000 nodes (called miners) in the Bitcoin network. All of them are trying to add new blocks to the chain; about every 10 minutes one of them will succeed.

Doing all this work requires computing power, and computers cost money to buy and to run. So what motivates the nodes to perform all these calculations?

The answer for the Bitcoin network is nodes participate because of the financial reward of creating a new block. This comes in two flavours:

  1. the node that creates a block gets a fixed reward, currently 25 bitcoins; this fixed reward halves every 210,000 blocks (roughly every 4 years)
  2. the node that creates a block picks up any transaction fees that the creator of the transaction assigns

Even with these rewards, the cost of specialised hardware and of electricity means that it can take more than a year to break even. It’s therefore common to have node pools which share the work, and the reward, to even up the odds somewhat.

What if two new blocks are made at the same time?

The work that nodes have to do to create a new block takes time. This lowers the probability that two blocks are made at the same time, but it’s still possible. When this happens, it creates a fork in the blockchain; some nodes might start building on one of those branches and some on the other.

But the protocol rules that govern how the nodes operate prevent this situation from going on for very long. Each node keeps track of all the branches, but nodes will only try to extend the longest branch they know about. (The length isn’t determined by the number of blocks but by the total amount of work that has gone into building the branch, based on the number of zeros at the beginning of the block hash.)

The transactions in shorter branches aren’t lost. When a node learns that another branch is longer than the one it has been working on, it checks its current branch for any transactions that aren’t in the new branch. These missed transactions are broadcast to the network again, and incorporated into new blocks on the chain.

So the blocks at the end of the blockchain could be retracted at any time. The network of nodes will eventually come to a consensus position on which block follows a given block, but at any one time there might be many branches in operation, potentially including different transactions.

And some of the transactions that are in the blockchain currently accepted by a node may eventually turn out to be invalid. If two branches contain transactions that cannot coexist (eg because they involve spending bitcoins twice over), one of the transactions will eventually be refused.

This uncertainty about the end of the chain leads to the common rule that a given bitcoin transaction isn’t confirmed until it is at least 6 blocks deep in the chain (ie after about an hour, given each block takes about 10 minutes to create). This number is based on understanding what it would take to change the contents of the blockchain, which leads on to the next question.

How can you change the contents of the blockchain?

As we’ve seen, each block contains a reference to the previous block, and that reference — the hash of the block — encodes both the content of the block and the reference to the previous block. In other words, if you changed the content of a block, you would change its hash. If you changed its hash, you would need to change the reference to it in the block after it in the chain. That in turn would change that block’s hash, which would mean changing the reference in the block after that, and so on to the end of the chain.

So changing the content of a block is hard to do when the block is followed by other blocks. Once a block is embedded within the chain, you have to unpick all the following blocks right back to the one you want to change, and then reconstruct those blocks again. As we’ve seen, creating a block is hard work: it takes lots of computational power. But more than that, while you are busy trying to recreate the chain you’ve unpicked, the rest of the network is merrily continuing to add even more blocks to that chain: you’re left playing catch-up.

The ease with which someone can alter the blockchain depends on the proportion of the network that they own. It’s possible to calculate the probability of someone being able to alter the blockchain depending on the proportion of the compute power that they control and the number of blocks back in the chain they are trying to amend.

For example, someone who controlled 20% of the compute power of the network has a 40% chance of being able to change the last block in the chain, but only a 4% chance of being able to change a block that is 5 blocks back in the chain. Someone who controls 10% of the compute power of the network has a less than 0.1% chance of being able to change a block 6 blocks back in the chain.

This last calculation is the basis for deciding when a transaction is confirmed: even if someone managed to capture 10% of the compute power of the network, they’d still have less than a 0.1% chance of altering a block that is 6 blocks deep in the chain. It’s judged unlikely that anyone would capture that much of the network, and they’re still unlikely to succeed even if they did, so 6 blocks feels like a safe bar for transactions in the Bitcoin network to need to meet.

How are people using blockchains for things other than bitcoin?

I haven’t yet dug into everything everyone is doing with blockchains but from my initial look there seem to be three main patterns for extending the use of Bitcoin blockchain to satisfy other requirements:

  • embedding data into Bitcoin blockchains
  • creating other kinds of cryptocurrency blockchains with different features
  • creating blockchains that aren’t based on cryptocurrencies at all

Embedding data into Bitcoin blockchains

Transactions within a cryptocurrency blockchain are simply blocks of data. It is possible to embed additional data within those transactions. Factom, the organisation implementing the pilot land registry for Honduras, inserts references to data managed on a completely separate network into the Bitcoin blockchain, using it as a timestamp server.

Creating other kinds of cryptocurrency blockchains with different features

There are various constraining features of the Bitcoin blockchain which limit its utility, such as the limits on the speed with which the blockchain can grow or the size and content of the blocks. Florincoin adds comments which developers can use to embed other data or references to other data. Alexandria is an application built on top of Florincoin that takes advantage of these comments to create a decentralised peer-to-peer library. The content itself (such as videos or music) isn’t embedded into the blockchain — it’s distributed via Bittorrent — but the metadata about that content is embedded in that way.

Creating non-cryptocurrency blockchains

Other developers are using blockchain principles to create blockchains that aren’t based on cryptocurrency exchange. I think this is what Guardtime are doing, using what they call a Keyless Signature Infrastructure (KSI), but I don’t yet have access to their Knowledge Base to be able to tell.

Questions, questions

Having understood more about how blockchains, and specifically the Bitcoin blockchain, works, I still have lots of questions about how it might be used for (open data) global registeries.

The unconstrained peer-to-peer nature of blockchain networks is important, but:

  • Are there other forms of network that might eg limit size of the network by limiting membership to trusted entities (no entity can be completely trusted)?
  • If not, then how do you incentivise nodes to join the network without it being associated with a cryptocurrency?
  • Or is the cryptocurrency side of blockchains essential to make them work?

The rules about the way in which currency transactions work (eg that the amount of money coming in to a transaction has to be equal to the amount of money coming out of the transaction) are built in to how Bitcoin nodes operate. It isn’t possible for an invalid transactions to be written into the Bitcoin blockchain (except by an error-prone node, whose outputs will be swamped by the rest of the network). When you move into other domains, however, validity rules might start to be untestable by machines, reliant on processes where human error can creep in.

  • How can you design blockchains in circumstances where mistakes can be made?
  • Or is this a natural limit for what blockchains are and aren’t suited for?

The Bitcoin blockchain records transactions rather than the state of each bitcoin. Getting a summary of who owns what is technically possible — all that data is public — but requires an additional layer of processing on top of the blockchain.

  • Do we similarly have to model things other than plain records in blockchains when we’re using them for other kinds of data?
  • How does this affect the ease of getting hold of the data itself, or creating aggregated summary statistics based on that data?

Because of the need for each node to copy the entire blockchain, it isn’t appropriate to embed large amounts of data within a blockchain. Hashing summaries into the blockchain guarantees integrity, but it doesn’t guarantee access.

  • How can blockchain be effectively used alongside distributed data storage mechanisms (such as Bittorrent)?

Blockchains are public ledgers. Although they can contain encrypted data, the validity of the blockchain relies on nodes being able to double check that the important aspects of the data match the hash summaries that they’ve been given.

  • How can blockchains be used for private data?

The way that the Bitcoin network is set up means that the total output of the network (in terms of adding blocks to the chain) remains constant regardless of the size of the network. This doesn’t seem scalable, either in terms of energy consumption or in terms of handling growing amounts of activity.

  • Which are the pieces of the Bitcoin protocol that can be tweaked to create something that is more scalable and energy efficient?

Any thoughts on, or pointers to documents discussing, any of these questions would be gladly received. Email me on or tweet me @JeniT.

A Visceral Sense of Privacy

Nov 21, 2013

I wrote this earlier in the year but for one reason or another never posted it.

I was at the excellent ORGCon yesterday. I was speaking about open data in my capacity as Technical Director at the Open Data Institute but I stayed for the rest of the conference in my capacity as an interested UK citizen. It is in the latter capacity that I write this.

When my 9-year-old daughter asked me what ORGCon was about, I explained that it was about the rights that we have when we are online. With the recent revelations about Prism in my mind, I asked her what she thought about people being able to read the emails or listen in on the Skype chats of people they thought might do harm.

She replied that she thought it was good, that it made her feel safe to think that the communications of “cyber bullies” (these are the most concerning “bad people” she thinks of) would be watched (and, presumably, punished if they were caught bullying).

From recent conversations I’ve had, this feels like a common reaction. I’d venture that you don’t need much technical knowledge or much paranoia to believe, when you stop and think about it, that the NSA and GCHQ can get access to much of our online communication. We shrug because we assume that the people they will want to listen in on are the baddies. And if it helps to stop those baddies doing bad things, so what?

In his opening keynote, Tim Wu talked about the need for us to have a visceral sense of our online rights — of our ownership of our data, of our privacy — in order to protect those rights.

Reflecting on my daughter, I’m struck that she does have a visceral sense of privacy, but it is not about privacy from the (benevolent, protecting) state. It is privacy from her (nosy, interfering) sister.

And I feel the same. I don’t have a visceral reaction to the NSA or GCHQ having access to my online (private) communication, but I certainly have that reaction when I think about it being seen by my coworkers, my friends, my family. This isn’t because of anything in particular that I’m worried about being discovered, just because I’d prefer to have some control over what I expose to the people with whom I interact most.

In his closing keynote, John Perry Barlow spoke about growing up in a town where everyone knew everything about everyone else, but no one brought up the past because everyone had skeletons. A type of mutually assured destruction. He said something I often hear the over-30s saying, that the young people growing up with Facebook are not concerned about their privacy.

If you read the focus group responses from the Pew Research Center’s Internet & American Life Project, you can see that this isn’t true. Teenagers might be sharing a lot of information through Facebook, but there are definitely some people they don’t want to see it (my emphasis):

[Friending my parents] sucks… Because then they [my parents] start asking me questions like why are you doing this, why are you doing that. It’s like, it’s my Facebook. If I don’t get privacy at home, at least, I think, I should get privacy on a social network.

In the open data world, we worry about publishing information about people who can be identified from that information. The ICO Anonymisation code of practice talks about a ‘motivated intruder’ test:

The ‘motivated intruder’ is taken to be a person who starts without any prior knowledge but who wishes to identify the individual from whose personal data the anonymised data has been derived. This test is meant to assess whether the motivated intruder would be successful.

The approach assumes that the ‘motivated intruder’ is reasonably competent, has access to resources such as the internet, libraries, and all public documents, and would employ investigative techniques such as making enquiries of people who may have additional knowledge of the identity of the data subject or advertising for anyone with information to come forward. The ‘motivated intruder’ is not assumed to have any specialist knowledge such as computer hacking skills, or to have access to specialist equipment or to resort to criminality such as burglary, to gain access to data that is kept securely.

The ‘motivated intruders’ that most people will be concerned about are those who are already known to them. (The main exception would be those people who for whatever reason have come to the attention of the press.) They are the suspicious spouse, the nosy neighbour, the interfering parents, the jealous colleague.

When I think about those people I know who have felt their privacy has been infringed, it has always been by people they know, behaving in extreme ways out of a desire to retain or regain control.

But those engaging on campaigns of harassment do not need additional personal information to make their target feel exposed. They just need to demonstrate knowledge of something.

“I know where you live.”

“I can hear you.”

It doesn’t have to be important. It doesn’t have to be private. The goal is to demonstrate to their victim that they are being monitored, constantly: their movements watched, their tweets read.

To invoke a visceral sense of your right to privacy, think of your friends and family reading your messages. To test anonymisation, think of a suspicious spouse aiming to prove infidelity. Even information we would never think of as private can be used against us.

It isn’t the state’s knowledge we fear, it’s that of those who already know us.

A Fair Web

Feb 3, 2013

The new membership of the W3C’s Technical Architecture Group (TAG), and some of the recent discussions on the TAG list about polyglot markup, have made me think about what the TAG should stand for and the role the TAG should play.

Fundamentally, the web is for everyone, whatever gender, whatever race, whatever sexual orientation, whatever visual or mental ability and so on. The web community should fight to keep the web open to all. And it should try to be a community that is open to all.

With same-sex marriages shortly being voted on by the UK Parliament, I have been struck, reading the recent threads about polyglot, how similar the arguments against polyglot seem to those used against homosexuality:

  • claiming that there is no use for polyglot, in the face of those who say they have use for it, is similar to denying that homosexuality exists, in the face of people saying that they are homosexual
  • stating that you can see no use for polyglot and therefore no one else should use it is similar to saying that since you are heterosexual, everyone else must be too
  • claiming that creating a Recommendation that describes polyglot will make people use it is similar to saying that talking about homosexuality will make people gay
  • saying that you don’t want to implement polyglot in a validator or editor is similar to being a priest who declines to marry gay people

By this comparison, those who argue that polyglot must be the only output anyone generates also has an analogy: someone arguing that all churches must marry only gay people.

I want the web community to be a fair and good society. To me the question about whether there should be a polyglot Recommendation is just the latest example of a need to ensure that our community is equitable.

The web is for everyone, whatever technology stack they use. The reason we have standards is to enable people to make their own choices about what they do in the privacy of their own servers. We don’t have to use the same libraries, so long as we implement the same standards. It doesn’t matter if you programme with Ruby or PHP or C# or Scala or XQuery, you can build a web application because we have the standards of HTTP, HTML, CSS, Javascript and so on.

Just as in wider society, we need to find compromises that balance the needs and desires of different constituencies. We need to balance the rights that everyone has to code as they wish against the rights that everyone has to have a web that works. We need to make sure that the quiet voices are heard, and support the equal rights of those who tread the less worn paths.

When there are conflicts between technologies, developers necessarily think of them in terms of which they would use, given their experience, expertise, environment and so on. I think the TAG needs to judge technologies in a different way. We have to consider the extent to which standardising their use disrupts the fabric of the web and prevents others from operating as they wish to. And, because we want a web that is fair and open and free, if there is no or minimal risk to the fabric of the web, and it does not overly constrain how others act, I believe we should err on the side of supporting diversity.

Making and expressing these judgements is just one of the things that I hope the newly formed TAG will manage to do better.