Data of the people, by the people, for the people

About 150 years ago, the American president Abraham Lincoln gave a very short speech - only a few minutes long - on a battlefield in Gettysburg, Pennsylvania. The occasion was to honor the soldiers who died in a fierce battle at the height of the American Civil War. Despite the brevity of the speech, and the fact that almost nobody understood what Lincoln was saying, it is now perhaps the most famous speech in US history by a US president. It is only ten sentences long, but to condense it even further here, Lincoln essentially said that there is nothing anyone could do to properly honor the fallen soldiers, other than to help ensuring that the idea of this newly conceived nation would continue to live on, and that “government of the people, by the people, for the people, shall not vanish from the earth.”

Why is this such a powerful line? It’s powerful because it expresses in very simple terms the basic idea of democracy, that we the people can form government, and that we the people can make political decisions, which is in itself the best guarantee that the decisions are made in the best interest of us, the people.

So, what does all of this have to do with open data?

Fundamentally, government is about organizing power. The vast majority of us agrees that power should be distributed among the many, not the few. To quote John Dalberg Acton: “Liberty consists in the division of power. Absolutism, in the concentration of power.” That is what democracy is about. And that is the discussion we should have about data. Because data is power. And if liberty consist in the division of power, or in the divided access to power, then that means that liberty also consist in the division of data.

But what does it even mean to say that data equals power?

Data contains information, and information can be used for commercial gains. We all understand that. But the power of data is much more fundamental than that. To understand this, we need to reflect on where we are as humans, at this point in time. We have now entered the second machine age - an age where machines will not only be much stronger, physically, as they have been for centuries, but also much, much smarter than we are. Not just a little smarter, but orders of magnitudes smarter. Most of us have come to terms with the fact that machines will achieve human intelligence. But think about machines that are ten times smarter, a hundred times smarter. How do you feel about a machine that is a million times smarter than a human? It’s a question worth asking, because while we may not live to see such a machine, our children, or grandchildren, probably will. In any case, even a machine that’s 100 times smarter than us is something you wouldn’t want to compete against. You wouldn’t feel comfortable if such machines were controlled by a small elite group. However, if such a machine were an agent, at your service, and if everyone would have such agents, which they’d use to make their lives better, that would be an entirely different story. Thus, when AI - artificial intelligence - becomes very powerful, it would be a disaster if that power were in the hands of a few. We would go back to absolutism, and despotism. We therefore need to ensure that the power of AI is distributed widely. 

There are some efforts, like the non-profit organization OpenAI, that aim to ensure that this is the case. In fact, if you follow the field of machine learning a little bit, a field that is currently at the heart of many of the AI-relevant breakthroughs, then you would see that most organizations are now open-sourcing the code that’s behind these AI breakthroughs. That’s a good thing, because it helps ensuring that the raw machinery to build AI, the algorithms, are indeed in the hands of many.

But this is not enough - not nearly. It’s very important to recognize that the power of AI is not simply in the algorithms; it’s not simply in the technology per se. It’s in the data. AI becomes intelligent when it can quickly learn on large amounts of data. AI without data does not exist. The analog version, the human brain, can perhaps help us to understand this idea a bit better. A human brain, in isolation, can only do so many things. It’s when the brain can learn on data that the magic happens. We call this education, or learning more generally. The brain itself is necessary, but it is the access to data - in the form of knowledge, and education - that makes us the most intelligent individuals to ever walk the face of the earth; of such an intelligence that we can even create artificial intelligence. And to take this analogy one step further, if you learn on small, false, or just generally crappy data, your brain will consistently make the wrong predictions. Coincidentally, this is why science has been such a boon for mankind: the scientific method helps us ensure that our brains get trained on high quality data.

So this is the central idea here: 

The enormous power of AI is based on data. If we want everyone to have access to this power, we need widespread access to data.

Put slightly differently:

Broad open data access is an absolute necessity for human liberty in the machine age.

If we accept this, then the question immediately arises, how do we get there? The fact that AI power is derived from data also means that from an economic perspective, privileged data access is incredibly valuable. Market players with privileged data access have absolutely no interest in losing this privilege. This is understandable - in the information economy, being able to extract information from data that can be used commercially is a matter of life and death, economically speaking. Forcing these players to give up their privileged access to data, which they generally collected themselves, would likely have severely negative economic consequences. It would also be highly unethical - for example, I’d be very upset if we forced Google to open up their data centers where anyone could have access to my data. There has to be another way.

I would like to offer a suggestion for another way. Access to personal data should be controlled by those who generate the data, not by those who collect it. The data generator is the person whose data is collected. In order for the data generator to be able to control access, the collector needs to provide the person a copy of the personal data.

Let’s make an example. Let’s say you use a provider’s map on your smartphone to drive from A to B. As you’re driving, GPS data of your trip is collected by the app maker. The app maker uses this kind of data to give you real-time traffic information. Great service - but you’ll never be able to access this data. You should be able to access this data, either in real time or with some delay, and do whatever you please to do with it, from training your own AI to sharing or selling it to third parties.

Another example. Let’s say you track your fitness with some device, you always shop for food at the same grocery store, and you also took part in a cohort study where your genome was sequenced, with your permission of course. The fitness device maker may reuse your data to make a more compelling product; the grocery store may direct ads at you for new products that fit your profile; and the cohort study will use your DNA data for research. All good - but is it easy for you to combine these three data sources? Not at the moment. You should be able to access all three data source - your fitness data, your nutrition data, and your DNA, without having to ask anyone for permission, for whatever reason. If you’re now asking, “why would anyone want that data”, you are asking the exact wrong question. It’s not anyone’s business why you would want that data - the point is that you should be able to get it with zero effort, in machine readable form, and then you should be allowed to do with it whatever you want to. It's your data. 

In some situations, we’re already close to this scenario. For example, when you open a bank account, of course you will be able to access every last detail of any transaction at any point in time, whenever and wherever you want to, without having to ask anyone. Any banking service without this possibility would be unthinkable. Why isn’t it like this with any service? If I can have my financial data like that, why can I not have the same access to my health data, my location data, my shopping data?

Once our own data is easily accessible for us, then it will be possible for us to let others access the data, provided we allow it. We can for example give the data to third parties such as trusted research groups, not-for-profit-organizations, or even trusted parts of the government or trusted corporations. At the moment, this sounds very futuristic. But imagine, for example, a trusted health data organization, perhaps a cooperative, where hundreds of thousands or even millions of people share their health data. This would be an enormous data pool that could be studied by public health officials to make better recommendations. It could be investigated by pharmaceutical companies to design new drugs. And, to bring this back to the original thought about AI, anyone could use this data to improve the artificial intelligence agents that will increasingly make health decisions on our behalf.  

Today, we’ll hear many excellent arguments that make the case for open data, highlighting social, political, economical and scientific aspects. My argument is that human liberty cannot exist in the machine age that is run by algorithms, unless people have broad access to data to improve their own intelligent agents. From this perspective, it makes no sense to be concerned about “smart machines”, or “smart algorithms” - the major concern should be about closed data. We won’t be able to leverage the phenomenal power of smart, learning, machines for the public good, and for distributed AI - for distributed power, really - if all the data is locked away, accessible only to select few. We need data of the people, by the people, for the people.