Open Data: Our Best Guarantee for a Just Algorithmic Future

(Two days ago I gave a talk at TEDxLausanne - I'll post the video when it will become available. This is the prepared text of the talk.)

Imagine you are coming down with the flu. A sudden, rapid onset of a fever, a sore throat, perhaps a cough. Worried, you start searching for your symptoms online. A few days later, as you're not getting better, you decide it's time to go see a doctor. Again a few days later, at your appointment with the doctor, you get diagnosed with the flu. And because flu is a notifiable disease, your doctor will pass on that information to the public health authorities.

Now, let's pause for a moment and reflect on what just happened. The first thing you did was to go on the internet. Let’s say you searched on Google. Google now has a search query from you with typical flu-related search terms. And Google has that information from millions of other people who are coming down with the flu as well - 1 two 2 weeks before that information made it to the public health authorities. In other words, from the perspective of Google, it will be old news.

In fact, this example isn’t hypothetical. Google Flu Trends was the first big example of a new field called “digital epidemiology”. When it launched, I was a postdoc. It became clear to me that the data that people generate about being sick, or staying healthy, would increasingly bypass the traditional healthcare systems, and go through the internet, apps, and online services. Not only would these novel data streams be much faster than traditional data streams, they would also be much larger, because - sadly - many more people have access to the internet through a phone than access to a health care system. In epidemiology, speed and coverage are everything; something the world was painfully reminded of last year during the Ebola outbreak.

So I became a digital epidemiologist - and I wondered: what other problems could we solve with these new data? Diseases like the flu, Ebola, and Zika get all the headlines, but there is an entire world of diseases that regularly kills on a large scale that almost nobody talks about: plant diseases. Today, 500 million smallholder farmers in the world depend on their crops doing well, but help is often hard to get when diseases start spreading. Now that the internet and mobile phones are omnipresent, even in low income countries, it seemed that digital epidemiology could help, and so a colleague, David Hughes, and I built a platform called PlantVillage. The idea was simple - if you have a disease in your field or garden, simply snap a picture with your phone and load it onto the site. We’ll immediately have an expert look at it and help you.

This system works well - but there are only so many human experts available in real time. Can we possibly get the diagnosis done by a machine too? Can we teach a computer to see what’s in an image? 

A project at Stanford called ImageNet tried to do this with computer vision – they created a dataset of hundreds of thousands of images - showing things like a horse, a car, a frog, a house. They wanted to develop software that could learn from the images, to later correctly classify images that the software had never seen before. This process is called “machine learning”, because you are letting a machine learn on existing data. The other way of saying this is that you are training an algorithm on existing data. And when you do this right, then the end product - the trained algorithm - can work with information it hasn’t encountered before. But the people at Image Net didn’t just use machine learning. They organized a challenge - a friendly competition - by saying “here, everybody can have access to all this data - if you think you can develop an algorithm that is better than the current state of the art, go for it!” And go for it, people did! Around the world, hundreds of research teams participated in this challenge, submitting their algorithms. And a remarkable thing happened. In less than five years, the field experienced a true revolution. At the end, the algorithms weren’t merely better than the previous ones. They were now better than humans. 

Machine learning is an incredibly hot and exciting research field, and it’s the basis of all the “artificial intelligence” craze that’s going on at the moment. And it's not just academic: it is how Facebook recognizes your friends when you upload an image. It is how Netflix recommends which movies you will probably like. And it is how self driving cars will bring you safely from A to B in the very near future.

Now, take the ImageNet project, but replace the images of horses and cars and houses, with images of plant diseases. That is what we are now doing with PlantVillage. We are collecting hundreds of thousands of images from diseased and healthy plants around the world, making them open access, and we are running open challenges where everyone can pitch in algorithms that can correctly identify a disease. Imagine how transformational this can be! Imagine if these algorithms can be just as good, or perhaps even better, than human experts. Imagine what can happen when you build these algorithms into apps, and release those apps for free to the 5 billion people around the globe with smartphones.

It’s clear to me now that this not only the future of PlantVillage, but a future of applied science more generally. Because if you can do this with plant diseases, you can do this with human diseases as well. You can in principle do it with skin cancer detection. Basically, any task where a human needs to make a decision based on an image, you can train an algorithm to be just as good. And it doesn’t stop at images, of course. Text, videos, sounds, more complex data altogether - anything is up for grabs. As long as you have enough good data that a machine learning algorithm can train on, it’s only a matter of time until someone will develop an algorithm that will reach and exceed human performance. And here, we're not talking science fiction, in the next 50 years, we're talking now, in the next couple of years. And this is why these large datasets - big data - are so exciting. Big data is not exciting because it’s big per se. It’s exciting because that bigness means that algorithms can learn from vast amounts of knowledge stored in those datasets, and achieve human performance.

If algorithms derive their power from data, then data equals power. So who has the data?  Things may be ethically easy with images of horses, cars, houses, or even plant diseases -  but what about the data concerning your personal health? Who has the data about our health, data which will form the basis for smart, personalized health algorithms? The answer may surprise you, because it’s not just about your past visits to doctors, and to hospitals. It’s your genome, your microbiome, all the data from your various sensors, from smartphones to smartwatches. The drugs you took. The vaccines you received. The diseases you had. Everything you eat, every place you go to, how much you exercise. Almost anything you do is relevant to your health in one way or another. And all that data exists somewhere. In hospital databases. In electronic health records. On the servers of the Googles and Apples and Facebooks of this world. In the databases of the grocery stores, where you buy your food. In the databases of the credit card companies who know where you bought what, when. These organizations have the data on which to train the future algorithms of smart personalized healthcare.

Today, these mainly business organizations provide us with compelling services that we love to use. In the process, they collect a lot of data about us, and store them in their mostly secure databases. They use these data primarily driven by the potential of commercial gains. But the data are closed, not accessible to the public - we imprison our data in those silos that only a selected few have access to, because we are afraid of privacy loss. And because of this fear, we don’t let the data work for us.  

Remember Google Flu Trends that I mentioned a few minutes ago? Last year, Google shut it down. Why? We can only speculate. But what this reminds us of is that those who have the data with which they can build these fantastic services... can also shut them down. And when it comes to our health, to our wealth, to our public infrastructure, we should be really careful to think deeply about who owns the data. I applaud Google for what they have done with Google Flu Trends. I am a happy consumer of many Google services that I love to use. But it is our responsibility to ensure that we don’t start to depend too strongly on systems that can be shut down any day without warning, because of a business decision that's been made thousands of miles away. 

So, how we can strike the right balance between protecting individual privacy and unleashing big data for the good of the public? I think the solution lies in giving each of us a right to a copy of our data.  We can then take a copy of our data, and either choose to retain complete privacy - or we can choose to donate parts of these data to others, to research projects, or into the public domain to pursue a public good, with the reassurance that these data will not be used by insurance companies, banks and employers to discriminate against us.  

Implementing this vision is not going to be easy, but it is possible. It has to be possible. Why? Two reasons (at least). First, our data is already digital, stored in machines somewhere and hence eminently hackable. We should have regulations in place to manage the risks of the inevitable data breaches. Second, we are now running full speed into a 2nd machine age where machines will not only be much stronger than us - as they have been in the past decades - but also much, much smarter than us. We need to continue to ensure that the machines work in our common interest. It’s not smart machines and artificial intelligence we should be concerned about - they are smart and intelligent because of the data. Our concern should be about closed data. We won’t be able to leverage the phenomenal power of smart, learning, machines for the public good if all the data is locked away.

Open data is not what we should be afraid of - it's what we should embrace. It’s our best guarantee that we remain in control of the algorithms that will rule our digital world in the future.