Open communication has always been a part of science, but there are new ways to share data. The Royal Society report Science as an Open Enterprise highlights the need to deal with the avalanche of data of scientific interest made available by means of new technologies, in order to preserve the principle of openness and explore the data in a way that could potentially start a new scientific revolution.
What partly motivated the Royal Society to carry out this study was a major controversy in the UK for a year and a half, in 2009: climategate. E-mails sent by researchers were hacked and someone published them. It was a scientific Wikileaks, so to speak. And the e-mails suggested that some of the scientists had tried to hide data from skeptics of climate change. The reason this was controversial is that we expect scientists to be open to debate and skepticism. This event led to a number of other questions related to the analysis of science in the twenty-first century. The report, published last June, discusses opportunities and challenges and was the product of discussions with experts, including people from industry, social sciences, computer science, and climate science.
We can look at a case that took place in May 2011 in Hamburg, Germany. There was an outbreak of an intestinal infection caused by Escherichia coli that spread rapidly across Europe, affecting 400,000 people. All the victims tested positive for a particular strain of E. coli. The doctors in Hamburg did not know what to do. What happened to that E. coli? It appeared so similar to other strains—why then did it cause this infection? The search for a solution led to fairly open cooperation. First, data on the genome of the E. coli strain was opened. This was posted on a site so anyone anywhere in the world could access it. In three weeks, about 200 reports were published about what should be done to prevent the epidemic and its effects, and the results were used to control the outbreak. This was possible thanks to a very open method of carrying out science, and to expertise from other countries, which led to a solution to a public health issue in just a few months. This is a very important story, of the kind of achievement that we seek through more interactive science.
The report was produced after a year of debate and we needed to be careful with the language and with our enthusiasm for this way of doing science, which we like to call “smart opening.” We have to open the data in a way that enables other people to use it, either for public health purposes, in industry or for other applications. Openness is not, in itself, something useful. You must open in an intelligible manner. The four criteria that we must follow are, firstly, to make metadata available. Secondly, the data must be understandable. Thirdly, the context must be disclosed, so that people who use the data understand how it was obtained, how reliable it is, in other words, for purposes of peer review. Finally, data must be reusable or reproducible. Only when these four criteria are met can we open data in an appropriate manner. Opening data is expensive, and this is a problem, because these criteria must be met for each of the various audiences that will use the data.
There was a great deal of discussion in the UK and Europe on the concept of “intelligent openness,” about how industries can use this data to generate strong economic development. What are the limits of this openness? Are there legitimate business interests that we also must protect? To what extent can we be open? We have some examples, such as that of the European Bioinformatics Institute. They have a mechanism to allow companies to compare and contrast information from their internal databases with our large databases, without anyone really able to see the other’s data. It is a type of opening, but one within the limitations of the commercial context.
The other question is with regard to the use of human beings. Clearly, this kind of information cannot be released because it would be an invasion of privacy. There are also security issues. There was a controversy about the H1N1 virus, when a new, highly contagious form of virus was found. The question was whether this work on the H1N1 virus should be published or not, because bioterrorists could use it. In the end it was decided that the data should be published, because not that many people were capable of using it for terrorism.
When we talk about the transition to a research environment in which data is more open, we are talking about a very complex system. The idea of a pyramid helps to illustrate how to deal with this. The higher the layer of this pyramid, the greater the responsibility and demand for access. At the base of this pyramid is individual and personal data, which, for many researchers, should be kept in an archive—nobody wants to open this data to the world. Our pyramid therefore has a large base, containing individual data held by researchers. Part of it could be useful for everyone, but we must not forget that it still exists. When we move up to the next layer, we see universities managing their databases in the UK. We have universities that compete very strongly in number of articles and published data—this data belongs to the institutions. So they try to restrict access to institutional repositories. The next layer contains collections of national data. And at the top is international data, such as the global protein database, which contains data collected over the course of many years.
To simplify, we want to see all active data on-line. We want integrated operation. Data is an integral part of science and it needs to be communicated in this way, not just included in articles. Our hope is that all scientific literature becomes available on-line, all primary data accessible on-line. To put this in concrete terms, we can say that there are six priorities. The first is to change the prevailing culture that considers scientific data to be personal property. The second is to give credit, in the process of evaluating research, for the disclosure of useful data and new forms of collaboration. The third is to create common standards for exchanging data. The fourth is to promote what we earlier referred to as intelligent opening of data. The fifth is to strengthen the group of scientists working with data. We do not have many computer engineers able to do this in the UK, so it is an urgent priority at the moment. And the sixth is the development of new software able to automate and simplify the joint creation and exploration of data sets. If you are going to invest in something today, this is what needs more resources. I hope that these issues, which are fundamental for researchers, can be dealt with in depth at the World Science Forum. And I hope that we can create something more than a tool to deal with a small epidemic outbreak, as occurred in Germany last year.
This and the following articles are the result of talks given at the first of seven preparatory meetings for the 2013 World Science Forum, held at the main offices of FAPESP August 29 – 31, 2012.
Republish