In this feature Steven Hill (an Investigator Statistician within Sach Mukherjee’s group at the MRC Biostatistics Unit) describes how important causal relationships are for science and his involvement in a recent, high-profile computational biology challenge known as “DREAM” that focused on causality.
Determining cause and effect is one of the central goals of science: for example, investigating the causes of a disease. The language of newspaper headlines often suggests causation, such as “Red meat raises the risk of breast cancer by a quarter” (http://www.express.co.uk , 11 June 2014). However, in many cases, the research being reported shows only an association or correlation, and the causal claim implicit in the headline is not justified. In general, association does not imply cause and effect.
Causal relationships are central to biology. Cells in the body contain many molecules that interact with each other and these interactions are vital to the functions of the cells and of the tissues and organs that they form. The molecules and their interactions can be thought of as a network, with arrows between the molecules indicating who influences whom. These networks are causal in the sense that if molecule A influences molecule B, removing or altering A will lead to a change in B.
Molecular networks are altered in many diseases, such as cancer, and there may be unusual patterns of influence that are specific to a particular disease. Therefore, it is important to investigate the patterns of influence or structure of networks in a disease-specific manner and an improved understanding could give new insights into disease biology and even help to design new therapies.
One way to determine network structure is to look at the levels or abundance of various molecules in cells over time and under different experimental conditions and use statistical and computational methods to tease out network structure from such data. But precisely because association does not equal causation, the problem of learning causal networks from data is a challenging one: if two molecules are correlated, it might be due to a causal influence between them, or it might be due to the fact that both are influenced by a third molecule (see Figure 1a), perhaps one that is not even measured in the experiment! For this and many other reasons, at present we cannot be sure that even very sophisticated computational methods are able to uncover causal relationships. Consequently, the task of carefully assessing whether such methods really work remains an important one.
Working with a team of colleagues from the UK, Europe and the US, I helped to organise the 2013 HPN-DREAM network inference challenge, aimed at assessing computational methods for learning causal networks, focussing in particular on molecular networks called protein signalling networks (see Figure 1b).
The DREAM project (Dialogue for Reverse Engineering and Assessment of Methods) organises competitive – yet collaborative – challenges that pose questions relevant to biomedicine and invite researchers from around the world to come up with computational methods to address the question. Such challenges are a great way to focus the research community’s attention on a particular problem and to assess diverse methods, contributed by many different teams, in a careful and consistent fashion. In the HPN-DREAM challenge, participants were asked to infer causal protein signalling networks using data obtained from cancer cells (HPN, the Heritage Provider Network, was a sponsor of the challenge).
For the challenge, we developed an approach to assess computational methods for learning networks, focusing on the accuracy of the methods to discern causal relationships. This is challenging for the simple reason that the true underlying network is typically unknown in disease biology. We therefore developed an approach that used additional so-called “test” data (from experiments that were not in the data originally provided to challenge participants) to empirically test causal predictions derived from the participants’ networks.
We ran the challenge for a period of 3 months over the summer of 2013, with approximately 70 teams participating. We provided feedback to participants on their performance via weekly online “leaderboards”. The best-performing teams received prizes and an invitation to attend the challenge conference that was held in Toronto and contribute to the scientific paper describing the challenge.
Despite the challenging nature of causal network inference, we were encouraged by the fact that several teams performed well. In addition to the molecular data, we allowed participants to use other sources of information, such as knowledge from the biological literature. Many teams used such biological knowledge and on average these teams performed better than teams that did not. However, we found that combining prior biological knowledge with molecular data improved performance relative to either one alone.
From a personal perspective, being involved in the challenge organisation was a unique and exciting experience, which involved collaborating with colleagues from diverse research areas. In addition to taking part in many aspects of the challenge design and implementation, I co-led the analysis of participant submissions. Through the set of tools used by challenge participants, I was also exposed to many different approaches for network analysis. It was rewarding to contribute to an important community effort that involved scientists from across the world.