Sach Mukherjee (Theme co-Lead) and Paul Kirk (Theme co-Lead)
Biostatistical Machine Learning (BML) is a new research theme focused on cross-cutting, methodological research in machine learning (ML), artificial intelligence (AI), and high-dimensional statistics. The overarching aim is to bridge the gap between flexible and scalable AI and ML approaches and the need for robustness, interpretability and scientific understanding that is essential in biostatistical applications. The theme will build upon the BSU’s current work in ML, AI and high-dimensional statistics, acting as a hub for BSU research in these areas and working closely with other themes in the context of specific applied projects where these approaches are key. Efforts will focus on four selected areas:
Robust and scalable AI and ML for biomedical translation:
Existing ML and AI approaches provide a good base for predictive modelling, but further work is needed to understand translationally relevant questions such as robustness to distributional shifts, interpretability, and transfer learning. We will focus on closing specific gaps in this area, including in particular the need for modified deep learning frameworks that exploit scientific knowledge/structure, the use of ancillary data, and scalability, particularly in the context of electronic health record (EHR) data analysis.
Network and causality:
Causal models go beyond the predictive paradigm by asking what would happen in a novel setting, such as under an intervention. Such questions are crucial in biomedicine, as they relate closely to prediction of therapeutic response, to analysis of perturbation data (e.g. via gene editing) and have future implications for AI-powered prioritisation of molecular targets. Causal questions are challenging for ML models that have typically focused on predictive, rather than causal or mechanistic, tasks. We will exploit our existing track record in this area through cross-cutting collaborations across the BSU, and further build on this by developing models to learn causal relationships between variables at scale, particularly motivated by molecular perturbation data. We will also develop approaches to bring together both the causal and predictive paradigms.
Integrative analysis of heterogeneous data:
Biomedical data are often both high-dimensional and heterogeneous, spanning subgroups with non-identical underlying models. Such heterogeneity can confound naïve applications of AI tools, with implications for robustness and scientific interpretation. Furthermore, studies often span multiple high-dimensional data types (e.g., omics layers) that require integrative analysis. Motivated by the need to address such heterogeneity we propose to develop scalable approaches at the intersection between supervised and unsupervised learning.
Latent functions and dynamic modelling:
Temporal biomedical data typically represent incomplete snapshots of complex dynamical processes. Emerging work in AI seeks to learn latent trajectories from such data by embedding into latent spaces in which dynamical aspects can be better modelled and understood. This type of approach has high potential in the context of longitudinal medical data and in the study of dynamic biological processes, but several challenging methodological and conceptual questions remain open. We propose to develop approaches combining deep representation learning with interpretable biostatistical/dynamical models in the latent space. This will build upon and unify lines of research in both fields with the aim of automatically learning simple dynamical models that plausibly underpin complex, multivariate longitudinal observations and that can disentangle the different sources of variation within a unified framework. Beyond time-varying data, we will also consider the more general challenge of learning latent functions, with a particular motivating example in modelling and interpreting drug combination responses in large-scale screening experiments.