What is response-adaptive randomization?
Randomization as a method to allocate patients to treatments in a clinical trial, has long been considered a defining element of a well-conducted study. It ensures comparability of treatment groups, mitigating selection bias, and additionally providing the basis for statistical inference [1]. Clinical trials typically use a fixed (most often equal) randomization probability.
An alternative mode of patient allocation is known as response-adaptive randomization (RAR). This is where randomization probabilities are altered during the trial based on the accrued response data, with the aim of achieving different experimental objectives. These objectives may include: early selection of a promising treatment among several candidates, increasing the power of a specific treatment comparison and/or assigning more patients to an effective treatment arm during a trial.
RAR has been a fertile area of methodological research over the past three decades, with books and many papers in top statistical journals being published on the subject. Despite this, the uptake of RAR in practice remains disproportionately slow in comparison with the theoretical attention it has received, and it continues to stand as a controversial and highly debated issue within the statistics community. These debates tend to intensify and multiply during health care crises such as the Ebola outbreak or the current COVID-19 pandemic. Unfortunately, such debates are mostly geared towards presenting arguments to justify one-sided positions around the use of RAR in clinical trials, which is of limited value for non-experts thinking about its use in practice.
Some background story
As researchers that have worked on RAR methodology, we have often been asked to recommend a comprehensive and up-to-date review paper on RAR. Since (to the best of our knowledge) none existed, we recently wrote one ourselves. Our aim was to provide a critical, balanced and updated review of methodological and practical issues to consider when debating the use of RAR in clinical trials. The paper, titled ‘Response-adaptive randomization in clinical trials: from myths to practical considerations’ is available as a pre-print at https://arxiv.org/abs/2005.00564
While we were wrapping up writing this review paper, a short article by Proschan and Evans was published in the journal Clinical Infectious Diseases, titled ‘The Temptation of Response-Adaptive Randomization’ (available at https://doi.org/10.1093/cid/ciaa334). The article strongly discouraged the use of RAR, listing a variety of problems and concluding that RAR jeopardises the integrity of a clinical trial.
We felt compelled to respond, aiming to start a dialogue that could help the non-expert. We wrote a letter to the Editor, titled ‘The Temptation of Overgeneralizing Response-Adaptive Randomization’ (available at https://doi.org/10.1093/cid/ciaa1027 and also upon request from us). In our letter, we argued that different types of RAR can avoid the problems mentioned by Proshcan and Evans, and that any discussion around the value of a specific type of RAR needs careful consideration of the context for which it is being proposed – the benefits can in certain cases more than outweigh the downsides.
In turn, we received ‘Reply to Villar, et al’ (available at https://doi.org/10.1093/cid/ciaa1029). As we are unable to continue the conversation with a further letter to the Editor of Clinical Infectious Diseases, we would like to use the remainder of this blog post to respond to this latest correspondence. The selected quotations below are taken directly from the reply.
Continuing the conversation
Our most pressing concern is about the methods being proposed and used in actual clinical trials, as purveyors of RAR do not appear to be using power-based approaches. This may be because power-based thinking seems counter to the intent of RAR.
We agree with the first sentence, as we have not seen the use of more power-orientated RAR procedures in practice. However, it is not necessarily the case that these are counter to the intent of RAR, ethical or otherwise. For example, the large body of work by Rosenberger et al. (see for example [2]) imposes minimum constraints on power while still leading to better ethical properties. We agree there are many trial contexts where it may not be sensible not to use RAR, but its use can be considered when (for example) it makes sense from an ethical perspective.
Many phase III trials use serious clinical endpoints such as death, whose probability of occurrence is much lower than 0.5; in that case, power is maximized by assigning more patients to the arm with the higher event rate!
The issue of maximising power leading to unethical allocations (where more patients are assigned to arms with the higher event rate) has long been recognised with regard to Neyman allocation (see [2]). However, Neyman is only one of the so-called power oriented rules and it is perhaps the most aggressive one in the sense that it is only aiming to maximise power (regardless of the resulting ethics of the allocations). Other RAR procedures, such as those mentioned above, ensure undesirable allocations are avoided and should be used instead.
Incidentally, others have also generalized about RAR: “Many variants of RAR have been proposed in the literature. However, different RAR procedures often perform similarly because they obey the same fundamental principle.”
We admit that trying to definitively class RAR into families of procedures was the wrong path to follow. This was driven home to us based on the experience of writing the review, where we realised that this was an elusive goal. Instead, careful consideration is needed when deciding which randomization approach is best suited for a specific clinical trial and its goals.
The magnitude of a time trend is hard to predict but can be large in platform trials of infectious diseases, for example…
A key question is how large is ‘large’? There is much ambiguity in this, the “most” important concern. We agree that more data on time trends in these contexts is needed, and hence would strongly encourage platform trials to report on this important issue.
There is little solace in knowing that the bias and other damage caused by RAR might be “not too bad”, given that we can eliminate bias from temporal trends using standard randomization methods.
The solace in knowing that bias is “not too bad” lies in what can be gained – if we lose less than what we gain, then the use of a RAR design can be justified. To return to an example used in our letter to the Editor, in the context of an infectious disease epidemic, if RAR allows for a faster rolling out of a vaccine (say), objectives such as slowing down the infection rate could potentially be more important than bias or power considerations. Rather than only focusing on power and estimation properties, we want to make the best of other important goals as well.
We continue to believe that only a re-randomization test protects against arbitrary temporal trends.
We agree that currently, re-randomization tests are the only way to protect against arbitrary temporal trends. Indeed, at the moment there is some lack of methods for robust statistical inference in small samples – but we are working on it!
We agree that some RAR methods have low probability of assigning more patients to the worse arm, an improvement learned only after missteps.
While this statement may be true for a particular class of RAR methods (based on Thompson sampling), other RAR procedures have been developed that have this low probability even without it being an explicit motivation or improvement – the probability was often not even computed when a RAR procedure was first proposed. For a simulation study comparing this probability for a variety of RAR procedures, we refer the reader to Section 4.1 of our review paper. In any case, if missteps led to improvements, then this is a positive development.
We ponder how many other problems might be lurking for a method that has been seldom used despite existing for more than 80 years.
Our view is that potential problems that may be discovered in the future should not be seen as a reason to rule out the development and exploration of RAR methodology. On the contrary, these problems mean there is more potential for research and further improvements to a method that has a noble intention. Given that RAR is already being increasingly used in practice, we should encourage research into how to address these issues to ensure the theoretical foundations are as solid as possible.
Randomization is the essence of a clinical trial. If randomization fails, then the trial fails.
We agree that randomisation is an essential part of a good clinical trial – and RAR does incorporate randomization as a key component.
It is a disservice not to appropriately educate researchers and research consumers about this fact, particularly in the context of infectious diseases and in the midst of a deadly pandemic, where correct answers are crucial.
As already mentioned, by no means are we proposing that RAR should be used in all trial contexts, but it is important to have a balanced view around its use. Advocating for partial views either against or in favor of RAR is a disservice – both the advantages and disadvantages of RAR should be taught to researchers and research consumers as part of educating them on good design of experiments . The latter is what we are really advocating for with our position paper and correspondence letter.
Also, could it be that in the context of infectious diseases, it is correct decisions that are most crucial, rather than correct answers? There is a subtle difference, but power considerations (for example) cannot always be more important than other objectives in the trial, like slowing down a pandemic and saving lives
We agree that not all RAR methods are the same and that improvements have been made to RAR methodologies through lessons like the ECMO trial. Those improvements, such as a “burn-in” period of conventional randomization, have moved the originally proposed RAR closer to conventional randomization. Continue moving in that direction ad infinitum and we will eventually agree.
In our view, the “burn-in” period is not an example of a substantial improvement in RAR methodology. Better examples of methodological advances in RAR include results about the validity of asymptotic inference, non-mypoic (forward looking) RAR rules, or the optimal designs of Rosenberger et al. The “burn-in” is in some sense an ad-hoc modification, and improves power simply by making RAR get closer to a balanced design, but at the potential cost of other considerations – which we firmly believe cannot be overlooked when making a design choice.
To conclude, we would like to stress that we do not have the ‘opposite’ viewpoint to the final sentence quoted above. We are not looking for staunch opponents of RAR to move ad infinitum towards the use of RAR. Rather, our hope is that such opponents can move slightly in our direction and we would agree earlier. We seek a middle ground rather than a one-sided view in either direction – or at the very least, an open two-sided dialogue from which everyone can learn and make up their own view.
Blog post by Dr Sofía Villar and Dr David Robertson, MRC Biostatistics Unit
References
[1] Rosenberger WF and Lachin JM (2016). Randomization in Clinical Trials. Wiley Series in Probability and Statistics.
[2] Hu F and Rosenberger WF (2006). The Theory of Response-Adaptive Randomization in Clinical Trials. Wiley Series in Probability and Statistics.[3] Villar SS, Bowden J and Wason J (2018). Response-adaptive designs for binary responses: How to offer patient benefit while being robust to time trends? Pharmaceutical Statistics 17:182-197.
[3] Villar SS, Bowden J and Wason J (2018). Response-adaptive designs for binary responses: How to offer patient benefit while being robust to time trends? Pharmaceutical Statistics 17:182-197.