skip to content

MRC Biostatistics Unit

Image of man holding left side of chest

Largescale population-based biobanks with medical record data are becoming increasingly popular as a data source for studying very rare conditions, like Pulmonary Arterial Hypertension (PAH), a condition in which blood pressure in the lungs in dangerously high. However, the suitability of these datasets for very rare conditions has not been examined.

In a letter published in European Respiratory Society, Benjamin Woolf and Stephen Burgess from the MRC Biostatistics Unit, and colleagues, demonstrate that medical record codes are an unreliable source of information for PAH as their use in existing biobanks results both in a failure to detect true associations from gold standard data, and potential false positive associations which are not replicated in gold standard data. 

Benji and colleagues were able to show that the vast (>90%) of medical record defined cases of PAH were misclassified by checking whether they had any implausible codes, e.g. not having PAH medication, or having codes connected to unrelated conditions which look similar but are not PAH. 

They then showed that this results in false negatives (i.e. failure to replicate known associations), and false positives (i.e. an ability to detect biologically implausible associations which could occur due to people with a similar looking but unrelated condition getting a diagnosis). 

This is an important development in PAH research, as an increasing proportion of papers are using medical record data to identify PAH cases. Bayes theorem suggests that as the prevalence of a condition goes down, the accuracy of a measure will also reduce. This is because a tiny number of misclassified non-cases can drown out the small number of true cases, e.g. PAH has a prevalence of around 50 per million. This means that even if 99.99% of true non-cases do not have a PAH code, there will be twice as many false positive cases than true positive cases. This suggests that very large population based samples are not the best way to study very rare conditions. 

Benji Woolf, first author and Research Associate at the BSU said:

"We hope this study highlights the importance of good quality genetic data and the difficulties with studying very rare conditions even with huge sample sizes."

Read the letter in full: https://publications.ersnet.org/content/erj/66/4/2500436