REVIEW ARTICLE
OPEN
Machine learning for medical imaging: methodological failures
and recommendations for the future
Gaël Varoquaux
1,2,3
✉
and Veronika Cheplygina
4
✉
Research in computer analysis of medical images bears many promises to improve patients’ health. However, a number of
systematic challenges are slowing down the progress of the field, from limitations of the data, such as biases, to research incentives,
such as optimizing for publication. In this paper we review roadblocks to developing and assessing methods. Building our analysis
on evidence from the literature and data challenges, we show that at every step, potential biases can creep in. On a positive note,
we also discuss on-going efforts to counteract these problems. Finally we provide recommendations on how to further address
these problems in the future.
npj Digital Medicine (2022) 5:48 ; https://doi.org/10.1038/s41746-022-00592-y
INTRODUCTION
Machine learning, the cornerstone of today’s artificial intelligence
(AI) revolution, brings new promises to clinical practice with
medical images
1–3
. For example, to diagnose various conditions
from medical images, machine learning has been shown to
perform on par with medical experts
4
. Software applications are
starting to be certifi ed for clinical use
5,6
. Machine learning may be
the key to realizing the vision of AI in medicine sketched several
decades ago
7
.
The stakes are high, and there is a staggering amount of
research on machine learning for medical images. But this growth
does not inherently lead to clinical progress. The higher volume of
research could be aligned with the academic incentives rather
than the needs of clinicians and patients. For example, there can
be an oversupply of papers showing state-of-the-art performance
on benchmark data, but no practical improvement for the clinical
problem. On the topic of machine learning for COVID, Robert
et al.
8
reviewed 62 published studies, but found none with
potential for clinical use.
In this paper, we explore avenues to improve clinical impact of
machine learning in medical imaging. After sketching the
situation, documenting uneven progress in Section It’s not all
about larger datasets, we study a number of failures frequent in
medical imaging papers, at different steps of the “publishing
lifecycle”: what data to use (Section Data, an imperfect window on
the clinic), what methods to use and how to evaluate them
(Section Evaluations that miss the target), and how to publish the
results (Section Publishing, distorted incentives). In each section,
we first discuss the problems, supported with evidence from
previous research as well as our own analyses of recent papers. We
then discuss a number of steps to improve the situation,
sometimes borrowed from related communities. We hope that
these ideas will help shape research practices that are even more
effective at addressing real-world medical challenges.
IT’S NOT ALL ABOUT LARGER DATASETS
The availability of large labeled datasets has enabled solving
difficult machine learning problems, such as natural image
recognition in computer vision, where datasets can contain
millions of images. As a result, there is widespread hope that
similar progress will happen in medical applications, algorithm
research should eventually solve a clinical problem posed as
discrimination task. However, medical datasets are typically
smaller, on the order of hundreds or thousands:
9
share a list of
sixteen “large open source medical imaging datasets”, with sizes
ranging from 267 to 65,000 subjects. Note that in medical imaging
we refer to the number of subjects, but a subject may have
multiple images, for example, taken at different points in time. For
simplicity here we assume a diagnosis task with one image/scan
per subject.
Few clinical questions come as well-posed discrimination tasks
that can be naturally framed as machine-learning tasks. But, even
for these, larger datasets have to date not lead to the progress
hoped for. One example is that of early diagnosis of Alzheimer’s
disease (AD), which is a growing health burden due to the aging
population. Early diagnosis would open the door to early-stage
interventions, most likely to be effective. Substantial efforts have
acquired large brain-imaging cohorts of aging individuals at risk of
developing AD, on which early biomarkers can be developed
using machine learning
10
. As a result, there have been steady
increases in the typical sample size of studies applying machine
learning to develop computer-aided diagnosis of AD, or its
predecessor, mild cognitive impairment. This growth is clearly
visible in publications, as on Fig. 1a, a meta-analysis compiling
478 studies from 6 systematic reviews
4,11–15
.
However, the increase in data size (with the largest datasets
containing over a thousand subjects) did not come with better
diagnostic accuracy, in particular for the most clinically relevant
question, distinguishing pathological versus stable evolution for
patients with symptoms of prodromal Alzheimer’s (Fig. 1b). Rather,
studies with larger sample sizes tend to report worse prediction
accuracy. This is worrisome, as these larger studies are closer to
real-life settings. On the other hand, research efforts across time
did lead to improvements even on large, heterogeneous cohorts
(Fig. 1c), as studies published later show improvements for large
sample sizes (statistical analysis in Supplementary Information).
Current medical-imaging datasets are much smaller than those
that brought breakthroughs in computer vision. Although a one-
1
INRIA, Versailles, France.
2
McGill University, Montreal, Canada.
3
Mila, Montreal, Canada.
4
IT University of Copenhagen, Copenhagen, Denmark.
✉
email: gael.varoquaux@inria.fr;
vech@itu.dk
www.nature.com/npjdigitalmed
Published in partnership with Seoul National University Bundang Hospital
1234567890():,;
评论0