to. We will discuss this idea more in Sec. B.
Data cleaning and preprocessing:
Statistics and ma-
chine learning models or algorithms help data scientists answer
domain questions. Using models or algorithms requires clean-
ing (pre-processing) raw data into a suitable format, be it a
categorical demographic feature or continuous measurements
of biomarker concentrations. For instance, when data come
from multiple laboratories, biologists must decide how to nor-
malize individual measurements (for example see (
22
)). When
data scientists preprocess data, they are implicitly assuming
that their choices are not unintentionally biasing the essential
information in the raw data. In other words, they assume
that the knowledge derived from a data result is stable with
respect to their processing choices. If such an assumption
cannot be justified, they should use multiple appropriate pro-
cessing methods and interpret data results that are stable
across these methods. Others have advocated evaluating re-
sults across alternatively processed datasets under the name
“multiverse analysis” (
23
). Although the stability principle
was developed independently of this work, it naturally leads
to a multiverse-style analysis.
Exploratory data analysis:
Both before the modeling
stage and in post hoc analyses, data scientists often engage
in exploratory data analysis (EDA) to identify interesting
relationships in the data and interpret data results. When
visualizations or summaries are used to communicate these
analyses, it is implicitly assumed that the relationships or
data results are stable with respect to any decisions made by
the data scientist. For example, if the biologist believes that
clusters in a heatmap represent biologically meaningful groups,
she should expect to observe the same clusters with respect to
any appropriate choice of distance metric, data perturbation,
or clustering method.
B. Predictability as reality check.
‖
After data collection,
cleaning/preprocessing, and EDA, models or algorithms
∗∗
are
frequently used to identify more complex relationships in data.
Many essential components of the modeling stage rely on the
language of mathematics, both in technical papers and in code.
A seemingly obvious but often ignored question is why conclu-
sions presented in the language of mathematics depict reality
that exists independently in nature, and to what extent we
should trust mathematical conclusions to impact this external
reality.
††
This concern has been articulated and addressed by many
others in terms of prediction. For instance, Philip Dawid drew
connections between statistical inference and prediction under
the name “prequential statistics,” highlighting the importance
of forecasts in statistical analyses (
24
). David Freedman ar-
gued that when a model’s predictions are not tested against
reality, conclusions drawn from the model are unreliable (
25
).
Seymour Geisser advocated that statistical analyses should
focus on prediction rather than parametric inference, particu-
larly in cases where the statistical model is an inappropriate
description of reality (
26
). Leo Breiman championed the essen-
tial role of prediction in developing realistic models that yield
sound scientific conclusions (
6
). It can even be argued that
the goal of most domain problems is prediction at the meta
‖
Predictability is a form of empirical validation, though other reality checks may be performed beyond
prediction (e.g. checking whether a model recovers known phenomena).
∗∗
Different model or algorithm choices could correspond to different translations of a domain problem.
††
The PCS documentation in Sec. 4 helps users assess whether this connection is reliable.
level. That is, the primary value of learning relationships in
data is often to predict some aspect of future reality.
B.1. Formulating prediction.
We describe a general framework for
prediction with data
D
= (
x, y
), where
x ∈ X
represents input
features and
y ∈ Y
the prediction target. Prediction targets
y ∈ Y
may be observed responses (e.g. supervised learning) or
extracted from data (e.g. unsupervised learning). Predictive
accuracy is a simple, quantitative metric to evaluate how well a
model represents relationships in
D
. It is well-defined relative
to a prediction function, testing data, and an evaluation metric.
We detail each of these elements below.
Prediction function: The prediction function
h : X → Y [1]
represents relationships between the observed features and
the prediction target. For instance, in the case of supervised
learning
h
may be a linear predictor or decision tree. In this
setting,
y
is typically an observed response, such as a class
label. In the case of unsupervised learning,
h
could map from
input features to cluster centroids.
To compare multiple prediction functions, we consider
{h
(λ)
: λ ∈ Λ}, [2]
where Λ denotes a collection models/algorithms. For exam-
ple, Λ may define different tuning parameters in lasso (
27
)
or random forest (
28
). For deep neural networks, Λ could
describe different network architectures. For algorithms with
a randomized component, such as k-means or stochastic gra-
dient descent, Λ can represent repeated runs. More broadly,
Λ may describe a set of competing algorithms such as linear
models, random forests, and neural networks, each correspond-
ing to a different problem translations. We discuss model
perturbations in more detail in Sec. D.3.
Testing (held-out) data:
We distinguish between training
data that are used to fit a collection of prediction functions,
and testing data that are used to evaluate the accuracy of fitted
prediction functions.
‡‡
At a minimum, one should evaluate
predictive accuracy on a held-out test set generated at the
same time and under the same conditions as the training data
(e.g. by randomly sampling a subset of observations). This
type of assessment addresses questions internal validity, which
describe the strength of a relationship in a given sample. It is
also often important to understand how a model will perform
in future conditions that differ from those that generated the
training data. For instance, a biologist may want to apply their
model to new cell lines. A social scientist might use a model
trained on residents from one city to predict the behavior of
residents in another. As an extreme example, one may want to
use transfer learning to apply part of their model to an entirely
new prediction problem. Testing data gathered under different
conditions from the training data directly addresses questions
of external validity, which describe how well a result will
generalize to future observations. Domain knowledge and/or
empirical validation are essential to assess the appropriateness
of different prediction settings. These decisions should be
reported in the proposed PCS documentation (Sec. 4).
‡‡
In some settings, a third set of data are used to tune model parameters.
Preprint | Yu and Kumbier 2019 | 3