没有合适的资源?快使用搜索试试~ 我知道了~
关于回归的一些学习思考
需积分: 0 0 下载量 21 浏览量
2024-06-17
15:09:51
上传
评论
收藏 1.63MB PDF 举报
温馨提示
![preview](https://dl-preview.csdnimg.cn/89446718/0001-e5a069fceb1cb8df1453d9e833881eeb_thumbnail.jpeg)
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
试读
13页
关于回归的一些学习思考
资源推荐
资源详情
资源评论
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![ppt](https://img-home.csdnimg.cn/images/20210720083527.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![docx](https://img-home.csdnimg.cn/images/20210720083331.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![](https://csdnimg.cn/release/download_crawler_static/89446718/bg1.jpg)
Veridical Data Science
Bin Yu
a,b,c
and Karl Kumbier
a
a
Statistics Department, University of California, Berkeley, CA 94720;
b
EECS Department, University of California, Berkeley, CA 94720;
c
Chan Zuckerberg Biohub, San
Francisco, CA 94158
Building and expanding on principles of statistics, machine learning, and scientific inquiry, we
propose the predictability, computability, and stability (PCS) framework for veridical data science.
Our framework, comprised of both a workflow and documentation, aims to provide responsible,
reliable, reproducible, and transparent results across the entire data science life cycle. The PCS
workflow uses predictability as a reality check and considers the importance of computation in
data collection/storage and algorithm design. It augments predictability and computability with
an overarching stability principle for the data science life cycle. Stability expands on statistical
uncertainty considerations to assess how human judgment calls impact data results through data
and model/algorithm perturbations. Moreover, we develop inference procedures that build on PCS,
namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data
results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We
illustrate PCS inference through neuroscience and genomics projects of our own and others and
compare it to existing methods in high dimensional, sparse linear model simulations. Over a wide
range of misspecified simulation models, PCS inference demonstrates favorable performance in
terms of ROC curves. Finally, we propose PCS documentation based on R Markdown or Jupyter
Notebook, with publicly available, reproducible codes and narratives to back up human choices
made throughout an analysis. The PCS workflow and documentation are demonstrated in a ge-
nomics case study available on Zenodo (1).
1. Introduction
Data science is a field of evidence seeking that combines data
with domain information to generate new knowledge. The
data science life cycle (DSLC) begins with a domain ques-
tion or problem and proceeds through collecting, managing,
processing (cleaning), exploring, modeling, and interpreting
∗
data results to guide new actions (Fig. 1). Given the trans-
disciplinary nature of this process, data science requires human
involvement from those who collectively understand both the
domain and tools used to collect, process, and model data.
These individuals make implicit and explicit judgment calls
throughout the DSLC. The limited transparency in reporting
such judgment calls has blurred the evidence for many analy-
ses, resulting in more false-discoveries than might otherwise
occur (
3
,
4
). This fundamental issue necessitates veridical data
science to extract reliable and reproducible information from
data, with an enriched technical language to communicate and
evaluate empirical evidence in the context of human decisions.
Three core principles: predictability, computability, and sta-
bility (PCS) provide the foundation for such a data-driven
language and a unified data analysis framework. They serve
as minimum requirements for veridical data science
†
.
Many ideas embedded in PCS have been widely used across
various areas of data science. Predictability plays a central
role in science through Popperian falsifiability (
5
). If a model
does not accurately predict new observations, it can be re-
jected or updated. Predictability has been adopted by the
machine learning community as a goal of its own right and
more generally to evaluate the quality of a model or data result
∗
For a precise definition of interpretability in the context of machine learning, we refer to our recent
paper (2)
†
Veridical data science is the broad aim of our proposed framework (veridical meaning “truthful” or
“coinciding with reality”). This paper has been on arXiv since Jan. 2019 under the old title “Three
principles of data science: predictability, computability, stability (PCS).
(
6
). While statistics has always considered prediction, machine
learning emphasized its importance for empirical rigor. This
was in large part powered by computational advances that
made it possible to compare models through cross-validation
(CV), developed by statisticians Stone and Allen (7, 8).
The role of computation extends beyond prediction, setting
limitations on how data can be collected, stored, and analyzed.
Computability has played an integral role in computer science
tracing back to Alan Turing’s seminal work on the computabil-
ity of sequences (
9
). Analyses of computational complexity
have since been used to evaluate the tractability of machine
learning algorithms (
10
). Kolmogorov built on Turing’s work
through the notion of Kolmogorov complexity, which describes
the minimum computational resources required to represent
an object (
11
,
12
). Since Turing machine-based notions of
computabiltiy are not computable in practice, we treat com-
putability as an issue of algorithm efficiency and scalability.
This narrow definition of computability addresses computa-
tional considerations at the modeling stage of the DSLC but
does not deal with data collection, storage, or cleaning.
Stability
‡
is a common sense principle and a prerequisite for
knowledge. It is related to the notion of scientific reproducibil-
ity, which Fisher and Popper argued is a necessary condition
for establishing scientific results (
5
,
13
). While replicability
across laboratories has long been an important consideration
in science, computational reproducibility has come to play an
important role in data science as well. For example, (14) dis-
‡
We differentiate between the notions of stability and robustness as used in statistics. The latter
has traditionally been used to investigate performance of statistical methods across a range of
distributions, while the former captures a much broader range of perturbations throughout the
DSLC as discussed in this paper. At a high level, stability is about robustness.
Preprint | Yu and Kumbier 2019 | 1–13
arXiv:1901.08152v5 [stat.ML] 12 Nov 2019
![](https://csdnimg.cn/release/download_crawler_static/89446718/bg2.jpg)
cusses reproducible research in the context of computational
harmonic analysis. More broadly, (
15
) advocates for “prepro-
ducibility” to explicitly detail all steps along the DSLC and
ensure sufficient information for quality control. Stability at
the modeling stage of the DSLC has been advocated in (
16
) as
a minimum requirement for reproducibility and interpretabil-
ity. Modeling stage stability unifies numerous previous works,
including Jackknife, subsampling, bootstrap sampling, robust
statistics, semi-parametric statistics, and Bayesian sensitivity
analysis (see (
16
) and references therein). These methods have
been enabled in practice through computational advances and
allow researchers to investigate the reproducibility of data re-
sults. Econometric models with partial identification (see the
book (
17
) and references therein) and fundamental theoretical
results in statistics, such as the central limit theorem (CLT),
can also be viewed as stability considerations.
In this paper, we unify and expand on these ideas through
the PCS framework, which is built on the three principles of
data science. The PCS framework consists of PCS workflow
and transparent PCS documentation. It uses predictability
as a reality check, computability to ensure that the DSLC
is tractable, and stability to test the reproducibility of data
results (Sec. 2) relative to human judgment calls at every step
of the DSLC. In particular, we develop basic PCS inference,
which leverages data and model perturbations to evaluate the
uncertainty human decisions introduce into the DSLC (Sec. 3).
We propose PCS documentation in R MarkDown or a Jupyter
(iPython) Notebook to justify these decisions through narra-
tives, code and visualizations (Sec. 4). We draw connections
between causal inference and the PCS framework, demon-
strating the utility of the latter as a recommendation system
for generating scientific hypotheses (Sec. 5). We conclude by
discussing areas for further work, including additional vet-
ting of the framework and theoretical analyses on connections
between the three principles. A case study of our proposed
framework based on the authors’ work studying gene regulation
in Drosophil a is documented on Zenodo.
2. PCS principles in the DSLC
Given a domain problem and data, the purpose of the DSLC is
to generate knowledge, conclusions, and actions (Fig. 1). The
PCS framework aims to ensure that this process is both reliable
and reproducible through the three fundamental principles of
data science. Below we discuss the roles of the three principles
within the PCS framework
§
, including PCS workflow and PCS
documentation. The former applies the relevant principles
at every step of the DSLC, with stability as the paramount
consideration, and contains PCS inference proposed in Sec. 3.
The latter documents the PCS workflow and judgment calls
made with a 6-step format described in Sec. 4.
A. Stability assumptions initiate the DSLC.
The ultimate goal
of the DSLC is to generate knowledge that is useful for future
actions, be it a biological experiment, business decision, or
government policy. Stability is a useful concept to address
whether another researcher making alternative, appropriate
¶
decisions would obtain similar conclusions. At the modeling
stage, stability has previously been advocated in (
16
). In this
§
We organize our discussion with respect to the steps in the DSLC.
¶
We use the term appropriate to mean well-justified from domain knowledge and an understanding
of the data generating process. The term “reasonable” has also been used with this definition (16).
Domain
question
Data collection Data cleaning
Exploration &
visualization
Modeling
Post hoc
analysis
Interpretation
of results
Update domain
knowledge
Stability
Fig. 1. The data science life cycle
context, stability refers to acceptable consistency of a data
result relative to appropriate perturbations of the data or
model. For example, jackknife (
18
–
20
), bootstrap (
21
), and
cross validation (
7
,
8
) may be considered appropriate pertur-
bations if the data are deemed approximately independent
and identically distributed (i.i.d.) based on domain knowledge
and an understanding of the data collection process.
Human judgment calls prior to modeling also impact data
results. The validity of an analysis relies on implicit stability
assumptions that allow data to be treated as an informative
representation of some natural phenomena. When these as-
sumptions do not hold in a particular domain, conclusions
rarely generalize to new settings unless empirically proven
by future data. This makes it essential to evaluate stability
to guard against costly future actions and false discoveries,
particularly in the domains of science, business, and public
policy, where data results are used to guide large scale actions,
and in medicine, where human lives are at stake. Below we
outline stability considerations that impact the DSLC prior
to modeling.
Question or problem formulation:
The DSLC be-
gins with a domain problem or a question, which could be
hypothesis-driven or discovery-based. For instance, a biolo-
gist may want to discover biomolecules that regulate a gene’s
expression. In the DSLC this question must be translated
into a question regarding the output of a model or analysis
of data that can be measured/collected. There are often mul-
tiple translations of a domain problem into a data science
problem. For example, the biologist described above could
measure factors binding regulatory regions of the DNA that
are associated with the gene of interest. Alternatively, she
could study how the gene covaries with regulatory factors
across time and space. From a modeling perspective, the
biologist could identify important features in a random forest
or through logistic regression. Stability relative to question or
problem formulation implies that the domain conclusions are
qualitatively consistent across these different translations.
Data collection:
To answer a domain question, domain
experts and data scientists collect data based on prior knowl-
edge and available resources. When this data is used to guide
future decisions, researchers implicitly assume that the data
is relevant to a future time. In other words, they assume that
conditions affecting data collection are stable, at least relative
to some aspects of the data. For instance, if multiple labo-
ratories collect data to answer a domain question, protocols
must be comparable across experiments and laboratories if
they expect to obtain consistent results. These stability con-
siderations are closely related to external validity in medical
research, which characterizes similarities between subjects in a
study and subjects that researchers hope to generalize results
Preprint | Yu and Kumbier 2019 | 2
![](https://csdnimg.cn/release/download_crawler_static/89446718/bg3.jpg)
to. We will discuss this idea more in Sec. B.
Data cleaning and preprocessing:
Statistics and ma-
chine learning models or algorithms help data scientists answer
domain questions. Using models or algorithms requires clean-
ing (pre-processing) raw data into a suitable format, be it a
categorical demographic feature or continuous measurements
of biomarker concentrations. For instance, when data come
from multiple laboratories, biologists must decide how to nor-
malize individual measurements (for example see (
22
)). When
data scientists preprocess data, they are implicitly assuming
that their choices are not unintentionally biasing the essential
information in the raw data. In other words, they assume
that the knowledge derived from a data result is stable with
respect to their processing choices. If such an assumption
cannot be justified, they should use multiple appropriate pro-
cessing methods and interpret data results that are stable
across these methods. Others have advocated evaluating re-
sults across alternatively processed datasets under the name
“multiverse analysis” (
23
). Although the stability principle
was developed independently of this work, it naturally leads
to a multiverse-style analysis.
Exploratory data analysis:
Both before the modeling
stage and in post hoc analyses, data scientists often engage
in exploratory data analysis (EDA) to identify interesting
relationships in the data and interpret data results. When
visualizations or summaries are used to communicate these
analyses, it is implicitly assumed that the relationships or
data results are stable with respect to any decisions made by
the data scientist. For example, if the biologist believes that
clusters in a heatmap represent biologically meaningful groups,
she should expect to observe the same clusters with respect to
any appropriate choice of distance metric, data perturbation,
or clustering method.
B. Predictability as reality check.
‖
After data collection,
cleaning/preprocessing, and EDA, models or algorithms
∗∗
are
frequently used to identify more complex relationships in data.
Many essential components of the modeling stage rely on the
language of mathematics, both in technical papers and in code.
A seemingly obvious but often ignored question is why conclu-
sions presented in the language of mathematics depict reality
that exists independently in nature, and to what extent we
should trust mathematical conclusions to impact this external
reality.
††
This concern has been articulated and addressed by many
others in terms of prediction. For instance, Philip Dawid drew
connections between statistical inference and prediction under
the name “prequential statistics,” highlighting the importance
of forecasts in statistical analyses (
24
). David Freedman ar-
gued that when a model’s predictions are not tested against
reality, conclusions drawn from the model are unreliable (
25
).
Seymour Geisser advocated that statistical analyses should
focus on prediction rather than parametric inference, particu-
larly in cases where the statistical model is an inappropriate
description of reality (
26
). Leo Breiman championed the essen-
tial role of prediction in developing realistic models that yield
sound scientific conclusions (
6
). It can even be argued that
the goal of most domain problems is prediction at the meta
‖
Predictability is a form of empirical validation, though other reality checks may be performed beyond
prediction (e.g. checking whether a model recovers known phenomena).
∗∗
Different model or algorithm choices could correspond to different translations of a domain problem.
††
The PCS documentation in Sec. 4 helps users assess whether this connection is reliable.
level. That is, the primary value of learning relationships in
data is often to predict some aspect of future reality.
B.1. Formulating prediction.
We describe a general framework for
prediction with data
D
= (
x, y
), where
x ∈ X
represents input
features and
y ∈ Y
the prediction target. Prediction targets
y ∈ Y
may be observed responses (e.g. supervised learning) or
extracted from data (e.g. unsupervised learning). Predictive
accuracy is a simple, quantitative metric to evaluate how well a
model represents relationships in
D
. It is well-defined relative
to a prediction function, testing data, and an evaluation metric.
We detail each of these elements below.
Prediction function: The prediction function
h : X → Y [1]
represents relationships between the observed features and
the prediction target. For instance, in the case of supervised
learning
h
may be a linear predictor or decision tree. In this
setting,
y
is typically an observed response, such as a class
label. In the case of unsupervised learning,
h
could map from
input features to cluster centroids.
To compare multiple prediction functions, we consider
{h
(λ)
: λ ∈ Λ}, [2]
where Λ denotes a collection models/algorithms. For exam-
ple, Λ may define different tuning parameters in lasso (
27
)
or random forest (
28
). For deep neural networks, Λ could
describe different network architectures. For algorithms with
a randomized component, such as k-means or stochastic gra-
dient descent, Λ can represent repeated runs. More broadly,
Λ may describe a set of competing algorithms such as linear
models, random forests, and neural networks, each correspond-
ing to a different problem translations. We discuss model
perturbations in more detail in Sec. D.3.
Testing (held-out) data:
We distinguish between training
data that are used to fit a collection of prediction functions,
and testing data that are used to evaluate the accuracy of fitted
prediction functions.
‡‡
At a minimum, one should evaluate
predictive accuracy on a held-out test set generated at the
same time and under the same conditions as the training data
(e.g. by randomly sampling a subset of observations). This
type of assessment addresses questions internal validity, which
describe the strength of a relationship in a given sample. It is
also often important to understand how a model will perform
in future conditions that differ from those that generated the
training data. For instance, a biologist may want to apply their
model to new cell lines. A social scientist might use a model
trained on residents from one city to predict the behavior of
residents in another. As an extreme example, one may want to
use transfer learning to apply part of their model to an entirely
new prediction problem. Testing data gathered under different
conditions from the training data directly addresses questions
of external validity, which describe how well a result will
generalize to future observations. Domain knowledge and/or
empirical validation are essential to assess the appropriateness
of different prediction settings. These decisions should be
reported in the proposed PCS documentation (Sec. 4).
‡‡
In some settings, a third set of data are used to tune model parameters.
Preprint | Yu and Kumbier 2019 | 3
剩余12页未读,继续阅读
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/4b72eb3f3d6c4776b7fa7c46db404906_chenzelin008.jpg!1)
c神秘靓仔y
- 粉丝: 79
- 资源: 1
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)