【免费】关于回归的一些学习思考资源-CSDN文库

需积分: 0 197 浏览量 2024-06-17 15:09:51 上传评论收藏 1.63MB PDF 举报

### 关于回归的一些学习思考 #### 一、引言与背景回归分析是数据分析和机器学习领域中的一个重要组成部分，它帮助我们理解变量之间的关系，并基于这些关系进行预测。本文档探讨了一个新的数据科学框架——“预测性、计算性和稳定性”（Predictability, Computability, and Stability, 简称PCS）框架，该框架由加州大学伯克利分校统计系的Bin Yu教授及其团队提出。PCS框架旨在提供一种负责任、可靠、可重复且透明的数据科学生命周期方法，特别强调了回归分析在这一过程中的作用。 #### 二、PCS框架概述 ##### 2.1 预测性(Predictability) 预测性作为PCS框架的一个核心概念，是指模型能够准确预测未知数据的能力。这一步骤不仅要求模型具有良好的预测能力，还需要通过现实世界的验证来确保预测的有效性。例如，在进行线性回归分析时，可以通过交叉验证等技术来评估模型的预测性能。 ##### 2.2 计算性(Computability) 计算性强调了计算资源在数据收集、存储以及算法设计中的重要性。随着数据量的增长，高效的计算策略变得至关重要。在回归分析中，选择合适的算法（如梯度下降法或正规方程法）对于处理大规模数据集至关重要。 ##### 2.3 稳定性(Stability) 稳定性是PCS框架中的另一个关键要素，它关注的是人类决策对数据结果的影响。这包括如何评估数据和模型扰动下的不确定性。在回归分析中，稳定性的考虑可以帮助研究人员更好地理解模型参数估计的可靠性以及模型本身对外部条件变化的敏感性。 #### 三、PCS框架的应用 PCS框架不仅适用于传统统计学领域，也适用于现代大数据分析场景。下面介绍几个具体应用场景： ##### 3.1 神经科学项目在神经科学研究中，PCS框架可以帮助研究人员更准确地预测大脑活动模式，并评估不同模型在处理复杂数据集时的稳定性和可靠性。 ##### 3.2 基因组学项目基因组学研究涉及大量的遗传数据，这些数据通常具有高维度和稀疏性特征。PCS框架可以用于评估模型在不同假设下的一致性，并通过模拟实验来验证其有效性。 #### 四、PCS框架下的推断程序 PCS框架还提出了一系列推断程序，包括PCS扰动区间和PCS假设检验，以评估数据结果的稳定性。这些工具可以在不同的分析阶段（如问题定义、数据清理、建模选择等）提供有价值的见解。 ##### 4.1 PCS扰动区间 PCS扰动区间是一种评估模型预测结果在面对数据或模型扰动时的稳定性指标。它可以帮助研究人员了解模型参数估计的置信水平，并为最终的决策提供支持。 ##### 4.2 PCS假设检验 PCS假设检验则是一种基于PCS框架的统计检验方法，用于测试模型假设的有效性。这种方法可以用来确定特定模型是否适合手头的数据集，并为后续分析提供指导。 #### 五、案例研究与比较为了展示PCS框架的实际应用效果，作者们提供了来自神经科学和基因组学领域的实际案例，并将其与其他现有方法进行了对比。结果显示，在广泛的模拟模型下，PCS框架在ROC曲线方面表现出较好的性能。 #### 六、结论与展望通过对PCS框架的研究，我们可以看到它为数据科学家提供了一种全面的方法论来处理回归分析等统计问题。这种框架强调了预测性、计算性和稳定性的综合考虑，有助于提高数据科学项目的质量。未来，随着更多领域采用这一框架，我们可以期待看到更多的高质量研究成果出现。 #### 七、附录：PCS文档示例为了支持整个分析过程的透明度和可重复性，PCS框架还提倡使用R Markdown或Jupyter Notebook等工具创建详细的文档。这些文档包含了所有必要的代码和叙述，以便其他研究者能够复现分析过程并验证其结果。 --- PCS框架为回归分析及其他数据科学任务提供了一套系统的解决方案。通过对预测性、计算性和稳定性的综合考虑，该框架有助于构建更加可靠、透明的数据科学实践。

资源推荐

资源详情

资源评论

Veridical Data Science

Bin Yu

a,b,c

and Karl Kumbier

Statistics Department, University of California, Berkeley, CA 94720;

EECS Department, University of California, Berkeley, CA 94720;

Chan Zuckerberg Biohub, San

Francisco, CA 94158

Building and expanding on principles of statistics, machine learning, and scientiﬁc inquiry, we

propose the predictability, computability, and stability (PCS) framework for veridical data science.

Our framework, comprised of both a workﬂow and documentation, aims to provide responsible,

reliable, reproducible, and transparent results across the entire data science life cycle. The PCS

workﬂow uses predictability as a reality check and considers the importance of computation in

data collection/storage and algorithm design. It augments predictability and computability with

an overarching stability principle for the data science life cycle. Stability expands on statistical

uncertainty considerations to assess how human judgment calls impact data results through data

and model/algorithm perturbations. Moreover, we develop inference procedures that build on PCS,

namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data

results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We

illustrate PCS inference through neuroscience and genomics projects of our own and others and

compare it to existing methods in high dimensional, sparse linear model simulations. Over a wide

range of misspeciﬁed simulation models, PCS inference demonstrates favorable performance in

terms of ROC curves. Finally, we propose PCS documentation based on R Markdown or Jupyter

Notebook, with publicly available, reproducible codes and narratives to back up human choices

made throughout an analysis. The PCS workﬂow and documentation are demonstrated in a ge-

nomics case study available on Zenodo (1).

1. Introduction

Data science is a ﬁeld of evidence seeking that combines data

with domain information to generate new knowledge. The

data science life cycle (DSLC) begins with a domain ques-

tion or problem and proceeds through collecting, managing,

processing (cleaning), exploring, modeling, and interpreting

∗

data results to guide new actions (Fig. 1). Given the trans-

disciplinary nature of this process, data science requires human

involvement from those who collectively understand both the

domain and tools used to collect, process, and model data.

These individuals make implicit and explicit judgment calls

throughout the DSLC. The limited transparency in reporting

such judgment calls has blurred the evidence for many analy-

ses, resulting in more false-discoveries than might otherwise

occur (

). This fundamental issue necessitates veridical data

science to extract reliable and reproducible information from

data, with an enriched technical language to communicate and

evaluate empirical evidence in the context of human decisions.

Three core principles: predictability, computability, and sta-

bility (PCS) provide the foundation for such a data-driven

language and a uniﬁed data analysis framework. They serve

as minimum requirements for veridical data science

†

Many ideas embedded in PCS have been widely used across

various areas of data science. Predictability plays a central

role in science through Popperian falsiﬁability (

). If a model

does not accurately predict new observations, it can be re-

jected or updated. Predictability has been adopted by the

machine learning community as a goal of its own right and

more generally to evaluate the quality of a model or data result

∗

For a precise deﬁnition of interpretability in the context of machine learning, we refer to our recent

paper (2)

†

Veridical data science is the broad aim of our proposed framework (veridical meaning “truthful” or

“coinciding with reality”). This paper has been on arXiv since Jan. 2019 under the old title “Three

principles of data science: predictability, computability, stability (PCS).

(

). While statistics has always considered prediction, machine

learning emphasized its importance for empirical rigor. This

was in large part powered by computational advances that

made it possible to compare models through cross-validation

(CV), developed by statisticians Stone and Allen (7, 8).

The role of computation extends beyond prediction, setting

limitations on how data can be collected, stored, and analyzed.

Computability has played an integral role in computer science

tracing back to Alan Turing’s seminal work on the computabil-

ity of sequences (

). Analyses of computational complexity

have since been used to evaluate the tractability of machine

learning algorithms (

). Kolmogorov built on Turing’s work

through the notion of Kolmogorov complexity, which describes

the minimum computational resources required to represent

an object (

). Since Turing machine-based notions of

computabiltiy are not computable in practice, we treat com-

putability as an issue of algorithm eﬃciency and scalability.

This narrow deﬁnition of computability addresses computa-

tional considerations at the modeling stage of the DSLC but

does not deal with data collection, storage, or cleaning.

Stability

‡

is a common sense principle and a prerequisite for

knowledge. It is related to the notion of scientiﬁc reproducibil-

ity, which Fisher and Popper argued is a necessary condition

for establishing scientiﬁc results (

). While replicability

across laboratories has long been an important consideration

in science, computational reproducibility has come to play an

important role in data science as well. For example, (14) dis-

‡

We differentiate between the notions of stability and robustness as used in statistics. The latter

has traditionally been used to investigate performance of statistical methods across a range of

distributions, while the former captures a much broader range of perturbations throughout the

DSLC as discussed in this paper. At a high level, stability is about robustness.

Preprint | Yu and Kumbier 2019 | 1–13

arXiv:1901.08152v5 [stat.ML] 12 Nov 2019

cusses reproducible research in the context of computational

harmonic analysis. More broadly, (

) advocates for “prepro-

ducibility” to explicitly detail all steps along the DSLC and

ensure suﬃcient information for quality control. Stability at

the modeling stage of the DSLC has been advocated in (

) as

a minimum requirement for reproducibility and interpretabil-

ity. Modeling stage stability uniﬁes numerous previous works,

including Jackknife, subsampling, bootstrap sampling, robust

statistics, semi-parametric statistics, and Bayesian sensitivity

analysis (see (

) and references therein). These methods have

been enabled in practice through computational advances and

allow researchers to investigate the reproducibility of data re-

sults. Econometric models with partial identiﬁcation (see the

book (

) and references therein) and fundamental theoretical

results in statistics, such as the central limit theorem (CLT),

can also be viewed as stability considerations.

In this paper, we unify and expand on these ideas through

the PCS framework, which is built on the three principles of

data science. The PCS framework consists of PCS workﬂow

and transparent PCS documentation. It uses predictability

as a reality check, computability to ensure that the DSLC

is tractable, and stability to test the reproducibility of data

results (Sec. 2) relative to human judgment calls at every step

of the DSLC. In particular, we develop basic PCS inference,

which leverages data and model perturbations to evaluate the

uncertainty human decisions introduce into the DSLC (Sec. 3).

We propose PCS documentation in R MarkDown or a Jupyter

(iPython) Notebook to justify these decisions through narra-

tives, code and visualizations (Sec. 4). We draw connections

between causal inference and the PCS framework, demon-

strating the utility of the latter as a recommendation system

for generating scientiﬁc hypotheses (Sec. 5). We conclude by

discussing areas for further work, including additional vet-

ting of the framework and theoretical analyses on connections

between the three principles. A case study of our proposed

framework based on the authors’ work studying gene regulation

in Drosophil a is documented on Zenodo.

2. PCS principles in the DSLC

Given a domain problem and data, the purpose of the DSLC is

to generate knowledge, conclusions, and actions (Fig. 1). The

PCS framework aims to ensure that this process is both reliable

and reproducible through the three fundamental principles of

data science. Below we discuss the roles of the three principles

within the PCS framework

, including PCS workﬂow and PCS

documentation. The former applies the relevant principles

at every step of the DSLC, with stability as the paramount

consideration, and contains PCS inference proposed in Sec. 3.

The latter documents the PCS workﬂow and judgment calls

made with a 6-step format described in Sec. 4.

A. Stability assumptions initiate the DSLC.

The ultimate goal

of the DSLC is to generate knowledge that is useful for future

actions, be it a biological experiment, business decision, or

government policy. Stability is a useful concept to address

whether another researcher making alternative, appropriate

decisions would obtain similar conclusions. At the modeling

stage, stability has previously been advocated in (

). In this

We organize our discussion with respect to the steps in the DSLC.

We use the term appropriate to mean well-justiﬁed from domain knowledge and an understanding

of the data generating process. The term “reasonable” has also been used with this deﬁnition (16).

Domain

question

Data collection Data cleaning

Exploration &

visualization

Modeling

Post hoc

analysis

Interpretation

of results

Update domain

knowledge

Stability

Fig. 1. The data science life cycle

context, stability refers to acceptable consistency of a data

result relative to appropriate perturbations of the data or

model. For example, jackknife (

–

), bootstrap (

), and

cross validation (

) may be considered appropriate pertur-

bations if the data are deemed approximately independent

and identically distributed (i.i.d.) based on domain knowledge

and an understanding of the data collection process.

Human judgment calls prior to modeling also impact data

results. The validity of an analysis relies on implicit stability

assumptions that allow data to be treated as an informative

representation of some natural phenomena. When these as-

sumptions do not hold in a particular domain, conclusions

rarely generalize to new settings unless empirically proven

by future data. This makes it essential to evaluate stability

to guard against costly future actions and false discoveries,

particularly in the domains of science, business, and public

policy, where data results are used to guide large scale actions,

and in medicine, where human lives are at stake. Below we

outline stability considerations that impact the DSLC prior

to modeling.

Question or problem formulation:

The DSLC be-

gins with a domain problem or a question, which could be

hypothesis-driven or discovery-based. For instance, a biolo-

gist may want to discover biomolecules that regulate a gene’s

expression. In the DSLC this question must be translated

into a question regarding the output of a model or analysis

of data that can be measured/collected. There are often mul-

tiple translations of a domain problem into a data science

problem. For example, the biologist described above could

measure factors binding regulatory regions of the DNA that

are associated with the gene of interest. Alternatively, she

could study how the gene covaries with regulatory factors

across time and space. From a modeling perspective, the

biologist could identify important features in a random forest

or through logistic regression. Stability relative to question or

problem formulation implies that the domain conclusions are

qualitatively consistent across these diﬀerent translations.

Data collection:

To answer a domain question, domain

experts and data scientists collect data based on prior knowl-

edge and available resources. When this data is used to guide

future decisions, researchers implicitly assume that the data

is relevant to a future time. In other words, they assume that

conditions aﬀecting data collection are stable, at least relative

to some aspects of the data. For instance, if multiple labo-

ratories collect data to answer a domain question, protocols

must be comparable across experiments and laboratories if

they expect to obtain consistent results. These stability con-

siderations are closely related to external validity in medical

research, which characterizes similarities between subjects in a

study and subjects that researchers hope to generalize results

Preprint | Yu and Kumbier 2019 | 2

to. We will discuss this idea more in Sec. B.

Data cleaning and preprocessing:

Statistics and ma-

chine learning models or algorithms help data scientists answer

domain questions. Using models or algorithms requires clean-

ing (pre-processing) raw data into a suitable format, be it a

categorical demographic feature or continuous measurements

of biomarker concentrations. For instance, when data come

from multiple laboratories, biologists must decide how to nor-

malize individual measurements (for example see (

)). When

data scientists preprocess data, they are implicitly assuming

that their choices are not unintentionally biasing the essential

information in the raw data. In other words, they assume

that the knowledge derived from a data result is stable with

respect to their processing choices. If such an assumption

cannot be justiﬁed, they should use multiple appropriate pro-

cessing methods and interpret data results that are stable

across these methods. Others have advocated evaluating re-

sults across alternatively processed datasets under the name

“multiverse analysis” (

). Although the stability principle

was developed independently of this work, it naturally leads

to a multiverse-style analysis.

Exploratory data analysis:

Both before the modeling

stage and in post hoc analyses, data scientists often engage

in exploratory data analysis (EDA) to identify interesting

relationships in the data and interpret data results. When

visualizations or summaries are used to communicate these

analyses, it is implicitly assumed that the relationships or

data results are stable with respect to any decisions made by

the data scientist. For example, if the biologist believes that

clusters in a heatmap represent biologically meaningful groups,

she should expect to observe the same clusters with respect to

any appropriate choice of distance metric, data perturbation,

or clustering method.

B. Predictability as reality check.

‖

After data collection,

cleaning/preprocessing, and EDA, models or algorithms

∗∗

are

frequently used to identify more complex relationships in data.

Many essential components of the modeling stage rely on the

language of mathematics, both in technical papers and in code.

A seemingly obvious but often ignored question is why conclu-

sions presented in the language of mathematics depict reality

that exists independently in nature, and to what extent we

should trust mathematical conclusions to impact this external

reality.

††

This concern has been articulated and addressed by many

others in terms of prediction. For instance, Philip Dawid drew

connections between statistical inference and prediction under

the name “prequential statistics,” highlighting the importance

of forecasts in statistical analyses (

). David Freedman ar-

gued that when a model’s predictions are not tested against

reality, conclusions drawn from the model are unreliable (

Seymour Geisser advocated that statistical analyses should

focus on prediction rather than parametric inference, particu-

larly in cases where the statistical model is an inappropriate

description of reality (

). Leo Breiman championed the essen-

tial role of prediction in developing realistic models that yield

sound scientiﬁc conclusions (

). It can even be argued that

the goal of most domain problems is prediction at the meta

‖

Predictability is a form of empirical validation, though other reality checks may be performed beyond

prediction (e.g. checking whether a model recovers known phenomena).

∗∗

Different model or algorithm choices could correspond to different translations of a domain problem.

††

The PCS documentation in Sec. 4 helps users assess whether this connection is reliable.

level. That is, the primary value of learning relationships in

data is often to predict some aspect of future reality.

B.1. Formulating prediction.

We describe a general framework for

prediction with data

= (

x, y

), where

x ∈ X

represents input

features and

y ∈ Y

the prediction target. Prediction targets

y ∈ Y

may be observed responses (e.g. supervised learning) or

extracted from data (e.g. unsupervised learning). Predictive

accuracy is a simple, quantitative metric to evaluate how well a

model represents relationships in

. It is well-deﬁned relative

to a prediction function, testing data, and an evaluation metric.

We detail each of these elements below.

Prediction function: The prediction function

h : X → Y [1]

represents relationships between the observed features and

the prediction target. For instance, in the case of supervised

learning

may be a linear predictor or decision tree. In this

setting,

is typically an observed response, such as a class

label. In the case of unsupervised learning,

could map from

input features to cluster centroids.

To compare multiple prediction functions, we consider

(λ)

: λ ∈ Λ}, [2]

where Λ denotes a collection models/algorithms. For exam-

ple, Λ may deﬁne diﬀerent tuning parameters in lasso (

)

or random forest (

). For deep neural networks, Λ could

describe diﬀerent network architectures. For algorithms with

a randomized component, such as k-means or stochastic gra-

dient descent, Λ can represent repeated runs. More broadly,

Λ may describe a set of competing algorithms such as linear

models, random forests, and neural networks, each correspond-

ing to a diﬀerent problem translations. We discuss model

perturbations in more detail in Sec. D.3.

Testing (held-out) data:

We distinguish between training

data that are used to ﬁt a collection of prediction functions,

and testing data that are used to evaluate the accuracy of ﬁtted

prediction functions.

‡‡

At a minimum, one should evaluate

predictive accuracy on a held-out test set generated at the

same time and under the same conditions as the training data

(e.g. by randomly sampling a subset of observations). This

type of assessment addresses questions internal validity, which

describe the strength of a relationship in a given sample. It is

also often important to understand how a model will perform

in future conditions that diﬀer from those that generated the

training data. For instance, a biologist may want to apply their

model to new cell lines. A social scientist might use a model

trained on residents from one city to predict the behavior of

residents in another. As an extreme example, one may want to

use transfer learning to apply part of their model to an entirely

new prediction problem. Testing data gathered under diﬀerent

conditions from the training data directly addresses questions

of external validity, which describe how well a result will

generalize to future observations. Domain knowledge and/or

empirical validation are essential to assess the appropriateness

of diﬀerent prediction settings. These decisions should be

reported in the proposed PCS documentation (Sec. 4).

‡‡

In some settings, a third set of data are used to tune model parameters.

Preprint | Yu and Kumbier 2019 | 3

剩余12页未读，继续阅读

评论收藏

内容反馈

c神秘靓仔y

粉丝: 127
资源: 1

关于回归的一些学习思考

回归课堂原点的深度学习——回归课堂原点的深度学习引论.pdf

“深度学习”视角下的高三数学复习：回归教材.pdf

结构思考力 学习总结

小学美术教学回归生活的思考与策略

回归教材抓根本,深度学习促生成——一堂高三调研压轴题探究活动课实录与思考.pdf

回归教材寻题根 深度学习悟本质——一堂高三试题探究活动课的实录与思考.pdf

坚持“以本为本、推进四个回归”的思考.docx

应用回归分析PPT学习教案.pptx

刍议信息时代学习理论与学习的溯源性思考.pdf

机器学习部分课后习题答案（完整版）

回归“学生中心”的教育常态——基于学习仪表盘数据挖掘的学习投入度研究.pdf

回归概念本质实现深度学习.pdf

多元回归模型

《机器学习模型思考》系列：线性回归模型的基本假设

4R学习：数学深度学习的一种样态.pdf

追寻简约而灵动的课堂小学数学本色课堂的理性回归PPT学习教案.pptx

让“儿童的视角”回归教育C位——以“玉米的种子”为例谈大班幼儿的深度学习.pdf

指向深度学习的期末基础复习.pdf

让学引思—课堂教学的本质回归顾俊琪分析PPT学习教案.pptx

机器学习个人笔记

洗尽铅华，回归自然——新课程改革中的思考.docx

机器学习实验报告

python机器学习案例

让孩子回归个性化教育【完结】

斯坦福大学机器学习课程个人笔记完整版

学历案与深度学习.pdf

Python金融量化的高级库：TA-Lib-0.4.24（包含python3.7、3.8、3.9、3.10的32位和64位版本）

xthreg2命令安装包

最新资源

结构思考力学习总结

回归教材寻题根深度学习悟本质——一堂高三试题探究活动课的实录与思考.pdf