QuantileRegressionForest.pdf_分位数随机森林资源-CSDN文库

需积分: 38 45 浏览量 2018-12-13 22:55:16 上传评论 1 收藏 310KB PDF 举报

### 分位数回归森林知识点详解 #### 一、引言与背景《QuantileRegressionForest.pdf》这篇论文介绍了一种新的机器学习工具——分位数回归森林（Quantile Regression Forests），该方法基于随机森林（Random Forests）并扩展了其功能。随机森林作为一种强大的机器学习算法，在高维数据的回归和分类任务中表现优异。传统的随机森林能够准确地估计响应变量的条件均值。而分位数回归森林则进一步提供了关于响应变量的完整条件分布的信息，不仅限于条件均值。 #### 二、随机森林简介在讨论分位数回归森林之前，首先需要了解随机森林的基本原理。随机森林是一种集成学习方法，它通过构建多个决策树来实现对数据的学习。每个决策树都是在数据的一个子集上训练得到的，并且在特征选择时也采用了随机的方式。这样的设计使得随机森林具有很高的稳定性和准确性，尤其适用于处理高维数据。 #### 三、分位数回归森林的核心概念分位数回归森林的核心思想是利用随机森林的方法来估计响应变量的条件分位数。具体来说，对于一个连续的响应变量\( Y \)和一个可能高维的预测变量\( X \)，传统的回归分析通常关注于估计条件均值\( E(Y|X=x) \)。而分位数回归森林则可以提供更丰富的信息，例如条件分位数\( Q_\alpha(x) \)，即当\( X=x \)时，响应变量\( Y \)小于或等于某个值的概率为\( \alpha \)时对应的阈值。 #### 四、分位数回归森林的工作原理 1. **随机抽样与特征选择**：与传统的随机森林类似，分位数回归森林在构建每棵树时都会从原始数据集中随机抽取一部分样本作为训练集，并且在每个节点处随机选择一部分特征用于分割。 2. **树的构建**：每棵树的构建过程中，根据分位数回归的目标选择合适的分割准则。不同于标准的随机森林使用平方损失作为分割依据，分位数回归森林会使用特定的损失函数来估计不同的分位数。 3. **分位数估计**：在每棵树构建完成后，可以通过对测试数据在所有树上的路径进行分析，从而估计出不同分位数下的预测值。这些预测值可以用来构建完整的条件分位数估计。 #### 五、分位数回归森林的优点 1. **非参数性**：分位数回归森林是一种非参数方法，这意味着它不需要假设数据服从某种特定的分布形式，这使得它在实际应用中更加灵活。 2. **适应性**：该方法能够自适应地调整预测的范围，从而更好地反映数据的复杂结构。 3. **处理高维数据的能力**：分位数回归森林特别适合处理高维预测变量的情况，这在现代数据科学中是非常常见的场景。 4. **一致性**：作者证明了分位数回归森林是一致的，也就是说随着样本量的增加，估计结果会逐渐接近真实值。 #### 六、数值实验与应用场景文中还通过一系列的数值实验验证了分位数回归森林的有效性，并将其与其他现有的方法进行了比较，结果表明分位数回归森林在预测能力方面具有竞争力。此外，分位数回归森林还可以应用于各种领域，如金融风险评估、经济预测等，特别是在需要考虑数据分布特性而非仅关注均值的情况下。分位数回归森林作为一种扩展了随机森林功能的新方法，不仅可以提供更丰富的统计信息，而且在处理高维数据和非线性关系时表现出色，具有广泛的应用前景。

资源推荐

资源详情

资源评论

Journal of Machine Learning Resea rch 7 (2006 ) 983–999 Submitted 10/05; Revised 2/06 ; Published 6/06

Quantile Regression Forests

Nicolai Meinshausen nicolai@stat.math.ethz.ch

Seminar f¨ur Statistik

ETH Z¨urich

8092 Z¨urich, Switzerland

Editor: Greg Ridgeway

Abstract

Random forests were introduced as a machine learning tool in Breiman (2001) and have

since proven to be very popular and powerful for high-dimensional regression and classiﬁ-

cation. For regression, random forests give an accurate approximation of the conditional

mean of a response variable. It is shown here that random forests provide information

about the full conditional distribution of the response variable, not only about the con-

ditional mean. Conditional quantiles can be inferred with quantile regression forests, a

generalisation of random forests. Quantile regression forests give a non-parametric and

accurate way of estimating conditional quantiles for high-dimensional predictor variables.

The algorithm is shown to be consistent. Numerical examples suggest that the algorithm

is competitive in terms of predictive power.

Keywords: quantile regression, random forests, adaptive neighborhood regression

1. Introduction

Let Y be a real-valued response variable and X a covariate or predictor variable, possibly

high-dimensional. A standard goal of statistical analysis is to infer, in some way, the

relationship between Y and X. Standard regression analysis tries to come up with an

estimate ˆµ(x) of the conditional mean E(Y |X = x) of the response variable Y , given

X = x. The conditional mean minimizes the expected squared error loss,

E(Y |X = x) = arg min

E{(Y − z)

|X = x},

and approximation of the conditional mean is typically achieved by minimization of a

squared error type loss function over the available data.

Beyond the Conditional Mean The conditional mean illuminates just one aspect of

the conditional distribution of a response variable Y , yet neglects all other features of

possible interest. This led to the development of quantile regression; for a good summary

see e.g. Koenker (2005). The conditional distribution function F (y|X = x) is given by the

probability that, for X = x, Y is smaller than y ∈ R,

F (y|X = x) = P (Y ≤ y|X = x).

For a continuous distribution function, the α-quantile Q

(x) is then deﬁned such that the

probability of Y being smaller than Q

(x) is, for a given X = x, exactly equal to α. In

2006 Nico lai Meinshausen .

Meinshausen

general,

(x) = inf{y : F (y|X = x) ≥ α}. (1)

The quantiles give more complete information about the distribution of Y as a function of

the predictor variable X than the conditional mean alone.

As an example, consider the predictions of next day ozone levels, as in Breiman and

Friedman (1985). Least-squares regression tries to estimate the conditional mean of ozone

levels. It gives little information about the ﬂuctuations of ozone levels around this predicted

conditional mean. It might for example be of interest to ﬁnd an ozone level that is -with

high probability- not surpassed. This can be achieved with quantile regression, as it gives

information about the spread of the response variable. For some other examples see Le

et al. (2005), which is to the best of our knowledge the ﬁrst time that quantile regression is

mentioned in the Machine Learning literature.

Prediction Intervals How reliable is a prediction for a new instance? This is a related

question of interest. Consider again the prediction of next day ozone levels. Some days, it

might be possible to pinpoint next day ozone levels to a higher ac curacy than on other days

(this can indeed be observed for the ozone data, see the section with numerical results).

With standard prediction, a single point estimate is returned for each new instance. This

point estimate does not contain information about the dispersion of observations around

the predicted value.

Quantile regression can be used to build prediction intervals. A 95% prediction interval

for the value of Y is given by

I(x) = [Q

.025

(x), Q

.975

(x)].

(2)

That is, a new observation of Y , for X = x, is with high probability in the interval I(x).

The width of this prediction interval can vary greatly with x. Indeed, going back to the

previous example, next day ozone level can on some days be predicted ﬁve times more

accurately than on other days. This eﬀect is even more pronounced for other data se ts.

Quantile regression oﬀers thus a principled way of judging the reliability of predictions.

Outlier Detection Quantile regression can likewise be used for outlier dete ction (for

surveys on outlier detection see e.g. Barnett and Lewis, 1994; Hodge and Austin, 2004). A

new observation (X, Y ) would be regarded as an outlier if its observed value Y is extrem e,

in some sense, with regard to the predicted conditional distribution function.

There is, however, no generally applicable rule of what precisely constitutes an “extreme”

observation. One could possibly ﬂag observations as outliers if the distance between Y and

the median of the conditional distribution is large; “large” being measured in comparison

to some robust measure of dispersion like the conditional median absolute deviation or

the conditional interquartile range (Huber, 1973). Both quantities are made available by

quantile regression.

Note that only anomalies in the conditional distribution of Y can be detected in this

way. Outliers of X itself cannot be detected. Other research has focused on detecting

anomalies for unlabelled data (e.g. Markou and Singh, 2003; Steinwart et al., 2005).

984

Quantile Regression Forests

Estimating Quantiles from Data Quantile regression aims to estimate the conditional

quantiles from data. Quantile regression can be cast as an optimization problem, just as

estimation of the conditional mean is achieved by minimizing a squared error loss function.

Let the loss function L

be deﬁned for 0 < α < 1 by the weighted absolute deviations

(y, q) =



α |y − q| y > q

(1 − α) |y − q| y ≤ q

(3)

While the conditional mean minimizes the expected squared error loss , conditional quantiles

minimize the expected loss E(L

(x) = arg min

E{L

(Y, q)|X = x}.

A parametric quantile regression is s olved by optimizing the parameters so that the empirical

loss is minimal. This can be achieved eﬃciently due to the convex nature of the optimization

problem (Portnoy and Koenker, 1997). Non-parametric approaches, in particular quantile

Smoothing Splines (He et al., 1998; Koe nker et al., 1994), involve similar ideas. Chaudhuri

and Loh (2002) developed an interesting tree-based method for estimation of conditional

quantiles which gives good p e rformance and allows for easy interpretation, being in this

respect similar to CART (Breiman e t al., 1984).

In this manuscript, a diﬀerent approach is proposed, which does not directly employ

minimization of a loss function of the sort (3). Rather, the method is based on random

forests (Breiman, 2001). Random forests grows an ensemble of trees, employing random

node and split point selection, inspired by Amit and Geman (1997). The prediction of

random forests can then be seen as an adaptive neighborhood c lass iﬁcation and regression

procedure (Lin and Jeon, 2002). For every X = x, a set of weights w

(x), i = 1, . . . , n for the

original n observations is obtained. The prediction of random forests, or estimation of the

conditional mean, is equivalent to the weighted mean of the observed response variables. For

quantile regression forests, trees are grown as in the standard random forests algorithm. The

conditional distribution is then estimated by the weighted distribution of observed response

variables, where the weights attached to observations are identical to the original random

forests algorithm.

In Section 2, necessary notation is introduced and the mechanism of random forests

is brieﬂy explained, using the interpretation of Lin and Jeon (2002), which views random

forests as an adaptive nearest neighbor algorithm, a view that is later supported in Breiman

(2004). Using this interpretation, quantile regression forests are introduced in Section 3 as

a natural generalisation of random forests. A proof of consistency is given in Section 4,

while encouraging numerical res ults for popular machine learning data sets are presented

in Section 5.

2. Random Forests

Random forests grows an ensemble of trees, using n independent observations

, X

), i = 1, . . . , n.

A large number of trees is grown. For each tree and each node, random forests employs

randomness when selecting a variable to split on. For each tree, a bagged version of the

985

剩余16页未读，继续阅读

评论收藏

内容反馈

我没有那种天分

粉丝: 8
资源: 2

QuantileRegressionForest.pdf

高分位数数据的分位数回归森林的扩展

分位数回归

基于EWT和分位数回归森林的短期风电功率概率密度预测

Python 实现基于QRF随机森林分位数回归多变量时间序列区间预测模型（含完整的程序和代码详解）

scikit-garden:scikit-学习兼容树的花园

koenker分位数回归

随机森林调用matlab代码做回归-QOOB:分位数袋外(QOOB)保形是一种用于预测推理的保形方法

一种新的随机森林特征采样方法预测高维数据

运营商大数据备考题库及答案

计算机毕业设计 期末设计 基于大数据的股票数据可视化分析与预测系统 Python+LSTM预测模型 股票 爬虫 Tensorflow

PSG 3D 三维测绘系统

多智能体一致性仿真 简单的多智能体一致性性仿真图，包含状态轨迹图和控制输入图 程序简单，所以便宜，但是有注释，都能看懂，适合初学者

（GUI框架）Matlab设计_道路桥梁裂缝检测.zip

origin2021下载免费分享

VRPTW 的 Solomon 标准测试数据集

数学建模国赛：无人机遂行编队飞行中的纯方位无源定位分析

多时间尺度、多分辨率、多PET计算方式的 日/周/月干旱指标SPEI计算代码及测试文件

2023年国赛数学建模高教社杯获奖优秀论文B题原文多波束测线问题

最值得收藏的 数据结构 全部知识点思维导图整理(王道考研), 附带经典题型整理.emmx

利用SVM（支持向量机）进行图像分割/提取-MATLAB

变分模态分解（VMD）代码

2022年数学建模国赛高教社杯C题古代玻璃制品的成分分析与鉴别优秀论文下载

汽车系统动力学-轮胎公式

数据结构(C语言版)+严蔚敏+吴伟民.pdf

数据资产管理实践白皮书6.0

最新版Notepad++十六进制查看的插件x64HexEditor0.9.12

最全PyCharm 中文使用手册.pdf

快速排序算法快速排序算法PDF

最新资源

计算机毕业设计期末设计基于大数据的股票数据可视化分析与预测系统 Python+LSTM预测模型股票爬虫 Tensorflow

多智能体一致性仿真简单的多智能体一致性性仿真图，包含状态轨迹图和控制输入图程序简单，所以便宜，但是有注释，都能看懂，适合初学者

多时间尺度、多分辨率、多PET计算方式的日/周/月干旱指标SPEI计算代码及测试文件

最值得收藏的数据结构全部知识点思维导图整理(王道考研), 附带经典题型整理.emmx