没有合适的资源?快使用搜索试试~ 我知道了~
高斯过程回归-经典文章拜读
需积分: 0 1 下载量 164 浏览量
2024-05-16
22:58:34
上传
评论
收藏 314KB PDF 举报
温馨提示
试读
13页
从基础入门,主要讲解高斯过程回归的基本原理。
资源推荐
资源详情
资源评论
Gaussian Processes for Regression:
A Quick Introduction
M. Ebden, August 2008
Comments to mebden@gmail.com
1 MOTIVATION
Figure 1 illustrates a typical example of a prediction problem: given some noisy obser-
vations of a dependent variable at certain values of the independent variable x, what is
our best estimate of the dependent variable at a new value, x
∗
?
If we expect the underlying function f(x) to be linear, and can make some as-
sumptions about the input data, we might use a least-squares method to fit a straight
line (linear regression). Moreover, if we suspect f(x) may also be quadratic, cubic, or
even nonpolynomial, we can use the principles of model selection to choose among the
various possibilities.
Gaussian process regression (GPR) is an even finer approach than this. Rather
than claiming f(x) relates to some specific models (e.g. f(x) = mx + c), a Gaussian
process can represent f(x) obliquely, but rigorously, by letting the data ‘speak’ more
clearly for themselves. GPR is still a form of supervised learning, but the training data
are harnessed in a subtler way.
As such, GPR is a less ‘parametric’ tool. However, it’s not completely free-form,
and if we’re unwilling to make even basic assumptions about f(x), then more gen-
eral techniques should be considered, including those underpinned by the principle of
maximum entropy; Chapter 6 of Sivia and Skilling (2006) offers an introduction.
−1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
?
x
y
Figure 1: Given six noisy data points (error bars are indicated with vertical lines), we
are interested in estimating a seventh at x
∗
= 0.2.
1
arXiv:1505.02965v2 [math.ST] 29 Aug 2015
2 DEFINITION OF A GAUSSIAN PROCESS
Gaussian processes (GPs) extend multivariate Gaussian distributions to infinite dimen-
sionality. Formally, a Gaussian process generates data located throughout some domain
such that any finite subset of the range follows a multivariate Gaussian distribution.
Now, the n observations in an arbitrary data set, y = {y
1
, . . . , y
n
}, can always be
imagined as a single point sampled from some multivariate (n-variate) Gaussian distri-
bution, after enough thought. Hence, working backwards, this data set can be partnered
with a GP. Thus GPs are as universal as they are simple.
Very often, it’s assumed that the mean of this partner GP is zero everywhere. What
relates one observation to another in such cases is just the covariance function, k(x, x
0
).
A popular choice is the ‘squared exponential’,
k(x, x
0
) = σ
2
f
exp
−(x − x
0
)
2
2l
2
, (1)
where the maximum allowable covariance is defined as σ
2
f
— this should be high for
functions which cover a broad range on the y axis. If x ≈ x
0
, then k(x, x
0
) approaches
this maximum, meaning f(x) is nearly perfectly correlated with f(x
0
). This is good:
for our function to look smooth, neighbours must be alike. Now if x is distant from
x
0
, we have instead k(x, x
0
) ≈ 0, i.e. the two points cannot ‘see’ each other. So, for
example, during interpolation at new x values, distant observations will have negligible
effect. How much effect this separation has will depend on the length parameter, l, so
there is much flexibility built into (1).
Not quite enough flexibility though: the data are often noisy as well, from measure-
ment errors and so on. Each observation y can be thought of as related to an underlying
function f(x) through a Gaussian noise model:
y = f(x) + N (0, σ
2
n
), (2)
something which should look familiar to those who’ve done regression before. Regres-
sion is the search for f (x). Purely for simplicity of exposition in the next page, we take
the novel approach of folding the noise into k(x, x
0
), by writing
k(x, x
0
) = σ
2
f
exp
−(x − x
0
)
2
2l
2
+ σ
2
n
δ(x, x
0
), (3)
where δ(x, x
0
) is the Kronecker delta function. (When most people use Gaussian pro-
cesses, they keep σ
n
separate from k(x, x
0
). However, our redefinition of k(x, x
0
) is
equally suitable for working with problems of the sort posed in Figure 1. So, given n
observations y, our objective is to predict y
∗
, not the ‘actual’ f
∗
; their expected values
are identical according to (2), but their variances differ owing to the observational noise
process. e.g. in Figure 1, the expected value of y
∗
, and of f
∗
, is the dot at x
∗
.)
To prepare for GPR, we calculate the covariance function, (3), among all possible
combinations of these points, summarizing our findings in three matrices:
K =
k(x
1
, x
1
) k(x
1
, x
2
) · · · k(x
1
, x
n
)
k(x
2
, x
1
) k(x
2
, x
2
) · · · k(x
2
, x
n
)
.
.
.
.
.
.
.
.
.
.
.
.
k(x
n
, x
1
) k(x
n
, x
2
) · · · k(x
n
, x
n
)
(4)
K
∗
=
k(x
∗
, x
1
) k(x
∗
, x
2
) · · · k(x
∗
, x
n
)
K
∗∗
= k(x
∗
, x
∗
). (5)
Confirm for yourself that the diagonal elements of K are σ
2
f
+ σ
2
n
, and that its extreme
off-diagonal elements tend to zero when x spans a large enough domain.
2
3 HOW TO REGRESS USING GAUSSIAN PROCESSES
Since the key assumption in GP modelling is that our data can be represented as a
sample from a multivariate Gaussian distribution, we have that
y
y
∗
∼ N
0,
K K
T
∗
K
∗
K
∗∗
, (6)
where T indicates matrix transposition. We are of course interested in the conditional
probability p(y
∗
|y): “given the data, how likely is a certain prediction for y
∗
?”. As
explained more slowly in the Appendix, the probability follows a Gaussian distribution:
y
∗
|y ∼ N (K
∗
K
−1
y, K
∗∗
− K
∗
K
−1
K
T
∗
). (7)
Our best estimate for y
∗
is the mean of this distribution:
y
∗
= K
∗
K
−1
y, (8)
and the uncertainty in our estimate is captured in its variance:
var(y
∗
) = K
∗∗
− K
∗
K
−1
K
T
∗
. (9)
We’re now ready to tackle the data in Figure 1.
1. There are n = 6 observations y, at
x =
−1.50 −1.00 −0.75 −0.40 −0.25 0.00
.
We know σ
n
= 0.3 from the error bars. With judicious choices of σ
f
and l (more
on this later), we have enough to calculate a covariance matrix using (4):
K =
1.70 1.42 1.21 0.87 0.72 0.51
1.42 1.70 1.56 1.34 1.21 0.97
1.21 1.56 1.70 1.51 1.42 1.21
0.87 1.34 1.51 1.70 1.59 1.48
0.72 1.21 1.42 1.59 1.70 1.56
0.51 0.97 1.21 1.48 1.56 1.70
.
From (5) we also have K
∗∗
= 1.70 and
K
∗
=
0.38 0.79 1.03 1.35 1.46 1.58
.
2. From (8) and (9), y
∗
= 0.95 and var(y
∗
) = 0.21.
3. Figure 1 shows a data point with a question mark underneath, representing the
estimation of the dependent variable at x
∗
= 0.2.
We can repeat the above procedure for various other points spread over some portion
of the x axis, as shown in Figure 2. (In fact, equivalently, we could avoid the repetition
by performing the above procedure once with suitably larger K
∗
and K
∗∗
matrices.
In this case, since there are 1,000 test points spread over the x axis, K
∗∗
would be
of size 1,000 × 1,000.) Rather than plotting simple error bars, we’ve decided to plot
y
∗
± 1.96
p
var(y
∗
), giving a 95% confidence interval.
3
剩余12页未读,继续阅读
资源评论
weixin_46621519
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功