稀疏贝叶斯学习与相关向量机_稀疏贝叶斯资源-CSDN文库

Bayesian

Learnin

需积分: 50 74 浏览量 2020-02-23 13:04:19 上传评论 2 收藏 936KB PDF 举报

资源推荐

资源详情

资源评论

Journal of Machine Learning Research 1 (2001) 211{244 Submitted 5/00; Published 6/01

Sparse Bayesian Learning and the Relevance Vector Machine

Michael E. Tipping

mtipping@microsoft.com

Microsoft Research

St George House, 1 Guildhal l Street

Cambridge CB2 3NH, U.K.

Editor:

Alex Smola

Abstract

This paper introduces a general Bayesian framework for obtaining

sparse

solutions to re-

gression and classication tasks utilising mo dels linear in the parameters. Although this

framework is fully general, we illustrate our approach with a particular specialisation that

we denote the `relevance vector machine' (RVM), a mo del of identical functional form to

the p opular and state-of-the-art `support vector machine' (SVM). We demonstrate that by

exploiting a probabilistic Bayesian learning framework, we can derive accurate prediction

models whichtypically utilise dramatically fewer basis functions than a comparable SVM

while oering a numb er of additional advantages. These include the b enets of probabilis-

tic predictions, automatic estimation of `nuisance' parameters, and the facility to utilise

arbitrary basis functions (

e.g.

non-`Mercer' kernels).

We detail the Bayesian framework and asso ciated learning algorithm for the RVM, and

give some illustrative examples of its application along with some comparative b enchmarks.

We oer some explanation for the exceptional degree of sparsity obtained, and discuss

and demonstrate some of the advantageous features, and potential extensions, of Bayesian

relevance learning.

1. Intro duction

supervised learning

we are given a set of examples of input vectors

along with

corresp onding targets

, the latter of which might be real values (in

regression

)

or class labels (

classication

). From this `training' set we wish to learn a mo del of the

dep endency of the targets on the inputs with the ob jective of making accurate predictions

for previously unseen values of

. In real-world data, the presence of noise (in regression)

and class overlap (in classication) implies that the principal mo delling challenge is to avoid

`over-tting' of the training set.

Typically,we base our predictions up on some function

(

) dened over the input space,

and `learning' is the pro cess of inferring (p erhaps the parameters of ) this function. A exible

and p opular set of candidates for

(

)isthatoftheform:

(

;



(



(

)

;

(1)

where the output is a linearly-weighted sum of

, generally nonlinear and xed, basis func-

tions



(

)=(



(

)

;

(

)

;:::;

(

))

. Analysis of functions of the typ e (1) is facilitated



2001 Michael E. Tipping.

Tipping

since the adjustable parameters (or `weights')

;:::;w

)

app ear linearly, and

the ob jective is to estimate `go o d' values for those parameters.

In this pap er, we detail a Bayesian probabilistic framework for learning in general mo dels

of the form (1). The key feature of this approach is that as well as oering good gener-

alisation p erformance, the inferred predictors are exceedingly

sparse

in that they contain

relatively few non-zero

parameters. The ma jority of parameters are automatically set to

zero during the learning pro cess, giving a procedure that is extremely eective at discerning

those basis functions which are `relevant' for making goo d predictions.

While the range of mo dels of the type (1) that we can address is extremely broad, we

concentrate here on a specialisation that we denote the `relevance vector machine' (RVM),

originally introduced by Tipping (2000). We consider functions of a type corresp onding

to those implemented by another sparse linearly-parameterised mo del, the

support vector

machine

(SVM) (Boser et al., 1992; Vapnik, 1998; Scholkopf et al., 1999a). The SVM makes

predictions based on the function:

(

;

(

;

(2)

where

(

;

)isa

kernel

function, eectively dening one basis function for each example

in the training set.

The key feature of the SVM is that, in the classication case, its target

function attempts to minimise a measure of error on the training set while simultaneously

maximising the `margin' between the two classes (in the feature space implicitly dened

by the kernel). This is a highly eective mechanism for avoiding over-tting, which leads

to goo d generalisation, and which furthermore results in a sparse model dependent only

on a subset of kernel functions: those asso ciated with training examples

(the \supp ort

vectors") that lie either on the margin or on the `wrong' side of it. State-of-the-art results

have been rep orted on many tasks where the SVM has b een applied.

However, despite its success, we can identify a number of signicant and practical dis-

advantages of the supp ort vector learning methodology:



Although relatively sparse, SVMs make unnecessarily liberal use of basis functions

since the numb er of support vectors required typically grows linearly with the size of

the training set. Some form of p ost-pro cessing is often required to reduce computa-

tional complexity (Burges, 1996; Burges and Scholkopf, 1997).



Predictions are not

probabilistic

. In regression the SVM outputs a point estimate, and

in classication, a `hard' binary decision. Ideally,we desire to estimate the conditional

distribution

(

) in order to capture uncertainty in our prediction. In regression this

may take the form of `error-bars', but it is particularly crucial in classication where

p osterior probabilities of class memb ership are necessary to adapt to varying class

priors and asymmetric misclassication costs. Posterior probability estimates have

b een co erced from SVMs via p ost-pro cessing (Platt, 2000), although we argue that

these estimates are unreliable (App endix D.2).

1. Note that the SVM predictor is not dened explicitly in this form | rather (2) emerges implicitly as a

consequence of the use of the kernel function to dene a dot-product in some notional feature space.

212

Sparse Bayesian Learning and the Relevance Vector Machine



It is necessary to estimate the error/margin trade-o parameter `

' (and in regression,

the insensitivity parameter `



'too). This generally entails a cross-validation pro cedure,

whichis wasteful b oth of data and computation.



The kernel function

(

;

)must satisfy Mercer's condition. That is, it must b e the

continuous symmetric kernel of a positiveintegral operator.

The `relevance vector machine' (RVM) is a Bayesian treatment

of (2) which do es not

suer from anyoftheabove limitations. Sp ecically,we adopt a fully probabilistic frame-

work and intro duce a prior over the mo del weights governed by a set of hyp erparameters,

one associated with eachweight, whose most probable values are iteratively estimated from

the data. Sparsity is achieved b ecause in practice we nd that the p osterior distributions

of many of the weights are sharply (indeed innitely) peaked around zero. We term those

training vectors associated with the remaining non-zero weights `relevance' vectors, in def-

erence to the principle of

automatic relevance determination

whichmotivates the presented

approach (MacKay, 1994; Neal, 1996). The most comp elling feature of the RVM is that,

while capable of generalisation performance comparable to an equivalent SVM, it typically

utilises dramatically fewer kernel functions.

In the next section, weintro duce the Bayesian mo del, initially for regression, and dene

the pro cedure for obtaining hyperparameter values, and from them, the weights. The

framework is then extended straightforwardly to the classication case in Section 3. In

Section 4, wegive some visualisable examples of application of the RVM in both scenarios,

along with an illustration of some p otentially p owerful extensions to the basic mo del, b efore

oering some b enchmark comparisons with the SVM. We oer some theoretical insightinto

the reasons b ehind the observed sparsity of the technique in Section 5 before summarising

in Section 6. To streamline the presentation within the main text, considerable theoretical

and implementational details are reserved for the app endices.

2. Sparse Bayesian Learning for Regression

We now detail the sparse Bayesian regression mo del and asso ciated inference pro cedures.

The classication counterpart is considered in Section 3.

2.1 Mo del Sp ecication

Given a data set of input-target pairs

, considering scalar-valued target functions

only, we follow the standard probabilistic formulation and assume that the targets are

samples from the model with additive noise:

(

;



;

where



are indep endent samples from some noise process which is further assumed to be

mean-zero Gaussian with variance



. Thus

(

)

;

), where the notation

2. This restriction can be relaxed slightly to include

conditional ly

positive kernels (Smola et al., 1998;

Scholkopf, 2001).

3. Note that our approachisnotaBayesian treatment of the SVM metho dology

per se

, an area which has

seen much recentinterest (Sollich, 2000; Seeger, 2000; Kwok, 2000) | here we treat the kernel function

as simply dening a set of basis functions, rather than as a denition of a dot-pro duct in some space.

213

Tipping

sp ecies a Gaussian distribution over

with mean

(

) and variance



. The function

(

) is as dened in (2) for the SVM where weidentify our general basis functions with the

kernel as parameterised by the training vectors:



(

)



(

;

). Due to the assumption

of indep endence of the

, the likeliho o d of the complete data set can b e written as

(

;

)=(2



)



exp









w



;

(4)

where

:::t

)

:::w

)

and



is the



(

+ 1) `design' matrix with



[



(

)

;



(

)

;:::;



(

)]

, wherein



(

)=[1

(

;

)

(

;

)

;:::;K

(

;

)]

. For

clarity, we omit to notate the implicit conditioning upon the set of input vectors

(4) and subsequent expressions.

With as many parameters in the mo del as training examples, wewould exp ect maximum-

likeliho o d estimation of

and



from (4) to lead to severe over-tting. To avoid this, a

common approach is to imp ose some additional constraint on the parameters, for example,

through the addition of a `complexity' p enalty term to the likeliho o d or error function. This

is implicitly eected by the inclusion of the `margin' term in the SVM. Here, though, we

adopt a Bayesian persp ective, and `constrain' the parameters by dening an explicit

prior

probability distribution over them.

We enco de a preference for smo other (less complex) functions by making the p opular

choice of a zero-mean Gaussian prior distribution over

(



(

;



)

;

(5)

with



a vector of

hyperparameters

. Importantly, there is an individual hyperpa-

rameter asso ciated indep endently with every weight, mo derating the strength of the prior

thereon.

To complete the sp ecication of this

hierarchical

prior, we must dene hyperpriors

over



, as well as over the nal remaining parameter in the model, the noise variance



These quantities are examples of

scale

parameters, and suitable priors thereover are Gamma

distributions (see,

e.g.

, Berger, 1985):

(



Gamma(



a; b

)

;

(



) = Gamma(



c; d

)

;

with









and where

Gamma(



a; b

)=(

)







b

;

(6)

with (



, the `gamma function'. Tomake these priors non-informative(

i.e.

at), we might x their parameters to small values:

e.g.

=10



. However, by

4. Note that although it is not a characteristic of this parameter prior in general, for the case of the RVM

that we consider here, the overall implied prior over functions is

data dependent

due to the app earance

in the basis functions

(

;

). This presents no practical diÆculty, although wemust take care in

interpreting the \error-bars" implied by the model. In Appendix D.1 we consider this in further detail.

214

Sparse Bayesian Learning and the Relevance Vector Machine

setting these parameters to zero, we obtain uniform hyp erpriors (over a

logarithmic

scale).

Since all scales are equally likely, a pleasing consequence of the use of such `improp er'

hyp erpriors here is that of scale-invariance: predictions are indep endent of linear scaling of

both

and the basis function outputs so, for example, results do not dep end on the unit

of measurement of the targets. For completeness, the more detailed derivations oered in

App endix A will consider the case of general Gamma priors for



and



, but in the main

b o dy of the pap er, all further analysis and presented results will assume uniform scale priors

with

=0.

This formulation of prior distributions is a type of

automatic relevance determination

(ARD) prior (MacKay, 1994; Neal, 1996). Using such priors in a neural network, individual

hyp erparameters would typically control

groups

of weights | those associated with each

input dimension

(this idea has also been applied to the input variables in `Gaussian

pro cess' mo dels). Should the evidence from the data support such a hyp othesis, using a

broad prior over the hyp erparameters allows the p osterior probability mass to concentrate

at very large values of some of these



variables, with the consequence that the p osterior

probability of the associated weights will b e concentrated at zero, thus eectively `switching

o ' the corresp onding inputs, and so deeming them to b e `irrelevant'.

Here, the assignment of an individual hyperparameter to eachweight, or basis function,

is the key feature of the relevance vector machine, and is resp onsible ultimately for its

sparsity properties. To intro duce an additional

+1 parameters to the mo del may seem

counter-intuitive, since we have already conceded that we have to o many parameters, but

from a Bayesian p ersp ective, if we correctly `integrate out' all such `nuisance' parameters

(or can approximate such a pro cedure suÆciently accurately), then this presents no problem

from a methodological persp ective (see Neal, 1996, pp. 16{17). Any subsequently observed

`failure' in learning is attributable to the form, not the parameterisation, of the prior over

functions.

2.2 Inference

Having dened the prior, Bayesian inference pro ceeds by computing, from Bayes' rule, the

p osterior over all unknowns given the data:

(

;



;

(

;



;

)

(

;



;

)

(

)

(7)

Then, given a new test point,



, predictions are made for the corresp onding target



, in

terms of the predictive distribution:

(



(



;



;

)

(

;



;

)



d

(8)

To those familiar, or even not-so-familiar, with Bayesian metho ds, it may come as no surprise

to learn that we cannot p erform these computations in full analytically, and must seek an

eectiveapproximation.

We cannot compute the p osterior

(

;



;

) in (7) directly since we cannot p erform

the normalising integral on the right-hand-side,

(

;



;

)

(

;



;

)



d

Instead, we decomp ose the posterior as:

(

;



;

(

;



;

)

(



;

)

;

(9)

215

剩余33页未读，继续阅读

评论收藏

内容反馈

cauchy

粉丝: 1
资源: 5

稀疏贝叶斯学习与相关向量机

贝叶斯和相关向量机压缩感知（稀疏编码）附matlab代码 上传.zip

稀疏贝叶斯学习的代码 稀疏贝叶斯学习的代码

贝叶斯学习

稀疏贝叶斯模型相关向量机

SparseBayes_相关向量机函数包.rar_SparseBayes_相关向量机_稀疏贝叶斯_稀疏贝叶斯相关向量机程序包_贝

稀疏贝叶斯算法.zip

压缩感知稀疏贝叶斯算法

新的回归向量机，用于回归拟合，相关论文可以参考sparse bayesian learning and relevance vector machine.zip

稀疏贝叶斯学习算法SBL-FM算法

结构稀疏贝叶斯学习

稀疏贝叶斯相关向量机的模拟电路故障诊断 (2011年)

相互耦合的DOA估计的稀疏贝叶斯学习。

基于稀疏贝叶斯学习的相干源到达方向估计

利用稀疏贝叶斯学习的稀疏恢复辅助Doa估计

稀疏贝叶斯matlab程序

稀疏贝叶斯.zip_formerpgq_稀疏_稀疏向量恢复_稀疏恢复_贝叶斯 恢复

基于稀疏贝叶斯学习的MIMO-OFDM电力线通信系统接收机设计.docx

SB_release1.rar_SB_稀疏分类_稀疏贝叶斯_稀疏贝叶斯SB

svm（支持向量机）与nbc（朴素贝叶斯）算法比较

相关向量机的快速算法

基于稀疏贝叶斯学习的电力线载波通信接收机设计

kuaisu稀疏贝叶斯

稀疏贝叶斯算法与实现.zip_稀疏 学习_稀疏分类_稀疏算法_贝叶斯_贝叶斯 稀疏

稀疏贝叶斯学习(SparseBayesianLearning).pdf

基于块稀疏贝叶斯学习的非相干源参数估计

matlab基于稀疏贝叶斯学习的频谱感知.zip

最新资源

贝叶斯和相关向量机压缩感知（稀疏编码）附matlab代码上传.zip

稀疏贝叶斯学习的代码稀疏贝叶斯学习的代码

稀疏贝叶斯.zip_formerpgq_稀疏_稀疏向量恢复_稀疏恢复_贝叶斯恢复

稀疏贝叶斯算法与实现.zip_稀疏学习_稀疏分类_稀疏算法_贝叶斯_贝叶斯稀疏