Towardoptimalfeatureselection资源-CSDN文库

需积分: 9 130 浏览量 2010-12-12 15:00:16 上传评论收藏 170KB PDF 举报

In this paper, we examine a method for feature subset selection based on Information Theory. Initially, a framework for dening the theoretically optimal, but computationally intractable, method for feature subset selection is presented. We show that our goal should be to eliminate a feature if it gives us little or no additional information beyond that subsumed by the remaining features. In particular, this will be the case for both irrelevant and redundant features. We then give an ecient algorithm for feature selection which computes an approximation to the optimal feature selection criterion. The conditions under which the approximate algorithm is successful are examined. Empirical results are given on a number of data sets, showing that the algorithm eectively handles datasets with large numbers of features. ### 关于最优特征选择的研究 #### 摘要与研究背景本文提出了一种基于信息论的方法来解决特征子集选择的问题。特征选择是机器学习领域中的一个重要环节，旨在识别哪些特征对预测结果最有帮助，并剔除那些冗余或不相关的特征。作者首先构建了一个理论框架，用于定义理论上最优但计算上难以实现的特征子集选择方法。在此基础上，他们进一步提出了一种高效的算法来近似计算最优特征选择标准，并通过实证分析证明了该算法的有效性。 #### 研究目标与方法本研究的目标是开发一种能够有效处理大量特征的数据集的特征选择算法。具体而言，研究的核心在于如何识别并消除那些提供很少或没有额外信息（相对于其他特征）的特征。这包括两类特征：无关特征和冗余特征。 - **无关特征**：这些特征与目标变量之间不存在显著关联。 - **冗余特征**：这些特征与其他一个或多个特征高度相关，因此提供的信息可以被其他特征所覆盖。为了达到这一目的，研究者设计了一个算法来近似地计算最优特征选择准则。此外，还探讨了该近似算法成功应用的条件，并在多个数据集上进行了实证测试，验证了其在处理包含大量特征的数据集时的有效性。 #### 算法设计与评估算法的设计主要依据信息论中的概念。具体来说，算法考虑了两个关键因素： 1. **特征与类别的相互信息**：衡量一个特征对类别标签预测的贡献度。 2. **特征之间的相互信息**：评估特征之间的冗余程度。通过计算这些相互信息，算法能够确定哪些特征最有助于提高模型的分类性能，并且能够有效地去除那些冗余或无关的特征。 #### 实证分析研究者在一系列数据集上进行了实验，以验证提出的特征选择算法的有效性。这些数据集包含了不同规模的特征数量，从几十个到数千个不等。实验结果显示，该算法能够在保持较高的分类精度的同时，显著减少模型所需的特征数量。这对于提高模型的训练速度、减少过拟合风险以及简化最终模型都是有益的。 #### 结论本文提出了一种基于信息论的高效特征选择算法，旨在近似计算理论上最优的特征子集选择标准。通过对算法进行详细的分析和实证测试，证明了它在处理大规模特征数据集时的有效性和实用性。这种方法不仅有助于提高模型的效率和准确性，也为解决实际问题提供了有力工具。本文为特征选择领域的研究提供了有价值的贡献，特别是对于那些涉及大量特征的实际应用场景。通过对特征进行有效的筛选，可以极大地提高机器学习模型的性能，并为更广泛的应用奠定基础。

资源推荐

资源详情

资源评论

Toward Optimal Feature Selection

Daphne Koller

Gates Building 1A

Computer Science Department

Stanford University

Stanford, CA 94305-9010

koller@cs.stanford.edu

Mehran Sahami

Gates Building 1A

Computer Science Department

Stanford University

Stanford, CA 94305-9010

sahami@cs.stanford.edu

Abstract

In this paper, we examine a metho d for fea-

ture subset selection based on Information

Theory. Initially,aframework for dening

the theoretically optimal, but computation-

ally intractable, method for feature subset se-

lection is presented. We show that our goal

should be to eliminate a feature if it gives

us little or no additional information beyond

that subsumed by the remaining features. In

particular, this will be the case for both ir-

relevant and redundant features. Wethen

give an ecient algorithm for feature selec-

tion which computes an approximation to the

optimal feature selection criterion. The con-

ditions under which the approximate algo-

rithm is successful are examined. Empirical

results are given on a number of data sets,

showing that the algorithm eectively han-

dles datasets with large numbers of features.

1 Intro duction

In the classic

supervisedlearning

task, we are given a

training set of labeled xed-length feature vectors, or

instances, from which to induce a classication model.

This mo del is then used to predict the class lab el for

a set of previously unseen instances. Thus, the infor-

mation about the class that is inherent in the features

determines the accuracy of the model. Theoretically,

having more features should give us more discriminat-

ing power. However, the real-world provides us with

many reasons why this is not generally the case.

First, the time requirements for an induction algo-

rithm often grow dramatically with the number of fea-

tures, rendering the algorithm impractical for prob-

lems with a large number of features. Furthermore,

many learning algorithms can be viewed as p erform-

ing (a biased form of ) estimation of the probabilityof

the class label given a set of features. In domains with

alargenumber of features, this distribution is very

complex and of high dimension. Unfortunately, in the

real world, we are often faced with the problem of lim-

ited data from which to induce a model. This makes

it very dicult to obtain go od estimates of the many

probabilistic parameters. In order to avoid over-tting

the model to the particular distribution seen in the

training data, many algorithms employ the Occam's

Razor (Blumer

et al.

1987) bias to build as simple a

model as p ossible that still achieves some acceptable

level of performance on the training data. This bias

often leads us to prefer a small number of relatively

predictive features overavery large number of features

that, taken in the prop er, but complex, combination,

are entirely predictive of the class lab el. Irrelevant and

redundant features also cause problems in this context

as they may confuse the learning algorithm by helping

to obscure the distributions of the small set of truly

relevant features for the task at hand.

If we reduce the set of features considered by the al-

gorithm, we can therefore servetwo purposes. We can

considerably decrease the running time of the induc-

tion algorithm, and we can increase the accuracy of

the resulting model. In light of this, a numberofre-

searchers have recently addressed the issue of feature

subset selection in machine learning. As dened by

(John, Kohavi, & Peger 1994), this work is often di-

vided along two lines: lter and wrapp er mo dels.

In the lter model, feature selection is performed as

a preprocessing step to induction. Thus the bias of

the learning algorithm do es not interact with the bias

inherent in the feature selection algorithm. Twoof

the most well-known lter metho ds for feature selec-

tion are RELIEF (Kira & Rendell 1992) and FOCUS

(Almuallim & Dietterich 1991). In RELIEF, a sub-

set of features in not directly selected, but rather each

feature is given a weighting indicating its level of rele-

vance to the class lab el. RELIEF is therefore ineec-

tive at removing redundant features as two predictive

but highly correlated features are both likely to be

highly weighted. The FOCUS algorithm conducts an

exhaustive search of all feature subsets to determine

the minimal set of features that can provide a consis-

tent lab eling of the training data. This consistency

criterion makes FOCUS very sensitive to noise in the

training data. Moreover, the exponential growth of

the power set of the features makes this algorithm im-

practical for domains with more than 25-30 features.

Another feature selection methodolgy which has re-

cently received much attention is the wrapp er mo del

(John, Kohavi, & Peger 1994) (Caruana & Freitag

1994) (Langley & Sage 1994). This model searches

through the space of feature subsets using the esti-

mated accuracy from an induction algorithm as the

measure of go odness for a particular feature subset.

Thus, the feature selection is being \wrapped around"

an induction algorithm, so that the bias of the op er-

ators that dene the search and that of the induction

algorithm strongly interact. While these metho ds have

encountered some success on induction tasks, they are

often prohibitively expensivetorunandcanbein-

tractable for a very large number of features. Further-

more, the metho ds leave something to b e desired in

terms of theoretical justication. While an important

aspect of feature selection is howwell a method helps

an induction algorithm in terms of accuracy measures,

it is also important to understand how the induction

problem in general is aected by feature selection.

In this work, we address both theoretical and empiri-

cal asp ects of feature selection. We describe a formal

framework for understanding feature selection, based

on ideas from Information Theory (Cover & Thomas

1991). We then present an ecient implemented al-

gorithm based on these theoretical intuitions. The

algorithm overcomes many of the problems with ex-

isting methods: it has a sound theoretical foundation;

it is eective in eliminating b oth irrelevant and redun-

dant features; it is tolerant to inconsistencies in the

training data; and, most importantly, it is a lter al-

gorithm which does not incur the high computational

cost of conducting a search through the space of fea-

ture subsets as in the wrapper methods, and is there-

fore ecient for domains containing hundreds or even

thousands of features.

2 Theoretical Framework

A data instance is typically described to the system

as an assignmentofvalues

;:::;f

)toaset

features

;:::;F

). As usual, we assume

that eachinstanceisdrawn indep endently from some

probability distribution over the space of feature vec-

tors. Formally,for each assignmentofvalues

wehave a probability Pr(

classier

is a procedure that takes as input a data

instance and classies it as b elonging to one of a num-

ber of possible classes

;:::;c

. The classier must

make its decision based on the assignment

associ-

ated with an instance. Optimistically, the feature vec-

tor will fully determine the appropriate classication.

However, this is rarely the case: wedonottypically

have access to enough features to make this a deter-

ministic decision. Therefore, we use a probability dis-

tribution to mo del the classication function: For each

assignmentofvalues

wehave a distribution

Pr(

) on the dierent possible classes, C.

A learning algorithm implicitly uses the empirical fre-

quencies observed in the training set | an approxi-

mation to the conditional distribution Pr (

)|to

construct a classier for the problem.

Let us consider the eect of feature space reduction on

the distribution that characterizes the problem. Let

be some subset of

. Given a feature vector

,we use

to denote the pro jection of

onto the variables in

. Consider a particular data instance characterized

. In the original distribution, this data instance in-

duces the distribution Pr(

). In the reduced

feature space, the same instance induces the (possi-

bly dierent distribution) Pr(

). Our goal

is to select

so that these two distributions are as

close as p ossible. As our distance metric, we use the

information-theoretic measure of cross-entropy (also

known as KL-distance (Kullback & Leibler 1951)).

Thus, we can view this as selecting a set of features

which causes us to lose the least amount of information

in these distributions. While other measures of separa-

bility (notably

divergence

)have b een suggested in the

statistics community for feature selection (Fukunaga

1990), these measures are often aimed at selecting fea-

tures to enhance the separability of the data and may

have dicultyinvery large dimensional spaces. Hence,

they bring with them an inherent bias whichmay

not b e appropriate for particular induction algorithms.

Our method seeks to eliminate non-informative fea-

tures and thereby allow induction metho ds to employ

their own bias in a much reduced feature space.

剩余8页未读，继续阅读

评论收藏

内容反馈

aiteam

粉丝: 2
资源: 5

Toward optimal feature selection

最新资源

Toward optimal feature selection

Mean Shift: A Robust Approach Toward Feature Space Analysis

Toward a Controllable Feature Space for Image Restoration

MeanShift_A Robust Approach toward feature space analysis

ga.rar_Nonlinear Optimal_generations_genetic algorithm_mixed int

奈达翻译理论新解——奈达代表作书名Toward a Science of Translating的名与实

Steps Toward Artificial Intelligence.pdf

主动学习方法：xPAL的python代码

Migrating to MariaDB: Toward an Open Source Database Solution

Toward UAV Control via Cellular Networks_Delay Pr.pdf

the entity-relationship model ---toward a unified view of data

中国移动_INTEGRATION OF SENSING, COMMUNICATION AND COMPUTING TOWARD 6G.pdf

Toward Long Distance, Sub-diffraction Imaging Using Coherent Camera Arrays 论文算法

Toward Real-World Single Image Super-Resolution Real SR论文翻译

论文研究-基于实值遗传算法与TAFSVM的遥感图像分类.pdf

Commutative Algebra with a View Toward Algebraic Geometry(GTM 150)

Toward AI Security.pdf

Toward Deeper Understanding of Neural Networks

Toward Exitless and Efficient Paravirtual I/O

The entity-relationship model toward a unified view of data

Toward Automated Detecting Unanticipated Price Feed...论文分享PPT

Poma_Dense_Extreme_Inception_Network_Toward

Toward practical application of fiber optical parametric amplifiers PhD03

IntelDistributionTuningGuide

Consumer Attitudes toward Genetically Modified Foods: A U.S.-China Risk-Benefit Perception Comparison

Xamarin.Forms Essentials First Steps Toward Cross-Platform Mobile Apps epub

最新资源