tion are RELIEF (Kira & Rendell 1992) and FOCUS
(Almuallim & Dietterich 1991). In RELIEF, a sub-
set of features in not directly selected, but rather each
feature is given a weighting indicating its level of rele-
vance to the class lab el. RELIEF is therefore ineec-
tive at removing redundant features as two predictive
but highly correlated features are both likely to be
highly weighted. The FOCUS algorithm conducts an
exhaustive search of all feature subsets to determine
the minimal set of features that can provide a consis-
tent lab eling of the training data. This consistency
criterion makes FOCUS very sensitive to noise in the
training data. Moreover, the exponential growth of
the power set of the features makes this algorithm im-
practical for domains with more than 25-30 features.
Another feature selection methodolgy which has re-
cently received much attention is the wrapp er mo del
(John, Kohavi, & Peger 1994) (Caruana & Freitag
1994) (Langley & Sage 1994). This model searches
through the space of feature subsets using the esti-
mated accuracy from an induction algorithm as the
measure of go odness for a particular feature subset.
Thus, the feature selection is being \wrapped around"
an induction algorithm, so that the bias of the op er-
ators that dene the search and that of the induction
algorithm strongly interact. While these metho ds have
encountered some success on induction tasks, they are
often prohibitively expensivetorunandcanbein-
tractable for a very large number of features. Further-
more, the metho ds leave something to b e desired in
terms of theoretical justication. While an important
aspect of feature selection is howwell a method helps
an induction algorithm in terms of accuracy measures,
it is also important to understand how the induction
problem in general is aected by feature selection.
In this work, we address both theoretical and empiri-
cal asp ects of feature selection. We describe a formal
framework for understanding feature selection, based
on ideas from Information Theory (Cover & Thomas
1991). We then present an ecient implemented al-
gorithm based on these theoretical intuitions. The
algorithm overcomes many of the problems with ex-
isting methods: it has a sound theoretical foundation;
it is eective in eliminating b oth irrelevant and redun-
dant features; it is tolerant to inconsistencies in the
training data; and, most importantly, it is a lter al-
gorithm which does not incur the high computational
cost of conducting a search through the space of fea-
ture subsets as in the wrapper methods, and is there-
fore ecient for domains containing hundreds or even
thousands of features.
2 Theoretical Framework
A data instance is typically described to the system
as an assignmentofvalues
f
=(
f
1
;:::;f
n
)toaset
of
features
F
=(
F
1
;:::;F
n
). As usual, we assume
that eachinstanceisdrawn indep endently from some
probability distribution over the space of feature vec-
tors. Formally,for each assignmentofvalues
f
to
F
,
wehave a probability Pr(
F
=
f
).
A
classier
is a procedure that takes as input a data
instance and classies it as b elonging to one of a num-
ber of possible classes
c
1
;:::;c
`
. The classier must
make its decision based on the assignment
f
associ-
ated with an instance. Optimistically, the feature vec-
tor will fully determine the appropriate classication.
However, this is rarely the case: wedonottypically
have access to enough features to make this a deter-
ministic decision. Therefore, we use a probability dis-
tribution to mo del the classication function: For each
assignmentofvalues
f
to
F
wehave a distribution
Pr(
C
j
F
=
f
) on the dierent possible classes, C.
A learning algorithm implicitly uses the empirical fre-
quencies observed in the training set | an approxi-
mation to the conditional distribution Pr (
C
j
F
)|to
construct a classier for the problem.
Let us consider the eect of feature space reduction on
the distribution that characterizes the problem. Let
G
be some subset of
F
. Given a feature vector
f
,we use
f
G
to denote the pro jection of
f
onto the variables in
G
. Consider a particular data instance characterized
by
f
. In the original distribution, this data instance in-
duces the distribution Pr(
C
j
F
=
f
). In the reduced
feature space, the same instance induces the (possi-
bly dierent distribution) Pr(
C
j
G
=
f
G
). Our goal
is to select
G
so that these two distributions are as
close as p ossible. As our distance metric, we use the
information-theoretic measure of cross-entropy (also
known as KL-distance (Kullback & Leibler 1951)).
Thus, we can view this as selecting a set of features
G
which causes us to lose the least amount of information
in these distributions. While other measures of separa-
bility (notably
divergence
)have b een suggested in the
statistics community for feature selection (Fukunaga
1990), these measures are often aimed at selecting fea-
tures to enhance the separability of the data and may
have dicultyinvery large dimensional spaces. Hence,
they bring with them an inherent bias whichmay
not b e appropriate for particular induction algorithms.
Our method seeks to eliminate non-informative fea-
tures and thereby allow induction metho ds to employ
their own bias in a much reduced feature space.