Classifying Categorical Data by Rule-based Neighbors
Jiabing WANG
1
, Pei ZHANG
1
1
School of Computer Science and Engineering
South China University of Technology
Guangzhou, China
jbwang@scut.edu.cn, z.pei02@mail.scut.edu.cn
Guihua WEN
1, 2
, Jia WEI
1
2
State Key Laboratory of Brain and Cognitive Science
Chinese Academy of Science
Beijing, China
{crghwen, csjwei}@scut.edu.cn
Abstract—A new learning algorithm for categorical data,
named CRN (Classification by Rule-based Neighbors) is
proposed in this paper. CRN is a nonmetric and parameter-
free classifier, and can be regarded as a hybrid of rule
induction and instance-based learning. Based on a new
measure of attributes quality and the separate-and-conquer
strategy, CRN learns a collection of feature sets such that for
each pair of instances belonging to different classes, there is a
feature set on which the two instances disagree. For an
unlabeled instance I and a labeled instance I
, I
is a neighbor
of I if and only if they agree on all attributes of a feature set.
Then, CRN classifies an unlabeled instance I based on I's
neighbors on those learned feature sets. To validate the
performance of CRN, CRN is compared with six state-of-the-
art classifiers on twenty-four datasets. Experimental results
demonstrate that although the underlying idea of CRN is
simple, the predictive accuracy of CRN is comparable to or
better than that of the state-of-the-art classifiers on most
datasets.
Keywords-classification; categorical data; feature selection;
rule-based neighbors
I. INTRODUCTION
Instance-based learning [1] is an important learning
paradigm. The k-Nearest-Neighbor (abbr. k-NN) [2] is a
representative instance-based classifier that assigns an
unlabeled instance with the most common class among its k
nearest neighbors. Due to its simplicity and effectiveness, k-
NN classifiers have been widely employed in pattern
classification field.
Most of the instance-based classifiers use a given metric
to measure the similarity between the unlabeled instance and
its neighbors. When attributes are numerical, the normalized
Euclidean distance is a natural metric to measure the
similarity between instances. However, there may not exist
some natural notion of metric for many applications. In this
case, many instance-based classifiers that are designed to
handle numerical attributes will be confronted with difficulty
and typically use much simpler metrics to measure the
distance between values of categorical attributes. Although
those simpler metrics perform well in some cases, they may
fail to capture the inherent complexity of the problem
domains, and as a result may perform badly [3].
One issue of instance-based classifiers is the choice of
parameter k, i.e., the number of neighbors, which is often
chosen empirically by cross-validation or domain experts in
practice. Apart from the parameter k, many k-NN classifiers
need options and other parameters. In order to make those
classifiers perform best, those parameters must be fine-tuned.
As the same of choice of k, most parameters can only be
empirically obtained by cross-validation or a similar method,
which is usually a tedious work. Another issue of instance-
based classifiers is that they are very sensitive to irrelevant
attributes. If a dataset contains a large number of irrelevant
attributes, the contributions of these attributes to the global
distance will dramatically decrease the performance of
instance-based classifiers [4].
Attempt to deal with the drawbacks of instance-based
learning as mentioned above, a new instance-based learning
algorithm for categorical data, named CRN (Classification
by Rule-based Neighbors), is proposed in this paper. Unlike
conventional instance-based classifiers, we treat “neighbor”
in a rule-like way which is essentially different from distance
metric view: for a test instance I and a labeled instance I′, I′
is a neighbor of I if and only if they agree on a feature set (a
subset of attributes). As a result, CRN does not use any
metric to measure the similarity between a test instance and
its neighbors. Since there are exponential-size subsets of
attributes, a new measure of attributes quality is proposed in
order to learn “high quality” feature sets. Based on the new
quality measure of attributes and the separate-and-conquer
strategy commonly used in rule induction [5], a collection of
feature sets are learned. Then, an unlabeled test instance is
classified by neighbors defined on those feature sets.
The rest of this paper is organized as follows. Section II
presents the CRN algorithm including the definition of the
measure of attributes quality, the training algorithm and the
classification algorithm. Section III reports the experimental
results. We conclude the paper in Section IV.
We use the following notations in the rest of the paper:
D: the training set.
D
c
: the subset of D consisting of instances with class c.
n: the number of instances in D.
m: the number of attributes, and the attribute’s name is
A
1
, A
2
, …, and A
m
respectively.
T: the number of total class labels, and the class label
name is 1, 2, …, and T respectively.
II. T
HE CRN ALGORITHM
We propose the CRN algorithm based on a simple idea:
if two instances belong to different classes, there is at least
one attribute on which two instances disagree. So, the core of
2011 11th IEEE International Conference on Data Mining
1550-4786/11 $26.00 © 2011 IEEE
DOI 10.1109/ICDM.2011.34
1248