DataMining资源-CSDN文库

数据挖掘

需积分: 10 53 浏览量 2014-05-27 01:35:59 上传评论收藏 245KB PDF 举报

资源推荐

资源详情

资源评论

Classifying Categorical Data by Rule-based Neighbors

Jiabing WANG

, Pei ZHANG

School of Computer Science and Engineering

South China University of Technology

Guangzhou, China

jbwang@scut.edu.cn, z.pei02@mail.scut.edu.cn

Guihua WEN

1, 2

, Jia WEI

State Key Laboratory of Brain and Cognitive Science

Chinese Academy of Science

Beijing, China

{crghwen, csjwei}@scut.edu.cn

Abstract—A new learning algorithm for categorical data,

named CRN (Classification by Rule-based Neighbors) is

proposed in this paper. CRN is a nonmetric and parameter-

free classifier, and can be regarded as a hybrid of rule

induction and instance-based learning. Based on a new

measure of attributes quality and the separate-and-conquer

strategy, CRN learns a collection of feature sets such that for

each pair of instances belonging to different classes, there is a

feature set on which the two instances disagree. For an

unlabeled instance I and a labeled instance I

′

, I

′

is a neighbor

of I if and only if they agree on all attributes of a feature set.

Then, CRN classifies an unlabeled instance I based on I's

neighbors on those learned feature sets. To validate the

performance of CRN, CRN is compared with six state-of-the-

art classifiers on twenty-four datasets. Experimental results

demonstrate that although the underlying idea of CRN is

simple, the predictive accuracy of CRN is comparable to or

better than that of the state-of-the-art classifiers on most

datasets.

Keywords-classification; categorical data; feature selection;

rule-based neighbors

I. INTRODUCTION

Instance-based learning [1] is an important learning

paradigm. The k-Nearest-Neighbor (abbr. k-NN) [2] is a

representative instance-based classifier that assigns an

unlabeled instance with the most common class among its k

nearest neighbors. Due to its simplicity and effectiveness, k-

NN classifiers have been widely employed in pattern

classification field.

Most of the instance-based classifiers use a given metric

to measure the similarity between the unlabeled instance and

its neighbors. When attributes are numerical, the normalized

Euclidean distance is a natural metric to measure the

similarity between instances. However, there may not exist

some natural notion of metric for many applications. In this

case, many instance-based classifiers that are designed to

handle numerical attributes will be confronted with difficulty

and typically use much simpler metrics to measure the

distance between values of categorical attributes. Although

those simpler metrics perform well in some cases, they may

fail to capture the inherent complexity of the problem

domains, and as a result may perform badly [3].

One issue of instance-based classifiers is the choice of

parameter k, i.e., the number of neighbors, which is often

chosen empirically by cross-validation or domain experts in

practice. Apart from the parameter k, many k-NN classifiers

need options and other parameters. In order to make those

classifiers perform best, those parameters must be fine-tuned.

As the same of choice of k, most parameters can only be

empirically obtained by cross-validation or a similar method,

which is usually a tedious work. Another issue of instance-

based classifiers is that they are very sensitive to irrelevant

attributes. If a dataset contains a large number of irrelevant

attributes, the contributions of these attributes to the global

distance will dramatically decrease the performance of

instance-based classifiers [4].

Attempt to deal with the drawbacks of instance-based

learning as mentioned above, a new instance-based learning

algorithm for categorical data, named CRN (Classification

by Rule-based Neighbors), is proposed in this paper. Unlike

conventional instance-based classifiers, we treat “neighbor”

in a rule-like way which is essentially different from distance

metric view: for a test instance I and a labeled instance I′, I′

is a neighbor of I if and only if they agree on a feature set (a

subset of attributes). As a result, CRN does not use any

metric to measure the similarity between a test instance and

its neighbors. Since there are exponential-size subsets of

attributes, a new measure of attributes quality is proposed in

order to learn “high quality” feature sets. Based on the new

quality measure of attributes and the separate-and-conquer

strategy commonly used in rule induction [5], a collection of

feature sets are learned. Then, an unlabeled test instance is

classified by neighbors defined on those feature sets.

The rest of this paper is organized as follows. Section II

presents the CRN algorithm including the definition of the

measure of attributes quality, the training algorithm and the

classification algorithm. Section III reports the experimental

results. We conclude the paper in Section IV.

We use the following notations in the rest of the paper:

D: the training set.

: the subset of D consisting of instances with class c.

n: the number of instances in D.

m: the number of attributes, and the attribute’s name is

, A

, …, and A

respectively.

T: the number of total class labels, and the class label

name is 1, 2, …, and T respectively.

II. T

HE CRN ALGORITHM

We propose the CRN algorithm based on a simple idea:

if two instances belong to different classes, there is at least

one attribute on which two instances disagree. So, the core of

2011 11th IEEE International Conference on Data Mining

DOI 10.1109/ICDM.2011.34

1248

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余5页未读，立即下载

评论收藏

内容反馈

rtrttrtrtt

粉丝: 0
资源: 1

Data Mining

Data-Mining

DATA MINING

Learning Data Mining with Python - Second Edition

4本经典Data Mining电子书pdf

Mastering Data Mining with Python

Introduction to data mining

DATA MINING METHODS and APPLICATIONS

Data Mining with Rattle and R.pdf

Data Mining with R: Learning with Case Studies, Second Edition

Data Mining and Knowledge Discovery Handbook

Data_Mining

英文原版-Data Mining 2nd Edition

oracle data mining for java

DATA MINING Concepts and Techniques 3rd(数据挖掘：概念与技术)

最新资源