attentions aiming at achieving accurate and early diagnosis
[10,11]. However, the problem of dimension disaster challenges
the tumor classification, since it is caused by high dimensional
gene data sets. Gene expression data sets have tens of thousands
of genes, but only a few genes are benefit to tumor classification.
Therefore, the method of gene selection is a key problem in tumor
classification and diagnosis.
Many gene selection methods were put forward in recent years.
According to the different evaluation function, these methods are
divided into two categories: filter method and wrapper method.
Filter method is generally used as a preprocessing, independent
of the classifier. By analyzing the characteristics of the relevant
genes, most filter methods rank these genes according to some cri-
teria. The criteria include correlation coefficient, distance metrics,
information gain and consistency. Golub et al. [12] firstly proposed
the signal-to-noise ratio function to evaluate the pros and cons of
the genes, the difference between the tumor molecular typing.
ReliefF [13] and MRMR [14] algorithms are two conventional
methods of feature selection. Zhang et al. [15] developed a gene
selection algorithm based on ReliefF and MRMR. The wrapper is
essentially a classifier, which relies on the classification accuracy
as a standard for selecting the optimal gene subset. Guyon et al.
[16] representatively proposed a recursive feature elimination
(SVM-RFE) algorithm for gene selection. The SVM-RFE recursively
eliminates the parameters of the support vector machine, which
is successfully applied to gene selection [17,18]. Ghosh et al. [19]
combined with Lasso estimation put forward an optimal gene clas-
sification procedure. However, the two kinds of methods have
some shortcomings. The filter method depends on some standard
criteria, while concurrently ignoring the classification effect of
the selected genes. The wrapper method is sensitive to the classi-
fier and its performance is unstable. Besides, the time complexity
of wrapper method is usually expensive.
Rough set theory, pioneered by Pawlak [20], has been success-
fully applied in bioinformatics field [21,22]. It is an impactful gene
selection tool to eliminate the noisy and redundant genes and dis-
cover important data dependencies by a reduction method. The
classical reduction algorithms are founded on the equivalence rela-
tion and only suitable for discrete data sets. A discretization should
be carried out before processing the continuous gene expression
data, but this will result in losing some information. Considering
this weakness, many scholars developed some extensions of classi-
cal rough set model for gene selection. Dai and Xu [3] proposed a
gene selection method based on fuzzy rough sets and fuzzy gain
ratio. Hu et al. [23,24] proposed a neighborhood rough set model
to deal with both discrete and continuous data sets with a d-
neighborhood parameter, which can maintain the rich information
for classifying the data sets. Wang et al. [25] proposed a gene selec-
tion method based on neighborhood rough set model for dealing
with gene expression profiles. Meng et al. [26] put forward a gene
selection method using neighborhood rough sets for the analysis of
plant stress response.
The general algorithm of finding the best gene subset is to
employ an incremental greedy heuristic information. Attribute sig-
nificance is usually adopted as heuristic in the process of rough set
gene selection. Rough set-based gene selection is a filter method,
which applies attribute significance as a filtering standard. In clas-
sical rough sets, attribute significance is constructed by three main
measures: positive region, discernible matrix and information
entropy. However, the traditional measures are not suitable to
neighborhood rough sets with gene selection. It is necessary to
transform or develop these measures. Information entropy is a
good measure of uncertainty. In this paper, we granulate a gene
data set by a neighborhood parameter and develop some uncer-
tainty measures in the neighborhood rough set frame. In particular,
a joint entropy is firstly proposed to evaluate the uncertainty of a
gene data set and a joint entropy-based gene selection method is
presented. Consequently, some experimental results show that
the proposed algorithm can obtain a small amount key genes and
improve the classification accuracy.
The rest of our paper is structured in the follows. Some concepts
of rough sets and information entropy measures are presented in
Section 2
. Section 3 presents several entropy measures of the
neighborhood systems and a gene selection approach based on a
joint entropy is proposed. In Section 4, some experiments are
implemented on two tumor data sets and the experimental results
are analyzed. The conclusions and further works are summarized
in Section 5.
2. Preliminaries
In this section, we introduce some basic concepts about rough
sets, mutual entropy measures and gene selection. These notations
can be found in literatures [27–31].
2.1. Rough sets
The equivalence relation is a core concept in rough sets. Let
DS ¼ðU; C [fdg; V; f Þ be a discrete-value decision system, also
called a discrete-value gene data set, where U is a nonempty finite
set of patient samples (or objects), named an universe; and C is a
nonempty finite set of genes (condition features or attributes), d
represents a decision; and V is the union of gene expression level
values, which are discrete values, V ¼[
a2fC[fdgg
V
a
, where V
a
is a
value set of gene a; f : U fC [fdgg ! V is a map function which
ensures the values from gene domains to sample domains. Actu-
ally, the gene expression level values are real numbers. In a rough
set gene selection process, they should be transformed into dis-
crete values by a discretization method.
Let any two patient samples x; y 2 U, for any gene subset B # C,
an indiscernible relation is
INDðBÞ¼fðx; yÞj8a 2 B; f ða; xÞ¼f ða; yÞg: ð1Þ
Obviously, INDðBÞ satisfies reflexivity, symmetry and transitivity,
called an equivalence relation. For any patient sample x 2 U, all
samples that are equivalent to x form an equivalence class of x,
denoted as ½x
B
¼fyjy 2 U; ð x; yÞ2INDðBÞg. The set of all the equiva-
lence classes composes a partition induced by INDðBÞ, denoted as
U=INDðBÞ or U=B.
In rough sets, these equivalence classes are elementary units,
which construct two classic sets, named lower and upper approx-
imation sets. For any patient sample subset X # U and gene subset
B # C, the lower approximation set and the upper approximation
set of X with respect to B can be defined respectively as
BðXÞ¼fxj½ x
B
# X; x 2 Ug; ð2Þ
BðXÞ¼fxj½ x
B
\ X – £; x 2 Ug: ð3Þ
The lower approximation set of a set X related to a gene set B is a set
of all samples, which are certainly classified to X related to the gene
set B. The upper approximation set of a set X related to a gene set B
is a set of samples, which are possibly classified to X related to the
gene set B. The order pair <
BðXÞ; BðXÞ > is called as Pawlas rough
set of X related to B.
2.2. Mutual entropy-based gene selection in rough sets
Feature reduction is also named gene selection, which aims to
delete redundant features (genes) while retaining classification
information in the field of machine learning [34–38]. There are
many literatures on rough set feature reduction and its applica-
60 Y. Chen et al. / Journal of Biomedical Informatics 67 (2017) 59–68