使用邻域粗糙集和熵测度进行肿瘤分类的基因选择资源-CSDN文库

106 浏览量 2021-03-07 19:13:27 上传评论收藏 714KB PDF 举报

随着生物信息学的发展，基于基因表达数据的肿瘤分类已成为诊断癌症的重要有用技术。由于基因表达数据通常包含数千个基因和少量样品，因此从基因表达数据中选择基因成为肿瘤分类的关键步骤。粗糙集的属性约简已经成功地应用于基因选择领域，因为它具有数据驱动的特点，并且不需要额外的信息。但是，传统的粗糙集方法仅处理离散数据。至于包含实数值或噪声数据的基因表达数据，通常是通过离散预处理使用它们，这可能导致分类精度差。在本文中，我们提出了一种基于邻域粗糙集模型的基因选择新方法，该方法具有在保留原始基因分类信息的同时处理实值数据的能力。此外，本文提出了一种在邻域粗糙集框架下的熵测度，以解决基因表达数据的不确定性和噪声。利用该措施可以发现紧密的基因亚群。最后，基于邻域粒子和熵测度设计了一种基因选择算法。对两个基因表达数据的一些实验表明，提出的基因选择是提高肿瘤分类准确性的有效方法。 ### 使用邻域粗糙集与熵测度进行肿瘤分类的基因选择 #### 一、引言随着生物信息学领域的快速发展，基因表达数据在肿瘤分类中的应用变得越来越重要。肿瘤分类对于癌症的早期诊断和治疗至关重要。然而，在处理基因表达数据时面临的一个主要挑战是如何有效地从成千上万的基因中筛选出对肿瘤分类有显著贡献的基因子集。这一过程被称为基因选择（gene selection），它是提高肿瘤分类准确性的一个关键步骤。 #### 二、背景与挑战基因表达数据通常包含了大量的基因以及相对较少的样本数量。这种高维数据的特点使得从这些数据中提取有用信息变得更加困难。传统的方法，如粗糙集理论，已经被广泛用于基因选择任务中，因为它们能够基于数据本身进行属性约简而不需要额外的信息。然而，传统的粗糙集方法只能处理离散数据，而对于包含实数值或噪声的基因表达数据，则需要通过离散化预处理，而这可能会导致分类精度下降。 #### 三、邻域粗糙集模型的应用为了解决上述问题，研究者们提出了一种基于邻域粗糙集模型的新方法。邻域粗糙集模型扩展了传统粗糙集理论，使其能够处理连续或实数值的数据。这种方法不仅能够在保留原始基因分类信息的同时处理实值数据，还能够有效地处理数据中的不确定性与噪声。邻域粗糙集模型的核心思想是定义一个邻域关系，通过这个关系来定义下近似集和上近似集，从而实现数据的分类和约简。 #### 四、熵测度的作用在邻域粗糙集框架下，本文还提出了一种新的熵测度方法。熵测度是一种衡量不确定性大小的统计量，它被广泛应用于信息论和数据分析领域。在这个研究中，熵测度被用来量化基因表达数据中的不确定性和噪声水平。通过对基因表达数据应用熵测度，研究人员可以识别出紧密相关的基因子集，这些子集对于提高肿瘤分类的准确性至关重要。 #### 五、基因选择算法的设计基于邻域颗粒（即根据邻域关系划分的数据子集）和熵测度，研究者们设计了一种新型的基因选择算法。该算法旨在通过最小化熵测度值来选择最佳的基因子集，从而提高肿瘤分类的准确性。具体来说，该算法首先利用邻域粗糙集模型将数据划分为不同的颗粒，然后计算每个颗粒的熵测度值，最后选择那些熵测度值较低的颗粒所对应的基因作为最终的基因子集。 #### 六、实验验证为了验证所提出的基因选择方法的有效性，研究者们在两个实际的基因表达数据集上进行了实验。实验结果表明，相比于传统的基因选择方法，基于邻域粗糙集模型和熵测度的新方法能够显著提高肿瘤分类的准确性。此外，这种方法还可以有效地减少所需的基因数量，从而降低了后续分析的成本和复杂度。 #### 七、结论本文提出了一种基于邻域粗糙集模型和熵测度的基因选择新方法，该方法不仅可以处理实值数据，还能有效地处理数据中的不确定性和噪声。通过实验验证，证明了该方法能够有效地提高肿瘤分类的准确性，为癌症的诊断提供了有力的支持。未来的研究可以进一步探索如何将这种方法与其他机器学习算法结合，以期在更广泛的场景下获得更好的性能。

资源推荐

资源详情

资源评论

Gene selection for tumor classiﬁcation using neighborhood rough sets

and entropy measures

Yumin Chen

, Zunjun Zhang

⇑

, Jianzhong Zheng

, Ying Ma

, Yu Xue

College of Computer & Information Engineering, Xiamen University of Technology, Xiamen 361024, China

Department of Urinary Surgery, The Third Xiamen Hospital of Fujian University of Traditional Chinese Medicine, Xiamen 316000, China

School of Computer & Software, Nanjing University of Information Science & Technology, Nanjing 210044, China

article info

Article history:

Received 2 June 2016

Revised 25 January 2017

Accepted 9 February 2017

Available online 13 February 2017

Keywords:

Gene selection

Neighborhood rough sets

Tumor classiﬁcation

Entropy measure

Gene expression data

abstract

With the development of bioinformatics, tumor classiﬁcation from gene expression data becomes an

important useful technology for cancer diagnosis. Since a gene expression data often contains thousands

of genes and a small number of samples, gene selection from gene expression data becomes a key step for

tumor classiﬁcation. Attribute reduction of rough sets has been successfully applied to gene selection

ﬁeld, as it has the characters of data driving and requiring no additional information. However, traditional

rough set method deals with discrete data only. As for the gene expression data containing real-value or

noisy data, they are usually employed by a discrete preprocessing, which may result in poor classiﬁcation

accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set

model, which has the ability of dealing with real-value data whilst maintaining the original gene classi-

ﬁcation information. Moreover, this paper addresses an entropy measure under the frame of neighbor-

hood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this

measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is

designed based on neighborhood granules and the entropy measure. Some experiments on two gene

expression data show that the proposed gene selection is an effective method for improving the accuracy

of tumor classiﬁcation.

1. Introduction

With the extensive applications of DNA microarray technology

[1], huge amounts of gene expression data have been generated.

Gene expression data is the quantitative description of a group of

gene expression and regulation information by DNA Microarray

technology. A gene expression data often consists of thousands of

genes and a small size of samples. Therefore, gene expression data

set have the general characteristics of large dimensionality and

small sample. The problem associated with such large amount of

genes, means that any attempt to mining knowledge from gene

expression data may result in very poor performance. Gene selec-

tion is a dimensionality reduction that tries to select important

genes which are information-abundant and also hold the original

classiﬁcation accuracy. It is therefore no surprise that gene selec-

tion has been used in tackling the very large amount of genes.

Learning knowledge from gene expression data is a hot research

topic and inspiring many applications in medical science [2,3].

The further researches on gene selection from gene expression data

have signiﬁcant inﬂuences on bioinformatics [4,5], tumor or cancer

classiﬁcation [6,7], disease diagnosis [8,9], and so on.

In this paper, we focus on the problem of gene selection for con-

tributing the early diagnosis of cancer. Tumor is a kind of gene dis-

ease which pathogenesis is not completely known yet. It is an

abnormal mass of tissue which arises due to an abnormal growth

of cells that have proliferated in an uncontrolled manner. Accord-

ing to the characteristics of the tumor and its hurt degree to the

body, it can be categorized as benign and malignant. The malignant

tumor is called a cancer. Because the treatment for patients with

advance cancer is often not very effective, medical specialists

believe that early diagnosis of cancer is the key factor to the suc-

cessful therapies of cancer. At present, the diagnoses of cancer

are mainly dependent on markers related to biochemistry,

immunology, and histopathology, but there are few reliable mark-

ers. Moreover, poor differentiated tumors are difﬁcultly distin-

guished by conventional histopathological diagnoses. In addition,

the histological appearance of a tumor can not reveal the potential

genetic aberrations or biological malignant processes. In recent

years, the research of cancer diagnosis based on gene expression

spectrum of tumor molecular has attracted many scholars’

http://dx.doi.org/10.1016/j.jbi.2017.02.007

⇑

Corresponding author.

E-mail address: zunjun998@163.com (Z. Zhang).

Journal of Biomedical Informatics 67 (2017) 59–68

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier.com/locate/yjbin

attentions aiming at achieving accurate and early diagnosis

[10,11]. However, the problem of dimension disaster challenges

the tumor classiﬁcation, since it is caused by high dimensional

gene data sets. Gene expression data sets have tens of thousands

of genes, but only a few genes are beneﬁt to tumor classiﬁcation.

Therefore, the method of gene selection is a key problem in tumor

classiﬁcation and diagnosis.

Many gene selection methods were put forward in recent years.

According to the different evaluation function, these methods are

divided into two categories: ﬁlter method and wrapper method.

Filter method is generally used as a preprocessing, independent

of the classiﬁer. By analyzing the characteristics of the relevant

genes, most ﬁlter methods rank these genes according to some cri-

teria. The criteria include correlation coefﬁcient, distance metrics,

information gain and consistency. Golub et al. [12] ﬁrstly proposed

the signal-to-noise ratio function to evaluate the pros and cons of

the genes, the difference between the tumor molecular typing.

ReliefF [13] and MRMR [14] algorithms are two conventional

methods of feature selection. Zhang et al. [15] developed a gene

selection algorithm based on ReliefF and MRMR. The wrapper is

essentially a classiﬁer, which relies on the classiﬁcation accuracy

as a standard for selecting the optimal gene subset. Guyon et al.

[16] representatively proposed a recursive feature elimination

(SVM-RFE) algorithm for gene selection. The SVM-RFE recursively

eliminates the parameters of the support vector machine, which

is successfully applied to gene selection [17,18]. Ghosh et al. [19]

combined with Lasso estimation put forward an optimal gene clas-

siﬁcation procedure. However, the two kinds of methods have

some shortcomings. The ﬁlter method depends on some standard

criteria, while concurrently ignoring the classiﬁcation effect of

the selected genes. The wrapper method is sensitive to the classi-

ﬁer and its performance is unstable. Besides, the time complexity

of wrapper method is usually expensive.

Rough set theory, pioneered by Pawlak [20], has been success-

fully applied in bioinformatics ﬁeld [21,22]. It is an impactful gene

selection tool to eliminate the noisy and redundant genes and dis-

cover important data dependencies by a reduction method. The

classical reduction algorithms are founded on the equivalence rela-

tion and only suitable for discrete data sets. A discretization should

be carried out before processing the continuous gene expression

data, but this will result in losing some information. Considering

this weakness, many scholars developed some extensions of classi-

cal rough set model for gene selection. Dai and Xu [3] proposed a

gene selection method based on fuzzy rough sets and fuzzy gain

ratio. Hu et al. [23,24] proposed a neighborhood rough set model

to deal with both discrete and continuous data sets with a d-

neighborhood parameter, which can maintain the rich information

for classifying the data sets. Wang et al. [25] proposed a gene selec-

tion method based on neighborhood rough set model for dealing

with gene expression proﬁles. Meng et al. [26] put forward a gene

selection method using neighborhood rough sets for the analysis of

plant stress response.

The general algorithm of ﬁnding the best gene subset is to

employ an incremental greedy heuristic information. Attribute sig-

niﬁcance is usually adopted as heuristic in the process of rough set

gene selection. Rough set-based gene selection is a ﬁlter method,

which applies attribute signiﬁcance as a ﬁltering standard. In clas-

sical rough sets, attribute signiﬁcance is constructed by three main

measures: positive region, discernible matrix and information

entropy. However, the traditional measures are not suitable to

neighborhood rough sets with gene selection. It is necessary to

transform or develop these measures. Information entropy is a

good measure of uncertainty. In this paper, we granulate a gene

data set by a neighborhood parameter and develop some uncer-

tainty measures in the neighborhood rough set frame. In particular,

a joint entropy is ﬁrstly proposed to evaluate the uncertainty of a

gene data set and a joint entropy-based gene selection method is

presented. Consequently, some experimental results show that

the proposed algorithm can obtain a small amount key genes and

improve the classiﬁcation accuracy.

The rest of our paper is structured in the follows. Some concepts

of rough sets and information entropy measures are presented in

Section 2

. Section 3 presents several entropy measures of the

neighborhood systems and a gene selection approach based on a

joint entropy is proposed. In Section 4, some experiments are

implemented on two tumor data sets and the experimental results

are analyzed. The conclusions and further works are summarized

in Section 5.

2. Preliminaries

In this section, we introduce some basic concepts about rough

sets, mutual entropy measures and gene selection. These notations

can be found in literatures [27–31].

2.1. Rough sets

The equivalence relation is a core concept in rough sets. Let

DS ¼ðU; C [fdg; V; f Þ be a discrete-value decision system, also

called a discrete-value gene data set, where U is a nonempty ﬁnite

set of patient samples (or objects), named an universe; and C is a

nonempty ﬁnite set of genes (condition features or attributes), d

represents a decision; and V is the union of gene expression level

values, which are discrete values, V ¼[

a2fC[fdgg

, where V

is a

value set of gene a; f : U fC [fdgg ! V is a map function which

ensures the values from gene domains to sample domains. Actu-

ally, the gene expression level values are real numbers. In a rough

set gene selection process, they should be transformed into dis-

crete values by a discretization method.

Let any two patient samples x; y 2 U, for any gene subset B # C,

an indiscernible relation is

INDðBÞ¼fðx; yÞj8a 2 B; f ða; xÞ¼f ða; yÞg: ð1Þ

Obviously, INDðBÞ satisﬁes reﬂexivity, symmetry and transitivity,

called an equivalence relation. For any patient sample x 2 U, all

samples that are equivalent to x form an equivalence class of x,

denoted as ½x

¼fyjy 2 U; ð x; yÞ2INDðBÞg. The set of all the equiva-

lence classes composes a partition induced by INDðBÞ, denoted as

U=INDðBÞ or U=B.

In rough sets, these equivalence classes are elementary units,

which construct two classic sets, named lower and upper approx-

imation sets. For any patient sample subset X # U and gene subset

B # C, the lower approximation set and the upper approximation

set of X with respect to B can be deﬁned respectively as

BðXÞ¼fxj½ x

# X; x 2 Ug; ð2Þ

BðXÞ¼fxj½ x

\ X – £; x 2 Ug: ð3Þ

The lower approximation set of a set X related to a gene set B is a set

of all samples, which are certainly classiﬁed to X related to the gene

set B. The upper approximation set of a set X related to a gene set B

is a set of samples, which are possibly classiﬁed to X related to the

gene set B. The order pair <

BðXÞ; BðXÞ > is called as Pawlas rough

set of X related to B.

2.2. Mutual entropy-based gene selection in rough sets

Feature reduction is also named gene selection, which aims to

delete redundant features (genes) while retaining classiﬁcation

information in the ﬁeld of machine learning [34–38]. There are

many literatures on rough set feature reduction and its applica-

60 Y. Chen et al. / Journal of Biomedical Informatics 67 (2017) 59–68

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38716081

粉丝: 3
资源: 943

使用邻域粗糙集和熵测度进行肿瘤分类的基因选择

基于局部线性嵌入和邻域粗糙集的基因选择用于基因表达数据分类

使用SVM-RFE和GA组合进行癌症分类的基因选择

利用微阵列数据进行多类别肿瘤分类的混合基因选择方法

邻域粗糙集属性约简,粗糙集属性约简步骤,Python

胡清华邻域粗糙集代码

12322879_邻域粗糙集_邻域属性约简_粗糙集_

邻域粗糙集属性约简_粗糙集_邻域粗糙集_邻域属性约简

邻域粗糙集可执行代码（MATLAB）

邻域粗糙集算例及数据.rar

邻域粗糙集和经典粗糙集基本理论和程序算例V2.0.1（2015.04.06）

Rough-set_matlab.rar_基本粗糙集_粗糙_粗糙集_粗糙集 MATLAB_邻域 粗糙集

关于粗糙集和邻域粗糙集的基本理论和程序算例

F-邻域粗糙集及其约简.docx

基于邻域粗糙集的符号与数值属性快速约简算法

论文研究-基于邻域粗糙集和概率神经网络集成的基因表达谱分类方法.pdf

伪标签邻域粗糙集：度量和属性约简

一个基于邻域粗糙集的前向贪心的属性约简算法

论文研究-基于容错改进的邻域粗糙集属性约简算法.pdf

基于支持向量机和邻域粗糙集的旋转机械振动故障诊断

混合型信息系统的邻域粗糙集模型动态更新算法.docx

邻域系统粗糙集和覆盖粗糙集

胡清华邻域粗糙集主程序

基于邻域粗糙集和并行神经网络的故障诊断.pdf

最新资源

Rough-set_matlab.rar_基本粗糙集_粗糙_粗糙集_粗糙集 MATLAB_邻域粗糙集