基于蚁群优化的基因调控元件识别算法资源-CSDN文库

81 浏览量 2021-03-01 18:18:14 上传评论收藏 988KB PDF 举报

在生物信息学领域，识别基因调控元件是一项重要的任务。基因调控元件指的是那些能够调节基因表达过程的序列，它们通常包括启动子、增强子、沉默子和转录因子结合位点等。这些元件在细胞功能的调节、生物体的发育过程以及对环境的响应中扮演着关键角色。由于其在基因组中的非编码区域通常数量庞大且功能复杂，因此确定哪些序列是实际的调控元件，以及它们如何相互作用，是生物信息学研究中的一个挑战。现有的用于识别这些调控元件的算法，往往趋向于局部最优解，并且计算复杂度较高。在给定的文件内容中，作者提出了一种基于蚁群优化（Ant Colony Optimization, ACO）的基因调控元件识别算法，名为ACRI（ant-colony-regulatory-identification）。ACO是一种基于群体智能的元启发式方法，其灵感来源于真实蚂蚁的集体觅食行为。ACO算法的主要特点包括自组织和鲁棒性。 ACRI算法的提出正是利用了ACO算法的这些特性，尤其在加速蚂蚁搜索过程时，通过引入局部优化策略来调整蚂蚁在被搜索序列上的起始位置，从而提升算法的搜索效率和解的质量。实验结果表明，ACRI在处理真实世界数据集时，不仅在解的质量上优于其他传统算法，而且在解的速度上也有显著提升。在具体实现方面，ACRI算法的目标是识别共表达基因上游所有可能的转录因子结合位点。这一任务对于理解基因表达调控网络和遗传调控机制具有重要意义。传统上，这类问题通常涉及复杂的统计模型和算法，如隐马尔可夫模型（Hidden Markov Model, HMM）、支持向量机（Support Vector Machines, SVM）等。ACRI算法的提出为这类问题提供了一个新的解决思路和工具。此外，ACRI算法的另一个关键点在于局部优化策略的引入。在ACO算法中，通常需要经历大量的迭代才能达到较好的解，而局部优化策略能够在一定程度上避免过早收敛到局部最优解，同时减少搜索空间，加快搜索速度。这在处理大规模的生物信息学数据时尤为重要。在实际应用中，ACRI算法的高效率和准确性使其在基因调控网络分析中具有很大的潜力。它可以被用于发掘新的调控元件，辅助生物学家构建更精确的基因调控网络模型，并且在疾病相关基因调控机制的研究中发挥作用。例如，在癌症研究中，通过准确识别出肿瘤发生发展过程中的关键调控元件，有助于发现新的治疗靶点或疾病生物标志物。 ACRI算法的研究和应用代表了蚁群优化算法在生物信息学领域的一个重要进展。它不仅提供了理论上的优化算法，而且通过实验验证了其在实际应用中的可行性和优越性。随着生物信息学和计算生物学的不断发展，类似的算法在揭示生物系统复杂调控机制方面将发挥越来越重要的作用。

资源推荐

资源详情

资源评论

An ant colony optimization based algorithm for identifying gene

regulatory elements

Wei Liu

a,b,

, Hanwu Chen

, Ling Chen

b,c

Department of Computer Science and Engineering, Southeast University, Nanjing 210096, China

Department of Computer Science, Yangzhou University, Yangzhou 225127, China

National Key Lab of Novel Software Tech, Nanjing University, Nanjing 210093, China

article info

Article history:

Received 15 July 2011

Accepted 11 April 2013

Keywords:

Gene regulatory elements

Ant colony optimization

Motif identiﬁcation

abstract

It is one of the most important tasks in bioinformatics to identify the regulatory elements in gene

sequences. Most of the existing algorithms for identifying regulatory elements are inclined to converge

into a local optimum, and have high time complexity. Ant Colony Optimization (ACO) is a meta-heuristic

method based on swarm intelligence and is derived from a model inspired by the collective foraging

behavior of real ants. Taking advantage of the ACO in traits such as self-organization and robustness, this

paper designs and implements an ACO based algorithm named ACRI (ant-colony-regulatory-identiﬁca-

tion) for identifying all possible binding sites of transcription factor from the upstream of co-expressed

genes. To accelerate the ants' searching process, a strategy of local optimization is presented to adjust the

ants' start positions on the searched sequences. By exploiting the powerful optimization ability of ACO,

the algorithm ACRI can not only improve precision of the results, but also achieve a very high speed.

Experimental results on real world datasets show that ACRI can outperform other traditional algorithms

in the respects of speed and quality of solutions.

1. Introduction

A biological system is mainly composed of static and dynamic

components. The static components include all genes in the

genome, which are the elementary constructional elements of a

biological system. With the achievements in the research of

genome sequencing and annotation, special interests have been

paid on the gene regulatory elements, namely the dynamic

component of the biological system. Genomic regulatory elements,

which are also called DNA motifs, contain abundant biological

information reﬂecting life characteristics, and play an important

role in the gene function and structure construction. Now dis-

covering and recognizing gene regulatory elements have become

one of the most important approaches in analysis of genome

sequences, and have drawn extensive attention in bioinformatics

research.

Gene regulatory element identiﬁcation (also called motif iden-

tiﬁcation) involves two problems: how to extract motif from

biological data or structures, and how to recognize the motif

contained in object sequences or structures. It is a major research

area in the study of gene non-coding region. At the transcriptional

and post-transcriptional level, gene expression is mainly con-

trolled by some cis-regulatory elements which essentially are

some shorter DNA sequences. These sequences are often located

in the upstream region of regulated genes and are recognized by

and combined with the speciﬁc DNA-binding protein (transcrip-

tion factor) so as to regulate DNA metabolism and transcription.

Otherwise they could probably be recognized by and combined

with the RNA-binding protein, and their combination could

inﬂuence the processes of RNA modiﬁcation, localization, transla-

tion and degradation. Hence, transcriptional regulatory element

identiﬁcation is one of the most important task s for genome

behavior understanding and explanation.

In searching for an identiﬁed regulatory element or predicting a

new one, three problems must be solved: (1) how to formally

describe the regulatory elements, namely, what characteristic

model would be constructed for regulatory elements? (2) how to

deﬁne a measurement or scoring function about the probability for

a sequence segment being a regulatory element; (3) given the

regulatory element model and scoring function, how to detect the

regulatory element with the maximal score from sequences to be

analyzed, which is just the problem of algorithm design.

In the past two decades, more and more efforts have been

dedicated to gene regulatory element identiﬁcation in DNA

sequences. There are mainly two categories of gene regulatory

element identifying methods: experimental methods and compu-

tational methods. Due to the high time and economic cost, the

Contents lists available at SciVerse ScienceDirect

journal homepage: www.elsevier.com/locate/cbm

Computers in Biology and Medicine

http://dx.doi.org/10.1016/j.compbiomed.2013.04.008

Corresponding author at: Department of Computer Science and Engineering,

Yangzhou University, Yangzhou 225127, China. Tel.: +86 51489 786266.

E-mail addresses: yzliuwei@126.com (W. Liu), hanwu_chen@163.com

(H. Chen), yzulchen@gmail.com (L. Chen).

Computers in Biology and Medicine 43 (2013) 922–932

experimental methods are probably not able to obtain compre-

hensive results. Therefore, computational methods [1–3] have

drawn much more attention because of its effectiveness and high

efﬁciency. However, it is referred to three problems in computa-

tional methods: (1) given genome sequences, ﬁnding out all

known regulatory elements; (2) ﬁnd out unknown regulatory

elements from the upstream of some co-expressed or co-

regulated genes; (3) ﬁnd out the unknown gene regulated by a

known transcriptional factor. We focus on the second problem

which is called sequence-driven regulatory elements identifying,

and accordingly the ﬁrst problem is called pattern-driven regula-

tory elements identifying.

Recently, many algorithms and models have been developed for

identifying the regulatory elements with different organisms and

different features. There are two categories of the most commonly

used methods: pattern-dri ven methods and seq uence-driv en meth-

ods. The former searches for the potential sites mainl y based on the

string or matrix models of regulat ory element. The latter is a

predicting method [4] for detecting the common element on the

co-regulated gene cluster. Using the pattern-driven method, lots of

software [5–8] has been developed for identifying the regulatory

elements, such as SIGNAL SCAN, ConsInspector, TFSearch/TESS,

Matinspect or, Consite, Match etc.

At present, methods used in existing algorithms for regulatory

elements identifying include the counting method, expectation–

maximization method, mixture model method and Markov chain

Monte Carlo (MCMM) method. The counting method [9] is the

most instinctive and simplest exhaustive search method. Since its

time complexity is proportional to the exponent of pattern length,

this method is only suitable to identify short regulatory elements.

The Expectation–Maximization (EM) algorithm is an efﬁcient

iterative procedure to compute the maximum likelihood estimate

in the presence of missing or hidden data. Each iteration of the EM

algorithm consists of two steps: the E-step which determines

the conditional expectation, and the M-step which maximizes the

expectation. Efﬁciency of the algorithm greatly depends on

the initial conditions. With the inappropriate initial setting of

the parameters, it will converge to a local optimum instead of the

global one. The Mixture Model (MM) method [10] is an improve-

ment of the EM algorithm. The basic idea of MM lies on the

conservation of regulatory elements and their corresponding

characteristic matrices. During the process of continual iterations,

the log-likelihood will be maximized only when both of them are

co-adapted. After conserved sequences, sensing matrices or char-

acteristic models are obtained, they are evaluated by their statis-

tical signiﬁcances. Gibbs sampling algorithm, proposed by

Lawrence et al. [11] is a typical example of the MCMC (Markov

Chain Monte Carlo) method to identify motifs of protein

sequences. Later Liu et al. [12] adopted Gibbs sampler into the

Bayesian model. Their method was used to solve the problem of

multiple sequence alignment and achieved admirable results. Now

Gibbs sampling and its improvements have sparked a major

increase in the application of regulatory element identiﬁcation.

There are much mature software available on the Internet, such as

Gibbs Motif Sampler [13], AlignACE [14], BioProspector [15], and

MotifSampler

[16] etc. The primary principle of Gibbs sampling is

to optimize the object function through continually updating the

regulatory element model and its position in each sequence by a

random sampling. The ﬁnal regulatory element is obtained when

the iteration is terminated under a certain condition. At present,

some other pieces of popular software is developed such as

Consensus [17], MEME [18], ANN-Spec [19], PROJECTION [20],

MDScan [21] and the most recent one named YMF [22].

Recently, some optimization methods are also applied on

regulatory element identiﬁcation, such as statistical analysis,

neural network [23], clustering prediction and word identiﬁcation.

With the development of information technology and deeply

understanding of molecular biology, more and more methods

based on biological knowledge to identify regulatory elements

have been proposed, such as detecting conserved binding sites in

the evolutional process by comparative genomics, regulatory

element module identiﬁcation method based on the cooperation

of regulatory elements etc.

Our study focuses on the problem of searching for the binding

sites from co-expressed gene sequences. When using the compu-

tational method to solve the problem, we adopt the premise of

assuming that the genes regulated by the same regulatory element

possess the same or similar gene expression mode. The co-

expressed genes can be obtained by clustering the gene chip data.

Our goal is to detect all possible binding sites of transcription

factor from the upstream of co-expressed genes. Therefore the

problem can be deﬁned as an optimization process to search for

the conserved sequence segments of certain length from a

sequence set. Based on the ant colony optimization, we present

a method named ACRI (ant-colony-regulatory-identiﬁcation) for

regulatory elements identiﬁcation. Compared with the existing

algorithms, our algorithm ACRI can avoid converging into local

optimum and can not only improve the quality of results, but also

solve the problem at a very high speed. Experimental results on

two groups of standard test data show that our algorithm ACRI can

obtain solutions with higher quality using less computational time

than other traditional algorithms.

2. Concepts and deﬁnitions

2.1. Problem deﬁnition

For convenience in description of the problem, we assume that

each regulatory element occurs only once in each sequence. Given

the sequence set X¼{X

,…,X

}, where each sequence is com-

posed of four nucleotides: A, T, G and C. The lengths of those

sequences are denoted as l

; l

; :::; l

respectively. Our goal is to ﬁnd

out the set of conserved sequence segments M¼{M

,…,M

}

consisting of the motif with length w. Hereby, M

is the substring

of X

with length w, and M

is a subsequence of X

(i¼1,2,…,n).

In the computational methods mentioned above, it is impor-

tant to construct a proper feature model for the regulatory

elements. In this paper, we use the matrix model, which uses a

characteristic matrix to describe the distribution of the regulatory

elements. Hence ﬁrstly we would deﬁne the characteristic matrix.

2.2. Characteristic matrix

Deﬁnition 1. Let the length of motif be w and alphabet ∑ ¼｛A,T,

G,C｝. The characteristic matrix M is a 4  w matrix and its jth

element at the ith row is notated as P

, where b is the ith character

in the alphabet, and P

denotes the possibility of the ith character

appearing at the jth position of the motif.

Example 1. Assuming that there are 12 regulatory elements of

length 6 shown as follows:

1 ¼ “ ACGCGT”

2 ¼ “ ACGCGT”

3 ¼ “ CCGCGT”

4 ¼ “ TCGCGA”

5 ¼ “ ACGCGT”

6 ¼ “ ACGCGA”

7 ¼ “ ACGCGT”

8 ¼ “ ACGCGA”

W. Liu et al. / Computers in Biology and Medicine 43 (2013) 922–932 923

剩余10页未读，继续阅读

评论收藏

内容反馈

weixin_38739942

粉丝: 5
资源: 953

基于蚁群优化的基因调控元件识别算法

基于蚁群优化算法的多个配送点的车辆调度优化问题的MATLAB仿真,MATLAB2021a测试

一种基于改进蚁群算法的图像边缘检测方法

基于蚁群算法的线阵优化

基于蚁群算法的PID控制参数优化Matlab源码_优化pid_蚁群pid参数_蚁群算法

基于蚁群遗传算法的冗余备用元件优化研究

基于云计算环境的蚁群优化计算资源分配算法

基于改进蚁群算法的HF接收资源调度

基于ACO蚁群优化的图像边缘提取算法matlab仿真+代码操作视频

基于ACO蚁群优化的二维路径规划算法matlab仿真,含仿真操作录像

基于蚁群算法的TSP优化方案

蚁群聚类算法极其改进

一种改进的蚁群算法在TSP问题中的应用研究

蚁群算法的改进

基于蚁群算法优化pid参数

改进的蚁群算法

基于蚁群优化算法的最优旅游路线优化模型

基于多态蚁群优化的图像边缘检测_边缘检测_蚁群算法_蚁群边缘检测_多蚁群_

基于蚁群算法的图像边缘检测算法MATLAB

改进的蚁群算法的聚类程序

蚁群算法的改进与应用

基于蚁群算法的二维路径规划算法_蚁群路径规划_优化算法_

蚁群算法.rar_fewpoh_蚁群优化_蚁群优化算法_蚁群算法_蚁群算法优化

最新资源