experimental methods are probably not able to obtain compre-
hensive results. Therefore, computational methods [1–3] have
drawn much more attention because of its effectiveness and high
efficiency. However, it is referred to three problems in computa-
tional methods: (1) given genome sequences, finding out all
known regulatory elements; (2) find out unknown regulatory
elements from the upstream of some co-expressed or co-
regulated genes; (3) find out the unknown gene regulated by a
known transcriptional factor. We focus on the second problem
which is called sequence-driven regulatory elements identifying,
and accordingly the first problem is called pattern-driven regula-
tory elements identifying.
Recently, many algorithms and models have been developed for
identifying the regulatory elements with different organisms and
different features. There are two categories of the most commonly
used methods: pattern-dri ven methods and seq uence-driv en meth-
ods. The former searches for the potential sites mainl y based on the
string or matrix models of regulat ory element. The latter is a
predicting method [4] for detecting the common element on the
co-regulated gene cluster. Using the pattern-driven method, lots of
software [5–8] has been developed for identifying the regulatory
elements, such as SIGNAL SCAN, ConsInspector, TFSearch/TESS,
Matinspect or, Consite, Match etc.
At present, methods used in existing algorithms for regulatory
elements identifying include the counting method, expectation–
maximization method, mixture model method and Markov chain
Monte Carlo (MCMM) method. The counting method [9] is the
most instinctive and simplest exhaustive search method. Since its
time complexity is proportional to the exponent of pattern length,
this method is only suitable to identify short regulatory elements.
The Expectation–Maximization (EM) algorithm is an efficient
iterative procedure to compute the maximum likelihood estimate
in the presence of missing or hidden data. Each iteration of the EM
algorithm consists of two steps: the E-step which determines
the conditional expectation, and the M-step which maximizes the
expectation. Efficiency of the algorithm greatly depends on
the initial conditions. With the inappropriate initial setting of
the parameters, it will converge to a local optimum instead of the
global one. The Mixture Model (MM) method [10] is an improve-
ment of the EM algorithm. The basic idea of MM lies on the
conservation of regulatory elements and their corresponding
characteristic matrices. During the process of continual iterations,
the log-likelihood will be maximized only when both of them are
co-adapted. After conserved sequences, sensing matrices or char-
acteristic models are obtained, they are evaluated by their statis-
tical significances. Gibbs sampling algorithm, proposed by
Lawrence et al. [11] is a typical example of the MCMC (Markov
Chain Monte Carlo) method to identify motifs of protein
sequences. Later Liu et al. [12] adopted Gibbs sampler into the
Bayesian model. Their method was used to solve the problem of
multiple sequence alignment and achieved admirable results. Now
Gibbs sampling and its improvements have sparked a major
increase in the application of regulatory element identification.
There are much mature software available on the Internet, such as
Gibbs Motif Sampler [13], AlignACE [14], BioProspector [15], and
MotifSampler
[16] etc. The primary principle of Gibbs sampling is
to optimize the object function through continually updating the
regulatory element model and its position in each sequence by a
random sampling. The final regulatory element is obtained when
the iteration is terminated under a certain condition. At present,
some other pieces of popular software is developed such as
Consensus [17], MEME [18], ANN-Spec [19], PROJECTION [20],
MDScan [21] and the most recent one named YMF [22].
Recently, some optimization methods are also applied on
regulatory element identification, such as statistical analysis,
neural network [23], clustering prediction and word identification.
With the development of information technology and deeply
understanding of molecular biology, more and more methods
based on biological knowledge to identify regulatory elements
have been proposed, such as detecting conserved binding sites in
the evolutional process by comparative genomics, regulatory
element module identification method based on the cooperation
of regulatory elements etc.
Our study focuses on the problem of searching for the binding
sites from co-expressed gene sequences. When using the compu-
tational method to solve the problem, we adopt the premise of
assuming that the genes regulated by the same regulatory element
possess the same or similar gene expression mode. The co-
expressed genes can be obtained by clustering the gene chip data.
Our goal is to detect all possible binding sites of transcription
factor from the upstream of co-expressed genes. Therefore the
problem can be defined as an optimization process to search for
the conserved sequence segments of certain length from a
sequence set. Based on the ant colony optimization, we present
a method named ACRI (ant-colony-regulatory-identification) for
regulatory elements identification. Compared with the existing
algorithms, our algorithm ACRI can avoid converging into local
optimum and can not only improve the quality of results, but also
solve the problem at a very high speed. Experimental results on
two groups of standard test data show that our algorithm ACRI can
obtain solutions with higher quality using less computational time
than other traditional algorithms.
2. Concepts and definitions
2.1. Problem definition
For convenience in description of the problem, we assume that
each regulatory element occurs only once in each sequence. Given
the sequence set X¼{X
1
,X
2
,…,X
n
}, where each sequence is com-
posed of four nucleotides: A, T, G and C. The lengths of those
sequences are denoted as l
1
; l
2
; :::; l
n
respectively. Our goal is to find
out the set of conserved sequence segments M¼{M
1
,M
2
,…,M
n
}
consisting of the motif with length w. Hereby, M
i
is the substring
of X
i
with length w, and M
i
is a subsequence of X
i
(i¼1,2,…,n).
In the computational methods mentioned above, it is impor-
tant to construct a proper feature model for the regulatory
elements. In this paper, we use the matrix model, which uses a
characteristic matrix to describe the distribution of the regulatory
elements. Hence firstly we would define the characteristic matrix.
2.2. Characteristic matrix
Definition 1. Let the length of motif be w and alphabet ∑ ¼{A,T,
G,C}. The characteristic matrix M is a 4 w matrix and its jth
element at the ith row is notated as P
b
j
, where b is the ith character
in the alphabet, and P
b
j
denotes the possibility of the ith character
appearing at the jth position of the motif.
Example 1. Assuming that there are 12 regulatory elements of
length 6 shown as follows:
M
1 ¼ “ ACGCGT”
M
2 ¼ “ ACGCGT”
M
3 ¼ “ CCGCGT”
M
4 ¼ “ TCGCGA”
M
5 ¼ “ ACGCGT”
M
6 ¼ “ ACGCGA”
M
7 ¼ “ ACGCGT”
M
8 ¼ “ ACGCGA”
W. Liu et al. / Computers in Biology and Medicine 43 (2013) 922–932 923