在大型DNA数据集中发现母题的有效算法资源-CSDN文库

201 浏览量 2021-03-10 13:17:25 上传评论收藏 2.68MB PDF 举报

文章提到的“在大型DNA数据集中发现母题的有效算法”，探讨了生物信息学中的一个关键问题——在大量DNA数据中寻找转录因子结合位点的模式。这类模式通常被称为DNA序列中的“母题”（motif），它们对理解基因调控以及生物学功能具有重要意义。文章主要介绍了名为MCES（Motif Combining Emerging Substrings）的算法，该算法通过挖掘和结合新兴子串（emergingsubstrings）来发现母题，尤其适用于处理大规模数据集。文章首先指出，尽管在过去十年中，植入学说（planted motif discovery）被成功用于定位转录因子结合位点在数十个启动子序列中，但在下一代测序数据（ChIP-seq数据集）中识别母题的工作还不够充分。下一代测序数据通常包含成千上万个输入序列，带来了新的挑战，即在合理的时间内做出良好的识别。为了解决这一挑战，作者提出了基于MapReduce策略的分布式挖掘新兴子串的方法。这种策略允许算法更高效地处理大规模数据集。文章通过模拟数据集的实验结果表明，MCES算法在成千上万到数百万的输入序列中可以高效且有效地识别母题。实验结果显示MCES运行速度超过现有的F-motif和TraverStringsR等母题发现算法。此外，MCES能识别出未知长度的母题，并且比竞争算法CisFinder具有更好的识别准确性。MCES的有效性也在真实数据集上得到了验证。为了理解MCES算法的工作原理，我们需先了解与母题发现相关的几个关键概念。母题在DNA序列中是一个重复出现的模式或短序列，它的每一个出现被称为一个母题实例。母题发现的序列模型有三种：OOPS、ZOOPS和TCM。OOPS模型对应于每个序列中一个母题实例的出现，ZOOPS模型对应于每个序列中零个或一个母题实例的出现，TCM模型对应于每个序列中零个或多个母题实例的出现。ZOOPS和TCM序列模型比OOPS模型更符合真实的生物学情况，但在这两种模型下识别母题要比OOPS模型更难。文章还提到了其他一些用于识别母题的算法，这些算法可以按照它们的发现方式分为两大类。一类是基于统计的方法，它依赖于统计显著性测试来确定母题的候选者；另一类是基于模式发现的方法，它更注重于发现序列中的重复模式。文章中虽然未详细说明MCES算法的具体实现细节，但是强调了通过MapReduce进行分布式处理的能力，MapReduce是Apache Hadoop软件框架中的一个组件，用于处理大规模数据集的并行运算。 MCES算法的设计思想是识别和结合新兴子串。在生物学序列分析中，新兴子串指的是在一组序列中频繁出现的短序列模式，它们的频繁出现可能暗示了潜在的生物学意义。结合这些子串可以帮助定位和预测转录因子的结合位点以及其他功能区域。文章还提到了一些关键术语，如ChIP-seq，它是一种用于研究蛋白质与DNA相互作用的实验技术。MapReduce是一种编程模型，适用于在由成百上千台计算机组成的分布式环境中处理大数据集。而motif discovery则是指在DNA序列数据中识别模式的算法。这篇文章的研究重点在于提出一种新的计算方法来应对在下一代测序技术带来的大数据环境下，如何高效且准确地识别DNA序列中的重要模式——母题。它为生物信息学提供了一个强大的工具，通过最新的计算机技术来解决生物学上的一个重要问题。

资源推荐

资源详情

资源评论

IEEE

Proof

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 14, NO. 5, JULY 2015 1

An Efﬁcient Algorithm for Discovering Motifs in

Large DNA Data Sets

Qiang Yu, Hongwei Huo*, Member, IEEE, Xiaoyang Chen, Haitao Guo, Jeffrey Scott Vitter, Fellow, IEEE,

and Jun Huan, Member, IEEE

Abstract—The planted motif discovery has been success-

fully used to locate transcription factor binding sites in dozens

of promoter sequences over the past decade. However, there h

not been enough work done in identifying

motifs in the

next-generation sequencing (ChIP-seq) data sets, which contain

thousands of input sequences and thereby bring new

challenge

to make a good identiﬁcation in reasonable time. To cater this

need, we propose a new planted

motif discovery algorithm

named MCES, which identiﬁes motifs by mining and

combining

emerging substrings. Specially, to handle larger data sets, we

designaMapReduce-basedstrategytomineemergingsubstrings

distributedly. Experimental results on

the simulated data show

that i) MCES is able to identify

motifs efﬁciently and effec-

tively in thousands to millions of input sequences, and runs faster

than the state-of-the-art

motif

discovery algorithms, such as

F-motif and TraverStringsR; ii) MCES is able to identify motifs

without known lengths, and has a better identiﬁcation accuracy

than the competing algorithm CisFin

der. Also, the validity of

MCES is tested on real data sets. MCES is freely available at

http://sites.google.com/site/feqond/mces.

Index Terms—ChIP-seq, emerging substrings, MapReduce,

motif discovery.

I. INTRODUCTION

OTIF discovery is an important and challenging

problem in computationa

l biology. It plays a key role

in locating transcription factor binding sites (TFBS) in DNA

sequences. Binding sites tend to be short and degenerate, so it

is difﬁcult to disting

uish them from the input sequences. The

planted

motif discovery [1] is a famous formulation for

motif discovery, which has been proven to be NP-complete [2].

Planted

Moti

f Discovery Problem: Given a set of

-length DNA sequences over the alphabet

and two nonnegative integers and , satisfying

,thetas

k is to ﬁnd one or more

-length strings

such that occurs in all or a large fraction of the sequences

with up to

mismatches. The -length string is called an

Manuscript received March 27, 2015; accepted March 31, 2015. Asterisk in-

dicates corresponding author.

Q. Yu, X. Chen, and H. Guo are with the School of Computer Science and

Technology, Xidian University, Xi'an, 710071, China (e-mail: qyu@mail.xi-

dian.edu.cn; xychen@mail.xidian.edu.cn; htguo@mail.xidian.edu.cn).

*H. Huo is with the School of Computer Science and Technology, Xidian

University, Xi'an, 710071, China (e-mail: hwhuo@mail.xidian.edu.cn).

J. S. Vitter and J. Huan are with the Information and Telecommunication of

Technology Center, The University of Kansas, Lawrence, 66047, USA (e-mail:

{jsv,jhuan}@ku.edu).

This work was supported in part by the National Natural Science Foundation

of China under Grant 61173025 and 61373044, and the Fundamental Research

Funds for the Central Universities under Grant JB150306 and XJS15014.

Digital Object Identiﬁer 10.1109/TNB.2015.2421340

motif and each occurrence of is called a motif instance

According to how and where motif occurrences appear in the

sequences, there are three types of motif discovery sequence

model: OOPS, ZOOPS and TCM [3], corresponding to one

oc-

currence per sequence, zero or one occurrence per sequence

and zero or more occurrences per sequence, respectively. The

ZOOPS and TCM sequence model are more consistent

with the

real biological situation than the OOPS model, but identifying

motifs under these two models is more difﬁcult than that under

the OOPS model.

Numerous algorithms have been proposed to identify motifs

in several to dozens of promoter sequences from co-regulated

or homologous genes [4]. These algorit

hms can be divided

into two categories in terms of the used motif representation

models: those using consensus sequences [5] and those using

position weight matrices (PWM) [6].

Most identiﬁcation al-

gorithms based on consensus sequences are pattern-driven

[7]–[11]. They traverse all sequence patterns of length

with

an initial search space of

and report all motifs.

The identiﬁcation algorithms based on PWM usually employ

statistical techniques [3], [12]. They iteratively update an initial

PWM and report the motif with

high score.

In recent years, the novel experimental techniques, such

as protein-binding microarray (PBM) [13] and chromatin

immunoprecipitation f

ollowed by high-throughput sequencing

(ChIP-seq) allows the genome-wide identiﬁcation of TFBSs

[4], [14]. The experiments can output a list of transcription

factor binding regio

ns (i.e., peak regions), but motif discovery

methods are still needed to accurately locating TFBSs in these

peak regions. The advantage of a ChIP-seq data set is that the

sequences are cle

aner than the traditional promoter sequences

[4]. That is, not only a high percentage of sequences contain

TFBSs, but also each sequence has a high resolution (i.e., the

sequence length

is short, about 200 base pairs). It seems easier

for motif discovery methods to obtain a high identiﬁcation ac-

curacy in ChIP-seq data sets, but the size of a ChIP-seq data set

is very large a

nd the set contains thousands or more sequences,

requiring a high computational efﬁciency of motif discovery.

Unfortunately, almost all algorithms designed for identifying

motifs in pr

omoter sequences, either the pattern-driven algo-

rithms or statistical algorithms, become too time-consuming

for ChIP-seq data sets.

ChIP-tai

lored versions of traditional motif discovery algo-

rithms have been proposed, such as MEME-ChIP [15]. These

algorithms usually present limitations on the data set size by se-

lectin

g a small subset of the sequences to make motif identiﬁca-

tion [16]. For example, MEME-ChIP just selects 600 sequences

at random from the input sequences and then identiﬁes motifs

by usi

ng the expectation-maximization algorithm. In spite of

this, these algorithms still show a poor time performance due

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE

Proof

2 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 14, NO. 5, JULY 2015

to maintaining the original algorithm framework. In contrast to

MEME-ChIP, EXTREME [17] achieves a much better time per-

formance by using the online expectation-maximization algo-

rithm, but it requires too much storage space in handling large

input (e.g., it requires about 8 GB memory for 10 Mb inputs).

A few new algorithms [18], [19] are designed either based on

sufﬁx tree or De Bruijn graph, but they show poor time perfor-

mance with the increase of

and . Although DREME [20] can

analyze very large data sets in minutes, it can only ﬁnd short

motifs. To process full-size ChIP-seq data sets efﬁciently, some

algorithms based on word counting are proposed, such as RSAT

[21] and CisFinder [22]. Both RSAT and CisFinder just take

advantages of frequencies of very short words for the sake of

good time performance, so they may miss some useful informa-

tion contained in the sequences; also, CisFinder do

es not support

outputting motifs of a speciﬁed length.

Against the background of identifying motifs in ChIP-seq

data sets, it is necessary to design new algorith

ms with the fol-

lowing features: i) they can handle the full-size input sequences

and make full use of the information contained in the sequences,

ii) they can complete the computation wit

h a good time perfor-

mance and a good identiﬁcation accuracy, iii) they can identify

motifs without the OOPS constraint and iv) they can report mo-

tifs with or without a speciﬁed length.

To cater these needs, we proposed a new motif discovery

algorithm, named MCES, based on mining and combining

emerging substrings, which are pote

ntial

motif instances.

We design MCES in terms of the ZOOPS sequence model. To

handle very large data sets, we also design a MapReduce-based

strategy to mine emerging subs

trings distributedly. MCES fully

uses the emerging substrings of different lengths, and is able

to efﬁciently and effectively identify motifs with or without a

speciﬁedlengthinfull-siz

eChIP-seqdatasets.

The rest of the paper is organized as follows. Section II ﬁrst

gives the overview of the proposed algorithm, then describes

the mining step and the c

ombining step in detail, and ﬁnally

shows the whole algorithm. Section III presents the results and

discussion. We conclude the paper in Section IV.

II. M

ETHODS

A. Overview

We ﬁrst introduce an observation that a given instance

motif may exactly occur multiple times in ChIP-seq

data sets. ChIP-seq data sets contain thousands or more DNA se-

quences, and thus the

motif also has thousands or more

instances. Since each instance differs from

in at most posi-

tions, we can expect to ﬁnd some motif instances repeating mul-

tiple times in thousands of sequences. In Section III-A we con-

ﬁrm this observation by using probabilistic analysis and demon-

strate that motif instances have higher occurrence frequencies

than the background

-mers.

In view of these considerations, we identify motifs by mining

and combining substrings with high occurrence frequencies.

Accordingly, our algorithm contains two main steps, namely

the mining step and the combining step. Table I summarizes

the notations used in this paper.

In the mining step, we mine substrings of different lengths

simultaneously, for i) the length of the

identiﬁed motif is unknown in advance and ii) some segments

of a motif are also over-represented and mining them helps us

obtain more motif information. Moreover, to reduce the dis-

turbance of random over-represented substrings, we perform

TABLE I

OTATIONS USED IN THIS PAPER.

mining by using both a test set and a control set of DNA se-

quences. The test set contains the sequences that share the mo-

tifs to be identiﬁed, whereas the control set only consists of the

background sequences (i.e., the sequences that do not contain

motif instances). Thereby, the interest substrings or motif in-

stances are only over-represented in the test set rather than in

the control set, and we call such substrings emerging substrings.

Naturally, we convert our mining task to emerging substrings

mining problem [23] as follows. The detailed description of the

mining step is given in Section II-B.

Emerging Substrings Mining Prob lem: Given a test set

and a control set of sequences over ,a

threshold frequency

, and a minimum

growth rate

, the task is to ﬁnd all substrings over

such that , and

. The substrings satisfying the con-

ditions of both

and are called emerging substrings.

In the combining step, we combine emerging substrings to-

gether by using clustering methods to obtain predicted motifs.

The key factors to consider are as follows. First, the mining step

may output many emerging substrings due to mining substrings

of different lengths, and these emerging substrings should be

combined efﬁciently. Second, we need a method to calculate

the similarity between two substrings of different lengths. Sec-

tion II-C describes the combining step in detail.

B. Mining Step

We take advantage of the string mining algorithm proposed

by Fischer, Heun and Kramer [23] to calculate the mining step

efﬁciently. Their mining algorithm runs in optimal time. It ﬁrst

constructs the sufﬁx array (SA) and the longest common preﬁx

array (LCP) for the input data sets, and then visits all substrings

by simulating the sufﬁx tree traversal in the SA using the LCP

information.

To adjust their mining algorithm to closely ﬁt our problem, we

make the following improvements. First, since the occurrence

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38578242

粉丝: 3
资源: 945

在大型DNA数据集中发现母题的有效算法

大型DNA数据集的一种有效的主题查找算法

PairMotif +：一种快速有效的从头发现DNA序列的算法

一种改进的LIPI数据挖掘算法的仿真分析.pdf

MaPle A Fast Algorithm for Maximal Pattern-based Clustering∗

数据挖掘技术与应用 (1).pdf

应用于大型数据库的聚类技术研究

MPI-Clustering:K-Means算法的顺序和并行实现，数据集为数据点，DNA链为输入，K为质心

精确字串匹配算法手册

COMRAD-MPI:使用并行计算压缩大型基因组数据集-开源

Robust Computer Vision: Theory and Applications

Accelerating Computation of Large Biological Datasets using MapReduce Framework

ChiPSeqPair:匹配具有不同特征的 Chip-Seq 数据-开源

主题搜索的参考序列选择

基于Bicluster的贝叶斯主成分分析方法的微阵列缺失值估计

使用高通量测序读数识别微转化

来自几个支配者的矩阵分解

基于CORDIC的反正弦和反余弦计算的FPGA实现

BA无标度网络中的SIR模型

使用3DCNN和卷积LSTM进行手势识别学习时空特征

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

磁悬浮系统自适应模糊PID控制器的设计

基于BP神经网络的人口预测

两轮平衡车的建模与控制研究

最新资源