AParallelPairwiseAlignmentwithPruningforLargeGenomicSequences资源-CSDN文库

181 浏览量 2021-02-07 03:28:36 上传评论收藏 181KB PDF 举报

资源推荐

资源详情

资源评论

A Parallel Pairwise Alignment with Pruning for

Large Genomic Sequences

Xiangyuan Zhu

∗

, Bing Li

†

, Kenli Li

‡

, Ping Shao

∗

, Yi Pan

†

∗

School of Computer Science, Zhaoqing University, Zhaoqing 526061, China

†

Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA

‡

College of Information Science and Engineering, Hunan University, Changsha 410082, China

Email: hnzxy@hnu.edu.cn

Abstract—Pairwise sequence alignment is a common and

fundamental task in Computational Biology, which constitutes the

basis for many Bioinformatics applications. In the post-genomic

era, there is an increasing demand to align long DNA sequences

to discover their functions. In this paper, we propose a parallel

pairwise alignment algorithm for large genomic sequences by

recursively dividing the whole genomic sequences into small

pieces, with an effective pruning strategy to reduce search and

computation space. We implemented rigorous tests on a 4-core

computer using real genomic sequences and artiﬁcially generated

sequences. The results show that our implementation can achieve

speedup 10.64 with 99.75% accuracy compared to the sequential

algorithm. As far as we know, this is the ﬁrst time that MBP

(mega base-pairs) sequences are globally aligned with an afﬁne

gap penalty.

I. INTRODUCTION

Pairwise sequence alignment is a fundamental operation in

bioinformatics. Its typical task is to identify similar regions

between two biological sequences such as proteins and DNA

sequences, which is an effective way to infer the evolutionary,

functional and structural relationships between the two.

There are two categories in the pairwise alignment: global

and local alignment. NW (Needleman and Wunsch) [1] is

an optimal global alignment which applies dynamic program-

ming scheme to construct an exact end-to-end alignment for

two sequences. SW (Smith-Waterman) [2] is a famous local

alignment algorithm which only focuses on producing similar

regions between two sequences. Although both NW and SW

are capable of retrieving exact solutions for the pairwise

alignment, their time and space complexity are O(m × n),

where m and n are the lengths of the two sequences needed

to be aligned. The quadratic time and space requirements make

it infeasible to align large genomic sequences.

With the decreasing sequencing costs and improvements

in sequencing and longer read technologies over the past 10

years, the amount of data produced every seven months is

continuously doubling [3]. These technological developments

have enabled several large scale sequencing efforts including

the Human Microbiome Project (HMP) [4], and Genomic

Encyclopedia of Bacteria and Archaea [5]. Currently, there

are 258,075 organisms in GOLD (Genomes Online Database).

The growing number of them makes aligning large genomic

sequences imperative in computational biology.

Gotoh [6] gave an algorithm to solve the pairwise align-

ment problems using afﬁne gap penalties rather than regular

gap penalties, resulting an improved alignment quality. For-

mally, the afﬁne function is speciﬁed as gap(k) = G + Hk

for the cost of a k-symbol indel (insert or deletion), where G

and H are non-negative constants. It means that opening up a

gap costs G and each symbol in the gap costs H.

To reduce memory space requirements, Myers and

Miller (Myers-Miller) [7] developed a linear-space algorithm

for constructing optimal sequence alignments by applying

Hirschberg’s technique to Gotoh’s algorithm. The underlying

idea of Myers-Miller is divide-and-conquer strategy. It ﬁnds

the ‘midpoint’ of the optimal alignment using a “forward”

and a “reverse” application of the linear space cost-only. Then

an optimal alignment can be retrieved by recursively ﬁnding

optimal alignment on both sides of this midpoint. The linear-

space complexity makes it possible to align large genomic

sequences. Myers-Miller is one of the most popular optimal

algorithm with citation of over 1,400 in the Google Scholar,

although it had been developed for nearly 30 years.

There are several parallel pairwise alignment algorithms

proposed based on the developments of high performance

computing platforms, including GPU (Graphical Processing

Unit) [8], FPGA (Field- Programmable Gate Array), and

multiprocessor and cluster environments [9]. CUDAlign 2.1

is a parallel algorithm that uses GPU to align huge sequences,

executing the SW algorithm combined with Myers-Miller,

with linear space complexity [10]. CUDASW++ 2.0 is an

enhanced SW protein alignment on GPUs based on the single

instruction, multiple thread (SIMT) and the virtualized single

instruction, multiple data (SIMD) abstraction [11]. A FPGA

hardware accelerator architecture was proposed to speedup the

implementation of SW algorithm which allows the processing

of larger DNA sequences in memory restricted environments

[12]. Two parallel algorithms based on the dot plot algorithm

was developed for aligning large genomic sequences on mul-

tiprocessor and cluster environments[13].

Although there are many parallel solutions for pairwise

alignments, most of them are based on SW algorithm. In this

paper, we present a new approach to accelerating the Myers-

Miller algorithm, with the aim for aligning large genomic se-

quences. We integrate an optimization method into the Myers-

Miller algorithm by reducing search space and calculation.

Greater acceleration is further achieved by recursively dividing

the whole genomic sequences into small pieces and then

aligning them in parallel. To the best of our knowledge, this is

the ﬁrst time to construct the global alignment for whole large

genomic sequences.

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余5页未读，立即下载

评论收藏

内容反馈

weixin_38667920

粉丝: 3
资源: 909

A Parallel Pairwise Alignment with Pruning for Large Genomic Seq...

最新资源

A Parallel Pairwise Alignment with Pruning for Large Genomic Seq...

Parallel Programming with Python

Professional Parallel Programming with C#: Master Parallel Extensions with .NET 4

Wrox.Parallel.Programming.with.Intel.Parallel.Studio.XE.2012

A Communication-Efficient Parallel Algorithm for Decision Tree

如何：编写 parallel_for 循环.doc

Parallel programming with Intel Parallel Studio XE code

Parallel I/O for Cluster Computing

Intel® Parallel Studio XE Cluster Edition for Windows or linux* 2018最新版许可key破解文件

Intel Parallel Studio XE Cluster Edition for Windows Update1

CLIP-Q: Deep Network Compression Learning by In-Parallel Pruning

license intel parallel studio xe 2015 for linux or wins

A Parallel Programming with Microsoft Visual C++ epub

Professional Parallel Programming with C#

A Parallel Programming with Microsoft Visual C++ 无水印pdf

INTEL Parallel Studio XE 2016 With Updates License

Parallel Programming for FPGAs

Parallel Programming with OmniThreadLibrary

CUDA for Engineers An Introduction to High-Performance Parallel Computing azw3

parallel-studio-a.lic

Parallel Computer Architecture - A Hardware Software Approach

Scalable Parallel Programming with CUDA中文版

Parallel Programming with Microsoft Visual Studio

A Layer-Level Parallel Encoding Framework for SVC

INTEL Parallel Studio XE 2016 With Update 1 License

最新资源