没有合适的资源?快使用搜索试试~ 我知道了~
A Parallel Pairwise Alignment with Pruning for Large Genomic Seq...
0 下载量 181 浏览量
2021-02-07
03:28:36
上传
评论
收藏 181KB PDF 举报
温馨提示
Pairwise sequence alignment is a common and fundamental task in Computational Biology, which constitutes the basis for many Bioinformatics applications. In the post-genomic era, there is an increasing demand to align long DNA sequences to discover their functions. In this paper, we propose a parallel pairwise alignment algorithm for large genomic sequences by recursively dividing the whole genomic sequences into small pieces, with an effective pruning strategy to reduce search and computation sp
资源推荐
资源详情
资源评论
A Parallel Pairwise Alignment with Pruning for
Large Genomic Sequences
Xiangyuan Zhu
∗
, Bing Li
†
, Kenli Li
‡
, Ping Shao
∗
, Yi Pan
†
∗
School of Computer Science, Zhaoqing University, Zhaoqing 526061, China
†
Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
‡
College of Information Science and Engineering, Hunan University, Changsha 410082, China
Email: hnzxy@hnu.edu.cn
Abstract—Pairwise sequence alignment is a common and
fundamental task in Computational Biology, which constitutes the
basis for many Bioinformatics applications. In the post-genomic
era, there is an increasing demand to align long DNA sequences
to discover their functions. In this paper, we propose a parallel
pairwise alignment algorithm for large genomic sequences by
recursively dividing the whole genomic sequences into small
pieces, with an effective pruning strategy to reduce search and
computation space. We implemented rigorous tests on a 4-core
computer using real genomic sequences and artificially generated
sequences. The results show that our implementation can achieve
speedup 10.64 with 99.75% accuracy compared to the sequential
algorithm. As far as we know, this is the first time that MBP
(mega base-pairs) sequences are globally aligned with an affine
gap penalty.
I. INTRODUCTION
Pairwise sequence alignment is a fundamental operation in
bioinformatics. Its typical task is to identify similar regions
between two biological sequences such as proteins and DNA
sequences, which is an effective way to infer the evolutionary,
functional and structural relationships between the two.
There are two categories in the pairwise alignment: global
and local alignment. NW (Needleman and Wunsch) [1] is
an optimal global alignment which applies dynamic program-
ming scheme to construct an exact end-to-end alignment for
two sequences. SW (Smith-Waterman) [2] is a famous local
alignment algorithm which only focuses on producing similar
regions between two sequences. Although both NW and SW
are capable of retrieving exact solutions for the pairwise
alignment, their time and space complexity are O(m × n),
where m and n are the lengths of the two sequences needed
to be aligned. The quadratic time and space requirements make
it infeasible to align large genomic sequences.
With the decreasing sequencing costs and improvements
in sequencing and longer read technologies over the past 10
years, the amount of data produced every seven months is
continuously doubling [3]. These technological developments
have enabled several large scale sequencing efforts including
the Human Microbiome Project (HMP) [4], and Genomic
Encyclopedia of Bacteria and Archaea [5]. Currently, there
are 258,075 organisms in GOLD (Genomes Online Database).
The growing number of them makes aligning large genomic
sequences imperative in computational biology.
Gotoh [6] gave an algorithm to solve the pairwise align-
ment problems using affine gap penalties rather than regular
gap penalties, resulting an improved alignment quality. For-
mally, the affine function is specified as gap(k) = G + Hk
for the cost of a k-symbol indel (insert or deletion), where G
and H are non-negative constants. It means that opening up a
gap costs G and each symbol in the gap costs H.
To reduce memory space requirements, Myers and
Miller (Myers-Miller) [7] developed a linear-space algorithm
for constructing optimal sequence alignments by applying
Hirschberg’s technique to Gotoh’s algorithm. The underlying
idea of Myers-Miller is divide-and-conquer strategy. It finds
the ‘midpoint’ of the optimal alignment using a “forward”
and a “reverse” application of the linear space cost-only. Then
an optimal alignment can be retrieved by recursively finding
optimal alignment on both sides of this midpoint. The linear-
space complexity makes it possible to align large genomic
sequences. Myers-Miller is one of the most popular optimal
algorithm with citation of over 1,400 in the Google Scholar,
although it had been developed for nearly 30 years.
There are several parallel pairwise alignment algorithms
proposed based on the developments of high performance
computing platforms, including GPU (Graphical Processing
Unit) [8], FPGA (Field- Programmable Gate Array), and
multiprocessor and cluster environments [9]. CUDAlign 2.1
is a parallel algorithm that uses GPU to align huge sequences,
executing the SW algorithm combined with Myers-Miller,
with linear space complexity [10]. CUDASW++ 2.0 is an
enhanced SW protein alignment on GPUs based on the single
instruction, multiple thread (SIMT) and the virtualized single
instruction, multiple data (SIMD) abstraction [11]. A FPGA
hardware accelerator architecture was proposed to speedup the
implementation of SW algorithm which allows the processing
of larger DNA sequences in memory restricted environments
[12]. Two parallel algorithms based on the dot plot algorithm
was developed for aligning large genomic sequences on mul-
tiprocessor and cluster environments[13].
Although there are many parallel solutions for pairwise
alignments, most of them are based on SW algorithm. In this
paper, we present a new approach to accelerating the Myers-
Miller algorithm, with the aim for aligning large genomic se-
quences. We integrate an optimization method into the Myers-
Miller algorithm by reducing search space and calculation.
Greater acceleration is further achieved by recursively dividing
the whole genomic sequences into small pieces and then
aligning them in parallel. To the best of our knowledge, this is
the first time to construct the global alignment for whole large
genomic sequences.
资源评论
weixin_38667920
- 粉丝: 3
- 资源: 909
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- NSArgumentNullException如何解决.md
- VueError解决办法.md
- buvid、did参数生成算法
- tiny-cuda-cnn.zip
- 关于月度总结的PPT模板
- 手表品牌与型号数据集,手表型号数据
- 基于Java实现(IDEA)的贪吃蛇游戏-源码+jar文件+项目报告
- 数字按键3.2考试代码
- 颜色拾取器 for Windows
- 台球检测40-YOLO(v5至v11)、COCO、CreateML、Paligemma、TFRecord、VOC数据集合集.rar
- # 基于MATLAB的导航科学计算库
- Qt源码ModbusTCP 主机客户端通信程序 基于QT5 QWidget, 实现ModbusTCP 主机客户端通信,支持以下功能: 1、支持断线重连 2、通过INI文件配置自定义服务器I
- tesseract ocr 训练相关的环境部署包,包括jdk-8u331-windows-x64.exe、jTessBoxEditorFX-2.6.0.zip 等
- 好用的Linux终端管理工具,支持自定义多行脚本命令,密码保存、断链续接,SFTP等功能
- 大学毕业设计写作与答辩指南:选题、研究方法及PPT制作
- 小偏差线性化模型,航空发动机线性化,非线性系统线性化,求解线性系统具体参数,最小二乘拟合 MATLAB Simulink 航空发动机,非线性,线性,非线性系统,线性系统,最小二乘,拟合,小偏差,系统辨
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功