没有合适的资源?快使用搜索试试~ 我知道了~
与Sanger测序相比,下一代测序平台可产生更短的读数,更高的覆盖范围和更高的通量。 这些成本最低的技术可以在一天内运行,从而覆盖大多数物种,包括哺乳动物。 这些仪器之一产生的序列数据由数百万或数十亿的序列读段组成,长度范围从50到150nt。 在开始进一步的基因组分析之前,必须从头开始组装这些短读物。 不幸的是,由于许多原因,基因组组装仍然是一个难题,特别是短读段和比读段更长的复杂重复结构。 最近有很多组装算法和软件,其中大多数算法面对重复时显得无能为力,尤其是对于相同的重复算法,并且在完全相同的输入数据下无法获得唯一的组装结果。 当重复的时间长于包含在输入数据集中的读取时间时,如何获得唯一且稳定的汇编结果正成为一个关键问题。 从这个角度出发,我们提出了一种基于动态重叠的基因组组装方法,该方法可以从随机选择的读取开始就获得独特的结果,并且可以解析长度为读取长度数百倍的高度相似的重复序列,更重要的是,我们使用单端数据但不提供配对末端信息来解决高相似性重复问题。
资源推荐
资源详情
资源评论
Abstract—The Next Generation Sequencing platform can
generate shorter reads, higher coverage, higher throughput
than the Sanger sequencing. These lowest cost technologies can
produce deeper coverage of most species, including mammals,
in few days at one run. The sequence data produced by one of
these instruments consist of millions or billions of sequence
reads ranging from 50 to 150nt in length. These short read must
be assembled de novo before further genome analysis can begin.
Unfortunately, genome assembly remains a difficult problem
challenged by many reasons, especially the short reads and the
complex repeats structure that longer than the reads. There are
many assembly algorithms and software recently, most of them
appear powerless by facing repeats, especially for the identical
ones and cannot get the unique assembly result with the
completely same input data. How to get the unique and stable
assembly result when repeats that longer than read contained
into the input data set is becoming a key issue. In this
perspective, we proposed a genome assembly method based on
dynamic overlap which can get unique result from the beginning
of randomly selected read and can resolve high similarity
repeats whose length is hundreds times of read length, more
importantly, we use single-end data but not paired-end
information to resolve high similarity repeats.
I. INTRODUCTION
VER the past twenty years, genome sequencing
technologies have made a great progress in many
aspects, such as speed, cost, coverage, etc. the
automated Sanger sequencing is regarded as the first-
generation genome sequencing technology which has the
ability to read through more than 1000 base pair
(1000-2000bp), and the later newer methods are referred to
the Next-Generation Sequencing(NGS) technologies.
Currently available commercial NGS platforms are Roche
454, Illumina/HiSeq, ABI/SOLiD, Helicos/Heliscope and
etc[1]. Next generation sequencing machines can now
sequence the whole human genome in few days, and this
capability has inspired a flood of new projects that are aiming
at sequencing large kinds of animals and plants[2][3]. NGS
can be characterized by highly parallel operation, higher
yield, simpler operation, shorter reads and much lower
cost[4].
So how to assembly a complete genome from these short
reads efficiently and accurately is becoming an urgent
problem. Assembly algorithms are greatly challenged by the
following several reasons.
• Complex repeat structure of genomes, especially for the
identical repeats which are longer than the reads[5].
This paper is supported by National Nature Science Fund
(Grant:61174163)
S Lian, X Dai is with School of Information Science and Technology, Sun
Yat-sen University, Guangzhou, China (phone: +86-020-39943331; e-mail:
shuai_lian@qq.com)
Complex repeats include Interspersed repeats, Tandem
repeats, Short interspersed nuclear elements(SINE),
Long interspersed nuclear elements(LINE).
• The fundamental limitations of the NGS technologies are
the NGS itself because the reads are much shorter, even
shorter than the smallest genome. Shorter reads deliver
less information per read so as to make assembling a
genome more difficult. Shorter reads require higher
coverage so as to satisfy the minimum overlap criteria,
higher coverage increase the complexity and intensity of
the computational issues related to the lager genome
data.
• Furthermore, sequencing error makes repeats resolution
more difficult. The existence of error makes assemblers
cannot correctly tell whether the difference is real
genome variation or caused by sequencing errors.
• Non-uniform coverage of the genomes in sequencing
process is another obstacle of whole genome assembly.
Low coverage can induce gaps in assembly, coverage
bias decreases the confidence of repeats resolution based
on coverage statistical tests[5].
• The computational of complexity of processing large
volume of data sets is also difficult to deal with, such as
human genome, Metagenomics. Computing overlaps
between two reads is most time consuming stage in
genome assembly. So for the large genome data,
assembly algorithms require the high-performance
computing platforms. Personal computer is almost
impossible for WGS assembly currently.
• And etc……..
Up to now, there are more than tens of genome assembly
algorithms. These NGS assemblers can be divided into three
categories, all are based on computing overlap or k-mers. The
overlap/Layout/Consensus(OLC) methods rely on the
overlap graphs. The de Bruijn Graphs(DBG) use K-mer graph
as another form of overlap. The Greedy graphs may combine
the OLC and DBG. Where SSAKE[6], SHARCGS[7],
ABySS[8], ALLPATHS[9], SOAPdenovo[10], Velvet[11]
are the typical ones. However, the efficiency and accuracy of
these currently assembly algorithms are different from each
other[12] due to many reasons.
Among these assemblies, most of them cannot deal well
with repeats longer than the reads, even for the high similarity
repeats with variations, furthermore, the assembling results
are not stable especially confounding identical repeats and
high similarity repeats. In this perspective, we proposed a
new genome assembly method based on dynamic overlap
called DOAssembler(Dynamic Overlap Assembly).
DOAssembler can resolve high similarity repeats up to 99%
similarity with read length 100bp and get the unique
assembly result from different beginning reads by using
A New Genome Assembly Method Based On Dynamic Overlap
Shuaibin Lian and Xianhua Dai
O
Third International Conference on Information Science and Technology
March 23-25, 2013; Yangzhou, Jiangsu, China
978-1-4673-2764-0/13/$31.00 ©2013 IEEE
1
资源评论
weixin_38592502
- 粉丝: 6
- 资源: 935
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功