没有合适的资源?快使用搜索试试~ 我知道了~
与Sanger测序相比,下一代测序平台可产生更短的读数,更高的覆盖范围和更高的通量。 这些成本最低的技术可以在一天内运行,从而覆盖大多数物种,包括哺乳动物。 这些仪器之一产生的序列数据由数百万或数十亿的序列读段组成,长度范围从50到150nt。 在开始进一步的基因组分析之前,必须从头开始组装这些短读物。 不幸的是,由于许多原因,基因组组装仍然是一个难题,特别是短读段和比读段更长的复杂重复结构。 最近有很多组装算法和软件,其中大多数算法面对重复时显得无能为力,尤其是对于相同的重复算法,并且在完全相同的输入数据下无法获得唯一的组装结果。 当重复的时间长于包含在输入数据集中的读取时间时,如何获得唯一且稳定的汇编结果正成为一个关键问题。 从这个角度出发,我们提出了一种基于动态重叠的基因组组装方法,该方法可以从随机选择的读取开始就获得独特的结果,并且可以解析长度为读取长度数百倍的高度相似的重复序列,更重要的是,我们使用单端数据但不提供配对末端信息来解决高相似性重复问题。
资源推荐
资源详情
资源评论




















Abstract—The Next Generation Sequencing platform can
generate shorter reads, higher coverage, higher throughput
than the Sanger sequencing. These lowest cost technologies can
produce deeper coverage of most species, including mammals,
in few days at one run. The sequence data produced by one of
these instruments consist of millions or billions of sequence
reads ranging from 50 to 150nt in length. These short read must
be assembled de novo before further genome analysis can begin.
Unfortunately, genome assembly remains a difficult problem
challenged by many reasons, especially the short reads and the
complex repeats structure that longer than the reads. There are
many assembly algorithms and software recently, most of them
appear powerless by facing repeats, especially for the identical
ones and cannot get the unique assembly result with the
completely same input data. How to get the unique and stable
assembly result when repeats that longer than read contained
into the input data set is becoming a key issue. In this
perspective, we proposed a genome assembly method based on
dynamic overlap which can get unique result from the beginning
of randomly selected read and can resolve high similarity
repeats whose length is hundreds times of read length, more
importantly, we use single-end data but not paired-end
information to resolve high similarity repeats.
I. INTRODUCTION
VER the past twenty years, genome sequencing
technologies have made a great progress in many
aspects, such as speed, cost, coverage, etc. the
automated Sanger sequencing is regarded as the first-
generation genome sequencing technology which has the
ability to read through more than 1000 base pair
(1000-2000bp), and the later newer methods are referred to
the Next-Generation Sequencing(NGS) technologies.
Currently available commercial NGS platforms are Roche
454, Illumina/HiSeq, ABI/SOLiD, Helicos/Heliscope and
etc[1]. Next generation sequencing machines can now
sequence the whole human genome in few days, and this
capability has inspired a flood of new projects that are aiming
at sequencing large kinds of animals and plants[2][3]. NGS
can be characterized by highly parallel operation, higher
yield, simpler operation, shorter reads and much lower
cost[4].
So how to assembly a complete genome from these short
reads efficiently and accurately is becoming an urgent
problem. Assembly algorithms are greatly challenged by the
following several reasons.
• Complex repeat structure of genomes, especially for the
identical repeats which are longer than the reads[5].
This paper is supported by National Nature Science Fund
(Grant:61174163)
S Lian, X Dai is with School of Information Science and Technology, Sun
Yat-sen University, Guangzhou, China (phone: +86-020-39943331; e-mail:
shuai_lian@qq.com)
Complex repeats include Interspersed repeats, Tandem
repeats, Short interspersed nuclear elements(SINE),
Long interspersed nuclear elements(LINE).
• The fundamental limitations of the NGS technologies are
the NGS itself because the reads are much shorter, even
shorter than the smallest genome. Shorter reads deliver
less information per read so as to make assembling a
genome more difficult. Shorter reads require higher
coverage so as to satisfy the minimum overlap criteria,
higher coverage increase the complexity and intensity of
the computational issues related to the lager genome
data.
• Furthermore, sequencing error makes repeats resolution
more difficult. The existence of error makes assemblers
cannot correctly tell whether the difference is real
genome variation or caused by sequencing errors.
• Non-uniform coverage of the genomes in sequencing
process is another obstacle of whole genome assembly.
Low coverage can induce gaps in assembly, coverage
bias decreases the confidence of repeats resolution based
on coverage statistical tests[5].
• The computational of complexity of processing large
volume of data sets is also difficult to deal with, such as
human genome, Metagenomics. Computing overlaps
between two reads is most time consuming stage in
genome assembly. So for the large genome data,
assembly algorithms require the high-performance
computing platforms. Personal computer is almost
impossible for WGS assembly currently.
• And etc……..
Up to now, there are more than tens of genome assembly
algorithms. These NGS assemblers can be divided into three
categories, all are based on computing overlap or k-mers. The
overlap/Layout/Consensus(OLC) methods rely on the
overlap graphs. The de Bruijn Graphs(DBG) use K-mer graph
as another form of overlap. The Greedy graphs may combine
the OLC and DBG. Where SSAKE[6], SHARCGS[7],
ABySS[8], ALLPATHS[9], SOAPdenovo[10], Velvet[11]
are the typical ones. However, the efficiency and accuracy of
these currently assembly algorithms are different from each
other[12] due to many reasons.
Among these assemblies, most of them cannot deal well
with repeats longer than the reads, even for the high similarity
repeats with variations, furthermore, the assembling results
are not stable especially confounding identical repeats and
high similarity repeats. In this perspective, we proposed a
new genome assembly method based on dynamic overlap
called DOAssembler(Dynamic Overlap Assembly).
DOAssembler can resolve high similarity repeats up to 99%
similarity with read length 100bp and get the unique
assembly result from different beginning reads by using
A New Genome Assembly Method Based On Dynamic Overlap
Shuaibin Lian and Xianhua Dai
O
Third International Conference on Information Science and Technology
March 23-25, 2013; Yangzhou, Jiangsu, China
978-1-4673-2764-0/13/$31.00 ©2013 IEEE
1
资源评论


weixin_38592502
- 粉丝: 6
- 资源: 934
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 毕业设计JAVAWEB校园订餐系统项目源码
- html css js分页按钮
- Comsol多孔板相场断裂模型:一种高效的数值模拟工具,好的,以下是根据您提供的“comsol多孔板相场断裂模型”提炼出的一个标题: COMSOL多孔板相场模拟与断裂分析模型 此标题涵盖了您提供
- Vcredist运行库【2005、2008、2010、2012、2013、2015-2022】X86+X64集合打包
- 六轴EtherCAT总线伺服涂布收卷机程序:动态测量与同步控制,具备参考值的六个伺服+变频器+编码器方案,六轴EtherCAT总线伺服涂布收卷机高级编程:伺服、变频器与编码器的协同控制及动态测量频率转
- springboot接入InfoSuiteAs
- 命令行界面构建库 :CmdForge
- 电力系统风储协同调频策略的MATLAB仿真模型:基于四机两区系统的频域模型与控制策略优化分析,MATLAB仿真模型:风储联合一次调频在四机两区电力系统的应用与优化,电力系统风储联合一次调频MATLAB
- 【微信小程序源码】笑话
- 「三菱R系列PLC应用:ST、RD77MS定位与触摸屏配方功能实现异地操作及快速通信」,三菱R系列PLC案例详解:高级应用与CClink通信实现机器人远程操作及触摸屏配方功能,三菱R系列PLC案例程序
- 【微信小程序源码】滑动选项卡
- Video_59564296397953.mp3
- 使用c++开发相机的示例CameraDS,引用DirectShow技术
- 贪吃蛇 web版 支持python启动
- 基于NRBO优化算法的Transformer-BiLSTM回归模型Matlab代码:适用于多变量时序预测的电力负荷与光伏功率预测,NRBO-Transformer结合BiLSTM神经网络的时序数据回归
- 【微信小程序源码】京东白条
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
