基于动态重叠的基因组组装新方法资源-CSDN文库

研究论文

130 浏览量 2021-03-16 22:43:02 上传评论收藏 792KB PDF 举报

资源推荐

资源详情

资源评论

Abstract—The Next Generation Sequencing platform can

generate shorter reads, higher coverage, higher throughput

than the Sanger sequencing. These lowest cost technologies can

produce deeper coverage of most species, including mammals,

in few days at one run. The sequence data produced by one of

these instruments consist of millions or billions of sequence

reads ranging from 50 to 150nt in length. These short read must

be assembled de novo before further genome analysis can begin.

Unfortunately, genome assembly remains a difficult problem

challenged by many reasons, especially the short reads and the

complex repeats structure that longer than the reads. There are

many assembly algorithms and software recently, most of them

appear powerless by facing repeats, especially for the identical

ones and cannot get the unique assembly result with the

completely same input data. How to get the unique and stable

assembly result when repeats that longer than read contained

into the input data set is becoming a key issue. In this

perspective, we proposed a genome assembly method based on

dynamic overlap which can get unique result from the beginning

of randomly selected read and can resolve high similarity

repeats whose length is hundreds times of read length, more

importantly, we use single-end data but not paired-end

information to resolve high similarity repeats.

I. INTRODUCTION

VER the past twenty years, genome sequencing

technologies have made a great progress in many

aspects, such as speed, cost, coverage, etc. the

automated Sanger sequencing is regarded as the first-

generation genome sequencing technology which has the

ability to read through more than 1000 base pair

(1000-2000bp), and the later newer methods are referred to

the Next-Generation Sequencing(NGS) technologies.

Currently available commercial NGS platforms are Roche

454, Illumina/HiSeq, ABI/SOLiD, Helicos/Heliscope and

etc[1]. Next generation sequencing machines can now

sequence the whole human genome in few days, and this

capability has inspired a flood of new projects that are aiming

at sequencing large kinds of animals and plants[2][3]. NGS

can be characterized by highly parallel operation, higher

yield, simpler operation, shorter reads and much lower

cost[4].

So how to assembly a complete genome from these short

reads efficiently and accurately is becoming an urgent

problem. Assembly algorithms are greatly challenged by the

following several reasons.

• Complex repeat structure of genomes, especially for the

identical repeats which are longer than the reads[5].

This paper is supported by National Nature Science Fund

(Grant:61174163)

S Lian, X Dai is with School of Information Science and Technology, Sun

Yat-sen University, Guangzhou, China (phone: +86-020-39943331; e-mail:

shuai_lian@qq.com)

Complex repeats include Interspersed repeats, Tandem

repeats, Short interspersed nuclear elements(SINE),

Long interspersed nuclear elements(LINE).

• The fundamental limitations of the NGS technologies are

the NGS itself because the reads are much shorter, even

shorter than the smallest genome. Shorter reads deliver

less information per read so as to make assembling a

genome more difficult. Shorter reads require higher

coverage so as to satisfy the minimum overlap criteria,

higher coverage increase the complexity and intensity of

the computational issues related to the lager genome

data.

• Furthermore, sequencing error makes repeats resolution

more difficult. The existence of error makes assemblers

cannot correctly tell whether the difference is real

genome variation or caused by sequencing errors.

• Non-uniform coverage of the genomes in sequencing

process is another obstacle of whole genome assembly.

Low coverage can induce gaps in assembly, coverage

bias decreases the confidence of repeats resolution based

on coverage statistical tests[5].

• The computational of complexity of processing large

volume of data sets is also difficult to deal with, such as

human genome, Metagenomics. Computing overlaps

between two reads is most time consuming stage in

genome assembly. So for the large genome data,

assembly algorithms require the high-performance

computing platforms. Personal computer is almost

impossible for WGS assembly currently.

• And etc……..

Up to now, there are more than tens of genome assembly

algorithms. These NGS assemblers can be divided into three

categories, all are based on computing overlap or k-mers. The

overlap/Layout/Consensus(OLC) methods rely on the

overlap graphs. The de Bruijn Graphs(DBG) use K-mer graph

as another form of overlap. The Greedy graphs may combine

the OLC and DBG. Where SSAKE[6], SHARCGS[7],

ABySS[8], ALLPATHS[9], SOAPdenovo[10], Velvet[11]

are the typical ones. However, the efficiency and accuracy of

these currently assembly algorithms are different from each

other[12] due to many reasons.

Among these assemblies, most of them cannot deal well

with repeats longer than the reads, even for the high similarity

repeats with variations, furthermore, the assembling results

are not stable especially confounding identical repeats and

high similarity repeats. In this perspective, we proposed a

new genome assembly method based on dynamic overlap

called DOAssembler(Dynamic Overlap Assembly).

DOAssembler can resolve high similarity repeats up to 99%

similarity with read length 100bp and get the unique

assembly result from different beginning reads by using

A New Genome Assembly Method Based On Dynamic Overlap

Shuaibin Lian and Xianhua Dai

Third International Conference on Information Science and Technology

March 23-25, 2013; Yangzhou, Jiangsu, China

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余4页未读，立即下载

评论收藏

内容反馈

weixin_38592502

粉丝: 6
资源: 934

基于动态重叠的基因组组装新方法

shovill::high_voltage::spade_suit: 从 Illumina 双端读取组装细菌分离基因组

spades-runner:使用SPAdes批量组装下一代测序读物

Genome-Sequencing:使用重叠图，Kmer组成和De-Bruijn图组装Phi-X174基因组

基于重叠扫描方法的改进单片机乘法运算

一种基于重叠社区检测的TAL管理方法

SOAPdenovo2基因组装

BIGrat:用于基于焦磷酸测序的重测序的基因组组装工具-开源

Phusion2:基于read聚类的基因组组装流程-开源

Ray, a de novo assembler using MPI 2.2:Ray——用于平行 DNA 测序的平行基因组组装-开源

基因组学考试重点宝典.pdf

网络游戏-基于两阶段策略的非重叠与重叠网络社区检测方法.zip

基于重叠多光栅的动态应变传感特性研究

网络游戏-基于重叠点识别的网络重叠社团检测方法.zip

bin3C:使用Hi-C从宏基因组学数据中提取由元基因组组装的基因组（MAG）

NeuralLayout:一个GNN模型，该模型执行从头基因组组装过程的布局阶段所用的简化算法

重复和非重复的从头基因组装配算法

一种基于距离函数的重叠细胞分离方法

RGAAT:基于参考的基因组组装和新基因组的注释-开源

基于重叠正交变换的鲁棒水印方法 (2008年)

基于密度的重叠聚类的三向决策方法

基于重叠池测序的基于克隆的准确单倍型方法

基因组装配中存在重复序列叠加时重叠群计数的推广的Lander-Waterman定理 (2012年)

SORA:在云上使用Apache Spark进行基因组装配的可扩展重叠图约简算法

matlab开发-全基因组序列管线20

最新资源