iSeeRNA:identificationoflongintergenicnon-codingRNAtranscriptsfromtranscriptomesequencingdata资源-CSDN文库

研究论文

162 浏览量 2021-02-09 10:41:40 上传评论收藏 1.73MB PDF 举报

资源推荐

资源详情

资源评论

RESEARCH Open Access

iSeeRNA: identification of long intergenic

non-coding RNA transcripts from

transcriptome sequencing data

Kun Sun

1,2

, Xiaona Chen

1,3

, Peiyong Jiang

1,2

, Xiaofeng Song

, Huating Wang

1,3*

, Hao Sun

1,2*

From ISCB-Asia 2012

Shenzhen, China. 17-19 December 2012

Abstract

Background: Long intergenic non-coding RNAs (lincRNAs) are emerging as a novel class of non-coding RNAs and

potent gene regulators. High-throughput RNA-sequencing combined with de novo assembly promises quantity

discovery of novel transcripts. However, the identification of lincRNAs from thousands of assembled transcripts is

still challenging du e to the difficulties of separating them from protein coding transcripts (PCTs).

Results: We have implemented iSeeRNA, a support vector machine (SVM)-based classifier for the identification of

lincRNAs. iSeeRNA shows better performance compared to other software. A public avai lable webserver for

iSeeRNA is also provided for small size dataset.

Conclusions: iSeeRNA demonstrates high pre diction accura cy and runs several magnitudes faster than other

similar programs. It can be integrated into the transcriptome data analysis pipelines or run as a web server, thus

offering a valuable tool for lincRNA study.

Background

Over the past decade, e vidence from numerous high-

throughput genomic platforms reveals that even though

less than 2% of the mammalian genome encodes proteins,

a significant fracti on can be transcribed into different

complex families of non-coding RNAs (ncRNAs) [1-4].

Other than microRNAs and other families of small non-

coding RNAs, long non-coding RNAs (lncRNAs, >200nt)

are emerging as potent regulators of gene expression [5].

Originally identified by Guttman et al. [6] from four

mouse cell types using chromatin state maps as a subtype

of lncRNAs, long intergenic non-coding RNAs (lincRNAs),

are discrete transcriptional unit intervening known pro-

tein-coding loci. Recent studies demonstrate the functional

significance of lincRNAs. However, it remains a daunting

task to identify all the lincRNAs existent in various biolo-

gical processes and systems.

Whole transcriptome sequencing , known as RNA- Seq,

offers the promise of rapid comprehensive discovery of

novel genes and transcripts [7]. With the de novo assembly

software such as Cufflinks [8] and Scripture [6], a large set

of novel assemblies can be obtained from RNA-Seq data.

Several programs have been used to facilitate the catalo-

ging of lincRNAs from RNA-Seq assemblies. For example,

Li et al. [9] used Codon Substitution Frequency (CSF)

score [10] to identify lincRNAs from de novo assembled

transcripts in chicken skeletal muscle. Pauli et al. [11] took

advantage of PhyloCSF score [12] followed by other filter-

ing steps to identify lincRNAs expressed during zebrafish

embryogenesis. Cabili et al. [13] also use d PhyloCSF pro-

gram to eliminate the de novo assembled transcripts with

positive coding potential and identified ~8200 lincRNA

loci in 24 human tissues. However, the extremely high

computational times demanded by PhyloCSF, may become

the bottleneck for handling millions of assemblies gener-

ated from high throughput sequencing. Furthermore,

* Correspondence: xfsong@nuaa.edu.cn; huating.wang@cuhk.edu.hk;

haosun@cuhk.edu.hk

Li Ka Shing Institute of Health Sciences, The Chinese University of Hong

Kong, Shatin, New Territories, Hong Kong SAR, China

Department of Biomedical Engineering, Nanjing University of Aeronautics

and Astronautics, Nanjing 210016, China

Full list of author information is available at the end of the article

Sun et al. BMC Genomics 2013, 14(Suppl 2):S7

http://www.biomedcentral.com/1471-2164/14/S2/S7

Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in

any medium, provided the original work is properly cited.

neither CSF nor PhyloCSF provides publicly available tools

that can be readily integrated into the lincRNA identifica-

tion workflow. Th erefore, ab initi o reconstruction of a

reliable set of lincRNAs through computational method

remains a daunting task. There is an urgent need for such

a standalone tool to accurately and quickly distinguish

lincRNAs from extremely large dataset. Previous studies

showed that supervised machine learning method, espe-

cially Support Vector Machine (SVM), may represent a

potential solution for accurate identification of lincRNAs

and protein coding gene transcripts (PCTs). For example,

CONC (Coding Or Non-Coding) [14], CPC (Coding

Potential Calculator) [15], and POTRAIT [16] have been

developed to discriminate PCTs and nc RNAs in general.

However, the performance of these programs is largely

dependent on dataset s; for instance, CONC is slow on

analyzing large datasets [15], which may limit its useful-

ness in the transcriptome data analysis. CPC works well

with known PCTs but may tend to classify novel PCTs

into lincRNAs if they have not been recorded in the pro-

tein databases used by CPC [15]. PORTAIT was specifi-

cally designed for the neglected species such as fungus et

al. [16]. Moreover, their performance on the identification

of lincRNAs has not been evaluated.

In this study, we present a new SVM-based classifier and

a standalo ne tool, iSee RNA. It demonstra ted high accu-

racy, balanced sensitivity and specificity for both lincRNA

and PCT datasets. It also outperforms others by running

several order-of-magnitudes faster, thus representing an

ideal tool for lincRNA identification from transcriptome

sequencing data.

Methods

Standard input file formats

To be compatible with de novo assembly software, such as

Cufflinks and Scripture, which use GTF/GFF or BED file

format, we set these three formats as default input file for-

mats for iSeeRNA. This will allow easy integration of

iSeeRNA into the transcriptome data analysis w orkflow.

The detailed information about the file formats can be

found at UCSC genome browser (http://genome.ucsc.edu/

FAQ/FAQformat.html).

SVM settings

In order to build SVM m odels for iSeeRNA, we used

LIBSVM (version 3.11) implem entation [17] with Radial

Basis Functional kernel which was shown to be the best

kernel to deal with this task [15]. During the training,

SVM was set as binary classifier with the two classes being

lincRNAs (positive set) and PCTs (negative set). Opt i-

mized SVM parameters C and gamma were obtained by

using the accompanying grid.py script with 5,000 ran-

domly selected instances from the training dataset. To

obtain the best performance model, 10-fold cross-

validation was used. In addition, two models were trained

and tested separately using species specific datasets for

human and mouse, respectively.

PhyloCSF and CPC settings

iSeeRNA was benchmarked aga inst two other classifica-

tion programs: PhyloCSF and CPC. These two programs

were installed locally and executed with default para-

meters. For PhyloCSF, a score of 0 was used as the classifi-

cation parameter. For CPC, Uniref90 [18] was employed as

protein database and the default classification model

developed by its authors was used.

Performance measurements

To evaluate the performance, accuracy (sensitivity or

specificity) and Matthews Correlation Coefficient (MCC)

[19], an indicator used in machine learning as a measure

of the quality of binary (two-class) classification, were

calculated; and Receiver Operating Characteristic (ROC )

curves were generated.

The following equations were used for calculating sen-

sitivity and specificity:

Sensitivity =

TP + F

(1)

Speciﬁcity =

+ FP

(2)

MCC =

∗

TN − FP

∗



(TP + FP)(TP + FN)(TN + FP)(TN + FN)

(3)

Where TP, FP, TN a nd FN are the numbers of true

positives (lincRNAs predicted to be non-coding), false

positives (PCTs predicted to be non-coding), true nega-

tives (PCTs predicted to be coding) and false negatives

(lincRNAs predicted to be coding).

Results

Gold-standard datasets

The quality of the training data is ultra-important for

building an accurate SVM model. In order to obtai n a

pool of high quality lincRNAs and PCTs as Gold-stan-

dard datasets (Figure 1), we collected lincRNAs and

PCTs annotated either as “ known” or “ novel” from

Human and Vertebrate An alysis and Annotation

(HAVANA) (http://vega.sanger.ac.uk/index.html) [20]

project. These lincRNA annotations were manually

curated and supported by some e xperimental evidences

such as spliced cDNAs and ES Ts et al., thus providing an

ideal source for lincRNAs. We further filtered the data

with the transcript length (> 200 nt). Next, for lincRNAs,

we eliminated those transcripts that were annotated as

PCTs by RefSeq [21]; similarly, for PCTs, we only kept

Sun et al. BMC Genomics 2013, 14(Suppl 2):S7

http://www.biomedcentral.com/1471-2164/14/S2/S7

Page 2 of 10

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38711041

粉丝: 6
资源: 954

iSeeRNA: identification of long intergenic non-coding RNA transc...

最新资源

iSeeRNA: identification of long intergenic non-coding RNA transc...

带你走进神秘的长链非编码RNA.pdf

音视频-编解码-非编码RNAHOTAIRM1和m省略8a在结直肠癌中的表达与功能研究.pdf

长链非编码RNA的作用机制.pdf

LINC00968的生物信息学分析以及对肺腺癌调控机制的研究.zip

which_tree:系统发育推断方法的测试

Python库 | parasail-1.1.7-py2.py3-none-win32.whl

转录组RNAseq术语解释.docx

醌还原菌群的富集及其生长特性研究

pybedtools:适用于Aaron Quinlan的BEDTools（生物信息学工具）的Python包装器以及更多内容

基于CORDIC的反正弦和反余弦计算的FPGA实现

使用3DCNN和卷积LSTM进行手势识别学习时空特征

BA无标度网络中的SIR模型

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

磁悬浮系统自适应模糊PID控制器的设计

基于BP神经网络的人口预测

无人机协同目标的多无人机协同搜索方法

两轮平衡车的建模与控制研究

基于改进遗传算法的六自由度机器人时间最优轨迹规划

一种基于深度学习的机械臂抓取方法

基于深度神经网络的交通流量预测

一种去除ECG中基线漂移和工频干扰的高效滤波方法

适用于1-8GHz宽带应用的原始Vivaldi天线

最新资源