没有合适的资源?快使用搜索试试~ 我知道了~
iSeeRNA: identification of long intergenic non-coding RNA transc...
0 下载量 162 浏览量
2021-02-09
10:41:40
上传
评论
收藏 1.73MB PDF 举报
温馨提示
Background: Long intergenic non-coding RNAs (lincRNAs) are emerging as a novel class of non-coding RNAs and potent gene regulators. High-throughput RNA-sequencing combined with de novo assembly promises quantity discovery of novel transcripts. However, the identification of lincRNAs from thousands of assembled transcripts is still challenging due to the difficulties of separating them from protein coding transcripts (PCTs).<br> Results: We have implemented iSeeRNA, a support vector machine (SV
资源推荐
资源详情
资源评论
RESEARCH Open Access
iSeeRNA: identification of long intergenic
non-coding RNA transcripts from
transcriptome sequencing data
Kun Sun
1,2
, Xiaona Chen
1,3
, Peiyong Jiang
1,2
, Xiaofeng Song
4*
, Huating Wang
1,3*
, Hao Sun
1,2*
From ISCB-Asia 2012
Shenzhen, China. 17-19 December 2012
Abstract
Background: Long intergenic non-coding RNAs (lincRNAs) are emerging as a novel class of non-coding RNAs and
potent gene regulators. High-throughput RNA-sequencing combined with de novo assembly promises quantity
discovery of novel transcripts. However, the identification of lincRNAs from thousands of assembled transcripts is
still challenging du e to the difficulties of separating them from protein coding transcripts (PCTs).
Results: We have implemented iSeeRNA, a support vector machine (SVM)-based classifier for the identification of
lincRNAs. iSeeRNA shows better performance compared to other software. A public avai lable webserver for
iSeeRNA is also provided for small size dataset.
Conclusions: iSeeRNA demonstrates high pre diction accura cy and runs several magnitudes faster than other
similar programs. It can be integrated into the transcriptome data analysis pipelines or run as a web server, thus
offering a valuable tool for lincRNA study.
Background
Over the past decade, e vidence from numerous high-
throughput genomic platforms reveals that even though
less than 2% of the mammalian genome encodes proteins,
a significant fracti on can be transcribed into different
complex families of non-coding RNAs (ncRNAs) [1-4].
Other than microRNAs and other families of small non-
coding RNAs, long non-coding RNAs (lncRNAs, >200nt)
are emerging as potent regulators of gene expression [5].
Originally identified by Guttman et al. [6] from four
mouse cell types using chromatin state maps as a subtype
of lncRNAs, long intergenic non-coding RNAs (lincRNAs),
are discrete transcriptional unit intervening known pro-
tein-coding loci. Recent studies demonstrate the functional
significance of lincRNAs. However, it remains a daunting
task to identify all the lincRNAs existent in various biolo-
gical processes and systems.
Whole transcriptome sequencing , known as RNA- Seq,
offers the promise of rapid comprehensive discovery of
novel genes and transcripts [7]. With the de novo assembly
software such as Cufflinks [8] and Scripture [6], a large set
of novel assemblies can be obtained from RNA-Seq data.
Several programs have been used to facilitate the catalo-
ging of lincRNAs from RNA-Seq assemblies. For example,
Li et al. [9] used Codon Substitution Frequency (CSF)
score [10] to identify lincRNAs from de novo assembled
transcripts in chicken skeletal muscle. Pauli et al. [11] took
advantage of PhyloCSF score [12] followed by other filter-
ing steps to identify lincRNAs expressed during zebrafish
embryogenesis. Cabili et al. [13] also use d PhyloCSF pro-
gram to eliminate the de novo assembled transcripts with
positive coding potential and identified ~8200 lincRNA
loci in 24 human tissues. However, the extremely high
computational times demanded by PhyloCSF, may become
the bottleneck for handling millions of assemblies gener-
ated from high throughput sequencing. Furthermore,
* Correspondence: xfsong@nuaa.edu.cn; huating.wang@cuhk.edu.hk;
haosun@cuhk.edu.hk
1
Li Ka Shing Institute of Health Sciences, The Chinese University of Hong
Kong, Shatin, New Territories, Hong Kong SAR, China
4
Department of Biomedical Engineering, Nanjing University of Aeronautics
and Astronautics, Nanjing 210016, China
Full list of author information is available at the end of the article
Sun et al. BMC Genomics 2013, 14(Suppl 2):S7
http://www.biomedcentral.com/1471-2164/14/S2/S7
© 2013 Sun et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
neither CSF nor PhyloCSF provides publicly available tools
that can be readily integrated into the lincRNA identifica-
tion workflow. Th erefore, ab initi o reconstruction of a
reliable set of lincRNAs through computational method
remains a daunting task. There is an urgent need for such
a standalone tool to accurately and quickly distinguish
lincRNAs from extremely large dataset. Previous studies
showed that supervised machine learning method, espe-
cially Support Vector Machine (SVM), may represent a
potential solution for accurate identification of lincRNAs
and protein coding gene transcripts (PCTs). For example,
CONC (Coding Or Non-Coding) [14], CPC (Coding
Potential Calculator) [15], and POTRAIT [16] have been
developed to discriminate PCTs and nc RNAs in general.
However, the performance of these programs is largely
dependent on dataset s; for instance, CONC is slow on
analyzing large datasets [15], which may limit its useful-
ness in the transcriptome data analysis. CPC works well
with known PCTs but may tend to classify novel PCTs
into lincRNAs if they have not been recorded in the pro-
tein databases used by CPC [15]. PORTAIT was specifi-
cally designed for the neglected species such as fungus et
al. [16]. Moreover, their performance on the identification
of lincRNAs has not been evaluated.
In this study, we present a new SVM-based classifier and
a standalo ne tool, iSee RNA. It demonstra ted high accu-
racy, balanced sensitivity and specificity for both lincRNA
and PCT datasets. It also outperforms others by running
several order-of-magnitudes faster, thus representing an
ideal tool for lincRNA identification from transcriptome
sequencing data.
Methods
Standard input file formats
To be compatible with de novo assembly software, such as
Cufflinks and Scripture, which use GTF/GFF or BED file
format, we set these three formats as default input file for-
mats for iSeeRNA. This will allow easy integration of
iSeeRNA into the transcriptome data analysis w orkflow.
The detailed information about the file formats can be
found at UCSC genome browser (http://genome.ucsc.edu/
FAQ/FAQformat.html).
SVM settings
In order to build SVM m odels for iSeeRNA, we used
LIBSVM (version 3.11) implem entation [17] with Radial
Basis Functional kernel which was shown to be the best
kernel to deal with this task [15]. During the training,
SVM was set as binary classifier with the two classes being
lincRNAs (positive set) and PCTs (negative set). Opt i-
mized SVM parameters C and gamma were obtained by
using the accompanying grid.py script with 5,000 ran-
domly selected instances from the training dataset. To
obtain the best performance model, 10-fold cross-
validation was used. In addition, two models were trained
and tested separately using species specific datasets for
human and mouse, respectively.
PhyloCSF and CPC settings
iSeeRNA was benchmarked aga inst two other classifica-
tion programs: PhyloCSF and CPC. These two programs
were installed locally and executed with default para-
meters. For PhyloCSF, a score of 0 was used as the classifi-
cation parameter. For CPC, Uniref90 [18] was employed as
protein database and the default classification model
developed by its authors was used.
Performance measurements
To evaluate the performance, accuracy (sensitivity or
specificity) and Matthews Correlation Coefficient (MCC)
[19], an indicator used in machine learning as a measure
of the quality of binary (two-class) classification, were
calculated; and Receiver Operating Characteristic (ROC )
curves were generated.
The following equations were used for calculating sen-
sitivity and specificity:
Sensitivity =
TP
TP + F
N
(1)
Specificity =
T
N
T
N
+ FP
(2)
MCC =
TP
∗
TN − FP
∗
FN
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
(3)
Where TP, FP, TN a nd FN are the numbers of true
positives (lincRNAs predicted to be non-coding), false
positives (PCTs predicted to be non-coding), true nega-
tives (PCTs predicted to be coding) and false negatives
(lincRNAs predicted to be coding).
Results
Gold-standard datasets
The quality of the training data is ultra-important for
building an accurate SVM model. In order to obtai n a
pool of high quality lincRNAs and PCTs as Gold-stan-
dard datasets (Figure 1), we collected lincRNAs and
PCTs annotated either as “ known” or “ novel” from
Human and Vertebrate An alysis and Annotation
(HAVANA) (http://vega.sanger.ac.uk/index.html) [20]
project. These lincRNA annotations were manually
curated and supported by some e xperimental evidences
such as spliced cDNAs and ES Ts et al., thus providing an
ideal source for lincRNAs. We further filtered the data
with the transcript length (> 200 nt). Next, for lincRNAs,
we eliminated those transcripts that were annotated as
PCTs by RefSeq [21]; similarly, for PCTs, we only kept
Sun et al. BMC Genomics 2013, 14(Suppl 2):S7
http://www.biomedcentral.com/1471-2164/14/S2/S7
Page 2 of 10
剩余9页未读,继续阅读
资源评论
weixin_38711041
- 粉丝: 6
- 资源: 954
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Cisco 思科 CP-7945g 7965g sip模式固件 9.4.2
- 贪吃蛇方案设计的方法.zip
- 微信支付账单(20240731-20240731).zip
- minio20240920.tar
- 集成供应链(Integrated Supply Chain,ISC)核心业务流程再造,华为的最佳实践
- zabbix-server-pgsql-7.0-centos-latest.tar
- zabbix-web-apache-pgsql-7.0-centos-latest.tar
- Altium Designer 24.9.1 Build 31 (x64)
- 基于JAVA的人机对弈的一字棋系统设计与实现课程设计源代码,极大极小搜索和α-β搜索算法
- 电子回单_2024092100085000842531409053050071685353.pdf
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功