使用简并K元组或Kmer策略鉴定microRNA前体资源-CSDN文库

需积分: 5 95 浏览量 2021-03-28 16:18:09 上传评论收藏 768KB PDF 举报

microRNA（miRNA）是小的非编码RNA分子，在基因表达的转录和转录后调控中扮演着重要角色。这些分子的异常表达已经在许多癌症和其他疾病状态中被观察到，这意味着miRNA分子也深深地参与到这些疾病中，特别是在癌症的发生过程中。因此，无论是基础研究还是基于miRNA的治疗，区分真实的前体miRNA（pre-miRNA）和假的前体miRNA（如具有类似茎环结构的发夹序列）都非常重要。在鉴别miRNA前体的过程中，最常用的方法是基于RNA样本的K元组（Kmer）组件形成向量的策略。但是，K元组的长度必须非常短；否则，向量的维度将变得极其庞大，导致所谓的“高维灾难”或过度拟合问题。在这种背景下，文章提出了使用简并K元组或Kmer策略来鉴定miRNA前体。简并K元组或Kmer是一种简化的策略，通过减少组合的复杂性来降低维度，从而使模型在保留重要特征的同时避免了高维灾难。为了理解简并K元组或Kmer策略如何工作，我们首先需要了解Kmer的概念。在生物信息学中，Kmer是指长度为K的子序列，它们是从DNA、RNA或蛋白质序列中提取的。简并K元组则是指在Kmer中的某些位置允许有一系列可能的核苷酸，而不是仅限于单一序列，这有助于降低维度并提高模式识别的效率。在文章中提到的方法中，研究者们发展了一个新的预测器，它能够快速有效地识别miRNA。这涉及到使用简并K元组或Kmer特征，这些特征能够更好地代表miRNA前体的序列特征，并允许算法捕捉到在miRNA成熟和功能中重要的长期效应（long-range effects）。长期效应指的是核苷酸序列中相隔较远的元素之间的相互作用，这些元素可能会影响miRNA的二级结构和功能。文章中还提到了一个在线服务器（deKmer web-server），这表明作者可能已经实现了一个基于网页的服务，它允许研究人员上传他们的RNA序列数据，并通过所开发的预测器来快速鉴定miRNA前体。这样的在线工具对那些没有足够计算资源来运行复杂模型的研究人员来说尤其有用。在生物学中，miRNA前体是指那些可以被加工成成熟的miRNA的初级转录物。识别真实的pre-miRNA对于理解其在细胞中的具体作用至关重要，特别是在进行基于miRNA的治疗时。由于miRNA在基因表达调控中的关键作用，能够准确地区分那些能够转换为功能性的miRNA的前体与其他发夹样序列，对未来的医学研究和疗法开发来说是一个重要的进展。总结来说，文章所提出的简并K元组或Kmer策略在鉴定miRNA前体方面是一个创新的方法。它通过降低维度和利用简并性来提高模型的性能，并且通过在线服务器的形式为研究者提供了易于访问的工具，这将对相关领域产生深远的影响。这种方法的研究和应用可能会加速miRNA在疾病诊断和治疗中的应用，并为生物医学研究提供一个强大的计算平台。

资源推荐

资源详情

资源评论

Identiﬁcation of microRNA precursor with the degenerate K-tuple

or Kmer strategy

Bin Liu

a,b,c,

, Longyun Fang

, Shanyi Wang

, Xiaolong Wang

a,b

, Hongtao Li

Kuo-Chen Chou

c,e

School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China

Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China

Gordon Life Science Institute, Boston, MA 0478, USA

Wendeng Marine Environmental Monitoring Station, Station Oceanic Administration Wendeng, Weihai, Shandong, China

Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia

HIGHLIGHTS



microRNA (miRNA) plays an important role in gene expression.



Identiﬁcation of real pre-miRNAs is important miRNA-based therapy.



A novel predictor was developed for fast and effectively identifying miRNA.

article info

Article history:

Received 16 July 2015

Received in revised form

21 August 2015

Accepted 24 August 2015

Available online 9 September 2015

Keywords:

MicroRNA precursor

True pre-miRNA

False pre-miRNA

Degenerate Kmer

deKmer web-server

Long-range effect

abstract

The microRNA (miRNA), a small non-coding RNA molecule, plays an important role in transcriptional and

post-transcriptional regulation of gene expression. Its abnormal expression, however, has been observed

in many cancers and other disease states, implying that the miRNA molecules are also deeply involved in

these diseases, particularly in carcinogenesis. Therefore, it is important for both basic research and

miRNA-based therapy to discriminate the real pre-miRNAs from the false ones (such as hairpin

sequences with similar stem-loops). Most existing methods in this regard were based on the strategy in

which RNA samples were formulated by a vector formed by their Kmer components. But the length of

Kmers must be very short; otherwise, the vector's dimension would be extremely large, leading to the

“high-dimension disaster” or overﬁtting problem. Inspired by the concept of “degenerate energy levels”

in quantum mechanics, we introduced the “degenerate Kmer” (deKmer) to represent RNA samples. By

doing so, not only we can accommodate long-range coupling effects but also we can avoid the high-

dimension problem. Rigorous jackknife tests and cross-species experiments indicated that our approach

is very promising. It has not escaped our notice that the deKmer approach can also be applied to many

other areas of computational biology. A user-friendly web-server for the new predictor has been

established at http://bioinformatics.hitsz.edu.cn/miRNA-deKmer/, by which users can easily get their

desired results.

1. Introduction

MicroRNAs (miRNAs) are small single-strand and non-coding

RNAs (ncRNAs), which play important roles in gene regulation by

targeting messenger RNAs (mRNAs) for cleavage or translational

repression (Fig.1). Their lengths are about 17–25 nt (Lopes et al.,

2014). The miRNAs are also involved in many important biological

processes, such as affecting stability, translation of mRNAs, and

negatively regulating gene expression in post-transcriptional

processes. Therefore, it is fundamentally important to identify the

real pre-miRNAs from the false ones. Unfortunately, it is difﬁcult to

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/yjtbi

Journal of Theoretical Biology

http://dx.doi.org/10.1016/j.jtbi.2015.08.025

Correspondence to: Harbin Institute of Technology Shenzhen Graduate School,

HIT Campus Shenzhen University Town, Xili, Shenzhen 518055, China.

Tel.: þ 86 0755 2603 3283.

E-mail addresses: bliu@insun.hit.edu.cn (B. Liu),

dragoncloudest@gmail.com (L. Fang), wangshanyiwsy@gmail.com (S. Wang),

wangxl@insun.hit.edu.cn (X. Wang), lht760@126.com (H. Li),

kcchou@gordonlifescience.org (K.-C. Chou).

Journal of Theoretical Biology 385 (2015) 153–159

use the traditional experimental techniques for timely and sys-

tematically detecting miRNAs from a genome (Xuan et al., 2011).

Facing the avalanche of genome sequences generated in the

postgenomic age, it is imperative to develop computational

methods (Li et al., 2010) for detecting miRNAs according to their

sequence information alone.

At present, the most successful computational approaches in

this ﬁeld were using the Kmer composition to represent RNA

samples (Wei et al., 2014). But the length of Kmers practically

really useful in this area is less than 6 nucleobases. This is because

any Kmers longer than that would result in using extremely high-

dimension vectors to represent the statistical samples (Chen et al.,

2014b, 2014a; Lin et al., 2014), leading to the “high-dimension

disaster” (Wang et al., 2008)oroverﬁtting problem that would

signiﬁcantly reduce the deviation tolerance or cluster tolerant

capacity (Chou, 1999) so as to lower down the success rate of

prediction. However, the miRNAs can vary from 17 to 25 nucleo-

bases. Therefore, the Kmer approach can be only used to represent

the short-range or local information of miRNA sequences but not

their long-range or global information. Particularly, most of the

pre-miRNAs have the characteristic of stem-loop hairpin struc-

tures (Xue et al., 2005). In view of this, some novel approaches are

deﬁnitely needed to relax the aforementioned limitation imposed

on the length of Kmers for miRNA sequences. The present study

was initiated in an attempt to address these problems.

2. Methods

2.1. Benchmark dataset

The benchmark dataset

used in this study can be formulated

SS S 1

∪

=()

+−

where the positive subset

contains pre-miRNA samples only,

which were extracted from the latest version of miRBase (release

21: June 2014). Furthermore, the CD-HIT software (Li and Godzik,

2006; Li et al., 2009) was used to make sure that none of the pre-

miRNA samples included in

has

≥

pairwise sequence

identity to any other. By doing so, we ﬁnally obtained 1 612 pre-

miRNA samples for the positive subset

The negative subset

−

also contained 1 612 samples, which

were randomly picked from the 8489 false pre-miRNAs in (Xue

et al., 2005). Again, none of the negative samples included in

−

has

≥

pairwise sequence identity to any other.

Since the most stringent cutoff threshold for DNA sequences by

CD-HIT is 75%, to our best knowledge, the aforementioned

benchmark dataset is so far the most stringent and largest

benchmark dataset constructed for studying the prediction of pre-

miRNAs.

Also, as pointed out in a comprehensive review (Chou and

Shen, 2007), there is no need to separate a benchmark dataset into

a training dataset and a testing dataset if a prediction method is to

be validated by the jackknife or subsampling (K-fold) cross-vali-

dation since the outcome thus obtained is actually from a com-

bination of many different independent dataset tests.

The benchmark dataset

as well as its subsets

and

−

, along

with the corresponding detailed sequences are given in Support-

ing information S1.

As pointed in Chou (2011) and concurred in a series of recent

publications (see, e.g., Chen et al., 2012; Min and Xiao, 2013; Xiao

et al., 2013a, 2015; Xu et al., 2013b, 2014b; Liu et al., 2014a, 2015a;

Qiu et al., 2014, 2015; Jia et al., 2015), one of the keys in success-

fully developing a sequence-based statistical predictor is how to

effectively formulate the sequence samples concerned with an

effective mathematical expression that can truly capture their

intrinsic correlation with the target to be predicted. Below we are

to address this problem.

2.2. Use degenerate Kmer composition to represent RNA samples

Suppose an RNA sequence R with L nucleobases (nitrogenous

bases or nucleic acid residues); i.e.,

R BBBBRBB B

L1234 567

=⋯

()

where

BAadenineCcytosine

Gguanine Uuracil

{}

∈( )( )

()()

()

denotes the nucleobase at sequence position

iL1,2, ,(= ⋯

)

The most straightforward method to represent an RNA sample

is just using its entire nucleobase sequence as shown in Eq. (2).In

order to identify whether the RNA sample belongs to pre-miRNA

or false pre-miRNA, one may use various sequence-similarity-

search-tools, such as BLAST (Altschul et al., 1997; Schaffer et al.,

2001), to search RNA database for those sequences that have high

sequence similarity to the query RNA sample R. Subsequently, the

attributes of the RNAs thus found were used to deduce the attri-

bute concerned for R. Unfortunately, this kind of straightforward

sequential model, although quite intuitive and without missing

any of the sample's information, failed to work when it did not

have signiﬁcant sequence similarity to any character-known RNA.

To overcome such a difﬁculty, one had to consider using non-

sequential or discrete vector models to formulate RNA samples.

Actually, the other important reasons to embrace the vector

models is that all the existing computational algorithms can only

handle vectors but not sequences, as elaborated in a recent paper

Chou (2015) .

Here we are to propose a completely different vector model to

represent RNA sample, as described below.

First of all, formulating the RNA sequence of Eq. (2) according

to its secondary structure derived from the Vienna RNA software

package (released 2.1.6) (Hofacker, 2003), we have

R 4

L12 34 56 7

=ΨΨΨΨΨΨΨ ⋯Ψ ( )

where

denotes the secondary structure state of B

the

structure state of B

, and so forth. They can be any of the following

seven structure states; i.e.,

A, C, G, U, A U, G C, U G 5

Ψ∈{ − − − } ( )

where A, C, G, U represent the structure states of the four unpaired

nucleobases, while A–U, G–C, U–G represent the structure states of

the three paired bases. Note that, in order to reduce computational

Fig. 1. MicroRNAs (miRNAs) are small single-strand and non-coding RNAs

(ncRNAs), which play important roles in gene regulation by targeting messenger

RNAs (mRNAs) for cleavage or translational repression.

B. Liu et al. / Journal of Theoretical Biology 385 (2015) 153– 159154

剩余6页未读，继续阅读

评论收藏

内容反馈

weixin_38563525

粉丝: 4
资源: 966

使用简并K元组或Kmer策略鉴定microRNA前体

Tl中多个双简并带的观察

基于Hadoop大数据平台和无简并高维离散超混沌系统的加密

拓扑纠缠熵，基态简并和全息

简并情况下的微扰理论.pptx

关于从准简并态到重量子态可观观测值的增强校正

使用希格斯对生产解决单希格斯生产中的简并性

稠密简并Dirac场的平衡热力学敏感性

非简并半导体中费米能级的简单计算及应用.pdf

简并和非简并定态微扰统一理论与能量二级修正公式 (1991年)

An Enriched 40K Source for Atomic Cooling

多通道简并的四波混频

多模光纤中简并模式群的分别探测

D→π（K）ℓν和D→π（K）ℓℓ的张量形状因子随着Nf = 2 + 1 + 1扭转质量费米子衰变

氢原子任意能级简并的解除 (2010年)

长散布的核元素（LINEs）中的逆转录酶片段的鉴定和表征。

耦合激光器的光学简并腔激光模拟_matlab_simulation

Lovelock AdS重力中的真空简并和共形质量

行业分类-物理装置-一种全光纤频率简并纠缠光束的产生装置及产生方法.zip

在存在非单一性的情况下使用T2K，NO v A和DUNE探测CP违反

体光栅的垂直选择角和光栅简并

具有更高导数的健康简并理论

腔内简并四波混频作用

CP守恒的两希格斯双峰模型中大规模简并希格斯玻色子的未来前景

最新资源