通过加权组合基于序列的特征并增强多个SVM来预测蛋白质-DNA结合残基资源-CSDN文库

123 浏览量 2021-02-26 05:09:25 上传评论收藏 512KB PDF 举报

在生物信息学领域，蛋白质与DNA的相互作用是基础研究和药物开发中的一个重要课题。由于大量的蛋白质序列在后基因组时代迅速积累，准确地从蛋白质序列中预测DNA结合残基成为了识别蛋白质功能和设计新药的关键挑战之一。本文介绍了一种新的预测方法，名为TargetDNA，旨在仅从一级序列出发预测蛋白质与DNA的结合残基。 TargetDNA预测器的构建基于两个基础特征：蛋白质的进化信息和预测的溶剂可及性。进化信息反映了蛋白质序列的保守性，它通常和功能相关的残基相联系；预测的溶剂可及性则是指残基在蛋白质表面的暴露程度，这在蛋白质和DNA发生相互作用时尤为重要。为了解决特征之间的权重分配问题，TargetDNA使用了一个居中线性核对齐算法来学习这两个特征的权重，并进行加权组合。由于蛋白质序列中，与DNA结合的残基数量远少于非结合残基的数量，导致了严重的类别不平衡现象。因此，作者应用了一种随机欠采样技术到原始数据集中，以平衡这两种残基的分布。在这个加权组合的特征基础上，使用支持向量机（SVM）作为分类器，训练出多个初始预测模型。接下来，通过Boosting方法对这些初始预测模型进行集成，最终得到了一个综合的预测模型。Boosting是一种机器学习算法，其核心思想是将一系列性能相对平庸的分类器组合起来，通过迭代的方式增强整个系统的预测能力。实验模拟结果表明，提出的TargetDNA预测器达到了高预测性能，并且在多数现有的基于序列的蛋白质-DNA结合残基预测器中表现更优。TargetDNA的在线服务器和数据集对学术用途是免费开放的，网址为***。在研究论文中，通常还会涉及到一些核心术语和概念。例如： - 核心概念：蛋白质与DNA的相互作用是细胞中一个不可或缺的过程，涉及到DNA复制、转录、剪接和修复等多种生物学过程。 - 特征加权：在机器学习中，根据特征对预测结果的重要性分配权重，以改善模型的预测准确性。 - 核心对齐算法：一种用于寻找特征之间相似性的算法，通过特定的核函数来实现。 - 集成学习方法：包括Boosting在内的方法通过组合多个模型的预测来获得更加准确和鲁棒的预测结果。 - 类别不平衡：在数据集中，一个类别（如DNA结合残基）的样本数量远远少于其他类别（如非结合残基），这可能导致模型偏向于多数类，从而影响预测性能。 - 支持向量机（SVM）：一种广泛应用于分类问题的机器学习算法，它通过在特征空间中找到最佳的超平面来区分不同的类别。 - 随机欠采样技术：一种数据预处理方法，用于减少数据集中多数类的样本数量，以缓解类别不平衡问题。 TargetDNA预测器的研究展示了生物信息学领域如何利用先进的算法和计算方法来解决生物学难题，并且为后续研究提供了新的思路和工具。随着生物信息学研究的不断深入，相信会有更多高效的预测方法被开发出来，进一步推动生物科学和相关领域的发展。

资源推荐

资源详情

资源评论

Predicting Protein-DNA Binding Residues by

Weightedly Combining Sequence-Based

Features and Boosting Multiple SVMs

Jun Hu, Yang Li, Ming Zhang, Xibei Yang, Hong-Bin Shen, and Dong-Jun Yu

Abstract—Protein-DNA interactions are ubiquitous in a wide variety of biological processes. Correctly locating DNA-binding residues

solely from protein sequences is an important but challenging task for protein function annotations and drug discovery, especially in the

post-genomic era where large volumes of protein sequences have quickly accumulated. In this study, we report a new predictor, named

TargetDNA, for targeting protein-DNA binding residues from primary sequences. TargetDNA uses a protein’s evolutionary information

and its predicted solvent accessibility as two base features and employs a centered linear kernel alignment algorithm to learn the weights

for weightedly combining the two features. Based on the weightedly combined feature, multiple initial predictors with SVM as classiﬁers

are trained by applying a random under-sampling technique to the original dataset, the purpose of which is to cope with the severe

imbalance phenomenon that exists between the number of DNA-binding and non-binding residues. The ﬁnal ensembled predictor is

obtained by boosting the multiple initially trained predictors. Experimental simulation results demonstrate that the proposed TargetDNA

achieves a high prediction performance and outperforms many existing sequence-based protein-DNA binding residue predictors. The

TargetDNA web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/TargetDNA/ for academic use.

Index Terms—Protein-DNA binding residues, kernel alignment, feature weighting, classiﬁer ensemble, imbalance learning

1INTRODUCTION

NTERACTIONS between proteins and DNA are indispens-

able for biological activities and play important roles in a

wide variety of biological processes [1], [2], [3], such as DNA

replication, transcription, splicing, and repair. Hence, accu-

rately locating the protein-DNA binding residues is of great

importance for both analyzing protein function and design-

ing novel drugs [4]. Much effort has been made to uncover

the intrinsic mechanism of protein-DNA interactions [5], [6],

and a number of high-throughput experimental technologies

have been developed to conﬁrm the interactions between

DNA and proteins, such as protein binding microarray (PBM)

[7], ChIP-Seq [8], and protein microarray assays [9]. However,

the identiﬁcation of protein-DNA binding residues via experi-

mental technologies is often cost-intensive and time-consum-

ing. Due to the importance of protein-DNA interactions and

the difﬁculty in experimentally identifying DNA-binding

residues, together with the fact that a huge number of unan-

notated protein sequences have quickly accumulated, the

development of computational methods for the fast predic-

tion of protein-DNA binding residues solely from sequences

has become a hot topic in bioinformatics [1], [5], [10].

In the last decade, a series of computational methods

have emerged for predicting DNA-binding residues, which

have been well characterized by Si et al. [1] and Miao et al.

[11]. These existing methods can be grouped into the follow-

ing three main categories according to the base features

used: sequence-based methods [10], [12], structure-based

methods [13], [14], and hybrid methods [15] that utilize

both the sequence and structural information.

It is undeniable that the prediction accuracies of structure-

based and hybrid methods often outperform those of

sequence-based methods [15], likely because structure-based

features are more effective than sequence-based features at

expressing the differences between DNA-binding and non-

binding residues [15]. Many structure-based features, such

as the B-factor, surface curvature and depth index (DPX),

have been successfully exploited to characterize DNA-bind-

ing residues [15]. However, the applicability of structure-

based and hybrid methods is limited in the common scenario

where only the sequence of a given protein target is known

and no corresponding 3D structure is available. Although

several homology modeling tools, such as MODELLER [16]

 J. Hu and Y. Li are with the School of Computer Science and Engineering,

Nanjing University of Science and Technology, 200 Xiaolingwei Road,

Nanjing, 210094, China. E-mail: junh_cs@126.com, balllee2011@sina.com.

 M. Zhang is with the School of Computer Science and Engineering, Nanjing

University of Science and Technology, 200 Xiaolingwei Road, Nanjing,

210094, China, and the School of Computer Science and Engineering, Jiangsu

University of Science and Technology, 2 Huancheng Road, Zhenjiang,

212003, China. E-mail: zhangming@just.edu.cn.

 X. Yang is with the School of Computer Science and Engineering, Jiangsu

University of Science and Technology, 2 Huancheng Road, Zhenjiang,

212003, China. E-mail: yangxibei@hotmail.com.

 H.B. Shen is with the Department of Automation, Shanghai Jiao Tong

University, and the Key Laboratory of System Control and Information

Processing, Ministry of Education of China, Shanghai, 200240, China.

E-mail: hbshen@sjtu.edu.cn.

 D.-J. Yu is with the School of Computer Science and Engineering, Nanjing

University of Science and Technology, 200 Xiaolingwei Road, Nanjing,

210094, China, and the Key Laboratory of Intelligent Perception and Sys-

tems for High-Dimensional Information of Ministry of Education, Nanj-

ing University of Science and Technology, Nanjing, 210094, P.R. China.

E-mail: njyudj@njust.edu.cn.

Manuscript received 14 Mar. 2016; revised 24 Aug. 2016; accepted 7 Oct.

2016. Date of publication 11 Oct. 2016; date of current version 6 Dec. 2017.

For information on obtaining reprints of this article, please send e-mail to:

reprints@ieee.org, and reference the Digital Object Identiﬁer below.

Digital Object Identiﬁer no. 10.1109/TCBB.2016.2616469

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 14, NO. 6, NOVEMBER/DECEMBER 2017 1389

1545-5963 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

and I-TASSER [17], have been developed and demonstrated

as feasible tools for modeling 3D structure from a given pro-

tein sequence, discrepancies between the predicted structure

and the actual structure still exist, particularly for proteins

that do not ﬁt a structural template [18]. Furthermore, with

ever-evolving gene-sequencing technologies, the gap

between protein sequences and structures continues to

widen. Therefore, sequence-based computational methods

for predicting DNA-binding residues are more practical, eco-

nomic, and in urgent need.

Compared to structure-based methods, sequence-based

methods can quickly predict DNA-binding residues with-

out using protein structure information. During the past

decade, a number of machine-learning algorithms have

been used to predict DNA-binding residues from protein

sequences, and a series of sequence-based predictors have

been developed, including BindN [10], DP-Bind [12],

BindNþ [19], MetaDBsite [6], and DNABR [20], among

others. These sequence-based predictors often utilize only

protein sequence information and recognize DNA-binding

residues with one or more machine-learning algorithms,

such as support vector machine (SVM) [21] or random forest

(RF) [22]. For example, in BindN [10], the prediction models

are constructed by SVM with three sequence features,

including the pK

value of the side chain, the hydrophobic-

ity index, and the molecular mass of an amino acid. In DP-

Bind [12], three machine-learning algorithms, including

SVM, kernel logistic regression, and penalized logistic

regression, are integrated to predict DNA-binding residues

based on the proﬁle of evolutionary conservation of a query

protein sequence in the form of a position-speciﬁc scoring

matrix (PSSM) [23]. Wong et al. [24] proposed and described

a computational approach, which takes into account both

protein sequence and DNA information, for learning the

speciﬁcity-determining residue-nucleotide interactions of

different known DNA-binding domain families. In addi-

tion, Wong et al. [25] developed a HMM-based approach

using belief propagations (named kmerHMM), which

accepts and pre-processes PBM raw data into median-bind-

ing intensities of individual k-mers to identify DNA motifs.

Despite the promising results of these methods, there

remains room for further improvements in accurately pre-

dicting DNA-binding residues from protein sequences.

Another important issue that warrants careful consider-

ation for developing machine-learning-based predictors of

protein-DNA binding residues is the severe intrinsic class

imbalance: the number of DNA-binding residues (minority

class) is signiﬁcantly fewer than that of non-binding residues

(majority class). Sample rescaling is the most straightforward

strategy for dealing with the issue of class imbalance [26], [27].

In this strategy, over-sampling and under-sampling are the

two most commonly used implementations. As demonstrated

in previous work [26], [27], [28], over-sampling will obtain an

enlarged training dataset and thus will inevitably increase the

training and predicting time. In addition, over-sampling may

also lead to a potential over-ﬁtting problem. On the other

hand, under-sampling can obtain a more compact training

dataset but comes with the risk of losing data. In view of this,

in this study, we address the class imbalance by integrating

under-sampling with an appropriate boosting ensemble algo-

rithm. More speciﬁcally, we trained multiple different classi-

ﬁers on balanced datasets obtained by applying random

under-sampling (RUS); then, these trained classiﬁers are

ensembled with a boosting procedure.

In view of the issues mentioned above, we propose a

sequence-based predictor, named “TargetDNA”, for the

computational identiﬁcation of DNA-binding residues.

First, we employed the protein evolutionary information

and the predicted solvent accessibility, which are deter-

mined solely from protein sequences, as two base features

(refer to Section 2.2 for details). Next, to further quantify the

difference between DNA-binding and non-binding resi-

dues, we utilized a centered linear kernel target alignment

algorithm to learn the weights for weightedly combining

the two features. Then, based on the weightedly combined

feature, we trained multiple DNA-binding residue predic-

tors with SVM as a base classiﬁer by applying a RUS tech-

nique on the original imbalanced dataset. Finally, we

obtained the ensembled predictor by using a boosting

ensemble algorithm. We also created an online web server

of TargetDNA, which is freely accessible for academic use

at http://csbio.njust.edu.cn/bioinf/TargetDNA/.

2METHODS

2.1 Benchmark Datasets

We constructed a dataset of 7,186 DNA-binding protein

chains, which had clear target annotations in the Protein

Data Bank (PDB) [29] before October 10, 2015. After remov-

ing the redundant sequences using CD-hit software [30], a

total of 584 non-redundant protein sequences were obtained

such that no two sequences had more than 30 percent iden-

tity. Then, we divided the non-redundant sequences into

two parts, the training dataset (PDNA-543) and the inde-

pendent test dataset (PDNA-TEST). PDNA-543 consists of

543 protein sequences, which were all released into the PDB

before October 10, 2014. PDNA-TEST includes 41 protein

chains, which were all released into the PDB after October

10, 2014. More speciﬁcally, there are 9,549 DNA-binding

residues (i.e., positive samples) and 134,995 non-binding

residues (i.e., negative samples) in PDNA-543. PDNA-TEST

consists of 734 positive samples and 14,021 negative sam-

ples. Table 1 summarizes the detailed compositions of

PDNA-543 and PDNA-TEST.

2.2 Feature Representation

From the point of view of machine learning, the prediction

of protein-DNA binding residues is a traditional binary

classiﬁcation problem. Thus, training a machine-learning-

based prediction model on how to encode protein-DNA

binding residues with discriminative features is one of the

most crucial steps. Various effective sequence-based fea-

tures, such as PSSM [12], predicted secondary structure [5],

TABLE 1

Composition of the Training and Independent

Validation Datasets

Dataset No. of Sequences numP

numN

Ratio

PDNA-543 543 9,549 134,995 14.137

PDNA-TEST 41 734 14,021 19.102

numP represents the number of positive samples.

numN represents the number of negative samples.

Ratio ¼ numN / numP.

1390 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 14, NO. 6, NOVEMBER/DECEMBER 2017

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38605801

粉丝: 10
资源: 984

通过加权组合基于序列的特征并增强多个SVM来预测蛋白质-DNA结合残基

多个异质子空间SVM集成增强蛋白质-维生素结合残基的预测

CNNsite：使用具有序列特征的卷积神经网络预测蛋白质中的DNA结合残基

基于粒子群优化支持向量机的单变量时间序列预测Matlab程序PSO-SVM

基于SVM的图像分割-真彩色图像分割_基于SVM的图像分割-真彩色图像分割_图像分割_图像分割

人工智能-深度学习-基于深度学习的蛋白质-RNA相互作用预测模型构建.pdf

基于SVM的回归预测分析-上证指数开盘指数预测

Matlab实现基于SVM-Adaboost支持向量机结合Adaboost集成学习时间序列预测（股票价格预测）（完整源码和数据）

MATLAB实现基于SVM-Adaboost支持向量机结合AdaBoost时间序列预测

基于误差的LS-SVM与PLS相结合的非线性建模

基于SVM的股票预测 Python

python利用支持向量机SVM进行时间序列预测（数据+源码）

Matlab 基于支持向量机(SVM)的时间序列预测 SVM时间序列

【SVM时间序列预测】基于matlab支持向量机SVM时间序列预测【含Matlab源码 2842期】

新建文件夹.zip_svm stock_svm预测_v-SVM

svm-load-forecast.rar_SVM时间序列_svm负荷预测_时间序列SVM_时间序列预测_负荷预测SVM

SVM神经网络的回归预测分析---上证开盘指数预测_svm预测_SVM神经网络的回归预测分析_回归预测_

svm 时间序列组合预测.rar

时序预测 - MATLAB实现SVM(支持向量机)时间序列预测（完整源码和数据）

MATLAB实现基于SVM-Adaboost支持向量机结合AdaBoost多输入分类预测

SVM时间序列预测（Python完整源码和数据）

MATLAB实现SVM支持向量机时间序列预测（完整源码和数据）

Matlab实现基于SVM-Adaboost支持向量机结合AdaBoost多输入分类预测（完整源码和数据）

MATLAB程序SVM神经网络的回归预测分析---上证开盘指数预测.zip

【预测模型-SVM预测】基于SVM实现股票趋势预测matlab源码.zip

pso-svm电力负荷预测.rar_PSO 预测_PSO优化SVM_svm负荷预测_svm预测

SVM时间序列预测文档

SVM-RFE-CBR-v1.3.zip_3rfe. com_SVM RFE_SVM-RFE-CBR-v1.3_native5k

matlab_GA-SVM预测_GA_SVM_GA-SVM_matlab_SVM_

最新资源