集成极端学习机从氨基酸序列预测蛋白质-蛋白质相互作用和主成分分析资源-CSDN文库

18 浏览量 2021-03-24 13:53:23 上传评论收藏 1.12MB PDF 举报

根据给定文件信息，以下为详细的知识点解释：标题《集成极端学习机从氨基酸序列预测蛋白质-蛋白质相互作用和主成分分析》涉及到几个关键的生物信息学和机器学习的概念，包括氨基酸序列、蛋白质-蛋白质相互作用（PPIs）、集成学习、极端学习机（ELM）和主成分分析（PCA）。我们可以逐一分析这些概念： 1. **氨基酸序列**：蛋白质是由氨基酸通过肽键连接而成的大分子。氨基酸是构成生物体的基本单位，而蛋白质的功能和结构特性主要由其序列决定。在生物信息学中，氨基酸序列是通过各种生物技术手段获取的，用于研究蛋白质的结构、功能以及与其他分子的相互作用。 2. **蛋白质-蛋白质相互作用（PPIs）**：PPI是指蛋白质之间在细胞内部进行的动态物理接触与相互作用，它们是细胞信号传导、代谢反应、细胞分裂与分化等几乎所有生命活动中的重要组成部分。正确识别PPI对于理解生物分子机制以及疾病的研究具有重要意义。 3. **集成学习（Ensemble Learning）**：这是一种通过构建并结合多个学习器来完成学习任务的方法。在集成学习中，通常会将多个弱学习器集成，以期望获得比单个学习器更好的预测性能。该方法利用多样化的学习器对同一问题的不同方面进行建模，然后通过投票或平均的方式综合多个学习器的预测结果。 4. **极端学习机（Extreme Learning Machine, ELM）**：是一种快速学习的单层前馈神经网络，其特点是无需传统神经网络中的迭代调整权重和偏置，因此它有更快的学习速度和更低的计算复杂度。ELM已被证明在许多机器学习任务中都具有很好的效果。 5. **主成分分析（Principal Component Analysis, PCA）**：这是一种常用的数据降维技术，用于提取数据中的主要变异因素，将数据变换到新的坐标系统中，使得前几个坐标（主成分）能够解释数据中的大部分变异。PCA在处理高维数据时尤为有用，因为它可以有效地简化数据结构，同时尽可能保留原始数据的重要信息。根据文件描述，研究提出了一种新的层次化模型PCA-EELM（主成分分析-集成极端学习机），该模型仅使用蛋白质序列信息来预测蛋白质-蛋白质相互作用。在该方法中，研究者首先从DIP数据库中提取了11188个蛋白对，并使用四种蛋白质序列信息将它们编码为特征向量。为了降低维度，研究者采用了有效的特征提取方法PCA来构建最有区分性的新特征集。训练了多个极端学习机并通过多数投票法将它们聚合成共识分类器。由于极端学习机的集成消除了对初始随机权重的依赖，从而提高了预测性能。研究的结论是，在酿酒酵母（Saccharomyces cerevisiae）的PPI数据上，提出的方法达到了87.00%的预测准确率，86.15%的敏感度以及87.59%的精确度。与其他先进的技术，例如支持向量机（SVM）进行了广泛的比较实验，结果表明提出的PCA-EELM模型在5折交叉验证中优于SVM方法。通过该研究，我们可以看到机器学习和生物信息学相结合的新趋势，以及如何利用高级数据分析技术来解决生物医学研究中的实际问题。集成学习与PCA和ELM等算法的融合，不仅提高了预测PPI的准确性和效率，也为未来基于序列的PPI预测提供了新的思路和方法。

资源推荐

资源详情

资源评论

PROCEEDINGS Open Access

Prediction of protein-protein interactions from

amino acid sequences with ensemble extreme

learning machines and principal

component analysis

Zhu-Hong You

1*†

, Ying-Ke Lei

2†

, Lin Zhu

, Junfeng Xia

, Bing Wang

From The 2012 International Conference on Intelligent Computing (ICIC 2012)

Huangshan, China. 25-29 July 2012

Abstract

Background: Protein-protein interactions (PPIs) play crucial roles in the execution of various cellular processes and

form the basis of biological mechanisms. Although large amount of PPIs da ta for different species has been

generated by high-throughput experimental techniques, current PPI pairs obtained with experimental methods

cover only a fraction of the complete PPI networks, and further, the experimental methods for identifying PPIs are

both time-consuming and expensive. H ence, it is urgent and challenging to develop automated computatio nal

methods to efficiently and accurately predict PPIs.

Results: We present here a novel hierarchical PCA-EELM (principal component analysis-ensemble extreme learning

machine) model to predict protein-protein interactions only using the information of protein sequences. In the proposed

method, 11188 protein pairs retrieved from the DIP database were encoded into feature vectors by using four kinds of

protein sequences information. Focusing on dimension reduction, an effective feature extraction method PCA was then

employed to construct the most discriminative new feature set. Finally, multiple extreme learning machines were trained

and then aggregated into a consensus classifier by majority voting. The ensembling of extreme learning machine

removes the dependence of results on initial random weights and improves the prediction performance.

Conclusions: When performed on the PPI data of Saccharomyces cerevisiae, the proposed method achieved

87.00% prediction accuracy with 86.15% sensitivity at the precision of 87.59%. Extensive experiments are perfo rmed

to compare our method with state-of-the-art techniques Support Vector Machine (SVM). Experimental results

demonstrate that proposed PCA-EELM outperforms the SVM method by 5-fold cross-validation. Besi des, PCA-EELM

performs faster than PCA-SVM based method. Consequently, the proposed approach can be considered as a new

promising and powerful tools for predicting PPI with excellent performance and less time.

Background

Proteins are crucial for almost all of functions in the cell,

including metabolic cycles, DNA transcription and replica-

tion, and signalling cascades. Usually, proteins rarely per-

form their func tions alone; instead they cooperate with

other proteins by forming a huge network of protein-

protein interactions (PPIs) [1]. PPIs are responsible for the

majority of cellular functions. In the past decades, many

innovative techniques for detecting PPIs have been devel-

oped [1-3]. Due to the progress in large-scale experimental

technologies such as yeast two-hybrid (Y2H) screens [2,4],

tandem affinity purification (TAP) [1], mass spectrometric

protein c omplex identification (MS-PCI) [3] and other

high-throughput biological techniques for PPIs detection,

a large amount of PPIs data for different species has been

* Correspondence: zhuhongyou@gmail.com

† Contributed equally

College of Computer Science and Software Engineering, Shenzhen

University, Shenzhen, Guangdong 518060, China

Full list of author information is available at the end of the article

You et al. BMC Bioinformatics 2013, 14(Suppl 8):S10

http://www.biomedcentral.com/1471-2105/14/S8/S10

Attribution Lic ense (http://creativecommons.org/licenses/by/2 .0), which permits unrestricted use, distribution, and reproduction in

any me dium, provided the origin al work is properly ci ted.

accumulated [1-5]. However, the experimental methods

are costly and time consuming, therefore current PPI pairs

obtained from experiments only covers a small fraction of

the complete PPI networks [6]. In addition, large-scale

experimental methods usually suffer from high rates o f

both false positive and false ne gative predictions [6-8].

Hence, it is of great practical significance to develop the

reliable computational methods to facilitate the identifica-

tion of PPIs [9-11].

A number of computational methods have been pro-

posed for the prediction of PPIs based on different data

types, including phylogenetic profiles, gene neighbor-

hood, gene fusion, literature mining knowledge, and

sequence conservation between interacting proteins

[6-9,12-15]. There are also methods that combine i nter-

action information from several different data source s

[16]. However, these methods cannot be i mplemented if

such pre-knowledge about the proteins is not available.

Recently, a couple of me thods which derive inf ormati on

directly from amino acid sequence are of particular

interest [7-9,11]. Many researchers have engaged in the

development of sequences-based method for discovering

new PPIs, and the experiment results showed th at the

information of amino acid sequences alone is sufficient

to predict PPIs[7,9, 11]. Among them, one of the excel-

lent works is a SVM-based method developed by Shen

et al [11]. In the study, the 20 amino acids were clus-

tered into seven classes according to their dipoles and

volumes of the side chains, and then the conjoint triad

method abstracts the features of protein pairs based on

the classificat ion of amino acids. When applied to pre-

dict human PPIs, this method yields a high prediction

accuracy of 83.9%. Because the conjoint triad method

cannot takes neighboring effect into account and the

interactions usually occur in the discontinuous amino

acids segments in the sequence, on the other work Guo

et al. developed a method based on SVM and auto cov-

ariance to extract the interactions information in the

discontinuous amino acids segments in the sequence

[9]. Their method yielded a prediction accuracy of

86.55%, when applied to predicting saccharomyces cere-

visiae PPIs. In our previous works, we also obtained

good prediction perfo rmance by using autocorrelatio n

descriptors and correlation coefficient, respectively

[8,17].

The general trend in current study for predicting PPIs

has focused on high accuracy but has not considered

the time taken to train the classification models, which

should be an important factor of developing a sequence-

based method for predicting PPIs because the total

number of possible PPIs is very large. Therefore some

computational models with high classification accuracy

may not be satisfactory when considering the trade-off

between the classification accuracy and the time for

training the models. Recently, Huang et al. proposed a

new learning algorithm called extreme learning machine

(ELM), which randomly assigns all the hidden node

parameters of generalized single-hidden layer feed-for-

ward networks (SLFNs) and analytically determines the

output weights of SLFNs[18-21]. Previous works shown

that ELM provides efficient unified solutions to general-

ized feed-forward networks including kernel learning.

Consequently, ELM offers significant advantages such as

fast learning speed, ease of implementation, and least

human inte rvention. ELM has good potential as a viable

alternative technique for large-scale computing and arti-

ficial intelligence. On the other hand, single ELM model

is sometime difficult to achieve a satisfacto ry perfor-

mance for the complex processes with strong nonlinear-

ity, time variant and highly uncertainty. Ensemble ELM

methods have received special attentions because it can

improve the accuracy of predictor and achieve better

stability through training a set of models and then com-

bining them for final predictions [22-24]. For example,

Lan et al. proposed an ensemb le of onli ne sequential

ELM with more stable and accurate results [25]. Zhao et

al. proposed an ensemble ELM soft sensing model f or

effluent quality prediction based on kernel principal

comp onent analysis (KPCA), whose reliability and accu-

racy outperf orms other models [24]. In this stud y, an

ensemble ELM model was built to predict the protein

interactions.

Previous works have pointed out that using feature

selection or feature extraction before conducting the clas-

sification tasks can improve the classification accuracy[26].

Here, we attempt to examine the effectiveness of the

dimensionality reduction technique before constructing

the ELM classifier for the PPI prediction. Principal compo-

nent analysis (PCA) is utilized to do the feature extraction

which projects the original feature space into a new space,

on which the ELM is used to perform the prediction task.

The effectiveness of the proposed PCA-ELM is examined

in terms of classification accuracy on the PPI dataset.

Promisingly, as can be seen that the developed PCA-ELM

PPI prediction system has achieved high accuracy and

runs very fast as well.

In this study, we re port a new sequence-b ased method

for the prediction of protein-protein interactions from

amino acid sequences with ensemble ELM and PCA aim-

ing at improving the efficiency and effectiveness of the

classification accuracy. Firstly, four kinds of useful

sequence-based features such as Auto Covariance (AC),

Conjoint triad (CT), Local descriptor (LD) and Moran

autocorrelation (MAC) are extracted from each protein

sequence to mine the interaction information in the

sequence. Secondly, in order to reduce the computational

complexity and enhance the overall accuracy of the pre-

dictor, an effective feature reduction method PCA is

You et al. BMC Bioinformatics 2013, 14(Suppl 8):S10

http://www.biomedcentral.com/1471-2105/14/S8/S10

Page 2 of 11

剩余10页未读，继续阅读

评论收藏

内容反馈

weixin_38718690

粉丝: 6
资源: 944

集成极端学习机从氨基酸序列预测蛋白质-蛋白质相互作用和主成分分析

使用新型的氨基酸序列的局部联合三联体描述符预测蛋白质-蛋白质相互作用。

用酵母双杂交系统研究蛋白质-蛋白质相互作用ppt课件.pptx

基于数据挖掘的丙型肝炎患者氨基酸序列与TNF-α间的规律性研究.pdf

基于机器学习的蛋白质相互作用位点预测研究进展.pdf

氨基酸符号序列转换为FASTA格式的蛋白质序列

蛋白质功能~结构~相互作用预测网站工具合集.doc

通过蛋白质序列的多元互信息预测蛋白质-蛋白质相互作用

人工智能-机器学习-机器学习方法预测蛋白质相互作用应省略ic回归提高质谱多肽鉴定的.pdf

基于LSTM和注意力机制预测蛋白质-配体结合亲和力.zip

蛋白质-蛋白质相互作用中热点区域的预测和分析

matlab开发-从一个蛋白质序列到蛋白质序列的查找

深度学习算法在噬菌体特异性蛋白质预测中的应用：完整代码实现与数据集分析,深度学习算法在噬菌体特异性蛋白质预测中的应用：完整代码实现与数据集分析,使用深度学习方法预测噬菌体特异性蛋白质完整代码实现，含数

核酸、氨基酸序列和蛋白质二级结构之间关系的探究

基于分段伪氨基酸组成成分特征提取方法预测蛋白质亚细胞定位

预测蛋白质-蛋白质相互作用位点的级联随机森林算法

基于数据挖掘的蛋白质序列分析研究.pdf

蛋白质与其他物质的相互作用.ppt

基于递归特征消除的蛋白质-蛋白质相互作用热点识别

蛋白质氨基酸的组合问题.pdf数学建模

教育教学（2022-2023年收集）蛋白质结构预测和序列分析软件.docx

数学建模-蛋白质氨基酸的组合问题.zip

metal-binding-prediction:通过氨基酸序列预测蛋白质中金属结合位点的方法

基于新型机器学习方法的蛋白质功能预测与分析.pdf

蛋白质结构预测在线软件.pdf

数学建模-关于_蛋白质氨基酸的组合问题_的评注.zip

利用PageRank（重启随机游走）预测蛋白质相互作用.zip

常见蛋白质的棕榈酰化修饰位点预测.doc

基于相互作用的蛋白质功能预测

最新资源