accumulated [1-5]. However, the experimental methods
are costly and time consuming, therefore current PPI pairs
obtained from experiments only covers a small fraction of
the complete PPI networks [6]. In addition, large-scale
experimental methods usually suffer from high rates o f
both false positive and false ne gative predictions [6-8].
Hence, it is of great practical significance to develop the
reliable computational methods to facilitate the identifica-
tion of PPIs [9-11].
A number of computational methods have been pro-
posed for the prediction of PPIs based on different data
types, including phylogenetic profiles, gene neighbor-
hood, gene fusion, literature mining knowledge, and
sequence conservation between interacting proteins
[6-9,12-15]. There are also methods that combine i nter-
action information from several different data source s
[16]. However, these methods cannot be i mplemented if
such pre-knowledge about the proteins is not available.
Recently, a couple of me thods which derive inf ormati on
directly from amino acid sequence are of particular
interest [7-9,11]. Many researchers have engaged in the
development of sequences-based method for discovering
new PPIs, and the experiment results showed th at the
information of amino acid sequences alone is sufficient
to predict PPIs[7,9, 11]. Among them, one of the excel-
lent works is a SVM-based method developed by Shen
et al [11]. In the study, the 20 amino acids were clus-
tered into seven classes according to their dipoles and
volumes of the side chains, and then the conjoint triad
method abstracts the features of protein pairs based on
the classificat ion of amino acids. When applied to pre-
dict human PPIs, this method yields a high prediction
accuracy of 83.9%. Because the conjoint triad method
cannot takes neighboring effect into account and the
interactions usually occur in the discontinuous amino
acids segments in the sequence, on the other work Guo
et al. developed a method based on SVM and auto cov-
ariance to extract the interactions information in the
discontinuous amino acids segments in the sequence
[9]. Their method yielded a prediction accuracy of
86.55%, when applied to predicting saccharomyces cere-
visiae PPIs. In our previous works, we also obtained
good prediction perfo rmance by using autocorrelatio n
descriptors and correlation coefficient, respectively
[8,17].
The general trend in current study for predicting PPIs
has focused on high accuracy but has not considered
the time taken to train the classification models, which
should be an important factor of developing a sequence-
based method for predicting PPIs because the total
number of possible PPIs is very large. Therefore some
computational models with high classification accuracy
may not be satisfactory when considering the trade-off
between the classification accuracy and the time for
training the models. Recently, Huang et al. proposed a
new learning algorithm called extreme learning machine
(ELM), which randomly assigns all the hidden node
parameters of generalized single-hidden layer feed-for-
ward networks (SLFNs) and analytically determines the
output weights of SLFNs[18-21]. Previous works shown
that ELM provides efficient unified solutions to general-
ized feed-forward networks including kernel learning.
Consequently, ELM offers significant advantages such as
fast learning speed, ease of implementation, and least
human inte rvention. ELM has good potential as a viable
alternative technique for large-scale computing and arti-
ficial intelligence. On the other hand, single ELM model
is sometime difficult to achieve a satisfacto ry perfor-
mance for the complex processes with strong nonlinear-
ity, time variant and highly uncertainty. Ensemble ELM
methods have received special attentions because it can
improve the accuracy of predictor and achieve better
stability through training a set of models and then com-
bining them for final predictions [22-24]. For example,
Lan et al. proposed an ensemb le of onli ne sequential
ELM with more stable and accurate results [25]. Zhao et
al. proposed an ensemble ELM soft sensing model f or
effluent quality prediction based on kernel principal
comp onent analysis (KPCA), whose reliability and accu-
racy outperf orms other models [24]. In this stud y, an
ensemble ELM model was built to predict the protein
interactions.
Previous works have pointed out that using feature
selection or feature extraction before conducting the clas-
sification tasks can improve the classification accuracy[26].
Here, we attempt to examine the effectiveness of the
dimensionality reduction technique before constructing
the ELM classifier for the PPI prediction. Principal compo-
nent analysis (PCA) is utilized to do the feature extraction
which projects the original feature space into a new space,
on which the ELM is used to perform the prediction task.
The effectiveness of the proposed PCA-ELM is examined
in terms of classification accuracy on the PPI dataset.
Promisingly, as can be seen that the developed PCA-ELM
PPI prediction system has achieved high accuracy and
runs very fast as well.
In this study, we re port a new sequence-b ased method
for the prediction of protein-protein interactions from
amino acid sequences with ensemble ELM and PCA aim-
ing at improving the efficiency and effectiveness of the
classification accuracy. Firstly, four kinds of useful
sequence-based features such as Auto Covariance (AC),
Conjoint triad (CT), Local descriptor (LD) and Moran
autocorrelation (MAC) are extracted from each protein
sequence to mine the interaction information in the
sequence. Secondly, in order to reduce the computational
complexity and enhance the overall accuracy of the pre-
dictor, an effective feature reduction method PCA is
You et al. BMC Bioinformatics 2013, 14(Suppl 8):S10
http://www.biomedcentral.com/1471-2105/14/S8/S10
Page 2 of 11