and I-TASSER [17], have been developed and demonstrated
as feasible tools for modeling 3D structure from a given pro-
tein sequence, discrepancies between the predicted structure
and the actual structure still exist, particularly for proteins
that do not fit a structural template [18]. Furthermore, with
ever-evolving gene-sequencing technologies, the gap
between protein sequences and structures continues to
widen. Therefore, sequence-based computational methods
for predicting DNA-binding residues are more practical, eco-
nomic, and in urgent need.
Compared to structure-based methods, sequence-based
methods can quickly predict DNA-binding residues with-
out using protein structure information. During the past
decade, a number of machine-learning algorithms have
been used to predict DNA-binding residues from protein
sequences, and a series of sequence-based predictors have
been developed, including BindN [10], DP-Bind [12],
BindNþ [19], MetaDBsite [6], and DNABR [20], among
others. These sequence-based predictors often utilize only
protein sequence information and recognize DNA-binding
residues with one or more machine-learning algorithms,
such as support vector machine (SVM) [21] or random forest
(RF) [22]. For example, in BindN [10], the prediction models
are constructed by SVM with three sequence features,
including the pK
a
value of the side chain, the hydrophobic-
ity index, and the molecular mass of an amino acid. In DP-
Bind [12], three machine-learning algorithms, including
SVM, kernel logistic regression, and penalized logistic
regression, are integrated to predict DNA-binding residues
based on the profile of evolutionary conservation of a query
protein sequence in the form of a position-specific scoring
matrix (PSSM) [23]. Wong et al. [24] proposed and described
a computational approach, which takes into account both
protein sequence and DNA information, for learning the
specificity-determining residue-nucleotide interactions of
different known DNA-binding domain families. In addi-
tion, Wong et al. [25] developed a HMM-based approach
using belief propagations (named kmerHMM), which
accepts and pre-processes PBM raw data into median-bind-
ing intensities of individual k-mers to identify DNA motifs.
Despite the promising results of these methods, there
remains room for further improvements in accurately pre-
dicting DNA-binding residues from protein sequences.
Another important issue that warrants careful consider-
ation for developing machine-learning-based predictors of
protein-DNA binding residues is the severe intrinsic class
imbalance: the number of DNA-binding residues (minority
class) is significantly fewer than that of non-binding residues
(majority class). Sample rescaling is the most straightforward
strategy for dealing with the issue of class imbalance [26], [27].
In this strategy, over-sampling and under-sampling are the
two most commonly used implementations. As demonstrated
in previous work [26], [27], [28], over-sampling will obtain an
enlarged training dataset and thus will inevitably increase the
training and predicting time. In addition, over-sampling may
also lead to a potential over-fitting problem. On the other
hand, under-sampling can obtain a more compact training
dataset but comes with the risk of losing data. In view of this,
in this study, we address the class imbalance by integrating
under-sampling with an appropriate boosting ensemble algo-
rithm. More specifically, we trained multiple different classi-
fiers on balanced datasets obtained by applying random
under-sampling (RUS); then, these trained classifiers are
ensembled with a boosting procedure.
In view of the issues mentioned above, we propose a
sequence-based predictor, named “TargetDNA”, for the
computational identification of DNA-binding residues.
First, we employed the protein evolutionary information
and the predicted solvent accessibility, which are deter-
mined solely from protein sequences, as two base features
(refer to Section 2.2 for details). Next, to further quantify the
difference between DNA-binding and non-binding resi-
dues, we utilized a centered linear kernel target alignment
algorithm to learn the weights for weightedly combining
the two features. Then, based on the weightedly combined
feature, we trained multiple DNA-binding residue predic-
tors with SVM as a base classifier by applying a RUS tech-
nique on the original imbalanced dataset. Finally, we
obtained the ensembled predictor by using a boosting
ensemble algorithm. We also created an online web server
of TargetDNA, which is freely accessible for academic use
at http://csbio.njust.edu.cn/bioinf/TargetDNA/.
2METHODS
2.1 Benchmark Datasets
We constructed a dataset of 7,186 DNA-binding protein
chains, which had clear target annotations in the Protein
Data Bank (PDB) [29] before October 10, 2015. After remov-
ing the redundant sequences using CD-hit software [30], a
total of 584 non-redundant protein sequences were obtained
such that no two sequences had more than 30 percent iden-
tity. Then, we divided the non-redundant sequences into
two parts, the training dataset (PDNA-543) and the inde-
pendent test dataset (PDNA-TEST). PDNA-543 consists of
543 protein sequences, which were all released into the PDB
before October 10, 2014. PDNA-TEST includes 41 protein
chains, which were all released into the PDB after October
10, 2014. More specifically, there are 9,549 DNA-binding
residues (i.e., positive samples) and 134,995 non-binding
residues (i.e., negative samples) in PDNA-543. PDNA-TEST
consists of 734 positive samples and 14,021 negative sam-
ples. Table 1 summarizes the detailed compositions of
PDNA-543 and PDNA-TEST.
2.2 Feature Representation
From the point of view of machine learning, the prediction
of protein-DNA binding residues is a traditional binary
classification problem. Thus, training a machine-learning-
based prediction model on how to encode protein-DNA
binding residues with discriminative features is one of the
most crucial steps. Various effective sequence-based fea-
tures, such as PSSM [12], predicted secondary structure [5],
TABLE 1
Composition of the Training and Independent
Validation Datasets
Dataset No. of Sequences numP
a
numN
b
Ratio
c
PDNA-543 543 9,549 134,995 14.137
PDNA-TEST 41 734 14,021 19.102
a
numP represents the number of positive samples.
b
numN represents the number of negative samples.
c
Ratio ¼ numN / numP.
1390 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 14, NO. 6, NOVEMBER/DECEMBER 2017