highly dependent on the chosen metric. However, these methods
designed for classification tasks cannot be used directly for
regression tasks. Only few of the metric learning methods have
been proposed specially for regression tasks so far. A typical one
is MLKR [14] which learns a metric specially for kernel regression.
Unfortunately, the improvement of regression performance
achieved by MLKR is limited that it is still difficult to achieve a
comparable performance with some sophisticated methods such
as support vector regression [15] on many datasets.
To explore further the metric learning method for regression
tasks, we consider learning a metric via incorporation of support
vector regression (SVR) which is one of the most popular regression
algorithms. Metric is also important for SVR especially with kernels.
Typical kernels for SVR have no prior knowledge about the meaning
of the features and are assumed to be isotropic. Therefore, we focus
on learning an embedded metric in SVR to improve the regression
performance. We propose a corresponding learning algorithm
termed as SVRML, which minimizes the error on the validation
set and enforces the sparsity on the learned metric matrix simulta-
neously. The learning process combines the Mahalanobis [16]
metric learning with the training of SVR. More importantly, to make
the metric learned by SVRML more effective, we propose a bagging-
like ensemble metric learning framework. It extends the original
bagging algorithm [17] in which a positive semi-definite matrix is
taken as a base-learner rather than either classifier or regressor.
The proposed SVRML algorithm has the following desirable
properties: (1) SVRML learns a sparse Mahalanobis metric which
is capable of removing potential redundancy or noise in data. (2)
SVRML can parallelly learn multiple base metrics by using a bag-
ging-like ensemble metric learning framework and obtain an aggre-
gated metric to achieve better generalization performance for SVR.
(3) It is easy to implement and can be treated as an alternative fea-
ture selection method to provide a convenient way to pre-process
the data automatically. The primary contributions of this work
are therefore as follows: (1) We propose a task-dependent metric
learning algorithm for SVR. (2) We develop an effective bagging-like
ensemble metric learning framework in which the resampling
mechanism of original bagging is specially modified for SVRML.
The rest of this paper is organized as follows: we provide an
overview of the related work in Section 2. Section 3 explains
how to learn an embedded metric for SVR. The bagging-like
ensemble metric learning framework is discussed detailedly in
Section 4. Experimental studies are shown in Section 5. Finally,
we draw the conclusions and list our future works in Section 6.
2. Related works
Over the last decade, several task-dependent metric learning
algorithms have been proposed [2,4,18,14]. However, only few of
them are designed specially for regression tasks. Support vector
regression which is very popular for regression tasks also depend
heavily on the metric. As far as we know, our work is the first to
combine metric learning with support vector regression. Our pro-
posed method SVRML is also in the family of task-dependent dis-
tance metric learning.
Weinberger and Tesauro constructed a metric learning algo-
rithm for kernel regression termed as MLKR [14] which learns a
task-specific (pseudo-)metric over the input space where small
distances between two vectors imply similar target values. This
metric in MLKR is learned by directly minimizing the leave-one-
out regression error. Similarly, Xu et al. [19] proposed a metric
learning algorithm for support vector classification by minimizing
the 0–1 classification error. Inspired by these work, we consider
learning a metric for SVR by minimizing the regression error on a
validation set. But one drawback of them is that they incline to
overfit the validation data [8].
As a remedy, ensemble learning is an alternate method we can
use to combine with the metric learning process, as ensemble learn-
ing is able to improve the generalization performance of learning
systems [20]. Some ensemble learning methods such as boosting
[21] have already been introduced into metric learning. For example,
Shen et al. [22] proposed a boosting-based technique BoostMetric to
learn a metric using trace-one rank-one matrices as weak learners.
Chang [23] developed a metric base-learner specific to the boosting
framework by improving a loss function iteratively. Mu et al. [24]
proposed a local discriminative metrics ensemble learning algo-
rithm. But none of them focus on regression tasks. To fill the gap,
we propose a bagging-like ensemble framework designed specially
for SVRML to improve the regression performance. Different from
the existing methods such as BoostMetric which iteratively learns
the base metrics, our framework retains the parallelism like bagging.
In our framework, the resampling mechanism of original bagging is
specially modified for SVRML to achieve better performance.
In addition to the above, our work is also inspired by the kernel-
parameter selection methods for SVR. For example, Chang and Lin
[25] derived various leave-one-out bounds for SVR parameter
selection to improve the generalization performance. The kernel-
parameter selection for SVR can be analyzed on the metric learning
perspective that the adjusting of the inner product leads to
different distance metrics. Different from choosing a single or a
few kernel-parameters, our method optimizes the entire metric
matrix and learns a nonlinear metric.
3. Metric learning for support vector regression (SVRML)
3.1. Support vector regression
Our method is based on L2-SVR [15], one of the most commonly
used varieties of SVR. Given a set of training examples fx
i
; y
i
g
‘
i¼1
of
size ‘, where the input vector x
i
2 R
d
, and the target value y
i
2 R ,
L2-SVR solves the primal problem:
min
w;b;n;n
1
2
w
T
w þ
C
2
X
‘
i¼1
n
2
i
þ
C
2
X
‘
i¼1
ðn
i
Þ
2
s:t:
e
n
i
6 w
T
/ x
i
ðÞþb y
i
6
e
þ n
i
; i ¼ 1; ...;‘:
ð1Þ
In order to solve the above problem effectively, practically we
solve the dual problem of (1) instead:
min
a
;
a
1
2
a
a
ðÞ
T
~
K
a
a
ðÞ
þ
e
X
l
i¼1
a
i
þ
a
i
X
l
i¼1
y
i
a
i
a
i
s:t:
P
l
i¼1
ð
a
i
a
i
Þ¼0; i ¼ 1; ...;‘;
a
i
;
a
i
P 0; i ¼ 1; ...;‘;
ð2Þ
where kðx
i
; x
j
Þ¼/ðx
i
Þ
T
/ðx
j
Þ is the kernel function.
~
K ¼ K þ I=C and
I
dd
is an identity matrix. The final prediction function is
gðxÞ¼w
T
/ðxÞþb ¼
X
‘
i¼1
ð
a
i
a
i
Þkðx
i
; xÞþb: ð3Þ
As the convenience of narrative, we do not distinguish L2-SVR
from SVR in the following sections any longer. Many kernel func-
tions are used for SVR. In fact, any function kð; Þ can be used as
a well-defined kernel if only it is positive semi-definite. In this
paper, we use the popular kernel function RBF kernel uniformly
due to its popularity and particularity that it depends on the dis-
tance function directly. The RBF kernel is defined as follows:
kðx
i
; x
j
Þ¼exp d
2
ðx
i
; x
j
Þ
no
; ð4Þ
where dð; Þ is the distance metric of data. In the RBF kernel, it is
commonly the squared Euclidean distance with a kernel width
parameter
r
ð
r
> 0Þ. When training the SVR, the prediction
performance can be improved by choosing an effective parameter
r
.
22 P.-C. Zou et al. / Knowledge-Based Systems 65 (2014) 21–30