M. Luo et al. / Neurocomputing 261 (2017) 164–170 165
Besides the distributed framework mentioned above, alternat-
ing direction method of multipliers (ADMM) that aims to find a so-
lution to global problem by solving local subproblems coordinately,
is also an effective optimization method for distributed convex
optimization [9,17] . It is noteworthy that C. Zhang investigated a
large-scale distributed linear classification algorithm in the frame-
work of ADMM and achieves a significant speedup over some other
classifiers [18] . C.Y. Lu et al. proposed a fast proximal ADMM with
parallel splitting method to further reduce the per-iteration com-
plexity for multi-blocks problem [19] .
Motivated by the advantages of ADMM in distributed optimiza-
tion problem, in this paper, we exploit a novel distributed extreme
learning machine (DELM) to implement the extreme learning ma-
chine on multiple machines in parallel. This approach relieve the
limitation of memory with large scale data set to some extent. It
is different from the traditional extreme learning machine, where
all of data are loaded onto one processor and share a common
output weight vector of hidden layer. Instead, the proposed DELM
method allows large-scale data set to be stored in distributed man-
ner by associating each processor a output weight vector, where
each output weight vector can be determined in parallel since it
depends only on the corresponding sub-dataset. In the framework
of ADMM, the shared output weight vector across all processors is
derived by combining all output weight vectors with an included
regularization vector.
The remainder of this paper is organized as follows. In
Section 2 , notations and preliminaries about extreme learning ma-
chine are reviewed briefly. Section 3 focuses on the formulated op-
timization problem of distributed extreme learning machine. A cor-
responding distributed convex optimization algorithm is developed
for the proposed DELM on the basis of ADMM in Section 4 . Ex-
tensive experiments on well-known benchmark data sets are con-
ducted in Section 5 to illustrate the effectiveness and superiority
of the proposed method. Conclusion is given in Section 6 .
2. Principles of extreme learning machine
Extreme learning machine refer to a kind of single-hidden-layer
feedforward neural network, where the hidden layer need not be
tuned. Formally, the output function of extreme learning machine
is formulated as
f
L
(x ) =
L
j=1
β
j
h
j
(
x
)
= h
(
x
)
β, (1)
where β =
β
1
, β
2
, ··· , β
L
∈ R
L
is the output weights v ect or of
the hidden layer with L nodes, which needs to be estimated ana-
lytically; Feature mapping h : R
n
→ R
L
maps input variable x ∈ R
n
to L -dimensional hidden-layer feature space such that
h
(
x
)
=
(
h
1
(
x
)
, h
2
(
x
)
, ··· , h
L
(
x
) )
, (2)
where the component h
j
(x ) = G (a
j
, b
j
, x ) ( j = 1 , 2 , ··· , L ) denotes
the output function of j th hidden node, which is known to users by
generating the parameters
a
j
, b
j
: j = 1 , 2 , ··· , L
randomly ac-
cording to any continuous probability distribution [20] . In general,
Sigmoid function
G
(
a , b, x
)
=
1
1 + exp
(
−
(
a
x + b
) )
,
Gaussian function
G
(
a , b, x
)
= exp
−b
x − a
2
and some other nonlinear activation functions are usually adopted
to generate the probability distribution in the framework of ex-
treme learning machine. In addition, some kernel sparse coding
methods are also developed for the requirement of many applica-
tions [21,22] .
For data set D =
x
k
, t
k
: x
k
∈ R
n
, t
k
∈ R , k = 1 , 2 , ··· , N
that
consists of N training data points x
k
with actual output t
k
(k =
1 , 2 , ··· , N) , extreme learning machine randomly generates input
weights and estimates the output weight vector β by minimizing
the training error with the l
2
-norm of output weights for better
generalization, i.e.,
β = arg min
β
1
2
β
2
2
+
C
2
N
k =1
(
h (x
k
) β − t
k
)
2
= arg min
β
1
2
β
2
2
+
C
2
Hβ − T
2
2
(3)
where C is the trade-off parameter between the training error
and the regularization; T =
(
t
1
, t
2
, ··· , t
N
)
∈ R
N
denotes the actual
output vector; H ∈ R
N×L
represents the hidden layer output matrix,
i.e.,
H =
⎛
⎜
⎜
⎝
h
(
x
1
)
h
(
x
2
)
.
.
.
h
(
x
N
)
⎞
⎟
⎟
⎠
=
⎛
⎜
⎜
⎝
h
1
(
x
1
)
h
2
(
x
1
)
··· h
L
(
x
1
)
h
1
(
x
2
)
h
2
(
x
2
)
··· h
L
(
x
2
)
.
.
.
.
.
.
.
.
.
.
.
.
h
1
(
x
N
)
h
2
(
x
N
)
··· h
L
(
x
N
)
⎞
⎟
⎟
⎠
.
It is evident that the closed solution of optimization (3) can be
achieved as
β = H
I
C
+ H H
−1
T , (4)
and therefore the output of conventional extreme learning ma-
chine satisfies
f
L
(x ) = h
(
x
)
H
I
C
+ H H
−1
T . (5)
If the feature mapping h is not known to users, a kernel matrix
=
i, j
∈ R
N×N
with respect to data set D can be calculated as
i, j
= h
(
x
i
)
· h
x
j
= K
x
i
, x
j
,
where the kernel function K : R
n
× R
n
→ R is usually defined as
Gaussian function
K
x
i
, x
j
= G
x , x
j
, σ
= exp
−
x − x
j
2
σ
2
.
According to Eq. (5) , the output of extreme learning machine with
kernel function K is formulated as
f
K
(x ) = h
(
x
)
H
I
C
+ H H
−1
T (6)
=
(
K
(
x , x
1
)
, K
(
x , x
2
)
, ··· , K
(
x , x
N
) )
I
C
+
−1
T .
As we all known that the conventional extreme learning
machine is usually implemented on a single machine and is
inevitable to suffer from the limitation of memory with large scale
data set. Indeed, it is definitely essential to handle large scale data
which are located on different machines, for example, in some
specific applications, the large scale data are collected and stored
in a distributed manner and can only be accessed on their own
machines; It is impossible to collect all data together due to the
confidentiality.
In addition, traditional extreme learning machine is
computation-intensive with large scale data because of the
high complexity for inverse conversion of the large hidden layer
output matrix H (see Eq. (5) for details). This situation will be
worse in the case of kernel matrix ∈ R
N × N
since the number of
nodes is equal to the number of samples with large scale data set.