最大限度地简化和放松的ADMM，用于正规化的极限学习机资源-CSDN文库

123 浏览量 2021-03-08 16:49:22 上传评论收藏 1.78MB PDF 举报

从提供的文件信息中，我们可以提取出以下重要的知识点： 1. 极限学习机（ELM）：ELM是一种用于训练单隐藏层前馈神经网络（SLFNs）的算法。它之所以受到关注，是因为它具有快速的学习速度和令人满意的泛化性能。ELM的主要优势在于训练过程极其迅速，因为它能够快速地确定网络权重。 2. 交替方向乘子法（ADMM）：这是一种用来解决凸模型拟合问题的方法，通过将问题分割为多个可并行执行的子问题来降低计算负担。每个子问题只需要一部分模型系数。ADMM是分布式优化中一个重要的算法，它能够将复杂的优化问题分解为更小、更易管理的部分。 3. 极大分裂和松弛的交替方向乘子法（MS-RADMM）：这是一种为正则化极限学习机（RELM）开发的新算法。与传统ADMM相比，MS-RADMM在系数上进行了最大分裂，并引入了一种新颖的松弛技术。这种算法旨在进一步简化优化过程并降低计算复杂度。 4. MS-RADMM的收敛性和收敛速率：研究者建立了MS-RADMM的收敛条件和收敛速率。该算法展现出线性收敛，并且与未松弛的最大分裂ADMM相比具有更小的收敛比率。这意味着MS-RADMM不仅能够确保问题的解向最优值逼近，而且收效率更高。 5. 参数选择：对于MS-RADMM算法，研究者确定了最优的参数值，并提供了一种快速的参数选择方案。这意味着在实际应用中，用户可以更快速地调整算法参数，以适应不同的问题和环境。 6. 实验验证：研究者在十个基准分类数据集上进行了实验。实验结果展示了MS-RADMM的快速收敛性和并行性。这种算法能够在保证性能的同时，大幅减少模型训练所需的时间。 7. 计算复杂度：文章还提供了一种与基于矩阵求逆的方法在乘法和加法运算次数、计算时间和内存单元数量方面的复杂度比较。这有助于评估MS-RADMM在性能上的表现。 8. 关键词：文章中提及的关键词包括交替方向乘子法（ADMM）、计算复杂度、收敛速率、极限学习机（ELM）和并行算法。这些关键词为理解文章的主要内容提供了关键线索。通过这些知识点，我们可以看到研究者对ELM的训练速度进行了优化，并成功开发了一种更为高效、简洁的优化算法——MS-RADMM。这种算法不仅提高了ELM的实用性，还拓展了在大数据和高维度数据环境中的应用前景。通过实验验证，MS-RADMM展示了其在并行处理和降低计算复杂度方面的优势，这对于需要在短时间内处理大量数据的应用场景尤为重要。此外，算法的快速收敛性和简洁性也使得MS-RADMM成为机器学习和数据科学领域极具吸引力的研究成果。

资源推荐

资源详情

资源评论

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

A Maximally Split and Relaxed ADMM for

Regularized Extreme Learning Machines

Xiaoping Lai , Member, IEEE,JiuwenCao , Member, IEEE, Xiaofeng Huang, Tianlei Wang,

and Zhiping Lin , Senior Member, IEEE

Abstract—One of the salient features of the extreme learning

machine (ELM) is its fast learning speed. However, in a big

data environment, the ELM still suffers from an overly heavy

computational load due to the high dimensionality and the

large amount of data. Using the alternating direction method

of multipliers (ADMM), a convex model ﬁtting problem can be

split into a set of concurrently executable subproblems, each with

just a subset of model coefﬁcients. By maximally splitting across

the coefﬁcients and incorporating a novel relaxation technique, a

maximally split and relaxed ADMM (MS-RADMM), along with

a scalarwise implementation, is developed for the regularized

ELM (RELM). The convergence conditions and the convergence

rate of the MS-RADMM are established, which exhibits linear

convergence with a smaller convergence ratio than the unrelaxed

maximally split ADMM. The optimal parameter values of the

MS-RADMM are obtained and a fast parameter selection scheme

is provided. Experiments on ten benchmark classiﬁcation data

sets are conducted, the results of which demonstrate the fast

convergence and parallelism of the MS-RADMM. Complexity

comparisons with the matrix-inversion-based method in terms of

the numbers of multiplication and addition operations, the com-

putation time and the number of memory cells are provided for

performance evaluation of the MS-RADMM.

Index Terms—Alternating direction method of multipliers

(ADMM), computational complexity, convergence rate, extreme

learning machine (ELM), parallel algorithm.

I. INTRODUCTION

HE extreme learning machine (ELM) [1] developed for

the training of single-hidden-layer feedforward neural

networks (SLFNs) has been attracting much attention in the

past decade and became popular due to its fast learning speed

and satisfactory generalization performance (see [2]–[7] and

the references therein). The fast learning speed of ELMs,

including those in online sequential mode [8]–[10], is due

to the randomly generated hidden nodes and the analytical

calculation of the output weight, which could be traced back

to [11]–[13]. The optimal output weight can be concisely

Manuscript received July 22, 2018; revised January 18, 2019 and May 1,

2019; accepted July 3, 2019. This work was supported by the National

Nature Science Foundation of China under Grants 61573123, 61427808, and

U1509205. (Corresponding author: Xiaoping Lai.)

X. Lai, J. Cao, and T. Wang are with the Institute of Information and

Control, Hangzhou Dianzi University, Hangzhou 310018, China (e-mail:

laixp@hdu.edu.cn; jwcao@hdu.edu.cn; wangtianlei0617@foxmail.com).

X. Huang is with the School of Communication Engineering, Hangzhou

Dianzi University, Hangzhou 310018, China (e-mail: xfhuang@hdu.edu.cn).

Z. Lin is with the School of Electrical and Electronic Engineering, Nanyang

Technological University, Singapore 639798 (e-mail: ezplin@ntu.edu.sg).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TNNLS.2019.2927385

expressed as H

T,whereH

is the Moore–Penrose gen-

eralized inverse of the hidden-layer output matrix H of the

SLFN and T represents the target output. The regularized

ELM (RELM) is an improved version of the original ELM

in which a regularization term is incorporated into its cost

function to minimize not only the total squared training error

but also the norm of the output weight [2].

The noniterative analytical solution of the output weight is

one of the main factors that contribute to the high computa-

tional efﬁciency of the ELM. However, the enlarging volume

and the increasing complexity of the data sets in big data

applications render the implementation of the ELM highly

challenging. If the number of training samples M and the

number of hidden nodes N are very large, the hidden output

matrix H is of very high dimension, and the matrix-inversion-

based (MI-based) solutions require a huge memory space and

suffer from a heavy computational load.

To address the above challenges, several enhanced ELMs

were proposed [14]–[21]. The ELM in [14] introduces an



-regularized cost that leads to sparse solutions, and therefore

favors network pruning, and describes a hardware SLFN

structure that combines the parallel and pipelined processing.

Frances-Villora et al. [15] presented two parallel hardware

architectures for on-chip ELMs that are implemented on

the ﬁeld-programmable gate array (FPGA) and focused on

the parallelization of the Moore–Penrose generalized inverse

computation based on a QR decomposition. He et al. [16]

used a programming model, namely, MapReduce to process

large data sets with a parallel/distributed algorithm on a

cluster to parallelize the MI-based output weight calculation

and the hidden node mapping. Xin et al. [17], [18] also used

the MapReduce framework to distribute the output weight

calculation. While Xin et al. [17] focused on the decompo-

sition of matrix multiplication, Xin et al. [18] focused on the

matrix multiplication in incremental/decremental/correctional

learning. The parallel RELM in [19] decomposes the data

matrix by rows or columns into a set of smaller block matrices

and trains the block-matrix-based models in parallel using

a cluster with the message passing interface environment.

Reference [20] divides each of the input data set, the hidden-

layer parameter data set, and the hidden-layer output matrix

into N parts, which are processed in parallel to calculate

the output weight. In [21], an ensemble of ELMs that are

implemented in parallel with multiple GPU and CPU cores

is used to reduce the error in regression problems with large

data sets. In [22], the training time by an ELM is reduced by

outsourcing the training to a computing cloud.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

All the methods discussed above focus on the com-

putation of the MI-based solution using parallel and

distributed hardware structures and programming models.

In this paper, we take a different approach: we develop a

highly parallel (scalarwise) algorithm that does not require

the time-consuming MI operation. The traditional steepest

descent (SD) method [23], [24], along with some of its

variants, e.g., Nesterov’s accelerated gradient method [25],

has a highly parallel structure that consists of only scalar

updates [23], [24]. However, these methods often con-

verge very slowly. The conjugate gradient (CG) method is

highly efﬁcient for convex quadratic cost functions; however,

it may lose this advantage when applied to nonquadratic

problems [26], [27], especially when the cost function is not

twice differentiable.

The alternating direction method of multipliers (ADMM)

[28], [29] is a powerful technique for large-scale con-

vex optimization [23]. It has been successfully applied in

many areas, including neural network training and machine

learning [30]–[32]. In [30], for example, the ADMM is used

to solve the center selection problem in fault-tolerant radial-

basis-function networks. In the standard setting, the ADMM is

guaranteed to converge under mild conditions [29], [33]–[35].

To improve the convergence, several accelerated variants of

the ADMM, e.g., the generalized ADMM [36] and the fast

ADMMs [37] were presented. An overrelaxation technique

was used in [36] and the extrapolation technique in the accel-

erated gradient method [25] was applied in [37].

In [31] and [32], the ADMM has been applied to ELM

and SVM, respectively, to implement distributed learning by

splitting across the training samples. In this paper, we consider

splitting across the model coefﬁcients. As a result, a problem

that has many coefﬁcients is broken down into a set of smaller

subproblems, each with just a portion of the coefﬁcients.

Concretely, the regularized least-squares (RLS) problem in

the RELM is split into a set of univariate optimization sub-

problems, thereby leading to a maximally split ADMM (MS-

ADMM). As direct extensions of an ADMM to multiblock

convex problems are not necessarily convergent [38], [39], the

MS-ADMM is no longer guaranteed to converge. Instead, its

convergence depends on the penalty factor of the method.

By incorporating a novel relaxation technique into the

MS-ADMM, a maximally split and relaxed ADMM (MS-

RADMM) for unregularized ELMs was reported in [40].

In [40], the convergence condition of the MS-RADMM was

presented, but no proofs were provided; the relations between

the algorithm parameters and the condition number of a data

matrix were presented, but no practical parameter selection

scheme was provided. Moreover, no parallel experiments

were presented in [40]. This paper extends the MS-RADMM

algorithm to the RLS problems in RELM. After analyzing

the asymptotic behavior of the algorithm, a necessary and

sufﬁcient condition for the convergence of the algorithm

is established. Linear convergence is demonstrated and the

convergence ratio is obtained. The results demonstrate that the

MS-RADMM has a smaller convergence ratio than the unre-

laxed MS-ADMM and the SD method. The optimal parameters

of the MS-RADMM are derived and a practical parameter

selection scheme is provided. A scalarwise implementation

of the algorithm, which is referred to as the scalarwise MS-

RADMM, is also presented and a computational complexity

analysis of the scalarwise algorithm is conducted. The results

of the analysis demonstrate that the scalarwise MS-RADMM

requires fewer computations than the MI-based method if the

number of iterations does not exceed min {N(3M + N)/6LM,

M(M+ 3N)/6LN}. The results of experiments on the training

of SLFNs demonstrate that the MS-RADMM converges faster

than the MS-ADMM and the SD method. The experimental

results also demonstrate that when implemented on a multicore

CPU, the scalarwise MS-RADMM achieves an acceleration

ratio that is approximately linear with respect to the number of

CPU cores. When implemented on a GPU, the MS-RADMM

algorithm realizes a much higher acceleration ratio than the

MI-based method.

The remainder of this paper is organized as follows. After

the standard ADMM and RELM are brieﬂy described in

Section II, the MS-RADMM for the RLS problem in RELM

is derived in Section III. The convergence, computational

complexity, and memory space analyses, along with the

derivation of optimal parameters and the practical parameter

selection scheme for the proposed algorithm, are presented in

Section IV. The MS-RADMM-based RELM and experiments

on benchmark data sets are presented in Section V and

conclusions are drawn in Section VI. Finally, the proofs of

all lemmas and theorems are provided in the appendices.

II. B

ASIC INFORMATION ON THE ADMM METHOD AND

THE

RELM ALGORITHM

A. Standard ADMM for Convex Optimization [29]

In the standard setting, the ADMM method solves a convex

optimization problem of the form

min

x,z

g(x) + f (z) (1a)

s.t. Ax + Bz = c (1b)

where the decision variables have been partitioned into two

parts, namely, x ∈ R

and z ∈ R

; f : R

→ R and g:

→ R are two convex functions; A ∈ R

K ×N

and B ∈

K ×J

are two constant matrices; and c ∈ R

is a constant

vector. The augmented Lagrangian of problem (1) is

(x, z, λ) = g(x) + f (z) +



Ax + Bz − c



+ λ

[Ax + Bz − c] (2)

where λ ∈ R

is the Lagrangian multiplier vector and ρ>0

is a penalty factor.

With initial estimates x

∈ R

, z

∈ R

,andλ

∈ R

the ADMM uses the Gauss–Seidel iterations to minimize the

augmented Lagrangian (2) over the decision variables, namely

x and z, and to update the multiplier λ as in the method of

multipliers, namely, the ADMM sequentially minimizes L

(x,

, λ

) over x and L

k+1

, z, λ

) over z and updates the

multiplier λ as follows:

k+1

= arg min

(x, z

, λ

) (3a)

k+1

= arg min

k+1

, z, λ

) (3b)

k+1

= λ

+ ρ(Ax

k+1

+ Bz

k+1

− c). (3c)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LAI et al.: MS-RADMM FOR RELMs 3

By introducing the scaled multiplier u = λ/ρ and substituting

the augmented Lagrangian with (2), the above iterations for

the decision variables and the multiplier are re-expressed as

k+1

= arg min



g(x) +

Ax + Bz

− c + u





(4a)

k+1

= arg min



f (z) +

Ax

k+1

+ Bz − c + u





(4b)

k+1

= u

+ Ax

k+1

+ Bz

k+1

− c. (4c)

B. Multipartition ADMM for the Model Fitting Problem [29]

Consider the convex model ﬁtting problem

min

f (Ax − b) + g(x) (5)

where A =[a

, a

, ..., a

]∈R

K ×N

with a

∈ R

is the data

matrix, x =[x

, x

,...,x

]

∈ R

is the model coefﬁcient

vector, b =[b

, b

,...,b

]

∈ R

is the target output vector,

f (·) is a convex loss function, and g(·) is a separable convex

regularization function. Partition the coefﬁcient vector as x =

, x

, ..., x

]

,wherex

∈ R

and n

+···+n

= N.

Conformably, the data matrix A is partitioned as A =[A

, A

..., A

] with A

∈ R

K ×n

and the regularization function as

g(x) = g

) + g

) + ...+ g

). By introducing the

p auxiliary vector variables, namely, z

= A

for i = 1, 2,

..., p, which are compacted in Z =[z

, z

, ..., z

]∈R

K ×p

the model ﬁtting problem (5) can be transformed into

min f





i=1

− b





i=1

) (6a)

s.t. A

− z

= 0, i = 1, 2,...,p. (6b)

The augmented Lagrangian of the above problem is

(x, Z, ) = f





i=1

− b





i=1

)



i=1

− z

) +



i=1

A

− z



(7)

where  =[λ

, λ

,...,λ

]∈R

K ×p

with λ

denoting the

ith Lagrangian multiplier vector. Similarly, as in the standard

ADMM, which was described in Section II-A, the Gauss–

Seidel method is used to minimize (7) over x and Z, namely,

to sequentially minimize L

(x, Z

, 

) over x and L

k+1

Z, 

) over Z. Then, an ADMM with p-partitioned blocks for

the model ﬁtting problem (5) is readily obtained as follows:

k+1

= arg min



) +



− z

+ ρ

−1





i = 1, 2,...,p (8a)

k+1

= arg min







i=1

− b





i=1



k+1

− z

+ ρ

−1





(8b)

k+1

= λ

+ ρ



k+1

− z

k+1



, i = 1, 2,...,p. (8c)

C. Regularized ELM for Classiﬁcation

The ELM was originally developed for training an

SLFN [1]. For an L-category classiﬁcation problem, the SLFN

to be trained can be modeled by

m



n=1



+ s



n

(9)

for m = 1, 2,...,M and  = 1, 2,...,L,wherev

∈

and o

=[o

, o

,...,o

]∈R

1×L

are the input

and the output of the network, respectively; w

= [w

,...,w

]

∈ R

and s

are the input weight and

bias of the nth hidden node; and  = (θ

n

) ∈ R

N×L

the output weight matrix that connects the L linear output

nodes and the N nonlinear hidden nodes with activation

function g(·).

Assume a set of M training data, namely, {(v

, ˆo

), m =

1, 2,...,M} is available, where ˆo

∈ R

1×L

is the target

output that represents the class label that is associated with

the mth input sample v

∈ R

by the “one-against-all”

rule. The original ELM [1] generates the hidden-layer weight

vectors and biases, namely, w

and s

with n = 1, 2,...,N,

randomly, which are kept unchanged during the training. Then,

the output-layer weight vector  that minimizes the total

squared-error norm between the actual outputs o

of the SLFN

and the target outputs ˆo

that are associated with the inputs

is analytically determined.

The RELM aims at minimizing both the training error

and the norm of the output weight matrix .Afterthe

hidden node weight vectors and biases have been randomly

generated, the RELM solves for the output-layer weight  in

the following 

-regularized LS problem:

min



H − T



(10)

where || · ||

represents the Frobenius norm, T =[ˆo

ˆo

,...,ˆo

]

denotes the target output matrix of the network,

H = (h

)

M×N

with h

= g(w

+ s

) is the hidden-layer

output matrix, and γ

with γ>0 is a regularization factor

that controls the tradeoff between the two objective terms in

(10). The analytical solution to problem (10) is expressed as



∗

= (H

H + γ

)

−1

T (11a)

where I

is the N × N identity matrix. If M < N, a compu-

tationally more efﬁcient solution is



∗

= H

(HH

+ γ

)

−1

T. (11b)

The network output for an input sample v is calculated via

o(v) = h(v)

∗

(12)

with h(v) =[g(w

v + s

), g(w

v + s

),...,g(w

v+s

)].

Then, the input sample v is assigned to the th category with

 = argmax(o(v)), where the function argmax(·) of a vector

returns the index of the largest element in the vector.

剩余14页未读，继续阅读

评论收藏

内容反馈

weixin_38531017

粉丝: 8
资源: 915

最大限度地简化和放松的ADMM，用于正规化的极限学习机

线性SVM的ADMM算法

A novel relaxed ADMM with highly parallel implementation for extreme learning machine

LibADMM-toolbox:用于稀疏和低秩优化的ADMM库

超限学习机进行了一个简单的实现(数据拟合)

极限学习机

ADMM在分布式优化和统计学习中的应用

基于在线ADMM的极限学习机，用于稀疏监督学习

Admm_admm_scalahadoopAdmm_

ADMM用于图像去模糊，matlab2021a仿真-源码

自适应极限学习机

一种基于鲁棒估计的极限学习机方法

基于粒子群优化的极限学习机

极限学习机程序

matlab 实现的 ADMM 算法

ADMM解决MMV下LASSO问题

ADMM.rar_ADMM迭代算法_admm_naturalp1q_半正定_半正定矩阵

ADMM（1）背景&简介_admm_ADMM简介

二元裂解算子交替方向乘子法的核极限学习机.docx

通过ν正则化优化极限学习机

超限学习机

ADMM-CSNet-master_admm_压缩感知admm_MATLABCS_压缩感知_matlab

boyd 交替方向法讲义 ADMM

ADMM图像去噪,admm图像去噪音,matlab

keywords.rar_ADMM lasso_ADMM LASSO_LASSO 预测_admm matlab_lasso

ADMM算法原理及实例讲解PPT教学课件.pptx

admm详细介绍

最新资源