【免费】2019-杨强无损FL框架多方设置+隐私安全+仅一方有标签+纵向FL+无损-SecureBoostALosslessFed资源-CSDN文库

需积分: 0 122 浏览量 2022-08-04 14:17:34 上传评论 2 收藏 392KB PDF 举报

资源推荐

资源详情

资源评论

arXiv:1901.08755v1 [cs.LG] 25 Jan 2019

SecureBoost: A Lossless Federated Learning Framework

Kewei Cheng

, Tao Fan

, Yilun Ji n

, Yang Liu

, Tianjian Chen

, Qiang Yang

1. University of Californ ia , Los Angeles, Los Angeles, USA

2. Webank, Shenzhen, China

3. Peking University, Beijing, China

4. Hong Kong Unversity of Science and Technology, Hong Kong

keweichen[email protected] u, dylanfan@webank.com, yljin@pku.edu.cn, yang[email protected]m

tobyche n@webank.com, qyang@cse.ust.hk

Abstract

The protection of user privacy is an important concern in ma-

chine learning, as evidenced by the r oll ing out of the General

Data Protection Regulation (GDPR) i n the European Union

(EU) in May 2018. The GDPR is designed to give users more

control over their personal data, which motivates us to ex-

plore machine learning frameworks with data sharing with-

out violating user privacy. To meet this goal, in this paper, we

propose a novel lossless privacy-preserving tree-boosting sys-

tem known as SecureBoost in the setting of federated learn-

ing. This federated-learning system allows a learning process

to be jointly conducted over multiple parties with partially

common user samples but different feature sets, which corre-

sponds to a vertically partitioned virtual data set. An advan-

tage of SecureBoost is that it provides the same level of accu-

racy as the non privacy-preserving approach while at the same

time, reveal no information of each private data provider.

We theoretically prove that the SecureBoost framework is as

accurate as other non-federated gradient tr ee-boosting algo-

rithms that bring the data into one place. In addition, along

with a proof of security, we discuss what would be required

to make the protocols completely secure.

Introduction

The modern society is increasingly concerned with the

unlawful use and exploitation of our personal data. At

the individual level, improper use of personal data may

cause potential risk to user privacy. At the enterprise level,

data leakage may imp inge have grave consequences on

commercia l inte rests. Actions are being taken by different

societies. For example, the European Union has recently

enacted a low known as Ge neral Data Protection Reg-

ulation (GDPR). The GDPR is designe d to g ive users

more control over their personal data (Regulation 2016;

Albrecht 2016; Mayer-Schonberger and Padova 2015;

Goodman and Flaxm a n 2016). Many enterprises that

rely heavily on machine learning are beginning to make

sweeping changes as a consequence.

Despite difﬁculty in meeting the goal of user privacy pro-

tection, the need for different organizations to collaborate

while building machine-learning models still stay strong. In

reality, many data owners do not have a sufﬁcient amount of

data to build high-quality models. For example, retail c om-

panies have user transactions data, which correspon d to dif-

ferent data dimensions or fea tures a s credit-rating compa-

nies do. Likewise, mobile phone users h ave their usage data,

but each device only have a small amount of user-activity

data. To have a usable model for user preference prediction,

it would be necessary to integrate the data collected by the

clients.

Thus, it is a challenge to b oth allow different data own-

ers to collaborate toge ther in order to build high-quality

machine learning models while a t the same time, pro-

tect user data privacy and conﬁdentiality. In the past,

several attempts h ave been made to address the u ser-

privacy problem while excha nging data (Hardy et al. 2017;

Mohassel and Zhang 2017). For example, Apple proposed

to use differential privacy (Dwork, Roth, a nd others 2014;

Dwork 2008) to address the privacy preservation issue. The

basic idea of differential privacy (DP) is to add properly cal-

ibrated noise to data to disambiguate the identity of any in-

dividuals when the data is being exchanged and analyzed

by a third party. However, as we discuss in the pape r, DP

only prevent user-data leakage to a certain degree and can-

not co mpletely rule out the identity of an individual. In ad-

dition, data exchan ge under the DP still requires that th e

data changes hands between o rganizations, which may not

be allowed by strict laws like GDPR. Furthermore, the DP

method is lossy in machine learning in that models built after

noise is injected can reduce much perfor mance in prediction

accuracy.

More recently, Google introduces a federated learning

framework (Koneˇcn`y et al. 20 16) on its Android cloud. The

basic idea is to allow individual clients to encrypt their

models which are then uploaded and aggregated at a cen-

tral cloud site. The ma chine-learning process at that site

can make use o f these enc rypted models while not leaking

the clients’ information. This f ramework applies to a data-

partition framework where each partition corresponds to a

subset of data samples collected from one or more users.

In this paper, we consider a general setting of multiple

parties collabo ratively build their machine-learning models

while protecting user p rivacy and data co nﬁdentiality. Our

setting is as shown in Figur e 2. We conside r a collectio n of

parties each holding a part of its own data. We can visualize

the data lo cated at different parties as a subsection o f a big

data table that is obtained by taking the un ion of all da ta at

different parties. Then the data at each party has the follow-

ing property:

Passive Party 1

Active Party

Sub-Model 2 Sub-Model 1

Confidential

Info. Exchange

SecureBoost

Privacy-Preserving Entity Alignment

Intermediate Computation Exchange

Passive Party 2

Confidential

Info. Exchange

Privacy-Preserving Entity Alignment

Sub-Model 3

Intermediate Computation Exchange

Figure 1: Illustration of the proposed SecureBoost framework

User X3 X4 X5

User X1 X2 X3 X4 X5 Y

Party 2

User X1 X2 Y

Party 1

Virtually Joint Table

Virtually Join

Figure 2: Vertically p artitioned da ta set

1. The big data table is vertically split, such th at the data a re

split in the fea ture dimension among parties;

2. only one data providers has the label information;

3. the users have partial overlap across different parties.

Our goal is then to allow each party to build a prediction

model for some designated label, while disallow any party

to obtain any information on the data of other parties.

Our above setting have several ad vantages. In contrast

with most existing work on pr ivacy-preserving data minin g

and machine learning, the complexity in our setting is sig-

niﬁcantly increa sed. Unlike the situation where the data are

horizontally split, the above setting requires a more complex

mechanism to decomp ose the loss function at each party

(Vaidya 2008; Vaidya and Clifton 2005; Hardy et al. 2017).

In addition, in each model-building process for all parties,

only one data provider owns the label information. It re-

quires us to propose a secure protocol to gu ide the learn-

ing process instead of sha ring lab el information explicitly

among all parties. Finally, data conﬁden tiality and privacy

concerns prevents the parties to expose their own users who

are not common among the gr oup when building the m od-

els. He nce, entity alignment sho uld also be c onducted in a

sufﬁciently secure ma nner.

In this paper, we pr opose a novel end-to-end privacy-

preserving tree-boosting algorithm and framework known

as SecureBoost to enable machine learning in a feder-

ated setting. Unlike previous federated learning frameworks

that split the da ta on user dim e nsions, our framework en-

sures that collaborative model building is done when data

is split among different parties on the feature dimension .

Our federated learning framework o perates in two steps.

First, we ﬁnd the common users among the parties under

a privacy-preserving constraint. Then, we collaboratively

learn a shared classiﬁcation or regression model without

leaking any user information to each other. We summarize

our main contributions as follows:

• We formally deﬁne a novel problem of privacy-preserving

machine learning over vertically partitioned data in the

setting of federated learning.

• We present an approach to train a high-quality tree boost-

ing model for each party collaboratively while keeping the

training data secret over multiple parties. We go through

this machine learning process without the participation of

a trusted third party.

• Finally and impo rtantly, we prove that our approach is

lossless in the sense that it is as accurate as any central-

ized non-privacy-pr eserving methods that bring all data to

a central location.

• In addition, along with a proof of security, we discuss

what would be requ ired to make the protocols completely

secure.

Preliminaries and Related Work

The existing literature on privacy-preser ving machine learn-

ing broadly address two objectives: privacy of the data

used for learning a model o r as inp ut to an existing

model. To protect privacy of the data used for learning a

model, in (Shokr i and Shmatikov 2015; Abadi et a l. 2016),

the authors propose to take a dvantage of differential pri-

vacy for learning a deep lear ning model. As one of the

most popular privacy-preserving techniques, differential pri-

vacy (Dwork 2008) protects sensitive data by injecting noise

to the raw datasets suc h that th e amount of information

leaked from an individual record is minimized. Even though

differential privacy ensures a pretty low probability of iden-

tifying an individual rec ord, there’s still a probability of

leakage, which is ag ainst the re quiremen t of GD PR. To ad-

dress the above problems, Google introduces a federated

learning framework to bring the model training to each

mobile term inal (Koneˇcn`y et al. 2016). It achieves the goal

of privacy protection by forbidding the data from trans-

ferring out. Another privacy preserving techniques is fo-

cuses on the inference stage instead of training stage. Mi-

crosoft proposed a cryptographic deep learning framework ,

CryptoNets (Gilad-Bachrach et al. 2016) based on Homo-

morphic Encryption to enable a trained neural network to

make encrypted pr edictions over the encrypted data. How-

ever, it has to sacriﬁce the accuracy to obtain security.

In (Rouhani, Riazi, and Koushanfar 2017), another frame-

work DeepSecure is proposed to securely conduc t deep

learning execution on e ncrypted data using Yao’s Garbled

Circuit (GC) protocol. Although it does not involve a trade-

off between utility and privacy, it suffers from serious inef-

ﬁciency.

All the above methods are desig ned for horizontally par-

titioned data whose data providers re cord the same features

for different entities. We consider a vertical data partition

as shown in Figure 2, in which multiple parties record dif-

ferent features at different sites. Different fr om the hori-

zontal partitioning, which assumes that ensemble happens

over data samples, the vertical partition builds a m odel

over a c ommon set of users. How to collaboratively build

the model is an open question. Som e previous works dis-

cuss privacy-preserving decision trees over vertically pa r-

titioned data (Vaidya and Clifto n 2005; Vaidya et al. 2008).

However, their proposed methods have to reveal class d is-

tribution over the given attributes, which will cause po-

tential security risk. In addition, they can only hand le dis-

crete data, which is less practical for real-life scenario.

In contrast, our method guarantees more secure protectio n

to the data and can easily apply to continuous data. In

(Djatmiko et al. 201 7), Patrini et al. proposed a framework

to jointly perform logistic regression over the encrypted

vertically-partitioned data by appro ximating a non-linear lo -

gistic loss by a Taylor expansion. Clearly, in th is appr oxima-

tion, the algorithm will inevitably cause a loss of accuracy.

To the contrary, we propose a novel approach that is loss-

less in nature. We believe tha t the Sec ureBoost framework

is the ﬁrst attempt f or privacy-preserving federated learning

over vertically pa rtitioned data which balance accuracy and

security.

Problem Statement

We now formally deﬁne our problem and clarify the

difference between our setting and previous works. Let



∈ R

×d



k=1

be the data matrix distributed on m

private parties with each row X

i∗

∈ R

1×d

being a data

instance. We use F

= {f

, ..., f

} to de note the feature

set of corresponding data matrix X

. If we consider all data

come from a virtual big data table involving all users and all

features, then we can view the data as being vertically split

from a large virtual table across different parties, such that

each party holds a different set of vertically partitioned data

over a subset of users. Two parties p and q have different sets

of features, denoted as F

∩ F

= ∅, ∀p 6= q ∈ {1...m}.

Different data providers may hold different sets of users as

well, allowing some degree of overlap. That is, parties at

sites n

...n

may be different fr om each other. As m en-

tioned bef ore, when building a model for a common task,

we consider th at only one of the data providers has a class

attribute for classiﬁcation or regression. We d enote the class

label as y ∈ R

×1

where the class label is held by the k-th

party.

Deﬁnition 1. Active Party:

We deﬁne the active party as the d ata provid er who holds

both a data matrix and the class label.

Since the class label information is indispensable for su-

pervised learning, there must be an active party with access

to the label y. The active party naturally takes the responsi-

bility as a dominating server in fed erated learning.

Deﬁnition 2. Passive Party:

We deﬁne the data provider which has only a data matrix

as a passive party.

Passive par ties play the role of clients in the federated

learning setting. They are also in need of building a mo del to

predict the class label y for their prediction purposes. Thus

they must collaborate with the active party to build their

model to predict y for the ir fu ture users using their own fea -

tures.

The problem of privacy-preserving m achine lear ning over

vertically partitioned data in federated learning can be stated

as follows:

Given: a vertically partitioned data matrix





k=1

dis-

tributed on m private parties and the class labels y dis-

tributed on active party.

Learn: a machine lea rning model M without giving in-

formation of the data matrix of any parties to others in the

process. T he model M is a function that has a projection M

at each par ty i, such that M

takes input of its own features

Lossless Constr aint: We require that the model M is

lossless, which means that the loss of M under federated

learning over the training data is the same as the loss of M

′

when M

′

is built on the union of all data.

Federated Learning with SecureBoost

As one of the most widely- used machine-

learning algorithm s, the gradient-tree boosting

model (Fried man et al. 200 0) excels in many machine learn-

ing tasks, suc h as fra ud detection (Oentaryo et al. 2014),

feature selection (Li et al. 2017) and pr oduct recommen-

dation (He et al. 2014). In this section, we propo se a novel

gradient-tree boosting algorith m we call SecureBoost in

the setting of federated learning. As shown in Figure 1,

SecureBoost consists o f two major steps. First, it aligns the

data under the privacy constraint. Second, it collaboratively

learn a shared gradient-tree boo stin g mo del while keeping

all the training data secret over multiple private parties.

Below, we explain each part in turn.

剩余11页未读，继续阅读

评论收藏

内容反馈

张匡龙

粉丝: 17
资源: 279

2019-杨强无损FL框架多方设置+隐私安全+仅一方有标签+纵向FL+无损-SecureBoost A Lossless Fed

最新资源

2019-杨强无损FL框架多方设置+隐私安全+仅一方有标签+纵向FL+无损-SecureBoost A Lossless Fed

2019-Survey杨强-安全方面相关工作+横纵向分类-Federated+Machine+Learning-+Concept

联邦学习 .pptx

CCF-微众银行-刘洋-联邦学习的研究及应用.pptx

sec_federated_learning

2018-杨强安全迁移学习-Secure Federated Transfer Learning1

迁移学习-杨强-2015_Transitive_Transfer_Learning1

CCF-微众银行-杨强-AI向善，数据孤岛和联邦学习.pdf

联邦学习-杨强1

达内比较全面的视频【C、C++、Unix、数据结构、Win32、Socket】

迁移学习教程，Transfer learning介绍，TL调查

（原文+译文）2015_传递迁移学习_杨强团队_Transitive_Transfer_Learning.zip

迁移学习理论与应用_杨强

2019工业互联网峰会PDF第一版.zip

杨强教授：2021联邦学习专题研讨会1

FATE-master.zip

PPT-迁移学习理论与应用.pdf

腾讯安全联邦学习.pdf

ssm框架酒吧系统完整导入可运行带sql

（原文+译文）A Survey on Transfer Learning_Pan and Yang_2010.pdf

70 智能金融香港科技大学教授杨强：金融业，AI 技术风口和商业应用的极佳契合点.docx

达内内部培训资料

迁移学习数据集百度云盘地址

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

网络安全+《2024网络安全报告》

OpenVAS GVM 中文翻译补丁

最新资源