没有合适的资源?快使用搜索试试~ 我知道了~
2019-杨强无损FL框架多方设置+隐私安全+仅一方有标签+纵向FL+无损-SecureBoost A Lossless Fed
需积分: 0 4 下载量 122 浏览量
2022-08-04
14:17:34
上传
评论 2
收藏 392KB PDF 举报
温馨提示
试读
12页
1. University of California, Los Angeles, Los Angeles, USA 2. Webank, Shenzhen,
资源推荐
资源详情
资源评论
arXiv:1901.08755v1 [cs.LG] 25 Jan 2019
SecureBoost: A Lossless Federated Learning Framework
Kewei Cheng
1
, Tao Fan
2
, Yilun Ji n
3
, Yang Liu
2
, Tianjian Chen
2
, Qiang Yang
4
1. University of Californ ia , Los Angeles, Los Angeles, USA
2. Webank, Shenzhen, China
3. Peking University, Beijing, China
4. Hong Kong Unversity of Science and Technology, Hong Kong
tobyche n@webank.com, qyang@cse.ust.hk
Abstract
The protection of user privacy is an important concern in ma-
chine learning, as evidenced by the r oll ing out of the General
Data Protection Regulation (GDPR) i n the European Union
(EU) in May 2018. The GDPR is designed to give users more
control over their personal data, which motivates us to ex-
plore machine learning frameworks with data sharing with-
out violating user privacy. To meet this goal, in this paper, we
propose a novel lossless privacy-preserving tree-boosting sys-
tem known as SecureBoost in the setting of federated learn-
ing. This federated-learning system allows a learning process
to be jointly conducted over multiple parties with partially
common user samples but different feature sets, which corre-
sponds to a vertically partitioned virtual data set. An advan-
tage of SecureBoost is that it provides the same level of accu-
racy as the non privacy-preserving approach while at the same
time, reveal no information of each private data provider.
We theoretically prove that the SecureBoost framework is as
accurate as other non-federated gradient tr ee-boosting algo-
rithms that bring the data into one place. In addition, along
with a proof of security, we discuss what would be required
to make the protocols completely secure.
Introduction
The modern society is increasingly concerned with the
unlawful use and exploitation of our personal data. At
the individual level, improper use of personal data may
cause potential risk to user privacy. At the enterprise level,
data leakage may imp inge have grave consequences on
commercia l inte rests. Actions are being taken by different
societies. For example, the European Union has recently
enacted a low known as Ge neral Data Protection Reg-
ulation (GDPR). The GDPR is designe d to g ive users
more control over their personal data (Regulation 2016;
Albrecht 2016; Mayer-Schonberger and Padova 2015;
Goodman and Flaxm a n 2016). Many enterprises that
rely heavily on machine learning are beginning to make
sweeping changes as a consequence.
Despite difficulty in meeting the goal of user privacy pro-
tection, the need for different organizations to collaborate
while building machine-learning models still stay strong. In
reality, many data owners do not have a sufficient amount of
data to build high-quality models. For example, retail c om-
panies have user transactions data, which correspon d to dif-
ferent data dimensions or fea tures a s credit-rating compa-
nies do. Likewise, mobile phone users h ave their usage data,
but each device only have a small amount of user-activity
data. To have a usable model for user preference prediction,
it would be necessary to integrate the data collected by the
clients.
Thus, it is a challenge to b oth allow different data own-
ers to collaborate toge ther in order to build high-quality
machine learning models while a t the same time, pro-
tect user data privacy and confidentiality. In the past,
several attempts h ave been made to address the u ser-
privacy problem while excha nging data (Hardy et al. 2017;
Mohassel and Zhang 2017). For example, Apple proposed
to use differential privacy (Dwork, Roth, a nd others 2014;
Dwork 2008) to address the privacy preservation issue. The
basic idea of differential privacy (DP) is to add properly cal-
ibrated noise to data to disambiguate the identity of any in-
dividuals when the data is being exchanged and analyzed
by a third party. However, as we discuss in the pape r, DP
only prevent user-data leakage to a certain degree and can-
not co mpletely rule out the identity of an individual. In ad-
dition, data exchan ge under the DP still requires that th e
data changes hands between o rganizations, which may not
be allowed by strict laws like GDPR. Furthermore, the DP
method is lossy in machine learning in that models built after
noise is injected can reduce much perfor mance in prediction
accuracy.
More recently, Google introduces a federated learning
framework (Koneˇcn`y et al. 20 16) on its Android cloud. The
basic idea is to allow individual clients to encrypt their
models which are then uploaded and aggregated at a cen-
tral cloud site. The ma chine-learning process at that site
can make use o f these enc rypted models while not leaking
the clients’ information. This f ramework applies to a data-
partition framework where each partition corresponds to a
subset of data samples collected from one or more users.
In this paper, we consider a general setting of multiple
parties collabo ratively build their machine-learning models
while protecting user p rivacy and data co nfidentiality. Our
setting is as shown in Figur e 2. We conside r a collectio n of
parties each holding a part of its own data. We can visualize
the data lo cated at different parties as a subsection o f a big
data table that is obtained by taking the un ion of all da ta at
different parties. Then the data at each party has the follow-
ing property:
Passive Party 1
Active Party
Sub-Model 2 Sub-Model 1
Confidential
Info. Exchange
SecureBoost
Privacy-Preserving Entity Alignment
Intermediate Computation Exchange
Passive Party 2
Confidential
Info. Exchange
Privacy-Preserving Entity Alignment
Sub-Model 3
Intermediate Computation Exchange
Figure 1: Illustration of the proposed SecureBoost framework
User X3 X4 X5
U1
U2
U4
User X1 X2 X3 X4 X5 Y
U1
U2
Party 2
User X1 X2 Y
U1
U2
U3
Party 1
Virtually Joint Table
Virtually Join
Figure 2: Vertically p artitioned da ta set
1. The big data table is vertically split, such th at the data a re
split in the fea ture dimension among parties;
2. only one data providers has the label information;
3. the users have partial overlap across different parties.
Our goal is then to allow each party to build a prediction
model for some designated label, while disallow any party
to obtain any information on the data of other parties.
Our above setting have several ad vantages. In contrast
with most existing work on pr ivacy-preserving data minin g
and machine learning, the complexity in our setting is sig-
nificantly increa sed. Unlike the situation where the data are
horizontally split, the above setting requires a more complex
mechanism to decomp ose the loss function at each party
(Vaidya 2008; Vaidya and Clifton 2005; Hardy et al. 2017).
In addition, in each model-building process for all parties,
only one data provider owns the label information. It re-
quires us to propose a secure protocol to gu ide the learn-
ing process instead of sha ring lab el information explicitly
among all parties. Finally, data confiden tiality and privacy
concerns prevents the parties to expose their own users who
are not common among the gr oup when building the m od-
els. He nce, entity alignment sho uld also be c onducted in a
sufficiently secure ma nner.
In this paper, we pr opose a novel end-to-end privacy-
preserving tree-boosting algorithm and framework known
as SecureBoost to enable machine learning in a feder-
ated setting. Unlike previous federated learning frameworks
that split the da ta on user dim e nsions, our framework en-
sures that collaborative model building is done when data
is split among different parties on the feature dimension .
Our federated learning framework o perates in two steps.
First, we find the common users among the parties under
a privacy-preserving constraint. Then, we collaboratively
learn a shared classification or regression model without
leaking any user information to each other. We summarize
our main contributions as follows:
• We formally define a novel problem of privacy-preserving
machine learning over vertically partitioned data in the
setting of federated learning.
• We present an approach to train a high-quality tree boost-
ing model for each party collaboratively while keeping the
training data secret over multiple parties. We go through
this machine learning process without the participation of
a trusted third party.
• Finally and impo rtantly, we prove that our approach is
lossless in the sense that it is as accurate as any central-
ized non-privacy-pr eserving methods that bring all data to
a central location.
• In addition, along with a proof of security, we discuss
what would be requ ired to make the protocols completely
secure.
Preliminaries and Related Work
The existing literature on privacy-preser ving machine learn-
ing broadly address two objectives: privacy of the data
used for learning a model o r as inp ut to an existing
model. To protect privacy of the data used for learning a
model, in (Shokr i and Shmatikov 2015; Abadi et a l. 2016),
the authors propose to take a dvantage of differential pri-
vacy for learning a deep lear ning model. As one of the
most popular privacy-preserving techniques, differential pri-
vacy (Dwork 2008) protects sensitive data by injecting noise
to the raw datasets suc h that th e amount of information
leaked from an individual record is minimized. Even though
differential privacy ensures a pretty low probability of iden-
tifying an individual rec ord, there’s still a probability of
leakage, which is ag ainst the re quiremen t of GD PR. To ad-
dress the above problems, Google introduces a federated
learning framework to bring the model training to each
mobile term inal (Koneˇcn`y et al. 2016). It achieves the goal
of privacy protection by forbidding the data from trans-
ferring out. Another privacy preserving techniques is fo-
cuses on the inference stage instead of training stage. Mi-
crosoft proposed a cryptographic deep learning framework ,
CryptoNets (Gilad-Bachrach et al. 2016) based on Homo-
morphic Encryption to enable a trained neural network to
make encrypted pr edictions over the encrypted data. How-
ever, it has to sacrifice the accuracy to obtain security.
In (Rouhani, Riazi, and Koushanfar 2017), another frame-
work DeepSecure is proposed to securely conduc t deep
learning execution on e ncrypted data using Yao’s Garbled
Circuit (GC) protocol. Although it does not involve a trade-
off between utility and privacy, it suffers from serious inef-
ficiency.
All the above methods are desig ned for horizontally par-
titioned data whose data providers re cord the same features
for different entities. We consider a vertical data partition
as shown in Figure 2, in which multiple parties record dif-
ferent features at different sites. Different fr om the hori-
zontal partitioning, which assumes that ensemble happens
over data samples, the vertical partition builds a m odel
over a c ommon set of users. How to collaboratively build
the model is an open question. Som e previous works dis-
cuss privacy-preserving decision trees over vertically pa r-
titioned data (Vaidya and Clifto n 2005; Vaidya et al. 2008).
However, their proposed methods have to reveal class d is-
tribution over the given attributes, which will cause po-
tential security risk. In addition, they can only hand le dis-
crete data, which is less practical for real-life scenario.
In contrast, our method guarantees more secure protectio n
to the data and can easily apply to continuous data. In
(Djatmiko et al. 201 7), Patrini et al. proposed a framework
to jointly perform logistic regression over the encrypted
vertically-partitioned data by appro ximating a non-linear lo -
gistic loss by a Taylor expansion. Clearly, in th is appr oxima-
tion, the algorithm will inevitably cause a loss of accuracy.
To the contrary, we propose a novel approach that is loss-
less in nature. We believe tha t the Sec ureBoost framework
is the first attempt f or privacy-preserving federated learning
over vertically pa rtitioned data which balance accuracy and
security.
Problem Statement
We now formally define our problem and clarify the
difference between our setting and previous works. Let
X
k
∈ R
n
k
×d
k
m
k=1
be the data matrix distributed on m
private parties with each row X
k
i∗
∈ R
1×d
k
being a data
instance. We use F
k
= {f
1
, ..., f
d
k
} to de note the feature
set of corresponding data matrix X
k
. If we consider all data
come from a virtual big data table involving all users and all
features, then we can view the data as being vertically split
from a large virtual table across different parties, such that
each party holds a different set of vertically partitioned data
over a subset of users. Two parties p and q have different sets
of features, denoted as F
p
∩ F
q
= ∅, ∀p 6= q ∈ {1...m}.
Different data providers may hold different sets of users as
well, allowing some degree of overlap. That is, parties at
sites n
1
...n
m
may be different fr om each other. As m en-
tioned bef ore, when building a model for a common task,
we consider th at only one of the data providers has a class
attribute for classification or regression. We d enote the class
label as y ∈ R
n
k
×1
where the class label is held by the k-th
party.
Definition 1. Active Party:
We define the active party as the d ata provid er who holds
both a data matrix and the class label.
Since the class label information is indispensable for su-
pervised learning, there must be an active party with access
to the label y. The active party naturally takes the responsi-
bility as a dominating server in fed erated learning.
Definition 2. Passive Party:
We define the data provider which has only a data matrix
as a passive party.
Passive par ties play the role of clients in the federated
learning setting. They are also in need of building a mo del to
predict the class label y for their prediction purposes. Thus
they must collaborate with the active party to build their
model to predict y for the ir fu ture users using their own fea -
tures.
The problem of privacy-preserving m achine lear ning over
vertically partitioned data in federated learning can be stated
as follows:
Given: a vertically partitioned data matrix
X
k
m
k=1
dis-
tributed on m private parties and the class labels y dis-
tributed on active party.
Learn: a machine lea rning model M without giving in-
formation of the data matrix of any parties to others in the
process. T he model M is a function that has a projection M
i
at each par ty i, such that M
i
takes input of its own features
X
i
.
Lossless Constr aint: We require that the model M is
lossless, which means that the loss of M under federated
learning over the training data is the same as the loss of M
′
when M
′
is built on the union of all data.
Federated Learning with SecureBoost
As one of the most widely- used machine-
learning algorithm s, the gradient-tree boosting
model (Fried man et al. 200 0) excels in many machine learn-
ing tasks, suc h as fra ud detection (Oentaryo et al. 2014),
feature selection (Li et al. 2017) and pr oduct recommen-
dation (He et al. 2014). In this section, we propo se a novel
gradient-tree boosting algorith m we call SecureBoost in
the setting of federated learning. As shown in Figure 1,
SecureBoost consists o f two major steps. First, it aligns the
data under the privacy constraint. Second, it collaboratively
learn a shared gradient-tree boo stin g mo del while keeping
all the training data secret over multiple private parties.
Below, we explain each part in turn.
剩余11页未读,继续阅读
资源评论
张匡龙
- 粉丝: 17
- 资源: 279
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功