没有合适的资源?快使用搜索试试~ 我知道了~
2018-FL+实体Resolution-Entity Resolution and Federated Learning ge
需积分: 0 0 下载量 158 浏览量
2022-08-03
15:50:04
上传
评论
收藏 681KB PDF 举报
温馨提示
试读
52页
arXiv:1803.04035v2 [cs.DB] 20 Mar 2018Entity Resolution and Federated Learning g
资源详情
资源评论
资源推荐
arXiv:1803.04035v2 [cs.DB] 20 Mar 2018
Entity Resolution and Federated Learning get a
Federated Resolution
Richard Nock
∗
Stephen Hardy Wilko Henecka Hamish Ivey-Law
Giorgio Patrini
†
Guillaume Smith Brian Thorne
N1 Analytics / Data61
firstname.lastname@data61.csiro.au
g.patrini@uva.nl
Abstract
Consider two data providers, each m aintaining records of different feature sets about common
entities. They aim to learn a linear m odel over the whole set of features. This problem of
federated learning over vertically partitioned data includes a crucial upstream issue: entity
resolution, i.e. finding the correspondence between the rows of the datasets. It is well known
that entity resolution, just like learning, is mistake-prone in the real world. Despite the impor-
tance of the problem, there has been no formal assessment of how errors in entity resolution
impact learning.
In this paper, we provide a thorough answer to this question, answering how optimal clas-
sifiers, empirical losses, margins and generalisation abilities are affected. While our answer
spans a wide set of losses — going beyond proper, convex, or classification calibrated —, it
brings simple practical arguments to upgrade entity resolution as a preprocessing step to learn-
ing. One of these suggests that entity resolution should be aimed at controlling or minimizing
the number of matching errors between examples of distinct classes. In our experiments, we
modify a simple token-based entity resolution algorithm so that it indeed aims at avoiding
matching rows belonging to different classes, and perform experiments in the setting where en-
tity resolution relies on noisy data, which is very relevant to real world domains. Notably, our
approach covers the case where one peer does not have classes, or a noisy record of classes. Ex-
periments display that using the class information during entity resolution can buy significant
uplift for learning at little expense from the complexity standpoint.
1 I ntroduction
With the ever-expanding collection of data, it is becoming common practice for organisations to
cooperate with the objective of leveraging their joint coll ecti ons of data [12, 14], with a w ider
∗
The Australian National University & the University of Sydney
†
Now with the University of Amsterdam
1
push to create and organise dat a m arketplaces as followers to the more mo nolithic data warehouse
[31]. Organisations are fully aware of the potential gain of combinin g their data assets , specifically
in terms of increased statistical power for analytics and predictive tasks. For example, hospitals
and medical facilities could leverage the medical history of common pati ents in order to prevent
chronic diseases and ri sks of future h ospitalisati o n.
The problem of learning models using t he data collected and kept/mai ntained by different par-
ties — federated learni ng for short [20] — has b ecome as much a necessity as a concrete research
challenge, expanding beyond machine learning through fields like databases and privacy. Am ong
other features, work i n the area can be classified in terms of (a) whet her the data is vertically or
horizontally partitioned and (b) the family of models being learned. The overwhelming majority
of previous work on secure di stributed learning con siders a horizontal data partition in which data
providers record the same features for different entities. Solutions can take advantage of the sep-
arability of loss functions which decompose the loss by examples. Relevant app roaches can be
found e.g. in [33] (and references therein).
In a vertical data partition, which is our setting, data providers can record different features for
the same entities. The vertical data partition case is more challenging than the horizontal one [14].
To see this, notice that in the later case, gathering all the data in one place makes any convention al
learning algorithm fit to l earn from the whole data. In the vertical partiti on case however, gathering
the data in o ne place would not solve the problem since we would still have to figure out the
correspondence between entities of the different datasets to learn from the union of all features.
Vertical data partition is more relevant to the setting where different organisations would sit in the
same market, thus aggregating different features for the same customers. Th e technical problem
to overcome is that loss functions are in general not separable over f eatures. Wi th the exception of
the unh inged l oss [30], this would be th e case for most prop er, classification calibrated and/or non-
convex losses [3, 24, 27]. A way to overcome this problem is to join the datasets upstream, using a
broad family of techniques w e refer to as entity resolut ion (or entity matching, record linkage, [8]).
For the whole pipeline — from matchin g to learning — to be fully and properly optimized taking
into account eventual additional constraints (like privacy), it is paramo unt to tackle and answer the
following q uestion:
"how does ent ity-resolut ion impact learning ?",
in particular because error-free entity resolution is often not available in the real-world [18], see
Figure 1. Case studies report that exact matching can be very damaging when identifiers are not
stable and error-prone: 25% of true matches would h ave been missed by exact m atching in a census
operation [29, 32]. In fact, one might expect such errors to just snowball with those of learning:
for example, wrong matches of a hospital database with pharmaceutical records with the objective
to improve preventive treatments coul d be disastrous on the predictive performances of a model
learned from the joined databases.
To our knowledge, there has been no formal treatment of this question so far, and the question
is open not just for machine learning as post-processing step to entity-resolution [15]. A s a con-
sequence perhaps, some work just assumes that th e so lution to entity-resolution is known a priori
[14].
Our contribution — In thi s paper, we provide the first detailed answer to this qu estion and hint on
how it can be used to imp rove entity resolution as an upstream process to federated learning with
2
Figure 1: The problem of entity resolu tion. In this example, peers A and B share common features
(name, date of birth — DOB) that could be used to craft an unique identifier, but the entries are
noisy so it becomes hard to match rows between peers.
vertically partitioned data. We focus on a popular class of models for federated learning, linear
models [14, 33]. To summarize our theoretical contribution, we bound the variation o f several key
quantities as computed from the error-prone entity-resolved dataset on one hand, and al so from the
ideal dataset for which we would kn ow the optimal correspondence on the other hand. These key
quantities i nclude:
(i) the relative deviation between the optimal classifiers;
(ii) the deviation between their respective losses;
(iii) the deviation in their respective generalization abilities;
More importantly, w e carry this analysis for any Ridge-regularized los s that satisfies so me m ild
differentiability condi tions, thus not necessarily being convex, nor classification-calibrated, n or
even proper.
Overall, our results shed light on large margin classification in the context of federated learning,
and how i t brings resilience in learning after entity resol ution. Indeed, we show th at it yields im-
munity to entity resolution mistakes — examples receive the right cl ass from the classifier learned
from error-prone enti ty-resolved data if they would receive large margin classification from the
optimal, "ideal" class ifier learned from the ideal dat a. Federated learning in the vertical partitio n
setting increases the number of features and is thereby likely to increase margins as well. Hence,
such a theoretical result on immunity represents a very strong argument for federated learning.
On a broader agenda in cl uding impacts for practical entity-resolution algorithms, our analysis
suggests that there exists a small set of controls defined from entity resolution mistakes that es-
sentially drive all deviations highlight ed before. Being able to control them essentially leads to
a strong handle on how entity-resolution impacts learning, from the classifier learned to its rates
for generalization, with respect to the ideal classifier. The mos t prominent of these knobs is the
errors made by entity resolution across classes, i.e. wrongly linking observations that belong to
different classes. Ou r theory sug gests that focusing on such mistakes during entity resolution can
bring significant leverage for the classifier learned afterwards. We exemplify this experimentally,
by modifying a simple token-based greedy entity-resolution algorithm to integrate the constraint
of carrying out entity resolution within classes [10, 15], assumin g that one peer has knowledge
of the classes but the other one may not — either classes are noisy or just not present —. We
3
perform simul at ed experiments on fifteen disti nct UCI domains, si mulated to investigate the key
parameters of federated learning in the setting where peers share the knowledge of some features
(such as gender, age, post al code for customers), which can furtherm o re be noi sy. Experiments
display that even when only one p eer has the knowledge of classes, sig nificant improvements can
be obtained over the approach that performs entity resolution without using classes, and can even
compete with the result of the learner that has access to the (unknown) ideally entity-resolved data.
The rest of this paper is organised as follows. Section 2 gives definitions. Section 3 shows how to
reduce the analysis for a general loss to that of a specific kind of loss called Taylor loss. Sections
4 through 7 develop our th eoreti cal result s, and Section 8 provide experiments. A last Section
discusses and conclud es o ur paper. An Appendix, starting page 24, provides all proofs.
2 Definitions
Supervised learning, losses — Let [n] = {1, 2, ..., n}. In the ordinary batch supervised learning
setting, o ne is given a set of m examples
ˆ
S
.
= {(
ˆ
x
i
, y
i
), i ∈ [m]}, where
ˆ
x
i
∈ X ⊆ R
d
is an
observation (X is called the domain) and y
i
∈ {−1, 1} is a label, or class (the "hat" no tation s h al l
be explained below). Our objective is to learn a linear classifier θ ∈ Θ for some fixed Θ ⊆ R
d
.
θ gives a label to some x ∈ X equal to the sign of θ
⊤
x ∈ R. The goodn ess of fit of θ on
ˆ
S is
measured by a loss function. We essent ially consider two categories of losses. The first is th e set
of Ridge-regularized losses. Each element, ℓ
F
, is defined by ℓ
F
(
ˆ
S, θ; γ, Γ)
.
= L + R with
L
.
=
1
m
·
X
i
F (y
i
θ
⊤
ˆ
x
i
) , R
.
= γθ
⊤
Γθ . (1)
Here, γ > 0 and Γ is symmetri c positive definite. F : R → R is C
2
and satisfies |F
′
(0)|, |F
′′
(0)| ≪
∞ where "≪" means finite. Note that t his is a very general definitio n as for example we do not
assume that F is convex nor even classification calibrated [3].
The other set of losses we consider, called Taylor losses, is such that L simplifies as a degree-
two polynomial:
L
.
= a +
b
m
·
X
i
y
i
θ
⊤
ˆ
x
i
+
c
m
·
X
i
(y
i
θ
⊤
ˆ
x
i
)
2
, (2)
with a, b, c ∈ R. Taylor losses have been used in secure federated learning [1, 12].
Federated learning — In federated learning,
ˆ
S is built from separate data-handling sources, called
peers. In our vertical partition setting, we have two peers A and B, each of whi ch has the description
of the m examples on a subset of the d features. It may be the case that only one peer (A by
default) has labels . In addition to learning a class ifier, federated learning thus faces the mandatory
preprocessing step of matching rows in the datasets of A and B to build dataset
ˆ
S, a preprocessin g
step we define as ent ity resolution [8].
The observed dataset
ˆ
S i s created from an unknown dataset S
.
= {(x
i
, y
i
), i ∈ [m]} whose
columns h ave been split between A and B. If we define X ∈ R
d×m
as the matrix storing (colum-
nwise) observations of S, then each row of X is held by A or B. The "or" need not be exclusive
as some rows may be present in both A and B [25]. Also, duplicating rows in X does not change
4
the learning problem. There is thus both an ideal X and an est imated observation matrix
ˆ
X giving
the obs ervations of
ˆ
S and built from entity-resolu tion. To understand how the differences between
ˆ
X and X impact learning, we need to drill down into the formalization o f
ˆ
X. Both matrices can be
represented by block matrices, with each d istinct feature row present exactly once, as:
X
.
=
X
A
X
B
,
ˆ
X
.
=
X
A
ˆ
X
B
.
= X
B
P
∗
, (3)
where P
∗
∈ {0, 1}
m×m
is a permutation matrix (unknown) capturing the mis takes of entity-resolution
if P
∗
6= I
m
(the identity matrix). From convention (3), the features of A are not affected by
entity-resolution: we call them anchor features. Because the features of Bare affected by entity-
resolution, we call them shuffle features. A folklore fact [6] (Chapter I.5) is that any permutation
matrix can be factored as a product of elementary permutation matrices, each of which swaps two
rows/columns of I
m
. So, suppose
P
∗
=
T
Y
t=1
P
t
, (4)
where P
t
is an elementary permutatio n matrix, where T , the size of P
∗
, is unknown. We let
u
A
(t), v
A
(t) ∈ [m] the two column in dexes in A affected by P
t
.
ˆ
X can be progressively constructed
from a sequence
ˆ
X
0
,
ˆ
X
1
, ...,
ˆ
X
T
where
ˆ
X
0
= X,
ˆ
X
T
=
ˆ
X and for t ≥ 1,
ˆ
X
t
.
=
X
A
ˆ
X
tB
,
ˆ
X
tB
.
= X
B
t
Y
j=1
P
j
. (5)
Let
ˆ
X
t
.
= [
ˆ
x
t1
ˆ
x
t2
···
ˆ
x
tn
] denote the column vector decomposition o f
ˆ
X
t
(with
ˆ
x
0i
.
= x
i
) and let
ˆ
S
t
be the training sample obtai n ed from the t first permutations in the sequence. Hence,
ˆ
S
0
= S,
ˆ
S
T
=
ˆ
S and
ˆ
S
t
.
= {(
ˆ
x
ti
, y
i
), i ∈ [m]}. We let u
B
(t) (resp. v
B
(t)) denote the indices in [m] of the
shuffle features in X that are in observation u
A
(t) (resp. v
A
(t)) and that will be permut ed by P
t
,
creating
ˆ
X
t
from
ˆ
X
t−1
. For example, if u
B
(t) = v
A
(t), v
B
(t) = u
A
(t), then P
t
correctly reconstructs
observations in indexes u
A
(t) and v
A
(t) in X. Figure 2 illustrates th e use of these notat ions.
Key parameters of P
∗
— it is clear that all mistakes of entity-resolu tion are captured b y P
∗
, so it
is not su rprising that all our results depend on some key parameters of P
∗
. A key property is how
errors "accumulate" through t he factorization of P
∗
in eq. (4). Hereafter, w
F
for w ∈ R
d
denotes
the subvector of w containin g the features of peer F ∈ {A, B}.
Definition 1 We say that P
t
is (ε, τ)-accurate for s ome ε, τ ≥ 0 , ε ≤ 1 iff f or any w ∈ R
d
,
|(
ˆ
x
ti
− x
i
)
⊤
B
w
B
| ≤ ε · |x
⊤
i
w| + τkwk
2
, ∀i ∈ [m] , (6)
|(x
u
F
(t)
− x
v
F
(t)
)
⊤
F
w
F
| ≤ ε · max
i∈{u
F
(t),v
F
(t)}
|x
⊤
i
w|
+τkwk
2
, ∀F ∈ {A, B} . (7)
We say that P
∗
is (ε, τ)-accurate iff each P
t
is (ε, τ)-accurate, ∀t = 1, 2, ..., T .
5
剩余51页未读,继续阅读
被要求改名字
- 粉丝: 26
- 资源: 315
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0