【免费】2018-FL+实体Resolution-EntityResolutionandFederatedLearningge资源-CSDN文库

需积分: 0 158 浏览量 2022-08-03 15:50:04 上传评论收藏 681KB PDF 举报

资源详情

资源评论

资源推荐

arXiv:1803.04035v2 [cs.DB] 20 Mar 2018

Entity Resolution and Federated Learning get a

Federated Resolution

Richard Nock

∗

Stephen Hardy Wilko Henecka Hamish Ivey-Law

Giorgio Patrini

†

Guillaume Smith Brian Thorne

N1 Analytics / Data61

firstname.lastname@data61.csiro.au

g.patrini@uva.nl

Abstract

Consider two data providers, each m aintaining records of different feature sets about common

entities. They aim to learn a linear m odel over the whole set of features. This problem of

federated learning over vertically partitioned data includes a crucial upstream issue: entity

resolution, i.e. ﬁnding the correspondence between the rows of the datasets. It is well known

that entity resolution, just like learning, is mistake-prone in the real world. Despite the impor-

tance of the problem, there has been no formal assessment of how errors in entity resolution

impact learning.

In this paper, we provide a thorough answer to this question, answering how optimal clas-

siﬁers, empirical losses, margins and generalisation abilities are affected. While our answer

spans a wide set of losses — going beyond proper, convex, or classiﬁcation calibrated —, it

brings simple practical arguments to upgrade entity resolution as a preprocessing step to learn-

ing. One of these suggests that entity resolution should be aimed at controlling or minimizing

the number of matching errors between examples of distinct classes. In our experiments, we

modify a simple token-based entity resolution algorithm so that it indeed aims at avoiding

matching rows belonging to different classes, and perform experiments in the setting where en-

tity resolution relies on noisy data, which is very relevant to real world domains. Notably, our

approach covers the case where one peer does not have classes, or a noisy record of classes. Ex-

periments display that using the class information during entity resolution can buy signiﬁcant

uplift for learning at little expense from the complexity standpoint.

1 I ntroduction

With the ever-expanding collection of data, it is becoming common practice for organisations to

cooperate with the objective of leveraging their joint coll ecti ons of data [12, 14], with a w ider

∗

The Australian National University & the University of Sydney

†

Now with the University of Amsterdam

push to create and organise dat a m arketplaces as followers to the more mo nolithic data warehouse

[31]. Organisations are fully aware of the potential gain of combinin g their data assets , speciﬁcally

in terms of increased statistical power for analytics and predictive tasks. For example, hospitals

and medical facilities could leverage the medical history of common pati ents in order to prevent

chronic diseases and ri sks of future h ospitalisati o n.

The problem of learning models using t he data collected and kept/mai ntained by different par-

ties — federated learni ng for short [20] — has b ecome as much a necessity as a concrete research

challenge, expanding beyond machine learning through ﬁelds like databases and privacy. Am ong

other features, work i n the area can be classiﬁed in terms of (a) whet her the data is vertically or

horizontally partitioned and (b) the family of models being learned. The overwhelming majority

of previous work on secure di stributed learning con siders a horizontal data partition in which data

providers record the same features for different entities. Solutions can take advantage of the sep-

arability of loss functions which decompose the loss by examples. Relevant app roaches can be

found e.g. in [33] (and references therein).

In a vertical data partition, which is our setting, data providers can record different features for

the same entities. The vertical data partition case is more challenging than the horizontal one [14].

To see this, notice that in the later case, gathering all the data in one place makes any convention al

learning algorithm ﬁt to l earn from the whole data. In the vertical partiti on case however, gathering

the data in o ne place would not solve the problem since we would still have to ﬁgure out the

correspondence between entities of the different datasets to learn from the union of all features.

Vertical data partition is more relevant to the setting where different organisations would sit in the

same market, thus aggregating different features for the same customers. Th e technical problem

to overcome is that loss functions are in general not separable over f eatures. Wi th the exception of

the unh inged l oss [30], this would be th e case for most prop er, classiﬁcation calibrated and/or non-

convex losses [3, 24, 27]. A way to overcome this problem is to join the datasets upstream, using a

broad family of techniques w e refer to as entity resolut ion (or entity matching, record linkage, [8]).

For the whole pipeline — from matchin g to learning — to be fully and properly optimized taking

into account eventual additional constraints (like privacy), it is paramo unt to tackle and answer the

following q uestion:

"how does ent ity-resolut ion impact learning ?",

in particular because error-free entity resolution is often not available in the real-world [18], see

Figure 1. Case studies report that exact matching can be very damaging when identiﬁers are not

stable and error-prone: 25% of true matches would h ave been missed by exact m atching in a census

operation [29, 32]. In fact, one might expect such errors to just snowball with those of learning:

for example, wrong matches of a hospital database with pharmaceutical records with the objective

to improve preventive treatments coul d be disastrous on the predictive performances of a model

learned from the joined databases.

To our knowledge, there has been no formal treatment of this question so far, and the question

is open not just for machine learning as post-processing step to entity-resolution [15]. A s a con-

sequence perhaps, some work just assumes that th e so lution to entity-resolution is known a priori

[14].

Our contribution — In thi s paper, we provide the ﬁrst detailed answer to this qu estion and hint on

how it can be used to imp rove entity resolution as an upstream process to federated learning with

Figure 1: The problem of entity resolu tion. In this example, peers A and B share common features

(name, date of birth — DOB) that could be used to craft an unique identiﬁer, but the entries are

noisy so it becomes hard to match rows between peers.

vertically partitioned data. We focus on a popular class of models for federated learning, linear

models [14, 33]. To summarize our theoretical contribution, we bound the variation o f several key

quantities as computed from the error-prone entity-resolved dataset on one hand, and al so from the

ideal dataset for which we would kn ow the optimal correspondence on the other hand. These key

quantities i nclude:

(i) the relative deviation between the optimal classiﬁers;

(ii) the deviation between their respective losses;

(iii) the deviation in their respective generalization abilities;

More importantly, w e carry this analysis for any Ridge-regularized los s that satisﬁes so me m ild

differentiability condi tions, thus not necessarily being convex, nor classiﬁcation-calibrated, n or

even proper.

Overall, our results shed light on large margin classiﬁcation in the context of federated learning,

and how i t brings resilience in learning after entity resol ution. Indeed, we show th at it yields im-

munity to entity resolution mistakes — examples receive the right cl ass from the classiﬁer learned

from error-prone enti ty-resolved data if they would receive large margin classiﬁcation from the

optimal, "ideal" class iﬁer learned from the ideal dat a. Federated learning in the vertical partitio n

setting increases the number of features and is thereby likely to increase margins as well. Hence,

such a theoretical result on immunity represents a very strong argument for federated learning.

On a broader agenda in cl uding impacts for practical entity-resolution algorithms, our analysis

suggests that there exists a small set of controls deﬁned from entity resolution mistakes that es-

sentially drive all deviations highlight ed before. Being able to control them essentially leads to

a strong handle on how entity-resolution impacts learning, from the classiﬁer learned to its rates

for generalization, with respect to the ideal classiﬁer. The mos t prominent of these knobs is the

errors made by entity resolution across classes, i.e. wrongly linking observations that belong to

different classes. Ou r theory sug gests that focusing on such mistakes during entity resolution can

bring signiﬁcant leverage for the classiﬁer learned afterwards. We exemplify this experimentally,

by modifying a simple token-based greedy entity-resolution algorithm to integrate the constraint

of carrying out entity resolution within classes [10, 15], assumin g that one peer has knowledge

of the classes but the other one may not — either classes are noisy or just not present —. We

perform simul at ed experiments on ﬁfteen disti nct UCI domains, si mulated to investigate the key

parameters of federated learning in the setting where peers share the knowledge of some features

(such as gender, age, post al code for customers), which can furtherm o re be noi sy. Experiments

display that even when only one p eer has the knowledge of classes, sig niﬁcant improvements can

be obtained over the approach that performs entity resolution without using classes, and can even

compete with the result of the learner that has access to the (unknown) ideally entity-resolved data.

The rest of this paper is organised as follows. Section 2 gives deﬁnitions. Section 3 shows how to

reduce the analysis for a general loss to that of a speciﬁc kind of loss called Taylor loss. Sections

4 through 7 develop our th eoreti cal result s, and Section 8 provide experiments. A last Section

discusses and conclud es o ur paper. An Appendix, starting page 24, provides all proofs.

2 Deﬁnitions

Supervised learning, losses — Let [n] = {1, 2, ..., n}. In the ordinary batch supervised learning

setting, o ne is given a set of m examples

= {(

, y

), i ∈ [m]}, where

∈ X ⊆ R

is an

observation (X is called the domain) and y

∈ {−1, 1} is a label, or class (the "hat" no tation s h al l

be explained below). Our objective is to learn a linear classiﬁer θ ∈ Θ for some ﬁxed Θ ⊆ R

θ gives a label to some x ∈ X equal to the sign of θ

⊤

x ∈ R. The goodn ess of ﬁt of θ on

S is

measured by a loss function. We essent ially consider two categories of losses. The ﬁrst is th e set

of Ridge-regularized losses. Each element, ℓ

, is deﬁned by ℓ

(

S, θ; γ, Γ)

= L + R with

F (y

⊤

) , R

= γθ

⊤

Γθ . (1)

Here, γ > 0 and Γ is symmetri c positive deﬁnite. F : R → R is C

and satisﬁes |F

′

(0)|, |F

′′

(0)| ≪

∞ where "≪" means ﬁnite. Note that t his is a very general deﬁnitio n as for example we do not

assume that F is convex nor even classiﬁcation calibrated [3].

The other set of losses we consider, called Taylor losses, is such that L simpliﬁes as a degree-

two polynomial:

= a +

⊤

)

, (2)

with a, b, c ∈ R. Taylor losses have been used in secure federated learning [1, 12].

Federated learning — In federated learning,

S is built from separate data-handling sources, called

peers. In our vertical partition setting, we have two peers A and B, each of whi ch has the description

of the m examples on a subset of the d features. It may be the case that only one peer (A by

default) has labels . In addition to learning a class iﬁer, federated learning thus faces the mandatory

preprocessing step of matching rows in the datasets of A and B to build dataset

S, a preprocessin g

step we deﬁne as ent ity resolution [8].

The observed dataset

S i s created from an unknown dataset S

= {(x

, y

), i ∈ [m]} whose

columns h ave been split between A and B. If we deﬁne X ∈ R

d×m

as the matrix storing (colum-

nwise) observations of S, then each row of X is held by A or B. The "or" need not be exclusive

as some rows may be present in both A and B [25]. Also, duplicating rows in X does not change

the learning problem. There is thus both an ideal X and an est imated observation matrix

X giving

the obs ervations of

S and built from entity-resolu tion. To understand how the differences between

X and X impact learning, we need to drill down into the formalization o f

X. Both matrices can be

represented by block matrices, with each d istinct feature row present exactly once, as:







= X

∗



, (3)

where P

∗

∈ {0, 1}

m×m

is a permutation matrix (unknown) capturing the mis takes of entity-resolution

if P

∗

6= I

(the identity matrix). From convention (3), the features of A are not affected by

entity-resolution: we call them anchor features. Because the features of Bare affected by entity-

resolution, we call them shufﬂe features. A folklore fact [6] (Chapter I.5) is that any permutation

matrix can be factored as a product of elementary permutation matrices, each of which swaps two

rows/columns of I

. So, suppose

∗

t=1

, (4)

where P

is an elementary permutatio n matrix, where T , the size of P

∗

, is unknown. We let

(t), v

(t) ∈ [m] the two column in dexes in A affected by P

X can be progressively constructed

from a sequence

, ...,

where

= X,

X and for t ≥ 1,





= X

j=1

. (5)

Let

= [

···

] denote the column vector decomposition o f

(with

= x

) and let

be the training sample obtai n ed from the t ﬁrst permutations in the sequence. Hence,

= S,

S and

= {(

, y

), i ∈ [m]}. We let u

(t) (resp. v

(t)) denote the indices in [m] of the

shufﬂe features in X that are in observation u

(t) (resp. v

(t)) and that will be permut ed by P

creating

from

t−1

. For example, if u

(t) = v

(t), v

(t) = u

(t), then P

correctly reconstructs

observations in indexes u

(t) and v

(t) in X. Figure 2 illustrates th e use of these notat ions.

Key parameters of P

∗

— it is clear that all mistakes of entity-resolu tion are captured b y P

∗

, so it

is not su rprising that all our results depend on some key parameters of P

∗

. A key property is how

errors "accumulate" through t he factorization of P

∗

in eq. (4). Hereafter, w

for w ∈ R

denotes

the subvector of w containin g the features of peer F ∈ {A, B}.

Deﬁnition 1 We say that P

is (ε, τ)-accurate for s ome ε, τ ≥ 0 , ε ≤ 1 iff f or any w ∈ R

− x

)

⊤

| ≤ ε · |x

⊤

w| + τkwk

, ∀i ∈ [m] , (6)

|(x

(t)

− x

(t)

)

⊤

| ≤ ε · max

i∈{u

(t),v

(t)}

⊤

+τkwk

, ∀F ∈ {A, B} . (7)

We say that P

∗

is (ε, τ)-accurate iff each P

is (ε, τ)-accurate, ∀t = 1, 2, ..., T .

剩余51页未读，继续阅读

评论收藏

内容反馈

被要求改名字

粉丝: 26
资源: 315

2018-FL+实体Resolution-Entity Resolution and Federated Learning ge

评论0

最新资源

2018-FL+实体Resolution-Entity Resolution and Federated Learning ge

评论0

2018-FL+云计算-Federated Learning via Over-the-Air Computation1

2018-FL+erlang语言-Functional Federated Learning in Erlang (ffl-er

2018-FL+可靠的低通耗网络-Federated Learning for Ultra-Reliable Low-Laten

2018-FL分布式医学数据研究-Federated Learning in Distributed Medical Datab

.Net系列框架-Dapper+EntityFrameworkCore+Autofac+WebApi+Web

Mastering+Java+Machine+Learning-Packt+Publishing(2017).epub

ORM-Dapper+DapperExtensions 示例全代码

Cyber+Physical+Computing+for+IoT-driven+Services-Springer(2018).pdf

CursoNETCore3.1:NETCore 3.1.NET5.0- C＃+ Arquitetura DDD + Entity Framework Core com MySQL + Swagger + AutoMapper + JWT令牌

mysql-for-visualstudio-2.0.5 + mysql-connector-net-6.10.9.7z

Asp.net core Web API + Autofac + EFCore + Element-UI + SqlServer2008R2

人工智能-命名实体识别-中文-CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity

ABP-ASP.NET-样板项目-CMS：ABP模块-零+ AdminLTE + Bootstrap Table + jQuery + Redis + sql server + quartz + hangfire权限管理系统

工具类---实现实体类与Json的转换

VCS&amp;Verdi.rar

springboot+mybatis逆向生成controller+service+mapper+entity

ASP.NET MVC with Entity Framework and CSS

Cesium 中 实体类entity多种实例对象-实现点击事件

CodeSmith7安装包+破解工具+生成BLL、DLL、Entity的CTS文件

HW1 entity Resolution数据分析训练题

[ASP.NET MVC] ASP.NET MVC with Entity Framework and CSS (英文版)

ER图（实体-联系图(Entity Relationship Diagram)）绘制软件。

Laravel开发-lara-ore-legal-entity

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

OpenVAS GVM 中文翻译补丁

最新资源

VCS&Verdi.rar

Cesium 中实体类entity多种实例对象-实现点击事件