Improving Performance of Federated Learning
based Medical Image Analysis in Non-IID Settings
using Image Augmentation
Alper Emin Cetinkaya
Information Security Program
Gazi University
Ankara, Turkey
aemin.cetinkaya@gazi.edu.tr
0000-0003-2424-6075
Murat Akin
Gazi AI Center of Gazi University,
Basarsoft Information Systems Inc.
Ankara, Turkey
muratakin@gazi.edu.tr
0000-0003-0001-1036
Seref Sagiroglu
Computer Engineering Dept.
Gazi AI Center, Gazi University
Ankara, Turkey
ss@gazi.edu.tr
0000-0003-0805-5818
Abstract
—Federated Learning (FL) is a suitable solution for
making use of sensitive data belonging to patients, people,
companies, or industries that are obligatory to work under rigid
privacy constraints. FL mainly or partially supports data
privacy and security issues and provides an alternative to model
problems facilitating multiple edge devices or organizations to
contribute a training of a global model using a number of local
data without having them. Non-IID data of FL caused from its
distributed nature presents a significant performance
degradation and stabilization skews. This paper introduces a
novel method dynamically balancing the data distributions of
clients by augmenting images to address the non-IID data
problem of FL. The introduced method remarkably stabilizes
the model training and improves the model’s test accuracy from
83.22% to 89.43% for multi-chest diseases detection of chest X-
ray images in highly non-IID FL setting. The results of IID, non-
IID and non-IID with proposed method federated trainings
demonstrated that the proposed method might help to
encourage organizations or researchers in developing better
systems to get values from data with respect to data privacy not
only for healthcare but also other fields.
Keywords—Federated Learning, Deep Learning, Medical
Image Analysis, Chest X-Ray Image, Privacy, Non-IID Data.
I. INTRODUCTION
Using deep learning (DL) for medical image analysis such
as detecting Covid-19 from the chest X-Ray (CXR) images
without using any dedicated test kits is a low-cost and
accurate alternative to laboratory-based testing. Recent
advances in DL provides promising results for medical image
analysis and large-scale diagnostic. Also, the ease of access,
scalability, and the rapid diagnostic of DL are huge plus when
compared to human based diagnosis. However, DL methods
require large amounts of samples to achieve competitive
results since the performance of the DL algorithms is highly
affected by the volume and diversity of the data.
The approach of centrally training an DL model for
leveraging health data suffers from the risk of violation of
patient privacy with the increasing concerns on data privacy.
Also, it is usually not likely that medical institutions share
their local data due to ownership concerns and strict
regulations on data privacy such as Health Insurance
Portability and Accountability Act (HIPAA) and General
Data Protection Regulation (GDPR). Even if the collected
data is adequately protected against malicious actors, it is
high probability that a condition beyond expected situations
resulting the violation of individuals privacy may occur. The
healthcare data including personal identity, behavior,
biometrics, biomedical images, genomic data, and medical
history of patient has become the one of the primary targets
of attackers or hackers while the healthcare is the sector that
is most exposed to cyber-attacks. According to the recent
report published by HIPAA [1], healthcare records of more
than 5 million people were breached across 38 incidents in
August 2021, while these leaks bring the total figure to 707
in the period between September 2020 and August 2021. The
breach of the health data has a lifelong impact unlike any
other personal data breach since it may include information
such as genomic data that cannot be altered afterwards. Data
holders, who are obliged to ensure the security and privacy of
the data they keep, face serious economic and legal
consequences in such cases. Hence, one of the primary
challenges on developing data-driven intelligent applications
for healthcare is to preserve privacy and secure shared data
against any kind of cyber threats or attacks.
A naive solution to leverage high volume and diversity of
data across multiple organizations is to alter the data before
collecting to a central place eighter by removing or
anonymizing personal data such that no private information
of individuals can be inferred. Unfortunately, re-
identification of anonymized or removed information against
such protections is still possible using advanced attacks [2]
such as linkage attacks. Furthermore, there is a trade-off
between the data utility and the privacy for this kind of
privacy-preserving methods such that the utility of the data
decreases as the more privacy is needed [3].
An alternative solution to data anonymization is training
a global model with a recent approach called Federated
Learning (FL) that was introduced by Google in 2016 [4]. In
contrast to conventional strategy of training a model
centrally, FL enables collaborative training of a global model
across multiple agents without gathering the data to a central
place. Instead of gathering the data in a central place, the
model training phase is decentralized and the training is
performed on the device where the data is produced. Since
the data never leaves its origin, there is no need to concern
about privacy risks and legal regulations to leverage the high
volume and diverse data. FL also reduces the cost of
2-3 DECEMBER 2021, 14TH INTERNATIONAL CONFERENCE ON INFORMATION SECURITY AND CRYPTOLOGY, ANKARA-TURKEY
978-1-6654-0776-2/21/$31.00 ©2021 IEEE 69
2021 International Conference on Information Security and Cryptology (ISCTURKEY) | 978-1-6654-0776-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/ISCTURKEY53027.2021.9654356
Authorized licensed use limited to: Xi'an Univ of Posts & Telecom. Downloaded on July 15,2022 at 16:40:17 UTC from IEEE Xplore. Restrictions apply.
Improving Performance of Federated Learning
based Medical Image Analysis in Non-IID Settings
using Image Augmentation
Alper Emin Cetinkaya
Information Security Program
Gazi University
Ankara, Turkey
aemin.cetinkaya@gazi.edu.tr
0000-0003-2424-6075
Murat Akin
Gazi AI Center of Gazi University,
Basarsoft Information Systems Inc.
Ankara, Turkey
muratakin@gazi.edu.tr
0000-0003-0001-1036
Seref Sagiroglu
Computer Engineering Dept.
Gazi AI Center, Gazi University
Ankara, Turkey
ss@gazi.edu.tr
0000-0003-0805-5818
Abstract
—Federated Learning (FL) is a suitable solution for
making use of sensitive data belonging to patients, people,
companies, or industries that are obligatory to work under rigid
privacy constraints. FL mainly or partially supports data
privacy and security issues and provides an alternative to model
problems facilitating multiple edge devices or organizations to
contribute a training of a global model using a number of local
data without having them. Non-IID data of FL caused from its
distributed nature presents a significant performance
degradation and stabilization skews. This paper introduces a
novel method dynamically balancing the data distributions of
clients by augmenting images to address the non-IID data
problem of FL. The introduced method remarkably stabilizes
the model training and improves the model’s test accuracy from
83.22% to 89.43% for multi-chest diseases detection of chest X-
ray images in highly non-IID FL setting. The results of IID, non-
IID and non-IID with proposed method federated trainings
demonstrated that the proposed method might help to
encourage organizations or researchers in developing better
systems to get values from data with respect to data privacy not
only for healthcare but also other fields.
Keywords—Federated Learning, Deep Learning, Medical
Image Analysis, Chest X-Ray Image, Privacy, Non-IID Data.
I. INTRODUCTION
Using deep learning (DL) for medical image analysis such
as detecting Covid-19 from the chest X-Ray (CXR) images
without using any dedicated test kits is a low-cost and
accurate alternative to laboratory-based testing. Recent
advances in DL provides promising results for medical image
analysis and large-scale diagnostic. Also, the ease of access,
scalability, and the rapid diagnostic of DL are huge plus when
compared to human based diagnosis. However, DL methods
require large amounts of samples to achieve competitive
results since the performance of the DL algorithms is highly
affected by the volume and diversity of the data.
The approach of centrally training an DL model for
leveraging health data suffers from the risk of violation of
patient privacy with the increasing concerns on data privacy.
Also, it is usually not likely that medical institutions share
their local data due to ownership concerns and strict
regulations on data privacy such as Health Insurance
Portability and Accountability Act (HIPAA) and General
Data Protection Regulation (GDPR). Even if the collected
data is adequately protected against malicious actors, it is
high probability that a condition beyond expected situations
resulting the violation of individuals privacy may occur. The
healthcare data including personal identity, behavior,
biometrics, biomedical images, genomic data, and medical
history of patient has become the one of the primary targets
of attackers or hackers while the healthcare is the sector that
is most exposed to cyber-attacks. According to the recent
report published by HIPAA [1], healthcare records of more
than 5 million people were breached across 38 incidents in
August 2021, while these leaks bring the total figure to 707
in the period between September 2020 and August 2021. The
breach of the health data has a lifelong impact unlike any
other personal data breach since it may include information
such as genomic data that cannot be altered afterwards. Data
holders, who are obliged to ensure the security and privacy of
the data they keep, face serious economic and legal
consequences in such cases. Hence, one of the primary
challenges on developing data-driven intelligent applications
for healthcare is to preserve privacy and secure shared data
against any kind of cyber threats or attacks.
A naive solution to leverage high volume and diversity of
data across multiple organizations is to alter the data before
collecting to a central place eighter by removing or
anonymizing personal data such that no private information
of individuals can be inferred. Unfortunately, re-
identification of anonymized or removed information against
such protections is still possible using advanced attacks [2]
such as linkage attacks. Furthermore, there is a trade-off
between the data utility and the privacy for this kind of
privacy-preserving methods such that the utility of the data
decreases as the more privacy is needed [3].
An alternative solution to data anonymization is training
a global model with a recent approach called Federated
Learning (FL) that was introduced by Google in 2016 [4]. In
contrast to conventional strategy of training a model
centrally, FL enables collaborative training of a global model
across multiple agents without gathering the data to a central
place. Instead of gathering the data in a central place, the
model training phase is decentralized and the training is
performed on the device where the data is produced. Since
the data never leaves its origin, there is no need to concern
about privacy risks and legal regulations to leverage the high
volume and diverse data. FL also reduces the cost of
2-3 DECEMBER 2021, 14TH INTERNATIONAL CONFERENCE ON INFORMATION SECURITY AND CRYPTOLOGY, ANKARA-TURKEY
978-1-6654-0776-2/21/$31.00 ©2021 IEEE 69
2021 International Conference on Information Security and Cryptology (ISCTURKEY) | 978-1-6654-0776-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/ISCTURKEY53027.2021.9654356
Authorized licensed use limited to: Xi'an Univ of Posts & Telecom. Downloaded on July 15,2022 at 16:40:17 UTC from IEEE Xplore. Restrictions apply.
提高联邦学习的性能
基于非IID设置的医学图像分析
使用图像增强
信息安全计划
土耳其安卡拉
加齐大学
穆拉特·阿金
Gazi大学GaziAI中心,
Basarsoft信息系统公司
土耳其安卡拉
计算机工程系
加兹大学加兹人工智能中心
土耳其安卡拉s
s@gazi.edu.tr
摘要—联邦学习(FL)是一个合适的解决方案
使用属于必须在严格的隐私约束下工作的患者、人员、公司或
行业的敏感数据。FL主要或部分支持数据隐私和安全问题,并
为模型问题提供了一种替代方案,便于多个边缘设备或组织使
用大量本地数据进行全局模型的训练,而无需它们。FL的非II
D数据由其分布引起
退化和稳定偏差。本文介绍了一种通过增强图像来动态平衡客
户端数据分布的新方法,以解决FL的非IID数据问题。引入的
方法显着稳定了模型训练,并将模型的测试准确率从83.22%
提高到89.43%,用于高度非IIDFL设置的胸部X线图像的多胸
部疾病检测。IID、非IID和非IID与所提出的方法联合训练的结
果展示了
建议的方法 可能有助于
鼓励组织或研究人员开发更好的系统,以从数据中获取与数据
隐私相关的价值,不仅适用于医疗保健,还适用于其他领域。
关键词——联邦学习、深度学习、医学
图像分析、胸部X射线图像、隐私、非IID数据。
使用深度学习(DL)进行医学图像分析,例如
因为在不使用任何专用测试套件的情况下从胸部X射线(
CXR)图像中检测Covid-19是基于实验室的测试的低成本
且准确的替代方案。DL的最新进展为医学图像分析和大
规模诊断提供了有希望的结果。此外,与基于人类的诊
断相比,DL的易用性、可扩展性和快速诊断是巨大的优
势。然而,DL方法需要大量样本才能获得有竞争力的结
果,因为DL算法的性能受到数据量和多样性的高度影响
。
集中训练DL模型的方法
随着对数据隐私的日益关注,利用健康数据存在侵犯患
者隐私的风险。此外,由于所有权问题和对数据隐私的
严格规定,例如健康保险流通与责任法案(HIPAA)和通用
数据保护条例(GDPR),医疗机构通常不太可能共享其本
地数据。即使收集到的数据得到了充分的保护,免受恶
意行为者的侵害,它也是
可能发生超出预期情况导致侵犯个人隐私的情况的可能
性很高。医疗保健
包括个人身份在内的数据,
生物识别、生物医学图像、基因组数据和患者的病史已
成为攻击者或黑客的主要目标之一,而医疗保健是最容
易受到网络攻击的部门。根据HIPAA[1]最近发布的报告
,在2021年8月的38起事件中,超过500万人的医疗记录
被泄露,而这些泄露使2020年9月至2021年8月期间的总
数达到707起。与任何其他个人数据泄露不同,健康数据
具有终身影响,因为它可能包含诸如基因组数据之类的
信息,之后无法更改。数据持有者有义务确保他们保存
的数据的安全性和隐私性,在这种情况下将面临严重的
经济和法律后果。因此,为医疗保健开发数据驱动的智
能应用程序的主要挑战之一是保护隐私并保护共享数据
免受任何类型的网络威胁或攻击。
一种利用大量和多样性的简单解决方案
跨多个组织的数据是在收集到一个中心位置之前更改数
据,方法是删除或匿名个人数据,以便没有私人信息
使用高级攻击[2](例如链接攻击)仍然可以识别针对此
类保护的匿名或删除信息。此外,这种隐私保护方法在
数据效用和隐私之间存在权衡,因此数据的效用会随着
需要的隐私越多而降低[3]。
数据匿名化的另一种解决方案是训练
谷歌在2016年推出的一种全球模型,最近采用了一种称
为联合学习(FL)的方法[4]。与集中训练模型的传统策略
相比,FL可以跨多个代理对全局模型进行协作训练,而
无需将数据收集到中心位置。模型训练阶段不是在中心
位置收集数据,而是分散的,训练在产生数据的设备上
执行。由于数据永远不会离开其来源,因此无需担心隐
私风险和法律法规来利用海量和多样化的数据。FL还降
低了成本
2021年12月2日至3日,第14届信息安全与密码学国际会议,土耳其安卡拉
20
21
年
信
息
安
全
和
密
码
学
国
际
会
议
(IS
C
T
U
R
K
E
Y)
97
8-
1-
66
54
-0
77
6-
22
1$
31.
00
20
21
IE
EE
D
OI
:10
.11
09
IS
C
T
U
R
K
E
Y5
30
27.
20
21.
96
54
35
6
授权许可使用仅限于:西安邮电大学。从IEEEXplore于2022年7月15日16:40:17UTC下载。有限制。