没有合适的资源?快使用搜索试试~ 我知道了~
An advanced data fabric architecture leveraging homomorphic encr
需积分: 0 0 下载量 87 浏览量
2024-03-26
09:36:28
上传
评论
收藏 4.96MB PDF 举报
温馨提示
试读
18页
An advanced data fabric architecture leveraging homomorphic encr
资源推荐
资源详情
资源评论
Information Fusion 102 (2024) 102004
Available online 9 September 2023
1566-2535/© 2023 Elsevier B.V. All rights reserved.
Contents lists available at ScienceDirect
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
An advanced data fabric architecture leveraging homomorphic encryption
and federated learning
Sakib Anwar Rieyan
a
, Md. Raisul Kabir News
a
, A.B.M. Muntasir Rahman
a
, Sadia Afrin Khan
a
,
Sultan Tasneem Jawad Zaarif
a
, Md. Golam Rabiul Alam
a
, Mohammad Mehedi Hassan
b,
∗
,
Michele Ianni
c
, Giancarlo Fortino
c
a
Department of Computer Science and Engineering, School of Data and Sciences, BRAC University, 66 Mohakhali, Dhaka, 1212, Bangladesh
b
Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh, 11543, Saudi Arabia
c
Department of Informatics, Modeling, Electronics, and Systems, University of Calabria, Rende, CS, 87036, Italy
A R T I C L E I N F O
Keywords:
Data fabric
Federated learning
Partially homomorphic encryption
Data fusion
Data lake
A B S T R A C T
Data fabric is an automated and AI-driven data fusion approach to accomplish data management unification
without moving data to a centralized location for solving complex data problems. In a Federated learning
architecture, the global model is trained based on the learned parameters of several local models that eliminate
the necessity of moving data to a centralized repository for machine learning. This paper introduces a secure
approach for medical image analysis using federated learning and partially homomorphic encryption within a
distributed data fabric architecture. With this method, multiple users or clients (hospitals/medical data centers)
can collaborate in training a machine-learning model without exchanging raw data. The approach complies
with laws and regulations such as HIPAA and GDPR, ensuring the privacy and security of the data. The study
demonstrates the method’s effectiveness through a case study on pituitary tumor classification, achieving a
significant accuracy of 83.31%. However, the primary focus of the study is using the data fabric architecture
to securely store and analyze medical images while complying with HIPAA and GDPR regulations. The results
highlight the potential of these techniques to be applied to other privacy-sensitive domains and contribute to
the growing body of research on secure and privacy-preserving machine learning.
1. Introduction
Artificial intelligence (AI) has become an integral part of our daily
lives, including in the field of healthcare. In order to make the most of
AI in healthcare, it is important to have access to large amounts of high-
quality data. However, the confidentiality and sensitivity of healthcare
data pose significant challenges to its storage and analysis. For example,
according to a study [1], almost 5,150 data breaches were reported to
OCR (Optical Character Recognition) between October 21, 2009, and
December 31, 2022. The volume of healthcare data is often very large,
particularly due to the prevalence of image-based data. In addition, the
confidentiality of healthcare data is of the utmost importance, as it is
often personal and sensitive in nature.
To address these challenges, we have built an advanced data fabric
architecture that brings together healthcare centers in a region and
stores patient data and diagnoses in a secure and privacy-preserving
∗
Corresponding author.
(S.T.J. Zaarif), [email protected] (Md.G.R. Alam), [email protected] (M.M. Hassan), [email protected] (M. Ianni), [email protected]
(G. Fortino).
manner. Data fabric is a data fusion [2,3] and integration approach
to accomplish data management unification through analytics and AI.
The proposed approach utilizes federated learning and partially homo-
morphic encryption [4] to allow for collaborative machine learning
on encrypted data, while still maintaining compliance with laws and
regulations such as the Health Insurance Portability and Accountability
Act (HIPAA) [5] and the General Data Protection Regulation (GDPR)
Act 2018 [6].
In the context of advancing healthcare through technological in-
novation, the work of [7] emerges as a pivotal exploration into the
application of deep learning methods within the domain of multi-source
heterogeneous data fusion. They compare two fusion approaches: stage-
based fusion, which aligns different data sources toward a common
goal but misses interaction, and feature-based fusion, which over-
looks redundancy among features, affecting correlation. By combining
https://doi.org/10.1016/j.inffus.2023.102004
Received 24 June 2023; Received in revised form 20 August 2023; Accepted 31 August 2023
Information Fusion 102 (2024) 102004
2
S.A. Rieyan et al.
deep learning and data fusion, [7] underscores the potential of har-
nessing hierarchical features through unsupervised training. This res-
onates with our work proposing a sophisticated data fabric architecture,
emphasizing the fusion of technology and contemporary challenges.
In this study, we have used pituitary tumor classification as a
case study, employing various deep-learning models such as VGG16,
VGG19, ResNet50, and ResNet152. Our results show promising poten-
tial for the use of federated learning [2] and partially homomorphic
encryption in secure medical image analysis. Specifically, we achieved
good performance with VGG16 and VGG19 models, while ResNet50
and ResNet152 achieved lower accuracy and precision for both classes.
However, our custom CNN architecture outperformed all of these pre-
trained models in almost every metric that we used. Our findings
contribute to the growing body of research on secure and privacy-
preserving [8,9] machine learning [10] and demonstrate the potential
for these techniques to be applied in other privacy-sensitive domains.
This paper’s structure follows a clear sequence: Section 1 introduces
the context, addresses healthcare data security challenges, presents an
advanced data fabric architecture integrating encryption and federated
learning, outlines research objectives and contributions. Section 2 pro-
vides background on key concepts: Data Fabric, Federated Learning,
and Homomorphic Encryption for medical image analysis. Section 3
explores existing research on Data Fabric architecture, secure health-
care data management, encryption methods, and their applications
in healthcare data privacy and analysis. Section 4 outlines the ex-
perimental approach, including data description, preprocessing, model
architecture, and neural network models for encrypted medical image
classification. Section 5 presents study outcomes, including the use of
homomorphic encryption, model evaluation on encrypted and unen-
crypted data, performance metrics for different models in classifying
pituitary tumors, and discussing the results’ implications. Section 6
summarizes research achievements in developing an advanced data fab-
ric architecture using Partial Homomorphic Encryption and Federated
Learning for secure and decentralized machine learning on medical
data. It discusses study assumptions, limitations, and potential future
directions.
1.1. Motivation
In the field of medical image analysis, ensuring the security of
sensitive patient data is of utmost importance. However, with the
increasing use of machine learning and deep learning techniques for
medical image analysis, there is a pressing need for an effective and
secure architecture to handle such data. Previous studies have shown
that the use of conventional security measures, such as encryption and
access control, is not enough to ensure the privacy of patient data
in the context of machine learning and deep learning operations. In
2015, Anthem Inc., one of the largest health insurance companies in
the United States, suffered a massive data breach that exposed the
personal information of nearly 79 million individuals. The attackers
gained access to a vast amount of sensitive medical data. The breach
was a success because of the Centralized Database Vulnerability, Lack
of Encryption, and Slow Detection [1].
In addition, the use of traditional centralized architectures [11,12]
for processing medical image data can be slow and resource-intensive,
which can further compromise the security of the data [13,14]. There-
fore, there is a clear need for a new, advanced data fabric architecture
that is specifically designed to handle the unique challenges of securing
medical image data while also supporting efficient machine learning
and deep learning operations. This research aims to address this gap in
the current state of the art by proposing and evaluating a novel archi-
tecture that is capable of effectively securing medical image data while
also enabling fast and accurate machine learning and deep learning
operations.
1.2. Research problems
The integration of data into healthcare has the potential to improve
the prediction of diseases and epidemics, enhance treatment outcomes,
and prevent premature deaths. However, the confidentiality of health-
care data and the complexity of managing large and diverse datasets
pose significant challenges to the integration of data into healthcare.
Ensuring data security and privacy is of utmost importance, as security
breaches in healthcare are on the rise. According to a study [15], there
were 3,033 data breaches reported between 2010 and 2019, resulting
in the exposure of 255.18 million records of data. Furthermore, as
previously mentioned, during last 13 years, there have 5,150 cases of
data breaches in this sector according to [1].
Moreover, the substantial volume of healthcare data presents chal-
lenges in terms of efficient processing, storage, and communication.
Conventional methods may prove inadequate when confronted with the
magnitude of the data at hand. In one proposed solution [16], a big data
healthcare cloud would host clinical, financial, social, physical, and
psychological data from patients in a centralized location. However,
proper governance of the data cloud is necessary to effectively work
with and analyze complex data.
In this study, we aim to address these challenges by proposing
an advanced data fabric architecture that brings together healthcare
centers in a region and stores patient data and diagnoses in a secure
and privacy-preserving manner using federated learning and partially
homomorphic encryption. We demonstrate the effectiveness of our
approach using pituitary tumor classification as a case study. However,
the primary focus of our work is on the development and evaluation of
federated learning and partially homomorphic encryption as tools for
secure medical image analysis in the healthcare sector.
The primary objective of this study is to address the research
question:
How effective and practical is the implementation of advanced data
fabric architecture using federated learning and partially homomorphic
encryption for secure medical image analysis in the healthcare sector?
1.3. Contributions
Through our work, we show a fully-fledged data fabric architec-
ture based on healthcare data can be built whilst complying with
privacy regulations and maintaining good accuracy scores. Our primary
contribution spans four aspects:
• We propose an advanced data fabric architecture for storing and
collaborating and fusing healthcare data in an encrypted form by
using Partial Homomorphic Encryption (PHE) and sharing it with
other parties without revealing its content. In this architecture,
medical images of various patients/clients are encrypted on the
client side and these encrypted images are then used as inputs for
deep learning models, enabling the models to learn and classify
tumors. Subsequently, the system collects the classified tumor
data for further analysis and processing. Further processing of
data is done in its encrypted state. Here, The raw data was
encrypted and generated to local weights using the local FL Model
before getting global attention. Therefore even if the data was
backtracked the end result will produce nothing but an encrypted
image. Thus, this architecture provides a secure and efficient
mechanism for processing encrypted data, while preserving data
privacy and confidentiality regulations such as HIPAA and GDPR.
This transformative approach marks a distinct departure from
currently available data architectures that lack such encryption
mechanisms.
• Our architecture also encompasses a federated learning frame-
work, allowing multiple clients to collaboratively train machine
learning models on their respective data. Unlike the existing
Information Fusion 102 (2024) 102004
3
S.A. Rieyan et al.
general federated learning frameworks which function on real-
time local and global updates, our framework offers the flexibility
to modify, scale, merge or select the local model updates before
using them into the global model. In this way, the framework
we proposed facilitates the systematic exchange of model updates
between the local and global models. This innovation grants
healthcare organizations the unparalleled ability to securely col-
laborate on model training while maintaining data privacy. No ex-
isting architecture seamlessly integrates federated learning within
a data fabric, thus highlighting the exceptional nature of our
contribution.
• Moreover, we have tailored a convolutional neural network
(CNN) architecture, inspired by VGG16 and VGG19, with a
smaller input size, resulting in a reduced parameter size compared
to the aforementioned models. This customization enables en-
hanced efficiency by reducing computational complexity, partic-
ularly when leveraging Partially Homomorphic Encryption (PHE)
techniques. This optimization sets it apart from previously estab-
lished architectures that often overlook the symbiotic relationship
between encryption and model design.
• We further evaluate the proposed approach by implementing
a prototype of the homomorphic encryption-based data fabric
and the federated learning framework. The assessment indicates
that the suggested method offers an effective and reliable data
fusion for sharing and analyzing data securely. The experimental
results demonstrate that the proposed approach achieves satisfac-
tory accuracy in the collaborative training of machine learning
models, even when the data is encrypted. This pivotal advance-
ment distinguishes our work from current architectures that re-
quire disjointed solutions for data management and operational
practices.
2. Background studies
2.1. Data fabric
According to Gartner [17], Data Fabric is a unified and integrated
platform that enables data discovery, fusion and integration, manage-
ment, and access across multiple environments. It provides a consistent
and scalable approach to managing data assets that are distributed
across various locations, such as on-premises, cloud, and edge com-
puting. Data Fabric helps organizations to simplify and optimize their
data management processes, reduce data silos, and enable real-time
access to data. It also supports the creation of a self-service data
marketplace, allowing users to discover, share, and consume data in
a secure and governed manner. Data Fabric is increasingly becoming a
critical component of modern data architectures, as organizations seek
to manage the growing volume, velocity, and variety of data generated
by digital business initiatives.
In our research, we are utilizing homomorphic encryption to classify
pituitary tumors from MRI images in our dataset. We have used a
Data Fabric architecture to store the weights of different machine-
learning models as encrypted data. The ML models are run on client
PCs, and the resulting encrypted data is saved in our Data Lake. Using
homomorphic encryption, a server PC can perform computations on
the encrypted data, allowing for the creation of a homogeneous global
model. The server can then provide users with the requested results
without compromising the privacy of the MRI images. This approach
benefits from the Data Fabric’s ability to provide a unified and inte-
grated platform that enables data discovery, integration, management,
and access across multiple environments. By utilizing homomorphic
encryption and a Data Fabric architecture, we can classify pituitary
tumors from MRI images in a privacy-preserving manner, contributing
to the development of more secure and privacy-preserving medical
imaging technologies.
2.1.1. Vanilla architecture of data fabric
Fig. 1 provides a visual overview of the key components and pro-
cesses of the Vanilla Architecture of Data Fabric.
(i) Accessing Data:
(a) Data Collecting and Encryption: Data is a volatile re-
source, and Medical Data is considered highly sensitive
as it can contain personally identifiable information such
as names, addresses, dates of birth, and medical records
which can be exploited if they fell into wrong hands. To
comply with this shortcoming, in this architecture, data
is neither collected nor stored in a central server which
may possess the risk of data leakage.
Here, firstly, to ensure privacy and reduce data volatility,
medical data from various users are first selected and then
encrypted with Partially Homomorphic Encryption (PHE).
Subsequently, the encrypted data is selected to train the
model locally for collecting updated model weights. The
data of each user is generated and stored locally, without
being transferred to the central server. Instead, the gener-
ated model updates are stored and merged for the global
model formation.
(b) Master Data Management: Following the generation of
local model updates and subsequent merging of data,
feature selection is employed as a means of optimiz-
ing and enhancing efficiency. By selecting the most rele-
vant features or weights, the dimensionality of the data
can be reduced, facilitating ease of analysis. Moreover,
feature selection mitigates the risk of overfitting, im-
proves model accuracy, and reduces computational costs,
thereby achieving heightened efficiency through the uti-
lization of a reduced training dataset.
FedMax, FedAvg, and FedMin are optimization algorithms
used in Federated Learning for feature selection. In all
three algorithms, updated model weights are sent to the
server/stored for future usage. However, in the case of
FedMax, the server/user selects the model with the high-
est accuracy while it chooses the lowest loss model for
FedMin and an average of all models for FedAvg.
In our architecture, we selected FedMax as our feature se-
lection algorithm to select important and relevant model
weights as it showed more accuracy and efficiency com-
pared to FedAvg and FedMin. The selected data is col-
lected and kept together as ‘‘Master Data’’.
(ii) Managing Life Cycle:
(a) Governance: Data governance is an essential component
of data fabric architecture. Data fabric architecture is an
approach to data management that enables organizations
to manage and process data from multiple sources, lo-
cations, and formats. It provides a unified view of data
across the organization and supports various data process-
ing requirements, such as data integration, analytics, and
artificial intelligence.
Data governance in data fabric architecture refers to the
policies, processes, and standards that organizations im-
plement to manage their data assets effectively. Data
governance helps organizations ensure that their data
is accurate, consistent, and compliant with regulatory
requirements. It also helps organizations manage data
privacy, security, and access.
The following are some key considerations for data gov-
ernance in data fabric architecture:
Data quality: Data governance policies should include
measures to ensure data quality, such as data profiling,
data cleansing, and data validation.
Information Fusion 102 (2024) 102004
4
S.A. Rieyan et al.
Metadata management: Data governance policies should
include metadata management to ensure that data is
properly tagged, categorized, and classified. Metadata
helps organizations understand the meaning and context
of their data and facilitates data discovery and reuse.
Data privacy and security: Data governance policies
should include measures to ensure data privacy and se-
curity, such as access controls, data encryption, and data
masking.
Data lineage: Data governance policies should include
data lineage to track the origin, transformation, and
movement of data across the organization. Data lineage
helps organizations understand how data is used and
facilitates compliance with regulatory requirements.
Data ownership and stewardship: Data governance
policies should define data ownership and stewardship
to ensure that data is managed and maintained by the
appropriate individuals and teams.
(b) Compliance: Data compliance refers to adhering to rel-
evant laws, regulations, and industry standards related to
the handling, processing, and storage of data. In the con-
text of data fabric, data compliance refers to ensuring that
data is managed in accordance with these requirements
across the entire data fabric. To ensure data compliance
in a data fabric, it is necessary to establish policies and
procedures that cover the entire data lifecycle, from data
ingestion to archival and deletion. Personal data must
be collected, processed, and stored in compliance with
privacy regulations such as GDPR, CCPA, HIPAA, etc.
In this architecture, the feature selected weights were
trained on various models such as VGG16, VGG19, ResNet
50, ResNet 152, etc. and updates are stored in a data
lake structurally based on models they were trained on
complying with HIPAA regulations.
(iii) Exposing Data: Data exposure refers to making data available
for consumption and analysis by users or applications within an
organization. Exposing data in a data fabric involves providing
access to the data for authorized users or applications. There are
several ways to expose data in a data fabric, including:
APIs: Application Programming Interfaces (APIs) enable appli-
cations to access and retrieve data from the data fabric.
Data Catalogs: A data catalog provides a searchable inventory
of data assets in the data fabric. Users can discover and access
data assets through the data catalog.
Self-Service Analytics: A self-service analytics platform enables
users to create their own queries and reports using the data
available in the data fabric.
Data Virtualization: Data virtualization enables users to access
and combine data from multiple sources as if it were in a single
location.
2.2. Federated learning
Federated Learning is a distributed machine learning technique that
enables multiple clients to collaboratively learn a shared model without
exchanging their raw data. This technique has gained popularity in
recent years due to its ability to preserve data privacy and security
while improving model performance. As shown in Fig. 2, each client
trains a local model using its own data and then sends the local model
weights to a central server. The central server then aggregates the local
model weights to update a global model that is shared among all clients.
This process continues iteratively until the global model achieves the
desired level of accuracy. According to the report [18], Federated
Learning has been successfully applied to various domains, such as
speech recognition, natural language processing, and healthcare, where
data privacy is a major concern.
2.3. Homomorphic encryption
A cryptographic method called homomorphic encryption enables
mathematical operations to be carried out on ciphertext without ex-
posing the underlying plaintext. Table 1 offers a comprehensive view of
different types of homomorphic encryption along with the distinctions
that set them apart. In our research, we used partially homomorphic
encryption to encrypt sensitive medical images, specifically brain MRI
scans.
Partially homomorphic encryption (PHE) is a type of homomorphic
encryption that only supports a limited set of mathematical operations,
such as addition or multiplication. By encrypting the medical images
using this technique, we were able to process and analyze the data
without exposing the sensitive information contained within it. Fig. 3
provides an illustrative overview of Partially Homomorphic Encryption
(PHE) Technique at the Pixel Level to Dataset Images.
One major benefit of using partially homomorphic encryption in this
context is that it ensures the confidentiality of medical data. As medical
information is often highly sensitive and personal, it is important
to protect it from unauthorized access. By encrypting the data, we
were able to securely process and analyze it without compromising its
confidentiality.
In addition, partially homomorphic encryption allows for more
efficient processing of the encrypted data. Because the mathematical
operations can be performed directly on the ciphertext, there is no
need to decrypt the data first, which can be a time-consuming process.
This was particularly useful when working with large datasets or when
processing data in real time [19].
Overall, our use of partially homomorphic encryption proved to be
a successful and effective method for protecting the confidentiality of
sensitive medical images while still enabling their analysis.
2.3.1. The paillier encryption scheme
As previously mentioned, we are using Partial Homomorphic En-
cryption for our dataset which follows The Paillier Encryption Scheme.
The Paillier encryption scheme [20] is an additively homomorphic
cryptosystem based on the computational difficulty of the decisional
composite residuosity assumption. The scheme’s security relies on the
difficulty of factoring the product of two large prime numbers. Key
components of the Paillier encryption scheme are:
1. Key Generation: The key generation process creates a public
key (𝑛, 𝑔) and a private key (𝜆, 𝜇).
• 𝑛 is the product of two large prime numbers, kept secret
and known only to the data owner.
• 𝑔 is a public system parameter, typically set as 𝑔 = 𝑛 + 1.
• 𝜆 is Carmichael’s totient function [21], 𝜆 = 𝑙𝑐𝑚(𝑝 − 1, 𝑞 − 1),
where 𝑝 and 𝑞 are the large prime factors of 𝑛.
• 𝜇 is the modular multiplicative inverse of 𝜆 modulo 𝑛.
2. Encryption: Given a plaintext image 𝑥, the encryption process
is performed as follows:
𝐸𝑛𝑐(𝑥) = (𝑔
𝑥
× 𝑥
𝑛
) % 𝑛
2
(1)
where 𝑥 is a random value chosen for each encryption, ensuring
probabilistic encryption.
2.3.2. Homomorphic operations
1. Addition: Addition on encrypted values is akin to combining the
original plaintext values after decryption. Given two encrypted
images 𝐸𝑛𝑐(𝑥) and 𝐸𝑛𝑐(𝑦), the homomorphic addition can be
performed as:
𝐸𝑛𝑐(𝑥 + 𝑦) = 𝐸𝑛𝑐(𝑥) × 𝐸𝑛𝑐(𝑦) % 𝑛
2
(2)
剩余17页未读,继续阅读
资源评论
yuan_0012
- 粉丝: 1
- 资源: 6
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功