【免费】Anadvanceddatafabricarchitectureleveraginghomomorphicencr资源-CSDN文库

fabric

需积分: 0 87 浏览量 2024-03-26 09:36:28 上传评论收藏 4.96MB PDF 举报

资源推荐

资源详情

资源评论

Information Fusion 102 (2024) 102004

Available online 9 September 2023

Contents lists available at ScienceDirect

Information Fusion

journal homepage: www.elsevier.com/locate/inffus

An advanced data fabric architecture leveraging homomorphic encryption

and federated learning

Sakib Anwar Rieyan

, Md. Raisul Kabir News

, A.B.M. Muntasir Rahman

, Sadia Afrin Khan

Sultan Tasneem Jawad Zaarif

, Md. Golam Rabiul Alam

, Mohammad Mehedi Hassan

∗

Michele Ianni

, Giancarlo Fortino

Department of Computer Science and Engineering, School of Data and Sciences, BRAC University, 66 Mohakhali, Dhaka, 1212, Bangladesh

Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh, 11543, Saudi Arabia

Department of Informatics, Modeling, Electronics, and Systems, University of Calabria, Rende, CS, 87036, Italy

A R T I C L E I N F O

Keywords:

Data fabric

Federated learning

Partially homomorphic encryption

Data fusion

Data lake

A B S T R A C T

Data fabric is an automated and AI-driven data fusion approach to accomplish data management unification

without moving data to a centralized location for solving complex data problems. In a Federated learning

architecture, the global model is trained based on the learned parameters of several local models that eliminate

the necessity of moving data to a centralized repository for machine learning. This paper introduces a secure

approach for medical image analysis using federated learning and partially homomorphic encryption within a

distributed data fabric architecture. With this method, multiple users or clients (hospitals/medical data centers)

can collaborate in training a machine-learning model without exchanging raw data. The approach complies

with laws and regulations such as HIPAA and GDPR, ensuring the privacy and security of the data. The study

demonstrates the method’s effectiveness through a case study on pituitary tumor classification, achieving a

significant accuracy of 83.31%. However, the primary focus of the study is using the data fabric architecture

to securely store and analyze medical images while complying with HIPAA and GDPR regulations. The results

highlight the potential of these techniques to be applied to other privacy-sensitive domains and contribute to

the growing body of research on secure and privacy-preserving machine learning.

1. Introduction

Artificial intelligence (AI) has become an integral part of our daily

lives, including in the field of healthcare. In order to make the most of

AI in healthcare, it is important to have access to large amounts of high-

quality data. However, the confidentiality and sensitivity of healthcare

data pose significant challenges to its storage and analysis. For example,

according to a study [1], almost 5,150 data breaches were reported to

OCR (Optical Character Recognition) between October 21, 2009, and

December 31, 2022. The volume of healthcare data is often very large,

particularly due to the prevalence of image-based data. In addition, the

confidentiality of healthcare data is of the utmost importance, as it is

often personal and sensitive in nature.

To address these challenges, we have built an advanced data fabric

architecture that brings together healthcare centers in a region and

stores patient data and diagnoses in a secure and privacy-preserving

∗

Corresponding author.

E-mail addresses: [email protected] (S.A. Rieyan), [email protected] (Md.R.K. News),

[email protected] (A.B.M.M. Rahman), [email protected] (S.A. Khan), [email protected]

(S.T.J. Zaarif), [email protected] (Md.G.R. Alam), [email protected] (M.M. Hassan), [email protected] (M. Ianni), [email protected]

(G. Fortino).

manner. Data fabric is a data fusion [2,3] and integration approach

to accomplish data management unification through analytics and AI.

The proposed approach utilizes federated learning and partially homo-

morphic encryption [4] to allow for collaborative machine learning

on encrypted data, while still maintaining compliance with laws and

regulations such as the Health Insurance Portability and Accountability

Act (HIPAA) [5] and the General Data Protection Regulation (GDPR)

Act 2018 [6].

In the context of advancing healthcare through technological in-

novation, the work of [7] emerges as a pivotal exploration into the

application of deep learning methods within the domain of multi-source

heterogeneous data fusion. They compare two fusion approaches: stage-

based fusion, which aligns different data sources toward a common

goal but misses interaction, and feature-based fusion, which over-

looks redundancy among features, affecting correlation. By combining

https://doi.org/10.1016/j.inffus.2023.102004

Received 24 June 2023; Received in revised form 20 August 2023; Accepted 31 August 2023

Information Fusion 102 (2024) 102004

S.A. Rieyan et al.

deep learning and data fusion, [7] underscores the potential of har-

nessing hierarchical features through unsupervised training. This res-

onates with our work proposing a sophisticated data fabric architecture,

emphasizing the fusion of technology and contemporary challenges.

In this study, we have used pituitary tumor classification as a

case study, employing various deep-learning models such as VGG16,

VGG19, ResNet50, and ResNet152. Our results show promising poten-

tial for the use of federated learning [2] and partially homomorphic

encryption in secure medical image analysis. Specifically, we achieved

good performance with VGG16 and VGG19 models, while ResNet50

and ResNet152 achieved lower accuracy and precision for both classes.

However, our custom CNN architecture outperformed all of these pre-

trained models in almost every metric that we used. Our findings

contribute to the growing body of research on secure and privacy-

preserving [8,9] machine learning [10] and demonstrate the potential

for these techniques to be applied in other privacy-sensitive domains.

This paper’s structure follows a clear sequence: Section 1 introduces

the context, addresses healthcare data security challenges, presents an

advanced data fabric architecture integrating encryption and federated

learning, outlines research objectives and contributions. Section 2 pro-

vides background on key concepts: Data Fabric, Federated Learning,

and Homomorphic Encryption for medical image analysis. Section 3

explores existing research on Data Fabric architecture, secure health-

care data management, encryption methods, and their applications

in healthcare data privacy and analysis. Section 4 outlines the ex-

perimental approach, including data description, preprocessing, model

architecture, and neural network models for encrypted medical image

classification. Section 5 presents study outcomes, including the use of

homomorphic encryption, model evaluation on encrypted and unen-

crypted data, performance metrics for different models in classifying

pituitary tumors, and discussing the results’ implications. Section 6

summarizes research achievements in developing an advanced data fab-

ric architecture using Partial Homomorphic Encryption and Federated

Learning for secure and decentralized machine learning on medical

data. It discusses study assumptions, limitations, and potential future

directions.

1.1. Motivation

In the field of medical image analysis, ensuring the security of

sensitive patient data is of utmost importance. However, with the

increasing use of machine learning and deep learning techniques for

medical image analysis, there is a pressing need for an effective and

secure architecture to handle such data. Previous studies have shown

that the use of conventional security measures, such as encryption and

access control, is not enough to ensure the privacy of patient data

in the context of machine learning and deep learning operations. In

2015, Anthem Inc., one of the largest health insurance companies in

the United States, suffered a massive data breach that exposed the

personal information of nearly 79 million individuals. The attackers

gained access to a vast amount of sensitive medical data. The breach

was a success because of the Centralized Database Vulnerability, Lack

of Encryption, and Slow Detection [1].

In addition, the use of traditional centralized architectures [11,12]

for processing medical image data can be slow and resource-intensive,

which can further compromise the security of the data [13,14]. There-

fore, there is a clear need for a new, advanced data fabric architecture

that is specifically designed to handle the unique challenges of securing

medical image data while also supporting efficient machine learning

and deep learning operations. This research aims to address this gap in

the current state of the art by proposing and evaluating a novel archi-

tecture that is capable of effectively securing medical image data while

also enabling fast and accurate machine learning and deep learning

operations.

1.2. Research problems

The integration of data into healthcare has the potential to improve

the prediction of diseases and epidemics, enhance treatment outcomes,

and prevent premature deaths. However, the confidentiality of health-

care data and the complexity of managing large and diverse datasets

pose significant challenges to the integration of data into healthcare.

Ensuring data security and privacy is of utmost importance, as security

breaches in healthcare are on the rise. According to a study [15], there

were 3,033 data breaches reported between 2010 and 2019, resulting

in the exposure of 255.18 million records of data. Furthermore, as

previously mentioned, during last 13 years, there have 5,150 cases of

data breaches in this sector according to [1].

Moreover, the substantial volume of healthcare data presents chal-

lenges in terms of efficient processing, storage, and communication.

Conventional methods may prove inadequate when confronted with the

magnitude of the data at hand. In one proposed solution [16], a big data

healthcare cloud would host clinical, financial, social, physical, and

psychological data from patients in a centralized location. However,

proper governance of the data cloud is necessary to effectively work

with and analyze complex data.

In this study, we aim to address these challenges by proposing

an advanced data fabric architecture that brings together healthcare

centers in a region and stores patient data and diagnoses in a secure

and privacy-preserving manner using federated learning and partially

homomorphic encryption. We demonstrate the effectiveness of our

approach using pituitary tumor classification as a case study. However,

the primary focus of our work is on the development and evaluation of

federated learning and partially homomorphic encryption as tools for

secure medical image analysis in the healthcare sector.

The primary objective of this study is to address the research

question:

How effective and practical is the implementation of advanced data

fabric architecture using federated learning and partially homomorphic

encryption for secure medical image analysis in the healthcare sector?

1.3. Contributions

Through our work, we show a fully-fledged data fabric architec-

ture based on healthcare data can be built whilst complying with

privacy regulations and maintaining good accuracy scores. Our primary

contribution spans four aspects:

• We propose an advanced data fabric architecture for storing and

collaborating and fusing healthcare data in an encrypted form by

using Partial Homomorphic Encryption (PHE) and sharing it with

other parties without revealing its content. In this architecture,

medical images of various patients/clients are encrypted on the

client side and these encrypted images are then used as inputs for

deep learning models, enabling the models to learn and classify

tumors. Subsequently, the system collects the classified tumor

data for further analysis and processing. Further processing of

data is done in its encrypted state. Here, The raw data was

encrypted and generated to local weights using the local FL Model

before getting global attention. Therefore even if the data was

backtracked the end result will produce nothing but an encrypted

image. Thus, this architecture provides a secure and efficient

mechanism for processing encrypted data, while preserving data

privacy and confidentiality regulations such as HIPAA and GDPR.

This transformative approach marks a distinct departure from

currently available data architectures that lack such encryption

mechanisms.

• Our architecture also encompasses a federated learning frame-

work, allowing multiple clients to collaboratively train machine

learning models on their respective data. Unlike the existing

Information Fusion 102 (2024) 102004

S.A. Rieyan et al.

general federated learning frameworks which function on real-

time local and global updates, our framework offers the flexibility

to modify, scale, merge or select the local model updates before

using them into the global model. In this way, the framework

we proposed facilitates the systematic exchange of model updates

between the local and global models. This innovation grants

healthcare organizations the unparalleled ability to securely col-

laborate on model training while maintaining data privacy. No ex-

isting architecture seamlessly integrates federated learning within

a data fabric, thus highlighting the exceptional nature of our

contribution.

• Moreover, we have tailored a convolutional neural network

(CNN) architecture, inspired by VGG16 and VGG19, with a

smaller input size, resulting in a reduced parameter size compared

to the aforementioned models. This customization enables en-

hanced efficiency by reducing computational complexity, partic-

ularly when leveraging Partially Homomorphic Encryption (PHE)

techniques. This optimization sets it apart from previously estab-

lished architectures that often overlook the symbiotic relationship

between encryption and model design.

• We further evaluate the proposed approach by implementing

a prototype of the homomorphic encryption-based data fabric

and the federated learning framework. The assessment indicates

that the suggested method offers an effective and reliable data

fusion for sharing and analyzing data securely. The experimental

results demonstrate that the proposed approach achieves satisfac-

tory accuracy in the collaborative training of machine learning

models, even when the data is encrypted. This pivotal advance-

ment distinguishes our work from current architectures that re-

quire disjointed solutions for data management and operational

practices.

2. Background studies

2.1. Data fabric

According to Gartner [17], Data Fabric is a unified and integrated

platform that enables data discovery, fusion and integration, manage-

ment, and access across multiple environments. It provides a consistent

and scalable approach to managing data assets that are distributed

across various locations, such as on-premises, cloud, and edge com-

puting. Data Fabric helps organizations to simplify and optimize their

data management processes, reduce data silos, and enable real-time

access to data. It also supports the creation of a self-service data

marketplace, allowing users to discover, share, and consume data in

a secure and governed manner. Data Fabric is increasingly becoming a

critical component of modern data architectures, as organizations seek

to manage the growing volume, velocity, and variety of data generated

by digital business initiatives.

In our research, we are utilizing homomorphic encryption to classify

pituitary tumors from MRI images in our dataset. We have used a

Data Fabric architecture to store the weights of different machine-

learning models as encrypted data. The ML models are run on client

PCs, and the resulting encrypted data is saved in our Data Lake. Using

homomorphic encryption, a server PC can perform computations on

the encrypted data, allowing for the creation of a homogeneous global

model. The server can then provide users with the requested results

without compromising the privacy of the MRI images. This approach

benefits from the Data Fabric’s ability to provide a unified and inte-

grated platform that enables data discovery, integration, management,

and access across multiple environments. By utilizing homomorphic

encryption and a Data Fabric architecture, we can classify pituitary

tumors from MRI images in a privacy-preserving manner, contributing

to the development of more secure and privacy-preserving medical

imaging technologies.

2.1.1. Vanilla architecture of data fabric

Fig. 1 provides a visual overview of the key components and pro-

cesses of the Vanilla Architecture of Data Fabric.

(i) Accessing Data:

(a) Data Collecting and Encryption: Data is a volatile re-

source, and Medical Data is considered highly sensitive

as it can contain personally identifiable information such

as names, addresses, dates of birth, and medical records

which can be exploited if they fell into wrong hands. To

comply with this shortcoming, in this architecture, data

is neither collected nor stored in a central server which

may possess the risk of data leakage.

Here, firstly, to ensure privacy and reduce data volatility,

medical data from various users are first selected and then

encrypted with Partially Homomorphic Encryption (PHE).

Subsequently, the encrypted data is selected to train the

model locally for collecting updated model weights. The

data of each user is generated and stored locally, without

being transferred to the central server. Instead, the gener-

ated model updates are stored and merged for the global

model formation.

(b) Master Data Management: Following the generation of

local model updates and subsequent merging of data,

feature selection is employed as a means of optimiz-

ing and enhancing efficiency. By selecting the most rele-

vant features or weights, the dimensionality of the data

can be reduced, facilitating ease of analysis. Moreover,

feature selection mitigates the risk of overfitting, im-

proves model accuracy, and reduces computational costs,

thereby achieving heightened efficiency through the uti-

lization of a reduced training dataset.

FedMax, FedAvg, and FedMin are optimization algorithms

used in Federated Learning for feature selection. In all

three algorithms, updated model weights are sent to the

server/stored for future usage. However, in the case of

FedMax, the server/user selects the model with the high-

est accuracy while it chooses the lowest loss model for

FedMin and an average of all models for FedAvg.

In our architecture, we selected FedMax as our feature se-

lection algorithm to select important and relevant model

weights as it showed more accuracy and efficiency com-

pared to FedAvg and FedMin. The selected data is col-

lected and kept together as ‘‘Master Data’’.

(ii) Managing Life Cycle:

(a) Governance: Data governance is an essential component

of data fabric architecture. Data fabric architecture is an

approach to data management that enables organizations

to manage and process data from multiple sources, lo-

cations, and formats. It provides a unified view of data

across the organization and supports various data process-

ing requirements, such as data integration, analytics, and

artificial intelligence.

Data governance in data fabric architecture refers to the

policies, processes, and standards that organizations im-

plement to manage their data assets effectively. Data

governance helps organizations ensure that their data

is accurate, consistent, and compliant with regulatory

requirements. It also helps organizations manage data

privacy, security, and access.

The following are some key considerations for data gov-

ernance in data fabric architecture:

Data quality: Data governance policies should include

measures to ensure data quality, such as data profiling,

data cleansing, and data validation.

Information Fusion 102 (2024) 102004

S.A. Rieyan et al.

Metadata management: Data governance policies should

include metadata management to ensure that data is

properly tagged, categorized, and classified. Metadata

helps organizations understand the meaning and context

of their data and facilitates data discovery and reuse.

Data privacy and security: Data governance policies

should include measures to ensure data privacy and se-

curity, such as access controls, data encryption, and data

masking.

Data lineage: Data governance policies should include

data lineage to track the origin, transformation, and

movement of data across the organization. Data lineage

helps organizations understand how data is used and

facilitates compliance with regulatory requirements.

Data ownership and stewardship: Data governance

policies should define data ownership and stewardship

to ensure that data is managed and maintained by the

appropriate individuals and teams.

(b) Compliance: Data compliance refers to adhering to rel-

evant laws, regulations, and industry standards related to

the handling, processing, and storage of data. In the con-

text of data fabric, data compliance refers to ensuring that

data is managed in accordance with these requirements

across the entire data fabric. To ensure data compliance

in a data fabric, it is necessary to establish policies and

procedures that cover the entire data lifecycle, from data

ingestion to archival and deletion. Personal data must

be collected, processed, and stored in compliance with

privacy regulations such as GDPR, CCPA, HIPAA, etc.

In this architecture, the feature selected weights were

trained on various models such as VGG16, VGG19, ResNet

50, ResNet 152, etc. and updates are stored in a data

lake structurally based on models they were trained on

complying with HIPAA regulations.

(iii) Exposing Data: Data exposure refers to making data available

for consumption and analysis by users or applications within an

organization. Exposing data in a data fabric involves providing

access to the data for authorized users or applications. There are

several ways to expose data in a data fabric, including:

APIs: Application Programming Interfaces (APIs) enable appli-

cations to access and retrieve data from the data fabric.

Data Catalogs: A data catalog provides a searchable inventory

of data assets in the data fabric. Users can discover and access

data assets through the data catalog.

Self-Service Analytics: A self-service analytics platform enables

users to create their own queries and reports using the data

available in the data fabric.

Data Virtualization: Data virtualization enables users to access

and combine data from multiple sources as if it were in a single

location.

2.2. Federated learning

Federated Learning is a distributed machine learning technique that

enables multiple clients to collaboratively learn a shared model without

exchanging their raw data. This technique has gained popularity in

recent years due to its ability to preserve data privacy and security

while improving model performance. As shown in Fig. 2, each client

trains a local model using its own data and then sends the local model

weights to a central server. The central server then aggregates the local

model weights to update a global model that is shared among all clients.

This process continues iteratively until the global model achieves the

desired level of accuracy. According to the report [18], Federated

Learning has been successfully applied to various domains, such as

speech recognition, natural language processing, and healthcare, where

data privacy is a major concern.

2.3. Homomorphic encryption

A cryptographic method called homomorphic encryption enables

mathematical operations to be carried out on ciphertext without ex-

posing the underlying plaintext. Table 1 offers a comprehensive view of

different types of homomorphic encryption along with the distinctions

that set them apart. In our research, we used partially homomorphic

encryption to encrypt sensitive medical images, specifically brain MRI

scans.

Partially homomorphic encryption (PHE) is a type of homomorphic

encryption that only supports a limited set of mathematical operations,

such as addition or multiplication. By encrypting the medical images

using this technique, we were able to process and analyze the data

without exposing the sensitive information contained within it. Fig. 3

provides an illustrative overview of Partially Homomorphic Encryption

(PHE) Technique at the Pixel Level to Dataset Images.

One major benefit of using partially homomorphic encryption in this

context is that it ensures the confidentiality of medical data. As medical

information is often highly sensitive and personal, it is important

to protect it from unauthorized access. By encrypting the data, we

were able to securely process and analyze it without compromising its

confidentiality.

In addition, partially homomorphic encryption allows for more

efficient processing of the encrypted data. Because the mathematical

operations can be performed directly on the ciphertext, there is no

need to decrypt the data first, which can be a time-consuming process.

This was particularly useful when working with large datasets or when

processing data in real time [19].

Overall, our use of partially homomorphic encryption proved to be

a successful and effective method for protecting the confidentiality of

sensitive medical images while still enabling their analysis.

2.3.1. The paillier encryption scheme

As previously mentioned, we are using Partial Homomorphic En-

cryption for our dataset which follows The Paillier Encryption Scheme.

The Paillier encryption scheme [20] is an additively homomorphic

cryptosystem based on the computational difficulty of the decisional

composite residuosity assumption. The scheme’s security relies on the

difficulty of factoring the product of two large prime numbers. Key

components of the Paillier encryption scheme are:

1. Key Generation: The key generation process creates a public

key (𝑛, 𝑔) and a private key (𝜆, 𝜇).

• 𝑛 is the product of two large prime numbers, kept secret

and known only to the data owner.

• 𝑔 is a public system parameter, typically set as 𝑔 = 𝑛 + 1.

• 𝜆 is Carmichael’s totient function [21], 𝜆 = 𝑙𝑐𝑚(𝑝 − 1, 𝑞 − 1),

where 𝑝 and 𝑞 are the large prime factors of 𝑛.

• 𝜇 is the modular multiplicative inverse of 𝜆 modulo 𝑛.

2. Encryption: Given a plaintext image 𝑥, the encryption process

is performed as follows:

𝐸𝑛𝑐(𝑥) = (𝑔

𝑥

× 𝑥

𝑛

) % 𝑛

(1)

where 𝑥 is a random value chosen for each encryption, ensuring

probabilistic encryption.

2.3.2. Homomorphic operations

1. Addition: Addition on encrypted values is akin to combining the

original plaintext values after decryption. Given two encrypted

images 𝐸𝑛𝑐(𝑥) and 𝐸𝑛𝑐(𝑦), the homomorphic addition can be

performed as:

𝐸𝑛𝑐(𝑥 + 𝑦) = 𝐸𝑛𝑐(𝑥) × 𝐸𝑛𝑐(𝑦) % 𝑛

(2)

剩余17页未读，继续阅读

评论收藏

内容反馈

yuan_0012

粉丝: 1
资源: 6

An advanced data fabric architecture leveraging homomorphic encr

IDC #BigData2013 Leveraging the Value of Big Data

3PAR Storage—Adaptive Data Reduction Technologies.pdf

Networkers2009：BRKCAM-3011 - Advanced Enterprise Campus Design: Leveraging Virtual Switch System

Modern Big Data Processing with Hadoop-Packt Publishing(2018)

Modern Big Data Processing with Hadoop

Big Data Fundamentals(PrenticeHall,2016)

Big.Data.MBA.Driving.Business.Strategies.with.Data.Science.11191811

Analytics.in.a.Big.Data.World.pdf

Data Lake for Enterprises-Packt Publishing(2017).pdf

Data.Analytics.and.Linux.Operating.System.1541352149.epub

Data Science at the Command Line

Threat Forecasting Leveraging Big Data for Predictive Analysis 无水印pdf

藏经阁-Leveraging Spark to Democratize Data for Omni Commerce.pdf

Architecture and Design for the Future Internet:4WARD Project

Advanced Flex 3 2008

Data.Modeling.Made.Simple.with.Embarcadero.ERStudio

Data Analytics and Python Programming 2 Bundle Manuscript

完整车牌号识别程序，可以识别车牌和颜色，可以集成到项目中 支持win7+

ChatGPT教程（终极版）最全整理

博客中Kmeans以及FCM算法数据（免积分）

Chatgpt 4omni 发布 GPT 4o / chatgpt-4 桌面版 chchatgpt 4 下载 / darkgpt

神经网络回归预测--气温数据集

hugging face的models-openai-clip-vit-large-patch14文件夹

XGBoost+LightGBM+LSTM-光伏发电量预测

Mathwork+Matlab+编程手册

中文短信数据集-带标签

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

最新资源

完整车牌号识别程序，可以识别车牌和颜色，可以集成到项目中支持win7+