通过字节编码的子词嵌入：在不牺牲准确性和复杂度的情况下获得隐私保护资源-CSDN文库

需积分: 5 83 浏览量 2024-10-28 07:52:19 上传评论收藏 729KB PDF 举报

资源推荐

资源详情

资源评论

Subword Embedding from Bytes Gains

Privacy without Sacriﬁcing Accuracy and Complexity

Mengjiao Zhang

Department of Computer Science

Stevens Institute of Technology

mzhang49@stevens.edu

Jia Xu

Department of Computer Science

Stevens Institute of Technology

jxu70@stevens.edu

Abstract

While NLP models signiﬁcantly impact our lives, there are rising concerns about

privacy invasion. Although federated learning enhances privacy, attackers may

recover private training data by exploiting model parameters and gradients. There-

fore, protecting against such embedding attacks remains an open challenge. To

address this, we propose Subword Embedding from Bytes (SEB) and encode sub-

words to byte sequences using deep neural networks, making input text recovery

harder. Importantly, our method requires a smaller memory with

256

bytes of

vocabulary while keeping efﬁciency with the same input length. Thus, our solution

outperforms conventional approaches by preserving privacy without sacriﬁcing

efﬁciency or accuracy. Our experiments show

SEB

can effectively protect against

embedding-based attacks from recovering original sentences in federated learning.

Meanwhile, we verify that

SEB

obtains comparable and even better results over

standard subword embedding methods in machine translation, sentiment analysis,

and language modeling with even lower time and space complexity.

1 Introduction

Advances in Natural Language Processing (NLP), such as Large Language Models (LLMs), have

made noticeable advancements in performance over the last decades, partially attributed to the large

datasets available. Since most data are from users, their privacy concerns play an increasingly critical

role, which is essential to building user trust, encouraging the responsible use of language data,

protecting personal information, ensuring ethical use, and avoiding potential harm to individuals.

Federated learning (FL) enables training shared models across multiple clients without transferring

the data to a central server to preserve user privacy. Although only the model updates are sent to

the central server, adversaries can still use model updates to reconstruct the original data and leak

sensitive information to compromise the user’s privacy. Figure 1(a) demonstrates an FL framework,

and Figure 1(b) shows how embedding-based attacks work as in [

]. In the illustrated example,

the attacker extracts all candidate words in a batch of data from the embedding gradients and can

easily reconstruct the text with beam search and reordering since one can perform straightforward

lookups when a vector is updated due to the one-to-one mapping between word/subword tokens and

embedding vectors.

Our intuitive idea is to apply the byte embedding method because the same bytes are repeatedly

used for multiple subwords. We aim to design a one-to-many mapping between words/subwords and

embedding vectors to increase the difﬁculty of the simple lookup so that retrieving input subwords

with the updated byte embeddings is harder, which makes the byte embedding in NLP models a

potential defense. For example, in subword embedding, if the word “good” is updated, the attacker

will only retrieve this word based on embedding updates. However, if we tokenize “good” into four

bytes, such as “50, 10, 128, 32", all subwords containing at least one of these bytes will be retrieved,

Preprint. Under review.

arXiv:2410.16410v1 [cs.AI] 21 Oct 2024

Private batch data

Tom lives in New York.

He is 20 years old.

...

Private batch data

This patient has heart

disease.

...

Clients Server

...

Gradients

Model

parameters

Gradients

Model

parameters

(a)

Attacker

Reconstructed text:

Tom lives in New York.

Model

Gradient

"New",

"is",

"He",

"years",

"old",

"20",

"lives",

"in",

"York",

"Tom",

"."

Beam search

and reorder

Bag-of-Words Extraction

(b)

Attacker

Reconstructed text:

Lucius lives in Tokyo now.

Model

Gradient

46, 48, 50,

67, 69, 72,

76, 78, 80,

83, 84, 89,

91, 93, 97,

100, 101,

105, 107,

108, 109,

110, 111,

114, 115,

118, 119,

121

Bag-of-Bytes Extraction Bag-of-Words Extraction

"Liberty", "Lab",

"Outside",

"Lucius", "left",

"airports",

"Jimmie", "Tokyo",

"in","sleepy",

"Canada", "71",

"is", "fine",

"##cian","##ture",

"now", "lives"

...

(c)

Figure 1: An attack example of recovering text in FL. (a): An FL framework. (b) and (c): Recovering

text using embedding gradients of subwords and bytes.

resulting in a larger search space and more possibilities to recover the original sentence. As shown in

Figure 1(c), although the attacker extracts a set of bytes, the number of candidate subwords is much

greater than that of using subword embeddings.

There are two major challenges to directly apply existing byte encodings [

] to enhance

privacy: First, smaller textual granularity cannot show the semantic meaning of each word, leading

to a less interpretable and analyzable model. Second, byte-based models are more computationally

expensive, as input sequences become much longer after byte tokenization.

To address these challenges in byte-based models, we propose to encode subwords with bytes and

aggregate the byte embeddings to obtain a single subword embedding. The procedure consists of three

steps: (1) Construct a mapping between subwords and bytes. (2) Convert the input text into a byte

sequence. (3) Retrieve the corresponding byte embeddings and aggregate them back into subword

embeddings using a feed-forward network while maintaining the subword boundaries. By adopting

this approach, we can leverage the privacy protection provided by bytes with a small vocabulary size

of 256 while keeping the same input sequence length as the subword sequence.

Our main contributions are:

•

We introduce a novel text representation method

SEB

, which achieves a vocabulary size of

256

the learned model without increasing the input sequence length.

•

We verify that our

SEB

can protect NLP models against data leaking attacks based on embedding

gradients. To the best of our knowledge, our work is the ﬁrst one to study privacy preservation with

byte representations in FL.

•

We demonstrate that

SEB

improves privacy and, at the same time, achieves comparable or better

accuracy with enhanced time and space efﬁciency without the privacy-performance/efﬁciency

trade-off in conventional approaches.

2 Related Work

Attacks and defenses in language model Some recent works consider the reconstruction as an

optimization task [

]. The attacker updates its dummy inputs and labels to minimize the

distance between the gradients of the victim uploaded and the gradients the attacker calculated based

on its dummy inputs and labels. [

] shows that the attackers can reconstruct a set of words with the

embedding gradients, then apply beam search and reorder with a pretrained language model for input

recovery. One defense described in [

] is to encrypt the gradients or make them not directly

inferable. However, encryption requires special setups and could be costly to implement. Moreover,

it does not provide effective protection against server-side privacy leakage [

]. Differential

privacy is another defense strategy, but it can hurt model accuracy [

]. While [

]

proposed a secure federated learning framework that can prevent privacy leakage based on gradient

reconstruction, it does not effectively address the retrieval of a bag of words from the embedding

matrix gradients, as proposed in [7].

Subword-level and byte-level language models Subword tokenization such as BPE [

] has some

limitations, despite the wide application. It cannot handle out-of-vocabulary subwords and requires

language-speciﬁc tokenizers. Another challenge is the high space complexity of the embedding matrix

when the vocabulary size is huge. Byte tokenization is a solution to address these issues [

UTF-8 can encode almost all languages. Therefore, there will be no out-of-vocabulary words and the

language-speciﬁc tokenizer is unnecessary. In addition, as the total number of bytes in UTF-8 is 256,

the embedding matrix for byte vocabulary is much smaller than most subword vocabularies, reducing

the number of parameters in the embedding layer and saving memory space.

Subword-level model with character- or byte-level fusion The character/byte-based models

often result in longer input sequences and higher time complexity compared to the subword-based

model. To make the model efﬁcient, recent works have explored character/byte-level fusion. For

example, [

] proposes CHARFORMER, using a soft gradient-based subword tokenization module to

obtain “subword tokens”. It generates and scores multiple subword blocks, aggregates them to obtain

subword representation, and then performs downsampling to reduce the sequence length. Although

CHARFORMER is faster than vanilla byte/character-based models, it does not maintain subword

boundaries, limiting the model’s interpretability. [

] proposes Local Bytes Fusion (LOBEF) to

aggregate local semantic information and maintain the word boundary. However, it does not reduce

the sequence length, making training and inference time-consuming.

3 Preliminaries

3.1 Federated Learning

In federated learning (FL), multiple clients jointly train a model using their private data. Assume we

have

clients,

C = {c

, c

, . . . , c

}

, and a server

, in an FL system. The jointly trained model is

with parameters

. The clients’ private data are

, D

, . . . , D

and the objective function is

. For

easier illustration, we assume all clients participate in each communication and use FedSGD [

]

to update the model parameters. In each communication round

, server

ﬁrst sends the model

parameters

to all clients. Then each client

compute

∇

)

, the gradients of current model

, based on a randomly sampled data batch

⊂ D

. After local computation, the clients send the

gradients ∆

, ∆

, . . . , ∆

to server and server s aggregate all the gradients and update the model:

t+1

= θ

− η

i=1

∇

). (1)

Here, Equation (1) is the gradient descent, and η is the learning rate.

3.2 Threat Model

Adversary’s capabilities and objective We follow the attack settings in [

]. The optimized model

is a language model

, parameterized by

. This scenario makes the attacker white box access to the

gradients

∇

)

sent by the victim client

is the model parameter that the server sends

to the clients at any communication round

. From parameters

and gradients

∇

)

, the

attacker can get the information of the vocabulary

and the embedding matrix

to retrieve which

tokens are updated. The goal of the attacker is to recover at least one sentence from

, based on

∇

) and θ

Attack model This paper does not address the gradient leakage attack which aims to obtain private

data by minimizing the difference between gradients derived from a dummy input and the actual

gradients of the victim’s data, because several methods have been proposed to mitigate this particular

attack [

]. Instead, we focus on a speciﬁc attack model, FILM, introduced in [

], for which

effective defenses have yet to be explored. In this model, the attacker attempts to reconstruct sentences

from the victim’s training batches as follows: (1) extracting candidate tokens from the gradients, (2)

applying beam search with a pre-trained Language Model, such as GPT-2, to reconstruct the input

sentence, and (3) reordering the subword tokens to achieve the best reconstruction.

剩余14页未读，继续阅读

评论收藏

内容反馈

sp_fyf_2024

粉丝: 1515
资源: 59

通过字节编码的子词嵌入：在不牺牲准确性和复杂度的情况下获得隐私保护

字节流编码获取

基于虚拟机字节码注入的Android应用程序隐私保护机制.pdf

面向Python的圈复杂度静态分析方法研究.pdf

c++时间与空间复杂度计算

这是本人写的以RS73编码来模拟CMMB的RS编码及字节交织-RS(7,3)编码和字节交织.rar

一种基于字节型多变长码的串匹配的Alpha图像编码算法.pdf

动态数组和链表的复杂度分析 数组和链表.pdf

汉字编码转换为字节码工具

在vs2013中默认不再包含对多字节字符编码的支持

如何处理错误ORA-29275：部分多字节字符

可变字节码

单字节 多字节 双字节 相互转换

bpemb：基于字节对编码（BPE）的275种语言的预训练子词嵌入

双字节 多字节 宽字节 Unicode

易语言动态字节集添加子夹

java字节码加密

获取文件流字节编码的类(VS.NET)

多字节与UTF-8、Unicode之间的转换

NetC# 获取字节流编码

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

Deep Learning Tuning Playbook（中译版）

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

zotero翻译插件.xpi

最新资源

动态数组和链表的复杂度分析数组和链表.pdf

单字节多字节双字节相互转换

双字节多字节宽字节 Unicode