使用差分隐私进行NLP混淆的PYTHON包.zip资源-CSDN文库

共26个文件

py：18个

gitignore：1个

yml：1个

版权申诉

109 浏览量 2024-05-17 17:41:55 上传评论收藏 396KB ZIP 举报

在自然语言处理（NLP）领域，保护个人隐私是一个重要的课题。随着大数据和人工智能的发展，对文本数据的分析越来越深入，但这也可能暴露用户的敏感信息。差分隐私（Differential Privacy，DP）作为一种强大的隐私保护技术，已经在数据分析、机器学习，尤其是NLP中得到了广泛应用。本文将详细介绍如何在Python环境中，利用名为"pypantera-main"的工具包来实现差分隐私的NLP混淆。差分隐私的基本思想是通过向数据添加随机噪声，使得分析结果对于任何个体数据的改变保持不变或微小变化，从而保护个体隐私。在NLP中，这可以应用于词频统计、模型训练等过程中，防止用户的特定词汇或语句模式被识别出来。 "pypantera-main"工具包可能是由Python实现的一个差分隐私库，专为NLP任务设计。虽然没有提供详细的标签信息，我们可以假设这个库提供了以下功能： 1. **文本预处理**：为了应用差分隐私，首先需要对原始文本进行标准化处理，如去除停用词、标点符号，以及词形还原等。 2. **差分隐私机制**：库可能包含了多种差分隐私机制，如Laplace机制（用于离散数据）和Gaussian机制（用于连续数据），这些机制会在统计分析或模型训练前向数据添加噪声。 3. **私有词频统计**：在NLP中，词频统计是一个常见的任务，但直接的词频统计可能泄露信息。pypantera-main可能会提供一种方法，能够在保护隐私的前提下统计词频。 4. **差分隐私的模型训练**：可能支持基于差分隐私的文本分类、情感分析等模型的训练。这可能包括对梯度进行扰动，以在保护隐私的同时更新模型参数。 5. **评估与可视化**：库可能包含评估工具，帮助开发者理解在保持差分隐私的情况下，模型性能的变化。同时，可能提供一些可视化工具，帮助解释模型的决策过程。 6. **配置与参数调整**：用户可能能够调整差分隐私的ε值，这是一个关键参数，它控制了隐私保护的程度与数据精度之间的权衡。 7. **兼容性与集成**：考虑到Python生态系统中的其他NLP库，pypantera-main可能设计为与其他流行库（如NLTK、spaCy、transformers等）兼容，便于用户集成到现有的NLP流程中。使用"pypantera-main"时，开发者需要考虑如何设置合适的隐私预算（ε），以及在满足隐私保护要求的同时，尽可能保持模型的性能。此外，理解和调整差分隐私的参数，如δ（容忍的最大错误概率）和μ（Laplace分布的尺度参数），也是成功应用差分隐私的关键。 "pypantera-main"工具包提供了一种实用的手段，让开发者能够在NLP任务中应用差分隐私，保护用户的隐私，同时还能进行有效的数据分析和模型训练。通过深入研究和实践，我们可以更好地理解如何在保护隐私和保持算法效率之间找到平衡。

资源推荐

资源详情

资源评论

收起资源包目录

使用差分隐私进行 NLP 混淆的PYTHON 包.zip （26个子文件）

pypantera-main

pypantera

__init__.py 0B

src

__init__.py 0B

EmbeddingPerturbationMechanism

__init__.py 0B

cmp.py 5KB

mahalanobis.py 6KB

AbstractEmbeddingPerturbationMechanism.py 3KB

vickrey.py 10KB

utils

__init__.py 0B

helper.py 8KB

vocab.py 2KB

AbstractTextObfuscationDPMechanism.py 4KB

SamplingPerturbationMechanism

__init__.py 0B

santext.py 4KB

custext.py 6KB

AbstractSamplingPerturbationMechanism.py 2KB

tem.py 7KB

setup.py 8KB

LICENSE 34KB

environment.yml 3KB

requirements.txt 2KB

.gitignore 3KB

images

classes.png 124KB

pyPANTER.webp 244KB

classes.pdf 13KB

test.py 2KB

README.md 11KB

# pyPANTERA ## A Python **P**ackage for n**A**tural la**N**guage obfusca**T**ion **E**nforcing p**R**ivacy & **A**nonymization <p align="center"> <img src="./images/pyPANTER.webp" width="255"> </p> ## What is pyPANTERA? pyPANTERA[^1] is a Python package that provides a simple interface to obfuscate natural language text. It is designed to help developers and data scientists to implement, reproduce and test State-of-the-Art techniques for natural language obfuscation that implements $\varepsilon$ -Differential Privacy. The package is built using numpy, pandas, and scikit-learn libraries, and it is designed to be easy to use and integrate with other Python packages. The package offers a combination of natural language processing and mathematical transformations to obfuscate natural language text. It replaces the original string texts with their obfuscated versions, ensuring that the obfuscated text is not directly related to the original text. The obfuscation is performed using word embeddings and word sampling mechanisms, and it is designed to be $\varepsilon$-Differential Privacy compliant. ## Virtual Environment We provide also a virtual environment to run the package: 1. You can create the virtual environment ***virtualEnvPyPANTERA*** using the ```environment.yml``` file, and running in your terminal: ```bash conda env create -f environment.yml ``` 2. Once the environment is created, you can verify that it is installed by running: ```bash conda env list ``` 3. Finnally, you can activate the virtual environment by running: ```bash conda activate virtualEnvPyPANTERA ``` In the ```requirements.txt``` file you can find the list of the packages exported from the virtual environment. ## How to use pyPANTERA? pyPANTERA is designed to be easy to use and accessible for everyone. You can install it using pip: ```bash pip install pypantera ``` Once installed, you can use it in your Python code by importing it as follows: ```python import pypantera ``` ## What can pyPANTERA do? pyPANTERA implements current State-of-the-Art mechanisms that uses $\varepsilon$-Differential Privacy to obfuscate natural language text. The mechansims implemented in pyPANTERA are divided in two categories: - **Word Embeddings Perturbation**: This mechanism uses word embeddings to obfuscate the text. It replaces the original words ebeddings with a perturbated version of them. Such perturbation is done by adding a statistical noise depending on the mechansim design. The mechansim implemented are the following: - Calibrated Multivariate Perturbations (**CMP**): Addition of sferical noise to the word embeddings. See reference [^2] for more information. - Mahalanobis Perturbations (**Mahalanobis**): Addition of eliptical noise to the word embeddings. See reference [^3] for more information. - Vickrey family of mechanisms (**Vickrey**): Perturbation performed using a treshold value to select the nearest perturbed embedding of a term. See reference [^4] for more information. - **Word Sampling Perturbation**: This mechanism uses word sampling to obfuscate the text. The mechanism computes for each word in the text a list of neighbouring words with the respective scores, then it samples a substitution candidate from basing such sampling on the scores of the neighbouring terms and teh privacy budget $\varepsilon$. The mechansim implemented are the following: - Customized Text (**CusText**): Sampling of the substitution candidate from the neighbouring $k$ words of the original word. See reference [^5] for more information. - Sanitization Text (**SanText**): Sampling of the substitution candidate from the neighbouring words of the original word. See reference [^6] for more information. - Truncated Exponential Mechanism (**TEM**): Sampling of the substitution candidate using the exponential mechanism with the scores of the neighbouring words. See reference [^7] for more information. ## How does pyPANTERA work? We provide a simple example to show how pyPANTERA works with a concrete example. We suggest to use the prepared virtual environment to run the example and the base script `test.py` to run the obfuscation pipeline. ```bash python test.py --embPath /absolute/path/to/embeddings --inputPath /absolute/path/to/input/data --outputPath /absolute/path/to/output/data --mechanism MECHANISM --epsilon EPSILON --task TASK --numberOfObfuscations N --PARAMETERS ``` The script will run the obfuscation pipeline using the embeddings in the path provided in the `--embPath | -eP` argument, the input data in the path provided in the `--inputPath | -i` argument, and `--outputPath | -o` is used as output path for storing the results. If `--outputPath` is not provided, it creates a folder `./results/task/mechanism/` to save the obfuscated data frames. pyPANTERA requires that the input data is a CSV file with a column named `text` that contains the text to obfuscate and an `id` to keep trace of the correspondance between original and obfuscated versions. The `--task | -tk` argument is used to specify the future task that you want to perform using the new obfuscated texts. The `--epsilon | -e` argument is used to specify the epsilon value for the differential privacy mechanism. The `--mechanism | -m` argument is used to specify the mechanism to use for the obfuscation. The `--numberOfObfuscations | -n` argument is used to specify the number of obfuscations to perform for the same text. Finally, the `--PARAMETERS` are the parameters for the mechanism that you want to use. We provide a specific list of parameters for each mechanism in the following section. ## UML of pyPANTERA The UML diagram of the pyPANTERA source code is displayied below: <center> ![pyPANTER UML diagram](./images/classes.png) </center> ## Prameters The script `test.py` has the following parameters, based on the mechanism parameters that you want to use: - **General Parameters**: - `--embPath | -eP`: The path to the word embeddings file (default str: None, **required**) - `--inputPath | -i`: The path to the input data file (default str: None, **required**) - `--outputPath | -o`: The path to the output data file (default str: None) - `--task | -tk`: The future task that you want to perform using the new obfuscated texts (default str: 'retrieval') - `--epsilon | -e`: The epsilon value for the differential privacy mechanism (default List[float]: [1.0, 5.0, 10.0, 12.5, 15.0, 17.5, 20.0, 50.0]) - `--mechanism | -m`: The mechanism to use for the obfuscation (default str: 'CMP', choices: ['CMP', 'Mahalanobis', 'VickreyCMP', 'VickreyMhl', 'CusText', 'SanText', 'TEM']) - `--numberOfObfuscations | -n`: The number of obfuscations to perform for the same text (default int: 1) - **CMP**: The parameters for the CMP mechanism are only the general ones. - **Mahalanobis**: The parameters for the Mahalanobis mechanism are the following: - `--lam`: The lambda value for the Mahalanobis norm (default float: 1) - **VickreyCMP/VickreyMhl**: The parameters for the Vickrey mechanism are the following: - `--t`: The treshold value for the Vickrey mechanism (default float: 0.75). Eventually, if you use the `VickreyMhl` mechanism, you can also use the `--lam` parameter to set the lambda value for the Mahalanobis norm (default float: 1) - **CusText**: The parameters for the CusText mechanism are the following: - `--k`: The number of neighbouring words to consider for the sampling (default int: 10) - `--distance | -d`: The distance metric to use for the sampling (default str: 'euclidean') - **SanText**: The parameters for the SanText mechanism are only the general ones. - **TEM**: The parameters for the TEM mechanism are the following: - `--beta`: The beta value for the exponential mechanism (default float: 0.001) ## Example Suppose you want to run the obfuscation pipeline using the `CMP` mechanism with the embeddings in the path `./embeddings/glove.6B.50d.txt

评论收藏

内容反馈

版权申诉