NLP大作业-自然语言处理大作业：新闻情感极性分类+源代码+文档说明

共70个文件

py：40个

json：6个

csv：5个

版权申诉

自然语言处理

56 浏览量 2023-12-23 08:11:21 上传评论 1 收藏 194.74MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

NLP-homework-master.zip （70个子文件）

NLP-homework-master

run_flask.py 2KB

README.txt 385B

requirements.txt 979B

models

__init__.py 0B

LSTM

LSTM.py 1KB

LSTM.h5 1.01MB

interface.py 994B

trained.kv.trainables.syn1neg.npy 75.02MB

trained.kv 12.17MB

trained.kv.wv.vectors.npy 75.02MB

NaiveBayes

TestModel.py 798B

__init__.py 0B

TrainModel.py 2KB

Test.csv 6.87MB

解释文档.txt 964B

Type2.csv 716KB

interface.py 724B

Type0.csv 357KB

Type1.csv 1.16MB

Train.csv 17.48MB

NaiveBayes.py 3KB

ALBERT

__init__.py 0B

preprocessor.py 9KB

interface.py 2KB

model

__init__.py 0B

configuration_bert.py 6KB

tokenization_utils.py 39KB

file_utils.py 10KB

modeling_albert.py 35KB

configuration_utils.py 10KB

optimization.py 8KB

tokenization_bert.py 19KB

modeling_utils.py 40KB

checkpoints

pytorch_model.bin 41.51MB

config.json 757B

train_albert_pytorch

__init__.py 15B

prepare_lm_data_ngram.py 22KB

prepare_lm_data_mask.py 22KB

configs

__init__.py 15B

albert_config_xlarge.json 562B

albert_config_xxlarge.json 564B

bert_config.json 518B

albert_config_large.json 563B

albert_config_base.json 561B

base.py 2KB

vocab.txt 107KB

run_pretraining.py 18KB

callback

utils.py 475B

__init__.py 0B

modelcheckpoint.py 4KB

progressbar.py 2KB

optimizater.py 66KB

ccf_processor.py 10KB

lcqmc_progressor.py 8KB

common

__init__.py 0B

metrics.py 10KB

tools.py 11KB

.gitignore 201B

README.md 4KB

convert_albert_tf_checkpoint_to_pytorch.py 3KB

run_classifier.py 25KB

vocab.txt 107KB

.gitignore 171B

static

jquery-3.4.1.min.js 86KB

css

bootstrap.min.css 98KB

demo.css-b981ec9.css 28KB

global.css-b981ec9.css 21KB

demo.html 3KB

1.gif 3KB

BaseAPI.py 119B

## albert_zh_pytorch This repository contains a PyTorch implementation of the albert model from the paper [A Lite Bert For Self-Supervised Learning Language Representations](https://arxiv.org/pdf/1909.11942.pdf) by Zhenzhong Lan. Mingda Chen.... arxiv: https://arxiv.org/pdf/1909.11942.pdf ## Pre-LN and Post-LN * Post-LN: . 在原始的Transformer中，Layer Norm在跟在Residual之后的，我们把这个称为`Post-LN Transformer` * Pre-LN: 把Layer Norm换个位置，比如放在Residual的过程之中（称为`Pre-LN Transformer`） ![](https://lonepatient-1257945978.cos.ap-chengdu.myqcloud.com/Selection_001.png) paper: [On Layer Normalization in the Transformer Architecture](https://openreview.net/forum?id=B1x8anVFPr) **使用方式** 按照][brightmart](https://github.com/brightmart/albert_zh)大佬提供的模型权重文件，需要在配置文件中添加`ln_type`参数，如下： ```json { "attention_probs_dropout_prob": 0.0, "directionality": "bidi", "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 768, "embedding_size": 128, "initializer_range": 0.02, "intermediate_size": 3072 , "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "type_vocab_size": 2, "vocab_size": 21128, "ln_type":"postln" # postln or preln } ``` ## show type **Cross-Layer Parameter Sharing**: ALBERT use cross-layer parameter sharing in Attention and FFN(FeedForward Network) to reduce number of parameter. modify the `share_type` parameter: * all: attention和FFN层参数都共享 * ffn:　只共享FFN层参数 * attention: 只共享attention层参数 * None: 无参数共享 **使用方式** 在加载`config`时，指定`share_type`参数，如下: ```python config = BertConfig.from_pretrained(bert_config_file,share_type=share_type) ``` ## Download Pre-trained Models of Chinese 感谢brightmart大佬提供中文模型权重：[github](https://github.com/brightmart/albert_zh) 1. [albert_large_zh](https://storage.googleapis.com/albert_zh/albert_large_zh.zip) 参数量，层数24，大小为64M 2. [albert_base_zh(小模型体验版)](https://storage.googleapis.com/albert_zh/albert_base_zh.zip), 参数量12M, 层数12，大小为40M 3. [albert_xlarge_zh](https://storage.googleapis.com/albert_zh/albert_xlarge_zh.zip) 参数量，层数24，文件大小为230M ## 预训练 **n-gram**: 原始论文中按照以下分布随机生成n-gram，默认max_n为3 <p align="center"><img width="200" src="https://lonepatient-1257945978.cos.ap-chengdu.myqcloud.com/n-gram.png" /></p> １．将文本数据转化为一行一句格式，并且不同document之间使用`\n`分割２．运行`python prepare_lm_data_ngram.py --do_data`分别生成ngram mask格式数据集３．运行`python run_pretraining.py --share_type=all`进行模型预训练 ** 模型大小** 以下是对`bert-base`进行实验的结果 | embedding_size | share_type | model_size | | :------- | :---------: | :---------: | | 768 | None | 476.5M | | 768 | attention | 372.4M | | 768 | ffn | 268.6M| | 768 |all | 164.6M| | | | | | 128 | None | 369.1M | | 128 | attention | 265.1M | | 128 | ffn | 161.2M| | 128 |all | 57.2M| ## 下游任务Fine-tuning １．下载预训练的albert模型２．运行`python convert_albert_tf_checkpoint_to_pytorch.py`将TF模型权重转化为pytorch模型权重(默认情况下shar_type=all) ３．下载对应的数据集，比如[LCQMC](https://drive.google.com/open?id=1HXYMqsXjmA5uIfu_SFqP7r_vZZG-m_H0)数据集，包含训练、验证和测试集，训练集包含24万口语化描述的中文句子对，标签为1或0。1为句子语义相似，0为语义不相似。４．运行`python run_classifier.py --do_train`进行Fine-tuning训练 5.　运行`python run_classifier.py --do_test`进行test评估 ## 结果问题匹配语任务：LCQMC(Sentence Pair Matching) | 模型 | 开发集(Dev) | 测试集(Test) | | :------- | :---------: | :---------: | | ALBERT-zh-base(tf) | 86.4 | 86.3 | | ALBERT-zh-base(pytorch) | 87.4 | 86.4 |

评论收藏

内容反馈

版权申诉