Text Classification
-------------------------------------------------------------------------
the purpose of this repository is to explore text classification methods in NLP with deep learning.
UPDATE: if you want to try a model now, you can go to folder 'a02_TextCNN', run 'python -u p7_TextCNN_train.py', it will use sample data to train a model, and print loss and F1 score periodically.
it has all kinds of baseline models for text classificaiton.
it also support for multi-label classification where multi label associate with an sentence or document.
although many of these models are simple, and may not get you to top level of the task.but some of these models are very classic, so they may be good to serve as baseline models.
each model has a test function under model class. you can run it to performance toy task first. the model is indenpendent from dataset.
<a href='https://github.com/brightmart/text_classification/blob/master/multi-label-classification.pdf'>check here for formal report of large scale multi-label text classification with deep learning</a>
serveral modes here can also be used for modelling question answering (with or without context), or to do sequences generating.
we explore two seq2seq model(seq2seq with attention,transformer-attention is all you need) to do text classification. and these two models can also be used for sequences generating and other tasks. if you task is a multi-label classification, you can cast the problem to sequences generating.
we implement two memory network. one is dynamic memory network. previously it reached state of art in question answering, sentiment analysis and sequence generating tasks. it is so called one model to do serveral different tasks, and reach high performance. it has four modules. the key component is episodic memory module. it use gate mechanism to performance attention, and use gated-gru to update episode memory, then it has another gru( in a vertical direction) to pefromance hidden state update. it has ability to do transitive inference.
the second memory network we implemented is recurrent entity network: tracking state of the world. it has blocks of key-value pairs as memory, run in parallel, which achieve new state of art. it can be used for modelling question answering with contexts(or history). for example, you can let the model to read some sentences(as context), and ask a question(as query), then ask the model to predict an answer; if you feed story same as query, then it can do classification task.
if you need some sample data and word embedding pertrained on word2vec, you can find it in closed issues, such as:<a href="https://github.com/brightmart/text_classification/issues/3">issue 3</a>.
you can also find some sample data at folder "data". it contains two files:'sample_single_label.txt', contains 50k data with single label; 'sample_multiple_label.txt', contains 20k data with multiple labels. input and label of is separate by " __label__".
if you want to know more detail about dataset of text classification or task these models can be used, one of choose is below:
https://biendata.com/competition/zhihu/
Models:
-------------------------------------------------------------------------
1) fastText
2) TextCNN
3) TextRNN
4) RCNN
5) Hierarchical Attention Network
6) seq2seq with attention
7) Transformer("Attend Is All You Need")
8) Dynamic Memory Network
9) EntityNetwork:tracking state of the world
10) Ensemble models
11) Boosting:
for a single model, stack identical models together. each layer is a model. the result will be based on logits added together. the only connection between layers are label's weights. the front layer's prediction error rate of each label will become weight for the next layers. those labels with high error rate will have big weight. so later layer's will pay more attention to those mis-predicted labels, and try to fix previous mistake of former layer. as a result, we will get a much strong model.
check a00_boosting/boosting.py
and other models:
1) BiLstmTextRelation;
2) twoCNNTextRelation;
3) BiLstmTextRelationTwoRNN
Performance
-------------------------------------------------------------------------
(mulit-label label prediction task,ask to prediction top5, 3 million training data,full score:0.5)
Model | fastText|TextCNN|TextRNN| RCNN | HierAtteNet|Seq2seqAttn|EntityNet|DynamicMemory|Transformer
--- | --- | --- | --- |--- |--- |--- |--- |--- |----
Score | 0.362 | 0.405| 0.358 | 0.395| 0.398 |0.322 |0.400 |0.392 |0.322
Training| 10m | 2h |10h | 2h | 2h |3h |3h |5h |7h
--------------------------------------------------------------------------------------------------
Ensemble of TextCNN,EntityNet,DynamicMemory: 0.411
Ensemble EntityNet,DynamicMemory: 0.403
--------------------------------------------------------------------------------------------------
Notice:
`m` stand for **minutes**; `h` stand for **hours**;
`HierAtteNet` means Hierarchical Attention Networkk;
`Seq2seqAttn` means Seq2seq with attention;
`DynamicMemory` means DynamicMemoryNetwork;
`Transformer` stand for model from 'Attention Is All You Need'.
Useage:
-------------------------------------------------------------------------------------------------------
1) model is in `xxx_model.py`
2) run python `xxx_train.py` to train the model
3) run python `xxx_predict.py` to do inference(test).
Each model has a test method under the model class. you can run the test method first to check whether the model can work properly.
-------------------------------------------------------------------------
Environment:
-------------------------------------------------------------------------------------------------------
python 2.7+ tensorflow 1.1
(tensorflow 1.2,1.3,1.4 also works; most of models should also work fine in other tensorflow version, since we use very few features bond to certain version; if you use python 3.5, it will be fine as long as you change print/try catch function)
TextCNN model is already transfomed to python 3.6
-------------------------------------------------------------------------
Notice:
-------------------------------------------------------------------------------------------------------
Some util function is in data_util.py;
typical input like: "x1 x2 x3 x4 x5 __label__ 323434" where 'x1,x2' is words, '323434' is label;
it has a function to load and assign pretrained word embedding to the model,where word embedding is pretrained in word2vec or fastText.
Models Detail:
-------------------------------------------------------------------------
1.fastText:
-------------
implmentation of <a href="https://arxiv.org/abs/1607.01759">Bag of Tricks for Efficient Text Classification</a>
after embed each word in the sentence, this word representations are then averaged into a text representation, which is in turn fed to a linear classifier.it use softmax function to compute the probability distribution over the predefined classes. then cross entropy is used to compute loss. bag of word representation does not consider word order. in order to take account of word order, n-gram features is used to capture some partial information about the local word order; when the number of classes is large, computing the linear classifier is computational expensive. so it usehierarchical softmax to speed training process.
1) use bi-gram and/or tri-gram
2) use NCE loss to speed us softmax computation(not use hierarchy softmax as original paper)
result: performance is as good as paper, speed also very fast.
check: p5_fastTextB_model.py
![alt text](https://github.com/brightmart/text_classification/blob/master/images/fastText.JPG)
-------------------------------------------------------------------------
2.TextCNN:
-------------
Implementation of <a href="http://www.aclweb.org/anthology/D14-1181"> Co
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
【探索人工智能的宝藏之地】 无论您是计算机相关专业的在校学生、老师,还是企业界的探索者,这个项目都是为您量身打造的。无论您是初入此领域的小白,还是寻求更高层次进阶的资深人士,这里都有您需要的宝藏。不仅如此,它还可以作为毕设项目、课程设计、作业、甚至项目初期的立项演示。 【人工智能的深度探索】 人工智能——模拟人类智能的技术和理论,使其在计算机上展现出类似人类的思考、判断、决策、学习和交流能力。这不仅是一门技术,更是一种前沿的科学探索。 【实战项目与源码分享】 我们深入探讨了深度学习的基本原理、神经网络的应用、自然语言处理、语言模型、文本分类、信息检索等领域。更有深度学习、机器学习、自然语言处理和计算机视觉的实战项目源码,助您从理论走向实践,如果您已有一定基础,您可以基于这些源码进行修改和扩展,实现更多功能。 【期待与您同行】 我们真诚地邀请您下载并使用这些资源,与我们一起在人工智能的海洋中航行。同时,我们也期待与您的沟通交流,共同学习,共同进步。让我们在这个充满挑战和机遇的领域中共同探索未来!
资源推荐
资源详情
资源评论
收起资源包目录
人工智能项目资料-基于深度学习的自然语言处理课程.zip (515个子文件)
acl.bst 25KB
cw.c 36KB
lbl.c 32KB
word2vec.c 32KB
nnlm.c 32KB
cooccur.c 19KB
glove.c 17KB
word2phrase.c 9KB
shuffle.c 8KB
vocab_count.c 7KB
compute-accuracy.c 5KB
compute-accuracy-txt.c 5KB
word-analogy.c 5KB
distance.c 5KB
fasttext.cc 15KB
model.cc 10KB
dictionary.cc 10KB
args.cc 8KB
main.cc 5KB
matrix.cc 2KB
vector.cc 2KB
utils.cc 594B
FastLineReader.class 26KB
FastWordReader.class 25KB
FastLineBinaryReader.class 25KB
MultiSenseWordEmbeddingModel.class 24KB
WordEmbeddingModel.class 15KB
VocabBuilder.class 14KB
MultiSenseEmbeddingBrowse$.class 13KB
EmbeddingOpts.class 8KB
MultiSenseSkipGramEmbeddingModel.class 8KB
EmbeddingDistance$.class 8KB
MultiSenseEmbeddingBrowse.class 5KB
SkipGramNegSamplingEmbeddingModel.class 5KB
HogWildTrainer.class 4KB
CBOWNegSamplingEmbeddingModel.class 4KB
MultiSenseEmbeddingBrowse$$anon$1.class 4KB
EmbeddingDistance$$anon$1.class 4KB
MultiSenseWordEmbeddingModel$$anonfun$store$1$$anonfun$apply$mcVI$sp$3.class 4KB
CBOWNegSamplingExample.class 3KB
MultiSenseEmbeddingBrowse$$anonfun$play$1$$anonfun$apply$mcVI$sp$4.class 3KB
MultiSenseWordEmbeddingModel$$anonfun$load_weights$1.class 3KB
MultiSenseEmbeddingBrowse$$anonfun$playMS$1.class 3KB
MultiSenseSkipGramEmbeddingModel$$anonfun$process$2.class 3KB
MultiSenseEmbeddingBrowse$$anonfun$play$1.class 3KB
EmbeddingDistance.class 3KB
EmbeddingDistance$$anonfun$play$2.class 3KB
MultiSenseEmbeddingBrowse$$anonfun$play$1$$anonfun$apply$mcVI$sp$4$$anonfun$apply$mcVI$sp$5.class 3KB
MultiSenseEmbeddingBrowse$$anonfun$load$1$$anonfun$apply$mcVI$sp$2.class 3KB
CBOWNegSamplingEmbeddingModel$$anonfun$process$2.class 3KB
MultiSenseEmbeddingBrowse$$anonfun$MultiSenseEmbeddingBrowse$$knn$1$$anonfun$apply$mcVI$sp$6.class 3KB
MSCBOWSkipGramNegSamplingExample.class 3KB
MultiSenseEmbeddingBrowse$$anonfun$load$1.class 3KB
MultiSenseSkipGramEmbeddingModel$$anonfun$process$2$$anonfun$apply$mcVI$sp$1.class 3KB
SkipGramNegSamplingEmbeddingModel$$anonfun$process$2.class 3KB
MultiSenseWordEmbeddingModel$$anonfun$store$1.class 3KB
WordVec$.class 3KB
SkipGramNegSamplingExample.class 3KB
SkipGramNegSamplingEmbeddingModel$$anonfun$process$2$$anonfun$apply$mcVI$sp$1.class 3KB
EmbeddingDistance$$anonfun$load$1.class 3KB
preprocess$$anonfun$main$1.class 2KB
preprocess$.class 2KB
WordEmbeddingModel$$anonfun$store$1.class 2KB
MultiSenseSkipGramEmbeddingModel$$anonfun$cbow_predict_dpmeans$2.class 2KB
MultiSenseEmbeddingBrowse$$anonfun$MultiSenseEmbeddingBrowse$$getSense$2.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$learnEmbeddings$3.class 2KB
MultiSenseSkipGramEmbeddingModel$$anonfun$process$2$$anonfun$apply$mcVI$sp$1$$anonfun$apply$mcVI$sp$2.class 2KB
MultiSenseSkipGramEmbeddingModel$$anonfun$cbow_predict_kmeans$2.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$store$1$$anonfun$apply$mcVI$sp$3$$anonfun$apply$mcVI$sp$5.class 2KB
MultiSenseEmbeddingBrowse$$anonfun$displayKNN$3.class 2KB
MultiSenseSkipGramEmbeddingModel$$anonfun$cbow_predict$2.class 2KB
MultiSenseSkipGramEmbeddingModel$$anonfun$process$2$$anonfun$apply$mcVI$sp$3.class 2KB
SkipGramNegSamplingEmbeddingModel$$anonfun$process$2$$anonfun$apply$mcVI$sp$3.class 2KB
CBOWNegSamplingEmbeddingModel$$anonfun$process$2$$anonfun$apply$mcVI$sp$1.class 2KB
SkipGramNegSamplingEmbeddingModel$$anonfun$process$2$$anonfun$apply$mcVI$sp$1$$anonfun$apply$mcVI$sp$2.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$learnEmbeddings$1.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$store$1$$anonfun$apply$mcVI$sp$3$$anonfun$apply$mcVI$sp$4.class 2KB
CBOWNegSamplingEmbeddingModel$$anonfun$process$2$$anonfun$apply$mcVI$sp$2.class 2KB
VocabBuilder$$anonfun$loadVocab$1.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$store$1$$anonfun$apply$mcVI$sp$3$$anonfun$apply$mcVI$sp$6.class 2KB
MultiSenseEmbeddingBrowse$$anonfun$load$1$$anonfun$apply$mcVI$sp$2$$anonfun$apply$mcVI$sp$3.class 2KB
VocabBuilder$$anonfun$buildSamplingTable$2.class 2KB
CBOWNegSamplingExample$$anonfun$accumulateValueAndGradient$2.class 2KB
VocabBuilder$$anonfun$saveVocab$1.class 2KB
MultiSenseEmbeddingBrowse$$anonfun$play$1$$anonfun$apply$mcVI$sp$7.class 2KB
MultiSenseEmbeddingBrowse$$anonfun$MultiSenseEmbeddingBrowse$$knn$1.class 2KB
BrowseOptions.class 2KB
TensorUtils$.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$learnEmbeddings$3$$anonfun$apply$1.class 2KB
vocab_word.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$learnEmbeddings$1$$anonfun$apply$mcVI$sp$1.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$store$1$$anonfun$apply$mcVI$sp$2.class 2KB
CBOWNegSamplingExample$$anonfun$accumulateValueAndGradient$1.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$learnEmbeddings$3$$anonfun$apply$1$$anonfun$apply$2.class 2KB
MultiSenseEmbeddingBrowse$$anonfun$displayKNN$4.class 2KB
MultiSenseWordEmbeddingModel$$anonfun$buildVocab$4.class 2KB
MultiSenseEmbeddingBrowse$$anonfun$playMS$2.class 2KB
MultiSenseSkipGramEmbeddingModel$$anonfun$cbow_predict_dpmeans$1.class 2KB
MultiSenseSkipGramEmbeddingModel$$anonfun$cbow_predict_kmeans$1.class 2KB
WordEmbeddingModel$$anonfun$store$1$$anonfun$apply$mcVI$sp$1.class 2KB
共 515 条
- 1
- 2
- 3
- 4
- 5
- 6
资源评论
妄北y
- 粉丝: 1w+
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功