# Deep Learning for Multi-Label Text Classification
[![Python Version](https://img.shields.io/badge/language-python3.6-blue.svg)](https://www.python.org/downloads/) [![Build Status](https://travis-ci.org/RandolphVI/Multi-Label-Text-Classification.svg?branch=master)](https://travis-ci.org/RandolphVI/Multi-Label-Text-Classification) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/c45aac301b244316830b00b9b0985e3e)](https://www.codacy.com/app/chinawolfman/Multi-Label-Text-Classification?utm_source=github.com&utm_medium=referral&utm_content=RandolphVI/Multi-Label-Text-Classification&utm_campaign=Badge_Grade) [![License](https://img.shields.io/github/license/RandolphVI/Multi-Label-Text-Classification.svg)](https://www.apache.org/licenses/LICENSE-2.0) [![Issues](https://img.shields.io/github/issues/RandolphVI/Multi-Label-Text-Classification.svg)](https://github.com/RandolphVI/Multi-Label-Text-Classification/issues)
This repository is my research project, and it is also a study of TensorFlow, Deep Learning (Fasttext, CNN, LSTM, etc.).
The main objective of the project is to solve the multi-label text classification problem based on Deep Neural Networks. Thus, the format of the data label is like [0, 1, 0, ..., 1, 1] according to the characteristics of such a problem.
## Requirements
- Python 3.6
- Tensorflow 1.15.0
- Tensorboard 1.15.0
- Sklearn 0.19.1
- Numpy 1.16.2
- Gensim 3.8.3
- Tqdm 4.49.0
## Project
The project structure is below:
```text
.
âââ Model
â  âââ test_model.py
â  âââ text_model.py
â  âââ train_model.py
âââ data
â  âââ word2vec_100.model.* [Need Download]
â  âââ Test_sample.json
â  âââ Train_sample.json
â  âââ Validation_sample.json
âââ utils
â  âââ checkmate.py
â  âââ data_helpers.py
â  âââ param_parser.py
âââ LICENSE
âââ README.md
âââ requirements.txt
```
## Innovation
### Data part
1. Make the data support **Chinese** and English (Can use `jieba` or `nltk` ).
2. Can use **your pre-trained word vectors** (Can use `gensim`).
3. Add embedding visualization based on the **tensorboard** (Need to create `metadata.tsv` first).
### Model part
1. Add the correct **L2 loss** calculation operation.
2. Add **gradients clip** operation to prevent gradient explosion.
3. Add **learning rate decay** with exponential decay.
4. Add a new **Highway Layer** (Which is useful according to the model performance).
5. Add **Batch Normalization Layer**.
### Code part
1. Can choose to **train** the model directly or **restore** the model from the checkpoint in `train.py`.
2. Can predict the labels via **threshold** and **top-K** in `train.py` and `test.py`.
3. Can calculate the evaluation metrics --- **AUC** & **AUPRC**.
4. Can create the prediction file which including the predicted values and predicted labels of the Testset data in `test.py`.
5. Add other useful data preprocess functions in `data_helpers.py`.
6. Use `logging` for helping to record the whole info (including **parameters display**, **model training info**, etc.).
7. Provide the ability to save the best n checkpoints in `checkmate.py`, whereas the `tf.train.Saver` can only save the last n checkpoints.
## Data
See data format in `/data` folder which including the data sample files. For example:
```json
{"testid": "3935745", "features_content": ["pore", "water", "pressure", "metering", "device", "incorporating", "pressure", "meter", "force", "meter", "influenced", "pressure", "meter", "device", "includes", "power", "member", "arranged", "control", "pressure", "exerted", "pressure", "meter", "force", "meter", "applying", "overriding", "force", "pressure", "meter", "stop", "influence", "force", "meter", "removing", "overriding", "force", "pressure", "meter", "influence", "force", "meter", "resumed"], "labels_index": [526, 534, 411], "labels_num": 3}
```
- **"testid"**: just the id.
- **"features_content"**: the word segment (after removing the stopwords)
- **"labels_index"**: The label index of the data records.
- **"labels_num"**: The number of labels.
### Text Segment
1. You can use `nltk` package if you are going to deal with the English text data.
2. You can use `jieba` package if you are going to deal with the Chinese text data.
### Data Format
This repository can be used in other datasets (text classification) in two ways:
1. Modify your datasets into the same format of [the sample](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/data).
2. Modify the data preprocessing code in `data_helpers.py`.
Anyway, it should depend on what your data and task are.
**ð¤Before you open the new issue about the data format, please check the `data_sample.json` and read the other open issues first, because someone maybe ask me the same question already. For example:**
- [è¾å
¥æ件çæ ¼å¼æ¯ä»ä¹æ ·åçï¼](https://github.com/RandolphVI/Multi-Label-Text-Classification/issues/1)
- [Where is the dataset for training?](https://github.com/RandolphVI/Multi-Label-Text-Classification/issues/7)
- [å¨ data_helpers.py ä¸ç content.txt ä¸ metadata.tsv æ¯ä»ä¹ï¼å
·ä½æ ¼å¼æ¯ä»ä¹ï¼è½å¦æä¾ä¸ä¸ªæ ·ä¾ï¼](https://github.com/RandolphVI/Multi-Label-Text-Classification/issues/12)
### Pre-trained Word Vectors
**You can download the [Word2vec model file](https://drive.google.com/file/d/1S33iejwuQOIaNQfXW7fA_6zBwHHClT--/view?usp=sharing) (dim=100). Make sure they are unzipped and under the `/data` folder.**
You can pre-training your word vectors (based on your corpus) in many ways:
- Use `gensim` package to pre-train data.
- Use `glove` tools to pre-train data.
- Even can use a **fasttext** network to pre-train data.
## Usage
See [Usage](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/Usage.md).
## Network Structure
### FastText
![](https://farm2.staticflickr.com/1917/45609842012_30f370a0ee_o.png)
References:
- [Bag of Tricks for Efficient Text Classification](https://arxiv.org/pdf/1607.01759.pdf)
---
### TextANN
![](https://farm2.staticflickr.com/1965/44745949305_50f831a579_o.png)
References:
- **Personal ideas ð**
---
### TextCNN
![](https://farm2.staticflickr.com/1927/44935475604_1d6b8f71a3_o.png)
References:
- [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882)
- [A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1510.03820)
---
### TextRNN
**Warning: Model can use but not finished yet ð¤ª!**
![](https://farm2.staticflickr.com/1925/30719666177_6665038ea2_o.png)
#### TODO
1. Add BN-LSTM cell unit.
2. Add attention.
References:
- [Recurrent Neural Network for Text Classification with Multi-Task Learning](http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9745/9552)
---
### TextCRNN
![](https://farm2.staticflickr.com/1915/43842346360_e4660c5921_o.png)
References:
- **Personal ideas ð**
---
### TextRCNN
![](https://farm2.staticflickr.com/1950/31788031648_b5cba7bbf0_o.png)
References:
- **Personal ideas ð**
---
### TextHAN
References:
- [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf)
---
### TextSANN
**Warning: Model can use but not finished yet ð¤ª!**
#### TODO
1. Add attention penalization loss.
2. Add visualization.
References:
- [A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING](https://arxiv.org/pdf/1703.03130.pdf)
---
## About Me
é»å¨ï¼Randolph
SCU SE Bachelor; USTC CS Ph.D.
Email: chinawolfman@hotmail.com
My Blog: [randolph.pro](http://randolph.pro)
LinkedIn: [randolph's linkedin](https://www.linkedin.com/in/randolph-%E9%BB%84%E5%A8%81/)
没有合适的资源?快使用搜索试试~ 我知道了~
多标签文本分类:关于基于神经网络的多标签文本分类
共39个文件
py:27个
json:3个
md:2个
5星 · 超过95%的资源 需积分: 33 29 下载量 88 浏览量
2021-02-06
03:56:24
上传
评论 6
收藏 276KB ZIP 举报
温馨提示
深度学习的多标签文本分类 该存储库是我的研究项目,也是对TensorFlow,深度学习(Fasttext,CNN,LSTM等)的研究。 该项目的主要目的是解决基于深度神经网络的多标签文本分类问题。 因此,根据这种问题的特征,数据标签的格式类似于[0、1、0,...,1、1]。 要求 Python 3.6 Tensorflow 1.15.0 Tensorboard 1.15.0 斯克莱恩0.19.1 脾气暴躁的1.16.2 Gensim 3.8.3 Tqdm 4.49.0 项目 项目结构如下: . ├── Model │ ├── test_model.py │ ├──
资源详情
资源评论
资源推荐
收起资源包目录
Multi-Label-Text-Classification-master.zip (39个子文件)
Multi-Label-Text-Classification-master
.travis.yml 224B
SANN
test_sann.py 9KB
train_sann.py 13KB
text_sann.py 12KB
data
Test_sample.json 97KB
Validation_sample.json 95KB
Train_sample.json 82KB
.github
FUNDING.yml 834B
Wechat.jpeg 107KB
Alipay.jpeg 64KB
CNN
test_cnn.py 9KB
text_cnn.py 8KB
train_cnn.py 13KB
FastText
test_fast.py 9KB
train_fast.py 13KB
text_fast.py 5KB
CRNN
test_crnn.py 9KB
train_crnn.py 13KB
text_crnn.py 9KB
LICENSE 11KB
RNN
train_rnn.py 13KB
test_rnn.py 9KB
text_rnn.py 11KB
RCNN
text_rcnn.py 10KB
train_rcnn.py 13KB
test_rcnn.py 9KB
Usage.md 5KB
HAN
test_han.py 9KB
text_han.py 8KB
train_han.py 13KB
requirements.txt 232B
.gitignore 2KB
ANN
train_ann.py 13KB
test_ann.py 9KB
text_ann.py 6KB
README.md 8KB
utils
param_parser.py 4KB
checkmate.py 6KB
data_helpers.py 17KB
共 39 条
- 1
WillisWang
- 粉丝: 23
- 资源: 4701
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论2