# A simple BiLSTM-CRF model for Chinese Named Entity Recognition
This repository includes the code for buliding a very simple __character-based BiLSTM-CRF sequence labeling model__ for Chinese Named Entity Recognition task. Its goal is to recognize three types of Named Entity: PERSON, LOCATION and ORGANIZATION.
This code works on __Python 3 & TensorFlow 1.2__ and the following repository [https://github.com/guillaumegenthial/sequence_tagging](https://github.com/guillaumegenthial/sequence_tagging) gives me much help.
## Model
This model is similar to the models provied by paper [1] and [2]. Its structure looks just like the following illustration:
![Network](./pics/pic1.png)
For one Chinese sentence, each character in this sentence has / will have a tag which belongs to the set {O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG}.
The first layer, __look-up layer__, aims at transforming each character representation from one-hot vector into *character embedding*. In this code I initialize the embedding matrix randomly and I know it looks too simple. We could add some language knowledge later. For example, do tokenization and use pre-trained word-level embedding, then every character in one token could be initialized with this token's word embedding. In addition, we can get the character embedding by combining low-level features (please see paper[2]'s section 4.1 and paper[3]'s section 3.3 for more details).
The second layer, __BiLSTM layer__, can efficiently use *both past and future* input information and extract features automatically.
The third layer, __CRF layer__, labels the tag for each character in one sentence. If we use a Softmax layer for labeling, we might get ungrammatic tag sequences beacuse the Softmax layer labels each position independently. We know that 'I-LOC' cannot follow 'B-PER' but Softmax doesn't know. Compared to Softmax, a CRF layer can use *sentence-level tag information* and model the transition behavior of each two different tags.
## Dataset
| | #sentence | #PER | #LOC | #ORG |
| :----: | :---: | :---: | :---: | :---: |
| train | 46364 | 17615 | 36517 | 20571 |
| test | 4365 | 1973 | 2877 | 1331 |
It looks like a portion of [MSRA corpus](http://sighan.cs.uchicago.edu/bakeoff2006/). I downloaded the dataset from the link in `./data_path/original/link.txt`
### data files
The directory `./data_path` contains:
- the preprocessed data files, `train_data` and `test_data`
- a vocabulary file `word2id.pkl` that maps each character to a unique id
For generating vocabulary file, please refer to the code in `data.py`.
### data format
Each data file should be in the following format:
```
中 B-LOC
国 I-LOC
很 O
大 O
句 O
子 O
结 O
束 O
是 O
空 O
行 O
```
If you want to use your own dataset, please:
- transform your corpus to the above format
- generate a new vocabulary file
## How to Run
### train
`python main.py --mode=train `
### test
`python main.py --mode=test --demo_model=1521112368`
Please set the parameter `--demo_model` to the model that you want to test. `1521112368` is the model trained by me.
An official evaluation tool for computing metrics: [here (click 'Instructions')](http://sighan.cs.uchicago.edu/bakeoff2006/)
My test performance:
| P | R | F | F (PER)| F (LOC)| F (ORG)|
| :---: | :---: | :---: | :---: | :---: | :---: |
| 0.8945 | 0.8752 | 0.8847 | 0.8688 | 0.9118 | 0.8515
### demo
`python main.py --mode=demo --demo_model=1521112368`
You can input one Chinese sentence and the model will return the recognition result:
![demo_pic](./pics/pic2.png)
## References
\[1\] [Bidirectional LSTM-CRF Models for Sequence Tagging](https://arxiv.org/pdf/1508.01991v1.pdf)
\[2\] [Neural Architectures for Named Entity Recognition](http://aclweb.org/anthology/N16-1030)
\[3\] [Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition](https://link.springer.com/chapter/10.1007/978-3-319-50496-4_20)
\[4\] [https://github.com/guillaumegenthial/sequence_tagging](https://github.com/guillaumegenthial/sequence_tagging)
没有合适的资源?快使用搜索试试~ 我知道了~
Python-一个非常简单的BiLSTMCRF模型用于中文命名实体识别TensorFlow
共22个文件
py:5个
txt:5个
png:2个
需积分: 46 101 下载量 67 浏览量
2019-08-11
07:19:38
上传
评论 8
收藏 40.13MB ZIP 举报
温馨提示
一个非常简单的 BiLSTM-CRF 模型用于中文命名实体识别 (TensorFlow)
资源推荐
资源详情
资源评论
收起资源包目录
Python-一个非常简单的BiLSTMCRF模型用于中文命名实体识别TensorFlow.zip (22个子文件)
zh-NER-TF-master
eval.py 778B
main.py 5KB
data_path_save
1521112368
checkpoints
checkpoint 79B
model-31680.index 1KB
model-31680.data-00000-of-00001 29.96MB
model-31680.meta 5.06MB
utils.py 3KB
pics
demo.txt 961B
pic1.png 768KB
pic2.png 284KB
data_path
word2id.pkl 60KB
test_data 1.06MB
train_data 13.26MB
original
train1.txt 9.99MB
testright1.txt 564KB
link.txt 49B
test1.txt 514KB
model.py 12KB
.gitignore 28B
README.md 4KB
conlleval_rev.pl 12KB
data.py 4KB
共 22 条
- 1
资源评论
weixin_39840650
- 粉丝: 409
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功