# Chinese Named Entity Recognition for Social Media
This repository contains:
1) Data: Named Entity Recognition (NER) for Chinese Social Media (Weibo). This dataset contains messages selected from Weibo and annotated according to the DEFT ERE annotation guidelines. Annotations include both name and nominal mentions. The corpus contains 1,890 messages sampled from Weibo between November 2013 and December 2014.
2) golden-horse: A neural based NER tool for Chinese Social Media.
## Important update of the data
We fixed some inconsistancies in the data, especially the annotations for the nominal mentions.
We thank Hangfeng He for his contribution to the major cleanup and revision of the annotations.
The original and revised annotated data are both made available in the data/ directory, with prefixes weiboNER.conll and weiboNER_2nd_conll, respectively.
We include updated results of our models on the revised version of the data in supplementary material: [golden_horse_supplement.pdf](golden_horse_supplement.pdf). If you want to compare with our models on the revised data, please refer to this supplementary material. Thanks!
Please note that the updated version provided
If you use the revised dataset, please kindly cite the following bibtex in addition to the citation of our papers:
@article{HeS16,
author={Hangfeng He and Xu Sun},
title={F-Score Driven Max Margin Neural Network for Named Entity Recognition in Chinese Social Media.},
journal={CoRR},
volume={abs/1611.04234},
year={2016}
}
# golden-horse
The implementation of the papers:
**Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings**
Nanyun Peng and Mark Dredze
*Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 2015
and
**Improving Named Entity Recognition for Chinese Social Media
with Word Segmentation Representation Learning**
Nanyun Peng and Mark Dredze
*Annual Meeting of the Association for Computational Linguistics (ACL)*, 2016
If you use the code, please kindly cite the following bibtex:
@inproceedings{peng2015ner,
title={Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings},
author={Peng, Nanyun and Dredze, Mark},
booktitle={Processings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages={548–-554},
year={2015},
File={https://www.aclweb.org/anthology/D15-1064/},
}
@inproceedings{peng2016improving,
title={Improving named entity recognition for Chinese social media with word segmentation representation learning},
author={Peng, Nanyun and Dredze, Mark},
booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)},
volume={2},
pages={149--155},
year={2016},
File={https://www.aclweb.org/anthology/P16-2025/},
}
## Dependencies:
This is an theano implementation; it requires installation of python module:
Theano
jieba (a Chinese word segmentor)
Both of them can be simply installed by pip moduleName.
The lstm layer was adapted from http://deeplearning.net/tutorial/lstm.html and the feature extraction part was adapted from crfsuite: http://www.chokkan.org/software/crfsuite/
## running the EMNLP_15 experiments:
### Sample commands for the training:
python theano_src/crf_ner.py --nepochs 30 --neval_epochs 1 --training_data data/weiboNER.conll.train --valid_data data/weiboNER.conll.dev --test_data data/weiboNER.conll.test --emb_file embeddings/weibo_charpos_vectors --emb_type charpos --save_model_param weibo_best_parameters --emb_init true --eval_test False
python theano_src/crf_ner.py --nepochs 30 --neval_epochs 1 --training_data data/weiboNER_2nd_conll.train --valid_data data/weiboNER_2nd_conll.dev --test_data data/weiboNER_2nd_conll.test --emb_file embeddings/weibo_charpos_vectors --emb_type char --save_model_param weibo_best_parameters --emb_init true --eval_test False
In the above example, the output will be written at output_dir/weiboNER.conll.test.prediction. If you also want to see the evaluation (you must have labeled test data), you can add flag --eval_test True.
### Sample commands for running the test:
python theano_src/crf_ner.py --test_data data/weiboNER.conll.test --only_test true --output_dir data/ --save_model_param weibo_best_parameters
## running the ACL_16 experiments:
python theano_src/jointSegNER.py --cws_train_path data/pku_training.utf8 --cws_valid_path data/pku_test_gold.utf8 --cws_test_path data/pku_test_gold.utf8 --ner_train_path data/weiboNER_2nd_conll.train --ner_valid_path data/weiboNER_2nd_conll.dev --ner_test_path data/weiboNER_2nd_conll.test --emb_init file --emb_file embeddings/weibo_charpos_vectors --lr 0.05 --nepochs 30 --train_mode joint --cws_joint_weight 0.7 --m1_wemb1_dropout_rate 0.1
The last three parameters and the learning rate can be tuned. In our experiments, we found that for named mention, the best combination is (joint, 0.7, 0.1); for nonimal mention, the best combination is (alternative, 1.0, 0.1)
## Data
We noticed that several factors could affect the replicatability of experiments:
1. the segmentor for preprocessing: we used jieba 0.37
2. the random number generator. Alghough we fixed the random seed, we noticed it will render slight different numbers on different machine.
3. the traditional lexical feature used.
4. the pre-trained embeddings.
To enhance the replicatability of our experiments, we provide the original data in conll format at data/weiboNER.conll.(train/dev/test). In addition, we also provide files including all the features and the char-positional transformation we used in our experiments in data/crfsuite.weiboNER.charpos.conll.(train/dev/test), as well as the pre-trained char and char-positional embeddings.
Note: the data we provide contains both named and nominal mentions, you can get the dataset with only named entities by simply filtering out the nominal mentions.
## Data License
The annotations in this repository are released according to the [Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0)](https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License). The messages themselves are selected from [Weibo](https://www.weibo.com/) and follow Weibo's terms of service.
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
golden-horse-master (2).zip (42个子文件)
ConvertDataset.py 3KB
resources
names.txt 3.23MB
data
weiboNER_test_source.txt 217KB
weiboNER_dev_source.txt 212KB
weiboNER_dev_target.txt 127KB
weiboNER_test.txt 87KB
crfsuite.weiboNER.charpos.conll.dev 1.96MB
pku_test_gold.utf8 701KB
weiboNER_test_target.txt 130KB
pku_training.utf8 7.37MB
weiboNER_2nd_conll.dev 103KB
crfsuite.weiboNER.charpos.conll.test 2MB
weiboNER.conll.train 427KB
weiboNER_train_target.txt 642KB
weiboNER_2nd_conll.test 106KB
weiboNER_train.txt 431KB
weiboNER_2nd_conll.train 523KB
weiboNER_train_source.txt 1.05MB
weiboNER.conll.test 86KB
weiboNER.conll.dev 85KB
weiboNER_dev.txt 85KB
crfsuite.weiboNER.charpos.conll.train 9.93MB
embeddings
weibo_charpos_vectors 47.6MB
weibo_char_vectors 21.89MB
theano_src
icwb.py 21KB
train_util.py 18KB
jointSegNER.py 22KB
neural_lib.py 32KB
sighan_ner.py 19KB
weiboNER_features.py 5KB
neural_architectures.py 10KB
crf_ner.py 13KB
.idea
workspace.xml 6KB
misc.xml 198B
inspectionProfiles
Project_Default.xml 576B
profiles_settings.xml 174B
modules.xml 297B
.gitignore 50B
golden-horse-master.iml 499B
encodings.xml 275B
golden_horse_supplement.pdf 70KB
README.md 6KB
共 42 条
- 1
资源评论
码海无涯C作舟
- 粉丝: 1
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功