HanLP：中文分词词性标注命名实体识别依存句法分析语义依存分析新词发现用自动生成的摘要进行文本分类聚类拼音简繁转换自然语言处理

共568个文件

py：406个

md：139个

rst：8个

nlp

natural-language-processing

text-classification

hanlp

named-entity-recognition

5星 · 超过95%的资源需积分: 43 139 浏览量 2021-02-03 15:10:59 上传评论 3 收藏 757KB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

HanLP：中文分词词性标注命名实体识别依存句法分析语义依存分析新词发现用自动生成的摘要进行文本分类聚类拼音简繁转换自然语言处理（568个子文件）

references.bib 20KB

.gitignore 4KB

HanLPClient.java 7KB

HanLPClientTest.java 2KB

BaseInput.java 672B

TokenInput.java 635B

SentenceInput.java 632B

DocumentInput.java 624B

LICENSE 11KB

Makefile 634B

pku.md 19KB

ctb.md 8KB

semeval16.md 8KB

sd.md 8KB

pku.md 8KB

ptb.md 7KB

README.md 7KB

863.md 6KB

ctb.md 6KB

data_format.md 5KB

tutorial.md 4KB

ud.md 4KB

msra.md 3KB

configure.md 3KB

cpb.md 2KB

resources.md 2KB

propbank.md 2KB

ontonotes.md 2KB

ud.md 2KB

index.md 2KB

contributing.md 2KB

dataset.md 1KB

README.md 1KB

bug_report.md 1KB

resources.md 971B

restful_java.md 959B

resources.md 951B

install.md 938B

resources.md 914B

index.md 839B

feature_request.md 664B

resources.md 493B

resources.md 356B

index.md 351B

README.md 310B

ud.md 303B

multi_criteria.md 292B

biaffine_ner.md 270B

rank_srl.md 265B

tag_ner.md 264B

bio_srl.md 264B

embedding.md 257B

ud_parser.md 256B

sdp.md 245B

transformer.md 243B

constituency.md 236B

transformer_ner.md 232B

biaffine_ner.md 231B

dep.md 228B

pos.md 225B

tok.md 224B

mtl.md 223B

crf_constituency_parser.md 222B

lem.md 222B

transformer.md 218B

biaffine_sdp.md 207B

transformer_tagger.md 206B

rnn_ner.md 200B

biaffine_dep.md 199B

fasttext.md 195B

word2vec.md 193B

vocab.md 192B

structure.md 186B

dictionary.md 183B

task.md 183B

classifiers.md 183B

char_cnn.md 177B

char_rnn.md 177B

index.md 173B

span_rank.md 172B

span_bio.md 172B

mcws_dataset.md 167B

pas.md 166B

rnn_tagger.md 161B

dm.md 159B

torch_component.md 157B

biaffine_ner.md 154B

trie.md 152B

psd.md 152B

index.md 151B

eos.md 150B

constituency_dataset.md 149B

index.md 141B

conll_dataset.md 133B

lemmatizer.md 129B

eos.md 128B

tokenizer.md 127B

txt.md 126B

共 568 条

# HanLP: Han Language Processing [中文](https://github.com/hankcs/HanLP/tree/doc-zh) | [docs](https://hanlp.hankcs.com/docs/) | [1.x](https://github.com/hankcs/HanLP/tree/1.x) | [forum](https://bbs.hankcs.com/) | [docker](https://github.com/WalterInSH/hanlp-jupyter-docker) The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be efficient, user friendly and extendable. Thanks to open-access corpora like Universal Dependencies and OntoNotes, HanLP 2.1 now offers 10 joint tasks on 104 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing. For end users, HanLP offers light-weighted RESTful APIs and native Python APIs. ## RESTful APIs Tiny packages in several KBs for agile development and mobile applications. Although anonymous users are welcomed, an auth key is suggested and [a free one can be applied here](https://bbs.hankcs.com/t/apply-for-free-hanlp-restful-apis/3178) under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. ### Python ```bash pip install hanlp_restful ``` Create a client with our API endpoint and your auth. ```python from hanlp_restful import HanLPClient HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='mul') ``` ### Java Insert the following dependency into your `pom.xml`. ```xml <dependency> <groupId>com.hankcs.hanlp.restful</groupId> <artifactId>hanlp-restful</artifactId> <version>0.0.3</version> </dependency> ``` Create a client with our API endpoint and your auth. ```java HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", null, "mul"); ``` ### Quick Start No matter which language you use, the same interface can be used to parse a document. ```python HanLP.parse("In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments. 2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。") ``` See [docs](https://hanlp.hankcs.com/docs/tutorial.html) for visualization, annotation guidelines and more details. ## Native APIs ```bash pip install hanlp ``` HanLP requires Python 3.6 or later. GPU/TPU is suggested but not mandatory. ### Quick Start ```python import hanlp HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE) print(HanLP(['In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments.', '2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。', '2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。'])) ``` In particular, the Python `HanLPClient` can also be used as a callable function following the same semantics. See [docs](https://hanlp.hankcs.com/docs/tutorial.html) for visualization, annotation guidelines and more details. ## Train Your Own Models To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes. ```python tokenizer = TransformerTaggingTokenizer() save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.70' tokenizer.fit( SIGHAN2005_PKU_TRAIN_ALL, SIGHAN2005_PKU_TEST, # Conventionally, no devset is used. See Tian et al. (2020). save_dir, 'bert-base-chinese', max_seq_len=300, char_level=True, hard_constraint=True, sampler_builder=SortingSamplerBuilder(batch_size=32), epochs=3, adam_epsilon=1e-6, warmup_steps=0.1, weight_decay=0.01, word_dropout=0.1, seed=1609836303, ) tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir) ``` The result is guaranteed to be `96.70` as the random feed is fixed. Different from some overclaiming papers and projects, HanLP promises every single digit in our scores is reproducible. Any issues on reproducibility will be treated and solved as a top-priority fatal bug. ## Performance <table><thead><tr><th rowspan="2">lang</th><th rowspan="2">corpora</th><th rowspan="2">model</th><th colspan="2">tok</th><th colspan="4">pos</th><th colspan="3">ner</th><th rowspan="2">dep</th><th rowspan="2">con</th><th rowspan="2">srl</th><th colspan="4">sdp</th><th rowspan="2">lem</th><th rowspan="2">fea</th><th rowspan="2">amr</th></tr><tr><td>fine</td><td>coarse</td><td>ctb</td><td>pku</td><td>863</td><td>ud</td><td>pku</td><td>msra</td><td>ontonotes</td><td>SemEval16</td><td>DM</td><td>PAS</td><td>PSD</td></tr></thead><tbody><tr><td rowspan="2">mul</td><td rowspan="2">UD2.7 <br>OntoNotes5</td><td>small</td><td>98.62</td><td>-</td><td>-</td><td>-</td><td>-</td><td>93.23</td><td>-</td><td>-</td><td>74.42</td><td>79.10</td><td>76.85</td><td>70.63</td><td>-</td><td>91.19</td><td>93.67</td><td>85.34</td><td>87.71</td><td>84.51</td><td>-</td></tr><tr><td>base</td><td>99.67</td><td>-</td><td>-</td><td>-</td><td>-</td><td>96.51</td><td>-</td><td>-</td><td>80.76</td><td>87.64</td><td>80.58</td><td>77.22</td><td>-</td><td>94.38</td><td>96.10</td><td>86.64</td><td>94.37</td><td>91.60</td><td>-</td></tr><tr><td rowspan="4">zh</td><td rowspan="2">open</td><td>small</td><td>97.25</td><td>-</td><td>96.66</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>95.00</td><td>84.57</td><td>87.62</td><td>73.40</td><td>84.57</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>base</td><td>97.50</td><td>-</td><td>97.07</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>96.04</td><td>87.11</td><td>89.84</td><td>77.78</td><td>87.11</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td rowspan="2">close</td><td>small</td><td>96.70</td><td>95.93</td><td>96.87</td><td>97.56</td><td>95.05</td><td>-</td><td>96.22</td><td>95.74</td><td>76.79</td><td>84.44</td><td>88.13</td><td>75.81</td><td>74.28</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>base</td><td>97.52</td><td>96.44</td><td>96.99</td><td>97.59</td><td>95.29</td><td>-</td><td>96.48</td><td>95.72</td><td>77.77</td><td>85.29</td><td>88.57</td><td>76.52</td><td>73.76</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr></tbody></table> - AMR models will be released once our paper gets accepted. ## Citing If you use HanLP in your research, please cite this repository. ```latex @software{hanlp2, author = {Han He}, title = {{HanLP: Han Language Processing}}, year = {2020}, url = {https://github.com/hankcs/HanLP}, } ``` ## License ### Codes HanLP is licensed under **Apache License 2.0**. You can use HanLP in your commercial products for free. We would appreciate it if you add a link to HanLP on your website. ### Models Unless otherwise specified, all models in HanLP are licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). ## References https://hanlp.hankcs.com/docs/references.html