# HanLP: Han Language Processing
[中文](https://github.com/hankcs/HanLP/tree/doc-zh) | [docs](https://hanlp.hankcs.com/docs/) | [1.x](https://github.com/hankcs/HanLP/tree/1.x) | [forum](https://bbs.hankcs.com/) | [docker](https://github.com/WalterInSH/hanlp-jupyter-docker)
The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be efficient, user friendly and extendable.
Thanks to open-access corpora like Universal Dependencies and OntoNotes, HanLP 2.1 now offers 10 joint tasks on 104 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing.
For end users, HanLP offers light-weighted RESTful APIs and native Python APIs.
## RESTful APIs
Tiny packages in several KBs for agile development and mobile applications. Although anonymous users are welcomed, an auth key is suggested and [a free one can be applied here](https://bbs.hankcs.com/t/apply-for-free-hanlp-restful-apis/3178) under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
### Python
```bash
pip install hanlp_restful
```
Create a client with our API endpoint and your auth.
```python
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='mul')
```
### Java
Insert the following dependency into your `pom.xml`.
```xml
<dependency>
<groupId>com.hankcs.hanlp.restful</groupId>
<artifactId>hanlp-restful</artifactId>
<version>0.0.3</version>
</dependency>
```
Create a client with our API endpoint and your auth.
```java
HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", null, "mul");
```
### Quick Start
No matter which language you use, the same interface can be used to parse a document.
```python
HanLP.parse("In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments. 2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。")
```
See [docs](https://hanlp.hankcs.com/docs/tutorial.html) for visualization, annotation guidelines and more details.
## Native APIs
```bash
pip install hanlp
```
HanLP requires Python 3.6 or later. GPU/TPU is suggested but not mandatory.
### Quick Start
```python
import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE)
print(HanLP(['In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments.',
'2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。',
'2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。']))
```
In particular, the Python `HanLPClient` can also be used as a callable function following the same semantics. See [docs](https://hanlp.hankcs.com/docs/tutorial.html) for visualization, annotation guidelines and more details.
## Train Your Own Models
To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes.
```python
tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.70'
tokenizer.fit(
SIGHAN2005_PKU_TRAIN_ALL,
SIGHAN2005_PKU_TEST, # Conventionally, no devset is used. See Tian et al. (2020).
save_dir,
'bert-base-chinese',
max_seq_len=300,
char_level=True,
hard_constraint=True,
sampler_builder=SortingSamplerBuilder(batch_size=32),
epochs=3,
adam_epsilon=1e-6,
warmup_steps=0.1,
weight_decay=0.01,
word_dropout=0.1,
seed=1609836303,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)
```
The result is guaranteed to be `96.70` as the random feed is fixed. Different from some overclaiming papers and projects, HanLP promises every single digit in our scores is reproducible. Any issues on reproducibility will be treated and solved as a top-priority fatal bug.
## Performance
<table><thead><tr><th rowspan="2">lang</th><th rowspan="2">corpora</th><th rowspan="2">model</th><th colspan="2">tok</th><th colspan="4">pos</th><th colspan="3">ner</th><th rowspan="2">dep</th><th rowspan="2">con</th><th rowspan="2">srl</th><th colspan="4">sdp</th><th rowspan="2">lem</th><th rowspan="2">fea</th><th rowspan="2">amr</th></tr><tr><td>fine</td><td>coarse</td><td>ctb</td><td>pku</td><td>863</td><td>ud</td><td>pku</td><td>msra</td><td>ontonotes</td><td>SemEval16</td><td>DM</td><td>PAS</td><td>PSD</td></tr></thead><tbody><tr><td rowspan="2">mul</td><td rowspan="2">UD2.7 <br>OntoNotes5</td><td>small</td><td>98.62</td><td>-</td><td>-</td><td>-</td><td>-</td><td>93.23</td><td>-</td><td>-</td><td>74.42</td><td>79.10</td><td>76.85</td><td>70.63</td><td>-</td><td>91.19</td><td>93.67</td><td>85.34</td><td>87.71</td><td>84.51</td><td>-</td></tr><tr><td>base</td><td>99.67</td><td>-</td><td>-</td><td>-</td><td>-</td><td>96.51</td><td>-</td><td>-</td><td>80.76</td><td>87.64</td><td>80.58</td><td>77.22</td><td>-</td><td>94.38</td><td>96.10</td><td>86.64</td><td>94.37</td><td>91.60</td><td>-</td></tr><tr><td rowspan="4">zh</td><td rowspan="2">open</td><td>small</td><td>97.25</td><td>-</td><td>96.66</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>95.00</td><td>84.57</td><td>87.62</td><td>73.40</td><td>84.57</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>base</td><td>97.50</td><td>-</td><td>97.07</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>96.04</td><td>87.11</td><td>89.84</td><td>77.78</td><td>87.11</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td rowspan="2">close</td><td>small</td><td>96.70</td><td>95.93</td><td>96.87</td><td>97.56</td><td>95.05</td><td>-</td><td>96.22</td><td>95.74</td><td>76.79</td><td>84.44</td><td>88.13</td><td>75.81</td><td>74.28</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>base</td><td>97.52</td><td>96.44</td><td>96.99</td><td>97.59</td><td>95.29</td><td>-</td><td>96.48</td><td>95.72</td><td>77.77</td><td>85.29</td><td>88.57</td><td>76.52</td><td>73.76</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr></tbody></table>
- AMR models will be released once our paper gets accepted.
## Citing
If you use HanLP in your research, please cite this repository.
```latex
@software{hanlp2,
author = {Han He},
title = {{HanLP: Han Language Processing}},
year = {2020},
url = {https://github.com/hankcs/HanLP},
}
```
## License
### Codes
HanLP is licensed under **Apache License 2.0**. You can use HanLP in your commercial products for free. We would appreciate it if you add a link to HanLP on your website.
### Models
Unless otherwise specified, all models in HanLP are licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
## References
https://hanlp.hankcs.com/docs/references.html
没有合适的资源?快使用搜索试试~ 我知道了~
HanLP:中文分词词性标注命名实体识别依存句法分析语义依存分析新词发现用自动生成的摘要进行文本分类聚类拼音简繁转换自然语言处理
共568个文件
py:406个
md:139个
rst:8个
5星 · 超过95%的资源 需积分: 43 29 下载量 139 浏览量
2021-02-03
15:10:59
上传
评论 3
收藏 757KB ZIP 举报
温馨提示
HanLP:汉语言处理 || || 面向生产环境的多语种自然语言处理工具包,基于PyTorch和TensorFlow 2.x双引擎,目标是普及落地最前沿的NLP技术。HanLP实现功能完善,性能高效,架构清晰,语料时新,可自定义的特点。 穿越世界上最大的多语言种语料库,HanLP2.1支持包括简繁中英日俄法德内部的104种语言上的10种联合任务:分词(粗分,细分2个标准,强制,合并,校正3种),词性标注(PKU,863,CTB,UD四套词性规范),命名实体识别(PKU,MSRA,OntoNotes三套规范),依存句法分析(SD,UD规范),成分法分析,语义依存分析(SemEval16,DM,PAS,PSD四套规范),语义角色标注,词干提取,词法语法特征提取,抽象意义(AMR)。 量体裁衣,HanLP提供RESTful和本机两种API,分别面向轻量级和海量级两种场景。无论使用哪种API语言,HanLP接口在语义上保持一致,在代码上坚持开源。 轻量级RESTful API 服务器算力有限,匿名用户重新放置, 。 Python pip install hanlp_restful 创建客户端
资源详情
资源评论
资源推荐
收起资源包目录
HanLP:中文分词词性标注命名实体识别依存句法分析语义依存分析新词发现用自动生成的摘要进行文本分类聚类拼音简繁转换自然语言处理 (568个子文件)
references.bib 20KB
.gitignore 4KB
HanLPClient.java 7KB
HanLPClientTest.java 2KB
BaseInput.java 672B
TokenInput.java 635B
SentenceInput.java 632B
DocumentInput.java 624B
LICENSE 11KB
Makefile 634B
pku.md 19KB
ctb.md 8KB
semeval16.md 8KB
sd.md 8KB
pku.md 8KB
ptb.md 7KB
README.md 7KB
863.md 6KB
ctb.md 6KB
data_format.md 5KB
tutorial.md 4KB
ud.md 4KB
msra.md 3KB
configure.md 3KB
cpb.md 2KB
resources.md 2KB
propbank.md 2KB
ontonotes.md 2KB
ud.md 2KB
index.md 2KB
contributing.md 2KB
dataset.md 1KB
README.md 1KB
README.md 1KB
README.md 1KB
bug_report.md 1KB
resources.md 971B
restful_java.md 959B
resources.md 951B
install.md 938B
resources.md 914B
index.md 839B
feature_request.md 664B
resources.md 493B
resources.md 356B
index.md 351B
README.md 310B
ud.md 303B
multi_criteria.md 292B
biaffine_ner.md 270B
rank_srl.md 265B
tag_ner.md 264B
bio_srl.md 264B
embedding.md 257B
ud_parser.md 256B
sdp.md 245B
transformer.md 243B
constituency.md 236B
transformer_ner.md 232B
biaffine_ner.md 231B
dep.md 228B
pos.md 225B
tok.md 224B
mtl.md 223B
crf_constituency_parser.md 222B
lem.md 222B
transformer.md 218B
biaffine_sdp.md 207B
transformer_tagger.md 206B
rnn_ner.md 200B
biaffine_dep.md 199B
fasttext.md 195B
word2vec.md 193B
vocab.md 192B
structure.md 186B
dictionary.md 183B
task.md 183B
classifiers.md 183B
char_cnn.md 177B
char_rnn.md 177B
index.md 173B
span_rank.md 172B
span_bio.md 172B
mcws_dataset.md 167B
pas.md 166B
rnn_tagger.md 161B
dm.md 159B
torch_component.md 157B
biaffine_ner.md 154B
trie.md 152B
psd.md 152B
index.md 151B
eos.md 150B
constituency_dataset.md 149B
index.md 141B
conll_dataset.md 133B
lemmatizer.md 129B
eos.md 128B
tokenizer.md 127B
txt.md 126B
共 568 条
- 1
- 2
- 3
- 4
- 5
- 6
xrxiong
- 粉丝: 19
- 资源: 4728
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- C语言基础-C语言编程基础之Leetcode编程题解之第39题组合总和.zip
- C语言基础-C语言编程基础之Leetcode编程题解之第38题外观数列.zip
- C语言基础-C语言编程基础之Leetcode编程题解之第37题解数独.zip
- C语言基础-C语言编程基础之Leetcode编程题解之第36题有效的数独.zip
- C语言基础-C语言编程基础之Leetcode编程题解之第35题搜索插入位置.zip
- index.wxml
- C语言基础-C语言编程基础之Leetcode编程题解之第33题搜索旋转排序数组.zip
- 基于Python实现的手写数字识别系统源码.zip
- 从网页提取禁止转载的文字
- C语言基础-C语言编程基础之Leetcode编程题解之第32题最长有效括号.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1