Python第三方安装包-中文公司名称分词工具_python提取公司名称,企业名称分词资源-CSDN文库

共48个文件

py：24个

txt：13个

md：5个

版权申诉

中文分词

自然语言处理

5星 · 超过95%的资源 103 浏览量 2021-07-16 15:49:42 上传评论 3 收藏 1.79MB ZIP 举报

在IT行业中，自然语言处理（NLP）是一个关键领域，特别是在中文文本处理方面。本话题聚焦于一个Python第三方安装包，专门用于处理中文公司名称的分词任务。这个工具能够帮助用户有效地解析和理解复杂的公司名称，从而提取出其中的地名、品牌名、行业词以及公司名后缀等关键信息。我们来详细了解一下“中文分词”。中文分词是NLP的第一步，因为中文没有像英文那样的空格作为单词之间的分隔符，因此需要通过特定的算法和模型将连续的汉字序列切分成有意义的词汇单元。这个过程对于后续的语义分析、情感分析、关键词抽取等任务至关重要。这个Python安装包可能采用了如Jieba、HanLP、PKUSeg等流行的中文分词库，它们通常结合了统计模型和词典匹配策略，以实现高效且准确的分词效果。接着，我们讨论“品牌名”和“行业词”的提取。品牌名识别旨在从公司名称中识别出公司的核心品牌，这对于市场营销和品牌管理非常重要。而行业词的提取则有助于理解公司的主营业务，例如“科技”、“金融”或“教育”。这些信息对于市场研究、竞争分析或业务定位都极其有价值。此外，该工具还关注“地名”的识别，这在公司名称中常常出现，特别是对于地域性企业。地名的提取可以帮助我们了解企业的地域属性，对研究区域经济、地理分布等具有参考意义。提取“公司名后缀词”也是这个工具的一大特色。在中文公司名称中，后缀词如“有限公司”、“集团”等可以提供公司的法律形态和规模信息。对于数据分析、企业征信等领域，这些细节至关重要。这个Python第三方安装包提供了一套完整的解决方案，用于处理中文公司名称的语义解析，极大地简化了数据预处理的工作，提高了工作效率。它适用于各种应用场景，比如商业智能、舆情分析、企业数据库建设等。用户只需要按照Python的标准安装流程，即可将此工具集成到自己的项目中，进行高效的数据处理和分析。通过熟练掌握和运用此类工具，开发者可以更好地挖掘和利用中文文本数据中的宝贵信息。

资源推荐

资源详情

资源评论

收起资源包目录

Python第三方安装包-中文公司名称分词工具.zip （48个子文件）

Python第三方安装包-中文公司名称分词工具

setup.py 2KB

companynameparser

place.py 3KB

data

china_place.txt 58KB

brand.txt 2.54MB

suffix_single.txt 24B

pca.csv 244KB

trade.txt 12KB

THUOCL_diming.txt 626KB

place_single.txt 24B

trade_single.txt 23B

suffix.txt 2KB

__init__.py 266B

tokenizer.py 3KB

parser.py 12KB

tools

__init__.py 80B

generate_bio.py 3KB

bio_2_entity.py 1KB

__main__.py 2KB

logger.py 1KB

.gitignore 2KB

requirements.txt 5B

CONTRIBUTING.md 7KB

LICENSE 11KB

.github

stale.yml 766B

ISSUE_TEMPLATE

usage-question.md 630B

bug-report.md 1KB

feature-request.md 787B

examples

all_demo.py 1KB

custom_name_split.txt 1000B

enable_wordsegment_demo.py 583B

base_demo.py 570B

cmd_demo.py 1KB

use_custom_split_demo.py 809B

pos_sensitive_demo.py 657B

README.md 8KB

tests

evaluate_file.py 1008B

fix_bug.py 3KB

parser_stdin.py 411B

cut_stdin.py 286B

parser_demo.py 631B

company_demo.txt 854B

predict_input.txt 3KB

groundtruth.txt 21KB

test_cut.py 1KB

test_base.py 2KB

test_customsplit.py 3KB

中文公司名称分词工具.docx 13KB

docs

echarts.png 511KB

# companynameparser [![PyPI version](https://badge.fury.io/py/companynameparser.svg)](https://badge.fury.io/py/companynameparser) [![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md) [![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) ![Language](https://img.shields.io/badge/Language-Python-blue.svg) ![Python3](https://img.shields.io/badge/Python-3.X-red.svg) company name parser, extract company name brand. 中文公司名称分词工具，支持公司名称中的地名，品牌名（主词），行业词，公司名后缀提取。 **Guide** - [Feature](#Feature) - [Install](#Install) - [Usage](#usage) - [Command Line Usage](#command-line-usage) - [Contribute](#contribute) - [Reference](#Reference) # Feature 对公司名文本解析，识别并提取地名、品牌名、行业词、公司名后缀词。 # Evaluate 运行评估脚本[evaluate_file.py](./tests/evaluate_file.py)，使用预测结果与GroundTruth完成相等才为True的保守评估方法，评估结果： - 准确率：94.8% - 召回率：94.5% # Install - 全自动安装：pip install companynameparser - 半自动安装： ``` git clone https://github.com/shibing624/companynameparser.git cd companynameparser python setup.py install ``` 通过以上两种方法的任何一种完成安装都可以。如果不想安装，可以下载github源码包，安装依赖[requirements.txt](./requirements.txt)再使用。 # Usage - Extract Company Name 公司名称各元素提取功能[base_demo.py](./examples/base_demo.py) ```python import companynameparser company_strs = [ "武汉海明智业电子商务有限公司", "泉州益念食品有限公司", "常州途畅互联网科技有限公司合肥分公司", "昆明享亚教育信息咨询有限公司", ] for name in company_strs: r = companynameparser.parse(name) print(r) ``` output: ``` {'place': '武汉', 'brand': '海明智业', 'trade': '电子商务', 'suffix': '有限公司', 'symbol': ''} {'place': '泉州', 'brand': '益念', 'trade': '食品', 'suffix': '有限公司', 'symbol': ''} {'place': '常州,合肥', 'brand': '途畅', 'trade': '互联网科技', 'suffix': '有限公司,分公司', 'symbol': ''} {'place': '昆明', 'brand': '享亚', 'trade': '教育信息咨询', 'suffix': '有限公司', 'symbol': ''} ``` > `parse`方法的此处输入`name`是str; > 输出的是一个包括place(地名)，brand(品牌名)，trade(行业词名)，suffix(后缀名)，symbol(标点符号)的dict; 多个地名词、品牌、行业词之间用`,`间隔，如`'常州,合肥'`。 - All Demo 一个demo演示所有示例[all_demo.py](./examples/all_demo.py)，包括： 1. 公司名称各元素提取 2. 打开各元素名称切分 3. 打开各元素位置信息 4. 用户自定义元素切分，解决误杀和漏召回问题 ```python import companynameparser company_strs = [ "武汉海明智业电子商务有限公司", "泉州益念食品有限公司", "常州途畅互联网科技有限公司合肥分公司", "昆明享亚教育信息咨询有限公司", "深圳光明区三晟股份有限公司", ] for name in company_strs: r = companynameparser.parse(name) print(r) print("*" * 42, ' enable word segment') for name in company_strs: r = companynameparser.parse(name, pos_sensitive=False, enable_word_segment=True) print(r) print("*" * 42, ' pos sensitive') for name in company_strs: r = companynameparser.parse(name, pos_sensitive=True, enable_word_segment=False) print(r) print("*" * 42, 'enable word segment and pos') for name in company_strs: r = companynameparser.parse(name, pos_sensitive=True, enable_word_segment=True) print(r) print("*" * 42, 'use custom name') companynameparser.set_custom_split_file('./custom_name_split.txt') for i in company_strs: r = companynameparser.parse(i) print(r) ``` output: ``` {'place': '武汉', 'brand': '海明智业', 'trade': '电子商务', 'suffix': '有限公司', 'symbol': ''} {'place': '泉州', 'brand': '益念', 'trade': '食品', 'suffix': '有限公司', 'symbol': ''} {'place': '常州,合肥', 'brand': '途畅', 'trade': '互联网科技', 'suffix': '有限公司,分公司', 'symbol': ''} {'place': '昆明', 'brand': '享亚', 'trade': '教育信息咨询', 'suffix': '有限公司', 'symbol': ''} {'place': '深圳光明', 'brand': '区三晟', 'trade': '', 'suffix': '股份有限公司', 'symbol': ''} ****************************************** enable word segment {'place': '武汉', 'brand': '海明智业', 'trade': '电子商务', 'suffix': '有限公司', 'symbol': ''} {'place': '泉州', 'brand': '益念', 'trade': '食品', 'suffix': '有限公司', 'symbol': ''} {'place': '常州,合肥', 'brand': '途畅', 'trade': '互联网,科技', 'suffix': '有限公司,分公司', 'symbol': ''} {'place': '昆明', 'brand': '享亚', 'trade': '教育,信息,咨询', 'suffix': '有限公司', 'symbol': ''} {'place': '深圳,光明', 'brand': '区三晟', 'trade': '', 'suffix': '股份,有限公司', 'symbol': ''} ****************************************** pos sensitive {'place': [('武汉', 0, 2)], 'brand': [('海明智业', 2, 6)], 'trade': [('电子商务', 6, 10)], 'suffix': [('有限公司', 10, 14)], 'symbol': []} {'place': [('泉州', 0, 2)], 'brand': [('益念', 2, 4)], 'trade': [('食品', 4, 6)], 'suffix': [('有限公司', 6, 10)], 'symbol': []} {'place': [('常州', 0, 2), ('合肥', 13, 15)], 'brand': [('途畅', 2, 4)], 'trade': [('互联网科技', 4, 9)], 'suffix': [('有限公司', 9, 13), ('分公司', 15, 18)], 'symbol': []} {'place': [('昆明', 0, 2)], 'brand': [('享亚', 2, 4)], 'trade': [('教育信息咨询', 4, 10)], 'suffix': [('有限公司', 10, 14)], 'symbol': []} {'place': [('深圳光明', 0, 4)], 'brand': [('区三晟', 4, 7)], 'trade': [], 'suffix': [('股份有限公司', 7, 13)], 'symbol': []} ****************************************** enable word segment and pos {'place': [('武汉', 0, 2)], 'brand': [('海明智业', 2, 6)], 'trade': [('电子商务', 6, 10)], 'suffix': [('有限公司', 10, 14)], 'symbol': []} {'place': [('泉州', 0, 2)], 'brand': [('益念', 2, 4)], 'trade': [('食品', 4, 6)], 'suffix': [('有限公司', 6, 10)], 'symbol': []} {'place': [('常州', 0, 2), ('合肥', 13, 15)], 'brand': [('途畅', 2, 4)], 'trade': [('互联网', 4, 7), ('科技', 7, 9)], 'suffix': [('有限公司', 9, 13), ('分公司', 15, 18)], 'symbol': []} {'place': [('昆明', 0, 2)], 'brand': [('享亚', 2, 4)], 'trade': [('教育', 4, 6), ('信息', 6, 8), ('咨询', 8, 10)], 'suffix': [('有限公司', 10, 14)], 'symbol': []} {'place': [('深圳', 0, 2), ('光明', 2, 4)], 'brand': [('区三晟', 4, 7)], 'trade': [], 'suffix': [('股份', 7, 9), ('有限公司', 9, 13)], 'symbol': []} ****************************************** use custom name {'place': '武汉', 'brand': '海明智业', 'trade': '电子商务', 'suffix': '有限公司', 'symbol': ''} {'place': '泉州', 'brand': '益念', 'trade': '食品', 'suffix': '有限公司', 'symbol': ''} {'place': '常州,合肥', 'brand': '途畅', 'trade': '互联网科技', 'suffix': '有限公司,分公司', 'symbol': ''} {'place': '昆明', 'brand': '享亚', 'trade': '教育信息咨询', 'suffix': '有限公司', 'symbol': ''} {'place': '深圳光明区', 'brand': '三晟', 'trade': '', 'suffix': '股份有限公司', 'symbol': ''} ``` ## Command Line Usage <details> <summary>命令行模式</summary> 支持批量提取地址的省市区信息： ``` python3 -m companynameparser company_demo.txt -o out.csv usage: python3 -m companynameparser [-h] -o OUTPUT input @description: positional arguments: input the input file path, file encode need utf-8. optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT the output file path. ``` > �

评论收藏

内容反馈

版权申诉