PyPI官网下载|connlp-0.0.18.tar.gz资源-CSDN文库

版权申诉

11 浏览量 2022-01-11 17:08:53 上传评论收藏 35KB GZ 举报

共25个文件

py：18个

txt：3个

pkg-info：2个

标题中的"PyPI 官网下载 | connlp-0.0.18.tar.gz"指出这是一个从Python Package Index（PyPI）官方源下载的软件包，名为connlp，版本为0.0.18，其格式是tar.gz，这是一种常见的在Unix/Linux系统中打包和压缩文件的方式。connlp很可能是一个Python库，用于特定的自然语言处理任务。在描述中提到的"资源来自pypi官网，资源全名：connlp-0.0.18.tar.gz"进一步确认了这个信息，connlp是一个可以在Python环境中安装和使用的开源项目，用户可以通过Python的包管理器pip来安装，如`pip install connlp`，前提是该库已发布在PyPI上。标签“Python库”意味着connlp是一个用Python编程语言编写的软件库，可能包含了各种函数、类和模块，为开发者提供了便捷的工具或服务，可能是用于文本分析、数据预处理、机器学习模型训练等与自然语言处理相关的任务。在压缩包子文件的文件名称列表中，只有connlp-0.0.18，这通常表示压缩包解压后会有一个名为connlp-0.0.18的目录，里面包含了该库的源代码、元数据、文档、测试文件等资源。开发者通常会在这样的目录结构中找到setup.py文件，这是一个Python脚本，用于定义库的元数据、依赖关系以及如何构建、打包和安装库。 connlp库可能包含以下组件和功能： 1. **模块和类**：定义了处理自然语言的函数和对象，例如分词器、句法分析器、命名实体识别器等。 2. **API接口**：提供了一种简单易用的方式与其他程序或服务交互，如接收输入文本，返回处理结果。 3. **数据结构**：为了高效地存储和处理文本数据，可能会包含定制的数据结构。 4. **依赖项**：可能依赖其他Python库，如NLTK、spaCy、scikit-learn等，这些在setup.py中会列出。 5. **测试文件**：为了确保代码的正确性，通常会有测试用例，用Python的unittest或其他测试框架编写。 6. **文档**：可能包括README文件、API文档等，帮助用户理解和使用库。 7. **示例**：可能包含一些示例脚本，演示如何使用库中的功能。为了更好地利用connlp库，开发者需要了解Python编程基础，熟悉自然语言处理的基本概念，同时阅读库的文档以理解它的功能和用法。如果库有详细的README或INSTALL文件，应按照其中的指示进行安装和配置。对于开发和贡献者来说，他们可能还需要了解Git版本控制，因为大多数开源项目都会托管在GitHub等平台上，并通过Pull Request来合并代码更改。

资源推荐

资源详情

资源评论

收起资源包目录

connlp-0.0.18.tar.gz （25个子文件）

connlp-0.0.18

PKG-INFO 31KB

test

test_compatibility.py 53B

test_ner.py 4KB

test_visualize.py 1KB

test_embedding.py 2KB

test_topic_modeling.py 1KB

test_preprocess.py 3KB

test_web_crawling.py 5KB

setup.cfg 38B

setup.py 584B

connlp

visualize.py 5KB

analysis_ner.py 16KB

test.py 37B

util.py 2KB

analysis_lda.py 3KB

__init__.py 0B

preprocess.py 12KB

text_extract.py 1KB

web_crawling.py 12KB

embedding.py 5KB

README.md 25KB

connlp.egg-info

PKG-INFO 31KB

SOURCES.txt 508B

top_level.txt 7B

dependency_links.txt 1B

# connlp A bunch of python codes to analyze text data in the construction industry. Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP). ## _Project Information_ - Supported by C!LAB (@Seoul Nat'l Univ.) ## _Contributors_ - Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/) - Sehwan Chung (hwani751@snu.ac.kr) - Jungyeon Kim (janykjy@snu.ac.kr) # Initialize ## _Setup_ Install _**connlp**_ with _pip_. ```shell pip install connlp ``` Install _requirements.txt_. ```shell cd WORKSPACE wget -O requirements_connlp.txt https://raw.githubusercontent.com/blank54/connlp/master/requirements.txt pip install -r requirements_connlp.txt ``` ## _Test_ If the code below runs with no error, _**connlp**_ is installed successfully. ```python from connlp.test import hello hello() # 'Helloworld' ``` # Preprocess Preprocessing module supports English and Korean. NOTE: No plan for other languages (by 2021.04.02.). ## _Normalizer_ _**Normalizer**_ normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks. ```python from connlp.preprocess import Normalizer normalizer = Normalizer() normalizer.normalize(text='I am a boy!') # 'i am a boy' ``` ## _EnglishTokenizer_ _**EnglishTokenizer**_ tokenizes the input text in English based on word spacing. The ngram-based tokenization is in preparation. ```python from connlp.preprocess import EnglishTokenizer tokenizer = EnglishTokenizer() tokenizer.tokenize(text='I am a boy!') # ['I', 'am', 'a', 'boy!'] ``` ## _KoreanTokenizer_ _**KoreanTokenizer**_ tokenizes the input text in Korean, and is based on either pre-trained or unsupervised approaches. You are recommended to use pre-trained method unless you have a large size of corpus. This is the default setting. If you want to use a pre-trained tokenizer, you have to select which analyzer you want to use. Available analyzers are based on KoNLPy (https://konlpy.org/ko/latest/api/konlpy.tag/), a python package for Korean language processing. The default analyzer is _**Hannanum**_ ```python from connlp.preprocess import KoreanTokenizer tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum') ``` If your corpus is big, you may use an unsupervised method, which is based on _**soynlp**_ (https://github.com/lovit/soynlp), an unsupervised text analyzer in Korean. ```python from connlp.preprocess import KoreanTokenizer tokenizer = KoreanTokenizer(pre_trained=False) ``` ### _train_ If your _**KoreanTokenizer**_ are pre-trained, you can neglect this step. Otherwhise (i.e. you are using an unsupervised approach), the _**KoreanTokenizer**_ object first needs to be trained on (unlabeled) corpus. 'Word score' is calculated for every subword in the corpus. ```python from connlp.preprocess import KoreanTokenizer tokenizer = KoreanTokenizer(pre_trained=False) docs = ['코퍼스의 첫 번째 문서입니다.', '두 번째 문서입니다.', '마지막 문서'] tokenizer.train(text=docs) print(tokenizer.word_score) # {'서': 0.0, '코': 0.0, '째': 0.0, '.': 0.0, '의': 0.0, '마': 0.0, '막': 0.0, '번': 0.0, '문': 0.0, '코퍼': 1.0, '번째': 1.0, '마지': 1.0, '문서': 1.0, '코퍼스': 1.0, '문서입': 0.816496580927726, '마지막': 1.0, '코퍼스의': 1.0, '문서입니': 0.8735804647362989, '문서입니다': 0.9036020036098448, '문서입니다.': 0.9221079114817278} ``` ### _tokenize_ If you are using a pre-trained _**KoreanTokenizer**_, the selected KoNLPy analyzer will tokenize the input sentence based on morphological analysis. ```python from connlp.preprocess import KoreanTokenizer tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum') doc = docs[0] # '코퍼스의 첫 번째 문서입니다.' tokenizer.tokenize(doc) # ['코퍼스', '의', '첫', '번째', '문서', '입니다', '.'] ``` If you are using an unsupervised _**KoreanTokenizer**_, tokenization is based on the 'word score' calculated from _**KoreanTokenizer.train**_ method. For each blank-separated token, a subword that has the maximum 'word score' is selectd as an individual 'word' and separated with the remaining part. ```python from connlp.preprocess import KoreanTokenizer tokenizer = KoreanTokenizer(pre_trained=False) doc = docs[0] # '코퍼스의 첫 번째 문서입니다.' tokenizer.tokenize(doc) # ['코퍼스의', '첫', '번째', '문서', '입니다.'] ``` ## _StopwordRemover_ _**StopwordRemover**_ removes stopwords from a given sentence based on the user-customized stopword list. Before utilizing _**StopwordRemover**_ the user should normalize and tokenize the docs. ```python from connlp.preprocess import Normalizer, EnglishTokenizer, StopwordRemover normalizer = Normalizer() eng_tokenizer = EnglishTokenizer() stopword_remover = StopwordRemover() docs = ['I am a boy!', 'He is a boy..', 'She is a girl?'] tokenized_docs = [] for doc in eng_docs: normalized_doc = normalizer.normalize(text=doc) tokenized_doc = eng_tokenizer.tokenize(text=normalized_doc) tokenized_docs.append(tokenized_doc) print(docs) print(tokenized_docs) # ['I am a boy!', 'He is a boy..', 'She is a girl?'] # [['i', 'am', 'a', 'boy'], ['he', 'is', 'a', 'boy'], ['she', 'is', 'a', 'girl']] ``` The user should prepare a customized stopword list (i.e., _stoplist_). The _stoplist_ should include user-customized stopwords divided by '\n' and the file should be in ".txt" format. ```text a is am ``` Initiate the _**StopwordRemover**_ with appropriate filepath of user-customized stopword list. If the stoplist is absent at the filepath, the stoplist would be ramain as a blank list. ```python fpath_stoplist = 'test/thesaurus/stoplist.txt' stopword_remover.initiate(fpath_stoplist=fpath_stoplist) print(stopword_remover) # <connlp.preprocess.StopwordRemover object at 0x7f163e70c050> ``` The user can count the word frequencies and figure out additional stopwords based on the results. ```python stopword_remover.count_freq_words(docs=tokenized_docs) # ======================================== # Word counts # | [1] a: 3 # | [2] boy: 2 # | [3] is: 2 # | [4] i: 1 # | [5] am: 1 # | [6] he: 1 # | [7] she: 1 # | [8] girl: 1 ``` After finally updating the _stoplist_, use _**remove**_ method to remove the stopwords from text. ```python stopword_removed_docs = [] for doc in tokenized_docs: stopword_removed_docs.append(stopword_remover.remove(sent=doc)) print(stopword_removed_docs) # [['i', 'boy'], ['he', 'boy'], ['she', 'girl']] ``` The user can check which stopword was removed with _**check_removed_words**_ methods. ```python stopword_remover.check_removed_words(docs=tokenized_docs, stopword_removed_docs=stopword_removed_docs) # ======================================== # Check stopwords removed # | [1] BEFORE: a(3) -> # | [2] BEFORE: boy -> AFTER: boy(2) # | [3] BEFORE: is(2) -> # | [4] BEFORE: i -> AFTER: i(1) # | [5] BEFORE: am(1) -> # | [6] BEFORE: he -> AFTER: he(1) # | [7] BEFORE: she -> AFTER: she(1) # | [8] BEFORE: girl -> AFTER: girl(1) ``` # Embedding ## _Vectorizer_ _**Vectorizer**_ includes several text embedding methods that have been commonly used for decades. ### _tfidf_ TF-IDF is the most commonly used technique for word embedding. The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents. The results included the followings. - TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer') - TF-IDF Matrix - TF-IDF Vocabulary ```python from connlp.preprocess import EnglishTokenizer from connlp.embedding import Vectorizer tokenizer = EnglishTokenizer() vectorizer = Vectorizer()

评论收藏

内容反馈

版权申诉