kwx:Python中基于BERT，LDA和TFIDF的关键字提取

共45个文件

py：15个

rst：8个

png：6个

multilingual

python

nlp

data-science

machine-learning

5星 · 超过95%的资源需积分: 44 62 浏览量 2021-03-15 21:23:19 上传评论 6 收藏 5.21MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

kwx-main.zip （45个子文件）

kwx-main

setup.py 2KB

.gitignore 408B

codecov.yml 239B

requirements.txt 3KB

resources

kwx_logo_transparent.png 96KB

kwx_logo.png 93KB

gh_images

topic_num_eval.png 535KB

word_cloud.png 1.49MB

t_sne.png 2.36MB

pyLDAvis.png 327KB

CHANGELOG.md 1KB

LICENSE.txt 1KB

.github

COC_CONTACT.md 213B

CONTRIBUTING.md 223B

CODE_OF_CONDUCT.md 6KB

workflows

ci.yml 1KB

examples

kw_extraction.ipynb 655KB

README.md 11KB

kwx

__init__.py 0B

visuals.py 26KB

utils.py 24KB

topic_model.py 7KB

autoencoder.py 3KB

languages.py 3KB

model.py 32KB

tests

conftest.py 14KB

__init__.py 0B

test_utils.py 7KB

test_languages.py 251B

test_visuals.py 3KB

test_model.py 3KB

environment.yml 5KB

.gitattributes 359B

docs

requirements.txt 14B

Makefile 638B

make.bat 799B

source

model.rst 630B

index.rst 2KB

conf.py 5KB

visuals.rst 529B

topic_model.rst 66B

languages.rst 64B

notes.rst 233B

utils.rst 975B

autoencoder.rst 66B

<div align="center"> <a href="https://github.com/andrewtavis/kwx"><img src="https://github.com/andrewtavis/kwx/blob/main/resources/kwx_logo_transparent.png" width=431 height=215></a> </div> -------------------------------------- [![rtd](https://img.shields.io/readthedocs/kwx.svg?logo=read-the-docs)](http://kwx.readthedocs.io/en/latest/) [![ci](https://img.shields.io/github/workflow/status/andrewtavis/kwx/CI?logo=github)](https://github.com/andrewtavis/kwx/actions?query=workflow%3ACI) [![codecov](https://codecov.io/gh/andrewtavis/kwx/branch/main/graphs/badge.svg)](https://codecov.io/gh/andrewtavis/kwx) [![pyversions](https://img.shields.io/pypi/pyversions/kwx.svg?logo=python&logoColor=FFD43B&color=306998)](https://pypi.org/project/kwx/) [![pypi](https://img.shields.io/pypi/v/kwx.svg?color=4B8BBE)](https://pypi.org/project/kwx/) [![pypistatus](https://img.shields.io/pypi/status/kwx.svg)](https://pypi.org/project/kwx/) [![license](https://img.shields.io/github/license/andrewtavis/kwx.svg)](https://github.com/andrewtavis/kwx/blob/main/LICENSE.text) [![coc](https://img.shields.io/badge/coc-Contributor%20Covenant-ff69b4.svg)](https://github.com/andrewtavis/kwx/blob/main/.github/CODE_OF_CONDUCT.md) [![codestyle](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![colab](https://img.shields.io/badge/%20-Open%20in%20Colab-097ABB.svg?logo=google-colab&color=097ABB&labelColor=525252)](https://colab.research.google.com/github/andrewtavis/kwx) ### BERT, LDA, and TFIDF based keyword extraction in Python [//]: # "The '-' after the section links is needed to make them work on GH (because of ↩s)" **Jump to:**<a id="jumpto"></a> [Models](#models-) • [Usage](#usage-) • [Visuals](#visuals-) • [To-Do](#to-do-) **kwx** is a toolkit for multilingual keyword extraction based on Google's [BERT](https://github.com/google-research/bert) and [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/kwx/languages.py) for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby allowing them to use their own intuitions to fine tune the modeling process. For a thorough overview of the process and techniques see the [Google slides](https://docs.google.com/presentation/d/1BNddaeipNQG1mUTjBYmrdpGC6xlBvAi3rapT88fkdBU/edit?usp=sharing), and reference the [documentation](https://kwx.readthedocs.io/en/latest/) for explanations of the models and visualization methods. # Installation via PyPi kwx can be downloaded from pypi via pip or sourced directly from this repository: ```bash pip install kwx ``` ```bash git clone https://github.com/andrewtavis/kwx.git cd kwx python setup.py install ``` ```python import kwx ``` # Models [`↩`](#jumpto) Implemented NLP modeling methods within [kwx.model](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) include: ### BERT [Bidirectional Encoder Representations from Transformers](https://github.com/google-research/bert) derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics. kwx uses [sentence-transformers](https://github.com/UKPLab/sentence-transformers) pretrained models. See their GitHub and [documentation](https://www.sbert.net/) for the available models. ### LDA [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics. Although not as statistically strong as the following machine learning models, LDA provides quick results that are suitable for many applications. ### LDA with BERT embeddings The combination of LDA with BERT via [kwx.autoencoder](https://github.com/andrewtavis/kwx/blob/main/kwx/autoencoder.py). ### Other Methods The user can also choose to simply query the most common words from a text corpus or compute TFIDF ([Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) keywords - those that are unique in a text body in comparison to another that's compared. The former method is used in kwx as a baseline to check model efficacy, and the latter is a useful baseline when a user has another text or text body to compare the target corpus against. # Usage [`↩`](#jumpto) Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. [examples.kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) provides an example of how to use kwx by deriving keywords from tweets in the Kaggle [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset. The following outlines using kwx to derive keywords from a text corpus with `prompt_remove_words` as `True` (the user will be asked if some of the extracted words need to be replaced): ```python from kwx.utils import prepare_data from kwx.model import extract_kws input_language = "english" # see kwx.languages for options num_keywords = 15 num_topics = 10 ignore_words = ["words", "user", "knows", "they", "don't", "want"] # kwx.utils.clean() can be used on a list of lists text_corpus = prepare_data( data='df_or_csv_xlsx_path', target_cols='cols_where_texts_are', input_language=input_language, min_token_freq=0, # for BERT min_token_len=0, # for BERT remove_stopwords=False, # for BERT verbose=True, ) # Remove n-grams for BERT training corpus_no_ngrams = [ " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus ] # We can pass keywords for sentence_transformers.SentenceTransformer.encode, # gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer bert_kws = extract_kws( method='BERT', # 'BERT', 'LDA_BERT', 'LDA', 'TFIDF', 'frequency' bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user batch_size=32, ) ``` ``` The BERT keywords are: ['time', 'flight', 'plane', 'southwestair', 'ticket', 'cancel', 'united', 'baggage', 'love', 'virginamerica', 'service', 'customer', 'delay', 'late', 'hour'] Are there words that should be removed [y/n]? y Type or copy word(s) to be removed: southwestair, united, virginamerica The new BERT keywords are: ['late', 'baggage', 'service', 'flight', 'time', 'love', 'book', 'customer', 'response', 'hold', 'hour', 'cancel', 'cancelled_flighted', 'delay', 'plane'] Are there words that should be removed [y/n]? n ``` The model will be re-ran until all words known to be unreasonable are removed for a suitable output. [kwx.model.gen_files](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results). # Visuals [`↩`](#jumpto) [kwx.visuals](https://github.com/andrewtavis/kwx/blob/main/kwx/visuals.py) includes functions for both presenting and analyzing the results of keyword extraction. ### Topic Number Eva

评论收藏

内容反馈