<div align="center">
<a href="https://github.com/andrewtavis/kwx"><img src="https://github.com/andrewtavis/kwx/blob/main/resources/kwx_logo_transparent.png" width=431 height=215></a>
</div>
--------------------------------------
[![rtd](https://img.shields.io/readthedocs/kwx.svg?logo=read-the-docs)](http://kwx.readthedocs.io/en/latest/)
[![ci](https://img.shields.io/github/workflow/status/andrewtavis/kwx/CI?logo=github)](https://github.com/andrewtavis/kwx/actions?query=workflow%3ACI)
[![codecov](https://codecov.io/gh/andrewtavis/kwx/branch/main/graphs/badge.svg)](https://codecov.io/gh/andrewtavis/kwx)
[![pyversions](https://img.shields.io/pypi/pyversions/kwx.svg?logo=python&logoColor=FFD43B&color=306998)](https://pypi.org/project/kwx/)
[![pypi](https://img.shields.io/pypi/v/kwx.svg?color=4B8BBE)](https://pypi.org/project/kwx/)
[![pypistatus](https://img.shields.io/pypi/status/kwx.svg)](https://pypi.org/project/kwx/)
[![license](https://img.shields.io/github/license/andrewtavis/kwx.svg)](https://github.com/andrewtavis/kwx/blob/main/LICENSE.text)
[![coc](https://img.shields.io/badge/coc-Contributor%20Covenant-ff69b4.svg)](https://github.com/andrewtavis/kwx/blob/main/.github/CODE_OF_CONDUCT.md)
[![codestyle](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![colab](https://img.shields.io/badge/%20-Open%20in%20Colab-097ABB.svg?logo=google-colab&color=097ABB&labelColor=525252)](https://colab.research.google.com/github/andrewtavis/kwx)
### BERT, LDA, and TFIDF based keyword extraction in Python
[//]: # "The '-' after the section links is needed to make them work on GH (because of ↩s)"
**Jump to:**<a id="jumpto"></a> [Models](#models-) • [Usage](#usage-) • [Visuals](#visuals-) • [To-Do](#to-do-)
**kwx** is a toolkit for multilingual keyword extraction based on Google's [BERT](https://github.com/google-research/bert) and [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/kwx/languages.py) for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby allowing them to use their own intuitions to fine tune the modeling process.
For a thorough overview of the process and techniques see the [Google slides](https://docs.google.com/presentation/d/1BNddaeipNQG1mUTjBYmrdpGC6xlBvAi3rapT88fkdBU/edit?usp=sharing), and reference the [documentation](https://kwx.readthedocs.io/en/latest/) for explanations of the models and visualization methods.
# Installation via PyPi
kwx can be downloaded from pypi via pip or sourced directly from this repository:
```bash
pip install kwx
```
```bash
git clone https://github.com/andrewtavis/kwx.git
cd kwx
python setup.py install
```
```python
import kwx
```
# Models [`↩`](#jumpto)
Implemented NLP modeling methods within [kwx.model](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) include:
### BERT
[Bidirectional Encoder Representations from Transformers](https://github.com/google-research/bert) derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.
kwx uses [sentence-transformers](https://github.com/UKPLab/sentence-transformers) pretrained models. See their GitHub and [documentation](https://www.sbert.net/) for the available models.
### LDA
[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.
Although not as statistically strong as the following machine learning models, LDA provides quick results that are suitable for many applications.
### LDA with BERT embeddings
The combination of LDA with BERT via [kwx.autoencoder](https://github.com/andrewtavis/kwx/blob/main/kwx/autoencoder.py).
### Other Methods
The user can also choose to simply query the most common words from a text corpus or compute TFIDF ([Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) keywords - those that are unique in a text body in comparison to another that's compared. The former method is used in kwx as a baseline to check model efficacy, and the latter is a useful baseline when a user has another text or text body to compare the target corpus against.
# Usage [`↩`](#jumpto)
Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. [examples.kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) provides an example of how to use kwx by deriving keywords from tweets in the Kaggle [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset.
The following outlines using kwx to derive keywords from a text corpus with `prompt_remove_words` as `True` (the user will be asked if some of the extracted words need to be replaced):
```python
from kwx.utils import prepare_data
from kwx.model import extract_kws
input_language = "english" # see kwx.languages for options
num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]
# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
data='df_or_csv_xlsx_path',
target_cols='cols_where_texts_are',
input_language=input_language,
min_token_freq=0, # for BERT
min_token_len=0, # for BERT
remove_stopwords=False, # for BERT
verbose=True,
)
# Remove n-grams for BERT training
corpus_no_ngrams = [
" ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]
# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
method='BERT', # 'BERT', 'LDA_BERT', 'LDA', 'TFIDF', 'frequency'
bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA
input_language=input_language,
output_language=None, # allows the output to be translated
num_keywords=num_keywords,
num_topics=num_topics,
corpuses_to_compare=None, # for TFIDF
ignore_words=ignore_words,
prompt_remove_words=True, # check words with user
batch_size=32,
)
```
```
The BERT keywords are:
['time', 'flight', 'plane', 'southwestair', 'ticket', 'cancel', 'united', 'baggage',
'love', 'virginamerica', 'service', 'customer', 'delay', 'late', 'hour']
Are there words that should be removed [y/n]? y
Type or copy word(s) to be removed: southwestair, united, virginamerica
The new BERT keywords are:
['late', 'baggage', 'service', 'flight', 'time', 'love', 'book', 'customer',
'response', 'hold', 'hour', 'cancel', 'cancelled_flighted', 'delay', 'plane']
Are there words that should be removed [y/n]? n
```
The model will be re-ran until all words known to be unreasonable are removed for a suitable output. [kwx.model.gen_files](https://github.com/andrewtavis/kwx/blob/main/kwx/model.py) could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).
# Visuals [`↩`](#jumpto)
[kwx.visuals](https://github.com/andrewtavis/kwx/blob/main/kwx/visuals.py) includes functions for both presenting and analyzing the results of keyword extraction.
### Topic Number Eva
没有合适的资源?快使用搜索试试~ 我知道了~
kwx:Python中基于BERT,LDA和TFIDF的关键字提取
共45个文件
py:15个
rst:8个
png:6个
5星 · 超过95%的资源 需积分: 44 43 下载量 62 浏览量
2021-03-15
21:23:19
上传
评论 6
收藏 5.21MB ZIP 举报
温馨提示
Python中基于BERT,LDA和TFIDF的关键字提取 跳到: ••• kwx是用于基于Google的和多语言关键字提取的工具包。 该软件包提供了一套方法来处理不同语言的文本,然后从创建的语料库中提取和分析关键字(有关各种语言支持,请参阅 )。 唯一的重点是允许用户确定输出中不包括哪些单词,从而允许他们使用自己的直觉来微调建模过程。 有关该过程和技术的全面概述,请参阅,并参考以获取有关模型和可视化方法的说明。 通过PyPi安装 kwx可以通过pip从pypi下载或直接从此存储库中获取: pip install kwx git clone https://github.com/andrewtavis/kwx.git cd kwx python setup.py install import kwx 型号 实现的NLP建模方法包括: 伯特 表示法是基于在开源Wikipedia数据上
资源推荐
资源详情
资源评论
收起资源包目录
kwx-main.zip (45个子文件)
kwx-main
setup.py 2KB
.gitignore 408B
codecov.yml 239B
requirements.txt 3KB
resources
kwx_logo_transparent.png 96KB
kwx_logo.png 93KB
gh_images
topic_num_eval.png 535KB
word_cloud.png 1.49MB
t_sne.png 2.36MB
pyLDAvis.png 327KB
CHANGELOG.md 1KB
LICENSE.txt 1KB
.github
COC_CONTACT.md 213B
CONTRIBUTING.md 223B
CODE_OF_CONDUCT.md 6KB
workflows
ci.yml 1KB
examples
kw_extraction.ipynb 655KB
README.md 11KB
kwx
__init__.py 0B
visuals.py 26KB
utils.py 24KB
topic_model.py 7KB
autoencoder.py 3KB
languages.py 3KB
model.py 32KB
tests
conftest.py 14KB
__init__.py 0B
test_utils.py 7KB
test_languages.py 251B
test_visuals.py 3KB
test_model.py 3KB
environment.yml 5KB
.gitattributes 359B
docs
requirements.txt 14B
Makefile 638B
make.bat 799B
source
model.rst 630B
index.rst 2KB
conf.py 5KB
visuals.rst 529B
topic_model.rst 66B
languages.rst 64B
notes.rst 233B
utils.rst 975B
autoencoder.rst 66B
共 45 条
- 1
资源评论
- qq_431549762021-08-20用户下载后在一定时间内未进行评价,系统默认好评。
信徒阿布
- 粉丝: 39
- 资源: 4576
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Python大作业:音乐播放软件(爬虫+可视化+数据分析+数据库)
- 课程设计-python爬虫-爬取日报,爬取日报文章后存储到本地,附带源代码+课程设计报告
- 软件和信息技术服务行业投资与前景预测.pptx
- 课程设计-基于SpringBoot + Mybatis+python爬虫NBA球员数据爬取可视化+源代码+文档+sql+效果图
- 软件品质管理系列二项目策划规范.doc
- 基于TensorFlow+PyQt+GUI的酒店评论情感分析,支持分析本地数据文件和网络爬取数据分析+源代码+文档说明+安装教程
- 软件定义无线电中的模拟电路测试技术.pptx
- 软件开发协议(作为技术开发合同附件).doc
- 软件开发和咨询行业技术趋势分析.pptx
- 软件测试题详解及答案.doc
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功