<p align="center" id="title_en"><img src="assets/logo.png" width="480"\></p>
[English](#title_en) | [中文](#title_zh)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
PyTorch implementations of Neural Topic Model varieties proposed in recent years, including NVDM-GSM, WTM-MMD (W-LDA), WTM-GMM, ETM, BATM ,and GMNTM. The aim of this project is to provide a practical and working example for neural topic models to facilitate the research of related fields. Configuration of the models will not exactly the same as those proposed in the papers, and the hyper-parameters are not carefully finetuned, but I have chosen to get the core ideas covered.
Empirically, NTM is superior to classical statistical topic models ,especially on short texts. Datasets of short news ([cnews10k](#cnews10k_exp)), dialogue utterances ([zhddline](#zhddline_exp)) and conversation ([zhdd](#zhdd_exp)), are presented for evaluation purpose, all of which are in Chinese. As a comparison to the NTM, an out-of-box LDA script is also provided, which is based on the gensim library.
If you have any question or suggestion about this implementation, please do not hesitate to contact me. **To make it better, welcome to join me.** ;)
*Note*: If you find it's slow to load the pictures of this readme file, you can read this [article](https://zll17.github.io/2020/11/17/Introduction-to-Neural-Topic-Models/) at my blog.
<h2 id="TOC_EN">Table of Contents</h2>
* [1. Installation](#Installation)
* [2. Models](#Models)
+ [2.1 NVDM-GSM](#NVDM-GSM)
+ [2.2 WTM-MMD](#WTM-MMD)
+ [2.3 WTM-GMM](#WTM-GMM)
+ [2.4 ETM](#ETM)
+ [2.5 GMNTM](#GMNTM-VaDE)
+ [2.6 BATM](#BATM)
* [3. Datasets](#Datasets)
* [3.1 cnews10k](#cnews10k_exp)
* [3.2 zhddline](#zhddline_exp)
* [3.3 zhdd](#zhdd_exp)
* [4. Usage](#Usage)
* [4.1 Preparation](#Preparation)
* [4.2 Run](#Run)
* [5. Acknowledgement](#Acknowledgement)
<h2 id="Installation">1. Installation</h2>
```shell
$ git clone https://github.com/zll17/Neural_Topic_Models
$ cd Neural_Topic_Models/
$ sudo pip install -r requirements.txt
```
<h2 id="Models">2. Models</h2>
<h3 id="NVDM-GSM">2.1 NVDM-GSM</h3>
Original paper: _Discovering Discrete Latent Topics with Neural Variational Inference_
*Author:* Yishu Miao
<p align="center">
<img src="assets/vae_arch.png" width="720"\>
</p>
#### Description
**VAE + Gaussian Softmax**
The architecture of the model is a simple VAE, which takes the BOW of a document as its input. After sampling the latent vector **z** from the variational distribution *Q(z|x)*, the model will normalize **z** through a softmax layer, which will be taken as the topic distribution $ \theta $ in the following steps. The configuration of the encoder and decoder could also be customized by yourself, depending on your application.
*Explaination for some arguments:*
--taskname: the name of the dataset, on which you want to build the topic model.
--n_topic: the number of topics.
--num_epochs: number of training epochs.
--no_below: to filter out the tokens whose document frequency is below the threshold, should be integer.
--no_above: to filter out the tokens whose document frequency is higher than the threshold, set as a float number to indicate the ratio of the number of documents.
--auto_adj: once adopted, there would be no need to specify the no_above argument, the model will automatically filter out the top 20 words with the highest document frequencies.
--bkpt_continue: once adopted, the model will load the last checkpoint file and continue training.
[[Paper](http://proceedings.mlr.press/v70/miao17a.html)] [[Code](models/GSM.py)]
#### Run Example
```
$ python3 GSM_run.py --taskname cnews10k --n_topic 20 --num_epochs 1000 --no_above 0.0134 --no_below 5 --criterion cross_entropy
```
<p align="center">
<img src="assets/GSM_cnews10k.png" width="auto"\>
</p>
<h3 id="WTM-MMD">2.2 WTM-MMD</h3>
Original paper: _Topic Modeling with Wasserstein Autoencoders_
*Author:* Feng Nan, Ran Ding, Ramesh Nallapati, Bing Xiang
<p align="center">
<img src="assets/wtm_arch.png" width="720"\>
</p>
#### Description
**WAE with Dirichlet prior + Gaussian Softmax**
The architecture is a WAE, which is actually a straightforward AutoEncoder, with an additional regulation on the latent space. According to the original [paper](https://www.aclweb.org/anthology/P19-1640/), the prior distribution of the latent vectors **z** is set as Dirichlet distribution, while the variational distribution is regulated under the Wasserstein distance. Compared with the GSM model, this model can hugely alleviate the KL collapse problem and obtain more coherent topics.
*Explaination for some arguments:*
--dist: the type of the prior distribution, set as `dirichlet` to use it as the WLDA model.
--alpha: the hyperparameter $\alpha$ in the dirichlet distribution.
The meaning of other arguments can be referred to the [GSM](#NVDM-GSM) model.
[[Paper](https://www.aclweb.org/anthology/P19-1640/)] [[Code](models/WTM.py)]
#### Run Example
```shell
$ python3 WTM_run.py --taskname cnews10k --n_topic 20 --num_epochs 600 --no_above 0.013 --dist dirichlet
```
<p align="center">
<img src="assets/WTM_cnews10k.png" width="auto"\>
</p>
<h3 id="WTM-GMM">2.3 WTM-GMM</h3>
Original paper: _Research on Clustering for Subtitle Dialogue Text Based on Neural Topic Model_
*Author:* Leilan Zhang
<p align="center">
<img src="assets/wtm_gmm_arch.png" width="720"\>
</p>
#### Description
**WAE with Gaussian Mixture prior + Gaussian Softmax**
An improved model of the original WLDA. It takes gaussian mixture distribution as prior distribution, which has two types of evolution strategy: `gmm-std` and `gmm-ctm` (GMM-standart and GMM-customized for short, respectively). The gmm-std adopts Gaussian mixture distribution, whose components have fixed means and variances, while those of the gmm-ctm will adjust to fit the latent vectors through the whole training process. The number of the components is usually set as the same as the number of topics. Emperically, the WTM-GMM model usually achieve better performance, both in topic coherence and diversity, than WTM-MMD and NVDM-GSM. It also avoid the mode collapse problem, which is a problem plagues the GMNTM for a long time. I personally recommend this model.
*Explaination for some arguments:*
--dist: the type of the prior distribution, set as `gmm-std` or `gmm-ctm` to use the corresponding model.
The meaning of other arguments can be referred to the [GSM](#NVDM-GSM) model.
[Under review] [[Code](models/WTM.py)]
#### Run Example
```shell
$ python3 WTM_run.py --taskname zhdd --n_topic 20 --num_epochs 300 --dist gmm-ctm --no_below 5 --auto_adj
```
<p align="center">
<img src="assets/WLDA-GMM_zhdd.png" width="auto"\>
</p>
<h3 id="ETM">2.4 ETM</h3>
Original paper: _Topic Modeling in Embedding Spaces_
*Author:* Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei
<p align="center">
<img src="assets/etm_arch.png" width="720"\>
</p>
#### Description
**VAE + Gaussian Softmax + Embedding**
The architecture is a straightforward VAE, with the topic-word distribution matrix decomposed as the product of the topic vectors and the word vectors. The topic vectors and word vectors are jointly trained with the topic modeling process. A note-worthy mentioned advantage of this model is that it can improve the interpretability of topics by locatting the topic vectors and the word vectors in the same space. Correspondingly, the model requires more time to converge to an ideal result than others since it has more parameters to adjust.
*Explaination for some arguments:*
--emb_dim: the dimension of the topic vectors as well as the word vectors, default set as 300.
The meaning of other arguments can be referred to the [GSM](#NVDM-GSM) model.
[[Paper](https://arxiv.org/abs/1
没有合适的资源?快使用搜索试试~ 我知道了~
Neural_Topic_Models:基于神经网络方法的主题模型的实现
共72个文件
png:30个
py:20个
txt:13个
需积分: 47 11 下载量 65 浏览量
2021-05-17
13:55:18
上传
评论 2
收藏 22.99MB ZIP 举报
温馨提示
| 近年来提出的神经主题模型变体的PyTorch实现包括NVDM-GSM,WTM-MMD(W-LDA),WTM-GMM,ETM,BATM和GMNTM。 该项目的目的是为神经主题模型提供一个实用且可行的示例,以促进相关领域的研究。 模型的配置与论文中提出的模型并不完全相同,并且没有对超参数进行仔细的微调,但是我选择覆盖其中的核心思想。 从经验上讲,NTM优于经典的统计主题模型,尤其是在短文本上。 出于评估目的,提供了短消息( ),对话话语( )和对话( )的数据集,所有这些均以中文显示。 作为与NTM的比较,还提供了基于gensim库的现成的LDA脚本。 如果您对此实施有任何疑问或建议,请随时与我联系。 为了更好,欢迎加入我的行列。 ;) 注意:如果发现加载此自述文件的图片太慢,则可以在我的博客上阅读此。 目录 2.6 BATM 3.数据集 3.1 cnews10k 3.
资源详情
资源评论
资源推荐
收起资源包目录
Neural_Topic_Models-master.zip (72个子文件)
Neural_Topic_Models-master
models
vae.py 2KB
gan.py 2KB
ETM.py 10KB
wae.py 5KB
__pycache__
vae.cpython-36.pyc 3KB
__init__.cpython-36.pyc 195B
wae.cpython-36.pyc 4KB
GSM.cpython-36.pyc 4KB
WLDA.cpython-36.pyc 7KB
GMNTM.py 13KB
vade.py 7KB
GSM.py 9KB
__init__.py 110B
BATM.py 6KB
WTM.py 9KB
tokenization.py 2KB
LDA_run.py 4KB
utils.py 6KB
logs
imgs
1006-0003_no_dropout_wlda
wlda_mimno_tc.png 10KB
wlda_c_w2v.png 9KB
wlda_c_npmi.png 10KB
wlda_c_v.png 10KB
wlda_c_uci.png 9KB
wlda_td.png 8KB
wlda_trainloss.png 12KB
1005-0950_no_dropout_wlda
wlda_mimno_tc.png 18KB
wlda_c_w2v.png 15KB
wlda_c_npmi.png 23KB
wlda_c_v.png 23KB
wlda_c_uci.png 26KB
wlda_trainloss.png 17KB
inference.py 3KB
data
stopwords.txt 5.53MB
3body2_lines.txt 955KB
dailydialogconv_lines.txt 5.64MB
3body1_lines.txt 567KB
zhdd_lines.txt 5.06MB
cnews10k_lines.txt 519KB
zhddline_lines.txt 5.06MB
poems_lines.txt 2.22MB
dailydialoguttr_lines.txt 5.64MB
3body3_lines.txt 1.09MB
babelnet_lines.txt 22.9MB
EMNLP2020_lines.txt 1.15MB
BATM_run.py 3KB
ckpt
placeholder 0B
assets
zhddline_exp_short.png 63KB
GSM_cnews10k.png 221KB
vae_arch.png 37KB
wtm_gmm_arch.png 27KB
wtm_arch.png 28KB
WTM_cnews10k.png 54KB
etm_arch.png 32KB
gmvae_arch.png 47KB
zhddline_exp.png 96KB
WLDA-GMM_zhddline10k.png 241KB
BATM_arch.png 148KB
3body1_exp.png 170KB
cnews10k_exp.png 111KB
zhdd_exp.png 296KB
WLDA-GMM_zhdd.png 56KB
logo.png 5KB
Demo.ipynb 245KB
tokenizer_exp.png 35KB
WTM_run.py 4KB
dataset.py 9KB
ETM_run.py 5KB
GMNTM_run.py 4KB
requirements.txt 49B
.gitignore 2KB
GSM_run.py 4KB
README.md 30KB
共 72 条
- 1
MachineryLy
- 粉丝: 29
- 资源: 4611
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0