Neural_Topic_Models:基于神经网络方法的主题模型的实现

共72个文件

png：30个

py：20个

txt：13个

JupyterNotebook

需积分: 47 65 浏览量 2021-05-17 13:55:18 上传评论 2 收藏 22.99MB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

Neural_Topic_Models-master.zip （72个子文件）

Neural_Topic_Models-master

models

vae.py 2KB

gan.py 2KB

ETM.py 10KB

wae.py 5KB

__pycache__

vae.cpython-36.pyc 3KB

__init__.cpython-36.pyc 195B

wae.cpython-36.pyc 4KB

GSM.cpython-36.pyc 4KB

WLDA.cpython-36.pyc 7KB

GMNTM.py 13KB

vade.py 7KB

GSM.py 9KB

__init__.py 110B

BATM.py 6KB

WTM.py 9KB

tokenization.py 2KB

LDA_run.py 4KB

utils.py 6KB

logs

imgs

1006-0003_no_dropout_wlda

wlda_mimno_tc.png 10KB

wlda_c_w2v.png 9KB

wlda_c_npmi.png 10KB

wlda_c_v.png 10KB

wlda_c_uci.png 9KB

wlda_td.png 8KB

wlda_trainloss.png 12KB

1005-0950_no_dropout_wlda

wlda_mimno_tc.png 18KB

wlda_c_w2v.png 15KB

wlda_c_npmi.png 23KB

wlda_c_v.png 23KB

wlda_c_uci.png 26KB

wlda_trainloss.png 17KB

inference.py 3KB

data

stopwords.txt 5.53MB

3body2_lines.txt 955KB

dailydialogconv_lines.txt 5.64MB

3body1_lines.txt 567KB

zhdd_lines.txt 5.06MB

cnews10k_lines.txt 519KB

zhddline_lines.txt 5.06MB

poems_lines.txt 2.22MB

dailydialoguttr_lines.txt 5.64MB

3body3_lines.txt 1.09MB

babelnet_lines.txt 22.9MB

EMNLP2020_lines.txt 1.15MB

BATM_run.py 3KB

ckpt

placeholder 0B

assets

zhddline_exp_short.png 63KB

GSM_cnews10k.png 221KB

vae_arch.png 37KB

wtm_gmm_arch.png 27KB

wtm_arch.png 28KB

WTM_cnews10k.png 54KB

etm_arch.png 32KB

gmvae_arch.png 47KB

zhddline_exp.png 96KB

WLDA-GMM_zhddline10k.png 241KB

BATM_arch.png 148KB

3body1_exp.png 170KB

cnews10k_exp.png 111KB

zhdd_exp.png 296KB

WLDA-GMM_zhdd.png 56KB

logo.png 5KB

Demo.ipynb 245KB

tokenizer_exp.png 35KB

WTM_run.py 4KB

dataset.py 9KB

ETM_run.py 5KB

GMNTM_run.py 4KB

requirements.txt 49B

.gitignore 2KB

GSM_run.py 4KB

README.md 30KB

<p align="center" id="title_en"><img src="assets/logo.png" width="480"\></p> [English](#title_en) | [中文](#title_zh) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) PyTorch implementations of Neural Topic Model varieties proposed in recent years, including NVDM-GSM, WTM-MMD (W-LDA), WTM-GMM, ETM, BATM ,and GMNTM. The aim of this project is to provide a practical and working example for neural topic models to facilitate the research of related fields. Configuration of the models will not exactly the same as those proposed in the papers, and the hyper-parameters are not carefully finetuned, but I have chosen to get the core ideas covered. Empirically, NTM is superior to classical statistical topic models ,especially on short texts. Datasets of short news ([cnews10k](#cnews10k_exp)), dialogue utterances ([zhddline](#zhddline_exp)) and conversation ([zhdd](#zhdd_exp)), are presented for evaluation purpose, all of which are in Chinese. As a comparison to the NTM, an out-of-box LDA script is also provided, which is based on the gensim library. If you have any question or suggestion about this implementation, please do not hesitate to contact me. **To make it better, welcome to join me.** ;) *Note*: If you find it's slow to load the pictures of this readme file, you can read this [article](https://zll17.github.io/2020/11/17/Introduction-to-Neural-Topic-Models/) at my blog. <h2 id="TOC_EN">Table of Contents</h2> * [1. Installation](#Installation) * [2. Models](#Models) + [2.1 NVDM-GSM](#NVDM-GSM) + [2.2 WTM-MMD](#WTM-MMD) + [2.3 WTM-GMM](#WTM-GMM) + [2.4 ETM](#ETM) + [2.5 GMNTM](#GMNTM-VaDE) + [2.6 BATM](#BATM) * [3. Datasets](#Datasets) * [3.1 cnews10k](#cnews10k_exp) * [3.2 zhddline](#zhddline_exp) * [3.3 zhdd](#zhdd_exp) * [4. Usage](#Usage) * [4.1 Preparation](#Preparation) * [4.2 Run](#Run) * [5. Acknowledgement](#Acknowledgement) <h2 id="Installation">1. Installation</h2> ```shell $ git clone https://github.com/zll17/Neural_Topic_Models $ cd Neural_Topic_Models/ $ sudo pip install -r requirements.txt ``` <h2 id="Models">2. Models</h2> <h3 id="NVDM-GSM">2.1 NVDM-GSM</h3> Original paper: _Discovering Discrete Latent Topics with Neural Variational Inference_ *Author:* Yishu Miao <p align="center"> <img src="assets/vae_arch.png" width="720"\> </p> #### Description **VAE + Gaussian Softmax** The architecture of the model is a simple VAE, which takes the BOW of a document as its input. After sampling the latent vector **z** from the variational distribution *Q(z|x)*, the model will normalize **z** through a softmax layer, which will be taken as the topic distribution $ \theta $ in the following steps. The configuration of the encoder and decoder could also be customized by yourself, depending on your application. *Explaination for some arguments:* --taskname: the name of the dataset, on which you want to build the topic model. --n_topic: the number of topics. --num_epochs: number of training epochs. --no_below: to filter out the tokens whose document frequency is below the threshold, should be integer. --no_above: to filter out the tokens whose document frequency is higher than the threshold, set as a float number to indicate the ratio of the number of documents. --auto_adj: once adopted, there would be no need to specify the no_above argument, the model will automatically filter out the top 20 words with the highest document frequencies. --bkpt_continue: once adopted, the model will load the last checkpoint file and continue training. [[Paper](http://proceedings.mlr.press/v70/miao17a.html)] [[Code](models/GSM.py)] #### Run Example ``` $ python3 GSM_run.py --taskname cnews10k --n_topic 20 --num_epochs 1000 --no_above 0.0134 --no_below 5 --criterion cross_entropy ``` <p align="center"> <img src="assets/GSM_cnews10k.png" width="auto"\> </p> <h3 id="WTM-MMD">2.2 WTM-MMD</h3> Original paper: _Topic Modeling with Wasserstein Autoencoders_ *Author:* Feng Nan, Ran Ding, Ramesh Nallapati, Bing Xiang <p align="center"> <img src="assets/wtm_arch.png" width="720"\> </p> #### Description **WAE with Dirichlet prior + Gaussian Softmax** The architecture is a WAE, which is actually a straightforward AutoEncoder, with an additional regulation on the latent space. According to the original [paper](https://www.aclweb.org/anthology/P19-1640/), the prior distribution of the latent vectors **z** is set as Dirichlet distribution, while the variational distribution is regulated under the Wasserstein distance. Compared with the GSM model, this model can hugely alleviate the KL collapse problem and obtain more coherent topics. *Explaination for some arguments:* --dist: the type of the prior distribution, set as `dirichlet` to use it as the WLDA model. --alpha: the hyperparameter $\alpha$ in the dirichlet distribution. The meaning of other arguments can be referred to the [GSM](#NVDM-GSM) model. [[Paper](https://www.aclweb.org/anthology/P19-1640/)] [[Code](models/WTM.py)] #### Run Example ```shell $ python3 WTM_run.py --taskname cnews10k --n_topic 20 --num_epochs 600 --no_above 0.013 --dist dirichlet ``` <p align="center"> <img src="assets/WTM_cnews10k.png" width="auto"\> </p> <h3 id="WTM-GMM">2.3 WTM-GMM</h3> Original paper: _Research on Clustering for Subtitle Dialogue Text Based on Neural Topic Model_ *Author:* Leilan Zhang <p align="center"> <img src="assets/wtm_gmm_arch.png" width="720"\> </p> #### Description **WAE with Gaussian Mixture prior + Gaussian Softmax** An improved model of the original WLDA. It takes gaussian mixture distribution as prior distribution, which has two types of evolution strategy: `gmm-std` and `gmm-ctm` (GMM-standart and GMM-customized for short, respectively). The gmm-std adopts Gaussian mixture distribution, whose components have fixed means and variances, while those of the gmm-ctm will adjust to fit the latent vectors through the whole training process. The number of the components is usually set as the same as the number of topics. Emperically, the WTM-GMM model usually achieve better performance, both in topic coherence and diversity, than WTM-MMD and NVDM-GSM. It also avoid the mode collapse problem, which is a problem plagues the GMNTM for a long time. I personally recommend this model. *Explaination for some arguments:* --dist: the type of the prior distribution, set as `gmm-std` or `gmm-ctm` to use the corresponding model. The meaning of other arguments can be referred to the [GSM](#NVDM-GSM) model. [Under review] [[Code](models/WTM.py)] #### Run Example ```shell $ python3 WTM_run.py --taskname zhdd --n_topic 20 --num_epochs 300 --dist gmm-ctm --no_below 5 --auto_adj ``` <p align="center"> <img src="assets/WLDA-GMM_zhdd.png" width="auto"\> </p> <h3 id="ETM">2.4 ETM</h3> Original paper: _Topic Modeling in Embedding Spaces_ *Author:* Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei <p align="center"> <img src="assets/etm_arch.png" width="720"\> </p> #### Description **VAE + Gaussian Softmax + Embedding** The architecture is a straightforward VAE, with the topic-word distribution matrix decomposed as the product of the topic vectors and the word vectors. The topic vectors and word vectors are jointly trained with the topic modeling process. A note-worthy mentioned advantage of this model is that it can improve the interpretability of topics by locatting the topic vectors and the word vectors in the same space. Correspondingly, the model requires more time to converge to an ideal result than others since it has more parameters to adjust. *Explaination for some arguments:* --emb_dim: the dimension of the topic vectors as well as the word vectors, default set as 300. The meaning of other arguments can be referred to the [GSM](#NVDM-GSM) model. [[Paper](https://arxiv.org/abs/1