通过非负矩阵分解的动态主题建模_Python_下载.zip

共20个文件

py：14个

txt：2个

gitignore：1个

版权申诉

113 浏览量 2023-04-13 23:36:57 上传评论收藏 1.48MB ZIP 举报

非负矩阵分解（Non-negative Matrix Factorization, NMF）是一种广泛应用的数据分析技术，尤其在文本挖掘、图像处理和推荐系统等领域。在这个“通过非负矩阵分解的动态主题建模_Python_下载.zip”压缩包中，包含了一个名为“dynamic-nmf-master”的项目，很可能是用于演示如何使用Python实现动态主题建模的代码库。动态主题建模（Dynamic Topic Modeling, DTM）是主题建模的一个扩展，它允许我们随着时间的推移追踪文本数据中的主题变化。在传统的主题建模中，如LDA（Latent Dirichlet Allocation），模型假设文本集合是一次性给定的，而DTM则考虑了时间序列的特性，能够捕捉到不同时间点上的主题演变情况。 NMF是DTM中常用的一种方法，因为它可以将非负的文档-词项矩阵分解为两个非负矩阵，一个是主题-词项矩阵，另一个是文档-主题矩阵。这个过程揭示了每个文档与一组潜在主题的关联，并且每个主题由一组相关的词来表示。在Python中实现动态主题建模，通常会用到如`scikit-learn`或`gensim`这样的库。`scikit-learn`提供了NMF的实现，而`gensim`则更专注于文本建模，包括LDA和DTM。在“dynamic-nmf-master”项目中，可能包含了以下步骤： 1. **数据预处理**：清洗和转换文本数据，例如去除停用词、标点符号，进行词干提取和词形还原，然后创建词袋模型或TF-IDF表示。 2. **构建时间序列矩阵**：根据文本的发表时间，将预处理后的文档组织成一个随时间变化的矩阵。 3. **非负矩阵分解**：使用`scikit-learn`的`NMF`类对时间序列矩阵进行分解，得到文档-主题矩阵和主题-词项矩阵。 4. **动态更新**：在DTM中，每个时间步长的主题会基于前一时间步长的主题进行更新，这可以通过优化算法（如交替最小二乘法或梯度下降）实现。 5. **主题解释和可视化**：通过查看主题-词项矩阵，我们可以理解每个主题的关键词汇，并且可以使用时间序列分析工具（如折线图）展示主题随时间的变化趋势。 6. **评估**：使用各种评估指标（如Perplexity或Coherence Score）来衡量模型的性能和主题的清晰度。这个压缩包中的项目可能还包含了示例数据、配置文件、训练脚本以及结果可视化代码。如果你打算深入学习动态主题建模和NMF，这个资源会是一个很好的起点。记得解压文件后阅读文档和代码，了解具体实现细节。同时，确保你有一定的Python基础和对机器学习、自然语言处理的理解，这将有助于你更好地理解和应用这些概念。

资源推荐

资源详情

资源评论

收起资源包目录

通过非负矩阵分解的动态主题建模_Python_下载.zip （20个子文件）

dynamic-nmf-master

unsupervised

__init__.py 0B

nmf.py 3KB

rankings.py 2KB

coherence.py 1KB

prep-text.py 4KB

create-dynamic-partition.py 3KB

display-topics.py 2KB

data

sample.zip 1.65MB

LICENSE 10KB

prep-word2vec.py 4KB

text

__init__.py 0B

stopwords_twitter.txt 3KB

util.py 5KB

stopwords.txt 2KB

track-dynamic-topics.py 5KB

find-window-topics.py 7KB

.gitignore 728B

find-dynamic-topics.py 8KB

README.md 14KB

export-csv.py 3KB

# dynamic-nmf: Dynamic Topic Modeling ### Summary Standard topic modeling approaches assume the order of documents does not matter, making them unsuitable for time-stamped corpora. In contrast, *dynamic topic modeling* approaches track how language changes and topics evolve over time. We have developed a two-level approach for dynamic topic modeling via Non-negative Matrix Factorization (NMF), which links together topics identified in snapshots of text sources appearing over time. If you make use of this implementation, please consider citing [the associated paper](https://doi.org/10.1017/pan.2016.7): * Greene, Derek, and James P. Cross. "Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach." Political Analysis 25.1 (2017): 77-94. [[PDF]](http://derekgreene.com/papers/greene17europarl.pdf) [[BibTeX]](http://derekgreene.com/bib/greene17europarl.bib) [[Preprint]](http://arxiv.org/abs/1607.03055) This repository contains a Python reference implementation for the approach described in the paper. ### Dependencies Tested with Python 3.5+, and requiring the following packages which are available via PIP: * Required: [numpy >= 1.8.0](http://www.numpy.org/) * Required: [scikit-learn >= 0.14](http://scikit-learn.org/stable/) * Required for utility tools: [prettytable >= 0.7.2](https://code.google.com/p/prettytable/) * Required for automatic model selection: [gensim >= 0.10.3](https://radimrehurek.com/gensim/) ### Basic Usage To perform dynamic topic modeling, the input corpus of documents should consist of plain text files (one document per file), organised into two or more sub-directories. Each of these sub-directories should correspond to a unique *time window*, representing a different time interval. The names of these sub-directories is arbitrary, once their alphabetic ordering corresponds to their order in time (e.g 2000, 2001, 2002; month1, month2, month3; 2010-q1, 2010-q2, 2010-q3). The dynamic topic modeling process consists of three steps, discussed below. The archive 'data/sample.zip' contains a sample corpus of 1,324 news articles divided into three time windows (month1, month2, month3), which is used to illustrate these steps. ##### Step 1: Pre-processing Before applying dynamic topic modeling, the first step is to pre-process the documents from each time window (i.e. sub-directory), to produce a *document-term matrix* for those windows. This involves tokenizing the documents, removing common stop-words, and building a document-term matrix for the time window. In the example below, we parse all .txt files in the sub-directories of 'data/sample'. The output files will be stored in the directory 'data'. Note that the final options below indicate that we want to apply TF-IDF term weighting and document length normalization to the documents before writing each matrix. python prep-text.py data/sample/month1 data/sample/month2 data/sample/month3 -o data --tfidf --norm The result of this process will be a collection of Joblib binary files (*.pkl and *.npy) written to the directory 'data', where the prefix of each corresponds to the name of each time window (e.g. month1, month2 etc). ##### Step 2: Window Topic Modeling Once the data has been pre-processed, the next step is to generate the *window topics*, where a topic model is created by applying NMF to each the pre-process data for each time window. For the example data, we apply it to the three months. If we want to use the same number of topics for each window (e.g. 5 topics), we can run the following, where results are written to the directory 'out': python find-window-topics.py data/*.pkl -k 5 -o out When the process has completed, we can view the descriptiors (i.e. the top ranked terms) for the resulting window topics as follows: python display-topics.py out/month1_windowtopics_k05.pkl out/month2_windowtopics_k05.pkl out/month3_windowtopics_k05.pkl The top terms and document IDs can be exported from a NMF results file to two individual comma-separated files using 'export-csv.py'. For instance, to export the top 50 terms and document IDs for a single results file: python export-csv.py out/month1_windowtopics_k05.pkl -t 50 ##### Step 3: Dynamic Topic Modeling Once the window topics have been created, we combine the results for the time windows to generate *dynamic topics* that span across multiple time windows. If we want to specify a fixed number of dynamic topics (e.g. *k=5*), we can run the following, where results are written to the directory 'out': python find-dynamic-topics.py out/month1_windowtopics_k05.pkl out/month2_windowtopics_k05.pkl out/month3_windowtopics_k05.pkl -k 5 -o out In this case the results will be written to 'out/dynamictopics_k05.pkl'. When the process has completed, we can view the dynamic topic descriptiors using 'display-topics.py': python display-topics.py out/dynamictopics_k05.pkl For the sample corpus, the output for the top 10 terms for 5 dynamic topics should look like: +------+------------+-----------+------------+--------+----------+ | Rank | D01 | D02 | D03 | D04 | D05 | +------+------------+-----------+------------+--------+----------+ | 1 | blair | chelsea | people | best | growth | | 2 | labour | game | mobile | band | economy | | 3 | election | club | users | music | oil | | 4 | government | united | software | film | sales | | 5 | minister | arsenal | microsoft | album | prices | | 6 | brown | league | technology | awards | market | | 7 | party | players | net | show | bank | | 8 | prime | cup | phone | number | economic | | 9 | howard | liverpool | computer | award | profits | | 10 | told | football | security | top | company | +------+------------+-----------+------------+--------+----------+ ### Advanced Usage The examples above involve using a manually-specified number of topics, for both window topics and dynamic topics. In cases where this number is not known in advance, a variety of strategies exist for automatically or semi-automatically choosing a number of topics. This package contains an implementation of the TC-W2V *topic coherence* measure, which can be used to compare different topic models and subsequently choose a model with a suitable number of topics. More details on the TC-W2V are included in the paper: An Analysis of the Coherence of Descriptors in Topic Modeling D. O'Callaghan, D. Greene, J. Carthy, P. Cunningham. Expert Systems with Applications (ESWA), 2015. The approach involves a number of steps, listed below. Again these steps are illustrated using the sample corpus. ##### Step 1: Build Word2Vec Model As well as preparing the input text corpus, we also need to build a Word2Vec model from all of the documents in our corpus. The script 'prep-word2vec.py' uses [Gensim](https://radimrehurek.com/gensim/) to build a Skipgram (SG) Word2Vec model. All of the text files in the specified sub-directories are used to build the model, which is written to the file 'out/w2v-model.bin'. python prep-word2vec.py data/sample -o out -m sg ##### Step 2: Window Topic Modeling Next, we use topic coherence based on the pre-built Word2Vec model to evaluate a range of different values for the number of topics *k* for each time window. We use the same 'find-window-topics.py' script, but specify a comma-separated range of values to try *(kmin,kmax)* (e.g. 4,10 will test all numbers of topics from *k=4* to *k=10*), and also specify the path to Word2Vec model file: python find-window-topics.py data/*.pkl -k 4,10 -o out -m out/w2v-model.bin -w selected.csv The script will apply NMF to each time window for each value of *k*, writing a result file each time to the directory 'out'. The output of the above for the sample data will also include the following top 3 rec

评论收藏

内容反馈

版权申诉