基于word2vec使用wiki中文语料库实现词向量训练模型.zip_Word2Vec和维基百科实现CBOW分词资源-CSDN文库

共12个文件

py：7个

txt：3个

md：1个

word2vec

wiki中文语料

训练模型

需积分: 5 138 浏览量 2023-01-27 10:57:21 上传评论收藏 13KB ZIP 举报

在自然语言处理领域，词向量（Word Embedding）是一种将词汇转化为连续向量表示的技术，它能够捕获词汇间的语义和语法关系。Word2vec是Google在2013年提出的一种词向量训练算法，它通过两种模型——Continuous Bag of Words (CBOW) 和 Skip-gram 模型，学习词的分布式表示。在这个“基于 word2vec 使用 wiki 中文语料库实现词向量训练模型”的项目中，我们将深入探讨如何利用Word2vec和中文维基百科语料库构建词向量模型。 1. **Word2vec简介**： - **CBOW模型**：该模型预测当前词，基于其上下文词汇。它通过输入上下文词的平均向量来预测目标词的向量。 - **Skip-gram模型**：与CBOW相反，Skip-gram尝试预测上下文词，基于目标词。这有助于捕捉词之间的关联性，尤其是那些不常出现在同一上下文中的词。 2. **中文维基百科语料库**： - 中文维基百科是大量、多样化的中文文本来源，包含各种主题和领域的信息，适合训练词向量模型。语料库通常需要进行预处理，包括分词、去除停用词和标点符号、处理多字词等步骤。 3. **预处理**： - **分词**：对中文文本进行词分割，由于中文没有明显的空格分隔，通常使用jieba、THULAC等工具进行分词。 - **过滤**：去除无意义的词汇，如“的”、“和”等停用词，以及一些特殊字符。 - **多字词处理**：对于成语或复合词，需要将其视为单个实体，可以使用特定的标记方式。 4. **训练过程**： - **构建词汇表**：统计所有词汇出现频率，设定阈值保留高频词汇，其余词汇可以用“未知词”代替。 - **初始化词向量**：为每个词汇随机分配初始向量，大小通常设置为100-300维度。 - **优化算法**：常用梯度下降法（如SGD）更新词向量，损失函数通常选用交叉熵损失。 - **负采样**：在训练过程中，针对每个目标词，选择一定数量的负样本（非上下文词），以减少计算量并提高训练效率。 5. **模型评估**： - **相似度和类比任务**：通过计算词向量之间的余弦相似度来评估模型，例如找出与“中国”最相似的国家，或者完成“男人:女人::国王:？”的类比问题。 - **词汇推理**：检查模型是否能捕获词汇间的语义关系，如“北京:中国::上海:？”。 6. **应用**： - **文本分类**：将词向量作为特征输入，用于情感分析、新闻分类等任务。 - **信息检索**：提高关键词匹配的准确性和召回率。 - **机器翻译**：作为翻译系统的一部分，帮助理解源语言并生成目标语言的向量表示。 7. **注意事项**： - **超参数调整**：如窗口大小、学习率、负样本数量等，对模型性能有直接影响，需通过实验调整。 - **训练时间与资源**：大型语料库训练可能需要较长的时间和计算资源，可以考虑使用分布式训练或预训练模型。通过以上步骤，我们可以利用word2vec和中文维基百科语料库构建出高质量的词向量模型，为后续的自然语言处理任务提供有力支持。这个项目提供了实际操作的实践机会，有助于理解和掌握词向量的训练方法。

资源推荐

资源详情

资源评论

收起资源包目录

基于 word2vec 使用 wiki 中文语料库实现词向量训练模型.zip （12个子文件）

word2vec

orientation.py 6KB

xml2txt.py 977B

word2vec.py 727B

separate.py 808B

positive.txt 569B

LICENSE 1KB

negative.txt 1KB

remove.py 880B

tradition2simple.py 649B

README.md 9KB

stopwords.txt 5KB

fasttext.py 709B

# 基于 word2vec 使用 wiki 中文语料库实现词向量训练模型之前做过一些自然语言处理的工作，主要是根据一些企业在互联网上的相关新闻进行分析，对其倾向性进行判断，最终目的是辅助国内某单位更好地对其管辖的企业进行监管工作。现在总结整理一下。这篇文章主要对**词向量训练阶段**进行阐述。 --- ## 数据获取使用的语料库是 wiki 百科的中文语料库，下载地址：[https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2)。另外，提供百度网盘下载链接：[https://pan.baidu.com/s/1eLkybiYOE_aVxsN0pALATg](https://pan.baidu.com/s/1eLkybiYOE_aVxsN0pALATg)，提取码为：hmtn。下载之后如下图（PyCharm 截图），大小为 1.16GB。 ![](https://www.writebug.com/myres/static/uploads/2021/12/31/c6a623c5fefdd28f7c095943f17ae1ff.writebug) --- ## 将 XML 格式数据转为 txt 因为原始文件是 XML 格式，并且是压缩文件，所以做了一步数据解压并进行格式转换的工作。具体使用了 gensim 库中的维基百科处理类 WikiCorpus，该类中的 get_texts 方法原文件中的文章转化为一个数组，其中每一个元素对应着原文件中的一篇文章。然后通过 for 循环便可以将其中的每一篇文章读出，然后进行保存。 ![](https://www.writebug.com/myres/static/uploads/2021/12/31/6f24f1ac1780ba4f4ddfc6a8630e2110.writebug) ``` # coding; utf-8 """ 这个代码是将从网络上下载的xml格式的wiki百科训练语料转为txt格式 wiki百科训练语料链接：https://pan.baidu.com/s/1eLkybiYOE_aVxsN0pALATg 密码：hmtn """ from gensim.corpora import WikiCorpus if __name__ == '__main__': print('主程序开始...') input_file_name = 'zhwiki-latest-pages-articles.xml.bz2' output_file_name = 'wiki.cn.txt' print('开始读入wiki数据...') input_file = WikiCorpus(input_file_name, lemmatize=False, dictionary={}) print('wiki数据读入完成！') output_file = open(output_file_name, 'w', encoding="utf-8") print('处理程序开始...') count = 0 for text in input_file.get_texts(): output_file.write(' '.join(text) + '\n') count = count + 1 if count % 10000 == 0: print('目前已处理%d条数据' % count) print('处理程序结束！') output_file.close() print('主程序结束！') ``` 结果文件截图： ![](https://www.writebug.com/myres/static/uploads/2021/12/31/3cf0924666141c4b545e8cea076a43aa.writebug) ![](https://www.writebug.com/myres/static/uploads/2021/12/31/3ce2ff402ff1f2e04220084e990ed17d.writebug) --- ## 繁体转为简体为了方便后期处理，接下来对上面的结果进行简体化处理，将所有的繁体全部转化为简体。在这里，使用了另外一个库 zhconv。对上面结果的每一行调用 convert 函数即可。 ![](https://www.writebug.com/myres/static/uploads/2021/12/31/ba3f4107fde5950eaca1d1a9b0497679.writebug) ``` # coding:utf-8 import zhconv print('主程序执行开始...') input_file_name = 'wiki.cn.txt' output_file_name = 'wiki.cn.simple.txt' input_file = open(input_file_name, 'r', encoding='utf-8') output_file = open(output_file_name, 'w', encoding='utf-8') print('开始读入繁体文件...') lines = input_file.readlines() print('读入繁体文件结束！') print('转换程序执行开始...') count = 1 for line in lines: output_file.write(zhconv.convert(line, 'zh-hans')) count += 1 if count % 10000 == 0: print('目前已转换%d条数据' % count) print('转换程序执行结束！') print('主程序执行结束！') ``` 结果截图： ![](https://www.writebug.com/myres/static/uploads/2021/12/31/089de963d78651a3840f41f5601c6764.writebug) ![](https://www.writebug.com/myres/static/uploads/2021/12/31/64795521a2b3484f97ef0162ed116a00.writebug) --- ## 分词对于中文来说，分词是必须要经过的一步处理，下面就需要进行分词操作。在这里使用了大名鼎鼎的 jieba 库。调用其中的 cut 方法即可。 ![](https://www.writebug.com/myres/static/uploads/2021/12/31/a24caa49f8f9560067c56f72b6145961.writebug) ``` # coding:utf-8 import jieba print('主程序执行开始...') input_file_name = 'wiki.cn.simple.txt' output_file_name = 'wiki.cn.simple.separate.txt' input_file = open(input_file_name, 'r', encoding='utf-8') output_file = open(output_file_name, 'w', encoding='utf-8') print('开始读入数据文件...') lines = input_file.readlines() print('读入数据文件结束！') print('分词程序执行开始...') count = 1 for line in lines: # jieba分词的结果是一个list，需要拼接，但是jieba把空格回车都当成一个字符处理 output_file.write(' '.join(jieba.cut(line.split('\n')[0].replace(' ', ''))) + '\n') count += 1 if count % 10000 == 0: print('目前已分词%d条数据' % count) print('分词程序执行结束！') print('主程序执行结束！') ``` 结果截图： ![](https://www.writebug.com/myres/static/uploads/2021/12/31/f1392f7455ba53a5496da3dfbd54112a.writebug) ![](https://www.writebug.com/myres/static/uploads/2021/12/31/f431f345c6ec043d8f755b642cd5c05c.writebug) --- ## 去除非中文词可以看到，经过上面的处理之后，现在的结果已经差不多了，但是还存在着一些非中文词，所以下一步便将这些词去除。具体做法是通过正则表达式判断每一个词是不是符合汉字开头、汉字结尾、中间全是汉字，即“**^[\u4e00-\u9fa5]+$**”。 ![](https://www.writebug.com/myres/static/uploads/2021/12/31/2e2bf13621429348f9c558c4cbf13542.writebug) ``` # coding:utf-8 import re print('主程序执行开始...') input_file_name = 'wiki.cn.simple.separate.txt' output_file_name = 'wiki.txt' input_file = open(input_file_name, 'r', encoding='utf-8') output_file = open(output_file_name, 'w', encoding='utf-8') print('开始读入数据文件...') lines = input_file.readlines() print('读入数据文件结束！') print('分词程序执行开始...') count = 1 cn_reg = '^[\u4e00-\u9fa5]+$' for line in lines: line_list = line.split('\n')[0].split(' ') line_list_new = [] for word in line_list: if re.search(cn_reg, word): line_list_new.append(word) print(line_list_new) output_file.write(' '.join(line_list_new) + '\n') count += 1 if count % 10000 == 0: print('目前已分词%d条数据' % count) print('分词程序执行结束！') print('主程序执行结束！') ``` 结果截图： ![](https://www.writebug.com/myres/static/uploads/2021/12/31/13f2e406b8a93b2e3cd22e345243fc0e.writebug) ![](https://www.writebug.com/myres/static/uploads/2021/12/31/5412bf66361e96fe55ef259cfea5c937.writebug) --- ## 词向量训练上面的工作主要是对 wiki 语料库进行数据预处理，接下来才真正的词向量训练。 ![](https://www.writebug.com/myres/static/uploads/2021/12/31/e56a9c305da10b9e508422367d8b1bef.writebug) ``` # coding:utf-8 import multiprocessing from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == "__main__": print('主程序开始执行...') input_file_name = 'wiki.txt' model_file_name = 'wiki.model' print('转换过程开始...') model = Word2Vec(LineSentence(input_file_name), size=400, # 词向量长度为400 window=5, min_count=5, workers=multiprocessing.cpu_count()) print('转换过程结束！') print('开始保存模型...') model.save(model_file_name) print('模型保存结束！') print('主程序执行结束！') ``` 也是使用了 gensim 库，通过其中的 Word2Vec 类进行了模型训练，并将最终的词向量保存起来。 ![](https:

评论收藏

内容反馈