# Word Vectors
[![Build Status](https://travis-ci.org/bmschmidt/wordVectors.svg?branch=master)](https://travis-ci.org/bmschmidt/wordVectors)
An R package for building and exploring word embedding models.
# Description
This package does three major things to make it easier to work with word2vec and other vectorspace models of language.
1. [Trains word2vec models](#creating-text-vectors) using an extended Jian Li's word2vec code; reads and writes the binary word2vec format so that you can import pre-trained models such as Google's; and provides tools for reading only *part* of a model (rows or columns) so you can explore a model in memory-limited situations.
2. [Creates a new `VectorSpaceModel` class in R that gives a better syntax for exploring a word2vec or GloVe model than native matrix methods.](#vectorspacemodel-object) For example, instead of writing
> `model[rownames(model)=="king",]`,
you can write
> `model[["king"]]`,
and instead of writing
> `vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])` (whew!),
you can write
> `vectors %>% closest_to(~"king" - "man" + "woman")`.
3. [Implements several basic matrix operations that are useful in exploring word embedding models including cosine similarity, nearest neighbor, and vector projection](#useful-matrix-operations) with some caching that makes them much faster than the simplest implementations.
### Quick start
For a step-by-step interactive demo that includes installation and training a model on 77 historical cookbooks from Michigan State University, [see the introductory vignette.](https://github.com/bmschmidt/wordVectors/blob/master/vignettes/introduction.Rmd).
### Credit
This includes an altered version of Tomas Mikolov's original C code for word2vec; those wrappers were origally written by Jian Li, and I've only tweaked them a little. Several other users have improved that code since I posted it here.
Right now, it [does not (I don't think) install under Windows 8](https://github.com/bmschmidt/wordVectors/issues/2). Help appreciated on that thread. OS X, Windows 7, Windows 10, and Linux install perfectly well, with one or two exceptions.
It's not extremely fast, but once the data is loaded in most operations happen in suitable time for exploratory data analysis (under a second on my laptop.)
For high-performance analysis of models, C or python's numpy/gensim will likely be better than this package, in part because R doesn't have support for single-precision floats. The goal of this package is to facilitate clear code and exploratory data analysis of models.
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.
## Creating text vectors.
One portion of this is an expanded version of the code from Jian Li's `word2vec` package with a few additional parameters enabled as the function `train_word2vec`.
The input must still be in a single file and pre-tokenized, but it uses the existing word2vec C code. For online data processing, I like the gensim python implementation, but I don't plan to link that to R.
In RStudio I've noticed that this appears to hang, but if you check processors it actually still runs. Try it on smaller portions first, and then let it take time: the training function can take hours for tens of thousands of books.
## VectorSpaceModel object
The package loads in the word2vec binary format with the format `read.vectors` into a new object called a "VectorSpaceModel" object. It's a light superclass of the standard R matrix object. Anything you can do with matrices, you can do with VectorSpaceModel objects.
It has a few convenience functions as well.
### Faster Access to text vectors
The rownames of a VectorSpaceModel object are presumed to be tokens in a vector space model and therefore semantically useful. The classic word2vec demonstration is that vector('king') - vector('man') + vector('woman') =~ vector('queen'). With a standard matrix, the vector on the right-hand side of the equation would be described as
```{r, include=F,show=T}
vector_set[rownames(vector_set)=="king",] - vector_set[rownames(vector_set)=="man",] + vector_set[rownames(vector_set)=="woman",]
```
In this package, you can simply access it by using the double brace operators:
```{r, include=F,show=T}
vector_set[["king"]] - vector_set[["man"]] + vector_set[["woman"]]
```
(And in the context of the custom functions, as a formula like `~"king" - "man" + "woman"`: see below).
Since frequently an average of two vectors provides a better indication, multiple words can be collapsed into a single vector by specifying multiple labels. For example, this may provide a slightly better gender vector:
```{r}
vector_set[["king"]] - vector_set[[c("man","men")]] + vector_set[[c("woman","women")]]
```
Sometimes you want to subset *without* averaging. You can do this with the argument `average==FALSE` to the subset. This is particularly useful for comparing slices of the matrix to itself in similarity operations.
```{r}
cosineSimilarity(vector_set[[c("man","men","king"),average=F]], vector_set[[c("woman","women","queen"),average=F]]
```
## A few native functions defined on the VectorSpaceModel object.
The native `show` method just prints the dimensions; the native `plot` method does some crazy reductions with the T-SNE package (installation required for functionality) because T-SNE is a nice way to reduce down the size of vectors, **or** lets you pass `method='pca'` to array a full set or subset by the first two principal components.
## Useful matrix operations
One challenge of vector-space models of texts is that it takes some basic matrix multiplication functions to make them dance around in an entertaining way.
This package bundles the ones I think are the most useful.
Each takes a `VectorSpaceModel` as its first argument. Sometimes, it's appropriate for the VSM to be your entire data set; other times, it's sensible to limit it to just one or a few vectors. Where appropriate, the functions can also take vectors or matrices as inputs.
* `cosineSimilarity(VSM_1,VSM_2)` calculates the cosine similarity of every vector in on vector space model to every vector in another. This is `n^2` complexity. With a vocabulary size of 20,000 or so, it can be reasonable to compare an entire set to itself; or you can compare a larger set to a smaller one to search for particular terms of interest.
* `cosineDistance(VSM_1,VSM_2)` is the inverse of cosineSimilarity. It's not really a distance metric, but can be used as one for clustering and the like.
* `closest_to(VSM,vector,n)` wraps a particularly common use case for `cosineSimilarity`, of finding the top `n` terms in a `VectorSpaceModel` closest to term m
* `project(VSM,vector)` takes a `VectorSpaceModel` and returns the portion parallel to the vector `vector`.
* `reject(VSM,vector)` is the inverse of `project`; it takes a `VectorSpaceModel` and returns the portion orthogonal to the vector `vector`. This makes it possible, for example, to collapse a vector space by removing certain distinctions of meaning.
* `magnitudes` calculated the magnitude of each element in a VSM. This is useful in many operations.
All of these functions place the VSM object as the first argument. This makes it easy to chain together operations using the `magrittr` package. For example, beginning with a single vector set one could find the nearest words in a set to a version of the vector for "bank" that has been decomposed to remove any semantic similarity to the banking sector.
``` {r}
library(magrittr)
not_that_kind_of_bank = chronam_vectors[["bank"]] %>%
reject(chronam_vectors[["cashier"]]) %>%
reject(chronam_vectors[["depositors"]]) %>%
reject(chronam_vectors[["check"]])
chronam_vectors %>% clos
没有合适的资源?快使用搜索试试~ 我知道了~
wordVectors:一个R包,用于创建和探索word2vec和其他单词嵌入模型
共60个文件
rd:26个
r:13个
md:4个
需积分: 10 5 下载量 88 浏览量
2021-04-30
06:01:37
上传
评论
收藏 817KB ZIP 举报
温馨提示
词向量 用于构建和探索单词嵌入模型的R包。 描述 该软件包做了三项主要工作,以使其更易于使用word2vec和其他语言的vectorspace模型。 使用扩展的Jian Li的word2vec代码; 读取和写入二进制word2vec格式,以便您可以导入经过预训练的模型,例如Google的模型; 并提供仅读取部分模型(行或列)的工具,以便您可以在内存受限的情况下探索模型。 例如,代替写作 model[rownames(model)=="king",] , 你可以写 model[["king"]] , 而不是写 vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",]) (
资源推荐
资源详情
资源评论
收起资源包目录
wordVectors-master.zip (60个子文件)
wordVectors-master
CONDUCT.md 1KB
vignettes
exploration.Rmd 8KB
introduction.Rmd 10KB
NAMESPACE 563B
LICENSE.txt 11KB
NEWS.md 8KB
DESCRIPTION 1KB
src
tmcn_word2vec.c 2KB
Makevars.win 113B
word2phrase.c 9KB
word2vec.h 22KB
inst
doc
exploration.R 2KB
exploration.Rmd 8KB
exploration.html 46KB
introduction.Rmd 10KB
introduction.R 4KB
introduction.html 150KB
paper.md 1012B
R
matrixFunctions.R 27KB
data.R 580B
word2vec.R 9KB
utils.R 55B
.travis.yml 191B
.Rbuildignore 42B
README.md 10KB
data
demo_vectors.rda 643KB
man
improve_vectorspace.Rd 986B
VectorSpaceModel-class.Rd 734B
cosineSimilarity.Rd 1KB
train_word2vec.Rd 2KB
write.binary.word2vec.Rd 581B
read.vectors.Rd 1KB
distend.Rd 1KB
as.VectorSpaceModel.Rd 374B
reject.Rd 955B
sub-VectorSpaceModel-method.Rd 644B
prep_word2vec.Rd 1KB
VectorSpaceModel-VectorSpaceModel-method.Rd 802B
plot-VectorSpaceModel-method.Rd 990B
word2phrase.Rd 1KB
sub-sub-VectorSpaceModel-method.Rd 601B
reexports.Rd 394B
demo_vectors.Rd 746B
normalize_lengths.Rd 398B
nearest_to.Rd 599B
square_magnitudes.Rd 466B
cosineDist.Rd 689B
project.Rd 654B
filter_to_rownames.Rd 546B
read.binary.vectors.Rd 1KB
closest_to.Rd 2KB
magnitudes.Rd 394B
tests
run-all.R 44B
testthat
test-train.R 2KB
test-linear-algebra-functions.R 700B
test-read-write.R 431B
test-name-collapsing.r 2KB
test-rejection.R 379B
test-types.R 587B
.gitignore 348B
共 60 条
- 1
资源评论
彭仕安
- 粉丝: 23
- 资源: 4679
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功