# fastText [![CircleCI](https://circleci.com/gh/facebookresearch/fastText/tree/master.svg?style=svg)](https://circleci.com/gh/facebookresearch/fastText/tree/master)
[fastText](https://fasttext.cc/) is a library for efficient learning of word representations and sentence classification.
In this document we present how to use fastText in python.
## Table of contents
* [Requirements](#requirements)
* [Installation](#installation)
* [Usage overview](#usage-overview)
* [Word representation model](#word-representation-model)
* [Text classification model](#text-classification-model)
* [IMPORTANT: Preprocessing data / encoding conventions](#important-preprocessing-data-encoding-conventions)
* [More examples](#more-examples)
* [API](#api)
* [`train_unsupervised` parameters](#train_unsupervised-parameters)
* [`train_supervised` parameters](#train_supervised-parameters)
* [`model` object](#model-object)
# Requirements
[fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions.
Since it uses C\++11 features, it requires a compiler with good C++11 support. You will need [Python](https://www.python.org/) (version 2.7 or ≥ 3.4), [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/) and [pybind11](https://github.com/pybind/pybind11).
# Installation
To install the latest release, you can do :
```bash
$ pip install fasttext
```
or, to get the latest development version of fasttext, you can install from our github repository :
```bash
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install
```
# Usage overview
## Word representation model
In order to learn word vectors, as [described here](https://fasttext.cc/docs/en/references.html#enriching-word-vectors-with-subword-information), we can use `fasttext.train_unsupervised` function like this:
```py
import fasttext
# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')
# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')
```
where `data.txt` is a training file containing utf-8 encoded text.
The returned `model` object represents your learned model, and you can use it to retrieve information.
```py
print(model.words) # list of words in dictionary
print(model['king']) # get the vector of the word 'king'
```
### Saving and loading a model object
You can save your trained model object by calling the function `save_model`.
```py
model.save_model("model_filename.bin")
```
and retrieve it later thanks to the function `load_model` :
```py
model = fasttext.load_model("model_filename.bin")
```
For more information about word representation usage of fasttext, you can refer to our [word representations tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html).
## Text classification model
In order to train a text classifier using the method [described here](https://fasttext.cc/docs/en/references.html#bag-of-tricks-for-efficient-text-classification), we can use `fasttext.train_supervised` function like this:
```py
import fasttext
model = fasttext.train_supervised('data.train.txt')
```
where `data.train.txt` is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string `__label__`
Once the model is trained, we can retrieve the list of words and labels:
```py
print(model.words)
print(model.labels)
```
To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the `test` function:
```py
def print_results(N, p, r):
print("N\t" + str(N))
print("P@{}\t{:.3f}".format(1, p))
print("R@{}\t{:.3f}".format(1, r))
print_results(*model.test('test.txt'))
```
We can also predict labels for a specific text :
```py
model.predict("Which baking dish is best to bake a banana bread ?")
```
By default, `predict` returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter `k`:
```py
model.predict("Which baking dish is best to bake a banana bread ?", k=3)
```
If you want to predict more than one sentence you can pass an array of strings :
```py
model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)
```
Of course, you can also save and load a model to/from a file as [in the word representation usage](#saving-and-loading-a-model-object).
For more information about text classification usage of fasttext, you can refer to our [text classification tutorial](https://fasttext.cc/docs/en/supervised-tutorial.html).
### Compress model files with quantization
When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.
```py
# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)
# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")
```
`model_filename.ftz` will have a much smaller size than `model_filename.bin`.
For further reading on quantization, you can refer to [this paragraph from our blog post](https://fasttext.cc/blog/2017/10/02/blog-post.html#model-compression).
## IMPORTANT: Preprocessing data / encoding conventions
In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this.
fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv).
fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.
* space
* tab
* vertical tab
* carriage return
* formfeed
* the null character
The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended.
The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords.
## More examples
In order to have a better knowledge of fastText models, please consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervi
没有合适的资源?快使用搜索试试~ 我知道了~
fastText是一个用于高效学习词表示和句子分类的库
共526个文件
html:198个
js:145个
png:44个
需积分: 2 0 下载量 96 浏览量
2024-11-20
10:01:35
上传
评论
收藏 4.17MB ZIP 举报
温馨提示
FastText是由Facebook人工智能研究团队开发的一个开源库,主要用于文本分类和词向量学习。以下是FastText的详细介绍: 核心特点: 高效性:FastText在文本分类任务上具有很高的处理速度,能够处理海量数据。 简单易用:FastText的API设计简洁,易于集成和使用。 多语言支持:FastText支持多种语言,可以处理不同语言的文本数据。 主要功能: 词向量学习:FastText可以学习单词和句子的高维向量表示,这些向量能够捕捉词汇的语义信息。 文本分类:FastText提供了一种高效的文本分类方法,通过学习文本的向量表示来进行分类。 技术原理: 层次化 Softmax:为了解决传统Softmax在处理大规模词汇表时的计算瓶颈问题,FastText采用了层次化Softmax,将词汇表组织成一个哈夫曼树,从而大大减少了计算量。 N-gram特征:FastText不仅考虑单词本身,还考虑单词的N-gram特征,这有助于捕捉单词的顺序信息,提高分类准确性。
资源推荐
资源详情
资源评论
收起资源包目录
fastText是一个用于高效学习词表示和句子分类的库 (526个子文件)
.variables_a.html.1MGQ27 0B
fasttext.cc 24KB
fasttext_pybind.cc 19KB
args.cc 16KB
autotune.cc 14KB
dictionary.cc 13KB
main.cc 13KB
fasttext_wasm.cc 11KB
loss.cc 9KB
productquantizer.cc 6KB
meter.cc 6KB
densematrix.cc 4KB
quantmatrix.cc 3KB
filter_utf8.cc 3KB
model.cc 2KB
vector.cc 2KB
utils.cc 1KB
dedup.cc 1KB
matrix.cc 494B
setup.cfg 40B
eval.cpp 3KB
doxygen.css 27KB
tabs.css 9KB
search.css 4KB
navtree.css 2KB
fasttext.css 739B
.gitignore 113B
fasttext.h 5KB
loss.h 4KB
dictionary.h 3KB
meter.h 2KB
autotune.h 2KB
args.h 2KB
densematrix.h 2KB
utils.h 2KB
model.h 2KB
productquantizer.h 2KB
quantmatrix.h 1KB
vector.h 1KB
matrix.h 972B
real.h 266B
classfasttext_1_1Model.html 64KB
classfasttext_1_1Dictionary.html 56KB
classfasttext_1_1FastText.html 52KB
classfasttext_1_1ProductQuantizer.html 42KB
fasttext_8h_source.html 37KB
model_8h_source.html 35KB
classfasttext_1_1Args.html 35KB
dictionary_8h_source.html 29KB
productquantizer_8h_source.html 29KB
args_8h_source.html 28KB
classfasttext_1_1Matrix.html 28KB
classfasttext_1_1Vector.html 25KB
classfasttext_1_1QMatrix.html 25KB
main_8cc.html 24KB
functions_func.html 22KB
qmatrix_8h_source.html 21KB
matrix_8h_source.html 20KB
namespacefasttext.html 19KB
vector_8h_source.html 18KB
functions_vars.html 18KB
classfasttext_1_1Dictionary-members.html 17KB
classfasttext_1_1Model-members.html 17KB
classfasttext_1_1FastText-members.html 16KB
classfasttext_1_1Args-members.html 13KB
classfasttext_1_1ProductQuantizer-members.html 12KB
dir_68267d1309a1af8e8297ef4c3efbcdba.html 10KB
classfasttext_1_1Matrix-members.html 9KB
files.html 9KB
classfasttext_1_1QMatrix-members.html 9KB
model_8h.html 9KB
classfasttext_1_1Vector-members.html 8KB
structfasttext_1_1Node.html 8KB
fasttext_8h.html 8KB
classes.html 8KB
structfasttext_1_1entry.html 8KB
dictionary_8h.html 7KB
annotated.html 7KB
args_8h.html 7KB
utils_8h_source.html 7KB
functions_s.html 6KB
functions_n.html 6KB
functions_p.html 6KB
namespacefasttext_1_1utils.html 6KB
globals.html 6KB
functions_m.html 6KB
utils_8cc.html 6KB
utils_8h.html 6KB
real_8h_source.html 6KB
vector_8h.html 6KB
functions_g.html 5KB
functions_c.html 5KB
functions_l.html 5KB
globals_func.html 5KB
qmatrix_8h.html 5KB
functions.html 5KB
vector_8cc.html 5KB
productquantizer_8h.html 5KB
functions_t.html 5KB
functions_d.html 5KB
共 526 条
- 1
- 2
- 3
- 4
- 5
- 6
资源评论
就是一顿骚操作
- 粉丝: 701
- 资源: 57
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功