# fastText [![CircleCI](https://circleci.com/gh/facebookresearch/fastText/tree/master.svg?style=svg)](https://circleci.com/gh/facebookresearch/fastText/tree/master)
[fastText](https://fasttext.cc/) is a library for efficient learning of word representations and sentence classification.
In this document we present how to use fastText in python.
## Table of contents
* [Requirements](#requirements)
* [Installation](#installation)
* [Usage overview](#usage-overview)
* [Word representation model](#word-representation-model)
* [Text classification model](#text-classification-model)
* [IMPORTANT: Preprocessing data / encoding conventions](#important-preprocessing-data-encoding-conventions)
* [More examples](#more-examples)
* [API](#api)
* [`train_unsupervised` parameters](#train_unsupervised-parameters)
* [`train_supervised` parameters](#train_supervised-parameters)
* [`model` object](#model-object)
# Requirements
[fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions.
Since it uses C\++11 features, it requires a compiler with good C++11 support. You will need [Python](https://www.python.org/) (version 2.7 or ≥ 3.4), [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/) and [pybind11](https://github.com/pybind/pybind11).
# Installation
To install the latest release, you can do :
```bash
$ pip install fasttext
```
or, to get the latest development version of fasttext, you can install from our github repository :
```bash
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install
```
# Usage overview
## Word representation model
In order to learn word vectors, as [described here](https://fasttext.cc/docs/en/references.html#enriching-word-vectors-with-subword-information), we can use `fasttext.train_unsupervised` function like this:
```py
import fasttext
# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')
# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')
```
where `data.txt` is a training file containing utf-8 encoded text.
The returned `model` object represents your learned model, and you can use it to retrieve information.
```py
print(model.words) # list of words in dictionary
print(model['king']) # get the vector of the word 'king'
```
### Saving and loading a model object
You can save your trained model object by calling the function `save_model`.
```py
model.save_model("model_filename.bin")
```
and retrieve it later thanks to the function `load_model` :
```py
model = fasttext.load_model("model_filename.bin")
```
For more information about word representation usage of fasttext, you can refer to our [word representations tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html).
## Text classification model
In order to train a text classifier using the method [described here](https://fasttext.cc/docs/en/references.html#bag-of-tricks-for-efficient-text-classification), we can use `fasttext.train_supervised` function like this:
```py
import fasttext
model = fasttext.train_supervised('data.train.txt')
```
where `data.train.txt` is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string `__label__`
Once the model is trained, we can retrieve the list of words and labels:
```py
print(model.words)
print(model.labels)
```
To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the `test` function:
```py
def print_results(N, p, r):
print("N\t" + str(N))
print("P@{}\t{:.3f}".format(1, p))
print("R@{}\t{:.3f}".format(1, r))
print_results(*model.test('test.txt'))
```
We can also predict labels for a specific text :
```py
model.predict("Which baking dish is best to bake a banana bread ?")
```
By default, `predict` returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter `k`:
```py
model.predict("Which baking dish is best to bake a banana bread ?", k=3)
```
If you want to predict more than one sentence you can pass an array of strings :
```py
model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)
```
Of course, you can also save and load a model to/from a file as [in the word representation usage](#saving-and-loading-a-model-object).
For more information about text classification usage of fasttext, you can refer to our [text classification tutorial](https://fasttext.cc/docs/en/supervised-tutorial.html).
### Compress model files with quantization
When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.
```py
# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)
# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")
```
`model_filename.ftz` will have a much smaller size than `model_filename.bin`.
For further reading on quantization, you can refer to [this paragraph from our blog post](https://fasttext.cc/blog/2017/10/02/blog-post.html#model-compression).
## IMPORTANT: Preprocessing data / encoding conventions
In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this.
fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv).
fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.
* space
* tab
* vertical tab
* carriage return
* formfeed
* the null character
The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended.
The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords.
## More examples
In order to have a better knowledge of fastText models, please consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervi
没有合适的资源?快使用搜索试试~ 我知道了~
Fasttext快速文本分类器代码
共515个文件
html:194个
js:144个
png:44个
需积分: 9 1 下载量 192 浏览量
2022-07-15
14:13:17
上传
评论 2
收藏 4.14MB ZIP 举报
温馨提示
Fasttext快速文本分类的源代码
资源详情
资源评论
资源推荐
收起资源包目录
Fasttext快速文本分类器代码 (515个子文件)
.variables_a.html.1MGQ27 0B
fasttext.cc 23KB
fasttext_pybind.cc 16KB
args.cc 14KB
dictionary.cc 13KB
autotune.cc 13KB
main.cc 12KB
loss.cc 9KB
productquantizer.cc 6KB
densematrix.cc 4KB
quantmatrix.cc 3KB
filter_utf8.cc 3KB
model.cc 2KB
vector.cc 2KB
meter.cc 2KB
dedup.cc 1KB
utils.cc 1KB
matrix.cc 494B
setup.cfg 40B
eval.cpp 3KB
doxygen.css 27KB
tabs.css 9KB
search.css 4KB
navtree.css 2KB
fasttext.css 739B
.gitignore 44B
fasttext.h 4KB
loss.h 4KB
dictionary.h 3KB
autotune.h 2KB
densematrix.h 2KB
args.h 2KB
model.h 2KB
meter.h 2KB
productquantizer.h 2KB
quantmatrix.h 1KB
utils.h 1KB
vector.h 1KB
matrix.h 972B
real.h 267B
classfasttext_1_1Model.html 64KB
classfasttext_1_1Dictionary.html 56KB
classfasttext_1_1FastText.html 52KB
classfasttext_1_1ProductQuantizer.html 42KB
fasttext_8h_source.html 37KB
model_8h_source.html 35KB
classfasttext_1_1Args.html 35KB
dictionary_8h_source.html 29KB
productquantizer_8h_source.html 29KB
args_8h_source.html 28KB
classfasttext_1_1Matrix.html 28KB
classfasttext_1_1Vector.html 25KB
classfasttext_1_1QMatrix.html 25KB
main_8cc.html 24KB
functions_func.html 22KB
qmatrix_8h_source.html 21KB
matrix_8h_source.html 20KB
namespacefasttext.html 19KB
vector_8h_source.html 18KB
functions_vars.html 18KB
classfasttext_1_1Dictionary-members.html 17KB
classfasttext_1_1Model-members.html 17KB
classfasttext_1_1FastText-members.html 16KB
classfasttext_1_1Args-members.html 13KB
classfasttext_1_1ProductQuantizer-members.html 12KB
dir_68267d1309a1af8e8297ef4c3efbcdba.html 10KB
classfasttext_1_1Matrix-members.html 9KB
files.html 9KB
classfasttext_1_1QMatrix-members.html 9KB
model_8h.html 9KB
classfasttext_1_1Vector-members.html 8KB
structfasttext_1_1Node.html 8KB
fasttext_8h.html 8KB
classes.html 8KB
structfasttext_1_1entry.html 8KB
dictionary_8h.html 7KB
annotated.html 7KB
args_8h.html 7KB
utils_8h_source.html 7KB
functions_s.html 6KB
functions_n.html 6KB
functions_p.html 6KB
namespacefasttext_1_1utils.html 6KB
globals.html 6KB
functions_m.html 6KB
utils_8cc.html 6KB
utils_8h.html 6KB
real_8h_source.html 6KB
vector_8h.html 6KB
functions_g.html 5KB
functions_c.html 5KB
functions_l.html 5KB
globals_func.html 5KB
qmatrix_8h.html 5KB
functions.html 5KB
vector_8cc.html 5KB
productquantizer_8h.html 5KB
functions_t.html 5KB
functions_d.html 5KB
productquantizer_8cc.html 5KB
共 515 条
- 1
- 2
- 3
- 4
- 5
- 6
kxiaozhuk
- 粉丝: 47
- 资源: 7
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Matlab基于遗传算法的车辆充电调度系统matlab完整源码+项目说明+详细代码注释+答辩PPT.zip
- 基于MovieLens数据集的协同过滤算法尝试+源代码+文档说明
- 基于SpringBoot+Mybatis的音乐网站,基于mahout的协同过滤算法进行歌曲推荐,支付宝沙箱模拟支付+源代码+说明
- 无卡运行 维宏NCstudio V5.4.68雕刻机仿真
- 算法设计 - 排序、动态规划、数学问题 - 实用编程资源合集
- Rust异步请求,详细描述了异步是怎么实现的
- 流体力学+环境生物学+仪器分析+环境监测:大二下 复习指南-上交
- 详细描述了Object-C中的内存管理,超级简单
- 详细描述了Kotlin和Android的统合
- 水污染控制工程+环境化学-生物环境化学+大气污染控制工程-上交
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0