fastText是一个用于高效学习词表示和句子分类的库资源-CSDN文库

共526个文件

html：198个

js：145个

png：44个

自然语言处理工具包

需积分: 2 96 浏览量 2024-11-20 10:01:35 上传评论收藏 4.17MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

fastText是一个用于高效学习词表示和句子分类的库（526个子文件）

.variables_a.html.1MGQ27 0B

fasttext.cc 24KB

fasttext_pybind.cc 19KB

args.cc 16KB

autotune.cc 14KB

dictionary.cc 13KB

main.cc 13KB

fasttext_wasm.cc 11KB

loss.cc 9KB

productquantizer.cc 6KB

meter.cc 6KB

densematrix.cc 4KB

quantmatrix.cc 3KB

filter_utf8.cc 3KB

model.cc 2KB

vector.cc 2KB

utils.cc 1KB

dedup.cc 1KB

matrix.cc 494B

setup.cfg 40B

eval.cpp 3KB

doxygen.css 27KB

tabs.css 9KB

search.css 4KB

navtree.css 2KB

fasttext.css 739B

.gitignore 113B

fasttext.h 5KB

loss.h 4KB

dictionary.h 3KB

meter.h 2KB

autotune.h 2KB

args.h 2KB

densematrix.h 2KB

utils.h 2KB

model.h 2KB

productquantizer.h 2KB

quantmatrix.h 1KB

vector.h 1KB

matrix.h 972B

real.h 266B

classfasttext_1_1Model.html 64KB

classfasttext_1_1Dictionary.html 56KB

classfasttext_1_1FastText.html 52KB

classfasttext_1_1ProductQuantizer.html 42KB

fasttext_8h_source.html 37KB

model_8h_source.html 35KB

classfasttext_1_1Args.html 35KB

dictionary_8h_source.html 29KB

productquantizer_8h_source.html 29KB

args_8h_source.html 28KB

classfasttext_1_1Matrix.html 28KB

classfasttext_1_1Vector.html 25KB

classfasttext_1_1QMatrix.html 25KB

main_8cc.html 24KB

functions_func.html 22KB

qmatrix_8h_source.html 21KB

matrix_8h_source.html 20KB

namespacefasttext.html 19KB

vector_8h_source.html 18KB

functions_vars.html 18KB

classfasttext_1_1Dictionary-members.html 17KB

classfasttext_1_1Model-members.html 17KB

classfasttext_1_1FastText-members.html 16KB

classfasttext_1_1Args-members.html 13KB

classfasttext_1_1ProductQuantizer-members.html 12KB

dir_68267d1309a1af8e8297ef4c3efbcdba.html 10KB

classfasttext_1_1Matrix-members.html 9KB

files.html 9KB

classfasttext_1_1QMatrix-members.html 9KB

model_8h.html 9KB

classfasttext_1_1Vector-members.html 8KB

structfasttext_1_1Node.html 8KB

fasttext_8h.html 8KB

classes.html 8KB

structfasttext_1_1entry.html 8KB

dictionary_8h.html 7KB

annotated.html 7KB

args_8h.html 7KB

utils_8h_source.html 7KB

functions_s.html 6KB

functions_n.html 6KB

functions_p.html 6KB

namespacefasttext_1_1utils.html 6KB

globals.html 6KB

functions_m.html 6KB

utils_8cc.html 6KB

utils_8h.html 6KB

real_8h_source.html 6KB

vector_8h.html 6KB

functions_g.html 5KB

functions_c.html 5KB

functions_l.html 5KB

globals_func.html 5KB

qmatrix_8h.html 5KB

functions.html 5KB

vector_8cc.html 5KB

productquantizer_8h.html 5KB

functions_t.html 5KB

functions_d.html 5KB

共 526 条

# fastText [![CircleCI](https://circleci.com/gh/facebookresearch/fastText/tree/master.svg?style=svg)](https://circleci.com/gh/facebookresearch/fastText/tree/master) [fastText](https://fasttext.cc/) is a library for efficient learning of word representations and sentence classification. In this document we present how to use fastText in python. ## Table of contents * [Requirements](#requirements) * [Installation](#installation) * [Usage overview](#usage-overview) * [Word representation model](#word-representation-model) * [Text classification model](#text-classification-model) * [IMPORTANT: Preprocessing data / encoding conventions](#important-preprocessing-data-encoding-conventions) * [More examples](#more-examples) * [API](#api) * [`train_unsupervised` parameters](#train_unsupervised-parameters) * [`train_supervised` parameters](#train_supervised-parameters) * [`model` object](#model-object) # Requirements [fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions. Since it uses C\++11 features, it requires a compiler with good C++11 support. You will need [Python](https://www.python.org/) (version 2.7 or ≥ 3.4), [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/) and [pybind11](https://github.com/pybind/pybind11). # Installation To install the latest release, you can do : ```bash $ pip install fasttext ``` or, to get the latest development version of fasttext, you can install from our github repository : ```bash $ git clone https://github.com/facebookresearch/fastText.git $ cd fastText $ sudo pip install . $ # or : $ sudo python setup.py install ``` # Usage overview ## Word representation model In order to learn word vectors, as [described here](https://fasttext.cc/docs/en/references.html#enriching-word-vectors-with-subword-information), we can use `fasttext.train_unsupervised` function like this: ```py import fasttext # Skipgram model : model = fasttext.train_unsupervised('data.txt', model='skipgram') # or, cbow model : model = fasttext.train_unsupervised('data.txt', model='cbow') ``` where `data.txt` is a training file containing utf-8 encoded text. The returned `model` object represents your learned model, and you can use it to retrieve information. ```py print(model.words) # list of words in dictionary print(model['king']) # get the vector of the word 'king' ``` ### Saving and loading a model object You can save your trained model object by calling the function `save_model`. ```py model.save_model("model_filename.bin") ``` and retrieve it later thanks to the function `load_model` : ```py model = fasttext.load_model("model_filename.bin") ``` For more information about word representation usage of fasttext, you can refer to our [word representations tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html). ## Text classification model In order to train a text classifier using the method [described here](https://fasttext.cc/docs/en/references.html#bag-of-tricks-for-efficient-text-classification), we can use `fasttext.train_supervised` function like this: ```py import fasttext model = fasttext.train_supervised('data.train.txt') ``` where `data.train.txt` is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string `__label__` Once the model is trained, we can retrieve the list of words and labels: ```py print(model.words) print(model.labels) ``` To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the `test` function: ```py def print_results(N, p, r): print("N\t" + str(N)) print("P@{}\t{:.3f}".format(1, p)) print("R@{}\t{:.3f}".format(1, r)) print_results(*model.test('test.txt')) ``` We can also predict labels for a specific text : ```py model.predict("Which baking dish is best to bake a banana bread ?") ``` By default, `predict` returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter `k`: ```py model.predict("Which baking dish is best to bake a banana bread ?", k=3) ``` If you want to predict more than one sentence you can pass an array of strings : ```py model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3) ``` Of course, you can also save and load a model to/from a file as [in the word representation usage](#saving-and-loading-a-model-object). For more information about text classification usage of fasttext, you can refer to our [text classification tutorial](https://fasttext.cc/docs/en/supervised-tutorial.html). ### Compress model files with quantization When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance. ```py # with the previously trained `model` object, call : model.quantize(input='data.train.txt', retrain=True) # then display results and save the new model : print_results(*model.test(valid_data)) model.save_model("model_filename.ftz") ``` `model_filename.ftz` will have a much smaller size than `model_filename.bin`. For further reading on quantization, you can refer to [this paragraph from our blog post](https://fasttext.cc/blog/2017/10/02/blog-post.html#model-compression). ## IMPORTANT: Preprocessing data / encoding conventions In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this. fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv). fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate. * space * tab * vertical tab * carriage return * formfeed * the null character The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended. The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords. ## More examples In order to have a better knowledge of fastText models, please consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervi

评论收藏

内容反馈