PyPI官网下载|delft-0.1.5.tar.gz资源-CSDN文库

版权申诉

Python库

198 浏览量 2022-01-10 12:25:50 上传评论收藏 107KB GZ 举报

共41个文件

py：30个

txt：6个

pkg-info：2个

资源推荐

资源详情

资源评论

收起资源包目录

delft-0.1.5.tar.gz （41个子文件）

delft-0.1.5

setup.py 1KB

requirements.txt 255B

delft.egg-info

top_level.txt 6B

SOURCES.txt 1KB

PKG-INFO 49KB

dependency_links.txt 1B

requires.txt 230B

MANIFEST.in 157B

setup.cfg 38B

LICENSE.txt 11KB

Readme.md 42KB

PKG-INFO 49KB

delft

sequenceLabelling

trainer.py 10KB

wrapper.py 17KB

tagger.py 5KB

__init__.py 155B

models.py 14KB

data_generator.py 6KB

config.py 3KB

reader.py 19KB

preprocess.py 10KB

evaluation.py 11KB

__init__.py 0B

textClassification

wrapper.py 13KB

__init__.py 55B

models.py 33KB

data_generator.py 3KB

config.py 2KB

reader.py 4KB

preprocess.py 3KB

utilities

Tokenizer.py 2KB

__init__.py 0B

Attention.py 2KB

layers.py 14KB

Embeddings.py 29KB

bilm

training.py 42KB

__init__.py 0B

elmo.py 4KB

model.py 26KB

data.py 15KB

Utilities.py 21KB

<img align="right" width="150" height="150" src="doc/cat-delft-small.jpg"> [![Build Status](https://travis-ci.org/kermitt2/delft.svg?branch=master)](https://travis-ci.org/kermitt2/delft) [![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html) # DeLFT __Work in progress !__ __DeLFT__ (**De**ep **L**earning **F**ramework for **T**ext) is a Keras framework for text processing, covering sequence labelling (e.g. named entity tagging) and text classification (e.g. comment classification). This library re-implements standard state-of-the-art Deep Learning architectures. From the observation that most of the open source implementations using Keras are toy examples, our motivation is to develop a framework that can be efficient, scalable and more usable in a production environment (with all the known limitations of Python of course for this purpose). The benefits of DeLFT are: * Re-implement a variety of state-of-the-art deep learning architectures for both sequence labelling and text classification problems, including the usage of the recent [ELMo](https://allennlp.org/elmo) contextualised embeddings, which can all be used within the same environment. For instance, this allows to reproduce under similar conditions the performance of all recent NER systems, and even improve most of them. * Reduce model size, in particular by removing word embeddings from them. For instance, the model for the toxic comment classifier went down from a size of 230 MB with embeddings to 1.8 MB. In practice the size of all the models of DeLFT is less than 2 MB, except for Ontonotes 5.0 NER model which is 4.7 MB. * Use dynamic data generator so that the training data do not need to stand completely in memory. * Load and manage efficiently an unlimited volume of pre-trained embeddings: instead of loading pre-trained embeddings in memory - which is horribly slow in Python and limits the number of embeddings to be used simultaneously - the pre-trained embeddings are compiled the first time they are accessed and stored efficiently in a LMDB database. This permits to have the pre-trained embeddings immediately "warm" (no load time), to free memory and to use any number of embeddings with a very negligible impact on runtime when using SSD. The medium term goal is then to provide good performance (accuracy, runtime, compactness) models also to productions stack such as Java/Scala and C++. DeLFT has been tested with python 3.5, Keras 2.1 and Tensorflow 1.7+ as backend. At this stage, we do not guarantee that DeLFT will run with other different versions of these libraries or other Keras backend versions. As always, GPU(s) are required for decent training time: a GeForce GTX 1050 Ti for instance is absolutely OK without ELMo contextual embeddings. Using ELMo was fine with a GeForce GTX 1080 Ti. ## Install Get the github repo: ```sh git clone https://github.com/kermitt2/delft cd delft ``` It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands: ```sh virtualenv --system-site-packages -p python3 env source env/bin/activate ``` Install the dependencies: ```sh pip3 install -r requirements.txt ``` DeLFT uses tensorflow 1.7 as backend, and will exploit your available GPU with the condition that CUDA (>=8.0) is properly installed. You need then to download some pre-trained word embeddings and notify their path into the embedding registry. We suggest for exploiting the provided models: * _glove Common Crawl_ (2.2M vocab., cased, 300 dim. vectors): [glove-840B](http://nlp.stanford.edu/data/glove.840B.300d.zip) * _fasttext Common Crawl_ (2M vocab., cased, 300 dim. vectors): [fasttext-crawl](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip) * _word2vec GoogleNews_ (3M vocab., cased, 300 dim. vectors): [word2vec](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) * _fasttext_wiki_fr_ (1.1M, NOT cased, 300 dim. vectors) for French: [wiki.fr](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.vec) * _ELMo_ trained on 5.5B word corpus (will produce 1024 dim. vectors) for English: [options](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json) and [weights](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5) Then edit the file `embedding-registry.json` and modify the value for `path` according to the path where you have saved the corresponding embeddings. The embedding files must be unzipped. ```json { "embeddings": [ { "name": "glove-840B", "path": "/PATH/TO/THE/UNZIPPED/EMBEDDINGS/FILE/glove.840B.300d.txt", "type": "glove", "format": "vec", "lang": "en", "item": "word" }, ... ] } ``` You're ready to use DeLFT. ## Management of embeddings The first time DeLFT starts and accesses pre-trained embeddings, these embeddings are serialised and stored in a LMDB database, a very efficient embedded database using memory page (already used in the Machine Learning world by Caffe and Torch for managing large training data). The next time these embeddings will be accessed, they will be immediately available. Our approach solves the bottleneck problem pointed for instance [here](https://spenai.org/bravepineapple/faster_em/) in a much better way than quantising+compression or pruning. After being compiled and stored at the first access, any volume of embeddings vectors can be used immediately without any loading, with a negligible usage of memory, without any accuracy loss and with a negligible impact on runtime when using SSD. In practice, we can exploit for instance embeddings for dozen languages simultaneously, without any memory and runtime issues - a requirement for any ambitious industrial deployment of a neural NLP system. For instance, in a traditional approach `glove-840B` takes around 2 minutes to load and 4GB in memory. Managed with LMDB, after a first load time of around 4 minutes, `glove-840B` can be accessed immediately and takes only a couple MB in memory, for an impact on runtime negligible (around 1% slower) for any further command line calls. By default, the LMDB databases are stored under the subdirectory `data/db`. The size of a database is roughly equivalent to the size of the original uncompressed embeddings file. To modify this path, edit the file `embedding-registry.json` and change the value of the attribute `embedding-lmdb-path`. To get FastText .bin format support please uncomment the package `fasttextmirror==0.8.22` in `requirements.txt` or `requirements-gpu.txt` according to your system's configuration. Please note that the **.bin format is not supported on Windows platforms**. Installing the FastText .bin format support introduces the following additional dependencies: * (gcc-4.8 or newer) or (clang-3.3 or newer) * [Python](https://www.python.org/) version 2.7 or >=3.4 * [pybind11](https://github.com/pybind/pybind11) While FastText .bin format are supported by DeLFT (including using ngrams for OOV words), this format will be loaded entirely in memory and does not take advantage of our memory-efficient management of embeddings. > I have plenty of memory on my machine, I don't care about load time because I need to grab a coffee, I only process one language at the time, so I am not interested in taking advantage of the LMDB emebedding management ! Ok, ok, then set the `embedding-lmdb-path` value to `"None"` in the file `embedding-registry.json`, the embeddings will be loaded in memory as immutable data, like in the usual Keras scripts. ## Sequence Labelling ### Available models * _BidLSTM-CRF_ with words and characters input following:      [1] Guillaume Lample, Miguel B

评论收藏

内容反馈

版权申诉