service-streamer：增强深度学习应用程序的Web服务资源-CSDN文库

共28个文件

py：15个

jpg：3个

md：3个

web

deep-learning

tensorflow

pytorch

bert

需积分: 20 53 浏览量 2021-02-05 00:06:51 上传评论收藏 107KB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

service-streamer-master.zip （28个子文件）

service-streamer-master

.travis.yml 655B

example_vision

app.py 942B

imagenet_class_index.json 35KB

file.lua 798B

model.py 2KB

README.md 567B

cat.jpg 21KB

example

flask_multigpu_example.py 889B

flask_example.py 888B

benchmark.lua 127B

future_example.py 1KB

redis_streamer_gunicorn.py 366B

redis_worker_example.py 272B

bert_model.py 2KB

tests

vision_case

imagenet_class_index.json 35KB

__init__.py 0B

model.py 2KB

dog.jpg 23KB

cat.jpg 21KB

test_service_streamer.py 5KB

LICENSE 11KB

README_zh.md 14KB

setup.py 1015B

.gitignore 1KB

README.md 15KB

service_streamer

service_streamer.py 19KB

__init__.py 203B

managed_model.py 544B

<h1 align="center">Service Streamer</h1> <p align="center"> Boosting your Web Services of Deep Learning Applications. <a href="./README_zh.md">中文README</a> </p> <p align="center"> </p> <p align="center"> <a href="#what-is-service-streamer-">What is Service Streamer ?</a> • <a href="#highlights">Highlights</a> • <a href="#installation">Installation</a> • <a href="#develop-bert-service-in-5-minutes">Develop BERT Service in 5 Minutes</a> • <a href="#api">API</a> • <a href="#benchmark">Benchmark</a> • <a href="#faq">FAQ</a> • </p> <h6 align="center"> <a href="https://travis-ci.org/ShannonAI/service-streamer"> <img src="https://travis-ci.org/ShannonAI/service-streamer.svg?branch=master" alt="Build status"> </a> • Made by ShannonAI • :globe_with_meridians: <a href="http://www.shannonai.com/">http://www.shannonai.com/</a> </h6> <h2 align="center">What is Service Streamer ?</h2> A mini-batch collects data samples and is usually used in deep learning models. In this way, models can utilize the parallel computing capability of GPUs. However, requests from users for web services are usually discrete. If using conventional loop server or threaded server, GPUs will be idle dealing with one request at a time. And the latency time will be linearly increasing when there are concurrent user requests. ServiceStreamer is a middleware for web service of machine learning applications. Queue requests from users are sampled into mini-batches. ServiceStreamer can significantly enhance the overall performance of the system by improving GPU utilization. <h2 align="center">Highlights</h2> - :hatching_chick: **Easy to use**: Minor changes can speed up the model ten times. - :zap: **Fast processing speed**: Low latency for online inference of machine learning models. - :octopus: **Good expandability**: Easy to be applied to multi-GPU scenarios for handling enormous requests. - :crossed_swords: **Applicability**: Used with any web frameworks and/or deep learning frameworks. <h2 align="center">Installation</h2> Install ServiceStream by using `pip`，requires **Python >= 3.5** : ```bash pip install service_streamer ``` <h2 align="center">Develop BERT Service in 5 Minutes</h2> We provide a step-by-step tutorial for you to bring BERT online in 5 minutes. The service processes 1400 sentences per second. ``Text Infilling`` is a task in natural language processing: given a sentence with several words randomly removed, the model predicts those words removed through the given context. ``BERT`` has attracted a lot of attention in these two years and it achieves State-Of-The-Art results across many nlp tasks. BERT utilizes "Masked Language Model (MLM)" as one of the pre-training objectives. MLM models randomly mask some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based on its context. MLM has similarities with text infilling. It is natural to introduce BERT to text infilling task. 1. First, we define a model for text filling task [bert_model.py](./example/bert_model.py). The `predict` function accepts a batch of sentences and returns predicted position results of the `[MASK]` token. ```python class TextInfillingModel(object): ... batch = ["twinkle twinkle [MASK] star.", "Happy birthday to [MASK].", 'the answer to life, the [MASK], and everything.'] model = TextInfillingModel() outputs = model.predict(batch) print(outputs) # ['little', 'you', 'universe'] ``` **Note**: Please download pre-trained BERT model at first. 2. Second, utilize [Flask](https://github.com/pallets/flask) to pack predicting interfaces to Web service. [flask_example.py](./example/flask_example.py) ```python model = TextInfillingModel() @app.route("/naive", methods=["POST"]) def naive_predict(): inputs = request.form.getlist("s") outputs = model.predict(inputs) return jsonify(outputs) app.run(port=5005) ``` Please run [flask_example.py](./example/flask_example.py), then you will get a vanilla Web server. ```bash curl -X POST http://localhost:5005/naive -d 's=Happy birthday to [MASK].' ["you"] ``` At this time, your web server can only serve 12 requests per second. Please see [benchmark](#benchmark) for more details. 3. Third, encapsulate model functions through `service_streamer`. Three lines of code make the prediction speed of BERT service reach 200+ sentences per second (16x faster). ```python from service_streamer import ThreadedStreamer streamer = ThreadedStreamer(model.predict, batch_size=64, max_latency=0.1) @app.route("/stream", methods=["POST"]) def stream_predict(): inputs = request.form.getlist("s") outputs = streamer.predict(inputs) return jsonify(outputs) app.run(port=5005, debug=False) ``` Run [flask_example.py](./example/flask_example.py) and test the performance with [wrk](https://github.com/wg/wrk). ```bash wrk -t 2 -c 128 -d 20s --timeout=10s -s benchmark.lua http://127.0.0.1:5005/stream ... Requests/sec: 200.31 ``` 4. Finally, encapsulate models through ``Streamer`` and start service workers on multiple GPUs. ``Streamer`` further accelerates inference speed and achieves 1000+ sentences per second (80x faster). ```python from service_streamer import ManagedModel, Streamer class ManagedBertModel(ManagedModel): def init_model(self): self.model = TextInfillingModel() def predict(self, batch): return self.model.predict(batch) streamer = Streamer(ManagedBertModel, batch_size=64, max_latency=0.1, worker_num=8, cuda_devices=(0, 1, 2, 3)) app.run(port=5005, debug=False) ``` 8 gpu workers can be started and evenly distributed on 4 GPUs. <h2 align="center">API</h2> #### Quick Start In general, the inference speed will be faster by utilizing parallel computing. ```python outputs = model.predict(batch_inputs) ``` **ServiceStreamer** is a middleware for web service of machine learning applications. Queue requests from users are scheduled into mini-batches and forward into GPU workers. ServiceStreamer sacrifices a certain delay (default maximum is 0.1s) and enhance the overall performance by improving the ratio of GPU utilization. ```python from service_streamer import ThreadedStreamer # Encapsulate batch_predict function with Streamer streamer = ThreadedStreamer(model.predict, batch_size=64, max_latency=0.1) # Replace model.predict with streamer.predict outputs = streamer.predict(batch_inputs) ``` Start web server on multi-threading (or coordination). Your server can usually achieve 10x (```batch_size/batch_per_request```) times faster by adding a few lines of code. #### Distributed GPU worker The performance of web server (QPS) in practice is much higher than that of GPU model. We also support one web server with multiple GPU worker processes. ```python from service_streamer import Streamer # Spawn releases 4 gpu worker processes streamer = Streamer(model.predict, 64, 0.1, worker_num=4) outputs = streamer.predict(batch) ``` ``Streamer`` uses ``spawn`` subprocesses to run gpu workers by default. ``Streamer`` uses interprocess queues to communicate and queue. It can distribute a large number of requests to multiple workers for processing. Then the prediction results of the model are returned to the corresponding web server in batches. And results are forwarded to the corresponding http response. ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.116 Driver Version: 390.116 | |-------------------------------+----------------------+----------------------+ ... +-----------------------------------------------------------------------------+ | Process