<h1 align="center">Service Streamer</h1>
<p align="center">
Boosting your Web Services of Deep Learning Applications.
<a href="./README_zh.md">中文README</a>
</p>
<p align="center">
</p>
<p align="center">
<a href="#what-is-service-streamer-">What is Service Streamer ?</a> •
<a href="#highlights">Highlights</a> •
<a href="#installation">Installation</a> •
<a href="#develop-bert-service-in-5-minutes">Develop BERT Service in 5 Minutes</a> •
<a href="#api">API</a> •
<a href="#benchmark">Benchmark</a> •
<a href="#faq">FAQ</a> •
</p>
<h6 align="center">
<a href="https://travis-ci.org/ShannonAI/service-streamer">
<img src="https://travis-ci.org/ShannonAI/service-streamer.svg?branch=master" alt="Build status">
</a>
• Made by ShannonAI • :globe_with_meridians: <a href="http://www.shannonai.com/">http://www.shannonai.com/</a>
</h6>
<h2 align="center">What is Service Streamer ?</h2>
A mini-batch collects data samples and is usually used in deep learning models. In this way, models can utilize the parallel computing capability of GPUs. However, requests from users for web services are usually discrete. If using conventional loop server or threaded server, GPUs will be idle dealing with one request at a time. And the latency time will be linearly increasing when there are concurrent user requests.
ServiceStreamer is a middleware for web service of machine learning applications. Queue requests from users are sampled into mini-batches. ServiceStreamer can significantly enhance the overall performance of the system by improving GPU utilization.
<h2 align="center">Highlights</h2>
- :hatching_chick: **Easy to use**: Minor changes can speed up the model ten times.
- :zap: **Fast processing speed**: Low latency for online inference of machine learning models.
- :octopus: **Good expandability**: Easy to be applied to multi-GPU scenarios for handling enormous requests.
- :crossed_swords: **Applicability**: Used with any web frameworks and/or deep learning frameworks.
<h2 align="center">Installation</h2>
Install ServiceStream by using `pip`,requires **Python >= 3.5** :
```bash
pip install service_streamer
```
<h2 align="center">Develop BERT Service in 5 Minutes</h2>
We provide a step-by-step tutorial for you to bring BERT online in 5 minutes. The service processes 1400 sentences per second.
``Text Infilling`` is a task in natural language processing: given a sentence with several words randomly removed, the model predicts those words removed through the given context.
``BERT`` has attracted a lot of attention in these two years and it achieves State-Of-The-Art results across many nlp tasks. BERT utilizes "Masked Language Model (MLM)" as one of the pre-training objectives. MLM models randomly mask some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based on its context. MLM has similarities with text infilling. It is natural to introduce BERT to text infilling task.
1. First, we define a model for text filling task [bert_model.py](./example/bert_model.py). The `predict` function accepts a batch of sentences and returns predicted position results of the `[MASK]` token.
```python
class TextInfillingModel(object):
...
batch = ["twinkle twinkle [MASK] star.",
"Happy birthday to [MASK].",
'the answer to life, the [MASK], and everything.']
model = TextInfillingModel()
outputs = model.predict(batch)
print(outputs)
# ['little', 'you', 'universe']
```
**Note**: Please download pre-trained BERT model at first.
2. Second, utilize [Flask](https://github.com/pallets/flask) to pack predicting interfaces to Web service. [flask_example.py](./example/flask_example.py)
```python
model = TextInfillingModel()
@app.route("/naive", methods=["POST"])
def naive_predict():
inputs = request.form.getlist("s")
outputs = model.predict(inputs)
return jsonify(outputs)
app.run(port=5005)
```
Please run [flask_example.py](./example/flask_example.py), then you will get a vanilla Web server.
```bash
curl -X POST http://localhost:5005/naive -d 's=Happy birthday to [MASK].'
["you"]
```
At this time, your web server can only serve 12 requests per second. Please see [benchmark](#benchmark) for more details.
3. Third, encapsulate model functions through `service_streamer`. Three lines of code make the prediction speed of BERT service reach 200+ sentences per second (16x faster).
```python
from service_streamer import ThreadedStreamer
streamer = ThreadedStreamer(model.predict, batch_size=64, max_latency=0.1)
@app.route("/stream", methods=["POST"])
def stream_predict():
inputs = request.form.getlist("s")
outputs = streamer.predict(inputs)
return jsonify(outputs)
app.run(port=5005, debug=False)
```
Run [flask_example.py](./example/flask_example.py) and test the performance with [wrk](https://github.com/wg/wrk).
```bash
wrk -t 2 -c 128 -d 20s --timeout=10s -s benchmark.lua http://127.0.0.1:5005/stream
...
Requests/sec: 200.31
```
4. Finally, encapsulate models through ``Streamer`` and start service workers on multiple GPUs. ``Streamer`` further accelerates inference speed and achieves 1000+ sentences per second (80x faster).
```python
from service_streamer import ManagedModel, Streamer
class ManagedBertModel(ManagedModel):
def init_model(self):
self.model = TextInfillingModel()
def predict(self, batch):
return self.model.predict(batch)
streamer = Streamer(ManagedBertModel, batch_size=64, max_latency=0.1, worker_num=8, cuda_devices=(0, 1, 2, 3))
app.run(port=5005, debug=False)
```
8 gpu workers can be started and evenly distributed on 4 GPUs.
<h2 align="center">API</h2>
#### Quick Start
In general, the inference speed will be faster by utilizing parallel computing.
```python
outputs = model.predict(batch_inputs)
```
**ServiceStreamer** is a middleware for web service of machine learning applications. Queue requests from users are scheduled into mini-batches and forward into GPU workers. ServiceStreamer sacrifices a certain delay (default maximum is 0.1s) and enhance the overall performance by improving the ratio of GPU utilization.
```python
from service_streamer import ThreadedStreamer
# Encapsulate batch_predict function with Streamer
streamer = ThreadedStreamer(model.predict, batch_size=64, max_latency=0.1)
# Replace model.predict with streamer.predict
outputs = streamer.predict(batch_inputs)
```
Start web server on multi-threading (or coordination). Your server can usually achieve 10x (```batch_size/batch_per_request```) times faster by adding a few lines of code.
#### Distributed GPU worker
The performance of web server (QPS) in practice is much higher than that of GPU model. We also support one web server with multiple GPU worker processes.
```python
from service_streamer import Streamer
# Spawn releases 4 gpu worker processes
streamer = Streamer(model.predict, 64, 0.1, worker_num=4)
outputs = streamer.predict(batch)
```
``Streamer`` uses ``spawn`` subprocesses to run gpu workers by default. ``Streamer`` uses interprocess queues to communicate and queue. It can distribute a large number of requests to multiple workers for processing.
Then the prediction results of the model are returned to the corresponding web server in batches. And results are forwarded to the corresponding http response.
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116 Driver Version: 390.116 |
|-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+
| Process
没有合适的资源?快使用搜索试试~ 我知道了~
service-streamer:增强深度学习应用程序的Web服务
共28个文件
py:15个
jpg:3个
md:3个
需积分: 20 1 下载量 53 浏览量
2021-02-05
00:06:51
上传
评论
收藏 107KB ZIP 举报
温馨提示
服务流媒体 增强您的深度学习应用程序的Web服务。 •••• ••• •由ShannonAI制造• :globe_with_meridians: 什么是Service Streamer? 微型批次收集数据样本,通常用于深度学习模型。 通过这种方式,模型可以利用GPU的并行计算能力。 但是,用户对Web服务的请求通常是离散的。 如果使用传统的循环服务器或线程服务器,GPU将处于空闲状态,一次只能处理一个请求。 当有并发用户请求时,延迟时间将线性增加。 ServiceStreamer是用于机器学习应用程序的Web服务的中间件。 来自用户的队列请求被抽样到迷你批次中。 ServiceStreamer可以通过提高GPU利用率来显着
资源详情
资源评论
资源推荐
收起资源包目录
service-streamer-master.zip (28个子文件)
service-streamer-master
.travis.yml 655B
example_vision
app.py 942B
imagenet_class_index.json 35KB
file.lua 798B
model.py 2KB
README.md 567B
cat.jpg 21KB
example
flask_multigpu_example.py 889B
flask_example.py 888B
benchmark.lua 127B
future_example.py 1KB
redis_streamer_gunicorn.py 366B
redis_worker_example.py 272B
bert_model.py 2KB
tests
vision_case
imagenet_class_index.json 35KB
__init__.py 0B
model.py 2KB
dog.jpg 23KB
cat.jpg 21KB
test_service_streamer.py 5KB
LICENSE 11KB
README_zh.md 14KB
setup.py 1015B
.gitignore 1KB
README.md 15KB
service_streamer
service_streamer.py 19KB
__init__.py 203B
managed_model.py 544B
共 28 条
- 1
RonaldWang
- 粉丝: 22
- 资源: 4586
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- OpenHarmony下的minicom工具
- 通信拓扑图标,包括数通、接入网、核心网等图标
- 课设毕设基于SSM的贝儿米幼儿教育管理系统-LW+PPT+源码可运行.zip
- M2M开发套件程序 2024-5-16
- 课设毕设基于SSM的宜佰丰超市进销存管理系统-LW+PPT+源码可运行.zip
- 课设毕设基于SSM的医院远程诊断系统-LW+PPT+源码可运行.zip
- 编码解码,(UTF16+UTF32+UTF8+ANSI)获取文本文件编码类型易语言源码
- 课设毕设基于SSM的网络视频播放器-LW+PPT+源码可运行.zip
- 课设毕设基于SSM的农产品供销服务系统-LW+PPT+源码可运行.zip
- 课设毕设基于SSM的高校四六级报名管理系统-LW+PPT+源码可运行.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0