没有合适的资源?快使用搜索试试~ 我知道了~
TensorFlow一种用于大规模机器学习的系统
4星 · 超过85%的资源 需积分: 10 9 下载量 134 浏览量
2018-06-12
17:49:03
上传
评论
收藏 665KB PDF 举报
温馨提示
试读
18页
TensorFlow是一个机器学习系统,其运行于大规模和异构环境。张量流使用数据流图来表示计算、共享状态以及改变该状态的操作。它将数据流图的节点映射到多台机器上的一个集群中,以及跨多个计算设备的机器中,包括多核CPU,通用GPU和定制设计的ASIC,称为张张良处理单元(TPU)。这种架构给应用程序开发人员极大的灵活性,而以前设计的“参数服务器”共享的管理状态是内置在系统中的。TensorFlow使开发人员能够尝试新颖的优化和训练算法。TensorFlow支持各种应用程序、特别强大的训练和支持深度神经网络推理。
资源推荐
资源详情
资源评论
TensorFlow: A system for large-scale machine learning
Mart
´
ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,
Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,
Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
Google Brain
Abstract
TensorFlow is a machine learning system that operates at
large scale and in heterogeneous environments. Tensor-
Flow uses dataflow graphs to represent computation,
shared state, and the operations that mutate that state. It
maps the nodes of a dataflow graph across many machines
in a cluster, and within a machine across multiple com-
putational devices, including multicore CPUs, general-
purpose GPUs, and custom designed ASICs known as
Tensor Processing Units (TPUs). This architecture gives
flexibility to the application developer: whereas in previ-
ous “parameter server” designs the management of shared
state is built into the system, TensorFlow enables devel-
opers to experiment with novel optimizations and train-
ing algorithms. TensorFlow supports a variety of appli-
cations, with particularly strong support for training and
inference on deep neural networks. Several Google ser-
vices use TensorFlow in production, we have released it
as an open-source project, and it has become widely used
for machine learning research. In this paper, we describe
the TensorFlow dataflow model in contrast to existing sys-
tems, and demonstrate the compelling performance that
TensorFlow achieves for several real-world applications.
1 Introduction
In recent years, machine learning has driven advances in
many different fields [3, 5, 23, 24, 30, 27, 40, 45, 48,
50, 55, 68, 69, 73, 76]. We attribute this success to the
invention of more sophisticated machine learning mod-
els [42, 51], the availability of large datasets for tack-
ling problems in these fields [10, 65], and the devel-
opment of software platforms that enable the easy use
of large amounts of computational resources for training
such models on these large datasets [14, 21].
We introduce the TensorFlow system
1
for experiment-
ing with new models, training them on large datasets, and
moving them into production. We have based TensorFlow
on years of experience with our first-generation system,
DistBelief [21], both simplifying and generalizing it to en-
able researchers to explore a wider variety of ideas with
relative ease. TensorFlow supports both large-scale train-
ing and inference: it efficiently uses hundreds of powerful
(GPU-enabled) servers for fast training, and it runs trained
models for inference in production on various platforms,
ranging from large distributed clusters in a datacenter,
down to performing inference locally on mobile devices.
At the same time, it is flexible and general enough to
support experimentation and research into new machine
learning models and system-level optimizations.
TensorFlow uses a unified dataflow graph to repre-
sent both the computation in an algorithm and the state
on which the algorithm operates. We draw inspiration
from the high-level programming models of dataflow sys-
tems [2, 22, 75], and the low-level efficiency of parame-
ter servers [14, 21, 46]. Unlike traditional dataflow sys-
tems, in which graph vertices represent functional compu-
tation on immutable data, TensorFlow allows vertices to
represent computations that own or update mutable state.
Edges carry tensors (multi-dimensional arrays) between
nodes, and TensorFlow transparently inserts the appropri-
ate communication between distributed subcomputations.
By unifying the computation and state management in a
single programming model, TensorFlow allows program-
mers to experiment with different parallelization schemes
that, for example, offload computation onto the servers
that hold the shared state to reduce the amount of network
traffic. We have also built various coordination protocols,
and achieved encouraging results with synchronous repli-
cation, echoing recent results [11, 19] that contradict the
1
TensorFlow can be downloaded from https://github.com/
tensorflow/tensorflow.
arXiv:1605.08695v2 [cs.DC] 31 May 2016
commonly held belief that asynchronous replication is re-
quired for scalable learning [14, 21, 46].
Over the past year, more than 60 teams at Google have
used TensorFlow, and we have released the system as an
open-source project. Thanks to our large community of
users we have gained experience with many different ma-
chine learning applications. In this paper, we focus on
neural network training as a challenging systems problem,
and select two representative applications from this space:
image classification and language modeling. These ap-
plications stress computational throughput and aggregate
model size respectively, and we use them both to demon-
strate the extensibility of TensorFlow, and to evaluate the
efficiency and scalability of our present implementation.
2 Background & Motivation
To make the case for developing TensorFlow, we start
by outlining the requirements for a large-scale machine
learning system (§2.1), then consider how related work
meets or does not meet those requirements (§2.2).
2.1 Requirements
Distributed execution A cluster of powerful comput-
ers can solve many machine learning problems more effi-
ciently, using more data and larger models.
Machine learning algorithms generally perform bet-
ter with more training data. For example, recent break-
throughs in image classification models have benefited
from the public ImageNet dataset, which contains 136 gi-
gabytes of digital images [65]; and language modeling has
benefited from efforts like the One Billion Word Bench-
mark [10]. The scale of these datasets motivates a data-
parallel approach to training: a distributed file system
holds the data, and a set of workers processes different
subsets of data in parallel. Data-parallelism eliminates
the I/O bottleneck for input data, and any preprocessing
operations can be applied to input records independently.
Effective learned models for image recognition, lan-
guage modeling, document clustering, and many other
problems have a large number of parameters. For ex-
ample, the current state-of-the-art image classification
model, ResNet, uses 2.3 million floating-point parame-
ters to classify images into one of 1000 categories [26].
The One Billion Word Benchmark has a vocabulary of
800,000 words, and it has been used to train language
models with 1.04 billion parameters [39]. A distributed
system can shard the model across many processes, to in-
crease the available network bandwidth when many work-
ers are simultaneously reading and updating the model.
A distributed system for model training must use the
network efficiently. Many scalable algorithms train a
model using mini-batch gradient descent [21, 47], where a
worker reads the current version of the model and a small
batch of input examples, calculates an update to the model
that reduces a loss function on those examples, and ap-
plies the update to the model. Mini-batch methods are
most effective when each worker uses the most current
model as a starting point, which requires a large amount
of data to be transferred to the worker with low latency.
Accelerator support Machine learning algorithms of-
ten perform expensive computations, such as matrix mul-
tiplication and multi-dimensional convolution, which are
highly parallelizable, but have many data dependencies
that require a tightly coupled implementation. The re-
cent availability of general-purpose GPUs has provided a
large number of cores that can operate on fast local mem-
ory. For example, a single NVIDIA Titan X GPU card
has 6 TFLOPS peak performance [60]. In 2012, state-of-
the-art results for different image classification tasks were
achieved using 16,000 CPU cores for three days [45], and
using two GPUs for six days [42]. Since then, GPU ven-
dors have innovated in their support for machine learning:
NVIDIA’s cuDNN library [13] for GPU-based neural net-
work training accelerates several popular image models
by 2–4× when using version R4 in place of R2 [15].
In addition to general-purpose devices, many special-
purpose accelerators for deep learning have achieved
significant performance improvements and power sav-
ings. At Google, our colleagues have built the Tensor
Processing Unit (TPU) specifically for machine learn-
ing, and it achieves an order of magnitude improve-
ment in performance-per-watt compared to alternative
state-of-the-art technology [38]. The Movidius Deep
Learning Accelerator uses a low-power Myriad 2 pro-
cessor with custom vector processing units that accel-
erate many machine learning and computer vision algo-
rithms [53]. Ovtcharov et al. have achieved significant
performance improvements and power savings for some
convolutional models using field programmable gate ar-
rays (FPGAs) [58]. Since it is difficult to predict the next
popular architecture for executing machine learning algo-
rithms, we require that TensorFlow uses a portable pro-
gramming model that can target a generic device abstrac-
tion, and allows its operations to be specialized for new
architectures as they emerge.
Training & inference support In addition to training,
scalable and high-performance inference is a requirement
for using models in production [18]. Depending on the
2
nature of the application, the inference may be required
to produce results with very low latency in an interactive
service, or execute on a disconnected mobile device. If
the model is large, it might require multiple servers to
participate in each inference computation, and thus re-
quire distributed computation support. Developers benefit
when they can use the same code to define a model for
both training and inference. Training and inference de-
mand similar performance, so we prefer a common well-
optimized system for both computations. Since inference
can be computationally intensive (e.g., an image classi-
fication model might perform 5 billion FLOPS per im-
age [70]), it must be possible to accelerate it with GPUs.
Extensibility Single-machine machine learning frame-
works [36, 2, 17] have extensible programming models
that enable their users to advance the state of the art with
new approaches, such as adversarial learning [25] and
deep reinforcement learning [51]. We seek a system that
provides the same ability to experiment, and also allows
users to scale up the same code to run in production. The
system must support expressive control-flow and stateful
constructs, while also satisfying our other requirements.
2.2 Related work
Single-machine frameworks Many machine learning
researchers carry out their work on a single—often GPU-
equipped—computer [41, 42], and many flexible single-
machine frameworks have emerged to support this sce-
nario. Caffe [36] is a high-performance framework for
training declaratively specified convolutional neural net-
works that runs on multicore CPUs and GPUs. Theano [2]
allows programmers to express a model as a dataflow
graph, and generates efficient compiled code for train-
ing that model. Torch [17] has an imperative program-
ming model for scientific computation (including machine
learning) that supports fine-grained control over the order
of execution and memory utilization.
While these frameworks do not satisfy our require-
ment for distributed execution, TensorFlow’s program-
ming model is close to Theano’s dataflow representation
(§3).
Batch dataflow systems Starting with MapRe-
duce [22], batch dataflow systems have been applied
to a large number of machine learning algorithms [71],
and more recent systems have focused on increasing
expressivity and performance. DryadLINQ [74] adds a
high-level query language that supports more sophisti-
cated algorithms than MapReduce. Spark [75] extends
DryadLINQ with the ability to cache previously com-
puted datasets in memory, and is therefore better suited to
iterative machine learning algorithms (such as k-means
clustering and logistic regression) when the input data fit
in memory. Dandelion extends DryadLINQ to support
generating code for GPUs [63] and FPGAs [16].
The principal limitation of a batch dataflow system is
that it requires the input data to be immutable, and all
of the subcomputations to be deterministic, so that the
system can re-execute subcomputations when machines
in the cluster fail. This feature—which is beneficial for
many conventional workloads—makes updating a ma-
chine learning model a heavy operation. For example,
the SparkNet system for training deep neural networks on
Spark takes 20 seconds to broadcast weights and collect
updates from five workers [52]. As a result, these systems
must process larger batches in each model update step,
which slows convergence [9]. We show in Subsection 6.3
that TensorFlow can train larger models on larger clusters
with step times as short as 2 seconds.
While not a batch dataflow system, Naiad [54] aug-
ments a dataflow model with streaming execution, stateful
vertices, and structured timestamps (“timely dataflow”)
that enable it to handle incremental updates and iterative
algorithms in the same computation. Naiad represents it-
eration using cyclic dataflow graphs, which together with
mutable state make it possible to implement algorithms
that require millisecond-scale latencies for coordination.
Naiad is designed for computing on sparse, discrete data,
and does not support GPU (or any other form of) acceler-
ation, but we borrow aspects of timely dataflow iteration
in Subsection 3.4.
Parameter servers Inspired by work on distributed
key-value stores, a parameter server architecture uses a set
of servers to manage shared state that is updated by a set of
data-parallel workers. Unlike a standard key-value store,
the write operation in a parameter server is specialized for
parameter updates: it is typically an associative and com-
mutative combiner, like addition-assignment (+=), that is
applied to the current parameter value and the incoming
update to produce a new parameter value.
Parameter servers emerged as an architecture for scal-
able topic modeling [66], and our previous system DistBe-
lief [21] showed how a similar architecture could be ap-
plied to deep neural network training. Project Adam [14]
demonstrated an efficient parameter server architecture
for training convolutional neural networks, and Li et al.’s
“Parameter Server” [46] added innovations in consistency
models, fault tolerance, and elastic rescaling. Despite ear-
lier skepticism that parameter servers would be compati-
3
ble with GPU acceleration [14], Cui et al. have recently
shown that GeePS [19], a parameter server specialized
for use with GPUs, can achieve speedups on modest-sized
clusters.
MXNet [12] is a recent system that uses a parameter
server to scale training, supports GPU acceleration, and
includes a flexible programming model with interfaces
for many languages. While MXNet partially fulfills our
extensibility requirements, the parameter server is “priv-
ileged” code, which makes it difficult for researchers to
customize the handling of large models (§4.2).
The parameter server architecture meets most of our
requirements, and our DistBelief [21] uses parameter
servers with a Caffe-like model definition format [36] to
great effect. We found this architecture to be insufficiently
extensible, because adding a new optimization algorithm,
or experimenting with an unconventional model archi-
tecture would require our users to modify the parameter
server implementation, which uses C++ for performance.
While some of the practitioners who use that system are
comfortable with making these changes, the majority are
accustomed to writing models in high-level languages,
such as Python and Lua, and the complexity of the high-
performance parameter server implementation is a barrier
to entry. With TensorFlow we therefore sought a high-
level programming model that allows users to customize
the code that runs in all parts of the system (§3).
3 TensorFlow execution model
TensorFlow uses a single dataflow graph to represent
all computation and state in a machine learning algo-
rithm, including the individual mathematical operations,
the parameters and their update rules, and the input pre-
processing (Figure 1). Dataflow makes the communi-
cation between subcomputations explicit, and therefore
makes it easy to execute independent computations in par-
allel, and partition the computation across multiple dis-
tributed devices. Dataflow TensorFlow differs from batch
dataflow systems (§2.2) in two respects:
• The model supports multiple concurrent executions
on overlapping subgraphs of the overall graph.
• Individual vertices may have mutable state that can
be shared between different executions of the graph.
The key observation in the parameter server architec-
ture [21, 14, 46] is that mutable state is crucial when
training very large models, because it becomes possible to
make in-place updates to very large parameters, and prop-
agate those updates to parallel training steps as quickly
as possible. Dataflow with mutable state enables Tensor-
Flow to mimic the functionality of a parameter server,
but with additional flexibility, because it becomes pos-
sible to execute arbitrary dataflow subgraphs on the ma-
chines that host the shared model parameters. As a re-
sult, our users have been able to experiment with different
optimization algorithms, consistency schemes, and paral-
lelization strategies.
3.1 Dataflow graph elements
In a TensorFlow graph, each vertex represents an atomic
unit of computation, and each edge represents the out-
put from or input to a vertex. We refer to the compu-
tation at vertices as operations, and the values that flow
along edges as tensors, because TensorFlow is designed
for mathematical computation, and uses tensors (or multi-
dimensional arrays) to represent all data in those compu-
tations.
Tensors In TensorFlow, we model all data as tensors
(dense n-dimensional arrays) with each element having
one of a small number of primitive types, such as int32,
float32, or string. Tensors naturally represent the
inputs to and results of the common mathematical oper-
ations in many machine learning algorithms: for exam-
ple, a matrix multiplication takes two 2-D tensors and
produces a 2-D tensor; and a mini-batch 2-D convolution
takes two 4-D tensors and produces another 4-D tensor.
All tensors in TensorFlow are dense. This decision en-
sures that the lowest levels of the system can have sim-
ple implementations for memory allocation and serializa-
tion, which reduces the overhead imposed by the frame-
work. To represent sparse tensors, TensorFlow offers two
alternatives: either encode the data into variable-length
string elements of a dense tensor, or use a tuple of
dense tensors (e.g., an n-D sparse tensor with m non-zero
elements could be represented an m × n index matrix and
a length-m value vector). The size of a tensor can vary in
one or more dimensions, making it possible to represent
sparse tensors with differing numbers of elements, at the
cost of more sophisticated shape inference.
Operations An operation takes m ≥ 0 tensors as input,
and produces n ≥ 0 tensors as output. An operation has
a named “type” (such as Const, MatMul, or Assign)
and may have zero or more compile-time attributes that
determine its behavior. An operation can be generic and
variadic at compile-time: its attributes determine both the
expected types and arity of its inputs and outputs.
4
剩余17页未读,继续阅读
资源评论
- yanyanwenmeng2018-12-10资源还不错
ljpone
- 粉丝: 4
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 微信自动抢红包APP.zip
- 微信抢红包神器-红包鸟(可以抢多种情况下的红包,支持手机息屏抢红包、屏蔽群抢红包).zip
- 微信抢红包,支持xposed与免root,xposed支持最新版,免root需自己实现.zip
- 一维卷积神经网络英语电影评论情感分类项目功能实现-Embedding层
- 微信 自动抢红包 插件外挂(Android),实现了监控通知栏,自动点击红包等功能,帮助大家快速抢到红包.zip
- 满满的干货:分享二十个Python学习资源材料.zip
- USD ISO14229-1-2013 中文版
- STM32单片机FPGA毕设电路原理论文报告模糊控制器在无线监控机群系统中的应用
- OpenWrt-Toolchain-ipq
- STM32单片机FPGA毕设电路原理论文报告模糊控制逆变弧焊电源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功