TensorFlow一种用于大规模机器学习的系统_本地CPU需要跑深度学习算法资源-CSDN文库

4星 · 超过85%的资源需积分: 10 147 浏览量 2018-06-12 17:49:03 上传评论收藏 665KB PDF 举报

资源推荐

资源详情

资源评论

TensorFlow: A system for large-scale machine learning

Mart

ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,

Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,

Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,

Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Google Brain

Abstract

TensorFlow is a machine learning system that operates at

large scale and in heterogeneous environments. Tensor-

Flow uses dataﬂow graphs to represent computation,

shared state, and the operations that mutate that state. It

maps the nodes of a dataﬂow graph across many machines

in a cluster, and within a machine across multiple com-

putational devices, including multicore CPUs, general-

purpose GPUs, and custom designed ASICs known as

Tensor Processing Units (TPUs). This architecture gives

ﬂexibility to the application developer: whereas in previ-

ous “parameter server” designs the management of shared

state is built into the system, TensorFlow enables devel-

opers to experiment with novel optimizations and train-

ing algorithms. TensorFlow supports a variety of appli-

cations, with particularly strong support for training and

inference on deep neural networks. Several Google ser-

vices use TensorFlow in production, we have released it

as an open-source project, and it has become widely used

for machine learning research. In this paper, we describe

the TensorFlow dataﬂow model in contrast to existing sys-

tems, and demonstrate the compelling performance that

TensorFlow achieves for several real-world applications.

1 Introduction

In recent years, machine learning has driven advances in

many different ﬁelds [3, 5, 23, 24, 30, 27, 40, 45, 48,

50, 55, 68, 69, 73, 76]. We attribute this success to the

invention of more sophisticated machine learning mod-

els [42, 51], the availability of large datasets for tack-

ling problems in these ﬁelds [10, 65], and the devel-

opment of software platforms that enable the easy use

of large amounts of computational resources for training

such models on these large datasets [14, 21].

We introduce the TensorFlow system

for experiment-

ing with new models, training them on large datasets, and

moving them into production. We have based TensorFlow

on years of experience with our ﬁrst-generation system,

DistBelief [21], both simplifying and generalizing it to en-

able researchers to explore a wider variety of ideas with

relative ease. TensorFlow supports both large-scale train-

ing and inference: it efﬁciently uses hundreds of powerful

(GPU-enabled) servers for fast training, and it runs trained

models for inference in production on various platforms,

ranging from large distributed clusters in a datacenter,

down to performing inference locally on mobile devices.

At the same time, it is ﬂexible and general enough to

support experimentation and research into new machine

learning models and system-level optimizations.

TensorFlow uses a uniﬁed dataﬂow graph to repre-

sent both the computation in an algorithm and the state

on which the algorithm operates. We draw inspiration

from the high-level programming models of dataﬂow sys-

tems [2, 22, 75], and the low-level efﬁciency of parame-

ter servers [14, 21, 46]. Unlike traditional dataﬂow sys-

tems, in which graph vertices represent functional compu-

tation on immutable data, TensorFlow allows vertices to

represent computations that own or update mutable state.

Edges carry tensors (multi-dimensional arrays) between

nodes, and TensorFlow transparently inserts the appropri-

ate communication between distributed subcomputations.

By unifying the computation and state management in a

single programming model, TensorFlow allows program-

mers to experiment with different parallelization schemes

that, for example, ofﬂoad computation onto the servers

that hold the shared state to reduce the amount of network

trafﬁc. We have also built various coordination protocols,

and achieved encouraging results with synchronous repli-

cation, echoing recent results [11, 19] that contradict the

TensorFlow can be downloaded from https://github.com/

tensorflow/tensorflow.

arXiv:1605.08695v2 [cs.DC] 31 May 2016

commonly held belief that asynchronous replication is re-

quired for scalable learning [14, 21, 46].

Over the past year, more than 60 teams at Google have

used TensorFlow, and we have released the system as an

open-source project. Thanks to our large community of

users we have gained experience with many different ma-

chine learning applications. In this paper, we focus on

neural network training as a challenging systems problem,

and select two representative applications from this space:

image classiﬁcation and language modeling. These ap-

plications stress computational throughput and aggregate

model size respectively, and we use them both to demon-

strate the extensibility of TensorFlow, and to evaluate the

efﬁciency and scalability of our present implementation.

2 Background & Motivation

To make the case for developing TensorFlow, we start

by outlining the requirements for a large-scale machine

learning system (§2.1), then consider how related work

meets or does not meet those requirements (§2.2).

2.1 Requirements

Distributed execution A cluster of powerful comput-

ers can solve many machine learning problems more efﬁ-

ciently, using more data and larger models.

Machine learning algorithms generally perform bet-

ter with more training data. For example, recent break-

throughs in image classiﬁcation models have beneﬁted

from the public ImageNet dataset, which contains 136 gi-

gabytes of digital images [65]; and language modeling has

beneﬁted from efforts like the One Billion Word Bench-

mark [10]. The scale of these datasets motivates a data-

parallel approach to training: a distributed ﬁle system

holds the data, and a set of workers processes different

subsets of data in parallel. Data-parallelism eliminates

the I/O bottleneck for input data, and any preprocessing

operations can be applied to input records independently.

Effective learned models for image recognition, lan-

guage modeling, document clustering, and many other

problems have a large number of parameters. For ex-

ample, the current state-of-the-art image classiﬁcation

model, ResNet, uses 2.3 million ﬂoating-point parame-

ters to classify images into one of 1000 categories [26].

The One Billion Word Benchmark has a vocabulary of

800,000 words, and it has been used to train language

models with 1.04 billion parameters [39]. A distributed

system can shard the model across many processes, to in-

crease the available network bandwidth when many work-

ers are simultaneously reading and updating the model.

A distributed system for model training must use the

network efﬁciently. Many scalable algorithms train a

model using mini-batch gradient descent [21, 47], where a

worker reads the current version of the model and a small

batch of input examples, calculates an update to the model

that reduces a loss function on those examples, and ap-

plies the update to the model. Mini-batch methods are

most effective when each worker uses the most current

model as a starting point, which requires a large amount

of data to be transferred to the worker with low latency.

Accelerator support Machine learning algorithms of-

ten perform expensive computations, such as matrix mul-

tiplication and multi-dimensional convolution, which are

highly parallelizable, but have many data dependencies

that require a tightly coupled implementation. The re-

cent availability of general-purpose GPUs has provided a

large number of cores that can operate on fast local mem-

ory. For example, a single NVIDIA Titan X GPU card

has 6 TFLOPS peak performance [60]. In 2012, state-of-

the-art results for different image classiﬁcation tasks were

achieved using 16,000 CPU cores for three days [45], and

using two GPUs for six days [42]. Since then, GPU ven-

dors have innovated in their support for machine learning:

NVIDIA’s cuDNN library [13] for GPU-based neural net-

work training accelerates several popular image models

by 2–4× when using version R4 in place of R2 [15].

In addition to general-purpose devices, many special-

purpose accelerators for deep learning have achieved

signiﬁcant performance improvements and power sav-

ings. At Google, our colleagues have built the Tensor

Processing Unit (TPU) speciﬁcally for machine learn-

ing, and it achieves an order of magnitude improve-

ment in performance-per-watt compared to alternative

state-of-the-art technology [38]. The Movidius Deep

Learning Accelerator uses a low-power Myriad 2 pro-

cessor with custom vector processing units that accel-

erate many machine learning and computer vision algo-

rithms [53]. Ovtcharov et al. have achieved signiﬁcant

performance improvements and power savings for some

convolutional models using ﬁeld programmable gate ar-

rays (FPGAs) [58]. Since it is difﬁcult to predict the next

popular architecture for executing machine learning algo-

rithms, we require that TensorFlow uses a portable pro-

gramming model that can target a generic device abstrac-

tion, and allows its operations to be specialized for new

architectures as they emerge.

Training & inference support In addition to training,

scalable and high-performance inference is a requirement

for using models in production [18]. Depending on the

nature of the application, the inference may be required

to produce results with very low latency in an interactive

service, or execute on a disconnected mobile device. If

the model is large, it might require multiple servers to

participate in each inference computation, and thus re-

quire distributed computation support. Developers beneﬁt

when they can use the same code to deﬁne a model for

both training and inference. Training and inference de-

mand similar performance, so we prefer a common well-

optimized system for both computations. Since inference

can be computationally intensive (e.g., an image classi-

ﬁcation model might perform 5 billion FLOPS per im-

age [70]), it must be possible to accelerate it with GPUs.

Extensibility Single-machine machine learning frame-

works [36, 2, 17] have extensible programming models

that enable their users to advance the state of the art with

new approaches, such as adversarial learning [25] and

deep reinforcement learning [51]. We seek a system that

provides the same ability to experiment, and also allows

users to scale up the same code to run in production. The

system must support expressive control-ﬂow and stateful

constructs, while also satisfying our other requirements.

2.2 Related work

Single-machine frameworks Many machine learning

researchers carry out their work on a single—often GPU-

equipped—computer [41, 42], and many ﬂexible single-

machine frameworks have emerged to support this sce-

nario. Caffe [36] is a high-performance framework for

training declaratively speciﬁed convolutional neural net-

works that runs on multicore CPUs and GPUs. Theano [2]

allows programmers to express a model as a dataﬂow

graph, and generates efﬁcient compiled code for train-

ing that model. Torch [17] has an imperative program-

ming model for scientiﬁc computation (including machine

learning) that supports ﬁne-grained control over the order

of execution and memory utilization.

While these frameworks do not satisfy our require-

ment for distributed execution, TensorFlow’s program-

ming model is close to Theano’s dataﬂow representation

(§3).

Batch dataﬂow systems Starting with MapRe-

duce [22], batch dataﬂow systems have been applied

to a large number of machine learning algorithms [71],

and more recent systems have focused on increasing

expressivity and performance. DryadLINQ [74] adds a

high-level query language that supports more sophisti-

cated algorithms than MapReduce. Spark [75] extends

DryadLINQ with the ability to cache previously com-

puted datasets in memory, and is therefore better suited to

iterative machine learning algorithms (such as k-means

clustering and logistic regression) when the input data ﬁt

in memory. Dandelion extends DryadLINQ to support

generating code for GPUs [63] and FPGAs [16].

The principal limitation of a batch dataﬂow system is

that it requires the input data to be immutable, and all

of the subcomputations to be deterministic, so that the

system can re-execute subcomputations when machines

in the cluster fail. This feature—which is beneﬁcial for

many conventional workloads—makes updating a ma-

chine learning model a heavy operation. For example,

the SparkNet system for training deep neural networks on

Spark takes 20 seconds to broadcast weights and collect

updates from ﬁve workers [52]. As a result, these systems

must process larger batches in each model update step,

which slows convergence [9]. We show in Subsection 6.3

that TensorFlow can train larger models on larger clusters

with step times as short as 2 seconds.

While not a batch dataﬂow system, Naiad [54] aug-

ments a dataﬂow model with streaming execution, stateful

vertices, and structured timestamps (“timely dataﬂow”)

that enable it to handle incremental updates and iterative

algorithms in the same computation. Naiad represents it-

eration using cyclic dataﬂow graphs, which together with

mutable state make it possible to implement algorithms

that require millisecond-scale latencies for coordination.

Naiad is designed for computing on sparse, discrete data,

and does not support GPU (or any other form of) acceler-

ation, but we borrow aspects of timely dataﬂow iteration

in Subsection 3.4.

Parameter servers Inspired by work on distributed

key-value stores, a parameter server architecture uses a set

of servers to manage shared state that is updated by a set of

data-parallel workers. Unlike a standard key-value store,

the write operation in a parameter server is specialized for

parameter updates: it is typically an associative and com-

mutative combiner, like addition-assignment (+=), that is

applied to the current parameter value and the incoming

update to produce a new parameter value.

Parameter servers emerged as an architecture for scal-

able topic modeling [66], and our previous system DistBe-

lief [21] showed how a similar architecture could be ap-

plied to deep neural network training. Project Adam [14]

demonstrated an efﬁcient parameter server architecture

for training convolutional neural networks, and Li et al.’s

“Parameter Server” [46] added innovations in consistency

models, fault tolerance, and elastic rescaling. Despite ear-

lier skepticism that parameter servers would be compati-

ble with GPU acceleration [14], Cui et al. have recently

shown that GeePS [19], a parameter server specialized

for use with GPUs, can achieve speedups on modest-sized

clusters.

MXNet [12] is a recent system that uses a parameter

server to scale training, supports GPU acceleration, and

includes a ﬂexible programming model with interfaces

for many languages. While MXNet partially fulﬁlls our

extensibility requirements, the parameter server is “priv-

ileged” code, which makes it difﬁcult for researchers to

customize the handling of large models (§4.2).

The parameter server architecture meets most of our

requirements, and our DistBelief [21] uses parameter

servers with a Caffe-like model deﬁnition format [36] to

great effect. We found this architecture to be insufﬁciently

extensible, because adding a new optimization algorithm,

or experimenting with an unconventional model archi-

tecture would require our users to modify the parameter

server implementation, which uses C++ for performance.

While some of the practitioners who use that system are

comfortable with making these changes, the majority are

accustomed to writing models in high-level languages,

such as Python and Lua, and the complexity of the high-

performance parameter server implementation is a barrier

to entry. With TensorFlow we therefore sought a high-

level programming model that allows users to customize

the code that runs in all parts of the system (§3).

3 TensorFlow execution model

TensorFlow uses a single dataﬂow graph to represent

all computation and state in a machine learning algo-

rithm, including the individual mathematical operations,

the parameters and their update rules, and the input pre-

processing (Figure 1). Dataﬂow makes the communi-

cation between subcomputations explicit, and therefore

makes it easy to execute independent computations in par-

allel, and partition the computation across multiple dis-

tributed devices. Dataﬂow TensorFlow differs from batch

dataﬂow systems (§2.2) in two respects:

• The model supports multiple concurrent executions

on overlapping subgraphs of the overall graph.

• Individual vertices may have mutable state that can

be shared between different executions of the graph.

The key observation in the parameter server architec-

ture [21, 14, 46] is that mutable state is crucial when

training very large models, because it becomes possible to

make in-place updates to very large parameters, and prop-

agate those updates to parallel training steps as quickly

as possible. Dataﬂow with mutable state enables Tensor-

Flow to mimic the functionality of a parameter server,

but with additional ﬂexibility, because it becomes pos-

sible to execute arbitrary dataﬂow subgraphs on the ma-

chines that host the shared model parameters. As a re-

sult, our users have been able to experiment with different

optimization algorithms, consistency schemes, and paral-

lelization strategies.

3.1 Dataﬂow graph elements

In a TensorFlow graph, each vertex represents an atomic

unit of computation, and each edge represents the out-

put from or input to a vertex. We refer to the compu-

tation at vertices as operations, and the values that ﬂow

along edges as tensors, because TensorFlow is designed

for mathematical computation, and uses tensors (or multi-

dimensional arrays) to represent all data in those compu-

tations.

Tensors In TensorFlow, we model all data as tensors

(dense n-dimensional arrays) with each element having

one of a small number of primitive types, such as int32,

float32, or string. Tensors naturally represent the

inputs to and results of the common mathematical oper-

ations in many machine learning algorithms: for exam-

ple, a matrix multiplication takes two 2-D tensors and

produces a 2-D tensor; and a mini-batch 2-D convolution

takes two 4-D tensors and produces another 4-D tensor.

All tensors in TensorFlow are dense. This decision en-

sures that the lowest levels of the system can have sim-

ple implementations for memory allocation and serializa-

tion, which reduces the overhead imposed by the frame-

work. To represent sparse tensors, TensorFlow offers two

alternatives: either encode the data into variable-length

string elements of a dense tensor, or use a tuple of

dense tensors (e.g., an n-D sparse tensor with m non-zero

elements could be represented an m × n index matrix and

a length-m value vector). The size of a tensor can vary in

one or more dimensions, making it possible to represent

sparse tensors with differing numbers of elements, at the

cost of more sophisticated shape inference.

Operations An operation takes m ≥ 0 tensors as input,

and produces n ≥ 0 tensors as output. An operation has

a named “type” (such as Const, MatMul, or Assign)

and may have zero or more compile-time attributes that

determine its behavior. An operation can be generic and

variadic at compile-time: its attributes determine both the

expected types and arity of its inputs and outputs.

剩余17页未读，继续阅读

评论收藏

内容反馈

yanyanwenmeng

2018-12-10

资源还不错

ljpone

粉丝: 4
资源: 1

TensorFlow一种用于大规模机器学习的系统

RandLA-Net-pytorch:RandLA-Net（https的Pytorch实施

RandLA-Net-Enhanced:RandLA-Net改进版

Python-基于分割的深度学习表面缺陷检测方法的一个Tensorflow实现

RandLA-Net:andTandorflow中的RandLA-Net（CVPR 2020，口服）

RandLA-Net-tensorflow2：RandLA-Net的Tensorflow2实现

## TensorFlow机器学习系统

TensorFlow:TensorFlow是一个用于机器学习的开源库-开源

TensorFlow Serving是一款用于为机器学习模型提供灵活、高性能服务的系统-python

借助TensorFlow构建大规模智能深度学习系统（谷歌大牛Jeff Dean）

TensorFlow是用于机器学习的端到端开源平台。-Python开发

用于将生产机器学习部署，管理和扩展到数千个模型的框架-Python开发

seldon-core：一个MLOps框架，用于打包，部署，监视和管理数千个生产机器学习模型

MobileNet实战：tensorflow2.X版本，MobileNetV3图像分类任务（大数据集）.zip

Tensorflow实现卷积神经网络的详细代码

TensorFlow 是谷歌的第二代机器学习系统

机器学习系统 TensorFlow.zip

Tensorflow-Coursera简介：用于人工智能，机器学习和深度学习的TensorFlow简介

RandLA-Net - 大规模点云的高效语义分割的Tensorflow实现（CVPR 2020）-python

云原生机器学习自动化平台-Python开发

Super_TF:使用张量流构建机器学习模型的简单框架

Predictive-Analytics-with-TensorFlow:使用TensorFlow进行预测分析，由Packt发布

polyaxon：Kubernetes的机器学习平台

NVIDIA GTC CHINA 2020大会资料汇总（144份）.zip

MobileNet实战：tensorflow2.X版本，MobileNetV1图像分类任务（大数据集）

java当当网源码-AwesomeGroupOrganization:AwesomeGroupOrganization推荐GitHub`群组织

AEGuard:基于边缘噪声特征的对抗样本检测模型

火山：Kubernetes本机批处理系统（CNCF下的项目）

最新资源