下载  >  开发技术  >  其它  > GOOGLE的TPU论文

GOOGLE的TPU论文 评分:

GOOGLE公开TPU的论文
DDR3 DRAM Chips 14 GiB/s ht FIFo 30 GiB/s Matrix Mull Unified Buffer Matrix Multiply Unit 1DGiB/s GiBIs 14Gs8蛋14Gis 喜 H(64K per cycle) for local activations 256×256x8b=64KMAC) (96K×256×8b=24MB) 24% Contro Accumulators Activation D Host Accumulators 167 GiB/s R Normalize/ Pool A Interf. 2% (4Kx256×32b=4MiB)6%A Control 2% Contro Control ort Activation Pipeline 6% port Contol ddr3 ddr3 3% PCle 3% Not to scale H Interface 3% Misc. 1/0 1% Figure 1. TPU Block Diagram. The main computation part is the Figure 2. Floor Plan of tpu die. The shading follows Figure 1 yellow Matrix Multiply unit in the upper right hand corner. Its inputs The light(blue) data buffers are 37% of the die, the light(yellow) are the blue Weight FIFO and the blue Unified Buffer(UB) and its compute is 30%0, the medium(green)1/O is 10%, and the dark output is the blue Accumulators (Acc). The yellow Activation Unit (red) control is just 2%. Control is much larger(and much more performs the nonlinear functions on the Acc, which go to the UB. difficult to design)in a CPU or GPU The goal was to run whole inference models in the tpu to reduce interactions with the host cPu and to be flexible enough to malch the nN needs of 2015 and beyond, instead of just what was required for 2013 NNS. Figure 1 shows the block diagram of the tpu The tpu instructions are sent from the host over the pcle gen3 x16 bus into an instruction buffer The internal blocks are typically connected together by 256-byte-wide paths. Starting in the upper-right corner, the matrix Multiply unit is the 16-bit products are collected in the 4 MiB of 32-bit Accumulators below the matrix unit. The 4MiB represents 406 S.The heart of the TPU. It contains 256x256 MACs that can perform &-bit muItiply-and-adds on signed or unsigned integer 256-element, 32-bit accumulators. The matrix unit produces one 256-element partial sum per clock cycle. We picked 4096 b first noting that the operations per byte need to reach peak performance(roofline knee in Section 4) is-1350, so we rounded that up to 2048 and then duplicated it so that the compiler could use double buffering while running at peak performance When using a mix of 8-bit weights and 16-bit activations(or vice versa), the Matrix Unit computes at half-speed, and it computes at a quarter-speed when both are 16 bits. It reads and writes 256 values per clock cycle and can per form either a matrix multiply or a convolution. The matrix unit holds one 64KiB tile of weights plus one for double-buffering( to hide the 256 cycles it takes to shift a tile in) This unit is designed for dense matrices Sparse architectural support was omitted for time-to-deploy reasons. Sparsity will have high priority in future designs cal, The weights for the matrix unit are staged through an on-chip Weight FIFO that reads from an off-chip 8 GiB DRAM FIFO is four tiles deep. The intermediate results are held in the 24 miB on-chip Unified Buffer, which can serve as inputs to e called Weight Memory(for inference, weights are read-only 8 GiB supports many simultaneously active models ) The weigl the Matrix Unit. A programmable dMa controller transfers data to or from CPU Host memory and the Unified Buffer Figure 2 shows the floor plan of the tpu die the 24 mib Unified buffer is almost a third of the die and the matrix Multiply Unit is a quarter, so the datapath is nearly two-thirds of the die. The 24 MiB size was picked in part to match th pitch of the Matrix Unit on the die and, given the short development schedule, in part to simplify the compiler (see Section 7) Control is just 2%. Figure 3 shows the TPU on its printed circuit card, which inserts into existing servers like an Sata disk As instructions are sent over the relatively slow PCle bus, TPU instructions follow the CISC tradition, including a repeat field. The average clock cycles per instruction(CPi)of these CiSC instructions is typically 10 to 20. It has about a dozen instructions overall. but these five are the kev ones. 1. Read Host Memory reads data from the cpu host memory into the Unified Buffer (UB) 2. Read Weights reads weights from Weight Memory into the Weight FifO as input to the Matrix Unit 3. MatrixMult ipl y/Convol ve causes the matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the Accumulators. A matrix operation takes a variable-sized b 256 input, multiplies it by a 256x256 constant weight input, and produces a B 256 output, taking B pipelined cycles to complete 4. Activate performs the nonlinear function of the artificial neuron, with options for relU, Sigmoid, and so on. It inputs are the accumulators, and its output is the Unified Buffer. It can also perform the pooling operations needed Tor convolutions using the dedicated hardware on the die, as il is connected to nonlinear function logic 5. Write Host Memorv writes data from the Unified Buffer into the cpu host memory The other instructions are alternate host memory read/write, set configuration, two versions of synchronization, interrupt host debug-tag, nop and halt. The CISC Matrix Multiply instruction is 12 bytes, of which 3 are Unified Buffer address; 2 are accumulator address; 4 are length(sometimes 2 dimensions for convolutions); and the rest are opcode and flags. The philosophy of the TPU microarchitecture is to keep the matrix unit busy. It uses a 4-stage pipeline for these CIsc instructions, where each instruction executes in a separate stage The plan was to hide the execution of the other instructions by overlapping their execution with the MatrixMultiply instruction. Toward that end, the Read WEights instruction follows the decoupled-access/execute philosophy smi82, in that it can complete after sending its address but before the weight is fetched from Weight Memory. The matrix unit will stall if the input activation or weight data is not ready We don' t have clean pipeline overlap diagrams, because our CISC instructions can occupy a station for thousands of for one network layer must complete before the matrix multiplications of the next layer can begin: we see a " delay slot ions clock cycles, unlike the traditional RiSC pipeline with one clock cycle per stage. Interesting cases occur when the activat where the matrix unit waits for explicit synchronization before salely reading from the Unified Buffer As reading a large SRAM uses much more power than arithmetic, the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer [Kun80JRam91TOvt15b]. Figure 4 shows that data flows in from the left, and the weights are loaded from the top a given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. The weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. Control and data are pipelined to give the illusion that the 256 inputs are read at once, and that they instantly update one location of each of 256 accumulators. From a correctness perspective, software is unaware of the systolic nature of the matrix unit, but for performance, it does worry about the latency of the unit The TPu software stack had to be compatible with those developed for CPUs and gPUs so that applications could be ported quickly to the TPU. The portion of the application run on the tPu is typically written in Tensor Flow and is compiled into an API that can run on GPUs or TPUS [Lar16]. Like GPUs, the tPU stack is split into a User Space Driver and a Kernel Driver. The Kernel Driver is lightweight and handles only memory management and interrupts It is designed for long-term stability. The User Space driver changes frequently. It sets up and controls TPU execution, reformats data into TPU order, ae translates API calls into TPU instructions, and turns them into an application binary. The User Space driver compiles a mod the first time it is evaluated, caching the program image and writing the weight image into the TPUs weight memory; the second and following evaluations run at full speed. The TPu runs most models completely from inputs to outputs, maximizing the ratio of Tpu compute time to lo time Computation is often done one layer at a time, with overlapped execution allowing the matrix multiply unit to hide most non-critical-path operations Control Data Partial sums Figure 3. TPU Printed Circuit Board. It can be inserted in the slot Figure 4. Systolic data flow of the Matrix Multiply Unit. Software for an sata disk in a server but the card uses pcle gen 3x16 has the illusion that each 256B input is read at once, and they instantl update one location of each of 256 accumulator RAMs Benchmurked servers Measured TOPS/s On-Chip Measured mm'Inm hzl TDp GB/s Memory DIes DRAM Size TDP Idle Busy 8bFP Idle busy Haswell E5-269v3/662222300|45W41W|145W2.61.35151MiB2 256 GiB 504W|159W455W NVIDIA K8O 56128560150w25W98W 281608MiB8 256 GiB (host) 1838W357w991W (2 dies/card +12G1Bx 8 TPU NA*2870075W|28W40W92-3428MiB4 256 GiB(host) 86Wp290W384W t8 gib x 4 Table 2. Benchmarked servers use Haswell CPUs, K80 GPUs, and TPUs Haswell has 18 cores, and the K80 has 13 SMX processors Figure 10 has measured power. The low-power TPU allows for better rack-level density than the high-power GPU. The 8 GiB DRAM per TPU is Weight Memory GPU Boost mode is not used(Sec. 8). SECDEC and no Boost mode reduce K80 bandwidth from 240 to 160. No Boost mode and single die vs dual die performance reduces K80 peak TOPS from 8.7 to 2.8. ( *The tpu die is s half the Haswell die size. 3. CPU. GPU. and TPu Platforms The six production applications in Table 1 are our workload for this paper. As mentioned above, these six are representative of 95%of TPU use in our datacenters. Ironically, deploying and measuring popular small dnns like AlexNet or VGG is difficult on production machines. However, one of our CNNs derives from Inception V2, which is widely used The benchmark platforms are server-class computers that were available in 2015 when the TPUs were deployed this restriction meant that they must include at least SECDEd protection of internal SRaM as well as external DRAM memory like the TPU, which excludes some choices such as the Nvidia maxwell GPU. For our company to purchase and deploy them they also had to be sensibly configured machines, and not awkward artifacts assembled solely to win benchmarks Table 2 lists our choices. The traditional CPu server is represented by an 18-core, dual-socket haswell processor from Intel. This platform is also the host server for gPUs or TPUs Haswell was fabbed in an Intel 22nm process. Both the CPu and GPU are very large dies: about 600 mm! The 2.3 GHz CPU clock rate doesn' t include Turbo mode because it seldom occurs in our datacenters for NN apps Haswell has different clock rates depending on whether programs use AVX instructions, which our NN apps often use. The higher clock rate of Turbo mode( for programs that avoid avX)occurs when they dont use all their cores. Thus, another reason Turbo mode is rare in our datacenters is that our apps typically do use all the cores, plus they can run other datacenter jobs to fill any idle cores The gpu accelerator is the nvidia k80. each K80 card contains two dies and offers secded on internal memory and DRAM. Nvidia states that the K80 Accelerator dramatically lowers datacenter cost by delivering application performance with fewer, more powerful servers'[Nvi16] NN researchers frequently used K80s in 2015, and they were chosen for new cloud-based GPU offerings as recently as September 2016 [Bar16] Up to eight K80 dies can be installed in four cards on this server, which is the configuration we benchmark. It offers a Boost mode to increase clock rate as high as 875 MHz. Turbo mode in Haswell is controlled by hardware and so can operate in short bursts before the chip temperature rises significantly. However, Boost mode is under the control of a software driver Nvil5], and thus lasts at least hundreds of milliseconds. Hence, power and cooling would have to be provisioned for the K80 as if it were essentially always running in Boost mode, for otherwise the chips could get too hot. For this platform, enabling Boost mode would force us to reduce the number of K80 cards, which would hurt total cost of ownership. Thus, Boost mode is disabled. This restriction reduces peak bandwidth and toPs (see Table 2 caption); Sec. 8 examines if it were enabled As the number of dies per benchmarked server varies betw een 2 to 8, we usually show results below normalized per die (Figures 5-8, Figures 10-11, and Tables 3, 4, and 6), but we occasionally show whole systems(Figure 9). We hope this distinction is clear 4. Performance: Rooflines, Response-Time, and Throughput To illustrate the performance of the six apps on the three processors we adapt the roofline performance model from high-performance computing(HPC)[Wil09]. This simple visual model is not perfect, yet it offers insights on the causes of performance bottlenecks. The assumption behind the model is that applications dont fit in on-chip caches, so they are either computation-limited or memory bandwidth-limited For HPC, the Y-axis is performance in floating-point operations per second, thus the peak computation rate forms the" part of the roofline. The X-axis is operational intensity, measured as floating-point operations per dR aM byte accessed Memory bandwidth is bytes per second, which turns into theslanted part of the roofline since(FLOPS/sec)(FLOPS/Byte)- Bytes/sec. Without sufficient operational intensity, a program is memory bandwidth-bound and lives under the slanted part of the roofline TPU Log-Log 100 Roofline 86.0-×LsTM0 LSTM1 14.1 ▲MLP1 9 →MLP0 n CNNO -O-CNN1 2,8 0.1 100 1D00 Operational intensity: Ops/weight byte (log scale Figure 5. TPU(die)roofline. Its ridge point is far to the right at 1350 operations per byte of weight memory fetched The gap between the actual operations per second of an application and the ceiling directly above it shows the potential benefit of further performance tuning while leaving operational intensity untouched; of course, optimizations that increase operational intensity(such as cache blocking) may yield even greater benefit To use the roofline model for the TPu, when nn applications are quantized, we first replace floating-point operations t ith integer operations. As weights do not normally fit in on-chip memory for NN applications, the second change is to redefine operational intensity to be integer operations per byte of weights read(see the tenth column of Table 1) Figure 5 shows the roofline model for a single tpu die on log-log scales. The TPU has a long" slanted "part of its roofline, where operational intensity means that performance is limited by memory bandwidth rather than by peak compute Five of the six applications are happily bumping their heads against the ceiling: the MLPs and lstms are memory bound, and CNNs are computation bound. CNNl, despite a very high operational intensity, is running at only 14.1 ToPS while CNNO runs at 86 TOPS Table 3 explains what happened with Cnnl, based on the performance counters that give us partial visibility into TPU operation The TPu spends less than half of its cycles performing matrix operations for cnnl(column 7, row 1). On each of those active cycles, only about half of the 65, 536 MACs hold useful weights because some layers in CNnl have shallow feature depths. about 35 of cycles are spent waiting for weights to load from memory into the matrix unit, which occurs during the 4 fully connected layers that run at an operational intensity of just 32(see the last Fallacy in Section 8). This leaves roughly 19% ofcycles not explained by the matrix-related counters. Because ofoverlapped execution on the TPU, we do not have exact accounting for those cycles, but we can see that 23% of cycles have stalls for raw dependences in the pipeline, and 1%are spent stalled for input over the PCle bus pplication MLPD MLPI LSTMO LSTMI CNNO CNNI Mean Row Array active cycles 12.7%10.6%8.2%10.5%|78.2%46.2%28%1 Useful MACs in 64K matrix(% peak) 12.5%9,4%82%6.3%78.2%2.5%23%2 Unused macs 0.3%1.2%0.0%42%00%23.7%5%3 Weight stall cycles 53%442%581%62.1%00%281%43%4 Weight shift cycles 159%134%15.8%17.1%0.0%7.0%12%5 Non-matrix cycles 17.5%319%179%10.3%218%18.7%20%6 RAW Stalls 3.3%84%146%10.6%3.5%22.8%11%7 Input data stalls 6.1 88%5.1 2.4%3.4%0.6%4%8 TeraOps/sec(92 Peak) 12.397 2886.014.12149 Table 3 Factors limiting TPU performance of the NN workload based on hardware performance counters. Rows 1, 4, 5, and 6 total 100 and are based on measurements of activity of the matrix unit. Rows 2 and 3 further break down the fraction of 64K weights in the matrix unit that hold useful weights on active cycles. Our counters cannot exactly explain the time when the matrix unit is idle in row 6; rows 7 and 8 show counters for two possible reasons, including Raw pipeline hazards and PCle input stalls. Row 9(TOPS)is based on measurements of production code while the other rows are based on performance-counter measurements, so they are not perfectly consistent. Host server overhead is excluded here. The MLPs and LSTMs are memory-bandwidth limited but CNNs are not CNNl results are explained in the text Haswell LogLog 3.1 oTIne --- LSTMO LSTM1 1:1 MLPO 0.6 CNNO CNN 0.2 100 5001000 Operational Intensity Ops/weight byte log scale Figure 6. Intel Haswell CPu (die) roofline with its ridge point at 13 operations/byte, which is much further left than in Figure 5 LSTMO and MLP l are faster on haswell than on the k80. but it is vice versa for the other dnns 80LgL。g 3 Roofline LSTMO ●-LsTM1 1.0 0▲MLP1 0.7 0.7 0.5 CNN CNN 0.2 10 100 1000 Operational Intensity: Ops/weight byte(log scale Figure 7. NVIDIA K80 GPU die Roofline. The much higher memory bandwidth moves the ridge point to 9 operations per weight byte, which is even further left than in Figure 6. The DNNs are much lower than their roofline because of response time limits(see Table 4) Figures 6 and 7 show rooflines for a single Haswell die and for a single K80 die. The six NN applications are generally further below their ceilings than was the TPU in Figure 5. Response time is the reason. Many of these NN applications are parts of end-user-facing services. Researchers have demonstrated that small increases in response time cause customers to use a service less [Sch09]. Hence, while training may not have hard response time deadlines, inference usually does. That is, inference prefers latency over throughput[Pat04 Table 4 illustrates the impact of the 99th-percentile response time limit of 7 ms for MLPo on Haswell and the K80 DDea13], which was required by the application developer. The inferences per second and 7 ms latency include the server host time as well as the accelerator time. They operate at 42%and 37 %, respectively, of the highest throughput achievable for MLPO if the response time limit was relaxed. Thus, while CPUs and GPUs have potentially much higher throughput, it's wasted if they don' t meet the response time limit These bounds affect the tpU as well, but at 80% in table 4 it is operating much closer to its highest MLPO throughput As compared to CPUs and GPUs, the single-threaded tPU has none of the sophisticated microarchitectural features that consume transistors and energy to improve the average case but not the 99th-percentile case: no caches, branch prediction, out-of-order execution, multiprocessing, speculative prefetching, address coalescing, multithreading, context switching, and so forth. Minimalism is a virtue of domain-specific processors Table 3 shows TPU performance, but it doesnt account for host server time, which can be divided into running the host share of the application and talking to the TPU. table 5 lists the second part, but the first part is hard. Queueing theory shows that long input queues raise throughput-by ensuring that the computer is never idle--but stretch response time. Thus, most applications keep their input queues empty. Alas, we cant measure when the TPu is idle since it is waiting for the CPu to do its portion of the application or because the cpu is also idle due to an empty input queue. Table 6 gives the bottom line of relative inference performance per die including the host server overhead for the two accelerators versus the CPU. The next-to-last column shows the geometric mean of the relative performance for the six NN applications, which suggests the K80 die is 1.1X the speed of a Haswell die, that the tpu die is 14.5 times as fast, and thus the tpu die is 13.2 times as fast as the gpu die. Figure shows their relative speeds visually. Recall that architects use the geometric mean when they don't know the actual mix of programs that will be run [Hen 18 For this study, however, we do know the mix (Table 1). The weighted mean in the last column of Table 6 using the actual mix increases the GPu to 1.9X and the TPU to 29.2X. so the tpu die is now 15.3 times as fast as the gpu die ype Batch 99th% Response/s(PS Max IPS CPU 16 7.2ms 5,482 42 CPU64 21.3ms 13,194 100 GPU 16 6.7ms 13461 GPU 64 ms 36.465 TPU|200 7.0ms 225000 80% TPU|250 10.0ms 280.000 100% Table 4.99-th% response time and per die throughput(IPS)for MLPO as batch size varies for MLPO. The longest allowable latency is 7 ms For the GPU and TPU, the maximum mLPO throughput is limited by the host server overhead. Larger batch sizes increase throughput, but as the text explains, their longer response times exceed the limit, so CPUs and gPUs must use less-efficient, smaller batch sizes(16 vs MLPO MLPI LSTMO LSTMI CNNO 21%76 11 20 5 Table 5. Time for host CPU to interact with the TPU expressed as percent of TPU execution time(from TPU performance counters). This fraction is the time the cpu and tpu are communicating over the pCle bus, not including the time the cpu is doing a portion of the application but not interacting with the TPU. As the text explains, it's hard for the TPU to measure if the CPU is idle or working on the app Type MLPO MLPl zSTMozSTMI CNNo CNN GM wM GPU2.50.3041.21.62.71.11.9 TpU|4018.5351.240371014529.2 Ratio16.760.0801.025426.3132153 Table 6. K80 GPU die and tPU die performance relative to CPU for the NN workload. GM and WM are geometric and weighted mean (using the actual mix of the six apps in Table 1). relative performance for the gPu and TPU includes host server overhead. MLPs and CNNs perform well on the TPU. table 4 explains that the tpu can have larger batch sizes and still meet the time limits, increasing operations per byte(Table 1)or, equivalently, reducing memory accesses per operation. Also, CNNs by their nature have greater weight reuse and thus higher operations per byte. Thus, the lower memory bandwidth of the tPu doesn 't significantly hurt CNn performance. Log-Log s D cale 100 TPU Roofline K80 Roofline HSW Roofline ★ LSTMO ★ ★LsTM1 10 ★MLP1 邕 ★MLPo ★cNN ★cNNT ▲LSTM0 ▲LsTM1 ▲MLP1 CNN 0.1 CNN ●LsTM0 100 1000 O LSTMT 4 more Operational intensity: Ops/weight byte ( log scale) Figure 8. Figures 5-7 combined into a single log-log graph. Stars are for the TPU, triangles are for the K80, and circles are for Haswell. All TPU Stars are at or above the other 2 rooflines GPU/CPU口 TPU/CPU■TPU/GPU■TPU!cP■TPU/GPU 196 200 150 100 86 83 69 4142 50 31 34 25 17 12 2.1 17 29 Total Perf. /Watt GM Total Perf./Watt WM Incremental Incremental Perf. /watt GM Perf. /Watt WM Figure 9 Relative performance/Watt(TDP)of GPU server(blue bar)and TPU server(red bar)to CPU server, and TPU server to GPU server (orange bar). TPU is an improved TPU (Sec. 7). The green bar shows its ratio to the CPu server and the lavender bar shows its relation to the GPU server. Total includes host server power, but incremental doesnt. GM and WM are the geometric and weighted means 5. Cost-Performance TCO. and Performance/Watt When buying computers by the thousands, cost-performance trumps performance The best cost metric in a datacenter is total cost of ownership(TCO). The actual price we pay for thousands of chips depends on negotiations between the companies involved. For business reasons, we cant publish such price information or data that might let them be deduced. However, power is correlated with TCO, and we can publish Watts per server, so we use performance/Watt as our proxy for performance/TCO in this paper. In this section, we compare whole servers rather than single dies, which Table 2 lists in the Benchmarked Server columns Figure 9 shows the geometric and weighted mean performance/Watt for the K80 GPU and TPU relative to the Haswell CPU. We present two different calculations of performance/Watt. The first(total)includes the power consumed by the host CPU server when calculating performance/Watt for the GPu and TPU. The second(incremental)subtracts the host CPu server power from the gPu and tpu beforehand For total-performance/Watt, the K80 server is 1.2-2 1X Haswell. For incremental-performance/Watt, when Haswell server power is omitted, the k80 server is 1.7-2.9X The tpu server has 17 to 34 times better total- performance/ Watt than haswell, which makes the tpu server 14 to 16 times the performance/Watt of the K80 server. The relative incremental-performance/Watt-which was our companys justification for a custom ASIC--is 4 1 to 83 for the TPU, which lifts the TPU to 25 to 29 times the performance/Watt of the GPU 6. Energy Proportionality Thermal Design Power(tDP) affects the cost of provisioning power, as you must supply sufficient power and cooling when hardware is at full power. However, the cost of electricity is based on the average consumed as the workload varies during the day. [Bar07] found that servers are 100% busy less than 10% of the time and advocated energy proportionality: servers should consume power proportional to the amount of work performed. The estimate of power consumed in the prior section is based on the fraction of the tdp that has been seen in our datacenters We measured performance and power as the offered workload utilization varies from o% to 100%, collected in buckets of 10% delta of workload [Lan09]. Figure 10 shows server power divided by the number of dies per server for the three chips by varying CNNO's workload. We plot incremental( K80 and TPU) as well as total power(K80+Haswell/4 and TPU+Haswell/2) for the GPu and tPu. note that all were given the same batch sizes We see that the TPU has the lowest power-118W per die total (TPU+Haswell/2) and 40w per die incremental (TPU in Fig. 10)but it has poor energy proportionality: at 10% load, the TPU uses 88% of the power it uses at 100%.(The short design schedule prevented inclusion of many energy-saving features. )Not surprisingly, Haswell is the best at energ proportionality of the group it uses 56% of the power at 10% load as it does at 100%. The K80 is closer to the CPu than the 10%load the CPU uses 47% of full power, the GPU uses 78%, and the TPU uses 94 ation bound, per forms similarly:at PU, using 66%of the full load power at 10% workload LSTMl, which is not comput What happens to the server power usage when running CNNo if it becomes a host to accelerators? When the GPu and TPU are at 100%load, the CPU server uses 52%o of full power for the gPU and 69% for the TPU. (The CPu does more work for the tpu because it is running so much faster than the gPU.) Consequently, the Haswell server plus four TPUs use <20% dditional power but run CNNo 80 times faster than the Haswell server alone (4 TPUs vs 2 CPUs) 7. Evaluation of Alternative tPu designs Like an FPU, the tPu coprocessor has a relatively easy microarchitecture to evaluate, so we created a performance model for our six applications. Table 7 shows the differences between the model results and the hardware performance counters, which average below 10%0. we then modeled performance as we varied the memory bandwidth the clock rate and number of accumulators and the matrix multiply unit size Figure 1 1 shows the mean performance sensitivity of tpu die as we scale these parameters over the range for 0. 25x to 4x. It plots weighted means but the geometric means look similar. In addition to evaluating the impact of only raising clock rates(clock in Figure 11), we also plot a design(clock+) where the clock rate is increased and the number of accumulators is correspondingly scaled so the compiler can keep more memory references in flight. Likewise, we plot matrix unit expansion if we increase the number of accumulators with the square of the rise in one dimension(matriX+), since the number of multipliers in the matrix grows in both dimensions, as well as just increasing the matrix unit alone(matrix) MLPO MLPI LSTMO LSTMI CNNO CNNI 68%10.9%77%54%8.2%112% Table 7. Difference in clock cycles between the TPU hardware performance counters and the tPu performance model. The average is 8%

...展开详情
2017-04-06 上传 大小:1.31MB
举报 收藏
分享

评论 下载该资源后可以进行评论 共5条

ncf25 很好的论文
2019-03-15
回复
jiuzhangzi 不好,不是全部的详解,只有一点
2019-03-01
回复
haoying6691 IEEE可直接下载
2018-10-13
回复
caffeineatp 很不错的资源
2018-01-10
回复
panchaofeng 积分要的太贵了
2017-11-09
回复
Hongkong TPU

香港TPU文件 可导入qgis、Arcgis 用于分析的底图,具有坐标

立即下载
MC68332的TPU应用于高速列车测速

MC68332的TPU应用于高速列车测速

立即下载
VPU、TPU和寒武纪-x的芯片架构

通过研究论文资料总结了三种ASIC芯片的结构原理等,包括VPU、TPU和寒武纪芯片

立即下载
VirtuaNES0.86a_TPU

就是个nes模拟器而已 下载的时候为了方便备份一下

立即下载
红白机模拟器VirtuaNES0.86a_TPU.rar

红白机模拟器VirtuaNES0.86a_TPU.rar

立即下载
HC29.22.730-TensorPU-Young-Google.

TPU(Tensor Processing Unit)即张量处理单元,是一款为机器学习而定制的芯片,经过了专门深度机器学习方面的训练,它有更高效能(每瓦计算能力)

立即下载
ABB IRC5P系统的IPS结构

ABB IRC5P系统的IPS结构 IRC5P新喷漆控制柜 统一的电柜尺寸(参考S4P有紧凑型与扩展型)) 主电缆接口与S4P+相同(除 IRB5500) 系统1最多支持8 系统2最多支持6 (pumps,handleretc.) 一个变压器涵盖200-600VAC电压输入 分布式I/O(串口通信) LED指示(similar S4P+)新特性 IRC5P电柜内部 IRC5P电柜内部 IRC5P电柜内部 IRC5P电柜内部 IRC5P电源系统 同一变压器器的输入电压的范围200-600VAC 115VAC与230VAC的输出是独立的线圈. IRC5P电源分配模块PDB 提供24VDC电压与分

立即下载
TensorFlow一种用于大规模机器学习的系统

TensorFlow是一个机器学习系统,其运行于大规模和异构环境。张量流使用数据流图来表示计算、共享状态以及改变该状态的操作。它将数据流图的节点映射到多台机器上的一个集群中,以及跨多个计算设备的机器中,包括多核CPU,通用GPU和定制设计的ASIC,称为张张良处理单元(TPU)。这种架构给应用程序开发人员极大的灵活性,而以前设计的“参数服务器”共享的管理状态是内置在系统中的。TensorFlow使开发人员能够尝试新颖的优化和训练算法。TensorFlow支持各种应用程序、特别强大的训练和支持深度神经网络推理。

立即下载
70页人工智能芯片行业深度研究

AI 芯片迎接蓝海,GPU 引领主流,ASIC 割据一地,看好未来各领风骚 在人工智能立夏将至的大趋势下,芯片市场蛋糕越做越大,足以让拥有不 同功能和定位的芯片和平共存,百花齐放。后摩尔定律时代,我们强调AI 芯片市场不是零和博弈。我们认为在3-5 年内深度学习对GPU 的需求是当 仁不让的市场主流。行业由上至下传导形成明显的价值扩张,英伟达和 AMD 最为受益。 在深度学习上游训练端(主要用在云计算数据中心里),GPU 是当仁不让的 第一选择,但以ASIC 为底芯片的包括谷歌的TPU、寒武纪的MLU 等,也 如雨后春笋。而下游推理端更接近终端应用,需求更加细分,我们认为除 了GPU 为主流芯

立即下载
tensorflow1.12.0及其依赖库离线安装包-win64-python3.6

TensorFlow™是一个基于数据流编程(dataflow programming)的符号数学系统,被广泛应用于各类机器学习(machine learning)算法的编程实现,其前身是谷歌的神经网络算法库DistBelief [1] 。 Tensorflow拥有多层级结构,可部署于各类服务器、PC终端和网页并支持GPU和TPU高性能数值计算,被广泛应用于谷歌内部的产品开发和各领域的科学研究 [1-2] 。 TensorFlow由谷歌人工智能团队谷歌大脑(Google Brain)开发和维护,拥有包括TensorFlow Hub、TensorFlow Lite、TensorFlow Res

立即下载
tensorflow教程文档

tensorflow教程文档,TensorFlow™是一个基于数据流编程(dataflow programming)的符号数学系统,被广泛应用于各类机器学习(machine learning)算法的编程实现,其前身是谷歌的神经网络算法库DistBelief [1] 。 Tensorflow拥有多层级结构,可部署于各类服务器、PC终端和网页并支持GPU和TPU高性能数值计算,被广泛应用于谷歌内部的产品开发和各领域的科学研究 [1-2] 。 TensorFlow由谷歌人工智能团队谷歌大脑(Google Brain)开发和维护,拥有包括TensorFlow Hub、TensorFlow Lite

立即下载