Session_16_Efficient_ComputeInMemory_Based_Processors_for

需积分: 5 91 浏览量 2023-03-14 12:01:10 上传评论收藏 32.65MB PDF 举报

Session 16 Efficient Compute-In-Memory Based Processors for ML.pdf 本文档主要介绍了基于compute-in-memory（CIM）技术的高效处理器设计，以满足机器学习（ML）应用的高速计算需求。CIM技术通过将计算嵌入到内存中，以减少内存访问和数据移动，从而提高能源效率。 _mulTCIM：一种基于CIM的多模态变换器加速器_ mulTCIM是一种基于CIM的多模态变换器加速器，目标是提高CIM的利用率和降低延迟。mulTCIM采用长期重用注意力消除、token裁剪对称模态重叠和bit平衡CIM等技术，从而提高CIM的利用率和降低延迟。mulTCIM在0.7V和160MHz频率下，消耗2.24μJ/token，支持ViLBERT-base模型的INT8/INT16混合精度计算。 _基于CIM的稀疏变换器加速器_ 本文档还介绍了一种基于CIM的稀疏变换器加速器，采用蝶形网络结构和数字CIM-Based Local-Attention-Reusable Engine。该加速器在0.67V和40MHz频率下，达到53.83TOPS/W的峰值系统能源效率，支持INT8精度计算。 _ Compute-in-Memory技术_ Compute-in-Memory（CIM）技术是指将计算嵌入到内存中，以减少内存访问和数据移动。CIM技术可以提高能源效率，但是需要应用特定的电路和架构来维持系统级的能源效益。本文档中，七篇论文分别介绍了基于SRAM/eDRAM/ReRAM的CIM宏单元的集成、混合/可重构计算架构、稀疏感知硬件设计、CIM利用率改进和AI模型特定优化等技术。 _基于CIM的机器学习处理器_ 基于CIM的机器学习处理器可以满足机器学习应用的高速计算需求。这些处理器采用CIM技术，提高能源效率和计算速度。同时，基于CIM的机器学习处理器还可以支持多种机器学习模型和算法，满足不同应用场景的需求。 _ Conclusion _ 本文档介绍了基于CIM技术的机器学习处理器设计，旨在提高能源效率和计算速度。这些处理器可以满足机器学习应用的高速计算需求，支持多种机器学习模型和算法。

资源推荐

资源详情

资源评论

Session 16 Overview:

Efficient Compute-In-Memory Based Processors for ML

MACHINE LEARNING SUBCOMMITTEE

1:30 PM

16.1 MulTCIM: A 28nm 2.24μJ/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for

Multimodal Transformers

Fengbin Tu,

Tsinghua University, Beijing, China

In Paper 16.1, Tsinghua University presents a 28nm 14.36mm

digital CIM-based multimodal transformer accelerator (MulTCIM)

with attention-token-bit hybrid sparsity exploitation. Long-reuse attention elimination, token pruning with symmetry modal

overlapping, and bit-balanced CIM are incorporated for improving CIM utilization and reducing latency. MulTCIM consumes

2.24μJ/token at 0.7V and 160MHz frequency for the ViLBERT-base model with INT8/INT16 mixed precision.

2:00 PM

16.2 A 28nm 53.8TOPS/W 8b Sparse Transformer Accelerator with In-Memory Butterﬂy Zero Skipper for Unstructured-

Pruned NN and CIM-Based Local-Attention-Reusable Engine

Shiwei Liu,

Fudan University, Shanghai, China

In Paper 16.2, Fudan University, Birentech and Peng Cheng Laboratory demonstrate a 28nm 3.93mm

CIM-based sparse

transformer accelerator, featuring a butterﬂy-network-based sparsity-aware feed-forward computing architecture and a digital

CIM-based local attention reusable engine. Peak system energy efﬁciency of 53.83TOPS/W is achieved with INT8 precision at

0.67V and 40MHz frequency.

Compute-in-memory (CIM) immerses computation into memory to reduce memory access and data movement, and prior CIM

works demonstrated high energy efﬁciency at the macro level. For end-to-end processors, application-speciﬁc circuits and

architectures are required to maintain the energy beneﬁts at the system-level. This session includes seven papers, each representing

advances in integration of SRAM/eDRAM/ReRAM-based CIM macros for system-level processors with new techniques, such as

hybrid/reconﬁgurable computing architectures, sparsity-aware hardware design, CIM utilization improvement and AI model-speciﬁc

optimization.

Session Chair: Jae-sun Seo

Arizona State University, Tempe, AZ

Session Co-Chair: Yongpan Liu

Tsinghua University, Beijing, China

246

• 2023 IEEE International Solid-State Circuits Conference

ISSCC 2023 / SESSION 16 / EFFICIENT COMPUTE-IN-MEMORY BASED PROCESSORS FOR ML / OVERVIEW

3:15 PM

16.4 TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 Digital-CIM Tensor Processor for MCM-CIM-Based

Beyond-NN Acceleration

Fengbin Tu,

Tsinghua University, Beijing, China

In Paper 16.4, Tsinghua University demonstrates a 28nm 12.42mm

processor chiplet called TensorCIM for a CIM-based multi-

chip module (MCM) system, featuring redundancy elimination for sparse gathering and inter-/intra-CIM utilization improvement

for sparse algebra. With FP32 precision, TensorCIM achieves 7.6TFLOPS/W and 3.7nJ/gather for a graph convolutional network

(Pubmed dataset) and 6.4TFLOPS/W and 2.3nJ/gather for a deep learning recommendation model (MovieLens dataset) at 0.6V

and 115MHz.

3:45 PM

16.5 DynaPlasia: An eDRAM In-Memory-Computing-Based Reconﬁgurable Spatial Accelerator with Triple-Mode Cell

for Dynamic Resource Switching

Sangjin Kim, Korea Advanced Institute of Science and Technology, Daejeon, Korea

In Paper 16.5, Korea Advanced Institute of Science and Technology presents a 28nm 20.25mm

eDRAM-based CIM processor

called DynaPlasia with a novel triple-mode 3T2C cell and a dynamic reconﬁgurable core architecture that enables high system

efﬁciency for ML workloads. For ResNet-18 (ImageNet dataset), DynaPlasia achieves system energy efﬁciency of 37.2TOPS/W

and compute density of 2.03TOPS/mm

at 1.0V and 250MHz for INT4/INT5 activation/weight precision.

4:15 PM

16.6 A Nonvolatile AI-Edge Processor with 4MB SLC-MLC Hybrid-Mode ReRAM Compute-in-Memory Macro and

51.4-251TOPS/W

Je-Min Hung,

National Tsing Hua University, Hsinchu, Taiwan

In Paper 16.6, National Tsing Hua University and TSMC report a 22nm 24.48mm

non-volatile AI processor that integrates 4MB

of ReRAM CIM macros, featuring a hybrid computing architecture (in-memory and near-memory) and ReRAM devices of mixed

precision (SLC/MLC memory states) to balance NVM capacity, energy and accuracy. For MobileNet-V2 (CIFAR-10 dataset),

251TOPS/W for INT4 precision and 68.9TOPS/W for INT8 precision have been achieved at 0.7V and 50MHz.

4:45 PM

16.7 A 40-310TOPS/W SRAM-Based All-Digital Up to 4b In-Memory Computing Multi-Tiled NN Accelerator in

FD-SOI 18nm for Deep-Learning Edge Applications

Nitin Chawla,

STMicroelectronics, Noida, India

In Paper 16.7, STMicroelectronics presents a 18nm FD-SOI 4.2mm

chip that integrates an Arm Cortex-M CPU and eight neural

processing unit (NPU) clusters. Each NPU instantiates 256Kb of push-rule 8T SRAM-based digital CIM macros, which can be

conﬁgured as compute modules or standard memory for storage. System-level peak energy efﬁciency of 77TOPS/W was

achieved for INT4 precision at 0.525V, 600MHz with a forward body bias of up to 1.5V.

ISSCC 2023 / February 21, 2023 / 1:30 PM

247 DIGEST OF TECHNICAL PAPERS •

2:30 PM

16.3 A 28nm 16.9-300TOPS/W Computing-in-Memory Processor Supporting Floating-Point NN Inference/Training

with Intensive-CIM Sparse-Digital Architecture

Jinshan Yue,

Institute of Microelectronics of the Chinese Academy of Sciences, Beijing, China

In Paper 16.3, the Institute of Microelectronics of the Chinese Academy of Sciences and Tsinghua University describe a 28nm

4.54mm

CIM processor for integer and ﬂoating-point (FP) inference/training, which integrates a RISC-V CPU, a CIM core for

dense integer computations and a digital core for sparse FP computations. For ResNet-50 model inference (ImageNet dataset),

system energy efﬁciencies are 19.7TOPS/W with INT8 precision and 4.22TOPS/W with FP16 precision at 0.485V and 50MHz.

248

• 2023 IEEE International Solid-State Circuits Conference

ISSCC 2023 / SESSION 16 / EFFICIENT COMPUTE-IN-MEMORY BASED PROCESSORS FOR ML / 16.1

16.1 MulTCIM: A 28nm 2.24μJ/Token Attention-Token-Bit Hybrid

Sparse Digital CIM-Based Accelerator for Multimodal

Transformers

Fengbin Tu, Zihan Wu, Yiqi Wang, Weiwei Wu, Leibo Liu, Yang Hu,

Shaojun Wei, Shouyi Yin

Tsinghua University, Beijing, China

Human perception is multimodal and able to comprehend a mixture of vision, natural

language, speech, etc. Multimodal Transformer (MulT, Fig. 16.1.1) models introduce a

cross-modal attention mechanism to vanilla transformers to learn from different

modalities, achieving excellent results on multimodal AI tasks like video question

answering and multilingual image retrieval. Transformers require specialized hardware

for efﬁcient inference [1]. Prior work demonstrates that a Compute-In-Memory (CIM)

accelerator with attention sparsity can efﬁciently process vanilla transformers [2].

Multimodal signals like video and audio exhibit diverse token signiﬁcance, providing new

opportunities for token sparsity via runtime pruning [3]. Additionally, activation functions

like GELU and softmax produce many near-zero values that expose bit sparsity in the

most-signiﬁcant bits (MSB). In utilizing attention-token-bit hybrid sparsity, there are

three challenges: 1) For attention sparsity, irregular patterns result in long reuse distance,

which requires CIM to hold infrequently used weights, lowering CIM utilization. 2)

Although token sparsity reduces computation, MulT’s cross-modal attention processes

tokens from two modalities with different token lengths (N) and embedding

dimensionality (d

), causing high latency in cross-modal switch. 3) At the bit level, since

token sparsity reduces value locality, a CIM macro has more variance in effective bitwidth

for the same group of inputs. In a conventional CIM’s bit-serial MAC scheme,

computation time is deﬁned by the longest bitwidth.

We propose a digital CIM-based MulT model accelerator MulTCIM, with three features

to tackle the hybrid sparsity challenges: 1) Instead of generating Q, K, V tokens in order,

we implement a Long Reuse Elimination Scheduler (LRES) to dynamically reshape the

attention matrix like a global+local sparse pattern. With this approach, weights stored in

the CIM can be reused more frequently to improve CIM utilization. 2) We design a

Runtime Token Pruner (RTP) to remove insigniﬁcant tokens and a Modal-Adaptive CIM

Network (MACN) to dynamically divide all CIM cores into two pipeline stages, StageS

for static matrix multiplication (MM) in Q, K and V generation, and StageD for dynamic

MM of attention computation. In cross-modal switch, MACN further exploits modal

symmetry to overlap Q, K generation for lower latency. 3) An Effective Bitwidth Balanced

CIM (EBB-CIM) architecture is designed to balance input bits across in-memory MACs

by performing effective bitwidth detection and bit equalization, reducing the computation

time.

Figure 16.1.2 shows the MulTCIM accelerator, comprising a MACN with 16 CIM cores,

an RTP, an LRES, a 64KB input buffer (IB), a 128KB global buffer (GB), a SIMD core and

a top controller. CIM cores are connected by the MACN’s pipeline bus, and each core

has 8 EBB-CIM macros. The controller conﬁgures the MACN with current layer

parameters. For fully-connected layers, all CIM cores store layer weights and work in

parallel. For attention layers (QK

-MM, A’V-MM), CIM cores work in pipeline mode similar

to [2] for less off-chip access. Take QK

-MM for example. RTP ﬁrst prunes insigniﬁcant

input tokens. LRES then conﬁgures sparse token and attention scheduling for IB and

MACN’s StageD. The cores in StageS store weight matrices (W

, W

), and load inputs

from IB to generate Q, K. The output q, k vectors are merged in the StageS adder and

streamed to StageD. LRES controls the StageS output’s destination core in StageD for

weight writing or input feed. Speciﬁcally for cross-modal attention, MACN utilizes the

modal workload allocator (MWA) for adjusting the StageS workload during cross-modal

switch. The MACN outputs are ﬁnally stored in the GB with activation functions performed

in the SIMD core.

Figure 16.1.3 illustrates LRES that optimizes for attention sparsity. LRES comprises an

attention sparsity manager, a local attention sorter and a reshaped attention generator.

The manager stores the sparse attention pattern obtained by training. After runtime token

pruning by the RTP, the manager compares the pattern’s row/column-wise sums to

select global-like attention. The corresponding q or k vectors are reused as weights in

StageD’s CIM for a long time. The local attention sorter forms a local-like pattern for the

rest of the attention with k as weights and q as inputs in StageD’s CIM. Local-like means

k is frequently consumed and dynamically replaced by new k from StageS. The pattern

is ﬁrst row-wise reordered via the similarity-based q-sorter to decide the input feed

sequence with minimum reuse distance for the current k. Then, the difference-based k-

sorter reorders the pattern column-wise to decide the weight writing sequence. The

reshaped attention generator sends conﬁgurations to the IB for token feed and StageD

for workload assignment. In the example QK

-MM, LRES achieves 4.16× higher CIM

utilization for StageD and 2.39× total speedup over the conventional in-order computing

approach.

Figure 16.1.4 depicts the RTP and MACN blocks that optimize for token sparsity. Since

the class (CLS) token characterizes other tokens’ signiﬁcance [3], the RTP receives the

previous layer’s CLS score and selects the current layer’s top-n most-signiﬁcant tokens.

The MACN comprises an MWA, 16 CIM cores and a pipeline bus. The MWA updates

pruning information for the current layer and allocates the StageS workload. Initially, the

MWA divides CIM cores into StageS and StageD, and pre-distributes weights for StageS

based on the allocation table. In cross-modal attention, conventional methods compute

one modality after another. Different modal parameters lead to many idle CIM macros in

cross-modal switch. The MWA exploits modal symmetry to overlap multimodal Q, K

generation. The CIM’s 4:1 activation structure stores multimodal weights in one macro

and switches modality by time multiplexing. Core1 stores W

and W

in the example.

At cycle N

, the MACN switches to Phase2 and Core1 activates W

to generate Q

instead

of staying idle. Modal symmetry makes Q

, K

generation conclude at the same time with

better CIM utilization. The RTP reduces latency by 2.13× and 1.58× for single- and cross-

modal attention, and symmetric modal overlap offers an extra 1.69× speedup for

cross-modal attention.

Figure 16.1.5 shows the EBB-CIM macro that optimizes for bit sparsity. It comprises 32

EBB-CIM arrays, an effective bitwidth detector, a bit equalizer and a bit-balanced feeder.

Each EBB-CIM array has 4×64 6T-SRAM bitcells (8 banks) and a cross-shift MAC tree.

We exploit a full-digital CIM architecture with 4 rows of time-sharing CIM logic (4:1

activation) to achieve high computing accuracy at INT16, while maintaining memory

density [4]. The detector receives inputs and detects effective bitwidth (EB) at runtime.

The bit equalizer calculates the average EB, assigns the long-EB data’s bits to the short-

EB data, producing a bit-balanced input sequence. The bit-balanced feeder fetches the

sequence and generates cross-shift conﬁgurations. The example shows INT8

multiplication of 8-element input and weight vectors. The MAC tree cross-shifts 8 weights

in a row to multiply 8 input bits at the correct position. At cycle 0, the cross-shift MAC

computes I

[3]×W

[6]×(W

<<3)+I

[4]×(W

<<1)+I

[3]×W

+...+I

[4]×(W

<<1) by

balancing long-EB W

and W

, thereby avoiding wasting in-memory MACs. The bit

equalizer limits weight shifting within 4 bits for lower cost in memory. The EBB-CIM is

reconﬁgurable for INT16 by fusing every two INT8 operations. Compared to conventional

bit-serial CIM, the EBB-CIM reduces latency by 2.38×, 2.20×, and 1.58× for softmax-

MM, GELU-MM, and the entire encoder, respectively, with only 5.1% power and 4.6%

area overhead.

Figure 16.1.6 shows the measurement results for the 28nm MulTCIM accelerator. The

chip works at 0.6-1.0V supply, corresponding to 85-275MHz. MulTCIM supports general

transformer models with special optimizations for MulT. Experiments are conducted on

2 typical MulT models, ViLBERT-base and ViLBERT-large [5], with the Visual Question

Answering v2.0 Dataset. Hybrid sparse techniques obtain 9.47× speedup and 8.11×

energy savings on ViLBERT-base’s attention layers, achieving 6.54× speedup and 5.61×

energy savings on the entire model. The negligible accuracy loss mainly comes from

INT8/16 quantization. The peak energy efﬁciency is 101.1TOPS/W for INT8 and

60.3TOPS/W for INT16 at 0.7V, 160MHz. Compared with prior transformer and digital

CIM accelerators, MulTCIM consumes 2.24μJ/Token for the ViLBERT-base model – an

energy reduction of 5.91× over [1] and 5.61× over [2]. Owing to the hybrid sparsity

exploitation, MulTCIM achieves 2.50× higher efﬁciency than the more advanced 5nm

digital CIM architecture [4]. MulTCIM’s die photo, voltage-frequency scaling curves and

summary table are shown in Fig. 16.1.7.

Acknowledgement:

This work was supported in part by NSFC Grant 62125403, Grant U19B2041, and Grant

92164301; in part by the National Key Research and Development Program under Grant

2021ZD0114400; in part by Beijing National Research Center for Information Science

and Technology; and, in part by the Beijing Advanced Innovation Center for Integrated

Circuits. The corresponding author of this paper is Shouyi Yin (yinsy@tsinghua.edu.cn).

References:

[1] Y. Wang et al., “A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer

Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing,” ISSCC,

pp. 464-465, 2022.

[2] F. Tu et al., “A 28nm 15.59μJ/Token Full-Digital Bitline-Transpose CIM-Based Sparse

Transformer Accelerator with Pipeline/Parallel Reconﬁgurable Modes,” ISSCC, pp. 466-

467, 2022.

[3] Y. Xu et al., “Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer,”

AAAI, pp, 2964-2972, 2022.

[4] H. Fujiwara et al., “A 5-nm 254-TOPS/W 221-TOPS/mm

Fully-Digital Computing-in-

Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and

Simultaneous MAC and Write Operations,” ISSCC, pp. 186-187, 2022.

[5] J. Lu et al., “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for

Vision-and-Language Tasks,” Conf. on Neural Information Processing, 2019.

剩余22页未读，继续阅读

评论收藏

内容反馈

LittleBrightness

粉丝: 0
资源: 140

Session_16_Efficient_ComputeInMemory_Based_Processors_for_ML.pdf

最新资源

Session_16_Efficient_ComputeInMemory_Based_Processors_for_ML.pdf

ISSCC2021_Session_09V_ML Processors From Cloud to Edge.pdf

Pins_Tool_for_i.MX_Processors_v6_x64.exe

Pins_Tool_for_i.MX_Processors_v4_x64.exe

ISSCC2021-Session_04V_Processors.pdf

assembly_language_for_x86_processors-6th.pdf 清晰.英文版

ISSCC2021_Session_15V_Compute-in-Memory Processors for Deep Neural Networks.pdf

2017 session 1-15.rar

ISSC C2016 所有

- ISSCC 2014 所有

ISSCC 2017 ALL

[汇编语言程序设计].(Assembly.Language.For.x86.Processors,.6ed),.Irvine

isscc 2019 all

ISSCC 2015 所有

Springer.Introduction.to.Assembly.Language.Programming.For.Pentium.and.RISC.Processors.With.75.Illustrations.2005.pdf

iMX6UL i.MX 6UltraLite Cortex-A7 芯片Datasheet 数据手册文档资料合集.zip

Jflash_for_sec.zip_Jflash_for_sec.zip_s3c6410

ARMCortexA-9Processors.pdf

ISSCC 2013 所有

Targeting_PA_Processors.pdf

1_sixyin-music-source-v1.0.7.js

植物大战僵尸杂交版v2.0安装程序.exe

植物大战僵尸杂交版v2.0.zip

misaka-v3.3.8.zip

TiggerRamDiskV4.2Beta1-Win.zip

大麦抢票_BP全自动抢购教程+注意事项.rar

红果脚本.apk

py作业.zip

C语言程序设计第四版何钦铭课后习题及答案.pdf

最新资源