Session_29_Digital_Accelerators_and_Circuit_Techniques.pdf资源-CSDN文库

需积分: 8 55 浏览量 2023-03-14 12:03:30 上传评论收藏 34.98MB PDF 举报

在Session 29的议题"Digital Accelerators and Circuit Techniques"中，主要探讨了数字加速器和电路技术在解决复杂计算问题中的创新应用。这个研讨会由Digital Circuits Subcommittee组织，涉及了混合信号处理、能量性能优化以及神经网络硬件实现等多个方面。论文29.1提出了一种基于65nm CMOS工艺的32.5mW混合信号处理内存（Processing-in-Memory, PiM）基的k-SAT求解器，专门针对30变量126子句的3-SAT问题。k-SAT问题是一个典型的计算机科学中的组合优化问题，该求解器利用混合信号随机循环神经网络（stochastic recurrent neural network）和无监督学习来寻找最优解。该测试芯片在1.2V电压和400MHz频率下运行，平均解题率为74.0%，对于99%的满意度问题，其平均运行时间为11.25毫秒。论文29.2介绍的是Snap-SAT，这是一个专为大规模硬布尔可满足性问题设计的一次性、能量性能意识的全数字内存计算SAT求解器。Snap-SAT采用65nm CMOS工艺，能够在内存中进行大规模并行计算和迭代变量更新，支持高达1024个子句和128个变量。与传统的CPU（ARM）SAT求解器相比，它实现了389倍（420倍）的速度提升和7*10^4倍（1.3*10^4倍）的能量效率提升。论文29.3展示了密歇根大学研究团队开发的28nm神经引擎，该引擎利用位稀疏化的符号-绝对值乘法和双加法树，实现了比2的补码实现更高的能效。在一个2.18mm²的芯片上，其峰值能效提高了53%，达到8.09TOPS/W（每瓦特万亿次操作），这表明了在低功耗神经网络计算方面的显著进步。这些研究工作突显了数字加速器和先进电路技术在推动高性能和高能效计算解决方案中的核心作用。它们不仅解决了传统计算难题，还为未来新兴应用提供了可能，如机器学习、人工智能等。通过混合信号处理、内存计算和高效的硬件实现，这些技术正在重新定义计算的边界，以应对日益增长的计算需求和能源效率挑战。

资源推荐

资源详情

资源评论

Session 29 Overview:

Digital Accelerators and Circuit Techniques

DIGITAL CIRCUITS SUBCOMMITTEE

1:30 PM

29.1 A 32.5mW Mixed-Signal Processing-in-Memory-Based k-SAT Solver in 65nm CMOS with 74.0% Solvability for

30-Variable 126-Clause 3-SAT Problems

Daehyun Kim, Georgia Institute of Technology, Atlanta, GA

In Paper 29.1, Georgia Institute of Technology researchers present a k-SAT solver that features a mixed-signal stochastic recurrent

neural network and unsupervised learning to search for an optimal solution. The 65nm test chip operates at 1.2V and 400MHz

and consumes 32.5mW. It demonstrates 74.0% solvability and 11.25ms median runtime for 99% satisﬁability in 30-variable and

126-clause hard 3-SAT problems.

2:00 PM

29.2 Snap-SAT: A One-Shot Energy-Performance-Aware All-Digital Compute-in-Memory Solver for Large-Scale Hard

Boolean Satisﬁability Problems

Shanshan Xie, University of Texas, Austin, TX

In Paper 29.2, the University of Texas at Austin describes a fast, reliable and scalable all-digital compute-in-memory SAT solver

in 65nm CMOS. This solver allows massive parallel in-memory computations and iterative variable updates of up to 1024 clauses

and 128 variables, achieves 389× (420×) speedup and reduces energy by 7*10

× (1.3*10

×) over CPU (ARM) SAT solver.

2:30 PM

29.3 An 8.09TOPS/W Neural Engine Leveraging Bit-Sparsiﬁed Sign-Magnitude Multiplications and Dual Adder Trees

Hyochan An,

University of Michigan, Ann Arbor, MI

In Paper 29.3, University of Michigan researchers present a 28nm neural engine that leverages sign-magnitude data representation

and bit sparsity for increased energy efﬁciency over 2’s complement implementations. The comparative implementation in a

2.18mm² chip achieves a 53% peak energy-efﬁciency improvement at 0.65V with 8.09TOPS/W.

Digital accelerators and advanced circuit techniques remain as key drivers to support existing and emerging applications with

high performance and energy efﬁciency. This session features seven new accelerators, namely two compute-in-memory-based

k-SAT solvers, a neural engine leveraging sign-magnitude data representation, a heterogeneous RRAM CNN and SRAM SNN SoC,

a microwatt keyword spotting system, a mixed-signal GPS accelerator and a universal soft-detection decoder. This session also

features two papers proposing novel digital circuit techniques, namely a 3D-stacked capacitor-processor system and an 8T

embedded non-volatile SRAM array.

Session Chair: Mingoo Seok

Columbia University, New York, NY

Session Co-Chair: Kazuki Fukuoka

Renesas, Tokyo, Japan

416

• 2023 IEEE International Solid-State Circuits Conference

ISSCC 2023 / SESSION 29 / DIGITAL ACCELERATORS AND CIRCUIT TECHNIQUES / OVERVIEW

3:45 PM

29.6 A 1.5μW End-to-End Keyword Spotting SoC with Content-Adaptive Frame Sub-Sampling and Fast-Settling Analog

Frontend

Ji-Hwan Seol,

University of Michigan, Ann Arbor, MI, Samsung Electronics, Hwasung, Korea

In Paper 29.6, the University of Michigan describes a fully integrated keyword spotting system, which employs a skip recurrent

neural network (RNN) to reduce the power consumption of the analog front-end and the digital back-end by adaptively sub-

sampling audio input frames based on the signal content. The proposed system consumes 1.5μW in 28nm CMOS and achieves

92.8% accuracy on a standard dataset.

4:15 PM

29.7 CCSA: A 394TOPS/W Mixed-Signal GPS Accelerator with Charge-Based Correlation Computing for Signal

Acquisition

Jieyu Li, Shanghai Jiao Tong University, Shanghai, China

In Paper 29.7, researchers from Shanghai Jiao Tong University and Columbia University present a GPS accelerator with charge-

based correlation computing that employs an area-efﬁcient, two-phase mixed-signal multiplier. The 28nm test chip achieves

114-394TOPS/W, demonstrating an 8.2· energy-efﬁciency improvement compared to the state-of-the-art design at the same

throughput.

4:45 PM

29.8 A Sub-0.8pJ/b 16.3Gbps/mm

Universal Soft-Detection Decoder Using ORBGRAND in 40nm CMOS

Arslan Riaz,

Boston University, Boston, MA

In Paper 29.8, researchers from Boston University, Massachusetts Institute of Technology and Maynooth University present an

integrated universal soft-detection decoder using ordered reliability bits guessing random additive noise decoding (ORBGRAND).

The 40nm chip consumes 0.76pJ per decoded bit, while achieving a throughput of 6.5Gbps and an average latency of 40ns at

a frame error rate of 10

-7

5:15 PM

29.9 An 8T eNVSRAM Macro in 22nm FDSOI Standard Logic with Simultaneous Full-Array Data Restore for Secure

IoT Devices

Sepideh Nouri,

University of California, Los Angeles, CA

In Paper 29.9, the University of California presents a multi-time programmable 8T eNVSRAM macro, which can operate as a

regular SRAM and an NVM in a 22nm FDSOI standard logic process. The proposed 8T bitcell of 0.86μm² implemented in a 1Kb

array enables local non-volatile data storage within the SRAM bitcell with charge-trap transistors, achieving 0.55ns SRAM read

and 2ns NVM full-array recall.

3:15 PM

29.5 A 73.53TOPS/W 14.74TOPS Heterogeneous RRAM In-Memory and SRAM Near-Memory SoC for Hybrid Frame

and Event-Based Target Tracking

Ashwin Sanjay Lele, Georgia Institute of Technology, Atlanta, GA

In Paper 29.5, a joint Georgia Institute of Technology and TSMC team reports a heterogeneous RRAM CNN and SRAM SNN

SoC for hybrid frame and event-based target tracking. The system occupies 20.25mm

in a 40nm process, supports RRAM-

CIM throughput of 14.74TOPS, and matches the bandwidth of the event camera for 100 outputs/s with efﬁciency of 73.53TOPS/W

for the fused frame and event vision.

ISSCC 2023 / February 22, 2023 / 1:30 PM

417 DIGEST OF TECHNICAL PAPERS •

2:45 PM

29.4 Wafer-Level Stacking of High-Density Capacitors to Enhance the Performance of a Large Multicore Processor

for Machine Learning Applications

Stephen Felix,

Graphcore, Bristol, United Kingdom

In Paper 29.4, Graphcore presents wafer-level stacking of high-density capacitors in a 130nm deep-trench-capacitor technology

to enhance the performance and V

MIN

of a 7nm 1472-tile multithreaded in-order processor. The stacked chips of 822mm²

integrate 750μF capacitance to reduce peak-to-peak voltage swings from 250mV to 47mV, enabling a 450MHz higher (+32%)

clock frequency at 0.75V or a 140mV (17%) V

MIN

improvement at 1.575GHz.

418

• 2023 IEEE International Solid-State Circuits Conference

ISSCC 2023 / SESSION 29 / DIGITAL ACCELERATORS AND CIRCUIT TECHNIQUES / 29.1

29.1 A 32.5mW Mixed-Signal Processing-in-Memory-Based k-SAT

Solver in 65nm CMOS with 74.0% Solvability for 30-Variable

126-Clause 3-SAT Problems

Daehyun Kim, Nael Mizanur Rahman, Saibal Mukhopadhyay

Georgia Institute of Technology, Atlanta, GA

Boolean satisﬁability (k-SAT, k ≥3) is an NP-complete combinatorial optimization

problem (COP) with applications in communication, ﬂight network, supply chain and

ﬁnance, to name a few. The ASICs for SAT and other COP solvers have been

demonstrated using continuous-time dynamics [1], simulated annealing [2], oscillator

interaction [3] and stochastic automata annealing [4]. However, prior designs show low

solvability for complex problems ([1] shows 16% solvability for 30 variables and 126

clauses), and use a small, ﬁxed network topology (King’s graph [3] or Lattice Graph [2]

or 3-SAT [1]) limiting the ﬂexibility of problem solving. A digital fully connected processor

enables ﬂexibility but incurs a large area, latency and power overhead [4]. This paper

presents a k-SAT solver where a Continuous-Time Stochastic Recurrent Neural Network

(CT-SRNN), controlled by a Discrete-Time Finite-State-Machine (DT-FSM), uses

unsupervised learning to search for an optimal solution (Fig. 29.1.1). A 65nm test-chip

based on a Mixed-Signal Processing-in-Memory (MS-PIM) architecture is presented.

Measured results demonstrate a higher solvability (74.0% for 30 variables and 126

clauses, vs. 16% in [1]) and an improved ﬂexibility (k > 3, different number of variables

per clause) in mapping k-SAT problems.

Figure 29.1.1 shows an overview of the proposed solver for a SAT problem (N variables

and M clauses). The variables (𝑣

𝑛

; ∀𝑛 ∈ (0,𝑁 − 1)) are represented by binary stochastic

neurons. A single layer fully connected recurrent neural network (RNN) uses a weighted

linear combination of past Boolean states of neurons to control a set of random

processes that determine neurons’ current states. The k-SAT (k≥3, each clause can have

different k) problems are programmed into a crossbar architecture to compute Boolean

states of the clauses (𝑓

𝑚

; ∀𝑚 ∈ (0,𝑀 − 1)). A digital FSM aggregates states of all clauses

to compute a current satisﬁability score (CSC = the number of satisﬁed clauses) and

update the RNN weights to control the stochasticity of neurons. Updates to RNN weights

(2b integers) are governed by stochastic un-supervised global and local learning rules.

The temporal gradient of CSC is used to determine the probability of global learning (G)

for all RNN weights; a positive (negative) gradient increases (decreases) G. The

probability of local learning (L

) of weights connected to the k

neuron is determined by

ﬂipping its state (𝑣

) and computing the change in the number of un-satisﬁed clauses; a

reduction in the number of unsatisﬁed clauses increases L

. The global and local learning

probabilities and current Boolean state of a neuron are used to update the weights

associated with that neuron. The global learning guides the system to evolve towards

higher CSC (global optima), while local learning helps escape from local minima. During

early iterations, neurons show high degree of stochasticity due to continuous changes

in the RNN weights (chaotic state). The weights and neurons converge to deterministic

states over iterations.

Figure 29.1.2 shows the system architecture and operation ﬂow. Mixed-Signal

Processing-in-Memory (MS-PIM) modules implement the RNN (RNN-PIM) and the

crossbar for mapping/computing SAT clauses (SAT-PIM). The stochastic neurons are

realized using analog circuits with controllable randomness. Digital FSMs compute the

RNN weight update probabilities (Probability Processor, PR-PC) and use these

probabilities and a pseudo-random number generator (PRNG) to stochastically update

the RNN weights (Weight Update Module, WUM). A SAT input ﬁlter is used to connect

the neuron outputs to SAT-PIM and support the learning rules.

An SRAM array with 128 rows and 128 columns (2KB 8T-SRAM cells) implements the

RNN-PIM for 64 variables and 2b weights (Fig. 29.1.3). The weights are updated using

decoders and read/write peripherals connected to WL (column) and BL/BLB (row) of bit-

cells, respectively. The PIM operation is performed via IN and OUT terminals of the

bit-cell. The outputs from stochastic neurons (variable sets) are ﬁltered using the RNN

Input Filter (RIF) to generate PIM inputs via SRAM rows. Each variable is associated with

two rows. The odd (2j+1) and even (2j) rows are associated with the true (𝑣

) and

complemented (𝑣

) forms of t he j

[∀𝑗 ∈ (0,63)] variable, respectively. RNN weights

𝑊

2j+1,k

and 𝑊

2j,k

represent inﬂuences of 𝑣

and 𝑣

(past states of j

variable) to the next

state of the k

variable (𝑣

), respectively. RIFs enable one (odd or even) row for each

variable (depending on the variable’s state) and disables both rows if a variable is not

included in a problem (an example is shown in Fig. 29.1.3). Vector matrix multiplication

(VMM) results are accumulated as current on the column lines. A pair of column lines

are connected to a stochastic neuron that includes a current mirror, noise generator and

a differential ampliﬁer (Fig. 29.1.3). The current mirror for each neuron adds currents of

the corresponding column pair (with 2× gain for current of the MSB column) and

generates a membrane potential (V

) for the neuron. A programmable noise generator

adds ﬂuctuation to V

. The differential ampliﬁer determines the next Boolean state of

each variable by comparing the reference voltage with the noisy V

. The added

programmable noise (controlled via V

REF

and V

noise

) and the inherent thermal noise of the

differential ampliﬁer lead to stochastic neuron behavior.

The WUM consists of a PRNG and comparators (Fig. 29.1.3). The PRNG generates four

pseudo-random numbers (RNDs) using LFSRs and digitally mixes them to generate two

RNDs. A set of comparators, each connected to a row of RNN-PIM, generate global/local

learning enable (GLE/LLE) signals based on two RNDs and update probabilities. All

weights connected to one output neuron [i.e., for the k

neuron: 𝑊

2j+1,k

and 𝑊

2j,k

, ∀𝑗 ∈

(0,63)] are updated in parallel; weights for different neurons are updated in sequence.

For the k

neuron, the comparators generate 𝐺𝐿𝐸

2j+1,k

, 𝐿𝐿𝐸

2j+1,k

, 𝐺𝐿𝐸

2j,k

, and 𝐿𝐿𝐸

2j,k

which are coupled with 𝑣

to update (incremented by ‘1’, decremented by ‘1’, or remains

unchanged) 𝑊

2j+1,k

and 𝑊

2j,k

. The comparator reads, updates, and writes back the RNN

weights in 2, 1, and 2 cycles, respectively. The optimal conﬁgurations (V

REF

and V

noise

)

are decided by training (ﬁnding the optimal conﬁguration). The optimal conﬁgurations

are different for each chip due to process variation.

Figure 29.1.4 shows the SAT input ﬁlter, SAT-PIM and the PR-PC for computing CSC

and weight update probabilities. Each row of a 256-row and 128-column SAT-PIM

(consists of 8T-SRAM cells) represents a clause. The columns represent variables 𝑣

and 𝑣

. The SAT problem is mapped by programming the bit-cells of the SAT-PIM to

indicate presence (‘1’) or absence (‘0’) of 𝑣

and 𝑣

in a clause (an example is shown in

Fig. 29.1.4). The SAT-PIM can map a maximum of 256 clauses, each with a maximum

of 64 variables; different clauses may also have different number of variables. All the

neuron states, expanded via SAT input ﬁlters to create 𝑣

and 𝑣

, are simultaneously

applied to the columns of SAT-PIM. All SRAM bit-lines (OUT in the bit-cell), representing

clauses, are pre-discharged. For each clause, any matching clause-input variable charges

the corresponding bit-line, indicating the clause is satisﬁed. The PR-PC accumulates the

clause outputs to compute CSC. The change in CSC between successive iterations is

used to compute G (Fig. 29.1.4). To support local learning, the PR-PC also stores the

set of unsatisﬁed clauses in an iteration. The SAT input ﬁlter inverts one variable (𝑣

) at

a time and the PR-PC determines corresponding L

by computing the change in the

number of unsatisﬁed clauses. L

values for all variables are computed in sequence.

Figure 29.1.5 shows measurement results from a prototype 65nm test chip operating at

1.2V, 400MHz, and at room temperature. A random search approach is also realized on

the chip by disabling PR-PC and WUM (no learning) as a baseline. The measured CSC

over time for a 30-variable and 126-clause 3-SAT problem shows the proposed method

achieves 100% CSC within 350μs, while the random search does not converge. The

probability of achieving 95% or higher CSC within 1ms is computed considering 1000

random 3-SAT problems. The random search shows 1% success for a 60-variable

252-clause problem compared to 100% success in the proposed method. Only using

global or local learning for the same problem shows 4.6% and 26.6% successes,

respectively. The evolution of variables over time is visualized using a T-distributed

stochastic neighbor embedding (t-SNE) to reduce a 60-dimensional (variable) space to

a 3-dimensional space; each marker represents a dimension-reduced variable-set and

the color represents the time-step. The random trials show chaotic search behavior even

for an easy (low clause to variable ratio) problem. The proposed approach shows fast

convergence to the optimum search area for easy problems, and harder (larger clause

to variable ratio) problems show more chaotic search behavior. The design demonstrates

99% satisﬁability in 11.25ms and 125ms of median run-time for 3-SAT problems of 30

variables/126 clauses and 60 variables/252 clauses, respectively.

Figure 29.1.6 shows the measured data over 200 randomly generated k-SAT problems.

The solvability, deﬁned as the percentage of these 200 cases where all clauses are

satisﬁed, reduces with an increasing number of variables and with a higher clause-to-

variable ratio (M/N). The measurement shows 74.0% solvability for 30 variables with

M/N=4.2, considered to be the hard region for 3-SAT. The design is ﬂexible enough to

map and solve problems with higher k and mixed-k. The 0.4mm

core consumes 32.5mW

dominated by the clock and switching power of FSM. In comparison to a prior SAT solver,

the design shows smaller area, higher solvability for hard problems, and more ﬂexibility

in mapping problems. Fig. 29.1.7 shows the die photo and the chip speciﬁcations.

References:

[1] M. Chang et al., “An Analog Clock-free Compute Fabric base on Continuous-Time

Dynamical System for Solving Combinatorial Optimization Problems,” IEEE CICC, 2022.

[2] Y. Su et al., “FlexSpin: A Scalable CMOS Ising Machine with 256 Flexible Spin

Processing Elements for Solving Complex Combinatorial Optimization Problems,” ISSCC,

pp. 274-275, 2022.

[3] I. Ahmed et al., “A Probabilistic Self-Annealing Compute Fabric Based on 560

Hexagonally Coupled Ring Oscillators for Solving Combinatorial Optimization Problems,”

IEEE Symp. VLSI Circuits, 2020.

[4] K. Yamamoto et al., “STATICA: A 512-Spin 0.25M-Weight Full-Digital Annealing

Processor with a Near-Memory All-Spin-Updates-at-Once Architecture for Combinatorial

Optimization with Complete Spin-Spin Interactions,” ISSCC, pp. 138-139, 2020.

剩余28页未读，继续阅读

评论收藏

内容反馈

LittleBrightness

粉丝: 0
资源: 161

Session_29_Digital_Accelerators_and_Circuit_Techniques.pdf

Session_14_Digital_Techniques_for_Clocking_and_Power_Management.pdf

isscc 2023 v29 digital accelerators and circuit techniques

ISSCC 2013 所有

MA245x_DB-R_v1.02_Spetek.pdf

tmio_mmc.rar_The Power

CCIX_Base_Specification_Revision1.1_Version1.0.pdf

SOC.Verfication.Methodology.and.Techniques

Implementation of GNSS Receiver Hardware Accelerators.pdf

0-1 Accelerators For Everyone.pdf

jc404^20170212^316.12^IBM^OpenCAPI_DDR5_Differential_Interface.pdf

linux_hotkey.zip

The Indispensable PC Hardware Book - rar - part1. (1/7)

YC1021 datasheet.pdf

Intel_Media_SDK_2016_R2.msi

Intel_Media_SDK_2017_R1.msi

Scientific Computing with Multicore and Accelerators

ISSCC 2024 Tutorials

nx-aes-ctr.rar_The Power_aes ctr_aes ctr

IE8_WindowsVista_x64CHS_itmop.com.zip

ISSCC2017-03

nx-sha512.rar_The Power

nx-aes-ctr.rar_CTR AES_The Power_aes ctr

RF Superconductivity for Accelerators (Padamsee H., Knobloch J., Hays T.) (Z-Library).djvu

accelerators.openshift.org:使用开源云平台加速您的社区

小月和平自用版美化V10.zip

幺蓝软件库.apk.1

最新资源