Session_13_Ideas_for_the_Future.pdf资源-CSDN文库

需积分: 5 50 浏览量 2023-03-14 12:00:09 上传评论收藏 27.74MB PDF 举报

资源推荐

资源详情

资源评论

Session 13 Overview: Ideas for the Future

TECHNOLOGY DIRECTIONS SUBCOMMITTEE

This session illustrates the scope of what integrated circuits may become in the future. The back-end-of-line (BEOL) hosts 3D memory in 13.1

and dielectric waveguides in 13.5. Signal processing is shown with low-power mixed-signal circuits in 13.2 and with reconﬁgurable photonics

in 13.6. Energy harvesting circuit operation is enabled in 0.1mm-class systems in 13.3 and in wafer scribe lines in 13.4.

Session Chair: Sudip Shekhar

University of British Columbia

Vancouver, Canada

Session Co-Chair: Daniel Morris

Meta, Menlo Park, CA

210

• 2023 IEEE International Solid-State Circuits Conference

ISSCC 2023 / SESSION 13 / IDEAS FOR THE FUTURE / OVERVIEW

11:15 AM

13.3 A Triturated Sensing System

Noriyuki Miura, Osaka University, Suita, Japan

In Paper 13.3, Osaka University and Kobe University describe a system consisting of clusters of 0.1mm-class battery, power

and data link ICs in close proximity, leveraging Inductive-Coupling Power-Line Communication. A temperature sensor application

is demonstrated as a use case.

11:30 AM

13.4 A Self-Programming PUF Harvesting the High-Energy Plasma During Fabrication

Kotaro Naruse, Osaka University, Suita, Japan

In Paper 13.4, Osaka University leverages the high-energy plasma into antennas built in a dicing street to power up circuits

during a semiconductor fabrication process. An oxide-breakdown PUF implementation is demonstrated.

11:45 AM

13.5 Subtractive Photonic Waveguide-Coupled Photodetectors in 180nm Bulk CMOS

Craig Ives, California Institute of Technology, Pasadena, CA

In Paper 13.5, Caltech introduces a technique to use the BEOL dielectric layers to form photonic waveguides in a standard bulk

CMOS process. Transmission in the visible and IR is demonstrated. Waveguide-coupled photodiodes are demonstrated.

12:00 PM

13.6 A Silicon Photonic Reconﬁgurable Optical Analog Processor (SiROAP) with a 4x4 Optical Mesh

Md Jubayer Shawon,

University of Delaware, Newark, DE

In Paper 13.6, University of Delaware presents a reconﬁgurable optical analog processor based on a large-scale optical mesh

fabricated in CMOS-compatible silicon photonics process.

10:45 AM

13.2 A 47nW Mixed-Signal Voice Activity Detector (VAD) Featuring a Non-Volatile Capacitor-ROM, a Short-Time

CNN Feature Extractor and an RNN Classiﬁer

Jinhai Lin, University of Macau, Macau, China

In Paper 13.2, University of Macau and Instituto Superior Tecnico/University of Lisboa describe a mixed-signal voice-activity

detector consuming 47nW of power and achieving >92% accuracy rate. The chip employs RNN-based classiﬁer and

ROM/capacitor-based multiply-accumulate function.

ISSCC 2023 / February 21, 2023 / 10:15 AM

211 DIGEST OF TECHNICAL PAPERS •

10:15 AM

13.1 Crystalline Oxide Semiconductor-based 3D Bank Memory System for Endpoint Artiﬁcial Intelligence with

Multiple Neural Networks Facilitating Context Switching and Power Gating

Yuto Yakubo,

Semiconductor Energy Laboratory, Atsugi, Japan

In Paper 13.1, SEL and Fukuoka University present a 3D AI SoC with one tier of CMOS logic, and two tiers of oxide semiconductor

memory provide retention, enabling intermittent operation at 25.15uW power.

212

• 2023 IEEE International Solid-State Circuits Conference

ISSCC 2023 / SESSION 13 / IDEAS FOR THE FUTURE / 13.1

13.1 Crystalline Oxide Semiconductor-based 3D Bank Memory

System for Endpoint Artiﬁcial Intelligence with Multiple Neural

Networks Facilitating Context Switching and Power Gating

Yuto Yakubo

, Kazuma Furutani

, Kouhei Toyotaka

, Haruki Katagiri

Masashi Fujita

, Munehiro Kozuma

, Yoshinori Ando

, Yoshiyuki Kurokawa

Toru Nakura

, Shunpei Yamazaki

Semiconductor Energy Laboratory, Atsugi, Japan

Fukuoka University, Fukuoka, Japan

Endpoint artiﬁcial intelligence (AI) requires high ﬂexibility in AI processes under strict

cost and power limitations. This work aims to achieve a chip capable of executing AI

processes at low power while periodically switching the context of multiple neural

networks (NNs) in a small chip area. Transistors fabricated using a crystalline oxide

semiconductor (OS) such as indium–gallium–zinc oxide exhibit an extremely low off-

state current. Such transistors have high compatibility with Si CMOS processes and

multiple OS transistor layers can be stacked [1]. A normally-off (Noff) CPU using OS

memory as FF backup memory to enable power gating (PG) has been reported [2]. A

structure where the Noff CPU has a high-efﬁciency AI accelerator (ACC) could be a

candidate for an endpoint AI chip. Nevertheless, the ACC requires large-scale memory

to switch between AI processes. Otherwise, it would waste power and time in data

rewriting, which makes a power reduction unfeasible. Moreover, the chip must be

adapted for another NN by context switching in which not only weight data but also FF

data are quickly switched. The challenge is to secure large-scale memory and achieve

context switching with low latency. To meet the challenge, ACC memory for NN weight

data, FF backup memory, and CPU memory used as instruction and data memory are

3D stacked via an OS transistor stacking technique where OS memory in each layer

serves as a bank (Fig. 13.1.1). As proof of this concept, a test chip was fabricated through

the OS/OS/Si process (130nm Si CMOS and two layers of 200nm OS). In the system,

bank switching of the ACC memory is linked with bank switching of the FF backup

memory, and inference of different NNs is switched with low latency and power so that

the PG standby time is extended.

Figure 13.1.2 shows the chip conﬁguration and circuits. The chip consists of Cortex-M0

(Core), an ACC including 128 MAC processing elements (PEs) and two layers of 32 KB

ACC memory, two layers of 4 KB CPU memory, a PMU, and peripheral circuits. All logics

except the PMU are capable of PG. In an OSFF, each OS layer in memory with a 3T1C/unit

conﬁguration is stacked on a scan FF with no area overheads. A ﬁne-grained random

arrangement is possible by taking advantage of monolithic stacking. The ACC supports

a binary NN (BNN) for low power operation. Due to the trade-off between a reduction in

driver area and an improvement in latency caused by the memory block division number,

a block-by-block arrangement has been adopted in which eight blocks, each comprising

16 PEs sharing two layers of 4 KB memory, are arranged. Eight MAC operations are

executed in parallel in one clock, and results are temporarily stored in the accumulator.

After the MAC operations are repeated in accordance with the number of inputs

(neurons), threshold judgment (biasing) is performed to complete arithmetic operations

for one layer of the NN. Furthermore, a maximum of 128 PEs can be driven in parallel.

In the case of a fully connected network with three hidden layers, the inference requires

194 clock cycles. A layer selection driver fabricated using only OS transistors is used to

determine an accessible memory layer. Additionally, a bootstrap circuit is used to

suppress the Vt drop of read and write word lines RWL and WWL that would be caused

in an NMOS switch. The layer selection driver and memory elements can be concurrently

embedded in the OS layer. Therefore, there are no area overheads even with the increased

number of stacked layers. Moreover, there is no need to change the address size of a

CMOS driver, and neither area nor power is increased.

Figure 13.1.3 presents a timing chart of context switching and PG, and the performance

of each circuit block. In the OSFF, the context switching is performed in the following

manner: data are backed up in response to a backup signal BK [0] (BK [1]) of the ﬁrst

(second) layer of the OS memory that corresponds to context 0 (1), the backed up data

of context 1 (0) are written back to the FF in response to a restore signal RE [1] (RE [0]),

and the task and results are backed up by BK [1] (BK[0]). After data backup, it is possible

to fall in a sleep mode with PG. Chip evaluation results demonstrate that write and read

energies are 510 and 111fJ/bit, respectively, when data in 4045 FFs are concurrently

backed up and restored in 160ns and 180ns. The ACC memory is capable of context

switching by just switching layer selection signals. It is possible to access memory cells

in a row of any selected OS memory layer when the RWL is activated by the CMOS driver.

Since the OS memory retains data during PG, no special operations are required. The

chip evaluation results show that the ACC’s top energy efﬁciency is 4.44TOPS/W (PECLK

400 kHz, System Clock 10MHz, including ACC control logic power). The critical path of

our system is the memory read for inference, and classiﬁcation accuracy degrades over

the maximum frequency. Energy for inference (MNIST) using only the CPU memory and

the core is 1681.97μJ, whereas energy for inference using the ACC is 0.19μJ. The

inference time is reduced from 3.55s to 485μs. Therefore, our ACC enables inference

according to the frame rate of imaging data (e.g., 60fps and 16ms).

The effect of power reduction when performing context switching and PG is compared

between an OS/Si chip and a Si (SRAM) chip, as shown in Fig. 13.1.4. The OS/Si chip is

fabricated by stacking only one layer of OS memory on a CMOS circuit. The Si chip does

not use OS and the ACC consists of PEs and SRAM. Since the SRAM is volatile memory,

the Si chip reduces standby power by clock gating (CG). The power of these chips is

estimated under intermittent operations in which the inference is performed while

switching two NNs, and then, PG (CG) is performed. The OS/Si and Si chips (estimated

based on the SRAM generator) can only retain data for one NN in the ACC memory. Thus,

every inference requires weight data rewriting. The stacked OS/OS/Si process enables

instant context switching and results in low power by allocating the time for PG. At room

temperature, the power of inference by the ACC, those of writing weight data to the ACC

memory, and those of PG are 386.5, 637.4, and 0.89μW, respectively.

When inference is performed at 60fps, the average power of the chip is 25.15μW, which

is 79% lower than that of the Si chip. Since the OS/Si chip is capable of PG, its average

power is lower than that of the Si chip when the cycle time is longer than 16ms.

Additionally, the average power of the OS/Si chip is equivalent to or lower than that of

the Si chip when the cycle time is 16ms because the PG and CG time is short and thus

the contribution of PG and CG to the average power is small. This power reduction is

also effective even when inference is assumed to be performed with four stacked layers

OS memory and four NNs are switched, and the chip can be driven at an average power

of 39.6μW with no increase in chip area. We conﬁrmed this memory retains data for at

least 1h at room temperature. That is, the memory refresh is required once every 225,000

intermittent operations with a cycle time of 16ms, and the impact of refresh power is

almost negligible. The power and area of an ACC block where 16 PEs and memory are

regarded as one set, as shown in Fig. 13.1.2, and a Si (SRAM) ACC block are estimated,

each with increased memory capacity. The advantages of area and standby power

become apparent as the increased number of stacked memory layers, and the advantage

of active power becomes apparent with four stacked layers of memory. With increased

memory capacity, the Si (SRAM) ACC block has a larger number of word line drivers

because of address size expansion, greater bit line load, and a higher memory leakage

current. By contrast, the OS memory with increased memory capacity does not increase

the number of word line drivers as described above, and uses a memory cell with the

same bit line load (just switching the OS memory layer) and a low leakage current. Thus,

power and area do not change signiﬁcantly. Accordingly, the elimination of the need for

ACC memory rewriting by instant context switching using OS bank memory and the

increase in PG execution time, lead to power and area beneﬁts in the multiple OS layers

stacking process with two or more layers of OS memory. This demonstrates the system’s

effectiveness (Fig. 13.1.5).

Figure 13.1.6 compares our chip with other chips including embedded microcontrollers

[3–6]. Moreover, Fig. 13.1.7 displays a die micrograph and a cross-sectional image.

References:

[1] M. Oota et al., “3D-Stacked CAAC-In-Ga-Zn Oxide FETs with Gate Length of 72nm,”

IEEE IEDM, pp. 3.2.1–3.2.4, 2019.

[2] T. Ishizu et al., “A 48 MHz 880-nW Standby Power Normally-Off MCU with 1 Clock

Full Backup and 4.69-μs Wakeup Featuring 60-nm Crystalline In–Ga–Zn Oxide BEOL-

FETs,” IEEE Symp. VLSI Circuits, pp. 48–49, 2019.

[3] T. F. Wu et al., “A 43pJ/Cycle Non-Volatile Microcontroller with 4.7μs

Shutdown/Wake-up Integrating 2.3-bit/Cell Resistive RAM and Resilience Techniques,”

ISSCC, pp. 226–227, Feb. 2019.

[4] N. Sakimura et al., “A 90nm 20MHz Fully Nonvolatile Microcontroller for Standby-

Power-Critical Applications,” ISSCC, pp. 184–185, Feb. 2014.

[5] M. Chang et al., “A 40nm 60.64TOPS/W ECC-Capable Compute-in-Memory/Digital

2.25MB/768KB RRAM/SRAM System with Embedded Cortex M3 Microprocessor for

Edge Recommendation Systems,” ISSCC, pp. 270–271, Feb. 2022.

[6] J. Wang et al., “A Compute SRAM with Bit-Serial Integer/Floating-Point Operations

for Programmable In-Memory Vector Acceleration,” ISSCC, pp. 224–225, Feb. 2019.

剩余19页未读，继续阅读

评论收藏

内容反馈

LittleBrightness

粉丝: 0
资源: 137

Session_13_Ideas_for_the_Future.pdf

13. AVL's Future Hybrid X Mode.pdf

Session_16_Efficient_ComputeInMemory_Based_Processors_for_ML.pdf

Session_34_CryoCMOS_for_Quantum_Computing.pdf

Session_18_mmWave_and_subTHz_for_Wireless_and_Sensing.pdf

Session_14_Digital_Techniques_for_Clocking_and_Power_Management.pdf

Session_27_Innovations_from_Outside_the_ISSCC_Box.pdf

Session_31_Energy_Efficient_Radios_for_UWB_BMI_and_IoT_Systems.pdf

oracle_v$session_v$session_wait用途详解

ISSCC2021_Session_13V_Cyro-CMOS for Quantum Computing.pdf

Session_4_Frequency_Synthesizers.pdf

Session_7_SRAM_ComputeInMemory.pdf

Session_5_Image_Sensors.pdf

Session_2_Digital_Processor.pdf

Session_1_Plenary.pdf

Session_23_Analog_Sensor_Interfaces.pdf

Session_3_Amplifiers_and_Oscillators.pdf

Session_20_GaN_Power_Conversion.pdf

Session_22_Heterogeneous_ML_Accelerator.pdf

Session_24_THz_Signal_Generation.pdf

Session_30_Power_Management_Techniques.pdf

1_sixyin-music-source-v1.0.7.js

py作业.zip

植物大战僵尸杂交版v2.0.zip

植物大战僵尸杂交版v2.0安装程序.exe

misaka-v3.3.8.zip

大麦抢票_BP全自动抢购教程+注意事项.rar

红果脚本.apk

TiggerRamDiskV4.2Beta1-Win.zip

最新资源