没有合适的资源?快使用搜索试试~ 我知道了~
Session_13_Ideas_for_the_Future.pdf
需积分: 5 0 下载量 50 浏览量
2023-03-14
12:00:09
上传
评论
收藏 27.74MB PDF 举报
温馨提示
试读
20页
Session_13_Ideas_for_the_Future.pdf
资源推荐
资源详情
资源评论
Session 13 Overview: Ideas for the Future
TECHNOLOGY DIRECTIONS SUBCOMMITTEE
This session illustrates the scope of what integrated circuits may become in the future. The back-end-of-line (BEOL) hosts 3D memory in 13.1
and dielectric waveguides in 13.5. Signal processing is shown with low-power mixed-signal circuits in 13.2 and with reconfigurable photonics
in 13.6. Energy harvesting circuit operation is enabled in 0.1mm-class systems in 13.3 and in wafer scribe lines in 13.4.
Session Chair: Sudip Shekhar
University of British Columbia
Vancouver, Canada
Session Co-Chair: Daniel Morris
Meta, Menlo Park, CA
210
• 2023 IEEE International Solid-State Circuits Conference
ISSCC 2023 / SESSION 13 / IDEAS FOR THE FUTURE / OVERVIEW
978-1-6654-9016-0/23/$31.00 ©2023 IEEE
11:15 AM
13.3 A Triturated Sensing System
Noriyuki Miura, Osaka University, Suita, Japan
In Paper 13.3, Osaka University and Kobe University describe a system consisting of clusters of 0.1mm-class battery, power
and data link ICs in close proximity, leveraging Inductive-Coupling Power-Line Communication. A temperature sensor application
is demonstrated as a use case.
11:30 AM
13.4 A Self-Programming PUF Harvesting the High-Energy Plasma During Fabrication
Kotaro Naruse, Osaka University, Suita, Japan
In Paper 13.4, Osaka University leverages the high-energy plasma into antennas built in a dicing street to power up circuits
during a semiconductor fabrication process. An oxide-breakdown PUF implementation is demonstrated.
11:45 AM
13.5 Subtractive Photonic Waveguide-Coupled Photodetectors in 180nm Bulk CMOS
Craig Ives, California Institute of Technology, Pasadena, CA
In Paper 13.5, Caltech introduces a technique to use the BEOL dielectric layers to form photonic waveguides in a standard bulk
CMOS process. Transmission in the visible and IR is demonstrated. Waveguide-coupled photodiodes are demonstrated.
12:00 PM
13.6 A Silicon Photonic Reconfigurable Optical Analog Processor (SiROAP) with a 4x4 Optical Mesh
Md Jubayer Shawon,
University of Delaware, Newark, DE
In Paper 13.6, University of Delaware presents a reconfigurable optical analog processor based on a large-scale optical mesh
fabricated in CMOS-compatible silicon photonics process.
10:45 AM
13.2 A 47nW Mixed-Signal Voice Activity Detector (VAD) Featuring a Non-Volatile Capacitor-ROM, a Short-Time
CNN Feature Extractor and an RNN Classifier
Jinhai Lin, University of Macau, Macau, China
In Paper 13.2, University of Macau and Instituto Superior Tecnico/University of Lisboa describe a mixed-signal voice-activity
detector consuming 47nW of power and achieving >92% accuracy rate. The chip employs RNN-based classifier and
ROM/capacitor-based multiply-accumulate function.
ISSCC 2023 / February 21, 2023 / 10:15 AM
211 DIGEST OF TECHNICAL PAPERS •
10:15 AM
13.1 Crystalline Oxide Semiconductor-based 3D Bank Memory System for Endpoint Artificial Intelligence with
Multiple Neural Networks Facilitating Context Switching and Power Gating
Yuto Yakubo,
Semiconductor Energy Laboratory, Atsugi, Japan
In Paper 13.1, SEL and Fukuoka University present a 3D AI SoC with one tier of CMOS logic, and two tiers of oxide semiconductor
memory provide retention, enabling intermittent operation at 25.15uW power.
13
212
• 2023 IEEE International Solid-State Circuits Conference
ISSCC 2023 / SESSION 13 / IDEAS FOR THE FUTURE / 13.1
13.1 Crystalline Oxide Semiconductor-based 3D Bank Memory
System for Endpoint Artificial Intelligence with Multiple Neural
Networks Facilitating Context Switching and Power Gating
Yuto Yakubo
1
, Kazuma Furutani
1
, Kouhei Toyotaka
1
, Haruki Katagiri
1
,
Masashi Fujita
1
, Munehiro Kozuma
1
, Yoshinori Ando
1
, Yoshiyuki Kurokawa
1
,
Toru Nakura
2
, Shunpei Yamazaki
1
1
Semiconductor Energy Laboratory, Atsugi, Japan
2
Fukuoka University, Fukuoka, Japan
Endpoint artificial intelligence (AI) requires high flexibility in AI processes under strict
cost and power limitations. This work aims to achieve a chip capable of executing AI
processes at low power while periodically switching the context of multiple neural
networks (NNs) in a small chip area. Transistors fabricated using a crystalline oxide
semiconductor (OS) such as indium–gallium–zinc oxide exhibit an extremely low off-
state current. Such transistors have high compatibility with Si CMOS processes and
multiple OS transistor layers can be stacked [1]. A normally-off (Noff) CPU using OS
memory as FF backup memory to enable power gating (PG) has been reported [2]. A
structure where the Noff CPU has a high-efficiency AI accelerator (ACC) could be a
candidate for an endpoint AI chip. Nevertheless, the ACC requires large-scale memory
to switch between AI processes. Otherwise, it would waste power and time in data
rewriting, which makes a power reduction unfeasible. Moreover, the chip must be
adapted for another NN by context switching in which not only weight data but also FF
data are quickly switched. The challenge is to secure large-scale memory and achieve
context switching with low latency. To meet the challenge, ACC memory for NN weight
data, FF backup memory, and CPU memory used as instruction and data memory are
3D stacked via an OS transistor stacking technique where OS memory in each layer
serves as a bank (Fig. 13.1.1). As proof of this concept, a test chip was fabricated through
the OS/OS/Si process (130nm Si CMOS and two layers of 200nm OS). In the system,
bank switching of the ACC memory is linked with bank switching of the FF backup
memory, and inference of different NNs is switched with low latency and power so that
the PG standby time is extended.
Figure 13.1.2 shows the chip configuration and circuits. The chip consists of Cortex-M0
(Core), an ACC including 128 MAC processing elements (PEs) and two layers of 32 KB
ACC memory, two layers of 4 KB CPU memory, a PMU, and peripheral circuits. All logics
except the PMU are capable of PG. In an OSFF, each OS layer in memory with a 3T1C/unit
configuration is stacked on a scan FF with no area overheads. A fine-grained random
arrangement is possible by taking advantage of monolithic stacking. The ACC supports
a binary NN (BNN) for low power operation. Due to the trade-off between a reduction in
driver area and an improvement in latency caused by the memory block division number,
a block-by-block arrangement has been adopted in which eight blocks, each comprising
16 PEs sharing two layers of 4 KB memory, are arranged. Eight MAC operations are
executed in parallel in one clock, and results are temporarily stored in the accumulator.
After the MAC operations are repeated in accordance with the number of inputs
(neurons), threshold judgment (biasing) is performed to complete arithmetic operations
for one layer of the NN. Furthermore, a maximum of 128 PEs can be driven in parallel.
In the case of a fully connected network with three hidden layers, the inference requires
194 clock cycles. A layer selection driver fabricated using only OS transistors is used to
determine an accessible memory layer. Additionally, a bootstrap circuit is used to
suppress the Vt drop of read and write word lines RWL and WWL that would be caused
in an NMOS switch. The layer selection driver and memory elements can be concurrently
embedded in the OS layer. Therefore, there are no area overheads even with the increased
number of stacked layers. Moreover, there is no need to change the address size of a
CMOS driver, and neither area nor power is increased.
Figure 13.1.3 presents a timing chart of context switching and PG, and the performance
of each circuit block. In the OSFF, the context switching is performed in the following
manner: data are backed up in response to a backup signal BK [0] (BK [1]) of the first
(second) layer of the OS memory that corresponds to context 0 (1), the backed up data
of context 1 (0) are written back to the FF in response to a restore signal RE [1] (RE [0]),
and the task and results are backed up by BK [1] (BK[0]). After data backup, it is possible
to fall in a sleep mode with PG. Chip evaluation results demonstrate that write and read
energies are 510 and 111fJ/bit, respectively, when data in 4045 FFs are concurrently
backed up and restored in 160ns and 180ns. The ACC memory is capable of context
switching by just switching layer selection signals. It is possible to access memory cells
in a row of any selected OS memory layer when the RWL is activated by the CMOS driver.
Since the OS memory retains data during PG, no special operations are required. The
chip evaluation results show that the ACC’s top energy efficiency is 4.44TOPS/W (PECLK
400 kHz, System Clock 10MHz, including ACC control logic power). The critical path of
our system is the memory read for inference, and classification accuracy degrades over
the maximum frequency. Energy for inference (MNIST) using only the CPU memory and
the core is 1681.97μJ, whereas energy for inference using the ACC is 0.19μJ. The
inference time is reduced from 3.55s to 485μs. Therefore, our ACC enables inference
according to the frame rate of imaging data (e.g., 60fps and 16ms).
The effect of power reduction when performing context switching and PG is compared
between an OS/Si chip and a Si (SRAM) chip, as shown in Fig. 13.1.4. The OS/Si chip is
fabricated by stacking only one layer of OS memory on a CMOS circuit. The Si chip does
not use OS and the ACC consists of PEs and SRAM. Since the SRAM is volatile memory,
the Si chip reduces standby power by clock gating (CG). The power of these chips is
estimated under intermittent operations in which the inference is performed while
switching two NNs, and then, PG (CG) is performed. The OS/Si and Si chips (estimated
based on the SRAM generator) can only retain data for one NN in the ACC memory. Thus,
every inference requires weight data rewriting. The stacked OS/OS/Si process enables
instant context switching and results in low power by allocating the time for PG. At room
temperature, the power of inference by the ACC, those of writing weight data to the ACC
memory, and those of PG are 386.5, 637.4, and 0.89μW, respectively.
When inference is performed at 60fps, the average power of the chip is 25.15μW, which
is 79% lower than that of the Si chip. Since the OS/Si chip is capable of PG, its average
power is lower than that of the Si chip when the cycle time is longer than 16ms.
Additionally, the average power of the OS/Si chip is equivalent to or lower than that of
the Si chip when the cycle time is 16ms because the PG and CG time is short and thus
the contribution of PG and CG to the average power is small. This power reduction is
also effective even when inference is assumed to be performed with four stacked layers
OS memory and four NNs are switched, and the chip can be driven at an average power
of 39.6μW with no increase in chip area. We confirmed this memory retains data for at
least 1h at room temperature. That is, the memory refresh is required once every 225,000
intermittent operations with a cycle time of 16ms, and the impact of refresh power is
almost negligible. The power and area of an ACC block where 16 PEs and memory are
regarded as one set, as shown in Fig. 13.1.2, and a Si (SRAM) ACC block are estimated,
each with increased memory capacity. The advantages of area and standby power
become apparent as the increased number of stacked memory layers, and the advantage
of active power becomes apparent with four stacked layers of memory. With increased
memory capacity, the Si (SRAM) ACC block has a larger number of word line drivers
because of address size expansion, greater bit line load, and a higher memory leakage
current. By contrast, the OS memory with increased memory capacity does not increase
the number of word line drivers as described above, and uses a memory cell with the
same bit line load (just switching the OS memory layer) and a low leakage current. Thus,
power and area do not change significantly. Accordingly, the elimination of the need for
ACC memory rewriting by instant context switching using OS bank memory and the
increase in PG execution time, lead to power and area benefits in the multiple OS layers
stacking process with two or more layers of OS memory. This demonstrates the system’s
effectiveness (Fig. 13.1.5).
Figure 13.1.6 compares our chip with other chips including embedded microcontrollers
[3–6]. Moreover, Fig. 13.1.7 displays a die micrograph and a cross-sectional image.
References:
[1] M. Oota et al., “3D-Stacked CAAC-In-Ga-Zn Oxide FETs with Gate Length of 72nm,”
IEEE IEDM, pp. 3.2.1–3.2.4, 2019.
[2] T. Ishizu et al., “A 48 MHz 880-nW Standby Power Normally-Off MCU with 1 Clock
Full Backup and 4.69-μs Wakeup Featuring 60-nm Crystalline In–Ga–Zn Oxide BEOL-
FETs,” IEEE Symp. VLSI Circuits, pp. 48–49, 2019.
[3] T. F. Wu et al., “A 43pJ/Cycle Non-Volatile Microcontroller with 4.7μs
Shutdown/Wake-up Integrating 2.3-bit/Cell Resistive RAM and Resilience Techniques,”
ISSCC, pp. 226–227, Feb. 2019.
[4] N. Sakimura et al., “A 90nm 20MHz Fully Nonvolatile Microcontroller for Standby-
Power-Critical Applications,” ISSCC, pp. 184–185, Feb. 2014.
[5] M. Chang et al., “A 40nm 60.64TOPS/W ECC-Capable Compute-in-Memory/Digital
2.25MB/768KB RRAM/SRAM System with Embedded Cortex M3 Microprocessor for
Edge Recommendation Systems,” ISSCC, pp. 270–271, Feb. 2022.
[6] J. Wang et al., “A Compute SRAM with Bit-Serial Integer/Floating-Point Operations
for Programmable In-Memory Vector Acceleration,” ISSCC, pp. 224–225, Feb. 2019.
978-1-6654-9016-0/23/$31.00 ©2023 IEEE
213
ISSCC 2023 / February 21, 2023 / 10:15 AM
DIGEST OF TECHNICAL PAPERS •
Figure 13.1.1: Motivation and concept for this work.
Figure 13.1.2: Chip configuration and circuits.
Figure 13.1.3: Timing chart, OSFF operations, ACC operations, and ACC
performance.
Figure 13.1.4: Chip power consumption and intermittent operation (Si, OS/Si, and
OS/OS/Si).
Figure 13.1.5: Chip performance.
Figure 13.1.6: Comparison table.
13
剩余19页未读,继续阅读
资源评论
LittleBrightness
- 粉丝: 0
- 资源: 137
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功