使用2D故障编码方法对NoC链路进行多位瞬态故障控制资源-CSDN文库

14 浏览量 2021-03-06 22:37:35 上传评论收藏 324KB PDF 举报

在深纳米级，片上网络（NoC）链路更容易出现多位瞬态故障。常规的ECC技术在纠正和检测多个瞬态故障时会带来较大的面积，功耗和时序开销。因此，为了解决NoC链路的多比特瞬态故障问题，采用了一种具有成本效益的ECC技术，称为2D故障编码方法。它的关键创新是将链接的导线视为其矩阵外观，并在矩阵的两个维度（水平矩阵行和垂直矩阵列）上执行轻量奇偶校验编码（PCC）。卧式PCC和立式PCC一起工作以找到故障的位置，然后通过简单地反转它们来纠正它们。提出了使用二维故障编码方法保护NoC链路的过程，分析了其纠正和检测能力，并进行了硬件实现。比较实验表明，该方案可以大大降低ECC硬件成本，具有更高的故障检测覆盖率，保持几乎零的静默故障率，并且在相同区域下归一化的故障校正率更高，表明该方案具有成本效益并且适合于NoC链路的多位瞬态故障控制。 ### 使用2D故障编码方法对NoC链路进行多位瞬态故障控制 #### 一、引言随着芯片技术向深纳米级发展，即晶体管特征尺寸被缩减至45nm、40nm、28nm甚至更小，集成电路表现出高频率与低电压的特点，这使得电路对瞬态故障和永久性故障变得更加敏感。据统计，瞬态故障的发生率大约占所有故障类型的80%左右[1]。因此，确保链路可靠性成为大规模片上网络（NoC）设计中的一个重要挑战。在深纳米级尺度下，NoC链路的瞬态故障耐受面临着新的现象： - 链路导线的故障概率增加。 - 多位瞬态故障的概率显著上升。 #### 二、背景与问题定义 **背景：** 传统的错误检查与纠正（ECC）技术虽然能够有效地纠正和检测单一或多位的瞬态故障，但在处理多位瞬态故障时，由于需要较大的面积、功耗及时序开销，导致其在深纳米级技术节点上的应用受到限制。 **问题定义：** 如何在确保链路可靠性的前提下，有效降低ECC硬件的成本，提高故障检测覆盖率，同时保持较低的静默故障率？ #### 三、2D故障编码方法介绍为解决上述问题，研究人员提出了一种称为2D故障编码的方法。该方法的关键创新在于将链路的导线视为一个二维矩阵，并在矩阵的两个维度（水平行和垂直列）上执行轻量级的奇偶校验编码（PCC）。 **关键特性：** - **水平PCC与垂直PCC协同工作**：水平PCC用于检测和定位矩阵行上的故障，而垂直PCC则负责检测和定位矩阵列上的故障。两者共同作用，可以精确定位故障位置。 - **简单的故障纠正机制**：一旦确定了故障位置，可以通过简单地翻转相应的比特位来进行纠正。 #### 四、2D故障编码方法的应用流程 1. **矩阵化处理**：将NoC链路中的导线组织成一个二维矩阵结构。 2. **奇偶校验编码**：在每个水平行和垂直列上分别执行PCC，生成相应的奇偶校验位。 3. **故障检测与定位**：当检测到故障时，通过比较水平和垂直方向的奇偶校验结果，可以快速定位故障的具体位置。 4. **故障纠正**：根据故障位置，简单地翻转相应的比特位即可完成故障纠正。 #### 五、性能评估与优势通过对2D故障编码方法的性能评估，可以看出这种方法具有以下显著优势： - **显著降低了ECC硬件成本**：相比传统ECC技术，2D故障编码方法通过优化编码方式减少了所需的硬件资源。 - **提高了故障检测覆盖率**：该方法能够在不显著增加硬件成本的前提下，有效检测多位瞬态故障。 - **维持极低的静默故障率**：即使在复杂的工作环境下，也能保持几乎为零的静默故障率。 - **更高的故障纠正率**：在相同的面积条件下，2D故障编码方法的故障纠正率明显高于传统ECC技术。 #### 六、结论 2D故障编码方法是一种成本效益高的ECC技术，适用于NoC链路的多位瞬态故障控制。通过将链路导线组织成二维矩阵，并在两个维度上执行轻量级的PCC，该方法不仅能够有效地检测和纠正多位瞬态故障，还能显著降低硬件成本，提高故障检测覆盖率，并维持较低的静默故障率。这些优势使得2D故障编码方法成为深纳米级NoC设计中不可或缺的技术手段之一。

资源推荐

资源详情

资源评论

Multi-bit Transient Fault Control for NoC Links

Using 2D Fault Coding Method

Xiaowen Chen

†, ‡

, Zhonghai Lu

‡

, Yuanwu Lei

†

, Yaohua Wang

†

, Shenggang Chen

†

College of Computer, National University of Defense Technology, 410073, Changsha, China

‡

Department of Electronic Systems, KTH - Royal Institute of Technology, 16440 Kista, Stockholm, Sweden

‡

{xiaowenc,zhonghai}@kth.se

Abstract—In deep nanometer scale, Network-on-Chip (NoC)

links are more prone to multi-bit transient fault. Conventional

ECC techniques brings heavy area, power, and timing overheads

when correcting and detecting multiple transient faults. There-

fore, a cost-effective ECC technique, named 2D fault coding

method, is adopted to overcome the multi-bit transient fault

issue of NoC links. Its key innovation is that the wires of

a link are treated as its matrix appearance and light-weight

Parity Check Coding (PCC) is performed on the matrix’s two

dimensions (horizontal matrix rows and vertical matrix columns).

Horizontal PCCs and vertical PCCs work together to ﬁnd the

faults’ position and then correct them by simply inverting them.

The procedure of using the 2D fault coding method to protect

a NoC link is proposed, its correction and detection capability

is analyzed, and its hardware implementation is carried out.

Comparative experiments show that the proposal can largely

reduce the ECC hardware cost, have much higher fault detection

coverage, maintain almost zero silent fault percentages, and have

higher fault correction percentages normalized under the same

area, demonstrating that it is cost-effective and suitable to the

multi-bit transient fault control for NoC links.

I. INTRODUCTION

As the chip technology goes into the deep nanometer era,

i.e., its transistor feature size is reduced to be 45nm, 40nm,

28nm, and even smaller, integrated circuits characterized by

high frequency and low voltage will be increasingly suscep-

tible to transient faults and permanent faults. The occurrence

of transient faults is considered to be roughly 80%[1]. Relia-

bility of links challenges large-scale Network-on-Chip (NoC)

design. In deep nanometer scale, transient fault tolerance of

NoC links faces new phenomena: (I) The fault probability

of a link wire becomes bigger. The fault probability (ε)ofa

link wire can be characterized by the classic fault model[2][3]

with a Gaussian distribution as

ε = Q



2σ





∞

/2σ

√

2π

−y

dy (1)

where V

is supply voltage and σ

is noise voltage. Fig.

1 depicts the trends of supply voltage and the ratio of noise

voltage to supply voltage, according to the real technology

data from TSMC



foundry[4]. As the technology shrinks, the

The research is partially supported by the National Natural Science

Foundation of China (No. 61502508), the Hunan Natural Science Foundation

of China (No. 2015JJ3017), and the Doctoral Program of the Ministry of

Education in China (No. 20134307120034).

Supply Voltage (V)

2.5

1.8

8.3%

1.2

12.5%

1.2

12.5%

15%

0.9

16.7%

0.85

17.6%

m o

250nm 180nm 130nm 90nm 65nm 40nm 28nm

0.05

0.1

0.15

Ratio of Noise Volta

TSMC Technology

Fig. 1. Trends of supply voltage and the ratio of noise voltage from TSMC



supply voltage decreases for the main purpose of reducing the

chip power consumption. However, the proportion of voltage

noise in supply voltage becomes bigger. Therefore, according

to Equation (1), the increase of the ratio of noise voltage to

supply voltage results in the increase of the fault probability (ε)

of a link wire. (II) The fault probability of a link becomes

bigger. Because technology shrinking leads to narrower wire

and smaller distance between two adjacent wires and the width

of on-chip link is not subject to the limited IO resources

of a chip, on-chip link can be usually designed to be 256-

bit, 512-bit, and even more wider in order to improve the

bandwidth performance. Equation (2) shows that, as the link

width (notated as w) becomes bigger, multiple wires in a link

may have transient faults concurrently, resulting in the increase

of the fault probability (η) of a link[5]. Multiple faults existing

on the links have become more important[6][7].

η =1− (1 − ε)

(2)

NoC links are more prone to multi-bit transient fault

than ever before in deep nanometer scale, and it is a need

to study multi-bit transient fault control for NoC links.

Typically, fault tolerance can be achieved by redundancy.

Redundancy is achieved by redundant components to cope

with failing ones (spatial redundancy), by re-execution of a

data transmission with the same component (temporal redun-

dancy), and by adding information for fault detection and cor-

rection (information redundancy)[8]. In the paper, our scope

is multi-bit transient fault control for on-chip communication

links of large-scale NoCs via information redundancy.

In information redundancy, ECC (Error Correcting Codes)

[9][10] is a commonly used and effective protection technique.

978-1-4673-9030-9/16/$31.00

2016 IEEE

SECDED DECTED QECPED OECNED

x 10

Area (Pm

)

SECDED DECTED QECPED OECNED

100

150

200

250

Power Consumption (mw)

SECDED DECTED QECPED OECNED

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

Delay (ns)

32Ŧbit

64Ŧbit

128Ŧbit

256Ŧbit

512Ŧbit

Fig. 2. Hardware cost of traditional ECC techniques such as SECDED,

DECTED, QECPED, and OECNED under TSMC



40nm technology

Hamming algorithm is often used as the base to generate

SECDED (Single-Error-Correction Double-Error-Detection)

code, while BCH is as the base to form the codes that can

detect and correct multiple faults, e.g., DECTED (2-Error-

Correction 3-Error-Detection), QECPED (4-Error-Correction

5-Error-Detection), and OECNED (8-Error-Correction 9-

Error-Detection). Fig. 2 summarizes the hardware cost of

SECDED, DECTED, QECPED, and OECNED under TSMC



40nm technology. (1) For the same link width, as the number

of detected and corrected faults increases, more area, power

consumption, and delays are required. (2) For the same de-

tection and correction capability, wider link consumes more

hardware cost. Therefore, scaling up conventional ECC

techniques to cover multiple transient faults incur large

hardware cost.

For multi-bit transient fault tolerance, careful ECC de-

sign should be carried out by comprehensively considering

its fault detection and correction capability and hardware

cost, so as to obtain better cost-effectiveness. Therefore,

we are motivated to adopt a 2D fault coding method to

support multi-bit transient fault control for NoC links.

The main contributions of the paper are summarized below:

1) The 2D fault coding method is adopted to organize

the wires of a link as its matrix appearance, perform

light-weight parity check coding on two dimensions

(horizontal matrix rows and vertical matrix columns),

and combine the horizontal and vertical fault coding in-

formation to detect and correct multiple transient faults.

2) The procedure of using the 2D fault coding method to

protect a NoC link is proposed, and its correction and

detection capability on NoC links is analyzed in detail.

The proposal is implemented by hardware.

3) The cost-effectiveness of the proposal is proved by com-

parative experiments. Compared with the conventional

ECC techniques, our method reduces the hardware cost

largely, has higher fault detection coverage, maintains

almost zero silent fault percentage, and has higher fault

correction coverage normalized under the same area.

II. R

ELATED WORK

Information redundancy at the data link layer widely uses

coding schemes such as Hamming code, BCH code and parity

bits to catch faults in NoC links. Hamming code based SEC

or SECDED is the most popular, since the occurrence of

1 fault has the highest probability. As the transistor feature

size shrinks, the probability of multi-bit fault grows up, some

researches adopt BCH code as the base to detect and correct

Wires of a link

0,0

0,1

0,7

1,0

1,1

1,7

7,0

7,1

7,7

Group 0

Group 7

Group 1

Fig. 3. An example of organizing the link wires into several groups that

constitute the link’s matrix appearance

two or more multi-bit faults[11]. However, BCH code has

large extra hardware overhead. To reduce the hardware cost,

some researchers combine a set of relatively simple SEC

or SECDED codes with bit interleaving technique to correct

adjacent multi-bit faults[5] [10][12]. For instance, in [5], the

data is split into blocks to be interleaved, while a SEC code

is applied for each block. Then an adaptive fault control

method is used to select the ECC scheme dynamically. In

[12], Lehtonen analyzes forward fault correction methods for

nanoscale NoC. Another thought is developed by Dutta and

Touba[13], who use an unequal coding scheme to protect

different parts of the packet. It has similar costs as SEC codes

while providing better fault detecting and correcting capability,

thus the cost is relatively reduced. The ECC codes used by

all these literatures are originated from Hamming code or

BCH code. Due the algorithm structure itself, Hamming code

with bit interleaving or BCH code can theoretically scale to

detect and correct more faults, but the hardware cost will grow

rapidly, so the number of detected and corrected faults by

current literatures is small (< 4). We adopt a new technique

(called 2D fault coding method) rather than Hamming code or

BCH code to achieve fast multi-bit fault detection/correction

while still maintaining high fault coverage with low cost.

The 2D fault coding method organizes the wires of a link as

a matrix appearance, and performs horizontal ECC along ma-

trix rows and vertical ECC along matrix columns. Regarding

ECC on two dimensions, prior work mainly concerns the pro-

tection of memory arrays. In [14], Mohr applies product codes

to memory arrays and uses horizontal byte-parity codes to

enable low-latency fault detection. This scheme only addresses

detecting and correcting single-bit faults within a memory

array. In [15][16], Argyrides proposed a matrix code to protect

SRAM-based memories against Multiple Bit Upsets (MBU).

The proposed method integrates hamming code and parity

code together to assure the memory reliability with low area

and performance overhead. His work is only at the algorithm

level without concrete hardware implementation. Similar with

Argyrides’s work, in [17], Kim proposed an fault coding

scheme in embedded L1/L2 memories and enabled vertical

fault coding across words in combination with conventional

per-word horizontal fault coding, in order to achieve multi-

bit fault protection. The vertical fault coding process is not

剩余7页未读，继续阅读

评论收藏

内容反馈

weixin_38691641

粉丝: 5
资源: 929

使用2D故障编码方法对NoC链路进行多位瞬态故障控制

基于故障粒度划分的NoC链路自适应容错方法

2D NoC Simulator Nirgam

论文研究-片上互联网络故障的注入模拟与测试 .pdf

5个不同 NOC总线 verilog代码

NoC模拟器Noxim实验教程

一个NOC的Verilog代码

基于虚通道故障粒度划分的3D NoC容错路由器设计.pdf

设备和NOC接口.rar设备和NOC接口.rar

通过NOC进行基础设施部署.rar

片上网络NOC基本架构

NOC与主控板断开,数组、数组运算，字符串

Noc 2022 scratch 真题

.net SAP_NOC

基于NoC的多核SoC片上调试构架

国外经典NoC模拟器

Interconnect-Centric Design for Advanced SoC and NoC.pdf

Design methodologies for NoC [PHD].pdf

python_NOC复赛模拟考文件.zip

AP故障出理

分享一本NOC书network on chips

NOC.rar_noc _noc opnet_opnet_opnet noc_opnet wireless

NOC指导教师认证真题

NOC界面隐藏命令、我的头文件宏

最新资源