SECDED DECTED QECPED OECNED
1
2
3
4
5
6
7
x 10
4
Area (Pm
2
)
SECDED DECTED QECPED OECNED
50
100
150
200
250
Power Consumption (mw)
SECDED DECTED QECPED OECNED
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Delay (ns)
32Ŧbit
64Ŧbit
128Ŧbit
256Ŧbit
512Ŧbit
Fig. 2. Hardware cost of traditional ECC techniques such as SECDED,
DECTED, QECPED, and OECNED under TSMC
40nm technology
Hamming algorithm is often used as the base to generate
SECDED (Single-Error-Correction Double-Error-Detection)
code, while BCH is as the base to form the codes that can
detect and correct multiple faults, e.g., DECTED (2-Error-
Correction 3-Error-Detection), QECPED (4-Error-Correction
5-Error-Detection), and OECNED (8-Error-Correction 9-
Error-Detection). Fig. 2 summarizes the hardware cost of
SECDED, DECTED, QECPED, and OECNED under TSMC
40nm technology. (1) For the same link width, as the number
of detected and corrected faults increases, more area, power
consumption, and delays are required. (2) For the same de-
tection and correction capability, wider link consumes more
hardware cost. Therefore, scaling up conventional ECC
techniques to cover multiple transient faults incur large
hardware cost.
For multi-bit transient fault tolerance, careful ECC de-
sign should be carried out by comprehensively considering
its fault detection and correction capability and hardware
cost, so as to obtain better cost-effectiveness. Therefore,
we are motivated to adopt a 2D fault coding method to
support multi-bit transient fault control for NoC links.
The main contributions of the paper are summarized below:
1) The 2D fault coding method is adopted to organize
the wires of a link as its matrix appearance, perform
light-weight parity check coding on two dimensions
(horizontal matrix rows and vertical matrix columns),
and combine the horizontal and vertical fault coding in-
formation to detect and correct multiple transient faults.
2) The procedure of using the 2D fault coding method to
protect a NoC link is proposed, and its correction and
detection capability on NoC links is analyzed in detail.
The proposal is implemented by hardware.
3) The cost-effectiveness of the proposal is proved by com-
parative experiments. Compared with the conventional
ECC techniques, our method reduces the hardware cost
largely, has higher fault detection coverage, maintains
almost zero silent fault percentage, and has higher fault
correction coverage normalized under the same area.
II. R
ELATED WORK
Information redundancy at the data link layer widely uses
coding schemes such as Hamming code, BCH code and parity
bits to catch faults in NoC links. Hamming code based SEC
or SECDED is the most popular, since the occurrence of
1 fault has the highest probability. As the transistor feature
size shrinks, the probability of multi-bit fault grows up, some
researches adopt BCH code as the base to detect and correct
Wires of a link
w
0
w
1
w
2
w
62
w
63
Ă
Ă
b
0,0
b
0,1
b
0,7
Ă
Ă
b
1,0
b
1,1
b
1,7
Ă
Ă
b
7,0
b
7,1
b
7,7
Ă
Ă
Ă
Group 0
Group 7
Group 1
Fig. 3. An example of organizing the link wires into several groups that
constitute the link’s matrix appearance
two or more multi-bit faults[11]. However, BCH code has
large extra hardware overhead. To reduce the hardware cost,
some researchers combine a set of relatively simple SEC
or SECDED codes with bit interleaving technique to correct
adjacent multi-bit faults[5] [10][12]. For instance, in [5], the
data is split into blocks to be interleaved, while a SEC code
is applied for each block. Then an adaptive fault control
method is used to select the ECC scheme dynamically. In
[12], Lehtonen analyzes forward fault correction methods for
nanoscale NoC. Another thought is developed by Dutta and
Touba[13], who use an unequal coding scheme to protect
different parts of the packet. It has similar costs as SEC codes
while providing better fault detecting and correcting capability,
thus the cost is relatively reduced. The ECC codes used by
all these literatures are originated from Hamming code or
BCH code. Due the algorithm structure itself, Hamming code
with bit interleaving or BCH code can theoretically scale to
detect and correct more faults, but the hardware cost will grow
rapidly, so the number of detected and corrected faults by
current literatures is small (< 4). We adopt a new technique
(called 2D fault coding method) rather than Hamming code or
BCH code to achieve fast multi-bit fault detection/correction
while still maintaining high fault coverage with low cost.
The 2D fault coding method organizes the wires of a link as
a matrix appearance, and performs horizontal ECC along ma-
trix rows and vertical ECC along matrix columns. Regarding
ECC on two dimensions, prior work mainly concerns the pro-
tection of memory arrays. In [14], Mohr applies product codes
to memory arrays and uses horizontal byte-parity codes to
enable low-latency fault detection. This scheme only addresses
detecting and correcting single-bit faults within a memory
array. In [15][16], Argyrides proposed a matrix code to protect
SRAM-based memories against Multiple Bit Upsets (MBU).
The proposed method integrates hamming code and parity
code together to assure the memory reliability with low area
and performance overhead. His work is only at the algorithm
level without concrete hardware implementation. Similar with
Argyrides’s work, in [17], Kim proposed an fault coding
scheme in embedded L1/L2 memories and enabled vertical
fault coding across words in combination with conventional
per-word horizontal fault coding, in order to achieve multi-
bit fault protection. The vertical fault coding process is not