Chapter 3 Solutions ■ 13
Case Study 1: Exploring the Impact of Microarchitectural
Techniques
2
3.1 The baseline performance (in cycles, per loop iteration) of the code sequence in
Figure 3.48, if no new instruction’s execution could be initiated until the previ-
ous instruction’s execution had completed, is 40. See Figure S.2. Each instruc-
tion requires one clock cycle of execution (a clock cycle in which that
instruction, and only that instruction, is occupying the execution units; since
every instruction must execute, the loop will take at least that many clock
cycles). To that base number, we add the extra latency cycles. Don’t forget the
branch shadow cycle.
3.2 How many cycles would the loop body in the code sequence in Figure 3.48
require if the pipeline detected true data dependencies and only stalled on those,
rather than blindly stalling everything just because one functional unit is busy?
The answer is 25, as shown in Figure S.3. Remember, the point of the extra
latency cycles is to allow an instruction to complete whatever actions it needs, in
order to produce its correct output. Until that output is ready, no dependent
instructions can be executed. So the first LD must stall the next instruction for
three clock cycles. The MULTD produces a result for its successor, and therefore
must stall 4 more clocks, and so on.
Figure S.2 Baseline performance (in cycles, per loop iteration) of the code sequence
in Figure 3.48.
Chapter 3 Solutions
Loop: LD F2,0(Rx) 1 + 4
DIVD F8,F2,F0 1 + 12
MULTD F2,F6,F2 1 + 5
LD F4,0(Ry) 1 + 4
ADDD F4,F0,F4 1 + 1
ADDD F10,F8,F2 1 + 1
ADDI Rx,Rx,#8 1
ADDI Ry,Ry,#8 1
SD F4,0(Ry) 1 + 1
SUB R20,R4,Rx 1
BNZ R20,Loop 1 + 1
____
cycles per loop iter 40
Copyright © 2012 Elsevier, Inc. All rights reserved.
评论30
最新资源