经典FPGA算法教材(二).pdf

所需积分/C币:46 2019-09-05 21:01:51 8.64MB PDF
38
收藏 收藏
举报

此书是关于各种DSP的FPGA实现的书,包括DSP算法原理,算法优化,以及FPGA的硬件实现,包括完整的VHDL,Verilog HDL代码!原版教
2. Computer Arithmetic 2.3 Binary Adders Table 2.4. IEEE floating-point standard CRNS Double g Mantissa Exponent LUT: b2 LUT: b2 LUT: b2 b Bi las 1023 Range 2138≈3.8,103821024≈9.10307 (+ normalized so that both numbers are adjusted to the larger exponent. Then the two mantissas can be added. For both multiplication and addition a final normalization is necessary to achieve again the fractional 1 m representation A B QRNS A B for the mantissa The single IEEE floating -point format uses FPGA resources interisivel Fig. 2.5. CRNS +QR.NS conversion due to the 23 x 23-bit fast integer multiplier. Therefore Shirazi et al.[42 have developed a modified format to implement various algorithms on their f1(0,.8)=7(0+8)+j410-8)md13=4+y7mad13 custom computing machine called SPLASH-2, a multiple-FPGA board based on Xilinx XC4(10 devices(see Table 1.5, p. 16). They used an 18-bit format so that they can transport two operands over the 36-bit wide system bus of the multiple-FPGA board. The 18-bit format has a 10-bit mantissa. 7- Fig. 2.5 shows a graphical interpretation of the mapping between CRNS bit exponent and a sign bit, and can represents a range of 3. 7. 10.The and QRNS implemented an adder and a multiplier with three pipeline stages. The data for a single floating-point multiplier and adder [42 can be seen from Table 2.2.3 Floating-Point Numbers 2.5 Floating-point systems were developed to provide high resolution over a large dynamic range. Floating-point systems often can provide a solution when Table 2.5. Custom 18-bit Hoating point design using Xilinx XC4010. fixed-point systems, with their limited precision and dynamic range, fail Adder Multiplier Floating-point systems, however, bring a speed and complexity penalty. Most Function Generator (ic, LE) 224 352 floating-point systems comply with the published single- or double-precision Flip-llups 112 IEEE Hoating-point standard [41. A standard floating-point word consists of 8. 6 MHz 4.9 MHz a sign-bit S, exponent e, and an unsigned Fractional) normalized mantissa h, arranged as follows: S Exponent e unsigned mantissa m Algebraically, a foating-point word is represented by 2.3 Binary Adders X=(-1)5.1,m2 (2.22) A basic binary N-bit adder/subtractor consists of N full adders(FA). A full The exponent e=1.1, is reserved for oo, and e=0 for zero. The other adder implements the following Boolean equations paramcters of the IEEE single and double format can he seen from Table 2.4 sk=k XOR 3* XOR Ch In floating-point multiplication the mantissas have to be multiplied like tixed-moint numbers, and the exponent must be added Floating-point ad .k D yk (DCK (224) dition is in general morc complicated, because the mantissas must first be that define the sum-bit. The carry(out) bit is computed with 2. Computer Arithmetic 2.3 Binary Adde eIs a3 bl3] a2]b|2] alljbll alo]blo A B a c[2 FA FA FA FA O 0 AL4(G B供2 S[2] s[11 A;1 B, GOUT A a31b33] a[2]b[2 a[lb[] alo][ol (b) Fig.2.7. XC4000 fast carry logic ((1993 Xilinx cl] 2 Bil Adder c 21 2 Bil Adder LUT 25x3 LUT 25x3 Fig 2.8 summarizes the size and Registered Performance of N-bit bi lary adders, if implenented with the lpn_add_ sub megafunction compo- nent. If the operands are applied through I/O cells, the delays through the busses of a FLEX device are dominant and performance decreases if the reg- S3]s[2 s[]s[0 isters of the I/O cells are used (Option: Assign- Global Project Logic Fig. 2.6. Twos complement adders Synthesis+Automatic Fast I/0). If the data are routed from local regis ters, performance improves. This type of additional LC register allocation will appear (in the project report file) as increased LC use by a factor of Ch+1=(=k AND y/)OR(k AND ck)OR (3k AND Ck 225) three.A synchronous registered design would not consume any additional =(akk)+(k·ck)+(k·ck (2.26) resources.A typical design will achieve a speed between these two cases In the case of a, 2C adder, the lsb can be reduced to a half adder because the carry iuput is zero 2.3.1 Pipelined Adders The simplest adder structure is called the"ripple carry adder"as shown in Fig. 2.6(a)in a bit-serial form. If larger tables are available in the FPGa Pipelining is extensively used in DSP solutions due the intrinsic data-flow several bits can be grouped together into one LUT, as shown in Fig. 2.((b) regularity of DSP algorithms. Programmable a .a signal processor MACs 14, 15, 6 typically carry at least four pipelined stages. The processor For this two bit at a time adder the longest delay comes from the ripple of the carry through all stages. Attempts have been. made to reduce the carry 1) Decodes the command delays using techniques such as the carry-skip, carry lookahead, conditional 2)Loads the operands in registars sum, or carry-select adders. These techniques can speed up addition and can 3)Performs multiplication and stores the product, and be used with older generation FPGA families (e. g, XC 3000 from Xilinx 4) Accumulates the products, all concurrently since these devices do not provide internal fast carry logic. Modern families, such as the Xilinx XC4K or Altera Flex, posses very fast"ripple carry logic The pipelining principle can be applied to FPGa designs as well, at little that is about a magnitude faster than the delay through a regular logic LUT or no additional cost since each logic element contains a flip-Hop, which is [1. Altera uses fast tables(see Fig. 1. 12, p. 17), while the Xilinx XC4K uses otherwise unused, to save routing resources. With pipelining it is possible to hardwired decoders for implementing carry logic based on the multiplexer break an arithmetic operation into small primitive operations, save the carry structure shown in Fig. 2.7. The presence of the fast car and the intermediate values in registers, and continue the calculation in the n FPGA families removes the need to develop hardware intensive carry look next clock cycle. Such adders arc sometimes called carry save adders(CSAs ahead schemes in the literature. Then the question arises: In how many pieces should we The name carry save adder is also used in the context of Wallace-nultiplier, see Exercise 2.1, p.75 2. Computer Arithmetic 2. 3 Binary Adders Table 2.7. Performance of pipeiined adders. Size and speed are for the maximum 140 bit width for 15-.22-, and 29-bit adders e- Speed use of l/0 register Bit Use I/o Size use of vo register No use of Pipeline Desig idth gist *t* Speed no use of l/o register I/O register stages file name MHz LCs MH Lcs * Size no use of l/O register 1597082694.3361 P 16-2294.335891.74113 23-2991.7410590.90180 add_3p. vhd MSBS MSBS 十 of xtr MSBS f X F: Carry 0 16 Bit width of adder Register*n: .b-p: Fig. 2.8. Adder speed and size for Flex 10K. LSBS f Y LSBS divide the adder? should we use bit level? For Altera s Flex 10K devices a reasonable choice will be al ways using an LAB with 8 LCs and 8 FFs for of xy one pipeline element. In fact it can be shown that if we try to pipeline(for instance) a 5-bit adder, the performance drops, as reported in Table 2.6, SBS because the pipelined 5-bit adder does not fit in one LAB of X Fig. 2.9. Pipelined adder Table 2.6, Performance of a 5-bit pipelined adder using synthesis of predefined LPM nodules with pipeline option Because the number of flip-flops in one LAB is 8 and we need an extra Pipeline- Use I/O No use of Hip-Hop for the carry-out, we should use a maximum block size of 7 bits for stages register I/O register MHz LCs MHZ LCs maximun Registered Performance. Only the blocks with the MSBs can be 8 bits wide, because we do not need the extra flip-Hlop for the carry. This 123.456120.4815 observation leads to the following conclusions 012345 1219510123.4518 112.3514123.4524 1)With one additional pipeline stage we can build adders up to a length 112.320120.4832 7+8=15 1123528123.4541 1C3.91311219545 2) With two pipeline stages we can build adders with up to 7+7+8=22-bit length 0 2. ColllpuLer Arithmetic 2.3 Binary Adde 51 3) With three pipeline stages we can build adders with up to 7+7+7+8= 11(k)<=x(k); 29-bit length 12(k)<=y(k); END LOOI Table 2.7 shows the Registered Performance and LC utilization of this kind -- SPlit MSBs from input x, of pipelined adder. From Table 2. 7 it can be concluded that although the bit FEk工NW工DTH2-1 DOVNTO C L0oP width increases the Registered Performance remains almost the same if we 13(k)<=x(k+ITH1) add the appropriate number of pipeline stages. 14(k)<=y(k+WITTH1 END LOOP The following example shows the code of a 15-bit pipelined adder which END FROCESS 1s graphically interpreted by Fig. 2.9 irst stage of the adder add_1: lpm_add_ sub Add Lses of c and Example 2.14: VHDL Design of 15-bit Pipelined Adder GENERIC MAP LPM_WIDTH =>WIDTH1 Consider the VHDL code of a 15-bit pipelined adder that is graphically LPM_REPRESENTATION => UNSIGNED interpreted in Fig. 2.9. Depending on the synthesis style, the design runs at LPM_DIRECTION =>"ADD") 织uto104MHx PORT MAP dataa = 11, datab =>12 LIBRARY Ipm; result =>r1, cout = cr1(0,) USE Ipm, lpm_components. ALL: reg_1: lpm_ff Save lsBs of xty and carry GENERIC MAP LPM WIDTH = WIDTH1) LIBRARY ieee PORT MAP data = r1, q=>q1, clock =>clk USE ieee std_logic_1164. ALL; reg-2: lpm_ff USE ieee std_Icgic_arith. ALL GENERIC MAP( LPM_ WIDTH ->oNE PORT MAP data = cr1, q=> cq1, clock = clk) ENTITY add_1p Is GENERIC (WIDTH: INTEGER =15; -- Total bit width add_2: lpm_add_ sub Add msBs of x and y 工DTH1: INTEGER:=7 Bit width of LsB GENERIC MAP LPM WIDTH =>WIDTH2 WIDTH2: INTEGER =8: Bit width of MsB LPMREFRESENTATION => UNSIGNED ONE: INTEGER =1): --1 bit for carry reg LPM DIRECTION =>"ADD) PORT (x, y IN STD_LOGIC_VECTOR(WIDTH-1 DOWNTO 0) PORT MAP (dataa => 13, datab =514, result = r2) Inputs reg-3: Ipm_ff Save MSBs of x+y sum CUT STD LOGIC_VECTOR(WIDTH-1 DOWNTO 0); GENERIC MAP C LPM WIDTH = WIDTH2 Result PORT MAP data = r2, q=> 92, clock = clk ) clk: IN STD_LOGIC) --------- Second stage of the adder END add_1p rand is zero h2<=(0 THERS=>103) ARCHITECtURE flex OF add_1p Is SIGNAL 11, 12, r1, q1 LSBs of inputs Add result from MSBs (x+y) and carry from lSBs STD LOGIC VECTORCWIDTH1-1 DOWNTO 0) add_3: lpm_add_sub SIGNAL 13, 14, r2, q2, u2, h2 MSBs of inPuts GENERIC MAP LPM WIDTH =2 WIDTH2 STD LOGIC VECTORCWIDTH2-1 DOWNTO 0); LPM REPRESENTATIUN E>UNSIGNED S工GNAL STD LOGIC VECTOR(WIDTH-1 DOWNTO 0) LPM- DIRECTION =>"ADD") ster PORT MAP cin = cq1(a),dat 92 SIGNAL crl, cq1 STD LOGIC VEGTOR(ONE-1 DOWNTO 0) datab =>h2, result =>02 - LSBs carry signal PROCESS Build a single registered output BEGIN word of width=wIDtH+IDhT2 BEGIN 真工 T UNTIL C1k=11; PROCESS Split in MSBs and LSBs and store in registers FOR K IN MIDTH1-1 DOWNTO O LOCP BEGIN 8(k)<=q1(k); AIT UNTIL C1k=1’; END LOOP Split LSBs from input x FURk工N阿工DTH2-1 DOWNT00L0CP FOR K IN NIDTH1-1 DOWNTO C LOOP s(k+WIDTH1)<= u2(k) END LOOP 2 The equivalent Verilog cade add 1p. v for this example can be found in Al END PROCESS dix A on page 348 52 2. Computer Arithmet 4 Binary Multipliers 國ALn! Wareham edi Interal. TH 4ns NaTe Pipeline :register 14 y LsBs cardI X Fig.2.10. Sinulation results for a pipelined adder ROM sum <= s - Connect s to output pins (x+y) END flex Size: b>0+ ulated perform of the 15-bit pipel Fig,2,10 MPX Note that the addition of 140 and 130 produces a carry from the lower 7-bit adder. but there is no carry for 120+5= 123< 127. (a) (b Fig, 2.11. Modular additions. (a)MPX-Add and MPX-Add-Pipe(b)ROM-Pipe 2.3.2 Modulo Adders Modulo adders are the most important building blocks in RNS-DSP designs Although the ROM shown in Fig 2.11 provides high speed, the ROM They are used for both additions and, via index arithmetic, for multiplica- itself produces a four-cycle pipeline delay and the number of ROMs is limited. tions. We wish to describe some design options for FPGAs in the following ROMs. however, are mandatory for the scaling schemes discussed before. The multiplexed-adder(MPX-Add) has a comparatively reduced speed even if a A wide variety of modular addition designs exists [413. Using LEs only, the carry chain is added to each column. The pipelined version sually needs the design of Fig. 2. 11(a)is viable for FPGAs. The Altera FLEX devices contain same number of lEs as the unpipelined version but runs about twice as fast a srmlall lumber of 2Kbit ROMs or RAMs(EABs)which can be configured Maximum throughput occurs when the adders are implemented in two blocks as28×8,28×4,210×2or21×1 tables and can be used for modulo m within 6-bit pipelined channels correction. The next table shows size and Regist. ered Performance 6,7, and 8-bit modulo adder [44 2.4 Binary Multipliers Pipeline Bits stages The product of two N-bit binary numbers, say X and A=2k=0 ak.* 41.3 MSPS 46.5 MSPS 33. 7 MSPS given by the "pencil and paper "method as: MPX 0 27 LE BI LE 35 LE P=A·X= 27i MPX 76. 3 MSPS 62.5 MSPS 60.9 MSPS (2 16 LE 18 LE 20 LE 151.5 MSPS 138.9 MSPS 123.5 MSPS MPX 27 LE 31 LE 35 LE It can be seen that the input X is successively shifted by h positions and 86.2 MSPS 86.2 MSPS 86.2 MSPS whenever ak+0, then X2 is accumulated. If ak=0, then the corresponding ROM 7 LE 8 LE 9 LE shift-add can be ignored (i.e, nop). The following VHDL example uses this 1 EAB 1 EAB 2 EAB pencil and paper"scheme to multiply two 8-bit integers. 2. Computer Arithmetic 2.4 Binary Multipliers Example 2. 15: 8-bit Multiplier 址 er.scf- waveform Editor 回区 Ref: 0.Ons Time: 20n The vHDL descriptionof an 8-bit multiplier is developed below. Multiplica- 0. Ons tion is performed in three stages. First, the 8-bit operands are "loaded and 100 Ons 200 Cns the product register reset. In the second stage, s1, the actual serial-parallel takes place. In the third st ap, s2, the product is transferred to ay count the output register y oy state PACKAGE eight-bit-int Is User defined types SUBTYPE BYTE Is integer RANGE-128 T0 127; SUBTYPE TWOBYTES IS integer RANGE -32768 TO 32767 MD eight-bit_int DE 510Xx400X160(320X L工 BRARY work; USE work.eight_bit_int. ALL; y L工 BRARY ieee g predefined packag UsE ieeestd_logic1164. ALL USE ieee std_logic_arith. ALL ENTITY mul ser Is --- Interface Fig. 2.12. Simulation results for a shift add multiplier. EN STD LOGIC TN BYTE N LOGIC VECTOR(7 DOWNTO C)i t米2; JUT TWOEYTES) coun七 unt t 1 END mmu ser. state <=81 END IF ARCHITECtURE flex OF mul_ser Is aHEN S2 = Output of result to and t next mu1七ip1 icat TYPE STATE TYPE IS (s0, s1, s2) state < so: SIGNAL state STATE TYPE END CASE; END PROCESS States; BEG工N END flex States: PROCESS ------ Multiplier in behavioral style VARIABLe p, t TWOB YTES; Double bit width Fig. 2. 12 shows the simulation result of a multiplication of 13 and 5. The VAR工 ABLE Count integer RANGE 0 TO 7 register t shows the partial produet sequence of 5, 10, 20, .. Since 1310= BEGIN 000011012c, the product register p is updated only three times in the pro ATT UNTIL Clk duction of the tinal result, 65. In state s2 the result 65 is transferred to the CASE sta七eIS output y of the multipl TJHEN SO - Initialization step state < s1 count =0 P 0 Product register reset Because one operand is used in parallel (i.e, X) and the second operand Set temporary shift register to x A is used bit wise, the multipliers we just described are called serial/parallel WHEN S1=> Processing step multipliers. If both operands are used serial, the scheme is called a serial /serial IF count =7 THEN - Multiplication ready multiplier[ 45, and such a multiplier only needs one full adder, but the latency state < s2 of serial/serial multipliers is high O(N2), because the state machine needs ELSE about N cycles IF a(count)=:/ THEN P:P+t add 2k Another approach, which trades speed for increased complexity, is called END工F; an"array, or parallel/parallel multiplier. A 4x4-bit array multiplier is shown 3 The equivalent Verilog code mul ser v for this example can be found in Fig. 2. 13. Notice that both operands are presented in parallel lo all adder array of N adder cells pendix A on page 353 56 2. Computer Arithmetic 2.4 Binary Multipliers ApION a l x[o] aOx[O 128Xa Pipeline-Register op a3]x[1] HA HA amOxiL 64X a 32X a3]x2 a2|x[2 a[][2 a0x{2 16X a +| a[3×[3 axIs a[ 31 ajaX[31 8X a, P 4X a FA HA 」:+ 2X a pL7 p 6 pIsI pl41 pI31 p2 pll p Fig.2.13. A 4-bit array multiplier Fig. 2. 14. Fast array multiplier for FPGAs This arrangement is viable if the time required to complete the carry and sun calculations are the same. For a modern FPGA. however, the carry Fig 2.15 reports the Registered Performance of three pipelined NXN computation is performed faster than the sum calculation and a different ar bit multipliers, using the MaxPlusII lpm_mult function, in a range from 4x4 chitecture is more efficient for FPGAs. The approach for this array multiplier is shown in Fig. 2. 14. for an 8X8-bit multiplier. This scheme combines in to 16 x 16 bits without(dotted line)and with(solid line) the use of I/ocell the first stage two neighboring partial products anX2" and an+12>2+1 and registers. Fig. 2.16 shows the effort for the multiplier, with(solid line) and without( dotted line) using the I/O cell registers. Placing the input regis the results are added to arrive at the final output product. This is a direct ter close to the multiplier(turn off option: AssignGlobal Project Logic array form of the "pencil and paper"method and must therefore produce a Synthesis-+Automatic Fast I/0), produces a slight gain in performance valid product The maximum size of multiplier that fits in the FleX10K20 is a 20x 20-bit We recognize from Fig. 2. 14 that this type of array multiplier gives the unit that uses 872 LCs and runs at 37.03 MHz Registered Performance apportunity to realize a(parallel) binary tree of the multiplier with a total with four pipeline stages nlumber of stages in the binary tree multiplier= log?(N). (2 9)又 Other multiplier architectures typically used in the ASIC world include Booth multipliers and Wallace-tree multipliers. They are discussed in Ex This alternative architecture also makes it easier to introduce pipeline stages ercises 22(p. 76)and 2.1 (p. 75) but are rarely used in connection with after each tree level. The necessary number of pipeline stages, according to FPGAS (2.28), to achieve maximum throughput is Bit width 23-45-89-1617-32 2.4.1 Multiplier Blocks Optimal number 5 A 2N X 2N multiplier can be defined in terms of an N x N multiplier block of pipeline stages The resulting multiplication is detined as 58 2. Computer Arithmetic 2.5 Multiply-Accumulator (MAC) and Sum of Product (SOP) 4x4 00 8> 500 16x16 400 4×4 40 8×3 0016x16 6 Number of pipeline stages Number of pipeline stag Fig. 2. 15. Performance of array multiplier for FPGAs Fig. 2.16. Effort in LOs for array multipliers P=Y.X=(V2N+Y1)(X2+Y1) Table 2.8. Data for EAB-based multipliers -F21222+(Y2X1+YX22+Y1X1 (229) Size Use I/O LCs EABs Pipeline Registered register sta Performarce where the indices 2 and 1 indicate the most significant half and least signif- icant N bit halves respectively. This partitioning scheme can be used if the 4X4 35.58MH: 4×4 1 51.54MH capacity of the FPGA is insufficient to implement a multiplier of desired size 4x4 2 76.33MHz or used to implement a. multiplier using 2Kbit EaB blocks. The number of 4×4 16 37, 87 MHZ EAB blocks in the flex10k20 are limited to six, and therefore the maximum 4×4 16 51.81MH乙 symmetric multiplier is 8x. Table 2. 8 shows the data for an EAB-based mul 4×4 76.33MHz tiplier. Comparing the data of Table 2.8 with the data from Fig 2.15 (p. 58 8×8 29 111144 23.80MHz 8×8 29 40.81MHz and 2.16(p. 59), it can be seen that the EAB-based multiplier reduces the 8×8 4 2 40.81MHz number of LCs but does not improve the Registered Performance 8×8 49 4 24.15MHz 8×8 49 10.81MHz 8×8 49 40.81MH 2.5 Multiply-Accumulator (MAC) and Sum of product (SOP) DSP algorithms are known to be multiply-accumulate(MAC) intensive. To vn]=fm*m=∑fkn-k 2.30 illustrate, consider the linear convolution sum given by

...展开详情
试读 28P 经典FPGA算法教材(二).pdf
立即下载 身份认证VIP会员低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
  • 至尊王者

关注 私信
上传资源赚钱or赚积分
最新推荐
经典FPGA算法教材(二).pdf 46积分/C币 立即下载
1/28
经典FPGA算法教材(二).pdf第1页
经典FPGA算法教材(二).pdf第2页
经典FPGA算法教材(二).pdf第3页
经典FPGA算法教材(二).pdf第4页
经典FPGA算法教材(二).pdf第5页
经典FPGA算法教材(二).pdf第6页

试读结束, 可继续读3页

46积分/C币 立即下载