c-to-verilog_code_papers_c语言转verilog资源-CSDN文库

共6个文件

pdf：5个

zip：1个

c-to-verilog

5星 · 超过95%的资源需积分: 44 45 浏览量 2012-11-22 14:53:11 上传评论 3 收藏 2.59MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

c-to-verilog.rar （6个子文件）

BINARY SYNTHESIS WITH MULTIPLE MEMORY BANKS TARGETING ARRAY.pdf 557KB

Reducing Memory Constraints in Modulo.pdf 311KB

c_2_verilog_code.zip 73KB

Automatic Memory Partitioning Increasing Memory.pdf 605KB

Synthesis for Variable Pipelined Function Units.pdf 228KB

Finding the best compromise in compiling compound loops to Verilog.pdf 1.24MB

Finding the best compromise in compiling compound loops to Verilog

Yosi Ben-Asher

, Nadav Rotem, Eddie Shochat

Computer Sci. Dep. Haifa University, Haifa, Israel

article info

Article history:

Received 31 March 2009

Received in revised form 28 June 2010

Accepted 2 July 2010

Available online 18 July 2010

Keywords:

High-level synthesis

FPGA

Compilation

abstract

In this work we consider a special optimization problem involved with compiling compound loops (com-

bining nested and consecutive sub-loops) to Verilog. Each sub-loop of the compound loop may require a

different optimized hardware conﬁguration (OHC) for optimized executio n times. For example, one loop

requires at least two memory ports and one multiplier for an optimized execution time, while another

loop may require only one memory port but two multipliers, yet one OHC should be selected for both

loops. The goal is to compute a minimal OHC which, based on the different heat levels (expected number

of iterations) of the sub-loops, is a good compromise between all the conﬂicting requirements of each

sub-loop. Though synthesis of nested loops has been implemented in quite a few systems this aspect

has not been considered so far. We avoid the use of time consuming integer linear programming (ILP)

techniques and instead use a fast space exploration technique combined with an efﬁcient variant of list

scheduling.

Another novel aspect of the proposed system is the observation that the real latencies of the hardware

units should be considered as variables of the OHC rather than ﬁxed real values as is usually done in high-

level synthesis systems. Experimental results show a signiﬁcant improvement in the OHC without a sig-

niﬁcant increase in the execution time due to the use of this search procedure.

Ó 2010 Published by Elsevier B.V.

1. Introduction

Embedded systems are characterized by extensive loop process-

ing and low power budget. In many cases these loops can be com-

piled to hardware circuits that are signiﬁcantly faster and more

power efﬁcient than their software versions. Automating this pro-

cess is a fundamental problem in high-level synthesis (HLS) [13].

Synthesizing code to circuits is done in the scheduling phase of a

compiler wherein the operations in the code are scheduled to a

time  hardware

resources 2D table without violating data depen-

dencies in the code. Each row in the 2D table determines the clock

cycle wherein the operations in that row will be executed. We re-

mark that the main difference between scheduling to hardware

and scheduling to a CPU is that when compiling to hardware, we

have the freedom to select the target architecture, namely the re-

sources that can be used in each clock cycle. In particular we can

generate circuits with multiple parallel memory references that

are started in the same clock cycle. There is also a freedom to chose

the partition of the schedule (the 2D table) to clock cycles (see [27]

for a detailed survey on scheduling issues in HLS).

For a given loop L

let toptðL

Þ be the latency (number of clock

cycles) of a circuit obtained by scheduling L

’s operations using

some known scheduling algorithm (List scheduling in our case)

with unlimited amount of resources. By the term ‘‘latency” we re-

fer to the number of clock cycles it takes for the synthesized circuit

to complete one iteration of the loop. The execution time of an iter-

ation is obtained by multiplying the latency by the clock rate.

optimized hardware conﬁguration (OHC) of L

is the minimal amount

of resources for which list scheduling (LS) of L

will produce a circuit

¼ LSðL

; OHCÞ whose latency is ‘‘close enough” to toptðL

Þ. The term

‘‘close enough” indicates that the latency of C

is greater than toptðL

by no more that some small fraction, e.g., tðC

Þ < 1:2  toptðL

Þ. While

ﬁnding an OHC of a single loop can be done by an exhaustive search,

ﬁnding an OHC for a set of loops L

; ...; L

is problematic since each L

can have a different and possibly conﬂicting OHC. Note that we as-

sume that these set of loops is part of the same program and hence

must be executed using the same set of resources (at least partially,

share resources). Clearly if we select a common OHC which is the un-

ion of all the resources needed by each L

to achieve toptðL

Þ then all

loops will achieve their optimal latencies. However, this might be

too wasteful as there might be another common OHC that minimize

the use of resources without damaging the overall performances

signiﬁcantly.

Fig. 1 depicts the synthesis (optimal scheduling) of two

consecutive loops. Each operation is described by a rectangle box

1383-7621/$ - see front matter Ó 2010 Published by Elsevier B.V.

doi:10.1016/j.sysarc.2010.07.001

* Corresponding author.

E-mail address: yosi@cs.haifa.ac.il (Y. Ben-Asher).

An early short version of this work appeared in ISVLSI-2008.

Sometimes, when the clock rate is regarded as 1 time unit, the execution time

corresponds to the latency.

Journal of Systems Architecture 56 (2010) 474–486

Contents lists available at ScienceDirect

Journal of Systems Architecture

journal homepage: www.elsevier.com/locate/sysarc

(annotated by the real latency of that operation) whose base indi-

cates the time when the inputs for this operation are ready and the

end of this box indicates the time the output of this operation is

ready. Different resources of the same type are designated by

different textures in the ﬁgure. The ﬁrst loop executes sþ¼

A½jþb þ B½j; and the second sþ¼A½jb  B½j;. Assume that the

real latencies are: multiplication 40 ns, memory 50 ns, and addi-

tion 10 ns. By evaluating all possible schedulings, we can verify

that:

1. The OHC of the ﬁrst loop includes two memory ports, two addi-

tion units and a memory delay factor of one clock cycle. Here,

latencies can be optimized if the two load operations (of A½i

and B½i) are executed in parallel.

2. The OHC of the second loop includes one memory port, one

multiplication unit, one addition unit and a memory/arithmetic

delay factor of one clock cycle. Given that the latency of the

multiplication is close to the memory latency and the multipli-

cations must be executed one after the other, then there is no

point in executing two parallel loads.

Hence, a set of loops L

; ...; L

can have conﬂicting requirements

for an OHC from which a compromise OHC must be selected.

This compromise can be based on the following considerations,

namely OHC1 is preferable to OHC2 if:

 The loops that OHC1 improves execute more iterations than

those improved by OHC2.

 OHC1 improves more loops than OHC2.

 OHC1 reduces more resources than OHC2.

 OHC1 reduces signiﬁcant resources compare to OHC2, i.e., a

reduction in the number of memory ports used is preferable

to reduction in the number of adders.

In the example of Fig. 1 we assumed that the number of itera-

tions of the ﬁrst loop n is smaller than the number of iterations

m of the second loop ðm > nÞ; hence, it is best to select the OHC

of the second loop and use one memory port. The ‘‘damage” to

the execution time of the ﬁrst loop is that each iteration takes

3  50 ns instead of 2  50 ns had we used two parallel load opera-

tions. This is a good choice, because it saves one memory port

and for m > n, while the overall execution time is not affected sig-

niﬁcantly. In general the problem of selecting the best compromise

hardware architecture for a given set of loops is not trivial and to

the best of our knowledge, has not been considered so far.

In this work we use a search procedure combined with a mod-

iﬁed list scheduling algorithm to ﬁnd an OHC for a given set of

loops. The use of the fast linear-time list scheduling allows us to

complete the proposed search procedure in reasonable times

needed for an interactive mode of operation. The need for rela-

tively fast synthesis times is essential in our view for practical

use of hardware compilation systems. An alternative approach

would have been to use an ILP solver. However, using an ILP solver

is not practical due to its state explosion problem and the relatively

large code sizes of the underlying compound loops. The use of a

modiﬁed list scheduling facilitates the fast interactive synthesis

mode needed to study the different tradeoffs involved with this

form of synthesis. We remark that for a reasonable number of pos-

sible resources and loops there are too many possible conﬁgura-

tions of the OHC space that must be evaluated. Hence, using an

exhaustive search evaluating all possible OHCs is not practical if

the system should be used interactively. Thus, the search proce-

dure described here ﬁnds optimized solutions using a Gradient

descent type of search so that only few hundreds OHCs should

be evaluated.

The paper is organized as follows. Section 2 describes the use of

memory delay factors instead of real memory latencies in HLS

scheduling. Section 3 contains a detailed description of the LS var-

iant we are using to schedule a sequence of loops for a given OHC.

Section 4 describes the search procedure used to ﬁnd the OHC

which is a good compromise between conﬂicting OHCs of the dif-

ferent loops being synthesized. Section 4.1 contains a detailed

example of how the search procedure works. Section 5 contains

a detailed description of the system describing how the user can

use the different loop transformations that are available. Section

6 contains the experimental results and Section 7 contains related

works.

2. Memory delay factors and system overview

In this section we describe practical aspects of the synthesis to

Verilog and introduce a variant of multi-cycle operations called

‘‘memory delay factor” (MDF), reﬂecting the ability to partition a

10ns

40ns

load

mem

10ns

40ns

load

mem

50ns

40ns

load

mem

10ns

40ns

load

mem

s += A[j]*b*B[j];

for(j=i;j<m;j++){

b= b+j;

}

for(i=0;i<n;i++){

}

s += b+A[i]+B[i];

b= b+i;

Fig. 1. Two similar loops with different OHCs.

Y. Ben-Asher et al. / Journal of Systems Architecture 56 (2010) 474–486

475

memory reference to several clock cycles. Fig. 2 depicts the synthe-

sis of a loop where the resources constraints include four addition

units and two memory ports. The load operation is implemented

by storing an address into a memory port ðp0=p1Þ and then after

some clock cycles the memory returns a value available in a va-

lue-port ð

1Þ. Once the scheduling has been computed, it is rel-

atively easy to translate it to Verilog, as depicted in Fig. 2. The

resulting code is fairly self explanatory, except for the assignments

reg <¼ exp; indicating that when the clock goes up, the register is

loaded by the value available at the output of a combinatorial cir-

cuit computing exp. All the assignments reg <¼ exp; in the same

case are executed in parallel. For simplicity and clarity, we do

not show the Verilog code that is obtained after the binding pass,

as it includes many MUXs/DeMUXs that makes it less readable.

We remark that pipelining techniques of loops (which are usually

an important part of HLS system), are executed as a source-level

loop transformation before the synthesis begins and are thus not

relevant to the system described here (an example showing how

it is done will be given later on).

It follows that in the case of Fig. 2 the hardware conﬁguration

used included: two memory ports, four addition units, a memory

delay of one clock cycle and a delay of 0.5 clock cycles for the addi-

tions. Note that the values returned from the memory must be

used at least one clock cycle after the address has been loaded. As-

sume that the real memory latency is 50 ns and the real addition

latency is 5 ns. The resulting clock rate

(after synthesis) will corre-

spond to the longest path in each clock cycle, yielding a clock cycle of

55 ns. Now assume that we require that a memory operation will be

synthesized with two clock cycles instead of one. It follows that the

resulting scheduling must now include three clock cycles as depicted

in Fig. 3. Hence the clock rate is reduced to 25 ns (still dominated by

the memory latency). The execution time for one loop iteration has

been improved from 60  2to25 3, which is a signiﬁcant improve-

ment. In addition, the new scheduling uses only two adders instead

of four.

This introduces the concept of using memory delay factors

(MDFs) as part of the OHC instead of real memory latencies.

Deﬁnition 2.1. Let reg ¼ loadðaddressÞ be a load operation l

that is

synthesized to a circuit C. Let l

:start be the clock cycle where

address has been loaded in a memory port and l

:use be the clock

cycle where the value returned by the memory module is ﬁrst used

(stored to reg). A given circuit C has a memory delay factor

MDF ¼ k if for every load operation:

 l

:use  l

:start P k.

 In all the clock cycles betweenl

:start and l

:use no other address

is loaded to the memory port associated with l

Similarly the memory port used for every store operation can not be

used for at least k clock cycles. In addition, arithmetic operations

can be also characterized by a delay factor. A circuit C has an arith-

metic delay factor ADF ¼ k if for every multiplication or division

reg ¼ mult=di

ðreg1; reg2Þ there are at least k clock cycles between

setting the input values to reg1; reg2 and the use of the output in reg

by other operations in C.

Note that in the example of Fig. 2 increasing the MDF above 7

will only slowdown the resulting iteration time since the last stage

with the two additions (10 ns) will always lead to an execution

time greater than 60. Consequently, in the proposed approach

there is an iterative search to ﬁnd optimized delay factors includ-

ing not only the MDF but also the delay factors for ‘‘heavy” arith-

metic operations (ADF) such as multiplications.

Generating schedulings with optimized delay factors differs

from the usual way in which HLS schedulings are obtained. This

is because delay factors become ‘‘open” variables that are opti-

mized by the scheduling, while in regular HLS resource–latencies

are ﬁxed constraints that the scheduler must satisfy (including

the use of multi-cycle operations).

As indicated earlier, maximizing the delay factors (making them

part of the solution instead of part of the problem) makes the

resulting scheduling problem less rigid, allowing more parallelism

of the resulting schedulings. Optimal solutions with real hardware

latencies are harder to obtain and require the use of integer linear

programming (ILP). As indicated in several works, it may become

impractical to schedule large or even medium code segments

(above 100 lines of source code) using ILP methods. Using delay

factors instead of real latencies simpliﬁes the scheduling con-

strains since:

 The range of the delay factors is smaller than the real latencies.

 The search for optimized delay factors is not part of the sched-

uling algorithm and is performed as an external procedure,

hence the scheduler is presented with a simpler problem to

solve.

Thus using delay factors allow us to separate the scheduling

problem to: a search for an optimized combination of delay factors

and resources followed by a simple list scheduling to compute the

resulting latencies (for the current choice of the delay factors and

1: begin

state <= 2; end

2: begin

s <= s + v0 +v1;

always @(posedge clock)

begin case(state)

p0 <= A+i; p1 <= B+i;

s <= s+b; b <= b +i;

endcase

else state <= done; end

state <= 1; }

if(i < n) { i <= i+1;

for(i=0;i<n;i++){

}

s += A[i]+b+B[i];

b= b+i;

bA i

p0 p1

clock cycle

Fig. 2. A given loop, a scheduling of its Data Dependency Graph and the resulting

Verilog code.

for(i=0;i<n;i++){

}

s += A[i]+b+B[i];

b= b+i;

1: begin

state <= 2; end

2: begin

always @(posedge clock)

begin case(state)

p0 <= A+i; p1 <= B+i;

s <= s +b; b <= b + i;

state <= 3; end

3: begin

s <= s + v0 +v1;

if(i < n) { i <= i+1;

state <= 1; }

else state <= done; end

endcase

p0 p1

clock cycle 25ns

but only 10ns needed

Fig. 3. Improving execution times by assigning a delay factor of two cycles to

memory operations.

We use the term clock rate instead of clock frequency in order not to confuse it

with the loop’s execution frequency.

476 Y. Ben-Asher et al. / Journal of Systems Architecture 56 (2010) 474–486

the resources). In this way we can generate and evaluate several

hundred combinations in a few seconds.

Using delay factors is basically not a new concept, there have

been other efforts considering multi-cycle operations including

multi-cycle load/store operations and pipelined arithmetic opera-

tions that are executed in k > 1 stages. Though delay factors are

basically multi-cycle operations there is a signiﬁcant difference be-

tween them. While multi-cycle operations are operations with a ﬁx

delay expressed by the number of clock cycles they should last, de-

lay factors are free variables that the synthesis tries to maximize.

Thus it can happen that the synthesis will procure MDF = 1 mean-

ing that a memory reference will be synthesized to complete in one

clock cycle or MDF = 3 meaning that the memory should be syn-

thesized to complete after three clock cycles. As explained earlier,

using delay factors can improve the resulting clock cycle.

3. Detailed description of the LS algorithm

In this section we shortly describe the LS variant (see Fig. 4)

used in this work (called modiﬁed LS). The goal of list scheduling

(LS) [13] in its HLS version is to schedule a DDG of operations to

a minimal consecutive set of clock cycles such that data dependen-

cies, delays and resources limitations are preserved. The input to a

LS is usually a DDG ðGÞ of operations where each edge u !

is la-

beled by the real_latency between u and

. In our case only edges

depending on a memory operation are labeled by the delay factor

and the rest of the edges are labeled by 1. This reﬂects the differ-

ence between the usual approach to scheduling and the delay-fac-

tor method that is used here. Delays are imposed by marking the

suitable resources as used for d > 1 clock cycles (line-14 in

Fig. 4). Basically, the LS maintains a ready-list (RL) of nodes and

at each step it selects the node with the highest priority from the

RL and schedule it at the next available place in a reservation table.

Priorities are assigned to the nodes of G mainly based on the max-

imal path length to an exit node (in our case the longest path

counts only delay factors (Fig. 4 line-05). Intuitively, by selecting

a node u with a long path ﬁrst, the LS will be able to ﬁll the delays

between the nodes of this path by independent operations from

other, shorter paths. In addition to the path length, the priority is

also affected by other factors, e.g., the out-degree of a node. We

are using the LS not only to ﬁnd optimized schedulings for a given

OHC but also to determine the desired changes in the current OHC.

For each resource R the LS counts (Fig. 4 line-19) how many times

an operation from the ready-list was not scheduled due to a lack of

a resource. Similarly, the LS counts how many times a resource was

not used in the current schedule (Fig. 4 line-24). Based on these

two statistics it is possible to evaluate if the number of resources

of type R should be changed or not.

The number of resources in the OHC (ports, adders, multiplier,)

is usually small compared to the number of nodes in G. Hence, the

LS algorithm can be regarded as a linear-time algorithm (in the

number of nodes in G ) that can be used many times to test different

combinations of OHCs.

For clarity we list the deﬁnitions of some of the main terms re-

lated to the synthesis of the scheduling computed by the LS:

Stage – each row in the scheduling table is synthesized to

a set of parallel Verilog assignments that are exe-

cuted in one clock cycle.

Control steps – the control structure of the original set of loops is

synthesized to a state machine such that each stage

is determined by an if-statement executed as part of

the parallel assignments of a stage. Fig. 2 illustrates

the synthesis of the control structure of a loop.

Clock cycle – one step in the execution of the resulting circuits

and its duration is determined by the synthesis tool.

Latency – since each clock cycle corresponds to one stage

then the number of stages/clock cycles (latency)

needed to synthesize one loop iteration corresponds

to the ﬁnal execution time of that loop iteration.

4. The search engine

In this section we describe how LS is used to synthesize a com-

pound loop and how the proposed search procedure works.

Though each loop L

has its own OHC

, the goal is to ﬁnd a single

OHC that is a good compromise between the possibly conﬂicting

OHC

s. Each loop may have a different heat level (expected number

of iterations). Thus the ‘‘global OHC” should minimize the

weighted sum of the scheduling lengths of each loop. Obviously,

we can achieve this optimality using the union of all OHC

, how-

ever, this may require too many resources and in particular use

the minimal delay factor of all OHC

increasing the execution time

signiﬁcantly (e.g., if MDF ¼ 1 then the clock rate will be at least as

the memory latency).

First, we show via one example that the impact of changing the

OHC can be computed, namely, we can determine if a given change

in the current OHC is desirable. For a given loop L let L:f be its exe-

cution frequency (number of iterations the loop is executed) and

L:s the latency or scheduling length of one iteration. For example,

if a compound loop contains two loops L

; L

such that

 L

’s heat level is L

 f ¼ 4000 iterations while L

 f ¼ 1000 iter-

ations and both loops have the same scheduling time of

 s ¼ L

 s ¼ 10 ns per iteration when an optimized OHCs is

used for each loop.

 L

needs only one memory port for an optimal execution time

 p ¼ 1 while L

needs two memory ports for an optimized

scheduling L

 p ¼ 2  L

needs two multiplications for opti-

mized execution L

:mult ¼ 2 compared to only one multiplica-

tion unit needed by L

ðL

:mult ¼ 1Þ.

 The total execution time of a loop is thus TðLÞ¼L  f  L  s.

Fig. 4. The LS algorithm.

Y. Ben-Asher et al. / Journal of Systems Architecture 56 (2010) 474–486

477

评论收藏

内容反馈

zhu1199

2013-09-04

试个简单的程序还行，真正的项目工程还需要验证的！
xierunyan123

2013-04-22

对于一般的硬件可以使用，但是复杂的估计有问题
xianlijiang

2015-05-27

不错挺好就是运行挺麻烦的
widekey

2014-02-08

类似的软件还有很多，有些已经商用，真心没觉得这个多么好
J'espèrequevousav

2021-05-18

请问怎么用呢?

前往

页

Rill

粉丝: 1158
资源: 48

c-to-verilog_code_papers

verilog to systemc

HLS：C语言转换FPGA教程（ug871）

verilog code

uart verilog code

I2C verilog source code

32bit ALU verilog code

verilog_IEEE官方标准手册-2005_IEEE_P1364.rar_ieee 2005_verilog 手册_veri

verilog_IEEE官方标准手册-2005_IEEE_P1364

verilog-ethernet-master_ethernet-verilog_crewmsp_verilog_Etherne

delayandGMSKdemod.rar_GSM matlab -code_gmsk verilog_gsm gmsk_gsm

Uart.rar_in_uart_uart verilog_uart verilog code

i2c.tar.gz_i2c verilog_i2c verilog code_iic verilog_verilog i2c_

CAVLC.zip_CAVLC_CAVLC verilog_cavlc verilog_cavlc verilog code

LTC2325 verilog code

alu.rar_ALU_verilog_alu verilog_alu的verilog code

PSG.rar_NIOS verilog_Verilog 8910_ay-3-8910_ay-8910_dac verilog

SPI-in-Verilog-implementation_控制器_FPGA设计verilog源码.rar

mux4_to_1.rar_4-to-1 mux_Verilog四选一

04-加法器_verilog_verilog加法_dish6v5_

AHB_LITE.rar_AHB fpga_AHB-LITE Verilog_ahb_ahb协议_verilog设计ahb

8-bit CPU verilog code

code_uart_verilog_beltv3l_

slave FIFO verilog code

verilog_code_FPGAverilog_fifo_verilog_

cordic.rar_cordic_cordic verilog code_it_sin verilog

lcd_test.rar_LCD_de2-70_lcd verilog_verilog lcd

卷积码、CRC.rar_CRC-16_CRC译码_VHDL-FPGA-Verilog_convolution encoder_卷

最新资源