计算机组成原理英文版课件：10-ACMTODAESBestPaperAward.pdf_计算机组成原理英文版资源-CSDN文库

版权申诉

文档资料

25 浏览量 2022-06-06 23:10:36 上传评论 1 收藏 1.53MB PDF 举报

资源推荐

资源详情

资源评论

Automatic Memory Partitioning and Scheduling for Throughput

and Power Optimization

JASON CONG, WEI JIANG, BIN LIU, and YI ZOU, University of California, Los Angeles

Memory bottleneck has become a limiting factor in satisfying the explosive demands on performance and cost

in modern embedded system design. Selected computation kernels for acceleration are usually captured by

nest loops, which are optimized by state-of-the-art techniques like loop tiling and loop pipelining. However,

memory bandwidth bottlenecks prevent designs from reaching optimal throughput with respect to available

parallelism. In this paper we present an automatic memory partitioning technique which can efﬁciently im-

prove throughput and reduce energy consumption of pipelined loop kernels for given throughput constraints

and platform requirements. Also, our proposed algorithm can handle general array access beyond afﬁne

array references.

Our partition scheme consists of two steps. The ﬁrst step considers cycle accurate scheduling information

to meet the hard constraints on memory bandwidth requirements speciﬁcally for synchronized hardware de-

signs. An ILP formulation is proposed to solve the memory partitioning and scheduling problem optimally for

small designs, followed by a heuristic algorithm which is more scalable and equally effective for solving large

scale problems. Experimental results show an average 6× throughput improvement on a set of real-world

designs with moderate area increase (about 45% on average), given that less resource sharing opportunities

exist with higher throughput in optimized designs. The second step further partitions the memory banks

for reducing the dynamic power consumption of the ﬁnal design. In contrast to previous approaches, our

technique can statically compute memory access frequencies in polynomial time with little or no proﬁling.

Experimental results show about 30% power reduction on the same set of benchmarks.

Categories and Subject Descriptors: B.5.2 [Register-Transfer-Level Implementation]: Design Aids—

Automatic synthesis

General Terms: Algorithm, Design, Experimentation

Additional Key Words and Phrases: Behavioral synthesis, memory partition

ACM Reference Format:

Cong, J., Jiang, W., Liu, B., and Zou, Y. 2011. Automatic memory partitioning and scheduling for throughput

and power optimization. ACM Trans. Des. Autom. Electron. Syst. 16, 2, Article 15 (March 2011), 25 pages.

DOI = 10.1145/1929943.1929947 http://doi.acm.org/10.1145/1929943.1929947

1. INTRODUCTION

Behavioral synthesis has emerged as a promising direction in EDA because of the

exponentially increasing complexity in modern SoC designs. Behavioral synthesis

transforms algorithmic description of behavior into hardware implementation. Typ-

ical applications for behavioral synthesis are data-intensive or computation-intensive

kernels in multimedia processing, which require high throughput. Usually computation

A preliminary version of this article appeared as Cong et al. [2009].

This work is partly supported by GSRC, SRC Contract 2009-TJ-1879, and NSF grant CNS-0725354. Support

from Magma and AutoESL Inc. is also gratefully acknowledged.

Authors’ address: J. Cong, W. Jiang, B. Liu, and Y. Zou, Computer Science Department, University of

California, Los Angeles, CA 90095; email: {cong,wjiang,bliu,zouyi}@cs.ucla.edu.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted

without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that

copies show this notice on the ﬁrst page or initial screen of a display along with the full citation. Copyrights for

components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.

To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this

work in other works requires prior speciﬁc permission and/or a fee. Permissions may be requested from

Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)

869-0481, or permissions@acm.org.

 2011 ACM 1084-4309/2011/03-ART15 $10.00

DOI 10.1145/1929943.1929947 http://doi.acm.org/10.1145/1929943.1929947

ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.

15:2 J. Cong et al.

Fig. 1. Area of SRAM cells (taken from Tatsumi and Mattausch [1999]).

kernels are captured by perfect loop nests. Loop tiling is a common optimization tech-

nique that partitions a loop’s iteration space into smaller chunks or blocks, so as to

increase the parallelism inside one loop iteration, and improve data locality. However,

memory bandwidth limits potential performance gain even with large parallelism avail-

able. For instance, in a pipelined design, the ﬁnal throughput is limited by number of

memory accesses divided by memory ports.

Memory bottleneck problems have drawn intensive interest, in both software and

hardware designs. A natural way to tackle this problem is to increase memory ports in

the ﬁnal memory implementations. However, research [Tatsumi and Mattausch 1999]

shows that the size of memory cells increases almost quadratically with port number

N (shown in Figure 1), which limits the maximal memory port number in practical

designs. Moreover, increasing memory ports is not feasible on FPGA platforms; for

example, block RAMs on Xilinx Virtex systems

are restricted to two read-write ports.

By duplicating the original memory N times, memory read bandwidth can be improved

by N times. But, despite the large area overhead, the number of simultaneous memory

writes cannot be increased because it is difﬁcult to guarantee that all memories have

synchronized data. A better approach is memory partitioning, which divides original

memories into several banks. To take advantages of memory partitioning, compilers

must analyze memory access patterns in original programs carefully, and balance them

among different banks.

Data partitioning and loop partitioning have been studied intensively in the compiler

domain for parallel computing on distributed systems. Ramanujam and Sadayappan

[1991] and Anderson and Lam [1993] developed algorithms to generate communication-

free data partitions for multi-computers with local memory. The approach in Agarwal

et al. [1995] presented an optimal tiling and iteration/data space mapping algorithm

even if communication-free partitioning solutions do not exist. In general, these algo-

rithms attempt to exploit coarse-level parallelism to generate good data partitions that

http://www.xilinx.com

ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.

Automatic Memory Partitioning and Scheduling for Throughput and Power Optimization 15:3

minimize remote data accesses. However, in hardware design, data accesses must be

considered at cycle-accurate level to avoid excessive simultaneous accesses on the same

memory. In behavioral synthesis, Gong et al. [2005] studied the memory partitioning

problem based on the footprint of data access, where only either row-based partition-

ing or column-based partitioning can be supported. Another approach [Baradaran

and Diniz 2008] studied the impact of data distribution, data duplication and scalar

replacement for behavioral synthesis to reduce latency of loop bodies. Their work did

not consider scheduling information in the data partitioning formulation. Our approach

considers memory partitioning at the cycle-accurate level to meet throughput require-

ments on pipelined perfect loop nests.

Memory partitioning can also be applied for power optimization. As IC technology

scales into the sub-100nm regime, power optimization has emerged as one of the utmost

important tasks in SoC designs. Power consumption usually goes up with performance

improvement; therefore, power optimization is considered in this paper as a secondary

objective to compensate for the overhead brought by performance optimization. Memory

partitioning techniques for power reduction have been studied [Lyuh and Kim 2004;

Wang and Hu 2004; Poncino et al. 2000]. Lyuh and Kim [2004] and Wang and Hu

[2004] assign discrete memory read/write operations to different memory banks for

performance and power optimizations. Their formulations mainly work for data ﬂow

graphs where one memory access can be bound to only one memory bank statically. The

approach in Poncino et al. [2000] generates an optimal memory partitioning scheme

based on platform information and memory access frequencies. The access frequencies

are assumed to be given in their formulation which is usually done by proﬁling. In our

approach, each array access is modeled as a parameterized polytope, and array access

proﬁles can be calculated by a polyhedron library to get an analytical solution with

respect to loop bounds and other parameters. For multimedia programs, our approach

usually requires no proﬁling (for afﬁne array accesses) or lightweight proﬁling by

using a parameterized polytope model. I n addition, our approach can support more

partitioning schemes.

In this article, a synthesis ﬂow is proposed to solve the memory partitioning problem

for throughput and power optimization. The goal of throughput optimization is to maxi-

mize concurrent memory accesses by partitioning; however, power optimization tries to

shut down unused memory banks to save dynamic power consumption. These distinct

objectives make it extremely difﬁcult to tackle throughput and power optimization in

one step. In our approach, the throughput optimization problem is solved at the ﬁrst

step, then results are passed to the power optimization step to further reduce the total

energy consumption.

Our contributions include the following: (i) a behavioral synthesis ﬂow which gen-

erates optimal memory partitioning solutions automatically for fast design space

exploration. (ii) Unlike most previous work, our approach can handle array refer-

ences beyond the afﬁne form, such as indirect array accesses and nonafﬁne array

subscriptions. (iii) Moreover, our partition scheme considers cycle-accurate scheduling

information to meet the hard constraints on memory bandwidth requirements, speciﬁ-

cally for synchronous hardware designs. (iv) An effective power optimization algorithm

is proposed that considers more general partition schemes and requires less proﬁling

effort. Compared to previous works, our approach requires only lightweight proﬁling,

and can support more partitioning schemes for better results. These contributes are

validated by our experimental results; more than 15× performance improvement and

60% power reduction can be observed on certain designs.

A preliminary version of this work was reported in Cong et al. [2009]. In this arti-

cle, we proposed an integer-linear-programming (ILP) based algorithm to optimally

solve the memory partitioning and scheduling together for small examples. And our

ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.

15:4 J. Cong et al.

algorithm is extended to consider loop dependences as well for throughput optimiza-

tion. Furthermore, an interactive modulo scheduling algorithm is adapted for the ﬁnal

loop pipelining following our memory partitioning step.

The remainder of t he article is organized as follows. Section 2 describes the pre-

liminaries, such as polytope model and loop pipelining. Section 3 gives a motivation

example for our memory partitioning problem. Section 4 gives problem formulation of

our memory partitioning problem, and Sections 5–7 present our proposed algorithms

for solving this problem. Section 8 reports experimental results and is followed by

conclusions in Section 9.

2. PRELIMINARIES

2.1. Polytope Model

Computation-intensive parts of applications are commonly captured by perfect loop

nests. Without loss of generality, all loops are normalized to have unit step size. Loops

and memory accesses can be modeled as polytopes on rational spaces.

Deﬁnition 1. The iteration space I of a depth-l loop nest is deﬁned in the l-

dimensional space, where each iteration corresponds to an integer vector identiﬁed

by its index values I = (i

, i

,...,i

Deﬁnition 2. Similarly, an n-dimensional array is deﬁned by an array space A,

and each array access at a certain iteration is determined by an integer tuple

A = (a

, a

,...,a

Deﬁnition 3. Given iteration space I and array space A, each array access R

in a

loop nest is a function: R

: I → A. R

is an afﬁne array access if R

(I) = M ∗ I + V ,

where M is an n × l matrix and V is a constant vector of size n.

Example 1.

for (i = 0; i <= N ; i ++)

for ( j = i; j <= M; j ++) {

A[i−1, i + j] = ...;

}

For Example 1, the iteration space is

I ={(0, 0), (0, 1),...(0, M), (1, 1), (1, 2),...(1, M) ...}.

Assume the size of array A is 10 × 10, the data space of array A is

={(0, 0), (0, 1),...(0, 10), (1, 0), (1, 1),...(1, 10) ...}.

And array access R

: A[i − 1, i + j] is an afﬁne array access with respect to iteration

space I = (i, j)since





∗







−1





i − 1

i + j



If loop bounds and array accesses are afﬁne, loop space and array accesses can be

modeled as polytopes on rational space Q.

Deﬁnition 4. A rational parametric polytope PT

with n parameters p is a set of

rational d-dimensional vectors x in rational space Q

deﬁned by m linear inequalities

on x and p, where A and B are (m× d)and(m× n) rational matrices, and c is a size-m

constant vector:

={x ∈ Q

|Ax ≥ B p + c}.

ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.

Automatic Memory Partitioning and Scheduling for Throughput and Power Optimization 15:5

Fig. 2. A loop pipelining example.

We can easily construct the polytope which contains the iteration space from loop

structures. For the example above, the iteration space is integer points inside polytope:

={i, j∈Q

|0 ≤ i ≤ N, i ≤ j ≤ M}.

The corresponding values of A, B, p and c in Deﬁnition 4 can also be calculated from

the given form:

⎧

⎪

⎨

⎪

⎩

I ∈ Q

⎡

⎢

⎣

−10

−11

0 −1

⎤

⎥

⎦

∗





≥

⎡

⎢

⎣

−N

−M

⎤

⎥

⎦

⎫

⎪

⎬

⎪

⎭

If array access R

is an afﬁne array access R

(I) = M ∗ I + V , the number of accesses

on this element is the number of integer point inside the intersection of iteration space

I and the hyperplane {I|A = M ∗ I + V }. For instance, the number of loop iterations

that access array location A[a1, a2] is a polytope

{i, j|i − 1 = a

, i + j = a

, 0 ≤ i ≤ N, i ≤ j ≤ M}.

The polytope model provides a useful abstract of the array access behavior, which

can be used to ﬁnd the array access frequency for power optimization. Details will be

discussed later.

2.2. Loop Pipelining

Loop pipelining is the most commonly used technique for improving throughput; it al-

lows simultaneous execution of operations in several continuous loop iterations. Modulo

scheduling [Rau 1994] is a loop pipelining technique which tries to ﬁnd a schedule for

one iteration of the loop such that when this same schedule is repeated at regular inter-

vals, no intra- or inter-iteration dependence is violated, and no resource usage conﬂict

arises between operations of either the same or different iterations. This constant in-

terval between the start of successive iterations is termed the initiation interval (II),

usually measured in clock cycles in s ynchronized hardware designs. An example is

shown in Figure 2, which has II = 1.

Deﬁnition 5. The minimum initiation interval MII is a lower bound on the smallest

possible value of II for which a modulo schedule exists.

The lower bound MII is limited by two factors: resource limits and inter-iteration loop

dependence, which are termed as resource-constrained MII (ResM I I) and recurrence-

constrained MII(Rec M I I) respectively. Thus, MII ≥ MAX(ResM I I, RecM I I).

In a pipelined design, each operation in a loop body is executed once within II cycles,

so obviously ResI I ∗ resour ces  operations,orResI I ≥ operations/resour ces.For

memory access, if the memory port number is N, and there are K memory accesses,

ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.

剩余24页未读，继续阅读

评论收藏

内容反馈

版权申诉

wxg520cxl

粉丝: 24
资源: 3万+

计算机组成原理英文版课件：10-ACMTODAESBestPaperAward.pdf

计算机组成原理英文版教材

计算机组成原理 英文版教材

计算机组成原理英文

计算机组成原理-英文版及课件

计算机组成原理（英）

麻省理工计算机组成原理英文版

计算机组成原理英文版第八版

计算机组成原理英文版课件：1-intro.pdf

计算机组成原理习题册（附答案）英文版.pdf

计算机组成原理英文版课件：8-Datapath.pdf

计算机组成原理英文版课件：10-Memory.pdf

计算机组成原理英文版课件：3-DataRepresentation.pdf

计算机组成原理英文版课件：9-Memory.pdf

计算机组成原理英文版答案

计算机组成原理英文版超清

计算机组成原理英文版课件：9-Datapath.pdf

计算机组成原理英文版课件：6-Datapath.pdf

计算机组成原理英文版课件：2-performance.pdf

计算机组成原理英文版课件：7-Datapath.pdf

计算机组成原理英文版课件：11-Memory.pdf

计算机组成原理英文版课件：12-AMD-APU14-Beijing.pdf

计算机组成原理英文版课件：4-InstructionSet.pdf

计算机组成原理教学课件：4-The processor.ppt

计算机组成原理实验指导书之TEC-XP.docx

计算机组成原理--实验八-简单模型计算机实验.pdf

杭电计算机组成原理寄存器堆设计实验-4.pdf

计算机组成原理实验指导书-CPTH.pdf

最新资源

计算机组成原理英文版教材