没有合适的资源?快使用搜索试试~ 我知道了~
计算机组成原理英文版课件:10-ACMTODAESBestPaperAward.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 25 浏览量
2022-06-06
23:10:36
上传
评论 1
收藏 1.53MB PDF 举报
温馨提示
试读
25页
计算机组成原理英文版课件:10-ACMTODAESBestPaperAward.pdf
资源推荐
资源详情
资源评论
15
Automatic Memory Partitioning and Scheduling for Throughput
and Power Optimization
JASON CONG, WEI JIANG, BIN LIU, and YI ZOU, University of California, Los Angeles
Memory bottleneck has become a limiting factor in satisfying the explosive demands on performance and cost
in modern embedded system design. Selected computation kernels for acceleration are usually captured by
nest loops, which are optimized by state-of-the-art techniques like loop tiling and loop pipelining. However,
memory bandwidth bottlenecks prevent designs from reaching optimal throughput with respect to available
parallelism. In this paper we present an automatic memory partitioning technique which can efficiently im-
prove throughput and reduce energy consumption of pipelined loop kernels for given throughput constraints
and platform requirements. Also, our proposed algorithm can handle general array access beyond affine
array references.
Our partition scheme consists of two steps. The first step considers cycle accurate scheduling information
to meet the hard constraints on memory bandwidth requirements specifically for synchronized hardware de-
signs. An ILP formulation is proposed to solve the memory partitioning and scheduling problem optimally for
small designs, followed by a heuristic algorithm which is more scalable and equally effective for solving large
scale problems. Experimental results show an average 6× throughput improvement on a set of real-world
designs with moderate area increase (about 45% on average), given that less resource sharing opportunities
exist with higher throughput in optimized designs. The second step further partitions the memory banks
for reducing the dynamic power consumption of the final design. In contrast to previous approaches, our
technique can statically compute memory access frequencies in polynomial time with little or no profiling.
Experimental results show about 30% power reduction on the same set of benchmarks.
Categories and Subject Descriptors: B.5.2 [Register-Transfer-Level Implementation]: Design Aids—
Automatic synthesis
General Terms: Algorithm, Design, Experimentation
Additional Key Words and Phrases: Behavioral synthesis, memory partition
ACM Reference Format:
Cong, J., Jiang, W., Liu, B., and Zou, Y. 2011. Automatic memory partitioning and scheduling for throughput
and power optimization. ACM Trans. Des. Autom. Electron. Syst. 16, 2, Article 15 (March 2011), 25 pages.
DOI = 10.1145/1929943.1929947 http://doi.acm.org/10.1145/1929943.1929947
1. INTRODUCTION
Behavioral synthesis has emerged as a promising direction in EDA because of the
exponentially increasing complexity in modern SoC designs. Behavioral synthesis
transforms algorithmic description of behavior into hardware implementation. Typ-
ical applications for behavioral synthesis are data-intensive or computation-intensive
kernels in multimedia processing, which require high throughput. Usually computation
A preliminary version of this article appeared as Cong et al. [2009].
This work is partly supported by GSRC, SRC Contract 2009-TJ-1879, and NSF grant CNS-0725354. Support
from Magma and AutoESL Inc. is also gratefully acknowledged.
Authors’ address: J. Cong, W. Jiang, B. Liu, and Y. Zou, Computer Science Department, University of
California, Los Angeles, CA 90095; email: {cong,wjiang,bliu,zouyi}@cs.ucla.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c
2011 ACM 1084-4309/2011/03-ART15 $10.00
DOI 10.1145/1929943.1929947 http://doi.acm.org/10.1145/1929943.1929947
ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.
15:2 J. Cong et al.
Fig. 1. Area of SRAM cells (taken from Tatsumi and Mattausch [1999]).
kernels are captured by perfect loop nests. Loop tiling is a common optimization tech-
nique that partitions a loop’s iteration space into smaller chunks or blocks, so as to
increase the parallelism inside one loop iteration, and improve data locality. However,
memory bandwidth limits potential performance gain even with large parallelism avail-
able. For instance, in a pipelined design, the final throughput is limited by number of
memory accesses divided by memory ports.
Memory bottleneck problems have drawn intensive interest, in both software and
hardware designs. A natural way to tackle this problem is to increase memory ports in
the final memory implementations. However, research [Tatsumi and Mattausch 1999]
shows that the size of memory cells increases almost quadratically with port number
N (shown in Figure 1), which limits the maximal memory port number in practical
designs. Moreover, increasing memory ports is not feasible on FPGA platforms; for
example, block RAMs on Xilinx Virtex systems
1
are restricted to two read-write ports.
By duplicating the original memory N times, memory read bandwidth can be improved
by N times. But, despite the large area overhead, the number of simultaneous memory
writes cannot be increased because it is difficult to guarantee that all memories have
synchronized data. A better approach is memory partitioning, which divides original
memories into several banks. To take advantages of memory partitioning, compilers
must analyze memory access patterns in original programs carefully, and balance them
among different banks.
Data partitioning and loop partitioning have been studied intensively in the compiler
domain for parallel computing on distributed systems. Ramanujam and Sadayappan
[1991] and Anderson and Lam [1993] developed algorithms to generate communication-
free data partitions for multi-computers with local memory. The approach in Agarwal
et al. [1995] presented an optimal tiling and iteration/data space mapping algorithm
even if communication-free partitioning solutions do not exist. In general, these algo-
rithms attempt to exploit coarse-level parallelism to generate good data partitions that
1
http://www.xilinx.com
ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.
Automatic Memory Partitioning and Scheduling for Throughput and Power Optimization 15:3
minimize remote data accesses. However, in hardware design, data accesses must be
considered at cycle-accurate level to avoid excessive simultaneous accesses on the same
memory. In behavioral synthesis, Gong et al. [2005] studied the memory partitioning
problem based on the footprint of data access, where only either row-based partition-
ing or column-based partitioning can be supported. Another approach [Baradaran
and Diniz 2008] studied the impact of data distribution, data duplication and scalar
replacement for behavioral synthesis to reduce latency of loop bodies. Their work did
not consider scheduling information in the data partitioning formulation. Our approach
considers memory partitioning at the cycle-accurate level to meet throughput require-
ments on pipelined perfect loop nests.
Memory partitioning can also be applied for power optimization. As IC technology
scales into the sub-100nm regime, power optimization has emerged as one of the utmost
important tasks in SoC designs. Power consumption usually goes up with performance
improvement; therefore, power optimization is considered in this paper as a secondary
objective to compensate for the overhead brought by performance optimization. Memory
partitioning techniques for power reduction have been studied [Lyuh and Kim 2004;
Wang and Hu 2004; Poncino et al. 2000]. Lyuh and Kim [2004] and Wang and Hu
[2004] assign discrete memory read/write operations to different memory banks for
performance and power optimizations. Their formulations mainly work for data flow
graphs where one memory access can be bound to only one memory bank statically. The
approach in Poncino et al. [2000] generates an optimal memory partitioning scheme
based on platform information and memory access frequencies. The access frequencies
are assumed to be given in their formulation which is usually done by profiling. In our
approach, each array access is modeled as a parameterized polytope, and array access
profiles can be calculated by a polyhedron library to get an analytical solution with
respect to loop bounds and other parameters. For multimedia programs, our approach
usually requires no profiling (for affine array accesses) or lightweight profiling by
using a parameterized polytope model. I n addition, our approach can support more
partitioning schemes.
In this article, a synthesis flow is proposed to solve the memory partitioning problem
for throughput and power optimization. The goal of throughput optimization is to maxi-
mize concurrent memory accesses by partitioning; however, power optimization tries to
shut down unused memory banks to save dynamic power consumption. These distinct
objectives make it extremely difficult to tackle throughput and power optimization in
one step. In our approach, the throughput optimization problem is solved at the first
step, then results are passed to the power optimization step to further reduce the total
energy consumption.
Our contributions include the following: (i) a behavioral synthesis flow which gen-
erates optimal memory partitioning solutions automatically for fast design space
exploration. (ii) Unlike most previous work, our approach can handle array refer-
ences beyond the affine form, such as indirect array accesses and nonaffine array
subscriptions. (iii) Moreover, our partition scheme considers cycle-accurate scheduling
information to meet the hard constraints on memory bandwidth requirements, specifi-
cally for synchronous hardware designs. (iv) An effective power optimization algorithm
is proposed that considers more general partition schemes and requires less profiling
effort. Compared to previous works, our approach requires only lightweight profiling,
and can support more partitioning schemes for better results. These contributes are
validated by our experimental results; more than 15× performance improvement and
60% power reduction can be observed on certain designs.
A preliminary version of this work was reported in Cong et al. [2009]. In this arti-
cle, we proposed an integer-linear-programming (ILP) based algorithm to optimally
solve the memory partitioning and scheduling together for small examples. And our
ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.
15:4 J. Cong et al.
algorithm is extended to consider loop dependences as well for throughput optimiza-
tion. Furthermore, an interactive modulo scheduling algorithm is adapted for the final
loop pipelining following our memory partitioning step.
The remainder of t he article is organized as follows. Section 2 describes the pre-
liminaries, such as polytope model and loop pipelining. Section 3 gives a motivation
example for our memory partitioning problem. Section 4 gives problem formulation of
our memory partitioning problem, and Sections 5–7 present our proposed algorithms
for solving this problem. Section 8 reports experimental results and is followed by
conclusions in Section 9.
2. PRELIMINARIES
2.1. Polytope Model
Computation-intensive parts of applications are commonly captured by perfect loop
nests. Without loss of generality, all loops are normalized to have unit step size. Loops
and memory accesses can be modeled as polytopes on rational spaces.
Definition 1. The iteration space I of a depth-l loop nest is defined in the l-
dimensional space, where each iteration corresponds to an integer vector identified
by its index values I = (i
1
, i
2
,...,i
l
).
Definition 2. Similarly, an n-dimensional array is defined by an array space A,
and each array access at a certain iteration is determined by an integer tuple
A = (a
1
, a
2
,...,a
n
).
Definition 3. Given iteration space I and array space A, each array access R
k
in a
loop nest is a function: R
k
: I → A. R
k
is an affine array access if R
k
(I) = M ∗ I + V ,
where M is an n × l matrix and V is a constant vector of size n.
Example 1.
for (i = 0; i <= N ; i ++)
for ( j = i; j <= M; j ++) {
A[i−1, i + j] = ...;
}
For Example 1, the iteration space is
I ={(0, 0), (0, 1),...(0, M), (1, 1), (1, 2),...(1, M) ...}.
Assume the size of array A is 10 × 10, the data space of array A is
A
={(0, 0), (0, 1),...(0, 10), (1, 0), (1, 1),...(1, 10) ...}.
And array access R
1
: A[i − 1, i + j] is an affine array access with respect to iteration
space I = (i, j)since
R
1
=
10
11
∗
i
j
+
−1
0
=
i − 1
i + j
.
If loop bounds and array accesses are affine, loop space and array accesses can be
modeled as polytopes on rational space Q.
Definition 4. A rational parametric polytope PT
p
with n parameters p is a set of
rational d-dimensional vectors x in rational space Q
d
defined by m linear inequalities
on x and p, where A and B are (m× d)and(m× n) rational matrices, and c is a size-m
constant vector:
PT
p
={x ∈ Q
d
|Ax ≥ B p + c}.
ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.
Automatic Memory Partitioning and Scheduling for Throughput and Power Optimization 15:5
Fig. 2. A loop pipelining example.
We can easily construct the polytope which contains the iteration space from loop
structures. For the example above, the iteration space is integer points inside polytope:
PT
I
={i, j∈Q
2
|0 ≤ i ≤ N, i ≤ j ≤ M}.
The corresponding values of A, B, p and c in Definition 4 can also be calculated from
the given form:
PT
I
=
⎧
⎪
⎨
⎪
⎩
I ∈ Q
2
|
⎡
⎢
⎣
10
−10
−11
0 −1
⎤
⎥
⎦
∗
i
j
≥
⎡
⎢
⎣
0
−N
0
−M
⎤
⎥
⎦
⎫
⎪
⎬
⎪
⎭
.
If array access R
k
is an affine array access R
k
(I) = M ∗ I + V , the number of accesses
on this element is the number of integer point inside the intersection of iteration space
I and the hyperplane {I|A = M ∗ I + V }. For instance, the number of loop iterations
that access array location A[a1, a2] is a polytope
{i, j|i − 1 = a
1
, i + j = a
2
, 0 ≤ i ≤ N, i ≤ j ≤ M}.
The polytope model provides a useful abstract of the array access behavior, which
can be used to find the array access frequency for power optimization. Details will be
discussed later.
2.2. Loop Pipelining
Loop pipelining is the most commonly used technique for improving throughput; it al-
lows simultaneous execution of operations in several continuous loop iterations. Modulo
scheduling [Rau 1994] is a loop pipelining technique which tries to find a schedule for
one iteration of the loop such that when this same schedule is repeated at regular inter-
vals, no intra- or inter-iteration dependence is violated, and no resource usage conflict
arises between operations of either the same or different iterations. This constant in-
terval between the start of successive iterations is termed the initiation interval (II),
usually measured in clock cycles in s ynchronized hardware designs. An example is
shown in Figure 2, which has II = 1.
Definition 5. The minimum initiation interval MII is a lower bound on the smallest
possible value of II for which a modulo schedule exists.
The lower bound MII is limited by two factors: resource limits and inter-iteration loop
dependence, which are termed as resource-constrained MII (ResM I I) and recurrence-
constrained MII(Rec M I I) respectively. Thus, MII ≥ MAX(ResM I I, RecM I I).
In a pipelined design, each operation in a loop body is executed once within II cycles,
so obviously ResI I ∗ resour ces operations,orResI I ≥ operations/resour ces.For
memory access, if the memory port number is N, and there are K memory accesses,
ACM Transactions on Design Automation of Electronic Systems, Vol. 16, No. 2, Article 15, Pub. date: March 2011.
剩余24页未读,继续阅读
资源评论
wxg520cxl
- 粉丝: 24
- 资源: 3万+
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功