Anefficienttilesizeselectionmodelbasedonmachinelearning资源-CSDN文库

173 浏览量 2021-02-11 17:16:06 上传评论收藏 2.25MB PDF 举报

### 基于机器学习的有效瓷砖尺寸选择模型 #### 概述本文介绍了一种基于机器学习技术的有效瓷砖尺寸选择模型（TSS），该模型旨在为多核处理器上的指定程序预测最优矩形瓷砖尺寸。瓷砖尺寸的选择对于提高数据局部性和实现粗粒度并行性至关重要。传统的TSS方法往往依赖于高度专业的人力资源，但即使如此也很难找到最佳的瓷砖尺寸。本文提出的模型通过提取一系列循环特征来捕捉瓷砖循环维度中的数据引用局部性和向量化效果，进而利用这些特征和相应的最佳瓷砖尺寸构建了一个通用回归神经网络（GRNN）模型，以隐藏瓷砖尺寸与底层因素之间复杂的相互作用。 #### 提出的方法 ##### 特征提取为了准确预测最佳瓷砖尺寸，本研究首先对瓷砖代码进行了特征提取。这些特征包括但不限于数据访问模式、循环边界、数组尺寸等，它们能够反映数据局部性的程度以及向量化操作的效果。特征提取的目标是捕获那些对瓷砖尺寸选择有显著影响的因素，从而确保最终模型的预测准确性。 ##### 通用回归神经网络的应用通用回归神经网络（GRNN）是一种非参数回归技术，它非常适合用于解决小样本问题。在本研究中，GRNN被用作建立瓷砖尺寸选择模型的基础。通过训练模型以学习瓷砖尺寸与前述特征之间的复杂关系，GRNN能够为给定程序预测接近最优的瓷砖尺寸。这种模型的优势在于它能够在不需要深入了解底层机制的情况下进行有效的瓷砖尺寸预测。 ##### 并行负载平衡虽然多线程的影响没有直接纳入模型训练过程中考虑，但预测出的瓷砖尺寸可以很好地适应不同数量的线程，从而实现了并行负载的平衡。这意味着无论是在少量还是大量线程的情况下运行程序，预测的瓷砖尺寸都能确保每个线程的计算负载大致相同，从而最大化了整体性能。 #### 实验结果为了验证所提出模型的有效性，研究人员选取了20个基准测试案例，并在两个不同的平台上进行了实验。结果显示，预测的瓷砖尺寸平均能够达到90%和81%的最佳性能水平。这一结果表明，即使没有人工干预，机器学习模型也能够非常接近地预测出最优瓷砖尺寸，这大大降低了人力成本并提高了代码优化的效率。 #### 结论本文提出了一种基于机器学习的瓷砖尺寸选择模型，该模型能够自动预测出给定程序的最佳瓷砖尺寸，以实现高效的数据局部性和并行负载平衡。通过提取关键的循环特征并利用通用回归神经网络构建模型，这种方法不仅简化了瓷砖尺寸的选择过程，而且能够跨不同平台和多线程环境实现近似最优的性能表现。未来的研究可以进一步探索如何将更多类型的循环特征整合进模型中，以及如何改进神经网络架构以提高预测精度。

资源推荐

资源详情

资源评论

J. Parallel Distrib. Comput. 121 (2018) 27–41

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

An efficient tile size selection model based on machine learning

Song Liu *, Yuanzhen Cui, Qing Jiang, Qian Wang, Weiguo Wu

Department of Computer Science and Technology, School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, PR China

h i g h l i g h t s

• We revisit the tile size selection problem of loop tiling by machine learning.

• Extracted features can capture the effect of data locality and vectorization.

• We build a tile size prediction model with generalized regression neural network.

• Predicted tile sizes can be adapted to different threads for parallel load balance.

• Results show near optimal performance over different benchmarks on 2 platforms.

a r t i c l e i n f o

Article history:

Received 18 September 2017

Received in revised form 27 April 2018

Accepted 13 June 2018

Available online 26 June 2018

Keywords:

Tile size selection

Loop features

Locality of data references

Machine learning

Parallel load balance

a b s t r a c t

Tiling is a classic loop optimization to improve data locality and achieve coarse-grained parallelism. Tile

size selection (TSS) plays an important role in tiling to determine the performance of tiled codes. Most

of the previous TSS approaches involve much highly skilled manpower, but it is still difficult to find the

optimal tile sizes. In this article, we propose an efficient TSS model using machine learning technique to

predict optimal rectangular tile sizes for a given program on multi-core processors. A set of loop features

is extracted on tiled codes to capture the locality of data references and the effect of vectorization in

tiled loop dimensions. Using the features and corresponding best tile sizes, the generalized regression

neural network is employed to build the TSS model, hiding the complicated interactions between tile sizes

and underlying factors. Although the impact of multithreading is not directly considered in training the

model, the predicted tile sizes can be well adapted to different numbers of threads. Experimental results

show that the predicted tile sizes achieve 90% and 81% of the optimal performance on average for 20

selected benchmarks on an Intel Xeon and an IBM Power6 multi-core platforms, respectively. The optimal

performance is delivered by the tile sizes that are obtained through a heuristically exhaustive search. Our

TSS model outperforms an artificial neural network (ANN)-based TSS prediction model which depends

on the prefetched features by over 9% in average performance for 9 benchmarks. It also outperforms a

state-of-the-art analytical TSS model which uses the cache set associativity and interaction with the single

instruction multiple data (SIMD) units to estimate the optimal tile sizes by over 7% in average performance

for 7 benchmarks.

1. Introduction

Nested loops are generally hot spots in many important scien-

tific computing kernels, which take up most of the execution time

and may easily lead to frequent cache misses. Tiling [19,29,32,46–

48] is a classic loop transformation widely used in program opti-

mization to enhance data locality in higher levels of memory hier-

archy and exploit coarse-grained parallelism. Loop tiling reorders

iterative computations by traversing the iteration space according

Corresponding author.

E-mail addresses: lsong28@stu.xjtu.edu.cn (S. Liu),

cuiyuanzhen@stu.xjtu.edu.cn (Y. Cui), jiangqing@cffex.com.cn (Q. Jiang),

rebeccamango@stu.xjtu.edu.cn (Q. Wang), wgwu@mail.xjtu.edu.cn (W. Wu).

to the tile sizes, which reduces data reuse distance to minimize

cache misses. The choice of different tile sizes leads to significant

variation in performance of the tiled codes. Thus, the tile size

selection (TSS) plays an important role in the effective use of loop

tiling. However, the selection of optimal tile sizes has become ever

more challenging since the influencing factors of on-chip memory

hierarchy and program running environments on the tile sizes are

increasingly complicated.

Previous work on TSS solutions mainly falls into three

categories, i.e. analytical approaches [6,8,22,27,50], empirical

search [7,12,30,45], and machine learning based approaches

[24,31,40,44,52]. In analytical approaches, the TSS models are con-

structed on the static analysis of loop codes and critical features

in modern processor architecture parameters, calculating the best

https://doi.org/10.1016/j.jpdc.2018.06.005

28 S. Liu et al. / J. Parallel Distrib. Comput. 121 (2018) 27–41

tile sizes for a given combination of program, architecture, and

compiler. However, these approaches are proved to be less effec-

tive in practice because it is difficult to figure out the complex

interactions between source program characteristics and execu-

tion environments. Hence, the performance of tiled codes with the

tile sizes selected by analytical models lags behind that yielded by

the actual best tile sizes. Besides, manually creating an accurate

analytical TSS model is not an easy work. The TSS model also needs

to be rebuilt while adapting to different processor architectures or

loop structures. Meanwhile, keeping a TSS model up-to-date with

respect to the progress of processor architectures and compilers

often involves much human effort.

In empirical search approaches, the tiled code is repeatedly

generated and executed for a huge search space of different tile

sizes to pick the optimal tile sizes on the target machine. One of

the most crucial issues faced by empirical tuning is the enormous

search space to be explored when considering multi-dimensional

tile sizes or rectangular tile sizes, i.e. different tile sizes in different

loop dimensions. This approach consumes so much time that many

empirical approaches only adopt cubic tiles, i.e. equal tile sizes

along all loop dimensions, to reduce the search space. But the cubic

tile has been shown that it is probably not the best in general and

therefore resulting in unsatisfactory performance [15,26]. Since

the empirical approaches are usually combined with analytical

models to perform heuristic search [5,41], the TSS model is used

to prune valid search space for reducing the time cost. However,

these approaches have not been widely applicable due to the time

cost and accuracy issue.

The machine learning techniques have been used in TSS prob-

lem in recent years. For these approaches [31,40,52], the program

features characterizing the crucial interactions between perfor-

mance and tile sizes are extracted to build the optimal tile sizes

prediction model by using machine learning techniques, such as

artificial neural networks and classifiers. This kind of approaches

can effectively hide the complicated influencing factors of proces-

sor architectures and intertwined compiler optimization phases on

TSS. Because the training data collected on the real architecture-

compiler environment has ability to express the inherent

connections between tile sizes and these factors. When the hard-

ware platform and compiler changes, the training data will be re-

collected and the new dataset could reflect the influences of the

changed running environment. Further, the data collection can be

accomplished without much manpower. Hence, the extraction of

the program features has become a key step in creating an accurate

TSS prediction model. But finding effective features that capture

the essential connections between performance of tiled code and

corresponding tile sizes is still under dispute.

This article proposes a new TSS model based on machine learn-

ing technique to predict optimal rectangular tile sizes for loop

codes on modern multi-core processors. The primary loop features

are extracted from the tiled codes in the light of data locality in

multiple loop dimensions and the vectorization in innermost loop

dimension, which leverages the locality of data references to fit

the working set sizes of tiles in multilevel caches and capture

the exact effect of tile sizes on the performance of tiled codes.

A generalized regression neural network (GRNN) is employed to

build the TSS model for the tile sizes prediction by using artificially

synthesized programs to generate a plenty of loop features and

corresponding best tile sizes as the training datasets. Although only

4 threads are used to train the model, the predicted tile sizes can be

adapted to different numbers of threads. To evaluate the proposed

TSS model, 20 typical kernels including 3D/2D loops and 2D/1D

data with 3 different kinds of problem sizes were chosen to carry

out a series of experiments on an Intel Xeon multi-core platform

and an IBM Power6 multi-core platform. The predicted tile sizes

achieved stable near-optimal performance on both platforms. The

results also indicated that the proposed model outperformed an

artificial neural network (ANN)-based model and a state-of-the-art

analytical model. Our approach yielded good performance when

applying various numbers of threads. Overall, this article has made

the following contributions.

• The loop features extracted from the tiled codes are able

to effectively capture the locality of data references in all

tiled loop dimensions and the effect of vectorization in the

innermost loop dimension, predicting the optimal rectan-

gular tile sizes for the target programs. And the features are

experimentally proved to be necessary and effective for the

good quality of constructing the TSS model.

• The TSS model is built with machine learning, specifically

a GRNN, to hide the intricate underlying influencing fac-

tors involved in the TSS and provide stable near-optimal

performance. The proposed approach could be extended

to more/less dimensions of loops and data, not limited to

3D loops with 2D data. In addition, the performance can

be improved noticeably when the model is combined with

a simple local search. And the proposed approach could

be platform and hardware independent for the target pro-

grams.

• A post-processing approach is proposed to adapt the pre-

dicted tile sizes to different numbers of threads through

simple adjustment. Although the effect of multithreading is

not directly considered and only 4 threads are used to train

the TSS model, the proposed approach leverages the crucial

impact of parallel load balance among threads to adjust the

predicted tile sizes for different numbers of threads, keeping

the good locality the predicted tile sizes have achieved and

achieving good performance for various numbers of threads.

The rest of this article is structured as follows. Section 2 in-

troduces the related work and motivation of our work. Section 3

details the loop features of the TSS model. Section 4 describes the

process of building the TSS model with GRNN and propose the

approach adapting the predicted tile sizes to different numbers of

threads. Section 5 shows the experiments and results. Section 6

concludes this article and points out the future work.

2. Related work and motivation

2.1. Related work

The TSS has been extensively studied to explore data locality

and coarse-grained parallelism for loop codes. The approaches

used in the study of TSS are categorized into three kinds: static

analytical model, model-driven empirical search, and machine

learning techniques. There is already some literature [27,33] that

has analyzed and summarized previous TSS models.

The early research focuses on emulating the access behaviors

in cache to estimate the best tile sizes by using program and

memory characteristics. The studies in [14,22] are dedicated in

establishing the relationship of tile sizes, problem sizes, and cache

parameters to avoid self-interference cache misses. And the ap-

proaches presented in [6,8] work on minimizing cross-interference

cache misses. Some studies [17,28,36] aim at choosing the best

tile sizes for pathological problem sizes through combining array

padding with the analytical approaches. However, these analyt-

ical approaches do not provide consistent good performance for

a wide range of applications on modern processors due to the

simple assumption of cache architectures. On the other hand,

considering the impact of single instruction multiple data (SIMD)

units and vectorization on TSS is gaining more and more attention

from academic circles, some studies [21,43] are dedicated to auto-

vectorization or optimization for SIMD parallelism. However, they

S. Liu et al. / J. Parallel Distrib. Comput. 121 (2018) 27–41 29

provide no substantial methods to integrate the TSS model with

the vectorization or SIMD effect. Another method [10] is proposed

to adapt the tile sizes to SIMD register width and cache sizes for the

benefit of vectorization, but it leaves the exploitation of data reuse

out of the TSS model. Recently, a state-of-the-art analytical ap-

proach [26] shows outstanding performance, which considers the

high cache set-associativity being ignored in previous work and the

interaction of tiling with SIMD units to exploit the data reuses in

the multiple levels of caches. Although these analytical approaches

can provide clearer discernment on the underlying interactions

between hardware parameters and software characteristics, they

need much human effort when developing and applying them to

different program features or platforms.

The empirical search approaches perform tuning in the enor-

mous space of candidate tile sizes. The ATLAS [45] is perhaps

one of the most remarkable auto-tuners [3,7,13], which optimizes

the basic linear algebra subprograms (BLAS) library on a given

hardware platform. But it is time-consuming and only works for

specialized libraries. At the same time, the model-driven empirical

approaches [11,12,37] that employ analytical models to charac-

terize good tile sizes and reduce search space and time are the

most commonly used approaches in practice. For example, some

researchers use the DL (distinct cache lines) model [11] and ML

(minimum working set lines) model [37] to restrict the upper and

lower bounds of search space for empirical search, meanwhile

taping into the data reuse in hierarchical caches. The proposed TSS

approach in this article can also be used with empirical search.

With the arising of parameterized tiling [34,35], tile sizes can be

executed as symbolic parameters with different values, and the

best one is determined at run time. This property enables some ap-

proaches [35,42] to dynamically select tile sizes in the production

run based on monitoring the performance of a few loop iterations.

However, it is hard to capture exact data reuse in a few iterations

since the reuse distance may be quite long, which results in the

underutilization of cache resource and sub-optimal tile sizes.

Since the machine learning techniques have been successively

applied in many fields, such as system architectures and compil-

ers, some researchers use the neural networks and regression to

construct the prediction models [18,40,44]. With respect to the

TSS, a classifier learning system [24] is proposed to determine the

number of levels of tiling and tile sizes based on the dimensions

of matrices and the architectural features of running environment.

But the system is specified for optimizing matrix multiplication

only. Statistical machine learning [31] is also employed to create

an ANN for predicting the performance distribution of different

tile sizes. This method can be used to determine the convergence

bounds for random empirical search. A similar ANN model [52] is

proposed, but it directly predicts the optimal tile sizes based on

a set of simple program features and synthetic programs. How-

ever, this model is limited to the specific kernels with 3D loops

and 2D data. Besides, another TSS approach [25] uses program

dynamic features, the information collected by hardware perfor-

mance counters, to build a multi-layered ANN model for predicting

optimal tile sizes. But this approach does not provide any insights

into how the selected features interact with the performance of

tiled codes and only considers the tiling problem for L1 cache. Our

approach is more or less inspired by [52] in the data collection,

but we used entirely different features based on tiled codes to

capture the locality of data references in multiple loop dimensions

and the effect of vectorization in innermost loop dimension. The

proposed approach in this article could be hardware independent

and extended to different dimensions of loops and data.

Fig. 1. Execution time of matmul with different tile sizes.

2.2. Motivation

To achieve optimal performance, TSS model requires consid-

ering many aspects, such as data reuses, data layout in memory,

cache replacement policy, processor architecture, the effect of

registers and prefetchers, vectorization and suchlike factors. A lot

of research [49,51,52] has shown that different combinations of

programs, hardware parameters, compilers, and compilation opti-

mizations have impact on constructing an effective TSS model. As a

matter of fact, it is unrealistic to encompass all possible influencing

factors of architectures and compilers in one model. Therefore, it

is critical to capture the key issues according to the context of

applying TSS model. In this article, the tile sizes are chosen from

the perspective of locality and vectorization in the tiled codes. And

we leave the influencing factors to the powerful machine learning

technique.

As is known, large tile sizes will restrain the parallelism and

may be unable to reduce cache capacity/conflict misses, whereas

small tile sizes may lead to insufficient data reuses and too much

control overhead. Proper tile sizes will not suffer from these prob-

lems. However, it is still hard for the proper tile sizes to take full

advantage of the precious cache resources if the locality difference

in loop dimensions is not considered. We can find the noticeable

performance variation that is caused by the locality impacts on

different loop dimensions. For example, the performance of the

tiled matmul kernel with different tile sizes is shown in Fig. 1,

which is obtained on a 4-core Intel Xeon processor. The problem

sizes are 1024×1024. Although all tiles are roughly the same in

the working set size as the sizes are in common but with different

combinations, the performance with each combination of tile sizes

is quite different, and the best one outperforms 4 times better

than the worst one. It is because the locality of data references

in each loop dimension is different and it makes a difference

to the performance. Better locality means more data reuses and

cache hits and less amount of new accessed data from memory.

Therefore, the tiles with same iterations would probably access

different amounts of data in memory due to the distinct locality

of different loop dimensions. If the locality difference is not fully

considered, the improper sizes are prone to increase the chance of

mismatch between tiles and cache resources, resulting in obvious

performance degradation. Hence, the effects of different locality in

loop dimensions are nontrivial, which need to be fully exploited to

realize the optimal performance.

Besides the locality and multi-core parallelism, SIMD instruc-

tion sets are also playing an increasingly important role in per-

formance enhancement due to the rapid development of on-chip

vector registers and SIMD units and the support of advanced

compiler directives in modern multi-core processors. The SIMD

impact on loop codes is particularly important since the innermost

剩余14页未读，继续阅读

评论收藏

内容反馈

weixin_38725623

粉丝: 4
资源: 939

An efficienttile size selection model based on machine learning

最新资源

An efficienttile size selection model based on machine learning

Mind_of_Context_Hungarian_text_generator_based_on_Transformer_machine_learning_model

Machine learning

Effective Amazon Machine Learning

Practical Machine Learning Cookbook

Python Machine Learning Case Studies 2017

Machine Learning Algorithms 2017.8

Machine Learning Algorithms

Python Machine Learning By Example-Packt Publishing(2017).epub

understanding machine learning theory-algorithms

Bishop Pattern Recognition and Machine Learning

Machine Learning Algorithms pdf format

Soft Computing in Machine Learning(PDF)

Pattern Recogintion and Machine Learning

Bayesian Reinforcement Learning A Survey

polikar2012.pdf

Data Mining: A Tutorial-Based Primer, Second Edition

Advanced Data Analysis in Neuroscience

active learning surey

最新资源