28 S. Liu et al. / J. Parallel Distrib. Comput. 121 (2018) 27–41
tile sizes for a given combination of program, architecture, and
compiler. However, these approaches are proved to be less effec-
tive in practice because it is difficult to figure out the complex
interactions between source program characteristics and execu-
tion environments. Hence, the performance of tiled codes with the
tile sizes selected by analytical models lags behind that yielded by
the actual best tile sizes. Besides, manually creating an accurate
analytical TSS model is not an easy work. The TSS model also needs
to be rebuilt while adapting to different processor architectures or
loop structures. Meanwhile, keeping a TSS model up-to-date with
respect to the progress of processor architectures and compilers
often involves much human effort.
In empirical search approaches, the tiled code is repeatedly
generated and executed for a huge search space of different tile
sizes to pick the optimal tile sizes on the target machine. One of
the most crucial issues faced by empirical tuning is the enormous
search space to be explored when considering multi-dimensional
tile sizes or rectangular tile sizes, i.e. different tile sizes in different
loop dimensions. This approach consumes so much time that many
empirical approaches only adopt cubic tiles, i.e. equal tile sizes
along all loop dimensions, to reduce the search space. But the cubic
tile has been shown that it is probably not the best in general and
therefore resulting in unsatisfactory performance [15,26]. Since
the empirical approaches are usually combined with analytical
models to perform heuristic search [5,41], the TSS model is used
to prune valid search space for reducing the time cost. However,
these approaches have not been widely applicable due to the time
cost and accuracy issue.
The machine learning techniques have been used in TSS prob-
lem in recent years. For these approaches [31,40,52], the program
features characterizing the crucial interactions between perfor-
mance and tile sizes are extracted to build the optimal tile sizes
prediction model by using machine learning techniques, such as
artificial neural networks and classifiers. This kind of approaches
can effectively hide the complicated influencing factors of proces-
sor architectures and intertwined compiler optimization phases on
TSS. Because the training data collected on the real architecture-
compiler environment has ability to express the inherent
connections between tile sizes and these factors. When the hard-
ware platform and compiler changes, the training data will be re-
collected and the new dataset could reflect the influences of the
changed running environment. Further, the data collection can be
accomplished without much manpower. Hence, the extraction of
the program features has become a key step in creating an accurate
TSS prediction model. But finding effective features that capture
the essential connections between performance of tiled code and
corresponding tile sizes is still under dispute.
This article proposes a new TSS model based on machine learn-
ing technique to predict optimal rectangular tile sizes for loop
codes on modern multi-core processors. The primary loop features
are extracted from the tiled codes in the light of data locality in
multiple loop dimensions and the vectorization in innermost loop
dimension, which leverages the locality of data references to fit
the working set sizes of tiles in multilevel caches and capture
the exact effect of tile sizes on the performance of tiled codes.
A generalized regression neural network (GRNN) is employed to
build the TSS model for the tile sizes prediction by using artificially
synthesized programs to generate a plenty of loop features and
corresponding best tile sizes as the training datasets. Although only
4 threads are used to train the model, the predicted tile sizes can be
adapted to different numbers of threads. To evaluate the proposed
TSS model, 20 typical kernels including 3D/2D loops and 2D/1D
data with 3 different kinds of problem sizes were chosen to carry
out a series of experiments on an Intel Xeon multi-core platform
and an IBM Power6 multi-core platform. The predicted tile sizes
achieved stable near-optimal performance on both platforms. The
results also indicated that the proposed model outperformed an
artificial neural network (ANN)-based model and a state-of-the-art
analytical model. Our approach yielded good performance when
applying various numbers of threads. Overall, this article has made
the following contributions.
• The loop features extracted from the tiled codes are able
to effectively capture the locality of data references in all
tiled loop dimensions and the effect of vectorization in the
innermost loop dimension, predicting the optimal rectan-
gular tile sizes for the target programs. And the features are
experimentally proved to be necessary and effective for the
good quality of constructing the TSS model.
• The TSS model is built with machine learning, specifically
a GRNN, to hide the intricate underlying influencing fac-
tors involved in the TSS and provide stable near-optimal
performance. The proposed approach could be extended
to more/less dimensions of loops and data, not limited to
3D loops with 2D data. In addition, the performance can
be improved noticeably when the model is combined with
a simple local search. And the proposed approach could
be platform and hardware independent for the target pro-
grams.
• A post-processing approach is proposed to adapt the pre-
dicted tile sizes to different numbers of threads through
simple adjustment. Although the effect of multithreading is
not directly considered and only 4 threads are used to train
the TSS model, the proposed approach leverages the crucial
impact of parallel load balance among threads to adjust the
predicted tile sizes for different numbers of threads, keeping
the good locality the predicted tile sizes have achieved and
achieving good performance for various numbers of threads.
The rest of this article is structured as follows. Section 2 in-
troduces the related work and motivation of our work. Section 3
details the loop features of the TSS model. Section 4 describes the
process of building the TSS model with GRNN and propose the
approach adapting the predicted tile sizes to different numbers of
threads. Section 5 shows the experiments and results. Section 6
concludes this article and points out the future work.
2. Related work and motivation
2.1. Related work
The TSS has been extensively studied to explore data locality
and coarse-grained parallelism for loop codes. The approaches
used in the study of TSS are categorized into three kinds: static
analytical model, model-driven empirical search, and machine
learning techniques. There is already some literature [27,33] that
has analyzed and summarized previous TSS models.
The early research focuses on emulating the access behaviors
in cache to estimate the best tile sizes by using program and
memory characteristics. The studies in [14,22] are dedicated in
establishing the relationship of tile sizes, problem sizes, and cache
parameters to avoid self-interference cache misses. And the ap-
proaches presented in [6,8] work on minimizing cross-interference
cache misses. Some studies [17,28,36] aim at choosing the best
tile sizes for pathological problem sizes through combining array
padding with the analytical approaches. However, these analyt-
ical approaches do not provide consistent good performance for
a wide range of applications on modern processors due to the
simple assumption of cache architectures. On the other hand,
considering the impact of single instruction multiple data (SIMD)
units and vectorization on TSS is gaining more and more attention
from academic circles, some studies [21,43] are dedicated to auto-
vectorization or optimization for SIMD parallelism. However, they