High-Level Language Extensions For Fast Execution of Pipeline-Parallelized Code On Current Chip Multi-Processor Systems
High-Level Language Extensions For Fast Execution of Pipeline-Parallelized Code On Current Chip Multi-Processor Systems
HIGH-LEVEL LANGUAGE EXTENSIONS FOR FAST EXECUTION OF PIPELINE-PARALLELIZED CODE ON CURRENT CHIP MULTI-PROCESSOR SYSTEMS
Apan Qasem1
1
Department of Computer Science, Texas State University San Marcos, Texas, USA
[email protected]
ABSTRACT
The last few years have seen multicore architectures emerge as the defining technology shaping the future of high-performance computing. Although multicore architectures present tremendous performance potential, to realize the true potential of these systems, software needs to play a key role. In particular, high-level language abstractions and the compiler and the operating system should be able to exploit the on-chip parallelism and utilize underlying hardware resources on these emerging platforms. This paper presents a set of high-level abstractions that allow the programmer to specify, at the source-code level, a variety to of parameters related to parallelism and inter-thread data locality. These abstractions are implemented as extensions to both C and Fortran. We present the syntax of these directives and also discuss their implementation in the context of source-to-source transformation framework and autotuning system. The abstractions are particularly applicable to pipeline parallelized code. We demonstrate the effectiveness of these strategies of a set of pipeline parallel benchmarks on three different multicore platforms.
KEYWORDS
Language abstractions, parallelism, multicore architecture
1. INTRODUCTION
The last few years have seen multicore and manycore systems emerge as the defining technology shaping the future of high-performance computing. As the number of cores per socket grows, so too does the performance potential of these systems. However, much of the responsibility of exploiting the on-chip parallelism and utilizing underlying hardware resources on these emerging platforms lies with software. To harness the full potential of these systems there is a need for improved compiler techniques for automatic parallelization, managed runtime systems and improved operating systems strategies. Most importantly there is a need for better language abstractions that allow a programmer to express parallelization in the source code and specify parameters for parallel execution. Concomitant to these abstractions, there is a need for software that is able to translate the programmer directives to efficient parallel code across platforms. This paper presents a set of high-level language abstractions and extensions that allow the programmer to specify, at the source-code level, a variety to of parameters related to parallelism and inter-thread data locality. These abstractions are implemented as extensions to both C and Fortran. We present the syntax of these directives and also discuss their implementation in the context of a source-to-source transformation framework and autotuning system. These abstractions are particularly suitable for pipeline-parallelized code. Multicore architectures, because of their high-speed inter-core communication have increased the applicability of
DOI : 10.5121/ijpla.2012.2101 1
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
pipeline-parallelized applications. Thus, the abstractions described in this paper are especially significant. We demonstrate the effectiveness of these strategies on a set of benchmarks on three different multicore platforms. The rest of the paper is organized as follows: in Section 2, we present related work; in Section 3 we describe the language extensions; the tuning framework is described in Section 4; experimental results appear in Section 5 and finally we conclude in Section 6.
2. RELATED WORK
The code transformations described in this paper have been widely studied in the literature. Tiling has been the predominant transformation for exploiting temporal locality for numerical kernels [20, 5, 3]. Loop fusion and array contraction have been used in conjunction for improving cache behavior and reducing storage requirements [4, 13]. Loop alignment has been used as an enabling transformation with loop fusion and scalarization [2]. The iteration-space splicing algorithm used in our strategy was first introduced by Jin et al. [6]. Time skewing has been a key transformation for optimizing time-step loops. McCalpin and Wonnacott intro- duced the concept in [10] and Wonnacott later extended this method for multiprocessor systems [23]. Wonnacott's method imposed restrictions on dependencies carried by spatial loops and
hence was not applicable for general time-step loops. Song and Li describe a strategy that combines time skewing with tiling and array contraction to reduce memory traffic by improving temporal data reuse [13]. Jin et al. describe recursive prismatic time skewing that perform skewing in multiple dimensions and handles shadow regions through bi-directional skewing [6]. Wolfe introduced loop skewing, a key transformation for achieving pipelined parallelism [21]. Previously, pipelined parallelism was mainly limited to vector and systolic array machines. However, with the advent of multicore platforms there has been renewed interest for exploiting pipelined parallelism for a larger class of programs. Because of their regular dependence patterns, much of the attention has been centered on streaming applications [15, 17, 8]. Thies et al. [14] describe a method for exploiting coarse-grain pipelined parallelism in C programs. Vadlamani and Jenks [18] present the synchronized pipelined parallelism model for producer-consumer applications. Although their model attempts to exploit locality between the producer and consumer threads they do not provide a method for choosing the appropriate synchronization interval (i.e. tile size).
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
3. LANGUAGE ABSTRACTIONS
We propose several language extensions in C and Fortran that allow a programmer to express constraints on parallelism and also provide directives and suggestions to the compiler (or a source-to-source code transformation tool) about applying certain optimizations that improve the performance of the code. The syntax for these extensions follows closely the syntax used in OpenMP programs, increasing the usability of the proposed strategy. Fig. 2(a) shows an example of directives embedded in Fortran source code. A directive is simply a comment line that specifies a particular transformation and one or more optional parameter values, which can be associated with any function, loop, or data structure. The association of directives with specific loops and data-structures is particularly useful because current state-of-the-art commercial and open-source compilers do not provide this level of control. For example, GCC does not allow the user to specify a tile size as a command-line option whereas Intel's icc allows a user-specified tile size, but applies it to every loop nest in the compilation unit. The use of
skew
$cdir skew 2 do i = 1, N do j = 1, M A(j+1) = A(j) + A(j+1) enddo enddo
steady-state
block
i
Figure 2. Extensions embedded in Fortran code
source-level directives in provides a novel and useful way of specifying optimization parameters at loop-level granularity, enhancing the effectiveness of both automatic and semi-automatic tuning more effective. Each source directive is processed by CREST, a transformation tool which serves as a meta compiler in out framework. Each directive is inspected, the legality of code transformation is verified and profitability threshold is determined. In the autotuning mode, the parameters are exposed for tuning through heuristic search whereas in the semi-automatic mode, the performance feedback is presented to the programmer. The rest of this section describes the four main directives used in optimizing pipelined parallelized code. For example, tile size and shape can be used to determine the number of required synchronizations and the amount of work done per thread. Some of the transformations currently supported by CREST include tiling, unroll-and-jam, multi-level loop fusion and distribution, array contraction, array padding, and iteration space splicing.
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
Fig. 1(b). Skewing results in a trapezoidal iteration space where the triangular regions represent the pipeline fill-and-drain times and the rectangular region in the middle represents the steady state. In a skewed iteration space, the area of the triangular regions is determined by the size of the x dimension (the outer loop). If we assume n = loop bounds for outer loop ov = pipeline fill-and-drain time s = pipeline steady state Then, we get ov = 2 x 1/2 x (n - 1) (n - 1) = (n - 1) 2 (1)
Therefore, ov will increase and dominate execution time for larger time dimensions (this scenario is depicted in Fig. 2(a)). This can cause severe performance degradation because of high synchronization overhead. To alleviate this situation, we propose blocking along the time dimension to produce a blocked iteration space as depicted in Fig. 2(b). This scheme reduces the
$cdir fuse 1 do j = 1, N $cdir fuse 1 $cdir tile 64 do i = 1, M b(i,j) = a(i,j) + a(i,j-1) enddo enddo $cdir fuse 1 do j = 1, N $cdir fuse 1 do i = 1, M c(i,j) = b(i,j) + d(j) enddo enddo
Figure 3. Before and after snapshot of data locality transformations with language extensions pipeline fill-and-drain time in two ways. First, we observe that if Bt is the time-skew factor then the new fill-and-drain area ov = (n - 1)/ Bt x 2 x x Bt x Bt = (n - 1)/Bt Since 1/Bt < n, we can say, (n- 1)/Bt < (n - 1)2. Therefore, ov < ov. Second, since each iteration of the outer loop performs roughly the same amount of work, the maximum parallelism is bounded by the available parallelism on the target platform. Therefore, by picking Bt to be the amount of available parallelism, we ensure that we do not lose any parallelism as a result of blocking. For this work, we assume available parallelism to be the number of cores (or HW threads if hyperthreading is present) available per node. (2)
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
propose several language extensions that allow a program to specify optimizations that can be applied by a tool at the source-code level. These transformations include: loop interchange, unroll-and-jam, loop fusion, loop distribution, iteration space splicing and tiling. Each of these optimizations can be applied to affect changes in the both inter and intra-thread data reuse. Further the transformation can also be applied to modify thread granularity, which can also have a significant impact on the performance of pipeline-parallelized code. Fig. 3(a) shows directives for tiling in Fortran source. Fig 3(b) shows the resulting code after the transformation has been by CREST. The principal technique that we apply to exploit data reuse in pipelined parallelized code is multilevel blocking. For a n-dimensional loop nest, the loop at level n 1, is tiled with respect to the loop at level n (outermost loop). The main issue here of course is picking the right tile size. Because the tile size controls not only the cache footprint but also the granularity of parallelism, we need to be especially careful in selecting a tile size. The ideal tile size is one that gives the maximum granularity without overburdening the cache. To find a suitable tile size for a shared cache, we first determine the cache footprint for all co-running thread. This information is obtained through reuse analysis of memory references in the loop nest in conjunction with thread scheduling information. According to our scheduling strategy, the data touched by a block bi is reused d steps after the completion of bi, where d is the block coverage value as derived in the previous section. Our schedule allows for p concurrent threads at each stage. Hence, to exploit temporal locality from one time step to the next, we pick a block size B that satisfies the following constraint
fbp ) < CS
(1)
where fbi = cache footprint of thread i, CS = estimated cache size For many applications however, the above condition might be too relaxed. We can enforce a tighter (and more profitable) constraint by considering the amount of data shared among concurrent threads. Generally, our synchronization-free schedule implies we have no true dependencies between concurrent threads. However, it is common to have input dependencies from one block to another. We do not want to double count this shared data when computing the cache footprint of each thread. Thus, we can revise (1) as follows (d + 1) x (fb0 + fb1 - sh(b0-b1) +fb2 - (sh(b0-b2) + sh(b1-b2)) . . . +fbp - (sh(b0-bp) + . . . sh(bp-1-bp))) < CS where sh(b1-b2) = amount of data shared among between b1 and b2 There is one other aspect of data locality that needs to be considered when choosing a tile size. We need to ensure that our tile size is also able to exploit intra-block locality. If we were running the loop nest sequentially then a tiling of the inner spatial dimensions would suffice. However, for the parallel case we need to tweak the outermost tile size to exploit the locality within the inner loops. In particular, we need to ensure that the concurrent threads do not interfere with each other in cache. The constraint on the tile size we enforce with (8) handles this situation for the last-level cache. Hence, we add a secondary constraint on the tile size that targets higher-level caches for exploiting intra-block locality.
(2)
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
Search Engine
Annotated Source
threads. The scheduling distance is specified using a single integer value whereas the affinity information is specified using a sequence of integers as shown in Fig. 4. We propose an alternate schedule to avoid this synchronization overhead. In our schedule, we distance the concurrent threads d blocks apart, where d represents the block coverage of the shadow region in terms of number of blocks. Of course, in most cases the value of d = 1. Scheduling the threads d blocks apart creates a safe zone for each block to execute without having to synchronize with any of the blocks from the previous time-step. Of course, in this schedule the fill-and-drain time is increased. Even for a dependence distance of 1, the fill- anddrain overhead will be twice as much. However, this overhead is significantly less than the overhead of synchronization for shadow regions
4. FRAMEWORK OVERVIEW
Fig. 1 provides an overview of our transformation framework. The system comprises of four major components: a source-to-source code restructurer (CREST), a program analyzer and feature extractor, a set of performance measurement tools, and a search engine for exploring the optimization search space. MATS can operate in two modes: offline and online. In the online mode, the source code is first fed into the program analyzer, which utilizes program characteristics and architectural profiles gathered at install time to generate a representative search space. The search engine is then invoked, which selects the next search point to be evaluated. Depending on the configuration, the search point gets translated to a set of parameters for high-level code optimizations or compiler flags for low-level optimizations or a combination of both. The program is then transformed and compiled with the specified optimizations and executed on the target platform. During program execution, performance measurement tools collect a variety of measurements to feed back to the search module. The search module uses these metrics in combination with results from previous passes to generate the next set of tuning
6
Search Space
Code Variant
online search
offline modeling
EXECUTE
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
parameters. This process continues until some pre-specified optimization time limit is reached or the search algorithm converges to a local minima. In the offline mode, the source program is fed into the feature extractor, which extracts the relevant features and then maps it to one of the pretrained predictive models using a nearest neighbor algorithm. The predictive model emits the best optimization for the program based on its training. The gathering of training data and the training of the predictive models occurs during install time. Next, we describe the four major components of our adaptive tuning system and highlight their key features.
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
search space. We include random search in our framework as a benchmark for other search algorithms. A search is considered effective only if it performs better than random on a given search space For offine tuning, we implement several supervised and unsupervised learning algorithms. The implemented algorithms include support vector machines (SVM), k-nearest neighbor, k-means clustering and principal component analysis (PCA). Additionally, to support these learning algorithms, we implement three statistical predictive models including independent and identically distributed model (IID), Markov chain and logistic regression. Table 1. Evaluation platforms Core2 2 2.33 GHz 16 KB, 4-way 4 MB, 4-way GCC 4.3.2 Linux Quad 4 2.4 GHz 32 KB, 8-way 2 x 4 MB, 8-way GCC 4.3.2 Linux 8Core 8 2.53 GHz 32 KB, 8-way 8 MB, 16-way GCC 4.3.2 Linux
5. EVALUATION
To evaluate the effectiveness of the ideas we conducted a series of experiments where we utilized the language extensions two three Fortran applications: a mult-grid solver (multigrid), an optimization problem (knapsack), and an advection code excerpted from the NCOMMAS [19] weather modeling application (advect3d). All three applications are amenable to pipelined parallelism, and hence serve as suitable candidate for evaluating our strategy. Each parallel variant is evaluated on three different multicore platforms. Table 1 shows the hardware and software configurations of each experimental platform. The two-core system is based on Intel's Conroe chip, in which the L2 cache is shared between the two cores. The fourcore system uses Intel's Kentsfield chip, which has two L2 caches, shared between two cores on each socket. The eight-core platform has a Xeon processor with Hyperthreading (HT) enabled, providing 16 logical cores. The rest of this paper refers to the two- four- and eight-core platforms as Core2, Quad and 8Core. In addition to these three platforms, we also conduct
3.0 speedup over baseline 2.5 2.0 1.5 1.0 0.5 0.0 knapsack advect3d 2 4 8 12 16 20 24 32 36 40 44 48 52 56 60 64 96 128 speedup over baseline 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
knapsack advect3d
block size
scalability studies on a Linux cluster. The configurations for this platform are given in Section 5.3.
8
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
1D block (outer)
2D block + skew
1D block (inner)
1D block (outer)
2D block + skew
1D block (inner)
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
opposed 2 MB), there were more cache misses in the unblocked variants, which were eliminated through blocking.
9.00 speedup over baseline 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 2 4 8 16 no of cores 32 64 multigrid knapsack advect3d
Core2
Quad
8Core
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012
best performance on average, which is a direct result of the higher inter-thread data reuse present in this application.
6. CONCLUSIONS
This paper described a set of high-level language extensions that can be used to that can be used in concert to improve the performance of pipeline-parallelize code on current multicore architectures. The language abstractions in conjunction with a source-to-source transformation can yield significant speed by by determining a suitable pipeline depth, exploiting inter-thread data locality and orchestrating thread affinity. Evaluation on three applications shows that the strategy is effective on three multicore platforms with different memory hierarchy and core configurations. The experimental results reveal several interesting aspects of pipelined-parallel performance. We found that there is significant performance variation when the skew factor is changed and blocking in the outer dimension appears more critical than blocking in the inner dimensions. The experimental evaluation also points out some limitations of the proposed strategy. In particular, it was shown that inter-node communication can be an important factor influencing the performance of parallel pipeline applications.
REFERENCES
[1] [2] Stencilprobe: A microbenchmark for stencil applications. R. Allen and K. Kennedy, (2002) Optimizing Compilers for Modern Architectures, Morgan Kaufmann. S. Coleman and K. S. McKinley, (1995) Tile size selection using cache organization, In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation. C. Ding and K. Kennedy, (2001) Improving effective bandwidth through compiler enhancement of global cache reuse, In International Parallel and Distributed Processing Symposium, San Francisco, CA, Apr. 2001. G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT, Aug. 1992. G. Jin, J. Mellor-Crummey, and R. Fowler. Increasing temporal locality with skewing and recursive blocking. In Proceedings of SC2001, Denver, CO, Nov 2001. S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations, In PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 235{244, New York, NY, USA, 2007. ACM. M. Kudlur and S. Mahlke, (2008) Orchestrating the execution of stream programs on multicore platforms, SIGPLAN Not., 43(6):114-124. J. Mccalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical report, In http://www.haverford.edu/cmsc/davew/cache- opt/cache-opt.html, 1999.
[3]
[4]
[5]
[6]
[7]
[8]
[9]
11
International Journal of Programming Languages and Applications ( IJPLA ) Vol.2, No.1/2/3, July 2012 [10] J. Michalake, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The weather reseach and forecast model: Software architecture and performance. In Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004. [11] A. Qasem, G. Jin, and J. Mellor-Crummey. Improving performance with integrated program transformations. Technical Report CS-TR03-419, Dept. of Computer Science, Rice University, Oct. 2003. [12] Y. Song, R. Xu, C. Wang, and Z. Li. Data locality enhancement by memory reduction. In Proceedings of the 15th ACM International Conference on Supercomputing, Sorrento, Italy, June 2001. [13] W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In International Symposium on Microarchitecture, 2007. [14] W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In Computational Complexity, pages 179{196, 2002. [15] J. Treibig, G. Wellein, and G. Hager. Efficient multicore-aware parallelization strategies for iterative stencil computations. CoRR, abs/1004.1741, 2010. [16] N. Vachharajani, R. Rangan, E. Raman, M. J. Bridges, G. Ottoni, and D. I. August. Speculative decoupled software pipelining. In PACT '07: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 49{59, Washington, DC, USA, 2007. IEEE Computer Society. [17] S. N. Vadlamani and S. F. Jenks. The synchronized pipelined parallelism model. In The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, 2004. [18] L. J. Wicker. NSSL collaborative model http://www.nssl.noaa.gov/~wicker/commas.html. for atmospheric simulation (NCOMMAS).
[19] M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. [20] M. J. Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, Oct. 1982. [21] D. Wonnacott. Time skewing for parallel computers. In LCPC '99: Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing, pages 477{480, London, UK, 2000. Springer-Verlag. [22] D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed Processing, page 171, Washington, DC, USA, 2000. IEEE Computer Society
12