0% found this document useful (0 votes)
89 views10 pages

Towards Parallel Twisted Block Factorizations: Moldaschl, Gansterer Parnum11

This document discusses parallelization strategies for computing twisted block factorizations (TBFs) of block tridiagonal matrices. TBFs are an important part of computing eigenvectors without first tridiagonalizing the matrix. The document outlines several strategies for parallelizing TBF computations across multiple cores, including distributing eigenvector computations across cores or using two cores cooperatively to compute a single eigenvector's TBFs in parallel. Experimental results are presented to analyze the performance of some of these parallelization strategies.

Uploaded by

Anurag Daware
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views10 pages

Towards Parallel Twisted Block Factorizations: Moldaschl, Gansterer Parnum11

This document discusses parallelization strategies for computing twisted block factorizations (TBFs) of block tridiagonal matrices. TBFs are an important part of computing eigenvectors without first tridiagonalizing the matrix. The document outlines several strategies for parallelizing TBF computations across multiple cores, including distributing eigenvector computations across cores or using two cores cooperatively to compute a single eigenvector's TBFs in parallel. Experimental results are presented to analyze the performance of some of these parallelization strategies.

Uploaded by

Anurag Daware
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Moldaschl, Gansterer

Parnum11

TOWARDS PARALLEL TWISTED BLOCK FACTORIZATIONS


Michael Moldaschl Research Group Theory and Applications of Algorithms, University of Vienna, Vienna, Austria Wilfried N. Gansterer Research Group Theory and Applications of Algorithms, University of Vienna, Vienna, Austria
Abstract With the rise of multicore architectures, concurrency and parallelization potential of numerical algorithms have become pivotal for achieving high performance. In this paper, the parallelization of the computation of twisted block factorizations of a block tridiagonal matrix is studied and several parallelization variants are discussed. The process is an important building block for computing eigenvectors of symmetric block tridiagonal matrices without tridiagonalization.

Introduction

A sequential algorithm for computing eigenvectors of a block tridiagonal or banded matrix based on twisted block factorizations (TBF, [1]) has been proposed recently [2]. An important benet of this method, which we refer to as TBF method in this paper, is that it does not require tridiagonalization of the block tridiagonal matrix and thus no backtransformation process. Since band reduction processes which reduce a full matrix to a band matrix with a bandwidth larger than one tend to be more ecient than a full tridiagonalization, block tridiagonal or banded matrices as considered here may also arise as the intermediate step when computing spectral information of a full dense matrix. Moreover, there is also a computationally less expensive approximative method for transforming a full matrix into a closely related block tridiagonal matrix [3]. The central component of the TBF method for computing eigenvectors of a block tridiagonal matrix proposed in [2] is the computation of twisted block factorizations. For fully understanding the performance potential of this method on contemporary and future hardware architectures, ecient parallelizations for multicore architectures have to be investigated and their scaling behavior with the number of cores has to be investigated. Multicore architectures are the dominant processor architecture for the years to come, it is expected that the number of cores will grow steadily over time. In this paper, we focus on several parallel strategies for computing twisted block factorizations of a given block tridiagonal matrix. The other components of the eigenvector computation, i. e., determination of a starting vector and solving the linear system in an inverse iteration step, are much less signicant in terms of overall runtime and therefore not discussed in detail in this paper. However, we discuss the eect of two new CPU features on the performance of parallel twisted block factorizations: the Intel hyperthreading technology
Parallel Numerics 11/Leibnitz, Austria/October 5-7 1

Parnum11

Moldaschl, Gansterer

and the Intel turbo boost technology. The former supports the execution of two separate threads on a single core. The latter allows one, two or four cores to increase their clock frequency if more performance is needed.1 Related Work. Various parallel eigensolvers have been developed and integrated into parallel software packages such as ScaLapack [4] or PLapack [5]. These libraries contain routines primarily for tridiagonal problems, though [6, 7, 8, 9], and rely on a tridiagonalization process preceeding them. Moreover, the target platforms for classical libraries such as ScaLapack and PLapack are distributed memory machines and they are not particularly well suited for modern multicore architectures. Plasma [10] is a more recent eort specically targeting multicore architectures, but its most current version 2.4.1 does not yet contain any routines for computing eigenvectors. In our experiments, we use eigensolvers from ScaLapack for comparison. The only eort so far specically targeting block tridiagonal eigenproblems without tridiagonalization is the block divide-and-conquer method [11, 12], which has already been parallelized [13]. However, the block divide-and-conquer method is not competitive in terms of performance if highly accurate solutions are required and the ranks of the o-diagonal blocks are high [11]. This has motivated the development of a method for computing eigenvectors of block tridiagonal matrices based on twisted block factorizations without tridiagonalization [2], [1]. Sequential implementations of this method have been developed and evaluated. Synopsis. In Section 2 the sequential TBF method for computing eigenvectors of a symmetric block tridiagonal matrix is reviewed. In Section 3, several parallelization strategies for the most important component of the TBF method, i. e., the computation of the twisted block factorizations, are introduced and discussed. Section 4 summarizes experimental evaluations of some of these parallelization strategies, and Section 5 concludes the paper.

Computing eigenvectors using twisted block factorizations


matrix with p diagonal blocks: Rnn (1)

The problem is dened by a symmetric block tridiagonal B1 A 2 A2 B2 A 3 .. .. . . A3 M := .. .. . . A p Ap Bp

where Ai , Bi Rbb with n = p b and Bi = Bi . Given an (approximation of an) eigenvalue of M , the corresponding eigenvector x of M can be computed as the solution of

(M I) x = W x = 0 with block tridiagonal W , e. g., using an inverse iteration process:


1

(2)

see http://www.intel.com/Assets/PDF/manual/325384.pdf Parallel Numerics 11/Leibnitz, Austria/October 5-7

Moldaschl, Gansterer

Parnum11

1. Determine starting vector x0 with x0 2. Solve (M I) xi+1 = xi for xi+1 3. Normalize xi+1 ; i := i + 1 4. If not converged continue with Step 2

= 1; i := 0

For solving the linear system in Step 2, a factorization of W can be used. For a block tridiagonal matrix, twisted block factorization [1] can be constructed which has the following structure: + + L1 + U1 N1 M+ ... .. .. 2 . . .. + + + . Li1 Ui1 Ni1 (3) Ui Mi+ Li Mi+1 P .. Ni Ui+1 . Li+1 .. .. .. . . . Mp Np1 Up L p The plus superscripts in Eqn. (3) indicate that these blocks result from a forward factorization and the minus superscripts that they were calculated in a backward factorization. The combination of the forward and backward factorization is called twisted block factorization (TBF). One of the possible TBFs denes a suitable starting vector for the inverse iteration process and is also used in solving the linear system in Step 2 [2]. In total, there are p dierent twisted block factorization representations (combining i 1 forward and p i backward factorization steps for i = 1, . . . , p). For determining the dening entities of all TBFs, an entire forward and backward factorization need to be computed, and then the updated diagonal blocks need to be factorized according to the following p equations:
+ Bi Pi+ Mi+ Ni1 Pi+1 Mi+1 Ni = Pi Li Ui

i = 1, . . . , p.

(4)

The minimal diagonal entry in all Ui can be used to dene the starting vector x0 for the inverse iteration process [2].

Parallelization strategies

In this section, we develop and discuss several possible strategies for parallelizing the computation of twisted block factorizations on multicore architectures. Computing the twisted block factorizations of W by far dominates the computational cost and runtime of the TBF method, and thus we focus on the parallelization of this component. The parallelization strategies are grouped according to the number of cores working on the computation of a single eigenvector.
Parallel Numerics 11/Leibnitz, Austria/October 5-7 3

Parnum11

Moldaschl, Gansterer

3.1

One core per eigenvector

The straightforward approach would be to use a single core per eigenvector and to parallelize over dierent eigenvectors. Theoretically, this achieves almost perfect parallelization. However, in practice, there are some inuences which cause a deviation from perfect scaling: initially replicating the block tridiagonal matrix over all cores and nally collecting the computed eigenvectors on one core causes communication overhead;and on the nodes we have currently available, up to four cores share the cache which can cause memory delays if several processes use the cache at the same time. Last, but not least, orthogonality of computed eigenvectors based on twisted block factorizations may be insucient for clustered eigenvalues [2], and thus separate reorthogonalization of eigenvectors may be required in some cases. Nevertheless, we use simplied versions of this straightforward approach as reference versions. Two dierent strategies based on the parallelization over the eigenvectors have been implemented: version0 and version3. Version0: The eigenvalues are distributed blockwise over the available cores and the matrix is broadcasted. Each core computes the eigenvectors dened by its local eigenvalues and then the eigenvectors of all cores are merged into the eigenvector matrix. No reorthogonalization is performed. The implementation of version0 is based on MPI, including the distribution of the eigenvalues at the beginning and the merging of the eigenvectors at the end. Version3: This strategy tries to exploit the shared memory in a node by using OpenMP. Every process consists of two or four threads, which are used to simultaneously compute several eigenvectors within a process. The number of processes times number of threads is always equal to the number of cores used, and we do not consider several threads running on a single core.

3.2

Two cores per eigenvector

As outlined in Section 2, the p dierent twisted block factorizations can be computed as one forward and one backward factorization followed by solving the p equations (4). An obvious idea to parallelize the computation of one eigenvector over two cores is the simultaneous computation of the forward and backward factorization with two processes. This idea leads to ve dierent strategies outlined in the following. Version1: In this version the computation is also parallelized over dierent eigenvectors, but always two processes work on the same eigenvector. In particular, the forward and backward factorizations are computed in parallel by two processes. The rst process computes the forward factorization of the upper half of W , and simultaneously the second process computes the backward factorization of the lower half of W . Then, the two processes exchange the blocks of the row where the factorizations meet. Process 1 uses these blocks to compute the backward factorization of the upper part of W , and simultaneously Process 2 uses them to compute the forward factorization of the lower part of W . After that, Process 1 has all data of the forward and backward factorization of the upper half of W and Process 2 has all data of the forward and backward factorization of the lower half of W .
4 Parallel Numerics 11/Leibnitz, Austria/October 5-7

Moldaschl, Gansterer

Parnum11

Based on Eqn. (4) these data can be used to calculate all possible twisted factorizations without needing any other data except two blocks which were calculated in the upper or lower half but are necessary for the other part. Both processes exchange these necessary blocks and Process 1 computes the twisted factorizations of the upper part of W (for all twisted blocks in the upper part), and simultaneously Process 2 computes the twisted factorizations of the lower part of W (for all twisted blocks in the lower part). As a result, Process 1 has all the information about twisted factorizations which meet in the upper half of W , and Process 2 has all the information about twisted factorizations which meet in the lower half of W . Both processes then search for the minimal diagonal entry in the twisted factorizations of their half, compare their minima, and the process with the global minimum starts with the back substitution process. Depending on the location of the minimum the corresponding process rst substitutes towards the center, exchanges required information with the other process (which can then start its back substitution process), and then substitutes towards the outside. As a result, the upper half of the eigenvector is computed on Process 1, and the lower half on Process 2. Version2 is an implementation of version1 based on using OpenMP instead of MPI for the parallelization of the forward, backward and twisted factorizations. In this case, each process consists of two threads: one computes the forward factorization, the other one the backward factorization. Finally, both threads compute the twisted factorizations in parallel (exchange of required data through the shared memory). In version2, the solution of the linear systems (forward and backward substitutions) is not parallelized, because the runtime of this part is insignicant. Version4 is a combination of version2 and version3. MPI is used to distribute the computation of eigenvectors over dierent processes. Then, OpenMP is used to create two threads in each process to further distribute the computation of the eigenvectors. Finally, each thread uses OpenMP to create another thread to also work on the same eigenvector. These two threads both work on the forward and backward factorization simultaneously. Then each of the two threads computes half of the twisted factorizations. This implementation requires the nested use of OpenMP, which is not supported by all compilers. Version5: The parallelization is done like in version4, only the lowest level (the forward, backward and twisted factorizations, back substitutions for each eigenvector) is parallelized based on MPI instead of OpenMP. Two threads of two dierent processes use MPI to communicate with each other for the work at the lowest level. Version6 is a renement of version2. The inverse iteration and the search of the minimum are also parallelized by using the same threads which before did the forward and backward factorization.

3.3

More than two cores per eigenvector

The following strategies use more than two cores for the parallel computation of a single eigenvector.
Parallel Numerics 11/Leibnitz, Austria/October 5-7 5

Parnum11

Moldaschl, Gansterer

Version7: We can use k cores to compute the p corresponding twisted factorizations concurrently. The computation of the factorizations is in the rst step uniformly distributed between the cores, i. e., core i of the k cores computes TBF 1 + (i 1) (p 1)/(k 1) (assuming that k and p are such that integer division is possible). Then, each core communicates with its neighbors to compute the remaining twisted factorizations based on Eqn. (4). The larger the number of cores is, the more computations are done redundantly, but less communication is required. We are currently investigating which number of cores reaches the best eciency for given problem parameters n, p. Version8: Another possibility to use up to four cores is the simultaneous computation of the twisted, the forward and the backward factorization. One process starts with the computation of the forward factorization, while another process starts with the computation of the backward factorization (in contrast to version1, each of them calculates the whole forward and backward factorization). After each factorization step, each of these two processes sends the given result to another process (Process 1 sends to Process 3 and Process 2 sends to Process 4). After both have reached the center of the matrix they change the target process for the results (Process 1 sends to Process 4 and Process 2 sends to Process 3). At the beginning Processes 3 and 4 have no tasks except receiving the results. After the center of the matrix has been reached, each further result allows each of the two processes the calculation of one twisted factorization. Thus, once half of the forward and backward factorization has been nished, all four processes can work simultaneously. Version9: Another parallelization strategy is the use of parallel basic operations which are used in the twisted block factorization. This includes LU factorizations or solving systems of equations. Ecient implementations for the parallelization of these operations are given in PLASMA. Open questions are how many cores can be used for the parallelization of the basic operations and for which block size this method becomes competitive. Version10: The last and most interesting parallelization strategy for more than two cores is the investigation of a tiled approach for the process of computing all twisted block factorizations. Therefore all blocks are split into smaller blocks (called tiles). The result of one tile is used in the computation of one tile of the next block while the next tile of the rst block is computed. This strategy constructs a pipeline with very small blocks where multiple processes can compute on dierent blocks concurrently, although the basic operations are not nished on the constrained blocks. Important for further researches is how much overlap between the independent operations is possible in principle and how ecient this strategy can be implemented. Factors which inuence the performance will be the size of the tiles, the block size and the degree of overlap (depends on the other factors and the amount of cores).

Numerical experiments

So far, we have implemented versions 0,1,2,3 and 6. The versions which use only one core per eigenvector are included as references for the other methods in order to quantify the parallelization overhead for using more than one core per eigenvector. We summarize evaluations of the parallel eciency of these parallelization strategies using up to two cores per eigenvec6 Parallel Numerics 11/Leibnitz, Austria/October 5-7

Moldaschl, Gansterer

Parnum11

Eciency of two processes for random matrix with b=10 1 0.95 Eciency 0.9 0.85 0.8 0.75 0 2000 4000 6000 8000 10000 12000 14000 16000 Matrix size n
Figure 1: Parallel eciency of various parallelization strategies for computing all eigenvectors on two cores (b = 10, matrix sizes n vary) tor on an Intel i7-860 CPU with 2.8 GHz and 8 GB main memory using the GNU Fortran 4.4.3 compiler. The test system also oers the turbo boost and hyperthreading technology. In all experiments the runtimes for computing all n eigenvectors of a random symmetric block tridiagonal matrix were measured. The parallel eciency (sometimes only called efciency in the following) was computed by dividing the speedup (runtime of the sequential program over the runtime of the parallel program) by the number of cores used. The numerical accuracy of the parallel versions is identical to the one of the sequential implementation which has already been analysed in [2]. Varying the matrix dimension. The rst evaluation of the performance is illustrated in Figure 1 for dierent matrix dimensions and xed block size b = 10. We see that version1 achieves the highest eciency of the versions which use two cores per eigenvector. The OpenMP implementations perform quite well (version6 is a little bit better than version2), but they do not achieve the same eciency as the MPI version. Varying the block size. In Figure 2 the matrix size is 8000 and dierent block sizes are illustrated for two cores. We can see that the eciency strongly depends on the block size. For very small blocks the eciency of the two cores per eigenvector strategies is much lower. Version1 reaches the best eciency for block size 10 and is for all larger block sizes almost equal. The eciency of version2 and version6 increases with the block size and version6 becomes as ecient as version1 for block sizes greater than or equal to 60. Note that except for very small block and matrix sizes, the parallel eciency of version1 is in the worst case only 5% lower than the one of the trivial parallelizations in version0 or version3.
Parallel Numerics 11/Leibnitz, Austria/October 5-7 7

Version3 Version0 Version1

Version6 Version2

Parnum11

Moldaschl, Gansterer

Eciency of two processes for random matrix with N=8000 1 0.95 Eciency 0.9 0.85 0.8 0.75 0 10 20 30 40 Block size b
Figure 2: Parallel eciency of various parallelization strategies for computing all eigenvectors on two cores (n = 8000, block sizes b vary) Turbo boost and hyperthreading. We also investigated the inuence of the turbo boost technology and of hyperthreading on the eciency of the parallelization strategies. On the test system each core runs normally with a clock frequency of 2.8 GHz, but with turbo boost one core can be accelerated up to 3.46 GHz, two cores up to 3.33 GHz and four cores up to 2.96 GHz.2 Using turbo boost on more cores reduces the average performance of all cores and will therefore decrease the eciency. Eqns. (5) and (6) estimate the theoretical slow-down caused by this eect on two cores (c2 ) and on four cores (c4 ): 2.8 + 0.530 96.24% 2.8 + 0.660 2.8 + 0.130 c4 = 84.68% 2.8 + 0.660 c2 = (5) (6)

Version0 Version3 Version1

Version6 Version2

50

60

70

80

Eqn. (6) could be conrmed experimentally with an error tolerance up to 6%. Because of the relatively small slow-down for two cores, the error tolerance is larger than the estimated change caused by the turbo boost. In the experiments investigating the inuence of the hyperthreading technology three cores were disabled to guarantee that only one physical core was used. In Figure 3 we compare the speedup (over the sequential version) achieved by the dierent parallelization strategies with hyperthreading for dierent matrix sizes and block size 10 to the speedup achieved on two cores without hyperthreading. We see that using hyperthreading for simulating two cores yields denitely worse performance than achieved on two physical cores, but all variants (with
2

see http://download.intel.com/newsroom/kits/embedded/pdfs/Core i7-860 Core i5-750.pdf Parallel Numerics 11/Leibnitz, Austria/October 5-7

Moldaschl, Gansterer

Parnum11

Speedup of 2 processes on 1 or 2 cores for random matrix with b=10 2 1.8 1.6 Speedup 1.4 1.2 1 0.8 0 2000 4000 6000 8000 10000 12000 14000 16000 Matrix size n
Figure 3: Parallel speedup of parallelization strategies (over sequential variant on a single core) on two cores without hyperthreading and on one core with hyperthreading (b = 10, matrix sizes n vary) or without hyperthreading) achieve a speedup over the sequential version, although this is also a very ecient program which is based on Lapack and (ATLAS-)Blas. With hyperthreading version0 is the winner, whereas version3 is not able to use the additional thread just as well. Version6 is as good as version1, and for large matrix sizes it is even better. Version2 shows in almost all cases the worst performance.

Version3 Version0 Version1 Version6 Version2

Version0 Version3 Version6 Version2 Version1

HT HT HT HT HT

Conclusions

Several parallelization strategies for computing eigenvectors of block tridiagonal matrices based on twisted block factorizations have been discussed. The excellent parallel eciency of the trivial parallelization over the eigenvectors was almost matched by more sophisticated parallelization strategies which utilize two cores per eigenvector and are based on the idea of computing the forward and backward factorizations in parallel. In particular, version1 achieved the highest parallel eciency of all strategies which utilize two cores per eigenvector for all tested matrix and block sizes. Larger block sizes tend to lead to higher parallel eciency. This work has focussed on the ecient parallel computation of a single eigenvector with two cores. We are currently working on the strategies mentioned which utilize more than two cores per eigenvector which have higher potential for scaling with growing numbers of cores, as expected on future multicore architectures. Acknowledgements. This work has been partly supported by the Austrian Science Fund (FWF) under contract S10608 (NFN SISE).
Parallel Numerics 11/Leibnitz, Austria/October 5-7 9

Parnum11

Moldaschl, Gansterer

References
[1] W. N. Gansterer and G. Knig, On twisted factorizations of block tridiagonal matrices, o Procedia Computer Science, vol. 1, no. 1, pp. 279 287, 2010. [2] G. Knig, M. Moldaschl, and W. N. Gansterer, Computing eigenvectors of block tridio agonal matrices based on twisted block factorizations, Journal of Computational and Applied Mathematics, 2011, in press. [3] Y. Bai and R. C. Ward, Parallel block tridiagonalization of real symmetric matrices, J. Parallel Distrib. Comput., vol. 68, pp. 703715, 2008. [4] L. S. Blackford, J. Choi, A. Cleary, E. DAzevedo, J. W. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLapack Users Guide. Philadelphia, PA: SIAM Press, 1997. [5] R. van de Geijn, Using PLapack: Parallel Linear Algebra Package. The MIT Press, 1997. Cambridge, MA:

[6] C. Vmel, Scalapacks MRRR algorithm, ACM Trans. Math. Softw., vol. 37, pp. 1:1 o 1:35, January 2010. [7] P. Bientinesi, I. S. Dhillon, and R. A. van de Geijn, A parallel eigensolver for dense symmetric matrices based on multiple relatively robust representations, SIAM J. Sci. Comput., vol. 27, pp. 4366, 2005. [8] F. Tisseur and J. Dongarra, A parallel divide and conquer algorithm for the symmetric eigenvalue problem on distributed memory architectures, SIAM J. Sci. Comput, vol. 20, pp. 22232236, 1999. [9] I. S. Dhillon, B. N. Parlett, and C. Vmel, The design and implementation of the MRRR o algorithm, ACM Trans. Math. Softw., vol. 32, pp. 533560, 2006. [10] E. Agullo et al., PLASMA Users Guide, Version 2.0, November 10, 2009. [Online]. Available: http://icl.cs.utk.edu/projectsles/plasma/pdf/users guide.pdf [11] W. N. Gansterer, R. C. Ward, R. P. Muller, and W. A. Goddard, III, Computing approximate eigenpairs of symmetric block tridiagonal matrices, SIAM J. Sci. Comput., vol. 25, pp. 6585, 2003. [12] W. N. Gansterer, R. C. Ward, and R. P. Muller, An extension of the divide-and-conquer method for a class of symmetric block-tridiagonal eigenproblems, ACM Trans. Math. Softw., vol. 28, pp. 4558, 2002. [13] Y. Bai and R. C. Ward, A parallel symmetric block-tridiagonal divide-and-conquer algorithm, ACM Trans. Math. Softw., vol. 33, 2007.

10

Parallel Numerics 11/Leibnitz, Austria/October 5-7

You might also like