0% found this document useful (0 votes)
715 views1 page

Code Optimization For Cell/B.E. - Opportunities For ABINIT - A Software Package For Physicists

This document discusses optimizing the ABINIT software package for calculations in physics on the Cell/B.E. multicore processor architecture. The authors aim to 1) run ABINIT on a single Cell processor's PPE core and make good use of the SPE cores, 2) run ABINIT across a cluster of Cell processors, and 3) evaluate ABINIT's performance on hybrid multiprocessor architectures. They start by optimizing the ZGEMM matrix multiplication routine, which accounts for 25% of ABINIT's runtime, to leverage the SPE cores by partitioning matrices and performing calculations in parallel across the cores.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
715 views1 page

Code Optimization For Cell/B.E. - Opportunities For ABINIT - A Software Package For Physicists

This document discusses optimizing the ABINIT software package for calculations in physics on the Cell/B.E. multicore processor architecture. The authors aim to 1) run ABINIT on a single Cell processor's PPE core and make good use of the SPE cores, 2) run ABINIT across a cluster of Cell processors, and 3) evaluate ABINIT's performance on hybrid multiprocessor architectures. They start by optimizing the ZGEMM matrix multiplication routine, which accounts for 25% of ABINIT's runtime, to leverage the SPE cores by partitioning matrices and performing calculations in parallel across the cores.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Code optimization for Cell/B.E.

Opportunities for ABINIT – a software package for physicists A R C H I T E K T U R

Timo Schneider1, Simon Wunderlich1, Wolfgang Rehm1, Torsten Hoefler1,2, Heiko Schick3
1 Chemnitz University of Technology, Germany 2 Indiana University, USA 3 IBM Deutschland Entwicklung GmbH, Germany

{timos,siwu,rehm,htor}@informatik.tu-chemnitz.de, [email protected], [email protected]

ABINIT on Cell - Overview Math kernel optimization


• The Cell/B.E. processor (aka “Cell”) developed by Sony, Toshiba and IBM is Our BLAS3/ZGEMM implementation:
a heterogenous multicore processor. • Parallel multiplication of complex matrices with double precision
• This architecture offers a great peak performance for scientific computations • Whole computation is done on the SPEs, PPE only administers SPE threads
• We took some opportunities to optimize ABINIT for Cell and present first
Memory
results

ABINIT (PPE) zgemm.o (PPE) SPE 0


ABINIT: Project Goals:
copy parameters load part of A
• A software package to compute the 1. Run ABINIT on PPE of a single Cell ALLOCATE(A) start SPE Threads do computation
....
total energy, charge density and elec- 2. Make good use of the SPEs CALL ZGEMM(A..)
wait for completion write back

tronic structure of systems made of 3. Run ABINIT on a cluster of Cells SPE N

electrons and nuclei 4. Evaluate how ABINIT could benefit load part of A
do computation
• 240.000 lines of Fortran code from a hybrid multiprocessor archi- existing
our contribution write back
implementation
• Uses MPI for parallelization tecture
• Divide the input matrices into blocks that fit into an SPEs local store
Profiling ABINIT promised that optimizing a few functions should lead to a serious • The actual data partitioning scheme is less significant, the algorithm demands
speedup of the whole application, in fact 4765 (2%) source lines of code (SLOC) much more multiplications than memory operations
make up 87% of ABINIT runtime. • The innermost loop (where we multiply) must be optimal, so the dual issue
rate and number of pipeline stalls are important
Function Runtime Task SLOC
ZGEMM 25% matrix multiplication 415
X X
cij = aik · bkj = (Re(aik ) + i Arg(aik )) (Re(bkj ) + i Arg(bkj ))
k k
opernl4 35% applying the non local operator 1800 =
X
Re(aik ) Re(bkj ) + Re(aik ) Arg(bkj )i + Arg(aik ) Re(bkj )i − Arg(aik ) Arg(bkj )
fftstp 15% fast fourier transformation 1450 k
X
= [Re(aik ) Re(bkj ) − Arg(aik ) Arg(bkj )] + [Re(aik ) Arg(bkj ) + Arg(aik ) Re(bkj )] i
mkffkg3 7% fast fourier transformation 580 k
" !!#
pw orthon 5% Gram-Schmidt orthogonalization 520 = Re(aik ) Re(bkj ) − Arg(aik ) Arg(bkj ) − Re
X
aik · bkj +
k−1
" !#
X
We started by optimizing ZGEMM because the operation which is done by this Arg(aik ) Re(bkj ) + (Re(aik ) Arg(bkj )) + Re aik · bkj i
k−1
routine can be understood quite easily, so we could focus on optimization and
getting familiar with the Cell programming environment. The last equation can be computed with only 4 fused multiply add (FMADD)
instructions, compared to 4 multiply and 2 add instructions in line 3.

The complex multiplication described above, implemented in C:


#d e f i n e VPTR ” ( v e c t o r d o u b l e ∗) ”
vector char h i g h d o u b l e = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 1 6 , 1 7 , 1 8 , 1 9 , 2 0 , 2 1 , 2 2 , 2 3 } ; The unmodified version of ABINIT is
vector char l o w d o u b l e = { 8 , 9 , 1 0 , 1 1 , 1 2 , 1 3 , 1 4 , 1 5 , 2 4 , 2 5 , 2 6 , 2 7 , 2 8 , 2 9 , 3 0 , 3 1 } ;
vector double r r e , rim , t r e , t i m ; roughly twice as fast on a 2 GHz
f o r ( k =0; k < k l e n ; k++, aa += a s t e p , bb += b s t e p ) { Opteron as on a Cell. If we manage
fim = s p u s h u f f l e ( ∗(VPTR aa ) , ∗(VPTR ( aa+a t s t e p ) ) , low double ); to optimize the other compute kernels
gim = s p u s h u f f l e ( ∗(VPTR bb ) , ∗(VPTR bb ) , low double );
fre = s p u s h u f f l e ( ∗(VPTR aa ) , ∗(VPTR ( aa+a t s t e p ) ) , high double ); by the same factor as ZGEMM the
gre = s p u s h u f f l e ( ∗(VPTR bb ) , ∗(VPTR bb ) , high double );
tre = spu msub ( fim , gim , r r e ) ; Cell version could be more than three
tim = spu madd ( f r e , gim , r i m ) ;
rre = spu msub ( f r e , g re , t r e ) ; times faster on the Cell than on the
}
rim = spu madd ( fim , g re , ti m ) ;
Opteron.b
tre = spu s h u f f l e ( r r e , rim , h i g h d o u b l e ) ;
tim = s p u s h u f f l e ( r r e , rim , l o w d o u b l e ) ;

To simplify the process of porting


The Cell SDK 3.0 (pre-release) math kernels to the Cell plattform
Benchmark results achieves 9.5 GFlop/s DGEMMa per- we are currently about to build tools
formance for a 2000x2000 matrix. which help with optimizing the com-
This corresponds to 68% of the Cell’s piler generated (gcc -S) assembly,
peak performance. Our optimized similar to spu_timing but in a more
ZGEMM implementation is able to “active” way, which means that the
leverage up to 73.5% of the peak pipeline status should not only be
performance even though the com- viewable but optimizations should be
plex multiplication requires more shuf- suggested.
fle operations. b
We simulated a test system in a 543 FFT
a
The current SDK does not offer a box with 108 atoms.
ZGEMM implementation, thus we used
DGEMM for comparison.

Future Work
Our ZGEMM implementation is 40 We achieve linear speedup with our
times faster for 2000x2000 sqare ma- ZGEMM implementation, which is due • Optimization of the other ABINIT compute kernels for Cell
trices than the ZGEMM implementa- to the good memory/CPU coupling on • Exploring ways to efficiently use heterogenous clusters for ABINIT
tion in the refblas package the Cell architecture

This research is supported by the Center for Advanced Studies (CAS) of Ihe IBM Böblingen Laboratory as part of the NICOLL Project.

You might also like