Code Optimization For Cell/B.E. - Opportunities For ABINIT - A Software Package For Physicists
Code Optimization For Cell/B.E. - Opportunities For ABINIT - A Software Package For Physicists
Timo Schneider1, Simon Wunderlich1, Wolfgang Rehm1, Torsten Hoefler1,2, Heiko Schick3
1 Chemnitz University of Technology, Germany 2 Indiana University, USA 3 IBM Deutschland Entwicklung GmbH, Germany
electrons and nuclei 4. Evaluate how ABINIT could benefit load part of A
do computation
• 240.000 lines of Fortran code from a hybrid multiprocessor archi- existing
our contribution write back
implementation
• Uses MPI for parallelization tecture
• Divide the input matrices into blocks that fit into an SPEs local store
Profiling ABINIT promised that optimizing a few functions should lead to a serious • The actual data partitioning scheme is less significant, the algorithm demands
speedup of the whole application, in fact 4765 (2%) source lines of code (SLOC) much more multiplications than memory operations
make up 87% of ABINIT runtime. • The innermost loop (where we multiply) must be optimal, so the dual issue
rate and number of pipeline stalls are important
Function Runtime Task SLOC
ZGEMM 25% matrix multiplication 415
X X
cij = aik · bkj = (Re(aik ) + i Arg(aik )) (Re(bkj ) + i Arg(bkj ))
k k
opernl4 35% applying the non local operator 1800 =
X
Re(aik ) Re(bkj ) + Re(aik ) Arg(bkj )i + Arg(aik ) Re(bkj )i − Arg(aik ) Arg(bkj )
fftstp 15% fast fourier transformation 1450 k
X
= [Re(aik ) Re(bkj ) − Arg(aik ) Arg(bkj )] + [Re(aik ) Arg(bkj ) + Arg(aik ) Re(bkj )] i
mkffkg3 7% fast fourier transformation 580 k
" !!#
pw orthon 5% Gram-Schmidt orthogonalization 520 = Re(aik ) Re(bkj ) − Arg(aik ) Arg(bkj ) − Re
X
aik · bkj +
k−1
" !#
X
We started by optimizing ZGEMM because the operation which is done by this Arg(aik ) Re(bkj ) + (Re(aik ) Arg(bkj )) + Re aik · bkj i
k−1
routine can be understood quite easily, so we could focus on optimization and
getting familiar with the Cell programming environment. The last equation can be computed with only 4 fused multiply add (FMADD)
instructions, compared to 4 multiply and 2 add instructions in line 3.
Future Work
Our ZGEMM implementation is 40 We achieve linear speedup with our
times faster for 2000x2000 sqare ma- ZGEMM implementation, which is due • Optimization of the other ABINIT compute kernels for Cell
trices than the ZGEMM implementa- to the good memory/CPU coupling on • Exploring ways to efficiently use heterogenous clusters for ABINIT
tion in the refblas package the Cell architecture
This research is supported by the Center for Advanced Studies (CAS) of Ihe IBM Böblingen Laboratory as part of the NICOLL Project.