Code Optimization For Cell/B.E. - Opportunities For ABINIT - A Software Package For Physicists

This document discusses optimizing the ABINIT software package for calculations in physics on the Cell/B.E. multicore processor architecture. The authors aim to 1) run ABINIT on a single Cell processor's PPE core and make good use of the SPE cores, 2) run ABINIT across a cluster of Cell processors, and 3) evaluate ABINIT's performance on hybrid multiprocessor architectures. They start by optimizing the ZGEMM matrix multiplication routine, which accounts for 25% of ABINIT's runtime, to leverage the SPE cores by partitioning matrices and performing calculations in parallel across the cores.

Uploaded by

Heiko Joerg Schick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

715 views1 page

Code Optimization For Cell/B.E. - Opportunities For ABINIT - A Software Package For Physicists

Uploaded by

Heiko Joerg Schick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Code optimization for Cell/B.E.

Opportunities for ABINIT – a software package for physicists A R C H I T E K T U R

Timo Schneider1, Simon Wunderlich1, Wolfgang Rehm1, Torsten Hoefler1,2, Heiko Schick3
1 Chemnitz University of Technology, Germany 2 Indiana University, USA 3 IBM Deutschland Entwicklung GmbH, Germany

{timos,siwu,rehm,htor}@informatik.tu-chemnitz.de, [email protected], [email protected]

ABINIT on Cell - Overview Math kernel optimization

• The Cell/B.E. processor (aka “Cell”) developed by Sony, Toshiba and IBM is Our BLAS3/ZGEMM implementation:
a heterogenous multicore processor. • Parallel multiplication of complex matrices with double precision
• This architecture offers a great peak performance for scientific computations • Whole computation is done on the SPEs, PPE only administers SPE threads
• We took some opportunities to optimize ABINIT for Cell and present first
Memory
results

ABINIT (PPE) zgemm.o (PPE) SPE 0

ABINIT: Project Goals:
copy parameters load part of A
• A software package to compute the 1. Run ABINIT on PPE of a single Cell ALLOCATE(A) start SPE Threads do computation
....
total energy, charge density and elec- 2. Make good use of the SPEs CALL ZGEMM(A..)
wait for completion write back

tronic structure of systems made of 3. Run ABINIT on a cluster of Cells SPE N

electrons and nuclei 4. Evaluate how ABINIT could benefit load part of A
do computation
• 240.000 lines of Fortran code from a hybrid multiprocessor archi- existing
our contribution write back
implementation
• Uses MPI for parallelization tecture
• Divide the input matrices into blocks that fit into an SPEs local store
Profiling ABINIT promised that optimizing a few functions should lead to a serious • The actual data partitioning scheme is less significant, the algorithm demands
speedup of the whole application, in fact 4765 (2%) source lines of code (SLOC) much more multiplications than memory operations
make up 87% of ABINIT runtime. • The innermost loop (where we multiply) must be optimal, so the dual issue
rate and number of pipeline stalls are important
Function Runtime Task SLOC
ZGEMM 25% matrix multiplication 415
X X
cij = aik · bkj = (Re(aik ) + i Arg(aik )) (Re(bkj ) + i Arg(bkj ))
k k
opernl4 35% applying the non local operator 1800 =
X
Re(aik ) Re(bkj ) + Re(aik ) Arg(bkj )i + Arg(aik ) Re(bkj )i − Arg(aik ) Arg(bkj )
fftstp 15% fast fourier transformation 1450 k
X
= [Re(aik ) Re(bkj ) − Arg(aik ) Arg(bkj )] + [Re(aik ) Arg(bkj ) + Arg(aik ) Re(bkj )] i
mkffkg3 7% fast fourier transformation 580 k
" !!#
pw orthon 5% Gram-Schmidt orthogonalization 520 = Re(aik ) Re(bkj ) − Arg(aik ) Arg(bkj ) − Re
X
aik · bkj +
k−1
" !#
X
We started by optimizing ZGEMM because the operation which is done by this Arg(aik ) Re(bkj ) + (Re(aik ) Arg(bkj )) + Re aik · bkj i
k−1
routine can be understood quite easily, so we could focus on optimization and
getting familiar with the Cell programming environment. The last equation can be computed with only 4 fused multiply add (FMADD)
instructions, compared to 4 multiply and 2 add instructions in line 3.

The complex multiplication described above, implemented in C:

#d e f i n e VPTR ” ( v e c t o r d o u b l e ∗) ”
vector char h i g h d o u b l e = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 1 6 , 1 7 , 1 8 , 1 9 , 2 0 , 2 1 , 2 2 , 2 3 } ; The unmodified version of ABINIT is
vector char l o w d o u b l e = { 8 , 9 , 1 0 , 1 1 , 1 2 , 1 3 , 1 4 , 1 5 , 2 4 , 2 5 , 2 6 , 2 7 , 2 8 , 2 9 , 3 0 , 3 1 } ;
vector double r r e , rim , t r e , t i m ; roughly twice as fast on a 2 GHz
f o r ( k =0; k < k l e n ; k++, aa += a s t e p , bb += b s t e p ) { Opteron as on a Cell. If we manage
fim = s p u s h u f f l e ( ∗(VPTR aa ) , ∗(VPTR ( aa+a t s t e p ) ) , low double ); to optimize the other compute kernels
gim = s p u s h u f f l e ( ∗(VPTR bb ) , ∗(VPTR bb ) , low double );
fre = s p u s h u f f l e ( ∗(VPTR aa ) , ∗(VPTR ( aa+a t s t e p ) ) , high double ); by the same factor as ZGEMM the
gre = s p u s h u f f l e ( ∗(VPTR bb ) , ∗(VPTR bb ) , high double );
tre = spu msub ( fim , gim , r r e ) ; Cell version could be more than three
tim = spu madd ( f r e , gim , r i m ) ;
rre = spu msub ( f r e , g re , t r e ) ; times faster on the Cell than on the
}
rim = spu madd ( fim , g re , ti m ) ;
Opteron.b
tre = spu s h u f f l e ( r r e , rim , h i g h d o u b l e ) ;
tim = s p u s h u f f l e ( r r e , rim , l o w d o u b l e ) ;

To simplify the process of porting

The Cell SDK 3.0 (pre-release) math kernels to the Cell plattform
Benchmark results achieves 9.5 GFlop/s DGEMMa per- we are currently about to build tools
formance for a 2000x2000 matrix. which help with optimizing the com-
This corresponds to 68% of the Cell’s piler generated (gcc -S) assembly,
peak performance. Our optimized similar to spu_timing but in a more
ZGEMM implementation is able to “active” way, which means that the
leverage up to 73.5% of the peak pipeline status should not only be
performance even though the com- viewable but optimizations should be
plex multiplication requires more shuf- suggested.
fle operations. b
We simulated a test system in a 543 FFT
a
The current SDK does not offer a box with 108 atoms.
ZGEMM implementation, thus we used
DGEMM for comparison.

Future Work
Our ZGEMM implementation is 40 We achieve linear speedup with our
times faster for 2000x2000 sqare ma- ZGEMM implementation, which is due • Optimization of the other ABINIT compute kernels for Cell
trices than the ZGEMM implementa- to the good memory/CPU coupling on • Exploring ways to efficiently use heterogenous clusters for ABINIT
tion in the refblas package the Cell architecture

This research is supported by the Center for Advanced Studies (CAS) of Ihe IBM Böblingen Laboratory as part of the NICOLL Project.

H&M PDF
100% (2)
H&M PDF
21 pages
Concept, Design, and Implementation of A Slimline Boot Firmware For Linux On Power Architecture
100% (1)
Concept, Design, and Implementation of A Slimline Boot Firmware For Linux On Power Architecture
79 pages
Slim-Llama A 4.69mW Large-Language-Model Processor With Binary Ternary Weights For Billion-Parameter Llama Model
No ratings yet
Slim-Llama A 4.69mW Large-Language-Model Processor With Binary Ternary Weights For Billion-Parameter Llama Model
3 pages
2-ISSCC-阿里-184QPS W 64Mb mm23D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System
No ratings yet
2-ISSCC-阿里-184QPS W 64Mb mm23D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System
3 pages
Comparative Analysis of Low Power 4-Bit Multipliers Using 120nm CMOS Technology
No ratings yet
Comparative Analysis of Low Power 4-Bit Multipliers Using 120nm CMOS Technology
28 pages
Epic Lab Poster
No ratings yet
Epic Lab Poster
1 page
Web GPU
0% (1)
Web GPU
40 pages
Design and Implementation of LUT Optimization Using APC-OMS System
100% (1)
Design and Implementation of LUT Optimization Using APC-OMS System
10 pages
Xilinx Ise Project Navigator
No ratings yet
Xilinx Ise Project Navigator
55 pages
Model Tabanlı Sistem Ve Yazılım Mimarisi Tasarımı - MBD Week3
No ratings yet
Model Tabanlı Sistem Ve Yazılım Mimarisi Tasarımı - MBD Week3
32 pages
Vlsi Course-Physical Design
No ratings yet
Vlsi Course-Physical Design
2 pages
Eliminating The Hardware/Software Divide: Satnam Singh, Microsoft Research Cambridge, UK
No ratings yet
Eliminating The Hardware/Software Divide: Satnam Singh, Microsoft Research Cambridge, UK
146 pages
S.No Date Name of The Experiment NO. Signature
No ratings yet
S.No Date Name of The Experiment NO. Signature
124 pages
Vinayaka Missions University, Salem Tamilnadu, India
No ratings yet
Vinayaka Missions University, Salem Tamilnadu, India
9 pages
The Upf 2 1 Library Commands Truly Unifying The Power Specification Formats Poster
No ratings yet
The Upf 2 1 Library Commands Truly Unifying The Power Specification Formats Poster
1 page
Implementing AI Models On FPGAs - A Comprehensive T
No ratings yet
Implementing AI Models On FPGAs - A Comprehensive T
43 pages
Set-Courses-20190622 556238 25757
No ratings yet
Set-Courses-20190622 556238 25757
4 pages
Vedic Multiplier Design
100% (1)
Vedic Multiplier Design
39 pages
Expt 1
No ratings yet
Expt 1
16 pages
Smart Mining System
No ratings yet
Smart Mining System
1 page
Liz Asset
No ratings yet
Liz Asset
29 pages
E-Cad & Vlsi Lab Manual
No ratings yet
E-Cad & Vlsi Lab Manual
86 pages
Vlsi Lab
No ratings yet
Vlsi Lab
161 pages
Print Vlsi
100% (1)
Print Vlsi
104 pages
Presentation 7
No ratings yet
Presentation 7
7 pages
Standard Cell Design and Characterization - DS
No ratings yet
Standard Cell Design and Characterization - DS
83 pages
Anand Raghunathan Raghunathan@purdue - Edu: ECE 695R: S - C D
No ratings yet
Anand Raghunathan Raghunathan@purdue - Edu: ECE 695R: S - C D
10 pages
Lab 8 Report
No ratings yet
Lab 8 Report
28 pages
Cell Based ATPG
No ratings yet
Cell Based ATPG
8 pages
8 Bit Carry Select Adder
50% (2)
8 Bit Carry Select Adder
6 pages
FPT LSI Profile and Service Overview 20210719
No ratings yet
FPT LSI Profile and Service Overview 20210719
33 pages
Lab2 DSD .
No ratings yet
Lab2 DSD .
18 pages
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
No ratings yet
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
4 pages
Accumulator Based 3-Weight Pattern Generation
No ratings yet
Accumulator Based 3-Weight Pattern Generation
15 pages
MATLAB & Simulink Ile Modelleme Ve Simülasyon Week4
No ratings yet
MATLAB & Simulink Ile Modelleme Ve Simülasyon Week4
24 pages
Thermal-Mechanical Metal-Forming Simulations in LS-DYNA Revisited at The 5th German LS-DYNA Forum in Ulm
No ratings yet
Thermal-Mechanical Metal-Forming Simulations in LS-DYNA Revisited at The 5th German LS-DYNA Forum in Ulm
8 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
Final Year Report
No ratings yet
Final Year Report
59 pages
Existing Methodology: I I I-1 I I-1 I I
No ratings yet
Existing Methodology: I I I-1 I I-1 I I
9 pages
Design For Testability Notes 1738476661
No ratings yet
Design For Testability Notes 1738476661
46 pages
Project
No ratings yet
Project
113 pages
Sns College of Technology: Department of Electronics & Communication Engineering
No ratings yet
Sns College of Technology: Department of Electronics & Communication Engineering
24 pages
Cao Notes
No ratings yet
Cao Notes
139 pages
DFT Scan Cells Network Design
No ratings yet
DFT Scan Cells Network Design
3 pages
Adsl Lab Manual
No ratings yet
Adsl Lab Manual
72 pages
Project Documentation Final (Sudharshanan)
No ratings yet
Project Documentation Final (Sudharshanan)
62 pages
20.8 Space-Mate A 303.5mW Real-Time Sparse Mixture-Of-Experts-Based NeRF-SLAM Processor For Mobile Spatial Computing
No ratings yet
20.8 Space-Mate A 303.5mW Real-Time Sparse Mixture-Of-Experts-Based NeRF-SLAM Processor For Mobile Spatial Computing
3 pages
Requirements Management and Analysis & MBD Week2
No ratings yet
Requirements Management and Analysis & MBD Week2
44 pages
VTU - Practical Book - Part A - 04-08-2009
No ratings yet
VTU - Practical Book - Part A - 04-08-2009
42 pages
Existing Methodology
No ratings yet
Existing Methodology
7 pages
Coding Practices SSW
No ratings yet
Coding Practices SSW
47 pages
21BLC1374 Lab6
No ratings yet
21BLC1374 Lab6
11 pages
Design of Low Power and Area Efficient Test Pattern Generator Using Alu
No ratings yet
Design of Low Power and Area Efficient Test Pattern Generator Using Alu
32 pages
Index: Mallareddy College of Engineering and Technology-Mtech (Vlsi Lab)
No ratings yet
Index: Mallareddy College of Engineering and Technology-Mtech (Vlsi Lab)
42 pages
PROJECT
No ratings yet
PROJECT
9 pages
Internship Training Program Day18
No ratings yet
Internship Training Program Day18
29 pages
Assignment 4
No ratings yet
Assignment 4
11 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Efficient Memory Optimization for IoT Intrusion Detection
From Everand
Efficient Memory Optimization for IoT Intrusion Detection
Ethan Evelyn
No ratings yet
Annual Report 2020 en
No ratings yet
Annual Report 2020 en
169 pages
An On Chip Network Inside A FPGA For Run-Time Reconfigurable Low Latency Grid Communication
No ratings yet
An On Chip Network Inside A FPGA For Run-Time Reconfigurable Low Latency Grid Communication
8 pages
Heterogeneous Multiprocessing - On A Tightly Coupled Opteron Cell Evaluation Platform
No ratings yet
Heterogeneous Multiprocessing - On A Tightly Coupled Opteron Cell Evaluation Platform
1 page
Status of The QPACE Project
No ratings yet
Status of The QPACE Project
7 pages
Heterogeneous Multiprocessing - On A Tightly Coupled Opteron Cell Evaluation Platform
No ratings yet
Heterogeneous Multiprocessing - On A Tightly Coupled Opteron Cell Evaluation Platform
1 page
Patellofemoral Pain Instability and Arth
No ratings yet
Patellofemoral Pain Instability and Arth
10 pages
ST One Manual Ita
No ratings yet
ST One Manual Ita
68 pages
Duct Insulation
No ratings yet
Duct Insulation
9 pages
M2700/M2701 Series Specification For 5.0X7.0Mm Cmos SMT Oscillator
No ratings yet
M2700/M2701 Series Specification For 5.0X7.0Mm Cmos SMT Oscillator
5 pages
RA Working at Heights
No ratings yet
RA Working at Heights
2 pages
Datasheet Norsat LNA Ku Band 4000 Series
No ratings yet
Datasheet Norsat LNA Ku Band 4000 Series
1 page
Thermomix tm31 User Manual
No ratings yet
Thermomix tm31 User Manual
52 pages
Manual
No ratings yet
Manual
8 pages
SCM Potato Pepsi
No ratings yet
SCM Potato Pepsi
14 pages
Doctor's Order & Lab Results
100% (1)
Doctor's Order & Lab Results
4 pages
1 Robert Grosseteste Compotus Correctorius Trans Philipp Nothaft
No ratings yet
1 Robert Grosseteste Compotus Correctorius Trans Philipp Nothaft
80 pages
Neutral Earthing Equipment
No ratings yet
Neutral Earthing Equipment
4 pages
8a Sc3a9rie One Day The Simple Future Tense
0% (1)
8a Sc3a9rie One Day The Simple Future Tense
5 pages
Compactador de Lanza Modelo AR 65 en-US
No ratings yet
Compactador de Lanza Modelo AR 65 en-US
2 pages
Signals and Systems For Signals and Systems For
No ratings yet
Signals and Systems For Signals and Systems For
75 pages
Foot Reflexology
No ratings yet
Foot Reflexology
7 pages
Morphology of Flowering Plants
No ratings yet
Morphology of Flowering Plants
48 pages
Romantic Interlude in Japan
No ratings yet
Romantic Interlude in Japan
1 page
Makeup Kit Manual
100% (4)
Makeup Kit Manual
31 pages
New - Hab.1 (Repaired)
No ratings yet
New - Hab.1 (Repaired)
42 pages
Termodinamica Moran Shapiro 7a Ed - Resp
No ratings yet
Termodinamica Moran Shapiro 7a Ed - Resp
32 pages
Urban Context Analysis
No ratings yet
Urban Context Analysis
5 pages
DCMT - Set 4 GR14 Supple
No ratings yet
DCMT - Set 4 GR14 Supple
2 pages
Coldstore Doors PDF
No ratings yet
Coldstore Doors PDF
2 pages
Research Paper
No ratings yet
Research Paper
7 pages
Environment Baseline SURVEY Report For Nghi Son Refinery Petrochemical Complex
No ratings yet
Environment Baseline SURVEY Report For Nghi Son Refinery Petrochemical Complex
159 pages
PVP2018 84110
No ratings yet
PVP2018 84110
6 pages
File 4
No ratings yet
File 4
3 pages
No-Bake Cheesecake No-Bake Nuttela Cheesecakes: Ingredients Ingredients
No ratings yet
No-Bake Cheesecake No-Bake Nuttela Cheesecakes: Ingredients Ingredients
6 pages

Code Optimization For Cell/B.E. - Opportunities For ABINIT - A Software Package For Physicists

Uploaded by

Code Optimization For Cell/B.E. - Opportunities For ABINIT - A Software Package For Physicists

Uploaded by

Code optimization for Cell/B.E.

Opportunities for ABINIT – a software package for physicists A R C H I T E K T U R

{timos,siwu,rehm,htor}@informatik.tu-chemnitz.de, [email protected], [email protected]

ABINIT on Cell - Overview Math kernel optimization

ABINIT (PPE) zgemm.o (PPE) SPE 0

tronic structure of systems made of 3. Run ABINIT on a cluster of Cells SPE N

The complex multiplication described above, implemented in C:

To simplify the process of porting

You might also like