Parallel Computing Platforms: Chieh-Sen (Jason) Huang
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
• Loop unrolling
Do i = 1, n, 4
sum1 = sum1 + a[i]*b[i]
sum2 = sum2 + a[i+1]*b[i+1]
sum3 = sum3 + a[i+2]*b[i+2]
sum4 = sum4 + a[i+3]*b[i+3]
End do
• Assuming that the vectors are laid out linearly in memory, eight
FLOPs (four multiply-adds) can be performed in 200 cycles.
int a[2][3];
for(int i=0;i<2;i++){
for (int j=0;j<3;j++)
cout<<’\t’<<&a[i][j];
cout<<endl;
}
-----------------------------------------------------------
0x7fff9987c700 0x7fff9987c704 0x7fff9987c708
0x7fff9987c70c 0x7fff9987c710 0x7fff9987c714
Impact of Memory Bandwidth: Example
• The vector column_sum is small and easily fits into the cache
b1 b2 b3 b4
=
+ + +
A A A A
= = = =
A b A b A b A b