Data-Level Parallelism Vector and GPU
Data-Level Parallelism Vector and GPU
• Performance?
• Because they are just instructions…
• Best case: 4x speedup
• …superscalar execution of vector instructions
• But, vector instructions don’t always have single-cycle throughput
• Multiple n-wide vector instructions per cycle
• Execution width (implementation) vs vector width (ISA)
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 6
• Let the compiler do it (automatic vectorization, with feedback) • Looking forward: Intel “Xeon Phi” (aka Larrabee) vectors
• GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose=n
• More flexible (vector “masks”, scatter, gather) and wider
• Limited impact for C/C++ code (old, hard problem)
• Should be easier to exploit, more bang for the buck
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 10
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 12
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
13 14