0% found this document useful (0 votes)
23 views

Data-Level Parallelism Vector and GPU

This document discusses data-level parallelism and vectorization. It begins by explaining how performing the same operation on many data items can be parallelized using single-instruction multiple-data (SIMD) techniques. It then provides an example of vectorizing a SAXPY loop to perform the operation on multiple elements with each instruction. The document goes on to describe example vector instruction set extensions, how vectors can speed up code by reducing the number of instructions, and how vector operations are implemented in hardware by replicating functional units to operate in parallel on wide vector registers. It concludes by discussing other specialized vector instructions and how vectors continue to be expanded in modern CPUs.

Uploaded by

hùng nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data-Level Parallelism Vector and GPU

This document discusses data-level parallelism and vectorization. It begins by explaining how performing the same operation on many data items can be parallelized using single-instruction multiple-data (SIMD) techniques. It then provides an example of vectorizing a SAXPY loop to perform the operation on multiple elements with each instruction. The document goes on to describe example vector instruction set extensions, how vectors can speed up code by reducing the number of instructions, and how vector operations are implemented in hardware by replicating functional units to operate in parallel on wide vector registers. It concludes by discussing other specialized vector instructions and how vectors continue to be expanded in modern CPUs.

Uploaded by

hùng nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

How to Compute This Fast?

•  Performing the same operations on many data items


•  Example: SAXPY
L1: ldf [X+r1]->f1 // I is in r1
for (I = 0; I < 1024; I++) { mulf f0,f1->f2 // A is in f0
CIS 501: Computer Architecture }
Z[I] = A*X[I] + Y[I]; ldf [Y+r1]->f3
addf f2,f3->f4
stf f4->[Z+r1}
addi r1,4->r1
Unit 11: Data-Level Parallelism: blti r1,4096,L1

Vectors & GPUs •  Instruction-level parallelism (ILP) - fine grained


•  Loop unrolling with static scheduling –or– dynamic scheduling
Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'' •  Wide-issue superscalar (non-)scaling limits benefits
with'sources'that'included'University'of'Wisconsin'slides'
by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood' •  Thread-level parallelism (TLP) - coarse grained
•  Multicore
•  Can we do some “medium grained” parallelism?
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 2

Data-Level Parallelism Example Vector ISA Extensions (SIMD)


•  Data-level parallelism (DLP) •  Extend ISA with floating point (FP) vector storage …
•  Single operation repeated on multiple data elements •  Vector register: fixed-size array of 32- or 64- bit FP elements
•  SIMD (Single-Instruction, Multiple-Data) •  Vector length: For example: 4, 8, 16, 64, …
•  Less general than ILP: parallel insns are all same operation •  … and example operations for vector length of 4
•  Exploit with vectors •  Load vector: ldf.v [X+r1]->v1
•  Old idea: Cray-1 supercomputer from late 1970s ldf [X+r1+0]->v10
•  Eight 64-entry x 64-bit floating point “vector registers” ldf [X+r1+1]->v11
•  4096 bits (0.5KB) in each register! 4KB for vector register file ldf [X+r1+2]->v12
•  Special vector instructions to perform vector operations ldf [X+r1+3]->v13
•  Load vector, store vector (wide memory operation) •  Add two vectors: addf.vv v1,v2->v3
•  Vector+Vector or Vector+Scalar addf v1i,v2i->v3i (where i is 0,1,2,3)
•  addition, subtraction, multiply, etc. •  Add vector to scalar: addf.vs v1,f2,v3
•  In Cray-1, each instruction specifies 64 operations! addf v1i,f2->v3i (where i is 0,1,2,3)
•  ALUs were expensive, so one operation per cycle (not parallel) •  Today’s vectors: short (128 or 256 bits), but fully parallel
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 4
Example Use of Vectors – 4-wide Vector Datapath & Implementatoin
ldf [X+r1]->f1 ldf.v [X+r1]->v1
mulf f0,f1->f2 mulf.vs v1,f0->v2 •  Vector insn. are just like normal insn… only “wider”
ldf [Y+r1]->f3 ldf.v [Y+r1]->v3 •  Single instruction fetch (no extra N2 checks)
addf f2,f3->f4 addf.vv v2,v3->v4
stf f4->[Z+r1] stf.v v4,[Z+r1] •  Wide register read & write (not multiple ports)
addi r1,4->r1 addi r1,16->r1
blti r1,4096,L1 blti r1,4096,L1 •  Wide execute: replicate floating point unit (same as superscalar)
7x1024 instructions 7x256 instructions •  Wide bypass (avoid N2 bypass problem)
•  Operations (4x fewer instructions) •  Wide cache read & write (single cache tag check)
•  Load vector: ldf.v [X+r1]->v1
•  Multiply vector to scalar: mulf.vs v1,f2->v3 •  Execution width (implementation) vs vector width (ISA)
•  Add two vectors: addf.vv v1,v2->v3 •  Example: Pentium 4 and “Core 1” executes vector ops at half width
•  Store vector: stf.v v1->[X+r1] •  “Core 2” executes them at full width

•  Performance?
•  Because they are just instructions…
•  Best case: 4x speedup
•  …superscalar execution of vector instructions
•  But, vector instructions don’t always have single-cycle throughput
•  Multiple n-wide vector instructions per cycle
•  Execution width (implementation) vs vector width (ISA)
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 6

Intel’s SSE2/SSE3/SSE4/AVX… Other Vector Instructions


•  Intel SSE2 (Streaming SIMD Extensions 2) - 2001 •  These target specific domains: e.g., image processing, crypto
•  16 128bit floating point registers (xmm0–xmm15) •  Vector reduction (sum all elements of a vector)
•  Each can be treated as 2x64b FP or 4x32b FP (“packed FP”) •  Geometry processing: 4x4 translation/rotation matrices
•  Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”) •  Saturating (non-overflowing) subword add/sub: image processing
•  Or 1x64b or 1x32b FP (just normal scalar floating point) •  Byte asymmetric operations: blending and composition in graphics
•  Original SSE: only 8 registers, no packed integer support •  Byte shuffle/permute: crypto
•  Population (bit) count: crypto
•  Other vector extensions •  Max/min/argmax/argmin: video codec
•  AMD 3DNow!: 64b (2x32b) •  Absolute differences: video codec
•  PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) •  Multiply-accumulate: digital-signal processing
•  Special instructions for AES encryption
•  Looking forward for x86 •  More advanced (but in Intel’s Xeon Phi)
•  Intel’s “Sandy Bridge” brings 256-bit vectors to x86 •  Scatter/gather loads: indirect store (or load) from a vector of pointers
•  Intel’s “Xeon Phi” multicore will bring 512-bit vectors to x86 •  Vector mask: predication (conditional execution) of specific elements
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 8
Using Vectors in Your Code Recap: Vectors for Exploiting DLP
•  Write in assembly •  Vectors are an efficient way of capturing parallelism
•  Ugh •  Data-level parallelism
•  Avoid the N2 problems of superscalar
•  Use “intrinsic” functions and data types
•  For example: _mm_mul_ps() and “__m128” datatype
•  Avoid the difficult fetch problem of superscalar
•  Area efficient, power efficient
•  Use vector data types
•  typedef double v2df __attribute__ ((vector_size (16))); •  The catch?
•  Use a library someone else wrote •  Need code that is “vector-izable”
•  Let them do the hard work •  Need to modify program (unlike dynamic-scheduled superscalar)
•  Matrix and linear algebra packages •  Requires some help from the programmer

•  Let the compiler do it (automatic vectorization, with feedback) •  Looking forward: Intel “Xeon Phi” (aka Larrabee) vectors
•  GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose=n
•  More flexible (vector “masks”, scatter, gather) and wider
•  Limited impact for C/C++ code (old, hard problem)
•  Should be easier to exploit, more bang for the buck
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 10

Graphics Processing Units (GPU) GPUs and SIMD/Vector Data Parallelism


•  Killer app for parallelism: graphics (3D games)
•  How do GPUs have such high peak FLOPS & FLOPS/Joule?
•  Exploit massive data parallelism – focus on total throughput
•  Remove hardware structures that accelerate single threads
•  Specialized for graphs: e.g., data-types & dedicated texture units
•  “SIMT” execution model
•  Single instruction multiple threads
•  Similar to both “vectors” and “SIMD”
Tesla S870! •  A key difference: better support for conditional control flow
•  Program it with CUDA or OpenCL
•  Extensions to C
•  Perform a “shader task” (a snippet of scalar computation) over
many elements
•  Internally, GPU uses scatter/gather and vector mask operations

CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 12
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
13 14

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf


15 16
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
17 18

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf


19 20
Data Parallelism Summary
•  Data Level Parallelism
•  “medium-grained” parallelism between ILP and TLP
•  Still one flow of execution (unlike TLP)
•  Compiler/programmer must explicitly expresses it (unlike ILP)
•  Hardware support: new “wide” instructions (SIMD)
•  Wide registers, perform multiple operations in parallel
•  Trends
•  Wider: 64-bit (MMX, 1996), 128-bit (SSE2, 2000),
256-bit (AVX, 2011), 512-bit (Xeon Phi, 2012?)
•  More advanced and specialized instructions
•  GPUs
•  Embrace data parallelism via “SIMT” execution model
•  Becoming more programmable all the time
•  Today’s chips exploit parallelism at all levels: ILP, DLP, TLP
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 21

You might also like