Data-Level Parallelism Vector and GPU

This document discusses data-level parallelism and vectorization. It begins by explaining how performing the same operation on many data items can be parallelized using single-instruction multiple-data (SIMD) techniques. It then provides an example of vectorizing a SAXPY loop to perform the operation on multiple elements with each instruction. The document goes on to describe example vector instruction set extensions, how vectors can speed up code by reducing the number of instructions, and how vector operations are implemented in hardware by replicating functional units to operate in parallel on wide vector registers. It concludes by discussing other specialized vector instructions and how vectors continue to be expanded in modern CPUs.

Uploaded by

hùng nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views6 pages

Data-Level Parallelism Vector and GPU

Uploaded by

hùng nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

How to Compute This Fast?

•  Performing the same operations on many data items

•  Example: SAXPY
L1: ldf [X+r1]->f1 // I is in r1
for (I = 0; I < 1024; I++) { mulf f0,f1->f2 // A is in f0
CIS 501: Computer Architecture }
Z[I] = A*X[I] + Y[I]; ldf [Y+r1]->f3
addf f2,f3->f4
stf f4->[Z+r1}
addi r1,4->r1
Unit 11: Data-Level Parallelism: blti r1,4096,L1

Vectors & GPUs •  Instruction-level parallelism (ILP) - fine grained

•  Loop unrolling with static scheduling –or– dynamic scheduling
Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'' •  Wide-issue superscalar (non-)scaling limits benefits
with'sources'that'included'University'of'Wisconsin'slides'
by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood' •  Thread-level parallelism (TLP) - coarse grained
•  Multicore
•  Can we do some “medium grained” parallelism?
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 2

Data-Level Parallelism Example Vector ISA Extensions (SIMD)

•  Data-level parallelism (DLP) •  Extend ISA with floating point (FP) vector storage …
•  Single operation repeated on multiple data elements •  Vector register: fixed-size array of 32- or 64- bit FP elements
•  SIMD (Single-Instruction, Multiple-Data) •  Vector length: For example: 4, 8, 16, 64, …
•  Less general than ILP: parallel insns are all same operation •  … and example operations for vector length of 4
•  Exploit with vectors •  Load vector: ldf.v [X+r1]->v1
•  Old idea: Cray-1 supercomputer from late 1970s ldf [X+r1+0]->v10
•  Eight 64-entry x 64-bit floating point “vector registers” ldf [X+r1+1]->v11
•  4096 bits (0.5KB) in each register! 4KB for vector register file ldf [X+r1+2]->v12
•  Special vector instructions to perform vector operations ldf [X+r1+3]->v13
•  Load vector, store vector (wide memory operation) •  Add two vectors: addf.vv v1,v2->v3
•  Vector+Vector or Vector+Scalar addf v1i,v2i->v3i (where i is 0,1,2,3)
•  addition, subtraction, multiply, etc. •  Add vector to scalar: addf.vs v1,f2,v3
•  In Cray-1, each instruction specifies 64 operations! addf v1i,f2->v3i (where i is 0,1,2,3)
•  ALUs were expensive, so one operation per cycle (not parallel) •  Today’s vectors: short (128 or 256 bits), but fully parallel
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 4
Example Use of Vectors – 4-wide Vector Datapath & Implementatoin
ldf [X+r1]->f1 ldf.v [X+r1]->v1
mulf f0,f1->f2 mulf.vs v1,f0->v2 •  Vector insn. are just like normal insn… only “wider”
ldf [Y+r1]->f3 ldf.v [Y+r1]->v3 •  Single instruction fetch (no extra N2 checks)
addf f2,f3->f4 addf.vv v2,v3->v4
stf f4->[Z+r1] stf.v v4,[Z+r1] •  Wide register read & write (not multiple ports)
addi r1,4->r1 addi r1,16->r1
blti r1,4096,L1 blti r1,4096,L1 •  Wide execute: replicate floating point unit (same as superscalar)
7x1024 instructions 7x256 instructions •  Wide bypass (avoid N2 bypass problem)
•  Operations (4x fewer instructions) •  Wide cache read & write (single cache tag check)
•  Load vector: ldf.v [X+r1]->v1
•  Multiply vector to scalar: mulf.vs v1,f2->v3 •  Execution width (implementation) vs vector width (ISA)
•  Add two vectors: addf.vv v1,v2->v3 •  Example: Pentium 4 and “Core 1” executes vector ops at half width
•  Store vector: stf.v v1->[X+r1] •  “Core 2” executes them at full width

•  Performance?
•  Because they are just instructions…
•  Best case: 4x speedup
•  …superscalar execution of vector instructions
•  But, vector instructions don’t always have single-cycle throughput
•  Multiple n-wide vector instructions per cycle
•  Execution width (implementation) vs vector width (ISA)
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 6

Intel’s SSE2/SSE3/SSE4/AVX… Other Vector Instructions

•  Intel SSE2 (Streaming SIMD Extensions 2) - 2001 •  These target specific domains: e.g., image processing, crypto
•  16 128bit floating point registers (xmm0–xmm15) •  Vector reduction (sum all elements of a vector)
•  Each can be treated as 2x64b FP or 4x32b FP (“packed FP”) •  Geometry processing: 4x4 translation/rotation matrices
•  Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”) •  Saturating (non-overflowing) subword add/sub: image processing
•  Or 1x64b or 1x32b FP (just normal scalar floating point) •  Byte asymmetric operations: blending and composition in graphics
•  Original SSE: only 8 registers, no packed integer support •  Byte shuffle/permute: crypto
•  Population (bit) count: crypto
•  Other vector extensions •  Max/min/argmax/argmin: video codec
•  AMD 3DNow!: 64b (2x32b) •  Absolute differences: video codec
•  PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) •  Multiply-accumulate: digital-signal processing
•  Special instructions for AES encryption
•  Looking forward for x86 •  More advanced (but in Intel’s Xeon Phi)
•  Intel’s “Sandy Bridge” brings 256-bit vectors to x86 •  Scatter/gather loads: indirect store (or load) from a vector of pointers
•  Intel’s “Xeon Phi” multicore will bring 512-bit vectors to x86 •  Vector mask: predication (conditional execution) of specific elements
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 8
Using Vectors in Your Code Recap: Vectors for Exploiting DLP
•  Write in assembly •  Vectors are an efficient way of capturing parallelism
•  Ugh •  Data-level parallelism
•  Avoid the N2 problems of superscalar
•  Use “intrinsic” functions and data types
•  For example: _mm_mul_ps() and “__m128” datatype
•  Avoid the difficult fetch problem of superscalar
•  Area efficient, power efficient
•  Use vector data types
•  typedef double v2df __attribute__ ((vector_size (16))); •  The catch?
•  Use a library someone else wrote •  Need code that is “vector-izable”
•  Let them do the hard work •  Need to modify program (unlike dynamic-scheduled superscalar)
•  Matrix and linear algebra packages •  Requires some help from the programmer

•  Let the compiler do it (automatic vectorization, with feedback) •  Looking forward: Intel “Xeon Phi” (aka Larrabee) vectors
•  GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose=n
•  More flexible (vector “masks”, scatter, gather) and wider
•  Limited impact for C/C++ code (old, hard problem)
•  Should be easier to exploit, more bang for the buck
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 10

Graphics Processing Units (GPU) GPUs and SIMD/Vector Data Parallelism

•  Killer app for parallelism: graphics (3D games)
•  How do GPUs have such high peak FLOPS & FLOPS/Joule?
•  Exploit massive data parallelism – focus on total throughput
•  Remove hardware structures that accelerate single threads
•  Specialized for graphs: e.g., data-types & dedicated texture units
•  “SIMT” execution model
•  Single instruction multiple threads
•  Similar to both “vectors” and “SIMD”
Tesla S870! •  A key difference: better support for conditional control flow
•  Program it with CUDA or OpenCL
•  Extensions to C
•  Perform a “shader task” (a snippet of scalar computation) over
many elements
•  Internally, GPU uses scatter/gather and vector mask operations

CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 12
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
13 14

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

15 16
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
17 18

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

19 20
Data Parallelism Summary
•  Data Level Parallelism
•  “medium-grained” parallelism between ILP and TLP
•  Still one flow of execution (unlike TLP)
•  Compiler/programmer must explicitly expresses it (unlike ILP)
•  Hardware support: new “wide” instructions (SIMD)
•  Wide registers, perform multiple operations in parallel
•  Trends
•  Wider: 64-bit (MMX, 1996), 128-bit (SSE2, 2000),
256-bit (AVX, 2011), 512-bit (Xeon Phi, 2012?)
•  More advanced and specialized instructions
•  GPUs
•  Embrace data parallelism via “SIMT” execution model
•  Becoming more programmable all the time
•  Today’s chips exploit parallelism at all levels: ILP, DLP, TLP
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 21

Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Unit 2
No ratings yet
Unit 2
43 pages
Data-Level Parallelism: Nima Honarmand
No ratings yet
Data-Level Parallelism: Nima Honarmand
59 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
19 Computer Architecture Vector Processor
No ratings yet
19 Computer Architecture Vector Processor
20 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
No ratings yet
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
31 pages
Vector
No ratings yet
Vector
38 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Architecture Chapter4 E5 2012
No ratings yet
Architecture Chapter4 E5 2012
92 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Native Shader Compilation With LLVM PDF
No ratings yet
Native Shader Compilation With LLVM PDF
37 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Gpu-Arc
No ratings yet
Gpu-Arc
37 pages
Module 4 Chapter 2
No ratings yet
Module 4 Chapter 2
42 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
Vector
No ratings yet
Vector
42 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
SIMD
No ratings yet
SIMD
44 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Array & Vector Processor
No ratings yet
Array & Vector Processor
17 pages
Slide 7
No ratings yet
Slide 7
40 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
Lecture #4
No ratings yet
Lecture #4
16 pages
Why Vector Processing: Deep Pipeline More Parallelism
No ratings yet
Why Vector Processing: Deep Pipeline More Parallelism
7 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Slides 18 645 Simd
No ratings yet
Slides 18 645 Simd
37 pages
Chapter 8
No ratings yet
Chapter 8
59 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
Ca Part 3
No ratings yet
Ca Part 3
20 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
MCA Unit-5 QB
No ratings yet
MCA Unit-5 QB
3 pages
Assembly #4
No ratings yet
Assembly #4
3 pages
Department of Cse CP7103 Multicore Architecture Unit - 2, DLP in Vector, Simd and Gpu Architectures 100% THEORY Question Bank
No ratings yet
Department of Cse CP7103 Multicore Architecture Unit - 2, DLP in Vector, Simd and Gpu Architectures 100% THEORY Question Bank
3 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
MCA - HW - Lecture 7and8 - Prelim
No ratings yet
MCA - HW - Lecture 7and8 - Prelim
146 pages
Aca Syllabus Details Made by Dr. Rahul Mirajkar
No ratings yet
Aca Syllabus Details Made by Dr. Rahul Mirajkar
1 page
Unit 3 Notes
No ratings yet
Unit 3 Notes
35 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Vector Code Example
No ratings yet
Vector Code Example
6 pages
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
No ratings yet
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
8 pages
15 20-15 55-18 05 06 VEXT-bcn-v1
No ratings yet
15 20-15 55-18 05 06 VEXT-bcn-v1
76 pages
GPU Compute
100% (1)
GPU Compute
58 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Tutorial Create A Google Sheets Data Source - Data Studio Help
No ratings yet
Tutorial Create A Google Sheets Data Source - Data Studio Help
1 page
Siemens 7SJ53 V3.4 Template Manual ENU TU2.20 V1.100
No ratings yet
Siemens 7SJ53 V3.4 Template Manual ENU TU2.20 V1.100
14 pages
(T-GCPAZURE-B) Module 2 - Getting Started With Google Cloud Platform
No ratings yet
(T-GCPAZURE-B) Module 2 - Getting Started With Google Cloud Platform
57 pages
How To Detect Bone Fracture With Ai
No ratings yet
How To Detect Bone Fracture With Ai
8 pages
Jaspersoft Embedding Guide
No ratings yet
Jaspersoft Embedding Guide
44 pages
Chalmeta Methodology For CRM JSS2006
No ratings yet
Chalmeta Methodology For CRM JSS2006
10 pages
1 Big Data Lab - 230823 - 103054
No ratings yet
1 Big Data Lab - 230823 - 103054
34 pages
Panjab University, Chandigarh-160014 (India) : Faculty of Arts Syllabi
No ratings yet
Panjab University, Chandigarh-160014 (India) : Faculty of Arts Syllabi
19 pages
Knowledge Interference
No ratings yet
Knowledge Interference
18 pages
Modelling System Requirements With Use Case
No ratings yet
Modelling System Requirements With Use Case
45 pages
Full Stack-Unit-Iii
No ratings yet
Full Stack-Unit-Iii
56 pages
Gep 400-4
No ratings yet
Gep 400-4
33 pages
Tutorial Exercises Solution
No ratings yet
Tutorial Exercises Solution
9 pages
Sen Answer Class Test 1
No ratings yet
Sen Answer Class Test 1
14 pages
HCI Notes - Unit 1
No ratings yet
HCI Notes - Unit 1
15 pages
Project Planning: EAT100 - Design, Drawing and Practical Skills DR Mike Knowles
No ratings yet
Project Planning: EAT100 - Design, Drawing and Practical Skills DR Mike Knowles
25 pages
Serial Communication in Win32
No ratings yet
Serial Communication in Win32
31 pages
Apatech Logs
No ratings yet
Apatech Logs
5 pages
Aqms - Envea
No ratings yet
Aqms - Envea
40 pages
Terraform Associate Exam - Free Questions and Answers - ITExams - Com2
No ratings yet
Terraform Associate Exam - Free Questions and Answers - ITExams - Com2
2 pages
Chapter-1-Introduction and Security Trends Notes
No ratings yet
Chapter-1-Introduction and Security Trends Notes
69 pages
Intro To Data Analytics Activity Templates
No ratings yet
Intro To Data Analytics Activity Templates
11 pages
Two-Wire Peripheral Expansion For The AT89C2051 Microcontroller
No ratings yet
Two-Wire Peripheral Expansion For The AT89C2051 Microcontroller
9 pages
Project - HappyTrip - Requirements - Java PDF
No ratings yet
Project - HappyTrip - Requirements - Java PDF
8 pages
Cyber Law and Information Technology PDF
No ratings yet
Cyber Law and Information Technology PDF
6 pages
EMTECH 1st
No ratings yet
EMTECH 1st
8 pages
Teknologi Di Kalangan Pelajar-Pelajar Inklusif
No ratings yet
Teknologi Di Kalangan Pelajar-Pelajar Inklusif
581 pages
Crockford Douglas JavaScript
100% (1)
Crockford Douglas JavaScript
39 pages
Picoscope 2205A Datasheet
No ratings yet
Picoscope 2205A Datasheet
20 pages
A Quick Tutorial On RSLogix Emulator 5000
No ratings yet
A Quick Tutorial On RSLogix Emulator 5000
10 pages

Data-Level Parallelism Vector and GPU

Uploaded by

Data-Level Parallelism Vector and GPU

Uploaded by

How to Compute This Fast?

• Performing the same operations on many data items

Vectors & GPUs • Instruction-level parallelism (ILP) - fine grained

Data-Level Parallelism Example Vector ISA Extensions (SIMD)

Intel’s SSE2/SSE3/SSE4/AVX… Other Vector Instructions

Graphics Processing Units (GPU) GPUs and SIMD/Vector Data Parallelism

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

You might also like

•  Performing the same operations on many data items

Vectors & GPUs •  Instruction-level parallelism (ILP) - fine grained