CA 4 Notes
CA 4 Notes
Parallelism in
Vector
Introduction
Data-Level Parallelism (DLP) is about doing the same task on many
pieces of data at the same time.
For example: If you have two lists of numbers and you want to add
them together, instead of adding them one by one, DLP lets you add
all the numbers at once in parallel. This makes things much faster.
In computer architecture, when we talk about vector processing, we
mean using special hardware that can work with groups of data at
once (called vectors),instead of just working with one piece of data
at a time. So, DLP helps computers do tasks on big sets of data much
quicker.
Vector Architecture:
Vector architecture in computer architecture refers to a type of processor
design that allows it to handle vector operations efficiently. In vector
architecture, vectors (large arrays of data) are processed in parallel with a
single instruction, rather than processing one data element at a time.
It is a way of designing computer processors that allows them to work with
a lot of data at the same time, rather than just one piece of data at a time.
Here's how it works:
• Vector registers: Think of these as super-sized memory storage that can
hold multiple pieces of data (like a list or array of numbers) at once.
Basic Structure of Vector Architecture : VMIPS
• SIMD (Single Instruction, Multiple Data): This means that one instruction can
tell the processor to do the same thing to several pieces of data at the same time.
So instead of adding two numbers one by one, you add a whole list of numbers in
one go.
Example:
Array 1: [ A1 A2 A3 A4 ]
Array 2: [ B1 B2 B3 B4 ]
Parallel Operation: Add: A1 + B1 → R1
Add: A2 + B2 → R2
Add: A3 + B3 → R3
Add: A4 + B4 → R4
Result: [ R1 R2 R3 R4 ]
Vector Instructions:
These instructions are specifically designed to perform operations like addition,
multiplication, or subtraction on entire vectors of data. For instance, a vector addition
instruction can add two vectors element-wise, all at once.
Different processors provide SIMD instruction sets tailored to multimedia tasks. Let’s explore some
common SIMD instruction sets:
1. MMX (Multi Media Extensions) - Intel
Purpose: MMX was the first SIMD extension introduced by Intel in the mid-1990s. It was designed
to accelerate multimedia tasks like video and audio processing.
Key Features:
• MMX uses 64-bit registers to store multiple data elements (e.g., integers or packed data).
• It allows multiple integer operations to be executed in parallel on these registers.
• MMX supports a range of operations such as addition, subtraction, multiplication, and logical
operations, all applied to packed data.
Example Application: Audio processing, video processing, and image manipulation.
2. SSE (Streaming SIMD Extensions) - Intel
Purpose: SSE was introduced by Intel to overcome some limitations of MMX, specifically adding
support for floating-point operations.
Key Features:
• SSE uses 128-bit registers (called XMM registers) to store 4 single-precision floating-point
numbers or 2 double-precision floating-point numbers.
• SSE supports operations like addition, subtraction, multiplication, and division on multiple
floating-point values simultaneously.
• SSE includes more complex operations like sqrt (square root), clipping, and shifting.
Example Application: Advanced image processing, video encoding/decoding, 3D rendering, and
scientific computations.
SSE2 (a later version of SSE) extends SSE by adding support for double-precision floating-point
numbers, and SSE3 and SSSE3 further improve multimedia performance with additional
instructions.
3.AVX (Advanced Vector Extensions) - Intel/AMD
Purpose: AVX extends SSE by providing 256-bit wide registers, allowing more data elements to
be processed simultaneously.
Key Features:
AVX supports floating-point operations and has more flexible vector operations than SSE.
The registers are wider (256 bits) than SSE's 128 bits, which allows for even more parallelism
(e.g., processing 8 single-precision floating-point numbers at once).
AVX also provides enhanced performance for multimedia workloads, particularly in video
processing, signal processing, and scientific simulations.
AVX-512 (introduced later) further extends this to 512-bit registers, offering even greater
parallelism.
Example Application: High-performance video and audio decoding, advanced 3D graphics
rendering, and computationally intensive tasks like machine learning and scientific computing.
Roofline Model
●The "roofline model" in computer architecture is a performance
visualization tool used to understand and evaluate the performance of
a system.
● specifically focusing on the relationship between computational
performance and memory bandwidth.
●Computational Performance:(how fast the processor can do
calculations)
●Memory Bandwidth:(how fast data can be transferred to/from
memory).
Graphics Processing Unit
A Graphics Processing Unit (GPU) is a special type of processor inside
your computer that's really good at handling graphics and images. It was
originally designed to help with video games and 3D graphics, but over
time, it’s become super useful for other tasks too, like running artificial
intelligence or doing scientific calculations.
What is a GPU?
• A Graphics Processing Unit (GPU) is designed to accelerate image rendering, but is also
highly effective for parallel computing tasks like simulations, machine learning, and data
processing.
• Unlike a CPU, which handles sequential tasks, the GPU executes many tasks simultaneously,
making it ideal for operations that can be parallelized.
Key Concepts
• Parallel Processing: GPUs have thousands of smaller cores, enabling them to perform many
operations at once.
• SIMT Model: Single Instruction, Multiple Threads—many threads execute the same
instruction on different data.
• Kernel: A function executed on the GPU by many threads in parallel.
Programming Models & Frameworks
1. CUDA (NVIDIA):
A platform for GPU programming using C/C++. Widely used in machine learning,
scientific computing, and video processing.
Example: Vector addition where each thread adds corresponding elements from two
arrays.
2. OpenCL:
An open standard for parallel computing that works across various hardware platforms
(NVIDIA, AMD, Intel).
3. DirectCompute:
Part of DirectX, designed for GPU-accelerated computing on Windows.
4. GPGPU (General-Purpose GPU Computing):
Uses graphics APIs like OpenGL and Vulkan for non-graphics computations.
Applications
• Machine Learning: Training deep neural networks with significant speed-ups.
• Scientific Simulations: Accelerates complex calculations in physics, chemistry,
and biology.
• Image & Video Processing: Enhances real-time processing for editing and
rendering.
Optimization Considerations
• Efficient memory management (e.g., using shared memory).
• Proper synchronization between threads.
• Minimize data transfer between CPU and GPU for better performance.
NVIDIA GPUs use the CUDA (Compute Unified Device Architecture) programming
model, which enables parallel computing for tasks such as graphics rendering, machine
learning, and scientific simulations. The GPU Instruction Set Architecture (ISA) is
designed to execute many threads simultaneously, using a SIMT (Single Instruction,
Multiple Threads) model, where each thread performs the same operation on different
data.
Key aspects of the CUDA ISA include:
1.Core Instructions:
•Arithmetic: Operations like addition, multiplication, and division (e.g., add.f32,
mul.f32).
•Memory: Instructions for accessing different memory types (e.g., ld.global,
st.shared).
•Control Flow: Instructions for branching and managing execution paths (e.g., bra,
call).
•Atomic Operations: For thread-safe access to shared data (e.g., atomicAdd).
2.Execution Model:
•Warps: Groups of 32 threads that execute the same instruction in parallel.
•SIMT: Threads in a warp work on different pieces of data but execute the same
operation.
3.Memory Hierarchy:
•Global Memory: Large but slow, shared across all threads.
•Shared Memory: Fast, shared by threads within a block.
•Registers: Fastest memory, local to each thread.
4.Optimizations:
•Memory Coalescing: Threads should access contiguous memory locations to
improve performance.
•Warp Divergence: Minimized to avoid performance penalties when threads in
a warp follow different paths.
Multithreaded SIMD Processor:
NVIDA instruction set architecture
NVIDIA’s GPU instruction set architecture (ISA) is designed to handle parallel computing
tasks, making it ideal for high-performance computing like graphics rendering, machine
learning, and simulations. Here’s how it works with an example:
Key Features:
1. CUDA Cores: The basic units that perform calculations in parallel.
2. Streaming Multiprocessors (SMs): Groups of CUDA cores that process instructions
simultaneously. Each SM has multiple CUDA cores.
3. SIMT (Single Instruction, Multiple Threads): A group of 32 threads (called a warp)
executes the same instruction but works on different pieces of data.
4. Tensor Cores: Specialized hardware that accelerates machine learning tasks, like matrix
multiplications.
Example: Matrix Multiplication
• Suppose you want to multiply two large matrices for a deep learning task. This is a
computationally intensive operation that benefits from parallel processing.
• Matrix A and Matrix B are both large (e.g., 1000x1000 elements each).
• On a CPU, the multiplication would be done row by row, which is slow. But on an NVIDIA
GPU, each thread can handle the multiplication of different elements across the matrices at
the same time.
1. CUDA Cores are assigned the task of computing different elements in the resulting matrix.
2. The GPU divides the operation into many warps (groups of 32 threads).
3. Each warp performs the same operation (multiplying a row from Matrix A with a column
from Matrix B) but on different elements of the resulting matrix.
Conditional branching in GPU
Memory Types:
Memory
Size Speed Access Optimized Use
Type
Global Store large data shared across
Large (GBs) Slow (high latency) All threads
Memory threads/blocks
What it is: Once parallelism is detected, enhancing parallelism involves modifying the code or
using techniques to make the parallel execution more efficient. This step optimizes the code to fully
exploit multiple processors or cores.
Goal: To improve the execution performance by making the parallelization more effective and
efficient.
How it works:
• Modify the code to maximize parallel execution (e.g., splitting work into smaller chunks,
reducing overhead, handling dependencies).
• Use parallel programming techniques or tools like OpenMP, CUDA, or thread libraries to make
sure the detected parallelism is properly implemented.
Example: Using OpenMP to enhance parallelism:
#pragma omp parallel for for (int i = 0; i < N; i++) { A[i] = B[i] + C[i]; }
This directive enhances parallelism by ensuring the loop runs in parallel across multiple
threads.
THANK YOU