0% found this document useful (0 votes)

18 views31 pages

SIMD v1

Uploaded by

Svetlin Ivanov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views31 pages

SIMD v1

Uploaded by

Svetlin Ivanov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

SIMD Programming

CS 240A, 2017

1
Flynn* Taxonomy, 1966

• In 2013, SIMD and MIMD most common parallelism in

architectures – usually both in same system!
• Most common parallel processing programming style: Single
Program Multiple Data (“SPMD”)
– Single program that runs on all processors of a MIMD
– Cross-processor execution coordination using synchronization
primitives
• SIMD (aka hw-level data parallelism): specialized function
units, for handling lock-step calculations involving arrays
– Scientific computing, signal processing, multimedia
(audio/video processing)
*Prof. Michael
Flynn, Stanford 2
Single-Instruction/Multiple-Data Stream
(SIMD or “sim-dee”)
• SIMD computer exploits
multiple data streams
against a single
instruction stream to
operations that may be
naturally parallelized,
e.g., Intel SIMD
instruction extensions
or NVIDIA Graphics
Processing Unit (GPU)

3
SIMD: Single Instruction, Multiple Data

• Scalar processing • SIMD processing

• traditional mode • With Intel SSE / SSE2
• one operation produces • SSE = streaming SIMD extensions
one result • one operation produces
multiple results

X X x3 x2 x1 x0

+ +
Y Y y3 y2 y1 y0

X+Y X+Y x3+y3 x2+y2 x1+y1 x0+y0

Slide Source: Alex Klimovitski & Dean Macri, Intel Corporation

4
What does this mean to you?
• In addition to SIMD extensions, the processor may have
other special instructions
– Fused Multiply-Add (FMA) instructions:
x=y+c*z
is so common some processor execute the multiply/add as a
single instruction, at the same rate (bandwidth) as + or *
alone
• In theory, the compiler understands all of this
– When compiling, it will rearrange instructions to get a good
“schedule” that maximizes pipelining, uses FMAs and SIMD
– It works with the mix of instructions inside an inner loop or
other block of code
• But in practice the compiler may need your help
– Choose a different compiler, optimization flags, etc.
– Rearrange your code to make things more obvious
– Using special functions (“intrinsics”) or write in assembly L

5
Intel SIMD Extensions
• MMX 64-bit registers, reusing floating-point registers
[1992]
• SSE2/3/4, new 8 128-bit registers [1999]

• AVX, new 256-bit registers [2011]

– Space for expansion to 1024-bit registers
6
SSE / SSE2 SIMD on Intel
• SSE2 data types: anything that fits into 16 bytes, e.g.,

4x floats

2x doubles

16x bytes

• Instructions perform add, multiply etc. on all the data

in parallel

• Similar on GPUs, vector processors (but many more

simultaneous operations) 7
Intel Architecture SSE2+
128-Bit SIMD Data Types
• Note: in Intel Architecture (unlike MIPS) a word is 16 bits
– Single-precision FP: Double word (32 bits)
– Double-precision FP: Quad word (64 bits)

122 121 96 95 80 79 64 63 48 47 32 31 16 15 16 / 128 bits

122 121 96 95 80 79 64 63 48 47 32 31 16 15 8 / 128 bits

96 95 64 63 32 31 4 / 128 bits

64 63 2 / 128 bits8
Packed and Scalar Double-Precision
Floating-Point Operations

Packed

Scalar

9
SSE/SSE2 Floating Point Instructions
Move
does
both
load
and
store

xmm: one operand is a 128-bit SSE2 register

mem/xmm: other operand is in memory or an SSE2 register
{SS} Scalar Single precision FP: one 32-bit operand in a 128-bit register
{PS} Packed Single precision FP: four 32-bit operands in a 128-bit register
{SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register
{PD} Packed Double precision FP, or two 64-bit operands in a 128-bit register
{A} 128-bit operand is aligned in memory
{U} means the 128-bit operand is unaligned in memory
{H} means move the high half of the 128-bit operand
{L} means move the low half of the 128-bit operand
10
Example: SIMD Array Processing
for each f in array
f = sqrt(f) for
{
each f in array

load f to floating-point register

calculate the square root
write the result from the
register to memory
}

for each 4 members in array

{
load 4 members to the SSE register
calculate 4 square roots in one operation
store the 4 results from the register to memory
}
SIMD style
11
Data-Level Parallelism and SIMD
• SIMD wants adjacent values in memory that
can be operated in parallel
• Usually specified in programs as loops
for(i=1000; i>0; i=i-1)
x[i] = x[i] + s;
• How can reveal more data-level parallelism
than available in a single iteration of a loop?
• Unroll loop and adjust iteration rate
12
Loop Unrolling in C
• Instead of compiler doing loop unrolling, could do it
yourself in C
for(i=1000; i>0; i=i-1)
x[i] = x[i] + s;

• Could be rewritten
for(i=1000; i>0; i=i-4) {
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}
13
Generalizing Loop Unrolling
• A loop of n iterations
• k copies of the body of the loop
• Assuming (n mod k) ≠ 0
– Then we will run the loop with 1 copy of the
body (n mod k) times
– and then with k copies of the body floor(n/k)
times

14
General Loop Unrolling with a Head
• Handing loop iterations indivisible by step size.
for(i=1003; i>0; i=i-1)
x[i] = x[i] + s;

• Could be rewritten
for(i=1003;i>1000;i--) //Handle the head (1003 mod 4)
x[i] = x[i] + s;

for(i=1000; i>0; i=i-4) {// handle other iterations

x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}

15
Tail method for general loop unrolling
• Handing loop iterations indivisible by step size.
for(i=1003; i>0; i=i-1)
x[i] = x[i] + s;
• Could be rewritten
for(i=1003; i>0 && i> 1003 mod 4; i=i-4) {
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}
for( i= 1003 mod 4; i>0; i--) //special handle in tail
x[i] = x[i] + s;

16
Another loop unrolling example

Normal loop After loop unrolling

int x;
for (x = 0; x < 103/5*5; x += 5) {
delete(x);
delete(x + 1);
delete(x + 2);
int x;
delete(x + 3);
for (x = 0; x < 103; x++) {
delete(x + 4);
delete(x);
}
}
/*Tail*/
for (x = 103/5*5; x < 103; x++) {
delete(x);
}
17
Intel SSE Intrinsics
Intrinsics are C functions and procedures for inserting
assembly language into C code, including SSE instructions

Instrinsics: Corresponding SSE instructions:

• Vector data type:
_m128d
• Load and store operations:
_mm_load_pd MOVAPD/aligned, packed double
_mm_store_pd MOVAPD/aligned, packed double
_mm_loadu_pd MOVUPD/unaligned, packed double
_mm_storeu_pd MOVUPD/unaligned, packed double
• Load and broadcast across vector
_mm_load1_pd MOVSD + shuffling/duplicating
• Arithmetic:
_mm_add_pd ADDPD/add, packed double
_mm_mul_pd MULPD/multiple, packed double 18
Example 1: Use of SSE SIMD instructions

• For (i=0; i<n; i++) sum = sum+ a[i];

• Set 128-bit temp=0;
For (i = 0; n/4*4; i=i+4){
Add 4 integers with 128 bits from &a[i] to temp; }

Tail: Copy out 4 integers of temp and add them together to sum.
For(i=n/4*4; i<n; i++) sum += a[i];

19
Related SSE SIMD instructions
__m128i _mm_setzero_si128( ) returns 128-bit zero vector

Load data stored at pointer p of memory to

__m128i _mm_loadu_si128( __m128i *p )
a 128bit vector, returns this vector.

__m128i _mm_add_epi32( __m128i a,

returns vector (a0+b0, a1+b1, a2+b2, a3+b3)
__m128i b )

void _mm_storeu_si128( __m128i *p, stores content off 128-bit vector ”a” ato
__m128i a ) memory starting at pointer p

20
Related SSE SIMD instructions
• Add 4 integers with 128 bits from &a[i] to temp vector with
loop body temp = temp + a[i]
• Add 128 bits, then next 128 bits …

__m128i temp=_mm_setzero_si128();
__m128i temp1=_mm_loadu_si128((__m128i *)(a+i));
temp=_mm_add_epi32(temp, temp1)

21
Example 2: 2 x 2 Matrix Multiply
Definition of Matrix Multiply:
2
Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j
k=1
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

1 0 1 3 C1,1= 11 + 02 = 1 C1,2= 13 + 04 = 3

x =
0 1 2 4 C2,1= 0*1 + 1*2 = 2 C2,2= 0*3 + 1*4 = 4

22
Example: 2 x 2 Matrix Multiply
• Using the XMM registers
– 64-bit/double precision/two doubles per XMM reg
C1 C1,1 C2,1
Stored in memory in Column order
C2 C1,2 C2,2

A A1,i
C1,1 C1,2
A2,i
C2,1 C2,2
B1 Bi,1 Bi,1
B2 Bi,2 Bi,2
C1 C2

23
Example: 2 x 2 Matrix Multiply
• Initialization
C1 0 0
C2 0 0

• I=1
A A1,1 A2,1 _mm_load_pd: Stored in memory in
Column order

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads

a double word and stores it in the high and
B2 B1,2 B1,2
low double words of the XMM register

24
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

• Initialization
C1 0 0
C2 0 0

• I=1
A A1,1 A2,1 _mm_load_pd: Load 2 doubles into XMM
reg, Stored in memory in Column order

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads

a double word and stores it in the high and
B2 B1,2 B1,2
low double words of the XMM register
(duplicates value in both halves of XMM)

25
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

• First iteration intermediate result

C1 0+A1,1B1,1 0+A2,1B1,1 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
C2 0+A1,1B1,2 0+A2,1B1,2 c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
• I=1
A A1,1 A2,1 _mm_load_pd: Stored in memory in
Column order

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads

a double word and stores it in the high and
B2 B1,2 B1,2
low double words of the XMM register
(duplicates value in both halves of XMM)

26
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

• First iteration intermediate result

C1 0+A1,1B1,1 0+A2,1B1,1 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
C2 0+A1,1B1,2 0+A2,1B1,2 c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
• I=2
A A1,2 A2,2 _mm_load_pd: Stored in memory in
Column order

B1 B2,1 B2,1 _mm_load1_pd: SSE instruction that loads

a double word and stores it in the high and
B2 B2,2 B2,2
low double words of the XMM register
(duplicates value in both halves of XMM)

27
Example: 2 x 2 Matrix Multiply
• Second iteration intermediate result
C1,1 C2,1
C1 A1,1B1,1+A1,2B2,1 A2,1B1,1+A2,2B2,1 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
C2 A1,1B1,2+A1,2B2,2 A2,1B1,2+A2,2B2,2 c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
C1,2 C2,2 SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
• I=2
A A1,2 A2,2 _mm_load_pd: Stored in memory in
Column order

B1 B2,1 B2,1 _mm_load1_pd: SSE instruction that loads

a double word and stores it in the high and
B2 B2,2 B2,2
low double words of the XMM register
(duplicates value in both halves of XMM)

28
Example: 2 x 2 Matrix Multiply
(Part 1 of 2)
#include <stdio.h> // Initialize A, B, C for example
// header file for SSE compiler intrinsics /* A = (note column order!)
#include <emmintrin.h> 10
01
*/
// NOTE: vector registers will be represented in
A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0;
comments as v1 = [ a | b]
// where v1 is a variable of type __m128d and
/* B = (note column order!)
a, b are doubles
13
24
int main(void) {
*/
// allocate A,B,C aligned on 16-byte boundaries B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0;
double A[4] __attribute__ ((aligned (16)));
double B[4] __attribute__ ((aligned (16))); /* C = (note column order!)
double C[4] __attribute__ ((aligned (16))); 00
int lda = 2; 00
int i = 0; */
// declare several 128-bit vector variables C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
__m128d c1,c2,a,b1,b2;

29
Example: 2 x 2 Matrix Multiply
(Part 2 of 2)
// used aligned loads to set /* c1 =
// c1 = [c_11 | c_21] i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11]
c1 = _mm_load_pd(C+0*lda); i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21]
// c2 = [c_12 | c_22] */
c2 = _mm_load_pd(C+1*lda);
c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
/* c2 =
for (i = 0; i < 2; i++) {
/* a = i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12]
i = 0: [a_11 | a_21] i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22]
i = 1: [a_12 | a_22] */
*/ c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
a = _mm_load_pd(A+i*lda); }
/* b1 =
i = 0: [b_11 | b_11] // store c1,c2 back into C for completion
i = 1: [b_21 | b_21] _mm_store_pd(C+0*lda,c1);
*/
_mm_store_pd(C+1*lda,c2);
b1 = _mm_load1_pd(B+i+0*lda);
/* b2 =
i = 0: [b_12 | b_12] // print C
i = 1: [b_22 | b_22] printf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);
*/ return 0;
b2 = _mm_load1_pd(B+i+1*lda); }

30
Conclusion
• Flynn Taxonomy
• Intel SSE SIMD Instructions
– Exploit data-level parallelism in loops
– One instruction fetch that operates on multiple
operands simultaneously
– 128-bit XMM registers
• SSE Instructions in C
– Embed the SSE machine instructions directly into C
programs through use of intrinsics
– Achieve efficiency beyond that of optimizing compiler

ELECH473 Th06
No ratings yet
ELECH473 Th06
65 pages
Lecture 4 - ARM Architecture
No ratings yet
Lecture 4 - ARM Architecture
65 pages
04 Simd
No ratings yet
04 Simd
53 pages
Lecture8 Simd
No ratings yet
Lecture8 Simd
38 pages
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
Unit 2 COA
No ratings yet
Unit 2 COA
62 pages
Coa Mod 4 5
No ratings yet
Coa Mod 4 5
91 pages
COA Module 02
No ratings yet
COA Module 02
110 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
LE Traverse Calculation Release 5 v5.0
100% (1)
LE Traverse Calculation Release 5 v5.0
280 pages
Intel 8085 Microprocessor Architecture
No ratings yet
Intel 8085 Microprocessor Architecture
72 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Grokking Algorithms in Python
No ratings yet
Grokking Algorithms in Python
532 pages
Lec17 x86SIMD PDF
No ratings yet
Lec17 x86SIMD PDF
80 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
Programming With SIMD-instructions
No ratings yet
Programming With SIMD-instructions
10 pages
EE6304 Lecture13 Processors
No ratings yet
EE6304 Lecture13 Processors
69 pages
CS3330 - A Quick Guide To SSE - SIMD
No ratings yet
CS3330 - A Quick Guide To SSE - SIMD
9 pages
Week 10 2021
No ratings yet
Week 10 2021
42 pages
4th Unit COA
No ratings yet
4th Unit COA
40 pages
09 - Instruction Sets Characteristics
No ratings yet
09 - Instruction Sets Characteristics
41 pages
Floating Point Instructions: Ray Seyfarth
No ratings yet
Floating Point Instructions: Ray Seyfarth
18 pages
GROUP
No ratings yet
GROUP
28 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
COA Class3
No ratings yet
COA Class3
57 pages
DSP Processor Fundamentals
No ratings yet
DSP Processor Fundamentals
58 pages
Unit II Assembly
No ratings yet
Unit II Assembly
19 pages
Unit6 3
No ratings yet
Unit6 3
24 pages
02 Architecture of Arm
No ratings yet
02 Architecture of Arm
43 pages
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
No ratings yet
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
31 pages
Design by Mohammed Intekhab Khan
No ratings yet
Design by Mohammed Intekhab Khan
33 pages
l5 Instruction Set and Addressing Modes
No ratings yet
l5 Instruction Set and Addressing Modes
48 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Ca 5
No ratings yet
Ca 5
13 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
BiD 07
No ratings yet
BiD 07
58 pages
Instruction Set Architecture: Naydin@yildiz - Edu.tr
No ratings yet
Instruction Set Architecture: Naydin@yildiz - Edu.tr
10 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Assembly #4
No ratings yet
Assembly #4
3 pages
Mpta CS
No ratings yet
Mpta CS
81 pages
Unit - I
No ratings yet
Unit - I
47 pages
14 Assembly Instructions
No ratings yet
14 Assembly Instructions
9 pages
Chapter 6 - P - I: Instruction Sets: Characteristics and Functions
No ratings yet
Chapter 6 - P - I: Instruction Sets: Characteristics and Functions
48 pages
?addressing Modes
No ratings yet
?addressing Modes
36 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
TOPIC 3 - Instruction Set Architecture
No ratings yet
TOPIC 3 - Instruction Set Architecture
11 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Assembly Language Program With 8085 Microprocessor
100% (1)
Assembly Language Program With 8085 Microprocessor
22 pages
Basic Computer Mcqs
No ratings yet
Basic Computer Mcqs
57 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
No ratings yet
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
80 pages
Chapter
No ratings yet
Chapter
9 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
SIMD Tutorial
No ratings yet
SIMD Tutorial
17 pages
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
No ratings yet
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
21 pages
C# Source Generators Explained: Boosting Compile-Time Productivity: Write Smarter Code by Automating Repetition and Enhancing Your C# Projects with Compile-Time Code Generation by BOSCO-IT CONSULTING 2025
No ratings yet
C# Source Generators Explained: Boosting Compile-Time Productivity: Write Smarter Code by Automating Repetition and Enhancing Your C# Projects with Compile-Time Code Generation by BOSCO-IT CONSULTING 2025
461 pages
Config and Protect Configuration and Secrets Management in .NET: A Practical Guide to Managing App Settings, Environment Variables, And Sensitive Data in ASP.net Core and Beyond by BOSCO-IT CONSULTING 2025
No ratings yet
Config and Protect Configuration and Secrets Management in .NET: A Practical Guide to Managing App Settings, Environment Variables, And Sensitive Data in ASP.net Core and Beyond by BOSCO-IT CONSULTING 2025
360 pages
How To Write Windows Services in C# by BOSCO-IT CONSULTING 2024
No ratings yet
How To Write Windows Services in C# by BOSCO-IT CONSULTING 2024
333 pages
SIMD Computer Organizations
0% (1)
SIMD Computer Organizations
20 pages
JAVA Tokens
No ratings yet
JAVA Tokens
6 pages
Data Structure Question Bank
No ratings yet
Data Structure Question Bank
17 pages
Higher Secondary School Teacher (Junior) (Computer Science) Kerala Higher Secondary Education
No ratings yet
Higher Secondary School Teacher (Junior) (Computer Science) Kerala Higher Secondary Education
5 pages
Tong Hop - Share Final
No ratings yet
Tong Hop - Share Final
160 pages
30 Most Asked Coding Questions
No ratings yet
30 Most Asked Coding Questions
19 pages
Cit308 Summary From Noungeeks
No ratings yet
Cit308 Summary From Noungeeks
87 pages
Escape Sequence in C
No ratings yet
Escape Sequence in C
7 pages
Topic Note: Identify Similar Figure
No ratings yet
Topic Note: Identify Similar Figure
7 pages
Full-Stack React, TypeScript, and Node - Second Edition 2025
No ratings yet
Full-Stack React, TypeScript, and Node - Second Edition 2025
66 pages
Ovms Commands
No ratings yet
Ovms Commands
65 pages
11th Computer Past Paper Mcqs 2011-2023
No ratings yet
11th Computer Past Paper Mcqs 2011-2023
34 pages
dg2000 Datasheet
No ratings yet
dg2000 Datasheet
12 pages
LMA) ဝင္းထက္ဝင္း PIC
No ratings yet
LMA) ဝင္းထက္ဝင္း PIC
207 pages
Da Course
No ratings yet
Da Course
3 pages
(Week 3) Syntax Analysis (Derivation)
No ratings yet
(Week 3) Syntax Analysis (Derivation)
46 pages
Infix To Postfix
No ratings yet
Infix To Postfix
2 pages
The Java Virtual Machine Specification Java SE 8 Edition Tim Lindholm & Frank Yellin & Gilad Bracha PDF Download
No ratings yet
The Java Virtual Machine Specification Java SE 8 Edition Tim Lindholm & Frank Yellin & Gilad Bracha PDF Download
61 pages
Ceph An Overview
No ratings yet
Ceph An Overview
8 pages
C++ Linked List Operations
No ratings yet
C++ Linked List Operations
10 pages
Unit 4 Graph
No ratings yet
Unit 4 Graph
16 pages
5-A - Search Algorithm
No ratings yet
5-A - Search Algorithm
4 pages
University of Karachi: Assignment #
No ratings yet
University of Karachi: Assignment #
1 page
FreeBSD and NVM Express
No ratings yet
FreeBSD and NVM Express
6 pages
MLSys 2022 Taglets A System For Automatic Semi Supervised Learning With Auxiliary Data Paper
No ratings yet
MLSys 2022 Taglets A System For Automatic Semi Supervised Learning With Auxiliary Data Paper
21 pages
DT027G Datorarkitektur-English
No ratings yet
DT027G Datorarkitektur-English
25 pages
Delta 3 - Unit 8 Review
No ratings yet
Delta 3 - Unit 8 Review
7 pages
Sintaxis Lenguaje Monarch MPCL II
No ratings yet
Sintaxis Lenguaje Monarch MPCL II
1 page
Week12 Worksheet
No ratings yet
Week12 Worksheet
7 pages
A New Optimized Version of Merge Sort
No ratings yet
A New Optimized Version of Merge Sort
5 pages
CST 305 System Software Series 2
No ratings yet
CST 305 System Software Series 2
1 page
Ete QP - Dip
No ratings yet
Ete QP - Dip
3 pages
Release Notes D-Sheet Piling 23.1
No ratings yet
Release Notes D-Sheet Piling 23.1
2 pages
Application and Implementation of DES Algorithm Based on FPGA
From Everand
Application and Implementation of DES Algorithm Based on FPGA
madhav
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

SIMD v1

Uploaded by

SIMD v1

Uploaded by

SIMD Programming

• In 2013, SIMD and MIMD most common parallelism in

• Scalar processing • SIMD processing

X+Y X+Y x3+y3 x2+y2 x1+y1 x0+y0

Slide Source: Alex Klimovitski & Dean Macri, Intel Corporation

• AVX, new 256-bit registers [2011]

• Instructions perform add, multiply etc. on all the data

• Similar on GPUs, vector processors (but many more

122 121 96 95 80 79 64 63 48 47 32 31 16 15 16 / 128 bits

122 121 96 95 80 79 64 63 48 47 32 31 16 15 8 / 128 bits

xmm: one operand is a 128-bit SSE2 register

load f to floating-point register

for each 4 members in array

for(i=1000; i>0; i=i-4) {// handle other iterations

Normal loop After loop unrolling

Instrinsics: Corresponding SSE instructions:

• For (i=0; i<n; i++) sum = sum+ a[i];

Load data stored at pointer p of memory to

__m128i _mm_add_epi32( __m128i a,

1 0 1 3 C1,1= 1*1 + 0*2 = 1 C1,2= 1*3 + 0*4 = 3

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads

• First iteration intermediate result

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads

• First iteration intermediate result

B1 B2,1 B2,1 _mm_load1_pd: SSE instruction that loads

B1 B2,1 B2,1 _mm_load1_pd: SSE instruction that loads

You might also like

1 0 1 3 C1,1= 11 + 02 = 1 C1,2= 13 + 04 = 3