0% found this document useful (0 votes)

36 views74 pages

Lec15 x86SIMD

The document discusses Intel SIMD architecture and how it can improve performance. It describes MMX and SSE/SSE2 extensions that added SIMD instructions. Examples show how to use SIMD to perform operations on multiple data elements in parallel, improving performance of multimedia and graphics applications.

Uploaded by

SyedAliShahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views74 pages

Lec15 x86SIMD

Uploaded by

SyedAliShahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 74

Intel SIMD architecture

Overview
• SIMD
• MMX architectures
• MMX instructions
• examples
• SSE/SSE2

• SIMD instructions are probably the best place t

o use assembly since compilers usually do not d
o a good job on using these instructions

2
Performance boost
• Increasing clock rate is not fast enough for boos
ting performance

In his 1965 paper,

Intel co-founder
Gordon Moore
observed that
“the number of
transistors per
square inch had
doubled every
18 months.

3
Performance boost
• Architecture improvements (such as pipeline/ca
che/SIMD) are more significant
• Intel analyzed multimedia applications and foun
d they share the following characteristics:
– Small native data types (8-bit pixel, 16-bit audio)
– Recurring operations
– Inherent parallelism

4
SIMD
• SIMD (single instruction multiple data)
architecture performs the same operation on
multiple data elements in parallel
• PADDW MM0, MM1

5
SISD/SIMD/Streaming

6
IA-32 SIMD development
• MMX (Multimedia Extension) was introduced in 1
996 (Pentium with MMX and Pentium II).
• SSE (Streaming SIMD Extension) was introduced
with Pentium III.
• SSE2 was introduced with Pentium 4.
• SSE3 was introduced with Pentium 4 supporting
hyper-threading technology. SSE3 adds 13 more
instructions.

7
MMX
• After analyzing a lot of existing applications suc
h as graphics, MPEG, music, speech recognition,
game, image processing, they found that many
multimedia algorithms execute the same instru
ctions on many pieces of data in a large data se
t.
• Typical elements are small, 8 bits for pixels, 16
bits for audio, 32 bits for graphics and general c
omputing.
• New data type: 64-bit packed data type. Why 6
4 bits?
– Good enough
– Practical
8
MMX data types

9
Compatibility
• To be fully compatible with existing IA, no new
mode or state was created. Hence, for context
switching, no extra state needs to be saved.
• To reach the goal, MMX is hidden behind FPU.
When floating-point state is saved or restored,
MMX is saved or restored.
• It allows existing OS to perform context switchi
ng on the processes executing MMX instruction
without be aware of MMX.
• However, it means MMX and FPU can not be use
d at the same time. Big overhead to switch.
10
Compatibility
• Although Intel defenses their decision on aliasin
g MMX to FPU for compatibility. It is actually a
bad decision. OS can just provide a service pack
or get updated.
• It is why Intel introduced SSE later without any
aliasing

11
MMX instructions
• 57 MMX instructions are defined to perform the
parallel operations on multiple data elements p
acked into 64-bit data types.
• These include add, subtract, multiply, co
mpare, and shift, data conversion, 64-b
it data move, 64-bit logical operati
on and multiply-add for multiply-accumulat
e operations.
• All instructions except for data move use MMX r
egisters as operands.
• Most complete support for 16-bit operations.
12
Saturation arithmetic
• Useful in graphics applications.
• When an operation overflows or underflows, the
result becomes the largest or smallest possible
representable number.
• Two types: signed and unsigned saturation

wrap-around saturating
13
Arithmetic
• PADDB/PADDW/PADDD: add two packed numbe
rs, no EFLAGS is set, ensure overflow never occ
urs by yourself
• Multiplication: two steps
• PMULLW: multiplies four words and stores the fo
ur lo words of the four double word results
• PMULHW/PMULHUW: multiplies four words and st
ores the four hi words of the four double word r
esults. PMULHUW for unsigned.

14
Detect MMX/SSE
mov eax, 1 ; request version info
cpuid ; supported since Pentium
test edx, 00800000h ;bit 23
; 02000000h (bit 25) SSE
; 04000000h (bit 26) SSE2
jnz HasMMX

15
Example: add a constant to a vector
char d[]={5, 5, 5, 5, 5, 5, 5, 5};
char clr[]={65,66,68,...,87,88}; // 24 byte
s
__asm{
movq mm1, d
mov cx, 3
mov esi, 0
L1: movq mm0, clr[esi]
paddb mm0, mm1
movq clr[esi], mm0
add esi, 8
loop L1
emms 16
Comparison
• No CFLAGS, how many flags will you need?
Results are stored in destination.
• EQ/GT, no LT

17
Change data types
• Pack: converts a larger data type to the next s
maller data type.
• Unpack: takes two operands and interleave the
m. It can be used for expand data type for imm
ediate calculation.

18
Pack with signed saturation

19
Pack with signed saturation

20
Unpack low portion

21
Unpack low portion

22
Unpack low portion

23
Unpack high portion

24
Keys to SIMD programming
• Efficient data layout
• Elimination of branches

25
Application: frame difference
A B

|A-B|

26
Application: frame difference
A-B B-A

(A-B) or (B-A)

27
Application: frame difference
MOVQ mm1, A //move 8 pixels of image A
MOVQ mm2, B //move 8 pixels of image B
MOVQ mm3, mm1 // mm3=A
PSUBSB mm1, mm2 // mm1=A-B
PSUBSB mm2, mm3 // mm2=B-A
POR mm1, mm2 // mm1=|A-B|

28
Example: image fade-in-fade-out

A B

A*α+B*(1-α) = B+α(A-B)

29
α=0.75

30
α=0.5

31
α=0.25

32
Example: image fade-in-fade-out
• Two formats: planar and chunky
• In Chunky format, 16 bits of 64 bits are wasted
• So, we use planar in the following example

R G B A R G B A

33
Example: image fade-in-fade-out
Image A Image B

34
Example: image fade-in-fade-out
MOVQ mm0, alpha//4 16-b zero-padding α
MOVD mm1, A //move 4 pixels of image A
MOVD mm2, B //move 4 pixels of image B
PXOR mm3, mm3 //clear mm3 to all zeroe
s
//unpack 4 pixels to 4 words
PUNPCKLBW mm1, mm3 // Because B-A could be
PUNPCKLBW mm2, mm3 // negative, need 16 bit
s
PSUBW mm1, mm2 //(B-A)
PMULHW mm1, mm0 //(B-A)*fade/256
PADDW mm1, mm2 //(B-A)*fade + B
//pack four words back to four bytes
35
PACKUSWB mm1, mm3
Data-independent computation
• Each operation can execute without needing to
know the results of a previous operation.
• Example, sprite overlay
for i=1 to sprite_Size
if sprite[i]=clr
then out_color[i]=bg[i]
else out_color[i]=sprite[i]

• How to execute data-dependent calculations on

several pixels in parallel. 36
Application: sprite overlay

37
Application: sprite overlay
MOVQ mm0, sprite
MOVQ mm2, mm0
MOVQ mm4, bg
MOVQ mm1, clr
PCMPEQW mm0, mm1
PAND mm4, mm0
PANDN mm0, mm2
POR mm0, mm4

38
Application: matrix transport

39
Application: matrix transport
char M1[4][8];// matrix to be transposed
char M2[8][4];// transposed matrix
int n=0;
for (int i=0;i<4;i++)
for (int j=0;j<8;j++)
{ M1[i][j]=n; n++; }
__asm{
//move the 4 rows of M1 into MMX registers
movq mm1,M1
movq mm2,M1+8
movq mm3,M1+16
movq mm4,M1+24
40
Application: matrix transport
//generate rows 1 to 4 of M2
punpcklbw mm1, mm2
punpcklbw mm3, mm4
movq mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 2 & row 1
punpckhwd mm0, mm3 //mm0 has row 4 & row 3
movq M2, mm1
movq M2+8, mm0

41
Application: matrix transport
//generate rows 5 to 8 of M2
movq mm1, M1 //get row 1 of M1
movq mm3, M1+16 //get row 3 of M1
punpckhbw mm1, mm2
punpckhbw mm3, mm4
movq mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 6 & row 5
punpckhwd mm0, mm3 //mm0 has row 8 & row 7
//save results to M2
movq M2+16, mm1
movq M2+24, mm0
emms
} //end
42
Performance boost (data from 1996)
Benchmark kernels: F
FT, FIR, vector dot-pr
oduct, IDCT, motion c
ompensation.

65% performance gain

Lower the cost of mul

timedia programs by
removing the need of
specialized DSP chips
43
How to use assembly in projects
• Write the whole project in assembly
• Link with high-level languages
• Inline assembly
• Intrinsics

44
Link ASM and HLL programs
• Assembly is rarely used to develop the entire p
rogram.
• Use high-level language for overall project dev
elopment
– Relieves programmer from low-level details
• Use assembly language code
– Speed up critical sections of code
– Access nonstandard hardware devices
– Write platform-specific code
– Extend the HLL's capabilities

45
General conventions
• Considerations when calling assembly language
procedures from high-level languages:
– Both must use the same naming convention (rules re
garding the naming of variables and procedures)
– Both must use the same memory model, with compat
ible segment names
– Both must use the same calling convention

46
Inline assembly code
• Assembly language source code that is inserted
directly into a HLL program.
• Compilers such as Microsoft Visual C++ and Borl
and C++ have compiler-specific directives that i
dentify inline ASM code.
• Efficient inline code executes quickly because
CALL and RET instructions are not required.
• Simple to code because there are no external n
ames, memory models, or naming conventions i
nvolved.
• Decidedly not portable because it is written for
a single platform.
47
__asm directive in Microsoft Visual C++
• Can be placed at the beginning of a single state
ment
• Or, It can mark the beginning of a block of asse
mbly language statements
• Syntax: __asm statement

__asm {
statement-1
statement-2
...
statement-n
}
48
Intrinsics
• An intrinsic is a function known by the compiler
that directly maps to a sequence of one or mor
e assembly language instructions.
• The compiler manages things that the user woul
d normally have to be concerned with, such as r
egister names, register allocations, and memory
locations of data.
• Intrinsic functions are inherently more efficient
than called functions because no calling linkage
is required. But, not necessarily as efficient as
assembly.
• _mm_<opcode>_<suffix> ps: packed single-precision
ss: scalar single-precision
49
Intrinsics
#include <xmmintrin.h>

__m128 a , b , c;
c = _mm_add_ps( a , b );

float a[4] , b[4] , c[4];

for( int i = 0 ; i < 4 ; ++ i )
c[i] = a[i] + b[i];

// a = b * c + d / e;
__m128 a = _mm_add_ps( _mm_mul_ps( b , c )
,
_mm_div_ps( d , e ) 50
);
SSE
• Adds eight 128-bit registers
• Allows SIMD operations on packed single-precisi
on floating-point numbers
• Most SSE instructions require 16-aligned address
es

51
SSE features
• Add eight 128-bit data registers (XMM registers)
in non-64-bit modes; sixteen XMM registers are
available in 64-bit mode.
• 32-bit MXCSR register (control and status)
• Add a new data type: 128-bit packed single-pre
cision floating-point (4 FP numbers.)
• Instruction to perform SIMD operations on 128-b
it packed single-precision FP and additional 64-
bit SIMD integer operations.
• Instructions that explicitly prefetch data, contr
ol data cacheability and ordering of store
52
SSE programming environment

XMM0
|
XMM7

MM0
|
MM7

EAX, EBX, ECX, EDX

EBP, ESI, EDI, ESP

53
MXCSR control and status register

Generally faster, but not compatible with IEEE 754

54
Exception
_MM_ALIGN16 float test1[4] = { 0, 0, 0, 1 }
;
_MM_ALIGN16 float test2[4] = { 1, 2, 3, 0 }
;
_MM_ALIGN16 float out[4];
Without this, result is 1.#INF
_MM_SET_EXCEPTION_MASK(0);//enable exceptio
n
__try {
__m128 a = _mm_load_ps(test1);
__m128 b = _mm_load_ps(test2);
a = _mm_div_ps(a, b);
_mm_store_ps(out, a);
}
__except(EXCEPTION_EXECUTE_HANDLER) {
if(_mm_getcsr() & _MM_EXCEPT_DIV_ZERO)
cout << "Divide by zero" << endl; 55
return;
SSE packed FP operation

• ADDPS/SUBPS: packed single-precision FP

56
SSE scalar FP operation

• ADDSS/SUBSS: scalar single-precision FP

used as FPU?
57
SSE2
• Provides ability to perform SIMD operations on d
ouble-precision FP, allowing advanced graphics
such as ray tracing
• Provides greater throughput by operating on 12
8-bit packed integers, useful for RSA and RC5

58
SSE2 features
• Add data types and instructions for them

• Programming environment unchanged

59
Example
void add(float *a, float *b, float *c) {
for (int i = 0; i < 4; i++)
c[i] = a[i] + b[i];
}
movaps: move aligned packed single-
__asm { precision FP
mov eax, a addps: add packed single-precision FP
mov edx, b
mov ecx, c
movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
}
60
SSE Shuffle (SHUFPS)
SHUFPS xmm1, xmm2, imm8
Select[1..0] decides which DW of DEST to be
copied to the 1st DW of DEST
...

61
SSE Shuffle (SHUFPS)

62
Example (cross product)
Vector cross(const Vector& a , const Vector& b ) {
    return Vector(
        ( a[1] * b[2] - a[2] * b[1] ) ,
        ( a[2] * b[0] - a[0] * b[2] ) ,
        ( a[0] * b[1] - a[1] * b[0] ) );
}

63
Example (cross product)
/* cross */
__m128 _mm_cross_ps( __m128 a , __m128 b ) {
__m128 ea , eb;
// set to a[1][2][0][3] , b[2][0][1][3]
ea = _mm_shuffle_ps( a, a, _MM_SHUFFLE(3,0,2,1) );
eb = _mm_shuffle_ps( b, b, _MM_SHUFFLE(3,1,0,2) );
// multiply
__m128 xa = _mm_mul_ps( ea , eb );
// set to a[2][0][1][3] , b[1][2][0][3]
a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(3,1,0,2) );
b = _mm_shuffle_ps( b, b, _MM_SHUFFLE(3,0,2,1) );
// multiply
__m128 xb = _mm_mul_ps( a , b );
// subtract
return _mm_sub_ps( xa , xb );
}

64
Example: dot product
• Given a set of vectors {v1,v2,…vn}={(x1,y1,z1), (x2,
y2,z2),…, (xn,yn,zn)} and a vector vc=(xc,yc,zc), cal
culate {vcvi}
• Two options for memory layout
• Array of structure (AoS)
typedef struct { float dc, x, y, z; } Verte
x;
Vertex v[n];
• Structure of array (SoA)
typedef struct { float x[n], y[n], z[n]; }
VerticesList;
VerticesList v; 65
Example: dot product (AoS)
movaps xmm0, v ; xmm0 = DC, x0, y0, z0
movaps xmm1, vc ; xmm1 = DC, xc, yc, zc
mulps xmm0, xmm1 ;xmm0=DC,x0*xc,y0*yc,z0*z
c
movhlps xmm1, xmm0 ; xmm1= DC, DC, DC, x0*x
c
addps xmm1, xmm0 ; xmm1 = DC, DC, DC,
; x0*xc+z0*z
c
movaps xmm2, xmm0
shufps xmm2, xmm2, 55h ; xmm2=DC,DC,DC,y0*y
c
movhlps:DEST[63..0] := SRC[127..64]
addps xmm1, xmm2 ; xmm1 = DC, DC, DC,
66
; x0*xc+y0*yc+z0*z
Example: dot product (SoA)
; X = x1,x2,...,x3
; Y = y1,y2,...,y3
; Z = z1,z2,...,z3
; A = xc,xc,xc,xc
; B = yc,yc,yc,yc
; C = zc,zc,zc,zc
movaps xmm0, X ; xmm0 = x1,x2,x3,x4
movaps xmm1, Y ; xmm1 = y1,y2,y3,y4
movaps xmm2, Z ; xmm2 = z1,z2,z3,z4
mulps xmm0, A ;xmm0=x1*xc,x2*xc,x3*xc,x4*x
c
mulps xmm1, B ;xmm1=y1*yc,y2*yc,y3*xc,y4*y
c
mulps xmm2, C ;xmm2=z1*zc,z2*zc,z3*zc,z4*z
c 67
addps xmm0, xmm1
Other SIMD architectures
• Graphics Processing Unit (GPU): nVidia 7800, 24
pipelines (8 vector/16 fragment)

68
NVidia GeForce 8800, 2006
• Each GeForce 8800 GPU stream processor is a fu
lly generalized, fully decoupled, scalar, process
or that supports IEEE 754 floating point precisio
n.
• Up to 128 stream processors

69
Cell processor
• Cell Processor (IBM/Toshiba/Sony): 1 PPE (Powe
r Processing Unit) +8 SPEs (Synergistic Processin
g Unit)
• An SPE is a RISC processor with 128-bit SIMD for
single/double precision instructions, 128 128-bi
t registers, 256K local cache
• used in PS3.

70
Cell processor

71
GPUs keep track to Moore’s law better

72
Different programming paradigms

73
References
• Intel MMX for Multimedia PCs, CACM, Jan. 1997
• Chapter 11 The MMX Instruction Set, The Art of
Assembly
• Chap. 9, 10, 11 of IA-32 Intel Architecture Soft
ware Developer’s Manual: Volume 1: Basic Archi
tecture
• http://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html

Introduction To Assembly Language
100% (6)
Introduction To Assembly Language
65 pages
Data Types in Java
No ratings yet
Data Types in Java
4 pages
Chapter 03 Assembly Language
No ratings yet
Chapter 03 Assembly Language
96 pages
Introduction To x64 Assembly
No ratings yet
Introduction To x64 Assembly
13 pages
Instruction Set Architecture
No ratings yet
Instruction Set Architecture
37 pages
2016defcon Intro To Disassembly Workshop PDF
No ratings yet
2016defcon Intro To Disassembly Workshop PDF
324 pages
20 Most Important Mcqs For Data Entry Operator
100% (1)
20 Most Important Mcqs For Data Entry Operator
4 pages
Chapter 6 - P - I: Instruction Sets: Characteristics and Functions
No ratings yet
Chapter 6 - P - I: Instruction Sets: Characteristics and Functions
48 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Introduction To x64 Assembly
100% (1)
Introduction To x64 Assembly
13 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
126 pages
Instruction Set Principles and Examples
No ratings yet
Instruction Set Principles and Examples
77 pages
IA32 Instruction Set (Short Form)
No ratings yet
IA32 Instruction Set (Short Form)
79 pages
Mpi Unit 4
No ratings yet
Mpi Unit 4
116 pages
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
No ratings yet
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
80 pages
Chap 3
No ratings yet
Chap 3
45 pages
04 Simd
No ratings yet
04 Simd
53 pages
CA Lecture 10
No ratings yet
CA Lecture 10
44 pages
MP02 - Insruction Set 1
No ratings yet
MP02 - Insruction Set 1
31 pages
Group 4 Microcomputer Systems Assignment
No ratings yet
Group 4 Microcomputer Systems Assignment
42 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Lecture8 Simd
No ratings yet
Lecture8 Simd
38 pages
Class04 X86assembly
No ratings yet
Class04 X86assembly
44 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
Micro 7
No ratings yet
Micro 7
79 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Lec17 x86SIMD PDF
No ratings yet
Lec17 x86SIMD PDF
80 pages
Intel I
No ratings yet
Intel I
72 pages
4 22 01 ISA Part II Annotated
No ratings yet
4 22 01 ISA Part II Annotated
69 pages
MP 3 4
No ratings yet
MP 3 4
52 pages
SIMD v1
No ratings yet
SIMD v1
31 pages
02 Assembly
No ratings yet
02 Assembly
43 pages
Radio Configuration of IPasolink 400A
No ratings yet
Radio Configuration of IPasolink 400A
23 pages
Sehs3317 L4
No ratings yet
Sehs3317 L4
53 pages
Week 10 2021
No ratings yet
Week 10 2021
42 pages
Instruction Set Architecture and Principles
No ratings yet
Instruction Set Architecture and Principles
42 pages
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
No ratings yet
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
31 pages
2021 - ECE391 - Ch5 - x86 Assembly Language
No ratings yet
2021 - ECE391 - Ch5 - x86 Assembly Language
24 pages
MP03 - Machine Code and Assembly Language 1
No ratings yet
MP03 - Machine Code and Assembly Language 1
16 pages
SIMD Tutorial
No ratings yet
SIMD Tutorial
17 pages
Programming With SIMD-instructions
No ratings yet
Programming With SIMD-instructions
10 pages
Address
No ratings yet
Address
24 pages
YMCA - MP - (Wk3)
No ratings yet
YMCA - MP - (Wk3)
28 pages
Details .L and .S Units
No ratings yet
Details .L and .S Units
35 pages
DSP Processor Fundamentals
No ratings yet
DSP Processor Fundamentals
58 pages
Implementation of DSP Algorithms
No ratings yet
Implementation of DSP Algorithms
20 pages
Unit - I
No ratings yet
Unit - I
47 pages
Machine-Level Programming I
No ratings yet
Machine-Level Programming I
27 pages
Lecture 3: Performance/Power, MIPS Instructions
No ratings yet
Lecture 3: Performance/Power, MIPS Instructions
22 pages
Raw Sockets
No ratings yet
Raw Sockets
15 pages
Industrial Work Experience On Networking
No ratings yet
Industrial Work Experience On Networking
56 pages
Assembly Language
No ratings yet
Assembly Language
26 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Streaming SIMD Extensions: CSE 820 Dr. Richard Enbody
No ratings yet
Streaming SIMD Extensions: CSE 820 Dr. Richard Enbody
14 pages
Teaching X86 Assembly Language Programming With Visual Studio
No ratings yet
Teaching X86 Assembly Language Programming With Visual Studio
7 pages
14 Assembly Instructions
No ratings yet
14 Assembly Instructions
9 pages
Lab No 1 COA (9-2-2017)
No ratings yet
Lab No 1 COA (9-2-2017)
4 pages
Isa 1
No ratings yet
Isa 1
13 pages
Client Software Installation Guide - (V100R002C01 - 06)
No ratings yet
Client Software Installation Guide - (V100R002C01 - 06)
108 pages
MMX Notes
No ratings yet
MMX Notes
2 pages
Human Visual System Presentation
No ratings yet
Human Visual System Presentation
20 pages
861 Dell Precision 7540 Spec Sheet (89544) KRZ-20191128
No ratings yet
861 Dell Precision 7540 Spec Sheet (89544) KRZ-20191128
1 page
08 - 29-Data Storage Capacity
No ratings yet
08 - 29-Data Storage Capacity
11 pages
Data Domain Commands
100% (1)
Data Domain Commands
2 pages
BIS Operating Systems Part 0
No ratings yet
BIS Operating Systems Part 0
49 pages
HPE ProLiant DL380 Gen11-PSN1014696069SGEN
No ratings yet
HPE ProLiant DL380 Gen11-PSN1014696069SGEN
5 pages
UCS SDK v4.0 Programmer's Manual - Kor
No ratings yet
UCS SDK v4.0 Programmer's Manual - Kor
266 pages
2025-05-16
No ratings yet
2025-05-16
6 pages
Secure-Firewall-Sb-Edition 2
No ratings yet
Secure-Firewall-Sb-Edition 2
30 pages
Classification of Network: Classification Based On Transmission Technology
No ratings yet
Classification of Network: Classification Based On Transmission Technology
32 pages
Webapplicationattacks 101005070110 Phpapp02
No ratings yet
Webapplicationattacks 101005070110 Phpapp02
24 pages
Experiment No 1 Code Convertor
No ratings yet
Experiment No 1 Code Convertor
9 pages
The Chimera II Real-Time Operating System For Advanced Sensor-Based Control Applications
No ratings yet
The Chimera II Real-Time Operating System For Advanced Sensor-Based Control Applications
22 pages
Levels of Requirements SRE
No ratings yet
Levels of Requirements SRE
16 pages
Modbus TCP Server Native FP5043TN
No ratings yet
Modbus TCP Server Native FP5043TN
7 pages
Divide and Conquer
No ratings yet
Divide and Conquer
19 pages
Computers: Lenovo - Thinkpad T14 Gen 3: 21ah
No ratings yet
Computers: Lenovo - Thinkpad T14 Gen 3: 21ah
2 pages
المبحث الثالث الفصل الأول
No ratings yet
المبحث الثالث الفصل الأول
25 pages
Eurotherm Ontrols: Instrument Programming System
No ratings yet
Eurotherm Ontrols: Instrument Programming System
4 pages
Pricelist Lettersize
No ratings yet
Pricelist Lettersize
4 pages
Lab - Implementing Integration Between AD DS and Microsoft Entra ID
No ratings yet
Lab - Implementing Integration Between AD DS and Microsoft Entra ID
10 pages
White Paper Low Latency Server
No ratings yet
White Paper Low Latency Server
8 pages
Lime3ds Log
No ratings yet
Lime3ds Log
7 pages
Sangfor HCI X 1 Node
No ratings yet
Sangfor HCI X 1 Node
4 pages
Aws - S3
No ratings yet
Aws - S3
6 pages
Thiết Bị Logic Lập Trình Được
No ratings yet
Thiết Bị Logic Lập Trình Được
16 pages
Resume Kantesh Mundaragi
No ratings yet
Resume Kantesh Mundaragi
3 pages
IoT Prevoius Paper
No ratings yet
IoT Prevoius Paper
1 page
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
SNES Architecture: Architecture of Consoles: A Practical Analysis, #4
From Everand
SNES Architecture: Architecture of Consoles: A Practical Analysis, #4
Rodrigo Copetti
No ratings yet
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

Lec15 x86SIMD

Uploaded by

Lec15 x86SIMD

Uploaded by

Intel SIMD architecture

• SIMD instructions are probably the best place t

In his 1965 paper,

• How to execute data-dependent calculations on

65% performance gain

Lower the cost of mul

float a[4] , b[4] , c[4];

EAX, EBX, ECX, EDX

Generally faster, but not compatible with IEEE 754

• ADDPS/SUBPS: packed single-precision FP

• ADDSS/SUBSS: scalar single-precision FP

• Programming environment unchanged

You might also like