0% found this document useful (0 votes)

85 views10 pages

Programming With SIMD-instructions

The document discusses programming with SIMD (Single Instruction Multiple Data) instructions. It describes several SIMD extensions including MMX, SSE, SSE2 and AVX, which allow parallel operations on packed integer or floating-point data. It provides details on the characteristics of SIMD operations, different data types, instructions, and programming techniques for utilizing SIMD, including using compiler intrinsics or inline assembly. Automatic vectorization by advanced compilers is also discussed, allowing SIMD parallelism without requiring explicit programming.

Uploaded by

Mahipal Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views10 pages

Programming With SIMD-instructions

Uploaded by

Mahipal Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Programming with SIMD-instructions

SIMD = Single Instruction stream, Multiple Data stream

From Flynn s taxonomy: SISD, SIMD, (MISD), MIMD

Extensions to the Intel and AMD x86 instruction set for parallel operations on packed integer or floating-point data
data parallelism parallel vector operations applies the same operation in parallel on a number of data items packed into a 64-, 128- or 256-bit vector also supports scalar operations on integer or floating-point values

Originally designed to speed up media processing applications

can also be very useful in other types of applications

There are many different versions of SIMD extensions

MMX, SSE, SSE2, SSE3, SSE4, 3DNow!, Altivec, VIS, AVX, ...

MMX, SSE and AVX

Extensions to the IA-32 and x86-64 instruction sets for parallel SIMD operations on packed data MMX Multimedia Extensions
introduced in the Pentium processor 1993 supports only integer operations

SSE Streaming SIMD Extension

introduced in Pentium III 1999 support for single-precision floating point operations SSE2 Streaming SIMD Extension 2 was introduced in Pentium 4, 2000 supports also double-precision floating point operations later extensions: SSE3, SSSE3, SSE4

AVX Advanced Vector Extensions

announced in 2008, supported in the Intel Sandy Bridge processors extends the vector registers to 256 bits

Characteristics of SIMD operations

The SIMD extensions were designed to speed up multimedia and communication applications
graphics and image processing video and audio processing speech compression and recognition can also be used for data-intensive scientific computations

Applications can benefit from SIMD processing if they have the following characteristics
small integer or floating-point data types (8 bit pixel values or characters, 16-bit audio samples, 32-bit floating-point values) small, highly repetitive loops frequent additions, multiplications or other simple operations compute-intense algorithms data-parallelism, can operate on independent values in parallel

SIMD operation
SIMD execution
performs an operation in parallel on an array of 2, 4, 8 or 16 values data parallel operation

The operation can be a

data movement instruction arithmetic instruction logical instruction comparison instruction conversion instruction shuffle instruction

Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

Destination

X2Y2

X1Y1

X0Y0

X3Y3

MMX registers
8 64-bit MMX registers
aliased to the x87 floating-point registers no stack-organization can store 1, 2, 4 or 8 packed integer values
Floating-point registers

MMX registers can only hold data

not memory addresses the general-purpouse registers are used for addresses

MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7

MMX registers

Can not use the x87 floating-point unit and the MMX unit at the same time
they share the same set of registers

MMX data types

MMX instructions operate on 8, 16, 32 or 64-bit integer values, packed into a 64-bit field 4 MMX data types
packed byte 8 bytes packed into a 64-bit quantity packed word 4 16-bit words packed into a 64-bit quantity packed doubleword 2 32-bit doublewords packed into a 64-bit quantity quadword one 64-bit quantity
63 0

b7
63

b6 w3

b5 w2

b3 w1

b1 w0

b0
0

dw1
63

dw0
0

MMX operates only on integer values MMX operations are limited to integer values
the SSE extensions also provide operations on floating-point data

MMX instructions
MMX introduced 47 new instructions for operation on packed integer data
arithmetic
addition, subtraction, multiplication, multiply and add also with signed and unsigned saturation

comparision
compare for equal, compare for greater than

conversion
packing and unpacking of data

logical
and, or, xor, and not

The MMX instructions start with the prefix P (for Packed)

Ex: paddb, paddw, paddd add packed bytes/word/doubleword (8/16/32 bit integers)
7

SSE
Streaming SIMD Extension
introduced with the Pentium III processor designed to speed up performance of advanced 2D and 3D graphics, motion video, videoconferencing, image processing, speech recognition, ... supports only single-precision floating point operation

Parallel operations on packed single precision floating-point values

128-bit packed single precision floating point data type four IEEE 32-bit floating point values packed into a 128-bit field data must be aligned in memory on 16-byte boundaries

127

XMM registers
SSE adds a set of new 128-bit XMM registers
8 XMM registers in 32-bit mode 16 XMM registers in 64-bit mode

The XMM registers are new physical registers

not aliased to any other registers independent of the general purpose and FPU/MMX registers can mix MMX and SSE instructions

XMM registers can be accessed in 32-bit, 64-bit or 128-bit mode

only for operations on data, not addresses

There is also a 32 bit control and status register, MXCSR

flag and mask bits for floating-point exceptions rounding control bits flush-to-zero bit denormals-are-zero bit

XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15
127 0

SSE instructions
The SSE extension added 70 new instructions to the instruction set
50 for SIMD floating-point operations 12 for SIMD integer operations 8 for cache control

Supports both packed and scalar single precision floating-point instructions

operations on packed 32-bit floating-point values
packed instructions have the suffix PS

operations on a scalar 32-bit floating-point value (the 32 LSB)

scalar instructions have the suffix SS

Also included some 64-bit SIMD integer instructions

extension to MMX operations on packed integer values stored in MMX registers

Packed and scalar operations

Packed SSE operations apply an operation in parallel on 2 or 4 floating-point values
Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

Destination

X2Y2

X1Y1

X0Y0

X3Y3

Scalar SSE operations apply an operation on a single (scalar) floating-point value

Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

The compiler uses this for floating-point operations instead of the x87 fp-unit

Destination

X0Y0

SSE2
Streaming SIMD Extension 2
introduced in the Pentium 4 processor designed to speed up performance of advanced 3D graphics, video encoding/decodeing, speech recognition, E-commerce and Internet, scientific and engineering applications

Extends MMX and SSE with support for

packed double precision floating point-values packed integer values adds over 70 new instructions to the instruction set

Operates on 128-bit entities in the XMM registers

must be aligned on 16-bit boundaries when stored in memory

SSE2 data types

128-bit packed double precision floating point
2 IEEE double precision floatingpoint values
127 0

128-bit packed byte integer

16 byte integers (8 bits)

127 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

128-bit packed word integer

8 word integers (16 bits)

127

w7
127

w0
0

128-bit packed doubleword integer

4 doubleword integers (32 bits)

d3
127

d0
0

128-bit packed quadword integer

2 quadword integers (64 bits)

SSE instructions
MMX instructions have names composed of different fields
a prefix P stands for Packed the operation, for example ADD, SUB or MUL for arithmetic operations US (Unsigned Saturation) or S (Signed Saturation) a suffix describing the data type
B Packed Byte, 8 bytes W Packed Word, four 16-bit words D Packed Doubleword, two 32-bit double words Q Quadword, one single 64-bit quadword

Example: PADDB Add Packed Byte PADDSB Add Packed Signed Byte Integers with Signed Saturation

SSE operations on packed double-precision data has the suffix PD

examples: ADDPD, MULPD, MAXPD, ANDPD

SSE operations on scalar double-precision data has the suffix SD

examples: MOVSD, ADDSD, MULSD, MINSD
14

Programming with MMX and SSE

There are different ways a programmer can use SSE in a program
automatic vectorization by the compiler
no explicit SSE programming needed, but requires a vectorizing compiler

arithmetic operations on vector data types

declare variables of vector type express computations as normal arithmetic expression

compiler intrinsinc functions for SSE operation

functions that provide access to the MMX/SSE instructions from a high-level language also requires a detailed knowledge of MMX/SSE operation

program with inline assembly language

very good possibilities to arrange instructions for efficient execution difficult to program, error prone requires detailed knowledge of MMX/SSE operation and assembly language programming
15

Automatic vectorization
The compiler automatically recognizes loops that can be implemented with vectorized code very easy to use, no changes to the program code are needed Only loops that can be analyzed and that are found to be suitable for SIMD execution are vectorized does not guarantee any performance improvement has no effect if the compiler can not analyze the code and find the opportunities for SIMD operation Requires a compiler with vectorizing capabilities in gcc, vectorization is enabled by -O3 (use -ftree-vectorizer-verbose=1 to print reports about which loops are vectorized) the Intel compiler, icc, also does advanced vectorization

gcc -O3 -ftree-vectorizer-verbose=1 saxpy.c -o saxpy! ! saxpy.c:9: note: created 1 versioning for alias checks.! ! saxpy.c:9: note: LOOP VECTORIZED.! saxpy.c:6: note: vectorized 1 loops in function.!
16

Arithmetic operations on vector data types

Declare variables of vector data types and express computations with normal arithmetic expressions
SSE2 data types defined in emmintrin.h

Vector data types

four 32-bit floating-point values: __m128 two 64-bit floating-point values: __m128d integer data types: __m128i

m128 a; /* 4 packed int values */ ! m128 b;! __m128 c; ! ! c = a+b;! !

Compiler intrinsinc functions

Functions for performing MMX and SSE operations on packed data
inplemented with inline assembly code allows the programmer to use C function calls and variables

Defines a C function for each MMX/SSE instruction

there are also intrisinc functions composed of several MMX/SSE instructions

New data types to represent packed integer and floating-point values

__m64 represents the contents of a 64-bit MMX register (8, 16 or 32 bit packed integers) __m128 represents 4 packed single precision floating-point values __m128d represents 2 packed double precision floating-point values __m128i represents packed integer values (8, 16, 32 or 64-bit)

C intrisinc functions
Example:
multiply two arrays A and B of 400 single precision f-p values
#define SIZE 400! ! float A[SIZE], B[SIZE], C[SIZE];! __m128 m1, m2, m3;! ! for (int i=0; i<SIZE; i+=4) {! m1 = _mm_load_ps (A+i);! m2 = _mm_load_ps (B+i);! m3 = _mm_mul_ps (m1,m2);! _mm_store_ps (C+i,m3);! }!

Register allocation and instruction scheduling is left to the compiler

Variables of vector data types have to be aligned to 16-bit boundaries May also need to access the individual values in the packed data
can be done by using a union structure

union mmdata {! __mm128 m;! float f[4];! };!

Assembly Language Program With 8085 Microprocessor
100% (1)
Assembly Language Program With 8085 Microprocessor
22 pages
Binary Ebook (Secure)
75% (4)
Binary Ebook (Secure)
17 pages
Trigger: Business Process Procedure Overview
100% (1)
Trigger: Business Process Procedure Overview
11 pages
Week 10 2021
No ratings yet
Week 10 2021
42 pages
SIMD v1
No ratings yet
SIMD v1
31 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
No ratings yet
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
80 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Introduction To x64 Assembly
100% (1)
Introduction To x64 Assembly
13 pages
Introduction To x64 Assembly
No ratings yet
Introduction To x64 Assembly
13 pages
Intel Intro Intro - To - Intel - AVX
No ratings yet
Intel Intro Intro - To - Intel - AVX
21 pages
Lec17 x86SIMD PDF
No ratings yet
Lec17 x86SIMD PDF
80 pages
SIMD Tutorial
No ratings yet
SIMD Tutorial
17 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
Intel X86 and Arm Data Types
No ratings yet
Intel X86 and Arm Data Types
20 pages
MMX Notes
No ratings yet
MMX Notes
2 pages
Pentium 4
No ratings yet
Pentium 4
60 pages
Intel Pentium 4 Processor
No ratings yet
Intel Pentium 4 Processor
60 pages
MMX Intel Architecture
No ratings yet
MMX Intel Architecture
9 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Lecture8 Simd
No ratings yet
Lecture8 Simd
38 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
SIMD
No ratings yet
SIMD
10 pages
Details of Intel® Advanced Vector Extensions Intrinsics
No ratings yet
Details of Intel® Advanced Vector Extensions Intrinsics
3 pages
8086 Architecture
No ratings yet
8086 Architecture
40 pages
Serial Communication
100% (1)
Serial Communication
28 pages
Unit I@mpmc
No ratings yet
Unit I@mpmc
33 pages
Microprocessor Based System: Muhammad Syargawi B. Abdullah Photonics Lab, Mimos Berhad For Unikl, Miit Sept 2012
No ratings yet
Microprocessor Based System: Muhammad Syargawi B. Abdullah Photonics Lab, Mimos Berhad For Unikl, Miit Sept 2012
69 pages
Unit-I (First Half)
No ratings yet
Unit-I (First Half)
38 pages
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
No ratings yet
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
21 pages
MP 3 4
No ratings yet
MP 3 4
52 pages
Streaming SIMD Extensions: CSE 820 Dr. Richard Enbody
No ratings yet
Streaming SIMD Extensions: CSE 820 Dr. Richard Enbody
14 pages
Risc Ans Cisc
No ratings yet
Risc Ans Cisc
33 pages
8086 Processor
No ratings yet
8086 Processor
49 pages
16-Bit Floating Point Instructions For Embedded Multimedia Applications
No ratings yet
16-Bit Floating Point Instructions For Embedded Multimedia Applications
6 pages
C674x CPU Features
No ratings yet
C674x CPU Features
23 pages
CSC 315 Notes 4
No ratings yet
CSC 315 Notes 4
9 pages
Adv M 1
No ratings yet
Adv M 1
85 pages
CISC
No ratings yet
CISC
16 pages
Pape 3
No ratings yet
Pape 3
20 pages
Department of Electronics and Communication: CS2252 Microprocessors and Microcontrollers
No ratings yet
Department of Electronics and Communication: CS2252 Microprocessors and Microcontrollers
117 pages
Intel 8088
100% (1)
Intel 8088
23 pages
EVOLUTION OF Microprocessor
No ratings yet
EVOLUTION OF Microprocessor
41 pages
14 Assembly Instructions
No ratings yet
14 Assembly Instructions
9 pages
Tutorial Emu8086
No ratings yet
Tutorial Emu8086
70 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Cse 216 - L3
No ratings yet
Cse 216 - L3
15 pages
Floating Point Instructions: Ray Seyfarth
No ratings yet
Floating Point Instructions: Ray Seyfarth
18 pages
MMX Present
No ratings yet
MMX Present
17 pages
Advanced Processors: Overview of DSP Unit-5 Unit-6
No ratings yet
Advanced Processors: Overview of DSP Unit-5 Unit-6
58 pages
Microprocessors and Microcontrollers
No ratings yet
Microprocessors and Microcontrollers
122 pages
Design by Mohammed Intekhab Khan
No ratings yet
Design by Mohammed Intekhab Khan
33 pages
IA32 Instruction Set (Short Form)
No ratings yet
IA32 Instruction Set (Short Form)
79 pages
Basic Architecture Ia32 x86
No ratings yet
Basic Architecture Ia32 x86
41 pages
Lecture 1
No ratings yet
Lecture 1
63 pages
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Application and Implementation of DES Algorithm Based on FPGA
From Everand
Application and Implementation of DES Algorithm Based on FPGA
madhav
No ratings yet
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
Exploring Government Uses of Social Media Through Twitter Sentiment Analysis
No ratings yet
Exploring Government Uses of Social Media Through Twitter Sentiment Analysis
12 pages
Aide Memoire HTML
No ratings yet
Aide Memoire HTML
13 pages
New Principles of Gunnery
No ratings yet
New Principles of Gunnery
404 pages
Discrete Mathematical Structures IS314 SUPPLEMENTARY EXAM 3RD SEM AUG 2017
No ratings yet
Discrete Mathematical Structures IS314 SUPPLEMENTARY EXAM 3RD SEM AUG 2017
4 pages
Advantages and Disadvantages of A Monolithic Repository A Case Study at Google
No ratings yet
Advantages and Disadvantages of A Monolithic Repository A Case Study at Google
10 pages
BIS Assignment 2
No ratings yet
BIS Assignment 2
1 page
Popkin, R. - Hume and Spinoza
No ratings yet
Popkin, R. - Hume and Spinoza
30 pages
Machine Operator or Production Operator or Production Technician
No ratings yet
Machine Operator or Production Operator or Production Technician
2 pages
OS Notes PDF
No ratings yet
OS Notes PDF
150 pages
Tensor Decomp Presentation
No ratings yet
Tensor Decomp Presentation
9 pages
1 Bac - Media - Customized Reading PDF
67% (3)
1 Bac - Media - Customized Reading PDF
1 page
Final - CO2011 - en - 2020 - 201 - 281x - No Keys
No ratings yet
Final - CO2011 - en - 2020 - 201 - 281x - No Keys
5 pages
Research and Comparability Report On Engas Ebudget Ebtms
No ratings yet
Research and Comparability Report On Engas Ebudget Ebtms
7 pages
CM1000 Manual
No ratings yet
CM1000 Manual
12 pages
Clustering Mall Data Students
No ratings yet
Clustering Mall Data Students
11 pages
5 Operating Systems
No ratings yet
5 Operating Systems
23 pages
MPDF PDF
No ratings yet
MPDF PDF
4 pages
MDCM Memo
No ratings yet
MDCM Memo
3 pages
XML Publisher and FSG For Beginners: Susan Behn, Alyssa Johnson, and Karen Brownfield
No ratings yet
XML Publisher and FSG For Beginners: Susan Behn, Alyssa Johnson, and Karen Brownfield
9 pages
Jablotron 100 User - Manual
No ratings yet
Jablotron 100 User - Manual
17 pages
Shcspraccs
No ratings yet
Shcspraccs
49 pages
IBM Software Defined Storage For Dummies ES
No ratings yet
IBM Software Defined Storage For Dummies ES
10 pages
Kifayatullah
No ratings yet
Kifayatullah
1 page
2017 Briefing Checklist SchoolSupportWorkshop 25.01.2017
No ratings yet
2017 Briefing Checklist SchoolSupportWorkshop 25.01.2017
11 pages
CV For Sanni Joseph Adeiza Acted
No ratings yet
CV For Sanni Joseph Adeiza Acted
3 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
CCNA Security Module 1 100%
100% (1)
CCNA Security Module 1 100%
4 pages
Marketing Strategy For Protein Shakes
100% (1)
Marketing Strategy For Protein Shakes
8 pages

Programming With SIMD-instructions

Uploaded by

Programming With SIMD-instructions

Uploaded by

Programming with SIMD-instructions

SIMD = Single Instruction stream, Multiple Data stream

Originally designed to speed up media processing applications

There are many different versions of SIMD extensions

MMX, SSE and AVX

SSE Streaming SIMD Extension

AVX Advanced Vector Extensions

Characteristics of SIMD operations

The operation can be a

MMX registers can only hold data

MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7

MMX data types

The MMX instructions start with the prefix P (for Packed)

Parallel operations on packed single precision floating-point values

The XMM registers are new physical registers

XMM registers can be accessed in 32-bit, 64-bit or 128-bit mode

There is also a 32 bit control and status register, MXCSR

Supports both packed and scalar single precision floating-point instructions

operations on a scalar 32-bit floating-point value (the 32 LSB)

Also included some 64-bit SIMD integer instructions

Packed and scalar operations

Scalar SSE operations apply an operation on a single (scalar) floating-point value

Extends MMX and SSE with support for

Operates on 128-bit entities in the XMM registers

SSE2 data types

128-bit packed byte integer

128-bit packed word integer

128-bit packed doubleword integer

128-bit packed quadword integer

SSE operations on packed double-precision data has the suffix PD

SSE operations on scalar double-precision data has the suffix SD

Programming with MMX and SSE

arithmetic operations on vector data types

compiler intrinsinc functions for SSE operation

program with inline assembly language

Arithmetic operations on vector data types

Vector data types

__m128 a; /* 4 packed int values */ ! __m128 b;! __m128 c; ! ! c = a+b;! !

Compiler intrinsinc functions

Defines a C function for each MMX/SSE instruction

New data types to represent packed integer and floating-point values

Register allocation and instruction scheduling is left to the compiler

union mmdata {! __mm128 m;! float f[4];! };!

You might also like

m128 a; /* 4 packed int values */ ! m128 b;! __m128 c; ! ! c = a+b;! !