0% found this document useful (0 votes)

102 views49 pages

XX-BSC Compact Vector Processing

Vector processors provide improved performance over superscalar processors for workloads involving linear algebra operations on arrays by mapping loops to vector instructions. They reduce fetch/decode bandwidth and avoid data hazards between elements of the same vector. The basic architecture includes vector and scalar units, with vector registers and functional units. Vector instructions are executed through convoys that avoid structural hazards. Performance is measured by total execution time and MFLOPS. Streaming SIMD Extensions provide vector-like instructions to x86 processors and are useful for multimedia, scientific and financial applications. The Intel Larrabee architecture combines aspects of CPUs and GPUs through its use of in-order cores with 512-bit vector processing units.

Uploaded by

mheba11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views49 pages

XX-BSC Compact Vector Processing

Uploaded by

mheba11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Vector Processors

Kavitha Chandrasekar
Sreesudhan Ramkumar
Agenda
• Why Vector processors
• Basic Vector Architecture
• Vector Execution time
• Vector load - store units and Vector memory
systems
• Vector length Control
• Vector stride
Limitations of ILP
• ILP:
– Increase in instruction width (superscalar)
– Increase in machine pipeline depth
– Hence, Increase in number of in-flight instructions
• Need for increase in hardware structures like
ROB, rename register files
• Need to increase logic to track dependences
• Even in VLIW, increase in hardware and logic is
required
Vector Processor
• Work on linear arrays of numbers(vectors)
• Each iteration of a loop becomes one element of the vector
• Overcoming limitations of ILP:
– Dramatic reduction in fetch and decode bandwidth.
– No data hazard between elements of the same vector.
– Data hazard logic is required only between two vector
instructions
– Heavily interleaved memory banks. Hence latency of
initiating memory access versus cache access is amortized.
– Since loops are reduced to vector instructions, there are
no control hazards
– Good performance for poor locality
Basic Architecture
• Vector and Scalar units
• Types:
– Vector-register processors
– Memory-memory Vector
processors
• Vector Units
– Vector registers (with 2
read and 1 write ports)
– Vector functional units
(fully pipelined)
– Vector Load Store unit(fully
pipelined)
– Set of scalar registers
VMIPS vector instructions
MIPS vs VMIPS
(DAXPY loop)
Execution time of vector instructions
• Factors:
– length of operand vectors
– structural hazards among operations
– data dependences
• Overhead:
– initiating multiple vector instructions in a clock cycle
– Start-up overhead (more details soon)
Vector Execution time (contd.)
• Terms:
– Convoy:
– set of vector instructions that can begin execution
together in one clock period
– Instructions in a convoy must not contain any structural or
data hazards
– Analogous to placing scalar instructions in VLIW
– One convoy must finish before another begins
– Chime: Unit of time taken to execute one convoy
• Hence for vector sequence m convoys executes in m
chimes
• Hence for vector length of n, time=m × n clock cycles
Example

Convoy
Start-up overhead
• Startup time: Time between initialization of
the instruction and time the first result
emerges from pipeline
• Once pipeline is full, result is produced every
cycle.
• If vector lengths were infinite, startup
overhead is amortized
• But for finite vector lengths, it adds significant
overhead
Startup overhead-example
Vector Load-Store Units and Vector
Memory Systems
• Start-up time: Time to get first word from
memory into a register
• To produce results every clock multiple memory
banks are used
• Need for multiple memory banks in vector
processors:
– Many vector processors allow multiple loads and
stores per clock cycle
– Support for nonsequential access
– Support for sharing of system memory by multiple
processors
Example
• Number of memory banks required:
Real world issues
• Vector length in a program is not always
fixed(say 64)
• Need to access non adjacent elements from
memory
• Solutions:
– Vector length Control
– Vector Stride
Vector Length Control
• Example:

• Here value of ‘n’ might be known only during runtime.

• In case of parameters to procedure, it changes even
during runtime
• Hence, VLR (Vector Length Register) is used to control
the length of a vector operation during runtime
• MVL (Maximum Vector Length) holds the maximum
length of a vector operation (processor dependent)
Vector Length Control(contd.)
• Strip mining:
– When vector operation is longer than MVL, this
concept is used
Execution time due to strip mining
• Key factors that contribute to the running time of a strip-mined
loop consisting of a sequence of convoys:

1. Number of convoys in the loop, which determines the number of

chimes.
2. Overhead for each strip-mined sequence of convoys. This
overhead consists of the cost of executing the scalar code for
strip-mining each block, plus the vector start-up cost for each
convoy.
• Total running time for a vector sequence operating on a vector of
length n,Tn:
Example
Vector Stride
• To overcome access to nonadjacent elements in memory
• Example:

• This loop can be strip-mined as a vector multiplication

• Each row of B would be first operand and each column of C would be
second operand
• For memory organization as column major order, B’s elements would be
non-adjacent
• Stride is distance(uniform) between the non-adjacent elements.
• Allows access of nonsequential memory elements
Vector processors - Contd.
Agenda
• Enhancing Vector performance
• Measuring Vector performance
• SSE Instruction set and Applications
• A case study - Intel Larrabee vector processor
• Pitfalls and Fallacies
Enhancing Vector performance
• General
o Pipelining individual operations of one instruction
o Reducing Startup latency
• Addressing following hazards effectively
o Structural hazards
o Data hazards
o Control hazards
Pipelining & reducing Startup latency
Addressing Structural hazards - Multiple Lanes
Addressing Structural hazards - Multiple Lanes

• Addressed using pipelining and parallel lanes

Multiple Lanes - Contd.
• Registers & Floating point units are localized
within lanes
Addressing Data hazards - Flexible chaining
• Similar to Forwarding
• Chaining allows a vector operation to start as soon as the
individual elements of its vector source operand become
available
• Example:

Instruction Startup Vector

time length
(cycles) (units)
MULV.D V1, V2, V3 7 64
ADDV.D V4, V1, V5 6 64
Flexible Chaining - Contd.

MULV.D V1, V2, V3 Unchained Chained

ADDV.D V4, V1, V5

Time (cycles) VLM + VLA +STM + VLM/A + STM +

STA = 141 STA = 77
cycles / result 141 / 64 = 2.2 77 / 64 = 1.2
FLOPS / clock 128 / 141 = 0.9 128 / 77 = 1.7
cycle
Addressing Control hazards - Vector mask

• Instructions involving control statement can't run in vector

mode
• Solution:
o Convert control dependence into data dependence by
executing control statement and updating vector mask
register
o Run data dependent instructions in vector mode based on
value in value mask register
Vector mask - Contd.
Improving Vector mask - Scatter & Gather method

• Step 1: Set VM to 1 based on control condition

• Step 2: Create CVI - Create Vector Index based on VM
o Create an index vector which points to addresses of valid
contents
• Step 3: LVI - Load Vector Index (GATHER)
o Load valid operands based on step 2
• Step 4: Execute arithemetic operation on compressed vector
• Step 5: SVI - Store Vector Index (SCATTER)
o Store valid output based on step 2
Scatter & Gather - Contd.
Comparison - Basic vector mask &
Scatter - Gather

• Conclusion: Scatter & Gather will run faster if

less than one-quarter of elements are non
zero
Enhancing Vector performance - Summary

• General
o Pipelining individual operations of one instruction
o Reducing Startup latency
• Structural hazards
o Multiple Lanes
• Data hazards
o Flexible chaining
• Control hazards
o Basic vector mask
o Scatter & Gather
Measuring Vector Performance - Total
execution time
Scale for measuring performance:
• Total execution time of the vector loop - Tn
o Used to compare performance of different instructions on
processor

o Unit - clock cycles

o n - vector length
o MVL - maximum vector length
o Tloop - Loop overhead
o Tstart - startup overhead
o Tch ime - unit of convoys
Measuring Vector Performance -
MFLOPS
• MFLOPS - Millions of FLoating point Operations Per Second
o Used to compare performance of two different processors
• MFLOPS - Rn

oUnit - operations / second

• MFLOPS - Rinfinity (theoritical / peak performance)
SSE Instructions
• Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to
the x86 architecture
• Streaming SIMD Extensions are similar to vector instructions.
• SSE originally added eight new 128-bit registers known as XMM0
through XMM7
• Each register packs together:
 four 32-bit single -
precision floating
point numbers or
 two 64-bit double -
precision floating
point numbers or
 two 64-bit integers or
 four 32-bit integers or
 eight 16-bit short integers or
 sixteen 8-bit bytes or characters.
SSE Instruction set & Applications
• Sample instruction set for floating point operations
o Scalar – ADDSS, SUBSS, MULSS, DIVSS
o Packed – ADDPS, SUBPS, MULPS, DIVPS
• Example

• Applications - multimedia, scientific and financial applications

A Case study - Intel Larrabee
Architecture
• a many-core visual computing
architecture code
• Intel’s new approach to a GPU
• Considered to be a hybrid between a
multi-core CPU and a GPU
• Combines functions of a multi-core CPU
with the functions of a GPU
Larrabee - The Big picture

• in order execution (Execution is also more deterministic so

instruction and task scheduling can be done by the compiler)
• Each Larrabee core contains a 512-bit vector
processing unit, able to process 16 single precision floating
point numbers at a time.
• uses extended x86 architecture set with additional features
like scatter / gather instructions and a mask register
designed to make using the vector unit easier and more
efficient.
Larrabee VPU Architecture
• 16 wide vector ALU in one core
• executes interger, single precision,
float and double precision float
instructions
• choice of 16 - Tradeoff between
increased computational density and
difficulty of high utilization with wider
one
• suports swizzling and replication
• Mask register and index register
operations
Larrabee Data types

• 32 512-bit vector registers & 8 16-bit vector mask registers

• Each element of vector register can be
o 8 wide - to store 16 float 32's or 16 int 32's
o 16 wide - to store 8 float 64's or 8 int 64's
Larrabee Instruction set

• vector arithmetic, logic and shift

• vector mask generation
• vector load / store
• swizzling
> Vector multiply - add, multiply - sub instructions
Past, Present & Future of Vector
processors
• Past
o Cray X1
o Earth simulator
• Present
o Cray Jaguar
o Larrabee
• Future: AVE (Advanced Vector Extensions)
o Sandy Bridge (Intel)
o Bulldozer (AMD)
Pitfalls and Fallacies

• Pitfalls:
o Concentrating on peak performance and ignoring start up
overhead (on memory-memory vector architecture)
o Increasing Vector performance, without comparable
increase in scalar performance
• Fallacy
o You can get vector performance without providing memory
bandwidth (by reusing vector registers)
Recap

• Why Vector processors

• Basic Vector Architecture
• Vector Execution time
• Vector load - store units and Vector memory systems
• Vector length - VLR
• Vector stride
• Enhancing Vector performance
• Measuring Vector performance
• SSE Instruction set and Applications
• A case study - Intel Larrabee vector processor
• Pitfalls and Fallacies
References

• Computer Architecture - A quantitative approach 4th edition (Appendix A, F &

G, chapter 2 & 3)
• Cray X1 http://www.supercomp.org/sc2003/paperpdfs/pap183.pdf
• Larrabee official page on intel http://software.intel.com/en-
us/articles/larrabee/
• Larrabee http://www.gpucomputing.org/drdobbs_042909_final.pdf
• http://www.vizworld.com/2009/05/new-whitepapers-from-intel-about-larrabee/
Thank you.

MCA - HW - Lecture 7and8 - Prelim
No ratings yet
MCA - HW - Lecture 7and8 - Prelim
146 pages
Unit 2
No ratings yet
Unit 2
43 pages
Coa Unit 5
No ratings yet
Coa Unit 5
53 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
Vector
No ratings yet
Vector
42 pages
CA 13 VectorProcessors
No ratings yet
CA 13 VectorProcessors
16 pages
PP Unit 2 Tesseract
No ratings yet
PP Unit 2 Tesseract
38 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
COE4590 14 Vector
No ratings yet
COE4590 14 Vector
14 pages
ACA20012021 - Vector & Multiple Issue Processor - 2
No ratings yet
ACA20012021 - Vector & Multiple Issue Processor - 2
21 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
Chapter 8
No ratings yet
Chapter 8
59 pages
Why Vector Processing: Deep Pipeline More Parallelism
No ratings yet
Why Vector Processing: Deep Pipeline More Parallelism
7 pages
SIMD
No ratings yet
SIMD
44 pages
Vector
No ratings yet
Vector
38 pages
20 Question of CA
No ratings yet
20 Question of CA
26 pages
Lecture #4
No ratings yet
Lecture #4
16 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
COA Chapter 9
No ratings yet
COA Chapter 9
36 pages
Ca Part 3
No ratings yet
Ca Part 3
20 pages
02 Architecture of Arm
No ratings yet
02 Architecture of Arm
43 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Vector Processor
No ratings yet
Vector Processor
13 pages
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
No ratings yet
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
31 pages
VLIW ARCHITECTURE and Pipeline
No ratings yet
VLIW ARCHITECTURE and Pipeline
5 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Bangabandhu Sheikh Mujibur Rahman Maritime University Bangladesh
No ratings yet
Bangabandhu Sheikh Mujibur Rahman Maritime University Bangladesh
7 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
CSO Computer Programming
No ratings yet
CSO Computer Programming
73 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
COA Unit V B
No ratings yet
COA Unit V B
5 pages
Architecture Chapter4 E5 2012
No ratings yet
Architecture Chapter4 E5 2012
92 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
Module 4 Chapter 2
No ratings yet
Module 4 Chapter 2
42 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
19 Computer Architecture Vector Processor
No ratings yet
19 Computer Architecture Vector Processor
20 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
User's Manual of Et-Stamp-Stm32
No ratings yet
User's Manual of Et-Stamp-Stm32
12 pages
COA Full Syllabus-CSE
No ratings yet
COA Full Syllabus-CSE
3 pages
Parallel Processing
No ratings yet
Parallel Processing
33 pages
CSO Lecture Notes Unit - 5
No ratings yet
CSO Lecture Notes Unit - 5
11 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
Evolution of Microprocessor
No ratings yet
Evolution of Microprocessor
10 pages
Vector Computers
No ratings yet
Vector Computers
43 pages
02 Chipyard Basics
100% (1)
02 Chipyard Basics
38 pages
Risc Processor
No ratings yet
Risc Processor
123 pages
Instruction Set Architecture and Design
No ratings yet
Instruction Set Architecture and Design
27 pages
Chapter 11 Single Cycle Datapath
No ratings yet
Chapter 11 Single Cycle Datapath
17 pages
Rajalakshmi Engineering College Department of Computer Science Cs2309 - Java Lab Lab Manual
100% (1)
Rajalakshmi Engineering College Department of Computer Science Cs2309 - Java Lab Lab Manual
5 pages
Programming Arduino Using Atmel Studio 6
No ratings yet
Programming Arduino Using Atmel Studio 6
6 pages
8.memory Reference Instructions
No ratings yet
8.memory Reference Instructions
12 pages
Intel Core's Multicore Processor
No ratings yet
Intel Core's Multicore Processor
7 pages
GND GND: Pcint17 Pcint16 Pcint14 Pcint18 Pcint19 Pcint20 Pcint21 Pcint22 Pcint23 Pcint0 Pcint1
No ratings yet
GND GND: Pcint17 Pcint16 Pcint14 Pcint18 Pcint19 Pcint20 Pcint21 Pcint22 Pcint23 Pcint0 Pcint1
1 page
Rohini College of Engineering & Technology
No ratings yet
Rohini College of Engineering & Technology
5 pages
Network Security: MSC Course by
No ratings yet
Network Security: MSC Course by
9 pages
BCS402 Model Question Paper of 2024-2025
No ratings yet
BCS402 Model Question Paper of 2024-2025
2 pages
MPMC Syllabus
No ratings yet
MPMC Syllabus
1 page
Pentium Architecture
No ratings yet
Pentium Architecture
3 pages
04 - Design With Microprocessors
No ratings yet
04 - Design With Microprocessors
71 pages
XX Chapter16 InstructionLevelParallelismAndSuperscalarProcessors PDF
No ratings yet
XX Chapter16 InstructionLevelParallelismAndSuperscalarProcessors PDF
90 pages
The University of Western Australia School of Computer Science & Software Engineering
No ratings yet
The University of Western Australia School of Computer Science & Software Engineering
2 pages
Computer Organization - Hardwired V/s Micro-Programmed Control Unit
No ratings yet
Computer Organization - Hardwired V/s Micro-Programmed Control Unit
9 pages
ch20 22
No ratings yet
ch20 22
8 pages
08 Perf Pipeline I
No ratings yet
08 Perf Pipeline I
65 pages
Atmel Automotive
No ratings yet
Atmel Automotive
16 pages
Point Feature Detection and Matching: Davide Scaramuzza
No ratings yet
Point Feature Detection and Matching: Davide Scaramuzza
65 pages
Lec3 PDF
No ratings yet
Lec3 PDF
15 pages
M3N78 Pro CPU SupportList v2
No ratings yet
M3N78 Pro CPU SupportList v2
4 pages
Subject Description Form: Subject Code Subject Title Credit Value Pre-Requisite / Co-Requisite/ Exclusion
No ratings yet
Subject Description Form: Subject Code Subject Title Credit Value Pre-Requisite / Co-Requisite/ Exclusion
4 pages
CMSC414 Practice
No ratings yet
CMSC414 Practice
14 pages
A Review On Recent T-Way Combinatorial Testing Strategy: Nuraminah Ramli, Rozmie Razif Othman
No ratings yet
A Review On Recent T-Way Combinatorial Testing Strategy: Nuraminah Ramli, Rozmie Razif Othman
6 pages
PIC Microcontrollers - Programming in BASIC Ch2
100% (1)
PIC Microcontrollers - Programming in BASIC Ch2
30 pages
Programming Paradigms: Unit 1 - Introduction and Basic Concepts
No ratings yet
Programming Paradigms: Unit 1 - Introduction and Basic Concepts
33 pages
Reevaluating Amdahl's Law and Gustafson's Law
No ratings yet
Reevaluating Amdahl's Law and Gustafson's Law
9 pages
CH 13
No ratings yet
CH 13
6 pages
Computer Architecture RjtD6wvc7Qvrur3dnms
No ratings yet
Computer Architecture RjtD6wvc7Qvrur3dnms
9 pages
Late Acceptance Hill Climbing Based Strategy For Addressing Constraints Within Combinatorial Test Data Generation
No ratings yet
Late Acceptance Hill Climbing Based Strategy For Addressing Constraints Within Combinatorial Test Data Generation
6 pages
Combinatorial Testing of ACTS: A Case Study: Mehra N.Borazjany, Linbin Yu, Yu Lei Raghu Kacker, Rick Kuhn
No ratings yet
Combinatorial Testing of ACTS: A Case Study: Mehra N.Borazjany, Linbin Yu, Yu Lei Raghu Kacker, Rick Kuhn
10 pages
Instruction Cycle: Universiti Teknologi MARA
No ratings yet
Instruction Cycle: Universiti Teknologi MARA
10 pages
ABC Algorithm For Combinatorial Testing Problem: October 2017
No ratings yet
ABC Algorithm For Combinatorial Testing Problem: October 2017
5 pages
Delay Subroutines in 8085
No ratings yet
Delay Subroutines in 8085
2 pages
Xx-Iip & Ilp
No ratings yet
Xx-Iip & Ilp
16 pages
Lec 13
No ratings yet
Lec 13
14 pages
A Tool For Automated Test Data Generation (And Execution) Based On Combinatorial Approach
No ratings yet
A Tool For Automated Test Data Generation (And Execution) Based On Combinatorial Approach
19 pages
Kuhn 2011
No ratings yet
Kuhn 2011
1 page
Programming Languages & Paradigms Abstraction & Modularity: PROP HT 2011
No ratings yet
Programming Languages & Paradigms Abstraction & Modularity: PROP HT 2011
14 pages
A Self-Adapting Ant Colony Optimization Algorithm Using Fuzzy Logic (ACOF) For Combinatorial Test Suite Generation
No ratings yet
A Self-Adapting Ant Colony Optimization Algorithm Using Fuzzy Logic (ACOF) For Combinatorial Test Suite Generation
11 pages
Introduction and Motivation: CITS 3242 Programming Paradigms
No ratings yet
Introduction and Motivation: CITS 3242 Programming Paradigms
11 pages
A Cuckoo Search Based Pairwise Strategy For Combinatorial Testing Problem
No ratings yet
A Cuckoo Search Based Pairwise Strategy For Combinatorial Testing Problem
9 pages
Wrapping Things Up: Programming Paradigms - P. 361/385
No ratings yet
Wrapping Things Up: Programming Paradigms - P. 361/385
8 pages
BCS Higher Education Qualifications Professional Graduate Diploma in IT Programming Paradigms Syllabus
No ratings yet
BCS Higher Education Qualifications Professional Graduate Diploma in IT Programming Paradigms Syllabus
6 pages
Implementation of Artificial Bee Colony Algorithm For T-Way Testing
No ratings yet
Implementation of Artificial Bee Colony Algorithm For T-Way Testing
4 pages
8051 Instruction Set Manual MOV
No ratings yet
8051 Instruction Set Manual MOV
4 pages
The University of Western Australia School of Computer Science & Software Engineering
No ratings yet
The University of Western Australia School of Computer Science & Software Engineering
3 pages
CS 4290/6290: High-Performance Computer Architecture Spring 2004 Midterm Quiz
No ratings yet
CS 4290/6290: High-Performance Computer Architecture Spring 2004 Midterm Quiz
3 pages
AMD Ryzen Threadripper 3990X: Say Hello To The Most Powerful Desktop Processor Ever
No ratings yet
AMD Ryzen Threadripper 3990X: Say Hello To The Most Powerful Desktop Processor Ever
1 page
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet

XX-BSC Compact Vector Processing

Uploaded by

XX-BSC Compact Vector Processing

Uploaded by

Vector Processors

• Here value of ‘n’ might be known only during runtime.

1. Number of convoys in the loop, which determines the number of

• This loop can be strip-mined as a vector multiplication

• Addressed using pipelining and parallel lanes

Instruction Startup Vector

MULV.D V1, V2, V3 Unchained Chained

Time (cycles) VLM + VLA +STM + VLM/A + STM +

• Instructions involving control statement can't run in vector

• Step 1: Set VM to 1 based on control condition

• Conclusion: Scatter & Gather will run faster if

o Unit - clock cycles

oUnit - operations / second

• Applications - multimedia, scientific and financial applications

• in order execution (Execution is also more deterministic so

• 32 512-bit vector registers & 8 16-bit vector mask registers

• vector arithmetic, logic and shift

• Why Vector processors

• Computer Architecture - A quantitative approach 4th edition (Appendix A, F &

You might also like