0% found this document useful (0 votes)

35 views52 pages

Lecture ParallelArchTLP-DLP

Noted

Uploaded by

presen_scribd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views52 pages

Lecture ParallelArchTLP-DLP

Noted

Uploaded by

presen_scribd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Lecture: Parallel Architecture – Thread Level

Parallelism and Data Level Parallelism

CSCE 569 Parallel Computing

Department of Computer Science and Engineering
Yonghong Yan
[email protected]
http://cse.sc.edu/~yanyh

1
Topics
• Introduction
• Programming on shared memory system (Chapter 7)
– OpenMP
• Principles of parallel algorithm design (Chapter 3)
• Programming on large scale systems (Chapter 6)
– MPI (point to point and collectives)
– Introduction to PGAS languages, UPC and Chapel
• Analysis of parallel program executions (Chapter 5)
– Performance Metrics for Parallel Systems
• Execution Time, Overhead, Speedup, Efficiency, Cost
– Scalability of Parallel Systems
– Use of performance tools
2
Topics
• Programming on shared memory system (Chapter 7)
– Cilk/Cilkplus and OpenMP Tasking
– PThread, mutual exclusion, locks, synchronizations
• Parallel architectures and memory
– Parallel computer architectures
• Thread Level Parallelism
• Data Level Parallelism
• Synchronization
– Memory hierarchy and cache coherency
• Manycore GPU architectures and programming
– GPUs architectures
– CUDA programming
– Introduction to offloading model in OpenMP 3
Lecture: Parallel Architecture –
Thread Level Parallelism

Note:
• Parallelism in hardware
• Not (just) multi-/many-core architecture

4
Binary Code and Instructions

• Compile a program using -save-temps flags to see the binary

code or disassembly a binary code using objdump -D 5
Stages to Execute an Instruction

6
Pipeline

7
Pipeline and Superscalar

8
What do we do with that many transistors?

• Optimizing the execution of a single instruction stream through

– Pipelining
• Overlap the execution of multiple instructions
• Example: all RISC architectures; Intel x86 underneath the
hood
– Out-of-order execution:
• Allow instructions to overtake each other in accordance with
code dependencies (RAW, WAW, WAR)
• Example: all commercial processors (Intel, AMD, IBM, SUN)
– Branch prediction and speculative execution:
• Reduce the number of stall cycles due to unresolved
branches
• Example: (nearly) all commercial processors
What do we do with that many transistors? (II)

– Multi-issue processors:
• Allow multiple instructions to start execution per clock cycle
• Superscalar (Intel x86, AMD, …) vs. VLIW architectures
– VLIW/EPIC architectures:
• Allow compilers to indicate independent instructions per
issue packet
• Example: Intel Itanium
– Vector units:
• Allow for the efficient expression and execution of vector
operations
• Example: SSE - SSE4, AVX instructions
Limitations of optimizing a single instruction
stream (II)
• Problem: within a single instruction stream we do not find
enough independent instructions to execute simultaneously due
to
– data dependencies
– limitations of speculative execution across multiple branches
– difficulties to detect memory dependencies among instruction
(alias analysis)
• Consequence: significant number of functional units are idling at
any given time
• Question: Can we maybe execute instructions from another
instructions stream
– Another thread?
– Another process?
Hardware Multi-Threading (SMT)
• Three types of hardware multi-threading (single-core only):
– Coarse-grained MT
– Fine-grained MT
– Simultaneous Multi-threading
Superscalar Coarse MT Fine MT SMT
Thread-level parallelism
• Problems for executing instructions from multiple threads
at the same time
– The instructions in each thread might use the same register
names
– Each thread has its own program counter
• Virtual memory management allows for the execution of
multiple threads and sharing of the main memory
• When to switch between different threads:
– Fine grain multithreading: switches between every instruction
– Course grain multithreading: switches only on costly stalls (e.g.
level 2 cache misses)
Simultaneous Multi-threading
• Convert Thread-level parallelism to instruction-level
parallelism
• Dynamically scheduled processors already have most
hardware mechanisms in place to support SMT (e.g.
register renaming)
• Required additional hardware:
– Register file per thread
– Program counter per thread
• Operating system view:
– If a CPU supports n simultaneous threads, the Operating
System views them as n processors
– OS distributes most time consuming threads ‘fairly’ across the
n processors that it sees.
Example for SMT architectures (I)
• Intel Hyperthreading:
– First released for Intel Xeon processor family in 2002
– Supports two architectural sets per CPU,
– Each architectural set has its own
• General purpose registers
• Control registers
• Interrupt control registers
• Machine state registers
– Adds less than 5% to the relative chip size
Reference: D.T. Marr et. al. “Hyper-Threading Technology Architecture and Microarchitecture”,
Intel Technology Journal, 6(1), 2002, pp.4-15.
ftp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_t
echnology.pdf
Example for SMT architectures (II)
• IBM Power 5
– Same pipeline as IBM Power 4 processor but with SMT support
– Further improvements:
• Increase associativity of the L1 instruction cache
• Increase the size of the L2 and L3 caches
• Add separate instruction prefetch and buffering units for
each SMT
• Increase the size of issue queues
• Increase the number of virtual registers used internally by
the processor.
Simultaneous Multi-Threading
• Works well if
– Number of compute intensive threads does not exceed the number of
threads supported in SMT
– Threads have highly different characteristics (e.g. one thread doing mostly
integer operations, another mainly doing floating point operations)
• Does not work well if
– Threads try to utilize the same function units
– Assignment problems:
• e.g. a dual processor system, each processor supporting 2 threads
simultaneously (OS thinks there are 4 processors)
• 2 compute intensive application processes might end up on the same
processor instead of different processors (OS does not see the difference
between SMT and real processors!)
Lecture: Parallel Architecture --
Data Level Parallelism

18
Classification of Parallel Architectures
Flynn’s Taxonomy
• SISD: Single instruction single data
– Classical von Neumann architecture
• SIMD: Single instruction multiple data
– Vector, GPU, etc
• MISD: Multiple instructions single data
– Non existent, just listed for completeness
• MIMD: Multiple instructions multiple data
– Most common and general parallel machine
– Multi-/many- processors/cores/threads/computers
Single Instruction Multiple Data
• Also known as Array-processors
• A single instruction stream is broadcasted to multiple
processors, each having its own data stream
– Still used in some graphics cards today

Instructions Data Data Data Data

stream

Control unit

processor processor processor processor

SIMD In Real
• Hardware for data-level parallelism
for (i=0; i<n; i++) {
a[i] = b[i] + c[i];
}

• Three major implementations

– SIMD extensions to conventional CPU
– Vector architectures
– GPU variant

21
SIMD Instructions
• Originally developed for Multimedia applications
• Same operation executed for multiple data items
• Uses a fixed length register and partitions the carry chain to allow
utilizing the same functional unit for multiple operations
– E.g. a 64 bit adder can be utilized for two 32-bit add
operations simultaneously
• Instructions originally not intended to be used by compiler, but just for
handcrafting specific operations in device drivers
• All elements in a register have to be on the same memory page to
avoid page faults within the instruction
Intel SIMD Instructions
• MMX (Mult-Media Extension) - 1996
– Existing 64 bit floating point register could be used for eight 8-
bit operations or four 16-bit operations
• SSE (Streaming SIMD Extension) – 1999
– Successor to MMX instructions
– Separate 128-bit registers added for sixteen 8-bit, eight 16-bit,
or four 32-bit operations
• SSE2 – 2001, SSE3 – 2004, SSE4 - 2007
– Added support for double precision operations
• AVX (Advanced Vector Extensions) - 2010
– 256-bit registers added
Vector Processors
• Vector processors abstract operations on vectors, e.g.
replace the following loop
for (i=0; i<n; i++) {
a[i] = b[i] + c[i];
}

a = b + c; ADDV.D V10, V8, V6

• Some languages offer high-level support for these

operations (e.g. Fortran90 or newer)
AVX Instructions
AVX Instruction Description
VADDPD Add four packed double-precision operands
VSUBPD Subtract four packed double-precision operands
VMULPD Multiply four packed double-precision operands
VDIVPD Divide four packed double-precision operands
VFMADDPD Multiply and add four packed double-precision operands
VFMSUBPD Multiply and subtract four packed double-precision operands
VCMPxx Compare four packed double-precision operands for EQ,
NEQ, LT, LTE, GT, GE…
VMOVAPD Move aligned four packed double-precision operands
VBROADCASTSD Broadcast one double-precision operand to four locations in a
256-bit register
Main concepts
• Advantages of vector instructions
– A single instruction specifies a great deal of work
– Since each loop iteration must not contain data dependence to
other loop iterations
• No need to check for data hazards between loop iterations
• Only one check required between two vector instructions
• Loop branches eliminated
Basic vector architecture
• A modern vector processor contains
– Regular, pipelined scalar units
– Regular scalar registers
– Vector units – (inventors of pipelining! )
– Vector register: can hold a fixed number of entries (e.g. 64)
– Vector load-store units
Comparison MIPS code vs. vector code
Example: Y=aX+Y for 64 elements

L.D F0, a /* load scalar a*/

DADDIU R4, Rx, #512 /* last address */
L: L.D F2, 0(Rx) /* load X(i) */
MUL.D F2, F2, F0 /* calc. a times X(i)*/
L.D F4, 0(Ry) /* load Y(i) */
ADD.D F4, F4, F2 /* aX(I) + Y(i) */
S.D F4, 0(Ry) /* store Y(i) */
DADDIU Rx, Rx, #8 /* increment X*/
DADDIU Ry, Ry, #8 /* increment Y */
DSUBU R20, R4, Rx /* compute bound */
BNEZ R20, L
Comparison MIPS code vs. vector code (II)
Example: Y=aX+Y for 64 elements

L.D F0, a /* load scalar a*/

LV V1, 0(Rx) /* load vector X */
MULVS.D V2, V1, F0 /* vector scalar mult*/
LV V3, 0(Ry) /* load vector Y */
ADDV.D V4, V2, V3 /* vector add */
SV V4, 0(Ry) /* store vector Y */
Vector length control
• What happens if the length is not matching the length of
the vector registers?
• A vector-length register (VLR) contains the number of
elements used within a vector register
• Strip mining: split a large loop into loops less or equal the
maximum vector length (MVL)
Vector length control (II)
low =0;
VL = (n mod MVL);
for (j=0; j < n/MVL; j++ ) {
for (i=low; i < low + VL; i++ ) {
Y(i) = a * X(i) + Y(i);
}
low += VL;
VL = MVL;
}
Vector stride
• Memory on typically organized in multiple banks
– Allow for independent management of different memory
addresses
– Memory bank time an order of magnitude larger than CPU
clock cycle
• Example: assume 8 memory banks and 6 cycles of memory
bank time to deliver a data item
– Overlapping of multiple data requests by the hardware
Vector stride (II)
• What happens if the code does not access subsequent
elements of the vector
for (i=0; i<n; i+=2) {
a[i] = b[i] + c[i];
}
– Vector load ‘compacts’ the data items in the vector register
(gather)
• No affect on the execution of the loop
• You might however use only a subset of the memory banks -
> longer load time
• Worst case: stride is a multiple of the number of memory
banks
Conditional execution
• Consider the following loop
for (i=0; i< N; i++ ) {
if ( A(i) != 0 ) {
A(i) = A(i) – B(i);
}
}
• Loop can usually not been vectorized because of the
conditional statement
• Vector-mask control: boolean vector of length MLV to
control whether an instruction is executed or not
– Per element of the vector
Conditional execution (II)
LV V1, Ra /* load vector A into V1 */
LV V2, Rb /* load vector B into V2 */
L.D F0, #0 /* set F0 to zero */
SNEVS.D V1, F0 /* set VM(i)=1 if V1(i)!=F0 */
SUBV.D V1, V1, V2 /* sub using vector mask*/
CVM /* clear vector mask to 1 */
SV V1, Ra /* store V1 */
Support for sparse matrices
• Access of non-zero elements in a sparse matrix often
described by
A(K(i)) = A(K(i)) + C (M(i))
– K(i) and M(i) describe which elements of A and C are non-zero
– Number of non-zero elements have to match, location not
necessarily
• Gather-operation: take an index vector and fetch the
according elements using a base-address
– Mapping from a non-contiguous to a contiguous
representation
• Scatter-operation: inverse of the gather operation
Support for sparse matrices (II)
LV Vk, Rk /* load index vector K into V1 */
LVI Va, (Ra+Vk) /* Load vector indexed A(K(i)) */
LV Vm, Rm /* load index vector M into V2 */
LVI Vc, (Rc+Vm) /* Load vector indexed C(M(i)) */
ADDV.D Va, Va, Vc /* set VM(i)=1 if V1(i)!=F0 */
SVI Va, (Ra+Vk) /* store vector indexed A(K(i)) */
• Note:
– Compiler needs the explicit hint, that each element of K is
pointing to a distinct element of A
– Hardware alternative: a hash table keeping track of the
address acquired
• Start of a new vector iteration (convoy) as soon as an
address appears the second time
Lecture: Parallel Architecture –
Synchronization between processors,
cores and HW threads

42
Synchronization between processors
• Required on all levels of multi-threaded programming
– Lock/unlock
– Mutual exclusion
– Barrier synchronization

• Key hardware capability: *cp++

– Uninterruptable instruction capable of automatically retrieving
or changing a value
Race Condition
int count = 0;
int * cp = &count;
….
*cp++; /* by two threads */

Pictures from wikipedia: http://en.wikipedia.org/wiki/Race_condition 44

Simple Example (IIIb)

void thread_func (void arg){

int * cp (int *) arg;

pthread_mutex_lock (&mymutex);
*cp++; // read, increment and write shared variable
pthread_mutex_unlock (&mymutex);

return NULL;
}
Synchronization
• Lock/unlock operations on the hardware level, e.g.
– Lock returning 1 if lock is free/available
– Lock returning 0 if lock is unavailable
• Implementation using atomic exchange (compare and swap)
– Process sets the value of a register/memory location to the
required operation
– Setting the value must not be interrupted in order to avoid
race conditions
– Access by multiple processes/threads will be resolved by write
serialization
Synchronization (II)
• Other synchronization primitives:
– Test-and-set
– Fetch-and-increment
• Problems with all three algorithms:
– Require a read and write operation in a single, uninterruptable
sequence
– Hardware can not allow any operations between the read and
the write operation
– Complicates cache coherence
– Must not deadlock
Load linked/store conditional
• Pair of instructions where the second instruction returns a
value indicating, whether the pair of instructions was
executed as if the instructions were atomic
• Special pair of load and store operations
– Load linked (LL)
– Store conditional (SC): returns 1 if successful, 0 otherwise
• Store conditional returns an error if
– Contents of memory location specified by LL changed before
calling SC
– Processor executes a context switch
Load linked/store conditional (II)
• Assembler code sequence to atomically exchange the
contents of register R4 and the memory location specified
by R1

try: MOV R3, R4

LL R2, 0(R1)
SC R3, 0(R1)
BEQZ R3, try
MOV R4, R2
Load linked/store conditional (III)
• Implementing fetch-and-increment using load linked and
conditional store
try: LL R2, 0(R1)
DADDUI R3, R2, #1
SC R3, 0(R1)
BEQZ R3, try
• Implementation of LL/SC by using a special Link Register,
which contains the address of the operation
Spin locks
• A lock that a processor continuously tries to acquire, spinning around in a
loop until it succeeds.
• Trivial implementation

DADDUI R2, R0, #1

lockit: EXCH R2, 0(R1) !atomic exchange
BNEZ R2, lockit
• Since the EXCH operation includes a read and a modify operation
– Value will be loaded into the cache
• Good if only one processor tries to access the lock
• Bad if multiple processors in an SMP try to get the lock (cache coherence)
– EXCH includes a write attempt, which will lead to a write-miss for SMPs
Spin locks (II)
• For cache coherent SMPs, slight modification of the loop
required

lockit: LD R2, 0(R1) !load the lock

BNEZ R2, lockit !lock available?
DADDUI R2, R0, #1 !load locked value
EXCH R2, 0(R1) !atomic exchange
BNEZ R2, lockit !EXCH successful?
Spin locks (III)
• …or using LL/SC
lockit: LL R2, 0(R1) !load the lock
BNEZ R2, lockit !lock available?
DADDUI R2, R0, #1 !load locked value
SC R2, 0(R1) !atomic exchange
BNEZ R2, lockit !SC successful?
Lecture: Parallel Architecture –
Moore’s Law

54
Moore’s Law
• Long-term trend on the number of transistor per integrated circuit
• Number of transistors double every ~18 month

Source: http://en.wikipedia.org/wki/Images:Moores_law.svg
The “Future” of Moore’s Law
• The chips are down for Moore’s law
– http://www.nature.com/news/the-chips-are-down-for-moore-
s-law-1.19338
• Special Report: 50 Years of Moore's Law
– http://spectrum.ieee.org/static/special-report-50-years-of-
moores-law
• Moore’s law really is dead this time
– http://arstechnica.com/information-
technology/2016/02/moores-law-really-is-dead-this-time/
• Rebooting the IT Revolution: A Call to Action (SIA/SRC,
2015)
– https://www.semiconductors.org/clientuploads/Resources/RIT
R%20WEB%20version%20FINAL.pdf

Paralelismo_2024
No ratings yet
Paralelismo_2024
30 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
Copy of Unit IV CA
No ratings yet
Copy of Unit IV CA
73 pages
pdf&rendition=1
No ratings yet
pdf&rendition=1
126 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
Architecture
No ratings yet
Architecture
67 pages
L03 Architecture Memory
No ratings yet
L03 Architecture Memory
56 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
Chapter2 part 3
No ratings yet
Chapter2 part 3
27 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
APznzaaBPbq19r7DttJsFJDiz6xdljQmPxg0oflqRAoyoqcN6IEEo4yrW Ck8XgHkH5PDMZIHRNz7h0ZpQWHOHwyjvO3PX93sVHvLd5fwcGETUu8XvmdTkaodNRbNrLgkDFPQZVQMfz8KHkZay30aqD0CVLA10PSummzrUt1vN32NEahcaq-m3CTYqZXjSBaBus9kPl5fj8KDKPT (1)
No ratings yet
APznzaaBPbq19r7DttJsFJDiz6xdljQmPxg0oflqRAoyoqcN6IEEo4yrW Ck8XgHkH5PDMZIHRNz7h0ZpQWHOHwyjvO3PX93sVHvLd5fwcGETUu8XvmdTkaodNRbNrLgkDFPQZVQMfz8KHkZay30aqD0CVLA10PSummzrUt1vN32NEahcaq-m3CTYqZXjSBaBus9kPl5fj8KDKPT (1)
80 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
Lecture 3 Flynn's Classical Taxonomy
No ratings yet
Lecture 3 Flynn's Classical Taxonomy
29 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
EE6304 Lecture13 Processors
No ratings yet
EE6304 Lecture13 Processors
69 pages
1. GPU Unit-1
No ratings yet
1. GPU Unit-1
10 pages
onur-digitaldesign-2020-lecture19-simd-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture19-simd-beforelecture
64 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
PDC-architectures
No ratings yet
PDC-architectures
24 pages
Lecture2 GPU Architecture_2025
No ratings yet
Lecture2 GPU Architecture_2025
46 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Review of LSS CSC
No ratings yet
Review of LSS CSC
21 pages
Implementation of DSP Algorithms
No ratings yet
Implementation of DSP Algorithms
20 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Lec 44 Multicore
No ratings yet
Lec 44 Multicore
23 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
lecture1
No ratings yet
lecture1
37 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
CA 4 notes
No ratings yet
CA 4 notes
34 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
Cs8083 MCP Unit I Notes
No ratings yet
Cs8083 MCP Unit I Notes
31 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
Lecture13 - Full IS1500
No ratings yet
Lecture13 - Full IS1500
34 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Real Time System Lect10 A
No ratings yet
Real Time System Lect10 A
25 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Systems I: Pipelining II
No ratings yet
Systems I: Pipelining II
30 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Superscalar Architectures
No ratings yet
Superscalar Architectures
36 pages
Intel Optimization Reference Manual V1 050
No ratings yet
Intel Optimization Reference Manual V1 050
895 pages
Flynns Taxonomy
0% (1)
Flynns Taxonomy
79 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
57 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
156 pages
Topic Instruction Cycle: (Continuous Assessment - 1)
No ratings yet
Topic Instruction Cycle: (Continuous Assessment - 1)
15 pages
Control Unit
No ratings yet
Control Unit
63 pages
Falcon-E: Introduction: (I.e., 4 Byte Chunks)
No ratings yet
Falcon-E: Introduction: (I.e., 4 Byte Chunks)
61 pages
Branch Instructions
No ratings yet
Branch Instructions
13 pages
Basic Computer Organization and Design
No ratings yet
Basic Computer Organization and Design
46 pages
Digital Design & Computer Engineering
No ratings yet
Digital Design & Computer Engineering
28 pages
Micro Programmed Control Unit Coa
No ratings yet
Micro Programmed Control Unit Coa
9 pages
Lecture Addressing Modes of Instruction
No ratings yet
Lecture Addressing Modes of Instruction
12 pages
CMPE361-Final - Sanple
No ratings yet
CMPE361-Final - Sanple
8 pages
RISC_V_Q_A_1732635576
No ratings yet
RISC_V_Q_A_1732635576
6 pages
CAAL HW3
No ratings yet
CAAL HW3
3 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Chapter2 Part 2 Machine Instructions and Programs
No ratings yet
Chapter2 Part 2 Machine Instructions and Programs
38 pages
Computer Architecture CS F342 Ca-Lect6
No ratings yet
Computer Architecture CS F342 Ca-Lect6
19 pages
Group 9 Risc
No ratings yet
Group 9 Risc
27 pages
Unit 1
No ratings yet
Unit 1
5 pages
GPU Architecture
No ratings yet
GPU Architecture
70 pages
Lab Proposal
No ratings yet
Lab Proposal
7 pages
Assembly Language Program - Part V: 1. Conditional Instructions
No ratings yet
Assembly Language Program - Part V: 1. Conditional Instructions
2 pages
GATE For CSE by Shivajees - T-States of All Instructions of 8085 Microprocessor Shivajees MCQ
No ratings yet
GATE For CSE by Shivajees - T-States of All Instructions of 8085 Microprocessor Shivajees MCQ
5 pages
Pipelining & Riscs: Pipelining Used Key Implementation Technique To Build Fast Processors. It
No ratings yet
Pipelining & Riscs: Pipelining Used Key Implementation Technique To Build Fast Processors. It
6 pages
Return Instruction, RC, RNC, RP, RM, RZ, RNZ, Rpe, Rpo, Ret
No ratings yet
Return Instruction, RC, RNC, RP, RM, RZ, RNZ, Rpe, Rpo, Ret
2 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Quiz5 Solutions
No ratings yet
Quiz5 Solutions
2 pages
M6800 Insturction Map
No ratings yet
M6800 Insturction Map
1 page
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Unit 3 - Week 1: Pipelined Instruction Execution Principles
No ratings yet
Unit 3 - Week 1: Pipelined Instruction Execution Principles
4 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet

Lecture ParallelArchTLP-DLP

Uploaded by

Lecture ParallelArchTLP-DLP

Uploaded by

Lecture: Parallel Architecture – Thread Level

Parallelism and Data Level Parallelism

CSCE 569 Parallel Computing

• Compile a program using -save-temps flags to see the binary

• Optimizing the execution of a single instruction stream through

Instructions Data Data Data Data

processor processor processor processor

• Three major implementations

a = b + c; ADDV.D V10, V8, V6

• Some languages offer high-level support for these

L.D F0, a /* load scalar a*/

L.D F0, a /* load scalar a*/

• Key hardware capability: *cp++

Pictures from wikipedia: http://en.wikipedia.org/wiki/Race_condition 44

void *thread_func (void *arg){

try: MOV R3, R4

DADDUI R2, R0, #1

lockit: LD R2, 0(R1) !load the lock

You might also like

void thread_func (void arg){