0% found this document useful (0 votes)

78 views50 pages

Parallelism (2) & Heterogeneous Computing & Future Perspetives

multithreaded architecture

Uploaded by

acer smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views50 pages

Parallelism (2) & Heterogeneous Computing & Future Perspetives

multithreaded architecture

Uploaded by

acer smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Parallelism (2) &

Heterogeneous computing
& Future perspetives
Hung-Wei Tseng
Modern processor pipelines
INTEL 64 AND IA-32 PROCESSOR ARCHITECTURES

2.1 THE SKYLAKE MICROARCHITECTURE

The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures.
The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.

32K L1 Instruction
BPU
Cache

MSROM Decoded Icache Legacy Decode

(DSB) Pipeline
4 uops/cycle 5 uops/cycle
6 uops/cycle

Instruction Decode Queue (IDQ,, or micro-op queue)

Allocate/Rename/Retire/MoveElimination/ZeroIdiom

Scheduler
256K L2 Cache
Port 2 (Unified)
Port 0 Port 1 Port 5 Port 6 LD/STA
Int ALU, Int ALU, Int ALU,
Vec FMA, Fast LEA, Int ALU, Port 3
Fast LEA,
Vec MUL, Vec FMA, Int Shft, LD/STA
Vec SHUF,
Vec Add, Vec MUL, Branch1,
Vec ALU,
Vec ALU, Vec Add, CVT 32K L1 Data Cache
Vec Shft, Vec ALU, Port 4
Divide, Vec Shft, STD
Branch2 Int MUL,
Slow LEA Port 7
STA

2
Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture
Super-scalar processors with OoO
Superscalar: Modern processors provide multiple pipelines/
functional units to allow more instructions working at the
same time to provide instruction level parallelism
Fetch multiple instructions at the same time
Execute them at the same time
Out-of-order execution: Modern processors dynamically
extracts data flow among instructions to achieve better
instruction level parallelism
Instructions without data dependencies can be executed in
parallel
Buffer the results and commit the result after previous
instructions are finished
If youre interested, we will discuss more about this in
CSC456 (Spring 2018, planned)

3
Simplified SuperScalar+OOO pipeline
Schedule instructions
whenever their inputs
Fetch a bunch of and target functional
instructions units are ready

extract the data Update the

flow among result based on
instructions the compiled
Branch order
predictor

4
Pentium 4 v.s. Athlon 64
Application: 80% ALU, 20% Branch, 90% prediction accuracy,
no data dependencies, perfect cache, consider the two
machines:
Pentium 4 with 20 pipeline stages, branch resolved in stage 19,
running at 3 GHz
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive

Athlon 64 with 12 pipeline stages, branch resolved in stage 10,

running at 2.7 GHz (11% longer cycle time)
1
2 3 4 5 6 7 8 9 10 11 12
Inst. Addr
Inst Mem Inst. Byte ID1 ID2 Inst. Dbl. ID and Dispatch Scheduling Execution D-Cache D-cache
Decode Read Pick & Pack Pack Address Access

which one is faster?

A. Athlon 64 CPIP4 = 80%1 + 20%90%1 + 20%10%*19 = 1.36

CPIAthlon64 = 80%*1 + 20%*90%*1 + 20%*10%*10 = 1.18
B. Pentium 4
At least 15% faster clock rate to achieve the same performance
5
Simplified SMT-OOO pipeline

Instruction
ROB: T0
Fetch: T0
Instruction
Fetch: T1 Register ROB: T1
Instruction Execution Data
renaming Schedule
Instruction Decode Units Cache
logic ROB: T2
Fetch: T2
Instruction
ROB: T3
Fetch: T3

6
Simultaneous Multi-Threading (SMT)
Fetch instructions from different threads/processes to fill the
pipeline
Exploit thread level parallelism (TLP) to solve the problem of
insufficient ILP in a single thread
Each thread is a separate scheduling identity in OS
Known as hyper-threading in intel processors
Keep separate architectural states for each thread
PC
Register Files
Reorder Buffer
Create an illusion of multiple processors for OSs
The rest of superscalar processor hardware is shared
Invented by Dean Tullsen Hung-Weis PhD advisor

7
SMT
How many of the following about SMT are correct?
SMT makes processors with deep pipelines more tolerable to
mis-predicted branches
SMT can improve the latency of a single-threaded application
hurt, b/c you are sharing resource with other threads.
SMT processors can better utilize hardware during cache
misses comparing with superscalar processors with the same
issue width
SMT processors can have higher cache miss rates comparing
with superscalar processors with the same cache sizes when
executing the same set of applications.
A. 0
B. 1
C. 2
D. 3
E. 4

8
Announcement
Check your grades as soon as possible
Homework #4 due 4/26 (Wednesday)
Final review this Wednesday
You may be getting some surprise!
Project #3 is up. Due 5/1 (Monday)
Project #2 You may revise and resubmit before
5/1 10% penalty
Final exam 5/10 8am
ClassEval

9
Outline
Thread-level parallelism Multithreaded
processors
Brief introduction to parallel computing
Heterogeneous computing and future

10
Chip-multiprocessor

11
A wide-issue processor or
multiple narrower-issue processors?
What can you do within a 21 mm * 21 mm area?
21 21
mmmm 2121
mmmm

I-Cache
I-Cache #1#1
(8K) I-Cache
(8K) I-Cache #2#2 (8K)
(8K)
Instruction
Instruction
External
External Cache
Cache External
External
Interface Instruction
Instruction (32 KB) Interface
Interface
Interface
Fetch (32 KB)
Fetch
TLB Processor Processor
Processor Processor

(256KB)
On-Chip L2 Cache (256KB)
TLB

Cache (256KB)
On-Chip L2 Cache (256KB)
#1#1 #2
#2
Inst.
Inst. Decode
Decode && Data

Crossbar
Data

Communication Crossbar
Clocking & Pads
Clocking & Pads

Clocking & Pads

Rename
Rename Cache
Cache You will have more

L2 Cache
(32
(32 KB)
KB)
21 mm
21 mm 2121
mmmm
ALUs if you choose D-Cache
D-Cache #1#1
D-Cache #3
(8K)D-Cache
(8K)
(8K)
D-Cache
D-Cache
#2#2
#4
(8K)
(8K)
(8K)
D-Cache #3 (8K) D-Cache #4 (8K)

L2Communication

On-Chip L2
Reorder
Reorder Buffer,
Buffer,
Instruction Queues, this!
Integer Unit

On-Chip
Instruction Queues,
Integer Unit

and
and Out-of-Order
Out-of-Order Logic
Logic
Processor
Processor Processor
Processor
#3#3 #4
#4

Floating
Floating Point
Point

L2
Unit
Unit
I-Cache
I-Cache #3#3 (8K) I-Cache
(8K) I-Cache
#4#4 (8K)
(8K)

A 6-issue
Figure superscalar processor 4 2-issue superscalar processor
Figure 2. 2.Floorplan
Floorplan
forfor
thethe six-issue
six-issue dynamic
dynamic superscalar
superscalar Figure
Figure 3.3. Floorplan
Floorplan forthe
for thefour-way
four-way single-chip
single-chip
3 integer ALUs microprocessor.
microprocessor. 4*1 integer ALUs
multiprocessor.
multiprocessor.
vents
3 floating point ALUs 4*1 floating point ALUs
vents thethe instruction
instruction fetch
fetch mechanism
mechanism from
from becoming
becoming a bottleneck sors
a bottleneck sorsis isless
lessthan
thanone-fourth
one-fourththe
thesize
sizeofofthe
the6-way
6-waySS SSprocessor,
processor,asas
since
since 3
thethe load/store
6-way
6-way execution
execution units
engine
engine requires
requires a much
a much higherinstruc-
higher shownininTable
instruc- shown 4*13.3.load/store
Table Thenumber
The units
numberofofexecution
executionunits
unitsactually
actuallyincreases
increases
tiontion fetch
fetch bandwidth
bandwidth than
than thethe 2-way
2-way processorsused
processors MP inin
usedin inthetheMP thetheMP MPbecause
becausethe
the6-way
6-wayprocessor
processorhad hadthree
threeunits
unitsofofeach
eachtype,
type,
architecture.
architecture.
12 whilethe
while the4-way
4-wayMP MPmust
musthave
havefour
fourone onefor
foreach
eachCPU.
CPU.On Onthe
the
Die photo of a CMP processor

13
CMP advantages
How many of the following are advantages of CMP
over traditional superscalar processor
CMP can exploit thread-level parallelism
CMP can deliver better instruction throughput within the
same die area (chip size)
CMP can achieve better ILP for each running thread
Not really, this depends on single-core microarchitecture
CMP can improve the performance of a single-threaded
application without modifying code
A. 0 Not really, you cannot have single-threaded programs
accessing functional units from other processor cores
B. 1
C. 2
D. 3
E. 4

14
Speedup a single
application on multi-
threaded processors

15
Parallel programming
To exploit CMP/SMT parallelism you need to break your
computation into multiple processes or multiple threads
Processes (in OS/software systems)
Separate programs actually running (not sitting idle) on your
computer at the same time.
Each process will have its own virtual memory space and you need
explicitly exchange data using inter-process communication APIs
Threads (in OS/software systems)
Independent portions of your program that can run in parallel
All threads share the same virtual memory space
We will refer to these collectively as threads
A typical user system might have 1-8 actively running threads.
Servers can have more if needed (the sysadmins will hopefully
configure it that way)

16
Create threads/processes
The only way we can improve a single application
performance on CMP/SMT
You can use fork() to create a child process
Or you can use pthread or openmp to compose
multi-threaded programs
Threads share the same memory space with each other
/* Do matrix multiplication */
for(i = 0 ; i < NUM_OF_THREADS ; i++)
{ Spawn a thread
tids[i] = i;
pthread_create(&thread[i], NULL, threaded_blockmm, &tids[i]);
}
for(i = 0 ; i < NUM_OF_THREADS ; i++)
pthread_join(thread[i], NULL);
Synchronize and wait a for
thread to terminate

17
Supporting shared memory model
Provide a single memory space that all processors
can share
All threads within the same program shares the
same address space.
Threads communicate with each other using shared
variables in memory
Provide the same memory abstraction as single-
thread programming

18
Simple idea...
Connecting all processor and shared memory to a
bus.
Processor speed will be slow b/c all devices on a
bus must run at the same speed

Core 0 Core 1 Core 2 Core 3

Bus

Shared $

19
Memory hierarchy on CMP
Each processor has
its own local cache
Core 0 Core 1

Local $ Local $

Shared $
Bus

Local $ Local $

Core 2 Core 3
20
Cache on Multiprocessor
Coherency
Guarantees all processors see the same value for a
variable/memory address in the system when the
processors need the value at the same time
What value should be seen
Consistency
All threads see the change of data in the same order
When the memory operation should be done

21
What each thread will see?
Assuming that we are running the following code on
a CMP with some cache coherency protocol, which
output is NOT possible? (a is initialized to 0)

thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;

A. 0 1 2 3 4 5 6 7 8 9
B. 1 2 5 9 3 6 8 10 12 13
C. 1 1 1 1 1 1 1 1 1 1
D. 1 1 1 1 1 1 1 1 1 100

22
Its show time!
Demo!

thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;

23
Cache coherency
Assuming that we are running the following code on
a CMP with some cache coherency protocol, which
output is NOT possible? (a is initialized to 0)

thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;

A. 0 1 2 3 4 5 6 7 8 9
B. 1 2 5 9 3 6 8 10 12 13
C. 1 1 1 1 1 1 1 1 1 1
D. 1 1 1 1 1 1 1 1 1 100

24
Hard to debug
thread 1 thread 2
int loop; void* modifyloop(void *x)
{
int main() sleep(1);
{ printf("Please input a number:\n");
pthread_t thread; scanf("%d",&loop);
loop = 1; return NULL;
}
pthread_create(&thread, NULL,
modifyloop, NULL);
while(loop == 1)
{
continue;
}
pthread_join(thread, NULL);
fprintf(stderr,"User input: %d\n",
loop);
return 0;
}

25
Hard to debug
thread 1 thread 2
volatile int loop; void* modifyloop(void *x)
{
int main() sleep(1);
{ printf("Please input a number:\n");
pthread_t thread; scanf("%d",&loop);
loop = 1; return NULL;
}
pthread_create(&thread, NULL,
modifyloop, NULL); make sure your compiled code
while(loop == 1)
{ put this in memory all the time
continue;
}
pthread_join(thread, NULL);
fprintf(stderr,"User input: %d\n",
loop);
return 0;
}

26
void* update_array(void *x)
{

Shared queue int thread_id = (int )x;

int i;
for(;array_index<ARRAY_SIZE;array_index++) {
array[array_index]++;
int ARRAY_SIZE=64;
}
int *array; void* update_array(void
return NULL;*x)
volatile int array_index; { }
int *thread_id = (int *)x;
int main(int argc, char **argv) int i;
{ for(;array_index<ARRAY_SIZE;array_index++) {
int i; array[array_index]++;
pthread_t thread[NUM_OF_THREADS]; }
int thread_args[NUM_OF_THREADS]; return NULL;
ARRAY_SIZE = atoi(argv[1]); }
array = (int *)calloc(ARRAY_SIZE,sizeof(int));
array_index =0;
for(i=0;i<NUM_OF_THREADS;i++) {
thread_args[i] = i;
pthread_create(&thread[i], NULL, update_array, &thread_args[i]);
}
for(i=0;i<NUM_OF_THREADS;i++) void* update_array(void *x)
pthread_join(thread[i], NULL); {
for(i=0;i<ARRAY_SIZE;i++) { int *thread_id = (int *)x;
switch(array[i]) int i;
{ for(;array_index<ARRAY_SIZE;array_index++) {
case 0: array[array_index]++;
printf("No one cares task %d! QQ\n",i); }
return 0; return NULL;
break; }
case 1:
break; void* update_array(void *x)
default: {
printf("Task %d has done more than once!\n",i); int *thread_id = (int *)x;
return 0; int i;
} for(;array_index<ARRAY_SIZE;array_index++) {
} array[array_index]++;
return 0; }
} return NULL;
}

27
How to make it work?
volatile int lock=0;
inline void atomic_increment(volatile int *pw) {
asm(
"lock\n\t"
"incl %0\n\t"
:
"=m"(*pw): // output (%0)
"m"(*pw): // input (%1) You need these instructions
"cc" // clobbers

}
);
to make it work
inline int atomic_compare_and_exchange(int requiredOldValue, volatile int* _ptr, int newValue, int
sizeOfValue){
int old;
__asm volatile
(
"mov %3, %%eax;\n\t"
"lock\n\t"
"cmpxchg %4, %0\n\t"
"mov %%eax, %1\n\t"
:
"=m" ( *_ptr ), "=r" ( old ): // outputs (%0 %1)
"m" ( *_ptr ), "r" ( requiredOldValue), "r" ( newValue ): // inputs (%2, %3, %4)
"memory", "eax", "cc" // clobbers
);

return old;
}

void* update_array(void *x) {

int *thread_id = (int *)x;
int i;
do
{
while(atomic_compare_and_exchange(0,&lock,1,sizeof(lock)) !=0);
array[array_index]++;
atomic_increment(&array_index);
atomic_compare_and_exchange(1,&lock,0,sizeof(lock));
}
while(array_index < ARRAY_SIZE);
return NULL;
}
28
Take-away
You need to understand the design of your
multithreaded hardware to correctly develop your
idea

30
The changing role of
GPUs

31
GPU (Graphics Processing Unit)
Originally for displaying images
HD video: 1920*1080 pixels * 60 frames per second
Graphics processing pipeline

Raster
Input Vertex Geometry Setup & Pixel Operations
Assembler Shader Shader Rasterizer Shader /Output
merger

These shaders need to be programmable to

apply different rendering effects/algorithms
(Phong shading, Gouraud shading, and etc...)
32
Basic concept of shading
They are all vectors
Iamb = Kamb Mamb
Idiff = Kdiff Mdiff (N L)
Ispec = Kspec Mspec (R V)n
N R Itotal = Iamb + Idiff + Ispec
L void main(void)
{

For each // normalize vectors after interpolation

V vec3 L = normalize(o_toLight);
vec3 V = normalize(o_toCamera);

point/pixel vec3 N = normalize(o_normal);

// get Blinn-Phong reflectance components

float Iamb = ambientLighting();
float Idif = diffuseLighting(N, L);
float Ispe = specularLighting(N, L, V);

// diffuse color of the object from texture

vec3 diffuseColor = texture(u_diffuseTexture, o_texcoords).rgb;

// combination of all components and diffuse color of the object

resultingColor.xyz = diffuseColor * (Iamb + Idif + Ispe);
resultingColor.a = 1;
}

33
What do you want from a GPU?
Given the basic idea of shading algorithms, how many of
the following statements would fit the agenda of designing
a GPU?
Many ALUs to process multiple pixels simultaneously
Each frame contains 1920*1080 pixels!
High bandwidth memory bus to supply pixels, vectors and
textures Each pixel requires different L, N, R, V
High performance branch predictors
not really, the behavior is uniform across all pixels
Powerful ALUs to process many different kinds of operators
not really, we only need vector add, vector mul, vector div. Low
frequency is OK since we have many threads
A. 0
B. 1
C. 2
D. 3
E. 4

34
implementation.
Hardware support throughout the design to enable new programming model capabilities
Nvidia GPU architecture
GK210 expands upon GK110s on-chip resources, doubling the available register file and shar
memory capacities per SMX.
SMX (Streaming
Multiprocessor)

35
Streaming Multiprocessor (SMX) Architecture

Inside each SMX

The Kepler GK110/GK210 SMX unit features several architectural innovations that make it the most
powerful multiprocessor weve built for double precision compute workloads.

Each of these
performs the same
operation, but each
of these is also a
thread

36
into 2 RBEs and 3 memory controllers, a 50% boost in memory bandwidth. In contrast, the AMD Radeon HD 7770 GHz Edition has a single primitive pipeline,
2 pixel pipelines and 10 compute units. The pixel pipelines in the AMD Radeon HD 7770 GHz Edition also scaled back to 2 memory controllers, for a 128-bit
wide interface.

AMD GPU Architecture

Figure 7: AMD Radeon HD 7970

37 AMD's Graphics Core Next Technology 13

A CU in an AMD GPU
Figure 3: GCN Compute Unit

Another crucial innovation in GCN is coherent caching. Historically, GPUs have relied on specialized caches (such as read-only texture caches) that do
not maintain a coherent view of memory. To communicate between cores within a GPU, the programmer or compiler must insert explicit synchronization
38design, it increases overhead for applications which share data. GCN is
instructions to flush shared data back to memory. While this approach simplifies
CPU v.s. GPU
Comparing the CPU and GPU architectures, how many of
the followings are correct?
Think about SMT
GPU architectures are more tolerable to memory latencies
CPU is a better fit if the application requires low latency
operations
Given GPU memory are optimized for bandwidth, sacrificing latencies
GPU is efficient for workloads like matrix multiplications
GPU is inefficient if the workload needs only a small set of
input data points but contains lots of decision making and
backward references
The utilization of GPU will be low, plus GPU ALUs are slower than
CPU ALUs
A. 0
B. 1
C. 2
D. 3
E. 4

39
Programming GPGPU
A GPGPU application contains the host program and
GPU kernels
Host program: A C/C++ based CPU program that invokes
GPU API
GPU kernels: C/C++-like programs running on GPUs
Programming models
CUDA (Compute Unified Device Architecture)
Proposed by NVIDIA
Only NVIDIA GPUs support
OpenCL
Maintained by Khronos Group (non-profit)
Supported by Altera, AMD, Apple, ARM Holdings, Creative
Technology, IBM, Imagination Technologies, Intel, Nvidia,
Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS

40
What a host program looks like?
int main (int argc, const char * argv[]) { Initialize GPU runtime
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);

float a = (float )malloc(ARRAY_SIZEARRAY_SIZEsizeof(cl_float));

float *b = (float *)malloc(ARRAY_SIZE*ARRAY_SIZE*sizeof(cl_float));
float *c = (float *)malloc(ARRAY_SIZE*ARRAY_SIZE*sizeof(cl_float));

if(a == NULL || b == NULL || c == NULL) Initialize host memory objects

fprintf(stderr,"allocating array c failed\n");
else init(a, ARRAY_SIZE);init(b, ARRAY_SIZE);
Allocate GPU memory space & copy data to GPU memory
void *gpu_a = gcl_malloc(sizeof (cl_float) * ARRAY_SIZE*ARRAY_SIZE, a,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR);
void *gpu_b = gcl_malloc(sizeof(cl_float) * ARRAY_SIZE*ARRAY_SIZE, b,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR);
void *gpu_c = gcl_malloc(sizeof(cl_float) * ARRAY_SIZE*ARRAY_SIZE, NULL,
CL_MEM_WRITE_ONLY);
Submitting GPU kernel to the runtime
dispatch_sync(queue, ^{
cl_ndrange range = { 2, {0, 0, 0},{ARRAY_SIZE, ARRAY_SIZE, 0},{16, 16, 0}};
matrix_mul_kernel(&range,(cl_float*)gpu_a,(cl_float*)gpu_b,(cl_float*)gpu_c,
(cl_int)ARRAY_SIZE);
gcl_memcpy(c, gpu_c, sizeof(cl_float) * ARRAY_SIZE*ARRAY_SIZE);
});
Copy data from GPU memory to main memory
gcl_free(gpu_a); gcl_free(gpu_b); gcl_free(gpu_c);
dispatch_release(queue);
free(a); free(b); free(c);
return 0;
} 41
What a GPU kernel looks like?
__kernel void
matrix_mul(__global float* input_a,
__global float* input_b,
__global float* output,
int size)
{
int col = get_global_id(0);
int row = get_global_id(1); Identify the current GPU thread

float value = 0;
for (int k = 0; k < size; ++k)
{
value += input_a[row * size + k] * input_b[k * size + col];
}
output[row * size + col] = value;
}

42
Demo
Matrix multiplication
block algorithm v.s. naive OpenCL code

43
How things are connected

DRAM
CPU
GPU

PCIe
Switch

Second Storage Devices

44
New overhead/bottleneck emerges

New overhead
DRAM
CPU
GPU

PCIe
Switch

Second Storage Devices

45
Closing remarks and
future perspectives

49
Challenges
Moores law is slowing down
Leakage power constrains the processing power of
a single chip
Discontinuation of Dennards scaling
Also called the dark silicon problem

50
CPU
F i g u r e 1performance scaling slows down

80 4E+07
1.5x in 56 months

60 3E+07

SuperComputer FLOPS
No faster
CPU SPECRate

2.7x in 56 months
4x
supercomputers
since Dec, 2012
40 2E+07

20 1E+07

Best CPU Performance

Best Super Computer Performance
0 0E+00
Oct-06
Jan-07
Apr-07
Jul-07
Oct-07
Jan-08
Apr-08
Jul-08
Oct-08
Jan-09
Apr-09
Jul-09
Oct-09
Jan-10
Apr-10
Jul-10
Oct-10
Jan-11
Apr-11
Jul-11
Oct-11
Jan-12
Apr-12
Jul-12
Oct-12
Jan-13
Apr-13
Jul-13
Oct-13
Jan-14
Apr-14
Jul-14
Oct-14
Jan-15
Apr-15
Jul-15
Oct-15
Jan-16
Sources:
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
https://www.spec.org/cpu2006/results/
51 http://www.top500.org/statistics/sublist/
New opportunities

DRAM
CPU
GPU

But, how do/will you use

the rightProcessors are
PCIe
them in Switch way?
everywhere in your
computers

Secondary Storage Devices

52
Conclusion
In the past, software engineers and hardware engineers work on
different sides of the ISA abstraction
Software engineers have no idea about what happen in processors/
hardware
Hardware engineers have no sense about what are the demands of
applications
This works fine if we can keep accelerating CPUs, but not true anymore
We need new execution & programming model to better utilize
these hardware components
We need innovative computer system design to address the
challenges from process technologies and the application
demands
Artificial intelligence
3D VR/AR
Hope to see you again in CSC456 Spring 2018 (planned)

53
We will talk more about high-
performance/heterogenous
computer system design as well
as corresponding techniques to
accelerate your programs in
CSC456 next Spring!

06T Semihermetic Screw Compressor
No ratings yet
06T Semihermetic Screw Compressor
8 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
SMT and CMP Architectures
100% (3)
SMT and CMP Architectures
19 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
91 pages
2015 Bull CAT 09.pdf - PDF
No ratings yet
2015 Bull CAT 09.pdf - PDF
67 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
Object Oriented Analysis and Design - Syllabus
No ratings yet
Object Oriented Analysis and Design - Syllabus
1 page
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
MULTITHREADING
No ratings yet
MULTITHREADING
30 pages
177538089-API-570-Final-Exam-Questions - REALIZAR
100% (2)
177538089-API-570-Final-Exam-Questions - REALIZAR
26 pages
Organization of Multiprocessor Systems
No ratings yet
Organization of Multiprocessor Systems
87 pages
Parallel Arch 2
No ratings yet
Parallel Arch 2
9 pages
l23 Multithread
No ratings yet
l23 Multithread
34 pages
PMOS NMOS Equations and Examples
100% (1)
PMOS NMOS Equations and Examples
3 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Flynns Taxonomy
0% (1)
Flynns Taxonomy
79 pages
A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems
No ratings yet
A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems
20 pages
2 - Parallel Computer Architecture - 1
No ratings yet
2 - Parallel Computer Architecture - 1
26 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
CC Unit 1
No ratings yet
CC Unit 1
24 pages
Unit IV QB With Answers
No ratings yet
Unit IV QB With Answers
16 pages
L38 TLP
No ratings yet
L38 TLP
13 pages
Unit-5 Part1
No ratings yet
Unit-5 Part1
85 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
TLP
No ratings yet
TLP
19 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
EE457Unit9c CMT
No ratings yet
EE457Unit9c CMT
60 pages
Simultaneous Multithreading
No ratings yet
Simultaneous Multithreading
50 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Basic of Thread Level Parallelism
No ratings yet
Basic of Thread Level Parallelism
30 pages
Chapter 3 Mcqs Class 9
No ratings yet
Chapter 3 Mcqs Class 9
21 pages
Unit 5
No ratings yet
Unit 5
86 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
MP Co4 PDF
No ratings yet
MP Co4 PDF
79 pages
15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
No ratings yet
15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
35 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
SMT and CMP Architectures
No ratings yet
SMT and CMP Architectures
19 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Multi-Core Computing: Osama Awwad
No ratings yet
Multi-Core Computing: Osama Awwad
37 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
Advanced Processor Architecture: Summer 1997
No ratings yet
Advanced Processor Architecture: Summer 1997
28 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Osa Multi Core
No ratings yet
Osa Multi Core
37 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
22 pages
Blanking Plugs
No ratings yet
Blanking Plugs
3 pages
SMT and CMP Architectures
No ratings yet
SMT and CMP Architectures
19 pages
03 TLP
No ratings yet
03 TLP
33 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
Multi Core 15213 Sp07
No ratings yet
Multi Core 15213 Sp07
67 pages
Chapter 24
No ratings yet
Chapter 24
41 pages
Chapter 2: Earth in Space
No ratings yet
Chapter 2: Earth in Space
75 pages
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
No ratings yet
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
15 pages
Mach4 G and M Code Reference Manual
No ratings yet
Mach4 G and M Code Reference Manual
81 pages
1st Lessonl
No ratings yet
1st Lessonl
4 pages
Class 10 Science Chapter 1 Previous Year Questions - Chemical Reactions and Equations
No ratings yet
Class 10 Science Chapter 1 Previous Year Questions - Chemical Reactions and Equations
3 pages
PGDCA Diploma
No ratings yet
PGDCA Diploma
15 pages
Inside The Cpu
No ratings yet
Inside The Cpu
10 pages
Chapter 45 Wex
No ratings yet
Chapter 45 Wex
72 pages
(Updated Constantly) : CCNA 1 (v5.1 + v6.0) Chapter 6 Exam Answers Full
No ratings yet
(Updated Constantly) : CCNA 1 (v5.1 + v6.0) Chapter 6 Exam Answers Full
16 pages
9395P Manual
No ratings yet
9395P Manual
232 pages
LG Catalog107
No ratings yet
LG Catalog107
40 pages
TM Midea 2nd Generation AC Series 50Hz Medium Static Pressure Duct 20210115 V8
No ratings yet
TM Midea 2nd Generation AC Series 50Hz Medium Static Pressure Duct 20210115 V8
19 pages
Microprocessor Complete 1
No ratings yet
Microprocessor Complete 1
56 pages
Experiment 2
No ratings yet
Experiment 2
4 pages
Figure 2. Rotating Disc Components. Retrieved: From Spectrometer - HTML
No ratings yet
Figure 2. Rotating Disc Components. Retrieved: From Spectrometer - HTML
4 pages
A Survey On Neural Network Hardware Accelerators
No ratings yet
A Survey On Neural Network Hardware Accelerators
22 pages
OneIM OperationalGuide
No ratings yet
OneIM OperationalGuide
197 pages
Hopeless
No ratings yet
Hopeless
7 pages
Lab 4 Report
No ratings yet
Lab 4 Report
10 pages
Process Fluid Flow
No ratings yet
Process Fluid Flow
24 pages
Mockito Basics and BDDMockito Class
No ratings yet
Mockito Basics and BDDMockito Class
9 pages
Bernanke and Blinder (1988)
No ratings yet
Bernanke and Blinder (1988)
5 pages
Auxiliary Materials - Physical Stability Agents PVPP: Adsorption Capacity
No ratings yet
Auxiliary Materials - Physical Stability Agents PVPP: Adsorption Capacity
2 pages
II - 17BT2102 - qp8
No ratings yet
II - 17BT2102 - qp8
2 pages
SPANNING TREE PROTOCOL: Most important topic in switching
From Everand
SPANNING TREE PROTOCOL: Most important topic in switching
Mulayam Singh
No ratings yet

Parallelism (2) & Heterogeneous Computing & Future Perspetives

Uploaded by

Parallelism (2) & Heterogeneous Computing & Future Perspetives

Uploaded by

Parallelism (2) &

2.1 THE SKYLAKE MICROARCHITECTURE

MSROM Decoded Icache Legacy Decode

Instruction Decode Queue (IDQ,, or micro-op queue)

extract the data Update the

Athlon 64 with 12 pipeline stages, branch resolved in stage 10,

A. Athlon 64 CPIP4 = 80%*1 + 20%*90%*1 + 20%*10%*19 = 1.36

Clocking & Pads

Core 0 Core 1 Core 2 Core 3

Shared queue int *thread_id = (int *)x;

void* update_array(void *x) {

These shaders need to be programmable to

For each // normalize vectors after interpolation

point/pixel vec3 N = normalize(o_normal);

// get Blinn-Phong reflectance components

// diffuse color of the object from texture

// combination of all components and diffuse color of the object

Inside each SMX

AMD GPU Architecture

37 AMD's Graphics Core Next Technology 13

float *a = (float *)malloc(ARRAY_SIZE*ARRAY_SIZE*sizeof(cl_float));

if(a == NULL || b == NULL || c == NULL) Initialize host memory objects

Second Storage Devices

Second Storage Devices

Best CPU Performance

But, how do/will you use

Secondary Storage Devices

You might also like

A. Athlon 64 CPIP4 = 80%1 + 20%90%1 + 20%10%*19 = 1.36

Shared queue int thread_id = (int )x;

float a = (float )malloc(ARRAY_SIZEARRAY_SIZEsizeof(cl_float));