0% found this document useful (0 votes)
78 views50 pages

Parallelism (2) & Heterogeneous Computing & Future Perspetives

multithreaded architecture

Uploaded by

acer smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views50 pages

Parallelism (2) & Heterogeneous Computing & Future Perspetives

multithreaded architecture

Uploaded by

acer smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Parallelism (2) &

Heterogeneous computing
& Future perspetives
Hung-Wei Tseng
Modern processor pipelines
INTEL 64 AND IA-32 PROCESSOR ARCHITECTURES

2.1 THE SKYLAKE MICROARCHITECTURE


The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures.
The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.

32K L1 Instruction
BPU
Cache

MSROM Decoded Icache Legacy Decode


(DSB) Pipeline
4 uops/cycle 5 uops/cycle
6 uops/cycle

Instruction Decode Queue (IDQ,, or micro-op queue)

Allocate/Rename/Retire/MoveElimination/ZeroIdiom

Scheduler
256K L2 Cache
Port 2 (Unified)
Port 0 Port 1 Port 5 Port 6 LD/STA
Int ALU, Int ALU, Int ALU,
Vec FMA, Fast LEA, Int ALU, Port 3
Fast LEA,
Vec MUL, Vec FMA, Int Shft, LD/STA
Vec SHUF,
Vec Add, Vec MUL, Branch1,
Vec ALU,
Vec ALU, Vec Add, CVT 32K L1 Data Cache
Vec Shft, Vec ALU, Port 4
Divide, Vec Shft, STD
Branch2 Int MUL,
Slow LEA Port 7
STA

2
Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture
Super-scalar processors with OoO
Superscalar: Modern processors provide multiple pipelines/
functional units to allow more instructions working at the
same time to provide instruction level parallelism
Fetch multiple instructions at the same time
Execute them at the same time
Out-of-order execution: Modern processors dynamically
extracts data flow among instructions to achieve better
instruction level parallelism
Instructions without data dependencies can be executed in
parallel
Buffer the results and commit the result after previous
instructions are finished
If youre interested, we will discuss more about this in
CSC456 (Spring 2018, planned)

3
Simplified SuperScalar+OOO pipeline
Schedule instructions
whenever their inputs
Fetch a bunch of and target functional
instructions units are ready

Register Reorder
Instruction Instruction Execution Data
renaming Schedule Buffer/
Fetch Decode Units Memory
logic Commit

extract the data Update the


flow among result based on
instructions the compiled
Branch order
predictor

4
Pentium 4 v.s. Athlon 64
Application: 80% ALU, 20% Branch, 90% prediction accuracy,
no data dependencies, perfect cache, consider the two
machines:
Pentium 4 with 20 pipeline stages, branch resolved in stage 19,
running at 3 GHz
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive

Athlon 64 with 12 pipeline stages, branch resolved in stage 10,


running at 2.7 GHz (11% longer cycle time)
1
2 3 4 5 6 7 8 9 10 11 12
Inst. Addr
Inst Mem Inst. Byte ID1 ID2 Inst. Dbl. ID and Dispatch Scheduling Execution D-Cache D-cache
Decode Read Pick & Pack Pack Address Access

which one is faster?

A. Athlon 64 CPIP4 = 80%*1 + 20%*90%*1 + 20%*10%*19 = 1.36


CPIAthlon64 = 80%*1 + 20%*90%*1 + 20%*10%*10 = 1.18
B. Pentium 4
At least 15% faster clock rate to achieve the same performance
5
Simplified SMT-OOO pipeline

Instruction
ROB: T0
Fetch: T0
Instruction
Fetch: T1 Register ROB: T1
Instruction Execution Data
renaming Schedule
Instruction Decode Units Cache
logic ROB: T2
Fetch: T2
Instruction
ROB: T3
Fetch: T3

6
Simultaneous Multi-Threading (SMT)
Fetch instructions from different threads/processes to fill the
pipeline
Exploit thread level parallelism (TLP) to solve the problem of
insufficient ILP in a single thread
Each thread is a separate scheduling identity in OS
Known as hyper-threading in intel processors
Keep separate architectural states for each thread
PC
Register Files
Reorder Buffer
Create an illusion of multiple processors for OSs
The rest of superscalar processor hardware is shared
Invented by Dean Tullsen Hung-Weis PhD advisor

7
SMT
How many of the following about SMT are correct?
SMT makes processors with deep pipelines more tolerable to
mis-predicted branches
SMT can improve the latency of a single-threaded application
hurt, b/c you are sharing resource with other threads.
SMT processors can better utilize hardware during cache
misses comparing with superscalar processors with the same
issue width
SMT processors can have higher cache miss rates comparing
with superscalar processors with the same cache sizes when
executing the same set of applications.
A. 0
B. 1
C. 2
D. 3
E. 4

8
Announcement
Check your grades as soon as possible
Homework #4 due 4/26 (Wednesday)
Final review this Wednesday
You may be getting some surprise!
Project #3 is up. Due 5/1 (Monday)
Project #2 You may revise and resubmit before
5/1 10% penalty
Final exam 5/10 8am
ClassEval

9
Outline
Thread-level parallelism Multithreaded
processors
Brief introduction to parallel computing
Heterogeneous computing and future

10
Chip-multiprocessor

11
A wide-issue processor or
multiple narrower-issue processors?
What can you do within a 21 mm * 21 mm area?
21 21
mmmm 2121
mmmm

I-Cache
I-Cache #1#1
(8K) I-Cache
(8K) I-Cache #2#2 (8K)
(8K)
Instruction
Instruction
External
External Cache
Cache External
External
Interface Instruction
Instruction (32 KB) Interface
Interface
Interface
Fetch (32 KB)
Fetch
TLB Processor Processor
Processor Processor

(256KB)
On-Chip L2 Cache (256KB)
TLB

Cache (256KB)
On-Chip L2 Cache (256KB)
#1#1 #2
#2
Inst.
Inst. Decode
Decode && Data

Crossbar
Data

Communication Crossbar
Clocking & Pads
Clocking & Pads

Clocking & Pads


Clocking & Pads

Rename
Rename Cache
Cache You will have more

L2 Cache
(32
(32 KB)
KB)
21 mm
21 mm 2121
mmmm
ALUs if you choose D-Cache
D-Cache #1#1
D-Cache #3
(8K)D-Cache
(8K)
(8K)
D-Cache
D-Cache
#2#2
#4
(8K)
(8K)
(8K)
D-Cache #3 (8K) D-Cache #4 (8K)

L2Communication

On-Chip L2
Reorder
Reorder Buffer,
Buffer,
Instruction Queues, this!
Integer Unit

On-Chip
Instruction Queues,
Integer Unit

and
and Out-of-Order
Out-of-Order Logic
Logic
Processor
Processor Processor
Processor
#3#3 #4
#4

Floating
Floating Point
Point

L2
Unit
Unit
I-Cache
I-Cache #3#3 (8K) I-Cache
(8K) I-Cache
#4#4 (8K)
(8K)

A 6-issue
Figure superscalar processor 4 2-issue superscalar processor
Figure 2. 2.Floorplan
Floorplan
forfor
thethe six-issue
six-issue dynamic
dynamic superscalar
superscalar Figure
Figure 3.3. Floorplan
Floorplan forthe
for thefour-way
four-way single-chip
single-chip
3 integer ALUs microprocessor.
microprocessor. 4*1 integer ALUs
multiprocessor.
multiprocessor.
vents
3 floating point ALUs 4*1 floating point ALUs
vents thethe instruction
instruction fetch
fetch mechanism
mechanism from
from becoming
becoming a bottleneck sors
a bottleneck sorsis isless
lessthan
thanone-fourth
one-fourththe
thesize
sizeofofthe
the6-way
6-waySS SSprocessor,
processor,asas
since
since 3
thethe load/store
6-way
6-way execution
execution units
engine
engine requires
requires a much
a much higherinstruc-
higher shownininTable
instruc- shown 4*13.3.load/store
Table Thenumber
The units
numberofofexecution
executionunits
unitsactually
actuallyincreases
increases
tiontion fetch
fetch bandwidth
bandwidth than
than thethe 2-way
2-way processorsused
processors MP inin
usedin inthetheMP thetheMP MPbecause
becausethe
the6-way
6-wayprocessor
processorhad hadthree
threeunits
unitsofofeach
eachtype,
type,
architecture.
architecture.
12 whilethe
while the4-way
4-wayMP MPmust
musthave
havefour
fourone onefor
foreach
eachCPU.
CPU.On Onthe
the
Die photo of a CMP processor

13
CMP advantages
How many of the following are advantages of CMP
over traditional superscalar processor
CMP can exploit thread-level parallelism
CMP can deliver better instruction throughput within the
same die area (chip size)
CMP can achieve better ILP for each running thread
Not really, this depends on single-core microarchitecture
CMP can improve the performance of a single-threaded
application without modifying code
A. 0 Not really, you cannot have single-threaded programs
accessing functional units from other processor cores
B. 1
C. 2
D. 3
E. 4

14
Speedup a single
application on multi-
threaded processors

15
Parallel programming
To exploit CMP/SMT parallelism you need to break your
computation into multiple processes or multiple threads
Processes (in OS/software systems)
Separate programs actually running (not sitting idle) on your
computer at the same time.
Each process will have its own virtual memory space and you need
explicitly exchange data using inter-process communication APIs
Threads (in OS/software systems)
Independent portions of your program that can run in parallel
All threads share the same virtual memory space
We will refer to these collectively as threads
A typical user system might have 1-8 actively running threads.
Servers can have more if needed (the sysadmins will hopefully
configure it that way)

16
Create threads/processes
The only way we can improve a single application
performance on CMP/SMT
You can use fork() to create a child process
Or you can use pthread or openmp to compose
multi-threaded programs
Threads share the same memory space with each other
/* Do matrix multiplication */
for(i = 0 ; i < NUM_OF_THREADS ; i++)
{ Spawn a thread
tids[i] = i;
pthread_create(&thread[i], NULL, threaded_blockmm, &tids[i]);
}
for(i = 0 ; i < NUM_OF_THREADS ; i++)
pthread_join(thread[i], NULL);
Synchronize and wait a for
thread to terminate

17
Supporting shared memory model
Provide a single memory space that all processors
can share
All threads within the same program shares the
same address space.
Threads communicate with each other using shared
variables in memory
Provide the same memory abstraction as single-
thread programming

18
Simple idea...
Connecting all processor and shared memory to a
bus.
Processor speed will be slow b/c all devices on a
bus must run at the same speed

Core 0 Core 1 Core 2 Core 3

Bus

Shared $

19
Memory hierarchy on CMP
Each processor has
its own local cache
Core 0 Core 1

Local $ Local $

Shared $
Bus

Local $ Local $

Core 2 Core 3
20
Cache on Multiprocessor
Coherency
Guarantees all processors see the same value for a
variable/memory address in the system when the
processors need the value at the same time
What value should be seen
Consistency
All threads see the change of data in the same order
When the memory operation should be done

21
What each thread will see?
Assuming that we are running the following code on
a CMP with some cache coherency protocol, which
output is NOT possible? (a is initialized to 0)

thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;

A. 0 1 2 3 4 5 6 7 8 9
B. 1 2 5 9 3 6 8 10 12 13
C. 1 1 1 1 1 1 1 1 1 1
D. 1 1 1 1 1 1 1 1 1 100

22
Its show time!
Demo!

thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;

23
Cache coherency
Assuming that we are running the following code on
a CMP with some cache coherency protocol, which
output is NOT possible? (a is initialized to 0)

thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;

A. 0 1 2 3 4 5 6 7 8 9
B. 1 2 5 9 3 6 8 10 12 13
C. 1 1 1 1 1 1 1 1 1 1
D. 1 1 1 1 1 1 1 1 1 100

24
Hard to debug
thread 1 thread 2
int loop; void* modifyloop(void *x)
{
int main() sleep(1);
{ printf("Please input a number:\n");
pthread_t thread; scanf("%d",&loop);
loop = 1; return NULL;
}
pthread_create(&thread, NULL,
modifyloop, NULL);
while(loop == 1)
{
continue;
}
pthread_join(thread, NULL);
fprintf(stderr,"User input: %d\n",
loop);
return 0;
}

25
Hard to debug
thread 1 thread 2
volatile int loop; void* modifyloop(void *x)
{
int main() sleep(1);
{ printf("Please input a number:\n");
pthread_t thread; scanf("%d",&loop);
loop = 1; return NULL;
}
pthread_create(&thread, NULL,
modifyloop, NULL); make sure your compiled code
while(loop == 1)
{ put this in memory all the time
continue;
}
pthread_join(thread, NULL);
fprintf(stderr,"User input: %d\n",
loop);
return 0;
}

26
void* update_array(void *x)
{

Shared queue int *thread_id = (int *)x;


int i;
for(;array_index<ARRAY_SIZE;array_index++) {
array[array_index]++;
int ARRAY_SIZE=64;
}
int *array; void* update_array(void
return NULL;*x)
volatile int array_index; { }
int *thread_id = (int *)x;
int main(int argc, char **argv) int i;
{ for(;array_index<ARRAY_SIZE;array_index++) {
int i; array[array_index]++;
pthread_t thread[NUM_OF_THREADS]; }
int thread_args[NUM_OF_THREADS]; return NULL;
ARRAY_SIZE = atoi(argv[1]); }
array = (int *)calloc(ARRAY_SIZE,sizeof(int));
array_index =0;
for(i=0;i<NUM_OF_THREADS;i++) {
thread_args[i] = i;
pthread_create(&thread[i], NULL, update_array, &thread_args[i]);
}
for(i=0;i<NUM_OF_THREADS;i++) void* update_array(void *x)
pthread_join(thread[i], NULL); {
for(i=0;i<ARRAY_SIZE;i++) { int *thread_id = (int *)x;
switch(array[i]) int i;
{ for(;array_index<ARRAY_SIZE;array_index++) {
case 0: array[array_index]++;
printf("No one cares task %d! QQ\n",i); }
return 0; return NULL;
break; }
case 1:
break; void* update_array(void *x)
default: {
printf("Task %d has done more than once!\n",i); int *thread_id = (int *)x;
return 0; int i;
} for(;array_index<ARRAY_SIZE;array_index++) {
} array[array_index]++;
return 0; }
} return NULL;
}

27
How to make it work?
volatile int lock=0;
inline void atomic_increment(volatile int *pw) {
asm(
"lock\n\t"
"incl %0\n\t"
:
"=m"(*pw): // output (%0)
"m"(*pw): // input (%1) You need these instructions
"cc" // clobbers

}
);
to make it work
inline int atomic_compare_and_exchange(int requiredOldValue, volatile int* _ptr, int newValue, int
sizeOfValue){
int old;
__asm volatile
(
"mov %3, %%eax;\n\t"
"lock\n\t"
"cmpxchg %4, %0\n\t"
"mov %%eax, %1\n\t"
:
"=m" ( *_ptr ), "=r" ( old ): // outputs (%0 %1)
"m" ( *_ptr ), "r" ( requiredOldValue), "r" ( newValue ): // inputs (%2, %3, %4)
"memory", "eax", "cc" // clobbers
);

return old;
}

void* update_array(void *x) {


int *thread_id = (int *)x;
int i;
do
{
while(atomic_compare_and_exchange(0,&lock,1,sizeof(lock)) !=0);
array[array_index]++;
atomic_increment(&array_index);
atomic_compare_and_exchange(1,&lock,0,sizeof(lock));
}
while(array_index < ARRAY_SIZE);
return NULL;
}
28
Take-away
You need to understand the design of your
multithreaded hardware to correctly develop your
idea

30
The changing role of
GPUs

31
GPU (Graphics Processing Unit)
Originally for displaying images
HD video: 1920*1080 pixels * 60 frames per second
Graphics processing pipeline

Raster
Input Vertex Geometry Setup & Pixel Operations
Assembler Shader Shader Rasterizer Shader /Output
merger

These shaders need to be programmable to


apply different rendering effects/algorithms
(Phong shading, Gouraud shading, and etc...)
32
Basic concept of shading
They are all vectors
Iamb = Kamb Mamb
Idiff = Kdiff Mdiff (N L)
Ispec = Kspec Mspec (R V)n
N R Itotal = Iamb + Idiff + Ispec
L void main(void)
{

For each // normalize vectors after interpolation

V vec3 L = normalize(o_toLight);
vec3 V = normalize(o_toCamera);

point/pixel vec3 N = normalize(o_normal);

// get Blinn-Phong reflectance components


float Iamb = ambientLighting();
float Idif = diffuseLighting(N, L);
float Ispe = specularLighting(N, L, V);

// diffuse color of the object from texture


vec3 diffuseColor = texture(u_diffuseTexture, o_texcoords).rgb;

// combination of all components and diffuse color of the object


resultingColor.xyz = diffuseColor * (Iamb + Idif + Ispe);
resultingColor.a = 1;
}

33
What do you want from a GPU?
Given the basic idea of shading algorithms, how many of
the following statements would fit the agenda of designing
a GPU?
Many ALUs to process multiple pixels simultaneously
Each frame contains 1920*1080 pixels!
High bandwidth memory bus to supply pixels, vectors and
textures Each pixel requires different L, N, R, V
High performance branch predictors
not really, the behavior is uniform across all pixels
Powerful ALUs to process many different kinds of operators
not really, we only need vector add, vector mul, vector div. Low
frequency is OK since we have many threads
A. 0
B. 1
C. 2
D. 3
E. 4

34
implementation.
Hardware support throughout the design to enable new programming model capabilities
Nvidia GPU architecture
GK210 expands upon GK110s on-chip resources, doubling the available register file and shar
memory capacities per SMX.
SMX (Streaming
Multiprocessor)

35
Streaming Multiprocessor (SMX) Architecture

Inside each SMX


The Kepler GK110/GK210 SMX unit features several architectural innovations that make it the most
powerful multiprocessor weve built for double precision compute workloads.

Each of these
performs the same
operation, but each
of these is also a
thread

36
into 2 RBEs and 3 memory controllers, a 50% boost in memory bandwidth. In contrast, the AMD Radeon HD 7770 GHz Edition has a single primitive pipeline,
2 pixel pipelines and 10 compute units. The pixel pipelines in the AMD Radeon HD 7770 GHz Edition also scaled back to 2 memory controllers, for a 128-bit
wide interface.

AMD GPU Architecture


Figure 7: AMD Radeon HD 7970

37 AMD's Graphics Core Next Technology 13


A CU in an AMD GPU
Figure 3: GCN Compute Unit

Another crucial innovation in GCN is coherent caching. Historically, GPUs have relied on specialized caches (such as read-only texture caches) that do
not maintain a coherent view of memory. To communicate between cores within a GPU, the programmer or compiler must insert explicit synchronization
38design, it increases overhead for applications which share data. GCN is
instructions to flush shared data back to memory. While this approach simplifies
CPU v.s. GPU
Comparing the CPU and GPU architectures, how many of
the followings are correct?
Think about SMT
GPU architectures are more tolerable to memory latencies
CPU is a better fit if the application requires low latency
operations
Given GPU memory are optimized for bandwidth, sacrificing latencies
GPU is efficient for workloads like matrix multiplications
GPU is inefficient if the workload needs only a small set of
input data points but contains lots of decision making and
backward references
The utilization of GPU will be low, plus GPU ALUs are slower than
CPU ALUs
A. 0
B. 1
C. 2
D. 3
E. 4

39
Programming GPGPU
A GPGPU application contains the host program and
GPU kernels
Host program: A C/C++ based CPU program that invokes
GPU API
GPU kernels: C/C++-like programs running on GPUs
Programming models
CUDA (Compute Unified Device Architecture)
Proposed by NVIDIA
Only NVIDIA GPUs support
OpenCL
Maintained by Khronos Group (non-profit)
Supported by Altera, AMD, Apple, ARM Holdings, Creative
Technology, IBM, Imagination Technologies, Intel, Nvidia,
Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS

40
What a host program looks like?
int main (int argc, const char * argv[]) { Initialize GPU runtime
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);

float *a = (float *)malloc(ARRAY_SIZE*ARRAY_SIZE*sizeof(cl_float));


float *b = (float *)malloc(ARRAY_SIZE*ARRAY_SIZE*sizeof(cl_float));
float *c = (float *)malloc(ARRAY_SIZE*ARRAY_SIZE*sizeof(cl_float));

if(a == NULL || b == NULL || c == NULL) Initialize host memory objects


fprintf(stderr,"allocating array c failed\n");
else init(a, ARRAY_SIZE);init(b, ARRAY_SIZE);
Allocate GPU memory space & copy data to GPU memory
void *gpu_a = gcl_malloc(sizeof (cl_float) * ARRAY_SIZE*ARRAY_SIZE, a,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR);
void *gpu_b = gcl_malloc(sizeof(cl_float) * ARRAY_SIZE*ARRAY_SIZE, b,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR);
void *gpu_c = gcl_malloc(sizeof(cl_float) * ARRAY_SIZE*ARRAY_SIZE, NULL,
CL_MEM_WRITE_ONLY);
Submitting GPU kernel to the runtime
dispatch_sync(queue, ^{
cl_ndrange range = { 2, {0, 0, 0},{ARRAY_SIZE, ARRAY_SIZE, 0},{16, 16, 0}};
matrix_mul_kernel(&range,(cl_float*)gpu_a,(cl_float*)gpu_b,(cl_float*)gpu_c,
(cl_int)ARRAY_SIZE);
gcl_memcpy(c, gpu_c, sizeof(cl_float) * ARRAY_SIZE*ARRAY_SIZE);
});
Copy data from GPU memory to main memory
gcl_free(gpu_a); gcl_free(gpu_b); gcl_free(gpu_c);
dispatch_release(queue);
free(a); free(b); free(c);
return 0;
} 41
What a GPU kernel looks like?
__kernel void
matrix_mul(__global float* input_a,
__global float* input_b,
__global float* output,
int size)
{
int col = get_global_id(0);
int row = get_global_id(1); Identify the current GPU thread

float value = 0;
for (int k = 0; k < size; ++k)
{
value += input_a[row * size + k] * input_b[k * size + col];
}
output[row * size + col] = value;
}

42
Demo
Matrix multiplication
block algorithm v.s. naive OpenCL code

43
How things are connected

DRAM
CPU
GPU

PCIe
Switch

Second Storage Devices

44
New overhead/bottleneck emerges

New overhead
DRAM
CPU
GPU

PCIe
Switch

Second Storage Devices

45
Closing remarks and
future perspectives

49
Challenges
Moores law is slowing down
Leakage power constrains the processing power of
a single chip
Discontinuation of Dennards scaling
Also called the dark silicon problem

50
CPU
F i g u r e 1performance scaling slows down

80 4E+07
1.5x in 56 months

60 3E+07

SuperComputer FLOPS
No faster
CPU SPECRate

2.7x in 56 months
4x
supercomputers
since Dec, 2012
40 2E+07

20 1E+07

Best CPU Performance


Best Super Computer Performance
0 0E+00
Oct-06
Jan-07
Apr-07
Jul-07
Oct-07
Jan-08
Apr-08
Jul-08
Oct-08
Jan-09
Apr-09
Jul-09
Oct-09
Jan-10
Apr-10
Jul-10
Oct-10
Jan-11
Apr-11
Jul-11
Oct-11
Jan-12
Apr-12
Jul-12
Oct-12
Jan-13
Apr-13
Jul-13
Oct-13
Jan-14
Apr-14
Jul-14
Oct-14
Jan-15
Apr-15
Jul-15
Oct-15
Jan-16
Sources:
Source: IDC's Digital Universe Study, sponsored by EMC, December 2012
https://www.spec.org/cpu2006/results/
51 http://www.top500.org/statistics/sublist/
New opportunities

DRAM
CPU
GPU

But, how do/will you use


the rightProcessors are
PCIe
them in Switch way?
everywhere in your
computers

Secondary Storage Devices

52
Conclusion
In the past, software engineers and hardware engineers work on
different sides of the ISA abstraction
Software engineers have no idea about what happen in processors/
hardware
Hardware engineers have no sense about what are the demands of
applications
This works fine if we can keep accelerating CPUs, but not true anymore
We need new execution & programming model to better utilize
these hardware components
We need innovative computer system design to address the
challenges from process technologies and the application
demands
Artificial intelligence
3D VR/AR
Hope to see you again in CSC456 Spring 2018 (planned)

53
We will talk more about high-
performance/heterogenous
computer system design as well
as corresponding techniques to
accelerate your programs in
CSC456 next Spring!

54

You might also like