0% found this document useful (0 votes)
127 views25 pages

CSE 820 Graduate Computer Architecture: Dr. Enbody

This document provides information about a graduate computer architecture course (CSE 820). It introduces the instructor, Dr. Enbody, and discusses his background and research interests. It outlines the course objectives, which involve studying advanced computer architecture concepts like modern processor design and multicore systems. More than half the course will cover material from the textbook, with the remainder covering additional topics. Students will complete homework assignments and be evaluated based on exams, homework, and participation. The document schedules topics to be covered over the semester and provides examples of potential "cool stuff" that may be discussed, including new processor architectures.

Uploaded by

kbkkr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views25 pages

CSE 820 Graduate Computer Architecture: Dr. Enbody

This document provides information about a graduate computer architecture course (CSE 820). It introduces the instructor, Dr. Enbody, and discusses his background and research interests. It outlines the course objectives, which involve studying advanced computer architecture concepts like modern processor design and multicore systems. More than half the course will cover material from the textbook, with the remainder covering additional topics. Students will complete homework assignments and be evaluated based on exams, homework, and participation. The document schedules topics to be covered over the semester and provides examples of potential "cool stuff" that may be discussed, including new processor architectures.

Uploaded by

kbkkr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

1/10/11

CSE 820
Graduate Computer Architecture
Richard Enbody

Dr. Enbody

Born and raised in NH and ME


Former High School Math Teacher
At MSU since 1987
Research
Computer Security
Computer Architecture

Hockey and squash player

1/10/11

Objectives
In this course students will study advanced
concepts in computer architecture. The
emphasis is on modern processor design,
and will include some multicore design. More
than half the time will be spent with material
related to the textbook; the remainder will be
material not in the text. Research papers will
be assigned to be read and analyzed.

Prerequisites
Assume undergraduate computer
architecture course such as CSE 420

1/10/11

Grading
30% Homework
30% Midterm Exam (Tuesday, March 1 in class)
35% Final Exam (Monday, May 2, 7:45 - 9:45 AM)
05% Classroom Participation
Course grade:
93% and above is a 4.0;
85% - 92% is a 3.5;
80% - 84% is a 3.0, etc.

Schedule
First half: text
Midterm
Second half: finish text
then cool architecture stuff
Final
In-between: readings, writings

1/10/11

Cool Stuff?
Possibilities
Virtualization support
IBM Cell processor
Multi-cores
Newest Intel and AMD chips
Google architecture
Power, Thermal, Skew issues
Asynchronous
Graphic processing

Homework
Most are brief overviews
of assigned reading,
e.g. one page.

1/10/11

Use some Pattersons slides


(text author)

Intel Aubrey Isle 32-core CPU

1/10/11

1/10/11

1/10/11

1/10/11

1/10/11

Why?
Intels response to GPGPU-based
supercomputers running CUDA
It is all about Flops per Watt

10

1/10/11

Meanwhile in your pocket


Motorola Atrix 4G
NVIDIA Tegra2 processor (40nm)
Dual-core ARM Cortex-A9 CPU
Out-of-order processing
1080p HDTV - HDMI
1 GHz
L2 cache 1 MB (shared?)
L1 32KB I & 32KB D per core
1 GB memory
8-core GPU
12 MP camera with 16X zoom

Algorithms
A benchmark production planning model solved using linear
programming would have taken 82 years to solve in 1988.
Fifteen years later (2003) it could be solved in roughly 1 minute,
an improvement of roughly 43 million.
Roughly 1,000 was due to increased processor speed;
a factor of roughly 43,000 was due to improvements in algorithms!
Professor Martin Grtschel
Konrad-Zuse-Zentrum fr Informationstechnik Berlin.

11

1/10/11

Outline
Computer Science at a Crossroads
Computer Arch. vs. Instruction Set Arch.
What Computer Architecture brings to table

Crossroads: Conventional Wisdom in Comp. Arch


Old: Power is free, Transistors expensive
New: Power wall Power expensive, Transistors free
(Can put more on chip than can afford to turn on)

Old: increasing Instruction Level Parallelism (ILP) via compilers,


innovation (Out-of-order, speculation, VLIW, )
New: ILP wall: law of diminishing returns on more HW for ILP
Old: Multiplies are slow, Memory access is fast
New: Memory wall Memory slow, multiplies fast
(200 clock cycles to DRAM memory, 4 clocks for multiply)
Old: Uniprocessor performance 2X / 1.5 yrs
New: Power Wall + ILP Wall + Memory Wall = Brick Wall
Uniprocessor performance now 2X / 5(?) yrs

Sea change in chip design: multiple cores


(2X processors per chip / ~ 2 years)
More simpler processors are more power efficient

12

1/10/11

Uniprocessor Performance (SPECINT)

technology
driven

architectural and
organizational driven
SPECFP increased faster.

Crossroads: Uniprocessor Performance

Performance (vs. VAX-11/780)

10000

1000

From Hennessy and Patterson, Computer


Architecture: A Quantitative Approach, 4th
edition, October, 2006

??%/year

52%/year
100

10

25%/year

1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

VAX
: 25%/year 1978 to 1986
RISC + x86: 52%/year 1986 to 2002

13

1/10/11

Sea Change in Chip Design


Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm2 chip
125 mm2 chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
RISC II shrinks to ~ 0.02 mm2 at 65 nm
Caches via DRAM or 1 transistor SRAM (www.tram.com) ?
Proximity Communication via capacitive coupling > 1 TB/s?
(Ivan Sutherland @ Sun / Berkeley)

Processor is the new transistor?

Dj vu all over again?


Multiprocessors imminent in 1970s, 80s, 90s,
todays processors are nearing an impasse as
technologies approach the speed of light..
David Mitchell, The Transputer: The Time Is Now (1989)
Transputer was premature
Custom multiprocessors strove to lead uniprocessors
Procrastination rewarded: 2X seq. perf. / 1.5 years

We are dedicating all of our future product development to


multicore designs. This is a sea change in computing
Paul Otellini, President, Intel (2004)
Difference is all microprocessor companies switch to multiprocessors
(AMD, Intel, IBM, Sun, )
Procrastination penalized: 2X sequential perf. / 5 yrs
Biggest programming challenge: 1 to 2 CPUs

14

1/10/11

Problems with Sea Change

Algorithms, Programming Languages, Compilers,


Operating Systems, Architectures, Libraries, not
ready to supply Thread Level Parallelism or Data
Level Parallelism for 1000 CPUs / chip,
Architectures not ready for 1000 CPUs / chip

Unlike Instruction Level Parallelism, cannot be solved by just


by computer architects and compiler writers alone, but also
cannot be solved without participation of computer architects

Outline
Computer Science at a Crossroads
Computer Arch. vs. Instruction Set Arch.
What Computer Architecture brings to table

15

1/10/11

Instruction Set Architecture: Critical Interface


software

instruction set

hardware

Properties of a good abstraction


Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower levels

ISA Example: MIPS


r0
r1

r31
PC
lo
hi

Programmable storage
2^32 x bytes
31 x 32-bit GPRs (R0=0)
32 x 32-bit FP regs (paired DP)
HI, LO, PC

Data types ?
Format ?
Addressing Modes?

Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU,
AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI
SLL, SRL, SRA, SLLV, SRLV, SRAV

Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR
SB, SH, SW, SWL, SWR

Control

32-bit instructions on word boundary

J, JAL, JR, JALR


BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL

16

1/10/11

Instruction Set Architecture


... the attributes of a [computing] system as seen by the
programmer, i.e. the conceptual structure and functional
behavior, as distinct from the organization of the data
flows and controls the logic design, and the physical
implementation.
Amdahl, Blaauw, and Brooks, 1964
SOFTWARE
-- Organization of Programmable
Storage
-- Data Types & Data Structures:
Encodings & Representations
-- Instruction Formats
-- Instruction (or Operation Code) Set
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions

Patterson:
ISA vs. Computer Architecture
Old definition of computer architecture = instruction set design
Other aspects of computer design called implementation
Insinuates implementation is uninteresting or less challenging
Pattersons view is computer architecture >> ISA
Architects job much more than instruction set design;
technical hurdles today are more challenging
than those in instruction set design
Since instruction set design not where action is, some conclude
computer architecture (using old definition) is not where action is
disagree on conclusion
agree that ISA not where action is

17

1/10/11

Comp. Arch. is an Integrated Approach


What really matters
is the functioning of the complete system:
hardware, runtime system, compiler,
operating system, and application
In networking, this is called the End-to-End argument

Computer architecture is not just about


transistors, individual instructions, or particular
implementations
E.g., Original RISC replaced complex instructions
with a compiler + simple instructions

Computer Architecture is
Design and Analysis
Design

Architecture is an iterative process:


Searching the space of possible designs
At all levels of computer systems

Analysis

18

1/10/11

Outline
Computer Science at a Crossroads
Computer Arch. vs. Instruction Set Arch.
What Computer Architecture brings to table
Technology Trends

Outline
Computer Science at a Crossroads
Computer Architecture v. Instruction Set
Arch.
What Computer Architecture brings to
table

19

1/10/11

What Computer Architecture brings to Table

Other fields often borrow ideas from architecture


Quantitative Principles of Design
1.
2.
3.
4.
5.

Careful, quantitative comparisons

Take Advantage of Parallelism


Principle of Locality
Focus on the Common Case
Amdahls Law
The Processor Performance Equation
Define, quantity, and summarize relative performance
Define and quantity relative cost
Define and quantity dependability
Define and quantity power

Culture of anticipating and exploiting


advances in technology
Culture of well-defined interfaces
that are carefully implemented and thoroughly checked

1) Taking Advantage of Parallelism


Increasing throughput of server computer
via multiple processors or multiple disks
Detailed HW design
Carry lookahead adders uses parallelism to speed up computing
sums from linear to logarithmic in number of bits per operand
Multiple memory banks searched in parallel in set-associative caches

Pipelining: overlap instruction execution


to reduce the total time to complete an instruction sequence.
Not every instruction depends on immediate predecessor
executing instructions completely/partially in parallel is possible
Classic 5-stage pipeline:
1) Instruction Fetch (Ifetch),
2) Register Read (Reg),
3) Execute (ALU),
4) Data Memory Access (Dmem),
5) Register Write (Reg)

20

1/10/11

Pipelined Instruction Execution


Time (clock cycles)

Ifetch

O
r
d
e
r

DMem

Reg

Ifetch

Reg

Reg

DMem

Reg

Reg

Ifetch

Reg

DMem

ALU

Reg

ALU

Ifetch

ALU

Reg

DMem

Limits to pipelining
Hazards prevent next instruction
from executing during its designated clock cycle
Structural hazards:
attempt to use the same hardware to do two different things at once

Data hazards:
Instruction depends on result of prior instruction still in the pipeline

Control hazards:
Caused by delay between the fetching of instructions and
decisions about changes in control flow (branches and jumps).

Reg

DMem

Ifetch

Reg

DMem

Reg

DMem

Reg

ALU

O
r
d
e
r

Ifetch

ALU

I
n
s
t
r.

ALU

Time (clock cycles)


ALU

I
n
s
t
r.

ALU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Ifetch

Ifetch

Reg
Reg
Reg
DMem

Reg

21

1/10/11

2) The Principle of Locality


The Principle of Locality:
Programs access a relatively small portion of the address space
at any instant of time.

Two Different Types of Locality:


Temporal Locality (Locality in Time):
If an item is referenced,
it will tend to be referenced again soon (e.g., loops, reuse)

Spatial Locality (Locality in Space):


If an item is referenced,
items whose addresses are close by tend to be referenced soon
(e.g., straight-line code, array access)

Last 30 years, hardware relied on locality for memory performance


P

Capacity
Access Time
Cost

MEM

Levels of the Memory Hierarchy


Staging
Xfer Unit

CPU Registers
100s Bytes
300 500 ps (0.3-0.5 ns)

Registers

L1 and L2 Cache
10s-100s K Bytes
~1 ns - ~10 ns
$1000s/ GByte

L1 Cache

Main Memory
G Bytes
80ns- 200ns
~ $100/ GByte
Disk
10s T Bytes, 10 ms
(10,000,000 ns)
~ $1 / GByte
Tape
infinite
sec-min
~$1 / GByte

Instr. Operands
Blocks

Upper Level
prog./compiler
1-8 bytes

faster

cache cntl
32-64 bytes

L2 Cache
Blocks

cache cntl
64-128 bytes

Memory
Pages

OS
4K-8K bytes

Files

user/operator
Mbytes

Disk

Tape

Larger

Lower Level

22

1/10/11

3) Focus on the Common Case


Common sense guides computer design
Since its engineering, common sense is valuable

In making a design trade-off,


favor the frequent case over the infrequent case
E.g., Instruction fetch and decode unit used more frequently
than multiplier, so optimize it first.
E.g., If database server has 50 disks / processor, storage
dependability dominates system dependability, so optimize it 1st

Frequent case is often simpler


and can be done faster than the infrequent case
E.g., overflow is rare when adding two numbers, so improve
performance by optimizing more common case of no overflow
May slow down overflow, but overall performance improved by
optimizing for the normal case

What is frequent case and how much performance


improved by making case faster => Amdahls Law

4) Amdahls Law
&
Fractionenhanced #
ExTimenew = ExTimeold ( $(1 ' Fractionenhanced )+
Speedupenhanced !"
%

Speedupoverall =

ExTimeold
=
ExTimenew

(1 ! Fractionenhanced ) +

Fractionenhanced

Speedupenhanced

Best you could ever hope to do:


Speedupmaximum =

(1 - Fractionenhanced )

23

1/10/11

Amdahls Law example


New CPU 10X faster
I/O bound server, so 60% time waiting for I/O
Speedup overall =

(1 ! Fraction enhanced )+ Fraction enhanced

Speedup enhanced

(1 ! 0.4)+ 0.4
10

1
= 1.56
0.64

Apparently, its human nature to be attracted by 10X


faster vs. keeping in perspective its just 1.6X faster

CPI

5) Processor performance equation


inst count
CPU time

= Seconds
Program

= Instructions x Cycles x Seconds


Program
Instruction
Cycle

Program

Inst Count
X

CPI

Compiler

(X)

Inst. Set.

Organization

Technology

Cycle time

Clock Rate

X
X

24

1/10/11

Whats a Clock Cycle?


Latch
or
register

combinational
logic

Old days: 10 levels of gates


Today: determined by numerous
time-of-flight issues + gate delays
clock propagation, wire lengths, drivers

And in conclusion
Computer Architecture >> instruction sets
Computer Architecture skill sets are different
5 Quantitative principles of design
Quantitative approach to design
Solid interfaces that really work
Technology tracking and anticipation

Computer Science at the crossroads from sequential


to parallel computing
Salvation requires innovation in many fields, including
computer architecture

25

You might also like