Computer Science 146 Computer Architecture
Computer Science 146 Computer Architecture
Computer Architecture
Spring 2004
Harvard University
Instructor: Prof. David Brooks
[email protected]
Lecture 21: Multithreading and I/O
Lecture Outline
HW#5 on webpage
Project Questions?
Tale of two of multithreaded x86s
Intel Pentium 4 multithreading
MemoryLogix MLX1 multithreading
Multithreading Paradigms
SuperScalar
Coarse MT
Pentium 3,
Alpha EV6/7
IBM Pulsar
Fine MT
Simultaneous MT
Intel P4-HT,
EV8, Others?
Fine-grained
Pentium 4 Front-End
Pentium 4 Backend
Some queues and buffers are partitioned (only entries per thread)
Scheduler is oblivious to instruction thread Ids (limit on # per scheduler)
Computer Science 146
David Brooks
Multithreading Performance
6
SMT
SuperScalar
9-stage pipeline
128KB I/D Cache
6 IntALU (4 Load/Store)
4 FPALUs
8 SMT threads
8-insn fetch (2 threads)
IPC
4
3
2
1
0
SPECint95
Apache
OLTP
DSS
SPECint2000
Die Size
(mm2)
Est Core
Size (mm2)
Core Size
Ratio
Typical
Speed (MHz)
544
80
~34
~13
>1000
AMD Duron
192
55
~37
~14
>1000
Transmeta 5800
640
55
~25
~10
>800
VIA C3
192
52
~31
~12
>800
ARM 1026EJ-S
32
4.6
2.6
>400
Processors
MemoryLogix MLX1:
Tiny Multithreaded 586 core
Presented at Microprocessor Forum 2002
MLX1 Design Goals
Up to 8 accesses/cycle
8 banks/line, 8 bytes/bank
Support multiple instruction fetch
MLX1 Summary
MLX1 Tiny x86 core
In 0.13um:
3.5mm2(core)+ 1.0mm2 (MMX) + 1.5mm2 (FPU) = 6.0mm2
Compared to 146mm2 for a Pentium 4
Can multithreading buy back the performance?
Sounds interesting
Depends on workloads
Are there enough embedded workloads that are throughput oriented?
I/O bottleneck:
Increasing fraction of time in I/O (relative to CPU)
Similar to Memory Wall problem
Type
Partner
Mouse
Human
0.01
CRT
Human
60,000
I/O
Machine
2-8
Modem
LAN
I/O
Machine
500-600
Tape
Storage
Machine
2000
Disk
Storage
Machine
2000-10,000
I/O Systems
Processor
interrupts
Cache
I/O
Controller
Disk
Disk
I/O
Controller
I/O
Controller
Graphics
Network
Data utilities
high capacity, hierarchically managed storage
Computer Science 146
David Brooks
10
Sector
Inner Outer
Track Track
Platter
Actuator
Actuator
Head
Platters (12)
11
Inner Sector
Head Arm Controller
Spindle
Track
Platter
Actuator
12
Track
Sector
Cylinder
Characteristics:
Seek Time (~8 ms avg)
Transfer rate
positional latency
rotational latency
10-40 MByte/sec
Block transfers
Capacity
Head
Platter
Response time
= Queue + Controller + Seek + Rot + Xfer
Service time
13
MB/$
> 100%/year (2X / 1.0 yrs)
Fewer support chips + increased areal density
Computer Science 146
David Brooks
source: www.seagate.com
14
100000
10000
1000
Areal Density
100
10
1
1970
1980
1990
2000
Year
15
Historical Perspective
1956 IBM Ramac early 1970s Winchester
Developed for mainframe computers, proprietary interfaces
Steady shrink in form factor: 27 in. to 14 in
2000s:
1 inch for cameras, cell phones?
Computer Science 146
David Brooks
Disk History
Data
density
Mbit/sq. in.
Capacity of
Unit Shown
Megabytes
1973:
1. 7 Mbit/sq. in
140 MBytes
1979:
7. 7 Mbit/sq. in
2,300 MBytes
16
Disk History
1989:
63 Mbit/sq. in
60,000 MBytes
1997:
1450 Mbit/sq. in
2300 MBytes
1997:
3090 Mbit/sq. in
8100 MBytes
17
18
Next Wednesday:
Google Cluster
Course Summary and Wrapup
Final Review (may schedule another review before final)
19