An Introduction: Prof. Thomas Sterling Department of Computer Science Louisiana State University January 18, 2011
An Introduction: Prof. Thomas Sterling Department of Computer Science Louisiana State University January 18, 2011
Thomas Sterling
Department of Computer Science
Louisiana State University
January 18, 2011
1959
IBM 7094 1976 1991 1996 2003 2009
1949 Cray 1 Intel Delta T3E Cray X1 Cray XT5
1 Edsac 103 106 109 1012 1015
100,000
MB per DRAM Chip
10,000 Logic Transistors per Chip (M)
uP Clock (MHz)
1,000
100
10
1
1997
2003
1999
2001
2006
2009
2012
Year of Technology Availability
CSC 7600 Lecture 1 : Introduction 12
Spring 2011
Classical DRAM
• Memory mats: ~ 1 Mbit each
• Row Decoders
• Primary Sense Amps
• Secondary sense amps & “page” multiplexing
• Timing, BIST, Interface
• Kerf
1000 1.00
0.90
100
0.80
10
% Chip Overhead
0.70
Gbits per chip
1
0.60
0.1 0.50
0.01 0.40
0.001 0.30
0.20
0.0001
0.10
0.00001
0.00
0.000001
1970 1980 1990 2000 2010 2020
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Historical SIA Production SIA Introduction
Historical ITRS @ Production ITRS @ Introduction
100,000 100000
w
La
’s
10,000 10000
re
aw
oo
Clock (MHz)
’s L
Clock (MHz)
M
re 3 GHz 3 GHz
al
oo
s ic
1,000 1000
s
M
Cla
l
ca
ssi
Cla
100 100
10 10
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 10000 1000 100 10
Historical ITRS Max Clock Rate (12 invertors)
Feature Size
Historical ITRS Max
2005 projection was for 5.2 GHz – and we didn’t make it in production.
Further, we’re still stuck at 3+GHz in production.
CSC 7600 Lecture 1 : Introduction 14
Spring 2011
Classes of Architecture for
High Performance Computers
• Parallel Vector Processors (PVP)
– NEC Earth Simulator, SX-6
– Cray- 1, 2, XMP, YMP, C90, T90, X1
– Fujitsu 5000 series
• Massively Parallel Processors (MPP)
– Intel Touchstone Delta & Paragon
– TMC CM-5
– IBM SP-2 & 3, Blue Gene/Light
– Cray T3D, T3E, Red Storm/Strider
• Distributed Shared Memory (DSM)
– SGI Origin
– HP Superdome
• Single Instruction stream Multiple Data stream
(SIMD)
– Goodyear MPP, MasPar 1 & 2, TMC CM-2
• Commodity Clusters
– Beowulf-class PC/Linux clusters
– Constellations
– HP Compaq SC, Linux NetworX MCR
• Automated calculating
– 17th century
• Stored program digital electronic
– 1948
• Vector
– 1975
• SIMD
– 1980s
• MPPs
– 1991
• Commodity Clusters
– 1993/4
• Multicore
– 2006
500
1000 Ratio
400
Memory Access Time
200
10
100
1 CPU Time
0
0.1
1997 1999 2001 2003 2006 2009
X-Axis
THE WALL
CSC 7600 Lecture 1 : Introduction 37
Spring 2011
Microprocessors no longer realize the
full potential of VLSI technology
1e+7
1e+6 52%
/ye Perf (ps/Inst)
1e+5 a r
1e+4 Linear (ps/Inst)
1e+3
19%/ye
1e+2 74% 30:1 a r
1e+1 /ye
ar 1,000:1
1e+0
1e-1 30,000:1
1e-2
1e-3
1e-4
1980 1990 2000 2010 2020
start end
TA TF
TO º time for non-accelerated computation
TA º time for accelerated computation
start end
TF º time of portion of computation that can be accelerated
TF/g g º peak performance gain for accelerated portion of computation
f º fraction of non-accelerated computation to be accelerated
S º speed up of computation with acceleration applied
S =TO TA
f =TF TO
æf ö
TA =( 1- f ) ´ TO + ç ÷´ TO
èg ø
TO
S=
æf ö
( 1- f ) ´ TO + ç ÷´ TO
èg ø
1
S=
æf ö
1- f + ç ÷
èg ø
start end
tF tF tF tF
TA
n
TF tFi
start end i
f
T A 1 f TO TO n v
g
TO TO
S
TA 1 f TO f TO n v
g
1
S
1 f f n v
g TO
• Computational Scientist
• HPC researcher
• System Administrators
• Design Engineers
January Tu 18 Introduction
Th 20 Parallel Computer Architecture, Quiz1
Tu 25 Commodity Cluster
Th 27 Benchmarking, Quiz2
February Tu 1 Throughput Computing
Tu 19 Spring Break
Th 21 Spring Break
Tu 26 Scheduling / Workload Management Systems
Th 28 Checkpointing/System Administration, Project Due, Quiz14
May Tu 3 Beyond and Beyond
Th 5 Class Summary / Final Exam Review
Th 12 FINAL EXAM (7:30 – 9:30 AM)
Arete [arete.cct.lsu.edu]
●
64 compute nodes x 8 cores
●
Quad-core AMD Opteron Processor @ 2.4 Ghz
●
8 GB RAM per Node
●
24TB of shared storage
●
1GB ethernet network interface
●
10GB Infiniband interconnect
http://appl003.lsu.edu/slas/dos.nsf/$Content/Code+of+Conduct?
OpenDocument
• Thinking Machines
Corporation, 1987.
• Hypercube architecture
with 65,536 processors.
• SIMD.
• Performance in the
range of GFLOPS.
• Japan, 1997.
• Fastest supercomputer
from 2002-2004: 35.86
TFLOPS.
• 640 nodes with eight
vector processors and
16 gigabytes of
computer memory at
each node.
• IBM, 2004.
• First supercomputer
ever to run over 100
TFLOPS sustained on a
real world application,
namely a three-
dimensional molecular
dynamics code
(ddcMD).