Simultaneous Multi-
Threaded Design
Virendra Singh
Associate Professor
Computer Architecture and Dependable Systems Lab
Department of Electrical Engineering
Indian Institute of Technology Bombay
http://www.ee.iitb.ac.in/~viren/
E-mail: [email protected]
EE-739: Processor Design
Lecture 35 (11 April 2013) CADSL
Simultaneous Multi-threading ...
One thread, 8 units! Two threads, 8 units!
Cycle! M! M! FX! FX! FP! FP!BR!CC! Cycle! M! M! FX! FX! FP! FP!BR!CC!
1 1
2 2
3 3
4 4
5 5
6 6
7
7
8
8
9
9
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes!
11 Apr 2013 EE-739@IITB 2 CADSL
Multithreaded Categories
Simultaneous
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading
Time (processor cycle)
Thread 1 Thread 3 Thread 5
Thread 2 Thread 4 Idle slot
11 Apr 2013 EE-739@IITB 3 CADSL
Design Challenges in SMT
• Since
SMT
makes
sense
only
with
fine-‐grained
implementa:on,
impact
of
fine-‐grained
scheduling
on
single
thread
performance?
– A
preferred
thread
approach
sacrifices
neither
throughput
nor
single-‐thread
performance?
– Unfortunately,
with
a
preferred
thread,
the
processor
is
likely
to
sacrifice
some
throughput,
when
preferred
thread
stalls
• Larger
register
file
needed
to
hold
mul:ple
contexts
• Not
affec:ng
clock
cycle
:me,
especially
in
– Instruc:on
issue
-‐
more
candidate
instruc:ons
need
to
be
considered
– Instruc:on
comple:on
-‐
choosing
which
instruc:ons
to
commit
may
be
challenging
• Ensuring
that
cache
and
TLB
conflicts
generated
by
SMT
do
not
degrade
performance
11 Apr 2013 EE-739@IITB 4 CADSL
Basic Out-of-order Pipeline
11 Apr 2013 EE-739@IITB 5 CADSL
SMT Pipeline
11 Apr 2013 EE-739@IITB 6 CADSL
Simultaneous Multithreading
11 Apr 2013 EE-739@IITB 7 CADSL
Simultaneous Multithreading (SMT)
• Simultaneous
mul:threading
(SMT):
insight
that
dynamically
scheduled
processor
already
has
many
HW
mechanisms
to
support
mul:threading
– Large
set
of
virtual
registers
that
can
be
used
to
hold
the
register
sets
of
independent
threads
– Register
renaming
provides
unique
register
iden:fiers,
so
instruc:ons
from
mul:ple
threads
can
be
mixed
in
datapath
without
confusing
sources
and
des:na:ons
across
threads
– Out-‐of-‐order
comple:on
allows
the
threads
to
execute
out
of
order,
and
get
beTer
u:liza:on
of
the
HW
• Just
adding
a
per
thread
renaming
table
and
keeping
separate
PCs
– Independent
commitment
can
be
supported
by
logically
keeping
a
separate
reorder
buffer
for
each
thread
Source:“Compaq Micrprocessor Report, December 6, 1999
Chooses SMT for Alpha”
11 Apr 2013 EE-739@IITB 8 CADSL
SMT Architecture
• StraighYorward
extension
to
conven:onal
superscalar
design.
– mul:ple
program
counters
and
some
mechanism
by
which
the
fetch
unit
selects
one
each
cycle,
– a
separate
return
stack
for
each
thread
for
predic:ng
subrou:ne
return
des:na:ons,
– per-‐thread
instruc:on
re:rement,
instruc:on
queue
flush,
and
trap
mechanisms,
– a
thread
id
with
each
branch
target
buffer
entry
to
avoid
predic:ng
phantom
branches,
and
– a
larger
register
file,
to
support
logical
registers
for
all
threads
plus
addi:onal
registers
for
register
renaming.
• The
size
of
the
register
file
affects
the
pipeline
and
the
scheduling
of
load-‐dependent
instruc:ons.
11 Apr 2013 EE-739@IITB 9 CADSL
SMT Performance
Tullsen ‘96
11 Apr 2013 EE-739@IITB 10 CADSL
Implementing SMT
Can
use
as
is
most
hardware
on
current
out-‐or-‐order
processors
Out-‐of-‐order
renaming
&
instruc3on
scheduling
mechanisms
• physical
register
pool
model
• renaming
hardware
eliminates
false
dependences
both
within
a
thread
(just
like
a
superscalar)
&
between
threads
• map
thread-‐specific
architectural
registers
onto
a
pool
of
thread-‐independent
physical
registers
• operands
are
therea]er
called
by
their
physical
names
• an
instruc:on
is
issued
when
its
operands
become
available
&
a
func:onal
unit
is
free
• instruc:on
scheduler
not
consider
thread
IDs
when
dispatching
instruc:ons
to
func:onal
units
(unless
threads
have
different
priori:es)
11 Apr 2013 EE-739@IITB 11 CADSL
From Superscalar to SMT
Extra
pipeline
stages
for
accessing
thread-‐shared
register
files
• 8
threads
*
32
registers
+
renaming
registers
SMT
instruc3on
fetcher
(ICOUNT)
• fetch
from
2
threads
each
cycle
• count
the
number
of
instruc:ons
for
each
thread
in
the
pre-‐execu:on
stages
• pick
the
2
threads
with
the
lowest
number
• in
essence
fetching
from
the
two
highest
throughput
threads
11 Apr 2013 EE-739@IITB 12 CADSL
From Superscalar to SMT
Per-‐thread
hardware
• small
stuff
• all
part
of
current
out-‐of-‐order
processors
• none
endangers
the
cycle
:me
• other
per-‐thread
processor
state,
e.g.,
• program
counters
• return
stacks
• thread
iden:fiers,
e.g.,
with
BTB
entries,
TLB
entries
• per-‐thread
bookkeeping
for
• instruc:on
re:rement
• trapping
• instruc:on
queue
flush
This
is
why
there
is
only
a
10%
increase
to
Alpha
21464
chip
area.
11 Apr 2013 EE-739@IITB 13 CADSL
Implementing SMT
Thread-‐shared
hardware:
• fetch
buffers
• branch
predic:on
structures
• instruc:on
queues
• func:onal
units
• ac:ve
list
• all
caches
&
TLBs
• MSHRs
• store
buffers
This
is
why
there
is
liTle
single-‐thread
performance
degrada:on
(~1.5%).
11 Apr 2013 EE-739@IITB 14 CADSL
Thank You
11 Apr 2013 EE-739@IITB 15 CADSL