0% found this document useful (0 votes)
32 views15 pages

Simultaneous Multi-Threaded Design: Virendra Singh

Uploaded by

Bhoomik Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views15 pages

Simultaneous Multi-Threaded Design: Virendra Singh

Uploaded by

Bhoomik Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Simultaneous Multi-

Threaded Design
Virendra Singh
Associate Professor
Computer Architecture and Dependable Systems Lab
Department of Electrical Engineering
Indian Institute of Technology Bombay
http://www.ee.iitb.ac.in/~viren/
E-mail: [email protected]

EE-739: Processor Design


Lecture 35 (11 April 2013) CADSL
Simultaneous Multi-threading ...
One thread, 8 units! Two threads, 8 units!
Cycle! M! M! FX! FX! FP! FP!BR!CC! Cycle! M! M! FX! FX! FP! FP!BR!CC!
1 1

2 2

3 3

4 4

5 5

6 6
7
7
8
8
9
9

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes!

11 Apr 2013 EE-739@IITB 2 CADSL


Multithreaded Categories
Simultaneous
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading
Time (processor cycle)

Thread 1 Thread 3 Thread 5


Thread 2 Thread 4 Idle slot

11 Apr 2013 EE-739@IITB 3 CADSL


Design Challenges in SMT
• Since  SMT  makes  sense  only  with  fine-­‐grained  implementa:on,  
impact  of  fine-­‐grained  scheduling  on  single  thread  performance?  
– A  preferred  thread  approach  sacrifices  neither  throughput  nor  
single-­‐thread  performance?    
– Unfortunately,  with  a  preferred  thread,  the  processor  is  likely  
to  sacrifice  some  throughput,  when  preferred  thread  stalls  
• Larger  register  file  needed  to  hold  mul:ple  contexts  
• Not  affec:ng  clock  cycle  :me,  especially  in    
– Instruc:on  issue  -­‐  more  candidate  instruc:ons  need  to  be  
considered  
– Instruc:on  comple:on  -­‐  choosing  which  instruc:ons  to  commit  
may  be  challenging  
• Ensuring  that  cache  and  TLB  conflicts  generated  by  SMT  do  not  
degrade  performance  

11 Apr 2013 EE-739@IITB 4 CADSL


Basic Out-of-order Pipeline

11 Apr 2013 EE-739@IITB 5 CADSL


SMT Pipeline

11 Apr 2013 EE-739@IITB 6 CADSL


Simultaneous Multithreading

11 Apr 2013 EE-739@IITB 7 CADSL


Simultaneous Multithreading (SMT)
• Simultaneous  mul:threading  (SMT):  insight  that  dynamically  
scheduled  processor  already  has  many  HW  mechanisms  to  support  
mul:threading  
– Large  set  of  virtual  registers  that  can  be  used  to  hold  the  register  
sets  of  independent  threads    
– Register  renaming  provides  unique  register  iden:fiers,  so  
instruc:ons  from  mul:ple  threads  can  be  mixed  in  datapath  
without  confusing  sources  and  des:na:ons  across  threads  
– Out-­‐of-­‐order  comple:on  allows  the  threads  to  execute  out  of  
order,  and  get  beTer  u:liza:on  of  the  HW    
• Just  adding  a  per  thread  renaming  table  and  keeping  separate  PCs  
– Independent  commitment  can  be  supported  by  logically  keeping  
a  separate  reorder  buffer  for  each  thread   Source:“Compaq Micrprocessor Report, December 6, 1999
Chooses SMT for Alpha”

11 Apr 2013 EE-739@IITB 8 CADSL


SMT Architecture
• StraighYorward  extension  to  conven:onal  
superscalar  design.  
– mul:ple  program  counters  and  some  mechanism  by  which  
the  fetch  unit  selects  one  each  cycle,  
– a  separate  return  stack  for  each  thread  for  predic:ng  
subrou:ne  return  des:na:ons,  
– per-­‐thread  instruc:on  re:rement,  instruc:on  queue  flush,  
and  trap  mechanisms,  
– a  thread  id  with  each  branch  target  buffer  entry  to  avoid  
predic:ng  phantom  branches,  and  
– a  larger  register  file,  to  support  logical  registers  for  all  
threads  plus  addi:onal  registers  for  register  renaming.    
• The  size  of  the  register  file  affects  the  pipeline  and  the  
scheduling  of  load-­‐dependent  instruc:ons.  

11 Apr 2013 EE-739@IITB 9 CADSL


SMT Performance
Tullsen ‘96

11 Apr 2013 EE-739@IITB 10 CADSL


Implementing SMT
Can  use  as  is  most  hardware  on  current  out-­‐or-­‐order  processors  
Out-­‐of-­‐order  renaming  &  instruc3on  scheduling  mechanisms  
• physical  register  pool  model  
• renaming  hardware  eliminates  false  dependences  both  
within  a  thread  (just  like  a  superscalar)  &  between  threads  
• map  thread-­‐specific  architectural  registers  onto  a  pool  of  
thread-­‐independent  physical  registers  
• operands  are  therea]er  called  by  their  physical  names  
• an  instruc:on  is  issued  when  its  operands  become  available  
&  a  func:onal  unit  is  free  
• instruc:on  scheduler  not  consider  thread  IDs  when  
dispatching  instruc:ons  to  func:onal  units  
 (unless  threads  have  different  priori:es)  
11 Apr 2013 EE-739@IITB 11 CADSL
From Superscalar to SMT
Extra  pipeline  stages  for  accessing  thread-­‐shared  register  
files  
• 8  threads  *  32  registers  +  renaming  registers  
 
SMT  instruc3on  fetcher  (ICOUNT)  
• fetch  from  2  threads  each  cycle  
• count  the  number  of  instruc:ons  for  each  thread  in  
the  pre-­‐execu:on  stages  
• pick  the  2  threads  with  the  lowest  number  
• in  essence  fetching  from  the  two  highest  throughput  
threads  

11 Apr 2013 EE-739@IITB 12 CADSL


From Superscalar to SMT
Per-­‐thread  hardware  
• small  stuff  
• all  part  of  current  out-­‐of-­‐order  processors  
• none  endangers  the  cycle  :me  
• other  per-­‐thread  processor  state,  e.g.,  
• program  counters  
• return  stacks  
• thread  iden:fiers,  e.g.,  with  BTB  entries,  TLB  entries  
• per-­‐thread  bookkeeping  for  
• instruc:on  re:rement  
• trapping  
• instruc:on  queue  flush    
This  is  why  there  is  only  a  10%  increase  to  Alpha  21464  chip  area.  

11 Apr 2013 EE-739@IITB 13 CADSL


Implementing SMT
Thread-­‐shared  hardware:  
• fetch  buffers  
• branch  predic:on  structures  
• instruc:on  queues  
• func:onal  units  
• ac:ve  list  
• all  caches  &  TLBs  
• MSHRs  
• store  buffers  
This  is  why  there  is  liTle  single-­‐thread  performance  
degrada:on  (~1.5%).  

11 Apr 2013 EE-739@IITB 14 CADSL


Thank You

11 Apr 2013 EE-739@IITB 15 CADSL

You might also like