Simultaneous Multithreading: Pratyusa Manadhata, Vyas Sekar (Pratyus, Vyass) @cs - Cmu.edu
Simultaneous Multithreading: Pratyusa Manadhata, Vyas Sekar (Pratyus, Vyass) @cs - Cmu.edu
1
3. Most h/w resources are available unlike
in static resource allocation This implies
that a non-parallelizable program will
still run efficiently in SMT.
4. Fetch Mechanism:
a. 2.8 scheme: select 2 threads . fetch 8
from each thread
(2.4 scheme?) out of these choose a
subset to match h/w decoding b/w b.
h/w cost:additional port on IC (2.8
better than 2.4)
c. icount technique: selecting the
thread, higher priority to those threads
that have least number of instructions in
the decode,rename and queue pipeline
Figure 1: Horizontal and Vertical Wastage stages: even distribution, prevents star-
vation etc. Other options are misscount,
bcount etc.
cient ILP) and vertical waste (due to data de-
pendencies and long latency operations). The
MT system minimizes vertical waste as it can 5. Caveat: Hardware register file is larger:
look at multiple threads to fetch from in each 2-clock latency to access register needs
cycle and thus it can tolerate long latency 2-cycle read/write.
operations within each thread.
2
4 Results and Observa- and TLP which implies that resources are not
statically partitioned.
tions
4.1 Discussion of Issues in
It has been observed that superscalars ap-
proximately give an IPC of about 1-2. But SMTs
the results shown indicate that SMT can • Cost vs Performance: It is necessary to
reach an IPC of upto 6.7 (for a 8-issue ar- quantify the architecture that can best
chitecture). Even though the SMT pipeline use the chip area and can provide en-
is longer implying a longer latency for a single hanced performance with minimal hard-
thread it is observed to not have a significant ware overhead.
performance effect. The reason for the non-
degraded performance in the presence of con- • Quantitative Comparisons: It is difficult
flicts and a longer pipeline is essentially the to quantify in absolute terms the perfor-
systems’ ability to absorbs additional con- mance gain that the SMT processor can
flicts i.e., the ability to hide latency by using deliver. Often this depends a lot on de-
multiple issues from multiple threads. The sign cycle time, the actual hardware im-
multiprocessor architectures MP2 and MP4 plementation etc that are hard to predict
were observed to be hindered by static re- given the technology trends.
source partitioning, while SMT on the other • Compilers: One of the earlier claims was
hand dynamically partitions resources among that SMT is easier for compilers and pro-
threads. Also a comparison between MP2 grammers, as the hardware can dynami-
vs MP4 shows that MP2 can better adapt cally repartition resources. But the gen-
to ILP, while MP4 is better suited for uti- eral feeling is that in order to assure
lizing TLP, which is quite intuitive as there a performance no worse than the com-
are more functional units per processor avail- peting architectures and to ensure maxi-
able in the MP2, while there are more paral- mum processor utilization, one does need
lel units in the MP4. SMT can also lead to compiler support for identifying sources
increased cache misses/conflicts and greater of parallelism and help in static schedul-
stress on the branching hardware. However ing.
the impact on overall program performance
is not significant as SMT, efficient hardware • OS: It is important to consider OS issues
design, and compiler optimizations can hide such as thread scheduling, thread prior-
latencies and conflicts significantly. The key ity etc. that will be necessary in a realis-
insight is that SMT achieves a better perfor- tic implementation of an SMT, and the
mance gain than Superscalar, multithreaded, interaction between the thread priority
and multiprocessor architectures due to the and the fetch/issue logic is an interest-
ability to ignore the distinction between ILP ing issue.
3
• Another observation is that more than in reality is far more complex, and there are
static partitioning of resources in multi- other economic factors that come into play.
processors the communication overhead
is a significant reason why SMTs perform
better than MPs. References
• The question also arises whether SMT [1] Susan Eggers, Joel Emer, Henry Levy,
needs a branch prediction mechanism at Jack Lo, Rebecca Stamm, and Dean
all? The answer is yes, which is again Tullsen. Simultaneous Multithreading: A
consistent with the design philosophy Platform for Next-generation Processors,
that a non-parallelizable program still in IEEE Micro, September/October 1997,
needs to get a good performance. pages 12-18.
• Is the performance gain adequate with [2] Jack Lo, Susan Eggers, Joel Emer, Henry
the additional resource cost? It has Levy, Rebecca Stamm, and Dean Tullsen.
been shown that an SMT outperforms Converting Thread-Level Parallelism Into
an equally resource-equipped multipro- Instruction-Level Parallelism via Simul-
cessor running at maximum number of taneous Multithreading, in ACM Transac-
supported threads, which shows that the tions on Computer Systems, August 1997,
SMT has maximum resource utilization. pages 322-354.
What does the future hold for SMTs? [3] Dean Tullsen, Susan Eggers, Joel Emer,
Each processor in an SMP can use SMT - Henry Levy, Jack Lo, and Rebecca
This is a direct extension of the SMP and Stamm. Exploiting Choice: Instruction
SMT architectures that can create small to Fetch and Issue on an Implementable Si-
massive parallel systems where each proces- multaneous Multithreading Processor , in
sor employs SMT to minimize execution time. Proceedings of the 23rd Annual Interna-
It has been observed that next generation ar- tional Symposium on Computer Architec-
chitectures would be based on design issues ture, May 1996, pages 191-202.
that tend to maximize use of power and chip
area, and this would mean that multiprocess-
ing (MP or MT or SMT) on chip is more ef-
ficient than a wider superscalars.
An interesting observation is that even
though the research on SMT was done in the
mid-late 90s, the actual commercial imple-
mentation of an SMT on a processor has been
delayed until now (the Intel “Hyperthread-
ing” Pentium). This shows that chip-design