0% found this document useful (0 votes)
36 views4 pages

Simultaneous Multithreading: Pratyusa Manadhata, Vyas Sekar (Pratyus, Vyass) @cs - Cmu.edu

1. Simultaneous multithreading (SMT) is a processor architecture that improves performance by exploiting both instruction-level parallelism (ILP) and thread-level parallelism (TLP). SMT maintains hardware state for multiple threads and can issue instructions from different threads each cycle. 2. SMT provides performance gains over superscalar and multiprocessor architectures by dynamically partitioning hardware resources between threads rather than statically allocating resources. This allows SMT processors to better tolerate long latency operations and improve hardware utilization. 3. While SMT offers performance advantages, it also faces challenges including increased register pressure, latency, and stress on shared resources like caches. Compiler and operating system support may be needed

Uploaded by

Palanikumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views4 pages

Simultaneous Multithreading: Pratyusa Manadhata, Vyas Sekar (Pratyus, Vyass) @cs - Cmu.edu

1. Simultaneous multithreading (SMT) is a processor architecture that improves performance by exploiting both instruction-level parallelism (ILP) and thread-level parallelism (TLP). SMT maintains hardware state for multiple threads and can issue instructions from different threads each cycle. 2. SMT provides performance gains over superscalar and multiprocessor architectures by dynamically partitioning hardware resources between threads rather than statically allocating resources. This allows SMT processors to better tolerate long latency operations and improve hardware utilization. 3. While SMT offers performance advantages, it also faces challenges including increased register pressure, latency, and stress on shared resources like caches. Compiler and operating system support may be needed

Uploaded by

Palanikumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Simultaneous Multithreading

Pratyusa Manadhata, Vyas Sekar


{pratyus,vyass}@cs.cmu.edu

1 Introduction: tirety and these are inherently incapable of


adapting to dynamic levels of ILP and TLP.
Current research in processor technology and This is the primary motivation for a new
computer architecture is motivated primar- architecture of processors called Simultane-
ily by the need for greater performance. In ous Multithreading (SMT).
this context, it is well understood that the
performance gain from improving the mem-
ory system alone is limited, and using system 2 SMT
Level Integration (such as supporting graph-
ics/sound on chip) can only lead to marginal In this section we identify some of the key
performance benefits. The most significant characteristics of an SMT architecture and
gain can be achieved by increasing parallelism some of the design requirements that can fa-
in execution. cilitate the implementation of an SMT over
There exist two kinds of parallelism in a conventional superscalar architecture. The
typical programming workloads, Instruction characteristics of SMT processors are
Level Parallelism (ILP) and Thread Level 1. inherited from superscalar: issue multiple
Parallelism (TLP). Modern superscalar archi- instructions per cycle
tectures are designed to capture ILP in pro- 2. from multithreaded: maintain hardware
grams, while multithreaded and multiproces- state for multiple threads
sor systems are designed to capture TLP or In Fig 1 we can see that there is a significant
parallelism across threads/processes. amount of wastage of issue slots in the super-
The better solution then would be to ex- scalar and the multithreaded system. There
ploit both ILP and TLP ; TLP from either are essentially two kinds of waste: vertical
multithreaded parallel programs or from mul- waste (an entire cycle is unused) and horizon-
tiprogramming workload, and the ILP from tal waste ( within a cycle issue slots are un-
each thread. used). Superscalar processors look at multi-
Neither superscalar nor multiprocessor ple instructions from same process, and have
(MP) can capture ILP and TLP in its en- both horizontal waste (as a result of insuffi-

1
3. Most h/w resources are available unlike
in static resource allocation This implies
that a non-parallelizable program will
still run efficiently in SMT.

4. Fetch Mechanism:
a. 2.8 scheme: select 2 threads . fetch 8
from each thread
(2.4 scheme?) out of these choose a
subset to match h/w decoding b/w b.
h/w cost:additional port on IC (2.8
better than 2.4)
c. icount technique: selecting the
thread, higher priority to those threads
that have least number of instructions in
the decode,rename and queue pipeline
Figure 1: Horizontal and Vertical Wastage stages: even distribution, prevents star-
vation etc. Other options are misscount,
bcount etc.
cient ILP) and vertical waste (due to data de-
pendencies and long latency operations). The
MT system minimizes vertical waste as it can 5. Caveat: Hardware register file is larger:
look at multiple threads to fetch from in each 2-clock latency to access register needs
cycle and thus it can tolerate long latency 2-cycle read/write.
operations within each thread.

3.1 SMT Disadvantages


3 SMT Model
• There is greater register pressure and
1. Consider a superscalar that fetches 8 in- greater per thread latency due to the
structions from the IC longer pipeline.

2. SMT h/w modifications required over a • On a multiprogrammed workload there


conventional superscalar is greater stress on shared structures
1. state for h/w contexts for threads such as BPB, cache, TLB etc.
2. per-thread exception/retirement
mechanisms • A Parallel Workload tends to stress the
functional units more.

2
4 Results and Observa- and TLP which implies that resources are not
statically partitioned.
tions
4.1 Discussion of Issues in
It has been observed that superscalars ap-
proximately give an IPC of about 1-2. But SMTs
the results shown indicate that SMT can • Cost vs Performance: It is necessary to
reach an IPC of upto 6.7 (for a 8-issue ar- quantify the architecture that can best
chitecture). Even though the SMT pipeline use the chip area and can provide en-
is longer implying a longer latency for a single hanced performance with minimal hard-
thread it is observed to not have a significant ware overhead.
performance effect. The reason for the non-
degraded performance in the presence of con- • Quantitative Comparisons: It is difficult
flicts and a longer pipeline is essentially the to quantify in absolute terms the perfor-
systems’ ability to absorbs additional con- mance gain that the SMT processor can
flicts i.e., the ability to hide latency by using deliver. Often this depends a lot on de-
multiple issues from multiple threads. The sign cycle time, the actual hardware im-
multiprocessor architectures MP2 and MP4 plementation etc that are hard to predict
were observed to be hindered by static re- given the technology trends.
source partitioning, while SMT on the other • Compilers: One of the earlier claims was
hand dynamically partitions resources among that SMT is easier for compilers and pro-
threads. Also a comparison between MP2 grammers, as the hardware can dynami-
vs MP4 shows that MP2 can better adapt cally repartition resources. But the gen-
to ILP, while MP4 is better suited for uti- eral feeling is that in order to assure
lizing TLP, which is quite intuitive as there a performance no worse than the com-
are more functional units per processor avail- peting architectures and to ensure maxi-
able in the MP2, while there are more paral- mum processor utilization, one does need
lel units in the MP4. SMT can also lead to compiler support for identifying sources
increased cache misses/conflicts and greater of parallelism and help in static schedul-
stress on the branching hardware. However ing.
the impact on overall program performance
is not significant as SMT, efficient hardware • OS: It is important to consider OS issues
design, and compiler optimizations can hide such as thread scheduling, thread prior-
latencies and conflicts significantly. The key ity etc. that will be necessary in a realis-
insight is that SMT achieves a better perfor- tic implementation of an SMT, and the
mance gain than Superscalar, multithreaded, interaction between the thread priority
and multiprocessor architectures due to the and the fetch/issue logic is an interest-
ability to ignore the distinction between ILP ing issue.

3
• Another observation is that more than in reality is far more complex, and there are
static partitioning of resources in multi- other economic factors that come into play.
processors the communication overhead
is a significant reason why SMTs perform
better than MPs. References
• The question also arises whether SMT [1] Susan Eggers, Joel Emer, Henry Levy,
needs a branch prediction mechanism at Jack Lo, Rebecca Stamm, and Dean
all? The answer is yes, which is again Tullsen. Simultaneous Multithreading: A
consistent with the design philosophy Platform for Next-generation Processors,
that a non-parallelizable program still in IEEE Micro, September/October 1997,
needs to get a good performance. pages 12-18.

• Is the performance gain adequate with [2] Jack Lo, Susan Eggers, Joel Emer, Henry
the additional resource cost? It has Levy, Rebecca Stamm, and Dean Tullsen.
been shown that an SMT outperforms Converting Thread-Level Parallelism Into
an equally resource-equipped multipro- Instruction-Level Parallelism via Simul-
cessor running at maximum number of taneous Multithreading, in ACM Transac-
supported threads, which shows that the tions on Computer Systems, August 1997,
SMT has maximum resource utilization. pages 322-354.

What does the future hold for SMTs? [3] Dean Tullsen, Susan Eggers, Joel Emer,
Each processor in an SMP can use SMT - Henry Levy, Jack Lo, and Rebecca
This is a direct extension of the SMP and Stamm. Exploiting Choice: Instruction
SMT architectures that can create small to Fetch and Issue on an Implementable Si-
massive parallel systems where each proces- multaneous Multithreading Processor , in
sor employs SMT to minimize execution time. Proceedings of the 23rd Annual Interna-
It has been observed that next generation ar- tional Symposium on Computer Architec-
chitectures would be based on design issues ture, May 1996, pages 191-202.
that tend to maximize use of power and chip
area, and this would mean that multiprocess-
ing (MP or MT or SMT) on chip is more ef-
ficient than a wider superscalars.
An interesting observation is that even
though the research on SMT was done in the
mid-late 90s, the actual commercial imple-
mentation of an SMT on a processor has been
delayed until now (the Intel “Hyperthread-
ing” Pentium). This shows that chip-design

You might also like