See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/220697797
Hyper-Threading Technology Speeds Clusters
Conference Paper in Lecture Notes in Computer Science · September 2003
DOI: 10.1007/978-3-540-24669-5_3 · Source: DBLP
CITATIONS READS
6 1,654
2 authors:
Kazimierz Wackowski Pawel Gepner
Warsaw University of Technology Warsaw University of Technology
11 PUBLICATIONS 31 CITATIONS 51 PUBLICATIONS 461 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Twój certyfikat na przyszłość View project
InterCriteria Analysis View project
All content following this page was uploaded by Pawel Gepner on 22 September 2018.
The user has requested enhancement of the downloaded file.
Kazimierz Waćkowski
Politechnika Warszawska
[email protected] Pawel Gepner
Intel Corporation
[email protected]
Hyper-Threading Technology Speeds Clusters
Hyper-Threading Technology enabled processors contain multiple logical processors per physical
processor package. The state information necessary to support each logical processor is replicated,
sharing, or partitioning the physical processor resources. The operating system (OS) makes a single
processor behave like two logical processors. When HT is enabled, the OS allows the processor to
execute multiple threads simultaneously, in parallel within each processor. The processor resources
are typically underutilized by most applications. A CPU with Hyper-Threading Technology enabled
can generally improve overall application performance. Multiple threads running in parallel can
achieve higher processor utilization and increase throughput. In order to get the fully optimized
benefit it is necessary to focus on 3 key areas which need to be aware of Hyper-Threading
Technology and tuned for it. These areas are: operating system, compiler and application.
Operating System optimization
The first and fundamental issue when we think about operating system optimization for HT is
awareness and ability to run in multi processor environment, meaning support for symmetric
multiprocessor (SMP) at the kernel. For clustered implementations the predominant OS in use is
Linux so we will focus our attention here. The Linux kernel was HT capable since the release of
2.4.17. The 2.4.17 kernel recognizes logical processors and behaves like a Hyper-Threaded
processor with two physical processors [15]. Hyper-Threading support can be viewed by using the
command cat /proc/cpuinfo to show the presence of two processors: processor 0 and
processor 1.
Typically, each physical processor has a single architectural state on a single processor core to
service threads. With HT, each physical processor has two architectural states on a single core,
making the physical processor appear as two logical processors to service threads [2]. The BIOS
counts each architectural state on the physical processor.
Architerctural Architerctural Architerctural Architercturral
State State State State
Execution Engine Execution Engine
Local APIC Local APIC Local APIC Local APIC
Bus Interface Bus Interface
System Bus
Figure 1 Hyper-Threading technology on an SMP
Figure 1 shows a typical, bus-based SMP scenario on a processor with Hyper-Threading technology.
Each logical processor can execute a software thread, allowing a maximum of two software threads
to execute simultaneously on one physical processor [11]. Since Hyper-Threading-aware operating
systems take advantage of logical processors, those operating systems have twice as many resources
to service threads. These replicated resources create copies of the resources for the two executed
threads [9]:
• The register alias tables map the architectural registers (eax; ebx; ecx: etc.) to physical
rename registers. Since we need to keep track of the architectural state of both logical
processors independently, these tables have to be duplicated.
• The Return Stack Predictor has to be duplicated in order to accurately predict call-return
instruction pairs.
• The next instruction pointers also needed to be duplicated because each logical processor
needs to keep track of its progress through the program it is executing independently. There
are two sets of next instruction pointers. One at the trace cache (the Trace Cache Next
IP”) which is a first-level instruction cache that stores decoded instructions and in the case
of a Trace Cache miss another set of next instruction pointers at the fetch and decode logic.
• Some of the front-end buffers are duplicated (Instruction Streaming Buffers and Trace
Cache Fill Buffers) to improve instruction prefetch behavior.
• The Instruction TLB was duplicated because it was simpler to duplicate it than to implement
the logic to share this structure. Also there was some die area near the instruction TLB that
was easy to use.
• In addition, there are also some miscellaneous pointers and control logic that are too small
to point out.
Such duplication of resources extends far less than 5% of the total die area. However, the
complexity was quite large.
The Xeon processor was the first member of the Hyper-Threading technology enabled CPUs. To
achieve the goal of executing two threads on a single physical processor, the processor
simultaneously maintains the context of multiple threads allowing the scheduler to dispatch two
potentially independent threads concurrently. The OS schedules and dispatches threads to each
logical processor, just as it would in a dual-processor or multi-processor system. When a thread is
not dispatched, the associated logical processor is kept idle. When a thread is scheduled and
dispatched to a logical processor (#0), the Hyper-Threading technology utilizes the necessary
processor resources to execute the thread. When a second thread is scheduled and dispatched on
the second logical processor (#1), resources are replicated, divided, or shared as necessary in order
to execute the second thread. Each processor makes selections at points in the pipeline to control
and process the threads. As each thread finishes, the operating system idles the unused processor,
freeing resources for the running CPU. Hyper-Threading technology is supported in Linux kernel
2.4.x, however the scheduler used in the kernel 2.4.x is not able to differentiate between two
logical processors and two physical processors [15].
The support for Hyper-Threading in Linux kernel 2.4.x includes the following enhancements:
• 128-byte lock alignment
• Spin-wait loop optimization
• Non-execution based delay loops
• Detection of Hyper-Threading enabled processor and starting the logical processor as if
machine was SMP
• Serialization in MTRR and Microcode Update driver as they affect shared state
• Optimization to scheduler when system is idle to prioritize scheduling on a physical
processor before scheduling on logical processor
• Offset user stack to avoid 64K aliasing
All these enhancements can improve system performance in the areas such as scheduler, low-level
kernel primitives, the file server, the network, and threaded support by 30%. Also compilation the
Linux kernel with a parallel makes (make –j 2, for example) provide significant speedup.
Figure shows the absolute performance of doing a kernel build on one and two Intel Xeon processor
MP with and without Hyper-Threading technology. This application scales nicely from 1 to 2
processors, showing an impressive 1.95 speedup. On a single processor with Hyper-Threading
technology, the application has a speedup of 1.20. While the speedup is not close to the dual
processor speedup, it goes to show that this technology is really promising [22]. Hyper-Threading
technology is able to achieve a significant speedup while keeping the system cost constant, while a
dual processor system cost significantly more than a single processor one.
Without Hyper-Threading Technology
With Hyper-Threading Technology
2.50
2.00
1.50
1.00
0.50
0.00
1P 2P
Figure 2 Linux kernel compile performance
The Linux kernel 2.5.x may provide performance speedup even up to 51%, mainly via improvements
to the scheduler. In addition to the optimized scheduler, there are other modifications added to the
Linux kernel that increase performance [22]. Those changes are:
HT-aware passive load-balancing:
The IRQ-driven balancing has to be per-physical-CPU, not per-logical-CPU. Otherwise, it might
happen that one physical CPU runs two tasks while another physical CPU runs no task; the stock
scheduler does not recognize this condition as "imbalance" because the stock scheduler does not
realize that the two logical CPUs belong to the same physical CPU.
"Active" load-balancing:
This is when a logical CPU goes idle and causes a physical CPU imbalance. The imbalance caused by
an idle CPU can be solved via the normal load-balancer. In the case of HT, the situation is special
because the source physical CPU might have just two tasks running, both runnable. This is a
situation that the stock load-balancer is unable to handle, because running tasks are hard to
migrate. This migration is essential otherwise a physical CPU can get stuck running two tasks while
another physical CPU stays idle.
HT-aware task pickup:
When the scheduler picks a new task, it should prefer all tasks that share the same physical CPU
before trying to pull in tasks from other CPUs. The stock scheduler only picks tasks that were
scheduled to that particular logical CPU.
HT-aware affinity:
Tasks should attempt to "link" to physical CPUs, not logical CPUs.
HT-aware wakeup:
The stock scheduler only knows about the "current" CPU, it does not know about any sibling. On HT,
if a thread is woken up on a logical CPU that is already executing a task, and if a sibling CPU is idle,
then the sibling CPU has to be woken up and has to execute the newly woken-up task immediately.
Hyper-Threading support in Linux kernel 2.5.x includes all the above changes.
Compiler optimization
Intel processors have a rich set of performance-enabling features such as the Streaming-SIMD-
Extensions (SSE and SSE2) in the IA-32 architecture, large register files, prediction, control and data
speculation in the Itanium-based architecture. These features allow the compiler to exploit
parallelism at various levels [4].
Intel’s newest Hyper-Threading Technology, a simultaneous multithreading design, allows one
physical processor to manage data as if it were two logical processors by handling data instructions
in parallel rather than serially. The Hyper-Threading Technology-enabled processors can
significantly increase the performance of application programs with a high degree of parallelism.
These potential performance gains are only obtained when an application is efficiently
multithreaded, either manually or automatically [2]. The Intel C++/Fortran high-performance
compiler supports several such techniques. One of those techniques is automatic loop
parallelization. In addition to automatic loop level parallelization, Intel compilers support OpenMP
directive, which significantly increase the domain of various applications amenable to effective
parallelism. For example, users can use OpenMP parallel sections to develop an application where
section-1 calls an integer-intensive routine and where section-2 calls a floating-point intensive
routine. Higher performance is obtained by scheduling section-1 and section-2 onto two different
logical processors that share the same physical processor to fully utilize processor resources based
on the Hyper-Threading Technology. The OpenMP standard API supports a multi-platform, shared-
memory, parallel programming paradigm in C++/C/Fortran95 on all Intel architectures and popular
operating systems such as Windows NT*, Linux*, and Unix*. OpenMP directives and programs have
emerged as the de facto standard of expressing parallelism in various applications as they
substantially simplify the notoriously complex task of writing multithreaded programs. The Intel
compilers support the OpenMP pragmas and directives in the languages C++/C/Fortran95, on IA-32
and IPF architectures. The Intel OpenMP implementation in the compiler strives to generate
multithreaded code which gains a speed-up due to Hyper-Threading Technology over optimized
uniprocessor code, integrate parallelization tightly with advanced scalar and loop optimizations
such as intra-register vectorization and memory optimizations to achieve better cache locality and
efficiently exploit multi-level parallelism, and minimize the overhead of data-sharing among
threads.
Application optimization
Multi-threaded applications that perform well on SMP systems will generally perform well on Hyper-
Threading enabled processors. But don’t confuse Hyper-Threading enabled processors with SMP
systems. Each processor in an SMP system has all its physical processor resources available and will
not experience any resource contention at this level. Well-designed multithreaded applications will
perform better on SMP systems when running on Hyper-Threading enabled processors. Enterprise
and technical computing users have a never-ending need for increased performance and capacity.
Performance continues to be a key concern for them [12].
Processor resources, however, are often underutilized and the growing gap between core processor
frequency and memory speed causes memory latency to become an increasing performance
challenge. Intel's Hyper-Threading Technology brings Simultaneous Multi-Threading to the Intel
Architecture and makes a single physical processor appear as two logical processors with duplicated
architecture state, but with shared physical execution resources. This allows two tasks (two threads
from a single application or two separate applications) to execute in parallel, increasing processor
utilization and reducing the performance impact of memory latency by overlapping the latency of
one task with the execution of another. Hyper-Threading Technology-capable processors offer
significant performance improvements for multi-threaded and multi-tasking workloads without
sacrificing compatibility with existing software or single-threaded performance.
The first step in multi-threading applications for Hyper-Threading is to follow the threading
methodology for designing Symmetric Multi-Processor (SMP) solutions. The best way of designing
for Hyper-Threading enabled processors is to avoid known traps.
There are several known pitfalls that developers can encounter when tuning an application for
Hyper-Threading enabled processors. The pitfalls are covered in detail in the “Intel Pentium 4 and
Intel Xeon Processor Optimization Manual”. Short descriptions of each of the known issues are
presented below [12].
Spin-waits
A spin-wait loop is a technique used in multithreaded applications whereby one thread waits for
other threads. The wait can be required for protection of a critical section, for barriers or for other
necessary synchronizations. Typically the structure of a spin-wait loop consists of a loop that
compares a synchronization variable with a predefined value. On a processor with a super-scalar
speculative execution engine, a fast spin-wait loop results in the issue of multiple read requests by
the waiting thread as it rapidly goes through the loop. These requests potentially execute out-of-
order. When the processor detects a write by one thread to any read of the same data that is in
progress from another thread, the processor must guarantee that no violations of memory order
occur. To ensure the proper order of outstanding memory operations, the processor incurs a severe
penalty. The penalty from memory order violation can be reduced significantly by inserting a PAUSE
instruction in the loop. If the duration of the spin-wait is before a thread updates the variable, then
the spinning loop consumes execution resources without accomplishing any useful work [13]. To
prevent a spin-wait loop from consuming resources that a waiting thread may use, developers will
insert a call to Sleep (0). This allows the thread to yield if another thread is waiting. But if there is
no waiting thread, the spin wait loop will continue to execute. On a multi-processor system, the
spin-wait loop consumes execution resources but does not affect the application performance. On a
system with Hyper-Threading enabled processors, the consumption of execution resources without
contribution to any useful work can negatively impact the overall application performance [7].
Write-combining store buffers
Data is read from the first level cache - the fastest cache - if at all possible. If the data is not in
that level, the processor attempts to read it from the next level out, and so on. When data is
written, it is written to the first level cache only if that cache already contains the specific cache
line being written, and "writes-through" to the second level cache in either case. If the data cache
line is not in the second level cache, it will be fetched from further out in the memory hierarchy
before the write can complete.
Data store operations place data into "store buffers", which stay allocated until the store completes.
Furthermore, there are a number of "write combining"(WC) store buffers, each holding a 64 byte
cache line. If a store is to an address within one of the cache lines of a store buffer, the data can
often be quickly transferred to and combined with the data in the WC store buffer, essentially
completing the store operation much faster than writing to the second level cache. This leaves the
store buffer free to be re-used sooner - minimizing the likelihood of entering a state where all the
store buffers are full and the processor must stop processing and wait for a store buffer to become
available [22].
The Intel NetBurst architecture, as implemented in the Intel Pentium 4 and Xeon processors, has 6
WC store buffers. If an application is writing to more than 4 cache lines at about the same time, the
WC store buffers will begin to be flushed to the second level cache. This is done to help insure that
a WC store buffer is ready to combine data for writes to a new cache line. The "Intel Pentium 4
Processor and Intel Xeon Processor Optimization" guide recommends writing to no more than 4
distinct addresses or arrays in an inner loop, in essence writing to no more than 4 cache lines at a
time, for best performance. With Hyper-Threading enabled processors, the WC store buffers are
shared between two logical processors on a single physical processor. Therefore, the total number
of simultaneous writes by both threads running on the two logical processors must be counted in
deciding whether the WC store buffers can handle all the writes [5]. In order to be reasonably
certain of getting the best performance by taking fullest advantage of the WC store buffers, it is
best to split inner loop code into multiple inner loops, each of which writes no more than two
regions of memory. Generally look for data being written to arrays with in incrementing index, or
stores via pointers that move sequentially through memory. Writes to elements of a modest-sized
structure or several sequential data locations can usually be counted as a single write, since they
will often fall into the same cache line and be write combined on a single WC store buffer.
64K alias conflict
The Intel Xeon processor with Hyper-Threading Technology shares the first level data cache among
logical processors. Two data virtual addresses that reside on cache lines that are modulo 64 KB
apart will conflict for the same cache line in the first level data cache. This can affect both the first
level data cache performance as well as impact the branch prediction unit. This alias conflict is
particularly troublesome for applications that create multiple threads to perform the same
operation but on different data. Subdividing the work into smaller tasks performing the identical
operation is often referred to as data domain decomposition. Threads performing similar tasks and
accessing local variables on their respective stacks will encounter the alias conflict condition
resulting in significant overall application degraded performance [12].
Effective cache locality
There are many factors that impact cache performance. Effective use of data cache locality is one
such significant factor. A well-known data cache blocking technique is used to take advantage of
data cache locality. The cache blocking technique restructures loops with frequent iterations over
large data arrays by sub-dividing the large array into smaller blocks, or tiles, such that the block of
data fits within the data cache. Each data element in the array is reused within the data block
before operating on the next block or tile. Depending on the application, a cache data blocking
technique is very effective. It is widely used in numerical linear algebra and is a common
transformation applied by compilers and application programmers [20]. The L2 cache contains
instructions as well as data, compilers often try to take advantage of instruction locality by
grouping related blocks of instructions close together as well [13]. However, the effectiveness of
the technique is highly dependent on the data block size, the processor cache size, and the number
of times the data is reused. With the introduction of Hyper-Threading Technology in the Intel Xeon
processor in which the cache is shared between logical processors, the relationship between block
size and cache size holds. The relationship is relative to the number of logical processors supported
by the physical processor as well. Applications should detect the data cache size using Intel’s CPUID
instruction and dynamically adjust cache blocking tile sizes to maximize performance across
processor implementations. Be aware that a minimum block size should be established such that the
overhead of threading and synchronization does not exceed the benefit from threading [4]. As a
general rule, cache block sizes should target approximately one-half to three-quarters the size of
the physical cache for non-Hyper-Threading processors and one-quarter to one-half the physical
cache size for a Hyper-Threading enabled processor supporting two logical processors.
Summary
HT brings additional performance to many applications but it is not automatic process. The
speedup can be achieved via Operating System optimization, following the threading methodology
for designing Hyper-Threading apps, avoiding known traps and applying smart thread management
practices. In addition there are also a large number of dedicated engineers who are working to
analyze and optimize applications for this technology; their contributions will continue to make a
real difference to server applications and clustering solutions.
Reference:
1. A Agarwal, B.-H. Lim, D. Kranz and J. Kubiatowicz. APRIL: A processor Architecture for
Multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer
Architectures, pages 104-114, May 1990
2. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porter and B. Smith. The TERA Computer
System. In International Conference on Supercomputing, pages 1{6, June 1990.
3. L. A. Barroso et al. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In
Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 282–293,
June 2000
4. P. Doerffer, O Szulc. Technologia hiperwątkowości (Hyper Threading) w zastosowaniach CFD. Institute
of Fluid-Flow Machinery, Polish Academy of Sciences, Gdańsk, August 2003
5. M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine
Multicomputer. In 28th Annual International Symposium on Microarchitecture, Nov. 1995
6. L. Hammond, B. Nayfeh, and K. Olukotun. A Single-Chip Multiprocessor. Computer, 30(9):79–85,
September 1997.
7. G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker and P. Roussel. The Microarchitecture
of the IntelÒ PentiumÒ 4 Processor. Intel Technology Journal. 1st quarter 2001.
8. G. Hinton and J. Shen. Intel’s Multithreading Technology. Microprocessor Forum. October 2001.
http://www.intel.com/research/mrl/Library/HintonShen.pdf
9. Intel Corporation. IA-32 IntelÒ Architecture Software Developer’s Manual, Volume 2: Instruction Set
Reference. Order number 245471. 2001. http://developer.intel.com/design/Pentium4/manuals
10. Intel Corporation. IA-32 IntelÒ Architecture Software Developer’s Manual, Volume 3: System
Programming Guide. Order number 245472. 2001
http://developer.intel.com/design/Pentium4/manuals
11. Intel Corporation. The IntelÒ Vtune™ Performance Analyzer.
http://developer.intel.com/software/products/vtune
12. Intel Corporation. Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology:
Implementation and Performance, Xinmin Tian, Aart Bik, Milind Girkar, Paul Grey, Hideki Saito,
Ernesto Su.
13. Intel Corporation. Using Spin-Loops on IntelÒ PentiumÒ 4 Processor and IntelÒ Xeon™ Processor MP,
Application Note AP-949. http://developer.intel.com/software/products/itc/sse2/sse2_appnotes.htm
14. D. J. C. Johnson, HP's Mako Processor. Microprocessor Forum. October 2001.
http://www.cpus.hp.com/technical_references/mpf_2001.pdf
15. J. A. Redstone, S. J. Eggers and H. M. Levy. An Analysis of Operating System Behavior on a
Simultaneous Multithreaded Architecture. Proceedings of the 9th International Conference on
Architectural Support for Programming Languages and Operating Systems, November 2000
16. [Standard Performance Evaluation Corporation. SPEC CPU2000 Benchmark
http://www.spec.org/osg/cpu2000
17. B.J.Smith. Architecture and Applications Of the HEP Multiprocessor Computer System. In SPIE Real
Time Signal Processing IV, pages 2 241-248, 1981
18. A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreading Processor.
Proceedings of the 9th International Conference on Architectural Support for Programming Languages
and Operating Systems, November 2000
19. J. M. Tendler, S. Dodson and S. Fields. POWER4 System Microarchitecture. Technical White Paper.
IBM Server Group. October 2001.
20. D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-chip Parallelism. In
22nd Annual International Symposium on Computer Architecture, June 1995
21. D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm. Exploiting choice: Instruction fetch and
issue on animplementable simultaneous multithreading processor. In 23nd Annual International
Symposium on Computer Architecture, May 1996
22. D. Vianney, Hyper-Threading speeds, Linux, Linux Kernel Performance Group, Linux Technology
Center, IBM, January 2003
View publication stats