Multi-Threading Technology and The Challenges of Meeting Performan
Multi-Threading Technology and The Challenges of Meeting Performan
Introduction
The increasing ubiquity of the smartphone along with a myriad of other smart mobile consumer
devices is changing personal communication. But it is the smartphone that perhaps has become
the leading wirelessly connected device used by people to browse the web, interact on social
networks, play audio or video, or receive and send emails – in addition of course to the more
traditional and prosaic activity of using it to speak to another human being. However, rather than
the ‘always connected’ tasks of voice calls, text messages and emails, it is applications such as
browsing, video streaming, gaming and navigation that are increasing the demand for higher
performance silicon. What is more, this next-generation silicon has to enable increased battery
life and also meet strict thermal requirements.
There are well-known question marks over the ability of single-core processors to meet these
increasing demands as they start to reach performance limits. The latest microprocessors can
now integrate billions of transistors on a single silicon die. Historically this increasing integration
has been largely governed by Moore's Law, which concerns the doubling every two years of the
number of transistors that can be implemented in a given area of silicon. However, it is proving
increasingly difficult to utilize these additional transistors to accelerate the execution of a single
sequential program.
The issue is that processors tend to become less efficient with increasing single-thread
performance. The rationale being that to achieve higher single thread performance requires an
increasing number of complex structures to exploit higher levels of Instruction Level Parallelism
(ILP). However, these additional structures are not necessarily performing useful computational
tasks, as they are more likely to be involved in scheduling and managing instructions in an order
Page 1 of 9
that can get the most out of the hardware.
In addition, in order to provide ever more instruction level parallelism, more and more execution
units are placed within a single processor core. This, combined with the scheduling hardware,
offers the possibility of higher single thread performance, but the challenge of exposing
sufficient instruction parallelism to keep the execution units busy grows rapidly. On the
occasions that such parallelism cannot be found, or when bubbles in the pipeline exist as occurs
on a branch mis-prediction, these execution units are under-utilized. Such under-utilization is
clearly a source of inefficiency.
Each of these cores runs a separate thread of execution, and so by having multiple processor
cores, the overall aggregate throughput from a given transistor budget can be increased,
provided of course the computational problem can be expressed with thread-level parallelism.
Multi-threading
Multi-threading technology was first developed in the 1990s and famously introduced into
mainstream PC processor applications in 2002. In essence multi-threading uses one physical
processor core to run two threads of execution simultaneously, creating two ‘logical’ processors.
This offers the promise of addressing the under-utilization problem of the complex processor by
using a second thread as a source of independent instructions.
By addressing the under-utilization, it seems that the efficiency of the complex core can be
improved; there are fewer occasions when the expensive multiple execution units are not being
used, so the efficiency, expressed as computation performed per unit of energy consumed, is
improved.
1
An out-of-order processor reschedules a sequence of instructions that were previously designed for
execution in a specific order to maximize use of processor resources. The processor executes these
rescheduled instructions and rearranges them in the original order to be written to memory.
Page 2 of 9
It is important to realize that when the physical processor core is running two threads of
execution in this way, then the throughput of each thread is usually substantially reduced when
compared with the throughput that one of those threads would have achieved had it been
running on its own. However, by finding opportunities to use the under-utilized execution units,
the prospect of multi-threading is that the aggregate throughput of the two threads running
simultaneously is greater than would have been achieved had they run sequentially, so leading
to higher performance, and lower energy consumption.
Using the premises above, a common conclusion is that a processor designed for multi-
threading can bring performance and efficiency advantages: specifically, additional throughput
performance against approximately half the additional power consumption – although it does
comes with a slight increase in silicon die area because of the requirement for additional
hardware execution resources. The addition of multi-threading to an out-of-order core delivers a
small improvement in throughput per Watt, and also in terms of throughput per mm2 compared
with that achieved by the equivalent out-of-order core without multi-threading. So on initial
observation multi-threading begins to sound like a compelling proposition, but looking deeper
into the subject unveils a number of uncertainties.
Given that the management and scheduling structures of the large processor contribute so
much to the energy consumption of a complex, out-of-order processor, it is inefficient to have to
use such logic for each thread while achieving, for that thread, a throughput that could be
achieved by simply using a much smaller, simpler and more efficient processor core.
A more-efficient alternative can be to implement two processors or cores with low single-thread
performance. When the single-thread performance of a large processor is not required, it can be
argued that it is preferable to put it on two small processors rather than using a large processor
disguised as two small processors.
Page 3 of 9
which requires larger rename pools and reservation stations, as well as more hazard checking.
Practical experience of large processors shows that this additional logic tends to suffer from
diminishing returns where the additional energy overhead of these resources increases faster
than average performance.
When compared with an in-order processor such as the Cortex-A7 processor, an out-of-order
high performance processor such as the Cortex-A12 processor adds significant additional
hardware resources, including register rename engines and execution dependency tracking.
This additional logic enables a higher level of single-thread performance but it also means
higher energy consumption per instruction. Evaluation tests by ARM have shown that
performance scales such that for highly complex processors, increasing performance by 50%
will cost more than a 50% increase in terms of power. The figure above shows an example of
that scaling comparing a Cortex-A7 processor with a Cortex-A12 processor and an ARM
estimated multi-threaded version of the Cortex-A12 processor.
Looking at the typically reported costs in die area and power consumption versus the
performance-throughput benefits of adding multi-threading capability to a typical out-of-order
processor core, the incremental throughput benefits are approximately twice that of the
incremental costs in die area and power consumption.
Page 4 of 9
Fig.2 Relative pipeline complexity, Cortex A7 vs Cortex-A12
Based on this premise, it has been estimated that adding multi-threading to a high-performance
Cortex core, the Cortex-A12 core for example, would result in a processor that offers
approximately the same throughput as two Cortex-A7 cores. However, this hypothetical multi-
threaded high-performance Cortex core, while delivering an equivalent throughput would use
twice the aggregate power of the two Cortex-A7 cores and would have a die area approximately
10 percent larger than the combined Cortex-A7 cores. This assumes that both systems have a
level 2 cache of the same size, which would be consistent with delivering an equivalent
throughput.
In addition, sharing a level 1 cache leads to cache interference effects such as cache thrashing
and destructive cache interference, which have been reasonably well documented in research
into multi-threading. However, carefully written code can actually deliver ‘constructive cache
interference’, where one thread conveniently brings in a piece of data that a second thread is
just about to use, thereby enabling a prefetching effect. But this is only really applicable in well-
controlled closed environments. For the independent workloads that are commonplace in open
Page 5 of 9
systems it is in a minority compared with the destructive effect and trying to use it is extremely
difficult, especially under a single operating system such as Linux or Android. Of course, a
bigger cache can be used, but with a resulting increase in silicon area, thereby undermining any
saving in area that comes from implementing multi-threading.
In addition, as two completely separate threads are being executed on the same physical piece
of silicon hardware, it becomes much harder to guarantee perfect isolation than if they were
being executed on separate cores. Potentially, this can lead to significantly longer processor
validation for silicon vendors.
Some tasks are inherently ill-suited to using large complex processors due to a fundamental
lack of predictability in the memory accesses. Such memory-bound tasks tend to stall processor
cores, while they wait for data from slow memory to enable the processing to proceed. If this
happens there is very little benefit in using a large complex processing core capable of
exploiting high instruction level parallelism, as such a processing core remains stalled. A simple
analogy would be taking a high performance sports car through the centre of a city with many
red lights. Running such tasks on simple processors cores, that don’t have the hardware and
energy overhead for scheduling and control, is a very efficient solution.
In such workloads, which are common in situations such as networking data plane applications,
it can be beneficial to being able to run a second thread, or more, in the times when other
threads are stalled. The act of stalling is wasteful in energy, as parts of the system will inevitably
consume power during these stalls, either through leakage or from dynamic power of clocks that
cannot be fully stopped.
Such a system is different from the large core multi-threading as each thread is not going
through a processor core that has been designed for much higher throughput, and the vast
majority of the energy spent in the processor core is being spent doing the essential tasks of
computation, rather than management overhead. Correspondingly, there are efficiencies to be
Page 6 of 9
gained by finding alternative threads to run while the first task is stalled, so constructively using
the energy that would be otherwise wasted during the stalls.
Core Combinations
As previously mentioned, the dominant solution used today in the majority of mobile phone
platforms is the single-threaded multi-processor solution. And increasingly, this solution is
evolving to be a combination of multiple processors that offer significantly different levels of
performance. A key industry example is the ARM big.LITTLE processing approach, which
combines high-performance cores with highly efficient cores in a multi-core processor system. A
central tenet of the platform is running small workloads on small and efficient processors,
whereas multi-threading is about running small workloads on larger, more complex and
therefore less-efficient processors.
In the first big.LITTLE system, the ‘big’ core is a Cortex-A15 processor, paired with the ‘LITTLE’
Cortex-A7 processor. By coherently connecting the Cortex-A15 and Cortex-A7 processors, the
system is flexible enough to support various use models, which can be tailored to the
processing requirements of the tasks.
Page 7 of 9
With big.LITTLE processing, if the application to be run requires a level of performance that can
be met by running on the Cortex-A7 core, then the operating system will schedule that task to
run on the Cortex-A7 core. This exploits the inherent efficiency of the Cortex-A7 core, while
meeting the needs of the application. However, if the task requires the performance of the
Cortex-A15 core, then the task will be scheduled by the operating system to the Cortex-A15
core.
While this appears to create complexity for the operating system, the complexity is in principle
little different from the complexity of an operating system that schedules effectively for multi-
threading. In a multi-threading system, the operating system must decide whether the
application to be run can have its performance requirements met by running as a second thread
on a multi-threaded core, or has a higher performance requirement meaning it must be run as
the sole occupant of a core. Indeed this scheduling problem for multi-threading is made worse
by the fact that the decision to schedule a second thread on a multi-threaded core affects the
performance of both threads on that core, so the scheduler must also consider the needs of the
current occupant of that core.
Conclusions
Multi-threading is a significant processor technology and it is highly likely that it will see growing
implementation in a few specific applications. Particularly of interest is the technology’s
possibilities in networking applications and more specifically ‘small-core’ multi-threading. This is
undoubtedly a compelling use that should easily find a place in network data plane applications.
Page 8 of 9
However, in mobile applications, where the performance/power efficiency ratio is crucial, and
particularly where applications require larger superscalar and out-of-order processor based
multicore designs, then single-threaded multicore implementations such as big-LITTLE are the
most efficient solution, certainly today and probably for the next few generations.
ENDS
Page 9 of 9