Soft Machine MPR-11303

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

SOFT MACHINES TARGETS IPC BOTTLENECK

New CPU Approach Boosts Performance Using Virtual Cores


By Linley Gwennap (October 27, 2014)
...................................................................................................................

Coming out of stealth mode at last week’s Linley Pro- president/CTO Mohammad Abdallah. Investors include
cessor Conference, Soft Machines disclosed a new CPU AMD, GlobalFoundries, and Samsung as well as govern-
technology that greatly improves performance on single- ment investment funds from Abu Dhabi (Mubdala), Russia
threaded applications. The new VISC technology can con- (Rusnano and RVC), and Saudi Arabia (KACST and
vert a single software thread into multiple virtual threads, Taqnia). Its board of directors is chaired by Global Foun-
which it can then divide across multiple physical cores. dries CEO Sanjay Jha and includes legendary entrepreneur
This conversion happens inside the processor hardware Gordon Campbell.
and is thus invisible to the application and the software Soft Machines hopes to license the VISC technology
developer. Although this capability may seem impossible, to other CPU-design companies, which could add it to
Soft Machines has demonstrated its performance advan- their existing CPU cores. Because its fundamental benefit
tage using a test chip that implements a VISC design. is better IPC, VISC could aid a range of applications from
Without VISC, the only practical way to improve
single-thread performance is to increase the parallelism Application (sequential code)
(instructions per cycle, or IPC) of the CPU microarchi-
Single Thread
tecture. Taken to the extreme, this approach results in
massive designs such as Intel’s Haswell and IBM’s Power8 OS and Hypervisor
that deliver industry-leading performance but waste power Standard ISA
and die area. But more-efficient designs, such as Cortex- Conversion Layer
A7, offer weak performance on single-threaded software, VISC ISA
which still constitutes the majority of applications. VISC
(which, according to the company, does not stand for vir- Global Front End
tual instruction set computing) delivers high performance Threadlets
on a single thread using a simpler, more efficient design.
As an added bonus, Soft Machines has created an Physical Physical Physical Physical
intermediate software layer that can translate from any Core Core Core Core
standard instruction set into threadable VISC instructions,
as Figure 1 shows. The initial design uses this conversion
layer to run ARM code, but the company claims it can cre- Main Memory
ate a conversion layer for x86 or other instruction sets as
needed. Figure 1. Soft Machines VISC technology. The conversion
Soft Machines is no fly-by-night operation. The com- layer converts standard instructions (e.g., ARM) into VISC
pany has spent seven years and $125 million to develop instructions. The hardware then converts a single instruction
and validate its technology. It currently has more than stream into multiple threadlets that can execute on multiple
250 employees, led by CEO Mahesh Lingareddy and physical cores.

© The Linley Group • Microprocessor Report October 2014


2 Soft Machines Targets IPC Bottleneck

smartphones to servers. The company has applied for more increased at an annual rate of only 6% over the past dec-
than 80 patents on its technology. ade, despite 40% annual growth in transistor count.
Thus, the implications of Soft Machines’ VISC tech-
Breaking the IPC Bottleneck nology are tremendous. By breaking a single software
Since the first multiprocessor (SMP) computers appeared thread into multiple hardware threads, the technology can
in the 1960s, programmers have realized that dividing their combine multiple CPU cores into a single virtual core that
code to run on multiple CPUs improves throughput. In the delivers greater performance than any single CPU. This
past decade, multicore processor chips have driven SMP accomplishment could break the IPC bottleneck, acceler-
techniques into high-volume devices such as PCs, smart- ating performance scaling well beyond single-digit per-
phones, and tablet computers. centages to 2x or greater.
Despite these advances, most programs continue to
rely on a single instruction stream, or thread, to do most or Pulling Apart the Threads
all of the work. Breaking a program into multiple threads, How can Soft Machines accomplish in hardware a task that
ensuring that each can work independently with minimal software vendors have failed at? The hardware sees only a
inter-thread communication, is a challenging task. A few binary instruction stream, so it has less visibility than the
application types, such as graphics and packet processing, compiler regarding data and code structures. The hard-
are simple to thread, but most others are not. For decades, ware has certain advantages, however, since it can access
development-tool vendors and other researchers have pointer values and other run-time information. At the
attempted to create a “magic compiler” that automatically conference, the company disclosed some description of its
creates a multithreaded program, but these tools either hardware design, although it did not provide complete
work only on a few easily parallelizable programs or re- details. The following provides a high-level view.
quire manual input from the programmer. A VISC processor comprises a single instruction
This failure puts the burden on CPU designers to cache, a global front end, and two or more physical cores
improve single-thread performance. Once clock speeds hit that each has its own instruction scheduling, register files,
a wall around 2005, performance per clock became the and data cache. The processor may also have a level-two
primary focus. Today’s high-end CPUs, however, are well (L2) cache that backs both the global instruction cache and
past the point of diminishing returns: adding another the individual data caches. The initial test chip includes
instruction-issue slot, another function unit, or a bigger two physical cores and 1MB of L2 cache, as Figure 2 shows,
reorder buffer burns more power while offering little IPC but future designs are likely to include four or more cores.
improvement. As a result, the IPC of high-end CPUs has The front end fetches instructions from the I-cache
and places them into an instruction buffer. From this
Global Front End buffer, it attempts to form sequences of related instruc-
Branch Instruction Cache tions. For example, instructions with register dependencies
Pred can be grouped. These sequences are fairly short and are
Fetch quite different from a conventional software thread; they
Unit Decode / Instr can be better thought of as threadlets. The front end also
Thread Formation Buffer
performs global register renaming to avoid false depend-
Retire encies. It can check pointers so as to group instructions
Unit Rename/Dispatch that refer to the same memory location. The process of
forming threadlets adds three cycles to the test chip’s
Physical Core Physical Core
pipeline.
Scheduler Scheduler
After creating the threadlets, the front end dispatches
them to the physical cores. When it sends a threadlet to a
Mem
ALU … etc etc … ALU
Mem
core, it also allocates in that core the rename registers that
Unit Unit
are needed to execute the instructions. The goal is to
Rename Registers Rename Registers minimize cases in which an instruction in one core re-
quires a register allocated in another core, as this situation
Data Cache Data Cache requires multiple cycles to resolve. Figure 2 shows a link
between the two register files that handles this situation;
the link becomes more complex in a processor with more
Level-Two Cache
than two cores.
Figure 2. Conceptual diagram of a VISC processor. The
global front end fetches a single instruction stream and Getting to the Core
divides it into threadlets that are dispatched to the multiple Each core buffers the instructions received from the front
physical cores, which schedule and execute the instructions. end and can reorder them to avoid stalls. The buffer can

© The Linley Group • Microprocessor Report October 2014


Soft Machines Targets IPC Bottleneck 3

contain instructions from various threadlets, and these


instructions can intermix freely. The core need not switch For More Information
threads, because each threadlet has its own set of rename
A copy of the Soft Machines presentation from the
registers. When instructions are ready to execute, the core Linley Processor Conference is available for free down-
fetches data from its register file and completes the opera- load at www.linleygroup.com/events/proceedings.php?
tions using its function units. num=29 (registration required). For additional informa-
As in any out-of-order (OOO) design, speculative re- tion on Soft Machines, access the company’s web site at
sults are held pending until the completion of all previous www.softmachines.com.
instructions. In the VISC approach, the retirement unit
must track instructions across all physical cores, since the
original instruction stream could have been divided among Soft Machines calls this approach “virtual cores.”
multiple cores. Similarly, a branch misprediction detected From a software viewpoint, the heavy thread runs on a
in one core can affect instructions executing in other cores. single virtual core that comprises more than one physical
Thus, a VISC processor can be viewed as having core, while the light thread runs on a virtual core that uses
global hardware for branch prediction as well as for fetch- only a portion of a physical core. This allocation of cores is
ing, grouping, tracking, and retiring instructions, but local invisible to software, except that the heavy thread runs
hardware for scheduling and executing instructions and faster than it would otherwise. According to the company,
accessing memory. Compared with a high-IPC processor its front-end hardware will recognize which threads need
such as Haswell, the front end is of similar complexity, but more performance and allocate the virtual core resources
the scheduler in each physical core is much simpler, as it appropriately. The operating system need not know the
manages only a few function units (versus eight in Has- number of physical cores in the processor to assign threads
well). The data cache also has fewer ports and can cycle or balance the load.
faster. Because the execution resources of both cores can
apply to a single thread, however, even a two-core VISC Demonstrated Performance
processor can deliver total ALU operations or memory op- Soft Machines has designed and fabricated a test chip that
erations per cycle that match or exceed those of Haswell. implements its VISC architecture using two physical cores.
VISC relies on a unique internal instruction set to do It refused to discuss the number of function units or other
its magic, but true to its name, Soft Machines provides a basic microarchitecture capabilities of these cores but char-
conversion layer that converts from a standard instruction acterized them as A15-class CPUs. We interpret this state-
set to VISC. This approach is similar to how Transmeta ment to mean that each core can execute three to four op-
processors executed x86 instructions (see MPR 1/24/00, erations per cycle with moderate instruction reordering.
“Transmeta Unveils Crusoe”), but VISC uses a completely The company also withheld the test chip’s clock speed.
different internal architecture. The company says the con- Because it uses a pipeline with only 10 stages (including
version overhead is less than 5%. In addition to its ARM- some extra stages for VISC scheduling), the CPU cannot
to-VISC translator, Soft Machines has prototyped an x86 match the high clock speeds of leading-edge x86 and ARM
translator and says it can develop other translators on cus- processors. We estimate the chip runs at several hundred
tomer request. megahertz. Even at this low speed, it completes some pro-
grams in less time than a low-end Haswell processor.
Better Load Balancing Despite its relatively simple design, the chip achieves
The previous example shows how the global front end can spectacular performance. On the single-thread SPEC2006
break down a single thread, but it can also handle multiple Heavy App Light App
software threads at the same time. In this way, a single VISC
processor can emulate a traditional multicore design. But
unlike traditional designs, it can more easily perform load
balancing, matching processing power to the task at hand. Global Front End
Figure 3 shows an example with two active threads:
one heavy (high performance) and one light. In a processor
Virtual
with several identical cores, each thread would run on one Virtual Core
Core
core, wasting cycles for the light thread and limiting per-
Physical Core Physical Core
formance for the heavy thread. In a VISC design with two
physical cores, the heavy thread is split into two hardware
threads, one running on each core. Note that the second Figure 3. Soft Machines virtual cores. A single virtual core can
core automatically shares its resources between the heavy comprise multiple physical cores or only a portion of a physical
thread and the light thread, because a VISC core can mix core. In this way, the front-end hardware aligns the perfor-
instructions from multiple threadlets. mance of the virtual core with the needs of each thread.

© The Linley Group • Microprocessor Report October 2014


4 Soft Machines Targets IPC Bottleneck

test suite, the company reports an average IPC of 2.1, effects of the short pipeline. A better way to quantify the
counting ARM instructions rather than VISC instructions. advantage of using VISC is the performance increase rela-
This IPC compares with 0.71 for Cortex-A15 and 1.39 for tive to a single core. By coupling the resources of two cores
Haswell. (For consistency, Soft Machines measured the IPC to execute a single thread, the maximum performance
on all three processors using GCC rather than Intel’s favorite increase (compared with a single core of similar design) is
compiler, ICC.) Thus, the VISC chip achieved three times 2x. In reality, the front end cannot fully utilize the re-
the IPC of ARM’s highest-end CPU shipping today and sources of the second core, particularly for code that is
50% better IPC than Intel’s fastest mainstream CPU. difficult to break apart. For example, code that contains
Figure 4 shows more-detailed performance results many branches or many register dependencies will be
comparing the VISC test chip against Cortex-A15 (meas- harder to thread and thus will use less of the second core.
ured in a Samsung Exynos processor). The figure shows Soft Machines has run many tests and simulations of
the results of 62 different benchmarks, including compo- its VISC technology. It estimates the second core improves
nents of SPEC2000, SPEC2006, EEMBC’s digital enter- performance by an average of 50–60% across a variety of
tainment benchmark (DenBench), and the Kraken Java- benchmarks. This factor implies that the test chip would
Script benchmark. While individual results range from achieve about 1.3 IPC when using a single core—consider-
1.5x to 7x, the average gain across these tests ranges be- ably higher than Cortex-A15. This higher IPC includes the
tween 3x and 4x. Soft Machines points out that although effects of the lower clock speed and shorter pipeline.
the test chip runs Linux and other applications, the soft- Although the test chip has only two physical cores,
ware contains workarounds for certain hardware bugs, so Soft Machines has run simulations on a four-core design.
it will likely obtain better results with future tuning and As one might expect, the performance gains diminish for
bug fixes. the additional cores: the third core adds 20–30% to single-
Although these results are impressive, they require thread performance, and the fourth adds only 10–20%. In
some caveats to put them into perspective. A shorter CPU total, the four-core design delivers about twice the perfor-
pipeline reduces branch penalties and other pipeline haz- mance of a single core. The unused resources in the extra
ards, thereby improving IPC compared with a longer pipe- cores, however, can be devoted to additional threads. For
line. In addition, a low CPU speed reduces the effective example, a four-core design could run two threads at close
latency of caches and main memory (measured in CPU to their maximum performance.
cycles), again improving IPC relative to a CPU with a Performance-critical applications will benefit from
faster clock. The latter effect might explain why the test VISC, but the technology can also apply to low-power de-
chip appears to perform better on SPEC2006 than on signs. As the test chip demonstrates, a VISC design can
SPEC2000, which has a smaller memory footprint. operate at a relatively low clock speed while achieving the
same absolute performance as a traditional design operat-
Doubling Performance Is Realistic ing at a higher clock speed. Thus, it should use less power,
The test results are difficult to interpret, owing to both the particularly if the voltage is reduced as well. Soft Machines,
lack of information about the test chip’s core CPU and the however, declined to reveal the power consumption of its
test chip.
7x Other details also remain undisclosed,
including die area. Details of how the pro-
6x
cessor handles privileged operations, inter-
thread synchronization, traps, and interrupts
IPC Relative to Cortex-A15

5x
could all affect how well it runs certain ap-
4x
plications. Performance could vary widely
across different workloads. The company
3x provides additional information to potential
customers under NDA and plans to make
2x further disclosures over time.

1x A Rising Tide
Soft Machines has identified a critical (per-
0x
SPEC2000 SPEC2006 EEMBC DE Kraken haps the most critical) problem in CPU
(color conversion) (JavaScript) design today: the minimal improvement in
Figure 4. VISC performance results. This chart shows measured instructions per single-thread performance over the past
cycle (IPC) for Soft Machines’ test chip normalized against Cortex-A15. The decade. Despite the obvious need for multi-
company did not disclose the clock speed of the test chip. (Source: Soft threaded software, most applications—even
Machines) performance-intensive ones—continue to

© The Linley Group • Microprocessor Report October 2014


Soft Machines Targets IPC Bottleneck 5

rely on a single thread. The software tools for creating the technology to determine how it will perform in an ac-
multithreaded applications remain primitive. tual product across a range of workloads.
With the announcement and demonstration of its Assuming the technology works as advertised, it will
VISC technology, Soft Machines has taken a big step change the way all processors are designed. CPU designers
toward solving this problem. By shifting the burden to will stop trying to improve IPC by adding more hardware;
hardware, VISC aims to deliver the benefits of multi- in fact, complex high-IPC designs like Haswell could dis-
threading to all applications. The initial performance re- appear in favor of smaller, simpler ones. Replacing large
sults are excellent: a 50–100% gain in single-thread IPC cores with clusters of simpler cores will improve perfor-
represents a decade of progress at the industry’s current mance and power efficiency, benefiting almost every type
sluggish pace. As with any radical new technology, how- of processor. To deliver on this promise, Soft Machines
ever, we remain skeptical, particularly given the startup’s must fully validate VISC and license it to leading processor
limited disclosure. Potential customers must fully assess vendors. ♦

To subscribe to Microprocessor Report, access www.linleygroup.com/mpr or phone us at 408-270-3772.

© The Linley Group • Microprocessor Report October 2014

You might also like