Soft Machine MPR-11303
Soft Machine MPR-11303
Soft Machine MPR-11303
Coming out of stealth mode at last week’s Linley Pro- president/CTO Mohammad Abdallah. Investors include
cessor Conference, Soft Machines disclosed a new CPU AMD, GlobalFoundries, and Samsung as well as govern-
technology that greatly improves performance on single- ment investment funds from Abu Dhabi (Mubdala), Russia
threaded applications. The new VISC technology can con- (Rusnano and RVC), and Saudi Arabia (KACST and
vert a single software thread into multiple virtual threads, Taqnia). Its board of directors is chaired by Global Foun-
which it can then divide across multiple physical cores. dries CEO Sanjay Jha and includes legendary entrepreneur
This conversion happens inside the processor hardware Gordon Campbell.
and is thus invisible to the application and the software Soft Machines hopes to license the VISC technology
developer. Although this capability may seem impossible, to other CPU-design companies, which could add it to
Soft Machines has demonstrated its performance advan- their existing CPU cores. Because its fundamental benefit
tage using a test chip that implements a VISC design. is better IPC, VISC could aid a range of applications from
Without VISC, the only practical way to improve
single-thread performance is to increase the parallelism Application (sequential code)
(instructions per cycle, or IPC) of the CPU microarchi-
Single Thread
tecture. Taken to the extreme, this approach results in
massive designs such as Intel’s Haswell and IBM’s Power8 OS and Hypervisor
that deliver industry-leading performance but waste power Standard ISA
and die area. But more-efficient designs, such as Cortex- Conversion Layer
A7, offer weak performance on single-threaded software, VISC ISA
which still constitutes the majority of applications. VISC
(which, according to the company, does not stand for vir- Global Front End
tual instruction set computing) delivers high performance Threadlets
on a single thread using a simpler, more efficient design.
As an added bonus, Soft Machines has created an Physical Physical Physical Physical
intermediate software layer that can translate from any Core Core Core Core
standard instruction set into threadable VISC instructions,
as Figure 1 shows. The initial design uses this conversion
layer to run ARM code, but the company claims it can cre- Main Memory
ate a conversion layer for x86 or other instruction sets as
needed. Figure 1. Soft Machines VISC technology. The conversion
Soft Machines is no fly-by-night operation. The com- layer converts standard instructions (e.g., ARM) into VISC
pany has spent seven years and $125 million to develop instructions. The hardware then converts a single instruction
and validate its technology. It currently has more than stream into multiple threadlets that can execute on multiple
250 employees, led by CEO Mahesh Lingareddy and physical cores.
smartphones to servers. The company has applied for more increased at an annual rate of only 6% over the past dec-
than 80 patents on its technology. ade, despite 40% annual growth in transistor count.
Thus, the implications of Soft Machines’ VISC tech-
Breaking the IPC Bottleneck nology are tremendous. By breaking a single software
Since the first multiprocessor (SMP) computers appeared thread into multiple hardware threads, the technology can
in the 1960s, programmers have realized that dividing their combine multiple CPU cores into a single virtual core that
code to run on multiple CPUs improves throughput. In the delivers greater performance than any single CPU. This
past decade, multicore processor chips have driven SMP accomplishment could break the IPC bottleneck, acceler-
techniques into high-volume devices such as PCs, smart- ating performance scaling well beyond single-digit per-
phones, and tablet computers. centages to 2x or greater.
Despite these advances, most programs continue to
rely on a single instruction stream, or thread, to do most or Pulling Apart the Threads
all of the work. Breaking a program into multiple threads, How can Soft Machines accomplish in hardware a task that
ensuring that each can work independently with minimal software vendors have failed at? The hardware sees only a
inter-thread communication, is a challenging task. A few binary instruction stream, so it has less visibility than the
application types, such as graphics and packet processing, compiler regarding data and code structures. The hard-
are simple to thread, but most others are not. For decades, ware has certain advantages, however, since it can access
development-tool vendors and other researchers have pointer values and other run-time information. At the
attempted to create a “magic compiler” that automatically conference, the company disclosed some description of its
creates a multithreaded program, but these tools either hardware design, although it did not provide complete
work only on a few easily parallelizable programs or re- details. The following provides a high-level view.
quire manual input from the programmer. A VISC processor comprises a single instruction
This failure puts the burden on CPU designers to cache, a global front end, and two or more physical cores
improve single-thread performance. Once clock speeds hit that each has its own instruction scheduling, register files,
a wall around 2005, performance per clock became the and data cache. The processor may also have a level-two
primary focus. Today’s high-end CPUs, however, are well (L2) cache that backs both the global instruction cache and
past the point of diminishing returns: adding another the individual data caches. The initial test chip includes
instruction-issue slot, another function unit, or a bigger two physical cores and 1MB of L2 cache, as Figure 2 shows,
reorder buffer burns more power while offering little IPC but future designs are likely to include four or more cores.
improvement. As a result, the IPC of high-end CPUs has The front end fetches instructions from the I-cache
and places them into an instruction buffer. From this
Global Front End buffer, it attempts to form sequences of related instruc-
Branch Instruction Cache tions. For example, instructions with register dependencies
Pred can be grouped. These sequences are fairly short and are
Fetch quite different from a conventional software thread; they
Unit Decode / Instr can be better thought of as threadlets. The front end also
Thread Formation Buffer
performs global register renaming to avoid false depend-
Retire encies. It can check pointers so as to group instructions
Unit Rename/Dispatch that refer to the same memory location. The process of
forming threadlets adds three cycles to the test chip’s
Physical Core Physical Core
pipeline.
Scheduler Scheduler
After creating the threadlets, the front end dispatches
them to the physical cores. When it sends a threadlet to a
Mem
ALU … etc etc … ALU
Mem
core, it also allocates in that core the rename registers that
Unit Unit
are needed to execute the instructions. The goal is to
Rename Registers Rename Registers minimize cases in which an instruction in one core re-
quires a register allocated in another core, as this situation
Data Cache Data Cache requires multiple cycles to resolve. Figure 2 shows a link
between the two register files that handles this situation;
the link becomes more complex in a processor with more
Level-Two Cache
than two cores.
Figure 2. Conceptual diagram of a VISC processor. The
global front end fetches a single instruction stream and Getting to the Core
divides it into threadlets that are dispatched to the multiple Each core buffers the instructions received from the front
physical cores, which schedule and execute the instructions. end and can reorder them to avoid stalls. The buffer can
test suite, the company reports an average IPC of 2.1, effects of the short pipeline. A better way to quantify the
counting ARM instructions rather than VISC instructions. advantage of using VISC is the performance increase rela-
This IPC compares with 0.71 for Cortex-A15 and 1.39 for tive to a single core. By coupling the resources of two cores
Haswell. (For consistency, Soft Machines measured the IPC to execute a single thread, the maximum performance
on all three processors using GCC rather than Intel’s favorite increase (compared with a single core of similar design) is
compiler, ICC.) Thus, the VISC chip achieved three times 2x. In reality, the front end cannot fully utilize the re-
the IPC of ARM’s highest-end CPU shipping today and sources of the second core, particularly for code that is
50% better IPC than Intel’s fastest mainstream CPU. difficult to break apart. For example, code that contains
Figure 4 shows more-detailed performance results many branches or many register dependencies will be
comparing the VISC test chip against Cortex-A15 (meas- harder to thread and thus will use less of the second core.
ured in a Samsung Exynos processor). The figure shows Soft Machines has run many tests and simulations of
the results of 62 different benchmarks, including compo- its VISC technology. It estimates the second core improves
nents of SPEC2000, SPEC2006, EEMBC’s digital enter- performance by an average of 50–60% across a variety of
tainment benchmark (DenBench), and the Kraken Java- benchmarks. This factor implies that the test chip would
Script benchmark. While individual results range from achieve about 1.3 IPC when using a single core—consider-
1.5x to 7x, the average gain across these tests ranges be- ably higher than Cortex-A15. This higher IPC includes the
tween 3x and 4x. Soft Machines points out that although effects of the lower clock speed and shorter pipeline.
the test chip runs Linux and other applications, the soft- Although the test chip has only two physical cores,
ware contains workarounds for certain hardware bugs, so Soft Machines has run simulations on a four-core design.
it will likely obtain better results with future tuning and As one might expect, the performance gains diminish for
bug fixes. the additional cores: the third core adds 20–30% to single-
Although these results are impressive, they require thread performance, and the fourth adds only 10–20%. In
some caveats to put them into perspective. A shorter CPU total, the four-core design delivers about twice the perfor-
pipeline reduces branch penalties and other pipeline haz- mance of a single core. The unused resources in the extra
ards, thereby improving IPC compared with a longer pipe- cores, however, can be devoted to additional threads. For
line. In addition, a low CPU speed reduces the effective example, a four-core design could run two threads at close
latency of caches and main memory (measured in CPU to their maximum performance.
cycles), again improving IPC relative to a CPU with a Performance-critical applications will benefit from
faster clock. The latter effect might explain why the test VISC, but the technology can also apply to low-power de-
chip appears to perform better on SPEC2006 than on signs. As the test chip demonstrates, a VISC design can
SPEC2000, which has a smaller memory footprint. operate at a relatively low clock speed while achieving the
same absolute performance as a traditional design operat-
Doubling Performance Is Realistic ing at a higher clock speed. Thus, it should use less power,
The test results are difficult to interpret, owing to both the particularly if the voltage is reduced as well. Soft Machines,
lack of information about the test chip’s core CPU and the however, declined to reveal the power consumption of its
test chip.
7x Other details also remain undisclosed,
including die area. Details of how the pro-
6x
cessor handles privileged operations, inter-
thread synchronization, traps, and interrupts
IPC Relative to Cortex-A15
5x
could all affect how well it runs certain ap-
4x
plications. Performance could vary widely
across different workloads. The company
3x provides additional information to potential
customers under NDA and plans to make
2x further disclosures over time.
1x A Rising Tide
Soft Machines has identified a critical (per-
0x
SPEC2000 SPEC2006 EEMBC DE Kraken haps the most critical) problem in CPU
(color conversion) (JavaScript) design today: the minimal improvement in
Figure 4. VISC performance results. This chart shows measured instructions per single-thread performance over the past
cycle (IPC) for Soft Machines’ test chip normalized against Cortex-A15. The decade. Despite the obvious need for multi-
company did not disclose the clock speed of the test chip. (Source: Soft threaded software, most applications—even
Machines) performance-intensive ones—continue to
rely on a single thread. The software tools for creating the technology to determine how it will perform in an ac-
multithreaded applications remain primitive. tual product across a range of workloads.
With the announcement and demonstration of its Assuming the technology works as advertised, it will
VISC technology, Soft Machines has taken a big step change the way all processors are designed. CPU designers
toward solving this problem. By shifting the burden to will stop trying to improve IPC by adding more hardware;
hardware, VISC aims to deliver the benefits of multi- in fact, complex high-IPC designs like Haswell could dis-
threading to all applications. The initial performance re- appear in favor of smaller, simpler ones. Replacing large
sults are excellent: a 50–100% gain in single-thread IPC cores with clusters of simpler cores will improve perfor-
represents a decade of progress at the industry’s current mance and power efficiency, benefiting almost every type
sluggish pace. As with any radical new technology, how- of processor. To deliver on this promise, Soft Machines
ever, we remain skeptical, particularly given the startup’s must fully validate VISC and license it to leading processor
limited disclosure. Potential customers must fully assess vendors. ♦