Performance Analysis On Multicore Processors
Performance Analysis On Multicore Processors
In the case of Intel based SUT, memory copy throughput does not
scale linearly with the number of threads. In contrast to data
sizes of 16 KB, and 1 MB, which can fit in L2 caches, copying
16 MB require extensive memory accesses through shared bus.
Thus, throughput is lower compared to the cases where accesses
hit in L2 caches and saturates as the bus becomes a bottleneck.
Memory copy throughput saturates at around 40 Gbps.
Furthermore, throughput is constrained due to shared L2 cache
Figure 5.2 Summation conflicts for up to four cores, but then starts increasing as
operations spread to other cores with thread affinity. This process
continues until the bus becomes a secondary bottleneck. This
result is consistent with the measurements reported in for a
similar dual quad-core based system. On the other hand,
throughput scales linearly for AMD and Cavium based SUT, for
16 MB of data size, due to their more efficient low-latency
memory controllers instead of a shared system bus.
6. OPEN ISSUES
6.1 Improved Memory System
With numerous cores on a single chip there is an enormous need
Figure 8 CELL Thermal Diagram for increased memory. 32-bit processors, such as the Pentium 4,
can address up to 4GB of main memory. With cores now using
5.2 Cache Coherence 64-bit addresses the amount of addressable memory is almost
Cache coherence is a concern in a multicore environment because
infinite. An improved memory system is a necessity; more main
of distributed L1 and L2 cache. Since each core has its own
memory and larger caches are needed for multithreaded
cache, the copy of the data in that cache may not always be the
multiprocessors.
most up-to-date version. For example, imagine a dual-core
processor where each core brought a block of memory into its
private cache. One core writes a value to a specific location; 6.2 System Bus and Interconnection Networks
when the second core attempts to read that value from its cache it Extra memory will be useless if the amount of time required for
won't have the updated copy unless its cache entry is invalidated memory requests doesn’t im- prove as well. Redesigning the
and a cache miss occurs. This cache miss forces the second core's interconnection network between cores is a major focus of chip
cache entry to be updated. If this coherence policy wasn’t in manufacturers. A faster network means a lower latency in inter-
place garbage data would be read and invalid results would be core communication and memory transactions. Intel is developing
produced, possibly crashing the program or the entire computer. their Quickpath interconnect, which is a 20-bit wide bus running
between 4.8 and 6.4 GHz; AMD's new HyperTransport 3.0 is a
32-bit wide bus and runs at 5.2 GHz. A different kind of
interconnect is seen in the TILE64's iMesh, which consists of 6.5 Homogeneous vs. Heterogeneous Cores
five networks used to fulfill I/O and off-chip memory Architects have debated whether the cores in a multicore
communication. environment should be homogeneous or heterogeneous, and there
is no definitive answer...yet. Homogenous cores are all exactly
Using five mesh networks gives the Tile architecture a per tile the same: equivalent frequencies, cache sizes, functions, etc.
(or core) bandwidth of up to 1.28 Tbps (terabits per second). The However, each core in a heterogeneous system may have a
question remains though, which type of interconnect is best different function, frequency, memory model, etc. There is an
suited for multicore processors? Is a bus-based approach better apparent trade- off between processor complexity and
than an interconnection network? Or is there a hybrid like the customization. All of the designs discussed above have used
mesh network that would work best? homogeneous cores except for the CELL processor, which has
one Power Processing Element and eight Synergistic Processing
6.3 Parallel Programming Elements.
To use multicore, you really have to use multiple threads. If you Homogeneous cores are easier to produce since the same
know how to do it, it's not bad. But the first time you do it there instruction set is used across all cores and each core contains the
are lots of ways to shoot yourself in the foot. The bugs you same hardware. But are they the most efficient use of multicore
introduce with multithreading are so much harder to find. technology?
In May 2007, Intel fellow Shekhar Borkar stated that “The Each core in a heterogeneous environment could have a specific
software has to also start following Moore's Law, software has to function and run its own specialized instruction set. Building on
double the amount of parallelism that it can support every two the CELL example, a heterogeneous model could have a large
years.” Since the number of cores in a processor is set to double centralized core built for generic processing and running an OS,
every 18 months, it only makes sense that the software running a core for graphics, a communications core, an enhanced
on these cores takes this into account. Ultimately, programmers mathematics core, an audio core, a cryptographic core, and the
need to learn how to write parallel programs that can be split up list goes on. This model is more complex, but may have
and run concurrently on multiple cores instead of trying to efficiency, power, and thermal benefits that outweigh its
exploit single-core hardware to increase parallelism of sequential complexity. With major manufacturers on both sides of this
programs. issue, this debate will stretch on for years to come; it will be
Developing software for multicore processors brings up some interesting to see which side comes out on top.
latent concerns. How does a programmer ensure that a high-
7. Conclusion
priority task gets priority across the processor, not just a core? In
Before multicore processors the performance increase from
theory even if a thread had the highest priority within the core on
generation to generation was easy to see, an increase in
which it is running it might not have a high priority in the system
frequency. This model broke when the high frequencies caused
as a whole. Another necessary tool for developers is debugging.
processors to run at speeds that caused increased power
However, how do we guarantee that the entire system stops and
consumption and heat dissipation at detrimental levels. Adding
not just the core on which an application is running?
multiple cores within a processor gave the solution of running at
These issues need to be addressed along with teaching good lower frequencies, but added interesting new problems.
parallel programming practices for developers. Once
We presented open-source MPAC benchmarking library that
programmers have a basic grasp on how to multithread and
provides a common extensible benchmarking infrastructure. It
program in parallel, instead of sequentially, ramping up to follow
can be leveraged to ease the development of specification-
Moore's law will be easier.
based micro-benchmarks, application benchmarks, and network
6.4 Starvation traffic load generators for state-of-the-art multi-core processors
If a program isn't developed correctly for use in a multicore based platforms. We implemented the specifications of Stream
processor one or more of the cores may starve for data. This and Netperf micro-benchmarks using MPAC library and
would be seen if a single-threaded application is run in a validated our MPAC based performance measurements on
multicore system. The thread would simply run in one of the Intel, AMD, and Cavium based multi-core platforms using these
cores while the other cores sat idle. This is an extreme case, but benchmarks for single thread executions.
illustrates the problem. Multicore processors are architected to adhere to reasonable
With a shared cache, for example Intel Core 2 Duo's shared L2 power consumption, heat dissipation, and cache coherence
cache, if a proper replacement policy isn't in place one core may protocols. However, many issues remain unsolved. In order to
starve for cache usage and continually make costly calls out to use a multicore processor at full capacity the applications run on
main memory. The replacement policy should include the system must be multithreaded. There are relatively few
stipulations for evicting cache entries that other cores have applications (and more importantly few programmers with the
recently loaded. This becomes more difficult with an increased know-how) written with any level of parallelism. The memory
number of cores effectively reducing the amount of evitable systems and interconnection networks also need improvement.
cache space without increasing cache misses. And finally, it is still unclear whether homogeneous or
heterogeneous cores are more efficient.
8. REFERENCES
[1] W. Knight, “Two Heads Are Better Than One”, IEEE [8] J. Kahle, “The Cell Processor Architecture”, MICRO-38
Review, September 2005 Keynote, 2005
[2] R. Merritt, “CPU Designers Debate Multi-core Future”, [9] D. Stasiak et al, “Cell Processor Low-Power Design
EETimes Online, February 2008 Methodology”, IEEE MICRO, 2005
[3] P. Frost Gorder, “Multicore Processors for Science and [10] D. Pham et al, “Overview of the Architecture, Circuit
Engineering”, IEEE CS, March/April 2007 Design, and Physical Implementation of a First-Generation
CeCell Processor”, IEEE Journal of Solid-State Circuits,
[4] D. Geer, “Chip Makers Turn to Multicore Processors”,
Vol. 41, No. 1, January 2006
Computer, IEEE Computer Society, May 2005
[11] M. Hasan Jamal, Ghulam Mustafa, Abdul Waheed and
[5] L. Peng et al, “Memory Performance and Scalability of
Waqar Mahmood, An Extensible Infrastructure for
Intel‟s and AMD‟s Dual-Core Processors: A Case Study”,
Benchmarking Multi-Core Processors based Systems, IEEE
IEEE, 2007
SPECTS 2009
[6] D. Pham et al, “The Design and Implementation of a First-
[12] Mikiko Sato, Yuji Sato, Member, IEEE and Mitaro Namiki,
Generation CELL Processor”, ISSCC
Member, IEEE, Proposal of a Multi-core Processor from the
[7] P. Hofstee and M. Day, “Hardware and Software Architecture Viewpoint of Evolutionary Computation, IEEE 2010
for the CELL Processor”, CODES+ISSS ‟05, September
2005