Changes in Hardware: 4.1 Memory Cells
Changes in Hardware: 4.1 Memory Cells
Changes in Hardware
This chapter deals with hardware and lays the foundations to understand
how the changing hardware impacts software and application development
and is partly taken from [SKP12].
In the early 2000s multi-core architectures were introduced, starting a
trend introducing more and more parallelism. Today, a typical board has
eight CPUs and 8 to 16 cores per CPU. So each one has between 64 and 128
cores. A board is a pizza-box sized server component and it is called blade
or node in a multi-node system. Each of those blades o↵ers a high level of
parallel computing for a price of about $50,000.
Despite the introduction of massive parallelism, the disk totally dom-
inated all thinking and performance optimizations not long ago. It was
extremely slow, but necessary to store the data. Compared to the speed
development of CPUs, the development of disk performance could not keep
up. This resulted in a complete distortion of the whole model of working
with databases and large amounts of data. Today, the large amounts of main
memory available in servers initiate a shift from disk based systems to in-
memory based systems. In-memory based systems keep the primary copy
of their data in main memory.
In early computer systems, the frequency of the CPU was the same as the
frequency of the memory bus and register access was only slightly faster than
memory access. However, CPU frequencies did heavily increase in the last
years following Moore’s Law1 [Moo65], but frequencies of memory buses
and latencies of memory chips did not grew with the same speed. As a result,
1Moore’s Law is the assumption that the number of transistors on integrated circuits
doubles every 18-24 months. This assumption still holds till today.
19
20 4 Changes in Hardware
TLB TLB
L1 Cacheline L1 L1
L2 Cacheline L2 L2
L3 Cacheline
L3 Cache QPI QPI L3 Cache
memory access gets more expensive, as more CPU cycles are wasted while
stalling for memory access. This development is not due to the fact that fast
memory can not be built, it is an economical decision as memory which is
as fast as current CPUs would be orders of magnitude more expensive and
would require extensive physical space on the boards. In general, memory
designers have the choice between SRAM (Static Random Access Memory)
and DRAM (Dynamic Random Access Memory).
SRAM cells are usually built out of six transistors (although variants with
only 4 do exist but have disadvantages [MSMH08]) and can store a stable
state as long as power is supplied. Accessing the stored state requires raising
the word access line and the state is immediately available for reading.
In contrast, DRAM cells can be constructed using a much simpler structure
consisting of only one transistor and a capacitor. The state of the memory
cell is stored in the capacitor while the transistor is only used to guard the
access to the capacitor. This design is more economical compared to SRAM.
However, it introduces a couple of complications. The capacitor discharges
over time and while reading the state of the memory cell. Therefore, today’s
systems refresh DRAM chips every 64 ms [CJDM01] and after every read of
the cell in order to recharge the capacitor. During the refresh, no access to
the state of the cell is possible. The charging and discharging of the capacitor
takes time, which means that the current can not be detected immediately
after requesting the stored state, therefore limiting the speed of DRAM cells.
In a nutshell, SRAM is fast but requires a lot of space whereas DRAM
chips are slower but allow larger chips due to their simpler structure. For
more details regarding the two types of RAM and their physical realization
the interested reader is referred to [Dre07].
4.3 Cache Internals 21
Caches are organized in cache lines, which are the smallest addressable unit
in the cache. If the requested content cannot be found in any cache, it is
loaded from main memory and transferred up the hierarchy. The smallest
transferable unit between each level is one cache line. Caches, where every
cache line of level i is also present in level i + 1 are called inclusive caches
otherwise the model is called exclusive caches. All Intel processors implement
an inclusive cache model. This inclusive cache model is assumed for the rest
of this text.
When requesting a cache line from the cache, the process of determining
whether the requested line is already in the cache and locating where it is
cached is crucial. Theoretically, it is possible to implement fully associative
caches, where each cache line can cache any memory location. However,
in practice this is only realizable for very small caches as a search over
22 4 Changes in Hardware
the complete cache is necessary when searching for a cache line. In order
to reduce the search space, the concept of a n-way set associative cache with
associativity Ai divides a cache with Ci bytes in Ci /Bi /Ai sets and restricts the
number of cache lines which can hold a copy of a certain memory address
to one set or Ai cache lines. Thus, when determining if a cache line is already
present in the cache only one set with Ai cache lines has to be searched.
64 0
A requested address from main memory is split into three parts for de-
termining if the address is already cached as shown by Figure 4.2. The first
part is the o↵set O, which size is determined by the cache line size of the
cache. So with a cache line size of 64 bytes, the lower 6 bits of the address
would be used as the o↵set into the cache line. The second part identifies the
cache set. The number s of bits used to identify the cache set is determined
by the cache size Ci , the cache line size Bi and the associativity Ai of the cache
by s = log2 (Ci /Bi /Ai ). The remaining 64 o s bits of the address are used
as a tag to identify the cached copy. Therefore, when requesting an address
from main memory, the processor can calculate S by masking the address
and then search the respective cache set for the tag T. This can be easily done
by comparing the tags of the Ai cache lines in the set in parallel.
into a physical address with help of the memory management unit inside
the processor.
We do not go into details of the translation and paging mechanisms.
However, the address translation is usually done by a multi-level page table,
where the virtual address is split into multiple parts which are used as
an index into the page directories resulting in a physical address and a
respective o↵set. As the page table is kept in main memory, each translation
of a virtual address into a physical address would require additional main
memory accesses or cache accesses in case the page table is cached.
In order to speed up the translation process, the computed values are
cached in the so called Translation Lookaside Bu↵er (TLB), which is a small
and fast cache. When accessing a virtual address, the respective tag for the
memory page is calculated by masking the virtual address and the TLB is
searched for the tag. In case the tag is found, the physical address can be
retrieved from the cache. Otherwise, a TLB miss occurs and the physical
address has to be calculated, which can be quite costly. Details about the
address translation process, TLBs and paging structure caches for Intel 64
and IA-32 architectures can be found in [Int08].
The costs introduced by the address translation scale linearly with the
width of the translated address [HP03, CJDM99], therefore making it hard
or impossible to built large memories with very small latencies.
4.5 Prefetching
Modern processors try to guess which data will be accessed next and initiate
loads before the data is accessed in order to reduce the incurring access
latencies. Good prefetching can completely hide the latencies so that the
data is already in the cache when accessed. However, if data is loaded which
is not accessed later it can also evict data which would be accessed later
and thereby induce additional misses by loading this data again. Processors
support software and hardware prefetching. Software prefetching can be
seen as a hint to the processor, indicating which addresses are accessed next.
Hardware prefetching automatically recognizes access patterns by utilizing
di↵erent prefetching strategies. The Intel Nehalem architecture contains two
second level cache prefetchers – the L2 streamer and data prefetch logic
(DPL) [Int11].
CPU
Registers
Higher Latency
Lower Price /
CPU
Performance
Caches
Higher
Main Memory
Flash
Hard Disk
3#
At the very bottom is the hard disk. It is cheap, o↵ers large amounts of
storage and replaces magnetic tapes as the slowest storage medium neces-
sary.
The next medium is flash. It is faster than disk, but it is still regarded
as disk from a software perspective because of its persistence and its usage
characteristics. This means that the same block oriented input and output
methods which were developed more than 20 years ago for disks are still in
place for flash. In order to fully utilize the speed of flash based storage the
interfaces and drivers have to be adapted accordingly.
On top of flash is the main memory, which is directly accessible. The next
level are the CPU caches —L3, L2, L1 —with di↵erent characteristics. Finally,
the top level of the memory hierarchy are the registers of the CPU where
things like calculations are happening.
When accessing data from a disk, there are usually four layers between the
accessed disk and the registers of the CPU which only transport information.
In the end, every operation takes place inside the CPU and in turn the data
has to be in the registers.
Table 4.1 shows some of the latencies, which come into play regarding
the memory hierarchy. Latency is the time delay experienced by the system
to load the data from the storage medium until it is available in a CPU
register. The L1 cache latency is 0.5 nanoseconds. In contrast, accessing a
4.7 Non-Uniform Memory Access 25
main memory reference takes 100 nanoseconds and a simple disk seek is
taking 10 milliseconds.
systems require software layers to take care of conflicting memory accesses. Since
most of the available standard hardware only provides ccNUMA, we will solely
26
concentrate on this form. 4 Changes in Hardware
Figure 1.5: (a) Shared FSB, (b) Intel Quick Path Interconnect [35]
Fig. 4.4: (a) Shared FSB, (b) Intel Quick Path Interconnect [Int09]
To exploit NUMA completely, applications have to be made aware of primarily
loading data from the locally attached memory slots of a processor. Memory-
abound
bottleneck and introduces
applications might see heavy challenges
a degradation of upintohardware
25% of theirdesign to connect
performance if
all cores and memory.
only remote memory is accessed instead of the local memory [37]. Reasons for
Non-Uniform
this degradation can Memorybe theAccess of the attempt
(NUMA)
saturation QPI link to solve processor
between this problems by
cores to
introducing local memory locations which are cheap to access
transport data from the adjacent memory slot of another core, or the influence of for local pro-
cessors. Figureof
higher latency 4.4a pictures an overview
single access to a remoteof an UMA slot.
memory and aTheNUMA system. In
full degradation
an UMA system every processor observes the same speeds
might not be experienced, as memory caches and prefetching of data mitigates when accessing
the
an
effects of local versus remote memory. Assume a job can be split into manythrough
arbitrary memory address as the complete memory is accessed parallel
atasks.
central
Formemory interface
the parallel as shown
execution of thesein Figure 4.4 (a). In contrast,
tasks distribution of data isinrelevant.
NUMA
systems, every processor
Optimal performance hasbeitsreached
can only primary used
if the local memory
executed asaccess
tasks solely well as re-
local
mote
memory.memory
If data supplied from theand
is badly distributed other
many processors.
tasks need to This setup
access is shown
remote memory, in
Figure 4.4 (b). The
the connections di↵erent
between kinds of memory
the processors can become from the processors
flooded with extensive point of
data
view introduce di↵erent memory access times between local memory (ad-
transfer.
jacent slots)
Aside fromandtheremote memory thatapplications,
use for data-intensive is adjacent some to the other use
vendors processing
NUMA
units.
to create alternatives for distributed systems. Through NUMA, multiple physical
Additionally,
machines can be systems can into
consolidated be classified
one virtualinto cache-coherent
machine. NUMA (cc-
Note the difference in
NUMA) and non
the commonly used cache-coherent
term of virtualNUMA machine, systems. ccNuma
where part systemsmachine
of a physical provide
each CPU cache
is provided the same
as a virtual viewWith
machine. to the complete
NUMA, memory
several physicaland enforce
machines co-
fully
herency
contributebytoa the
hardware implemented
one virtual machine giving protocol. Non
the user thecache-coherent NUMA
impression of working
systems require software
with an extensively layers
large server. Withto such
handle memory
a virtual conflicts
machine, accordingly.
the main memory
Although
of all nodes nonandccNUMA
all CPUs hardware
can be accessedis easier and resources.
as local cheaper to build, most
Extensions of
to the
operating
todays system standard
available enable thehardware
system to provides
efficientlyccNUMA,
scale out without
since nonanyccNUMA
need for
special remote
hardware is morecommunication that would have to be handled in the operating
difficult to program.
system or the
To fully applications.
exploit In most cases,
the potentials the remote
of NUMA, memory access
applications have is to improved
be made
aware of the di↵erent memory latencies and should primarily load data from
the locally attached memory slots of a processor. Memory-bound applica-
tions may su↵er a degradation of up to 25% of their maximal performance
if remote memory is accessed instead of local memory.
4.8 Scaling Main Memory Systems 27
An example system that consists of multiple nodes can be seen in Figure 4.5.
One node has eight CPUs with eight cores, so each system has 64 cores,
Core 1 Core 2
Core 1 Core 2 Core 1 Core 2 Register Register Core 1 Core 2
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Register Register Register Register Register Register
Register Register Register Register Register Register
Level 1 Cache Level 1 Cache
Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register
Main Memory
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Level 3 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
CPU 1 CPU 2 CPU 5 CPU 6 CPU 1 CPU 2 CPU 5 CPU 6 CPU 1 CPU 2 CPU 5 CPU 6
Level 3 Cache Level 3 Cache Level 3 Cache
Level 3 Cache Level 3 Cache Level 3 Cache
Core 3 Core 4
Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache
Core 3 Core 4 Core 3 Core 4 Register Register Core 3 Core 4
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Register Register Register Register Register Register
Register Register Register Register Register Register
Level 1 Cache Level 1 Cache
Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Main Memory
Main Memory
Main Memory
Main Memory
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 3 Cache
CPU 3 CPU 4 CPU 7 CPU 8 CPU 3 CPU 4 CPU 7 CPU 8 CPU 3 CPU 4 CPU 7 CPU 8
Level 3 Cache Level 3 Cache Level 3 Cache
Level 3 Cache Level 3 Cache Level 3 Cache
Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Register Register
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Register Register Register Register Register Register
Register Register Register Register Register Register
Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Network#
Storage#Area#Network#–#SSD#/#Disk#
4.10 References