0% found this document useful (0 votes)
84 views11 pages

Changes in Hardware: 4.1 Memory Cells

This chapter discusses changes in computer hardware over time, including the introduction of multi-core processors with up to 128 cores per server blade around the early 2000s. It also describes how memory hierarchies have evolved, with main memory shifting from disk-based storage to in-memory systems as memory capacities increased. The chapter then details the differences between static RAM (SRAM) and dynamic RAM (DRAM) memory cell technologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views11 pages

Changes in Hardware: 4.1 Memory Cells

This chapter discusses changes in computer hardware over time, including the introduction of multi-core processors with up to 128 cores per server blade around the early 2000s. It also describes how memory hierarchies have evolved, with main memory shifting from disk-based storage to in-memory systems as memory capacities increased. The chapter then details the differences between static RAM (SRAM) and dynamic RAM (DRAM) memory cell technologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Chapter 4

Changes in Hardware

This chapter deals with hardware and lays the foundations to understand
how the changing hardware impacts software and application development
and is partly taken from [SKP12].
In the early 2000s multi-core architectures were introduced, starting a
trend introducing more and more parallelism. Today, a typical board has
eight CPUs and 8 to 16 cores per CPU. So each one has between 64 and 128
cores. A board is a pizza-box sized server component and it is called blade
or node in a multi-node system. Each of those blades o↵ers a high level of
parallel computing for a price of about $50,000.
Despite the introduction of massive parallelism, the disk totally dom-
inated all thinking and performance optimizations not long ago. It was
extremely slow, but necessary to store the data. Compared to the speed
development of CPUs, the development of disk performance could not keep
up. This resulted in a complete distortion of the whole model of working
with databases and large amounts of data. Today, the large amounts of main
memory available in servers initiate a shift from disk based systems to in-
memory based systems. In-memory based systems keep the primary copy
of their data in main memory.

4.1 Memory Cells

In early computer systems, the frequency of the CPU was the same as the
frequency of the memory bus and register access was only slightly faster than
memory access. However, CPU frequencies did heavily increase in the last
years following Moore’s Law1 [Moo65], but frequencies of memory buses
and latencies of memory chips did not grew with the same speed. As a result,

1Moore’s Law is the assumption that the number of transistors on integrated circuits
doubles every 18-24 months. This assumption still holds till today.

19
20 4 Changes in Hardware

Nehalem Quadcore Nehalem Quadcore

Core 0 Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 Core 0

TLB TLB

L1 Cacheline L1 L1

L2 Cacheline L2 L2

L3 Cacheline
L3 Cache QPI QPI L3 Cache

Memory Page Main Memory Main Memory

Fig. 4.1: Memory Hierarchy on Intel Nehalem Architecture

memory access gets more expensive, as more CPU cycles are wasted while
stalling for memory access. This development is not due to the fact that fast
memory can not be built, it is an economical decision as memory which is
as fast as current CPUs would be orders of magnitude more expensive and
would require extensive physical space on the boards. In general, memory
designers have the choice between SRAM (Static Random Access Memory)
and DRAM (Dynamic Random Access Memory).
SRAM cells are usually built out of six transistors (although variants with
only 4 do exist but have disadvantages [MSMH08]) and can store a stable
state as long as power is supplied. Accessing the stored state requires raising
the word access line and the state is immediately available for reading.
In contrast, DRAM cells can be constructed using a much simpler structure
consisting of only one transistor and a capacitor. The state of the memory
cell is stored in the capacitor while the transistor is only used to guard the
access to the capacitor. This design is more economical compared to SRAM.
However, it introduces a couple of complications. The capacitor discharges
over time and while reading the state of the memory cell. Therefore, today’s
systems refresh DRAM chips every 64 ms [CJDM01] and after every read of
the cell in order to recharge the capacitor. During the refresh, no access to
the state of the cell is possible. The charging and discharging of the capacitor
takes time, which means that the current can not be detected immediately
after requesting the stored state, therefore limiting the speed of DRAM cells.
In a nutshell, SRAM is fast but requires a lot of space whereas DRAM
chips are slower but allow larger chips due to their simpler structure. For
more details regarding the two types of RAM and their physical realization
the interested reader is referred to [Dre07].
4.3 Cache Internals 21

4.2 Memory Hierarchy

An underlying assumption of the memory hierarchy of modern computer


systems is a principle known as data locality [HP03]. Temporal data locality
indicates that data which is accessed is likely to be accessed again soon,
whereas spatial data locality indicates that data which is stored together in
memory is likely to be accessed together. These principles are leveraged by
using caches, combining the best of both worlds by leveraging the fast access
to SRAM chips and the sizes made possible by DRAM chips. Figure 4.1 shows
a hierarchy of memory on the example of the Intel Nehalem architecture.
Small and fast caches close to the CPUs built out of SRAM cells cache accesses
to the slower main memory built out of DRAM cells. Therefore, the hierarchy
consists of multiple levels with increasing storage sizes but decreasing speed.
Each CPU core has its private L1 and L2 cache and one large L3 cache
shared by the cores on one socket. Additionally, the cores on one socket have
direct access to their local part of main memory through an IMC (Integrated
Memory Controller). When accessing other parts than their local memory,
the access is performed over a QPI (Quick Path Interconnect) controller
coordinating the access to the remote memory.
The first level are the actual registers inside the CPU, used to store inputs
and outputs of the processed instructions. Processors usually only have a
small amount of integer and floating point registers which can be accessed
extremely fast. When working with parts of the main memory, their content
has to be first loaded and stored in a register to make it accessible for the
CPU. However, instead of accessing the main memory directly the content
is first searched in the L1 cache. If it is not found in L1 cache it is requested
from L2 cache. Some systems even make use of a L3 cache.

4.3 Cache Internals

Caches are organized in cache lines, which are the smallest addressable unit
in the cache. If the requested content cannot be found in any cache, it is
loaded from main memory and transferred up the hierarchy. The smallest
transferable unit between each level is one cache line. Caches, where every
cache line of level i is also present in level i + 1 are called inclusive caches
otherwise the model is called exclusive caches. All Intel processors implement
an inclusive cache model. This inclusive cache model is assumed for the rest
of this text.
When requesting a cache line from the cache, the process of determining
whether the requested line is already in the cache and locating where it is
cached is crucial. Theoretically, it is possible to implement fully associative
caches, where each cache line can cache any memory location. However,
in practice this is only realizable for very small caches as a search over
22 4 Changes in Hardware

the complete cache is necessary when searching for a cache line. In order
to reduce the search space, the concept of a n-way set associative cache with
associativity Ai divides a cache with Ci bytes in Ci /Bi /Ai sets and restricts the
number of cache lines which can hold a copy of a certain memory address
to one set or Ai cache lines. Thus, when determining if a cache line is already
present in the cache only one set with Ai cache lines has to be searched.

64 0

Tag T Set S Offset O

Fig. 4.2: Parts of a Memory Address

A requested address from main memory is split into three parts for de-
termining if the address is already cached as shown by Figure 4.2. The first
part is the o↵set O, which size is determined by the cache line size of the
cache. So with a cache line size of 64 bytes, the lower 6 bits of the address
would be used as the o↵set into the cache line. The second part identifies the
cache set. The number s of bits used to identify the cache set is determined
by the cache size Ci , the cache line size Bi and the associativity Ai of the cache
by s = log2 (Ci /Bi /Ai ). The remaining 64 o s bits of the address are used
as a tag to identify the cached copy. Therefore, when requesting an address
from main memory, the processor can calculate S by masking the address
and then search the respective cache set for the tag T. This can be easily done
by comparing the tags of the Ai cache lines in the set in parallel.

4.4 Address Translation

The operating system provides each process a dedicated continuous address


space, containing an address range from 0 to 2x . This has several advantages
as the process can address the memory through virtual addresses and does
not have to bother about the physical fragmentation. Additionally, mem-
ory protection mechanisms can control the access to memory, restricting
programs to access memory which was not allocated by them. Another ad-
vantage of virtual memory is the use of a paging mechanism which allows
a process to use more memory than is physically available by paging pages
in and out and saving them on secondary storage.
The continuous virtual address space of a process is divided into pages
of size p, which is on most operating system 4 KB. Those virtual pages are
mapped to physical memory. The mapping itself is saved in a so called page
table, which resides in main memory itself. When the process accesses a
virtual memory address, the address is translated by the operating system
4.6 Memory Hierarchy and Latency Numbers 23

into a physical address with help of the memory management unit inside
the processor.
We do not go into details of the translation and paging mechanisms.
However, the address translation is usually done by a multi-level page table,
where the virtual address is split into multiple parts which are used as
an index into the page directories resulting in a physical address and a
respective o↵set. As the page table is kept in main memory, each translation
of a virtual address into a physical address would require additional main
memory accesses or cache accesses in case the page table is cached.
In order to speed up the translation process, the computed values are
cached in the so called Translation Lookaside Bu↵er (TLB), which is a small
and fast cache. When accessing a virtual address, the respective tag for the
memory page is calculated by masking the virtual address and the TLB is
searched for the tag. In case the tag is found, the physical address can be
retrieved from the cache. Otherwise, a TLB miss occurs and the physical
address has to be calculated, which can be quite costly. Details about the
address translation process, TLBs and paging structure caches for Intel 64
and IA-32 architectures can be found in [Int08].
The costs introduced by the address translation scale linearly with the
width of the translated address [HP03, CJDM99], therefore making it hard
or impossible to built large memories with very small latencies.

4.5 Prefetching

Modern processors try to guess which data will be accessed next and initiate
loads before the data is accessed in order to reduce the incurring access
latencies. Good prefetching can completely hide the latencies so that the
data is already in the cache when accessed. However, if data is loaded which
is not accessed later it can also evict data which would be accessed later
and thereby induce additional misses by loading this data again. Processors
support software and hardware prefetching. Software prefetching can be
seen as a hint to the processor, indicating which addresses are accessed next.
Hardware prefetching automatically recognizes access patterns by utilizing
di↵erent prefetching strategies. The Intel Nehalem architecture contains two
second level cache prefetchers – the L2 streamer and data prefetch logic
(DPL) [Int11].

4.6 Memory Hierarchy and Latency Numbers

The memory hierarchy can be viewed as pyramid of storage mediums. The


slower a medium is, the cheaper it gets. This also means that the amount of
24 4 Changes in Hardware

storage on the lower levels increases, because it is simply more a↵ordable.


The hierarchy levels of nowadays hardware are outlined by Figure 4.3.
Memory Hierarchy

CPU
Registers
Higher Latency
Lower Price /

CPU

Performance
Caches

Higher
Main Memory

Flash

Hard Disk
3#

Fig. 4.3: Conceptual View of the Memory Hierarchy

At the very bottom is the hard disk. It is cheap, o↵ers large amounts of
storage and replaces magnetic tapes as the slowest storage medium neces-
sary.
The next medium is flash. It is faster than disk, but it is still regarded
as disk from a software perspective because of its persistence and its usage
characteristics. This means that the same block oriented input and output
methods which were developed more than 20 years ago for disks are still in
place for flash. In order to fully utilize the speed of flash based storage the
interfaces and drivers have to be adapted accordingly.
On top of flash is the main memory, which is directly accessible. The next
level are the CPU caches —L3, L2, L1 —with di↵erent characteristics. Finally,
the top level of the memory hierarchy are the registers of the CPU where
things like calculations are happening.
When accessing data from a disk, there are usually four layers between the
accessed disk and the registers of the CPU which only transport information.
In the end, every operation takes place inside the CPU and in turn the data
has to be in the registers.
Table 4.1 shows some of the latencies, which come into play regarding
the memory hierarchy. Latency is the time delay experienced by the system
to load the data from the storage medium until it is available in a CPU
register. The L1 cache latency is 0.5 nanoseconds. In contrast, accessing a
4.7 Non-Uniform Memory Access 25

main memory reference takes 100 nanoseconds and a simple disk seek is
taking 10 milliseconds.

Action Time in nanoseconds Time


L1 cache reference (cached data word) 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock / unlock 25 ns
Main memory reference 100 ns 0.1 µs
Send 2,000 byte over 1 Gb/s network 20,000 ns 20 µs
SSD random read 150,000 ns 150 µs
Read 1 MB sequentially from memory 250,000 ns 250 µs
Disk seek 10,000,000 ns 10 ms
Send packet CA to Netherlands to CA 150,000,000 ns 150 ms

Table 4.1: Latency Numbers

In the end, there is nothing special about “in-memory” computing and


all computing ever done was in memory, because it can only take place
in the CPU. Assuming a bandwidth-bound application, the performance is
determined by how fast the data can be transferred through the hierarchy to
the CPU. In order to estimate the runtime of an algorithm, it is possible to
roughly estimate the amount of data which has to be transferred to the CPU.
A very simple operation that a CPU can do is a comparison like filtering
for an attribute. Let us assume a calculation speed of 2 MB per millisecond
for this operation using one core. So one core of a CPU can digest 2 MB per
millisecond. This number scales with the amount of cores and if there are ten
cores, they can scan 20 GB per second. If there are 10 nodes with ten cores
each, then that is already 200 GB in per second.
Considering a large multi-node system like that, having 10 nodes and 40
CPUs per node where the data is distributed across the nodes, it is hard to
write an algorithm which needs more than one second. This includes large
amounts of data. The previously mentioned 200 GB are highly compressed
data. So it is a much higher amount of plain character data. To sum this up,
the number to remember is 2 MB per millisecond per core. If an algorithm
shows a completely di↵erent result it is worth looking into it as there is
probably something going wrong. This could be an issue in SQL, like a too
complicated join or a loop in a loop.

4.7 Non-Uniform Memory Access

As the development in modern computer systems goes from multi-core to


many-core systems and the amount of main memory continues to increase,
using a Front Side Bus (FSB) with Uniform Memory Access (UMA) becomes
1.2 The Impact of Recent Hardware Trends 19

systems require software layers to take care of conflicting memory accesses. Since
most of the available standard hardware only provides ccNUMA, we will solely
26
concentrate on this form. 4 Changes in Hardware

Figure 1.5: (a) Shared FSB, (b) Intel Quick Path Interconnect [35]
Fig. 4.4: (a) Shared FSB, (b) Intel Quick Path Interconnect [Int09]
To exploit NUMA completely, applications have to be made aware of primarily
loading data from the locally attached memory slots of a processor. Memory-
abound
bottleneck and introduces
applications might see heavy challenges
a degradation of upintohardware
25% of theirdesign to connect
performance if
all cores and memory.
only remote memory is accessed instead of the local memory [37]. Reasons for
Non-Uniform
this degradation can Memorybe theAccess of the attempt
(NUMA)
saturation QPI link to solve processor
between this problems by
cores to
introducing local memory locations which are cheap to access
transport data from the adjacent memory slot of another core, or the influence of for local pro-
cessors. Figureof
higher latency 4.4a pictures an overview
single access to a remoteof an UMA slot.
memory and aTheNUMA system. In
full degradation
an UMA system every processor observes the same speeds
might not be experienced, as memory caches and prefetching of data mitigates when accessing
the
an
effects of local versus remote memory. Assume a job can be split into manythrough
arbitrary memory address as the complete memory is accessed parallel
atasks.
central
Formemory interface
the parallel as shown
execution of thesein Figure 4.4 (a). In contrast,
tasks distribution of data isinrelevant.
NUMA
systems, every processor
Optimal performance hasbeitsreached
can only primary used
if the local memory
executed asaccess
tasks solely well as re-
local
mote
memory.memory
If data supplied from theand
is badly distributed other
many processors.
tasks need to This setup
access is shown
remote memory, in
Figure 4.4 (b). The
the connections di↵erent
between kinds of memory
the processors can become from the processors
flooded with extensive point of
data
view introduce di↵erent memory access times between local memory (ad-
transfer.
jacent slots)
Aside fromandtheremote memory thatapplications,
use for data-intensive is adjacent some to the other use
vendors processing
NUMA
units.
to create alternatives for distributed systems. Through NUMA, multiple physical
Additionally,
machines can be systems can into
consolidated be classified
one virtualinto cache-coherent
machine. NUMA (cc-
Note the difference in
NUMA) and non
the commonly used cache-coherent
term of virtualNUMA machine, systems. ccNuma
where part systemsmachine
of a physical provide
each CPU cache
is provided the same
as a virtual viewWith
machine. to the complete
NUMA, memory
several physicaland enforce
machines co-
fully
herency
contributebytoa the
hardware implemented
one virtual machine giving protocol. Non
the user thecache-coherent NUMA
impression of working
systems require software
with an extensively layers
large server. Withto such
handle memory
a virtual conflicts
machine, accordingly.
the main memory
Although
of all nodes nonandccNUMA
all CPUs hardware
can be accessedis easier and resources.
as local cheaper to build, most
Extensions of
to the
operating
todays system standard
available enable thehardware
system to provides
efficientlyccNUMA,
scale out without
since nonanyccNUMA
need for
special remote
hardware is morecommunication that would have to be handled in the operating
difficult to program.
system or the
To fully applications.
exploit In most cases,
the potentials the remote
of NUMA, memory access
applications have is to improved
be made
aware of the di↵erent memory latencies and should primarily load data from
the locally attached memory slots of a processor. Memory-bound applica-
tions may su↵er a degradation of up to 25% of their maximal performance
if remote memory is accessed instead of local memory.
4.8 Scaling Main Memory Systems 27

By introducing NUMA, the central bottleneck of the FSB can be avoided


and memory bandwidth can be increased. Benchmark results have shown
that a throughput of more than 72 GB per second is possible on an Intel
XEON 7560 (Nehalem EX) system with four processors [Fuj10].

4.8 Scaling Main Memory Systems

An example system that consists of multiple nodes can be seen in Figure 4.5.
One node has eight CPUs with eight cores, so each system has 64 cores,

Scaling Main Memory


and there are four nodes. Each of them has a terabyte of RAM and SSDs for
persistence. Everything which is below DRAM is for logging, archiving, and
for emergency reconstruction of data, which means reloading the data after
the power supply was turned o↵.
Systems
IO Hub 1 IO Hub 2 IO Hub 1 IO Hub 2 IO Hub 1 IO Hub 2 IO Hub 1 IO Hub 2

1 hop 1 hop 1 hop 1 hop

Core 1 Core 2
Core 1 Core 2 Core 1 Core 2 Register Register Core 1 Core 2
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Register Register Register Register Register Register
Register Register Register Register Register Register
Level 1 Cache Level 1 Cache
Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register

Main Memory
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache

Main Memory

Main Memory

Main Memory
Main Memory

Main Memory

Main Memory
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache

CPU 1 CPU 2 CPU 5 CPU 6


Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory
Level 3 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache

CPU 1 CPU 2 CPU 5 CPU 6 CPU 1 CPU 2 CPU 5 CPU 6 CPU 1 CPU 2 CPU 5 CPU 6
Level 3 Cache Level 3 Cache Level 3 Cache
Level 3 Cache Level 3 Cache Level 3 Cache
Core 3 Core 4
Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache
Core 3 Core 4 Core 3 Core 4 Register Register Core 3 Core 4
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Register Register Register Register Register Register
Register Register Register Register Register Register
Level 1 Cache Level 1 Cache
Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache

Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2


Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Register Register
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Register Register Register Register Register Register
Register Register Register Register Register Register
Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache

Main Memory
Main Memory

Main Memory

Main Memory
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache

CPU 3 CPU 4 CPU 7 CPU 8


Main Memory

Main Memory

Main Memory
Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 3 Cache

CPU 3 CPU 4 CPU 7 CPU 8 CPU 3 CPU 4 CPU 7 CPU 8 CPU 3 CPU 4 CPU 7 CPU 8
Level 3 Cache Level 3 Cache Level 3 Cache
Level 3 Cache Level 3 Cache Level 3 Cache
Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache Level 3 Cache
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Register Register
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Register Register Register Register Register Register
Register Register Register Register Register Register
Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Register Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache
Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache
Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache Level 2 Cache

2 hop 2 hop 2 hop 1 hop 2 hop


1 hop 1 hop 1 hop

IO Hub 3 IO Hub 4 IO Hub 3 IO Hub 4 IO Hub 3 IO Hub 4 IO Hub 3 IO Hub 4

8#CPU#/#8#core# 8#CPU#/#8#core# 8#CPU#/#8#core# 8#CPU#/#8#core#

1#TB#RAM# 1#TB#RAM# 1#TB#RAM# 1#TB#RAM#

SSD#for#Persistence# SSD#for#Persistence# SSD#for#Persistence# SSD#for#Persistence#

10GbE# 10GbE# 10GbE# 10GbE#

Network#

Storage#Area#Network#–#SSD#/#Disk#

6# Fig. 4.5: A System Consisting of Multiple Blades

The networks which connect the nodes are continuously increasing in


speed. In the example shown in Figure 4.5, a 10 Gb/s Ethernet network
connects the four nodes. Computers with 40 Gb/s Infiniband are already on
the market and switch manufacturers are talking about 100 Gb/s switches
which even have logic allowing smart switching. This is another location
where an optimization can take place – on a low level and very e↵ective for
28 REFERENCES

applications. It can be leveraged to improve joins, where calculations often


go across multiple nodes.

4.9 Remote Direct Memory Access

Shared memory is another interesting way to directly access memory be-


tween multiple nodes. The nodes are connected with the network via Infini-
band and create a shared memory region. The main idea is to automatically
access data which is on a di↵erent node without explicitly shipping the data.
In turn, there is direct access without shipping and processing it on the other
side. Research has been done at Stanford University in cooperation with
the HPI using a RAM cluster. It is very promising as it could basically o↵er
direct access to a seemingly unlimited amount of memory from a program’s
perspective.

4.10 References

[CJDM99] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A performance


comparison of contemporary DRAM architectures. Proceedings
of the 26th annual international symposium on Computer Architec-
ture, 1999.
[CJDM01] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. High-performance
DRAMs in workstation environments. Computers, IEEE Transac-
tions on, 50(11):1133–1153, 2001.
[Dre07] U. Drepper. What Every Programmer Should Know About
Memory. http://people. redhat. com/drepper/cpumemory. pdf, 2007.
[Fuj10] Fujitsu. Speicher-Performance Xeon 7500 (Nehalem EX)
basierter Systeme, 2010.
[HP03] J. Hennessy and D. Patterson. Computer architecture: a quantitative
approach. Morgan Kaufmann, 2003.
[Int08] Intel Inc. TLBs, Paging-Structure Caches, and Their Invalidation,
2008.
[Int09] Intel. An Introduction to the Intel QuickPath Interconnect, 2009.
[Int11] Intel Inc. Intel 64 and IA-32 Architectures Optimization Reference
Manual, 2011.
[Moo65] G. Moore. Cramming more components onto integrated circuits.
Electronics, 38:114 ↵., 1965.
[MSMH08] A.A Mazreah, M.R Sahebi, M.T Manzuri, and S.J Hosseini. A
Novel Zero-Aware Four-Transistor SRAM Cell for High Density
and Low Power Cache Application. In Advanced Computer Theory
REFERENCES 29

and Engineering, 2008. ICACTE ’08. International Conference on,


pages 571–575, 2008.
[SKP12] David Schwalb, Jens Krueger, and Hasso Plattner. Cache con-
scious column organization in in-memory column stores. Tech-
nical Report 60, Hasso-Plattner-Institute, December 2012.

You might also like