Hardware
Hardware
(Intel example)
Table of contents
• Computer memory hierarchy
• Types of memory
• Layout of physical memory
• Memory addressing
• At what speeds different parts of the computer work?
• How the processor cache works?
• Computer architectures
• Additional reading
2
Computer Memory Hierarchy, Ryan J. Leng, 2007 3
Types of memory
Dynamic Random Access Memory
Dynamic Random Access Memory, DRAM – a kind of volatile semiconductor memory with random access,
the bits of which are represented by the state of charge of the capacitors. They require periodic refreshing
of the content (use less energy), unlike static memories, which require constant power supply (use more
energy).
https://www.youtube.com/watch?v=yi0FhRqDJfo 4
Types of memory
Static Random Access Memory
SRAM memories are used in fast cache memories,
• they do not require large capacities (data density in SRAM is 4 times lower than in DRAM),
• access speed is about 7 times faster than DRAM (1 SRAM cycle is about 10 ns, while in
DRAM about 70 ns).
This speed applies to random access, in the case of reading data from neighboring address cells,
the speed of SRAM and DRAM is comparable.
https://www.youtube.com/watch?v=yi0FhRqDJfo 5
Image source: Intel
A non-volatile memory (NVM) technology.
In 2015, Intel and Micron claimed 3D XPoint would be up to
1,000 times faster and have up to 1,000 times more
endurance than NAND flash, and have 10 times the storage
density of conventional memory. Up to half the cost of
DRAM.
Available on the open market under the brand name
Optane (Intel) since April 2017.
Intel 3D XPoint Technology, Disruptive Technologies Session, 2015 HPC User Forum
3D XPoint™ Technology Revolutionizes Storage Memory 6
Admiral Grace Hopper explains the nanosecond
• I called over to the engineering building and I said: „Please cut off a
nanosecond and send it over to me”.
• I wanted a piece of wire which would represent the maximum
distance that electricity could travel in a billionth of a second. Of
course, it wouldn’t really be through wire. It’d out in space; the
velocity of light.
• So, if you start with the velocity of light, you’ll discover that a
1906-1992
nanosecond is 11.8 inches long (29,97 cm)
• At the end of about a week, I called back and said: „I need something to
compare this to. Could I please have a microsecond?”
• Here is a microsecond, 984 feet (29992,32 cm). I sometimes think we ought
to hang one over every programmer’s desk (or around their neck) so they
know when they’re throwing away when they throw away microseconds.
https://www.youtube.com/watch?v=9eyFDBPk4Yw 7
Memory hierarchy, Intel 8
Memory hierarchy, sample values ~2021
https://www.youtube.com/watch?v=J6jkrDlgflo
9
Layout of physical memory
11
Memory addressing
Memory addressing depends on the hardware. 80x86 microprocessors distinguish three types of addresses:
– Logical address – operated at the level of machine language instructions. It is related to segmentation
available in this architecture. Each logical address consists of a segment number and a segment offset.
– Linear (virtual) address – is a 32-bit number that allows to address up to 4 GB.
– Physical address – address recognized by the memory module. Physical addresses are in the form of a 32-
bit or 36-bit number.
Memory address translation in x86 CPUs with paging enabled (source: Duarte, Software
Illustrated) 12
Address translation and MMU
13
Segmentation
When paging is turned off, the output from the segmentation unit is already a physical address; in 16-bit
real mode that is always the case.
The original 8086 (1978) had 16-bit registers. This allowed code to work with 216 bytes or 64 KB of memory.
To increase the size of the available address space, without increasing the size of registers and
instructions, segment registers were introduced to allow switching between different blocks (segments)
of 64 KB size.
There were four segment registers: for stack (ss), for program
code (cs), for data (ds, es). There are also two general-
purpose segment registers: fs and gs.
Nowadays segmentation is still present and is always enabled
in x86 processors.
https://en.wikipedia.org/wiki/X86
Each instruction that touches memory implicitly uses a
segment register. Segment registers store 16-bit segment
selectors.
14
16-bit real mode
Though segmentation is always on, it works differently in real mode versus protected mode.
In real mode, such as during early boot, the segment selector is a 16-bit number specifying the physical
memory address for the start of a segment.
This number must be scaled. Intel made the decision to multiply the segment selector by only 24 (or 16),
which limits memory to about 1 MB and complicates translation.
15
32-bit protected mode
https://en.wikipedia.org/wiki/I386
In 32-bit protected mode (1985), a segment selector is an index into a table of 8-byte segment
descriptors. Segment descriptors are stored in two tables: Global Descriptor Table (GDT) and Local
Descriptor Table (LDT).
The TI bit in a segment selector is 0 for the GDT and 1 for the LDT, while the index specifies the desired
segment within the table. Each CPU (or core) contains a register called gdtr, which stores the linear
memory address of the first byte in the GDT.
To choose a
The base address in segment descriptor is a 32-bit linear
segment, a
address pointing to the beginning of the segment.
segment
The limit specifies how big the segment is. selector has to
Adding the base address to a logical memory address be loaded to a
yields a linear address. segment
register.
The jump instruction in 32-bit protected mode.
cs contains code segment selector.
Most user mode applications do not make use of LDT, thus the kernel defines default LDT to be shared by
most processes. In some cases, however, processes may require to set up their own LDT (e.g. applications,
like Wine, that execute segment-oriented MS Windows applications). 17
Basic flat model
When the CPU is in 32-bit mode, registers and instructions can address the entire linear address space, it
is enough to set the base address to 0 and treat the logical address as a linear address. Intel calls it
basic flat model.
Basic flat model is equivalent to disabling segmentation when it comes to translating memory addresses.
All processes executed in user mode use the same pair of segments user code segment and user data
segment.
Similarly, all kernel mode processes use the same pair of segments kernel code segment and kernel data
segment.
All of these segments have a base address set to 0 and a limit of 232-1. This means that all processes, both
in user mode and in kernel mode, can use the same logical addresses.
18
64-bit flat linear address space
In Linux, only 3 segment descriptors are used during boot. Two of the segments are flat, addressing the entire
32-bit space (a code segment and a data segment), the third segment is a system segment TSS (Task State
Segment).
After boot, each CPU has its own copy of the GDT:
– the layout of the GDT is specified in arch/x86/include/asm/segment.h,
– and its instantiation in arch/x86/kernel/cpu/common.c.
Since coinciding logical and linear addresses are simpler to handle, they became standard, such that 64-bit
mode enforces a flat linear address space (2005).
Except in unusual cases, segmentation won't change the resulting physical address in 64 bit mode
(segmentation is just used to store traits like the current privilege level, and enforce features like SMEP –
Supervisor Mode Execution Prevention).
One well known "unusual case" is the implementation of Thread Local Storage by most compilers on x86,
which uses the fs and gs segments to define per logical processor offsets into the address space. Other
segments can not have non-zero bases, and therefore cannot shift addresses through segmentation.
19
At what speeds different parts of the computer work?
What are
– speed – latency,
– throughput
of various subsystems in a commodity PC
(Intel Core 2 Duo at 3.0 GHz).
Time units are:
– nanoseconds (ns, 10-9 s),
– milliseconds (ms, 10-3 s),
– seconds (s).
Throughput units are in megabytes and
gigabytes per second.
Brendan Gregg
https://www.youtube.com/watch?v=tDacjrSCeq4
https://www.youtube.com/watch?v=lMPozJFC8g0 22
Why do we need the processor cache?
Flavors of Memory
supported by Linux, their
use and benefit,
Christopher Lameter
(source: presentation at
Open Source Summit,
2018)
23
AMD Ryzen 5000, 2020
https://www.youtube.com/watch?v=J6jkrDlgflo 24
IBM POWER10, 2020
https://www.youtube.com/watch?v=J6jkrDlgflo
25
How does the processor cache works?
Intel
Search for matching tag in the set (source: Duarte, Software Illustrated) 27
How does the processor cache works?
You can imagine each bank and its directory as columns in a spreadsheet, in which case the rows are the
sets. Each cell in the way column contains a cache line.
Physical memory is divided into 4 KB physical pages. Each page has 4 KB/64 bytes = 64 cache lines in it.
Bytes 0 through 63 within that page are in the first cache line, bytes 64-127 in the second cache line,
and so on. The pattern repeats for each page.
Basic types of cache organization:
• fully associative cache – any memory line can be stored in any of the cache cells, This makes
storage flexible, but it becomes expensive to search for cells when accessing them.
• direct-mapped cache (one-way set associative) – a given memory line can only be stored in one
specific set (or row),
• N-way set-associative cache – each memory line may be stored in one of the N cache lines. In 8-
way set-associative each row has 8 cells available to store the cache lines it is associated with. Bits
11-6 determine the line number within the 4 KB page and therefore the set to be used.
32
Architectures
Symmetric multiprocessors
Since 2002
The interconnect between the two systems introduced latency for the memory access across nodes.
Understanding Non-Uniform Memory Access/Architectures (NUMA) 34
Where the older SMP architecture had a separate
AMD Hyper-Transport (HT) memory controller, newer systems have an
integrated memory controller built into the processor
itself, and each processor has its own memory bank.
The first processors to introduce an integrated
memory controller were the AMD Opteron series of
processors in early 2003. AMD processors share
memory access through Hyper-Transport (HT) links
between the processors.
NUMA organization
with 4 AMD Opteron
6128 (2010)
Hardware insights, Francesco Quaglia A Primer on Modern Hardware Architectures, Alessandro Pellegrini
37
Additional reading
38