0% found this document useful (0 votes)
7 views38 pages

Hardware

The document provides an overview of computer memory hierarchy, types of memory including DRAM and SRAM, and the workings of processor cache. It discusses memory addressing, the layout of physical memory, and the speed of various computer components. Additionally, it explains address translation and segmentation in x86 architecture, highlighting the evolution from 16-bit to 64-bit modes.

Uploaded by

devs.kals
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

Hardware

The document provides an overview of computer memory hierarchy, types of memory including DRAM and SRAM, and the workings of processor cache. It discusses memory addressing, the layout of physical memory, and the speed of various computer components. Additionally, it explains address translation and segmentation in x86 architecture, highlighting the evolution from 16-bit to 64-bit modes.

Uploaded by

devs.kals
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

A little about hardware

(Intel example)
Table of contents
• Computer memory hierarchy
• Types of memory
• Layout of physical memory
• Memory addressing
• At what speeds different parts of the computer work?
• How the processor cache works?
• Computer architectures
• Additional reading

2
Computer Memory Hierarchy, Ryan J. Leng, 2007 3
Types of memory
Dynamic Random Access Memory
Dynamic Random Access Memory, DRAM – a kind of volatile semiconductor memory with random access,
the bits of which are represented by the state of charge of the capacitors. They require periodic refreshing
of the content (use less energy), unlike static memories, which require constant power supply (use more
energy).

Synchronous Dynamic Random Access


Memory, SDRAM – the type of DRAM
working synchronously with the
system bus (which distinguishes it from
the classic DRAM that works
asynchronously).
SDRAM family includes:
SDR (Single Data Rate),
DDR (Double Data Rate), DDR2, DDR3,
DDR4, DDR5. Used for main memory Used for cache

https://www.youtube.com/watch?v=yi0FhRqDJfo 4
Types of memory
Static Random Access Memory
SRAM memories are used in fast cache memories,
• they do not require large capacities (data density in SRAM is 4 times lower than in DRAM),
• access speed is about 7 times faster than DRAM (1 SRAM cycle is about 10 ns, while in
DRAM about 70 ns).
This speed applies to random access, in the case of reading data from neighboring address cells,
the speed of SRAM and DRAM is comparable.

https://www.youtube.com/watch?v=yi0FhRqDJfo 5
Image source: Intel
A non-volatile memory (NVM) technology.
In 2015, Intel and Micron claimed 3D XPoint would be up to
1,000 times faster and have up to 1,000 times more
endurance than NAND flash, and have 10 times the storage
density of conventional memory. Up to half the cost of
DRAM.
Available on the open market under the brand name
Optane (Intel) since April 2017.
Intel 3D XPoint Technology, Disruptive Technologies Session, 2015 HPC User Forum
3D XPoint™ Technology Revolutionizes Storage Memory 6
Admiral Grace Hopper explains the nanosecond
• I called over to the engineering building and I said: „Please cut off a
nanosecond and send it over to me”.
• I wanted a piece of wire which would represent the maximum
distance that electricity could travel in a billionth of a second. Of
course, it wouldn’t really be through wire. It’d out in space; the
velocity of light.
• So, if you start with the velocity of light, you’ll discover that a
1906-1992
nanosecond is 11.8 inches long (29,97 cm)

• At the end of about a week, I called back and said: „I need something to
compare this to. Could I please have a microsecond?”
• Here is a microsecond, 984 feet (29992,32 cm). I sometimes think we ought
to hang one over every programmer’s desk (or around their neck) so they
know when they’re throwing away when they throw away microseconds.

https://www.youtube.com/watch?v=9eyFDBPk4Yw 7
Memory hierarchy, Intel 8
Memory hierarchy, sample values ~2021

https://www.youtube.com/watch?v=J6jkrDlgflo
9
Layout of physical memory

Most processor requests are for RAM, part for memory-


mapped I/O, mainly devices such as video cards, PCI cards,
as well as flash memory that stores the BIOS.
This device address mapping causes a hole in the computer
memory between 640 KB and 1 MB.
This is why 32-bit operating systems have problems using 4
GB of memory.
The figure shows a typical layout of the first 4 GB physical
memory addresses in the Intel PC.

Layout of physical memory, 2008


(source: Duarte, Software Illustrated)
10
On Linux, physical memory mapping can be read from the file
/proc/iomem. This is how it looks on my old workstation and on
duch.mimuw.edu.pl.
(640 KB is 0x000a0000, 1 MB is 0x00100000)

Do you see the


difference?

11
Memory addressing
Memory addressing depends on the hardware. 80x86 microprocessors distinguish three types of addresses:
– Logical address – operated at the level of machine language instructions. It is related to segmentation
available in this architecture. Each logical address consists of a segment number and a segment offset.
– Linear (virtual) address – is a 32-bit number that allows to address up to 4 GB.
– Physical address – address recognized by the memory module. Physical addresses are in the form of a 32-
bit or 36-bit number.

Memory address translation in x86 CPUs with paging enabled (source: Duarte, Software
Illustrated) 12
Address translation and MMU

The translation of addresses is done by hardware Memory Management Unit (MMU).


The translation of logical addresses into physical addresses depends on the processor mode.
o In real mode, the CPU can address only 1 MB of physical RAM.
o When the CPU works in 32-bit protected mode, it can physically address only 4 GB of memory
(unless it uses the so-called physical address extension). Because the upper 1 GB is mapped to
devices connected to the motherboard, only 3 GB of RAM remain effectively.
o When the CPU works in 64-bit protected mode, it can physically address 64 GB (hardly any
motherboards deliver enough RAM). In this mode, you can use physical addresses above the RAM
available in the system to reach RAM areas corresponding to the physical addresses stolen by
devices connected to the motherboard.

13
Segmentation

When paging is turned off, the output from the segmentation unit is already a physical address; in 16-bit
real mode that is always the case.
The original 8086 (1978) had 16-bit registers. This allowed code to work with 216 bytes or 64 KB of memory.
To increase the size of the available address space, without increasing the size of registers and
instructions, segment registers were introduced to allow switching between different blocks (segments)
of 64 KB size.
There were four segment registers: for stack (ss), for program
code (cs), for data (ds, es). There are also two general-
purpose segment registers: fs and gs.
Nowadays segmentation is still present and is always enabled
in x86 processors.
https://en.wikipedia.org/wiki/X86
Each instruction that touches memory implicitly uses a
segment register. Segment registers store 16-bit segment
selectors.

14
16-bit real mode

Though segmentation is always on, it works differently in real mode versus protected mode.
In real mode, such as during early boot, the segment selector is a 16-bit number specifying the physical
memory address for the start of a segment.
This number must be scaled. Intel made the decision to multiply the segment selector by only 24 (or 16),
which limits memory to about 1 MB and complicates translation.

The example shows a jump


instruction where cs registry
contains 0x1000.

Real mode segmentation (source:


Duarte, Software Illustrated)

15
32-bit protected mode
https://en.wikipedia.org/wiki/I386
In 32-bit protected mode (1985), a segment selector is an index into a table of 8-byte segment
descriptors. Segment descriptors are stored in two tables: Global Descriptor Table (GDT) and Local
Descriptor Table (LDT).
The TI bit in a segment selector is 0 for the GDT and 1 for the LDT, while the index specifies the desired
segment within the table. Each CPU (or core) contains a register called gdtr, which stores the linear
memory address of the first byte in the GDT.
To choose a
The base address in segment descriptor is a 32-bit linear
segment, a
address pointing to the beginning of the segment.
segment
The limit specifies how big the segment is. selector has to
Adding the base address to a logical memory address be loaded to a
yields a linear address. segment
register.
The jump instruction in 32-bit protected mode.
cs contains code segment selector.

Segmentation in protected mode (source: Duarte, Software


Illustrated) 16
There is one copy of GDT for each processor. All
GDT and LDT copies have identical content except for a few items.
Each processor has its own TSS segment (Task State
Segment). Several items depend on the process that
is performed on a given processor (LDT and TLS
segment descriptors).
TSS is used to save the contents of the processor
registers. These segments are arranged one by one
in the init_tss array. Each TSS has 236 bytes and
cannot be accessed in user mode.
TLS (Thread Local Storage) allows threads within
one program to use up to three segments containing
local data for each thread. The system functions
set_thread_area() and get_thread_area() are used
to create and release a TLS segment for a running
Global Descriptor Table (GDT) (source: Bovet, Cesati, Understanding
the Linux Kernel)
process.

Most user mode applications do not make use of LDT, thus the kernel defines default LDT to be shared by
most processes. In some cases, however, processes may require to set up their own LDT (e.g. applications,
like Wine, that execute segment-oriented MS Windows applications). 17
Basic flat model

When the CPU is in 32-bit mode, registers and instructions can address the entire linear address space, it
is enough to set the base address to 0 and treat the logical address as a linear address. Intel calls it
basic flat model.
Basic flat model is equivalent to disabling segmentation when it comes to translating memory addresses.
All processes executed in user mode use the same pair of segments user code segment and user data
segment.
Similarly, all kernel mode processes use the same pair of segments kernel code segment and kernel data
segment.
All of these segments have a base address set to 0 and a limit of 232-1. This means that all processes, both
in user mode and in kernel mode, can use the same logical addresses.

18
64-bit flat linear address space

In Linux, only 3 segment descriptors are used during boot. Two of the segments are flat, addressing the entire
32-bit space (a code segment and a data segment), the third segment is a system segment TSS (Task State
Segment).
After boot, each CPU has its own copy of the GDT:
– the layout of the GDT is specified in arch/x86/include/asm/segment.h,
– and its instantiation in arch/x86/kernel/cpu/common.c.
Since coinciding logical and linear addresses are simpler to handle, they became standard, such that 64-bit
mode enforces a flat linear address space (2005).
Except in unusual cases, segmentation won't change the resulting physical address in 64 bit mode
(segmentation is just used to store traits like the current privilege level, and enforce features like SMEP –
Supervisor Mode Execution Prevention).
One well known "unusual case" is the implementation of Thread Local Storage by most compilers on x86,
which uses the fs and gs segments to define per logical processor offsets into the address space. Other
segments can not have non-zero bases, and therefore cannot shift addresses through segmentation.
19
At what speeds different parts of the computer work?
What are
– speed – latency,
– throughput
of various subsystems in a commodity PC
(Intel Core 2 Duo at 3.0 GHz).
Time units are:
– nanoseconds (ns, 10-9 s),
– milliseconds (ms, 10-3 s),
– seconds (s).
Throughput units are in megabytes and
gigabytes per second.

Processor and northbridge chip


20
https://en.wikipedia.org/wiki/X86 (source: Duarte, Software Illustrated)
Hard disks also have caches.
They are small in size (16 MB cache is only 0.002% 750 GB
of disk), but thanks to them the disk can queue up writes
and then perform them in one bunch.
Reads can also be grouped in this way for performance,
and both the OS and the drive firmware engage in these
optimizations.
Standard SSD can read sequential data at a speed of about
550 megabytes per second (MBps) and write it at 520
MBps. A fast HDD may carry out sequential reads and
writes at just 125 MBps.
USB 3.0 can transfer data at up to 5 Gbit/s (625 MB/s),
which is about 10 times faster than the USB 2.0 standard.
Wired Ethernet – 2.5 Gbit/s (soon 10 Gbit/s)
Wireless 802.11ac – 200 Mbit/s
The latency to a fast website is about 45 ms, comparable
to hard drive seek latency.
The vision of "the Internet as a computer" is real!
Southbridge chip 21
But ...
Be aware, shouting in the datacenter
is not recommended

Brendan Gregg

Vibration can badly influence disk latency

https://www.youtube.com/watch?v=tDacjrSCeq4
https://www.youtube.com/watch?v=lMPozJFC8g0 22
Why do we need the processor cache?

Modern processors use cache to store recently used memory cells. It


allows to overcome differences in processor speed and memory
access time (the processor is much faster, in addition there may be
several of them competing for access to the same memory).

Flavors of Memory
supported by Linux, their
use and benefit,
Christopher Lameter
(source: presentation at
Open Source Summit,
2018)

23
AMD Ryzen 5000, 2020

https://www.youtube.com/watch?v=J6jkrDlgflo 24
IBM POWER10, 2020

https://www.youtube.com/watch?v=J6jkrDlgflo
25
How does the processor cache works?

The unit of data in the cache is


Intel the line (block), which is a
contiguous chunk of bytes in
memory. This cache uses 64-
byte lines.
The lines are stored in cache
banks or cache ways, and each
bank has a dedicated directory
to store its metadata.
This particular cache has 64 sets
and 8 ways, hence 512 cells to
store cache lines, which adds up
to 32 KB of space.

Processor cache L1 of the Core 2 processor


(source: Duarte, Software Illustrated) 26
How does the processor cache works?
Each cached line is tagged by its corresponding directory cell; the tag is the number for the page
where the line came from.
The processor can address 64 GB of physical RAM, so there are 64 GB/4 KB = 224 of these pages and
thus we need 24 bits for the tag.

Intel

Search for matching tag in the set (source: Duarte, Software Illustrated) 27
How does the processor cache works?

You can imagine each bank and its directory as columns in a spreadsheet, in which case the rows are the
sets. Each cell in the way column contains a cache line.
Physical memory is divided into 4 KB physical pages. Each page has 4 KB/64 bytes = 64 cache lines in it.
Bytes 0 through 63 within that page are in the first cache line, bytes 64-127 in the second cache line,
and so on. The pattern repeats for each page.
Basic types of cache organization:
• fully associative cache – any memory line can be stored in any of the cache cells, This makes
storage flexible, but it becomes expensive to search for cells when accessing them.
• direct-mapped cache (one-way set associative) – a given memory line can only be stored in one
specific set (or row),
• N-way set-associative cache – each memory line may be stored in one of the N cache lines. In 8-
way set-associative each row has 8 cells available to store the cache lines it is associated with. Bits
11-6 determine the line number within the 4 KB page and therefore the set to be used.

Well described in Wikipedia.


28
How does the processor cache works?
The tag matching is very fast; electrically all tags are compared simultaneously.
If a set fills up, then a cache line must be evicted before another one can be stored.
To avoid this, performance-sensitive programs try to organize their data so that memory accesses are
evenly spread among cache lines.
A memory access usually starts with a linear (virtual) address, so the L1 cache relies on the paging unit to
obtain the physical page address used for the cache tags.
By contrast, the set index comes from the least significant bits of the linear address and is used without
translation.
Hence the L1 cache is physically tagged, but virtually indexed, helping the CPU to parallelize lookup
operations.
Because the L1 bank is never bigger than an MMU page, a given physical memory location is guaranteed
to be associated with the same set even with virtual indexing.
L2 caches must be physically tagged and physically indexed because their bank size can be bigger than
MMU pages.
Intel caches are inclusive: the contents of the L1 cache are duplicated in the L2 cache.
29
TLBs and Caches, David Black-Schaffer 30
Virtual vs physical memory addressing
• X86 – caches are physically indexed and physically tagged (except for small L1 caches), virtual address
associated with the memory map is filtered by the MMU before real access to the memory hierachy is
performed.
• ARMv4 and ARMv5 – cache is organized as virtually indexed, virtually tagged, cache lookups are faster
(TLB is not involved in matching cache lines for a virtual address), the same physical address can be
mapped to multiple virtual addresses.
• MIPS R4x00 – virtually-indexed, physically-tagged on-chip L1 cache

In typical architectures, the


optimal performance is
achieved by having the L1
cache and the TLB racing to
provide their outputs for
subsequent use

A Primer on Modern Hardware Architectures, Alessandro Pellegrini 31


Cache coherence

In multiprocessor systems, each processor has its own


cache memory. Additional hardware is needed to
synchronize the contents of these memories
(snooping cache).
Directory cell also stores the state of its corresponding
cached line. A line in the L1 code cache is either
Invalid or Shared (which means valid).
There is a whole family of cache coherence protocols
(MSI, MESI, MOESI).

Extra reading: Memory part 2: CPU Caches (Ulrich


Drepper, October 2007); Hardware insights,
Francesco Quaglia.
The false cache sharing problem

32
Architectures
Symmetric multiprocessors

Hyper-Threading Technology enables a single physical


processor to execute two or more threads concurrently using
shared execution resources. The OS will recognize each
physical core as 2 virtual cores.
Chip Multi Processor (CMP) – Multicore Symmetric Multi-threading (SMT) Hyperthreading

Since 2002

Hardware insights, Francesco Quaglia 33


UMA vs NUMA
In Symmetric Multiprocessing (SMP) Systems, a
single memory controller is shared among all
CPUs (Uniform Memory Access—UMA). All of
the processors have equal access to the memory
and I/O in the system.
As more processors were added to the system the processor bus became a
limitation to the overall system performance.

To scale more, Non-


Uniform Memory
Architectures
(NUMA) implement
multiple buses and
memory
controllers.

The interconnect between the two systems introduced latency for the memory access across nodes.
Understanding Non-Uniform Memory Access/Architectures (NUMA) 34
Where the older SMP architecture had a separate
AMD Hyper-Transport (HT) memory controller, newer systems have an
integrated memory controller built into the processor
itself, and each processor has its own memory bank.
The first processors to introduce an integrated
memory controller were the AMD Opteron series of
processors in early 2003. AMD processors share
memory access through Hyper-Transport (HT) links
between the processors.

Intel Quick-Path Interconnect (QPI)


The Intel QPI interconnects the processors with
each other in a similar manner to the AMD HT
interconnect. However, the QPI snoop based cache
forwarding implementation can return remote data
in as little as 2 hops.

Understanding Non-Uniform Memory Access/Architectures (NUMA) 35


Non-Uniform Memory Access
Each CPU has its own local memory which is accessed faster
• Shared memory is the union of local memories.
• The latency to access remote memory depends on the ‘distance’.

NUMA organization
with 4 AMD Opteron
6128 (2010)

A Primer on Modern Hardware Architectures, Alessandro Pellegrini 36


NUMA latency assymetries
• A processor (made of multiple cores) and the memory local to it form a NUMA node.
• There are commodity systems which are not fully meshed: remote nodes can be only accessed with
multiple hops.
• The effect of a hop on commodity systems has been shown to produce a performance degradation of
even 100%—but it can be even higher with increased load on the interconnect.

Linux is NUMA-aware – Support started in 2004 (Linux 2.6)


Two optimization areas
– Thread scheduling
– Memory allocation

Hardware insights, Francesco Quaglia A Primer on Modern Hardware Architectures, Alessandro Pellegrini
37
Additional reading

• Intel 64 and IA-32 Architectures Software Developer Manuals


• What Every Programmer Should Know About Memory (Ulrich Drepper, Red Hat)
• Thread Local Storage (TLS) descriptors in GDT

38

You might also like