0% found this document useful (0 votes)
13 views37 pages

HPC 00 HW Basics

The document discusses how computers execute programs differently than humans. It explains that a computer reads instructions one by one from memory, decodes each instruction to determine the operation, fetches required data, performs the calculation, and writes the result back to memory before moving to the next instruction. The document outlines the major components of a computer, including the processor, memory, and connections between them, that work together to execute instructions sequentially at a much faster pace than a human.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views37 pages

HPC 00 HW Basics

The document discusses how computers execute programs differently than humans. It explains that a computer reads instructions one by one from memory, decodes each instruction to determine the operation, fetches required data, performs the calculation, and writes the result back to memory before moving to the next instruction. The document outlines the major components of a computer, including the processor, memory, and connections between them, that work together to execute instructions sequentially at a much faster pace than a human.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur

Dr. Robert Schöne


APB 1029 WIL A104 https://tu-dresden.de/zih/die-einrichtung/struktur/robert-schoene
☎ +49 351 463 42483 ☎ +49 351 463 35450 [email protected]
Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur

Dr. Robert Schöne


APB 1029 WIL A104 https://tu-dresden.de/zih/die-einrichtung/struktur/robert-schoene
☎ +49 351 463 42483 ☎ +49 351 463 35450 [email protected]

High Performance Computers and Their


Programming (Preparation)
Here‘s what you know

An example How did you get this result?


C Python — Read the instructions
— Calculate

Congratulations, you can calculate like a


computer!

Your task:
What is printed at the command line?

High Performance Computers and Their Programming


Robert Schöne
Slide 3
You wish!

— That took way too long


— You should have done that in < 0.01 µs
— But you can listen to me, interpret the
images that contain the code correctly, and
execute the task afterwards
— So, you‘re fine , just not optimal for
numerical calculations

High Performance Computers and Their Programming


Robert Schöne
Slide 4
Let‘s have a look on what you did, and how a computer would do that

You Computer
— Given: — Given:
 List of ordered statements  List of ordered instructions
— Interpretation: — Interpretation:
 Read statement  Read instruction
 Understand statement  Decode instruction
 Execute statement  Execute operation
 Read next statement  Read data
 Understand next statement  Calculate
 Execute next statement  Write result
 …  Read next instruction
 …

High Performance Computers and Their Programming


Robert Schöne
Slide 5
Things needed

Computer Instructions
stored in memory
— Given: sum=0 i=0 sum+=i i++ i<10?

 List of ordered instructions


— Interpretation: Data stored in
memory sum i
 Read instruction 0 Memory
Execution
 Decode instruction 0 Bus that connects units
processor and
 Execute operation 0 memory
 Read data Instruction ALU load
fetch
 Calculate mul store
Decoder
 Write result Fl. point

 Read instruction 1 Program Counter 0 Processor


(PC) / Instruction
 … Pointer (IP):
Which instruction Computer
to execute

High Performance Computers and Their Programming


Robert Schöne
Slide 6
Things needed

Computer Instructions
stored in memory
— Given: sum=0 i=0 sum+=i i++ i<10?

 List of ordered instructions


— Interpretation:
sum i
 Read instruction 0 Memory
 Decode instruction 0
 Execute operation 0
 Read data Instruction ALU load
fetch
 Calculate mul store
Decoder
 Write result Fl. point

 Read instruction 1 Program Counter 0 Processor


(PC) / Instruction
 … Pointer (IP):
Which instruction Computer
to execute

High Performance Computers and Their Programming


Robert Schöne
Slide 7
Things needed

Computer
— Given: sum=0 i=0 sum+=i i++ i<10?

 List of ordered instructions


— Interpretation:
sum i
 Read instruction 0 Memory
 Decode instruction 0
 Execute operation 0 Copied from
memory, can now
 Read data be decoded
Instruction ALU load
fetch
 Calculate mul store
sum=0 Decoder
 Write result Fl. point

 Read instruction 1 0 Processor


 …
Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 8
Things needed

Computer
— Given: sum=0 i=0 sum+=i i++ i<10?

 List of ordered instructions


— Interpretation:
sum i
 Read instruction 0 Memory
 Decode instruction 0
 Execute operation 0 Decode  select
the execution
 Read data unit, prepare
Instruction ALU load
fetch
execution
 Calculate mul store
sum=0 Decoder
 Write result Fl. point

 Read instruction 1 0 Processor


 …
Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 9
Things needed

Computer
— Given: sum=0 i=0 sum+=i i++ i<10?

 List of ordered instructions


— Interpretation:
0 i
 Read instruction 0 Memory
 Decode instruction 0
 Execute operation 0
Execute the
 Read data instruction
Instruction ALU load
fetch
 Calculate mul store
sum=0 Decoder
 Write result Fl. point

 Read instruction 1 0 Processor


 …
Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 10
Things needed

Computer
— Given: sum=0 i=0 sum+=i i++ i<10?

 List of ordered instructions


— Interpretation:
0 i
 Read instruction 0 Memory
 Decode instruction 0
 Execute operation 0
Increment  next
 Read data instruction
Instruction ALU load
fetch
 Calculate mul store
sum=0 Decoder
 Write result Fl. point

 Read instruction 1 1 Processor


 …
Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 11
Things needed

Computer Instructions
stored in memory
— Given: sum=0 i=0 sum+=i i++ i<10?

 List of ordered instructions


— Interpretation:
sum i
 Read instruction 0 Memory
 Decode instruction 0
 Execute operation 0
 Read data Instruction ALU load
fetch
 Calculate mul store
Decoder
 Write result Fl. point

 Read instruction 1 Program Counter 1 Processor


(PC) / Instruction
 … Pointer (IP):
Which instruction Computer
to execute

High Performance Computers and Their Programming


Robert Schöne
Slide 12
Things needed

Hardware Common Language / Instruction Set Architecture


— Memory (ISA)

 Stores instructions — Software uses this language


 Stores data — Hardware understands this language
— Bus — Defines possible instructions
— Processor: — Defines available registers
 Program Counter
— Defines how to access memory
 Execution Units
— Defines how to do I/O
 Registers (not depicted before)
— I/O bus (not depicted before) — Examples: x86, ARM, RISC-V, RDNA

 For input/output devices like hard drives

High Performance Computers and Their Programming


Robert Schöne
Slide 13
Things not depicted before: Registers

— Today, mostly General Purpose registers Terms, you might hear

(integers, memory adresses) — Load/store architecture:


 An ISA, where data cannot be used directly from
— Usage depending on ISA, example:
memory, but has to be expliitely transferred
 LOAD r1, (r2) # load from adress stored from/to registers
# in register 2 to reg. 1
— 3-Operand-Format
 ADD r3, r3, r1 # add to register 3
 Instructions with three operands, mostly:
 ADD r2, $8 # increase r2 by 8
𝑟𝑡𝑎𝑟𝑔𝑒𝑡 = 𝑟𝑠𝑜𝑢𝑟𝑐𝑒1 𝑜𝑝 𝑟𝑠𝑜𝑢𝑟𝑐𝑒2
 CMP r2, r4 # compare with r4
 B.LT $-20 # if < go back 20 bytes — 2-Operand-Format
# (5 instructions)  Instructions with two operands, mostly:
— Special registers for non-integers 𝑟𝑡𝑎𝑟𝑔𝑒𝑡/𝑠𝑜𝑢𝑟𝑐𝑒1 = 𝑟𝑡𝑎𝑟𝑔𝑒𝑡/𝑠𝑜𝑢𝑟𝑐𝑒1 𝑜𝑝 𝑟𝑠𝑜𝑢𝑟𝑐𝑒2

— Program Counter can also be a register

High Performance Computers and Their Programming


Robert Schöne
Slide 14
Things not depicted before: I/O devices and bus(ses)

— Includes all sorts of long term storage — Includes all busses within the computer
 Hard drive  SATA (Serial ATA)
 Flash drive  PCIe (PCI express)
 …  USB (Universal Serial Bus)
— Includes all devices for user interaction
 Keyboard
 Mouse
 Digitizer
 …
— Includes all devices to connect to the outside
world
 (W)LAN card, bluetooth chip, …

High Performance Computers and Their Programming


Robert Schöne
Slide 15
Hardware Optimizations

High Performance Computers and Their Programming


Robert Schöne
Slide 16
Basic Assumption

— Five steps of processing an instruction


— Instruction Fetch
 Fetch instruction from memory to processor
— Instruction Decode
 Decode instruction and select execution unit for the underlying operation, e.g. ALU
— Execute
 Execute the operation on the execution unit
— Memory (optional)
 Access memory
 Why after „Execution“? Execution can calculate a memory address which would be used here
— Writeback (optional)
 Write result back to registers

High Performance Computers and Their Programming


Robert Schöne
Slide 17
How long would the example from before take?

— The loop can be realized with two instructions on some platforms


— Add + conditional jump (one of these can increase i implicitly)
— Sums up to 20 instructions
— Each of these has 5 steps
—  100 cycles ( = 50 ns on a 2 GHz processor)
— But we said „You should have done that in < 0.01 µs“, which is less than 50 ns
— How do we get there?

High Performance Computers and Their Programming


Robert Schöne
Slide 18
Caches

High Performance Computers and Their Programming


Robert Schöne
Slide 19
Problem

— Memory access via bus is slow


— How slow?
— > 100 cycles for one access
— Only limited number of registers
Memory
 have to use memory

Instruction ALU load


fetch
mul store
Decoder
Fl. point

Processor

Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 20
Principle

— Buffer recently used parts of memory


— Buffer memory that is close to recently used
parts of memory
— Use fast buffer, which resides in processor
Memory
— Accesses to data which is (close to) recently
used are fast Cache

— Accesses to non-recently-used data are slow Instruction


fetch
ALU load

mul store
Decoder
Fl. point

Processor

Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 21
Details

— Cache holds multiple lines of data (cache line)


— Cache line
addr int64 int64 int64 int64
 typically 64 byte of consecutive memory per line
addr int64 int64 int64 int64
(64 byte == 8 x 64 bit values/16 x 32 bit values/...)
 In addition to data, the line holds information about addr int64 int64 int64 int64
the memory address which it holds a copy of addr int64 int64 int64 int64
 Can be clean (1:1 copy of memory)
 Can be dirty (changes to cache line but not to
memory)
— Caches have limited space
 For each new cache line added to cache, an old addr int64 int64 int64 int64
cache line must be evicted
If dirty, then write back to memory

High Performance Computers and Their Programming


Robert Schöne
Slide 22
Details

— Multiple cache levels (L1, L2, L3)


 Increasing size
 Increasing latency (how long until data is read)
— Example 2 Caches, read 64 bit from address
 1. Check L1 cache Memory
 If present select line and return the
corresponding 64 bit L2
L1
 2. Check L2 cache Instruction ALU load
fetch
 If present select line and return the mul store

corresponding 64 bit Decoder


Fl. point
 3. Read from memory Processor

Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 23
Speedup

— Depends on the latency and bandwidth of


the cache
— Typical latencies:
 L1 cache: 2-4 cycles
 L2 cache: ~10 cycles
 L3 cache: ~40 cycles
 RAM: ~250 cycles (very rough estimate)

High Performance Computers and Their Programming


Robert Schöne
Slide 24
Instruction Pipelining

High Performance Computers and Their Programming


Robert Schöne
Slide 25
Problem

— A single instruction has multiple steps, a typical example* would be:


Instruction Fetch, Instruction Decode, Execute, Memory Access, Writeback to register
— Each of these parts has specialized hardware
— All of this hardware is used only 1/5th of the time and 4/5th it just idles

Cycle 0: Cycle 1:
Cycle 2: Cycle 3: Cycle 4:
• Read PC • Read instruction
• Process • Access • Write result to
• Access memory • Select execution unit
instruction memory registers
• Read instruction • Pass instruction

*(Actual pipeline is highly processor-dependent)

High Performance Computers and Their Programming


Robert Schöne
Slide 26
Idea

— Use these units in parallel


— After one unit finishes, it can start working on the next instruction
— Cycle 0: Before
 Fetch instruction 0 IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB

— Cycle 1:
 Decode instruction 0 ready ready ready …
 Fetch instruction 1 After

— Cycle 2: IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB

 Execute instruction 0 IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX
 Decode instruction 1

IF ID EX WB IF ID EX WB IF ID EX WB IF ID
 Fetch instruction 2
IF ID EX WB IF ID EX WB IF ID EX WB IF

High Performance Computers and Their Programming


Robert Schöne
Slide 27
New Problem: Hazards

Pipeline stalls (hazards) if:


— Instructions depend on each other (a=b+c; d=2*a;)
— There‘s a branch/jump instruction (which way to go?)
— Multiple stages use common resources (IF, MEM both use the memory controller)
— …

High Performance Computers and Their Programming


Robert Schöne
Slide 28
Speedup

— Depends on pipeline length (𝑘)


— First result after 𝑘 cycles
— Afterwards, one result every cycle
— Probably lower due to hazards
𝑇 𝑘∗𝑛
— Speedup: 𝑆𝑘 = 𝑇1 = 𝑛+(𝑘−1), for lim Sk = k
𝑘 n→∞
No result 1 result per cycle

IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB

IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX

IF ID EX WB IF ID EX WB IF ID EX WB IF ID

IF ID EX WB IF ID EX WB IF ID EX WB IF

High Performance Computers and Their Programming


Robert Schöne
Slide 29
Harvard Architecture

High Performance Computers and Their Programming


Robert Schöne
Slide 30
Problem and solution

— Instruction Fetch and Memory Access access


memory for instructions and data, resp.
— Needs high bandwidth, low latencies
— Shared resources: memory controller,
Memory
busses, memory
— Solution: Harvard Architecture L2
L1i L1d
separate memories for instructions and data Instruction ALU load
fetch
 can serve instructions and data in parallel mul store
Decoder
Fl. point
— Today used in L1 caches, split into L1i and
Processor
L1d
Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 31
Superscalarity

High Performance Computers and Their Programming


Robert Schöne
Slide 32
Problem

— Most of the time Execution Units not used


(during floating point addition, ALUs not
used)
— Inefficient use of transistors
Memory

Instruction ALU load


fetch
mul store
Decoder
Fl. point

Processor

Computer

High Performance Computers and Their Programming


Robert Schöne
Slide 33
Solution and Speedup

— Instruction Fetch/Decode can fetch/decode


multiple instructions Before

— Pass them to multiple Execution Units, if: IF ID EX WB IF ID EX WB IF ID EX WB

 there are enough EUs available


ready ready ready …
 instructions do not depend on each other
— Speedup depends on:
After
 Number of instructions that can be
fetched/decoded IF ID EX WB IF ID EX WB IF ID EX WB

 The number and types of EUs IF ID EX WB IF ID EX WB IF ID EX WB


 The instruction mix
IF ID EX WB IF ID EX WB IF ID EX WB
— 3x superscalar  max. speedup = 3
— Can be combined with pipelining …

High Performance Computers and Their Programming


Robert Schöne
Slide 34
Out-of-Order Execution, Branch Prediction

High Performance Computers and Their Programming


Robert Schöne
Slide 35
Out-of-Order Execution

— Default: still have to wait if a single instruction stalls (e.g., due to memory access)
— What if there were instructions that could be executed?
— De-couple IF/ID with EX/WB
— Collect instructions
 Mark them ready if their data is available
 If ready, pass them to execution units
— Needs significant amount of processor space and management overhead
— According to ISA: Everything is still in-order!

High Performance Computers and Their Programming


Robert Schöne
Slide 36
Branch Prediction

— Every 8th instruction is a branch


— Pipelining needs to stop if if-clause depends on data that is to be computed

 Branch prediction
— Take notes on whether a branch was taken previously
— Take branch based on previous behavior
— Good for for-loops

— Simple: Do what was done last time


— Complicated: Take note for each branch instruction, also store possible target address

High Performance Computers and Their Programming


Robert Schöne
Slide 37

You might also like