0% found this document useful (0 votes)
4 views

Week 4a - Computer Architecture Fundamentals - Part 1

The document covers computer architecture fundamentals, focusing on performance improvements, current trends, and various classes of computers, including IoT, personal mobile devices, desktops, servers, and clusters. It discusses parallelism, Flynn's taxonomy, and the importance of power and energy efficiency in integrated circuits. Additionally, it addresses trends in technology, performance measurement, and dependability in computing systems.

Uploaded by

owen chan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Week 4a - Computer Architecture Fundamentals - Part 1

The document covers computer architecture fundamentals, focusing on performance improvements, current trends, and various classes of computers, including IoT, personal mobile devices, desktops, servers, and clusters. It discusses parallelism, Flynn's taxonomy, and the importance of power and energy efficiency in integrated circuits. Additionally, it addresses trends in technology, performance measurement, and dependability in computing systems.

Uploaded by

owen chan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CSIT123 Computing and Cyber

Security Fundamentals
Week 4: Computer Architecture Fundamentals
(Part 1)
Dr. Huseyin Hisil and Dr. Xueqiao Liu

Initially prepared by Dr. Dung Duong


Reading Task:
● Patterson, D.A. and Hennessy, J.L., 2019. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann Publishing.
○ Fundamentals of Quantitative Design and Analysis
Introduction to Quantitative Design and Analysis
● Performance Improvements
• Semiconductor technology - feature size, clock speed
• Improvements in computer architectures - Enabled by HLL compilers, UNIX; Lead to RISC
(Reduced Instruction Set Computer) architectures.
• Together enables - Lightweight computers; Productivity-based; managed/interpreted
programming languages; SaaS, Virtualization, Cloud.
• Applications - Speech, sound, images, video, “augmented/extended reality”, “big data”.
Processor Performance

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Introduction
● Current Trends in Architecture
• Cannot continue to leverage Instruction-Level parallelism (ILP) - Single processor performance
improvement ended in 2003
• New models for performance - Data-level parallelism (DLP); Thread-level parallelism (TLP);
Request-level parallelism (RLP)
• These require explicit restructuring of the application
Classes of Computers
● Internet of Things/Embedded Computers
● Personal Mobile Device (PMD)
● Desktop Computing
● Servers
● Clusters / Warehouse Scale Computers
Classes of Computers
Internet of Things (IoT)/Embedded Computers
● Embedded computers are found in everyday machines: microwaves, washing
machines, most printers, networking switches, and all automobiles.
● Internet of Things (IoT) refers to embedded computers that are connected to the
Internet, typically wirelessly
○ When augmented with sensors and actuators, IoT devices collect useful data and
interact with the physical world, leading to a wide variety of “smart” applications, i.e.,
smart watches, smart thermostats, smart speakers, smart cars, smart homes, smart
grids, and smart cities.
● Embedded computers have the widest spread of processing power and cost.
○ Price is a key factor in the design of computers for this space.
Classes of Computers
Personal Mobile Device
● apply to a collection of wireless devices with multimedia user interfaces such as cell
phones, tablet computers, and so on.
○ Cost is a prime concern
● Applications on PMDs are often web-based and media-oriented,
● Energy and size requirements lead to use of Flash memory for storage
● Characteristics:
○ Responsiveness and Predictability
○ Minimize memory and energy efficiency
Classes of Computers
Desktop Computing
● Largest market in dollar terms
● Desktop market tends to be driven to optimize price-performance.
○ Matters to customers and computer designers
○ Appearance of highest-performance microprocessors and cost-reduced microprocessors
Classes of Computers
Servers
● Servers have become the backbone of large-scale enterprise computing, replacing the
traditional mainframe.
● Important characteristics:
○ Availability
○ Scalability
○ Efficiency and cost-effectiveness
Classes of Computers
Clusters/Warehouse-Scale Computers
● Clusters are collections of desktop computers or servers connected by local area
networks to act as a single larger computer.
● WSCs (Warehouse-Scale Computers) are the largest of the clusters
○ tens of thousands of servers can act as one
● Price-performance and power are critical to WSCs
● WSCs are related to servers in that availability is critical
● Difference between WSCs and servers:
○ WSCs use redundant, inexpensive components as the building blocks, relying on a
software layer to catch and isolate the many failures
○ scalability for a WSC is handled by the local area network connecting the computers
and not by integrated computer hardware, as in the case of servers.
● Supercomputers are related to WSCs in that they are equally expensive
○ Supercomputers emphasize floating-point performance
○ WSCs emphasize interactive applications, large-scale storage, dependability, and high
Internet bandwidth.
Classes of Computers
Parallelism at Multiple Levels
● Classes of parallelism in applications
○ Data-Level Parallelism (DLP) - many data items that can be operated on at the
same time.
○ Task-Level Parallelism (TLP) - tasks of work are created that can operate
independently and largely in parallel
● Classes of architectural parallelism
○ Instruction-Level Parallelism (ILP) - exploits data-level parallelism at modest levels with
compiler help using ideas like pipelining and at medium levels using ideas like
speculative execution.
○ Vector architectures/Graphic Processor Units (GPUs) - exploit data-level parallelism by
applying a single instruction to a collection of data in parallel.
○ Thread-Level Parallelism - exploits either data-level parallelism or task-level parallelism
in a tightly coupled hardware model that allows for interaction among parallel threads.
○ Request-Level Parallelism - exploits parallelism among largely decoupled tasks specified
by the programmer or the operating system.
Classes of Computers
Flynn’s Taxonomy - Founded in the 1960s
● Single instruction stream, single data stream (SISD)
○ Uni-processor. Sequential computer but can exploit instruction-level parallelism. Use ILP techs (superscalar and
speculative execution).
● Single instruction stream, multiple data streams (SIMD)
○ The same instruction is executed by multiple processors using different data streams. Exploit data-level parallelism
by applying the same operations to multiple items of data in parallel.
○ Each processor has its own data memory (hence the MD of SIMD), but there is a single instruction memory and
control processor.
○ DLP and three different architectures that exploit it: vector architectures, multimedia extensions to standard
instruction sets, and GPUs.
● Multiple instruction streams, single data stream (MISD)
○ No commercial multiprocessor of this type has been built to date.
● Multiple instruction streams, multiple data streams (MIMD)
○ Each processor fetches its own instructions and operates on its own data, targeting task-level parallelism.
○ More flexible than SIMD and thus more generally applicable, but more expensive.
○ Tightly coupled MIMD - exploit thread-level parallelism since multiple cooperating threads operate in parallel;
Loosely coupled MIMD – (clusters and warehouse-scale computers) exploit request-level parallelism with little
communication or synchronization.
Define Computer Architecture
● The Myopic View: only Instruction Set Architecture (ISA)
○ ISA serves as the boundary between the software and hardware
○ Seven dimensions: Class of ISA, Memory addressing, Addressing modes, Types and sizes
of operands, Operations, Control flow instructions, and Encoding an ISA
● The Genuine: Designing the Organization and Hardware
○ Meet functional requirements as well as price, power, performance, and availability
goals.
○ Cover three aspects: Instruction set architecture, Microarchitecture Addressing modes,
Hardware
Trends in Technology
● Integrated circuit technology
○ Transistor density increases by 35%/year
○ Die size increases by 10-20%/year
○ Together leads to the increase on transistor count on a chip: 40-55%/year
● DRAM (dynamic random-access memory) capacity
○ Capacity per DRAM chip has increased by 25-40%/year (slowing)
● Flash (electrically erasable programmable read-only memory) capacity
○ Capacity per Flash chip has increased by 50-60%/year
○ In 2019, 8-10X cheaper/bit than DRAM
● Magnetic disk technology
○ Since 2004, density increased by 40%/year
○ 8-10X cheaper/bit than Flash
○ 200-300X cheaper/bit than DRAM
● Network technology : Performance depends on
○ performance of switches
○ performance of the transmission system
Performance Trends: Bandwidth over Latency
● Bandwidth or throughput
○ The total amount of work done in given
time, such as megabytes per second for a
disk transfer
○ 10,000-25,000X improvement for
processors over the 1st milestone
○ 300-1200X improvement for memory and
disks over the 1st milestone
● Latency or response time
○ The time between start and completion of
an event
○ 30-80X improvement for processors over
the 1st milestone
○ 6-8X improvement for memory and disks
over the 1st milestone
Scaling of Transistor Performance and Wires
● Feature size
○ Minimum size of transistor or wire in x or y dimension
○ 10 microns in 1971 to .016 microns in 2017
○ Transistor performance scales linearly - Wire delay does not improve with feature size
○ Integration density scales quadratically
○ Linear performance and quadratic density growth present a challenge and opportunity,
creating the need for computer architect
Power and Energy in Integrated Circuits
● Power is the biggest challenge
○ Problem: power is brought in and distributed around the chip, and modern
microprocessors use hundreds of pins and multiple interconnect layers just for power
and ground; power is dissipated as heat and must be removed.
○ Three concerns
■ What is the maximum power a processor ever requires?
■ What is the sustained power consumption? - the thermal design power (TDP): Characterizes
sustained power consumption; Used as target for power supply and cooling system; Lower
than peak power, higher than average power consumption
■ Consider energy and energy efficiency – energy (not power) consumption per task better
measures efficiency
Dynamic Energy and Power
● Dynamic energy
○ The energy required per transistor is proportional to the product of the capacitive load
driven by the transistor and the square of the voltage, i.e., the energy of pulse of the
logic transition of 0→1→0 or 1→0→1

○ The energy of a single transition (0→1 or 1→0)

● Dynamic power
○ The power required per transistor is the product of the energy of a transition multiplied
by the frequency of transitions = ½ x Capacitive load x Voltage2 x Frequency switched.
Dynamic Energy and Power
● Example: Some microprocessors today are designed to have adjustable voltage, so
a 15% reduction in voltage may result in a 15% reduction in frequency. What would
be the impact on dynamic energy and on dynamic power?
● Solution:
Dynamic Energy and Power
● Reducing clock
frequency/rate reduces
power, not energy
○ The first microprocessors
consumed less than a watt
and the first 32-bit
microprocessors (like the
Intel 80386) used about 2
W, while a 4.0 GHz Intel
Core i7-6700K consumes
95 W.
Improve Energy Efficiency
● Techniques
○ Do nothing well - Most microprocessors turn off the clock of inactive modules to save energy
and dynamic power. E.g., if no floating-point instructions executing, the clock of the floating-
point unit is disabled. If some cores are idle, their clocks are stopped.
○ Dynamic Voltage-Frequency Scaling (DVFS) - Modern microprocessors typically offer a few
clock frequencies and voltages in which to operate that use lower power and energy.
○ Design for typical case - PMDs and laptops are often idle, memory and storage offer low
power modes to save energy. E.g., DRAMs different increasingly lower power modes to extend
battery life, so disks can spin at lower rates when idle. You cannot access DRAMs or disks in
these modes, so must return to fully active mode to read or write.
○ Overclocking - Run at a higher clock rate for short time on some cores until temperature rises.
For single threaded code, microprocessors turn off all cores but one and run it at a higher clock
rate.
Static Power
● Static power
○ Leakage current flows even when a transistor is off

○ Proportional to number of transistors


○ Power gating reduces static power - Turn off the power supply
○ Race-to-halt strategy - use a faster, less energy-efficient processor to allow the rest of
the system to go into a sleep mode
● New metric to evaluate
○ Old: performance per mm2 of silicon
○ New: tasks per joule or performance per watt
Trends in Cost
● Learning Curves Drives Costs Down.
○ Measured by change in yield - the percentage of manufactured devices that survives the
testing procedure. Designs that have twice the yield will have half the cost.
● DRAMs price closely track cost
● Microprocessor
○ In significant competition, its price tracks cost closely
○ Its volume also determines cost
■ Increasing volumes decrease the time to get down the learning curve which is partly
proportional to number of systems/chips manufactured
■ Increasing volumes decrease cost for increasing purchasing/manufacturing efficiency (10%
less for each doubling of volume)
Integrated Circuit
● Integrated circuit costs become greater portion and vary
○ Though the costs of integrated circuits dropped exponentially, the silicon manufacture is
unchanged: a wafer is still tested and chopped into dies that are packaged

○ To predict the number of good chips per wafer requires learning how many dies fit on a
wafer and how to predict the percentage of those that will work

○ Number of dies per wafer is about the area of the wafer divided by the area of the die

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Integrated Circuit
● Example: Find the number of dies per 300 mm (30 cm) wafer for a die
that is 1.5 cm on a side and for a die that is 1.0 cm on a side.
● Answer:
Integrated Circuit
● Problem: what’s the fraction of good dies on a wafer/die yield
○ Above only gives the max number of dies per wafer. Defects are randomly distributed
over the wafer and yield is inversely proportional to the complexity of the fabrication

○ Bose–Einstein formula looks at the yield of many manufacturing lines.


■ Wafer yield accounts for wafers are completely bad and so need not be tested, assume the
wafer yield is 100% for simplicity.
■ Defects per unit area measures the random manufacturing defects that occur. In 2017, 0.08
to 0.1 defects per square inch; or 0.012 to 0.016 defects per square cm for a 28nm process,
as it depends on the maturity of the process (learning curve).
■ N: the process-complexity factor, a measure of manufacturing difficulty. For 28 nm
processes in 2017, N ranged from 7.5 to 9.5.
○ The manufacturing process dictates the wafer cost, wafer yield, and defects per unit
area, so the sole control of the designer is die area

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Integrated Circuit
● Example: Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a
side, assuming a defect density of 0.047 per cm2 and N is 12.
● Answer:
Dependability
● Systems alternate between 2 service states due to service level
agreements (SLAs) or service level objectives (SLOs)
○ Service accomplishment, where the service is delivered as specified
○ Service interruption, where the delivered service is different from the SLA
● Transitions caused by failures (1 to 2) or restorations (2 to 1)
○ Module reliability: continuous service accomplishment/the time to failure
■ Mean time to failure (MTTF)/failures in time (FIT)
■ 1 MTTF of 1,000,000 hours = 1000 FIT
■ Mean time to repair (MTTR)
■ Mean time between failures (MTBF) = MTTF + MTTR
○ Module availability: the service accomplishment with respect to the alternation
between the two states of accomplishment and interruption

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Dependability
● Example:
Dependability
● Answer:
Performance Measurement
● Two metrics
• Reduce response time: time between start and end of an event (execution time)
• Increase throughput: the total amount of work done in a given time
• Computer X is n times faster than Y:

• The throughput of X is 1.3 times higher than Y” signifies here that the number of tasks
completed per unit time on computer X is 1.3 times the number completed on Y.
● Execution time
• Wall-clock time/response time/elapsed time: latency to complete a task (all overheads)
• CPU time: only computation time

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Performance Measurement
● Benchmarks
• The best choice of benchmarks to measure performance is real applications,
• Kernels: small, key pieces of real applications
• Toy programs: 100-line programs from beginning programming assignments, e.g., quicksort
• Synthetic benchmarks: fake programs invented to try to match the profile and behaviour of
real applications, e.g., Dhrystone (https://en.wikipedia.org/wiki/Dhrystone)
Performance Measurement
● Benchmarks
• Bench suites: a popular measure of performance of processors with a variety of
applications, e.g., SPEC (Standard Performance Evaluation Corporation)
• SPECRatio: dividing time on the reference computer by time on the rated computer

• SPECRatio is a ratio rather than an absolute execution time, the mean must be computed
using the geometric mean

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Quantitative Principles of Computer Design
● Take Advantage of Parallelism
• Multiple processors, disks, memory banks, pipelining, multiple functional units
● Principle of Locality
• Reuse of data and instructions
Extra Reading Material
Quantitative Principles of Computer Design
● Focus on the Common Case
• Amdahl’s law:

• Speed up from enhancement depending on two factors:


•The fraction of the computation time in the original computer. E.g., if 20 seconds of the execution
time of a program that takes 60 seconds in total can use an enhancement, Fractionenhanced = 20/60,
is always less than or equal to 1
•Enhanced execution mode, i.e., how much faster the task would run if the enhanced mode were
used for the entire program, i.e., the time of the original mode over the time of the enhanced
mode. E.g., if the enhanced mode takes, 2 seconds for a portion of the program, while it is 5
seconds in the original mode, Speedupenhanced = 5/2, is always greater than 1

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Quantitative Principles of Computer Design

● Focus on the Common Case


○ Example: Suppose that we want to enhance the processor used for web serving. The
new processor is 10 times faster on computation in the web serving application than
the old processor. Assuming that the original processor is busy with computation 40%
of the time and is waiting for I/O 60% of the time, what is the overall speedup gained
by incorporating the enhancement?
○ Answer:

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Quantitative Principles of Computer Design
● Focus on the Common Case
○ Example: A common transformation required in graphics processors is square root.
Implementations of floating-point (FP) square root vary significantly in performance,
especially among processors designed for graphics. Suppose FP square root (FSQRT) is
responsible for 20% of the execution time of a critical graphics benchmark. One
proposal is to enhance the FSQRT hardware and speed up this operation by a factor of
10. The other alternative is just to try to make all FP instructions in the graphics
processor run faster by a factor of 1.6; FP instructions are responsible for half of the
execution time for the application. The design team believes that they can make all FP
instructions run 1.6 times faster with the same effort as required for the fast square
root. Compare these two design alternatives.
○ Answer: Improving the performance of the FP operations overall is slightly better
because of the higher frequency.

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Quantitative Principles of Computer Design
● The Processor Performance Equation
• All computers are constructed using a clock running at a constant rate.
•These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles, or clock cycles.
Computer designers refer to the time of a clock period by its duration (e.g., 1 ns) or by its rate (e.g.,
1 GHz).

•The number of instructions executed: the instruction path length or instruction count (IC). If know
the number of clock cycles and the instruction count, can calculate the average number of clock
cycles per instruction (CPI), or the inverse of CPI (IPC)

•Clock cycles can be defined as IC × CPI


•Expanding the first formula into the units of measurement shows how the pieces fit together:

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Quantitative Principles of Computer Design
● Processor performance and CPU time equally depend on
• clock cycle (or rate)
• clock cycles per instruction
• instruction count
● Basic technologies are interdependent so hard to change one parameter
in complete isolation from others
• Clock cycle time: Hardware technology and organization
• CPI: Organization and instruction set architecture
• Instruction count: Instruction set architecture and compiler technology
Quantitative Principles of Computer Design
● Potential improvement techniques improve one component of with
small/predictable impacts on the other two
• Calculate number of total processor clock cycles, ICi: number of times instruction i is
executed in a program, CPIi: average number of clocks per instruction for instruction i

• Use each individual CPIi and the fraction of occurrences of that instruction in a program,
i.e., ICi ÷ Instruction count:

• Instead use measurement of frequency of instructions and of instruction CPI values

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Quantitative Principles of Computer Design
● Example:

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


● Answer

Image Source: https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1


Reference
● Patterson, D.A. and Hennessy, J.L., 2019. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann Publishing.

You might also like