We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48
CSC 2111
Computer Organisation and
Architecture 2 Unit 4: Intended Learning Outcomes By the end of this unit, you should be able to:
Understand the key performance issues that
relate to computer design. Distinguish among multicore, MIC, and GPGPU organisations. Perform basic Measures of Computer Performance Designing for Performance Unit Introduction Chipmakers have been busy learning how to fabricate chips of greater and greater density. But the raw speed of the microprocessor will not achieve its potential unless it is fed a constant stream of work to perform in the form of computer instructions. As such, processor designers must come up with ever more elaborate techniques for feeding the processor with instructions. Among the techniques built into contemporary processors are the following: Pipelining Branch prediction Superscalar execution Data flow analysis Speculative execution Pipelining Recap The execution of an instruction involves multiple stages of operation: fetching the instruction, decoding the opcode, fetching operands, performing a calculation, and so on. Pipelining A processor works simultaneously on multiple instructions by performing a different phase for each of the multiple instructions at the same time. It overlaps operations by moving data or instructions into a conceptual pipe with all stages of the pipe processing simultaneously. For example, while one instruction is being executed, the computer is decoding the next instruction. This is the same principle as seen in an assembly line. Assembly Line Example Consider a water bottle packaging plant. Let there be 3 stages that a bottle should pass through, Inserting the bottle(I), Filling water in the bottle(F), and Sealing the bottle(S). Let us consider these stages as stage 1, stage 2 and stage 3 respectively. Let each stage take 1 minute to complete its operation. Without (left) Vs. With (right) Pipelining Execution in a Pipelined Processor Consider a processor having 4 stages to each an instruction and let there be 2 instructions to be executed. Execution in a Pipelined Processor Consider a processor having 4 stages to each an instruction and let there be 2 instructions to be executed. Execution in a Pipelined Processor Consider a processor having 4 stages to each an instruction and let there be 2 instructions to be executed. Hyper Threading Vs. Pipelining In computer science a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. The multiple threads of a given process may be executed concurrently (via multithreading capabilities), sharing resources such as memory, while different processes do not share these resources. Pipelining works on a single thread, hyperthreading works on multiple threads. Hyper Threading Vs. Multithreading Branch Prediction A branch is an instruction in a computer program that can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order.
A branch predictor is a digital circuit that tries to guess
which way a branch will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch Prediction Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed. If it is later detected that the guess was wrong, then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay. Superscalar Execution This is the ability to issue more than one instruction in every processor clock cycle. In effect, multiple parallel pipelines are used. A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. Each execution unit is not a separate processor (or a core if the processor is a multi-core processor), but an execution resource within a single CPU such as an arithmetic logic unit. Superscalar Execution Vs. Pipelining While a superscalar CPU is typically also pipelined, superscalar and pipelining execution are considered different performance enhancement techniques. Superscalar executes multiple instructions in parallel by using multiple execution units Pipelining executes multiple instructions in the same execution unit in parallel by dividing the execution unit into different phases. Superscalar Execution Pipeline Execution Data flow analysis The processor analyzes which instructions are dependent on each other’s results, or data, to create an optimized schedule of instructions. In fact, instructions are scheduled to be executed when ready, independent of the original program order. This prevents unnecessary delay. Speculative execution Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations. This enables the processor to keep its execution engines as busy as possible by executing instructions that are likely to be needed. Performance Balance Performance Balance Need for performance balance? Processor power has raced ahead at breakneck speed, while other critical components of the computer have not kept up. Resulting in adjusting of the organisation and architecture to compensate for the mismatch among the capabilities of the various components: especially at the interface between the processor and main memory or I/O devices. Processor – Memory Interface Increase the number of bits that are retrieved at one time from DRAMs “wider” rather than “deeper” Using wide bus data paths Reduce the frequency of memory access by incorporating increasingly complex and efficient cache structures between the processor and main memory. Increase the interconnect bandwidth between processors and memory by using higher-speed buses and a hierarchy of buses to buffer and structure data flow. Processor – I/O Interface As computers become faster and more capable, more sophisticated applications are developed that support the use of peripherals with intensive I/O demands Processor – I/O Interface Strategies of getting I/O data moved between processor and peripherals. Caching and buffering schemes Use of higher-speed interconnection buses and interconnection structures. Use of multiple-processor configurations to satisfy I/O demands. Designers constantly strive to balance the throughput and processing demands of the processor components, main memory, I/O devices, and the interconnection structures. Multicore, MICS, and GPGPUs New approaches to improving performance: Multicore An approach to improving performance by placing multiple processors on the same chip, with a large shared cache. Many Integrated Core (MIC). The number of cores per chip are more than 50 cores per chip. GPGPUs A chip with multiple general-purpose processors plus graphics processing units (GPUs) and specialised cores for video processing and other tasks. When a broad range of applications are supported by such a processor, the term general-purpose computing on GPUs (GPGPU) is used. Basic Measures of Computer Performance Amdahl’s Law Amdahl’s Law Computer system designers look for ways to improve system performance by advances in technology or change in design. However, a speedup in one aspect of the technology or design does not result in a corresponding improvement in performance. Amdahl’s law was first proposed by Gene Amdahl in 1967 and deals with the potential speedup of a program using multiple processors compared to a single processor. Amdahl’s Law Consider a program running on a single processor such that: a fraction (1 - f) of the execution time involves code that is inherently sequential, and a fraction f that involves code that is infinitely parallelizable with no scheduling overhead. Let T be the total execution time of the program using a single processor. Amdahl’s Law Consider a program running on a single processor such that: Then the speedup using a parallel processor with N processors that fully exploits the parallel portion of the program is as follows: Amdahl’s Law Illustration of Amdahl’s Law Amdahl’s Law Amdahl’s Law for Multiprocessors Clock Speed Clock Speed Operations performed by a processor, such as fetching an instruction, decoding the instruction, performing an arithmetic operation, and so on, are governed by a system clock. Typically, all operations begin with the pulse of the clock. Thus, at the most fundamental level, the speed of a processor is dictated by the pulse frequency produced by the clock, measured in cycles per second, or Hertz (Hz). MHz (megahertz, or millions of pulses per second) GHz (gigahertz, or billions of pulses per second) System Clock Typically, clock signals are generated by a quartz crystal, which generates a constant signal wave while power is applied. This wave is converted into a digital voltage pulse stream that is provided in a constant flow to the processor circuitry. System Clock System Clock System Clock The rate of pulses is known as the clock rate, or clock speed. One increment, or pulse, of the clock is referred to as a clock cycle, or a clock tick. The time between pulses is the cycle time. Instruction Cycle, Machine Cycle and T- State The execution of an instruction involves a number of discrete steps: fetching the instruction from memory, decoding the various portions of the instruction, loading and storing data, and performing arithmetic and logical operations. Most instructions on most processors require multiple clock cycles to complete. Instruction Cycle, Machine Cycle and T- State Instruction Cycle Is the fetching, decoding and execution of a single instruction. Typically consists of one to five read or write operations between processor and memory or input/output devices. Machine Cycle Is a particular time period required by each memory or I/O operation In other words, to move a byte of data in or out of the microprocessor, a machine cycle is required. T-State Each machine cycle consists of 3 to 6 clock periods/cycles, referred to as T-states. Typically, one instruction cycle consists of one to five machine cycles and one machine cycle consists of three to six T-states i.e. three to six clock periods. Instruction Cycle, Machine Cycle and T- State INSTRUCTION EXECUTION RATE Parameters Instruction Count (Ic), for a program is the number of machine instructions executed for that program until it runs to completion or for some defined time interval. Average Cycles Per Instruction (CPI) for a program is the number of clock cycles required for a machine instruction. On any given processor, the number of clock cycles required varies for different types of instructions, such as load, store, branch, and so on. Hence, the average cycles per instruction (CPI) for a program is an important parameter. Class Exercise Consider the execution of a program that results in the execution of 2 million instructions The program consists of 4 major types of instructions. The instruction mix and the CPI for each instruction type are given below What is the average CPI? CPI Formula
Let CPIi be the number of cycles required for
instruction type i and Ii be the number of executed instructions of type Ii for a given program. Then we can calculate an average CPI as follows: Processor Execution Time Formula
A processor is driven by a clock with a
constant frequency f or, Equivalently, a constant cycle time T, where T = 1/ f. The processor time T needed to execute a given program can be expressed as: Class Exercise The processor runs at a clock rate of 400 MHz What is the processor execution time? Instruction Execution Rate A common measure of performance for a processor is the rate at which instructions are executed Expressed as millions of instructions per second (MIPS)- the MIPS rate. Expressed in terms of the clock rate and CPI as follows: Instruction Execution Rate