Module 6
Module 6
Module 6
Architectural Enhancements
• Superscalar Architectures, Out-of-Order Execution,
Multi-core processors, Clusters, GPU
Superscalar Processors
• They may be micro or macro processors depending on the
functionality.
• The basic processor in general executes one instructions per clock
cycle.
• In way to improve their processing speed so that the machines can
improve their speed came into being is, the superscalar processor
which has pipelining algorithm to enable it to execute two
instructions per clock cycle.
• It was first invented by Seymour Cray’s CDC 6600 invented in
1964 and was later enhanced by Tjaden & Flynn in 1970.
• The first commercial single-chip superscalar microprocessor
MC88100 was developed by Motorola in 1988, later Intel
introduced its version I960CA in 1989 & the AMD 29000-series
29050 in 1990.
• At present, the typical superscalar processor used is the Intel Core
i7 processor depending on the Nehalem microarchitecture.
• Even though, the implementations of superscalar are heading
toward enhancing complexity.
• The design of these processors normally refers to a set of methods
that permit the CPU of a computer to attain a throughput of above
one instruction for each cycle while executing a single sequential
program.
• A type of microprocessor that is used to implement a type of
parallelism known as instruction-level parallelism in a single
processor to execute more than one instruction during a CLK
cycle by dispatching simultaneously various instructions to special
execution units on the processor.
• A scalar processor executes single instruction for each clock cycle;
a superscalar processor can execute more than one instruction
during a clock cycle.
• The design techniques of superscalar normally comprise parallel
register renaming, parallel instruction decoding, out-of-order
executions & speculative execution.
• So these methods are normally used with complementing design
methods like pipelining, branch prediction, caching & multi-core
within current designs of microprocessors.
Features
• Superscalar architecture is a parallel computing technique utilized
in various processors.
• In a superscalar computer, the CPU manages several instruction
pipelines to perform numerous instructions simultaneously during
a clock cycle.
• Superscalar architectures include all pipelining features although
there are several instructions executing simultaneously within the
same pipeline.
• Superscalar design methods normally comprise parallel register
renaming, parallel instruction decoding, speculative execution &
out-of-order execution. So, these methods are normally used with
complementing design methods like caching, pipelining, branch
prediction & multi-core in recent microprocessor designs.
Superscalar Processor Architecture
• We know that a superscalar processor is a CPU that executes
above one instruction for each CLK cycle because processing
speeds are simply measured in CLK cycles for each second.
• Compared to a scalar processor, this processor is very faster.
• Superscalar processor architecture mainly includes parallel
execution units where these units can implement instructions
simultaneously.
• So first, this parallel architecture was implemented within a RISC
processor that utilizes simple & short instructions to execute
calculations. So due to their superscalar abilities, normally RISC
processors have performed better as compared to CISC processors
which run at the same megahertz. But, most CISC processors now
like the Intel Pentium comprise some RISC architecture also,
which allows them to perform instructions in parallel.
• The superscalar processor is equipped with several processing units for
handling various instructions in parallel in every processing stage.
• By using the above architecture, a number of instructions start
execution within a similar clock cycle.
• These processors are capable of obtaining an instruction execution
output of the above one instruction for each cycle.
• A processor is used with two execution units where one is used for
integer & other one is used for the operations of floating point.
• The instruction fetch unit (IFU) is capable of instructions reading at a
time & stores them within the instruction queue.
• In every cycle, the dispatch unit fetches & decodes up to 2 instructions
from the queue front.
• If there is a single integer, single floating point instruction & no
hazards, then both instructions are dispatched within a similar clock
cycle.
Pipelining
• Pipelining is the procedure of breaking down tasks into sub-steps
& executing them within different processor parts.
• In the following superscalar pipeline, two instructions can be
fetched and dispatched at a time to complete a maximum of 2
instructions per cycle.
• The instructions in a superscalar processor are issued from a
sequential instruction stream.
• It must allow multiple instructions for each clock cycle and the
CPU must check dynamically for data dependencies between
instructions.
• The scalar processor pipeline architecture includes a single
pipeline and four stages fetch, decode, execute & result write
back.
• In the single pipeline scalar processor, the pipeline in the
instruction1 (I1) works as; in the first clock period I1 it will fetch,
in the second clock period it will decode and in the second
instruction, I2 will fetch.
• The third instruction I3 in the third clock period will fetch, I2 will
decode and I1 will execute. In the fourth clock period, I4 will
fetch, I3 will decode, I2 will execute and I1 will write in memory.
• So, in seven clock periods, it will execute 4 instructions in a single
pipeline.
Superscalar Processor Pipeline Architecture
• The superscalar processor pipeline architecture includes two
pipelines and four stages fetch, decode, execute & result write
back.
• It is a 2-issue superscalar processor which means at a time two
instructions will fetch, decode, execute and result write back.
• The two instructions I1 & I2 will at a time fetch, decode, execute
and write back in every clock period.
• Simultaneously in the next clock period, the remaining two
instructions I3 & I4 will at a time fetch, decode, execute and write
back.
• So, in five clock periods, it will execute 4 instructions in a single
pipeline.
• Thus, a scalar processor issues single instruction per clock cycle
and performs a single pipeline stage per clock cycle whereas a
superscalar processor, issues two instructions per clock cycle and
it executes two instances of each stage in parallel.
• So the instruction execution in a scalar processor takes more time
whereas in a superscalar it takes less time to execute instructions.
Characteristics
• A superscalar processor is a super-pipelined model where
simply the independent instructions are performed serially
without any waiting situation.
• A superscalar processor fetches & decodes at a time several
instructions of the incoming instruction stream.
• The architecture of superscalar processors exploits the
potential of instruction-level parallelism.
• Superscalar processors mainly issue the above single
instruction for every cycle.
• The no. of instructions issued mainly depends on the
instructions within the instruction stream.
• Instructions are frequently reordered to fit the architecture of
the processor better.
• The superscalar method is usually associated with some
identifying characteristics. Instructions are normally issued
from a sequential instruction stream.
• The CPU checks dynamically for data dependencies in
between instructions at run time.
• The CPU executes multiple instructions for each clock cycle.
Advantages
• A superscalar processor implements instruction-level parallelism in a
single processor.
• These processors are simply made to perform any instruction set.
• The superscalar processor including out-of-order execution branch
prediction & speculative execution can simply find parallelism above
several basic blocks & loop iterations.
Disadvantages
• Superscalar processors are not used much in small embedded systems
due to power usage.
• The problem with scheduling can happen in this architecture.
• Superscalar processor enhances the complexity level in the designing of
hardware.
• The instructions in this processor are simply fetched based on their
sequential program order but this is not the best execution order.
Applications
• The superscalar execution is frequently used by a laptop or
desktop. This processor simply scans the program in execution
to discover sets of instructions that can be executed as one.
• A superscalar processor includes various data path hardware
copies which execute various instructions at once.
• This processor is mainly designed to generate an
implementation speed of above one instruction for each clock
cycle for a single sequential program.
Out-of-order instruction execution
• Instructions are fetched in compiler-generated order
• Instruction completion may be in-order (today) or out-of-order
(older computers)
• In between they may be executed in some other order
• Independent instructions behind a stalled instruction can pass it
• Instructions are dynamically scheduled
Dynamic Scheduling
• Out-of-order processors: After instruction decode
• Check for structural hazards
– An instruction can be issued when a functional unit is available
– An instruction stalls if no appropriate functional unit
• Check for data hazards
– An instruction can execute when its operands have been
calculated or loaded from memory
– An instruction stalls if operands are not available
• Out-of-order processors: don’t wait for previous instructions to
execute if this instruction does not depend on them, i.e.,
independent ready instructions can execute before earlier
instructions that are stalled
out-of-order processors
• lw $3, 100($4) in execution, cache miss
• sub $5, $6, $7 can execute during the cache miss
• add $2, $3, $4 waits until the miss is satisfied
• Out-of-order processors: ready instructions can execute before
earlier instructions that are stalled
Case 2: path instructions are waiting for a branch condition to be
computed
• When path instructions go around a branch instruction:
– The instructions that are issued from the predicted path are
issued speculatively, called speculative execution
– Speculative instructions can execute (but not commit) before
the branch is resolved
– If the prediction was wrong, speculative instructions are
flushed from the pipeline
– If prediction is right, instructions are no longer speculative
Speculative Execution
• Instruction speculation: executing an instruction before it is known
that it should be executed
– All instructions that are fetched because of a prediction are
speculative
In-order pipeline
– Branch is executed before the path
Out-of-order pipeline:
– Path can be executed before the branch
– Speculative instructions can executed but not committed
– getting rid of wrong-path instructions is not just a matter of
flushing them from the pipeline
Superscalar Pipelining
A superscalar is a CPU, used to An implementation technique like
implement a form of parallelism which pipelining is used where several
is called instruction-level parallelism in instructions are overlapped within the
a single processor. execution.
Pipelining architecture executes a
A superscalar architecture initiates
single pipeline stage only for each
several instructions simultaneously &
clock cycle.
executes them separately.
These processors depend on spatial
It depends on temporal parallelism.
parallelism.
Several operations run concurrently on Overlapping several operations on
separate hardware. common hardware.
It is achieved by duplicating hardware It is achieved by execution units
resources like register file ports & pipelined more deeply with very fast
execution units. CLK cycles.
Multicore Processor
• A multicore processor is an integrated circuit that has two or more
processor cores attached for enhanced performance and reduced
power consumption.
• These processors also enable more efficient simultaneous
processing of multiple tasks, such as with parallel processing and
multithreading.
• A dual core setup is similar to having multiple, separate processors
installed on a computer. However, because the two processors are
plugged into the same socket, the connection between them is
faster.
• The use of multicore processors or microprocessors is one
approach to boost processor performance without exceeding the
practical limitations of semiconductor design and fabrication.
• Using multicores also ensure safe operation in areas such as heat
generation.
• The heart of every processor is an execution engine, also known as
a core.
• The core is designed to process instructions and data according to
the direction of software programs in the computer's memory.
• Over the years, designers found that every new processor design
had limits.
• Numerous technologies were developed to accelerate
performance, including the following ones:
– Clock speed.
– Hyper-threading.
– More chips.
Clock speed
• One approach was to make the processor's clock faster.
• The clock is the "drumbeat" used to synchronize the processing of
instructions and data through the processing engine.
• Clock speeds have accelerated from several megahertz to several
gigahertz (GHz) today.
• However, transistors use up power with each clock tick.
• As a result, clock speeds have nearly reached their limits given
current semiconductor fabrication and heat management
techniques.
Hyper-threading
• Another approach involved the handling of multiple instruction
threads. Intel calls this hyper-threading.
• With hyper-threading, processor cores are designed to handle two
separate instruction threads at the same time.
• When properly enabled and supported by both the computer's
firmware and operating system (OS), hyper-threading techniques
enable one physical core to function as two logical cores.
• Still, the processor only possesses a single physical core.
• The logical abstraction of the physical processor added little real
performance to the processor other than to help streamline the
behavior of multiple simultaneous applications running on the
computer.
More chips
• The next step was to add processor chips -- or dies -- to the
processor package, which is the physical device that plugs into the
motherboard.
• A dual-core processor includes two separate processor cores.
• A quad-core processor includes four separate cores.
• Today's multicore processors can easily include 12, 24 or even
more processor cores.
• The multicore approach is almost identical to the use of
multiprocessor motherboards, which have two or four separate
processor sockets.
• The effect is the same.
• Today's huge processor performance involves the use of processor
products that combine fast clock speeds and multiple
hyper-threaded cores.
Architecture of Multicore Processor
Core
• Cores are the central components or multicore processors.
• Cores contain all of the registers and circuitry -- sometimes
hundreds of millions of individual transistors -- needed to perform
the closely-synchronized tasks of ingesting data and instruction,
processing that content and outputting logical decisions or results.
Processor support
• Processor support circuitry includes an assortment of input/output
control and management circuitry, such as clocks, cache
consistency, power and thermal control and external bus access.
Caches
• Caches are relatively small areas of very fast memory.
• A cache retains often-used instructions or data, making that
content readily available to the core without the need to access
system memory.
• A processor checks the cache first. If the required content is
present, the core takes that content from the cache, enhancing
performance benefits.
• If the content is absent, the core will access system memory for
the required content.
• A Level 1, or L1, cache is the smallest and fastest cache unique to
every core. A Level 2, or L2, cache is a larger storage space shared
among the cores.
• Some multicore processor architectures may dedicate both L1 and
L2 caches
• However, multicore chips have several issues to consider.
• First, the addition of more processor cores doesn't automatically
improve computer performance.
• The OS and applications must direct software program instructions
to recognize and use the multiple cores.
• This must be done in parallel, using various threads to different
cores within the processor package.
• Some software applications may need to be refactored to support
and use multicore processor platforms.
• Otherwise, only the default first processor core is used, and any
additional cores are unused or idle.
• Second, the performance benefit of additional cores is not a direct
multiple.
• That is, adding a second core does not double the processor's
performance, or a quad-core processor does not multiply the
processor's performance by a factor of four.
• This happens because of the shared elements of the processor,
such as access to internal memory or caches, external buses and
computer system memory.
• The benefit of multiple cores can be substantial, but there are
practical limits.
• Still, the acceleration is typically better than a traditional
multiprocessor system because the coupling between cores in the
same package is tighter and there are shorter distances and fewer
components between cores.
Multicore advantages
• Better application performance.
– The principle benefit of multicore processors is more potential processing
capability.
– Each processor core is effectively a separate processor that OSes and applications
can use.
– In a virtualized server, each VM can employ one or more virtualized processor
cores, enabling many VMs to coexist and operate simultaneously on a physical
server.
– Similarly, an application designed for high levels of parallelism may use any
number of cores to provide high application performance that would be impossible
with single-chip systems.
• Better hardware performance.
– By placing two or more processor cores on the same device, it can use shared
components -- such as common internal buses and processor caches -- more
efficiently.
– It also benefits from superior performance compared with multiprocessor systems
that have separate processor packages on the same motherboard.
Multicore disadvantages
• Software dependent.
– The application uses processors -- not the other way around. OSes and
applications will always default to use the first processor core, dubbed core
0.
– Any additional cores in the processor package will remain unused or idle
until software applications are enabled to use the them.
– Such applications include database applications and big data processing
tools like Hadoop.
– A business should consider what a server will be used for and the
applications it plans to use before making a multicore system investment to
ensure that the system delivers its optimum computing potential.
• Performance boosts are limited.
– Multiple processors in a processor package must share common
system buses and processor caches.
– The more processor cores share a package, the more sharing
must take place across common processor interfaces and
resources.
– This results in diminishing returns to performance as cores are
added.
– For most situations, the performance benefit of having multiple
cores far outweighs the performance lost to such sharing, but
it's a factor to consider when testing application performance.
• Power, heat and clock restrictions.
– A computer may not be able to drive a processor with many cores as hard
as a processor with fewer cores or a single-core processor.
– A modern processor core may contain over 500 million transistors.
– Each transistor generates heat when it switches, and this heat increases as
the clock speed increases.
– All of that heat generation must be safely dissipated from the core through
the processor package.
– When more cores are running, this heat can multiply and quickly exceed
the cooling capability of the processor package.
– Thus, some multicore processors may actually reduce clock speeds -- for
instance, from 3.5 GHz to 3.0 GHz -- to help manage heat.
– This reduces the performance of all processor cores in the package.
– High-end multicore processors require complex cooling systems and
careful deployment and monitoring to ensure long-term system reliability.
Homogenous vs. Heterogeneous Multicore Processors
• The cores within a multicore processor may be homogeneous or
heterogeneous.
• Mainstream Intel and AMD multicore processors for x86
computer architectures are homogeneous and provide identical
cores.
• Consequently, most discussion of multicore processors are about
homogeneous processors.
• However, dedicating a complex device to do a simple job or to get
greatest efficiency is often wasteful.
• There is a heterogeneous multicore processor market that uses
processors with different cores for different purposes.
• Heterogeneous cores are generally found in embedded or Arm
processors that might mix microprocessor and microcontroller
cores in the same package.
There are three general goals for heterogeneous multicore processors:
Optimized performance
– While homogeneous multicore processors are typically
intended to provide vanilla or universal processing capabilities,
many processors are not intended for such generic system use
cases.
– Instead, they are designed and sold for use in embedded --
dedicated or task-specific -- systems that can benefit from the
unique strengths of different processors.
– For example, a processor intended for a signal processing
device might use an Arm processor that contains a Cortex-A
general-purpose processor with a Cortex-M core for dedicated
signal processing tasks.
Optimized power
– Providing simpler processor cores reduces the transistor count
and eases power demands.
– This makes the processor package and the overall system
cooler and more power-efficient.
• Optimized security
– Jobs or processes can be divided among different types of
cores, enabling designers to deliberately build high levels of
isolation that tightly control access among the various
processor cores.
– This greater control and isolation offer better stability and
security for the overall system, though at the cost of general
flexibility.
Graphics Processing Units (GPU)
• GPUs are also known as video cards or graphics cards.
• In order to display pictures, videos, and 2D or 3D animations, each
device uses a GPU.
• A GPU performs fast calculations of arithmetic and frees up the
CPU to do different things.
• A GPU has lots of smaller cores made for multi-tasking, while a
CPU makes use of some cores primarily based on sequential serial
processing.
• In the world of computing, graphics processing technology has
advanced to offer specific benefits.
• The modern GPUs enables new possibilities in content creation,
machine learning, gaming, etc.
• In the 1990s, when chip producer Nvidia coined it, GPU became a
common term for the part that powered graphics on a system.
• The company's GeForce range of graphics cards has been the first
to be popularized and ensured associated technology, including
programmable shading, hardware acceleration, and stream
processing.
• Although rendering simple objects, such as an operating system's
desktop environment, can typically be handled by the limited
flexibility of graphics processing built into the CPU.
• The additional workloads require the extra horsepower that comes
with a dedicated GPU. For the personal and business system, the
graphics processing unit (GPU) is the most important computing
technology type.
• The GPU is designed for parallel processing and is used in various
applications, including video rendering and graphics.
• Originally, GPUs were designed to accelerate 3D graphics
rendering.
• They have become more modular and programmable over time,
improving their capabilities.
• It enables graphics programmers with shadowing techniques and
advanced lighting to create more exciting visual effects and more
realistic scenes.
• Other developers have also started to harness GPU's power in
high-performance computing, deep learning, etc. to significantly
speed up additional workloads.
• GPUs are generally used to drive high-quality gaming
experiences, creating life-like super-slick rendering and graphic
design.
• However, there are also many business applications, which depend
on strong graphics chips.
• Today, the GPU is more programmable than ever before, giving
them the potential to speed up a wide variety of applications that
go way beyond conventional graphics rendering.
• There are various applications where we can use the GPU's.
GPU's for gaming
• Video games have become extra computationally intensive for
gaming, with vast and hyper-realistic, complex in-game worlds.
• With new display technology, like 4K displays and high refresh
rates, and the increase of virtual reality gaming, graphics
processing demand increases rapidly.
• Games may be played at a better resolution, better frame rate, or
each with advanced graphics performance.
GPU for Machine Learning
• Artificial Intelligence and Machine Learning offer several exciting
packages for GPU technology.
• Since GPUs have an exceptional amount of computational power,
they can provide tremendous acceleration in workloads that take
advantage of GPU's highly parallel design, such as image
recognition.
• Many advanced learning technologies depend on GPUs working in
combination with CPUs.
GPU for Video Editing and Content Creation
• For many years, video editors, graphics designers, and different
professionals have struggled with a long time for Video Editing
and Content Creation, which tied up the system resources and
stifled creative flow.
• Now, GPU's parallel processing makes rendering video and
graphics in higher quality formats easier and faster.
• Moreover, modern GPUs have specific media and display engines,
which help video production and playback more power-efficient.
Clusters
• In computer organization, clusters refer to groups of
interconnected computers or servers that work together as a
unified system.
• Clustering is a powerful technique that provides computer systems
with enhanced performance, fault tolerance, and scalability.
• Types of Clusters
– High Availability (HA) Clusters
– Load Balancing Clusters
– High-Performance Computing (HPC) Clusters
– Data Clusters
High Availability (HA) Clusters:
• HA clusters are designed to provide continuous availability of
services by utilizing redundant hardware and software
configurations.
• In a failure, the services are automatically transferred to a standby
node, minimizing downtime and ensuring uninterrupted
operations.
Load Balancing Clusters:
• Load-balancing clusters distribute incoming workloads across
multiple nodes to optimize resource utilization and improve
system performance.
• Requests are evenly distributed among the cluster nodes,
preventing the overloading of any single node and ensuring
efficient handling of user requests.
High-Performance Computing (HPC) Clusters:
• HPC clusters are used for computationally intensive scientific
simulations, data analysis, and complex calculations.
• These clusters harness the computational power of multiple nodes
to execute parallel processing tasks, significantly reducing the
time required to complete these tasks.
Data Clusters:
• Data clusters are designed explicitly for managing large volumes
of data.
• They provide high-capacity storage and efficient data retrieval
mechanisms.
• Data clustering technologies like distributed file systems and
databases ensure data availability, reliability, and fault tolerance.
Cluster Architectures
• Clusters can be organized into different architectures depending on
how they are interconnected:
– Shared-Nothing Architecture
– Shared-Disk Architecture
– Shared-Memory Architecture
Shared-Nothing Architecture:
• In a shared-nothing architecture, each node in the cluster has its
dedicated resources, including processors, memory, and storage.
• Nodes communicate by passing messages over a network, and
each node operates independently.
• This architecture provides high scalability and fault tolerance but
requires explicit communication and data transfer between nodes.
Shared-Disk Architecture:
• In a shared-disk architecture, all nodes in the cluster share a
standard storage system.
• Multiple nodes can access and modify the shared disk
simultaneously, enabling efficient data sharing.
• However, shared-disk architectures may introduce data access
contention and performance bottlenecks.
Shared-Memory Architecture:
• In a shared-memory architecture, all nodes in the cluster share a
common physical memory space.
• This architecture allows for easy data sharing and communication
between nodes as they can directly access shared memory.
• However, shared-memory architectures are limited by memory
capacity and scalability.
Advantages:
• Improved Performance:
– By harnessing the combined computational power of multiple
nodes, clusters can significantly enhance system performance.
– Tasks can be distributed among nodes, enabling parallel
processing and reducing execution time.
• Cost Efficiency:
– Clusters can offer cost savings by utilizing commodity
hardware and distributed computing resources.
– Instead of investing in expensive high-end servers,
organizations can build clusters with standard off-the-shelf
hardware, making it a cost-effective solution.
Disadvantages:
• Complexity:
– Designing, configuring, and managing clusters can be complex
and requires specialized knowledge and skills.
– Proper setup and maintenance of the cluster components,
including interconnects, networking, and distributed software,
are crucial for optimal performance.
• Load Balancing:
– Efficient load balancing ensures that resources are evenly
distributed among cluster nodes.
– Improper load balancing can lead to performance degradation,
as some nodes may be overloaded while others still need to be
utilized.
• Data Consistency and Access:
– In shared-disk or shared-memory architectures, ensuring data
consistency and preventing conflicts during simultaneous
access is critical.
– Synchronization mechanisms and distributed file systems
manage data integrity and access control.
• Network Communication:
– Communication between cluster nodes over the network
introduces latency and bandwidth limitations.
– Network congestion and bottlenecks can affect cluster
performance, requiring careful design and optimization.