0% found this document useful (0 votes)
12 views67 pages

Unit V

Uploaded by

717823p106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views67 pages

Unit V

Uploaded by

717823p106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

UNIT-V

RAID?

RAID (Redundant Array of Independent Disks) is like having backup


copies of your important files stored in different places on several hard
drives or solid-state drives (SSDs). If one drive stops working, your data
is still safe because you have other copies stored on the other drives. It’s
like having a safety net to protect your files from being lost if one of your
drives breaks down.

RAID (Redundant Array of Independent Disks) in a Database


Management System (DBMS) is a technology that combines multiple
physical disk drives into a single logical unit for data storage. The main
purpose of RAID is to improve data reliability, availability, and
performance. There are different levels of RAID, each offering a balance
of these benefits.

How RAID Works?

Let us understand How RAID works with an example- Imagine you have
a bunch of friends, and you want to keep your favorite book safe. Instead
of giving the book to just one friend, you make copies and give a piece to
each friend. Now, if one friend loses their piece, you can still put the book
together from the other pieces. That’s similar to how RAID works with
hard drives. It splits your data across multiple drives, so if one drive fails,
your data is still safe on the others. RAID helps keep your information
secure, just like spreading your favorite book among friends keeps it safe

What is a RAID Controller?


A RAID controller is like a boss for your hard drives in a big storage
system. It works between your computer’s operating system and the
actual hard drives, organizing them into groups to make them easier to
manage. This helps speed up how fast your computer can read and write
data, and it also adds a layer of protection in case one of your hard drives
breaks down. So, it’s like having a smart helper that makes your hard
drives work better and keeps your important data safer.

Types of RAID Controller

There are three types of RAID controller:

Hardware Based: In hardware-based RAID, there’s a physical controller


that manages the whole array. This controller can handle the whole
group of hard drives together. It’s designed to work with different types
of hard drives, like SATA (Serial Advanced Technology Attachment)
or SCSI (Small Computer System Interface). Sometimes, this controller is
built right into the computer’s main board, making it easier to set up and
manage your RAID system. It’s like having a captain for your team of hard
drives, making sure they work together smoothly.

Software Based: In software-based RAID, the controller doesn’t have its


own special hardware. So it use computer’s main processor and memory
to do its job. It perform the same function as a hardware-based RAID
controller, like managing the hard drives and keeping your data safe. But
because it’s sharing resources with other programs on your computer, it
might not make things run as fast. So, while it’s still helpful, it might not
give you as big of a speed boost as a hardware-based RAID system

Firmware Based: Firmware-based RAID controllers are like helpers built


into the computer’s main board. They work with the main processor, just
like software-based RAID. But they only implement when the computer
starts up. Once the operating system is running, a special driver takes
over the RAID job. These controllers aren’t as expensive as hardware
ones, but they make the computer’s main processor work harder. People
also call them hardware-assisted software RAID, hybrid model RAID, or
fake RAID.

Why Data Redundancy?

Data redundancy, although taking up extra space, adds to disk reliability.


This means, that in case of disk failure, if the same data is also backed up
onto another disk, we can retrieve the data and go on with the operation.
On the other hand, if the data is spread across multiple disks without the
RAID technique, the loss of a single disk can affect the entire data.

Key Evaluation Points for a RAID System

 Reliability: How many disk faults can the system tolerate?

 Availability: What fraction of the total session time is a system in


uptime mode, i.e. how available is the system for actual use?

 Performance: How good is the response time? How high is the


throughput (rate of processing work)? Note that performance
contains a lot of parameters, not just the two.

 Capacity: Given a set of N disks each with B blocks, how much


useful capacity is available to the user?

RAID is very transparent to the underlying system. This means, that to


the host system, it appears as a single big disk presenting itself as a linear
array of blocks. This allows older technologies to be replaced by RAID
without making too many changes to the existing code.
Different RAID Levels

 RAID-0 (Stripping)

 RAID-1 (Mirroring)

 RAID-2 (Bit-Level Stripping with Dedicated Parity)

 RAID-3 (Byte-Level Stripping with Dedicated Parity)

 RAID-4 (Block-Level Stripping with Dedicated Parity)

 RAID-5 (Block-Level Stripping with Distributed Parity)

 RAID-6 (Block-Level Stripping with two Parity Bits)

Raid Controller

1. RAID-0 (Stripping)

 Blocks are “stripped” across disks.


RAID-0

 In the figure, blocks “0,1,2,3” form a stripe.

 Instead of placing just one block into a disk at a time, we can work
with two (or more) blocks placed into a disk before moving on to
the next one.
Raid-0

Evaluation

 Reliability: 0
There is no duplication of data. Hence, a block once lost cannot be
recovered.

 Capacity: N*B
The entire space is being used to store data. Since there is no
duplication, N disks each having B blocks are fully utilized.

Advantages

 It is easy to implement.

 It utilizes the storage capacity in a better way.

Disadvantages

 A single drive loss can result in the complete failure of the system.

 It’s not a good choice for a critical system.

2. RAID-1 (Mirroring)

 More than one copy of each block is stored in a separate disk. Thus,
every block has two (or more) copies, lying on different disks.
Raid-1

 The above figure shows a RAID-1 system with mirroring level 2.

 RAID 0 was unable to tolerate any disk failure. But RAID 1 is capable
of reliability.

Evaluation

Assume a RAID system with mirroring level 2.

 Reliability: 1 to N/2
1 disk failure can be handled for certain because blocks of that disk
would have duplicates on some other disk. If we are lucky enough
and disks 0 and 2 fail, then again this can be handled as the blocks
of these disks have duplicates on disks 1 and 3. So, in the best case,
N/2 disk failures can be handled.

 Capacity: N*B/2
Only half the space is being used to store data. The other half is just
a mirror of the already stored data.
Advantages

 It covers complete redundancy.

 It can increase data security and speed.

Disadvantages

 It is highly expensive.

 Storage capacity is less.

3. RAID-2 (Bit-Level Stripping with Dedicated Parity)

 In Raid-2, the error of the data is checked at every bit level. Here,
we use Hamming Code Parity Method to find the error in the data.

 It uses one designated drive to store parity.

 The structure of Raid-2 is very complex as we use two disks in this


technique. One word is used to store bits of each word and another
word is used to store error code correction.

 It is not commonly used.

Advantages

 In case of Error Correction, it uses hamming code.

 It Uses one designated drive to store parity.

Disadvantages

 It has a complex structure and high cost due to extra drive.

 It requires an extra drive for error detection.

4. RAID-3 (Byte-Level Stripping with Dedicated Parity)


 It consists of byte-level striping with dedicated parity striping.

 At this level, we store parity information in a disc section and write


to a dedicated parity drive.

 Whenever failure of the drive occurs, it helps in accessing the parity


drive, through which we can reconstruct the data.

Raid-3

 Here Disk 3 contains the Parity bits for Disk 0, Disk 1, and Disk 2. If
data loss occurs, we can construct it with Disk 3.

Advantages

 Data can be transferred in bulk.

 Data can be accessed in parallel.

Disadvantages

 It requires an additional drive for parity.


 In the case of small-size files, it performs slowly.

5. RAID-4 (Block-Level Stripping with Dedicated Parity)

 Instead of duplicating data, this adopts a parity-based approach.

Raid-4

 In the figure, we can observe one column (disk) dedicated to


parity.

 Parity is calculated using a simple XOR function. If the data bits are
0,0,0,1 the parity bit is XOR(0,0,0,1) = 1. If the data bits are 0,1,1,0
the parity bit is XOR(0,1,1,0) = 0. A simple approach is that an even
number of ones results in parity 0, and an odd number of ones
results in parity 1.
Raid-4

 Assume that in the above figure, C3 is lost due to some disk failure.
Then, we can recompute the data bit stored in C3 by looking at the
values of all the other columns and the parity bit. This allows us to
recover lost data.

Evaluation

 Reliability: 1
RAID-4 allows recovery of at most 1 disk failure (because of the way
parity works). If more than one disk fails, there is no way to recover
the data.

 Capacity: (N-1)*B
One disk in the system is reserved for storing the parity. Hence, (N-
1) disks are made available for data storage, each disk having B
blocks.

Advantages
 It helps in reconstructing the data if at most one data is lost.

Disadvantages

 It can’t help reconstructing data when more than one is lost.

6. RAID-5 (Block-Level Stripping with Distributed Parity)

 This is a slight modification of the RAID-4 system where the only


difference is that the parity rotates among the drives.

Raid-5

 In the figure, we can notice how the parity bit “rotates”.

 This was introduced to make the random write performance


better.

Evaluation

 Reliability: 1
RAID-5 allows recovery of at most 1 disk failure (because of the way
parity works). If more than one disk fails, there is no way to recover
the data. This is identical to RAID-4.

 Capacity: (N-1)*B
Overall, space equivalent to one disk is utilized in storing the parity.
Hence, (N-1) disks are made available for data storage, each disk
having B blocks.

Advantages

 Data can be reconstructed using parity bits.

 It makes the performance better.

Disadvantages

 Its technology is complex and extra space is required.

 If both discs get damaged, data will be lost forever.

7. RAID-6 (Block-Level Stripping with two Parity Bits)

 Raid-6 helps when there is more than one disk failure. A pair of
independent parities are generated and stored on multiple disks at
this level. Ideally, you need four disk drives for this level.

 There are also hybrid RAIDs, which make use of more than one
RAID level nested one after the other, to fulfill specific
requirements.
Raid-6

Advantages

 Very high data Accessibility.

 Fast read data transactions.

Disadvantages

 Due to double parity, it has slow write data transactions.

 Extra space is required.

Advantages of RAID

 Data redundancy: By keeping numerous copies of the data on


many disks, RAID can shield data from disk failures.

 Performance enhancement: RAID can enhance performance by


distributing data over several drives, enabling the simultaneous
execution of several read/write operations.
 Scalability: RAID is scalable, therefore by adding more disks to the
array, the storage capacity may be expanded.

 Versatility: RAID is applicable to a wide range of devices, such as


workstations, servers, and personal PCs

Disadvantages of RAID

 Cost: RAID implementation can be costly, particularly


for arrays with large capacities.

 Complexity: The setup and management of RAID might be


challenging.

 Decreased performance: The parity calculations necessary for


some RAID configurations, including RAID 5 and RAID 6, may result
in a decrease in speed.

 Single point of failure: RAID is not a comprehensive backup


solution while offering data redundancy. The array’s whole
contents could be lost if the RAID controller malfunctions.

Storage Device

The storage unit is a part of the computer system which is employed to


store the information and instructions to be processed. A storage device
is an integral part of the computer hardware which stores
information/data to process the result of any computational work.
Without a storage device, a computer would not be able to run or even
boot up. Or in other words, we can say that a storage device is hardware
that is used for storing, porting, or extracting data files. It can also store
information/data both temporarily and permanently.
Types of Computer Memory

1. Primary Memory

2. Secondary Memory

3. Tertiary Memory

1. Primary Memory: It is also known as internal memory and main


memory. This is a section of the CPU that holds program instructions,
input data, and intermediate results. It is generally smaller in size. RAM
(Random Access Memory) and ROM (Read Only Memory) are examples
of primary storage.

2. Secondary Memory: Secondary storage is a memory that is stored


external to the computer. It is mainly used for the permanent and long-
term storage of programs and data. Hard Disks, CDs, DVDs, Pen/Flash
drives, SSD, etc, are examples of secondary storage.

3. Tertiary Memory: Tertiary Memory is a type of Memory that is rarely


used in personal computers and due to this, tertiary memory is not
considered to be an important one. Tertiary memory works
automatically without human intervention.

Types of Computer Storage Devices

Now we will discuss different types of storage devices available in the


market. These storage devices have their own specification and use.
Some of the commonly used storage devices are:

1. Primary Storage Devices

2. Magnetic Storage Devices

3. Flash memory Devices


4. Optical Storage Devices

5. Cloud and Virtual Storage

1. Primary Storage Devices

 RAM: It stands for Random Access Memory. It is used to store


information that is used immediately or we can say that it is a
temporary memory. Computers bring the software installed on a
hard disk to RAM to process it and to be used by the user. Once,
the computer is turned off, the data is deleted. With the help of
RAM, computers can perform multiple tasks like loading
applications, browsing the web, editing a spreadsheet,
experiencing the newest game, etc. It allows you to modify quickly
among these tasks, remembering where you’re in one task once
you switch to a different task. It is also used to load and run
applications, like your spreadsheet program, answers commands,
like all edits you made within the spreadsheet, or toggle between
multiple programs, like once you left the spreadsheet to see the
email. Memory is nearly always actively employed by your
computer. It ranges from 1GB – 32GB/64GB depending upon the
specifications. There are different types of RAM, and although they
all serve the same purpose, the most common ones are :

o SRAM: It stands for Static Random Access Memory. It consists


of circuits that retain stored information as long as the power
supply is on. It is also known as volatile memory. It is used to
build Cache memory. The access time of SRAM is lower and it
is much faster as compared to DRAM but in terms of cost, it
is costly as compared to DRAM.
o DRAM: It stands for Dynamic Random Access Memory. It is
used to store binary bits in the form of electrical charges that
are applied to capacitors. The access time of DRAM is slower
as compared to SRAM but it is cheaper than SRAM and has a
high packaging density.

o SDRAM: It stands for Synchronous Dynamic Random Access


Memory. It is faster than DRAM. It is widely used in
computers and others. After SDRAM was introduced, the
upgraded version of double data rate RAM, i.e., DDR1, DDR2,
DDR3, and DDR4 was entered into the market and widely
used in home/office desktops and laptops.

 ROM: It stands for Read-Only Memory. The data written or stored


in these devices are non-volatile, i.e, once the data is stored in the
memory cannot be modified or deleted. The memory from which
will only read but cannot write it. This type of memory is non-
volatile. The information is stored permanently during
manufacture only once. ROM stores instructions that are used to
start a computer. This operation is referred to as bootstrap. It is
also used in other electronic items like washers and microwaves.
ROM chips can only store a few megabytes (MB) of data, which
ranges between 4 and 8 MB per ROM chip. There are two types of
ROM:

o PROM: PROM is Programmable Read-Only Memory. These


are ROMs that can be programmed. A special PROM
programmer is employed to enter the program on the PROM.
Once the chip has been programmed, information on the
PROM can’t be altered. PROM is non-volatile, that is data is
not lost when power is switched off.

o EPROM: Another sort of memory is the Erasable


Programmable Read-Only Memory. It is possible to erase the
info which has been previously stored on an EPROM and write
new data onto the chip.

o EEPROM: EEPROM is Electrically erasable programmable


read-only memory. Here, data can be erased without using
ultraviolet light, with the use of just applying the electric

o field.

Primary Storage Devices

2. Magnetic Storage Devices

 Floppy Disk: Floppy Disk is also known as a floppy diskette. It is


generally used on a personal computer to store data externally. A
Floppy disk is made up of a plastic cartridge and secured with a
protective case. Nowadays floppy disk is replaced by new and
effective storage devices like USB, etc.

 Hard Disk: Hard Disk is a storage device (HDD) that stores and
retrieves data using magnetic storage. It is a non-volatile storage
device that can be modified or deleted n number of times without
any problem. Most computers and laptops have HDDs as their
secondary storage device. It is actually a set of stacked disks, just
like phonograph records. In every hard disk, the data is recorded
electromagnetically in concentric circles or we can say track
present on the hard disk, and with the help of a head just like a
phonograph arm(but fixed in a position) to read the information
present on the track. The read-write speed of HDDs is not so fast
but decent. It ranges from a few GBs to a few and more TB.

 Magnetic Card: It is a card in which data is stored by modifying or


rearranging the magnetism of tiny iron-based magnetic particles
present on the band of the card. It is also known as a swipe card. It
is used like a passcode(to enter the house or hotel room), credit
card, identity card, etc.

 Tape Cassette: It is also known as a music cassette. It is a


rectangular flat container in which the data is stored in an analog
magnetic tape. It is generally used to store audio recordings.

 SuperDisk: It is also called LS-240 and LS-120. It is introduced by


Imation Corporation and it is popular with OEM computers. It can
store data up to 240 MB.

Magnetic Storage Devices

3. Flash Memory Devices

It is a cheaper and more portable storage device. It is the most commonly


used device to store data because is more reliable and efficient as
compared to other storage devices. Some of the commonly used flash
memory devices are:

 Pen Drive: It is also known as a USB flash drive that includes flash
memory with an integrated USB interface. We can directly connect
these devices to our computers and laptops and read/write data
into them in a much faster and more efficient way. These devices
are very portable. It ranges from 1GB to 256GB generally.

 SSD: It stands for Solid State Drive, a mass storage device like HDD.
It is more durable because it does not contain optical disks inside
like hard disks. It needs less power as compared to hard disks, is
lightweight, and has 10x faster read and writes speed as compared
to hard disks. But, these are costly as well. While SSDs serve an
equivalent function as hard drives, their internal components are
much different. Unlike hard drives, SSDs don’t have any moving
parts and thus they’re called solid-state drives. Instead of storing
data on magnetic platters, SSDs store data using non-volatile
storage. Since SSDs haven’t any moving parts, they do not need to
“spin up”. It ranges from 150GB to a few more TB.

 SD Card: It is known as a Secure Digital Card. It is generally used


with electronic devices like phones, digital cameras, etc. to store
larger data. It is portable and the size of the SD card is also small so
that it can easily fit into electronic devices. It is available in different
sizes like 2GB, 4GB, 8GB, etc.

 Memory Card: It is generally used in digital cameras. printers, game


consoles, etc. It is also used to store large amounts of data and is
available in different sizes. To run a memory card on a computer
you require a separate memory card reader.

 Multimedia Card: It is also known as MMC. It is an integrated


circuit that is generally used in-car radios, digital cameras, etc. It is
an external device to store data/information.
Flash Memory Devices

4. Optical Storage Devices

Optical Storage Devices is also secondary storage device. It is a


removable storage device. Following are some optical storage devices:

 CD: It is known as Compact Disc. It contains tracks and sectors on


its surface to store data. It is made up of polycarbonate plastic and
is circular in shape. CD can store data up to 700MB. It is of two
types:

o CD-R: It stands for Compact Disc read-only. In this type of CD,


once the data is written can not be erased. It is read-only.

o CD-RW: It stands for Compact Disc Read Write. In this type of


CD, you can easily write or erase data multiple times.

 DVD: It is known as Digital Versatile Disc. DVDs are circular flat


optical discs used to store data. It comes in two different sizes one
is 4.7GB single-layer discs and another one is 8.5GB double-layer
discs. DVDs look like CDs but the storage capacity of DVDs is more
than as compared to CDs. It is of two types:

o DVD-R: It stands for Digital Versatile Disc read-only. In this


type of DVD, once the data is written can not be erased. It is
read-only. It is generally used to write movies, etc.

o DVD-RW: It stands for Digital Versatile Disc Read Write. In this


type of DVD, you can easily write or erase data multiple times.
 Blu-ray Disc: It is just like CD and DVD but the storage capacity of
blu ray is up to 25GB. To run a Blu-ray disc you need a separate Blu-
ray reader. This Blu-ray technology is used to read a disc from a
blue-violet laser due to which the information is stored in greater
density with a longer wavelength.

Optical Storage Devices

5. Cloud and Virtual Storage

Nowadays, secondary memory has been upgraded to virtual or cloud


storage devices. We can store our files and other stuff in the cloud and
the data is stored for as long as we pay for the cloud storage. There are
many companies that provide cloud services largely Google, Amazon,
Microsoft, etc. We can pay the rent for the amount of space we need
and we get multiple benefits out of it. Though it is actually being stored
in a physical device located in the data centers of the service provider,
the user doesn’t interact with the physical device and its maintenance.
For example, Amazon Web Services offers AWS S3 as a type of storage
where users can store data virtually instead of being stored in physical
hard drive devices. These sorts of innovations represent the frontier of
where storage media goes.

Cloud and Virtual Storage


Characteristics of Computer Storage Devices

 Data stored in the Memory can be changed or replaced in case of a


requirement, because of the mobility of the storage devices.

 Storage Devices validate that saved data can be replaced or deleted


as per the requirements because the storage devices are easily
readable, writeable, and rewritable.

 Storage Devices are easy and convenient to access because they do


not require much skill set to handle these resources.

 The storage capacity of these devices is an extra advantage to the


system.

 Storage Devices have better performance and data can be easily


transferred from one device to another.
Parallel processing

Parallel processing is used to increase the computational speed of computer


systems by performing multiple data-processing operations simultaneously.
For example, while an instruction is being executed in ALU, the next
instruction can be read from memory. The system can have two or more
ALUs and be able to execute multiple instructions at the same time. In
addition, two or more processing is also used to speed up computer
processing capacity and increases with parallel processing, and with it, the
cost of the system increases. But, technological development has reduced
hardware costs to the point where parallel processing methods are
economically possible. Parallel processing derives from multiple levels of
complexity. It is distinguished between parallel and serial operations by the
type of registers used at the lowest level.

Shift registers
work one bit at a time in a serial fashion, while parallel registers work
simultaneously with all bits of the word. At high levels of complexity, parallel
processing derives from having a plurality of functional units that perform
separate or similar operations simultaneously. By distributing data among
several functional units, parallel processing is installed. As an example,
arithmetic, shift and logic operations can be divided into three units and
operations are transformed into a teach unit under the supervision of a
control unit. One possible method of dividing the execution unit into eight
functional units operating in parallel is shown in figure. Depending on the
operation specified by the instruction, operands in the registers are
transferred to one of the units, associated with the operands. In each
functional unit, the operation performed is denoted in each block of the
diagram. The arithmetic operations with integer numbers are performed by
the adder and integer multiplier.

Floating-point operations

can be divided into three circuits operating in parallel. Logic, shift, and
increment operations are performed concurrently on different data. All
units are independent of each other, therefore one number is shifted while
another number is being incremented. Generally, a multi-functional
organization is associated with a complex control unit to coordinate all the
activities between the several components.
The main advantage of parallel processing is that it provides better
utilization of system resources by increasing resource multiplicity which
overall system throughput.

Hardware multithreading
Types of Hardware Multithreading

There are two primary types of hardware multithreading: coarse-


grained and fine-grained. Coarse-grained multithreading involves
switching threads at the instruction level, where each thread is given a
designated time slice to execute before switching to the next thread.
On the other hand, fine-grained multithreading switches threads at a
much smaller granularity, typically at the cycle or sub-cycle level. Fine-
grained multithreading allows for even more efficient utilization of the
processor core, as it minimizes the impact of pipeline stalls and other
latency-inducing events.

In coarse-grained multithreading, each thread has its own instruction


buffer, resources, and register set. However, fine-grained
multithreading typically shares these resources among all active
threads, which requires careful management and scheduling.

Both types of hardware multithreading have their own advantages and


trade-offs. Coarse-grained multithreading is simpler to implement and
requires less complex hardware, but it may lead to inefficiencies when
threads experience significant differences in execution times. Fine-
grained multithreading, on the other hand, maximizes core utilization
and can better tolerate latency, but it requires more advanced
hardware support and scheduling algorithms.

Coarse-Grained Multithreading

In coarse-grained multithreading, each thread is given a designated


time slice, known as a quantum, to execute before the processor
switches to another thread. The quantum is typically short, ranging
from a few hundred cycles to a few thousand cycles, depending on the
architecture and workload characteristics.

When one thread encounters a long latency operation, such as a cache


miss or a branch misprediction, the processor can switch to another
thread that is ready for execution. This allows the processor to continue
executing useful work while waiting for the completion of the long
latency operation.

Coarse-grained multithreading is relatively straightforward to


implement in hardware. The processor maintains multiple sets of key
resources, such as register sets and instruction buffers, one for each
active thread. The switching between threads can be performed during
pipeline flushes or other idle cycles, ensuring smooth progression of the
execution.

Fine-Grained Multithreading

Fine-grained multithreading operates at a much smaller granularity


than coarse-grained multithreading. Instead of switching threads at the
instruction level, the processor can switch threads at the cycle or sub-
cycle level, allowing for more interleaved execution.

In fine-grained multithreading, the processor maintains a single set of


key resources, such as instruction buffers and register sets, which are
shared among all active threads. Each thread is allocated a limited
amount of resources, and the processor dynamically schedules the
execution of instructions from different threads in a round-robin or
priority-based manner.

One of the key advantages of fine-grained multithreading is its ability to


maximize core utilization even in the presence of long-latency
operations. When one thread encounters a stall, such as a cache miss
or a branch misprediction, the processor can continue executing
instructions from other threads, making efficient use of the pipeline
resources.

Benefits of Hardware Multithreading

Hardware multithreading offers several benefits in computer


architecture, including improved utilization of core resources, increased
throughput, and better performance under certain workloads.
One of the main advantages of hardware multithreading is its ability to
maximize core utilization. By allowing the processor to switch between
multiple threads, idle time can be significantly reduced, leading to
better overall system performance. This is particularly beneficial in
situations where threads experience long-latency operations, such as
memory accesses or I/O operations.

Furthermore, hardware multithreading can increase the overall


throughput of a system by executing multiple threads simultaneously.
This is especially useful in applications that can be parallelized, such as
scientific simulations, rendering, and data processing. With hardware
multithreading, multiple threads can be executed in parallel, effectively
reducing the time required to complete a task.

Another advantage of hardware multithreading is its ability to provide


better performance under certain workloads. For example, workloads
with a mix of both compute-intensive and memory-intensive tasks can
benefit from hardware multithreading by allowing the processor to
switch between threads and allocate resources more efficiently.

Improved Resource Utilization

Hardware multithreading improves resource utilization by minimizing


idle time and maximizing the utilization of core resources. By allowing
the processor to switch between threads during latency-inducing
events, such as cache misses or branch mispredictions, idle time can be
reduced, and resources can be effectively utilized.

For example, in coarse-grained multithreading, when one thread


encounters a cache miss, instead of waiting for the cache access to
complete, the processor can switch to another thread that is ready to
execute. This enables the execution of useful work during the cache
miss, effectively hiding the latency.

In fine-grained multithreading, the sharing of resources among multiple


threads allows for a more efficient allocation of resources. The
processor can dynamically schedule the execution of instructions from
different threads, interleaving instructions from different threads at a
very granular level. This improves overall resource utilization and allows
for better performance in latency-sensitive workloads.

Implementation Techniques for Hardware Multithreading

There are several implementation techniques for hardware


multithreading, depending on the specific goals and requirements of
the architecture.

One common technique used in many modern processors is


simultaneous multithreading (SMT). SMT allows multiple threads to
execute simultaneously on a single processor core by leveraging the
available core resources, such as instruction fetch, decode, and
execution units.

In SMT, the processor maintains multiple threads' contexts and


interleaves their execution at the pipeline level. This allows for better
utilization of core resources and improved performance in multi-
threaded workloads. SMT can be implemented in both coarse-grained
and fine-grained configurations, depending on the level of thread
switching granularity.

Another implementation technique for hardware multithreading is


clustered multithreading. In clustered multithreading, the processor
core is divided into multiple clusters, each with its own set of resources,
such as instruction buffers and register sets. Each cluster is responsible
for executing a specific subset of threads, and the switching between
clusters occurs at a coarser granularity compared to fine-grained
multithreading.

Clustered multithreading can provide a good balance between


performance and complexity. By dividing the core into clusters, the
processor can effectively manage the sharing of resources among
threads while minimizing the overhead associated with fine-grained
multithreading.

Simultaneous Multithreading (SMT)

SMT is a widely used implementation technique for hardware


multithreading in modern processors. It allows for the concurrent
execution of multiple threads on a single processor core by leveraging
the available resources.

In SMT, the processor maintains the context for multiple threads and
interleaves their execution at the pipeline level. This means that
instructions from different threads can be fetched, decoded, and
executed simultaneously, leading to improved core utilization and
better overall system performance.

SMT can be implemented in various configurations, such as 2-way or 4-


way SMT, depending on the specific architecture. Each configuration
determines the number of threads that can be executed simultaneously
on a single core.

Clustered Multithreading
Clustered multithreading divides the processor core into multiple
clusters, with each cluster responsible for executing a specific subset of
threads. Each cluster has its own set of resources, including instruction
buffers and register sets, which are shared among the threads within
the cluster.

The switching between clusters occurs at a coarser granularity


compared to fine-grained multithreading, which helps minimize the
complexity associated with fine-grained multithreading while still
providing efficient resource utilization.

Clustered multithreading can be advantageous in situations where


there is a need for fine-grained resource sharing among threads, but
the overhead of fine-grained multithreading is too high.

Exploring Another Dimension of Hardware Multithreading in


Computer Architecture

Within the realm of hardware multithreading, another important


aspect to consider is the relationship between hardware multithreading
and instruction-level parallelism.

Instruction-level parallelism (ILP) refers to the ability of a processor to


execute instructions in parallel by identifying and exploiting
independent instruction-level operations from a single instruction
stream. In some cases, ILP can be seen as a form of software
multithreading, where the execution of multiple instructions occurs
simultaneously.

Hardware multithreading and ILP are complementary techniques that


can work together to enhance processor performance. While hardware
multithreading allows for the execution of multiple threads
concurrently, ILP focuses on extracting parallelism within a single
thread.

By combining hardware multithreading and ILP, processors can achieve


even greater throughput and performance improvements. Hardware
multithreading can help overcome the limitations of ILP when there are
dependencies or stalls in the instruction stream, while ILP can exploit
parallelism within a single thread that is not suitable for hardware
multithreading.
Vector (SIMD) processing
Vector Processing

Vector processing is a computer method that can process numerous


data components at once. It operates on every element of the entire
vector in one operation.

What is vector processing?

Vector processing is a computer method that can process numerous


data components at once. It operates on every element of the entire
vector in one operation, or in parallel, to avoid the overhead of the
processing loop. Yet simultaneous operations must be independent of
one another in order for vector processing to be effective.

Vector processing vs. array and parallel processing

Now, let’s understand the important difference between array


processing and vector processing. Arrays are groups of data elements
that are kept in close proximity to one another in memory. They’re
frequently used to symbolize parallel-processable datasets, while the
term “vector processing” describes the simultaneous processing of
many data units using specialized technology. The distinction
between array processing and vector processing is that while vector
processing uses a single processor to execute the same operation on
numerous data items concurrently, array processing uses several
processors to work on individual array elements.

The difference between parallel processing and vector processing is


that parallel processing involves multiple processors working on
separate tasks simultaneously. In contrast, vector processing involves
a single processor performing the same operation on multiple data
elements simultaneously.

How vector processing works

Let’s get a deeper understanding of vector processing by exploring


how it works.

1. A vector is an ordered collection of a one-dimensional array of


data elements. The row vector V = [V1, V2, V3,………Vn] can be
used to represent a vector V of length n. A column vector could
be used to represent the data elements if they’re listed in a
column.

2. One instruction can be used to perform parallel operations on


several data elements for a processor with multiple ALUs. These
instructions are usually referred to as vector instructions. Unlike
scalar processors that operate on one value at a time, vector
processors can operate on multiple values simultaneously. This
makes them ideal for applications where many data points are
processed at once, such as image processing and graphics
rendering.
3. Two consecutive pairs of items are processed in vector
processing every clock cycle. Two pairs of items can be handled
concurrently in dual vector pipes and dual sets of vector
functional units. The results are provided to the relevant parts of
the result register as soon as each pair of operations is finished.
The procedure keeps on until the count provided by the vector
length register is reached in terms of processed items. For
instance, C (1:50) = A (1:50) + B (1:50).

4. The starting addresses of the two source operands, one


destination operand, the length of the vectors, and the action to
be carried out are all included in this vector instruction.

Features of vector processing

For specific kinds of computing applications, vector processing


performs very well in terms of key features. These features are as
follows:

1. Simultaneous operations: This is achieved through the use of


specialized hardware that can process multiple data elements in
parallel.

2. High performance: Vector processing can achieve high


performance by exploiting data parallelism and reducing
memory access. This means that vector processors can perform
computations faster than traditional processors, particularly for
tasks that involve repeated operations on large datasets.

3. Scalability: Vector processors can scale up to handle larger


datasets without sacrificing performance.
4. Limited instruction set: Vector processors have a limited
instruction set that’s optimized for numerical computations.

5. Data alignment: Vector processors require data to be aligned in


memory to achieve optimal performance. This means the data
must be stored in contiguous memory locations so that the
processor can access it efficiently.

Types of vector processing

Vector processing provides higher performance than traditional CPU


or GPU architectures because it’s able to handle more data at once.
And we all know how vital high performance is when you’re working
on graphics-related use cases. There are two main types of vector
processing: SIMD and MIMD.

Computer designs like SIMD and MIMD are used to enhance the
efficiency of specific computing activities. The amount of data and
instruction streams serve as the classification’s basis. The computer
architecture known as SIMD, or single instruction multiple data,
allows for the execution of a single instruction across numerous data
streams. In contrast, MIMD (multiple instruction multiple data)
computer architectures can carry out a number of instructions on a
number of data streams.

Single Instruction Multiple Data (SIMD)

SIMD architecture executes the same instruction on multiple data sets


simultaneously. This means that all processors in a SIMD system
perform the same operation on different pieces of data. This
architecture is often used in applications such as multimedia
processing, where the same operation needs to be performed on
multiple datasets at the same time.

Multiple Instruction Multiple Data (MIMD)

MIMD architecture, on the other hand, allows multiple processors to


execute different instructions on different datasets at the same time.
Each processor in an MIMD system has its own program counter and
instruction set, allowing it to operate independently of the other
processors in the system. This architecture is often used in
applications such as scientific computing, where different processors
need to perform different calculations simultaneously.

Key difference between SIMD and MIMD

With multiple processors carrying out the same task in parallel, SIMD
is frequently employed for issues needing numerous computations.
With each component allocated to a distinct processor for
simultaneous solutions, MIMD is widely employed for issues that
divide algorithms into independent and separate portions.

Technically, SIMD and MIMD differ from one another. Although


MIMD processors are much more capable of doing much more
complicated functions, SIMD processors are often simpler, smaller,
faster, and cheaper. Complex processes must be carried out
sequentially by SIMD processors, but they can be carried out
concurrently by MIMD processors.

Advantages of vector processing

There are several advantages of vector processing:


1. Increased performance: Applications for numerical and scientific
computing can perform much better thanks to vector
processing. Vector processors are substantially quicker than
standard processors because they can operate on numerous
data components at once.

2. Reduced memory access: The number of memory accesses


necessary to process huge datasets is decreased through vector
processing. This is so that less time is wasted waiting for data to
be put into memory because the processor can access numerous
data components in a single transaction.

3. Improved energy efficiency: Because less time is spent waiting


for data to be loaded into memory, vector processing can be
more energy-efficient than conventional processors. This implies
that the processor can do jobs more quickly, which lowers the
amount of energy needed to complete the task.

4. Scalability: Due to its tremendous scalability, vector processing


can handle bigger datasets without compromising performance.
This is due to the fact that vector processors are made to carry
out the same action on several data components at once.

5. Optimized for numerical computations: With their focus on


numerical calculations, vector processors have a small number
of instructions that are specifically designed for them. This
indicates that they can complete numerical calculations more
quickly than conventional CPUs.

Disadvantages of vector processing

There are also some disadvantages of vector processing:


1. Limited applicability: Tasks requiring repetitive operations on
huge datasets benefit most from vector processing. It might not
be appropriate for tasks requiring intricate branching and
conditional logic.

2. Data alignment requirements: For vector processors to operate


at their best, the data must be aligned in memory. For some
applications, especially those that use atypical data structures,
this can be difficult.

3. **Limited software support: **Specialized, hardware-optimized


software is needed for vector processing. For some applications,
especially those that need to work with a variety of software
environments, this can be a constraint.

4. Cost: In high-end systems that demand specialized hardware,


vector processors might be more expensive than conventional
CPUs.

Ways to improve vector processing methods

To increase the efficiency of vector processing, there are different


ways to reduce the overhead on the processor. Here are some of the
popular ways that work:

 Improving instruction: We can improvise the pipeline to


integrate the scaler instruction of the same type. And vector
instruction can be improved by reducing memory access.

 Algorithm: The choice of algorithm is important and should be


picked wisely according to the use case that works faster and is
easily integrated with processors.
 Vectorizing compiler: Using a high-level programming language,
a compiler must regenerate the parallelism.

Cache coherence
Cache coherence : In a multiprocessor system, data inconsistency may
occur among adjacent levels or within the same level of the memory
hierarchy. In a shared memory multiprocessor with a separate cache
memory for each processor, it is possible to have many copies of any one
instruction operand: one copy in the main memory and one in each
cache memory. When one copy of an operand is changed, the other
copies of the operand must be changed also. Example : Cache and the
main memory may have inconsistent copies of the same object.

 Suppose there are three processors, each having cache. Suppose


the following scenario:-
 Processor 1 read X : obtains 24 from the memory and caches it.

 Processor 2 read X : obtains 24 from memory and caches it.

 Again, processor 1 writes as X : 64, Its locally cached copy is


updated. Now, processor 3 reads X, what value should it get?

 Memory and processor 2 thinks it is 24 and processor 1 thinks it is


64.

 As multiple processors operate in parallel, and independently


multiple caches may possess different copies of the same memory
block, this creates a cache coherence problem. Cache coherence is
the discipline that ensures that changes in the values of shared
operands are propagated throughout the system in a timely
fashion. There are three distinct level of cache coherence :-

 Every write operation appears to occur instantaneously.

 All processors see exactly the same sequence of changes of values


for each separate operand.

 Different processors may see an operation and assume different


sequences of values; this is known as non-coherent behavior.

 There are various Cache Coherence Protocols in multiprocessor


system. These are :-

 MSI protocol (Modified, Shared, Invalid)

 MOSI protocol (Modified, Owned, Shared, Invalid)

 MESI protocol (Modified, Exclusive, Shared, Invalid)

 MOESI protocol (Modified, Owned, Exclusive, Shared, Invalid)


 These important terms are discussed as follows:

 Modified – It means that the value in the cache is dirty, that is the
value in current cache is different from the main memory.

 Exclusive – It means that the value present in the cache is same as


that present in the main memory, that is the value is clean.

 Shared – It means that the cache value holds the most recent data
copy and that is what shared among all the cache and main
memory as well.

 Owned – It means that the current cache holds the block and is now
the owner of that block, that is having all rights on that particular
blocks.

 Invalid – This states that the current cache block itself is invalid and
is required to be fetched from other cache or main memory.

 For detail on above protocol, refer :- Cache coherence protocol

 Coherency mechanisms : There are three types of coherence :

 Directory-based – In a directory-based system, the data being


shared is placed in a common directory that maintains the
coherence between caches. The directory acts as a filter through
which the processor must ask permission to load an entry from the
primary memory to its cache. When an entry is changed, the
directory either updates or invalidates the other caches with that
entry.

 Snooping – First introduced in 1983, snooping is a process where


the individual caches monitor address lines for accesses to memory
locations that they have cached. It is called a write invalidate
protocol. When a write operation is observed to a location that a
cache has a copy of and the cache controller invalidates its own
copy of the snooped memory location.

 Snarfing – It is a mechanism where a cache controller watches both


address and data in an attempt to update its own copy of a memory
location when a second master modifies a location in main
memory. When a write operation is observed to a location that a
cache has a copy of the cache controller updates its own copy of
the snarfed memory location with the new data.

Message-passing Multicomputers
Multicomputers are message-passing machines which apply packet
switching method to exchange data. Here, each processor has a private
memory, but no global address space as a processor can access only its
own local memory.

A message-passing multicomputer is a parallel computing system where


multiple computers, or nodes, communicate with each other using
messages to complete a task:

In a message-passing multicomputer, each node works on a portion of


the overall computing problem. The nodes must synchronize their
actions, exchange data, and receive command and control over the
entire system.

Message passing is a common technique in computer science and is used


in many models of concurrency and object-oriented programming. It's
also used to allow objects and systems on different computers to
interact, such as on the internet.

The Message Passing Interface (MPI) is a standard set of functions that


defines how to perform these tasks in a message-passing system. MPI is
considered the industry standard for message passing and is used in
programming languages like C, C++, Fortran, and Java.

Multiprocessor System Interconnects

Parallel processing needs the use of efficient system interconnects for


fast communication among the Input/Output and peripheral devices,
multiprocessors and shared memory.

Hierarchical Bus Systems

A hierarchical bus system consists of a hierarchy of buses connecting


various systems and sub-systems/components in a computer. Each bus
is made up of a number of signal, control, and power lines. Different
buses like local buses, backplane buses and I/O buses are used to
perform different interconnection functions.

Local buses are the buses implemented on the printed-circuit boards. A


backplane bus is a printed circuit on which many connectors are used to
plug in functional boards. Buses which connect input/output devices to
a computer system are known as I/O buses.

Crossbar switch and Multiport Memory

Switched networks give dynamic interconnections among the inputs and


outputs. Small or medium size systems mostly use crossbar networks.
Multistage networks can be expanded to the larger systems, if the
increased latency problem can be solved.
Both crossbar switch and multiport memory organization is a single-
stage network. Though a single stage network is cheaper to build, but
multiple passes may be needed to establish certain connections. A
multistage network has more than one stage of switch boxes. These
networks should be able to connect any input to any output.

Multistage and Combining Networks

Multistage networks or multistage interconnection networks are a class


of high-speed computer networks which is mainly composed of
processing elements on one end of the network and memory elements
on the other end, connected by switching elements.

These networks are applied to build larger multiprocessor systems. This


includes Omega Network, Butterfly Network and many more.

Multicomputers

Multicomputers are distributed memory MIMD architectures. The


following diagram shows a conceptual model of a multicomputer −
Multicomputers are message-passing machines which apply packet
switching method to exchange data. Here, each processor has a private
memory, but no global address space as a processor can access only its
own local memory. So, communication is not transparent: here
programmers have to explicitly put communication primitives in their
code.

Having no globally accessible memory is a drawback of multicomputers.


This can be solved by using the following two schemes −

 Virtual Shared Memory (VSM)

 Shared Virtual Memory (SVM)

In these schemes, the application programmer assumes a big shared


memory which is globally addressable. If required, the memory
references made by applications are translated into the message-passing
paradigm.

Virtual Shared Memory (VSM)


VSM is a hardware implementation. So, the virtual memory system of
the Operating System is transparently implemented on top of VSM. So,
the operating system thinks it is running on a machine with a shared
memory.

Shared Virtual Memory (SVM)

SVM is a software implementation at the Operating System level with


hardware support from the Memory Management Unit (MMU) of the
processor. Here, the unit of sharing is Operating System memory pages.

If a processor addresses a particular memory location, the MMU


determines whether the memory page associated with the memory
access is in the local memory or not. If the page is not in the memory, in
a normal computer system it is swapped in from the disk by the
Operating System. But, in SVM, the Operating System fetches the page
from the remote node which owns that particular page.

Explore our latest online courses and learn new skills at your own pace.
Enroll and become a certified expert to boost your career.

Three Generations of Multicomputers

In this section, we will discuss three generations of multicomputers.

Design Choices in the Past

While selecting a processor technology, a multicomputer designer


chooses low-cost medium grain processors as building blocks. Majority
of parallel computers are built with standard off-the-shelf
microprocessors. Distributed memory was chosen for multi-computers
rather than using shared memory, which would limit the scalability. Each
processor has its own local memory unit.

For interconnection scheme, multicomputers have message passing,


point-to-point direct networks rather than address switching networks.
For control strategy, designer of multi-computers choose the
asynchronous MIMD, MPMD, and SMPD operations. Caltech’s Cosmic
Cube (Seitz, 1983) is the first of the first generation multi-computers.

Present and Future Development

The next generation computers evolved from medium to fine grain


multicomputers using a globally shared virtual memory. Second
generation multi-computers are still in use at present. But using better
processor like i386, i860, etc. second generation computers have
developed a lot.

Third generation computers are the next generation computers where


VLSI implemented nodes will be used. Each node may have a 14-MIPS
processor, 20-Mbytes/s routing channels and 16 Kbytes of RAM
integrated on a single chip.

The Intel Paragon System

Previously, homogeneous nodes were used to make hypercube


multicomputers, as all the functions were given to the host. So, this
limited the I/O bandwidth. Thus to solve large-scale problems efficiently
or with high throughput, these computers could not be used.The Intel
Paragon System was designed to overcome this difficulty. It turned the
multicomputer into an application server with multiuser access in a
network environment.
Message Passing Mechanisms

Message passing mechanisms in a multicomputer network needs special


hardware and software support. In this section, we will discuss some
schemes.

Message-Routing Schemes

In multicomputer with store and forward routing scheme, packets are


the smallest unit of information transmission. In wormhole–routed
networks, packets are further divided into flits. Packet length is
determined by the routing scheme and network implementation,
whereas the flit length is affected by the network size.

In Store and forward routing, packets are the basic unit of information
transmission. In this case, each node uses a packet buffer. A packet is
transmitted from a source node to a destination node through a
sequence of intermediate nodes. Latency is directly proportional to the
distance between the source and the destination.

In wormhole routing, the transmission from the source node to the


destination node is done through a sequence of routers. All the flits of
the same packet are transmitted in an inseparable sequence in a
pipelined fashion. In this case, only the header flit knows where the
packet is going.

Deadlock and Virtual Channels

A virtual channel is a logical link between two nodes. It is formed by flit


buffer in source node and receiver node, and a physical channel between
them. When a physical channel is allocated for a pair, one source buffer
is paired with one receiver buffer to form a virtual channel.

When all the channels are occupied by messages and none of the
channel in the cycle is freed, a deadlock situation will occur. To avoid this
a deadlock avoidance scheme has to be followed.

Graphics Processing Unit


A Graphics Processing Unit (GPU) is a specialized electronic circuit in a
computer that speeds up the processing of images and videos in a
computer system. Initially created for graphics tasks, GPUs have
transformed into potent parallel processors with applications extending
beyond visual computing. This in-depth exploration will cover the
history, architecture, operation, and various uses of GPUs.

Table of Content

 GPU Meaning and Usage

 Diverse Applications of GPUs

 GPU vs. CPU

 Streaming Multiprocessors (SMs)

 Memory Hierarchy

 Parallel Processing

 GPU Applications Beyond Graphics

 Functioning of GPUs

 Challenges and Future Trends in GPU Technology


GPU Meaning and Usage

A GPU, or Graphics Processing Unit, is a special part of a computer or


phone that handles graphics and images. It's really good at showing
videos, playing games, and doing anything that needs fancy visuals.

The usage of GPU in a computer is as follows:

Diverse Applications of GPUs

Graphics Processing Units (GPUs), renowned for their role in graphics


rendering, have evolved into versatile tools with applications spanning
various domains, driven by their exceptional parallel processing
capabilities. The diverse applications of GPUs showcase their
transformative impact across industries:

1. Gaming

GPUs play a key role in gaming by providing realistic graphics and smooth
animations. Their parallel architecture ensures a better gaming
experience with lifelike visuals and immersive gameplay.

Aspect Description

GPU-accelerated matrix
operations significantly speed up
Deep Learning Acceleration deep neural network training.

GPUs reduce the time for both


Model Training and Inference Speed
model training and inference,
Aspect Description

enabling faster development


cycles.

Efficient handling of large


datasets for tasks like
preprocessing and feature
Large-Scale Data Processing extraction.

Accelerates tasks like Monte Carlo


simulations and optimization
Simulation and Optimization algorithms.

Frameworks and APIs enable


developers to optimize code for
GPU-Accelerated Libraries and APIs GPU architectures.

Integration with popular ML


frameworks like TensorFlow and
Support for ML Frameworks PyTorch for seamless GPU usage.

GPU clusters and cloud services


facilitate scalable and distributed
Scalability with GPU Clusters computing.
Aspect Description

GPUs are often more energy-


efficient, providing cost-effective
Energy Efficiency solutions for ML tasks.

Specialized hardware
enhancements, such as Tensor
Cores, improve performance for
Integration with AI Hardware AI workloads.

GPU availability broadens access


to high-performance computing
Democratization of AI for researchers and developer

2. Deep Learning and AI

GPUs play a pivotal role in the realm of artificial intelligence (AI) and deep
learning. Their parallel architecture accelerates the matrix calculations
essential for training and running deep neural networks. This
acceleration significantly boosts the efficiency of machine learning
models, making GPUs integral to AI applications.

3. Scientific Computing

In scientific computing, GPUs find extensive use in simulations and


calculations. From climate modeling and fluid dynamics to molecular
dynamics and astrophysics, the parallel processing capabilities of GPUs
expedite complex computations. Scientists benefit from accelerated
simulations, enabling faster insights and discoveries.
4. Medical Imaging

The processing speed of medical imaging tasks, such as MRI and CT scans,
is enhanced by GPUs. GPU acceleration enables real-time rendering and
analysis of volumetric data, contributing to quicker diagnoses and
improved medical imaging outcomes.

5. Cryptocurrency Mining

The parallel processing power of GPUs has found utility in cryptocurrency


mining. Complex mathematical calculations required for validating
transactions on blockchain networks are efficiently handled by GPUs.
Cryptocurrency miners leverage GPUs for their computational prowess
in this domain.

GPU vs. CPU

In the world of computing, two essential components play crucial roles


in the performance and functionality of modern devices: the GPU
(Graphics Processing Unit) and the CPU (Central Processing Unit). Each
serves distinct purposes, but their roles often intersect in complex tasks.
Let's delve into the differences between GPU and CPU, their functions,
and their respective impacts on various applications.

Streaming Multiprocessors (SMs)

At the heart of a Graphics Processing Unit (GPU) lies the concept of


Streaming Multiprocessors (SMs), defining the core processing units
responsible for the execution of tasks.

In NVIDIA's architecture, these SMs comprise multiple CUDA (Compute


Unified Device Architecture) cores, while in AMD's architecture, they are
referred to as Stream Processors. The essence of SMs lies in their
concurrent operation, enabling the GPU to handle and execute multiple
tasks simultaneously.

Each SM acts as a powerhouse, capable of performing a multitude of


operations concurrently. The parallelism achieved through SMs is a
fundamental characteristic of GPU architecture, making it exceptionally
efficient in handling tasks that can be parallelized. This parallel
processing capability is particularly advantageous in scenarios where
tasks involve a vast number of repetitive calculations or operations.

Memory Hierarchy

The memory hierarchy of GPUs is a critical aspect that significantly


influences their performance. GPUs come equipped with dedicated
memory known as Video RAM (VRAM), specifically designed to store
data essential for graphics processing. The efficiency of memory
management directly impacts the overall performance of the GPU.

The memory hierarchy within a GPU includes different levels, such as


global memory, shared memory, and registers. Global memory serves as
the primary storage for data that needs to be accessed by all threads.

Proximity to
Level Type Characteristics GPU Cores Examples

GDDR5,
High
GDDR GDDR6,
capacity,
(Graphics Far HBM (High
moderate
DDR) Bandwidth
speed
Global Memory)
Proximity to
Level Type Characteristics GPU Cores Examples

On-chip,
Shared L2
GPU shared
On-chip cache, L1
(Device) among all
cache
Device GPU cores

On-chip, Shared
shared memory
Shared
within a GPU On-chip within a
Memory
block (thread CUDA
Shared block) thread block

Optimized
Specialized
Texture for texture
On-chip for texture
Memory mapping and
operations
Texture filtering

Read-only
Read-only
Constant data shared
On-chip data for all
Memory among all
threads
Constant threads
Proximity to
Level Type Characteristics GPU Cores Examples

Fast, private
L1 cache for
Level 1 cache for
On-chip individual
Cache each GPU
GPU cores
L1 Cache core

Larger, L2 cache
Level 2 shared cache shared
On-chip
Cache for all GPU among all
L2 Cache cores GPU cores

Fastest,
private Registers
Register File storage for On-chip allocated to
individual each thread
Registers threads

Shared memory is a faster but smaller memory space that allows threads
within the same block to share data. Registers are the smallest and
fastest memory units residing on the GPU cores for rapid access during
computation.

Efficient memory management involves optimizing the utilization of


these memory types based on the specific requirements of tasks. It
ensures that data is swiftly accessed, processed, and shared among
different components of the GPU, contributing to enhanced overall
performance.

Parallel Processing

Parallel processing stands as a cornerstone of GPU architecture, making


it exceptionally well-suited for tasks that can be parallelized. In parallel
processing, multiple operations are executed simultaneously, a
capability harnessed through the presence of multiple cores within SMs.

GPU Applications Beyond Graphics

In financial modeling, GPUs offer speed boosts for intricate simulations,


aiding risk assessment. Autonomous vehicles and robotics rely on GPU
efficiency for real-time object detection and decision-making. The broad
impact showcases GPUs as versatile tools shaping advancements in
technology.

1. Data Science and Machine Learning

GPUs have become instrumental in accelerating data science and


machine learning tasks. The parallel architecture of GPUs, designed for
handling massive parallel computations, aligns seamlessly with the
requirements of training and running complex machine-learning models.

Aspect Description

GPU-accelerated matrix
operations significantly speed up
Deep Learning Acceleration deep neural network training.
Aspect Description

GPUs reduce the time for both


model training and inference,
enabling faster development
Model Training and Inference Speed cycles.

Efficient handling of large


datasets for tasks like
preprocessing and feature
Large-Scale Data Processing extraction.

Accelerates tasks like Monte Carlo


simulations and optimization
Simulation and Optimization algorithms.

Frameworks and APIs enable


developers to optimize code for
GPU-Accelerated Libraries and APIs GPU architectures.

Integration with popular ML


frameworks like TensorFlow and
Support for ML Frameworks PyTorch for seamless GPU usage.
Aspect Description

GPU clusters and cloud services


facilitate scalable and distributed
Scalability with GPU Clusters computing.

GPUs are often more energy-


efficient, providing cost-effective
Energy Efficiency solutions for ML tasks.

Specialized hardware
enhancements, such as Tensor
Cores, improve performance for
Integration with AI Hardware AI workloads.

GPU availability broadens access


to high-performance computing
Democratization of AI for researchers and developers.

Frameworks like TensorFlow and PyTorch leverage GPU capabilities to


significantly reduce the time required for model training, enabling rapid
advancements in artificial intelligence.

2. Cryptocurrency Mining

The parallel processing power of GPUs finds unconventional yet


impactful applications in cryptocurrency mining. Cryptocurrencies, like
Bitcoin, rely on complex mathematical calculations to validate
transactions on blockchain networks.
GPUs excel in parallelizing these calculations, providing miners with a
powerful tool for efficient and competitive mining. While specialized
hardware (ASICs) has emerged in this domain, GPUs remain accessible
and versatile for various cryptocurrency mining endeavors.

3. Computational Biology and Drug Discovery

The computational demands of tasks in biology, such as molecular


dynamics simulations and protein folding studies, align with the parallel
processing capabilities of GPUs.

Researchers in computational biology leverage GPUs to accelerate


simulations, gaining insights into biological processes. Additionally, in
drug discovery, where extensive computational analyses are required,
GPUs play a crucial role in speeding up the identification of potential drug
candidates.

4. Financial Modeling and Simulation

In the financial sector, where complex mathematical models and


simulations are essential for risk assessment and decision-making, GPUs
offer a significant boost in processing speed.

Financial analysts utilize GPU-accelerated computations to run intricate


models, conduct simulations, and analyze vast datasets efficiently. This
accelerates the pace of financial analysis and contributes to more
informed decision-making.

5. Autonomous Vehicles and Robotics

The demanding computational requirements of autonomous vehicles


and robotics benefit from the parallel processing capabilities of GPUs.
Tasks such as real-time object detection, image recognition, and sensor
fusion rely on the efficiency of GPU architecture. This application extends
to the field of robotics, where GPUs contribute to enhancing the
perception and decision-making capabilities of autonomous systems.

Functioning of GPUs

Graphics Processing Units (GPUs) function as specialized processors


designed to handle parallelizable tasks, complementing the Central
Processing Unit (CPU) in a computer system.

The functioning of GPUs revolves around optimizing parallel processing


for tasks that benefit from concurrent execution. The symbiotic
relationship between CPUs and GPUs ensures a balanced distribution of
workload, enhancing overall system performance.

The operational dynamics of GPUs can be elucidated through key aspects


that define their role and efficiency:

1. Task Offloading

GPUs operate on the principle of task offloading, taking on parallelizable


tasks from the CPU. Tasks that exhibit parallel characteristics, such as
graphics rendering or processing extensive datasets, are delegated to the
GPU. This strategic offloading optimizes the overall processing speed and
efficiency of the system, allowing the CPU to focus on non-parallel tasks
without unnecessary workload.

2. Data Parallelism

A defining strength of GPUs lies in their prowess in data parallelism. In


scenarios where the same operation needs to be performed on multiple
sets of data simultaneously, GPUs excel.
This attribute is particularly advantageous in graphics rendering, where
pixels or vertices can be processed independently. The ability to handle
data in parallel significantly accelerates the processing of tasks, making
GPUs indispensable for graphics-intensive applications.

3. APIs and Shaders

Application Programming Interfaces (APIs) act as bridges connecting


software applications with the GPU. Prominent APIs like DirectX and
OpenGL facilitate seamless communication, enabling software to
leverage the capabilities of the GPU.

Shaders, programmable units within the GPU, play a crucial role.


Developers utilize shaders to write code tailored for specific tasks,
fostering customization and flexibility. Shaders are instrumental in tasks
like rendering complex graphics, where precise control is paramount.

4. GPGPU (General-Purpose computing on GPUs)

GPUs have evolved beyond their initial focus on graphics rendering and
are increasingly harnessed for general-purpose computing through
GPGPU. General-purpose computing on GPUs extends the utility of GPUs
to a broader spectrum of applications.

Developers can use GPUs for various tasks like scientific simulations and
machine learning, thanks to their ability to handle multiple tasks
simultaneously. This extends the use of GPUs beyond just graphics,
making them essential for a wide range of computational challenges.

GPUs, with their focus on data parallelism, customizable shaders, and


adaptability for general-purpose computing, have become pivotal
components in modern computing architectures.
Whether enhancing gaming experiences, accelerating scientific
simulations, or driving advancements in artificial intelligence, GPUs
continue to shape the landscape of high-performance computing.

Challenges and Future Trends in GPU Technology

While Graphics Processing Units (GPUs) have revolutionized computing


in various domains, they encounter challenges and are subject to
ongoing developments that shape their future trajectory:

1. Energy Efficiency

The power consumption of high-performance GPUs, particularly in data


centers, poses a challenge to sustainability efforts. Ongoing research and
development focus on enhancing the energy efficiency of GPUs,
addressing concerns related to power consumption and environmental
impact.

2. Ray Tracing

Ray tracing, a sophisticated rendering technique for achieving realistic


lighting effects in graphics, places additional computational demands on
GPUs. Advances in both hardware and algorithms dedicated to ray
tracing are underway, aiming to further enhance graphical realism while
optimizing computational efficiency.

3. Quantum Computing and Hybrid Approaches

The advent of quantum computing introduces challenges to traditional


computing paradigms, including GPUs. Researchers are exploring hybrid
approaches that leverage the strengths of both GPUs and emerging
quantum technologies. This pursuit aims to navigate the evolving
landscape of computing capabilities.
4. Edge Computing

The shift towards edge computing, particularly in applications like


autonomous vehicles and Internet of Things (IoT) devices, necessitates
GPUs optimized for edge computing workloads. The demand for efficient
and powerful GPUs at the edge is rising, prompting developments in
architecture and design tailored for edge-centric applications.

CLUSTERS AND WAREHOUSE SCALE COMPUTERS

Warehouse-scale computers form the foundation of internet services.


The present days WSCs act as one giant machine. The main parts of a
WSC are the building with the electrical and cooling infrastructure, the
networking equipment and the servers. WSCs as Servers The following
features of WSCs that makes it work as servers: Cost-performance:
Because of the scalability, the cost-performance becomes very critical.
Even small savings can amount to a large amount of money. Energy
efficiency: Since large numbers of systems are clustered, lot of money is
invested in power distribution and for heat dissipation. Work done per
joule is critical for both WSCs and servers because of the high cost of
building the power and mechanical infrastructure for a warehouse of
computers and for the monthly utility bills to power servers. If

servers are not energy-efficient they will increase cost of electricity cost
of infrastructure to provide electricity cost of infrastructure to cool the
servers. Dependability via redundancy: The hardware and software in a
WSC must collectively provide at least 99.99% availability, while
individual servers are much less reliable. Redundancy is the key to
dependability for both WSCs and servers. WSC architects rely on multiple
cost-effective servers connected by a low cost network and redundancy
managed by software. Multiple WSCs may be needed to handle faults in
whole WSCs. Multiple WSCs also reduce latency for services that are

Widely deployed. Network I/O: Networking is needed to interface to the


public as well as to keep data consistent between multiple WSCs. ROHINI
COLLEGE OF ENGINEERNG AND TECHNOLOGY EC8552 COMPUTER
ARCHITECTURE AND ORGANIZATION Data Centers Data centres hosts
services for multiple providers. WSCs WSCs are run by only one client.
There will be little commonality between Homogenous hardware and
software hardware and software. management. Thirdparty software
solutions. In-house middleware. Interactive and batch-processing
workloads: Search and social networks are interactive and require fast
response times. At the same time, indexing, big data analytics etc. create
a lot of batch processing workloads also. The WSC workloads must be
designed to tolerate large numbers of component faults without
affecting the overall performance and availability. Differences between
WSCs and data centers WSC are not servers: The following features of
WSCs make them different from servers: Ample parallelism: Servers
need not to worry about the parallelism available in applications to
justify the amount of parallel hardware. But in WSCs most jobs are totally
independent and exploit request-level parallelism. Request-Level
parallelism (RLP) is a way of representing tasks which are set of requests
which are to be to run in parallel. Interactive internet service
applications, the workload consists of independent requests of millions
of users. Also, the data of many batch applications can be processed in
independent chunks, exploiting data-level parallelism. Operational costs
count: Server architects normally design systems for peak performance
with in a cost budget.

You might also like