Computer Organization
Computer Organization
Computer
Organization
Coprocessors &
Processor Virtualization
Group-1
Group members
● The CPU gives the coprocessor an instruction or set of instructions and tells it to
execute them
● The coprocessor is more independent and runs pretty much on its own.
● Due to that, special network processors have been developed to handle the
traffic.
● To discuss how network processors work, we first need to know the general
overview of how networking works. Let’s do a brief intro to networking!!
Introduction to networking
● Computer networks are categorized into
○ Local-area Network (LAN) and Wide-area Network (WAN)
● Ethernet is the most popular LAN, connecting multiple computers within a
building or campus. Its original form used a vampire tap, but modern versions
have computers attached to a central switch.
● Speed of ethernet
○ Original ethernet – 3 Mbps
○ First commercial ethernet (10M) – 10 Mbps
○ Fast ethernet (100M) – 100 Mbps
○ 1G – 1 Gbps
○ 10G – 10 Gbps
○ 40G, 100G, 400G, etc…
Introduction to networking (cont’d), WAN
● Say you want to see a cute kitten pic on internet. Let’s observe what happened
if you clicked on a website page that contain many pics of kittens.
1. Your browser first establishes a connection to the Web server using TCP
(Transmission Control Protocol).
2. The browser sends a packet containing a GET PAGE request using the
HTTP (HyperText Transfer Protocol) to the server.
Introduction to networking (cont’d), Network Software (cont’d)
3. Web browser formats the GET PAGE request as a correct HTTP message and then
hands it to the TCP software to transmit over the connection.
4. The TCP software adds a header in front of the message containing a sequence
number and other information. This header is naturally called the TCP header.
5. the TCP software takes the TCP header and payload (containing the GET PAGE
request) and passes it to another piece of software that implements the IP protocol
(Internet Protocol).
6. This software attaches an IP header to the front, It contains
● the source address (the machine the packet is coming from)
● the destination address (the machine the packet is supposed to go to)
● how many more hops the packet may live (to prevent lost packets from living forever)
● a checksum (to detect transmission and memory errors)
● and other fields.
Introduction to networking (cont’d), Network Software (cont’d)
7. The resulting packet (which constitutes the IP header, TCP header, and GET
PAGE request) is passed down to the data link layer, where a data link header is
attached to the front for actual transmission.
8. The data link layer also adds a checksum to the end called a CRC (Cyclic
Redundancy Code) used to detect transmission errors.
2. Pipeline organization
○ each PPE performs one processing step and then feeds a pointer to its output packet
to the next PPE in the pipeline
○ the PPE pipeline acts very much like the CPU pipelines
○ PPEs are completely programmable.
Introduction to Network Processors (cont’d)
PPEs Multithreading
Path Selection
● Most network processors use special fast path to
handle plain old garden-variety data packets.
● Other packets need to be managed differently and
carefully by control processor and they are sent to
slow path.
● Packets have to go through either fast or slow path
that suits them best.
Destination Network Determination
● IP packets contain a 32 bit destination address.
● It is not possible to have a 232 entry table to look up the destination of each IP
packet.
● The leftmost part is network number and the rest specifies a machine on that
network.
● Network numbers can be any length, so determining the destination network
number is nontrivial and multiple matches are possible.
● To handle this step,a custom ASIC(Application Specific Integrated Circuit) is
often used.
Route Lookup
● Once the number of the destination network is known,the outgoing line can
be used to look up in a table in SRAM.
● A custom ASIC may be used in this step too.
Fragmentation and Reassembly
Header Management
● Sometimes,headers need to be added,removed or
some of their fields modified.
● For Example,IP header can count the number of hops
the packets make before being discarded.
● Everytime it is retransmitted,the number must be
decremented.
Queue Management
● Incoming and outgoing packets have to be queued while waiting their turn at
being processed.
● Sometimes,certain types of applications,like videos or games need specific
interpacket spacing to work smoothly(avoid jitter).
Checksum Generation
● Outgoing packets need to be checksummed.
● IP checksum can be generated by the network
processor,but the Ethernet CRC is generally computed
by hardware.
Accounting
● Accounting for packet traffic is needed for some cases especially when one
network is forwarding traffic for other networks.
Statistics Gathering
● Finally,many organizations like to collect statistics
about their traffic.
● The network processor is a good place to collect
statistics like how many packets came and how many
went out at what times of day and more.
Improving Performance
❖ Improving performance of network processors is crucial for efficient data
handling.
❖ Performance can be measured using metrics like packets forwarded per second
and bytes forwarded per second.
Key Strategies For Enhancing Performance
❖ Clock Speed and Parallelism: Boosting the clock speed of the processor
can help, but the relationship isn't always straightforward due to
memory cycle time and heat issues. Incorporating more parallel
processing elements (PPEs) and deeper pipelines can enhance
performance, especially in parallel PPE configurations.
❖ Memory Type Optimization: Replacing slower SDRAM with faster SRAM can
generally enhance performance, but at an expense.
Graphic Processors
Graphic Processors
The second area which uses co-processors is for handling high-resolution
graphic processing such as 3D rendering.
Since ordinary cpu can not handle graphical processing which needs the
massive computation to process large amount of data, many future
processors equipped with GPUs (Graphical Processing Units)
Nvidia Fermi GPU
So, Let’s talk about Fermi GPU which you may have heard of.
Architecture of Fermi GPU
Architecture of Fermi GPU
● Organized into 16 Streaming Multiprocessros (SM) which basically is
components of the GPU
● Each SM contains private L1 cache & Dedicated Shared Memory
● Each SM is composed of 32 CUDA (Compute Unified Deveice
Architecture) cores which is heart of GPU
● CUDA is the simple processor which render images, display graphical
interface in monitors , tvs & also used for computer vision applications
Architecture of Fermi GPU (Cont’d)
● SMs share single unified 768-KB L2 cache which connect multiported
DRAM (Dynamic Random Access Memory)
● Host Processor Interface connect host system & GPU through shared
DRAM bus interface, typically PCIe interface.
SIMD (Single-Instruction Multiple Data) processing
● Since Fermi architecture is designed to efficiently execute graphics,
videos & images processing codes, there may have unnecessary
computations
● Unnecessary Computations require SM components to execute same
operations in a cycle to be identical
● This style of processing is called SIMD computation
Advantages of SIMD processing
● Each SM fetch & decode same instruction in a cycle
● Only by sharing instructions process across all cores, NVIDIA cram 512
CUDA cores in a single silicon
● If programmers are clever enough to utilize computation resources,
system provides computational advantages over traditional scalar
architecture.
Problems with SIMD
● SIMD processing requirements limit programmer to execute by putting
constraints on the code.
● To be exact, each CUDA core must run the same code in lock-step to
execute 16 operations simultaneously.
● This was the burdens for programmers
CUDA language
● To remove the burden, NVIDA implemented CUDA programming
language. (C , Fortran, C++)
● By using Threads, CUDA language specifies program parallelism.
● Grouping threads into blocks, then assign to SMs.
● As long as threads in the block execute the same code sequence (all
branches have the same decision) 16 operations can be executes
simultaneously.
Branch Divergence
● If threads in each SM makes different decision, there has been a branch
divergence which causes performance-degrade effect.
● That effect forces threads with different code path to execute serially on
SM.
● Reduces Parallel Processing & slows GPU processing.
Countering Branch Divergence & GPGPUS
● Fortunately, there are activities which avoids brand divergence & achieve
good speed-ups
● SIMD-styled architecture graphic processors obtains benefits such as
Medical Imaging, Proof Solving, financial prediction & graph analysis.
● There’s a nickname called GPGPUS (General-Purpose Graphics
Processing Units) due to wide area of application using GPU.
Fermi GPU memory Hierarchy
■
Virtual Memory
❖ Virtual memory is a method that computers use to manage storage space to
keep systems running quickly and efficiently.
❖ Using the technique, operating systems can transfer data between different
types of storage, such as random access memory (RAM), also known as main
memory, and hard drive or solid-state disk storage.
❖ The term "Virtual Machine Monitor" is a more traditional term that was
used before the popularity of the term "hypervisor."VMM refers to the
software layer responsible for managing and creating virtual machines on a
physical host.It's the precursor to the modern concept of the hypervisor,
and the terms are often used interchangeably.
Application Virtualization
❖ Application virtualization abstracts the operating system from the application
code and provides a degree ofsandboxing.
❖ Application virtualization replaces portions of the runtime environment with a
virtualization layer and performs tasks such as
❖ intercepting disk I/O calls and redirecting them to a sandboxed, virtualized disk
environment.
❖ Application virtualization can encapsulate a complex software installation
process, consisting of hundreds of files installed in various directories, as well as
numerous Windows registry modifications, in an equivalent virtualized
environment contained within a single executable file.
❖ Simply copying the executable to a target system and running it brings up the
application as if the entire installation process had taken place on the target.
Network Virtualization
❖ Network virtualization is the connection of software-based emulations of
network components, such as switches, routers, firewalls, and
telecommunication networks in a manner that represents a physical
configuration of these components.
❖ This allows operating systems and the applications running on them to
interact with and communicate over the virtual network in the same manner
they would on a physical implementation of the same network architecture.
❖ A single physical network can be subdivided into multiple virtual local area
networks (VLANs), each of which appears to be a complete, isolated network
to all systems connected on the same VLAN.
❖ Multiple computer systems at the same physical location can be connected to
different VLANs, effectively placing them onseparate networks.
Storage Virtualization
❖ A storage virtualization system manages the process of translating logical data
requests to physical data transfers. Logical data requests are addressed as block
locations within a disk partition. Following the logical-to-physical translation,
data transfers may ultimately interact with a storage device that has an
organization completely different from the logical disk partition.
❖ The process of accessing physical data given a logical address is similar to the
virtual-to-physical address translation process in virtual memory systems
❖ The logical disk I/O request includes information such as a device identifier and
a logical block number.
❖ This request must be translated to a physical device identifier and block number.
The requested read or write operation then takes place on the physical disk.
❖ There are somen improvement in Storage Virtualization
❖ Centralized management / Replication / Data migration
Categories of processor virtualization
❖ With full virtualization, binary code in operating systems and applications runs
in the virtual environment with no modifications whatsoever.
❖ Guest operating system code performing privileged operations executes under
the illusion that it has complete and sole access to all machine resources and
interfaces.
❖ The hypervisor manages interactions between guest operating systems and
host resources, and takes any steps needed to deconflict access to I/O devices
and other system resources for each virtual machine under its control.
Trap-and-emulate virtualization
❖ Gerald J. Popek and Robert P. Goldberg described the three properties a
hypervisor must implement to efficiently and fully virtualize a computer
system.
❖ Equivalence: Programs (including the guest operating system) running in a
hypervisor must exhibit essentially the same behavior as when they run directly
on machine hardware, excluding the effects of timing.
❖ Resource control: The hypervisor must have complete control over all of the
resources used by the virtual machine.
❖ Efficiency: A high percentage of instructions executed by the virtual machine
must run directly on the physical processor, without hypervisor intervention.
Trap-and-emulate virtualization(Con’td)
❖ The hardware and operating system of the computer on which it is running
must grant the hypervisor the power to fully control the virtual machines it
manages.
❖ In a hypervisor implementing the trap-and-emulate virtualization method,
portions of the hypervisor run with kernel privilege, while all guest operating
systems operate at the user privilege level.
❖ Kernel code within the guest operating systems executes normally until a
privileged instruction attempts to execute or a memory-access instruction
attempts to read or write memory outside the user-space address range
available to the guest operating system. When the guest attempts any of these
operations, a trap occurs.
Exception types: faults, traps, and aborts
❖ A fault is an exception that ends by restarting the instruction that caused the
exception. For example, a page fault occurs when a program attempts to
access a valid memory location that is currently inaccessible. After the page
fault handler completes, the triggering instruction is restarted, and execution
continues from that point.
❖ A trap is an exception that ends by continuing the execution with the
instruction following the triggering instruction. For example, execution
resumes after the exception triggered by a debugger breakpoint by
continuing with the next instruction.
❖ An abort represents a serious error condition that may be unrecoverable.
Problems such as errors accessing memory may cause aborts.
Paravirtualization
❖ In Paravirtualization, the hypervisor is installed on the device. Then, the
guest operating systems are installed into the environment.
❖ The virtualization method modifies the guest operating system to
communicate with the hypervisor. Thus, it reduces the time taken by the
operating system to perform operations that are difficult and take a
longer time in a virtual environment.
❖ Paravirtualization helps to increase the performance of the system.
Moreover, the guest operating systems communicate with the hypervisor
using API calls.
Binary translation
❖ Problematic instructions within processor architectures that lack full support for
virtualization is to scan the binary code prior to execution to detect the presence
of nonvirtualizable instructions. Where such instructions are found, the code is
translated into virtualization-friendly instructions that produce identical effects.
❖ Static binary translation recompiles a set of executable images into a form ready
for execution in the virtual environment. This translation takes some time, but it is a
one-time process providing a set of system and user images that will continue to
work until new image versions are installed, necessitating a recompilation
procedure for the new images.
❖ Dynamic binary translation scans sections of code during program execution to
locate problematic instructions. When such instructions are encountered, they are
replaced with virtualizable instruction sequences.
Hardware emulation
❖ When emulating processor hardware, each instruction executing in an
emulated guest system must be translated to an equivalent instruction or
sequence of instructions in the host ISA.
❖ One example of hardware emulation tools is the open source QEMU
machine emulator and virtualizer, which supports the running of operating
systems for a wide variety of processor architectures on an impressive list
of differing architectures, with reasonably good performance.
Virtualization challenges
In this section, we will focus on the hosted (type 2) hypervisor because this mode of
operation presents a few added challenges that a bare-metal hypervisor may not
face because the type 1 hypervisor has been optimized to support virtualization.
❖ In a type 2 hypervisor, the host operating system supports kernel and user modes,
as does the guest operating system. As the guest operating system and the
applications running within it request system services, the hypervisor must
intercept each request and translate it into a suitable call to the host kernel.
❖
❖ In a virtualized environment, the hypervisor must manage the interfaces to these
devices whenever the user requests interaction with the guest OS.The degree of
difficulty involved in implementing these capabilities depends on the instruction
set of the host computer.
Unsafe instructions
❖ Processor instructions rely on or modify privileged system state information
are referred to as unsafe. For the trap-and-emulate method to function in a
comprehensively secure and reliable manner, all unsafe instructions must
generate exceptions that trap to the hypervisor.
● Between 2005 and 2006,Intel and AMD released versions of the x86
processors containing hardware extensions supporting virtualization
● These extensions resolved the problems caused by the privileged but
non-trapping instructions
● Enabling full system virtualization under the Popek and Goldberg criteria
● AMD-V in AMD processors and VT-x in Intel processors
ARM processor virtualization
Paying Attention
R.I.P Cheems (2011 - 2023)