0% found this document useful (0 votes)
92 views153 pages

CPT Full

Uploaded by

Nirma Virus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views153 pages

CPT Full

Uploaded by

Nirma Virus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 153

Unit -1

Performance Tuning Concepts


Elements of Cloud Infrastructure, Hardware, Storage, Operating Systems, Hypervisors, Networks,
Power management, Introduction to the term performance,Elements of Performance Tuning,
Resource Management, Resource Provisioning and Monitoring, Process, Resource Allocation 5esource
Monitoring , Performance Analysis and Performance Monitoring Tools.
Innovation Centre for Education

What is Cloud Performance?

Cloud performance refers to monitoring of applications and the servers in


private and public clouds. Cloud performance management tools are
usually used to know not only cloud servers performance, from processor,
memory and storage usage point of view, but also the way applications
(cloud based) are performing. Cloud performance monitoring helps
administrators track the widely varying workloads to which cloud
applications are subjected, noting potential problems under peak loads.
Cloud performance software can help identify instances where a particular
application may require additional resources to prevent potential outages.

2
Innovation Centre for Education

Many factors are needed to be considered for performance


tuning, which can become more complex with respect to cloud
computing, including:

 The service consumer, e.g. an application or a browser.

 The network speed, e.g., internal network speed and public networks speed
.
 Cloud provider tunable, e.g., database and network configurations.

3
Innovation Centre for Education

Depending on the type of cloud computing service they consume, the service consumer
varies greatly. Considering the storage as an example, the native application is a consumer of
the storage service. Tuning, or configuration, would be the manner in which the service is
consumed, which even includes the use of any available caching systems that may eliminate
many of the times that application has to request to the cloud provider for storage block
services.
The tuning of network may be in any number of ways, which includes the path and method of
transmission from the service consumer to the service provider. The tuning of network
typically is the configuration within the routers and other transmission points. On the basis of
trial and error this can be achieved, and also we need to keep in mind it should not impact
others who are also using the network, so we need to find a solution by compromising around
what’s the requirement for the use of the cloud computing providers and others who will be
leveraging the network.
And lastly, the cloud computing provider normally provides some tunables, generally around
the processing and location of data. The main intention here is to place the data as close/near
to the processing as possible, and hence eliminating many of the unnecessary hops that
internal process requires for fetching the data. In case of a very interactive application which
makes many database requests during the normal course of processing, the requirement of
data to be placed as close as possible becomes an absolute requirement.

4
Innovation Centre for Education

Elements of Cloud
Infrastructure Hardware

Server Technology
Server is a computer program that provides the resources to the other computers in the
network. Servers can be run on a dedicated computer and provides several services in the
network. The server will manage the communication mechanism between the computers in
the network through the network protocols. The server hardware will vary depends upon the
application running in the server.
The client-server type of services includes:
• File Server: Provides file management functionalities
• Print Server: Provides printing functions and share the printer to access in the network
• Database Server: Provides the databases functionalities running on the server.

5
Innovation Centre for Education

Blade Server
A blade server is a single circuit board added with the hardware components such as
processors, memory and network cards that is usually found on multiple boards. Since the
blade server uses the laptop technology, it looks like thin and requires less power and cooling
compared to other servers.
The individual blades in the chassis (cabinet) are connected using a bus system. Combining all
the cabinets to form blade server and the entire cabinets share a common power supply and
cooling resources. Each blade will have its own operating system and software applications
installed on it.

6
Innovation Centre for Education

Rack Servers
A rack server is server which is designed to fit into the rack. The rack contains multiple mounting
slots and bays, each bays is used for holding the hardware unit and screw that. In a single
rack we can have multiple servers, network components. To reduce the floor space in the
organization, keep all the servers in the rack and the cooling system is necessary to prevent
the excessive heat in the room. Each rack server has power supply separately. The rack
server comes with different sizes.
U is the standard unit of measure of vertical usable space, height of racks( metal frame
designed to hold the hardware devices) and the cabinets. The server comes from 1U to 7U
sizes, in which 1U is equal to 1.75inches.

7
Innovation Centre for Education

Enterprise Server
An Enterprise server is a computer that has programs collectively to serve the requirements of
an enterprise. Mainframe computer systems were used as enterprise servers. Due to the ability
to manage enterprise wide programs, UNIX servers and Wintel servers are also generally called
as enterprise servers.
Examples of enterprise servers are Sun Microsystem’s UNIX based systems, IBM iSeries
systems, HP systems and so more.

8
Innovation Centre for Education

High Performance Server


High Performance computer systems are the most powerful and flexible research instruments
today. It is used in various fields like climatology, quantum chemistry, computational medicine,
high energy physics and many other areas. The hardware structure or architecture determines
to a large extent what the possibilities and impossibilities are speeding up the system.
HPC is an essential part of business success in many organizations. Over the past years, the
HPC cluster has disrupted the supercomputing market. Typical HPC systems can deliver
industry-leading, cost-effective performance.

9
Innovation Centre for Education

Server Workload:

Server Workload can be defined as the amount of processing that the server is assigned at a
given time. The workload can consist of certain amount of application programming running and
the users interacting with the systems applications.

The workload can be specified as the benchmark which is used to evaluate the server in terms
of its performance, which is divided into response time and throughput. The response time is the
time between the user request and the response to the request. The throughput is the amount
of work that is accomplished over a period of time. The amount of work handled by the server
estimates the efficiency and performance of that particular server.

10
Innovation Centre for Education

Memory Workload: Programs or instructions require memory to store data and


perform intermediate computations. The amount of memory used by the server
over a given period of time or at a specific instant of time is determined by the
memory workload. The usage of the main memory is increased by the paging and
segmentation process which uses a lot of virtual memory, however if the number of
programs to be executed becomes large, it needs more amount of memory which
should be effectively managed.

CPU Workload: The number of instructions that is executed by the processor during
a given period of time is indicated by the CPU workload. More processing power is
needed if the CPU is overloaded always. The improvement in the performance is
obtained for the same number of instructions by decreasing the number of cycles
required by the instruction.

I/O Workload: The number of inputs got by a server and the number of outputs
produced by the server over a particular duration of time is called as I/O workload.

Database Workload: The workload of a database is analyzed by the


determination of the number of queries executed by the database over a given
period of time, or the average number of queries executed at an instant of time.
11
Innovation Centre for Education

Storage
Storage networking is the process of linking storage devices together and connecting
to other IT networks and provides a centralized repository for data that can be
accessed by users and uses high-speed connections to provide fast performance.

Storage networking is used in reference to storage area networks (SANs) which links
multiple storage devices and provides block-level storage.

Storage networking can also refer to network attached storage (NAS) devices. NAS is
a standalone device which connects to a network and provides file-level storage. The
NAS unit typically does not use a keyboard or display and is controlled and
configured over the network using a browser. An operating system is not needed on a
NAs device and usually a stripped down operating system like FreeNAS is used. The
Network Attached Storage uses file based protocols like NFS, Server Message Block
SMB/CIFS and NCP(Network Control Protocol).

The Storage networking can include direct-attached storage (DAS), network-attached


storage (NAS) and storage-area network (SANs).
12
Innovation Centre for Education

The benefits of storage networking are improved performance, reliability and


availability and also make it easier to back up data for disaster recovery purpose.
Storage networks are often used in parallel to storage management technologies like
storage resource management software, virtualization and compression.

Types of Storage System: Storage is a technology consisting of components and


recording media to retain digital data for the purpose of future usage. It is a core
function and fundamental component of computers. All computers use a storage
hierarchy which puts expensive and small storage options into the CPU. The volatile
technologies are referred to as memory and the slow permanent technologies are
referred to as Storage.

The computer represents data using binary numeral system. Text, numbers, pictures,
audio are converted into string of bits or binary digits having value of 0 or 1. Storage
in a Computer is measured usually in bytes, Kilobytes (KB), Megabytes (MB),
Gigabytes (GB) and presently in Terabytes (TB).

13
Innovation Centre for Education

Types of Storage:
Primary Storage:
Primary storage is also known as memory, which is directly accessible to the CPU.
The CPU continuously reads the instructions which is stored and executes them as
and when required. RAM, ROM amd Cache memory are examples of primary
memory.

Secondary Storage:
Secondary storage is also referred to as external memory or auxiliary storage and is
the one which is not directly accessible by the CPU. The computer uses the
input/output channels to access secondary storage and transfers the desired data
using intermediate area in primary storage. Secondary storage is a non-volatile
memory. Hard Disks are example of Secondary storage. Some other examples of
secondary storage technologies are Flash memory, Floppy disks, magnetic tapes,
Optical devise like CD, DVD and also Zip drives.

14
Innovation Centre for Education

Storage Devices and


Technologies: Hard Disk
Drive:

15
Innovation Centre for Education

RAID or Disk Array


Redundant Array of Independent Disks (RAID) in which the hard disk drives are grouped
together using hardware or software and is treated as a single data storage unit. The data is
recorded across multiple hard disk drives in parallel, increasing the access time significantly.
The array of multiple hard disk drives which forms the RAID can also be partitioned and
assigned with a file system.

RAID Functions :
• Striping
Striping is the process in which consecutive logical bytes of data is stored per blocks in the
consecutive physical disks which forms the array.
• Mirroring
Mirroring is the process in which the data is written to the same block on two or more physical
disks in the array.
• Parity Calculation
If there are N numbers of disks in the RAID array, N-1 consecutive blocks are used for storing
data blocks and the Nth block is used for storing the parity. When any of the N-1 data blocks
are altered, N-2 XOR calculations are performed on the N-1 blocks. The Data Blocks and
Parity Block is written across the array of hard disk drives that forms the RAID array.
Iyf oane of the N blocks fail, the data in that particular block is reconstructed using N-2 XOR
calculations on the remaining N-1 blocks If two or more blocks fail in the RAID array the
reconstruction of the data from the failed blocks is not possible.
16
Innovation Centre for Education

Optical Storage
Optical Storage is one of the low-cost and reliable storage media used in personal computers
for incremental data archiving. Optical storage are available in any one of the three basic
formats. They are Compact Disk (CD), Digital versatile disk (DVD) and Blue-ray disk (BD). The
media unit costs are very low in Recordable CDs, DVDs and Blue-ray disks. However, when
comparing to the hard disk drives or tape drives the capacities and speeds of the optical discs
are comparatively lower.

17
Innovation Centre for Education

Solid State Drives:


The solid state drives do not contain any moving parts like the magnetic drives. The solid state
drives are also known as flash memory, thumb drives, USB flash drives, Memory Stick, Secure
digital cards. The SSDs are relatively expensive when compared to the other types for their low
capacity, but are very convenient for backing up the data. SSD drives are presently also
available in the order of 500GB to TBs.

18
Innovation Centre for Education

Network and Online Storage:


Network storage is a method in which the user’s data is stored and backed-up onto their
company’s network servers. The files stored online are stored in a hard disk drive located
remotely to the computer and can be accessed from a remote location. As the internet access is
more widespread the remote backup services are gaining popularity. Backing up of users data
using the internet to a remote location helps to protect data against worst-case scenarios such
as fires, floods, or earthquakes which would destroy any backups stored in the same location.

19
Innovation Centre for Education

FC-AL (Fibre Channel Arbitrated Loop)


FC-AL is a fibre channel topology used to connect devices using a loop topology, is similar to a
token ring network. A token is used to prevent data from colliding when two or more streams are
sent at the same time. FC-AL passes data using one-way loop technique.
FC-AL technology eliminates the expensive fibre channel switches and allows several servers
and storage devices to be connected. The fibre channel arbitrated loop can also be called as
arbitrated loop.
The FC topology has three major fibre channels to connect ports:
• Switched Fabric: A network topology using crossbar switches that connects devices.

• Point-to-Point: Allows two-way data communication connecting one device to another.

• Arbitrated Loop: Devices are connected in a loop, only two devices can communicate
at the same time. This is the most common fibre channel used in FC-AL.
Fibre channel arbitrated loop can connect upto 127 devices with a port attached to the fabric. In
FC-AL only one port can transmit data at a time. FC-AL uses the arbitration signal to choose the
port. Once the port is selected by the arbitration signal, it can use the fibre channel, gigabit-
speed network topology is used for network storage.

20
Innovation Centre for Education

The FC-AL topology has the following properties:


All devices share the same bandwidth in the loop
Can be cabled using hub or loop technology

If a port malfunctions in the loop, all ports stop working


Two ports function like arbitrated loop and not as a point-to-point
127 ports (devices) can be supported with one port attached to the fabric
Has a serial architecture compatible with small computer system interface (SCSI)
FC-AL topology is used where interconnection of storage devices is needed and where multiple
node connections require high-bandwidth connections.
FC-Al loops are of two types
•Private loops: Private loops are not connected to the Fabric, so the nodes are inaccessible to
the nodes that do not belong to the loop.
•Public loops: Public loops are connected to the Fabric through one FL Port and are accessible
to other nodes that are not part of the loop.

21
Innovation Centre for Education

FABRIC
Storage Area Network (SAN) Fabric is hardware used to connect workstations and servers to
the storage device in a SAN. Fibre Channel switching technology is used in a SAN Fabric to
enable any-server to any storage device connectivity.
Fabric is a network topology used to connect network nodes with each other using one or more
network switches.
Switched Fabric in Fibre Channel:
Switched Fabric is a topology in which the devices are connected to each other through one or
more Fibre Channel switches. This topology has the best scalability of the three Fibre Channel
topologies like Arbitrated loop and Point-to-point as the traffic is spread across multiple physical
links and also the only one requiring switches. The visibility among various devices also called
as nodes in a fabric is controlled with Zoning.
There are two methods of implementing zoning
•Hardware Zoning: Hardware Zoning is a port based zoning method. Physical ports are
assigned to a zone. A port can be assigned to one zone or multiple zones at the same time. The
devices are connected to a particular port, if the devices are shifted to different port, the entire
zone ceases to operate.
•Software Zoning: Software Zoning is a SNS based zoning method. The device’s physical
connectivity to a port does not play a role in the definition of zones. Even if a device is
connected to a different port, it remains in the same zone.
22
Innovation Centre for Education

Storage Area Network:


A Storage Area Network (SAN) is a dedicated network that carries data between computer
systems and storage devices, which can include tape and disk resources. SAN forms a
communication infrastructure providing physical connections and consists of a management
layer, which organizes the connections, storage elements, and computer systems so that the
data transfer is secure and robust.

23
Innovation Centre for Education

Zoning
Zoning in a Storage area network is the allocation of resources for the devices load balancing
and for allowing access to data only for certain users. Zoning is similar to that of a file system.
Zoning can be of two kinds, Hard or Soft zoning. Hard zoning is one in which each device is
assigned to a particular zone and this does not change. In the case of soft zoning device
assignments can be changed to accommodate variations in demand on different servers in the
network.
Zoning is used to minimize the risk of data corruption, helps to secure data against hackers and
minimizing the spread of viruses and worms. The disadvantage of zoning is the complication in
the scaling process if the number of users and servers in a SAN increases significantly.

24
Innovation Centre for Education

Storage Virtualization
Storage Virtualization is a concept in which the storage system uses virtualization concepts
which enables better functionality and advanced features within and across storage systems.
Storage virtualization hides the complexity of the SAN by pooling together multiple storage
devices to appear as a single storage device.
Storage virtualization can be of three types
•Host based: In the Host based virtualization the virtualization layer is provided by a server and
presents a single drive for the applications. The host based storage virtualization depends on
the software at the server often at the OS level. Volume Manager is the tool which is used to
enable this functionality. The volume manager is configured so that several drives are presented
as a single resource which can be divided as needed.
•Appliance Based: In the Appliance based virtualization a hardware appliance is used which sits
on the storage network
•Network Based: The network based virtualization is similar to the appliance based except that it
works at the switching level. Appliance and network based work at the storage infrastructure
level. Data can be migrated from one storage unit to another without reconfiguring any of the
servers; the virtualization layer handles the remapping.

25
Innovation Centre for Education

Storage virtualization can be implemented using software applications.


The main reasons to implement storage virtualization are

• Improvised storage management in an IT environment

• Better availability with automated management

• Better storage utilization

• Less energy usage

• Increase in loading and backup speed


• Cost effective, no need to purchase additional software and hardware
Storage virtualization can be applied to different storage functions such as physical storage,
RAID arrays, LUNs, storage zones and logical volumes.
The two main types of virtualization are:
•Block Virtualization: The abstraction (separation) of logical storage (partition) from the
physical storage so that the partition can be accessed without regard to the physical storage.
The separation of the logical storage from the physical storage allows greater flexibility to
manage storage for the users.
•File Virtualization: File virtualization eliminates the dependencies between the data accessed
at the file level and the physical location where the files are stored. This optimizes the storage
use and server consolidation and also to perform non-disruptive file migrations.
26
Innovation Centre for Education

Operating Systems

The collection of software to manage computer hardware resources and to provide common
services for computer programs is called as Operating system. The operating system is required
for the functioning of the application programs.
Various tasks like recognizing the input from the keyboard, transferring output to the display,
maintaining track of files and directories on the disk and controlling the peripheral devices are
performed by the operating system. It is also responsible for the security, ensuring that the
unauthorized users do not access the system. The operating system also provides a consistent
application interface which is required if more than one type of computer uses the operating
system. The Application Program Interface (API) allows a developer to write an application on a
particular computer and the same can be run on another computer having different amount of
memory or quantity of storage.
The users normally interact with the operating system through a set of commands. The
command processor or command line interpreter in the operating system accepts and executes
the commands. The graphical user interface allows the users to enter commands by pointing
and clicking at objects on the screen. Examples of modern operating systems are Android, BSD,
iOS, Linux, Microsoft Windows and IBM z/OS.

27
Innovation Centre for Education

Real-time operating system is one which is used to control machineries, scientific instruments
and industrial systems. RTOS has minimal user-interface capabilities and has no end user
interfaces. The system will be a sealed box when delivered for usage. RTOS executes a
particular operation precisely every time it occurs with the same amount of time.

28
Innovation Centre for Education

Features of the operating system:


•Multi-User: Multi-User operating system is one which allows two or more users to run
programs at the same time
•Multiprocessing: Multiprocessing is an operating system which supports running a program
on more than one CPU
•Multitasking: Multitasking is an operating system which allows more than one program
to run concurrently
•Multithreading: Multithreading is an operating system which allows different parts of a single
program to run concurrently

29
Innovation Centre for Education

Cloud Performance Tuning


The operating system performs the following system tasks
•Process management: Processor management is responsible for ensuring that each process
and application receives the required processor’s time for its functioning and using as many
processor cycles as possible.Depending on the operating system, the basic unit of software that
the operating system deals with in scheduling the work done by the processor is called as
process or a thread. A process is software which performs actions and can be controlled by a
user or other applications or by the operating system.

•Memory management: A single process must have enough memory to execute and should
not run into the memory space for another process. The various memories in the system must
be properly used so that each process can run effectively. Memory management is responsible
for accomplishing this. The operating system sets up the boundary for the types of software and
for individual applications.

30
•Device management: A driver is a program which is used to control the path between the operating
system and the hardware on the computer’s motherboard. The driver acts as a translator between the
electrical signals of the hardware and the high-level programming languages of the operating system. The
operating system assigns high-priority blocks to drivers so as to release and ready the hardware resource
for further usage as quickly as possible.

•File management: File management is also known as file system, is the system which the operating
system uses to organize and to keep track of the files. A hierarchical file system uses directories to organize
files into a tree structure. An entry is created by the operating system in the file system to show the start
and end locations of a file, the file name, file type, file archiving, and user’s permissions to read and modify
the file and the date and time of the files creation.
Innovation Centre for Education

Virtualization overview:
The technique in which one physical resource is split into multiple virtual resources is called as
virtualization. By this a single physical resource can be used for multiple purpose and perform
required actions.
The benefits of Virtualization are:
Reduces hardware cost - single server acts as multiple server
Workload is optimized - by dynamically sharing the resources.

IT responsiveness and flexibility - It gives a single consolidated access to all


available resources..

Virtualization or system virtualization is creating many virtual systems with single physical
system. These virtual systems use virtual resources with independent operating system.

32
Innovation Centre for Education

Hypervisor:
A technique that allows multiple Operating systems to run on a single hardware at the same
time, who’s hardware is virtualized. This is also called as virtual machine manager.
Hypervisor types:
There are basically two type of hypervisor, type 1 and type 2 .
•Type 1 hypervisor runs directly on the physical hardware, without intermediate operating
system. This is also called as “bare metal hypervisors”. In this case hypervisor itself is the
operating system managing the hardware. This layer is more effective in comparison with the
type 2 hypervisor as there is no OS system. That makes the physical resource server the need
for the individual virtual machines. This hypervisor is only built to host other operating system.
Most of the enterprise operating system are type 1 operating system.

33
Innovation Centre for Education

Type 2 hypervisor runs over the operating system using virtual PC or virtual box. In this case
there are two layers the operating system and the hypervisor between each virtual machine.
This is more of an application which is installed over an operating system.

34
Innovation Centre for Education

Hypervisor features:
Most organizations run their servers in virtual environments in their data centers. This helps
them to carry their workload with high availability and better performance.
Operating system and workload can be consolidated into one server, reducing the cost of
operations and hardware.
Multiple operating systems can be run on a single hardware at the same time, each running
applications executing as per its requirements.
Dynamically assigning of resources is possible from virtual resource to the physical resource
through methods like dispatching and paging.
Workload is managed with ease in a single server to improve the performance, system use and
price.

35
Innovation Centre for Education

Network Topologies
You may have come across the word topology used in networks. “Topology” refers to the
physical arrangement of network components and media within a networking structure.
Topology Types
There are four primary kinds of LAN topologies: bus, tree, star, and ring.
Bus topology

Bus topology is a linear LAN architecture in which the network components transmit and
propagate through the length of the medium and are received by all other components. The
common signal path directed through wires or other media across one part of a network to
another is a Bus. Bus topology is commonly implemented by IEEE 802.3/Ethernet networks.

36
Tree topology
Tree topology is similar to the bus topology. A tree network can contain multiple nodes and branches. The network
components transmit and propagate through the length of the medium and are received by all other components
as in a bus topology.
Disadvantage of bus topology over tree topology is that, the entire network goes down if the connection to any
one user is broken, disrupting communication between all users.
Advantages – Less cabling required, hence cost-effective.
Star Topology

Star topology is a kind of LAN topology wherein a network is created with the help of a common
central switch or a hub and the nodes are connected to them as in a Hub and spoke model (point to
point connection). A star network is often combined with a bus topology. The central hub is then
connected to the backbone of the bus. This combination is called a tree.
Advantages – Connection disruption between a node and the central device will not hinder the
communication between the other nodes and the network remains functional.
Disadvantage– More cabling required so a bit higher on the cost factor.
Innovation Centre for Education

Ring topology
Ring topology is a single closed loop formed by unidirectional transmission links formed by a
series of repeaters which will be linked to each other. A repeater in the network connects each
station to the network. Even though logically a ring, ring topologies are mostly arranged as a
closed-loop star, such topologies works as a unidirectional closed-loop star, instead of point-to-
point links. Token Ring is an example of Ring topology.
In an event of a connection failure between two components, redundancy is used to avoid
collapse of the entire ring.

39
Innovation Centre for Education

OSI Model
To define and abstract network communications we use the Open Systems Interconnection
(OSI) model. The OSI has seven logical layers. The below Figure lists those seven layers.
Each logical layer has a well defined functionality, described below:
OSI model functionality by layer
Layer Function
7 Application Interaction with application software

6 Presentation Data formatting

5 Session Host-to-host connection management

4 Transport Host-to-host data transfer

3 Network Addressing and routing

2 Data-Link Local network data transfer

1 Physical Physical hardware

40
Switching Concepts
Innovation Centre for Education

Switching Concepts continued…….


Application-specific integrated circuits (ASICs) are used in switches to maintain and build their
filter tables.
A layer 2 switch can be called as a multiport bridge as it breaks up collision domains.
Layer 2 switches are faster than routers as they do not look for the Network layer header
information and just looks for the frame's hardware addresses to decide whether to forward,
flood, or drop the frame. An independent bandwidth is provided on each port of the Switch in
order to create dedicated collision domains.
Layer 2 switching provides the following:
• Hardware-based bridging (ASIC)
• Wire speed
• Low latency
• Low cost
Efficiency of the layer 2 switching is maintained as there is no modification done to the Data
packet. The switching process is made less error-prone and considerably faster than the routing
processes by designing the device to read just the frame encapsulation of the packet. The
feature of increased bandwidth due to individual collision domains in each interface makes it
possible to connect multiple devices to each interface.
Switch Functions at Layer 2

44
Innovation Centre for Education

Basically the functions of layer 2 switches can be Address learning.


Layer 2 switches maintains a MAC database which gets the information about the source
hardware address which is remembered when each frame is received on an interface. This
information is also known as a forward/filter table.
Forward/filter decisions
The frame which is received on an interface of the switch searches for the destination hardware
address in the MAC database and is forwarded out via a specified destination port.
Loop avoidance
Network loops might occur due to connection between multiple switches which are created for
redundancy purposes. Spanning Tree Protocol (STP) is used to avoid network loops and
maintaining redundancy.
Limitations of Layer 2 Switching
Layer 2 switches don’t break broadcast domains by default which reduces the size and growth
potential of the network and reduces the overall performance.
The Layer 2 Switches has slow convergence time for spanning trees along with the broadcasts
and multicasts is major hitch as the network grows which leads to the reason for its replacement
by routers (layer 3 devices) in the inter network.

45
Switch Vs Hub Vs Router
A switch is a networking device which can transmit the incoming data to the destination port without disturbing
any other ports. By this “without disturbing any other ports”, means to say that switch can differentiate between
two different computers by using their MAC address. Mostly, Switches are used in Local Area Network. It is also
known as Layer 2 device because switch works on Data-Link Layer. We can also say that switch works like the
manager of data transmission for all the computers which are connected to the switch because the switch is
having all the information which is transmitting over the computers.

Hub, acts a repeater which transfers/broadcasts the incoming data to all the computers connected to it. This is the
biggest drawback of the hub that broadcasts the data packet to the whole network. To overcome this limitation,
we use Switches in the place of Hub. It is a Layer 1 device because it works on the Physical Layer.

The router is a very intelligent device other than switch and hub. The router is used to connect two or more
different networks to maintain the flow of data. It is also known as Layer 3 devices because Routers works on the
Network Layer which is Layer3 For ex:- If we have to connect two MAN(Metropolitan Area Network) we use
routers to connect two different networks.
Innovation Centre for Education

What is Routing?
Packet forwarding from a network to another and determining the path using various metrics are
the basic functions of a router. The load on the link between devices, delay, bandwidth and
reliability, even hop count are the metrics of Routing.
Routers can do all that switches do and more. Routers look into the Destination and Source IP
addresses part of the network header which is added during the packet encapsulation in the
network layer.
What is a VLAN?
VLAN is a virtual LAN. Switches create broadcast domain by VLAN’s. Normally a router creates
that broadcast domains but with the help of VLAN’s, a switch can create broadcast domains.
The administrator includes some switch ports in a VLAN other than the default VLAN (VLAN 1).
The ports in a single VLAN have a single broadcast domain.
Switches normally communicate with each other, some ports on switch A could be included to
VLAN 5 and other ports on switch B can be in VLAN 5. In such a case the broadcasts between
these two switches will not be seen on any other port in any other VLAN, other than 5.
Whereas, the devices which are in the VLAN 5 can all communicate between them as they are
on the same VLAN. Communication is not possible with the devices of other VLAN's without
extra configuration.

47
Innovation Centre for Education

When do I need a VLAN?


VLAN’s are required in the following situations:
 when you have a large number of devices on your LAN
 when the broadcast traffic on your LAN is more
 when security to a group of users is required to avoid slowdown due to broadcasts
 when a group of people Groups of users running the same applications wants to be in a
separate broadcast domain and want to avoid regular users
 when multiple virtual switches are required in a single switch

48
Innovation Centre for Education

Power management

Power management is a feature of some electrical appliances, especially copiers, computers and
computer peripherals such as monitors and printers, that turns off the power or switches the system to
a low-power state when inactive. In computing this is known as PC power management and is built
around a standard called ACPI(Advanced Configuration and Power Interface). This supersedes APM.
All recent (consumer) computers have ACPI support.

PC power management for computer systems is desired for many reasons, particularly:
• Reduce overall energy consumption
• Prolong battery life for portable and embedded systems
• Reduce cooling requirements

• Reduce noise
• Reduce operating costs for energy and cooling
• Lower power consumption also means lower heat dissipation, which increases system stability, and
less energy use, which saves money and reduces the impact on the environment.

49
Innovation Centre for Education

Processor level techniques


The power management for microprocessors can be done for the whole processor, or in specific areas.
With dynamic voltage scaling and dynamic frequency scaling, the CPU core voltage, clock rate, or both, can be
altered to decrease power consumption at the price of potentially lower performance. This is sometimes done
in real time to optimize the power-performance tradeoff.
Examples:
AMD Cool'n'Quiet AMD
PowerNow IBM Energy Scale
Intel Speed Step
Transmeta LongRun and LongRun2
VIA LongHaul (Power Saver)

Additionally, processors can selectively power off internal circuitry (power gating). For example: Newer Intel
Core processors support ultra-fine power control over the functional units within the processors.

AMD CoolCore technology get more efficient performance by dynamically activating or turning off parts of the
processor.

Intel VRT technology split the chip into a 3.3V I/O section and a 2.9V core section. The lower core voltage
reduces power consumption.
50
Innovation Centre for Education

Heterogenous computing
ARM's big.LITTLE architecture can migrate processes between faster "big" cores and more
power efficient "LITTLE" cores.
Hibernation in computing
When a computer system hibernates it saves the contents of the RAM to disk and powers down
the machine. On startup it reloads the data. This allows the system to be completely powered
off while in hibernate mode. This requires a file the size of the installed RAM to be placed on the
hard disk, potentially using up space even when not in hibernate mode. Hibernate mode is
enabled by default in some versions of Windows and can be disabled in order to recover this
disk space.

51
Innovation Centre for Education

Advanced Power Management


Advanced power management (APM) is an API developed by Intel and Microsoft and released
in 1992 which enables an operating system running an IBM-compatible personal computer to
work with the BIOS (part of the computer's firmware) to achieve power management.
Revision 1.2 was the last version of the APM specification, released in 1996. ACPI is intended
as the successor to APM. Microsoft dropped support for APM in Windows Vista. The Linux
kernel still mostly supports APM, with the last fully functional APM support shipping in 3.3.
APM uses a layered approach to manage devices. APM-aware applications (which include
device drivers) talk to an OS-specific APM driver. This driver communicates to the APM-aware
BIOS, which controls the hardware. There is the ability to opt-out of APM control on a device-
by-device basis, which can be used if a driver wants to communicate directly with a hardware
device.

52
Innovation Centre for Education

The layers in APM.


Communication occurs both ways; power management events are sent from the BIOS to the
APM driver, and the APM driver sends information and requests to the BIOS via function calls.
In this way the APM driver is an intermediary between the BIOS and the operating system.
Power management happens in two ways; through the above mentioned function calls from the
APM driver to the BIOS requesting power state changes, and automatically based on device
activity.

53
Innovation Centre for Education

Advanced Configuration and Power Interface


In computing, the Advanced Configuration and Power Interface (ACPI) specification provides an
open standard for device configuration and power management by the operating system.
First released in December 1996, ACPI defines platform-independent interfaces for hardware
discovery, configuration, power management and monitoring. With the intention of replacing
Advanced Power Management, the MultiProcessor Specification and the Plug and Play BIOS
Specification, the standard brings power management under the control of the operating
system, as opposed to the previous BIOS-central system which relied on platform-specific
firmware to determine power management and configuration policy.[2] The specification is
central to Operating System-directed configuration and Power Management (OSPM), a system
implementing ACPI which removes device management responsibilities from legacy firmware
interfaces.
The standard was originally developed by Intel, Microsoft and Toshiba, and was later joined by
HP and Phoenix. The latest version is "Revision 5.0", which was published on 6 December
2011. As the ACPI technology gained wider adoption with many operating systems and
processor architectures, the desire to improve the governance model of the specification has
increased significantly. In October 2013, original developers of the ACPI standard agreed to
transfer all assets to the UEFI Forum, where all future development will be taking place.

54
Innovation Centre for Education

BatteryMAX (idle detection)


BatteryMAX is an Idle Detection System used for computer power management developed at
Digital Research, Inc.'s European Development Centre (EDC) in Hungerford, UK. It was
invented by British borne engineers Roger Gross and John Constant in August 1989 and was
first released with DR DOS 5.0. It was created to address the new genre of portable personal
computers (lap-tops) which ran from battery power. As such, it was also an integral part of
Novell's PalmDOS 1.0 operating system tailored for early palmtops in 1992.
Power saving in laptop computers traditionally relied on hardware inactivity timers to determine
whether a computer was idle. It would typically take several minutes before the computer could
identify idle behavior and switch to a lower power consumption state. By monitoring software
applications from within the operating system, BatteryMAX is able to reduce the time taken to
detect idle behavior from minutes to microseconds. Moreover it can switch power states around
20 times a second between a user's keystrokes. The technique was named Dynamic Idle
Detection and includes halting, or stopping the CPU for periods of just a few microseconds until
a hardware event occurs to restart it.
DR DOS 5.0 was the first Personal Computer operating system to incorporate an Idle Detection
System for power management. An US patent describing the Idle Detection System was filed on
9 March 1990 and was granted in 11 October 1994.

55
Innovation Centre for Education

CPU power dissipation


Central Processing Unit power dissipation or CPU power dissipation is the process in which
central processing units (CPUs) consume electrical energy, and dissipate this energy both by
the action of the switching devices contained in the CPU (such as transistors or vacuum tubes)
and by the energy lost in the form of heat due to the impedance of the electronic circuits.

CPU power management


 Designing CPUs that perform tasks efficiently without overheating is a major consideration of
nearly all CPU manufacturers to date. Some implementations of CPUs use very little power;
for example, the CPUs in mobile phones often use just a few hundred milliwatts of electricity.
Some microcontrollers, used in embedded systems may use a few milliwatts. In comparison,
CPUs in general purpose personal computers, such as desktops and laptops, dissipate
significantly more power because of their higher complexity and speed. These microelectronic
CPUs may consume power in the order of a few watts to hundreds of watts. Historically, early
CPUs implemented with vacuum tubes consumed power on the order of many kilowatts.

© 2013 IBM Corporation


50
Innovation Centre for Education

• CPUs for desktop computers typically use a significant portion of the power consumed by the
computer. Other major uses include fast video cards, which contain graphics processing
units, and the power supply. In laptops, the LCD's backlight also uses a significant portion of
overall power. While energy-saving features have been instituted in personal computers for
when they are idle, the overall consumption of today's high-performance CPUs is
considerable. This is in strong contrast with the much lower energy consumption of CPUs
designed for low-power devices. One such CPU, the Intel XScale, can run at 600 MHz with
only half a watt of power, whereas x86 PC processors from Intel in the same performance
bracket consume roughly eighty times as much energy

57
Implications of increased clock frequencies
Processor manufacturers consistently delivered increases in clock rates and instruction-level parallelism, so that
single-threaded code executed faster on newer processors with no modification. Now, to manage CPU power
dissipation, processor makers favor multi-core chip designs, and software has to be written in a multi-threaded or
multi-process manner to take full advantage of the hardware. Many multi-threaded development paradigms
introduce overhead, and will not see a linear increase in speed vs number of processors. This is particularly true
while accessing shared or dependent resources, due to lock contention. This effect becomes more noticeable as
the number of processors increases. Recently, IBM has been exploring ways to distribute computing power more
efficiently by mimicking the distributional properties of the human brain.
Innovation Centre for Education

Energy Star
 Energy Star (trademarked ENERGY STAR) is an international standard for energy efficient
consumer products originated in the United States of America. The EPA estimates that it
saved about $14 billion in energy costs in 2006 alone. The Energy Star program has helped
spread the use of LED traffic lights, efficient fluorescent lighting, power management systems
for office equipment, and low standby energy use.

59
Innovation Centre for Education

VESA Display Power Management Signaling


(Video Electronics Standards Association)VESA Display Power Management Signaling (or
DPMS dos Protected mode Services) is a standard from the VESA consortium for
managing the power supply of video monitors for computers through the graphics card
e.g.; shut off the monitor after the computer has been unused for some time (idle), to save
power.
DPMS 1.0 was issued by VESA in 1993, and was based on the United States Environmental
Protection Agency's (EPA) earlier Energy Star power management specifications. Subsequent
revisions were rolled into future VESA BIOS Extensions.

60
Successfully managing performance ensures that your system is
efficiently using resources and that your server provides the best
possible services to your users and to your business needs. Moreover,
effective performance management can help you quickly respond to
changes in your system
Innovation Centre for Education

Elements of Performance Tuning


The fundamental element of resource management is the discovery process. It involves
searching of the suitable physical resources in which the virtual machines are to be created
matching the user's request.
Resource scheduling selects the best resource from the matched physical resources. In fact it
identifies the physical resource where the Virtual machines are to be created to provision the
resources from cloud infrastructure.
Resource allocation allocates the selected resource to the job or task of user's request. Actually,
it means the job submission to the selected cloud resource. After the submission of the job, the
resource is monitored.

64
Innovation Centre for Education

Resource Management :
Resource management includes processes such as resource discovery (includes resource
scheduling), resource allocation and resource monitoring. The said processes manage
resources such as disk space, CPU cores, and bandwidth (network). The said resources must
be further sliced and shared between different virtual machines running heterogeneous
workloads mostly.
The taxonomy for resource management elements are depicted in the below figure

65
Innovation Centre for Education

The fundamental element of resource management is the discovery process. It involves


searching for the appropriate resource types available that match the application requirements.
The process is managed by the cloud service provider. This process is being taken by the
resource broker or user broker to discover available resources.
Discovery consists of detailed description of resources available. Resource discovery provide a
way for a resource management system (RMS) to determine the state of the resources that are
managed by it and other RMSs that interoperate with it. The resource discovery works with
dissemination of resources to provide information about the state of resources to the information
server.
The Allocation process is the process of assigning an available source needed cloud
applications over the internet. These resources are allocated based on user request and pay-
per-use method. In this process, scheduling and dispatching method is being applied to allocate
the resources. The scheduler will schedule assigned resources to the client. Then, the
dispatcher will allocate the assigned resources to the client.
Resource monitoring is a key tool for controlling and managing hardware and software
infrastructures. It also provides information and Key Performance Indicators (KPIs) for both
platforms and applications in cloud to be used for data collection to assist in decision method of
allocating the resources. It's also one key component to monitor the state of the resources in the
event of failure whether at the physical layer or at the services layer.

66
Innovation Centre for Education

Performance Analysis and Performance Monitoring Tools


The first step in a strategy for managing system performance is to set measurable objectives.
You can begin by setting goals that match the demands of your business and identifying areas
of the system where performance improvements can have a positive effect.
The tasks that follow make up a performance strategy. Implementing a performance strategy is
an iterative process that begins with defining your performance goals or objectives, and then
repeating the rest of the tasks until you accomplish your goals.
 Set performance goals.
- Set goals that match the demands of your business.
- Identify the areas of the system where an improvement in performance can affect your
business.
- Set goals that can be measured.
- Make the goals reasonable
 Collect performance data continuously with Collection Services.
 Always save performance measurement data before installing a new software release, a
major hardware upgrade, a new application, or a large number of additional workstations
or jobs.
 Your data should include typical medium to heavy workloads.

67
Innovation Centre for Education

 Check and analyze performance data.


– Summarize the collected data and compare the data to objectives or resource guidelines.
– Perform monthly trend analysis that includes at least the previous three months of
summarized data. As time progresses, include at least six months of summary data to
ensure that a trend is consistent. Make decisions based on at least three months of trend
information and your knowledge of upcoming system demands.
– Analyze performance data to catch situations before they become problems. Performance
data can point out objectives that have not been met. Trend analysis shows you whether
resource consumption is increasing significantly or performance objectives are
approaching or exceeding guideline values.

 Tune performance when ever guidelines are not met

 Plan for capacity when:


 Trend analysis shows a significant growth in resource utilization
 A major new application , additional interactive work stations, or new batch jobs, will be
added to the current hardware configuration.
 You have reviewed business plans and expect a significant change.

68
Innovation Centre for Education

Collecting performance data for analysis


Collecting data is an important step toward improving performance. When you collect
performance data, you gather information about your server that can be used to understand
response times and throughput. It is a way to capture the performance status of the server, or
set of servers, involved in getting your work done. The collection of data provides a context, or a
starting point, for any comparisons and analysis that can be done later. When you use your first
data collections, you have a benchmark for future improvements and a start on improving your
performance today. You can use the performance data you collect to make adjustments,
improve response times, and help your systems achieve peak performance. Performance
problem analysis often begins with the simple question: "What changed?" Performance data
helps you answer that question.

69
Innovation Centre for Education

System workload
An accurate and complete definition of a system's workload is critical to predicting or
understanding its performance.
A difference in workload can cause far more variation in the measured performance of a system
than differences in CPU clock speed or random access memory (RAM) size. The workload
definition must include not only the type and rate of requests sent to the system, but also the
exact software packages and in-house application programs to be executed.
Workloads can be classified into following categories:
Multiuser
A workload that consists of a number of users submitting work through individual terminals.
Typically, the performance objectives of such a workload are either to maximize system
throughput while preserving a specified worst-case response time or to obtain the best possible
response time for a constant workload.
Server
A workload that consists of requests from other systems. For example, a file-server workload
is mostly disk read and disk write requests. It is the disk-I/O component of a multiuser workload
(plus NFS or other I/O activity), so the same objective of maximum throughput within a given
response-time limit applies. Other server workloads consist of items such as math-intensive
programs, database transactions, printer jobs.
Workstation
A workload that consists of a single user submitting work through a keyboard and receiving
results on the display of that system. Typically, the highest-priority performance objective of
such
60
a workload is minimum response time to the user's requests. © 2013 IBM Corporation
Innovation Centre for Education

Performance objectives

After defining the workload that your system will have to process, you can choose performance
criteria and set performance objectives based on those criteria.
The overall performance criteria of computer systems are response time and throughput.
Response time is the elapsed time between when a request is submitted and when the
response from that request is returned. Examples include:

 The amount of Time a database query takes


 The amount of time it takes to echo characters to the terminal
 The amount of time it takes to access a web page

Throughput is a measure of the amount of work that can be accomplished over some unit of
time. Examples include:
 Database transactions per minute
 Kilobytes of a file transferred per sec
 Kilobytes of a file read or written per sec
 Web server hits per sec
Innovation Centre for Education

The relationship between these metrics is complex. Sometimes you can have higher throughput
at the cost of response time or better response time at the cost of throughput. In other
situations, a single change can improve both. Acceptable performance is based on reasonable
throughput combined with reasonable response time.

In planning for or tuning any system, make sure that you have clear objectives for both
response time and throughput when processing the specified workload. Otherwise, you risk
spending analysis time and resource dollars improving an aspect of system performance that
is of secondary importance
Performance monitoring tools
There are numerous performance analysing and monitoring tools, among many, common tools
we will discuss here in the following section.
nmon
nmon is a free tool to analyze AIX and Linux performance
This free tool gives you a huge amount of information all on one screen.

72
Innovation Centre for Education

The nmon tool is designed for AIX and Linux performance specialists to use for monitoring and
analyzing performance data, including:
– CPU utilization
– Memory use
– Kernel statistics and run queue information
– Disks I/O rates, transfers, and read/write ratios
– Free space on file systems
– Disk adapters
– Network I/O rates, transfers, and read/write ratios
– Paging space and paging rates
– CPU and AIX specification
– Top processors
– IBM HTTP Web cache
– User Defined disk groups
– Machine details and resources
– Asynchronous I/O -- AIX only
– Workload Manager (WLM) -- AIX only
– IBM TotalStorage® Enterprise Storage Server® (ESS) disks -- AIX only
– Network File System (NFS)
– Dynamic LPAR (DLPAR) changes -- only pSeries p5 and OpenPower for either AIX or
Linux
73
Innovation Centre for Education

Performance analysis with the trace facility


The operating system's trace facility is a powerful system-observation tool.
The trace facility captures a sequential flow of time-stamped system events, providing a fine
level of detail on system activity. Events are shown in time sequence and in the context of other
events. Trace is a valuable tool for observing system and application execution. Unlike other
tools that only provide CPU utilization or I/O wait time, trace expands that information to aid in
understanding what events are happening, who is responsible, when the events are taking
place, how they are affecting the system and why.
Simple Performance Lock Analysis Tool (splat)
The Simple Performance Lock Analysis Tool (splat) is a software tool that generates reports on
the use of synchronization locks. These include the simple and complex locks provided by the
AIX kernel, as well as user-level mutexes, read and write locks, and condition variables provided
by the PThread library. The splat tool is not currently equipped to analyze the behavior of the
Virtual Memory Manager (VMM) and PMAP(Process-Mapping) locks used in the AIX kernel.

74
Innovation Centre for Education

Analyzing collected performance data


The key to analyzing performance data is to do it on a regular basis. Just like performance
monitoring, analyzing is an ongoing process. Performance analysis is a method for
investigating, measuring, and correcting deficiencies so that system performance meets the
user's expectations. It does not matter that the system is a computer; it could be an automobile
or a washing machine. The problem-solving approach is essentially the same:
– Understand the symptoms of the problem.
– Use tools to measure and define the problem.
– Isolate the cause.
– Correct the problem.
– Use tools to verify the correction.

75
Innovation Centre for Education

The following tasks help you to analyze your data. These tasks allow you to focus on the
problem, isolate the cause, and make the corrections. Use these same tasks to verify the
corrections.
View collected performance data
Look at the data that you collected with Performance Tools.
Summarize performance data into reports
Performance reports provide a way for you to effectively research areas of the system that
are causing performance problems.
Create graphs of performance data to show trends
Use historical data to show the changes in resource utilization on your system over time.
Work with performance database files
Find information about the performance database files -- what they contain, how to use them,
and how to create them.

76
IBM application for Performance Management:
The data that Performance Explorer collects is stored in Performance Explorer database files.

The following table shows the Performance Explorer (PEX) data files collected by the system when
using data collection commands. Type the Display File Field Description (DSPFFD) command as follows
to view the contents for a single file:

DSPFFD FILE(xxxxxxxxx)
where xxxxxxxxx is the name of the file that you want to display .

Type of information contained in file File name


Trace Resources Affinity QAYPEAFN
Auxiliary storage management event data QAYPEASM
Auxiliary storage pool (ASP) information data QAYPEASPI
Base event data QAYPEBASE
Basic configuration information QAYPECFGI
Basic configuration information QAYPECFGI

Communications event data QAYPECMN

Disk event data QAYPEDASD


Disk server event data QAYPEDSRV

Event type and subtype mapping QAYPEEVENT

File Serving event data QAYPEFILSV

Configured filter information QAYPEFTRI

Performance measurement counter (PMC) selection QAYPEFQCFG

Heap event data QAYPEHEAP


Unit 2
Fundamentals of High Availability
High Availability (HA)
• The basic understanding of high availability in
the IT world is when any system continuously
operates (100 percent operational) without
any down time despite occurrences of failure
in hardware, software, and application.
High Availability (HA)
A high availability (HA) setup is an infrastructure
without a single point of failure.
It prevents a single server failure from being a
downtime event by adding redundancy to
every layer of your architecture.
A load balancer facilitates redundancy for the
backend layer (web/app servers), but for a
true high availability setup, you need to have
redundant load balancers as well.
What is HA Clustering ?
• One service goes down => others take over its
work
• IP address takeover, service takeover,
• Not designed for high-performance
• Not designed for high troughput (load
balancing)
Does it Matter ?
• Downtime is expensive*
• You miss out on service You miss out on
opportunity
• Your boss complains Your bosses boss
complains
• New users don't return
The Rules of HA
• Keep it Simple
• Prepare for Failure
• Complexity is the enemy
• Test your HA setup Test your HA setup
Myths
• Virtualization will solve your HA Needs
• Live migration is the solution to all your
problems
• HA will make your platform more stable
You care about ?
Your data ?
• Consistent
• Realtime
• Eventual Consistentcy
Your Connection
• Always
• Most of the time
Eliminating the SPOF
Find out what Will Fail
• Disks
• Fans
• Power (Supplies)
Find out what Can Fail
• Network
• Going Out Of Memory
High Availability/Clustering in Linux

How to Configure and


Maintain High
Availability/Clustering in
Linux
High Availability/Clustering in Linux
• High Availability (HA) simply refers to a quality
of a system to operate continuously without
failure for a long period of time.
• HA solutions can be implemented using
hardware and/or software, and one of the
common solutions to implementing HA is
clustering.
High Availability/Clustering in Linux
• In computing, a cluster is made up of two or
more computers (commonly known as nodes
or members) that work together to perform a
task.

• In such a setup, only one node provides the


service with the secondary node(s) taking over
if it fails.
Cluster Types
• Storage: provide a consistent file system
image across servers in a cluster, allowing the
servers to simultaneously read and write to a
single shared file system.
• High Availability: eliminate single points of
failure and by failing over services from one
cluster node to another in case a node goes
becomes inoperative.
Cluster Types
• Load Balancing: dispatch network service
requests to multiple cluster nodes to balance
the request load among the cluster nodes.
• High Performance: carry out parallel or
concurrent processing, thus helping to
improve performance of applications.
Cluster Types

• Another widely used solution to providing HA


is replication (specifically data replications).
• Replication is the process by which one or
more (secondary) databases can be kept in
sync with a single primary (or master)
database.
Cluster quorum disk
• A cluster quorum disk is the storage medium
on which the configuration database is stored
for a cluster computing network.
• The cluster configuration database, also called
the quorum, tells the cluster which physical
server(s) should be active at any given time.
• The quorum disk comprises a shared block
device that allows concurrent read/write
access by all nodes in a cluster.
Cluster quorum disk
• In networking, clustering is the use of multiple
servers (computers) to form what appears to
users as a single highly available system.
• A Web page request is sent to a "manager"
server, which then determines which of several
other servers to forward the request for handling.
• Cluster computing is used to load-balance the
traffic on high-traffic Web sites.
• Load balancing involves dividing the work up
among multiple servers so that users get served
faster.
Cluster quorum disk
• Although clusters comprise multiple servers, users or other
computers see any given cluster as a single virtual server.
• The physical servers themselves are called cluster nodes.
• The quorum tells the cluster which node should be active at
any given time, and intervenes if communications fail
between cluster nodes by determining which set of nodes
gets to run the application at hand.
• The set of nodes with the quorum keeps running the
application, while the other set of nodes is taken out of
service.
Redundant Network connectivity's
Network Redundancy

• Network redundancy is a process through which


additional or alternate instances of network
devices, equipment and communication mediums
are installed within network infrastructure.
• It is a method for ensuring network availability in
case of a network device or path failure and
unavailability. As such, it provides a means of
network failover.
Network Redundancy

• Network redundancy is a simple concept to


understand.
• If you have a single point of failure and it fails
you, then you have nothing to rely on.
• If you put in a secondary (or tertiary) method
of access, then when the main connection
goes down, you will have a way to connect to
resources and keep the business operational.
Network Redundancy
• Network redundancy is primarily implemented
in enterprise network infrastructure to
provide a redundant source of network
communications.
• It serves as a backup mechanism for quickly
swapping network operations onto redundant
infrastructure in the event of unplanned
network outages.
Network Redundancy

• Typically, network redundancy is achieved


through the addition of alternate network
paths, which are implemented through
redundant standby routers and switches.
• When the primary path is unavailable, the
alternate path can be instantly deployed to
ensure minimal downtime and continuity of
network services.
Active-Active high availability cluster
• An active-active cluster is typically made up of at
least two nodes, both actively running the same
kind of service simultaneously.
• The main purpose of an active-active cluster is to
achieve load balancing.
• Load balancing distributes workloads across all
nodes in order to prevent any single node from
getting overloaded.
• Because there are more nodes available to serve,
there will also be a marked improvement in
throughput and response times.
Active-Active high availability cluster
• For example, in a set up, which consists of a
load balancer and two (2) HTTP servers (i.e. 2
nodes), is an example of this type of HA
cluster configuration.
• Instead of connecting directly to an HTTP
server, web clients go through the load
balancer, which in turn connects each client to
any of the HTTP servers behind it.
Active-Active high availability cluster
• Assigning of clients to the nodes in the cluster
isn't an arbitrary process.
• Rather, it's based on what ever load balancing
algorithm is set on the load balancer.
• So, for example in a "Round Robin" algorithm,
the first client to connect is sent to the 1st
server, the second client to the 2nd server, the
3rd client back to the 1st server, the 4th client
back to the 2nd server, and so on.
Active-Active high availability cluster
• In order for the high availability cluster to operate
seamlessly, it's recommended that the two nodes
be configured for redundancy. In other words,
their individual configurations/settings must be
virtually identical.

• Another thing to bear in mind is that a cluster like


this works best when the nodes store files in a
shared storage like a NAS.
Active-Passive high availability cluster
• Like the active-active configuration, active-
passive also consists of at least two nodes.
• However, as the name "active-passive" implies,
not all nodes are going to be active.
• In the case of two nodes, for example, if the first
node is already active, the second node must be
passive or on standby.
• The passive (a.k.a. failover) server serves as a
backup that's ready to take over as soon as the
active (a.k.a. primary) server gets disconnected or
is unable to serve.
Active-Passive high availability cluster
• When clients connect to a 2-node cluster in active-
passive configuration, they only connect to one server.
• In other words, all clients will connect to the same
server.
• Like in the active-active configuration, it's important
that the two servers have exactly the same settings
(i.e., redundant).
• If changes are made on the settings of the primary
server, those changes must be cascaded to the failover
server.
• So when the failover does take over, the clients won't
be able to tell the difference.
Active-Passive high availability cluster
• High-Availability cluster aka Failover-cluster
(active-passive cluster) is one of the most widely
used cluster types in the production
environment.
• This type of cluster provides you the continued
availability of services even one of the node from
the group of computer fails.
• If the server running application has failed for
some reason (hardware failure), cluster software
(pacemaker) will restart the application on
another node.
Active-Passive high availability cluster
• Mostly in production, you can find this type of
cluster is mainly used for databases, custom
application and also for file sharing.
• Fail-over is not just starting an application; it has
some series of operations associated with it; like
mounting filesystems, configuring networks and
starting dependent applications.

• CentOS 7 / RHEL 7 supports Fail-over cluster using


the pacemaker.
N+1 redundancy
What does N+1 redundancy mean?

N+1 redundancy is a formula meant to express a


form of resilience used to ensure system
availability in the event of a component
failure.
The formula suggests that components (N) have
at least one independent backup component
(+1).
N+1 redundancy
• The "N" can refer to many different
components that make up a data center
infrastructure, including servers, hard disks,
power supplies, switches routers and cooling
units.
• The level of resilience is referred to as
active/passive or standby, as backup
components do not actively participate within
the system during normal operation.
N+1 redundancy
• It is also possible to have N+1 redundancy with
active/active components. In such cases, the
backup component remains active in the
operation even if all other components are fully
functional.
• In the event that one component fails, however,
the system will be able to perform. An
active/active approach is considered superior in
terms of performance and resiliency.
N+1 redundancy
• Very simply, it means that anything in your data
center can be shut down and kept down for a
period of time, without directly affecting ongoing
processing.
• If everything has been well maintained, the
statistical chance of a simultaneous failure
practically drops off the charts.
• Further, to actually bring down a facility that is
designed and installed for concurrent
maintainability most likely takes not two, but at
least three sequential events.
N-to-1 configuration
• An N-to-1 configuration is based on the concept that
multiple, simultaneous server failures are unlikely;
therefore, a single redundant server can protect
multiple active servers.
• When a server fails, its applications move to the
redundant server.
• For example, in a 4-to-1 configuration, one server can
protect four servers.
• This configuration reduces redundancy cost at the
server level from 100 percent to 25 percent. In this
configuration, a dedicated, redundant server is cabled
to all storage and acts as a spare when a failure occurs.
N-to-1 configuration
• An N-to-1 failover configuration reduces the
cost of hardware redundancy and still
provides a potential, dedicated spare.
• In an asymmetric configuration there is no
performance penalty.
• There are no issues with multiple applications
running on the same system; however, the
drawback is the 100 percent redundancy cost
at the server level.
N-to-1 configuration
• The problem with this design is the issue of
failback.
• When the failed server is repaired, you must
manually fail back all services that are hosted
on the failover server to the original server.
• The failback action frees the spare server and
restores redundancy to the cluster.
N-to-1 configuration
• Most shortcomings of early N-to-1 cluster
configurations are caused by the limitations of
storage architecture.
• Typically, it is impossible to connect more than
two hosts to a storage array without complex
cabling schemes and their inherent reliability
problems, or expensive arrays with multiple
controller ports.
Innovation Centre for Education

Hardware
Performance
Tuning

1
Innovation Centre for Education

Introduction
The server is the heart of the entire network operation. The performance of the server is a
critical factor in the efficiency of the overall network, and it affects all users. Although simply
replacing the entire server with a newer and faster one might be an alternative, it is often more
appropriate to replace or to add only to those components that need it and to leave the other
components alone. Often, poor performance is due to bottlenecks in individual hardware
subsystems, an incorrectly configured operating system, or a poorly tuned application. The
proper tools can help you to diagnose these bottlenecks and removing them can help improve
performance.
For example, adding more memory or using the correct device driver can improve performance
significantly. Sometimes, however, the hardware or software might not be the direct cause of the
poor performance. Instead, the cause might be the way in which the server is configured. In this
case, reconfiguring the server to suit the current needs might also lead to a considerable
performance improvement.
In this unit we will study in detail the processors, memory, storage also we will be dealing with
power management and heat dissipation linked with performance.

2
Innovation Centre for Education

CPU or Processor
The central processing unit (CPU or processor) is the key component of any computer system.
In this chapter, we cover several different CPU architectures from Intel (IA32, Intel 64
Technology, and IA64) and AMD1 (AMD64), and outline their main performance characteristics.
Processor technology
The central processing unit (CPU) has outperformed all other computer subsystems in its
evolution. Thus, most of the time, other subsystems such as disk or memory will impose a
bottleneck upon your application (unless pure number crunching or complex application
processing is the desired task).
Understanding the functioning of a processor in itself is already quite a difficult task, but today IT
professionals are faced with multiple and often very different CPU architectures.
Comparing different processors is no longer a matter of looking at the CPU clock rate, but is
instead a matter of understanding which CPU architecture is best suited for handling which kind
of workload. Also, 64-bit computing has finally moved from high-end UNIX® and mainframe
systems to the Intel-compatible arena, and has become yet another new technology to be
understood.
The Intel-compatible microprocessor has evolved from the first 8004 4-bit CPU, produced in
1971, to the current line of Xeon and Core processors. AMD offers the world’s first IA32
compatible 64-bit processor. Our overview of processors begins with the current line of Intel
Xeon CPUs, followed by the AMD Opteron and the Intel Core™ Architecture. For the sake of
simplicity, we do not explore earlier processors.
3
Innovation Centre for Education

Intel Xeon processors


Moore’s Law states that the number of transistors on a chip doubles about every two years.
Similarly, as transistors have become smaller, the frequency of the processors have increased,
which is generally equated with performance.
However, around 2003, physics started to limit advances in obtainable clock frequency.
Transistor sizes have become so small that electron leakage through transistors has begun to
occur. Those electron leaks result in large power consumption and substantial extra heat, and
could even result in data loss. In addition, cooling processors at higher frequencies by using
traditional air cooling methods has become too expensive.
This is why the material that comprises the dielectric in the transistors has become a major
limiting factor in the frequencies that are obtainable. Manufacturing advances have continued to
enable a higher per-die transistor count, but have only been able to obtain about a 10%
frequency improvement per year. For that reason, processor vendors are now placing more
processors on the die to offset the inability to increase frequency. Multi-core processors provide
the ability to increase performance with lower power consumption.

4
Innovation Centre for Education

Dual-core processors
Intel released its first dual core Xeon processors in October 2005. Dual core processors are two
separate physical processors combined onto a single processor socket. Dual core processors
consist of twice as many cores, but each core is run at lower clock speeds as an equivalent
single core chip to lower the waste heat usage. Waste heat is the heat that is produced from
electron leaks in the transistors.
Recent dual-core Xeon processor models that are available in System x servers:
Xeon 7100 Series MP processor (Tulsa)

5
Innovation Centre for Education

Xeon 5100 Series DP processor (Woodcrest)


The Woodcrest processor is the first Xeon DP processor that uses the Intel Core
microarchitecture instead of the Netburst microarchitecture. Intel core microarchitecture is
explained in detail in the following sections.

Processor Speed L2 cache Front-side bus Power (TDP)


Xeon 5110 1.6 GHz 4 MB 1066 MHz 65 W
Xeon 5120 1.86 GHz 4 MB 1066 MHz 65 W
Xeon 5130 2.00 GHz 4 MB 1333 MHz 65 W
Xeon 5140 2.33 GHz 4 MB 1333 MHz 65 W
Xeon 5148 LV 2.33 GHz 4 MB 1333 MHz 40 W
Xeon 5150 2.66 GHz 4 MB 1333 MHz 65 W
Xeon 5160 3.0 GHz 4 MB 1333 MHz 80 W
Table - Woodcrest processor models

6
Innovation Centre for Education

Xeon 5200 Series DP processor (Wolfdale)


The Wolfdale dual-core processor is based on the new 45 nm manufacturing process. It
features SSE4, which provides expanded computing capabilities over its predecessor. With the
Intel Core microarchitecture and Intel EM64T, the Wolfdale delivers superior performance and
energy efficiency to a broad range of 32-bit and 64-bit applications.

Processor Speed L2 cache Front-side Power (TDP)


model bus
E5205 1.86 GHz 6 MB 1066 MHz 65 W
L5238 2.66 GHz 6 MB 1333 MHz 35 W
X5260 3.33 GHz 6 MB 1333 MHz 80 W
X5272 3.40 GHz 6 MB 1600 MHz 80 W
X5270 3.50 GHz 6 MB 1333 MHz 80 W
Table- Wolfdale processor models

7
Innovation Centre for Education

Xeon 7200 Series MP Processor (Tigerton)


Tigerton comes with 7200 series 2-core and 7300 series 4-core options. For detailed
information about this topic, refer to following Xeon 7300 Tigerton discussed during quad-core
processor.
Xeon 5500 (Gainestown)
The Intel 5500 series, with Intel Nehalem Microarchitecture, brings together a number of
advanced technologies for energy efficiency, virtualization, and intelligent performance.
Interaged with Intel QuickPath Technology, Intelligent Power Technology and Intel Virtualization
Technology, the Gainestown is available with a range of features for different computing
demands. Most of the models for Gainestown are quad-core. Refer the below quad-core section
for specific models.

8
Innovation Centre for Education

Quad-core processors
Quad-core processors differ from single-core and dual-core processors by providing four
independent execution cores. Although some execution resources are shared, each logical
processor has its own architecture state with its own set of general purpose registers and
control registers to provide increased system responsiveness. Each core runs at the same clock
speed.
Intel quad-core and six-core processors include the following:
Xeon 5300 Series DP processor (Clovertown)
The Clovertown processor is a quad-core design that is actually made up of two Woodcrest dies
in a single package. Each Woodcrest die has 4 MB of L2 cache, so the total L2 cache in
Clovertown is 8 MB.
The Clovertown processors are also based on the Intel Core microarchitecture as described in
the following Intel core microarchitecture topic.
Processor models available include the E5310, E5320, E5335, E5345, and E5355. The
processor front-side bus operates at either 1066 MHz (processor models ending in 0) or 1333
MHz (processor models ending in 5). For specifics, see following below table. None of these
processors support Hyper-Threading.

9
Innovation Centre for Education

In addition to the features of the Intel Core microarchitecture, the features of the Clovertown
processor include:
Intel Virtualization Technology - processor hardware enhancements that support software-
based virtualization.
Intel 64 Architecture (EM64T) - support for both 64-bit and 32-bit applications.
Demand-Based Switching (DBS) - technology that enables hardware and software power
management features to lower average power consumption of the processor while maintaining
application performance.
Intel I/O Acceleration Technology (I/OAT) - reduces processor bottlenecks by offloading
network-related work from the processor.

Processor Speed L2 cache Front-side Power Demand-


model bus (TDP) based
switching
E5310 1.60 GHz 8 MB 1066 MHz 80 W No
E5320 1.86 GHz 8 MB 1066 MHz 80 W Yes
E5335 2.00 GHz 8 MB 1333 MHz 80 W No
E5345 2.33 GHz 8 MB 1333 MHz 80 W Yes
E5355 2.66 GHz 8 MB 1333 MHz 120 W Yes
Table - Clovertown processor models

10
Innovation Centre for Education

Six-core processors
Six-core processors extend the quad-core paradigm by providing six independent execution
cores.
Xeon 7400 Series MP processor (Dunnington)
With enhanced 45 nm process technology, the Xeon 7400 series Dunnington processor features
a single-die 6-core design with 16 MB of L3 cache. Both 4-core and 6-core models of
Dunnington have shared L2 cache between each pair of cores. They also have a shared L3
cache across all cores of the processor. A larger L3 cache increases efficiency of cache-to-core
data transfers and maximizes main memory-to-processor bandwidth. Specifically built for
virtualization, this processor comes with enhanced Intel VT, which greatly optimizes
virtualization software efficiency.

15
Innovation Centre for Education

 Intel Advanced Smart Cache


The L2 cache in the Core microarchitecture is shared between cores instead of each core using a separate L2.
The below Figure illustrates the difference in the cache between the traditional Xeon with Netburst
microarchitecture and the Intel Core microarchitecture.
The front-side bus utilization would be lower, similar to the Tulsa L3 shared cache as discussed in “Xeon 7100
Series MP processor (Tulsa)”. With a dual-core processor, the Core microarchitecture allows for one, single
core to use the entire shared L2 cache if the second core is powered down for power-saving purposes. As the
second core begins to ramp up and use memory, it will allocate the L2 memory away from the first CPU until it
reaches a balanced state where there is equal use betweencores.
Single core performance benchmarks, such as SPEC Int and SPEC FP, benefit from this architecture because
single core applications are able to allocate and use the entire L2 cache. SPEC Rate benchmarks balance the
traffic between the cores and more effectively balance the L2 caches.

20 Figure - Intel Xeon versus Intel Core Architecture


Innovation Centre for Education

Intel Nehalem microarchitecture


The new Nehalem is built on the Core microarchitecture and incorporates significant system
architecture and memory hierarchy changes. Being scalable from 2 to 8 cores, the Nehale m
includes the following features:
 QuickPath Interconnect (QPI)
QPI, previously named Common System Interface or CSI, acts as a high-speed
interconnection between the processor and the rest of the system components, including
system memory, various I/O devices, and other processors. A significant benefit of the QPI
is that it is point-to-point. As a result, no longer is there a single bus that all processors must
connect to and compete for in order to reach memory and I/O. This improves scalability and
eliminates the competition between processors for bus bandwidth.
Each QPI link is a point-to-point, bi-directional interconnection that supports up to 6.4 GTps
(giga transfers per second). Each link is 20 bits wide using differential signaling, thus
providing bandwidth of 16 GBps. The QPI link carries QPI packages that are 80 bits wide,
with 64 bits for data and the remainder used for communication overhead. This gives a
64/80 rate for valid data transfer; thus, each QPI link essentially provides bandwidth of 12.8
GBps, and equates to a total of 25.6 GBps for bi-directional interconnection.
Innovation Centre for Education

 Integrated Memory Controller


The Nehalem replaces front-side bus memory access with an integrated memory controller
and QPI. Unlike the older front-side bus memory access scheme, this is the Intel approach
to implementing the scalable shared memory architecture known as non-uniform memory
architecture (NUMA) that we introduced with the IBM eX4 “Hurricane 4” memory controller
on the System x3950 M2.
As a part of Nehalem’s scalable shared memory architecture, Intel integrated the memory
controller into each processor’s silicon die. With that, the system can provide independent
high bandwidth, low latency local memory access, as well as scalable memory bandwidth as
the number of processors is increased. In addition, by using Intel QPI, the memory
controllers can also enjoy fast efficient access to remote memory controllers.
Innovation Centre for Education

Comparing to its predecessor, Nehalem’s cache hierarchy extends to three levels; referring the
below figure. The first two levels are dedicated to individual cores and stay relatively small. The
third level cache is much larger and is shared among all cores.

Figure - Intel Nehalem versus Intel Core


Innovation Centre for Education

64-bit computing
As discussed above, there are three 64-bit implementations in the Intel-compatible processor
marketplace:
Intel IA64, as implemented on the Itanium 2 processor
Intel 64 Technology, as implemented on the 64-bit Xeon DP and Xeon MP processors
AMD AMD64, as implemented on the Opteron processor

There exists some uncertainty as to the definition of a 64-bit processor and, even more
importantly, the benefit of 64-bit computing.
Definition of 64-bit:
A 64-bit processor is a processor that is able to address 64 bits of virtual address space. A 64-
bit processor can store data in 64-bit format and perform arithmetic operations on 64-bit
operands. In addition, a 64-bit processor has general purpose registers (GPRs) and arithmetic
logical units (ALUs) that are 64 bits wide.
Innovation Centre for Education

The Itanium 2 has both 64-bit addressability and GPRs and 64-bit ALUs. So, it is by definition a
64-bit processor.
Intel 64 Technology extends the IA32 instruction set to support 64-bit instructions and
addressing, but are Intel 64 Technology and AMD64 processors real 64-bit chips? The answer is
yes. Where these processors operate in 64-bit mode, the addresses are 64-bit, the GPRs are
64 bits wide, and the ALUs are able to process data in 64-bit chunks. Therefore, these
processors are full-fledged,64-bit processors in this mode.
Note that while IA64, Intel 64 Technology, and AMD64 are all 64-bit, they are not compatible for
the following reasons:
Intel 64 Technology and AMD64 are, with exception of a few instructions such as 3DNOW,
binary compatible with each other. Applications written and compiled for one will usually run at
full speed on the other.
IA64 uses a completely different instruction set to the other two. 64-bit applications written for
the Itanium 2 will not run on the Intel 64 Technology or AMD64 processors, and vice versa.
Innovation Centre for Education

64-bit extensions: AMD64 and Intel 64 Technology


Both the AMD AMD64 and Intel 64 Technology (formerly known as EM64T) architectures
extend the well-established IA32 instruction set with:
 A set of new 64-bit general purpose registers (GPR)
 6t4in-sbtiructionpointers
 The ability to process data in 64-bit chunks
 Up to 1 TB of address space that physical memory is able to access
 6t4in-tbeiger support and 64- bit flat virtual address space
Even though the names of these extensions suggest that the improvements are simply in
memory addressability, both the AMD64 and the Intel 64 Technology are fully functional 64-bit
processors.
There are three distinct operation modes available in AMD64 and Intel 64 Technology:
 3t2le-gbaicy mode
 Compatibility mode
 -Fbuitlm64ode (long mode)
For more information about the AMD64 architecture, see: http://www.x86-64.org/
For more information about Intel 64 Technology, see:
http://www.intel.com/technology/64bitextensions/
105
© 2013 IBM Corporation
Innovation Centre for Education

Benefits of 64-bit (AMT64, Intel 64 Technology) computing


Processors using the Intel 64 Technology and AMD64 architectures are making this transition very smooth by
offering 32-bit and 64-bit modes. This means that the hardware support for 64-bit will be in place before you
upgrade or replace your software applications with 64-bit versions. IBM System x already has many models
available with the Intel 64 Technology-based Xeon and AMD64 Opteronprocessors.
Here are examples of applications that will benefit from 64-bitcomputing:
 Encryption applications
Most encryption algorithms are based on very large integers and would benefit greatly from the use of 64-bit
GPRs and ALUs. Although modern high-level languages allow you to specify integers above the 232 limit, in a
32-bit system, this is achieved by using two 32-bit operands, thereby causing significant overhead when
moving those operands through the CPU pipelines. A 64-bit processor will allow you to perform 64-bit integer
operation with one instruction.
 Scientific applications
Scientific applications are another example of workloads that need 64-bit data operations. Floating-point
operations do not benefit from the larger integer size because floating-point registers are already 80 or 128 bits
wide even in 32-bit processors.
 Software applications requiring more than 4 GB of memory
The biggest advantage of 64-bit computing for commercial applications is the flat, potentially massive, address
space.
32-bit enterprise applications such as databases are currently implementing Page Addressing Extensions
(PAE) and Addressing Windows Extensions (AWE) addressing schemes to access memory above the 4 GB
limit imposed by 32-bit address limited processors. With Intel 64 Technology and AMD64, these 32-bit
addressing extension schemes support access to memory up to 128 GB insize.
Innovation Centre for Education

In addition, 32-bit applications might also get a performance boost from a 64-bit Intel 64
Technology or AMD64 system running a 64-bit operating system. When the processor runs in
Compatibility mode, every process has its own 4 GB memory space, not the 2 GB or 3 GB
memory space each gets on a 32-bit platform. This is already a huge improvement compared to
IA32, where the operating system and the application had to share those 4 GB of memory.
When the application is designed to take advantage of more memory, the availability of the
additional 1 or 2 GB of physical memory can create a significant performance improvement. Not
all applications take advantage of the global memory available. APIs in code need to be used to
recognize the availability of more than 2 GB of memory.
Furthermore, some applications will not benefit at all from 64-bit computing and might even
experience degraded performance. If an application does not require greater memory capacity
or does not perform high-precision integer or floating-point operations, then 64-bit will not
provide any improvement.
Innovation Centre for Education

Shared cache
Shared cache is introduced in “Intel Core microarchitecture” section, as shared L2 cache to
improve resource utilization and boost performance. In Nehalem, sharing cache moves to the
third level cache for overall system performance considerations. In both cases, the last level of
cache is shared among different cores and provides significant benefits in multi-core
environments as oppose to dedicated cache implementation.
Benefits of shared cache include:
 Improved resource utilization, which makes efficient usage of the cache. When one core idles,
the other core can take all the shared cache.
 Shared cache, which offers faster data transfer between cores than system memory, thus
improving system performance and reducing traffic to memory.
 Reduced cache coherency complexity, because a coherency protocol does not need to be set
for the shared level cache because data is shared to be consistent rather than distributed.
 More flexible design of the code relating to communication of threads and cores because
programmers can leverage this hardware characteristic.
 Reduced data storage redundancy, because the same data in the shared cache needs to be
stored only once.
Innovation Centre for Education

Memory

Memory technology
This section introduces key terminology and technology that are related to memory; the topics
discussed are:
 “DIMMs and DRAMs”
 “Ranks”
 “SDRAM”
 “Registered and unbuffered DIMMs”
 “Double Data Rate memory”
 “Fully-buffered DIMMs”
 “MetaSDRAM”
 “DIMM nomenclature”
 “DIMMs layout”
 “Memory interleaving”
Innovation Centre for Education

DIMMs and DRAMs

DRAM chips on a DIMM


Innovation Centre for Education

Figure - DRAM capacity as printed on a PC3200 (400 MHz) DDR DIMM

The sum of the capacities of the DRAM chips (minus any used for ECC functions if any), equals the
capacity of the DIMM. Using the previous example, the DRAMs in above figure are 8 bits wide, so:

8 x 128M = 1024 Mbits = 128 MB per DRAM


128 MB x 8 DRAM chips = 1024 MB or 1 GB of memory
Innovation Centre for Education

Ranks
A rank is a set of DRAM chips on a DIMM that provides eight bytes (64 bits) of data. DIMMs are
typically configured as either single-rank (1R) or double-rank (2R) devices but quad-rank
devices (4R) are becoming more prevalent.
Using x4 DRAM devices, and not including DRAMS for ECC, a rank of memory is composed of
64 / 4 = 16 DRAMs. Similarly, using x8 DRAM devices, a rank is composed of only 64 / 8 = 8
DRAMs.
It is common, but less accurate, to refer to memory ranking in terms of “sides”. For example,
single-rank DIMMs can often be referred to as single-sided DIMMs, and double-ranked DIMMs
can often referred to as double-sided DIMMs.
However, single ranked DIMMs, especially those using x4 DRAMs often have DRAMs mounted
on both sides of the DIMMs, and quad-rank DIMMs will also have DRAMs mounted on two
sides. For these reasons, it is best to standardize on the true DIMM ranking when describing the
DIMMs.
Innovation Centre for Education

DIMMs may have many possible DRAM layouts, depending on word size, number of ranks, and
manufacturer design. Common layouts for single and dual-rank DIMMs are identified here:
 x8SR = x8 single-ranked modules
These have five DRAMs on the front and four DRAMs on the back with empty spots in between
the DRAMs, or they can have all 9 DRAMs on one side of the DIMM only.
 x8DR = x8 double-ranked modules
These have nine DRAMs on each side for a total of 18 (no empty slots).
 x4SR = x4 single-ranked modules
These have nine DRAMs on each side for a total of 18, and they look similar to x8 double-
ranked modules.
 x4DR = x4 double-ranked modules
These have 18 DRAMs on each side, for a total of 36.
The rank of a DIMM also impacts how many failures a DIMM can tolerate using redundant bit
steering.
Innovation Centre for Education

SDRAM
Synchronous Dynamic Random Access Memory (SDRAM) is used commonly in servers today,
and this memory type continues to evolve to keep pace with modern processors. SDRAM
enables fast, continuous bursting of sequential memory addresses. After the first address is
supplied, the SDRAM itself increments an address pointer and readies the next memory
location that is accessed. The SDRAM continues bursting until the predetermined length of data
has been accessed. The SDRAM supplies and uses a synchronous clock to clock out data from
the SDRAM chips. The address generator logic of the SDRAM module also uses the system-
supplied clock to increment the address counter to point to the next address.
There are two types of SDRAMs currently in the market: registered and unbuffered. Only
registered SDRAM are now used in System x servers, however. Registered and unbuffered
cannot be mixed together in a server.
With unbuffered DIMMs, the memory controller communicates directly with the DRAMs, giving
them a slight performance advantage over registered DIMMs. The disadvantage of unbuffered
DIMMs is that they have a limited drive capability, which means that the number of DIMMs that
can be connected together on the same bus remains small, due to electrical loading.
Icnontrast, registered DIMMs use registers to isolate the memory controller from the DRAMs,
which leads to a lighter electrical load. Therefore, more DIMMs can be interconnected and
larger memory capacity is possible. The register does, however, typically impose a clock or
more of delay, meaning that registered DIMMs often have slightly longer access times than their
unbuffered counterparts.
Innovation Centre for Education

Double Data Rate memory

Data transfers made to and from an SDRAM DIMM use a synchronous clock signal to establish
timing. For example, SDRAM memory transfers data whenever the clock signal makes a
transition from a logic low level to a logic high level. Faster clock speeds mean faster data
transfer from the DIMM into the memory controller (and finally to the processor) or PCI
adapters. However, electromagnetic effects induce noise, which limits how fast signals can be
cycled across the memory bus, and have prevented memory speeds from increasing as fast as
processor speeds.

All Double Data Rate (DDR) memories, including DDR, DDR2, and DDR3, increase their
effective data rate by transferring data on both the rising edge and the falling edge of the clock
signal. DDR DIMMs use a 2-bit prefetch scheme such that two sets of data are effectively
referenced simultaneously. Logic on the DIMM multiplexes the two 64-bit results (plus ECC bits)
to appear on subsequent data transfers. Thus, two data transfers can be performed during one
memory bus clock cycle, enabling double the data transfer rate over non-DDR technologies.
Innovation Centre for Education

DDR2

DDR2 is the technology follow-on to DDR, with the primary benefits being the potential for faster
throughput and lower power. In DDR2, the memory bus is clocked at two times the frequency of
the memory core. Stated alternatively, for a given memory bus speed, DDR2 allows the memory
core to operate at half the frequency, thereby enabling a potential power savings.
Although DDR memory topped out at a memory bus clock speed of 200 MHz, DDR2 memory
increases the memory bus speed to as much as 400 MHz. Note that even higher DDR2 speeds
are available, but these are primarily used in desktop PCs. Below figure shows standard and
small form factor DDR2 DIMMs.

Figure - A standard DDR2 DIMM (top) and small form-factor DDR2 DIMM (bottom)
Innovation Centre for Education

DDR2 also enables additional power savings through use of a lower operating voltage. DDR
uses a range of 2.5 V to 2.8 V. DDR2 only requires 1.8 V.

Because only the highest speed DDR2 parts have similar memory core speeds as compared to
DDR due to being clocked at half the bus rate, DDR2 employs a number of mechanisms to
reduce the potential for performance loss. DDR2 increases the number of bits prefetched into
I/O buffers from the memory core to 4 per clock, thus enabling the sequential memory
throughput for DDR and DDR2 memory to be equal when the memory bus speeds are equal.

However, because DDR2 still has a slower memory core clock when the memory bus speeds
are equal, the memory latencies of DDR2 are typically higher.
Fortunately, the latest generation and most common DDR2 memory speeds typically also
operate at significantly higher memory frequency.
Innovation Centre for Education

DDR2 Memory is commonly found in recent AMD Opteron systems, as well as in the HS12
server blade and the x3850M2 and x3950M2 platforms.
The below table lists the common DDR2 memory implementations

Table - DDR2 memory implementations


Innovation Centre for Education

DDR3

DDR3 is the next evolution of DDR memory technology. Like DDR2, it promises to deliver ever-
increasing memory bandwidth and power savings. DDR3 DIMMs are used on all 2-socket-
capable servers using the Intel 5500-series processors and newer.

DDR3 achieves its power efficiency and throughput advantages over DDR2 using many of the
same fundamental mechanisms employed by DDR2.

DDR3 improvements include:


 Lower supply voltages, down to 1.5 V in DDR3 from 1.8 V in DDR2
 Memory bus clock speed increases to four times the core clock
 Prefetch depth increases from 4-bit in DDR2 to 8-bit in DDR3
 Continued silicon process improvements
Innovation Centre for Education

Latency
The performance of memory access is usually described by listing the number of memory bus clock cycles that
are necessary for each of the 64-bit transfers needed to fill a cache line. Cache lines are multiplexed to
increase performance, and the addresses are divided into row addresses and columnaddresses.
A row address is the upper half of the address (that is, the upper 32 bits of a 64-bit address). A column address
is the lower half of the address. The row address must be set first, then the column address must be set. When
the memory controller is ready to issue a read or write request, the address lines are set, and the command is
issued to the DIMMs.
When two requests have different column addresses but use the same row address, they are said to “occur in
the same page.” When multiple requests to the same page occur together, the memory controller can set the
column address once, and then change the row address as needed for each reference. The page can be left
open until it is no longer needed, or it can be closed after the first request is issued. These policies are referred
to as a page open policy and a page closed policy,respectively.
cTthe a of changing a column address is referred to as Column Address Select (CAS).
There are three common access times:
CAS: Column Address Select
RAS to CAS: delay between row access and column access
RAS: Row Address Strobe
Innovation Centre for Education

SMP and NUMA architectures


There are two fundamental system architectures used in the x86 server market: SMP and
NUMA. Each architecture can have its strengths and limitations, depending on the target
workload, so it is important to understand these details when comparing systems.
SMP architecture
Prior to the introduction of Intel 5500-series processors, most systems based on Intel
processors typically use the Symmetric Multiprocessing (SMP) Architecture. The exception to
this is the IBM x3950 M2, which uses a combination of SMP in each node and NUMA
architecture between nodes.
SMP systems are fundamentally defined by having one or more Front-Side Buses (FSB) that
connect the processors to a single memory controller, also known as a north bridge. Although
older system architectures commonly deployed a single shared FSB, newer systems often have
one FSB per processor socket.
In an SMP architecture, the CPU accesses memory DIMMs through the separate memory
controller or north bridge. The north bridge handles traffic between the processor and memory,
and controls traffic between the processor and I/O devices, as well as data traffic between I/O
devices and memory. In the below figure, the central position of the north bridge and the shared
front-side bus are shown. These components play the dominant role in determining memory
performance.
Innovation Centre for Education

64-bit memory addressing

To break through the 4 GB limitations of 32-bit addressing, CPU and operating system vendors
extended the x86 specification to 64-bits. Known by many names, this technology is most
generally referred to as x86-64 or x64, though Intel refers to it as EM64T, and AMD uses the
name AMD64. Fundamentally, this technology enables significantly increased memory
addressability for both operating systems and applications.
Below table illustrates the differences between the 32-bit and 64-bit operating systems.

Table - Virtual memory limits


a. 3 GB for the application and 1 GB for the kernel if system booted with a /3GB switch.
b. 4 GB if the 32-bit application has the LARGEADDRESSAWARE flag set (LAA).
© 2013 IBM Corporation
160
Innovation Centre for Education

The width of a memory address dictates how much memory the processor can address. As
shown in below figure, a 32-bit processor can address up to 232 bytes or 4 GB. A 64-bit
processor can theoretically address up to 264 bytes or 16 Exabytes (or 16777216 Terabytes).

Table - Relation between address space and number of address bits


Innovation Centre for Education

Current implementation limits are related to memory technology and economics. As a result,
physical addressing limits for processors are typically implemented using less than the full 64
potential address bits, as shown in below figure

. Table - Memory supported by processors

These values are the limits imposed by the processors themselves. They represent the
maximum theoretical memory space of a system using these processors.

[Tip: Both Intel 64 and AMD64 server architectures can utilize either the 32-bit (x86) or 64-bit
(x64) versions of their respective operating systems. However, the 64-bit architecture
extensions will be ignored if 32-bit operating system versions are employed. For systems using
x64 operating systems, it is important to note that the drivers must also be 64-bit capable.]

You might also like