0% found this document useful (0 votes)
1K views9 pages

CHAPTER 3 Network Maintenance

This document discusses network maintenance and provides details on various related tasks. It describes network maintenance as proactively working to keep a network running smoothly by troubleshooting issues, installing software and hardware, monitoring performance, and planning for growth. Common maintenance models are also outlined, such as FCAPS, which focuses on fault, configuration, accounting, performance, and security management. The document emphasizes the importance of documentation, change management, and planning for failures through metrics like MTTR, MTBF, and MTTF.

Uploaded by

Oge Esther
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views9 pages

CHAPTER 3 Network Maintenance

This document discusses network maintenance and provides details on various related tasks. It describes network maintenance as proactively working to keep a network running smoothly by troubleshooting issues, installing software and hardware, monitoring performance, and planning for growth. Common maintenance models are also outlined, such as FCAPS, which focuses on fault, configuration, accounting, performance, and security management. The document emphasizes the importance of documentation, change management, and planning for failures through metrics like MTTR, MTBF, and MTTF.

Uploaded by

Oge Esther
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Network Maintenance

Network maintenance basically means you have to do what it takes in order to keep a network up
and running and it includes a number of tasks:

• Troubleshooting network problems.


• Hardware and software installation/configuration.
• Monitoring and improving network performance.
• Planning for future network growth.
• Creating network documentation and keeping it up-to-date.
• Ensuring compliance with company policies.
• Ensuring compliance with legal regulations.
• Securing the network against all kind of threats.

Of course this list could be different for each network you work on and perhaps you are only
responsible for a number of these tasks. All these tasks can be performed in the following way:

1. Structured tasks.
2. Interrupt-driven tasks.

Structured means you have a pre-defined plan for network maintenance that will make sure that
problems are solved before they occur (proactive). As a network engineer this will also make
your life a whole lot easier. Interrupt-driven means you just wait for trouble to occur and then
fix it as fast as you can (Reactive).
NETWORK MAINTENANCE MODEL
It is a strategy put in place for maintenance. The strategy varies from one company to another.
Types of Network Maintenance Model
- FCAPS
- PPDIOO
- Others
To give you an idea what a network maintenance model is about and what it looks like, here’s an
example for FCAPS:

• Fault management: we will configure our network devices (routers, switches, firewalls,
servers, etc.) to capture logging messages and send them to an external server. Whenever
an interface goes down or the CPU goes above 80% we want to receive an e-mail so we
can see what is going on.
• Configuration management: Any changes made to the network have to be logged. We
will use a change management so relevant personnel will be notified of planned network
changes. Changes to network devices have to be reported and acknowledged before they
are implemented.
• Accounting management: We will charge (guest) users for usage of the wireless
network so they’ll pay for each 100MB of data or something. It’s also commonly used to
charge people for long distance VoIP calls.
• Performance management: Network performance will be monitored on all LAN and
WAN links so we know when things go wrong. QoS (Quality of Service) will be
configured on the appropiate interfaces.
• Security management: We will create a security policy and implement it by using
firewalls, VPNs, intrusion prevention systems and use AAA (Authorization,
Authentication and Accounting) servers to validate user credentials. Network breaches
have to be logged and a appropiate response has to be made.

You can see FCAPS is not just a “theoretical” method but it truly describes “what”, “how” and
“when” we will do things.

Whatever network maintenance model you decide to use, there are always a number of routine
maintenance tasks that should have listed procedures, here are a couple of examples

• Configuration changes: Business are never static but they change all the time.
Sometimes you need to make changes to the network to allow access for guest users,
normal users might move from one office to another so you’ll have to make changes to
the network to facilitate this.
• Replacement of hardware: Older hardware has to be replaced with more modern
equipment and it’s also possible that production hardware fails so we’ll have to replace it
immediately.
• Backups: If we want to recover from network problems such as failing switches or
routers then we need to make sure we have recent backups of configurations. Normally
you will use scheduled backups so you will save the running-configuration each day,
week, month or whatever you like.
• Software updates: We need to keep our network devices and operating systems up-to-
date. Bugs are fixed but also to make sure we don’t have devices that are running older
software that has security vulnerabilities.
• Monitoring: We need to collect and understand traffic statistics and bandwidth
utilization so we can spot (future) network problems but also so we can plan for future
network growth.

Normally you will create a list with the tasks that have to be done for your network. These tasks
can be assigned a certain priority. If a certain access layer switch fails then you will likely want
to replace it as fast as you can but a failed distribution or core layer device will have a much
higher priority since it impacts more users of the network. Other tasks like backups and software
updates can be scheduled. You will probably want to install software updates outside of business
operating hours and backups can be scheduled to perform each day after midnight or something.
The advantage of scheduling certain tasks is that network engineers will less likely forget to do
them.

Making changes to your network will sometimes impact productivity of users who rely on the
network availability. Some changes will have a huge impact, changes to firewalls or access-list
rules might impact more users then you’d wish for. For example you might want to install a new
firewall and planned for a certain result. Accidentally you forgot about a certain application that
uses random port numbers and you end up troubleshooting this issue. Meanwhile some users are
not able to use this application (and shouting at you while you try to fix it…;).

Larger companies might have more than 1 IT department and each department is responsible for
different network services. If you plan to replace a certain router tommorow at 2AM then you
might want to warn the “Microsoft Windows” guys department because their servers will be
unreachable. You can use change management for this. When you plan to make a certain change
to the network then other departments will be informed and they can object if there is a conflict
with their planning.

When you want to implement change management you might want to think about the following:

• Who will be responsible for authorizing changes to the network?


• Which tasks will be performed during scheduled maintenance windows?
• What procedures have to be followed before making a change? (for example: doing a
“copy run start” before making changes to a switch).
• How will you measure the success or failure of network changes? (for example: if you
plan to change a number of IP addresses you will plan the time required to make this
change. If it takes 5 minutes to reconfigure the IP addresses and you end up
troubleshooting 2 hours because something else is not working you might want to
“rollback” to the previous configuration. How much time do you allow for
troubleshooting? 5 minutes? 10 minutes? 1 hour?
• How, when and who will add the network change to the network documentation?
• How will you create a rollback plan so you can restore a configuration to the previous
configuration in case of unexpected problems?
• What circumstances will allow change management policies to be overruled?

Another task we have to do is to create and update our network documentation. Whenever a new
network is designed and created it should be documented.

The more challenging part is to keep it up-to-date in the future. There are a number of items that
you should find in any network documentation:

• Physical topology diagram: This should show all the network devices and how they are
physically connected to each other.
• Logical topology diagram: This should show how everything is connected to each other.
Protocols that are used, VLAN information etc.
• Interconnections: It’s useful to have a diagram that shows which interfaces of one
network device are connected to the interface of another network device.
• Inventory: You should have an inventory of all network equipment, vendor lists, product
numbers, software versions, software license information and each network device should
have an organization tag assess number.
• IP Addresses: You should have a diagram that covers all the IP addresses in use on the
network and on which interfaces they are configured.
• Configuration management: Before changing a configuration we should save the
current running-configuration so it’s easy to restore to a previous (working) version. It’s
even better to keep an archive of older configurations for future use.
• Design documents: Documents that were created during the original design of the
network should be kept so you can always check why certain design decisions were
made.

FAILURE AND FALURE METRICS


Even the most efficient maintenance teams experience equipment failures. That’s why it’s
critical to plan for them.

Failure exists in varying degrees (e.g. partial or total failure) but in the most basic terms,
failure simply means that a system, component, or device can no longer produce specific desired
results. Even if a piece of manufacturing equipment is still running and producing items, it has
failed if it doesn’t deliver the expected quantities.

Managing failure correctly can help you to significantly reduce its negative impact. To help you
effectively manage failures, there are a number of critical metrics that should be monitored.
Examples of metrics include: are MTTR, MTBF, and MTTF

1. Mean Time To Repair (MTTR) refers to the amount of time required to repair a system
and restore it to full functionality. The MTTR clock starts ticking when the repairs start
and it goes on until operations are restored. This includes repair time, testing period,
and return to the normal operating condition. Taking too long to repair a system or
equipment is not desirable as it can have a highly unpleasant impact on business results.
This is especially the case for processes that are particularly sensitive to failure. It often
results in production downtime, missed deadlines, loss of revenue and so on.

Mean Time To Recovery is a measure of the time between the point at which the failure is first
discovered until the point at which the equipment returns to operation. So, in addition to repair
time, testing period, and return to normal operating condition, it captures failure notification
time and diagnosis.
2. Mean Time Between Failures. MTBF measures the predicted time that passes between
one previous failure of a mechanical/electrical system to the next failure during normal
operation. Or, the time between one system breakdown and the next.

The expectation that failure will occur at some point is an essential part of MTBF.

The term MTBF is used for repairable systems, but it does not take into account units
that are shut down for routine scheduled maintenance (re-calibration, servicing, lubrication)
or routine preventive parts replacement. Rather, it captures failures that occur due to design
conditions that make it necessary to take the unit out of operation before it can be repaired.

So, while MTTR measures availability, MTBF measures availability and reliability.

3. Mean Time To Failure (MTTF) is a very basic measure of reliability used for non-
repairable systems. It represents the length of time that an item is expected to last in
operation until it fails.

MTTF is what we commonly refer to as the lifetime of any product or a device. Its value is
calculated by looking at a large number of the same kind of items over an extended period of
time and seeing what is their mean time to failure.
MTTF is one of the many metrics commonly used to evaluate the reliability of manufactured
products. However, there is still a lot of confusion in differentiating between MTTF and MTBF
because they are both somewhat similar in definition. The good news is that this is easily
resolved by remembering that while MTBF is used only when referring to repairable items,
MTTF is used to refer to non-repairable items.
When using MTTF as a failure metric, repair of the asset is not an option.

MTTF is an important metric used to estimate the lifespan of products that are not
repairable. Common examples of these products range from items like fan belts in automobiles
to light bulbs in our homes and offices.

Disaster and Disaster Recovery

A network disaster is an event that can lead to disruption of services either temporally or
permanently.

Types of Disaster:

Environmental disasters

Environmental disasters are what most people think of first when they think of disaster recovery
for networks. Some types of environmental disasters are regional. Others can happen pretty
much anywhere.

• Fire: Fire is probably the first disaster that most people think of when they consider
disaster planning. Fires can be caused by unsafe conditions; carelessness, such as
electrical wiring that isn’t up to code; natural causes, such as lightning strikes; or arson.

• Earthquakes: Not only can earthquakes cause structural damage to your building, but
they can also disrupt the delivery of key services and utilities, such as water and power.
Serious earthquakes are rare and unpredictable, but some areas experience them with
more regularity than others. If your business is located in an area known for earthquakes,
your BCP should consider how your company would deal with a devastating earthquake.

• Weather: Weather disasters can cause major disruption to your business. Moderate
weather may close transportation systems so that your employees can’t get to work.
Severe weather may damage your building or interrupt delivery of services, such as
electricity and water.

• Water: Flooding can wreak havoc with electrical equipment, such as computers. If
floodwaters get into your computer room, chances are good that the computer equipment
will be totally destroyed. Flooding can be caused not only by bad weather but also by
burst pipes or malfunctioning sprinklers.

• Lightning: Lightning storms can cause electrical damage to your computer and other
electronic equipment from lightning strikes as well as surges in the local power supply.

Deliberate disasters

Some disasters are the result of deliberate actions by others. For example

• Intentional damage: Vandalism or arson may damage or destroy your facilities or your
computer systems. The vandalism or arson may be targeted at you specifically, by a
disgruntled employee or customer, or it may be random. Either way, the effect is the
same.

• Theft: Theft is always a possibility. You may come to work someday to find that your
servers or other computer equipment have been stolen.

• Terrorism: No matter where you live in the world, the possibility of a terrorist attack is
real.

Disruption of services

You may not realize just how much your business depends on the delivery of services and
utilities. A BCP should take into consideration how you will deal with the loss of certain
services:

• No juice: Electrical power is crucial for computers and other types of equipment.
Electrical outages are not uncommon, but the technology to deal with them is readily
available. UPS (uninterruptible power supply) equipment is reliable and inexpensive.
• No communications: Communication connections can be disrupted by many causes. For
example, road construction can cut through the phone lines, completely cutting off our
phone services, including Internet connections.

• No water: An interruption in the water supply may not shut down your computers, but it
can disrupt your business by forcing you to close your facility until the water supply is
reestablished.

Equipment failure

Modern companies depend on many different types of equipment for their daily operations. The
failure of any of these key systems can disrupt business until the systems are repaired:

• Computer equipment failure can obviously affect business operations.

• Air-conditioning systems are crucial to regulate temperatures, especially in


computer rooms. Computer equipment can be damaged if the temperature climbs too
high.

• Elevators, automatic doors, and other equipment may also be necessary for your
business.

Other disaster

You should assess many other potential disasters. Here are just a few:

Disaster Recovery Planning:


Disaster Recovery Plan is a set of procedures and a documented process to protect and recover
IT business in the event of disaster. It is a processed document in a written format which specifies
the procedures an organization has to follow during a disaster. “Continuity of operations” is
another term associated with information technology disaster recovery planning for the recovery
of assets, data and facilities.
The primary objective of disaster recovery services is to minimize data loss and downtime. It is
there to protect the organization in disaster situations in terms of two concepts:

• Recovery Time Objective, RTO

• Recovery Point Objective, RPO


Recovery Time Objective
It is a given time in which the business process should be restored after the occurrence of any
major incident.

Recovery Point Objective


Some files need to be recovered from backup storage for use in normal operations. It is defined as
the time measurement of maximum acceptable data loss amount.

A Network disaster recovery plan is a set of procedures prepared for an organization to respond
to an interrupted network service, for instance; local area network, wide area network, internet
access and wireless network functionality during a man made or a natural disaster. A Network
disaster recovery plan provides various guidelines for network restoration services. According
to IT disaster recovery planning, resources are required to perform the network recovery
procedures, critical storage of documents as well as maintenance of the offsite backups.
Computer Disaster Recovery Plan
A computer disaster recovery plan is used for software database systems. Here, the question arises
as to how business continues to function if software vanishes and the computer system crashes?
Problem can be sorted-out easily if you follow some simple steps:

• Backup disaster recovery solution is the first step of database disaster recovery. It suggests that
you create a backup of your system on regular basis.

• 3 consecutive copies should be maintained before overwriting.

• A copy of the documentation and key passwords should be saved in a secure place.

• After every 6 months, the most recent data backup must be retained for permanent offsite storage.

You might also like