0% found this document useful (0 votes)
12 views20 pages

Chapter Contents

Uploaded by

ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views20 pages

Chapter Contents

Uploaded by

ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter Contents

7.1 Introduction, 77
7.2 Aspects Of Performance, 77
7.3 Items That Can Be Measured, 78
7.4 Measures Of Network Performance, 78
7.5 Application And Endpoint Sensitivity, 79
7.6 Degraded Service, Variance In Traffic, And Congestion, 80
7.7 Congestion, Delay, And Utilization, 81
7.8 Local And End-To-End Measurements, 81
7.9 Passive Observation Vs. Active Probing, 82
7.10 Bottlenecks And Future Planning, 83
7.11 Capacity Planning, 83
7.12 Planning The Capacity Of A Switch, 84
7.13 Planning The Capacity Of A Router, 84
7.14 Planning The Capacity Of An Internet Connection, 85
7.15 Measuring Peak And Average Traffic On A Link, 86
7.16 Estimated Peak Utilization And 95th Percentile, 87
7.17 Relationship Between Average And Peak Utilization, 87
7.18 Consequences For Management And The 50/80 Rule, 88
7.19 Capacity Planning For A Complex Topology, 89
7.20 A Capacity Planning Process, 89
7.21 Route Changes And Traffic Engineering, 94
7.22 Failure Scenarios And Availability, 94
7.23 Summary, 95
7

Performance Assessment
And Optimization

7.1 Introduction

Chapters in this part of the text provide general background and define the problem
of network management. The three previous chapters explain aspects of the FCAPS
model.
This chapter continues the discussion of FCAPS by focusing on evaluation of net-
work performance. Unlike the discussion of monitoring in Chapter 6 that focuses on
data needed for accounting and billing, our discussion considers performance in a
broader sense. In particular, we examine the important question of how measurements
are used for capacity assessment and planning.

7.2 Aspects Of Performance

Three aspects of performance are relevant to network managers. The first two per-
tain to assessment, and the third pertains to optimization. The aspects can be expressed
as determining:

d What to measure
d How to obtain measurements
d What to do with the measurements
77
78 Performance Assessment And Optimization Chap. 7

Although each aspect is important, the discussion in this chapter emphasizes concepts; a
later chapter discusses the SNMP technology that can be used to obtain measurements.

7.3 Items That Can Be Measured

In the broadest sense, a network manager can choose to measure specific entities
such as:
d Individual links
d Network elements
d Network services
d Applications

Individual Links. A manager can measure the traffic on an individual link and cal-
culate link utilization over time. Chapter 5 discusses monitoring individual links to
detect anomalous behavior; the load on links can also be used in some forms of capacity
planning†.
Network Elements. The performance of a network element, such as a switch or a
router, is easy to assess. Even basic elements provide statistics about packets pro-
cessed; more powerful devices include sophisticated internal performance monitoring
mechanisms supplied by the vendor. For example, some elements keep a count of
packets that travel between each pair of ports.
Network Services. Both enterprises and providers measure basic network services
such as domain name lookup, VPNs, and authentication service. As with applications,
providers are primarily concerned with how the service performs from a customer’s
point of view.
Applications. The measurement of applications is primarily of concern to enter-
prise networks. An enterprise manager assesses application performance for both inter-
nal and external users. Thus, a manager might measure the response time of a company
database system when employees make requests as well the response time of a company
web server when outsiders submit requests.

7.4 Measures Of Network Performance

Although it would be convenient if managers could use a single value to capture


the performance of their networks, no single measure exists. Instead, a set of indepen-
dent measures are often used to assess and characterize network performance. Five
measures are commonly used:

†Although link performance is conceptually separate from network element performance, measurement of
link performance is usually obtained from network elements attached to the link.
Sec. 7.4 Measures Of Network Performance 79

d Latency
d Throughput
d Packet loss
d Jitter
d Availability

Network latency refers to the delay a packet experiences when traversing a net-
work, and is measured in milliseconds (ms). The throughput of a network refers to the
data transfer rate measured in bits transferred per unit time (e.g., megabits per second or
gigabits per second). A packet loss statistic specifies the percentage of packets that the
network drops. When a network is operating correctly, packet loss usually results from
congestion, so loss statistics can reveal whether a network is overloaded†. The jitter a
network introduces is a measure of the variance in latency. Jitter is only important for
real-time applications, such as Voice over IP (VoIP), because accurate playback requires
a stream of packets to arrive smoothly. Thus, if a network supports real-time audio or
video applications, a manager is concerned with jitter. Finally, availability, which
measures how long a network remains operational and how quickly a network can re-
cover from problems, is especially important for a business that depends on the network
(e.g., a service provider).

7.5 Application And Endpoint Sensitivity

When is a network performing well? How can we distinguish among excellent,


good, fair, and poor performance? Some managers think that numbers and statistics
alone cannot be used to make a judgement. Instead, they rely on human reaction to
gauge performance: a network is performing satisfactorily provided that no user is com-
plaining (i.e., no trouble reports are received).
It may seem that a more precise definition of quality is needed. Unfortunately,
network performance cannot be judged easily. In fact, we cannot even define an exact
boundary between a network that is up and a network that is down. The ambiguity
arises because each application is sensitive to a certain set of conditions and the perfor-
mance received by a given pair of endpoints can differ from the performance received
by other pairs.
To understand application sensitivity, consider a remote login application, a voice
application, and a file transfer application running over the same network. If the net-
work exhibits high throughout and high latency, it may be adequate for file transfer, but
unsuitable for remote login. If it exhibits low latency and high jitter, the network may
be suitable for remote login, but intolerable for voice transmission. Finally, if it exhi-
bits low delay and low jitter but has low throughput, the network may work well for re-
mote login and voice, but be unsuitable for file transfer.

†Loss rates can also reveal problems such as interference on a wireless network.
80 Performance Assessment And Optimization Chap. 7

To understand endpoint sensitivity, observe that latency and throughput depend on


the path through a network. The path between a pair of endpoints (a, b) can be com-
pletely disjoint from the path between another pair of endpoints (c, d). Thus, equip-
ment along the paths can differ, one path can be much longer than the other, or one path
can be more congested than the other. Because paths can differ, statements about aver-
age behavior are not usually meaningful.
The point is:

Because each application is sensitive to specific network characteris-


tics and the performance that an application observes depends on the
path that data travels through a network, a manager cannot give a
single assessment of network performance that is appropriate for all
applications or all endpoints.

7.6 Degraded Service, Variance In Traffic, And Congestion

Instead of making quantitative statements about network performance, network


managers use the term degraded service to refer to a situation in which some charac-
teristics of latency, throughput, loss, and jitter are worse than baseline expectations. Of
course, degraded service can result from faults such as malfunctioning hardware drop-
ping packets. However, degraded service can also result from performance problems
such as nonoptimal routes or route flapping (i.e., routes changing back and forth
between two paths).
In many networks, the most significant cause of degraded service is congestion:
packets arrive in the network faster than they leave. Congestion often occurs when a
given link becomes saturated (e.g., a Layer 2 switch connected to two computers that
each send incoming traffic at 1 gbps to a single output port that operates at 1 gbps).
To handle variations in traffic, network elements, such as switches or routers, each
contain a packet queue that can accommodate a temporary burst of arrivals. However,
if a high-speed burst continues over an extended period of time, the queue fills and the
network must discard additional packets that arrive. Thus, congestion raises latency, in-
creases packet loss, and lowers effective throughput. The point is:

In many networks, congestion is a leading cause of degraded perfor-


mance.
Sec. 7.7 Congestion, Delay, And Utilization 81

7.7 Congestion, Delay, And Utilization

We said that congestion occurs as a link becomes saturated. The relationship


between congestion and degraded performance is crucial because it allows managers to
deduce performance from a quantity that is easy to measure: link utilization.
We define utilization as the percentage of the underlying hardware capacity that
traffic is currently using, expressed as a value between 0 and 1. When utilization in-
creases, congestion occurs, which raises the delay a packet experiences. As a first order
approximation, the effective delay can be estimated by Equation 7.1:

D0 (7.1)
D ∼

1 − U

where D is the effective delay, U is the utilization, and D0 is the hardware delay in the
absence of traffic (i.e., the delay when no packets are waiting to be sent).
Although it is a first-order approximation, Equation 7.1 helps us understand the re-
lationship between congestion and delay: as utilization approaches 100%, congestion
causes the effective delay to become arbitrarily large. For now, it is sufficient to under-
stand that utilization is related to performance; in later sections, we will revisit the ques-
tion and see how utilization can be used for capacity planning.

7.8 Local And End-To-End Measurements

Measurements can be divided into two broad categories, depending on the scope of
the measurement. A local measurement assesses the performance of a single resource
such as a single link or a single network element. Local measurements are the easiest
to perform, and many tools are available. For example, tools exist that allow a manager
to measure latency, throughput, loss, and congestion on a link.
Unfortunately, a local measurement is often meaningless to network users because
users are not interested in the performance of individual network resources. Instead,
users are interested in end-to-end measurements (e.g., end-to-end latency, throughput,
and jitter). That is, a user cares about the behavior observed from an application on an
end system. End-to-end measurements include the performance of software running on
end systems as well as the performance of data crossing an entire network. For exam-
ple, end-to-end measurement of a web site includes measurement of the server as well
as the network.
It may seem that the end-to-end performance of a network could be computed from
measurements of local resources. Such a computation would be ideal because local
82 Performance Assessment And Optimization Chap. 7

measurements are easier to make than end-to-end measurements. Unfortunately, the re-
lationship between local and end-to-end performance is complex, which makes it impos-
sible to draw conclusions about end-to-end performance even if the performance of each
network element and link is known. For example, even if the packet loss rate is known
for each link, the overall packet loss rate is difficult to compute. We can summarize:

Although they are the easiest to obtain, local performance measure-


ments cannot be used to deduce end-to-end performance.

7.9 Passive Observation Vs. Active Probing

How are networks measured? There are two basic approaches:

d Passive observation
d Active probing

Passive observation refers to a nonintrusive mechanism that obtains measurements


without affecting the network or the traffic. A passive observation system can measure
a network under actual load. That is, a passive mechanism measures network perfor-
mance while production traffic is passing through the network.
Active probing refers to a mechanism that injects and measures test traffic. For ex-
ample, test generators exist that can be used to create many simultaneous TCP connec-
tions. Thus, active probing is intrusive in the sense that measurement introduces addi-
tional traffic.
In general, passive observation is restricted to local measurement, and active prob-
ing is used for end-to-end measurement. In fact, to obtain a more accurate picture of
end-to-end performance, external devices are often used as the source of active probes.
For example, to test how a web site performs, an active probing mechanism injects web
requests at various points in the Internet and measures the response†. Of course, the
company that owns and operates the web site may not be able to control the situation
because the cause of poor performance may be networks run by ISPs on the path
between the test system and the web server. However, active probing can provide a
realistic assessment of performance.
To summarize:

The only way a manager can obtain a realistic assessment of end-to-


end performance is to employ active probing that measures an entire
network path plus the performance of applications.

†Commercial companies exist that use active probing to measure performance.


Sec. 7.10 Bottlenecks And Future Planning 83

7.10 Bottlenecks And Future Planning

How do network managers use measurement data that is collected? Chapter 5


discusses one use: detection of faults and anomalies. There are two principal uses that
arise in the context of performance optimization. The two are related:

d Optimize current network performance


d Optimize future network performance

Optimize Current Network Performance. Managers can maximize performance of


a network without upgrading the hardware. To do so, they identify bottlenecks. En-
gineers use the term bottleneck to refer to the component or subsystem that is the
slowest. The bottleneck in a network can consist of a link that is saturated or a router
that is running at capacity. The point of performing bottleneck assessment is to identify
links or network elements that are causing performance problems. A bottleneck can be
upgraded to improve performance†, or traffic can be rerouted to alternative paths. Un-
fortunately, as we will see in the next section, bottleneck identification tends to focus on
individual network elements when, in fact, most data networks are so complex that no
single element forms a bottleneck.
Optimize Future Network Performance. By far the most important and complex
use of performance data arises in future planning. A network manager must anticipate
future needs, acquire the necessary equipment, and integrate the new facilities into the
network before the need arises. As we will see, future planning can be difficult, and re-
quires taking careful and extensive measurements.

7.11 Capacity Planning

We use the term capacity planning to refer to network management activities con-
cerned with estimating future needs. In cases of large networks, the planning task is
complex. A manager begins by measuring existing traffic and estimating future traffic
increases. Once estimates have been generated, a manager must translate the predicted
loads into effects on individual resources in the network. Finally, a manager considers
possible scenarios for enhancing specific network resources, and chooses a plan as a
tradeoff among performance, reliability, and cost.
To summarize:

Capacity planning requires a manager to estimate the size of


resources that will be needed to meet anticipated load, taking into ac-
count a desired level of performance, a desired level of robustness
and resilience, and a bound on cost.

†There is no point in upgrading a device that is not a bottleneck.


84 Performance Assessment And Optimization Chap. 7

A major challenge in capacity planning arises because the underlying networks can
be enhanced in many ways. For example, a manager can: increase the number of ports
on an existing network element, add new network elements, increase the capacity of ex-
isting links, or add additional links. Thus, a manager must consider many alternatives.

7.12 Planning The Capacity Of A Switch

Estimating the capacity needed for a switch is among the most straightforward
capacity planning tasks. The only variables are the number of connections needed and
the speed of each connection. In many cases, the speed required for each connection is
predetermined, either by policy, assumptions about traffic, or the equipment to be at-
tached to the switch. For example, an enterprise might have a policy that specifies each
desktop connection uses wired Ethernet running at 100 mbps. As an example of equip-
ment, the connection between a switch and a router might operate at 1 gbps.
To plan the capacity of a switch, a manager estimates the number of connections,
N, along with the capacity of each. For most switches, planning capacity is further sim-
plified because modern switches employ 10/100/1000 hardware that allows each port to
select a speed automatically. Thus, a manager merely needs to estimate N, the number
of ports. A slow growth rate further simplifies planning switch capacity — additional
ports are only needed when new users or new network elements are added to the net-
work. The point is:

Estimating switch capacity is straightforward because a manager only


needs to estimate the number of ports needed, and port demand grows
slowly.

7.13 Planning The Capacity Of A Router

Planning router capacity is more complex than planning switch capacity for three
reasons. First, because a router can provide services other than packet forwarding (e.g.,
DHCP), a manager needs to plan capacity for each service. Second, because the speed
of each connection between a router and a network can be changed, planning router
capacity entails planning the capacity of connections. Third, and most important, the
traffic a router must handle depends on the way a manager configures routing. Thus, to
predict the router capacity that will be needed, a manager must plan the capacity of sur-
rounding systems and routing services. To summarize:
Sec. 7.13 Planning The Capacity Of A Router 85

Estimating the capacity needed for a router requires a manager to es-


timate parameters for surrounding hardware systems as well as esti-
mate services the router must perform.

7.14 Planning The Capacity Of An Internet Connection

Another capacity planning task focuses on the capacity of a single link between an
organization and an upstream service provider. For example, consider a link between
an enterprise customer and an ISP, which can be managed by the ISP or the customer.
If the ISP manages the link, the ISP monitors traffic and uses traffic increases to en-
courage the customer to pay a higher fee to upgrade the speed of the link. If the custo-
mer manages the link, the customer monitors traffic and uses increases to determine if
the link will soon become a bottleneck.
In theory, planning link capacity should be straightforward: use link utilization, an
easy quantity to measure, in place of delay, throughput, loss, and jitter. That is, com-
pute the percentage of the underlying link capacity that is currently being used, track
untilization over many weeks, and increase link capacity when utilization becomes too
high.
In practice, two difficult question arise:

d How should utilization be measured?


d When should link capacity be increased?

To understand why the questions are difficult, recall from the discussion in Chapter
5 that the amount of traffic on a link varies considerably (e.g., is often much lower dur-
ing nights and weekends). Variations in traffic make measurement difficult because
measurements are only meaningful when coordinated with external events such as holi-
days. Variations make the decision about increasing link capacity difficult because a
manager must choose an objective. The primary question concerns packet loss: is the
objective to ensure that no packets are lost at any time or to compromise by choosing a
lower-cost link that handles most traffic, but may experience minor loss during times of
highest traffic? The point is:

Because link utilization varies over time, when upgrading a link, a


manager must decide whether the objective is to prevent all packet
loss or compromise with some packet loss to lower cost.
86 Performance Assessment And Optimization Chap. 7

7.15 Measuring Peak And Average Traffic On A Link

How should a link be measured? The idea is to divide a week into small intervals,
and measure the amount of data sent on the link during each interval. From the meas-
urements, it is possible to compute both the maximum and average utilization during
the week. The measurements can be repeated during successive weeks to produce a
baseline.
How large should measurement intervals be? Choosing a large interval size has
the advantage of producing fewer measurements, which means management traffic in-
troduces less load on links, intermediate routers, and the management system. Choos-
ing a small interval size has the advantage of giving more accuracy. For example,
choosing the interval size to be one minute allows a manager to assess very short
bursts; choosing the interval size to be one hour produces much less data, but hides
short bursts by averaging all the traffic in an hour.
As a compromise, a manager can choose an interval size of 5, 10, or 15 minutes,
depending on how variable the manager expects traffic to be. For typical sites, where
traffic is expected to be relatively smooth, using a 15-minute interval represents a rea-
sonable choice. With 7 days of 24 hours per day and 4 intervals per hour, a week is di-
vided into 672 intervals. Thus, measurement data collected over an entire week consists
of only 672 values. Even if the procedure is repeated for an entire year, the total data
consists of only 34,944 values. To further reduce the amount of data, progressively old-
er data can be aggregated.
We said that measurement data can be used to compute peak utilization. To be
precise, we should say that it is possible to compute the utilization during the 15-minute
interval with the most traffic. For capacity planning purposes, such an estimate is quite
adequate. If additional accuracy is needed, a manager can reduce the interval size. To
summarize:

To compute peak and average utilization on a link, a manager meas-


ures traffic in fixed intervals. An interval size of 15 minutes is ade-
quate for most capacity planning; smaller intervals can be used to im-
prove accuracy.

Of course, the computation described above only estimates utilization for data trav-
eling in one direction over a connection (e.g., from the Internet to an enterprise). To
understand utilization in both directions, a manager must also measure traffic in the re-
verse direction. In fact, in a typical enterprise, managers expect traffic traveling
between the enterprise and the Internet will be asymmetric, with more data flowing
from the Internet to the enterprise than from the enterprise to the Internet.
Sec. 7.16 Estimated Peak Utilization And 95 th Percentile 87

7.16 Estimated Peak Utilization And 95th Percentile

Once a manager has collected statistics for average and peak utilization in each
direction over many weeks, how can the statistics be used to determine when a capacity
increase is needed? There is no easy answer. To prevent all packet loss, a manager
must track the change in absolute maximum utilization over time, and must upgrade the
capacity of the link before peak utilization reaches 100%.
Unfortunately, the absolute maximum utilization can be deceptive because packet
traffic tends to be bursty and peak utilization may only occur for a short period. Events
such as error conditions or route changes can cause small spikes that are not indicative
of normal traffic. In addition, legitimate short-lived traffic bursts can occur. Many sites
decide that minor packet loss is tolerable during spikes. To mitigate the effects of short
spikes, a manager can follow a statistical approach that smooths measurements and
avoids upgrading a link too early: instead of using the absolute maximum, use the 95th
percentile of traffic to compute peak utilization. That is, take traffic measurements in
15 minute intervals as usual, but instead of selecting one interval as the peak, sort the
list, select intervals at the 95th percentile and higher, and use the selected intervals to
compute an estimated peak utilization. To summarize:

Because traffic can be bursty, a smoothed estimate for peak utilization


avoids reacting to a single interval with unusually high traffic. A
manager can obtain a smoothed estimate by averaging over intervals
at the 95th percentile and higher.

7.17 Relationship Between Average And Peak Utilization

Experience has shown that for backbone links, traffic grows and falls fairly steadly
over time. A corresponding result holds for traffic on a link connecting a large organi-
zation to the rest of the Internet. If similar conditions are observed on a given link and
absolute precision is not needed, a manager can simplify the calculation of peak utiliza-
tion.
One more observation is required for the simplification: traffic on a heavily used
link (e.g., a backbone link at a provider) follows a pattern where the ratio between the
estimated peak utilization of a link, calculated as a 95th percentile, and the average utili-
zation of the link is almost constant. The constant ratio depends on the organization.
According to one Internet backbone provider, the ratio can be approximated by Equa-
tion 7.2:

Estimated peak utilization ∼ (7.2)


∼ 1.3
Average link utilization
88 Performance Assessment And Optimization Chap. 7

7.18 Consequences For Management And The 50/80 Rule

How does a constant peak-to-average ratio affect managers? For a connection to


the Internet that is heavily used, a manager only needs to measure the average utiliza-
tion, and can use the average to estimate peak utilization and draw conclusions. Figure
7.1 lists examples of average utilization and the meaning.

Average Utilization Peak Utilization Interpretation


40% 52% Link is underutilized
50% 65% Comfortable operating range
60% 78% Link is beginning to fill
70% 91% Link is effectively saturated
80% 100% Link is completely saturated

Figure 7.1 Interpretation of peak utilization for various values of average uti-
lization assuming a constant ratio. Although utilization is limited
to 100%, peak demand can exceed capacity.

As the figure shows, a link with average utilization of 50% is running at approxi-
mately two-thirds of capacity during peak times, a comfortable level that allows for
unusual circumstances such as a national emergency that creates unexpected load.
When the average utilization is less than 50% and extra link capacity is not reserved for
backup, the link is underutilized. However, if the average utilization is 70%, only 9%
of the link capacity remains available during peak times. Thus, a manager can track
changes in average utilization over time, and use the result to determine when to up-
grade the link. In particular, by the time the average utilization climbs to 80%, a
manager can assume the peak utilization has reached 100% (i.e., the link is saturated).
Thus, the goal is to maintain average capacity between 50% and 80%. The bounds are
known as the 50/80 Rule:

On a heavily used Internet connection, a manager can use average


utilization to determine when to upgrade link capacity. When average
utilization is less than 50%, the link is underutilized; when average
utilization climbs to 80%, the link is saturated during peak times.

Of course, measurements may reveal that a given Internet connection is sometimes


unused. In particular, if an enterprise closes during nights and weekends, the traffic
during those periods can fall to almost zero, which will significantly lower average utili-
zation and make the ratio between peak and average utilization high. To handle such
cases, a manager can measure average utilization only over in-use periods.
Sec. 7.19 Capacity Planning For A Complex Topology 89

7.19 Capacity Planning For A Complex Topology

Capacity planning for a large network is much more complex than capacity plan-
ning for individual elements and links. That is, because links can be added and routes
can be changed, capacity planning must consider the performance of the entire network
and not just the performance of each individual link or element. The point is:

Assessing how to increase the capacity of a network that contains N


links and elements requires a manager to perform more work than
planning capacity increases for each of the N items individually.

7.20 A Capacity Planning Process

Capacity planning in a large network requires six steps. Figure 7.2 summarizes the
overall planning process by listing the steps a network management team performs.
The next sections explain the steps.

1. Use measurement of the current network and forecasting to


devise an estimate of the expected load.
2. Translate the expected load into a model that can be used with
capacity planning software.
3. Use the load model plus a description of network resources to
compute resource utilization estimates and validate the results.
4. Propose modifications to the network topology or routing, and
compute new resource utilization estimates.
5. Use resource utilization estimates to derive estimates on the
performance needed from network elements and links.
6. Use performance estimates to make recommendations for capa-
city increases and the resulting costs.

Figure 7.2 An overview of the steps a network management team takes to


plan capacity increases in a complex network.

7.20.1 Forecasting Future Load

Forecasting the future load on a network requires a network manager to estimate


both growth in existing traffic patterns and potentially new traffic patterns. Estimating
growth in a stable business is easiest. On one hand, a manager can track traffic over a
long period of time, and can use past growth to estimate future growth. On the other
90 Performance Assessment And Optimization Chap. 7

hand, a manager can obtain estimates of new users (either internal users or external cus-
tomers), and calculate additional increases in load that will result.
Estimating new traffic patterns is difficult, especially in a rapidly expanding pro-
vider business. The load introduced by new service offerings may differ from the past
load and may grow to dominate quickly. An ISP sets sales targets for both existing and
new services, so a manager should be able to use the estimates to calculate the resulting
increase in load. Before a manager can do so, however, the manager must translate
from the marketing definition of a service to a meaningful statement of network load.
In particular, sales quotas cannot be used in capacity planning until the quantity, types,
and destinations of the resulting packets can be determined. Thus, when marketing de-
fines and sells a service, a network manager must determine how many new additional
packets will be sent from point A to point B in the network.

Forecasting future load is especially difficult in cases where a net-


work must support new services because a manager must estimate the
quantity, type, and destinations of packets that will be generated.

7.20.2 Measuring Existing Resource Use

We have already discussed the measurement of existing resources. We have seen


that most large network elements, such as routers, contain an API that allows a manager
to obtain measurements of performance. We know that the traffic on a link varies over
time, that it is possible to measure traffic in intervals, and that a network manager can
use the measurements to calculate average and peak utilization. To smooth the estimate
of peak utilization, a manager can compute the peak at the 95th percentile and above.
Finally, we know that on heavily utilized backbone links the expected load is smooth,
with a constant ratio of peak-to-average utilization. When measuring a network, a
manager must measure each network element and link.

7.20.3 A Load Model Based On A Traffic Matrix

Effective estimates of resource use rely on an accurate model of network traffic.


Once a model has been derived, the model can be used to compute the effect on under-
lying resources, and can allow managers to test the effect of possible changes.
One technique has emerged as the leader for modeling network load: a traffic ma-
trix. Conceptually, a traffic matrix corresponds to connections between the network and
outside traffic sources and sinks. That is, a traffic matrix has one row for each network
ingress point and one column for each network egress point. In practice, most connec-
tions between a network and outside sources are bidirectional, which means a traffic
matrix has one row and one column for each external connection. If T is a traffic ma-
Sec. 7.20 A Capacity Planning Process 91

trix, Tij gives the rate at which data is expected to arrive on external connection i des-
tined to leave over external connection j. Figure 7.3 illustrates the concept.
A traffic matrix is an especially appropriate starting point for future planning be-
cause a matrix specifies an expected external traffic load without making any assump-
tions about the interior of a network or routing. Thus, planners can build a single traffic
matrix, and then run a set of simulations that compare performance of a variety of topo-
logies and routing architectures; the matrix does not need to change for each new simu-
lation.

egress1 egress2 egress3 egressm


ingress1

ingress2

ingress3
. . .
..
.
ingressn

Figure 7.3 The concept of a traffic matrix. Each entry in the matrix stores
the rate of traffic from a source (ingress) to a destination (egress).
For bidirectional network connections, row i corresponds to the
same connection as column i.

A traffic matrix is most useful for modeling the backbone of a large network where
aggregate traffic flows can be averaged. It is more difficult to devise a traffic matrix for
a network that connects many end-user computers because the individual traffic from
each computer must be specified. Thus, for a large service provider, each input or out-
put of the traffic matrix represents a smaller ISP or a peer provider. For an enterprise, a
traffic matrix can model the corporate backbone, which means an input or output
corresponds to either an external Internet connection or an internal source of traffic,
such as a group of offices.
What values should be placed in a traffic matrix? The question arises because traf-
fic varies over time. Should a traffic matrix give the average traffic expected for each
pair of external connections, the peak traffic, or some combination? The ultimate goal
is to understand how traffic affects individual resources, but peak loads on individual
resources do not always occur under the same traffic conditions. As we have seen, for
heavily-loaded backbone links, average and peak utilization is related. In other cases,
planning is focused on the worst case — a network manager needs to ensure sufficient
resources to handle the worst possible combinations of traffic. Thus, the manager uses
peak load as the measure stored in the traffic matrix.
92 Performance Assessment And Optimization Chap. 7

Unfortunately, we know that traffic varies over time. Furthermore, peak traffic
between a given pair of external connections may occur at a different time than the peak
traffic between another pair of connections. The notion of independent temporal varia-
tions can impact capacity planning. For example, consider an ISP that serves residential
customers and businesses. If business traffic is high during business hours and residen-
tial traffic is high in the evening or on weekends, it may be possible to use a single net-
work infrastructure for both types of traffic. However, if a traffic matrix represents the
peak traffic for each pair of connections without specifying times, a manager can plan a
network with twice the capacity actually required. The point can be summarized:

A traffic matrix that stores peak traffic loads without documenting


temporal variations can lead to overestimation of needed capacity.

How can a manager create a load model that accounts for temporal variation?
There are two possibilities:

d Use multiple traffic matrices, where each matrix corresponds to a


single time slot.
d Use a single matrix, but make each entry correspond to the peak
traffic only during the busy time of the network.

To use multiple matrices, a manager divides time into blocks, and specifies a traf-
fic matrix for each block. For example, in the ISP mentioned above, a manager might
choose to create a traffic matrix for business hours and a separate traffic matrix for oth-
er hours. The chief disadvantage of using multiple matrices is that a manager must
spend time creating each.
The single-matrix approach works best for networks in which a manager can easily
identify a time slot during which peak demand occurs. For, example, some ISPs experi-
ence peak demand when business hours from multiple time zones overlap.

7.20.4 Flows And Aggregates

We said that a traffic matrix can represent the peak traffic from each ingress to
each egress. Another complication arises when managers add estimates of future traffic
to a traffic matrix: instead of starting with aggregates of traffic for each combination of
ingress and egress, a manager may be given estimates of traffic from specific flows.
For example, suppose an enterprise plans to install a new application. If a manager can
estimate the amount of traffic each user will generate, the estimates must be combined
to create aggregates. Similarly, if an ISP plans to sell new services, the traffic resulting
from the new service must be aggregated with other traffic.
Sec. 7.20 A Capacity Planning Process 93

It may be difficult to generate aggregate estimates from individual flows for two
reasons. First, in many cases, the actual destination of a new flow is unknown. For ex-
ample, if an ISP offers a VPN service that encrypts data and expects to sell N subscrip-
tions to the service, a manager may find it difficult to estimate the locations of the end-
points. Second, peaks from flows may not coincide even if the flows cross the same
link. Thus, to build a traffic matrix, a manager must estimate the combined effect of
new flows on aggregate traffic without knowing exactly how the peaks will coincide.

7.20.5 Deriving Estimates And Validation

Once a traffic matrix has been produced, a manager can use the matrix plus a
description of the network topology and routing architecture to compute peak resource
demands for individual links and network elements. In essence, traffic for each (source,
destination) pair is mapped onto network paths, and the traffic is summed for each link
or network element in the path. The capacity needed for a link is given by the total
traffic that is assigned to the link. The switching capacity needed for a device such as
an IP router can be computed by calculating the number of packets that can arrive per
second (i.e., the sum over all inputs).
One of the key ideas underlying the process of capacity planning involves valida-
tion of the traffic matrix: before adding estimates of new traffic, a manager uses meas-
ures of current traffic, maps data from the traffic matrix to individual resources, and
compares the calculated load to the actual measured load on the resource. If the esti-
mates for individual resources are close to the measured values, the model is valid and a
manager can proceed to add estimates for new traffic with some confidence that the
results will also be valid.
If a traffic model does not agree with reality, the model must be tuned. In addition
to checking basic traffic measurements, a manager checks assumptions about the times
of peak traffic with peak times for individual links. In each case where an estimate was
used, the manager notes the uncertainty of the estimate, and concentrates on improving
items that have the greatest uncertainty.

7.20.6 Experimenting With Possible Changes

The chief advantage of a traffic model lies in the ability of managers to investigate
possible network enhancements without disrupting the network. That is, a manager pos-
tulates a change in the network, and then uses the traffic matrix to see how the change
affects behavior. There are three types of changes:

d Change the network topology


d Change routing in the network
d Change assumptions about failure
94 Performance Assessment And Optimization Chap. 7

Changes in topology are the easiest to imagine. For example, a manager can ex-
plore the effect of increasing the capacity of one or more links or the effect of adding
extra links. If the calculation of resource usage is automated, a manager can experiment
with several possibilities easily. The next sections discuss changing the routing archi-
tecture and changing assumptions about failure.

7.21 Route Changes And Traffic Engineering

One of the more subtle aspects of capacity planning concerns routing: a change in
the Layer 3 routing structure can change resource untilization dramatically. In particu-
lar, one alternative to increasing the capacity of a link involves routing some traffic
along an alternative path.
Managers who are concerned with routing often follow an approach known as traf-
fic engineering in which a manager controls the routing used for individual traffic
flows. In particular, a manager can specify that some of the network traffic from a
given ingress to a given egress can be forwarded along a different path than other traffic
traveling between the same ingress and egress.
The most widely recognized traffic engineering technology is Multi-Protocol Label
Switching (MPLS), which allows a manager to establish a path for a specific type of
traffic passing from a given ingress to a given egress. Many large ISPs establish a full
mesh of MPLS paths among routers in the core of their network. That is, each pair of
core routers has an MPLS path between them†.

7.22 Failure Scenarios And Availability

Another aspect of capacity planning focuses on planning network resiliency in the


face of failure. The idea is straightforward: consider how a network will perform under
a set of failure scenarios, and plan excess capacity and backup routes that will achieve
the required availability. Failure planning is especially critical for service providers be-
cause the business depends on being able to guarantee availability; it can also be impor-
tant for enterprises.
Typically, failure scenarios are chosen to consider single point failures. The most
obvious cases include the failure of a single link or a single network element. However,
many planners focus on failure of a facility. For example, a planner might observe that:
network elements residing in a physical rack share the same source of power, several
logical circuits are multiplexed onto the same underlying fiber, or a set of services are
located within a single Point Of Presence (POP). Thus, for planning purposes, a
manager can define a facility to be a rack, a cable, or a POP. Once a facility has been
defined, a manager can consider how the network will perform if all items in the facility
become unavailable. The point is:

†A later chapter on tools for network management discusses MPLS traffic engineering.
Sec. 7.22 Failure Scenarios And Availability 95

In addition to planning capacity to handle increases in traffic, net-


work managers consider possible failures and plan capacity sufficient
to allow a network to route around problems.

7.23 Summary

Performance measurement and assessment is one of the key parts of network


management. There are two aspects. The first focuses on understanding how resources
are currently being used and how an existing network is performing. The second
focuses on long-term trends and capacity planning.
The primary measures of a network are latency, throughput, packet loss, jitter, and
availability. No single measure of performance exists because each application can be
sensitive to some of the measures without being sensitive to others. Although measur-
ing individual items in a network is easiest, meaningful assessment requires end-to-end
measurement, possibly with active probing.
Measurement is used to find bottlenecks and do capacity planning. Planning the
capacity of a single switch is straightforward and involves knowing the number of ports.
Utilization is related to delay. To measure link utilization, a manager makes multi-
ple measurements over time; 15-minute intervals work well for most networks. Peak
utilization can be computed from the 95th percentile. In heavily used links, the peak-to-
average ratio is almost constant, which allows a manager to measure average utilization
and employ the 50/80 Rule.
Managers use a traffic matrix to model network load, where the matrix gives the
average or peak data rate from each network ingress to each egress. Given a network
topology, a manager can use values from the traffic matrix to determine capacity needed
on a given link or network element. To plan capacity, a manager experiments by
changing assumptions about network topology or routing and calculating new resource
requirements. In addition to planning capacity for normal circumstances, managers con-
sider the excess capacity needed to accommodate failure scenarios.

You might also like