A Graph-Theoretic Approach To Enterprise Network Dynamics
A Graph-Theoretic Approach To Enterprise Network Dynamics
A Graph-Theoretic Approach To Enterprise Network Dynamics
Horst Bunke
Peter J. Dickinson
Miro Kraetzl
Walter D. Wallis
A Graph-Theoretic Approach
to Enterprise Network Dynamics
Birkhauser
Boston Basel Berlin
Horst Bunke
Universitat Bern
Institute of Computer Science and
Applied Mathematics (IAM/FKI)
CH-3012 Bern
Switzerland
[email protected]
Walter D. Wallis
Southern Illinois University
Department of Mathematics
Carbondale, IL 62901
USA
[email protected]
e-ISBN-10: 0-8176-4519-5
e-ISBN-13: 978-0-8176-4519-9
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Birkhauser Boston, c/o Springer Science+Business Media LLC, 233
Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or
scholarly analysis. Use in connection with any form of information storage and retrieval, electronic
adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks and similar terms, even if they
are not identied as such, is not to be taken as an expression of opinion as to whether or not they are
subject to proprietary rights.
9 8 7 6 5 4 3 2 1
www.birkhauser.com
(TXQ/MP)
Preface
vi
Preface
unique node labeling; this work has also been described in [62]. Parts of reference [62]
are used in this monograph with kind permission of Springer Science and Business
Media. Chapter 4 introduces the most important graph similarity measures for abnormal
change detection in networks, as reported in [26, 158]. Median graphs and their most
important applications in detection of anomalous changes in dynamic networks are
outlined in Chapter 5. Many experimental results are given as well. This work was also
reported in [59, 63]. Chapter 6 addresses the important problem of clustering in the
graph domain of time series of graphs. Most important types of clustering are given
and applications to network analysis are outlined. Graph distances based on intra-graph
clustering are covered in Chapter 7; this work has been reported also in [61]. Chapter
8 outlines possible applications of matching sequences of graphs to network dynamics
investigations. Some applications to incomplete network knowledge are given. Some
of this work was reported in [22].
Part III is dedicated to the exploration of properties of underlying graphs in dynamic
enterprise networks. Chapter 9 introduces graph dynamics measures using path lengths
and clustering coefcients. Relationships to networks of small-world type and general
enterprise networks is given. In Chapter 10, a new set of measures utilizing Kendall
Wei ranking of graph tournaments is applied to network dynamics modeling and to
ranking of enterprise network nodes by their importance in overall communication.
Part IV deals with theory and applications of network behavior inferencing and forecasting using sequences of graphs. Moreover, in this part, graph distances based on the
hierarchical graph abstractions are introduced. Chapter 11 describes the reconstruction
of missing network data using context in time and also machine learning and decision
tree classiers applied to network prediction. In this chapter, a detailed examination
of the algorithms implemented is given, along with an extensive set of computational
results. Some of the new results described in this chapter have been reported elsewhere [23]. In Chapter 12, network dynamics measures involving hierarchical graph
abstractions are explored, together with their most important applications to enterprise
network monitoring. Bounding techniques are implemented for graph contractions, resulting in favorable speedups of anomalous change computations; the main results of
this chapter have been reported elsewhere [64].1 We acknowledge the permission of
World Scientic to use the material from that publication.
A monograph of this size and scope would not be possible without the help and
support of many people. The rst author wants to acknowledge contributions from
his students at the University of Bern, especially Christophe Irniger, Michel Neuhaus,
and Florian Thalmann. The second and third authors would like to thank the many
people from the Intelligence, Surveillance and Reconnaissance Division of DSTO for
their support during the development of this book. All four authors would also like to
acknowledge the contribution of Peter Shoubridge to many theoretical and experimental
aspects of our work on network dynamics investigations. Any views stated within this
Preface
vii
book are completely our own and are not related in any way to theAustralian Department
of Defence.
Bern (Switzerland)
Adelaide (Australia)
Carbondale (U.S.A.)
March 2006
Horst Bunke
Peter J. Dickinson, Miro Kraetzl
Walter D. Wallis
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
Part I Introduction
1
3
3
4
7
9
11
12
14
16
17
18
19
20
21
21
22
23
25
26
28
Graph-Theoretic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Basic Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Connectivity, Walks, and Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
31
32
34
Contents
2.4 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Factors, or Spanning Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
43
43
44
45
51
52
53
53
57
59
59
63
63
64
65
65
67
70
70
72
73
75
75
76
77
78
79
79
80
82
83
84
84
Contents
xi
89
90
93
93
94
94
97
100
104
105
112
114
115
115
116
118
118
119
122
123
128
130
5.4
84
84
85
86
87
89
xii
Contents
10 Tournament Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Tournaments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.2 Tournament Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Ranking Tournaments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.1 The Ranking Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.2 KendallWei Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.3 The PerronFrobenius Theorem . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 Application to Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.1 Matrix of a Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5 Modality Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.1 Dening the Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.2 Applying the Distance Measure . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6 Variations in the Weight Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
165
165
165
165
166
166
166
167
168
168
168
169
169
170
172
172
Contents
xiii
12
199
199
200
201
206
207
210
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Part I
Introduction
1
Intranets and Network Management
1.1 Introduction
The origin of the Internet and TCP/IP protocol suite date back to 1969, when the Advanced Research Projects Agency (ARPA) funded a research and development project
to create a packet-switched network, named the ARPANET. The aim was to demonstrate techniques to provide a heterogeneous, robust, and reliable data communications
network. The ARPANET grew steadily as many research and educational institutions
implemented the open protocols and connected to the network. The ARPANET has
since evolved into the global network of networks that we know as the Internet and has
continued to grow rapidly.
The Transmission Control Protocol/Internet Protocol (TCP/IP) is a suite of protocols
that form the basic foundation of the Internet. The Internet Protocol (IP) is central to
the architecture of the Internet. In terms of the OSI (Open Systems Interconnection)
seven-layer reference model, it provides the data link and network layer services. The
function of IP is to ensure that packets injected at any point in the network are routed to
the intended destination. It is a connectionless protocol; hence it provides no guarantee
that packets will be successfully delivered to the destination node. The TCP is a reliable
connection-oriented transport layer protocol. Its function is to fragment a byte stream
into discrete messages and then use IP to route these messages to the destination. At
the destination, TCP sorts packets into the correct order and requests the sender to
retransmit lost packets. A comprehensive coverage of TCP/IP and layered protocols,
such as the OSI reference model, can be found in [166].
The popularity and growth of the Internet, in conjunction with the variety of applications that make use of it (e.g., World Wide Web and email), has led to the widespread
usage of the TCP/IP protocol suite in networks not connected, or indirectly connected,
to the Internet. These networks are called intranets and provide data communications
for internal use by organizations. Within such organizations, the trend has been toward
larger and more complex networks that support numerous corporate activities. Not
surprisingly, this has led to an increased reliance on the intranet for daily business functions. In addition, the larger and more complex the network becomes, the greater the
require only a single network technology. A local area network (LAN) using Ethernet
technology would sufce. On the other hand, a large organization that has ofces globally may use numerous networking technologies. This global intranet may comprise of
many Ethernet LANs, to service the individual ofces, that are interconnected using
a high-speed wide area networking (WAN) technology, such as asynchronous transfer
mode (ATM) or frame relay. Network technologies are dealt with more thoroughly in
[169]. The difculty in managing networks increases with size and complexity of the
network. While the management of a single LAN segment requires only basic network management tools, a large heterogeneous network requires a powerful network
management system.
It is often a requirement for an organization to connect its intranet to the public Internet. Under these circumstances a rewall is deployed to keep unauthorized Internet
trafc off the intranet. Figure 1.1 shows the interconnectivity between an intranet and
the Internet. Likewise, if a partnership is formed between two organizations, a need
may arise to connect their individual intranets to share information. While this poses
less risk than connection to the Internet, rewalls are again deployed, however using
less-stringent lters. When two or more enterprise intranets are connected to one another they are referred to as an extranet (see Figure 1.1). Organizations are faced with
supporting a broader range of communications among a wider range of sites and at
the same time reducing the cost of the communications infrastructure. When an urgent
need arises to provide connectivity to remote ofces, the past solutions to wide area
networking, such as dedicated leased lines, have proven to be inexible and expensive.
Many of these problems have been solved following the advent of virtual private network (VPN) technology. VPNs use the open distributed infrastructure of the Internet
to transfer encrypted data from the corporate intranet to remote sites. An example of
this can be seen in Figure 1.1. The use of VPNs is not limited to corporate sites only.
They can be used to provide secure connectivity to mobile users and to provide interconnections to extranets. VPNs provide a signicant cost reduction over dedicated
leased lines. The tradeoff for having greater exibility in connections to extranets and
the Internet is increased security risks. The management of security of an intranet is
vital and should consider both internal and external vulnerabilities. While rewalls help
to secure interconnections to external networks, including the Internet, and VPN technology provides intranets with a secure method of data transfer that utilizes the public
infrastructure, additional methods are required to ensure that intranets, and the sensitive information they carry, remain protected from unauthorized access and malicious
attacks. These techniques require the ability to detect network anomalies and identify
network intrusion in order to identify and isolate incursions. Figure 1.1 also depicts
how a VPN is used to connect to remote intranets via the Internet.
As mentioned, the main objective of an intranet is to provide a mechanism for information exchange. The applications that are most prevalent in intranets for information
exchange are email, web browsers, ftp, and telnet. More recently, voice over IP (VoIP)
and video conferencing have grown in popularity. All of these applications can generate a large volume of dynamic trafc on the underlying physical network. For most
intranets, web trafc makes up the bulk of this trafc and arises from communications
between distributed web servers and web browsers on user workstations. Unlike the
z
z
traditional client/server models, where trafc patterns are somewhat predictable, the
new web-centric model leads to unpredictable trafc patterns. These unusual patterns
result from a large number of users accessing a variety of web pages that reside on
different web servers distributed across the intranet. Under some circumstances this
trafc can lead to signicant network problems. Flash crowds [95], whereby a recently
published web site is accessed concurrently by a large number of users, is one such
example that results in network congestion. The ensemble behavior of web trafc is thus
largely driven by the activity of its users. Web trafc, and trafc resulting from many
other TCP/IP-based applications, poses signicant challenges to network managers if
they are to provide acceptable performance and availability of intranet applications to
end users. To maintain desired levels of service to end users it is critical to exercise
effective network management of web resources, bandwidth, and trafc. Network management tools that can help to identify abnormal behavior before it leads to performance
degradation or network failure are still very immature. Better techniques are required
to detect network anomalies in order to identify problems early so that corrective action
can be taken.
The protocols used in the TCP/IP suite to control network routing (e.g., open shortest path rst (OSPF) protocol), can also result in signicant dynamic network behavior.
It is important for network managers to know when this behavior is normal or abnormal
so that faults, such as misconguration of routers, can be identied early. In order to derive models of normal and abnormal behavior, it is necessary for metrics to be collected
from various points in the network. The network management system is used to retrieve such information. Fortunately, intranets are generally owned by one organization
and thus have the advantage of a single management entity. This makes it possible to
access network devices directly to acquire management data. Conversely, when one is
performing network monitoring functions in the Internet domain, ownership of certain
parts of the network may not always be so clear, and access to management data may
be prohibited by the owner. Where complete ownership is not guaranteed, other techniques, such as network tomography [50], are required to derive information about the
network. These techniques may also prove valuable for intranets. Tools that can make
use of varied sources of network metrics in order to distinguish abnormal from normal
behavior, and thus identify anomalous events, are valuable to network managers.
Intranets have proven to be a very useful facilitator of communication of information within an organization and as such have become relied upon for performing
the daily functions of the organization. The size and complexity of these networks, in
combination with the exible nature of applications available to end users, can lead to
serious network faults. This has the potential to cause major disruption to the operations
of an organization. The impact of such faults can suspend an organizations activities
until the problem is rectied. This could result in a large amount of lost revenue for
the organization. It is very important that an organization has a network management
system in place and that the capability of that system is commensurate with the risk that
the organization is willing to accept if the network were to fail. Performance, fault, and
security management of a network are vital components of an overall network management solution. They minimize the risk of network failure and network intrusion, and
ensure that the network is providing the desired quality of service to its end users.
10
device. The management agent associated with the network resource is responsible for
maintaining the MIB. The management station accesses information from the MIB of
managed resources to determine their status. The management station can also change
the characteristics of a network element by controlling values of attributes in the MIB
that relate to conguration aspects of the resource. Interaction between a manager
station and management agents is provided by the network management protocol. A
network management protocol denes functions for retrieving management information
from agents and for issuing commands to agents. The SNMP is widely used in TCP/IP
networks and hence is commonly used to manage intranets.
It is quite straightforward for a single vendor to produce an NMS to manage a network constructed solely from its own products. In practice, it is seldom the case that an
organization would choose network components from a single vendor. Hence the installation of multiple network management systems may be required by an organization
[159]. The use of multivendor network management systems is a problem, not only in
intranets, but in network management systems in general. Much effort has been made to
develop standards for the purpose of providing a common management system and to
enable interoperability among different NMSs. Some attention has been given to solving internetworking problems [42, 190]. In addition, commercial products, such as HP
Openview, provide an open network management platform on top of which can be built
network management applications. Openview provides common management services
that can be accessed through standard application interfaces (APIs). The APIs enable
third-party vendors to develop their own network management systems that conform
to Openview. Thus an enterprise can deploy an integrated multivendor NMS.
The two main standards for network management are those of the Internet and OSI.
The OSI protocols, developed by ISO, comprise Common Management Information
11
12
The following two sections provide an overview of SNMP and RMON. For a more
comprehensive description see [162].
1.5.1 Simple Network Management Protocol (SNMP)
SNMP is a set of standards used for network management of TCP/IP networks. Not only
does it dene a protocol for exchanging network management information between an
agent and manager, but it also contains a framework for the denition of management
information relating to network devices. For a device to be managed by SNMP it must
be capable of running an SNMP management agent. Currently, most devices designed
for use in TCP/IP networks, such as bridges and routers, meet this requirement. This
makes SNMP a cost-effective method for monitoring network functions. In contrast
to CMIP, SNMP does not provide a denition for network management functions and
services. Instead, it provides a set of primitives from which network management applications can be built. The network management applications are performed at the
management station on information retrieved from management agents. Many vendors
produce proprietary network management applications that use SNMP [160].
The Management Information Base (MIB) is a database used by SNMP to dene
characteristics of the managed resource (e.g., server, bridge, router). Each resource to be
managed is represented in the database by an object. The MIB is a structured collection
of such objects. Objects within the MIB must be either a scalar or tabular quantity. A
tabular variable is used where multiple object instances are dened. The total number
of packets on a router interface would be represented by a scalar variable, whereas a
list of interface entries would be tabular. The SNMP standards describe in considerable
detail the information that must be maintained by each type of management agent. This
information is rigidly specied to ensure that a management station will be compatible
with network resources produced by a variety of vendors. Management agents are not
required to maintain all objects dened in the MIB. Depending on the type of device, a
set of objects relevant to the operations of that device will be populated. For example,
a device that does not require the implementation of TCP will not have to manage MIB
objects relating to TCP. Most of the information stored in the MIB of a device either
represents the status of the device (e.g., operational status of an interface on a router), or
is an aggregated trafc-related parameter (e.g., number of packets into an interface on
a router). The management station also maintains a MIB. The contents of this database
reects the contents of MIBs of network devices that it manages, and hence the objects
that these devices support.
The objects dened for the MIB of SNMP are arranged hierarchically as a tree
and are clustered into groups of related areas. The tree structure facilitates this logical
grouping, with each managed object occupying a leaf in the tree. MIB-II is the current
standard for SNMP and is specied in RFC1213 [128]. There are eleven groups dened
for MIB-II. These groups contain information relating to such areas as system, interface,
TCP, and UDP (User Datagram Protocol). The system group denes objects related to
system administration, such as system name, contact person, and physical locality.
It is the most accessed group in MIB-II. Other groups, such as interfaces and TCP,
comprise objects that control the behavior of the network resource (e.g., the maximum
13
number of allowable TCP connections) and provide counts for certain trafc-related
variables (e.g., total number of input octets received by an interface). In total there are
175 objects (or variables) specied in MIB-II [128]. An important criterion used by
developers of MIB-II when selecting object groups was that each object must assist
in fault or conguration management. Further criteria are given in the RFC. A private
subtree has been included in the MIB to cater for vendor specic objects.
The SNMP protocol is used to provide communication of information contained
in the MIB between the managed agent and management station, and between two
manager processes. SNMP was designed to operate over the UDP for robustness. Since
UDP is connectionless, no connections are maintained between a management station
and its agents. The use of UDP can result in some SNMP messages not reaching their
intended destination node. This problem is exacerbated when SNMP is used to manage
larger networks and when regular polling is relied upon for management information.
If SNMP had used TCP, then SNMP exchanges may have been lost whenever one or
more links in the path between the management station and managed agent failed.
SNMP was designed to be easy to implement and to consume minimal resources.
As a result, the capabilities it offers are very simple. The protocol provides four basic
functions to allow a manager station to communicate with an agent. These functions
include get, set, trap, and traversal (e.g., get-next-request) operations. In summary,
the get command is used to retrieve management information from agents and hence
monitor managed devices; the set command is used to control managed devices by
changing the values of certain MIB objects. The trap command is used to send alarms
asynchronously from an agent to a management station whenever an abnormal event
is detected by the agent. In the event of an alarm, a network manager would respond
by polling for management information from the agent in order to ascertain the cause
of the alarm. A manager station can also send an alarm to another manager station.
Finally, traversal operations are used to determine which objects are supported by a
device on the network and to sequentially retrieve information from tabular objects,
such as routing tables.
The rst version of SNMP, referred to as SNMP version 1, was developed as an
interim management protocol, ahead of OSI management. The specication is given
in RFC1157 [32]. Since OSI management was never realized, further development of
SNMP was undertaken to address a number of deciencies with SNMPv1. This resulted
in SNMPv2 [3340]. The main change to the protocol in SNMPv2 was the introduction
of two new messages. The rst message, get-bulk, provided a bulk data transfer capability for improving retrieval speed of data from tables. The second, inform-request,
provided improved communication of management information between management
stations. SNMPv2 is currently the most widely used version of SNMP. Many security
issues that were identied in SNMPv1 were unable to be xed during the development
cycle of SNMPv2. The nal revision of SNMPv2 was termed community-based SNMPv2 or SNMPV2C, since it uses the concept of community name for authentication
purposes. Recent work on SNMPv3 has addressed security issues in greater detail [161].
An important limitation of SNMP is the inability to derive information about trafc
between two subnetworks separated by two or more routers. If the trafc between such
subnetworks were to increase, then the cause of such an increase could not be identied
14
using MIB-II alone. In these circumstances RMON can be used. RMON is discussed in
Section 1.5.2 below. SNMP is primarily a capability for collecting and reporting management information about the network. Further processing of information is required
to perform network anomaly detection.
1.5.2 Remote Network Monitoring (RMON) Protocol
The success of SNMP management is reected in its widespread usage in TCP/IP-based
networks along with its availability in most vendors networking equipment. This success has resulted in growth of the number of managed network resources in computer
networks. SNMPv1 provided the rst capability for implementing remote monitoring
of a network. This capability enabled a centralized Network Operation Center (NOC) to
remotely congure network resources and to detect faults. It is common for large enterprise networks to comprise many thousands of hosts and subnetworks. Unfortunately,
SNMPv1 alone could not provide adequate capability for monitoring the performance of
such networks. The RMON standard is an enhancement to SNMP to provide a network
management system with the ability to monitor a subnetwork as a whole rather than
having to monitor individual devices connected to the subnetwork. Since the characterization of network performance is statistical in nature, it was logical that such statistics
be produced locally and later transmitted to a central network management station. The
process of remote network monitoring involves three steps. The rst step requires access to the transmission medium so that packets owing on the network can be viewed.
This activity is performed by a network monitor (or probe) attached to the subnetwork.
Network monitors can be stand-alone devices or can be embedded into existing equipment, such as routers. Network monitoring employed by RMON is a passive operation
and hence does not disrupt the ow of data on the network. Figure 1.3 shows a typical
network conguration employing RMON probes at each LAN segment. The second
step in remote monitoring is to produce summary information from data collected from
a probe. This may include error statistics, such as number of collisions, and performance
statistics, such as throughput and packet size distribution. The production of statistics
is performed within the network monitor. The nal step requires communication of the
summarized information to a remote network management station.
There are several advantages in using RMON devices for network monitoring.
It is not always practical to monitor subnetworks using SNMP due to the additional
trafc generated by the protocol. A single RMON device can devote all of its resources
to monitor a network segment and relay summarized information, characterizing the
behavior of the subnetwork, to the management station. The information can be sent
upon request by a management station or as a result of an abnormal event being detected
by the management agent. The SNMP protocol is still used to access information from
RMON devices; however, the overall effect is a marked reduction of SNMP trafc on
the network, especially in the segment where the network management station resides.
In addition, RMON is able to provide greater detail about the nature of trafc on a
subnetwork. This can be used to deduce information such as the host within a LAN
segment that is generating the most errors. It would not be possible to do this without
remote monitoring unless the network management station were connected directly to
15
the subnetwork. Since SNMP uses unreliable transport of packets, it is more likely
for packet loss to occur across a large network. The local probing of a subnetwork,
performed by an RMON device, is thus more reliable than that which could be achieved
by regular polling of information across a large network using SNMP. If the network
management station is unable to communicate with a device due to link failure, the
RMON device can continue to collect statistics about the local subnetwork and report
back to the network management station when connectivity resumes. It is possible to
perform near-continuous monitoring of a subnetwork using RMON; hence proactive
fault detection is a possibility. At the least, network problems can be identied more
quickly and reported to the network management station.
In order to implement RMON, it was necessary to add new MIB variables to supplement MIB-II. No changes were required to the SNMP protocol to support RMON.
A device that implements the RMON MIB is known as an RMON probe. RMON is
described in detail in a number of RFCs published by the IETF. The earliest implementation of RMON, now referred to as RMON1, is given in RFC 1757 [177]. It is
capable of monitoring all trafc on the LAN segment to which it is attached. RMON1
operates at the data link layer; hence it can capture MAC (medium access control) level
frames and read source and destination MAC addresses in those frames. If a router is
attached to the LAN, RMON1 can monitor total trafc only into and out of that router.
It is not capable of determining source and destination addresses beyond the router. The
16
RMON1 MIB is divided into ten groups of variables. Each group provides storage of
data specic to that group. An example is the statistics group used to store information
on utilization of the network.
RMON2 is dened by RFC2021 and RFC2074 [11,178]. It was developed to provide
a capability of monitoring protocol trafc above the MAC level. RMON2 operates
upward from the network layer to the application layer. It can monitor trafc at the
network layer, including IP addressing, and at the application level, such as email, ftp,
and web. As a result, RMON2 can determine source or destination addresses beyond a
router. This additional capability enables a network manager to determine such things
as which nodes are contributing to the bulk of trafc that is incoming or outgoing to the
LAN. It also enables a breakdown of trafc by protocol or application. The RMON2
MIB introduces an additional nine groups of variables to that of the RMON1 MIB. These
hold information related to higher-layer activities, such as statistics of trafc carried
between specic host pairs for a given application. Of most importance to the study of
anomaly detection in intranets, especially the graph-theoretic techniques developed in
this monograph, are the matrix groups. These groups provide statistics on the amount
of trafc between pairs of hosts, and contain statistics relating to the network layer and
the application layer. The network-layer matrix (nlMatrix) group provides statistics for
the aggregated trafc between host pairs, while the application-layer (alMatrix) group
provides statistics on the basis of application-level address. This information can be
used to describe the network topology and trafc ow at the network layer for any
given time interval.
The ner-grained detail of network management information provided by RMON-II
comes at the cost of greater processing requirements at the management agent. This has
led vendors to produce stand-alone RMON probes that are hosted on high-end servers.
At present, the standards for RMON2 are being extended to support high-capacity
networks.
17
18
out of a single device and can examine encapsulated headers to derive behavior related
to the network layer and above. Devices such as routers containing SNMP agents
and network monitors that implement the RMON MIB are the most commonly used
passive monitors. While active monitoring techniques are useful for producing certain
performance metrics, such as network latency measurements, passive techniques that
collect information relating to origindestination (OD) trafc ows are of prime interest
to this monograph. The information from OD trafc ows can be represented as a
graph. Graph-based techniques can then be used to produce measures that are sensitive
to network change and hence be used in network anomaly detection. Such and other
techniques applied to analysis of enterprise network dynamics indeed represent the
main topics of this monograph.
1.6.2 Common Monitoring Solutions for Intranets
SNMP-based polling systems have proven to be cost-effective means for network monitoring due to the widespread deployment of devices that are SNMP enabled. MIB
variables are a very good source of aggregated network data and are hence often used
for passive network monitoring [15, 31, 90]. These polling systems, however, have an
inherent overhead in terms of the processing load on the network devices required to
compute trafc statistics, and on network bandwidth consumed when the management
station retrieves data from managed agents. In some extreme cases, where a poorly
designed network management system is in operation, the SNMP trafc can be responsible for disrupting the very services that it aims to maintain. In order to be capable of
rapidly detecting network anomalies and faults, the rate of polling of each SNMP agent
in a network has to be at least of the same order of time as that of the fault; otherwise,
the fault will go undetected. This can lead to very short polling intervals on the order
of a few minutes. This increased polling frequency further accentuates the processing
demand on network devices and bandwidth overhead. Also, as network management
systems become more focused on application-level management, the network monitoring system is required to collect more data. An increased processing load on network
devices can lead to lost packets. Some research has recently been undertaken to improve the efciency of polling [43] and data access [19] for the SNMP protocol. When
a centralized measurement system is utilized, bandwidth bottlenecks can occur on links
nearest to the central management station. The resource-intensive task of polling can
be overcome using a distributed measurement system whereby network monitors send
important data back to midlevel management stations. A midlevel station will usually
oversee approximately ten probes and be located in close proximity to those probes. Its
function is to consolidate data from each probe and respond to periodic queries from a
higher level, or central management station [175]. In distributed polling systems, the
bulk of polling is moved closer to each device, where the corresponding links are less
likely to be affected by the additional trafc. The trafc load on links between midlevel
management systems and the central management system is thus greatly reduced. In
[112] a method to minimize bandwidth utilization using hierarchical network monitoring is addressed. To reduce the hardware requirements, and hence cost, this research
sought to nd the minimum number and location of midlevel stations in a network to
19
perform polling. Other research into failure-resilient monitoring of link delays [9] and
monitoring of bandwidth and latency [18] also deals with the problem of minimizing
the cost of measurement infrastructure.
RMON is also commonly used for network monitoring. However, the cost of deploying an RMON solution network-wide is high. RMON was described in detail in
Section 1.5.2. Since the function of RMON is to monitor and aggregate trafc in a
subnetwork, it provides a similar benet to that of a distributed management model, in
that it reduces the need for regular polling by a management station. RMON is generally limited to monitoring LAN segments. Implementation of RMON for monitoring
higher-speed backbone interfaces has proven to be infeasible or prohibitively expensive
[77].
Packet monitors are an alternative method for the production of network measurements and are commonly used on high-speed backbone links. Packet monitors can provide very detailed information about trafc traversing a link. They operate by collecting
a copy of each packet that traverses a link, recording IP, TCP/IP, or application layer
information. In monitoring high-speed links the collection of every packet becomes
impractical. To reduce the load on processing elements and volume of data collected,
these monitors often collect only a limited number of bytes from each packet. Typically,
only the IP header is collected, which contains information such as source and destination addresses and port numbers. Many packet monitoring tools have been developed
by research institutions [69, 72, 103] and commercial vendors.
Many commercial tools are available for performing network monitoring functions.
These range from personal computers tted with a network interface card and special
monitoring software to custom hardware devices. Examples of popular commercial
monitors include NetFlow by Cisco [48], and Ecoscope by Compuware.
1.6.3 Alternative Methods for Network Monitoring
In the management of intranets we assume that a single administrative domain exists.
From a network monitoring perspective this means that monitoring techniques that
require direct connection to the network to perform active or passive monitoring, or
require access to information on network devices, can be employed. Conversely, monitoring of Internet performance is a more difcult problem due to the size, heterogeneity,
and lack of centralized administration [49,86]. Before entering into a peering relationship,1 the owner of an autonomous system (AS) would generally like to gain some
insight into the operations of the AS that it wishes to peer with. Under these conditions
it is necessary for monitoring techniques to be able to acquire information about the
AS of interest without cooperation from devices within that network. Techniques exist
whereby passive monitoring of trafc emanating from that network is used to derive
information about the networks internal operations. Network tomography [50], which
is based on signal processing, is one such technique that has been adapted for this
purpose. Here network inferencing is used to estimate network performance parameters based on trafc measurements taken at a small subset of network nodes. Many of
1A bilateral agreement established between two or more ISPs for the purpose of directly
exchanging Internet trafc.
20
the techniques developed specically for Internet monitoring [44] can also be applied
to intranets. While it may seem unnecessary to do so, in some circumstances where
limited monitoring infrastructure is available, and network bandwidth is at a premium,
such techniques can be invaluable. These methods provide additional sources of data
for network anomaly detection.
1.6.4 Sampling Interval and Polling Rate
An important parameter in network monitoring is the period of time between measurements and the sampling (or aggregation) interval. The period of time between
measurements, or polling rate, denes how often a network measurement is taken.
When a fast polling rate is used, the interval between network measurements is short.
Likewise, a slow polling rate results in a long interval of time between measurements.
The sampling interval determines the length of time that information collected from
the network is aggregated to produce any given network measurement. This interval
governs the types of network faults or anomalies that can be detected. An example of
this is a count of the number of packets into or out of a router interface. A sampling
interval of fteen minutes will enable shorter-duration faults to be detected than those
that would be detected by a longer time interval. Aggregation of statistics over a longer
time interval would mask the occurrence of the shorter-duration faults or anomalies,
but would be better suited for predicting long-term network trends. The performance
of a network is also largely inuenced by the time of day, or operating period [194].
This is mostly due to trafc resulting from network users.
It is common for a xed sampling interval to be employed in network monitoring
implementations. However, the use of a variable-length sampling interval can be better
suited to certain monitoring functions. Instead of time being used to determine the
boundaries of a sampling interval, the interval could be determined by using a xed
number of OD trafc ows on a link. Thus the time interval would vary depending on
the rate of trafc on the link. In busy periods, when trafc on the network is heavy, the
interval would be short. Conversely, when network activity is low, such as in the middle
of the night, the time interval would be longer. Obviously, there exist many other ways
for determining the length of sampling intervals.
Depending on the network monitoring requirement it may be advantageous to have
a short sampling interval and a slow polling rate. This approach is most suited to
monitoring networks that are known to be relatively static over time. The sampling
interval is set to a value suitable for capturing network anomalies of interest.Aggregation
of measurements over the sampling interval is usually performed locally at a router or
network probe. A slow polling rate reduces network load resulting from trafc between
the management station and network element. The polling rate selected should ensure
that any change that may occur will be detected within an acceptable time frame.
Most of the examples given above discuss aggregation of SNMP MIB variables. In
this monograph, we address the aggregation of network topology. The aim is to produce
a static representation of the network topology observed during the aggregation interval
(see Section 4.2). This method is central to the application of graph-based techniques
developed throughout this monograph.
21
22
23
changes in the network as a whole. The standard techniques are unable to detect such
anomalies. A requirement exists for automated techniques to perform early detection
of network-wide trafc anomalies to enable timely and rapid correction of a problem
before it can result in a failure of service [94].
1.7.1 Anomaly Detection Methods
Anomaly detection has found applications in many of the activities relating to network
management including network security [56, 117], fault detection [89, 101, 116, 124],
and performance monitoring [86]. Network intrusion detection (i.e., intrusion detection
and denial of service attacks) is the greatest driver for this research. Proactive fault
detection [8789, 171] is also a big driver of this work. The type of methods used
for implementing anomaly detection fall into two major categories. In this section a
description will be given for these two main areas of research. They are the signaturebased and statistical approaches.
Signature Method
Signature-based, or rule-based, approaches to anomaly detection have been used considerably in network management [110, 119, 126]. The fundamental characteristic of
signature-based approaches is that they can only detect network anomalies that have
been observed in the past. A signature for each anomaly is created and stored in a
database. When new anomalies are identied, through other means, a new signature
is created and added to the database. Techniques in this category perform well against
known problems, usually with very low false alarm rates. Signature-based methods can
also help to limit the source domain. Information not relevant to signatures of interest
can be ignored. Any data that cannot be matched to a known pattern can also be immediately discarded. The major disadvantage is that signature approaches are unable
to identify new problems as they arise. A network anomaly that is not represented in
the database of signatures of known anomalies will remain undetected. In addition, this
method assumes that information is available to build a database of representative signatures. In general, such a database requires substantial time to develop, and demands the
attention of network experts. Signature-based techniques are used in network anomaly
detection due to the large number of efcient algorithms that have been developed in
this area over time. While only a limited number of anomalies can be detected by any
system using this method, it is expected that detection of anomalies using such systems
is reliable, and able to explicitly identify the type of anomaly that has transpired.
Several variants of signature-based methods have been explored over time. The
early work in this area was based on expert systems, where rules dening the behavior
of faulty systems or known network intrusions were compiled. The rule-based systems
rely heavily upon expertise of network managers and do not adapt well to an evolving
network environment. In the detection of network faults, alarms are generated by network resources and sent to a central management station. Alarms arising from multiple
network resources must be correlated to determine the cause of a problem. There is
24
considerable research in the domain of alarm correlation [97, 150]. Rule-based systems are used to correlate such alarms. Case-based reasoning [119] is an extension to
rule-based systems whereby previous fault scenarios are used to assist in the decisionmaking process. Adaptive learning techniques can be employed to enable this approach
to adapt to the changing network environment. Finite state machines [116] have also
been used for anomaly detection in networks. Historical data is used to build a nite
state machine model. The sequence of alarms, generated from devices in the network,
are modeled as states of the nite state machine. A problem is agged whenever the
sequence of events leads to a state that is known to represent an anomaly. In this way the
exact type of anomaly is usually determined. The problem with nite state machines
is that there may be an explosion in the number of states as a function of the number
and complexity of anomalies to be modeled. Finite state machines are also not highly
suited for adaptation to a changing network environment.
Statistical Method
Statistical approaches to anomaly detection [7, 125, 171] offer an alternative to
signature-based methods. These methods function by learning normal network behavior from network measures. When network measurement variables deviate away from
normal behavior, it represents a possible anomaly. In order to build a model of normal
behavior there is a need to know when the network is operating normally. In practice it
is rare for this information to be known. Likewise, it is difcult to characterize abnormal
behavior [52]. Instead, it is common practice that normal behavior be represented by
a period devoted to learning. An assumption is made that it would be unlikely that a
network anomaly would occur during a learning period because of the low frequency of
occurrence of anomalies. However, if an anomaly does occur during the learning period,
the problematic behavior will become part of the model for normal behavior and hence
be undetectable in the future. Common techniques used to learn normal network behavior and perform anomaly detection include auto regressive processes [31, 83, 171],
neural networks, hidden Markov models (HMM) [87], wavelets [125], Kalman lters,
change point detection [197], and Bayesian networks [89].
Unlike the signature approaches, statistical methods are capable of detecting network anomalies that have not been observed in the past. In addition, as the network
evolves, statistical approaches can continuously update their model of normal behavior.
Thus they have no need for regular recalibration or retraining. For statistical methods to
perform adequately, they require suitable indicators to be selected as input for the decision engine. Choosing the right measures is difcult, since different types of anomalies
produce different symptoms. This requires a large range of network measures to be
considered. It is critical that the measurement variables selected enable modeling of
normal network behavior and be sensitive to network abnormalities of interest. When
more than one measure has been deemed suitable, they can be combined to produce a
single anomaly measure (see Section 1.6.6) so that single time series analysis can be utilized. Alternatively, techniques based on analysis of multiple time series, or multivariate
analysis, such as principal component analysis, may be used.
25
26
techniques. However, the difference is that network measures are collected from numerous points across a network and combined to produce new measures that capture
network-wide anomalies. Network-wide measures are suited to modeling the dynamic
behavior of a network and detecting topology and trafc anomalies brought about by
unusual behavioral patterns of its users. It is important for network operators to be able
to identify the causes of change in trafc volumes over physical links caused by changes
in user behavior across the network. The work in this monograph takes this approach,
employing graph-based techniques to produce measures that are sensitive to changes
in network topology.
A technique to detect network-wide anomalies was studied in [113]. Here the
subspace method, based on multivariate statistical process control, was applied to OD
level trafc ows collected from all routers in a network. Principal component analysis
(PCA) was used to decompose OD ows into their constituent eigenows. The top
eigenows correspond to normal network behavior with the remainder of eigenows
representing abnormal behavior. The original OD ows were reconstructed as the sum
of normal and abnormal components. Abnormal events were isolated by inspecting the
residual trafc. Results were obtained for three OD ow data sets comprising 5-tuple
data and one of the number of bytes, packets, or ows. The anomalies detected using
this approach proved to be valid. However, an additional nding showed that each data
set led to different anomalies being detected. This suggests that the data sets derived
from number of bytes, packets, and ows produce complementary information about
network behavior.
1.7.3 Examples of Network Anomalies
There are numerous network anomalies that are of great interest to network operators.
Many of these anomalies arise from network device failure, performance degradations,
and network security issues. Some common performance anomalies include le server
failure, paging across the network, babbling nodes, broadcast storms, and transient
congestion [127]. Denial of service attacks and network intrusion are examples of
security-related anomalies. Some network anomalies can be detected using techniques
that make use of data gathered at the link layer or below. This may require network
measurements such as counts of packets into or out of router interfaces. There are,
however, many instances whereby a network-wide solution may be better suited to
detecting certain kinds of network anomalies. Here, monitoring techniques that gather
data at the network layer and above are required. This may entail collecting IP header
information at several points across a network. The resulting data sets would include
OD ows. However, these could also be further rened by specication of a certain
application layer protocol (e.g., OD ows relating to http trafc only). Network-wide
anomalies are often the result of unusual patterns of activity caused by user behavior.
Below are a number of examples of network anomalies that impact intranets, and would
be best detected using a network-wide analysis approach. The examples given aim to
provide motivation for research into network monitoring strategies outlined in this
monograph. Accordingly, they are prime candidates of network problems where the
techniques proposed in this monograph could be applied.
27
28
1.8 Summary
In this chapter, a denition of a typical enterprise intranet was given along with many of
the network management functions required to adequately maintain quality of service to
users. Intranets are based on technologies developed for the Internet and have recently
become invaluable to modern business. Many enterprise business functions now rely
upon the intranet, and network failures or intrusions can be costly. Early detection of
network anomalies can reduce or eliminate possible failures. Techniques for network
anomaly detection can be used to aid network management functions.
A model widely adopted for network management comprises ve functional areas,
including fault management, performance management, security management, conguration management, and accounting management. While a detailed explanation of
each function was given, application to performance, fault, and security management
of large dynamic intranets was identied to be the focus of techniques developed in this
monograph. TCP/IP network management is used to manage most intranets. The SNMP
protocols form the basis of TCP/IP management. The implementation of network management functions is executed by a network management system. Management stations
gather information compiled at network elements (e.g., router) by a management agent
and use this information to monitor and control the behavior of the entire network.
1.8 Summary
29
The database used in SNMP is known as the MIB. The SNMP agent is responsible
for populating the MIB with useful network measures. standardization of SNMP has
resulted in it being implemented by most vendors of networking equipment. RMON
is another important protocol in TCP/IP management, providing the ability to monitor
remote parts of a network.
Network monitoring is essential for management of all networks. It is the process
of gathering useful information pertaining to operations of the network that facilitates
all ve functional areas of network management. Network monitoring consists of three
steps: data collection, information processing, and anomaly detection. New methods
are required to improve both information processing and anomaly detection. Synthesis
of new network measures using graph-based techniques can be used to improve the
detection of network-wide anomalies, which is the main theme of this monograph.
Network anomaly detection is a growing eld of research due to the increasing size
and complexity of modern networks and the lack of tools to identify abnormal network
behavior. The two main approaches to network anomaly detection are the signature and
statistical methods. The signature-based methods rely on a database of known signatures
to detect anomalies. While they are unable to detect new types of network anomalies,
they generally produce very low false alarm rates and are able to identify the actual
anomaly that has occurred. The statistical approaches model normal network behavior
and classify deviation away from normal behavior as abnormal. These methods are
capable of detecting unknown anomalies. However, they suffer from larger numbers of
false alarms and less ability to identify the type of anomaly when one occurs.
2
Graph-Theoretic Concepts
2.1 Introduction
We are going to discuss the structure of several kinds of communications networks.
In every case, the network consists of a number of individuals, or nodes, and certain
relationships between them.
The basic mathematical structure underlying this is a graph. A graph consists of
undened objects called vertices together with a binary relation called adjacency: given
any two vertices, either they are adjacent or they are not. Vertices will usually represent
nodes or collections of nodes of the network; for example, they might be individuals
in an organization, or servers in an intranet. (When vertices represent single nodes,
the words node and vertex are used interchangeably.) Adjacency might represent
communication between two nodes, or acquaintanceship, or any other relation.
It is useful to represent a graph in a diagram. The vertices are specially identied
points, and a line is drawn between each adjacent pair of vertices, and for this reason
adjacent pairs are called edges. (The name graph derives from this graphic representation.) Numerical measures may be associated with the vertices or with the edges,
representing costs, capacities, et cetera.
In this book we shall encounter several generalizations of graphs; for example, it is
often (but not always) useful to associate directions with the edges. Numerical measures
may be associated with the vertices or with the edges, representing costs, capacities,
and so on. The denition of an edge implies that there can be at most one edge between
two vertices, but in some representations multiple edges make sense.
Because of the importance of the graphs that underlie networks, we shall start with
a formal discussion of graphs and a few standard denitions.
General discussions of graph theory include [10,181,188]. The relation to networks
is discussed in [6, 135, 184].
32
2 Graph-Theoretic Concepts
In terms of the more general denitions sometimes used, we can say that our graphs
are nite and contain neither loops nor multiple edges.
We write v(G) and e(G) for the orders of V (G) and E(G), respectively.
The edge containing x and y is written xy or (x, y); x and y are called its endpoints.
We say that this edge joins x to y. Gxy denotes the result of deleting edge xy from G;
if x and y were not adjacent, then G + xy is the graph constructed from G by adjoining
an edge xy. Similarly G x is the graph derived from G by deleting one vertex x (and
all the edges on which x lies). Similarly, G S denotes the result of deleting some set
S of vertices.
In order to discuss cases in which there may be more than one link between two
vertices, we dene a multigraph in the same way as a graph except that there may
be more than one edge corresponding to the same unordered pair of vertices. The
underlying graph of a multigraph is formed by replacing all edges corresponding to the
33
unordered pair {x, y} by a single edge xy. Unless otherwise mentioned, all denitions
pertaining to graphs will be applied to multigraphs in the obvious way.
If vertices x and y are endpoints of one edge in a graph or multigraph, then x and y
are said to be adjacent to each other, and it is often convenient to write x y. The set
of all vertices adjacent to x is called the neighborhood of x, and denoted by N (x). We
dene the degree or valency d(x) of the vertex x to be the number of edges that have
x as an endpoint. If d(x) = 0, x is an isolated vertex. A graph is called regular if all
its vertices have the same degree; in particular, if the common degree is 3, the graph is
called cubic. We write (G) for the smallest of all degrees of vertices of G, and (G)
for the largest. (One also writes (G) for the common degree of a regular graph G.) If
G has v vertices, so that its vertex set is, say,
V (G) = {x1 , x2 , . . . , xv },
then its adjacency matrix MG is the v v matrix with entries mij such that
1 if xi xj ,
mij =
0 otherwise.
Some authors dene the adjacency matrix of a multigraph to be the adjacency matrix
of the underlying graph; others set mij equal to the number of edges joining xi to xj .
We shall use the former convention.
A vertex and an edge are called incident if the vertex is an endpoint of the edge, and
two edges are called incident if they have a common endpoint. A set of edges is called
independent if no two of its members are incident, while a set of vertices is independent
if no two of its members are adjacent.
Theorem 2.1. In any graph or multigraph, the number of edges equals half the sum of
the degrees of the vertices.
Proof. It is convenient to work with the incidence matrix: we sum its
entries. The sum
of the entries in row i is just d(xi ); the sum of the degrees is then vi=1 d(xi ), which
equals the sum of the entries in N . The sum of the entries in column j is 2, since each
edge is incident with two vertices; the sum over all columns is thus 2e, so that
v
d(xi ) = 2e,
i=1
34
2 Graph-Theoretic Concepts
35
graph is called bipartite. More generally, the complete r-partite graph Kn1 ,n2 ,...,nr
is a graph with vertex set V1 V2 Vr , where the Vi are disjoint sets and Vi
has order ni , in which xy is an edge if and only if x and y are in different sets. Any
subgraph of this graph is called an r-partite graph. If n1 = n2 = = nr = n we use
(r)
the abbreviation Kn .
36
2 Graph-Theoretic Concepts
also included, and (to avoid the triviality of allowing K2 to be dened as a cycle) n
must be at least 3. The latter convention ensures that every Cn has n edges. Figure 2.3
shows P4 and C5 .
2.4 Trees
37
d(v) + d(w) n 1,
a contradiction.
Corollary 2.6. If G is a graph with n vertices, n 3, and every vertex has degree at
least n/2, then G is Hamiltonian.
Theorem 2.5 was rst proven by Ore [138] and Corollary 2.6 some years earlier
by Dirac [65]. Both can in fact be generalized into the following result of Psa [143]:
a graph with n vertices, n 3, has a Hamiltonian cycle provided the number of
vertices of degree less than or equal to k does not exceed k 1, for each k satisfying
1 k (n 1)/2.
2.4 Trees
As we stated in the preceding section, a tree is a connected graph that contains no cycle.
Figure 2.4 contains three examples of trees. It is also clear that every path is a tree, and
the star K1,n is a tree for every n.
A tree is a minimal connected graph in the following sense: if any vertex of degree
at least 2, or any edge, is deleted, then the resulting graph is not connected.
Theorem 2.7. A connected graph is a tree if and only if every edge is a bridge.
Trees are also characterized among connected graphs by their number of edges.
Theorem 2.8. A nite connected graph G with v vertices is a tree if and only if it has
exactly v 1 edges.
From this it follows that every tree other than K1 has at least two vertices of degree 1.
(This does not hold if we allow our graphs to have innite vertex setsone elementary
example consists of the innitely many vertices 0, 1, 2, . . ., n, . . . and the edges 01, 12,
23, . . ., (n, n + 1), . . .but innite graphs will not arise in this book.)
Theorem 2.9. Suppose T is a tree with k edges and G is a graph with minimum degree
(G) k. Then G has a subgraph isomorphic to T .
38
2 Graph-Theoretic Concepts
39
A vertex v is called a start in the digraph if B(x) is empty and a nish if A(x) is empty.
The indegree and outdegree of a vertex are the numbers of arcs leading into and leading
away from that vertex respectively, so if multiple arcs are not allowed, then the indegree
and outdegree of v equal |B(v)| and |A(v)| respectively.
Our notation is extended to directed graphs in the obvious way, so that if X and Y
are any sets of vertices of G, then [X, Y ] consists of all arcs with start in X and nish
in Y . If X or Y has only one element, it is usual to omit the set brackets in this notation.
Observe that if V is the vertex set of G, then
[v, A(v)] = [v, V ] = set of all arcs leading out of v,
[B(v), v] = [V , v] = set of all arcs leading into v.
A walk in a directed multigraph is a sequence of arcs such that the nish of one
is the start of the next. (This is analogous to the denition of a walk in a graph, but
takes into account the direction of each arc. Each arc must be traversed in its proper
direction.) A directed path is a sequence (a0 , a1 , . . . , an ) of vertices, all different, such
that ai1 ai is an arc for every i. Not every path is a directed path. If a directed path is
considered as a digraph, then a0 is a start, and is unique, and an is the unique nish,
so we call a0 and an the start and nish of the path. We say ai precedes aj (and aj
succeeds ai ) when i < j .
A directed cycle (a1 , a2 , . . . , an ) is a sequence of two or more vertices in which all
of the members are distinct, each consecutive pair ai1 ai is an arc, and also an , a1 is an
arc. (Notice that there can be a directed cycle of length 2, or digon, which is impossible
in the undirected case.) A digraph is called acyclic if it contains no directed cycle.
As an example, consider the digraph of Figure 2.5. It has
A(a) = {b, c},
A(b) = ,
A(c) = {b, d},
A(d) = {b},
B(a) = ,
B(b) = {a, c, d},
B(c) = {a},
B(d) = {c};
a is a start and b is a nish. We have [{a, c}, {b, d}] = {ab, cb, cd}. There are various
directed paths, such as (a, c, d, b), but no directed cycle.
40
2 Graph-Theoretic Concepts
Proof. Any digraph has nitely many vertices, so the sequence (a0 , a1 , . . .) must contain repetitions. Suppose ai is repeated; say j is the smallest subscript greater than i
such that ai = aj . Then (ai , ai+1 , . . . , aj ) is a cycle in g.
In a similar way we can prove the following lemma:
Lemma 2.11. If a digraph contains an innite sequence of vertices (a0 , a1 , . . .) such
that ai+1 ai is an arc for every i, then the digraph contains a cycle.
Theorem 2.12. Every acyclic digraph has a start and a nish.
The concept of a complete graph generalizes to the directed case in two ways. The
complete directed graph on vertex set V , denoted by DKV , has as its arcs all ordered
pairs of distinct members of V , and is uniquely determined by V . On the other hand,
one can consider all the different digraphs that can be formed by assigning directions
to the edges of the complete graph on V ; these are called tournaments.
In those cases in which a directed graph is fully determined, up to isomorphism,
by its number of vertices, notation is used that is analogous to the undirected case. The
directed path, directed cycle, and complete directed graph on v vertices are denoted by
DPv , DCv , and DKv respectively.
We shall say that vertex x is reachable from vertex y if there is a walk (and consequently a directed path) from y to x. (When x is reachable from y, some authors
say x is a descendant of y and y is an ancestor of x.) Two vertices are strongly
connected if each is reachable from the other, and a digraph (or directed multigraph)
is called strongly connected if every vertex is strongly connected to every other vertex.
For convenience, every vertex is dened to be strongly connected to itself. We shall say
that a directed graph or multigraph is connected if the underlying graph is connected,
and disconnected otherwise. However, some authors reserve the word connected for
a digraph in which given any pair of vertices x and y, either x is reachable from y or y
is reachable from x.
It is clear that strong connectivity is an equivalence relation on the vertex set of any
digraph D. The equivalence classes, and the subdigraphs induced by them, are called
the strong components of D.
Part II
3
Matching Graphs with Unique Node Labels
3.1 Introduction
In its most general form, graph matching refers to the problem of nding a mapping
f from the nodes of one given graph g1 to the nodes of another given graph g2 that
satises some constraints or optimality criteria. For example, in graph isomorphism
detection [130], mapping f is a bijection that preserves all edges and labels. In subgraph
isomorphism detection [173], mapping f has to be injective such that all edges of g1 are
included in g2 and all labels are preserved. Other graph matching problems that require
the constructions of a mapping f with particular properties are maximum common
subgraph detection [118, 129] and graph edit distance computation [131, 151].
The main problem with graph matching is its high computational complexity, which
arises from the fact that it is usually very costly to nd mapping f for a pair of given
graphs. It is a known fact that the detection of a subgraph isomorphism or a maximum
common subgraph and the computation of graph edit distance are N P -complete problems. If the graphs in the application are small, optimal algorithms can be used. These
algorithms are usually based on an exhaustive enumeration of all possible mappings
f between two graphs. Sometimes application-dependent heuristics can be found that
allow us to eliminate signicant portions of the search space (i.e., the space of all possible functions f ), but still guarantee the correct, or optimal, solution being found. Such
heuristics can be used in conjunction with look-ahead techniques and constraint satisfaction [51, 115, 173]. For matching of large graphs, one needs to resort to suboptimal
matching strategies. Methods of this type are characterized by an (often low-order)
polynomial-time complexity, but they are no longer guaranteed to nd the optimal solution for a given problem. A large variety of such suboptimal approaches have been
proposed in the literature, based on a multitude of different computational paradigms.
Examples include probabilistic relaxation [45, 191], genetic algorithms [54, 183],
expectation maximization [122], eigenspace methods [108, 123], and quadratic programming [141].
Another possibility to overcome the problem arising from the exponential complexity of graph matching is to focus on classes of graphs with an inherently lower
computational complexity of the matching task. Some examples of such classes are
44
given in [91, 99, 121]. Most recently, in the eld of pattern recognition and computer
vision, the class of trees has received considerable attention [140, 156].
In this chapter another special class of graphs will be introduced. The graphs belonging to this class are characterized by the existence of unique node labels, which
means that each node in a graph possesses a node label that is different from all other
node labels in that graph. This condition implies that whenever two graphs are being
matched with each other, each node has at most one candidate for possible assignment
under function f in the other graph. This candidate is uniquely dened through its node
label. Consequently, the most costly step in graph matching, which is the exploration
of all possible mappings between the nodes of the two graphs under consideration, is
no longer needed. Moreover, we introduce matching algorithms for this special class
of graphs and analyze their computational complexity. Particular attention is directed
to the computation of graph isomorphism, subgraph isomorphism, maximum common
subgraph, graph edit distance, and median graph computation.
If constraints are imposed on a class of graphs, we usually lose some representational
power. The class of graphs considered in this chapter is restricted by the requirement
of each node label being unique. Despite this restriction, there exist some interesting
applications for this class of graphs. From the general point of view, graphs with unique
node labels seem to be appropriate whenever the objects from the problem domain,
which are modeled through nodes, possess properties that can be used to identify them
uniquely. In particular, the condition of unique node labels does not pose a problem when
one is dealing with graphs constructed from data collected from computer networks.
It is common in these networks that each node, such as a client, server, or router, be
uniquely identied. For example, in an intranet employing ethernet technology on a local
area network (LAN) segment, either the Media Access Control (MAC), or the Internet
Protocol (IP) address could be used to uniquely identify nodes on the local segment. As
a consequence, the efcient graph matching algorithms described in Theorem 3.8 can
be applied to computer networks to assist in network management functions. Another
application of graphs with unique node labels is web document analysis [154].
The remainder of this chapter is organized as follows. In Section 3.2, we introduce our basic concepts and terminology. Graphs with unique node labels and related
matching strategies are discussed in Section 3.3. In Section 3.4, we present the results
of an experimental study in which the run time of some of the proposed algorithms was
measured. Finally, conclusions from this work are drawn in Section 3.5.
45
a special case if there exists an edge (y, x) E for every edge (x, y) E with
(x, y) = (y, x).
Let g = (V , E, , ) and g = (V , E , , ) be graphs; g is a subgraph of g,
g g, if V V , E E, (x) = (x) for all x V , and (x, y) = (x, y) for
all (x, y) E . Let g g and g g . Then g is called a common subgraph of g and
g . Furthermore, g is called a maximum common subgraph (notation: mcs) of g and
g if there exists no other common subgraph of g and g that has more nodes and, for
a given number of nodes, more edges than g.
For graphs g and g , a graph isomorphism is any bijection f : V V such that:
(1) (x) = (x) for all x V ; and
(2) for any edge (x, y) E, there exists (f (x), f (y)) E with (x, y) =
(f (x), f (y)), and for any edge (x , y ) E there exists an edge
(f 1 (x ), f 1 (y )) E with (x , y ) = (f 1 (x ), f 1 (y )).
If f : V V is a graph isomorphism between graphs g and g , and g is a
subgraph of another graph g , i.e., g g , then f is called a subgraph isomorphism
from g to g .
Next we introduce the concept of graph edit distance (notation: ged), which is
based on graph edit operations. We consider six types of edit operations: substitution
of a node label, substitution of an edge label, insertion of a node, insertion of an edge,
deletion of a node, and deletion of an edge. A cost (i.e., a nonnegative real number) is
assigned to each edit operation. Let e be an edit operation and c(e) its cost. The cost
of a sequence of edit
operations, s = e1 . . . en , is given by the sum of all its individual
costs, i.e., c(s) = ni=1 c(ei ). The edit distance d(g1 , g2 ) of two graphs g1 and g2 is
equal to the minimum cost, taken over all sequences of edit operations, that transform
g1 into g2 . Procedures for ged computation are discussed in [131].
Finally, we introduce the median of a set of graphs [100]. Let G = {g1 , . . . , gN }
be a set of graphs and U the set of all graphs with labels from LV and LE . The median
g of G is a graph that satises the condition
N
N
d(g, gi ) = min
d(g, gi ) | g U .
i=1
i=1
It follows that the median is a graph that has the minimum average edit distance
to the graphs in set G. It is a useful concept to represent a set of graphs by a single
prototype. In many instances the median of a given set G is not unique; nor is it always
a member of G. For further details on median graphs see [100].
46
i.e., LV = {1, 2, 3, . . .}, or words over an alphabet that can be lexicographically ordered.
Throughout this chapter we consider graphs from this class only, unless otherwise
mentioned.
Denition 3.1. Let g = (V , E, , ) be a graph. The label representation (g) of g is
given by (g) = (L, C, ), where:
(1) L = {(x)|x V };
(2) C = {((x), (y))|(x, y) E}; and
(3) : C LE with ((x), (y)) = (x, y) for all (x, y) E.
According to this denition the label representation of a graph g is obtained by
representing each node of g by its (unique) label and dropping the set V . From the formal
point of view, (g) denes the equivalence class of all graphs that are isomorphic to
g. The individual members of this class are obtained by assigning an arbitrary node, or
more precisely an arbitrary node name, to each unique node label, i.e., to each element
from L.
Example 3.2. Let LV = {1, 2, 3, 4, 5} and g = (V , E, , ), where V = {a, b, c, d, e},
E = {(a, b), (b, e), (e, d), (d, a), (a, c), (b, c), (d, c), (e, c), (a, e), (b, d)}, : a
1, b
2, c
5, d
4, e
3, : (x, y)
1 for all (x, y) E. A
graphical illustration of g is shown in Figure 3.1(a), where the node names (i.e., the
elements of V ) appear inside the nodes and the corresponding labels outside. Because all edge labels are identical, they have been omitted. The label representation (g) of g is then given by the following quantities: L = {1, 2, 3, 4, 5}, C =
{(1, 2), (2, 3), (3, 4), (4, 1), (1, 5), (2, 5), (4, 5), (3, 5), (1, 3), (2, 4)}, : (i, j )
1
for all (i, j ) C.
2
a
c 5
Intuitively, we can interpret the label representation (g) of any graph g as a graph
identical to g up to the fact that all node names are left unspecied. Hence (g) can be
conveniently graphically represented in the same way as g is represented. For example,
a graphical representation of (g), where g is shown in Figure 3.1(a), is given in
Figure 3.1(b).
47
48
49
a cost equal to one. Hence the total cost arising from edit operations on the nodes of g1
and g2 amounts to |L1 | |L1 L2 | + |L1 | |L1 L2 | = |L1 | + |L2 | 2 |L1 L2 |.
We now consider the edges. There exist |C1 | |C0 | edges in g1 that do not occur
in g2 , and need to be deleted. Similarly, there exist |C2 | |C0 | edges in g2 that do not
have a counterpart in g1 , and need to be inserted. Furthermore, there are two types of
edges corresponding to set C1 C2 . The rst type are edges (i, j ) C0 , for which
1 (i, j ) = 2 (i, j ). No edit operations are needed for edges of this kind. The second
type are edges (i, j ) C0 , for which 1 (i, j ) = 2 (i, j ). An edge substitution with cost
one is needed for each such edge. Hence the total cost of edit operations on the edges
of g1 and g2 is equal to |C1 | |C0 | + |C2 | |C0 | + |C0 | = |C1 | + |C2 | 2 |C0 | + |C0 |.
This concludes the proof.
Possible computational procedures for ged computation implied by Lemma 3.6 are
based again on the intersection of two ordered sets. Hence, similar to all other graph
matching procedures considered before, the complexity of edit distance computation
of graphs with unique node labels is O(n2 ).
Finally, we turn to the problem of computing a graph g that is the median of a set of
graphs, G = {g1 , . . . , gN }, with unique node labels. In the remainder of this section we
assume, for the purpose of notational convenience and without restricting generality,
that all graphs under consideration are complete. That is, there is an edge (x, y) E
between any pair of nodes x, y V for any considered graph g. Real edges can be
easily distinguished from virtual edges by including a special null symbol in the edge
label alphabet LE and dening (x, y) = null for any virtual edge. The benet we
get from considering complete graphs is that the only necessary edit operations on the
edges are substitutions. In other words, any edge deletion or insertion now becomes a
substitution that involves the null label. No conicts will arise from this simplication
because the cost of edge substitutions, deletions, and insertions are the same.
NLet (g1 ), . . . , (g
NN ) be the label representations of g1 , . . . , gN . Dene LU =
i=n Li and CU =
i=1 Ci . Furthermore, let (i) be the total number of occurrences
of node label i LU in L1 , . . . , LN . Note that (1 (i) N ). Formally, (i) can be
dened through the following procedure:
(i) = 0;
for k = 1 to N do
if i Lk then (i) = (i) + 1
Next, we dene (g) = (L, C, ) such that
(1) L = {i | i LU and (i) N/2};
(2) C = {(i, j ) | i, j L}; and
(3) (i, j ) = max_label(i, j ),
where function max_label(i, j ) returns the label k (i, j ) LE that has the maximum number of occurrences on edge (i, j ) in C1 , . . . , CN . In case of a tie, any of the
competing labels k (i, j ) may be returned.
Lemma 3.7. Let G and (g) be as above. Then any graph g with label representation
(g) is a median graph of G.
50
Proof. The smallest potential median graph candidate is the graph with an empty set
of nodes, while the largest potential candidate corresponds to the case L = LU . The
second observation is easy to verify, because any graph g that includes more node
labels will have at least one label k that does not occur in any of the Li s. Hence
the node with label k will be deleted in all of the distance computations for d(g , g),
i = 1, . . . , N. Therefore dropping the node with label k from g will produce a graph
with a smaller average edit distance to the members of G. It follows that for any median
graph g with node label representation (g), set L must necessarily be a subset of LU .
If we substitute the expression derived in Lemma 3.6 into the denition of a median
graph given in Section 3.2, we recognize that any median graph g with node label
representation (g) must minimize the following expression:
N
|Li | 2
i=1
N
N
|L Li | + N |C| +
i=1
|Ci | 2
i=1
N
|C0i | +
i=1
N
|C0i |.
i=1
Note that all quantities are nonnegative integers. Since all Li s and Ci s are given,
minimization of is equivalent to minimizing
N |L| 2
N
|L Li | + N |C| 2
N
i=1
i=1
N
|C0i | +
N
|C0i |.
i=1
N
N
51
have to minimize 2 i=1 |C0i | + i=1 |C0i |. Since |C0i | + |C0i | = |C C0 |, this
is equivalent to maximizing |C0i |. However, such a maximization is exactly what is
accomplished by function max_label. This function chooses, for edge (i, j ), the label
that most often occurs on edge (i, j ) in all the given graphs.
So far we have treated the terms 1 and 2 independently of each other. In fact, they
are not independent because the exclusion of a node with label i from the median graph
implies exclusion of any of its incident edges (i, j ) or (j, i). Therefore the question
arises whether this dependency can lead to an inconsistency in the minimization of
= 1 + 2 in the sense that decreasing 1 leads to an increase of 2 by a larger
amount, and vice versa. It is easy to see that such an inconsistency can never happen.
First of all, exclusion of an edge (i, j ) for the sake of minimizing 2 does not imply any
constraints on inclusion or exclusion of any of the incident nodes i and j . Secondly, if
node i is not included because (i) < N/2, function max_label will surely return the
null label for any edge (i, j ) or (j, i). This is equivalent to not including (i, j ) or (j, i)
in the median graph. In other words, if a node i is not included in the median graph
because (i) < N/2, the dependency between 1 and 2 leads also to not including
all incident edges, which is exactly what is required to minimize 2 . This concludes
the proof of Lemma 3.7.
In order to derive a practical computational procedure for the computation of a
median of a set of graphs with unique node labels, we need to implement functions (i)
and max_label(i, j ). It is easy to verify that the complexities of these two functions
are O(n N ) and O(n2 N ), respectively. It follows that the median graph computation
problem can be solved in O(n2 N ) time for graphs with unique node labels.
So far, we have assumed that there are O(n2 ) edges in a graph with n nodes. There
are, however, applications in which the graphs are of bounded degree, i.e., the maximum
number of edges incident to a node is bounded by a constant . In this case all of the
expressions O(n2 ) reduce to O(n).
The following theorem summarizes all the results derived in this section.
Theorem 3.8. For the class of graphs with unique node labels there exist computational
procedures that solve the following problems in quadratic time with respect to the
number of nodes in the underlying graph:
(1)
(2)
(3)
(4)
graph isomorphism;
subgraph isomorphism;
maximum common subgraph; and
graph edit distance under the cost function introduced earlier in this section.
The median graph computation problem can be solved in O(n2 N ) time, where n is the
number of nodes in the largest graph and N is the number of given graphs.
52
isomorphism, subgraph isomorphism, mcs, and ged are measured for graphs ranging
in size from hundreds of nodes to tens of thousands of nodes, and with different edge
densities. In addition, we validate the linear dependency of time taken to compute a
median graph to the number of graphs from which the median is derived. Computation
times are measured for synthetic data sets and real network data. Real network data were
acquired from a link in the core of a wide area computer network. A test for similarity of
computational time measurements for real and synthetic data sets is made to verify that
results achieved for simulated networks can be repeated for real-world implementations.
An experiment was conducted to verify that the times taken to compute algorithms in
this chapter are independent of network topology. Two graph generators were used to
produce synthetic data sets having different network topologies. The real network data
set was used as a third sample having different topology. Graphs in each data set had
to be equivalent in number of nodes and links for this test.
The hardware platform used to measure computational times was a SUN Fire V880
with 4 750MHz UltraSparc3 processors and 8GB of RAM. The specic hardware
platform used to perform the experiments is not important and has been provided for
completeness only. Only relative computational times with respect to graph dimensions
are important.
3.4.1 Synthetic Network Data
Synthetic data sets are used to validate the computational complexity of procedures
dened in Section 3.3. These data sets are also used to verify that the procedures are
independent of network topology.
Two data sets have been produced using normally distributed random edges with
edge densities of 2.5% and 10%, respectively. An edge density of 2.5% was used so that
graphs with 20000 nodes could be synthesized without exceeding computer memory
of the computer platform used for the experiments. The data set with 10% edge density
was chosen to mimic the characteristics of the real data network. The maximum number
of nodes possible for graphs in this data set was 10000.
An additional single synthetic data set, having edge density of 2.5%, was created
using a Fan Chung algorithm [47]. This graph generator produced graphs having vertex
degrees with a power-law distribution. The resultant topology of graphs produced using
this method is quite different from those of graphs having normally distributed random
edges. In fact, graphs having degree distribution that are power laws are characteristic
of large networks, such as the Internet [167].
For each synthetic data set we rst obtain a series S of graphs, containing 100,
1000, 3000, 5000, 7000, and 10000 nodes. For data sets with edge density of 2.5% we
obtain an additional graph in the series that has 20000 nodes. The resulting graphs have
directed edges with Poisson distributed edge weights. A second series S was produced
as a counterpart, using the same procedure, for measurements of computational times
for mcs and ged.
A further set of graphs was created to verify the linear increase in computational
time with an increase in edge density, for a xed number of nodes. The graph generator
assigned edges using a normal distribution. For this data set, graphs had 5000 vertices
53
and edge densities ranging from 1% to 10% in steps of 1%. A counterpart was created
for each graph to be used for mcs and ged computations.
To compare computational times of algorithms measured for synthetic data against
real data sets, we created two randomly distributed graphs having the same number of
vertices and edges as each of the real data sets in Section 3.4.2.
Finally, for the validation of computational times for median graphs we created a
series of 100 graphs using randomly distributed edges. In this series the average number
of vertices and edge density are matched to our business domain network data set (i.e.,
comprising graphs having on average 70 vertices with edge density of 10%) as described
in Section 3.4.2.
3.4.2 Real Network Data
Real network data were acquired from a core link in a large-enterprise data network
using network performance monitoring tools. The data network employs static IP addresses, hence its suitability for representation by the class of graphs dened in this
chapter. Graphs were produced from trafc traversing the link at intervals of one day.
This resulted in a time series of 100 graphs representing 100 days of network trafc.
Two levels of abstraction have been used to produce the time series of real network
data. Both have quite different characteristics. The rst data set has graph vertices that
represent IP addresses, while the second has vertices that represent business domains.
In both data sets, edges represent logical links, and edge weights represent the total
number of bytes communicated between vertices in one day. The business domain
abstraction is created by coalescing IP addresses belonging to a predened business
entity into a single node. This resulted in graphs that contain on average 70 nodes with
edge densities of 10%. The IP network abstraction has graphs that have on average
9000 nodes with an edge density of 0.04%. The low edge density is a result of the near
bipartite nature of the graphs arising from data collected at a single point in the core of
the enterprise data network. The business domain and IP network abstractions are of
interest to network administrators because they provide both coarse and ne network
performance data, respectively.
Two consecutive graphs were chosen from each of the real network data set abstractions, to be used in comparisons of computational times of algorithms with times
measured for synthetic data. The two graphs chosen for the business domain abstraction contained approximately 90 vertices with an edge density of 10%, while the graphs
chosen from the IP abstraction contained 9000 vertices with an edge density of 0.04%.
To verify median graph computational times the whole 100-day time series of graphs
of business domain data was used.
3.4.3 Verication of O(n2 ) Theoretical Computational Complexity for
Isomorphism, Subgraph Isomorphism, MCS, and GED
To measure the time taken to compute a test for graph isomorphism we select the rst
graph g1 from S, containing one hundred unique nodes, and make an exact copy g2 . The
fact that g2 = g1 guarantees that the graphs tested are in fact isomorphic to each other.
54
The computational time measurement does not include the time taken to derive the label
representations (g1 ) and (g2 ) for graphs g1 and g2 . This is true for all computational
times measured for each algorithm. For the measurement of computational time for the
subgraph isomorphism test, we use the same graph g1 , together with graph g3 , obtained
by removing 20% of the edges from g1 . The graph g3 is obviously a subgraph of g1 .
The measurements of time to compute both mcs and ged required both graph series S
and S . To measure the time taken to execute these algorithms we again use g1 from
S, and select the equivalent-size graph from S . The procedures outlined above were
repeated for all three synthetic data sets for graph sizes 1000, 3000, 5000, 7000, 10000,
and 20000 (where present).
Table 3.1. Computational times for isomorphism.
The results of all computational time measurements are shown in Tables 3.1, 3.2,
3.3, and 3.4. As expected, the measured computational complexity of all matching
algorithms is O(n2 ). Figures 3.2, 3.3, 3.4, and 3.5 illustrate this observation for isomorphism, subgraph isomorphism, mcs, and ged, respectively; the x-axis corresponds
55
to the number of nodes in a graph and the y-axis represents the time, in seconds, to
compute each graph matching algorithm. Figures show greater computational times
for larger edge densities. This result was anticipated due to the dependency on graph
elements. Computation times to test for graph isomorphism were the longest. Testing
for subgraph isomorphism required the least time to compute. This was a consequence
of removing 20% of edges from g1 to produce a subgraph g2 . The smaller the size
250
Fan Chung 2.5%
Random 2.5%
Random 10%
200
150
100
50
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
2
4
x 10
56
200
150
100
50
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
2
4
x 10
of g2 with respect to g1 , the shorter the time taken to compute the subgraph isomorphism. The computational times for both mcs and ged, as observed in Figures 3.4 and
3.5, are almost indistinguishable. This is not surprising, since the computational steps
proposed in Lemmas 3.5 and 3.6 are nearly identical. In all cases the computational
times measured for both randomly distributed edges and those with power-law degree
distributions, for edge density of 2.5%, are nearly identical. The results would be identical if the numbers of edges in graphs from both data sets were equal. Since the graph
generator used to produce graphs with randomly distributed edges create on average
graphs with a specied edge density, the actual number of edges can vary. The closeness
of results veries the independence of the algorithms from network topology.
Further experimentation was performed to show the linear dependency of computational complexity in number of edges in a graph with xed number of vertices. Figure
3.6 shows results for the four graph algorithms. Observation of these results reveals the
linear relationship.
57
250
150
100
50
0.5
1.5
x 10
80
isomorphism
subgraph isomorphism
maximum common subgraph
graph edit distance
70
60
50
40
30
20
10
0
10
Fig. 3.6. Plot of linear dependency of computational times vs. edge density.
3.4.4 Comparison of Computational Times for Real and Synthetic Data Sets
In this section measurements of computational times of isomorphism, subgraph isomorphism, mcs, and ged on the two real network data sets (i.e., business domain and
IP level abstractions) and their synthetic counterparts are described. The aim was to
58
conrm that synthetic data measurements are consistent with those measured for real
network data.
The label representation is rst derived for the two graphs in each data set. Isomorphism and subgraph isomorphism computation requires only one of the graphs from
each set. Both graphs are required for the mcs and ged computations. The results are
given in Table 3.5. It can be seen that all measurements between real and equivalent synthetic data sets agree. This implies that results obtained for synthetic data are consistent
to those for real data.
Table 3.5. Comparison of computational times for real and synthetic network data.
Isomorphism
Subgraph Isomorphism
MCS
GED
3.5 Conclusions
59
6
Real Data
Synthetic
5.5
5
4.5
4
3.5
3
2.5
2
10
20
30
40
50
60
70
80
90
100
3.5 Conclusions
Graph matching is nding many applications in the elds of science and engineering. In
this chapter we considered a special class of graphs characterized by unique node labels.
A label representation is given for graphs in this class. For a given graph it contains
a set of unique vertex labels of the graph, an edge set based on vertex labels, and a
60
1000
Real vertex count
Real edge count
Synthetic vertex count
Synthetic edge count
900
Frequency of Occurrence
800
700
600
500
400
300
200
100
0
20
40
60
80
100
Time (Days)
Fig. 3.8. Vertex and edge counts for real and synthetic data sets.
3.5 Conclusions
61
agree. This outcome, along with the knowledge that the algorithms are independent of
network topology, means that simulation of performance of the algorithms on synthetic
data can be used to accurately predict the performance that will be achieved for real
networks.
In conclusion, graph matching algorithms for uniquely labeled graphs having a label
representation provide a signicant computational saving compared to the generalized
class of graphs, where such matching algorithms have an exponential computational
complexity. In this chapter we have shown that for this class of graphs we have been
able to apply matching algorithms to graphs having many thousands of nodes.
While this chapter focused on applications to network analysis, it is important to
note that the class of graphs described, and the matching algorithms, can be used for
any application in which graphs have a unique node labeling. In particular, there are
applications in content-based web document analysis [154].
4
Graph Similarity Measures for Abnormal Change
Detection
4.1 Introduction
In managing large-enterprise data networks, the ability to measure network changes
in order to detect abnormal trends is an important performance monitoring function
[17]. The early detection of abnormal network events and trends can provide advance
warning of possible fault conditions [171], or at least assist with identifying the causes
and locations of known problems.
Network performance monitoring typically uses statistical techniques to analyze
variations in trafc distribution [84, 98] or changes in topology [189]. Visualization
techniques are also widely used to monitor changes in network performance [8]. To
complement these approaches, specic measures of change at the network level in both
logical connectivity and trafc variations are useful in highlighting when and where
abnormal events may occur in the network [57,5964,157]. Using these measures, other
network management tools may then be focused on problem regions of the network for
more detailed analysis.
This chapter examines various measures of network change based on the concept
of graph distance. The aim is to identify whether in using these techniques, signicant
changes in logical connectivity or trafc distributions can be observed between large
groups of users communicating over a wide area data network. This data network
interconnects some 120, 000 users around Australia. For the purposes of this study, a
network management probe was attached to a physical link on the wide area network
backbone, and trafc statistics of all data trafc operating over the link were collected.
From this information a logical network of users communicating over the physical link
is constructed.
Communications between user groups (business domains) within the logical network over any one day is represented as a directed graph. Edge direction indicates the
direction of trafc transmitted between two adjacent nodes (business domains) in the
network, with edge labels (also called edge-weight) indicating the amount of trafc
carried. A subsequent graph can then describe communications within the same network for the following day. This second graph can be compared with the original graph,
using a measure of distance between the two graphs to indicate the degree of change
64
occurring in the logical network. The more dissimilar the graphs, the greater the graph
distance value. By continuing network observations over subsequent days, the graph
distance scores provide a trend of the logical networks relative dynamic behavior as it
evolves over time.
This problem becomes one of nding good graph distance measures that are sensitive to abnormal change events but insensitive to typical variations in logical network
connectivity or trafc. In addition to graph distance measures, it is also necessary to
readily identify where in a network the abnormal change has occurred. This requires
the location of regions in the graph that contributed most to the measured change.
The chapter is structured in the following way. Section 4.2 describes how a telecommunication system can be represented as a graph and provides details of how the network
trafc was sampled. Graph distance measures suited to changes in logical network connectivity are assessed in Section 4.3, with Section 4.4 examining distance measures
aimed at variations in trafc. Measures exploiting graph structures are discussed in
Section 4.5, and Section 4.6 applies localization approaches to a particular abnormal
event in the sampled network trafc. Concluding remarks are presented in Section 4.7.
65
over the physical link with edge-weight denoting the total trafc transmitted between
corresponding OD pairs over a 24-hour period.
Successive log les collected over subsequent days produced a time series of corresponding directed and labeled graphs representing trafc ows between business
domains communicating over the physical link in the network. Log les were collected
continuously over the period 9 July 1999 to 24 December 1999. Weekends, public holidays, and days for which probe data were unavailable were removed to produce a nal
set of 102 log les representing the successive business days trafc. The graph distance
measures examined in this chapter produce a distance score indicating the dissimilarity
between two given graphs. Successive graphs derived from the 102 log les of the
network data set are compared using the various graph distance measures to produce
a set of distance scores representing the change experienced in the network from one
day to the next.
Operators in the network management center are interested in identifying the causes
of change in trafc volume over the physical link caused by changes in user behavior on
the network. Primarily, these effects are caused by the introduction of new applications
or services at a local user level, and changes in communities of interest or user groups
communicating over particular physical links. Detecting signicant changes in logical
connectivity is useful in tracking changes in user groups, while logical trafc variations
between OD pairs assist with the identication of new applications or services consuming large amounts of network capacity. Distance measures assessing change in topology
are required for measuring change in communities of interest between user groups, with
trafc-based distance measures required for measurements of change to logical trafc
distributions. When signicant abnormal change is observed in either connectivity or
trafc patterns over successive business days, operators then require identication of the
regions within the logical network that contribute most to the overall change observed.
|mcs(g, g )|
,
max{|g|, |g |}
(4.1)
66
where mcs(g, g ) denotes the maximum common subgraph of g and g , and |g| denotes
the number of vertices in the graph g. The number of edges can also be used as |g|, or
any other measure of problem size in the denominator of equation (4.1) [158, 182].
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
20
40
60
day
80
100
120
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
20
40
60
day
80
100
120
67
(4.2)
Clearly the edit distance, as a measure of topology change, increases with increasing
degree of change experienced by the network over successive time intervals. Edit distance d(g, g ) is bounded below by d(g, g ) = 0 when g and g are isomorphic (i.e.,
there is no change), and above by d(g, g ) = |V | + |V | + |E| + |E | when g g = ,
the case in which the networks are completely different. Cost functions can also be
designed to place greater signicance on the more important vertices or edges within
the graph representation of the network. It is interesting to note that this measure, ged, is
related to the previous maximum common subgraph distance metric given in equation
(4.1) in [20].
Results plotted in Figure 4.3 show that the edit distance produces peaks (indicating signicant change) that correlate with the maximum common subgraph distance
measures, in addition to indicating further possible events of interest. Note that edit
1 Note that equation (4.2) is a simplication of the formula derived in Lemma 3.6 because no
edge label substitutions are considered here.
68
800
700
edit distance
600
500
400
300
200
100
20
40
60
day
80
100
120
69
In the transition from day 63 to day 64, a signicant amount of change was measured
using both MCS and edit distance measures. To provide an example of the degree of
change measured, the logical connectivity between business domains on days 63 and 64
is shown in Figures 4.4 and 4.5 respectively. Note that all node positions are maintained
the same in both gures. There is clearly a signicant change in connectivity, in the
lower right region of the graphs, between these two days. Such changes were not evident
in the comparison of other adjacent graphs in the time series with lower distance scores.
600
500
400
300
200
100
20
80
60
40
number of graphs containing the same edge
100
120
Figure 4.6 shows the distribution of edge occurrences in the time series of graphs
over the 102 days. From this gure, a large number of edges appear in a reasonably
small ( 10) number of graphs; these edges are the cause of the large amount of distance
variation. Also, there are quite a few edges that occur consistently in a large number of
the graphs; 99 graphs in the time series have approximately 100 edges in common. This
further explains the three signicant events observed in the previous graph distance
plots, in which many of these 100 consistent edges were found missing from the graphs
70
25
20
15
of those three particular days. The distribution of vertex presence in the series of graphs
is shown in Figure 4.7. Again there is a large number of dynamic vertices and also
many consistent vertices in the series. Almost 30 vertices occur in all graphs of the time
series. This shows a consistent group of business domains communicating daily.
d(g, g ) =
u,vV
71
(4.3)
0.95
0.9
edgeweight distance
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
20
40
60
day
80
100
120
100
120
0.9
0.8
0.7
0.6
0.5
0.4
0.3
20
40
60
day
80
72
Figure 4.8 shows the comparative plot of edge-weight distance derived from the
data network observations represented as a time series of graphs. Three of the peaks,
indicating increased network change, correlate with previous plots showing changes in
connectivity. Clearly these logical connectivity changes involve relatively large amounts
of trafc, as illustrated in Figure 4.8. Two peaks at days 48 and 53 indicate large changes
in trafc distribution, between business domains (vertices), without greatly affecting
overall connectivity. This becomes apparent by comparing Figure 4.8 with say Figure
4.3 for days 48 and 53. Figure 4.9 plots the same edge-weight distance but in this case
only with edges belonging to the maximum common subgraph between the graphs being
compared. This plot is useful because it shows that there was a reasonable amount of
change in trafc on ve of the days independent of any changes in logical connectivity.
By considering together the plots of connectivity (Figures 4.14.3), total trafc (Figure
4.8), and trafc variations in the maximum common subgraph (Figure 4.9), it is possible
to make judgements regarding the type of network changes occurring.
k
(i i )2
i=1
,
k
k
2 ,
2
min
i=1 i
j =1 j
(4.4)
73
4.5
3.5
2.5
1.5
0.5
20
40
60
day
80
100
120
74
One possible approach is to use various path sets for the denition of new graph
measures. For specied vertices (network terminals) u, v Vg , one can consider the
g
family Pk (u, v) of all paths (in g) of length k connecting u and v. This can be generalized
g
to the collection Pk of all k-long paths in graph g (i.e., the distance-k path set of g).
One could also replace the k-length paths with paths containing the minimal number of
edges, or with shortest
paths satisfying certain vertex- and/or
minimality.
edge-weight
g
g
Let P g (u, v) = k2 Pk (u, v), and similarly, by P g = k2 Pk denote the sets of all
paths of length greater than or equal to 2 joining u and v, or joining any two vertices,
respectively.
In communication networks, it is common for information to be passed from a
source node to a destination node along a path via intermediate nodes. Therefore, if an
edge is deleted from a graph, and many paths contain this edge, it will have a greater
impact on communications than an edge that affects only one path. It therefore seems
reasonable that in the context of communications, graph distance measures based on
the number of paths containing a specic edge(s) should be more sensitive to network
changes of interest.
E, a new graph g (E)
= (V , E , ,
For a chosen (nonempty) subset of edges E
) can be generated from g = (V , E, , ) such that only those edges in g are retained
The graph g(E)
is
that are within paths p in P g (, ) (or in P g ) containing edges in E.
constructed using the following conditions:
1. V = V .
g
2. An edge e is in E if and only if there exists a path p Pk such that e p and p
contains at least one edge in E.
3. = .
4. The edge-weight values in Gg denote the number of paths in P g containing that
edge and at least one edge in E.
provide an indication of edge signicance because edges conEdge attributes in g(E)
tained within many paths have high associated edge-weight value and are likely to affect
communications connectivity between more vertices.
Graphs g1 () and g2 () can be constructed from graphs g1 and g2 respectively and
compared using equation (4.3) as the basis for measuring graph distance. There are
as subsets of E1 and E2 ; either E
= E1 and
some obvious choices of respective sets E
E = E2 respectively, or E can consist of edges belonging to mcs(g1 , g2 ). Furthermore,
can comprise edges of the core backbone of the communication network or important
E
logical links. In the case that there exist disjoint edge sets of g1 and g2 , both graphs
are treated as completely different. The resulting distance measure should be more
sensitive to change that potentially impacts a greater number of nodes or users within
the network. Analysis using the sampled network trafc showed no improvement in
clearly identifying further change events over the previous methods. However, this
approach is worth being investigated further due to the potential advantages it offers.
75
3500
3000
edit distance
2500
2000
1500
1000
500
20
40
60
day
80
100
120
The 2-hop edit distance has been applied to the time series of network data, producing a result showing correlation of signicant events with previous distance measures
(Figure 4.11). In addition, this distance measure highlights a number of secondary level
events (in terms of change magnitude) not obvious with earlier results. However, clear
evidence demonstrating the signicance of these events has not yet been obtained.
76
similarity measures such as these are useful in determining when abnormal events
have occurred that might warrant closer examination, these measures do not provide
information describing where in the network greatest change has occurred. For this
latter problem an indication of change distribution within the network is required.
4.6.1 Symmetric Difference
One approach to locating regions in the network most affected by topology change is
to rank vertices in the order they experienced insertions or deletions of incident edges
during the transition from one time interval to the next. Consider two graphs g and g
representing network communications over two successive time intervals, and assume
that the graph distance between g and g is deemed to be signicant. Of particular interest
is the distribution of the change during the transition from g to g . The following method
ranks all vertices V V within the two graphs in ascending order of the number of
network topology change events experienced by individual vertices.
Differences between graphs g and g can be described by a change matrix C = [cuv ]
that indicates where edges have been deleted from g or inserted into g . This matrix
C has a row and column for every vertex contained within the two graphs. An edge
deleted or inserted in transition from one graph to the other is represented in the matrix
by a corresponding row column entry cuv = 1. Indices u and v denote the respective
originator and destination vertices of the deleted or inserted directed edge. Any edges
(u, v) that remain incident to the same pair of vertices in both g and g result in the
corresponding entry cuv = 0, indicating that no change has occurred. All other entries
in C equal 0. This matrix C essentially describes the symmetric difference between
graphs g and g , where the symmetric difference gg is the graph containing vertices
V V whose edges appear in exactly one of either graph g or g [157, 188].
Respective row and column sums of C indicate the amount of change experienced
locally by the corresponding vertex. The amount of change experienced by each vertex
can be plotted in descending order to show the distribution of local change occurring in
77
the transition from g to g . This is illustrated in Figure 4.12 derived from the network
data for change occurring between days 63 and 64. Vertices contributing most to the
network change can be readily identied from the gure.
This approach could be modied to identify those vertices experiencing greatest
variations in trafc along incident edges by substituting the entries cuv in C with relative
differences in edge-weight value between g and g , for example
cuv =
where (u, v) E E .
4.6.2 Vertex Neighborhoods
An alternative to the graph-symmetric difference approach just described is to measure
graph distances between corresponding vertex neighborhood subgraphs. This technique
will produce a vector of graph distance measures describing the differences between
two graphs g1 and g2 . Each vector coordinate indicates the distance between g1 and
g2 from the perspective of an individual vertex and its adjacent vertices, essentially
providing a measure of change for a local region. The neighborhood subgraph of a
vertex u in g1 = (V1 , E1 , 1 , 1 ) is the subgraph g1 (u) = (V1 (u), E1 (u), 1 , 1 ),
where E1 (u) = E1 (N1 [u] N1 [u]) is the set of incident edges between adjacent
vertices in N1 [u]. Vertex- and edge-weight functions in g1 (u) are 1 and 1 , restricted
to N1 [u] and E1 (u) respectively.
Successive graphs in the time series of networks can be compared using this neighborhood approach with the corresponding neighborhoods graph distances calculated
using equations (4.1), (4.2), (4.3), or (4.4). A vertex that exists only in one of the two
graphs g1 and g2 has its graph neighborhood compared with the empty neighborhood,
where N[u] = E(u) = . The resulting neighborhood graph distances are stored in a
distance vector d, dened as
d = [d(g1 (u), g2 (u))] ,
for all u (V1 V2 ), and where d(, ) is a particular graph distance measure. Those
coordinates contained within the neighborhood distance vector d with higher subgraph
distance measures correspond to the vertices u that experienced greatest change in their
respective local region as the network transitioned from one time interval to the next.
The distance vector coordinates can be ordered to rank vertices by the degree to which
they experienced local change.
The vertex neighborhoods dened above describe single-hop (1-hop) neighborhoods that include only the adjacent vertices to a given vertex u. It may be useful to
consider neighborhoods of 2-hops, whereby vertices adjacent to those in N [u] are also
included in the neighborhood subgraph (together with incident edges). Of course this
can be extended to the general case for k-hops.
78
4.7 Conclusions
This chapter has examined several techniques that can be used to measure the degree
of change occurring within a data network as it evolves over time. Communication
transactions collected by a network management system between logical network nodes
occurring over periodic time intervals are represented as a series of weighted graphs.
Graph distance measures are used to assess the changes in communications between user
groups (nodes), over successive time intervals, to focus a network operators attention
on the time and regions within the logical network where abnormal events occur.
Identifying what is a signicant event or what is the networks normal operational behavior is difcult. However, drawing attention to abnormal events, when
compared with previous network observations, appears feasible and potentially very
useful. It relieves network operators from the need to continually monitor networks if
connectivity and trafc patterns can be shown to be similar to the activity over previous
time periods. Alternatively, if trafc volumes are seen to abruptly increase over the
physical link, network operators are able to more readily identify the individual user
groups contributing to this increase in aggregate trafc.
Both maximum common subgraph and edit distance measures identied signicant
changes in logical connectivity using relative and absolute measures of change respectively. The more signicant events were veried by comparing graph visualizations of
days around the events and on other normal days. The edge-weight distance measure
highlighted large variations in logical trafc distributions on two days in particular that
were not affected by large changes in connectivity. These results correlate with distance measures based on analysis of the graph spectra. The use of vertex neighborhood
distance vectors and graph-symmetric difference assists with identifying those vertices
contributing most to network change.
5
Median Graphs for Abnormal Change Detection
5.1 Introduction
In Chapter 4, abnormal network behavior was detected based on the distance of a pair
of graphs. In this chapter we derive more general procedures that use more than just
two graphs. Considering a larger number of graphs, it can be expected that abnormal
event detection procedures will exhibit more stability and robustness against random
uctuations and noise in the underlying network.
The procedures derived in this chapter will be based on the median of a set, or a
sequence of graphs as introduced in Chapter 4. Intuitively speaking, the median of a
sequence of graphs S = (g1 , . . . , gn ) is a graph that represents the given gi s in the
best possible manner. Using any of the graph distance measures introduced in Chapter
4, the median of a sequence of graphs S is dened as a graph that minimizes the sum
of all edit distances to all members of sequence S. Formally, let U be the family of all
graphs that can be constructed using labels from LV for vertices and real numbers for
edges. Then g is a median graph of the sequence S = (g1 , . . . , gn ) if
g = arg min
gU
n
d(g, gi ) .
(5.1)
i=1
80
5 Median Graphs
has a similar property to the median of a sequence of graphs: it minimizes the sum
of
ndistances to all elements of the given sequence, i.e., it minimizes the expression
i=1 |x xi |.
In Chapter 3, an efcient procedure for median graph computation was introduced.
This procedure uses the edit distance of graphs and assumes particular costs of the
underlying edit operations. Each of the edit operations node deletion, node insertion,
edge deletion, edge insertion, and edge substitution has a cost equal to one. Node label
substitution is not admissible and has innite cost. We will use the notation d1 (g1 , g2 )
to refer to this kind of graph edit distance. In Section 5.2 we introduce a second graph
edit distance, called d2 (d1 , g2 ), and describe a procedure for median graph computation based on d2 (g1 , g2 ). The cost function underlying d2 (g1 , g2 ) is more general than
the cost function underlying d1 (g1 , g2 ) in that it takes differences in edge weight into
account. In Section 5.3 we derive procedures for abnormal event detection in telecommunication networks using median graphs. Experimental results are reported in Section
5.4 and conclusions are drawn in Section 5.5.
eE1 \(E1 E2 )
eE1 E2
1 (e) +
2 (e) .
eE2 \(E1 E2 )
81
, E,
)
be dened as follows:
Let graph
g = (V
= {u | u V and (u) > n/2} ,
V
= {(u, v) | u, v V
} ,
E
v) = median{i (u, v) | i = 1, . . . , n} .
(u,
, E,
)
is a median of set S under graph distance measure
Theorem 5.1. Graph
g = (V
d2 .
Proof. Similarly to the proof of Lemma 3.7 we observe
that the set of nodes potentially
is bounded from above and below by n Vi and , respectively. The
included in V
i=1
quantity to be minimized is
n
n
n
| +
Vi | +
v) i (u, v)| ,
= c n|V
|Vi | 2
|V
|(u,
i=1
i=1
(u,v)E
i=1
and
edges =
n
(u,v)E
i=1
v) i (u, v)| .
|(u,
Clearly, nodes is minimized by the same procedure used in Lemma 3.7, i.e., we
if and only if (u) > n/2. For the minimization
include a node u V in V
of edges we use the property of the median of an ordered sequence of real numbers cited in Section 5.1. That is, edges is minimized by assigning to each edge
the median of the weights of the edge (u, v) in E1 , . . . , En . Formally,
(u, v) E
(u, v) = median(i1 (u, v), . . . , in (u, v)), where i1 (u, v), . . . , in (u, v) is the ordered sequence of 1 (u, v), . . . , n (u, v).
Similarly to Lemma 3.7, the values of nodes and edges are not independent of each
implies the exclusion of all edges (u, v)
other, because the exclusion of a node u in V
which means (u,
v) = 0 or (v,
u) = 0 for those edges. We observe
and (v, u) in E,
that median(i (u, v) | i = 1, . . . , n) = 0 and median(i (v, u) | i = 1, . . . , n) = 0
if (u) n/2 or (v) n/2. Hence none of the edges incident to u will ever be
during the minimization of edges . This concludes the
a candidate for inclusion in E
proof.
Comparing the median graph construction procedure described in Lemma 3.7 with
the one introduced in this chapter, we notice that the former is a special case of the
latter, constraining edge weights to assume only binary values. Edge weight zero (or
one) indicates the absence (or presence) of an edge. Including an edge (u, v) in the
82
5 Median Graphs
g1 :
g2 :
g3 :
2
3
3
5
1
4
4
Fig. 5.1. Three graphs (g1 , g2 , g3 ).
median graph g because it occurs in more than n/2 of the given graphs is equivalent to
labeling that edge in
g with the median of the weights assigned to it in the given graphs.
The median of a set of numbers according to equation (5.2) is unique. Hence in constructing a median graph under graph distance measure d2 , there will be no ambiguity
in edge weight. But inclusion of a node in the median graph is in general not unique.
We conclude this section with an example of median graph under graph distance d2 .
Three different graphs g1 , g2 , and g3 are shown in Figure 5.1. Their median is unique
and is displayed in Figure 5.2. Moreover, we observe that = c + 14.
g:
3
3
1
4
83
may lead to a graph distance larger than the chosen threshold, though these changes are
not really signicant.
One expects that a more robust change detection procedure would be obtained
through the use of median graphs. In statistical signal processing the median lter is
widely used for removing impulsive noise. A median lter is computed by sliding a
window of length L over data values in a time series. At each step the output to this
process is the median of values within the window. This process is also termed the
running median [3]. In the following we discuss four different approaches to abnormal
change detection that utilize median lters. All these approaches assume that a time
series of graphs (g1 , . . . , gn , gn+1 , . . .) is given. The median graph of a subsequence
of these graphs can be computed using either graph distance measure d1 or d2 .
5.3.1 Median vs. Single Graph, Adjacent in Time (msa)
Given the time series of graphs, we compute the median graph in a window of length L,
where L is a parameter that is to be specied by the user, depending on the underlying
application. Let
gn be the median of the sequence (gnL+1 , . . . , gn ). Then d(
gn , gn+1 )
can be used to measure the abnormal network change. We classify the change between
gn and gn+1 as abnormal if d(
gn , gn+1 ) is larger than some threshold.
Increased robustness can be expected if we take the average deviation of graphs
(gnL+1 , . . . , gn ) into account. We compute
=
1
L
n
d(
gn , gi ) ,
(5.3)
i=nL+1
(5.4)
where is a parameter that needs to be determined from examples of normal and abnormal network change. Note that the median
gn is by denition a graph that minimizes
in equation (5.3).
Earlier in this chapter it was pointed out that
gn is not necessarily unique. If several instances
gn1 , . . . ,
gnt of
gn exist, one can apply equations (5.3) and (5.4) to
all of them. This will result in a series of values 1 , . . . , t and a series of values
d(
gn1 , gn+1 ), . . . , d(
gnt , gn+1 ). Under a conservative scheme, an abnormal change will
be reported if
d(
gn1 , gn+1 ) 1 d(
gnt , gn+1 ) t .
By contrast, a more sensitive change detector is obtained if a change is reported as soon
as there exists at least one i for which
d(
gni , gn+1 ) i , 1 i t .
84
5 Median Graphs
85
900
800
700
edit distance, d1
600
500
400
300
200
100
0
20
40
60
time (days)
80
100
Figure 5.4 shows the results for distance measure d2 being applied to consecutive
graphs in the time series. While this measure considers both topology and trafc, it
does not appear to provide any additional indicators of change additional to those
found using the topology only measure. The main difference with this measure is that
it has increased the amplitude of the rst two peaks so that they now appear as major
peaks. This suggests that these peaks were a result of network change consisting of
large change in edge weights.
The ndings for this technique indicate that the application of measures to consecutive graphs in a time series is most suited for detecting daily uctuations in network
behavior. This is especially useful for identifying outliers in a time series of graphs.
1 Figure 5.3 is identical to Figure 4.3 but is shown here again for easier comparison.
86
5 Median Graphs
900
800
edit distance, d2
700
600
500
400
300
200
100
20
40
60
time (day)
80
100
5.4.2 Edit Distance and Median Graph vs. Single Graph Adjacent in Time
(msa)
The results computed for msa used a median graph constructed from L = 5 consecutive
graphs with = 2 used to compute the threshold of signicant network change. Figures
5.5 and 5.6 show msa results for topology and topology plus trafc, respectively.
600
500
edit distance, d1
400
300
200
100
20
40
60
time (days)
80
100
In Figure 5.5 there are now three additional major peaks identied on days 8, 12,
and 32, compared to those observed in ssa. The peaks occurring on days 8 and 12
have been accentuated using msa. This method also shows that there was signicant
change on days 6, 7, 8, 12, 21, 22, 23, 32, 49, 56, 62, 64, 65, 86, 88, and 90 based on
the threshold using = 2. Raising the threshold (i.e., = 3) results in a reduction in
the number of days on which signicant change was detected to three. Here signicant
change occurred on days 56, 65, and 90. Interestingly enough, day 56 corresponded
to a minor peak, yet has been deemed signicant based on the deviations observed
87
within the preceding median graph window. Finally, from Figure 5.5 it is evident that
the network is undergoing a gradual transition from considerable daily network change
to less daily change between days one through 60.
900
800
edit distance, d2
700
600
500
400
300
200
100
20
40
60
time (days)
80
100
The results presented in Figure 5.6 show the effect of introducing edge weight. While
the peaks here occur at the same time as the unweighted experiment, their amplitudes
do differ somewhat. The amplitude of peaks on days 22, 65, and 90 have been reduced,
relative to the two peaks at the start of the time series. This suggests that these changes
are more inuenced by changes in topology. The number of signicant events detected
is noticeably smaller for the same value . Signicant change was detected on four
occasions: days 21, 56, 65, and 90. The inclusion of edge weight has resulted in an
increase in average deviation within the median graph window, which is the result of a
large trafc variation on edges. This had the overall effect of increasing the threshold
for detection of signicant change.
This method is particularly useful for detection of signicant events with improved
robustness to noise. Increasing the length of the median graph window results in a
greater smoothing effect.
5.4.3 Edit Distance and Median Graph vs. Median Graph Adjacent in Time
(mma)
The mma procedure computes an edit distance between two adjacent median graphs in
the time series. In this experiment a median graph window length of L1 = L2 = 5
was used. There is no reason why the window lengths must be equal for both median
graphs. A value of = 2 was used.
Results obtained using d1 are shown in Figure 5.7. In this gure there are only
two major peaks, one occurring at the start of the time series and around day 65.
Signicant change was detected around day 7 and on days 12 and 46. Surprisingly,
signicant change was not detected around day 65. This is most likely due to the
increased smoothing effect, which would have reduced the effect of the large outlier
88
5 Median Graphs
500
450
400
edit distance, d1
350
300
250
200
150
100
50
0
20
40
60
time (day)
80
100
occurring on day 65. Signicant change is, however, detected on day 65 with a value
of that is below 1.8. The mma procedure improves the observation of network change
decreasing slowly between days 1 through 60.
800
700
edit distance, d2
600
500
400
300
200
100
20
40
60
time (day)
80
100
When distance measure d2 is applied to the mma procedure the results shown in
Figure 5.8 are obtained. Here the results are very similar to those attained using measure
d1 . The main point to note is that there are no points identifed as exhibiting signicant
change with a value of = 2. This again indicates that inclusion of edge weight has
increased the average deviation within the median graph window. The value of must
be reduced below 1.5 to get the same results as those of measure d1 .
The use of the mma procedure seems particularly useful for providing an additional
smoothing effect that eliminates the effects caused by large outliers. This makes it more
robust to noise than that of msa.
89
5.4.4 Edit Distance and Median Graph vs. Single Graph Distant in Time (msd)
The experiment involving the msd procedure used a median graph window of length
L = 5 with an offset of l = 10 graphs between the median graph and the second graph
used in the distance computation. Values of = 3 for d1 and = 2 for d2 were used in
the detection of signicant change. The larger value of for d1 was required to reduce
an unusually large number of signicant events being detected with = 2. Figures 5.9
and 5.10 only have outputs at points relating to the graph that is distant in time. Thus
the starting point of a trace is dependent on both the size L of the median graph window
and the offset period l between graphs.
900
800
700
edit distance, d1
600
500
400
300
200
100
0
20
40
60
time (day)
80
100
Figure 5.9 shows the results achieved for measure d1 . This gure shows three major
peaks on days 23, 65, and 90. The main point of interest in this result is the cluster
of indicators of signicant change between days 64 and 74. This suggests that there
was a step change in network behavior that took place on day 64. This characteristic
could not clearly be observed using ssa, msa, or mma. It is also important to note
that the indicators of signicant change were based on a threshold of three times the
average deviation occurring within the median graph window, indicating a large change
in network behavior.
Figure 5.10 shows the results achieved for measure d2 . Again there was quite a
degree of similarity with the results achieved using measure d1 . The main difference
was the relative increase in the peak occurring on day 90.
While the msd procedure would be better suited to a data set exhibiting gradual
change, it was crucial in highlighting a step change in network behavior.
5.4.5 Edit Distance and Median Graph vs. Median Graph Distant in Time
(mmd)
Median graph windows of length L1 = L2 = 5 were used in the mmd procedure
with an offset of l = 10 graphs between the two median graphs used in the distance
computation. A value of = 3 was used for d1 and = 2 for d2 .
90
5 Median Graphs
1000
900
edit distance, d2
800
700
600
500
400
300
200
20
40
60
time (day)
80
100
The results for mmd can be seen in Figures 5.11 and 5.12 using measures d1 and d2 ,
respectively. The two gures are very similar to one another with peaks and signicant
change detected in the same periods. Compared with the msd results the mmd procedure
provides additional smoothing, giving greater robustness to noise and large outliers.
This can be seen by the relative fall in amplitude of the peak on day 65 and the near
elimination of the peak on day 90. Like the msd procedure, mmd is also capable of
showing the step change in the network behavior commencing on day 64.
5.5 Conclusions
In this chapter the use of the median graph for detecting abnormal change in data
networks is proposed. The median of a sequence of graphs S is a graph that minimizes
the average edit distance from all members in S. Thus the median can be regarded as the
best single representative of a sequence of graphs. Abnormal change in a network can
be detected by computing the median of a time series of graphs over a window of a given
5.5 Conclusions
91
800
700
edit distance, d2
600
500
400
300
200
20
40
60
time (day)
80
100
92
5 Median Graphs
6
Graph Clustering for Abnormal Change Detection
6.1 Introduction
Graph similarity measures, including graph edit distance, and their application to the
detection of abnormal change in telecommunication networks were introduced in Chapter 4. In Section 3.3, the median of a set of graphs was studied. Also this concept has
proven useful to measure change in communication networks and to detect unusual
behavior. Recently, graph clustering based on graph edit distance has been proposed
[78]. In this chapter, some known clustering algorithms will be reviewed rst. Then the
potential of graph clustering for analyzing time series of graphs and telecommunication
networks will be studied.
Clustering is the process of dividing a set of objects into groups, or clusters, such
that objects that are similar are assigned to the same cluster, while dissimilar objects are
put into different clusters. Clustering is a fundamental technique in the whole discipline
of computer science. However, almost all clustering algorithms published until today
are based on object representations in terms of feature vectors. Only very few papers
address the clustering of symbolic data structures, particularly the clustering of graphs.
For a general introduction to clustering see [66, 96]. Graph clustering was addressed
in [155]. Recently an extension of self-organizing maps [106] for graph clustering was
proposed in [78]. This method will be reviewed in greater detail in Section 6.2.2.
The meaning of the term graph clustering in the literature is not unique. In [149]
the identication of groups of nodes within the same graph is called graph clustering,
while in [78, 155] the term is used to denote procedures that cluster graphs rather than
feature vectors. Throughout this chapter, graph clustering is to be interpreted in the
latter sense.
The rest of this chapter is organized as follows. In the next section the most important
clustering algorithms for the domain of feature vectors will be reviewed. In Section 6.3
techniques will be introduced that allow us to extend the algorithms presented in Section
6.2 into the graph domain. The application of the resulting graph clustering algorithms
to time series of graphs and the detection of abnormal change in telecommunication
networks will be discussed in Section 6.4. Finally, conclusions will be drawn in Section 6.5.
94
6 Graph Clustering
95
input: X = {x1 , . . . , xM }
output: a hierarchical partitioning of set X, i.e., a sequence R0 ,
R1 , . . . , RM1 , where each Ri is a partition of set X, and
Ri is ner than Ri+1
begin
R0 = {c1 = {x1 }, c2 = {x2 }, . . . , cM = {xM }};
t = 0;
repeat
t = t + 1;
put each cluster ci from Rt1 into Rt ;
nd among all pairs of clusters in Rt the one with
minimum distance, i.e., nd (ci , cj ) such that
d(ci , cj ) = min{(cr , cs ) | cr = cs ; cr , cs Rt }
/* ties are broken arbitrarily */
generate a new cluster cnew = ci cj ;
remove ci and cj from Rt ;
add cnew to Rt
until all xi belong to the same cluster
end
Example 6.1. Assume that set X consists of the seven points A, B, . . . , G in the x-y
plane shown in Figure 6.2. The dendrogram resulting from the hierarchical clustering
algorithm is depicted in Figure 6.3. First, objects B and C are merged into one cluster.
Next D and E, and then F and G are merged. In the next step the cluster consisting
of B and C is merged with object A, and so on. Finally, one cluster is obtained that
includes all seven objects.
One important detail needed for the implementation of the hierarchical clustering
algorithm is the denition of the distance of clusters, d(ci , cj ). The three most popular
distance functions are the following:
single-linkage distance:
d(ci , cj ) = min{d(x, y) | x ci , y cj }.
The distance of two clusters is equal to the distance of the two closest representatives,
one from ci and the other from cj .
complete-linkage distance:
d(ci , cj ) = max{d(x, y) | x ci , y cj }.
Here the representatives that have the largest distance dene the distance of two
clusters.
96
6 Graph Clustering
average distance:
d(ci , cj ) =
1
d(x, y).
|ci ||cj | xc yc
i
This is the average distance of two elements that belong to different clusters.
It has been reported by several authors that the complete-linkage distance generally leads
to compact clusters, while the single-linkage distance has a tendency to produce rather
diffuse clusters. The behavior of the average distance is somewhat between completeand single-linkage. Another possibility is to represent each cluster by its mean or its
median and to use the distance between those as distance measure for clusters.
Fig. 6.2. A clustering example. (The clusters correspond to the horizontal dashed line in Figure 6.3).
A dendrogram, such as the one shown in Figure 6.3, is an excellent tool to visualize
the structure of the given objects. The dendrogram can also be used to partition set X
into a given number of clusters. For this purpose one needs only to split the dendrogram
at a certain height. For an example see the horizontal dashed line in Figure 6.3 that splits
the data into three clusters, namely, {A, B, C}, {D, E}, and {F, G}.
Usually one needs to reorder the elements of set X in a dendrogram in order to
achieve a nice graphical representation in which branches of the tree dont cross each
other. Obviously, such a reordering is always possible, regardless of the number of
elements in set X.
The algorithm in Figure 6.1 is a bottom-up hierarchical procedure. It starts with
singleton clusters and successively builds larger sets. It is also possible to start with the
full set X and split it recursively into subsets. Such an algorithm is called top-down. A
closer examination reveals that the splitting criteria needed in top-down clustering are
often more difcult to dene than the merging criteria required in bottom-up clustering.
Therefore, bottom-up approaches are much more popular than hierarchical top-down
clustering.
97
98
6 Graph Clustering
1st iteration:
x1
m1 ; x2 , x3 , x4
m2 ;
m1 = 1; m2 = 3.66;
2nd iteration:
x1 , x2
m1 ; x3 , x4
m2 ;
m1 = 1.5; m2 = 4.5;
3rd and following iterations:
no more change, i.e., c1 = {x1 , x2 }, c2 = {x3 , x4 }.
Example 6.3. For this example we use again Figure 6.2. Assume k = 3 and the initial
cluster centers are A, D, and F . Then the result of the k-means algorithm is the one
shown in Figure 6.2. However, if we select A, B, and C as initial cluster centers, we
will obtain the clustering shown in Figure 6.5.
From this example, it becomes clear that the result of the k-means clustering algorithm critically depends on the choice of the initial cluster centers. Therefore, often
a number of runs of the complete algorithm are executed, each with a different set of
initial cluster centers. Finally, the best result is chosen.
To measure the quality of a clustering produced by the k-means algorithm, the sum
E of all quadratic distances from the cluster centers can be used. Let mj be the center
of cluster cj . Then we dene
ej =
d 2 (mj , x)
xcj
k
ej .
j =1
It has been proposed to add some postprocessing steps to the results generated by the
k-means algorithm. Possible postprocessing operations include the merging of a pair
of (small) clusters that have a small distance, or the splitting of clusters that have many
elements and/or large variance.
Self organizing maps (SOMs) is a popular method in information processing that
is inspired by biological systems [106]. It can be used for various purposes. In this
chapter, we consider SOMs exclusively for the purpose of clustering.
A pseudocode description of the classical SOM algorithm is given in Figure 6.6.
Given a set of patterns X, the algorithm returns a prototype yi for each cluster ci . The
prototypes are sometimes called neurons. The number of clusters, k, is a parameter
that must be provided a priori. In the algorithm, rst each prototype yi is randomly
initialized (line 4). In the main loop (lines 510) one randomly selects an element
x X and determines the neuron y that is nearest to x. In the inner loop (lines 7,
8) one considers all neurons y that are within a neighborhood N (y ) of y , including
y , and updates them according to the formula in line 8. The effect of neuron updating
is to move neuron y closer to pattern x. The degree by which y is moved toward x
99
is controlled by the parameter , which is called the learning rate. It has to be noted
that is dependent on the distance between y and y , i.e., the smaller this distance
is, the larger is the change on neuron y. After each iteration through the repeat-loop,
the learning rate is reduced by a small amount, thus facilitating convergence of the
algorithm. It can be expected that after a sufcient number of iterations the yi s have
moved into areas where many xj s are concentrated. Hence each yi can be regarded as
a cluster center. The cluster around center yi consists of exactly those patterns that have
yi as closest neuron.
SOM clustering is in fact similar to the k-means clustering algorithm. The main
difference between both algorithms is that SOM is incremental in its nature, while kmeans is batch-oriented. That is, under k-means cluster centers are updated only after a
complete cycle through all x X has been conducted, while in SOM cluster centers are
updated immediately after representation of each individual object x X. Similarly to
k-means, SOM clustering critically depends on a good initialization strategy.
100
6 Graph Clustering
Fig. 6.7. Example of Dunn index: (a) Clustering with a small value of D; (b) Clustering with a
large value of D; dmin = min{d(ci , cj ) | i, j = 1, 2, 3}.
101
cluster validation index is a function that measures the quality of a given clustering.
Hence, given a cluster validation index and a clustering algorithm, such as k-means
or SOM, one can execute the clustering algorithm a number of times, specifying a
different number of clusters to be produced in each run, and nally select the clustering
that yields the optimal value of the cluster validation index.
At rst glance, the sum E of all quadratic distances dened in Section 6.2.2 may
look like a suitable cluster validation index. It turns out, however, that this measure
assumes its minimum value E = 0 for k = M, i.e., for the case in which each cluster
consists of just a single element. A number of more appropriate indices have been proposed in the literature [55, 66, 70, 76, 96]. Some of them are reviewed below.
Dunn Index
Let d(ci , cj ) denote the distance of clusters ci and cj . For this function any of the
measures discussed in Section 6.2.1 can be used, i.e., single-linkage, complete-linkage,
average distance, as well as the distance between the mean or the median of ci and cj .
Furthermore, let (ci ) denote the maximum distance within cluster ci , i.e.,
(ci ) = max{d(x, y) | x, y ci },
and let max be the maximum within-cluster distance taken over all clusters, i.e.,
max = max{(ci ) | i = 1, . . . , k}.
Then the Dunn index D is dened as
D=
1
max
min{d(ci , cj ) | i, j = 1, . . . , k}.
The Dunn index D considers the distance of the two nearest clusters in relation to
the largest distance within a single cluster. Clearly, the larger D is, the better is the
considered clustering. For a graphical illustration see Figure 6.7. It is easy to verify that
D [0, ].
This index is easy to compute. Its main shortcoming, however, is the fact that it is
vulnerable to outliers, i.e., if only a single outlier is added to the data, the value of the
index may drastically change.
DavisBouldin Index
Let mi be the center of cluster ci ; i = 1, . . . , k. The average distance of element
xl ci to mi is given by
di =
1
d(xl , mi ).
|ci | x c
l
First we dene
102
6 Graph Clustering
Rij = Rj i =
di + d j
.
d(mi , mj )
d1
m1
m2
d12
d13
d2
d23
m3
d3
Fig. 6.8. Example of DavisBouldin index: The smaller the di s and the further the mj s are
away from each other, the smaller is DB, i.e., the better is the clustering. In the gure, the notation
dij = d(mi , mj ) is used.
Next we are interested, for cluster ci , in the worst case, i.e., in the cluster cj that
yields the maximum value of Rij . Hence we dene
Ri = max{Rij | j = 1, . . . , k; i = j }.
Finally the DavisBouldin index DB is dened as the average of the Ri s taken over
all clusters, i.e.,
1
Ri .
k
k
DB =
i=1
Clearly, the smaller the value of DB is, the better the clustering. It is easy to see that
D [0, ].
103
GoodmanKruskal Index
We dene
1 if xi and xj belong to different clusters,
(xi , xj ) =
0 if xi and xj belong to same cluster.
To compute the index, one considers all quadruples (xi , xj , xr , xs ) where xi = xj , xr =
xs , (xi , xj ) = (xr , xs ). For each such quadruple, the quantities d(xi , xj ), (xi , xj ),
d(xr , xs ), and (xr , xs ) are computed. A quadruple is concordant if either
d(xi , xj ) < d(xr , xs ) (xi , xj ) < (xr , xs )
or
d(xi , xj ) > d(xr , xs ) (xi , xj ) > (xr , xs ).
By contrast, a quadruple is called discordant if either
d(xi , xj ) < d(xr , xs ) (xi , xj ) > (xr , xs )
or
d(xi , xj ) > d(xr , xs ) (xi , xj ) < (xr , xs ).
Notice that there are usually quadruples that are neither concordant nor discordant, for
example if (xi , xj ) = (xr , xs ).
Fig. 6.9. (a) Example of a concordant quadruple (xi , xj , xr , xs ); (b) Example of a discordant
quadruple (xi , xj , xr , xs ).
S+ S
.
S+ + S
104
6 Graph Clustering
j (xi ) = 1
for i = 1, . . . , M
(6.1)
j =1
and
0<
M
j (xi ) < M
for
j = 1, . . . , k.
(6.2)
i=1
The rst condition means that for each xi the membership degrees, taken over all clusters
cj , must sum up to unity. By means of the second condition, clusterings are excluded
in which all elements belong to just a single cluster with membership degree one.
Let mj denote the center of cluster cj . We compute the degree of membership of
an element xi in cluster cj as follows:
1, if xi = mj ,
2 1
j (xi ) =
(6.3)
k d(xi ,mj ) 1
, otherwise.
l=1 d(xi ,ml )
105
It can be veried that j (xi ) will have a value close to 1 if xi is near mj , and a value
close to 0 if xi is far away from mj . In the equation, > 1 is a parameter that controls
how quickly the value of j (xi ) drops from 1 to 0 if xi is moved away from mj . A value
of close to 1 means a quick drop of the membership function, while large values of
imply a slower decrease of j (xi ).
In the classical version of the k-means algorithm, there is an iterative updating of
the cluster centers. A similar updating operation on the cluster centers takes place in
fuzzy k-means clustering:
M
mj = i=1
M
j (xi )xi
i=1 j (xi )
(6.4)
This operation can be interpreted as computing the weighted average over all input
elements, xi , where each xi is weighted by its degree of membership in cluster cj .
Given equations (6.3) and (6.4), the fuzzy k-means clustering algorithm can be
formulated as shown in Figure 6.10. Notice that, similarly to its classical counterpart
given in Figure 6.4, the number of clusters needs to be given as a parameter. The same
techniques for cluster center initialization and termination as discussed in Section 6.2.2
can be applied, but the error measure E needs to be appropriately redened, for example
by letting
FE =
k
M
j (xi )d(xi , mj ).
i=1 j =1
It can be shown that the validity of equations (6.1) and (6.2) is maintained during
the execution of the fuzzy k-means algorithm. The result of the fuzzy k-means algorithm is a set of cluster centers m1 , . . . , mk and a sequence of membership values,
(1 (xi ), . . . , k (xi )), for each input element xi . It is possible to harden this result by
means of a defuzzycation procedure. One possibility of defuzzycation is the winnertake-all strategy. Under this strategy, xi is assigned to the cluster cj with maximum
membership value j (xi ), i.e., we assign xi to cj if and only if
j = arg max{j (xi ) | j = 1, . . . , k}.
This decision rule is equivalent to assigning xi to that cluster cj the center mj of
which is closest to xi .
A multitude of clustering algorithms have been published in the literature, but there
is no general principle that can be used to predict which method performs best. The
actual performance of a method crucially depends on the given task and the underlying
data, and has to be experimentally determined.
106
6 Graph Clustering
on graph clustering, i.e., on the clustering of objects that are represented by means of
graphs.
In [155] the objects to be clustered are random graphs. In such a graph, each node
and each edge has a probability assigned to it, which reects its likelihood of existence.
An information-theoretic measure is used to assess the quality of a clustering. This
measure also controls the assignment of individual random graphs to clusters.
In [78] an extension of the SOM clustering procedure from feature vectors to the
domain of graphs is described. The basic steps of the algorithm are identical to those
given in Figure 6.6. However, two important extensions were introduced to make the
SOM clustering applicable to graphs. The rst is the replacement of the Euclidian
distance by graph edit distance (see Chapter 3), and the second a generalization of the
SOM updating rule (see line 8 in Figure 6.6) from n-dimensional real space to the graph
domain.
The method introduced in [78] was augmented by cluster validation indices for the
graph domain in [79]. As a result, a graph clustering procedure is obtained that can
nd the optimal number of clusters automatically.
Next we discuss, from a general standpoint, what is needed to extend the clustering
algorithms introduced in Section 6.2 from n-dimensional real space to the domain of
graphs. Obviously, one of the concepts essential to each of the clustering algorithms is
a distance function. Very often, Euclidian distance is used. In order to apply clustering
algorithms in the graph domain, we need a function d(g1 , g2 ) that computes the distance
of any two given graphs, g1 and g2 . Fortunately, there are a number of such graph
distance functions; see Chapter 4. Also, graph distance measures d1 and d2 introduced
in Section 5.2 can be used as a graph distance function in the context of graph clustering.
A close look reveals that the availability of a function that computes graph distance
is already sufcient to implement the hierarchical clustering algorithm discussed in
Section 6.2.1 (see Figure 6.1). For the implementation of the k-means algorithm (Figure 6.4) we need, in addition to a distance function on graphs, a method to compute the
g1
107
g2
d(g1,g2)
d(g1,g2)
center of a cluster, i.e., the center of a nite set of graphs. For this task, any procedure
for median (or set median) graph computation can be used; see Chapters 3 and 5.
Median graph computation is not needed for the implementation of the SOM algorithm in the graph domain. Here a different operation is required. We need to synthesize
a new graph g on the basis of two given graphs g1 and g2 such that
d(g1 , g) =
(6.5)
where is a given constant, called the learning rate. It controls the degree by which a
cluster center is moved closer to an input element (see Figure 6.6, line 8). Intuitively,
g is a graph on the connecting line between g1 and g2 at distance to g1 in the graph
domain; see Figure 6.11 for an illustration. Such a graph has been called a weighted
mean in [24].
Fortunately, it turns out that the computation of a weighted mean g, given g1 , g2 ,
and , is a straightforward task. In fact, this computation can be accomplished as a
postprocessing step of edit distance computation. The computation of d(g1 , g2 ) yields
a sequence of edit operations (e1 , . . . , el ) that transform g1 into g2 with minimum
cost. Let c(ei ) denote the cost of edit operation ei in this sequence. Now g can be
synthesized from
g1 by selecting a subsequence (ei1 , . . . , eir ) of sequence (e1 , . . . , el ),
r l, such that rj =1 c(eij ) approximates as closely as possible, and applying all
edit operations of this subsequence to g1 . It can be proven that the graph g that results
from this procedure is in fact a weighted mean, satisfying equation (6.5).
The procedure for weighted mean graph computation is also applicable to distance
measures d1 and d2 introduced in Chapter 5, as we will show in the following two
examples.
Fig. 6.12. An example of weighted mean graph using d1 (see also Figure 6.13 and Figure 6.14).
108
6 Graph Clustering
Example 6.4. In Figure 6.12 two graphs, g1 and g2 , are shown. It is easy to verify that
d1 (g1 , g2 ) = 8. An optimal, i.e., minimum cost, sequence of edit operations transforming g1 into g2 is as follows:
e1
e2
e3
e4
e5
e6
e7
e8
Fig. 6.13. Four weighted mean graphs of g1 and g2 in Figure 6.12 for = 1.
For = 1 we may select any of the edge deletions e1 , e2 , e4 , or the node insertion
e6 and apply it to g1 . The resulting graphs are shown in Figure 6.13. We observe
that d(g1 , g) = 1 and d(g2 , g) = 7 for any of the weighted mean graphs g shown
in Figure 6.13. Note, however, that it is not legitimate to select any of the other edit
operations, i.e., e3 , e5 , e7 , or e8 , because these edit operations cant be carried out
independently. This means they need additional edit operations to produce a valid graph.
For example, if we delete node 0 through edit operation e3 , we also must delete edges
(0, 1) and (0, 3) through edit operations e1 and e2 , respectively. Similarly, insertion of
edge (1, 4) through edit operation e7 requires the insertion of node 4 by means of e6 .
For = 2 we may select any pair of edit operations among (e1 , e2 ), (e1 , e4 ),
(e1 , e6 ), (e2 , e4 ), (e2 , e6 ), (e4 , e6 ), (e6 , e7 ), (e6 , e8 ), resulting in one of the graphs
shown in Figure 6.14. Here we observe that d(g1 , g) = 2 and d(g, g2 ) = 6, for any
graph g depicted in Figure 6.14. Any pair of edit operations other than the ones listed
above are not legitimate for the same reasons as given for the case = 1.
The cases = 3, . . . , 7 are similar. Note that for = 0 and = 8, the weighted
means will be isomorphic to g1 and g2 , respectively.
Example 6.5. Two graphs with edge labels are shown in Figure 6.15. Here we observe
that d2 (g1 , g2 ) = 6 if we let c = 1. An optimal sequence of edit operations that
transform g1 into g2 is:
109
Fig. 6.14. Eight weighted mean graphs of g1 and g2 in Figure 6.12 for = 2.
Fig. 6.15. An example of weighted mean graph using d2 (see also Figures 6.16 and 6.17).
e1
e2
e3
e4
The resulting three graphs are shown in Figure 6.16. Obviously, d2 (g1 , g) = 1 and
d2 (g, g2 ) = 5 for any of the weighted means.
For = 2 the following possibilities exist:
delete edge (2, 3) and change the label of edge (1, 2) from 3 to 2
delete edge (2, 3) and change the label of edge (1, 3) from 2 to 1
delete edge (1, 3)
change the label on edge (1, 2) from 3 to 1
110
6 Graph Clustering
Fig. 6.16. Three weighted mean graphs of g1 and g2 in Figure 6.15 for = 1.
Fig. 6.17. Three weighted mean graphs of g1 and g2 in Figure 6.15 for = 2.
The resulting graphs are shown in Figure 6.17. Now we have d2 (g1 , g) = 2 and
d2 (g, g2 ) = 4.
The graphs for = 3, . . . , 6 can be constructed similarly.
Fig. 6.18. An example of weighted median graph computation using d1 (see also Figure 6.19).
Having a procedure for weighted mean graph computation at our disposal, we are
able to implement the SOM clustering algorithm in the graph domain. Next we consider
the fuzzy k-means clustering algorithm. It is easy to see that for the implementation of
equation (6.3) the availability of a function for graph distance computation is sufcient.
However, equation (6.4) needs to be elaborated on. Here the aim is to compute a cluster
center mj . As a generalization over the median graph computation procedure discussed
in the context of nonfuzzy k-means clustering, each of the elements xi now has an
individual weight, namely j (xi ). Yet the computation of the median of a weighted
set of graphs can be considered as a straightforward extension of normal median graph
2 Note that the deletion of an edge is equivalent to substituting the label by weight 0.
111
computation. All that needs to be done is to include multiple copies of each graph in the
underlying set, with the number of copies of a graph being proportional to its weight.
Example 6.6. Three graphs, g1 , g2 , and g3 , together with their weights are shown in
Figure 6.18. Let us consider the computation of a weighted median under graph distance
d1 . From g1 , g2 , and g3 we produce a set of ten graphs. This set includes four copies
of g1 , three copies of g2 , and three copies of g3 . Application of the median graph
computation procedure as described in Chapter 3 to this set yields the graph shown in
Figure 6.19.
Fig. 6.20. An example of weighted median graph computation using d2 (see also Figure 6.21).
112
6 Graph Clustering
Example 6.7. The graphs in Figure 6.18, augmented by edge labels, are shown in Figure 6.20. The weighted median computed under graph distance d2 (see Section 5.2) is
depicted in Figure 6.21.
Further inspection of the pseudocode in Figure 6.10 reveals that no additional tools
are needed to implement fuzzy k-means clustering in the graph domain.
It is easy to see that all cluster validation indices introduced in Section 6.2.3 are based
only on distances. Therefore, only graph distance is needed for their implementation in
the graph domain.
113
and c2 . A partitioning of a time series of graphs, such as the one shown in Figure 6.22,
may be useful to characterize a whole series of graphs in more general and global terms
than just referring to individual points in time where abnormal change occurs. For the
sequence shown in Figure 6.22 one could infer, for example, that the sequence consists
of two different alternating types of graphs. If the behavior of the network follows some
(deterministic or nondeterministic) rules, it may be possible to automatically detect such
rules by means of grammatical inference [29] or tools from the discipline of machine
learning (see Chapter 11). Once appropriate rules have been derived from an existing
sequence, they may be used to predict the future behavior of this sequence, or to patch
up points in time where no measurements are available. In such a system, clustering
would be needed to provide the basic similarity classes of network states.2
Any noncontiguous clustering can be easily converted into a contiguous one by
splitting the clusters into contiguous subsequences. Figure 6.23 shows the contiguous
clusters that are obtained from Figure 6.22. Another procedure for the generation of
contiguous clusters can be derived from the hierarchical clustering algorithm introduced
in Section 6.2.1. Here we start with the individual graphs as the initial clusters. Then,
whenever two clusters (or individual graphs) are to be merged, only candidates that are
adjacent in time will be considered. In other words, only pairs of (contiguous) clusters
that will result in a new contiguous cluster are eligible for merging.
Fig. 6.23. Contiguous clustering resulting from splitting the clusters in Figure 6.22.
Graph clustering can be used to develop additional tools for abnormal change detection in time series of graphs. The basic idea is to rst produce a contiguous clustering
of a given time series of graphs. Then pairs of clusters that are adjacent in time are
compared to each other. If their dissimilarity is above some threshold, it can be concluded that some abnormal change has occurred at the point in time at which the two
clusters meet. Potential measures for cluster similarity, or dissimilarity, are single linkage, complete linkage, average distance, and distance of median. It can be expected
that such measures for abnormal change detection have noise resistance properties that
are similar to the median graph. But they are more general because they can adapt to
the actual data in a more exible way than median graphs, which are always computed
over a time window of xed size.
2 For a more detailed treatment of predicting future network behavior and recovering missing
informaton see Chapter 11.
114
6 Graph Clustering
6.5 Conclusion
Graph representation and similarity measures on graphs have proven useful in many applications, including the analysis of telecommunication networks. In this chapter a number of clustering algorithms are reviewed and their extension from the n-dimensional
real space to the graph domain is studied. Furthermore, the application of these algorithms to the monitoring of telecommunication networks is discussed.
It can be expected that graph clustering will provide additional tools to detect
abnormal change in time series of graphs and telecommunication networks. Moreover, it
is potentially useful to identify patterns of similar behavior in long network observation
sequences. This may lead to enhanced tools that are able not only to detect network
events in short time windows, but also to analyze network behavior and deliver humanunderstandable high-level interpretation over long periods of time.
7
Graph Distance Measures based on Intragraph
Clustering and Cluster Distance
7.1 Introduction
Various graph distance measures were considered in previous chapters. All of these
measures have in common that the distance of two given graphs g1 and g2 is equal to
zero if and only if g1 and g2 are isomorphic to each other. Sometimes, however, it may
be desirable to have a more exible distance measure for which d(g1 , g2 ) = 0 if g1 and
g2 are similar, but not necessarily isomorphic. Such a measure is potentially useful to
make our graph-distance-based computer network monitoring procedures more robust
against noise and small random perturbations.
A family of new graph distance measures with this property is introduced in this
chapter. Given two graphs g1 and g2 , their distance d(g1 , g2 ) is computed in two steps.
First an intragraph clustering procedure is applied that partitions the set of nodes of
each graph into a set of clusters, based on the weights on the edges. In the second step,
the distance d(C1 , C2 ) of the two clusterings C1 and C2 derived in the rst step will be
computed, and this quantity will serve as our graph distance measure d(g1 , g2 ). Thus
the problem of computing the distance of two graphs is transformed into the problem of measuring the similarity of two given clusterings. For the latter problem, some
algorithms are known from the literature [147, 164]. In addition to these algorithms
we propose a novel approach to measuring cluster similarity, based on bipartite graph
matching. The intragraph clustering procedure can be combined with any of the methods that measure clustering similarity. Hence a whole family of new graph distance
measures, rather than just a single procedure, is obtained.
In the next section our basic terminology and an algorithm for intragraph clustering
will be introduced. Then in Section 7.3, distance measures for clusterings will be discussed. Combining the concepts from Sections 7.2 and 7.3, our novel graph-matching
methods will be described in Section 7.4. In Section 7.5, we continue with a discussion
of applications of the new graph distance measures in the eld of computer network
monitoring. Finally, in Section 7.6 conclusions and suggestions for further research are
provided.
116
117
(G, F ), (F, C), (C, I ), (C, D), (H, A), (A, B), (D, E) to T , in this order. The edges
belonging to T are printed in bold. It can be proven that the mst of a given graph is
uniquely dened if all edge weights are different. Otherwise several msts may exist
for the same given graph. In the remainder of this monograph we shall assume, for the
purpose of simplication, that each graph under consideration has a unique mst.
As mentioned above, the objective in this chapter is not to cluster a set of objects,
as is done in the algorithm described in [196], but to cluster the nodes in a graph
g = (V , E, ). However, we can directly apply the algorithm given in [196] in case
the underlying graph is connected, i.e., any node y V can be reached from any
other node x V via a sequence of edges (x, x1 ), (x1 , x2 ), (x2 , x3 ), . . . , (xi , y) that
all belong to E. Otherwise, if g is not connected, we add edges to graph g such that
it becomes a complete graph (i.e., such that there is an edge (x, y) between any pair
of nodes x, y V ). Each of the additional edges e gets assigned a weight (e) = .
Given graph g we compute an mst, using the algorithm described above. Given the
mst, the k edges with the largest weights are determined and deleted from the mst. This
results in k + 1 subsets, or clusters, of the set of nodes V . Each cluster is dened by a
connected component of the mst.
Example 7.1. For this example we use the graph and the mst from Figure 7.1. If we
delete just the single edge with maximum weight from the mst, i.e., edge (D, E) with
weight 9, then the nodes of graph g are divided into two clusters, the one just including
node E, and the other consisting of all other nodes. Alternatively, if we delete the
two edges with maximum weight from the mst, we get three clusters, {E}, {A, B}, and
{C, D, F, G, H, I }. Deletion of three edges from the mst yields four clusters, {E}, {D},
{A, B}, {C, F, G, H, I }, and so on.
D
9
4
A
C
2
11
14
7
6
8
10
H
G
1
F
2
The number of edges to be deleted from the mst is a parameter that controls the
number of clusters to be produced. Two basic strategies can be applied to select appropriate values of this parameter. First, we can put a threshold t on the edge weights and
delete all edges from the mst with a weight larger than t. Under this strategy, the number
of clusters is not xed a priori, but depends on the actual weights in the given graph.
Secondly, the number of edges to be deleted can be xed. This will always result in
the same number of clusters being produced, independent of the actual weights on the
edges. Which of the two strategies is more suitable usually depends on the underlying
application.
118
The algorithm described above is based on the idea that edge weights represent
distances and that clusters should comprise objects that have a small distance to each
other. Therefore, the minimum spanning tree is computed, and those k edges are deleted
from the mst that have the largest weights. There is a dual version of this problem in
which edge weights represent afnity between nodes, and we want to cluster those
nodes together that have a high degree of afnity. In this dual version we compute the
maximum spanning tree of the complete graph analogously to the minimum spanning
tree and delete those k edges from the maximum spanning tree that have the smallest
weights. If we consider the second version of the problem, a small modication of the
graph completion procedure will be required in case g is not connected: To each edge
e added to the original graph g with the purpose of turning g into a complete graph,
a weight (e) = 0, rather than (e) = , should be assigned. Intuitively, this means
that a missing edge corresponds to the case of zero afnity between two nodes.
We like to conclude this section by pointing out that for the new graph distance
measures to be introduced below, any other procedure that partitions the set of nodes of
a graph into disjoint subsets, or clusters, can be used instead of the mst-based clustering
algorithm.
consistent if either
- oi and oj are in the same cluster in C1 and in the same cluster in C2 , or
- oi and oj are in different clusters in C1 and in different clusters in C2 ;
119
inconsistent if either
- oi and oj are in the same cluster in C1 , but in different clusters in C2 , or
- oi and oj are in different clusters in C1 , but in the same cluster in C2 .
R+
.
+ R
R+
(7.1)
It is easy to see that R(C1 , C2 ) [0, 1] for any two given clusterings C1 and C2 over
a set of objects O. We have R(C1 , C2 ) = 0 if and only if k = l and there exists a bijective
mapping f between C1 and C2 such that f (c1i ) = c2j for i, j {1, . . . , l}, which means
that the two clusterings are the same except for possibly assigning different names to
the individual clusters, or listing the clusters in different order. The case R(C1 , C2 ) = 1
corresponds to the maximum degree of cluster dissimilarity. It indicates that there exists
no consistent pair (oi , oj ) O O. In order to compute the Rand index, O(n2 ) pairs
of objects from O have to be considered since R + + R = n(n 1)/2.
Example 7.3. The following values of the Rand index are obtained for the clusterings
given in Example 7.2:
d(C1 , C2 ) = 0 ,
d(C1 , C3 ) = 6/15 ,
d(C1 , C4 ) = 5/15 ,
d(C1 , C5 ) = 9/15 ,
d(C2 , Ci ) = d(C1 , Ci )
d(C3 , C4 ) = 5/15 ,
d(C3 , C5 ) = 5/15 ,
d(C4 , C5 ) = 8/15 ,
R + = 15 ,
R
+
R =9,
R
+
R = 10 ,
R
+
R =6,
R
for i = 3, 4, 5 ;
R + = 10 ,
R
+
R = 10 ,
R
+
R =7,
R
=0;
=6;
=5;
=9;
=5;
=5;
=8.
Hence if we order these distance values, we may conclude, for example, that the
clustering that is most similar to C1 is C2 , followed by C4 , C3 , and C5 .
7.3.2 Mutual Information
Mutual information (MI ) is a well-known concept in information theory [53]. It measures how much information about random variable Y is obtained from observing
random variable X. Let X and Y be two random variables with joint probability distribution p(x, y) and marginal probability functions p(x) and p(y). Then the mutual
information of X and Y , MI (X, Y ), is dened as
MI (X, Y ) =
(x,y)
p(x, y) log
p(x, y)
.
p(x)p(y)
(7.2)
Some properties of MI are summarized below. For a more detailed treatment the
reader is referred to [53].
120
MI (X, Y ) = MI (Y, X) ;
MI (X, Y ) 0 ;
MI (X, Y ) = 0 if and only if X and Y are independent;
MI (X, Y ) max(H (X), H (Y )), where H (X) = x p(x) log p(x) is the entropy of random variable X ;
5. MI(X, Y ) = H (X) + H (Y ) H (X, Y ), where H (X, Y ) =
(x,y) p(x, y) log p(x, y) is the joint entropy of X and Y.
1.
2.
3.
4.
In the context of measuring the distance of two given clusterings C1 and C2 over
a set of objects O, the discrete values of random variable X are the different clusters
c1i of C1 to which an element of O can be assigned. Similarly, the discrete values of
Y are the different clusters c2j of C2 to which an object of O can be assigned. Hence
equation (7.2) becomes
MI (C1 , C2 ) =
(c1i ,c2j )
p(c1i , c2j )
.
p(c1i )p(c2j )
(7.3)
Given set O and clusterings C1 and C2 , we can construct a table such as the one shown in
Figure 7.2 to compute MI (C1 , C2 ). An entry at position (c1i , c2j ) in this table indicates
how often an object has been assigned to cluster c1i in clustering C1 , and to cluster c2j
in clustering C2 . To compute p(c1i , c2j ) in equation (7.3) we just have to divide the
element at position (c1i , c2j ) by the sum of all elements in the table. For p(c1i ) we sum
all elements in row c1i and divide again by the sum of all elements in the table. To get
p(c2j ) we proceed similarly, summing over all values in column c2j , and dividing by
the total sum.
C1 \ C2 c21 . . . c2j . . . c2l
c11
..
.
c1i
..
.
c1k
Fig. 7.2. Illustration of the computation of MI (C1 , C2 ).
In contrast with the Rand index, which is normalized to the interval [0, 1], no
normalization is provided in equation (7.3). As MI (C1 , C2 ) max(H (C1 ), H (C2 ))
and H (C) log(k), with k being the number of clusters present in clustering C, the
upper bound of MI (C1 , C2 ) depends on the number of clusters in C1 and C2 . To get a
normalized value, it was proposed to divide MI (X, Y ) in equation (7.3) by log(k l),
where k and l are the number of discrete values of X and Y , respectively [164]. This
leads to the normalized mutual information
N MI (C1 , C2 ) = 1
p(c1i , c2j )
1
p(c1i , c2j ) log
.
log(k l)
p(c1i )p(c2j )
121
(7.4)
(c1i ,c2j )
c22
0
0
2
2
c23
2
0
0
2
2
2
2
6
MI (C1 , C4 )
c41 c42
c11 2 0 2
c12 1 1 2
c
13 0 2 2
3 3 6
MI (C1 , C3 )
c31
c11 1
c12 1
c
13 0
2
c32
1
0
1
2
c33
0
1
1
2
2
2
2
6
MI (C1 , C5 )
c51 c52
c11 1 1 2
c12 1 1 2
c
13 1 1 2
3 3 6
122
w
.
n
(7.5)
Clearly, this measure is equal to 0 if and only if k = l and there is a bijective mapping
f between the clusters of C1 and C2 such that f (c1i ) = c2j for i, j {1, . . . , k}. Values
close to one indicate that no good mapping between the clusters of C1 and C2 exists
such that corresponding clusters have many elements in common.
Example 7.5. Based on the clusterings C1 , . . . , C5 introduced in Example 7.2, the
graphs used to nd the maximum-weight bipartite graph matching between (C1 , C2 ),
. . . , (C4 , C5 ) are shown in Figures 7.4a-g. In these graphs the numbers in the left
columns represent the clusters in C1 , and the numbers in the right columns the clusters
in C2 . Edges are drawn between any two clusters that have a nonempty intersection.
The maximum-weight bipartite matching is indicated in bold. For example, in Figure 7.4c, we compare clusters c11 , c12 , and c13 (represented by numbers 1, 2, 3 on
the left) with clusters c41 and c42 (represented by numbers 1, 2 on the right). Since
c11 c41 = {o1 , o2 }, an edge with weight 2 is drawn between c11 and c41 . Similarly,
an edge with weight 2 is drawn between c13 and c42 . Cluster c12 has one element in
common with each c41 and c42 , which is reected in the two edges with weight 1. Since
all other pairs of clusters have an empty intersection, there are no more edges included
in Figure 7.4c. The maximum-weight bipartite matching is the one corresponding to
the bold edges. Note that in Figures 7.4a,c,e,f,g the maximum-weight bipartite graph
1 In bipartite graph matching we aim at matching two subsets of nodes of the same graph with
each other. This is different from the meaning of graph matching as used in the rest of this book,
where our aim is to match the nodes and edges of two different graphs with each other.
123
matching is uniquely dened, while in Figures 7.4b,d several solutions exist. However,
all these solutions lead to the same distance value. The distances are as follows:
BGM(C1 , C2 ) = 0,
BGM(C1 , C3 ) = 3/6,
BGM(C1 , C5 ) = 4/6,
BGM(C3 , C5 ) = 2/6,
(b)
b) BGM(C1 , C3)
(a)
a) BGM(C1 , C2)
BGM(C1 , C4 ) = 2/6,
BGM(C3 , C4 ) = 2/6,
BGM(C4 , C5 ) = 2/6.
2
2
1
2
1
1
2
3
1
(d)
d) BGM(C1 , C5)
(c)
c) BGM(C 1 , C4)
1
2
1
1
2
3
(e)
e) BGM(C 3 , C4)
(f)
f) BGM(C 3, C 5 )
1
1
1
2
1
2
1
2
1
2
1
2
3
(g)
g) BGM(C4 , C5)
1
2
124
A
g 1:
B
3
A
g 2:
2
F
E
2
of the nodes of g1 , and a clustering C2 of the nodes of g2 . Then any of the three cluster
distance measures discussed in Section 7.3 can be used to get the distance d(g1 , g2 )
between g1 and g2 . This yields three different new graph distance measures. Note that
even more graph distance measures can be obtained if the mst-clustering procedure
applied in the rst step is replaced by another intragraph clustering procedure.
We assume that there is a common pool V = V1 V2 from which the nodes of
graphs g1 = (V1 , E1 , 1 ) and g2 = (V2 , E2 , 2 ) are chosen. Note, however, that we
do not require V1 = V2 . This means that in general, there will be nodes in g1 that do
not occur in g2 , and there will be nodes in g2 that are not present in g1 . From this point
of view, the situation is more general than the one considered in Section 7.3 because
it was assumed there that both clusterings C1 and C2 are derived from the same set
of objects. In the following we will discuss three extensions to the distance measures
introduced in Section 7.3 that allow us to deal with this problem.
Assume that we are given two clusterings, C1 = {c11 , c12 , . . . , c1k } and C2 =
{c21 , c22 , . . . , c2l } produced by our intragraph clustering algorithm, and we want to
apply the Rand index R(C1 , C2 ) to measure the distance of C1 and C2 . First we add
a dummy cluster c10 to C1 . This cluster includes exactly the nodes that are present in
V2 , but not in V1 . Similarly, we add a dummy cluster c20 to C2 , consisting of exactly
the nodes that are present in V1 , but not in V2 . We can think of the nodes in cluster
c20 as nodes that are deleted from g1 , while the nodes in cluster c10 can be understood
as nodes that are inserted in g2 . Now consider the Rand index as dened in equation
(7.1). In order to cope with the situation that V1 may be different from V2 , we need to
generalize the notion of a consistent pair. Any pair (x, y) with x c10 or y c10 (or
both x, y c10 ) will be considered inconsistent in our new setting. Moreover, any pair
(x, y) with x c20 or y c20 (or both x, y c20 ) will be considered inconsistent as
well. This means that a consistent pair (x, y) needs to fulll, in addition to the properties
stated in Section 7.3.1, the condition that none of x and y is in c10 or c20 .
An example is shown in Figure 7.5. We assume that we run the intragraph clustering
algorithm in such a way that it produces two clusters. Then the clusters represented by
the ovals will be obtained. We have c10 = {E, F }, c11 = {A, B}, c12 = {C, D},
c20 = {C, D}, c21 = {A, B, F }, c22 = {E}. There are altogether 15 different pairs
to be considered, out of which only one is consistent, namely pair (A, B). Hence
d(g1 , g2 ) = 1 1/15 = 14/15. So we get a relatively large distance value, which
makes sense from the intuitive point of view since the two clusterings are in fact quite
different.
125
For the mutual information-based clustering similarity measure dened by equations (7.3) and (7.4) we need to extend the table shown in Figure 7.2 by adding one
row for dummy cluster c10 , and one column for dummy cluster c20 . Given the extended
table, we propose to compute the distance measure in the same fashion as described in
Section 7.3.2, but to carry out the summation in equations (7.3) and (7.4) only over the
rows corresponding to c11 , . . . , c1k and the columns corresponding to c21 , . . . , c2l . In
this way elements from c10 and c20 will not contribute to the value of N MI (C1 , C2 ).
For the example shown in Figure 7.5 we get the table shown in Figure 7.6,
where the entries correspond to absolute frequencies. From these numbers we derive
2/6
MI (C1 , C2 ) = 26 log 1/21/3
= 13 log 2.
EF
AB
CD
CD ABF E
0
1 1 2
0
2 0 2
2
0 0 2
2
3 1 6
For the bipartite graph-matching-based measure we do not need to add clusters c10
and c20 to the result of the intragraph clustering procedure, but we adjust the normalization factor n in equation (7.5). In Section 7.3.3 this factor was equal to the number of
objects under consideration, i.e., |V1 | = |V2 | = n. In the generalized setting discussed
in this section, we let n = |V1 V2 |, which means that all nodes occurring in graphs
g1 and g2 are taken into consideration.
Using the example in Figure 7.5, we derive the bipartite weighted graph shown in
Figure 7.7, from which we get d(g1 , g2 ) = 1 2/6 = 4/6.
1
2
1
2
Next we provide a more extensive example using the bipartite graph matching
scheme for measuring the distance of two clusterings. Four graphs, g1 , g2 , g3 , and g4 ,
are shown in Figures 7.8a,b,c,d. The bold edges represent the msts of these graphs.
The clusters that result when all edges with a weight greater than or equal to 3 are
deleted from the mst are graphically represented by ellipses.2 For these four graphs,
the bipartite graph matching measure yields
2 In this example, the edge weights are not unique and several msts exist. Nevertheless, the
resulting clusterings of all four graphs are uniquely dened.
126
d(g1 , g2 ) = 0,
d(g1 , g3 ) = d(g2 , g3 ) = 2/9,
d(g1 , g4 ) = d(g2 , g4 ) = 1/2,
d(g3 , g4 ) = 4/9.
The bipartite matchings for d(g1 , g2 ), d(g1 , g3 ), d(g1 , g4 ), and d(g3 , g4 ) are shown
in Figures 7.9a,b,c,d. We note that the bipartite matchings for d(g2 , g3 ) and d(g2 , g4 )
are identical to those for d(g1 , g3 ) and d(g1 , g4 ), respectively. In the computation
of d(g1 , g2 ) and d(g1 , g3 ), a total of nine nodes are involved. Hence, for example,
d(g1 , g3 ) = 1 7/9 = 2/9, with 7 being the sum of the weights on the edges of
the maximum weight bipartite graph shown in Figure 7.9b. For d(g1 , g4 ) we need to
consider ten nodes, and for d(g3 , g4 ) nine nodes.
In the remainder of this section we formally characterize some properties of the
new graph distance measures. For the following considerations we assume that the
intragraph clustering procedure described in Section 7.2 is applied in such a way that
it always returns the same number of clusters. That is, we always cut a xed number
of edges from the mst, rather than dening the edges to be cut by means of a threshold
on the weight of the edges. Let g1 = (V1 , E1 , 1 ) and g2 = (V2 , E2 , 2 ) be graphs.
Furthermore, let E1 = {e1 , . . . , ek } and 1 (e1 ) < 1 (e2 ) < < 1 (ek ). We call
graph g2 a scaled version of g1 , symbolically g2 = (g1 ), if V1 = V2 , E1 = E2 , and
2 (ei ) = c 1 (ei ) for i = 1, . . . , k, where c > 0 is a constant. Basically, g1 and g2
are the same up to a constant factor that is applied on each edge weight in g1 to get the
corresponding edge weight in g2 .
Lemma 7.6. Let g be a graph and (g) a scaled version of g. Then d(g, (g)) = 0.
Proof. Let g = g1 and (g) = g2 . Because each 2 (ei ) is identical to 1 (ei ) up to
a constant scaling factor, the mst of g1 , which is uniquely dened because all edge
weights are different by assumption, is also an mst of g2 . Hence cutting a xed number
of edges with the maximum weight from those msts will lead to identical clusterings
in g1 and g2 . Consequently, d(g1 , g2 ) = 0.
Lemma 7.7. Let g1 and g2 be graphs, and (g2 ) a scaled version of g2 . Then
d(g1 , g2 ) = d(g1 , (g2 )).
Proof. Similarly to the proof of Lemma 7.6, we notice that the clustering obtained for g2
will be the same as the clustering obtained for (g2 ). Hence d(g1 , g2 ) = d(g1 , (g2 )).
Finally, we consider a transformation on the edge weights of a graph that is more
general than scaling. Let g1 = (V1 , E1 , 1 ) and g2 = (V2 , E2 , 2 ) be graphs. Furthermore, let E1 = {e1 , . . . , ek } and 1 (e1 ) < 1 (e2 ) < < 1 (ek ). We call g2
an order-preserved transformed version of g1 , symbolically g2 = (g1 ), if V1 = V2 ,
E1 = E2 , and 2 (e1 ) < 2 (e2 ) < < 2 (ek ).
127
128
Fig. 7.9. Bipartite graph matching corresponding to (a) d(g1 , g2 ), (b) d(g1 , g3 ), (c) d(g1 , g4 ),
(d) d(g3 , g4 ).
Theorem 7.8. Let g1 and g2 be graphs and (g2 ) an order preserved transformed
version of g2 . Then
(i) d(g1 , (g1 )) = 0,
(ii) d(g1 , g2 ) = d(g1 , (g2 )).
The proof is based on arguments similar to the ones used for Lemmas 7.6 and 7.7.
Clearly, Theorem 7.8 is a generalization of Lemmas 7.6 and 7.7 because any scaled
version of a graph g is necessarily an order-preserved transformed version of g.
For an example of Theorem 7.8 consider Figure 7.10. Here we apply the intragraph
clustering algorithm in such a way that it always produces two clusters. Clearly, g2 is a
scaled and therefore an order-preserved transformed version of g1 , and d(g1 , g2 ) = 0.
The distance of any other graph g to g1 will always be the same as the distance of g
to g2 , because g1 and g2 are clustered in the same way; hence d(g, g1 ) = d(g, g2 ).
For example, d(g1 , g3 ) = d(g2 , g3 ) = 0.5, using the bipartite graph-matching-based
cluster similarity measure.
129
(b)
b) g 2 :
3
A
1
A
2
C
4
c) g 3 :
(c)
A
4
C
8
4
D
C
2
Another very important feature of the graph distance measure introduced in this
chapter is the fact that it may lend itself to useful visualization tools that allow a human
130
operator to visually track the changes in a computer network. First of all, the clusters
resulting from the rst step of the new method can be readily displayed (as single nodes)
on a screen, which allows us to view the network from a birds-eye perspective. The
level of detail, i.e., the number of clusters or, equivalently, their size, can be chosen by
the user. It corresponds to the number of edges to be deleted from the mst, as described
in Section 7.2. Secondly, if we use the bipartite graph matching procedure in the second
step of the algorithm for computing graph distance, we can exactly see what happens
to the individual clusters when we go from time t to t + 1. That is, we can see how
the individual clusters of nodes develop over time. This allows a human operator not
only to quantitatively measure network change, but also to qualitatively interpret these
changes. Using the bipartite graph matching procedure introduced in Section 7.3.3 one
can also highlight, for example, the k clusters that have undergone the largest change
between time t and t + 1, where k is a user-dened parameter. Similarly, one can dene
a threshold T and display all clusters on the screen with a change larger than T . It is
perhaps more difcult to implement similar visualization tools based on graph distance
measures d1 , d2 , or others described in Chapter 4.
7.6 Conclusion
In this chapter we have proposed a set of new distance measures for graphs with unique
node labels and weighted edges. These measures are based on a clustering procedure
that partitions the set of nodes of a graph into clusters based on the edge weights, and
algorithms to determine the distance of two given clusterings. We have also discussed
potential applications of the new measures in the eld of computer network monitoring
and have pointed out that they may be a valuable addition to the repository of existing
tools with some potential advantages over other graph distance measures.
In the second step of the new algorithm the distance of two given clusterings is
evaluated. For this task, three different methods have been proposed. Also for the rst
step, i.e., intragraph clustering, several alternatives exist to the algorithm described
in Section 7.2. Hence, what is described in the current chapter may be regarded as a
novel general algorithmic framework rather than a single concrete procedure for graph
distance computation.
8
Matching Sequences of Graphs
8.1 Introduction
In the current chapter we are going to address a new problem in the domain of graph
matching. All algorithms for graph distance computation discussed previously in this
book consider just a pair of graphs g and G, and derive their distance d(g, G). Our
proposed extension consists in computing the distance d(s, S) of a pair of graph sequences s = g1 , . . . , gn and S = G1 , . . . , Gm . Both sequences can be of arbitrary
length. In particular, the length of s can be different from the length of S. The normal
graph distance is obtained as a special case of the proposed graph sequence distance if
each s and S consists of only one graph. Similarly to any of the classical graph distance
measures d(g, G), the proposed graph sequence distance d(s, S) will be equal to zero
if s and S are the same, i.e., n = m and gi = Gi for i = 1, . . . , n. On the other hand,
d(s, S) will be large for two highly dissimilar sequences.
In the next section, we review basic concepts and procedures in sequence matching
that will be used later in this chapter. In Section 8.3 we will bring the two concepts,
classical sequence matching and graph matching, together and develop novel procedures
for graph sequence matching. Applications of graph sequence matching, particularly
in the eld of network behavior analysis, will be discussed in Section 8.4. Finally, in
Section 8.5 conclusions will be drawn.
132
133
134
input: x = x1 . . . xn , y = y1 . . . ym
output: success, if x is a continuous subsequence of y;
failure, otherwise.
begin
i=1
j =1
while i m and j n do
if xj = yi then
i =i+1
j =j +1
else
i =ij +2
j =1
if j > n then
output(success: x1 . . . xn = yin . . . yi1 )
else
output(failure)
end
considered. The value of D(i, j ) is dened to be equal to the minimum of the following
three quantities: (a) D(i 1, j 1) plus the cost of substituting xi by yj , (b) D(i 1, j )
plus the cost of deleting xi , (c) D(i, j 1) plus the cost of inserting yj . Hence the value of
D(i, j ), which represents d(x1 . . . xi , y1 . . . yj ), is obtained by selecting the minimum
among of the following three terms:
(a) d(x1 . . . xi1 , y1 . . . yj 1 ) + c(xi yj )
(b) d(x1 . . . xi1 , y1 . . . yj ) + c(xi )
(c) d(x1 . . . xi , y1 . . . yj 1 ) + c( yj )
Matrix element D(n, m) will become available last, and it holds the desired value
d(x, y). For a proof of the correctness of this algorithm see [176], for example.
From the pointers set in steps 10 to 12 the sequence of edit operations that actually transform x into y can be reconstructed. For this purpose one starts at D(n, m) and
traces back these pointers to position D(0, 0). Such a sequence of pointers is also called
an optimal path in the cost matrix. It has to be noted that, due to the way the algorithm is
presented in Figure 8.2, only one pointer is installed for each matrix element. In general,
however, the predecessor element from which D(i, j ) is computed may not be uniquely
dened. In other words, the minimum among (a), (b), and (c) as introduced above may
not be unique. If one is interested in recovering all possible sequences of edit operations
135
y2
...
yi
...
ym
initial row
x1
x2
..
.
xi
..
.
xn
element D(i, 0): holds d(x1 . . . xi , )
initial column
that transform x into y with minimum cost, one has to install and track all these pointers.
Example 8.2. Let x = ababbb, y = babaaa and let the set of edit operations be given
by a b, b a, a , b , a, b with costs c(a ) = c(b
) = c( a) = c( b) = 1 and c(a b) = c(b a) = 2. Then the execution
136
of the algorithm yields the matrix shown in Figure 8.4. From the element in the lower
right corner we conclude that d(x, y) = 6.
0
1
b
1
2
a
2
1
b
3
2
a
4
3
a
5
4
a
6
5
b
a
b
b
b
2
3
4
5
6
1
2
3
4
5
2
1
2
3
4
1
2
1
2
3
2
1
2
3
4
3
2
3
4
5
4
3
4
5
6
For the sake of simplicity, pointers have been left out in Figure 8.4. However, the
path that can be actually reconstructed from the pointers is shown in Figure 8.5. This
path corresponds to the following edit operations:
( b), (a a), (b b), (a a), (b ), (b a), (b a).
Application of this sequence of edit operations to x yields in fact sequence y, and
the cost of this sequence of edit operations is equal to 6.
a
b
a
b
b
b
0
1
2
3
4
5
6
b
1
2
1
2
3
4
5
a
2
1
2
1
2
3
4
b
3
2
1
2
1
2
3
a
4
3
2
1
2
3
4
a
5
4
3
2
3
4
5
a
6
5
4
3
4
5
6
137
(8.2)
and an actual lcs corresponds to the identical substitutions on an optimal path in the
cost matrix.
Example 8.3. Consider again Example 8.2. Here the costs comply with the condition
stated above. Hence we conclude that l = [|x| + |y| d(x, y)]/2 = [6 + 6 6]/2 = 3,
i.e., the lcs of x and y is of length 3. Moreover, from the optimal path shown in Figure
8.5 we extract three identical substitutions, namely (a a), (b b), (a a), which
correspond to the lcs aba. Note that in this example the lcs is not unique.
Obviously, the algorithm for lcs computation can be used to nd out whether a given
sequence s is a subsequence of another given sequence S. We just run the algorithm and
check whether |s| = l, where l is the length of the lcs of s and S according to equation
(8.2).
It is easy to see that the algorithm given in Figure 8.2 has a time and space complexity
equal to O(nm). We want to remark that the present section provides only a basic
introduction to the eld of sequence matching. For further details, including algorithms
with a lower time or space complexity, the reader is referred to [80, 152, 163].
138
139
Fig. 8.6. Algorithm for edit distance, d(s, S), of graph sequences s = g1 . . . gn and S =
G1 . . . Gm .
140
database such that the pair Gi1 , Gi corresponds to the pair gn1 , gn , and if furthermore
the transition from Gi1 to Gi is known to be anomalous, then we may conclude that
also the transition from gn1 to gn is anomalous. Such a strategy, which compares
actual input data to past events of known type, follows the paradigm of case-based
reasoning, where past cases are used to solve an actual problem. For an introduction
to, and overview of, case-based reasoning, see [107].
In the rest of this section we will discuss the strategy that was sketched in the
last paragraph in greater detail. The rst question that arises in the implementation of
a scheme for anomalous change detection using a library of past cases concerns the
length of sequence s. We assume here that there exist some deterministic, but unknown,
rules that partly govern the network behavior. In particular, we assume that whether the
transition from gn1 to gn is anomalous depends not only on gn1 and gn , but on some
longer, but nite, history, i.e., on a sufx of sequence s of length t 2. Finding an
appropriate value of the relevant length t of the postx of s to be used to decide whether
the transition from gn1 to gn represents an anomalous event is the rst important
step to be solved. Clearly, if the value of t is too small then we may not have enough
information to decide of which type the event leading from gn1 to gn is. On the other
hand, if t is too large we may unnecessarily slow down the computation or introduce
noise. In order to simplify our notation, we assume that n = t, i.e., we assume that the
whole length of sequence s is relevant to classifying the transition from gn1 to gn as
anomalous or not. The actual choice of a suitable value of parameter t will usually need
some prior knowledge that depends on the actual application, and may require some
experimental investigation.
Given a xed value of parameter t, let us furthermore assume that we know that
the transition from Gi1 to Gi in sequence S of the database is anomalous. Then
we can compute the distance of s = g1 . . . gn and Gin+1 . . . Gi . If this distance
d(g1 . . . gn , Gin+1 . . . Gi ) is smaller than a given threshold we may conclude that
the transition from gn1 to gn is anomalous. Moreover, we may conclude that the
abnormality is of the same type as the one corresponding to the transition from Gi1 to
Gi . Hence detection of anomalous events can be solved by means of the graph sequence
matching algorithm introduced in this chapter. If there is more than one anomalous event
recorded in the database, then all of these events can be used for comparison to sequence
s. If more than one anomalous event from the database matches the actual sequence,
then the closest match, i.e., the event corresponding to the sequence with the smallest
distance, can be used to determine the type of abnormality.
When matching sequence g1 . . . gn with Gin+1 . . . Gi , we can either disable insertions and deletions or allow them, depending on the particular application. Disabling
insertions and deletions can be easily accomplished by dening the costs of these
edit operations to equal innity. If insertions and deletions are disabled, then the two
sequences under consideration must be of the same length, and computing the edit
distance of two sequences of equal length, s = g1 . . . gn and S = G1 . . . Gn , reduces
to computing the sum of the substitution costs c(gi , Gi ), i.e., in this case
d(g1 . . . gn , G1 . . . Gn ) =
n
141
c(gi , Gi ) .
i=1
Note that this operation requires only O(nk) time, where parameter k represents
the cost of computing c(gi , Gi ).
If deletions and insertions are enabled, one may consider a time window longer
than n in the sequences in the database to be matched with the input sequence s. That
is, if s = g1 . . . gn and the transition from Gi1 to Gi in database sequence S is
known to be anomalous, then we may wish to compute d(g1 . . . gn , GiN +1 . . . Gi )
rather than d(g1 . . . gn , Gin+1 . . . Gi ) for some N > n. Again, nding an appropriate
value of parameter N may require a priori knowledge about the problem domain and
experimental investigation.
8.4.2 Prediction of Anomalous Events
We assume again that a database with past time series of graphs is at our disposal,
where it is known for each transition from Gi1 to Gi in each sequence S whether
the transition represents an anomalous event. Given an actual time series of graphs
s = g1 . . . gn , we consider the task of predicting whether the transition from gn to gn+1
will be anomalous. It is assumed, similarly to Section 8.4.1, that the behavior of the
network to be monitored is to some degree deterministic, though the concrete rules that
govern the networks behavior are unknown. Consequently, we may conclude that an
anomalous event is likely to occur if there exists a sequence S = G1 . . . Gm in the
database that contains a (contiguous) subsequence Gin+1 . . . Gi that is similar to s,
and the transition from Gi to Gi+1 represents an anomalous event. This strategy is
based on the assumption that, given two sequences s and S, if the rst n graphs are
similar in both sequences, the elements at position n + 1 will also be similar.
To implement a scheme for prediction as sketched in the last paragraph, we need to
x again parameter t that denes the length of the relevant time window. This parameter
has the same meaning as in Section 8.4.1. In order to simplify our notation we assume
again that t = n. In analogy to Section 8.4.1 let the transition from Gi to Gi+1 represent
an anomalous event. Then we compute the distance d(g1 . . . gn , Gin+1 . . . Gi ) and
predict that the transition from gn to gn+1 will be an anomalous one if this distance is
smaller than some predened threshold.
Similarly to Section 8.4.1, two different instances of this scheme seem feasible. We
may choose either to disable or enable insertions and deletions. In the second case, it
may be meaningful to consider a time window in the sequences in the database that is
longer than n.
8.4.3 Recovery of Incomplete Network Knowledge
For a number of reasons we may sometimes have to deal with incomplete time series of
graphs, where at certain points in time no graphs are available. For the sake of simplicity
we consider just the case of a simple gap of length one in a given time series of graphs.
That is, we assume that an actual time series of graphs s = g1 . . . gi1 gi+1 . . . gn is
142
n
c(gk , Gj +k ) ,
k=1
for j = 0, . . . , m n + 1. Symbol = gi is a dummy symbol for which we dene c( g) = 0 for any graph g. Clearly, whenever there exists a subsequence
Gj +1 . . . Gj +i1 Gj +i Gj +i+1 . . . Gj +n for which d(g1 . . . gi1 gi+1 . . . gn ,Gj +1 . . .
Gj +i1 Gj +i Gj +i+1 . . . Gj +n ) is smaller than some predened threshold, graph Gj +i
qualies to be used as a suitable substitute for the missing graph gi .
8.5 Conclusions
In this chapter we have proposed a novel class of algorithms for graph sequence matching. While both the matching of graphs and the matching of sequences of symbols have
been individually addressed in the literature before, this is the rst attempt to combine both concepts into a common framework that allows us to compare sequences in
which the individual elements are given in terms of graphs. Our novel algorithms are
able to decide, for two given sequences of graphs s and S, whether s is a (continuous)
subsequence of S. Moreover, we have introduced a procedure that is able to compute
the distance of two graph sequences based on a set of edit operations, each of which
8.5 Conclusions
143
affects a complete graph in one of the two given sequences. Also, some applications
of the proposed algorithms in the analysis of network behavior have been discussed.
Compared to previous work in network analysis that uses the same kind of graph representation as in this chapter, it can be expected that the proposed algorithms will be able
to be distinguished by increased robustness and precision in the detection of anomalous
events, since they take more information into account when classifying an event as normal or abnormal. Moreover, there are a number of novel tasks in network analysis that
can be addressed, thanks to the improved capabilities of the new tools proposed in this
report. Examples include prediction of anomalous behavior and recovery of missing
information. One crucial requirement, however, is the availability of a sufciently large
database that represents typical network behavior of both normal and abnormal type.
In Section 8.4.3 we have shown how a missing graph in a time series can be recovered. A variation of this task consists in the reconstruction of partial graphs. Here
we assume that at time i only a partial graph gi was observed and we are interested in
reconstructing the missing portion of graph gi . Obviously, this problem can be solved
by the method described in Section 8.4.3 if we dont distinguish whether graph gi is
completely missing or was only partially observed. Another possible solution is, however, to apply some kind of graph interpolation operator that would reconstruct gi from
gi1 and gi+1 , or from a larger window. That is, under this scheme we would not need
any of the graph sequence stored in the database but rely only on graphs in the actual
input sequence. This topic will be discussed in greater detail in Chapter 11.
Another open question is whether more elaborate versions of the algorithms proposed in this chapter can be developed that would be able to nd appropriate values
of some of the involved parameters automatically. For example, in all of the methods
discussed in Section 8.4 it is necessary for the user, or the system developer, to x the
length t of the relevant sequences or subsequences that are to be taken into account
in comparing the actual times series of graphs with the cases stored in the database.
It would be an advantage from the users point of view if appropriate values of this
parameter could be automatically found by our graph sequence matching algorithm.
Part III
9
Distances, Clustering, and Small Worlds
148
Section 9.3 presents an example network for calculating vertex eccentricities and graph
diameter.
Of course, the distance function in the denition of eccentricity could be replaced
by the weighted distance. In that case we would refer to the weighted eccentricity and
graph weighted diameter.
9.1.3 Average Distances
It is of interest to know the typical distance between two vertices in a graph. One
obvious parameter is the mean path length, the mean value of D(x, y), where x and y
vary through the vertices. While this is a useful parameter when known, its calculation is
impractical for large graphs such as those that arise in discussing intelligent networks. It
is necessary to estimate the average. Such estimation must be carried out by sampling.
Since calculating a median is much faster than calculating a mean, there are better
sampling techniques available for accurately estimating medians than for means, so
Watts and Strogatz considered the median path length as a measure.
The sampling technique for estimating medians was studied, and its accuracy estimated, by Huber [93]. One might think that the median would be a signicantly less
reliable measure of average distance than the mean. However, Huber [93] studied the
sampling technique for estimating medians, and estimated its accuracy. The results are
summarized in [184, pp. 2930] and show that the median is quite a good estimator.
Another theoretical advantage is that medians can be calculated even if the graph
is disconnected, provided the number of disconnected pairs of vertices is not too large.
(A disconnected pair corresponds to an innite path, which is longer than any nite
path. Provided fewer than half the paths are innite, the median can be calculated.)
Newman [135] suggests ignoring disconnected pairs, but we believe that the innite
distance technique is more appropriate.
To approximate a median of a set is straightforward, because the sample median is
an unbiased estimator of the median (and of the mean, in normal or near-normal data).
One simply takes a number of readings and nds their median.
If S is a set with n elements, its median is dened as that value M(S) such that n/2
members of S have value M(S) and n/2 members have value M(S). Let us say
that an estimate M of M(S) is of accuracy if at least n/2 members have value
M and n/2 members have value M. Then Huber proved the following theorem:
Theorem 9.3. [93] Suppose (usually near 1) and (usually small) are positive
constants. Approximation of the median M(S) of a set S by sampling s readings yields
a value of accuracy at least with probability 1 , where
2
2
s = 8 ln
.
1
9.1.4 Characteristic Path Length
The median of a nite set of readings is a discrete-valued function: it takes either one of
the reading values or the average of two of them. This leads to some minor inaccuracies.
9.2 Diameters
149
Consider, for example, two sets of readings: A contains 49 readings 0 and 50 readings
1, while B contains 50 readings 0 and 49 readings 1. These sets are very similar, but the
medians are 0 and 1 respectively. To avoid such chaotic leaps in value, the following
denition is used.
The characteristic path length L(G) of a graph G is calculated as follows. First,
for each vertex x, the median D x of all the values D(x, y), y a vertex, is calculated.
Then L(G) is the mean of the values D x , for varying x. So, in the sampling procedure
suggested by Huber, one calculates a number of values of D x and takes their median.
9.1.5 Clustering Coefcient
The neighborhood of a set of vertices consists of all the vertices adjacent to at least one
member of the set, excluding members of the set itself. If the set consists of a single
vertex x, we denote the neighborhood of {x} by N (x). The graph generated by N (x),
denoted by N(x), has vertex set N (x), and its edges are all edges of the graph with
both endpoints in N (x). We write k(x) (the degree of x) and e(x) for the numbers of
vertices and edges, respectively, in N (x). Then the clustering coefcient x of x is
e(x)
x = k(x) =
2
2e(x)
.
k(x)(k(x) 1)
In other words, it equals the number of connections between the neighbors of x divided
by the maximum possible number of connections.
The clustering coefcient of a graph G equals the mean of the clustering coefcients
of all vertices of G, and is denoted (G) or simply .
The extreme maximum value (G) = 1 occurs if and only if G consists of a number
of disjoint complete graphs of the same order (each has k + 1 vertices, where every
vertex has the same degree k). The extreme minimum value (G) = 0 is attained if
and only if the graph G contains no triangles.
9.1.6 Directed Graphs
The theory of these measures is much the same if directed graphs are considered.
However, in calculating the clustering coefcient in a directed graph, the formula
e(x)
e(x)
x = k(x) =
k(x)(k(x) 1)
2 2
should be used. In other words, the result should be halved, corresponding to the fact
that there are k(k 1) possible directed edges between k vertices.
9.2 Diameters
Graph diameter can be used as as a measure of difference between two graphs. Given
two graphs G and H , simply calculate the difference between their respective diameters.
The difference f (G, H ) is given by
150
9.2 Diameters
151
Finally, it must be shown that the triangle inquality f (G, H ) f (G, K)+f (K, H )
holds for all graphs G, H, K. Without loss of generality it is assumed that D(G)
D(H ). Three cases are considered. In the rst case, let D(G) D(H ) D(K). Then
|D(G) D(K)| + |D(K) D(H )| = D(G) D(K) + D(H ) D(K),
but D(H ) D(K) 0 and D(K) D(H ), so
D(G) D(K) + D(H ) D(K) D(G) D(K) + 0
D(G) D(H )
= |D(G) D(H )|.
For the second case, let D(G) D(K) D(H ). Then
|D(G) D(K)| + |D(K) D(H )| = D(G) D(K) + D(K) D(H )
= D(G) D(H )
= |D(G) D(H )|.
In the last case, let D(K) D(G) D(H ). Then
|D(G) D(K)| + |D(K) D(H )| = D(K) D(G) + D(K) D(H ),
but D(K) D(G) 0 and D(K) D(G), so
D(K) D(G) + D(K) D(H ) 0 + D(K) D(H )
D(G) D(H )
= |D(G) D(H )|.
Axioms (i)(iv) hold, and therefore f (G, H ) = |D(G) D(H )| is a pseudometric.
It is important to note that it is possible to dene an equivalence class for which
the function f is a metric. If we dene G to be the set of all networks with diameter
equal to D(G) and write D(G) for the (common) diameter of members of G, then
f (G, H) = |D(G) D(H)| is a metric on the (quotient set of) equivalence classes {G}.
9.2.2 Sensitivity Analysis
In addition to being a pseudometric, if difference in graph diameter is to be a useful
indicator of changes in network topology, it is important that small changes in a relatively dense unweighted network should not result in large changes in the diameter. To
investigate the sensitivity of our distance measure, we look at the family of complete
graphs and simple cycles.
The eccentricity of any vertex of the complete graph Kn is 1, so the diameter is
D(Kn ) = 1. If one edge is deleted, two vertices of the resulting graph G will have
n
eccentricity 2, so D(G) = n+2
n . If k edges are deleted, where k 2 , at most 2k
vertices will have eccentricity 2, and the others will have eccentricity 1. So D(G) 2.
152
while
D(C9 ) = D(C92 ) = 4, D(C93 ) = D(C94 ) = 3.56.
Theorem 9.6. If n 0 (mod 4), then
n 3n
D(Cni ) D Cn2 =
.
8
If n 2 (mod 4), then
n 1 3n
4
D(Cni ) D Cn2
=
+
.
8
8n
If n 1 (mod 4), then
D(Cni )
n1
Cn 2
=
3n + 1
.
8
n3
3n 14n 15
D(Cni ) D Cn 2
=
.
8
8n
Proof. In the rst case, it is easy to see that the graph diameter is smallest when the
chord is of length n/2: in those cases the eccentricities of the vertices are n/2 1 (this
is the eccentricity of two vertices, the vertices furthest from the endpoints of the chord),
n/2 2 (four vertices adjacent to the two just mentioned), n/2 3 (four times vertices),
. . . , n/4 + 1 (four vertices), n/4 (two vertices, the endpoints of the chord). The average
is as shown.
The
similarly. It is interesting to note that when n 2 ( mod
other
cases are handled
n
4), D Cn2
= D Cn2
n2
2n .
Theorem 9.6 shows that the addition of an edge can reduce the diameter of Cn by
approximately 25%, from n2 to approximately 3n
8 . This discussion shows that the
distance measure based on graph diameter is sensitive to small changes in sparse graphs
and less sensitive for small changes in nearly complete graphs.
This analysis is pointless in the case of weighted networks, because the difference
in diameter due to inclusion or deletion of one edge can be made as large as we please
by increasing the weight of the edge.
153
7
6
Let G be the graph in Figure 9.1 with V = {1, 2, . . . , 7} and E = {{1, 2}, {1, 3},
{2, 5}, {3, 5}, {4, 7}, {5, 6}, {5, 7}, {6, 7}}.1 The eccentricities for the example graph are
(1) = 4, (2) = 3, (3) = 3, (4) = 4, (5) = 2, (6) = 3, (7) = 3. Taking an
average of all the eccentricities, D(G) 3.143.
3
1
7
6
To demonstrate that graph diameter can detect small changes in graph topology, an
edge is added to the graph in Figure 9.1. Figure 9.2 shows the graph with the edge {3, 4}
added. Let this graph be H = (E {3, 4}, V ). It is left to the reader to verify that the
graph diameter D(H ) 2.714 and the difference f (G, H ) 0.429.
1 The example graph is undirected; therefore, the edges are just unordered pairs of adjacent
vertices. It is trivial to convert the graph to directed as described in Section 9.1.
154
1
"ave_diam.dat.chng.norm"
0.8
0.6
0.4
0.2
0
0
20
40
60
80
100
120
days
Figure 9.3 shows the times series change based on the difference in graph diameter
normalized by the maximum change over the interval. Peaks in the time series represent
potential abnormal change. Abnormal change would be based on deviations from some
normal state of the network.
Figure 9.4 shows the time series change based on the difference in weighted graph
diameter, normalized by the maximum change over the interval. As in the plot in Figure
9.3, peaks represent potential abnormal change. The peaks and valleys differ between the
two plots, since different factors are compared for weighted and nonweighted graphs.
The nonweighted plot would show topological changes where edges are either added or
deleted. The weighted plot, based on edge trafc, can show both topological changes as
well as abnormal changes in edge trafc. As demonstrated in Section 9.2.1 and Section
9.3, both are sensitive to small changes in the network.
155
1
"ave_diam_weighted.dat.chng.norm"
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80
100
120
days
3.8
"diam.dat"
3.6
diameter
3.4
3.2
2.8
2.6
2.4
0
20
40
60
80
100
120
day
As can be seen in Figure 9.5, there are still peaks and valleys in the time series
plot of graph diameter, but these peaks and valleys have a different connotation than
in the time series. This connotation is based on the difference in two graphs. From a
communications network perspective, large graph diameter implies a longer average
communication path between any two vertices. From the aspects of performability and
156
Suppose G is a 1-lattice with n vertices, for which k is even and at least 2. Then
(G) =
3(k 2)
.
4(k 1)
The exact value of L(G) depends on the remainder when n is divided by k, but approximately,
157
L(G) =
n(n + k 2)
2k(n 1)
and
(G) =
3(k 2)
.
4(k 1)
Random Graphs
There are two standard models of random graphs in the mathematical literature (see,
for example, [16]).
A
graph of type G(n, M) has n vertices and M edges. The M edges are chosen from
the n2 possibilities in such a way that any of the possible M-sets is equally likely to
be the one chosen. (In statistical terminology, the edges are a simple random sample of
size M from the possibilities.)
A graph of type G(n,
p) has n vertices. The probability that the vertex-pair xy is
an edge is p, and the n2 events xy is an edge are independent.
In studying most propertiesin particular, the properties that will interest usthese
two models of random graphs are interchangeable, provided M is approximately np.
When this is true, it is more usual to study G(n, p), since the independence makes
mathematical discussions simpler.
The average degree of vertices in a G(n, p) is clearly p(n 1). The clustering
coefcient is approximately p; in other words, for a reasonably large random graph of
average degree k, the clustering coefcient is approximately k/n.
158
159
The other two networks analyzed were electrical in nature: the power grid for the
western United States and the neural system of the much-studied nematode C. elegans.
(Nematodes are a class of worms comprising 10,000 known species. The parasitic kinds,
such as hookworm, infect nearly half the worlds human population.) The power grid
has n = 4941 vertices, consisting of generators, transformers, and substations, with an
average of k = 2.67 high-voltage transmission lines per vertex. C. elegans has a mere
n = 282 neurons (not counting 20 neurons in the worms throat, which biologists have
not yet completely mapped), with an average of k = 14 connectionssynapses and
gap junctionsfor each neuron.
The power grid has a rather low clustering coefcient = 0.08. But this is 160 times
what would be expected for a random network of the same size. The average distance
between vertices is 18.7, far below the 925 predicted for a perfectly regular network.
(Actually, this is an unfair comparison, since Watts and Strogatzs regular network is
essentially 1-dimensional, whereas the power grid is inherently 2-dimensional. It would
be fairer to compare the power grid to a hexagonal honeycomb, for which most vertices
have three links. The average distance between two vertices in an m m honeycomb,
which has n = 2m(m + 2) vertices altogether, is approximately 2m. For n = 4941,
that is approximately 98.) For C. elegans, the clustering coefcient is 0.28 (as against
0.05 for the random model), and the average distance is 2.65 (with 2.25 calculated for
the random model and 10 for the corresponding lattice graph with k = 14).
Extensive surveys of small-world phenomena appear in the literature [6, 184].
160
325 business domains observed over this physical link in a day were then represented as
a directed and labeled graph. Vertex-weight identied the business domains of logical
nodes communicating over the physical link with edge-weight denoting the total trafc
transmitted between corresponding OD pairs over a 24-hour period.
Successive log les collected over subsequent days produced a time series of corresponding directed and labeled graphs representing trafc ows between business
domains communicating over the physical link in the network. Log les were collected continuously over a period of several months, from July 19 to October 25, 1999,
and weekends and public holidays were removed to produce a nal set of 90 log les
representing the successive business days trafc.
9.6.2 Results on Enterprise Graphs
To test our hypothesis, we used our data to produce two samples of enterprise graphs
by removing weights and directions (see Sections 3.4.2 and 4.2). The resulting graph
represented connections in the network. Time series of both the original graphs (with
45,000 vertices corresponding to individual users) and the domain graphs (with 325
vertices corresponding to the business domains) were studied.
Both characteristic path length and clustering coefcient can be discussed in the
case of directed graphs. The theory of both these measures is much the same in the
directed case. However, in calculating the clustering coefcient in a directed graph, the
formula
e(x)
e(x)
x = k(x) =
k(x)(k(x) 1)
2
2
should be used. In other words, the result should be halved, corresponding to the fact
that there are k(k 1) possible directed edges between k vertices.
4.5
4
3.5
distance
3
2.5
2
1.5
1
0.5
0
10
20
30
40
50
60
70
80
90
time
161
2.5
distance
1.5
0.5
10
20
30
40
50
60
70
80
90
time
Much work has been done on random directed graphs. To have a directed benchmark
for highly structured graphs, we dene the directed version of a d-lattice by replacing
every edge xy in the undirected case by the pair of edges {xy, yx}. The formulas for
characteristic path length and clustering coefcient apply to the directed lattices.
We measured the characteristic path lengths for the directed versions of our two
series of enterprise graphs. In the original graphs, the characteristic path length ranges
approximately from 3 to 4.5, averaging between 3.8 and 3.9. In the domain graph series
the average is about 2.2; the length is never less than 2, and there were only three days
when it was greater than 2.35. This is consistent with the hypothesis that enterprise
graphs behave like small-world graphs in the directed case also. The time series of
characteristic path lengths are shown in Figures 9.8 and 9.9.
In both series of graphs, the clustering coefcient ranges approximately from 0.58
to 0.75, averaging about 0.67. In the domain graph series there is very little variation;
the original graph series is a little less tight. These gures are very close to the lattice
graph gure, 0.749. This is also consistent with the small-world hypothesis. The time
series of clustering coefcients are shown in Figures 9.10 and 9.11.
We produced a histogram of the values x for all vertices x on the rst day (day 0)
of the series. These histograms are shown in Figures 9.12 and 9.13.
In our examples of enterprise graphs, there were very few situations in which communication was not symmetric, that is, if there was a directed edge from x to y, there
was usually a directed edge from y to x. This would not necessarily be true in some
enterprise graphs, for example in a highly structured social network where instructions
or data are broadcast from a source that does not accept answering messages. While
the characteristic path length would be higher in such networks (due to the existence
162
distance
1.2
1
0.8
0.6
0.4
0.2
0
10
20
30
40
50
60
70
80
90
time
distance
1.2
1
0.8
0.6
0.4
0.2
0
10
20
30
40
50
60
70
80
90
time
of a larger number of innite paths), and the clustering coefcient would be lower,
we predict that the underlying graph will still t the small-world pattern.
9.6.3 Discovering Anomalous Behavior
We do not derive a measure of change between snapshots of the network from the
characteristic path length or clustering coefcient. However, the relative consistency
163
2500
2000
1500
1000
500
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
of the characteristic path length, particularly for the domain graphs, suggests that one
should study those days with a markedly different characteristic length, to see whether
they correspond to other marked differences in the network. Therefore the small-world
measures provide a signicant tool for network management.
164
We would also suggest that the upward trend in the clustering coefcient be investigated. If it in fact reects a seasonal or other consistent change in the network, this
would be important, since other measures that have been tested on these data do not
reect such a consistent change.
10
Tournament Scoring
10.1 Introduction
Various methods have been used to assess the relative strengths of players in roundrobin tournaments. These methods involve the digraph representing the results of the
tournament. (Somewhat confusingly, this digraph is also called a tournament.) In
this chapter we shall generalize these techniques to provide a graph measure that we
call modality. We propose techniques to measure rapid change in the structure of the
network by looking at changes in modality.
The ideas of matrix theory will be used in discussing tournaments. In addition to
standard references such as [92] and [114], another useful reference for nonnegative
matrices is [145]. We shall use the notation v to denote the (Euclidean) norm of the
vector v. Thus, if v has n elements,
n
v =
vi2 .
i=1
10.2 Tournaments
10.2.1 Denitions
In general, a tournament is an oriented complete graph. In other words, it is a digraph
in which every pair of vertices is joined by exactly one arc.
In sports scheduling, a tournament is an event in which a number of players compete
two at a time. A round-robin tournament is one in which every pair of players competes exactly once, so a (graphical) tournament represents the results of a (scheduling)
166
10 Tournament Scoring
round-robin tournament in which ties are impossible: the players are the vertices and
x y whenever x beats y. We shall use the graphical and scheduling terminologies
interchangeably.
10.2.2 Tournament Matrices
The winloss outcomes of the matches in a tournament can conveniently be recorded
in a tournament matrix A = [aij ], where aij = 1 if i j and aij = 0 otherwise; in
particular, A has zero diagonal. Such a matrix is called a tournament matrix. A square
zeroone matrix A is a tournament matrix if and only if
A + AT = J I.
As an example, consider the 6-player tournament in which
1 beats 2, 3, 4, 5;
2 beats 3, 4, 5, 6;
3 beats 5, 6;
4 beats 3, 6;
5 beats 4, 6;
6 beats 1.
1
0
0
0
0
0
1
1
0
1
0
0
1
1
0
0
1
0
1
1
1
0
0
0
0
1
1
.
1
1
0
X Y
,
O Z
where the submatrices X and Z are square, and irreducible otherwise. A tournament
is called reducible if there is some ordering of the players for which the tournament
matrix is reducible, and called irreducible otherwise.
167
have score 2; and player 6 has score 1. There is an immediate problem: how do we rank
players 1 and 2 in this example? Our rst intuition is that 1 should outrank 2, because
1 beat 2 when they met. Is this appropriate, or should they be considered equal? The
second problem we notice is the anomalous behavior of player 6. This player won
only one match. However, it was against one of the two strongest players. Should this
performance receive extra credit? In general, should we give more credit to a player
who beats better opponents?
10.3.2 KendallWei Ranking
We outline a method due to Wei [187] and Kendall [102] that takes into account the
quality of the opponents. We assume that the tournament is irreducible (if the tournament
is reducible, all players in the upper part are ranked above all players in the lower part).
We assume that there is a nonnegative quantity that we shall call the (relative) strength
of a player, and that the players should be ranked according to strength. So the object of
a scoring system is to estimate the strengths of players. We denote the strength of player
i by wi , and (assuming there are n players) dene the strength vector of a tournament
to be
w = (w1 , w2 , . . . , wn ) .
For convenience, we assume that the strength vector is normalized, so that the sum of
strengths of the players adds to 1. Suppose we have an estimate of all the strengths of the
players. One could dene the strength of player i to depend on the sum of the strengths
of the players that i beats. The KendallWei method approximates w by nding a
sequence of vectors w1 , w2 , . . . , wn . In w 1 , the strength of player i is proportional to
the number of matches won in the tournament: in the example, we take
v 1 = (4, 4, 2, 2, 2, 1)
and dene w1 to be v 1 /v 1 . When v j 1 is determined, we dene
j 1
j
vk .
vi =
ik
In other words, the strength of player i in the j th iteration is derived from the sum
of the (j 1)st iteration strengths of the players beaten by i. This is a reasonable
approximation to the suggested principle, and also is easy to compute: v 1 = Ae, and
v j = Av j 1 , so in general v n = An e. Then wn = v n /v n . If the sequence (w n )
approaches a limit, then that limit is a reasonable value for w . Proceeding with the
example, we obtain
v 2 = (10, 7, 3, 3, 3, 4),
v 3 = (16, 13, 7, 7, 7, 10),
v 4 = (34, 31, 17, 17, 17, 16),
v 5 = (82, 67, 33, 33, 33, 34),
v 6 = (166, 133, 67, 67, 67, 82).
168
10 Tournament Scoring
So w 6 is approximately (.65, .52, .26, .26, .26, .32). At this stage player 1 is signicantly stronger than player 2 and player 6 has moved clearly into third place. It is not
obvious that convergence occurs, but we would expect the order of strength of the players to stay unchanged in future iterations. An important question is, of course, whether
convergence takes place.
10.3.3 The PerronFrobenius Theorem
The spectral radius of a real square matrix A, denoted by (A) or simply , is the
largest of the absolute values of eigenvalues of A. The matrix A is called primitive
if all its entries are nonnegative and if there is some positive integer power of A all
of whose entries are positive. The following theorem guarantees convergence of the
strength vector of a tournament matrix. It comes from a theorem of Perron [142],
generalized by Frobenius [74] (see [114, p. 538]).
Theorem 10.1 (PerronFrobenius). Suppose A is an irreducible matrix with all entries nonnegative. Then (A) is an eigenvalue of multiplicity one for A, and there is a
corresponding eigenvector v all of whose entries are positive. Moreover, any nonnegative eigenvector of A is a multiple of v. If, furthermore, A is primitive, then every other
eigenvalue of A has absolute value less than (A).
The multiple of v with norm 1, the eigenvector v/v, is called the Perron vector
of A, and denoted by v(A). If the eigenvalue of largest magnitude of a matrix A is
unique, then it is known that lim k Ak e/Ak e exists, and in fact is an eigenvector
corresponding to . In fact this is the power method, a standard technique for nding the
eigenvalue of largest magnitude of A when that eigenvalue is unique (see, for example,
[75, p. 209]). This means that the KendallWei method always gives a result whenever
there are four or more players, since all tournament matrices of size greater than three
are primitive.
169
period. Thus the snapshots form a time series of graphs. We have discussed these series
in [25].
Typically pij will be a count of the number of packets of information, but this is
not necessary. We could use any function pij provided it is a monotone increasing
function of the number of packets, and provided pij = 0 if and only if no information
is transmitted. As dened, pij is symmetric, pij = pj i , but one could equally well
dene pij to be the number of packets sent from i to j , and associate a digraph rather
than a graph with the network. We shall discuss possible variations of pij in Section
10.6 below.
We now dene the snapshot matrix corresponding to a snapshot to be the matrix
A = (pij ). A snapshot is called reducible if there is some ordering of the vertices for
which the snapshot matrix is reducible, and called irreducible otherwise. Because of the
symmetry of the matrix, a reducible snapshot is decomposable: the network decomposes
into two or more components, with no communication between the components.
We dene the underlying graph of a snapshot as the subgraph of G obtained by
deleting edges (i, j ) for which pij = 0. The underlying graph of a decomposable
snapshot is disconnected, and the components that do not communicate correspond to
the components of the underlying graph.
All the analysis in Section 10.3 can be carried out for matrices more general than
tournament matrices. In fact, Kendalls original analysis put an entry 21 in every diagonal
position, and Thompson [170] showed that this 21 may be replaced by any value r with
0 r < 1. Moreover, one could multiply the matrix by a positive constant without
affecting the Perron vector. In particular, consider a snapshot matrix of a communication
network. The Perron vector of such a matrix will be dened and will measure relative
strength of the nodes; the greater the entry in the ith position, the greater the strength
of node i. What is this strength? From the discussion of the KendallWei method,
we see that the rst approximation measures the amount of information that the node
communicates with the rest of the network. Subsequent approximations weight this
with the communicativeness of those that the node contacts, so the information will be
disseminated more rapidly. If we were to treat the amount of information communicated
to or originating from a node as a random variable, the rst approximation to strength
is the value of the variable at that node; the largest value corresponds to the mode of
the distribution. Consequently, we refer to the measure as the modality of the node. The
Perron vector will be called the modality vector of the network.
170
10 Tournament Scoring
Various problems exist where a measure of difference between two graphs is required; see, for example, the discussion in [157]. In order to make numerical comparisons between differences in different situations, it is desirable that measures of graph
difference have metric properties (see [30]), and the modality distance is clearly a metric: if there are n nodes, then modality distance is derived directly from the Euclidean
distance metric in n-dimensional space.
10.5.2 Applying the Distance Measure
The modality distance method was tested on two datasets comprising 102 and 284
days respectively, of TCP/IP trafc collected from a number of probes on an enterprise
network excluding weekends and public holidays. The 284-days data were collected
about a year after the 102-days data. In the 102-days data set, the network administrators
identied three days (22, 64, and 89) on which they thought the network had changed or
behaved anomalously. Only on day 64 was there a suggested reason: the introduction
of a web-based personnel and nancial management system. There may have been
other network changes in this period but none were notable enough to be mentioned by
network administrators. From Figure 10.1, one can see that modality distance obviously
produces an evident change point on day 64. All tests were performed on a PC with
Pentium R4 processor run at 3.00 GHz. Using Java-based REDBACK software, for 102days data set 7 seconds of CPU time was used, while the 284-days data set required 102
seconds of processing time. The running times seem to be reasonable assuming that
eigenvalues of large matrices were computed. Also, we chose the diagonal parameter
r from [170] equal to 0.1, and we replaced every other zero entry of the adjacency
matrices by 1.0 105 .
2
1.8
1.6
Modality
1.4
1.2
0.8
0.6
0.4
0.2
22
64
89
Days
We also compared the modality distance results with a number of other time series
graph distance measures. For example, the MCS distances [25, 30] using edges (Figure 10.2) and vertices (Figure 10.3) for the 102-days data set produced obvious large
171
changes in network topology behavior on all three anomalous days. Naturally, since
graphs in time series are labeled, this MCS-distance computations run in linear time
and produced numerical distances in about 7 seconds on the same PC.
0.9
0.8
0.7
MCS Edge
0.6
0.5
0.4
0.3
0.2
0.1
20
40
60
80
100
120
days
0.8
0.7
0.6
MCS Vertex
0.5
0.4
0.3
0.2
0.1
20
40
60
80
100
120
Days
Modality distance results for the 284-days data set are shown in Figure 10.4. We
did not have any identied anomalous change points, but we could postprocess the
obtained numerical time series of doubles with algorithms for change point detection
and/or forecasting of network behavior.
172
10 Tournament Scoring
2
1.8
1.6
1.4
modality
1.2
1
0.8
0.6
0.4
0.2
50
100
150
days
200
250
300
10.7 Conclusion
In this chapter we proposed an application of the tournament scoring in analysis of
communication networks. Tournament matrices record the winloss outcomes of the
matches in tournaments. KendallWei ranking is an approach to ranking of tournament
outputs using the quality of the players. The strength vector is approximated by a
recursive sequence of vectors where the strength of a player i is proportional to the
sum of the strengths of all the players that were beaten by i. The question of the
convergence of that sequence is solved by a version of the PerronFrobenius theorem
10.7 Conclusion
173
for primitive matrices. That version states that for primitive matrices each eigenvalue
is strictly dominated by the spectral radius (which is a simple zero of the characteristic
polynomial). We introduced the snapshot matrices of individual members of time series
of weighted digraphs that are dened and used for the implementation of the tournament
scoring. Modality distance is dened and used to measure distance between consecutive
members of the time series of digraphs. Some future directions for generalizing the
measure (by varying the weight function) are discussed.
Part IV
11
Recovery of Missing Information in Graph Sequences
11.1 Introduction
Various procedures for the detection of anomalous events and network behavior were
introduced in previous chapters of this book. In the current chapter we are going to
address a different problem, viz., the recovery of missing information. Procedures for
missing information recovery are useful in computer network monitoring in situations
in which one or more network probes have failed. Here the presence, or absence, of
certain nodes and edges is not known. In these instances, the network management
system would be unable to compute an accurate measurement of network change. The
techniques described in this chapter can be used to determine the likely status of this
missing data and hence reduce false alarms of abnormal change.
This chapter is organized as follows. In the next section we address the problem of
recovering missing information in a computer network using three different heuristic
procedures that exploit graph context in time. In Section 11.3 an alternative approach to
the recovery of missing information in a computer network is proposed. This approach
makes use of decision tree classiers. Finally, Section 11.4 draws conclusions and
discusses potential future work.
178
In the present section we will present three heuristic procedures for the recovery of
missing information in a computer network based on graph context in time. We start
by introducing the basic concepts and our notation in Section 11.2.1. Then the three
strategies for information recovery will be described in Sections 11.2.2 to 11.2.4.
11.2.1 Basic Concepts and Notation
We consider graphs with unique node labels. To represent graphs with unique node
labels in a convenient way, we drop set V and dene each node in terms of its unique
label. Hence a graph with unique node labels is represented by a 3-tuple g = (L, E, ),
where L is the set of node labels occurring in g, E L L is the set of edges, and
: E L is the edge-labeling function. The terms node label and node will be
used synonymously in the remainder.
g1:
g2:
g3:
In this chapter we will especially consider time series of graphs, i.e., graph sequences s = g1 , g2 , . . . , gN . The notation gi = (Li , Ei , i ) will be used to represent
an individual graph gi in sequence s; i = 1, . . . , N. Motivated by computer network
analysis applications we assume the existence of a universal set of node labels, or nodes,
L, from which all node labels
that occur in a sequence s are drawn. That is, Li L for
1
i = 1, . . . , N and L = N
L
i=1 i . Given sequence s = g1 , g2 , . . . , gN , a subsequence
of s is obtained by deleting the rst i and the last j graphs from s, where 0 i +j N .
Thus s = gi+1 , . . . , gNj +1 is a subsequence of sequence s.2
As an example, consider sequence s = g1 , g2 , g3 , where graphs g1 , g2 , and g3 are
depicted in Figure 11.1. These graphs are formally represented as follows:
1 In the computer network analysis application L will be, for example, the set of all unique
IP host addresses in the network. Note that in one particular graph gi , usually only a subset is
actually present.
2 The notation used here differs slightly from that of Chapter 8, where a subsequence as
dened above is referred to as a continuous subsequence.
179
We assume that 1 = 2 = 3 = const and omit the edge labels in Figure 11.1. In
this example we have L = {A, B, C, D}.
Given a time series of graphs s = g1 , g2 , . . . , gN and its corresponding universal
set of node labels L, we can represent each graph gi = (Li , Ei , i ) in this series as a
3-tuple (i , i , i ) where:
The denition of i (e) means that each edge e that is in fact present in gi will have
label i (e).
The 3-tuple (i , i , i ) that is constructed from gi = (Li , Ei , i ) will be called the
characteristic representation of gi , and denoted by (gi ). Clearly, for any given graph
sequence s = g1 , g2 , . . . , gN the corresponding sequence (s) = (g1 ) , (g2 ) , . . . ,
(gN ) can be easily constructed and is uniquely dened. Conversely, given (s) =
(g1 ) , (g2 ) , . . . , (gN ) we can uniquely reconstruct s = g1 , g2 , . . . , gN .
As an example, consider graphs g1 , g2 , and g3 in Figure 11.1. As mentioned before,
L = {A, B, C, D}. The following characteristic representations are obtained:
1 ) where
(g1 ) = (1 , 1 ,
1 : A
1, B
1, C
1, D
0
1 : (A, B)
1, (B, C)
1, (C, A)
1, (B, A)
0, (C, B)
0, (A, C)
0
1 : (A, B)
const, (B, C)
const, (C, A)
const;
1 (x, y) undened for any other (x, y) {A, B, C} {A, B, C}
2 ) where
(g2 ) = (2 , 2 ,
2 : A
1, B
1, C
0, D
1
2 : (A, B)
1, (B, D)
1, (D, A)
1, (B, A)
0, (D, B)
0, (A, D)
0
2 : (A, B)
const, (B, D)
const, (D, A)
const;
2 (x, y) undened for any other (x, y) {A, B, D} {A, B, D}
3 ) where
(g3 ) = (3 , 3 ,
3 : A
1, B
0, C
1, D
1
3 : (A, D)
1, (D, C)
1, (C, A)
1, (D, A)
0, (C, D)
0, (A, C)
0
3 One can easily verify that {l | (l) = 1} = L .
i
i
180
In this chapter we will pay particular attention to graph sequences with missing
information. There are two possible cases of interest. First it may not be known whether
node l is present in graph gi . In other words, in (gi ), it is not known whether i (l) = 1
or i (l) = 0. Secondly, it may not be known whether edge (l1 , l2 ) is present in gi , which
is equivalent to not knowing, in (gi ), whether i (l1 , l2 ) = 1 or i (l1 , l2 ) = 0.
To cope with the problem of missing information and in order to make our notation
more convenient, we extend functions and in the characteristic representation (g)
of graph g = (L, E, ) by including the special symbol ? in the range of values of
each function to indicate the case of missing information. That is, we write (l) =?
if it is unknown whether node l is present in g. Similarly, the notation (l1 , l2 ) =?
will be used to indicate that it is not known whether edge (l1 , l2 ) is present in g. To
keep our notation simple we will not explicitly distinguish between functions and
as originally introduced and their extended versions that include the symbol ? in their
as the characteristic
range of values, and we will refer to the 3-tuple (g) = ( , , )
representation of graph g regardless of whether the original functions or their extended
versions are used.
Since any existing edge (l1 , l2 ) E requires the existence of both incident nodes
l1 , l2 L, we assume condition (l1 ) = (l2 ) = 1 if (l1 , l2 ) = 1 always being
fullled to ensure the consistency of any graph g.
11.2.2 Recovery of Missing Information Using a Voting Procedure
In Sections 11.2.2 to 11.2.4 we introduce three simple heuristic procedures that are all
based on the idea of using information about the behavior of a particular node or an
edge along the time axis in order to predict its presence or absence in a particular graph.
Consider graph sequence s = g1 , g2 , . . . , gt and let L denote the underlying universal set of node labels. Furthermore, consider graph gt = (Lt , Et , t ) with characteristic
representation (gt ) = (t , t , t ) and assume that t (l) =? for some l L. In order
to make a statement about the possible presence or absence of node l in graph gt we
consider a subsequence, or time window, s = gtM , . . . , gt1 of length M. The length
of the sequence, M, is actually a parameter that can be tuned to the considered application. The basic idea is to utilize information about node l in the graphs belonging
to subsequence s , in order to make a statement about the presence or absence of l in
graph gt . A simple approach consists in computing the relative frequency of occurrence
of node l in subsequence s and using this value for the decision to be made. Let k1 be
the number of graphs in subsequence s in which node l is actually present. In other
words, k1 is the number of graphs g in subsequence s for which (l) = 1. Clearly
0 k1 M. Similarly, assume that k0 is the number of graphs g in subsequence s
for which (l) = 0. Similarly to k1 , we observe 0 k0 M. Obviously, there are
0 M (k0 + k1 ) M graphs g in subsequence s where (l) =?. Given parameters
k0 and k1 , we can use the following rule to make a decision as to the presence of node
l in gt :
t (l) =
0 if k0 > k1 ,
1 if k1 > k0 .
181
(11.1)
182
graph g4 , we observe 4 (l) =? for any x L, and 4 (l1 , l2 ) =? for any (l1 , l2 ) LL.
This means that we dont have any information about graph g4 . Assume the task is
to recover the missing information for graph g4 . Applying rule (2.2.1) results in the
following decisions: 4 (A) = 4 (D) = 1; 4 (B) = 0; for node C we have k0 = k1 =
1 and therefore make a random decision, for example, 4 (C) = 0. Furthermore, we
get for the edges of g4 4 (A, D) = 1 and 4 (D, A) = 0. A graphical representation of
graph g4 after recovering the missing information is shown in Figure 11.3.
A
D
g1
g2
g3
g4
D
g4
Fig. 11.3. Result of information recovery procedure when applied to g4 in Figure 11.2.
(11.3)
183
(11.4)
184
over a time window of length M and retrieve all cases recorded in set R that match
the current history. Then a decision is made as to t (l) = 0 or t (l) = 1, depending
on which case occurs more frequently in the reference set. From the intuitive point of
view, this approach is based on the assumption that not only the number of occurrences,
but also the patterns of presence and absence of a node l in a time window of a certain
length has a correlation with the occurrence of l in gt .
As an example, consider graph sequence s = g1 , . . . , g9 shown in Figure 11.4. Let
t = 9 and assume we want to make a decision as to 9 (A) = 0 or 9 (A) = 1. Let
M = 2. Then we get the following reference set:
R = {(g1 , g2 ) , (g2 , g3 ) , . . . , (g7 , g8 )} .
(11.6)
From this set, we can extract the following reference patterns for node A:
(0, 0) : 1 instance extracted from (g1 , g2 )
(0, 1) : 2 instances extracted from (g2 , g3 ) and (g4 , g5 )
(1, 0) : 2 instances extracted from (g3 , g4 ) and (g7 , g8 )
(1, 1) : 2 instances extracted from (g5 , g6 ) and (g6 , g7 )
The query pattern in this example is (8 (A) , 9 (A)) = (0, ?). There are three
matching reference patterns, namely the single instance of (0, 0) and the two instances
of (0, 1). Hence k = 3, k0 = 1, k1 = 2, and the result will be 9 (A) = 1.
The method derived until now is based on the assumption that none of the reference
patterns for node l, extracted from set R, contains the symbol ?. From the practical
point of view, this implies that whenever the situation i (l) =? is actually encountered,
we have to discard the corresponding reference pattern, which may result in a set of
reference patterns too small to be meaningful. However, the restriction that symbol ?
must not occur in a reference pattern can be overcome as described below. Consider
reference set R as dened in equation (11.4) and assume that in fact i (l) =? for some
i, 1 i < t. Then there will be reference patterns for node l that include symbol ?. Let
x = (x1 , . . . , xi1 , ?, xi+1 , . . . , xM ) be such a reference pattern. In order to eliminate
symbol ? from the reference pattern, we replace x by two new reference patterns x 0 and
x 1 where x 0 is obtained from x by replacing symbol ? by symbol 0, and x 1 is obtained
by replacing ? by symbol 1. That is, x 0 = (x1 , . . . , xi1 , 0, xi+1 , . . . , xM ) and x 1 =
(x1 , . . . , xi1 , 1, xi+1 , . . . , xM ). This schema can be iteratively applied in case there is
more than one position in x equal to ?. Generally, if there are r, 1 r M, positions
equal to ?, we will replace the original reference pattern by 2r new reference patterns,
taking all combinations into account to substitute symbol 0 or symbol 1 for symbol
?. As an example, consider reference pattern x = (?, ?, ?), which will be replaced by
eight new reference patterns, x 000 = (0, 0, 0), x 001 = (0, 0, 1), . . ., x 111 = (1, 1, 1).
Once all occurrences of symbol ? have been eliminated from all reference patterns
for node l, we assign a weight to each new reference pattern. The weight is equal to
1/2r , where r is the number of symbols equal to ? in the original reference pattern
(equivalently, 2r is the number of new reference patterns obtained by substitution from
the original reference pattern). In the previous example, where x = (?, ?, ?), each of
the new reference patterns x 000 , x 001 , . . . , x 111 gets a weight equal to 1/8. In case
g1
g2
g7
D
g4
g6
g3
g5
185
D
g8
g9
Fig. 11.4. An example used to demonstrate the method introduced in Section 11.2.3.
no substitution operation was applied to a reference pattern x (which means that this
reference pattern never included an occurrence of symbol ?) we assign weight 1 to x.
Once all symbols ? have been eliminated and weights have been assigned to all
reference patterns of node l, we apply the following modied rule in order to decide as
to t (l) = 0 or t (l) = 1, given t (l) =?. The numbers k0 and k1 in equation (11.5)
are no longer used to reect the number of matching reference patterns with xM = 0
and xM = 1, respectively, but are now equal to the sum of weights of all reference
patterns with xM = 0 and xM = 1, respectively. With this modied denition of k0 and
k1 , equation (11.5) is applied.
As an example, consider the graph sequence in Figure 11.4 and assume 6 (A) =?
rather than 6 (A) = 1. For this situation we get the following reference patterns and
weights for node A:
186
187
188
yt =
M
(11.7)
i yti .
i=1
In this formula the last M values ytM , . . . , yt1 are used for the prediction of
yt , using M real-valued weighting coefcients 1 , . . . , M . In linear prediction one
chooses the weighting factors i in such a way that the estimate yt results in an error
as small as possible. Hence, dening the error as
et = yt yt = yt
M
i yti ,
(11.8)
i=1
and summing the squared error over a past time window of length T yields the quantity
E=
T
1
i=0
ei2 =
T
1
yi
M
2
j yij .
(11.9)
j =1
i=0
Now the aim is to choose the coefcients i in such a way that E is minimized. The
minimum of E occurs when the derivative with respect to each parameter E/i is
equal to zero. After some intermediate steps, the following solution is obtained [136]:
= 10 ,
(11.10)
where:
1,1 1,M
= ... . . . ...
M,1 M,M
with elements
r,s =
T
1
yir yis
i=0
for r, s = 0, 1, . . . , M
0 = 1,0 , . . . , M,0 is a column vector of dimension M
It can be shown that matrix is symmetric, and this symmetry can be exploited to
speed up the inversion process [136].
In order to apply linear prediction to the recovery of missing information about node
l in a time series of graphs s = g1 , g2 , . . . , gt we just need to select appropriate values
for parameters M and T and substitute values 1 (l) , . . . , t1 (l) for y1 , . . . , yt1
in the previous equations. Similarly, we can use sequence 1 (e) , . . . , t1 (e) for the
recovery of missing edge information. It is possible to dynamically recompute all i
189
values whenever the case t (l) =? or t (e) =? occurs. Alternatively, one can compute
these coefcients just a single time and use them in equation (11.7) whenever the case
t (l) =? or t (e) =? occurs. It has to be noted that in contrast with the methods
described in Sections 11.2.2 and 11.2.3, linear prediction will not tolerate missing
information in any of the graphs gi used to derive the coefcients i . That is, the cases
i (l) =? or i (e) =? are not admissible if graph gi is being used in the computation
of .
190
An example of a decision tree that can be used to classify fruits is shown in Figure 11.5. The different object classes underlying our example are
1
2
3
4
5
6
7
= watermelon
= apple
= grape
= grapefruit
= lemon
= banana
= cherry
Given a decision tree, such as the one shown in Figure 11.5, and an unknown input
object x to be classied, we simply apply the tests represented by the nonleaf nodes of
the tree, starting with the test represented by the root, and traverse the tree top-down
toward the leaves, according to the outcome of each test. Once a leaf has been reached,
the class i represented by that leaf is assigned to the unknown input object x . As an
example, consider the object x = ( yellow, small, round, sour). Classication of this
object leads to the leaf node that represents object class lemon. Therefore, object x is
classied as lemon. Note that for this decision the value of the attribute taste has not
been used.
From the decision tree shown in Figure 11.5 it can be concluded that the same class
may occur at different leaf nodes. This simply means that objects of the same class may
have different feature values, or intuitively speaking, different appearance. An example
in Figure 11.5 is the class grape, which occurs two times. We also note that the same
test might occur multiple times, at different nonleaf nodes, in the same decision tree.
For example, there are three different nonleaf nodes in Figure 11.5 that all test attribute
size.
Given a decision tree, such as the one shown in Figure 11.5, and an unknown object,
for example x = (yellow, small, round, sour), the classication of x is accomplished
easily by a straightforward traversal of the decision tree. A more difcult question is
how the decision tree is obtained. Clearly, one possibility is to have the decision tree
built by a human expert, in a manual fashion based on his or her expertise. However,
such a manual decision tree construction has clear limitations, for example if many
191
features or many classes are involved. Also, for certain applications, there may be a
lack of human expertise. In the following we introduce a procedure that allows us to
infer a decision tree automatically from a set of examples. This set of examples is
called a learning or training set in machine learning, and it is conceptually similar to
the reference set R used in Section 11.2.3.
A training set is a set of objects, x = (x1 , . . . , xd ), where the class of each object
in the training set is known. There are several algorithms for the inference of a decision
tree from a learning set that are similar to each other. In the following we describe
an approach closely related to C4.5 [146]. It is a procedure that recursively splits
the training set into smaller subsets, according to the possible outcomes of the tests
represented by the nodes of the decision tree, i.e., the values of a chosen attribute. The
procedure starts with the whole training set and terminates once a subset contains only
elements that belong to the same class.
A pseudocode description of the procedure for decision tree learning is given in
Figure 11.6. As an example, consider set L = {xx1 , . . . , x9 } shown in Figure 11.7. We
observe that L contains elements from different classes. Hence case 1 applies, and the
algorithm generates a node for set L; this node will actually become the root node of
the decision tree. Assume that the best feature is color. We assign this feature as a test to
the node corresponding to L. The different values of color, viz. green, yellow, and red,
split L into three subsets, L1 = {xx1 , x2 , x3 }, L2 = {xx4 , x5 , x6 }, and L3 = {xx7 , x8 , x9 },
respectively. For a graphical illustration see Figure 11.8. For each subset we generate
an edge that leaves the node corresponding to L. Then we recursively apply procedure
decision-tree-inference to each of L1 , L2 , L3 . At the node corresponding to L1 , case
1 applies. Assume best feature is size. This splits L1 into L11 = {x1 }, corresponding
to size = big, L12 = {x2 }, corresponding to size = medium, and L13 = {x3 },
corresponding to size = small. We generate the corresponding edges and continue
with L11 . Since this subset contains only a single element, case 2 applies. We generate
a leaf node for L11 and label it with 1 = watermelon. Similarly, we generate a leaf
node for L12 and a leaf node for L13 and label it with 2 = apple and 3 = grape,
respectively. It is easy to verify that by continuing this procedure we get the tree shown
in Figure 11.8, assuming that the following features will be chosen as best feature:
shape for L2 , size for L3 , size for L21 = {x4 , x5 }, and taste for L32 = {x8 , x9 }.
Dropping sets L, L1 , L2 , L3 , L11 , and so on from the tree and keeping only the best
feature as a test at each nonleaf node renders the decision tree shown in Figure 11.5.
An important question in decision tree induction is how the best feature is found at
each nonleaf node. The basic idea is to seek the feature that contributes most toward
the purity of the resulting subsets. At any stage of the execution of the decision tree
induction algorithm shown in Figure 11.6, a training set L is called pure if all its
elements are from the same class. On the other hand, it is impure if different classes are
represented in L. A quantity that is suitable to formally model the concept of purity is
entropy. The entropy of training set L is given by
E (L) =
c
i=1
(11.11)
192
decision_tree_inference(L)
input: learning set L where the class of each object is known
output: decision tree
begin
case 1: the learning set L includes objects from different classes; in this
case do
1. generate a decision tree node N for L
2. choose the best feature x i assign it as a test to N , and divide set
L into disjoint subsets L1 , L2 ,...,Lk corresponding to the different
values v1 ,v2 ,...,vk of x i
3. for each value vj , of x i do
(a) generate an edge to the child node of N corresponding to the
value vj
(b) execute decision_tree_inference(Lj )
case 2: the learning set L includes objects from only a single class, i ;
in this case generate a leaf node for L and assign class i to it
end
In this formula, c is the number of classes and p (i ) is the probability of class i
occurring in L. This probability is usually computed by dividing the number of elements
from i in L by the total number of elements in L. For example, if L = {x1 , x2 , x3 , x4 },
x1 , x2 1 , x3 2 , x4 3 then p (1 ) = 0.5, p (2 ) = p (3 ) = 0.25. It is
193
known that E (L) 0, and E (L) = 0 if and only if all elements in L are from the
same class. On the other hand, the maximum value of E (L) occurs if and only if the
probabilities of all classes i in L are the same, which means that p (i ) = 1/c for
i = 1, . . . , c. Note that maximum and minimum purity coincides with minimum and
maximum entropy, respectively.
Given a training set L, in order to nd the best feature at a particular node in
the decision tree we probe each feature xi by computing the weighted entropy of the
successor nodes that result if L is split into subsets L1 , L2 , . . . , Lk depending on the k
different values of xi . More precisely, the expression
k
E Lj
(11.12)
E=
Lj
j =1
is computed for each feature xi , and the feature that minimizes E is taken as the
best feature. Clearly, this minimization strategy is equivalent to maximizing the purity
among the training subsets, which makes sense because we require each leaf node to
be eventually produced being completely pure.
As an example, consider the decision tree in Figure 11.8 and assume we want to
nd the best feature for the root node, which corresponds to the training set L =
{x1 , . . . , x9 }. Evaluation of feature color gives
1
1
1
1
1
1
1
1
log log log
= log .
E = E (L1 ) + E (L2 ) + E (L3 ) =
3
3
3
3
3
3
3
3
We compute E in the same manner for all other features, i.e., size, shape, and taste,
and choose that feature as best feature that yields the smallest value of E.4 The same
procedure is repeated at all other nonleaf nodes of the decision tree. Note that for any of
the other nonleaf nodes we test all features, even in case a feature was already chosen
as best feature at a predecessor of the current node in the tree.
x 1 = (green, big, round, sweet) = Watermelon
x 2 = (green, medium, round, sour) = Apple
x 3 = (green, small, round, sweet) = Grape
x 4 = (yellow, big, round, sour) = Grapefruit
x 5 = (yellow, small, round, sour) = Lemon
x 6 = (yellow, small, elongated, sweet) = Banana
x 7 = (red, medium, round, sweet) = Apple
x 8 = (red, small, round, sweet) = Cherry
x 9 = (red, small, round, sour) = Grape
Fig. 11.7. Training set for inference of the decision tree shown in Figure 11.5.
There are more issues that need to be addressed before a decision tree classier
can be actually applied to a practical problem. One of these issues is how to deal with
4 Note that for the decision tree shown in Figure 11.8, best feature was chosen randomly. That
is, the entropy minimization procedure based on equation (11.12) was not used in Figure 11.8.
194
unknown feature values that may occur in the training set and/or the unknown input
objects to be classied. Furthermore, it may happen during decision tree construction
that elements from different classes end up in the same leaf node. In this case there
exist no features that allow us to discriminate between these elements. Such a case,
where two identical objects belong to different classes, is not uncommon in real-world
applications.5
Another issue is decision tree pruning in order to avoid overtting. Usually, the aim
of the algorithm described in Figure 11.6 is to produce a decision tree that is used as a
classier on future input objects. In particular, the classier should work well on new
input objects that are not included in the training set. That is, we have to expect that a
new input object is different from any of the training objects used to build the tree. It is
well known that decision trees that are overadapted, or overt, to the given training set
tend to have a rather poor performance on new, unseen data.6 To avoid overtting, some
pruning strategies are available. They typically cut off some branches after a decision
tree has been generated, or they avoid generation of such branches from the beginning.
For a detailed treatment all of these issues we refer to [146] and Chapter 3 on decision tree learning in [134]. There are several software packages available that include all
functionality needed to implement decision tree classiers for a variety of applications,
including techniques to deal with unknown feature values and to avoid overtting.
Fig. 11.8. Example of decision tree induction, using the training set given in Figure 11.7.
195
classier. We use the same terminology as in Section 11.2 and assume we want to
make a decision as to t (l) = 0 or t (l) = 1, given t (l) =?. Actually, this decision
problem can be transformed into a classication problem as follows. The network at
time t, gt corresponds to the unknown object to be classied. Network gt is described
by means of a feature vector x = (x1 , . . . , xd ), and the decision as to t (l) = 0 or
t (l) = 1 can be interpreted as a two-class classication problem, where t (l) = 0
corresponds to class 0 and t (l) = 1 corresponds to class 1 . As features x1 , . . . , xd
that represent the unknown object x , i.e., graph gt , one can use, in principle, any quantity
that is extractable from graphs g1 , . . . , gt . In the present section we consider the case
that these features are extracted from graph gt exclusively. Assume that the universal
set of node labels is given by L = {l0 , l1 , . . . , lD }, and assume furthermore that it is
node label l0 for which we want to make a decision as to t (l0 ) = 0 or t (l0 ) = 1,
given t (l0 ) =?. Then we set d = D and use the D-dimensional binary feature vector
(t (l1 ) , . . . , t (lD )) to represent graph gt . In other words, x = (t (l1 ) , . . . , t (lD )).
This feature vector is to be classied as either belonging to class 0 or 1 . The former
case corresponds to deciding t (l0 ) = 0, and the latter to t (l0 ) = 1. Intuitively, using
(t (l1 ) , . . . , t (lD )) as a feature vector for the classication of gt means we make a
decision as to the presence or absence of l0 in gt depending on the presence or absence
of all other nodes from L in gt .
For the implementation of the classication procedure described in the last paragraph, we need a training set. For the training set we can use all previous graphs in the
given time series, i.e., g1 , . . . , gt1 . From each graph gi , we extract the D-dimensional
feature vector
xi = (i (l1 ) , . . . , i (lD )) .
(11.13)
So our training set becomes L = {xx1 , . . . , xt1 } . As pointed out in Section 11.3.1 we
do need to assign the proper class to each element of the training set. This can be easily
accomplished by assigning class 0 to xi if i (l0 ) = 0; otherwise, if i (l0 ) = 1 we
assign class 1 to x i .
Given such a training set constructed from g1 , . . . , gt1 , we can now apply the
procedure described in Figure 11.6 to infer a decision tree from training set L. Once
the decision tree has been produced, it is easy to classify feature vector xt (see equation
(11.13)), which describes gt , as belonging to 0 or 1 .
As mentioned in Section 11.3.1, decision tree classiers are able to deal with unknown attribute values. This is important in our application because we must expect
that not only information about node l0 in gt is missing, but also about other nodes
li in gt , where i {1, . . . , D}. Similarly, in building the decision tree from training
set L = {xx 1 , . . . , x t1 }, there may be graphs gi , i {1, . . . , t 1}, for which it
is
not known for some nodes whether they are present in gi . Hence some of the i lj
may be unknown. Using a state-of-the-art decision tree software package will allow
us to deal with missing feature values without the necessity of taking any additional
precautions. In other words, it doesnt matter, neither during decision tree inference nor
while classifying an unknown input object, whether there are unknown feature values
or not. The system will be able to correctly handle any case.
196
The procedure described in this section is based on two assumptions. The rst
assumption is that there is some kind of correlation between the occurrence of a node
l in graph gt and the occurrence of some (or all) other nodes in the same graph. In
other words, we assume that the behavior of node l is dependent, in some way, on the
behavior of the other nodes. Note, however, that we dont need to make any assumptions
as to the mathematical nature of this dependency. Our second assumption is that there
is some stationarity in the dependency between l and the other nodes. Using graphs
g1 , . . . , gt1 as a training set to derive a classier that makes a decision pertaining to
graph gt will work well only if the dependency between l and the other nodes in gt is
of the same nature as in g1 , . . . , gt1 .
In a practical setting it may be computationally too demanding to infer a decision tree
at each point of time t from g1 , . . . , gt1 , because decision tree induction procedures
typically work in batch mode. That is, as time progresses from t to t + 1, and a new
decision tree for gt+1 is built from g1 , . . . , gt , we cant use the tree produced for gt
before, but need to generate the decision tree for gt+1 completely from scratch. Hence it
may be preferable to do an update of the actual decision tree only after a certain period
of time has elapsed. In the decision tree updating process it may also be advisable to use
only part of the network history. This means that for the construction of the decision
tree for gt , we dont use g1 , . . . , gt1 , but focus on only the M most recent graphs
gtM , . . . , gt1 . This is particularly advisable if there is evidence that the behavior of
the network is not perfectly stationary, but changing over time.
11.3.3 Possible Extensions of the Basic Scheme
In Section 11.3.2 we have presented a basic scheme of applying a decision tree classier
to the recovery of missing information in a computer network. In the current section
we discuss a number of possible extensions. All extensions are based on the decision
tree induction and traversal procedures described in Section 11.3.1. They differ only in
the feature vector x = (x1 , . . . , xd ) used to represent the underlying network.
The rst possible extension discussed in this section concerns network edges. It is
easy to see that information about
' can be integrated in a feature vector
& network edges
in a straightforward way. If E = e1 , . . . , eD = L L is the set of all potential edges
in the network then we can extend equation (11.13) as follows:
x i = i (l1 ) , . . . , i (lD ) , i (e1 ) , . . . , i eD .
(11.14)
Such an extension would allow us to use not only node information, but also information about the presence or absence of edges in the process of recovering information
about node l0 in graph gt . All other steps remain the same as described in Section 11.3.2.
However, a note of caution regarding computational complexity is in order here, because such an
will increase the dimensionality of the feature vector from
extension
O (D) to O D 2 , which leads to a corresponding increase of complexity in decision
tree inference.
Our next extension concerns the recovery of missing edge data. That is, we consider
the problem of making a decision as to t (e) = 0 or t (e) = 1, given t (e) =? for some
11.4 Conclusions
197
11.4 Conclusions
In this chapter the problem of missing information recovery has been investigated. In
Section 11.2, three heuristic schemes were proposed that all use context in time, i.e.,
the behavior of a node or an edge in previous graphs in the sequence under observation,
in order to predict its presence or absence in the actual graph. Next, in Section 11.3,
we have developed a machine-learning-based method to solve the same problem. This
method can utilize context in time as well as intragraph context, which means that not
only the history of a node or an edge can be used to infer information about its possible
presence or absence in the actual graph, but also information about the presence or
absence of certain other nodes or edges in the graph under consideration.
The information recovery schemes introduced in Sections 11.2 and 11.3 can be
extended in various ways. First of all, we have not addressed the problem of edge label
recovery. That is, it may be known that edge e exists in graph gt , but its label t (e)
may be unknown. Here we can imagine the development of some kind of extrapolation
7 This phenomenon is also known as the curse of dimensionality [67, 134].
198
12
Matching Hierarchical Graphs
12.1 Introduction
In general, the computation of graph similarity is a very costly task. In the context
of this book, however, we focus on a special class of graphs that allow for low-order
polynomial-time matching algorithms. The considered class of graphs is characterized
by the constraint that each node has a unique node label. This constraint is met in all
computer network monitoring and abnormal event detection applications considered in
this book.
Future applications of graph matching may require one to deal with graphs consisting of tens or even hundreds of thousands of nodes. For these applications low-order
polynomial matching algorithms, such as those considered in previous chapters, may be
still too slow. In this chapter we introduce a hierarchical graph representation scheme
that is suitable for reducing the size of the graphs under consideration. Other reduction
schemes have been proposed in [109], for example. There are also some conceptual
similarities with hierarchical quadtree, or pyramid, representations in image processing
[4]. The basic idea underlying the proposed hierarchical representation scheme is to
contract some nodes of the given graph and represent them as a single node at a higher
level of abstraction. There are no particular assumptions about the criteria that control
the selection of nodes to be contracted into a single node at a higher abstraction level.
For the contraction process, any algorithm that clusters nodes of a graph, including
heuristic selection strategies or the algorithms discussed in Chapter 7, may be chosen.
Properties of the nodes that are contracted are stored as attributes with the corresponding
node at the higher level of abstraction. This process can be carried out in a hierarchical,
iterative fashion, which will allow us to eventually contract any arbitrarily large set of
nodes into a single node.
Because of the reduced number of nodes, computing the similarity of two graphs at
a higher level of abstraction can be expected to be much faster than the corresponding
computation on the original graphs. It is, however, desirable that the graph contraction
procedure, as well as the chosen graph distance measure, have some monotonicity
properties. That is, if graph g1 is more similar to g2 than to g3 at the original, full graph
resolution level, then this property should be maintained for the representation at any
200
higher level of abstraction. In this chapter we study several of these properties. While
the general monotonicity property, as stated above, cant be guaranteed, we will derive
upper and lower bounds of the graph similarity measure at higher levels of abstraction.
It will be shown that under certain conditions these bounds are tight, i.e., they are
identical to the real similarity value.
In the next section, the proposed graph abstraction scheme is presented. Then in
Section 12.3, our new graph similarity measures will be dened and upper and lower
bounds for graph distance at higher levels of abstraction derived. Next, potential applications of the proposed graph contraction scheme and the similarity measures in the
domain of computer network monitoring will be discussed. In Section 12.5 the results
of an experimental study will be presented. Finally, a summary and conclusions will be
provided in Section 12.6.
(12.1)
This edit distance is identical to the edit distance introduced in Chapter 4 for the
case that we neglect edge weight and are just interested in whether an edge is present
between a given pair of nodes.
We start our graph abstraction process by partitioning the set of nodes
V into a set
of subsets, or clusters, C = {c1 , . . . , cn }, where ci V , ci cj = , ni=1 ci = V for
i = j ; i, j = 1, . . . , n.
Denition 12.1. Given a graph g and an arbitrary partitioning C, a hierarchical ab ,
where:
straction of g is the graph g = (V , E,
)
V = C, i.e., each node in g represents a cluster of nodes in g (hence V =
{c1 , . . . , cn });
(ii) E = V V , i.e., g is fully connected;
(iii) (v)
(iv) (e)
= |{(x, y) | (x, y) E x ci y cj e = (ci , cj )}| for each e E.
is, if e is an edge in g originating at the node representing cluster ci and terminating
at the node representing cluster cj , then we count the number of edges in g that
lead from a node in ci to a node on cj .
(i)
201
Example 12.2. A graph gi and its hierarchical abstraction g i are shown in Figure 12.1.
For these graphs we observe that Vi = {1, 2, 3, 4, 5} and Ei = {(1, 2), (1, 4), (2, 1),
(2, 4), (2, 5), (3, 1), (4, 3)}.
We assume that Vi is partitioned into C = {{1, 2}, {3, 4}, {5}}, i.e., c1 = {1, 2},
,
is then given by
)
c2 = {3, 4}, c3 = {5}. The hierarchical abstraction g i = (V , E,
Vi = {{1, 2}, {3, 4}, {5}} = {c1 , c2 , c3 },
E i = {(c1 , c2 ), (c1 , c3 ), (c2 , c1 )},
nodesi : c1 2, c2 2, c3 1,
edgesi : c1 2, c2 1, c3 0,
i : (c1 , c2 ) 2, (c1 , c3 ) 1, (c2 , c1 ) 1.
(2,2)
{1,2}
(2,1)
gi
{3,4}
{5}
(1,0)
gi
All edges e that have an attribute value (e) = 0 are not included in Figure 12.1. In
the graphical representation of g i in Figure 12.1, the pairs (x, y) displayed next to the
nodes correspond to the node attributes, i.e., x = nodes(v), y = edges(v). Similarly,
the numbers next to the edges correspond to the edge attributes.
202
vV
|i (e) j (e)| .
(12.2)
eE
Example 12.3. A graph gj and its hierarchical abstraction g j are shown in Figure 12.2.
We assume that V = Vi Vj , E = Ei Ej and C = {{1, 2}, {3, 4}, {5, 6}}. It is easy
to verify that d(gi , gj ) = 13 and Dl (g i , g j ) = 9. The distance Dl (g i , g j ) is obtained
by summing the absolute differences of all pairs of corresponding attribute values. For
nodes we get the value two, for edges the value three, and for (e)
the value four.
(1,0)
{2}
1
(2,1)
{3,4}
{5,6}
(2,1)
gj
gj
203
Proof. The proof is based on the observation that the term |Vi | + |Vj | 2|Vi Vj | in
equation (12.1) is equal to the number of nodes that are in either gi or gj , but not in
both. Similarly, |Ei | + |Ej | 2|Ei Ej | is equal to the number of edges either in gi
or gj , but not in both. In equation (12.2), node v (corresponding to one of the clusters
ck ) includes exactly nodesi (v) nodes from gi and nodesj (v) nodes from gj . Hence
there must be at least |nodesi (v) nodesj (v)| nodes that are not in both gi and gj .
Summing up over all nodes v V (i.e., clusters ck C) yields a lower bound of the
expression |Vi | + |Vj | 2|Vi Vj |. Similarly, the sum of the second and the third terms
in equation (12.2) yields a lower bound of |Ei | + |Ej | 2|Ei Ej |.
It can be shown that under certain conditions the lower bound given by equation (12.2) is exact.
Lemma 12.5. Let gi , gj , g i , and g j be the same as in Lemma 12.4. Furthermore, let
Vi Vj and Ei Ej . Then
Dl (g i , g j ) = d(gi , gj ).
Proof. From our assumptions it follows that |Vi Vj | = |Vi | and |Ei Ej | = |Ei |.
Hence |Vi |+|Vj |2|Vi Vj | = |Vj ||Vi |, |Ei |+|Ej |2|Ei Ej | = |Ej ||Ei |, and
d(gi , gj ) = |Vj | |Vi | + |Ej | |Ei |. Obviously, the right-hand side of this equation
is identical to the right-hand side of equation (12.2) under the assumption |Vi | |Vj |
and |Ei | |Ej |.
The second graph distance measure is dened as follows:
Du (g i , g j ) =
NODES(v) +
vV
INTRACLUSTER-EDGES(v)
vV
INTERCLUSTER-EDGES(e) ,
(12.3)
eE
where
NODES(v) =
if nodesi (v)
+nodesj (v) < |c|,
INTRACLUSTER-EDGES(v) =
and
INTERCLUSTER-EDGES(e) =
i (e) + j (e),
if i (e) + j (e) < |EDGES(e)|,
204
In this denition, c denotes the cluster that corresponds to node v, EDGES(v) is the
set of all edges in set E that belong to cluster c, and EDGES(e) is the set of all edges
in E that start and end at the same cluster as edge e. Formally,
EDGES(v) = {e | e = (x, y) E x c y c} ,
and
EDGES(e) = {(x, y) | (x, y) E x ci y cj e = (ci , cj )} .
Example 12.6. For the graphs shown in Figures 12.1 and 12.2, we obtain Du (g i , g j ) =
13. Note that in all of the quantities NODES(v), INTRACLUSTER-EDGES(v), and
INTERCLUSTER-EDGES(e) the second condition always evaluates to true. The rst
term in equation (12.3) evaluates to two, while values ve and six are obtained for the
second and third terms, respectively.
Next we show that the measure Du (g i , g j ) is an upper bound on d(gi , gj ).
Lemma 12.7. Let gi , gj , g i and g j be graphs as introduced above. Then
d(gi , gj ) Du (g i , g j ) .
Proof. If the number of nodes of V that belong to cluster c is greater than the number of
nodes of gi in cluster c plus the number of nodes of gj in cluster c, then the intersection
of nodes of gi and gj is possibly empty and the expression |Vi | + |Vj | 2|Vi Vj |
in equation (12.1) is bounded from above by nodesi (c) + nodesj (c). Otherwise, some
nodes from gi and gj must be the same, i.e., some nodes must occur in both gi and gj .
The number of these nodes is equal to nodesi (c)+nodesj (c)|c|. Hence the expression
|Vi | + |Vj | 2|Vi Vj | becomes equal to |nodesi (c) + nodesj (c) 2(nodesi (c) +
nodesj (c) |c|)| = 2|c| nodesi (c) nodesj (c). A similar argument holds for the
edges, i.e., for the the attributes edgesi (c), edgesj (c), i (e), and j (e). Summing over
all clusters c and all edges in E provides an upper bound of d(gi , gj ).
In Example 12.6 we note that Du (g i , g j ) = d(gi , gj ). This is no coincidence
because the proof of Lemma 12.7 implies that the upper bound Du (g i , g j ) is equal
to the actual distance d(gi , gj ) if |Vi | + |Vj | |V | and |Ei | + |Ej | |E|. This is
summarized in the following lemma.
Lemma 12.8. Let gi , gj g i , and g j be dened as in Lemma 12.7 and let |Vi |+|Vj | |V |
and |Ei | + |Ej | |E|. Then
Du (g i , g j ) = d(gi , gj ) .
A consequence of this lemma is that for any two graphs gi , gj and their hierarchical
abstractions g i , g j , the quantity Du (g i , g j ) is always equal to d(gi , gj ) if we set V =
Vi Vj and E = Ei Ej .
205
In the remainder of this section we will investigate the problem of how the upper
and lower bounds Du (g i , g j ) and Dl (g i , g j ) depend on the way we partition the set V .
Let C = {c1 , . . . , cn } and C = {c1 , . . . , cm } be two different partitionings of set V . We
call C ner than C if for each ci there exists a cj such that ci cj . Let gi and gj be two
i and G
j
graphs, g i and g j their hierarchical abstractions based on partition C, and G
where C is ner than C.
Then we
their hierarchical abstractions based on partition C,
can prove that Du (g i , g j ) and Dl (g i , g j ) are better approximations of d(gi , gj ) than
i, G
j ) and Dl (G
i, G
j ), respectively.
Du (G
i , and G
j be as dened above. Then
Lemma 12.9. Let g i , g j , G
i, G
j ) Dl (g i , g j ) .
Dl (G
Proof. Assume that the cluster c C is split into clusters
c1 , . . . , ck C when we
rene the partition C to the partition C. Clearly, |c| = kl=1 |cl |. The contribution of
i, G
j ) is equal to |nodesi (c) nodesj (c)|, which
cluster c to the rst term of Dl (G
can be rewritten as | kl=1 nodesi (cl ) kl=1 nodesj (cl )|; see equation (12.2). On the
other hand, for clusters c1 , . . . , ck we get a contribution equal to kl=1 |nodesi (cl )
nodesj (cl )| to the rst term in Dl (g i , g j ).Applying a similar argument to the second and
third terms in equation (12.2) and using the well-known relation | kl=1 al kl=1 bl |
k
l=1 |al bl |, which holds for any set of real numbers al , bl , concludes the proof.
i , and G
j be the same as in Lemma 12.9. Then
Lemma 12.10. Let g i , g j , G
i, G
j).
Du (g i , g j ) Du (G
Proof. The proof is based on observing that in the computation of NODES(v) in equa it will also
tion (12.3), whenever the second case evaluates to true for a partition C,
evaluate to true for any other partition C that is ner than C. On the other hand, if the rst
then either the rst or the second case may evaluate to true for
case evaluates to true for C,
any of the clusters in C. Moreover, we observe that the value under the second condition
is always less than or equal to the value obtained under the rst condition. Applying a
similar argument to INTRACLUSTER-EDGES(v) and INTERCLUSTER-EDGES(e)
yields the proof.
Summarizing all results derived in this section, we obtain the following theorem:
Theorem 12.11. Let all quantities be as introduced above. Then:
i, G
j ) Dl (g i , g j ) d(gi , gj ) Du (g i , g j ) Du (G
i, G
j),
(i) Dl (G
206
207
1.6e+06
1.55e+06
1.5e+06
1000
100
10
208
4e+06
3e+06
2e+06
1000
100
10
3e+06
1.5e+06
1000
100
10
Fig. 12.5. Experimental data illustrating the upper and lower bound (m = 60).
3e+06
3e+06
3e+06
1.5e+06
1.5e+06
1.5e+06
0
1000
100
(a)
10
0
1000
100
(b)
10
0
1000
100
10
(c)
Fig. 12.6. Further illustration of upper and lower bound: (a) m = 70, (b) m = 80, (c) m = 90.
The aim of the next set of experiments is to analyze the behavior of the upper and
lower bounds in case the conditions of Lemmas 12.5 and 12.8 are no longer satised.
For this purpose, we start again with a graph g1 that is generated in exactly the same way
as described in the previous paragraph. Next we randomly delete 50% of the nodes of g1
together with their incident edges. The resulting graph is referred to as g3 . Next, graph
g4 is generated by randomly deleting m% of all nodes together with their incident edges
209
Du
Dl
d
1000
2,409,516
1,239,038
100
10
2,410,042 2,410,088
638,308
534,130
2,407,580
1
2,411,390
390,094
m
60
Du
Dl
d
Du
Dl
d
Du
Dl
d
1000
2,397,782
1,457,062
2,322,436
1,700,350
2,313,272
1,969,420
100
10
2,398,060 2,398,146
1,126,264 1,020,270
2,396,262
2,322,510 2,322,814
1,495,504 1,360,116
2,321,440
2,313,318 2,313,494
1,890,110 1,815,788
2,312,800
1
2,400,294
749,790
70
2,325,836
1,034,514
80
2,317,408
1,275,640
90
from g1 (m = 60, 70, 80, 90). Clearly, when we match graphs g3 and g4 , the conditions
of Lemmas 12.5 and 12.8 are not necessarily satised any longer. Similarly to the rst
set of experiments, hierarchical abstractions of g3 and g4 were generated consisting
of 1,000, 100, 10, and 1 node. In Figure 12.5, the distances d(g3 , g4 ), Dl (g 3 , g 4 ), and
Du (g 3 , g 4 ) are shown for m = 60. While the lower bound is signicantly smaller
than the real distance, the upper bound is quite tight. As a matter of fact, d(g3 , g4 )
visually coincides with Du (g 3 , g 4 ) in Figure 12.5. To see that d(g3 , g4 ) is not identical
to Du (g 3 , g 4 ), the information provided in Figure 12.5 is shown in tabular form in Table
12.1. In Figure 12.6 and Table 12.2 the corresponding values are given for m = 70, 80,
and 90. As m increases, graphs g3 and g4 become more similar to each other. In any
case, the upper bound is very close to the real distance even for the maximum degree
of compression, where both graphs are represented through a single node only. We also
observe that both upper and lower bounds become tighter as the distance d(g3 , g4 )
decreases.
The motivation of the third set of experiments is to measure the computational
savings that can be achieved by means of the proposed hierarchical graph abstraction
scheme. We assume that the sensors, or devices, that yield the graph data not only
provide us with the graphs at the full level of resolution, but also with hierarchical
abstractions. Hence the time needed to generate hierarchical abstractions from a graph
at the full resolution level is not taken into account in the experiments described in
the following. To analyze the computational efciency of the proposed graph similarity
measures, we select graphs g3 and g4 (with m = 60) and their hierarchical abstractions,
as described in the last paragraph, and measure the time needed to compute d(g3 , g4 ),
Dl (g 3 , g 4 ), and Du (g 3 , g 4 ). The results are shown in Table 12.3. The computation of
d(g3 , g4 ) is performed on the original graphs g3 and g4 , and is independent of the cluster
210
size in the hierarchical abstraction. It turns out that the computation of both Dl (g 3 , g 4 )
and Du (g 3 , g 4 ) is extremely fast when compared to d(g3 , g4 ). From this observation
we can conclude that distance measure Du provides an excellent compromise between
speed and precision. On one hand, it is extremely fast to compute, and on the other,
it returns values very close to the real graph distance. As a matter of fact, a speedup
on the order of 108 can be observed over the computation of d(g3 , g4 ) for the case of
maximum graph compression, while the precision of the upper bound is still within a
tolerance of 0.2%.
Table 12.3. Computational time of the distance measures in msec.
Time Du
Time Dl
Time d
1000
61.6
19.52
100
0.5632
0.1552
10
0.00568
0.001304
36,000
1
0.000076
0.0000552
m
60
12.6 Conclusions
In this chapter we have described a hierarchical graph abstraction procedure that contracts clusters, or groups, of nodes into single nodes. On this hierarchical representation,
graph similarity can be computed more efciently than on the original graphs. Two distance measures for contracted graphs are introduced, and it is shown that they provide
lower and upper bounds, respectively, for the distance of graphs at the original level
of resolution. The proposed methods can be used to very signicantly speed up the
computation of graph similarity in the context of computer network monitoring and abnormal change detection. It can be proven that under special conditions, upper and/or
lower bounds are exact.
References
212
References
18. Y. Breitbart, C.Y. Chan, M. Garofalakis, R. Rastogi, and A. Silberschatz. Efciently monitoring bandwidth and latency in IP networks. In Proceedings of the IEEE INFOCOM,
pages 933942, Anchorage, Alaska, April 2001.
19. D. Breitgand, D. Raz, and Y. Shavitt. SNMP GetPrev: An efcient way to access data in
large MIB tables. IEEE Journal on Selected Areas in Communications, 20(4):656667,
2002.
20. H. Bunke. On a relation between graph edit distance and maximal common subgraph.
Pattern Recognition Letters, 18:689694, 1997.
21. H. Bunke. Error correcting graph matching: on the inuence of the underlying cost function.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 21:917922, 1999.
22. H. Bunke, P. Dickinson, and M. Kraetzl. Matching sequences of graphs with applications in
computer network analysis. In Proceedings of the 8th World Multiconference on Systemics,
Cybernetics and Informatics, pages 270275, Orlando, FL, 2004.
23. H. Bunke, P. Dickinson, and M. Kraetzl. Recovery of missing information in graph sequences. In L. Brun and M. Vento, editors, Proceedings of the 5th International Workshop
GBR2005, Springer, LNCS 3434, Graph Based Representations in Pattern Recognition,
pages 312321, 2005.
24. H. Bunke and S. Gnter. Weighted mean of a pair of graphs. Computing, 67(3):209224,
November 2001.
25. H. Bunke, M. Kraetzl, P. Shoubridge, and W.D. Wallis. Measuring change in large enterprise
data networks. In Proceedings of the International Conference on Information, Decision
and Control, pages 5358, Adelaide, 2002.
26. H. Bunke, M. Kraetzl, P.J. Shoubridge, and W.D. Wallis. Measuring abnormal change
in large data networks. In Proceedings of the International Conference on Information,
Decision and Control, pages 5358, Adelaide, 2002.
27. H. Bunke and B. Messmer. Recent advances in graph matching. International Journal of
Pattern Recognition and Articial Intelligence, 11(1):169203, 1997.
28. H. Bunke, A. Muenger, and X. Jiang. Combinatorial search versus genetic algorithms; A
case study based on the generalized mean graph problem. Pattern Recognition Letters,
20:12711277, 1999.
29. H. Bunke and A. Sanfeliu. Syntactic and Structural Pattern Recognition Theory and
Applications. World Scientic, 1990.
30. H. Bunke and K. Shearer. A graph distance metric based on the maximal common subgraph.
Pattern Recognition Letters, 19:255259, 1998.
31. J.B.D. Cabrera, L. Lewis, X. Qin, W. Lee, R.K. Prasanth, B. Ravichandran, and R.K. Mehra.
Proactive detection of distributed denial of service attacks using mib trafc variables - a
feasibility study. In Proceedings of the IEEE/IFIP International Symposium on Integrated
Network Management, pages 609622, May 2001.
32. J. Case, M. Fedor, M. Schoffstall, and J. Davin. RFC 1157: A simple network management
protocol (SNMP), May 1990.
33. J. Case, K. McCloghrie, M. Rose, and S. Waldbusser. Coexistence between version 1 and version 2 of the internet-standard network management framework.
RFC 1908 (Draft Standard), January 1996. Obsoleted by RFC 2576, Available at
http://www.ietf.org/rfc/rfc1908.txt.
34. J. Case, K. McCloghrie, M. Rose, and S. Waldbusser. Conformance statements for version
2 of the simple network management protocol (SNMPv2). RFC 1904 (Draft Standard),
January 1996. Obsoleted by RFC 2580, Available at http://www.ietf.org/rfc/rfc1904.txt.
35. J. Case, K. McCloghrie, M. Rose, and S. Waldbusser. Introduction to community-based
SNMPv2. RFC 1901 (Historic), Available at http://www.ietf.org/rfc/rfc1901.txt, January
1996.
References
213
36. J. Case, K. McCloghrie, M. Rose, and S. Waldbusser. Management information base for version 2 of the simple network management protocol (SNMPv2). RFC 1907 (Draft Standard),
January 1996. Obsoleted by RFC 3418, Available at http://www.ietf.org/rfc/rfc1907.txt.
37. J. Case, K. McCloghrie, M. Rose, and S. Waldbusser. Protocol operations for version 2 of
the simple network management protocol (SNMPv2). RFC 1905 (Draft Standard), January
1996. Obsoleted by RFC 3416, Available at http://www.ietf.org/rfc/rfc1905.txt.
38. J. Case, K. McCloghrie, M. Rose, and S. Waldbusser. Structure of management
information for version 2 of the simple network management protocol (SNMPv2).
RFC 1902 (Draft Standard), January 1996. Obsoleted by RFC 2578, Available at
http://www.ietf.org/rfc/rfc1902.txt.
39. J. Case, K. McCloghrie, M. Rose, and S. Waldbusser. Textual conventions for version 2 of
the simple network management protocol (SNMPv2). RFC 1903 (Draft Standard), January
1996. Obsoleted by RFC 2579, Available at http://www.ietf.org/rfc/rfc1903.txt.
40. J. Case, K. McCloghrie, M. Rose, and S. Waldbusser. Transport mappings for version 2 of
the simple network management protocol (SNMPv2). RFC 1906 (Draft Standard), January
1996. Obsoleted by RFC 3417, Available at http://www.ietf.org/rfc/rfc1906.txt.
41. P. Chan, M. Mahoney, and M. Arshad. A machine learning approach to anomaly detection. Technical Report CS-2003-06, Florida Institute of Technology, Melbourne, FL, USA,
March 2003.
42. M. Chapman, F. Dupuy, and G. Nilsson. An overview of the telecommunications information networking architecture. Electronics and Communication Engineering Journal,
8(3):135141, June 1996.
43. M. Cheikhrouhou and J. Labetoulle. An efcient polling layer for snmp. In Proceedings of the IFIP/IEEE Network Operations and Management Symposium, pages 477490,
September 2000.
44. T. Chen and L. Hu. Internet performance monitoring. In Proceedings of the IEEE, volume 90, pages 15921603, September 2002.
45. W.J. Christmas, J. Kittler, and M. Petrou. Structural matching in computer vision using
probabilistic relaxation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
8:749764, 1995.
46. F.R.K. Chung. Spectral Graph Theory. CBMS, Regional Conference Series in Mathematics. American Mathematical Society, 1997.
47. F.R.K. Chung and L. Lu. Connected components in random graphs with given expected
degree sequences. Annals of Combinatorics, 6:125145, 2002.
48. Cisco Systems. NetFlow Services Solutions Guide, 2005. Available at
http://www.cisco.com/univercd/cc/td/doc/cisintwk/intsolns/netsol/nfwhite.htm.
49. K. Claffy and T. Monk. Whats next for internet data analysis? Status and challenges facing
the community. In Proceedings of the IEEE; Special Issue on Communications in the 21st
Century, volume 85, pages 15631571, October 1997.
50. M. Coates, A. Hero, R. Nowak, and B. Yu. Internet tomography. IEEE Signal Processing
Magazine, 19:4765, May 2002.
51. L.P. Cordella, P. Foggia, C. Sansone, and M. Vento. An improved algorithm for matching
large graphs. In Proceedings of the 3rd IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition, pages 149159, 2001.
52. C. Cortes, L.D. Jackel, and W-P Chiang. Predicting failures of telecommunication paths:
Limits on learning machine accuracy imposed by data quality. In Proceeding of the International Workshop on Applications of Neural Networks to Telecommunications 2, Stockholm,
1995.
53. T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley, 1991.
214
References
54. A. Cross, R. Wilson, and E. Hancock. Inexact graph matching with genetic search. Journal
of Pattern Recognition, 30:953970, 1997.
55. D.L. Davies and D.W Bouldin. A cluster separation measure. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 1(4):224227, 1979.
56. D. Denning. An intrusion detection model. IEEE Transactions on Software Engineering,
SE-13(2):222232, February 1987.
57. L.E. Diamond, M.E. Gaston, and M. Kraetzl. An observation of power law distribution
in dynamic networks. In Proceedings of the International Conference on Information,
Decision and Control, pages 101105, Adelaide, 2002.
58. P. Dickinson. Graph Based Techniques for Measurement of Intranet Dynamics. PhD thesis,
Institute for Telecommunications Research, University of South Australia, Adelaide, April
2006.
59. P. Dickinson, H. Bunke, A. Dadej, and M. Kraetzl. Median graphs and anomalous change
detection in communications networks. In Proceedings of the International Conference on
Information, Decision and Control, pages 5964, Adelaide, 2002.
60. P. Dickinson, H. Bunke, A. Dadej, and M. Kraetzl. Similarity measure for hierarchical
graph representation and its application to computer network monitoring. In Proceedings
of the 6th World Multiconference on Systemics, Cybernetics and Informatics, volume IV,
pages 457462, Orlando, Florida, 2002.
61. P. Dickinson, H. Bunke, A. Dadej, and M. Kraetzl. A novel graph distance measure and its
application to monitoring change in computer networks. In Proceedings of the 7th World
Multiconference on Systemics, Cybernetics and Informatics, volume III, pages 333338,
Orlando, FL, 2003.
62. P. Dickinson, H. Bunke, A. Dadej, and M. Kraetzl. Matching graphs with unique node
labels. Pattern Analysis and Applications, 7(3):243254, 2004.
63. P.J. Dickinson, H. Bunke, A. Dadej, and M. Kraetzl. Application of median graphs in
detection of anomalous change in communication networks. In Proceedings of the 5th
World Multiconference on Systemics, Cybernetics and Informatics, volume 5, pages 194
197, Orlando, FL, 2001.
64. P.J. Dickinson, M. Kraetzl, H. Bunke, M. Neuhaus, and A. Dadej. Similarity measures for
hierarchical representations of graphs with unique node labels. International Journal of
Pattern Recognition and Articial Intelligence, 18(3):425442, 2004. Special issue: Graph
Matching in Pattern Recognition and Machine Vision.
65. G. A. Dirac. Some theorems on abstract graphs. Proceedings of the London Mathematical
Society, 2(3):6981, 1952.
66. R.C. Dubes. Cluster analysis and related issues. In C.H. Chen, L.F. Pau, and P. Wang, editors,
Handbook of Pattern Recognition and Computer Vision, pages 332. World Scientic, 1993.
67. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classication. John Wiley, 2nd edition,
2001.
68. N. Dufeld and M. Grossglauser. Trajectory sampling for direct trafc observation.
IEEE/ACM Transactions on Networking, 9(3):280292, June 2001.
69. R. Caceresand N. Dufeld, A. Feldmann, J. Friedmann, A. Greenberg, R. Greer, T. Johnson, C. Kalmanek, B. Krishnamurthy, D.Lavelle, P. Mishra, K. Ramakrishnan, J. Rexford,
F. True, and J. van der Merwe. Measurement and analysis of IP network usage and behaviour. IEEE Communications Magazine, 38(5):144151, May 2000.
70. J.C. Dunn. Well separated clusters and fuzzy partition. Journal of Cybernetics, pages
95104, 1974.
71. R. Espinosa, M. Tripod, and S. Tomic. Cisco Router Conguration and Troubleshooting.
New Riders, Indianapolis, Indiana, December 1998.
References
215
72. C. Fraleigh, C. Diot, S. Moon, P. Owezarski, D. Papiannaki, and F. Tobagi. Design and
deployment of a passive monitoring infrastructure. In Proceedings of the Thyrrhenian
International Workshop on Digital Communications, pages 556575, London, UK, 2001.
Springer-Verlag.
73. M. Friedman and A. Kandel. Introduction to Pattern Recognition. World Scientic, 1999.
74. G. Frobenius. ber Matrizen aus nicht negativen Elementen. Sitzungsberichte der Preuischen Akademie der Wissenschaften, 23:456477, 1912.
75. G. Golub and C.F. van Loan. Matrix Computations. Johns Hopkins University Press, 1983.
76. L.A. Goodman and W.H. Kruskal. Measures of association for cross-classication. Journal
of the American Statistical Association, pages 732764, 1954.
77. M. Grossglauser and J. Rexford. Passive trafc measurement for IP operations. In K. Park
and W. Willinger, editors, The Internet as a Large-Scale Complex System. Oxford University
Press, 2002.
78. S. Gnter and H. Bunke. Self-organizing map for clustering in the graph domain. Pattern
Recognition Letters, 23(4):405417, February 2002.
79. S. Gnter and H. Bunke. Validation indices for graph clustering. Pattern Recognition
Letters, 24(8):11071113, 2003.
80. D. Guseld. Algorithms on Strings, Trees and Sequences. Cambridge University Press,
1997.
81. W.R. Hamilton. The Icosian game. Jaques and Son, 1859. Reprinted in [12].
82. G. Held. LAN Testing and Troubleshooting: Reliability Tuning Techniques. John Wiley,
New York, June 1996.
83. J. Hellerstein and T.J. Watson. An approach to selecting metrics for detecting performance
problems in information systems. Proceedings of the 2nd IEEE International Workshop on
Systems Management, pages 3039, 1996.
84. G.N. Higginbottom. Performance Evaluation of Communication Networks. Artech House,
Massachusetts, 1998.
85. M. Hills. Intranet Business Strategies. John Wiley and Sons, 1996.
86. L.L. Ho, C.J. Macey, and R.G. Hiller. Real-time performance monitoring and anomaly
detection in the internet: An adaptive, objective-driven, mix-and-match approach. Bell
Labs Technical Journal, 4(4):2341, October 1999.
87. C.S. Hood and C. Ji. Intelligent network monitoring. In Proceedings of the IEEE Workshop
on Neural Networks for Signal Processing, pages 521530, 1995.
88. C.S. Hood and C. Ji. Probabilistic network fault detection. In Proceedings of the IEEE
GLOBECOM, volume 3, pages 18721876, 1996.
89. C.S. Hood and C. Ji. Proactive network-fault detection. IEEE Transactions on Reliability,
46(3):333341, 1997.
90. C.S. Hood and C. Ji. Intelligent agents for proactive fault detection. IEEE Internet Computing, 2(2):6572, March 1998.
91. J.E. Hopcroft and J.K. Wong. Linear time algorithm for isomorphism of planar graphs. In
Proceedings of the 6th Annual ACM Symposium on Theory of Computing, pages 172184,
1974.
92. R. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
93. M. Huber. Estimating the average shortest path length in a graph. Technical report, Cornell
University, 1996.
94. B.A. Huberman and R.M. Lukose. Social dilemas and internet congestion. Science,
277(5325):535537, July 1997.
95. B. Krishnamurthy J. Jung and M. Rabinovich. Flash crowds and denial of service attacks: Characterization and implications for CDNs and web sites. In Proceedings of the
216
96.
97.
98.
99.
100.
101.
102.
103.
104.
105.
106.
107.
108.
109.
110.
111.
112.
113.
114.
115.
References
International World Wide Web Conference, pages 252262, Honolulu, Hawaii, May 2002.
IEEE.
A.K. Jain, M. N. Murty, and P. Flynn. Data clustering: A review. ACM Computing Surveys,
31:264323, 1999.
G. Jakobson and M.D. Weissman. Alarm correlation. IEEE Network Journal, 7(6):5259,
1993.
J.L. Jerkins and J.L. Wang. A close look at trafc measurements from packet networks. In
Proceedings of the IEEE GLOBECOM, volume 4, pages 24052411, 1998.
X. Jiang and H. Bunke. Including geometry in graph representations: a quadratic-time
graph isomorphism algorithm and its application. In P. Perner, P. Wang, and A. Rosenfeld,
editors, Advances in Structural and Syntactic Pattern Recognition, volume 1121 of LNCS,
pages 110119. Springer, 1996.
X. Jiang and H. Bunke. On median graphs: Properties, algorithms, and applications. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 23(10):11441151, 2001.
I. Katzela and M. Schwartz. Schemes for fault identication in communication networks.
IEEE/ACM Transactions on Networking, 3(6):753764, 1995.
M.G. Kendall. Further contributions to the theory of paired comparisons. Biometrics,
11:4362, 1955.
K. Keys, D. Moore, R. Koga, E. Lagache, M. Tesch, and K. Claffy. The architecture of
CoralReef: an Internet trafc monitoring software suite. In PAM2001: Passive and Active
Measurement Workshop. CAIDA, RIPE NCC, April 2001.
S. Khuller and B. Raghavachari. Advanced combinatorial algorithms. In M.J. Atallah,
editor, Algorithms and Theory of Computation Handbook, chapter 7. CRC Press, 1999.
T.P. Kirkman. On the representation of polyhedra. Philosophical Transactions of the Royal
Society, London, 146:413418, 1856.
T. Kohonen. Self-Organizing Maps. Springer-Verlag, New York, 1997.
J.L. Kolodner. Case-Based Reasoning. Morgan Kaufmann, San Francisco, California,
1993.
S. Kosinov and T. Caelli. Inexact multisubgraph matching using graph eigenspace and
clustering models. In T. Caelli, A. Amin, R. Duin, M. Kamel, and D. de Ridder, editors,
Proceedings of the Structural, Syntactic, and Statistical Pattern Recognition workshop,
volume 2396 of LNCS, pages 133142. Springer, 2002.
W. Kropatsch, M. Burge, S. Ben Jacoub, and N. Sehmanoui. Dual graph contraction with
LEDA. Computing Journal, Supplementum, 12:101110, 1998.
C. Kruegel and T. Toth. Using decision trees to improve signature-based intrusion detection.
In Proceedings of the 6th Symposium on Recent Advances in Intrusion Detection (RAID),
volume 2820 of LNCS, pages 173191, Pittsburgh, USA, 2003. Springer.
C. Kruegel and G. Vigna. Anomaly detection of web-based attacks. In Proceedings of
the 10th ACM Conference on Computer and Communications Security, pages 251261,
October 2003.
M. Thottan L. Li, B. Yao, and S. Paul. Distributed network monitoring with bounded link
utilization. In Proceedings of the IEEE INFOCOM, pages 11891198, March 2003.
A. Lakhina, M. Crovella, and C. Diot. Characterization of network-wide anomalies in trafc
ows. In Proceedings of the 4th ACM SIGCOMM conference on Internet Measurement,
pages 201206, Taormina, Sicily, Italy, October 2004. ACM.
P. Lancaster and M. Tismenetsky. The Theory of Matrices. Academic Press, 1985.
J. Larrosa and G. Valiente. Constraint satisfaction algorithms for graph pattern matching.
Mathematical Structures in Computer Science, 12(4):403422, 2002.
References
217
116. A. Lazar, W. Wang, and R. Deng. Models and algorithms for network fault detection
and identication: A review. In Proceedings of the IEEE International Conference on
Communications, pages 9991003, Singapore, November 1992.
117. A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava. A comparative study of
anomaly detection schemes in network intrusion detection. In Proceedings of the 3rd
SIAM Conference on Data Mining, May 2003.
118. G. Levi. A note on the derivation of maximal common subgraphs of two directed or
undirected graphs. Calcolo, 9:341 354, 1972.
119. L. Lewis. A case based reasoning approach to the managment of faults in communications
networks. In Proceedings of the IEEE INFOCOM, volume 3, pages 14221429, San
Francisco, CA, March 1993.
120. L. Lovsz and M.D. Plummer. Matching Theory. North-Holland, Amsterdam, 1986.
121. E.M. Luks. Isomorphism of graphs of bounded valence can be tested in polynomial time.
Journal of Computer and Systems Sciences, 25:4265, 1982.
122. B. Luo and E. Hancock. Structural graph matching using the em algorithm and singular
value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
23:11201136, 2001.
123. B. Luo, R. Wilson, and E. Hancock. Spectral feature vectors for graph clustering. In
T. Caelli, A. Amin, R. Duin, M. Kamel, and D. de Ridder, editors, Structural, Syntactic,
and Statistical Pattern Recognition, volume 2396 of LNCS, pages 8393. Springer, 2002.
124. E. Madruga and L. Tarouco. Fault management tools for a cooperative and decentralised
network operations environment. IEEE Journal on Selected Areas in Communications,
12(6):11211130, August 1994.
125. A. Magnaghi, T. Hamada, and T. Katsuyama. A wavelet-based framework for proactive
detection of network miscongurations. In SIGCOMM 2004, pages 253258, August 2004.
126. M. Mahoney and P. Chan. Learning rules for anomaly detection of hostile network trafc.
In Proceedings of the 3rd IEEE International Conference on Data Mining, pages 601604,
Washington, DC, USA, 2003. IEEE Computer Society.
127. R. Maxion and M. Feather. A case study of ethernet anomalies in a distributed computing
environment. IEEE Transactions on Reliability, 39:433443, October 1990.
128. K. McCloghrie and M. T. Rose. RFC 1213: Management information base for network
management of TCP/IP-based internets:MIB-II, March 1991.
129. J. McGregor. Backtrack search algorithms and the maximal common subgraph problem.
Software Practice and Experience, 12(1):2334, 1982.
130. B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:4587, 1981.
131. B. Messmer and H. Bunke. A new algorithm for error-tolerant sub-graph isomorphism
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):493
504, May 1998.
132. S. Milgram. The small world problem. Psychology Today, 2:6067, 1967.
133. M. Miller. Troubleshooting Internetworks. Macmillan Technical, Indianapolis, Indiana,
1991.
134. T.M. Mitchell. Machine Learning. Mc Graw-Hill, 1997.
135. M.E.J. Newman. The structure and function of complex networks. SIAM Review, 45:167
256, 2003.
136. H. Niemann. Pattern Analysis and Understanding. Springer, 2nd edition, 1990.
137. C. Noble and D. Cook. Graph-based anomaly detection. In SIGKDD, pages 631636,
Washington, DC, USA, August 2003. ACM.
138. O. Ore. Note on Hamilton circuits. American Mathematical Monthly, 67:55, 1960.
218
References
139. D.D. Parkes and W.D. Wallis. Graph Theory and the Study of Activity Structure. Timing
Space and Spacing Time, vol. 2: Human Activity and Time Geography. Edward Arnold,
London, 1978.
140. M. Pelillo. Matching free trees, maximal cliques and monotone game dynamics. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24(11):15351541, 2002.
141. M. Pelillo and A. Jagota. Feasible and infeasible maxima in a quadratic program for
maximum clique. Journal of Articial Neural Networks 2, 2:411420, 1995.
142. O. Perron. Zur Theorie der Matrizen. Mathematische Annalen, 64:248263, 1907.
143. L. Psa. A theorem concerning Hamilton lines. Magyar Tudomnyos Akadmia Matematikai Kutat, 7:225226, 1962.
144. R.C. Prim. Shortest connection networks and some generalizations. Bell System Technical
Journal, 36:13891401, 1957.
145. N. Pullman. Matrix Theory and its Applications. Dekker, 1976.
146. R. Quinland. C4.5: Programs for Machine Learning. Morgen Kaufmann Publ., 1993.
147. W.M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the
American Statistical Association, 66:846850, 1971.
148. B.D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.
149. S. Rizzi. Genetic operators for hierarchical graph clustering. Pattern Recognition Letters,
19:12931300, 1998.
150. I. Rouvellou. Graph identication techniques applied to Network Management problems.
PhD thesis, Columbia University, 1993.
151. A. Sanfeliu and K.S. Fu. A distance measure between attributed relational graphs for pattern
recognition. IEEE Transactions on Systems, Man, and Cybernetics, 13(3):353362, 1983.
152. D. Sankoff and J. Kruskal, editors. Time Warps, String Edits, and Macromolecules: The
Theory and Practice of Sequence Comparison. Addison Wesley, 1983.
153. S. Sarkar and K.L. Boyer. Quantitative measures of change based on feature organisation:
Eigenvalues and eigenvectors. Computer Vision and Image Understanding, 71:110136,
1998.
154. A. Schenker, M. Last, H. Bunke, and A. Kandel. Graph Theoretic Techniques for Web
Content Mining. World Scientic, 2005.
155. D.S. Seong, H.S. Kuri, and K.H. Park. Incremental clustering of attributed graphs. IEEE
Transactions on System, Man and Cybernetics, 23:13991411, 1993.
156. A. Shokonfandeh and S. Dickinson. A unied framework for indexing and matching
hierarchical shape structures. In C. Arcelli, L. Cordella, and G. Sanniti di Baja, editors,
Visual Form 2001, pages 6784. Springer Verlag, LNCS 2059, 2001.
157. P. Shoubridge, M. Kraetzl, and D. Ray. Detection of abnormal change in dynamic networks.
In Proceedings of the International Conference on Information, Decision and Control,
pages 557562, Adelaide, 1999.
158. P. Shoubridge, M. Kraetzl, W.D. Wallis, and H. Bunke. Detection of abnormal change in
time series of graphs. Journal of Interconnection Networks, 3(1&2):85101, 2002.
159. W. Stallings. SNMP, SNMPv1, and CMIP: The Practical Guide to Network Management
Standards. Addison Wesley Publishing Company, 1997.
160. W. Stallings. SNMPv3: A security enhancement to snmp. IEEE Communications Surveys,
1(1):217, 1998.
161. W. Stallings. Cryptography and Network Security: Principles and Practice. Prentice-Hall,
1999.
162. W. Stallings. SNMP, SNMPv1, SNMPv3 and RMON 1 and 2, Third Edition. Addison-Wesley
Publishing Company, September 1999.
163. G.A. Stephen. String Searching Algorithms. World Scientic, 1994.
References
219
164. A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering.
In Proceedings of the 17th AAAI Workshop of Articial Intelligence for Web Search, pages
5864, 2000.
165. M. Subramanian. Network Management: Principles and Practice. Addison-Wesley, Reading, MA, 2000.
166. A. Tanenbaum. Computer Networks. Prentice Hall, third edition edition, 1996.
167. H. Tangmunarunkit, R. Govindan, S. Jamin, S. Shenker, and W. Willinger. Network topology generators: Degree-based vs structural. In ACM Special Interest Group on Data Communications, SIGCOMM02, Pittburgh, Pennyslvania, USA, August 2002.
168. K. Terplan. Intranet Performance Management. CRC Press, Boca Raton, Florida, 2000.
169. S. Thomas. IPng and the TCP/IP Protocols: implementing the next generation internet,
chapter 3, pages 4392. John Wiley and Sons, New York, 1996.
170. G.L. Thompson. Lectures on Game Theory Markov Chains and Related Topics, volume
SCR-11. Sandia Corporation Monograph, 1958.
171. M. Thottan and C. Ji. Proactive anomaly detection using distributed intelligent agents.
IEEE Network, 12(5):2127, September 1998.
172. M. Thottan and C. Ji. Anomaly detection in IP networks. IEEE Transactions on Signal
Processing, 51(8):21912204, August 2003.
173. J.R. Ullman. An algorithm for subgraph isomorphism. Journal of the ACM, 23(1):3142,
1976.
174. S. Umeyama. An eigendecomposition approach to weighted graph matching problems.
IEEE Transactions on Pattern Recognition and Machine Intelligence, 10(5):695703,
September 1988.
175. T.J. Velte. Simulating your NT network. Windows NT Magazine, January 1999.
176. R.A. Wagner and M.J. Fischer. The string-to-string correction problem. Journal of the
ACM, 21:168173, 1974.
177. S. Waldbusser. RFC 1757: Remote network monitoring management information base,
February 1995.
178. S. Waldbusser. RFC 2021: Remote network monitoring management information base
version 2 using SMIv2, January 1997.
179. W. D. Wallis. One-factorizations of complete graphs. In Contemporary Design Theory,
pages 593631, New York, 1992. Wiley.
180. W. D. Wallis. One-factorizations. Kluwer, Dortrecht, Netherlands, 1997.
181. W.D. Wallis. A Beginners Guide to Graph Theory. Birkhauser, 2000.
182. W.D. Wallis, P.J. Shoubridge, M. Kraetzl, and D. Ray. Graph distances using graph union.
Pattern Recognition Letters, 22:701704, 2001.
183. I. Wang, K-C. Fan, and J-T. Horng. Genetic-based search for error-correcting graph isomorphism. IEEE Transactions on SMC, 27:588597, 1997.
184. D.J. Watts. Small Worlds. Princeton University Press, 1999.
185. D.J. Watts and S.H. Strogatz. Collective dynamics of small-world networks. Nature,
393:440442, 1998.
186. A. Webb. Statistical Pattern Recognition. Oxford University Press, 1999.
187. T.H. Wei. The Algebraic Foundations of Ranking Theory. PhD thesis, Cambridge, 1952.
188. D.B. West. Introduction to Graph Theory. Prentice Hall, New Jersey, 1996.
189. C.C. White, E.A. Sykes, and J.A. Morrow. An analytical approach to the dynamic topology
problem. Telecommunication Systems, 3:397413, 1995.
190. J. De Wiele and S. Rabie. Meeting network management challenges: Customization, integration and scalability. In Proceedings of the IEEE International Conference on Communications, volume 2, pages 11971204, May 1993.
220
References
191. R.C. Wilson and E. Hancock. Structural matching by discrete relaxation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19:634648, 1997.
192. Y. Yemini. The OSI network management model. IEEE Communications Magazine, pages
2029, May 1993.
193. Y. Yemini. A critical survey of network management protocol standards. In Telecommunications Network Management into the 21st Century (S.Aidarous and T Plevyak, eds). IEEE
Press, 1994.
194. S.M.S. Zabir, A. Ashir, and N. Shiratori. Estimation of network performance: An approach
based on network experience. In Proceedings of the The 15th International Conference on
Information Networking, pages 657662, Washington, DC, USA, 2001. IEEE Computer
Society.
195. L.A. Zadek. Fuzzy sets. Information and Control, 8:338353, 1965.
196. C.T. Zahn. Graph-theoretical methods for detecting and describing gestalt structures. IEEE
Transactions on Computers, C-20:6886, 1971.
197. F. Zhang and J.L. Hellerstein. An approach to on-line predictive detection. In Proceedings
of the 8th International Symposium on Modeling, Analysis, and Simulation of Computer
and Telecommunications Systems, pages 549556, 2000.
198. T. Zseby, M. Molina, N. Dufeld, S. Niccolini, and F. Raspall. Techniques for IP packet
selection. Internet draft, IETF, February 2005.
Index
abnormal change, 83
detection, 63, 79, 93
accounting management, 9
acyclic graph, 35
adjacency, 31, 38
matrix, 33
anomaly detection, 4, 16, 21, 22
in networks, 23
methods, 2326
average distance, 96
bipartite graph, 35
matching, 122123
bridge, 34
case based reasoning, 140
change matrix, 76
characteristic path length, 148
characteristic representation, 179
closed walk, 35
cluster, 200
cluster distance, 115
cluster validation, 100
index, 100
clustering, 93
k-means, 99
fuzzy, 104105
fuzzy k-means, 105
hierarchical, 94, 96
intragraph, 115
nonhierarchical, 9799
clustering coefcient, 149
directed, 149
of graph, 149
common subgraph, 45
communications network, 64
complement, 33
complete bipartite graph, 34
complete directed graph, 40
complete graph, 34, 151
complete-linkage distance, 95
component, 34
strong, 40
concordant, 103
conguration management, 9
connected graph, 34, 37
context in time, 177
continuous subsequence, 132
cost function, 48
cost matrix, 133
cubic graph, 33
cut, 34
trivial, 34
cut-edge, 34
cutpoint, 34
cycle, 35, 152
directed, 39
Hamilton, 36
DavisBouldin index, 101
decision tree, 190
classier, 189
inference, 192
defuzzycation, 105
degree, 33
degree matrix, 72
dendrogram, 94
diameter, 147, 149
222
Index
diameter (continued)
weighted, 148
digraph, 38
directed cycle, 39
directed graph, 31, 38
complete, 40
directed path, 39
disconnected graph, 34
discordant, 103
distance, 147
modality, 169
distance measure, 201
distance of clusterings, 118123
Dunn index, 101
dynamic programming, 133
eccentricity
weighted, 148
edge, 31, 32
eigenvalue, 72
eigenvector, 168
endpoint, 32
enterprise graph, 159
entropy, 191
extranet, 5, 8
factor, 38
factorization, 38
fault management, 4, 8
rewall, 5, 8
fuzzy k-means clustering, 105
fuzzy clustering, 104105
fuzzy set, 104
GoodmanKruskal index, 103
graph, 31, 32, 35, 44
abstraction, 200201
acyclic, 35
adjacency matrix, 33
bipartite, 35
clustering, 93, 105112
clustering coefcient, 149
complement, 33
complete, 34
complete bipartite, 34
connected, 34, 37
cubic, 33
deletion, 138
diameter, 147, 149
directed, 31, 38
disconnected, 34
distance in, 22, 147
edge, 44
edit distance, 45, 51, 6770
enterprise, 159
Hamiltonian, 36
insertion, 138
isomorphism, 34, 45, 51, 150
labeled, 32
matching, 43
null, 34
random walk, 72
regular, 33
sequence matching, 137138
small-world, 158
spectrum, 7273
substitution, 138
time series, 112
unique node labeling, 4551
vertex, 44
graph isomorphism, 150
Hamilton cycle, 36
Hamiltonian graph, 36
hierarchical abstraction, 200
hierarchical clustering, 9496
hierarchical graph matching, 199
IMDb, 158
incidence, 33
incomplete network knowledge, 141142
indegree, 39
independence, 33
induced subgraph, 34
Internet, 3, 4, 8, 10, 17, 19
Internet Protocol (IP), 3, 11, 19, 21
intragraph clustering, 115
intranet, 48, 14
anomaly detection, 22
behavior, 6
enterprise, 4
extranet, 5
growth, 4
importance of, 7
intrusion detection, 4, 5, 23, 28
isolated vertex, 33
isomorphism
graph, 150
Index
isomorphism (continued)
of graphs, 34, 45, 51
k-means clustering, 99
KendallWei ranking, 167
Laplace matrix, 72
lattice graph, 156
learning rate, 107
length, 35, 147
linear prediction, 187
link, 32
longest common subsequence, 137
loop, 38
machine learning, 189
matching
of graph sequences, 137138
of graphs, 43
matrix
snapshot, 169
tournament, 166
maximum common subgraph, 45, 51, 65
mean path length, 148
median, 45, 148
median graph, 51, 79
median path length, 148
metric, 150
MIB, 9, 12, 15
minimum spanning tree, 38, 116
missing information recovery, 177
modality, 169
distance, 169
vector, 169
multigraph, 32
adjacency matrix, 33
underlying graph, 32
mutual information, 119
neighborhood, 33, 149
neighborhood subgraph, 77
network, 4
abnormal behavior, 4, 17
anomaly detection, 21, 22
examples of anomalies, 2628
faults, 6
intranet, 4, 14
monitoring, 6
probes, 14
tomography, 6
network management, 4, 79
abnormal behavior, 22
accounting, 9
conguration, 9
fault, 4, 8
history of, 11
network anomalies, 2628
network monitoring, 16
OSI vs. IETF, 10
performance, 4, 8
RMON, 1416, 18
security, 4, 8
SNMP, 1214, 18
standards, 7, 911, 13
TCP/IP, 11
network management system, 5, 7, 911
components of, 9
for intranets, 7
management agent, 9, 12
management information base, 9
management station, 9, 12, 14
problems in, 10
protocol, 9
network measurements, 4, 6, 17, 21, 24
network monitoring, 6, 1622
active vs. passive, 17
commercial tools, 19
data aggregation, 21
denition of, 16
ltering, 21
network tomography, 19
packet monitors, 19
polling, 18, 20
sampling interval, 20
stages of, 17
network trafc, 64
neuron, 98
node, 31, 32
nonhierarchical clustering, 9799
norm
of vector, 165
null graph, 34
one-factor, 38
one-factorization, 38
optimal path, 134
OSI reference model, 3, 7, 10, 11, 13
outdegree, 39
223
224
Index
overtting, 194
partitioning, 200
path, 35
directed, 39
nish, 39
length, 35, 147
shortest, 147
start, 39
performance management, 4, 8
PerronFrobenius theorem, 168
prediction, 141
prex, 132
proper tree, 36
pruning, 194
pseudometric, 150
query pattern, 183
Rand index, 118
random graph, 157
reachability, 40
reducibility, 166
reference pattern, 182
regions of change, 75
regular graph, 33
RMON, 11, 1416, 19
MIB, 15, 18, 21
probes, 14
RMON1, 15
RMON2, 16
round-robin tournament, 165
running median, 83
security management, 4, 8
self-organizing map, 97, 100
sensitivity, 151
sequence of graphs, 131
sequence of symbols, 131
edit distance, 132137
set median, 79
signicant change, 84
single-linkage distance, 95
small-world graph, 158
snapshot
irreducible, 169
of network, 168
reducible, 169
snapshot matrix, 169
SNMP, 9, 1114, 18
agent, 12, 18
MIB, 12
MIBII, 13
protocol, 13, 18
SNMPv1, 13, 14
SNMPv2, 13
SNMPv3, 13
SOM, 97, 100
spanning subgraph, 34
spanning tree, 38
spectral radius, 168
strength
of player, 167
strong component, 40
subgraph, 34, 45
common, 45
induced, 34
maximum common, 45, 51, 65
spanning, 34, 38
subgraph isomorphism, 45, 51
subsequence, 132
sufx, 132
symbol
deletion, 133
insertion, 133
substitution, 133
symmetric difference, 7677
2-hop distance, 75
TCP/IP protocol, 3, 8, 11, 19
intranets, 47
MIB, 12
network management, 1116
RMON, 11, 1416
SNMP, 1014
tournament, 165
reducible, 166
round-robin, 165
tournament matrix, 166
training set, 195
tree, 35, 37
minimum spanning, 116
proper, 36
spanning, 38
triangle inequality, 150
trivial, 34
underlying graph
Index
underlying graph (continued)
of network, 168
valency, 33
vector
modality, 169
vertex, 31, 32
degree, 33
eccentricity, 147
indegree, 39
independence, 33
isolated, 33
outdegree, 39
reachability, 40
valency, 33
vertex neighborhood, 77
virtual private network, 5
voting procedure, 180
walk, 35, 39
closed, 35
directed, 39
weighted diameter, 148
weighted eccentricity, 148
weighted length, 147
weighted mean, 107
winner-take-all, 105
within-graph context, 189
225