Computer Network Analysis by Visualization
Computer Network Analysis by Visualization
The Author shall, when transferring the rights of the Work to a third party
(for example a publisher or a company), acknowledge the third party about
this agreement. If the Author has signed a copyright agreement with a
third party regarding the Work, the Author warrants hereby that he/she
has obtained any necessary permission from this third party to let
Chalmers University of Technology and University of Gothenburg store the
Work electronically and make it accessible on the Internet. Computer
PAULINE GOMÉR
JON-ERIK JOHNZON
c PAULINE GOMÉR.
c JON-ERIK JOHNZON.
Preface 6
1 Project Goal 7
2 Introduction 8
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Research and development . . . . . . . . . . . . . . . . 10
2.1.3 New protocols . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Issues with traffic analysis . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Collecting data . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Data amount . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Analysis of data . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Overview of classification methods . . . . . . . . . . . . . . . . 13
2.3.1 Exact matching . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Machine learning . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Heuristic methods . . . . . . . . . . . . . . . . . . . . . 15
2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Behavior-based network analysis . . . . . . . . . . . . . 16
3 Requirements Specification 17
3.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Functional Requirements . . . . . . . . . . . . . . . . . 17
3.1.2 Non-Functional Requirements . . . . . . . . . . . . . . 17
3.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Test trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Data source . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Data properties . . . . . . . . . . . . . . . . . . . . . . 20
4 Implementation 21
4.1 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Java with pts . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Java with Haskell preprocessor and pts . . . . . . . . . 23
4.1.3 Java with Haskell preprocessor and MySQL . . . . . . 24
4.1.4 Java with MySQL . . . . . . . . . . . . . . . . . . . . . 26
4.2 Database optimization . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Filter Analysis and Database Indexing . . . . . . . . . 27
4.3 Database Initialization . . . . . . . . . . . . . . . . . . . . . . 28
3
4.4 Graph Library Research . . . . . . . . . . . . . . . . . . . . . 29
4.5 Application interface . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.1 Filter interface . . . . . . . . . . . . . . . . . . . . . . 30
4.5.2 Graph interface . . . . . . . . . . . . . . . . . . . . . . 32
4.5.3 Graph Layout Algorithms . . . . . . . . . . . . . . . . 34
5 Results 35
5.1 Application usage . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.1 Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.2 Advanced . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Conclusions 38
7 Future work 39
4
Abstract
The explosive growth of Internet has raised interest in traffic anal-
ysis. Understanding what traffic traverse the network is important for
operation, investments, research and design of new protocols.
However network traffic analysis has not evolved as rapidly as net-
work usage. Many researchers still look at data in raw text format
even though the human brain is much better at pattern recognition
in images than text. The purpose with this thesis is to develop a tool
that builds a graph to visualize network traffic. Network analysis us-
ing this approach is not new, but there are no tools available where
visualization is the focus.
The network graph is built by defining hosts as nodes and commu-
nication between hosts as edges. To enable analysis the user can select
a subset of the traffic to visualize. The tool is able to produce graphs
on large data sets, 3000 nodes and 30000 edges, on a home computer.
We have tested the tool on generated data and on data provided by
the MonNet project.
The tool is ready for testing but further development is needed
since the graph library we used is resource intensive when visualizing
large graphs.
5
Preface
This master thesis of 30 credits concludes our studies on the Software Engi-
neering and Technology master programme at Chalmers University of Tech-
nology.
We would like to thank Wolfgang John and Tomas Olovson for their
support during our work.
6
1 Project Goal
The main objective of this project is to develop the ideas presented in a paper
by Iliofotou et al. [15] and produce a tool for visualizing network traffic. Ilio-
fotou et al. proposes to visualize network traffic as Traffic Dispersion Graphs
with IP addresses as nodes and communication between two IP addresses as
edges. Different protocols will appear as different patterns in the graph.
This tool should not be specialized towards specific kinds of traffic. The
user should be able to filter the traffic and visualize the result as a graph.
We present a few examples of possible use cases in order to show that the
software works as intended and give an indication of its usefulness. When
starting the project, we had three goals:
• The tool should be able to take traffic traces as input and produce
graphs and numerical graph features as output.
• The tool should include additional features like different (pre-) filter op-
tions, zooming capabilities, information retrieval possibilities per node
(on click or on-mouse over).
• We should also provide examples that shows that our tool work as
intended.
7
2 Introduction
The Internet has evolved beyond what anyone would have guessed. It has
moved from being a tool for researchers and experts to a social and econom-
ical platform on which many build their lives. The reasons are many but
the biggest are ease of sharing information, near instant communication and
inexpensive usage.
However, the architecture and fundamental protocols used on the Internet
have remained mostly unchanged. This presents problems for network oper-
ators and users since its difficult to understand how protocols can be used
or misused. The aim when the Internet was created were to make commu-
nication work seamlessly between networks with different types of hardware.
Important properties were the ease of including new networks and users,
routing traffic through intermediate networks regardless of content and us-
ing a common addressing scheme [6]. Many of the problems on the Internet
come from abuse of design decisions. This has raised an interest in analyzing
and monitoring traffic on the Internet.
2.1 Motivation
Analysis of network traffic is important for operation, research and develop-
ment of networks and design of new protocols. For these reasons we have
developed a visualization tool to make traffic analysis faster and easier for
humans.
2.1.1 Operation
When the number of users grew faster than network capacity, network oper-
ators faced the challenge to keep the network running smoothly while scaling
it to support the increasing network traffic [3]. In the 90s, the growth was
handled by expanding and upgrading the infrastructure. This was possible
by venture capital that flowed into the new Internet sector and generated in-
vestments in the infrastructure. However, when the speculative bubble burst
in the early 21st century, the network operators could not afford to continue
upgrading as before and needed to make the most of existing capacity.
To optimize usage of the available capacity, network operators try to
prioritize some traffic over other by using different Quality of Service (QoS)
methods. QoS works by distinguishing traffic belonging to different traffic
classes, such as sensitive, best effort and unwanted traffic. Each class of traffic
is then treated differently; sensitive traffic is given top priority, best effort
gives no guarantees and unwanted traffic is often blocked or limited [22].
8
QoS methods depend on accurate analysis methods to reliably classify
traffic into the right classes. The Internet Protocols (TCP and IP) use ad-
dresses and port numbers to deliver the traffic to the right host and appli-
cation. While Internet Assigned Number Authority (IANA) [14] lists many
ports as belonging to specific protocols1 , the port numbers are not enforced
and a protocol can use any port number for its traffic. The analysis methods
must be able to distinguish traffic belonging to different application protocols
even if they are using the same port number. A false positive or negative
in a live environment to implement QoS could result in blocking of sensitive
traffic or priority to unwanted traffic if the analysis results are used in routers
to decide which packets to discard when there is congestion.
As the Internet gained popularity, it increased the incentives for users
to make profit or gain other advantages by attacking other users [3]. A
larger number of users makes it easier to hide and avoid detection as well
as providing more targets. Malicious users generally aim to reach as many
users as possible in exchange for a small success percentage. The attack
methods often abuse or misuse features of protocols. An example of this
is ’Ping of Death’ where ICMP was misused to cause a buffer overflow in
susceptible systems when a packet which exceeded the maximum size was
sent fragmented, and when the receiver assembles it, the system crashed due
to a buffer overflow.
Unwanted traffic does not really have any definition since it depends on
who is doing the classification. Some network operators consider bandwidth
intensive applications to belong to the unwanted traffic class. Streaming
media is entering the scene and it uses a lot of bandwidth. The applications
and protocols used are designed to use any available bandwidth to enable as
many users as possible to use the service. Notable applications in this area
are Spotify [29], Voddler [31] and Skype [28].
The network operators rely on statistical multiplexing2 when dimension-
ing their networks to fulfill QoS constraints while minimizing unused band-
width [25]. Statistical multiplexing assumes that the users most of the time
are using less than their full bandwidth and QoS constraints can be met with
less bandwidth than the sum of all users’ maximum bandwidth. However,
users that run bandwidth intensive applications do not follow this pattern
and can cause congestion3 and cause delays for all users.
On corporate and privately owned networks, the network operator might
1
The TCP and UDP protocols allow 65536 ports. 0-1023 are for common system
services and are called well-known ports. 1024-49151 are for common applications and are
called registered ports. The rest of the ports are intended for dynamic and private ports.
2
Multiplexing: Several users are sharing a resource.
3
There is more traffic than the network has capacity to handle.
9
also block other kinds of applications due to policies or competition with
their own solutions.
10
and that traffic type needs to be identified before any configuration optimiza-
tions can me made. With the help of traffic analysis on live traces, network
usage can be reliably established.
11
2.2.2 Data amount
As the volume of traffic in the networks grows, the amount of storage needed
for collection also increases. There are several possible approaches to reduce
the size of traffic traces. However they all are tradeoffs between completeness
and scalability [1].
• Packet and flow sampling only collects a small portion of the total
traffic such as one packet for every N packets or flows with more than
N packets. This approach only provides general knowledge about the
traffic traces such as packet distribution and inter-arrival times.
Problems There are several problems with collection of packet traces. But
the biggest is that it is impossible to start a trace and collect 100% of all
communications. There are packets that are missed because the link was
established before the packet collection started or there are connections that
use another link for part of the communication.
12
to use ports that either is associated with another application or use a random
port. Encryption is also used to hide application headers in the payload that
can be used to identify the application.
13
The signature-based analysis methods have several problems [9]:
14
training data, but do belong to one of the classes, might not be correctly
classified if they differ too much from the other related patterns.
Therefore, supervised learning is used when the aim is to single out spe-
cific traffic classes.
• The classic k-means algorithm that puts each example in exactly one
group
15
results, but the possibility that a host changes behavior in the middle of an
interval increases [17]. The heuristics must also handle the possibility that
only the traffic in one direction is available due to asymmetric routing [19].
On the other hand, heuristics can use port analysis to reliably classify
common applications. While other applications obfuscate their traffic by
using the same ports, the protocols they are using for communication gener-
ates patterns that differ very much from the original applications, especially
packet size and TCP headers [21].
7
Who knows who and who communicates with whom
16
3 Requirements Specification
3.1 Software
When planning our project we agreed on a list of requirements on our soft-
ware tool and the operation environment. This list contains both functional
requirements and non-functional requirements.
• Provide a GUI
• Intuitive Interface
17
of the program. Therefore, the program can become noticeable smaller by
avoiding importing functions.
The latter meaning refers to avoiding versions dependencies, that is to
avoid requiring the latest version of a dependency unless the latest function-
ality is really needed. This is because some system run old software and
upgrading one program will usually require its dependencies to be upgraded
as well.
Our focus was to reduce the amount of external third party applications
and libraries as much as possible. The reason was to make the application
easy to install and deploy since installation and configuration of dependencies
often require administrative privileges.
User Prerequisites
• Know SQL
Licensing Source code for computer programs are in many countries de-
fined as literary works and falls under copyright law. While copyright varies
between countries8 , the right holder is typically granted exclusive rights to
selling, distribution, and adapting their work.
Our application should have a license that allows:
8
Most countries in the world are signatories of the Berne Convention for the Protection
of Literary and Artistic Works and provides similar protection.
18
• free redistribution
• free to modify
• free to use
These criteria is typically fulfilled by open source and free software li-
censes. They was choosen to allow the tool to be spread and used widely
without requiring permission.
Since our application will use third party libraries, a potential problem is
that those libraries have different licenses that cannot co-exist in the same
application. Therefore the preferred licenses are those that do not impose
restrictions on which license our tool uses, such as releasing our tool under
the same license. Two common licenses that does not have such restrictions
are BSD and LGPL.
Operation Environment
3.2 Hardware
We recommend that our tool is run on a computer with at least the following
specs:
• 1024MB RAM
19
3.3.2 Data properties
Data about the trace (with the payload stripped off):
9
A flow is all packages from a source IP address and port to a destination IP address
and port during a certain time period [34]
20
4 Implementation
4.1 Development
When we started to develop the application we had a set of base tools avail-
able. These were GraphViz [13], crl flow (part of CoralReef [7]) and a Perl
script to format crl flow output to GraphViz input.
GraphViz is a software for visualizing graphs, but only produces output
in various image formats. Since our goal was an interactive visualization
of a graph which would be difficult with a static image, we decided to find
another graph library.
We stayed with crl flow because it’s fast, customizable and supports many
trace formats. So we wrote a parser for crl flow output and started out
trying different solutions for data storage and presentation. We tried different
solutions here presented in chronological order.
21
our application started to consume a lot of memory for more complex fil-
ters. For example an out- or in-degree filter on the nodes (which means that
the application needs to keep each encountered IP address in memory with
counters) memory usage was about 900MB. Although there are more effi-
cient data structures (for example a Bloom Filter [33] could be used) than
the Java ArrayList, it still does not scale with increasing traffic volumes.
Pros
• Lightweight
Cons
• Bad performance on larger traces
Preprocessor Clever data organization and storage are keys to get good
performance, avoiding exhaustive search is vital. Using a preprocessor and
store the results in a good way on a persistent storage is very important. We
don’t want to waste memory and CPU since it is needed for the visualization,
as we are using Java memory management is even more important as to avoid
runtime failures such as stack or heap overflows. The preprocessor will for
each node calculate:
22
By using precomputed statistics, memory usage is minimized at runtime
because the data is read directly from storage and does not need to be kept
in memory. This type of optimization is used to gain execution speed by sac-
rificing persistent storage space, which is usually much cheaper than memory
and faster CPUs.
The first failure made us think about how we could reduce memory re-
quirements without reducing functionality. We realized that pre-computation
was necessary to avoid storing the entire IP list in memory. To implement
the preprocessor we needed a language which had good libraries for pars-
ing and pattern recognition. Having taken some functional programming
courses we chose to write the preprocessor in Haskell using the highly opti-
mized ByteString library [8].
Now when we were using non-runtime calculations we chose to precom-
pute all statistics we thought usable. Nodes had total amount of packets,
bytes and flows sent or received summarized as well as their in- and out-
degree computed. Edges however did not have much data computed since
we could not find anything of value to compute. We also converted the IP
addresses from the 4x8bit integer (x.x.x.x) notation to 32bit integer. The
visualizer was rewritten to work with the above changes and performance
increased, now we applied our filters on the complete 20 minute data set.
Performance for basic filters was very good, filtering out all traffic with a
destination port 80 or all nodes with more than 50 000 bytes sent or received
was completed in a few minutes. However, the performance of combined
filters was still bad, taking close to an hour. An example of a combined filter
is ”nodes with out-degree >50 and communicating on port 80” and the slow
speed of the combined filters along with the problem of heap overflow due to
filters matching too much of the data set made this solution non-useful.
23
Pros
Cons
Lessons Learned Working with plain text storage seemed to be the most
limiting factor. It required us to parse all the data and keep too much in
memory, which means that the application had trouble working with traces
as they grow larger since there is no way of figuring out how large the data
set will be before testing it. The conclusion drawn from this was that we
needed a database.
24
but now instead of using the text files directly we created a table for each
(one for nodes and one for edges) and then loaded them into the database
using the LOAD DATA directive.
The switch to database storage led to a huge performance improvement.
The basic filters where now completing in seconds instead of minutes and
more complex filters were now done in minutes instead of hours.
However now we discovered a problem with the application. Now when
we could test the application properly we realized that we allowed more than
one edge between two nodes. While this may not seem like a big problem
it rendered the graphs useless since there could be hundreds of edges be-
tween two nodes and it clutters the graph to incomprehension. To solve this
problem we placed a unique constraint10 on the edges in the database and
updating the information on the edge when a duplicate was detected. This
solved the multiple edge problem but at the expense of data loss since we
could no longer distinguish the individual flows between two nodes. Infor-
mation loss in an analysis tool is not acceptable since we don’t have a clue
to what data our users need.
Pros
• Good for basic and complex filters
• Good performance even on large traces
• Better scaling
• Good memory management
Cons
• Information loss!
Lessons Learned This solution seemed to solve all problems we had. It re-
lieved the memory appetite, gave good performance, enabled size estimation.
But one severe bug appeared, the way we added the data into the database
made us loose data about which protocols were responsible for what traffic.
This isn’t good because that information is useful when searching for mali-
cious traffic. The problem resided in our preprocessor and after much effort
to fix it we decided to drop our preprocessor. We first considered writing a
new one in C or the like but we realized that we could use MySQL constructs
instead, more specifically triggers and on-update.
10
A unique constraint is used to guarantee that no duplicate values are entered in specific
columns which are not part of the primary key
25
4.1.4 Java with MySQL
To solve the data loss and multiple edge problems we decided to make a
proper database design.
We also found a way to make the statistics computations during the
inserts into the database, using triggers11 .
This table setup minimizes redundancy and solves a visualization issue
when there is more than one flow between two hosts. This table layout creates
an edge as soon as some form of communication has been seen between two
hosts. Then the actual data is stored in the edge-data table, along with the
corresponding edge id. With the edge id in the edge-data table, the actual
edge can be retrieved and from the edge to the source and destination nodes
via the IP addresses.
Timestamps and traffic volumes are stored in the edges table to allow
searches for high bandwidth and/or high throughput links. The reason for
storing traffic volumes in the nodes is to be able to search for traffic heavy
nodes.
Pros
• Better scaling
26
Cons
• Reliance on triggers
Lessons Learned This solution fixed the severe issue with the previous so-
lution while keeping all of the improvements. But it added a rather big con-
straint on the database, it must support triggers. This means that MySQL
version 5 or later is required, as well as having a user account with the
SUPER privilege on the database. However, the SUPER privilege is only
required to add data to the database, but not to use existing data.
27
To optimize the indices for the IP-addresses, we store the each address
as a single 32-bit integer, since indices work better on numerical types than
strings. The conversion from standard doted IP notation is done with the
MySQL function INET ATON, for example INET ATON(’127.0.0.1’) is 2130706433.
One thing to have in mind when constructing indices in MySQL is that
the indices work like this, think linked-list:
1. col1 → col2 → col3
2. col1 → col2
3. col1
So if an index is built from col1 to col2 to col3, an index from col1 to col2
and an index on col1 is automatically gained, but no indices on col2, col3
and col2 to col3. This is useful because some indices are given for free when
constructing large ones, and thus redundant indices are minimized [23].
Using crl flow To get the needed output from crl flow use this command:
the following arguments are needed: crl flow -Tf60 -o outfile tracefile
• -Tf60: this sets the flow timeout to 60 seconds. This means that a
flow is terminated 60 seconds after the last packet is seen. The CAIDA
website [4] recommended setting the flow timeout to 60s.
• if the input is a bi-directional trace contained in two separate files they
can be merged by using the flag -m, i.e. -m tracefile1 tracefile2 instead
of tracefile.
• -o specifies the output file name.
Stream Editor - sed Sed is used to format the flow data for parsing using
a regexp to clean and format the crl flow output.
• Remove all comments (lines beginning with #)
• Split the timestamp into seconds and subseconds12
12
crl flow denotes timestaps in seconds.subseconds (fractions of a second). This is more
difficult to parse using a regexp since the IP addresses also contain dots. So we reformat
the timestamps to list seconds and subseconds separately.
28
Database script This script creates the required tables in the database
and fills them with data. It works by collecting all needed information over
the command-line and then performs, in the listed order, these operations:
• Create and populate the protocol table if this does not exist.
• Create the node, edge and edge-data tables for this trace
29
4.5.1 Filter interface
In the first version of the application, no filtering was available through the
interface. This version was primarily for testing the graph library. Skeleton
menus were added in preparation for holding different graph manipulation
operations.
At this point of time, we were using plain text storage and had not decided
on how to apply the filters. The options we considered were to type the
filter rules on the command line when starting the application or using a
dialogue that would be accessed from a menu. If the filtering was done
on the command line, we would have to write our own parser and filtering
language. If we used a form in a dialogue, we would need to make it flexible
enough to handle more complex filters.
In the second version, the interface was redesigned and the menus where
discarded in favor of a tabbed layout. The filtering interface was given its
own tab since the application needed to be able to handle a large database
which contained far more information than it would be able to visualize.
Therefore it is necessary to filter the input before drawing the graph. This
made the filtering as important as drawing the graph and that was reflected
in the new design.
The main reason for the redesign was that we replaced plain text storage
with MySQL. MySQL uses SQL for querying and we could take advantage
of its flexible and powerful syntax for filtering. We estimated it would be
too time-consuming to create an interface that would give the user the same
level of flexibility, power and freedom as SQL. This decision impose a new
requirement on the user, but since the user would have had to learn a new
filtering language in any case we chose to use SQL since there is a lot of
material on it.
30
The filtering interface became a single text field that exposed the WHERE
clause of a SELECT query13 . A SQL SELECT query is responsible for re-
trieving data from a SQL database, the data retrieved is limited by the
WHERE clause.
Example:
SELECT * FROM users WHERE username=”testuser”;
This would select the rows in the users database which has the username
’testuser’.
By exposing the WHERE clause the user does not have to write a full
SELECT query and can focus on filtering the data.
In the third version of the application, the single text field became three
fields with one for each table: nodes, edges and edge data. The SELECT
query used in the second version was a join between the tables edges and edge-
data as they contain most of the data. However, this left out the nodes table
with statistics for each node since a three-table join would be too expensive.
Instead each table was given its own field and the filters for each table narrow
down the results.
Figure 2: The query tab with three exposed MySQL queries and a log win-
dow.
New in this version was the description of the tables and the possibility
to count the number of results from a query. Since there is a limit on how
many nodes the graph library can handle depending on the hardware, we try
lower the risk for a crash by providing the option of counting the number of
results before drawing the graph14 .
13
The WHERE clause limits which rows in the tables to include in the result
14
Since it is impossible for us to know exactly how much data a computer can handle,
there are too many variables to consider. We let the user estimate this instead.
31
4.5.2 Graph interface
The first version of the GUI was a prototype for testing the JUNG library.
Since the focus was on the graph, a single window with the graph drawing
area was used. In this version, the general graph drawing features were
implemented such as scrolling, zooming and pick support (moving the whole
graph).
The second version of the GUI used a tabbed layout with the filtering and
graph drawing area on separate tabs. The application needed to be possible
to use with far more data than the graph would be able to handle. That
made filtering as important as the resulting graph and this was reflected in
the design by putting the filtering before the graph view.
Figure 4: Graph showing traffic over the Telnet protocol (port 23)
Several new features were added to the graph tab: basic node information,
time interval filter and extracting subgraphs.
Information about a node is shown in a popup window when it is clicked.
The popup shows which IP address (in decimal notation) the node represents
as well as in-degree and out-degree15 . The degree values in the graph are
based on the filtered subset that is used to draw the graph.
15
Communication between two IP addresses may be limited to one direction. This can
have several different reasons. Two plausible is that the trace is unidirectional the other
32
The filtering over a set time interval has two purposes: showing the change
over time and de-cluttering the graph. If the initial filtering results in a
large number of results, the drawn graph will be very cluttered and could be
difficult to analyze. The time filtering will only show the traffic that occurs
within the time interval while retaining the positions of the nodes.
Figure 5: Only the traffic in the last 60 seconds of the time span is displayed.
Figure 6: The sixty second time interval has been extracted to a separate
graph and the layout has been recalculated.
The time filtering panel is placed under the graph and provides controls
for showing the traffic in a smaller time interval. The slider marks the start
while the spinner gives the extent of the interval. The interval is also used to
define which nodes to include when extracting a subset of the graph. Since
the time filter only controls which graph elements to display, the layout of
the graph is unchanged and the placement of the graph elements appears
unbalanced. However, the layout is recalculated when those elements are
is that the return traffic occurred outside the capture window. In-degree and out-degree
is the number of incoming respectively outgoing connections.
33
extracted to a new window.
34
5 Results
5.1 Application usage
5.1.1 Simple
A typical usage is to define a filter that only takes a small amount of informa-
tion into account, this could be protocol, source port, destination port or a
combination. The purpose of such a filter is to see if there is any traffic with
those characteristics, for example is there any UDP traffic in the trace, how
much traffic is there with destination port 80 or how many nodes having an
out-degree of more than 3000 and have transmitted more than 40MB data.
Some examples of simple usage:
Figure 9: Searching for traffic originating from any port larger than 65530,
here we see normal communication
5.1.2 Advanced
Advanced usage is to use the applications main feature, the visualization
to draw conclusions and to do more queries from those conclusions. This
is useful for example to find malicious activity, like bot nets, worms and
35
Figure 10: Searching for traffic on port 79 (finger), here we see a potential
scanning attack
Data mining In this usage example, we are looking for interesting patterns
in a random port interval. The port interval was selected randomly in the
port range used for dynamic or private ports.
Filter used: source-port>64999 and source-port<65101
This gives us a graph with mix of unknown application traffic with a
source port between 65000 and 65100, but we might be able to discover an
interesting node that we would not be able to find otherwise.
In the upper right corner there is a host that is connected to at least three
hosts that have a large number of connections. Here is a closer look with the
node marked with a square.
36
Figure 12: The square marks a node connected to several nodes with in turn
has many connections.
Figure 13: The node communicates with only a few other nodes.
A count of four nodes connected to the center node shows that they are
communicating with several thousands other nodes in our 20 minute trace.
Since the number of edges is almost double the number of nodes, the
communication is clearly going in both directions. From this we can conclude
that the four nodes probably is not performing a scan. If the traffic had only
been going from node 1-4, it could have been an indication of a botnet or
similar.
37
6 Conclusions
In this thesis, we have presented a visualization tool for network traffic with
filtering capabilities. This tool can be used to analyze any network trace that
can be reduced to flows. The user can define filters to check for specific traffic
patterns in large data volumes, for example port scanning. The application
can also be used for data mining, which means starting out with traffic on a
randomly selected port. Then explore it deeper by refining the filters based
on the results of all the previous filters. It can also be used for discovering
protocol patterns by analyzing traces from controlled environments, which
can be useful when developing new protocols.
Our application is split into two parts: a backend responsible for data
management and a front-end that is responsible for visualization and user
interaction.
The backend is realized with a SQL database that is accessed by the
front-end through standard SQL queries. The reason for using SQL queries is
that regardless of whether we construct our own language or use an existing
one, the user still has a learning curve. But learning something that is
standardized and has a lot of available documentation should be easier and
more meaningful.
The front-end is realized in Java because it is platform independent and
has good third party graph libraries available. This is essential because the
visualization is the core of our application and writing a graph visualization
library from scratch is a thesis of its own.
The result is an application that is more of a proof of concept than a
general tool and thus requires some expert knowledge to use. The user needs
to be familiar with network concepts such as protocols and flows. The user
also needs to know SQL syntax to construct the filter rules used for selecting
data to be visualized.
That said the application is still very usable and since the target user
group are network analysts, the knowledge needed to use the application is
most likely covered.
38
7 Future work
This version of the application is a proof of concept. To be truly usable in
a production environment a lot of improvements to the presentation of data
must be made. Some examples:
• Extend the node information and make the information retrieval dy-
namic (fetch info when it is requested)
39
References
[1] Martin F. Arlitt and Carey Williamson. The extensive challenges of
internet application measurement. IEEE Network, 21(3):41–46, 2007.
[3] Nevil Brownlee and K.C. Claffy. Internet measurement. IEEE Internet
Computing, 8(5):30–33, 2004.
[9] Jeffrey Erman, Martin Arlitt, and Anirban Mahanti. Traffic classifica-
tion using clustering algorithms. In MineNet ’06: Proceedings of the
2006 SIGCOMM workshop on Mining network data, pages 281–286.
ACM, 2006.
[11] Scott Fluhrer, Itsik Mantin, and Adi Shamir. Weaknesses in the key
scheduling algorithm of rc4. Lecture Notes in Computer Science, 2259:1–
??, 2001.
40
[14] Internet assigned numbers authority. http://www.iana.org/. [Online;
accessed 25-July-2010].
[17] Wolfgang John and Sven Tafvelin. Heuristics to classify internet back-
bone traffic based on connection patterns. In ICOIN ’08: 22nd Inter-
national Conference on Information Networking, 2008.
[18] Tomihisa Kamada and Satoru Kawai. An algorithm for drawing general
undirected graphs. Information Processing Letters, 31(1):7–15, 1989.
[23] MySQL 5.1 Reference Manual. 7.4.4. how mysql uses in-
dexes. http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.
html, 2009. [Online; accessed 22-October-2009].
41
[24] Anthony Mcgregor, Mark Hall, Perry Lorier, and James Brunskill. Flow
clustering using machine learning techniques. Passive and Active Net-
work Measurement, pages 205–214, 2004.
[25] J. Mignault, A. Gravey, and C. Rosenberg. A survey of straightforward
statistical multiplexing models for atm networks. Telecommunication
Systems, 5(1):177–208, 1996.
[26] Thuy Nguyen and Grenville Armitage. A survey of techniques for inter-
net traffic classification using machine learning. IEEE Communications
Surveys & Tutorials, 10(4):56–76, 2008.
[27] M. Perenyi, D. Trang Dinh, A. Gefferth, and S. Molnar. Identification
and analyis of peer-to-peer traffic. Journal of communications, 1(7):36–
46, 2006.
[28] Skype. http://www.skype.com/, 2009. [Online; accessed 21-October-
2009].
[29] Spotify. http://www.spotify.com/, 2009. [Online; accessed 3-
November-2009].
[30] Erik Tews, Ralf-Philipp Weinmann, and Andrei Pyshkin. Breaking 104
bit wep in less than 60 seconds. Cryptology ePrint Archive, Report
2007/120, Apr 2007.
[31] Voddler. http://www.voddler.com/, 2010. [Online; accessed 24-July-
2010].
[32] Wikipedia. Caesar cipher — Wikipedia, the free encyclope-
dia. http://en.wikipedia.org/w/index.php?title=Caesar_
cipher&oldid=318415052, 2009. [Online; accessed 21-October-2009].
[33] Wikipedia. Bloom filter — Wikipedia, the free encyclope-
dia. http://en.wikipedia.org/w/index.php?title=Bloom_
filter&oldid=346894015, 2010. [Online; accessed 24-July-2010].
[34] Wikipedia. Traffic flow (computer networking) — Wikipedia, the
free encyclopedia. http://en.wikipedia.org/w/index.php?title=
Traffic_flow_(computer_networking)&oldid=355521842, 2010.
[Online; accessed 24-July-2010].
[35] Carey Williamson. Internet traffic measurement. IEEE Internet Com-
puting, 5(6):70–74, 2001.
42