Automatic Root Cause Analysis For LTE Networks Based On Unsupervised Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
1

Automatic root cause analysis for LTE networks


based on unsupervised techniques
A. Gomez-Andrades, P. Munoz, I. Serrano and R. Barco

AbstractThe increase in size and complexity of current


cellular networks is complicating their operation and maintenance tasks. While the end-to-end user experience in terms of
throughput and latency has been significantly improved, cellular
networks have also become more prone to failures. In this
context, mobile operators start to concentrate their efforts on
creating Self-Healing networks, i.e. those networks capable of
performing troubleshooting in an automatic way, making the
network more reliable and reducing costs. In this paper, an
automatic diagnosis system based on unsupervised techniques for
Long-Term Evolution (LTE) networks is proposed. In particular,
this system is built through an iterative process, using SelfOrganizing Maps (SOM) and the Wards Hierarchical method,
in order to guarantee the quality of the solution. Furthermore,
in order to obtain a number of relevant clusters and label them
properly from a technical point of view, an approach based on the
analysis of the statistical behaviour of each cluster is proposed.
Moreover, with the aim of increasing the accuracy of the system,
a novel adjustment process is presented. It intends to refine the
diagnosis solution provided by the traditional SOM according
to the so-called Silhouette index and the most similar cause on
the basis of the minimum X th percentile of all distances. The
effectiveness of the developed diagnosis system is validated using
real and simulated LTE data by analysing its performance and
comparing it with reference mechanisms.
Index TermsLTE, Self-Healing, diagnosis, fault identification,
unsupervised learning, Self-Organizing Maps, silhouette.

I. I NTRODUCTION
ELLULAR networks are an effective way of communication that allows people to instantaneously send and receive information anywhere. In the last years, as a consequence
of a sharp increase in the traffic demand, the infrastructure of
cellular networks has been profoundly modified. The higher
complexity of these networks has encouraged mobile operators
to implement effective and low cost management algorithms.
In this context, Self-Organizing Networks (SONs) establish a
new concept of network management in which the operation
and maintenance tasks are carried out with high level of
automation [1]. Within this paradigm, Self-Healing is a major
SON category whose aim is to automate the troubleshooting
process [2], [3], which is composed of the detection, diagnosis,
compensation and recovery phases. As cellular networks are
currently more prone to failures due to their huge increase
in size and complexity, Self-Healing is gradually gaining

Copyright (c) 2015 IEEE. Personal use of this material is permitted.


However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to [email protected].
A. Gomez-Andrades, P. Munoz and R. Barco are with the Department of
Communication Engineering, University of Malaga, 29071, Malaga, Spain
e-mail: {aga, pabloml, rbm}@ic.uma.es.
I. Serrano is with Ericsson SDT EDOS-DP, 29590, Malaga Spain e-mail:
[email protected].

attention from operators. In this sense, the enormous diversity


of performance indicators, counters, configuration parameters
and alarms can be used to develop more intelligent and
automatic techniques that cope with faults in a much more
efficient manner. The implementation of these mechanisms
has a direct impact on network availability and operators
revenues.
The aim of this paper is to present an automatic diagnosis
system that can be part of a Self-Healing Long Term Evolution (LTE) network. Although there are some references
related to automated diagnosis in the literature [4], [5], [6],
it is extremely difficult to deploy those proposed systems
in real networks. This is due to the fact that the design
of these systems is based either on expert knowledge or
on historical databases of fault cases. On the one hand,
experts in troubleshooting do not have either the time or the
expertise to build the required complex models. On the other
hand, although there are historical databases of performance
indicators, they do not contain any indication of whether
there is a fault or its cause. Therefore, compared to previous
references, the key characteristic of the proposed system is
that it provides both a simple way of analysing the behaviour
of faults and an accurate fault identification without requiring
labelled historical databases (labelled means that the fault
cause has been identified and it is included in the database).
In this context, there are two types of learning algorithms,
supervised and unsupervised, whose application depends on
whether the training dataset includes additional information
about the cause of the problem or only the raw data (i.e.
performance indicators), respectively.
This paper presents an automatic diagnosis system based
on different unsupervised techniques with SOM as the centrepiece. In particular, the proposed system consists of three
novel approaches:
The first one is the proposed unsupervised method to
automatically identify the number of clusters that an
expert will consider statistically different. To that end,
the combined use of the Davies-Bouldin index and the
KolmogorovSmirnov test [7] is proposed.
The second one is related to how the expert knowledge
is introduced in the system. In order to make the system
autonomous during the exploitation stage, the expert
knowledge is included in the design stage. The proposed
method focuses on the study of the statistical behaviour
of the clusters by inspection of the training data associated with each of them, instead of analysing directly
the characteristics of the neurons, which provides less
detailed information. With this information, experts are
able to identify whether each individual cluster is relevant

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
2

and represents a particular fault of mobile networks. As


a result, the diagnosis system will be able to diagnose
a set of faults that has been validated and approved by
experts.
The third contribution is the proposed adjustment in the
exploitation stage. Once the diagnosis system is built,
it can be used as part of a Self-Healing system to
automatically diagnose faulty cells, i.e. to determine the
cause of the problem. As mentioned earlier, this diagnosis
must be as accurate as possible. Therefore, the main
difference with other SOM systems is that the mapping
between the input metrics and the root cause includes
an additional adjustment phase where the worst-cases
(i.e. those whose diagnosis is not clear) are analysed in
detail. In particular, the proposed approach intends to
find out the cause with greatest similarities among all
neighbouring cases on the basis of the distances within
each case. Then, this is compared to the initial decision
corresponding to the original SOM algorithm and the
solution that maximises the average Silhouette index [8]
is determined.

The rest of the paper is organized as follows. First, the existing work related to fault diagnosis in mobile communication
systems is presented in Section 2, followed by the motivation
for the design of an unsupervised diagnosis system in Section
3. Then, in Section 4, the proposed system is described in
detail, explaining the concepts and the algorithms used in each
stage. Section 5 demonstrates the validity and the robustness
of this approach using real and simulated data and comparing
its efficacy with reference mechanisms. Finally, in Section 6,
the main conclusions are discussed.
II. R ELATED W ORK
Even though the automatic fault identification is essential
for the prompt enforcement of maintenance decisions, the
related literature in the field of mobile network is scarce.
Thus, it remains an open research problem. Nevertheless, some
solutions for mobiles networks can be found in the literature.
An overview in the field of mobile network is presented below.

Bayesian Networks: Some studies such as [4], [5] proposed the use of Bayesian Network as classifiers of fault
problems in order to automatically diagnose them. In particular, [4] presents a method based on a nave Bayesian
classifier to identify the fault cause in GSM/GPRS, 3G
or multi-systems networks using performance indicators
as continuous variables. In addition, [5] presents another
automated fault diagnosis system based on Bayesian
Networks for UMTS networks using two different algorithms to discretize the performance indicators (i.e.
the percentile-based discretization (PBD) and the entropy
minimization discretization (EMD)).
Scoring-based systems: A different approach based on a
scoring system has been proposed in [6]. This detection
and diagnosis system is automatically built from the
labelled fault cases reported by experts and it uses a
scoring system in order to determine how well a specific

case matches each diagnosis target. [9] proposes more sophisticated profiling and detection capabilities to improve
this framework.
All these techniques proposed in the literature, i.e. [4],
[5], [6], [9], are supervised techniques. Therefore, their
diagnosis process requires a historical database of labelled fault cases in order to learn the impact the faults
have on the performance indicators. However, when the
set of documented and solved cases is poor, partly due to
the limited occurrences of each fault, unsupervised techniques are the most appropriate ones. This is precisely
what occurs with the historical records of faults in mobile
networks because experts are not concerned with storing
the cause of the problem or the actions they took when
performing their troubleshooting tasks.
Neural Networks: Several works have demonstrated the
potential and utility of using the unsupervised neural
network known as Self-Organizing Maps (SOM) [10]
in order to automate the fault detection phase of the
troubleshooting tasks (that is, the phase prior to diagnosis). In particular, in [11] SOM is used to build the
normal profile of the network and to determine the
healthy ranges of the selected performance indicators (i.e.
the symptoms). Therefore, this system helps to identify
whether the symptoms are healthy or faulty, without ever
providing the fault cause. In addition, [12], [13], [14],
[15] show how SOM can be used to analyse multidimensional Third Generation (3G) network performance
data in order to aid the manual fault diagnosis. The
aim is to clusters cells based on their performance in
order to assist experts in their manual troubleshooting
and parameter optimization tasks. In contrast, the system
proposed in this paper aims to provide automatically and
directly the fault cause of a problematic cell without
any supervision of the experts in the exploitation stage.
Therefore, it is essential for the proposed system to
ensure two important aspects: (a) all identified clusters
must present different statistical behaviours in order to
guarantee that there are not any similar clusters associated
to the same fault cause; and (b) the final diagnosis must be
as accurate as possible. Even though the proposed system
is based on SOM technique, as stated before the purpose
of the whole system is different from the one presented in
[12-15]; thus, the system proposed in this paper consists
of additional and complex techniques in order to achieve
those requirements.

Regarding the use of the Silhouette index with SOM techniques, there are some reference in the literature (not related
to wireless networks), such as [16] and [17]. However, in
those cases, the Silhouette index is only used with the goal of
evaluating the quality of the obtained clustering. Furthermore,
in [18], this index is used as indicator to choose the best
clustering technique. Unlike the previous references, in this
paper, the Silhouette index is used to correct the mapping of
a given input.

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
3

III. P ROBLEM FORMULATION


Diagnosis is one of the most critical functions in a SelfHealing network. It is therefore essential to ensure that the
automatic fault identification is accurate and reliable, so that
experts do not have to check every diagnosis provided by the
system. Consequently, the automatic diagnosis system must
emulate the manual process followed by an expert to determine
the existing faults.
Through this manual procedure, experts analyse the symptoms or measurements that may reveal the cause of the
problem. These symptoms could be alarms, key performance
indicators (KPI), configuration parameters, etc. Thus, depending on which symptoms are degraded and their level of deterioration, experts can differentiate the fault of a cell. However,
this raises difficulties that experts must face; difficulties that
are also present when the diagnosis system is designed. First,
the system must be able to operate with a wide range of
symptoms. Furthermore each symptom has different features,
e.g.: it can be either discrete or continuous; its range of
variation can be limited or not; etc. Finally, the automatic
diagnosis system, like experts, must know the effects that
each fault causes on the symptoms in order to perform the
identification from among a plurality of faults.
In addition, automating the diagnosis process implies that
the diagnosis system has to learn how the faults behave. A
possible approach could be to extract the information from
stored cases that have been solved satisfactory and whose
fault is known (i.e. labelled cases). This dataset will allow
obtaining an automatic system through supervised learning.
Nevertheless, since experts do not tend to collect the values of
the KPI along with a standard label associated to the cases that
they resolve, the available historical records are characterized
by being scarce. In particular, they do not have a high variety
of faults and, for each specific fault, there is not a high number
of labelled cases. As a result, the historical data obtained from
a real network is not sufficiently rich to build a diagnosis
system with supervised techniques. Although this valuable
information could be obtained from the expert knowledge, this
would give rise to a system that would highly depend on the
notions provided by the experts and how this knowledge is
translated into the system. Furthermore, each network may
have different faults and not all of them may be known by the
experts. Consequently, the unsupervised technique proposed in
this paper is the best solution for the design of the automatic
diagnosis system, since it avoids the use of labelled cases for
its construction.
Unsupervised methods allow building systems from a
dataset taken directly from the real network, without including
any information about the fault cause in question. Moreover,
the dataset used to design the system contains symptoms from
both healthy and faulty cells because these two states cannot
be distinguished, i.e. the dataset is unlabelled. Therefore,
one additional aspect to consider, arising from the use of
unsupervised techniques, is that the obtained system must
be able to identify not only the cause of the problems but
also whether a cell has a problem or not. As a result, the
automatic system performs simultaneously the diagnosis and

Input Stage

Process Stage

Output Stage

Diagnosis System

Training

Clustering

Labelling

S
KPI
Database

Pre-process
BMU
Cause

Percentile-based
approach

Adjustment

Silhouette
controller

Semiautomatic Design Stage


Exploitation Stage

Fig. 1. Automatic root cause analysis.

the detection functions of the self-healing procedure. Even


though the dataset is previously filtered by the detection phase,
there will be some samples of healthy cells (due to nonidealities of the detection phase) that arrive to the diagnosis
phase. This leads to a diagnosis system that is trained with
both healthy and problematic cases. Thus, those healthy cases
should also be identified.
IV. P ROPOSED DIAGNOSIS SYSTEM
The root cause analysis procedure can be generally formulated through the scheme shown in Fig. 1, where S =
[KP I1 , KP I2 , ..., KP IM ] RM is the input vector that
represents the state of the cell by M KPIs and c is
the output of the system whose value belongs to the set
C = {F C1 , ..., F CL , N }. Therefore, the output c determines
whether the cell presents a normal behaviour (N ) or not,
indicating, in the latter case, its fault cause (F C). Firstly, in the
input stage, the specific KPIs are selected and pre-processed.
Then, the pre-processed input vector flows toward the diagnosis system. In particular, the diagnosis system proposed in this
paper follows the scheme shown in Fig. 1 and its core is based
on SOM. As it can be observed, this system can be executed
in two different modes, one corresponding to the design phase
(the cross-hatched area in Fig. 1) and another corresponding
to the exploitation phase (the hatched area in Fig. 1). In the
first one, the diagnosis system is built, while in the second
phase, the system is used to diagnose the fault cause. Each of
these three stages is described in detail below.
A. Input stage
The input data vector (S) consists of all relevant KPIs of the
cell under study. These KPIs can be estimated with different
time aggregation levels (hourly, daily, weekly, monthly, etc.)
according to the required granularity (level of detail) of the
diagnosis process. Two types of input data can be distinguished
depending on whether the data is used in the design stage or
in the exploitation stage:
Training data: To build the proposed system, a training
dataset with as many cases as possible is required. The
resulting system will depend greatly on the training
dataset, thus the training data and the data to be used in
the exploitation stage must have the same features, e.g.

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
4

the same KPIs, time and layer (per cell, per base station,
etc.) aggregation levels, etc. Since no labels are used to
obtain the training dataset, there will be data describing
normal behaviour and cells with abnormal behaviour due
to faults. It is important to highlight that since the dataset
will include data both from normal cells and faulty cells,
one of the classes obtained after the clustering phase
will be the normal profile, while the rest of classes will
correspond to faults.
Analytical data: During the exploitation stage, the input
data is the specific cells state taken directly from the cell
under study.
The input of the diagnosis system must be pre-processed
in order to suit the technical requirements of the system. In
particular, for the proposed system, the input data must be
quantitative, i.e. they can be expressed in terms of numbers
(e.g. power, throughput). In particular, performance indicators
of mobile networks are characterised by being numerical
variables. As a consequence, this system is appropriate to
automate the troubleshooting process in a mobile network by
working directly with KPIs, avoiding both the discretization
of the variables and the definition of the thresholds by experts.
However, given that the proposed system is based on the
Euclidean distance, the raw data taken from the network must
be normalised. This ensures that their dynamic ranges are
similar and thus there are no high values that dominate the
training. In this system, the normalization process is performed
by using the following methods:
Range normalization: This method transforms the dynamic range of a particular metric (KP Ii ). The objective
is to ensure that all input variables range within the
desired interval. In particular, this method is applied only
to those KPI whose values are not within the interval [0,
1]. This normalization is given by the following equation
on the basis of the minimum and maximum value of the
]I i ):
KPI i in the training dataset (KP
[I i =
KP

^
KP Ii min(KP
Ii )
.
^
^
max(KP Ii ) min(KP Ii )

(1)

Z-score normalization: This technique modifies the input


variable in order to achieve zero mean and unit standard
deviation. It is carried out taking into account the mean
^
and the standard deviation (std) of KP
Ii through the
following linear operation:
^
[I i = KP Ii mean(KP Ii ) .
KP
^
std(KP
Ii )

(2)

In this case, this normalization is applied to all input


variables to guarantee that all KPIs have unit variance.
B. Semiautomatic design stage
Before the proposed diagnosis system can be used, it must
be designed through an iterative process (see Fig. 1) with the
goal of obtaining an accurate system. The proposed design
method consists of different unsupervised techniques with

the SOM as the key algorithm, along with the validation of


the experts who determine if the obtained diagnosis system
makes sense. Thus, this semiautomatic design is carried out
through three different phases: unsupervised SOM training,
unsupervised clustering and labelling by experts.
1) Unsupervised SOM training: A self-organizing map
(SOM) is a type of unsupervised neural network capable of
acquiring knowledge and learning from a set of unlabelled
data. Therefore, in this paper, SOM is used as the centrepiece
to classify the cells state according to the behaviour of
its symptoms and subsequently identifying the fault cause.
The great advantage of SOM is its capacity for processing
high dimensional data and reducing it to a lower dimension
(e.g. two), which enormously facilitates the interpretation and
understanding of the final diagnosis. Furthermore, this system
does not require discrete data. That enables to work directly
with raw data, without any discretization methods that cause
loss of information.
In particular, SOM consists of elements (artificial neurons) which are organised in a two-dimensional grid. Each
of those neurons has a specific weight vector (W =
[WKP I1 , WKP I2 , ..., WKP IM ] RM ) whose dimension M is
determined by the number of KPIs in the input vector. Firstly,
all weight vectors are initialised, for example, through the
linear initialisation method described in [10]. Via this method
the weight vectors are initialised in an orderly fashion along
a two-dimensional subspace spanned by the two principal
eigenvectors of the training dataset [19]. The advantage of the
linear initialisation is that the convergence of the algorithm is
much faster than if the random method is used. Then, these
weight vectors are updated through an unsupervised training
process in order to determine the values of the weight vectors
that best match the behaviour of the input data. Fundamentally,
the training process depends on the training data and the
neighbourhood function (hij ) that links the neurons i and j and
it consists of identifying the winner neuron or best matching
unit (BMU) and subsequently updating both its weight vector
and the weight vector of all its neighbouring neurons.
In this paper, the SOM training is done in two phases. The
first phase is the rough phase which aims to order the neurons,
while the second one is the fine phase that intends to achieve
the convergence, as follows:

Rough phase: first, the training in the rough-tuning phase


is carried out by the Batch algorithm [19] (summarised in
Algorithm 1) due to its rapid convergence [20], especially
in the cases where the linear initialization method is
used. The Batch algorithm is an iterative method that is
characterised by modifying all the weight vectors after
the entire dataset has been presented to the SOM. At
the beginning of each iteration (t), for each normalised
input (Si ), the winner neuron, namely the best matching
unit (BMU, nti ), is searched in relation to the minimum
Euclidean distance (d). Afterwards the weight vectors
(W t ) of all neurons are updated taking into account
the neighbourhood bubble function (htnt j ) [21] and its
i
radius( t ). In particular, that function determines the
neurons around nti that are considered neighbours and

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
5

Algorithm 1. A Batch Self-Organizing Map: Rough-tuning


1. Calculate the best-matching Units (BMU) for all input vectors:
nti = arg minj ||Si Wjt ||, i [1...P ]
where P is the number of input vectors in the training dataset.
2. Apply neighbourhood
function (Bubble):
(
t , j) < t
1,
if
d(n
i
htnt j =
0, if d(nti , j) > t
i
where t is the neighbourhood radius in the current iteration t:
0

T1

,
t = (t1)
T1
0 and T1 are the initial and final neighbourhood radius
respectively, and T1 is the number of iterations.
3. Update the weight factors:
(t+1)

Wj

P
P
i
ht t S
i=1 ni j
P
P
ht t
i=1 ni j

j,

4. Repeat the above-described process starting from step 1 until t=T1


Algorithm 2. Sequential Self-Organizing Map: Fine-tuning
1. Calculate the best-matching Units (BMU):
nt = arg minj ||St Wjt ||
2. Neighbourhood function: Gaussian
t W t ||2
||Wn
j
2( t )2

htnt j = e
3. Learning rate:
0
t = 1+100

t
T2

where 0 is the initial learning rate and T2


is the number of iterations.
4. Update step:
(t+1)
Wj
= Wjt + t htnt j [St Wjt ] j,
0

T2

(t+1) = t
,
T2
where 0 and T2 are the initial and final
neighbourhood radius respectively.
5. Repeat the above-described process starting
from step 1 until t=T2

or classes as possible using an unsupervised algorithm. To


that end, it must be noted that the activated neuron (i.e. the
BMU) for a specific input or cells state is the closest one
based on the Euclidean distance (d). Thus, similar cells states
activate nearby neurons or even the same one. This indicates
that the same cause (ci C) can be represented by several
neurons with slight differences between them, so they must be
grouped together into the same class (gi G). Therefore, once
the SOM has been trained, the next step groups the neurons
representing data with similar behaviour into the same cluster.
The aim is to divide the SOM map into as many clusters as
causes can be distinguished.
It is highlighted that the exact number of clusters in the
dataset is unknown because this system works with unlabelled data, so there is no evidence to reliably determine
it in advance. As a result, two design considerations are
established to automatically determine whether the obtained
classification can be accepted or not. First, the identification
of all existing clusters in the dataset is not necessary because
the objective is to identify the most frequent causes, that is,
the most relevant ones in order to recognize which are the
most problematic faults and automate the diagnosis of the
most repeated problems. Consequently, the total number of
clusters that are identified by the system should be limited.
In particular, the system should identify a minimum of two
clusters (the normal state and a faulty state) and a maximum
T C configured by the designer (e.g. T C = 10). Second, all
the identified clusters are valid only if they present different
statistical behaviour. Taking these points into consideration,
we propose to cluster the neurons in an unsupervised manner
through the following algorithm.

thus the neurons that must be updated. The rough-tuning


phase aims to order the neurons quickly, therefore it is
carried out in a few thousand iterations. Furthermore,
at the beginning of the rough phase the radius of the
neighbourhood function covers all neurons and it linearly
decreases at each iteration until it only covers two neighbouring neurons [22].
Fine phase: in the fine-tuning phase, the sequential algorithm [19] (summarised in Algorithm 2) is used. In each
iteration, a random normalised input (Si ) is presented and
its winner neuron (nt ) is searched. Thereafter, the weight
vectors of the winner neuron (Wnt ) and its neighbours
(Wjt ) are updated considering the neighbourhood function
(htnt j ) and the learning rate (t ). The objective is to
slowly modify the weight vectors until the convergence
of the system is achieved. This process is repeated during
several thousand iterations. Thus, the learning rate must
take very small values (e.g. 0.01) and, unlike in the
previous phase, the neighbouring radio only includes the
closest neurons (e.g. one) [22].
After this SOM training, the topology of the neural network
will be as the spatial distribution of the training dataset.
2) Unsupervised clustering: In this phase, all ordered neurons of the SOM system are clustered in as many groups

Let G = {g1 , ..., gJ } be the set of classes just after the


training, in particular, this method starts with a class for
each neuron.
Then, SOM is clustered into different number of groups
(from 2 to T C) using the Wards hierarchical clustering
algorithm [23]. To that end, the Euclidean distance between each pair of classes (d(gj , gk )) is calculated. Then,
the Ward algorithm iteratively merges the closest two
classes into a new one based on the minimum distance.
After each union, the distances between each current pairs
of clusters are updated following the Lance-Williams
recurrence formula [23].
Each classification is evaluated through the DaviesBouldin index [24], which determines how well is each
clustering, and the clustering with the minimum index
(i.e. the best clustering according to the Davies-Bouldin
metric) is selected.
Finally, it is verified that no pair of clusters have a similar
statistical behaviour through the Kolmogorov-Smirnov
test (KS test) [7]. The null hypothesis to be checked
by this test is that the observed values of a KPI of
two clusters present the same distribution. Therefore, the
KS test is applied to each KPI for each pair of cluster
using as observed values the values of the training dataset
that have been assigned to those clusters. The p-value
obtained for each KPI and pair of cluster determines

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
6

To explain this, let us assume a cells state consisting of two


KPIs: Retainability, which indicates the rate of connections
that have finished properly to the total number of connections
that have occurred in the cell; and handover success rate
(HOSR), which determines the rate of handover performed
successfully to the total number of handovers. Let us assume
that the dataset, for example, is composed of cells states
belonging to three different causes, but this detail is unknown
so, in the clustering phase, the Davies-Bouldin Index has
indicated that the best number of classes is four (Fig. 2).
However, the KS test determines that cluster 3 and 4 present
similar statistical behaviour. This means that they represent the
same cause, since both cases have the same cluster-symptom
relation. Thus, the number of cluster is automatically reduced
and the KS test is applied to the new clustering.
3) Labelling by expert: At this stage, the obtained clusters
need to be labelled with the identified causes. First, the
unsupervised phases of the design (training and clustering)
must be verified because there is no information to evaluate
the goodness of the solution. A simple way of doing that is
to verify if the solution is right by a simple visual inspection
of the neurons belonging to each group. To illustrate this, Fig.
3.a shows an example of a SOM of 10x10 neurons during the
clustering phase where the neurons are being grouped into two
different classes. In each iteration, a new neuron (represented
in white colour) is selected and joined to the closest group.
At the end of the clustering, all neurons belong to one of the
created groups. However, if the resulting classification provides a fragmented cluster (see Fig. 3.b) then the training and
clustering phase must be repeated until clusters are composed
of adjoining neurons as in Fig. 3.c. This can be done by
changing the parameters of the training procedure, such as
the training length, the initial or final neighbourhood radius
and the learning rate of the sequential algorithm.
Furthermore, since there is a non-deterministic relation
between KPIs and causes [3], it is necessary to analyse
their statistical relations. To do this, the following process is
proposed:

For each cells states contained in the training dataset,


the associated cluster must be identified. In particular,
the cluster (gj ) associated with a particular normalised

70
Cluster 1
Cluster 2
Cluster 3
Cluster 4

Histogram (Retainability)

60
50
40
30
20
10
0

0.65

0.7

0.75

0.8
0.85
Retainability

0.9

0.95

(a)
90
80

Cluster 1
Cluster 2
Cluster 3
Cluster 4

70

Histogram (HOSR)

the probability of having values consistent with the null


hypothesis. In particular, there must not be any pair of
clusters whose all KPIs present the same behaviour. The
lower the p-value the more inconsistent is the data with
the null hypothesis so if the p-value is too small the null
hypothesis can be rejected. Taking this into account, we
consider that two cluster are statistically different if the
p-value obtained for all their KPIs are below a predefined
threshold called the significance level (e.g. SL = 0.1%).
If there is at least one pair of cluster that are statistically
similar, the number of clusters is reduced and the clustering obtained with Wards hierarchical algorithm for the
new number of clusters is chosen. This process is repeated
until all the identified classes are statistically different. As
a result, the final set of classes is G = {g1 , ..., gL , gN }
(i.e. L fault causes along with the normal case (N )).

60
50
40
30
20
10
0
0.7

0.75

0.8

0.85
HOSR

0.9

0.95

(b)
Fig. 2. Histograms of retainability (a) and HOSR (b) given each

cluster, when the neurons have been grouped in 4 clusters.

(a)

(b)

(c)

Fig. 3. Neurons of the diagnosis system during the clustering phase

(a), the resulting classification with fragmented clusters (b) and


without fragmentation (c)

cells state (Si ) is considered to be the one that includes


the neuron (BMU) activated for that state Si , i.e.:
Si gj BM U (Si ) gj

(3)

Once all cells state in the dataset have been included in a


given cluster, the conditional probability density functions
(PDF) of each KPI given each cluster (f (KP Ii |gj )) are
estimated. As the distribution followed by the KP Ii
is unknown, a non-parametric technique must be used.
Among them, the most commonly used to define the PDF
are the histogram or the kernel smoothing function [25].
The estimated PDFs for each cluster are studied in order
to examine the statistical behaviour of each KPI and,
as a result, determine the cluster-symptom relation. This
statistical information also helps to verify whether the
clustering is correct or not. Namely, it allows experts to

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
7

detect if a cluster is associated with more than one cause.


Finally, taking into account the cluster-symptom relations, the experts should identify the cause associated
with each cluster based on their knowledge and, as a
consequence, provide a suitable label to each cluster. As
stated before, one of the clusters will correspond to the
normal behaviour of the cells and it will be labelled
with N ; on the other hand, the remaining cases will
have a descriptive label related with the possible fault
cause (F Ci ). Therefore, the process of labelling maps
the clusters to a specific cause:
G = {g1 , ..., gL , gN } C = {F C1 , ..., F CL , N } (4)

C. Exploitation Stage
Once the system has been designed, it can be used to automatically perform the diagnosis in the exploitation phase. This
diagnosis process can be performed periodically to determine
whether the identified fault is sporadic, continuous or periodic
and to track the evolution of the fault over the time when
either compensation or recovery tasks are carried out.
The proposed diagnosis process is summarised in Algorithm
3. Such a process must be as accurate and reliable as possible.
To achieve this, first, for a specific cells state (Si ), the winner
neuron (BMU) is determined on the basis of the minimum
Euclidean distance and thus the diagnosis (DiagnosisBM U ) is
the cause (cj ) related to that neuron. However, if the activated
neuron is at the border between two or more causes, the
likelihood of an erroneous diagnosis is higher because the
cells state is close to different faults behaviours. Thus, we
propose further processing in order to guarantee high successful diagnosis rate. In particular, the adjustment proposed in this
paper is carried out using a percentile-based approach and a
Silhouette controller (see Fig. 1):
Percentile-based approach: this method is only used when
the BMU is a border neuron among different causes. In
this situation, the diagnosis should be the cause located
at the borderline that has, in general, the most similar
behaviour. In view of the above, a different approach
to identify the most similar cause is proposed. For each
border cause, the X th percentile of the distances between
the cells state and all neurons in that cluster are estimated. Then, the cause selected by the percentile-based
approach (DiagnosisP ) is the one that has the minimum
X th percentile of all its distances. Compared to the BMU
where the considered distance is only to one single neuron
(i.e. the closest), the proposed approach considers the
distance to all neurons in the cluster.
Silhouette controller: once the diagnoses have been determined by the BMU and percentile-based approaches,
one of the two diagnoses must be selected. For this
purpose, a controller based on the silhouette index [8]
is proposed. First, if the diagnosis provided by the
percentile-based approach matches the one given by the
BMU, it can be concluded that the diagnosis is right
and, as a result, the selected cause corresponds to the
DiagnosisBM U . Nevertheless, when both diagnoses are

different, a comparative evaluation is required so that


it is possible to discern which one is better. Therefore,
the average silhouette is used to evaluate the quality of
each diagnosis and determine whether the input fits well
with the selected clusters or not. Then, the silhouette
index for each neuron and the input vector are estimated
through the equation of step 3.3.1 in the Algorithm 4.
Particularly, this process is carried out twice, once for the
mapping made with the BMU and once for the mapping
of the percentile-based approach. For each diagnosis,
the average silhouette is calculated. This measure shows
how well the input has been categorised in a cause. In
particular, the higher the average silhouette is, the better
the classification is. Consequently, the final decision is
the diagnosis whose average silhouette is greater.
It should be noted that during the exploitation stage of the
system, only a specific diagnosis (i.e. one cause) is provided,
corresponding to the diagnosis that best matches the values of
the KPIs measured in the faulty cell. Therefore, when a cell
presents deterioration due to multiple causes, the diagnosis
provided by the system depends on the effect of those multiple
causes on the KPIs. On the one hand, a cell deteriorated due
to multiple causes can present the symptoms of the most
dominant fault. In this case, only the most dominant cause
is identified by this system. Once this fault is solved, then the
next dominant fault can be identified. On the other hand, if a
specific combination of multiple causes produces its particular
symptoms, this behaviour could lead to a specific cluster of the
diagnosis system during the design stage (provided that there
are enough cases with this behaviour in the training dataset).
Therefore, in this situation, this combination of multiple causes
will probably be diagnosed with the label assigned by the
expert.
V. E XPERIMENT R ESULTS
Herewith, the assessment of the proposed diagnosis system
is presented. Firstly, the characteristics of the simulated data
are described. Afterwards, the proposed diagnosis system is
built for the simulated LTE network, illustrating the construction process. Then, with a labelled dataset, the obtained system
is evaluated and compared with reference mechanisms. Furthermore, the complexity of the proposed system is discussed.
Finally, a demonstration of the diagnosis system in a live
network is presented.
A. Simulated dataset
This section briefly presents the features of the used dataset,
which are available at [26]. This dataset consists of two
different collections: the training set, which will be used to
design the proposed diagnosis system, and thus it is used
without labels; and the validation set, which will be used to
evaluate the system. In particular, the dataset has been generated by a dynamic LTE system-level simulator implemented in
MATLAB [27], whose 57 macro-cells are evenly distributed
across the entire scenario forming a hexagonal grid. Table I
describes the principal parameters of the simulator.

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
8

Algorithm 3. Automatic diagnosis in the exploitation phase:


1. Calculate BMU for the state S:
= arg minj ||S W T ||
BM U (S)
j
is not at the border
2. if BM U (S)
= DiagnosisBM U = cj C BM U (S)
cj
Diagnosis(S)
3. else:
3.1. Calculate Euclidean distance to each neighbouring cause:
W T ) : j cneigh } neigh
Dcneigh = {d(S,
j
3.2. Determine the most similar neighbouring cause:
DiagnosisP = cneigh C X th percentile of Dcneigh is minimum
3.3. if DiagnosisBM U and DiagnosisP are different.
3.3.1 Calculate the silhouette associated to both diagnoses (BM U and P ): using all the neurons (J) and the input.
b(i)a(i)
Silhouettex (i) = max(b(i),a(i)) , i [1...J + 1]
where x determines whether the silhouette is related to DiagnosisBM U or DiagnosisP , a(i) is the average Euclidean
distance between i and all the other components of its cluster, and b(i) is the minimum average Euclidean distance between
i and all the components of the nearest neighbouring cluster different to the cluster of i.
3.3.2 Calculate the average silhouette for both diagnoses (BM U and P ).
1 PJ+1
Silhouettex = J+1
i=1 Silhouettex (i), x = {BM U, P }
3.3.3 Choose the final diagnosis:

DiagnosisBM U
Diagnosis(S) =

DiagnosisP

if SilhouetteBM U SilhouetteP
if SilhouetteP > SilhouetteBM U

In order to carry out the root cause analysis of this scenario,


the chosen indicators are statistics at cell level related both to
radio environment quality and the efficiency of the offered
service. In particular, the used KPIs are those proposed in
[28]:

Retainability, the ratio of connections that have finished


successfully to the total number of finished connections.
Handover success rate (HOSR), calculated as the ratio
of the number of successfully handover to the total.
The 95th percentile of the reference signal received power
(RSRP95 ) measured by all the users connected to the
cell. The RSRP is defined in [29] as the average power
received from the serving cell over the resource elements
that carry the cell-specific reference signals (RS) within
the considered measurement frequency bandwidth.
The 5th percentile of the reference signal received quality
(RSRQ5 ), which is the ratio of the RSRP multiplied by
the total number of resource blocks to the total received
power within the measurement bandwidth.
The 95th percentile of the signal to interference plus noise
ratio (SIN R95 ). In particular, the SINR of the users is
the ratio of the desired power received to the total power
of noise and interference.
The average throughput of all users in a cell
(AvT hroughput), where the user throughput (Tu ) is calculated on the basis of their SINR through the following
equation [30]:
Tu = (1 BLER(SIN Ru ))

Du
TTI

(5)

where BLER is the block error probability which depends on the SINR of user u, Du is its data block payload
in bits and T T I is the transmission time interval.
The 95th percentile of the distance between the base
station and each user (Distance95 ).

Therefore, the input vector of the system consists of the


previous KPIs, formally, it can be expressed as:
S = [Retainability, HOSR, RSRP95 , RSRQ5 ,
SIN R95 , AvT hroughput, Distance95 ]

(6)

Considering these KPIs, all normal cells (i.e. cells without


problem) have been configured to achieve good cell performance in terms of retainability and HOSR (which must
be above 0.98 and 0.95, respectively). The KPI time period
in this simulation, i.e. the time interval in which the KPIs
are estimated, corresponds to a simulation loop which is
sufficiently long to provide reliable statistics (in this study,
it is composed of 18000 simulation steps). In each simulation
loop, all cells present the normal configuration, except a set
of cells which present one of the simulated faults.
In particular, six fault causes (presented in [28]) have been
simulated in order to deteriorate some randomly chosen cells.
The first one consists of an excessive uptilt (EU) of the antenna
which causes overshooting due to the excessive increment of
the coverage area. The second fault is caused by an excessive
downtilt of the antenna (ED) due to a wrong parameter
configuration. This causes a reduction of the coverage area
from what it is planned focussing the transmission power of
the cell near the base station. The third fault cause is modelled
by reducing excessively the transmission power of the antenna
(so it is called excessive reduction of cell power, ERP). This
can happen in a real network due to wiring problems or
wrong parameter configuration. The next fault cause consists
of creating a coverage hole (CH) within the coverage area
of a cell by increasing the attenuation suffered in a small
part. This models the shadowing caused by obstacles in the
environment (such as buildings or hills). Another simulated
fault cause is a too late handover problem (TLHO). If the
handover margin (HOM) parameter is wrongly configured, the

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
9

TABLE I
S IMULATION PARAMETERS
Parameter
Cellular layout
Transmission direction
Carrier frequency
System bandwidth
Frequency reuse
Propagation model
Channel
Mobility
Service
Base station

model
model
model
model

Scheduler
Power control
Link Adaptation
Handover
Radio Link Failure
Traffic distribution
Time resolution
Epoch & KPI time

Configuration
Hexagonal grid, 57 cells, cell radius 0.5 km
Downlink
2.0 GHz
1.4 MHz (6 physical resource blocks)
1
Okumura-Hata with wrap-around, Log-normal
slow fading, sf = 8 dB and
correlation distance=50m
Multipath fading, ETU model
Random direction, 3 km/h
Full Buffer, poisson traffic arrival
Tri-sectorized antenna, SISO, Donwtilt=9
PT Xmax =43 dBm,
Azimuth beamwidth (AB)=70
Elevation beamwidth (EB)=10
Time domain: Round-Robin,
Frequency domain: Best Channel
Equal transmit power per physical
resource blocks
Fast, CQI based, perfect estimation
Triggering event = A3, HOM = 3 dB,
Measurement type = RSRP
SINR < 6.9 dB for 500 ms
Evenly distributed in space
100 TTI (100 ms)
100 s

handover process is not performed correctly. In particular, the


handover is performed too late when the HOM is too high
so the condition to perform the handover is very restrictive.
The last fault cause is an inter-system interference (II) problem
that may happen due to external systems such as TV, radars or
even other cellular systems. This is simulated adding an extra
antenna in the scenario within the service area of a cell. The
configuration used for generating each fault cause is presented
in Table II.
Each case represents the state of a cell so it is a vector
with the value of its KPI estimated over a KPI time period.
Therefore, in each simulation loop, a case for each cell of the
scenario is obtained. Since in a live network the frequency
in which faults happen is not deterministic, the number of
cases stored for each fault has been randomly selected taking
also into account that the frequency of each fault is different.
The actual number of cases which compose the datasets is
presented in Table II. It is highlighted that normal cells are
more common than faulty cells in a real network, and thus
the total number of normal cases is much greater than the
number of fault cases in both simulated datasets. This allows
to determine if the system can identify both the most and the
less prevalent problems within the dataset. Another important
detail is that the size of the training dataset is much lower
than the size of the validation dataset. This is to ensure that
the system can be properly designed with little data and then
there are sufficient data available to estimate the overall total
error.
B. Experimental design
In this section, the proposed diagnosis system is applied
to the simulated LTE network in order to diagnose the cells

TABLE II
FAULT C AUSE DESCRIPTION
Fault Cause

Configuration

EU
ED
ERP
CH
TLHO
II

Downtilt=[0,1]
Downtilt=[16,15,14]
PT X =[7,8,9,10] dB
hole =[49,50,52,53] dBm
HOM=[6,7,8] dBm
PT Xmax =33 dBm
Downtilt=15
AB=[30, 60]
EB=10
Normal

No fault

Number of Cases
Training Validation
32
212
28
212
28
208
14
103
34
204
15

106

399

2964

and find abnormal behaviours due to faults. In particular, the


system has been implemented in Matlab using its Statistic
Toolbox and the SOM Toolbox [31]. The diagnosis system
has been designed as proposed in Section IV, using only
the training database (described in Section V.1 and without
considering any label for each case).
First, the system has been trained with the configuration
parameters presented in Table III. The configuration of the
training dataset has been done in accordance with the theoretical concepts. Namely, the rough phase has been carried
out in a thousand iterations while the fine phase has been
performed in twenty thousand iterations, but using a smaller
neighbourhood radius.
After that, the neural network has been clustered in an
unsupervised manner because there is no information available
about the specific fault cause associated with each group. To
that end, the two configuration parameters of the unsupervised
clustering algorithm should be set. In this study, the maximum
number of cluster identifiable by the system (T C) has been
set at 10. Therefore, the final number of cluster is limited to a
value between 2 and 10. This ensures that the system analyses
a wide enough range of possible clustering trying to find the
top ten clusters. The significance level (SL) for the KS test has
been set to a small value (i.e. 0.1%) in order to ensure that the
statistical behaviour of the identified clusters are sufficiently
different. During this process the Davies-Bouldin (DB) index
is calculated to each of the clustering performed with the
Wards Hierarchical method. In Table IV, the DB index for
each possible clustering is presented. As it can be seen, the
clustering consisting in eight clusters presents the smallest DB
index; this means that it is the best clustering according to the
DB index. Then, the KS test is applied between each pair of
cluster, in order to compare the distribution of their KPIs. To
that end, the cases of the training dataset that are assigned to
each cluster are used. As an example, the p-value obtained by
applying the KS test to the KPIs of a specific pair of clusters
is shown in Table V. It can be seen that the p-value obtained
for all KPIs is greater than the predefined significance level
(i.e. 0.001). This means that the distribution of the KPI of both
clusters is very similar. Therefore, the KS test reveals that these
clusters present similar statistical behaviours. Consequently,
the number of cluster is reduced and the KS test is repeated.
In this case, the KS test determines that there is not any pair
of clusters with the same statistical behaviour so the automatic

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
10

TABLE III
C ONFIGURATION PARAMETERS USED TO TRAIN THE SYSTEM
Parameters
Training algorithm
Training lenght, T
Neighbourhood function
Initial neighbourhood radius
Final neighbourhood radius
Initial learning rate

Simulated scenario
Rough-tuning Fine-tuning
Batch
Sequential
1000
20000
Bubble
Gaussian
10
1
2
0
0.01

Proposed diagnosis system clustered in 7 groups by the


unsupervised clustering method
Fig. 4.

unsupervised clustering concludes and determines that the best


number of clusters is seven. As a result, neurons are grouped
as shown in Fig. 4. With these two evaluations, the proposed
clustering ensures that the chosen classification is the best one
according to the Davies-Bouldin index or at least it does not
include any cluster with similar statistical behaviour to another
according to the KS test.
Afterwards, in the labelling phase, the statistical behaviour
of each cluster has been analysed in detail to reasonably
identify their corresponding cause. This requires knowledge
and understanding of when a specific KPI is degraded, but
this information usually depends on the network features and
context factors, and it is not always well-defined. Therefore,
first of all, the PDFs of each KPI estimated by the kernel
smoothing function are analysed to determine whether or not

TABLE IV
DAVIES -B OULDIN I NDEX
2
1.43

3
1.25

4
1.06

Number of Clusters
5
6
7
0.98 0.84
0.80

P - VALUE OF THE

Ret
0.0026

HOSR
0.0026

8
0.7

9
0.83

10
0.87

TABLE V
KS TEST BETWEEN TWO CLUSTERS

RSRP95
0.3788

KPI
RSRQ5
0.0045

SINR95
0.8552

AvTh
0.2853

Dist95
0.2095

Real scenario
Rough-tuning Fine-tuning
Batch
Sequential
1000
2000
Bubble
Gaussian
10
1
1
0
0.001

each KPI is deteriorated for each cause (see Fig. 5). Due to the
fact that we are using the kernel smoothing function, the PDFs
are estimated through the sum of Gaussian functions centred
at the data. Therefore, the estimated PDF is smoother than the
histogram causing the estimated PDF to cover values out of
range (see Fig. 5.a and b). It is important to highlight that this
is part of the estimation error, but this particular error does
not affect the proposed analysis, since the estimated PDFs are
only used to determine the overall statistical behaviour of the
clusters by visual inspection.
The objective of this examination is to get a rough idea
about the most probable values of each KPI depending on the
cluster and thus determine, for a specific cluster, if the majority
of values of a KPI are considered damaged or not. As a result
of this detailed study, the overall behaviour of each cluster has
been assessed to identify the associated fault cause and assign
it a label (see Fig. 5 and Table VI):
Cluster 1: This cluster represents cells with normal behaviour given that none of the indicators are deteriorated.
Cluster 2: Comparing the PDFs of the second cluster with
the first one, it is possible to identify that this cluster
represents cells whose KPIs have a normal behaviour
but with the difference that their retainability is more
likely to be low (approximately below 0.98). Therefore,
the high number of dropped calls together with the fact
that no other KPIs are deteriorated means that the fault
cause is a coverage hole within the service area. This
results from the fact that the coverage hole affects only
a particular group of users located in this specific area,
so that the aggregated values of the other analysed KPIs
do not present any deterioration.
Cluster 3: After analysing the PDFs, it is possible to
conclude that the most deteriorated KPIs of this cluster
are retainability, HOSR and RSRQ5 because they are
approximately below 0.98, 0.9 and -18.5 dB respectively.
Bad HOSR determines that this cell has mobility problems, while low RSRQ5 shows that the number of users
with bad quality has increased. This reveals that the cell is
retaining users with bad quality instead of performing the
handover process. As a result, this kind of cells presents
mobility problems due to the fact that handovers are
carried out too late.
Cluster 4: All the KPIs of this cluster, except average
throughput, are likely to be low. Namely, the overall
performance of those cells is degraded. The number of
drops has increased and the maximum served distance
(Distance95 ) have been reduced (approximately below

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology

20
18
16
14
12
10
8
6
4
2
0 0.7

30

Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7

0.75

PDF(HOSR | Cluster)

PDF(Retainability | Cluster)

11

0.8

0.85 0.9 0.95


Retainability

1.05

Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7

25
20
15
10
5
00

1.1

0.2

0.4

0.6
HOSR

(a)

PDF(RSRQ | Cluster)

PDF(RSRP| Cluster)

1.2

(b)

0.35
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7

0.3
0.25
0.2
0.15
0.1

2.5

Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7

2
1.5
1
0.5

0.05
0

-85

-80

-75

-70
-65
RSRP [dBm]

-60

0
-21.5 -21 -20.5 -20 -19.5 -19 -18.5 -18 -17.5 -17
RSRQ [dB]

-55

(d)

0.35

PDF(Throughput | Cluster)

(c)

PDF(SINR | Cluster)

0.8

Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7

0.3
0.25
0.2
0.15
0.1
0.05
0

8
10 12
SINR [dB]

14

16

18

20

0.035

Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7

0.03
0.025
0.02
0.015
0.01
0.005
0

50

100

150
200
250
300
Average Throughput [kbps]

PDF(Distance | Cluster)

(e)

350

(f)

8
7
6
5
4
3
2
1
0
0.2

Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7

0.4

0.6

0.8

1.2

1.4

1.6

1.8

th

95 percentile distance [km]


(g)
Fig. 5. Estimated PDF of Retainability (a), HOSR (b), RSRP 95 pctl (c), RSRQ 5 pctl (d), SINR 95 pctl (e), average throughput (f) and

distance 95 pctl (g) given each fault cause

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
12

TABLE VI
L ABEL ASSIGNED TO EACH CLUSTER
Clusters
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7

Label
Normal
Coverage Hole
Too late HO
Excessive reduction of cell power
Excessive Uptilt
Inter-system interference
Excessive downtilt

-75 dBm and 0.8 km respectively). This indicates that


those cells cannot carry on providing service to the
furthest users, which typically have the lowest throughput, so thus, the average throughput of the cell has
been increased. At the same time the level of signal
received by the best users (RSRP95 ) has been decreased
which reveals that the nearby users have experienced a
decrease in their received power. Therefore, this fault is
causing deterioration both to nearby and distant users.
This behaviour corresponds to cells whose transmission
power has been reduced considerably.
Cluster 5: In this cluster, the degraded KPIs are RSRP95 ,
SIN R95 and the average throughput (approximately
below -75 dBm, 13 dB and 100 kbps respectively),
furthermore those cells serve more distant users than in
normal conditions. Thus, those cells present a service area
much greater than necessary. This fault is caused by an
excessive uptilt of their antennas because the coverage
area of the cells has been increased and, at the same
time, the power received by the nearby users and thus
their SINR have been reduced.
Cluster 6: The PDFs of this cluster show that SIN R95
is low (approximately below 13 dB), meaning that this
KPI is degraded. As a consequence, in those cells the
quality of the service has worsened, causing a lot of
drops and a decrease in the average throughput (which is
approximately below 100 kbps). Thus, the performance
of those cells is deteriorated due to a high level of intersystem interference.
Cluster 7: The last cluster matches a problem caused by
an excessive downtilt of the antenna, where the retainability, RSRQ5 and Distance95 are low (approximately
below 0.98, -18.5 dB and 0.8 km respectively) while the
level of signal (RSRP95 ) is even better than in a normal
situation (approximately above -68 dBm). This results
from the fact that an antenna with high downtilt focuses
its radiation around the base station, causing a reduction
of the serving area and the unintended disconnection of
the distant users while the signal level of the nearby users
improves.

C. Performance Evaluation
In this section, the diagnosis system is assessed and compared with reference mechanisms in order to show its effectiveness. It should be pointed out that there are no previous
unsupervised systems proposed in literature for diagnosis in

wireless networks. Among the available supervised solutions


in the literature, two reference systems have been chosen. The
first one is the rule based system (RBS) [32] which can be
considered a baseline scheme in the field of diagnosis given
that it is the simplest solution and widely used by network
operators. This system uses a set of IF...T HEN rules to
perform a diagnosis based on the values of the KPIs. The second one is a Bayesian network classifier (BNC), proposed in
[5] to diagnose faults in mobile networks. In this study, it has
been designed using the GeNIe modelling environment [33].
Both systems require discretized inputs, thus the percentilebased discretization (PBD) method proposed in [5] is used.
The threshold of this method discretizes each KPIs based on
the X% percentile of the training data, where the percentage
X is defined by the expert (e.g. 5 % percentile of the normal
value of each KPI is chosen in this analysis). Furthermore,
since the two mechanisms are supervised systems, they require
labelled data to be built. In particular, both systems have been
built using the training dataset, like the diagnosis system, but
for these reference mechanisms the label of the data has been
used.
In light of the above, and despite the fact that the proposed
diagnosis system is unsupervised, the evaluation has been done
with the validation dataset using the labels of the cases so that
it is possible to compare it with the reference solutions. In
order to do that three different metrics have been used:
False Positive Rate (FPR): it is the proportion of normal
cases diagnosed as a fault cause (NF P ) to the total
number of normal cases (NN C ). This measurement determines the probability of the system to identify a normal
case as problematic.
FPR =

(7)

False Negative Rate (FNR): is the proportion of problematic cases diagnosed as normal (NF N ) to the total number
of problematic cases (NP C ). It represents the incapacity
of the system to detect problems in a cell, when that cell
actually has a problem.
FNR =

NF P
NN C

NF N
NP C

(8)

Diagnosis Error Rate (DER): it is the proportion of


problematic cases diagnosed with a fault cause different
to the real one (NE ) to the total number of problematic
cases (NP C ). It indicates the probability of misdiagnosing
a problematic cell.
DER =

NE
NP C

(9)

Based on these measures, the total error (Etotal ) of the


system can be estimated through the following expression:
Etotal = PN F P R + PP C (F N R + DER)

(10)

where PN and PP C represent the prevalence of normal


and problematic cases respectively over the total validation
dataset. In particular, their values are PN = 73.93% and
PP C = 26.07% in the validation dataset.

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
13

TABLE VII
T EST EVALUATION
System
Proposed system
RBS
BNC

FPR [%]
0.24
37.42
10.9

FNR [%]
1.24
0
6.2

DER [%]
1.05
16.36
10

Etotal [%]
0.77
31.93
12.28

Table VII shows the evaluation error estimated for each


system. Since the simulations to obtain the artificial dataset
presented in Table II are very time consuming, there is only
one validation dataset available. As a result, those total errors
are derived from only this validation dataset, and therefore
it is not an averaged value. However, since the number of
cases in the validation dataset is very high, the obtained error
is a valid average figure. It should be noted that there are
only one validation dataset available so the errors have been
calculated from this . From the results, it can be concluded
that both RBS and BNC present a higher error rate than the
proposed diagnosis system, even though they are supervised
techniques. This results from the fact that the two reference
mechanisms discretise their inputs causing a great loss of
information, while the proposed system works directly with
the continuous value of the KPIs. Therefore, the performance
of RBS and BNC are highly dependent on the discretization process. The conclusion to be drawn from the above
is that by processing the inputs with higher resolution the
proposed diagnosis system is capable of performing successful
diagnoses, even without having considered the labels in the
design phase. As it can be seen, the overall error is very low
(Etotal = 0.77%), especially for an unsupervised system. This
is achieved thanks to each of the phases of the design stage
along with the adjustment process of the exploitation stage. As
a result, it can be said that the proposed design process makes
it possible to obtain a reliable system, which, in addition, has
been validated by the expert during its construction in the
labelling phase. Furthermore, this small total error rate also
demonstrates that the clustering phase has made a successful
classification finding the proper number of clusters. Note that
an error in the number of clusters would have led to a very
high error rate.
The number of clusters the system can identify is limited
and it is defined during the design phase. In particular,
assuming that there can be a huge number of possible faults
in a network, among all possible fault cases only the most
frequent will be identified. As a result, the diagnosis system
will not find any cluster related to those faults which are not
present in the training dataset, or whose presence is scarce.
Therefore, both the rare failures and the new ones will not be
considered by the system. This issue is an inherit limitation
of the unsupervised systems. Consequently, if the diagnosis of
more faults is required, the system should be redesigned using
a larger volume of training data. However, as these data are
unlabelled, it is not possible to ensure that this new training
dataset includes occurrences of those new failures. In addition,
new KPIs may be required in order to have new information
to facilitate the identification of new faults.
When analysing the specific states that are wrongly di-

agnosed by the BMU phase, it can be observed that the


majority of them activate border neurons. In particular, 92.5%
of the cases misdiagnosed by the BMU phase are assigned
to a border neuron. Therefore, this justifies the decision of
proposing an adjustment focused on the borders of the clusters.
With that unsupervised adjustment, 24.3% of those cases are
satisfactorily corrected. This has reduced the total error rate
from 1% to 0.77%. It should be noticed that even a little
improvement in the percentage, given the high number of
cells in the network, means a considerable improvement in
the number of cells correctly diagnosed. Furthermore, the use
of the Silhouette index provides a confirmation of the obtained
diagnosis in the most difficult cases.

D. Algorithm complexity evaluation


In this section the complexity of the iterative procedures
is discussed. In particular, the proposed diagnosis system is
composed by several iterative algorithms in its design stage:
both the rough-tuning and the fine-tuning of the training phase
as well as the proposed unsupervised clustering. The roughtuning phase is performed by the Batch algorithm which
has a computational complexity of approximately O(PJM)/2
according to [34]. It is recalled that P is the total number of
cases in the training dataset, J is the number of neurons and
M is the dimension of the input vector. In order to achieve
convergence in the rough-phase, thousands iterations of this
algorithm is required. However, the computational complexity
of the sequential algorithm in the fine-tuning phase is about
the double of the Batch algorithm, that is, O(PJM) [34].
Furthermore, this phase requires more iterations to converge,
i.e. several thousand iterations. Regarding the unsupervised
clustering, which consists of a combination of mechanisms,
its computational complexity is determined by the upper level,
i.e. the Wards Hierarchical method whose computational
complexity is at least O(J2 ) [23]. However it is executed
over a few iterations determined by the maximum number of
identifiable clusters (TC).
The running time of the procedures is strongly dependent
on the number of iterations that each procedure is executed.
Thus, the running time of the training phases varies according
to the configured training length. Table VIII presents the
running time of each iterative process calculated when the
diagnosis system was designed with the simulated training
dataset, whose training length (i.e. the number of iterations
of each procedure) is presented in Table III. As stated before,
the evaluation of the system has been performed with only
one validation dataset. Furthermore, these experiments were
conducted on an Intel Core i5-2540M at 2.60 GHz and 4 GB
memory. The operating system was windows 7 Enterprise. As
it can be observed, the fine-tuning phase is the most timeconsuming part. It should be stressed that the duration of the
design phase is not as critical as the duration of the exploitation
stage, which should identify the problem as fast as possible
to minimize the time-to-resolution. For the proposed diagnosis
system, the execution of the exploitation stage is instantaneous.

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
14

TABLE VIII
RUNNING TIME [ S ] OF EACH PROCEDURE
Rough-tuning
2.40

Fine-tuning
681.39

Unsupervised clustering
0.33

Technology (IRAT) HO Rate; quality, e.g. number of bad


coverage report and average of the Received Strength Signal
Indicator (RSSI); capacity, e.g. number of RRC connections
and load average of CPU; and, configuration, e.g. antenna tilt.

TABLE IX
PARAMETERS OF THE REAL LTE NETWORK
Parameter
Network Layout
Number of cells
System bandwidth
Number of PRBs
Frequency reuse factor
Max. Transmit power
Maximum Tx Power of UE
Horizontal HPBW
HOM
KPI time period
Number of observed cells
Number of days under observation
Size of the training dataset

Configuration
Urban area
8679
10 MHz
50
1
46 dBm
23 dBm
65
3 dB
Hourly
45
6 days (on average)
14478 unlabelled cases

E. Demonstration in a live network


Once the performance of the proposed diagnosis system has
been assessed with simulated data, it has been applied in a
live LTE network in order to demonstrate its usability and
effectiveness using real and unlabelled data.
1) Details of the analysed live LTE network: The analysis
of the proposed system has been conducted in a real LTE
network of a big urban area corresponding to a city with
a population of nearly 4 million. This LTE network has
been chosen because its deployment is extensive and well
established. The characteristics of this LTE network (such as
transmission power, handover parameters, system bandwidth,
etc.) is summarised in Table IX. It consists of more than 8000
different cells, so there is a great variety of cells, each of them
located at different locations suffering different environment
conditions. In order to obtain a big training dataset, 45 cells
have been randomly chosen among all the available cells in
the network. From those cells, the values of the KPIs have
been stored at hourly level during an observation period of 6
days (on average). As a result, a training dataset with a total of
14478 different unlabelled cases has been obtained. It should
be noted that it is important to store the state of the same
cell at different hours because the values of the KPIs are very
dependent with traffic conditions and the volume of served
users, which vary with the time. Therefore, several cases of
the same cells have been stored over time, instead of storing
a single case of different cells for the same hour. This ensures
that the training dataset includes cases that are affected by the
traffic variations.
In particular, the KPIs selected to perform the diagnosis
are some of the most common KPIs used by the experts in
their manual troubleshooting tasks. Furthermore, these KPIs
are associated with the main categories in mobile networks:
connectivity, e.g. accessibility, retainability and failed Radio Resource Control (RRC) connections rate; mobility, e.g.
HOSR, number of ping-pong HO and Inter-Radio Access

Accessibility: shows the ability of the cell to provide


the service requested by the user under acceptable conditions [35]. Therefore, it is usually used to identify
the percentage of connections that have got access to
that cell over the KPI time period. As a result, a low
value of accessibility shows that there are many blocked
connections.
Retainability: this KPI is the same one that was described
in Section V.A. Namely, it represents the percentage of
connections that are not interrupted or ended prematurely
out of total number of connections [35]. Thus, a high
value of retainibility determines that the majority of the
connections have been successfully finished.
Failed RRC Connections Rate: a successful RRC connection [36] determines that a user has been provided
with the LTE resources required to transfer any kind of
data. Thus, this KPI determines the ratio between the total
number of failed RRC connections and the total number
of requested RRC connections.
HOSR: as stated in Section V.A, this KPIs shows how
well a cell performs the handover functionality providing
a satisfactory mobility to their users given that it represents the number of HO that have been successfully
performed over the total number of HO (considering both
successful and failed HOs) [35].
Number of ping-pong HO: this KPI counts the total
number of ping-pong HO that happen during the KPI time
period. A ping-pong HO occurs when the user equipment
(UE) switches between two cells repeatedly in a short
time period [37]. This KPI is considered given that the
ping-pong HO is a critical issue on the HO procedure
that negatively affects the performance of a cell.
IRAT HO Rate: an inter-radio access technology HO is a
mobility process whereby users switch their connections
from one RAT to another. In this case, this KPI represents
the percentage of users in LTE that perform an IRAT HO
from LTE to a different RAT over the total number of
connections successfully finished. A high IRAT HO rate
means that lots of users are leaving LTE.
Number of bad coverage report: it counts the number of
signal level measurements in which the Event A2 [36]
of the mobility process is met, that is, the total number
of times the received signal level from the serving cell
is under an absolute threshold. A high value of this KPI
gives an indication of a lack of coverage.
Average RSSI: RSSI is the wide band power received by
the user considering both the desired signals and the rest
of received power due to thermal noise, adjacent channel
interference, etc. Therefore, this KPI is calculated as the
average of all the RSSI reported over the KPI time period.
Number of RRC connections: is the number of RRC connections attempts that has been successfully established.
This KPI is a measure of the amount of users served by

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
15

the cell.
Average CPU Load: is the weighted average of the CPU
processes over the KPI time period. A cell with a high
average load means that it has overload problems.
Tilt: is the antenna configuration parameter that determines the angle that the antenna forms with the horizontal
plane. This means that the smaller the antenna tilt, the
higher its coverage area.

2) Construction of the proposed diagnosis system: According to the proposed design process, the diagnosis system has
been built using the real training dataset previously presented.
The first point to mention is that this training dataset is much
larger than the artificial training dataset (i.e. 14478 real cases
against 550 simulated cases). This would result in an excessive
rise of the running time of the training procedure, which is
the most critical one. As a result, the first design decision
was to reduce the training length of the fine-tuning phase
to 10 percent of the training length used with the artificial
dataset. The rest of the configuration parameters were set up
with the same configuration. Once the training and clustering
phases were completed, during the labelling phase it was found
that the obtained classification was fragmented. Therefore, as
explained in Section IV, the training and clustering phase was
repeated with different configuration parameters. To that end,
the final neighbourhood radius of the rough-tuning phase and
the initial learning rate of the fine-tuning phase were reduced
in order to do the training with more resolution. The particular
values used for those training parameters are shown in Table
III. Furthermore, the maximum number of clusters identifiable
by the system remains 10. With this design configuration, four
statistically different clusters have been found by the system.
In addition, all clusters are constituted by adjoining neurons,
which determines that both training and clustering phases have
been successful. In order to label each of them, their statistical
behaviour has been analysed through the PDFs of the KPI
estimated by the kernel smoothing function, as stated earlier
in this paper. Nevertheless, in this section, the mean and the
standard deviation of each KPI given each cluster have been
presented in Table X instead of the figures of those PDFs,
because of space constraints.
First, the statistical behaviour of the clusters is analysed to
find the normal one. This will be the cluster whose KPIs has
the most common and less deteriorated values. On this basis,
cluster 3 has been labelled as Normal (see Table XI), given
the following reasons. This cluster represents cells whose
connectivity procedure works successfully given that it has
high accessibility and retainability (both KPIs are around 99%)
along with a very low failed RRC connections rate. The HOSR
is adequate and both the absolute number of ping-pong HO
and the IRAT HO rate are relatively low (around 34.02 and
1.16%, respectively), this means that there are no problems in
the mobility process. Regarding the quality, the low number of
bad coverage reports determines that the served users measure
the cell with high signal level. Furthermore, the average RSSI
is around -115.86 dBm, so it is within the desired range from
-120 dBm to -114 dBm. Finally, the low load average indicates
that there is no overload. As a result, this cluster presents a

normal performance.
By comparing cluster 1 with the Normal cluster, it can be
seen that all its KPIs are deteriorated. In particular, the low
value of accessibility and the high failed RRC connections
rate indicate that there are lots of users that cannot establish a connection. In addition, there are a large number of
dropped connections, as the low values of the retainability
show (86.88% on average). These symptoms, along with the
high average CPU load (around 42.32), reveal that this cluster
matches cells that have overload problems. Moreover, it is
fully in line with the rest of the symptoms such as the high
number of RRC connections. Furthermore, as the number of
users increases, the interference levels suffered by the users are
incremented, causing a significant deterioration of the average
RSSI (which is around -107 dBm, outside the acceptable
range). In conclusion, the cell is overloaded due to the high
amount of traffic causing that the cells cannot maintain the
service under acceptable conditions and further blocks the
connections of new users.
Regarding cluster 2, its accessibility, retainability and failed
RRC connections rate have a good statistical behaviour (similar to the normal one). Furthermore, the KPIs related to the
HO procedure also present normal values, except the IRAT HO
rate which is higher than normal. This indicates that lots of
users are leaving the LTE technology, which is undesirable. By
analysing the rest of KPIs, it can be observed that, in addition,
both the number of bad coverage report and the average RSSI
are deteriorated. In this case, the high number of established
RRC connections, along with the low values of the antenna
tilt, suggests that the cells are covering a higher than necessary
coverage area and thus they are serving users with bad signals
conditions. On the basis of the above, it is concluded that this
behaviour is in line with the symptoms of lack of coverage.
Finally, concerning cluster 4, the majority of its KPIs are
concentrated near zero. Both accessibility and retainability
are practically zero which means that there is hardly any
connection established in those cells. As a result, this cluster
is labelled as non-operating cell.
3) Case Study: diagnosing problematic cells over the time:
In order to demonstrate the performance of the automatic
diagnosis system and validate the assigned labels, the system
has been applied to diagnose different cells of this real LTE
network.
The first chosen cell is a problematic cell that has been
manually reconfigured over the time in order to improve its
performance. Therefore, for this study, it has been analysed
by the proposed diagnosis system during the days when the
troubleshooting tasks were being carried out. Fig. 6 shows
the evolution of the number of bad coverage report (presented
by a continuous black line), since it is the most relevant KPI
in this situation. Furthermore, the diagnosis achieved with the
proposed system over the same period of time is superimposed.
In particular, the diagnosis automatically obtained during the
first three days determines that this cell is not operating. This
matches the troubleshooting tasks, which reveal that this cell
was inactive during this period of time. After the cell was
launched, its KPIs indicate that the performance of the cell was
deteriorated. In particular, the diagnosis varies between normal

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
16

TABLE X
M EAN AND S TANDARD DEVIATION OF EACH KPI GIVEN EACH CLUSTER
C1
76.04
86.88
18.40
83.24
4040
1.38
1241
-107.73
42.32
28867
0.63

Mean
C2
C3
99.07
99.78
99.11
99.75
0.41
0.23
97.77
96.88
36.98
34.02
3.47
1.16
1352
198
-113.82
-115.86
32.69
21.66
13990
5467
1.38
2.34

Num Bad Cov. Report

Label
Overload
Lack of Coverage
Normal
Non-operating cell

C1
25.79
23.36
16.80
16.79
4363
1.32
1041
4.39
12.17
17943
1.13

Standard deviation
C2
C3
C4
7.01
0.34
3.47
6.96
0.36
0
0.89
0.41
0
8.62
12.65
25.18
34.14
146.51
1.31
5.23
1.17
0.53
903.28
262.55
2.12
2.27
2.72
39.68
9.74
3.53
4.17
6459
4474
1.06
1.67
2.40
3.48

2000

TABLE XI
L ABEL ASSIGNED TO EACH CLUSTER
Clusters
Cluster 1
Cluster 2
Cluster 3
Cluster 4

C4
0.12
0
0
6.82
0.15
0.07
0.35
-15.22
22.65
0.04
3.63

Non-operating

Normal

1000

Lack of Cov.

Diagnosis

KPI
Accessibility [%]
Retainability [%]
Failed RRC Connections Rate [%]
HOSR [%]
Number of ping-pong HO
IRAT HO Rate [%]
Number of bad coverage report
Average RSSI [dBm]
Average CPU Load
Number of RRC connections
Tilt [ ]

Overload

VI. C ONCLUSIONS
An automatic diagnosis system as part of a Self-Healing
network has been proposed in this paper. This system is built
through unsupervised techniques with the aim of obtaining a
system that represents the normal and faulty behaviour of the
real network. The use of unsupervised techniques guarantees
that the system can be built without historical reports of solved
cases while simultaneously enabling the system to identify
new faults which are not previously known. Even so, the
clusters derived from the proposed system are labelled by

ay
12 2
0: :00
00
D
ay
12 3
0: :00
00
D
ay
12 4
0: :00
00
D
ay
12 5
0: :00
00
D
ay
12 6
0: :00
00
D
ay
12 7
0: :00
00
D
ay
12 8
0: :00
00
D
ay
9
1
0: 2:0
00 0
D
ay
1 10
0: 2:0
00 0
D
ay
12 11
:0
0

:0
0

12

0:

00

00
0:

Fig. 6. Number of bad coverage report values of the diagnosed cell along
with the obtained diagnosis.
-100

Normal
-110
Lack Of Cov.

Diagnosis

Average RSSI

Non-operating

ay
D
00
0:

12
:0
0

4
D
ay
0:
00

12
:0
0

ay
D
0
0:
0

12
:

00

2
0:
0

ay

00
12
:

00

D
ay

-120

Overload

00
:

and lack of coverage problem over time and in accordance


with traffic. Therefore, only during the busy hours the existing
problem comes to light. This diagnosis is in line with the
manual diagnosis performed by the operator who decided to
increment the tilt from 0 to 6 in order to control de radio
frequency conditions and improve the overall performance of
the cell. This change is represented over Fig. 6 by a vertical
line, determining the instant in which the antenna tilt was
changed. Analysing this figure, it can be seen that this change
fixed the problem reducing the number of bad coverage report
in that cell. In accordance with this, after the change, the cell
is automatically diagnosed as normal by the system.
In order to validate the overload cluster, a different problematic cell is analysed. In this case, the diagnosis system
determines that the cell presented overload problems on two
occasions. This is reflected in all the KPIs whose values are
extremely deteriorated. As an example of the latter, the average
RSSI is shown along with its diagnosis in Fig. 7. It can be
seen that the overload problem matches the hours in which
the values of the average RSSI is deteriorated. According
to the information provided by a troubleshooting expert, the
deterioration of these KPIs is due to the high connections
attempts caused by peak traffic.

ay

Fig. 7. Average RSSI values of the diagnosed cell along with the obtained
diagnosis.

an expert based on their statistical behaviour, although the


effort required from experts is negligible compared to that
required for supervised methods. In particular, the PDFs of
each KPI given each cluster is estimated taken into account
all the cases in the training dataset, which provides more
information that only considering the weight vectors of the
neurons. By performing a supervised labelling, experts can:
detect errors in the clustering; identify the behaviour of each
cluster; assign the best suited fault cause to each cluster based
on their knowledge; and verify whether the system is right or
not. As a result, this stage is not only a labelling phase but
also a validation phase.
The main requirement is that the identification process
must be relatively prompt, objective, and automatic. The key
element for achieving this is the proposed adjustment phase.
In order to avoid slowing down the exploitation process, this

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
17

technique only acts when the traditional mapping is more


likely to be wrong, that is, when the activated neuron is a
border neuron between two or more clusters. Furthermore,
this phase attempts to correct the errors in an objective and
automatic way. This correction is done based on the X th
percentile of all distances between the input and each cluster,
and the evaluation provided by the average Silhouette index.
To assess the proposed approach, the diagnosis system has
been built with both simulated and real data, showing how
the construction phase must be done and how the diagnosis
is performed in a live network. The obtained results demonstrate the value of the integrate approach. Furthermore, the
proposed diagnosis system has been compared with reference
mechanisms in order to objectively evaluate its effectiveness. It
is important to point out that the proposed diagnosis system is
highly accurate, taking into account that it has been built using
unsupervised techniques. Finally, it can be concluded that this
system could be part of a Self-Healing network where specific
corrective actions are taken after the automatic diagnosis stage.
ACKNOWLEDGMENT
This work has been partially funded by Optimi-Ericsson,
Junta de Andaluca (Agencia IDEA, Consejera de Ciencia, Innovacion y Empresa, ref.59288, and Proyecto de Investigacion
de Excelencia P12-TIC-2905) and ERDF. Furthermore, the
authors wish to thank E. J. Khatib for providing real data and
for his valuable comments and suggestions about the statistical
analysis of the real data.
R EFERENCES
[1] 3GPP, Self-Organizing Networks (SON); Concepts and requirements,
3rd Generation Partnership Project (3GPP), TS 32.500.
[2] 3GPP, Self-Organizing Networks (SON); Self-healing concepts and
requirements, version 11.0.0 (2012-09), TS 32.541.
[3] R.Barco, P.Lazaro, and P.Munoz, A unified framework for self-healing
in wireless networks, IEEE Communications Magazine, pp. 134142,
2012.
[4] R. Barco, V. Wille, and L. Dez, System for automated diagnosis
in cellular networks based on performance indicators, in European
Transactions on Telecommunications, 2005., vol. 16, 2005, pp. 399
409.
[5] R.M.Khanafer, B.Solana, J.Triola, R.Barco, L.Moltsen, Z.Altman, and
P.Lazaro, Automated Diagnosis for UMTS Networks Using Bayesian
Network Approach, IEEE Transactions on vehicular technology,
vol. 57, no. 4, pp. 2451 2461, 2008.
[6] P. Szilagyi and S. Novaczki, An Automatic Detection and Diagnosis
Framework for Mobile Communication Systems, IEEE Transactions on
network and service management, vol. 9, no. 2, pp. 184197, 2012.
[7] F. J. Massey, The kolmogorov-smirnov test for goodness of fit, Journal
of the American Statistical Association, vol. 46, no. 253, p. 6878, 1951.
[8] P. J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and
validation of cluster analysis, Journal of Computational and Applied
Mathematics, vol. 20, p. 5365, 1987.
[9] S. Novaczki, An Improved Anomaly Detection and Diagnosis Framework for Mobile Network Operators, in Proc. of 9th International Conference on the Design of Reliable Communication Networks (DRCN),
2013.
[10] T. Kohonen, M. R. Schroeder, and T. S. Huang, Eds., Self-Organizing
Maps, 3rd ed. Secaucus, NJ, USA: Springer-Verlag New York, Inc.,
2001.
[11] G. A. Barreto, J. C. M. Mota, L. G. M. Souza, R. A. Frota, and
L. Aguayo, A new approach to fault detection and diagnosis in cellular
systems using competitive learning. in Proceedings of the VII Brazilian
Symposium on Neural Networks, 2004.
[12] J. Laiho, K. Raivio, P. Lehtimki, K. Hatonen, and O. Simula, Advanced
analysis methods for 3G cellular networks. IEEE Transactions on
Wireless Communications, vol. 4, pp. 930942, 2005.

[13] K. Raivio, O. Simula, J. Laiho, and P. Lehtimki, Analysis of mobile


radio access network using the Self-Organizing Map. IEEE Eighth
International Symposium on Integrated Network Management, pp. 439
451, 2003.
[14] K. Raivio, O. Simula, and J. Laiho, Neural analysis of mobile radio
access network. in IEEE International Conference on Data Mining,
December 2001, pp. 457464.
[15] M. Kylvaja, K. Hatonen, P. Kumpulainen, J. Laiho, P. Lehtimaki,
K. Raivio, and P. Vehvilainen, Trial Report on Self-Organizing Map
Based Analysis Tool for Radio Networks, in IEEE 59th Vehicular
Technology Conference, 2004. VTC 2004-Spring., vol. 4, May 2004,
pp. 2365 2369.
[16] S. Chebbout and H. F. Merouani, Comparative study of clustering
based colour image segmentation techniques. in IEEE, International
Conference on Signal Image Technology and Internet Based Systems,
2012, pp. 839844.
[17] P. Liu, Using self-organizing feature maps and data mining to analyze
liability authentications of two-vehicle traffic crashes, Third International Conference on Natural Computation, vol. 2, pp. 94 102, 2007.
[18] J. G. Brida, M. Disegna, and L. Osti, Segmenting visitors of cultural
events by motivation: A sequential non-linear clustering analysis of
italian christmas market visitors, Expert Systems with Applications,
vol. 39, no. 13, pp. 11 349 11 356, 2012.
[19] T. Kohonen, Essentials of the Self-Organizing Map, Neural Networks,
vol. 37, pp. 52 65, 2013.
[20] D. Brugger, M. Bogdan, and W. Rosenstiel, Automatic cluster detection
in Kohonens SOM. IEEE Transactions on Neural Networks, vol. 19,
no. 3, pp. 442459, 2008.
[21] J. A. Lee and Michel Verleysen, Self-organizing maps with recursive
neighborhood adaptation. Neural Networks, vol. 15, no. 8-9, pp. 993
1003, 2002.
[22] S. Haykin, Ed., Neural Networks. A comprehensive Foundation.
Macmillan Publishing C., 1994.
[23] F. Murtagh and P. Legendre, Wards hierarchical agglomerative clustering method: Which algorithms implement wards criterion? Journal
of Classification, 2013.
[24] D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE
Transactions on Pattern Analysis and Machine Intelligence, pp. 224
227, 1979.
[25] M. P. Wand and M. C. Jones, Eds., Kernel Smoothing. Chapman &
Hall/CRC Monographs on Statistics & Applied Probability (60), 1994.
[26] A. Gomez-Andrades, P.Munoz, R.Barco, E.J.Khatib, I. de-la-Bandera,
and I.Serrano. (2014) Labelled cases of LTE problems. [Online].
Available: http://webpersonal.uma.es/de/rbarco/
[27] P. Munoz, I. de la Bandera, F. Ruiz, S. Luna-Ramrez, R. Barco, M. Toril,
P. Lazaro, and J. Rodrguez, Computationally-Efficient Design of a
Dynamic System-Level LTE Simulator, Intl Journal of Electronics and
Telecommunications, vol. 57, no. 3, pp. 347 358, 2011.
[28] A. Gomez-Andrades, P.Munoz, E.J.Khatib, I. de-la-Bandera, I.Serrano,
and R.Barco, Methodology for the Design and Evaluation of SelfHealing LTE Networks, Submitted, 2014.
[29] 3GPP, Physical layer; Measurements, 3rd Generation Partnership
Project (3GPP), TS 25.215.
[30] 3GPP, OFDM-HSDPA System level simulator calibration (R1040500), 3rd Generation Partnership Project (3GPP) May 2004, 3GPP
TSG-RAN WG1 37.
[31] E. Alhoniemi, J. P. Johan Himberg, and J. Vesanto. Som
toolbox 2.0 for matlab 5 software. [Online]. Available:
http://www.cis.hut.fi/somtoolbox/
[32] M. Negnevitsky, Artificial Intelligence: A Guide to Intelligent Systems,
1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co.,
Inc., 2001.
[33] Decision Systems Laboratory of the University of Pittsburgh. GeNIe
modeling environment. [Online]. Available: http://genie.sis.pitt.edu/
[34] J. Vesanto. Neural Network Tool for Data Mining: SOM
toolbox. Helsinki University of Technology. [Online]. Available:
http://www.cis.hut.fi/proyects/somtoolbox/
[35] 3GPP, Key Performance Indicators (KPI) for Evolved Universal Terrestrial Radio Access Network, 3rd Generation Partnership Project (3GPP),
TS 32.450.
[36] 3GPP, Evolved Universal Terrestrial Radio Access (E-UTRA) Radio
Resource Control (RRC); Protocol specification, version 9.2.0 (201004), TS 36.331.
[37] K. Ghanem, H. Alradwan, A. Motermawy, and A. Ahmad, Reducing
ping-pong handover effects in intra eutra networks, Proc. of 8th international Symposium on Communication Systems, Networks & Digital
Signal Processing, pp. 15, 2012.

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
18

PLACE
PHOTO
HERE

PLACE
PHOTO
HERE

Ana Gomez Andrades received her M.Sc. degree in


telecommunication engineering from the University
of Malaga (Spain) in 2012. She is currently working in the Communications Engineering Department
(University of Malaga) in cooperation with Ericsson.
Since 2014, she has been a Ph.D. student in the
area of self-healing LTE networks. Her research
interests include mobile communications and bigdata analytics applied to Self-Organizing Networks.

Raquel Barco holds a M.Sc. and a Ph.D. in


Telecommunication Engineering. She has worked at
Telefonica in Madrid (Spain) and at the European
Space Agency (ESA) in Darmstadt (Germany). She
has also worked part-time for Nokia Networks. In
2000 she joined the University of Malaga, where
she is currently Associate Professor.

PLACE
PHOTO
HERE

PLACE
PHOTO
HERE

Pablo Munoz
received his M.Sc. and Ph.D. degrees in Telecommunication Engineering from the
University of Malaga (Spain) in 2008 and 2013,
respectively. He is currently working with the Communications Engineering Department at the same
university. Since September 2009, he has been a
Ph.D. Fellow, where he has been working in selfoptimization of mobile radio access networks and
radio resource management.

Inmaculada Serrano obtained her M.Sc. degree


in Universidad Politecnica Valencia. She specialized
further in radio after complementing her education
with a Master in Mobile Communications. In 2004
she joined Optimi and started a wide career in
optimization and troubleshooting of Mobile Networks, including a variety of consulting, training
and technical project management roles. In 2012,
she moved to Advanced Research Department in
Ericsson.

0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like