Automatic Root Cause Analysis For LTE Networks Based On Unsupervised Techniques
Automatic Root Cause Analysis For LTE Networks Based On Unsupervised Techniques
Automatic Root Cause Analysis For LTE Networks Based On Unsupervised Techniques
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
1
I. I NTRODUCTION
ELLULAR networks are an effective way of communication that allows people to instantaneously send and receive information anywhere. In the last years, as a consequence
of a sharp increase in the traffic demand, the infrastructure of
cellular networks has been profoundly modified. The higher
complexity of these networks has encouraged mobile operators
to implement effective and low cost management algorithms.
In this context, Self-Organizing Networks (SONs) establish a
new concept of network management in which the operation
and maintenance tasks are carried out with high level of
automation [1]. Within this paradigm, Self-Healing is a major
SON category whose aim is to automate the troubleshooting
process [2], [3], which is composed of the detection, diagnosis,
compensation and recovery phases. As cellular networks are
currently more prone to failures due to their huge increase
in size and complexity, Self-Healing is gradually gaining
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
2
The rest of the paper is organized as follows. First, the existing work related to fault diagnosis in mobile communication
systems is presented in Section 2, followed by the motivation
for the design of an unsupervised diagnosis system in Section
3. Then, in Section 4, the proposed system is described in
detail, explaining the concepts and the algorithms used in each
stage. Section 5 demonstrates the validity and the robustness
of this approach using real and simulated data and comparing
its efficacy with reference mechanisms. Finally, in Section 6,
the main conclusions are discussed.
II. R ELATED W ORK
Even though the automatic fault identification is essential
for the prompt enforcement of maintenance decisions, the
related literature in the field of mobile network is scarce.
Thus, it remains an open research problem. Nevertheless, some
solutions for mobiles networks can be found in the literature.
An overview in the field of mobile network is presented below.
Bayesian Networks: Some studies such as [4], [5] proposed the use of Bayesian Network as classifiers of fault
problems in order to automatically diagnose them. In particular, [4] presents a method based on a nave Bayesian
classifier to identify the fault cause in GSM/GPRS, 3G
or multi-systems networks using performance indicators
as continuous variables. In addition, [5] presents another
automated fault diagnosis system based on Bayesian
Networks for UMTS networks using two different algorithms to discretize the performance indicators (i.e.
the percentile-based discretization (PBD) and the entropy
minimization discretization (EMD)).
Scoring-based systems: A different approach based on a
scoring system has been proposed in [6]. This detection
and diagnosis system is automatically built from the
labelled fault cases reported by experts and it uses a
scoring system in order to determine how well a specific
case matches each diagnosis target. [9] proposes more sophisticated profiling and detection capabilities to improve
this framework.
All these techniques proposed in the literature, i.e. [4],
[5], [6], [9], are supervised techniques. Therefore, their
diagnosis process requires a historical database of labelled fault cases in order to learn the impact the faults
have on the performance indicators. However, when the
set of documented and solved cases is poor, partly due to
the limited occurrences of each fault, unsupervised techniques are the most appropriate ones. This is precisely
what occurs with the historical records of faults in mobile
networks because experts are not concerned with storing
the cause of the problem or the actions they took when
performing their troubleshooting tasks.
Neural Networks: Several works have demonstrated the
potential and utility of using the unsupervised neural
network known as Self-Organizing Maps (SOM) [10]
in order to automate the fault detection phase of the
troubleshooting tasks (that is, the phase prior to diagnosis). In particular, in [11] SOM is used to build the
normal profile of the network and to determine the
healthy ranges of the selected performance indicators (i.e.
the symptoms). Therefore, this system helps to identify
whether the symptoms are healthy or faulty, without ever
providing the fault cause. In addition, [12], [13], [14],
[15] show how SOM can be used to analyse multidimensional Third Generation (3G) network performance
data in order to aid the manual fault diagnosis. The
aim is to clusters cells based on their performance in
order to assist experts in their manual troubleshooting
and parameter optimization tasks. In contrast, the system
proposed in this paper aims to provide automatically and
directly the fault cause of a problematic cell without
any supervision of the experts in the exploitation stage.
Therefore, it is essential for the proposed system to
ensure two important aspects: (a) all identified clusters
must present different statistical behaviours in order to
guarantee that there are not any similar clusters associated
to the same fault cause; and (b) the final diagnosis must be
as accurate as possible. Even though the proposed system
is based on SOM technique, as stated before the purpose
of the whole system is different from the one presented in
[12-15]; thus, the system proposed in this paper consists
of additional and complex techniques in order to achieve
those requirements.
Regarding the use of the Silhouette index with SOM techniques, there are some reference in the literature (not related
to wireless networks), such as [16] and [17]. However, in
those cases, the Silhouette index is only used with the goal of
evaluating the quality of the obtained clustering. Furthermore,
in [18], this index is used as indicator to choose the best
clustering technique. Unlike the previous references, in this
paper, the Silhouette index is used to correct the mapping of
a given input.
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
3
Input Stage
Process Stage
Output Stage
Diagnosis System
Training
Clustering
Labelling
S
KPI
Database
Pre-process
BMU
Cause
Percentile-based
approach
Adjustment
Silhouette
controller
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
4
the same KPIs, time and layer (per cell, per base station,
etc.) aggregation levels, etc. Since no labels are used to
obtain the training dataset, there will be data describing
normal behaviour and cells with abnormal behaviour due
to faults. It is important to highlight that since the dataset
will include data both from normal cells and faulty cells,
one of the classes obtained after the clustering phase
will be the normal profile, while the rest of classes will
correspond to faults.
Analytical data: During the exploitation stage, the input
data is the specific cells state taken directly from the cell
under study.
The input of the diagnosis system must be pre-processed
in order to suit the technical requirements of the system. In
particular, for the proposed system, the input data must be
quantitative, i.e. they can be expressed in terms of numbers
(e.g. power, throughput). In particular, performance indicators
of mobile networks are characterised by being numerical
variables. As a consequence, this system is appropriate to
automate the troubleshooting process in a mobile network by
working directly with KPIs, avoiding both the discretization
of the variables and the definition of the thresholds by experts.
However, given that the proposed system is based on the
Euclidean distance, the raw data taken from the network must
be normalised. This ensures that their dynamic ranges are
similar and thus there are no high values that dominate the
training. In this system, the normalization process is performed
by using the following methods:
Range normalization: This method transforms the dynamic range of a particular metric (KP Ii ). The objective
is to ensure that all input variables range within the
desired interval. In particular, this method is applied only
to those KPI whose values are not within the interval [0,
1]. This normalization is given by the following equation
on the basis of the minimum and maximum value of the
]I i ):
KPI i in the training dataset (KP
[I i =
KP
^
KP Ii min(KP
Ii )
.
^
^
max(KP Ii ) min(KP Ii )
(1)
(2)
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
5
T1
,
t = (t1)
T1
0 and T1 are the initial and final neighbourhood radius
respectively, and T1 is the number of iterations.
3. Update the weight factors:
(t+1)
Wj
P
P
i
ht t S
i=1 ni j
P
P
ht t
i=1 ni j
j,
htnt j = e
3. Learning rate:
0
t = 1+100
t
T2
T2
(t+1) = t
,
T2
where 0 and T2 are the initial and final
neighbourhood radius respectively.
5. Repeat the above-described process starting
from step 1 until t=T2
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
6
70
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Histogram (Retainability)
60
50
40
30
20
10
0
0.65
0.7
0.75
0.8
0.85
Retainability
0.9
0.95
(a)
90
80
Cluster 1
Cluster 2
Cluster 3
Cluster 4
70
Histogram (HOSR)
60
50
40
30
20
10
0
0.7
0.75
0.8
0.85
HOSR
0.9
0.95
(b)
Fig. 2. Histograms of retainability (a) and HOSR (b) given each
(a)
(b)
(c)
(3)
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
7
C. Exploitation Stage
Once the system has been designed, it can be used to automatically perform the diagnosis in the exploitation phase. This
diagnosis process can be performed periodically to determine
whether the identified fault is sporadic, continuous or periodic
and to track the evolution of the fault over the time when
either compensation or recovery tasks are carried out.
The proposed diagnosis process is summarised in Algorithm
3. Such a process must be as accurate and reliable as possible.
To achieve this, first, for a specific cells state (Si ), the winner
neuron (BMU) is determined on the basis of the minimum
Euclidean distance and thus the diagnosis (DiagnosisBM U ) is
the cause (cj ) related to that neuron. However, if the activated
neuron is at the border between two or more causes, the
likelihood of an erroneous diagnosis is higher because the
cells state is close to different faults behaviours. Thus, we
propose further processing in order to guarantee high successful diagnosis rate. In particular, the adjustment proposed in this
paper is carried out using a percentile-based approach and a
Silhouette controller (see Fig. 1):
Percentile-based approach: this method is only used when
the BMU is a border neuron among different causes. In
this situation, the diagnosis should be the cause located
at the borderline that has, in general, the most similar
behaviour. In view of the above, a different approach
to identify the most similar cause is proposed. For each
border cause, the X th percentile of the distances between
the cells state and all neurons in that cluster are estimated. Then, the cause selected by the percentile-based
approach (DiagnosisP ) is the one that has the minimum
X th percentile of all its distances. Compared to the BMU
where the considered distance is only to one single neuron
(i.e. the closest), the proposed approach considers the
distance to all neurons in the cluster.
Silhouette controller: once the diagnoses have been determined by the BMU and percentile-based approaches,
one of the two diagnoses must be selected. For this
purpose, a controller based on the silhouette index [8]
is proposed. First, if the diagnosis provided by the
percentile-based approach matches the one given by the
BMU, it can be concluded that the diagnosis is right
and, as a result, the selected cause corresponds to the
DiagnosisBM U . Nevertheless, when both diagnoses are
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
8
DiagnosisBM U
Diagnosis(S) =
DiagnosisP
if SilhouetteBM U SilhouetteP
if SilhouetteP > SilhouetteBM U
Du
TTI
(5)
where BLER is the block error probability which depends on the SINR of user u, Du is its data block payload
in bits and T T I is the transmission time interval.
The 95th percentile of the distance between the base
station and each user (Distance95 ).
(6)
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
9
TABLE I
S IMULATION PARAMETERS
Parameter
Cellular layout
Transmission direction
Carrier frequency
System bandwidth
Frequency reuse
Propagation model
Channel
Mobility
Service
Base station
model
model
model
model
Scheduler
Power control
Link Adaptation
Handover
Radio Link Failure
Traffic distribution
Time resolution
Epoch & KPI time
Configuration
Hexagonal grid, 57 cells, cell radius 0.5 km
Downlink
2.0 GHz
1.4 MHz (6 physical resource blocks)
1
Okumura-Hata with wrap-around, Log-normal
slow fading, sf = 8 dB and
correlation distance=50m
Multipath fading, ETU model
Random direction, 3 km/h
Full Buffer, poisson traffic arrival
Tri-sectorized antenna, SISO, Donwtilt=9
PT Xmax =43 dBm,
Azimuth beamwidth (AB)=70
Elevation beamwidth (EB)=10
Time domain: Round-Robin,
Frequency domain: Best Channel
Equal transmit power per physical
resource blocks
Fast, CQI based, perfect estimation
Triggering event = A3, HOM = 3 dB,
Measurement type = RSRP
SINR < 6.9 dB for 500 ms
Evenly distributed in space
100 TTI (100 ms)
100 s
TABLE II
FAULT C AUSE DESCRIPTION
Fault Cause
Configuration
EU
ED
ERP
CH
TLHO
II
Downtilt=[0,1]
Downtilt=[16,15,14]
PT X =[7,8,9,10] dB
hole =[49,50,52,53] dBm
HOM=[6,7,8] dBm
PT Xmax =33 dBm
Downtilt=15
AB=[30, 60]
EB=10
Normal
No fault
Number of Cases
Training Validation
32
212
28
212
28
208
14
103
34
204
15
106
399
2964
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
10
TABLE III
C ONFIGURATION PARAMETERS USED TO TRAIN THE SYSTEM
Parameters
Training algorithm
Training lenght, T
Neighbourhood function
Initial neighbourhood radius
Final neighbourhood radius
Initial learning rate
Simulated scenario
Rough-tuning Fine-tuning
Batch
Sequential
1000
20000
Bubble
Gaussian
10
1
2
0
0.01
TABLE IV
DAVIES -B OULDIN I NDEX
2
1.43
3
1.25
4
1.06
Number of Clusters
5
6
7
0.98 0.84
0.80
P - VALUE OF THE
Ret
0.0026
HOSR
0.0026
8
0.7
9
0.83
10
0.87
TABLE V
KS TEST BETWEEN TWO CLUSTERS
RSRP95
0.3788
KPI
RSRQ5
0.0045
SINR95
0.8552
AvTh
0.2853
Dist95
0.2095
Real scenario
Rough-tuning Fine-tuning
Batch
Sequential
1000
2000
Bubble
Gaussian
10
1
1
0
0.001
each KPI is deteriorated for each cause (see Fig. 5). Due to the
fact that we are using the kernel smoothing function, the PDFs
are estimated through the sum of Gaussian functions centred
at the data. Therefore, the estimated PDF is smoother than the
histogram causing the estimated PDF to cover values out of
range (see Fig. 5.a and b). It is important to highlight that this
is part of the estimation error, but this particular error does
not affect the proposed analysis, since the estimated PDFs are
only used to determine the overall statistical behaviour of the
clusters by visual inspection.
The objective of this examination is to get a rough idea
about the most probable values of each KPI depending on the
cluster and thus determine, for a specific cluster, if the majority
of values of a KPI are considered damaged or not. As a result
of this detailed study, the overall behaviour of each cluster has
been assessed to identify the associated fault cause and assign
it a label (see Fig. 5 and Table VI):
Cluster 1: This cluster represents cells with normal behaviour given that none of the indicators are deteriorated.
Cluster 2: Comparing the PDFs of the second cluster with
the first one, it is possible to identify that this cluster
represents cells whose KPIs have a normal behaviour
but with the difference that their retainability is more
likely to be low (approximately below 0.98). Therefore,
the high number of dropped calls together with the fact
that no other KPIs are deteriorated means that the fault
cause is a coverage hole within the service area. This
results from the fact that the coverage hole affects only
a particular group of users located in this specific area,
so that the aggregated values of the other analysed KPIs
do not present any deterioration.
Cluster 3: After analysing the PDFs, it is possible to
conclude that the most deteriorated KPIs of this cluster
are retainability, HOSR and RSRQ5 because they are
approximately below 0.98, 0.9 and -18.5 dB respectively.
Bad HOSR determines that this cell has mobility problems, while low RSRQ5 shows that the number of users
with bad quality has increased. This reveals that the cell is
retaining users with bad quality instead of performing the
handover process. As a result, this kind of cells presents
mobility problems due to the fact that handovers are
carried out too late.
Cluster 4: All the KPIs of this cluster, except average
throughput, are likely to be low. Namely, the overall
performance of those cells is degraded. The number of
drops has increased and the maximum served distance
(Distance95 ) have been reduced (approximately below
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
20
18
16
14
12
10
8
6
4
2
0 0.7
30
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
0.75
PDF(HOSR | Cluster)
PDF(Retainability | Cluster)
11
0.8
1.05
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
25
20
15
10
5
00
1.1
0.2
0.4
0.6
HOSR
(a)
PDF(RSRQ | Cluster)
PDF(RSRP| Cluster)
1.2
(b)
0.35
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
0.3
0.25
0.2
0.15
0.1
2.5
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
2
1.5
1
0.5
0.05
0
-85
-80
-75
-70
-65
RSRP [dBm]
-60
0
-21.5 -21 -20.5 -20 -19.5 -19 -18.5 -18 -17.5 -17
RSRQ [dB]
-55
(d)
0.35
PDF(Throughput | Cluster)
(c)
PDF(SINR | Cluster)
0.8
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
0.3
0.25
0.2
0.15
0.1
0.05
0
8
10 12
SINR [dB]
14
16
18
20
0.035
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
0.03
0.025
0.02
0.015
0.01
0.005
0
50
100
150
200
250
300
Average Throughput [kbps]
PDF(Distance | Cluster)
(e)
350
(f)
8
7
6
5
4
3
2
1
0
0.2
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
0.4
0.6
0.8
1.2
1.4
1.6
1.8
th
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
12
TABLE VI
L ABEL ASSIGNED TO EACH CLUSTER
Clusters
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Label
Normal
Coverage Hole
Too late HO
Excessive reduction of cell power
Excessive Uptilt
Inter-system interference
Excessive downtilt
C. Performance Evaluation
In this section, the diagnosis system is assessed and compared with reference mechanisms in order to show its effectiveness. It should be pointed out that there are no previous
unsupervised systems proposed in literature for diagnosis in
(7)
False Negative Rate (FNR): is the proportion of problematic cases diagnosed as normal (NF N ) to the total number
of problematic cases (NP C ). It represents the incapacity
of the system to detect problems in a cell, when that cell
actually has a problem.
FNR =
NF P
NN C
NF N
NP C
(8)
NE
NP C
(9)
(10)
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
13
TABLE VII
T EST EVALUATION
System
Proposed system
RBS
BNC
FPR [%]
0.24
37.42
10.9
FNR [%]
1.24
0
6.2
DER [%]
1.05
16.36
10
Etotal [%]
0.77
31.93
12.28
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
14
TABLE VIII
RUNNING TIME [ S ] OF EACH PROCEDURE
Rough-tuning
2.40
Fine-tuning
681.39
Unsupervised clustering
0.33
TABLE IX
PARAMETERS OF THE REAL LTE NETWORK
Parameter
Network Layout
Number of cells
System bandwidth
Number of PRBs
Frequency reuse factor
Max. Transmit power
Maximum Tx Power of UE
Horizontal HPBW
HOM
KPI time period
Number of observed cells
Number of days under observation
Size of the training dataset
Configuration
Urban area
8679
10 MHz
50
1
46 dBm
23 dBm
65
3 dB
Hourly
45
6 days (on average)
14478 unlabelled cases
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
15
the cell.
Average CPU Load: is the weighted average of the CPU
processes over the KPI time period. A cell with a high
average load means that it has overload problems.
Tilt: is the antenna configuration parameter that determines the angle that the antenna forms with the horizontal
plane. This means that the smaller the antenna tilt, the
higher its coverage area.
2) Construction of the proposed diagnosis system: According to the proposed design process, the diagnosis system has
been built using the real training dataset previously presented.
The first point to mention is that this training dataset is much
larger than the artificial training dataset (i.e. 14478 real cases
against 550 simulated cases). This would result in an excessive
rise of the running time of the training procedure, which is
the most critical one. As a result, the first design decision
was to reduce the training length of the fine-tuning phase
to 10 percent of the training length used with the artificial
dataset. The rest of the configuration parameters were set up
with the same configuration. Once the training and clustering
phases were completed, during the labelling phase it was found
that the obtained classification was fragmented. Therefore, as
explained in Section IV, the training and clustering phase was
repeated with different configuration parameters. To that end,
the final neighbourhood radius of the rough-tuning phase and
the initial learning rate of the fine-tuning phase were reduced
in order to do the training with more resolution. The particular
values used for those training parameters are shown in Table
III. Furthermore, the maximum number of clusters identifiable
by the system remains 10. With this design configuration, four
statistically different clusters have been found by the system.
In addition, all clusters are constituted by adjoining neurons,
which determines that both training and clustering phases have
been successful. In order to label each of them, their statistical
behaviour has been analysed through the PDFs of the KPI
estimated by the kernel smoothing function, as stated earlier
in this paper. Nevertheless, in this section, the mean and the
standard deviation of each KPI given each cluster have been
presented in Table X instead of the figures of those PDFs,
because of space constraints.
First, the statistical behaviour of the clusters is analysed to
find the normal one. This will be the cluster whose KPIs has
the most common and less deteriorated values. On this basis,
cluster 3 has been labelled as Normal (see Table XI), given
the following reasons. This cluster represents cells whose
connectivity procedure works successfully given that it has
high accessibility and retainability (both KPIs are around 99%)
along with a very low failed RRC connections rate. The HOSR
is adequate and both the absolute number of ping-pong HO
and the IRAT HO rate are relatively low (around 34.02 and
1.16%, respectively), this means that there are no problems in
the mobility process. Regarding the quality, the low number of
bad coverage reports determines that the served users measure
the cell with high signal level. Furthermore, the average RSSI
is around -115.86 dBm, so it is within the desired range from
-120 dBm to -114 dBm. Finally, the low load average indicates
that there is no overload. As a result, this cluster presents a
normal performance.
By comparing cluster 1 with the Normal cluster, it can be
seen that all its KPIs are deteriorated. In particular, the low
value of accessibility and the high failed RRC connections
rate indicate that there are lots of users that cannot establish a connection. In addition, there are a large number of
dropped connections, as the low values of the retainability
show (86.88% on average). These symptoms, along with the
high average CPU load (around 42.32), reveal that this cluster
matches cells that have overload problems. Moreover, it is
fully in line with the rest of the symptoms such as the high
number of RRC connections. Furthermore, as the number of
users increases, the interference levels suffered by the users are
incremented, causing a significant deterioration of the average
RSSI (which is around -107 dBm, outside the acceptable
range). In conclusion, the cell is overloaded due to the high
amount of traffic causing that the cells cannot maintain the
service under acceptable conditions and further blocks the
connections of new users.
Regarding cluster 2, its accessibility, retainability and failed
RRC connections rate have a good statistical behaviour (similar to the normal one). Furthermore, the KPIs related to the
HO procedure also present normal values, except the IRAT HO
rate which is higher than normal. This indicates that lots of
users are leaving the LTE technology, which is undesirable. By
analysing the rest of KPIs, it can be observed that, in addition,
both the number of bad coverage report and the average RSSI
are deteriorated. In this case, the high number of established
RRC connections, along with the low values of the antenna
tilt, suggests that the cells are covering a higher than necessary
coverage area and thus they are serving users with bad signals
conditions. On the basis of the above, it is concluded that this
behaviour is in line with the symptoms of lack of coverage.
Finally, concerning cluster 4, the majority of its KPIs are
concentrated near zero. Both accessibility and retainability
are practically zero which means that there is hardly any
connection established in those cells. As a result, this cluster
is labelled as non-operating cell.
3) Case Study: diagnosing problematic cells over the time:
In order to demonstrate the performance of the automatic
diagnosis system and validate the assigned labels, the system
has been applied to diagnose different cells of this real LTE
network.
The first chosen cell is a problematic cell that has been
manually reconfigured over the time in order to improve its
performance. Therefore, for this study, it has been analysed
by the proposed diagnosis system during the days when the
troubleshooting tasks were being carried out. Fig. 6 shows
the evolution of the number of bad coverage report (presented
by a continuous black line), since it is the most relevant KPI
in this situation. Furthermore, the diagnosis achieved with the
proposed system over the same period of time is superimposed.
In particular, the diagnosis automatically obtained during the
first three days determines that this cell is not operating. This
matches the troubleshooting tasks, which reveal that this cell
was inactive during this period of time. After the cell was
launched, its KPIs indicate that the performance of the cell was
deteriorated. In particular, the diagnosis varies between normal
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
16
TABLE X
M EAN AND S TANDARD DEVIATION OF EACH KPI GIVEN EACH CLUSTER
C1
76.04
86.88
18.40
83.24
4040
1.38
1241
-107.73
42.32
28867
0.63
Mean
C2
C3
99.07
99.78
99.11
99.75
0.41
0.23
97.77
96.88
36.98
34.02
3.47
1.16
1352
198
-113.82
-115.86
32.69
21.66
13990
5467
1.38
2.34
Label
Overload
Lack of Coverage
Normal
Non-operating cell
C1
25.79
23.36
16.80
16.79
4363
1.32
1041
4.39
12.17
17943
1.13
Standard deviation
C2
C3
C4
7.01
0.34
3.47
6.96
0.36
0
0.89
0.41
0
8.62
12.65
25.18
34.14
146.51
1.31
5.23
1.17
0.53
903.28
262.55
2.12
2.27
2.72
39.68
9.74
3.53
4.17
6459
4474
1.06
1.67
2.40
3.48
2000
TABLE XI
L ABEL ASSIGNED TO EACH CLUSTER
Clusters
Cluster 1
Cluster 2
Cluster 3
Cluster 4
C4
0.12
0
0
6.82
0.15
0.07
0.35
-15.22
22.65
0.04
3.63
Non-operating
Normal
1000
Lack of Cov.
Diagnosis
KPI
Accessibility [%]
Retainability [%]
Failed RRC Connections Rate [%]
HOSR [%]
Number of ping-pong HO
IRAT HO Rate [%]
Number of bad coverage report
Average RSSI [dBm]
Average CPU Load
Number of RRC connections
Tilt [ ]
Overload
VI. C ONCLUSIONS
An automatic diagnosis system as part of a Self-Healing
network has been proposed in this paper. This system is built
through unsupervised techniques with the aim of obtaining a
system that represents the normal and faulty behaviour of the
real network. The use of unsupervised techniques guarantees
that the system can be built without historical reports of solved
cases while simultaneously enabling the system to identify
new faults which are not previously known. Even so, the
clusters derived from the proposed system are labelled by
ay
12 2
0: :00
00
D
ay
12 3
0: :00
00
D
ay
12 4
0: :00
00
D
ay
12 5
0: :00
00
D
ay
12 6
0: :00
00
D
ay
12 7
0: :00
00
D
ay
12 8
0: :00
00
D
ay
9
1
0: 2:0
00 0
D
ay
1 10
0: 2:0
00 0
D
ay
12 11
:0
0
:0
0
12
0:
00
00
0:
Fig. 6. Number of bad coverage report values of the diagnosed cell along
with the obtained diagnosis.
-100
Normal
-110
Lack Of Cov.
Diagnosis
Average RSSI
Non-operating
ay
D
00
0:
12
:0
0
4
D
ay
0:
00
12
:0
0
ay
D
0
0:
0
12
:
00
2
0:
0
ay
00
12
:
00
D
ay
-120
Overload
00
:
ay
Fig. 7. Average RSSI values of the diagnosed cell along with the obtained
diagnosis.
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
17
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TVT.2015.2431742, IEEE Transactions on Vehicular Technology
18
PLACE
PHOTO
HERE
PLACE
PHOTO
HERE
PLACE
PHOTO
HERE
PLACE
PHOTO
HERE
Pablo Munoz
received his M.Sc. and Ph.D. degrees in Telecommunication Engineering from the
University of Malaga (Spain) in 2008 and 2013,
respectively. He is currently working with the Communications Engineering Department at the same
university. Since September 2009, he has been a
Ph.D. Fellow, where he has been working in selfoptimization of mobile radio access networks and
radio resource management.
0018-9545 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.