0% found this document useful (0 votes)
32 views33 pages

Network Node Fault Identification Based On ML - Final

Uploaded by

Ibrahim Hefny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views33 pages

Network Node Fault Identification Based On ML - Final

Uploaded by

Ibrahim Hefny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

ECEN464: Network Node Fault Identification based on ML

Network Node Fault Identification based on ML

By

Abd-ElRahman Ahmed – 202000898

Doaa Muhammed Mostafa – 202000566

Ibrahim Hefny – 202002482

Sadeen Alaa – 2020009493

Ziad Elelimy– 202003033

Under Supervision of

Dr. Mohamed Saeed

Eng. Shahd Gmal

Submitted in partial fulfillment of the requirements for ECEN-464 course project.

1|Pa ge
ECEN464: Network Node Fault Identification based on ML

Table of Contents
Abstract.......................................................................................................................................3
Introduction .................................................................................................................................4
Literature review ..........................................................................................................................6
1. Network Fault Localization................................................................................................6
Rule-based techniques: ........................................................................................................6
Case-based techniques: .......................................................................................................7
Probability-based techniques: ...............................................................................................7
Model-based techniques:......................................................................................................7
2. Types of Faults in WSN......................................................................................................8
Based on their behavior ...........................................................................................................9
Time-Based faults...................................................................................................................9
Based on components .............................................................................................................9
3. The Main Aspects of Faults Management Structure in WSNs ............................................ 12
Error Detection: .................................................................................................................. 12
Error Diagnosis: .................................................................................................................. 12
Error Recovery: ................................................................................................................... 13
4. Machine Learning for Network Fault Management ........................................................... 14
Depend on ML in Anomaly Detection ................................................................................... 14
Depend on ML in Detecting Location-Based Faults ............................................................... 15
Methodology .............................................................................................................................. 16
Results ...................................................................................................................................... 25
Conclusion ................................................................................................................................ 30
References ................................................................................................................................ 31
Appendix ................................................................................................................................... 31

2|Pa ge
ECEN464: Network Node Fault Identification based on ML

Abstract

The increasing connectivity in modern networks introduces significant challenges in managing the vast

volume of data and the potential for link failures, which can disrupt services if not promptly addressed.

Traditional manual fault recovery techniques are often slow and inefficient. This paper presents ML-LFIL,

a machine learning-based method for fault localization and identification, leveraging traffic engineering

principles. ML-LFIL operates in three stages: link fault detection, differentiation between disconnections

and reconnections, and fault location determination. By utilizing Multi-Layer Perceptron neural networks,

Random Forests, and Support Vector Machines, ML-LFIL analyzes traffic metrics gathered through passive

monitoring, thereby avoiding the drawbacks associated with active probing. Extensive experiments across

various network topologies demonstrate that ML-LFIL achieves rapid and accurate fault detection and

localization, significantly enhancing fault management in complex

Keywords: link fault detection; fault localization; machine learning; fault recovery techniques; fault

location determination.

3|Pa ge
ECEN464: Network Node Fault Identification based on ML

Introduction

Managing networks now faces additional difficulties due to the increase in connectivity,

particularly with regard to the sheer amount of data and the possibility of link failures. If not

immediately fixed, these errors whether they result from disconnections or reconnections can cause

service disruptions. Time-consuming manual fault recovery techniques that rely on programs like

ping and trace-route can be used. As a result, effective fault management systems with quick

diagnosis and recovery times are vital. The algorithms employed and the caliber of the network data

which can be gathered actively or passively determine how accurate fault detection is. Although it is

frequently employed, active probing increases network traffic and latency by having measurement

points in the network exchange control packets. By comparison, passive monitoring finds errors

without adding more overhead by examining the traffic metrics already in place. By examining

network traffic attributes, machine learning especially deep learning offers a chance to address

these issues.

Machine Learning (ML) is especially suitable for managing complex systems. It enables

advanced data analysis and can help achieve network goals, including root-cause analysis and

failure localization, by learning from past data and predicting future responses. Recent

advancements in computational hardware, parallel computing, big data storage, and processing

frameworks, alongside the introduction of Software-defined Networking (SDN) and Network

Function Virtualization (NFV) platforms, have facilitated the application of ML to various networking

challenges.

In this work, we suggest a traffic engineering-based machine learning method for fault

localization and identification. Our method, ML-LFIL, consists of three steps: distinction between

disconnections and reconnections, fault location determination, and link fault detection. We train

4|Pa ge
ECEN464: Network Node Fault Identification based on ML

our model using Multi-Layer Perceptron neural networks, Random Forests, and Support Vector

Machines, which have shown efficacy in classification and regression tasks. Crucially, ML-LFIL

enables rapid fault identification and localization, even in large networks, by learning from real-time

data points. We validate our approach through extensive experiments on various network topologies,

demonstrating its effectiveness in improving fault management.

To put it briefly, we have developed a machine learning model to understand traffic behavior

and link faults, adopted a passive monitoring approach, and extensively experimented to

demonstrate the effectiveness of our method. We present performance evaluations, describe our

methodology, summarize related work, and offer suggestions for further research.

5|Pa ge
ECEN464: Network Node Fault Identification based on ML

Literature review

Link failures in networks might cause a link to detach and then reattach itself without any prompt

replacement. For example, if a wireless node switched access points, it becomes difficult to determine

whether a link has failed or been reconnected, as well as to pinpoint the location of the failed link. Using

an active probing strategy result in high communication overhead and latency since it takes a long time to

investigate the network by sending signaling messages on various pathways. Furthermore, the wireless

sensor network is made up of several detection stations, also known as nodes, that work together to

collaboratively perform a variety of tasks, including sensing, communicating, and computing. There is

widespread use of wireless sensor networks in several uses, such as in the fields of medicine, the military,

intelligent security systems, and many more. However, they are having a lot of problems with dependability

and fault tolerance. Many studies are being conducted to improve WSNs' fault tolerance so that they can be

efficiently used in critical applications. In this project, we will use machine learning-based link fault

identification

and localization in networks.

1. Network Fault Localization

Many authors presented different techniques that have been developed for localizing link faults. These

techniques are broadly categorized into 4 types: rule-based techniques, case-based techniques, probability-

based techniques and model-based techniques.

Rule-based techniques:
It depends on the knowledge base developed by the system experts, which is effectively a

series of if-then statements, or the system's rules. However, neither past experience nor network

dynamics seen from previously unseen traffic behavior can teach these rule-based systems to learn

adaptively [2].

6|Pa ge
ECEN464: Network Node Fault Identification based on ML

Case-based techniques:
Mainly fault diagnosis by case-based techniques depends on the expert and experience

obtained from the past experience [2].

Probability-based techniques:
It proposed methods for defect diagnostics based on likelihood. The related probability

mass functions of the links show where the link faults are located in the network [2].

Model-based techniques:
It builds a mathematical model from a knowledge base to describe the network behaviors.

The model's anticipated traffic patterns are compared with the recently observed network traffic

behaviors. Network errors are identified when observed behaviors diverge from those predicted.

Therefore, in order to diagnose link problems effectively, model-based approaches need precise

information about the links in the network [2].

7|Pa ge
ECEN464: Network Node Fault Identification based on ML

2. Types of Faults in WSN

A wireless sensor network (WSN) is a new type of information acquisition and processing network.

It consists of a large number of low power sensor nodes. The sensor nodes communicate through a wireless

network. WSN has been widely used in mechanical parameter detection, industrial monitoring, mine safety,

medical and health, environmental monitoring and other industries [4].

Figure 1: Classification of Faults in WSN

8|Pa ge
ECEN464: Network Node Fault Identification based on ML

Faults in WSN can also be classified on the basis of behavior, time, component

and location:

Based on their behavior, faults can be classified as:

a. Hard Faults: SNs are unable to communicate among themselves due to failure in certain

modules, Black hole in case of energy depletion.

b. Soft Faults: In soft faults, SN can continue to work even in case of failures but sense,

process or transmit faulty data.

Time-Based faults can be classified as:

a. Transient Faults: Faults occur because of environmental conditions such as temperature,

humidity, cosmic rays, vibrations. These types of errors usually occur once and then disappear

and thus are difficult to manage.

b. Intermittent Fault: These types of errors do not occur continuously, as they appear and vanish

repeatedly. This type of error includes loose connection, obsolete/aged components.

c. Permanent Faults: These faults include built-in defects such as faults in chip manufacturing,

burned out of electronic components. The effects of permanent faults remain until the faulty

components are totally removed from the circuit.

d. Potential Faults: This occurs due to depletion of hardware resources which ultimately reduces

the network lifespan. The most common is the energy depletion of the nodes that impacts the

lifetime of the node. SNs require energy for various operations such as data sensing, data

collection, communication and processing. Thus, it is necessary to charge or change the battery

after they have been consumed.

Based on components of WSN:

9|Pa ge
ECEN464: Network Node Fault Identification based on ML

An unexpected change in or departure from one or more system characteristics from the

norm, acceptable, or standard is referred to as a defect or fault. There are three types of failures

occur in WSNs: node, network and software failures (see figure 2) [3].

Figure 2: types of failures based on components

a. Node failure: it occurs in a network due to large numbers of nodes being deployed in harsh

and/or inaccessible outdoor environments. Therefore, they can be destroyed or damaged easily.

As a result, they are readily damaged or destroyed. Furthermore, each node in a network has a

finite amount of energy that can run out. Sensor failures and low power readings could also be

the cause of failures[3.4].

b. Network failure: the SNs collect the information and transmit the data toward the sink through

communication link. Routing plays an important role in this. So, communication links and

routing layer are another cause of faults in WSN. Path faults, Radio interference,

temporary/permanent blocks in paths may cause errors in SNs communication. Congestion

problems also arise due to deployment of enormous amount of SNs simultaneously transmitting

the data on the occurrence of interesting events. This can lead to packet loss. Thus, software

programming should be done in such a way that the applied algorithms can reduce congestion

problems [3,4].

10 | P a g e
ECEN464: Network Node Fault Identification based on ML

c. Software faults: it includes issues brought on by software defects and crashes in the operating

system's processes. While WSNs are frequently impacted by this kind of failure, the likelihood

of it happening is low in comparison to other failures. Understanding a general fault model and

diagnosis methodology is necessary before delving deeper into the fault diagnosis idea [3].

11 | P a g e
ECEN464: Network Node Fault Identification based on ML

3. The Main Aspects of Faults Management Structure in WSNs

The fault management structure in WSNs consists of three stages: error detection, diagnosis,

and recovery as shown in Figure 3. The following subsections describe the three phases of the fault

management framework.

Figure 3: General steps for fault tolerance structure in WSN.

Error Detection:

Error or fault detection refers to identifying any unexpected failure or damaging forces that

affect a network’s or node’s optimum condition. Based on their performance, fault detection

methods are divided into three categories: centralized, self-supervision, and decentralized [5].

Error Diagnosis:

In order to use the fault-tolerance concept correctly, the type of mistake and the problematic

nodes must be identified. It is important to identify the cause, kind, and effects of failures on the

health of the network [50]. Using particular reference nodes in a network at specified geographic

locations to help other nodes locate their location is one well-known method. To find and look into

12 | P a g e
ECEN464: Network Node Fault Identification based on ML

network issues, it is necessary to monitor the WSN. Four types of monitoring exist: proactive,

reactive, passive, and active [5].

Error Recovery:

The fundamental definition of "recovery" for WSNs is the reconstruction or restoration of the

network to prevent damaged nodes from impairing its optimal functioning. The process of

substituting an ideal state for a malfunctioning one is known as recovery. Depending on the defect,

two fault recovery techniques forward recovery and backward recovery may be applied [5].

13 | P a g e
ECEN464: Network Node Fault Identification based on ML

4. Machine Learning for Network Fault Management

Machine learning (ML) for network node fault identification involves the use of algorithms and

techniques to automatically detect, classify, and predict faults or anomalies in network nodes. This

approach leverages historical data, network telemetry, and various features extracted from network

traffic to build models that can identify abnormal behavior indicative of faults or failures. Recently,

several works proposed to use machine learning techniques for network fault management [1, 3.4] .

There are several applications of ML related to fault management in optical networks. The affected

traffic could be restored by starting a restoration procedure, but it would be preferable to foresee these

degradations and identify the root cause of the (soft) failure so that the lightpath can be rerouted before

it is interrupted. It should be noted that failure localization is necessary in order to schedule maintenance

tasks and to exclude the failed resources from path computation. Proactive failure detection would also

provide planners more time to schedule the rerouting process, for example, during off-peak hours [1].

Depend on ML in Anomaly Detection


Sensor defects such as erratic, hard-over, spike, drift, and stuck problems have been

categorized by the authors using Support Vector Machines (SVM). SVM is trained using various kernel

functions in [20]. Increasing the amount of features and training data improved the classifier's

effectiveness, but further increases led to an overfitting issue. Cross validation strategies are

employed to address the overfitting problem. Moreover, expanding the input sample size improves

the classifier's accuracy

14 | P a g e
ECEN464: Network Node Fault Identification based on ML

Depend on ML in Detecting Location-Based Faults

This section is divided into 2 sections. First, section is for detecting data centric faults and second

is for detecting system centric faults in WSN.

Detecting data centric faults:

Sensor defects such as erratic, hard-over, spike, drift, and stuck problems have been

categorized by the authors using Support Vector Machines (SVM). SVM is trained using various kernel

functions in. Increasing the amount of features and training data improved the classifier's

effectiveness, but further increases led to an overfitting issue. Cross validation strategies are

employed to address the overfitting problem. Moreover, expanding the input sample size improves

the classifier's accuracy [4].

Detecting system centric faults:

The paper [4] focuses on communication link failure in WSNs because, even when all the

nodes are operating as intended, choices may still be impacted by communication link issues.

Consequently, the connection Faults may result in separate sets of nodes, which defeats the goal

and general functioning of WSN. The Feedforward Neural Network (FFNN), which can adapt and learn

from gradient decent learning technique, is the foundation of this automatic link failure detection

system. The parameters Packet Delivery Ratio (PDR) and Latency are used to evaluate the quality of

networks. When latency is extremely high and PDR is extremely low, a link is deemed to have failed.

Neural networks use these parameters as input, or features. Testbed experiments conducted both

indoors and outdoors verify the suggested methodology.

15 | P a g e
ECEN464: Network Node Fault Identification based on ML

Methodology
Our methodology focuses on the application of Machine Learning techniques, specifically the K-

Nearest Neighbors (KNN) classifier, to identify faults in an IP network using the SOFI dataset.

This dataset includes network performance data collected over 649 hours, with faults induced

during 10 of those hours. Here is a breakdown of our methodology:

Data Preparation

The dataset consists of two separate CSV files, presumably representing data from two different

core switches in the network:

SOFI CoreSwitch-I.csv

SOFI CoreSwitch-II.csv

The class labels in these datasets, representing healthy (NE) and faulty (F) network states, are

converted from categorical to binary format (F=0, NE=1) for the application of ML algorithms.

data1['class'] = data1['class'].replace({'F':0,'NE':1})

data2['class'] = data2['class'].replace({'F':0,'NE':1})

Feature Selection and Model Training

The dataset includes 34 attributes from which we exclude the class label for input features.

Different models are trained with variations in feature subsets to evaluate their impact on model

performance.

Initial Model: A KNN classifier is trained using all features except the class label.

16 | P a g e
ECEN464: Network Node Fault Identification based on ML

Exclusion of Temporal Features: To evaluate the impact of temporal data (timestamp, range), these

are excluded in another model.

Exclusion of ICMP and Packet-related Features: We train additional models excluding features

that might relate more to network performance variations than faults, such as ICMP ping responses

and packet error rates.

Further Reduced Feature Set: Further models test the exclusion of an increasingly larger set of

features deemed less relevant or redundant, based on prior analysis or domain knowledge.

Each model is tested for its accuracy:

model.score(testInputs.__array__(), testResults.__array__())

Hyperparameter Tuning

For the KNN algorithm, the choice of k (number of neighbors) is crucial. We iterate k from 1 to

29 to find the optimal k that maximizes the accuracy of the classifier on the test dataset.

for i in range(1,30):

model = KNeighborsClassifier(i)

model.fit(trainInputs, trainResults)

s = model.score(testInputs.__array__(), testResults.__array__())

results.append([i, s])

Feature Importance Evaluation

To determine the influence of each feature on the network's health classification, we isolate each

feature alongside temporal data, train a model using these features, and record the accuracy.

17 | P a g e
ECEN464: Network Node Fault Identification based on ML

for col in allData.keys():

trainInputs = data1[[str(col), 'timestamp']]

testInputs = data2[[str(col), 'timestamp']]

model = KNeighborsClassifier(5)

model.fit(trainInputs, trainResults)

s = model.score(testInputs.__array__(), testResults.__array__())

results[col] = s

Sorted results help identify the most and least predictive features.

18 | P a g e
ECEN464: Network Node Fault Identification based on ML

Description for the Dataset Features:

EX_Inbound_packets_discarded: Tracks the number of incoming packets discarded on the

extended network interface. This parameter highlights potential problems like buffer overflow,

hardware limitations, or configuration errors on the secondary interface, which could lead to data

loss and impact network performance.

P_Inbound_packets_with_errors: Monitors the number of incoming packets on the primary

network interface that contain errors. Errors in inbound packets can result from issues such as

signal degradation, interference, or faulty hardware. This metric is essential for diagnosing data

integrity problems and ensuring reliable network communication.

EX_Inbound_packets_with_errors: Measures the number of incoming packets on the extended

network interface that contain errors. This parameter is crucial for identifying data integrity issues

on the secondary interface, which can affect the accuracy and reliability of the received data,

leading to potential communication failures.

P_Bits_received: Quantifies the total number of bits received by the primary network interface.

This metric provides insight into the volume of incoming data, helping to assess the network's

capacity to handle traffic and identifying potential bottlenecks or bandwidth limitations.

EX_Bits_received: Indicates the total number of bits received by the extended network interface.

This parameter reflects the incoming data volume on the secondary interface, providing valuable

information about the network's ability to manage and distribute traffic efficiently across multiple

interfaces.

19 | P a g e
ECEN464: Network Node Fault Identification based on ML

P_Outbound_packets_discarded: Tracks the number of outgoing packets on the primary network

interface that are discarded. Packet discards can occur due to network congestion, buffer overflow,

or misconfiguration. Monitoring this metric helps in identifying and resolving issues that impede

the network's ability to transmit data effectively.

EX_Outbound_packets_discarded: Counts the number of outgoing packets discarded on the

extended network interface. This parameter indicates potential problems such as network

congestion, hardware limitations, or misconfiguration on the secondary interface, which can affect

the smooth flow of outbound traffic.

P_Outbound_packets_with_errors: Monitors the number of outgoing packets on the primary

network interface that contain errors. Errors in outbound packets can result from hardware faults,

signal interference, or network misconfigurations, impacting data integrity and communication

reliability.

EX_Outbound_packets_with_errors: Measures the number of outgoing packets on the extended

network interface that contain errors. This metric is essential for identifying issues affecting the

accuracy and reliability of outgoing data on the secondary interface, which can lead to

communication failures.

P_Bits_sent: Quantifies the total number of bits sent by the primary network interface. This

parameter reflects the volume of outgoing data, providing insight into the network's capacity to

handle outbound traffic and identifying potential issues related to bandwidth or data transfer rates.

20 | P a g e
ECEN464: Network Node Fault Identification based on ML

EX_Bits_sent: Indicates the total number of bits sent by the extended network interface. This

metric provides valuable information about the volume of outgoing data on the secondary

interface, helping to assess the network's ability to manage and distribute traffic efficiently.

P_Speed: Measures the operational speed of the primary network interface, expressed in bits per

second (bps). This parameter indicates the data transfer rate capability of the primary interface,

essential for evaluating the network's performance and capacity to handle high-speed data

transmission.

IN_Speed: Measures the speed of the inbound connection, reflecting the data transfer rate

capability for incoming traffic. This metric is crucial for assessing the network's ability to handle

and process incoming data efficiently, ensuring optimal performance and minimal delays.

EX_Speed: Measures the operational speed of the extended network interface, expressed in bits

per second (bps). This parameter indicates the data transfer rate capability of the secondary

interface, essential for evaluating the network's overall performance and ability to manage high-

speed data transmission across multiple interfaces.

P_Operational_status: Indicates the current operational state of the primary network interface,

showing whether the interface is active and functioning properly. This parameter helps in

monitoring the availability and reliability of the primary interface, ensuring continuous network

operation.

EX_Operational_status: Reflects the current operational state of the extended network interface,

indicating whether this secondary interface is active and functioning properly. This metric is

important for ensuring the availability and reliability of the secondary interface, supporting overall

network stability.

21 | P a g e
ECEN464: Network Node Fault Identification based on ML

P_Interface_type: Identifies the type of the primary network interface, such as Ethernet, Wi-Fi, or

fiber optic. This parameter provides context about the physical or logical connection type, helping

in understanding the network's architecture and the capabilities of the primary interface.

IN_Interface_type: Identifies the type of interface used for inbound traffic, providing information

about the physical or logical connection for incoming data. This parameter is crucial for

understanding the network's architecture and the characteristics of the inbound data flow.

EX_Interface_type: Specifies the type of the extended network interface, such as Ethernet, Wi-Fi,

or fiber optic. This parameter provides context about the secondary connection type, helping in

understanding the network's architecture and the capabilities of the extended interface.

Device_uptime: Measures the total time the network device has been operational since its last

restart. This metric is indicative of the device's reliability and stability, helping in identifying

potential issues related to device performance and the need for maintenance or troubleshooting.

SNMP_Availability: Indicates the accessibility of the network device via SNMP. This parameter

is crucial for network management and monitoring, allowing administrators to collect and analyze

network performance data, configure devices, and troubleshoot issues remotely.

IN_Inbound_packets_discarded: Counts the number of incoming packets on the inbound interface

that are discarded. Discards can occur due to buffer overflow, misconfiguration, or network

congestion. This metric helps in diagnosing potential issues affecting the network's ability to

handle incoming traffic efficiently and maintaining data integrity.

IN_Inbound_packets_with_errors: Measures the number of incoming packets on the inbound

interface that contain errors. Errors in inbound packets can result from issues such as signal

22 | P a g e
ECEN464: Network Node Fault Identification based on ML

degradation, interference, or faulty hardware. This metric is essential for diagnosing data integrity

problems and ensuring reliable network communication.

IN_Bits_received: Quantifies the total number of bits received by the inbound interface. This

metric provides insight into the volume of incoming data, helping to assess the network's capacity

to handle traffic and identifying potential bottlenecks or bandwidth limitations.

IN_Outbound_packets_discarded: Tracks the number of outgoing packets on the inbound interface

that are discarded. Packet discards can occur due to network congestion, buffer overflow, or

misconfiguration. Monitoring this metric helps in identifying and resolving issues that impede the

network's ability to transmit data effectively.

IN_Outbound_packets_with_errors: Monitors the number of outgoing packets on the inbound

interface that contain errors. Errors in outbound packets can result from hardware faults, signal

interference, or network misconfigurations, impacting data integrity and communication

reliability.

IN_Bits_sent: Quantifies the total number of bits sent by the inbound interface. This parameter

reflects the volume of outgoing data, providing insight into the network's capacity to handle

outbound traffic and identifying potential issues related to bandwidth or data transfer rates.

IN_Operational_status: Indicates the current operational state of the inbound interface, showing

whether the interface is active and functioning properly. This parameter helps in monitoring the

availability and reliability of the inbound interface, ensuring continuous network operation.

23 | P a g e
ECEN464: Network Node Fault Identification based on ML

Class: Indicates the health status of the network, with 'F' representing a faulty network and 'NE'

representing a healthy network. This parameter is crucial for categorizing network performance

and diagnosing issues, enabling targeted troubleshooting and maintenance efforts.

Range: Provides a categorization or classification range for network performance or fault

detection. This parameter offers additional context for the dataset's attributes, helping in the

analysis and interpretation of network performance data.

24 | P a g e
ECEN464: Network Node Fault Identification based on ML

Results
The network node fault identification using machine learning techniques was evaluated

through a series of experiments using a K-Nearest Neighbors (KNN) classifier. The experiments

involved training the model on data from SOFI CoreSwitch-I and testing on data from SOFI

CoreSwitch-II.

Initial Model Evaluation:

The initial model, with k=5, was trained on the full feature set from the first dataset and tested on

the second dataset, yielding an accuracy score of 0.9859.

Model Performance Across Different k Values:

To determine the optimal number of neighbors (k), the model was evaluated for k ranging from 1

to 29. The performance of the model was measured and recorded for each k value as follows:

k Score
1 0.9875
2 0.9833
3 0.9907
4 0.9897
5 0.9902
6 0.9898
7 0.9896
8 0.9894
9 0.9888
10 0.9888
11 0.9874
12 0.9873

25 | P a g e
ECEN464: Network Node Fault Identification based on ML

13 0.9865
14 0.9862
The results indicated that the highest accuracy of 0.9907 was achieved with k=3, suggesting that

this is the most suitable k for our dataset.

Feature Reduction Analysis:

To assess the impact of feature reduction on model performance, several iterations of the model

were trained and tested with different sets of features removed:

Model 2: Dropped features timestamp and range. This model configuration tested the influence of

temporal and range-based features and achieved an accuracy of 0.9863.

Model 3: Excluded features such as ICMP_ping, ICMP_loss, ICMP_response_time, and

P_Inbound_packets_discarded, focusing on non-ICMP related metrics. This configuration resulted

in an accuracy of 0.9902.

Model 4: Further reduced the feature set by additionally removing

P_Inbound_packets_with_errors, P_Bits_received, EX_Inbound_packets_discarded,

EX_Inbound_packets_with_errors, Device_uptime, and IN_Inbound_packets_discarded. This

extensive reduction aimed to isolate the most critical features for fault identification and achieved

the highest accuracy of 0.9908.

Feature Importance Evaluation:

The feature importance evaluation aimed to identify which network metrics most

significantly impacted the accuracy of the KNN classifier in detecting network node faults. This

analysis was conducted by training and testing the model with each feature individually, alongside

26 | P a g e
ECEN464: Network Node Fault Identification based on ML

the timestamp, and measuring the resulting accuracy. The features were then sorted based on their

contribution to the model's performance.

Key Findings from Feature Importance Evaluation;

The evaluation revealed that several features consistently contributed to high model accuracy,

while some had a slightly less impact. Here are the details:

High-Impact Features:

- P_Bits_received, EX_Bits_sent, IN_Bits_received:

Each of these features achieved an accuracy score of 0.9849. This suggests that the amount of data

being received and sent across different interfaces (primary, external, internal) is crucial for

identifying network faults.

- EX_Bits_received, IN_Bits_sent:

These features showed a slightly higher accuracy score of 0.9853, indicating that the volume of

data traffic on external and internal interfaces is highly indicative of network health and potential

faults.

Moderate-Impact Features:

Features like range, ICMP_ping, ICMP_loss, ICMP_response_time,

P_Inbound_packets_discarded also demonstrated high importance with accuracy scores around

0.9849. This indicates that both range-based metrics and ICMP-related measurements (such as

ping, loss, and response time) are critical for fault detection.

27 | P a g e
ECEN464: Network Node Fault Identification based on ML

Other Relevant Features:

P_Inbound_packets_with_errors,EX_Inbound_packets_discarded,

EX_Inbound_packets_with_errors:

These metrics, with scores around 0.9849, emphasize the significance of monitoring errors and

discarded packets in both primary and external inbound traffic.

Device_uptime, P_Speed, IN_Speed, EX_Speed:

Operational metrics such as device uptime and interface speeds (primary, internal, external) also

played a significant role, suggesting that continuous performance and speed metrics are vital for

fault identification.

Implications for Model Refinement

The feature importance evaluation provides several actionable insights:

Prioritizing Critical Features:

Given the high impact of data traffic and error metrics, future models should prioritize

these features. This can help in creating more efficient and accurate models by focusing on the

most informative metrics.

Potential for Feature Reduction:

The relatively close performance scores across different features suggest a degree of

redundancy. Models could potentially achieve similar accuracy with a reduced set of features,

leading to simpler and faster models without significant loss of accuracy.

28 | P a g e
ECEN464: Network Node Fault Identification based on ML

Balanced Feature Set:

A balanced inclusion of both throughput (bits sent/received) and error-related metrics

ensures that the model can capture a comprehensive picture of network health.

Continuous Monitoring Metrics:

Features related to ongoing operations like device uptime and speed are crucial. Continuous

monitoring of these metrics allows for real-time fault detection and proactive network

management.

The comprehensive list of features demonstrated that various network metrics had a

relatively similar impact on the model's performance, indicating a high level of feature redundancy.

P_Bits_received 0.9848
EX_Bits_sent 0.9848
IN_Bits_received 0.9848
EX_Bits_received 0.9852
IN_Bits_sent 0.9852

29 | P a g e
ECEN464: Network Node Fault Identification based on ML

Conclusion
The final sorted results from the feature importance analysis indicated that network

throughput and packet metrics were the most predictive of network node faults. This provided

valuable insights for further model refinement and feature engineering.

Overall, the experiments demonstrated that the KNN classifier could effectively identify

network node faults with varying degrees of accuracy depending on the choice of k and the set of

features used. The feature reduction analysis highlighted the potential for improving model

efficiency without significantly compromising accuracy, while the feature importance evaluation

offered a roadmap for prioritizing the most impactful network metrics.

30 | P a g e
ECEN464: Network Node Fault Identification based on ML

References
[1] L. Velasco and D. Rafique, “Fault management based on machine learning [invited],” Optica Publishing

Group, https://opg.optica.org/abstract.cfm?uri=OFC-2019-W3G.3

[2] Machine learning-based link fault identification and localization in complex networks | IEEE Journals

& Magazine | IEEE Xplore, https://ieeexplore.ieee.org/abstract/document/8676028/

[3] Fault detection in wireless sensor network based on Deep Learning Algorithms,

https://www.researchgate.net/publication/351285887_Fault_Detection_in_Wireless_Sensor_Netwo

rk_Based_on_Deep_Learning_Algorithms

[4] A survey of Machine Learning for Network Fault Management,

https://www.researchgate.net/publication/350576388_A_Survey_of_Machine_Learning_for_Netwo

rk_Fault_Management

[5] G. H. Adday, S. K. Subramaniam, Z. A. Zukarnain, and N. Samian, “Fault tolerance structures in

wireless sensor networks (wsns): Survey, classification, and Future Directions,” Sensors (Basel,

Switzerland), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9415276

Appendix
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

data1 = pd.read_csv("SOFI CoreSwitch-I.csv")

31 | P a g e
ECEN464: Network Node Fault Identification based on ML

data2 = pd.read_csv("SOFI CoreSwitch-II.csv")


data1['class'] = data1['class'].replace({'F':0,'NE':1})
data2['class'] = data2['class'].replace({'F':0,'NE':1})

trainInputs = data1.drop(['class'], axis=1)


trainResults = data1['class']

testInputs = data2.drop(['class'], axis=1)


testResults = data2['class']

model = KNeighborsClassifier(5)
model.fit(trainInputs, trainResults)
model.score(testInputs.__array__(),testResults.__array__())
results = []

for i in range(1,30):

model = KNeighborsClassifier(i)
model.fit(trainInputs, trainResults)
s = model.score(testInputs.__array__(),testResults.__array__())

results.append([i, s])
model2 = KNeighborsClassifier(5)
trainInputs = data1.drop(['class','timestamp', 'range'], axis=1)
model2.fit(trainInputs, trainResults)
testInputs = data2.drop(['class','timestamp', 'range'], axis=1)

model2.score(testInputs.__array__(),testResults.__array__())

model3 = KNeighborsClassifier(5)
trainInputs = data1.drop(['ICMP_ping','ICMP_loss', 'ICMP_response_time',
'P_Inbound_packets_discarded'], axis=1)
model3.fit(trainInputs, trainResults)
testInputs = data2.drop(['ICMP_ping','ICMP_loss', 'ICMP_response_time',
'P_Inbound_packets_discarded'], axis=1)

model3.score(testInputs.__array__(),testResults.__array__())
model4 = KNeighborsClassifier(5)
trainInputs = data1.drop(['ICMP_ping','ICMP_loss', 'ICMP_response_time',
'P_Inbound_packets_discarded',
'P_Inbound_packets_with_errors','P_Bits_received','EX_Inbound_packets_discarded',
'EX_Inbound_packets_with_errors','Device_uptime','IN_Inbound_packets_discarded'], axis=1)
model4.fit(trainInputs, trainResults)
testInputs = data2.drop(['ICMP_ping','ICMP_loss', 'ICMP_response_time',
'P_Inbound_packets_discarded',
'P_Inbound_packets_with_errors','P_Bits_received','EX_Inbound_packets_discarded',
'EX_Inbound_packets_with_errors','Device_uptime','IN_Inbound_packets_discarded'], axis=1)

model4.score(testInputs.__array__(),testResults.__array__())

results = {}
allData = pd.DataFrame(data2)
allData = allData.drop(['class', 'timestamp', 'SNMP_availability'], axis=1)
for col in (allData).keys():
trainInputs = data1[[str(col), 'timestamp']]
testInputs = data2[[str(col), 'timestamp']]
print(str(col))
model = KNeighborsClassifier(i)
model.fit(trainInputs, trainResults)
s = model.score(testInputs.__array__(),testResults.__array__())

results[col] = s
sorted_results = dict(sorted(results.items(), key=lambda x: x[1]))
print(sorted_results)

32 | P a g e
ECEN464: Network Node Fault Identification based on ML

33 | P a g e

You might also like