0% found this document useful (0 votes)
36 views13 pages

A Review On Machine Learning and Deep Learning Techniques For Predicting

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

A Review On Machine Learning and Deep Learning Techniques For Predicting

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

A Review on Machine Learning and Deep Learning Techniques for Predicting Infrastructure

Failures in Cloud Environment


Predicting Infrastructure Failures in Cloud Environment: A Systematic Review of Machine Learning
and Deep Learning Techniques
Abstract
The future of electronic health, computing, and smart appliances depends on reliable and secure cloud
applications. Since there are so many different types of cloud services, they have failed in the past. With
tolerance, cloud computing allows for uninterrupted service even when individual components fail. The
complexity of cloud computing systems is higher than that of traditional distributed systems. This makes it
necessary for new initiatives to ensure high availability and reliability. Both the customers and the cloud
providers are affected by this issue. The present article provides a literature review summarizing and
analyzing the behaviours of normal and failure activities, characterizing the failure evaluation, understanding
advanced algorithms for failure prediction, and effectiveness of implementing machine learning and deep
learning techniques during failure classifications in the cloud environment. The goal of this study is to analyse
the patterns of failure in the cloud environment. The study then compares the actions of successful and
unsuccessful jobs to see if there is a link between the characteristics of successful and failed cloud
applications. Several models were discussed with respect to their efficiency of predicting the failures and
other factors that are associated with the cloud infrastructure.
Keywords: Cloud Computing Services, Cloud Infrastructure Failures, Failure Characterization, Failure
Prediction Models, Reliable Computing.
1. Introduction

Due to the rise of big data analytics and the increasing number of people accessing the Internet, users are
looking for more efficient and effective computing solutions. Cloud software has made it easier for people
to access the resources they need, and major companies such as Amazon, Microsoft, and Google are
providing users with leasing solutions that allow them to get the most out of their computational power. The
rise of cloud computing has made it more common than ever to use distributed systems. These services
provide high-speed, reliable, and unlimited resources. Their environments are made up of various
components, such as memory units, processors, networking devices, and sensors. Due to the complexity of
the cloud's architecture, it is becoming more difficult to maintain its availability and reliability. Many users
are complaining about the issues that they are experiencing with the various cloud services. The availability
of publicly available clouds provides users with access to resources without requiring them to worry about
the hardware's maintenance.

Cloud computing applications are prone to failure as they operate in large-scale environments, such as virtual
and physical machines. Data canters and services may experience various types of failures, like disk,
software, and hardware, which can affect their operations. To ensure that cloud users and service providers
have the ability to predict when and how it will happen, a model has to be developed that can accurately
forecast job failure. However, it is important to note that the maintenance of the cloud environment is carried
out by the service providers. In the event of hardware failure, the users and cloud providers are greatly
affected. Having the ability to accurately forecast when and how it will happen is very important. The
efficiency of the prediction techniques currently used in the cloud computing environment was compared
with the performance of their alternatives using a variety of statistical methods. This study will help the
development of future research in this field and provide the opportunity for cloud computing users to respond

1
efficiently to predicted failures. It could also benefit the cloud computing services of major companies such
as Microsoft and Amazon.
The main objective of the present review is to understand the characteristics of the failure and successful
activities, key factors that can affect the performance of the cloud-based applications, understanding the role
of machine learning and deep learning models with respect to failure predictions, and analyzing the
performance of existing models. The review process can help the future research to identify several factors
that impacts the availability and reliability of the cloud services while reducing the number of failures. The
study also provides a scope to develop a new failure prediction model that can analyzing various failure
characteristics. The future research can implement various statistical, machine learning and neural networks
to develop an efficient cloud-based failure prediction system.
The major contributions of this work are described as follows:
• To define the scope of understanding several factors that can upgrade the performance of cloud
computing services by understanding the behaviour of failure activities.
• To discuss the concepts of machine learning and deep learning and understand their applicability and
provide a failure prediction system in real-world situations.
• To summarize the potential benefits of using machine learning and deep learning algorithms for
developing an efficient failure prediction system to classify the failure and successful activities in
cloud environments.
• To summarize and discuss the potential directions of future research within the scope of failure
predictions.

The organization of the other sections of the article are as follows: Section 2 summarizes the failure
characteristics, failure prediction techniques and the existing models. Section 3 summarizes the comparative
analysis with respect to several attributes followed by conclusion and future directions in Section 4.

2. Related Work
The researchers evaluated several papers on the subject “failures of cloud computing infrastructure”. The
study also noted the importance of developing predictive models to classify the successful and failure
activities using historical data ana applying several machine learning and deep learning models in a timely
manner. The entire review process is carried out by collecting the relevant data from various articles that are
published in well reputed journals and published in between the years 2017 and 2024. Only seven years data
is collected to understand the latest advancements in the systems that are developed for the prediction of
faults in cloud environments.

35
29
30
Number of Articles

25 22
19
20
14 13 14
15 11 10
10
5
0
Year of Publication

2017 2018 2019 2020 2021 2022 2023 2024

Figure 1. Distribution of articles (year-wise)


2
Anlysing State-
of-the-art
Advancements Failure
Systems, 9
in Failure Characteristics
Analysis, 4 and Prediction,
24

Failure
Prediction
Systems, 17
Failure
Prediction
Techniques, 7

Figure 2. Section-wise distribution of articles


2.1 Failure Characterization and Prediction
The rise of cloud computing has raised the bar for distributed computing. Due to its rapid adoption, it is
important that the various faults in the servers and data centers are detected and predicted in order to prevent
them from happening in the first place. This can help businesses and customers get the most out of their cloud
computing investments. One of the most effective ways to predict a failure is by training a machine to identify
the various message patterns in the logs and messages sent between different cloud components. This method
can then be used to analyse the messages and determine if the patterns are consistent with the data center's
failure. One of the most important factors that can be considered when it comes to identifying a failure is the
state of the cloud computing servers. This can be done by analysing the various parameters that are used in
the logs to monitor the performance of the cloud. For instance, the CPU usage and memory usage can be
analysed to determine if the cloud is operating properly. By using these parameters, a decision tree can be
created to identify a potential issue (Bambharolia et al., 2017). Various studies have been conducted on the
characterization and evaluation of failures in cloud, grid, or supercomputing environments. The reliability of
hardware in these environments has been a central issue (Kumar, 2024; Liang & Chen, 2024; K. K. Kumar et
al., 2019). (M. S. Jassas & Mahmoud, 2022).
A method is proposed that would predict which jobs would fail and optimize cloud applications in terms of
their resource usage. An in-depth evaluation was performed on the proposed model by using the Google
Cluster, Mustang, or Trinity traces. The traces were then fed into various machine learning models to find the
most reliable one. The efficiency evaluation of the proposed model revealed that it performed well in terms
of recall, accuracy, and F1-score. It also showed that it can increase cloud service availability and reliability
by implementing various factors (Bommala et al., 2023). The study proposed an approach to analyse the
effects of the cloud system's improvements on performance. It performed a comprehensive analysis of the
various characteristics of the workload, such as its frequency of workloads, performance time, and memory
consumption. They also found that some aspects of the workload are associated with job failure (Bommala
et al., 2023).
The researchers used a Markov model to analyse the utilization of the Google Cloud Platform's cluster. They
then validated the reliability of the system by looking at its resources' stability and the time it takes to fix
them (Mesbahi et al., 2019). A case study is presented on the use of Google Cluster tracing in dual cloud
computing instances (Ruan et al., 2019). The researchers then used the system's distribution function to
analyse the time it takes to fix and recover from failures (Uddin Ahmed et al., 2020). Previous studies on the
use of Google Cluster tracing focused on the load patterns in the sequence. This study, however, used the K-
3
Means algorithm to classify events and analyse the sequence. The researchers found two new traces from
private clusters and one from high-performance computing clusters (Amvrosiadis et al., 2018). The
researchers found that the data analysis activities that are most likely to be related to the Google workload
are the most common in private clusters. These activities are also commonly used in high-performance
computing (HPC) clusters. The researchers additionally found four new traces from the LANL cluster tracing
system. These traces came from two different private clouds and one from a highly efficient cluster (Vo L,
2018). The variables that were analysed in this study included the storage space, memory use, and CPU speed
(M. Jassas & Mahmoud, 2018).
The failure prediction model's correlation between abortive and unsuccessful tasks can be evaluated using
cloud traces. These traces can also be used to identify the various features of the virtual machine that are not
working properly. To determine if the CPU has recurring behavior patterns, PRESS uses signal processing to
analyse the data. If it finds that the CPU is exhibiting these patterns, it will use historical workload data to
forecast future demand. Alternatively, it will use a statistical technique known as a state-driven method. The
ability to predict the success or failure of a task has been the subject of numerous studies on job loss. But
academia has not paid much attention to this aspect (Alahmad et al., 2021; Jyothi Shetty et al., 2019). Some
methods, such as proactive fault tolerance (Prajapati & Thakkar, 2020), utilize ML algorithms and data
analysis to predict when a task will fail. Deep learning is a well-known method for identifying software
failures (Sun et al., 2018). In addition, the study utilized a method that involves breeding new samples for
the prediction of failures. They used various machine learning models to identify scientific applications that
will fail to complete their tasks (Padmakumari & Umamakeswari, 2019). The researchers evaluated and
developed the models using a dataset. The Naive Bayes classifier performed well in terms of exactness.
The researchers used machine learning to identify work and task mistakes. They developed a multilayered
model with a Bi-LSTM, which has to be tested and trained on the collected data. They were able to achieve
an accuracy of 87% for predicting a job failure and 93% for predicting a task failure. Unfortunately, most
researchers only evaluated their own models and not investigated the various classification methods
available. This resulted in a major flaw in current literature. To address this issue, the researchers utilized
different techniques, such as KNN, QDA, XG-Boost, and DTs. Later, proposed a model that can predict a
task's failure using feature-selective algorithm to improve the model's accuracy (Gao et al., 2022). The
researchers found that the workload variables and job failure are related to one another. They also found that
the failure rate of tasks and the use of resources are both significantly high in cloud computing instances.
Only a small number of the failed activities were reimplemented multiple times before they were successful.
The findings support the idea that the scheduling class is associated with failure (Swain et al., 2020).
The paper introduces the development and use of a deep learning-based failure prediction model. It can detect
and identify failed tasks before they happen, which can help improve cloud applications' performance. To
investigate the predicted failure behavior in large-scale environments, we used the Google cluster trace,
Mustang, and Trinity traces. We also evaluated the proposed model's performance through different
evaluation metrics. The goal of these measures is to ensure that the model delivers the most accurate
prediction (M. S. Jassas & Mahmoud, 2021). There has been a lot of research on the development of machine
learning models that can improve the accuracy of cloud failure prediction for large data centers, but there has
been little research on the use of ensemble models.
A new model that uses the Adaboost ensemble machine learning framework to predict software and hardware
failures in the cloud is presented (Ng’ang’a et al., 2023). The study developed a system metrics approach that
can improve the reliability of cloud failure prediction by using artificial intelligence (Kumar, 2024). We tested
the model against over a hundred cloud servers and four AI algorithms. The results revealed that the
4
correlation between the various system metrics is very important in predicting cloud failure. The experimental
results also showed that the combination of these metrics can perform better than the state-of-art (Chhetri et
al., 2022). A conceptual framework for a cloud-based AI system that can predict a component's failure based
on collected data is presented. The system's architecture is built using methods such as cloud computing and
mining (Karthik & Kamala, 2021). The researchers proposed a model that uses performance statistics to
identify and predict the faults in the system. This method can help prevent the system from experiencing
issues in the future (Gaur & Mahalkari, 2021).
2.2 Failure Prediction Techniques
On-demand computing is a type of service that allows users to access and manage various computing
resources over the internet. Cloud computing is usually offered through a pay-as-you-go model. Rather than
owning their own data canters or infrastructure, users can rent the services of cloud service providers. One
of the most important factors that cloud computing users should consider when it comes to maintaining their
environment is the prediction of failure. This can help them avoid experiencing major issues and ensure that
their infrastructure is reliable. This can be done with load balancing and migration. With the help of predictive
maintenance, users can take proactive steps to avoid experiencing major issues. This process can be carried
out with various machine learning algorithms. One of the most common factors that researchers consider
when it comes to developing a prediction model is the classification problem. This involves determining if
the current state of the cloud environment can lead to a failure.
The ARIMA model is a tool used to predict a VM's failure. It does so by considering the non-stationary failure
traces. In addition, this model can also be applied to other stationary and non-stationary sources (Rawat et
al., 2021). In this paper, he presented a method that uses the Hidden Markov Model and the AdaBoost
algorithm to improve the performance of cloud systems. The method was able to predict the failure state of
a VM based on the relationship between its observation state and its hidden state. It can be used with other
cloud systems to improve their security state prediction (Z. Li et al., 2019). The various factors that can cause
a node failure are reflected by the spatial and temporal signals. To improve the prediction of a node's failure,
he presented a novel model called the MING. This method combines the multiple models that are used in the
prediction of failure, such as the LSTM model and the Random Forest model (Lin et al., 2018).
The study developed fast learning methods that can reliably predict and analyse the performance of cloud
computing jobs. The methods were integrated with the Google cluster dataset and various other tools, such
as an artificial neural network, a support vector machine, and a stack ensemble. The suggested models were
able to identify and predict the failed tasks efficiently and effectively. The stack ensemble performed well
against the experiments, reaching a 99.8% success rate (Gollapalli et al., 2022). The proposed model is
designed to have a high accuracy rate when it comes to failure prediction, even if the model is implementing
a small or large size trace. We tested three classification algorithms, namely the RF, DT, and ANN, and found
that the former performed well with the Google trace, with a 99.8% accuracy rate (M. S. Jassas & Mahmoud,
2021).
The model is composed of three components: a Logistic Regression, a Random Forest Classifier, and a
Decision Tree Classifier. The results indicate that the approach performed relatively well compared to the
previous research. The Decision Tree performed the best in terms of accuracy, recording 91.7% precision,
88.8% recall, and an F1 Score of 89.7% (Ng’ang’a et al., 2023). The study revealed the relationship between
the various system metrics and cloud failure prediction. The experimental results show that combining the
metrics can perform better than the state-of-the-art. The results of the analysis were compared using different
algorithms, and the best performing one was used to perform the comparison. The system metrics approach

5
performed better than the previous research, with a precision of up to 11% and a recall rate of 32% and
compared to the state-of the-art studies, the combined approach exhibited competitive results, with a 5% F1
score and a 10% precision (Chhetri et al., 2022).
To improve the cloud-app efficiency and resource utilization is proposed. We performed a comprehensive
evaluation of its components, namely the Google cluster, Trinity, and Mustang. We were able to identify the
most accurate model using several machine learning techniques. The results of our analysis revealed that the
requests and unsuccessful tasks have a significant correlation. The results of the analysis revealed that the
proposed model can perform well in various areas, such as accuracy, recall, and F1score. The time it takes to
perform the task with the RF-based model is longer, at 247.6s, with a Google trace of 29 days, while the DT
has a relatively low time of 53.8s (M. S. Jassas & Mahmoud, 2022). The proposed evaluation method aims
to predict a task's failure and a job's failure. We utilized the GCT dataset to analyse the performance of the
different TML and DL models. The XGBoost classifier was selected as the best candidate for predicting job-
level malfunctions. It achieved an accuracy rate of 94.35% and a score of 0.9310. In contrast, two supervised
task-level classification models performed well, with the former having an accuracy rate of 89.75% and the
latter an F-rating of 0.9154 (Tengku Asmawi et al., 2022).
2.3 Fault Prediction Systems
Machine learning-based models are commonly used in the cloud environment to detect and recover faults.
They are supervised learning methods that learn from data related to specific fault situations. Various
frameworks have been presented for developing and implementing this type of model. The review addresses
the need of monitoring the abnormal changes in cloud services. The process also brings the need of
understanding the implementation of various fault prediction models in the cloud environment. Different
machine learning and deep learning algorithms that perform the efficient classification of successful and
failure activities are also discussed. Further studies on the applications of deep learning and machine learning
in forecasting and fault management have been carried out (Soualhia et al., 2019; Wang et al., 2022; Yang
& Kim, 2022). The method presented in (Kaur & Vaithiyanathan, 2024) combines the optimization
techniques of hybrid systems with neural networks.
The proposed framework can be used to improve the maintenance strategy of an organization by analysing
and predicting the likelihood of task failures. It can also be used in real-world environments such as a
manufacturing facility (Aboshosha et al., 2023). The proposed framework is based on the Hidden Markov
Model and the Cloud Theory, and it can extend this model to predict system failure. In the simulations, the
model was able to perform well. It also exhibited an optimal tradeoff between the computational complexity
and the performance of the prediction (Zheng et al., 2016).
Several studies have been conducted on the use of anomaly detection methods for detecting various faults in
an environment using software defined networking (SDN). One of these is a method that uses cloud log data
to predict and detect faults. Unfortunately, due to the use of SVM, the method has some issues such as
labelling and imbalance (Garg et al., 2019; M. El-Shamy et al., 2021). Another technique that is commonly
used is the Bi-LSTM method, but it has some disadvantages like utilizing a lot of labelling (Gao et al., 2022;
He & Lee, 2021). In (Mohammed et al., 2019), a method that uses machine learning to improve the accuracy
of prediction for failure is presented. We have developed a model that is based on a variety of algorithms,
such as the Support Vector Machine, the Random Forest, the KNN, and the CART. We tested the accuracy of
the prediction with different comparisons. A proposed model for failure prediction is based on CloudSim. It
collects performance-related data from the cloud and uses a neural network to analyse the hardware's status.
It was able to predict the cloud's host failure with an accuracy of about 89% (Davis et al., 2017).
6
Several critical factors were addressed related to cloud computing by developing models that can improve
the prediction of cloud performance and provide a better fault tolerance. We utilized a combination of
machine learning techniques, such as gradient boosting, linear regression, and decision trees, to build our
models (Kalaskar & Thangam, 2023). A novel predictive model can be developed by extracting various
features from log data through a text mining algorithm. It then provides a model that can predict the failure
of the system's critical devices and identify the ones that need to be replaced. The last step involves
developing a forecasting model that can predict the infrastructure's health. The second step of this process
involves developing a set of models using various algorithms, such as rank-based and association rules. The
time-series models are then built using machine learning techniques (Patel, 2020).
The proposed framework can be used to identify and predict various faults in an infrastructure-level cloud.
It can perform well by detecting non-fatal faults in the hardware and software of the system. The accuracy of
the prediction made by the two models is comparable. For instance, the CNN has a 96.47% accuracy while
the Long-Term Memory LSTM has a 96.88% accuracy (Soualhia et al., 2019). The paper presents an
innovative method for root cause analysis and system failure prediction that combines the three aspects of IT
observability: logs, traces, and metrics. The method is designed to capture the temporal aspects of the data
by integrating GNNs. The predicted F1 scores of the system failure prediction were 0.98, 0.96, and 0.97,
which are significantly better than the state-of-art (Rouf et al., 2024). The goal of this study is to develop a
framework that can predict the likelihood of task failures in scientific workflows. The results of the analysis
of the predicted and actual failures in Amazon EC2 and Pegasus were compared with the predicted and actual
failures using Naive Bayes. The model's accuracy was confirmed at 94% (Bala & Chana, 2015).
3. Comparative Analysis
3.1 Advancements in Failure Analysis
In recent years, advancements in analysing the failures of cloud infrastructure have gained significant
attention due to their potential impact on applications like cloud data availability, reliability and fault
tolerance. With the emergence of cloud computing, researchers have been increasingly able to tackle
challenges such as predictive maintenance, anticipating and address the failures like detecting the anomalies,
predicting the outages and optimizing the cloud performance. This progression has laid the groundwork for
examining the reliability of cloud computing. Despite considerable progress, there remains a lack of
comprehensive studies addressing the application of analysing the predictive techniques in cloud
environment using machine learning techniques. Consequently, this literature review seeks to explore these
gaps, examining how various existing problems are being addressed in the current research landscape.
The paper introduces Preface, a novel approach that enhances neural-network-based failure predictors to
effectively handle time series of KPI sets with variable sizes, which is essential for cloud applications that
utilize autoscaling. This is achieved by incorporating a Rectifier layer that transforms the variable KPI sets
into a fixed set of rectified-KPIs, making them compatible with the neural network's input requirements.
Experimental results demonstrate that Preface can successfully predict many harmful failures in both a
commercial application and a widely used academic exemplar, allowing for timely activation of
countermeasures to prevent negative impacts on users of the applications (Y. Li et al., 2020). The study found
that the best overall performing configuration for failure prediction is a CNN-based encoder combined with
the Logkey2vec embedding strategy. This combination demonstrated high accuracy when specific dataset
conditions were met, namely a dataset size greater than 350 or a failure percentage exceeding 7.5%. The
research systematically investigated the impact of various deep learning encoders (LSTM, BiLSTM, CNN,
and transformer) and embedding strategies (BERT and Logkey2vec) on failure prediction accuracy, revealing

7
that the characteristics of the dataset, such as size and failure percentage, significantly influence the
performance of the models (Hadadi et al., 2024).
The proposed failure prediction algorithm based on multi-layer Bidirectional Long Short-Term Memory (Bi-
LSTM) demonstrates a significant improvement in predicting task and job 4 failures in cloud data centers,
achieving an accuracy of 93 percent for task failures and 87 percent for job failures in trace-driven
experiments. The study highlights the importance of accurately predicting task and job failures to enhance
service reliability and availability in large-scale cloud data centers, thereby reducing resource wastage
associated with recovery from such failures (Gao et al., 2022). The study proposes a conceptual model for
preparing, constructing, and evaluating both traditional machine learning algorithms and deep learning
algorithms specifically for predicting job and task failures in cloud systems, addressing a critical issue faced
by cloud service providers and users. Experimental results indicate that Extreme Gradient Boosting
outperforms other algorithms in job failure prediction with an accuracy of 94.35%, while Decision Tree and
Random Forest achieve the highest accuracy of 89.75% in task failure prediction, highlighting the importance
of specific features such as disk space request, CPU request, and task priority in determining prediction
outcomes (He & Lee, 2021).
3.2 Analysing the State-of-the-art Systems
Due to the complexity of cloud computing, many service providers are not able to prevent the failures that
commonly occur in their components. Previous studies have mainly focused on understanding the behavior
of failed jobs and identifying their causes. On the other hand, some investigations have investigated the
prediction of failures. The main objective of this approach is to enhance the efficiency of cloud applications
by minimizing the number of jobs that have failed. In this subsection, a comparative analysis of existing
literatures was reviewed to evaluate the performance of the existing failure prediction models investigating
the systems developed, datasets used, performance assessment, and results achieved are shown in Table 1.
Table 1. Summary of the existing literature towards failure prediction systems
Reference Dataset Process Approach Metrics Results
Accuracy
(Islam & 97%
Analysing Long-Short Term True Positive
Google accuracy,
Manivannan, failure Memory Rate
cluster 85% TPR,
2017) Characterization (LSTM) Fale Positive
11% FPR
Rate
92.4%
Custom data average
collected Predicting LSTM with precision,
Precision
(Lin et al., from failure Random Forest 63.5%
Recall
2018) production proneness of a (RF) and a average
F1 measure
cloud service node ranking model recall, 75.2%
system average f1-
score
XGBoost, C5.0,
(Kalaskar & 100%
Ada-Boost, Precision
Google Encompassing precision,
Thangam, Average Neural Sensitivity
Trace diverse metrics 80%
2023) Network, and
sensitivity,
Bayesian GLM
Log Artificial Neural Sensitivity and 69.96%
(Patel, 2020) Text mining
messages Networks and Specificity sensitivity
8
and Support Vector and 97.13%
maintenance Machine specificity
records
Yellow Saddle
0.95 purity
(Kaur & Failure- Goat Fish
value and
Dataset Augment the Algorithm, and Purity value and
Vaithiyanathan, 0.901, 0.89
OpenStack purity metrics Grasshopper STO workload
2024) STO
database Optimization
workloads
Algorithm
Logistic
(M. S. Jassas & Regression (LR) 99% for
Mitigating the
Mahmoud, Bit Brains and K-Nearest Accuracy KNN, 95%
losses
2022) Neighbour for LR
(KNN)
92%
(Faraz Bashir et Google Failure Precision and
XGBoost precision and
al., 2022) cluster characterization Recall
94.8% recall
Multiple
(Gollapalli et Google 99.8%
assessment ANN and SVM Accuracy
al., 2022) cluster accuracy
criteria
Analysing
(Gao et al., Google Bidirectional 93%
system message Accuracy
2022) cluster LSTM accuracy
logs

3.3 Limitations of the existing studies


• There is an imbalance in the distribution of failure and non-failure instances, which leads to biased
models that are less effective in predicting less frequent failure events.
• The current approaches to predicting failures are limited to applications with statically configured
sets of components and computational nodes, which do not accommodate the dynamic nature of cloud
applications that utilize autoscaling.
• Few studies do not extensively address the limitations or challenges associated with implementing
these models in real-world scenarios, such as data privacy concerns, integration with existing systems,
and the potential for false positives in failure predictions.
• Limited research has been conducted on ensemble models for cloud failure prediction, despite the
focus on failure characterization and analysis machine learning models.
• There is a gap in current empirical studies regarding the coverage of all main deep learning types,
specifically Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and
transformers, as well as their examination on a wide range of diverse datasets for failure prediction
tasks.
• Many studies do not address the potential limitations or challenges associated with the
implementation of AI models in diverse cloud computing environments, which may affect their
generalizability and effectiveness across different infrastructures.
• Many features have explored the applicability of the developed algorithms across diverse cloud
platforms and varying workload characteristics.
• There is a need to focus on improving the accuracy of the prediction model by adopting advanced
prediction techniques such as neural networks and recurrent neural networks.
9
4. Conclusion
The research will utilize a unique approach to identify and analyse the root causes of infrastructure failures,
which will provide a foundation for more accurate and targeted predictions. The researchers then plan to
develop a preprocessing framework that will ensure that the data collected is high-quality. This will help
them improve the model's responsiveness and reliability. In addition, it will handle various issues such as
missing data and noise. The researchers have integrated deep learning and machine learning techniques into
a hybrid architecture. This method helps in understanding the characteristics of the failure activities based on
historical data, which makes the systems more accurate and comprehensive prediction model than the
traditional ones. The present review will help to improve the classification of successful and failure activities
in cloud infrastructures using machine learning and deep learning algorithms. It also paves the way for
innovations in intelligent cloud management and predictive maintenance, which would benefit both users
and providers.
References
Aboshosha, A., Haggag, A., George, N., & Hamad, H. A. (2023). IoT-based data-driven predictive maintenance
relying on fuzzy system and artificial neural networks. Scientific Reports, 13(1).
https://doi.org/10.1038/s41598-023-38887-z

Alahmad, Y., Daradkeh, T., & Agarwal, A. (2021). Proactive Failure-Aware Task Scheduling Framework for Cloud
Computing. IEEE Access, 9, 106152–106168. https://doi.org/10.1109/ACCESS.2021.3101147

Amvrosiadis, G., Woo Park, J., Ganger, G. R., Gibson, G. A., Baseman, E., & DeBardeleben, N. (n.d.). On the
diversity of cluster workloads and its impact on research results.
https://www.usenix.org/conference/atc18/presentation/amvrosiadis

Bala, A., & Chana, I. (2015). Intelligent failure prediction models for scientific workflows. Expert Systems with
Applications, 42(3), 980–989. https://doi.org/10.1016/j.eswa.2014.09.014

Bambharolia, P., Bhavsar, P., & Prasad, V. (2017). Failure Prediction and Detection In Cloud Datacenters.
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH, 6(09). www.ijstr.org

Bommala, H., Uma Maheswari, V., Aluvalu, R., & Mudrakola, S. (2023). Machine learning job failure analysis
and prediction model for the cloud environment. High-Confidence Computing, 3(4).
https://doi.org/10.1016/j.hcc.2023.100165

Chhetri, T. R., Dehury, C. K., Lind, A., Srirama, S. N., & Fensel, A. (2022). A Combined System Metrics
Approach to Cloud Service Reliability Using Artificial Intelligence. Big Data and Cognitive Computing,
6(1). https://doi.org/10.3390/bdcc6010026

Davis, N. A., Rezgui, A., Soliman, H., Manzanares, S., & Coates, M. (2017). FailureSim: A System for Predicting
Hardware Failures in Cloud Data Centers Using Neural Networks. IEEE International Conference on
Cloud Computing, CLOUD, 2017-June, 544–551. https://doi.org/10.1109/CLOUD.2017.75

Gao, J., Wang, H., & Shen, H. (2022). Task Failure Prediction in Cloud Data Centers Using Deep Learning. IEEE
Transactions on Services Computing, 15(3), 1411–1422. https://doi.org/10.1109/TSC.2020.2993728

Garg, S., Kaur, K., Kumar, N., & Rodrigues, J. J. P. C. (2019). Hybrid deep-learning-based anomaly detection
scheme for suspicious flow detection in SDN: A social multimedia perspective. IEEE Transactions on
Multimedia, 21(3), 566–578. https://doi.org/10.1109/TMM.2019.2893549

10
Gaur, D. K., & Mahalkari, A. (n.d.). Effective Fault prediction using classifier analysis for cloud environment.

Gollapalli, M., AlMetrik, M. A., AlNajrani, B. S., AlOmari, A. A., AlDawoud, S. H., AlMunsour, Y. Z., Abdulqader,
M. M., & Aloup, K. M. (2022). Task Failure Prediction Using Machine Learning Techniques in the Google
Cluster Trace Cloud Computing Environment. Mathematical Modelling of Engineering Problems, 9(2),
545–553. https://doi.org/10.18280/mmep.090234

Hadadi, F., Dawes, J. H., Shin, D., Bianculli, D., & Briand, L. (2024). Systematic Evaluation of Deep Learning
Models for Log-based Failure Prediction. Empirical Software Engineering, 29(5).
https://doi.org/10.1007/s10664-024-10501-4

He, Z., & Lee, R. B. (2021). CloudShield: Real-time Anomaly Detection in the Cloud.
http://arxiv.org/abs/2108.08977

Islam, T., & Manivannan, D. (2017). Predicting Application Failure in Cloud: A Machine Learning Approach.
Proceedings - 2017 IEEE 1st International Conference on Cognitive Computing, ICCC 2017, 24–31.
https://doi.org/10.1109/IEEE.ICCC.2017.11

Jassas, M., & Mahmoud, Q. H. (n.d.). Failure Analysis and Characterization of Scheduling Jobs in Google
Cluster Trace.

Jassas, M. S., & Mahmoud, Q. H. (2021, April 15). A Failure Prediction Model for Large Scale Cloud
Applications using Deep Learning. 15th Annual IEEE International Systems Conference, SysCon 2021 -
Proceedings. https://doi.org/10.1109/SysCon48628.2021.9447141

Jassas, M. S., & Mahmoud, Q. H. (2022). Analysis of Job Failure and Prediction Model for Cloud Computing
Using Machine Learning. Sensors, 22(5). https://doi.org/10.3390/s22052035

Kalaskar, C., & Thangam, S. (2023). Fault Tolerance of Cloud Infrastructure with Machine Learning.
Cybernetics and Information Technologies, 23(4), 26–50. https://doi.org/10.2478/cait-2023-0034

Karthik, T. S., & Kamala, B. (2021). Cloud based AI approach for predictive maintenance and failure prevention.
Journal of Physics: Conference Series, 2054(1). https://doi.org/10.1088/1742-6596/2054/1/012014

Kaur, R., & Vaithiyanathan, R. (2024). Hybrid YSGOA and neural networks-based software failure prediction in
cloud systems. Scientific Reports, 14(1). https://doi.org/10.1038/s41598-024-67107-5

Kumar, A. (n.d.). AI-Driven Innovations in Modern Cloud Computing. Computer Science and Engineering,
2024(6), 129–134. https://doi.org/10.5923/j.computer.20241406.02

Li, Y., Jiang, Z. M. J., Li, H., Hassan, A. E., He, C., Huang, R., Zeng, Z., Wang, M., & Chen, P. (2020). Predicting
Node Failures in an Ultra-Large-Scale Cloud Computing Platform. ACM Transactions on Software
Engineering and Methodology, 29(2). https://doi.org/10.1145/3385187

Li, Z., Liu, L., & Kong, D. (2019). Virtual Machine Failure Prediction Method Based on AdaBoost-Hidden Markov
Model. Proceedings - 2019 International Conference on Intelligent Transportation, Big Data and Smart
City, ICITBS 2019, 700–703. https://doi.org/10.1109/ICITBS.2019.00173

Liang, J., & Chen, M. (n.d.). AI-Driven Predictive Maintenance for Cloud Infrastructure: Advancements,
Challenges, and Future Directions.

Lin, Q., Hsieh, K., Dang, Y., Zhang, H., Sui, K., Xu, Y., Lou, J. G., Li, C., Wu, Y., Yao, R., Chintalapati, M., & Zhang,
D. (2018). Predicting node failure in cloud service systems. ESEC/FSE 2018 - Proceedings of the 2018

11
26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the
Foundations of Software Engineering, 480–490. https://doi.org/10.1145/3236024.3236060

M. El-Shamy, A., A. El-Fishawy, N., Attiya, G., & A. A. Mohamed, M. (2021). Anomaly Detection and Bottleneck
Identification of The Distributed Application in Cloud Data Center using Software–Defined Networking.
Egyptian Informatics Journal, 22(4), 417–432. https://doi.org/10.1016/j.eij.2021.01.001

Mesbahi, M. R., Rahmani, A. M., & Hosseinzadeh, M. (2019). Dependability analysis for characterizing Google
cluster reliability. International Journal of Communication Systems, 32(16).
https://doi.org/10.1002/dac.4127

Mohammed, B., Awan, I., Ugail, H., & Muhammad, Y. (n.d.). Failure Prediction using Machine Learning in a
Virtualised HPC System and application.

Ng’ang’a, D. N., Cheruiyot, W., & Njagi, D. (2023). A Machine Learning Framework for Predicting Failures in
Cloud Data Centers -A case of Google Cluster -Azure Clouds and Alibaba Clouds.
https://doi.org/10.21203/rs.3.rs-3326876/v1

Padmakumari, P., & Umamakeswari, A. (2019). Task Failure Prediction using Combine Bagging Ensemble
(CBE) Classification in Cloud Workflow. Wireless Personal Communications, 107(1), 23–40.
https://doi.org/10.1007/s11277-019-06238-9

Patel, S. S. (2020). Forecasting health of complex IT systems using system log data. Journal of Banking and
Financial Technology, 4(1), 27–35. https://doi.org/10.1007/s42786-019-00011-z

Prajapati, V., & Thakkar, V. (2020). EasyChair Preprint A Survey on Failure Prediction Techniques in Cloud
Computing A Survey on Failure Prediction Techniques in Cloud Computing.

Proceedings of the 9th International Conference On Cloud Computing, Data Science and Engineering:
Confluence 2019 : 10-11 January 2019, Uttar Pradesh, India. (2019a). IEEE.

Rawat, A., Sushil, R., Agarwal, A., & Sikander, A. (2021). A New Approach for VM Failure Prediction using
Stochastic Model in Cloud. IETE Journal of Research, 67(2), 165–172.
https://doi.org/10.1080/03772063.2018.1537814

Rouf, R., Rasolroveicy, M., Litoiu, M., Nagar, S., Mohapatra, P., Gupta, P., & Watts, I. (2024). InstantOps: A Joint
Approach to System Failure Prediction and Root Cause Identification in Microservices Cloud-Native
Applications. ICPE 2024 - Proceedings of the 15th ACM/SPEC International Conference on Performance
Engineering, 119–129. https://doi.org/10.1145/3629526.3645047

Ruan, L., Xu, X., Xiao, L., Yuan, F., Li, Y., & Dai, D. (2019). A comparative study of large-scale cluster workload
traces via multiview analysis. Proceedings - 21st IEEE International Conference on High Performance
Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE
International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019, 397–404.
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00067

Soualhia, M., Fu, C., & Khomh, F. (2019). Infrastructure fault detection and prediction in edge cloud
environments. Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, SEC 2019, 222–235.
https://doi.org/10.1145/3318216.3363305

Sun, Y., Xu, L., Li, Y., Guo, L., Ma, Z., & Wang, Y. (2018). Utilizing Deep Architecture Networks of VAE in Software
Fault Prediction. 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous
Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking,

12
Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), 870–877.
https://doi.org/10.1109/BDCloud.2018.00129

Swain, D., Kumar, P., Tushar, P., & Editors, A. (n.d.). Advances in Intelligent Systems and Computing 1311
Machine Learning and Information Processing Proceedings of ICMLIP 2020.
http://www.springer.com/series/11156

Tengku Asmawi, T. N., Ismail, A., & Shen, J. (2022). Cloud failure prediction based on traditional machine
learning and deep learning. Journal of Cloud Computing, 11(1). https://doi.org/10.1186/s13677-022-
00327-0

Uddin Ahmed, K. M., Alvarez, M., & Bollen, M. H. J. (2020). Characterizing failure and repair time of servers in a
hyper-scale data center. IEEE PES Innovative Smart Grid Technologies Conference Europe, 2020-
October, 660–664. https://doi.org/10.1109/ISGT-Europe47291.2020.9248891

Vo L. (n.d.). The Atlas Cluster Trace Repository. www.usenix.org

Wang, B., Hua, Q., Zhang, H., Tan, X., Nan, Y., Chen, R., & Shu, X. (2022). Research on anomaly detection and
real-time reliability evaluation with the log of cloud platform. Alexandria Engineering Journal, 61(9), 7183–
7193. https://doi.org/10.1016/j.aej.2021.12.061

Yang, H., & Kim, Y. (2022). Design and Implementation of Machine Learning-Based Fault Prediction System in
Cloud Infrastructure. Electronics (Switzerland), 11(22). https://doi.org/10.3390/electronics11223765

Zheng, W., Wang, Z., Huang, H., Meng, L., & Qiu, X. (2016). EHMM-CT: An online method for failure prediction
in cloud computing systems. KSII Transactions on Internet and Information Systems, 10(9), 4087–4107.
https://doi.org/10.3837/tiis.2016.09.004

13

You might also like