AI-driven Prediction Based Energy-Aware Fault-Tolerant Scheduling Scheme (PEFS) For Cloud Data Center Abstract
AI-driven Prediction Based Energy-Aware Fault-Tolerant Scheduling Scheme (PEFS) For Cloud Data Center Abstract
Section. Data centers are becoming increasingly popular for the provisioning
of computing resources. The cost and operational expenses of data centers have
skyrocketed with the increase in computing capacity [24].Energy consumption
is a growing concern for data
centersoperators. It is becoming one of the main entries on a datacenter
operational expenses (OPEX) bill . The Gartner Group estimates energy
consumptions to account for up to 10%of the current OPEX, and this estimate is
projected to rise to50% in the next few years .The slice of roughly 40% is
related to the energy consumed by information technology (IT) equipment ,
which includes energy consumed by the computing servers as well as data
center network hardware used for interconnection. In fact , about one-third of
the total IT energy is consumed by communication links, switching, and
aggregation elements ,while the remaining two-thirds are allocated to
computing servers. Other systems contributing to the data center energy
consumption are cooling and power distribution systems that account for 45%
and 15% of total energy consumption, respectively .The first data center energy
saving solutions operate don a distributed basis and focused on making the data
center hardware energy efficient. There are two popular techniques for power
savings in computing systems. The Dynamic Voltage and Frequency Scaling
(DVFS) technology. Finally, the selection of the optimal target PMs is modeled
as an optimization problem that is solved using an improved particle swarm
optimization algorithm. We evaluate our approach against five related
approaches in terms of the overall transmission overhead, overall network
resource consumption, and total execution time while executing a set of parallel
applications. Experimental results demonstrate the efficiency and effectiveness
of our approach.
1.1 PROBLEM DEFINITION:
The Prediction based Energy-aware Fault-tolerant Scheduling scheme (PEFS)
proposed in this paper involves two stages: 1) failure prediction and, 2) task
scheduling, as illustrated . The predictor is designed based on a DNN to train
and test the task in the HDS. This predictor is used to predict the probability of
task failure, and based on this probability, and tasks are classified into failure-
prone and non-failure-prone tasks. Failure-prone and non-failure prone tasks are
organized in a failure-prone task queue and non-failure task queue, respectively,
then these types of tasks are using the scheduling method, separately. The
power model is adopted to address the energy-saving concern of the CDC.
Similarly, the fault model is used to design a fault-tolerant mechanism for the
failure-prone tasks. The task failures may occur due to the unavailability of
resources, hardware failures, execution cost and time exceeding a threshold
value, system running out of memory or disk space, over-utilization of
resources, improper installation of required libraries, and so on. These faults can
be transient or permanent and are assumed to be independent. So, developing a
fault-tolerant scheduling scheme needs to guarantee the deadline of all the tasks
in the system that are met before fault occurs even under the worst-case
scenario. As we know, the replication strategy is widely used for fault tolerance,
which generally replicates tasks into two or more copies, then schedules to
different hosts. Therefore, there are more possibilities for wastage of resources
and an increase in the unusual energy consumption. Thus, in this paper, only
failure-prone tasks are replicated. First, three consecutive tasks are taken from
the failure-prone task queue, and each task is replicated into three copies. Then,
the vector reconstruction method is designed to reconstruct the super task from
replicate copies Reconstructed super tasks are mapped to the most suitable
hosts, allocated with resources, and then scheduled in different hosts, separately.
The sequence of replicate copies of tasks in super tasks is designed so that the
execution of different copies of the tasks in different hosts will not be
overlapped in order to avoid redundant execution.
1.2 SCOPE OF THE PROJECT
Firstly, a prediction model based on the machine learning approach is trained to
classify the arriving tasks into “failure-prone tasks” and “non-failure-prone
tasks” according to the predicted failure rate .Then, two efficient scheduling
mechanisms are proposed to allocate two types of tasks to the most appropriate
hosts in a CDC. The vector reconstruction method is developed to construct
super tasks from failure-prone tasks and separately schedule these super tasks
and non-failure-prone tasks to the most suitable physical host When multiple
task instances from different applications start to execute on numerous hosts,
some of the hosts may fail accidentally, resulting in a fault in the system. This
phenomenon is usually avoided by a fault-tolerance mechanism . Various
factors can lead to a host failure. In addition, a failure event usually stimulates
another fault event. These failures may include operating system crashes,
network partitions, hardware malfunctions, power outages, abrupt software
failures, etc
2. BACKGROUND
2.1 DOMAIN
2.2 SUB DOMAIN
3.LITERATURE SURVEY
1.TOPIC: Quantitative comparisons of the state-of-the-art data center
architectures
AUTHOR: K. Bilal, S. U. Khan, L. Zhang, H. Li, K. Hayat, S. A. Madani, N.
Min-Allah, L. Wang, D. Chen, M. Iqbal, C.-Z. Xu, and A. Y. Zomaya
YEAR: 2012
Data centers are experiencing a remarkable growth in the number of
interconnected servers. Being one of the foremost data center design concerns,
network infrastructure plays a pivotal role in the initial capital investment and
ascertaining the performance parameters for the data center. Legacy data center
network (DCN) infrastructure lacks the inherent capability to meet the data
centers growth trend and aggregate bandwidth demands. Deployment of even
the highest-end enterprise network equipment only delivers around 50% of the
aggregate bandwidth at the edge of network. The vital challenges faced by the
legacy DCN architecture trigger the need for new DCN architectures, to
accommodate the growing demands of the ‘cloud computing’ paradigm. We
have implemented and simulated the state of the art DCN models in this paper,
namely: (a) legacy DCN architecture, (b) switch-based, and (c) hybrid models,
and compared their effectiveness by monitoring the network: (a) throughput and
(b) average packet delay. The presented analysis may be perceived as a
background benchmarking study for the further research on the simulation and
implementation of the DCN-customized topologies and customized addressing
protocols in the large-scale data centers. We have performed extensive
simulations under various network traffic patterns to ascertain the strengths and
inadequacies of the different DCN architectures. Moreover, we provide a firm
foundation for further research and enhancement in DCN architectures.
2.TOPIC:DENS: Data Center Energy-EfficientNetwork-Aware Scheduling
AUTHOR: D. Kliazovich, P. Bouvry, and S. U. Khan,
YEAR: 2013
In modern data centers, energy consumption accountsfor a considerably large
slice of operational expenses. The state ofthe art in data center energy
optimization is focusing only on jobdistribution between computing servers
based on workload orthermal profiles. This paper underlines the role
ofcommunication fabric in data center energy consumption andpresents a
scheduling approach that combines energy efficiencyand network awareness,
termed DENS. The DENS methodologybalances the energy consumption of a
data center, individual jobperformance, and traffic demands. The proposed
approachoptimizes the tradeoff between job consolidation (to minimize
theamount of computing servers) and distribution of traffic patterns(to avoid
hotspots in the data center network)
3.TOPIC: Energy-efficient data centers
AUTHOR: J. Shuja, S. A. Madani, K. Bilal, K. Hayat, S. U. Khan, and S.
Sarwar
YEAR: 2012
Energy consumption of the Information and Communication Technology (ICT)
sector has grown exponentially in recent years. A major component of the
today’s ICT is constituted by the data centers which have experienced an
unprecedented growth in their size and population, recently. The Internet giants
like Google, IBM and Microsoft house large data centers for cloud computing
and application hosting. Many studies, on energy consumption of data centers,
point out to the need to evolve strategies for energy efficiency. Due to large-
scale carbon dioxide emissions, in the process of electricity production, the ICT
facilities are indirectly responsible for considerable amounts of green house gas
emissions. Heat generated by these densely populated data centers needs large
cooling units to keep temperatures within the operational range. These cooling
units, obviously, escalate the total energy consumption and have their own
carbon footprint. In this survey, we discuss various aspects of the energy
efficiency in data centers with the added emphasis on its motivation for data
centers. In addition, we discuss various research ideas, industry adopted
techniques and the issues that need our immediate attention in the context of
energy efficiency in data centers.
4.TOPIC: Using proactive fault-tolerance approach to enhance cloud service
reliability
AUTHOR: J. Liu, S. Wang, A. Zhou, S. Kumar, F. Yang, and R. Buyya
YEAR: 2018
The large-scale utilization of cloud computing services for hosting
industrial/enterprise applications has led to the emergence of cloud service
reliability as an important issue for both cloud service providers and users. To
enhance cloud service reliability, two types of fault tolerance schemes, reactive
and proactive, have been proposed. Existing schemes rarely consider the
problem of coordination among multiple virtual machines (VMs) that jointly
complete a parallel application. Without VM coordination, the parallel
application execution results will be incorrect. To overcome this problem, we
first propose an initial virtual cluster allocation algorithm according to the VM
characteristics to reduce the total network resource consumption and total
energy consumption in the data center. Then, we model CPU temperature to
anticipate a deteriorating physical machine (PM). We migrate VMs from a
detected deteriorating PM to some optimal PMs. Finally, the selection of the
optimal target PMs is modeled as an optimization problem that is solved using
an improved particle swarm optimization algorithm. We evaluate our approach
against five related approaches in terms of the overall transmission overhead,
overall network resource consumption, and total execution time while executing
a set of parallel applications. Experimental results demonstrate the efficiency
and effectiveness of our approach.
5.TOPIC: Probabilistic model for evaluating a proactive fault tolerance
approach in the cloud
AUTHOR: O. Hannache and M. Batouche
YEAR: 2015
Cloud computing is an emerging paradigm where computing services are
provided across the web. Virtualization powers the cloud by mutualizing
physical resources thus ensuring flexibility and high availability of the cloud.
Certainly fault tolerance like load balancing or advancement programming
security aim to foster availability but classic reactive fault tolerance techniques
prove to be greedy in terms of memory and recovery time. Elsewhere, proactive
fault tolerance is possible by preemptive virtual machine migration requiring a
strong and accurate failure predictor. In quest of an effective approach for
proactive fault tolerance we introduce in this paper a probabilistic model of the
cloud with a failure generator for evaluating a proposed approach based on three
scenarios of virtual machine migration.
6.TOPIC: Failover strategy for fault tolerance in cloud computing environment
AUTHOR: B. Mohammed, M. Kiran, M. Kabiru, and I.-U. Awan
YEAR: 2015
Cloud fault tolerance is an important issue in cloud computing platforms and
applications. In the event of an unexpected system failure or malfunction, a
robust fault-tolerant design may allow the cloud to continue functioning
correctly possibly at a reduced level instead of failing completely. To ensure
high availability of critical cloud services, the application execution, and
hardware performance, various fault-tolerant techniques exist for building self-
autonomous cloud systems. In comparison with current approaches, this paper
proposes a more robust and reliable architecture using optimal checkpointing
strategy to ensure high system availability and reduced system task service
finish time. Using pass rates and virtualized mechanisms, the proposed smart
failover strategy (SFS) scheme uses components such as cloud fault manager,
cloud controller, cloud load balancer, and a selection mechanism, providing
fault tolerance via redundancy, optimized selection, and checkpointing. In our
approach, the cloud fault manager repairs faults generated before the task time
deadline is reached, blocking unrecoverable faulty nodes as well as their virtual
nodes. This scheme is also able to remove temporary software faults from
recoverable faulty nodes, thereby making them available for future request. We
argue that the proposed SFS algorithm makes the system highly fault tolerant by
considering forward and backward recovery using diverse software tools.
Compared with existing approaches, preliminary experiment of the SFS
algorithm indicates an increase in pass rates and a consequent decrease in
failure rates, showing an overall good performance in task allocations. We
present these results using experimental validation tools with comparison with
other techniques, laying a foundation for a fully fault-tolerant infrastructure as a
service cloud environment.
7.TOPIC: Elastic reliability optimization through peer-to-peer checkpointing in
cloud computing
AUTHOR: J. Zhao, Y. Xiang, T. Lan, H. H. Huang, and S. Subramaniam
YEAR: 2017
Modern day data centers coordinate hundreds of thousands of heterogeneous
tasks and aim at delivering highly reliable cloud computing services. Although
offering equal reliability to all users benefits everyone at the same time, users
may find such an approach either inadequate or too expensive to fit their
individual requirements, which may vary dramatically. In this paper, we
propose a novel method for providing elastic reliability optimization in cloud
computing. Our scheme makes use of peer-to-peer checkpointing and allows
user reliability levels to be jointly optimized based on an assessment of their
individual requirements and total available resources in the data center. We
show that the joint optimization can be efficiently solved by a distributed
algorithm using dual decomposition. The solution improves resource utilization
and presents an additional source of revenue to data center operators. Our
validation results suggest a significant improvement of reliability over existing
schemes.
8. TOPIC: A comparative study into distributed load balancing algorithms for
cloud computing
AUTHOR: M. Randles, D. Lamb, and A. Taleb-Bendiab
YEAR: 2010
The anticipated uptake of Cloud computing, built on well-established research
in Web Services, networks, utility computing, distributed computing and
virtualisation, will bring many advantages in cost, flexibility and availability for
service users. These benefits are expected to further drive the demand for Cloud
services, increasing both the Cloud's customer base and the scale of Cloud
installations. This has implications for many technical issues in Service
Oriented Architectures and Internet of Services (IoS)-type applications;
including fault tolerance, high availability and scalability. Central to these
issues is the establishment of effective load balancing techniques. It is clear the
scale and complexity of these systems makes centralized assignment of jobs to
specific servers infeasible; requiring an effective distributed solution. This paper
investigates three possible distributed solutions proposed for load balancing;
approaches inspired by Honeybee Foraging Behaviour, Biased Random
Sampling and Active Clustering.
9. TOPIC: Energy-aware fault-tolerant dynamic task scheduling scheme for for
virtualized cloud data center
AUTHOR: A. Marahatta, Y.-S. Wang, F. Zhang, A. K. Sangaiah, S. K. Sah
Tyagi, and Z. Liu
YEAR: 2021
Resource scheduling is a challenging job in multi-cloud environments. The
multi-cloud technology attracted much research to work on it and look forward
to solving the problems of vendors lock-in, reliability, interoperability, etc.
The uncertainty in the multi-cloud environments with heterogeneous user
demands made it a challenging job to dispense the resources on demand of
the user. Researchers still focused on predicting efficient optimized resource
allocation management from the existing resource allocation policies in multi-
cloud environments. The research aims to provide a broad systematic literature
analysis of resource management in the area of multi-cloud environments. The
numbers of optimization techniques have been discussed among the open
issues and future challenges in consideration due to flexibility and reliability
in present environments. To analyses the literature work, it is necessary to cover
the existing homogenous/ heterogeneous user demands and cloud
applications, and algorithms to manage it in multi-clouds. In this paper, we
present the definition and classification of resource allocation techniques in
multi-clouds and generalized taxonomy for resource management in cloud
environments. In the last, we explore the open challenges and future
directions of resource management in a multi-cloud environment.
10.TOPIC: Network failure-aware redundant virtual machine placement in a
cloud data center
AUTHOR: A. Zhou, S. Wang, C.-H. Hsu, M. H. Kim, and K. S. Wong
YEAR: 2017
Cloud has become a very popular infrastructure for many smart city
applications. A growing number of smart city applications from all over the
world are deployed on the clouds. However, node failure events from the cloud
data center have negative impact on the performance of smart city applications.
Survivable virtual machine placement has been proposed by the researchers to
enhance the service reliability. Because of the ignorance of switch failure,
current survivable virtual machine placement approaches cannot achieve the
best effect. In this paper, we study to enhance the service reliability by
designing a novel network failure–aware redundant virtual machine placement
approach in a cloud data center. Firstly, we formulate the network failure–aware
redundant virtual machine placement problem as an integer nonlinear
programming problem and prove that the problem is NP-hard. Secondly, we
propose a heuristic algorithm to solve the problem. Finally, extensive simulation
results show the effectiveness of our algorithm.
4. SYSTEM ANALYSIS
4.1 EXISTING SYSTEM
Our evaluation results show that the proposed scheme can intelligently predict
task failure and achieves better fault tolerance and reduces total energy
consumption better than the existing schemes.
The existing fault-tolerant techniques in CDCs include replication, check-point,
job migration, retry, task resubmission, etc. Some studies introduced methods
based on certain principles, such as retry, resubmission, replication, renovation
of software, screening, and migration, to harmonize the fault-tolerant
mechanism with CDC task scheduling. However, for parallel and distributed
computing systems, the most widely adopted and acknowledged method is to
replicate data to multiple hosts. A rearrangement-based improved fault-tolerant
scheduling algorithm (RTFR) has been presented to deal with the dynamic
scheduling issue for tasks in cloud systems . A primary-backup model is
adopted to realize fault-tolerance in this method. The corresponding backup
copy will be released after the primary replica is completed, in order to release
the resource it occupies. In addition, the waiting tasks can be rearranged to
utilize the released resources. In contrast, after the task is sent to the waiting
queue of the virtual machine, the execution sequence is fixed and cannot be
changed.
In addition, the performance of the proposed scheduling scheme, i.e. PEFS, is
compared with some existing techniques, real-time fault-tolerant scheduling
algorithm with rearrangement (RFTR) , dynamic fault tolerant scheduling
mechanism (DFTS) and modified breadth first search (MBFS) as all of them
are designed for fault-tolerant scheduling .whereas, in most existing algorithms,
the executing sequence is settled after sending tasks to the waiting queue of a
VM. Experiments on Internet Data Set and Eular Data Set are conducted, and
the experimental results validate the merits of the proposed scheme in
comparison with existing techniques.