0% found this document useful (0 votes)
22 views8 pages

Adaptive Cloud Load Balancing With Reinforcement Learning Leveraging Google Cluster Data

The document presents a research study on developing a reinforcement learning (RL)-based cloud load balancer using Google cluster data to optimize resource allocation and load distribution. It highlights the limitations of traditional load balancers and proposes a methodology that includes data collection, analysis, and the training of an RL agent using advanced algorithms like Q-learning and Deep Q Networks. The outcomes aim to enhance load balancing efficiency, improve resource utilization, and adapt to changing demands in cloud computing environments.

Uploaded by

Sahithi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views8 pages

Adaptive Cloud Load Balancing With Reinforcement Learning Leveraging Google Cluster Data

The document presents a research study on developing a reinforcement learning (RL)-based cloud load balancer using Google cluster data to optimize resource allocation and load distribution. It highlights the limitations of traditional load balancers and proposes a methodology that includes data collection, analysis, and the training of an RL agent using advanced algorithms like Q-learning and Deep Q Networks. The outcomes aim to enhance load balancing efficiency, improve resource utilization, and adapt to changing demands in cloud computing environments.

Uploaded by

Sahithi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

5th International Conference on Electronics and Sustainable Communication Systems (ICESC 2024)

IEEE XPlore Part Number: CFP24V66-ART; ISBN: 979-8-3503-7994-5

Adaptive Cloud Load Balancing with Reinforcement


Learning: Leveraging Google Cluster Data
V Esther Jyothi Sahithi Alugoju K Indraja
2024 5th International Conference on Electronics and Sustainable Communication Systems (ICESC) | 979-8-3503-7994-5/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICESC60852.2024.10689973

Department of Computer Applications Department of Computer Applications Department of Computer Applications


Velagapudi Ramakrishna Siddhartha Velagapudi Ramakrishna Siddhartha Velagapudi Ramakrishna Siddhartha
Engineering College Engineering College Engineering College
Vijayawada, India Vijayawada, India Vijayawada, India
[email protected] [email protected] [email protected]

N Sampreeth Chowdary A Madhuri S Sindhura


Department of CSE Department of CSE Department of CSE
P.V.P. Siddhartha Institute of P.V.P. Siddhartha Institute of NRI Institute of Technology
Technology Technology Agiripalli, India
Vijayawada, India Vijayawada, India [email protected]
[email protected] [email protected]

Abstract— This research study aims to develop an RL-based comprehensive range of services from IaaS to PaaS. As cloud
cloud load balancer leveraging Google cluster data. The primary computing plays a crucial role in this digital era, there is a need
objective is to optimize cloud resource allocation and load to manage heavy demands and fluctuations. These challenges
distribution using RL techniques. The dataset includes various cannot be addressed by traditional load balancers as they have
metrics that are useful for decision-making in load balancing. their limitations.
Some metrics include CPU utilization, memory usage, resource
requests, and system configurations. The proposed methodology This research study aims to develop a load balancer to yield
includes systematic data collection, exploratory data analysis optimized results than traditional balancers like least response
(EDA), and preprocessing, including handling missing values and time, round-robin, and least connections. The round-robin
feature engineering. This study focuses on designing and training algorithm assigns tasks in sequential order and is useful for
the RL agent using state-of-the-art algorithms like Q-learning and servers having identical specifications and the tasks are uniform.
Deep Q Networks, calibrated to the intricacies of cloud load It does not consider the current capabilities of the server leading
balancing. To replicate real-world dynamics, a simulated cloud to inefficient distribution. In the same way, least connections
environment having servers with different configurations is algorithm aims for even task assignments but is less effective
created and several tasks are assigned. Whenever different tasks when the session lengths vary [11]. The least response time
are encountered, all the values of metrics are calculated and algorithm assigns tasks based on the least connections and lesser
cumulative reward is considered for decision making by using the
average response time. However, it does not consider other
RL agent. The research outcomes include enhanced efficiency in
metrics like server capabilities, task type, memory usage, etc. All
load balancing, improved resource utilization, and adaptability to
changing demands. This research study aims to significantly the traditional load balancers work through static rules and are
contribute to cloud computing, setting new standards of resource unintelligent. So RL-based load balancer [1] is developed that
management and operational efficiency. considers 34 different metrics for load balancing. It has decision-
making, continuous learning, and adaptability. The agent is
Keywords— Reinforcement Learning, Cloud Load Balancing, trained using Google cluster data which is a snapshot of the
Google Cluster Data, Resource Allocation, Q-learning, Cloud dynamic world scenarios. This RL load balancer can be
Computing, Simulation Environment. integrated with different cloud platforms there by optimizing
results.
I. INTRODUCTION
A. System Architecture
Cloud computing is emerging as a fundamental
infrastructure as it deals with all the drawbacks of traditional The system environment is represented in Fig. 1, which
computations. It delivers computing resources like servers, shows the Google cluster data as input to the model. To start the
storage, databases, and software via the Internet. This shift helps RL model, we need to train the agent with the Google cluster
in giving easy access to the people thereby accelerating dataset 2019. Reinforcement learning contains state, agent, and
innovation. It leads to wide usage of cloud services by action. In our project, the state can be considered as an incoming
businesses as they can concentrate on more important things task, the agent is the load balancer, and action is the distribution
rather than handling servers, bearing high costs, and facing of traffic across different servers. The figure shows that the agent
scalability issues. With this transformation, cloud service works on value and policy. Value is the cumulative number
providers like Microsoft, AWS, and Google are offering a calculated with the sum of all metrics. A positive reward is given

979-8-3503-7994-5/24/$31.00 ©2024 IEEE 572


Authorized licensed use limited to: STATE UNIV NY BINGHAMTON. Downloaded on October 03,2024 at 12:16:37 UTC from IEEE Xplore. Restrictions apply.
5th International Conference on Electronics and Sustainable Communication Systems (ICESC 2024)
IEEE XPlore Part Number: CFP24V66-ART; ISBN: 979-8-3503-7994-5

for a value greater than the threshold otherwise negative reward multi- server cloud environments [12]. To necessitate the
is given. With continuous learning, the agent makes optimal validity of scalability and reliability in testing real-world
decisions aiming for maximum rewards. The action is performed scenarios, a detailed framework has been provided that helps in
on the cloud environment, Azure cloud in this case. Tasks are applying DRL in practical cloud computing scenarios. When
assigned to different servers to yield maximum throughput, DRL algorithms are integrated, the system continuously learns
reduced latency, and error rate. The servers are of different
efficient load distribution strategies to enhance system stability
configurations helping each one of them to handle CPU-
intensive tasks, database handling, file handling, network and reduce processing delays and latency. RL methodologies
handling, etc.. Furthermore, the allocation of tasks within the address load balancing and resource allocation in the realm of
system is managed dynamically by the RL agent which receives smart cities, which focuses on the enhancement of resource
task requests from the users through the web browser interface. distribution efficiency in the dynamic environment of smart city
infrastructure (Alorbani, 2021) [6]. Implementation of RL
algorithm helps in the development of a system capable of
adapting to changes in resource demands that help in improving
resource utilization efficiency. This also leads to a decrease in
operational costs and enhances service delivery and these
insights help make complex systems with valuable in- sights. A
study conducted by Ramesh, 2021 [7] states that the application
of ML techniques and models tested for efficacy in load
balancing and incorporating self-learning capabilities like RL
outperforming traditional strategies. Different ML approaches
and practical implications in cloud computing provide solutions
to challenges in cloud resource management and further
research is needed in fine-tuning these models for specific cloud
architectures and workload scenarios. The study of a novel load-
balancing scheme for mobile edge computing conducted by a
Fig. 1. System Environment Overview group of authors shows the need for advanced approaches to
load-balancing [8]. Armbrust’s view of cloud computing
II. RELATED WORKS
provided information regarding cloud environments [9]. The
Research studies were conducted on cloud computing to paper [10] showed the need for deep learning for a network of
optimize the utilization of resources along with load balancing dense traffic.
in dynamic environments, emphasizing the significance of
Machine learning (ML) and Reinforcement Learning (RL). An III. PROPOSED WORK
RL-based approach has been proposed by Lahande et al., 2023 The methodology encompasses the implementation of the
[1], to demonstrate resource utilization and latency reduction, reinforcement learning algorithm, its customization to address
specifically tailored to cloud environments, compared to load balancing requirements [13], the design of a user-friendly
conventional methods. Alfarhood et al., 2022 [2] conducted interface, the selection and description of the dataset, and the
studies on constrained deep reinforcement learning (DRL) to setup of a simulated environment for evaluation. Each facet of
address the constraints on smart load balancing in network the methodology is detailed to provide clarity on the processes
involved and the rationale behind the methodological decisions,
systems, which is crucial for optimizing load distribution. ML
facilitating reproducibility and contributing to advancements in
algorithms and RL principles are implemented and tested to cloud computing optimization.
optimize cloud load balancing in cloud computing. This study
helped select strategies for cloud load balancing and integrating A. Reinforcement Learning Algorithm Overview
these solutions with cloud infrastructures, thus highlighting the The methodology involves four main steps. The first one is
need for further research (Muchori et al., 2022) [3]. A study data preprocessing and validation. This is important to make the
conducted by Kaveri, 2023 [4] introduced a novel approach to data ready to be given as input and suitable for the learning by
optimize cloud resource allocation using RL and adjusting the RL agent. The next step is developing the RL algorithm, in a
resource scheduling which uses real-time data and demonstrates way that the Agent can make decisions, receive feedback in
terms of rewards or penalties and, then adjust actions to
efficiency in load distribution and processing time. These
maximize cumulative rewards. The third step involves
findings help in addressing key challenges in cloud computing implementing or deploying RL based load balancer into the
and cloud resource management. This paves path for further simulation environment. The final step is creating a user-friendly
research in enhancing adaptability to diverse cloud services and web interface where a user can give several tasks and monitor
workloads. the servers to which the tasks are assigned.
Application of Deep Reinforcement Learning (DRL) in the The RL algorithm needs to be modified to be adaptive for
development of load balancing is explored by Deep, 2021 [5]. load balancing in the cloud environment. The algorithm used in
Another study aims in self-optimizing computational loads in reinforcement learning is Q-learning which is suitable for

979-8-3503-7994-5/24/$31.00 ©2024 IEEE 573


Authorized licensed use limited to: STATE UNIV NY BINGHAMTON. Downloaded on October 03,2024 at 12:16:37 UTC from IEEE Xplore. Restrictions apply.
5th International Conference on Electronics and Sustainable Communication Systems (ICESC 2024)
IEEE XPlore Part Number: CFP24V66-ART; ISBN: 979-8-3503-7994-5

calculating cumulative value of metrics. State representations of • γ is the discount factor, balancing immediate
the algorithm consist of metrics like the current load of each rewards with future rewards.
server, the queue length of requests, the number of requests, • S′ is the next state after taking action A in state S.
memory usage, network traffic [15], etc. • a′ is the next action chosen based on the policy.
The action for a load balancer is to distribute incoming traffic
across different servers. The reward function is crucial in RL. In our proposed system, tasks are submitted to the cloud for
The reward function is defined in such a way that it gives computation, subsequently added to the environment’s queue.
a signal to the agent about how good or bad its action was. RL algorithms, integrated within the LA, dynamically select
For a load balancer, a simple reward might be negative for every tasks based on predefined policies and manage rewards in
request that isn’t handled within a certain time. As RL RL-based the Q-Table. By utilizing these rewards as feedback for LB, the
load balancer considers all the metrics like CPU utilization, LA redirects future tasks to VMs with optimal resource
memory usage, time taken, etc we keep a threshold value to utilization, thereby enhancing the LB mechanism. Although the
check on reward. If the value is lesser than the threshold it gets LA may initially receive lower rewards, it gradually learns to
a negative reward otherwise a positive reward. optimize LB decisions, ultimately improving overall system
performance.
B. Utilization of Dataset for Analysis
Training of the RL agent was conducted using a simulated 1) Dataset Overview: The Google Cluster Workload Traces
cloud environment that mimicked real-world scenarios. The 2019 dataset stands as a testament to the evolution of cloud
paper used historical data to transform the RL agent. We made computing research, offering an extensive collection of work-
sure to split the data into training and validation sets to evaluate load data sourced from eight Google Borg compute clusters
the agent’s performance. After training we evaluated the RL throughout May 2019. The dataset is collected from Kaggle
agent’s performance in a simulated environment. As the which serves as a cornerstone to understand the dynamics of
environment might change over time we keep re-training or large-scale cluster management and workload scheduling within
fine-tuning the agent to adapt to new situations. a cloud environment. As the project focuses on heavy traffic
management, the Google cluster workload traces 2019 dataset
The essential components of the RL model for optimizing
emerges as a key resource suitable for job submission,
cloud resource utilization through LB are outlined as follows:
scheduling decisions and resource usage across diverse clusters.
• Learning Agent (LA): The learning agent facilitates the
equitable distribution of incoming task loads across 2) Data Collection and Preprocessing:
available VMs for computation. Over time, the LA a) Data Collection: The first step of the project is data
dynamically selects the most suitable RL algorithm for collection. Data should be collected from various sources. In
LB, adapting to the evolving dynamics of the cloud our project, a dataset with 34 metrics is available in Kaggle and
computing environment. it is downloaded for use.
• Environment (E): The environment constitutes the b) Data Validation: Once the data is collected, it under- goes a
cloud computing landscape where submitted tasks validation process to identify and remove any irrelevant or
undergo processing on available cloud VMs, striving to incomplete data entries [14]. This step is crucial for the integrity
deliver optimal Quality of Service (QoS) to end-users. of the simulation results.
• State Space (S): Represents the possible states the agent c) Data Transformation: Transforming data into a usable format
can be in. involves several sub-steps:
• Action Space (A): Defines the actions available to the • Normalization: Normalization involves scaling numerical
agent in each state. data to a specific range to avoid bias due to varying scales.
• Reward Function (R(S, A)): Computes the immediate • Encoding: Encoding is converting categorical data into
reward based on the state-action pair. numerical data, as the algorithm can interpret only
numerical data.
• Q-Learning Update Rule: Updates the Q-values based
• Feature Engineering: Feature Engineering involves de-
on observed rewards and state transitions.
riving new features that might be useful for accurate job
handling.
The Q-Learning Update Rule is given by: • Dimensionality Reduction: It involves reducing the
Q(S, A) ← Q(S, A) + αR(S, A) + γ max Q(S′, a′) − Q(S, A) number of features in datasets having high dimensionality
using techniques like Principal Component Analysis. This
Where:
helps in simplifying the model without significant loss of
• Q(S, A) is the Q-value for state-action pair (S, A). information.
• α is the learning rate, determining the extent of Q- d) Data Cleaning: Data cleaning addresses issues like missing
value updates. values, duplicates, and outliers. Techniques such as imputation

979-8-3503-7994-5/24/$31.00 ©2024 IEEE 574


Authorized licensed use limited to: STATE UNIV NY BINGHAMTON. Downloaded on October 03,2024 at 12:16:37 UTC from IEEE Xplore. Restrictions apply.
5th International Conference on Electronics and Sustainable Communication Systems (ICESC 2024)
IEEE XPlore Part Number: CFP24V66-ART; ISBN: 979-8-3503-7994-5

for missing values or thresholding for outliers are used to ensure 6) Scalability and Reliability: The simulation environment is
that the data is clean and reliable. designed to be scalable, allowing for the addition of more
e) Data Security: Ensuring the privacy and security of data is handlers or enhancement of existing handlers’ capabilities to
necessary. This includes implementing access controls, handle increased loads. This can be done by fallback
encryption, and regular audits to protect data from unauthorized mechanisms and redundancy in handlers, so if one handler fails,
access and breaches. the LB can reroute jobs to other available handlers without
In summary, data preprocessing and management in the interruption.
simulation environment is about establishing a robust, secure, 7) Usability and Access: The handlers are accessible through
and efficient pipeline that prepares data for processing by the unique URLs that serve as API endpoints. These endpoints are
Azure Function handlers, ensuring the system’s overall useful for the management of job requests remotely,
effectiveness and reliability. demonstrating the system’s flexibility and ease of integration
with various client applications.
8) Security and Compliance: Security measures are
C. Simulation Environment
implemented to ensure that each job is processed securely. The
The simulation environment is created in Azure workspace Azure Function App is configured to take care of compliance
using Azure functions. Azure functions promote server less standards, ensuring that data handling and processing are
computing that facilitates server creation at the time of performed according to industry best practices.
simulation and destroys servers when the job is done. This is a 9) Future Enhancements: The simulation environment has room
cost-effective and energy-saving technique for the simulation for enhancements such as implementing AI-based predictive
and testing of the load balancer. Below, we delve into the load balancing, integrating advanced analytics for performance
specifics of this simulated setup. optimization, and exploring auto-scaling capabilities based on
1) System Architecture: The system architecture comprises a predictive modelling of incoming job requests.
set of Azure Functions, each representing a unique handler with
specific computational strengths. These handlers are organized
by the load balancer that intelligently routes jobs to the most D. User Interface Design
appropriate handler. By redirecting tasks to the appropriate For the users to interact with the cloud environment, we
handlers results in optimal utilization of resources and developed a simple user interface for the system. This enables
maximum throughput. users to assign any number of tasks to the system thereby
2) Load Balancer Design: The load balancer The Load balancer monitoring the real-time feedback on the allocation of jobs to
is designed with a smart routing algorithm that analyze servers. Users are provided with the server configurations and
incoming job requests and assigns them to handlers based on details to which the task is assigned. The following components
their tagged resource intensity. This decision-making process constitute the user-interface design:
considers the current system load, job requirements, and 1) Job Submission Input Field: A text field that is displayed on
individual handler capabilities. the webpage enables users to specify the number of jobs they
3) Handler Specifications: Each handler is tailored for a intend to submit for processing.
particular type of job: This input field is flexible and easy to use. It takes care of
• MemoryIntensiveHandler: Useful for tasks that need invalid inputs by showing error messages to the users.
a high amount of RAM. 2) Submit Button: Adjacent to the job submission input field is
• ComputeHandler: Best suited for CPU-intensive a submit button, which triggers the initiation of the load
tasks. balancing algorithm. When a user clicks the submit button, the
system processes the submitted jobs and allocates them to the
• NetworkHandler: Ideal for jobs that demand high
appropriate servers based on their characteristics.
network throughput. 3) Job Allocation Details Display: When the jobs are submitted
• DatabaseHandler: Configured for tasks requiring users will be redirected to the job details display page, where
extensive database interactions. all the jobs can be seen along with the job types. Each of the
• IOIntensiveHandler: Designed for jobs with high in- jobs is handled by a specific URL which is nothing but a
put/output operations. redirection to the server that is handling the job. The display is
4) Job Assignment Process: When a job request is received, the comprehensible providing clear insights into the workload
load balancer queries the status and capacity of each handler. distribution to the users.
Based on the job’s requirements—like memory usage, CPU 4) Real-time Updates: The UI incorporates mechanisms for
load, and network bandwidth—it selects the optimal handler. real-time updates, ensuring that users receive instantaneous
The job is then queued in the selected handler’s task list, feedback on the status of their submitted jobs and the
awaiting processing. corresponding server allocations. This feature enhances user
5) Monitoring and Metrics: The system includes a robust experience by providing timely information and fostering
monitoring solution to track the performance of each handler. transparency in the job allocation process.
Metrics such as memory usage, swap usage, cache size, and the 5) Error Handling: The UI also includes provisions for error
number of active sessions are continuously monitored to handling in case of invalid inputs and system errors. The
provide real-time insights into system performance

979-8-3503-7994-5/24/$31.00 ©2024 IEEE 575


Authorized licensed use limited to: STATE UNIV NY BINGHAMTON. Downloaded on October 03,2024 at 12:16:37 UTC from IEEE Xplore. Restrictions apply.
5th International Conference on Electronics and Sustainable Communication Systems (ICESC 2024)
IEEE XPlore Part Number: CFP24V66-ART; ISBN: 979-8-3503-7994-5

appropriate message will be displayed to guide users in


facilitating error resolution.
The User interface design is designed to provide usability,
accessibility, and responsiveness aiming for effectiveness and
user satisfaction in the overall system.
IV. RESULTS AND DISCUSSION
To evaluate the performance of our proposed RL-based
cloud load balancer, we conducted a series of experiments in a
simulated cloud environment based on Microsoft Azure. The
experimental setup consists of a web-based homepage interface
where users could submit job requests. These job requests were
then processed by the load balancer, which intelligently assigned
each job to different servers within the Azure cloud
infrastructure. Fig. 3. Job Submission Details
A. Homepage Interface
The homepage interface is shown in Fig. 2, where users
simply specify the number of jobs they wish to submit for These results typically include CPU usage distribution,
processing. Soon after the jobs are submitted, random values are indicating the average and peak CPU utilization, and memory
generated internally for various metrics such as computational usage details, such as assigned memory and page cache memory.
intensity, memory usage, and network bandwidth. These job Successful execution of jobs is marked by the ’Failed’ status
requests are then passed through the load balancer where the being False. Performance metrics like Cycles per Instruction
agent can make decisions based on the state. The load balancer
(CPI) and Memory Accesses per Instruction offer deeper insights
evaluates each job request based on its specific requirements and
the current state of available servers in the cloud environment. into the efficiency of the job processing.
Subsequently, the load balancer dynamically allocates each job C. Job Handler Details
request to an appropriate server, aiming to optimize resource
utilization and minimize job processing time. This process In our system, the job handler component shown in Fig.
ensures efficient distribution of workload across the cloud 4 provides the details of different handlers that tasks are assigned
infrastructure, maximizing overall system performance and to. These details include configurations like hardware
throughput. specifications and software setups. This information is useful to
ensure efficient job execution, resource optimization, and
responsive service delivery.

Fig. 2. Homepage Interface


Fig. 4. Job Handler Details
B. Job Submission Details
The simulation results from our cloud load balancer D. Azure Function Configuration
project are shown in Fig. 3 provide insightful metrics on the In the Azure cloud environment shown in Fig. 5, we
performance of job assignments to Azure Functions. developed servers with different capabilities, each tailored to
perform specific functions. For instance, some servers are
optimized for database operations, offering high-speed data
processing and storage capabilities. These servers must have
specialized hardware and software configurations optimized for
handling large-scale database queries and transactions
efficiently.

979-8-3503-7994-5/24/$31.00 ©2024 IEEE 576


Authorized licensed use limited to: STATE UNIV NY BINGHAMTON. Downloaded on October 03,2024 at 12:16:37 UTC from IEEE Xplore. Restrictions apply.
5th International Conference on Electronics and Sustainable Communication Systems (ICESC 2024)
IEEE XPlore Part Number: CFP24V66-ART; ISBN: 979-8-3503-7994-5

TABLE I. PERFORMANCE COMPARISON OF ALGORITHMS IN SYSTEM


ENVIRONMENT
Load Algorithm Resource Latency Through Error Queue Length
Utilization (ms) put Rate and Wait
(%) Times
100 Least 80.0 150.0 80.0 5.0 110.0
Connections
100 Least Response 84.0 123.3 90.0 4.0 96.7
Time
100 Reinforcement 88.0 105.0 100.0 3.0 84.3
Learning
200 Least 85.0 200.0 160.0 10.0 130.0
Connections
200 Least Response 88.0 156.7 180.0 8.0 113.3
Time
200 Reinforcement 91.0 130.0 200.0 6.0 98.6
Learning
500 Least 100.0 350.0 400.0 25.0 190.0
Connections
500 Least Response 100.0 256.7 450.0 20.0 163.3
Time
Fig. 5. Azure Function Configuration 500 Reinforcement 92.0 205.0 500.0 15.0 141.4
Learning
Additionally, there are servers for CPU-intensive tasks that Table 1 shows that under low load (100 tasks), the Rein-
can execute computations and algorithms requiring heavy forcement Learning algorithm exhibited the highest Resource
processing power. These servers are equipped with high-end Utilization (88.0%) and the lowest Latency (105.0 ms), re-
processors, and heavy RAM specifications that are useful for sulting in a high Throughput of 100.0 and a low Error Rate
certain tasks. Furthermore, Azure provides servers specialized of 3.0%. However, as the load increased to 200 tasks, all
for handling network-related operations, such as data
algorithms experienced an increase in Resource Utilization and
transmission, routing, and network security. These servers
contain advanced networking hardware and software Latency, with the Reinforcement Learning algorithm still
components, enabling them to manage high volumes of network maintaining competitive performance.
traffic, ensure data integrity, and maintain network performance. At the highest load of 500 tasks, the Least Connections
Moreover, we utilized servers optimized for input/output (I/O) algorithm showed maximum Resource Utilization (100.0%) but
operations, which excel at handling data read/write operations,
also the highest Latency (350.0 ms) and Error Rate (25.0%). On
file transfers, and storage management tasks. These servers are
configured with fast storage devices, efficient data caching the other hand, the Reinforcement Learning algorithm
mechanisms, and optimized I/O controllers to minimize data maintained a relatively lower Resource Utilization (92.0%) but
access latency and improve overall system responsiveness. achieved a lower Latency (205.0 ms) and Error Rate (15.0%),
indicating its effectiveness in handling high loads while
By controlling servers with diverse capabilities in the Azure maintaining system responsiveness and reliability.
environment, we can effectively allocate job requests to the most
suitable servers based on their specific requirements and A. Graph Representation
optimize resource utilization across the cloud infrastructure.
This approach enables us to maximize system efficiency, Fig. 6 presents a visual representation of the performance
enhance performance, and deliver optimal user experiences for metrics comparison among the algorithms. As shown in Fig. 6,
various applications and workloads. under low load (100 tasks), the Reinforcement Learning
algorithm exhibited the highest Re- source Utilization (88.0%)
V. RESULT ANALYSIS and the lowest Latency (105.0 ms), resulting in a high
Throughput of 100.0 and a low Error Rate of 3.0%. However,
This section is an analysis conducted to evaluate the
as the load increased to 200 tasks, all algorithms experienced
efficiency and performance of our algorithm. The efficiency is
an increase in Resource Utilization and Latency, with the
assessed by comparing the algorithm with traditional load
Reinforcement Learning algorithm still maintaining competitive
balancing algorithms like Least Connections, and Least
performance.
Response Time. The comparison was based on key metrics like
resource utilization, latency, throughput and error rate. These
metrics are suitable for checking the effectiveness and reliability
of load-balancing algorithms in real-world scenarios. This
evaluation is done by assigning 500 tasks to all the
algorithms, starting with 100 tasks and incrementing by 100
tasks in each subsequent step. During this process, we closely
monitored and recorded the metrics to evaluate the performance
efficiency of each algorithm.

979-8-3503-7994-5/24/$31.00 ©2024 IEEE 577


Authorized licensed use limited to: STATE UNIV NY BINGHAMTON. Downloaded on October 03,2024 at 12:16:37 UTC from IEEE Xplore. Restrictions apply.
5th International Conference on Electronics and Sustainable Communication Systems (ICESC 2024)
IEEE XPlore Part Number: CFP24V66-ART; ISBN: 979-8-3503-7994-5

Fig. 7. Heatmap Representation of algorithms

• Reinforcement Learning demonstrates superior


performance in NetworkHandlerFunction and
MemoryIntensiveFunction tasks, achieving efficiency
levels close to 0.95. This indicates that the RL algorithm
is highly effective in managing network-related and
memory-intensive operations.
• Least Connections exhibits challenges, particularly in
ComputeHandlerFunction,DatabaseHandlerFunction
tasks, where its efficiency is comparatively lower than
other algorithms. This suggests that Least Connections
struggles more with computational and database-
intensive tasks within the system.
• The heatmap’s color gradients reveal that all algorithms
perform relatively well with FileHandlerFunction and
AnalyticsFunction tasks. This observation implies that
these tasks may be less resource-demanding or have
been optimized effectively within the system, resulting
in better performance across all algorithms.
C. Point Plot Insights
The point plot analysis provides valuable insights into
the performance of different algorithms across various
handlers.
Fig. 6. Comparison of Algorithm Performance Metrics • The point plot in figure 8 shows that the Reinforcement
learning algorithm is efficient and consistent across
different handlers. The consistency can be seen in the
At the highest load of 500 tasks, the Least Connections reduced fluctuation in the performance, indicating the
algorithm showed maximum Resource Utilization algorithm’s adaptability across different tasks.
(100.0%) but also the highest Latency (350.0 ms) and Error
• Among the compared algorithms, the Least Response
Rate (25.0%). On the other hand, the Reinforcement Time shows better overall performance compared to
Learning algorithm maintained a relatively lower Resource
Least Connections in most cases. However, for tasks
Utilization (92.0%) but achieved a lower Latency (205.0
related to IOIntensiveFunction, both algorithms exhibit
ms) and Error Rate (15.0%), indicating its effectiveness in
similar efficiencies.
handling high loads while maintaining system
responsiveness and reliability. • The point plot analysis shows that Reinforcement
learning is better in handling more complex tasks and
B. Heatmap Representation
tasks involving memory management and network
Fig. 7 presents a Heatmap Representation of algorithm operations.
efficiency across different task types. The heatmap provides
In the Fig.8, Point plot a nalysis provides a deeper
insights into the performance of each algorithm in handling
understanding of how different algorithms perform across
specific types of tasks within the system.

979-8-3503-7994-5/24/$31.00 ©2024 IEEE 578


Authorized licensed use limited to: STATE UNIV NY BINGHAMTON. Downloaded on October 03,2024 at 12:16:37 UTC from IEEE Xplore. Restrictions apply.
5th International Conference on Electronics and Sustainable Communication Systems (ICESC 2024)
IEEE XPlore Part Number: CFP24V66-ART; ISBN: 979-8-3503-7994-5

specific task categories, highlighting their strengths and load balancing,” in 2022 IEEE 19th Annual Consumer Communications
& Networking Conference (CCNC), pp. 207–215, IEEE, 2022.
areas for potential optimization.
[3] J. G. Muchori and P. M. Mwangi, “Machine learning load balancing
techniques in cloud computing: A review,” 2022.
[4] P. R. Kaveri and P. Lahande, “Reinforcement learning to improve re-
source scheduling and load balancing in cloud computing,” SN Computer
Science, vol. 4, no. 2, p. 188, 2023.
[5] Q. Liu, T. Xia, L. Cheng, M. Van Eijk, T. Ozcelebi, and Y. Mao, “Deep
reinforcement learning for load-balancing aware network control in iot
edge systems,” IEEE Transactions on Parallel and Distributed Systems,
vol. 33, no. 6, pp. 1491–1502, 2021.
[6] A. Alorbani and M. Bauer, “Load balancing and resource allocation in
smart cities using reinforcement learning,” in 2021 IEEE International
Smart Cities Conference (ISC2), pp. 1–7, IEEE, 2021.
[7] R. K. Ramesh, H. Wang, H. Shen, and Z. Fan, “Machine learning for load
balancing in cloud data centres,” in 2021 IEEE/ACM 21st International
Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp.
186–195, IEEE, 2021.
[8] Z. Duan, C. Tian, N. Zhang, M. Zhou, B. Yu, X. Wang, J. Guo, and Y.
Fig. 8. Point plot analysis of algorithms Wu, “A novel load balancing scheme for mobile edge computing,”
Journal of Systems and Software, vol. 186, p. 111195, 2022.
VI. CONCLUSION [9] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski,
G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al., “A view of cloud
The project resulted in the deployment of an Azure-based computing,” Communications of the ACM, vol. 53, no. 4, pp. 50–58,
cloud load-balancing system that is capable of efficient re- 2010.
source management and task distribution. The system’s design [10] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui, “Load balancing for ultradense
networks: A deep reinforcement learning-based approach,” IEEE Internet
and implementations have adhered to industry best practices of Things Journal, vol. 6, no. 6, pp. 9399–9412, 2019.
ensuring effectiveness in real-world scenarios. The resultant [11] Kiran, Koppolu Ravi, et al. "An advanced ensemble load balancing
analysis showcases the proposed system’s performance across approach for fog computing applications." International Journal of
key metrics, highlighting the Reinforcement Learning (RL) Electrical & Computer Engineering (2088-8708) 14.2 (2024).
algorithm’s efficiency in resource utilization, latency, [12] V. E. Jyothi and N. S. Chowdary, "Vulnerability Classification for
Detecting Threats in Cloud Environments Against DDoS Attacks," 2024
throughput, and error rate management. Particularly under high IEEE 13th International Conference on Communication Systems and
load conditions, the RL algorithm has maintained high Network Technologies (CSNT), 2024, pp. 368-373.
performance compared to alternative algorithms. [13] Praveen, S. Phani, et al. "An Adaptive Load Balancing Technique for
Multi SDN Controllers." 2022 International Cnetwonference on
REFERENCES Augmented Intelligence and Sustainable Systems (ICAISS). IEEE, 2022.
[14] Jyothi, Veerapaneni Esther, and N. Sampreeth Chowdary. "Challenges
and Artificial Intelligence–Centered Defensive Strategies for
[1] P. V. Lahande, P. R. Kaveri, J. R. Saini, K. Kotecha, and S. Alfarhood, Authentication in Online Banking." Artificial Intelligence Enabled
“Reinforcement learning approach for optimizing cloud resource utiliza- Management: An Emerging Economy Perspective (2024): 105
tion with load balancing,” IEEE Access, 2023.
[15] Biyyapu, NarasimhaSwamy, et al. "Designing a modified feature
[2] O. Houidi, D. Zeghlache, V. Perrier, P. T. A. Quang, N. Huin, J. Leguay, aggregation model with hybrid sampling techniques for network intrusion
and P. Medagliani, “Constrained deep reinforcement learning for smart detection." Cluster Computing (2024): 1-19.

979-8-3503-7994-5/24/$31.00 ©2024 IEEE 579


Authorized licensed use limited to: STATE UNIV NY BINGHAMTON. Downloaded on October 03,2024 at 12:16:37 UTC from IEEE Xplore. Restrictions apply.

You might also like