Adaptive Cloud Load Balancing With Reinforcement Learning Leveraging Google Cluster Data
Adaptive Cloud Load Balancing With Reinforcement Learning Leveraging Google Cluster Data
Abstract— This research study aims to develop an RL-based comprehensive range of services from IaaS to PaaS. As cloud
cloud load balancer leveraging Google cluster data. The primary computing plays a crucial role in this digital era, there is a need
objective is to optimize cloud resource allocation and load to manage heavy demands and fluctuations. These challenges
distribution using RL techniques. The dataset includes various cannot be addressed by traditional load balancers as they have
metrics that are useful for decision-making in load balancing. their limitations.
Some metrics include CPU utilization, memory usage, resource
requests, and system configurations. The proposed methodology This research study aims to develop a load balancer to yield
includes systematic data collection, exploratory data analysis optimized results than traditional balancers like least response
(EDA), and preprocessing, including handling missing values and time, round-robin, and least connections. The round-robin
feature engineering. This study focuses on designing and training algorithm assigns tasks in sequential order and is useful for
the RL agent using state-of-the-art algorithms like Q-learning and servers having identical specifications and the tasks are uniform.
Deep Q Networks, calibrated to the intricacies of cloud load It does not consider the current capabilities of the server leading
balancing. To replicate real-world dynamics, a simulated cloud to inefficient distribution. In the same way, least connections
environment having servers with different configurations is algorithm aims for even task assignments but is less effective
created and several tasks are assigned. Whenever different tasks when the session lengths vary [11]. The least response time
are encountered, all the values of metrics are calculated and algorithm assigns tasks based on the least connections and lesser
cumulative reward is considered for decision making by using the
average response time. However, it does not consider other
RL agent. The research outcomes include enhanced efficiency in
metrics like server capabilities, task type, memory usage, etc. All
load balancing, improved resource utilization, and adaptability to
changing demands. This research study aims to significantly the traditional load balancers work through static rules and are
contribute to cloud computing, setting new standards of resource unintelligent. So RL-based load balancer [1] is developed that
management and operational efficiency. considers 34 different metrics for load balancing. It has decision-
making, continuous learning, and adaptability. The agent is
Keywords— Reinforcement Learning, Cloud Load Balancing, trained using Google cluster data which is a snapshot of the
Google Cluster Data, Resource Allocation, Q-learning, Cloud dynamic world scenarios. This RL load balancer can be
Computing, Simulation Environment. integrated with different cloud platforms there by optimizing
results.
I. INTRODUCTION
A. System Architecture
Cloud computing is emerging as a fundamental
infrastructure as it deals with all the drawbacks of traditional The system environment is represented in Fig. 1, which
computations. It delivers computing resources like servers, shows the Google cluster data as input to the model. To start the
storage, databases, and software via the Internet. This shift helps RL model, we need to train the agent with the Google cluster
in giving easy access to the people thereby accelerating dataset 2019. Reinforcement learning contains state, agent, and
innovation. It leads to wide usage of cloud services by action. In our project, the state can be considered as an incoming
businesses as they can concentrate on more important things task, the agent is the load balancer, and action is the distribution
rather than handling servers, bearing high costs, and facing of traffic across different servers. The figure shows that the agent
scalability issues. With this transformation, cloud service works on value and policy. Value is the cumulative number
providers like Microsoft, AWS, and Google are offering a calculated with the sum of all metrics. A positive reward is given
for a value greater than the threshold otherwise negative reward multi- server cloud environments [12]. To necessitate the
is given. With continuous learning, the agent makes optimal validity of scalability and reliability in testing real-world
decisions aiming for maximum rewards. The action is performed scenarios, a detailed framework has been provided that helps in
on the cloud environment, Azure cloud in this case. Tasks are applying DRL in practical cloud computing scenarios. When
assigned to different servers to yield maximum throughput, DRL algorithms are integrated, the system continuously learns
reduced latency, and error rate. The servers are of different
efficient load distribution strategies to enhance system stability
configurations helping each one of them to handle CPU-
intensive tasks, database handling, file handling, network and reduce processing delays and latency. RL methodologies
handling, etc.. Furthermore, the allocation of tasks within the address load balancing and resource allocation in the realm of
system is managed dynamically by the RL agent which receives smart cities, which focuses on the enhancement of resource
task requests from the users through the web browser interface. distribution efficiency in the dynamic environment of smart city
infrastructure (Alorbani, 2021) [6]. Implementation of RL
algorithm helps in the development of a system capable of
adapting to changes in resource demands that help in improving
resource utilization efficiency. This also leads to a decrease in
operational costs and enhances service delivery and these
insights help make complex systems with valuable in- sights. A
study conducted by Ramesh, 2021 [7] states that the application
of ML techniques and models tested for efficacy in load
balancing and incorporating self-learning capabilities like RL
outperforming traditional strategies. Different ML approaches
and practical implications in cloud computing provide solutions
to challenges in cloud resource management and further
research is needed in fine-tuning these models for specific cloud
architectures and workload scenarios. The study of a novel load-
balancing scheme for mobile edge computing conducted by a
Fig. 1. System Environment Overview group of authors shows the need for advanced approaches to
load-balancing [8]. Armbrust’s view of cloud computing
II. RELATED WORKS
provided information regarding cloud environments [9]. The
Research studies were conducted on cloud computing to paper [10] showed the need for deep learning for a network of
optimize the utilization of resources along with load balancing dense traffic.
in dynamic environments, emphasizing the significance of
Machine learning (ML) and Reinforcement Learning (RL). An III. PROPOSED WORK
RL-based approach has been proposed by Lahande et al., 2023 The methodology encompasses the implementation of the
[1], to demonstrate resource utilization and latency reduction, reinforcement learning algorithm, its customization to address
specifically tailored to cloud environments, compared to load balancing requirements [13], the design of a user-friendly
conventional methods. Alfarhood et al., 2022 [2] conducted interface, the selection and description of the dataset, and the
studies on constrained deep reinforcement learning (DRL) to setup of a simulated environment for evaluation. Each facet of
address the constraints on smart load balancing in network the methodology is detailed to provide clarity on the processes
involved and the rationale behind the methodological decisions,
systems, which is crucial for optimizing load distribution. ML
facilitating reproducibility and contributing to advancements in
algorithms and RL principles are implemented and tested to cloud computing optimization.
optimize cloud load balancing in cloud computing. This study
helped select strategies for cloud load balancing and integrating A. Reinforcement Learning Algorithm Overview
these solutions with cloud infrastructures, thus highlighting the The methodology involves four main steps. The first one is
need for further research (Muchori et al., 2022) [3]. A study data preprocessing and validation. This is important to make the
conducted by Kaveri, 2023 [4] introduced a novel approach to data ready to be given as input and suitable for the learning by
optimize cloud resource allocation using RL and adjusting the RL agent. The next step is developing the RL algorithm, in a
resource scheduling which uses real-time data and demonstrates way that the Agent can make decisions, receive feedback in
terms of rewards or penalties and, then adjust actions to
efficiency in load distribution and processing time. These
maximize cumulative rewards. The third step involves
findings help in addressing key challenges in cloud computing implementing or deploying RL based load balancer into the
and cloud resource management. This paves path for further simulation environment. The final step is creating a user-friendly
research in enhancing adaptability to diverse cloud services and web interface where a user can give several tasks and monitor
workloads. the servers to which the tasks are assigned.
Application of Deep Reinforcement Learning (DRL) in the The RL algorithm needs to be modified to be adaptive for
development of load balancing is explored by Deep, 2021 [5]. load balancing in the cloud environment. The algorithm used in
Another study aims in self-optimizing computational loads in reinforcement learning is Q-learning which is suitable for
calculating cumulative value of metrics. State representations of • γ is the discount factor, balancing immediate
the algorithm consist of metrics like the current load of each rewards with future rewards.
server, the queue length of requests, the number of requests, • S′ is the next state after taking action A in state S.
memory usage, network traffic [15], etc. • a′ is the next action chosen based on the policy.
The action for a load balancer is to distribute incoming traffic
across different servers. The reward function is crucial in RL. In our proposed system, tasks are submitted to the cloud for
The reward function is defined in such a way that it gives computation, subsequently added to the environment’s queue.
a signal to the agent about how good or bad its action was. RL algorithms, integrated within the LA, dynamically select
For a load balancer, a simple reward might be negative for every tasks based on predefined policies and manage rewards in
request that isn’t handled within a certain time. As RL RL-based the Q-Table. By utilizing these rewards as feedback for LB, the
load balancer considers all the metrics like CPU utilization, LA redirects future tasks to VMs with optimal resource
memory usage, time taken, etc we keep a threshold value to utilization, thereby enhancing the LB mechanism. Although the
check on reward. If the value is lesser than the threshold it gets LA may initially receive lower rewards, it gradually learns to
a negative reward otherwise a positive reward. optimize LB decisions, ultimately improving overall system
performance.
B. Utilization of Dataset for Analysis
Training of the RL agent was conducted using a simulated 1) Dataset Overview: The Google Cluster Workload Traces
cloud environment that mimicked real-world scenarios. The 2019 dataset stands as a testament to the evolution of cloud
paper used historical data to transform the RL agent. We made computing research, offering an extensive collection of work-
sure to split the data into training and validation sets to evaluate load data sourced from eight Google Borg compute clusters
the agent’s performance. After training we evaluated the RL throughout May 2019. The dataset is collected from Kaggle
agent’s performance in a simulated environment. As the which serves as a cornerstone to understand the dynamics of
environment might change over time we keep re-training or large-scale cluster management and workload scheduling within
fine-tuning the agent to adapt to new situations. a cloud environment. As the project focuses on heavy traffic
management, the Google cluster workload traces 2019 dataset
The essential components of the RL model for optimizing
emerges as a key resource suitable for job submission,
cloud resource utilization through LB are outlined as follows:
scheduling decisions and resource usage across diverse clusters.
• Learning Agent (LA): The learning agent facilitates the
equitable distribution of incoming task loads across 2) Data Collection and Preprocessing:
available VMs for computation. Over time, the LA a) Data Collection: The first step of the project is data
dynamically selects the most suitable RL algorithm for collection. Data should be collected from various sources. In
LB, adapting to the evolving dynamics of the cloud our project, a dataset with 34 metrics is available in Kaggle and
computing environment. it is downloaded for use.
• Environment (E): The environment constitutes the b) Data Validation: Once the data is collected, it under- goes a
cloud computing landscape where submitted tasks validation process to identify and remove any irrelevant or
undergo processing on available cloud VMs, striving to incomplete data entries [14]. This step is crucial for the integrity
deliver optimal Quality of Service (QoS) to end-users. of the simulation results.
• State Space (S): Represents the possible states the agent c) Data Transformation: Transforming data into a usable format
can be in. involves several sub-steps:
• Action Space (A): Defines the actions available to the • Normalization: Normalization involves scaling numerical
agent in each state. data to a specific range to avoid bias due to varying scales.
• Reward Function (R(S, A)): Computes the immediate • Encoding: Encoding is converting categorical data into
reward based on the state-action pair. numerical data, as the algorithm can interpret only
numerical data.
• Q-Learning Update Rule: Updates the Q-values based
• Feature Engineering: Feature Engineering involves de-
on observed rewards and state transitions.
riving new features that might be useful for accurate job
handling.
The Q-Learning Update Rule is given by: • Dimensionality Reduction: It involves reducing the
Q(S, A) ← Q(S, A) + αR(S, A) + γ max Q(S′, a′) − Q(S, A) number of features in datasets having high dimensionality
using techniques like Principal Component Analysis. This
Where:
helps in simplifying the model without significant loss of
• Q(S, A) is the Q-value for state-action pair (S, A). information.
• α is the learning rate, determining the extent of Q- d) Data Cleaning: Data cleaning addresses issues like missing
value updates. values, duplicates, and outliers. Techniques such as imputation
for missing values or thresholding for outliers are used to ensure 6) Scalability and Reliability: The simulation environment is
that the data is clean and reliable. designed to be scalable, allowing for the addition of more
e) Data Security: Ensuring the privacy and security of data is handlers or enhancement of existing handlers’ capabilities to
necessary. This includes implementing access controls, handle increased loads. This can be done by fallback
encryption, and regular audits to protect data from unauthorized mechanisms and redundancy in handlers, so if one handler fails,
access and breaches. the LB can reroute jobs to other available handlers without
In summary, data preprocessing and management in the interruption.
simulation environment is about establishing a robust, secure, 7) Usability and Access: The handlers are accessible through
and efficient pipeline that prepares data for processing by the unique URLs that serve as API endpoints. These endpoints are
Azure Function handlers, ensuring the system’s overall useful for the management of job requests remotely,
effectiveness and reliability. demonstrating the system’s flexibility and ease of integration
with various client applications.
8) Security and Compliance: Security measures are
C. Simulation Environment
implemented to ensure that each job is processed securely. The
The simulation environment is created in Azure workspace Azure Function App is configured to take care of compliance
using Azure functions. Azure functions promote server less standards, ensuring that data handling and processing are
computing that facilitates server creation at the time of performed according to industry best practices.
simulation and destroys servers when the job is done. This is a 9) Future Enhancements: The simulation environment has room
cost-effective and energy-saving technique for the simulation for enhancements such as implementing AI-based predictive
and testing of the load balancer. Below, we delve into the load balancing, integrating advanced analytics for performance
specifics of this simulated setup. optimization, and exploring auto-scaling capabilities based on
1) System Architecture: The system architecture comprises a predictive modelling of incoming job requests.
set of Azure Functions, each representing a unique handler with
specific computational strengths. These handlers are organized
by the load balancer that intelligently routes jobs to the most D. User Interface Design
appropriate handler. By redirecting tasks to the appropriate For the users to interact with the cloud environment, we
handlers results in optimal utilization of resources and developed a simple user interface for the system. This enables
maximum throughput. users to assign any number of tasks to the system thereby
2) Load Balancer Design: The load balancer The Load balancer monitoring the real-time feedback on the allocation of jobs to
is designed with a smart routing algorithm that analyze servers. Users are provided with the server configurations and
incoming job requests and assigns them to handlers based on details to which the task is assigned. The following components
their tagged resource intensity. This decision-making process constitute the user-interface design:
considers the current system load, job requirements, and 1) Job Submission Input Field: A text field that is displayed on
individual handler capabilities. the webpage enables users to specify the number of jobs they
3) Handler Specifications: Each handler is tailored for a intend to submit for processing.
particular type of job: This input field is flexible and easy to use. It takes care of
• MemoryIntensiveHandler: Useful for tasks that need invalid inputs by showing error messages to the users.
a high amount of RAM. 2) Submit Button: Adjacent to the job submission input field is
• ComputeHandler: Best suited for CPU-intensive a submit button, which triggers the initiation of the load
tasks. balancing algorithm. When a user clicks the submit button, the
system processes the submitted jobs and allocates them to the
• NetworkHandler: Ideal for jobs that demand high
appropriate servers based on their characteristics.
network throughput. 3) Job Allocation Details Display: When the jobs are submitted
• DatabaseHandler: Configured for tasks requiring users will be redirected to the job details display page, where
extensive database interactions. all the jobs can be seen along with the job types. Each of the
• IOIntensiveHandler: Designed for jobs with high in- jobs is handled by a specific URL which is nothing but a
put/output operations. redirection to the server that is handling the job. The display is
4) Job Assignment Process: When a job request is received, the comprehensible providing clear insights into the workload
load balancer queries the status and capacity of each handler. distribution to the users.
Based on the job’s requirements—like memory usage, CPU 4) Real-time Updates: The UI incorporates mechanisms for
load, and network bandwidth—it selects the optimal handler. real-time updates, ensuring that users receive instantaneous
The job is then queued in the selected handler’s task list, feedback on the status of their submitted jobs and the
awaiting processing. corresponding server allocations. This feature enhances user
5) Monitoring and Metrics: The system includes a robust experience by providing timely information and fostering
monitoring solution to track the performance of each handler. transparency in the job allocation process.
Metrics such as memory usage, swap usage, cache size, and the 5) Error Handling: The UI also includes provisions for error
number of active sessions are continuously monitored to handling in case of invalid inputs and system errors. The
provide real-time insights into system performance
specific task categories, highlighting their strengths and load balancing,” in 2022 IEEE 19th Annual Consumer Communications
& Networking Conference (CCNC), pp. 207–215, IEEE, 2022.
areas for potential optimization.
[3] J. G. Muchori and P. M. Mwangi, “Machine learning load balancing
techniques in cloud computing: A review,” 2022.
[4] P. R. Kaveri and P. Lahande, “Reinforcement learning to improve re-
source scheduling and load balancing in cloud computing,” SN Computer
Science, vol. 4, no. 2, p. 188, 2023.
[5] Q. Liu, T. Xia, L. Cheng, M. Van Eijk, T. Ozcelebi, and Y. Mao, “Deep
reinforcement learning for load-balancing aware network control in iot
edge systems,” IEEE Transactions on Parallel and Distributed Systems,
vol. 33, no. 6, pp. 1491–1502, 2021.
[6] A. Alorbani and M. Bauer, “Load balancing and resource allocation in
smart cities using reinforcement learning,” in 2021 IEEE International
Smart Cities Conference (ISC2), pp. 1–7, IEEE, 2021.
[7] R. K. Ramesh, H. Wang, H. Shen, and Z. Fan, “Machine learning for load
balancing in cloud data centres,” in 2021 IEEE/ACM 21st International
Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp.
186–195, IEEE, 2021.
[8] Z. Duan, C. Tian, N. Zhang, M. Zhou, B. Yu, X. Wang, J. Guo, and Y.
Fig. 8. Point plot analysis of algorithms Wu, “A novel load balancing scheme for mobile edge computing,”
Journal of Systems and Software, vol. 186, p. 111195, 2022.
VI. CONCLUSION [9] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski,
G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al., “A view of cloud
The project resulted in the deployment of an Azure-based computing,” Communications of the ACM, vol. 53, no. 4, pp. 50–58,
cloud load-balancing system that is capable of efficient re- 2010.
source management and task distribution. The system’s design [10] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui, “Load balancing for ultradense
networks: A deep reinforcement learning-based approach,” IEEE Internet
and implementations have adhered to industry best practices of Things Journal, vol. 6, no. 6, pp. 9399–9412, 2019.
ensuring effectiveness in real-world scenarios. The resultant [11] Kiran, Koppolu Ravi, et al. "An advanced ensemble load balancing
analysis showcases the proposed system’s performance across approach for fog computing applications." International Journal of
key metrics, highlighting the Reinforcement Learning (RL) Electrical & Computer Engineering (2088-8708) 14.2 (2024).
algorithm’s efficiency in resource utilization, latency, [12] V. E. Jyothi and N. S. Chowdary, "Vulnerability Classification for
Detecting Threats in Cloud Environments Against DDoS Attacks," 2024
throughput, and error rate management. Particularly under high IEEE 13th International Conference on Communication Systems and
load conditions, the RL algorithm has maintained high Network Technologies (CSNT), 2024, pp. 368-373.
performance compared to alternative algorithms. [13] Praveen, S. Phani, et al. "An Adaptive Load Balancing Technique for
Multi SDN Controllers." 2022 International Cnetwonference on
REFERENCES Augmented Intelligence and Sustainable Systems (ICAISS). IEEE, 2022.
[14] Jyothi, Veerapaneni Esther, and N. Sampreeth Chowdary. "Challenges
and Artificial Intelligence–Centered Defensive Strategies for
[1] P. V. Lahande, P. R. Kaveri, J. R. Saini, K. Kotecha, and S. Alfarhood, Authentication in Online Banking." Artificial Intelligence Enabled
“Reinforcement learning approach for optimizing cloud resource utiliza- Management: An Emerging Economy Perspective (2024): 105
tion with load balancing,” IEEE Access, 2023.
[15] Biyyapu, NarasimhaSwamy, et al. "Designing a modified feature
[2] O. Houidi, D. Zeghlache, V. Perrier, P. T. A. Quang, N. Huin, J. Leguay, aggregation model with hybrid sampling techniques for network intrusion
and P. Medagliani, “Constrained deep reinforcement learning for smart detection." Cluster Computing (2024): 1-19.