0% found this document useful (0 votes)
10 views10 pages

AFramework Designof Transfer Reinforcement Learning TRLfor Cooling Water System Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

AFramework Designof Transfer Reinforcement Learning TRLfor Cooling Water System Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/385104292

A Framework Design of Transfer Reinforcement Learning (TRL) for Cooling


Water System Optimization

Conference Paper · February 2025

CITATIONS READS

0 33

2 authors:

Zhechao Wang Zhihong Pang


Louisiana State University Louisiana State University
8 PUBLICATIONS 18 CITATIONS 45 PUBLICATIONS 1,127 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Zhechao Wang on 28 October 2024.

The user has requested enhancement of the downloaded file.


A Framework Design of Transfer
Reinforcement Learning (TRL) for Cooling
Water System Optimization
Zhechao Wang Zhihong Pang, PhD
Student Member ASHRAE Associate Member ASHRAE

ABSTRACT
Transfer reinforcement learning (TRL) offers a promising approach to optimize the operation of building cooling water systems,
striking a balance between power consumption and effective cooling performance. This study presents a TRL framework that
enhances the initial performance of reinforcement learning (RL) agents leveraging experiences pre-acquired from similar
systems. First, three data-driven models are trained for three cooling water systems in China utilizing the field data. Then, an
RL model is trained for one cooling water system model, followed by the development of a novel TRL framework to adapt the
initial RL model for the remaining two systems. This TRL framework revolutionizes conventional approaches by introducing
self-adaptive input mapping and flexible output actions, enabling the accommodation of configuration difference s among
various cooling water systems while maintaining consistent neural network architectures across diverse RL agents. Specifically,
it allows for inputs like part load ratio (PLR) and outputs like ratio -based operation decisions. Besides, a supervisory
mechanism is implemented to dynamically modulate the TRL’s reliance on transferred versus new agents based on a belief
function and selected key performance indexes (KPIs), ensuring optimal agent selection probability and overall system
performance achievement. A parametric simulation incorporating two TRL frameworks and various scenarios is conducted to
investigate the TRL performance. The simulation results demonstrate significant performance enhancements, with the most
notable improvement being an 83.3% reduction in training time, compared with the learning-from-scratch method.

INTRODUCTION

Background

Reinforcement learning (RL) is a machine learning (ML) paradigm that involves an agent actively engaging with its
environment and iteratively learns and refines its decision-making policy through a process known as exploration and
exploitation, with the ultimate goal of optimizing its behavior to achieve a cumulative reward in complex, sequential decisio n-
making tasks (Sutton and Barto 2018). Such a model-free method has garnered increasing attention from researchers and
engineers in the field of intelligent controls for Heating, Ventilation, and Air Conditioning (HVAC) systems, primarily
attributable to its ability to circumvent the time-intensive and expertise-dependent process of HVAC equipment and systems
modeling (Biemann et al. 2021; Fu et al. 2022; Heidari et al. 2022; Mahbod et al. 2022). Additionally, RL demonstrates the
capability to effectively adapt to the performance degradation in equipment due to extended operation through its online
learning mechanism, thus eliminating the need to rebuild models as required in conventional model-based methods when
confronted with system changes (Qiu et al. 2020).
RL has made notable strides in HVAC, with recent applications showcasing its ability to enhance energy efficiency and
cut operational costs. For instance, Qiu et al. (2020) proposed a n RL-based Q-learning controller to reduce the energy
consumption of building cooling water systems. This method uses the outdoor wet bulb temperature and system cooling load

Zhechao Wang is a Ph.D. student, and Zhihong Pang is an Assistant Professor in the Bert S. Turner Department of Construction
Management at Louisiana State University in Baton Rouge, LA.
(CL) as the state variables, while the frequency of the fan and water pump are set as the action variables, and the system
coefficient of performance (COP), defined as the system total power divided by the system CL, serves as the reward. Compared
with the basic controller, the proposed approach saved 7% of the operational energy in the cooling season. Building upon this
methodology, Qiu et al. (2022) further developed a n advanced RL controller guided by a comprehensive utility function
considering energy and thermal comfort to optimize the chilled water temperature, yielding an energy-saving ratio of
approximately 4% and enhanced indoor thermal comfort compared to conventional human-operated control strategies.
RL has also been used to optimize occupant’s wellbeing in the built environment. Ahn and Park (2020) applied a deep Q-
learning (DQN) model in a demand-controlled-ventilation (DCV) study to control the indoor CO2 level concentration and
achieve a balance between indoor air quality (IAQ) and energy savings. Their study utilizes a multitude of state variables,
including the indoor and outdoor temperatures, relative humidity, solar irradiation, outdoor air damper position, etc. The
selected actions of the RL controller include the chilled water supply temperature setpoint, cooling water temperature setpoint,
and outdoor air damper position, with punishments imposed for excessive energy consumption and elevated indoor CO2
concentration. The DQN controller lowered the total energy usage by 15.7% in comparison with the baseline operation while
maintaining the indoor CO2 concentration below 1,000 ppm.

Transfer Learning Applications

Although RL has demonstrated considerable versatility and achieved good performance across various HVAC control
applications, its broader adoption is impeded by the need for scenario-specific controller redesign, leading to increased
computational demands, time, and costs, as well as limited policy transferability and potential suboptimal performance (Kurte
et al. 2020; Sierla et al. 2022; Wang et al. 2023). Hence, transfer reinforcement learning (TRL), which combines RL with the
transfer learning (TL) framework, emerges as a compelling solution to this challenge, enabling the leverage of knowledge
acquired from similar tasks to improve the initial performance in new-yet-related tasks, thus significantly reducing the
associated computational demands, time, and costs for redesign of RL controllers (Coraci et al. 2023; Fang et al. 2023; Fu et
al. 2023; Lissa et al. 2020; Pinto et al. 2022).
TRL has been successfully applied in some existing studies within the HVAC domain. For instance, Lissa et al. (2020)
proposed a methodology based on TL to share an RL controller among different rooms in the same building, which efficiently
reduced the thermal discomfort of occupants of various zones by incorporating the zonal geometrical variations into the
framework. Similarly, Fang et al. (2023) investigated the cross temporal-spatial transferability of an RL controller in an HVAC
system to enhance indoor temperature conditions and reduce energy consumption simultaneously in other systems with a
similar climatic scenario. The results indicated that the proposed TRL framework can effectively improve the training efficiency
of control strategy by about 13.28% when compared to that of the baseline RL models trained from scratch.
Despite some successful applications, m any existing studies on TRL often rely on high similarities between the source
and target systems, i.e., the new and previous systems, which compromises their broad applicability and reduces data utilization
efficiency when similarities are low. Firstly, a complicated system may encompass numerous distinct factors potentially
applicable for TRL, which makes it challenging to establish a standardized boundary for determining transfer feasibility based
on similarity. Furthermore, a significant amount of potentially valuable data may be discarded when a transfer is not executed
due to sub-threshold similarity, resulting in wasted opportunities for learning and improvement.

Objectives

This paper aims to investigate a uniform RL framework applicable across different systems by incorporating consistent
input and output workflows and adjustable exploration and exploitation in the transfer. This approach enables the trained agent
to interpret inputs from the new system and generate actions easily executable by these systems. Instead of addressing similarity
directly, this novel mechanism enhances data efficiency by adaptively adjusting the new system's trust in the transferred agent ,
hence eliminating the need to predetermine transferability prior to the initiation of the learning process in the new system.

METHODOLOGY

This section provides an overview of the primary technical methods, including the data -driven model for cooling water
systems, the mechanism of the uniform RL controller, and the supervision mechanism for exploration and exploitation. As
presented in Figure 1, this paper relies on a data -driven modeling approach to simulate the cooling water system operation and
evaluate TRL performance. To assess the transferability, an RL agent is first trained on a baseline cooling water system (grey
module) and reused as the transferred agent (green box). Then a TRL framework is established for a new cooling water system,
incorporating both the pre-trained transferred agent and a new RL agent (red box). During each TRL iteration, the new agent
and the transferred agent concurrently receive state information from the new cooling water system model (blue box), and
independently predict the COP and control actions. A supervision mechanism (purple box) is established to select one of the
two agents for control implementation based on a belief function, which is continuously and automatically updated to ensure
the optimal performance of TRL, with the operation data accumulating.

Figure 1 Flowchart of the proposed transfer reinforcement learning framework .

Data-Driven Cooling Water System Models

Equipment Model. Data-driven models are widely utilized in nonlinear systems due to their straightforward procedures
in model formalization and development, especially in RL applications of HVAC (Qiu et al. 2020; Wang and Lin 2023; Xiong
et al. 2023). This paper adopts the data -driven approach to train the cooling water models for TRL investigations.
Considering the primary equipment in a cooling water system and their operational characteristics, this study selects five
data -driven models for three types of equipment: chillers, cooling towers, and cooling water pumps. Table 1 summarizes the
inputs and outputs chosen for these models, along with the corresponding regressors, where the variables 𝐶𝐿, 𝑇𝑐𝑤𝑟, 𝑇𝑐𝑤𝑠,
𝑇𝑐ℎ𝑤𝑠, 𝐹𝑐ℎ𝑤 , 𝐹𝑐𝑤 , 𝑓𝑟𝑒𝑞 , 𝑇𝑤𝑏, and 𝑃 represent the cooling load of chiller, temperature of cooling water returning to chiller,
temperature of cooling water supplying from chiller, temperature of chilled water supplying from chiller, flow rate of chilled
water, flow rate of cooling water, frequency of tower or water pump, ambient wet bulb temperature, and equipment operation
power, respectively. The LightGBM model (Ke et al. 2017) is chosen for the chiller COP and cooling tower water temperature
models due to its high accuracy and low computing demands. Linear regression is employed for the remaining models to
capture the underlying physical relationships governing equipment operation. For instance, the cooling water pump follows the
pump affinity law “𝑃 ∝ 𝑓𝑟𝑒𝑞 3 ” in operation (ASHRAE 2024); thus its model can be constructed as a linear function, as
presented in Equation (1), where 𝑎, 𝑏, 𝑐, and 𝑑 are coefficients determined through linear regression.
𝑃 = 𝑎 + 𝑏𝑓𝑟𝑒𝑞 + 𝑐𝑓𝑟𝑒𝑞 2 + 𝑑𝑓𝑟𝑒𝑞 3 (1)
Table 1. Inputs, Outputs, and Regressors for the Various Data-Driven Models
Model Function Regressor
Chiller COP model 𝑓 (𝐶𝐿, 𝑇𝑐𝑤𝑟, 𝑇𝑐ℎ𝑤𝑠,𝐹𝑐ℎ𝑤, 𝐹𝑐𝑤) → Chiller 𝐶𝑂𝑃 LightGBM
Cooling tower water temperature model 𝑓 (𝑇𝑐𝑤𝑠,𝑓𝑟𝑒𝑞, 𝑇𝑤𝑏, 𝐹𝑐𝑤) → 𝑇𝑐𝑤𝑟 LightGBM
Cooling tower power model 𝑓 (𝑓𝑟𝑒𝑞) → 𝑃 Linear Regressor
Cooling water pump flow rate model 𝑓 (𝑓𝑟𝑒𝑞) → 𝐹𝑐𝑤 Linear Regressor
Cooling water pump power model 𝑓 (𝑓𝑟𝑒𝑞) → 𝑃 Linear Regressor

Integrated Cooling Water System. The system model is constructed based on the single-equipment models and actual
equipment connection in a cooling water circuit. The trial-and-error based iterative procedures to determine the stable operating
point are outlined in Figure 2. First, 𝐹𝑐𝑤 is calculated based on the cooling water pump flow rate model and its frequency setup
derived from actions. Then, an assumed 𝑇𝑐𝑤𝑠 along with other known variables are used sequentially to simulate 𝑇𝑐𝑤𝑟, chiller
𝐶𝑂𝑃, chiller exchange heat, and the updated temperature 𝑇𝑐𝑤𝑠′. Last, a conditional branch is added to evaluate the differential
between 𝑇𝑐𝑤𝑠′ and 𝑇𝑐𝑤𝑠: if the differential is below a threshold value 𝛿, the loop terminates; otherwise, the 𝑇𝑐𝑤𝑠 updated,
and the loop is iterated to recalculate 𝑇𝑐𝑤 until the convergence is achieved.
The developed data-driven cooling water system model mimics the operation of a real system in reality, through which
the system COP can be calculated. This COP serves multiple purposes in this study: it is used as the reward for RL agents
training, the input for the supervision mechanism , as well as a KPI for TRL evaluation.

Figure 2 Schematic of the stable operating point determination for the cooling water system model.

Uniform Reinforcement Learning Framework

Reinforcement Learning Principle. Value-based reinforcement learning, like DQN (Mnih et al. 2015), requires agents
to learn a value function to evaluate the state or the state-action pairs, which are denoted by 𝑉(𝑠) or 𝑄(𝑠, 𝑎), respectively. The
Q value refers to the future accumulated reward, i.e., the long-term return, based on which the RL agent chooses control actions
to maximize the comprehensive benefits. Compared with fundamental value -based RL algorithms such as the Q-table, DQN
employs a neural network model to find the optimal Q value for each state, and hence are more efficient and accurate for
complicated and non-linear scenarios.
In this paper, DQN is adopted as the RL algorithm due to its simplicity, effectiveness, and elegance (Fang et al. 2023).
Besides, DQN is highly versatile when it comes to cooperation with other modules such as supervision mechanism, which is
desired in this study.
Uniform Agent Framework. The agent framework must be well-designed to enable effective knowledge transfer since
it determines the selection and utilization of data during transfer. From a learning structure perspective, TL requires all agents
to operate within a consistent framework. Specifically, the framework must maintain the same input and output structures, wit h
data dimensions remaining consistent to facilitate transfer. This presents two primary concerns related to input and output, for
which we propose two potential solutions:
• Heterogeneity in cooling load conditions. Cooling water systems have unique features and scales regarding the
cooling load condition depending on building characteristics. This results in a generalizability issue, e.g., the data
from a system with a large CL may not generalize well to a system with a small one. Hence, we propose to use
PLR, a dimensionless variable, along with the number of active chillers, to describe the system’s cooling load
condition.
• Heterogeneity in system configurations. The quantities of cooling towers and cooling pumps often vary across
systems. Direct output of absolute quantities by the agents would result in variable-dimensional action spaces
across different systems, impeding the effective transfer of learned policies. This study proposes a ratio-based
approach to tackle the problem, in which the control agent generates ratio-based control actions instead of
specifying absolute, operational instructions, hence improving the dimensional consistency, scalability, and
generalization.
As shown in Figure 3, the system cooling load is expressed as a function of part load ratios and number of operating
chillers, the value of which is sent to the tower agent and the water pump agent which controls the number and frequency of
respective equipment. Each agent outputs actions that map to corresponding ratio -based instructions within a pre-set scale,
which can be converted to instructions on the operation of chillers based on system characteristics. The innovations of this
study are highlighted in the red dotted box in Figure 3, where the input-output procedure of this study enhances the
transferability and flexibility of RL by inputting PLRs and operating chiller numbers and outputting ratio instructions,
compared to the conventional framework that directly takes physical inputs or generates operational instructions.
In practice, it is common for a cooling water system to have various chillers of different capacities to dynamically
distribute loads and optimize efficiency. Therefore, CL can be decomposed into multiple groups of PLR and the number of
operating chillers, with each group corresponding to a specific cooling capacity. These variables, along with 𝑇𝑤𝑏, are input to
the new agent and transferred agent as the state, which output actions for cooling towers and cooling pumps, respectively.
Flexibility. While the performance and efficacy of TRL implementation are positively correlated with the degree of
similarity between systems, however, defining, comparing, and quantifying the similarity is challenging in practice, considering
that similarities are dependent on various perspectives, including climates, loads, system structures, equipment capacities, etc.
This study proposes a method to bypass the similarity determination by structuring the control action as a ratio-based
value. These actions map to pre-set ratio instructions, expressed as percentages within a fixed scale. Hence, operational
instructions can be derived by multiplying these percentages by the topological metrics of cooling water systems. This method
allows for flexible adjustments to system parameters prior to learning and transfer processes, accommodating practical needs.

Figure 3 Uniform reinforcement learning framework in cooling water system optimization .

Supervision Mechanism

Overview. Directly transferring an RL agent into a new system often results in inefficient learning. Restricting TRL to
highly similar systems potentially mitigate efficiency issues but introduces the drawback of low data utilization. Moreover, a
standardized procedure for TRL remains absent in the current literature.
The proposed TRL approach addresses these issues with a dual-agent architecture, consisting of a transferred agent
responsible for exploitation and a new agent equipped with an initial DQN for exploration. Both agents operate within a unif orm
framework and share access to system information. Each agent in the TRL approach contains two sub -agents that control the
cooling towers and cooling water pumps, as described in the Uniform Agent Framework section. For simplicity in this section,
the focus remains solely on the new agent and transferred agent from the TRL perspective.
To balance exploration and exploitation, a supervision mechanism is employed using a preference function that defines
the supervisor’s preferences and calculates beliefs for each agent, allowing one to be selected to output an action at each s tep.
Preferences are continuously updated based on system f eedback guided by a well-designed objective.
Preference and Belief. A preference function, denoted as 𝐻 (𝐴agent ), is established to facilitate the selection of either the
new agent or the transferred agent in a control iteration. A belief function is utilized to translate the output of the preference
function to a probability for agent selection.
The preference and belief functions need to be updated continuously to ensure optimal performance. The Gradient Bandit
Algorithm (Sutton and Barto 2018) is employed to facilitate updating the preference function and adapting the belief for both
agents in a coordinated manner. The update rule of preference is shown in Equation (2), where 𝛼 refers to the learning rate and
𝑂𝑏𝑗 refers to the objective to be designed. In this study, the value of 𝛼 is determined by preliminary investigations. The details
to 𝑂𝑏𝑗 determination are presented in the Objective Design of Supervision sub-section.
𝐻 ← 𝐻 + 𝛼(𝑂𝑏𝑗) (1 − 𝑃𝑟 (𝐴𝑎𝑔𝑒𝑛𝑡 )) , 𝑖𝑓 𝐴𝑎𝑔𝑒𝑛𝑡 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
{ (2)
𝐻 ← 𝐻 − 𝛼(𝑂𝑏𝑗) 𝑃𝑟 (𝐴𝑎𝑔𝑒𝑛𝑡 ), 𝑖𝑓 𝐴𝑎𝑔𝑒𝑛𝑡 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑛𝑜𝑡 𝑐ℎ𝑜𝑠𝑒𝑛
A soft-max distribution is utilized to define the belief function (i.e., Gibbs or Boltzmann distribution), as presented in
Equation (3). This definition meets the requirement that the higher the H value of an agent, the more likely it is to be chosen.
𝑒𝐻
𝑃𝑟 (𝐴𝑎𝑔𝑒𝑛𝑡 ) = (3)
∑ 𝑒𝐻
Objective Design of Supervision. In the initial stage of TRL, the new agent is expected to learn primarily from the
transferred agent. Specifically, when an action from the transferred agent is executed by the system, the corresponding output
is also provided to the new agent, allowing it to evolve and improve. In the later stages, as the new agent accumulates knowledge
from the transferred agent, it tends to be more frequently selected to overcome the potential bottleneck of the transferred agent.
Thus, in scenarios where the system similarity is limited, a gradual decrease in the preference for the transferred agent is
beneficial for the overall performance.
To reflect this guiding principle, an objective function is defined for the preference function as −MSE (𝐶𝑂𝑃, 𝑄) , where
MSE refers to Mean-Square Error between 𝐶𝑂𝑃 and 𝑄, 𝐶𝑂𝑃 refers to the actual COP from the system, and 𝑄 refers to the
action value for COP prediction. If the transferred agent consistently underperforms in predicting COP compared to the new
agent due to its fixed model structure and parameters, its preference will be decreased progressively by the supervision module.
Conversely, if the transferred agent performs well in COP prediction, indicating a high similarity between the source and tar get
systems, the preference for it should be maintained rather than reduced.

Key Performance Indexes

The system COP is selected as the KPI for evaluation. Its computation follows (𝑃𝑐ℎ𝑖𝑙𝑙𝑒𝑟 + 𝑃𝑝𝑢𝑚𝑝 + 𝑃𝑡 𝑜𝑤𝑒𝑟 )/𝐶𝐿, where
𝑃chiller , 𝑃pump , and 𝑃𝑡𝑜𝑤𝑒𝑟 refer to the summed power of all chillers, cooling water pumps, and cooling towers, respectively.
Besides, the system COP is also the reward feedback to the RL agent for decision making in each iterative step.

CASE STUDY

System Descriptions

A case study is conducted to validate the proposed methodology, comprising three cooling water systems (System A, B,
and C) for three factories located in distinct cities (i.e., Changzhou, Köppen Cfa-humid subtropical climate or IECC 3A-warm
humid, Xiamen, Cfa or 2A-hot humid, and Chengdu, Cwa-Monsoon-influenced humid subtropical climate or 3A) across China
(Mineralogy 1993), each characterized by unique climatic conditions. This cross-climate case study enables a comprehensive
examination of the methodology's effectiveness in transferring knowledge across different environments, with System A
serving as the source and Systems B and C as target systems.
Detailed information about these systems is shown in Table 2, including the number of chillers, cooling towers, cooling
pumps, and chiller’s nominal refrigerating capacities. The cooling loads of Systems A and C are significantly higher than that
of System B, resulting in the last having relatively fewer cooling towers.
Table 2. Systems’ information
System A B C
11,251 kW/3,200 RT, 8 9,8445 kW/2,800 RT, 4 11,251 kW/3,200 RT, 9
Capacity and number of chillers
5,997 kW/1,700 RT, 2 4,922 kW/1,400 RT, 2 5,626 kW/1,600 RT, 2
Number of cooling towers 22 14 22
Number of cooling water pumps 10 6 14

In Figure 4 we illustrate two common types of cooling water system structures. In subplot (a), chillers are cascaded with
cooling pumps, and this combination is then connected to parallel-connected cooling towers. In subplot (b), chillers, cooling
pumps, and cooling towers are first connected in parallel within their respective equipment groups, and then the entire syste ms
are connected sequentially. For simplicity, these two system structures are referred to as the cascade st ructure and the parallel
structure. Systems A and B utilize the cascade structure, while System C employs the parallel structure. In summary, with two
groups of contrasts, i.e., (A, B) and (A, C), the performance of the transfer can be investigated under different system structures
including the differences of refrigerating capacities.
As presented in Table 2 and Figure 4, it is evident that transfer (A, B) involves the same system structure but has distinct
cooling capacities and more climate diversity, while transfer (A, C) features different system structures with similar cooling
capacities and climate characteristics. In Figure 4(c), the scatter plot illustrates the relationship between the part load ratio and
the COP for the main chillers of the three different systems, highlighting variations in efficiency across varying load conditions.
It is worth noting that this paper does not present or quantify the similarity index of the systems, as this approach aims to
demonstrate its applicability even when precise similarity metrics are not available or easily quantifiable in real-world scenarios.

Figure 4 Two types of system structures: (a) cascade structure, and (b) parallel structure; and (c) Part Load Ratio
comparison between three systems.

Model Setups

System Model. The field operation data, including cooling loads and outdoor 𝑇𝑤𝑏s, were collected in a 20-minute
interval from July to December 2022. These data were post-processed and fed into the cooling water system models for training
purposes.
Agent Parameters. All DQN models are designed to have the same architecture, i.e., two hidden layers, each with 150
neurons. The learning rate is set as 1/((𝑖𝑡𝑒𝑟 + 1) × 20), where 𝑖𝑡𝑒𝑟 refers to the current iteration number. These values were
determined based on preliminary investigations to ensure good model performance (Xiong et al. 2023).
Operational Instruction Scale. Based on engineering judgments and the load condition , the action scale for the number
of cooling towers spans from the design value to seven units below it, while the action scale for the cooling water pump
frequency ranges from 50 to 30 Hertz.

Simulation Plan

Performance Comparison. This study investigates two TRL scenarios, alongside a baseline method. The baseline
method, referred to as Learn from Scratch (LFS), involves a new agent learning without prior experience. Additionally, two
variants of the proposed TRL framework are examined: Transfer and Learn with Belief (TLB), where a new agent learns under
the supervision of a transferred agent that continues to learn alongside the new agent, with both agents selected by a belief
mechanism; and Transfer and Fix the Policy with Belief (TFB), where a new agent learns from a transferred agent with a fixed
policy, with both agents chosen by the belief mechanism. These variants aim to explore the efficacy of having a transferred
agent continue learning in a new system as an advisor.
Action Scale Adjustment. Based on the extensive experience of the building management team, who have overseen this
specific facility for several years, the potential optimal number of operating cooling towers for System B is determined to b e 7
under most conditions. This number is derived from the historical operation of the facility. Considering that the primary
objective in this paper is to create and validate a generic framework for TRL in building applications, the focus is restrict ed on
the TRL framework rather than delving into the detailed validation of operational characteristics of the system . Hence, a long
scale [7,14] and a short scale [7,10] for actions controlling the cooling tower number is employed for TFB in the transfer A-B,
which is convenient due to the proposed ratio-based instructions (see Flexibility subsection). This scenario aims to investigate
how different scales influence the TRL performance.

RESULT AND DISCUSSION

Performance Comparison

The comparison of various scenarios is presented in Figure 5. All the performance (COP) curves are based on the average
results of 10 runs, where a 95% confidence interval is provided in a lighter color. The x-axis represents the number of iterations,
while the y-axis is the system COP. Subplots (a) and (b) show different COP ranges in the y-axis due to different target system
characteristics. The conclusions are summarized below.
First, LFS starts at a COP of 3.67 and 5.63 in transfer scenarios (A-B) and (A-C), respectively. For TLB, they are 3.65
and 5.63, which indicates a poor performance compared with LFS. TFB starts at a COP of 3.71 in A-B and 5.66 in A-C, higher
than others. Second, LFS and TLB end at a COP of 3.72, while TFB ends at a COP near 3.71, in transfer A-B; LFS, TLB, and
TFB end at a COP of 5.68, 5.69, and 5.685, respectively, in transfer A-C. In terms of final performances, there are no significant
distinctions. Third, from the perspective of trends, the different trends of three scenarios between A-B and A-C result from
system characteristics. Specifically, in the transfer A-B, TFB’s COP decreases until the 6th iteration and rises from iteration 6
to the end, while LFS and TLB have similar learning curves, starting with a low value and catching up with TFB in iteration 6,
indicating that TFB saves about 83.3% training time. In the transfer A-C, they have similar trends and TLB is lower than others
in most iterations.
In summary, from the perspectives of the initial performance and the time saving, TFB is appreciative and well-performed,
while TLB’s poor performance indicates that continuing learning of the transferred agent lowers the whole TRL framework
performance.

Figure 5 Performance comparison of scenarios: (a) COP (A-B), (b) COP (A-C).

Action Scale Adjustment

In Figure 6, the histogram of operational instructions across three iteration scenarios (i.e., iteration 1, 5, and 9) is shown
for the A-B transfer. The iteration numbers are marked with different colors.
In Figure 6(a), on a large scale, the distribution center of iteration 1 (blue) and iteration 5 (orange) is 13, and that of
iteration 9 is 10. In Figure 6(b), on a small scale, the distribution center of iteration 9 and iteration 5 is 7, and that of iteration
1 is 9. From the two pictures, it is evident that the agent prefers to choose a low operation number as the iteration increases.
Based on the building manager’s experience, mentioned in the Simulation Plan section, the small action scale performs better,
due to that it attains 7 in iteration 5 while the other does not even in iteration 9. In summary, an action mapping from ratio-
based values to operational instructions provides more flexibilities for TRL.

Figure 6 The operational instruction distribution changes with the number of iterations under different preset scales:
(a) Number scale: [7, 14]; (b) Number scale: [7, 10].

CONCLUSION

In this paper, we argue that introducing transfer reinforcement learning into cooling water system optimization is
beneficial and reasonable. A TRL framework is proposed to unify the information transmission and action generation to
enhance the overall performance and efficiency of TRL. Besides, a supervision mechanism with a belief function is
incorporated to control the possibility of new and transferred agent selection to balance exploitation and exploration.
A case study shows that the proposed method can reduce the training time by 83.3% in its best-performance scenario.
Moreover, the proposed TRL performance cannot be guaranteed when the transferred agent continues learning in a new system
and plays the role of advisor of the new agent. Additionally, flexible action mapping contributes to better performance in TRL.
The proposed TRL framework is valuable for promoting the RL implementation in cooling water systems.
REFERENCES

Ahn, K. U., and C. S. Park. 2020. Application of deep Q-networks for model-free optimal control balancing between different
HVAC systems. Science and Technology for the Built Environment, 26 (1), 61-74.
ASHRAE. (2024). Description 2024 ASHRAE Handbook —HVAC Systems and Equipment.
Biemann, M., F. Scheller, X. Liu, and L. Huang. 2021. Experimental evaluation of model-free reinforcement learning
algorithms for continuous HVAC control. Applied energy, 298, 117164.
Coraci, D., S. Brandi, T. Hong, and A. Capozzoli. 2023. Online transfer learning strategy for enhancing the scalability and
deployment of deep reinforcement learning control in smart buildings. Applied energy, 333, 120598.
Fang, X., G. Gong, G. Li, L. Chun, P. Peng, W. Li, and X. Shi. 2023. Cross temporal-spatial transferability investigation of
deep reinforcement learning control strategy in the building HVAC system level. Energy, 263, 125679.
Fu, Q., Z. Han, J. Chen, Y. Lu, H. Wu, and Y. Wang. 2022. Applications of reinforcement learning for building energy
efficiency control: A review. Journal of Building Engineering, 50, 104165.
Fu, Q., Z. Wang, N. Fang, B. Xing, X. Zhang, and J. Chen. 2023. MAML2: meta reinforcement learning via meta -learning for
task categories. Frontiers of Computer Science, 17(4), 174325.
Heidari, A., F. Maréchal, and D. Khovalyg. 2022. An occupant-centric control framework for balancing comfort, energy use
and hygiene in hot water systems: A model-free reinforcement learning approach. Applied energy, 312, 118833.
Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. 2017. Lightgbm: A highly efficient gradient
boosting decision tree. Advances in neural information processing systems, 30.
Kurte, K., J. Munk, O. Kotevska, K. Amasyali, R. Smith, E. McKee, Y. Du, B. Cui, T. Kuruganti, and H. Zandi. 2020.
Evaluating the adaptability of reinforcement learning based HVAC control for residential houses. Sustainability,
12(18), 7727.
Lissa, P., M. Schukat, and E. Barrett. 2020. Transfer learning applied to reinforcement learning-based hvac control. SN
Computer Science, 1(3), 127.
Mahbod, M. H. B., C. B. Chng, P. S. Lee, and C. K. Chui. 2022. Energy saving evaluation of an energy efficient data center
using a model-free reinforcement learning approach. Applied energy, 322, 119392.
Mineralogy, H. I. o. (1993). The Köppen Climate Classification.
Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and
G. Ostrovski. 2015. Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.
Pinto, G., Z. Wang, A. Roy, T. Hong, and A. Capozzoli. 2022. Transfer learning for smart buildings: A critical review of
algorithms, applications, and future perspectives. Advances in Applied Energy, 5, 100084.
Qiu, S., Z. Li, Z. Li, J. Li, S. Long, and X. Li. 2020. Model-free control method based on reinforcement learning for building
cooling water systems: Validation by measured data -based simulation. Energy and buildings, 218, 110055.
Qiu, S., Z. Li, D. Fan, R. He, X. Dai, and Z. Li. 2022. Chilled water temperature resetting using model-free reinforcement
learning: Engineering application. Energy and buildings, 255, 111694.
Sierla, S., H. Ihasalo, and V. Vyatkin. 2022. A review of reinforcement learning applications to control of heating, ventilat ion
and air conditioning systems. Energies, 15(10), 3526.
Siqin, Z., D. Niu, M. Li, T. Gao, Y. Lu, and X. Xu. 2022. Distributionally robust dispatching of multi-community integrated
energy system considering energy sharing and profit allocation. Applied energy, 321, 119202.
Sutton, R. S., and A. G. Barto. 2018. Reinforcement learning: An introduction: MIT press.
Wang, M., and B. Lin. 2023. MF^ 2: Model-free reinforcement learning for modeling-free building HVAC control with data-
driven environment construction in a residential building. Building and Environment, 244, 110816.
Wang, Z., Q. Fu, J. Chen, Y. Wang, Y. Lu, and H. Wu. 2023. Reinforcement learning in few-shot scenarios: A survey. Journal
of Grid Computing, 21(2), 30.
Xing, Z., Y. Pan, Y. Yang, X. Yuan, Y. Liang, and Z. Huang. 2024. Transfer learning integrating similarity analysis for short -
term and long-term building energy consumption prediction. Applied energy, 365, 123276.
Xiong, Q., Z. Li, W. Cai, and Z. Wang. Model free optimization of building cooling water systems with refined action space.
Paper presented at the Building Simulation. 2023
Zhang, X., Y. Sun, D.-c. Gao, W. Zou, J. Fu, and X. Ma. 2022. Similarity-based grouping method for evaluation and
optimization of dataset structure in machine-learning based short-term building cooling load prediction without
measurable occupancy information. Applied energy, 327, 120144.

View publication stats

You might also like