1. Introduction
With the flourish growth of the Internet, the demand for Internet services and data storage is driving the growth of data centers. It has resulted in a massive increase in energy consumption and carbon emissions. Facing the dual pressures of energy and environmental concerns, China has introduced the dual-carbon target. The implementation of energy efficiency measures in the data center field is one of the keys to achieve the dual-carbon goal. Given that cooling systems are a significant component of data centers, directly impacting their energy consumption, the types of cooling systems, optimization control strategies, and evaluation metrics are the crucial links in the chain from design and operation to the assessment of these systems. First, the categorization of cooling systems provides a framework for understanding the design and functionality of each system. Next, exploring optimization control strategies helps to enhance the energy efficiency. Finally, appropriate evaluation metrics allow us to quantify and compare system performance, providing data support for optimization and adjustments. A detailed assessment of these key aspects not only enables us to grasp trends in technological development but also offers a theoretical and practical basis for selecting or developing the most suitable cooling technologies. This comprehensive analysis aids in optimizing energy efficiency management and promotes the realization of environmental responsibilities, aligning with strategic goals of sustainable development.
Data center cooling is crucial for safeguarding equipment functionality and enhancing energy efficiency, primarily categorized into air and liquid cooling systems. Air cooling methods, including direct air cooling, indirect air cooling, and evaporative cooling [
1,
2], improve efficiency through optimized airflow management [
3,
4]. Liquid cooling methods, such as cold plate cooling, immersion cooling, and spray cooling [
5,
6], directly cool components with liquid, enhancing server efficiency, stability, and reducing energy consumption and noise [
7,
8,
9]. Although extensive research exists on cooling technologies [
10], there is a lack of focus on selecting the most suitable cooling system during the design phase and integrating these systems into specific environments [
11,
12]. This review compares the advantages, disadvantages, cost-effectiveness, and environmental adaptability of air and liquid cooling systems, emphasizing key factors. It aims to assist decision-makers in choosing the most appropriate cooling system based on the external environment of their data centers. In addition, modern data centers are seeing an increase in rack density due to advancements in AI and other high-performance computing tasks. The average rack power density has grown from around 8.5 kW per rack in 2023 to an expected 12 kW per rack in 2024, highlighting the need for cooling systems capable of managing significantly higher heat loads [
13]. Cooling systems can account for approximately one-third of a data center’s total energy consumption, with liquid cooling technologies offering notable efficiency improvements. For example, direct-to-chip cooling systems have been shown to reduce facility power needs by up to 18%, contributing to an overall energy cost savings of around 10% [
14]. Furthermore, the Coefficient of Performance (COP) of cooling systems varies by technology and configuration, with modern liquid cooling systems achieving higher COPs compared to traditional air cooling. Immersion cooling systems, in particular, have been reported to reduce energy consumption by up to 94% in specific applications, reflecting their superior energy efficiency [
15].
The control strategies of cooling systems have a significant impact on the energy consumption and normal operation of data centers. To optimize energy efficiency and ensure system stability, it is essential to classify and understand different control strategies. Although PID control is simple and effective in practical applications, its limitations become apparent. Therefore, exploring and implementing more advanced optimization control strategies is crucial for enhancing system efficiency and adaptability. MPC predicts future states and adjusts controls accordingly to better adapt to changing conditions, while RL optimizes control strategies through trial-and-error learning, improving system efficiency over time. In this context, Zhang et al. [
6] proposed optimization strategies based on MPC and RL. They evaluated the effectiveness of these methods in improving system efficiency and reliability, summarizing their applicable conditions. Du et al. [
16] analyzed the performance of PID, MPC, and RL in the dynamic thermal environment control of data centers. Recent technological advancements have led to the integration of multiple control strategies in the optimization process, rather than relying solely on a single method. Therefore, this paper aims to analyze and discuss the features of combined control strategies versus single control strategies. Additionally, it outlines the future development directions of each technology, providing better guidance for the future optimization control of data centers.
Given the high energy consumption characteristics, their energy efficiency assessment has attracted extensive attention from data center operators, policy makers, and researchers. Energy efficiency assessment is not only an effective tool to measure the efficiency of energy use but also provides important guidance [
17,
18]. In the current literature, there are various ways to categorize energy efficiency indicators; for example, they can be categorized according to the granularity of the indicator [
19], or according to the type of indicator, such as energy efficiency, eco-design and safety [
20]. PUE has become the most mainstream energy efficiency assessment index in the industry, yet it has its limitations. The literature has seen numerous improvements to PUE, including the creation of refined energy efficiency metrics like ApPUE and AoPUE [
21], the development of new assessment tools such as PEMC [
22], the optimization of prediction models with artificial intelligence algorithms [
23], and the introduction of advanced sensing technologies for more accurate parameter measurement [
24]. However, there are few review-type papers that provide a comprehensive summary of the work on PUE improvement. Therefore, our work serves to illuminate potential directions for future research. Many studies [
9,
10,
11] have reviewed the classification, control strategy optimization, and evaluation metrics of cooling systems in data centers, but the utilization of bibliometrics in this field remains infrequent. Using bibliometrics quantitatively can provide a more comprehensive overview, which can guide future research by analyzing historical developments and current trends in the field, as has been proven in other fields [
25,
26,
27,
28]. Therefore, this review aims to offer an overview of the following topics through bibliometrics using CiteSpace-6.3.1 software. Through the application of metrology software and a comprehensive review of the existing literature, in this study, we aim to address the following key issues related to the optimization control strategy and evaluation indicators of data center cooling systems:
- (1)
The specific classification of data center cooling systems, with a comprehensive description of liquid cooling and air cooling, including the specific categories included.
- (2)
Design optimization measures for data center cooling systems, elaborated in detail from the two aspects of air cooling and liquid cooling.
- (3)
Optimal control strategies for data center cooling systems, such as PID control, model predictive control, and reinforcement learning.
- (4)
The classification and current status of the energy consumption indicators of data center cooling systems, and a detailed description of the shortcomings of PUE and improvement measures.
Through the above key issues, first of all, the basis of the research is clarified, that is, the research is carried out through the application of metrology software. Then, the key issues to be solved in the research are discussed gradually and in depth. Starting from the classification of cooling systems, readers can have a clear understanding of the basic composition of the system, laying the foundation for the subsequent discussion of optimization measures and control strategies. Then, the design optimization measures are explored in depth, and detailed elaboration is given on different cooling methods to ensure the research is more targeted and operational. Then, the optimization control strategy is discussed, and specific control methods are listed to provide technical support for the efficient operation of the system. Finally, attention is paid to energy consumption indicators, especially a detailed analysis of the shortcomings and improvement measures of PUE, so as to achieve the purpose of reducing energy consumption and improving system performance. Overall, this study provides new insights into the design, operation optimization, and performance evaluation of data center cooling systems, enabling researchers to quickly and deeply understand the research content and significance of data center cooling systems.
2. Data and Methods
2.1. Paper Search
To conduct a quantitative and visual analysis of the research content, it is essential to extract relevant papers from pertinent databases. The process of collecting papers is primarily divided into three phases: identifying the databases, performing keyword searches, and selecting the relevant papers. The databases selected for this study were Scopus, Web of Science, and IEEE. The selection was based on the comprehensive nature of these databases. Scopus offers an extensive collection of papers. Web of Science indexes various core journal articles, providing a platform for high-quality paper searches. IEEE focuses on electronic technology, making it crucial for research related to data centers and electronic technology. Consequently, it was necessary to include the IEEE database in the paper search process.
After determining the databases, we conducted literature retrieval and screening in Scopus, Web of Science, and IEEE. Initially, we searched for the topic “data center cooling systems” in alignment with our research focus. The search keywords used were “data center cooling systems”. The initially retrieved papers were then selected or excluded based on specific criteria.
- (1)
Papers published between 2004 and 2024.
- (2)
Research fields limited to engineering, civil engineering, and architecture. Papers within these fields were selected, while those outside these fields were excluded.
- (3)
Non-English papers were excluded.
- (4)
Published journal papers were selected, while non-journal papers were excluded.
The reason for conducting literature searches within the period of 2004 to 2024, spanning nearly 20 years, is twofold. First, research that is too distant may have limited relevance to future scientific inquiries. Second, the past two decades represent a period of vigorous development in the global data center industry. During this period, data centers have evolved from computing centers, information centers, and cloud centers to power centers, gradually integrating new technologies and advancing towards greener and smarter directions. This period has been marked by continuous upgrading and development for data centers [
29].
After the initial screening, the selected papers were further screened and excluded based on the following criteria.
- (1)
To ensure the selected papers aligned more closely with our research topic, filters in the Scopus and IEEE databases were applied with the following keywords: “data center”, “cooling systems”, “energy-saving”, “energy consumption”, “optimization”, and “energy efficiency”. The same keywords were also used to filter results in Web of Science.
- (2)
Based on the titles and abstracts of the papers, we checked whether one or more of the above keywords were mentioned. Papers that mentioned these keywords were selected, while those that did not mention any of the keywords were excluded. Additionally, papers that included the keywords in the titles and abstracts but were unrelated in direction were also excluded.
- (3)
After re-screening the papers based on the above criteria, we performed a full-text review of the remaining papers. We selected the final papers based on the following criteria: whether the paper focused on data center cooling systems and whether it involved technologies, theories, or practical cases in this field. Papers meeting these criteria were retained, determining the final number of selected papers in each database. A summary of the paper search and screening process in a flowchart is shown in
Figure 1.
2.2. Database Analysis
Figure 2 illustrates the total number of studies selected from the Web of Science, Scopus, and IEEE databases. We observed that the total number of studies selected in Scopus was the highest, with 291 articles, accounting for 62%. The total number of studies selected in Web of Science was 121 articles, accounting for 26%. The total number of studies selected in IEEE was 55 articles, accounting for 12%.
IEEE excels in electrical, electronics, and computer engineering research and is highly regarded for its comprehensive collection of published papers. Data centers have gained significant attention, correlating strongly with research areas such as electronics and computer science. Utilizing the keyword “data center” in the IEEE database yielded 69,086 journal articles from the period of 2004 to 2024. This indicates that data centers have become a significant focus in IEEE journal publications. However, when searching for “data center cooling systems” in the IEEE database from 2004 to 2024, only 415 journal articles were found. This suggests that specialized research on data center cooling systems within IEEE is relatively scarce.
The cooling system is also a vital component of data centers, consuming nearly 50% of the power consumption in data centers [
30].
For instance, in the aspects of system control and energy efficiency evaluation, Bruno et al. [
30] analyzed data centers from the perspective of networked physical systems. They considered both network and physical factors, introducing a control model that couples a computer network representing network dynamics with a thermal network representing physical dynamics. This model was used to evaluate the energy efficiency and computational performance of different data center control strategies. Additionally, they introduced the cyber–physical index (CPI) to measure the distribution of network and physical effects in data centers. The study by Bruno et al. [
30] simultaneously considers physical effects alongside computational networks in data centers.
In the direction of optimal control, Cheong et al. [
31] proposed a comprehensive and multi-pronged approach to data center environmental control to prevent the formation of hotspots and cold spots around server cabinets. This approach aims to prevent overheating, premature aging, performance degradation, or condensation damage to server cabinets. The method includes Computational Fluid Dynamics (CFD) simulation-assisted predictive design and a complementary Internet of Things (IoT) responsive management system.
In achieving the goal of energy conservation, Berezovskaya et al. [
32] proposed a modeling toolkit that can construct data center models of any scale and configuration. The toolkit consists of a set of building blocks that can model various components of a data center. Using this toolkit, data centers can be modeled to estimate the energy consumption, temperature variation, and air temperature of all computing nodes and to evaluate the performance under different energy-saving strategies. The simulation data from the model, when compared with the actual data from corresponding real data centers, show similar trends.
Furthermore, in the direction of reducing energy consumption and saving energy, Ran et al. [
33] proposed a new event-driven control and optimization algorithm within the framework of deep reinforcement learning (DRL).
Ahmed et al. [
34] published a review paper in the IEEE database. They systematically reviewed and classified energy consumption models for major load segments of data center components such as IT, Internal Power Conditioning Systems (IPCS), and cooling loads. This review revealed the strengths and weaknesses of various models under different applications. The most innovative contribution of this paper is its systematic review of the current research status on reliability indicators, reliability models, and reliability methods for the first time. However, this paper does not focus on data center cooling systems themselves. It only provides a simple overview of existing energy consumption models for the cooling part and does not review the classification, performance optimization strategies, or evaluation indicators of data center cooling systems themselves. Indeed, a review specifically targeting cooling systems could significantly enrich the research content in the data center domain of the IEEE database.
IEEE Access is a multidisciplinary, open-access journal focused on electronic research. Its research and publication scope is broad, encompassing many areas that intersect with and relate to data centers. Additionally, IEEE Access publishes papers related to the data center domain. Therefore, publishing articles related to data center cooling systems in the IEEE Access journal is both relevant and desirable for IEEE Access.
2.3. Analysis of Annual Publication Trends
We observed the publication years of the selected papers and listed the number of papers selected from each database annually from 2004 to 2024, spanning nearly 20 years, as shown in
Figure 3. In the field, the Scopus database has consistently published papers from 2004 to 2024. After 2015, there has been a rapid growth in the number of publications in this domain. In 2023, the highest number of papers was published, reaching 50, accounting for 17% of the selected papers in Scopus. The Web of Science database saw relevant papers emerging around 2008. In 2023, the highest number of papers was published, with 18 articles, accounting for 15% of the selected papers in Web of Science. Papers in the IEEE database have relatively low annual publication numbers, with relevant papers appearing only after 2011. The highest number of papers was published in 2021, with 9 articles, accounting for 16% of the selected papers in IEEE.
The analysis of annual publication trends and the obtained results are indeed consistent with the development trajectory of data centers. In the early 1990s, particularly in the USA, the number of data centers was minimal, primarily used for government and research applications, with limited commercial use. During the period from 2001 to 2006, the rapid growth of the Internet resulted in a substantial rise in the number of websites. It was also during this time that data centers gained widespread recognition. From 2006 to 2012, data centers globally entered the stage of information centers, characterized by a slowdown in the growth rate of the data center market. This phase also witnessed the introduction of cloud computing technology into data centers. From 2012 to 2019, the data center industry transitioned into the era of cloud-centric data centers. During this period, countries worldwide imposed energy-saving requirements on data centers. For example, in 2016, the United States Office of Management and Budget (OMB) announced the “Data Center Optimization Initiative (DCOI)”, requiring U.S. government agencies to monitor and measure metrics such as data center energy consumption, PUE targets, virtualization, server utilization rates, and equipment utilization rates. Since 2019, due to the accelerated development of various new digital technologies, the growth rate of data centers has experienced a counter-cyclical increase. During this phase, data centers began to evolve towards green and intelligent directions [
29]. The global demand for energy efficiency in data centers has increased since 2012, and data centers have transitioned towards green and intelligent directions since 2019. The number of papers on data center cooling systems regarding classification, optimization strategies, and energy efficiency evaluation has been increasing in various paper databases.
From the publication years of the papers, it can be observed that research on data center cooling systems, as well as related areas such as natural cooling, optimization methods, and energy consumption metrics evaluation, has received increasing attention in recent years. Studying and summarizing these areas will play a crucial guiding role in the design and optimization of future data center cooling systems.
2.4. Keyword Analysis
“CiteSpace-6.3.1” is a free Java application used for visualizing and analyzing trends and patterns in the scientific literature [
35,
36]. CiteSpace-6.3.1 is a visual bibliometric software designed for creating scientific knowledge maps. It can generate knowledge maps of relevant fields, offering a comprehensive view of a specific knowledge domain. Through diversified and dynamic network analysis, CiteSpace-6.3.1 identifies critical literature, hot research topics, and emerging trends within a scientific field. It can perform co-occurrence, clustering, and burst analysis of keywords, with keyword co-occurrence and burst analysis being valuable indicators for assessing the future development trends of a research area. Additionally, it provides visual analyses of countries, institutions, and more. These features give CiteSpace-6.3.1 unique advantages compared to other visual bibliometric software. Apart from CiteSpace-6.3.1, VOSviewer is another commonly used visual bibliometric software. Although VOSviewer offers a simpler interface and charts, it lacks some of the unique features and more comprehensive, diversified visual analyses that CiteSpace-6.3.1 provides [
37]. Using CiteSpace-6.3.1 to conduct keyword co-occurrence analysis on the selected papers from the Scopus and Web of Science databases, we generated keyword co-occurrence graphs as shown in
Figure 4 and
Figure 5.
Scopus adopts the time interval from 2005 to 2024 for the keyword co-occurrence graph, while Web of Science adopts the time interval from 2008 to 2024 for the keyword co-occurrence graph. The high-frequency keywords extracted from the selected papers in Scopus and Web of Science using CiteSpace-6.3.1 are presented in
Table 1 and
Table 2, respectively.
Based on the high-frequency keywords and keyword co-occurrence graphs from both databases, “data center” and “cooling system” emerge as the most frequent keywords. This indicates that the focus of our research is indeed the cooling systems of data centers. In the keyword co-occurrence graphs, we can also observe numerous co-occurrence relationships between keywords related to “data center” and “cooling” with other keywords. This demonstrates that these two keywords are indeed core keywords. Additionally, “energy efficiency” has a high frequency of occurrence in both databases, indicating that the energy efficiency of data center cooling systems is also a key focus of research. In the Scopus database, “energy utilization” and “energy conservation” are also high-frequency keywords, indicating an increasing focus on the importance of optimization strategies for data center cooling systems. Optimizing data center cooling systems can significantly increase their energy-saving potential. The comparison between
Figure 4a–c shows that after 2011, there is a greater variety of keywords appearing in the keyword co-occurrence graph. This indicates that research on performance, energy efficiency, and optimization has become increasingly in-depth since 2011. In the Web of Science database, “performance” and “free cooling” are both high-frequency keywords. This indicates that natural cooling is widely utilized in data centers, and the performance of various cooling systems is a key focus in research on data center cooling. The thickness of the warm-colored rings around the high-frequency keywords in the co-occurrence graph is also significant, further confirming that these high-frequency keywords are indeed the current hotspots of attention in data center cooling systems.
Using CiteSpace-6.3.1, we generated a temporal clustering graph of keywords from the database with the most selected articles, Scopus, for the years 2020 to 2024, as shown in
Figure 6. Among them, “energy efficiency” and “indirect evaporative cooling” are two clusters within the keyword clustering. In the direction of energy efficiency, relevant keywords continue to appear from 2020 to 2024. Data center cooling systems urgently need energy-saving measures, and adopting optimization strategies can alleviate the energy consumption of the cooling systems. Indirect evaporative cooling is one of the technologies for cooling systems and falls under the broader classification of cooling systems. Additionally, the COP is also one of the clusters. However, there has been little attention paid to COP in recent years. COP is a performance indicator used to evaluate the refrigeration capacity of data center cooling systems. Discussing the advantages and disadvantages of various energy efficiency evaluation indicators and proposing improvements to these indicators is indeed a valuable research direction.
2.5. Publication Journal Analysis
Figure 7 illustrates the annual publication volumes of selected papers from Scopus, Web of Science, and IEEE, categorized by the journals with the highest publication counts.
Figure 7a is a bubble chart of the journal publication years from the Scopus database,
Figure 7b is a bubble chart of the journal publication years from the Web of Science database, and
Figure 7c is a bubble chart of the journal publication years from the IEEE database. In the bubble chart, the size of the bubble represents the volume of publications in that year. The larger the bubble, the greater the number of publications. Each bubble chart also includes a legend, where the specific number of publications represented by bubbles of different sizes can be seen.
In the Scopus database, the journals with the highest publication volumes are Applied Thermal Engineering, Energy and Buildings, Energy, and Applied Energy. Applied Thermal Engineering had the highest publication volume in 2023, with ten related papers. Energy and Buildings had the most publications in 2014, with six related papers. Energy and Applied Energy had the highest publication volumes in 2023 and 2022, with eight and seven related papers, respectively, indicating a significant number of publications in these journals in recent years. Overall, journals with high publication volumes have consistently published related papers since 2014, indicating that research in this area has been continuously active. This underscores the ongoing necessity to study the performance, optimization, energy-saving measures, and energy consumption evaluation of data center cooling systems.
In the Web of Science database, Energy and Buildings and Applied Thermal Engineering have significant publication volumes. Applied Thermal Engineering had the highest number of publications in this direction in 2023, with 6 papers, accounting for 23% of the total selected papers from this journal Energy and Buildings had the highest number of publications in 2014, 2020, and 2023, with six papers each.
Among the selected papers, IEEE has only three journals with higher publication volumes, namely IEEE Transactions on Components, Packaging and Manufacturing Technology, IEEE Access, and IEEE Transactions on Sustainable Computing. IEEE Transactions on Components, Packaging and Manufacturing Technology published the most papers in 2017 with three papers. IEEE Access published the most papers in 2019 and 2021 with three papers each. IEEE Transactions on Sustainable Computing published the most papers in 2017 with three papers. In recent years, the IEEE database has not had a high volume of publications in this direction. Therefore, publishing relevant research papers in IEEE journals can effectively fill this gap in the field.
2.6. Analysis of Publication Countries
Here we present the country-wise publications from the Scopus and Web of Science databases, along with the country co-occurrence maps generated using CiteSpace-6.3.1. These are represented in
Figure 8 and
Figure 9, respectively.
Figure 8a,b depict the scatter plots of high-publishing countries from the two databases, while
Figure 9a,b represent the country co-occurrence maps for each database. In both the Scopus and Web of Science databases, the two countries with the highest number of publications are China and the United States, with China’s publication count far exceeding that of the United States. According to the country co-occurrence map generated by CiteSpace-6.3.1, the outermost circle of China’s color ring represents earlier years, but the remaining warm-colored rings are also quite thick, indicating that this field remains a hot research topic even in recent years. The results indicate that China has been the most prolific in publishing in this field, consistently showing a sustained interest in scientific research in this area.
To explore whether there is a relationship between countries with high publication volumes and their top publishing institutions, a co-occurrence map of publishing institutions, as shown in
Figure 10, was created. In the Scopus co-occurrence map, the countries where the institutions are located are not clearly identifiable. However, in the Web of Science co-occurrence map, it is evident that institutions such as Hunan University, Northeastern University, and the Chinese Academy of Sciences are leading publishing institutions in this research field. This helps to explain why both the Scopus and Web of Science databases show that China has the highest publication volume.
2.7. Discussion on the Citation of Papers
Due to a large number of selected papers in the Scopus database, especially during the period from 2020 to 2024, a citation network map for the Scopus database from 2020 to 2024 was generated using CiteSpace-6.3.1, as shown in
Figure 11. According to the citation network map, the papers are clustered into eight main directions: (1) Immersion Cooling, (2) Thermal Performance, (3) Thermal Environment, (4) Deflector, (5) Indirect Evaporative Cooling, (6) Water-Cooled Heat Exchanger, (7) Optimization, (8) Simulation.
Among these right main directions, immersion cooling and indirect evaporative cooling can be considered as classifications of data center cooling systems. The features of different cooling systems vary, and their applicability differs as well. Therefore, it is necessary to summarize and discuss the positives aspects of each system. The optimization direction also continues to be researched. In response to the energy-saving trend, optimizing various categories of cooling systems is undoubtedly of paramount importance. Simulation is also a method for optimizing strategies for data center cooling systems. Relevant research should focus on improving the energy-saving capabilities of data center cooling systems through simulation and discussion.
2.8. Discussion of Research Directions in Papers
The research direction of our study is data center cooling systems. Each approach has its own advantages, and it is important for us to understand the current classification of data center cooling systems. During the design phase, it is crucial to select the appropriate cooling system based on different requirements and conditions. In the aforementioned
Section 2.4 on keyword analysis, it can be observed that energy efficiency, optimization, and energy efficiency are all high-frequency keywords commonly appearing in the literature. Therefore, it can be inferred that these high-frequency keywords represent the most concerning issues for researchers.
We are now conducting keyword burst analysis for the Scopus and Web of Science databases using the “burstness” feature in CiteSpace-6.3.1. Since the Scopus database has the highest number of studies in recent years, we have selected the time interval from 2020 to 2024 to conduct keyword burst analysis for the Scopus database. The keyword burstness graph is shown in the following
Figure 12.
Figure 12a,b, respectively, show the keyword burstness graphs for Scopus from 2020 to 2024 and Web of Science. The keyword burstness graphs display the top ten keywords with the highest burstness. The red color in the keyword burstness graph indicate the time periods during which the keyword had a high frequency of occurrence, while the blue color represent the periods of lower frequency. The lighter the blue, the lower the frequency of the keyword during that time period. In the keyword burstness results from the Scopus database, it is evident that “data center” remains a highly discussed topic. At the same time, “airflow management” experienced burstness during the period of 2020–2021. This term is related to optimization strategies for air-cooled cooling systems. Hence, it suggests that there has been an increasing focus on optimizing strategies for different types of cooling systems in data centers in recent years. The keyword “cooling perform” experienced significant burstness during 2020–2022, indicating a growing emphasis on improving the performance of cooling systems, including efficiency and energy-saving aspects. The keyword “optimization” experienced burstness during 2021–2022 in the Web of Science database, indicating a recent surge in research focusing on optimizing systems to improve efficiency and save energy.
Based on the analysis of keyword co-occurrence and burstness, and considering the complete process of data center cooling systems from design, operation to evaluation, we will delve into four research directions in the following sections: the selection of current data center cooling systems at the design stage, optimization strategies for different types of data centers, optimization control strategies that can achieve energy savings, and evaluation metrics for energy efficiency. The following is a summary of the four major directions of research on data center cooling systems and their significance, which will be elaborated on in the subsequent sections in
Table 3.
According to the keyword burstness graph, it is evident that waste heat recovery, thermal environment, and thermal management have emerged as keywords during the period from 2021 to 2024. This indicates a potential future research trend in data center cooling systems. In future research, there may be a focus on studying the thermal environment and the better utilization of waste heat, among other directions.
3. Classification of Data Center Cooling Systems
As a core component of IT infrastructure, data centers are rapidly expanding in both number and scale [
7]. Data centers are known for their high energy density and continuous operation, running up to 8760 h a year, which results in an exceptionally high energy consumption [
38]. Globally, data centers account for 1.3% of total electricity consumption [
39]. This underscores the urgent need for high-efficiency solutions driven by the rapid growth of the data center market.
To ensure the stable operation of IT equipment, data centers rely on air-conditioning systems to provide continuous cooling throughout the year, preventing room temperatures from exceeding the maximum allowable limits for the equipment.
As shown in
Figure 13, the energy consumption of air-conditioning systems is the largest, apart from the power consumption of the IT equipment itself, contributing about 40% of the total energy consumption [
40]. Therefore, optimizing the energy efficiency of air-conditioning systems has become a crucial measure.
Recently, numerous researchers have conducted in-depth studies on data center cooling technologies. Zhang et al. [
40] reviewed the development of natural cooling technologies in data centers from the aspects of configuration characteristics and performance, providing a detailed analysis of the performance characteristics of air-side natural cooling, water-side natural cooling, and thermosyphon cooling, and they also summarized the performance standards for evaluating the effectiveness of natural cooling in data centers.
Ebrahimi et al. [
39] focused on cooling technologies and their operating conditions in data centers, exploring the possibility of utilizing waste heat from data centers. Additionally, they assessed the feasibility and effectiveness of implementing low-grade waste heat recovery technologies in combination with energy-saving cooling technologies.
Nadjahi et al. [
7] provided an overview of potential cooling technologies for data centers from an energy-saving perspective in
Table 4.
Based on these studies, data center cooling technologies can be clearly classified into two main categories: liquid cooling and air cooling. Liquid cooling technologies use liquids to directly or indirectly absorb and dissipate heat, providing high-efficiency thermal management solutions, especially suitable for high-heat-density environments. In contrast, air cooling technologies achieve cooling through air circulation, suitable for scenarios with a lower heat density and requiring a lower initial investment. The advantages and applicable conditions of these two technologies are important factors to consider when designing.
3.1. Liquid Cooling Technology
Liquid cooling technology utilizes the high heat capacity and thermal conductivity of liquids to dissipate the heat, thereby maintaining the equipment within a safe operating temperature range [
1,
2], as shown in
Figure 14. Liquid cooling technology operates by circulating a coolant through a closed-loop system to efficiently manage the heat generated by data center equipment. The process begins with the cooling water system, where cooling towers dissipate heat to the external environment. The cooled water is then pumped into the CDU, which acts as a central hub, distributing the coolant to various cooling systems connected directly to the equipment. Within the information equipment room, the coolant flows through specialized cooling systems attached to the server cabinets. These systems absorb the heat produced by the servers, maintaining optimal operating temperatures. The heated coolant is then returned to the CDU, where it is recirculated back to the cooling towers, completing the cooling cycle.
Indirect Liquid Cooling: The heat source does not come into direct contact with the coolant. Instead, heat is transferred through cooling devices such as cold plates, with the coolant flowing through enclosed pipes to absorb the heat [
45,
46]. This method offers a high cooling efficiency and precise temperature control, making it suitable for high heat load scenarios. However, its drawbacks include high costs, system complexity, maintenance challenges, and the risk of leakage. Direct liquid cooling involves direct contact between the coolant and the heat-generating components and includes immersion and spray cooling systems. Immersion cooling submerges the equipment entirely in a non-conductive liquid and is further divided into single-phase and two-phase systems. Single-phase systems maintain the coolant in a liquid state, while two-phase systems utilize phase change from liquid to gas to release heat efficiently [
47]. Immersion cooling boasts a high cooling efficiency, near-silent operation, and space savings, but it comes with a high initial investment and maintenance costs, as well as complex hardware replacement procedures [
48].
Performance and Recommendations: Research by Mohamad Hnayno et al. [
49] indicates that single-phase immersion cooling can reduce server energy consumption by at least 20% compared to air cooling systems, and by 7% compared to other liquid cooling systems. Greenberg et al. [
50] have highlighted that adopting liquid cooling systems can significantly reduce the total cooling energy demand. Implementing liquid cooling at the server or rack level, combined with ambient free-air cooling, can reduce or even eliminate the reliance on CRAC (computer room air-conditioning) units and chillers, resulting in substantial energy savings.
3.1.1. Cold Plate Liquid Cooling
Chip-level liquid cooling technology is an indirect method that achieves heat dissipation by installing cold plates on high-heat-output components of servers, such as CPUs and GPUs. A notable feature of direct-to-chip cooling is its ability to use warm water as the coolant, making it environmentally friendly [
51]. This technology can provide waste heat in the form of water at temperatures around 45 °C or even higher [
52].
Commercial liquid cooling products primarily offer waste heat recovery channels on the primary side. However, it is noteworthy that waste heat can be recovered from both the primary and secondary sides of the CDU [
53].
To address this research gap, LU et al. [
54] conducted an innovative study using a liquid-cooled rack in an office building as a “data furnace” to supply heat to the return pipeline on the secondary side of the space heating network. This approach eliminates the need for heat pumps, thereby reducing district heating demand and investment costs [
54]. In conclusion, chip-level liquid cooling technology offers a promising solution for efficient heat dissipation and waste heat recovery in data centers. The ability to use warm water as a coolant and recover waste heat at useful temperatures presents significant environmental and economic benefits. Additionally, exploring innovative applications, such as using liquid-cooled racks as “data furnaces”, can further enhance the sustainability and cost-effectiveness of heating solutions in buildings.
3.1.2. Immersion Liquid Cooling
Immersion liquid cooling systems achieve effective heat exchange by directly immersing heat-generating electronic equipment in coolant, with the coolant circulating to remove heat. During use, the IT equipment is fully immersed in non-conductive secondary-side coolants, including mineral oil, silicone oil, or fluorinated liquids. Immersion liquid cooling technology is further divided into single-phase and two-phase immersion cooling based on whether a phase change occurs in the coolant during heat exchange.
- (1)
Single-Phase Immersion Cooling
In single-phase liquid cooling systems, the coolant undergoes a temperature change during heat transfer without a phase change. The cooling distribution unit (CDU) circulates low-temperature coolant through the IT equipment, absorbing heat, and then returns the heated coolant to the CDU. Inside the CDU, heat is transferred to the primary-side coolant and then released into the atmosphere to complete the cooling cycle [
55]. The performance of single-phase liquid cooling significantly depends on various design parameters of the cold plate, such as porous media, microchannel heat sinks, and the heat sink pressure [
47]. These parameters enhance the efficiency of heat transfer, making liquid cooling systems more effective than traditional forced air cooling solutions. For instance, Parida et al. [
56] conducted a comparative study using traditional forced air cooling and liquid cold plate solutions to explore the cooling capacity of the server racks. Additionally, Eiland et al. [
57] explored the heat transfer performance of servers immersed in mineral oil, a type of single-phase immersion cooling system. Their findings indicated that the immersion cooling system reduced the thermal resistance by 34.4% compared to traditional air-cooled solutions, achieving a power usage effectiveness (PUE) as low as 1.0.
These studies illustrate the superior efficiency of liquid cooling systems, emphasizing how various design parameters contribute to their enhanced performance compared to air-cooled systems. By optimizing factors like porous media, microchannel heat sinks, and the heat sink pressure, single-phase liquid cooling not only improves thermal management but also significantly reduces energy consumption. Despite the advantages of liquid cooling, several challenges remain. The high initial investment and maintenance costs, the complexity of system integration, and the potential risk of leaks are significant barriers to widespread adoption. Future research should focus on the development of more cost-effective materials and designs for cold plates and heat exchangers, as well as the optimization of coolant formulations to improve the thermal performance and reduce the environmental impact [
45,
46,
47,
48,
49,
55,
56,
57].
- (2)
Two-Phase Immersion Cooling
Two-phase cooling involves the phase change in circulating coolant during the heat transfer process, where the liquid absorbs latent heat as it evaporates. This method reduces the flow rate of liquid cooling systems and produces a more uniform temperature distribution [
6]. The key to improving efficiency lies in designing efficient porous media and microchannel heat sinks. However, two-phase cooling methods still face the challenge of flow instability, which may cause surface overheating and potential damage [
58].
To compare the performance of single-phase immersion cooling (SPIC) systems and two-phase immersion liquid cooling systems, Kanbur et al. [
59] conducted experimental studies on data center services, they found that the Coefficient of Performance (COP) of two-phase immersion cooling systems was 72.2–79.3% higher than that of SPIC systems. Despite the superior thermal characteristics of two-phase immersion cooling systems, they are relatively inferior in terms of the investment cost, safety, and maintainability [
60]. Therefore, SPIC systems are more suitable for large-scale and commercial applications than two-phase immersion liquid cooling systems. In summary, both single-phase and two-phase liquid cooling systems offer significant advantages over traditional air cooling methods. Single-phase systems benefit from stable operation and a lower energy consumption, while two-phase systems provide a higher thermal efficiency but come with challenges in cost and stability. Future research should focus on overcoming the flow instability in two-phase systems and reducing the associated costs to make them more viable for large-scale applications. Additionally, advancements in cold plate design and immersion cooling technologies will further enhance the cooling efficiency and energy savings in data centers, contributing to more sustainable and high-performance computing environments.
3.1.3. Spray Liquid Cooling
Unlike the previous two systems, spray liquid cooling technology directly sprays coolant onto electronic equipment through specially designed nozzles to achieve efficient heat exchange [
61]. During spraying, the coolant directly contacts the surface of electronic equipment or connected heat-conducting materials. The heated coolant is then recovered through the system’s return pipeline and sent back to the CDU for recooling [
62]. This system typically includes a cooling tower, CDU, liquid cooling pipeline, and spray liquid cooling cabinet, which integrates key components such as the pipeline system, liquid distribution system, spray module, and liquid return system, ensuring efficient and precise cooling throughout the process. Spray cooling has promising potential in various fields including aerospace, biomedicine, and battery safety. Continuous advancements are expected to enhance its efficiency and applicability, overcoming existing technical challenges and expanding its usage in more advanced and compact electronic devices.
3.2. Air Cooling Technology
Air cooling methods use fans to cool the refrigerant in the condenser and directly release the heat into the air, as shown in
Figure 15. Compared to water-cooled chiller systems, this method does not require the installation of cooling towers, cooling water pumps, and piping equipment, thus ensuring normal cooling operation in water-scarce environments. Air-cooled chiller systems are simple, reliable, and easy to maintain, making them widely used in medium and large data centers.
3.2.1. Direct Air Cooling
Direct air cooling is straightforward and cost-effective, particularly suited for environments where the ambient air quality and temperature are within acceptable limits for IT equipment operation [
38]. However, this method has inherent limitations. Its efficiency significantly depends on ambient air conditions, making it less effective in hot or polluted environments. Furthermore, the reliance on high-speed fans can introduce significant noise levels, which can be problematic in densely populated data centers. The system’s cooling capacity is also limited by the heat dissipation capabilities of air compared to liquid cooling solutions, making it less suitable for extremely high-density configurations or high-performance computing (HPC) applications where heat loads are substantial [
63,
64]. Future research should aim at developing hybrid cooling systems that combine the simplicity and cost-effectiveness of direct air cooling with other cooling technologies to enhance overall performance. For example, integrating direct air cooling with liquid cooling systems could provide a more robust solution for managing diverse heat loads across different data center environments.
3.2.2. Indirect Air Cooling
Indirect air cooling technology involves transferring heat from one medium to another through a heat exchanger, typically transferring heat from the hot equipment to water or coolant, which then dissipates heat through the air [
65]. Indirect air cooling systems are widely used in data centers, where heat exchangers transfer heat from servers to the liquid in cooling pipes. These systems can integrate with building HVAC systems to effectively manage the thermal environment of large-scale data centers [
5]. In industrial applications, indirect air cooling technology is often used in environments with strict cooling requirements, such as chemical plants or pharmaceutical facilities, maintaining stable operating temperatures to prevent overheating and associated safety risks. The complexity of heat exchange mechanisms and integration with HVAC systems can lead to high operational costs. Regular maintenance is required to prevent leaks and ensure efficient operation. Precise control of coolant temperatures is crucial to maintain efficiency and prevent overheating.
3.2.3. Evaporative Cooling
Evaporative cooling uses the principle of heat absorption during water evaporation to cool air. It achieves more efficient thermal management by humidifying and lowering the air temperature. In green buildings, evaporative cooling systems are designed to utilize natural evaporation and air cooling synergy, reducing traditional air-conditioning energy consumption and achieving environmentally friendly temperature control [
39]. By using this technology in hot environments, the overall system energy efficiency and output performance can be significantly enhanced. The efficiency of evaporative cooling is highly dependent on ambient air humidity levels; in very humid environments, its effectiveness diminishes significantly. Additionally, the system requires a continuous supply of water, which can be a limitation in areas with water scarcity. Regular maintenance is necessary to prevent mold and bacteria growth in the system, which can affect air quality and system efficiency. Managing the balance between humidity and temperature control can also be complex, requiring advanced control systems to optimize performance without compromising indoor air quality [
5,
39].
3.3. Key Factors in Data Center Cooling Systems
To maintain uniform airflow distribution and avoiding the mixing of hot and cold air, researchers influence the organization of airflow by adjusting the height of raised floors, the openness of perforated tiles [
66], and the deployment of obstacles in the plenum ventilation system [
67]. The optimal values for these adjustments are obtained through simulations and experiments [
68]. In such systems, energy consumption is reduced by changing structural parameters, particularly the porosity of perforated tiles [
69]. Natural cooling systems, as part of air cooling systems, also focus on climate conditions, cooling system structure design [
70], natural cooling switch points, and flow rates for thermal management and energy savings [
71]. Ham et al. [
72] compared the cooling performance of nine air-side heat exchangers in data centers. They found that using energy-efficient equipment could save 47.5% to 62% of the total cooling energy compared to traditional data center cooling systems. Among these nine cooling systems, the indirect air-side economizer with high-efficiency heat exchangers achieved 63.6% energy savings, while the indirect air-side economizer with low-efficiency heat exchangers had the lowest energy savings. This study shows that the choice of heat exchanger is crucial for energy savings. Previous studies have shown that plate heat exchangers are more suitable for indirect air cooling and have focused on using algorithms to optimize various factors to improve the structural performance, save energy, and achieve uniform heat dissipation. In terms of thermal management and energy savings for liquid cooling systems, factors such as the thermal load, structural parameters [
73,
74], coolant flow rate [
75,
76], and coolant type [
6,
43] are involved. Researchers have analyzed the relationship between these factors and cooling performance and energy consumption and have optimized the structure of heat exchange facilities. Through these measures, liquid cooling systems can achieve more effective thermal management and energy savings, providing reliable assurance for data center operations.
5. Optimization Control Strategies
In the operation of modern data centers, optimizing control strategies for cooling systems is particularly important. This requires not only the application of advanced technologies and methods to enhance efficiency and reduce energy consumption but also the strengthening of the overall performance and reliability. As the scale of data centers increases and operational complexity rises, traditional experience-based methods are no longer sufficient to meet the demands, making automated control strategies crucial [
6]. By implementing intelligent control systems and utilizing advanced monitoring and automatic control technologies, cooling equipment can be adjusted in real-time to match the actual thermal load. Researchers have widely adopted various control strategies, such as Proportional-Integral-Derivative (PID), model predictive control (MPC), and reinforcement learning (RL) [
102]. Although PID control is widely used due to its simplicity of operation and high stability, advancements in technology have driven the adoption of more advanced control methods, such as MPC and RL. These methods not only enhance the level of automation and intelligence of the system but also optimize the operational efficiency and cost-effectiveness. Each control method demonstrates unique advantages and challenges, and they can be used alone or in combination with other technologies to improve system efficiency and adaptability.
5.1. PID Control
In the field of optimization control for data center cooling systems, PID control is highly favored. This method is widely adopted and suitable for most conventional cooling systems, primarily used to maintain stable operation at set points. However, PID control exhibits limitations in handling complex and rapidly changing environments.
For instance, research by Durand-Estebe et al. [
102] has shown that the performance of PID control is constrained in nonlinear or rapidly changing system environments. This is because traditional PID control struggles to achieve ideal control effects when the parameters are improperly set or when the system dynamics are strong. To address this issue, researchers have proposed enhancing the performance of PID control in dynamic data center environments by improving PID parameter adjustment methods. Durand-Estebe et al. [
102] optimized data centers using PID control in CFD simulations. They employed a computational fluid dynamics (CFD) simulation environment to monitor temperature and airflow in real time and adjust the PID parameters accordingly. Through iterative simulations and experiments, they precisely tuned the P, I, and D parameters, enabling the cooling system to quickly respond to thermal load changes, thus optimizing the system energy efficiency and stability. Zheng and Ping [
103] proposed an Active Disturbance Rejection Control (ADRC) method for temperature regulation in server storage systems. This method augmented traditional PID control with real-time disturbance compensation mechanisms. Specifically, it employed an Extended State Observer (ESO) to estimate unknown disturbances in the system and dynamically adjusted the PID parameters based on these estimations, thereby enhancing the system’s disturbance rejection capability and control accuracy. These approaches aim to enhance the adaptability and efficiency of PID control through more precise parameter adjustments, thereby better meeting the practical demands and effectively optimizing the thermal environment of data centers.
Using PID control alone is rare; it is usually combined with other control strategies to improve control precision. The combination of different control strategies has shown significant advantages in the optimization of data center cooling systems. For example, Demir et al. [
104] employed a combination of proportional control and fuzzy logic to independently control the temperature and humidity. This combination enhances the flexibility and responsiveness of the control system and allows for more precise control of the data center environmental conditions through intelligent methods, optimizing the energy efficiency and enhancing the overall system performance to better adapt to complex environmental changes.
PID technology was originally conceived over a century ago for single-input, single-output processes. Although its parameter-tuning methods can be cumbersome, they are manageable [
104]. To enhance PID performance, researchers have proposed adapting to dynamic changes through fine-tuning parameters. In the future, it will be necessary to explore the integration of more advanced algorithms and technologies to address the application limitations of PID control in complex and rapidly changing environments. This will promote the development of data center cooling system control strategies towards greater efficiency and intelligence. This requires not only technological innovation but also comprehensive consideration from system design to practical application, ensuring that control strategies are effectively implemented and achieve the expected optimization results.
5.2. Model Predictive Control
Model predictive control (MPC) has garnered widespread attention in optimizing data center cooling systems due to its exceptional predictive and optimization capabilities, making it a core technology for enhancing system efficiency and responsiveness.
For example, Wang et al. [
76] proposed a global optimization method for air-conditioning water systems in data centers. Global optimization was achieved by building an energy consumption model for each component and using the differential evolution (DE) algorithm. The results show that the cooler saves 10.2%, the pump saves 28.1%, and the energy consumption of the cooling tower increases by 29.7%, highlighting the importance of comprehensive optimization. The potential of MPC in balancing energy efficiency and system performance is underscored by the development of predictive optimization methods for the comprehensive optimization of chilled water systems. This work highlights the important application value of MPC in large-scale systems, while also noting the limitations of existing models in handling dynamic changes and complex systems. Zhao et al. [
105] demonstrated that in ice storage air-conditioning systems, combining MPC with a multi-objective optimization control strategy can reduce energy consumption by 25% and operating costs by 20.9% compared to storage priority control.
In addition, Zhu et al. [
106] studied an advanced control strategy combining refrigeration technology to optimize data center coolers. Advanced MPC strategies have also been extensively studied. Higher energy efficiency and system responsiveness are achieved through integrated refrigeration technology and standardized operating states. Zhu et al. [
107] found that the advanced MPC strategy of the hybrid cooling system was reduced by 12.19% in the natural cooling mode, 4.04% in the mixed cooling mode, and 22.15% in the mechanical cooling mode. They combined this advanced control strategy with mixed integer linear programming (MILP) to effectively improve energy efficiency and reduce refrigeration losses. Choi et al. [
108] introduced highly adaptive artificial neural network models and optimal control algorithms, greatly enhancing the responsiveness and adaptability of data center cyber–physical systems. Similarly, Fan and Zhou [
109] applied MPC to optimize chiller units combined with water-side economizers, further affirming MPC’s flexibility and efficiency.
Despite MPC’s outstanding performance, Du et al. [
16] noted that implementing MPC in dynamic environments faces numerous challenges. Future research needs to enhance MPC’s adaptability and flexibility by integrating advanced predictive models and optimization algorithms, improving real-time capabilities, and reducing implementation complexity and costs. The integration of emerging machine-learning technologies will be crucial to automatically adjust and optimize MPC parameters, making MPC more suitable for widespread data center applications.
These studies collectively constitute the application framework of MPC technology, from basic optimization methods to complex system management and specific technology strategies. They demonstrate MPC’s expanding application in data center cooling systems, from the meticulous control of individual servers to the holistic optimization of complex systems. Future research will focus on improving MPC’s learning efficiency, reducing reliance on big data, and enhancing real-time performance and stability. Through these efforts, MPC is expected to achieve more widespread and effective applications in data center cooling systems, driving control strategies toward greater efficiency and intelligence.
5.3. Reinforcement Learning
Reinforcement learning (RL), as an advanced intelligent control strategy, has demonstrated significant potential in adaptive adjustment and performance optimization. RL techniques have been widely applied to enhance the energy efficiency and responsiveness of data center cooling systems, showing substantial promise in adapting to complex and dynamic environments.
In specific applications, Zhang et al. [
6] explored the use of RL in optimizing data center cooling systems. Their research showed that RL techniques effectively optimize energy efficiency, especially when dealing with complex system dynamics and continuous control variables. Although this method can improve energy efficiency, it requires high-quality training data and substantial computational resources, which limits its application in certain settings. He et al. [
110] combined deep reinforcement learning with predictive control technologies to enhance the energy efficiency of chiller units, providing a new optimization approach for complex systems. This study utilized deep reinforcement learning to handle more complex system dynamics and manage continuous control variables, underscoring RL’s capabilities in complex decision-making environments. Despite its benefits, this approach also demands significant high-quality training data and computational resources, restricting its use in specific contexts.
To expand RL applications to broader system-level management, Qin et al. [
111] employed a distributed reinforcement learning approach to optimize energy use across regional building clusters, effectively addressing energy distribution among multiple buildings. This distributed RL method not only optimized the energy efficiency of the entire region but also showed potential in coordinating multiple control points in large-scale systems. However, it faces technical challenges related to effective communication and synchronization between systems.
In the latest relevant studies, Lin et al. [
112] enhanced server energy efficiency by integrating multi-agent reinforcement learning methods with dynamic voltage frequency scaling (DVFS) and dynamic fan control technologies. This approach finely tuned hardware operations to minimize energy consumption, demonstrating the direct effects of RL at a specific hardware level. The study also highlighted the need to improve algorithm generalization across different server environments.
These cases illustrate that RL technology is expanding its application scope from the meticulous control of individual servers to the holistic optimization of complex systems and even to energy management across multiple buildings. Future research needs to further enhance the learning efficiency of RL through algorithm improvements and improve the real-time performance and stability of algorithms. With these efforts, RL is expected to achieve more widespread and effective applications in data center cooling systems, driving control strategies towards greater efficiency and intelligence.
5.4. Summary
In the optimization of cooling systems, the choice of control strategies is crucial for enhancing system energy efficiency and responsiveness. This chapter discusses three main control strategies: PID control, MPC, and RL, exploring their applications, advantages, and the challenges they face in data center cooling systems. PID control, as a fundamental control strategy, is widely adopted due to its simplicity and high robustness, making it suitable for most conventional cooling systems, and it is primarily used to maintain stable operation at set points. However, PID control shows limitations in dealing with complex and rapidly changing environments. Model predictive control (MPC), as a model-based control strategy, is valued in data center cooling systems for its excellent optimization and predictive capabilities. It can achieve global optimization in large-scale and complex systems, significantly enhancing the system energy efficiency and performance by adjusting operations in real time to adapt to environmental changes. The main challenges for MPC lie in its model dependency and the need for real-time operation. Reinforcement learning (RL), as a model-free optimal control strategy, demonstrates strong potential in complex system regulation through its adaptive learning mechanism, especially in multi-level system applications. The application of RL requires substantial training data and computational resources, along with effective strategies to ensure learning process stability and convergence speed.
In
Table 7, the advantages, disadvantages, application scenarios, and application stages of PID control, model predictive control, and reinforcement learning are summarized, allowing readers to quickly understand these three optimal control strategies.
Data centers may choose to employ a single control method or a combination of methods to adapt to complex and dynamic system environments. This diversified strategy provides more comprehensive and flexible solutions. Typically, a single control strategy is suitable for standard system optimization tasks, while a composite control strategy can more effectively handle complex and multi-objective optimization problems, thereby significantly enhancing overall optimization outcomes. Large data centers tend to prefer advanced control strategies, such as MPC or RL, or their combinations, to ensure system stability and maximize energy efficiency. However, the implementation and maintenance costs of these strategies are high, so when adopting these strategies, it is necessary to comprehensively consider the data center’s budget and business requirements.
Future trends will focus on the deeper integration of algorithms and technologies to enhance the performance and adaptability of these control strategies. Particularly, the integration of machine-learning technologies to automatically optimize control parameters will be a key step, not only improving the flexibility and accuracy of control strategies but also further reducing the energy consumption and enhancing the overall system performance. Through the integration and innovation of these technologies, data center cooling system control strategies are expected to become more efficient and intelligent, better meeting the growing performance demands and environmental adaptation challenges.
6. Data Center Cooling System Energy Consumption Indicators
Enhancing the energy efficiency to support energy conservation and emissions reduction is a primary focus of current research. To achieve this goal, researchers are working to develop and refine a series of energy efficiency assessment metrics. These metrics serve dual purposes: they measure the efficiency of energy use and guide energy management and conservation strategies in data centers [
17].
We delve into the critical role of energy consumption metrics, which are essential for monitoring and evaluating energy usage efficiency and are pivotal in optimizing energy consumption and enhancing the sustainability of these facilities.
Section 6.1 will introduce the classification and current status of these energy metrics, ranging from fundamental measures like power usage effectiveness (PUE) to more sophisticated and holistic efficiency metrics. Moving forward,
Section 6.2 will conduct an in-depth analysis of the challenges and limitations faced in the practical application of PUE, particularly highlighting its potential shortcomings in comprehensively reflecting a data center’s energy efficiency. Furthermore,
Section 6.3 will explore the ongoing improvements in the industry that build upon PUE. Researchers and engineers are striving to refine these metrics to more precisely assess and optimize the energy performance, thereby better aligning with global environmental sustainability objectives.
6.1. Classification and Status of Energy Consumption Indicators
In recent years, as data center energy efficiency has received increasing global attention, numerous researchers have systematically classified energy consumption metrics. Within this field, scholars have proposed a variety of metrics to quantify and manage energy consumption based on different assessment needs and methods. This section provides an overview of the existing energy consumption indicators and explores the current status of their development. By sorting through the relevant research results, this section reveals the strengths and weaknesses of the existing indicators and offers a theoretical basis and guidance for future research.
Long et al. [
19] classified the existing data center energy efficiency assessment indicators into three categories in terms of granularity: coarse-grained indicators, medium-grained indicators, and fine-grained indicators. Coarse-grained metrics assess only the total energy efficiency and lack detailed information on subcomponents, making it difficult to provide specific recommendations for energy savings. Medium-grained metrics cover information on specific equipment, infrastructure, and green energy sources to more accurately assess energy efficiency. Fine-grained metrics provide detailed performance information on specific components, assessing energy efficiency through changes in performance and energy consumption, and are important to operators even though they do not directly represent overall energy efficiency. Shao et al. [
20] collected existing energy efficiency assessment metrics for data centers in the last 20 years globally, summarized the energy efficiency metrics based on energy saving, eco-design and data center security, proposed the calculation formulas for the energy efficiency metrics, and discussed their strengths and weaknesses as well as their relevance and applicability. Wang et al. [
118] discusses a taxonomy of green data center performance metrics, including basic performance metrics (e.g., greenhouse gas emissions, humidity, power, and heat metrics) as well as extended metrics. Reddy et al. [
119] classified data center metrics according to greening, performance, heat, security, storage, and financial impact.
All four papers provide a systematic framework to classify data center energy efficiency assessment metrics, but their respective focuses and approaches show significant differences. Long et al. [
19] constructs a hierarchical framework through the classification of granularity—coarse, medium, and fine-grained—that facilitates the systematic assessment of metrics based on the level of detail of the information. This categorization highlights the applicability and limitations of indicators at different levels of granularity. Shao et al. [
20] focuses on energy efficiency assessment metrics from around the world in the last two decades. This paper provides a macroscopic perspective, not only summarizing a variety of metrics, but also exploring their calculation formulas and their strengths and weaknesses, and highlighting the relevance and applicability of the metrics in practical applications. Wang et al. [
118] discusses a taxonomy of green data center performance metrics, including basic performance metrics as well as extended metrics. The focus of this literature is to present a more granular perspective that helps to assess the overall performance of green data centers. Reddy et al. [
119] classifies data center metrics according to the energy efficiency, cooling, green technology application, performance, heat and air management, network, security, storage, and financial impact. This classification methodology incorporates financial impacts as well, which can help managers to develop more effective decision-making and management strategies to enhance the overall performance of data centers.
Despite these advances, as data centers face increasing environmental challenges such as air pollution and toxic by-products [
120], the industry is calling for future research to be based not only on practical needs, but also to explore the assessment methods best suited to address specific issues.
Future research could consider fusing these two approaches, using the granularity classification of the first literature to organize the broad set of indicators in the second. This would not only maintain the comprehensiveness of the assessment, but also increase its usefulness. In addition, to address the shortcomings of the current classification methods, further research should explore how to optimize the use of indicators in specific application scenarios, as well as how to supplement the shortcomings of the existing methods through technological innovation.
6.2. Shortcomings of the PUE Indicator
There are a range of metrics available for evaluating data centers, but PUE has become the de facto industry standard over time [
121]. The PUE metrics have become increasingly popular in the data industry in terms of computing energy efficiency [
22].
Although PUE has become an important metric, several studies have shown that it still has significant shortcomings.
Firstly, PUE measures the overall energy efficiency, but this calculation method performs insufficiently in providing specific operational details and practical guidance. Long et al. [
19] point out that PUE fails to adequately reflect the performance in terms of water usage, carbon emissions, and renewable energy usage. In addition, PUE also fails to effectively capture the problem of the over-provisioning of infrastructure in data centers under low load conditions, which may lead to inaccurate energy efficiency assessment.
Second, PUE has limitations in considering the meteorological conditions. Shao and Li et al. [
20,
122] point out that data centers with the same level of technology may exhibit different PUE values due to differences in geography and climate, which affects a fair comparison of PUE values between data centers across regions.
In addition, PUE makes it impossible to assess the energy efficiency of IT equipment, which may result in PUE not accurately reflecting the actual productivity. Zhou et al. [
21] highlighted that software energy efficiency optimization may significantly reduce the energy consumption of IT equipment, while the energy consumption of the infrastructure does not change much, which may lead to unexpectedly high PUE values, thus incorrectly indicating a decrease in energy efficiency.
Finally, technical issues in PUE calculations also affect their accuracy and consistency. Brady et al. [
121] and Yuventi et al. [
122,
123] discuss inconsistencies in definitions, unclear calculation nodes, and the complexity of data collection and parameter monitoring in PUE calculations. These technical issues not only increase the difficulty of calculating PUE but may also lead to inaccuracies in engineering estimates. Additionally, the influence of environmental factors may prevent PUE values from being a true reflection of actual performance throughout the year. Brady et al. [
121] also mention the lack of research on the sensitivity of PUE to specific parameters and the shortcomings of attempts to use open-source information for PUE calculations.
The above analyses reveal a variety of limitations faced in its practical application. These limitations not only affect the accuracy and reliability of PUE, but also restrict its usefulness. In view of these issues, there is a clear need to improve and optimize PUE further in the future. This can be achieved by developing new metrics, introducing more relevant variables and parameters, formulating policies and standards for energy efficiency assessment, and utilizing advanced data analytics and machine-learning techniques, in order to enhance its effectiveness as an energy efficiency assessment tool.
6.3. Improvements in PUE
Recently, numerous scholars have made improvements to the PUE metric, which has led to a more comprehensive and accurate assessment.
To address the lack of specific operational details and practical guidance for the PUE metrics, Shaikh et al. [
22] overcame the challenges of the PUE and DCiE metrics, such as the fact that the metrics only measure power efficiency and do not take into account CO
2 emissions, as well as the costs involved in the total power usage of the entire data center, and therefore proposed a new power efficiency and CO2 measurement calculator called PEMC.
Since the PUE metrics cannot assess the energy efficiency of IT equipment and applications, Zhou et al. [
21] analyzed the requirements of application-level metrics for data center power usage efficiency and proposed two novel energy efficiency metrics: ApPUE and AoPUE, which constitute the application-level PUE family. ApPUE reflects the energy efficiency of IT equipment and relates application characteristics to power consumption. AoPUE measures the energy efficiency of the total power of data center facilities with respect to their application performance.
In order to address technical issues affecting the accuracy and consistency of energy efficiency in data centers, Yuventi et al. [
123] suggest renaming PUE (power usage effectiveness) to “energy usage effectiveness” (EUE), which is intended to eliminate confusion and to avoid inconsistencies in the reporting of energy ratings. Rose et al. [
124] adapted the calculation of PUE to exclude the effect of water storage systems.
Currently, there are some problems with the accuracy of PUE predictions. Lei et al. [
23] proposed a statistical framework for predicting and analyzing PUE in Hyperscale Data Centers (HDCs). Islam et al. [
125] describe a Belief Rule-Based Expert System (BRBES) prediction model for predicting power usage effictiveness (PUE) in data centers. This model is unique in that it integrates a new learning mechanism that significantly improves the accuracy of PUE prediction through parameter and structure optimization. Avotins et al. [
24] improved the accuracy of PUE measurements and overall energy efficiency performance by deploying high-resolution sensors to analyze the thermodynamic behavior of the system in detail and using artificial intelligence algorithms to develop automated analysis tools to optimize cooling system configurations.
The above studies have effectively improved data center energy efficiency management through the introduction of advanced sensing technologies, the development of new assessment tools such as PEMC, the creation of refined energy efficiency metrics such as ApPUE and AoPUE, and the optimization of prediction models using AI algorithms. They have also adjusted existing assessment criteria, such as renaming PUE to EUE. These comprehensive improvement measures not only complement each other but also work together to enhance the science and practicality of data center energy efficiency management. However, despite the significant progress made in the field of data center energy efficiency assessment, it still faces some limitations. For example, although the proposed new technologies and methods are theoretically valid, they may encounter implementation challenges such as high cost, technical complexity, or compatibility with existing systems. High-resolution monitoring and artificial intelligence algorithms rely on a large amount of high-quality data, but obtaining such data in real-world environments can be challenging, and data incompleteness, errors, and biases may affect the accuracy of the results. Furthermore, despite the introduction of new metrics, harmonized standards for these metrics have not yet been established, which may affect the consistency and efficiency of future energy efficiency measurements.
7. Conclusions
Through bibliometric analysis, we conducted a comprehensive review of research on data center cooling systems from 2004 to 2024, with a particular focus on energy efficiency, optimization strategies, and energy management. The selected databases include Scopus, Web of Science, and IEEE. The study shows that as global attention has increasingly focused on data center energy efficiency, especially after 2012, research in this field has exhibited significant growth. Since 2019, research directions have gradually shifted towards greener and smarter data centers, as reflected by the increase in the number of publications in related fields. Among these databases, Scopus has the largest number of publications, and China has shown an outstanding research performance in this field.
By using CiteSpace-6.3.1 to visualize the literature, including keyword co-occurrence analysis, keyword clustering, and keyword burst analysis, we found that the classification of data center cooling systems, optimization strategies for different types of cooling systems, measures to improve data center energy efficiency, and the evaluation of cooling system energy efficiency are all important research directions in this field. Based on this, the study comprehensively reviews the research progress on optimization control strategies and evaluation indicators for data center cooling systems, focusing on improving system efficiency and sustainability. The key findings of the study are as follows:
- (1)
In the research domain of data center cooling system design, despite the extensive literature discussing various cooling technologies and technical details, studies often focus predominantly on the evaluation of technical performance. Particularly, when considering the influence of different climatic conditions on cooling system selection, the depth and breadth of the existing research remain insufficient. To address this, the present study proposes a systematic approach to assess and compare the adaptability, cost-effectiveness, and environmental impacts of air-cooled and liquid-cooled systems under diverse operational environments. By providing a comprehensive summary and categorization of key factors for both air-cooled and liquid-cooled systems, this paper enhances the understanding of their performance under varying conditions. These tools enable the selection of the most appropriate cooling solution based on specific thermal load requirements, available space, and climatic conditions. In the field of data center cooling, the future development of air-cooling and liquid-cooling systems will focus more on precise selection based on the specific heat load requirements, available space, and climatic conditions, while continuously improving the adaptability, cost-effectiveness, and environmental protection performance to meet growing performance demands.
- (2)
Although the evaluation indexes of data center air supply effectiveness have gradually been quantified, the evaluation indexes adopted by different research institutions are not the same, and there is a lack of a recognized evaluation index system to enable the comparison of air supply effectiveness across data centers in different regions and scales. Installing baffles in the static pressure chamber is a design optimization measure that can effectively improve the uniformity of airflow organization in the underfloor air supply system. By changing the shape and angle of the baffles, the air supply performance can be significantly enhanced. Overhead air supply shows a good performance in terms of cooling effect and robustness, and when exploring its optimized design, it is necessary to consider not only the channel sealing strategy but also the optimized design of the geometric parameters of the air supply components, such as the grille diameter and deflection angle.
- (3)
The PID control strategy is often used for temperature regulation. As data centers grow in scale and complexity, model predictive control has gradually become the mainstream approach. With the development of artificial intelligence, reinforcement learning methods are used to optimize the control strategy of the cooling system to cope with complex environments. Data centers can choose a single control method or a combination of multiple methods to adapt to complex and dynamic system environments. Diversified strategies provide more comprehensive and flexible solutions. Single control strategies are suitable for standard system optimization tasks, while composite control strategies can more effectively handle complex multi-objective optimization problems, thereby significantly improving overall optimization results.
- (4)
Future research into PUE is likely to see, on the one hand, a wider application of advanced AI and machine-learning techniques to automate the collection of data, analysis of energy efficiency, and execution of optimization strategies, thereby reducing human intervention and improving operational efficiency. On the other hand, more research is expected to focus on driving the uniform use of energy efficiency metrics to ensure consistency and standardization globally. Such efforts will help to improve the global synergy and effectiveness of data center energy efficiency management.
By systematically studying and optimizing the design and control strategies of data center cooling systems, improving the energy efficiency of cooling systems can not only significantly reduce the overall energy consumption and operating costs of data centers but also effectively extend the life of equipment and reduce carbon emissions, thereby making a positive contribution to addressing the global energy crisis and environmental issues.