Urban Building Energy Performance Prediction and Retrofit Analysis Using Data-Driven Machine Learning Approach
Urban Building Energy Performance Prediction and Retrofit Analysis Using Data-Driven Machine Learning Approach
A R T I C L E I N F O A B S T R A C T
Keywords: Stakeholders such as urban planners and energy policymakers use building energy performance modeling and
Building energy performance analysis to develop strategic sustainable energy plans with the aim of reducing energy consumption and emissions
Data-driven approaches from the built environment. However, inconsistent energy data and the lack of scalable building models create a
Urban building energy modeling
gap between building energy modeling and traditional planning practices. An alternative approach is to conduct a
Machine learning
large-scale energy usage survey, which is time-consuming. Similarly, existing studies rely on traditional machine
Building retrofit
learning or statistical approaches for calculating large-scale energy performance. This paper proposes a solution
that employs a data-driven machine learning approach to predict the energy performance of urban residential
buildings, using both ensemble-based machine learning and end-use demand segregation methods. The proposed
methodology consists of five steps: data collection, archetype development, physics-based parametric modeling,
machine learning modeling, and urban building energy performance analysis. The devised methodology is tested
on the Irish residential building stock and generates a synthetic building dataset of one million buildings through
the parametric modeling of 19 identified vital variables for four residential building archetypes. As a part of the
machine learning modeling process, the study implemented an end-use demand segregation method, including
heating, lighting, equipment, photovoltaic, and hot water, to predict the energy performance of buildings at
an urban scale. Furthermore, the model’s performance is enhanced by employing an ensemble-based machine
learning approach, achieving 91% accuracy compared to the traditional approach’s 76%. Accurate prediction of
building energy performance enables stakeholders, including energy policymakers and urban planners, to make
informed decisions when planning large-scale retrofit measures.
1. Introduction ergy efficiency within the building sector using the Energy Performance
of Buildings Directive (EPBD). The primary objective of this directive
The operation of buildings accounted for 30% of global energy con- is to facilitate the adoption of policies and measures that will enable
sumption and 27% of total energy sector greenhouse gas emissions the achievement of a highly energy-efficient and decarbonized building
(GHG) in 2021 [1]. Within this context, 8% comprised direct emissions stock by the years 2030 and 2050, respectively [2].
occurring within buildings, while 19% represented indirect emissions The rise in annual energy consumption, especially in urban areas,
resulting from the production of electricity and heat used in buildings. is expected to increase carbon emissions significantly [1]. As a result,
To address these environmental concerns, the member nations of the there is a growing focus on reducing energy use and emissions from
European Union (EU) have established a legislative infrastructure to the building sector. Urban planners and policymakers are exploring
advance sustainable strategic planning initiatives and strengthen en- innovative strategies to make existing buildings more sustainable, in-
* Corresponding author.
E-mail addresses: [email protected] (U. Ali), [email protected] (S. Bano), [email protected] (M.H. Shamsi),
[email protected] (D. Sood), [email protected] (C. Hoare), [email protected] (W. Zuo), [email protected] (N. Hewitt),
[email protected] (J. O’Donnell).
https://doi.org/10.1016/j.enbuild.2023.113768
Received 19 September 2023; Received in revised form 1 November 2023; Accepted 17 November 2023
Available online 22 November 2023
0378-7788/© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Nomenclature
cluding creating comprehensive sustainable energy plans. Furthermore, rithms commonly used in building energy demand prediction include
long-term renovation strategies are necessary to achieve a higher level a nearest neighbor, naive Bayes, rule induction, deep learning, Sup-
of sustainability and reduce carbon emissions from buildings. These port Vector Machines (SVM), and neural networks [14,15,13]. On the
plans aim to minimize overall energy consumption and CO2 emissions other hand, unsupervised learning techniques are applied without any
by analyzing data on the energy performance of buildings on a large corresponding output variable for inputs [14]. Unsupervised learning
scale. As a result, the EU has implemented the aforementioned EPBD to algorithms commonly implemented in this domain include clustering
ensure that member states develop the buildings database comprising and association rules of k means [16,11]. However, previous stud-
Energy Performance Certificates (EPCs). However, even with this man- ies employing the data-driven methodology primarily concentrated on
date, building stock databases typically cover only 30-50% of the total forecasting the energy consumption of individual buildings [17]. This
building stock [3]. limited focus is mainly due to the need for more high-quality and reli-
Moreover, available data are often inadequate for stakeholders such able data on a large scale. In addition, these studies have relied on only
as urban planners, energy policymakers, utility planners, and manufac- a few parameters to forecast the potential energy consumption of the
turers to create effective and sustainable energy conservation measures. building [18].
Gathering accurate and comprehensive data for urban modeling poses The novelty of this research lies in the integration of parametric
a significant challenge [4]. The limited availability and accessibility of simulations, ensemble-based machine learning approaches, and segre-
data at the urban scale make it difficult to understand the urban con- gation methods to predict building energy performance at an urban
text thoroughly. This poses a hurdle for researchers and practitioners scale using limited resources. Parametric simulation techniques can cre-
who aim to develop accurate and reliable models that capture the com- ate synthetic data encompassing a wide range of relevant scenarios for
plexities of urban systems. Overcoming this issue requires innovative stakeholders. This study implements ensemble-based machine learning
approaches and collaborations to improve data collection and sharing algorithms to predict building energy performance on an urban scale by
mechanisms, ensuring a more comprehensive and representative urban segregating end-use demands such as electricity, hot water, and heating.
modeling and analysis. Similarly, estimating the energy performance of Furthermore, this research identifies the key building characteristics for
the entire building stock is challenging due to numerous factors that each end-use demand prediction. The research additionally analyses the
impact energy usage, including the building envelope, the geometry of impact of retrofit measures and future stakeholder policies using histor-
buildings, the behavior of occupants, heating and cooling systems, and ical and future weather data.
the weather conditions [5,6]. This paper is structured as follows. Section 2 describes an overview
Generally, there are two main approaches to estimating building of the existing work done on the prediction of the energy performance
energy performance: physical and data-driven models [7]. Physical of urban buildings. Section 3 outlines the methodology devised, includ-
models are based on detailed building physics and are analyzed us- ing an explanation of the steps followed in the development of the
ing simulation tools such as EnergyPlus, ESP-r, and TRNSYS [5]. The machine learning model. The results of the Irish case study are pre-
simulation of these tools requires extensive building characteristics, in- sented in Section 4, followed by discussions of possible implications
cluding geometric and non-geometric information [6]. On the other and improvements in the case study in Section 5. Section 6 includes
hand, the data-driven approach predicts energy usage based on his- conclusions and potential challenges, and future work.
torical data, employing statistical or machine learning algorithms [8].
Unlike the physical modeling approach, this method does not require 2. Literature review
a deep understanding of the building. This approach has gained signif-
icant popularity in the building energy sector because it allows pre- Urban building energy modeling can effectively analyze building en-
diction and estimation of energy consumption with limited building ergy performance and facilitate sustainable energy planning. The most
information [6]. Similarly, data-driven models can uncover complex common modeling approaches, such as physics-based or data-driven ap-
relationships between various characteristics of buildings and energy proaches, differ based on implementation and data requirements, as
consumption, which can be challenging to identify using traditional described in the following sections.
methods.
In recent years, researchers implemented various data-driven ap- 2.1. Physics-based urban building energy modeling
proaches in building energy demand prediction. These approaches use
historical data and employ statistical and machine learning (ML) al- The physics-based urban building energy modeling approach also
gorithms to develop data-driven models [6,9–12]. Machine learning referred to as the engineering or simulation approach, uses simulation
algorithms can be broadly classified into supervised and unsupervised techniques along with data related to building characteristics, construc-
learning techniques, with supervised learning further divided into re- tion, weather conditions, and data from heating-cooling systems to
gression and classification algorithms [13]. Supervised learning algo- compute the consumption of end-use energy [19,20]. The physics-based
2
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
approach can simulate and estimate building energy usage or produc- 2.2. Data-driven urban building energy modeling
tion on site, incorporating renewable energy technologies [13]. These
models determine the end-use energy consumption of each building by In urban energy modeling, a data-driven approach can predict and
type and rating using measurable data [7]. assess buildings’ energy usage by considering various factors related to
In the context of cities, the bottom-up archetype method has been the characteristics of the buildings [7,19]. This approach is based on the
widely used to analyze the overall impact of energy efficiency strate- analysis of existing data sources that include building stock datasets,
gies and new technologies at a regional or national scale [5,21]. Each billing data (such as electricity and gas consumption), survey data,
building archetype is modeled in the simulation engine to estimate en- and socioeconomic variables [7]. Data-driven urban energy modeling
ergy consumption, with these estimates then scaled up to represent the is conducted mainly using machine learning and statistical approaches.
regional or national building stock [22]. These approaches heavily rely Recent studies on urban energy have increasingly focused on using ma-
on quantitative data obtained from building physics. These methods re- chine learning algorithms over traditional statistical techniques [7].
quire various inputs, such as the thermal properties (U values) of the Rahman et al. used deep recurrent neural networks to predict
building components (walls, windows, roof, floor, doors), internal and medium- to long-term electricity use in commercial and residential
external temperatures, heating system patterns, ventilation rates, ap- buildings [34]. Meanwhile, Kontokosta and Tull devised statistical mod-
pliance quantities, occupancy, schedules, and internal loads [7,6]. In els to determine the energy consumption of electricity and natural gas
addition, these models require numerous assumptions to establish the in more than a million buildings in New York City [35]. Feifeng et
behavior of the occupants and a substantial amount of technical data to al. proposed a semi-supervised learning method for predicting energy
estimate energy consumption. use intensity (EUI) using 34,456 unlabeled samples [36]. Zhang et al.
One of the most prominent projects, the City Building Energy Saver proposed a data-driven framework for the prediction of energy usage
(CityBES), offers a platform for modeling and analyzing the thermal and greenhouse gas emissions, which considered various factors such
performance of different retrofit scenarios [23]. CityBES uses the En- as building characteristics, geometry and urban morphology [37]. Sim-
ergyPlus simulation engine to model buildings and analyze retrofit at ilarly, Seo et al. developed a data-driven model to predict the energy
the district or city scale [24]. Another project, The CitySim project, demand for heating of 10,000 low-income households in South Korea
involves a decision support tool that assists energy planners and stake- [38]. Razak et al. developed a machine learning model that forecasts an-
holders in minimizing energy usage and emissions while incorporating nual average energy use based on building design features in the initial
development stages [18]. Ngo et al. used ensemble machine learning
various optimization and retrofit analyses [25]. Urban Modeling Inter-
models to forecast building energy consumption over 24 hours [39].
face (UMI) integrates the EnergyPlus simulation engines, Daysim, and
Lastly, Wurm et al. developed a workflow for modeling the heat demand
a Python module for the operational energy, daylighting, and walk-
of building stock on an urban scale, using deep learning algorithms
ability of urban buildings [26]. MIT’s UBEM (Urban Building Energy
[40].
Model) platform uses the EnergyPlus simulation engine to model ap-
Although a significant amount of research has been conducted on
proximately 83,541 buildings by integrating official GIS datasets and a
predicting energy consumption in individual buildings using their spe-
custom building archetype library [27]. URBANopt (Urban Renewable
cific characteristics, more studies have yet to explore using data-driven
Building And Neighborhood Optimization) provides an EnergyPlus and
models for predicting energy consumption on a larger scale. The main
OpenStudio-based simulation software development kit (SDK) to simu-
challenge lies in the lack of high-quality data in sufficient quantities
late the energy performance of low-energy districts and campus-scale
to train prediction models effectively. This underscores the need for a
thermal and electrical analyses [28].
robust building energy modeling approach capable of accurately pre-
One of the significant challenges in modeling at an urban scale is the
dicting the energy performance of entire building stocks, even when
availability of both building geometric and non-geometric data. Few re-
faced with limited resources for complex decision-making analysis. Fur-
cent studies have focused on the generation of new building geometric
thermore, previous research on predicting building energy consumption
data. UBEM.io, a novel web-based framework, automates the genera-
has been limited by considering only a small set of parameters ([18]).
tion of urban-scale building geometries based on widely available inputs Fewer recent studies have started incorporating crucial factors such
such as shapefiles, LiDAR, and tax assessor data [29]. Soroush et al. as U-values, HVAC systems, and renewable energy systems into their
developed a detailed urban building energy model using the CityGML machine-learning algorithms to estimate better energy performance in
format for 3D urban geometry and employed spatial joining to incor- buildings ([37]). However, only a few studies have specifically investi-
porate the features required for archetype selection [30]. Ali et al. gated the impact of parameters such as U values, HVAC system types,
proposed urban building energy and microclimate modeling by gen- and the presence of renewable energy systems on the estimation of
erating 3D city models from sources such as Google Earth, Microsoft the energy performance of buildings using machine learning algorithms
Footprints, and OpenStreetMap [31]. Irene et al. developed a model- ([18,39–41]).
ing framework to assess the potential of creating energy communities Predicting the energy performance of buildings at an urban scale
by combining UBEM capabilities with the rooftops’ potential for solar poses a significant challenge for urban planners and policymakers. The
generation [32]. accurate prediction of energy consumption and the identification of
With increased data availability and more sophisticated modeling opportunities for enhancing energy efficiency are crucial for fostering
techniques, it has become crucial to devise a generalized UBEM frame- sustainable development in cities. There is significant potential to ex-
work and improve the existing work to facilitate the modeling and anal- pand current research and establish a comprehensive methodology for
ysis of different use cases. Previous studies provide a limited view of the data-driven building energy modeling on an urban level.
different building energy aspects in an urban setting. This stems mainly However, one major issue that arises in an urban context is the avail-
from the fact that simulating each building individually, along with ability of data. Obtaining comprehensive and reliable data at an urban
their interdependencies, requires significant time and resources [33]. scale can be challenging, as it requires collecting and integrating infor-
Furthermore, these methods usually deploy a physics-based simulation mation from multiple sources [4]. Addressing this issue is essential to
engine, which can be computationally demanding and time-consuming enable effective energy planning and modeling techniques, empower-
due to the intricate nature of urban systems”. ing stakeholders to make informed decisions and drive positive change
Data-driven urban building energy modeling can address the afore- in urban energy management.
mentioned challenges by estimating building energy consumption using These findings highlight the importance of adopting a holistic ap-
basic knowledge of the buildings’ features. However, this approach still proach to building energy modeling, considering all relevant factors,
has research gaps, as discussed in the next section. to accurately predict building energy performance and align with the
3
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Fig. 1. Overarching methodology for urban building energy performance prediction using machine learning.
objectives of various stakeholders. Therefore, this research proposes required for building energy modeling is gathered from building stock
a methodology that combines and harnesses the strengths of physics- and energy performance certificate databases and existing construction
based and data-driven approaches to accurately predict the energy per- databases such as TABULA, EPISCOPE, and building typology databases
formance of buildings on an urban scale. In the physics-based approach, ([43]).
parametric simulation methods are employed to generate synthetic data Along with geometric data, non-geometric data are also required for
that encompass all possible scenarios relevant to stakeholders. Sim- modelings, such as user occupancy patterns, equipment loads, HVAC
ilarly, ensemble machine learning and end-use demand segregation systems, and usage patterns also need to be modeled. One of the sig-
methods are used in the data-driven approach instead of relying on a nificant challenges in this regard is the availability of non-geometric
single model to achieve accurate predictions of building energy perfor- building information on a large scale. Non-geometric building data can
mance on an urban scale. be obtained through the building archetypes approach, using available
national census databases, statistical surveys, and energy performance
3. Methodology certificate data.
Weather data sets are essential to accurately model energy use in
This study proposes a novel methodology that uses supervised ma- building thermal simulations ([44]). The most commonly used climate
chine learning algorithms to predict building energy performance on data sets, such as the typical meteorological year data (TMY), have
a large scale. This research aims to identify the most effective model been available for a long time and describe the local climate ([45]).
using physics and data-driven approaches. The prediction methodol- Another helpful resource are EnergyPlus Weather format (EPW) files,
ogy for the energy performance of urban buildings involves five steps which can be accessed online for more than 3,034 locations. These files
(Fig. 1). are arranged by region and country of the World Meteorological Or-
ganization. Furthermore, this study incorporates future weather files
1. The initial step involves collecting data from various sources such to assess the impact of weather conditions on retrofit measures under
as building stock, census, weather, and geographical data. various climate scenarios, aiming to achieve the energy policy targets
2. The next step involves developing building archetypes using exist- set by policymakers, such as those for 2030 or 2050. The sources of
ing building stock data to identify representative baseline models. these future weather files can vary, including resources like Meteonorm,
3. The subsequent step focuses on parametric simulation to develop WeatherShift, and CCWorldWeatherGen [46].
appropriate synthetic data. Similarly, the modeling process relies on additional sources such as
4. The step of developing machine learning models predicts building census data, reports on energy policies, and construction data. These
energy performance on a large scale using an ensemble or segrega- sources offer valuable insights into demographic patterns, energy con-
tion method. sumption trends, and infrastructure development, facilitating a more
5. Finally, the urban building energy performance analysis step an- comprehensive analysis and meeting the requirements of urban systems.
alyzes the modeling process results for planning and decision-
making purposes. 3.2. Building archetypes development
4
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Fig. 2. Process of machine learning modeling to predict Energy Use Intensity (EUI) using machine learning models.
sis, and empirical studies of buildings within the target building stock. These methods allow for generating representative synthetic datasets
Moreover, simulating any building archetype requires geometric and encompassing a range of parameter combinations, facilitating a more
non-geometric data for each baseline model. These building archetypes comprehensive analysis of design alternatives and optimizing energy
are the starting point for parametric modeling of different buildings to modeling outcomes.
develop a synthetic stock.
3.4. Machine learning modeling
3.3. Parametric simulation
This process involves formulating machine learning models to es-
Parametric simulation provides an optimal solution, mainly when timate the building energy performance (Fig. 2). Synthetic building
only sparse data sets are available for energy modeling. To execute stock data, generated from the parametric simulation step, is intended
complex parametric simulations involving multiple parameters, a para- to serve as input for the development of machine learning models.
metric tool is used to perform numerous simulations using a Building
Energy Performance Simulator (BEPS) model ([47]). This study uses jE- 3.4.1. Data preprocessing
Plus as a parametric tool for energy simulations. Furthermore, jEPlus The process begins with data preprocessing, during which inconsis-
uses EnergyPlus for simulation and incorporates DesignBuilder con- tencies within the dataset are identified and eliminated before the data
struction templates to integrate diverse parameter values. Parametric are used for further analysis and model development.
simulation using EnergyPlus presents a robust approach to assess the
energy performance of buildings and investigate various design alterna- 3.4.2. Data splitting
tives. In the parametric simulation, EnergyPlus facilitates a systematic The pre-processed data is divided into two subsets to ensure optimal
exploration of the design parameters, providing insights into their im- training of the model: a training dataset used for training the model
pact on energy consumption, comfort, and other performance metrics. and a test dataset for evaluating the performance of the trained model.
The selection of parametric features plays a crucial role in devel- Two standard techniques for data splitting are random data splitting
oping parametric simulation-based models and generating synthetic and cross-validation.
datasets. The accuracy of the building energy model is highly depen- Random data splitting is a straightforward method in which data
dent on the careful selection of each parameter in this process. These is randomly divided into training and testing datasets, typically in an
parameter values, which encompass the necessary variations for syn- 80-20% split ratio. However, this method may cause problems with
thetic data generation, can be obtained from literature surveys that are uneven data distribution, and an incorrect selection of training and
specific to the relevant climate environments ([48,3]). testing datasets can also adversely affect the machine learning model’s
In the parametric simulation process, various essential parameters performance [51]. On the other hand, cross-validation is a more sophis-
are commonly used that include construction characteristics such as ticated method that is often used to strike a balance between minimal
walls, windows, floors, roofs, internal gains, occupancy density, and bias and variance in the trained model. This study adopts the k-fold
heating or cooling systems. They all contribute to the overall energy cross-validation algorithm for data splitting to prevent overfitting or
performance assessment and are integral to the parametric simulation. underfitting the model.
By considering these parameters and their variations, parametric simu-
lation enables the exploration of different design alternatives and their 3.4.3. Non-segregation models development
impact on energy consumption, comfort levels, and other performance This paper implements and compares three different machine learn-
metrics. It allows for a comprehensive evaluation of the energy effi- ing model approaches to predict building energy performance, namely:
ciency of the building and helps to make decisions about design op- the single model approach, end-use demand segregation method, and
timizations. Therefore, selecting the appropriate parameters and their ensemble-based segregation method. In the single model approach, also
values, based on literature surveys and specific climate environments, referred to as the “non-segregation” method, this study conducts a com-
is crucial to create accurate and representative synthetic datasets and parative analysis of various machine learning algorithms, assessing their
ensuring the reliability of parametric simulation-based models. predictive accuracy, efficiency, and suitability for building energy per-
However, dealing with the complexity of many parameters makes it formance modeling. Over recent years, machine learning models have
nearly impossible to generate simulated data for all possible combina- garnered considerable attention in data-driven modeling. Among the
tions. Sampling methods such as Simple Random Sampling (SRS) and most frequently used models are Linear Regression (LR), Neural Net-
Latin Hypercube Sampling (LHS) are used to generate synthetic data work (NN), Decision Tree (DT), Random Forest (RF), K-Nearest Neigh-
to address this challenge ([49,50]). Simple Random Sampling (SRS) is bor (KNN), Gradient Boosting (GB) and Support Vector Regression
a straightforward method in which each sample is randomly and in- (SVR) [7]. Some of the popular implementations of gradient boosting in-
dependently selected from the population. On the other hand, Latin clude XGBoost (Extreme Gradient Boosting), Histogram-Based Gradient
Hypercube Sampling (LHS) is a more advanced sampling method that Boosting (HGB), and LGBM (Light Gradient Boosted Machine). These
aims to achieve a more uniform distribution of samples across the entire algorithms have demonstrated exceptional performance in energy fore-
range of the data. LHS ensures that each parameter value combination is casting and prediction, particularly in the context of energy modeling,
balanced, allowing for a more comprehensive design space exploration. due to their extensive use and success in previous studies ([17,11]). By
5
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Fig. 3. Methodology for end-use demand segregation modeling to predict Energy Use Intensity (EUI) using machine learning.
assessing the effectiveness of these models, this study aims to discern improves accuracy compared to the conventional approach of using a
the most efficient approach to predict building energy performance us- single model. There are two main ensemble learning techniques that
ing machine learning techniques. differ mainly by kind of model, data sampling, and decision function.
Therefore, ensemble learning techniques can be classified as stacking
3.4.4. End-use demand segregation models development and voting techniques.
End-use demand segregation methods use different machine learn- The stacking method, also known as stacking generalization, was in-
ing models to predict each end-use demand. This strategy diverges troduced by Wolpert [52]. The goal is to reduce the generalization error
from the traditional approach of employing a single machine-learning of different machine learning models. The final Meta-Model comprises
model. This modification aims to achieve superior predictive perfor- the predictions of an “n” number of machine learning-based models
mance (Fig. 3). The workflow includes developing distinct regression through the k-fold cross-validation technique. On the other hand, the
machine learning models for each end-use demand, such as heating, voting ensemble method is one of the most intuitive and easy to under-
cooling, lighting, and hot water. The predictions of these end-use de- stand. The voting ensemble method comprises a number “n” of machine
mands are aggregated to calculate the final energy performance of the learning models, and the final prediction is the one with “the most
building, measured in terms of Energy Use Intensity (EUI). The predic- votes” or the highest weighted and averaged probability. Generally, en-
tion for each end-use demand is multiplied by its corresponding primary semble learning techniques use multiple best-prediction performance
energy factor. The resulting values for heating, cooling, equipment, machine learning models. The study implements a stacking-based en-
lighting, and hot water are then aggregated and photovoltaic energy semble method to predict each end-use demand, enhancing model ac-
generation is deducted from them to calculate the total energy con- curacy and predicting building energy performance. This method com-
sumption of the building. This cumulative total is then divided by the bines predictions from multiple models by training another model to
building area to calculate the Energy Use Intensity (EUI), a measure of consolidate its output, often resulting in more accurate and robust pre-
the energy performance of the building as defined in Equation (1). Fi- dictions compared to the voting ensemble method (Fig. 4).
nally, the EUI is classified into an Energy Performance Certificate (EPC)
label or rating, 3.4.6. Models performance
To evaluate the effectiveness of machine learning models, commonly
(𝐸heating × 𝑃 𝐸𝐹heating ) + (𝐸cooling × 𝑃 𝐸𝐹cooling ) used performance indices such as R-Squared (𝑅2 ), Mean Absolute Error
𝐸𝑈 𝐼 =
𝐴total (MAE), and Root Mean Squared Error (RMSE) are employed ([7,11]). A
(𝐸lighting × 𝑃 𝐸𝐹lighting ) + (𝐸equipment × 𝑃 𝐸𝐹equipment ) model with the lowest RMSE and MAE values and a 𝑅2 value near-
+ est to 1 is deemed superior among all models. Finally, in order to
𝐴total
assess the model’s accuracy, the predicted value of EUI (expressed in
(𝐸hotwater × 𝑃 𝐸𝐹hotwater ) − (𝐸PV × 𝑃 𝐸𝐹PV )
+ (1) kW h/(m2 *year)) is transformed into an Energy Performance Certifi-
𝐴total cate (EPC) label or rating. Furthermore, precision and recall are crucial
where 𝐸ℎ𝑒𝑎𝑡𝑖𝑛𝑔 , 𝐸𝑐𝑜𝑜𝑙𝑖𝑛𝑔 , 𝐸𝑙𝑖𝑔ℎ𝑡𝑖𝑛𝑔 , 𝐸𝑒𝑞𝑢𝑖𝑝𝑚𝑒𝑛𝑡 , 𝐸ℎ𝑜𝑡𝑤𝑎𝑡𝑒𝑟 , and 𝐸𝑃 𝑉 rep- metrics used for a detailed analysis of each class. Precision assesses
resent the energy consumption (or generation for 𝐸𝑃 𝑉 ) for each re- the accuracy of positive predictions made by the model, whereas recall
spective category in kilowatt hours per year (kW h/year).𝑃 𝐸𝐹ℎ𝑒𝑎𝑡𝑖𝑛𝑔 , quantifies the model’s capability to detect all positive instances within
𝑃 𝐸𝐹𝑐𝑜𝑜𝑙𝑖𝑛𝑔 , 𝑃 𝐸𝐹𝑙𝑖𝑔ℎ𝑡𝑖𝑛𝑔 , 𝑃 𝐸𝐹𝑒𝑞𝑢𝑖𝑝𝑚𝑒𝑛𝑡 , 𝑃 𝐸𝐹ℎ𝑜𝑡𝑤𝑎𝑡𝑒𝑟 , and 𝑃 𝐸𝐹𝑃 𝑉 are the dataset [3].
the primary energy factors (PEFs) for each respective category. 𝐴𝑡𝑜𝑡𝑎𝑙
represents the total floor area of the building in square meters (m2 ). 3.4.7. End-use features extraction
The final step of this process is to find the importance of features
3.4.5. Ensemble and segregation models development for each end-use demand using the developed machine learning model.
The workflow further implements ensemble machine learning meth- Feature importance refers to the determination of the relevance or con-
ods to test multiple learning algorithms and obtain better predictive per- tribution of individual features in a machine learning model to make
formance. Ensemble techniques are commonly used in machine learning accurate predictions. It helps in understanding which features have the
to enhance model accuracy by mitigating overfitting and increasing most significant impact on the model’s predictions.
generalizability. By leveraging the complementary strengths of mul- One popular method for calculating feature importance is SHAP
tiple models, ensemble learning provides more stable predictions and (SHapley Additive exPlanations). SHAP values provide a unified mea-
6
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Fig. 4. Methodology for ensemble machine learning modeling approach for enhanced predictive performance in machine learning models.
sure of feature importance by considering the contribution of each energy performance of buildings on an urban scale. This case study fol-
feature value to the prediction for a specific instance while also ac- lows the same structure as the proposed methodology discussed in the
counting for interactions between features. By using SHAP values, we previous section, with subsequent subsections following the same order.
can gain insight into which features impact the model’s predictions the
most. This information can be valuable for understanding the underly- 4.1. Data collection
ing relationships in the data and identifying the key drivers or factors
that influence the target variable. Collecting urban-scale building stock data is challenging as indi-
vidual building information is often unavailable [4]. The data collec-
3.5. Urban building energy performance analysis tion process involves acquiring raw building data from various sources
to implement the proposed methodology, including building stock
In the final phase of the methodology, the developed machine learn- datasets, building census datasets, weather data, and data from energy
ing model predicts the energy performance of the entire building stock. policymakers’ reports. See Table 1.
The availability of comprehensive building stock data can help stake- In Ireland, building stock data are available as Energy Performance
holders analyze the building stock at an urban scale and successfully im- Certificates (EPCs) maintained by the Sustainable Energy Authority of
plement sustainable energy policies. Furthermore, the developed model Ireland (SEAI). The EPC (also called the Building Energy Rating (BER)
can be applied to practical application scenarios, such as implementing certificate) dataset of the Irish residential stock represents the measured
and evaluating proposed retrofit measures as part of national-level pol- building stock and comprises more than 200 building characteristics.
icy decisions. These measures, often proposed at the national level, aim These features include building fabric, heating systems, estimated end-
to improve the energy performance of existing buildings through mod- use, CO2 emissions, and estimated delivered and primary energy con-
ifications and improvements. For example, this could include installing sumption. Each entry in the Irish EPC dataset contains an energy rating
heat pumps or integrating renewable energy systems like solar panels. for the respective building, ranked its energy performance on a graded
The proposed models can evaluate their impact before implementa- scale from A1 to G based on the estimated energy consumption per
tion and identify potential energy savings. This predictive capability square meter per year [53]. In 2023, the Irish EPC dataset contained
reduces the risk of implementing ineffective or inefficient measures, en- approximately 1,126,817 residential buildings, with a significant pro-
suring that resources are used optimally. It also helps fine-tune such portion of building ratings within the range of C1 to D2 (Fig. 5). The
measures to fit better the specific needs and constraints of the building dataset’s most common types of buildings are semi-detached and de-
stock. tached houses.
In general, the developed model offers a holistic approach to urban- The Irish census, conducted every four years by the Central Statis-
scale energy management and policy implementation, creating a more tics Office (CSO), collects various data points on the building where
sustainable built environment. Using modeling outcomes, stakehold- the respondent resides. Therefore, the census provides the number of
ers can navigate the complexities of urban building stock analysis and buildings in each geographic area [56]. According to the CSO 2022
energy policy implementation, even without extensive knowledge of dataset, Ireland has approximately 1,841,152 residential buildings. Sim-
building dynamics. This empowers policymakers and stakeholders alike ilarly, the GeoDirectory database provides statistical and geographical
to make informed decisions when retrofitting existing building stock to information on Ireland’s entire building stock [54]. The Q4 2022 GeoDi-
improve energy efficiency and mitigate environmental impact. rectory report, published by An Post (Irish Postal Service) and Ordnance
Survey Ireland, comprises geocoded addresses of 2,100,905 residential
4. Case study buildings in Ireland. Detached dwellings remained the most prevalent
type of residence (30.7% of the national total), followed by terraced
The primary objective of this case study is to test the proposed dwellings (28.2%) and semi-detached dwellings (24.7%). This study
methodology by calculating the energy performance of Ireland’s resi- focuses on Dublin City in Ireland and the Dublin EPC dataset, which
dential building stock. This methodology seamlessly integrates a data- includes 339,494 of the 624,758 residential buildings, representing the
driven approach with parametric simulation modeling to predict the highest proportion of the entire Irish building stock. This suggests that
7
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Table 1
Building data requirements and associated data sources for Irish case study.
Fig. 5. Irish EPC building energy rating chart used to determine building energy performance, percentage of total EPC vs. Non-EPC residential buildings.
Fig. 6. 3D geometry of Irish residential building archetypes for energy parametric simulation [44,48].
EPC data are available for only approximately 54% of the residen- types are selected to represent the primary variations of building types
tial building stock of Dublin City ([53]). This study employs machine based on data from the CSO, Irish EPC, and GeoDirectory datasets.
learning algorithms to predict the energy rating of the remaining 46% These building archetypes serve as the starting point for the paramet-
stock using limited variables (Fig. 5). Furthermore, the weather data ric modeling of different buildings, helping to develop a synthetic stock
for Dublin are obtained from the default EnergyPlus dataset, which in- representation. These four different types of residential buildings also
cludes historical data and also incorporates future weather files for 2030 exist in the GeoDirectory database, namely terraced houses, detached
by Meteonorm. This allows us to assess the impact of weather conditions houses, semi-detached houses, and bungalows (Fig. 6).
on retrofit measures in various climate scenarios. Building archetypes require both geometric and non-geometric data
Similarly, energy policy reports are necessary to explore future sce- to model each baseline model. The initial step involves identifying the
narios. Irish national reports, such as the Climate Action Plan 2023, are non-geometric and geometric parameters associated with the existing
used to test scenarios in this case study. This provides valuable insight building stock of Dublin. This information is essential for performing
into future plans and strategies for Irish residential buildings. These a parametric simulation using the archetypes. Geometric information
reports outline the goals, roadmaps, and goals set by policymakers to collected from various types of Irish buildings is based on existing stud-
address climate change, reduce greenhouse gas emissions, and improve ies and Irish building regulations guidelines. However, non-geometric
energy efficiency in the residential sector [57]. parameters are determined using current building energy performance
databases and literature surveys. For example, the Irish EPC provides
4.2. Building archetypes development values for essential building physics parameters, such as U-values for
walls, roofs, floors, and windows, along with their respective ranges.
The parametric simulation framework uses each building archetype Other relevant non-geometric parameters that impact the energy perfor-
as a baseline model. In this case study, four building types are consid- mance of the Irish building stock have been identified based on previous
ered as archetypes of the Irish residential building stock [44]. These research [44,48]. The geometric and non-geometric parameters of base-
8
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Table 2
Geometric and non-geometric parameters of baseline archetypes used in the Irish case study.
Table 3
Parameters needed for parametric simulation of archetypes.
line archetypes with default values used for the Irish case study are struction templates and reducing the number of dependent features. For
shown in Table 2 [44,48,62,63]. instance, building elements require material features such as thickness,
conductivity, density, and specific heat. In this study, existing templates
4.3. Parametric simulation were used, and U-values were used to represent these features. This ap-
proach ultimately results in a reduction of the required parameters as
The selection of parametric features is pivotal in developing physics- inputs to the UBEM and further reduces the model computing time by
based models based on parametric simulation and generating synthetic
eliminating dependent parameters.
datasets after the archetype development process. The accuracy of the
One of the primary output parameters in this study is the Energy
building energy model relies on the careful selection of each input
Use Intensity (EUI), also referred to as the final primary energy use
and output parameter in this process. These parameter values embody
per building’s total floor area per year, measured in kW h/(m2 *year).
the necessary variations for synthetic data generation. In this study,
19 input parameters are used to simulate Irish residential building Irish EPC data provide information on building energy performance or
archetypes. The selection of these parameters is based on existing stud- certificate ratings in terms of EUI (kW h/(m2 *year)), which is further
ies on residential buildings [48,3]. However, these previous studies do interpreted on an A1 to G rating scale. An A1-rated building demon-
not include certain advanced features. Therefore, several additional strates the highest level of energy efficiency, typically associated with
parameters, including HVAC systems, are incorporated to conduct a the lowest energy consumption and CO2 emissions. On the other hand,
complete analysis of HVAC systems, primary heating factors, and re- a building with a G rating represents the least energy-efficient rating
newable parameters (Table 3). Furthermore, this study employed a (Fig. 5). Furthermore, this study focuses on the end-use demand seg-
building feature reduction approach by integrating Design-Builder con- regation method to calculate the Energy Use Intensity. Therefore, each
9
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Fig. 7. Distribution of 1 million residential buildings synthetic data in terms of the Irish building energy rating labels.
Table 4
Comparative analysis of machine learning models to predict end-use demand in kW h/yr using
RMSE metrics.
Models Heating Interior Lighting Interior Equipment Photovoltaic Power Water Systems
end-use demand, including heating, lighting, equipment, photovoltaic, XGBoost (XGB), LightGBM (LGBM), Gradient Boosting (GB), Histogram-
and hot water, is considered an output parameter in the parameter sim- based Gradient Boosting (HGB), Random Forest (RF), Neural Network
ulation process. (NN), Decision Tree (DT), Linear Regression (LR), K-Nearest Neighbors
This study employs jEPlus as a parametric tool for physics-based (KNN) and Support Vector Machine (SVM). The performance of each de-
parametric simulation. A jEPlus uses the capabilities of EnergyPlus for veloped model is evaluated using metrics such as R-Squared (𝑅2 ), Mean
thermal simulation and integrates DesignBuilder construction templates Absolute Error (MAE), and Root Mean Squared Error (RMSE). A model
to incorporate diverse parameter values. A sample of 1 million buildings is considered superior if it achieves values closer to zero for RMSE and
is generated using the Latin hypercube sampling (LHS) method to con- MAE and values close to zero for 𝑅2 . The target feature is EUI, which is
struct a reliable machine learning model. This sampling process ensures used to predict building energy performance using regression models.
that the resulting distribution covers all energy rating data for Irish Furthermore, the final predicted EUI is also converted into an energy
buildings (Fig. 7). rating based on the Irish EPC rating (Fig. 5). Finally, the model’s per-
formance is further tested using an accuracy estimation of the energy
4.4. Machine learning modeling rating, with the model producing the highest accuracy being considered
the best learning model.
This process involves formulating an urban-scale building energy This study conducts a comparative analysis of three different ma-
performance machine learning model. The process begins with gener- chine learning models proposed in this research to evaluate which one
ated synthetic building stock data from the previous step, which are is best suited for predicting building energy performance. These ap-
preprocessed to remove outliers and improve the data set’s quality be- proaches include the single-model approach (non-segregation method),
fore implementing machine learning models. Subsequently, the data is the end-use demand segregation method, and the ensemble-based seg-
divided into two subsets to create training and testing datasets. This regation method. In the non-segregation method, EUI predicted using
study uses a 10-fold cross-validation method during data division to all ten machine learning models. Similarly, the workflow then develops
mitigate the risk of overfitting, rather than using a random data selec- learning models using the segregation method for each end-use demand,
tion for training and testing. such as heating, interior lighting, photovoltaic power and water systems
Ten different machine learning algorithms are analyzed to assess in the interior equipment. The process implemented and tested ten ma-
their abilities to predict EUI building energy performance based on chine learning models for each end-use demand (Table 4). The results
a given dataset. These regression algorithms have shown exceptional show that the XGB model showed the best performance in predicting
performance in energy forecasting and prediction, particularly within the demand for heating with an RMSE of 683.17. For interior lighting,
the context of energy modeling ([17,11,7]). The algorithms include interior equipment, photovoltaic power and water systems, the XGB,
10
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Table 5
List of important features with rank that affect end-use demand machine learning models using SHAP
method.
1 Air changes per hour Lighting density Equipment density Renewables Building type
2 Heating setpoint Building type Building type Orientation Domestic hot water
3 Wall U-value Weather
4 Building type
5 Occupancy
6 Window U-value
7 Equipment density
8 Weather
9 Roof U-value
10 Lighting density
11 Heating setback
12 Floor U-value
LGBM, RF and DT models reported an RMSE of 0, indicating excellent is essential to note that some models, such as NN, LR, KNN, and SVM,
performance. continue to demonstrate suboptimal performance even in the segrega-
In addition, models such as LR, KNN, and SVM exhibited relatively tion scenario. The Neural Network (NN) model shows relatively less
higher root mean square errors (RMSE) in all categories, indicating improvement compared to other models, which might suggest that it
less accurate predictions. The results demonstrate that the RMSE for does not benefit as much from segregation in this particular context.
most end-use demands is nearly 0. This can be attributed to the fact The poor performance of SVM persisted even with segregation, indicat-
that end-use demands calculated in EnergyPlus are derived using static ing that this model might not be suitable for this dataset irrespective of
calculations, meaning that values are determined based on fixed param- the data processing method.
eters and equations without accounting for variability or randomness. These results indicate that incorporating segregation in the analy-
Therefore, machine learning models can easily learn and map these sis improves the performance of most models, particularly XGB, LGBM,
fixed relationships between input features and end-use demands, re- and HGB. These findings highlight the importance of considering seg-
sulting in a near-perfect fit to the data. Furthermore, the SHAP method regation in the machine learning process to obtain more accurate pre-
is employed to gain further insight into the main features that affect the dictions for EUI values and emphasize the potential for future research
model output (Table 5). The findings reveal significant factors that af- to explore novel approaches to improve the performance of models that
fect energy consumption in buildings. The rate of air changes per hour are lagging.
emerged as the most influential feature, highlighting the importance of The modeling process is further improved using ensemble learning
ventilation in determining heating demand. The heating setpoint and techniques to combine the best-developed models (XGB, LGBM, and
wall U-value also ranked high, underscoring the importance of tem- HGB) based on performance. By comparing the interpretation of these
perature control and insulation in regulating energy usage. The type models, this study seeks to identify the most effective approach for
of building appeared consistently throughout the ranking, indicating its predicting building energy performance using machine learning tech-
substantial influence on overall energy demand and usage patterns. The niques.
relevance of orientation and weather in photovoltaic power generation These results highlight the importance of EUI segregation and the
emphasizes the need to consider building direction for optimal energy effectiveness of ensemble modeling in improving the accuracy of end-
production. These results provide valuable information for stakehold- use demand prediction (Table 6). In general, non-segregation method,
ers to understand these critical features and design effective strategies the XGB model achieved an RMSE of 13.89, with an accuracy of 76%.
aimed at reducing energy consumption, improving energy efficiency, On the contrary, the XGB model segregation method results in a sig-
and promoting sustainability in the built environment. nificantly lower RMSE of 7.69, indicating reduced prediction errors
Finally, the prediction of each end-use demand is multiplied by compared to the previous method. The accuracy improves to 89%, sug-
its respective Irish primary energy factor, and these values are then gesting more accurate predictions in most cases. Finally, the ensemble-
summed to determine the total energy consumption of the building. based segregation approach, combining the XGB, LGBM, and HGB mod-
This cumulative total is then divided by the area of the building to cal- els, achieves the lowest RMSE of 6.48, demonstrating a further reduc-
culate the EUI, a measure of the energy performance of the building.
tion in prediction errors compared to the previous methods. Accuracy
The results illustrate the significant improvement in the performance of
reaches 91%, indicating a higher level of correct predictions than the
various machine learning models in predicting EUI with and without ap-
other methods. The confusion matrix shows that the model performs
plying segregation methods (Fig. 8). Firstly, non-segregation scenario,
well with all energy ratings of the building (Fig. 9). The findings sug-
the XGB model demonstrates the best performance on all metrics, boast-
gest that the combination of models can enhance prediction capabilities
ing an RMSE of 13.89, MAE of 9.72, and an accuracy of 76% in terms
and provide more reliable estimates for decision-making processes.
of building rating. LGBM follows closely in performance. However, as
we move down the table, the performance degrades, with the SVM hav-
ing an RMSE of 71.96, MAE of 50.98, R-squared of 0.76 and accuracy 4.5. Urban building energy performance analysis
of 29%. This suggests that the Gradient Boosts models, such as XGB and
LGBM, are better suited for this problem of non-segregation. In the urban building energy performance analysis phase, the devel-
Secondly, when considering the EUI Segregation scenario, there is oped model is applied to practical application scenarios, implementing
a notable enhancement in the performance of several models. Specifi- retrofit measures outlined in Ireland’s National Climate Action Plan
cally, the XGB and LGBM models excel with good R-squared values and 2023. The objective is to retrofit existing residential buildings with
substantially lower RMSE and MAE values compared to those without below B2 ratings and install heat pumps. Two different scenarios are
the segregation method. These models achieve substantially higher ac- developed, improving the U values of windows, walls and roofs as rec-
curacy, with XGB reaching 89% and LGBM reaching 87%. This signifies ommended by Part L of the Irish Building Regulations and upgrading
that segregation could efficiently capture the underlying data patterns, the HVAC system from a boiler to a heat pump. Additionally, the sce-
aiding these models in making more precise predictions. However, it narios include options with and without renewables (Table 7).
11
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Fig. 8. Comparative analysis RMSE and accuracy of machine learning models using with and without end-use demand segregation method to predict EUI.
Table 6
Comparative analysis of method and machine learning models for predicting EUI
using model performance metrics.
Both retrofit scenarios are applied to a dataset of 10,000 buildings considered in this study are based on a Representative Concentration
with ratings below B2 and boilers as the HVAC system. This dataset size Pathway (RCP), which is a greenhouse gas concentration trajectory
of 10,000 buildings allows for a sufficiently large sample to analyze and adopted by the IPCC [60]. The 2030 weather file is based on RCP
apply retrofit scenarios effectively, covering all inefficient building rat- 4.5, described by the IPCC as an intermediate scenario and the most
ings from B3 to G. In general, there is a significant improvement in the probable baseline scenario, considering the exhaustible nature of non-
distribution of energy ratings in buildings. Furthermore, implementing renewable fuels. The study shows no significant differences when using
both retrofit scenarios in sample buildings resulted in a notable im- the future weather file. However, due to global warming and projected
provement, as indicated by the change in the distribution curve from average temperature increases of 1–1.6 °C, heating demand is expected
lower energy ratings to higher ones (Fig. 10). However, the results to decrease in the future, potentially leading to an improvement in
indicate that in Scenario I, where the heat pumps are installed with building energy ratings [61]. Furthermore, the rating distribution for
windows, walls, and roofs refurbished, only 2,725 buildings achieved a buildings is expected to change, primarily through using photovoltaics
rating of B2 and above. as renewable energy sources (Fig. 11).
In contrast, Scenario II, which included renewable installations, The results demonstrate that the proposed methodology helps ur-
showed a slight improvement, with 3,467 buildings reaching higher rat- ban planners, energy policymakers, utility planners, and manufacturers
ings. These results demonstrate that both scenarios could only improve in evaluating the implementation of retrofit measures on a large scale.
the higher rating of a relatively small percentage of buildings, ranging Additionally, this case study highlights that fabric renovation in build-
from 27% to 34%. It highlights the need for deeper retrofitting mea- ings is insufficient as a standalone solution. In conjunction with the
sures to achieve higher ratings, including heat pumps and renewables installation of the heat pump, it is crucial to address other factors such
(Fig. 10). as the airtightness of the building and the control of the heating to ef-
The results are further examined using historical and future weather fectively improve the energy performance of the building, as evidenced
conditions, utilizing a 2030-year weather file. The emission scenarios by the importance of the characteristics.
12
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Fig. 9. Confusion matrix shows the performance of the ensemble-based segregation model for each building rating. (For interpretation of the colors in the figure(s),
the reader is referred to the web version of this article.)
Table 7
Retrofit scenarios to analyze the pre or post-effect on building energy performance at urban scale.
Retrofit Scenarios Window U-value Wall U-value Roof U-value HVAC Renewables
Fig. 10. Impact on the distribution of 10,000 building sample pre or post-retrofit scenarios.
13
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
Fig. 11. Impact of historical and future weather conditions on the post-retrofit scenarios.
measures. By focusing on these critical factors, policymakers can ef- in accuracy 15%. Accurate prediction of building energy performance
fectively allocate resources and implement targeted retrofit strategies enables stakeholders, such as energy policymakers and urban planners,
to improve building energy efficiency. However, it should be acknowl- to make informed decisions when planning large-scale retrofit mea-
edged that the importance of characteristics may differ for different sures.
sample data, weather conditions, or urban contexts. In general, the proposed methodology offers valuable information
Finally, the proposed solution is a valuable tool for urban planners, and tools to support urban planners and energy policymakers in ad-
energy policymakers, utility planners, and manufacturers in evaluat- dressing the challenges of sustainable planning and energy efficiency on
ing and implementing retrofit scenarios at the urban scale. However, an urban scale. The data-driven approach, coupled with feature analysis
the models inherently depend on the quality of the data input. There- and predictive modeling, empowers decision-makers to make informed
fore, incorrect synthetic data that do not closely represent real-world choices and drive positive change in urban energy systems. The findings
conditions might not accurately capture the complexities and uncer- of this study offer valuable assistance to energy policymakers and urban
tainties of the actual urban context. Furthermore, machine learning planners by providing information that can contribute to the develop-
models are often considered ‘black boxes,’ which could lead to a lack ment of effective retrofit measures. These measures aim to decrease
of understanding of the underlying reasons behind the predictions. This building energy consumption and mitigate carbon emissions. By in-
lack of knowledge makes it difficult for policymakers and planners to corporating the knowledge gained from this study, policymakers and
trust and fully understand the recommendations. Additionally, the com- planners can make well-informed decisions that facilitate sustainable
plexity and computational requirements of machine learning models urban development and address the pressing issue of climate change.
and parametric simulations can be prohibitive, necessitating significant Furthermore, the study helps policymakers and urban planners eval-
computational resources. uate the feasibility and impact of implementing retrofit measures on a
larger scale. This comprehensive approach supports the formulation and
6. Conclusion and future work execution of strategies to address energy efficiency and environmental
concerns.
Stakeholders analyze the energy performance of buildings on an Future research directions could investigate the influence of dif-
urban scale to develop effective policy measures that reduce energy
ferent mid-rise or high-rise apartments and non-residential archetype
consumption and CO2 emissions. However, collecting and analyzing
models on the predictive performance of machine learning algorithms.
building energy performance data on a large scale is complex and time-
Furthermore, the integration of cloud computing parametric simulation
consuming, requiring multiple resources. To address this challenge, we
could further enhance the research results. Currently, this research fo-
propose a novel methodology that uses machine learning algorithms
cuses on annual energy use and could be expanded to analyze seasonal
to predict the energy performance of an entire urban building stock.
and monthly variations.
This methodology allows stakeholders to make informed decisions and
implement targeted interventions to promote sustainable urban devel-
Declaration of competing interest
opment. In this paper, we implement the end-use demand segregation
method and the ensemble-based approach to develop a robust learning
The authors declare that they have no known competing financial
model to predict building energy performance. This approach improves
the predictive performance of machine learning and supports informed interests or personal relationships that could have appeared to influence
decision-making in building energy performance assessment. the work reported in this paper.
The methodology tested on Dublin City by developing a synthetic
building dataset of 1 million residential buildings using parametric anal- Data availability
ysis of 19 key parameters identified from four building archetypes. The
results show that the segregation method is highly effective for predict- Data will be made available on request.
ing EUI based on the given dataset, compared to the traditional single
model approach. Among the ten different machine learning algorithms Acknowledgements
compared, variations of the Gradient Boosting algorithm (XGB, LGBM,
and HGB) are found to be the most efficient and accurate models to This publication has emanated from research supported by Sci-
predict building energy performance. Furthermore, the ensemble-based ence Foundation Ireland through US-Ireland R&D Partnership Research
approach further improved the results, achieving an accuracy of 91%. Grant 20/US/3695, the U.S. National Science Foundation through
Comparing the ten different models revealed that the ensemble-based Award Number 2217410, and the Department for the Economy in
segregation method is highly effective in predicting EUI, with an im- Northern Ireland through USI 167. The opinions, findings, and conclu-
provement in the energy rating of the building resulting in an increase sions or recommendations expressed in this material are those of the
14
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
author(s) and do not necessarily reflect the views of the Science Foun- [27] C.C. Davila, C.F. Reinhart, J.L. Bemis, Modeling Boston: a workflow for the effi-
dation Ireland or other funding agencies. cient generation and maintenance of urban building energy models from existing
geospatial datasets, Energy 117 (2016) 237–250.
[28] R. El Kontar, B. Polly, T. Charan, K. Fleming, N. Moore, N. Long, D. Goldwasser, Ur-
References banopt: an open-source software development kit for community and urban district
energy modeling, Tech. rep., National Renewable Energy Lab. (NREL), Golden, CO
[1] EU-Energy, Energy for Europe by European commission, Online; https://energy.ec. (United States), 2020.
europa.eu/index_en, 2022. (Accessed 1 December 2022). [29] Y.Q. Ang, Z.M. Berzolla, S. Letellier-Duchesne, V. Jusiega, C. Reinhart, Ubem. io:
[2] W.A. Benjamin, Revision of the energy performance of buildings directive: fit for 55 a web-based framework to rapidly generate urban building energy models for carbon
package, 2022. reduction technology pathways, Sustain. Cities Soc. 77 (2022) 103534.
[3] U. Ali, M.H. Shamsi, M. Bohacek, C. Hoare, K. Purcell, E. Mangina, J. O’Donnell, [30] S.S. Abolhassani, M. Amayri, N. Bouguila, U. Eicker, A new workflow for detailed ur-
A data-driven approach to optimize urban scale energy retrofit decisions for resi- ban scale building energy modeling using spatial joining of attributes for archetype
dential buildings, Appl. Energy 267 (2020) 114861. selection, J. Build. Eng. 46 (2022) 103661.
[4] C. Hoare, R. Aghamolaei, M. Lynch, A. Gaur, J. O’Donnell, A linked data approach [31] A. Katal, M. Mortezazadeh, L.L. Wang, H. Yu, Urban building energy and microcli-
to multi-scale energy modelling, Adv. Eng. Inform. 54 (2022) 101719. mate modeling–from 3d city generation to dynamic simulations, Energy 251 (2022)
[5] C.F. Reinhart, C.C. Davila, Urban building energy modeling–a review of a nascent 123817.
field, Build. Environ. 97 (2016) 196–202. [32] I.M. Borràs, D. Neves, R. Gomes, Using urban building energy modeling data to
[6] T. Hong, Y. Chen, X. Luo, N. Luo, S.H. Lee, Ten questions on urban building energy assess energy communities’ potential, Energy Build. 282 (2023) 112791.
modeling, Build. Environ. 168 (2020) 106508. [33] A. Nutkiewicz, Z. Yang, R.K. Jain, Data-driven urban energy simulation (due-s):
[7] U. Ali, M.H. Shamsi, C. Hoare, E. Mangina, J. O’Donnell, Review of urban build- integrating machine learning into an urban building energy simulation workflow,
ing energy modeling (UBEM) approaches, methods and tools using qualitative and Energy Proc. 142 (2017) 2114–2119.
quantitative analysis, Energy Build. 246 (2021) 111073. [34] A. Rahman, V. Srikumar, A.D. Smith, Predicting electricity consumption for com-
[8] T. Ahmad, H. Chen, Y. Guo, J. Wang, A comprehensive overview on the data driven mercial and residential buildings using deep recurrent neural networks, Appl. En-
and large scale based approaches for forecasting of building energy demand: a re- ergy 212 (2018) 372–385.
view, Energy Build. 165 (2018) 301–320. [35] C.E. Kontokosta, C. Tull, A data-driven predictive model of city-scale energy use in
[9] Y. Zhao, C. Zhang, Y. Zhang, Z. Wang, J. Li, A review of data mining technologies in buildings, Appl. Energy 197 (2017) 303–317.
building energy systems: load prediction, pattern identification, fault detection and [36] F. Jiang, J. Ma, Z. Li, Y. Ding, Prediction of energy use intensity of urban buildings
diagnosis, Energy Built Environ. 1 (2) (2020) 149–164. using the semi-supervised deep learning model, Energy 249 (2022) 123631.
[10] Y. Wang, T. Wu, H. Li, M. Skitmore, B. Su, A statistics-based method to quantify [37] Y. Zhang, B.K. Teoh, M. Wu, J. Chen, L. Zhang, Data-driven estimation of build-
residential energy consumption and stock at the city level in China: the case of the ing energy consumption and ghg emissions using explainable artificial intelligence,
Guangdong-Hong Kong-Macao Greater Bay area cities, J. Clean. Prod. 251 (2020) Energy 262 (2023) 125468.
119637. [38] J. Seo, S. Kim, S. Lee, H. Jeong, T. Kim, J. Kim, Data-driven approach to predicting
[11] Y. Sun, F. Haghighat, B.C. Fung, A review of the-state-of-the-art in data-driven ap- the energy performance of residential buildings using minimal input data, Build.
proaches for building energy prediction, Energy Build. 221 (2020) 110022. Environ. 214 (2022) 108911.
[12] C. Tian, Y. Ye, Y. Lou, W. Zuo, G. Zhang, C. Li, Daily power demand prediction for [39] N.-T. Ngo, A.-D. Pham, T.T.H. Truong, N.-S. Truong, N.-T. Huynh, T.M. Pham, An
buildings at a large scale using a hybrid of physics-based model and generative ad- ensemble machine learning model for enhancing the prediction accuracy of energy
versarial network, in: Building Simulation, vol. 15, Springer, 2022, pp. 1685–1701. consumption in buildings, Arab. J. Sci. Eng. 47 (4) (2022) 4105–4117.
[13] N. Abbasabadi, M. Ashayeri, Urban energy use modeling methods and tools; a re- [40] M. Wurm, A. Droin, T. Stark, C. Geiß, W. Sulzer, H. Taubenböck, Deep learning-
view and an outlook for future tools, Build. Environ. (2019) 106270. based generation of building stock data from remote sensing for urban heat demand
[14] P. Manandhar, H. Rafiq, E. Rodriguez-Ubinas, Current status, challenges, and modeling, ISPRS Int.l J. Geo-Inf. 10 (1) (2021) 23.
prospects of data-driven urban energy modeling: a review of machine learning meth-
[41] A.S. Mohammed, P.G. Asteris, M. Koopialipoor, D.E. Alexakis, M.E. Lemonis, D.J.
ods, Energy Rep. 9 (2023) 2757–2776.
Armaghani, Stacking ensemble tree models to predict energy performance in resi-
[15] C. Benavente-Peces, N. Ibadah, Buildings energy efficiency analysis and classifica- dential buildings, Sustainability 13 (15) (2021) 8298.
tion using various machine learning technique classifiers, Energies 13 (13) (2020)
[42] F. Johari, G. Peronato, P. Sadeghian, X. Zhao, J. Widén, Urban building energy
3497.
modeling: state of the art and future prospects, Renew. Sustain. Energy Rev. 128
[16] U. Ali, M.H. Shamsi, F. Alshehri, E. Mangina, J. O’Donnell, Comparative analysis
(2020) 109902.
of machine learning algorithms for building archetypes development inurban build-
[43] T. Loga, B. Stein, N. Diefenbach, Tabula building typologies in 20 European
ing energy modeling, in: Building Performance Modeling Conference and SimBuild,
countries—making energy-related features of residential building stocks compara-
2018.
ble, Energy Build. 132 (2016) 4–12.
[17] Y. Chen, M. Guo, Z. Chen, Z. Chen, Y. Ji, Physical energy and data-driven models in
[44] U. Ali, M.H. Shamsi, C. Hoare, E. Mangina, J. O’Donnell, A data-driven approach for
building energy prediction: a review, Energy Rep. 8 (2022) 2656–2671.
multi-scale building archetypes development, Energy Build. 202 (2019) 109364.
[18] R. Olu-Ajayi, H. Alaka, I. Sulaimon, F. Sunmola, S. Ajayi, Building energy con-
[45] W. Wang, S. Li, S. Guo, M. Ma, S. Feng, L. Bao, Benchmarking urban local weather
sumption prediction for residential buildings using deep learning and other machine
with long-term monitoring compared with weather datasets from climate station
learning techniques, J. Build. Eng. 45 (2022) 103406.
and energyplus weather (EPW) data, Energy Rep. 7 (2021) 6501–6514.
[19] Y. Pan, M. Zhu, Y. Lv, Y. Yang, Y. Liang, R. Yin, Y. Yang, X. Jia, X. Wang, F. Zeng,
[46] M.P. Tootkaboni, I. Ballarini, M. Zinzi, V. Corrado, A comparative analysis of differ-
et al., Building energy simulation and its application for building performance op-
ent future weather data for building energy performance simulation, Climate 9 (2)
timization: a review of methods, tools, and case studies, Adv. Appl. Energy (2023)
(2021) 37.
100135.
[20] M. Ferrando, F. Causone, T. Hong, Y. Chen, Urban building energy modeling (UBEM) [47] Y. Zhang, I. Korolija, Performing complex parametric simulations with jeplus, in:
tools: a state-of-the-art review of bottom-up physics-based approaches, Sustain. SET2010-9th International Conference on Sustainable Energy Technologies, 2010,
Cities Soc. 62 (2020) 102408. pp. 24–27.
[21] O. Pasichnyi, J. Wallin, O. Kordas, Data-driven building archetypes for urban build- [48] J. Egan, D. Finn, P.H.D. Soares, V.A.R. Baumann, R. Aghamolaei, P. Beagon, O.
ing energy modelling, Energy 181 (2019) 360–377. Neu, F. Pallonetto, J. O’Donnell, Definition of a useful minimal-set of accurately-
[22] L.G. Swan, V.I. Ugursal, Modeling of end-use energy consumption in the residential specified input data for building energy performance simulation, Energy Build. 165
sector: a review of modeling techniques, Renew. Sustain. Energy Rev. 13 (8) (2009) (2018) 172–183.
1819–1835. [49] Y. Choi, D. Song, S. Yoon, J. Koo, Comparison of factorial and Latin hypercube
[23] T. Hong, Y. Chen, S.H. Lee, M.A. Piette, Citybes: a web-based platform to support sampling designs for meta-models of building heating and cooling loads, Energies
city-scale building energy efficiency, Urban Comput. 14 (2016) 2016. 14 (2) (2021) 512.
[24] Y. Chen, T. Hong, M.A. Piette, Automatic generation and simulation of urban build- [50] W. Tian, Y. Heo, P. De Wilde, Z. Li, D. Yan, C.S. Park, X. Feng, G. Augenbroe,
ing energy models based on city datasets for city-scale building retrofit analysis, A review of uncertainty analysis in building energy assessment, Renew. Sustain.
Appl. Energy 205 (2017) 323–335. Energy Rev. 93 (2018) 285–301.
[25] D. Robinson, F. Haldi, P. Leroux, D. Perez, A. Rasheed, U. Wilke, Citysim: com- [51] Y. Ye, M. Strong, Y. Lou, C.A. Faulkner, W. Zuo, S. Upadhyaya, Evaluating perfor-
prehensive micro-simulation of resource flows for sustainable urban planning, in: mance of different generative adversarial networks for large-scale building power
Proceedings of the Eleventh International IBPSA Conference, no. CONF, 2009, demand prediction, Energy Build. 269 (2022) 112247.
pp. 1083–1090. [52] D.H. Wolpert, Stacked generalization, Neural Netw. 5 (2) (1992) 241–259.
[26] C. Reinhart, T. Dogan, J.A. Jakubiec, T. Rakha, A. Sang, Umi-an urban simulation [53] Building energy rating certificate database by SEAI, Online; https://ndber.seai.ie/
environment for building energy use, daylighting and walkability, in: 13th Con- BERResearchTool/ber/search.aspx. (Accessed 25 October 2023).
ference of International Building Performance Simulation Association, Chambery, [54] K. McDonagh, Geodirectory technical guide, an post and ordnance survey Ireland,
France, 2013. 2023.
15
U. Ali, S. Bano, M.H. Shamsi et al. Energy & Buildings 303 (2024) 113768
[55] Ordnance survey Ireland, Online; https://www.osi.ie. (Accessed 25 October 2023). [60] Intergovernmental panel on climate change (IPCC), Online; https://www.ipcc.ch.
[56] Census of population 2022 - profile 1 housing in Ireland by Central Statis- (Accessed 25 October 2023).
tics Office, Online; https://www.cso.ie/en/releasesandpublications/ep/p-cpsr/ [61] P. Nolan, J. Flanagan, High-resolution climate projections for Ireland–a multi-model
censusofpopulation2022-summaryresults/, 2022. (Accessed 25 October 2023). ensemble approach, Environmental Protection Agency, 2020.
[57] Ireland climate action plan 2023, Online; https://www.gov.ie/en/publication/ [62] D. Sood, I. Alhindawi, U. Ali, J.A. McGrath, M.A. Byrne, D. Finn, J. O’Donnell,
7bd8c-climate-action-plan-2023/. (Accessed 25 October 2023). Simulation-based evaluation of occupancy on energy consumption of multi-scale
[58] U. Ali, M.H. Shamsi, M. Bohacek, C. Hoare, K. Purcell, E. Mangina, J. O’Donnell, residential building archetypes, J. Build. Eng. 75 (2023) 106872.
A data-driven approach to optimize urban scale energy retrofit decisions for resi- [63] D. Sood, I. Alhindawi, U. Ali, D. Finn, J.A. McGrath, M.A. Byrne, J. O’Donnell,
dential buildings, Appl. Energy 267 (2020) 114861. Zone-wise occupancy schedules developed using Time Use Survey data for building
[59] J. Laue, Ashrae 62.1: using the ventilation rate procedure, Consult.-Specif. Eng. 55 energy performance simulations, Data Brief 49 (2023) 109453.
(2018) 14–17.
16