Machine Learning Algorithms For Classification of
Machine Learning Algorithms For Classification of
Abstract. Building performance has been shown to degrade significantly after commissioning, resulting in
increased energy consumption and associated greenhouse gas emissions. Continuous Commissioning using
existing sensor networks and IoT devices has the potential to minimize this waste by continually identifying system
degradation and revising control strategies to adapt to real building performance. Due to its significant contribution
to GHG emissions, building heating, particularly gas boiler systems are critical systems for detecting decreased
performance. A review of boiler performance studies has been used to develop a set of common faults and degraded
performance conditions, and these have been integrated into a MATLAB Simulink emulator to create a labelled
dataset with approximately 27,500 cases for training and testing boiler fault classification models. Classification
algorithms such as K-nearest neighbour, Decision tree, Random Forest and Naïve Bayes have been tested and the
results show that decision tree methods gave the best prediction (97.8% accuracy) followed by Random forest
(95.0%) and KNN for K = 3 (88.1%). Naïve Bayesian and KNN for K = 9 classification both gave poor results.
1. Introduction
HVAC systems throughout a buildings life cycle may often led to poor performance due to faulty
equipment. Amongst all end-uses in buildings, heating, ventilation, and air conditioning (HVAC)
accounts for 40% of building energy consumption [1]. Common faults in HVAC equipment includes
process parameter changes, disturbance parameter changes, actuator problems, and sensor problems [2].
These faults may accumulate over time, often undiagnosed, resulting in decreased performance and
increased energy consumption and costs. Fortunately, fault detection and diagnosis (FDD) technology
can leverage this understanding of poorly operating equipment to improve performance. The goal of
FDD includes improved indoor environmental quality, reducing unscheduled equipment down time and
maintenance costs, and increased equipment life [2]. However, accurate FDD requires detailed
knowledge of how faults affect the performance of the system either with recorded sensor data or
through fault modelling [1]. Li O’Neill note that the development of fault data using simulation is
extremely valuable as it permits the modelling and algorithm training for complex fault scenarios
(multiple concurrent faults) and is a way to inexpensively generate the bulk data necessary for algorithm
development and testing [1]. Further, this approach permits data on rare or dangerous fault conditions
to be generated without risk to the building or its occupants. To address the gap noted in [1] regarding
a dearth of simulation-based studies, this paper presents a study of the operation of an HVAC component
is through simulation and modelling. Using MATLAB/Simulink and Simscape, the subcomponents
within a boiler can be implemented and designed as such to simulate nominal and faulty performance.
It was found that such accumulation of faults throughout an entire HVAC system can contribute to an
additional 40% in energy consumption [1].
A comprehensive literature review of fault modelling in HVAC systems is presented in [1]. Fault
simulation can be categorized as three groups: white-box (physics, first principles), black-box (data
driven, machine learning, empirical) and grey-box (hybrid, semi-empirical) [2]. White box models
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
IAQVEC IOP Publishing
IOP Conf. Series: Materials Science and Engineering 609 (2019) 062007 doi:10.1088/1757-899X/609/6/062007
utilize concepts of physical and chemical laws, such as mass, momentum, and energy balance to develop
relationships between inputs and outputs [3]. Simulation software exists that can model individual
HVAC components, ranging from primary components (e.g. boiler, heat pumps, and chillers) to
secondary components (e.g. dampers and air handling units), or entire HVAC systems [4]. The presented
research utilizes concepts of white box models to develop the emulator of a boiler. Black-box models
are data driven, relying on empirical solutions, especially machine learning concepts. Regression, fuzzy
logic, frequency domain, and other similar approaches are most commonly used for HVAC system
modelling [4]. The third broad model type is a ‘grey box’ model, which combines both physics-based
and machine learning approaches.
This paper contributes to the development of improved fault detection in two ways: first, it presents
a validated physics-based boiler emulator model that can be used to generate simulated data for rare
fault conditions, and second, it explores machine learning models for fault detection using points
typically monitored by Building Automation Systems (BAS). The dataset associated with this data is
also provided as a supplemental file to complement field-collected data and support future research.
2. Methodology
This paper presents a combined approach, whereby the physical system is simulated within MATLAB
and modified from a nominal case based on the Simscape heating system model [5], a toolbox within
the Simulink library capable of modelling heating systems and validated using manufacturer data. To
simulate potential fault conditions, model parameters are varied, either individually or in combination,
and the resulting input and output data is labelled with the fault name. This results in a set of datasets,
which can be filtered to mimic the point outputs from a BAS and used to train machine learning models
to classify sets of BAS points to detect or predict fault conditions. The resultant dataset can be used to
supplement logged data where certain faults have not yet occurred and thus the real-world data is
unavailable. There were three fundamental steps in this research: (1) development and validation of a
boiler emulator capable of simulating normal operation and operation under key fault conditions; (2)
simulation of fault condition dataset; and (3) creation of a testing and training dataset for a machine
learning model to identify fault conditions based on standard building monitoring system data points.
2
IAQVEC IOP Publishing
IOP Conf. Series: Materials Science and Engineering 609 (2019) 062007 doi:10.1088/1757-899X/609/6/062007
3
IAQVEC IOP Publishing
IOP Conf. Series: Materials Science and Engineering 609 (2019) 062007 doi:10.1088/1757-899X/609/6/062007
visible to the BAS, namely the water flow rate, entering and leaving water temperatures, outdoor air and
fuel temperatures, and gas consumption rate, were output to the dataset and were labelled with the
associate fault. Iterations were performed changing the gas fuel rate from 1 kg/s to 4 kg/s, water mass
flow rate from 3 kg/s to 12.5 kg/s and combustion air temperature from 283 K to 303 K. A constant
return temperature of 333K was used for all runs and thus omitted from the dataset. A total of 27,281
simulations were run to generate a robust dataset [11] for model training.
Figure 2. Confusion matrices of 4 class dataset (left to right): KNN with k=3, DT, NB, and RF
The full condition classification results are shown in figure 3. Of the algorithms, DT had the highest
accuracy (97.8%) followed by RF (95.0%) and then KNN with k=3 (88.1%). A thorough analysis of the
results showed that RF consistently amplified the misclassifications occurring to a lesser degree in DT.
For example, RF misclassified X=0.1 as S=0.01 for 34.0% of occurrences compared with 2.2% for DT.
In addition, when feature selection was implemented, RF misclassified adjacent excess air faults, as well
as misclassifying between faults and scaling. This may be a result of removing gas mass flow rate, as it
would have provided insight capable of distinguishing between similar fault outputs. Naïve Bayes and
4
IAQVEC IOP Publishing
IOP Conf. Series: Materials Science and Engineering 609 (2019) 062007 doi:10.1088/1757-899X/609/6/062007
KNN with larger number of neighbours (k>5), performed poorly for all feature sets tested, likely due to
the curse of dimensionality associated with such a large number of classes. It is noteworthy that of all
the algorithms tested, the KNN model showed the most significant performance improvement with
feature selection, with the k=3 model increasing from 4.3% to 88.1% when fuel rate was removed as an
input. Beyond k=3, the accuracy for this model remained consistently poor regardless of input variables.
Conversely, the random forest model suffered, decreasing from 95.0% to 74.2% when the fuel flow rate
was omitted. The remaining algorithms showed no such sensitivity to feature selection.
True Label
True Label
True Label
True Label
Figure 3. 31-Class confusion matrices for best feature set algorithms tested: KNN with k=3 (top left),
NB (top right), DT (bottom left), and RF with 500 trees (bottom right).
Despite the large number of classes, the condition prediction was deemed to be successful for fault
detection, particularly the DT model with 97.8% accuracy. This granularity in prediction is important
because it permits a more precise diagnosis of the specific fault occurring within the boiler. Further, if
left unresolved, it is possible to track the extent of the fault in time, and thus build future models
permitting a mean time to failure estimate to be developed. Together, these algorithms will permit an
intelligent boiler monitoring system to be developed and integrated into the building automation system,
thus providing an additional depth of insight into boiler fault progress, allowing for improved
maintenance schedules and permitting the optimization of operational costs.
5
IAQVEC IOP Publishing
IOP Conf. Series: Materials Science and Engineering 609 (2019) 062007 doi:10.1088/1757-899X/609/6/062007
4. Conclusions
This study has determined that that it is possible to classify faults across a large number of conditions
with high accuracy based only on observed BAS data points. While presenting promising results, there
are several limitations of this research as-presented. First, the boiler validation and testing was based on
a single boiler model and future research should repeat the validation testing for other boiler models and
create similar datasets for those boilers. Second, the classification is only performed for individual faults
not combined/hybrid faults. Third, this research presents only simulated results, and should be extended
in the future to include field-collected data. To address the first two limitations, future work will clone
this emulator to develop datasets for other boilers and replicate this study across boiler types (condensing
and non-condensing) and sizes. Multiple concurrent faults will be simulated to permit more complex
investigations to be undertaken. To address the third limitation, the authors are obtaining real data from
in-situ boilers on campus and this data will be used to both enhance the dataset as well as further refine
and validate the fault detection models. Additional studies are investigating the impact of signal noise
on prediction accuracy and identify signal processing techniques to increase the robustness of the model
for real-world applications.
References
[1] Li Y and O’Neill Z 2018 A critical review of fault modeling of HVAC systems in buildings.
Build. Simul. 11 953-75
[2] Lan L and Chen Y 2007 Application of modeling and simulation in fault detection and
diagnosis of HVAC systems Build. Simul. 1299-1306
[3] Homod Z R 2013 Review on the HVAC system modeling types and the shortcomings of their
application Journal of Energy 2013
[4] Afram A and Janabi-Sharifi F 2014 Review of modeling methods for HVAC systems. Appl.
Therm. Eng. 67 507-19.
[5] MathWorks 2019 House Heating System. Accessed January 1, 2019.
https://www.mathworks.com/help/physmod/hydro/examples/house-heating-
system.html?searchHighlight=heating&s_tid=doc_srchtitle
[6] Turns S 1996 An Introduction to Combustion (New York: McGraw-Hill)
[7] Satoh M. Atmospheric circulation dynamics and general circulation models. Springer Science &
Business Media; 2013 Jul 4.
[8] Viessmann 2018 Viessmann Vitorond 200 Technical Data Manual October.
https://www.viessmann.ca/content/dam/vi-brands/CA/pdfs/commercial/vitorond_200-
lg_tdm.pdf/_jcr_content/renditions/original.media_file.download_attachment.file/vitorond_2
00-lg_tdm.pdf
[9] Shah R and Sekulić D 2003 Fundamentals of Heat Exchanger Design (Hoboken: John Wiley&
Sons)
[10] ANSI/AHRI 2014 2015 Standard 1500 for Performance Rating of Commercial Space Heating
Boilers Standard, Arlington: Air Conditioning, Heating, and Refrigeration Institute
[11] Shohet R, Kandil M and McArthur J 2019 Simulated boiler fault data Toronto: IEEE Dataport.
Available online: doi: https://doi.org/10.21227/ye8z-z608.
[12] CRAN The Comprehensive R Archive Network. Accessed 03 29, 2019.
https://cran.r-project.org