Group-3 Report
Group-3 Report
IN BUILDING
SCIENCE (AR32203)
(REPORT)
GROUP-3
PREPARED BY:
RAGHUWAR SRIVASTAVA
KETHOKHRUTO KAPUH
(21AR10027)
(19AR10015)
SHUBHAM DILAWAR
ADITYA RAJ (21AR10002)
(21AR10031)
DEBANSH ASSAWA (21AR10010)
SYED AHMAD HASAN
PRERANA SINGH (21AR10025)
(21AR10035)
PALLAVI SURVASE (21AR10042)
AYUSH RAJ (21AR10039)
Problem Statement:
Time-Series Forecasting Model for
Energy Usage and Occupancy
Prediction Using IAQ and Energy
Data in an Office Room
INDEX
SUMMARY
2. CORRELATION ANALYSIS
3. MODEL DEVELOPMENT
4. MODEL VALIDATION
STUDY
CONCLUSIONS
REFERENCES
SUMMARY
Objective and Methodology
The report aimed to develop a time-series forecasting model to predict energy usage
and occupancy using Indoor Air Quality (IAQ) and energy data in an office room.
Data cleaning, standardization of timestamps, and handling of missing data were
critical initial steps to ensure data integrity.
Model Performance
The models achieved perfect scores in accuracy, precision, recall, and F1-score, which
although indicative of excellent performance, also raised concerns about potential
overfitting.
Conclusion
The report concludes that while the models show high efficiency and accuracy in
predictions, there is a need for careful consideration of model generalizability and
ongoing maintenance.
Future work should focus on addressing the identified limitations and expanding the
models' capabilities to ensure they remain robust and relevant in practical scenarios.
METHODOLOGY
1.
DATA CLEANING AND
FORMATTING
1.1 CREATING STANDARD TIME STAMP
Fig 1.1.1
Explanation (Fig 1.1.1):
The code imports the pandas library for data manipulation.
It reads three CSV files into separate DataFrames (df1, df2, df3).
It prints the columns of each DataFrame to understand their structure.
df1 contains columns related to energy consumption.
df2 contains columns related to indoor environment parameters.
df3 contains columns related to outdoor environment parameters.
Fig 1.1.2
Explanation (Fig 1.1.2):
The code merges the three DataFrames (df1, df2, df3) using the common
column 'datetime'.
It first merges df1 and df2 into a new DataFrame called merged_df using an
inner join.
Then, it merges merged_df with df3 using an inner join again.
The resulting DataFrame merged_df contains the data from all three original
DataFrames, aligned based on the 'datetime' column.
merged_df.head() is used to display the first few rows of the merged
DataFrame for inspection.
Fig 1.1.3
Fig 1.1.5
Explanation (Fig 1.1.5):
The code filters the DataFrame merged_df to include only rows where the
'datetime' values are less than '2024-01-02 00:00:00'.
It then groups consecutive NaN (missing) values in the 'Computer - kWatts'
column of the filtered DataFrame.
This grouping is done using groupby along with diff and cumsum to identify
consecutive NaN value groups.
For each consecutive group, it checks if all values in the 'Computer - kWatts'
column are NaN using .isna().all().
If all values are NaN in a group, it retrieves the start and end timestamps of
the group from the 'datetime' column of group_df.
It calculates the count of consecutive NaN values in the group using
len(group_df).
Finally, it prints the start timestamp, end timestamp, and count of consecutive
NaN values for each group.
Removing the column with value 2434 suggests that this column may not have
been useful or contained too many missing values for effective interpolation.
Therefore, it was dropped from the DataFrame to simplify the data and
analysis.
Fig 1.1.6
The graph represents the power usage of a computer measured in kilowatts
over a period from early February 2023 to early January 2024.
Fig 1.1.6
Fig 1.1.7
The dataset has 1,655 missing values in the 'Computer - kWatts' column.
Fig 1.1.8
Visualization of Interpolation: The graph clearly shows where data was missing
(red dots) and how the interpolated values (blue line) seamlessly connect the
data points around these gaps.
Missing Data Identification: The red dots effectively highlight the temporal
distribution of missing data, aiding in understanding when data interruptions
occurred.
Data Continuity Restored: Interpolation helps restore continuity to the
dataset, making it more useful for further analysis that requires complete
datasets, like trend analysis or modeling.
Trend Observation: Despite the missing data, the general trends of usage
spikes and downtimes are maintained and can be analyzed further for
patterns.
“This graph is useful for assessing both the quality of the data (in terms
of completeness) and the typical behavior of the power usage over time.”
1.9 WHY INTERPOLATION?
Temperature
Date Time
Presence of
linearly arranged
data
Pressure in mbar
Relative Humidity
Presence of
continuously
missing values
Fig 1.1.8
Fig 1.1.9
Date Time Fig 1.1.9
Date Time
Linearly Arranged Data: In the first part of the image, we see a graph showing
temperature over time. The data points are linearly arranged, meaning that
values change gradually over time. This makes interpolation a suitable method
because it assumes that changes between consecutive data points are
incremental and can be estimated linearly.
Filling Gaps Smoothly: Interpolation allows for a smooth transition between
known data points, preserving the inherent trend and variation in the dataset
without abrupt changes, which is ideal for time-series data like temperature
and pressure.
Handling Continuously Missing Values: The right part of the image shows a
segment where there are consecutive missing data points for relative humidity.
Interpolation is beneficial here as it provides a method to estimate these
missing values based on the surrounding data, ensuring that the dataset
remains usable for analysis despite substantial gaps.
Data Integrity and Analysis: By interpolating missing data points, the integrity
of the data is maintained, allowing for more accurate and meaningful analysis,
such as trend analysis, forecasting, or even machine learning models that
require complete datasets.
METHODOLOGY
2.
CORELATION ANALYSIS
CORRELATION MATRIX OF ENERGY DATA
Fig 2.1.1
Fig 2.1.2
Fig 2.1.3
Dependent Variables
These are influenced by human actions:
CO2 Levels:
Influenced by ventilation practices, increased occupancy, and usage of
many electrical appliances over long periods.
Indoor Temperature:
Can be adjusted by humans using heating, cooling, and insulation systems.
Independent Variables
These are not directly controlled by human actions:
Atmospheric Pressure:
Not directly controlled by human actions in typical settings.
Relative Humidity:
Although human activities like using humidifiers or dehumidifiers can
affect it, it's generally considered an environmental factor.
Concentration (g/m3):
The relationship with human control is unclear without specifics. It could
represent various substances or pollutants, some of which may be
influenced by human activities while others may not.
METHODOLOGY
3.
MODEL
DEVELOPMENT:
3.1. Timeseries forecasting of energy
3.2. Occupancy prediction
3.1.1 EXPLANATION OF THE ENERGY FORECASTING
MODEL:
1. Data Preprocessing:
Handling Missing Values: The code replaces zero values with NaN and then
fills these with the mean of the respective columns, ensuring the dataset is
complete for accurate modeling.
2. Feature Selection:
Date-Time Index: The dataset is indexed by timestamp, ensuring time-based
analysis is feasible and accurate.
3. Model Training and Evaluation:
ATTEMPT 1 USING ANN: This approach utilizes Artificial Neural Networks
to forecast future energy demands. Fir this approach we need to first divide
the energy values into individual ‘Energy Class’ and time slots into unique
‘Time Class’.
ATTEMPT 2 USING PROPHET: This approach simply uses Meta’s prophet
algorithm to forecast future energy demand. More details of this model shall
be discussed further.
4. Model Performance:
The model's performance is evaluated using accuracy and a classification
report.
3.1.2
Reason for creating ‘classes’: Division into classes help in making the values not
distinct and hence makes it easier for the ANN to learn the pattern(function) the
data is following hence yielding better results.
The graph plot here demonstrates the repetitive pattern of the energy
consumption. It demonstrates the seasonality of the energy consumption.
To train an ANN for energy forecasting, historical data is typically divided into
training, validation, and testing sets. The ANN architecture, including the
number of layers, neurons, and activation functions, is chosen based on the
complexity of the problem and the available data. The model is then trained using
optimization algorithms such as gradient descent, with the objective of
minimizing prediction errors.
Once trained, the ANN can be used to generate forecasts for future energy
demand or generation. The accuracy of the forecasts can be evaluated using
performance metrics such as Mean Absolute Error (MAE), Mean Squared Error
(MSE), or Root Mean Squared Error (RMSE). Continuous monitoring and
periodic retraining of the ANN ensure that the forecasts remain accurate and
reliable over time.
3.1.2
Breakdown of the steps to build and evaluate a machine learning model for
predicting occupancy based on energy usage and environmental data:
Troubleshooting the problem and trying to find its solution online also yielded no
fruit.
2. Step 2: Train/Test Split: Dividing the data into test and train data
3.1.5 APPROACH II: PROPHET
2. Step 2: Train/Test Split: Dividing the data into test and train data
About Prophet: Let us dive deep into the details of the prophet model developed
by facebook.
Here's an overview of how Facebook's Prophet works and why it's a great choice
for energy forecasting in building analysis:
1. Modeling Approach:
- Prophet is based on an additive model where non-linear trends are fit with
yearly, weekly, and daily seasonality, plus holiday effects.
- It decomposes time series data into trend, seasonal, and holiday components,
which are then combined to generate forecasts.
2. Flexibility:
- Prophet offers flexibility in modeling different types of trends, including
linear and saturating growth trends.
- It automatically detects and incorporates various seasonal patterns, such as
yearly, weekly, and daily cycles, making it suitable for capturing energy
consumption patterns in buildings.
5. Uncertainty Estimation:
- Prophet provides uncertainty intervals (prediction intervals) around the
forecasted values, allowing users to assess the reliability of the forecasts.
- This is essential for energy forecasting in building analysis, where uncertainty
in predictions can impact decision-making regarding energy efficiency measures
and resource allocation.
6. Scalability:
- Prophet is scalable and can handle large datasets efficiently.
- It is implemented in Python and can be easily integrated into existing data
analysis workflows, making it accessible to researchers and practitioners in
building analysis.
7. Ease of Use:
- One of the key advantages of Prophet is its ease of use.
- It provides a simple and intuitive interface for fitting and making forecasts,
making it accessible to users with varying levels of expertise in time series
forecasting and machine learning.
1. Data Preprocessing:
Handling Missing Values: The code replaces zero values with NaN and then
fills these with the mean of the respective columns, ensuring the dataset is
complete for accurate modeling.
Feature Engineering: The Occupancy column is created by setting a threshold
on total energy. If total energy exceeds 0.5, the space is considered occupied
(1), otherwise not occupied (0). This binary classification approach simplifies
the analysis.
2. Feature Selection:
Relevant Features: The model uses energy consumption readings from various
sources (Computer kWatts, Plug Load kWatts, Air Conditioner-kWatts, light
+ fan - kWatts, total energy) and environmental factors (Indoor Temperature,
Outdoor Temperature) to predict occupancy.
Date-Time Index: The dataset is indexed by timestamp, ensuring time-based
analysis is feasible and accurate.
3. Model Training and Evaluation:
Random Forest Classifier: This machine learning model is suitable for
classification tasks and can handle non-linear data effectively. It uses an
ensemble of decision trees to make predictions, providing robustness against
overfitting.
Scaling: Features are scaled using StandardScaler to normalize the data,
which is crucial for many machine learning algorithms that are sensitive to the
scale of input features.
4. Model Performance: The model's performance is evaluated using accuracy and
a classification report, providing insights into precision, recall, and F1-score for
the occupancy detection.
Fig 3.2.2
Fig 3.2.2
Breakdown of the steps to build and evaluate a machine learning model for
predicting occupancy based on energy usage and environmental data:
Fig 3.2.3
1. Accuracy: The accuracy of the model is 1.0 or 100%, meaning every prediction
made by the model was correct.
2. Precision: The precision is 1.0 for both classes (0 and 1). This indicates that
every instance predicted as class 0 or 1 was truly that class. There were no
false positives.
3. Recall: The recall is also 1.0 for both classes, meaning that every instance of
class 0 and 1 was correctly identified by the model. There were no false
negatives.
4. F1-Score: The F1-score, which is a harmonic mean of precision and recall, is
1.0 for both classes, indicating perfect balance between precision and recall.
5. Support: This shows the number of actual occurrences of each class in the
dataset: 5994 for class 0 and 189 for class 1.
METHODOLOGY
4.
MODEL VALIDATION
4.1 MODEL VALIDATION:
1. Data Integrity
Ensured by thorough data cleaning and formatting.
Missing values were systematically addressed by interpolation, particularly
for critical parameters like energy consumption.
2. Methodology for Analysis
Inclusion of different environmental and energy consumption variables to
ensure comprehensive analysis.
Time-series forecasting techniques and occupancy prediction models were
utilized.
3. Testing and Training Data
Data was split into 80% for training and 20% for testing, which is a
standard practice in machine learning to validate model performance.
4. Model Building
Employed multiple algorithms including Random Forest and Facebook's
Prophet for forecasting and classification tasks.
Random Forest was used for its ability to handle non-linear data
effectively.
Prophet was chosen for its robust handling of missing data, flexibility, and
ease of use in capturing and forecasting trends.
5. Performance Evaluation
Model performance was assessed through metrics like accuracy, precision,
recall, and F1-score.
Perfect scores (1.0) in these metrics suggest excellent model prediction
capabilities, although it might also indicate a need to check for overfitting.
6. Error Handling
Encountered issues with unknown errors were noted, indicating challenges
in model execution which need further troubleshooting.
7. Scalability and Robustness
Prophet's scalability and its capability to handle large datasets efficiently
were highlighted as beneficial for managing complex building analysis
data.
8. Real-World Applicability
The validation process includes considering the impact of real-world
factors like holidays and seasonality on energy consumption, which are
crucial for accurate forecasting in practical scenarios.
9. Future Enhancements
Continuous monitoring and periodic retraining are recommended to
maintain model accuracy.
Further research into challenges like data quality and model
interpretability is suggested to fully exploit the forecasting potential.
LIMITATIONS OF THE STUDY
Overfitting Concerns
The perfect scores (1.0 for accuracy, precision, recall, and F1-score) may
indicate overfitting, where the model performs exceptionally well on the
training data but may not generalize well to unseen data.
Overfitting reduces the model's predictive power in real-world applications.
Model Complexity
The use of multiple advanced algorithms like Random Forest and Prophet
requires significant computational resources and expertise, which may not be
feasible in all operational environments.
Complex models can also be difficult to interpret, making it harder for
stakeholders to understand model decisions.
Scalability Challenges
While the Prophet model is noted for its scalability, integrating such models
into existing systems or workflows can be challenging, particularly in
environments with varying data scales or formats.
Real-World Factors
The model's ability to handle real-world complexities such as unpredictable
fluctuations in occupancy and environmental conditions could be limited.
Seasonal and holiday effects are considered, but other unpredictable factors
like unexpected events or sudden changes in building usage patterns may not
be fully accounted for.
FUTURE SCOPE FOR IMPROVEMENT
Books
"Pattern Recognition and Machine Learning" by Christopher M. Bishop:
Offers comprehensive insights into statistical techniques for machine
learning.
"Forecasting: Principles and Practice" by Rob J Hyndman and George
Athanasopoulos: Great for understanding time series forecasting
methods, particularly useful if you're using the Prophet model.
YouTube Tutorials
StatQuest with Josh Starmer: Offers clear and concise statistics
tutorials, which can be helpful for understanding the data analysis
techniques used.