0% found this document useful (0 votes)
22 views38 pages

Group-3 Report

Uploaded by

ritikaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views38 pages

Group-3 Report

Uploaded by

ritikaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

DATA ANALYTICS

IN BUILDING
SCIENCE (AR32203)
(REPORT)
GROUP-3

UNDER THE GUIDANCE OF-


PROF. PRASHANT ANAND

PREPARED BY:

RAGHUWAR SRIVASTAVA
KETHOKHRUTO KAPUH
(21AR10027)
(19AR10015)
SHUBHAM DILAWAR
ADITYA RAJ (21AR10002)
(21AR10031)
DEBANSH ASSAWA (21AR10010)
SYED AHMAD HASAN
PRERANA SINGH (21AR10025)
(21AR10035)
PALLAVI SURVASE (21AR10042)
AYUSH RAJ (21AR10039)
Problem Statement:
Time-Series Forecasting Model for
Energy Usage and Occupancy
Prediction Using IAQ and Energy
Data in an Office Room
INDEX

SUMMARY

1. DATA CLEANING & FORMATTING

2. CORRELATION ANALYSIS

3. MODEL DEVELOPMENT

3.1. Timeseries forecasting of energy

3.2. Occupancy prediction

4. MODEL VALIDATION

LIMITATION AND FUTURE SCOPE FOR

STUDY

CONCLUSIONS

REFERENCES
SUMMARY
Objective and Methodology
The report aimed to develop a time-series forecasting model to predict energy usage
and occupancy using Indoor Air Quality (IAQ) and energy data in an office room.
Data cleaning, standardization of timestamps, and handling of missing data were
critical initial steps to ensure data integrity.

Data Handling and Preprocessing


The study utilized three distinct datasets related to energy consumption, indoor, and
outdoor environment parameters.
Significant efforts were made in data preprocessing, including merging datasets on a
common datetime column and interpolating missing values to maintain data
continuity.

Model Development and Algorithms Used


Two main modeling approaches were utilized:
Random Forest: Used for its capability to handle non-linear data, making it suitable
for occupancy prediction based on environmental and energy consumption data.
Prophet: Selected for its robustness in handling missing data and its flexibility in
incorporating trends, seasonality, and holiday effects into forecasts.

Key Insights from Data Analysis


Correlation analysis revealed that AC usage had a high impact on total energy
consumption, suggesting a target area for energy conservation.
The occupancy predictions were based on energy usage patterns, with the model
showing strong predictive capabilities.

Model Performance
The models achieved perfect scores in accuracy, precision, recall, and F1-score, which
although indicative of excellent performance, also raised concerns about potential
overfitting.

Challenges and Limitations


The primary challenge noted was the potential overfitting of the model to the training
data, which might compromise its effectiveness on new, unseen datasets.
Errors encountered during model execution were significant, as they hindered further
development and application of the predictive models.

Future Scope and Recommendations


Continuous improvement and updating of the models are recommended to keep up
with changing data patterns.
Integration with real-time data and further research on model interpretability and
scalability are suggested to enhance the models’ practical applicability and flexibility.

Conclusion
The report concludes that while the models show high efficiency and accuracy in
predictions, there is a need for careful consideration of model generalizability and
ongoing maintenance.
Future work should focus on addressing the identified limitations and expanding the
models' capabilities to ensure they remain robust and relevant in practical scenarios.
METHODOLOGY

1.
DATA CLEANING AND
FORMATTING
1.1 CREATING STANDARD TIME STAMP

Fig 1.1.1
Explanation (Fig 1.1.1):
The code imports the pandas library for data manipulation.
It reads three CSV files into separate DataFrames (df1, df2, df3).
It prints the columns of each DataFrame to understand their structure.
df1 contains columns related to energy consumption.
df2 contains columns related to indoor environment parameters.
df3 contains columns related to outdoor environment parameters.

1.2 CONSIDERATION OF COMMON DATASET

Fig 1.1.2
Explanation (Fig 1.1.2):
The code merges the three DataFrames (df1, df2, df3) using the common
column 'datetime'.
It first merges df1 and df2 into a new DataFrame called merged_df using an
inner join.
Then, it merges merged_df with df3 using an inner join again.
The resulting DataFrame merged_df contains the data from all three original
DataFrames, aligned based on the 'datetime' column.
merged_df.head() is used to display the first few rows of the merged
DataFrame for inspection.

1.3 CONSIDER “No CT” AS ZERO

Fig 1.1.3

Explanation (Fig 1.1.3):


The code accesses the last few rows of the merged_df DataFrame using the
.tail() method.
It replaces any missing values (NaNs) in the DataFrame with zero, specifically
for the 'ct' column.
This operation ensures that any missing values in the 'ct' column are
substituted with zeros for the displayed portion of the DataFrame.
By doing this, the code ensures consistency and prevents potential issues with
missing data when viewing the tail of the DataFrame.

1.4 MERGED ALL SHEETS ON THE BASIS OF


TIMESTAMP
Fig 1.1.4

Explanation (Fig 1.1.4):


The code groups consecutive NaN (missing) values in the 'Computer - kWatts'
column of the DataFrame merged_df.
It uses groupby along with diff and cumsum to create consecutive groups
based on whether each value in the column is NaN or not.
It iterates over each group, identified by group_id and group_df.
For each group, it checks if all values in the 'Computer - kWatts' column are
NaN using .isna().all().
If all values are NaN in a group, it retrieves the start and end timestamps of
the group from the 'datetime' column of group_df.
It calculates the count of consecutive NaN values in the group using
len(group_df).
Finally, it prints the start timestamp, end timestamp, and count of consecutive
NaN values for each group.

1.5 REMOVE COLUMN WITH 2434 MISSING VALUES

Fig 1.1.5
Explanation (Fig 1.1.5):
The code filters the DataFrame merged_df to include only rows where the
'datetime' values are less than '2024-01-02 00:00:00'.
It then groups consecutive NaN (missing) values in the 'Computer - kWatts'
column of the filtered DataFrame.
This grouping is done using groupby along with diff and cumsum to identify
consecutive NaN value groups.
For each consecutive group, it checks if all values in the 'Computer - kWatts'
column are NaN using .isna().all().
If all values are NaN in a group, it retrieves the start and end timestamps of
the group from the 'datetime' column of group_df.
It calculates the count of consecutive NaN values in the group using
len(group_df).
Finally, it prints the start timestamp, end timestamp, and count of consecutive
NaN values for each group.
Removing the column with value 2434 suggests that this column may not have
been useful or contained too many missing values for effective interpolation.
Therefore, it was dropped from the DataFrame to simplify the data and
analysis.

1.6 INSIGHTS FROM GRAPH BEFORE INTERPOLATING

Fig 1.1.6
The graph represents the power usage of a computer measured in kilowatts
over a period from early February 2023 to early January 2024.

Fig 1.1.6

Insights (Fig 1.1.6):


Fluctuating Power Usage: The computer's power usage varies significantly
throughout the observed period, with some days experiencing much higher
usage compared to others.
Periods of High Usage: There are distinct spikes in power consumption,
indicating periods of high activity or usage of the computer. These spikes are
particularly prominent at several points such as mid-April, late August, and
mid-November.
Low Usage Baseline: When not in high use, the computer maintains a
relatively low and consistent baseline of power consumption. This could
indicate periods when the computer is either turned off, in standby mode, or
running minimal processes.
Seasonal or Periodic Trends: There appears to be a periodic pattern to the
spikes, which could suggest seasonal usage patterns, periodic tasks that
require more computing power, or scheduled processes/tests that occur at
regular intervals.
End of Year Increase: There is a noticeable increase in activity towards the end
of the year, from November to December, which could be related to year-end
processing needs or other specific tasks that occur around this time.
1.7 FINDING NO. OF MISSING VALUES

Fig 1.1.7

The dataset has 1,655 missing values in the 'Computer - kWatts' column.

1.8 INTERPOLATING THE MISSING VALUES

Fig 1.1.8

Code Explanation (Fig 1.1.6):


Data Loading: The code begins by loading the CSV file into a pandas
DataFrame.
Datetime Conversion: It converts the 'datetime' column to pandas datetime
format to facilitate time-series analysis.
Missing Values Detection: Determines which entries in the 'Computer -
kWatts' column are missing.
Data Interpolation: Interpolates missing values using linear interpolation,
which estimates missing values based on neighboring data points.
Plotting: Creates a plot that includes both the interpolated data (in blue) and
the original points where data was missing (marked in red).
Fig 1.1.8

Graph Insights (Fig 1.1.6):

Visualization of Interpolation: The graph clearly shows where data was missing
(red dots) and how the interpolated values (blue line) seamlessly connect the
data points around these gaps.
Missing Data Identification: The red dots effectively highlight the temporal
distribution of missing data, aiding in understanding when data interruptions
occurred.
Data Continuity Restored: Interpolation helps restore continuity to the
dataset, making it more useful for further analysis that requires complete
datasets, like trend analysis or modeling.
Trend Observation: Despite the missing data, the general trends of usage
spikes and downtimes are maintained and can be analyzed further for
patterns.

“This graph is useful for assessing both the quality of the data (in terms
of completeness) and the typical behavior of the power usage over time.”
1.9 WHY INTERPOLATION?
Temperature

Date Time
Presence of
linearly arranged
data
Pressure in mbar

Relative Humidity
Presence of
continuously
missing values

Fig 1.1.8

Fig 1.1.9
Date Time Fig 1.1.9

Date Time

Explanation (Fig 1.1.9):

Linearly Arranged Data: In the first part of the image, we see a graph showing
temperature over time. The data points are linearly arranged, meaning that
values change gradually over time. This makes interpolation a suitable method
because it assumes that changes between consecutive data points are
incremental and can be estimated linearly.
Filling Gaps Smoothly: Interpolation allows for a smooth transition between
known data points, preserving the inherent trend and variation in the dataset
without abrupt changes, which is ideal for time-series data like temperature
and pressure.
Handling Continuously Missing Values: The right part of the image shows a
segment where there are consecutive missing data points for relative humidity.
Interpolation is beneficial here as it provides a method to estimate these
missing values based on the surrounding data, ensuring that the dataset
remains usable for analysis despite substantial gaps.
Data Integrity and Analysis: By interpolating missing data points, the integrity
of the data is maintained, allowing for more accurate and meaningful analysis,
such as trend analysis, forecasting, or even machine learning models that
require complete datasets.
METHODOLOGY

2.
CORELATION ANALYSIS
CORRELATION MATRIX OF ENERGY DATA

Fig 2.1.1

Explanation (Fig 2.1.1):

AC and Total Energy: Extremely high correlation (0.99), indicating AC is the


main energy consumer.
Light + Fan and Total Energy: Moderate correlation (0.57), suggesting a
noticeable but less significant impact on total energy usage.
Low Impact of Computers and Plug Loads: Both show very low correlation
(0.09 and 0.05 respectively) with total energy, indicating minimal impact.
Minimal Cross-Device Impact: Low correlations between different devices
(mostly 0.00 to 0.13) suggest independent operation without influencing each
other's energy usage.
Focus on AC for Energy Conservation: Given its strong influence on total
energy, optimizing AC use should be a priority for energy reduction efforts.
CORRELATION MATRIX OF OUTDOOR AIR QUALITY

Fig 2.1.2

Explanation (Fig 2.1.2):


CO2 and Dew Point Temperature: Moderate positive correlation (0.54),
indicating that higher CO2 levels are associated with higher dew point
temperatures.
Outdoor Temperature and Atmospheric Pressure: Strong negative correlation
(-0.63), suggesting that as temperatures rise, atmospheric pressure tends to
decrease.
Dew Point and Atmospheric Pressure: Strong negative correlation (-0.73),
showing that higher dew point temperatures often occur with lower
atmospheric pressures.
Concentration and Atmospheric Pressure: Strong negative correlation (-0.74),
implying that lower atmospheric pressures are associated with higher
pollutant concentrations.
Outdoor Temperature and Dew Point Temperature: Moderate positive
correlation (0.56), indicating that increases in temperature correlate with
increases in dew point temperature.
Concentration and Dew Point Temperature: Very strong positive correlation
(0.99), suggesting that higher dew point temperatures significantly correlate
with higher pollutant concentrations.
CORRELATION MATRIX OF INDOOR AIR QUALITY

Fig 2.1.3

Explanation (Fig 2.1.3):


Temperature and Dew Point: Strong positive correlation (0.72), indicating that
higher indoor temperatures lead to higher dew point temperatures.
Temperature vs. Atmospheric Pressure: Strong negative correlation (-0.62),
showing that increases in indoor temperature are associated with decreases in
atmospheric pressure.
Dew Point and Pollutant Concentration: Nearly perfect correlation (0.99)
between dew point temperature and concentration, suggesting that moisture
levels strongly influence indoor pollutant levels.
Temperature and Pollutant Concentration: Significant positive correlation
(0.69) indicates that warmer temperatures inside may increase pollutant
concentrations.
Pressure and Concentration: Negative correlation (-0.57) implies that lower
atmospheric pressures are linked to higher pollutant concentrations.
Humidity, Dew Point, and Concentration: Strong correlations indicate that
higher humidity is associated with higher dew point temperatures and
pollutant concentrations.
DEPENDENT AND INDEPENDENT PARAMETERS

Dependent Variables
These are influenced by human actions:
CO2 Levels:
Influenced by ventilation practices, increased occupancy, and usage of
many electrical appliances over long periods.

Indoor Temperature:
Can be adjusted by humans using heating, cooling, and insulation systems.

Independent Variables
These are not directly controlled by human actions:
Atmospheric Pressure:
Not directly controlled by human actions in typical settings.

Relative Humidity:
Although human activities like using humidifiers or dehumidifiers can
affect it, it's generally considered an environmental factor.

Dew Point Temperature:


A measure of humidity influenced by atmospheric conditions, but not
directly controlled by humans.

Concentration (g/m3):
The relationship with human control is unclear without specifics. It could
represent various substances or pollutants, some of which may be
influenced by human activities while others may not.
METHODOLOGY

3.
MODEL
DEVELOPMENT:
3.1. Timeseries forecasting of energy
3.2. Occupancy prediction
3.1.1 EXPLANATION OF THE ENERGY FORECASTING
MODEL:

Objective: The model aims to forecast the energy demand of spaces.


Approach: It leverages historical data to learn patterns associated with energy
and applies this understanding to forecast future energy demand.
Machine Learning Workflow:
1. Splitting the Data: The dataset is divided into training and test sets, with 80%
used for training and 20% for testing.
2. Model Fitting: The Random Forest model is trained on the scaled training
data.
3. Forecasts: The trained model forecasts occupancy on the scaled test data.
4. Evaluation: Model accuracy is calculated by comparing predictions with actual
occupancy data in the test set.

3.1.2 SPECIFIC INFERENCES FROM OCCUPANCY


PREDICTION:

1. Data Preprocessing:
Handling Missing Values: The code replaces zero values with NaN and then
fills these with the mean of the respective columns, ensuring the dataset is
complete for accurate modeling.
2. Feature Selection:
Date-Time Index: The dataset is indexed by timestamp, ensuring time-based
analysis is feasible and accurate.
3. Model Training and Evaluation:
ATTEMPT 1 USING ANN: This approach utilizes Artificial Neural Networks
to forecast future energy demands. Fir this approach we need to first divide
the energy values into individual ‘Energy Class’ and time slots into unique
‘Time Class’.
ATTEMPT 2 USING PROPHET: This approach simply uses Meta’s prophet
algorithm to forecast future energy demand. More details of this model shall
be discussed further.
4. Model Performance:
The model's performance is evaluated using accuracy and a classification
report.
3.1.2

1. Step 1: Encoding Time Class


Setting up ‘Time Class’
We encode the time values in excel to represent a certain integer for certain
time of the day. The division is demonstrated below

2. Step 2: Encoding Energy Class


Setting up ‘Energy Class’
We encode the time values in excel to represent a certain energy range
obtained from the quartile ranges. The division is demonstrated below
3.1.2

Reason for creating ‘classes’: Division into classes help in making the values not
distinct and hence makes it easier for the ANN to learn the pattern(function) the
data is following hence yielding better results.

The graph plot here demonstrates the repetitive pattern of the energy
consumption. It demonstrates the seasonality of the energy consumption.

To train an ANN for energy forecasting, historical data is typically divided into
training, validation, and testing sets. The ANN architecture, including the
number of layers, neurons, and activation functions, is chosen based on the
complexity of the problem and the available data. The model is then trained using
optimization algorithms such as gradient descent, with the objective of
minimizing prediction errors.
Once trained, the ANN can be used to generate forecasts for future energy
demand or generation. The accuracy of the forecasts can be evaluated using
performance metrics such as Mean Absolute Error (MAE), Mean Squared Error
(MSE), or Root Mean Squared Error (RMSE). Continuous monitoring and
periodic retraining of the ANN ensure that the forecasts remain accurate and
reliable over time.
3.1.2

Artificial Neural Networks offer a promising approach to energy forecasting,


enabling stakeholders to make informed decisions, optimize resource utilization,
and contribute to a more sustainable energy future. However, further research is
needed to address challenges such as data quality, model interpretability, and
uncertainty quantification to fully unlock the potential of ANNs in energy
forecasting applications.

3.1.3 CODING IT OUT:


3.1.3

Breakdown of the steps to build and evaluate a machine learning model for
predicting occupancy based on energy usage and environmental data:

Step 1: Import Libraries


Import necessary Python libraries for data handling (pandas, numpy), data
visualization (matplotlib), and machine learning (scikit-learn).
Step 2: Load and Prepare Data
Load the dataset using pandas, and prepare the data by converting necessary
columns to appropriate formats and setting indices if required.
Handle missing values by replacing them with statistical replacements like
mean or median.
Step 3: Feature Engineering
Define the target variable (e.g., Total Energy usage) based on logical
conditions from the dataset.
Select relevant features that will be used to train the model.
Step 4: Data Encoding
Encoding the data into different numerical values to make it easier for the
Machine Learning model to understand the pattern (function) of the data as
already mentioned earlier.
Step 5: Data Splitting
Split the dataset into training and testing sets using train_test_split to provide
a basis for training and later evaluating the model.
Step 6: Model Building and Training
Construct the model using an algorithm suitable for classification (like ANNs
for time series forecasting).
Step 7: Forecasting and Evaluation
Forecasting energy consumption using the model on the test dataset.
Evaluate the model’s accuracy and performance using metrics like accuracy
score and classification report.
Step 8: Interpret Results
Analyze the outcomes to understand the model's effectiveness in predicting the
occupancy accurately.
Step 9: Iteration and Optimization
Refine the model by experimenting with different parameters, features, or
models to improve prediction accuracy.

3.1.4 MODEL ACCURACY:


Sadly, we were not able to execute our model because of unknown errors every
time we tried.
3.1.4 MODEL ACCURACY:

Troubleshooting the problem and trying to find its solution online also yielded no
fruit.

3.1.5 APPROACH II: PROPHET


1. Step 1: Analyzing the Data
Analyzing the data using box plot to know about the deviation of the values

2. Step 2: Train/Test Split: Dividing the data into test and train data
3.1.5 APPROACH II: PROPHET

2. Step 2: Train/Test Split: Dividing the data into test and train data

3. Step 3: Prophet Model: Implementing the prophet model

4. Step 4: Accuracy: Implementing the prophet model


3.1.5 APPROACH II: PROPHET

About Prophet: Let us dive deep into the details of the prophet model developed
by facebook.

Here's an overview of how Facebook's Prophet works and why it's a great choice
for energy forecasting in building analysis:

1. Modeling Approach:
- Prophet is based on an additive model where non-linear trends are fit with
yearly, weekly, and daily seasonality, plus holiday effects.
- It decomposes time series data into trend, seasonal, and holiday components,
which are then combined to generate forecasts.

2. Flexibility:
- Prophet offers flexibility in modeling different types of trends, including
linear and saturating growth trends.
- It automatically detects and incorporates various seasonal patterns, such as
yearly, weekly, and daily cycles, making it suitable for capturing energy
consumption patterns in buildings.

3. Handling of Holidays and Events:


- Prophet allows users to include custom events or holidays that may impact
energy consumption in buildings.
- By incorporating holiday effects, Prophet can accurately model changes in
energy usage during specific periods, such as vacations or public holidays.
3.1.5 APPROACH II: PROPHET

4. Robustness to Missing Data:


- Prophet is designed to handle missing data and irregularly spaced time series
gracefully.
- This is particularly useful for energy forecasting in building analysis, where
data may be missing due to sensor failures or irregular monitoring intervals.

5. Uncertainty Estimation:
- Prophet provides uncertainty intervals (prediction intervals) around the
forecasted values, allowing users to assess the reliability of the forecasts.
- This is essential for energy forecasting in building analysis, where uncertainty
in predictions can impact decision-making regarding energy efficiency measures
and resource allocation.

6. Scalability:
- Prophet is scalable and can handle large datasets efficiently.
- It is implemented in Python and can be easily integrated into existing data
analysis workflows, making it accessible to researchers and practitioners in
building analysis.

7. Ease of Use:
- One of the key advantages of Prophet is its ease of use.
- It provides a simple and intuitive interface for fitting and making forecasts,
making it accessible to users with varying levels of expertise in time series
forecasting and machine learning.

Overall, Facebook's Prophet model is a great choice for energy forecasting in


building analysis due to its flexibility, robustness, and ease of use. By accurately
capturing trends, seasonality, and holiday effects in energy consumption data,
Prophet enables building managers and energy analysts to make informed
decisions and optimize energy usage effectively.
3.2.1 EXPLANATION OF THE OCCUPANCY PREDICTION
MODEL:

Objective: The model aims to predict whether a space is occupied based on


energy usage patterns and environmental conditions.
Approach: It leverages historical data to learn patterns associated with
occupancy and applies this understanding to new data to make predictions.
Machine Learning Workflow:
1. Splitting the Data: The dataset is divided into training and test sets, with 80%
used for training and 20% for testing.
2. Model Fitting: The Random Forest model is trained on the scaled training
data.
3. Prediction: The trained model predicts occupancy on the scaled test data.
4. Evaluation: Model accuracy is calculated by comparing predictions with actual
occupancy data in the test set.

3.2.2 SPECIFIC INFERENCES FROM OCCUPANCY


PREDICTION:

1. Data Preprocessing:
Handling Missing Values: The code replaces zero values with NaN and then
fills these with the mean of the respective columns, ensuring the dataset is
complete for accurate modeling.
Feature Engineering: The Occupancy column is created by setting a threshold
on total energy. If total energy exceeds 0.5, the space is considered occupied
(1), otherwise not occupied (0). This binary classification approach simplifies
the analysis.
2. Feature Selection:
Relevant Features: The model uses energy consumption readings from various
sources (Computer kWatts, Plug Load kWatts, Air Conditioner-kWatts, light
+ fan - kWatts, total energy) and environmental factors (Indoor Temperature,
Outdoor Temperature) to predict occupancy.
Date-Time Index: The dataset is indexed by timestamp, ensuring time-based
analysis is feasible and accurate.
3. Model Training and Evaluation:
Random Forest Classifier: This machine learning model is suitable for
classification tasks and can handle non-linear data effectively. It uses an
ensemble of decision trees to make predictions, providing robustness against
overfitting.
Scaling: Features are scaled using StandardScaler to normalize the data,
which is crucial for many machine learning algorithms that are sensitive to the
scale of input features.
4. Model Performance: The model's performance is evaluated using accuracy and
a classification report, providing insights into precision, recall, and F1-score for
the occupancy detection.

Fig 3.2.2
Fig 3.2.2

Breakdown of the steps to build and evaluate a machine learning model for
predicting occupancy based on energy usage and environmental data:

Step 1: Import Libraries


Import necessary Python libraries for data handling (pandas, numpy), data
visualization (matplotlib), and machine learning (scikit-learn).
Step 2: Load and Prepare Data
Load the dataset using pandas, and prepare the data by converting necessary
columns to appropriate formats and setting indices if required.
Handle missing values by replacing them with statistical replacements like
mean or median.
Step 3: Feature Engineering
Define the target variable (e.g., Occupancy) based on logical conditions from
the dataset.
Select relevant features that will be used to train the model.
Step 4: Data Splitting
Split the dataset into training and testing sets using train_test_split to provide
a basis for training and later evaluating the model.
Step 5: Scale the Features
Normalize the feature data using StandardScaler to ensure that the model is
not skewed by the scale of the data.
Step 6: Model Building and Training
Construct the model using an algorithm suitable for classification (like
RandomForestClassifier).
Train the model on the normalized training data.
Step 7: Make Predictions and Evaluate
Predict occupancy using the model on the test dataset.
Evaluate the model’s accuracy and performance using metrics like accuracy
score and classification report.
Step 8: Interpret Results
Analyze the outcomes to understand the model's effectiveness in predicting the
occupancy accurately.
Step 9: Iteration and Optimization
Refine the model by experimenting with different parameters, features, or
models to improve prediction accuracy.

3.2.3 MODEL ACCURACY

Fig 3.2.3

1. Accuracy: The accuracy of the model is 1.0 or 100%, meaning every prediction
made by the model was correct.
2. Precision: The precision is 1.0 for both classes (0 and 1). This indicates that
every instance predicted as class 0 or 1 was truly that class. There were no
false positives.
3. Recall: The recall is also 1.0 for both classes, meaning that every instance of
class 0 and 1 was correctly identified by the model. There were no false
negatives.
4. F1-Score: The F1-score, which is a harmonic mean of precision and recall, is
1.0 for both classes, indicating perfect balance between precision and recall.
5. Support: This shows the number of actual occurrences of each class in the
dataset: 5994 for class 0 and 189 for class 1.
METHODOLOGY

4.
MODEL VALIDATION
4.1 MODEL VALIDATION:

1. Data Integrity
Ensured by thorough data cleaning and formatting.
Missing values were systematically addressed by interpolation, particularly
for critical parameters like energy consumption.
2. Methodology for Analysis
Inclusion of different environmental and energy consumption variables to
ensure comprehensive analysis.
Time-series forecasting techniques and occupancy prediction models were
utilized.
3. Testing and Training Data
Data was split into 80% for training and 20% for testing, which is a
standard practice in machine learning to validate model performance.
4. Model Building
Employed multiple algorithms including Random Forest and Facebook's
Prophet for forecasting and classification tasks.
Random Forest was used for its ability to handle non-linear data
effectively.
Prophet was chosen for its robust handling of missing data, flexibility, and
ease of use in capturing and forecasting trends.
5. Performance Evaluation
Model performance was assessed through metrics like accuracy, precision,
recall, and F1-score.
Perfect scores (1.0) in these metrics suggest excellent model prediction
capabilities, although it might also indicate a need to check for overfitting.
6. Error Handling
Encountered issues with unknown errors were noted, indicating challenges
in model execution which need further troubleshooting.
7. Scalability and Robustness
Prophet's scalability and its capability to handle large datasets efficiently
were highlighted as beneficial for managing complex building analysis
data.
8. Real-World Applicability
The validation process includes considering the impact of real-world
factors like holidays and seasonality on energy consumption, which are
crucial for accurate forecasting in practical scenarios.
9. Future Enhancements
Continuous monitoring and periodic retraining are recommended to
maintain model accuracy.
Further research into challenges like data quality and model
interpretability is suggested to fully exploit the forecasting potential.
LIMITATIONS OF THE STUDY

Overfitting Concerns
The perfect scores (1.0 for accuracy, precision, recall, and F1-score) may
indicate overfitting, where the model performs exceptionally well on the
training data but may not generalize well to unseen data.
Overfitting reduces the model's predictive power in real-world applications.

Data Quality and Availability


The presence of missing values and the subsequent need for interpolation
suggest gaps in data collection processes.
Reliability of the forecasts could be compromised if data quality is not
consistently high.

Model Complexity
The use of multiple advanced algorithms like Random Forest and Prophet
requires significant computational resources and expertise, which may not be
feasible in all operational environments.
Complex models can also be difficult to interpret, making it harder for
stakeholders to understand model decisions.

Error Handling and Debugging


Encountered errors during model execution were not successfully resolved,
indicating potential issues in the code or the modeling process that could
affect deployment.
Lack of detailed troubleshooting or error logs to pinpoint issues can delay
resolution and impact model performance.

Scalability Challenges
While the Prophet model is noted for its scalability, integrating such models
into existing systems or workflows can be challenging, particularly in
environments with varying data scales or formats.

Real-World Factors
The model's ability to handle real-world complexities such as unpredictable
fluctuations in occupancy and environmental conditions could be limited.
Seasonal and holiday effects are considered, but other unpredictable factors
like unexpected events or sudden changes in building usage patterns may not
be fully accounted for.
FUTURE SCOPE FOR IMPROVEMENT

Enhanced Data Collection and Quality


Implement more robust data collection systems to minimize gaps and
reduce the reliance on interpolation.
Regular audits and quality checks to ensure data integrity and reliability.

Advanced Error Handling Techniques


Develop a more comprehensive debugging and error logging system to
quickly identify and rectify issues in the modeling process.
Implement automated anomaly detection to flag potential data or model
inconsistencies.

Model Simplification and Interpretability


Explore simpler models that require fewer resources and are easier to
understand while maintaining reasonable accuracy.
Develop methods to improve the interpretability of complex models, such
as feature importance analysis or model visualization tools.

Cross-Validation and Generalization


Incorporate cross-validation techniques during model training to better
assess the model's ability to generalize to unseen data.
Test the model on diverse datasets from different geographic locations or
building types to validate its applicability and robustness.

Integration with Real-Time Data


Enhance the model's capability to incorporate real-time data for dynamic
forecasting and immediate occupancy detection.
Develop adaptive models that can update their parameters in response to
real-time feedback and changes in usage patterns.

Continuous Learning and Model Updates


Implement mechanisms for continuous learning where the model can learn
from new data over time without complete retraining.
Schedule periodic model re-evaluations and updates based on the latest
data and technological advancements.

Focus on Sustainability and Energy Efficiency


Expand the model's capabilities to include predictions related to energy
efficiency and sustainability metrics.
Collaborate with sustainability experts to align model outputs with energy
conservation goals and practices.
CONCLUSION
1. High Model Accuracy
The study achieved perfect model accuracy metrics (1.0 for accuracy,
precision, recall, and F1-score), indicating exceptional model performance
on the training data.
2. Effective Data Handling
Comprehensive data cleaning and formatting procedures were successfully
implemented, ensuring the integrity and usability of the dataset for
complex analyses.
Techniques such as interpolation were effectively used to manage missing
data, maintaining the continuity of data for time-series analysis.
3. Robust Modeling Techniques
The use of advanced modeling techniques like Random Forest and Prophet
proved effective in capturing complex patterns in energy usage and
occupancy.
Prophet’s ability to handle missing data, incorporate seasonal effects, and
forecast trends based on non-linear data was particularly beneficial.
4. Importance of AC in Energy Consumption
The study highlighted that air conditioning is a significant consumer of
energy within office environments, suggesting a potential target for energy
conservation efforts.
5. Predictive Capabilities for Occupancy
The model demonstrated strong predictive capabilities for determining
occupancy based on energy consumption and environmental data,
showcasing its potential for application in smart building management.
6. Handling of Seasonal and Holiday Effects
The inclusion of seasonal and holiday effects in the modeling process
helped in accurately forecasting energy consumption patterns, underlining
the importance of considering external factors in energy management.
7. Challenges with Overfitting
Despite perfect accuracy metrics, there is a concern regarding overfitting,
suggesting that the model might perform differently on unseen or new data
sets.
8. Potential for Real-Time Application
The study pointed towards the feasibility of implementing these models in
real-time systems for dynamic energy management and occupancy
detection, although this would require further developments in real-time
data integration and processing.
9. Scalability and Flexibility
The scalability of the models, especially Prophet, allows for their use in
larger or different types of buildings, providing flexibility for broader
applications in building analysis.
10. Future Research Directions
Continued research is needed to improve model interpretability, address
data quality issues, and refine error handling to enhance the robustness
and reliability of the forecasts.
REFERENCES

Books
"Pattern Recognition and Machine Learning" by Christopher M. Bishop:
Offers comprehensive insights into statistical techniques for machine
learning.
"Forecasting: Principles and Practice" by Rob J Hyndman and George
Athanasopoulos: Great for understanding time series forecasting
methods, particularly useful if you're using the Prophet model.

Tools and Software Documentation


Scikit-Learn's official documentation (scikit-learn.org): Detailed
guides and tutorials on implementing machine learning algorithms.
Official documentation for Facebook Prophet: Includes
comprehensive guides, examples, and best practices for using
Prophet in time series forecasting.

Online Forums and Communities


Stack Overflow: A valuable resource for troubleshooting specific
issues you encounter while using Python, R, or any specific data
analysis tool.
Cross Validated (part of the Stack Exchange network): Good for
statistical questions and discussions, particularly about model
accuracy and selection.

YouTube Tutorials
StatQuest with Josh Starmer: Offers clear and concise statistics
tutorials, which can be helpful for understanding the data analysis
techniques used.

You might also like