s3950476 TimeSeriesAnalysis Assignment 3
s3950476 TimeSeriesAnalysis Assignment 3
INTRODUCTION
Time series forecasting is a crucial aspect of data analysis and prediction in various fields,
ranging from finance and economics to weather forecasting and sales forecasting. It involves
analyzing and predicting patterns, trends, and future values based on historical data points
collected over a specific period. Time series forecasting techniques are utilized to make informed
decisions, plan resources, optimize operations, and anticipate future events. This article provides
an introduction and background to time series forecasting, exploring its significance, challenges,
and common techniques. We will delve into the fundamental concepts, methods, and tools
employed in this domain, as well as discuss its real-world applications. Understanding the
principles and approaches of time series forecasting will empower individuals and organizations
to harness the power of historical data and make accurate predictions (Natras et al.,2022).
BACKGROUND
Time series data refers to a collection of observations recorded at regular intervals over a
specified period. The data points in a time series possess an inherent chronological order, making
it distinct from cross-sectional or panel data. This time-based dimension enables the
identification of patterns, trends, and dependencies within the data, which can be leveraged for
cyclicality, seasonality and irregular fluctuations. The trend signifies the long-term measure in
Seasonality states to repetitive outlines that befall within shorter time frames, such as daily,
weekly, or yearly cycles. Cyclicality denotes to longer-term outlines that are not as systematic as
seasonal patterns, often spanning multiple years. Lastly, irregular fluctuations, also known as
residual or random components, represent the unpredictable and random fluctuations in the data
Time series forecasting aims to model and predict future values based on historical data patterns.
Accurate forecasts enable businesses and organizations to anticipate demand, optimize inventory,
plan resources, and make informed decisions. Furthermore, time series forecasting plays a
crucial role in various domains, such as finance, economics, weather forecasting, energy
Forecasting time series data poses several challenges due to its inherent characteristics. One such
challenge is the presence of noise and outliers, which can distort the patterns and affect the
accuracy of predictions. Handling missing data is another challenge, as the absence of values at
certain time points can impact the continuity and reliability of the series. Moreover, time series
data often exhibit non-stationary behaviour, where the statistical properties change over time,
making it difficult to model using traditional methods. These challenges necessitate the
utilization of specialized techniques and algorithms designed for time series forecasting. (Tan et
al.,2021)
A variety of time series forecasting techniques have been developed to tackle these challenges
and generate accurate predictions. These techniques can be broadly categorized into two main
approaches: statistical methods and machine learning methods. Statistical methods, such as
statistical models to capture the patterns and dependencies within the data. On the other hand,
machine learning methods, including random forests, support vector machine and neural
networks, leverage algorithms that learn from the data to make predictions. These methods often
require large amounts of data and perform well when dealing with complex patterns and
DATASET DESCRIPTION
The Hourly Energy Consumption dataset from Kaggle provides valuable insights into power
consumption patterns over 16 years (2002-2018). This dataset is sourced from PJM
contains hourly power consumption data, measured in megawatts (MW), and offers an
Date: This column represents the date of the power consumption measurement, following the
YYYY-MM-DD format.
Time: The time component in the dataset signifies the hour, minute, and second at which the
Power Consumption: This column provides the hourly power consumption values in megawatts
(MW). These values serve as the target variable for time series forecasting.
The Hourly Energy Consumption dataset is particularly valuable for forecasting future power
consumption trends. Various time series forecasting methods can be applied to this dataset, such
as ARIMA, Exponential Smoothing, and Prophet models. By leveraging historical data and the
temporal patterns within the dataset, accurate predictions can be made about future power
consumption levels.
The dataset enables researchers and analysts to analyze historical trends in power consumption.
Plotting the data over time or utilizing statistical methods like regression analysis allows for a
In conclusion, the Hourly Energy Consumption dataset is a valuable resource for researchers,
trends. While it offers a large and comprehensive dataset with organized information, users
should be mindful of potential accuracy issues and the dataset's limited coverage. Overall, this
dataset serves as a valuable tool for gaining insights into power consumption dynamics and
DESCRIPTIVE ANALYSIS
The provided code performs a descriptive analysis and time series forecasting using the Hourly
Energy Consumption dataset. Let's break down the analysis and highlight the key steps and
findings:
The code begins by importing the necessary libraries, such as pandas, numpy, matplotlib.pyplot,
Basic exploration of the dataset is conducted, displaying the first few rows using the `head()`
function and plotting the hourly energy usage over the entire dataset using `df.plot()`.
Feature Engineering:
The code then proceeds with creating additional time series features based on the index of the
DataFrame. Features like hour, day of the week, quarter, month, year, day of the year, and week
Visualizations are created to discover the energy usage tendencies by month and year by means
Model Training:
The time series forecasting models' features (X) and target variable (y) must be defined in the
following step.
The provided features and target values (energy consumption) are used to train two models, the
Specific hyperparameters, such as the quantity of estimators, early stopping rounds, objective,
Different hyperparameters, including the number of estimators, the maximum depth, the
minimum split, and the minimum leaf, are used to train the Random Forest Regressor.
Forecasting:
Next, the code generates forecasts for the next 10 months using both the XGBoost Regressor and
Line plots are created to visualize the historical data and the forecasted values from both models.
Model Evaluation:
The accuracy of the XGBoost Regressor and Random Forest Regressor models is evaluated
The accuracy scores for both models are displayed to compare their performance.
Based on the accuracy comparison, the Random Forest Regressor is selected for the final
forecasting.
Forecasts for the next 10 months are generated using the selected model.
The forecasts, including the year, month, and predicted energy consumption, are stored in the
DataFrame "forecast_data."
Analysis of the Hourly Energy Consumption dataset, including exploratory data analysis, feature
engineering, model training, and time series forecasting. The Random Forest Regressor model is
selected as the preferred model for predicting future energy consumption. The forecasted values
are stored and displayed for further analysis and decision-making processes.
Model Specification
We use the XGBoost Regressor and the Random Forest Regressor as our two machine learning
models. Based on historical data, these models try to forecast how much electricity will be used
to construct the feature matrix "X". The "PJME_MW" column, which denotes the amount of
The feature matrix 'X' and the target variable 'y' are used to train the XGBoost Regressor. To
enhance the performance of the model, the hyperparameters of the XGBoost Regressor are
specified, including the number of estimators, early stopping rounds, maximum depth, and
learning rate. The mean squared error (MSE) is used to assess the model, and the accuracy is
shown.
The Random Forest Regressor is trained on the same feature matrix `X` and target variable `y`.
The hyperparameters of the Random Forest Regressor, including the number of estimators,
maximum depth, minimum samples split, and minimum samples leaf, are set to achieve better
accuracy. The model's performance is evaluated using the MSE, and the accuracy is displayed.
Based on the accuracy comparison, the Random Forest Regressor is selected for forecasting the
electricity usage for the next 10 months. The Random Forest Regressor is utilized to predict the
electricity usage using the feature matrix `next_10_months_df`, which consists of the features for
the next 10 months. The predictions from both the XGBoost Regressor and the Random Forest
Regressor are plotted against the historical data using line plots. The plots visualize the
forecasted electricity usage and provide a comparison with the actual historical data.
Model Fitting
Two different machine learning models were fitted to the data: XGBoost Regressor and Random
Forest Regressor. The XGBoost Regressor has 600 decision trees, a maximum depth of 3, and a
learning rate of 0.01. The Random Forest Regressor has 1000 decision trees, a maximum depth
trained for 500 epochs. The XGBoost Regressor achieved an accuracy of 90%, while the
Image 1:Display the accuracy of the XGBoost Regressor and Random Forest Regressor
RESULT ANALYSIS
From 2002 to 2018, in PJM Interconnection LLC. As you can see, over the previous 16 years,
energy use has steadily increased. The daily energy consumption varies greatly, with an average
of about 100 megawatts (MW). Seasonal variations also exist, with winter seeing higher energy
The coldest months of the year are the winter ones, which last from December to March. Energy
use and the demand for heating are both at their peak during this time. The warmest months of
the year are the summer ones, which run from June to August. At this time, energy use and
cooling demand are both at their peak. Additionally, there is a slight increase in energy use in the
Examining the discrepancies between observed data and values predicted by a model is the
process of residual analysis. This can be used to spot any overfitting or underfitting issues that
The historical data and the forecasts from the XGBoost Regressor and Random Forest Regressor
models as shown in the image. As you can see, the XGBoost Regressor model doesn't seem to fit
the data as well as the Random Forest Regressor model does. This is so because the Random
Forest Regressor model has smaller residuals (differences between the observed data and the
predicted values).
This suggests that compared to the XGBoost Regressor model, the Random Forest Regressor
Image 4: This plots the historical data and the predictions made by the XGBoost Regressor
models
Image 5:This plots the historical data and the predictions made by the Random Forest
Regressor models.
Image 6:The model was able to generate accurate forecasts for the next 10 months.
The Random Forest Regressor model proved to be a highly effective tool for time series
forecasting, specifically in the context of predicting electricity usage in the PJM East Region.
The model demonstrated its capability to achieve high accuracy on the dataset, which is a crucial
electricity usage trends in the PJM East Region. These forecasts are derived from a combination
of historical data and current trends, leveraging the patterns observed in the dataset. By
incorporating relevant time series features and utilizing the Random Forest Regressor's
capabilities, the model can make reliable predictions for the upcoming months.
CONCLUSION
That time series forecasting, particularly using the Random Forest Regressor model, is an
effective tool for predicting electricity usage in the PJM East Region. The insights gained from
accurate forecasts can aid decision-making processes and provide valuable information for
resource planning and optimization. However, it is essential to be aware of the limitations and
REFERENCES
Natras, R., Soja, B., & Schmidt, M. (2022). Ensemble Machine Learning of Random Forest,
AdaBoost and XGBoost for Vertical Total Electron Content Forecasting. Remote
Link: https://www.mdpi.com/2072-4292/14/15/3547
Meng, D., Xu, J., & Zhao, J. (2021). Analysis and prediction of hand, foot and mouth disease
incidence in China using Random Forest and XGBoost. Plos one, 16(12), e0261629.
Link: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0261629
Tan, C. W., Bergmeir, C., Petitjean, F., & Webb, G. I. (2021). Time series extrinsic
regression: Predicting numeric values from time series data. Data Mining and Knowledge