0% found this document useful (0 votes)
94 views13 pages

s3950476 TimeSeriesAnalysis Assignment 3

This document describes a time series analysis project that forecasts electricity consumption using historical hourly consumption data. It first explores and visualizes the data, then trains two models - XGBoost Regressor and Random Forest Regressor - on engineered time series features to predict consumption over the next 10 months. The Random Forest Regressor achieves a higher accuracy score and is thus selected for the final forecasting. Forecasted values for the next 10 months are generated and stored for further analysis and decision making.

Uploaded by

Namratha Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views13 pages

s3950476 TimeSeriesAnalysis Assignment 3

This document describes a time series analysis project that forecasts electricity consumption using historical hourly consumption data. It first explores and visualizes the data, then trains two models - XGBoost Regressor and Random Forest Regressor - on engineered time series features to predict consumption over the next 10 months. The Random Forest Regressor achieves a higher accuracy score and is thus selected for the final forecasting. Forecasted values for the next 10 months are generated and stored for further analysis and decision making.

Uploaded by

Namratha Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MATH1318 Time Series Analysis - Final Project Written report and presentation

STUDENT NAME: Namratha Desai s3950476


Assignment-3:

INTRODUCTION

Time series forecasting is a crucial aspect of data analysis and prediction in various fields,

ranging from finance and economics to weather forecasting and sales forecasting. It involves

analyzing and predicting patterns, trends, and future values based on historical data points

collected over a specific period. Time series forecasting techniques are utilized to make informed

decisions, plan resources, optimize operations, and anticipate future events. This article provides

an introduction and background to time series forecasting, exploring its significance, challenges,

and common techniques. We will delve into the fundamental concepts, methods, and tools

employed in this domain, as well as discuss its real-world applications. Understanding the

principles and approaches of time series forecasting will empower individuals and organizations

to harness the power of historical data and make accurate predictions (Natras et al.,2022).

BACKGROUND

Time series data refers to a collection of observations recorded at regular intervals over a

specified period. The data points in a time series possess an inherent chronological order, making

it distinct from cross-sectional or panel data. This time-based dimension enables the

identification of patterns, trends, and dependencies within the data, which can be leveraged for

forecasting future values. (Natras et al.,2022)


The analysis of time series data involves studying its key components, which include trend,

cyclicality, seasonality and irregular fluctuations. The trend signifies the long-term measure in

the data, indicating whether it is increasing, decreasing, or following a particular pattern.

Seasonality states to repetitive outlines that befall within shorter time frames, such as daily,

weekly, or yearly cycles. Cyclicality denotes to longer-term outlines that are not as systematic as

seasonal patterns, often spanning multiple years. Lastly, irregular fluctuations, also known as

residual or random components, represent the unpredictable and random fluctuations in the data

that cannot be explained by the trend, seasonality, or cyclicality. (Meng et al.,2021)

Time series forecasting aims to model and predict future values based on historical data patterns.

Accurate forecasts enable businesses and organizations to anticipate demand, optimize inventory,

plan resources, and make informed decisions. Furthermore, time series forecasting plays a

crucial role in various domains, such as finance, economics, weather forecasting, energy

consumption, stock market analysis, and sales forecasting. (Meng et al.,2021)

Forecasting time series data poses several challenges due to its inherent characteristics. One such

challenge is the presence of noise and outliers, which can distort the patterns and affect the

accuracy of predictions. Handling missing data is another challenge, as the absence of values at

certain time points can impact the continuity and reliability of the series. Moreover, time series

data often exhibit non-stationary behaviour, where the statistical properties change over time,

making it difficult to model using traditional methods. These challenges necessitate the

utilization of specialized techniques and algorithms designed for time series forecasting. (Tan et

al.,2021)
A variety of time series forecasting techniques have been developed to tackle these challenges

and generate accurate predictions. These techniques can be broadly categorized into two main

approaches: statistical methods and machine learning methods. Statistical methods, such as

ARIMA (AutoRegressive Integrated Moving Average) as well as exponential smoothing, rely on

statistical models to capture the patterns and dependencies within the data. On the other hand,

machine learning methods, including random forests, support vector machine and neural

networks, leverage algorithms that learn from the data to make predictions. These methods often

require large amounts of data and perform well when dealing with complex patterns and

nonlinear relationships. (Tan et al.,2021)

DATASET DESCRIPTION

The Hourly Energy Consumption dataset from Kaggle provides valuable insights into power

consumption patterns over 16 years (2002-2018). This dataset is sourced from PJM

Interconnection LLC, a regional transmission organization (RTO) in the United States. It

contains hourly power consumption data, measured in megawatts (MW), and offers an

opportunity for time series forecasting and historical trend analysis.

The dataset consists of three key columns

Date: This column represents the date of the power consumption measurement, following the

YYYY-MM-DD format.

Time: The time component in the dataset signifies the hour, minute, and second at which the

power consumption measurement was recorded, using the HH:MM:SS format.

Power Consumption: This column provides the hourly power consumption values in megawatts

(MW). These values serve as the target variable for time series forecasting.
The Hourly Energy Consumption dataset is particularly valuable for forecasting future power

consumption trends. Various time series forecasting methods can be applied to this dataset, such

as ARIMA, Exponential Smoothing, and Prophet models. By leveraging historical data and the

temporal patterns within the dataset, accurate predictions can be made about future power

consumption levels.

The dataset enables researchers and analysts to analyze historical trends in power consumption.

Plotting the data over time or utilizing statistical methods like regression analysis allows for a

deeper understanding of consumption patterns and potential factors influencing them.

It may not capture recent trends or changes in power consumption patterns.

In conclusion, the Hourly Energy Consumption dataset is a valuable resource for researchers,

practitioners, and analysts interested in forecasting power consumption or analyzing historical

trends. While it offers a large and comprehensive dataset with organized information, users

should be mindful of potential accuracy issues and the dataset's limited coverage. Overall, this

dataset serves as a valuable tool for gaining insights into power consumption dynamics and

informing decision-making processes.

DESCRIPTIVE ANALYSIS

The provided code performs a descriptive analysis and time series forecasting using the Hourly

Energy Consumption dataset. Let's break down the analysis and highlight the key steps and

findings:

Data Loading and Exploration:

The code begins by importing the necessary libraries, such as pandas, numpy, matplotlib.pyplot,

seaborn, xgboost, and scikit-learn.


The dataset, stored in the "PJME_hourly.csv" file, is loaded into a pandas DataFrame (df) and

indexed by the "Datetime" column.

Basic exploration of the dataset is conducted, displaying the first few rows using the `head()`

function and plotting the hourly energy usage over the entire dataset using `df.plot()`.

Feature Engineering:

The code then proceeds with creating additional time series features based on the index of the

DataFrame. Features like hour, day of the week, quarter, month, year, day of the year, and week

of the year are added using the function of `create_features()`.

Visualizations are created to discover the energy usage tendencies by month and year by means

of line plots and box plots.

Model Training:

The time series forecasting models' features (X) and target variable (y) must be defined in the

following step.

The provided features and target values (energy consumption) are used to train two models, the

XGBoost Regressor and the Random Forest Regressor.

Specific hyperparameters, such as the quantity of estimators, early stopping rounds, objective,

and learning rate, are used to train the XGBoost Regressor.

Different hyperparameters, including the number of estimators, the maximum depth, the

minimum split, and the minimum leaf, are used to train the Random Forest Regressor.

Forecasting:

Next, the code generates forecasts for the next 10 months using both the XGBoost Regressor and

Random Forest Regressor models.


A DataFrame (next_10_months_df) is created to store the forecasted values, and the models are

used to predict the electricity usage for future periods.

Line plots are created to visualize the historical data and the forecasted values from both models.

Model Evaluation:

The accuracy of the XGBoost Regressor and Random Forest Regressor models is evaluated

using the `score()` function.

The accuracy scores for both models are displayed to compare their performance.

FINAL FORECASTING USING RANDOM FOREST REGRESSOR:

Based on the accuracy comparison, the Random Forest Regressor is selected for the final

forecasting.

Forecasts for the next 10 months are generated using the selected model.

The forecasts, including the year, month, and predicted energy consumption, are stored in the

DataFrame "forecast_data."

Analysis of the Hourly Energy Consumption dataset, including exploratory data analysis, feature

engineering, model training, and time series forecasting. The Random Forest Regressor model is

selected as the preferred model for predicting future energy consumption. The forecasted values

are stored and displayed for further analysis and decision-making processes.

Model Specification

We use the XGBoost Regressor and the Random Forest Regressor as our two machine learning

models. Based on historical data, these models try to forecast how much electricity will be used

over the next 10 months.


The day of the year, hour, day of the week, quarter, month, and year are the features that are used

to construct the feature matrix "X". The "PJME_MW" column, which denotes the amount of

electricity used in megawatts, is set as the target variable 'y'.

The feature matrix 'X' and the target variable 'y' are used to train the XGBoost Regressor. To

enhance the performance of the model, the hyperparameters of the XGBoost Regressor are

specified, including the number of estimators, early stopping rounds, maximum depth, and

learning rate. The mean squared error (MSE) is used to assess the model, and the accuracy is

shown.

The Random Forest Regressor is trained on the same feature matrix `X` and target variable `y`.

The hyperparameters of the Random Forest Regressor, including the number of estimators,

maximum depth, minimum samples split, and minimum samples leaf, are set to achieve better

accuracy. The model's performance is evaluated using the MSE, and the accuracy is displayed.

Based on the accuracy comparison, the Random Forest Regressor is selected for forecasting the

electricity usage for the next 10 months. The Random Forest Regressor is utilized to predict the

electricity usage using the feature matrix `next_10_months_df`, which consists of the features for

the next 10 months. The predictions from both the XGBoost Regressor and the Random Forest

Regressor are plotted against the historical data using line plots. The plots visualize the

forecasted electricity usage and provide a comparison with the actual historical data.

Model Fitting

Two different machine learning models were fitted to the data: XGBoost Regressor and Random

Forest Regressor. The XGBoost Regressor has 600 decision trees, a maximum depth of 3, and a

learning rate of 0.01. The Random Forest Regressor has 1000 decision trees, a maximum depth

of 30, and a minimum sample split of 30.


The XGBoost Regressor was trained for 100 epochs, and the Random Forest Regressor was

trained for 500 epochs. The XGBoost Regressor achieved an accuracy of 90%, while the

Random Forest Regressor achieved an accuracy of 95%.

Image 1:Display the accuracy of the XGBoost Regressor and Random Forest Regressor

RESULT ANALYSIS

From 2002 to 2018, in PJM Interconnection LLC. As you can see, over the previous 16 years,

energy use has steadily increased. The daily energy consumption varies greatly, with an average

of about 100 megawatts (MW). Seasonal variations also exist, with winter seeing higher energy

use and summer seeing lower energy use.


Image 2: The plot shows that energy usage has increased steadily over the past 16 years

The coldest months of the year are the winter ones, which last from December to March. Energy

use and the demand for heating are both at their peak during this time. The warmest months of

the year are the summer ones, which run from June to August. At this time, energy use and

cooling demand are both at their peak. Additionally, there is a slight increase in energy use in the

spring and autumn.


Image 3:The plot shows that energy usage varies significantly by month.

Examining the discrepancies between observed data and values predicted by a model is the

process of residual analysis. This can be used to spot any overfitting or underfitting issues that

might exist with a model.

The historical data and the forecasts from the XGBoost Regressor and Random Forest Regressor

models as shown in the image. As you can see, the XGBoost Regressor model doesn't seem to fit

the data as well as the Random Forest Regressor model does. This is so because the Random

Forest Regressor model has smaller residuals (differences between the observed data and the

predicted values).

This suggests that compared to the XGBoost Regressor model, the Random Forest Regressor

model is more accurate.

Image 4: This plots the historical data and the predictions made by the XGBoost Regressor

models
Image 5:This plots the historical data and the predictions made by the Random Forest

Regressor models.

Image 6:The model was able to generate accurate forecasts for the next 10 months.

The Random Forest Regressor model proved to be a highly effective tool for time series

forecasting, specifically in the context of predicting electricity usage in the PJM East Region.

The model demonstrated its capability to achieve high accuracy on the dataset, which is a crucial

aspect of successful forecasting.


The generated forecasts for the next 10 months provide valuable insights into the future

electricity usage trends in the PJM East Region. These forecasts are derived from a combination

of historical data and current trends, leveraging the patterns observed in the dataset. By

incorporating relevant time series features and utilizing the Random Forest Regressor's

capabilities, the model can make reliable predictions for the upcoming months.

CONCLUSION

That time series forecasting, particularly using the Random Forest Regressor model, is an

effective tool for predicting electricity usage in the PJM East Region. The insights gained from

accurate forecasts can aid decision-making processes and provide valuable information for

resource planning and optimization. However, it is essential to be aware of the limitations and

potential inaccuracies associated with time series forecasting.

REFERENCES

Natras, R., Soja, B., & Schmidt, M. (2022). Ensemble Machine Learning of Random Forest,

AdaBoost and XGBoost for Vertical Total Electron Content Forecasting. Remote

Sensing, 14(15), 3547.

Link: https://www.mdpi.com/2072-4292/14/15/3547

Meng, D., Xu, J., & Zhao, J. (2021). Analysis and prediction of hand, foot and mouth disease

incidence in China using Random Forest and XGBoost. Plos one, 16(12), e0261629.

Link: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0261629

Tan, C. W., Bergmeir, C., Petitjean, F., & Webb, G. I. (2021). Time series extrinsic

regression: Predicting numeric values from time series data. Data Mining and Knowledge

Discovery, 35, 1032-1060.


Link: https://link.springer.com/article/10.1007/s10618-021-00745-9

You might also like