0% found this document useful (0 votes)

9 views

tutorial-time-series-forecasting-with-xgboost

This document provides a tutorial on hourly time series forecasting using XGBoost with PJM's hourly energy consumption data from 2002-2018. It covers data preparation, feature creation, model training, and evaluation, including error metrics like RMSE and MAE. The tutorial also discusses the importance of certain features and suggests potential improvements, such as adding lag variables and holiday indicators.

Uploaded by

Teo Chee Kiat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

tutorial-time-series-forecasting-with-xgboost

Uploaded by

Teo Chee Kiat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

tutorial-time-series-forecasting-with-xgboost

February 5, 2025

1 Hourly Time Series Forecasting using XGBoost

If you haven’t already first check out my previous notebook forecasting on the same data using
Prophet
In this notebook we will walk through time series forecasting using XGBoost. The data we will be
using is hourly energy consumption.

[ ]: import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error
plt.style.use('fivethirtyeight')

2 Data
The data we will be using is hourly power consumption data from PJM. Energy consumtion has
some unique charachteristics. It will be interesting to see how prophet picks them up.
Pulling the PJM East which has data from 2002-2018 for the entire east region.

[ ]: pjme = pd.read_csv('../input/PJME_hourly.csv', index_col=[0], parse_dates=[0])

[ ]: color_pal = ["#F8766D", "#D39200", "#93AA00", "#00BA38", "#00C19F", "#00B9E3",␣

↪"#619CFF", "#DB72FB"]

_ = pjme.plot(style='.', figsize=(15,5), color=color_pal[0], title='PJM East')

3 Train/Test Split
Cut off the data after 2015 to use as our validation set.

[ ]: split_date = '01-Jan-2015'
pjme_train = pjme.loc[pjme.index <= split_date].copy()
pjme_test = pjme.loc[pjme.index > split_date].copy()

1
[ ]: _ = pjme_test \
.rename(columns={'PJME_MW': 'TEST SET'}) \
.join(pjme_train.rename(columns={'PJME_MW': 'TRAINING SET'}), how='outer') \
.plot(figsize=(15,5), title='PJM East', style='.')

4 Create Time Series Features

[ ]: def create_features(df, label=None):
"""
Creates time series features from datetime index
"""
df['date'] = df.index
df['hour'] = df['date'].dt.hour
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['dayofyear'] = df['date'].dt.dayofyear
df['dayofmonth'] = df['date'].dt.day
df['weekofyear'] = df['date'].dt.weekofyear

X = df[['hour','dayofweek','quarter','month','year',
'dayofyear','dayofmonth','weekofyear']]
if label:
y = df[label]
return X, y
return X

[ ]: X_train, y_train = create_features(pjme_train, label='PJME_MW')

X_test, y_test = create_features(pjme_test, label='PJME_MW')

5 Create XGBoost Model

[ ]: reg = xgb.XGBRegressor(n_estimators=1000)
reg.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=50,
verbose=False) # Change verbose to True if you want to see it train

5.1 Feature Importances

Feature importance is a great way to get a general idea about which features the model is relying
on most to make the prediction. This is a metric that simply sums up how many times each feature
is split on.
We can see that the day of year was most commonly used to split trees, while hour and year came
in next. Quarter has low importance due to the fact that it could be created by different dayofyear

2
splits.

[ ]: _ = plot_importance(reg, height=0.9)

6 Forecast on Test Set

[ ]: pjme_test['MW_Prediction'] = reg.predict(X_test)
pjme_all = pd.concat([pjme_test, pjme_train], sort=False)

[ ]: _ = pjme_all[['PJME_MW','MW_Prediction']].plot(figsize=(15, 5))

7 Look at first month of predictions

[ ]: # Plot the forecast with the actuals
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
_ = pjme_all[['MW_Prediction','PJME_MW']].plot(ax=ax,
style=['-','.'])
ax.set_xbound(lower='01-01-2015', upper='02-01-2015')
ax.set_ylim(0, 60000)
plot = plt.suptitle('January 2015 Forecast vs Actuals')

[ ]: # Plot the forecast with the actuals

f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
_ = pjme_all[['MW_Prediction','PJME_MW']].plot(ax=ax,
style=['-','.'])
ax.set_xbound(lower='01-01-2015', upper='01-08-2015')
ax.set_ylim(0, 60000)
plot = plt.suptitle('First Week of January Forecast vs Actuals')

[ ]: f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
_ = pjme_all[['MW_Prediction','PJME_MW']].plot(ax=ax,
style=['-','.'])
ax.set_ylim(0, 60000)
ax.set_xbound(lower='07-01-2015', upper='07-08-2015')
plot = plt.suptitle('First Week of July Forecast vs Actuals')

8 Error Metrics On Test Set

Our RMSE error is 13780445
Our MAE error is 2848.89

3
Our MAPE error is 8.9%

[ ]: mean_squared_error(y_true=pjme_test['PJME_MW'],
y_pred=pjme_test['MW_Prediction'])

[ ]: mean_absolute_error(y_true=pjme_test['PJME_MW'],
y_pred=pjme_test['MW_Prediction'])

I like using mean absolute percent error because it gives an easy to interperate percentage showing
how off the predictions are. MAPE isn’t included in sklearn so we need to use a custom function.

[ ]: def mean_absolute_percentage_error(y_true, y_pred):

"""Calculates MAPE given y_true and y_pred"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

[ ]: mean_absolute_percentage_error(y_true=pjme_test['PJME_MW'],
y_pred=pjme_test['MW_Prediction'])

9 Look at Worst and Best Predicted Days

[ ]: pjme_test['error'] = pjme_test['PJME_MW'] - pjme_test['MW_Prediction']
pjme_test['abs_error'] = pjme_test['error'].apply(np.abs)
error_by_day = pjme_test.groupby(['year','month','dayofmonth']) \
.mean()[['PJME_MW','MW_Prediction','error','abs_error']]

[ ]: # Over forecasted days

error_by_day.sort_values('error', ascending=True).head(10)

Notice anything about the over forecasted days? - #1 worst day - July 4th, 2016 - is a holiday. -
#3 worst day - December 25, 2015 - Christmas - #5 worst day - July 4th, 2016 - is a holiday.
Looks like our model may benefit from adding a holiday indicator.

[ ]: # Worst absolute predicted days

error_by_day.sort_values('abs_error', ascending=False).head(10)

The best predicted days seem to be a lot of october (not many holidays and mild weather) Also
early may

[ ]: # Best predicted days

error_by_day.sort_values('abs_error', ascending=True).head(10)

10 Plotting some best/worst predicted days

[ ]: f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(10)
_ = pjme_all[['MW_Prediction','PJME_MW']].plot(ax=ax,

4
style=['-','.'])
ax.set_ylim(0, 60000)
ax.set_xbound(lower='08-13-2016', upper='08-14-2016')
plot = plt.suptitle('Aug 13, 2016 - Worst Predicted Day')

This one is pretty impressive. SPOT ON!

[ ]: f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(10)
_ = pjme_all[['MW_Prediction','PJME_MW']].plot(ax=ax,
style=['-','.'])
ax.set_ylim(0, 60000)
ax.set_xbound(lower='10-03-2016', upper='10-04-2016')
plot = plt.suptitle('Oct 3, 2016 - Best Predicted Day')

[ ]: f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(10)
_ = pjme_all[['MW_Prediction','PJME_MW']].plot(ax=ax,
style=['-','.'])
ax.set_ylim(0, 60000)
ax.set_xbound(lower='08-13-2016', upper='08-14-2016')
plot = plt.suptitle('Aug 13, 2016 - Worst Predicted Day')

11 Up next?
• Add Lag variables
• Add holiday indicators.
• Add weather data source.

2023 National Scouting Combine Athletic Performance Report
100% (1)
2023 National Scouting Combine Athletic Performance Report
84 pages
Fresco
100% (2)
Fresco
17 pages
6 Meteorology
100% (1)
6 Meteorology
168 pages
Chapter 4
No ratings yet
Chapter 4
31 pages
Prototype 13
No ratings yet
Prototype 13
1 page
Solar Forcasting
No ratings yet
Solar Forcasting
3 pages
Factor Backtest
No ratings yet
Factor Backtest
13 pages
Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science
No ratings yet
Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science
10 pages
unit-2 lab
No ratings yet
unit-2 lab
11 pages
2 Regression
No ratings yet
2 Regression
15 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
ML Manual Final
No ratings yet
ML Manual Final
35 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
NTFX Price Prediction
No ratings yet
NTFX Price Prediction
5 pages
ml exp-5,6 (1)[1] (1)
No ratings yet
ml exp-5,6 (1)[1] (1)
6 pages
lab-5-nguyenngocmaithi-20130120
No ratings yet
lab-5-nguyenngocmaithi-20130120
20 pages
ML Python Exercises UOM BDS Regression
No ratings yet
ML Python Exercises UOM BDS Regression
16 pages
BMHC17 P6.Ipynb - Colaboratory
No ratings yet
BMHC17 P6.Ipynb - Colaboratory
4 pages
vertopal.com_Lab_Linear_Regression
No ratings yet
vertopal.com_Lab_Linear_Regression
21 pages
Praktikum TT M9.
No ratings yet
Praktikum TT M9.
6 pages
Lab Assignment 2
No ratings yet
Lab Assignment 2
1 page
Time_series_analysis__1718649022
No ratings yet
Time_series_analysis__1718649022
5 pages
How To Use Facebooks Neuralprophet and Why It Is So Powerful
No ratings yet
How To Use Facebooks Neuralprophet and Why It Is So Powerful
18 pages
PR
No ratings yet
PR
17 pages
Untitled 3
No ratings yet
Untitled 3
4 pages
Regression Linaire Python Tome II
No ratings yet
Regression Linaire Python Tome II
10 pages
Chap 1: Preparing Data and A Linear Model: Explore The Data With Some EDA
No ratings yet
Chap 1: Preparing Data and A Linear Model: Explore The Data With Some EDA
27 pages
Time Series Analysis of HDFCBANK Stock by Pavan
No ratings yet
Time Series Analysis of HDFCBANK Stock by Pavan
10 pages
To Improve The Performance of Models Predicting Ba
No ratings yet
To Improve The Performance of Models Predicting Ba
6 pages
22b2195_E7_group5
No ratings yet
22b2195_E7_group5
4 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Optimizing the Hyperparameters 1693296270
No ratings yet
Optimizing the Hyperparameters 1693296270
11 pages
Gas Price Analyzer
No ratings yet
Gas Price Analyzer
3 pages
2100080224-dm-co3
No ratings yet
2100080224-dm-co3
2 pages
DA Lab 1-7
No ratings yet
DA Lab 1-7
26 pages
If With: February 26, 2024
No ratings yet
If With: February 26, 2024
7 pages
Step 1: Finding The Data Set: "Amazon - Reviews - Multilingual - UK - v1 - 00.tsv - GZ" 'RT' "Utf8"
No ratings yet
Step 1: Finding The Data Set: "Amazon - Reviews - Multilingual - UK - v1 - 00.tsv - GZ" 'RT' "Utf8"
4 pages
Estiven - Hurtado.Santos - Regresión Con Varios Algoritmos
No ratings yet
Estiven - Hurtado.Santos - Regresión Con Varios Algoritmos
16 pages
Time Series Prediction - California Dairy Data 1995-2013
No ratings yet
Time Series Prediction - California Dairy Data 1995-2013
30 pages
Presentation 1
No ratings yet
Presentation 1
30 pages
python code 6-10 class X
No ratings yet
python code 6-10 class X
6 pages
f00f6aee-2440-4359-b0eb-314b96d06b0f
No ratings yet
f00f6aee-2440-4359-b0eb-314b96d06b0f
29 pages
35 Case Syntax
No ratings yet
35 Case Syntax
269 pages
LSTM Stock Prediction
100% (1)
LSTM Stock Prediction
38 pages
Seasonality Calculation
No ratings yet
Seasonality Calculation
14 pages
Data Analytics of Theatres Using Seaborn and Plotly
No ratings yet
Data Analytics of Theatres Using Seaborn and Plotly
4 pages
ML Lab Programs For Exam
No ratings yet
ML Lab Programs For Exam
10 pages
Asg5_dmds
No ratings yet
Asg5_dmds
4 pages
Time Series Forecasting
No ratings yet
Time Series Forecasting
7 pages
Programs Lab Bca
No ratings yet
Programs Lab Bca
16 pages
Time-Series Forecasting Using Conv1D-LSTM - Multiple Timesteps Into Future
No ratings yet
Time-Series Forecasting Using Conv1D-LSTM - Multiple Timesteps Into Future
6 pages
R-Tools-LAB
No ratings yet
R-Tools-LAB
31 pages
Finance With Python and MPT
100% (1)
Finance With Python and MPT
31 pages
DM Practice
No ratings yet
DM Practice
15 pages
Kickstarter Success Prediction
No ratings yet
Kickstarter Success Prediction
17 pages
DM Slip Solutions
100% (1)
DM Slip Solutions
24 pages
saurabh
No ratings yet
saurabh
22 pages
Ass1 Merged Merged
No ratings yet
Ass1 Merged Merged
19 pages
AI-MAJOR-AUGUST - Aryal Ashish
No ratings yet
AI-MAJOR-AUGUST - Aryal Ashish
16 pages
Data analytics assignment solutions
No ratings yet
Data analytics assignment solutions
20 pages
Matplotlib Pandas Guide
No ratings yet
Matplotlib Pandas Guide
7 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Joe Taco Menu - 7-10-24 Compressed
No ratings yet
Joe Taco Menu - 7-10-24 Compressed
2 pages
Ges 6194 PDF
No ratings yet
Ges 6194 PDF
2 pages
Famous Places: The Great Wall of China
No ratings yet
Famous Places: The Great Wall of China
6 pages
Thi Hien Tai Don
No ratings yet
Thi Hien Tai Don
44 pages
h3w2gr1v Datasheet (English)
No ratings yet
h3w2gr1v Datasheet (English)
2 pages
DMAIC
No ratings yet
DMAIC
32 pages
Wa0000.
No ratings yet
Wa0000.
7 pages
Behavior and Habitat Affecting THE Distribution of Five Species of Sympatric Mudskippers in Queensland
No ratings yet
Behavior and Habitat Affecting THE Distribution of Five Species of Sympatric Mudskippers in Queensland
6 pages
Grade 8 Science: Learning Area: Living Things and Their Environment Quarter: Fourth Quarter
No ratings yet
Grade 8 Science: Learning Area: Living Things and Their Environment Quarter: Fourth Quarter
4 pages
Biddings
No ratings yet
Biddings
10 pages
BASF R5-12 Loading Procedure
No ratings yet
BASF R5-12 Loading Procedure
3 pages
Peter Mancuso Resume A!
No ratings yet
Peter Mancuso Resume A!
2 pages
Areas of Normal Curve
No ratings yet
Areas of Normal Curve
24 pages
SE101
No ratings yet
SE101
16 pages
Back Questions On Heat Transfer
No ratings yet
Back Questions On Heat Transfer
6 pages
NLP Workbook by Dipaali Life Coach
100% (1)
NLP Workbook by Dipaali Life Coach
80 pages
User Guide DO Meter Hanna 9417
No ratings yet
User Guide DO Meter Hanna 9417
2 pages
Business Analytics - End Term
No ratings yet
Business Analytics - End Term
20 pages
MARIJUANA Cultivation
No ratings yet
MARIJUANA Cultivation
17 pages
EcoStruxure Modicon Builder V3.1
No ratings yet
EcoStruxure Modicon Builder V3.1
145 pages
Design and Preparation of Media For Fermentation: Done By: Sreelakshmi S Menon Dept of Biotechnology
No ratings yet
Design and Preparation of Media For Fermentation: Done By: Sreelakshmi S Menon Dept of Biotechnology
36 pages
Other Three in Pronunciation in Each of The Following Questions
No ratings yet
Other Three in Pronunciation in Each of The Following Questions
9 pages
Pedigree Analysis: (Cf. Chapters 4.4, 5.2, 6.2 of Textbook)
No ratings yet
Pedigree Analysis: (Cf. Chapters 4.4, 5.2, 6.2 of Textbook)
11 pages
Volvo_Car_Showmode_manual_2019
No ratings yet
Volvo_Car_Showmode_manual_2019
12 pages
2014_Snodgrass_Recognising-neuroplasticity-in-musculoskeletal-rehabilitation.-A-basis-for-greater-collaboration-between-musculoskeletal-and-neurological-physiotherapists
No ratings yet
2014_Snodgrass_Recognising-neuroplasticity-in-musculoskeletal-rehabilitation.-A-basis-for-greater-collaboration-between-musculoskeletal-and-neurological-physiotherapists
4 pages
Metal Brite HD 25 LTR
No ratings yet
Metal Brite HD 25 LTR
15 pages
The Proposal - Short Answer Type and Case Study Type Questions - LC - Lakhanpur
No ratings yet
The Proposal - Short Answer Type and Case Study Type Questions - LC - Lakhanpur
5 pages
Imlay City Plant Burns
No ratings yet
Imlay City Plant Burns
24 pages