0% found this document useful (0 votes)
925 views14 pages

035 Assignment PDF

The document discusses air quality data from Dar es Salaam, Tanzania. It explores the data from two sensor sites, cleans the data by localizing timestamps, removing outliers, resampling to hourly means, and imputing missing values. Graphs of the cleaned PM2.5 data are generated, including a time series plot, 7-day rolling average, ACF plot, and PACF plot. The cleaned data is then split into training and test sets for modeling.

Uploaded by

Tman Letswalo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
925 views14 pages

035 Assignment PDF

The document discusses air quality data from Dar es Salaam, Tanzania. It explores the data from two sensor sites, cleans the data by localizing timestamps, removing outliers, resampling to hourly means, and imputing missing values. Graphs of the cleaned PM2.5 data are generated, including a time series plot, 7-day rolling average, ACF plot, and PACF plot. The cleaned data is then split into training and test sets for modeling.

Uploaded by

Tman Letswalo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

035-assignment

April 30, 2022

Air Quality in Dar es Salaam ��


[1]: import warnings

import wqet_grader

warnings.simplefilter(action="ignore", category=FutureWarning)
wqet_grader.init("Project 3 Assessment")

<IPython.core.display.HTML object>

[55]: # Import libraries here


import inspect
import time
import warnings

import matplotlib.pyplot as plt


from pprint import PrettyPrinter
import pandas as pd
import plotly.express as px
from statsmodels.tsa.ar_model import AutoReg
import seaborn as sns
from IPython.display import VimeoVideo
from pymongo import MongoClient
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA

warnings.filterwarnings("ignore")

1 Prepare Data
1.1 Connect
Task 3.5.1: Connect to MongoDB server running at host "localhost" on port 27017. Then
connect to the "air-quality" database and assign the collection for Dar es Salaam to the variable
name dar.

1
[7]: client = MongoClient(host="localhost", port=27017)
db = client["air-quality"]
dar = db["dar-es-salaam"]

[8]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.1", [dar.name])

<IPython.core.display.HTML object>

1.2 Explore
Task 3.5.2: Determine the numbers assigned to all the sensor sites in the Dar es Salaam collection.
Your submission should be a list of integers.
[11]: sites = dar.distinct("metadata.site")
sites

[11]: [11, 23]

[12]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.2", sites)

<IPython.core.display.HTML object>
Task 3.5.3: Determine which site in the Dar es Salaam collection has the most sensor readings
(of any type, not just PM2.5 readings). You submission readings_per_site should be a list of
dictionaries that follows this format:
[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
Note that the values here � are from the Nairobi collection, so your values will look different.
[73]: pp = PrettyPrinter(indent=2)

[74]: result = dar.aggregate(


[
{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}

]
)

readings_per_site = list(result)
readings_per_site

[74]: [{'_id': 11, 'count': 138412}, {'_id': 23, 'count': 60020}]

[75]: result =result= dar.aggregate(


[
{"$match": {"metadata.site":11}},
{"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}}

2
)
pp.pprint(list(result))

[ {'_id': 'temperature', 'count': 17283},


{'_id': 'humidity', 'count': 17283},
{'_id': 'P2', 'count': 51923},
{'_id': 'P1', 'count': 51923}]

[76]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.3", readings_per_site)

---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Input In [76], in <cell line: 1>()
----> 1␣
,→wqet_grader.grade("Project 3 Assessment", "Task 3.5.3", readings_per_site)

File /opt/conda/lib/python3.9/site-packages/wqet_grader/__init__.py:180, in␣


,→grade(assessment_id, question_id, submission)

175 def grade(assessment_id, question_id, submission):


176 submission_object = {
177 'type': 'simple',
178 'argument': [submission]
179 }
--> 180 return␣
,→show_score(grade_submission(assessment_id, question_id, submission_object))

File /opt/conda/lib/python3.9/site-packages/wqet_grader/transport.py:145, in␣


,→grade_submission(assessment_id, question_id, submission_object)

143 raise Exception('Grader raised error: {}'.format(error['message']))


144 else:
--> 145 raise Exception('Could not grade submission: {}'.
,→format(error['message']))

146 result = envelope['data']['result']


148 # Used only in testing

Exception: Could not grade submission: Could not verify access to this␣
,→assessment: Received error from WQET submission API: You have already passed␣

,→this course!

1.3 Import
Task 3.5.4: (5 points) Create a wrangle function that will extract the PM2.5 readings from the
site that has the most total readings in the Dar es Salaam collection. Your function should do the
following steps:
1. Localize reading time stamps to the timezone for "Africa/Dar_es_Salaam".
2. Remove all outlier PM2.5 readings that are above 100.

3
3. Resample the data to provide the mean PM2.5 reading for each hour.
4. Impute any missing values using the forward-will method.
5. Return a Series y.
[32]: def wrangle(collection):
results = collection.find(
{"metadata.site": 11, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0},
)

# Read results into DataFrame


df = pd.DataFrame(list(results)).set_index("timestamp")

# Localize timezone
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Dar_es_Salaam")

# Remove outliers
df = df[df["P2"] < 100]

# Resample and forward-fill


y = df["P2"].resample("1H").mean().fillna(method='ffill')

return y

Use your wrangle function to query the dar collection and return your cleaned results.

[33]: y = wrangle(dar)
y.head()

[33]: timestamp
2018-01-01 03:00:00+03:00 9.456327
2018-01-01 04:00:00+03:00 9.400833
2018-01-01 05:00:00+03:00 9.331458
2018-01-01 06:00:00+03:00 9.528776
2018-01-01 07:00:00+03:00 8.861250
Freq: H, Name: P2, dtype: float64

[34]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.4", wrangle(dar))

<IPython.core.display.HTML object>

1.4 Explore Some More


Task 3.5.5: Create a time series plot of the readings in y. Label your x-axis "Date" and your
y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels".
[35]:

4
fig, ax = plt.subplots(figsize=(15, 6))
y.plot(xlabel="Date",ylabel="PM2.5 Level", title="Dar es Dalaam PM2.5 Levels",␣
,→ax=ax)

# Don't delete the code below �


plt.savefig("images/3-5-5.png", dpi=150)

[36]: with open("images/3-5-5.png", "rb") as file:


wqet_grader.grade("Project 3 Assessment", "Task 3.5.5", file)

<IPython.core.display.HTML object>
Task 3.5.6: Plot the rolling average of the readings in y. Use a window size of 168 (the number
of hours in a week). Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar
es Salaam PM2.5 Levels, 7-Day Rolling Average".

[38]: fig, ax = plt.subplots(figsize=(15, 6))


y.rolling(168).mean().plot(xlabel="Date", ylabel="PM2.5 Level", title="Dar es␣
,→Salaam PM2.5 Levels, 7-Day Rolling Average")

# Don't delete the code below �


plt.savefig("images/3-5-6.png", dpi=150)

5
[39]: with open("images/3-5-6.png", "rb") as file:
wqet_grader.grade("Project 3 Assessment", "Task 3.5.6", file)

<IPython.core.display.HTML object>
Task 3.5.7: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]"
and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings,
ACF".
[40]: fig, ax = plt.subplots(figsize=(15, 6))

plot_acf(y, ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam PM2.5 Readings, ACF")
# Don't delete the code below �
plt.savefig("images/3-5-7.png", dpi=150)

6
[41]: with open("images/3-5-7.png", "rb") as file:
wqet_grader.grade("Project 3 Assessment", "Task 3.5.7", file)

<IPython.core.display.HTML object>
Task 3.5.8: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]"
and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings,
PACF".
[42]: fig, ax = plt.subplots(figsize=(15, 6))
plot_pacf(y, ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam PM2.5 Readings, PACF")
# Don't delete the code below �
plt.savefig("images/3-5-8.png", dpi=150)

7
[43]: with open("images/3-5-8.png", "rb") as file:
wqet_grader.grade("Project 3 Assessment", "Task 3.5.8", file)

<IPython.core.display.HTML object>

1.5 Split
Task 3.5.9: Split y into training and test sets. The first 90% of the data should be in your training
set. The remaining 10% should be in the test set.
[44]: cutoff_test = int(len(y) * 0.90)
y_train = y.iloc[:cutoff_test]
y_test = y.iloc[cutoff_test:]

print("y_train shape:", y_train.shape)


print("y_test shape:", y_test.shape)

y_train shape: (1533,)


y_test shape: (171,)

[45]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.9a", y_train)

<IPython.core.display.HTML object>

[47]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.9b", y_test)

<IPython.core.display.HTML object>

2 Build Model
2.1 Baseline
Task 3.5.10: Establish the baseline mean absolute error for your model.
[49]: y_train_mean = y_train.mean()
y_pred_baseline = [y_train_mean] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean P2 Reading:", y_train_mean)


print("Baseline MAE:", mae_baseline)

Mean P2 Reading: 8.617582545265433


Baseline MAE: 4.07658759405218

[50]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.10", [mae_baseline])

<IPython.core.display.HTML object>

8
2.2 Iterate
Task 3.5.11: You’re going to use an AR model to predict PM2.5 readings, but which hyperpa-
rameter settings will give you the best performance? Use a for loop to train your AR model on
using settings for p from 1 to 30. Each time you train a new model, calculate its mean absolute
error and append the result to the list maes. Then store your results in the Series mae_series.
[56]: p_params = range(1, 31)
maes = []
for p in p_params:
#train model
model = AutoReg(y_train, lags=p).fit()
#generate in sample prediction
y_pred= model.predict().dropna()
mae=(mean_absolute_error(y_train[p:], y_pred))
print(f"Training MAE{p}-{mae}")
#calculate training mae
maes.append(mae)

mae_series = pd.Series(maes, name="mae", index=p_params)


mae_series.head()

Training MAE1-0.9478883593419718
Training MAE2-0.9338936356035513
Training MAE3-0.9208499415535056
Training MAE4-0.9201530053544462
Training MAE5-0.9195193686406402
Training MAE6-0.9189138317023042
Training MAE7-0.916923142065859
Training MAE8-0.9170429247796957
Training MAE9-0.9171924197194339
Training MAE10-0.9187110436140872
Training MAE11-0.9159620710795833
Training MAE12-0.916340014046864
Training MAE13-0.9170330944012818
Training MAE14-0.9164179092924236
Training MAE15-0.9166361277977625
Training MAE16-0.9171373921454925
Training MAE17-0.917409118681427
Training MAE18-0.9187234696542569
Training MAE19-0.918293884445819
Training MAE20-0.9172220704171751
Training MAE21-0.9158723154884857
Training MAE22-0.9158006763004195
Training MAE23-0.9129308629695768
Training MAE24-0.9116939089047236
Training MAE25-0.9075634252567448
Training MAE26-0.9073326154152601

9
Training MAE27-0.9073145077062825
Training MAE28-0.9067760080656363
Training MAE29-0.9080260984622734
Training MAE30-0.9138332570235556

[56]: 1 0.947888
2 0.933894
3 0.920850
4 0.920153
5 0.919519
Name: mae, dtype: float64

[57]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.11", mae_series)

<IPython.core.display.HTML object>
Task 3.5.12: Look through the results in mae_series and determine what value for p provides
the best performance. Then build and train final_model using the best hyperparameter value.
Note: Make sure that you build and train your model in one line of code, and that the data type
of best_model is statsmodels.tsa.ar_model.AutoRegResultsWrapper.

[58]: best_p = mae_series.idxmin()


best_model = AutoReg (y_train, lags=best_p).fit()

[59]: wqet_grader.grade(
"Project 3 Assessment", "Task 3.5.12", [isinstance(best_model.model,␣
,→AutoReg)]

<IPython.core.display.HTML object>
Task 3.5.13: Calculate the training residuals for best_model and assign the result to
y_train_resid. Note that your name of your Series should be "residuals".
[60]: y_train_resid = best_model.resid
y_train_resid.name = "residuals"
y_train_resid.head()

[60]: timestamp
2018-01-02 07:00:00+03:00 1.732488
2018-01-02 08:00:00+03:00 -0.381568
2018-01-02 09:00:00+03:00 -0.560971
2018-01-02 10:00:00+03:00 -2.215760
2018-01-02 11:00:00+03:00 0.006468
Freq: H, Name: residuals, dtype: float64

[61]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.13", y_train_resid.


,→tail(1500))

10
<IPython.core.display.HTML object>
Task 3.5.14: Create a histogram of y_train_resid. Be sure to label the x-axis as "Residuals"
and the y-axis as "Frequency". Use the title "Best Model, Training Residuals".

[62]: # Plot histogram of residuals


plt.hist(y_train_resid)
plt.ylabel("Frequency")
plt.xlabel("Residuals")
plt.title("Best Model, Training Residuals")
# Don't delete the code below �
plt.savefig("images/3-5-14.png", dpi=150)

[63]: with open("images/3-5-14.png", "rb") as file:


wqet_grader.grade("Project 3 Assessment", "Task 3.5.14", file)

<IPython.core.display.HTML object>
Task 3.5.15: Create an ACF plot for y_train_resid. Be sure to label the x-axis as "Lag
[hours]" and y-axis as "Correlation Coefficient". Use the title "Dar es Salaam, Training
Residuals ACF".
[64]: fig, ax = plt.subplots(figsize=(15, 6))
plot_acf(y_train_resid,ax=ax)
plt.xlabel("Lag [hours]")

11
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam, Training Residuals ACF")
# Don't delete the code below �
plt.savefig("images/3-5-15.png", dpi=150)

[65]: with open("images/3-5-15.png", "rb") as file:


wqet_grader.grade("Project 3 Assessment", "Task 3.5.15", file)

<IPython.core.display.HTML object>

2.3 Evaluate
Task 3.5.16: Perform walk-forward validation for your model for the entire test set y_test.
Store your model’s predictions in the Series y_pred_wfv. Make sure the name of your Series is
"prediction" and the name of your Series index is "timestamp".
[68]: y_pred_wfv = pd.Series(dtype="float64")
history = y_train.copy()
for i in range(len(y_test)):
model= AutoReg(history, lags=28).fit()
next_pred=model.forecast()
y_pred_wfv= y_pred_wfv.append(next_pred)
history=history.append(y_test[next_pred.index])

y_pred_wfv.name = "prediction"
y_pred_wfv.index.name = "timestamp"
y_pred_wfv.head()

[68]: timestamp
2018-03-06 00:00:00+03:00 8.056391
2018-03-06 01:00:00+03:00 8.681779

12
2018-03-06 02:00:00+03:00 6.268951
2018-03-06 03:00:00+03:00 6.303760
2018-03-06 04:00:00+03:00 7.171444
Freq: H, Name: prediction, dtype: float64

[69]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.16", y_pred_wfv)

<IPython.core.display.HTML object>
Task 3.5.17: Submit your walk-forward validation predictions to the grader to see test mean
absolute error for your model.
[70]: wqet_grader.grade("Project 3 Assessment", "Task 3.5.17", y_pred_wfv)

<IPython.core.display.HTML object>

3 Communicate Results
Task 3.5.18: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don’t
forget the index). Then plot df_pred_test using plotly express. Be sure to label the x-axis as
"Date" and the y-axis as "PM2.5 Level". Use the title "Dar es Salaam, WFV Predictions".
[71]: df_pred_test = pd.Series({"y_test":y_test,"y_pred_wfv":y_pred_wfv})
fig = px.line(df_pred_test, labels={"value":"PM2.5"})
fig.update_layout(
title="Dar es Salaam, WFV Predictions",
xaxis_title="Date",
yaxis_title="PM2.5 Level",
)
# Don't delete the code below �
fig.write_image("images/3-5-18.png", scale=1, height=500, width=700)

fig.show()

13
[72]: with open("images/3-5-18.png", "rb") as file:
wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)

---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Input In [72], in <cell line: 1>()
1 with open("images/3-5-18.png", "rb") as file:
----> 2 wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)

File /opt/conda/lib/python3.9/site-packages/wqet_grader/__init__.py:180, in␣


,→grade(assessment_id, question_id, submission)

175 def grade(assessment_id, question_id, submission):


176 submission_object = {
177 'type': 'simple',
178 'argument': [submission]
179 }
--> 180 return␣
,→show_score(grade_submission(assessment_id, question_id, submission_object))

File /opt/conda/lib/python3.9/site-packages/wqet_grader/transport.py:145, in␣


,→grade_submission(assessment_id, question_id, submission_object)

143 raise Exception('Grader raised error: {}'.format(error['message']))


144 else:
--> 145 raise Exception('Could not grade submission: {}'.
,→format(error['message']))

146 result = envelope['data']['result']


148 # Used only in testing

Exception: Could not grade submission: Could not verify access to this␣
,→assessment: Received error from WQET submission API: You have already passed␣

,→this course!

Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redis-
tribution or publication of this material is strictly prohibited.

14

You might also like