045 Assignment PDF
045 Assignment PDF
import wqet_grader
warnings.simplefilter(action="ignore", category=FutureWarning)
wqet_grader.init("Project 4 Assessment")
<IPython.core.display.HTML object>
warnings.simplefilter(action="ignore", category=FutureWarning)
1 Prepare Data
1.1 Connect
Run the cell below to connect to the nepal.sqlite database.
1
[4]: %load_ext sql
%sql sqlite:////home/jovyan/nepal.sqlite
Task 4.5.1: What districts are represented in the id_map table? Determine the unique values in
the district_id column.
[5]: %%sql
SELECT distinct(district_id)
FROM id_map
* sqlite:////home/jovyan/nepal.sqlite
Done.
<IPython.core.display.HTML object>
What’s the district ID for Kavrepalanchok? From the lessons, you already know that Gorkha is 4;
from the textbook, you know that Ramechhap is 2. Of the remaining districts, Kavrepalanchok is
the one with the largest number of observations in the id_map table.
Task 4.5.2: Calculate the number of observations in the id_map table associated with district 1.
[7]: %%sql
SELECT count(*)
FROM id_map
WHERE district_id = 1
* sqlite:////home/jovyan/nepal.sqlite
Done.
[7]: [(36112,)]
<IPython.core.display.HTML object>
Task 4.5.3: Calculate the number of observations in the id_map table associated with district 3.
[9]: %%sql
SELECT count(*)
FROM id_map
WHERE district_id = 3
2
* sqlite:////home/jovyan/nepal.sqlite
Done.
[9]: [(82684,)]
<IPython.core.display.HTML object>
Task 4.5.4: Join the unique building IDs from Kavrepalanchok in id_map, all the columns from
building_structure, and the damage_grade column from building_damage, limiting. Make sure
you rename the building_id column in id_map as b_id and limit your results to the first five rows
of the new table.
[11]: %%sql
SELECT distinct(i.building_id) AS b_id,
s.*,
d.damage_grade
FROM id_map AS i
JOIN building_structure AS s ON i.building_id = s.building_id
JOIN building_damage AS d ON i.building_id = d.building_id
WHERE district_id = 3
LIMIT 5
* sqlite:////home/jovyan/nepal.sqlite
Done.
<IPython.core.display.HTML object>
3
1.2 Import
Task 4.5.5: Write a wrangle function that will use the query you created in the previous task to
create a DataFrame. In addition your function should:
1. Create a "severe_damage" column, where all buildings with a damage grade greater than 3
should be encoded as 1. All other buildings should be encoded at 0.
2. Drop any columns that could cause issues with leakage or multicollinearity in your model.
[13]: # Build your `wrangle` function here
def wrangle(db_path):
# Connect to database
conn = sqlite3.connect(db_path)
# Construct query
query = """
SELECT distinct(i.building_id) AS b_id,
s.*,
d.damage_grade
FROM id_map AS i
JOIN building_structure AS s ON i.building_id = s.building_id
JOIN building_damage AS d ON i.building_id = d.building_id
WHERE district_id = 3
"""
# Drop columns
df.drop(columns=drop_cols, inplace=True)
return df
4
Use your wrangle function to query the database at "/home/jovyan/nepal.sqlite" and return
your cleaned results.
[14]: df = wrangle("/home/jovyan/nepal.sqlite")
df.head()
land_surface_condition foundation_type \
b_id
87473 Flat Mud mortar-Stone/Brick
87479 Flat Mud mortar-Stone/Brick
87482 Flat Mud mortar-Stone/Brick
87491 Flat Mud mortar-Stone/Brick
87496 Flat Mud mortar-Stone/Brick
[15]: wqet_grader.grade(
"Project 4 Assessment", "Task 4.5.5", wrangle("/home/jovyan/nepal.sqlite")
)
<IPython.core.display.HTML object>
1.3 Explore
Task 4.5.6: Are the classes in this dataset balanced? Create a bar chart with the normalized value
counts from the "severe_damage" column. Be sure to label the x-axis "Severe Damage" and the
5
y-axis "Relative Frequency". Use the title "Kavrepalanchok, Class Balance".
[16]: # Plot value counts of `"severe_damage"`
df["severe_damage"].value_counts(normalize=True).plot(
kind="bar", xlabel="Severe Damage", ylabel="Relative Frequency", title =␣
,→"Kavrepalanchok, Class balance"
);
# Don't delete the code below �
plt.savefig("images/4-5-6.png", dpi=150)
<IPython.core.display.HTML object>
Task 4.5.7: Is there a relationship between the footprint size of a building and the damage it
sustained in the earthquake? Use seaborn to create a boxplot that shows the distributions of the
"plinth_area_sq_ft" column for both groups in the "severe_damage" column. Label your x-
axis "Severe Damage" and y-axis "Plinth Area [sq. ft.]". Use the title "Kavrepalanchok,
Plinth Area vs Building Damage".
6
plt.title("Kavrepalanchok, Plinth Area vs Building Damage");
# Don't delete the code below �
plt.savefig("images/4-5-7.png", dpi=150)
<IPython.core.display.HTML object>
Task 4.5.8: Are buildings with certain roof types more likely to suffer severe damage? Create a
pivot table of df where the index is "roof_type" and the values come from the "severe_damage"
column, aggregated by the mean.
[20]: roof_pivot = pd.pivot_table(
df, index="roof_type", values="severe_damage", aggfunc=np.mean
).sort_values(by="severe_damage")
roof_pivot
[20]: severe_damage
roof_type
RCC/RB/RBC 0.040715
Bamboo/Timber-Heavy roof 0.569477
Bamboo/Timber-Light roof 0.604842
7
[21]: wqet_grader.grade("Project 4 Assessment", "Task 4.5.8", roof_pivot)
<IPython.core.display.HTML object>
1.4 Split
Task 4.5.9: Create your feature matrix X and target vector y. Your target is "severe_damage".
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
Task 4.5.10: Divide your dataset into training and validation sets using a randomized split. Your
validation set should be 20% of your data.
[25]: X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_val shape:", X_val.shape)
print("y_val shape:", y_val.shape)
<IPython.core.display.HTML object>
8
2 Build Model
2.1 Baseline
Task 4.5.11: Calculate the baseline accuracy score for your model.
[27]: acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 2))
<IPython.core.display.HTML object>
2.2 Iterate
Task 4.5.12: Create a model model_lr that uses logistic regression to predict building damage.
Be sure to include an appropriate encoder for categorical features.
[29]: model_lr = make_pipeline(
OneHotEncoder(use_cat_names=True),
LogisticRegression(max_iter=3000)
)
# Fit model to training data
model_lr.fit(X_train, y_train)
[29]: Pipeline(steps=[('onehotencoder',
OneHotEncoder(cols=['land_surface_condition',
'foundation_type', 'roof_type',
'ground_floor_type', 'other_floor_type',
'position', 'plan_configuration',
'superstructure'],
use_cat_names=True)),
('logisticregression', LogisticRegression(max_iter=3000))])
<IPython.core.display.HTML object>
Task 4.5.13: Calculate training and validation accuracy score for model_lr.
[31]: lr_train_acc = accuracy_score(y_train, model_lr.predict(X_train))
lr_val_acc = model_lr.score(X_val, y_val)
9
[32]: submission = [lr_train_acc, lr_val_acc]
wqet_grader.grade("Project 4 Assessment", "Task 4.5.13", submission)
<IPython.core.display.HTML object>
Task 4.5.14: Perhaps a decision tree model will perform better than logistic regression, but what’s
the best hyperparameter value for max_depth? Create a for loop to train and evaluate the model
model_dt at all depths from 1 to 15. Be sure to use an appropriate encoder for your model, and
to record its training and validation accuracy scores at every depth. The grader will evaluate your
validation accuracy scores only.
[33]: depth_hyperparams = range(1, 16)
training_acc = []
validation_acc = []
for d in depth_hyperparams:
model_dt = test_model = make_pipeline(
OrdinalEncoder(),
DecisionTreeClassifier(max_depth=d, random_state=42)
)
model_dt.fit(X_train, y_train)
# Calculate training accuracy score and append to `training_acc`
training_acc.append(model_dt.score(X_train, y_train))
# Calculate validation accuracy score and append to `training_acc`
validation_acc.append(model_dt.score(X_val, y_val))
<IPython.core.display.HTML object>
Task 4.5.15: Using the values in training_acc and validation_acc, plot the validation curve
for model_dt. Label your x-axis "Max Depth" and your y-axis "Accuracy Score". Use the title
"Validation Curve, Decision Tree Model", and include a legend.
[35]: plt.plot(depth_hyperparams, training_acc, label="training")
plt.plot(depth_hyperparams, validation_acc, label="validation")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy Score")
plt.title("Validation Curve, Decision Tree Model")
# Don't delete the code below �
10
plt.savefig("images/4-5-15.png", dpi=150)
<IPython.core.display.HTML object>
Task 4.5.16: Build and train a new decision tree model final_model_dt, using the value for
max_depth that yielded the best validation accuracy score in your plot above.
[37]: final_model_dt = make_pipeline(
OrdinalEncoder(),
DecisionTreeClassifier(max_depth=10, random_state=42)
)
# Fit model to training data
final_model_dt.fit(X_train, y_train)
[37]: Pipeline(steps=[('ordinalencoder',
OrdinalEncoder(cols=['land_surface_condition',
'foundation_type', 'roof_type',
'ground_floor_type', 'other_floor_type',
'position', 'plan_configuration',
'superstructure'],
mapping=[{'col': 'land_surface_condition',
11
'data_type': dtype('O'),
'mapping': Flat 1
Moderate slope 2
Steep slope 3
NaN -2
dtype: int64},
{'col': 'foundation_type',
'dat…
Building with Central Courtyard 9
H-shape 10
NaN -2
dtype: int64},
{'col': 'superstructure',
'data_type': dtype('O'),
'mapping': Stone, mud mortar 1
Adobe/mud 2
Brick, cement mortar 3
RC, engineered 4
Brick, mud mortar 5
Stone, cement mortar 6
RC, non-engineered 7
Timber 8
Other 9
Bamboo 10
Stone 11
NaN -2
dtype: int64}])),
('decisiontreeclassifier',
DecisionTreeClassifier(max_depth=10, random_state=42))])
<IPython.core.display.HTML object>
2.3 Evaluate
Task 4.5.17: How does your model perform on the test set? First, read the CSV
file "data/kavrepalanchok-test-features.csv" into the DataFrame X_test. Next, use
final_model_dt to generate a list of test predictions y_test_pred. Finally, submit your test
predictions to the grader to see how your model performs.
Tip: Make sure the order of the columns in X_test is the same as in your X_train. Otherwise, it
could hurt your model’s performance.
[39]: X_test = pd.read_csv("data/kavrepalanchok-test-features.csv", index_col="b_id")
y_test_pred = final_model_dt.predict(X_test)
y_test_pred[:5]
12
[39]: array([1, 1, 1, 1, 0])
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Input In [40], in <cell line: 2>()
1 submission = pd.Series(y_test_pred)
----> 2 wqet_grader.grade("Project 4 Assessment", "Task 4.5.17", submission)
Exception: Could not grade submission: Could not verify access to this␣
,→assessment: Received error from WQET submission API: You have already passed␣
,→this course!
3 Communicate Results
Task 4.5.18: What are the most important features for final_model_dt? Create a Series Gini
feat_imp, where the index labels are the feature names for your dataset and the values are the
feature importances for your model. Be sure that the Series is sorted from smallest to largest
feature importance.
[41]: features = X_train.columns
importances = final_model_dt.named_steps["decisiontreeclassifier"].
,→feature_importances_
13
[41]: plan_configuration 0.004189
land_surface_condition 0.008599
foundation_type 0.009967
position 0.011795
ground_floor_type 0.013521
dtype: float64
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Input In [42], in <cell line: 1>()
----> 1 wqet_grader.grade("Project 4 Assessment", "Task 4.5.18", feat_imp)
Exception: Could not grade submission: Could not verify access to this␣
,→assessment: Received error from WQET submission API: You have already passed␣
,→this course!
Task 4.5.19: Create a horizontal bar chart of feat_imp. Label your x-axis "Gini
Importance" and your y-axis "Label". Use the title "Kavrepalanchok Decision Tree, Feature
Importance".
Do you see any relationship between this plot and the exploratory data analysis you did regarding
roof type?
[43]: # Create horizontal bar chart of feature importances
feat_imp.plot(kind="barh")
plt.xlabel("Gini Importance")
14
plt.ylabel("Features");
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Input In [44], in <cell line: 1>()
1 with open("images/4-5-19.png", "rb") as file:
----> 2 wqet_grader.grade("Project 4 Assessment", "Task 4.5.19", file)
15
--> 180 return␣
,→show_score(grade_submission(assessment_id, question_id, submission_object))
Exception: Could not grade submission: Could not verify access to this␣
,→assessment: Received error from WQET submission API: You have already passed␣
,→this course!
Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redis-
tribution or publication of this material is strictly prohibited.
16