Random Forest Model
Random Forest Model
May 1, 2023
1.1 Introduction
As we’re learning, random forests are popular statistical learning algorithms. Some of their primary
benefits include reducing variance, bias, and the chance of overfitting.
Here, we will train, tune, and evaluate a random forest model using data from spreadsheet of survey
responses from 129,880 customers. It includes data points such as class, flight distance, and inflight
entertainment. our random forest model will be used to predict whether a customer will be satisfied
with their flight experience.
Note: first we require exploratory data analysis, data cleaning, and other manipulations to prepare
it for modeling.
We shall Import relevant Python libraries and modules, including numpy and pandaslibraries
for data processing; the pickle package to save the model; and the sklearn library, con-
taining: - The module ensemble, which has the function RandomForestClassifier - The
module model_selection, which has the functions train_test_split, PredefinedSplit, and
GridSearchCV - The module metrics, which has the functions f1_score, precision_score,
recall_score, and accuracy_score
[1]: # Import `numpy`, `pandas`, `pickle`, and `sklearn`.
# Import the relevant functions from `sklearn.ensemble`, `sklearn.
,→model_selection`, and `sklearn.metrics`.
import numpy as np
import pandas as pd
1
As shown in this cell, the dataset has been automatically loaded in for us. We do not need to
download the .csv file
[2]: # RUN THIS CELL TO IMPORT YOUR DATA.
air_data = pd.read_csv("Invistico_Airline.csv")
Food and drink Gate location … Online support Ease of Online booking \
0 0 2 … 2 3
1 0 3 … 2 3
2 0 3 … 2 2
3 0 3 … 3 1
4 0 3 … 4 2
5 0 3 … 2 2
6 0 3 … 5 5
2
7 0 3 … 2 2
8 0 3 … 5 4
9 0 3 … 2 2
Now, we will display the variable names and their data types.
[4]: # Display variable names and types.
air_data.dtypes
3
[4]: satisfaction object
Customer Type object
Age int64
Type of Travel object
Class object
Flight Distance int64
Seat comfort int64
Departure/Arrival time convenient int64
Food and drink int64
Gate location int64
Inflight wifi service int64
Inflight entertainment int64
Online support int64
Ease of Online booking int64
On-board service int64
Leg room service int64
Baggage handling int64
Checkin service int64
Cleanliness int64
Online boarding int64
Departure Delay in Minutes int64
Arrival Delay in Minutes float64
dtype: object
Question: What do you observe about the differences in data types among the variables included
in the data?
Next, to understand the size of the dataset, identify the number of rows and the number of columns.
[5]: # Identify the number of rows and the number of columns.
air_data.shape
Now, we check for missing values in the rows of the data. Starting with .isna() to get Booleans
indicating whether each value in the data is missing. Then, we use .any(axis=1) to get Booleans
indicating whether there are any missing values along the columns in each row. Finally, we use
.sum() to get the number of rows that contain missing values.
[6]: 393
WE drop the rows with missing values. This is an important step in data cleaning, as it makes
the data more useful for analysis and regression. Then, save the resulting pandas DataFrame in a
variable named air_data_subset.
4
[8]: # Drop missing values.
# Save the DataFrame in variable `air_data_subset`.
air_data_subset = air_data.dropna(axis=0)
Food and drink Gate location … Online support Ease of Online booking \
0 0 2 … 2 3
1 0 3 … 2 3
2 0 3 … 2 2
3 0 3 … 3 1
4 0 3 … 4 2
5 0 3 … 2 2
6 0 3 … 5 5
7 0 3 … 2 2
8 0 3 … 5 4
9 0 3 … 2 2
5
1 4 4 4 2
2 3 3 4 4
3 1 0 1 4
4 2 0 2 4
5 5 4 5 5
6 5 0 5 5
7 3 3 4 5
8 4 0 1 5
9 2 4 5 3
6
[10]: # Convert categorical features to one-hot encoded features.
air_data_subset_dummies = pd.get_dummies(air_data_subset,
columns=['Customer Type','Type of␣
,→Travel','Class'])
7
Online boarding Departure Delay in Minutes Arrival Delay in Minutes \
0 2 0 0.0
1 2 310 305.0
2 2 0 0.0
3 3 0 0.0
4 5 0 0.0
5 2 0 0.0
6 3 17 15.0
7 2 0 0.0
8 4 0 0.0
9 2 30 26.0
8
[10 rows x 26 columns]
Question: What changes do you observe after converting the string data to dummy variables?**
The first step to building the model is separating the labels (y) from the features (X).
[13]: # Separate the dataset into labels (y) and features (X).
y = air_data_subset_dummies["satisfaction"]
X = air_data_subset_dummies.drop("satisfaction", axis=1)
Once separated, split the data into train, validate, and test sets.
9
[14]: # Separate into train, validate, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,␣
,→random_state = 0)
Now, We fit and tune a random forest model with separate validation set. We begin by determining
a set of hyperparameters for tuning the model using GridSearchCV.
[15]: # Determine set of hyperparameters.
cv_params = {'n_estimators' : [50,100],
'max_depth' : [10,50],
'min_samples_leaf' : [0.5,1],
'min_samples_split' : [0.001, 0.01],
'max_features' : ["sqrt"],
'max_samples' : [.5,.9]}
10
CPU times: user 5.64 s, sys: 127 ms, total: 5.76 s
Wall time: 48.2 s
Use the selected model to predict on our test data. We use the optimal parameters found via
GridSearchCV.
[21]: # Use optimal parameters on GridSearchCV
rf_opt = RandomForestClassifier(n_estimators = 50, max_depth = 50,
11
min_samples_leaf = 1, min_samples_split = 0.001,
max_features="sqrt", max_samples = 0.9,␣
,→random_state = 0)
12
Finally, collect our F1-score.
[27]: # Get F1 score.
f1_test = f1_score(y_test, y_pred, pos_label = "satisfied")
print("The F1 score is {f1:.3f}".format(f1 = f1_test))
13
• Recall (Sensitivity, TP/TP+FN): The ratio of correctly predicted positive observations to all
observations in actual class.
• F1 score: The harmonic average of precision and recall, which takes into account both false
positives and false negatives.
Calculate the scores: precision score, recall score, accuracy score, F1 score.
[28]: # Precision score on test data set.
print("\nThe precision score is: {pc:.3f}".format(pc = pc_test),
"for the test set,", "\nwhich means of all positive predictions,",
"{pc_pct:.1f}% prediction are true positive.".format(pc_pct = pc_test *␣
,→100))
"\nwhich means of which means of all real positive cases in test set,",
"{rc_pct:.1f}% are predicted positive.".format(rc_pct = rc_test * 100))
14
The model performs well according to all 4 performance metrics. The model’s precision score is
slightly better than the 3 other metrics.
Finally, create a table of results that we can use to evaluate the performace of the model.
[32]: # Create table of results.
table = pd.DataFrame()
table = table.append({'Model': "Tuned Decision Tree",
'F1': 0.945422,
'Recall': 0.935863,
'Precision': 0.955197,
'Accuracy': 0.940864
},
ignore_index=True
)
Question: How does the random forest model compare to the decision tree model we built in the
previous lab?
The tuned random forest has higher scores overall, so it is the better model. Particularly, it shows
a better F1 score than the decision tree model, which indicates that the random forest model may
do better at classification when taking into account false positives and false negatives.
1.6 Considerations
What are the key takeaways from this lab? - Data exploring, cleaning, and encoding are
necessary for model building. - A separate validation set is typically used for tuning a model,
rather than using the test set. This also helps avoid the evaluation becoming biased. - F1 scores
are usually more useful than accuracy scores. If the cost of false positives and false negatives are
very different, it’s better to use the F1 score and combine the information from precision and recall.
* The random forest model yields a more effective performance than a decision tree model.
15
What summary would we provide to stakeholders? * The random forest model predicted
satisfaction with more than 94.2% accuracy. The precision is over 95% and the recall is approx-
imately 94.5%. * The random forest model outperformed the tuned decision tree with the best
hyperparameters in most of the four scores. This indicates that the random forest model may
perform better. * Because stakeholders were interested in learning about the factors that are most
important to customer satisfaction, this would be shared based on the tuned random forest. * In
addition, you would provide details about the precision, recall, accuracy, and F1 scores to support
your findings.
1.6.1 References
Accuracy, Precision, Recall & F1 Score: Interpretation of Performance Measures, Renuka Joshi
What is the Difference Between Test and Validation Datasets?, Jason Brownlee
Decision Trees and Random Forests Neil Liberman
16