How good is your
model?
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Classification metrics
Measuring model performance with accuracy:
Fraction of correctly classified samples
Not always a useful metric
SUPERVISED LEARNING WITH SCIKIT-LEARN
Class imbalance
Classification for predicting fraudulent bank transactions
99% of transactions are legitimate; 1% are fraudulent
Could build a classifier that predicts NONE of the transactions are fraudulent
99% accurate!
But terrible at actually predicting fraudulent transactions
Fails at its original purpose
Class imbalance: Uneven frequency of classes
Need a different way to assess performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Confusion matrix for assessing classification
performance
Confusion matrix
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
SUPERVISED LEARNING WITH SCIKIT-LEARN
Assessing classification performance
Accuracy:
SUPERVISED LEARNING WITH SCIKIT-LEARN
Precision
Precision
High precision = lower false positive rate
High precision: Not many legitimate transactions are predicted to be fraudulent
SUPERVISED LEARNING WITH SCIKIT-LEARN
Recall
Recall
High recall = lower false negative rate
High recall: Predicted most fraudulent transactions correctly
SUPERVISED LEARNING WITH SCIKIT-LEARN
F1 score
precision ∗ recall
F1 Score: 2 ∗ precision + recall
SUPERVISED LEARNING WITH SCIKIT-LEARN
Confusion matrix in scikit-learn
from sklearn.metrics import classification_report, confusion_matrix
knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Confusion matrix in scikit-learn
print(confusion_matrix(y_test, y_pred))
[[1106 11]
[ 183 34]]
SUPERVISED LEARNING WITH SCIKIT-LEARN
Classification report in scikit-learn
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.86 0.99 0.92 1117
1 0.76 0.16 0.26 217
accuracy 0.85 1334
macro avg 0.81 0.57 0.59 1334
weighted avg 0.84 0.85 0.81 1334
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Logistic regression
and the ROC curve
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Logistic regression for binary classification
Logistic regression is used for classification problems
Logistic regression outputs probabilities
If the probability, p > 0.5:
The data is labeled 1
If the probability, p < 0.5:
The data is labeled 0
SUPERVISED LEARNING WITH SCIKIT-LEARN
Linear decision boundary
SUPERVISED LEARNING WITH SCIKIT-LEARN
Logistic regression in scikit-learn
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
SUPERVISED LEARNING WITH SCIKIT-LEARN
Predicting probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]
print(y_pred_probs[0])
[0.08961376]
SUPERVISED LEARNING WITH SCIKIT-LEARN
Probability thresholds
By default, logistic regression threshold = 0.5
Not specific to logistic regression
KNN classifiers also have thresholds
What happens if we vary the threshold?
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
The ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
Plotting the ROC curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()
SUPERVISED LEARNING WITH SCIKIT-LEARN
Plotting the ROC curve
SUPERVISED LEARNING WITH SCIKIT-LEARN
ROC AUC
SUPERVISED LEARNING WITH SCIKIT-LEARN
ROC AUC in scikit-learn
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))
0.6700964152663693
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Hyperparameter
tuning
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager
Hyperparameter tuning
Ridge/lasso regression: Choosing alpha
KNN: Choosing n_neighbors
Hyperparameters: Parameters we specify before fitting the model
Like alpha and n_neighbors
SUPERVISED LEARNING WITH SCIKIT-LEARN
Choosing the correct hyperparameters
1. Try lots of different hyperparameter values
2. Fit all of them separately
3. See how well they perform
4. Choose the best performing values
This is called hyperparameter tuning
It is essential to use cross-validation to avoid overfitting to the test set
We can still split the data and perform cross-validation on the training set
We withhold the test set for final evaluation
SUPERVISED LEARNING WITH SCIKIT-LEARN
Grid search cross-validation
SUPERVISED LEARNING WITH SCIKIT-LEARN
Grid search cross-validation
SUPERVISED LEARNING WITH SCIKIT-LEARN
Grid search cross-validation
SUPERVISED LEARNING WITH SCIKIT-LEARN
GridSearchCV in scikit-learn
from sklearn.model_selection import GridSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {"alpha": np.arange(0.0001, 1, 10),
"solver": ["sag", "lsqr"]}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)
{'alpha': 0.0001, 'solver': 'sag'}
0.7529912278705785
SUPERVISED LEARNING WITH SCIKIT-LEARN
Limitations and an alternative approach
3-fold cross-validation, 1 hyperparameter, 10 total values = 30 fits
10 fold cross-validation, 3 hyperparameters, 30 total values = 900 fits
SUPERVISED LEARNING WITH SCIKIT-LEARN
RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'alpha': np.arange(0.0001, 1, 10),
"solver": ['sag', 'lsqr']}
ridge = Ridge()
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)
{'solver': 'sag', 'alpha': 0.0001}
0.7529912278705785
SUPERVISED LEARNING WITH SCIKIT-LEARN
Evaluating on the test set
test_score = ridge_cv.score(X_test, y_test)
print(test_score)
0.7564731534089224
SUPERVISED LEARNING WITH SCIKIT-LEARN
Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N