0% found this document useful (0 votes)
14 views

supervised learning using python - chapter3

Uploaded by

senarkitgame
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

supervised learning using python - chapter3

Uploaded by

senarkitgame
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

How good is your

model?
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Classification metrics
Measuring model performance with accuracy:
Fraction of correctly classified samples

Not always a useful metric

SUPERVISED LEARNING WITH SCIKIT-LEARN


Class imbalance
Classification for predicting fraudulent bank transactions
99% of transactions are legitimate; 1% are fraudulent

Could build a classifier that predicts NONE of the transactions are fraudulent
99% accurate!
But terrible at actually predicting fraudulent transactions

Fails at its original purpose

Class imbalance: Uneven frequency of classes

Need a different way to assess performance

SUPERVISED LEARNING WITH SCIKIT-LEARN


Confusion matrix for assessing classification
performance
Confusion matrix

SUPERVISED LEARNING WITH SCIKIT-LEARN


Assessing classification performance

SUPERVISED LEARNING WITH SCIKIT-LEARN


Assessing classification performance

SUPERVISED LEARNING WITH SCIKIT-LEARN


Assessing classification performance

SUPERVISED LEARNING WITH SCIKIT-LEARN


Assessing classification performance

SUPERVISED LEARNING WITH SCIKIT-LEARN


Assessing classification performance

SUPERVISED LEARNING WITH SCIKIT-LEARN


Assessing classification performance

SUPERVISED LEARNING WITH SCIKIT-LEARN


Assessing classification performance

SUPERVISED LEARNING WITH SCIKIT-LEARN


Assessing classification performance

Accuracy:

SUPERVISED LEARNING WITH SCIKIT-LEARN


Precision

Precision

High precision = lower false positive rate

High precision: Not many legitimate transactions are predicted to be fraudulent

SUPERVISED LEARNING WITH SCIKIT-LEARN


Recall

Recall

High recall = lower false negative rate

High recall: Predicted most fraudulent transactions correctly

SUPERVISED LEARNING WITH SCIKIT-LEARN


F1 score
precision ∗ recall
F1 Score: 2 ∗ precision + recall

SUPERVISED LEARNING WITH SCIKIT-LEARN


Confusion matrix in scikit-learn
from sklearn.metrics import classification_report, confusion_matrix
knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

SUPERVISED LEARNING WITH SCIKIT-LEARN


Confusion matrix in scikit-learn
print(confusion_matrix(y_test, y_pred))

[[1106 11]
[ 183 34]]

SUPERVISED LEARNING WITH SCIKIT-LEARN


Classification report in scikit-learn
print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.86 0.99 0.92 1117


1 0.76 0.16 0.26 217

accuracy 0.85 1334


macro avg 0.81 0.57 0.59 1334
weighted avg 0.84 0.85 0.81 1334

SUPERVISED LEARNING WITH SCIKIT-LEARN


Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Logistic regression
and the ROC curve
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager, DataCamp
Logistic regression for binary classification
Logistic regression is used for classification problems
Logistic regression outputs probabilities

If the probability, p > 0.5:


The data is labeled 1

If the probability, p < 0.5:


The data is labeled 0

SUPERVISED LEARNING WITH SCIKIT-LEARN


Linear decision boundary

SUPERVISED LEARNING WITH SCIKIT-LEARN


Logistic regression in scikit-learn
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

SUPERVISED LEARNING WITH SCIKIT-LEARN


Predicting probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]
print(y_pred_probs[0])

[0.08961376]

SUPERVISED LEARNING WITH SCIKIT-LEARN


Probability thresholds
By default, logistic regression threshold = 0.5
Not specific to logistic regression
KNN classifiers also have thresholds

What happens if we vary the threshold?

SUPERVISED LEARNING WITH SCIKIT-LEARN


The ROC curve

SUPERVISED LEARNING WITH SCIKIT-LEARN


The ROC curve

SUPERVISED LEARNING WITH SCIKIT-LEARN


The ROC curve

SUPERVISED LEARNING WITH SCIKIT-LEARN


The ROC curve

SUPERVISED LEARNING WITH SCIKIT-LEARN


The ROC curve

SUPERVISED LEARNING WITH SCIKIT-LEARN


The ROC curve

SUPERVISED LEARNING WITH SCIKIT-LEARN


Plotting the ROC curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()

SUPERVISED LEARNING WITH SCIKIT-LEARN


Plotting the ROC curve

SUPERVISED LEARNING WITH SCIKIT-LEARN


ROC AUC

SUPERVISED LEARNING WITH SCIKIT-LEARN


ROC AUC in scikit-learn
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))

0.6700964152663693

SUPERVISED LEARNING WITH SCIKIT-LEARN


Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
Hyperparameter
tuning
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

George Boorman
Core Curriculum Manager
Hyperparameter tuning
Ridge/lasso regression: Choosing alpha
KNN: Choosing n_neighbors

Hyperparameters: Parameters we specify before fitting the model


Like alpha and n_neighbors

SUPERVISED LEARNING WITH SCIKIT-LEARN


Choosing the correct hyperparameters
1. Try lots of different hyperparameter values
2. Fit all of them separately

3. See how well they perform

4. Choose the best performing values

This is called hyperparameter tuning

It is essential to use cross-validation to avoid overfitting to the test set

We can still split the data and perform cross-validation on the training set

We withhold the test set for final evaluation

SUPERVISED LEARNING WITH SCIKIT-LEARN


Grid search cross-validation

SUPERVISED LEARNING WITH SCIKIT-LEARN


Grid search cross-validation

SUPERVISED LEARNING WITH SCIKIT-LEARN


Grid search cross-validation

SUPERVISED LEARNING WITH SCIKIT-LEARN


GridSearchCV in scikit-learn
from sklearn.model_selection import GridSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {"alpha": np.arange(0.0001, 1, 10),
"solver": ["sag", "lsqr"]}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)

{'alpha': 0.0001, 'solver': 'sag'}


0.7529912278705785

SUPERVISED LEARNING WITH SCIKIT-LEARN


Limitations and an alternative approach
3-fold cross-validation, 1 hyperparameter, 10 total values = 30 fits
10 fold cross-validation, 3 hyperparameters, 30 total values = 900 fits

SUPERVISED LEARNING WITH SCIKIT-LEARN


RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'alpha': np.arange(0.0001, 1, 10),
"solver": ['sag', 'lsqr']}
ridge = Ridge()
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)

{'solver': 'sag', 'alpha': 0.0001}


0.7529912278705785

SUPERVISED LEARNING WITH SCIKIT-LEARN


Evaluating on the test set
test_score = ridge_cv.score(X_test, y_test)
print(test_score)

0.7564731534089224

SUPERVISED LEARNING WITH SCIKIT-LEARN


Let's practice!
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N

You might also like