0% found this document useful (0 votes)
102 views19 pages

Comprehensive Guide To Multiclass Classification With Sklearn - Towards Data Science

This document provides a comprehensive guide to multiclass classification with Sklearn. It discusses three approaches Sklearn uses for multiclass classification: native multiclass classifiers, binary classifiers with a one-vs-one strategy, and binary classifiers with a one-vs-rest strategy. It covers choosing an appropriate model, evaluation metrics, and hyperparameter tuning to optimize a custom evaluation metric for multiclass problems.

Uploaded by

Samuel Asmelash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views19 pages

Comprehensive Guide To Multiclass Classification With Sklearn - Towards Data Science

This document provides a comprehensive guide to multiclass classification with Sklearn. It discusses three approaches Sklearn uses for multiclass classification: native multiclass classifiers, binary classifiers with a one-vs-one strategy, and binary classifiers with a one-vs-rest strategy. It covers choosing an appropriate model, evaluation metrics, and hyperparameter tuning to optimize a custom evaluation metric for multiclass problems.

Uploaded by

Samuel Asmelash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Get unlimited access Open in app

Published in Towards Data Science

This is your last free member-only story this month. Upgrade for unlimited access.

Bex T. Follow

Jun 6, 2021 · 13 min read · Listen

Save

Comprehensive Guide to Multiclass Classification With Sklearn


Model selection, developing a strategy, and choosing an evaluation metric

Learn how to tackle any multiclass classification problem with Sklearn. The tutorial covers how to choose a model selection strategy,
several multiclass evaluation metrics and how to use them finishing off with hyperparameter tuning to optimize for user-defined metrics.

Photo by Sergiu Iacob on Pexels

Introduction
Even though multi-class classification is not as common, it certainly poses a much bigger challenge than binary classification problems.
You can literally take my word for it because this article has been the most challenging post I have ever written (have written close to 70).

I found that the topic of multiclass classification is deep and full of nuances. I have read so many articles, read multiple StackOverflow
threads, created a few of my own, and spent several hours exploring the Sklearn user guide and doing experiments. The core topics of
multiclass classification such as
Get unlimited access Open in app

choosing a strategy to binarize the problem

choosing a base mode

understanding excruciatingly many metrics

filtering out a single metric that solves your business problem and customizing it

tuning hyperparameters for this custom metric

and finally putting all the theory into practice with Sklearn

have all been scattered in the dark, sordid corners of the Internet. This was enough to conclude that no single resource shows an end-to-
end workflow of dealing with multiclass classification problems on the Internet (maybe, I missed it).

For this reason, this article will be a comprehensive tutorial on how to solve any multiclass supervised classification problem using
Sklearn. You will learn both the theory and the implementation of the above core concepts. It is going to be a long and technical read, so
get a coffee!

Native multiclass classifiers


Depending on the model you choose, Sklearn approaches multiclass classification problems in 3 different ways. In other words, Sklearn
estimators are grouped into 3 categories by their strategy to deal with multi-class data.

The first and the biggest group of estimators are the ones that support multi-class classification natively:

naive_bayes.BernoulliNB

tree.DecisionTreeClassifier

tree.ExtraTreeClassifier

ensemble.ExtraTreesClassifier

naive_bayes.GaussianNB

neighbors.KNeighborsClassifier

svm.LinearSVC (setting multi_class=”crammer_singer”)`

linear_model.LogisticRegression (setting multi_class=”multinomial”)

linear_model.LogisticRegressionCV (setting multi_class=”multinomial”)

For an N-class problem, they produce N by N confusion matrix, and most of the evaluation metrics are derived from it:

1 from sklearn.datasets import make_classification


2 from sklearn.ensemble import ExtraTreesClassifier
3 from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
4 from sklearn.model_selection import train_test_split
5
6 # Build a synthetic dataset
7 X, y = make_classification(
8 n_samples=1000, n_features=5, n_informative=4, n_redundant=1, n_classes=4
9 )
10
11 # Train/test sets
12 X_train, X_test, y_train, y_test = train_test_split(
13 X, y, test_size=0.3, random_state=1121218
14 ) Get unlimited access Open in app
15
16 # Fit/predict
17 etc = ExtraTreesClassifier()
18 _ = etc.fit(X_train, y_train)
19 y_pred = etc.predict(X_test)
20
21 # Plot confusion matrix
22 fig, ax = plt.subplots(figsize=(8, 5))
23 cmp = ConfusionMatrixDisplay(
24 confusion_matrix(y_test, y_pred),
25 display_labels=["class_1", "class_2", "class_3", "class_4"],
26 )
27
28 cmp.plot(ax=ax)
29 plt.show();

6601.py
hosted with ❤ by GitHub view raw
Get unlimited access Open in app

We will focus on multiclass confusion matrices later in the tutorial.

Binary classifiers with One-vs-One (OVO) strategy


Other supervised classification algorithms were mainly designed for the binary case. However, Sklearn implements two strategies called
One-vs-One (OVO) and One-vs-Rest (OVR, also called One-vs-All) to convert a multi-class problem into a series of binary tasks.

OVO splits a multi-class problem into a single binary classification task for each pair of classes. In other words, for each pair, a single
binary classifier will be built. For example, a target with 4 classes — brain, lung, breast, and kidney cancer, uses 6 individual classifiers to
binarize the problem:

Classifier 1: lung vs. breast

Classifier 2: lung vs. kidney

Classifier 3: lung vs. brain

Classifier 4: breast vs. kidney

Classifier 5: breast vs. brain

Classifier 6: kidney vs. brain


Sklearn suggests these classifiers to work best with the OVO approach:
Get unlimited access Open in app

svm.NuSVC

svm.SVC

gaussian_process.GaussianProcessClassifier (setting multi_class = “one_vs_one”)

Sklearn also provides a wrapper estimator for the above models under sklearn.multiclass.OneVsOneClassifier :

1 from sklearn.gaussian_process import GaussianProcessClassifier


2 from sklearn.multiclass import OneVsOneClassifier
3 from sklearn.svm import SVC
4
5 # Don't have to set `multi_class` argument if used with OVOClassifier
6 ovo = OneVsOneClassifier(estimator=GaussianProcessClassifier())
7
8 >>> ovo.fit(X_train, y_train)
9 OneVsOneClassifier(estimator=GaussianProcessClassifier())

6602.py
hosted with ❤ by GitHub view raw

A major downside of this strategy is its computation workload. As each pair of classes require a separate binary classifier, targets with high
cardinality may take too long to train. To compute the number of classifiers that will be built for an N-class problem, the following formula
is used:

1 # Print the number of estimators created


2 print(len(ovo.estimators_))
3
4 ----------------------------------------
5
6 6

6603.py
hosted with ❤ by GitHub view raw
In practice, the One-vs-Rest strategy is much preferred because of this disadvantage.
Get unlimited access Open in app

Binary classifiers with One-vs-Rest (OVR) strategy


Alternatively, the OVR strategy creates an individual classifier for each class in the target. Essentially, each binary classifier chooses a
single class and marks it as positive, encoding it as 1. The rest of the classes are considered negative labels and, thus, encoded with 0. For
classifying 4 types of cancer:

Classifier 1: lung vs. [breast, kidney, brain] — (lung cancer, not lung cancer)

Classifier 2: breast vs. [lung, kidney, brain] — (breast cancer, not breast cancer)

Classifier 3: kidney vs. [lung, breast, brain] — (kidney cancer, not kidney cancer)

Classifier 4: brain vs. [lung, breast kidney] — (brain cancer, not brain cancer)

Sklearn suggests these classifiers to work best with the OVR approach:

ensemble.GradientBoostingClassifier

gaussian_process.GaussianProcessClassifier (setting multi_class = “one_vs_rest”)

svm.LinearSVC (setting multi_class=”ovr”)

linear_model.LogisticRegression (setting multi_class=”ovr”)

linear_model.LogisticRegressionCV (setting multi_class=”ovr”)

linear_model.SGDClassifier

linear_model.Perceptron

Alternatively, you can use the above models with the default OneVsRestClassifier :

1 from sklearn.linear_model import Perceptron


2 from sklearn.multiclass import OneVsRestClassifier
3
4 # Init/fit
5 ovr = OneVsRestClassifier(estimator=Perceptron())
6 _ = ovr.fit(X_train, y_train)
7 print(len(ovr.estimators_))
8
9 ---------------------------------------------------
10
11 4

6604.py
hosted with ❤ by GitHub view raw

Even though this strategy significantly lowers the computational cost, the fact that only one class is considered positive and the rest as
negative makes each binary problem an imbalanced classification. This problem is even more pronounced for classes with low proportions
in the target.

In both approaches, depending on the passed estimator, the results of all binary classifiers can be summarized in two ways:
majority of the vote: each binary classifier predicts one class, and the class that got the most votes from all classifiers is chosen
Get unlimited access Open in app

depending on the argmax of class membership probability scores: classifiers such as LogisticRegression computes probability scores
for each class ( .predict_proba() ). Then, the argmax of the sum of the scores is chosen.

We will talk more about how to score each of these strategies later in the tutorial.

Sample classification problem and preprocessing pipeline


As an example problem, we will be predicting the quality of diamonds using the Diamonds dataset from Kaggle:

1 import pandas as pd
2
3 diamonds = pd.read_csv("data/diamonds.csv").drop("Unnamed: 0", axis=1)
4 diamonds.head()

6605.py
hosted with ❤ by GitHub view raw

1 >>> diamonds.shape
2 (53940, 10)
3
4
5 >>> diamonds.describe().T.round(3)

6606.py
hosted with ❤ by GitHub view raw
Get unlimited access Open in app

The above output shows the features are on different scales, suggesting we use some type of normalization. This step is essential for many
linear-based models to perform well.

1 >>> diamonds.cut.value_counts()
2
3 Ideal 21551
4 Premium 13791
5 Very Good 12082
6 Good 4906
7 Fair 1610
8 Name: cut, dtype: int64

6607.py
hosted with ❤ by GitHub view raw

The dataset contains a mixture of numeric and categorical features. I covered preprocessing steps for binary classification in my last article
in detail. You can easily apply the ideas to the multi-class case, so I will keep the explanations here nice and short.

The target is ‘cut’, which has 5 classes: Ideal, Premium, Very Good, Good, and Fair (descending quality). We will encode the textual
features with OneHotEncoder.

Let’s take a quick look at the distributions of each numeric feature to decide what type of normalization to use:

>>> diamonds.hist(figsize=(16, 12));


Get unlimited access Open in app

Price and carat show skewed distributions. We will use a logarithmic transformer to make them as normally distributed as possible. For
the rest, simple standardization is enough. If you are not familiar with numeric transformations, check out my article on the topic. Also,
the below code contains an example of Sklearn pipelines, and you can learn all about them from here.

Let’s get to work:

1 from sklearn.model_selection import train_test_split


2
3 # Build feature/target arrays
4 X, y = diamonds.drop("cut", axis=1), diamonds["cut"].values.flatten()
5
6 # Create train/test sets
7 X_train, X_test, y_train, y_test = train_test_split(
8 X, y, random_state=1121218, test_size=0.33, stratify=y
9 )

6608.py
hosted with ❤ by GitHub view raw

1 from sklearn.compose import ColumnTransformer


2 from sklearn.ensemble import RandomForestClassifier
3 from sklearn.pipeline import Pipeline, make_pipeline
4 from sklearn.preprocessing import (
5 OneHotEncoder, PowerTransformer, StandardScaler
6 ) Get unlimited access Open in app
7
8 # Build categorical preprocessor
9 categorical_cols = X.select_dtypes(include="object").columns.to_list()
10 categorical_pipe = make_pipeline(
11 OneHotEncoder(sparse=False, handle_unknown="ignore")
12 )
13
14 # Build numeric processor
15 to_log = ["price", "carat"]
16 to_scale = ["x", "y", "z", "depth", "table"]
17 numeric_pipe_1 = make_pipeline(PowerTransformer())
18 numeric_pipe_2 = make_pipeline(StandardScaler())
19
20 # Full processor
21 full = ColumnTransformer(
22 transformers=[
23 ("categorical", categorical_pipe, categorical_cols),
24 ("power_transform", numeric_pipe_1, to_log),
25 ("standardization", numeric_pipe_2, to_scale),
26 ]
27 )
28
29 # Final pipeline combined with RandomForest
30 pipeline = Pipeline(
31 steps=[
32 ("preprocess", full),
33 (
34 "base",
35 RandomForestClassifier(max_depth=13),
36 ),
37 ]
38 )
39 # Fit
40 _ = pipeline.fit(X_train, y_train)

6609.py
hosted with ❤ by GitHub view raw

The first version of our pipeline uses RandomForestClassifier . Let's look at its confusion matrix by generating predictions:

1 from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix


2
3 y pred = pipeline.predict(X test)
3 y_pred pipeline.predict(X_test)
4
Get unlimited access Open in app
5 # Plot the confusion matrix
6 fig, ax = plt.subplots(figsize=(12, 8))
7 # Create the matrix
8 cm = confusion_matrix(y_test, y_pred)
9 cmp = ConfusionMatrixDisplay(cm, display_labels=pipeline.classes_)
10 cmp.plot(ax=ax)
11
12 plt.show();

6610.py
hosted with ❤ by GitHub view raw

In lines 8 and 9, we are creating the matrix and using a special Sklearn function to plot it. ConfusionMatrixDisplay also has display_labels

argument, to which we are passing the class names accessed by pipeline.classes_ attribute.
Interpreting N by N confusion matrix
Get unlimited access Open in app
If you read my other article on binary classification, you know that confusion matrices are the holy grail of supervised classification
problems. In a 2 by 2 matrix, the matrix terms are easy to interpret and locate.

Even though it gets more difficult to interpret the matrix as the number of classes increases, there are sure-fire ways to find your way
around any matrix of any shape.

The first step is always identifying your positive and negative classes. This depends on the problem you are trying to solve. As a jewelry
store owner, I may want my classifier to differentiate Ideal and Premium diamonds better than other types, making these types of
diamonds my positive class. Other classes will be considered negative.

Establishing positive and negative classes early on is very important in evaluating model performance and in hyperparameter tuning. After
doing this, you should define your true positives, true negatives, false positives, and false negatives. In our case:

Positive classes: Ideal and Premium diamonds

Negative classes: Very Good, Good, and Fair diamonds

True Positives, type 1: actual Ideal, predicted Ideal

True Positives, type 2: actual Premium, predicted Premium

True Negatives: the rest of the diamond types predicted correctly

False Positives: actual value belongs to any of the 3 negative classes but predicted either Ideal or Premium

False Negatives: actual value is either Ideal or Premium but predicted by any of the 3 negative classes.

Always list out the terms of your matrix in this manner, and the rest of your workflow will be much easier, as you will see in the next
section.

How Sklearn computes multiclass classification metrics — ROC AUC score


This section is only about the nitty-gritty details of how Sklearn calculates common metrics for multiclass classification. Specifically, we
will peek under the hood of the 4 most common metrics: ROC_AUC, precision, recall, and f1 score. Even though I will give a brief
overview of each metric, I will mostly focus on using them in practice. If you want a deeper explanation of what each metric measures,
please refer to this article.

The first metric we will discuss is the ROC AUC score or area under the receiver operating characteristic curve. It is mostly used when we
want to measure a classifier’s performance to differentiate between each class. This means that ROC AUC is better suited for balanced
classification tasks.

In essence, the ROC AUC score is used for binary classification and with models that can generate class membership probabilities based on
some threshold. Here is a brief overview of the steps to calculate ROC AUC for binary classification:

1. A binary classifier that can generate class membership probabilities such as LogisticRegression with its predict_proba method.

2. An initial, close to 0 decision threshold is chosen. For example, if the probability is higher than 0.1, the class is predicted negative else
positive.

3. Using this threshold, a confusion matrix is created.

4. True positive rate (TPR) and false positive rate (FPR) are found.

5. A new threshold is chosen, and steps 3–4 are repeated.

6. Repeat steps 2–5 for various thresholds between 0 and 1 to create a set of TPRs and FPRs.
7. Plot all TPRs vs. FPRs to generate the receiver operating characteristic curve.
Get unlimited access Open in app

8. Calculate the area under this curve.

For multiclass classification, you can calculate the ROC AUC for all classes using either OVO or OVR strategies. Since we agreed that OVR
is a better option, here is how ROC AUC is calculated for OVR classification:

1. Each binary classifier created using OVR finds the ROC AUC score for its own class using the above steps.

2. ROC AUC scores of all classifiers are then averaged using either of these 2 methods:

“macro”: this is simply the arithmetic mean of the scores

“weighted”: this takes class imbalance into account by finding a weighted average. Each ROC AUC is multiplied by their class weight
and summed, then divided by the total number of samples.

As an example, let’s say there are 100 samples in the target — class 1 (45), class 2 (30), class 3 (25). OVR creates 3 binary classifiers, 1 for
each class, and their ROC AUC scores are 0.75, 0.68, 0.84, respectively. The weighted ROC AUC score across all classes will be:

ROC AUC (weighted): ((45 * 0.75) + (30 * 0.68) + (25 * 0.84)) / 100 = 0.7515

Here is the implementation of all this in Sklearn:

1 from sklearn.metrics import roc_auc_score


2
3 # Generate membership scores with .predict_proba
4 y_pred_probs = pipeline.predict_proba(X_test)
5
6 # Calculate ROC_AUC
7 roc_auc_score(
8 y_test, y_pred_probs, multi_class="ovr", average="weighted"
9 )
10
11 --------------------------------------
12
13 0.9104965737411006

6611.py
hosted with ❤ by GitHub view raw

Above, we calculated ROC AUC for our diamond classification problem and got an excellent score. Don’t forget to set the multi_class and
average parameters properly when using roc_auc_score . If you want to generate the score for a particular class, here is how you do it:

1 # GENERATE ROC_AUC SCORE FOR 'IDEAL' CLASS DIAMONDS


2
3 # Find the index of Ideal class diamonds
4 idx = np.where(pipeline.classes_ == "Ideal")[0][0]
5
6 # Don't have to set multiclass and average params
7 >>> roc_auc_score(y_test == "Ideal", y_pred_probs[:, idx])
8 0.9431101165153962

6612.py
hosted with ❤ by GitHub view raw
Get unlimited access Open in app

ROC AUC score is only a good metric to see how the classifier differentiates between classes. A higher ROC AUC score does not necessarily
mean a better model. On top of that, we care more about our model’s ability to classify Ideal and Premium diamonds, so a metric like ROC
AUC is not a good option for our case.

Precision, Recall and F1 scores for multiclass classification


A better metric to measure our pipeline’s performance would be using precision, recall, and F1 scores. For the binary case, they are easy
and intuitive to understand:

Images by author

In a multiclass case, these 3 metrics are calculated per-class basis. For example, let’s look at the confusion matrix again:

1 # Plot the confusion matrix


2 fig, ax = plt.subplots(figsize=(12, 8))
3 # Create the matrix
4 cm = confusion_matrix(y_test, y_pred)
5 cmp = ConfusionMatrixDisplay(cm, display_labels=pipeline.classes_)
6 cmp.plot(ax=ax)
7
8 plt.show();

6613.py
hosted with ❤ by GitHub view raw
Get unlimited access Open in app

Precision tells us what proportion of predicted positives is truly positive. If we want to calculate precision for Ideal diamonds, true
positives would be the number of Ideal diamonds predicted correctly (the center of the matrix, 6626). False positives would be any cells
that count the number of times our classifier predicted other types of diamonds as Ideal. These would be the cells above and below the
center of the matrix (1013 + 521 + 31 + 8 = 1573). Using the formula of precision, we calculate it to be:

Precision (Ideal) = TP / (TP + FP) = 6626 / (6626 + 1573) = 0.808

Recall is calculated similarly. We know the number of true positives — 6626. False negatives would be any cells that count the number of
times the classifier predicted the Ideal type of diamonds belonging to any other negative class. These would be the cells right and left to
the center of the matrix (3 + 9 + 363 + 111 = 486). Using the formula of recall, we calculate it to be:

Recall (Ideal) = TP / (TP + FN) = 6626 / (6626 + 486) = 0.93

So, how do we choose between recall and precision for the Ideal class? It depends on the type of problem you are trying to solve. If you
want to minimize the instances where other, cheaper types of diamonds are predicted as Ideal, you should optimize precision. As a jewelry
store owner, you might be sued for fraud for selling cheaper diamonds as expensive Ideal diamonds.

On the other hand, if you want to minimize the instances where you accidentally sell Ideal diamonds for a lower price, you should
optimize for recall of the Ideal class. Indeed, you won’t get sued, but you might lose money.

The third option is to have a model that is equally good at the above 2 scenarios. In other words, a model with high precision and recall.
Fortunately, there is a metric that measures just that: the F1 score. F1 score takes the harmonic mean of precision and recall and produces
a value between 0 and 1:
Get unlimited access Open in app

So, the F1 score for the Ideal class would be:

F1 (Ideal) = 2 * (0.808 * 0.93) / (0.808 + 0.93) = 0.87

Up to this point, we calculated the 3 metrics only for the Ideal class. But in multiclass classification, Sklearn computes them for all classes.
You can use classification_report to see this:

1 from sklearn.metrics import classification_report


2
3 >>> print(classification_report(y_test, y_pred))
4
5 precision recall f1-score support
6
7 Fair 0.92 0.78 0.85 532
8 Good 0.79 0.64 0.70 1619
9 Ideal 0.81 0.93 0.86 7112
10 Premium 0.67 0.86 0.75 4551
11 Very Good 0.73 0.36 0.48 3987
12
13 accuracy 0.75 17801
14 macro avg 0.78 0.71 0.73 17801
15 weighted avg 0.76 0.75 0.74 17801

6614.py
hosted with ❤ by GitHub view raw

You can check that our calculations for the Ideal class were correct. The last column of the table — support shows how many samples are
there for each class. Also, the last 2 rows show averaged scores for the 3 metrics. We already covered what macro and weighted averages
are in the example of ROC AUC.

For imbalanced classification tasks such as these, you rarely choose averaged precision, recall of F1 scores. Again, choosing one metric to
optimize for a particular class depends on your business problem. For our case, we will choose to optimize the F1 score of Ideal and
Premium classes (yes, you can choose multiple classes simultaneously). First, let’s see how to calculate weighted F1 across all class:

1 from sklearn.metrics import f1_score


2
3 # Weighed F1 across all classes
4 >>> f1_score(y_test, y_pred, average="weighted")
5 0.7355520553610462

6615.py
hosted with ❤ by GitHub view raw
Get unlimited access Open in app

The above is consistent with the output of classification_report . To choose the F1 scores for Ideal and Premium classes, specify the

labels parameter:

1 # F1 score for Ideal and Premium with weighted average


2 >>> f1_score(
3 ... y_test, y_pred, labels=["Premium", "Ideal"], average="weighted"
4 ... )
5 0.8205313467958754

6616.py
hosted with ❤ by GitHub view raw

Finally, let’s see how to optimize these metrics with hyperparameter tuning.

Hyperparameter tuning to optimize model performance for a custom metric


Optimizing the model performance for a metric is almost the same as when we did for the binary case. The only difference is how we pass
a scoring function to a hyperparameter tuner like GridSearch.

Up until now, we were using the RandomForestClassifier pipeline, so we will create a hyperparameter grid for this estimator:

1 n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]


2 max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
3 min_samples_split = [2, 5, 7, 10]
4 min_samples_leaf = [1, 2, 3, 4]
5
6 param_grid = {
7 "base__n_estimators": n_estimators,
8 "base__max_depth": max_depth,
9 "base__min_samples_split": min_samples_split,
10 "base__min_samples_leaf": min_samples_leaf,
11 }

6617.py
hosted with ❤ by GitHub view raw

244 3

Don’t forget to prepend each hyperparameter name with the step name you chose in the pipeline for your estimator. When we created our
pipeline, we specified RandomForests as ‘base’. See this discussion for more info.

We will use the HalvingGridSeachCV (HGS), which was much faster than a regular GridSearch. You can read this article to see my
experiments:

11 Times Faster Hyperparameter Tuning with HalvingGridSearch


Edit description
towardsdatascience.com
Get unlimited access Open in app

Before we feed the above grid to HGS, let’s create a custom scoring function. In the binary case, we could pass string values as the names
of the metrics we wanted to use, such as ‘precision’ or ‘recall.’ But in multiclass case, those functions accept additional parameters, and we
cannot do that if we pass the function names as strings. To solve this, Sklearn provides make_scorer function:

1 from sklearn.metrics import make_scorer


2
3 custom_f1 = make_scorer(
4 f1_score, greater_is_better=True, average="weighted", labels=["Ideal", "Premium"]
5 )
6
7 >>> custom_f1
8 make_scorer(f1_score, average=weighted)

6618.py
hosted with ❤ by GitHub view raw

As we did in the last section, we pasted custom values for average and labels parameters.

Finally, let’s initialize the HGS and fit it to the full data with 3-fold cross-validation:

1 from sklearn.experimental import enable_halving_search_cv


2 from sklearn.model_selection import HalvingRandomSearchCV
3
4 hrs = HalvingRandomSearchCV(
5 estimator=pipeline,
6 param_distributions=param_grid,
7 scoring=custom_f1,
8 cv=3,
9 n_candidates="exhaust",
10 factor=5,
11 n_jobs=-1,
12 )
13 # Fit
14 _ = hrs.fit(X, y)
15
16 # Score
17 best_estimator = hrs.best_estimator_
18 _ = model.fit(X_train, y_train)
19 y_preds = model.predict(X_test)
20
21 >>> f1_score(y_test, preds, average="weighted", labels=["Ideal", "Premium"])
22 0.8136686577320091

6619.py
hosted with ❤ by GitHub view raw
After the search is done, you can get the best score and estimator with .best_score_ and .best_estimator_ attributes, respectively.
Get unlimited access Open in app

Your model is only as good as the metric you choose to evaluate it with. Hyperparameter tuning will be time-consuming but assuming you
did everything right until this point and gave a good enough parameter grid, everything will turn out as expected. If not, it is an iterative
process, so take your time by tweaking the preprocessing steps, take a second look at your chosen metrics, and maybe widen your search
grid. Thank you for reading!

Related Articles
Multi-Class Metrics Made Simple, Part I: Precision and Recall

Multi-Class Metrics Made Simple, Part II: the F1-score

How to Calculate Precision, Recall, and F-Measure for Imbalanced Classification

Discussions
How to choose between ROC AUC and the F1 score?

What are the differences between AUC and F1-score?

API and User Guides


Classification Metrics

Multiclass and multioutput algorithms

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to
miss. Take a look.

Emails will be sent to [email protected].


Get this newsletter Not you?

You might also like