0% found this document useful (0 votes)
85 views5 pages

Hyperparameter Tuning For Machine Learning Models

Hyperparameter tuning is crucial for machine learning models as hyperparameters control model behavior. This document explores how random forest model performance and computation time change with different hyperparameter tuning methods. A baseline random forest model achieves 81.56% accuracy on a titanic dataset. Grid search exhaustively tries all hyperparameter combinations, improving accuracy to 84.12% but taking around 5 hours. Randomized search randomly samples combinations, getting 83.57% accuracy in under 5 minutes, making it more feasible for simple problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views5 pages

Hyperparameter Tuning For Machine Learning Models

Hyperparameter tuning is crucial for machine learning models as hyperparameters control model behavior. This document explores how random forest model performance and computation time change with different hyperparameter tuning methods. A baseline random forest model achieves 81.56% accuracy on a titanic dataset. Grid search exhaustively tries all hyperparameter combinations, improving accuracy to 84.12% but taking around 5 hours. Randomized search randomly samples combinations, getting 83.57% accuracy in under 5 minutes, making it more feasible for simple problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Hyperparameter tuning for Machine learning

models
Hyperparameters tuning is crucial as they control the overall behavior of a
machine learning model. Every machine learning models will have different
hyperparameters that can be set.

A hyperparameter is a parameter whose value


is set before the learning process begins.

The Titanic dataset from Kaggle is used for comparison. The purpose of this
article to explore how the performance and the computational time of the
random forest model are changing with various hyperparameter tuning
methods. After all, machine learning is all about finding the right balance
between computing time and the model’s performance.

Baseline model with default parameters:

random_forest = RandomForestClassifier(random_state=1).fit(X_train, y_train)


random_forest.score(X_test,y_test)
The accuracy of this model, when used on the testing set, is 81.56.

We can get the default parameters used for the model using the
command. randomforest.get_params()

The default parameters are:


{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion':
'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes':
None, 'max_samples': None, 'min_impurity_decrease': 0.0,
'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None,
'oob_score': False, 'random_state': 1, 'verbose': 0, 'warm_start': False}

No need to worry if you don't know about these parameters and how they are
used. Usually, information about all the parameters can be found in Scikit
documentation of the models.

Some important Parameters in Random Forest:

1. max_depth: int, default=None This is used to select how deep you


want to make each tree in the forest. The deeper the tree, the more splits it
has, and it captures more information about the data.
2. criterion :{“Gini,” “entropy”}, default=” Gini”: Measures the quality of
each split. “Gini” uses the Gini impurity while “entropy” makes the split
based on the information gain.
3. max_features: {“auto,” “sqrt,” “log2”}, int or float, default=”
auto”: This represents the number of features that are considered on a
pre-split level when finding the best split. This improves the model's
performance as each tree node is now considering a higher number of
options.
4. min_samples_leaf: int or float, default=1: This parameter helps
determine the minimum required number of observations at the end of
each decision tree node in the random forest to split it.
5. min_samples_split: int or float, default=2: This specifies the
minimum number of samples that must be present from your data for a
split to occur.
6. n_estimators: int, default=100: This is perhaps the most important
parameter. This represents the number of trees you want to build within a
random forest before calculating the predictions. Usually, the higher the
number, the better, but this is more computationally expensive.
Grid Search

One traditional and popular way to perform hyperparameter tuning is by using


an Exhaustive Grid Search from Scikit learn. This method tries every possible
combination of each set of hyper-parameters. Using this method, we can find
the best set of values in the parameter search space. This usually uses more
computational power and takes a long time to run since this method needs to
try every combination in the grid size.

GridSearchCV is the process of performing hyperparameter tuning in order to


determine the optimal values for a given model. As mentioned above, the
performance of a model significantly depends on the value of hyperparameters.
Note that there is no way to know in advance the best values for
hyperparameters so ideally, we need to try all possible values to know the
optimal values. Doing this manually could take a considerable amount of time
and resources and thus we use GridSearchCV to automate the tuning of
hyperparameters.

GridSearchCV is a function that comes in Scikit-learn’s(or SK-learn)


model_selection package.So an important point here to note is that we need to
have the Scikit learn library installed on the computer. This function helps to
loop through predefined hyperparameters and fit your estimator (model) on
your training set. So, in the end, we can select the best parameters from the
listed hyperparameters.

parameters ={'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'criterion' : ['gini', 'entropy'],
'max_features': [0.3,0.5,0.7,0.9],
'min_samples_leaf': [3,5,7,10,15],
'min_samples_split': [2,5,10],
'n_estimators': [50,100,200,400,600]}

Using sklearn’s GridSearchCV, we can search our grid over and then run the grid
search.

from sklearn.model_selection import GridSearchCV


grid_search = RandomForestClassifier()
grid_search = GridSearchCV(
grid_search,
parameters,
cv=5,
scoring='accuracy',n_jobs=-1)

grid_result= grid_search.fit(X_train, y_train)


print('Best Params: ', grid_result.best_params_)
print('Best Score: ', grid_result.best_score_)

Output

Our cross-validation score is improved from 81.56% to 84.12% with the Grid
search CV model compared with our baseline model. That is a 3.3%
improvement. The computational time is almost 5hrs which is not feasible for a
simple problem like this one.

Randomized Search

The main difference in the RandomizedSearch CV, when compared with


GridCV, is that instead of trying every possible combination, this chooses the
hyperparameter sample combinations randomly from grid space. Because of
this reason, there is no guarantee that we will find the best result like Grid
Search. But, this search can be extremely effective in practice as computational
time is very less.

The computational time and model performs mainly depends on


the n_iter value. Because this value specifies how many times the model should
search for parameters. If this value is high, there is a better chance of getting
better accuracy, but also this comes with more computational power.

We can implement RandomizedSearchCV by using the sklearn’s library.

%%time
from sklearn.model_selection import RandomizedSearchCV
random_search=RandomizedSearchCV(estimator = RandomForestClassifier(),
param_distributions=parameters, n_jobs=-1,
n_iter=200)
random_result = random_search.fit(X_train, y_train)
print('Best Score: ', random_result.best_score_*100)
print('Best Params: ', random_result.best_params_)

Output

Our cross-validation score is improved from 81.56% to 83.57% with the


Randomized search CV model compared with our baseline model. That is a
2.5% improvement, which is 0.8% less than Grid CV. But the computational
time is less than 5mins, which is almost 60 times faster. For most simple
problems, this randomized search will be the most feasible option for
hyperparameter tuning.

You might also like