0% found this document useful (0 votes)
4 views13 pages

Supervised ML

The document provides an introduction to machine learning, covering supervised and unsupervised algorithms, dataset splitting, and model evaluation metrics. It details various classification algorithms, particularly focusing on Random Forest, including hyperparameter tuning using GridSearchCV. The document also emphasizes the importance of metrics like accuracy, precision, recall, and F1-score for evaluating model performance.

Uploaded by

duarte.denio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views13 pages

Supervised ML

The document provides an introduction to machine learning, covering supervised and unsupervised algorithms, dataset splitting, and model evaluation metrics. It details various classification algorithms, particularly focusing on Random Forest, including hyperparameter tuning using GridSearchCV. The document also emphasizes the importance of metrics like accuracy, precision, recall, and F1-score for evaluating model performance.

Uploaded by

duarte.denio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction to

Machine Learning

Prof. Denio Duarte


[email protected]
Introduction
● Learning based on dataset features
○ Supervised algorithms (labelled)
■ The value of the label
● Classifcation (discrete)
● Regression (continuous)
○ Unserpervised algorithms
Introduction
● Before starting
○ The ML algorithms learn applying a given approach to
generalize the data to get the correct classes (labels)
○ The dataset can (or must) be split for training and
testing the built model
■ Training set (+/- 70%)
■ Test set (+/- 30%)
Introduction
● Before starting
○ Dataset spliting

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# random_state to guarantee the same random numbers through all executtions
# shuffle the examples before spliting (default True)
# stratify balance the number of classes in the sets if None (default)
Introduction
● Before starting
○ A model must be evaluate using metrics
■ Classification is the simplest one
– Compare the real value against the predicted one
■ To evaluate regression models is trickier
– Subtract the real value from the predicted one (0 means
that real and predicted are the same) – residual error
Introduction
● Before staring
○ Classification (metrics: sklearn.metrics)
■ Balanced classes
● Accuracy (metrics.accuracy_score(y, y_hat))
■ Unbalanced classes
● Precision (metrics.precision_score(y, y_hat))
● Recall (metrics.recall_score(y, y_hat))
● F1-score (metrics.f1_score(y, y_hat))
■ Getting all metrics
● metrics.classification_report(y, y_hat)
Introduction
● Confusion matrix
Classification
● Some algorithms
○ Decision tree
○ Random forest
○ Support vector machines
○ Logistic regression
Random Forest
● Several decision trees (estimators) are built to check
the predicted value
○ The most voted predicted class is chosen

Source: https://towardsdatascience.com/understanding-random-forest-58381e0602d2
Random Forest
● The most informative attributes are placed next to the
root (an informativeness function is applied)
○ entropy
○ gini (based on entropy but
computationally simpler)

Source: https://towardsdatascience.com/understanding-random-forest-58381e0602d2
Random Forest
● All ML algorithm has a set of hyperparameter used to
tune the model
○ criterion ['gini','entropy']
○ n_estimators n - # of trees (100)
○ max_features ['auto', 'sqrt', 'log2'] – maximal number of
attributes to split a node
○ max_depth n – max depth of a tree (None)
○ bootstrap [True, False] – built synthetic samples to
build the trees
○ Several others must be studied
Random Forest
● How to choose the best hyperparameters
○ sklearn.model_selection.GridSearchCV
○ Allow to run a given estimator with a combination of
hiperparameter and return the best combination

param_grid = {'bootstrap':[True,False], ‘n_estimators’:[50,100,200,300],


‘max_features’: ['auto', 'sqrt', 'log2'], criterion:[‘gini’,’entropy’]}
best_RF=GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid)
# the algorithm runs 2x4x3x2= 48 times
best_RF.fit(X_train,y_train)
best_RF.best_estimator_
Exercise
● Based on previous exercise
○ Propose a set of hypeparameters and show the best
model using precision and accuracy

You might also like