Supervised Learning Algorithms cheat sheet
Supervised Learning Algorithms cheat sheet
Contents
1
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Introduction
Supervised learning is the machine learning task of learning a function that maps an input to an
output based on example input-output pairs. A supervised learning algorithm analyzes the
training data and produces an inferred function, which can be used later for mapping new
examples.
The most popular supervised learning tasks are: Regression and Classification.
The result of solving the regression task is a model that can make numerical predictions.
For example:
o Real estate value prediction
o Predicting your company's revenue next year
The result of solving the classification task is a model that can make classes predictions.
For example:
o Spam detection
o Classifying news articles
The line between these tasks is sometimes fuzzy (predicting the probability of cancer
based on blood tests)
2
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Hard classification algorithms predict whether a data point belongs to a particular class
without producing the probability estimation.
Soft classification algorithms in turn, also estimate the class conditional probabilities.
Some algorithms are designed only for binary classification problems (SVM for example). So,
they cannot be used for multi-class classification tasks directly. Instead, heuristic methods can be
used to split a multi-class classification problem into multiple binary classification datasets and
train a binary classification model each:
OvR (one-vs-rest) - sometimes OvA (one-vs-all) - you have to train N classifiers for N
classes, but on full dataset.
OvO (one-vs-one) - you have to train N*(N-1)/2 classifiers for N classes, but on
subsamples from your dataset. Better suited for unbalanced samples.
Next, the following algorithms will be reviewed or mentioned (note, that all of them solve both
classification and regression task, except of Linear Regression (only Regression) and Logistic
Regression (only Classification)):
Linear Regression
Logistic Regression
Support Vector Machines
k-Nearest Neighbors
Decision Tree
Ensemble methods:
o Bagging and Pasting
o Random Forest and Extra Trees
o Boosting
o Stacking and Blending
3
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Simple Algorithms
I use the phrase simple algorithms not in the sense that they are simply implemented (although
some of them really are), but in the fact that these are separate algorithms, not ensemble learning
that we will see later.
Linear Regression
In the simplest case, the regression task is to draw a line through the data points so that an error
between this line (predictions) and real values is minimal. In general, this is the problem of
minimizing the loss function, so the optimization problem. Usually, the loss function is the MSE -
mean square error (because of maximum likelihood estimation), and the optimization algorithm
is gradient descent. Anyway, any other loss function of optimization algorithm can be used.
One of the important properties of linear regression is that optimal parameters (according to
MSE, again, because of maximum likelihood estimation) can be calculated with simple Normal
Equation. But this method does not scale well with large number of features, so any other
optimization method can be applied instead.
If the data dependences is more complex, than a straight line, we can add powers of each feature
as new features (PolynomialFeatures class from sklearn can be used) and then train a Linear
Regression model. This technique is called Polynomial Regression. Process of creating new
features (e.g. x^n, or log(x), e^x etc.) is called feature engineering and can significantly increase
linear model performance.
Other popular version of this algorithm is Bayesian Linear Regression, that predicts not only
values, but also it's probabilities, by building a confidence interval. This is possible thanks to
Bayes' theorem.
One of the most efficient way to avoid overfitting and outliers influence with regression is
regularization. Regularization term is added to loss function so regression coefficients have to
be as little as possible.
Regularized regression can also be used like feature selection tool. Thanks to some properties,
LASSO regression, for example, can delete insignificant features (set their coefficients to zero).
4
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Main hyperparameters:
Pros:
Cons:
5
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Logistic Regression
Like Linear Regression model, Logistic Regression (is also known as logit regression) computes
a weighted sum of the input features (plus bias) but instead of outputting this result directly, it
outputs the logistic of the result. The logistic is a sigmoid function, that outputs a number
between 0 and 1, so Logistic Regression is soft binary classifier that estimates the probability
that instance belongs to the positive class. Depends of some threshold different values of
accuracy/recall can be obtained. The same types of regularization as in Linear Regression can be
used.
Very similar probit regression uses a little different function - probit function instead of
sigmoid.
The Logistic Regression model can be generalized to support multiple classes directly, without
training multiple classifiers. This is called Softmax Regression (or Multinomial Logistic
Regression). This model computes a score for each class and then estimates the probability of
each class by applying softmax function (also called normalized exponential).
Is based on Linear Regression, so inherits all the hyperparameters, pros and cons of this
algorithm. What can be noted separately - high interpretation level of this algorithm, so it is
usually widely used in credit scoring tasks and medical diagnostics.
6
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Hard Margin Classification - it is assumed that instances of the same class are on the
same side of the separating hyperplane without exceptions.
Soft Margin Classification - allows violation of the decision boundary, which is regulated
by the regularization parameter.
In case of regression task, instead, it tries to draw a line to fit as many instances as possible
inside the border, "on the street".
Since SVM requires calculating distances between points it also requires feature scaling.
The most important and mathematically elegant feature of SVM is that the solution of the Dual
Problem (which is the basis of SVM) does not depend on the features directly (as vectors), but
only on their pairwise scalar products. This allows us to replace the scalar product with a certain
function k(a, b), which is called the kernel. In fact, the kernel is a scalar product in some other
space. This procedure allows you to build nonlinear classifiers (which are actually linear in a
larger dimension space) without adding new features and is called kernel trick.
The use of different kernels allows this algorithm to recover very complex dependencies in both
classification and regression tasks. The most popular kernels are:
polynomial
RBF - Gaussian Radial Basis Function
sigmoid and others
7
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
One-class SVM also can be used for the Anomaly Detection problem.
Main hyperparameters:
kernel type
regularization parameter C - a penalty for each misclassified data point (usually 0.1 < C <
100)
regularization parameter gamma - controls regions separating different classes. Large
gamma - too specific class regions (overfitting). (usually 0.0001 < gamma < 10)
Pros:
Cons:
8
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
k-Nearest Neighbors
The nearest neighbor algorithm, as a representative of metric methods, makes two hypotheses
about the data distribution:
Continuity hypothesis for regression - close objects correspond to close answers, and
Compactness hypothesis for classification - close objects correspond to the same class.
For a new object we have to find k nearest neighbors. Definition of nearest depends on the
distance metric that we want to use (Manhattan, Euclidean etc.).
Main hyperparameters:
k - number of neighbors
distance metric
9
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Pros:
Cons:
10
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Decision Trees
At each step, the train set is divided into two (or more) parts, depending on a particular choice.
Usually these algorithms are greedy, that means, that they are looking for a local optimal
solution at a specific step. The popular algorithms for building trees are:
ID3 (one of the oldest algorithm, Iterative Dichotomiser 3 was invented by Ross
Quinlan),
C4.5, C5.0 (an extensions of ID3 algorithm, they were developed by the same person and
consists in pruning the tree after using ID3),
CART (Classification And Regression Tree is optimized for both classification (Gini
Inpurity as measure) and regression (MSE as measure) trees and is implemented in scikit-
learn).
Decision Tree Structure using the example Decision Tree Classifier above. Image by Author
11
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Different measures for calculating information gain can be used. Then decision tree algorithm
use information gain to split a particular node:
The so-called decision tree pruning shows itself better than simply limiting the length of the
tree. This is the procedure when we build a tree of full depth, after that we remove insignificant
nodes of the tree. However, this process is more resource-intensive.
Main hyperparameters:
maximum depth of the tree - the less the less overfitting, usually 10-20
minimum number of objects in a leaf - the greater the less overfitting, usually 20+
Pros:
Simple interpretation
Simple realization
Computational simplicity
Does not require features preprocessing and can handle with missing values
Feature importance can be calculated using information gain
Cons:
Unstable and variable (investigation of greedy algorithm) - a small change in the input
data can completely change the structure of the tree
High sensitivity to the content of the training set and noise
Poorly restores complex (non-linear) dependencies
The tendency to overfitting at a large depth of the tree
Unlike linear models, they are not extrapolated (they can only predict the value in the
range from the minimum to the maximum value of train set)
12
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Ensemble Methods
Ensemble methods (also ensemble learning) are techniques that create multiple models and then
combine them to produce improved results. Ensemble methods usually produces more accurate
solutions than a single model would.
Bagging
Bagging stands for bootstrap aggregating.
When we have a train set X_train (N x M) N data points and M features then we train n
trees on X, where X (N x M) is a random subsample of X_train with the same size. When X is
formed with replacement algorithm is called bagging, and when X is formed without
replacement algorithm is called pasting. When this model does prediction, really, it gets n
predictions from n different models and aggregates them. Classification is computed from a
simple majority vote of the models and regression is computed from a mean value of the models'
predictions.
Pasting was originally designed for large datasets, when computing power is limited. Bagging,
on the other hand, can use the same subsets many times, which is great for smaller sample sizes,
in which it improves robustness.
13
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
This approach allows to leave the same bias, but decrease the variance thanks to Central Limit
Theorem. The more variable the algorithms are, the lower the correlation of their predictions and,
accordingly, the CLT works better (decision trees are a great choice).
If we are using bagging, there is a chance that a sample would never be selected, while others
may be selected multiple times. In general, for a big dataset, 37% of its samples are never
selected and we could use it to test our model. This is called Out-of-Bag scoring, or OOB
Scoring.
Main hyperparameters:
type of models
n_estimators - the number of models in the ensemble
max_samples - the number of samples to take from train set to train each base model
max_features - the number of features to take from train set to train each base model
Pros:
Cons:
Complexity of interpretation
Does not cope well with a very large number of features or for sparse data
Trains and makes predictions significantly slower than linear models
14
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Random Forest
Despite the fact that bagging can be applied with all types of algorithms, bagging over decision
trees has become widespread. Since they are unstable and variable, a good result is obtained. In
fact, random forest is bagging over decision trees with random subspace method.
When we have train set X_train N x M (N data points and M features) then we train n
trees on X, where X (N x m) is random subsample of X_train with replacement, but we also
take a random subset of the m (m < M) features. This is called the Random Subspace Method.
When this model does prediction, really, it gets n predictions from n different models and
aggregates them. Classification is computed from a simple majority vote of the models and
regression is computed from a mean value of the models' predictions.
This approach allows to leave the same bias, but decrease the variance thanks to Central Limit
Theorem.
As known Isolation Forest algorithm also can be used for the Anomaly detection problem.
Main hyperparameters:
n_estimators - the number of trees in the ensemble - the more the better
max_features - the number of features to draw from train set to train each base tree - n/3
for regression and sqrt(n) for classification is recommended
max_depth - the maximum depth of the tree
min_sample_leaf - the minimum number of samples required to split an internal node
Extra Trees
Extra Trees is related to the widely used random forest algorithm.
Unlike bagging and random forest that trains each decision tree from a bootstrap
sample of the training dataset, the Extra Trees algorithm trains each decision tree on the
whole training dataset.
Like random forest, the Extra Trees algorithm will randomly sample the features at
each split point of a decision tree.
Unlike random forest, which uses a greedy algorithm to select an optimal split point, the
Extra Trees algorithm randomly selects a split point.
It can often achieve as good or better performance than the random forest algorithm,
although it uses a simpler algorithm to construct the decision trees used as members of the
ensemble, so it works faster.
15
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Boosting
A boosting is an ensemble of weak algorithms (the prediction accuracy is slightly better than
random) that are trained sequentially and each subsequent one considers the error of the
previous one.
The general idea of boosting can be implemented in different ways. Three the most popular types
of boosting are:
AdaBoost
AdaBoost stands for Adaptive Boosting. This is a greedy iterative algorithm. At each step
it identifies misclassified data points and adjusts the weights to minimize the training
error.
This version of boosting is sensitive to outliers.
Gradient Boosting
Gradient Boosting is also called Gradient Boosting Machine - GBM. As any boosting
implementation, at each step, this algorithm tries to minimize the errors made in the
previous steps. But instead of changing the weights (like AdaBoost), GBM trains the next
model on the residual errors of its predecessor. One of the implementations of GDM is
LightGBM.
16
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
XGBoost
XGBoost stands for eXtreme Gradient Boosting. This implementation was designed for
speed and performance - it works parallel with GPU and Hadoop. XGBFIR is a great
library for XGBoost feature importance analysis.
Since the main implementations of boosting still use decision trees as basic models, boosting,
like random forest, determines the importance of features. But the popularity of the boosting
creates a lot of libraries that allows you to do more detailed analysis (for example XGBFIR
library allows you to analyse not one feature importance, but also their double and even triple
combinations).
Main hyperparameters:
Pros:
Cons:
Learning is slower than random forest, because learning process has to be strictly
sequential (although the implementations like XGBoost or LightGBM can argue with
this)
Prone to overfitting
Works well only with sufficiently large datasets
17
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Stacking
The architecture of a stacking model involves two or more base models, often referred to as
level-0 models, and a meta-model that combines the predictions of the base models, referred
to as a level-1 model.
Level-0 Models (Base-Models): Training data is divided into K folds. Then K models train
on the K-1 folds each.
Level-1 Model (Meta-Model): Model that learns how to combine the predictions of the
base models in the best possible way.
Unlike bagging, in stacking, the models are typically different (e.g. not all decision
trees).
Unlike boosting, in stacking, a single model is used to learn how to best combine the
predictions from the contributing models (e.g. instead of a sequence of models that
correct the predictions of prior models).
The usage of a simple linear model as the meta-model often gives stacking the colloquial name
blending.
Main hyperparameters:
18
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Pros:
Cons:
19
Supervised Learning algorithms cheat sheet by Dmytro Nikolaiev
Conclusions
The most popular supervised learning algorithms were described here (of course, there are
others). As a conclusion, I want to describe the process of choosing an algorithm to solve a
typical supervised learning task (classification or regression). It's very simple - you just need to
answer two questions.
Is your data sparse? If yes, then you will have to use linear methods. This is usually an SVM, and
with different kernels it will allow you to restore complex dependencies. Remember that linear
methods require data preprocessing, which can be problematic in some cases.
If your data is dense, then you are more lucky. Now everything depends on their amount. If there
are a lot of them, then use a boosting, otherwise - random forest. Both of these algorithms are
powerful, resistant to noise and will show you a good quality, but they will take a long time to
learn and predict. Also, remember that boosting is prone to overfitting.
What does a lot of data mean? How much is it? Usually talk about a threshold value of 100
thousand samples, but in any case, you can (and most likely will) try different algorithms.
This is just a recommendation, and you should try different algorithms with different
hyperparameters to solve your task in the best way.
20