PerceptiLabs-ML Handbook
PerceptiLabs-ML Handbook
Handbook
Preface
The aim of this handbook is to make it easier to (one independent variable) and sometimes
understand and to start using machine learning. In this multivariate (several independent variables). This is
handbook, one can find everything from classical only for ease of describing the algorithm, and not
preprocessing it appropriately.
they work.
2
Table of contents 3
Preface
2
1.1 Preprocessing
5
1.2 Algorithms
6
2.1 Preprocessing
9
2.2 Postprocessing
9
2.3.1 Algorithm
11
2.4.1 Algorithm
12
3.1 Preprocessing
14
3.2 Algorithm
14
4.1 Preprocessing
17
4.2 Algorithm
17
5.1 Preprocessing
21
5.2 Algorithm
21
6.0 Clustering
22
6.1 Preprocessing
23
6.2 Algorithm
23
7.0 Appendix
24
7.1.6 Clustering
27
7.3 Mathematics
28
7.3.1 Backpropagation 28
Linear
1.0
Regression
Linear regression is the most basic type of regression. It is often used in predictive analysis. The regression estimates are
used to explain the relation between one dependent variable (Y) and one (or more) independent variable(s) (X). The goal is
to minimize the sum of the squared errors to fit a straight line to a set of data points (see Figure 1).
Figure 1
• Forecasting an effect
• Trend forecasting
Linear regression assumes a link or relationship between one (or more) independent variables and a dependent variable.
4
It is important to consider the model fit when choosing which model to use, i.e., choosing the regression coefficient b and
the constant c. Adding more independent variables to a linear regression model will always increase the variance of the
model (often expressed as ). Adding more variables makes the model more inefficient, and overfitting can occur. Occam’s
razor applies here: always choose the simplest model possible, i.e., as few independent variables as possible.
Underfitting can also occur. This happens when the model estimates are biased. It often occurs when linear regression is
used to attempt to prove a relationship that does not exist.
The standard way to evaluate the model is to use a cost function. Then the cost for each model is calculated as the root
mean squared error (RMSE) or the . The chosen model is the one that has the minimum RMSE or . But a model (the
variables b and c) can be chosen in different ways. The most common ways are shown in Eq. 3 and Eq. 4. This approach
should give the optimum solution. Another way is to use the stochastic gradient descent (SGD), which is the most popular
way to calculate the parameters in neural networks. By using SGD, one must set hyperparameters such as the learning rate
and the number of training iterations. To choose the perfect hyperparameters, trial and error is necessary. Also, more
preprocessing of the data, like scaling features, is often necessary.
(3) (4)
Linear regression is a global model, in which a single predictive formula holds over the entire data space. If the data have lots
of features that interact in nonlinear ways, using a single global model is not the way to go.
1.1 Preprocessing
Remove outliers from the dataset. Outliers will affect the regression negatively by giving a large mean square error (MSE).
The red dot in Figure 2 is an outlier.
Figure 2
5
Be aware of missing values. A missing value is when there is a Nan or zero instead of a sample in the dataset.
• Global constant
If using SGD and multiple linear regression, perform feature normalization so that the mean value of each feature in the data
is zero and the standard deviation is one.
1.2 Algorithms
Simple
Linear
Regression
Multiple
Linear
Regression
6
SGD Simple
Linear
Regression
SGD Multiple
Linear
Regression
7
2.0 Decision
Tree
Decision trees are very popular, since they are so easy to understand. The final decision tree can explain exactly why a
certain prediction was made. A decision tree identifies the most significant variable/attribute and the value that gives the
most homogeneous set of data.
In a decision tree, each internal node (non-leaf) is labeled with an input feature. The arcs from a node labeled with an input
feature are labeled with the possible values of the target, the output feature, or the arc that leads to a subordinate decision
node on another input feature.
To create a tree as a machine learning model, one must select input variables and split the points of those variables until a
suitable tree is created (see Figure 3 for a comparison between an ordinary rule-based decision tree and a decision tree used
in machine learning). A tree is created from the top down.
based
Figure 3
The tree to the left is rule-based, and it uses all the attributes in the data to make the final decisions. The tree to the right is created by using machine
learning, and therefore it only uses as many attributes as it needs (fewer than the rule-based tree) to make an accurate prediction/classification.
To select these variables and to get each splitting point, a greedy algorithm is used. That is, the solution is based on the
benefit of the next step without considering the larger problem as a whole. The greedy algorithm is used to minimize a cost
function. The tree construction ends by using a previously defined stopping criterion.
8
To perform the greedy splitting (also known as recursive binary splitting), i.e., to choose which attribute to use in the
root/node, all possible splits at that point in the training are tried and tested using a cost function. The split with the lowest
cost is selected.
Tree models in which the target variable can take a discrete set of values are called classification trees, and tree models in
which the target value can take continuous values (often real numbers) are called regression trees.
The cost function for a classification problem will look different than the function for a regression problem.
2.1 Preprocessing
When training a decision tree, it is necessary to know when to stop training. This is best done by splitting the dataset so
that cross-validation can be used. This is a way to split the dataset when the most common way of splitting a dataset
([train-set 70%, validation-set 20%, test-set 10%] or [train-set 70%, validation-set 30%]) is not viable because the dataset is
too small.
The training data are used to train the model, which should always increase the accuracy, while the validation set is used to
validate the model after a training iteration to ensure that its accuracy is also increasing. If the accuracy of the validation set
starts to decrease, the model is overfitting, and the training should stop. The test set is usually used for a final prediction.
Cross-validation is a technique that averages the prediction errors over a partitioned dataset to get a more accurate
estimate of a models performance. The most common cross-validation method is called k-fold.
2.2 Postprocessing
To avoid overfitting, i.e., the model learning the training data too well, we prune the tree. This can be done in several ways.
One common, simple, and fast way is to replace the node by a leaf with the label instead and then to see if this change
significantly affects the accuracy of the trained model. If it does not, the change is kept. The procedure is repeated until a
pruning changes the accuracy of the model significantly.
The first thing to do is to choose the best attribute to test first. To decide which attribute should be the root, one must look
at the information on every attribute and then do a comparison. There are different ways to do this; the most common
method is to compare the entropy (Eq. 5) of the data. Another way is to compare the Gini index (calculate the impurity).
9
When using the entropy as the method to select important features to use as nodes, we want to consider the information
gain (we want to maximize the information gain; Eq. 7), which is calculated from the entropy and the conditional entropy
(Eq. 6).
(5)
(6)
(7)
Where J is the number of classes and are fractions that add up to 1 and represent the percentage of each
class in the child node that results in a split in the tree. X is the data and A is the attribute.
It is also possible to consider the number of splits (we want as few splits as possible; Eq. 8), so that the attribute with the
highest ratio of Eq. 7 and Eq. 8 is chosen.
(8)
The Gini index or impurity explains how often a randomly chosen element from the set would be incorrectly labeled if it was
labeled randomly according to the distribution of labels in the subset. Eq. 9 shows how it is calculated.
(9)
Where J is the number of classes and are fractions that add up to 1 and represent the percentage of each
class in the child node that results in a split in the tree.
10
• There are no more examples left
2.3.1 Algorithm
Decision Tree
for Binary
Classification
To handle the splits in the regression tree, look at the standard deviation (Eq. 10), rather than calculating the entropy or the
Gini index.
In Eq. 10 - Eq. 12, is the mean, P is the probability, and N is the number of values. The split with the highest standard
deviation reduction (SDR), calculated as in Eq. 12, is the criterion to split.
A common stopping criterion in this case is when a node contains a minimum number of data points, e.g., when a node
contains less than 5% of the data.
(10)
11
(11)
(12)
2.4.1 Algorithm
Decision Tree
for Regression
12
k-Nearest
3.0
Neighbor
k-NN is a type of lazy learning, which means that generalization of the prediction is not beyond the training data until a
query is made to the system. This is one of the simplest machine learning algorithms. It is most useful for large datasets with
few attributes. As the number of attributes (the dimensions) increases, the input space increases exponentially. This will lead
to data points with similar attributes that may be separated by large distances, which is a problem for the k-NN algorithm.
This problem is more generally known as the curse of dimensionality.
The k-NN algorithm takes the k nearest neighbors (see Figure 4) and either takes the means or the medians of those values
(regression) to predict the next sample or takes the modes of those labels (classification) to predict the label of the next
sample. In the classification case, it is wise to make k an odd number if the number of classes is even and vice versa to avoid
a tie.
Figure 4
For a more accurate prediction, in some cases it is possible to weight each neighbor by —
, where d is the distance to the
D
actual neighbor and D is the total distance to all neighbors.
• Manhattan
• Minkowski
• Jaccard
• Mahalanobis
• Cosine
13
where the most common one is Euclidean distance. However, it is important to use the distance measure that suits the
data/problem to solve, e.g., if the input variables are similar in type, Euclidean distance might be the way to go, and if the
3.1 Preprocessing
The following preprocessing methods are not necessary for the use of the k-NN algorithm, but some of them might give
• PCA
• ICA
• LDA
3.2 Algorithm
k-NN
14
4.0 Support Vector
Machine
A support vector machine (SVM) is a supervised machine learning algorithm that is commonly used for classification,
regression, and outlier detection, and it is considered as one of the best off-the-shelf algorithms. The fact that it is
supervised means that you need to have labeled data for it to learn. As with many other machine learning algorithms, some
hyperparameters need to be set.
The goal of an SVM is to maximize the margin between the decision boundary and the closest data point. The decision
boundary is the line that separates the data that belong to different clusters (see Figure 5).
Figure 5
In the ideal case, the data are linearly separable, with no outliers to destroy the classification. This is not always the case,
but luckily SVM is flexible and capable of handling both nonlinear cases and cases where a few outliers do not match.
To handle nonlinear cases, SVM uses the kernel trick. A common way in machine learning to separate data that are not
linearly separable is to “send” the data to a higher dimension and try to separate them there instead. To do this, to every
point requires a fair bit of calculation, though, and to avoid that, the kernel trick is used.
To deal with outliers and with margins that are too small, SVM uses slack variables to ensure that not every point needs to
be taken into account fully.
There are two different ways to approach the SVM and to calculate the decision boundary: solving the primal problem (Eq.
13) or the dual problem (Eq. 15). They are both optimization problems, which aim to maximize the margin, but they are
different approaches to the same problem.
15
(13)
Where N is the number of samples, y is the label, x is the input, w is the weight, and f is Eq. 14. where C is the
regularization parameter, and it decides how much slack we will give our decision boundary.
A small C allows constraints to be easily ignored, which leads to large margins. A large C has the opposite effect. Setting C
to infinity to enforces all constraints.
The decision boundary, i.e., the final model (Eq. 14), is much like the linear regression model, only now with a built-in margin.
(14)
The benefit of the dual problem is that it is often much more efficient in to use in high dimensions, and it also allows for the
kernel trick, which the primal problem does not. In other words, use the dual problem solution if you want nonlinear
decision boundaries.
(15)
We can also see the kernel trick being used in calculating the decision boundary (the final model, Eq. 16). Since there only
are a few support vectors (a is often 0), the boundary is fast and efficient to calculate.
(16)
(17) (18)
16
(19) (20)
To apply SVM on a regression problem, simply use the normal regression algorithm and replace the loss and regularization in
the cost function with hinge loss and squared regularization as in SVM.
4.1 Preprocessing
It can be a good idea to scale the inputs.
4.2 Algorithm
SVM
Primal
17
5.0 Neural
Networks
Neural networks (see Figure 6) are mathematical models inspired by the biological neural networks, i.e., our brains.
Figure 6
Neural networks consist of multiple layers of neurons (also (MLP) if there are two or more hidden layers, and it is also
called units). There is an input layer, there may be one or known as deep neural network.
more hidden layers, and there is an output layer. Normally Perceptrons are one sort of neurons. Very briefly, the
a neural network is defined as a multi-layer perceptron perceptron takes several binary inputs and produces a
single binary output (see Figure 7).
Figure 7
The perceptron.
18
Each binary input is multiplied by a weight, and if the sum of all these input multiplications is greater than some threshold
value, the output will be 1; else it will be 0. A common approach is to move the threshold value to the same side as the
weight and input, and to call it the bias.
(21)
Today, it is more common to have neurons that can take inputs other than binary values, and that can also give an output
that is not binary.
Each neuron has several inputs, each multiplied by a weight. These multiplications are added to the threshold value, also
called the bias, as mentioned above. The final value becomes an input to an activation function (the sigmoid function is often
used as an activation function, see Figure 8 and Eq. 22), which gives an output that goes to the next layer of neurons. This
output differs from the perceptrons, since it is not binary. The great difference is that here a small change in the weights and
bias will only cause a small change in the output, which is beneficial for the learning process.
(22)
Figure 8
19
The weights and biases must be initialized before training It is common to use mini-batches when training, instead
starts, usually with a bias of zero and with random of the whole input. This leads to a higher variable update
weights. The neural network is evaluated by a cost frequency and leads to robustness.
During the forward pass, i.e., using the data as input, the
input is sent through the network, where each neuron, in
(23)
each layer, either activates (send an output) or does not
(no output). This will lead to different outputs from the
network, depending on the values of the weights and
biases. So, for a specific input and specific weights and
biases, the network gives an output that is compared with (24)
the truth, i.e., the correct classification (for example) for
that input.
The most common cost function is the cross-entropy function (Eq. 25). The cross-entropy function considers two probability
distributions: the true probability (the true label) and the given distribution (the prediction). The cross-entropy function gives
a measure of similarity between these two probability distributions. The cross-entropy function is defined as:
(25)
There are several important parameters that control other parameters in the network. These hyperparameters must be
selected. It is here that the validation data are used. After each epoch (one run over the network with all training data),
when new weights and biases have been calculated, the network is tested with the validation data. To do this iteratively
after each epoch, one can plot the cost function of the training data and the validation data. From this plot, it is possible to
extract information that tells us whether a certain hyperparameter should be changed or not. This is an iterative process,
and training neural networks consists of lots of trial and error to find the optimal hyperparameters, if it is even possible.
After this whole procedure, the test data is used to evaluate the network as a final product.
20
5.1 Preprocessing
Two methods are common for preprocessing the data when using neural networks:
• Transformation, e.g.:
• One-hot encoding
• Normalization
5.2 Algorithm
Neural
Network
21
6.0 Clustering
Clustering algorithms are probably the most common unsupervised machine learning algorithms. Unsupervised learning is
when there is no truth/label with which to compare the prediction. They divide the data into groups based on the similarity
of the data (see Figure 9). There are several different types of clustering:
• Connectivity models
• Centroid models
• Distribution models
• Density models
• Etc.
Figure 9
The most famous and popular clustering algorithm is k-means clustering (centroid model). The downside with this algorithm
is that one must know in advance how many clusters/classes there should be (k clusters). There are clustering methods (e.g.,
density models, like DBSCAN) that do not require that knowledge, but they are more computationally heavy.
22
The k in k-means clustering is the number of cluster centers there will be. The least squared Euclidean distance is calculated
between a cluster mean and a data point. The data point gets attached to the cluster with the least squared Euclidean
distance. This process is repeated iteratively until the assignments no longer change. There is no guarantee that the
optimum will be found using this algorithm.
6.1 Preprocessing
The following preprocessing methods should be considered when using k-means clustering:
• Missing value
• Data/unit normalization
• PCA
• ICA
• LDA
6.2 Algorithm
k-Means
Clustering
23
7.0 Appendix
Missing
Values
Outliers
24
Feature
Normalization
Cross-Validation
(k-fold)
Scale
25
Missing Values
Dimensionality
Reduction (PCA)
Scale
26
7.1.5 Neural Networks
Normalization
7.1.6 Clustering
Missing
Values
Unit
Normalization
27
Dimensionality
Reduction (PCA)
Pruning
7.3 Mathematics
7.3.1 Backpropagation
The backpropagation gives an expression of the derivative of the cost function C with respect to the
weights and biases. From that we can calculate how fast the cost changes depending on the changes of
the weights and biases in the network.
For a cost function to be usable for backpropagation, it needs to fulfill two criteria:
28
• it can be written as an average over cost functions for individual training examples
We calculate the partial derivatives and where w is the weight from neuron k in layer (l — 1) to
neuron j in layer l and b is the bias of the j neuron in layer l. Then we relate to each partial derivative,
which is the error in the j neuron in layer l.
What we want is for the neural network to perform with as high an accuracy as possible. To increase its
accuracy, one must minimize the error. The error comes because the output from an activation function is
different than what is optimal, e.g., from activation function output a we want the output where:
(26)
But what we actually get is σ . When this propagates through the network, the final cost will be
Therefore, we choose so that we minimize the final cost as much as possible, e.g., can be set
to the opposite sign of to make sure that is small. This tells us that acts as the error in each
neuron, i.e.,
(27)
(28)
Or, in matrix-form:
(29) (30)
(31) (32)
where ⨀ is the Hadamard product, also known as elementwise multiplication, and is the activation
function.
Eq. 28 is the error in the output layer L, Eq. 30 is the error expressed in terms of the error in the next layer
(l+1), Eq. 31 is the rate of change of the cost with respect to the bias, and Eq. 32 is the rate of change of the
cost with respect to the weight.
29
If in Eq. 32 is small, it is said that the weight learns slowly. Depending on the activation function, this is a
consequence of the neuron saturating (sigmoid as activation function).
All four equations Eq. 28-Eq. 32 are consequences of the chain rule:
(33)
But the output from the activation function of the kth neuron will only depend on the weighted input
This is Eq. 29 in component form, i.e., Eq. 28. To prove Eq. 30, one can again apply the chain rule:
(36) (37)
(38)
(39)
(40)
where the first factor has already been proven to be the error and the second factor is:
30
(41)
(42)
For Eq. 31, the first derivative factor is the same as Eq. 40 and the second derivative factor is:
(43)
(44)
31