Machine Learning
Machine Learning
Author:
Jaime Pizarroso Gonzalo
Topic 1. Introduction ...................................................................................... 6
1 Data Mining and Machine Learning ....................................................................................... 6
2 Learning process........................................................................................................................ 6
2.1 Data ..................................................................................................................................... 6
2.2 Abstraction .......................................................................................................................... 6
2.3 Generalization ..................................................................................................................... 6
2.4 Steps to apply Machine Learning to data ............................................................................ 6
3 Types of learning ....................................................................................................................... 6
3.1 Supervised learning ............................................................................................................. 6
3.2 Unsupervised learning ......................................................................................................... 7
3.3 Reinforcement learning ....................................................................................................... 7
3.4 Evolutionary learning .......................................................................................................... 7
Topic 2. Classification ..................................................................................... 8
1 Classification: problem statement............................................................................................ 8
2 Preprocessing ............................................................................................................................. 8
3 Probabilistic approach .............................................................................................................. 9
4 Training, validating and testing ............................................................................................. 10
4.1 Cross-validation................................................................................................................. 11
4.2 Measuring classification performance ............................................................................... 11
4.2.1 Histograms .................................................................................................................................................. 12
4.2.2 Calibration plots ......................................................................................................................................... 12
4.2.3 Heat maps ................................................................................................................................................... 12
4.2.4 Confusion matrix ......................................................................................................................................... 13
5 Models ...................................................................................................................................... 14
5.1 Logistic regression ............................................................................................................ 14
5.2 Bayesian probabilities ....................................................................................................... 14
5.2.1 Linear discriminant analysis for n = 1 ........................................................................................................ 15
5.2.2 Linear discriminant analysis for n>1 .......................................................................................................... 15
5.2.3 Quadratic discriminant analysis ................................................................................................................. 15
1
5.7 Support vector machines ................................................................................................... 21
5.7.1 Considerations ............................................................................................................................................ 21
5.7.2 Linearly non-separable ............................................................................................................................... 22
5.7.3 Non-linear classification ............................................................................................................................. 23
5.7.4 Support Vector Machine .............................................................................................................................. 23
2
1.2.1 Quantitative methods................................................................................................................................... 48
3
5.5 ARMA Model Identification ............................................................................................. 61
5.6 ARMA Model Diagnosis................................................................................................... 62
5.6.1 Residual analysis ......................................................................................................................................... 62
5.6.2 Level of significance of the coefficients ....................................................................................................... 64
6 Clustering ................................................................................................................................. 76
6.1 Introduction ....................................................................................................................... 76
6.2 Proximity measures ........................................................................................................... 77
6.2.1 Distance measures....................................................................................................................................... 77
4
6.2.2 Similarity measures ..................................................................................................................................... 77
6.2.3 Measures for continuous variables.............................................................................................................. 77
6.2.4 Measures for binary variables..................................................................................................................... 78
6.2.5 Measures for mixed variables ..................................................................................................................... 79
5
Topic 1. Introduction
1 Data Mining and Machine Learning
We are in a data rich but information poor situation, and decision makers do not usually have the tools
to extract the valuable knowledge embedded in the data.
Data Mining
o Knowledge discovery from data (KDD)
o Discovering patterns and associations in large data sets
o Turning data into information
o Uncover valuable information from data and transform it into organized knowledge
Machine learning
o Field of study that gives computers the ability to learn without being explicitly
programmed
o Making computers to modify or adapt their actions so that these actions get more
accurate
A machine learns if it is able to take experience to utilize it and improves its performance on
similar experiences in the future
2 Learning process
2.1 Data
The input data is the main source of knowledge, its quality determines the quality of the final system.
Requires observation, memory storage and recall.
2.2 Abstraction
It’s the translation of data into broader representations. During the abstraction process, we assign
meaning to data by representing knowledge using some kind of model (equations, diagrams such as
trees and graphs, logical if/else rules or groupings of data known as clusters). The process of fitting a
particular model to a dataset is known as training.
2.3 Generalization
Uses abstracted data to form a basis for action. A model is said to generalize if it produces correct outputs
for cases not included in the training dataset. Measuring the generalization capabilities of a model is an
essential task. Our final objective is being able to generalize from a finite set of data.
3 Types of learning
3.1 Supervised learning
The aim of supervised learning is to learn an input-output mapping from a labelled dataset. Applications:
6
3.2 Unsupervised learning
The aim is to find the regularities in the input data by discovering patterns (characterize what generally
happens and what does not). Applications:
Density estimation
Clustering
Vector Quantization
Dimensionality reduction
Robot control
Games
Other activities that software agent can learn
7
Topic 2. Classification
1 Classification: problem statement
Given a set of n attributes (features) which belong to a n-dimension real space, a set of m classes, and a
set of N labeled training instances, in which every instance has n attributes and is of a class m, determine
a classification rule that predicts the class of any instance from the value of its attributes.
The classification rule is a partition of the input space.
If observations are grouped in just two categories or classes, the problem is of binary
classification.
o The important category is described as signal and the second as background
If there are more than two categories, this is a multiclass problem.
2 Preprocessing
Many machine learning algorithms are affected by the scale of the predictors.
Standardization: standard scores are also called z-values, z-scores, normal scores, and
standardized variables:
𝑥 − 𝑥̅
𝑥∗ =
𝜎𝑥
To resolve skewness:
o An un-skewed distribution is one that is roughly symmetric
o A right-skewed (positive skew) distribution has a large number of points on the left side
of the distribution (smaller values) that on the right side (larger values)
𝑥 − 𝑥̅ 3
𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝐸[( ) ]
𝜎𝑥
o The Box Cox transformation can be used to make the distribution of the variable as
normal as possible (skewness 0):
𝑥𝜆 − 1
∗
𝑥 ={ 𝜆 𝑖𝑓 𝜆 ≠ 0
log(𝑥) 𝑖𝑓 𝜆 = 0
This family can identify square transformation (λ = 2), square root (λ = 0.5), inverse (λ
= -1) and others in-between.
Using the training data, λ can be estimated by maximum likelihood. The predictor data
must contain values greater than zero.
Outliers can be generally defined as “samples that are exceptionally far from the mainstream
of the data”
o When one or more samples are outliers, the values must be scientifically valid and there
are not data recording errors.
o With small sample sizes, apparent outliers might be a result of a skewed distribution
where there are not yet enough data to see the skewness.
Missing values: some predictors have no values for a given samples.
8
o For large data sets, removal of samples based on missing values is not a problem,
assuming that the missingness is not informative.
o In smaller data set must be a too big loss of information. Two approaches:
Predict data and substitute the missing values.
Use information in the training set predictors to estimate the values of other
predictors.
Censored data. Should not be confused with missing data.
o The exact value of censored data is not known but something is known about it.
o It can be common when using laboratory measurements, some essays cannot measure
below their limit of detection, but we know that the value is smaller than the limit.
Dimensionality reduction: it generates a smaller set of predictors that capture a majority of the
information of the original variables.
o These methods often called signal extraction or feature extraction techniques.
o Principal component analysis (PCA) is a commonly used data reduction technique.
o This method seeks to find linear combinations of the predictors (principal components
PCs) which capture the most possible variance and are uncorrelated.
Removing predictors: fewer predictors means decreased computational time and complexity. If
two predictors are highly correlated, this implies that they are measuring the same underlying
information so they give redundant information.
o Collinearity is the technical term for the situation where a pair of predictor variables
have a substantial correlation. Using highly correlated predictors in techniques like
linear regression can result in highly unstable models. PCA can solve this problem.
o A more heuristic approach is to minimize the predictors in order that all the pairs
correlations are below a threshold.
3 Probabilistic approach
At any point X in the multivariate input space, class label Y (categorical) is distributed according to a
mass function P(Y = y|X=x), the probability of observing y at x. The goal of statistical classification is
to learn the distribution P(y|x).
This learning is accomplished by building (training) a predictive model on data with known class labels.
In practice, the quality of the learned model at point x is measured using a loss function l(y, f(x)). It’s
similar to a distance between the true class label y and the predicted response f(x). Classification loss
for the learned model f(x) is the expected distance:
over the domain of Y and X for the joint probability density function.
9
The expected loss is usually estimated by averaging l(y,f(x)) over the labeled data drawn from the joint
probability density function.
The response f(x) predicted by a classification model can therefore be one of a nominal variable, a
numeric scalar or a vector with m elements for m classes. The exact meaning of f(x) depends on the
nature of the problem and properties of the classification model.
If a classifier returns only hard labels, there is only one good choice for the loss function. The distance
is:
The expected loss L(X,Y) is then minimized by classifying every x into the most probable class. This
loss equals the probability of observing one of the less probable class. In the statistics literature, P(y|x)
is called the posterior probability and the minimal classification error is often called Bayes error.
The training error is not a good estimate of the generalization error, we need training and validation sets.
The requirement of generalization and accuracy on the labeled data is called the bias-variance trade-off.
10
The first term refers to the average test MSE that we would obtain if we repeatedly estimated f
using a large number of training sets and tested each at x0.
Variance: amount by which f would change if we estimated it using a different training data set:
if a method has a high variance then small changes in the training data can result in large changes
in f.
Bias refers to the error that is introduced by approximating a real-life problem, which may be
extremely complicated, by a much simpler model.
As we use more flexible methods, the variance will increase and the bias will decrease.
A test dataset may be used during the learning phase of the classifier for determining the optimal
structure of the classifier. If the dataset is large enough, proportions training – test – validation are 50-
25-25. If the dataset is small, resampling techniques should be used.
4.1 Cross-validation
Resampling effectively increases the amount of data without incurring the full cost of data simulation
or collection. The price is datasets are not independent, which affect the quality of estimates.
Cross-validation works by splitting data into K disjoint subsets. Use 1 – 1/K parts of data for training
and 1/K of data for validation. Repeat this step K times, using every observation once for validation and
K-1 times for training.
This process results in k estimates of the test error MSE, and the k-fold CV estimate is the average of
these errors.
The number of subsets can vary from 2 to N, where N is the number of available observations. K = N is
called leave-one-out-cross-validation. The most popular choice is K = 10.
A continuous valued prediction usually in the form of a probability. If not, it can be transformed
using the softmax transformation:
11
4.2.1 Histograms
12
4.2.4 Confusion matrix
𝑇𝑃+𝑇𝑁
Overall accuracy: 𝑂 = 𝑇𝑜𝑡𝑎𝑙
× 100%
The cost of False Negatives (FN) and False Positives (FP) may be different.
Other rates:
13
Sensitivity: true positive rate
# 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑡𝑜 ℎ𝑎𝑣𝑒 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑇𝑃
=
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ℎ𝑎𝑣𝑖𝑛𝑔 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑇𝑃 + 𝐹𝑁
5 Models
5.1 Logistic regression
Binary classification problem (Y: {0,1}). Rather than modeling this response Y directly, logistic
regression models the probability that Y belongs to a particular category (Y = 1|X). The logistic
regression model is given by:
14
5.2.1 Linear discriminant analysis for n = 1
If we suppose that p(X=x|Y=k) is normal or Gaussian:
1
1 − (𝑥−𝜇𝑘 )2
𝑝(𝑋 = 𝑥|𝑌 = 𝑘) = 𝑒 2𝜎𝑘2
√2𝜋𝜎𝑘
Where μk and σ2k are the mean and the variance parameters of the kth class. We would suppose that the
variance is constant between classes.
1 2
1 − 2 (𝑥−𝜇𝑘 )
𝑒 2𝜎𝑘 × 𝜋𝑘
√2𝜋𝜎𝑘
𝑃(𝑌 = 𝑘|𝑋 = 𝑥) = 1
− 2 (𝑥−𝜇𝑖 ) 2
1
∑𝑚
𝑖=1 𝑒 2𝜎𝑖 × 𝜋𝑖
√2𝜋𝜎𝑖
The Bayes classifier involves assigning and observation X = x to the class for which P(Y=k|X=x) is
largest.
This is equivalent to assigning the observation to the class for which:
𝜇𝑘 𝜇𝑘2
𝛿𝑘 (𝑥) = 𝑥 ∙ − + log(𝜋𝑘 )
𝜎 2 2𝜎 2
Is largest.
The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates
for πk, μk and σ2 into the previous equation:
And assigns an observartion X=x to the class for which ẟk(x) is maximum.
Where μk is a class-specific mean vector, and Σ is a covariance matrix that is common to all m classes
(which is a very hard constraint).
Then, it can be shown that the Bayes classifier assigns and observation X=x to the class for which is
largest:
15
Then, it can be shown that the Bayes classifier assigns and observation X=x to the class for which is
largest:
The regions on the classification problem are separated by straight lines for linear discriminant analysis
and by conic sections for quadratic discriminant analysis.
16
5.4 Decision trees
These are easy-to-understand general representation of a discrete classifier and fast learning algorithms.
They are a built-in feature selection (process of selecting a subset of relevant features for use in model
construction).
17
5.4.3 Algorithm
If we have a continuous predictor and a categorical response, the optimal split point is given by:
1. The samples are sorted based on the predictor values
2. The split points are then the midpoints between each unique predictor value. If the response is
binary, then the process generates a contingency table as:
The Gini index prior to the split would be 2(n1+/n)(n2+/n). Calculated after the split within each
of the new nodes and combine them using the proportion of samples in each partition:
3. Partitioning algorithms evaluate nearly all split points and select the split point value that
minimizes the Gini index.
4. The splitting process continues until the stopping criteria is met (minimum number of samples
in a node or the maximum tree depth)
Also, the algorithm works by substituting the Gini index by the entropy.
5.4.3.1 Considerations
Trees that are constructed to have the maximum depth usually over-fits the training data
A generalizable tree is a pruned version of the initial tree which is determined by a cost-
complexity tuning, in which the purity criterion is penalized by a factor of the total number of
terminal nodes in the tree. Each terminal node produces a vector of class probabilities based on
the training set which is then used as the prediction for a new sample.
Tree models can also bin categorical predictors.
When fitting trees and rule-based models, a choice must be made regarding the treatment of
categorical predictor data:
18
o Each categorical predictor can be entered into the model as a single entity so that the
model decides how to group or split the values (grouped categories). A categorical
variable X = (a,b,c) can be consider as a – ab – ac – b – bc – c.
o Categorical predictors are first decomposed into binary dummy variables. In this way,
the dummies are considered independently, forcing binary splits for the categories
(independent categories). A categorical variable X = (a, b, c) can be splitted as Xa, Xb,
Xc and each of these is Xx=0/1
There exist different stopping criteria, but cross-validation is used for selecting optimal
complexity:
o Number of nodes/depth of tree.
o Minimum number of observations in a node.
o Entropy stop splitting the node if the entropy is small enough
In this case, the best size (number of leaf nodes) is 6 (less error in cross-validation)
5.5.1 Bagging
Given a set of N independent observations Z1, …, ZN, each with variance σ2, the variance of the mean
Z of the observations is given by σ2/Naveraging a set of observations reduces variance.
A natural way to reduce the variance and increase the prediction is take many training sets from the
population, build a separate prediction model using each training set, and average the resulting
predictions. This is called bagging:
The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations
19
5.5.2 Out-of-Bag Error estimation
On average, each bagged tree makes use of around two-thirds of the observations. The other third is
referred as the out-of-bag (OOB) observations.
We can predict the response for the ith observation using each of the trees in which that observation was
OOB. This will yield around B/3 predictions for the ith observation. Averaging these predicted responses
(regression) or can take a majority vote (classification), leads to a single OOB prediction for the i th
prediction. The resulting OOB error is a valid estimate of the test error for the bagged model.
A fresh sample of m predictors is taken at each split, and typically we choose 𝑚 = √𝑛.
If a predictor is very important, most of the bagged trees will use this predictor in the top split, and,
consequently, all of the bagged trees will look quite similar to each other. Averaging many highly
correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated
quantities.
5.6.1 Considerations
If a random forest is built using m = n, then this leads to bagging.
Using a small value of m in building a random forest will typically be helpful when we have a
large number of large correlated predictors.
20
5.7 Support vector machines
SVM are a generalization of a simple and intuitive classifier called the maximal margin classifier. The
maximal margin classifier can only solve linearly separable problems.
In a n-dimensional space, a hyperplane is a flat affine subspace of dimension n-1; in two dimensions, a
hyperplane is a flat one-dimensional subspace (a line); and, in three dimensions, it is a plane:
Hyperplane:
Line:
Plane:
Depending on the value of the equation of an observation x in an equation, it will determine if the
observation is in one side or in the other. So, the hyperplane divides the n-dimensional input space into
two halves.
Now suppose that we have an Nxn matrix of data that consists of N observations in a n-dimensional
input space (the training set). We also have a test observation, and supposing that is a binary
classification problem {1, -1}, our goal is to develop a classifier based on the training data that will
correctly classify the test observation.
A separating hyperplane satisfies for all i:
If f(x*) is far from zero, then it means that x* lies far from the hyperplane, so we can be
confident about a class assignment for x*.
If f(x*) is close to zero, then x* is located near the hyperplane, and we are less certain about the
class assignment for x*.
5.7.1 Considerations
If our data can be perfectly separated by a hyperplane, then there will exist an infinite number
of such hyperplanes.
A natural choice is the maximal margin hyperplane (optimal separating hyperplane), which is
the separating hyperplane that is the farthest of the training observations.
21
o The margin is the minimal (perpendicular) distance from the observations to the
hyperplane. The maximal margin hyperplane is the separating hyperplane for which the
margin is largest.
o The support vectors are the 3 points (no matter which class) with the minimum distance
to the hyperplane. These are called support vectors because if these points are moved,
the hyperplane is moved. The maximal margin hyperplane depends directly on the
support vectors, but not the other observations.
The maximal margin hyperplane is the solution to the optimization problem:
o The second constraint guarantees that each observation will be on the correct side of
the hyperplane (M is positive)
o The first constraint ensures that the perpendicular distance is given by:
In these cases, in order to improve robustness to individual observations and better classification of most
of the observations, we might consider a classifier based on a hyperplane that does not perfectly separate
the classes.
The Support Vector Classifier, sometimes called soft margin classifier, allows a few training
observations to be in the incorrect side of the margin, or even on the incorrect side of the hyperplane.
The hyperplane is given by the solution of the optimization problem:
Where C is a nonnegative tuning parameter, M is the width of the margin and ϵi are slack variables.
Once the optimization problem has been solved, we classify a test observation x* by the sign of:
22
The slack variable ϵi tells us where the ith observation is located:
If ϵi=0, then the ith observation is on the correct side of the margin
If ϵi>0, then the ith observation is on the wrong side of the margin
If ϵi>1, then the ith observation is on the wrong side of the hyperplane
C bounds the sum of the ϵi’s, so it determines the number and severity of the violations to the margin:
If C=0, then there is no budget for violations and the problem is reduced to the maximal margin
hyperplane.
If C>0, then no more than C observations can be on the wrong side of the hyperplane.
In practice, C is a tuning parameter that is usually chosen via cross-validation:
When C is small, narrow margins are rarely violated. This amounts to a classifier that is highly
fit to the data, which may have low bias but high variance.
When C is larger, the margin is larger. This amount to fitting the data less hard and obtaining a
classifier more biased but with lower variance.
The hyperplane is only affected by the observations that either lie on the margin or that violate the
margin. Observations that lie directly on the margin or on the wrong side of the margin are support
vectors. When C is large, more observations are involved in determining the hyperplane (there are more
support vectors).
More functions can be considered but the amount of computations could become unmanageable
The linear support vector classifier can be represented as the next formula where there are N parameters
αi.
To evaluate the function f(x) we need to compute the inner product between the new point x and the
training points xi. However, it turns out that αi is nonzero only for the support vectors. Being ẟ the indices
of the support vectors:
23
If we replace the inner product with a generalization of the form where K is some function that will refer
to as a kernel:
The linear kernel quantifies the similarity of a pair of observations using the Pearson correlation:
The polynomial kernel of degree d amounts to fitting a support vector classifier in a higher-dimensional
space involving polynomials of degree d:
When the support vector classifier is combined with a non-linear kernel, the resulting classifier is known
as a Support Vector Machine. The function has the form:
The radial kernel, where σ is a positive constant, has a local behavior, as only nearby training
observations have an effect on the class label of a test observation.
24
5.8 Neural Networks
The idea is to mimic the structure of the brain by massively interconnecting very simple processing units
and designing learning rules to adjust their transfer functions. Artificial Neural Networks generate their
own rules by learning from examples.
A neural network is a massively parallel distributed processor made up of simple processing units that
has a natural propensity for storing experiential knowledge and making it available to use. It works like
a brain:
Knowledge is acquired by the network from its environment through a learning process.
Interneuron connection strengths, known as synaptic weights, are used to acquired knowledge.
5.8.1 Perceptron
Activation function:
The perceptron partitions the input space in two regions, according to the hyperplane s=0
(decision boundary)
For n = 2
The perceptron is only capable of resolving linearly separable problems. The threshold moves the
hyperplane away from the center of the dimensional space.
25
Rosenblatt learning rule:
If the problem is linearly separable, the algorithm converges to the solution; if the problem is not linearly
separable, the algorithm may oscillate. The value of α doesn’t affect stability, but it determines the rate
of convergence (typically: α=1)
Hierarchical structure of fully interconnected layers of processing units, forming a feedforward ANN
5.8.2.1 Notation
We fit several layers of perceptrons through derivable activation functions (instead of the step function,
we use tanh(s) or 1/(1+exp(-s)) and gradient-based methods).
26
5.8.2.2 Theorem
A multilayer perceptron with only one hidden layer and sigmoidal activation functions is a universal
function approximator
A multilayer perceptron with only one hidden layer and sigmoidal activation functions is a universal
classifier.
5.8.2.3 Error bounds
The mean integrated squared error between the estimated network and a target function f is show to be
bounded by:
Where h is the number of nodes, n the input dimension of the function, N is the number of training
observations, and Cf is the first absolute moment of the Fourier magnitude distribution of f. The principal
problem of the multilayer perceptron is the overfitting due to the complexity of the model. The only way
to reduce the error is increasing the number of nodes (decreasing bias and increasing variance) and
increasing the training dataset (reducing variance).
5.8.2.4 Backpropagation
𝜕𝐸
It is an efficient algorithm for the computation of 𝜕𝑤
The gradient:
Chain rule:
First factor:
Finally:
27
28
5.8.2.5 Weight initialization
The external inputs and outputs should be standardized or normalized in order to ensure a well-
conditioned optimization problem. Random small weights are used for preventing saturation of the
activation function. Assuming that the inputs have been normalized in the interval [-1;1]:
Generate initial weight vectors for the external inputs according to a uniform distribution:
Locate the center of the interval at a random location along the slice by setting:
Delta rule:
The minimization of E(w), with respect to the weight vector w in W=Rq, is an unconstrained nonlinear
optimization problem.
For classification problems (d ϵ {0,1} and y ϵ {0,1}):
29
Softmax (for m classes we rain a neural net with m outputs):
Typically, Г = λI (autovalores)
5.8.2.8 Statistical Sensitivity Analysis
The objective of this analysis is reducing the complexity of the model by pruning input variables that
do not affect the output. Measure of the relevance of an input variable:
𝜕ŷ
𝜍𝑖 =
𝜕𝑥𝑖
30
31
Topic 3. Regression
1 Regression: problem statement
Given a set of input, independent, regressor or exogenous variables X=(X 1,X2,…,XN), which belong to
an n-dimensional real space and an output, dependent or endogenous variable Y, which belongs to a 1-
dimensional real space, our objective is to estimate the value of E(Y/X=x) from a random sample of the
form {x[i], y[i] ϵ RnxR}i=1N.
The deterministic component can be estimated using a function approximator, but the random
component (noise) has to be characterized.
32
2 Linear regression model
It’s composed by n explanatory variables (regressors). The coefficients measure the marginal
contribution of each input variable on the output (sensitivity)
Assumptions:
Adjusted coefficient of determination: Takes into account the number of parameters variable
selection
33
2.2 Test for significance of regression
2.2.1 F-test of the overall fit
H0: β0 = β1 = … = βn = 0
H1: Ǝi/βi ≠ 0
If the null hypothesis is true, then:
If the corresponding p-value < α, H0 is rejected and at least one coefficient is considered significant.
If the corresponding p-value < α, H0 is rejected and the coefficient βi is considered significant.
2.4 Multicollinearity
Appears when there is strong correlation among the input variables. They should be independent.
Consequences:
34
Estimation of the model coefficients can be arbitrary (it is not clear which variable explains the
output)
It is an unstable model that can behave badly with new data
We detect it with the Variance Inflation Factor:
where Ri2 is the coefficient of determination resulting from regressing xi on the remaining n-1 regressor
variables. As a rule of thumb, if VIF>10 then multicollinearity is a problem.
35
2.5.1.2 Backward selection
Begins with the full least squares model containing all n predictors, and then iteratively removes the
least useful predictor, one-at-a-time
Indirectly estimate test error by making and adjustment to the training error to account for the
bias due to overfitting
Directly estimate test error using a validation set or a cross-validation approach.
2.5.1.4.1 Mallows Cp
Estimates the size of the bias that is introduced into the predicted responses by having an underspecified
model:
where p is the number of parameters and ^σ2 is an estimate of the variance of the error (usually the mean
squared error obtained from fitting the model containing all of the candidate predictors)
2.5.1.4.2 Akaike Information Criterion AIC
In a linear regression model fir by maximum likelihood, AIC is given by:
where p is the number of parameters and ^σ2 is an estimate of the variance of the error (usually estimated
as the mean squared error obtained from fitting the model containing all the candidate predictors)
36
2.5.1.4.3 Schwarz’s Bayesian Information Criterion BIC
For the least squares model with p parameters, the BIC is given by:
where p is the number of parameters and ^σ2 is an estimate of the variance of the error (usually estimated
as the mean squared error obtained from fitting the model containing all the candidate predictors)
2.5.1.4.4 Adjusted R2
For selecting among a set of models that contain different number of variables.
37
2.5.2.2 Lasso
Ridge regression will include the n predictors in the final model. The lasso is an alternative to ridge
regression that overcomes this disadvantage. The lasso coefficients minimize the quantity:
The lasso shrinks the coefficient estimates towards zero, but, the L1 penalty has the effect of forcing
some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently
large performs variable selection.
2.5.2.3 Ridge regression vs Lasso
38
2.5.3.1 Principal Components Regression (PCR)
Involves constructing the first p principal components and then using these components as the predictors
in a linear regression model that is fit using least squares. We assume that the directions in which X1,
X2, …, XN show the most variation are the directions associated with Y.
The principal components are orthogonal between them, so the R2 of the regressor would be 0. This is
not optimized for regression.
39
3 Polynomial regression
Straightforward extension of the linear regression model:
4 Regression splines
Piecewise polynomials. A piecewise cubic polynomial (can be of a power n) with a single knot at a point
c takes the form:
If we place K different knots throughout the range of X, then we will end up fitting K+1 different cubic
polynomials. The resulting curve can be continuous or discontinuous, depending on the constraints
imposed to the model:
then we can fit a cubic spline with K knots by least squares regression using the model:
Regression splines often gives superior results to polynomial regression, because unlike polynomial
which must use a high degree to produce flexible fits, splines introduce flexibility by increasing the
number of knots but keeping the degree fixed.
A natural spline is a regression spline which is linear at the boundary (where X is smaller than the
smallest knot and larger than the largest knot).
40
We can fix the degrees of freedom and then place the corresponding number of knots at uniform
quantiles of the data. To optimize the number of knots we can use cross-validation.
The function g(x) that minimizes the proposed cost function is a natural cubic spline with knots at x[1],
…, x[N]
It is called an additive model because we calculate a different fj for each Xj and then add together all the
contributions. If we use regression splines constructed using an appropriate basis functions to
approximate each fj, the entire model is a big regression onto spline basis variables.
5.1 Advantages
Gams allows us to fit non-linear fj to each Xj, so we can automatically model non-linear relationships.
We do not need to manually try out many different transformations on each variable individually.
Because the model is additive, we can still examine the effect of each Xj on Y individually while holding
all of the other variables fixed.
The smoothness of the function fj for the variable Xj can be summarized via degrees of freedom.
5.2 Limitations
The main limitation is that the model is restricted to additive. Important interactions can be missed with
many variables.
However, we can manually add interaction terms to the GAM model by including additional predictors
of the form Xj x Xk.
41
5.3 Multilayer perceptron
Theorem: Let be any continuous sigmoidal functions, then finite sums of the form
are dense in C(In). In other words, given any f ϵ C(In) and ε > 0, there is a sum, G(x), of the above form
for which
A Multilayer Perceptron with only one hidden layer and sigmoidal activation functions in the hidden
layer is a universal function approximator.
The parameters estimate of a linear SVM can be written as functions of a set of unknown parameters αi
and the training set data points:
42
Other kernel functions (ϕ and σ are scaling parameters):
The extra parameters, as the polynomial degree or the scale factors ϕ and σ, must be specified. These
parameters, along with the Cost value, constitute the tuning parameters for the model:
When the cost is large, the model becomes very flexible. When the cost is small, the model will become
less likely to overfit, but more likely to underfit.
There is a relationship between the Cost parameter and ϵ. We suggest fixing a value for ϵ and tuning
over the other kernel parameters. Center and scaling the predictors is recommended.
The above problem is an ill-posed problem as there are infinite solutions, so we need to add constraints
to g g is a smooth function (similar inputs produce similar outputs)
The new error function is:
where:
Ḡ(s) is a positive function that tends to zero when |s|∞, in such a way that 1/ Ḡ(s) is the
transfer function of a low-pass filter
λ>0: regularization parameter that controls the tradeoff between closeness and smoothness.
If we select as smoothness functional:
43
If we limit the number of Radial Basis Functions and we consider a different scale factor in each unit:
44
6.1.1.4 Normalized RBFN
when the pdf pr(X,Y) is unknown, it can be estimated from a set of observations of (X,Y):
45
6.1.1.6 Probabilistic Radial Basis Functions Network
It is a general regression neural network plus a normalized radial basis functions networks:
Outputs:
PDF estimation
Function approximation
46
Weights initialization:
o Centers of the radial basis units (ri): clustering
o Scale factors of the radial basis units (μi): p-nearest neighbor
47
Topic 4. Forecasting
1 Introduction
A forecast is a prediction of some future event. Forecasting problems are classified as:
1.1 Objectives
1. Describing the evolution of a time series
2. Modelling the process that has generated the time series by means of a suitable statistical model
3. Forecasting future values of the time series
4. Control: good forecasts enable the analyst to take actions and control a given process.
Quantitative methods:
o Sufficient information about the past is available (historical record)
o This information can be set as numerical time series (decomposition methods)
o We can assume that the future behavior is similar to the observed past behavior
(continuity assumption)
Qualitative methods:
o Little or no quantitative information is available.
o These are based on expert knowledge
o Example: Delphi method
2 Fundamental concepts
2.1 Stochastic processes
A stochastic process Y(w, t) is a family of time indexed random variables. w belongs to a sample space,
t belongs to an index set.
48
2.2 Time series
A time series is a collection of observations made sequentially through time. Formally, a time series is
the realization of a discrete stochastic process.
2.3.1 Properties
When the process is stationary, its first and second order moments can be estimated from only one
realization of the process:
Mean:
Autocovariance:
49
2.5 White noise process
The white noise is the sequence of uncorrelated random variables, identically distributed with zero mean
and constant variance. The general expression is y[t] = ε[t]. The mean is zero and the variance is
constant. If training a model, the value of the prediction depends on the noise (are correlated), the model
can be improved.
50
Conditional mean function:
51
2.9.1 Cross-validation methods:
Training set (“in-sample”) parameter optimization
Validation set (“out-of-sample”) measuring the generalization capabilities of the model.
The above transformations can be generalized in the form proposed by Box & Cox:
In R:
lambda <- BoxCox.lambda(y)
y_transf <- BoxCox(y,lambda)
In practice, the square root and the logarithm are the transformation mainly used.
Function monthdays(ts) can be used in R to obtain the number of days for each month in the time
series.
The number of holidays per month is very different from month to month. If it is possible to classify
the working days and holidays, and all holidays have the same effect, we can adjust:
52
Population growth are necessary to be taken into account when predicting series as the number of users
of public transport. If demographic studies are available, it is preferable to normalize the series and
predict the proportion of users.
3 Decomposition methods
The idea is that a timeseries depends on a trend cycle, a seasonal component, and an irregular
component.
The additive model is appropriate if the magnitude of the seasonal fluctuations does not vary with the
level of the time series.
Multiplicative decomposition is more prevalent with economic series because most seasonal economic
series do have seasonal variation which increases with the level of the series.
Pseudo-additive decomposition:
which is useful in series where there is a short period that is much higher or lower than all the others.
53
3.4 Additive classical decomposition
1. The trend cycle is computed by using a low-pass filter or smoother (centered MA)
2. The de-trended series is computed as:
3. The seasonal component, which is assumed to be constant from time to time, is estimated as an
shorter-period average value of the de-trended series r(t).
4. The irregular component is given by:
aj is a weight functions for some common weighted moving averages, should be looked in tables.
54
The irregular component may be forecast as zero (for additive decomposition) or one (for
multiplicative decomposition). This assumes that the irregular component is uncorrelated,
which is not usually true.
The decomposition methods are generally used as exploratory methods.
where α is a constant between 0 and 1. The above expression can be put in weighted average form:
Properties:
55
Forecast: ŷ(t+m) = L(t) + (ϕ+ϕ2+…+ϕm)T(t)
where μ is the mean of y[t], ψo = 1 and ε[t] is a sequence of individual independent distribution random
variables with zero mean and well defined distribution. We will focus on three process:
56
5.1.1 Properties
We cannot predict this series from the past, because there is no correlation. We should see to diagnose
the model that the residuals follow a white noise distribution.
or:
where:
57
On the other hand, with δ = 0:
We can see that the simple autocorrelation decrease exponentially with time.
With a second grade autoregressive process, the autocorrelation decrease following and exponential that
change the sign.
Any AR(p) can be written as a MA(∞) model, and an MA(q) model (it must be inverible) can be written
as a AR(∞). For a MA(q) process to be invertible, the roots of the polynomial have to lie outside the
unit circle:
58
If we include a constant term:
Then:
There is only the first term of the simple autocovariance, then it is zero.
59
5.4 ARMA processes
ARMA(p,q) process:
In order to be stationary, the roots of the polynomial have to lie outside the unit circle:
For an ARMA(p.q) process to be invertible, the roots of the polynomial have to lie outside the nit circle:
60
5.5 ARMA Model Identification
Covariance:
Correlation coefficient:
Autocorrelation:
For an AR(p) process, the ACF decays after k=p, but it never reaches 0 so it is not easy to identify an
AR process from its ACF. We should use the partial autocorrelation function.
The PACF can be obtained by linear regression, interpreting each coefficient ϕkk as the partial correlation
between y[t] and y[t-k] after having eliminated in both variables the effects of the samples between
them:
61
For an MA(q) process, we should look to the ACF and not to the PACF.
We can check for heteroskedascity (constant variable) and check for outliers.
62
We should also check the degree of significance of each autocorrelation coefficient. For a white noise
process:
We should check the coefficients individually but we can do it in group by a Portmanteau test:
63
5.6.2 Level of significance of the coefficients
We should do a t-test:
64
This effect is translated mathematically as:
This should be transformed by logarithms or by the Box-Cox transformation to stabilize the variance
through all the time series, in order to not have the effect of correlation between trend and seasonality.
The stationarity is reached using differencing. First order differencing removes linear trends, and second
order differencing removes quadratic trends. This must be made until we get a random walk process. In
case of doubt, it is usually better to over-differentiate than to avoid differencing.
The differencing of a previously log-transformed series is known as return.
5.7.2.1.1 Dickey-Fuller test
1. We fit the OLS model (with p = 3):
2. If y(t) requires differencing, then ϕ will be close to 0 (using a 5% threshold, differencing is
required if p-value is greater than 0.05).
3. If y(t) is stationary, then ϕ is stationary.
ARIMA(3,2,1):
65
5.7.4 Seasonal ARIMA models
Seasonality of period s is evidenced in the ACF and PACF when significative coefficients appear in the
multiples of the period s.
ARIMA(1,1,1)(1,1,1)4:
5.7.4.3.1 Identification
1. Plot the series and search for possible outliers
2. Stabilize the variance by transforming the data. Use the mean/std plot
3. Analyze the stationarity of the transformed series. If the data has a constant level and its ACF
and PACF cancel rapidly, then it can be considered stationary.
4. If the series is not stationary, we use differencing. For non-seasonal time series, apply regular
differencing. For seasonal time series, we first apply seasonal differencing and then apply
regular differencing (d, D ≤ 2)
5. Identify the seasonal model by analyzing the seasonal coefficients of the ACF and PACF.
6. Once the seasonal model has been identified, identify the regular component by exploring the
ACF and PACF of the residuals of the seasonal model.
7. Check the significance of the coefficients.
8. Analyze the residuals:
a. Outlier detection
b. Test for serial correlation (Ljung and Box test)
66
c. Plot the histogram of the residuals (Normality test)
9. Compare different models using AIC or SBC (M = p+q+P+Q):
where y[t] is the output or dependent variable, xi[t] are inputs, explanatory or independent variables and
ε[t] is the noise. In the basic hypothesis we assume linearity, independent residuals, homocedasticity
and gaussian residuals white noise residuals.
where y[t] is the dependent or output variable, x[t] is the independent or explanatory input variable, v[t]
is the autocorrelated ARIMA noise, w(L) = w0 – w1B – w2B2 - … - wsBs, δ(L) = δ 0 – δ 1B – δ 2B2 - … -
δ rBr, and r,s,b are constant integers (b is the delayed effect of x and y).
The dynamic regression model requires determining the orders r,s and b, and the values of the p, d and
q of the ARIMA noise model. Two methods, the traditional Box and Jenkins model and the LTF method.
with a large (8 – 10) k and a low order AR model for the noise
3. If the regression errors are not stationary, then differentiate y and x. Fit the model with the
differentiated series.
4. If the regression errors are stationary, identify the transfer function α(L) by selecting
appropriate values for b, r and s:
a. The value of b is selected as the number of samples it takes for the output to respond
to the input.
b. The value of r determines the pattern of decay in the impulse response weights.
c. The value of s determines where the pattern of decay in the impulse response weights
begins.
5. Identify an ARMA model for the regression errors v[t]
6. Fit he complete model with the identified TF and ARMA model.
7. Analyze the residual ε[t] using the general procedure.
6.1.2.1 General rules
For the determination of b, we analyze the number of initial non-significant coefficients (α0, α1,
…, αb-1)
The value of r determines the pattern of decay of the coefficients αi:
o If there is no pattern of decay, but a set of non-zero coefficients followed by a cut to
zero, we take r = 0
o If the pattern of decay is exponential, we take r = 1
o If the pattern of decay is damped exponential or damped sine wave, we take r = 2
The value of s determines the number of non-null coefficients αi before the decay.
67
6.1.3 Model diagnosis
Tests on the parameters:
Check whether the model can be simplified by eliminating operators with values close in
numerator and denominator.
The roots of the AR polynomials should fulfill the stability conditions.
Check that all the coefficients are significant and have a reasonable physical meaning (in
particular the sign of the coefficients of the TF).
Tests on the residuals:
68
Topic 5. Unsupervised learning
1 Introduction
The probability density function (pdf) of the random variable X belonged to R gives a natural
distribution of this variable in R:
Applications of the pdf are data description and characterization, discriminant analysis (classification),
clustering and simulation.
Having a set of observed data points {x[1], …, x[N]} assumed to be a sample from an unknown
probability density function p(x) of the random variable X. Density estimation is the construction of an
0estimate ^p(x) of the density function from the observed data.
2 Parametric methods
Parametric distributions can be described with a finite set of parameters. Examples: Normal(μ, σ),
Beta(α, β), … The basic procedure consists in:
1. Select a parametric family of probability density functions that is compatible with the
distribution of the data.
2. Estimate the parameters of the distribution
3. Diagnose the final model
69
The stats package in R contains the functions for the density function, cumulative distribution function,
quantile function and random variate generation. They are named dxxx, pxxx, qxxx, rxxx respectively.
Q-Q plot: a point (x,y) on the plot corresponds to one of the quantiles of the second distribution (y-
coordinate) plotted against the same quantile of the first distribution (x-coordinate). If the two
distributions being compared are similar, the points rest in the line x=y.
70
3 Non-parametric methods
It is the oldest and most widely used density estimator. Given an origin xo and a bin width δ, the real
axis can be partitioned in the form:
Pros Cons
Simplicity ^p(x) is not a continuous function
Not adequate for clustering and classification
The selection of x0 may affect the shape of the
histogram
Selection of the bin width δ
Only for scalar random variables
3.1 The naive estimator
From the definition of a probability density, if the random variable X has density p(x), then:
The estimation is obtained by assigning a probability field a(x[i]) to each sample x[i]:
Pros Cons
Simplicity Stepwise predictions: discontinuities in x[i] and
null derivatives
71
3.2 The kernel estimator
The naïve estimator can be generalized by substituting the weight function a(x) by a kernel function K()
which satisfies the condition:
Usually K(x) will be a symmetric probability density function. By analogy with the naïve estimator, the
kernel estimator is defined by the following expression where δ is the window width, smoothing
parameter or bandwidth.
The kernel estimator is a sum of bumps placed at the observations. The kernel K() determines the shape
of the bumps while the window width δ determines their width in the x axis.
Pros Cons
^p(x) will inherit all the continuity and The definition of a unique δ for the complete
differentiability properties of the kernel K. sample is not the best choice for the estimation
of distributions with heavy tails.
As sigma is smaller, more fitted is the neural network to the training set.
72
4.2 The probabilistic RBFN
If we define a specific scale factor for each radial unit and we limit the number of radial units to a
predefined number h:
It is necessary to adopt a learning strategy to optimize the position of the centers ri and the scale factor
σj.
73
5 Principal Components Analysis
5.1 Introduction
The objective of PCA is to exploit the covariance structure of a given set of variables by means of a
linear combination of them.
When faced with a large set of n correlated variables, PCA allows us to summarize this set with a smaller
number h of representative variables that collectively explain most of the variability in the original set
and are uncorrelated. This implies that we can reduce the initial set of n variables to a new set of h<n
variables, with little loss of information, simply by rotating the axis. It is also used for data visualization.
5.2 Computation
Sample of N elements defined by values of n variables in matrix X(Nxn) where each column is a variable
and each row a case. Each variable must be centered, so X has zero mean and covariance matrix:
Problem appears when finding a space with a smaller dimension that adequately represent the data, in
such a way the data keeps is structure (relative distance) with the least possible distortion. The solution
to the problem is formed by the directions, orthogonal to each other, that maximize the variance of the
projections.
The first PC of a set of features X1, X2, …, Xn is the normalized linear combination of the features:
that has the largest variance. By normalized, we mean that . We refer to the elements ϕi1
as the loadings of the first principal components. The loading vector ϕ1 defines a direction in feature
space along which the data vary the most. If we project the N data points onto this direction, the projected
values are the principal components scores Z11, …, ZN1 themselves.
After the first principal component Z1 of the features has been determined, we can find the second
principal component Z2. The second principal component is the linear combination of X 1, X2, …, Xn
that has maximal variance out of all linear combinations that are uncorrelated with Z1:
It can be shown that the space of dimension h that best represents the original points is defined by the
eigenvectors associated with the h greater eigenvalues of the covariance matrix S. These directions are
called principal directions of the data and the new variables are defined by the principal components.
In general, the matrix X (and S) has rank n, so there are as many main components as original variables.
The eigenvalues are obtained as roots of the characteristic polynomial.
74
After obtaining the eigenvectors, and sorting them in descending order of the eigenvalues in the matrix
ϕ(nxn), the principal components Z(Nxn) are obtained from the centered original data X(Nxn) such as
Z=Xϕ
Therefore, calculating the principal components is equivalent to applying an orthogonal transformation
ϕ to the original data X.
5.3 Properties
They preserve the initial variability: the sum of variances of the n PC is equal to the sum of the variances
of the original n variables.
The variance of principal component Zi is λi
The proportion of the total variance explained by Zi is:
The first h PC provide the optimal linear prediction with h variables of the set variables X:
5.4 Interpretation
When there is a high positive correlation between all the variables, the first PC has all its coordinates of
the same sign and can be interpreted as a weighted average of all the variables size factor.
The other components are interpreted as shape factors and has positive and negative coordinates. Can
be written as weighted means of two groups of variables of different sign and contrast the variables of
one sign with those of the other.
75
5.5.1 Setup
Assume there exist independent signals S = [s1(t), …, sn(t)]. Observe only linear combinations of them,
X(t) = A S(t), where both A and S are unknown. A is called the mixing matrix. We have to recover S
from X, so we need to find a linear transformation L, ideally A-1, such that LX(t) = S(t).
5.5.2 Computation
First get rid of correlation “whitening”: Apply a linear transformation to decorrelate and normalize
the signals (PCA). Let Z = ϕX
Then, address higher order dependence: find a rotation W that makes the whitened signals independent.
The optimization problem is minimizewdep(WZ) where dep(M) is a measure of the dependency between
the columns of M so that WTW = I.
5.5.3 Independence
For independent signals u and v:
Excess Kurtosis
Takes values from -3 to infinite, Gaussian is 0
Maximize the absolute value to find non-Gaussian
dep(M) = -1 x [excess kurtosis of columns of M]
6 Clustering
6.1 Introduction
Cluster analysis is described in terms of internal homogeneity and external separation, it means, data
objects in the same cluster should be similar.
Clustering a set of data consists of the following steps:
1. Feature selection or extraction
2. Clustering algorithm and proximity measure design or selection
3. Cluster validation
4. Result interpretation
76
6.2 Proximity measures
A data object is described by a set of features or variables, usually represented as a multidimensional
vector. For N data objects with n features, an Nxn patter matrix is built from the corresponding vectors.
Each row in the matrix denotes an object while each column represents a feature.
Qualitative or categorical, when they can take on one of a limited and usually fixed number of
possible values or labels and do not have a numerical or quantitative meaning. They simply
describe a quality or characteristic of something.
Quantitative, when they are measured and expressed numerically, have numerical meaning and
can be used in calculations. They can be continuous or discrete.
The Euclidean distance tend to form hyperspherical clusters, being invariant to translations and rotations
in the feature space.
The data should be normalized in order that the different units don’t affect the clusters (a bigger unit
may dominate over a smaller one). One solution is standardization (z-score):
77
The Euclidean distance can be generalized as a especial case of the Minkowski distance or Lp
norm:
The more similar the two objects, the more parallel they are in the feature space, and the greater the
cosine value.
with:
with:
78
If a binary variable has more than 1 value (for example, 00, 01, 10 and 11) a more effective and
commonly used method is based on the simple matching criterion:
being w usually 1 (values greater than 1 are used when the number of possible values is high).
When the binary variables are ordered from lowest to highest, they can be compared using the
continuous dissimilarity measures, normalizing the values to the range [0;1].
where Sijl represents the similarity in component l, and δijl is a binary coefficient that indicates if the
measure is missing or not.
Partitional clustering directly divides data points into some prespecified number of clusters
without the hierarchical structure
Hierarchical clustering groups data with a sequence of nested partitions, either from singleton
clusters to a cluster including all individuals (agglomerative) or vice versa (divisive). The results
of hierarchical clustering are usually depicted by a binary tree or dendrogram.
6.3.1.1 Dendrogram
The root node of the dendrogram represents the whole data set. Each leaf node is regarded as a data
point and the height of the dendrogram expresses the distance between each pair of data points or
clusters.
79
6.3.2 Agglomerative hierarchical clustering
Division methods are very computationally intensive, therefore
agglomerative methods are more common. The general agglomerative
clustering can be summarized as
For single linkage algorithm or nearest neighbor method, the distance
between a pair of clusters is determined by the two closest objects of
the different clusters. Single linkage clustering tends to generate
elongated clusters producing a chaining effect, leading to connect
cluster with no relation due to noise.
In contrast to single linkage, the complete linkage method uses the
farthest distance of a pair of objects to define inter-cluster distance. It
is effective in uncovering small and compact cluster and tends to
generate spherical clusters.
Another method is the centroid method, which is generally applied
only with continuous variables, establishes as distance between groups
the Euclidean distance between its centers.
Ward introduced another type of method, with the objective at each stage is to minimize the increase in
the total within-cluster error sum of squares. Being K the number of clusters and mk the centroid of
cluster Ck, these error is given by:
𝐾
𝐸 = ∑ ∑ ‖𝑥𝑖 − 𝑚𝑘 ‖2
𝑘=1 𝑥𝑖 𝜖𝐶𝑘
where:
The partition that minimizes the sum-of-squared-errors criterion is regarded as optimal and is called the
minimum variance partition.
6.3.3.1 K-means algorithm
The basic clustering procedure of K-means is:
1. Select K samples of the dataset as initial prototypes: randomly, K points furthest from each
other or manual selection in the PCA plane.
2. Assign each object in the data set to the nearest cluster Cl
1
3. Recalculate the cluster prototype matrix based on the current partition: 𝑚𝑖 = ∑𝑥𝑗𝜖𝐶𝑖 𝑥𝑗
𝑁𝑖
4. Repeat steps 2 and 3 until there is no change for each cluster
The input space is divided into Voronoi regions corresponding to a set of prototype vector or Voronoi
vectors. Each point in a Voronoi region is closer to its vector than any other ones. The algorithm
described above performs batch mode learning, since the update occurs after all data is processed.
80
The on-line or incremental mode K-means adjusts the cluster centroids each time a data point is
processed. With η as learning rate:
Connectedness relates to what extend observations are placed in the same cluster as their nearest
neighbors in the data space.
Compactness assesses cluster homogeneity, looking at intra-cluster variance.
Separation quantifies the degree of separation between clusters, usually by measuring the
distance between cluster centroids.
As separation is opposite to compactness, both are usually measured into a single score.
The Dunn Index has a value between zero and 1, should be maximized.
81
6.4.3 Selection of the number of clusters
Different validation measures are compared in order to determine the optimal number of clusters.
6.4.3.1 The elbow method
1. Compute clustering algorithm for different values of k.
2. For each k, calculate the total within-cluster sum of squares (WSS).
3. Plot the curve of WSS according to the number k.
4. The location of a bend in the plot is usually considered the appropriate number of clusters.
6.4.3.2 The average Silhouette method
1. Compute clustering algorithm for different values of k.
2. For each k, calculate the average silhouette of observations.
3. Plot the curve of the average silhouette according to the number k.
4. The location of the maximum is considered the appropriate number of clusters.
6.5.1 Notation
Each model (network) consists of a set of H units A={c1, c2, …, cH}. Each unit c is associated with a
reference or prototype vector wc ϵ Rn that indicates its position in the input space.
Between the units of the network there is a set of unweighted and symmetrical neighborhood
connections C ⸦ AxA. These connections define the topological ordering of the network:
For each unit c, let Nc be the set of the direct neighborhoods of c. The input vectors are generated
according to an unknown pdf, from which the training data set has been generated.
Given an input vector x, we will call wining unit s(x) to the unit of A whose prototype or reference
vector is closest to x: 𝑠(𝑥) = arg 𝑚𝑖𝑛𝑐∈𝐴 ‖𝑥 − 𝑤𝑐 ‖where ||.|| is the Eucledian norm. Similarly, si(x) to
the i-th unit closest to x.
Given a set of reference vectors w1, …, wH in Rn, we define the Voronoi Region Vi of vector wi as the
nearest reference vector. The Voronoi region of unit c, c ϵ A, to the Voronoi region of its reference
vector:
In the case of a finite data set D, the Voronoi set of unit c is defined as the subset Rc of D for which c is
the wining unit:
82
The k-means algorithm is an example of vector quantization.
where:
83
where:
8. Prune the connections older than amax. If pruning results in disconnected units, remove them.
9. If the number of input vectors processed is a multiple of the parameter λ, insert a new unit:
a. Determine the unit q from A with the largest error:
𝑤𝑞 +𝑤𝑓
c. Add a new unit r, interpolating the reference vector of q and f: A = Aᴗ{r}; 𝑤𝑟 =
2
d. Connect the new unit r with q and f, and eliminate the connection between q and f:
C = Cᴗ{(r, q),(r, f)}; C = C\{(q, f)}
e. Decrease the errors of q and f in a fraction α: ΔEq = -αEq, ΔEf = -αEf
𝐸𝑞 +𝐸𝑓
f. Estimate the error of unit r from the errors of q and f: 𝐸𝑟 = 2
10. Decrease the error of all units: ΔEc = -βEc, ∀c ϵ A
11. If the stopping criterion has not been reached, go to step 2
Typical values:
84
6.6 Model-based clustering
6.6.1 The probabilistic RBFN
Under this approach, a statistical model consisting of a finite mixture of Gaussian distributions is fit to
the data == PRBFN:
Each mixture component represents a cluster, and the mixture components and group memberships are
estimated using maximum likelihood, ri and σi are optimized for maximizing the log-likelihood:
85
A possible propagation rule of the adaptation of the winning unit to the rest is:
where:
7.1.1 Algorithm
1. Initialize the set A with H=H1H2 units ci according to p(x): A = {c1, c2,…, cH}
Initialize the set of connections C in the form of a rectangular grid H1xH2.
Initialize t = 0
2. Generate a new input vector x according to p(x)
3. Determine the winning unit s = s(x)
4. Adapt each unit r:
where:
7.2
86