0% found this document useful (0 votes)
89 views

Machine Learning

This document provides an overview of machine learning techniques for classification and regression. It discusses the learning process and types of learning including supervised, unsupervised, reinforcement, and evolutionary learning. For classification, it covers preprocessing, probabilistic approaches, training/validating/testing methods like cross-validation, and performance measures. Models covered include logistic regression, Bayesian probabilities, k-nearest neighbors, decision trees, bagging/random forests, support vector machines, and neural networks. For regression, it discusses linear and nonlinear regression models, assessing accuracy, and model selection/regularization techniques.

Uploaded by

hiphoplistener
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Machine Learning

This document provides an overview of machine learning techniques for classification and regression. It discusses the learning process and types of learning including supervised, unsupervised, reinforcement, and evolutionary learning. For classification, it covers preprocessing, probabilistic approaches, training/validating/testing methods like cross-validation, and performance measures. Models covered include logistic regression, Bayesian probabilities, k-nearest neighbors, decision trees, bagging/random forests, support vector machines, and neural networks. For regression, it discusses linear and nonlinear regression models, assessing accuracy, and model selection/regularization techniques.

Uploaded by

hiphoplistener
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 87

Machine Learning

Author:
Jaime Pizarroso Gonzalo
Topic 1. Introduction ...................................................................................... 6
1 Data Mining and Machine Learning ....................................................................................... 6
2 Learning process........................................................................................................................ 6
2.1 Data ..................................................................................................................................... 6
2.2 Abstraction .......................................................................................................................... 6
2.3 Generalization ..................................................................................................................... 6
2.4 Steps to apply Machine Learning to data ............................................................................ 6
3 Types of learning ....................................................................................................................... 6
3.1 Supervised learning ............................................................................................................. 6
3.2 Unsupervised learning ......................................................................................................... 7
3.3 Reinforcement learning ....................................................................................................... 7
3.4 Evolutionary learning .......................................................................................................... 7
Topic 2. Classification ..................................................................................... 8
1 Classification: problem statement............................................................................................ 8
2 Preprocessing ............................................................................................................................. 8
3 Probabilistic approach .............................................................................................................. 9
4 Training, validating and testing ............................................................................................. 10
4.1 Cross-validation................................................................................................................. 11
4.2 Measuring classification performance ............................................................................... 11
4.2.1 Histograms .................................................................................................................................................. 12
4.2.2 Calibration plots ......................................................................................................................................... 12
4.2.3 Heat maps ................................................................................................................................................... 12
4.2.4 Confusion matrix ......................................................................................................................................... 13

5 Models ...................................................................................................................................... 14
5.1 Logistic regression ............................................................................................................ 14
5.2 Bayesian probabilities ....................................................................................................... 14
5.2.1 Linear discriminant analysis for n = 1 ........................................................................................................ 15
5.2.2 Linear discriminant analysis for n>1 .......................................................................................................... 15
5.2.3 Quadratic discriminant analysis ................................................................................................................. 15

5.3 K-Nearest Neighbors ......................................................................................................... 16


5.4 Decision trees .................................................................................................................... 17
5.4.1 Gini index (CART decision trees) ................................................................................................................ 17
5.4.2 Entropy/Information statistic (C4.5 decision trees)..................................................................................... 17
5.4.3 Algorithm .................................................................................................................................................... 18

5.5 Bagging and Random forests............................................................................................. 19


5.5.1 Bagging ....................................................................................................................................................... 19
5.5.2 Out-of-Bag Error estimation ....................................................................................................................... 20

5.6 Random forests .................................................................................................................. 20


5.6.1 Considerations ............................................................................................................................................ 20

1
5.7 Support vector machines ................................................................................................... 21
5.7.1 Considerations ............................................................................................................................................ 21
5.7.2 Linearly non-separable ............................................................................................................................... 22
5.7.3 Non-linear classification ............................................................................................................................. 23
5.7.4 Support Vector Machine .............................................................................................................................. 23

5.8 Neural Networks................................................................................................................ 25


5.8.1 Perceptron ................................................................................................................................................... 25
5.8.2 Multilayer perceptron.................................................................................................................................. 26

Topic 3. Regression ....................................................................................... 32


1 Regression: problem statement .............................................................................................. 32
1.1 Regression model .............................................................................................................. 32
1.2 Function approximators ..................................................................................................... 32
1.2.1 Model identification .................................................................................................................................... 32
1.2.2 Model diagnosis .......................................................................................................................................... 32

2 Linear regression model ......................................................................................................... 33


2.1 Assessing the accuracy of the model ................................................................................. 33
2.2 Test for significance of regression .................................................................................... 34
2.2.1 F-test of the overall fit ................................................................................................................................. 34

2.3 Test on individual regression coefficients ......................................................................... 34


2.3.1 t-test on individual coefficients .................................................................................................................... 34

2.4 Multicollinearity ................................................................................................................ 34


2.5 Linear model selection and regularization......................................................................... 35
2.5.1 Subset selection ........................................................................................................................................... 35
2.5.2 Shrinkage methods ...................................................................................................................................... 37
2.5.3 Dimension reduction methods ..................................................................................................................... 38
2.5.4 Regression in high dimensions .................................................................................................................... 39

3 Polynomial regression ............................................................................................................. 40


4 Regression splines .................................................................................................................... 40
4.1 Smoothing spline ............................................................................................................... 41
5 Generalized Additive Models (GAMs) .................................................................................. 41
5.1 Advantages ........................................................................................................................ 41
5.2 Limitations......................................................................................................................... 41
5.3 Multilayer perceptron ........................................................................................................ 42
6 Support Vector Machines (SVM)........................................................................................... 42
6.1 Regularization theory ........................................................................................................ 43
6.1.1 Radial Basis Function Networks ................................................................................................................. 44

Topic 4. Forecasting ...................................................................................... 48


1 Introduction ............................................................................................................................. 48
1.1 Objectives .......................................................................................................................... 48
1.2 Forecasting methods .......................................................................................................... 48

2
1.2.1 Quantitative methods................................................................................................................................... 48

2 Fundamental concepts ............................................................................................................ 48


2.1 Stochastic processes .......................................................................................................... 48
2.2 Time series ........................................................................................................................ 49
2.3 Means, variances and covariances ..................................................................................... 49
2.3.1 Properties .................................................................................................................................................... 49

2.4 Stationary processes .......................................................................................................... 49


2.5 White noise process ........................................................................................................... 50
2.6 Random walk ..................................................................................................................... 50
2.7 Autoregressive processes................................................................................................... 51
2.8 Moving Average processes................................................................................................ 51
2.9 Measures of forecast accuracy........................................................................................... 51
2.9.1 Cross-validation methods: .......................................................................................................................... 52

2.10 Mathematical transformation and adjustments .................................................................. 52


2.10.1 Mathematical transformation ................................................................................................................. 52
2.10.2 Calendar adjustments ............................................................................................................................. 52
2.10.3 Adjustment for inflation and population growth ..................................................................................... 52

3 Decomposition methods .......................................................................................................... 53


3.1 Additive model .................................................................................................................. 53
3.2 Multiplicative model ......................................................................................................... 53
3.3 General formulation .......................................................................................................... 53
3.4 Additive classical decomposition ...................................................................................... 54
3.5 Multiplicative classical decomposition ............................................................................. 54
3.6 Moving averages ............................................................................................................... 54
3.7 X11/X12/X13 – ARIMA ................................................................................................... 54
3.8 Relation between forecasting and decomposition ............................................................. 54
4 Exponential smoothing methods ............................................................................................ 55
4.1 Simple exponential smoothing .......................................................................................... 55
4.2 Trend methods ................................................................................................................... 55
4.2.1 Holt’s Linear Trend method ........................................................................................................................ 55
4.2.2 Damped Trend method ................................................................................................................................ 55

4.3 Trend and seasonality methods ......................................................................................... 56


4.3.1 Holt-Winters Exponential Smoothing .......................................................................................................... 56

5 Basic linear processes .............................................................................................................. 56


5.1 Gaussian White Noise ....................................................................................................... 56
5.1.1 Properties .................................................................................................................................................... 57

5.2 Autoregressive processes................................................................................................... 57


5.3 Moving Average processes................................................................................................ 58
5.4 ARMA processes ............................................................................................................... 60

3
5.5 ARMA Model Identification ............................................................................................. 61
5.6 ARMA Model Diagnosis................................................................................................... 62
5.6.1 Residual analysis ......................................................................................................................................... 62
5.6.2 Level of significance of the coefficients ....................................................................................................... 64

5.7 ARIMA Models................................................................................................................. 64


5.7.1 The Box-Jenkins methodology ..................................................................................................................... 64
5.7.2 Integrated processes .................................................................................................................................... 65
5.7.3 ARIMA (p,d,q) ............................................................................................................................................. 65
5.7.4 Seasonal ARIMA models ............................................................................................................................. 66

6 Dynamic regression models .................................................................................................... 67


6.1 Multiple regression ............................................................................................................ 67
6.1.1 Formulation (Pankratz)............................................................................................................................... 67
6.1.2 LTF (Linear Transfer function) method ...................................................................................................... 67
6.1.3 Model diagnosis .......................................................................................................................................... 68

Topic 5. Unsupervised learning ................................................................... 69


1 Introduction ............................................................................................................................. 69
2 Parametric methods ................................................................................................................ 69
2.1 Selection of the family of distributions ............................................................................. 69
2.1.1 Validation of the selection ........................................................................................................................... 70
2.1.2 Parameter estimation .................................................................................................................................. 70

3 Non-parametric methods ........................................................................................................ 71


3.1 The naive estimator ........................................................................................................... 71
3.2 The kernel estimator .......................................................................................................... 72
4 Neural networks for density estimation ................................................................................ 72
4.1 The Parzen estimator ......................................................................................................... 72
4.2 The probabilistic RBFN .................................................................................................... 73
4.2.1 Learning algorithm ..................................................................................................................................... 73

5 Principal Components Analysis ............................................................................................. 74


5.1 Introduction ....................................................................................................................... 74
5.2 Computation ...................................................................................................................... 74
5.3 Properties ........................................................................................................................... 75
5.4 Interpretation ..................................................................................................................... 75
5.5 Independent Component Analysis ..................................................................................... 75
5.5.1 Setup ............................................................................................................................................................ 76
5.5.2 Computation ................................................................................................................................................ 76
5.5.3 Independence............................................................................................................................................... 76

6 Clustering ................................................................................................................................. 76
6.1 Introduction ....................................................................................................................... 76
6.2 Proximity measures ........................................................................................................... 77
6.2.1 Distance measures....................................................................................................................................... 77

4
6.2.2 Similarity measures ..................................................................................................................................... 77
6.2.3 Measures for continuous variables.............................................................................................................. 77
6.2.4 Measures for binary variables..................................................................................................................... 78
6.2.5 Measures for mixed variables ..................................................................................................................... 79

6.3 Hierarchical clustering....................................................................................................... 79


6.3.1 Introduction ................................................................................................................................................. 79
6.3.2 Agglomerative hierarchical clustering ........................................................................................................ 80
6.3.3 Partitional clustering .................................................................................................................................. 80

6.4 Validation measures .......................................................................................................... 81


6.4.1 Silhouette width ........................................................................................................................................... 81
6.4.2 Dunn index .................................................................................................................................................. 81
6.4.3 Selection of the number of clusters .............................................................................................................. 82

6.5 Vector quantization ........................................................................................................... 82


6.5.1 Notation....................................................................................................................................................... 82
6.5.2 Minimization of the quantization error........................................................................................................ 82
6.5.3 Neural gas ................................................................................................................................................... 83
6.5.4 Hebbian learning......................................................................................................................................... 83
6.5.5 Neural Gas + Hebbian learning.................................................................................................................. 83
6.5.6 Growing neural gas ..................................................................................................................................... 84

6.6 Model-based clustering ..................................................................................................... 85


6.6.1 The probabilistic RBFN............................................................................................................................... 85

7 Kohonen Self-Organising Maps ............................................................................................. 85


7.1 Learning algorithm ............................................................................................................ 85
7.1.1 Algorithm .................................................................................................................................................... 86

5
Topic 1. Introduction
1 Data Mining and Machine Learning
We are in a data rich but information poor situation, and decision makers do not usually have the tools
to extract the valuable knowledge embedded in the data.

 Data Mining
o Knowledge discovery from data (KDD)
o Discovering patterns and associations in large data sets
o Turning data into information
o Uncover valuable information from data and transform it into organized knowledge
 Machine learning
o Field of study that gives computers the ability to learn without being explicitly
programmed
o Making computers to modify or adapt their actions so that these actions get more
accurate
 A machine learns if it is able to take experience to utilize it and improves its performance on
similar experiences in the future

2 Learning process
2.1 Data
The input data is the main source of knowledge, its quality determines the quality of the final system.
Requires observation, memory storage and recall.

2.2 Abstraction
It’s the translation of data into broader representations. During the abstraction process, we assign
meaning to data by representing knowledge using some kind of model (equations, diagrams such as
trees and graphs, logical if/else rules or groupings of data known as clusters). The process of fitting a
particular model to a dataset is known as training.

2.3 Generalization
Uses abstracted data to form a basis for action. A model is said to generalize if it produces correct outputs
for cases not included in the training dataset. Measuring the generalization capabilities of a model is an
essential task. Our final objective is being able to generalize from a finite set of data.

2.4 Steps to apply Machine Learning to data


1. Collecting data: must integrate several sources of information. Data Warehouses have become
a very common data repository architecture.
2. Cleaning, exploring and preparing the data: 80% of the effort on this.
3. Identifying and training a model.
4. Evaluating model performance: estimate the generalization capabilities of the model.
5. Improving model performance.

3 Types of learning
3.1 Supervised learning
The aim of supervised learning is to learn an input-output mapping from a labelled dataset. Applications:

 Classification: the output is discrete.


 Regression: the output is continuous and is not time dependent
 Forecasting: the output is continuous and is time dependent.

6
3.2 Unsupervised learning
The aim is to find the regularities in the input data by discovering patterns (characterize what generally
happens and what does not). Applications:

 Density estimation
 Clustering
 Vector Quantization
 Dimensionality reduction

3.3 Reinforcement learning


The learner is a decision-making agent that takes actions in an environment and receives rewards for its
actions in trying to solve a problem. After a set of trial-and-error runs, it should learn the best policy,
which is the sequence of actions that maximize the total reward. Applications:

 Robot control
 Games
 Other activities that software agent can learn

3.4 Evolutionary learning


Use ideas from natural evolution such as reproduction, mutation, recombination and selection. One of
the principles is survival of the fittest. Evolutionary computation techniques can be used in optimization,
learning and design.

7
Topic 2. Classification
1 Classification: problem statement
Given a set of n attributes (features) which belong to a n-dimension real space, a set of m classes, and a
set of N labeled training instances, in which every instance has n attributes and is of a class m, determine
a classification rule that predicts the class of any instance from the value of its attributes.
The classification rule is a partition of the input space.

 If observations are grouped in just two categories or classes, the problem is of binary
classification.
o The important category is described as signal and the second as background
 If there are more than two categories, this is a multiclass problem.

2 Preprocessing
 Many machine learning algorithms are affected by the scale of the predictors.
 Standardization: standard scores are also called z-values, z-scores, normal scores, and
standardized variables:
𝑥 − 𝑥̅
𝑥∗ =
𝜎𝑥

 To resolve skewness:
o An un-skewed distribution is one that is roughly symmetric
o A right-skewed (positive skew) distribution has a large number of points on the left side
of the distribution (smaller values) that on the right side (larger values)
𝑥 − 𝑥̅ 3
𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝐸[( ) ]
𝜎𝑥

o The Box Cox transformation can be used to make the distribution of the variable as
normal as possible (skewness  0):

𝑥𝜆 − 1

𝑥 ={ 𝜆 𝑖𝑓 𝜆 ≠ 0
log(𝑥) 𝑖𝑓 𝜆 = 0
This family can identify square transformation (λ = 2), square root (λ = 0.5), inverse (λ
= -1) and others in-between.
Using the training data, λ can be estimated by maximum likelihood. The predictor data
must contain values greater than zero.
 Outliers can be generally defined as “samples that are exceptionally far from the mainstream
of the data”
o When one or more samples are outliers, the values must be scientifically valid and there
are not data recording errors.
o With small sample sizes, apparent outliers might be a result of a skewed distribution
where there are not yet enough data to see the skewness.
 Missing values: some predictors have no values for a given samples.

8
o For large data sets, removal of samples based on missing values is not a problem,
assuming that the missingness is not informative.
o In smaller data set must be a too big loss of information. Two approaches:
 Predict data and substitute the missing values.
 Use information in the training set predictors to estimate the values of other
predictors.
 Censored data. Should not be confused with missing data.
o The exact value of censored data is not known but something is known about it.
o It can be common when using laboratory measurements, some essays cannot measure
below their limit of detection, but we know that the value is smaller than the limit.
 Dimensionality reduction: it generates a smaller set of predictors that capture a majority of the
information of the original variables.
o These methods often called signal extraction or feature extraction techniques.
o Principal component analysis (PCA) is a commonly used data reduction technique.
o This method seeks to find linear combinations of the predictors (principal components
PCs) which capture the most possible variance and are uncorrelated.
 Removing predictors: fewer predictors means decreased computational time and complexity. If
two predictors are highly correlated, this implies that they are measuring the same underlying
information so they give redundant information.
o Collinearity is the technical term for the situation where a pair of predictor variables
have a substantial correlation. Using highly correlated predictors in techniques like
linear regression can result in highly unstable models. PCA can solve this problem.
o A more heuristic approach is to minimize the predictors in order that all the pairs
correlations are below a threshold.

 Adding predictors: it is common practice to include non-linear combinations of predictors in


linear models.

3 Probabilistic approach
At any point X in the multivariate input space, class label Y (categorical) is distributed according to a
mass function P(Y = y|X=x), the probability of observing y at x. The goal of statistical classification is
to learn the distribution P(y|x).
This learning is accomplished by building (training) a predictive model on data with known class labels.
In practice, the quality of the learned model at point x is measured using a loss function l(y, f(x)). It’s
similar to a distance between the true class label y and the predicted response f(x). Classification loss
for the learned model f(x) is the expected distance:

over the domain of Y and X for the joint probability density function.

9
The expected loss is usually estimated by averaging l(y,f(x)) over the labeled data drawn from the joint
probability density function.
The response f(x) predicted by a classification model can therefore be one of a nominal variable, a
numeric scalar or a vector with m elements for m classes. The exact meaning of f(x) depends on the
nature of the problem and properties of the classification model.
If a classifier returns only hard labels, there is only one good choice for the loss function. The distance
is:

The expected loss L(X,Y) is then minimized by classifying every x into the most probable class. This
loss equals the probability of observing one of the less probable class. In the statistics literature, P(y|x)
is called the posterior probability and the minimal classification error is often called Bayes error.

4 Training, validating and testing


The error estimated on the training data, often called resubstitution error, is optimistic. This phenomenon
is called overtraining, and its effect is most pronounced for complex, flexible learners:

The training error is not a good estimate of the generalization error, we need training and validation sets.
The requirement of generalization and accuracy on the labeled data is called the bias-variance trade-off.

10
 The first term refers to the average test MSE that we would obtain if we repeatedly estimated f
using a large number of training sets and tested each at x0.
 Variance: amount by which f would change if we estimated it using a different training data set:
if a method has a high variance then small changes in the training data can result in large changes
in f.
 Bias refers to the error that is introduced by approximating a real-life problem, which may be
extremely complicated, by a much simpler model.
As we use more flexible methods, the variance will increase and the bias will decrease.
A test dataset may be used during the learning phase of the classifier for determining the optimal
structure of the classifier. If the dataset is large enough, proportions training – test – validation are 50-
25-25. If the dataset is small, resampling techniques should be used.

4.1 Cross-validation
Resampling effectively increases the amount of data without incurring the full cost of data simulation
or collection. The price is datasets are not independent, which affect the quality of estimates.
Cross-validation works by splitting data into K disjoint subsets. Use 1 – 1/K parts of data for training
and 1/K of data for validation. Repeat this step K times, using every observation once for validation and
K-1 times for training.

This process results in k estimates of the test error MSE, and the k-fold CV estimate is the average of
these errors.
The number of subsets can vary from 2 to N, where N is the number of available observations. K = N is
called leave-one-out-cross-validation. The most popular choice is K = 10.

4.2 Measuring classification performance


Classification models generate two types of predictions:

 A continuous valued prediction usually in the form of a probability. If not, it can be transformed
using the softmax transformation:

 A predicted class in the form of a discrete category.

11
4.2.1 Histograms

4.2.2 Calibration plots

4.2.3 Heat maps

12
4.2.4 Confusion matrix

𝑇𝑃+𝑇𝑁
Overall accuracy: 𝑂 = 𝑇𝑜𝑡𝑎𝑙
× 100%

The cost of False Negatives (FN) and False Positives (FP) may be different.

Other rates:

 No information rate of a class x:


𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝑥
≤1
𝑇𝑜𝑡𝑎𝑙 𝑐𝑎𝑠𝑒𝑠
 Kappa:
𝑂−𝐸
𝐾𝑎𝑝𝑝𝑎 = ≤1
1−𝐸
 Random agreement:
𝑇𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 𝐹𝑁 + 𝑇𝑁 𝐹𝑃 + 𝑇𝑁
𝐸= × × ×
𝑇𝑜𝑡𝑎𝑙 𝑇𝑜𝑡𝑎𝑙 𝑇𝑜𝑡𝑎𝑙 𝑇𝑜𝑡𝑎𝑙

13
 Sensitivity: true positive rate
# 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑡𝑜 ℎ𝑎𝑣𝑒 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑇𝑃
=
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ℎ𝑎𝑣𝑖𝑛𝑔 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑇𝑃 + 𝐹𝑁

 Specificity: true negative rate


# 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑎𝑠 𝑛𝑜𝑛 − 𝑒𝑣𝑒𝑛𝑡 𝑇𝑁
=
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑇𝑁 + 𝐹𝑃
 Youden’s J Index:
𝐽 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 − 1
o Assuming a fixed level accuracy for the model, there is a trade-off between sensitivity
and specificity. The Receiver Operating Characteristic (ROC) curve is one technique
for evaluating this trade-off.
4.2.4.1 ROC curves
The ROC curve is created by evaluating the class probabilities for the model across a continuum of
thresholds. For each threshold, the resulting true-positive rate (sensitivity) and the false-positive rate
(one minus specificity) are plotted against each other.
The Area Under the Curve (AUC) is a measure of the overall performance of a classifier. An ideal ROC
curve will hug the top left corner, the larger the AUC the better the classifier (0 ≤ AUC ≤ 1).

5 Models
5.1 Logistic regression
Binary classification problem (Y: {0,1}). Rather than modeling this response Y directly, logistic
regression models the probability that Y belongs to a particular category (Y = 1|X). The logistic
regression model is given by:

𝑒 (𝑤0 +𝑤1 𝑥1 +⋯+𝑤𝑛 𝑥𝑛 )


𝑃(𝑌 = 1|𝑋) = 𝑝(𝑋) =
1 + 𝑒 (𝑤0 +𝑤1 𝑥1 +⋯+𝑤𝑛 𝑥𝑛 )
𝑝(𝑥)
𝑜𝑑𝑑𝑠 = = 𝑒 (𝑤0 +𝑤1 𝑥1 +⋯+𝑤𝑛 𝑥𝑛 )
1 − 𝑝(𝑥)
The odds can take on any value between 0 and ∞. Odds are traditionally used instead of probabilities
since they are more naturally to the correct betting strategy.
𝑝(𝑥)
𝑙𝑜𝑔 − 𝑜𝑑𝑑𝑠 𝑜𝑟 𝑙𝑜𝑔𝑖𝑡 = ln( ) = 𝑤0 + 𝑤1 𝑥1 + ⋯ + 𝑤𝑛 𝑥𝑛
1 − 𝑝(𝑥)
The regression coefficients are estimated by maximizing the likelihood function:

𝑙(𝑤0 , 𝑤1 , … , 𝑤𝑛 ) = ∏ 𝑝(𝑥𝑖 ) × ∏ (1 − 𝑝(𝑥𝑖′ )


𝑖:𝑦𝑖 =1 𝑖 ′ :𝑦𝑖′ =0

5.2 Bayesian probabilities


Multiclass classification problem (Y: m unordered values). Being πk the overall or prior probability that
a randomly chosen observation comes from the kth class, the Bayesian theorem states that:
𝑝(𝑋 = 𝑥|𝑌 = 𝑘) × 𝜋𝑘
𝑃(𝑌 = 𝑘|𝑋 = 𝑥) =
∑𝑚
𝑖=1 𝑝(𝑋= 𝑥|𝑌 = 𝑖) × 𝜋𝑖

That is, the probability of X being x if Y is k divided by the probability of X being x.

14
5.2.1 Linear discriminant analysis for n = 1
If we suppose that p(X=x|Y=k) is normal or Gaussian:
1
1 − (𝑥−𝜇𝑘 )2
𝑝(𝑋 = 𝑥|𝑌 = 𝑘) = 𝑒 2𝜎𝑘2
√2𝜋𝜎𝑘
Where μk and σ2k are the mean and the variance parameters of the kth class. We would suppose that the
variance is constant between classes.
1 2
1 − 2 (𝑥−𝜇𝑘 )
𝑒 2𝜎𝑘 × 𝜋𝑘
√2𝜋𝜎𝑘
𝑃(𝑌 = 𝑘|𝑋 = 𝑥) = 1
− 2 (𝑥−𝜇𝑖 ) 2
1
∑𝑚
𝑖=1 𝑒 2𝜎𝑖 × 𝜋𝑖
√2𝜋𝜎𝑖
The Bayes classifier involves assigning and observation X = x to the class for which P(Y=k|X=x) is
largest.
This is equivalent to assigning the observation to the class for which:

𝜇𝑘 𝜇𝑘2
𝛿𝑘 (𝑥) = 𝑥 ∙ − + log(𝜋𝑘 )
𝜎 2 2𝜎 2
Is largest.
The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates
for πk, μk and σ2 into the previous equation:

And assigns an observartion X=x to the class for which ẟk(x) is maximum.

5.2.2 Linear discriminant analysis for n>1


Suppose we assume that p(X = x|Y=k) is a multivariate Gaussian distribution N(μk,Σ):

Where μk is a class-specific mean vector, and Σ is a covariance matrix that is common to all m classes
(which is a very hard constraint).
Then, it can be shown that the Bayes classifier assigns and observation X=x to the class for which is
largest:

5.2.3 Quadratic discriminant analysis


Suppose we assume that p(X=x|Y=y) is a multivariate Gaussian distribution N(μk,Σk):

Where μk is a class-specific mean vector, and Σk is a class-specific covariance matrix.

15
Then, it can be shown that the Bayes classifier assigns and observation X=x to the class for which is
largest:

The regions on the classification problem are separated by straight lines for linear discriminant analysis
and by conic sections for quadratic discriminant analysis.

5.3 K-Nearest Neighbors


The kNN algorithm gets its name from the fact that it uses information about an example’s k-nearest
neighbors to classify unlabeled examples. The letter k is a variable term implying that any number of
nearest neighbors could be used.
After choosing k, the algorithm requires a training dataset made up of examples that have been classified
into several categories, as labeled by a nominal variable. Then, for each unlabeled record in the test
dataset, k-NN identifies k records in the training data that are the nearest in similarity. The unlabeled
test instance is assigned the class of the majority of the k nearest neighbors.

16
5.4 Decision trees
These are easy-to-understand general representation of a discrete classifier and fast learning algorithms.
They are a built-in feature selection (process of selecting a subset of relevant features for use in model
construction).

 Types of nodes: decision or terminal (leaf) nodes.


 Types of separators:
o For categorical input variables: Value of X?
o For continuous input variables: Value of X < threshold?
The main idea of the decision trees is splitting the input space recursively by minimizing the impurity
of the nodes until de terminal nodes are pure enough. There are two main measures of purity:

5.4.1 Gini index (CART decision trees)


For a two class problem, the Gini index for a given node is defined as:
𝐺𝑖𝑛𝑖 = 𝑝1 (1 − 𝑝1 ) + 𝑝2 (1 − 𝑝2 )
Where p1 and p2 are the class 1 and class 2 probabilities. Since p1 + p2 = 1, we can deduce Gini = 2p1p2

5.4.2 Entropy/Information statistic (C4.5 decision trees)


Entropy or information statistic is defined as:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −[𝑝1 𝑙𝑜𝑔2 (𝑝1 ) + 𝑝2 𝑙𝑜𝑔2 (𝑝2 )]
When p=0 it is customary to have 0 log2(0) = 0

17
5.4.3 Algorithm
If we have a continuous predictor and a categorical response, the optimal split point is given by:
1. The samples are sorted based on the predictor values
2. The split points are then the midpoints between each unique predictor value. If the response is
binary, then the process generates a contingency table as:

The Gini index prior to the split would be 2(n1+/n)(n2+/n). Calculated after the split within each
of the new nodes and combine them using the proportion of samples in each partition:

3. Partitioning algorithms evaluate nearly all split points and select the split point value that
minimizes the Gini index.
4. The splitting process continues until the stopping criteria is met (minimum number of samples
in a node or the maximum tree depth)
Also, the algorithm works by substituting the Gini index by the entropy.
5.4.3.1 Considerations
 Trees that are constructed to have the maximum depth usually over-fits the training data
 A generalizable tree is a pruned version of the initial tree which is determined by a cost-
complexity tuning, in which the purity criterion is penalized by a factor of the total number of
terminal nodes in the tree. Each terminal node produces a vector of class probabilities based on
the training set which is then used as the prediction for a new sample.
 Tree models can also bin categorical predictors.
 When fitting trees and rule-based models, a choice must be made regarding the treatment of
categorical predictor data:

18
o Each categorical predictor can be entered into the model as a single entity so that the
model decides how to group or split the values (grouped categories). A categorical
variable X = (a,b,c) can be consider as a – ab – ac – b – bc – c.
o Categorical predictors are first decomposed into binary dummy variables. In this way,
the dummies are considered independently, forcing binary splits for the categories
(independent categories). A categorical variable X = (a, b, c) can be splitted as Xa, Xb,
Xc and each of these is Xx=0/1
 There exist different stopping criteria, but cross-validation is used for selecting optimal
complexity:
o Number of nodes/depth of tree.
o Minimum number of observations in a node.
o Entropy  stop splitting the node if the entropy is small enough

In this case, the best size (number of leaf nodes) is 6 (less error in cross-validation)

5.5 Bagging and Random forests


Decision trees suffer from high variance, using different training sets may produce different decision
trees. A procedure with low variance will yield similar results if applied repeatedly to distinct data sets.
Bootstrap, aggregation or bagging, is a general-purpose procedure for reducing the bagging variance of
statistical learning method.

5.5.1 Bagging
Given a set of N independent observations Z1, …, ZN, each with variance σ2, the variance of the mean
Z of the observations is given by σ2/Naveraging a set of observations reduces variance.
A natural way to reduce the variance and increase the prediction is take many training sets from the
population, build a separate prediction model using each training set, and average the resulting
predictions. This is called bagging:

The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations

19
5.5.2 Out-of-Bag Error estimation
On average, each bagged tree makes use of around two-thirds of the observations. The other third is
referred as the out-of-bag (OOB) observations.
We can predict the response for the ith observation using each of the trees in which that observation was
OOB. This will yield around B/3 predictions for the ith observation. Averaging these predicted responses
(regression) or can take a majority vote (classification), leads to a single OOB prediction for the i th
prediction. The resulting OOB error is a valid estimate of the test error for the bagged model.

5.6 Random forests


Provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees.
As in bagging, we build a number of decision trees on bootstrapped training samples. When building
these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen
as split candidates from the full set of n predictors.

A fresh sample of m predictors is taken at each split, and typically we choose 𝑚 = √𝑛.
If a predictor is very important, most of the bagged trees will use this predictor in the top split, and,
consequently, all of the bagged trees will look quite similar to each other. Averaging many highly
correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated
quantities.

5.6.1 Considerations
 If a random forest is built using m = n, then this leads to bagging.
 Using a small value of m in building a random forest will typically be helpful when we have a
large number of large correlated predictors.

20
5.7 Support vector machines
SVM are a generalization of a simple and intuitive classifier called the maximal margin classifier. The
maximal margin classifier can only solve linearly separable problems.

In a n-dimensional space, a hyperplane is a flat affine subspace of dimension n-1; in two dimensions, a
hyperplane is a flat one-dimensional subspace (a line); and, in three dimensions, it is a plane:

Hyperplane:

Line:

Plane:
Depending on the value of the equation of an observation x in an equation, it will determine if the
observation is in one side or in the other. So, the hyperplane divides the n-dimensional input space into
two halves.
Now suppose that we have an Nxn matrix of data that consists of N observations in a n-dimensional
input space (the training set). We also have a test observation, and supposing that is a binary
classification problem {1, -1}, our goal is to develop a classifier based on the training data that will
correctly classify the test observation.
A separating hyperplane satisfies for all i:

The test observation x* is classified by the sign of:

We can also use the magnitude of f(x*):

 If f(x*) is far from zero, then it means that x* lies far from the hyperplane, so we can be
confident about a class assignment for x*.
 If f(x*) is close to zero, then x* is located near the hyperplane, and we are less certain about the
class assignment for x*.

5.7.1 Considerations
 If our data can be perfectly separated by a hyperplane, then there will exist an infinite number
of such hyperplanes.
 A natural choice is the maximal margin hyperplane (optimal separating hyperplane), which is
the separating hyperplane that is the farthest of the training observations.

21
o The margin is the minimal (perpendicular) distance from the observations to the
hyperplane. The maximal margin hyperplane is the separating hyperplane for which the
margin is largest.
o The support vectors are the 3 points (no matter which class) with the minimum distance
to the hyperplane. These are called support vectors because if these points are moved,
the hyperplane is moved. The maximal margin hyperplane depends directly on the
support vectors, but not the other observations.
 The maximal margin hyperplane is the solution to the optimization problem:

o The second constraint guarantees that each observation will be on the correct side of
the hyperplane (M is positive)
o The first constraint ensures that the perpendicular distance is given by:

5.7.2 Linearly non-separable


In many cases, the training observations can not be perfectly separated by a hyperplane. In these cases,
the optimization problem has no solution with M > 0.
In fact, even if it exists, there are cases in which the classifier based on a separating hyperplane might
not be desirable due to its high sensitivity to individual observations lack of robustness and overfitting.

In these cases, in order to improve robustness to individual observations and better classification of most
of the observations, we might consider a classifier based on a hyperplane that does not perfectly separate
the classes.
The Support Vector Classifier, sometimes called soft margin classifier, allows a few training
observations to be in the incorrect side of the margin, or even on the incorrect side of the hyperplane.
The hyperplane is given by the solution of the optimization problem:

Where C is a nonnegative tuning parameter, M is the width of the margin and ϵi are slack variables.
Once the optimization problem has been solved, we classify a test observation x* by the sign of:

22
The slack variable ϵi tells us where the ith observation is located:

 If ϵi=0, then the ith observation is on the correct side of the margin
 If ϵi>0, then the ith observation is on the wrong side of the margin
 If ϵi>1, then the ith observation is on the wrong side of the hyperplane
C bounds the sum of the ϵi’s, so it determines the number and severity of the violations to the margin:

 If C=0, then there is no budget for violations and the problem is reduced to the maximal margin
hyperplane.
 If C>0, then no more than C observations can be on the wrong side of the hyperplane.
In practice, C is a tuning parameter that is usually chosen via cross-validation:

 When C is small, narrow margins are rarely violated. This amounts to a classifier that is highly
fit to the data, which may have low bias but high variance.
 When C is larger, the margin is larger. This amount to fitting the data less hard and obtaining a
classifier more biased but with lower variance.
The hyperplane is only affected by the observations that either lie on the margin or that violate the
margin. Observations that lie directly on the margin or on the wrong side of the margin are support
vectors. When C is large, more observations are involved in determining the hyperplane (there are more
support vectors).

5.7.3 Non-linear classification


Non-linear classification problems can be solved in some cases by enlarging the feature space using
functions of the original predictors, such as quadratic and cubic terms. If we consider quadratic terms:

More functions can be considered but the amount of computations could become unmanageable

5.7.4 Support Vector Machine


The SVM is an extension of the support vector classifier, which enlarge the feature space in an efficient
computational way using kernels. The inner product of two observation is given by:

The linear support vector classifier can be represented as the next formula where there are N parameters
αi.

To evaluate the function f(x) we need to compute the inner product between the new point x and the
training points xi. However, it turns out that αi is nonzero only for the support vectors. Being ẟ the indices
of the support vectors:

23
If we replace the inner product with a generalization of the form where K is some function that will refer
to as a kernel:

The linear kernel quantifies the similarity of a pair of observations using the Pearson correlation:

The polynomial kernel of degree d amounts to fitting a support vector classifier in a higher-dimensional
space involving polynomials of degree d:

When the support vector classifier is combined with a non-linear kernel, the resulting classifier is known
as a Support Vector Machine. The function has the form:

The radial kernel, where σ is a positive constant, has a local behavior, as only nearby training
observations have an effect on the class label of a test observation.

If we extend it to multiclass problems:

24
5.8 Neural Networks
The idea is to mimic the structure of the brain by massively interconnecting very simple processing units
and designing learning rules to adjust their transfer functions. Artificial Neural Networks generate their
own rules by learning from examples.
A neural network is a massively parallel distributed processor made up of simple processing units that
has a natural propensity for storing experiential knowledge and making it available to use. It works like
a brain:

 Knowledge is acquired by the network from its environment through a learning process.
 Interneuron connection strengths, known as synaptic weights, are used to acquired knowledge.

5.8.1 Perceptron

 Activation function:

 The perceptron partitions the input space in two regions, according to the hyperplane s=0
(decision boundary)

 For n = 2 

The perceptron is only capable of resolving linearly separable problems. The threshold moves the
hyperplane away from the center of the dimensional space.

25
 Rosenblatt learning rule:

If the problem is linearly separable, the algorithm converges to the solution; if the problem is not linearly
separable, the algorithm may oscillate. The value of α doesn’t affect stability, but it determines the rate
of convergence (typically: α=1)

5.8.2 Multilayer perceptron


Used to solve linearly non-separable problem:

Hierarchical structure of fully interconnected layers of processing units, forming a feedforward ANN
5.8.2.1 Notation

We fit several layers of perceptrons through derivable activation functions (instead of the step function,
we use tanh(s) or 1/(1+exp(-s)) and gradient-based methods).

26
5.8.2.2 Theorem
A multilayer perceptron with only one hidden layer and sigmoidal activation functions is a universal
function approximator
A multilayer perceptron with only one hidden layer and sigmoidal activation functions is a universal
classifier.
5.8.2.3 Error bounds
The mean integrated squared error between the estimated network and a target function f is show to be
bounded by:

Where h is the number of nodes, n the input dimension of the function, N is the number of training
observations, and Cf is the first absolute moment of the Fourier magnitude distribution of f. The principal
problem of the multilayer perceptron is the overfitting due to the complexity of the model. The only way
to reduce the error is increasing the number of nodes (decreasing bias and increasing variance) and
increasing the training dataset (reducing variance).
5.8.2.4 Backpropagation
𝜕𝐸
It is an efficient algorithm for the computation of 𝜕𝑤

E is the error function:


Where

The gradient:

 Chain rule:

 First factor:

 Second factor = Backpropagated signal:

 Finally:

27
28
5.8.2.5 Weight initialization
The external inputs and outputs should be standardized or normalized in order to ensure a well-
conditioned optimization problem. Random small weights are used for preventing saturation of the
activation function. Assuming that the inputs have been normalized in the interval [-1;1]:

 Generate initial weight vectors for the external inputs according to a uniform distribution:

 The magnitude of the weight vector is adjusted as follows:


for j: 1, …, h

 Locate the center of the interval at a random location along the slice by setting:

Delta rule:

 Online learning vs Batch learning. Typical values: α=0.25 and η=0.9


Nonlinear unconstrained optimization methods: conjugate gradient, quasi-Newton methods, Levenberg-
Marquadt.

5.8.2.6 Optimization methods


If we use MSE as a loss function where y[i]=f(x[i], w) is the estimated output for training sample i:

The minimization of E(w), with respect to the weight vector w in W=Rq, is an unconstrained nonlinear
optimization problem.
For classification problems (d ϵ {0,1} and y ϵ {0,1}):

 Maximum likelihood (Kullback-Leibler distance):

29
 Softmax (for m classes we rain a neural net with m outputs):

To solve unconstrained nonlinear optimization methods:

 Direct search methods: do not need to compute derivatives


o Hook & Jeeves
o Genetic algorithms
 Gradient based methods: w[k+1] = w[k]+α[k]*D[k]
o Gradient descent: D[k] = -∇E[k]
o Conjugate gradient: D[k] = -∇E[k]+γ[k]*D[k-1]
o Quasi-Newton: D[k]=-B[k]* ∇E[k]
 Hessian based methods: D[k] = -(∇2E[k])-1*∇E[k]
 Local minima:
o Select a good initial point
o Try with different initial point
o Global search methods (“simulated annealing”)
 Standardization
5.8.2.7 Weight decay
It’s a regularization method that limits the growth of the weights where Г is a diagonal matrix:

Typically, Г = λI (autovalores)
5.8.2.8 Statistical Sensitivity Analysis
The objective of this analysis is reducing the complexity of the model by pruning input variables that
do not affect the output. Measure of the relevance of an input variable:
𝜕ŷ
𝜍𝑖 =
𝜕𝑥𝑖

30
31
Topic 3. Regression
1 Regression: problem statement
Given a set of input, independent, regressor or exogenous variables X=(X 1,X2,…,XN), which belong to
an n-dimensional real space and an output, dependent or endogenous variable Y, which belongs to a 1-
dimensional real space, our objective is to estimate the value of E(Y/X=x) from a random sample of the
form {x[i], y[i] ϵ RnxR}i=1N.

The deterministic component can be estimated using a function approximator, but the random
component (noise) has to be characterized.

1.1 Regression model

1.2 Function approximators


 Linear regression models
 Non-linear regression models:
o Polynomials
o Multilayer Perceptrons (MLP)
o Radial Basis Function Networks (RBFN)
o Support Vector Machines (SVM)
o Splines
o Generalized additive models

1.2.1 Model identification


We should check:

 Structure of the deterministic component (linear or nonlinear)


 Complexity of the function approximator
 Relevant input variables
 Coefficients of the function approximator  error function

1.2.2 Model diagnosis


 Significance of the model (explanatory capabilities)
 Significance of the estimated coefficients
 Analysis of the residuals

32
2 Linear regression model

It’s composed by n explanatory variables (regressors). The coefficients measure the marginal
contribution of each input variable on the output (sensitivity)
Assumptions:

 Linearity and additivity


 Statistical independence of the errors
 Homoscedasticity (constant variance) of the errors
 Normality
If any of these assumptions is violated then the forecasts, confidence intervals, and scientific yielded by
a regression model may be (at best) inefficient or (at worst) seriously biased or misleading.

2.1 Assessing the accuracy of the model


 Sum of Squared Errors or Residual Sum of Squares:

 Mean Squared Error

 Root Mean Squared Error

 Residual Standard Error

 Coefficient of determination: Proportion of variance of Y explained by the explanatory


variables. It’s bounded between 0 and 1

 Adjusted coefficient of determination: Takes into account the number of parameters  variable
selection

33
2.2 Test for significance of regression
2.2.1 F-test of the overall fit
H0: β0 = β1 = … = βn = 0
H1: Ǝi/βi ≠ 0
If the null hypothesis is true, then:

If the corresponding p-value < α, H0 is rejected and at least one coefficient is considered significant.

2.3 Test on individual regression coefficients


2.3.1 t-test on individual coefficients
H0: βi = 0
H1: βi ≠ 0
If the null hypothesis is true, then:

If the corresponding p-value < α, H0 is rejected and the coefficient βi is considered significant.

2.4 Multicollinearity
Appears when there is strong correlation among the input variables. They should be independent.
Consequences:

34
 Estimation of the model coefficients can be arbitrary (it is not clear which variable explains the
output)
 It is an unstable model that can behave badly with new data
We detect it with the Variance Inflation Factor:

where Ri2 is the coefficient of determination resulting from regressing xi on the remaining n-1 regressor
variables. As a rule of thumb, if VIF>10 then multicollinearity is a problem.

2.5 Linear model selection and regularization


2.5.1 Subset selection
To perform best subset selection, we should fit a separate least squares regression for each possible
combination of the n predictors.

2.5.1.1 Forward selection


Begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until
all of the predictors are in the model.

35
2.5.1.2 Backward selection
Begins with the full least squares model containing all n predictors, and then iteratively removes the
least useful predictor, one-at-a-time

2.5.1.3 Hybrid approaches


These are hybrid versions of forward and backward stepwise selection, in which variables are added to
the model sequentially, but after adding each new variable, the method may also remove any variables
that no longer provide an improvement in the model fit.
Stepwise regression requires two significance levels: one for adding variables and one for removing
variables. The cutoff probability for adding variables should be less than the cutoff probability for
removing variables so that the procedure does not get into an infinite loop.
2.5.1.4 Choosing the optimal model
In order to implement subset selection, we need to determine which is the best.
The model containing all of the predictors will always have the smallest RSS and the largest R 2, since
these quantities are related to the training error. We wish to choose a model with a low test error. There
are two common approaches to estimate the test error:

 Indirectly estimate test error by making and adjustment to the training error to account for the
bias due to overfitting
 Directly estimate test error using a validation set or a cross-validation approach.
2.5.1.4.1 Mallows Cp
Estimates the size of the bias that is introduced into the predicted responses by having an underspecified
model:

where p is the number of parameters and ^σ2 is an estimate of the variance of the error (usually the mean
squared error obtained from fitting the model containing all of the candidate predictors)
2.5.1.4.2 Akaike Information Criterion AIC
In a linear regression model fir by maximum likelihood, AIC is given by:

where p is the number of parameters and ^σ2 is an estimate of the variance of the error (usually estimated
as the mean squared error obtained from fitting the model containing all the candidate predictors)

36
2.5.1.4.3 Schwarz’s Bayesian Information Criterion BIC
For the least squares model with p parameters, the BIC is given by:

where p is the number of parameters and ^σ2 is an estimate of the variance of the error (usually estimated
as the mean squared error obtained from fitting the model containing all the candidate predictors)
2.5.1.4.4 Adjusted R2
For selecting among a set of models that contain different number of variables.

2.5.1.4.5 Validation and Cross-Validation


Directly estimate the generalization capabilities of the different models using the validation set and
cross-validation methods.

2.5.2 Shrinkage methods


Alternative to subset selection, we can fit a model containing all n predictors using a technique that
constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates
towards zero. This technique can significantly reduce their variance.
2.5.2.1 Ridge regression
Very similar to least squares, except that the coefficients are estimated by minimizing a slightly different
quantity:

where λ ≥ 0 is a tuning parameter, to be determined separately.


As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but
increased bias.

37
2.5.2.2 Lasso
Ridge regression will include the n predictors in the final model. The lasso is an alternative to ridge
regression that overcomes this disadvantage. The lasso coefficients minimize the quantity:

The lasso shrinks the coefficient estimates towards zero, but, the L1 penalty has the effect of forcing
some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently
large  performs variable selection.
2.5.2.3 Ridge regression vs Lasso

2.5.3 Dimension reduction methods


The previous methods have controlled variance in two different ways, either by using a subset of the
original variables or by shrinking their coefficients toward zero. Dimension reduction methods
transform the predictors and then fit a least squares model using the transformed variables.
Let Z1, Z2, …, Zp represent p<n linear combinations of our original n predictors:

we can fit the linear regression model:

38
2.5.3.1 Principal Components Regression (PCR)
Involves constructing the first p principal components and then using these components as the predictors
in a linear regression model that is fit using least squares. We assume that the directions in which X1,
X2, …, XN show the most variation are the directions associated with Y.
The principal components are orthogonal between them, so the R2 of the regressor would be 0. This is
not optimized for regression.

2.5.3.2 Partial Least Squares (PLS)


Is a supervised alternative to PCR that makes use of the response Y in order to identify new features
that not only approximate the old features well, but also that are related to the response.
After standardizing the n predictors, PLS computes the first direction Z1 by setting each ϕ1j equal to the
coefficient from the simple linear regression of Y onto Xj.
To identify the second PLS direction, we first adjust each of the variables for Z1, by regressing each
variable on Z1 and taking residuals (Xj=^θ1jZ1+ε1j). These residuals can be interpreted as the remaining
information that has not been explained by the first PLS direction.
We then compute Z2 using this orthogonalized data in exactly the same fashion as Z1 was computed
based on the original data.

2.5.4 Regression in high dimensions


1. Regularization or shrinkage plays a key role in high-dimensional problems
2. Appropriate tuning parameter selection is crucial for good predictive performance.
3. The test error tends to increase as the dimensionality of the problem (n) increases, unless the
additional features are truly associated with the response  the curse of dimensionality.

39
3 Polynomial regression
Straightforward extension of the linear regression model:

Usually, d<=4, to prevent overfitting and strange shapes.

4 Regression splines
Piecewise polynomials. A piecewise cubic polynomial (can be of a power n) with a single knot at a point
c takes the form:

If we place K different knots throughout the range of X, then we will end up fitting K+1 different cubic
polynomials. The resulting curve can be continuous or discontinuous, depending on the constraints
imposed to the model:

In general, a degree d spline is a piecewise degree d polynomial with continuity in derivatives up to


degree d-1 at each knot.
If we define the truncated power basis function with knot ξ:

then we can fit a cubic spline with K knots by least squares regression using the model:

Regression splines often gives superior results to polynomial regression, because unlike polynomial
which must use a high degree to produce flexible fits, splines introduce flexibility by increasing the
number of knots but keeping the degree fixed.
A natural spline is a regression spline which is linear at the boundary (where X is smaller than the
smallest knot and larger than the largest knot).

40
We can fix the degrees of freedom and then place the corresponding number of knots at uniform
quantiles of the data. To optimize the number of knots we can use cross-validation.

4.1 Smoothing spline


Regression splines are created by specifying a set of knots, producing a sequence of basis functions and
then using least squares to estimate the spline coefficients. A smoothing spline is a function g() that
minimizes the cost function, where λ is nonnegative:

The function g(x) that minimizes the proposed cost function is a natural cubic spline with knots at x[1],
…, x[N]

5 Generalized Additive Models (GAMs)


These are a natural way to extend the multiple linear model to allow for non-linear relationships is given
by:

It is called an additive model because we calculate a different fj for each Xj and then add together all the
contributions. If we use regression splines constructed using an appropriate basis functions to
approximate each fj, the entire model is a big regression onto spline basis variables.

5.1 Advantages
Gams allows us to fit non-linear fj to each Xj, so we can automatically model non-linear relationships.
We do not need to manually try out many different transformations on each variable individually.
Because the model is additive, we can still examine the effect of each Xj on Y individually while holding
all of the other variables fixed.
The smoothness of the function fj for the variable Xj can be summarized via degrees of freedom.

5.2 Limitations
The main limitation is that the model is restricted to additive. Important interactions can be missed with
many variables.
However, we can manually add interaction terms to the GAM model by including additional predictors
of the form Xj x Xk.

41
5.3 Multilayer perceptron

Theorem: Let be any continuous sigmoidal functions, then finite sums of the form

are dense in C(In). In other words, given any f ϵ C(In) and ε > 0, there is a sum, G(x), of the above form
for which

A Multilayer Perceptron with only one hidden layer and sigmoidal activation functions in the hidden
layer is a universal function approximator.

6 Support Vector Machines (SVM)


Are one of the most robust model, where we seek to minimize the effect of outliers. Given a threshold
ϵ > 0 set by the user, data points with residual within the threshold do not contribute to the regression
fit, while point with an absolute difference greater than the threshold contribute a linear-scale amount
(instead of quadratic, which would be more sensitive to outliers).
To estimate the model parameters, we minimize the error function:

The parameters estimate of a linear SVM can be written as functions of a set of unknown parameters αi
and the training set data points:

The regression equation:

can be written as:

42
Other kernel functions (ϕ and σ are scaling parameters):

The extra parameters, as the polynomial degree or the scale factors ϕ and σ, must be specified. These
parameters, along with the Cost value, constitute the tuning parameters for the model:

When the cost is large, the model becomes very flexible. When the cost is small, the model will become
less likely to overfit, but more likely to underfit.
There is a relationship between the Cost parameter and ϵ. We suggest fixing a value for ϵ and tuning
over the other kernel parameters. Center and scaling the predictors is recommended.

6.1 Regularization theory


We want to estimate an unknown function g(x) with a function approximator f(x,w), from a set of
samples:

The above problem is an ill-posed problem as there are infinite solutions, so we need to add constraints
to g  g is a smooth function (similar inputs produce similar outputs)
The new error function is:

where:

 ϕ[f] is a smoothness functional:

 Ḡ(s) is a positive function that tends to zero when |s|∞, in such a way that 1/ Ḡ(s) is the
transfer function of a low-pass filter
 λ>0: regularization parameter that controls the tradeoff between closeness and smoothness.
If we select as smoothness functional:

we obtain the non-parametric estimation:

43
If we limit the number of Radial Basis Functions and we consider a different scale factor in each unit:

6.1.1 Radial Basis Function Networks


6.1.1.1 Learning rule
1. Unsupervised placement of the h Radial Basis Functions’ centers (ri) using a clustering
algorithm with the training set. The clustering algorithm most used in practice is k-means.
2. Determination of the h scale factors (σi) using the p-nearest neighbor heuristic, that ensures the
slight overlapping of the supports of the Radial Basis Function units. For this, we take as σi the
mean Euclidean distance of ri to its p nearest centers {rj1*, rj2*, …, rjp*} (typically p=2)
3. The weights of the output layer (vi) are obtained through linear regression once the previous
parameters have been set.
6.1.1.2 Properties
1. Universal approximation capabilities, RBFNs are universal approximators in the space of
absolutely integrable functions.
2. Optimal approximation, proof of existence and uniqueness of a structure that minimizes the
approximation error. This property is not satisfied by MLP.
3. The curse of dimensionality, due to the exponential increase in the number of radial units
required when increasing the dimension of the input space. The origin lies in the construction
of the hypersurface from hyperspheres, giving equal weight to all dimensions. This problem can
be mitigated using generalized RBFN with hypereliptical supports or normalized RFBN.
6.1.1.3 RBFN vs MLP
 RBFN have only one hidden layer, MLP may have several hidden layers.
 All the processing units of a MLP have the same structures, in a RBFN the output units are
completely different from the hidden units
 RBFN use Euclidean distance, MLP use scalar product
 RBFN use local approximations, MLP use global approximations
 RBFN learning is very fast, MLP have less hidden units
 RBFN has a direct interpretation, MLP has the best generalization capabilities

44
6.1.1.4 Normalized RBFN

A normalized RBFN is equivalent to a RBFN with lateral connections.


6.1.1.5 General Regression Neural Network
Assume that pr(X,Y) represents the known joint continuous probability density function of a vector
random variable, X, and a scalar random variable, Y.
Let x be a particular measured value of the random variable X. The conditional mean of Y given X (also
called the regression of Y on X) is given by:

when the pdf pr(X,Y) is unknown, it can be estimated from a set of observations of (X,Y):

45
6.1.1.6 Probabilistic Radial Basis Functions Network
It is a general regression neural network plus a normalized radial basis functions networks:

 Define a scale factor for each radial basis unit i:

 Activation function of the radial basis unit i:

 Outputs:

6.1.1.6.1 Learning with PRBFN


Three possible applications:

 PDF estimation

 Function approximation

 PDF estimation and function approximation

46
 Weights initialization:
o Centers of the radial basis units (ri): clustering
o Scale factors of the radial basis units (μi): p-nearest neighbor

o Weights of the output units (vi):

6.1.1.6.2 Gradient computation for PRBFN

6.1.1.6.3 PRBFN Sensitivities

rjk expected value of xk in the support of radial basis unit j


vj expected value of y in the support of radial basis unit j

47
Topic 4. Forecasting
1 Introduction
A forecast is a prediction of some future event. Forecasting problems are classified as:

 Short-term: a few time periods (days, weeks) into the future.


 Medium-term: from one to two years into the future
 Long-term: many years.
The short and medium term forecasting are typically based on identifying patterns found in historical
data. Long term forecasting is usually based on expert knowledge and fundamentals models.

1.1 Objectives
1. Describing the evolution of a time series
2. Modelling the process that has generated the time series by means of a suitable statistical model
3. Forecasting future values of the time series
4. Control: good forecasts enable the analyst to take actions and control a given process.

1.2 Forecasting methods


There are two main groups:

 Quantitative methods:
o Sufficient information about the past is available (historical record)
o This information can be set as numerical time series (decomposition methods)
o We can assume that the future behavior is similar to the observed past behavior
(continuity assumption)
 Qualitative methods:
o Little or no quantitative information is available.
o These are based on expert knowledge
o Example: Delphi method

1.2.1 Quantitative methods


Inside this group, there are another two main groups:

 Explanatory models: y = f(x1,x2, …, xn, noise)


 Time series models: y = f(y(t-1), y(t-2), …, noise)
In both cases, the observation is composed of two components: y = pattern + noise. The objective of the
modeling process is to separate both components, in order to use the pattern for forecasting and the
observed noise for characterizing prediction errors.

2 Fundamental concepts
2.1 Stochastic processes
A stochastic process Y(w, t) is a family of time indexed random variables. w belongs to a sample space,
t belongs to an index set.

 For a fixed t*, Y(w, t*) is a random variable


 For a given w*, Y(w*, t) is called a sample function or realization of the stochastic process.
The population of all possible realizations is called the ensemble.

48
2.2 Time series
A time series is a collection of observations made sequentially through time. Formally, a time series is
the realization of a discrete stochastic process.

2.3 Means, variances and covariances


For a stochastic process {y[t], t=0, ±1, ±2…}, the mean function is defined as μt = E(y[t])
The autocovariance function is defined as γt,s = Cov([y], y[s]) = E((y[t] - μt)(y[s] – μs))
The autocorrelation functions is defined as ρt,s = Corr(y[t], y[s]) =

2.3.1 Properties

If ρt,s = 0 we say that y[t] and y[s] are uncorrelated.

2.4 Stationary processes


A process is said to be stationary when the properties of the underlying process do not vary with time.
We say that is stationary in the strict sense, when to make the same shift in the timing of all the variables
of any finite joint distribution, the distribution does not vary:

A process is said to be first order stationary in distribution if:

A process is said to be second order or wide sense stationary if it satisfies:

When the process is stationary, its first and second order moments can be estimated from only one
realization of the process:

 Mean:
 Autocovariance:

49
2.5 White noise process
The white noise is the sequence of uncorrelated random variables, identically distributed with zero mean
and constant variance. The general expression is y[t] = ε[t]. The mean is zero and the variance is
constant. If training a model, the value of the prediction depends on the noise (are correlated), the model
can be improved.

2.6 Random walk


The general expression: y[t] = y[t-1] + ε[t]. y[t] is not a stationary process, but y[t] – y[t-1] is stationary.
The variance seems to be constant.

 Marginal mean function:


 Marginal variance function:

50
 Conditional mean function:

 Conditional variance function:

2.7 Autoregressive processes


AR(p) process: Related to inertia

2.8 Moving Average processes


MA(q) process: Random events

2.9 Measures of forecast accuracy

51
2.9.1 Cross-validation methods:
 Training set (“in-sample”)  parameter optimization
 Validation set (“out-of-sample”)  measuring the generalization capabilities of the model.

2.10 Mathematical transformation and adjustments


It will often be necessary to transform and/or adjust the series under study to fulfill the model
assumptions (constant level, constant variance, normally distributed, …)

2.10.1 Mathematical transformation

The above transformations can be generalized in the form proposed by Box & Cox:

In R:
lambda <- BoxCox.lambda(y)
y_transf <- BoxCox(y,lambda)
In practice, the square root and the logarithm are the transformation mainly used.

2.10.2 Calendar adjustments


The month duration from month to month can reach a 10% ((31-28)/31). This effect can be corrected
for monthly forecasts by using:

Function monthdays(ts) can be used in R to obtain the number of days for each month in the time
series.
The number of holidays per month is very different from month to month. If it is possible to classify
the working days and holidays, and all holidays have the same effect, we can adjust:

2.10.3 Adjustment for inflation and population growth


Inflation is necessary to be taken into account when predicting prices. For that purpose, prices are
referred to the same date.

52
Population growth are necessary to be taken into account when predicting series as the number of users
of public transport. If demographic studies are available, it is preferable to normalize the series and
predict the proportion of users.

3 Decomposition methods
The idea is that a timeseries depends on a trend cycle, a seasonal component, and an irregular
component.

3.1 Additive model

The additive model is appropriate if the magnitude of the seasonal fluctuations does not vary with the
level of the time series.

3.2 Multiplicative model

Multiplicative decomposition is more prevalent with economic series because most seasonal economic
series do have seasonal variation which increases with the level of the series.

3.3 General formulation


Logarithms turn a multiplicative relationship into an additive one:

Pseudo-additive decomposition:

which is useful in series where there is a short period that is much higher or lower than all the others.

53
3.4 Additive classical decomposition
1. The trend cycle is computed by using a low-pass filter or smoother (centered MA)
2. The de-trended series is computed as:
3. The seasonal component, which is assumed to be constant from time to time, is estimated as an
shorter-period average value of the de-trended series r(t).
4. The irregular component is given by:

3.5 Multiplicative classical decomposition


1. The trend-cycle is computed using a low-pass filter or smoother (centered MA)
2. The de-trended series is computed as the ratio:
3. The seasonal component, which is assumed to be constant from year to year, is estimated as the
monthly average value of the de-trended series r(t).
4. The irregular component is given by:

3.6 Moving averages

In general, a weighted (odd) k-point moving average can be written as:

aj is a weight functions for some common weighted moving averages, should be looked in tables.

3.7 X11/X12/X13 – ARIMA


These are the most widely used variants of Census II method developed by the U.S. Bureau of the
Census.
Census II decomposition is usually multiplicative, since most economic time series have seasonal
variation which increases with the level of the series. It is an iterative procedure in which the
decomposition is refined. The algorithm also minimizes the effect of outliers.

3.8 Relation between forecasting and decomposition


Forecasts based directly on a decomposition are performed by extending each of the components of the
series. In practice it rarely works well:

 The trend-cycle is the most difficult component to forecast. It is sometimes proposed to be


modeled by a simple function as a straight line, but such models are rarely adequate.
 The seasonal component for future years can be based on the seasonal component from the last
full period of data, but if is changing over time, this will be unlikely to be adequate.

54
 The irregular component may be forecast as zero (for additive decomposition) or one (for
multiplicative decomposition). This assumes that the irregular component is uncorrelated,
which is not usually true.
The decomposition methods are generally used as exploratory methods.

4 Exponential smoothing methods


4.1 Simple exponential smoothing
Formulation of the method:

where α is a constant between 0 and 1. The above expression can be put in weighted average form:

Properties:

 It is not suitable for time series with trend


 Analogy with proportional control
 Requires very low storage space  suitable when the number of series is very high
Developing the weighted expression, it can be deduced the geometric progression of the method:

The component expression of the model:

 Smoothing equation: L(t) = αy(t)+(1- α)L(t-1)


 Forecast equation: ŷ(t+1) = L(t)
 Multi-horizon forecast: ŷ(t+h|t) = L(t)  flat forecast
 Optimization: we need to select L(0) and α

4.2 Trend methods


4.2.1 Holt’s Linear Trend method
Given the constant α and β between 0 and 1, the Holt’s linear trend model is given by:

 Level: L(t) = α/y(t)+(1-α) (L(t-1) + T(t-1))


 Trend: T(t) = β(L(t) – L(t-1)) + (1- β) T(t-1)
 Forecast: ŷ(t+m) = L(t) + T(t) m

4.2.2 Damped Trend method


Given the constant α, β and ϕ between 0 and 1, the damped linear trend model is given by:

 Level: L(t) = α/y(t)+(1-α) (L(t-1) + ϕT(t-1))


 Trend: T(t) = β(L(t) – L(t-1)) + (1- β)ϕT(t-1)

55
 Forecast: ŷ(t+m) = L(t) + (ϕ+ϕ2+…+ϕm)T(t)

4.3 Trend and seasonality methods


4.3.1 Holt-Winters Exponential Smoothing
Given the constant α, β and γ between 0 and 1, the additive model is given by:

 Level: L(t) = α(y(t)-S(t-s)) + (1-α) (L(t-1) + T(t-1))


 Trend: T(t) = β(L(t) – L(t-1)) + (1- β)T(t-1)
 Seasonality: S(t) = γ(y(t)-L(t)) + (1 - γ)S(t-s)
 Forecast: ŷ(t+m) = L(t) + T(t)m + S(t-s+m)
Given the constant α, β and γ between 0 and 1, the multiplicative model is given by:

 Level: L(t) = α(y(t)/S(t-s)) + (1-α) (L(t-1) + T(t-1))


 Trend: T(t) = β(L(t) – L(t-1)) + (1- β)T(t-1)
 Seasonality: S(t) = γ(y(t)/L(t)) + (1 - γ)S(t-s)
 Forecast: ŷ(t+m) = (L(t) + T(t)m)S(t-s+m)
Given the constant α, β, γ and ϕ between 0 and 1, the Holt-Winters method with a damped trend and
multiplicative seasonality is given by:

 Level: L(t) = α(y(t)/S(t-s)) + (1-α) (L(t-1) + ϕT(t-1))


 Trend: T(t) = β(L(t) – L(t-1)) + (1- β) ϕT(t-1)
 Seasonality: S(t) = γ(y(t)/L(t)) + (1 - γ)S(t-s)
 Forecast: ŷ(t+m) = (L(t) + (ϕ+ϕ2+…+ϕm)T(t))S(t-s+m)

5 Basic linear processes


A linear process can be represented as a linear combination of random variables (Box-Jenkins):

where μ is the mean of y[t], ψo = 1 and ε[t] is a sequence of individual independent distribution random
variables with zero mean and well defined distribution. We will focus on three process:

5.1 Gaussian White Noise


General expression is y[t] = ε[t].

56
5.1.1 Properties

We cannot predict this series from the past, because there is no correlation. We should see to diagnose
the model that the residuals follow a white noise distribution.

5.2 Autoregressive processes


Process AR(p) is modelling a model with inertia:

or:

where:

and B is the backshift operator: By[t] = y[t-1]


To be stationary, the roots of its characteristic polynomial have to lie outside the unit circle:

If we include a constant term:

under the assumption of stationarity:

57
On the other hand, with δ = 0:

We can see that the simple autocorrelation decrease exponentially with time.

With a second grade autoregressive process, the autocorrelation decrease following and exponential that
change the sign.

5.3 Moving Average processes


MA(q) process:

Any AR(p) can be written as a MA(∞) model, and an MA(q) model (it must be inverible) can be written
as a AR(∞). For a MA(q) process to be invertible, the roots of the polynomial have to lie outside the
unit circle:

58
If we include a constant term:

Then:

On the other hand, with δ = 0:

There is only the first term of the simple autocovariance, then it is zero.

59
5.4 ARMA processes
ARMA(p,q) process:

In order to be stationary, the roots of the polynomial have to lie outside the unit circle:

For an ARMA(p.q) process to be invertible, the roots of the polynomial have to lie outside the nit circle:

If we include a constant term, under the assumption of stationarity:

On the other hand:

60
5.5 ARMA Model Identification
Covariance:

Correlation coefficient:

Correlation is very affected by outliers.


Autocovariance: only make sense if the time series is stationary, if not, we need more than one time
series.

Autocorrelation:

Correlogram is the plot of autocorrelation, not recommended for k>N/4


Under the assumption that the real autocorrelation is zero, the sample autocorrelation must follow a
normal distribution of mean zero and variance equal to σpk2. If y[t] is a MA(q) process, then:

and we can stablish the 95% confidence interval:

For an AR(p) process, the ACF decays after k=p, but it never reaches 0 so it is not easy to identify an
AR process from its ACF. We should use the partial autocorrelation function.
The PACF can be obtained by linear regression, interpreting each coefficient ϕkk as the partial correlation
between y[t] and y[t-k] after having eliminated in both variables the effects of the samples between
them:

For an AR(p) process and k>p:

61
For an MA(q) process, we should look to the ACF and not to the PACF.

5.6 ARMA Model Diagnosis


The ideal model:

 Residuals follows a white noise distribution (Gaussian)


 Is stationary and invertible.
 Coefficients are statistically significant and uncorrelated.
 Model coefficients are sufficient to represent the series
 High degree of fit compared with other models.

5.6.1 Residual analysis


We should plot the standardized residuals with different confidence limits (±σε, ±2σε, ±3σε, ±4σε).

We can check for heteroskedascity (constant variable) and check for outliers.

62
We should also check the degree of significance of each autocorrelation coefficient. For a white noise
process:

Therefore, we can stablish an 95% confidence interval:

We should check the coefficients individually but we can do it in group by a Portmanteau test:

Residuals are not white noise

63
5.6.2 Level of significance of the coefficients
We should do a t-test:

5.7 ARIMA Models


5.7.1 The Box-Jenkins methodology
1. Transform the original time series in order to stabilize the variance and the mean.
2. Propose and estimate the parameters of a tentative ARMA(p,q) model for the stationary time
series.
3. Diagnose the resulting model
4. Repeat
5.7.1.1 Stabilizing the variance
In many time series, the variance (seasonal component) increases with the trend, being this increase
often linear.

64
This effect is translated mathematically as:

This should be transformed by logarithms or by the Box-Cox transformation to stabilize the variance
through all the time series, in order to not have the effect of correlation between trend and seasonality.

5.7.2 Integrated processes


A process may be non-stationary in the mean, variance, autocorrelations or other characteristics of the
distribution of the variables. If the trend is not stable over time and have a tendency  mean non-
stationary. Integrated processes are non-stationary processes which become stationary when
differenced.
The ACF of an integrated process shows a slow linear decreasing pattern. Most real time series are not
stationary and their average level changes with time.
5.7.2.1 Stabilizing the mean
The stationarity of the mean requires that the series keeps oscillating around a constant level. When this
does not happen, the ACF has a very slow, linear decrease:

The stationarity is reached using differencing. First order differencing removes linear trends, and second
order differencing removes quadratic trends. This must be made until we get a random walk process. In
case of doubt, it is usually better to over-differentiate than to avoid differencing.
The differencing of a previously log-transformed series is known as return.
5.7.2.1.1 Dickey-Fuller test
1. We fit the OLS model (with p = 3):
2. If y(t) requires differencing, then ϕ will be close to 0 (using a 5% threshold, differencing is
required if p-value is greater than 0.05).
3. If y(t) is stationary, then ϕ is stationary.

5.7.3 ARIMA (p,d,q)


This model results from the application of an ARMA model to a dth differenced time series:

 AR: p = autoregressive order


 I: d = regular differencing order
 MA: q = moving average order

ARIMA(3,2,1):

65
5.7.4 Seasonal ARIMA models
Seasonality of period s is evidenced in the ACF and PACF when significative coefficients appear in the
multiples of the period s.

Seasonal non-stationary time series may require seasonal differencing


5.7.4.1 Seasonal processes ACF
a) The first (1 to 6) coefficients of the ACF are only affected by the regular component.
b) The seasonal coefficients are basically affected by the seasonal component.
c) The ACF of the regular component is replicated at both sides of the seasonal lags.
5.7.4.2 Seasonal processes PACF
a) The first (1 to 6) coefficients of the PACF are only affected by the regular component.
b) The seasonal coefficients are basically affected by the seasonal component.
c) The PACF of the regular component is replicated at the right side of the seasonal lags.
d) The ACF of the regular component is replicated at the left side of the seasonal lags of the PACF.
5.7.4.3 ARIMA(p,d,q)(P,D,Q)s model
Is the combination of a seasonal ARIMA model and a regular ARIMA model:

 AR: p = autoregressive order


 I: d = regular differencing order
 MA: q = moving average order
 ARs: P = seasonal autoregressive order
 Is: D = seasonal differencing order
 MAs: Q = seasonal moving average order

ARIMA(1,1,1)(1,1,1)4:
5.7.4.3.1 Identification
1. Plot the series and search for possible outliers
2. Stabilize the variance by transforming the data. Use the mean/std plot
3. Analyze the stationarity of the transformed series. If the data has a constant level and its ACF
and PACF cancel rapidly, then it can be considered stationary.
4. If the series is not stationary, we use differencing. For non-seasonal time series, apply regular
differencing. For seasonal time series, we first apply seasonal differencing and then apply
regular differencing (d, D ≤ 2)
5. Identify the seasonal model by analyzing the seasonal coefficients of the ACF and PACF.
6. Once the seasonal model has been identified, identify the regular component by exploring the
ACF and PACF of the residuals of the seasonal model.
7. Check the significance of the coefficients.
8. Analyze the residuals:
a. Outlier detection
b. Test for serial correlation (Ljung and Box test)

66
c. Plot the histogram of the residuals (Normality test)
9. Compare different models using AIC or SBC (M = p+q+P+Q):

6 Dynamic regression models


6.1 Multiple regression
Multiple regression model:

where y[t] is the output or dependent variable, xi[t] are inputs, explanatory or independent variables and
ε[t] is the noise. In the basic hypothesis we assume linearity, independent residuals, homocedasticity
and gaussian residuals  white noise residuals.

6.1.1 Formulation (Pankratz)

where y[t] is the dependent or output variable, x[t] is the independent or explanatory input variable, v[t]
is the autocorrelated ARIMA noise, w(L) = w0 – w1B – w2B2 - … - wsBs, δ(L) = δ 0 – δ 1B – δ 2B2 - … -
δ rBr, and r,s,b are constant integers (b is the delayed effect of x and y).
The dynamic regression model requires determining the orders r,s and b, and the values of the p, d and
q of the ARIMA noise model. Two methods, the traditional Box and Jenkins model and the LTF method.

6.1.2 LTF (Linear Transfer function) method


1. Transform the series for stabilizing the variance
2. Fit a multiple regression model of the form:

with a large (8 – 10) k and a low order AR model for the noise
3. If the regression errors are not stationary, then differentiate y and x. Fit the model with the
differentiated series.
4. If the regression errors are stationary, identify the transfer function α(L) by selecting
appropriate values for b, r and s:
a. The value of b is selected as the number of samples it takes for the output to respond
to the input.
b. The value of r determines the pattern of decay in the impulse response weights.
c. The value of s determines where the pattern of decay in the impulse response weights
begins.
5. Identify an ARMA model for the regression errors v[t]
6. Fit he complete model with the identified TF and ARMA model.
7. Analyze the residual ε[t] using the general procedure.
6.1.2.1 General rules
 For the determination of b, we analyze the number of initial non-significant coefficients (α0, α1,
…, αb-1)
 The value of r determines the pattern of decay of the coefficients αi:
o If there is no pattern of decay, but a set of non-zero coefficients followed by a cut to
zero, we take r = 0
o If the pattern of decay is exponential, we take r = 1
o If the pattern of decay is damped exponential or damped sine wave, we take r = 2
 The value of s determines the number of non-null coefficients αi before the decay.

67
6.1.3 Model diagnosis
Tests on the parameters:

 Check whether the model can be simplified by eliminating operators with values close in
numerator and denominator.
 The roots of the AR polynomials should fulfill the stability conditions.
 Check that all the coefficients are significant and have a reasonable physical meaning (in
particular the sign of the coefficients of the TF).
Tests on the residuals:

 Need to check gaussianity, with zero mean and uncorrelated residuals.

There are 4 cases to be analyzed:


1. We have specified wrong both the transfer function and the noise model. In this case, the
estimated residuals are autocorrelated and will be correlated with x[t]
2. If the TF is incorrect, although the noise model is correct, there will be correlation between the
residuals and x[t], but also autocorrelation in the residuals by the filtered effect of x[t].
3. If the TF is correct and the disturbance model is not, there will be residual autocorrelation but
no correlation will be observed between the estimated residuals and x[t].
4. If both are correct, no cross-correlation or autocorrelation are observed.

68
Topic 5. Unsupervised learning
1 Introduction
The probability density function (pdf) of the random variable X belonged to R gives a natural
distribution of this variable in R:

Applications of the pdf are data description and characterization, discriminant analysis (classification),
clustering and simulation.
Having a set of observed data points {x[1], …, x[N]} assumed to be a sample from an unknown
probability density function p(x) of the random variable X. Density estimation is the construction of an
0estimate ^p(x) of the density function from the observed data.

2 Parametric methods
Parametric distributions can be described with a finite set of parameters. Examples: Normal(μ, σ),
Beta(α, β), … The basic procedure consists in:
1. Select a parametric family of probability density functions that is compatible with the
distribution of the data.
2. Estimate the parameters of the distribution
3. Diagnose the final model

2.1 Selection of the family of distributions


The scalar random variable X, whose distribution must be estimated, can be continuous or discrete.
If X is a discrete random variable, there are a countable number of values x1, x2, …, each one with a
given probability p1, p2, …, satisfying ∑pi = 1, pi ≥ 0.
If X is a continuous random variable, the selection of the family of distribution can be done from the
histogram of the sampling distribution.

69
The stats package in R contains the functions for the density function, cumulative distribution function,
quantile function and random variate generation. They are named dxxx, pxxx, qxxx, rxxx respectively.

2.1.1 Validation of the selection


The validation of the model is carried out by comparing the estimated distribution with the sampling
distribution, by means of the histogram, the cumulative distribution function or the Q-Q plot.

Q-Q plot: a point (x,y) on the plot corresponds to one of the quantiles of the second distribution (y-
coordinate) plotted against the same quantile of the first distribution (x-coordinate). If the two
distributions being compared are similar, the points rest in the line x=y.

2.1.2 Parameter estimation


The parameters of the distribution can be estimated by maximum likelihood. The basic idea is to
maximize the joint probability of the observed sample, which is equivalent to maximize the log-
likelihood:

70
3 Non-parametric methods
It is the oldest and most widely used density estimator. Given an origin xo and a bin width δ, the real
axis can be partitioned in the form:

Given a set of N samples, the histogram is then defined by:

The histogram can be generalized by allowing the bin widths to vary:

Pros Cons
Simplicity ^p(x) is not a continuous function
Not adequate for clustering and classification
The selection of x0 may affect the shape of the
histogram
Selection of the bin width δ
Only for scalar random variables
3.1 The naive estimator
From the definition of a probability density, if the random variable X has density p(x), then:

If we define the weight function: Then:

The estimation is obtained by assigning a probability field a(x[i]) to each sample x[i]:

Pros Cons
Simplicity Stepwise predictions: discontinuities in x[i] and
null derivatives

71
3.2 The kernel estimator
The naïve estimator can be generalized by substituting the weight function a(x) by a kernel function K()
which satisfies the condition:

Usually K(x) will be a symmetric probability density function. By analogy with the naïve estimator, the
kernel estimator is defined by the following expression where δ is the window width, smoothing
parameter or bandwidth.

The kernel estimator is a sum of bumps placed at the observations. The kernel K() determines the shape
of the bumps while the window width δ determines their width in the x axis.

Pros Cons
^p(x) will inherit all the continuity and The definition of a unique δ for the complete
differentiability properties of the kernel K. sample is not the best choice for the estimation
of distributions with heavy tails.

4 Neural networks for density estimation


4.1 The Parzen estimator

Is a consistent estimator (converges asymptotically in probability) if the function σ = σ(N) decreases


with N and satisfies:

As sigma is smaller, more fitted is the neural network to the training set.

72
4.2 The probabilistic RBFN
If we define a specific scale factor for each radial unit and we limit the number of radial units to a
predefined number h:

It is necessary to adopt a learning strategy to optimize the position of the centers ri and the scale factor
σj.

4.2.1 Learning algorithm


4.2.1.1 Initialization of the centers of the hidden units
A clustering method is applied to allocate the hidden units in the input space (K-means).

4.2.1.2 Initialization of the scale factors


The p-nearest neighbor algorithm is applied in order to ensure some overlapping of the radial basis units:

where rij is the j-nearest center to ri and by default p = 2.


4.2.1.3 Fine tuning of ri and σi
The final values of ri and σj are obtained by maximizing the log-likelihood.

73
5 Principal Components Analysis
5.1 Introduction
The objective of PCA is to exploit the covariance structure of a given set of variables by means of a
linear combination of them.
When faced with a large set of n correlated variables, PCA allows us to summarize this set with a smaller
number h of representative variables that collectively explain most of the variability in the original set
and are uncorrelated. This implies that we can reduce the initial set of n variables to a new set of h<n
variables, with little loss of information, simply by rotating the axis. It is also used for data visualization.

5.2 Computation
Sample of N elements defined by values of n variables in matrix X(Nxn) where each column is a variable
and each row a case. Each variable must be centered, so X has zero mean and covariance matrix:

Problem appears when finding a space with a smaller dimension that adequately represent the data, in
such a way the data keeps is structure (relative distance) with the least possible distortion. The solution
to the problem is formed by the directions, orthogonal to each other, that maximize the variance of the
projections.
The first PC of a set of features X1, X2, …, Xn is the normalized linear combination of the features:

that has the largest variance. By normalized, we mean that . We refer to the elements ϕi1
as the loadings of the first principal components. The loading vector ϕ1 defines a direction in feature
space along which the data vary the most. If we project the N data points onto this direction, the projected
values are the principal components scores Z11, …, ZN1 themselves.

After the first principal component Z1 of the features has been determined, we can find the second
principal component Z2. The second principal component is the linear combination of X 1, X2, …, Xn
that has maximal variance out of all linear combinations that are uncorrelated with Z1:

It can be shown that the space of dimension h that best represents the original points is defined by the
eigenvectors associated with the h greater eigenvalues of the covariance matrix S. These directions are
called principal directions of the data and the new variables are defined by the principal components.
In general, the matrix X (and S) has rank n, so there are as many main components as original variables.
The eigenvalues are obtained as roots of the characteristic polynomial.

74
After obtaining the eigenvectors, and sorting them in descending order of the eigenvalues in the matrix
ϕ(nxn), the principal components Z(Nxn) are obtained from the centered original data X(Nxn) such as
Z=Xϕ
Therefore, calculating the principal components is equivalent to applying an orthogonal transformation
ϕ to the original data X.

5.3 Properties
They preserve the initial variability: the sum of variances of the n PC is equal to the sum of the variances
of the original n variables.
The variance of principal component Zi is λi
The proportion of the total variance explained by Zi is:

The first h PC provide the optimal linear prediction with h variables of the set variables X:

5.4 Interpretation
When there is a high positive correlation between all the variables, the first PC has all its coordinates of
the same sign and can be interpreted as a weighted average of all the variables  size factor.
The other components are interpreted as shape factors and has positive and negative coordinates. Can
be written as weighted means of two groups of variables of different sign and contrast the variables of
one sign with those of the other.

5.5 Independent Component Analysis


The objective is to extract the independent signals driving a process from the data obtained where they
are mixed.

75
5.5.1 Setup
Assume there exist independent signals S = [s1(t), …, sn(t)]. Observe only linear combinations of them,
X(t) = A S(t), where both A and S are unknown. A is called the mixing matrix. We have to recover S
from X, so we need to find a linear transformation L, ideally A-1, such that LX(t) = S(t).

5.5.2 Computation
First get rid of correlation  “whitening”: Apply a linear transformation to decorrelate and normalize
the signals (PCA). Let Z = ϕX
Then, address higher order dependence: find a rotation W that makes the whitened signals independent.
The optimization problem is minimizewdep(WZ) where dep(M) is a measure of the dependency between
the columns of M so that WTW = I.

5.5.3 Independence
For independent signals u and v:

 E[g(u) h(v)] = E[g(u)] E[h(v)] for all functions g, h.


 dep(M) = magnitude of the difference between E[^g ^h] and E[^g] E[^h] are applied to columns
of M.
Central limit theorem says adding things together makes them more Gaussian  Unmixed signals
should be less Gaussian:

 Excess Kurtosis
 Takes values from -3 to infinite, Gaussian is 0
 Maximize the absolute value to find non-Gaussian
 dep(M) = -1 x [excess kurtosis of columns of M]

6 Clustering
6.1 Introduction
Cluster analysis is described in terms of internal homogeneity and external separation, it means, data
objects in the same cluster should be similar.
Clustering a set of data consists of the following steps:
1. Feature selection or extraction
2. Clustering algorithm and proximity measure design or selection
3. Cluster validation
4. Result interpretation

76
6.2 Proximity measures
A data object is described by a set of features or variables, usually represented as a multidimensional
vector. For N data objects with n features, an Nxn patter matrix is built from the corresponding vectors.
Each row in the matrix denotes an object while each column represents a feature.

Features can be classified as:

 Qualitative or categorical, when they can take on one of a limited and usually fixed number of
possible values or labels and do not have a numerical or quantitative meaning. They simply
describe a quality or characteristic of something.
 Quantitative, when they are measured and expressed numerically, have numerical meaning and
can be used in calculations. They can be continuous or discrete.

6.2.1 Distance measures


A distance function on a data set X is defined to satisfy the following conditions:

 Symmetry: D(xi, xj) = D(xj, xi)


 Positivity: D(xi, xj) ≥ 0
If the conditions

 Triangle inequality: D(xi, xj) ≤ D(xi, xk) + D(xk, xj)


 Reflexivity: D(xi, xj) = 0 if xi= xj
also hold then it is called a metric. If only the triangle inequality is not satisfied, the function is called a
semimetric.

6.2.2 Similarity measures


A similarity function is defined to satisfy the conditions:

 Symmetry: S(xi, xj) = (xj, xi)


 Positivity: 0 ≤ S(xi, xj) ≤ 1
If it also satisfies the following conditions:

 S(xi, xj) S(xj, xk) ≤ |S(xi, xj) + S(xj, xk)|S(xi, xk)


 S(xi, xj) = 1 if xi = xj
it is called similarity metric.

6.2.3 Measures for continuous variables


The most commonly used distance measure is the Euclidean distance, also known as L2 norm.
represented as:

The Euclidean distance tend to form hyperspherical clusters, being invariant to translations and rotations
in the feature space.
The data should be normalized in order that the different units don’t affect the clusters (a bigger unit
may dominate over a smaller one). One solution is standardization (z-score):

77
 The Euclidean distance can be generalized as a especial case of the Minkowski distance or Lp
norm:

o When p = 2, the distance becomes the Euclidean distance:


o When p = 1, the distance becomes the Manhattan distance:

o When p  ∞, the sup distance or L∞ norm:

The similarity can be measured as:

The more similar the two objects, the more parallel they are in the feature space, and the greater the
cosine value.

6.2.4 Measures for binary variables


When the features are 0 or 1, the similarity between two vectors acquires non-geometric interpretation
in terms of the number of common features. The binary features can be classified as symmetric or
asymmetric based on whether the two values are equally important or not.
In the case of symmetrical features, a measure of similarity:

with:

 w = 1, the simple matching coefficient is obtained.


 w = 2, the Rogers and Tanimoto coefficient is obtained.
 w = 0.5, the Sokal and Sneath coefficient is obtained.
In the case of asymmetrical feature, a measure for similarity:

with:

 w = 1, the Jaccard coefficient is obtained.


 w = 2, the Sokal and Sneath coefficient is obtained.
 w = 0.5, the Dice coefficient is obtained.

78
If a binary variable has more than 1 value (for example, 00, 01, 10 and 11) a more effective and
commonly used method is based on the simple matching criterion:

being w usually 1 (values greater than 1 are used when the number of possible values is high).
When the binary variables are ordered from lowest to highest, they can be compared using the
continuous dissimilarity measures, normalizing the values to the range [0;1].

6.2.5 Measures for mixed variables


For real data set, there are commonly both continuous and categorical features.
The Gower similarity is given by:

where Sijl represents the similarity in component l, and δijl is a binary coefficient that indicates if the
measure is missing or not.

 For discrete variables:

 For continuous variables:

6.3 Hierarchical clustering


6.3.1 Introduction
Clustering techniques are generally classified as partitional clustering and hierarchical clustering.

 Partitional clustering directly divides data points into some prespecified number of clusters
without the hierarchical structure
 Hierarchical clustering groups data with a sequence of nested partitions, either from singleton
clusters to a cluster including all individuals (agglomerative) or vice versa (divisive). The results
of hierarchical clustering are usually depicted by a binary tree or dendrogram.
6.3.1.1 Dendrogram
The root node of the dendrogram represents the whole data set. Each leaf node is regarded as a data
point and the height of the dendrogram expresses the distance between each pair of data points or
clusters.

79
6.3.2 Agglomerative hierarchical clustering
Division methods are very computationally intensive, therefore
agglomerative methods are more common. The general agglomerative
clustering can be summarized as 
For single linkage algorithm or nearest neighbor method, the distance
between a pair of clusters is determined by the two closest objects of
the different clusters. Single linkage clustering tends to generate
elongated clusters producing a chaining effect, leading to connect
cluster with no relation due to noise.
In contrast to single linkage, the complete linkage method uses the
farthest distance of a pair of objects to define inter-cluster distance. It
is effective in uncovering small and compact cluster and tends to
generate spherical clusters.
Another method is the centroid method, which is generally applied
only with continuous variables, establishes as distance between groups
the Euclidean distance between its centers.
Ward introduced another type of method, with the objective at each stage is to minimize the increase in
the total within-cluster error sum of squares. Being K the number of clusters and mk the centroid of
cluster Ck, these error is given by:
𝐾

𝐸 = ∑ ∑ ‖𝑥𝑖 − 𝑚𝑘 ‖2
𝑘=1 𝑥𝑖 𝜖𝐶𝑘

6.3.3 Partitional clustering


Partitional clustering assigns a set of data points xi ϵ Rn into K clusters {C1, …, Ck} without any
hierarchical structure. This process usually accompanies the optimization of a criterion function J. A
common criterion function is the sum-of-squared-errors:

where:

The partition that minimizes the sum-of-squared-errors criterion is regarded as optimal and is called the
minimum variance partition.
6.3.3.1 K-means algorithm
The basic clustering procedure of K-means is:
1. Select K samples of the dataset as initial prototypes: randomly, K points furthest from each
other or manual selection in the PCA plane.
2. Assign each object in the data set to the nearest cluster Cl
1
3. Recalculate the cluster prototype matrix based on the current partition: 𝑚𝑖 = ∑𝑥𝑗𝜖𝐶𝑖 𝑥𝑗
𝑁𝑖
4. Repeat steps 2 and 3 until there is no change for each cluster
The input space is divided into Voronoi regions corresponding to a set of prototype vector or Voronoi
vectors. Each point in a Voronoi region is closer to its vector than any other ones. The algorithm
described above performs batch mode learning, since the update occurs after all data is processed.

80
The on-line or incremental mode K-means adjusts the cluster centroids each time a data point is
processed. With η as learning rate:

𝑚𝑛𝑒𝑤 = 𝑚𝑜𝑙𝑑 + 𝜂(𝑥 − 𝑚𝑜𝑙𝑑 )


A disadvantage of this algorithm is that it assumes the number of clusters K which is not usually true.
A F-test is usually done, comparing the sum of the squared errors with K groups (J(K)) with that of K+1
groups (J(K+1)) calculating the relative reduction of variability:
𝐽(𝐾) − 𝐽(𝐾 + 1)
𝐹=
𝐽(𝐾 + 1)/(𝑁 − 𝐾 − 1)
The resulting F is compared with an F distribution with n and n(N-K-1) degrees of freedom. An
empirical rule that gives reasonable results is to introduce one more group if this ratio is greater than 10.

6.4 Validation measures


These measures reflect compactness, connectedness and separation of the cluster partitions:

 Connectedness relates to what extend observations are placed in the same cluster as their nearest
neighbors in the data space.
 Compactness assesses cluster homogeneity, looking at intra-cluster variance.
 Separation quantifies the degree of separation between clusters, usually by measuring the
distance between cluster centroids.
As separation is opposite to compactness, both are usually measured into a single score.

6.4.1 Silhouette width


For each observation i, the silhouette width si is calculated as:
1. For each observation i, calculate the average distance ai between i and all other points in the
same cluster.
2. For all other clusters C to which i does not belong, calculate the average distance d(i,C) of i to
all observations of C. The smallest is defined as bi. The value of bi can be seen as the
dissimilarity between i and its neighbor cluster.
3. The silhouette width is defined as:
𝑏𝑖 − 𝑎𝑖
𝑆𝑖 =
max(𝑏𝑖 , 𝑎𝑖 )
6.4.1.1 Interpretations
 The Silhouette width S is the average of each observation’s Silhouette value.
 The Silhouette width thus lies in the interval [-1; 1] and should be maximized.
 Observations with large Si are very well clustered.
 A small Si means that the observation lies between two clusters.
 Observations with a negative Si are probably placed in the wrong cluster.

6.4.2 Dunn index


Ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster
distance, where diam(Cm) is the maximum distance between observations in cluster Cm.

The Dunn Index has a value between zero and 1, should be maximized.

81
6.4.3 Selection of the number of clusters
Different validation measures are compared in order to determine the optimal number of clusters.
6.4.3.1 The elbow method
1. Compute clustering algorithm for different values of k.
2. For each k, calculate the total within-cluster sum of squares (WSS).
3. Plot the curve of WSS according to the number k.
4. The location of a bend in the plot is usually considered the appropriate number of clusters.
6.4.3.2 The average Silhouette method
1. Compute clustering algorithm for different values of k.
2. For each k, calculate the average silhouette of observations.
3. Plot the curve of the average silhouette according to the number k.
4. The location of the maximum is considered the appropriate number of clusters.

6.5 Vector quantization


The objective is to optimally distribute a given number of vectors in the input space. This distribution
should reflect the probability density function of the input vectors, although the pdf is not explicitly
specified.

6.5.1 Notation
Each model (network) consists of a set of H units A={c1, c2, …, cH}. Each unit c is associated with a
reference or prototype vector wc ϵ Rn that indicates its position in the input space.
Between the units of the network there is a set of unweighted and symmetrical neighborhood
connections C ⸦ AxA. These connections define the topological ordering of the network:

For each unit c, let Nc be the set of the direct neighborhoods of c. The input vectors are generated
according to an unknown pdf, from which the training data set has been generated.
Given an input vector x, we will call wining unit s(x) to the unit of A whose prototype or reference
vector is closest to x: 𝑠(𝑥) = arg 𝑚𝑖𝑛𝑐∈𝐴 ‖𝑥 − 𝑤𝑐 ‖where ||.|| is the Eucledian norm. Similarly, si(x) to
the i-th unit closest to x.
Given a set of reference vectors w1, …, wH in Rn, we define the Voronoi Region Vi of vector wi as the
nearest reference vector. The Voronoi region of unit c, c ϵ A, to the Voronoi region of its reference
vector:

In the case of a finite data set D, the Voronoi set of unit c is defined as the subset Rc of D for which c is
the wining unit:

6.5.2 Minimization of the quantization error


The objective is to find a set of H vectors wc that minimizes the quantization error given by:

82
The k-means algorithm is an example of vector quantization.

6.5.3 Neural gas


Another algorithm proposed by Martinetz and Schulten:
1. Initialize H<<N reference vectors of A, wci ϵ Rn, according to p(x). Initialize t=0.
2. Generate a new input vector x according to p(x).
3. Order the elements of A in ascending order of their distance to x:

let ki(x, A) be the order position k of wi in the previous list:

 ki(x, A) = 0 ↔ wi is the nearest vector to x


 ki(x, A) = 1 ↔ wi is the second nearest vector to x
4. Adapt all the reference vectors according to:

where:

with typical values of:


5. t = t+1
6. If t < tmax go to step 2

6.5.4 Hebbian learning


Determination of the connection matrix:
1. Initialize H << M reference vectors of A according to p(x). Initialize t = 0. Initialize the
connection matrix C = {}.
2. Generate a new input vector x according to p(x)
3. Determine the two nearest units to x: s1(x) and s2(x)
4. Create a new connection between s1 and s2: C = Cᴗ{(s1, s2)}
5. Go to step 2

6.5.5 Neural Gas + Hebbian learning


1. Initialize H << M reference vectors of A according to p(x). Initialize t = 0. Initialize the
connection matrix C = {}.
2. Generate a new input vector x according to p(x)
3. Order the elements of A in ascending order of their distance to x:

let ki(x, A) be the order position k of wi in the previous list:

 ki(x, A) = 0 ↔ wi is the nearest vector to x


 ki(x, A) = 1 ↔ wi is the second nearest vector to x
4. Adapt all the reference vectors according to:

83
where:

with typical values of:


5. Create a new connection between i0 and i1 and initialize their age: C = Cᴗ{(i0, i1)}, age(i0, i1)=0
6. Update the age of all connections emanating from i0: age(i0, i) = age(i0, i) + 1 ∀i ϵ Nio where Nc
are the direct neighbors of unit c
𝑡
𝑇
7. Eliminate the connections above the maximum age T(t) given by: 𝑇(𝑡) = 𝑇𝑖 ( 𝑇𝑓)𝑡𝑚𝑎𝑥 ; typical
𝑖
values: Ti = 20; Tf = 200
8. Update t = t+1
9. If t < tmax go to step 2

6.5.6 Growing neural gas


1. Initialize the set A according to p(x): A={c1, c2}. Initialize the connection matrix C, C ⸦ AxA
with the empty set C = {}.
2. Generate a new input vector x according to p(x)
3. Determine the two nearest units to x: s1(x) and s2(x)
4. Create a new connection between s1 and s2 and initialize its age: C = Cᴗ{(s1, s2)}; age(s1,s2) = 0
5. Update the local squared error of the wining unit: ΔEs1 = ||x – ws1||
6. Update the reference vector of the wining unit and its neighbors:

7. Update the age of all connections emanating from S1:

8. Prune the connections older than amax. If pruning results in disconnected units, remove them.
9. If the number of input vectors processed is a multiple of the parameter λ, insert a new unit:
a. Determine the unit q from A with the largest error:

b. Determine the neighbor f of unit q with largest error:

𝑤𝑞 +𝑤𝑓
c. Add a new unit r, interpolating the reference vector of q and f: A = Aᴗ{r}; 𝑤𝑟 =
2
d. Connect the new unit r with q and f, and eliminate the connection between q and f:
C = Cᴗ{(r, q),(r, f)}; C = C\{(q, f)}
e. Decrease the errors of q and f in a fraction α: ΔEq = -αEq, ΔEf = -αEf
𝐸𝑞 +𝐸𝑓
f. Estimate the error of unit r from the errors of q and f: 𝐸𝑟 = 2
10. Decrease the error of all units: ΔEc = -βEc, ∀c ϵ A
11. If the stopping criterion has not been reached, go to step 2

Typical values:

84
6.6 Model-based clustering
6.6.1 The probabilistic RBFN
Under this approach, a statistical model consisting of a finite mixture of Gaussian distributions is fit to
the data == PRBFN:

Each mixture component represents a cluster, and the mixture components and group memberships are
estimated using maximum likelihood, ri and σi are optimized for maximizing the log-likelihood:

7 Kohonen Self-Organising Maps


These are a competitive learning algorithm similar to Neural Gas based on a decreasing neighborhood
and a decreasing learning rate. It is also based on a topological ordering of the reference vectors:
reference vectors are distributed on a two-dimensional grid (aij) and similar reference vectors are housed
in nearby units of the grid.

7.1 Learning algorithm


The distance d(r,s) between two units of the grid determines the extent to which a unir r = akm is adapted
when unit s = aij happens to be the best-matching cell. The L1 norm is used as a measure of the distance
between two units of a rectangular grid:

Another types of grid:

85
A possible propagation rule of the adaptation of the winning unit to the rest is:

where:

7.1.1 Algorithm
1. Initialize the set A with H=H1H2 units ci according to p(x): A = {c1, c2,…, cH}
Initialize the set of connections C in the form of a rectangular grid H1xH2.
Initialize t = 0
2. Generate a new input vector x according to p(x)
3. Determine the winning unit s = s(x)
4. Adapt each unit r:

where:

5. Update t = t+1. If t<tmax go to step 2


Typical values:

7.2

86

You might also like