StatisticsToolbox II
StatisticsToolbox II
Reminder: The following exercises are supposed to be solved with MATLAB Live Edi-
tor.All MANDATORY exercises have to be completed and uploaded through the moodle
portal before the deadline. Please use the StatisticsToolboxII_Template.mlx provided in
moodle.
The maximum likelihood estimate is the parameter vector θmax that maximizes the log-
likelihood.
In case of a normal distribution it is easily shown that the parameters µ and σ that
maximize the likelihood are merely the mean and variance of the sample set itself.
1 XN
µmax = xi (4)
N i=1
v
1 XN
u
σmax = (xi − µ)2 (5)
u
t
N i=1
In general, for a fixed set of data and underlying statistical model, the method of maximum
likelihood selects values of the model parameters that produce a distribution that gives
the observed data the greatest probability (i.e., parameters that maximize the likelihood
function). In essence the method selects a set of model parameters that predicts that
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 2
events that occur often in the data are very likely to occur, and events that occur seldom
in the data are predicted to occur with small probability. Maximum-likelihood estimation
gives a unified approach to estimation, which is well-defined in the case of the normal
distribution and many other problems.
MATLAB provides a graphical user interface called dfittool for fitting distributions to
data. With this tool it is possible to find the probability distribution that best fits a given
data set and to compare the fitted PDF with the histogram representation of the data.
Besides, the following is a list of the functions that will be covered in this section
Probability distribution objects allow you to easily fit, access, and store distribution in-
formation for a given data set. The following operations are easier to perform using dis-
tribution objects:
• Grouping a single dataset in a number of different ways using group names, and
then fit a distribution to each group.
The universal functions pdf and cdf compute the (cumulative) probability density func-
tion for arbitrary distributions, it thus subsumes the specific PDF functions such as
normpdf,exppdf,. . . and normcdf,expcdf,. . ..
y = pdf(obj,X)
returns a vector y of length n containing the values of the probability density function for
the general distribution object obj, evaluated at the n-by-d data matrix X.
In the same manner the function fitdist subsumes the individual variants for fitting the
parameters of a distribution to data such as normfit,expfit,. . ..
pd = fitdist(x,distname)
2) MANDATORY: Perform the same calculation with the Matlab function normfit.
What are the 95% confidence intervals for the mean and variance?
3) MANDATORY: Perform the same calculation with the Matlab function fitdist to
generate a distribution object.
4) MANDATORY: Calculate the probability of the data p(xi |µ, σ) w.r.t. the estimated
distribution with pdf.
for the two distributions with pdf. Which of the two probability distributions (nor-
mal, lognormal) distribution exhibits a higher likelihood on the data?
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 4
Figure 1: Estimated normal and log normal distribution, histogram of forearm data
Figure 2: Estimated exponential and Weibull distribution with histogram of FitDist2 data
Regression
Preparation: carDataSet.mat provides data about 406 car models from the years 1970
to 1982 in several categories. We focus on the following data characteristics:
• Power in [kW]
• Year of introduction
• Weight in [kg]
The class table behaves similar to the cell-arrays with the additional feature, of row and
column names.
Name = {’Hank’; ’Rocky’; ’Cody’};
Weight = [2.5; 50; 25];
Age = [2; 5; 7];
Vaccinated = [false; true; true];
Breed = {’Chihuahua’;’Great Dane’;’Labrador’};
dogTable = table(Weight,Age,Vaccinated,Breed,’RowNames’,Name)
dogTable =
3x4 table
Weight Age Vaccinated Breed
______ ___ __________ ____________
Hank 2.5 2 false ’Chihuahua’
Rocky 50 5 true ’Great Dane’
Cody 25 7 true ’Labrador’
ans =
table
Weight
______
Hank 2.5
ans =
2.5000
The degree in which two variables have a linear statistical relationship with each other
can be measured by the correlation coefficient. A strong correlation is an indicator of
a dependence between both variables. The correlation coefficient is +1 in the case of a
perfect linear relationship with positive slope, -1 perfect linear relationship with negative
slope and 0 if there is no correlation.
The task is to train a regression model that predicts the regressand fuel efficiency Lper100km
from the regressor vector (or a subset of these features)
Acceleration, Cylinders, Displacement, Power, Year, Weight. The vector of indepen-
dent variables is called the predictor, the dependent variable (here Lper100km) is denoted
as the response. Based on the correlation plot, which variable is likely to be the most and
least useful for the prediction of Lper100km? (Ignoring Origin)
Scatter Plot
A scatter plot is a type of plot that uses Cartesian coordinates to display values for typi-
cally two variables for a set of data. A scatter plot reveals correlations between variables.
If the data pairs slope from lower left to upper right, it indicates a positive correlation
between the variables.
scatter(x,y);
creates a scatter plot with circles at the locations specified by the vectors x and y.
scatter3(z,y,z)
Generate a 3D scatter plot scatter3 in which you plot the two (presumably) most
relevant regressors on the x- and y-axis, and Lper100km on the z-axis. Label the
axis and give an appropriate title. Set the figure to hold on for superposition of
subsequent plots.
5) MANDATORY: Origin has so far not been considered as regressor. Open a new
figure and create a boxplot with the countries of origin on the x-axis and Lper100km
on the y-axis.
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 8
The Regression Learner app regressionLearner trains regression models to predict data.
The app supports data exploration, feature selection, dimensionality reduction, validation
schemes, model training, and residual analysis. You can perform automated training to
search for the best regression model type, including linear regression models, regression
trees, Gaussian process regression models, support vector machines, and ensembles of
regression trees. Once you trained a particular model on your data, you can either export
the model (with and without data) or the code for training that model together with the
data.
The regression learner app expects the data matrix of dimension n-by-m to be arranged
in m rows (xi , yi ). Columns 1 to n-1 correspond to the regressors x and the last column
contains the regressand y. The regression learner app provides k-fold holdout method for
crossvalidation. Even though k-fold crossvalidation is preferable, for the sake of compu-
tational efficiency first experiment with the holdout partition.
Cross-validation is a model validation technique that analyzes how well a model generalizes
previously unseen data. It is widely employed in the context of prediction problems where
a model is estimated from a dataset of known data (training dataset), and tested w .r .t .
to unknown data (testing dataset).
The holdout method separates the overall data into two sets, called the training set and
the testing set. The parameters of the model are learned using the training set only. The
trained model is asked to predict the response for the data in the testing set. Typical
partitions are 75% training and 25% testing.
k-fold cross-validation improves the holdout method and better utilizes the available data.
The data set is partitioned into k folds, and the holdout method is repeated k times. Each
time, one of the k folds serves as the testing set and the union of the remaining k − 1
subsets constitutes the training set. The prediction performance is averaged over these
k trials. The advantage of this method is that better utilizes the data. Every data item
appears once in the test set, and k − 1 times in the training set. The variance of the
performance estimate is reduced by a factor of k over the holdout method. The drawback
of k-fold cross-validation is the computational burden of training a model k times.
Linear Regression
Linear regression models the relationship between a scalar dependent variable y and one
or more independent variables denoted x. For more than one explanatory variable, the
process is called multiple linear regression. The relationships are modeled using linear
predictor functions
ŷ = w| x = (1)
X
w n xn
n
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 9
whose unknown model parameters w = (w1 , . . . , wn ) are estimated from data pairs (xi , yi ).
Linear regression models are often fitted using the least squares approach minimizing a
cost function over the data
i i n
The yi are called the regressand and the vector xi = (x1i , . . . , xni ) is called the regressor.
Usually a constant is included as one of the regressors, for example, xi1 = 1 to account
for the mean of y.
Sometimes one of the regressors can be a non-linear function f (x) of the data xi ,
as in polynomial regression in which x2ji , x3ji , . . . , xkji are additional regressors. The model
remains linear as long as it is linear in the parameter vector w.
The common criteria in regression is the mean squared error between prediction ŷi and
observation of the response yi
1 XN
MSE = (yi − ŷi )2 (4)
N i=1
The function mse computes the mean squared errors of a vector or matrix. Another com-
mon criteria is the root mean squared error
v
1 XN
u
RMSE = (yi − ŷi )2 (5)
u
t
N i=1
The RMSE represents the sample standard deviation of the differences between predicted
and observed responses. These individual differences are called residuals for the training
data set from which the model is estimated, and are denoted as prediction errors on the
unseen test data.
with observation y, mean response ȳ and prediction ŷ is more objective as it relates the
error with the natural variance of the response arounds its mean. The R2 = 1 − NMSE-
score is quite common, as it captures the predictive power of the model in terms of the
amount of variation of the response explained by the regressor. An R2 of one indicates a
perfect model, R2 equal to zero means that the regressor has no predictive power at all
for the response.
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 11
8) MANDATORY: Train the following models on the car dataset and observe plots of
predicted versus true response.
• Linear regression - linear : The model contains a constant term and linear terms
in the predictors
ŷ = w1 | x + w0 (7)
ŷ = w2 | x2 + w1 | x + w0 (8)
ŷ = w2 | xx| + w1 | x + w0 (9)
9) MANDATORY: Select the two least relevant features (feature Selection) and train
a linear regression model on these bottom two regressors. Report the RSME in the
table?
10) MANDATORY: Select the two most relevant features (feature Selection) and train
a linear regression model on this regressor. Report the RMSE in the table?
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 12
The Regression learner apps supports the export of trained a model to the workspace.
It also allows it to export the code to train the model from data. The trainedModel
structure allows predictions on novel unseen data. The structure contains a model object
and a function for prediction. The exported model predicts the response for new data:
yfit = trainedModel.predictFcn(T);
11) MANDATORY: Export the model with two predictors under the name
trainedModel to the Matlab workspace. To store the model on disk save the trained
model and load to import the model.
12) MANDATORY: Use meshgrid to create a grid of query points for the two predictors
used in the model. Use the previous correlations plot to find appropriate minimum
and maximum values for the grid. 10 grid steps in each dimension are sufficient.
13) MANDATORY: To make a prediction for the newly created query points, it is
necessary to put them into a new table testTable with the same number of columns
and the same column names, as was used for training, but without the column
Lper100km. The columns of the unused predictors are set to 0.
14) MANDATORY: With trainedModel predict Lper100km for the vector of queries
and store the predictions in Lper100kmPRED.
Lper100kmPRED = trainedModel.predictFcn(testTable);
15) MANDATORY: Plot a mesh of the estimations stored in Lper100kmPRED into the
corresponding scatter plot figRelReg, created previously. To do this, it may be nec-
cessary to reshape Lper100kmPRED to the same size as your grid matrices previously
generated with meshgrid.
Lper100kmPRED = reshape(Lper100kmPRED,10,10);
Inspect the original data and the predictions that can be generated based on this
data visually.
It is often implicitly assumed that the Gaussian process has zero mean m(x) = 0.
The random variables represent the value of the function f (x) for an input x.
y = x| w + (13)
in which ∼ N (0, σ 2 ) and the error variance σ 2 and the regression coefficients w are
estimated from data by maximum likelihood and linear least squares.
in which f (x) = GP(0, k(x, x0 )) such that the noise originates from a Gaussian process
with covariance function k(x, x0 ).
The basis functions h(x) map the original feature vector x into a transformed feature
vector h(x) . The vector w is composed of the coefficients of the basis function. The
probabiliyt of observing a response yi given the independent variables xi is given by
Each observation xi is associated with its own latent variable f (xi ), which renders model
nonparametric. The overall likelihood of the vector of responses y is given by
h(x1 )| f (x1 )
|
x1 y1
x| y h(x )| f (x )
in which X =
2 , y = , H =
2 2 , f =
2
. . . . . . ... ...
For practical use, one assumes a parametric covariance function k(xn , xn )|θ in which θ
denotes a vector of hyper-parameters. The squared-exponential covariance function is a
common choice of the kernel function. In one dimension it has the following form
(xp − xq )2
ky (xp , xq ) = σf2 exp(− ) + σn2 δpq (19)
2l 2
with the hyper-parameters length-scale l, the signal variance σf2 and the noise variance σn2 .
The noise variance σn2 reflects the prior standard deviation of the process. For σn2 = 0 the
model would perfectly fit the data at the observed data points xi . The length scale l reflects
the smoothness of the underlying model. A short length scale allows sharp variations of
the function. The effect of different choice of hyper-parameters l, σf2 , σn2 on the Gaussian
process model is shown in figure 4
Gaussian process regression estimates the basis function coefficients w in conjunction with
the noise variance σ 2 and the hyperparameters θ. The parameters w, σ 2 , θ are estimated
such that they maximize the log likelihood of y according to equation (16).
16) OPTIONAL: Train a Gaussian Process model with the regression learner app.
The toolbox contains a library of functions that mimics the training and validation of re-
gression models with the Regression Learner. The function fitlm creates a linear regression
model as a LinearObject model. You can create plots and do further diagnostic analysis
by using methods such as plot, plotResiduals, and plotDiagnostics. With predict you can
predict the response of linear regression model.
trainedModelB = fitlm(carTable,’Lper100km~Year+Weight’)
Lper100kmPREDB = trainedModelB.predict(testTable);
2
source: C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning [1]
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 15
Figure 4: Data is generated from a GP with hyperparameters (l, σf2 , σn2 ) = (1, 1, 0.1) , as
shown by the + symbols. Using Gaussian process prediction with these hyperparameters
we obtain a 95 % confidence region for the underlying function f (shown in grey). Panels
(b) and (c) again show the 95 % confidence region, but this time for hyperparameter
values (0.3, 1.08, 0.00005) and (3.0, 1.16, 0.89) respectively 2
Literaturverzeichnis
16