0% found this document useful (0 votes)
6 views16 pages

StatisticsToolbox II

The document outlines exercises for scientific programming in MATLAB, focusing on maximum-likelihood estimation (MLE) for probability distribution fitting and regression analysis. It details the use of various MATLAB functions for estimating parameters of normal, lognormal, exponential, and Weibull distributions, as well as methods for regression modeling using car data. Mandatory tasks include fitting distributions, calculating likelihoods, and visualizing correlations and relationships between variables.

Uploaded by

mytv.clientes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views16 pages

StatisticsToolbox II

The document outlines exercises for scientific programming in MATLAB, focusing on maximum-likelihood estimation (MLE) for probability distribution fitting and regression analysis. It details the use of various MATLAB functions for estimating parameters of normal, lognormal, exponential, and Weibull distributions, as well as methods for regression modeling using car data. Mandatory tasks include fitting distributions, calculating likelihoods, and visualizing correlations and relationships between variables.

Uploaded by

mytv.clientes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Scientific Programming in Matlab

Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 1

Reminder: The following exercises are supposed to be solved with MATLAB Live Edi-
tor.All MANDATORY exercises have to be completed and uploaded through the moodle
portal before the deadline. Please use the StatisticsToolboxII_Template.mlx provided in
moodle.

Probability Distribution Estimation:


In statistics, maximum-likelihood estimation (MLE) is a method of estimating the para-
meters of a statistical model. When applied to a data set and given a statistical model,
maximum-likelihood estimation provides estimates for the model parameters.

The method of maximum likelihood corresponds to well-known estimation techniques


in statistics. For example, one might be interested in the length of adult forearms, but
is unable to measure the length of every single adult due to cost or time constraints.
Assuming that the lengths are normally (Gaussian) distributed with some unknown mean
and variance, the mean and variance can be estimated with MLE from a set of samples
{x1 , . . . , xn } drawn from the overall population. MLE accomplishes this task by finding
the mean and variance, lumped into a parameter vector θ, for which the observations
(sample set) obtain the highest likelihood. The likelihood of the data is defined by

P ({x1 , . . . , xn }|θ) = (1)


Y
p(xi |θ)
i

In practice it is easier to investigate the log likelihood.

ln P ({x1 , . . . , xn }|θ) = ln(p(xi |θ)) (2)


X

The maximum likelihood estimate is the parameter vector θmax that maximizes the log-
likelihood.

θmax = argmaxθ ln P ({x1 , . . . , xn }|θ) (3)

In case of a normal distribution it is easily shown that the parameters µ and σ that
maximize the likelihood are merely the mean and variance of the sample set itself.

1 XN
µmax = xi (4)
N i=1

v
1 XN
u
σmax = (xi − µ)2 (5)
u
t
N i=1

In general, for a fixed set of data and underlying statistical model, the method of maximum
likelihood selects values of the model parameters that produce a distribution that gives
the observed data the greatest probability (i.e., parameters that maximize the likelihood
function). In essence the method selects a set of model parameters that predicts that
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 2

events that occur often in the data are very likely to occur, and events that occur seldom
in the data are predicted to occur with small probability. Maximum-likelihood estimation
gives a unified approach to estimation, which is well-defined in the case of the normal
distribution and many other problems.

MATLAB provides a graphical user interface called dfittool for fitting distributions to
data. With this tool it is possible to find the probability distribution that best fits a given
data set and to compare the fitted PDF with the histogram representation of the data.
Besides, the following is a list of the functions that will be covered in this section

- normfit: normal parameter estimates [muhat,sigmahat] = normfit(data) re-


turns an estimate of the mean µ in muhat, and an estimate of the standard deviation
σ in sigmahat.

- lognfit: lognormal parameter estimates [parmhat,parmci] = lognfit(data) re-


turns a vector parmhat of maximum likelihood estimates with parmhat(1) = µ and
parmhat(2) = σ of parameters for a lognormal distribution fitting data. µ and σ are
the mean and standard deviation, respectively, of the associated normal distributi-
on. The 2-by-2 matrix parmci contains 95% confidence intervals for the parameter
estimates µ and σ. The first column of the matrix contains the lower and upper
confidence bounds for parameter µ, and the second column contains the confidence
bounds for parameter σ.

- expfit: exponential parameter estimates muhat = expfit(data) estimates the mean


of an exponentially distributed sample data. Each entry of muhat corresponds to
the data in a column of data.

- wblfit: weibull parameter estimates parmhat = wblfit(data) returns the maxi-


mum likelihood estimates, parmhat, of the parameters of the Weibull distribution
given the values in the vector data, which must be positive.

- gmdistribution: gaussian mixture distribution class. To create a Gaussian mix-


ture distribution by specifying the distribution parameters, use the gmdistribution
constructor. To fit a Gaussian mixture distribution model to data, use fitgmdist.

Probability distribution objects allow you to easily fit, access, and store distribution in-
formation for a given data set. The following operations are easier to perform using dis-
tribution objects:

• Grouping a single dataset in a number of different ways using group names, and
then fit a distribution to each group.

• Fitting different distributions to the same set of data.

• Sharing fitted distributions across workspaces.


Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 3

The universal functions pdf and cdf compute the (cumulative) probability density func-
tion for arbitrary distributions, it thus subsumes the specific PDF functions such as
normpdf,exppdf,. . . and normcdf,expcdf,. . ..
y = pdf(obj,X)

returns a vector y of length n containing the values of the probability density function for
the general distribution object obj, evaluated at the n-by-d data matrix X.

In the same manner the function fitdist subsumes the individual variants for fitting the
parameters of a distribution to data such as normfit,expfit,. . ..
pd = fitdist(x,distname)

creates a probability distribution object by fitting the distribution specified by distname


to the data in column vector x.

Fitting a Distribution from Data

1) MANDATORY: Find the parameters of a normal probability density function that


best fits the dataset from the file forearm.mat with the tool dfittool.

2) MANDATORY: Perform the same calculation with the Matlab function normfit.
What are the 95% confidence intervals for the mean and variance?

3) MANDATORY: Perform the same calculation with the Matlab function fitdist to
generate a distribution object.

4) MANDATORY: Calculate the probability of the data p(xi |µ, σ) w.r.t. the estimated
distribution with pdf.

5) MANDATORY: Estimate the parameters of a lognormal distribution with lognfit


for the same dataset. Compare the estimated distribution with the histogram of the
data using histogram with probability normalization and the previously calculated
normal distribution as shown in figure 1. Set the bin edges of the histogram such
that the bin width becomes one in order to ensure that the normalized histogram
density coincides with the probability density function.

6) MANDATORY: Calculate the likelihood of the overall data X = {x1 , . . . , xn }

P ({x1 , . . . , xn }|µ, σ) = (6)


Y
p(xi |µ, σ)
i

for the two distributions with pdf. Which of the two probability distributions (nor-
mal, lognormal) distribution exhibits a higher likelihood on the data?
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 4

Figure 1: Estimated normal and log normal distribution, histogram of forearm data

7) MANDATORY: Calculate the parameters of an exponential distribution and


a weibull distribution that best describe the dataset in the file FitDist2.mat.
Using the estimated parameters calculate both probability distributions and plot
them together with a histogram obtained from the dataset as shown in figure 2. Set
the bin edges of the histogram such that the bin width becomes one in order to
ensure that the normalized histogram density coincides with the probability density
function. Which probability distribution exhibits a higher likelihood?
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 5

Figure 2: Estimated exponential and Weibull distribution with histogram of FitDist2 data

Regression

In statistical modeling, regression analysis is a set of statistical processes for estimating


the relationships among variables. It includes many techniques for modeling and analyzing
several variables, when the focus is on the relationship between a dependent variable and
one or more independent variables (or ’predictors’).

Preparation: carDataSet.mat provides data about 406 car models from the years 1970
to 1982 in several categories. We focus on the following data characteristics:

• Acceleration from 0 to 100 km/h in [s]

• Cylinders (number of)

• Displacement the swept volume of all the pistons in [Liters]

• Power in [kW]

• Year of introduction

• Weight in [kg]

• Origin country of production

• Lper100km fuel consumption in [L/100km]


Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 6

The class table behaves similar to the cell-arrays with the additional feature, of row and
column names.
Name = {’Hank’; ’Rocky’; ’Cody’};
Weight = [2.5; 50; 25];
Age = [2; 5; 7];
Vaccinated = [false; true; true];
Breed = {’Chihuahua’;’Great Dane’;’Labrador’};

dogTable = table(Weight,Age,Vaccinated,Breed,’RowNames’,Name)

dogTable =

3x4 table
Weight Age Vaccinated Breed
______ ___ __________ ____________
Hank 2.5 2 false ’Chihuahua’
Rocky 50 5 true ’Great Dane’
Cody 25 7 true ’Labrador’

dogTable(1,1) % addressing the first field of the table

ans =

table
Weight
______
Hank 2.5

dogTable{1,1} % addressing the content in first field of the table

ans =
2.5000

1) MANDATORY: Create a variable carTable of class table using Acceleration,


Cylinders, Displacement, Power, Year, Weight, Lper100km as columns and Model
as row names.

The degree in which two variables have a linear statistical relationship with each other
can be measured by the correlation coefficient. A strong correlation is an indicator of
a dependence between both variables. The correlation coefficient is +1 in the case of a
perfect linear relationship with positive slope, -1 perfect linear relationship with negative
slope and 0 if there is no correlation.

2) MANDATORY: Use the function corrplot to visualize the correlations between


all variables. The predictor Origin is excluded, because corrplot can only handle
numerical and logical inputs.
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 7

The task is to train a regression model that predicts the regressand fuel efficiency Lper100km
from the regressor vector (or a subset of these features)
Acceleration, Cylinders, Displacement, Power, Year, Weight. The vector of indepen-
dent variables is called the predictor, the dependent variable (here Lper100km) is denoted
as the response. Based on the correlation plot, which variable is likely to be the most and
least useful for the prediction of Lper100km? (Ignoring Origin)

Scatter Plot

A scatter plot is a type of plot that uses Cartesian coordinates to display values for typi-
cally two variables for a set of data. A scatter plot reveals correlations between variables.
If the data pairs slope from lower left to upper right, it indicates a positive correlation
between the variables.
scatter(x,y);

creates a scatter plot with circles at the locations specified by the vectors x and y.
scatter3(z,y,z)

displays circles at the locations specified by the vectors x, y, and z.

3) MANDATORY: Open a new figure and store its handle:


figRelReg = figure;

Generate a 3D scatter plot scatter3 in which you plot the two (presumably) most
relevant regressors on the x- and y-axis, and Lper100km on the z-axis. Label the
axis and give an appropriate title. Set the figure to hold on for superposition of
subsequent plots.

4) MANDATORY: Open a new figure and a 3D scatter plot scatter3 in which


you plot the two (presumably) least relevant regressors on the x- and y-axis, and
Lper100km an the z-axis. Label the axis and give an appropriate title.

5) MANDATORY: Origin has so far not been considered as regressor. Open a new
figure and create a boxplot with the countries of origin on the x-axis and Lper100km
on the y-axis.
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 8

Regression Learner App

The Regression Learner app regressionLearner trains regression models to predict data.
The app supports data exploration, feature selection, dimensionality reduction, validation
schemes, model training, and residual analysis. You can perform automated training to
search for the best regression model type, including linear regression models, regression
trees, Gaussian process regression models, support vector machines, and ensembles of
regression trees. Once you trained a particular model on your data, you can either export
the model (with and without data) or the code for training that model together with the
data.

The regression learner app expects the data matrix of dimension n-by-m to be arranged
in m rows (xi , yi ). Columns 1 to n-1 correspond to the regressors x and the last column
contains the regressand y. The regression learner app provides k-fold holdout method for
crossvalidation. Even though k-fold crossvalidation is preferable, for the sake of compu-
tational efficiency first experiment with the holdout partition.

Data Partition into Training and Validation Data

Cross-validation is a model validation technique that analyzes how well a model generalizes
previously unseen data. It is widely employed in the context of prediction problems where
a model is estimated from a dataset of known data (training dataset), and tested w .r .t .
to unknown data (testing dataset).

The holdout method separates the overall data into two sets, called the training set and
the testing set. The parameters of the model are learned using the training set only. The
trained model is asked to predict the response for the data in the testing set. Typical
partitions are 75% training and 25% testing.

k-fold cross-validation improves the holdout method and better utilizes the available data.
The data set is partitioned into k folds, and the holdout method is repeated k times. Each
time, one of the k folds serves as the testing set and the union of the remaining k − 1
subsets constitutes the training set. The prediction performance is averaged over these
k trials. The advantage of this method is that better utilizes the data. Every data item
appears once in the test set, and k − 1 times in the training set. The variance of the
performance estimate is reduced by a factor of k over the holdout method. The drawback
of k-fold cross-validation is the computational burden of training a model k times.

Linear Regression

Linear regression models the relationship between a scalar dependent variable y and one
or more independent variables denoted x. For more than one explanatory variable, the
process is called multiple linear regression. The relationships are modeled using linear
predictor functions

ŷ = w| x = (1)
X
w n xn
n
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 9

Figure 3: Graphic user interface of the Regression Learner App


Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 10

whose unknown model parameters w = (w1 , . . . , wn ) are estimated from data pairs (xi , yi ).
Linear regression models are often fitted using the least squares approach minimizing a
cost function over the data

E(w) = (ŷ(xi , w) − yi )2 = ( wn xin − yi )2 (2)


X X X

i i n

The yi are called the regressand and the vector xi = (x1i , . . . , xni ) is called the regressor.
Usually a constant is included as one of the regressors, for example, xi1 = 1 to account
for the mean of y.

Sometimes one of the regressors can be a non-linear function f (x) of the data xi ,

ŷ = w| f (x) = wn fn (x) (3)


X

as in polynomial regression in which x2ji , x3ji , . . . , xkji are additional regressors. The model
remains linear as long as it is linear in the parameter vector w.

Performance Criteria in Regression

The common criteria in regression is the mean squared error between prediction ŷi and
observation of the response yi

1 XN
MSE = (yi − ŷi )2 (4)
N i=1

The function mse computes the mean squared errors of a vector or matrix. Another com-
mon criteria is the root mean squared error
v
1 XN
u
RMSE = (yi − ŷi )2 (5)
u
t
N i=1

The RMSE represents the sample standard deviation of the differences between predicted
and observed responses. These individual differences are called residuals for the training
data set from which the model is estimated, and are denoted as prediction errors on the
unseen test data.

The normalized mean squared error


(yi − ŷi )2
PN
NMSE = Pi=1 (6)
i=1 (yi − ȳ)
N 2

with observation y, mean response ȳ and prediction ŷ is more objective as it relates the
error with the natural variance of the response arounds its mean. The R2 = 1 − NMSE-
score is quite common, as it captures the predictive power of the model in terms of the
amount of variation of the response explained by the regressor. An R2 of one indicates a
perfect model, R2 equal to zero means that the regressor has no predictive power at all
for the response.
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 11

features linear pure quadratic quadratic Gaussian process (optional)


all regressors
top two regressors
bottom two regressors

Tabelle 1: R2 -score car data set

6) MANDATORY: Open the regressionLearner and start a new session by importing


carTable from workspace.

7) MANDATORY: Choose holdout validation with a 75 % - 25 % split into training


and test

8) MANDATORY: Train the following models on the car dataset and observe plots of
predicted versus true response.

• Linear regression - linear : The model contains a constant term and linear terms
in the predictors

ŷ = w1 | x + w0 (7)

• Linear regression - pure quadratic (advanced options) : The models contains


a constant term, linear terms, and terms that are purely quadratic in each of
the predictors

ŷ = w2 | x2 + w1 | x + w0 (8)

• Linear regression - quadratic (advanced options) : The models contains a con-


stant term, linear terms, and quadratic terms including interactions in each of
the predictors

ŷ = w2 | xx| + w1 | x + w0 (9)

• Save the session after you trained the models.

Report the R2 -score of the different models in table 1.

9) MANDATORY: Select the two least relevant features (feature Selection) and train
a linear regression model on these bottom two regressors. Report the RSME in the
table?

10) MANDATORY: Select the two most relevant features (feature Selection) and train
a linear regression model on this regressor. Report the RMSE in the table?
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 12

The Regression learner apps supports the export of trained a model to the workspace.
It also allows it to export the code to train the model from data. The trainedModel
structure allows predictions on novel unseen data. The structure contains a model object
and a function for prediction. The exported model predicts the response for new data:
yfit = trainedModel.predictFcn(T);

11) MANDATORY: Export the model with two predictors under the name
trainedModel to the Matlab workspace. To store the model on disk save the trained
model and load to import the model.

12) MANDATORY: Use meshgrid to create a grid of query points for the two predictors
used in the model. Use the previous correlations plot to find appropriate minimum
and maximum values for the grid. 10 grid steps in each dimension are sufficient.

13) MANDATORY: To make a prediction for the newly created query points, it is
necessary to put them into a new table testTable with the same number of columns
and the same column names, as was used for training, but without the column
Lper100km. The columns of the unused predictors are set to 0.

14) MANDATORY: With trainedModel predict Lper100km for the vector of queries
and store the predictions in Lper100kmPRED.
Lper100kmPRED = trainedModel.predictFcn(testTable);

15) MANDATORY: Plot a mesh of the estimations stored in Lper100kmPRED into the
corresponding scatter plot figRelReg, created previously. To do this, it may be nec-
cessary to reshape Lper100kmPRED to the same size as your grid matrices previously
generated with meshgrid.
Lper100kmPRED = reshape(Lper100kmPRED,10,10);

Inspect the original data and the predictions that can be generated based on this
data visually.

Gaussian Process Regression

This section can be skipped as the description refers to an optional assignment.

Gaussian process regression (GPR) models are nonparametric kernel-based probabilistic


models [1].

Definition 1 A Gaussian process is a collection of random variables, any Gaussian pro-


cess finite number of which have a joint Gaussian distribution.
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 13

A Gaussian process is completely specified by its mean function

m(x) = E[f (x)] (10)

and covariance function

k(x, x0 ) = E[(f (x) − m(x))(f (x0 ) − m(x0 ))] (11)

The Gaussian process is denoted as

f (x) = GP(m(x), k(x, x0 )) (12)

It is often implicitly assumed that the Gaussian process has zero mean m(x) = 0.

The random variables represent the value of the function f (x) for an input x.

The ordinary linear regression model is given by

y = x| w +  (13)

in which  ∼ N (0, σ 2 ) and the error variance σ 2 and the regression coefficients w are
estimated from data by maximum likelihood and linear least squares.

The Gaussian process regression model is given by

y = h(x)| w + f (x) (14)

in which f (x) = GP(0, k(x, x0 )) such that the noise originates from a Gaussian process
with covariance function k(x, x0 ).

The basis functions h(x) map the original feature vector x into a transformed feature
vector h(x) . The vector w is composed of the coefficients of the basis function. The
probabiliyt of observing a response yi given the independent variables xi is given by

P (yi |f (xi ), xi ) ∼ N (yi |h(xi )| w + f (xi ), σ 2 ) (15)

Each observation xi is associated with its own latent variable f (xi ), which renders model
nonparametric. The overall likelihood of the vector of responses y is given by

P (y|f , X) ∼ N (y|Hw + f , σ 2 I) (16)

h(x1 )| f (x1 )
 |
x1 y1
     
 x|  y   h(x )|   f (x ) 
in which X = 
 2 , y =  , H = 
  2  2  , f = 
 2 
. . . . . .  ...   ... 

xn| yn h(xn )| f (xn )


The distribution of the latent variables is given by the normal distribution

P (f |X) ∼ N (f |0, K(X, X)) (17)


Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 14

in which the covariance matrix


k(x1 , x1 ) k(x1 , x2 ) . . . k(x1 , xn )
 

k(x2 , x1 ) k(x2 , x2 ) . . . k(x2 , xn )


 
K(X, X) = 
 ..  (18)
.

 
k(xn , x1 ) k(xn , x2 ) . . . k(xn , xn )

depends on the kernel function k(. . . , . . .) and the data X.

For practical use, one assumes a parametric covariance function k(xn , xn )|θ in which θ
denotes a vector of hyper-parameters. The squared-exponential covariance function is a
common choice of the kernel function. In one dimension it has the following form

(xp − xq )2
ky (xp , xq ) = σf2 exp(− ) + σn2 δpq (19)
2l 2

with the hyper-parameters length-scale l, the signal variance σf2 and the noise variance σn2 .
The noise variance σn2 reflects the prior standard deviation of the process. For σn2 = 0 the
model would perfectly fit the data at the observed data points xi . The length scale l reflects
the smoothness of the underlying model. A short length scale allows sharp variations of
the function. The effect of different choice of hyper-parameters l, σf2 , σn2 on the Gaussian
process model is shown in figure 4

Gaussian process regression estimates the basis function coefficients w in conjunction with
the noise variance σ 2 and the hyperparameters θ. The parameters w, σ 2 , θ are estimated
such that they maximize the log likelihood of y according to equation (16).

16) OPTIONAL: Train a Gaussian Process model with the regression learner app.

The toolbox contains a library of functions that mimics the training and validation of re-
gression models with the Regression Learner. The function fitlm creates a linear regression
model as a LinearObject model. You can create plots and do further diagnostic analysis
by using methods such as plot, plotResiduals, and plotDiagnostics. With predict you can
predict the response of linear regression model.
trainedModelB = fitlm(carTable,’Lper100km~Year+Weight’)
Lper100kmPREDB = trainedModelB.predict(testTable);

2
source: C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning [1]
Scientific Programming in Matlab
Version 1.0 Scientific Programming: Statistics Toolbox (II) Page 15

Figure 4: Data is generated from a GP with hyperparameters (l, σf2 , σn2 ) = (1, 1, 0.1) , as
shown by the + symbols. Using Gaussian process prediction with these hyperparameters
we obtain a 95 % confidence region for the underlying function f (shown in grey). Panels
(b) and (c) again show the 95 % confidence region, but this time for hyperparameter
values (0.3, 1.08, 0.00005) and (3.0, 1.16, 0.89) respectively 2
Literaturverzeichnis

[1] C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning, MIT


Press, 2006, http://www.gaussianprocess.org/gpml/chapters/RW.pdf

16

You might also like