0% found this document useful (0 votes)
11 views5 pages

TP2 Reg 2024

The document outlines a practical session focused on regularized regression models, specifically Ridge and Lasso regression, using R programming. It includes instructions for model selection, significance testing, and empirical analysis of datasets, along with guidelines for reporting results. Students are required to work in pairs, analyze specific datasets, and submit a report generated in R markdown format.

Uploaded by

traorehamed589
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views5 pages

TP2 Reg 2024

The document outlines a practical session focused on regularized regression models, specifically Ridge and Lasso regression, using R programming. It includes instructions for model selection, significance testing, and empirical analysis of datasets, along with guidelines for reporting results. Students are required to work in pairs, analyze specific datasets, and submit a report generated in R markdown format.

Uploaded by

traorehamed589
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Travaux Pratiques - Modèles de Régression régularisée

September 22th 2024

Goal of the practical session


• Model selection for linear models in R. Ridge and Lasso regression

Remarks
• The work has to be carried out by a team of 2 students and R studio is used to perform the practical sessions.
• A report should be written only for exercice IV , automatically generated using a R markdown file format
for ‘R studio’.
• The ‘R markdown file’ and the corresponding pdf file have to be uploaded before next practical session
on the ENSIIE project web site in the folder MRR2024TP2.

I. Tests of significativity and model selection


a) Analyze and study the following instructions. Specify the underlaying theoretical model.
n=100; X=cbind(((1:n)/n)ˆ3,((1:n)/n)ˆ4 ); Y=X%*%c(1,1)+rnorm(n)/4;
res=summary(lm(Y~X)); print(res); print(res$coef[2,4]);

Compare the results provided by a multiple regression model and the results computed independently using two
simple models. Conclusion.
reg1=lm(Y~X[,1]);print(summary(reg1));
reg2=lm(Y~X[,2]);print(summary(reg2));

b) Execute the previous instructions several times (2 or 3 times) and describe the behaviour of the estimators of
the coefficients. Compute the empirical correlation matrix. Instruction cor().
cor(X[,1],X[,2])

II Model selection in a linear regression framework


− ŷi )2 ;
P
Following table details several criteria used in model selection. We denote RSS = i (yi

Notation Definition
P Criteria Objective R Instruction
(ŷi −ȳ)2
R2 2
R = P(y −ȳ)2 R-squared - lm()
i

2
Radj 2
Radj =1− n−1
n−p (1 − R2 ) Adjusted R-squared 2
Max. Radj lm()
σ̂p2 σ̂p2 = RSS
n−p Non biased residual est. Min. σ̂p Fonction lm()
AIC ' nlog( RSS
n ) + 2p Aikaike Information (1971) Min. AIC extractAIC()
BIC ' RSS
nlog( n ) + log(n)p Bayesian Information C (1978) Min. BIC extractAIC(,k=log(n))
RSS(p)
Cp = σ2 − (n − 2p) Cp Mallows (1973) Min. Cp regsubsets()

The step() function is used to compare and select parcimonious models (models based on few variables). The
fonction starts from the global model and withdraw step by step one variable. The procedure stops when the
coefficient of the variable which should be removed is significative (α = 0.1 threshold).

1
Applications
Analyze the files “USCrimeinfo.txt” et “UsCrime.txt”. The target variable, Y , is stored in the first column.
• Upload the file in the R environment using tab=read.table(). What is the number of available observations?
Provide a scatterplot of all joint distributions. Conclusion.
• Compute the empirical correlation matrix. Conclusion. Use the corrplot() function of corrplot() library to
highlight potential linear relations between variables.

A. Multiple regression model.


The goal is now to study the opportunity to use a linear model to explain the target variable Y . Specify the model.
a) What can you briefly say on the results provided by a linear model on the USCrime data set using the function
reg=lm("R~.",data=tab) where Y denotes the target variable and X the explanatory variables (p = 14).
b) Does the linear model globally have a interest? Justify your answer.
c) What can you say about the significativity of the coefficients? Justify your answer.
d) Compute the Residual Sum of Square (RSS) in this case with p = 14 variables ?

B. Model selection.
The goal of this section is to find a sparse model based on a small subset of variables of size p0 to explain the target
variable Y . Prior writing any R instruction, read carefully the help of the R step() function.
a) Backward regression. Study and implement the following instruction
regbackward=step(reg,direction='backward'); summary(regbackward)

Comment the successive removed variables. What is the final model ? How may variables are selected ?
b) Forward regression.
regforward=step(lm(R~1,data=tab),list(upper=reg),direction='forward');
summary(regforward);

Comment the successive added variables. Compute the AIC criteria using the instruction AIC(). What is the final
model ? How may variables are selected ? Compare this model with the model computed with the Backward
regression method.
c) Stepwise regression:
regboth=step(reg,direction='both')
summary(regboth)

Comment the added and removed variables for the stepwise regression. Compare the selected models obtained with
all the previous selection methods.
d) Remarks. Use the formula(s0) function where s0 denotes the R output object computed with the step()
function. Note that the instruction reg0=lm(formula(s0),data=tab); let you use the computed selected
model for further applications and that summary(reg0) provides you detailed information on this model.

2
III RIDGE and LASSO penalized regression.
A. Simulated data. Illustration.
a) Execute and comments the results using the following instructions.
rm(list=ls()); n=10000; p=5;
X=matrix(rnorm(n*(p)),nrow=n,ncol=p); X=scale(X)*sqrt(n/(n-1));
beta=matrix(10*rev(1:p),nrow=p,ncol=1); print(beta)
epsi=rnorm(n,1/nˆ2); Y=X%*%beta +epsi;
Z=cbind(Y,data.frame(X)); Z=data.frame(Z);

b) Considering a linear model, provide an estimation of the coefficients using X and Y data with the help of the
lm() function. Conclusion.
Execute t(X)\%*\%Y/n and comment the result. The lars() function of R lars library can be used to implement
a LASSO regression as the glmnet() function of R glmnet library. Upload the library in your R environment and
read carefully the help of the function.
In this section, the goal is now to implement and study a linear model with a `1 penalization on the coefficients
using X and Y data.
c) Execute:
library(lars);
modlasso=lars(X,Y,type="lasso"); attributes(modlasso);

What do the fields modlasso\(meanx and modlasso\)normx store ? What are they for ?
d) Comment the following graphs:
par(mfrow=c(1,2));
plot(modlasso); plot(c(modlasso$lambda,0),pch=16,type="b",col="blue"); grid()

e) Execute and comment the results? Why is it possible to guess, before any computation, the results computed
with the LASSO in this situation? Justify carefully.
print(coef(modlasso));
coef=predict.lars(modlasso,X,type="coefficients",mode="lambda",s=2500);
coeflasso=coef$coefficients;
par(mfrow=c(1,1)); barplot(coeflasso,main='lasso, l=1',col='cyan');

B. Applications
The data studied in this section are indicators of development used for economy, demography, sociology in the
United states over a period of 15 years. Our goal, in this application, is to identify the indicators which best
explained the CO2 emissions observed in the athmosphere. For this purpose, the RIDGE and the lasso regression
are both used and study.
a) Describe the content of the files “usa_indicators_info.txt” and “usa_indicators.txt”.
b) Upload the data in the R environment using tab=read.table(). What are the number of observations and
the number of variables? Can you use, in this situation, a multiple linear model ? Justify your answer.
c) What is the variable used for the CO2 emission ? Plot the temporal evolution on this indicator on a graph.
d) As the data correspond to various indicators, the units may also be very different. Explain how it can be a
difficulty either for regular linear models or penalized linear models ? Scale the variables of the data set using
function scale(tab, center=FALSE).
e) Use the lm() function to estimate the parameters of a linear model. Conclusion.

3
RIDGE. Regression with `2 penalization.
a) Recall the definition of the Ridge regression.
The function lm.ridge of the MASS library is used to compute a Ridge regression. Upload the MASS library in
your R environment, and read carefully the help of the function lm.ridge().
b) Compute a Ridge regression for values of the penalization parameter equaled to λ = 0, λ = 100 without using
the Year variable in your model. Print the computed coefficients using the instruction coef(). Plot the five
largest coefficients. What do they represent? What are the differences between
coef(resridge) and resridge$coef instructions ?
c) Compute ridge regression models for different values of λ starting from 0 to 10 with an increment of 0.01
(λ = seq(0, 10, 0.01)). Plot the performances computed by cross-validation given the values of λ (field $GCV of
the ridge R object, GCV for Generalized Cross Validation). Plot the evolution of the values of the coefficients
given λ using the instruction plot(resridge). Conclusion. What model may you advice ? Print the
corresponding value for the regularization parameter λ. Print and store automatically the parameters of the
best model with the help of the functions which.min() and coef() #coefridge=...
d) Compute the mean quadratic error between the observed target and the estimated target Ŷridge using matrix
computation where X denotes the input matrix: Yridge=as.matrix(X)\%*\%as.vector(coefridge).

LASSO. Regression with `1 penalization


a) Compute a lasso regression using the instruction reslasso=lars(X,Y,type="lasso") where X denotes the
input matrix and Y the target matrix.
Execute both following instructions: plot(reslasso) and plot(reslasso$lambda). Comment.
b) Plot the values of the model coefficients for λ = 0 with the help of the instruction:
coef=predict.lars(reslasso,X,type="coefficients",mode="lambda",s=0). Conclusion.
c) Plot the values of the coefficients for λ = 100. Conclusion.
Compare these results with the results already obtained with the ridge regression. Conclusion.
d) Compute the mean quadratic error between the observed target and the estimated target (Ŷlasso ):
pY=predict.lars(reslasso,X,type="fit",mode="lambda",s=0.06).
e) How can you chose lambda ?

4
IV. Wind Turbine Modelling
As a data scientist, you are now asked to study the ProjWindTurbine.txt dataset. The aim of this study is to
propose a sparse linear models able to explain the power produced by some wind turbines (the target variable,
Y ) given some other variables as (1) the free stream velocity of some components (m/s), (FSV 1-2-3-4) (2) the
rotational speed of some components (RPM 1-2-3-4), (3) the current intensity of some components (mA), (CIN
1-2-3_4) (4) the power (mW).

A Preliminary
Study the following empirical joint distributions between the variables (POW,RPM1), (POW, CIN1), (RPM1,
CIN1).
What can you observe ?
Based on your previous observation, split the initial n = 3000 observations in 3 equaled parts called D1 , D2 , D3 by
chosing smart frontiers using only the 2 covariables RPM1 and CIN1. Each subsets contains 1000 observations
Propose a linear regression model to explain the power given the explanatory variables for each data set D1 , D2 , D3 .

B Model selection
Study the possibility to provide a sparse model using forward, backward, stepwise regression or ridge and lasso
regression. for each dataset (D1 or D2 or D3 ).
Conclusion.

C Regression model a categorical explanatory variable


Run and study the following instructions. Conclusion.
rm(list=ls());
turbinedata=read.table(file="ProjWindTurbine.txt",header=TRUE,sep=',')
numturbine=as.factor(c(rep(1,1000),rep(2,1000),rep(3,1000)));
mydata=cbind(turbinedata,numturbine)
modlm=lm("POW~.",data=mydata);
summary(modlm)
plot(modlm$fit,mydata$POW,type="p"); abline(a=0,b=1,col="red");

You might also like