0% found this document useful (0 votes)
78 views

Ridge Regression

This document discusses and compares Ridge and Lasso regression. It begins by implementing Ridge regression on a dataset to predict sales, which shows a slight improvement over linear regression. The document then discusses the mathematics behind Ridge regression and how the alpha hyperparameter controls the penalty term. Next, it implements Lasso regression on the same dataset, which performs even better. The key differences between Ridge and Lasso are that Lasso uses L1 regularization and can perform automatic feature selection by setting some coefficients exactly to zero, while Ridge uses L2 regularization and only shrinks coefficients. The document concludes by suggesting Elastic Net for situations with many correlated features.

Uploaded by

Julia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Ridge Regression

This document discusses and compares Ridge and Lasso regression. It begins by implementing Ridge regression on a dataset to predict sales, which shows a slight improvement over linear regression. The document then discusses the mathematics behind Ridge regression and how the alpha hyperparameter controls the penalty term. Next, it implements Lasso regression on the same dataset, which performs even better. The key differences between Ridge and Lasso are that Lasso uses L1 regularization and can perform automatic feature selection by setting some coefficients exactly to zero, while Ridge uses L2 regularization and only shrinks coefficients. The document concludes by suggesting Elastic Net for situations with many correlated features.

Uploaded by

Julia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

11.

Ridge Regression
Let us first implement it on our above problem and check our results that whether it
performs better than our linear regression model.

from sklearn.linear_model import Ridge


## training the model
ridgeReg = Ridge(alpha=0.05, normalize=True)
ridgeReg.fit(x_train,y_train)
pred = ridgeReg.predict(x_cv)
calculating mse
mse = np.mean((pred_cv - y_cv)**2)
mse 1348171.96 ## calculating score ridgeReg.score(x_cv,y_cv) 0.5691

So, we can see that there is a slight improvement in our model because the value of
the R-Square has been increased. Note that value of alpha, which is hyperparameter
of Ridge, which means that they are not automatically learned by the model instead
they have to be set manually.

Here we have consider alpha = 0.05. But let us consider different values of alpha and
plot the coefficients for each case.

You can see that, as we increase the value of alpha, the magnitude of the coefficients
decreases, where the values reaches to zero but not absolute zero.

But if you calculate R-square for each alpha, we will see that the value of R-square
will be maximum at alpha=0.05. So we have to choose it wisely by iterating it
through a range of values and using the one which gives us lowest error.

So, now you have an idea how to implement it but let us take a look at the
mathematics side also. Till now our idea was to basically minimize the cost function,
such that values predicted are much closer to the desired result.

Now take a look back again at the cost function for ridge regression.

Here if you notice, we come across an extra term, which is known as the penalty
term. λ given here, is actually denoted by alpha parameter in the ridge function. So by
changing the values of alpha, we are basically controlling the penalty term. Higher
the values of alpha, bigger is the penalty and therefore the magnitude of coefficients
are reduced.

Important Points:

 It shrinks the parameters, therefore it is mostly used to prevent


multicollinearity.
 It reduces the model complexity by coefficient shrinkage.
 It uses L2 regularization technique. (which I will discussed later in this article)

Now let us consider another type of regression technique which also makes use of
regularization.

12. Lasso regression

LASSO (Least Absolute Shrinkage Selector Operator), is quite similar to ridge, but
lets understand the difference them by implementing it in our big mart problem.

from sklearn.linear_model import Lasso


lassoReg = Lasso(alpha=0.3, normalize=True)
lassoReg.fit(x_train,y_train)
pred = lassoReg.predict(x_cv)
# calculating mse
mse = np.mean((pred_cv - y_cv)**2)
mse
1346205.82
lassoReg.score(x_cv,y_cv)
0.5720
As we can see that, both the mse and the value of R-square for our model has been
increased. Therefore, lasso model is predicting better than both linear and ridge.
Again lets change the value of alpha and see how does it affect the coefficients.

So, we can see that even at small values of alpha, the magnitude of coefficients have
reduced a lot. By looking at the plots, can you figure a difference between ridge and
lasso? We can see that as we increased the value of alpha, coefficients were
approaching towards zero, but if you see in case of lasso, even at smaller alpha’s, our
coefficients are reducing to absolute zeroes. Therefore, lasso selects the only some
feature while reduces the coefficients of others to zero. This property is known as
feature selection and which is absent in case of ridge.
Mathematics behind lasso regression is quiet similar to that of ridge only difference
being instead of adding squares of theta, we will add absolute value of Θ.

Here too, λ is the hypermeter, whose value is equal to the alpha in the Lasso function.

Important Points:

 It uses L1 regularization technique (will be discussed later in this article)


 It is generally used when we have more number of features, because it
automatically does feature selection.

Now that you have a basic understanding of ridge and lasso regression, let’s think of
an example where we have a large dataset, lets say it has 10,000 features. And we
know that some of the independent features are correlated with other independent
features. Then think, which regression would you use, Rigde or Lasso?

Let’s discuss it one by one. If we apply ridge regression to it, it will retain all of the
features but will shrink the coefficients. But the problem is that model will still
remain complex as there are 10,000 features, thus may lead to poor model
performance.

Instead of ridge what if we apply lasso regression to this problem. The main problem
with lasso regression is when we have correlated variables, it retains only one
variable and sets other correlated variables to zero. That will possibly lead to some
loss of information resulting in lower accuracy in our model.

Then what is the solution for this problem? Actually we have another type of
regression, known as elastic net regression, which is basically a hybrid of ridge and
lasso regression. So let’s try to understand it.

Step 4: Implementation of Ridge regression

install.packages("glmnet")
library(glmnet)

train$Item_Weight[is.na(train$Item_Weight)] <- mean(train$Item_Weight, na.rm =


TRUE)
train$Outlet_Size[is.na(train$Outlet_Size)] <- "Small"
train$Item_Visibility[train$Item_Visibility == 0] <- mean(train$Item_Visibility)
train$Outlet_Establishment_Year=2013 - train$Outlet_Establishment_Year

train<-train[c(-1)]
Y<-train[c(11)]

X <- model.matrix(Item_Outlet_Sales~., train)

lambda <- 10^seq(10, -2, length = 100)

set.seed(567)
part <- sample(2, nrow(X), replace = TRUE, prob = c(0.7, 0.3))
X_train<- X[part == 1,]
X_cv<- X[part == 2,]

Y_train<- Y[part == 1,]


Y_cv<- Y[part == 2,]

#ridge regression

ridge_reg <- glmnet(X[X_train,], Y[X_train], alpha = 0, lambda = lambda)


summary(ridge_reg)
#find the best lambda via cross validation
ridge_reg1 <- cv.glmnet(X[X_train,], Y[X_train], alpha = 0)

bestlam <- ridge_reg1$lambda.min


ridge.pred <- predict(ridge_reg, s = bestlam, newx = X[X_cv,])

m<-mean((Y_cv - ridge.pred)^2)
m

out = glmnet(X[X_train,],Y[X_train],alpha = 0)
ridge.coef<-predict(ridge_reg, type = "coefficients", s = bestlam)[1:40,]
ridge.coef

Step 5: Implementation of lasso regression


install.packages("glmnet")
library(glmnet)

train$Item_Weight[is.na(train$Item_Weight)] <- mean(train$Item_Weight, na.rm =


TRUE)
train$Outlet_Size[is.na(train$Outlet_Size)] <- "Small"
train$Item_Visibility[train$Item_Visibility == 0] <- mean(train$Item_Visibility)
train$Outlet_Establishment_Year=2013 - train$Outlet_Establishment_Year

train<-train[c(-1)]
Y<-train[c(11)]

X <- model.matrix(Item_Outlet_Sales~., train)

lambda <- 10^seq(10, -2, length = 100)

set.seed(567)
part <- sample(2, nrow(X), replace = TRUE, prob = c(0.7, 0.3))
X_train<- X[part == 1,]
X_cv<- X[part == 2,]

Y_train<- Y[part == 1,]


Y_cv<- Y[part == 2,]

#lasso regression
lasso_reg <- glmnet(X[X_train,], Y[X_train], alpha = 1, lambda = lambda)
lasso.pred <- predict(lasso_reg, s = bestlam, newx = X[X_cv,])
m<-mean((lasso.pred-Y_cv)^2)
m

lasso.coef <- predict(lasso_reg, type = 'coefficients', s = bestlam)[1:40,]


lasso.coef

You might also like