Ridge Regression
Ridge Regression
Ridge Regression
Let us first implement it on our above problem and check our results that whether it
performs better than our linear regression model.
So, we can see that there is a slight improvement in our model because the value of
the R-Square has been increased. Note that value of alpha, which is hyperparameter
of Ridge, which means that they are not automatically learned by the model instead
they have to be set manually.
Here we have consider alpha = 0.05. But let us consider different values of alpha and
plot the coefficients for each case.
You can see that, as we increase the value of alpha, the magnitude of the coefficients
decreases, where the values reaches to zero but not absolute zero.
But if you calculate R-square for each alpha, we will see that the value of R-square
will be maximum at alpha=0.05. So we have to choose it wisely by iterating it
through a range of values and using the one which gives us lowest error.
So, now you have an idea how to implement it but let us take a look at the
mathematics side also. Till now our idea was to basically minimize the cost function,
such that values predicted are much closer to the desired result.
Now take a look back again at the cost function for ridge regression.
Here if you notice, we come across an extra term, which is known as the penalty
term. λ given here, is actually denoted by alpha parameter in the ridge function. So by
changing the values of alpha, we are basically controlling the penalty term. Higher
the values of alpha, bigger is the penalty and therefore the magnitude of coefficients
are reduced.
Important Points:
Now let us consider another type of regression technique which also makes use of
regularization.
LASSO (Least Absolute Shrinkage Selector Operator), is quite similar to ridge, but
lets understand the difference them by implementing it in our big mart problem.
So, we can see that even at small values of alpha, the magnitude of coefficients have
reduced a lot. By looking at the plots, can you figure a difference between ridge and
lasso? We can see that as we increased the value of alpha, coefficients were
approaching towards zero, but if you see in case of lasso, even at smaller alpha’s, our
coefficients are reducing to absolute zeroes. Therefore, lasso selects the only some
feature while reduces the coefficients of others to zero. This property is known as
feature selection and which is absent in case of ridge.
Mathematics behind lasso regression is quiet similar to that of ridge only difference
being instead of adding squares of theta, we will add absolute value of Θ.
Here too, λ is the hypermeter, whose value is equal to the alpha in the Lasso function.
Important Points:
Now that you have a basic understanding of ridge and lasso regression, let’s think of
an example where we have a large dataset, lets say it has 10,000 features. And we
know that some of the independent features are correlated with other independent
features. Then think, which regression would you use, Rigde or Lasso?
Let’s discuss it one by one. If we apply ridge regression to it, it will retain all of the
features but will shrink the coefficients. But the problem is that model will still
remain complex as there are 10,000 features, thus may lead to poor model
performance.
Instead of ridge what if we apply lasso regression to this problem. The main problem
with lasso regression is when we have correlated variables, it retains only one
variable and sets other correlated variables to zero. That will possibly lead to some
loss of information resulting in lower accuracy in our model.
Then what is the solution for this problem? Actually we have another type of
regression, known as elastic net regression, which is basically a hybrid of ridge and
lasso regression. So let’s try to understand it.
install.packages("glmnet")
library(glmnet)
train<-train[c(-1)]
Y<-train[c(11)]
set.seed(567)
part <- sample(2, nrow(X), replace = TRUE, prob = c(0.7, 0.3))
X_train<- X[part == 1,]
X_cv<- X[part == 2,]
#ridge regression
m<-mean((Y_cv - ridge.pred)^2)
m
out = glmnet(X[X_train,],Y[X_train],alpha = 0)
ridge.coef<-predict(ridge_reg, type = "coefficients", s = bestlam)[1:40,]
ridge.coef
train<-train[c(-1)]
Y<-train[c(11)]
set.seed(567)
part <- sample(2, nrow(X), replace = TRUE, prob = c(0.7, 0.3))
X_train<- X[part == 1,]
X_cv<- X[part == 2,]
#lasso regression
lasso_reg <- glmnet(X[X_train,], Y[X_train], alpha = 1, lambda = lambda)
lasso.pred <- predict(lasso_reg, s = bestlam, newx = X[X_cv,])
m<-mean((lasso.pred-Y_cv)^2)
m