0% found this document useful (0 votes)
53 views13 pages

R Markdown File Mid

The document describes code that analyzes a real estate dataset to predict housing prices. The code fits linear, ridge, and lasso regression models to the training data and makes predictions on test data. It also performs feature selection using stepwise regression and evaluates models using RMSE on test data. Visualization techniques are used to explore relationships between variables like transaction date and housing price.

Uploaded by

zlsHARRY Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views13 pages

R Markdown File Mid

The document describes code that analyzes a real estate dataset to predict housing prices. The code fits linear, ridge, and lasso regression models to the training data and makes predictions on test data. It also performs feature selection using stepwise regression and evaluates models using RMSE on test data. Visualization techniques are used to explore relationships between variables like transaction date and housing price.

Uploaded by

zlsHARRY Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Mid Term

Abdul Rehman

2023-11-22

Abstract:
This dataset captures essential details related to real estate transactions, aiming to uncover
patterns and relationships that contribute to variations in house prices per unit area. The
information encompasses diverse factors, including transaction date, house age, proximity
to MRT stations, the presence of convenience stores, and geographical coordinates. The
dataset provides a rich source for exploring the dynamics of the real estate market,
enabling analysis and prediction of housing prices based on various influential features.
This abstract outlines the dataset’s key variables and sets the stage for a comprehensive
exploration of the factors affecting housing prices in the given context.

Introduction:
The dataset under consideration contains information related to real estate transactions,
specifically focusing on factors that may influence the price of residential units. The dataset
includes several key variables, such as transaction date, house age, distance to the nearest
MRT (Mass Rapid Transit) station, the number of convenience stores nearby, latitude,
longitude, and the house price per unit area. The data spans a range of transaction dates,
house ages, and distances to MRT stations, providing a comprehensive view of the real
estate market.

R code
The provided R code performs the following tasks for different regression models:
** Linear Regression:
Fitting a linear regression model (model_lm) using all predictors in the dataset. Making
predictions (predictions_lm) using the linear regression model. Creating a scatter plot of
actual vs predicted values for linear regression.
** Ridge Regression:
Fitting a ridge regression model (model_ridge) using the glmnet package with specified
predictors and regularization parameter (lambda = 0.01). Making predictions
(predictions_ridge) using the ridge regression model. Creating a scatter plot of actual vs
predicted values for ridge regression.
** Lasso Regression:
Fitting a lasso regression model (model_lasso) using the glmnet package with specified
predictors and regularization parameter (lambda = 0.01). Making predictions
(predictions_lasso) using the lasso regression model. Creating a scatter plot of actual vs
predicted values for lasso regression.
** Model Comparison:
** Splitting the data into training and testing sets. Fitting a full linear regression model and
a stepwise regression model for feature selection. Fitting a partial least squares (PLS)
regression model. Evaluating and comparing the performance of these models using root
mean squared error (RMSE). Selecting the best-performing model based on RMSE values.
** Visualization:
Creating a bar plot and scatter plot to explore the relationship between transaction date
and house price of unit area.
** Residual Plots:
Generating residual plots for both ridge and lasso regression models to assess model
performance. In summary, the code covers linear regression, ridge regression, and lasso
regression, along with model comparison and visualization techniques to analyze and
interpret the relationships within the dataset.

Load necessary libraries


library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse


2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ──────────────────────────────────────────
tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all
conflicts to become errors

library(caret)

## Loading required package: lattice


##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(glmnet)

## Loading required package: Matrix


##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8

library(pls)

##
## Attaching package: 'pls'
##
## The following object is masked from 'package:caret':
##
## R2
##
## The following object is masked from 'package:stats':
##
## loadings

library(MASS)

##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select

library(dplyr)
library(ggplot2)

** Load the dataset


data=read.csv("C://Users//lenovo//Documents//Real estate.csv")

** summary of the data


summary(data)

## No X1.transaction.date X2.house.age
## Min. : 1.0 Min. :2013 Min. : 0.000
## 1st Qu.:104.2 1st Qu.:2013 1st Qu.: 9.025
## Median :207.5 Median :2013 Median :16.100
## Mean :207.5 Mean :2013 Mean :17.713
## 3rd Qu.:310.8 3rd Qu.:2013 3rd Qu.:28.150
## Max. :414.0 Max. :2014 Max. :43.800
## X3.distance.to.the.nearest.MRT.station X4.number.of.convenience.stores
## Min. : 23.38 Min. : 0.000
## 1st Qu.: 289.32 1st Qu.: 1.000
## Median : 492.23 Median : 4.000
## Mean :1083.89 Mean : 4.094
## 3rd Qu.:1454.28 3rd Qu.: 6.000
## Max. :6488.02 Max. :10.000
## X5.latitude X6.longitude Y.house.price.of.unit.area
## Min. :24.93 Min. :121.5 Min. : 7.60
## 1st Qu.:24.96 1st Qu.:121.5 1st Qu.: 27.70
## Median :24.97 Median :121.5 Median : 38.45
## Mean :24.97 Mean :121.5 Mean : 37.98
## 3rd Qu.:24.98 3rd Qu.:121.5 3rd Qu.: 46.60
## Max. :25.01 Max. :121.6 Max. :117.50

** Remove NA values
data <- na.omit(data)

** Split the data into training and testing sets


set.seed(123)

** for reproducibility
train_index <- createDataPartition(data$Y.house.price.of.unit.area, p = 0.8,
list = FALSE)

train_data <- data[train_index, ]


test_data <- data[-train_index, ]

Fit multiple regression models and evaluate

Use stepwise regression for feature selection


model_full <- lm(Y.house.price.of.unit.area ~ ., data = train_data)
model_step <- stepAIC(model_full, direction = "both")

## Start: AIC=1475.84
## Y.house.price.of.unit.area ~ No + X1.transaction.date + X2.house.age +
## X3.distance.to.the.nearest.MRT.station +
X4.number.of.convenience.stores +
## X5.latitude + X6.longitude
##
## Df Sum of Sq RSS AIC
## - No 1 27.82 26991 1474.2
## - X6.longitude 1 32.12 26995 1474.2
## <none> 26963 1475.8
## - X1.transaction.date 1 827.15 27790 1483.9
## - X5.latitude 1 2010.70 28974 1497.7
## - X4.number.of.convenience.stores 1 2335.33 29299 1501.4
## - X3.distance.to.the.nearest.MRT.station 1 2939.38 29903 1508.2
## - X2.house.age 1 2994.53 29958 1508.8
##
## Step: AIC=1474.18
## Y.house.price.of.unit.area ~ X1.transaction.date + X2.house.age +
## X3.distance.to.the.nearest.MRT.station +
X4.number.of.convenience.stores +
## X5.latitude + X6.longitude
##
## Df Sum of Sq RSS AIC
## - X6.longitude 1 29.46 27021 1472.5
## <none> 26991 1474.2
## + No 1 27.82 26963 1475.8
## - X1.transaction.date 1 849.19 27840 1482.5
## - X5.latitude 1 2016.96 29008 1496.1
## - X4.number.of.convenience.stores 1 2339.06 29330 1499.8
## - X3.distance.to.the.nearest.MRT.station 1 2925.18 29916 1506.3
## - X2.house.age 1 2967.19 29958 1506.8
##
## Step: AIC=1472.54
## Y.house.price.of.unit.area ~ X1.transaction.date + X2.house.age +
## X3.distance.to.the.nearest.MRT.station +
X4.number.of.convenience.stores +
## X5.latitude
##
## Df Sum of Sq RSS AIC
## <none> 27020 1472.5
## + X6.longitude 1 29.5 26991 1474.2
## + No 1 25.2 26995 1474.2
## - X1.transaction.date 1 837.1 27858 1480.7
## - X5.latitude 1 2098.3 29119 1495.4
## - X4.number.of.convenience.stores 1 2375.3 29396 1498.5
## - X2.house.age 1 2947.1 29968 1504.9
## - X3.distance.to.the.nearest.MRT.station 1 5291.1 32312 1529.9

** LASSO regression (L1 regularization)


x_train_lasso <- subset(train_data, select = -Y.house.price.of.unit.area)
y_train_lasso <- train_data$Y.house.price.of.unit.area
model_lasso <- glmnet(as.matrix(x_train_lasso), y_train_lasso, alpha = 1)

** Ridge regression (L2 regularization)


x_train_ridge <- dplyr::select(train_data, -Y.house.price.of.unit.area)
y_train_ridge <- train_data$Y.house.price.of.unit.area
model_ridge <- glmnet(as.matrix(x_train_ridge), y_train_ridge, alpha = 0)

** Partial Least Squares (PLS)


model_pls <- plsr(Y.house.price.of.unit.area ~ ., data = train_data, scale =
TRUE)

** Evaluate models on the test set


x_test <- subset(test_data, select = -Y.house.price.of.unit.area)

** Full Model
predictions_full <- predict(model_full, newdata = x_test)

** Stepwise Model
predictions_step <- predict(model_step, newdata = x_test)

** LASSO Model
predictions_lasso <- predict(model_lasso, newx = as.matrix(x_test))

** Ridge Model
predictions_ridge <- predict(model_ridge, newx = as.matrix(x_test))

** PLS Model
predictions_pls <- predict(model_pls, newdata = x_test)

** Calculate RMSE for each model


rmse_full <- sqrt(mean((test_data$Y.house.price.of.unit.area -
predictions_full)^2))
rmse_step <- sqrt(mean((test_data$Y.house.price.of.unit.area -
predictions_step)^2))
rmse_lasso <- sqrt(mean((test_data$Y.house.price.of.unit.area -
predictions_lasso)^2))
rmse_ridge <- sqrt(mean((test_data$Y.house.price.of.unit.area -
predictions_ridge)^2))
rmse_pls <- sqrt(mean((test_data$Y.house.price.of.unit.area -
predictions_pls)^2))

** Compare RMSE values and select the best model


rmse_values <- c("Full Model" = rmse_full, "Stepwise Model" = rmse_step,
"LASSO" = rmse_lasso, "Ridge" = rmse_ridge, "PLS" = rmse_pls)
best_model <- names(rmse_values[which.min(rmse_values)])

Results
The Full Model has the lowest RMSE, suggesting it performs the best on the test data among
the evaluated models. The scatter plots show the relationship between actual and
predicted values for Linear, Ridge, and LASSO regression. Ideally, points should lie along
the diagonal line, indicating perfect predictions. The bar plot and scatter plot visualize the
relationship between transaction date and house price of the unit area. Residual plots help
assess the goodness of fit for Ridge and LASSO regression models.
Full Regression: MSE = 7.783 Ridge Regression: MSE = 10.11 Lasso Regression: MSE = 8.24
PLS Regression: MSE = 7.80 Step Regression: MSE = 7.79

RMSE Comparison:
The RMSE values obtained for each model are compared. Lower RMSE values indicate
better predictive performance. The Full Model, Stepwise Model, LASSO, Ridge, and PLS are
evaluated, and the model with the lowest RMSE is considered the most accurate on the test
data.
** Create a bar plot for “Y.house.price.of.unit.area” against “X1.transaction.date”
ggplot(data, aes(x = X1.transaction.date, y = Y.house.price.of.unit.area)) +
geom_bar(stat = "identity", fill = "red") +
labs(title = 'House Price vs Transaction Date') +
xlab('Transaction Date') +
ylab('House Price of Unit Area') +
theme_minimal()

** Residual plots for lasso regression


lasso_residuals <- test_data$Y.house.price.of.unit.area - predictions_lasso
plot(predictions_lasso, lasso_residuals, pch = 16, col = "green", main =
"Lasso Regression Residuals vs Fitted")
** Load necessary libraries
library(glmnet)

** Fit linear regression model


model_lm <- lm(Y.house.price.of.unit.area ~ ., data = data)

** Assuming all variables are predictors


** Fit ridge regression model
x_ridge <- as.matrix(data[, c("X1.transaction.date", "X2.house.age",
"X3.distance.to.the.nearest.MRT.station", "X4.number.of.convenience.stores",
"X5.latitude", "X6.longitude")])
y_ridge <- data$Y.house.price.of.unit.area
model_ridge <- glmnet(x_ridge, y_ridge, alpha = 0, lambda = 0.01)

** Make predictions using linear regression


predictions_lm <- predict(model_lm, newdata = data)

** Fit lasso regression model


model_lasso <- glmnet(x_ridge, y_ridge, alpha = 1, lambda = 0.01)

Compare the Predictions


** Scatter plot of actual vs predicted values for Linear Regression
plot(test_data$Y.house.price.of.unit.area, predictions_full, pch = 16, col =
"blue", main = "Linear Regression: Actual vs Predicted")

** Make predictions using linear regression


predictions_lm <- predict(model_lm, newdata = data)

** Make predictions using ridge regression


predictions_ridge <- predict(model_ridge, newx = x_ridge, s = 0.01)

** Make predictions using lasso regression


predictions_lasso <- predict(model_lasso, newx = x_ridge, s = 0.01)

** Scatter plot of actual vs predicted values for Lasso Regression


plot(data$Y.house.price.of.unit.area, predictions_lasso, pch = 16, col =
"purple", main = "Lasso Regression: Actual vs Predicted")
abline(0, 1, col = "red")
Compare the Predictions
** Scatter plot of actual vs predicted values for Linear Regression
plot(data$Y.house.price.of.unit.area, predictions_lm, pch = 16, col = "blue",
main = "Linear Regression: Actual vs Predicted")
abline(0, 1, col = "red")
** Scatter plot of actual vs predicted values for Ridge Regression
plot(data$Y.house.price.of.unit.area, predictions_ridge, pch = 16, col =
"green", main = "Ridge Regression: Actual vs Predicted")
abline(0, 1, col = "red")
** Scatter plot of actual vs predicted values for Lasso Regression
plot(data$Y.house.price.of.unit.area, predictions_lasso, pch = 16, col =
"purple", main = "Lasso Regression: Actual vs Predicted")
abline(0, 1, col = "red")
Comparison
** Ridge and LASSO Residuals: Residuals represent the differences between predicted and
actual values. Residual plots for Ridge and LASSO regression models are analyzed to assess
model performance. A random and evenly distributed pattern of residuals indicates a well-
fitted model. Patterns or outliers may suggest areas for improvement.
** Predicted vs. Actual Values: Scatter plots are created for Linear, Ridge, and LASSO
regressions to compare predicted vs. actual values. Points closely aligning along the
diagonal line indicate accurate predictions. Deviations from this line may highlight areas
where the model has difficulty making accurate predictions.

You might also like