R Markdown File Mid
R Markdown File Mid
Abdul Rehman
2023-11-22
Abstract:
This dataset captures essential details related to real estate transactions, aiming to uncover
patterns and relationships that contribute to variations in house prices per unit area. The
information encompasses diverse factors, including transaction date, house age, proximity
to MRT stations, the presence of convenience stores, and geographical coordinates. The
dataset provides a rich source for exploring the dynamics of the real estate market,
enabling analysis and prediction of housing prices based on various influential features.
This abstract outlines the dataset’s key variables and sets the stage for a comprehensive
exploration of the factors affecting housing prices in the given context.
Introduction:
The dataset under consideration contains information related to real estate transactions,
specifically focusing on factors that may influence the price of residential units. The dataset
includes several key variables, such as transaction date, house age, distance to the nearest
MRT (Mass Rapid Transit) station, the number of convenience stores nearby, latitude,
longitude, and the house price per unit area. The data spans a range of transaction dates,
house ages, and distances to MRT stations, providing a comprehensive view of the real
estate market.
R code
The provided R code performs the following tasks for different regression models:
** Linear Regression:
Fitting a linear regression model (model_lm) using all predictors in the dataset. Making
predictions (predictions_lm) using the linear regression model. Creating a scatter plot of
actual vs predicted values for linear regression.
** Ridge Regression:
Fitting a ridge regression model (model_ridge) using the glmnet package with specified
predictors and regularization parameter (lambda = 0.01). Making predictions
(predictions_ridge) using the ridge regression model. Creating a scatter plot of actual vs
predicted values for ridge regression.
** Lasso Regression:
Fitting a lasso regression model (model_lasso) using the glmnet package with specified
predictors and regularization parameter (lambda = 0.01). Making predictions
(predictions_lasso) using the lasso regression model. Creating a scatter plot of actual vs
predicted values for lasso regression.
** Model Comparison:
** Splitting the data into training and testing sets. Fitting a full linear regression model and
a stepwise regression model for feature selection. Fitting a partial least squares (PLS)
regression model. Evaluating and comparing the performance of these models using root
mean squared error (RMSE). Selecting the best-performing model based on RMSE values.
** Visualization:
Creating a bar plot and scatter plot to explore the relationship between transaction date
and house price of unit area.
** Residual Plots:
Generating residual plots for both ridge and lasso regression models to assess model
performance. In summary, the code covers linear regression, ridge regression, and lasso
regression, along with model comparison and visualization techniques to analyze and
interpret the relationships within the dataset.
library(caret)
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:caret':
##
## R2
##
## The following object is masked from 'package:stats':
##
## loadings
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
library(ggplot2)
## No X1.transaction.date X2.house.age
## Min. : 1.0 Min. :2013 Min. : 0.000
## 1st Qu.:104.2 1st Qu.:2013 1st Qu.: 9.025
## Median :207.5 Median :2013 Median :16.100
## Mean :207.5 Mean :2013 Mean :17.713
## 3rd Qu.:310.8 3rd Qu.:2013 3rd Qu.:28.150
## Max. :414.0 Max. :2014 Max. :43.800
## X3.distance.to.the.nearest.MRT.station X4.number.of.convenience.stores
## Min. : 23.38 Min. : 0.000
## 1st Qu.: 289.32 1st Qu.: 1.000
## Median : 492.23 Median : 4.000
## Mean :1083.89 Mean : 4.094
## 3rd Qu.:1454.28 3rd Qu.: 6.000
## Max. :6488.02 Max. :10.000
## X5.latitude X6.longitude Y.house.price.of.unit.area
## Min. :24.93 Min. :121.5 Min. : 7.60
## 1st Qu.:24.96 1st Qu.:121.5 1st Qu.: 27.70
## Median :24.97 Median :121.5 Median : 38.45
## Mean :24.97 Mean :121.5 Mean : 37.98
## 3rd Qu.:24.98 3rd Qu.:121.5 3rd Qu.: 46.60
## Max. :25.01 Max. :121.6 Max. :117.50
** Remove NA values
data <- na.omit(data)
** for reproducibility
train_index <- createDataPartition(data$Y.house.price.of.unit.area, p = 0.8,
list = FALSE)
## Start: AIC=1475.84
## Y.house.price.of.unit.area ~ No + X1.transaction.date + X2.house.age +
## X3.distance.to.the.nearest.MRT.station +
X4.number.of.convenience.stores +
## X5.latitude + X6.longitude
##
## Df Sum of Sq RSS AIC
## - No 1 27.82 26991 1474.2
## - X6.longitude 1 32.12 26995 1474.2
## <none> 26963 1475.8
## - X1.transaction.date 1 827.15 27790 1483.9
## - X5.latitude 1 2010.70 28974 1497.7
## - X4.number.of.convenience.stores 1 2335.33 29299 1501.4
## - X3.distance.to.the.nearest.MRT.station 1 2939.38 29903 1508.2
## - X2.house.age 1 2994.53 29958 1508.8
##
## Step: AIC=1474.18
## Y.house.price.of.unit.area ~ X1.transaction.date + X2.house.age +
## X3.distance.to.the.nearest.MRT.station +
X4.number.of.convenience.stores +
## X5.latitude + X6.longitude
##
## Df Sum of Sq RSS AIC
## - X6.longitude 1 29.46 27021 1472.5
## <none> 26991 1474.2
## + No 1 27.82 26963 1475.8
## - X1.transaction.date 1 849.19 27840 1482.5
## - X5.latitude 1 2016.96 29008 1496.1
## - X4.number.of.convenience.stores 1 2339.06 29330 1499.8
## - X3.distance.to.the.nearest.MRT.station 1 2925.18 29916 1506.3
## - X2.house.age 1 2967.19 29958 1506.8
##
## Step: AIC=1472.54
## Y.house.price.of.unit.area ~ X1.transaction.date + X2.house.age +
## X3.distance.to.the.nearest.MRT.station +
X4.number.of.convenience.stores +
## X5.latitude
##
## Df Sum of Sq RSS AIC
## <none> 27020 1472.5
## + X6.longitude 1 29.5 26991 1474.2
## + No 1 25.2 26995 1474.2
## - X1.transaction.date 1 837.1 27858 1480.7
## - X5.latitude 1 2098.3 29119 1495.4
## - X4.number.of.convenience.stores 1 2375.3 29396 1498.5
## - X2.house.age 1 2947.1 29968 1504.9
## - X3.distance.to.the.nearest.MRT.station 1 5291.1 32312 1529.9
** Full Model
predictions_full <- predict(model_full, newdata = x_test)
** Stepwise Model
predictions_step <- predict(model_step, newdata = x_test)
** LASSO Model
predictions_lasso <- predict(model_lasso, newx = as.matrix(x_test))
** Ridge Model
predictions_ridge <- predict(model_ridge, newx = as.matrix(x_test))
** PLS Model
predictions_pls <- predict(model_pls, newdata = x_test)
Results
The Full Model has the lowest RMSE, suggesting it performs the best on the test data among
the evaluated models. The scatter plots show the relationship between actual and
predicted values for Linear, Ridge, and LASSO regression. Ideally, points should lie along
the diagonal line, indicating perfect predictions. The bar plot and scatter plot visualize the
relationship between transaction date and house price of the unit area. Residual plots help
assess the goodness of fit for Ridge and LASSO regression models.
Full Regression: MSE = 7.783 Ridge Regression: MSE = 10.11 Lasso Regression: MSE = 8.24
PLS Regression: MSE = 7.80 Step Regression: MSE = 7.79
RMSE Comparison:
The RMSE values obtained for each model are compared. Lower RMSE values indicate
better predictive performance. The Full Model, Stepwise Model, LASSO, Ridge, and PLS are
evaluated, and the model with the lowest RMSE is considered the most accurate on the test
data.
** Create a bar plot for “Y.house.price.of.unit.area” against “X1.transaction.date”
ggplot(data, aes(x = X1.transaction.date, y = Y.house.price.of.unit.area)) +
geom_bar(stat = "identity", fill = "red") +
labs(title = 'House Price vs Transaction Date') +
xlab('Transaction Date') +
ylab('House Price of Unit Area') +
theme_minimal()