A REGRESSION MODEL OF SINGLE HOUSE PRICE IN LA
CONSTRUCTING A PREDICTED MODEL FOR HOUSE PRICES
A Project
Presented to the
Faculty of
California State Polytechnic University, Pomona
In Partial Fulfllment
Of the Requirements for the Degree
Master of Science
In
Economics
By
Lishun Yuan
2019
SIGNATURE PAGE
PROJECT: A REGRESSION MODEL OF SINGLE HOUSE PRICE IN LA
CONSTRUCTING A PREDICTED MODEL FOR HOUSE PRICES
AUTHOR: Lishun Yuan
DATE SUBMITTED: Spring 2019
Department of Economics
Dr. Craig Kerr
Project Committee Chair
Economics
Dr. Carsten Lange
Economics
Dr. Shin-tang Hwu
Economics
ii
ACKNOWLEDGMENTS
I want to thank Dr. Craig Kerr and Dr. Shih-Tang Hwu for help and advice that improved
this paper.
iii
ABSTRACT
Knowing the factors infuencing the real estate market is not only benefcial for realtors to
complete the sales, but also helpful for buyers to have a thorough view of the real estate
market and evaluate the properties in a better way. There are many factors that can affect
the real estate market.
In my study, I collect the latest house sale price in seven major cities in Los Angeles
County and attempt to construct a linear multiple regression model to estimate the factors
that affect house sale price in the current real estate market. The regression is based on 140
properties in market right now. These properties are measured by the eight critical vari-
ables that are widely utilized by realtors and buyers. They are namely internal square feet,
lot square feet, number of bedrooms, number of bathrooms, local school quality, median
household income, and city population. The regression inaccuracy and other statistics-
related fallacies are tested by the Gauss-Markov Theorem. The study is operated by R. The
results also provide suggestions to improve and inspire further study related to real estate
market.
iv
Contents
Signature Page ii
Acknowledgments iii
Abstract iv
1 Introduction 1
1.1 Background and overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Empirical Work 5
2.1 Data Selection and Framework . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Conclusion 12
Bibliography 13
v
Chapter 1
Introduction
1.1 Background and overview
The idea behind predictive modeling is the statistical approach to build a prediction func-
tion from the observed data. The function is then used to estimate a value of a new de-
pendent variable for a new data set. Predictive modeling has been widely used in many
research areas from business to social and natural sciences. This paper applies multivariate
linear regression to estimate the values of single houses in LA county. This paper builds a
multivariate regression model of property prices using a dataset composed of 140 homes.
When people consider buying homes, usually the location has been constrained to a
certain area such as not too far from the work place. With location factor pretty much
fxed, the property characteristics information weights more in the home prices. There are
many factors that determine the price of a house which do not weigh equally in determining
the home value. This paper presents a modeling process for estimating home values using
a multivariate linear regression model based on the condition information of the houses in
order to examine the key factors effecting their values. The project also provides a general
idea of fguring out if a transaction is a good deal based on the information provided.
Real estate economics is the application of economic techniques in the real estate mar-
1
ket. Real estate transactions play an essential part in the US economy. In recent years, as an
increasing number of immigrants have chosen to live in the United States, Los Angelas has
become one of the extremely popular areas in the country, which signifcantly increased the
housing demand in the local market. The main demographic variables that can affect the
housing price are population growth and population size. In other words, the more people
in a country or a region, the greater the demand for housing in that area.
However, it is an oversimplifcation if we only take these population factors into con-
sideration. Conventionally, buyers determines the sale price of houses based on simple
methods, calculating the average prices of property nearby or looking for a reasonable me-
dian price locally. There is a disadvantage of this method: it heavily relies on the subjective
perception and experiences of realtors and local residents to estimate house prices, which
creates bias, inaccuracy, and inconsistency in the price determining process.
Determining the price of a single-family house can be extremely complicated since
there are many factors that can affect the value of a house: such as the crime rate, number
of rooms, age of the property and school district. (Assil, 2012) The sale price of a house
available in the market is a fairly accurate index refecting the intrinsic value of the house.
2
1.2 Literature Review
For the purposes of constructing a reasonable model which can be universally utilized to
predict the single-house price in an assigned location, researchers use econometrics theory
to set up a regression model. This approach is called the econometric approach in contrast
with the conventional approach. This study aims to construct a best linear unbiased estima-
tor to predict the sale price of the properties in a certain area based on essential variables
in order to help amateur house buyers understand the price determining procedure and the
house market thoroughly.
Different researchers have different regression models and methods to construct their
own studies. The independent variables, however, are used similarly in some research
studies. Bourassa et al (2010) uses interior size, location, year built, lot size, number of
bedrooms and number of bathrooms as the independent variables in the linear regression
model. Hu et al. uses a dataset composed of 81 single houses to construct multivariate
regression models of home prices. Hu et al. then applies the maximum information coef-
fcient statistics to the dependent variable which is home values (Y) and the independent
variables as an evaluation of the regression models. The result shows a high strength of the
relationship among dependent and independent variables.
Case et al. (2014) and El Mahmah (2012) add the median household income and local
population size to their model because these two factors play important roles in determin-
ing the property sale price in the market. But the impacts of the two factors are different
from region to region. Therefore, it is necessary to take these two factors into the construc-
tion of the regression model. Also, neighborhood quality is an important measurement in
determining the house value. Since it is hard to evaluate the neibourhood quality quantita-
tively, Dubin (1998) uses proxy variables such as Crime rate, school quality measures and
race to determine the neighborhood quality. The study points out that if the neighborhood
variables such as crime rate and school district ratings are not included in the regression
model, it is likely that the error terms from nearby houses will be correlated because they
3
are in the same neighborhood.
After confrming the model, testing the model is important. Based on Guass-Markov
assumptions of regression, various researches test and improve the quality of the regression
model. Bourassa et al. (2010) uses scattered plots of independent variables versus a depen-
dent variable (i.e., the sale price) to demonstrate visual representations of the relationship
between X and Y. In this way, the study identifes which pair of relationship correlated
with each other the most and furthermore recognized multicollinearity between indepen-
dent variables. Hu et al. (2013) also tries to build up a multiple regression and suggests
that it is necessary to diagnose the regression being built. A scattered-dot diagram (based
on predicted values and residuals) can be used to illustrate if there is a heteroskedasticity
in models being constructed. (Hu et al, 2013)
4
Chapter 2
Empirical Work
2.1 Data Selection and Framework
There are 140 single-family houses as 140 observations in this study with detailed informa-
tion regarding the 140 real estate properties currently on market. There are seven common
variables that might help determine the sale price of houses: internal square feet, lot square
feet, number of bedrooms, number of bathrooms, local school quality, median household
income, and city population.
This study’s data is randomly selected from realtor.com, a professional real estate
database in the United States. The properties selected are generally located in seven major
cities in Los Angeles County: San Dimas, Sierra Madre, Claremont, La Verne, Pomona,
Montebello, and Glendale. Specifcally, 20 properties are randomly selected in each city.
Based on current data, this research study aims to construct a linear regression model that
can forecast other properties with essential and related variables. Measurements and prop-
erties of the eight variables are listed in the Table 2.1 below.
The data processing of the study is operated by R. All the following tests and operation
commands will be demonstrated along with the results. The basis of the study is the simple
linear regression model below:
5
Table 2.1: Properties of Variables
Name Abbreviations Measurement Source
Internal Square Feet Intersf Square feet Realtor.com
Number of Bedrooms Bed Units Realtor.com
Number of Bathroom Bath Units Realtor.com
Year Built Year Years Realtor.com
Median Household Income Income U.S. Dollars City Offcials
City Population Ctpop Number of People City Offcials
Median School Rating School School Evaluation City Offcials
Y = β0 + β1 x + εi (2.1)
where there is one variable “X”, the regression residual/error “e”, and the dependent vari-
able “Y”. From the simplifed regression model, the multivariable linear regression model
that includes more variables to construct a model that can forecast the housing price in LA
county effciently and consistently.
This regression model is a multiple linear regression model including seven indepen-
dent variables and one dependent variable. It is signifcant to consider the Gauss-Markov
assumptions and the corresponding statistical tests regarding this study. The Guess-Markov
assumptions in this Study make it possible that the least-square estimators are “best linear
unbiased estimator” (BLUE).
1. The population process has to be linear in parameters. In other words, no multiplica-
tive effect in parameters.
2. Each individual sample in the population is equally likely to be selected in the study.
More importantly, all of the data come from the same population.
6
3. Zero conditional mean of error. If any of the variable(s) known, it will not help the
researcher to predict whether the variable(s) will be above or below the average population
regression line.
Also, it is necessary to consider the multicollinearity and heteroskedasticity of the in-
dependent variables to achieve the ideal concept of BLUE. All the data in this study is
recent and from the same period of time. Therefore, there is no time series issues involved
in this dataset. Therefore, the elimination of serial correlation and correction of stationary
variables are not necessarily being carried out. Also, there is no two-way causality in this
topic.
The expectation of errors given any independent variable should be equal to zero.
1. No perfect collinearity in regressions. (use Variance Infation Factor (VIF) test)
2. No heteroskedasticity issue. (use Breusch-Pagan test))
Log(Price) = β0 +β1 inters f +β2 bed +β3 bath+β4 year +β5 income+β6 ct pop+β7 school +εi
(2.2)
Equation 2.1 is the log-linear regression model used to estimate the housing price in
LA county. This model is used to construct a regression analysis for housing price in LA
county in the following section.
7
2.2 Methodology
A log-linear regression is run based on the data in table 2.1 above. The results are shown
in Table 2.2 below.
Table 2.2: The corrected regression model with heteroskedasticity adjusted
Coeffcient Estimate T-value Pr ( > | t | )
Intercept 1.860e+01 11.903 <2e-16
Intersqft 3.274e-04 10.436 <2e-16
Bedroom -6.075-02 -1.906 0.059
Bathroom 9.004-02 2.682 0.008
Year -3.294-03 -4.144 6.53e-05
Income 5.644-06 3.334 0.001
Citypop 1.659-06 4.885 3.34e-06
School 3.191-02 2.677 0.008
Residual standard error 0.1635 degrees of freedom 116
Multiple R-squared 0.819 Adjusted R-squared 0.808
F-statistic 75.33 on 7 and 116 df P-value < 2.2e-16
The signifcance level is the probability to reject null hypothesis when it is true. In the
majority of analysis, 0.05 is used as a cutoff point. We reject the null hypothesis when
the p-value is less than 0.05. According to Table 2.2, The p values of ’number of bed-
room’ is 0.05916, which is slightly greater than 0.05 signifcance level. And for all the
other variables, their P-values are less than 0.05. As a result, lot square feet, number of
bathrooms, local school quality, median household income, and city population and school
district ratings are signifcant in this study.
In order to guarantee the hypothetical reasoning, a Variance Infation Factor test is also
8
necessary to justify if there is a multicollinearity issue regarding the data above. In statis-
tics, the Variance Infation Factor test (VIF) evaluates the severity of multicollinearity in an
OLS regression analysis. The result of VIF test provides an index that measures how much
the variance of an estimated regression coeffcient is increased because of collinearity.
Table 2.3: The calculation output of VIF test
Variable VIF Result
Intersqft 3.552
Bedroom 3.289
Bathroom 3.779
Year 1.514
Income 4.930
Citypop 2.428
School 3.145
According to the result of VIF test in table 2.3 above, all the outputs are less than 5,
which suggests there’s no multicollinearity issue in the data. So do not drop any of these
variables to construct a better regression model.
Although this is a regression without a multicollinearity problem and a model that is
highly signifcant (i.e., most of the regression probability levels are smaller than 0.05), it
is still not an ideal model since there might be heteroskedasticity problem. So next step
is to run a Breusch-Pagan test. The BP test, developed in 1979, is a method to test for
heteroskedasticity in linear regression models. The result of the BP test has a signifcant
p-value that is greater than 0.05. Therefore, H1 is rejected and H1 and H0 is accepted. It is
reasonable to claim that there is no heteroskedasticity problem in the dataset.
Having fnished the test for multicollinearity and heteroskedasticity, The regression
9
Table 2.4: The calculation output of BP test
Name Result
BP 5.207
Df 7
P-value 0.635
model is able to predict the single-house price in LA County. Based on the table below, a
Table 2.5: Log-linear Regression Model
Independent Variable Dependent Variable
Internal Square Feet 0.0003
Number of Bedroom -0.061
Number of Bathroom 0.090
Year Built -0.003
Median Household Income 0.0000056
City Population 0.0000017
Median School Rating 0.032
Constant 18.597
linear regression model of single-family house in LA County can be formulated as follow:
Log(price) = 18.597+ 0.0003X1− 0.061X2+ 0.090X3− 0.003X4+ 0.0000056X5+
0.0000017X6 + 0.032X7 + εi
where Y represents the predicted sale price of a single-family house; X1 is the internal
square feet; X2 is the number of bedrooms in the property; X3 is the number of bathrooms
in the property; X4 is the year the property was built; X5 is the median household income;
10
X6 is the city population; X7 is the median school rating of the city.
11
Chapter 3
Conclusion
Multivariate regression has been widely used in various aspects in lives. This paper presents
a process to building a multivariate regression model for a simplifed problem of estimating
private property’s prices. This steps of building a predictive model involves: (a) apply the
subsets procedure to select the best variables; (b) build a linear regression model from the
selected variables; (c) conduct diagnostics to fnd if there’s a multicollinearity problem or
a heteroskedasticity issue. This is a typical process of building regression models that may
apply to many applications and aspects.
Generally, the ideas of the econometric approach in real estate market can also be used
in government property tax estimation, house buyers fnancing estimation, and realtors
property evaluation, etc. Researchers are currently exploring non-linear regression models
which might potentially provide a better estimator in determining real estate price.
Although my model is not completely reliable to predict all the house prices, the study
provides an attempt to estimate the single-house price in Los Angeles County by utiliz-
ing a log-linear regression model as a scientifc vehicle in the econometric approach (El
Mahmah, 2012). Since this model’s results are signifcant, researchers in the future may be
able to improve it by adding the omitted variables as shown above in the system of formulas
and by adding more variables specifcally.
12
Bibliography
Assil, EL MAHMAH. 2012. “Constructing a real estate price index: the Moroccan expe-
rience.” IFC Bulletin, 28: 134.
Bourassa, Steven, Eva Cantoni, and Martin Hoesli. 2010. “Predicting house prices with
spatial dependence: a comparison of alternative methods.” Journal of Real Estate Re-
search, 32(2): 139–159.
Case, Bradford, Henry O Pollakowski, and Susan M Wachter. 1991. “On choosing
among house price index methodologies.” Real estate economics, 19(3): 286–307.
Dubin, Robin A. 1998. “Predicting house prices using multiple listings data.” The Journal
of Real Estate Finance and Economics, 17(1): 35–59.
Hu, Gongzhu, Jinping Wang, and Wenying Feng. 2013. “Multivariate regression model-
ing for home value estimates with evaluation using maximum information coeffcient.” In
Software Engineering, Artifcial Intelligence, Networking and Parallel/Distributed Com-
puting 2012. 69–81. Springer.
13