0% found this document useful (0 votes)
19 views11 pages

Land Use Policy

Research paper on land use policy

Uploaded by

epv190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

Land Use Policy

Research paper on land use policy

Uploaded by

epv190
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Land Use Policy 111 (2021) 104919

Contents lists available at ScienceDirect

Land Use Policy


journal homepage: www.elsevier.com/locate/landusepol

Understanding house price appreciation using multi-source big geo-data


and machine learning
Yuhao Kang a, b, Fan Zhang a, *, Wenzhe Peng c, Song Gao b, Jinmeng Rao b, Fabio Duarte a, d,
Carlo Ratti a
a
Senseable City Lab, Department of Urban Studies and Planning, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
b
Geospatial Data Science Lab, Department of Geography, University of Wisconsin, Madison, WI 53703, United States
c
Department of Architecture, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
d
Urban Management Program, PUCPR, Curitiba 80215-910, Brazil

A R T I C L E I N F O A B S T R A C T

Keywords: Understanding house price appreciation benefits place-based decision makings and real estate market analyses.
House price appreciation rate Although large amounts of interests have been paid in the house price modeling, limited work has focused on
Street view images evaluating the price appreciation rate. In this study, we propose a data-fusion framework to examine how well
House photos
house price appreciation potentials can be predicted by combining multiple data sources. We used data sets
Human mobility patterns
Geographically weighted regression
including house structural attributes, house photos, locational amenities, street view images, transportation
accessibility, visitor patterns, and socioeconomic attributes of neighborhoods to enrich our understanding of the
real estate appreciation and its predictive modeling. As a case study, we investigate more than 20,000 houses in
the Greater Boston Area, and discuss the spatial dependency of house price appreciations, influential variables
and their relationships. In detail, we extract deep features from street view images and house photos using a deep
learning model, merging features from multi-source data and modeling house price appreciation using machine
learning models and the geographically weighted regression at two spatial scales: fine-scale point level and
aggregated neighborhood level. Results show that the house price appreciation rate can be modeled with high
accuracy using the proposed framework (R2 = 0.74 for gradient boosting machine at neighborhood-scale). We
discovered that houses with low house prices and small house areas may have a higher house appreciation
potential. Our results provide insights into how multi-source big geo-data can be employed in machine learning
frameworks to characterize real estate price trends and help understand human settlements for policy-making.

1. Introduction prices, which are only snapshots of the property values in a specific time
window, house price appreciation rates can reflect the growth or decay
As an important aspect of human settlement, house prices are strongly of property values from a long-term perspective. In addition, high house
associated with economic activities (Chen et al., 2016). Understanding the price does not equal to a high house price appreciation rate. A same
trends in house prices can provide suggestions not only for house buyers variable may have totally different impacts on house prices and on
but also for researchers and decision makers in real estate market, urban appreciation rates. Therefore, examining the effect of different variables
planning and development. For decades, researchers from economy, urban on house price appreciation is important and promising.
planning, geography, politics and computer science have made great ef­ Second, existing models such as the hedonic pricing model proposed by
forts in house price-related topics to understand the impacts of property Rosen (1974) typically only take structural attributes and locational
values in different socioeconomic environments (Archer et al., 1996; Cao amenities into consideration, which may not describe the other aspects of
et al., 2019; Fu et al., 2016; Hu et al., 2019). factors influencing the house price appreciation rate comprehensively.
Despite large amounts of existing studies, two aspects received In practice, structural attributes contain the tangible assets of the
insufficient attention. First, most existing literature focuses on the house property, including the size of the house, the year built, the number of
price modeling but neglects the study of price appreciation rate (Hung the bedrooms and bathrooms, etc., which can describe the inner char­
and Tu, 2008; Livy, 2017). Compared with absolute values of house acteristics of the houses (Can, 1992). Locational amenities refer to

* Corresponding author.
E-mail address: [email protected] (F. Zhang).

https://doi.org/10.1016/j.landusepol.2020.104919
Received 7 September 2019; Received in revised form 6 July 2020; Accepted 9 July 2020
Available online 21 July 2020
0264-8377/© 2020 Elsevier Ltd. All rights reserved.
Y. Kang et al. Land Use Policy 111 (2021) 104919

geographical-related variables, such as the distance to the nearest fa­ 2. Framework


cilities, which can reflect the intangible environment nearby (Chau and
Chin, 2003). However, the house price appreciation rate might be 2.1. Overview
affected by other variables such as the physical appearance of the house,
surrounding physical and social environment settings, and dynamic The framework is composed of four stages, namely data collection,
human mobility patterns (Du et al., 2018). For example, houses with feature construction, model training, and mapping and analysis (Fig. 1).
exquisite decoration worth higher values by intuition; houses located in First, we collect multi-source datasets, including the house information,
districts and areas with a beautiful visual aesthetic environment, where built environment features, human mobility patterns, and socioeco­
residents’ physical and mental health can be benefited, might have nomic attributes of neighborhoods on a cloud server. Second, by fusing
higher appreciation rate; and regions that can attract more visitors may the above datasets, we extract a series of features that are assumed to
have higher business values. However, due to the lack of quantitative have an impact on price appreciation rates and use a multi-dimensional
measurements in conventional data collection methods, these key fac­ vector for representation. Then, algorithms including machine learning
tors were overlooked by most of the previous studies. and geographically weighted regression (GWR) are built using the fea­
The emergence of big data, high-performance computing, and tures constructed. The metrics are defined to measure the performance
advanced machine learning methods provide unprecedented opportunities of those algorithms as well. Specifically, two spatial units (points and
to model those intangible assets of houses, which can enhance the esti­ neighborhood) are tested in this research with different combinations of
mation of house price appreciation rates. On one hand, in contrast to approaches. Finally, we aim to not only explore better ways for pre­
previous studies which used official statistical data and manual surveys in dicting house price appreciation rates, but also interpret the potential
exploration of house price appreciation rates (Andrew and Meen, 2003; variables that are associated with the values of real estate appreciation.
Crone and Voith, 1992; Archer et al., 1996; Quercia et al., 2000), larger
volumes, velocities, varieties and veracities of geo-referenced data actively 2.2. Data collection
and passively produced by users bring more comprehensive insights into
depicting socioeconomic environments in the era of volunteered Four different categories of data are used in this study, namely house
geographic information (VGI) (Goodchild, 2007) and big geo-data (Gao information, built environment, human mobility patterns, and socio­
et al., 2017b). For instance, house photographs that reflect indoor and economic attributes of the neighborhoods.
outdoor scenery of properties, taken from the house owners and seller
agents, are uploaded to online websites, which enable people to under­ 2.2.1. House information
stand the scenery of houses; and street view images can describe the re­ House information consists of two subcategories: structural attri­
lationships between urban physical attributes and socioeconomic butes and house photos. Both of them are collected from a popular on­
environments (Gebru et al., 2017; Zhang et al., 2018b; Zhang and Dong, line real estate website—REDFIN website.1 House owners and seller
2018; Liu et al., 2019b). These two data sources make it possible to agents post the information of their properties to the website for sale
characterize the living environment from a human’s perspective. with an estimated price of each house provided by the system.
Furthermore, the wide spread of GPS-embedded devices (e.g., mobile Structural attributes describe the basic characteristics of the house,
phones and vehicles), makes it possible to track individuals’ trajectories to including the location of the property, the number of bathrooms and
infer people’s activities and movements. These dynamic observations of bedrooms of the house, the built year, the number of floors and the size
human movements may be taken as supplementary for locational ame­ of the property, and the house type (single family residential, town­
nities which only characterize the static geospatial aspects of houses. house, etc.), which have been widely used in traditional hedonic pricing
Intuitively, houses located in the areas with high accessibility to other models (Rosen, 1974; Chau and Chin, 2003). Since our main focus is to
places and higher attractiveness of others, may have higher price appre­ predict the house price appreciation rates (i.e., price changes), the house
ciation rate because of the travel convenience. A better understanding of prices across a five year period from February 2014 to February 2019,
the relationship between all these dimensions and house price appreciation are retrieved. Accordingly, the appreciation rate R of a house with
rates can provide more comprehensive and valuable information for policy market price P is defined as follows:
making to improve the overall quality of neighborhoods and stimulate
social and economic balances between urban areas. P2019 − P2014
R= (1)
On the other hand, the development of state-of-the-art computer P2014
vision techniques enables us to extract high-level visual features from House photos are downloaded from the REDFIN website as another
urban images. Capturing visual features to represent the scenic charac­ important part of the house information (Fig. 2). For each property,
teristics of houses as well as their neighborhood settings might help sellers upload photos taken by themselves to show the interior and
measure real estate appreciation values. In fact, recent works have exterior appearance of the house. Because the number of photos shared
shown the great potential of visual information in estimating house by sellers varies and not all properties have house photos available, we
prices and in exploring culture and socioeconomic characteristics of discarded those houses without photos. After that, the remaining houses
neighborhoods (Gebru et al., 2017; You et al., 2017; Yao et al., 2018; with available photos are stored in order to extract meaningful high-
Law et al., 2018; Fu et al., 2019; Liu et al., 2019a; Chen et al., 2020; level visual features to describe the house scenery.
Zhang et al., 2020). Accordingly, modeling house price appreciation rate
with visual information is promising. 2.2.2. Built environment
In this work, we propose a comprehensive multi-feature-fusion Two datasets are used to depict the built environment of a house:
framework using machine learning to model the house price apprecia­ locational amenities and street view images.
tion rate. To build the framework, multiple data sources, including Typically, locational amenities refer to the facilities near the house in
house information, built environment, human mobility patterns, and the hedonic pricing model. Here, we use the point of interest (POI) in­
socioeconomic attributes of neighborhoods, are used to understand the formation to show the location characteristics of nearby properties. The
value of urban settlements comprehensively. We take the Greater Boston SafeGraph POI data2 is used to provide the location information. Besides
Area as an example to test the feasibility of the proposed framework, and the location coordinates, each POI has a specific category code, which
explore factors impacting on house price appreciation rates.

1
https://www.redfin.com/.
2
https://www.safegraph.com/.

2
Y. Kang et al. Land Use Policy 111 (2021) 104919

Fig. 1. The workflow of this study: (A) Data collection. (B) Feature construction. (C) Model training. (D) Mapping and analysis.

Fig. 2. Left: Study area. Red dots indicate the location of houses and blue polygons represent the boundary of census block groups. Middle: Examples of street view
images. © 2019 Google. Right: Examples of house photos. © 2019 Redfin.

3
Y. Kang et al. Land Use Policy 111 (2021) 104919

follows the standard criteria proposed by the North American Industry 2.2.4. Socioeconomic attributes of neighborhoods
Classification System (NAICS).3 In reference to the existing research We used the CBG data released from the American Community
(Cao et al., 2019), the following categories of POIs are chosen as illus­ Survey (ACS), which contains all kinds of demographic data. It is widely
trated in Table 1. used in socioeconomic studies to estimate the neighborhood social
Street view images are downloaded by utilizing the Google Street identities at the CBG level. Specifically, we retrieved the population,
View API4 (Fig. 2). Street view images have been widely used to describe ethnicity, income, and unemployment rate of the CBGs. Population of
the physical settings of urban environment and neighborhoods, which seven ethnicity groups as well as their ratio of each ethnicity are
can infer the relationship between human society activities and physical recorded in each CBG. For each CBG, both the population and the ratio
environment (Gebru et al., 2017; Zhang et al., 2018a; Chen et al., 2020). of each ethnicity are computed. In addition, the average income and
In order to retrieve the street view data along the roads, road networks average unemployment rate, which could reflect the identity and class of
are downloaded from the OpenStreetMap.5 A set of geo-referenced the neighborhood are also retrieved for further analysis.
sampling points are generated along the roads with a fixed distance
interval of 100 m. For each point, eight street view images are analyzed 2.3. Feature construction
from different angles to show the surrounding urban environment
comprehensively. It should be noted that not all street view images Assume that each house appreciation rate ri is influenced by a set of
collected are used. Only those images within 50 m of each house are features from four perspectives: ri = F(hi , bi , mi , si ), in which hi refers to
retrieved as descriptors to model the visual scenery of the housing built house information, bi refers to built environment, mi refers to human
environment. mobility patterns, and si refers to socioeconomic attributes. In order to
integrate these four types of factors to predict the house price appreci­
2.2.3. Human mobility patterns ation rate, a set of features are extracted and constructed following the
There are two datasets used in this research to reflect the dynamic steps below.
human mobility patterns: visitor patterns and transportation accessi­
bility. Both are aggregated at the spatial resolution of Census Block 2.3.1. General features
Groups (CBGs). Structural attributes are constructed as features from the data source
The visitor patterns of CBGs are retrieved from the SafeGraph mobile and attached to each house directly. It is worth noting that we use the
phone database which covers about 10% of total population with mobile natural logarithm of house price, which is normally distributed rather
devices in the United States.6 SafeGraph aggregates anonymized loca­ than the original values as they distributed skewed. Features of human
tion data from numerous mobile applications in order to provide insights mobility patterns and socioeconomic attributes of the neighborhood are
about physical places. To enhance privacy, SafeGraph excludes CBG attached to the houses after the spatial join operation between house
information if fewer than five devices visited an establishment in a locations and CBG polygons.
month from a given census block group. For each CBG, the records of
aggregated visitor patterns illustrate how many visitors to the CBG 2.3.2. Locational features
during a specified time window, which could reflect the attractiveness of For locational amenities, to better characterize the living conve­
the CBG. The hourly visit counts are recorded as a 24-dimensional vector nience of neighborhoods thoroughly, we construct features from two
to show the dynamic patterns of visitors at CBGs. aspects: the distance from the house to the nearest facilities, and the
The other dataset is released by the Uber Movement project.7 This number of nearby amenities. The assumption is that different facilities
publicly available open data platform provides the observed travel times may have different urban functions, which result in different mobility
between two CBGs based on the movements of Uber vehicles. We patterns of people and neighborhood vibrancy (Liu et al., 2012; Gao
calculate the mean travel time of each CBG to all other CBGs in the et al., 2017a; Yue et al., 2017). For example, people usually prefer to go
whole year of 2018. Note that the mean travel time may vary among to the nearest transportation hubs including metro and bus stops instead
CBGs, so the standard deviation of the travel time is also computed. of farther away options. Thus, the distance to the nearest facility matters
while the number of these transportation stations nearby may have
limited impacts on people’s travel mode. However, for amenities such as
shops and restaurants, their quantity and variety of offerings influence
the convenience to people living in a certain area. Therefore, we
Table 1 calculate the total number of these types of POIs. As suggested by studies
POI categories with NAICS code.
from urban planning and geography (Neilson and Fowler, 1972; Murray
NAICS code Categories and Wu, 2003), we select 600 m as the distance threshold for our dis­
445110 Grocery Store tance analysis, which is suitable to represent the preferred coverage of
452319 Stores human physical activity by walking, and is used to evaluate the walk­
611110 School ability of the neighborhood convenience (Ellis et al., 2016). In other
611310 Universities
622110 Hospital
words, POIs in each category within 600m of house properties would be
712190 Nature Parks counted as the descriptors of a house.
713110 Amusement Parks
722511 Full-Service Restaurants 2.3.3. Visual features
722513 Limited-Service Restaurants
Furthermore, we extract deep features of street view imagery and
722515 Snack and Nonalcoholic Beverage Bars
house photos using a deep convolutional neural network (DCNN). The
model is adapted from ResNet18, a commonly used architecture that has
been proved efficiency in various computer vision tasks (He et al.,
2016). It can extract high-dimensional visual features which can reveal
3 hidden scenery information captured in photos. In order to learn effi­
https://www.naics.com.
4
https://developers.google.com/maps/documentation/streetview/intro. cient visual features from the images, we train the model with a house
5
https://www.openstreetmap.org/. price prediction task. Accordingly, we take the images as model inputs
6
https://www.safegraph.com/blog/what-about-bias-in-the-safegraph-dat and the house price value as the output. To deal with the skewed dis­
aset. tribution (power law) of the house prices and accelerate the training
7
https://movement.uber.com/?lang=en-US. process, we discretize the house price values into 10 levels and

4
Y. Kang et al. Land Use Policy 111 (2021) 104919

formulate the training as a 10-category classification task. A similar 2.4.2. Neighborhood level
strategy was adopted in Zhang et al. (2019). The pre-trained model is As for the neighborhood level (Fig. 3), the average values of all
then used to extract 512-dimensional features from each image, which is features for properties in one specific CBG are calculated as the feature
considered as an efficient visual representation of the indoor/outdoor set for the CBG. Ordinary least squares regression (OLS) and
scene depicted in the image. Here, we only take the Greater Boston Area geographically weighted regression (GWR) are used to estimate the
as a case study. The framework is also expected to be employed in other variables that influence house price appreciation rate. Compared with
cities. With such a high-dimensional feature representation, the scenery global regression model which ignores spatial non-stationary and only
of all collected photos can be represented comprehensively. We conduct illustrates global impacts of variables, the GWR model constructs spatial
the training process for house photos and street view images separately. relationships between independent and dependent variables with the
Given the training process of the high-dimensional features (especially following equation (Fotheringham et al., 2003):
for the images) is time-consuming. Therefore, we adopt the principle

m
component analysis (PCA) to reduce the feature dimension while pre­ Ri = α0(ui ,vi ) + ak(ui ,vi ) Xk(ui ,vi ) + εi (2)
serving major feature characteristics. For each image, the first twenty k=1

components with about 60% of the total explained variance are main­
tained as the image feature. Please note that the major feature charac­ where Ri refers to the house price appreciation rate at location i, and the
teristics remained may vary across cities due to different urban coordinate is (ui ,vi ); α0(ui ,vi ) refers to the intercept parameter at location i;
environments and spatial dependency place by place. The first 20 ak(ui ,vi ) refers to the local regression coefficient for the kth independent
components selected here can represent the visual scenery of built variable at location i; Xk(ui ,vi ) refers to the kth attribute of location i; and
environment specifically in the Greater Boston Area only. Finally, we εi indicates the random error. By using this model, the derived co­
average the image features from multiple images associated with the efficients may vary across the research area and show the spatial het­
same house (for both house photos and street view images). erogeneity of impact factors.
The main assumption in our study is that by embedding new data
2.4. Modeling algorithms sources, including house photos, street view images, human mobility
data and socioeconomic data, a model with better performance could be
Two spatial analysis units are used in the experiment: fine-scale point built. We expect such a model can achieve higher accuracy and provide
level and aggregated neighborhood level. We assume that there are two better explanations of the reasons for house price appreciation values.
kinds of target purposes using the proposed framework given the de­ Therefore, the traditional hedonic model (Rosen, 1974) fed with struc­
mand difference from two different groups. For house buyers and real tural attributes and locational amenities only is considered as the
estate industry, good machine learning models for individual house baseline. New models fed with extended data sources are added
prices is more informative because of the high accuracy for appreciation respectively. Finally, a hybrid model using all data sources is also tested.
estimation. The higher the model accuracy, the higher users’ satisfaction
is. Therefore, a fine-scale prediction of house value appreciation with 2.5. Evaluation
advanced machine learning models is essential (Law et al., 2018; Hu
et al., 2019). In comparison, economists, geographers, and policy Two metrics are used for the evaluation of model performance,
makers are more interested in analyzing the macroscopical trend of namely, the root mean square error (RMSE) and the coefficient of
house prices, and discuss the hidden economic and geographic factors determination R2 . The RMSE is calculated as follows:
influencing house price appreciation. The accuracy of results is not the √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅

only metric to consider when choosing the best model, while a macro­ √1 ∑ N
RMSE = √ (r0i − ri )2 (3)
scopic perspective of the real estate appreciation rate may be more N i=1
helpful. The efficacy of geographically weighted regression (GWR) that
could explain the spatial heterogeneity of variables in regression has where r0i is the actual house appreciation rate and ri is the predicted
been demonstrated in house price modeling (Cao et al., 2019; Wu et al., price appreciation rate of a house i. And the R2 is calculated as follows:
2019; Liu et al., 2020). Therefore, spatially explicit models such as the
GWR at the neighborhood-scale are favored. 1 ∑m
(ri − r) ∗ (r0i − r0 )
R2 = ∗ (4)
m i=1 ρr ∗ ρr0
2.4.1. Fine-scale level
At the fine-scale point level, all properties are treated equally with where r and r0 refer to the average values of the predicted and the
the entire set of the abovementioned features. We compare the multiple observed house price appreciation rates, and ρr and ρr0 are the standard
linear regression (MLR) approach with one machine learning deviations of the predicted and the observed house price appreciation
approach—gradient boosting machine (GBM) with decision trees rate respectively.
(Friedman, 2001)—to test the efficiency of the proposed framework.
Although there are various machine learning methods, we only use the 3. Experiment and results
GBM as a representative machine learning model to make comparison
with the linear regression model according to the following reasons: The We take the Greater Boston Area as the study area. As shown in
accuracy and efficiency of GBM have been proved in various prediction Fig. 2, the red dots represent houses (fine-scale) and the blue polygons
tasks (Natekin and Knoll, 2013); And the main focus in this paper is to are the CBGs (neighborhood-level). In this study, there are 21,928
explore whether those extended data features can provide useful infor­ houses with 125,000 house photos and about 470,000 street view im­
mation for house price appreciation rate prediction, while not focusing ages in total. All the house-related datasets are spatially aggregated into
on which machine learning algorithm performs the best. We conduct the the 867 CBGs based on their point-in-polygon relationship (Fig. 3).
k-fold cross-validations which split data into two parts: one is the We train the machine learning models with multi-sources of data.
training dataset and the other is the testing dataset, to mitigate over­ The RMSE and R2 with k-fold cross-validations are calculated to evaluate
fitting problem in model training and prediction. The importance of the model performance between the predicted and the actual value of
each variable for GBM is also recorded to provide helpful suggestions for house price appreciation rate. We conduct the experiments at the fine-
decision makings. scale and at the neighborhood-scale respectively.

5
Y. Kang et al. Land Use Policy 111 (2021) 104919

Fig. 3. Data distributions at census block group (CBG) level: (A) average house appreciation rates. (B) The natural logarithm of house prices. (C) Average house price
per square meter. (D) Number of visitors to each CBG. (E) Averaged travel mean time to other CBGs. (F) Population.

3.1. Fine-scale house price appreciation estimation


Table 2
Model performance with RMSE in different combinations of data aspects using
At the fine-scale, we take each house as the basic unit, and conduct
multiple linear regression (MLR) and gradient boosting machine (GBM) at fine-
five experiments with different combinations of explanatory variables. scale point level.
The baseline experiment only takes house attributes and locational
RMSE MLR GBM
amenities as the explanatory variables. Then, four additional experi­
ments are conducted with house photos, street view images, human Baseline 0.111 0.082
Baseline + house photos 0.110 0.081
mobility patterns, and socioeconomic factors by feeding these features
Baseline + street view 0.106 0.079
into each model step-by-step. Finally, we train the model using all Baseline + mobility data 0.107 0.080
variables. Baseline + socioeconomic 0.109 0.080
Fig. 4 shows the scatter plots between the observed and the predicted All data sources 0.103 0.077
house price appreciation rate. Table 2 illustrates the RMSE for all
models. In general, the machine learning model using GBM (R2 = 0.74;
the performance of the models. In particular, the models that incorpo­
RMSE = 0.077) outperforms the MLR (R2 = 0.48; RMSE = 0.103). This
rate street view images got the lowest RMSE and improved the R2 to a
is expected, as the relationships between house price appreciation rate
large extent. Most importantly, the model incorporating all the variables
and features are not linear and the decision tree-based machine learning
achieved the best performance. It proves that the four groups of vari­
approach can better model non-linear relationships among variables.
ables characterize the appreciation value of a house from different
Results also show that combining multiple data sources indeed improves
perspectives and contribute differently to the variation of the house

Fig. 4. Model performance with R2 in different combinations of data sources using multiple linear model (MLR) and gradient boosting machine (GBM) at fine-scale
point level.

6
Y. Kang et al. Land Use Policy 111 (2021) 104919

price appreciation rates. higher the appreciation rate. Interestingly, the study corroborates what
Moreover, we ask which variables contribute most to modeling the is widely discussed in real estate studies: proximity to amenities matters,
house price appreciation rate. The variable importance is calculated by proximity to transportation hubs matters, shorter travel time matters,
the GBM. Fig. 5 ranks the top 20 variables of the model. In addition, we and physical quality of the surroundings matters.
calculated the correlation coefficients between these variables and the
house price appreciation rate to explore how these factors influencing 3.2. Neighborhood-scale house price appreciation estimation
house price appreciation rate. Fig. 6 shows the Pearson correlation co­
efficients of several selected variables with p-values less than 0.01, The relationship between the independent variables and the house
which means that they are statistically significant. price appreciation rate may vary over space due to the spatially non-
The results show that the logarithm house price is the most important stationarity (Fotheringham et al., 2003). To investigate how spatial re­
variable among all models with a correlation coefficient of − 0.55, lationships change across the research area, we employed the
indicating that within the last five years, low-cost houses had a higher geographically weighted regression (GWR) at the neighborhood-scale
price appreciation in the Great Boston Area. The type of houses such as and compared the results with global multiple linear regression (MLR).
townhouse, single family house, etc., and the house area (with absolute Fig. 7 shows the performance of the two models with data from
correlation coefficient 0.49), also have great impacts on house price various sources. Similar to the results at the fine-scale, the GWR achieves
appreciation. It illustrates that structural attributes can influence not a better performance (R2 = 0.774) than the MLR (R2 = 0.608), which
only house prices, as illustrated in the traditional hedonic pricing model, confirms the spatial heterogeneity of the study phenomenon over the
but also the house price appreciation rate. Moreover, we noticed that the research area.
street view image feature is one of the most important variables for all Fig. 8 (A) shows that the coefficient of determination (R2 ) of the
the models. Among them, the average values of the third component of GWR model is generally consistent across the study area. However, in
visual features (represented as StreetView PC3 AVG), which has great the Boston downtown area, the R2 is a little bit lower than the sur­
contributions to the scenery captured by street view images, has mod­ rounding area (about 0.70 vs. 0.78). This indicates downtown area to be
erate influence on house price appreciation rate with negative correla­ a more complex region which requires more latent factors that deter­
tion at -0.30. In addition, the impacts of the ninth, fifth, seventh, first mine house appreciation rates. Fig. 7(B)–(F) depict several selected
and thirteenth components of visual features also ranked in the top 20 correlation coefficient distributions over the study area. Results show
among all variables. Though it is hard to explain the specific meaning of that the logarithm house price (changes from − 0.30 to about − 0.55)
these visual features, it indeed indicates that high-level visual features and the distance to metro (changes from − 0.05 to about − 0.25) have
could capture parts of important perspectives that are related to real weak to moderate negative effects on the appreciation rate of house
estate appreciation rate. The results support our hypothesis that the prices and such a relationship change spatially. In contrast, the effect of
detailed visual information of the house surrounding environment plays mean travel time (vary from about − 0.05 to 0.125), distance to hospital
an important role in real estate appreciation evaluation as the street (vary from about − 0.12 to 0.04), and distance to university varied (vary
view images contain the overall environment of a neighborhood (Li from about − 0.05 to 0.125) from negatively to positively across the
et al., 2015; Gebru et al., 2017). Besides, for locational amenities, a study area. For instance, in the southeast region, the closer to a hospital,
house that is closer to a school ( − 0.13), amusement park ( − 0.21), the lower the house appreciation rate (positive correlation of about
metro station ( − 0.19), hospital ( − 0.09), or surrounded by more res­ 0.04). Whereas for other regions, a house price appreciation rate in­
taurants (0.01), may have a higher price appreciation rate. Similarly, the crease is associated with a decrease in the distance to hospital (negative
mean travel time that reflects the transportation convenience of a coefficient of about − 0.12). Results of the GWR model indicate that
neighborhood is negatively correlated with house price appreciation house price appreciation rates in Boston have spatial heterogeneous
rate, which means the less the mean travel time to other regions, the patterns. In other words, the coefficients of each variable and their

Fig. 5. Importance of top 20 variables using GBM with all data sources.

7
Y. Kang et al. Land Use Policy 111 (2021) 104919

Fig. 6. Correlation coefficients of variables selected at fine-scale point level.

Fig. 7. Model performance with R2 in different combinations of data sources using multiple linear regression (MLR) and geographically weighted regression (GWR)
at aggregated-neighborhood level.

Fig. 8. Spatial distribution of GWR coefficients at the neighborhood scale.

8
Y. Kang et al. Land Use Policy 111 (2021) 104919

impacts on house price appreciation rates vary across space and should For those houses with high prices, since their prices have already been at
be modeled place by place. Hence, it is necessary to explicitly embed a high level, there is limited room for house appreciations and thereby
spatial relationships for the predictive modeling of the house price are more stable.
appreciation rate. In addition, we conducted the correlation analysis between house
area and house prices, and the result shows these two variables are
3.3. Model and determinants analysis significantly highly correlated with a coefficient of 0.62. Since houses
with low prices typically have a small area, the correlation coefficient
Results of this study show promising findings in estimation of house between house price appreciation rates and house area is thereby
price appreciation rate. We compare a series of models at two spatial negative (correlation coefficient − 0.49) as well. Therefore, compared
scales, interpreting the results of these models, and explaining the with houses with high prices, those houses with low prices and small
spatial patterns of the house price appreciation rates. Different from the areas may have greater house price appreciation rates.
MLR which mostly models linear relationships, the decision tree-based Besides, the more convenience the house with nearby facilities and
machine learning method can build non-linear relationships between higher transportation accessibility, the higher the house price appreci­
features and house price appreciation rate, and the GWR can model ation rate is. At the neighborhood-scale, it shows that the spatial het­
spatial non-stationarity between the variables, which indeed provide erogeneity of variables exist and their influences to the distribution of
better prediction and a more holistic explanation. the house price appreciation rate are different. Coefficients of variables
We also examined the importance and the impacts of the factors such as distance to hospital, average travel mean time, even diverge
related to house price appreciation at fine-scale and neighborhood-scale. from negative to positive. Therefore, it is necessary to model the spatial
The emerging sources of house photos, street view images, human relationships between house price appreciation rate and these variables
mobility patterns and socioeconomic attributes, enable us to examine to better interpreting the underlying factors.
house price appreciation rate comprehensively from various aspects.
Results show that by combining visual scenery of a house, built envi­ 4. Discussion
ronment, dynamic human mobility patterns, and socioeconomic attri­
butes of neighborhoods with machine learning approaches, the 4.1. Implications of policies
estimation accuracy of the price appreciation rate can be improved by a
large margin. Among them, high-dimensional visual features extracted Understanding the variability and dynamic changes of house price
from street view images can provide important information related to appreciation rates are crucial for the government policy decision mak­
house price appreciation rate. Such visual features capture intangible ing. On the one hand, house price appreciation rates are closely related
information which was not explored and discussed before. With better to various groups of people in cities, such as newly married couples,
quantifying high-level semantic information from visual features, the workers as labors, and youths who need school district housing. Hence,
procedure of policy making might be improved from these new insights. house price appreciation rate-related information can provide tutorials
In addition, several interesting findings are discovered in this study. for their daily lives and house buying. It also helps the policy makers in
For instance, at fine-scale, we found that houses with lower prices and planning housing development to fit with the job opportunities distri­
small house area may have higher house appreciation potential. We dig bution and various educational facilities as well. On the other hand, the
into this discovery further and attempt to quantify and explain such a paper addresses the relationships between several factors and house
relationship. prices, which might be helpful for sustainable city planning and urban
As shown in Fig. 9, the logarithm of house prices in 2014 and the infrastructure construction. The results and conclusions may help the
actual house price appreciation rates follow an exponential decay with government a more coordinated manner with empirical data support in
λ = − 2.758. It means that with the logarithm of house prices increases, urban development. In addition, the data-driven paradigm and
the lower the rate of appreciation of house prices, and the slower the advanced machine learning methods show potentials in providing in­
decay slope. The reasons might be traced from two aspects. On the one sights for decision makings. For instance, street view images can be
hand, a greater percentage of increment is not equal to a greater actual employed as a useful tool for urban environment observation and
increment of prices. Therefore, houses may have larger price increment monitoring. The urban environment and neighborhood scenery can be
while less increased percentage. On the other hand, there are fewer captured comprehensively and processed efficiently with deep learning
houses with high prices compared with medium and low price houses. algorithms, which indeed will benefit people in cities. The data fusion of
different aspects of big data also illustrates the data-driven paradigm in
discovering and addressing the development of cities. Therefore, it is
helpful for policy makers to understand the dynamics of housing
appreciation rates when formulating housing policies.

4.2. Limitations and future directions

Here, we also discuss several limitations of this work that should be


paid more attentions in future studies. Firstly, we only take one region
(the Boston area) as a case study area. Since the built environment,
human mobility patterns, and social class conditions of neighborhoods
may vary in different cities, more regional factors can be taken into
consideration in the future. For example, large vs. small, eastern vs.
western, and coastal vs. inland cities can be considered to improve the
generalization ability and replicability of the proposed framework for
estimating house price appreciation rates.
Secondly, as we collected the house photos from VGI sources, the
uncertainty of the data is a common concern for quality assurance.
Models using house photos do not perform as good as the other three
Fig. 9. Fit curve between logarithm of house prices (2014) and house price data sources, which might result from the low quality of sample data in
appreciation rates. capturing various scenery of a house. Instead, a better DCNN or other

9
Y. Kang et al. Land Use Policy 111 (2021) 104919

models that are not biased to samples and can differentiate complex Wisconsin Alumni Research Foundation, and the Trewartha Research
visual features of different houses should be trained in order to improve Award, Department of the Geography, University of Wisconsin-Madison.
the accuracy of the framework. The authors would like to thank Timothy Prestby, UW-Madison for his
Thirdly, the urban renewal may occur in the past 5 years in Boston. generous help of proofreading. We thank Safegraph for providing
Though the house price data are collected between 2014 and 2019, anonymous mobile location data and POI visit patterns. The authors
other attributes are limited to a specific time period. For example, POI would also like to gratefully thank the members of the MIT Senseable
data provided by SafeGraph was collected in the year 2018, only human City Lab Consortium: RATP, Dover Corporation, Teck Resources, Lab
mobility data in the year 2018 was collected from Uber Movement Campus, Anas S.p.A., Ford, SNCF Gares & Connexions, Brose, Allianz,
project, and the most recent street view images (may range from 2012 to ENEL Foundation, Laval, Curitiba, Stockholm, Amsterdam, Victoria
2018 for a specific place) were harvested, etc. However, the develop­ State Government, KTH Royal Institute of Technology,
ment of urban construction indeed has impacts on house price appre­ UTEC—Universidad de Ingenierıa y Tecnologıa, Politecnico di Torino,
ciation rates as well. These issues have not been addressed in this paper Austrian Institute of Technology, Fraunhofer Institute, Kuwait-MIT
due to the restrictions of data sources. In the future, we expect to involve Center for Natural Resources, SMART—Singapore-MIT Alliance for
dynamics of urban land use changes into the framework with richer Research and Technology, and AMS Institute for supporting this
datasets. research.
Lastly, although we attempt to understand the patterns of house
price appreciation rate, deeper exploration and more explanations could References
be added in future works. For example, the differences of appreciation
rates between houses with different price ranges and in different Andrew, M., Meen, G., 2003. House price appreciation, transactions and structural
change in the British housing market: a macroeconomic perspective. Real Estate
geographic regions can be compared. Also, our framework focuses more Econ. 31 (1), 99–116.
on evaluating the value of several emerging data sources in house price Archer, W.R., Gatzlaff, D.H., Ling, D.C., 1996. Measuring the importance of location in
appreciation and discovering the spatial distribution of house price house price appreciation. J. Urban Econ. 40 (3), 334–353.
Can, A., 1992. Specification and estimation of hedonic housing price models. Reg. Sci.
appreciation rates with their spatial dependencies. While the causality Urban Econ. 22 (3), 453–474.
relationships between variables and house price appreciation rates are Cao, K., Diao, M., Wu, B., 2019. A big data-based geographically weighted regression
also necessary for policy decision making. In the future, we will try to model for public housing prices: a case study in Singapore. Ann. Am. Assoc.
Geograph. 109 (1), 173–186.
involve more time-series data and approaches from economy to build Chau, K.W., Chin, T., 2003. A critical review of literature on the hedonic price model. Int.
such relationships and improve the interpretability of deep learning J. Housing Sci. Appl. 27 (2), 145–165.
models. Chen, L., Yao, X., Liu, Y., Zhu, Y., Chen, W., Zhao, X., Chi, T., 2020. Measuring impacts of
urban environmental elements on housing prices based on multisource data – a case
study of Shanghai, China. ISPRS Int. J. Geo-Inform. 9 (2), 106.
5. Conclusion Chen, M., Liu, W., Lu, D., 2016. Challenges and the way forward in China’s new-type
urbanization. Land Use Policy 55, 334–339.
Crone, T.M., Voith, R.P., 1992. Estimating house price appreciation: a comparison of
In summary, we present a multi-source-data-fusion framework to
methods. J. Housing Econ. 2 (4), 324–338.
estimate the house price appreciation rates from various perspectives by Du, Q., Wu, C., Ye, X., Ren, F., Lin, Y., 2018. Evaluating the effects of landscape on
the utilization of several big geo-data sources and the state-of-the-art housing prices in urban China. Tijdsch. Econ. Soc. Geogr. 109 (4), 525–541.
machine learning approaches. Particularly, we extract high-level vi­ Ellis, G., Hunter, R., Tully, M.A., Donnelly, M., Kelleher, L., Kee, F., 2016. Connectivity
and physical activity: using footpath networks to measure the walkability of built
sual features from street view images and house photos to depict inner environments. Environ. Plann. B: Plann. Des. 43 (1), 130–151.
and outer appearance of houses using deep learning methods, which Fotheringham, A.S., Brunsdon, C., Charlton, M., 2003. Geographically Weighted
have certain impacts on house price appreciation rates. Regression: The Analysis of Spatially Varying Relationships. John Wiley & Sons.
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine.
This study offers insights into the potential of machine learning and Ann. Stat. 1189–1232.
spatial statistical approaches in modeling complex urban environments Fu, X., Jia, T., Zhang, X., Li, S., Zhang, Y., 2019. Do street-level scene perceptions affect
using multi-source geospatial big data. The contribution of the study is housing prices in Chinese megacities? an analysis using open access datasets and
deep learning. PLOS ONE 14 (5), e0217505.
threefold: First, we propose to predict the house price appreciation rate, Fu, Y., Xiong, H., Ge, Y., Zheng, Y., Yao, Z., Zhou, Z.H., 2016. Modeling of geographic
which differs notably from existing research for the absolute price dependencies for real estate ranking. ACM Trans. Knowl. Discov. Data 11 (1), 11.
estimation. Second, we build a big-data-driven multi-feature-fusion Gao, S., Janowicz, K., Couclelis, H., 2017a. Extracting urban functional regions from
points of interest and human activities on location-based social networks. Trans. GIS
framework which utilizes various data sources from different aspects,
21 (3), 446–467.
especially with visual features extracted from house photos and street Gao, S., Li, L., Li, W., Janowicz, K., Zhang, Y., 2017b. Constructing gazetteers from
views, in order to enrich the knowledge of the house price appreciation volunteered big geo-data based on hadoop. Comput. Environ. Urban Syst. 61,
172–186.
modeling with state-of-the-art machine learning approaches at two
Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E.L., Fei-Fei, L., 2017. Using
spatial units. Third, we focus not only on improving the accuracy of a deep learning and Google street view to estimate the demographic makeup of
model, but also seeking explanations for what factors would influence neighborhoods across the united states. Proc. Natl. Acad. Sci. U.S.A. 114 (50),
the property value to provide suggestions for housing policies. Our 13108–13113.
Goodchild, M.F., 2007. Citizens as sensors: the world of volunteered geography.
research integrates computer science and social science research by GeoJournal 69 (4), 211–221.
utilizing advanced techniques with emerging data sources, and could He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition.
provide new insights for researchers from economy, geography and Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
770–778.
urban planning towards future land-use studies. Hu, L., He, S., Han, Z., Xiao, H., Su, S., Weng, M., Cai, Z., 2019. Monitoring housing
rental prices based on social media: an integrated approach of machine-learning
Conflict of interest algorithms and hedonic modeling to inform equitable housing policies. Land Use
Policy 82, 657–673.
Hung, S.Y., Tu, C., 2008. An examination of housing price appreciation in California and
None declared. the impact of alternative mortgage instruments. J. Housing Res. 17 (1), 33–47.
Law, S., Paige, B., Russell, C., 2018. Take a Look Around: Using Street View and Satellite
Images to Estimate House Prices. arXiv:180707155.
Acknowledgement Li, X., Zhang, C., Li, W., Kuzovkina, Y.A., Weiner, D., 2015. Who lives in greener
neighborhoods? the distribution of street greenery and its association with residents’
The funding support for this research is provided by the National socioeconomic conditions in Hartford, Connecticut, USA. Urban Forest. Urban
Green. 14 (4), 751–759.
Natural Science Foundation of China under Grant 41901321 and
Liu, F., Min, M., Zhao, K., Hu, W., 2020. Spatial-temporal variation in the impacts of
41671378, the Office of Vice Chancellor for Research and Graduate urban infrastructure on housing prices in Wuhan, China. Sustainability 12 (3), 1281.
Education at the University of Wisconsin-Madison with funding from the

10
Y. Kang et al. Land Use Policy 111 (2021) 104919

Liu, X., Andris, C., Huang, Z., Rahimi, S., 2019a. Inside 50,000 living rooms: an Yao, Y., Zhang, J., Hong, Y., Liang, H., He, J., 2018. Mapping fine-scale urban housing
assessment of global residential ornamentation using transfer learning. EPJ Data Sci. prices by fusing remotely sensed imagery and social media data. Trans. GIS 22 (2),
8 (1), 4. 561–581.
Liu, Y., Wang, F., Xiao, Y., Gao, S., 2012. Urban land uses and traffic ’source-sink areas’: You, Q., Pang, R., Cao, L., Luo, J., 2017. Image-based appraisal of real estate properties.
evidence from GPS-enabled taxi data in Shanghai. Landsc. Urban Plann. 106 (1), IEEE Trans. Multimedia 19 (12), 2751–2759.
73–87. Yue, Y., Zhuang, Y., Yeh, A.G., Xie, J.Y., Ma, C.L., Li, Q.Q., 2017. Measurements of poi-
Liu, Z., Yang, A., Gao, M., Jiang, H., Kang, Y., Zhang, F., Fei, T., 2019b. Towards based mixed use and their relationships with neighbourhood vibrancy. Int. J.
feasibility of photovoltaic road for urban traffic-solar energy estimation using street Geograph. Inform. Sci. 31 (4), 658–675.
view image. J. Clean. Prod. 228, 303–318. Zhang, F., Wu, L., Zhu, D., Liu, Y., 2019. Social sensing from street-level imagery: a case
Livy, M.R., 2017. The effect of local amenities on house price appreciation amid market study in learning spatio-temporal urban mobility patterns. ISPRS J. Photogram. Rem.
shocks: the case of school quality. J. Housing Econ. 36, 62–72. Sens. 153, 48–58.
Murray, A.T., Wu, X., 2003. Accessibility tradeoffs in public transit planning. Zhang, F., Zhang, D., Liu, Y., Lin, H., 2018a. Representing place locales using scene
J. Geograph. Syst. 5 (1), 93–107. elements. Comput. Environ. Urban Syst. 71, 153–164.
Natekin, A., Knoll, A., 2013. Gradient boosting machines, a tutorial. Front. Neurorobot. Zhang, F., Zhou, B., Liu, L., Liu, Y., Fung, H.H., Lin, H., Ratti, C., 2018b. Measuring
7, 21. human perceptions of a large-scale urban region using machine learning. Landsc.
Neilson, G.K., Fowler, W.K., 1972. Relation between transit ridership and walking Urban Plann. 180, 148–160.
distances in a low-density Florida retirement area. Highway Res. Rec. (403). Zhang, Y., Dong, R., 2018. Impacts of street-visible greenery on housing prices: evidence
Quercia, R., McCarthy, G., Ryznar, R., Can Talen, A., 2000. Spatio-temporal from a hedonic price model and a massive street view image dataset in Beijing.
measurement of house price appreciation in underserved areas. J. Housing Res. 11 ISPRS Int. J. Geo-Inform. 7 (3), 104.
(1), 1–28. Zhang, F., Zu, J., Hu, M., Zhu, D., Kang, Y., Gao, S., Zhang, Y., Huang, Z., 2020.
Rosen, S., 1974. Hedonic prices and implicit markets: product differentiation in pure Uncovering inconspicuous places using social media check-ins and street view
competition. J. Pol. Econ. 82 (1), 34–55. images. Comput. Environ. Urban Syst. 81 p.101478.
Wu, C., Ren, F., Hu, W., Du, Q., 2019. Multiscale geographically and temporally
weighted regression: exploring the spatiotemporal determinants of housing prices.
Int. J. Geograph. Inform. Sci. 33 (3), 489–511.

11

You might also like