Image Based Appraisal of Real Estate Properties: Quanzeng You, Ran Pang, Liangliang Cao, and Jiebo Luo, Fellow, IEEE
Image Based Appraisal of Real Estate Properties: Quanzeng You, Ran Pang, Liangliang Cao, and Jiebo Luo, Fellow, IEEE
Images reviews [3] and so on. Generally speaking, there are several
different types of appraisal values. In particular, we are in-
terested in the market value, which refers to the trade price
in a competitive Walrasian auction setting [4]. Today, people
are likely to trade through real estate brokers, who provide
easy access online websites for browsing real estate property
in an interactive and convenient way. Fig. 1 shows an example
arXiv:1611.09180v2 [cs.CV] 27 Jul 2017
performance on many computer vision related tasks, e.g. digit focused on building price indexes for real properties. The
recognition [16], [17], image classification [18], [19], aesthet- seminal work in [6] built price index according to the repeat
ics estimation [20] and scene recognition [21]. These systems prices of the same property at different times. They employed
suggest that deep learning is very effective in learning robust regression analysis to build the price index, which shows good
features in a supervised or unsupervised fashion. Even though performances. Another widely used regression model, Hedonic
deep neural networks may be trapped in local optima [22], regression, is developed on the assumption that the character-
[23], using different optimization techniques, one can achieve istics of a house can predict its price [7], [8]. However, it
the state-of-the-art performance on many challenging tasks is argued that the Hedonic regression model requires more
mentioned above. assumptions in terms of explaining its target [28]. They also
Inspired by the recent successes of deep learning, in this mentioned that for repeat sales model, the main problem is
work we are interested in solving the challenging real estate lack of data, which may lead to failure of the model. Recent
appraisal problem using deep visual features. In particular, for work in [9] employed locations and sale price series to build
images related tasks, Convolutional Neural Network (CNN) an autoregressive component. Their model is able to use both
are widely used due to the usage of convolutional layers. It single sale homes and repeat sales homes, which can offer a
takes into consideration the locations and neighbors of image more robust sale price index.
pixels, which are important to capture useful features for visual More studies are conducted on employing feed forward
tasks. Convolutional Neural Networks [24], [18], [19] have neural networks for real estate appraisal [29], [30], [31], [32].
been proved very powerful in solving computer vision related However, their results suggest that neural network models
tasks. are unstable even using the same package with different run
We intend to employ the pictures for the task of real estate times [29]. The performance of neural networks are closely
price estimation. We want to know whether visual features, related to the features and data size [32]. Recently, Kontrimas
which is a reflection of a real estate property, can help and Verikas [33] empirically studied several different models
estimate the real estate price. Intuitively, if visual features can on selected 12 dimensional features, e.g. type of the house,
characterize a property in a way similar to human beings, we size, and construction year. Their results show that linear
should be able to quantify the house features using those visual regression outperforms neural network on their selected 100
responses. Meanwhile, real estate properties are closely related houses.
to the neighborhood. In this work, we develop algorithms More recent studies in [1] propose a ranking objective,
which only rely on 1) the neighbor information and 2) the which takes geographical individual, peer and zone dependen-
attributes from pictures to estimate real estate property price. cies into consideration. Their method is able to use various
To preserve the local relation among properties we employ estate related data, which helps improve their ranking results
a novel approach, which employs random walks to generate based on properties’ investment values. Furthermore, the work
house sequences. In building the random walk graph, only the in [3] studied online user’s reviews and mobile users’ mov-
locations of houses are utilized. In this way, the problem of real ing behaviors on the problem of real estate ranking. Their
estate appraisal has been transformed into a sequence learning proposed sparsity regularized learning model demonstrated
problem. Recurrent Neural Network (RNN) is particularly competitive performance.
designed to solve sequence related problems. Recently, RNNs In contrast, we are trying to solve this problem using the
have been successfully applied to challenging tasks including attributes reflected in the visual appearances of houses. In
machine translation [25], image captioning [26], and speech particular, our model does not use the meta data of a house
recognition [27]. Inspired by the success of RNN, we deploy (e.g. size, number of rooms, and construction year). We intend
RNN to learn regression models on the transformed problem. to utilize the location information in a novel way such that
The main contributions of our work are as follows: our model is able to use the state-of-the-art deep learning for
• To the best of our knowledge, we are the first to quantify feature extraction (Convolutional Neural Network) and model
the impact of visual content on real estate price esti- learning (Recurrent Neural Network).
mation. We attribute the possibility of our work to the
newly designed computer vision algorithms, in particular III. R ECURRENT N EURAL N ETWORK FOR R EAL E STATE
Convolutional Neural Networks (CNNs). P RICE E STIMATION
• We employ random walks to generate house sequences
In this section, we present the main components of our
according to the locations of each house. In this way, we framework. We describe how to transform the problem into a
are able to transform the problem into a novel sequence problem that can be solved by the Recurrent Neural Network.
prediction problem, which is able to preserve the relation The architecture of our model is also presented.
among houses.
• We employ the novel Recurrent Neural Networks (RNNs)
to predict real estate properties and achieve accurate A. Random Walks
results. One main feature of real estate properties is its location. In
particular, for houses in the same neighborhood, they tend to
II. R ELATED W ORK have similar extrinsic features including traffic, schools and
Real estate appraisal has been studied by both real estate so on. We build an undirected graph G for all the houses
industrial professionals and academia researchers. Earlier work collected, where each node vi represent the i-th house in our
3
data set. The similarity sij between house hi and house hj is RNN with gradient descent. Fig. 2 shows the details of a
defined using the Gaussian kernel function, which is a widely single Long Short-Term Memory (LSTM) block [38]. Each
used similarity measure1 : LSTM cell contains an input gate, an output gate and an forget
dist(hi , hj )
gate, which is also called a memory cell in that it is able to
sij = exp , (1) remember the error in the error propagation stage [39]. In this
2σ 2
way, LSTM is capable of modeling long-range dependencies
where dist(hi , hj ) is the geodesic distance between house hi than conventional RNNs.
and hj . σ is the hyper-parameter, which controls the similarity
decaying velocity with the increase of distance. In all of our
experiments, we set σ to 0.5 miles so that houses within the
1.5 (within 3σ) miles will have a relatively larger similarity.
Input Gate
The -neighborhood graph [34] is employed to build G in our ht-1
implementation. We assign the weight of each edge eij as the it
Cell
similarity sij between house hi and the house hj .
ct ht
Given this graph G, we can then employ random walks Forget Gate
to generate sequences. In particular, every time, we randomly ft
choose one node vi as the root node, then we proportionally xt
jump to its neighboring nodes vj according to the weights
between vi and its neighbors. The probability of jumping to
node vj is defined as ot
Output Gate
eji
pj = P , (2)
k∈N (i) eki Fig. 2. An illustration of a single Long Short-Term Memory (LSTM) Cell.
where N (i) is the set of neighbor nodes of vi . We continue
to employ this process until we generate the desired length For completeness, we give the detailed calculation of ht
of sequence. The employment of random walks is mainly given input xt and ht−1 in the following equations. Let
motivated by the recent proposed DeepWalk [35] to learn W.i , W.f , W.o represent the parameters related to input,
feature representations for graph nodes. It has been shown that forget and output gate respectively. denotes the element-
random walks can capture the local structure of the graphs. In wise multiplication between two vectors. φ and ψ are some
this way, we can keep the local location structure of houses selected activation functions and σ is the fixed logistic sigmoid
and build sequences for houses in the graph. Algorithm 1 function. Following [38], [27], [40], we employ tanh for both
summarizes the detailed steps for generating sequences from φ in Eq.(6) and ψ in Eq.(8).
a similarity graph.
it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (4)
We have generated sequences by employing random walks.
In each sequence, we have a number of houses, which is ft = σ(Wxf xt + Whf ht−1
+ Wcf ct−1 + bf ) (5)
related in terms of their locations. Since we build the graph on ct = ft ct−1 + it φ(Wxc xt + Whc ht−1 + bc ) (6)
top of house locations, the houses within the same sequence ot = σ(Wxo xt + Who ht−1 + Wco ct + bo ) (7)
are highly possible to be close to each other. In other words,
ht = ot ψ(ct ) (8)
the prices of houses in the same sequence are related to
each other. We can employ this context for estimating real
estate property price, which can be solved by recurrent neural C. Multi-layer Bidirectional LSTM
network discussed in following sections. In previous sections, we have discussed the generation of
sequences as well as Recurrent Neural Network. Recall that
B. Recurrent Neural Network we have built an undirected graph in generating the sequences,
With a Recurrent Neural Network (RNN), we are trying to which indicates that the price of one house is related to all
predict the output sequence {y1 , y2 , . . . , yT } given the input the houses in the same sequence including those in the later
sequence {x1 , x2 , . . . , xT }. Between the input layer and the part. Bidirectional Recurrent Neural Network (BRNN) [41]
output layer, there is a hidden layer, which is usually estimated has been proposed to enable the usage of both earlier and
as in Eq.(3). future contexts. In bidirectional recurrent neural network, there
is an additional backward hidden layer iterating from the last
ht = ∆(Whi ht−1 + Wx xt + bh ) (3) of the sequence to the first. The output layer is calculated by
∆ represents some selected activation function or other com- employing both forward and backward hidden layer.
plex architecture employed to process the input xt and ht . Bidirectional-LSTM (B-LSTM) is a particular type of
One of the most widely deployed architectures is Long Short- BRNN, where each hidden node is calculated by the long
Term Memory (LSTM) cell [36], which can overcome the short-term memory as shown in Fig. 2. Graves et al. [40] have
vanishing and exploding gradient problem [37] when training employed Bidirectional-LSTM for speech recognition. Fig. 3
shows the architecture of the bidirectional recurrent neural
1 http://en.wikipedia.org/wiki/Radial basis function kernel network. We have two Bidirectional-LSTM layers. During the
4
Input Layer x1 xt 1 xt xt 1 xT
Fig. 3. The Multi-layer Bidirectional Recurrent Neural Network (BRNN) architecture for real estate price estimation. There are two bidirectional recurrent
layers in this architecture. For real estate price estimation, the price of each house is related to all houses in the same sequence, which is the main motivation
to employ bidirectional recurrent layers.
forward pass of the network, we calculate the response of both generated i-th sequence and ŷij is the corresponding estimated
the forward and the backward hidden layers in the 1st-LSTM price for this house.
and 2nd-LSTM layer respectively. Next, the output (in our When training our Multi-Layer B-LSTM model, we employ
problem, the output is the price of each house) of each house the RMSProp [42] optimizer, which is an adaptive method for
is calculated using the output of the 2nd-LSTM layer as input automatically adjust the learning rates. In particular, it normal-
to the output layer. izes the gradients by the average of its recent magnitude.
testing
the houses in our collected data from San Jose and Rochester
h1 ht-1 ht ht+1 hT using the returned geo-locations from Bing Map API.
According to these coordinates, we are able to calculate the
distance between any pair of houses. In particular, we employ
known Vincenty distance (https://en.wikipedia.org/wiki/Vincenty’s
formulae) to calculate the geodesic distances according to the
Fig. 4. Testing sequence h1 → h2 → · · · → hT . In each testing sequence, coordinates. Fig. 7 shows distribution of the distance between
there is one and only one testing node in that sequence. The remaining nodes
are all come from training data. any pair of houses in our data set. The distance is less than
4 miles for most randomly picked pair of houses. In building
our -neighborhood graph, we assign an edge between any pair
IV. E XPERIMENTAL R ESULTS of houses, which has a distance smaller than 5 miles ( = 5
miles).
In this section, we discuss how to collect data and evaluate
the proposed framework as well as several state-of-the-art
approaches. In this work, all the data are collected from
Realtor (http://www.realtor.com/), which is the largest realtor B. Feature Extraction and Baseline Algorithms
association in North America. We collect data from San Jose, In our implementation, we experimented with GoogleNet
CA, one of the most active cities in U.S., and Rochester, NY, model [43], which is one of the state-of-the-art deep neural
one of the least active cities in U.S., over a period of one architectures. In particular, we use the response from the last
year. In the next section, we will discuss the details on how avg − pooling layer as the visual features for each image. In
to preprocess the data for further experiments. this way, we obtain a 1, 024 dimensional feature vector for
each image. Each house may have several different pictures
A. Data Preparation on different angles of the same property. We average features
The data collected from Realtor contains description, school of all the images of the same house (also known as average-
information and possible pictures about each real property pooling)2 to obtain the feature representation of the house.
as shown in Fig. 1 show. We are particularly interested in We compare the proposed framework with the following
employing the pictures of each house to conduct the price algorithms.
estimation. We filter out those houses without image in our 1) Regression Model (LASSO): Regression model has been
data set. Since houses located in the same neighborhood seem employed to analyze real estate price index [6]. Recently, the
to have similar price, the location is another important features results in Fu et al. [3] show that sparse regularization can ob-
in our data set. However, after an inspection of the data, we tain better performance in real estate ranking. Thus, we choose
notice that some of the house price are abnormal. Thus, we to use LASSO (http://statweb.stanford.edu/∼tibs/lasso.html),
preprocess the data by filtering out houses with extremely high which is a l1-constrained regression model, as one of our
or low price compared with their neighborhood. baseline algorithms.
TABLE I shows the overall statistics of our dataset after 2) DeepWalk: Deepwalk [35] is another way of employing
filtering. Overall, the city of San Jose has more houses than random walks for unsupervised feature learning of graphs. The
Rochester on the market (as expected for one of the hottest main approach is inspired by distributed word representation
market in the country). The house prices in the two cities learning. In using DeepWalk, we also use -neighborhood
also have significant differences. Fig. 5 shows some of the graph with the same settings with the graph we built for
example house pictures from the two cities, respectively. From generating sequences for B-LSTM. The learned features are
these pictures, we observe that houses whose prices are above also fed into a LASSO model for learning the regression
average typically have larger yards and better curb appeal, and weights. Indeed, deepwalk can be thought as a simpler version
vice versa. The same can be observed among house interior of our algorithm, where only the graph structure are employed
pictures (examples not shown due to space). to learn features. Our framework can employ both the graph
Realtor does not provide the exact geo-location for each structure and other features, i.e. visual attributes, for building
house. However, geo-location is important for us to build regression model.
the -neighborhood graph for random walks. We employ
Microsoft Bing Map API (https://msdn.microsoft.com/en-us/ 2 We also tried max-pooling. However, the results are not as good as
library/ff701715.aspx) to obtain the latitude and longitude for average-pooling. In the following experiments, we report the results using
each house given its collected address. Fig. 6 shows some of average-pooling.
6
Fig. 5. Examples of house pictures of the two cities respectively. Top Row: houses whose prices (per Sqft) are above the average of their neighborhood.
Bottom Row: houses whose prices (per Sqft) are below the average of their neighborhood.
TABLE II
P REDICTION DEVIATION OF DIFFERENT MODELS FROM THE ACTUAL SALE PRICES . N OTE THAT RNN- BEST IS THE UPPER - BOUND PERFORMANCE OF THE
RNN BASED MODEL PROPOSED IN THIS WORK .
800
700
600
Number of House Pairs
500
400
300
200
100
With the above mentioned similarity graph, we are able to The evaluation metrics employed are mean absolute error
generate sequences using random walks following the steps (MAE) and mean absolute percentage error (MAPE). Both
described in Algorithm 1. For each city, we randomly split of them are popular measures for evaluating the accuracy of
the houses into training (80%) and testing set (20%). Next, prediction models. Eq.(10) and Eq.(11) give the definitions for
we generate sequences using random walks on the training these two metrics, where pi is the predicted value and ti is
7
[13] X. Jin, A. Gallagher, L. Cao, J. Luo, and J. Han, “The wisdom of social [40] A. Graves, N. Jaitly, and A.-R. Mohamed, “Hybrid speech recognition
multimedia: using flickr for prediction and forecast,” in Proceedings of with deep bidirectional lstm,” in Workshop on Automatic Speech Recog-
the international conference on Multimedia. ACM, 2010, pp. 1235– nition and Understanding (ASRU). IEEE, 2013, pp. 273–278.
1244. [41] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
[14] Q. You, L. Cao, Y. Cong, X. Zhang, and J. Luo, “A multifaceted ap- works,” Signal Processing, IEEE Transactions on, vol. 45, no. 11, pp.
proach to social multimedia-based prediction of elections,” Multimedia, 2673–2681, 1997.
IEEE Transactions on, vol. 17, no. 12, pp. 2271–2280, Dec 2015. [42] T. Tieleman and G. Hinton, “Lecture 6.5 - rmsprop, coursera: Neural
[15] Q. You, S. Bhatia, and J. Luo, “A picture tells a thousand words?about networks for machine learning,” University of Toronto, Tech. Rep., 2012.
you! user interest profiling from user generated visual content,” Signal [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
Processing, vol. 124, pp. 45–53, 2016. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
[16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, in CVPR, June 2015.
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551,
1989.
[17] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
2006.
[18] D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmid-
huber, “Flexible, high performance convolutional neural networks for
image classification,” in IJCAI. AAAI Press, 2011, pp. 1237–1242.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks.” in NIPS, vol. 1, no. 2, 2012,
p. 4.
[20] X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang, “Rapid: Rating pictorial
aesthetics using deep learning,” in ACM MM. ACM, 2014, pp. 457–466.
[21] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning
deep features for scene recognition using places database,” in NIPS,
2014, pp. 487–495.
[22] G. Hinton, “A practical guide to training restricted boltzmann machines,”
Momentum, vol. 9, no. 1, p. 926, 2010.
[23] Y. Bengio, “Practical recommendations for gradient-based training of
deep architectures,” in Neural Networks: Tricks of the Trade. Springer,
2012, pp. 437–478.
[24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[25] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” ICLR, 2014.
[26] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
neural image caption generator,” in CVPR, 2015, pp. 3156–3164.
[27] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
deep recurrent neural networks,” in ICASSP. IEEE, 2013, pp. 6645–
6649.
[28] F. T. Wang and P. M. Zorn, “Estimating house price growth with repeat
sales data: what’s the aim of the game?” Journal of Housing Economics,
vol. 6, no. 2, pp. 93–118, 1997.
[29] E. Worzala, M. Lenk, and A. Silva, “An exploration of neural networks
and its application to real estate valuation,” Journal of Real Estate
Research, vol. 10, no. 2, pp. 185–201, 1995.
[30] P. Rossini, “Improving the results of artificial neural network models for
residential valuation,” in Fourth Annual Pacific-Rim Real Estate Society
Conference, Perth, Western Australia, 1998.
[31] P. Kershaw and P. Rossini, “Using neural networks to estimate constant
quality house price indices,” Ph.D. dissertation, INTERNATIONAL
REAL ESTATE SOCIETY, 1999.
[32] N. Nghiep and C. Al, “Predicting housing value: A comparison of
multiple regression analysis and artificial neural networks,” Journal of
Real Estate Research, vol. 22, no. 3, pp. 313–336, 2001.
[33] V. Kontrimas and A. Verikas, “The mass appraisal of the real estate by
computational intelligence,” Applied Soft Computing, vol. 11, no. 1, pp.
443–448, 2011.
[34] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and
computing, vol. 17, no. 4, pp. 395–416, 2007.
[35] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of
social representations,” in SIGKDD. ACM, 2014, pp. 701–710.
[36] F. Gers, “Long short-term memory in recurrent neural networks,” Un-
published PhD dissertation, École Polytechnique Fédérale de Lausanne,
Lausanne, Switzerland, 2001.
[37] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training
recurrent neural networks,” in ICML, 2013, pp. 1310–1318.
[38] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise
timing with lstm recurrent networks,” The Journal of Machine Learning
Research, vol. 3, pp. 115–143, 2003.
[39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.