0% found this document useful (0 votes)
20 views18 pages

Graph Convolutional Network-Based Model For Megacity Real Estate Valuation

ewrawe rv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views18 pages

Graph Convolutional Network-Based Model For Megacity Real Estate Valuation

ewrawe rv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Received 19 August 2022, accepted 21 September 2022, date of publication 27 September 2022, date of current version 7 October 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3210281

Graph Convolutional Network-Based Model for


Megacity Real Estate Valuation
ZONGYAN YANG, ZHONGHUA HONG , (Member, IEEE), RUYAN ZHOU, AND HONG AI
College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
Corresponding author: Zhonghua Hong ([email protected])
This work was supported in part by the National Natural Science Foundation of China under Grant 41871325.

ABSTRACT It is challenging to make precise assessments of real estate prices due to its elevated
individual prices, complicated influencing factors, and ambiguous attribute selection. As a result of the
high demand for owner-occupied and investment properties, real estate is also a substantial concern for
society. A hot topic for research by major institutions has been how to accurately estimate its price. Real-
world applications of real estate valuation impose stringent requirements on the acquisition of datasets
and the generalizability of models. On the basis of SRGCNN, a spatial regression model with excellent
generalizability, this paper introduces an external attention mechanism to construct the A-SRGCNN model
and compares it to the benchmark model utilizing data from Shanghai, Melbourne, and San Diego. For spatial
regression, A-SRGCNN employs graph convolutional neural networks, and the external attention mechanism
implicitly considers the relationship between property data. Experiments indicate that the A-SRGCNN model
outperforms the benchmark model and has improved real estate price estimation accuracy. In the meantime,
this paper employs the A-SRGCNN model to conduct zonal experiments and time-division experiments on
the secondary real estate market in Shanghai to analyze the real estate price linkages between different zones
and the real estate price linkages at different times. It is revealed that Shanghai real estate prices exhibit spatial
aggregation and price aggregation, with comparable prices within the same zones, and that the A-SRGCNN
model is effective at predicting house prices.

INDEX TERMS Graph convolutional network, deep learning, real estate valuation, spatial analysis.

I. INTRODUCTION floor area, number of floors, historical sales prices, etc. Since
Appraising real estate prices is of paramount importance for qualitative analysis is influenced by subjective factors and
banks to review loan mortgages and national real estate policy is difficult to measure precisely and thoroughly, quantitative
formulation. The timely and efficient valuation and forecast- analysis is highly accurate and credible.The two predominant
ing of real estate prices not only brings significant economic research directions in the quantitative analysis are the hedonic
benefits directly, but also has tremendous political impli- model and the machine learning model.
cations, and major banks, insurance companies, and think Hedonic models are the most frequently used models for
tanks are searching for a precise, speedy, and cost-effective valuating real estate prices. The hedonic model assumes that
mechanism for real estate valuation. Qualitative analysis and real estate consists of various functionalities that provide dif-
quantitative analysis are the two most common categories ferent utility to individuals, such as the size of the property, its
used to objectively assess real estate prices. The qualitative location, its surroundings, its potential for appreciation, and
study concentrates on the economics of macro policies, mar- so on. The variations in the number of features and the manner
ket trends, and other factors, whereas the quantitative analysis in which they are combined ascertain the disparities in home
models the characteristics of house prices, such as real estate prices. By decomposing the factors that influence real estate
prices and calculating the prices implied by each factor, it is
The associate editor coordinating the review of this manuscript and possible to valuate the prices of real estate based on the
approving it for publication was Vlad Diaconita . differences between properties.The hedonic framework was

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 104811
Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

originally employed by S. Rosen to examine the relationship estate prediction can achieve similar or even superior per-
between real estate prices and the living environment [1]. formance compared to a fully supervised ANN method [8].
R. Meese et al. utilized hedonic regression models to valuate To estimate house prices in London, UK, S. Law et al. com-
the dynamic impact of market fundamentals on real estate bined housing character traits with a deep neural network
prices [2]. After many years of development, the hedonic model [9].
model has become an established method of real estate price ANN does have some restrictions. While processing serial
valuation, utilized in a large number of appraisal models and data, such as modifications in estate prices over time, ANN
serving as a crucial foundation for bank loan approvals and often faces challenges with the input order information that
government monetary policies. However, the hedonic method is necessary. A deeper level of information from estate
also has some drawbacks, such as the fact that the results attributes, such as the effect of the property’s exterior appear-
of the hedonic model can vary depending on the estimation ance on the estate price, is difficult for ANN to extract.
formula or process selected, which enhances the subjectivity Convolutional neural networks (CNN) and recurrent neural
of the appraisal, and necessitates a high demand for analysts networks can both be used to get around ANN’s drawbacks
with specialized knowledge, and necessitates a large quantity (RNN). When J. Bin et al. used RNN in their estate value
of property price data. estimation model and contrasted it with non-machine learn-
The field of machine learning is another area of research. ing models, they discovered that RNN performed better [10].
Earlier stage machine learning models for estimating house Architectural images are a significant factor in real estate
values were relatively homogenous and relied on straightfor- prices as well. O. Poursaeed used CNN to examine the impact
ward statistical and mathematical techniques like regression of the appearance of the estate on the price of the estate [11].
analysis.In multiple regression analysis, R. Dubin et al. used However, there are concerns as to the applicability of archi-
spatial regression techniques to estimate home prices [3]. tectural images, and Stephen Law points out that different
However, this method ignores the impact of time variation. regions may have different aesthetics, which can have a vary-
Real estate price changes can be thought of as a time series ing impact on the price assessment of properties [9]. Further
because real estate prices are affected by time characteristics investigation revealed that the performance of the CNN and
as well. To forecast the growth of home prices in four US RNN models varied significantly depending on the type of
regions, R. Gupta et al. used a time series model with dynamic dataset used. Long short-term memory (LSTM) and CNN
factor analysis and Bayesian shrinkage estimation [4]. Time were both used by L. Yu et al. to predict the price of used
series can also be incorporated into spatial regression mod- homes in Beijing [12]. They discovered that the LSTM model
els. In order to account for spatial and temporal hetero- outperformed when time-series data were used, while CNN
geneity, B. Huang et al. incorporate time effects into GWR performed much better when a dataset with deeply crawled
models to assess house prices [5].Although these methods’ feature factors was used. It is simple to see that while CNN
performance in making forecasts is acceptable, their use in can access more in-depth real estate data, LSTM is better
determining actual house prices is very limited. Although suited to handle time series like estate price changes. A new
these methods’ valuation performance is acceptable, there is trend involves combining spatial analysis and deep learning.
very little use for them to determine real estate prices. The With the use of AI techniques for geographic knowledge
variables influencing real estate prices are intricate, mak- discovery, researchers started to look into bridging the gap
ing it challenging to monitor price changes. It is incredi- between deep learning and spatial analysis methodologies,
bly difficult for standard mathematical models to accurately which expands the potential applications for estate price valu-
model estate prices. Hedonic models have gained in popu- ation methodologies. X. Xing et al. added the neighbor effect
larity over the past few decades due to their affordability, to a raster-based CNN to employ remotely sensed images for
accuracy, and complexity. Due to deep learning’s strong com- estimating the amount of human activity [13]. D. Zhu et al.
putational capabilities and its many benefits in interdisci- theoretically demonstrated the possibility of utilizing graph
plinary fields, complex fitting is now possible. Real estate convolutional neural networks (GCNN) to implement spatial
price valuation is beginning to use deep learning. A lot regression and proposed Spatial regression graph convolu-
of people use artificial neural networks (ANN). H. Selim tional neural networks [14].
used an artificial neural network model to predict Turkey’s Real estate appraisals have also implemented the decision
real estate prices and noticed that it performed much better tree algorithm. Notably, Kok et al. [15] developed a decision
than the characteristic price model [6]. When compared to tree valuation model, which is superior to HPM in valu-
the hedonic model [7], S. Peterson et al. use of ANN on ing multi-family dwellings. On the basis of decision tree
a sizable sample of 46,467 residential data revealed that it algorithm, random forest algorithm can be constructed. The
performs better when there are a lot of dummy variables random forest algorithm comprises multiple regression trees,
because parameter estimation for ANN does not depend on and each decision tree in the forest is unrelated to the oth-
the rank of the regression matrix. Because semi-supervised ers, and the final output of the model is generated by each
learning better takes advantage of the nonlinear relation- decision tree in the forest together. Likewise, the random
ships between the factors involved, Y. Guo et al. discovered forest algorithm is a type of ensemble algorithm. Because
that applying a semi-supervised learning strategy to ANN the random forest algorithm’s samples and features are

104812 VOLUME 10, 2022


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

random, it is less highly probable to overfit than traditional (CV) and natural language processing (NLP). For instance,
decision tree algorithms and is also more precise when the J. Liu et al. discovered that although LSTM has an excellent
sample size is large [16]. This is clearly more suitable for real performance in recognizing 3D human actions, not all action
estate valuation where a large number of samples and char- joints have a positive effect on training, and some action
acteristics exist, and therefore, the random forest approach joints produce a great deal of interference with training, and
has become a common algorithm for real estate valuation. they added an attention mechanism to the original LSTM
M. Ceh et al. [17] employed random forest machine learning model in order to selectively focus on useful action sequences
techniques to predict sales on real estate sales data in the with the aid of global contextual information joints [21].
Slovenian capital from 2008 to 2013 and compared them The Attention mechanism can also be applied to real estate
with traditional HPM. It was revealed that the prediction appraisal, as demonstrated by J. Bin et al. [22], who con-
of the random forest method was higher on all the effec- ducted a multimodal fusion appraisal of Los Angeles real
tiveness indicators. T. Dimopoulos et al. [18]compared the estate based on attention and discovered that the appraisal
effectiveness of random forest and linear multiple regression model performed well after the introduction of the Attention
throughout predicting apartment prices on real estate data mechanism. A. Vaswani et al. [23] proposed a self-attentive
in the Nicosia area of Cyprus. It was demonstrated that the mechanism that can enhance the performance of the model,
random forest approach exhibited higher prediction accuracy, parallelize the computation, and significantly reduce the
especially for models that included a sufficient number of training time. This self-aware mechanism does not, how-
independent variables. There is another type of ensemble ever, account for the potential correlation between different
algorithm in which there is a strong dependency between indi- samples. M.-H. Guo proposed external attention [24], which
vidual weak learners that must be generated serially, which captures the global connections between data via external
is represented by the Boosting algorithm. In their simulation shared units and implicitly considers correlations between all
of the Spanish real estate market, Alfaro-Navarro et al. [19] sample data.
unearthed that the boosting algorithm outperformed the Taking full consideration of the previous section, the need
individual tree approach, though overall the random forest for realistic real estate valuation is taken into account. We pre-
approach had moderately superior performance. In addi- sume that the A-SRGCNN model is appropriate for real estate
tion, J.-L. Alfaro-Navarro pointed out that ensemble learning valuation since real estate price appraisal is indeed a very typ-
methods tend to be applied in a limited way to specific ical spatial regression scenario, and the SRGCNN appraisal
geographic areas, while the best models tend to differ from model performs well in this scenario in comparison to older
city to city. The combination of decision trees and boosting models [14]. What’s more, the external attention mechanism
ideas gave birth to the GBDT algorithm, which inherits the can delve deeper into the linkage between sample data, which
advantages and improves the disadvantages of decision trees corresponds to the close connection among real estate data,
and boosting, and, in turn, solves the problem of overfitting so the model would further work on improving the valua-
well by integrating multiple decision trees through the gradi- tion consistent manner on the SRGCNN valuation model’s
ent boosting method.Meanwhile, the dilemma of sequential impressive performance. Real estate appraisal models have
training and the difficulty of parallelism common to boosting high requirements in terms of their ability to generalize over
algorithms has been effectively resolved with the advance- validation sets and different data sets. The acquisition of cer-
ment of XGBoost and LightGBM (a framework for imple- tain attributes for real estate data samples regularly presents
menting the GDBT algorithm). Z. Peng et al. [20]utilized the some challenges. Consequently, the A-SRGCNN real estate
XGBoost algorithm to construct a second-hand house price appraisal model is constructed in this paper. The model is
prediction model for Chengdu, China, and observed that the based on the spatial regression model SRGCNN, and the spa-
XGBoost algorithm outperformed multiple linear regression tial regression algorithm shows good generalization ability
and decision tree algorithms, and also had improved general- and stable performance when it comes to different datasets.
ization and robustness. The most important parameter of the SRGCNN model is the
When observing objects, individuals are more likely to spatial location of real estate, and the spatial information of
concentrate their efforts on what is of greater interest. real estate is often easier to obtain in reality. The A-SRGCNN
As research advances, some researchers have suggested that approach incorporates an attention mechanism by adding
the Attention mechanism, which can be considered as a mech- an external attention layer before the final output, which is
anism for reallocating resources based on the importance based on the use of the SRGCNN model. There are tight
of activation, be added to machine learning to improve its connections between real estate samples, and the external
accuracy. Attention analyzes the input content to determine attention layer enhances the algorithm’s truthfulness by cap-
the correlation between the elements, captures and amplifies turing the global connections between property samples via
the less notable but more important features to increase their shared memory units. Accordingly, compared with popular
influence weight in the training, and allocates more computa- regression valuation models, the A-SRGCNN proposed in
tional resources to the more significant computational units. this paper is more generalizable, performs stably on different
Almost instantaneously, the Attention mechanism demon- samples, and the attributes of the required samples are easily
strated significant benefits in areas such as computer vision available, which is more in line with the realistic needs of real

VOLUME 10, 2022 104813


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

estate valuation, while taking into account the accuracy of the In the back propagation process, the root mean square error
valuation. is used as the loss function L for parameter updating, so the
The remaining sections of this paper are as follows. gradient of the lth level of descent is represented as follows.
In Section II, a spatial regression graph convolution model
∂L
(A-SRGCNN) for real estate price valuation is constructed J (l) =
based on an external attention mechanism. In Section III, the ∂Z l
experimental dataset and parameter settings are proffered. ∂L ∂X l
=
In Section IV, the experimental results are presented and ∂X l ∂Z l
analyzed. Section V concludes. ∂L ∂Z l+1 ∂X l
=
∂Z l+1 ∂X l ∂Z l
II. SPATIAL REGRESSION GRAPH CONVOLUTION = J (l+1) WLT 2l σ l (4)
NEURAL NETWORK REAL ESTATE PRICE VALUATION
MODEL BASED ON EXTERNAL ATTENTION MECHANISM The response of graph convolutional neural networks’ for-
(A-SRGCNN) ward and backward propagation to spatial lag illustrates that
A. SPATIAL REGRESSION GRAPH CONVOLUTION NEURAL these networks are capable of modeling dependent and inde-
NETWORK (SRGCNN) pendent variables in a manner similar to spatial regression.
The structure of our model is introduced with the setting of This makes it possible to perform spatial regression using
Shanghai data set as an example. The interpretation of spa- graph convolutional neural networks, or SRGCNN, by swap-
tial statistical relationships between dependent variables and ping out the conventional spatial weight matrix for a spatial
variables can be modeled using traditional spatial regression. graph structure that characterizes the structural connections
Traditional spatial regression, on the other hand, relies on between spatial units.
the assumption that attribute observations are complete when The real estate samples are intricately connected, par-
constructing the spatial weight matrix, which leads inevitably ticularly the implicit connection between the spatial loca-
to missing data in practical applications. For instance, it is tions of the real estate samples. Conversely, the majority
demanding to collect all attribute data for a property so that of linear spatial models rely on a supervised approach,
it can be valuated completely in real estate appraisal, which in which only the locations of the observed labels can be
restricts the use of scenarios for spatial regression models. incorporated into the trained model. As a result, these mod-
It is also daunting to capture nonlinear relationships among els are not directly applicable for valuation or exhibit poor
geographic attributes, and many such nonlinear relationships performance when real estate sample data are missing or
exist in real estate attributes [25], which will unavoidably under-sampled [26]. To solve this issue, SRGCNN employs
affect the assessment’s accuracy. Furthermore, spatial regres- semi-supervised learning, in which all spatial units are
sion models’ predetermined linear econometric regression observed and the Label is only partially sampled-a situation
models reinforce the assumption that spatial relationships to that is typical in accurate real estate appraisals-to optimize
be studied are linear. Traditional spatial regression, on the the parameters. In order to fully capture the spatial depen-
other hand, disregards the heterogeneity between sample dence and better reflect it in real estate data, semi-supervised
locations and instead learns the overall spatial relationship by learning allows the weights of all spatial units to be taken into
sharing weights. account.
D. Zhu et al. [14] made the SRGCNN proposal and argued The model developed in this paper’s SRGCNN layer is
that spatial regression could theoretically be implemented depicted in Fig. 1. The location data of the real estate samples
using graph convolution. The Spatial Dubin Model (SDM) is first extracted by the SRGCNN layer, and a spatial graph
is shown in the following matrix. with a 6000×6000 structure is built using the location data.
This spatial graph encodes the spatial weight matrix and
y = ρWy + xβ + WX δ +  (1) cross-sectional data of the real estate samples. The SRGCNN
Transformation is as follows. layer then uses additional attribute data from the training set
to update all of the nodes. The values are then matrix linearly
y = (I − ρW )−1 (X 2 + ) operated with the memory units WEIGHT and BIAS, which
  ∼ 
are activated by the activation function Relu and input to the
= I + ρW + ρ W + · · · X 2 + 
2
subsequent attention layer. The memory unit BIAS’ structure


is set to 1×56, and the memory unit WEIGHT’s structure
X ∼
= (ρW )k X 2 +  (2) is 7×56. Forward propagation is used to update the memory
k=0 unit’s parameters. By training the model using the spatial unit
And according to the forward propagation mechanism of weights of all property samples, including those in the train-
the graph convolutional network, the dependent variable is ing and validation sets, even if the sample does not appear in
expressed as follows. the training set, SRGCNN employs a semi-supervised learn-
ing strategy to optimize the parameters. The semi-supervised
y = σ (wL X 2) (3) learning strategy makes the SRGCNN model more

104814 VOLUME 10, 2022


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 1. Structure of SRGCNN layer.

FIGURE 2. Structure of external attention layer.

appropriate for these valuation scenarios because in practice computing a weighted sum of the pair-wise affinities of all
there are regularly cases of missing or under-sampled real locations. Self-focus, however, excludes potential correla-
estate sample data. tions between various samples and has secondary complexity.
In order to capture long-range dependencies in a single The potential correlation between various samples in the
sample, the prevalent self-attention mechanism works on valuation of real estate is also very high. As a result, we take
the principle of modifying the features of each location by into account using external attention in the A-SRGCNN

VOLUME 10, 2022 104815


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

model in this paper. By calculating the similarity between TABLE 1. Attributes of the Shanghai dataset.
the self-questions and the self-keys, the self-attention mech-
anism generates an attention map. The A-SRGCNN model’s
external attention layer implements the attention mechanism
by determining the relationship between sub-queries and the
more compact learnable Key memory units that catch the
broad connections between real estate data.

B. EXTERNAL ATTENTION
In order to capture long-range dependencies in a single sam-
ple, the prevalent self-attention mechanism works on the prin-
ciple of modifying the features of each location by computing
a weighted sum of the pair-wise affinities of all locations.
Self-focus, however, excludes potential correlations between
various samples and has secondary complexity. The potential TABLE 2. Attributes of the Melbourne dataset.
correlation between various samples in the valuation of real
estate is also very high. As a result, we take into account using
external attention in the A-SRGCNN model in this paper.
By calculating the similarity between the self-questions and
the self-keys, the self-attention mechanism generates an
attention map. The A-SRGCNN model’s external attention
layer implements the attention mechanism by determining
the relationship between sub-queries and the more compact
learnable Key memory units that catch the broad connections
between real estate data.
Fig. 2 depicts the A-external SRGCNN’s attention layer’s
organizational structure. Two linear layers are used to
implement the model. First, two-dimensional data from the
upper SRGCNN layer, which is the data obtained after the
SRGCNN layer applies spatial regression using graph con-
TABLE 3. Attributes of the San Diego dataset.
volution on real estate samples, is accepted by the model.
The memory cells K and V in the layer of external attention
each have a 32×64 structure. The memory units keep track of
the attention weight parameters and implicitly take potential
correlations between real estate data into account. In order to
increase the potential relevance and important features in the
real estate data and lessen or even eliminate the weights of
the unimportant features in the real estate data, the incoming
two-dimensional data from the SRGCNN layer is multiplied
along with the weight parameters in the memory cell. The
appraised value of the property is output as a 1×1 result by the
attention layer in the end. The structure of the memory units
of the external attention layer can be modified or bias terms
can be added to further improve the optimization of external
attention. The appraised price will be compared with the partner real estate company, the San Diego, California, USA
actual price to calculate the error and start a back propagation real estate dataset on airhnb,and Melbourne, Australia real
algorithm to update the parameters in the memory units. estate transaction data on kaggle. The following links can be
utilized to access the data. https://github.com/n0away/Data-
III. DATA AND EXPERIMENT SETUP ASRGCNN/ Data on real estate transactions in Shanghai is
The experiments are implemented using Pytroch, a deep among those that cannot be released to the public because it
learning framework with GPU acceleration. The com- is provided by commercial real estate appraisers and involves
puting environment is a Linux server with an Nvidia trade secrets.
RTX 3070 GPU, a 2.10Ghz Intel i7-12700 CPU and 64GB Table 1, Table 2, and Table 3 show the structure of
RAM. the dataset with 31605 entries in the Shanghai dataset,
Three datasets are used in this paper, namely the Shanghai 8887 entries in the Melbourne dataset and 6110 entries in the
transactional real estate transaction dataset provided by a San Diego dataset. The attributes of each feature have been

104816 VOLUME 10, 2022


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 3. Distribution of Shanghai dataset. FIGURE 5. Distribution of San Diego dataset.

FIGURE 4. Distribution of Melbourne dataset. FIGURE 6. Price histogram for the Shanghai dataset.

given in table. The experiments use the logarithmic price of exhibit a normal distribution, as shown by the horizontal and
the property as the dependent variable and the latitude and vertical coordinates.
longitude location of the property to generate the adjacency It is worth mentioning that the shanghai dataset’s prices
graph. used in the experimental data are the final real estate trans-
Fig. 3, Fig. 4, and Fig. 5 show the real estate price distribu- action data, which have higher accuracy compared with the
tion. The horizontal and vertical coordinates are the latitude data obtained by web crawlers. In general, due to commercial
and longitude of the real estate, and the shades of the scatter practices or policy pressure, there is a discrepancy between
colors represent the high and low logarithmic prices of the the listed price and the final real transaction price as published
real estate. What can be seen is that there are both high and on real estate agent websites. [27].
low house price clusters in the dataset. Seven models, LR, BP, RF, XGBoost, LightGBM, and
Fig. 6, Fig. 7, and Fig. 8 depict a bar chart of house prices, SRGCNN, were used as the benchmark models. The latitude
where the horizontal coordinate represents the log price of the and longitude of the property location were employed as the
property and the vertical coordinate represents the number of variable input to the model because some of the models could
corresponding log prices. The log prices of the data set to not embrace the graph parameters.

VOLUME 10, 2022 104817


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

4) RANDOM FOREST(RF)
Random forest regression consists of multiple regression
trees, each decision tree in the forest is uncorrelated with one
another, and the model output is ascertained cooperatively by
each decision tree in the forest.

5) SPATIAL REGRESSION GRAPH CONVOLUTIONAL NEURAL


NETWORK(SRGCNN)
The model uses spatial regression implemented with graph
convolution and has good performance on multivariate
prediction.

6) EXTREME GRADIENT BOOSTING(XGBOOST)


XGBoost implements and engineers many improvements to
machine learning algorithms in the Gradient Boosting frame-
work and is widely used in algorithm competitions.

FIGURE 7. Price histogram for the Melbourne dataset. 7) LIGHT GRADIENT BOOSTING MACHINE(LIGHTGBM)
Microsoft’s framework for the GBDT algorithm, which facil-
itates efficient parallel training. Specifically, it has the bene-
fits of rapid training speed, low memory consumption, and
distributed support.
Real estate has attributes such as high unit price, low liq-
uidity, and long transaction time, which cannot achieve high
frequency trading similar to that of stocks. It is challenging
to represent the time-series changes of real estate transactions
and model them using the time-series data because the real
estate transaction data used in this experiment has a more
discrete time distribution. Therefore, the more popular LSTM
model is not used as the benchmark model in this paper.
We employ the A-SRGCNN model for zonal and
time-division experiments following the model comparison
experiments. 6000 property data are randomly chosen in
Shanghai dataset as experimental data for the model com-
parison experiment, of which 5000 serve as the training set
and 1000 serve as the validation set. The influence of time
attributes on property prices can be disregarded because they
FIGURE 8. Price histogram for the San Diego dataset. were chosen at random. 6000 data points from various regions
are chosen for the zonal experiments, of which 5000 serve
as the training set and 1000 serve as the validation set. The
In this paper, the following models are chosen as the bench- training set for the time-division experiments consists of
mark models. 5000 real estate prices from 2020 to the first half of 2021,
and the validation set consists of 1000 real estate prices from
1) LINEAR REGRESSION (LR) the second half of 2021. When the valuation metrics are stable
A simple and easy to implement model, the data is modeled or the training set has been trained to overfitting, all models
using a linear prediction function. stop learning.

2) BACK PROPAGATION (BP) IV. RESULT


A multilayer feedforward network trained by error back prop- A. MODEL COMPARISON EXPERIMENTS
agation, with the basic idea of gradient descent. Table 4 summarizes the performance results of all benchmark
models, and the experiments use mean absolute percentage
3) GRAPH CONVOLUTIONAL NETWORK(GCN) error (MAPE) to valuate how well the models fit on the
A powerful graph neural network with a wide range of appli- training set and how well they valuate on the validation set.
cations in several fields such as computer vision and natural Table 4’s model performance findings indicate that the
language processing. A-SRGCNN model performs impressive. Although the

104818 VOLUME 10, 2022


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

TABLE 4. Performance in model comparison experiments.

TABLE 5. Friedman’s test.

TABLE 6. Wilcoxon signed rank test.

A-SRGCNN model does not perform optimally on every the other comparison algorithms, the A-SRGCNN algorithm
dataset compared to the benchmark model, the A-SRGCNN outperforms them with errors of 1.87 percent, 4.12 percent,
model performs worse than the XGBoost model on the vali- and 12.80 percent on the three datasets, denoting that the
dation set of the Shanghai and Melbourne data. Nevertheless, A-SRGCNN algorithm has the advantage of stable perfor-
the A-SRGCNN model has superior generalizability and high mance throughout datasets.
stability. In the validation set, the A-SRGCNN model has Friedman’s test was performed on the results, and accord-
the highest accuracy among the eight models. This signifies ing to the results in Table 5, the significant p-value is
that the A-SGRCNN model is more stable and applicable 0.000***<0.05. Therefore, the statistical results are sig-
to real-world applications than the benchmark model. The nificant, indicating that there are tremendous differences
LightGBM algorithm has an error rate of 4.21 percent on between LR, BP, GCN, RF, light GBM, XGBoost, SRGCNN,
the validation set in Shanghai, which is inferior to the RF and A-SRGCNN. Their difference magnitude Cohen’s f value
algorithm’s error rate of 3.05 percent and XGBoost’s error is 1.3. Undoubtedly, the magnitude disparity is enormous.
rate of 2.81 percent. On the San Diego validation set, the It can also be seen from the box line plot that the A-SRGCNN
errors of RF, lightGBM, and XGBoost are, respectively, model has the most stable performance among the eight
6.96 percent, 6.95 percent, and 7.06 percent, which are com- models.
parable.On the Melbourne validation set, the error rate of the Further wilcoxon signed rank test was performed on the
LightGBM algorithm is 15.44 percent, which is significant resultant results and according to the results in Table 6, the
compared to the error rate of XGBoost (15.14 percent) and significance p-value is 0.025**, which presents significance
dramatically decreases the error rate of the RF algorithm at the level, so there is a significant difference between
(21.05 percent).This suggests that the achievement of these the A-SRGCNN model and the SRGCNN model. The sig-
interconnected algorithms is not stable and varies hugely nificance of the contrast The d-value of Cohen is 1.501,
when valuated on various datasets.Especially in contrast to which represents a really massive variation. And thereby,

VOLUME 10, 2022 104819


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 9. Comparison of box line graphs.

FIGURE 10. Heat map of group 1 in the zonal experiments.

we conclude that the A-SRGCNN model demonstrates a B. ZONAL EXPERIMENTS


statistically meaningful performance advantage over the Using the A-SRGCNN model, this section conducts zonal
SRGCNN model without the attention mechanism. experiments. Experiments were conducted in various admin-

104820 VOLUME 10, 2022


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 11. Heat map of group 2 in the zonal experiments.

istrative districts to examine the training effect of the model TABLE 7. Performance in zonal experiments.
in various districts, the performance of the valuation in other
districts, and the relationship between the two. Comparison
group 1 used data from the district of Xuhui for its training
set and data from the district of Changning for its validation
set. Consult Table 7 for a listing of the training and validation
sets utilized by other comparison groups.
According to the experimental findings, the validation
set error is substantially greater than the model comparison restricted to a single district, it is not surprising that the
experiments while the training set error is comparatively performance on the validation set is subpar given that the
lesser when using the A-SRGCNN model in the zonal exper- correlation between distinct districts is obviously lower than
iments. This implies that while the A-SRGCNN model per- that of the same district.
forms worse than the model-comparison experiment on the After analyzing the experimental data, we encountered
validation set, it performs better on the training set in the that, on the one hand, the relative distance between different
zonal experiment. It is not difficult to understand, as the data zones and, on the other, the magnitude of the difference in
for the zonal experiments were restricted to one district in house prices between zones, determine how effective the
Shanghai while the data for the model comparison experiment assessment is. In comparison group 1, the districts of Xuhui
were chosen at random from six regions. In contrast, the and Changning are very close to one another and are both
zonal experiments’ data show more geographical proximity among Shanghai’s more expensive areas. As a result, compar-
between properties, which means there are more potential ison group 1 has the best prediction results, even coming close
connections between properties, leading to better training to the model comparison experiment’s performance. While it
results on the training set. Though the validation set is also is completely obvious that the assessment effect is influenced

VOLUME 10, 2022 104821


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 12. Heat map of group 3 in the zonal experiments.

by the zoning distance from the comparison group 1 of the Despite the fact that there is a distance between these two
assessment heat map in Fig. 10, that the predicted prices districts, the A-SRGCNN model used in this comparison
are more accurate in the area of Changning District close group 4 performs a better valuation of the house price than
to Xuhui District. The same is true for comparison group 2, the model used in comparison group 3.
which also has low property prices in Fengxian and Jiading. What reveals from the experiment is that administrative
However, due to the great physical distance between these clustering and price aggregation are present at a higher level
two zoning districts, comparison group 2’s valuation effect in Shanghai’s housing prices. For example, although the
of its validation sets is considerably worse than comparison model takes into account the number of schools in the vicinity
group 1. This paper, therefore, considers that the district of the property in the experiment, the quality of schools can
location’s distance has an impact on the prediction effect. be high or low between districts, which is evidenced not only
Despite their close proximity, Jiading and Changning dis- in the stage of education but also in the fact that schools with
tricts in comparison group 3 belong to the low house price a high level of education have good fitness facilities and a
zone while Changning district belongs to the high house price supportive environment that is available to the neighborhood,
zone. As a result, they perform very poorly in the validation which can have a more beneficial effect on house prices [28].
set, which is the worst assessment result in the zonal experi-
ments. Therefore, we think that the difference in house prices
between districts has an impact on the assessment effect C. TIME-DIVISION EXPERIMENTS
as well and that the magnitude of the difference in house This part of the experiment randomly selected property prices
prices has a greater influence on prediction than the distance from 2020 to the first half of 2021 as the training set and
between districts. This is further supported by comparison the second half of 2021 as the validation set to analyze the
group 4, where the house price level of Songjiang district performance of the A-SRGCNN model in predicting future
is situated between Jiading district and Changning district. house prices.

104822 VOLUME 10, 2022


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 13. Heat map of group 4 in the zonal experiments.

FIGURE 14. Distribution of group 1 in the zonal experiment.

The experimental results indicate that the number of on the validation set is 2.20%. Considering the complexity
epochs trained by the A-SRGCNN model in the time-division of appropriately predicting future property prices in the real
experiments is directly analogous to that of the comparison world, A-SRGCNN of future property prices is not terrible.
experiment, and the error of the time division experiment In model comparison experiments, its error rate of 1.87% is

VOLUME 10, 2022 104823


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 15. Distribution of group 2 in the zonal experiment.

FIGURE 16. Distribution of group 3 in the zonal experiment.

inferior to that of the A-SRGCNN model. We believe that of the rapid containment of COVID-19 in China and the
the specificity of the data set explains this portion of the introduction of relatively accommodating economic policies
difference. China’s real estate has been in a bull market prior by the Chinese government, Shanghai’s real estate prices
to 2020 and for the past decades. In contrast, the outbreak of began to recover quickly [30]. The prediction error of the
COVID-19 in China at the start of 2020 led to a stagnation model in the time-division experiment increased as a result of
of the Chinese economy, which had a dramatic impact on such a massive shock and upheaval, which caused the dataset
the real estate market, particularly a significant dampening to be severely impacted and to no longer accurately reflect
effect on Shanghai real estate prices [29]. And as a result the real trend of real estate prices.

104824 VOLUME 10, 2022


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 17. Distribution of group 4 in the zonal experiment.

TABLE 8. Performance in time-division experiments. TABLE 9. Regional time-division experiments.

there are more large error points in time-division experi-


ments, and large error points even appear sporadically in
the Jiading region. The valuation of the A-SRGCNN model
in the time-division experiments is inferior to that of the
A-SRGCNN model in the comparison experiment, as shown
by the scatter distribution plot in Fig.20, which contains more
deviation points.
Additionally, a time-division experiment was also con-
ducted on individual administrative divisions in the dataset.
We selected data from Changning, Fengxian and Xuhui dis-
tricts with a larger number of samples. Notably, the results
are portrayed in the Table 9.
The results illustrated that there was a more favorable
performance using data from a single borough to conduct
time-sharing experiments. The error of the Changning and
FIGURE 18. Fitted curve of time-division experiments.
Xuhui Districts data in the experiment is only 1.99%, which
is better than the 2.2% error of the randomly selected data in
the six regions. The error in Fengxian District is 2.09%, which
Fig. 19 and 20 highlight additional distinctions between lies in between, which further verifies the conclusions in the
the valuation effects of the A-SRGCNN model in time- previous experiments. The administrative size of Changning
division experiments and model comparison experiments. and Xuhui Districts is smaller than that of Jiading District,
From the valuation heat map in Fig. 19, it can be seen that and the distribution of samples is more clustered. The more

VOLUME 10, 2022 104825


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

FIGURE 19. Heat map of time-division experiments.

FIGURE 20. Scattered distribution of time-division experiments.

geographically clustered samples usually have more potential an A-SRGCNN real estate valuation model based on an
connections, and tend to have better valuation effect. external attention mechanism is proposed in this paper. The
spatial regression model SRGCNN has good generalization,
V. CONCLUSION while the geographic information of the properties required
Addressing the instability of appraisals and difficulties in by the model is easier to obtain in reality. It is practical
data acquisition that tend to exist in real estate appraisals, and feasible in realistic real estate appraisal. In order to

104826 VOLUME 10, 2022


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

model dependent and independent variables correspondingly [9] S. Law, B. Paige, and C. Russell, ‘‘Take a look around: Using street view
to spatial regression, the SRGCNN model employs a graph and satellite images to estimate house prices,’’ ACM Trans. Intell. Syst.
Technol., vol. 10, no. 5, pp. 1–19, Sep. 2019.
convolutional neural network. The semi-supervised learn- [10] J. Bin, S. Tang, Y. Liu, G. Wang, B. Gardiner, Z. Liu, and E. Li, ‘‘Regres-
ing approach accounts for nonlinear relationships between sion model for appraisal of real estate using recurrent neural network and
data, and the addition of an attention mechanism enables boosting tree,’’ in Proc. 2nd IEEE Int. Conf. Comput. Intell. Appl. (ICCIA),
Sep. 2017, pp. 209–213.
the model to recognize and remember potential relation- [11] O. Poursaeed, T. Matera, and S. Belongie, ‘‘Vision-based real estate
ships between property data in order to enhance model per- price estimation,’’ Mach. Vis. Appl., vol. 29, no. 4, pp. 667–676,
formance. According to the experiments, SRGCNN has a May 2018.
[12] L. Yu, C. Jiao, H. Xin, Y. Wang, and K. Wang, ‘‘Prediction on housing price
better ability to estimate house prices than the benchmark based on deep learning,’’ Int. J. Comput. Inf. Eng., vol. 12, no. 2, pp. 90–99,
model, and the SRGCNN model maintains a very high accu- 2018.
[13] X. Xing, Z. Huang, X. Cheng, D. Zhu, C. Kang, F. Zhang, and
racy with a stable play on different datasets. The valua- Y. Liu, ‘‘Mapping human activity volumes through remote sensing
tion’s accuracy has also increased with the addition of the imagery,’’ IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13,
attention mechanism; on the validation set, the A-SRGCNN pp. 5652–5668, 2020.
[14] D. Zhu, Y. Liu, X. Yao, and M. M. Fischer, ‘‘Spatial regression
model’s error is only 1.87%. Due to the limited data currently graph convolutional neural networks: A deep learning paradigm for
available from the cooperating real estate appraisal firms, spatial multivariate distributions,’’ GeoInformatica, vol. 25, pp. 1–32,
future research is therefore needed before this method can Nov. 2021.
[15] N. Kok, E.-L. Koponen, and C. A. Martínez-Barbosa, ‘‘Big data in
be applied to increasing complexity data sets. After further real estate? From manual appraisal to automated valuation,’’ J. Portfolio
analysis of Shanghai second-hand property data using the Manag., vol. 43, no. 6, pp. 202–211, Sep. 2017.
A-SRGCNN model, we discovered that real estate prices in [16] J. Ali, R. Khan, N. Ahmad, and I. Maqsood, ‘‘Random forests and decision
trees,’’ Int. J. Comput. Sci. Issues, vol. 9, p. 272, Sep. 2012.
Shanghai exhibit regional aggregation and price aggregation, [17] M. Áeh, M. Kilibarda, A. Lisec, and B. Bajat, ‘‘Estimating the per-
with similar prices for properties in the same region and formance of random forest versus multiple regression for predict-
significant differences in prices in different regions even ing prices of the apartments,’’ ISPRS Int. J. Geo-Inf., vol. 7, p. 168,
Oct. 2018.
though they are geographically adjacent. The A-SRGCNN [18] T. Dimopoulos, H. Tyralis, N. P. Bakas, and D. Hadjimitsis,
model also performs well when comparing prices in similar ‘‘Accuracy measurement of random forests and linear regression
areas. The A-SRGCNN predicts the future house prices in for mass appraisal models that estimate the prices of residential
apartments in Nicosia, Cyprus,’’ Adv. Geosci., vol. 45, pp. 377–382,
the time-division experiment and achieves good prediction Nov. 2018.
results with a prediction error of about 2.20%. Neverthe- [19] J.-L. Alfaro-Navarro, E. L. Cano, E. Alfaro-Cortés, N. García, M. Gámez,
less, it performs less accurately than the A-SRGCNN model, and B. Larraz, ‘‘A fully automated adjustment of ensemble methods in
machine learning for modeling complex real estate systems,’’ Complexity,
which has an error of 1.87% in the model comparison exper- vol. 2020, pp. 1–12, Apr. 2020.
iment. In this paper, we argue that this is the rationale behind [20] Z. Peng, Q. Huang, and Y. Han, ‘‘Model research on forecast of
how major variables like public crises and policy changes second-hand house price in Chengdu based on XGboost algorithm,’’ in
Proc. IEEE 11th Int. Conf. Adv. Infocomm Technol. (ICAIT), Oct. 2019,
can affect trends in real estate prices. Real estate prices are pp. 168–172.
influenced by a variety of factors, so more thorough inves- [21] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, ‘‘Global
tigation and study are required to forecast real estate prices context-aware attention LSTM networks for 3D action recognition,’’ in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
with greater accuracy. pp. 1647–1656.
[22] J. Bin, B. Gardiner, Z. Liu, and E. Li, ‘‘Attention-based multi-
modal fusion for improved real estate appraisal: A case study in Los
REFERENCES Angeles,’’ Multimedia Tools Appl., vol. 78, no. 22, pp. 31163–31184,
[1] S. Rosen, ‘‘Hedonic prices and implicit markets: Product differentiation in Nov. 2019.
pure competition,’’ J. Political Economy, vol. 82, no. 1, pp. 34–55, 1974. [23] M.-H. Guo, Z.-N. Liu, T.-J. Mu, and S.-M. Hu, ‘‘Beyond self-attention:
[2] R. Meese and N. Wallace, ‘‘House price dynamics and market funda- External attention using two linear layers for visual tasks,’’ 2021,
mentals: The Parisian housing market,’’ Urban Stud., vol. 40, nos. 5–6, arXiv:2105.02358.
pp. 1027–1045, May 2003. [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
[3] R. Dubin, R. K. Pace, and T. G. Thibodeau, ‘‘Spatial autoregression L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
techniques for real estate data,’’ J. Real Estate Literature, vol. 7, no. 1, Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11.
pp. 79–96, 1999. [25] T. J. Kauko, ‘‘Modelling the locational determinants of house prices:
Neural network and value tree approaches,’’ Ph.D. dissertation, Fac. Spatial
[4] R. Gupta, ‘‘Forecasting house prices for the four census regions and the
Sci., Utrecht Univ., Utrecht, The Netherlands, 2002.
aggregate U.S. economy in a data-rich environment,’’ Appl. Econ., vol. 45,
[26] R. R. Vatsavai and B. Bhaduri, ‘‘A hybrid classification scheme for mining
no. 33, pp. 4677–4697, Nov. 2013.
multisource geospatial data,’’ GeoInformatica, vol. 15, no. 1, pp. 29–47,
[5] B. Huang, B. Wu, and M. Barry, ‘‘Geographically and temporally weighted Jan. 2011.
regression for modeling spatio-temporal variation in house prices,’’ Int. J. [27] D. A. Crozier and F. McLEAN, ‘‘Consumer decision-making in the pur-
Geograph. Inf. Sci., vol. 24, no. 3, pp. 383–401, Mar. 2010. chase of estate agency services,’’ Service Industries J., vol. 17, no. 2,
[6] H. Selim, ‘‘Determinants of house prices in turkey: Hedonic regres- pp. 278–293, Apr. 1997.
sion versus artificial neural network,’’ Exp. Syst. Appl., vol. 36, no. 2, [28] H. Wen, Y. Zhang, and L. Zhang, ‘‘Do educational facilities affect housing
pp. 2843–2852, Mar. 2009. price? An empirical study in Hangzhou, China,’’ Habitat Int., vol. 42,
[7] S. Peterson and A. Flanagan, ‘‘Neural network hedonic pricing models in pp. 155–163, Apr. 2014.
mass real estate appraisal,’’ J. Real Estate Res., vol. 31, no. 2, pp. 147–164, [29] M. Yang and J. Zhou, ‘‘The impact of COVID-19 on the housing market:
Jan. 2009. Evidence from the Yangtze River delta region in China,’’ Appl. Econ. Lett.,
[8] Y. Guo, S. Lin, X. Ma, J. Bal, and C.-T. Li, ‘‘Homogeneous feature transfer vol. 29, no. 5, pp. 409–412, Mar. 2022.
and heterogeneous location fine-tuning for cross-city property appraisal [30] C. Tian, X. Peng, and X. Zhang, ‘‘COVID-19 pandemic, urban resilience
framework,’’ in Proc. Australas. Conf. Data Mining. Singapore: Springer, and real estate prices: The experience of cities in the Yangtze River delta
2018, pp. 161–174. in China,’’ Land, vol. 10, no. 9, p. 960, Sep. 2021.

VOLUME 10, 2022 104827


Z. Yang et al.: GCN-Based Model for Megacity Real Estate Valuation

ZONGYAN YANG is currently pursuing the RUYAN ZHOU received the Ph.D. degree in
degree in computer science and technology with agricultural bio-environment and energy engineer-
the School of Information Technology, Shanghai ing from Henan Agricultural University, in 2007.
Ocean University. From 2007 to 2008, she worked with the
Zhongyuan University of Technology. She is cur-
rently working with Shanghai Ocean University,
Shanghai, China.

ZHONGHUA HONG (Member, IEEE) received HONG AI received the master’s degree in com-
the Ph.D. degree in GIS from Tongji Univer- puter software and theory from Yanshan Univer-
sity, Shanghai, China, in 2014. He has been an sity, in 2005. Since 2005, she has been working
Associate Professor with the College of Infor- with Shanghai Ocean University, Shanghai, China.
mation Technology, Shanghai Ocean University,
since 2019. His research interests include 3D dam-
age detection, coastal mapping, photogrammetry,
GNSS-R, and deep learning.

104828 VOLUME 10, 2022

You might also like