1. Introduction
Driven by advertising technologies and goals to produce targeted ads, the personalisation and customisation of websites and services have become the new norm in our society. The need for personalisation has been driven by the increase in data and information available. Information overload, which makes it challenging to find relevant information, has been a phenomenon for the past two decades. For example, a study from 2003 found that unique information creation was estimated to be between 1 to 2 exabytes. This implied that each human being must be processing 250 megabytes of information. Almost 20 years later, this demonstrates the mounting need for efficient and accurate user recommendation systems to help find pertinent data and information. Personalised content delivery to any set of users may consist of multiple aspects.
A factor that plays a vital role in most personalised web interfaces is the interactivity and the “user-friendly” nature of the User Interface (UI). Every web user, be it a novice, or an expert, wants the interface to provide meaningful content delivered without having much prior expertise about its functionality. This process involves a lot of work from a web developer’s perspective but should be invisible and seamless to the end-user. Therefore, various tools and techniques have been developed to implicitly collect data from users. Implicit data collection, in simpler terms, is just the collection of a user’s data through “interface interactions” without the user having to provide the data in a specific manner. The data is then used to determine interests and make recommendations. At the same time, another aspect growing in popularity is having the location information of a user to make recommendations.
Such recommender systems are widely deployed in many consumer domains, such as online shopping, although our research focuses on real estate recommendations. Real estate recommendation is often about the location of a property item, so we have incorporated online map interactions as a tool to understand a user’s interests. This paper presents four principle recommendation approaches for effectively identifying property items in our real estate portal. (1) Analysis and implementation of content-based filtering for suggesting real estate items. (2) Collaborative filtering approach reduces the computational cost by suggesting similar items to a similar group of users. (3) Location-based approach for predicting the area of interest to the user based on geographical location and user preferences. (4) Building a price prediction model to assist users in making an informed decision. The reason for selecting the first two approaches is based on the fact that the features of a real estate database closely resemble a movie database. Both content-based filtering and collaborative filtering have proven to provide precise recommendations to users [
1]. Introducing a location-based approach is essential since property items have an inherent location aspect.
We have used data from the Estatech map’s portal:
https://www.the-estatech.com (accessed on 15 September 2021) for the recommendation part of the study. We also obtained data in explicit and implicit formats. In addition, historical data of properties and price listings were obtained from
Zameen.com (accessed on 15 September 2021), a real estate portal for online property listings. The techniques and methods used for recommendation algorithms were the score tree processes, TF-IDF and K-nearest neighbours. For house prediction, we cross-compared two techniques, namely multiple linear regression and Keras regression based on neural networks.
The remainder of the article is organised as follows:
Section 2 presents the related literature review. The methodological approach is given in
Section 3.
Section 4 presents a discussion and the results, while
Section 5 concludes the study and provides future recommendations.
2. Related Work
Today’s modern recommendation engines have emerged from the domain of information filtering, a term created by [
2] outlines one solution for the issue of retrieving the correct information against a pool of massive online data, called content filters. To ascertain a user’s choice correctly, multiple visualisation tools have also been developed to accurately distinguish a user’s interests and inclinations. These tools can also be considered as a form of content filter. This domain has been progressing ever since. [
3] demonstrate various options for integrating a recommendation engine into a real estate portal’s user journey. Furthermore, in the same manner, the work validated how additional real estate details can provide more accurate recommendation results when integrated into the proposed model of deep learning and factorisation machines.
Another study by [
4] aims to determine if consumer loyalty will help a recommender system be more accurate. Other techniques implemented by [
5] such as using intelligent data analysis methods to create a recommender framework to solve the problem of recommending the most appropriate components for each user at any given time. They have further addressed the problem of converting an original dataset from a real component-based application to an optimised dataset. After gathering the interaction data and developing a dataset to produce optimised recommendation results, machine learning algorithms using feature engineering techniques and feature selection methods were also applied. Users and developers alike want information processing and its display to be swift. The system developed by [
6] is based on an implicit profiling system for tracking the user’s interests through mouse movements.
A gap analysis approach by [
7] identifies the differences between theory and reality in presenting information on location choice by developing a seven-factor classification tool for evaluating property websites. To capture the relations between the latent feature vectors of real estate items, Ref. [
8] utilised the average-based and individual-based geographical regularisation terms. Both terms are integrated with the weighted regularised matrix factorisation framework to model users’ implicit feedback behaviours to provide them with personalised property recommendations.
A probabilistic model for collaborative filtering by [
9] calculates the predicted values for items against active users, given that there is information already available about those active users. The same research divides collaborative filtering methods into two primary modules, memory-based collaborative filtering and model-based collaborative filtering. Additional probabilistic approaches have been presented, some more sophisticated than others, including the work of [
10]. The recommended procedure is taken as a sequential decision-making process, and the use of Markov decision chains have been suggested to create a model. However, they do not state any improved accuracy over Breese’s projected models. Another recommendation system by [
11] applies content-based filtering, a fuzzy technique for identifying similar and different content and a prediction algorithm for identifying the right set of movie content for the user. At the same time, Ref. [
12] developed item to item centred algorithms. It has been done to provide improved outcomes than user-based algorithms by comparing the approach with K-nearest neighbour.
In the domain of GIS, a complete map personalisation system is developed by [
13] in which the users’ interests are implicitly recorded and given specific rankings based on certain criteria fulfilment upon user’s mouse clicks or movements. As already mentioned, map personalisation has become an area of interest since data overload has become a common scenario in spatial information systems. In the model developed by [
14], the entire focus is to understand map usage patterns of the end-users. The goal is again focused on developing personalised maps for users on a web interface. Working on similar lines, RecoMap [
13], is a web-based platform through which each user receives customised spatial recommendations based on their likings. The results are presented in a map interface highlighting the user’s personalised spatial recommendations. The adaptive map also shows the user’s preferences and the context in which they are used. A different approach by [
15] is to build a recommendation system and map interface, represented in a personalised format for the user to acquire quick results. Further inferences are made by studying the user’s behaviour for system improvement.
Another recommender system designed by [
16] is for real estate users who do not have a user profile for any real estate portal. The session-based interaction of the user is made more effective by utilising a user’s search context and ranking criteria for any suitable property item. A portal developed by [
8] specifically designed for real estate uses two basic approaches for user profiling, an ontological structure and case-based reasoning. The purpose is to save the end-user from the stress of massive online searching and deliver results where the user gets quick recommendations based on their interests. A recommendation system that is being used by the US-based real estate website “Trulia” utilises a “square counting method” [
17] The method works well with large scale datasets and delivers swift results per the user’s preferences based on love and hate edge configurations.
Things have changed significantly in the real estate industry during the COVID-19 era. In some regions, house prices have shown signs of stagnancy and even, in some cases, decreasing trends as people lost their livelihoods. These conditions have urged people to tread more carefully while making investments in this sector. In such a scenario, a price prediction model can help users make an informed decision. A method by [
18] for predicting house prices utilises a Mallows model averaging estimator, which is vigorous in terms of spatial dependence. Another study on ML models for house price prediction by concludes that the random forest regressor model provides the best results amongst all other compared models like linear regression, decision tree, k-means regression [
19]. Another similar study carried out by [
20] applies regression as a predictive model. They use MSE, MAE and RMSE as their evaluation metrics for their model’s accuracy. Another interesting study by [
21] used Multiple Regression Analysis (MRA) to estimate property prices for mass evaluation. The structural qualities and the property’s location were viewed as two primary micro factors of house pricing. MRA was utilised to determine the structural characteristics and locational attributes that statistically influence house price using a sample of 106 house sale transactions from 2011 to 2015. An alternative approach by [
22] focuses on traditional solutions based on widely known methods and procedures and faith in the infallibility and objectivity of a human analysing the real estate market. Since modern technologies are also boldly entering the arena. Hence, the study’s key focus is that organisations should stop viewing automated solutions (such as AVM, CAMA, and AAVM) as functioning in opposition to traditional approaches and instead embrace them as supplemental tools.
Our previous work in map personalisation discusses the initial concept of personalisation using real estate analytics [
23]. It also evaluates background research relating to the building blocks that lead to a recommendation engine for real-time analytics. Extensive research in this field has revealed gaps between real-estate analytics and map-based personalisation, recommendation and prediction; thus, we have tried to bridge this gap in our research and initial development work. We also found motivation for our study and consequent development since map-based personalised real estate portals do not widely exist in the online real estate market. Having to sift through a plethora of online data is no longer suitable for most users, and personalisation has become a key concept in every aspect of data search. In our scenario, real estate test users have been interacting with a real estate portal, “Estatech Maps”, to search and post property items. Our recommendation system is based on three techniques. This includes content, collaboration and location-based filtering. The interaction of users is captured via the map-based interface of the real estate application, Estatech Maps, and stored in a database. Based on this data and analysis, a user gets recommendations as per their area of interest. Along with that, we have incorporated a module based on traditional regression techniques and Keras API for predicting the future price trends of property items.
The subsequent section discusses the detailed insight of the research process regarding data collection, its pre-processing, run time environment creation, and model conception. Finally, the section will discuss the following crucial areas of the research process in detail. (1) Data collection and Technology. (2) Property Recommendation. (3) Price prediction model.
3. Methodology
“Estatech Maps” main focus is to provide personalised real estate listings to its users on a map-based interface by making accurate recommendations and providing insight about price trends of a user’s area of interest. Recommendation and price prediction were the key focus areas to deliver map-based personalisation to the users. In the first stage, a detailed study on the mathematical interpretation of recommendation algorithms was carried out. The second stage focused on the algorithm’s designs, and in the third stage, development based on those algorithms was carried out, and the models were implemented. The validation and testing of these models were carried out in the final stage of the research. The sequence of the study is illustrated in
Figure 1.
Regarding price prediction, after researching various prediction techniques, two models were selected. One is based on a classical regression technique, and the other relies on neural networks.
3.1. Data Collection and Technology
User interaction data was extracted from the portal over a year (May 2020–March 2021). Data were extracted in JSON format from a MongoDB database, which was converted to a CSV format. It consisted of 1600 recorded user interactions with the portal. The data for house price prediction was acquired from a Pakistani based real estate portal
Zameen.com (accessed on 15 September 2021) for two years between 2019–2020 for Islamabad City.
Both the datasets from Estatech maps and
Zameen.com (accessed on 15 September 2021) were converted into test and training datasets.
Zameen.com (accessed on 15 September 2021) data, used for a house price prediction model, was further converted into a validation dataset. The data consisted of multiple files: User login information (User demographics), Interaction Data (Most viewed properties list) and Item Data (Properties).
TuriCreate was used to build the recommendation engine for content-based and collaborative filtering, whereas a K-means clustering technique was employed for the location-based recommendation. TuriCreate is an open-source toolkit for building Core ML models for tasks like image recognition, object detection, style transfers, and recommendation generation, among others.
Tensor Flow and Keras API were used as baseline technologies to build the house price prediction model and a proper validation for model loss and model accuracy, which was done through evaluation techniques of MSE, MAE and RMSE. TensorFlow is a machine learning software library that is free and open-source. It can be used for various activities, but it focuses on deep neural network training and inference. The Google Brain team created TensorFlow for internal Google use. In 2015, it was published under the Apache License 2.0. The reason for using TensorFlow is that it is an open-source artificial intelligence library that builds models using data flow graphs. It enables programmers to create large-scale neural networks with multiple layers. Keras is a deep learning API written in Python that runs on top of the TensorFlow machine learning system. It was built with the objective of allowing fast experimentation.
3.2. Property Recommendation
The three areas of focus for the recommendation engine are discussed in detail in each of the following sections.
3.2.1. Content-Based Filtering
The concept behind recommender systems is data analytics. This can be achieved either by score-based algorithms or by suggesting to a user the top items in an N-th list of item array. In our scenario, our recommender system is designed for suggesting property items listed for sale or rent. If a person has interacted with a map-based interface with a property item, say in area “A” with attribute array “X”. The recommender system can display similar items for the user in an instant and accurate manner.
In content-based filtering, the angle between the user’s profile and the items the user is interested in is determined. This cosine angle determines how close in space the vectors lie to each other and is also termed cosine similarity. The closer they are, the more similar they are deemed. Let us consider a vector “
U” of users {user1, user2, user3….} and a vector “
P” of property items {p1, p2, p3, p4……}. The similarity between these two vectors can be calculated as:
The cosine value or similarity in Equation (1) can range between −1 and 1. Based on this value, the articles are organised in descending order, and the top recommendations are made to the user.
The approach for content-based filtering is further explained in
Figure 2, which shows how a tree-based criterion for item selection works. The concept is based on how much interactivity a user has with a specific item or category. Interest ratios are calculated between corresponding categories based on “incrementing the value of frequency”. For example, buyers’ interactions with rent or purchase categories define the interest ratio between the two categories. The flow of the function which performs frequency calculation is elaborated in
Figure 3, which details another content-based filtering process, namely
TF-
IDF. For example, suppose a user searches for “the rise of analytics” on Google. In that case, it is inevitable that the word “the” will occur more frequently than “analytics”, but the relative importance of analytics is higher than the search query point of view. In such cases,
TF-
IDF weighting negates the effect of high-frequency words in determining the significance of an item (document).
TF(t) is simply the frequency of a word in a document, whereas IDF(t) signifies the rarity of the word, so if the word occurring in the document is less, then the value of IDF increases. In Equation (4), the log parameter is used to dampen the effect of high-frequency words. We have utilised both the score tree process and TF-IDF approaches in formulating our content-based filtering algorithm. Initially, user-user similarity and item-item similarity are obtained in an array format. The next step in the process was the creation of the item-user similarity matrix.
3.2.2. Collaborative Filtering Approach
In our approach towards developing a collaborative filter for the portal, the test users were divided into segments based on their preferences and items were recommended as per mutual choices of users belonging to that segment. The more the user interacts with items on display and rates them, the more precisely the system can suggest appropriate items. The algorithms designed for collaborative filtering are mostly based on finding the similarities between users on the grounds of the rank or rating they have given to previous items. So, for predicting any item for user “u”, calculations are made to compute the weighted sum of user “u” given by users to an item “i”. The prediction would then be calculated as:
is the prediction term for user ‘
u’ against an item “
i”.
is the prediction term for user ‘u’ against an item “i”, is the interaction by the user say “v”. with an item “i”, is the likeness among the two users, i.e., user “u”. and user “v”.
As per
Table 1, the interactions between users and properties is recorded, and suggestions to a new user “u1” are generated. At the same time, the symbol “x” represents any interaction between a user and a property item. It is evident that there is more similarity between user 1 and user 2 than user 3. Based on this, user 1 and user 2 will be grouped together for future recommendations. Algorithm 1 depicts a generalised algorithm that has been designed for grouping user 1 and user 2 together so that the same properties get recommended to them.
Algorithm 1 Collaborative Recommendation Algorithm for New User “U1”
|
1: Input: Properties Dataset → all properties |
2: Neighbours used for ranking → K |
3: New User for recommendation → U1 |
4: Current recommendations for New User U1 → ∅ |
5: Users location history → L |
6: rank = 0 |
7: Output: N items to be recommended |
8: For each → property ∈ all properties do |
9: if (users for P1==users for P2) then |
10: rank++ |
11: Group according to the nearest neighbour in similarity (K, property, user, L) = users for P1&&user for P2 |
12: Recommendations [U1] → [P3] |
13: Descending rank. sort (properties) |
14: Return Recommendations [] |
3.2.3. Location-Based Filtering
The purpose of a location-based recommendation system is to recommend items based on the geographical location of a user. In this case scenario, recommendations can also be made possible for a new user (cold start problem) where items get recommended based on users of nearby locations who may align with the new user based on other parameters such as age or gender etc. A location-based recommendation can immensely benefit people in saving time and travel costs when displayed effectively through an interactive interface.
Equation (4) calculates the probability of interactivity of a user with an item “i” established based on distance from all previous interactions of the user, which, in our case, are other property items. Whereas in Algorithm 2 the algorithm for test user 1 has been specified.
In Algorithm 2, a generalised algorithm for calculating location-based recommendations for users is presented. It considers at least 50 users in a cluster for a similarity score calculation.
Algorithm 2 Location-based recommendation algorithm for New User “U1”
|
1: Input: A user |
2: Collection of users → U |
3: Users location history → L |
4: Similarity matrix between users → M |
5: Current recommendations for New User based on location → ∅ |
6: Count = 0 |
7: Output: Top N location-based property recommendations based on users’ similarities and preferences |
8: M = similarity matrix values |
9: Number of nearby users selected for similarity |
10: score calculation ≤ 50 |
11: For each → user ∈ U do |
12: LOC = location discovery // level of hierarchy or granularity of location |
13: Calculate similarity distance score |
14: Calculate distance from nearby users |
15: The similarity score of User U1’s last x interacted properties == similarity score of nearby user’s similarity score |
16: Sort properties based on a count |
17: Select top N scores |
18: Select top N properties |
19: Return N Recommendations |
3.3. Price Prediction Model
The critical aspect to notice in the price prediction model is that the data used for this analysis is the “offered set of prices” by the real estate portal
Zameen.com (accessed on 15 September 2021). These prices can change as per the market variations or any redundancy in the real estate sector.
For the prediction and analysis aspect, two regression techniques, namely (1) Multiple linear regression and (2) Keras regression, were selected. The cross-comparison and validation of these techniques were performed. The one that performed better in terms of variance score was selected as the final model for visualising house prices.
3.3.1. Multiple Linear Regression
This is a type of linear regression in which the supposition is that the independent variable y and the dependent variable x have a linear or direct relationship. We used the Sklearn library to import the Linear Regression module. As already mentioned, our dataset was divided into a test set and a train set.
3.3.2. Keras Regression
We use regression techniques to predict the independent variable y, which is price. We have 14 features (property_id, location_id, property_type, price in pkr, price in dollars, location, city, province, bedrooms, bathrooms, area purpose, date of addition to the portal, area in Marla, area in sq. ft,); therefore we selected 14 neurons as baseline along with one output and one input layer for the model. There are 4 hidden layers.
The model was trained for 400 epochs, with the training and validation precision being recorded during each cycle. Finally, the model was run on both train and test results, with the loss function being measured at each epoch to keep track of how well the model is performing.
5. Conclusions
Three different recommendation algorithms for the real estate portal “Estatech Maps” were developed along with two different models for house price prediction. First, we set our goals to analyse and implement content-based filtering for suggesting real estate items. The collaborative filtering approach was used for reducing the computational cost by suggesting similar items to a similar group of users. Then, we applied the location-based approach for predicting the areas of interest to the user based on the user’s geographical location. All this was achieved with a minimum precision of 79%. Prediction models were created, and results were visualised by price increase, decrease or stagnancy in multiple sectors of Islamabad city to better assist people planning future land asset purchases. Our model was able to precisely predict the changes in house prices trends with a minimum accuracy of 80%, which was through our neural network-based prediction model. This work can be effectively utilised in any real-estate sale and purchase domain and will improve the overall user experience of real estate portals. This proves the viability of our map-based system in providing data and recommendations to users based on the popularity of an item, user similarity and geographical location.
While nowadays recommendation and predictive analysis are becoming a common trend in even the smallest of businesses, in Pakistan, the real estate industry is lacking when it comes to implementing these techniques not only in terms of a map-based interface but also in terms of presenting these items to the user in an effective way. Therefore, our approach for displaying an item of interest to the user on a map-based interface would be one of the pioneers in real estate portals in Pakistan.
We have used sequential NN models for our recommendation and prediction in this research. One area of improvement and basis for future work could be exploring and implementing these as parallel models to improve response time and efficiency. Another approach could be combining multiple techniques to create a hybrid model. The same approach was used in the study where the Cobb-Douglas and linear regression models were combined to form a mathematical model [
24]. GIS was an additional tool to organise the regional data of the area under study. In turn, this can cover a broader spectrum of users’ behaviours and avoid high computational costs at the server end.