Pfda Assignment Report Jackson Tai Tp059628
Pfda Assignment Report Jackson Tai Tp059628
Programming For Data Analysis (Asia Pacific University of Technology and Innovation)
INDIVIDUAL ASSIGNMENT
TECHNOLOGY PARK MALAYSIA
CT127-3-2-PFDA
Programming for Data Analysis
APU2F2209SE
Lecturer: Miss Minnu Helen Joseph & Miss Farhana Illiani Binti Hassan
Table of Contents
3.2.7 - Analysis 2.7: Find the number of the bathroom in houses in each city ................ 34
3.2.8 - Conclusion for question 2 ..................................................................................... 35
3.3 - Question 3: How expensive is it to rent a house in each city? .................................... 36
3.3.1 - Analysis 3.1: Find the average rent of houses in each city................................... 36
3.3.2 - Analysis 3.2: Find the average rent per sqft of houses in each city ..................... 37
3.3.3 - Analysis 3.3: Find the average rent of houses’ room in each city ........................ 38
3.3.4 - Conclusion for question 3 ..................................................................................... 39
3.4 - Question 4: Does the rent directly affect the size of houses? ...................................... 40
3.4.1 - Analysis 4.1: Find the relationship between the rent with house size .................. 40
3.4.2 - Analysis 4.2: Find relationships between average rent with bedroom number .... 41
3.4.3 - Analysis 4.3: Find relationships between average rent with bathroom number ... 42
3.4.4 - Conclusion for question 4 ..................................................................................... 43
3.5 - Question 5: How is the property rental market performance?..................................... 44
3.5.1 - Analysis 5.1: Find the consistency of houses posted in the last few months ....... 44
3.5.2 - Analysis 5.2: Find the house posted daily in the last few months ........................ 45
3.5.3 - Analysis 5.3: Find the average rent of houses in the last few months .................. 46
3.5.4 - Analysis 5.4: Find the rent of houses posted in the last week .............................. 47
3.5.5 - Conclusion for question 5 ..................................................................................... 48
3.6 - Question 6: What kind of houses do tenants prefer to live in? .................................... 49
3.6.1 - Analysis 6.1: Find the size of houses preferred by tenants .................................. 49
3.6.2 - Analysis 6.2: Find the number of bedrooms preferred by tenants........................ 50
3.6.3 - Analysis 6.3: Find the number of bathrooms preferred by tenants ...................... 51
3.6.4 - Analysis 6.4: Find the area in which the tenant’s preferred house is located....... 52
3.6.5 - Analysis 6.5: Find the city in which the tenant’s preferred house is located ....... 53
3.6.6 - Analysis 6.6: Find the furnishing status of houses preferred by tenants .............. 55
3.6.7 - Analysis 6.7: Find the area type of houses preferred by tenants .......................... 56
3.6.8 - Analysis 6.8: Find the contact way of houses preferred by tenants ..................... 57
3.6.9 - Conclusion for question 6 ..................................................................................... 58
3.7 - Question 7: How do different groups of tenants choose their rental houses? ............. 59
3.7.1 - Analysis 7.1: Find the budget for house rent by different tenants ........................ 59
3.7.2 - Analysis 7.2: Find the furnishing status of houses preferred by different tenants 60
3.7.3 - Analysis 7.3: Find the area type of houses preferred by different tenants ........... 61
3.7.4 - Analysis 7.4: Find the contact way of houses preferred by different tenants ....... 62
3.7.5 - Analysis 7.5: Find the city preferred by different tenants .................................... 63
3.7.6 - Analysis 7.6: Find the area preferred by different tenants ................................... 64
3.7.7 - Analysis 7.7: Find the floor preferred by different tenants .................................. 66
3.7.8 - Analysis 7.8: Find the date of renting houses preferred by different tenants ....... 67
3.7.9 - Conclusion for question 7 ..................................................................................... 69
3.8 - Question 8: How big a house do different group of tenants prefer?............................ 70
3.8.1 - Analysis 8.1: Find the house size preferred by different tenants.......................... 70
3.8.2 - Analysis 8.2: Find the number of bedrooms preferred by different tenants ......... 71
3.8.3 - Analysis 8.3: Find the number of bathrooms preferred by different tenants ........ 72
3.8.4 - Conclusion for question 8 ..................................................................................... 73
3.9 - Question 9: What kind of houses are posted in July? .................................................. 74
3.9.1 - Analysis 9.1: Find the furnishing status of houses posted in July. ....................... 74
3.9.2 - Analysis 9.2: Find the area type of houses posted in July. ................................... 75
3.9.3 - Analysis 9.3: Find the contact way of houses posted in July. .............................. 76
3.9.4 - Analysis 9.4: Find the size of houses posted in July. ........................................... 77
3.9.5 - Analysis 9.5: Find the number of bedrooms in houses posted in July. ................ 78
3.9.6 - Analysis 9.6: Find the number of bathrooms in houses posted in July. ............... 79
3.9.7 - Analysis 9.7: Find the city of houses posted in July. ........................................... 80
3.9.8 - Conclusion for question 9 ..................................................................................... 81
4.0 - Extra Feature.................................................................................................................... 82
4.1 - User-defined function .................................................................................................. 82
4.1.1 - Get frequencies of the same data .......................................................................... 82
4.1.2 - Get percentages of data......................................................................................... 84
4.2 – Additional ggplot 2 features........................................................................................ 86
4.2.1 - Text label on charts............................................................................................... 86
4.2.2 - Trendline on graph................................................................................................ 87
4.2.3 - Doughnut chart ..................................................................................................... 88
4.2.4 - Stacked bar chart .................................................................................................. 88
4.2.5 - Treemap ................................................................................................................ 89
5.0 - Conclusion ....................................................................................................................... 90
6.0 - References ....................................................................................................................... 91
1.2 - Assumption
The given dataset contains information on houses, apartments or flats available for rent. These
data are obtained from the people who posted a list of houses they wish to be rented out. For
this, the given dataset is useful for analysing house rent prediction. The properties posted in
the given dataset have already been rented out. As such, this dataset can be used to analyse
houses that are preferred by tenants.
The first and yet most important step before the analysis process begins is to import data from
the external files. Before importing the data, the working directory of the working environment
is set using the setwd() built-in function. It takes a new working directory as the argument,
C:/JacksonTai/APU/Degree/lvl2/sem1/Assignment/PFDA/. Once setting the directory, the
rent data in a CSV file called House_Rent_Dataset.csv is imported by using the read.csv()
built-in function. The first parameter of this function is the directory of the importing file while
the second parameter is the separator sep with the assigned value of comma, which tells R that
the file is comma-delimited. Lastly, the View() function is being used to view the dataset in a
visual table form. It takes dataset house_data as the parameter and will automatically open
another tab to display the dataset.
In R, missing values are represented by the symbol NA (not available). The is.na() function
is used to check if NA values exist. The sum() is also being used to count all the possible NA in
the dataset. For the given dataset, there is 0 number of NA, meaning there is no missing value
found.
The names() function is used to return the names of a data object by passing the name of the
variable that holds the data as the argument (Chidalu, n.d.). When the dataset is first imported
in R, the column names of the house_data dataset are defaulted to be the same and the spaces
is being replaced with a dot.
The column names of the given dataset are being renamed to be more meaningful for
conducting the analysis. The names() function is also being used to rename the column names
by assigning a vector of characters that contains the new column names.
The data types of each column in the dataset can be viewed using the str() function, which
compactly displays the internal structure of an R object. It can display even the internal
structure of large lists which are nested. It gives information about the columns and rows along
with additional information like the column names, and the class of each column followed by
a few of the initial observations of each of the columns.
In Figure 6 above, the data type of the field Posted.On consists of dates with the data types of
characters. This makes it difficult to perform analysis when there is a need to perform
calculations on the date. Therefore, the data type of this column is being changed to date using
the as.Date() functions. The first parameter consists of the column in which the data types
will be changed, and the second parameter consists of the date format.
The above source code aims to remove the redundant word “contact” from the data under the
Contact_Way column, which was named in section 2.3.1. The contact_ways vector is being
created to hold all extracted data such as “owner” and “agent”. A for loop is being used to
iterate through each data in the Contact_Way column of house_data dataset. Within the for
loop, the strsplit() function is used to remove the word “Contact” from the contact_way
variable and return a list consisting of a vector that holds the remaining characters that are
being split. The vector consists of an empty string and the extracted data (owner/agent). Thus,
the [[1]] is specified to access the first element of the list, that is the vector. The [2] is then
used to access the second element of the vector, which is the extracted data. Once extracted the
data, it is being added to the contact_ways vector at the end of each iteration. Lastly, the
contact_ways vector is updated to the dataset and the rm() function is used to remove the
variable from the memory to avoid possible collapse of the same variable name later.
The class() function is being used to return the values of data types of the dataset, which is a
data frame.
The dim() function is used to get the dimension of the given dataset, which returns the number
of rows and columns (Zach, 2022). Moreover, the nrow() and ncol() functions can also be
used to get the row and column number of the dataset respectively. The given dataset has 4746
rows and 12 columns.
The sample data of the given dataset can be obtained by using the head() function. By default,
the head() function returns the first 6 rows of the dataset if there is no value provided for the
second parameter (CN, 2022).
10
On the other hand, the tail() function is being used to return the last n rows of the dataset and
the second parameter is the number of rows to be returned.
The summary() function is being used to quickly summarizes the values in the data frame, and
provide information such as the minimum, 1st quartile, median, mean 3rd quartile, and max
value of the numeric values. It provides information such as the length, class, and mode for
categorical values.
11
The factor() function takes a vector as the input, that is the bedroom column from the
house_data dataset (Johnson, 2022). In figure 14, the data of the bedroom column is made up
of numbers from 1 to 6.
12
3.1.1 - Analysis 1.1: Find the house size and rent preferred by people.
The above source code is to generate a scatter plot graph that is being used to visualize the
relationship between the house size and rent with the contact way. The data dataset is derived
from the house_data dataset with the condition of having a maximum rent of 100,000 INR and
excluding the contact way of the builder. The reason for setting a maximum rent is to lower the
high difference in rent, which ranges from 1,200 INR up to 3,500,000 INR.
The house with the contact way of the builder is being excluded from the dataset in this analysis
as there is only one of it in the whole dataset (house_data), which is negligible to be excluded
for the analysis.
In addition to the dataset, the categories of contact way are also being extracted using the
factor() built-in function, which takes a vector as the input, that is the contact way from the
derived data dataset (Johnson, 2022). After obtaining the dataset to be plotted. The pre-written
13
function ggplot() from the ggplot2 package is being used to plot the graph. The three key
components of the grammar of graphics plots are:
• Data: the observations in the dataset
• Aesthetics: mappings from the data to visual properties
• Geoms: geometric objects that represent what is in the plot
The first parameter of the ggplot function would be the dataset to be plotted. The second
parameter consists of the aes(), which includes the x-axis (Rent) and y-axis (Size) of the
plotting graph. The geom_point() is also being used to create the scatterplots, which help
display the relationship between two continuous variables. It takes aes() as the parameter,
which specifies the shape and colour of the scatterplots, that is the contact way. Lastly, the
labs() function is being used to assign the graph title as well as the name of x-axis and y-axis
(Madhugiri, 2022).
14
From the above graph, it can be observed that the owner prefers to rent out houses with lower rent compared to the agent. The majority of houses
rent for the owner falls within the range of 25,000 INR whereas the agent has a wider range of rent. The vertical axis shows that agents prefer to
rent out houses with larger sizes as opposed to the owner, where most of the house size starts around 500 square foot (sqft). These two metrics
show the sign of agents tend to rent out luxury houses with higher rent since they will get higher commissions. In contrast, houses that are rented
out by the owner are mostly common houses with sizes below 2,000 sqft.
15
The above source code is to generate a histogram that is being used to visualize the relationship
between the tenant preferred with the contact way. The data dataset is derived from the house-
data dataset with the condition of excluding houses with contact way of the builder. Kindly
refer to the reason for doing so under this section. The fill is used to display the graph grouped
by tenants preferred so that it is more understandable. Moreover, the geom_histogram() is
being used to generate the histogram, where the stat with the value of “count” counts the rows
of each x-axis value and the position with the value of “dodge” is used to separate the bars
plotted on the grid. The labs() function is used to name the histogram title and the axis on the
graph. Lastly, the geom_text() function is used to display the count of every plot on top of the
bar.
_3.1.2_-_Analysis
16
The result for analysis 1.2 shows that most agents and owners prefer renting out houses that
are for tenants under the group of bachelor or family, with 849 and 2,594 of them respectively.
There are 434 agents, and 396 owners prefer renting out houses for only the tenant that is a
bachelor. The least preferred houses to be rented out by both agents and owners are the houses
for tenants under the group of family, with only a total of 472. The reason for family people
being the least preferred tenants could also be attributed to the size of houses they are renting
out, where most of the houses posted are under 2,000 sqft as shown in analysis result 1.1.
17
3.1.3 - Analysis 1.3: Find the furnishing status of houses preferred by people
The above source code generates a stacked bar chart to visualize the relationship between the
furnishing status with the contact way. The data dataset derives only the records of contact
way and furnishing status from the house_data dataset with the condition of excluding houses
with contact way of the builder. Kindly refer to the reason for excluding the builder under this
section. The data is then converted into a table using table() built-in function, which performs
a tabulation of the furnishing status preferred by the owner or the agent. The table of data is
then converted back into a data frame for renaming and plotting. The names() built-in function
is used to rename the column heading name. The stat argument with the value of “identity”
will take the y-axis value for the dependent variable. Lastly, the geom_text() built-in function
is used to display the count of every plot on top of the bar.
18
The result for analysis 1.3 shows that most agents and owners prefer to rent out houses that are
semi-furnished, with 777 and 1,474 of them respectively. There are a total of 459 agents, and
1,355 owners prefer to rent out unfurnished houses. The least preferred furnishing status of
houses to be rented out by both agent and owner is furnished, with only a total of 680 of them
together. This could mean they are not willing to pay the upfront cost of furnishing the houses
for the tenant as their goal is to maximize the return on investment (ROI).
19
3.1.4 - Analysis 1.4: Find the city in which the houses preferred by people located at
The above source code generates a histogram to visualize the relationship between the city with
the contact way. The data dataset is derived from the house_data dataset with the condition of
excluding houses with contact way of the builder. Kindly refer to the reason for doing so under
this section. The fill argument is used to display the graph grouped by contact way. Moreover,
the geom_histogram() is being used to generate the histogram. The labs() function is used to
name the histogram title and the axis on the graph. The geom_text() function is also used to
display the count of every plot on top of the bar. Lastly, the facet_wrap() function is used to
split the single histogram into a multi-panel histogram that is related to each other (Haaf, 2019).
20
The most obvious result that can be observed through the histogram above is that most agent
prefers to rent out houses that are located in Mumbai, with a total of 780 of them. As for owners,
most of them prefer to rent out houses located in Chennai and Hyderabad. On the other hand,
most agent does not prefer renting out houses in Kolkata, with only 81 agents renting out houses
in this city. The owners do not prefer renting out houses in Mumbai, with only 192 of them.
Overall, most people prefer to rent out houses in Mumbai, with a total of 972 of them. In
contrast, Kolkata is the least preferred city by people to rent out houses, where only a total of
523 of them choose to rent out houses in this city.
21
3.1.5 - Analysis 1.5: Find the area type of houses preferred by people
The above source code generates a histogram to visualize the relationship between the area
type with the contact way. Kindly refer to the source code explanation of analysis 1.4 for the
details of obtaining data and plotting histogram.
The above histogram shows that most agents prefer to rent out houses in which the size is
calculated based on the carpet area and only a minority of them prefer super area. As for the
owner, most of them prefer houses in which the size is calculated based on super area. Some
owners prefer carpet areas and only two owners prefer built areas.
22
3.1.6 - Analysis 1.6: Find the number of bedrooms in houses preferred by people
The above source code generates a histogram to visualize the relationship between the number
of bedrooms with the contact way. Kindly refer to the source code explanation of analysis 1.4
for the details of obtaining data and plotting histogram.
The above histogram shows that most of the agents and owners prefer to rent out houses with
2 bedrooms. More agents prefer to rent out houses with 3 bedrooms than one. Conversely, more
owners prefer to rent out houses with one bedroom than 3.
23
3.1.7 - Analysis 1.7: Find the number of bathrooms in houses preferred by people
The above source code generates a histogram to visualize the relationship between the number
of bathrooms with the contact way. Kindly refer to the source code explanation of analysis 1.4
for the details of obtaining data and plotting histogram.
The above histogram shows that most of the agents and owners prefer to rent out houses with
2 bathrooms. More agents prefer to rent out houses with 3 bathrooms than one. Conversely,
more owners prefer to rent out houses with one bathroom than 3.
24
25
3.2.1 - Analysis 2.1: Find the average house size of each city
The above source code is to generate a bar chart that is being used to visualize the average size
of houses in each city. The levels() built-in function is being used to access the levels
attributes of the city. A for loop is then used to obtain the average house size of each city by
strong it in the average_sizes vector. The round() built-in function is also being used to round
off the values of decimal values. The digits argument with the value of 2 indicates the number
of the digit to be rounded. A data frame, data is created using data.frame() to store the cities’
names and average house sizes. Once obtaining the data that needs to be plotted, the geom_bar()
is used to plot a bar chart. The fill argument with the value of “skyblue” paints the bar colour,
and the width argument defines the width of the bar. Lastly, the coord_flip() is being used to
flip the cartesian coordinates.
26
The above bar chart shows that most houses with bigger sizes are located in Hyderabad and
Chennai, where the average house size is above 1,000 sqft. On the other hand, most small-sized
houses are located in Kolkata and Delhi, where the average size is below 800 sqft. The medium-
sized houses are mostly located in Mumbai and Bangalore, with an average house size of 905.9
and 985.93 sqft respectively.
27
3.2.2 - Analysis 2.2: Find the furnishing status of houses in each city
The above source code is to generate a histogram that is being used to visualize the furnishing
status of houses in each city. Kindly refer to the source code explanation of analysis 1.4 for
built-in functions details of plotting histogram.
The histogram above shows that majority of the cities have the most semi-furnished houses
following up with unfurnished and furnished houses. However, the percentage of unfurnished
houses located in Kolkata is higher as opposed to the other cities.
28
3.2.3 - Analysis 2.3: Find the area type of houses in each city
The above source code is to generate a histogram that is being used to visualize the area type
of houses in each city. Kindly refer to the source code explanation of analysis 1.4 for built-in
functions details of plotting histogram.
The above histogram shows that there are more houses in which the size is calculated based on
the super area in cities such as Bangalore, Chennai, and Hyderabad. In contrast, there are more
houses of which the size is calculated based on carpet area in cities such as Delhi, Kolkata, and
Mumbai. Most houses calculated based on the super area are located in Hyderabad and carpet
areas are mainly located in Mumbai. There are also two houses in which the size is calculated
based on the built area located in Chennai and Hyderabad.
29
3.2.4 - Analysis 2.4: Find the contact way of renting houses in each city
The above source code is to generate a histogram that is being used to visualize the contact
way of renting houses in each city. Kindly refer to the source code explanation of analysis 1.4
for built-in functions details of plotting histogram.
The histogram above shows that the majority of the cities have most houses that require tenants
to contact the owner. In Mumbai, most houses require the tenant to contact the agent as there
are a total of 780 houses that have the contact way of the agent and only 192 houses have the
contact way of the owner. In Hyderabad, there is one house that requires the tenant to contact
the builder.
30
3.2.5 - Analysis 2.5: Find the tenant preferred by houses in each city
The above source code is to generate a histogram that is being used to visualize the tenant
preferred by houses in each city. Kindly refer to the source code explanation of analysis 1.4 for
built-in functions details of plotting histogram.
The histogram above shows that houses that are preferred to be rented out to bachelors or family
have the largest proportion in all cities. Most of the cities have houses with tenants preferred
by bachelors more than family except in Mumbai. The number of houses that are preferred to
be rented to the family is higher than bachelors in Mumbai.
31
3.2.6 - Analysis 2.6: Find the number of bedrooms in houses in each city
The above source code is to generate a histogram that is being used to visualize the number of
bedrooms in houses in each city. Kindly refer to the source code explanation of analysis 1.4 for
built-in functions details of plotting histogram.
32
The most obvious result that can be observed through the histogram above is that most of the houses in all cities are having 2 bedrooms. The
second most houses in cities such as Bangalore, Delhi, Kolkata, and Mumbai are having one bedroom. As for Chennai and Hyderabad, houses
with 3 bedrooms are the second most houses. Houses with more than 4 bedrooms are located in all cities except Bangalore that is only having
houses with up to 4 bedrooms.
33
3.2.7 - Analysis 2.7: Find the number of the bathroom in houses in each city
Due to the high similarity of analysis techniques, kindly refer to analysis 2.6 for detailed information.
The majority of houses in all cities are having 2 bathrooms except in Delhi and Kolkata, which have houses with only one bathroom the most.
Houses with more than 3 bathrooms are located in all cities except in Kolkata that only has houses with up to 3 bathrooms.
34
City/Attributes Size Furnishing status Area type Contact way Preferred tenant
Bangalore Medium Semi-furnished Super area Owner Family/Bachelor
Chennai Large Semi-furnished Super area Owner Family/Bachelor
Delhi Small Semi-furnished Carpet area Owner Family/Bachelor
Hyderabad Large Semi-furnished Super area Owner Family/Bachelor
Kolkata Small Unfurnished Carpet area Owner Family/Bachelor
Mumbai Medium Semi-furnished Carpet area Agent Family/Bachelor
There are several interesting results have been found throughout those analyses. In analysis 2.2, it is found that all cities are mostly having semi-
furnished houses, except for Kolkata that is mostly having unfurnished houses. Another result found in analysis 2.4 is that most houses in Mumbai
are having agents as the contact way while houses in other cities are having owners as the most common contact way.
35
The above source code generates a bar chart to visualize the average rent of houses in each city.
Kindly refers to the source code explanation of analysis 2.1 for the details of obtaining data
and built-in functions.
The bar chart above shows that houses in Mumbai have the highest average rent among all the
other cities, which is around 85,000 INR. The average rent in cities such as Bangalore, Chennai,
Delhi, and Hyderabad ranges from 20,000 INR to 30,000 INR. Kolkata is the cheapest city to
rent a house as it has the lowest average rent of houses at around 11,600 INR.
36
3.3.2 - Analysis 3.2: Find the average rent per sqft of houses in each city
The above source code aims to generate a bar chart that is being used to visualize the average
rent per sqft of houses in each city. Kindly refers to the source code explanation of analysis 2.1
for the details of obtaining data and plotting a bar chart.
The analysis result 3.2 shows that houses in Mumbai have the highest average rent per square
foot (psf) among all the other cities, which is around 80 INR psf. Delhi is the city that has the
second highest average rent psf, which stands at 67.63 INR Psf. The average rent psf in cities
such as Bangalore, Chennai, Hyderabad, and Kolkata ranges from 15 to 25 INR psf.
37
3.3.3 - Analysis 3.3: Find the average rent of houses’ room in each city
The above source code aims to generate a bar chart that is being used to visualize the average
rent of houses room in each city. The number of house rooms includes both the bathroom and
bedroom in a house. Kindly refers to the source code explanation of analysis 2.1 for the details
of obtaining data and plotting a bar chart.
The analysis result 3.3 shows that Mumbai has the highest average rent of houses room, where
each room is being rented at an average of 16,894.85 INR. In contrast, the average rent of a
room in Kolkata’s houses has the lowest rent of 3,318.08 INR. The average rent of a room in
cities such as Bangalore, Chennai, Hyderabad, and Delhi ranges from 4,000 to 7,000 INR.
38
City/Rent (INR) Average house rent Average rent psf Average room rent
Bangalore INR 24,966.37 INR 21.75 INR 5,748.72
Chennai INR 21,614.09 INR 19.79 INR 4,791.13
Delhi INR 29,461.98 INR 67.63 INR 6,427.34
Hyderabad INR 20,555.05 INR 23.59 INR 4,320.99
Kolkata INR 11,645.17 INR 16.49 INR 3,318.08
Mumbai INR 85,321.20 INR 81.67 INR 16,894.85
In short, Mumbai is the most expensive city to rent a house and Kolkata is the most affordable
city to rent a house. An interesting thing has been found in analysis 3.2, which is the average
rent psf in Delhi is significantly higher as compared to other aspects like average room rent.
This means the houses in Delhi can be expensive for tenants who wish to rent a larger house
and be affordable at the same time for tenants who are fine with living in a smaller house.
39
3.4 - Question 4: Does the rent directly affect the size of houses?
The house rent in each city is being analysed thoroughly in question 3, and this question aims
to find out whether the rent will affects the size of houses.
3.4.1 - Analysis 4.1: Find the relationship between the rent with house size
The above source code aims to generate a scatter plot graph that is being used to visualize the
relationship between house size and rent. The quantile() function is used to obtain the sample
quantiles of a dataset. The maximum rent of this analysis uses the 75th percentile of the dataset
to reduce the range of high differences in rent. The geom_smooth() function is being used to
add a trendline over the existing plot and the default trendline is a loess line. Therefore, the
method parameter with the value of “lm” which stands for the linear model is used to add a
straight-line linear model (Ebner, 2022). The colour parameter is being used to specify the
colour of the trendline. The se parameter is being specified as false to override the default value
of true, which will add the confidence interval around the trendline.
40
The scatter plots in the analysis result in 4.1 shows that the higher the rent, the wider the range
of the plot is being scattered vertically. Overall, the rent will affect the size of houses as the
trendline shows that houses with higher rent tend to have higher size psf and vice versa.
3.4.2 - Analysis 4.2: Find relationships between average rent with bedroom number
The above source code aims to generate a bar chart that is being used to visualize the
relationship between the average rent with the bedroom number. Kindly refers to the source
code explanation of analysis 2.1 for the details of obtaining data and plotting a bar chart. In this
analysis, the geom_line() function is being used to connect the observation of the bar chart.
The group parameter is specified as one to override the default grouping of the x-axis value,
which is the number of bedrooms. The scale_y_continuous() from scales package is used to
avoid abbreviated axis labels (scientific notation) such as 1e+00.
41
The bar chart above shows that the houses with higher average rent will be having a higher
number of bedrooms. However, this trend stops as it reaches the houses with 5 bedrooms and
starts to decline, where the average rent for a 6 bedrooms house is 73,125 INR.
3.4.3 - Analysis 4.3: Find relationships between average rent with bathroom number
The above source code aims to generate a bar chart that is being used to visualize the
relationship between the average rent with the bathroom number. This analysis is highly similar
42
to analysis 4.2, the only difference is that the data is being derived with the condition of having
only houses with a maximum of 10 bathrooms. This is because there is only one house with 10
bathrooms in the whole dataset, which is negligible to be excluded from the analysis.
The bar chart above shows that the houses with higher average rent will be having a higher
number of bathrooms. However, this trend stops as it reaches the houses with 5 bathrooms and
starts to decline, where the average rent for houses with 6 and 7 bathrooms is 177,500 INR and
81,666.67 INR respectively.
43
3.5.1 - Analysis 5.1: Find the consistency of houses posted in the last few months
The above source code generates a line graph to visualize the consistency of houses posted in
the last few months. The data is assigned with the value returned from the table() function,
which tabulates the count of houses posted on each date. The data is then converted into a data
frame and named to plot the pie chart. The data type of Posted Date field is also being changed
to date using as.Date() function. The cumsum() function is used to calculate the cumulative
sum of the houses posted each day.
The line graph above shows that the number of houses posted has consistently increased from
the beginning of May until July. This means the market is stable throughout this period as there
is no sudden increase or decrease in the number of houses being posted.
44
3.5.2 - Analysis 5.2: Find the house posted daily in the last few months
The above source code is being used to generate a line graph to visualize the number of houses
posted in the last few months. Kindly refers to the source code explanation of the analysis 5.1
for detailed information on obtaining data and built-in functions. Kindly refers to the source
code explanation of geom_smooth() in analysis 4.1.
The line graph above shows that the number of houses posted from May until July has gradually
increased based on the trendline. The number of houses posted during the mid of May until
early June is relatively consistent. This is because this period has a narrower confidence interval
(CI) as compared to April and July.
45
3.5.3 - Analysis 5.3: Find the average rent of houses in the last few months
The above source code is being used to generate a connected scatterplot to visualize the average
rent of houses in the last few months. The data consists of each posted date of the houses and
the average rent of houses posted on that particular date. The data excludes date with average
house rent above 100,000 INR to lower the high difference in average rent which range from
10,000 INR to 260,000 INR.
46
The connected scatterplot aboveanalysis__result_5__3 shows that the average rent of houses
has gradually increased based on the trendline from May until July. The average rent in April
and July has a wider confidence interval (CI) as compared to the duration during the mid of
May until early June. This means the average rent of houses during the mid of May until early
June did not fluctuate much as compared to the time during April and July.
3.5.4 - Analysis 5.4: Find the rent of houses posted in the last week
The above source code is being used to generate box plots to visualize the rent of houses posted
in the last week. The data dataset is derived from the house_data dataset with the condition of
having a maximum rent of 100,000 INR. Kindly refer to the reason for doing so under this
section. The way of obtaining the data that is posted in the last week is by subtracting the
47
maximum posted date or the latest date with 7. The geom_boxplot() is being used to plot the
bar plot where the value of fill parameter is being used to fill the bar colour.
The box plot above shows that the rent of houses has increased from the 5th of July until the 9th
of July. The median rent up to the 8th of July has been remaining below 25,000 INR and it has
a sudden spike on the 9th of July, at around 45,000 INR. The rent started to drop after 2 days
and remain at 17,500 INR on the 11th of July. On the 6th and 9th of July, the house rent range
from 4,000 INR up to 90,000 INR.
48
The above source code generates a bar chart to visualize the size of houses preferred by tenants.
The get_freq() function is being used to obtain the frequency of the same data within a
specific column, that is the count of tenants and the range of houses’ sizes. Kindly refer to the
source code explanation in analysis 2.1 for details of plotting a bar chart.
The bar chart above shows that most tenant prefers houses with a size that is below 1,600 sqft,
with 4,222 of them. There are 472 tenants who prefer houses with sizes from 1,601 to 3,200
sqft. There is only a total of 52 tenants who prefer houses with 3,201 sqft and above.
49
The above source code is being used to generate a bar chart to visualize the bedroom number
of houses preferred by tenants. The data is assigned with the value returned from the table()
function, which tabulates the frequency of bedroom numbers. The data is then converted into
a data frame and named to plot the pie chart. Kindly refers to the source code explanation in
analysis 2.1 for the details of built-in functions in plotting the bar chart.
The bar chart above shows that most tenant prefers houses with 2 bedrooms, with 2,265 of
them. There is a total of 1,167, and 1,098 tenants prefer houses with one and 3 bedrooms
respectively. There is only a total of 216 tenants who prefer houses with more than 3 bedrooms.
50
The above source code generates a bar chart to visualize the bathroom number of houses
preferred by tenants. Kindly refers to the source code explanation of analysis 6.2, which is
highly similar to this analysis.
The bar chart above shows that most tenant prefers houses with 2 bathrooms, with 2,291 of
them. There is a total of 1,474, and 749 tenants prefer houses with one and 3 bedrooms
respectively. There is only a total of 232 tenants who prefer houses with more than 3 bathrooms.
51
3.6.4 - Analysis 6.4: Find the area in which the tenant’s preferred house is located
The above source code generates a tree map to visualize the top 10 areas of houses preferred
by tenants. The table() built-in function is used to perform a tabulation of the area preferred
by the tenants. The table of data is then converted into a data frame for renaming using the
names() function. The data is being reordered using the order() function with the parameter
decreasing of true. The top 10 data is obtained using the head() built-in function with the
value of 10 for the second parameter. The Area column is replaced with itself concatenated
with the percentage value that is obtained using the get_percent function.
The treemap() function from the treemap package is used to plot the treemap where the first
parameter is the data to be plotted. The index parameter is for the column that provides groups,
that is the Area column. The vSize parameter is for the value of each group, that is the Count
column. The type parameter is used to specify the type of the treemap, which determines how
the rectangles are coloured. The provided value for this parameter is “index”, which means the
colours are determined by the index variables, that is the Area column.
52
The above treemap shows that most tenants prefer to rent houses located in the Bandra West
area, which is around 16.16% of them. The second most preferred area preferred by tenants is
the Gachibowli, which is around 12.66%.
3.6.5 - Analysis 6.5: Find the city in which the tenant’s preferred house is located
The above source code generates a pie chart to visualize the percentage of houses renting out
in each city. Kindly refers to the source code explanation of analysis 6.2, which has the same
technique used to obtain the data to be plotted. The percentages of each city’s rental houses are
obtained by using the get_percent() helper function, which calculates the percentage of data
based on its value. ggplot2 does not offer any specific geom to build pie charts. The trick for
this is by building a stacked bar chart using the geom_col() function and making it circular
with the coord_polar() function. The theta parameter with the value of “y” makes the
53
variable map angle to the y-axis. The paste() function is used to concatenate the percentage
with the count of renting houses. The sep argument has space as its default separator, and
therefore it is being overridden using an empty string. The theme() function is used to override
the default theme elements such as the axis text, ticks title and panel grid.
The above pie chart shows that most of the houses are renting out in Mumbai, that is around
20% or 972 houses. Both Bangalore and Chennai have the second most proportion of houses
renting out, that is around 19% or 886 and 891 houses respectively. Hyderabad is also having
a relatively close amount of houses renting out, that is around 18% or 868 houses. Delhi and
Kolkata both have lesser houses renting out, that is around 13% and 11% or 605 and 868 houses
respectively.
54
3.6.6 - Analysis 6.6: Find the furnishing status of houses preferred by tenants
The above source code is being used to generate a 3D pie chart to visualize the furnishing status
of houses preferred by tenants. Kindly refers to the source code explanation of analysis 6.2,
which has the same technique used to obtain the data to be plotted. The pie3D function from
the plotrix package is used to plot the 3D pie chart. The first parameters hold the data to be
plotted and the labels parameter is used to specify the text on each slice of the pie chart, that
is the furnishing status along with the percentage and tenants count. The percentage of each
furnishing status is obtained using the get_percent() function. The height parameter is used
to specify the height of the 3D pie chart while the explode parameter is used to separate the
slices from the pie chart. The col parameter is used to specify the colours of slices in the pie
chart whereas the border parameter is used to specify the colour of the slices' border.
55
The 3D pie chart above shows that most tenant prefers houses that is semi-furnished, which is
around 47.43% or 2,251 of them. There are around 38.24% or 1,815 tenants prefer houses that
are unfurnished, and 14.33% or 680 tenants prefer fully-furnished houses.
3.6.7 - Analysis 6.7: Find the area type of houses preferred by tenants
The above source code generates a doughnut chart to visualize the area type of houses preferred
by tenants. Kindly refers to the source code explanation of analysis 6.2, which has the same
technique used to obtain the data to be plotted. The percentage of each area type is obtained
using the get_percent() function. The hsize variable is used to hold the value that specifies
the hole size of the doughnut chart. There is no specific function has been created for plotting
a doughnut chart in ggplot2. The trick for this is by building a stacked bar chart using the
geom_col() function and making it circular with the coord_polar() function. The theta
parameter with the value of “y” makes the variable map angle to the y-axis. Finally, the xlim()
function switches the pie to a doughnut by adding the empty circle in the middle. Besides, the
scale_fill_brewer() function is used to change the filling colours of ggplot2 graphs. The
theme() function is used to override the default theme elements such as the axis text, ticks title
56
The above doughnut chart shows that more than half of the tenants prefer houses in which the
size is calculated based on super area, with 51.54% or 2,446 of them. There are 48.42% or
2,298 tenants who prefer houses with the area type of carpet area and only 0.04% or 2 tenants
prefer houses with the built area.
3.6.8 - Analysis 6.8: Find the contact way of houses preferred by tenants
The above source code generates a bar chart to visualize the contact way of houses preferred
by tenants. Kindly refers to the source code explanation of analysis 6.2, which is highly similar
to this analysis.
57
The bar chart above shows that most tenant prefers houses in which the contact way is the
owner, with 3,216 of them. There are 1,529 tenants who prefer houses that require them to
contact the agent and there is only one tenant who prefers houses with the contact way of the
builder.
58
3.7 - Question 7: How do different groups of tenants choose their rental houses?
Bracket subsetting can be cumbersome and difficult to read, especially for analyses such as 5.3,
which involve complicated operations. As such, the subsequent question may involve various
data manipulation techniques for modifying data to make the code more readable and organized
(Naveen, 2022).
3.7.1 - Analysis 7.1: Find the budget for house rent by different tenants
The above source code generates a bar chart to visualize the budget for house rent by different
tenants. The pipe operator from dplyr package, written as >%> is used to take the output of one
function and passes it into another function as an argument, allowing the sequence of analysis
steps to be linked (Willeams, 2017). The group_by() function is used to group the data based
on the Tenant_Preferred column in the house_data dataset. The group_by() function alone
does not give any output unless it is used with the summarize() function, which is used to get
aggregation results on specified columns, that is the mean value of rent. Thus, the group by()
function is used to group the mean value of rent based on the data in Tenant_Preferred column.
The na.rm argument is set as true to exclude missing value when calculating the mean value of
rent. The color argument in the geom_bar function is used to specify the border colour of the
bar. Kindly refers to the source code explanation in analysis 2.1 for more details of built-in
functions in plotting the bar chart.
59
The above bar chart shows that the bachelor is having an average of 42,143.79 INR as their
budget for renting a house. As for tenants who is either bachelor or family, their average budget
for house rent is 31,210.79 INR. The family people will have a higher average budget for house
rent, which is 50,020.34 INR.
3.7.2 - Analysis 7.2: Find the furnishing status of houses preferred by different tenants
The above source code is to generate a histogram that can be used to visualize the furnishing
status of houses preferred by different tenants. Kindly refers to the source code explanation of
plotting a histogram in analysis 1.2.
60
The above histogram shows that most bachelors choose to rent unfurnished houses, with 409
of them. As for tenants who is either bachelor or family, most of them prefer to choose semi-
furnished houses, that is 1,675 of them. Most family people would also prefer semi-furnished
houses, with 252 of them.
3.7.3 - Analysis 7.3: Find the area type of houses preferred by different tenants
The above source code is to generate a histogram that can be used to visualize the area type of
houses preferred by different tenants. Kindly refers to the source code explanation of plotting
a histogram in analysis 1.2.
61
The above histogram shows that most bachelors choose to rent houses in which the size is
calculated based on carpet area, that is 691 of them. As for tenants who is either bachelor or
family, most of them prefer houses calculated based on super area, that is 2,161 of them. Most
family people would prefer houses calculated on carpet area, with 326 of them.
3.7.4 - Analysis 7.4: Find the contact way of houses preferred by different tenants
The above source code is to generate a histogram that can be used to visualize the contact way
of houses preferred by different tenants. Kindly refers to the source code explanation of plotting
a histogram in analysis 1.2.
62
The above histogram shows that most bachelors and families prefer to contact the agent when
choosing a rental house, that is 434 and 246 of them respectively. As for tenants who is either
bachelor or family, most of them choose to rent a house with the contact way of the owner, that
is 2,594 of them.
The above source code is to generate a histogram that can be used to visualize the city preferred
by different tenants. Kindly refers to the source code explanation of plotting a histogram in
analysis 1.2.
63
The above histogram shows that most bachelors and families prefer to rent a house located in
Mumbai, that is 172 and 186 of them respectively. As for tenants who is either bachelor or
family, most of them choose to rent a house located in Bangalore, that is 694 of them.
The above source code generates a bar chart to visualize the top 3 areas preferred by different
tenants. The table() function is also used to tabulate the frequency of data based on the Area
64
and the Tenant_Preferred columns. The table is then converted to a data frame as well as
named using the set_colnames() function. The data is arranged in descending order based on
the Count value using the arrange() and desc() functions together. Areas with the top three
frequencies will be obtained using the group_by() and slice() functions. Kindly refers to the
source code explanation of plotting a histogram in analysis 1.2.
The above histogram shows that most bachelors would prefer to rent houses located at the kst
chattarpur apartments, that is 13 of them. As for tenants who are either bachelors or family,
most of them prefer to rent houses at the Bandra West, that is 31 of them. Most families would
prefer to rent houses located in the Vadapalani area.
65
The above source code generates a stacked bar chart to visualize the top three floors preferred
by different tenants. The floor is extracted using several functions in a for loop and stored in
the floors vector. Kindly refers to this section, which has the same technique for deriving the
floor data. The mutate() function is being used to replace the Floor column in the house_data
data frame with the extracted floor data. Kindly refers to the source code explanation in analysis
7.6 for details of built-in functions used in obtaining data. Lastly, the select() function is used
to rearrange the column.
The factor() function is used to rearrange the order of floors by specifying the factor levels
for the stacked bar chart. Kindly refers to the source code explanation of plotting a histogram
in analysis 1.2.
66
Each stacked bar chart above can be perceived as a building where each floor is separated using
different colours. The result shows that all different tenants prefer to rent houses located on the
first floor, following up with the second floor and the ground floor being the third preferred
level.
3.7.8 - Analysis 7.8: Find the date of renting houses preferred by different tenants
67
The above source code generates a multiline graph to visualize the date of renting houses
preferred by different tenants. The table() function is also used to tabulate the frequency of
data based on the Posted_Date and the Tenant_Preferred columns. The table is then converted
to a data frame as well as named using the set_colnames() function. The mutate() function is
being used to replace the Date column in the house_data data frame with the same column in
which the data type is converted to date objects. Lastly, the filter() function is used to subset
the data frame based on the condition where only the value of the Count column greater than 0
will be kept and vice versa.
The plotting of the multiline graph is very similar to the line graph in analysis 5.2. The only
difference is that the group and color parameters are used to group the categorical data, that is
the Tenant. Moreover, the size parameter in geom_line is used to specify the thickness of the
line. The theme_ipsum() from the hrbrthemes package is used to provide the theme colour and
appearance for the graph.
The above result shows that most of the different tenants would prefer to rent houses posted
lately, that is during July. Tenants who are either bachelors or families also prefer to rent houses
starting from the end of April. As for the bachelor and family, they prefer to rent houses posted
starting from the mid of May.
68
Tenants/Preference Budget Furnishing status Area type Contact way City Floor Date
Bachelors Low Unfurnished Carpet area Agent Mumbai 1 May - July
Bachelors/ Family Medium Semi-furnished Super area Owner Bangalore 1 April - July
Family High Semi-furnished Carpet area Agent Mumbai 1 May - July
Analysis 7.1 shows that different groups of tenants will have different budgets when choosing their rental houses. Bachelors would prefer to rent
an unfurnished house while the other tenants prefer a semi-furnished house. Bachelors and family tenants prefer to choose rental houses in which
the size is calculated based on carpet area, having the contact way of the agent, and located in Mumbai. In analysis 7.8, the result shows that all
different tenants would prefer to live on the first floor of the building. It is also found that bachelors and families prefer to rent houses posted from
May until July whereas tenants who are either bachelors or families would prefer to rent houses posted starting from May until July.
69
3.8.1 - Analysis 8.1: Find the house size preferred by different tenants
The above source code generates a bar chart to visualize the average house size preferred by
different tenants. Kindly refers to analysis 7.1, which has similar analysis techniques.
The above bar chart shows that the average house size preferred by bachelors is 1,015.11 sqft.
As for tenants that are either bachelors or family, they preferred houses with an average size of
930.27 sqft. Family people prefer the highest average house size, which is 1,155.33 sqft.
70
3.8.2 - Analysis 8.2: Find the number of bedrooms preferred by different tenants
The above source code is to generate a histogram that can be used to visualize the number of
bedrooms preferred by different tenants. Kindly refers to the source code explanation of
plotting a histogram in analysis 1.2.
The histogram above shows that all different tenants mostly prefer to rent a house with 2
bedrooms. More bachelors and families prefer houses with 3 bedrooms than 2, that is 227 and
163 of them respectively. As for tenants that are either bachelor or family, there are more of
them who prefers houses with one bedroom than 3, that is 914 of them.
71
3.8.3 - Analysis 8.3: Find the number of bathrooms preferred by different tenants
The above source code is to generate a histogram that can be used to visualize the number of
bathrooms preferred by different tenants. Kindly refers to the source code explanation of
plotting a histogram in analysis 1.2.
The histogram above shows that all different tenants mostly prefer to rent a house with 2
bathrooms. More bachelors and tenants that are either bachelor or family would prefer houses
with one bedroom than 3, that is 239 and 1,177 of them respectively. As for family people,
there are more of them who prefers houses with 3 bedroom than one, that is 115 of them.
72
73
3.9.1 - Analysis 9.1: Find the furnishing status of houses posted in July.
The above source code is to generate a 3D pie chart that can be used to visualize the furnishing
status of houses posted in July. The very first part is to obtain data in which the date is greater
or equal to “2022-7-1” using the filter() function. This ensures the data obtained is later than
the first of July. The table() built-in function is used to perform a tabulation of the furnishing
status of each house. The table of data is then converted into a data frame using data.frame()
function and the column name is renamed using the set_colnames() function. Kindly refers to
the source code explanation of plotting a 3D pie chart in analysis 6.6.
74
The above 3D pie chart shows that the majority of the houses posted in July are semi-furnished,
that is 54% or 525 of them. There are also 31% or 307 unfurnished houses posted in July. The
furnished houses are the least posted ones, which is only 15% or 146 of them.
3.9.2 - Analysis 9.2: Find the area type of houses posted in July.
The above source code is to generate a 3D pie chart that can be used to visualize the area type
of houses posted in July. Kindly refers to the source code explanation in analysis 9.1, which is
highly similar to this analysis.
75
The above 3D pie chart shows that houses in which the size is calculated based on carpet area
are more than the super area in July. The are 64% or 622 of houses with the area type of carpet
area and there are only 36% or 356 with the area type of super area.
3.9.3 - Analysis 9.3: Find the contact way of houses posted in July.
The above source code is to generate a 3D pie chart that can be used to visualize the contact
way of houses posted in July. Kindly refers to the source code explanation in analysis 9.1,
which is highly similar to this analysis.
76
The above 3D pie chart shows that houses with the contact way of agents are more than owners
in July. The are 53% or 516 of houses with the contact way of agents and there are 47% or 462
with the contact way of owners.
The above source code is to generate a bar chart that can be used to visualize the size of houses
posted in July. The very first part is to obtain data in which the date is greater or equal to “2022-
7-1” using the filter() function. This ensures the data obtained is later than the first of July.
The get_freq() function is used to obtain the frequency of the same data in ranges within the
Size column. The range of house sizes is obtained by concatenating the from and to columns
provided by the get_freq() function. The rename() function is used to change the name of the
freq column to count. Kindly refer to the source code explanation in analysis 2.1 for details of
77
The above bar chart shows that most houses posted in July are below the size of 1,750 sqft, that
is 838 of them. There are 127 houses posted in July that fall within the range of 1,751 to 3,500
sqft.
3.9.5 - Analysis 9.5: Find the number of bedrooms in houses posted in July.
The above source code is to generate a bar chart that can be used to visualize the number of
bedrooms in houses posted in July. Kindly refers to the source code explanation in analysis 9.1,
which is highly similar to this analysis. As for plotting a bar chart, kindly refer to the source
code explanation in analysis 2.1.
78
The above bar chart shows that most houses posted in July are having 2 bedrooms, that is 434
of them. There are 309 houses posted in July having 3 bedrooms and 170 houses having one
bedroom.
3.9.6 - Analysis 9.6: Find the number of bathrooms in houses posted in July.
The above source code is to generate a bar chart that can be used to visualize the number of
bathrooms in houses posted in July. Kindly refers to the source code explanation in analysis
9.1, which is highly similar to this analysis. As for plotting a bar chart, kindly refer to the source
code explanation in analysis 2.1.
79
The above bar chart shows that most houses posted in July are having 2 bathrooms, that is 488
of them. There are 227 houses posted in July having 3 bedrooms and 183 houses having one
bathroom.
The above source code is to generate a 3D pie chart that can be used to visualize the city of
houses posted in July. Kindly refers to the source code explanation in analysis 9.1, which is
highly similar to this analysis.
80
The above 3D pie chart shows that the majority of the houses posted in July are located in
Mumbai, that is 28% or 275 of them. There are also 23% or 223 houses located in Hyderabad
posted in July. Houses located in Kolkata are the least posted ones, which is only 5% or 45 of
them.
81
The get_freq() function is written to obtain the frequency of the same data occurring in a
specific column. The reason for creating this function stems from many considerations, the first
being its reusability as there are several analyses involving the counting of the same categorical
data. For attributes such as house sizes and the floor that have a wide range of categorical data,
it is difficult to plot a visible chart or graph. In analysis 6.1, instead of counting the frequency
of each unique house size, this function allows the house sizes to be categorized and counted
based on specific ranges.
There are three parameters that the get_freq() function will be receiving. The first and second
parameters are pretty self-explanatory based on their naming. The data_frame parameter will
be the data frame in which the data is stored, and the column refers to the name of the column
that will be checked for repeating data. The interval parameter will be the number of different
ranges based on the repeating data. Back to the example of houses’ size, if the given value for
the interval parameter is 5, there will be 5 different ranges of houses’ size being counted as
shown in figure 119. By default, the value of the interval parameter is set to one if none is
provided.
82
Figure 121: Sample output of get_freq function with the default interval value
The first part of the function is to make sure that the number provided for the interval
parameter is not less than 1, else a Boolean value of false will be returned. If the interval number
is valid, the interval value will be calculated by dividing the maximum value of data in the
given dataset column by the interval number. The round() function is also being used in this
calculation to round off the decimal points that could lead to errors for the following algorithm.
The data is being used to store the data of the specific column of the given data frame. The
core idea of this function’s algorithm is by keeping track of each data range and count the
frequency of data that falls within this range. Therefore, two variables named from and to are
being used to store the starting and ending point of each range of data. The initial value for
from is 0, which will be the starting value for the first range of data whereas the to stores the
first interval value obtained as the initial value.
83
The process to obtain the frequency of data within different ranges of value is iterated through
a for loop. The number of iterations is dependent on the provided interval number. This means
there will be 5 iterations in total if the provided value for the interval parameter is 5. Each
iteration will extract the data in which the value is between the starting and ending points of
the current iteration. This is done by specifying a condition where the value of the data shall
be greater than the value of from and less than or equal to the value of to. The extracted data
is then stored in the match_result variable. The frequency of data that falls within the range
of value is counted using the length() function and stored in the Freq variable. Once setting
the starting and ending point for the next range of data, the result for the current range of data
will be added to a data frame named result. The data frame will be created along with the
From, To and Freq variables during the first loop and the subsequence loop will add a new row
to the data frame. After the for loop completes all the iterations, the result will be returned from
the function and accessible.
The get_percent() function is written to calculate the percentage of a value. This function is
being created for the sake of reusability in several analyses that involve the counting of
percentages. One of the use cases would be obtaining the percentages of houses in each city in
analysis 6.4.
There are two parameters that the get_percent() function will be receiving, which are the
value to be calculated and the number of decimal places to be rounded off. By default, the value
of the round_digit parameter is set to zero if none is provided.
84
Figure 125: Sample output of get_percent function with the default interval value
The first part of this function is by getting the denominator using the sum() function, which is
used to find the sum of the values (Lathiya, 2021 ). The value is then divided by the value of
the denominator and multiplied by 100. Lastly, the round() function is used to round off the
decimal digits of the result.
85
The bar charts and histograms provided for the analysis have included the text label that shows
the value on each bar. Such a feature helps the HRM team to see the exact number of data
present in the bar chart and make precise decisions. For example, it is easy for them to know
how many tenants prefer houses with different furnishing statuses as shown in figure 127.
The text on the bar in a bar chart is added using the geom_text() function. This function
consists of the aes() that includes the parameter of the label with the value of the y-axis, that
is the Tenant.
86
Graph such as a connected scatter plot provided for the analysis has included a trendline that
illustrates the prevailing direction of the data value. This feature helps the HRM team to see
the overall direction of the data presented in the graph and make better decisions. This makes
it easy for them to visualize the trend of the number of houses posted from April until July.
The trendline is added to a graph using the geom_smooth() function. The default trendline is a
loess line. The method parameter with the value of “lm” which stands for the linear model is
used to specify a straight-line linear model (Ebner, 2022). The colour of the trendline can also
be changed by specifying the value for the color parameter. By default, the se parameter is set
to true, which will add the confidence interval around the trendline. The fill parameter would
be the color that fills the area of the confidence interval.
87
A doughnut chart is a pie chart with its centre removed. Instead of slices, the arc segments are
used to display individual dimensions. Each of the doughnut arcs has the same width, but a
different length. This makes it easier for the HRM team to compare individual categories or
dimensions to the larger whole. This is because they just need to compare which arc is longer
when comparing which data is greater.
When comparing data in analysis such as 1.3 and 7.6, the stacked bar chart makes it easier for
the HRM team to see the comparison on a stacked bar chart versus a combined chart.
88
4.2.5 - Treemap
The treemap capture relative sizes of data categories, allowing the HRM team to have a quick
perception of the items that are large contributors to each category.
89
5.0 - Conclusion
In conclusion, there are several analyses have been performed with the given dataset to help
the HRM team in decision-making. These analyses have involved various data analytics
techniques such as pre-processing data, data exploration, manipulation, and visualization.
Throughout the analysis, the relationship of multiple factors has been visualized and explained
to determine how people choose their rental houses. However, the HRM team is not encouraged
to fully rely on the analysis result due to the uncertainty of interest rates these days, which
makes it difficult to predict the property market.
90
6.0 - References
Chidalu, O. T. (n.d.). What is the names() function in R? Retrieved from Educative:
https://www.educative.io/answers/what-is-the-names-function-in-r
CN, P. (2022, August 4). The head() and tail() function in R - Detailed Reference. Retrieved
from Digital Ocean: https://www.digitalocean.com/community/tutorials/head-and-
tail-function-r
Ebner, J. (2022, July 19). How to Use geom_smooth in R. Retrieved from Sharp Sight:
https://www.sharpsightlabs.com/blog/geom_smooth/
Haaf, S. (2019, April 2). Easy multi-panel plots in R using facet_wrap() and facet_grid()
from ggplot2. Retrieved from Zev Ross: http://zevross.com/blog/2019/04/02/easy-
multi-panel-plots-in-r-using-facet_wrap-and-facet_grid-from-ggplot2/
Johnson, D. (2022, November 19). Factor in R: Categorical Variable & Continuous
Variables. Retrieved from Guru99: https://www.guru99.com/r-factor-categorical-
continuous.html#:~:text=What%20is%20Factor%20in%20R,integer%20data%20valu
es%20as%20levels.
Kosourova, E. (2022, June 15). How to Write Functions in R (with 18 Code Examples).
Retrieved from Data Quest: https://www.dataquest.io/blog/write-functions-in-
r/#:~:text=To%20declare%20a%20user%2Ddefined,at%20each%20of%20them%20s
eparately.
Lathiya, K. (2021 , September 11). Sum in R: How to Find Sum of Vectors. Retrieved from R-
Lang: https://r-lang.com/sum-in-r/
Madhugiri, D. (2022, March 7). A Comprehensive Guide on ggplot2 in R. Retrieved from
Analytics Vidhya: https://www.analyticsvidhya.com/blog/2022/03/a-comprehensive-
guide-on-ggplot2-in-r/
Naveen. (2022, Janruary 7). Data Manipulation in R. Retrieved from Intelli Paat:
https://intellipaat.com/blog/tutorial/r-programming/data-manipulation-in-r/
Willeams, K. (2017, November). Pipes in R Tutorial For Beginners. Retrieved from Data
Camp: https://www.datacamp.com/tutorial/pipe-r-tutorial
Zach. (2022, April 19). How to Use the dim() Function in R. Retrieved from Statology:
https://www.statology.org/r-dim-function/
91