67 Iuk
67 Iuk
Introduction
For data analytics is one of the tools business management now comes for analysing trends,
optimization of operations, and customer experience. Among the leading platforms for short-
term rentals, Airbnb generates huge amounts of datasets containing unprecedented amount of
information for studying pricing, occupancy and behaviour of customers. But it’s only after this
step that meaningful analysis is possible, before that, the dataset must be clean, structured and
also without inconsistencies.
The incorrect or missing data can generate misleading insights, which subsequently will
negatively affect the business strategy. As such, data preprocessing is a necessary step to assure
high quality data. The aim of this report is on how to analyses the Barwon Southwest, Victoria,
Australia dataset provided by Inside Airbnb. According to the major objectives of the report,
they are to:
3. Perform data preprocessing steps to be able to work with the dataset later.
Microsoft Excel will process the dataset, handling the missing value, remove duplicates, remove
inconsistency, and find outlier. Quality of data will ensure insights that can drive meaningful
business decision in the short term rental market.
Last_review does not show any recorded dates for numerous properties within the
dataset. Data collection errors or properties without reviews seem to be the cause of these
missing values.
The reviews_per_month column shows several empty values since some properties do
not receive regular reviews.
Several properties have missing price information which negatively affects revenue
measurement computations.
Impact on Analysis
Time-based tracking of customer engagement becomes challenging because last_review values
are missing from some entries.
Impact on Analysis
When listing counts are inflated it misleads stakeholders about the real number of
available units since the statistics become misleading.
The combination of double-counted revenue projections occurs because of duplicate
records that create incorrect revenue estimates.
Statistical model biases produce wrong price recommendations which make negative
impacts on business choices.
The analysis becomes difficult due to inconsistent data formats since they prevent effective
numerical operations and date calculations. The following problems can be found in our dataset:
Price data exists as text in the values "$120" without proper numeric representation (120).
The data dates appear in two different formats ("2023-12-01" and "Dec 1, 2023")
preventing effective date-based analysis or sorting.
Impact on Analysis
The data contains values or observations which stand distinctively separate from the
normal range of expected values. Two common outlier types present in the data set
involve:
Fictitious rental options exist because some listings set minimum_nights values at
numbers greater than 365 days.
Data containing impractical nightly rates ($10,000) at extreme values probably result
from user entry mistakes.
Impact on Analysis
Unusual points affect statistical averages and group patterns so business analytics might
show incorrect information.
When pricing data deviates from reasonable standards it causes problems in financial
forecasting and competitive market assessments to go wrong.
Techniques Used:
When dealing with numerical data fields like reviews_per_month the missing values
received replacement using the median value from the specific column. Data points
surrounding the median act as protection against extreme values that would distort the
dataset.
The category field last_review received "No Review" as a value when missing data was
found to indicate the listing lacked customer reviews.
All empty reviews_per_month entries received new values from the median measurement of
available review information. To achieve data consistency all missing last_review fields received
a value of "No Review."
Techniques Used:
Excel Remove Duplicates identified and eliminated duplicate records from the table.
Techniques Used:
The price text was converted into numeric format through a process of dollar sign ($)
deletion.
We applied Excel’s Power Query Editor to convert all last_review data into the YYYY-
MM-DD date format with standardized consistency.
The Find and Replace functionality in Excel allowed for removing currency symbols ($) from
the price data field. The dates were reformatted for unified presentation within the entire record
set.
Techniques Used:
Z-score analysis was applied for detecting extreme outlier points. The process excluded improper
entries that showed minimum_nights values greater than 365 days.
The data science analysis excluded bookings with minimum_nights set to more than 365
days since these conditions indicate unfeasible availability.
Properties with nightly pricing above $3000 were filtered from the dataset to eliminate
wrong values.
4. Data Preprocessing Steps
The manipulated data needed preprocessing to achieve suitable structure for examination after
completing raw data cleaning. The administration of data precedes analysis since it reforms data
standards and builds precision while removing inconsistencies that affect the clarity of insights.
The preprocessing process describes the main steps used on the Airbnb dataset which involved
selecting essential columns and missing value handling and data format conversion before
creating a saved database for analytical use.
Each listing contains a title named in the Airbnb database which serves to describe its
accommodation type.
A rental space determination decides either a guest experiences an entire home or rents a
private space or shares accommodations with others.
The nightly rental fee represents the price that customers need to pay for booking the
accommodation.
Guests need to reserve their stays for a minimum number of continuous nights according
to this attribute.
Reviews per month indicates the standard number of monthly reviews which a listing
generates.
The available booking duration throughout a year equates to availability_365.
Reviews Per Month: Serves as an indicator of a listing’s popularity and customer satisfaction.
The removal of host_name and scrape_id metadata columns alongside other irrelevant data
columns optimized the dataset for effective analysis and visualization.
The analysis excluded unnecessary features host_name and scrape_id from the dataset.
The examined dataset became harder to analyze and more complex through these
unneeded columns that provided no core analysis value.
The cleaning procedure standardized the naming conventions of the room_type category.
Different spellings of "Entire home" such as "Entire Home" and "entire home" were
converted into a unified term.
Some listings contained insupportable minimum_nights values above 365 days so these
records were deactivated from the dataset.
The cleaning process of the dataset became essential for achieving precise outcomes in
subsequent price predictions together with demand forecasting and trend analysis.
Interpolation methods were used to perform estimations for the missing review_per_month data
points. Data trends can be preserved through this approach which helps prevent sudden changes
in the data distribution pattern. The median value from numerical data points was substituted for
missing values in reviews_per_month to prevent wrong outcome results caused by outlying data
points.
Standardization of unknown data involved assigning "Unknown" as the default value for
room_type fields so that data maintenance could continue without information loss.
Systematic handling of missing values kept all important dataset information intact while
maintaining validation for analysis purposes.
The price column had its initial data format as text which contained dollar signs ($138)
throughout its numerical values. To convert the column into numbers only we used the Find and
Replace function in Excel for dollar sign elimination. Performing this step enabled the system to
perform mathematical computations for average pricing analysis and price pattern examination.
The date information in the last_review column showed multiple data formats between "2023-
12-01" and "Dec 1, 2023". Power Query in Excel applied the YYYY-MM-DD format to
restructure all date values. The standardization approach made it possible to execute date-related
analyses including time-series forecasting and review trend analysis without facing any
inconsistencies. The process involves transforming all disparate categorical data labels into
standardized labels. The room_type column included multiple variants of equivalent category
types which included "Entire home" and "Entire house". Standardization processes unified all
labels so duplicate categories would not arise during analysis. The dataset required data
transformation before analysis because it needed proper formatting to succeed with computations
and visualization need.
A CSV file was chosen for the cleaned dataset's storage because it enables compatibility
with Excel applications.
A complete set of backup versions maintained the original dataset for future verification
and comparison operations.
The documentation process introduced new documentation that explained every
preprocessing step together with modification descriptions and data transformation
reasons.
Data storage of processed data in a well-structured format made it possible to perform
additional analysis and verify research results.
Data integrity improves because these processes eliminate duplicate information while fixing
data mistakes, so the analysed results match actual industry patterns.
6. Conclusion
The examination stressed how important data preparation methods are to prepare a dataset for
analytical work. The key data quality problems we found existed in four domains such as
missing values among other issues including duplicate entries accompanied by inconsistent
formats along with outliers. We performed different data cleaning procedures to resolve the
listed data challenges by removing duplicated records while dealing with missing values and
converting formats to standard types and removing extreme outlier points.
The cleansed dataset enables users to both predict business trends and build models as well as
create appealing visualizations. The team should task itself with implementing machine learning
methods to recognize property pricing trends while developing optimal rental pricing models.
The quality of input data forms the essential basis for acquiring meaningful insights from
datasets because it enables effective extraction of practical knowledge from original data
sources.