0% found this document useful (0 votes)

8 views10 pages

67 Iuk

This report focuses on the analysis of the Airbnb dataset for Barwon Southwest, Victoria, emphasizing the importance of data preprocessing to ensure high-quality insights for business decisions. Key data quality issues identified include missing values, duplicate entries, inconsistent data formats, and outliers, all of which can lead to misleading analyses. Various data cleaning techniques were applied, including handling missing values, removing duplicates, standardizing formats, and filtering outliers, ultimately preparing the dataset for effective analysis and visualization.

Uploaded by

Bunny Saini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

67 Iuk

Uploaded by

Bunny Saini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

1.

Introduction
For data analytics is one of the tools business management now comes for analysing trends,
optimization of operations, and customer experience. Among the leading platforms for short-
term rentals, Airbnb generates huge amounts of datasets containing unprecedented amount of
information for studying pricing, occupancy and behaviour of customers. But it’s only after this
step that meaningful analysis is possible, before that, the dataset must be clean, structured and
also without inconsistencies.

The incorrect or missing data can generate misleading insights, which subsequently will
negatively affect the business strategy. As such, data preprocessing is a necessary step to assure
high quality data. The aim of this report is on how to analyses the Barwon Southwest, Victoria,
Australia dataset provided by Inside Airbnb. According to the major objectives of the report,
they are to:

1. Identify common data quality issues.

2. Try to find appropriate patterns or rules to strengthen data reliability.

3. Perform data preprocessing steps to be able to work with the dataset later.

Microsoft Excel will process the dataset, handling the missing value, remove duplicates, remove
inconsistency, and find outlier. Quality of data will ensure insights that can drive meaningful
business decision in the short term rental market.

2. Working with Data Quality Issues in Airbnb Dataset

Quality of any dataset is required to hold and being complete, and accurate and consistent. A key
data quality issue found in the Airbnb dataset that if not addressed can lead to error of analysis
and faulty decision making is identified.

2.1 Missing Values

Issue
Large datasets often face the problem of missing values especially when users need to provide
input information. The crucial fields of our dataset contained significant missing data points
among them:

 Last_review does not show any recorded dates for numerous properties within the
dataset. Data collection errors or properties without reviews seem to be the cause of these
missing values.
 The reviews_per_month column shows several empty values since some properties do
not receive regular reviews.
 Several properties have missing price information which negatively affects revenue
measurement computations.

Impact on Analysis
Time-based tracking of customer engagement becomes challenging because last_review values
are missing from some entries.

 No reviews_per_month values means a listing’s popularity cannot be measured properly.

 The absence of prices results in erroneous calculations of revenue which hinders general
market analyses and rental pricing strategy evaluation.

2.2 Duplicate Entries

Issue

 The occurrence of duplicate records is a common problem which researchers detect

during database analysis especially when these records get created due to:
 Multiple listings of the same property by hosts occur across various differently named
listings.
 Repeated rows appear in the database because of data retrieval problems.

Impact on Analysis

 When listing counts are inflated it misleads stakeholders about the real number of
available units since the statistics become misleading.
 The combination of double-counted revenue projections occurs because of duplicate
records that create incorrect revenue estimates.
 Statistical model biases produce wrong price recommendations which make negative
impacts on business choices.

2.3 Inconsistent Data Formats

Issue

The analysis becomes difficult due to inconsistent data formats since they prevent effective
numerical operations and date calculations. The following problems can be found in our dataset:

 Price data exists as text in the values "$120" without proper numeric representation (120).
 The data dates appear in two different formats ("2023-12-01" and "Dec 1, 2023")
preventing effective date-based analysis or sorting.

Impact on Analysis

 The non-standardization of prices makes mathematical computations impossible such as

determining nightly average rates.
 The mixture of different date formatting methods prevents suitable analysis of rental
pattern forecasts across time periods.

2.4 Outliers and Anomalous Data

Issue

The Airbnb dataset included multiple columns but most of them were not vital for research
purposes. The dataset required simplification so only fundamental columns related to analysis
were kept for further assessment. The chosen columns went through evaluation to determine
their value in pricing, room type, availability and user engagement measurements. The chosen
set of columns included these characteristics:

 Each listing contains a title named in the Airbnb database which serves to describe its
accommodation type.
 A rental space determination decides either a guest experiences an entire home or rents a
private space or shares accommodations with others.
 The nightly rental fee represents the price that customers need to pay for booking the
accommodation.
 Guests need to reserve their stays for a minimum number of continuous nights according
to this attribute.
 Reviews per month indicates the standard number of monthly reviews which a listing
generates.
 The available booking duration throughout a year equates to availability_365.

Rationale for Column Selection

The evaluation of pricing trends together with rental type distribution requires this information.
The availability feature enables an understanding of how frequently properties become occupied
and thus facilitates revenue projection.
A record of the minimum required nights in a stay enables the evaluation of rental rules along
with market vitality.

Reviews Per Month: Serves as an indicator of a listing’s popularity and customer satisfaction.

The removal of host_name and scrape_id metadata columns alongside other irrelevant data
columns optimized the dataset for effective analysis and visualization.

4.2 Cleaning the Dataset

The dataset required cleaning operations which included the removal of unneeded data while
fixing inconsistent values along with correct formatting throughout all records. The primary
cleaning steps included:

 The analysis excluded unnecessary features host_name and scrape_id from the dataset.
The examined dataset became harder to analyze and more complex through these
unneeded columns that provided no core analysis value.
 The cleaning procedure standardized the naming conventions of the room_type category.
Different spellings of "Entire home" such as "Entire Home" and "entire home" were
converted into a unified term.
 Some listings contained insupportable minimum_nights values above 365 days so these
records were deactivated from the dataset.
 The cleaning process of the dataset became essential for achieving precise outcomes in
subsequent price predictions together with demand forecasting and trend analysis.

4.3 Handling Missing Data in Detail

The correct management of missing data proves to be essential during preprocessing because it
prevents unreliable models and wrong conclusions from forming. The approach to handle
missing values utilized separate techniques depending on whether the data consisted of
numerical or categorical fields.

Approach for Numerical Data

Interpolation methods were used to perform estimations for the missing review_per_month data
points. Data trends can be preserved through this approach which helps prevent sudden changes
in the data distribution pattern. The median value from numerical data points was substituted for
missing values in reviews_per_month to prevent wrong outcome results caused by outlying data
points.

Approach for Categorical Data

The field containing missing values in last_review received "No Review" as default value to
indicate listings which had not been reviewed.

Standardization of unknown data involved assigning "Unknown" as the default value for
room_type fields so that data maintenance could continue without information loss.

Systematic handling of missing values kept all important dataset information intact while
maintaining validation for analysis purposes.

4.4 Transforming Data Formats

The process of data format transformation establishes proper value structures needed for
analysis. Standardization routines were applied to numerical and categorical data fields of the
Airbnb dataset.

Standardizing the Price Column

The price column had its initial data format as text which contained dollar signs ($138)
throughout its numerical values. To convert the column into numbers only we used the Find and
Replace function in Excel for dollar sign elimination. Performing this step enabled the system to
perform mathematical computations for average pricing analysis and price pattern examination.

Standardizing Date Formats

The date information in the last_review column showed multiple data formats between "2023-
12-01" and "Dec 1, 2023". Power Query in Excel applied the YYYY-MM-DD format to
restructure all date values. The standardization approach made it possible to execute date-related
analyses including time-series forecasting and review trend analysis without facing any
inconsistencies. The process involves transforming all disparate categorical data labels into
standardized labels. The room_type column included multiple variants of equivalent category
types which included "Entire home" and "Entire house". Standardization processes unified all
labels so duplicate categories would not arise during analysis. The dataset required data
transformation before analysis because it needed proper formatting to succeed with computations
and visualization need.

4.5 Exporting and Storing the Processed Dataset

The cleaned pre-processed data received organized storage through a specified format that would
support upcoming analysis needs.

Steps Taken for Storage:

 A CSV file was chosen for the cleaned dataset's storage because it enables compatibility
with Excel applications.
 A complete set of backup versions maintained the original dataset for future verification
and comparison operations.
 The documentation process introduced new documentation that explained every
preprocessing step together with modification descriptions and data transformation
reasons.
 Data storage of processed data in a well-structured format made it possible to perform
additional analysis and verify research results.

5. Significance of Data Cleaning and Preprocessing

The process of data cleaning together with preprocessing transforms data quality into improved
operational performance and produces better analytical results. The below list demonstrates the
essential advantages for data cleaning and preprocessing the Airbnb dataset:

5.1 Ensuring Accurate Insights

Data cleaning generates reliable information that delivers accurate insights which businesses can
use for making decisions based on trustworthy data.

Data integrity improves because these processes eliminate duplicate information while fixing
data mistakes, so the analysed results match actual industry patterns.

5.2 Enhancing Efficiency in Data Analysis

Preprocesses data streamlines data maintenance tasks because analysts can dedicate their time to
extraction rather than error correction. A properly organized dataset makes computations run
more efficiently so both machine learning algorithms and statistical analysis take reduced
execution times.

5.3 Improving Machine Learning Model Performance

Better predictions together with more accurate models result from high-quality data processing.
The process of handling missing values along with removing outliers ensures machine learning
algorithms will not learn from unjustified and faulty data points. The preprocessing of data leads
to better analytical results and establishes fact-based and accurate data-driven decisions.

6. Conclusion
The examination stressed how important data preparation methods are to prepare a dataset for
analytical work. The key data quality problems we found existed in four domains such as
missing values among other issues including duplicate entries accompanied by inconsistent
formats along with outliers. We performed different data cleaning procedures to resolve the
listed data challenges by removing duplicated records while dealing with missing values and
converting formats to standard types and removing extreme outlier points.

The processing phase included the following steps:

 Selected relevant columns to streamline analysis.

 The data required transformation of numerical variables and categorical values to achieve
uniformity within the dataset.
 The processed dataset was exported for visualization analysis.

The cleansed dataset enables users to both predict business trends and build models as well as
create appealing visualizations. The team should task itself with implementing machine learning
methods to recognize property pricing trends while developing optimal rental pricing models.
The quality of input data forms the essential basis for acquiring meaningful insights from
datasets because it enables effective extraction of practical knowledge from original data
sources.

Student Management System Project Report
88% (24)
Student Management System Project Report
66 pages
Project 1 - Instructions, Airbnb
No ratings yet
Project 1 - Instructions, Airbnb
7 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
AirBnB Data Analysis - Architecture
No ratings yet
AirBnB Data Analysis - Architecture
7 pages
Unit 4 - Database Design and Development
0% (1)
Unit 4 - Database Design and Development
6 pages
BMGT 7074
No ratings yet
BMGT 7074
21 pages
NY Airbnb Report FINAL
No ratings yet
NY Airbnb Report FINAL
17 pages
Final
No ratings yet
Final
14 pages
Naan Mudhalvan Phase 2
No ratings yet
Naan Mudhalvan Phase 2
13 pages
Synopsis Vikashkumar
No ratings yet
Synopsis Vikashkumar
12 pages
Workshop 2v6
No ratings yet
Workshop 2v6
5 pages
Ass
No ratings yet
Ass
4 pages
"Never Assume You Can't Do Something. Push Yourself To Redefine The Boundaries." Brian Chesky, CEO of Airbnb
No ratings yet
"Never Assume You Can't Do Something. Push Yourself To Redefine The Boundaries." Brian Chesky, CEO of Airbnb
24 pages
Python Hospitality Data Analysis Project
No ratings yet
Python Hospitality Data Analysis Project
14 pages
Assessment 3-Group Assignment
No ratings yet
Assessment 3-Group Assignment
3 pages
AirBnB Data Analysis - LLD
No ratings yet
AirBnB Data Analysis - LLD
11 pages
Reading IBM Cloud Gallery
No ratings yet
Reading IBM Cloud Gallery
10 pages
Airbnb Methodology Case Study
No ratings yet
Airbnb Methodology Case Study
7 pages
Wrangle Report
No ratings yet
Wrangle Report
7 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
A2 Estimation (Individual) : MIS772 2020 T2
No ratings yet
A2 Estimation (Individual) : MIS772 2020 T2
1 page
AirBnB Data Analysis - HLD
No ratings yet
AirBnB Data Analysis - HLD
10 pages
Chandu Zeroth Review
No ratings yet
Chandu Zeroth Review
15 pages
Sunny Kumar Resume Data Analyst Updated Resume 0VEAYTB4J6
No ratings yet
Sunny Kumar Resume Data Analyst Updated Resume 0VEAYTB4J6
3 pages
Activity - Create A Chart in Tableau - Coursera
No ratings yet
Activity - Create A Chart in Tableau - Coursera
9 pages
Hotels Analysis Project
No ratings yet
Hotels Analysis Project
23 pages
Airbnb Analysis
No ratings yet
Airbnb Analysis
3 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
Data Conversion: Calculating the Monetary Benefits
From Everand
Data Conversion: Calculating the Monetary Benefits
Patricia Pulliam Phillips
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
INN Hotels Project
No ratings yet
INN Hotels Project
26 pages
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
DataQuality Submit
No ratings yet
DataQuality Submit
11 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Portofolio Hang Kesturi - Data Analyst Di Tokopediawrd
No ratings yet
Portofolio Hang Kesturi - Data Analyst Di Tokopediawrd
14 pages
MDA Project 2024
No ratings yet
MDA Project 2024
2 pages
Interview QnAs - CloudyML
No ratings yet
Interview QnAs - CloudyML
13 pages
Math211101020
No ratings yet
Math211101020
12 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Dbms
No ratings yet
Dbms
15 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Document
No ratings yet
Document
29 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
Summary - Lifecycle of Data Analysis - 3982
No ratings yet
Summary - Lifecycle of Data Analysis - 3982
7 pages
Airbnb Pricing Predictions
No ratings yet
Airbnb Pricing Predictions
8 pages
Vikas Report
No ratings yet
Vikas Report
11 pages
DDDB BP0281535
No ratings yet
DDDB BP0281535
22 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Data Driven CW3 COTS Solution
100% (1)
Data Driven CW3 COTS Solution
14 pages
Text
No ratings yet
Text
3 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
4 Econometrie Exploration Et Préparation Des Données-5
No ratings yet
4 Econometrie Exploration Et Préparation Des Données-5
23 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Wrangling
No ratings yet
Data Wrangling
9 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
EDA
100% (1)
EDA
9 pages
Report Final Stats Is Tics
No ratings yet
Report Final Stats Is Tics
7 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
CA 1 Orientation Deepanshu 2
No ratings yet
CA 1 Orientation Deepanshu 2
7 pages
ROHIT BOSE - CHRIST University - 2221249
No ratings yet
ROHIT BOSE - CHRIST University - 2221249
1 page
USING CAATs FOR SUBSTANTIVE TESTING
No ratings yet
USING CAATs FOR SUBSTANTIVE TESTING
7 pages
BA Process Template
No ratings yet
BA Process Template
7 pages
Academic Certification Roadmap Jan2018 PDF
No ratings yet
Academic Certification Roadmap Jan2018 PDF
1 page
Business Objects Add-Ons 360suite and 360eyes Compliance
No ratings yet
Business Objects Add-Ons 360suite and 360eyes Compliance
15 pages
HRD-TEE Guideline v2.0
No ratings yet
HRD-TEE Guideline v2.0
12 pages
19ucs519-Cyber Security Aided Ese
No ratings yet
19ucs519-Cyber Security Aided Ese
2 pages
Stats Log
No ratings yet
Stats Log
29 pages
Case Study
No ratings yet
Case Study
5 pages
EcoStruxure Building Management - System Hardening Guide
No ratings yet
EcoStruxure Building Management - System Hardening Guide
33 pages
Lesson 3 - Import SAS Dataset
100% (1)
Lesson 3 - Import SAS Dataset
24 pages
Web Hacking 101 Sample PDF
100% (1)
Web Hacking 101 Sample PDF
31 pages
Hana - 2092196 - How-To - Terminating Sessions in SAP HANA
No ratings yet
Hana - 2092196 - How-To - Terminating Sessions in SAP HANA
3 pages
RIPE - Reverse DNS
No ratings yet
RIPE - Reverse DNS
9 pages
Assignment 5 - Crypto and Blockchain
No ratings yet
Assignment 5 - Crypto and Blockchain
4 pages
15 Vulnerable Sites To (Legally) Practice Your Hacking Skills
No ratings yet
15 Vulnerable Sites To (Legally) Practice Your Hacking Skills
5 pages
KM Maturity Model Service - vC0307
No ratings yet
KM Maturity Model Service - vC0307
1 page
Network Management Fundamentals
No ratings yet
Network Management Fundamentals
3 pages
CAD Exam - Free Actual Q&As, Page 1 - ExamTopics
No ratings yet
CAD Exam - Free Actual Q&As, Page 1 - ExamTopics
2 pages
Computer Architecture and Computer Organization
No ratings yet
Computer Architecture and Computer Organization
16 pages
Tanishq Jain DST Project Report
No ratings yet
Tanishq Jain DST Project Report
6 pages
Applied DAX With Power BI - Teo Lachev - 2019
100% (2)
Applied DAX With Power BI - Teo Lachev - 2019
367 pages
Sasi Resume 13524
No ratings yet
Sasi Resume 13524
3 pages
Digital Marketing Specialization v2
No ratings yet
Digital Marketing Specialization v2
23 pages
Synology QuickConnect White Paper
No ratings yet
Synology QuickConnect White Paper
14 pages
Swift
No ratings yet
Swift
2 pages
Program Report
No ratings yet
Program Report
7 pages

67 Iuk

Uploaded by

67 Iuk

Uploaded by

1.

1. Identify common data quality issues.

2. Try to find appropriate patterns or rules to strengthen data reliability.

2. Working with Data Quality Issues in Airbnb Dataset

2.1 Missing Values

 No reviews_per_month values means a listing’s popularity cannot be measured properly.

2.2 Duplicate Entries

 The occurrence of duplicate records is a common problem which researchers detect

2.3 Inconsistent Data Formats

 The non-standardization of prices makes mathematical computations impossible such as

2.4 Outliers and Anomalous Data

3. Data Cleaning Techniques

3.1 Handling Missing Values

Application in Airbnb Dataset:

3.2 Removing Duplicates

Application in Airbnb Dataset:

3.3 Handling Inconsistent Data Formats

Application in Airbnb Dataset:

3.4 Identifying and Removing Outliers

Application in Airbnb Dataset:

4.1 Selecting Relevant Column

Rationale for Column Selection

4.2 Cleaning the Dataset

4.3 Handling Missing Data in Detail

Approach for Numerical Data

Approach for Categorical Data

4.4 Transforming Data Formats

Standardizing the Price Column

Standardizing Date Formats

4.5 Exporting and Storing the Processed Dataset

Steps Taken for Storage:

5. Significance of Data Cleaning and Preprocessing

5.1 Ensuring Accurate Insights

5.2 Enhancing Efficiency in Data Analysis

5.3 Improving Machine Learning Model Performance

The processing phase included the following steps:

 Selected relevant columns to streamline analysis.

You might also like