0% found this document useful (0 votes)
31 views3 pages

Assigment 3 Data Science

The report details the data cleaning process for a cafe sales dataset containing 10,000 entries, which faced issues like missing values, incorrect data types, and placeholder entries. After a thorough cleaning process, the dataset was reduced to 3,089 valid entries, ensuring all data is usable for analysis. The cleaned dataset is now structured and free from inconsistencies, making it suitable for further analysis and modeling.

Uploaded by

Mehar M Moeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views3 pages

Assigment 3 Data Science

The report details the data cleaning process for a cafe sales dataset containing 10,000 entries, which faced issues like missing values, incorrect data types, and placeholder entries. After a thorough cleaning process, the dataset was reduced to 3,089 valid entries, ensuring all data is usable for analysis. The cleaned dataset is now structured and free from inconsistencies, making it suitable for further analysis and modeling.

Uploaded by

Mehar M Moeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Science

Instructor: Miss Andleeb Akram


Sec A
Submitted By
Abdul Moeed B-26546
Fall 2021-2025

University of South Asia


Department of Computer Science
1|Page
Data Cleaning Report for Cafe Sales Dataset
Overview

The dataset under consideration consists of 10,000 entries recording sales transactions from
a cafe. Each record includes the following attributes: Transaction ID, Item, Quantity, Price Per
Unit, Total Spent, Payment Method, Location, and Transaction Date. On initial inspection, it
became clear that the dataset contained various issues related to data quality, including missing
values, incorrect data types, and invalid or placeholder entries such as "ERROR" and
"UNKNOWN". These inconsistencies rendered the raw dataset unsuitable for analysis without a
thorough data cleaning process.

Identified Data Issues

Upon loading and reviewing the dataset, several key problems were observed:

Missing values: Many columns, especially Item, Payment Method, Location, and Transaction
Date, had null or missing entries.

Placeholder entries: Values such as "ERROR" in the Total Spent column and "UNKNOWN" in
categorical columns were used to denote issues or unavailable data.

Incorrect data types: Columns expected to hold numerical values, including Quantity, Price Per
Unit, and Total Spent, were stored as text strings. This prevented numerical operations and
required type conversion.

Data inconsistencies: There was a lack of standardization in how missing or faulty entries were
represented.

Cleaning Process

The following steps were taken to clean the dataset:

Replacement of Invalid Entries: All occurrences of the placeholders "ERROR" and


"UNKNOWN" were replaced with NaN using pandas' replace() method. This ensured consistent
handling of missing and invalid data.

Data Type Conversion: The columns Quantity, Price Per Unit, and Total Spent were converted
to float type using pd.to_numeric() with error coercion. This step was critical for any future
calculations or statistical analysis.

Handling Missing Data: Rows with missing values in any of the columns were removed.
Although this reduced the size of the dataset considerably—from 10,000 entries to 3,089—it
ensured the reliability and integrity of the remaining data.

2|Page
Index Resetting and Final Export: After cleaning, the DataFrame’s index was reset, and the
cleaned dataset was saved as cleaned_cafe_sales.csv for further use.

Outcome

After cleaning, the dataset was reduced to 3,089 rows of fully valid, usable data. The final
dataset contains no missing values or invalid entries. All numerical fields are stored as float, and
categorical data is free from placeholder values. This cleaned dataset is now ready for reliable
analysis, visualization, and modeling tasks.

Conclusion

Data cleaning is a crucial step in any data science or analytics project. While it often results in
the loss of some data, as was the case here, it vastly improves the overall quality and usefulness
of the dataset. The cleaning process applied to the cafe sales data has resulted in a structured,
error-free, and analysis-ready dataset. Future users of this dataset can now conduct meaningful
insights without concerns about inconsistencies or corruption in the data.

3|Page

You might also like