Assigment 3 Data Science
Assigment 3 Data Science
The dataset under consideration consists of 10,000 entries recording sales transactions from
a cafe. Each record includes the following attributes: Transaction ID, Item, Quantity, Price Per
Unit, Total Spent, Payment Method, Location, and Transaction Date. On initial inspection, it
became clear that the dataset contained various issues related to data quality, including missing
values, incorrect data types, and invalid or placeholder entries such as "ERROR" and
"UNKNOWN". These inconsistencies rendered the raw dataset unsuitable for analysis without a
thorough data cleaning process.
Upon loading and reviewing the dataset, several key problems were observed:
Missing values: Many columns, especially Item, Payment Method, Location, and Transaction
Date, had null or missing entries.
Placeholder entries: Values such as "ERROR" in the Total Spent column and "UNKNOWN" in
categorical columns were used to denote issues or unavailable data.
Incorrect data types: Columns expected to hold numerical values, including Quantity, Price Per
Unit, and Total Spent, were stored as text strings. This prevented numerical operations and
required type conversion.
Data inconsistencies: There was a lack of standardization in how missing or faulty entries were
represented.
Cleaning Process
Data Type Conversion: The columns Quantity, Price Per Unit, and Total Spent were converted
to float type using pd.to_numeric() with error coercion. This step was critical for any future
calculations or statistical analysis.
Handling Missing Data: Rows with missing values in any of the columns were removed.
Although this reduced the size of the dataset considerably—from 10,000 entries to 3,089—it
ensured the reliability and integrity of the remaining data.
2|Page
Index Resetting and Final Export: After cleaning, the DataFrame’s index was reset, and the
cleaned dataset was saved as cleaned_cafe_sales.csv for further use.
Outcome
After cleaning, the dataset was reduced to 3,089 rows of fully valid, usable data. The final
dataset contains no missing values or invalid entries. All numerical fields are stored as float, and
categorical data is free from placeholder values. This cleaned dataset is now ready for reliable
analysis, visualization, and modeling tasks.
Conclusion
Data cleaning is a crucial step in any data science or analytics project. While it often results in
the loss of some data, as was the case here, it vastly improves the overall quality and usefulness
of the dataset. The cleaning process applied to the cafe sales data has resulted in a structured,
error-free, and analysis-ready dataset. Future users of this dataset can now conduct meaningful
insights without concerns about inconsistencies or corruption in the data.
3|Page