Module 1 Answer PDF
Module 1 Answer PDF
Thank you for providing us with the three datasets from Sprocket Central Pty Ltd. The below table
highlights the summary statistics from the three datasets received. Please let us know if the figures are
not aligned with your understanding.
Table name No. of records Distinct Customer IDs Date Data Received
Customer -insert value- -insert value- -insert value-
Demographic
Customer Address -insert value- -insert value- -insert value-
Transaction Data -insert value- -insert value- -insert value-
Notable data quality issues that were encountered and the methods used to mitigate the identified data
inconsistencies are as follows. Furthermore, recommendations have been provided to avoid the re-
occurrence of data quality issues and improve the accuracy of the underlying data used to drive business
decisions.
● Additional customer_ids in the ‘Transactions table’ and ‘Customer Address table’
but not in ‘Customer Master (Customer Demographic)’
Mitigation: Please ensure that all tables are from the same period. Only customers in the Customer Master
list will be used as a training set for our model.
This indicates that the data received may not be in sync with each other which may skew the
analysis results if there are missing data records. Please refer to excel file ‘data_outliers.xlsx’ for
the list of outliers between tables.
● Various columns, such as the brand of a purchase, or job title, have empty values in
certain records
Mitigation: If only a small number of rows are empty, filter out the record entirely from the training set for
prediction. Else, if it is a core field, impute based on distribution in the training dataset.
For key datasets, such as transactions, less than 1% of transactions (totalling less than 0.1% of
revenue) have missing fields. These records have been removed from the training dataset.
Kind regards,
[Junior Consultant Name]