0% found this document useful (0 votes)
6 views

Data Mining Group Assignment4

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Mining Group Assignment4

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Data

Cleaning
Group 1:
1. Raphael Chitsva - R056165Y
2. Chipuriro Walter R217118K3
3. Lazarus Mapfurira R217116X4
4. Tinashe Jima R217094J
5. Gillian James R217096Y
6. Ropafadzo Jere R217084B
7. Munyaradzi J Ndhokoyo R217112E
8. Placxedece K Phiri R204482L
Definition of
terms:
Data Preporcessing:
- Conversion of raw data into an understandable format and made ready for further analysis.
- the process of transforming raw data into a useful, understandable format

Data Cleaning:
- the process of cleaning datasets by accounting for missing values, removing outliers, correcting
inconsistent data points, and smoothing noisy data.
- Data cleaning help us remove inaccurate, incomplete and incorrect data from the data.
Noisy Data:
- A large amount of meaningless data
- the data that cannot be interpreted by machine and are containing unnecessary faulty data
Data Cleaning
What is it?
• The process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
• Removal of inconsistency within our data to produce a solid and profound
analysis.
• Data cleaning is a foundational process in the data science lifecycle and its
role cannot be overemphasized when trying to uncover insights and
Why is Data Cleaning so
generate reliable answers.
Important?
- helps to create a template for cleaning an organization's data.
- If data is incorrect, outcomes of any analysis, algorithms are unreliable, even
though they may look correct.
- False conclusions because of incorrect or “dirty” data can inform poor business
strategy
- and decision-making.
3. Filter
1. Remove 2. Fix Structural unwanted
duplicates errors outliers

Data cleaning in 5 steps

4. Handle missing
data 5. Validate and QA
1. Remove duplicates or irrelevant
observations
• Remove unwanted observations from your data, including
duplicate or irrelevant observations as this will happen most
often during data collection.
• Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to
analyse.

• For example, if you want to analyse data regarding


millennial customers, but your dataset includes older
generations, you might remove those irrelevant
observations.
2. Fix structural errors
3. Filter unwanted outliers
• Filter or remove observations that do not appear to fit within the range of
data you are analysing

• For example your organisation sells kids clothes and toys, and from the
collected data you seeing an age group of 87 or 50 years.

• If you have a legitimate reason to remove an outlier, like improper data-


entry, doing so will help the performance of the data you are working
with.

*NB* : just because an outlier exists, doesn’t mean it is incorrect.


4. Handle missing data
• You can’t ignore missing data because many algorithms will not accept missing
values.
• Handling missing values has a great impact of the outcome of your analysis or
model performance
Handling missing
data
 Drop missing values:
- This method is effective for large datasets with few missing values, but be mindful of potential information loss.

 Input missing values:


- Replace missing values with mean, median, or mode of the relevant variable.
- Use mean for normal distributions and median for non-normal distributions.
- Caution: assumptions may compromise data integrity.

 Prediction of missing values:


- Utilize prediction models to estimate missing values based on available data.
- Requires careful model selection and validation for reliable results.
Impact of handling missing values

• Improved data quality:


• Addressing missing values enhances the overall quality of the dataset
• Preservation of Data Integrity:
• Imputing or removing missing values ensures that the dataset remains
consistent and suitable for analysis.
• Reduced bias:
• Handling missing data allows for a more unbiased representation of the
underlying patterns in the data.
• Descriptive statistics, such as means, medians, and standard deviations, can be more
accurate when missing values are appropriately handled. This ensures a more
reliable summary of the dataset.
• Increased efficiency:
• Efficiently handling missing values can save you time and effort during data
analysis and modelling.
Advantages and benefits of data cleaning
 Enhanced Data Quality
 Increased Accuracy of Insights
 Improved Decision-Making
 Enhanced Data Consistency
 Cost and Time Savings
 Enhanced Stakeholder Confidence

You might also like