Data Mining Group Assignment4
Data Mining Group Assignment4
Cleaning
Group 1:
1. Raphael Chitsva - R056165Y
2. Chipuriro Walter R217118K3
3. Lazarus Mapfurira R217116X4
4. Tinashe Jima R217094J
5. Gillian James R217096Y
6. Ropafadzo Jere R217084B
7. Munyaradzi J Ndhokoyo R217112E
8. Placxedece K Phiri R204482L
Definition of
terms:
Data Preporcessing:
- Conversion of raw data into an understandable format and made ready for further analysis.
- the process of transforming raw data into a useful, understandable format
Data Cleaning:
- the process of cleaning datasets by accounting for missing values, removing outliers, correcting
inconsistent data points, and smoothing noisy data.
- Data cleaning help us remove inaccurate, incomplete and incorrect data from the data.
Noisy Data:
- A large amount of meaningless data
- the data that cannot be interpreted by machine and are containing unnecessary faulty data
Data Cleaning
What is it?
• The process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
• Removal of inconsistency within our data to produce a solid and profound
analysis.
• Data cleaning is a foundational process in the data science lifecycle and its
role cannot be overemphasized when trying to uncover insights and
Why is Data Cleaning so
generate reliable answers.
Important?
- helps to create a template for cleaning an organization's data.
- If data is incorrect, outcomes of any analysis, algorithms are unreliable, even
though they may look correct.
- False conclusions because of incorrect or “dirty” data can inform poor business
strategy
- and decision-making.
3. Filter
1. Remove 2. Fix Structural unwanted
duplicates errors outliers
4. Handle missing
data 5. Validate and QA
1. Remove duplicates or irrelevant
observations
• Remove unwanted observations from your data, including
duplicate or irrelevant observations as this will happen most
often during data collection.
• Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to
analyse.
• For example your organisation sells kids clothes and toys, and from the
collected data you seeing an age group of 87 or 50 years.