Lesson 3. Data Preparation and Structuring 1 Data Cleaning
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
Data Collection
Data preparation and structuring o Box plots, Histograms, Scatter plots, Bar Charts, Pie Charts, Line
Charts
● Identifying and handling the missing values
● Choosing right data model
● Feature engineering
o Types of data models
● Encoding data
Data Publishing
● Feature scaling
● Report writing
● Data cleaning
● Data Visualization
● Data enriching or augmentation
● Data validation
● Splitting dataset
Overview
− Outlier detection and handling: Identifying and handling extreme values that
may skew the analysis.
● Mean: What it means you replace the missing values with the
average (mean)
● Middle value (median), or the most frequent value (mode) of
that column.
● Mode Imputation: For categorical data, replace missing values
with the most frequent value (mode).
● When to use: This is useful when the missing data is random
(MCAR or MAR), and the data you're filling in doesn't have
extreme values or outliers
Mean, Median, or Mode Imputation
What it means: Instead of filling in the missing data, you create a new variable
(flag) that indicates whether the data is missing or not.
When to use: This is helpful when you want to keep track of missing values and
use it for further analysis (e.g., analyzing whether missing data correlates with
certain patterns or outcomes).
Example: In a customer survey dataset, if some "age" entries are missing, you
could add a new column called "Age_Missing" that has a 1 for missing data and
a 0 for available data.
Advantages: Helps preserve the original data and adds valuable information
about the missingness itself.
Disadvantages: It doesn't solve the problem of missing data but can help in
analysing how missingness affects your model
Data cleaning -Removing Duplicates
● Outliers are data points that significantly differ from the rest of
the data, and they can distort statistical analyses and model
performance. Handling outliers is crucial in feature engineering
because they may lead to biased results or reduced model
accuracy.
Techniques for handling outliers include:
• Identify Outliers: Use statistical methods (e.g., IQR, Z-scores) or
visualization techniques (boxplots, histograms) to identify
extreme values.
Data Cleaning-Handling Outliers
•Date Standardization:
•Convert all dates to the same format, e.g., YYYY-MM-DD or DD/MM/YYYY,
depending on the region or system used.
•Categorical Standardization:
•Standardize categories (e.g., "Male" and "M" should be the same category
for gender).
Data Transformation
•Pivoting or Reshaping:
•Reshape data using pivot tables or pivot() functions to get a
more compact and readable format, especially when analyzing
multiple dimensions.
•Transform the data structure using pivot tables or similar
methods to ensure it’s in a format that is suitable for analysis