Week 3
Week 3
Identifying Outliers:
Use statistical measures like IQR (Interquartile Range), Z-
score, or visual tools like box plots and scatter plots to detect
outliers.
Understand if the outliers are genuine or data errors.
Handling Outliers:
Deletion: Remove outliers if they are the result of data entry
errors.
Transformation: Use log or square root transformations to
reduce the impact of outliers.
Capping: Limit the maximum and minimum values for certain
attributes.
Imputation: Replace outliers with mean/median/mode values.
/ Data Reduction
Common Techniques:
Normalization: Scaling numerical data to fall within a smaller,
standard range, like 0-1.
Standardization (Z-Score Normalization): Rescaling data so
it has a mean of 0 and a standard deviation of 1.
One-Hot Encoding: Converting categorical variables into a
format that can be provided to machine learning algorithms to
improve predictions.
Binning: Converting continuous data into discrete intervals or
bins.
Log Transformation: Used to transform skewed data into a
more normal or Gaussian distribution.
Feature Extraction: Creating new variables from the existing
ones, like Principal Component Analysis (PCA).
/ Data Transformations
Common Techniques:
Range Check: Verifying that a data value falls within a
specified range.
Format Check: Ensuring data is in a specific format, like a
valid email address or phone number.
List Check: Validating data against a predefined list of
acceptable values.
Consistency Check: Ensuring data doesn't have contradictions,
such as a date of birth indicating a person is 150 years old.
Uniqueness Check: Verifying that entries in a unique field, like
a user ID, are not duplicated.
Logical Check: Confirming that data combinations make logical
sense, such as gender and salutation alignment.
/ Data Transformations