Module 3 Notes
Module 3 Notes
2. Data Transformation:
o Normalize/Standardize data.
3. Data Reduction:
o Dimensionality reduction (e.g., PCA).
o Feature selection.
2. Imputation:
o Mean/Median/Mode Imputation: Replace missing values with the
mean (for numerical data), median, or mode (for categorical data)
of the column.
o Predictive Imputation: Use other features to predict missing
values. Techniques include regression or K-nearest neighbors (KNN).
o Forward/Backward Fill: In time series data, missing values can be
filled using previous or next values.
3. Using Algorithms that Handle Missing Data: Some machine learning
algorithms, such as decision trees, can handle missing data directly.
4. Multiple Imputation: Multiple imputation creates multiple datasets with
different imputed values and averages the results to deal with the
uncertainty in missing values.
4. Handling Outliers
Outliers are extreme values that deviate significantly from the rest of the data.
They can distort statistical analyses and machine learning models.
Detecting Outliers:
1. Visual Methods:
o Boxplots: Outliers are often shown as points outside the whiskers
of a boxplot.
o Scatter Plots: For multivariate data, scatter plots help to identify
outliers.
2. Statistical Methods:
o Z-score: Outliers can be identified by calculating the Z-score (how
many standard deviations away a point is from the mean). A Z-score
greater than 3 is often considered an outlier.
o Interquartile Range (IQR): Any data point beyond 1.5 times the
IQR above the third quartile or below the first quartile is considered
an outlier.
Handling Outliers:
1. Removing Outliers:
o If the outliers are errors or not important for the analysis, they can
be removed.
2. Transforming Data:
o Log Transformation: Apply log transformations to reduce the
impact of outliers.
o Winsorizing: Replacing extreme values with the nearest data point
within a specified range.
3. Capping or Truncation:
o Set a maximum or minimum value to cap outliers, bringing them
closer to the rest of the data.
4. Using Algorithms Robust to Outliers:
o Some algorithms, like decision trees, are more robust to outliers and
can handle them better without the need for removal or
transformation.
5. Data Transformation
Data transformation is necessary to convert data into a suitable format and scale
for analysis. Common transformations include scaling, encoding, and
normalization.
Techniques for Data Transformation:
1. Normalization: Normalization rescales the data into a specific range
(usually between 0 and 1). This is especially useful when features have
different units or scales.
Where:
o μ is the mean
o σ is the standard deviation
o
3. Log Transformation:
o Apply the natural logarithm to reduce the impact of large values
and make the data more normally distributed.
4. Binning:
o Divide continuous data into bins or intervals. This is especially
useful in decision tree models and can help with data stability.
5. Encoding Categorical Variables:
o One-Hot Encoding: Create binary columns for each category in
the categorical variable.
o Label Encoding: Assign a unique integer to each category in the
categorical variable.
6. Cleaning Data
Data cleaning involves detecting and correcting errors in the dataset. It’s a vital
part of data preprocessing to improve the quality and reliability of the analysis.
Common Cleaning Steps:
1. Removing Duplicates:
o Identify and remove duplicate rows that don’t add new information.