MSDSModule 2
MSDSModule 2
• Random vs. Stratified Sampling: The choice between random and stratified sampling for splitting depends
on the dataset's characteristics and the problem at hand. Stratified sampling is preferred when dealing with
class imbalance, while random sampling is more common for balanced datasets.
Common Data Preprocessing Steps
(Continued..)
5. Data Imbalance Handling: In classification tasks, deal with imbalanced datasets by oversampling the minority
class, undersampling the majority class, or using synthetic data generation techniques.
6. Data Normalization and Standardization: Normalize and standardize features to ensure that they have similar
scales. Normalization typically scales features to a [0, 1] range, while standardization makes features have a mean
of 0 and a standard deviation of 1.
7. Handling Categorical Data: Convert categorical data into a numerical format (e.g., one-hot encoding) to make
it suitable for machine learning algorithms.
8. Handling Time-Series Data: Resample, interpolate, or aggregate time-series data to align it with the desired
frequency or to fill in missing values.
9. Encoding Text Data: Convert text data into numerical format using techniques like TF-IDF (Term Frequency-
Inverse Document Frequency) or word embeddings (e.g., Word2Vec, GloVe) for natural language processing tasks.
10. Data Validation and Quality Checks: Ensure that the preprocessed data is free from anomalies, errors, or
inconsistencies.
11. Feature Scaling and Selection: Choose appropriate features and apply scaling methods to avoid issues with
algorithms sensitive to feature scales (e.g., gradient-based optimization algorithms).
12. Data Augmentation (for Image Data): Generate additional training examples by applying random
transformations to images (e.g., rotation, cropping, flipping) to improve model generalization