0% found this document useful (0 votes)
12 views

5_Unit 2 - Lecture 2-Data Handling

Uploaded by

sihagmukesh05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

5_Unit 2 - Lecture 2-Data Handling

Uploaded by

sihagmukesh05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Handling

datasets for
Machine Learning
Feature sets

•Handling datasets for machine learning


feature sets involves several key steps.
Here's a comprehensive guide to manage
and prepare your datasets effectively:

•1. Data Collection


• Identify Data Sources: Determine the
sources from where the data will be
collected (databases, APIs, web
scraping, sensors, etc.).
• Gather Data: Collect the data
ensuring you have enough examples
to train a robust model.

Figure 1: Data Collection


Handling
datasets for
Machine Learning
Feature sets

•2. Data Cleaning


• Remove Duplicates: Eliminate
duplicate records to avoid
redundancy.
• Handle Missing Values: Impute
missing values using strategies
like mean/median imputation,
forward/backward fill, or
removing the records/columns
with excessive missing values.
• Correct Errors: Fix any errors in
the data such as incorrect labels,
out-of-range values, etc.

Figure 2: Data Cleaning Cycle


Handling datasets
for Machine
Learning Feature
sets
•3. Data Transformation
• Normalization/Standardization: Scale
the features so they have similar ranges.
Common techniques include min-max
normalization and z-score
standardization.
• Encoding Categorical Variables: Convert
categorical variables to numerical using
methods like one-hot encoding, label
encoding, or target encoding.
• Feature Engineering: Create new
features from existing ones to help the
model learn better. This includes
creating interaction terms, polynomial
features, and using domain knowledge
to derive new features.

Figure 3: Data Transformation Process


Handling datasets for Machine
Learning Feature sets

Figure 4: Data Transformation Techniques


Handling
datasets for
Machine Learning
Feature sets

•4. Data Splitting


• Train-Test Split: Split the data into
training and testing sets to evaluate
the model's performance on unseen
data.
• Validation Set: Further split the
training data into a training set and a
validation set to tune
hyperparameters and avoid
overfitting.
• Cross-Validation: Use k-fold cross-
validation to make the best use of
the data, especially when you have
limited data. Figure 5: Data Splitting
Handling
datasets for
Machine Learning
Feature sets

• Cross-Validation: Use k-
fold cross-validation to
make the best use of the
data, especially when you
have limited data.

Figure 6: Cross Validation


Handling •5. Handling Imbalanced Data
datasets for • Resampling: Use techniques like oversampling the minority
class (e.g., SMOTE) or undersampling the majority class.
Machine Learning • Class Weighting: Assign different weights to classes to
Feature sets balance the influence of each class on the model training.

Figure 7: Handling Imbalance Data


Handling datasets for Machine Learning Feature sets
5. Handling Imbalanced Data
 Resampling: Use techniques like oversampling the minority class (e.g., SMOTE) or
undersampling the majority class.
 Class Weighting: Assign different weights to classes to balance the influence of each class
on the model training.

6. Feature Selection
 Remove Unnecessary Features: Drop features that do not contribute to the model
performance.
 Use Algorithms: Employ algorithms (like LASSO, Decision Trees) that help in selecting
important features.
 Correlation Analysis: Remove highly correlated features to reduce multicollinearity.

7. Feature Scaling
 Normalization: Scale features to a range, typically [0, 1].
 Standardization: Transform features to have zero mean and unit variance.
Handling
datasets for
Machine Learning
Feature sets

•6. Feature Selection


• Remove Unnecessary Features:
Drop features that do not
contribute to the model
performance.
• Use Algorithms: Employ
algorithms (like LASSO, Decision
Trees) that help in selecting
important features.
• Correlation Analysis: Remove
highly correlated features to
reduce multicollinearity.

Figure 8: Feature Selection


Handling datasets for Machine
Learning Feature sets

Figure 9: Benefit of Feature Selection


Handling datasets for Machine Learning Feature sets
7. Feature Scaling: Feature Scaling is a technique to standardize the independent features
present in the data in a fixed range. It is performed during the data pre-processing to handle
highly varying magnitudes or values or units. If feature scaling is not done, then a machine
learning algorithm tends to weigh greater values, higher and consider smaller values as the
lower values, regardless of the unit of the values.
 Normalization: Scale features to a range, typically [0, 1].
 Standardization: Transform features to have zero mean and unit variance.

Figure 10: Data Normalization


Handling datasets for Machine Learning Feature sets

8. Data Augmentation
 Generate New Data: For image, text, or audio data, create variations of existing data to
increase the dataset size.
 In machine learning, data augmentation is a common method for manipulating existing
data to artificially increase the size of a training dataset. In an attempt to enhance the
efficiency and flexibility of machine learning models, data augmentation looks for the
boost in the variety and volatility of the training data.
 Data augmentation can be especially beneficial when the original set of data is small as it
enables the system to learn from a larger and more varied group of samples.

Types of Data Augmentation: Techniques for data augmentation can be used with a variety
of data kinds, including time series, text, photos, and audio. Here are a few frequently used
methods of data augmentation for image data:
 Images can be rotated at different angles and flipped horizontally or vertically to create
alternative points of view.
Handling datasets for Machine Learning Feature sets

 Random cropping and padding: By applying random cropping or padding to the photos,
various scales, and translations can be simulated.

 Scaling and zooming: The model can manage various item sizes and resolutions by
rescaling the photos to different sizes or zooming in and out.

 Shearing and perspective transform: Changing an image's shape or perspective can imitate
various viewing angles while also introducing deformations.

 Color jittering: By adjusting the color characteristics of the images, including their
brightness, contrast, saturation, and hue, the model can be made to be more resilient to
variations in illumination.

 Gaussian noise: By introducing random Gaussian noise to the images, the model's
resistance to noisy inputs can be strengthened.
Handling datasets for Machine
Learning Feature sets

Figure 11: Data Augmentation


Handling datasets for Machine Learning Feature sets

9. Data Storage
 Save Cleaned Data: Store the cleaned and preprocessed data in an appropriate format
(CSV, HDF5, etc.) for future use.
 Document the Process: Keep track of the steps and transformations applied to the data
for reproducibility.

You might also like