0% found this document useful (0 votes)

12 views

5_Unit 2 - Lecture 2-Data Handling

Uploaded by

sihagmukesh05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

5_Unit 2 - Lecture 2-Data Handling

Uploaded by

sihagmukesh05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Handling

datasets for
Machine Learning
Feature sets

•Handling datasets for machine learning

feature sets involves several key steps.
Here's a comprehensive guide to manage
and prepare your datasets effectively:

•1. Data Collection

• Identify Data Sources: Determine the
sources from where the data will be
collected (databases, APIs, web
scraping, sensors, etc.).
• Gather Data: Collect the data
ensuring you have enough examples
to train a robust model.

Figure 1: Data Collection

Handling
datasets for
Machine Learning
Feature sets

•2. Data Cleaning

• Remove Duplicates: Eliminate
duplicate records to avoid
redundancy.
• Handle Missing Values: Impute
missing values using strategies
like mean/median imputation,
forward/backward fill, or
removing the records/columns
with excessive missing values.
• Correct Errors: Fix any errors in
the data such as incorrect labels,
out-of-range values, etc.

Figure 2: Data Cleaning Cycle

Handling datasets
for Machine
Learning Feature
sets
•3. Data Transformation
• Normalization/Standardization: Scale
the features so they have similar ranges.
Common techniques include min-max
normalization and z-score
standardization.
• Encoding Categorical Variables: Convert
categorical variables to numerical using
methods like one-hot encoding, label
encoding, or target encoding.
• Feature Engineering: Create new
features from existing ones to help the
model learn better. This includes
creating interaction terms, polynomial
features, and using domain knowledge
to derive new features.

Figure 3: Data Transformation Process

Handling datasets for Machine
Learning Feature sets

Figure 4: Data Transformation Techniques

Handling
datasets for
Machine Learning
Feature sets

•4. Data Splitting

• Train-Test Split: Split the data into
training and testing sets to evaluate
the model's performance on unseen
data.
• Validation Set: Further split the
training data into a training set and a
validation set to tune
hyperparameters and avoid
overfitting.
• Cross-Validation: Use k-fold cross-
validation to make the best use of
the data, especially when you have
limited data. Figure 5: Data Splitting
Handling
datasets for
Machine Learning
Feature sets

• Cross-Validation: Use k-
fold cross-validation to
make the best use of the
data, especially when you
have limited data.

Figure 6: Cross Validation

Handling •5. Handling Imbalanced Data
datasets for • Resampling: Use techniques like oversampling the minority
class (e.g., SMOTE) or undersampling the majority class.
Machine Learning • Class Weighting: Assign different weights to classes to
Feature sets balance the influence of each class on the model training.

Figure 7: Handling Imbalance Data

Handling datasets for Machine Learning Feature sets
5. Handling Imbalanced Data
 Resampling: Use techniques like oversampling the minority class (e.g., SMOTE) or
undersampling the majority class.
 Class Weighting: Assign different weights to classes to balance the influence of each class
on the model training.

6. Feature Selection
 Remove Unnecessary Features: Drop features that do not contribute to the model
performance.
 Use Algorithms: Employ algorithms (like LASSO, Decision Trees) that help in selecting
important features.
 Correlation Analysis: Remove highly correlated features to reduce multicollinearity.

7. Feature Scaling
 Normalization: Scale features to a range, typically [0, 1].
 Standardization: Transform features to have zero mean and unit variance.
Handling
datasets for
Machine Learning
Feature sets

•6. Feature Selection

• Remove Unnecessary Features:
Drop features that do not
contribute to the model
performance.
• Use Algorithms: Employ
algorithms (like LASSO, Decision
Trees) that help in selecting
important features.
• Correlation Analysis: Remove
highly correlated features to
reduce multicollinearity.

Figure 8: Feature Selection

Handling datasets for Machine
Learning Feature sets

Figure 9: Benefit of Feature Selection

Handling datasets for Machine Learning Feature sets
7. Feature Scaling: Feature Scaling is a technique to standardize the independent features
present in the data in a fixed range. It is performed during the data pre-processing to handle
highly varying magnitudes or values or units. If feature scaling is not done, then a machine
learning algorithm tends to weigh greater values, higher and consider smaller values as the
lower values, regardless of the unit of the values.
 Normalization: Scale features to a range, typically [0, 1].
 Standardization: Transform features to have zero mean and unit variance.

Figure 10: Data Normalization

Handling datasets for Machine Learning Feature sets

8. Data Augmentation
 Generate New Data: For image, text, or audio data, create variations of existing data to
increase the dataset size.
 In machine learning, data augmentation is a common method for manipulating existing
data to artificially increase the size of a training dataset. In an attempt to enhance the
efficiency and flexibility of machine learning models, data augmentation looks for the
boost in the variety and volatility of the training data.
 Data augmentation can be especially beneficial when the original set of data is small as it
enables the system to learn from a larger and more varied group of samples.

Types of Data Augmentation: Techniques for data augmentation can be used with a variety
of data kinds, including time series, text, photos, and audio. Here are a few frequently used
methods of data augmentation for image data:
 Images can be rotated at different angles and flipped horizontally or vertically to create
alternative points of view.
Handling datasets for Machine Learning Feature sets

 Random cropping and padding: By applying random cropping or padding to the photos,
various scales, and translations can be simulated.

 Scaling and zooming: The model can manage various item sizes and resolutions by
rescaling the photos to different sizes or zooming in and out.

 Shearing and perspective transform: Changing an image's shape or perspective can imitate
various viewing angles while also introducing deformations.

 Color jittering: By adjusting the color characteristics of the images, including their
brightness, contrast, saturation, and hue, the model can be made to be more resilient to
variations in illumination.

 Gaussian noise: By introducing random Gaussian noise to the images, the model's
resistance to noisy inputs can be strengthened.
Handling datasets for Machine
Learning Feature sets

Figure 11: Data Augmentation

Handling datasets for Machine Learning Feature sets

9. Data Storage
 Save Cleaned Data: Store the cleaned and preprocessed data in an appropriate format
(CSV, HDF5, etc.) for future use.
 Document the Process: Keep track of the steps and transformations applied to the data
for reproducibility.

Designing Machine Learning Systems by Chip Huygen by Rick
No ratings yet
Designing Machine Learning Systems by Chip Huygen by Rick
15 pages
Planning BPC
No ratings yet
Planning BPC
184 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
Functions of DBMS
100% (2)
Functions of DBMS
1 page
NN-7
No ratings yet
NN-7
26 pages
ML1
No ratings yet
ML1
69 pages
ML_DA
No ratings yet
ML_DA
55 pages
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
No ratings yet
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
6 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
Presentation
No ratings yet
Presentation
10 pages
Presentation-2 Data Pre-Processing in Machine Learning
No ratings yet
Presentation-2 Data Pre-Processing in Machine Learning
11 pages
Lecture 5 - Feature extraction, model building & evaluation
No ratings yet
Lecture 5 - Feature extraction, model building & evaluation
35 pages
VIVA
No ratings yet
VIVA
5 pages
UNIT 2 ML
No ratings yet
UNIT 2 ML
14 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
ML Lecture Notes Unit-1
No ratings yet
ML Lecture Notes Unit-1
45 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
ML Interactively
No ratings yet
ML Interactively
273 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
machineLearning-unit1
No ratings yet
machineLearning-unit1
9 pages
Module 4
No ratings yet
Module 4
96 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Data
No ratings yet
Data
36 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
12 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
MACHINE LEARNING 1-5 (Ai &DS)
100% (1)
MACHINE LEARNING 1-5 (Ai &DS)
60 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Allpiedml unit2
No ratings yet
Allpiedml unit2
19 pages
Exploring, Transforming, And Summarizing Input Datasets for Building Classification Models
No ratings yet
Exploring, Transforming, And Summarizing Input Datasets for Building Classification Models
21 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
AAM 1st Unit QB
No ratings yet
AAM 1st Unit QB
4 pages
Salazar CPE124 Courswork 1
No ratings yet
Salazar CPE124 Courswork 1
22 pages
ML_notion_1
No ratings yet
ML_notion_1
18 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Pa 2
No ratings yet
Pa 2
13 pages
ML 2022
No ratings yet
ML 2022
10 pages
Chương
No ratings yet
Chương
12 pages
Chapter 01 machine learning
No ratings yet
Chapter 01 machine learning
22 pages
How To Apply ML
No ratings yet
How To Apply ML
4 pages
life lesson
No ratings yet
life lesson
13 pages
Unit .1
No ratings yet
Unit .1
7 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
Feature Engineering
No ratings yet
Feature Engineering
11 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Advanced Excel Topics
No ratings yet
Advanced Excel Topics
5 pages
Operation Maintenance Program
No ratings yet
Operation Maintenance Program
36 pages
Statistical Thinking v4 3
No ratings yet
Statistical Thinking v4 3
67 pages
GIS Tools & Components
No ratings yet
GIS Tools & Components
3 pages
Tarea 8
0% (2)
Tarea 8
13 pages
Tejesh Mooc Report
No ratings yet
Tejesh Mooc Report
9 pages
A201 Topic 5 - Laudon Mis16 PPT Ch06 KL CE
No ratings yet
A201 Topic 5 - Laudon Mis16 PPT Ch06 KL CE
50 pages
Apb Vip User Manual PDF
No ratings yet
Apb Vip User Manual PDF
18 pages
Hydrographic Data Branch, Nautical Charting Division Hydrography Department Last Update: 26 February 2016
No ratings yet
Hydrographic Data Branch, Nautical Charting Division Hydrography Department Last Update: 26 February 2016
8 pages
HP Ux Tips and Procedures
100% (1)
HP Ux Tips and Procedures
23 pages
Assignment 2 (DBMS)
No ratings yet
Assignment 2 (DBMS)
2 pages
#16 Structure
100% (1)
#16 Structure
7 pages
Methods of Urban Design Survey
No ratings yet
Methods of Urban Design Survey
14 pages
SharePoint Administration
No ratings yet
SharePoint Administration
6 pages
Hospital
100% (1)
Hospital
5 pages
Sap Note 2684818 - e - 20230819
No ratings yet
Sap Note 2684818 - e - 20230819
2 pages
C2SE.23 DatabaseDesign MR
No ratings yet
C2SE.23 DatabaseDesign MR
20 pages
BRM Final
No ratings yet
BRM Final
30 pages
San Car 2021
No ratings yet
San Car 2021
12 pages
Unit-04 Integrity Constraints
No ratings yet
Unit-04 Integrity Constraints
6 pages
A NOR Emulation Strategy Over NAND Flash Memory PDF
No ratings yet
A NOR Emulation Strategy Over NAND Flash Memory PDF
8 pages
Udacity Rubic
No ratings yet
Udacity Rubic
2 pages
Cbse - Department of Skill Education Curriculum For Session 2022-2023
No ratings yet
Cbse - Department of Skill Education Curriculum For Session 2022-2023
15 pages
Mba Project 2023-24
No ratings yet
Mba Project 2023-24
20 pages
BSBDAT501 - Assessment Task 2 V4 Digital Marketing
33% (3)
BSBDAT501 - Assessment Task 2 V4 Digital Marketing
53 pages
Autodesk Civil 3d Borehole Tool Helpfile
No ratings yet
Autodesk Civil 3d Borehole Tool Helpfile
10 pages
DBMS Record
No ratings yet
DBMS Record
55 pages
Database
No ratings yet
Database
4 pages