100% found this document useful (1 vote)

153 views

02.data Preprocessing PDF

The document outlines the main steps in a machine learning project life cycle: 1. Define the problem or question 2. Obtain and explore the data to gain insights 3. Prepare the data for machine learning algorithms through tasks like feature scaling, imputation, and encoding 4. Select a model and train it on the processed data 5. Fine-tune the model and deploy it.

Uploaded by

sunil

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

153 views

02.data Preprocessing PDF

Uploaded by

sunil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

End - End

Machine Learning
Project
Life Cycle

Here are the main steps you will go through:

1. Question or problem definition.

2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Deployment
Life Cycle

Here are the main steps you will go through:

1. Question or problem definition.

The first question to ask your boss is what exactly is the business objective; building a model is probably
not the end goal. How does the company expect to use and benefit from this model?

This is important because

• It will determine how you frame the problem,
• What algorithms you will select, what performance measure you will use to evaluate your model,
• How much effort you should spend tweaking it.
Life Cycle

Here are the main steps you will go through:

2. Get the data.

• From Relational DataBases(SQL) like MY-SQL
• From Non-Relational DataBases (NOSQL) MangoDB
• scrape from the websites using web scraping tools such as Beautiful Soup.
• Gather data by connecting to Web APIs
Exploratory Data Analysis
Machine Learning
What is EDA?

Exploratory Data Analysis (EDA) is an approach to analyzing data. It’s where the researcher

takes a bird’s eye view of the data and tries to make some sense.

Promoted by John Tukey to encourage statisticians to explore data.

To identify outliers, trends and patterns.

Why Data Exploration?
Real Data is Messy …
• Bad Formatting
• Trailing Spaces
• Duplicates
• Empty Rows
• Synonyms and Different Abbreviations
• Difference in Scales
• Skewed Distributions and Outliers
• Missing Values
Steps of Data Exploration

Steps involved to understand, clean and prepare your data for building your predictive model:

• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values
• Outlier
Variable Identification

Identify
• Predictor (Input) variables
• Target (output) variable
• Data Type of variables
• Category of the variables.
Variable Identification
Univariate Analysis
Method of exploring variables one by one is called uni-variate Analysis.
Performing uni-variate analysis will depend on whether the variable type is categorical or continuous.

Statistical measures and Visualization for categorical and continuous variables:

1. Continuous Variables :- Descriptive Statistics, Histogram and Box plot.
2. Categorical Variables :- Frequency Table, Bar Charts to understand distribution of each category.

Note: Univariate analysis is also used to highlight missing and outlier values.
Bi-variate Analysis
• Bi-variate Analysis finds out the relationship between two variables.

• We can perform bi-variate analysis for any combination of categorical and continuous variables.

• In this analysis to try to find which features within the dataset contribute significantly to our solution goal? (Statistically

speaking is there a correlation among a feature and solution goal? )

As the feature values change does the solution state change as well, and visa-versa?

• This can be tested both for numerical and categorical features in the given dataset.

• We may also want to determine correlation among features other than target variable for subsequent goals .

• Correlation features may help in creating, completing, or correcting features.

Bi-variate Analysis

Statistical measures and Visualization for Bi-variate analysis:

1. Continuous Variables vs Continuous Variables :- Scatterplot and Co-variance ,Pearson Correlation

coefficient.
2. Categorical Variables vs Categorical Variables :- Frequency Table, Bar Charts and Chi-square Test.
3. Categorical Variables vs Continuous :- Bar chart, Box-plot and Anova .
Outlier Detection
Most commonly visualizations used to detect outliers are
• Box-plot,
• Histogram,
• Scatter Plot

Various thumb rules to detect outliers.

• Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
• Any value which out of range of 5th and 95th percentile can be considered as outlier
• Data points, three or more standard deviation away from mean are considered outlier.
Data Pre-processing
Machine Learning
CONTENT

• Overview on Data Pre-processing

• Why Data Pre-processing
• Tasks in Data pre-Processing
Normalization
Standardization
Binarization
Imputation
Polynomial Features etc.
Why Data pre -Processing?
 For achieving better results from the machine learning model the format of the data has
to be in a proper manner.
 Unscaled or unstandardized data have might give unacceptable predictions.
 Another aspect is that data set should be formatted in such a way that more than one
Machine Learning and Deep Learning algorithms are executed in one data set, and best
out of them is chosen.
What is Data pre -Processing?
Data Preprocessing is a technique that is used to convert the raw data into a clean data set.

In other words, whenever the data is gathered from different sources it is collected in raw

format which is not feasible for the analysis.

It includes :

• Feature Scaling

• Imputation

• Label Encoding

• Binarizer
Transformers

Objects which can transform data so that they can be consumed by machine learning.

Common API – fit, transform and fit_transform.

 Fit ( ) – Creating the map (fitting is equal to training)

 Transform ( ) – Using the map transforming the data

 Fit_transform ( ) – Combined of above two

Feature Scaling

Feature scaling is the method to limit the range of variables so that they can be compared on
common grounds.
 Standard scaler

 MinMax scaler

 Robust scaler

 Normalizer
Why Feature Scaling?

Real world dataset contains features that highly vary in magnitudes, units, and range.

Formally, If a feature in the dataset is big in scale compared to others then in

algorithms where Euclidean distance is measured this big scaled feature becomes
dominating and needs to be normalized.

The algorithms which use Euclidean Distance measure are sensitive to Magnitudes.
Here feature scaling helps to weigh all the features equally.

Sometimes, it also helps in speeding up the calculations in an algorithm.

Feature Scaling Matters
• Linear Regression
• K-Means
• K-Nearest-Neighbours
• Principal Component Analysis (PCA): Tries to get the feature with maximum
variance, here too feature scaling is required.
• Gradient Descent: Calculation speed increase as Theta calculation becomes faster
after feature scaling.
• Deep Learning (CNN,ANN,RNN)

Note: Naive Bayes, Linear Discriminant Analysis, and Tree-Based models,XGboost are
not affected by feature scaling.
Standard Scaler

• The Standard Scaler assumes your data is normally distributed within each feature
• Scale them such that the distribution is now centered around 0, with a standard deviation 1.
• The mean and standard deviation are calculated for the feature and then the feature is
scaled based on:

• If data is not normally distributed, this is not the best scaler to use.

Standardization, or mean removal and variance scaling

MinMax Scaler

• One of most popular scaling method

• Works on data which is not normally distributed or having low

standard deviation is very small.

• Brings the data in range of [0,1] or [-1,1]

• It preserves the shape of the original distribution. It doesn’t
meaningfully change the information embedded in the original
data.
• It is sensitive to outliers, so if there are outliers in the data better
to consider Robust scaler.
Robust Scaler

• Most suited for data with outliers

• Rather than min-max, used interquartile range

• This Scaler removes the median and scales the data according to the
quantile range (defaults to IQR).
Normalization

• The normalizer scales each value by dividing each value by its magnitude in n-
dimensional space for n number of features. i.e. brining the values of each
feature vector on a common scale.

• Say your features were x, y and z Cartesian co-ordinates your scaled value for x
would be:

• Each point is now within 1 unit of the origin on this Cartesian co-ordinate system.
Label Encoding

• Learning algorithms don’t understand strings

• Categorical columns with string values (yes / no) needs to be converted to

numbers

• Label encoder encodes the value between 0 to n-1 classes

One Hot Encoding

• One-Hot Encoding transforms each categorical feature with n possible values

into n binary features, with only one active.

• Suitable for nominal data

• Like Gender,Color,City etc.

Binarizer

• Feature binarization is the process of thresholding numerical features to get

boolean values.

• Commonly used for text data.

•
Imputation

• Real world data might be incomplete , missing data is represented by Nan or Null

• Incomplete data are incompatible with scikit-learn

• One way to deal with them is discard. However it is not the best option to remove
the rows and columns from our dataset as it can lead to loss of valuable
information.

• Other way is simply substituting the missing values of our dataset.

Train and Test split

• The training set is a subset of your data on which your model will learn how to
predict the dependent variable with the independent variables.

• The test set is the complimentary subset from the training set, on which you will
evaluate your model to see if it manages to predict correctly the dependent variable
with the independent variables.

Making Sense of Data I: A Practical Guide to Exploratory Data Analysis and Data Mining
From Everand
Making Sense of Data I: A Practical Guide to Exploratory Data Analysis and Data Mining
Glenn J. Myatt
No ratings yet
Data Scientist Pocket Guide: Over 600 Concepts, Terminologies, and Processes of Machine Learning and Deep Learning Assembled Together
From Everand
Data Scientist Pocket Guide: Over 600 Concepts, Terminologies, and Processes of Machine Learning and Deep Learning Assembled Together
Mohamed Sabri
No ratings yet
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Apache Spark Graph Processing
From Everand
Apache Spark Graph Processing
Ramamonjison Rindra
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Practical Data Cleaning: Bite-Size Stats, #5
From Everand
Practical Data Cleaning: Bite-Size Stats, #5
Lee Baker
No ratings yet
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
1737527078055
No ratings yet
1737527078055
111 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Eda
No ratings yet
Eda
48 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Week 10
No ratings yet
Week 10
50 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Final ML
No ratings yet
Final ML
2 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
ML_DA
No ratings yet
ML_DA
55 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Data Mining
No ratings yet
Data Mining
33 pages
L3 Overview of ML Model Development Lifecycle-1
No ratings yet
L3 Overview of ML Model Development Lifecycle-1
30 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Module 4
No ratings yet
Module 4
96 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
PMA Unit-2 pdf
No ratings yet
PMA Unit-2 pdf
19 pages
BUSINESS ANALYTICS
No ratings yet
BUSINESS ANALYTICS
14 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Presentation
No ratings yet
Presentation
10 pages
Data
No ratings yet
Data
36 pages