100% found this document useful (1 vote)
153 views

02.data Preprocessing PDF

The document outlines the main steps in a machine learning project life cycle: 1. Define the problem or question 2. Obtain and explore the data to gain insights 3. Prepare the data for machine learning algorithms through tasks like feature scaling, imputation, and encoding 4. Select a model and train it on the processed data 5. Fine-tune the model and deploy it.

Uploaded by

sunil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
153 views

02.data Preprocessing PDF

The document outlines the main steps in a machine learning project life cycle: 1. Define the problem or question 2. Obtain and explore the data to gain insights 3. Prepare the data for machine learning algorithms through tasks like feature scaling, imputation, and encoding 4. Select a model and train it on the processed data 5. Fine-tune the model and deploy it.

Uploaded by

sunil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

End - End

Machine Learning
Project
Life Cycle

Here are the main steps you will go through:

1. Question or problem definition.


2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Deployment
Life Cycle

Here are the main steps you will go through:

1. Question or problem definition.

The first question to ask your boss is what exactly is the business objective; building a model is probably
not the end goal. How does the company expect to use and benefit from this model?

This is important because


• It will determine how you frame the problem,
• What algorithms you will select, what performance measure you will use to evaluate your model,
• How much effort you should spend tweaking it.
Life Cycle

Here are the main steps you will go through:

2. Get the data.


• From Relational DataBases(SQL) like MY-SQL
• From Non-Relational DataBases (NOSQL) MangoDB
• scrape from the websites using web scraping tools such as Beautiful Soup.
• Gather data by connecting to Web APIs
Exploratory Data Analysis
Machine Learning
What is EDA?

Exploratory Data Analysis (EDA) is an approach to analyzing data. It’s where the researcher

takes a bird’s eye view of the data and tries to make some sense.

Promoted by John Tukey to encourage statisticians to explore data.

To identify outliers, trends and patterns.


Why Data Exploration?
Real Data is Messy …
• Bad Formatting
• Trailing Spaces
• Duplicates
• Empty Rows
• Synonyms and Different Abbreviations
• Difference in Scales
• Skewed Distributions and Outliers
• Missing Values
Steps of Data Exploration

Steps involved to understand, clean and prepare your data for building your predictive model:

• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values
• Outlier
Variable Identification

Identify
• Predictor (Input) variables
• Target (output) variable
• Data Type of variables
• Category of the variables.
Variable Identification
Univariate Analysis
Method of exploring variables one by one is called uni-variate Analysis.
Performing uni-variate analysis will depend on whether the variable type is categorical or continuous.

Statistical measures and Visualization for categorical and continuous variables:


1. Continuous Variables :- Descriptive Statistics, Histogram and Box plot.
2. Categorical Variables :- Frequency Table, Bar Charts to understand distribution of each category.

Note: Univariate analysis is also used to highlight missing and outlier values.
Bi-variate Analysis
• Bi-variate Analysis finds out the relationship between two variables.

• We can perform bi-variate analysis for any combination of categorical and continuous variables.

• In this analysis to try to find which features within the dataset contribute significantly to our solution goal? (Statistically

speaking is there a correlation among a feature and solution goal? )

As the feature values change does the solution state change as well, and visa-versa?

• This can be tested both for numerical and categorical features in the given dataset.

• We may also want to determine correlation among features other than target variable for subsequent goals .

• Correlation features may help in creating, completing, or correcting features.


Bi-variate Analysis

Statistical measures and Visualization for Bi-variate analysis:

1. Continuous Variables vs Continuous Variables :- Scatterplot and Co-variance ,Pearson Correlation


coefficient.
2. Categorical Variables vs Categorical Variables :- Frequency Table, Bar Charts and Chi-square Test.
3. Categorical Variables vs Continuous :- Bar chart, Box-plot and Anova .
Outlier Detection
Most commonly visualizations used to detect outliers are
• Box-plot,
• Histogram,
• Scatter Plot

Various thumb rules to detect outliers.


• Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
• Any value which out of range of 5th and 95th percentile can be considered as outlier
• Data points, three or more standard deviation away from mean are considered outlier.
Data Pre-processing
Machine Learning
CONTENT

• Overview on Data Pre-processing


• Why Data Pre-processing
• Tasks in Data pre-Processing
Normalization
Standardization
Binarization
Imputation
Polynomial Features etc.
Why Data pre -Processing?
 For achieving better results from the machine learning model the format of the data has
to be in a proper manner.
 Unscaled or unstandardized data have might give unacceptable predictions.
 Another aspect is that data set should be formatted in such a way that more than one
Machine Learning and Deep Learning algorithms are executed in one data set, and best
out of them is chosen.
What is Data pre -Processing?
Data Preprocessing is a technique that is used to convert the raw data into a clean data set.

In other words, whenever the data is gathered from different sources it is collected in raw

format which is not feasible for the analysis.

It includes :

• Feature Scaling

• Imputation

• Label Encoding

• Binarizer
Transformers

Objects which can transform data so that they can be consumed by machine learning.

Common API – fit, transform and fit_transform.

 Fit ( ) – Creating the map (fitting is equal to training)

 Transform ( ) – Using the map transforming the data

 Fit_transform ( ) – Combined of above two


Feature Scaling

Feature scaling is the method to limit the range of variables so that they can be compared on
common grounds.
 Standard scaler

 MinMax scaler

 Robust scaler

 Normalizer
Why Feature Scaling?

Real world dataset contains features that highly vary in magnitudes, units, and range.

Formally, If a feature in the dataset is big in scale compared to others then in


algorithms where Euclidean distance is measured this big scaled feature becomes
dominating and needs to be normalized.

The algorithms which use Euclidean Distance measure are sensitive to Magnitudes.
Here feature scaling helps to weigh all the features equally.

Sometimes, it also helps in speeding up the calculations in an algorithm.


Feature Scaling Matters
• Linear Regression
• K-Means
• K-Nearest-Neighbours
• Principal Component Analysis (PCA): Tries to get the feature with maximum
variance, here too feature scaling is required.
• Gradient Descent: Calculation speed increase as Theta calculation becomes faster
after feature scaling.
• Deep Learning (CNN,ANN,RNN)

Note: Naive Bayes, Linear Discriminant Analysis, and Tree-Based models,XGboost are
not affected by feature scaling.
Standard Scaler

• The Standard Scaler assumes your data is normally distributed within each feature
• Scale them such that the distribution is now centered around 0, with a standard deviation 1.
• The mean and standard deviation are calculated for the feature and then the feature is
scaled based on:

• If data is not normally distributed, this is not the best scaler to use.

Standardization, or mean removal and variance scaling


MinMax Scaler

• One of most popular scaling method

• Works on data which is not normally distributed or having low


standard deviation is very small.

• Brings the data in range of [0,1] or [-1,1]


• It preserves the shape of the original distribution. It doesn’t
meaningfully change the information embedded in the original
data.
• It is sensitive to outliers, so if there are outliers in the data better
to consider Robust scaler.
Robust Scaler

• Most suited for data with outliers

• Rather than min-max, used interquartile range

• This Scaler removes the median and scales the data according to the
quantile range (defaults to IQR).
Normalization

• The normalizer scales each value by dividing each value by its magnitude in n-
dimensional space for n number of features. i.e. brining the values of each
feature vector on a common scale.

• Say your features were x, y and z Cartesian co-ordinates your scaled value for x
would be:

• Each point is now within 1 unit of the origin on this Cartesian co-ordinate system.
Label Encoding

• Learning algorithms don’t understand strings

• Categorical columns with string values (yes / no) needs to be converted to


numbers

• Label encoder encodes the value between 0 to n-1 classes


One Hot Encoding

• One-Hot Encoding transforms each categorical feature with n possible values


into n binary features, with only one active.

• Suitable for nominal data

• Like Gender,Color,City etc.


Binarizer

• Feature binarization is the process of thresholding numerical features to get


boolean values.

• Commonly used for text data.


Imputation

• Real world data might be incomplete , missing data is represented by Nan or Null

• Incomplete data are incompatible with scikit-learn

• One way to deal with them is discard. However it is not the best option to remove
the rows and columns from our dataset as it can lead to loss of valuable
information.

• Other way is simply substituting the missing values of our dataset.


Train and Test split

• The training set is a subset of your data on which your model will learn how to
predict the dependent variable with the independent variables.

• The test set is the complimentary subset from the training set, on which you will
evaluate your model to see if it manages to predict correctly the dependent variable
with the independent variables.

You might also like