02.data Preprocessing PDF
02.data Preprocessing PDF
Machine Learning
Project
Life Cycle
The first question to ask your boss is what exactly is the business objective; building a model is probably
not the end goal. How does the company expect to use and benefit from this model?
Exploratory Data Analysis (EDA) is an approach to analyzing data. It’s where the researcher
takes a bird’s eye view of the data and tries to make some sense.
Steps involved to understand, clean and prepare your data for building your predictive model:
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values
• Outlier
Variable Identification
Identify
• Predictor (Input) variables
• Target (output) variable
• Data Type of variables
• Category of the variables.
Variable Identification
Univariate Analysis
Method of exploring variables one by one is called uni-variate Analysis.
Performing uni-variate analysis will depend on whether the variable type is categorical or continuous.
Note: Univariate analysis is also used to highlight missing and outlier values.
Bi-variate Analysis
• Bi-variate Analysis finds out the relationship between two variables.
• We can perform bi-variate analysis for any combination of categorical and continuous variables.
• In this analysis to try to find which features within the dataset contribute significantly to our solution goal? (Statistically
As the feature values change does the solution state change as well, and visa-versa?
• This can be tested both for numerical and categorical features in the given dataset.
• We may also want to determine correlation among features other than target variable for subsequent goals .
In other words, whenever the data is gathered from different sources it is collected in raw
It includes :
• Feature Scaling
• Imputation
• Label Encoding
• Binarizer
Transformers
Objects which can transform data so that they can be consumed by machine learning.
Feature scaling is the method to limit the range of variables so that they can be compared on
common grounds.
Standard scaler
MinMax scaler
Robust scaler
Normalizer
Why Feature Scaling?
Real world dataset contains features that highly vary in magnitudes, units, and range.
The algorithms which use Euclidean Distance measure are sensitive to Magnitudes.
Here feature scaling helps to weigh all the features equally.
Note: Naive Bayes, Linear Discriminant Analysis, and Tree-Based models,XGboost are
not affected by feature scaling.
Standard Scaler
• The Standard Scaler assumes your data is normally distributed within each feature
• Scale them such that the distribution is now centered around 0, with a standard deviation 1.
• The mean and standard deviation are calculated for the feature and then the feature is
scaled based on:
• If data is not normally distributed, this is not the best scaler to use.
• This Scaler removes the median and scales the data according to the
quantile range (defaults to IQR).
Normalization
• The normalizer scales each value by dividing each value by its magnitude in n-
dimensional space for n number of features. i.e. brining the values of each
feature vector on a common scale.
• Say your features were x, y and z Cartesian co-ordinates your scaled value for x
would be:
• Each point is now within 1 unit of the origin on this Cartesian co-ordinate system.
Label Encoding
•
Imputation
• Real world data might be incomplete , missing data is represented by Nan or Null
• One way to deal with them is discard. However it is not the best option to remove
the rows and columns from our dataset as it can lead to loss of valuable
information.
• The training set is a subset of your data on which your model will learn how to
predict the dependent variable with the independent variables.
• The test set is the complimentary subset from the training set, on which you will
evaluate your model to see if it manages to predict correctly the dependent variable
with the independent variables.