0% found this document useful (0 votes)
2 views

Machine Learning Unit 2

The document discusses Exploratory Data Analysis (EDA) and its importance in understanding data through various methods like graphical and non-graphical analysis. It outlines steps involved in EDA, including data collection, cleaning, identifying correlations, and handling missing values, as well as techniques for encoding categorical data and detecting outliers. Additionally, it provides practical coding examples for handling missing values and encoding data using Python libraries.

Uploaded by

VANSHIKA GADA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning Unit 2

The document discusses Exploratory Data Analysis (EDA) and its importance in understanding data through various methods like graphical and non-graphical analysis. It outlines steps involved in EDA, including data collection, cleaning, identifying correlations, and handling missing values, as well as techniques for encoding categorical data and detecting outliers. Additionally, it provides practical coding examples for handling missing values and encoding data using Python libraries.

Uploaded by

VANSHIKA GADA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Exploratory Data Analysis

Dr. Sandip Bankar, STME Navi Mumbai


Concepts
• Missing Value Treatment Handling
Categorical data: Mapping ordinal features,
Encoding class labels, Performing one-hot
encoding on nominal features, Outlier
Detection and Treatment. Feature
Engineering: Variable Transformation and
Variable Creation, Selecting meaningful
features.

Dr. Sandip Bankar, STME Navi Mumbai


• Exploratory Data Analysis is a process of examining or
understanding the data and extracting insights or main
characteristics of the data.

Exploratory Data • EDA is generally classified into two methods,


Analysis(EDA) i.e. graphical analysis and non-graphical analysis.

• EDA is very essential because it is a good practice to first


understand the problem statement and the various relationships
between the data
Dr. Sandip features
Bankar, before getting your hands dirty.
STME Navi Mumbai
The primary motive of EDA is to

Examine the data distribution

Handling missing values of the dataset (a most common issue with every dataset)

Handling the outliers

Removing duplicate data

Encoding the categorical variables

Normalizing and Scaling

Dr. Sandip Bankar, STME Navi Mumbai


Steps Involved in EDA
1. Data Collection
• Nowadays, data is generated in huge volumes and various forms belonging to every sector of human life, like
healthcare, sports, manufacturing, tourism, and so on.
• This depends on collecting the required data from various sources through surveys, social media, and customer
reviews, to name a few.
• Without collecting sufficient and relevant data, further activities cannot begin.

2. Finding all Variables and Understanding Them


• When the analysis process starts, the first focus is on the available data that gives a lot of information.
• This information helps to understand and get valuable insights from them.
• It requires first identifying the important variables which affect the outcome and their possible impact.
• This step is crucial for the final result expected from any analysis.

Dr. Sandip Bankar, STME Navi Mumbai


3. Cleaning the Dataset
• The next step is to clean the data set, which may contain null values and irrelevant information.
• These are to be removed so that data contains only those values that are relevant and important from the target
point of view.
• This will not only reduce time but also reduces the computational power from an estimation point of view.

4. Identify Correlated Variables


• Finding a correlation between variables helps to know how a particular variable is related to another.
• The correlation matrix method gives a clear picture of how different variables correlate, which further helps in
understanding vital relationships among them.

Dr. Sandip Bankar, STME Navi Mumbai


5. Choosing the Right Statistical Methods
• Depending on the data, categorical or numerical, the size, type of variables, and the purpose of analysis, different
statistical tools are employed.
• Statistical formulae applied for numerical outputs give fair information, but graphical visuals are more appealing
and easier to interpret.

6. Visualizing and Analyzing Results


• Once the analysis is over, the findings are to be observed cautiously and carefully so that proper interpretation can
be made.
• The trends in the spread of data and correlation between variables give good insights for making suitable changes
in the data parameters.
• The results obtained will be appropriate to data of that particular domain and are suitable for use in retail,
healthcare, and agriculture.
Dr. Sandip Bankar, STME Navi Mumbai
Handling Missing Values
Type of Missing Values
• Missing Completely at Random (MCAR) - In this case, missing
values are randomly distributed across the variable, and it is
assumed that other variables can’t predict the missing value.
• Missing at Random (MAR) - In MAR, missing data is randomly
distributed, but it can be predicted by other variables/data
points. For example, sensors could not log data for a minute due
to a technical glitch, but it can be interpolated by previous and
subsequent readings.
• Missing Not at Random (MNAR) - In this case, missing values
are not random and missing systematically. For example, people
avoid certain information while filling out surveys.

Dr. Sandip Bankar, STME Navi Mumbai


Handling Missing Values- Categorical & Numerical
• How to Handle Missing Data?

import pandas as pd
df = pd.read_csv(".../Train.csv") ID 0
Gender 0
Ever_Married 140
# Count of Records in data Age 0
df.shape Graduated 78
>> (8068, 11) Profession 124
Work_Experience 829
Spending_Score 0
# Check which variables have missing values and how Family_Size 335
many Var_1 76
Segmentation 0
df.isnull().sum() dtype: int64
Dr. Sandip Bankar, STME Navi Mumbai
• Removal : You can either remove rows or columns containing missing
values.

# To remove rows containing missing values


df_remove = df.dropna(axis = 0)
df_remove.shape
>> (6665, 11)

# To remove columns with missing values


df_remove = df.dropna(axis = 1)
df_remove.shape
>> (8068, 5)

Dr. Sandip Bankar, STME Navi Mumbai


Imputation
• In this approach, you can replace missing values of a column by
computing its mean/average.
• However, this approach only works for numerical variables.
• This method is simple, fast, and works well for small datasets.
• But outliers in a column can result in skewed mean that can impact
the accuracy of the ML model.
•Mean - The average value
•Median - The mid point value
•Mode - The most common value

Dr. Sandip Bankar, STME Navi Mumbai


• You can use the fillna function to impute the missing values
by column mean.
# Impute Work_Experience feature by its mean in our dataset
df['Work_Experience'] =
df['Work_Experience'].fillna(df['Work_Experience'].mean())

• Also, you can replace the NULL values with the column’s median
# Impute Work_Experience feature by its median in our dataset
df['Work_Experience'] =
df['Work_Experience'].fillna(df['Work_Experience'].median())

Dr. Sandip Bankar, STME Navi Mumbai


• Imputation by Mode (Most Frequent Values)
you can impute the missing or NULL values by the most frequent or common value in the
column.
• The advantage of this method is that it can work with both numerical and categorical features.
However, it ignores feature correlations, just like previous approaches.
• You can use Scikit-learn’s SimpleImputer class for this imputation method
# impute Graduated and Family_Size features with most_frequent values
from sklearn.impute import SimpleImputer
impute_mode = SimpleImputer(strategy = 'most_frequent')
impute_mode.fit(df[['Graduated', 'Family_Size']])

df[['Graduated', 'Family_Size']] = impute_mode.transform(df[['Graduated', 'Family_Size']])

Dr. Sandip Bankar, STME Navi Mumbai


Dr. Sandip Bankar, STME Navi Mumbai
• df['Age'].fillna(df['Age'].mode(), inplace = True)

Dr. Sandip Bankar, STME Navi Mumbai


To see the columns and their data types
• df.info() .

• tail() will display the last 5 observations of the dataset


FInding number of unique elements in our dataset. This will help us in deciding which type of encoding to choose for
converting categorical columns into numerical columns.
• df.nunique()

Filling the missing values of gender with the string “No Gender”
• df["Gender"].fillna("No Gender", inplace = True)

Filling the senior management coloum with the mode value.


• mode = df['Senior Management'].mode().values[0]
• df['Senior Management']= df['Senior Management'].replace(np.nan, mode)
• df.isnull().sum()

Dr. Sandip Bankar, STME Navi Mumbai


Categorical Data
➢ Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number.
➢ Here are a few examples:
• The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
• The department a person works in: Finance, Human resources, IT, Production.
• The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
• The grades of a student: A+, A, B+, B, B- etc.

➢ But many Machine learning models require all input and output variables to be numeric.
➢ If your data contains categorical data, you must encode it to numbers before you can fit and evaluate a
model.

Dr. Sandip Bankar, STME Navi Mumbai


There are two kinds of categorical data
• Nominal data: This type of categorical data consists of the name variable without any numerical values. For
example, in any organization, the name of the different departments like research and development
department, human resource department, accounts and billing department etc.

• Ordinal data: This type of categorical data consists of a set of orders or scales. For example, a list of patients
consists of the level of sugar present in the body of a person which can be divided into high, low and
medium classes.

Dr. Sandip Bankar, STME Navi Mumbai


Encoding Categorical Data
• Encoding converts each label into integer values and the encoded data represents the sequence of
labels.
• There are three common approaches for converting ordinal and categorical variables to numerical
values.
• They are:
• Ordinal Encoding
• One-Hot Encoding
• Dummy Variable Encoding

Dr. Sandip Bankar, STME Navi Mumbai


Ordinal Encoding
• We use this categorical data encoding technique when the categorical feature is ordinal.
• In this case, retaining the order is important. Hence encoding should reflect the sequence.
import category_encoders as ce
import pandas as pd
df=pd.DataFrame({'height':
['tall','medium','short','tall','medium','short','tall','medium','short',]})
# create object of Ordinal encoding
encoder=
ce.OrdinalEncoder(cols=['height'],return_df=True, mapping=[{'col
':'height', 'mapping':{'None':0,'tall':1,'medium':2,'short':3}}])
#Original data
print(df)
df['transformed'] = encoder.fit_transform(df)
Dr. Sandip Bankar, STME Navi Mumbai
print(df)
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from numpy import asarray

# Create a dataset
data = {'cost': ['50', '35', '75', '42', '54', '71'],
'size': ['large', 'small', 'extra large', 'medium', 'large', 'extra large']}
df = pd.DataFrame(data)

# Initiate OrdinalEncoder
encoder = OrdinalEncoder()

# Fit the encoder


encoder.fit(asarray(df['size']).reshape(-1,1))
Dr. Sandip Bankar, STME Navi Mumbai
One-Hot Encoding
• We use this categorical data encoding technique when the features are nominal(do not have any
order). In one hot encoding, for each level of a categorical feature, we create a new variable. Each
category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence,
and 1 represents the presence of that category.

Dr. Sandip Bankar, STME Navi Mumbai


import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

#Create object for one-hot encoding


encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',
return_df=True,use_cat_names=True)

#Original Data
data

Dr. Sandip Bankar, STME Navi Mumbai


• #Fit and transform Data
• data_encoded = encoder.fit_transform(data)
• data_encoded

Dr. Sandip Bankar, STME Navi Mumbai


# import required modules
import pandas as pd
import numpy as np
# create dataset
df = pd.DataFrame({'Temperature': ['Hot', 'Cold', 'Warm', 'Cold'],
})
# display dataset
print(df)
# create dummy variables
pd.get_dummies(df)

Dr. Sandip Bankar, STME Navi Mumbai


Dummy Variable Encoding
• Dummy encoding also uses dummy (binary) variables.
• Instead of creating a number of dummy variables that is equal to the number of categories (k) in
the variable, dummy encoding uses k-1 dummy variables.

Dr. Sandip Bankar, STME Navi Mumbai


Outlier Detection and Treatment. .
• Outlier is an observation in each dataset that
lies far from the rest of the observations.
• Detecting Outliers
• Algorithm:
• Calculate the mean of each cluster
• Initialize the Threshold value
• Calculate the distance of the test data from each cluster
mean
• Find the nearest cluster to the test data
• If (Distance > Threshold) then, Outlier

• Below are some of the techniques of detecting outliers


• Boxplots
• Z-score
• Inter Quantile Range(IQR)
Dr. Sandip Bankar, STME Navi Mumbai
Using pandas describe() to find outliers
• use .describe() to generate some summary
statistics. Generating summary statistics is a
quick way to help us determine whether or
not the dataset has outliers.

Dr. Sandip Bankar, STME Navi Mumbai


• Data Visualization Techniques
Use data visualization techniques to inspect the data’s distribution and verify the
presence of outliers.
• Use a statistical method to calculate the outlier data points.
• Apply a statistical method to drop or transform the outliers.
• These are a few of the most popular visualization methods for finding outliers in
data:
• Histogram
• Box plot
• Scatter plot

Dr. Sandip Bankar, STME Navi Mumbai


Use a px.histogram() to plot to review the fare_amount distribution.
• #create a histogram
• fig = px.histogram(df, x=’fare_amount’)
• fig.show()

Use px.box() to review the values of fare_amount.


• #create a box plot
• fig = px.box(df, y=”fare_amount”)
• fig.show()

Use px.scatter() to review passenger_count and fare_amount.


• fig = px.scatter(x=df[‘passenger_count’], y=df[‘fare_amount’])
• fig.show()

Dr. Sandip Bankar, STME Navi Mumbai


Detect Outliers Using
Standard Deviation

• standard deviation measures


the spread of data around the mean,
and in essence, it captures how far
away from the mean the data points
are.
• The data points that fall below mean-
3*(sigma) or
above mean+3*(sigma) are outliers,
where mean and sigma are
the average value and standard
deviation of a particular column.
Dr. Sandip Bankar, STME Navi Mumbai
#Calculate the mean and standard deviation
• mean = np.mean(age_data)
• std_dev = np.std(age_data)
# Set a threshold for the number of standard deviations
away from the mean threshold = 2
#Identify outliers
• outliers1 = []
• for age in age data:
• if abs(age - mean)> threshold * std_dev:
• outliers.append(age)
print("Outliers:", outliers1)

Dr. Sandip Bankar, STME Navi Mumbai


• import numpy
• arr = [10, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577, 471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 704, 443, 569, 430,
637, 331, 511, 552, 496, 484, 566, 554, 472, 335, 440, 579, 341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405, 487, 490, 496, 398,
512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565, 415, 486, 668, 414, 665, 763, 557, 304, 404, 454, 689, 610, 483, 441, 657, 590, 492, 476, 437,
483, 529, 363, 711, 543]

• elements = numpy.array(arr)

• mean = numpy.mean(elements, axis=0)


• sd = numpy.std(elements, axis=0)

• final_list = [x for x in arr if (x > mean - 2 * sd)]


• final_list = [x for x in final_list if (x < mean + 2 * sd)]
• print(final_list)

Dr. Sandip Bankar, STME Navi Mumbai


Outlier detection using Z-Score
• The Z-score for a value of x in the dataset with a
normal distribution with mean μ and standard
deviation σ is given by:
• z = (x - μ)/σ
• when the values of a variable are converted to Z-
scores, then the distribution of the variable is
called standard normal distribution with mean=0
and standard deviation=1.
• The Z-score method requires a cut-off specified
by the user, to identify outliers.
• The widely used lower end cut-off is -3 and the
upper end cut-off is +3. The reason behind using
these cut-offs is, 99.7% of the values lie between
-3 and +3 in a standard normal distribution
Dr. Sandip Bankar, STME Navi Mumbai
mean = np.mean (age_data)
std = np.std(age_data)
print('mean of the dataset is', mean)
print('std. deviation is', std)
threshold = 3
outlier = []
for i in age_data:
z= (i-mean)/std
if z > threshold:
outlier.append(i)
print('outlier in dataset is', outlier)

Dr. Sandip Bankar, STME Navi Mumbai


Outlier detection using Interquartile Range(IQR), Boxplot

• Outlier detection using the Interquartile Range (IQR)


method involves calculating the first and third
quartiles (Q1 and Q3) of a dataset and then
identifying any data points that fall beyond the range
of Q1 - 1.5 * IQR to Q3 + 1.5 * IQR, where IQR is the
difference between Q3 and Q1.
• Data points that fall outside of this range are
considered outliers.

Dr. Sandip Bankar, STME Navi Mumbai


Dr. Sandip Bankar, STME Navi Mumbai
Outliers are –50 and 1456
Dr. Sandip Bankar, STME Navi Mumbai
outliers = []
steps: def detect_outliers_iqr(data):
data = sorted(data)
• Sort the dataset in ascending order q1 = np.percentile(data, 25)
• calculate the 1st and 3rd quartiles(Q1, q3 = np.percentile(data, 75)
Q3) # print(q1, q3)
• compute IQR=Q3-Q1 IQR = q3-q1
lwr_bound = q1-(1.5*IQR)
• compute lower bound = (Q1–1.5*IQR),
upper bound = (Q3+1.5*IQR) upr_bound = q3+(1.5*IQR)
# print(lwr_bound, upr_bound)
• loop through the values of the dataset for i in data:
and check for those who fall below the
if (i<lwr_bound or i>upr_bound):
lower bound and above the upper bound
and mark them as outliers outliers.append(i)
return outliers# Driver code
sample_outliers = detect_outliers_iqr(sample)
print("Outliers from IQR method: ", sample_outliers)

Dr. Sandip Bankar, STME Navi Mumbai


How to Handle Outliers?
• Outliers are the deviated values or data points that are observed too away from other data points
in such a way that they badly affect the performance of the model. Outliers can be handled with
this feature engineering technique. This technique first identifies the outliers and then remove
them out.
• Removal: Outlier-containing entries are deleted from the distribution. However, if there are
outliers across numerous variables, this strategy may result in a big chunk of the datasheet being
missed.
• Replacing values: Alternatively, the outliers could be handled as missing values and replaced with
suitable imputation.
• Capping: Using an arbitrary value or a value from a variable distribution to replace the maximum
and minimum values.
• Discretization : Discretization is the process of converting continuous variables, models, and
functions into discrete ones. This is accomplished by constructing a series of continuous intervals
(or bins) that span the range of our desired variable/model/function.

Dr. Sandip Bankar, STME Navi Mumbai


Trimming/Remove the outliers
• In this technique, we remove the outliers from the dataset. Although it is not a good practice to
follow.

Python code to delete the outlier and copy the rest of the elements to another array.

# Trimming
for i in sample_outliers:
a = np.delete(sample, np.where(sample==i))
print(a)
# print(len(sample), len(a))

The outlier ‘101’ is deleted and the rest of the data points are copied to another array ‘a’.

Dr. Sandip Bankar, STME Navi Mumbai


• Impute outliers:
In this case, outliers are simply considered as
missing values. You can employ various
imputation techniques for missing values, such as
mean, median, mode, nearest neighbor, etc., to
impute the values for outliers.

Dr. Sandip Bankar, STME Navi Mumbai


Separately treating
• If there are significant number of outliers and
dataset is small , we should treat them separately
in the statistical model. One of the approach is to
treat both groups as two different groups and build
individual model for both groups and then combine
the output. But this technique is tedious when the
dataset is large.

Dr. Sandip Bankar, STME Navi Mumbai


Feature Engineering: Variable
Transformation and Variable Creation

• Feature engineering is the process


of transforming raw data into features that
are suitable for machine learning models.
• It is the process of selecting, extracting, and
transforming the most relevant features
from the available data to build more
accurate and efficient machine learning
models.
• Feature engineering is a machine learning
technique that leverages data to create new
variables that aren’t in the training set

Dr. Sandip Bankar, STME Navi Mumbai


• prices of properties in x city.
It shows the area of the
house and total price.

• new column to display the


cost per square foot.

• The data can be visualised.

Dr. Sandip Bankar, STME Navi Mumbai


Imputation
• Numerical Imputation : We can use a number of strategies for Imputing the values of Continuous variables. Some such
strategies are imputing with Mean, Median or Mode.
• #Filling all missing values with 0
• data = data.fillna(0)

• Categorical Imputation: When dealing with categorical columns, replacing missing values with the highest
value in the column is a smart solution.
• #Max fill function for categorical columns
• data[‘column_name’].fillna(data[‘column_name’].value_counts().idxmax(), inplace=True)

Dr. Sandip Bankar, STME Navi Mumbai


• Deleting the Columns
We can fix a certain threshold value, say 70% or 80%, and if the number of null values exceeds the threshold we may want
to delete that particular column from our training dataset.
• threshold=0.7
• dataset = dataset[dataset.columns[dataset.isnull().mean() < threshold]]
• print(dataset)

• Deleting The Rows


• dataset.dropna(axis=0, subset=['Gender'], inplace=True)
• dataset.head(10)

• Assigning A New Category To The Missing Categorical Values


• Simply deleting the values which are missing, causes loss of information. To avoid that we can also replace the missing
values with a new category. For example, we may assign ‘U’ to the missing genders where ‘U’ stands for Unknown.
• dataset['Gender']= dataset['Gender'].fillna('U')
• dataset.head(10) Dr. Sandip Bankar, STME Navi Mumbai
Log transform
• Log transform helps in handling the skewed data, and it makes the distribution more approximate to normal
after transformation. It also reduces the effects of outliers on the data, as because of the normalization of
magnitude differences, a model becomes much robust.
• //Log Example
• df[log_price] = np.log(df[‘Price’])

Dr. Sandip Bankar, STME Navi Mumbai


Binning
• In machine learning, overfitting is one of the main issues that degrade the performance of the model and
which occurs due to a greater number of parameters and noisy data. However, one of the popular
techniques of feature engineering, "binning", can be used to normalize the noisy data. This process involves
segmenting different features into bins.
• It is a process of converting numerical or continuous variables into multiple bins or sets of intervals. This
makes data easy to analyze and understand.
• For example, the age features can be converted into intervals such as (0-10, 11-20, ..) or (child, young, …).
Feature Split
• As the name suggests, feature split is the process of splitting features intimately into two or more parts and
performing to make new features. This technique helps the algorithms to better understand and learn the
patterns in the dataset.

Dr. Sandip Bankar, STME Navi Mumbai


Feature Scaling – The last step of Feature Engineering
• Feature Scaling is the process of scaling or converting all the values in our dataset to a given scale
• There are two main techniques of feature scaling:
• NORMALIZATION : Normalization is the process of scaling the data values in such a way that that the value of all
the features lies between 0 and 1.
• This method works well when the data is normally distributed.

• STANDARDIZATION : Standardization is the process of scaling the data values in such a way that that they gain the
properties of standard normal distribution. This means that the data is rescaled in such a way that the mean
becomes zero and the data has unit standard deviation.
• Standardized values do not have a fixed bounded range like Normalised values.

Dr. Sandip Bankar, STME Navi Mumbai


• Encoding Dependent Variables
• Let us now have a look at our dependent variable y.
• print(y)

• Our dependent variable y is also a categorical variable. However in this case we can simply assign 0 and 1 to the two categories ‘No’ and
‘Yes’. In this case, we do not require dummy variables to encode the ‘Predicted’ variable as it is a dependent variable that will not be used
to train the model.
• To code this, we are going to need the LabelEncoder class.
• from sklearn.preprocessing import LabelEncoder
• le = LabelEncoder()
• y = le.fit_transform(y)

Dr. Sandip Bankar, STME Navi Mumbai


Data Transformation (Variable Transformation)

• A transformation is a rescaling of the data using a function or some


mathematical operation on each observation.
• When data are very strongly skewed (negative or positive), we
sometime transform the data so that they are easier to model.
• In other way, if variable(s) does not fit a normal distribution then one
should try a data transformation to fit the assumption of using a
parametric statistical test.

Dr. Sandip Bankar, STME Navi Mumbai


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.DataFrame({
'Income': [15000, 1800, 120000, 10000],
'Age': [25, 18, 42, 51],
'Department': ['HR','Legal','Marketing','Management']
})

we 1st create a copy of our dataframe and store the numerical feature names in a list, and their
values as well:
Dr. Sandip Bankar, STME Navi Mumbai
MinMax Scaler
• The MinMax scaler is one of the simplest scalers to understand. It just scales all the data between 0 and
1. The formula for calculating the scaled value is-
• x_scaled = (x – x_min)/(x_max – x_min)

• We will first need to import it from sklearn.preprocessing


• import MinMaxScaler
• scaler = MinMaxScaler()
• Apply it on only the values of the features:
df_scaled[col_names]= scaler.fit_transform(features.values)

Dr. Sandip Bankar, STME Navi Mumbai


• suppose we don’t want the income or age to have values like 0. Let us take the
range to be (5, 10)
• from sklearn.preprocessing import MinMaxScaler
• scaler = MinMaxScaler(feature_range=(5, 10))
• df_scaled[col_names] = scaler.fit_transform(features.values)
• df_scaled

Dr. Sandip Bankar, STME Navi Mumbai


Standard Scaler
• Standard Scaler is another popular scaler that is very easy to understand and implement.
• For each feature, the Standard Scaler scales the values such that the mean is 0 and the standard deviation is
1(or the variance).
• x_scaled = x – mean/std_dev
• from sklearn.preprocessing import StandardScaler
• scaler = StandardScaler()

• df_scaled[col_names] = scaler.fit_transform(features.values)
• df_scaled

Dr. Sandip Bankar, STME Navi Mumbai


MaxAbsScaler
• In simplest terms, the MaxAbs scaler takes the absolute maximum value of each column and
divides each value in the column by the maximum value.
• Thus, it first takes the absolute value of each value in the column and then takes the maximum
value out of those. This operation scales the data between the range [-1, 1]

• df["Balance"] = [100.0, -263.0, 2000.0, -5.0]


• from sklearn.preprocessing import MaxAbsScaler
• scaler = MaxAbsScaler()

• df_scaled[col_names] = scaler.fit_transform(features.values)
• df_scaled

Dr. Sandip Bankar, STME Navi Mumbai


Robust Scaler
• If there are too many outliers in the data, they will influence the
mean and the max value or the min value. Thus, even if we scale this
data using the above methods, we cannot guarantee a balanced data
with a normal distribution.
• The Robust Scaler, as the name suggests is not sensitive to outliers.
This scaler-
• removes the median from the data
• scales the data by the InterQuartile Range(IQR)
• x_scaled = (x – Q1)/(Q3 – Q1)
Dr. Sandip Bankar, STME Navi Mumbai
• from sklearn.preprocessing import RobustScaler
• scaler = RobustScaler()

• df_scaled[col_names] = scaler.fit_transform(features.values)
• df_scaled

Dr. Sandip Bankar, STME Navi Mumbai


Guassian Transformation
• logarithmic transformation
• reciprocal transformation
• square root transformation
• exponential transformation (more general, you can use any exponent)
• boxcox transformation

Dr. Sandip Bankar, STME Navi Mumbai


Log Transform
• It is primarily used to convert a skewed distribution to a normal distribution/less-
skewed distribution. In this transform, we take the log of the values in a column
and use these values as the column instead.
• log(10) = 1
• log(100) = 2, and
• log(10000) = 4.

Dr. Sandip Bankar, STME Navi Mumbai


df['log_income'] = np.log(df['Income'])

Dr. Sandip Bankar, STME Navi Mumbai


Reciprocal Transformation:
• df['Age_reciprocal']=1/df.Age
• plot_data(df,'Age_reciprocal')

Dr. Sandip Bankar, STME Navi Mumbai


• Square Root Transformation

df['Age_sqaure']=df.Age**(1/2)
plot_data(df,'Age_sqaure')

Dr. Sandip Bankar, STME Navi Mumbai


Selecting meaningful features
• Feature selection is an important process in
machine learning and data analysis.
• It involves selecting a subset of relevant
features from a larger set of available features.
• These features are also known as variables,
predictors, or attribute.
• Selecting the best features helps the model to
perform well

Dr. Sandip Bankar, STME Navi Mumbai


1. Wrapper Methods
• different combinations are made, evaluated, and compared with other combinations. It trains
the algorithm by using the subset of features iteratively.
• Forward selection - Forward selection is an iterative process, which begins with an empty set
of features. After each iteration, it keeps adding on a feature and evaluates the performance
to check whether it is improving the performance or not. The process continues until the
addition of a new variable/feature does not improve the performance of the model.
• Backward elimination - Backward elimination is also an iterative approach, but it is the
opposite of forward selection. This technique begins the process by considering all the
features and removes the least significant feature. This elimination process continues until
removing the features does not improve the performance of the model.
• Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature
selection methods, which evaluates each feature set as brute-force. It means this method
tries & make each possible combination of features and return the best performing feature
set.

Dr. Sandip Bankar, STME Navi Mumbai


2. Filter Methods
• Features are selected on the basis of statistics measures.
• The advantage of using filter methods is that it needs low computational time and
does not overfit the data.
• Some common techniques of Filter methods are as follows:
✓ Information Gain
✓ Chi-square Test
✓ Fisher's Score
✓ Missing Value Ratio

Dr. Sandip Bankar, STME Navi Mumbai


• Information Gain: Information gain determines the reduction in entropy while transforming the dataset. It
can be used as a feature selection technique by calculating the information gain of each variable with
respect to the target variable.
• Chi-square Test: Chi-square test is a technique to determine the relationship between the categorical
variables. The chi-square value is calculated between each feature and the target variable, and the desired
number of features with the best chi-square value is selected
• Fisher's Score: Fisher's score is one of the popular supervised technique of features selection. It returns the
rank of the variable on the fisher's criteria in descending order. Then we can select the variables with a large
fisher's score.
• Missing Value Ratio: The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number of missing values in
each column divided by the total number of observations. The variable is having more than the threshold
value can be dropped.

Dr. Sandip Bankar, STME Navi Mumbai


3. Embedded Methods: Embedded methods combined the advantages of both
filter and wrapper methods by considering the interaction of features along with low
computational cost.

These are fast processing methods similar to the filter method but more accurate than the filter
method.

• Regularization- Regularization adds a penalty term to different parameters of the machine


learning model for avoiding overfitting in the model. This penalty term is added to the
coefficients; hence it shrinks some coefficients to zero. Those features with zero coefficients can
be removed from the dataset. The types of regularization techniques are L1 Regularization
(Lasso Regularization) or Elastic Nets (L1 and L2 regularization).

• Random Forest Importance - Different tree-based methods of feature selection help us with
feature importance to provide a way of selecting features. Here, feature importance specifies
which feature has more importance in model building or has a great impact on the target
variable. Random Forest is such a tree-based method, which is a type of bagging algorithm that
aggregates a different number of decision trees. It automatically ranks the nodes by their
performance or decrease in the impurity (Gini impurity) over all the trees. Nodes are arranged
as per the impurity values, and thus it allows to pruning of trees below a specific node. The
remaining nodes create a subset of the most important features.
Dr. Sandip Bankar, STME Navi Mumbai
Dr. Sandip Bankar, STME Navi Mumbai

You might also like