0% found this document useful (0 votes)

49 views21 pages

Eda 2

This document discusses various techniques for preprocessing categorical features in machine learning models, including label encoding and one-hot encoding. It also covers outlier detection using Isolation Forest, which builds random decision trees to isolate outliers closer to the root based on their different values compared to normal data points. Finally, it introduces the Predictive Power Score (PPS) as an alternative to correlation for measuring relationships between features, which calculates a normalized score between 0-1 based on a decision tree model's performance compared to a naive baseline.

Uploaded by

Riya Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views21 pages

Eda 2

Uploaded by

Riya Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

EDA 2

• Encoding Methods - OHE, Label Encoders

• Outlier detection-Isolation Forest
• Calculating the Predictive Power Score (PPS)
Handling Text and Categorical
Attributes
In machine Learning algorithms, based on the problem definition the data set might contain text or categorical
values
that are not numerical features.
For example, marital status feature can have values like married, single and divorced
Gender feature will have Male or Female

Most Machine Learning algorithms prefer to work with numbers so before applying the algorithms on the dataset,
these non numerical columns needs to be treated

We have predominantly two methods for the same

• One Hot Encoding

• Label Encoding

Label Encoder and One Hot Encoder. These two encoders are parts of the SciKit Learn library in Python, and
they are used to convert categorical data, or text data, into numbers, which our predictive models can better
understand
Label
Encoding
Label encoding is to assign positive numbers for all the categorical
variables.

In the dataset, country is categorical variable, the country names will be

replaced by numbers (0,1,2) when we apply label encoding.

from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()
data[:, 0] = labelencoder.fit_transform(data[:, 0])

The problem here is, since there are different numbers in the same column,
the model will misunderstand the data to be in some kind of order, 0 < 1 < 2.
But this isn’t the case at all. To overcome this problem, we use One Hot
Encoder.

•If the target column is categoric, we use the sklearn.LabelEncoder

•If the feature column is categoric, we use the sklearn.OneHotEncoder
One Hot Encoding

Categorical variables have to be

converted to numerical using a method
called One-hot encoding

Pd.get_dummies(df
)
Isolation Forest for Outlier
Detection
A lot of machine learning algorithms suffer in terms of their performance when outliers are not treated. In
order to avoid this kind of problem you could, for example, drop them from your sample, cap the values at
some reasonable point (based on domain knowledge) or transform the data

Isolation Forest algorithm identify the outliers / anomalies in a multidimensional space

Isolation Forest is built based on decision trees.
In these trees, partitions are created by first randomly selecting a feature and then selecting a random split
value between the minimum and maximum value of the selected feature.

Generally, Outliers are less in number than normal observations and are different from them in terms of
values. They will be placed far away from the rest of the data points in the feature space.

That is why by using such random partitioning they should be identified closer to the root of the tree (shorter
average path length, i.e., the number of edges an observation must pass in the tree going from the root to the
terminal node), with fewer splits necessary.
Step 1 — Sampling for
Training
⮚ Sample the data for the model training.
⮚ Depending on the underlying data set, a sampling proportion can be different
Step 2 — Binary decision
tree
⮚ Build a decision tree based on the data we have sampled
⮚ Randomly select a feature (Q1 or Q2) and also a values that is between min and
max of the feature from the sample data
Step 3 — Repeat step 2
Iteratively
⮚ Do step 2 for two sub-data set based on the binary split from step 2
⮚ “Fewer and different” data points are isolated quicker such as the data point at the very lower right
corner
⮚ In other words, it takes less path for them to be isolated
⮚ Do this iteratively to create a forest, a collection of trees
Step 4 — Feeding data set and
calculating anomaly score
•Feed each data point into a trained forest model for each tree and
compute the anomaly score

•Anomaly score is defined as:

We calculate this anomaly score for each tree and average them out across different trees and get the final
anomaly score for an entire forest for a given data point

Mathematically, an outlier gets a score closer to 1. A value closer to 0.5 or lesser is considered normal data
point.

From sklearn library, if the predicted value is -1 , it is an outlier and 1 indicates normal data points
Correlation using Predictive Power Score
(PPS)
Correlation Coefficient

Correlation coefficient for the below

association??
if the relationship is linear between the parameters then we can use correlation coefficient
to find the strength of the association.

What if the relationship is non-linear, gaussian or unknown relationship, how do we find

the association?
If we want to detect relationships between cities and zip codes ?

The expectation is that, irrespective of the relationship between the parameters we would
like to know the relationship. And the score should be 0 if there is no relationship and the
score should be 1 if there is a perfect relationship.

Also the score should be able to handle categoric and numeric columns.
Predictive Power Score
(PPS)

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear
relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect
predictive power). It can be used as an alternative to the correlation (matrix).

Let’s say we have two columns and want to calculate the predictive power score of A predicting
B. In this case, we treat B as our target variable and A as our (only) feature. We can now
calculate a cross-validated Decision Tree and calculate a suitable evaluation metric. When
the target is numeric we can use a Decision Tree Regressor and calculate the Mean Absolute
Error (MAE). When the target is categorical, we can use a Decision Tree Classifier and
calculate the weighted F1
Example
:

Zip codes and the city name :

Calculate the PPS of zip code to city. Weighted F1 score will be used because city is categoric. Let’s
say cross-validated Decision Tree Classifier achieves a score of 0.95 F1.

Then calculate a baseline score via always predicting the most common city and achieve a score of 0.1
F1.(Accuracy of the basic algorithm if it predicts mode of the dependent variable)

Normalize the score, you will get a final PPS of 0.94 after applying the following normalization
formula: (0.95–0.1) / (1–0.1).

As we can see, a PPS score of 0.94 is rather high, so the zip code seems to have a good predictive
power towards the city. However, if we calculate the PPS in the opposite direction, we might achieve a
PPS of close to 0 because the Decision Tree Classifier is not substantially better than just always
predicting the most common zip code.
Comparing the PPS to
correlation
For the below nonlinear association the Correlation is 0. Both from x to y and from y to x
because the correlation is symmetric.

However, the PPS from x to y is 0.67, detecting the non-linear relationship and PPS from y to x is
0 because your prediction cannot be better than the naive baseline and thus the score is 0.

Because if y is 4, it is impossible to predict

whether x was roughly 2 or -2 by the decision tree
For Regression:

In case of an regression, the ppscore uses the mean absolute error (MAE) as the
underlying evaluation metric (MAE_model). The best possible score of the MAE is 0
and higher is worse. As a baseline score, we calculate the MAE of a naive model
(MAE_naive) that always predicts the median of the target column. The PPS is the
result of the following normalization (and never smaller than 0):

PPS = 1 - (MAE_model / MAE_naive)

For Classification :

If the task is a classification, we compute the weighted F1 score (wF1) as the

underlying evaluation metric (F1_model). The F1 score can be interpreted as a
weighted average of the precision and recall, where an F1 score reaches its best value
at 1 and worst score at 0. The relative contribution of precision and recall to the F1
score are equal.

The weighted F1 takes into account the precision and recall of all classes weighted by
their support as described here. As a baseline score, we calculate the weighted F1 score
of a naive model (F1_naive) that always predicts the most common class of the target
column. The PPS is the result of the following normalization (and never smaller than
0):

PPS = (F1_model - F1_naive) / (1 - F1_naive)

Thank
you

Hacker Rank
0% (2)
Hacker Rank
1 page
FFT
No ratings yet
FFT
10 pages
DME - Payment Config Document
No ratings yet
DME - Payment Config Document
25 pages
Workshop 4
100% (1)
Workshop 4
3 pages
RIP Correlation. Introducing The Predictive Power Score: Sign Up and Get An Extra One For Free
No ratings yet
RIP Correlation. Introducing The Predictive Power Score: Sign Up and Get An Extra One For Free
11 pages
Divorce Prediction System: Devansh Kapoor 179202050
No ratings yet
Divorce Prediction System: Devansh Kapoor 179202050
12 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
Performance Evaluation
No ratings yet
Performance Evaluation
24 pages
4 Classification
No ratings yet
4 Classification
20 pages
Classification&Decision Tree
No ratings yet
Classification&Decision Tree
13 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
1.0 Modeling: 1.1 Classification
No ratings yet
1.0 Modeling: 1.1 Classification
5 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
Module 6
No ratings yet
Module 6
24 pages
Iai&ml Unit-5
No ratings yet
Iai&ml Unit-5
15 pages
1 Intro
No ratings yet
1 Intro
5 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
Unit 3
No ratings yet
Unit 3
34 pages
Python
No ratings yet
Python
14 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Jpskycak 2018 Intuiting Predictive Algorithms 1
No ratings yet
Jpskycak 2018 Intuiting Predictive Algorithms 1
16 pages
PW3 SupervisedLearning
No ratings yet
PW3 SupervisedLearning
10 pages
Ml2 Summary
No ratings yet
Ml2 Summary
8 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
2 Classification
No ratings yet
2 Classification
38 pages
Linear Regression Vs Logistic Regression
No ratings yet
Linear Regression Vs Logistic Regression
8 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
CH 4
No ratings yet
CH 4
21 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
PS Notes (Machine Learning
No ratings yet
PS Notes (Machine Learning
14 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
AIML
No ratings yet
AIML
30 pages
FDS Notes
No ratings yet
FDS Notes
6 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
Final ML
No ratings yet
Final ML
2 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Ad3501-Dl-Unit 4 Notes
No ratings yet
Ad3501-Dl-Unit 4 Notes
16 pages
Classification Data Mining
No ratings yet
Classification Data Mining
84 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
Data Science Statistics Mathematics Cheat Sheet
100% (1)
Data Science Statistics Mathematics Cheat Sheet
13 pages
My Notes
No ratings yet
My Notes
15 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
L06 Features
No ratings yet
L06 Features
44 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
Jntuk ML RECORD Full
No ratings yet
Jntuk ML RECORD Full
46 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
ERROR and Confusion Matrix
No ratings yet
ERROR and Confusion Matrix
29 pages
CE802 Report
No ratings yet
CE802 Report
7 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
Machine Learning Model Evaluation
No ratings yet
Machine Learning Model Evaluation
11 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
41 Machine Learning Algorithms I
No ratings yet
41 Machine Learning Algorithms I
8 pages
Unit III 1
No ratings yet
Unit III 1
21 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
c2060 Text Only
No ratings yet
c2060 Text Only
11 pages
Performance Specification Digital Terrain Elevation Data (Dted)
No ratings yet
Performance Specification Digital Terrain Elevation Data (Dted)
45 pages
BMI Autopsy: 1 1 3 1 1 1 1 2 1 3 1 1 1 2 2 1 2 3 3 2 4 1 2 1 1 2 1 2 1 1 1 Total Result 23 27 50
No ratings yet
BMI Autopsy: 1 1 3 1 1 1 1 2 1 3 1 1 1 2 2 1 2 3 3 2 4 1 2 1 1 2 1 2 1 1 1 Total Result 23 27 50
4 pages
SumTotal Benefits Administration Software
No ratings yet
SumTotal Benefits Administration Software
2 pages
CST 308-Comprehensive Course Work Question Bank
No ratings yet
CST 308-Comprehensive Course Work Question Bank
60 pages
Banking & Finance Banking: Central Bank of India Case Study
No ratings yet
Banking & Finance Banking: Central Bank of India Case Study
1 page
Simulation Study of Black Hole and Jellyfish - Attack On MANET Using NS3
No ratings yet
Simulation Study of Black Hole and Jellyfish - Attack On MANET Using NS3
5 pages
Servlets JSP
No ratings yet
Servlets JSP
40 pages
C Virtual Functions
No ratings yet
C Virtual Functions
1 page
An Instrument For Measuring Customer Satisfaction Toward
No ratings yet
An Instrument For Measuring Customer Satisfaction Toward
14 pages
Tests Available in Fargo
No ratings yet
Tests Available in Fargo
86 pages
ABAP Object Design Patterns - Singleton
No ratings yet
ABAP Object Design Patterns - Singleton
6 pages
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data Ethics and Impact On Outsourcing
No ratings yet
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data Ethics and Impact On Outsourcing
49 pages
AI and Security
100% (1)
AI and Security
11 pages
Standard Iq Remote Control Circuitry (120vac) : Document Title
No ratings yet
Standard Iq Remote Control Circuitry (120vac) : Document Title
1 page
Ex. No: 1 Find The IP Address of The Host: Advanced Java Programming Manual
No ratings yet
Ex. No: 1 Find The IP Address of The Host: Advanced Java Programming Manual
152 pages
SCS 301 - Research Methods in Computing
No ratings yet
SCS 301 - Research Methods in Computing
3 pages
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
No ratings yet
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
73 pages
String Manipulation
50% (2)
String Manipulation
7 pages
Static Structural Analysis Using Sap2000 For Truss and Frame (Beginner Level)
100% (1)
Static Structural Analysis Using Sap2000 For Truss and Frame (Beginner Level)
35 pages
Ait307 QP
No ratings yet
Ait307 QP
3 pages
Shift Registers
No ratings yet
Shift Registers
13 pages
Vostokov Dmitry Memory Thinking For C and C++ Windows Diagnostics
No ratings yet
Vostokov Dmitry Memory Thinking For C and C++ Windows Diagnostics
251 pages
Test 01
No ratings yet
Test 01
2 pages
Tmux Terminal Multiplexer
No ratings yet
Tmux Terminal Multiplexer
2 pages
Particle Swarm Optimization: Technique, System and Challenges
No ratings yet
Particle Swarm Optimization: Technique, System and Challenges
9 pages