0% found this document useful (0 votes)
49 views21 pages

Eda 2

This document discusses various techniques for preprocessing categorical features in machine learning models, including label encoding and one-hot encoding. It also covers outlier detection using Isolation Forest, which builds random decision trees to isolate outliers closer to the root based on their different values compared to normal data points. Finally, it introduces the Predictive Power Score (PPS) as an alternative to correlation for measuring relationships between features, which calculates a normalized score between 0-1 based on a decision tree model's performance compared to a naive baseline.

Uploaded by

Riya Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views21 pages

Eda 2

This document discusses various techniques for preprocessing categorical features in machine learning models, including label encoding and one-hot encoding. It also covers outlier detection using Isolation Forest, which builds random decision trees to isolate outliers closer to the root based on their different values compared to normal data points. Finally, it introduces the Predictive Power Score (PPS) as an alternative to correlation for measuring relationships between features, which calculates a normalized score between 0-1 based on a decision tree model's performance compared to a naive baseline.

Uploaded by

Riya Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

EDA 2

• Encoding Methods - OHE, Label Encoders


• Outlier detection-Isolation Forest
• Calculating the Predictive Power Score (PPS)
Handling Text and Categorical
Attributes
In machine Learning algorithms, based on the problem definition the data set might contain text or categorical
values
that are not numerical features.
For example, marital status feature can have values like married, single and divorced
Gender feature will have Male or Female

Most Machine Learning algorithms prefer to work with numbers so before applying the algorithms on the dataset,
these non numerical columns needs to be treated

We have predominantly two methods for the same

• One Hot Encoding


• Label Encoding

Label Encoder and One Hot Encoder. These two encoders are parts of the SciKit Learn library in Python, and
they are used to convert categorical data, or text data, into numbers, which our predictive models can better
understand
Label
Encoding
Label encoding is to assign positive numbers for all the categorical
variables.

In the dataset, country is categorical variable, the country names will be


replaced by numbers (0,1,2) when we apply label encoding.

from sklearn.preprocessing import LabelEncoder


labelencoder = LabelEncoder()
data[:, 0] = labelencoder.fit_transform(data[:, 0])

The problem here is, since there are different numbers in the same column,
the model will misunderstand the data to be in some kind of order, 0 < 1 < 2.
But this isn’t the case at all. To overcome this problem, we use One Hot
Encoder.

•If the target column is categoric, we use the sklearn.LabelEncoder​


•If the feature column is categoric, we use the sklearn.OneHotEncoder​
One Hot Encoding

Categorical variables have to be


converted to numerical using a method
called One-hot encoding

Pd.get_dummies(df
)
Isolation Forest for Outlier
Detection
A lot of machine learning algorithms suffer in terms of their performance when outliers are not treated. In
order to avoid this kind of problem you could, for example, drop them from your sample, cap the values at
some reasonable point (based on domain knowledge) or transform the data

Isolation Forest algorithm identify the outliers / anomalies in a multidimensional space


Isolation Forest is built based on decision trees.
In these trees, partitions are created by first randomly selecting a feature and then selecting a random split
value between the minimum and maximum value of the selected feature.

Generally, Outliers are less in number than normal observations and are different from them in terms of
values. They will be placed far away from the rest of the data points in the feature space.

That is why by using such random partitioning they should be identified closer to the root of the tree (shorter
average path length, i.e., the number of edges an observation must pass in the tree going from the root to the
terminal node), with fewer splits necessary.
Step 1 — Sampling for
Training
⮚ Sample the data for the model training.
⮚ Depending on the underlying data set, a sampling proportion can be different
Step 2 — Binary decision
tree
⮚ Build a decision tree based on the data we have sampled
⮚ Randomly select a feature (Q1 or Q2) and also a values that is between min and
max of the feature from the sample data
Step 3 — Repeat step 2
Iteratively
⮚ Do step 2 for two sub-data set based on the binary split from step 2
⮚ “Fewer and different” data points are isolated quicker such as the data point at the very lower right
corner
⮚ In other words, it takes less path for them to be isolated
⮚ Do this iteratively to create a forest, a collection of trees
Step 4 — Feeding data set and
calculating anomaly score
•Feed each data point into a trained forest model for each tree and
compute the anomaly score

•Anomaly score is defined as:

We calculate this anomaly score for each tree and average them out across different trees and get the final
anomaly score for an entire forest for a given data point

Mathematically, an outlier gets a score closer to 1. A value closer to 0.5 or lesser is considered normal data
point.

From sklearn library, if the predicted value is -1 , it is an outlier and 1 indicates normal data points
Correlation using Predictive Power Score
(PPS)
Correlation Coefficient

Correlation coefficient for the below


association??
if the relationship is linear between the parameters then we can use correlation coefficient
to find the strength of the association.

What if the relationship is non-linear, gaussian or unknown relationship, how do we find


the association?
If we want to detect relationships between cities and zip codes ?

The expectation is that, irrespective of the relationship between the parameters we would
like to know the relationship. And the score should be 0 if there is no relationship and the
score should be 1 if there is a perfect relationship.

Also the score should be able to handle categoric and numeric columns.
Predictive Power Score
(PPS)

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear
relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect
predictive power). It can be used as an alternative to the correlation (matrix).

Let’s say we have two columns and want to calculate the predictive power score of A predicting
B. In this case, we treat B as our target variable and A as our (only) feature. We can now
calculate a cross-validated Decision Tree and calculate a suitable evaluation metric. When
the target is numeric we can use a Decision Tree Regressor and calculate the Mean Absolute
Error (MAE). When the target is categorical, we can use a Decision Tree Classifier and
calculate the weighted F1
Example
:

Zip codes and the city name :


Calculate the PPS of zip code to city. Weighted F1 score will be used because city is categoric. Let’s
say cross-validated Decision Tree Classifier achieves a score of 0.95 F1.

Then calculate a baseline score via always predicting the most common city and achieve a score of 0.1
F1.(Accuracy of the basic algorithm if it predicts mode of the dependent variable)

Normalize the score, you will get a final PPS of 0.94 after applying the following normalization
formula: (0.95–0.1) / (1–0.1).

As we can see, a PPS score of 0.94 is rather high, so the zip code seems to have a good predictive
power towards the city. However, if we calculate the PPS in the opposite direction, we might achieve a
PPS of close to 0 because the Decision Tree Classifier is not substantially better than just always
predicting the most common zip code.
Comparing the PPS to
correlation
For the below nonlinear association the Correlation is 0. Both from x to y and from y to x
because the correlation is symmetric.

However, the PPS from x to y is 0.67, detecting the non-linear relationship and PPS from y to x is
0 because your prediction cannot be better than the naive baseline and thus the score is 0.

Because if y is 4, it is impossible to predict


whether x was roughly 2 or -2 by the decision tree
For Regression:

In case of an regression, the ppscore uses the mean absolute error (MAE) as the
underlying evaluation metric (MAE_model). The best possible score of the MAE is 0
and higher is worse. As a baseline score, we calculate the MAE of a naive model
(MAE_naive) that always predicts the median of the target column. The PPS is the
result of the following normalization (and never smaller than 0):

PPS = 1 - (MAE_model / MAE_naive)


For Classification :

If the task is a classification, we compute the weighted F1 score (wF1) as the


underlying evaluation metric (F1_model). The F1 score can be interpreted as a
weighted average of the precision and recall, where an F1 score reaches its best value
at 1 and worst score at 0. The relative contribution of precision and recall to the F1
score are equal.

The weighted F1 takes into account the precision and recall of all classes weighted by
their support as described here. As a baseline score, we calculate the weighted F1 score
of a naive model (F1_naive) that always predicts the most common class of the target
column. The PPS is the result of the following normalization (and never smaller than
0):

PPS = (F1_model - F1_naive) / (1 - F1_naive)


Thank
you

You might also like