Eda 2
Eda 2
Most Machine Learning algorithms prefer to work with numbers so before applying the algorithms on the dataset,
these non numerical columns needs to be treated
Label Encoder and One Hot Encoder. These two encoders are parts of the SciKit Learn library in Python, and
they are used to convert categorical data, or text data, into numbers, which our predictive models can better
understand
Label
Encoding
Label encoding is to assign positive numbers for all the categorical
variables.
The problem here is, since there are different numbers in the same column,
the model will misunderstand the data to be in some kind of order, 0 < 1 < 2.
But this isn’t the case at all. To overcome this problem, we use One Hot
Encoder.
Pd.get_dummies(df
)
Isolation Forest for Outlier
Detection
A lot of machine learning algorithms suffer in terms of their performance when outliers are not treated. In
order to avoid this kind of problem you could, for example, drop them from your sample, cap the values at
some reasonable point (based on domain knowledge) or transform the data
Generally, Outliers are less in number than normal observations and are different from them in terms of
values. They will be placed far away from the rest of the data points in the feature space.
That is why by using such random partitioning they should be identified closer to the root of the tree (shorter
average path length, i.e., the number of edges an observation must pass in the tree going from the root to the
terminal node), with fewer splits necessary.
Step 1 — Sampling for
Training
⮚ Sample the data for the model training.
⮚ Depending on the underlying data set, a sampling proportion can be different
Step 2 — Binary decision
tree
⮚ Build a decision tree based on the data we have sampled
⮚ Randomly select a feature (Q1 or Q2) and also a values that is between min and
max of the feature from the sample data
Step 3 — Repeat step 2
Iteratively
⮚ Do step 2 for two sub-data set based on the binary split from step 2
⮚ “Fewer and different” data points are isolated quicker such as the data point at the very lower right
corner
⮚ In other words, it takes less path for them to be isolated
⮚ Do this iteratively to create a forest, a collection of trees
Step 4 — Feeding data set and
calculating anomaly score
•Feed each data point into a trained forest model for each tree and
compute the anomaly score
We calculate this anomaly score for each tree and average them out across different trees and get the final
anomaly score for an entire forest for a given data point
Mathematically, an outlier gets a score closer to 1. A value closer to 0.5 or lesser is considered normal data
point.
From sklearn library, if the predicted value is -1 , it is an outlier and 1 indicates normal data points
Correlation using Predictive Power Score
(PPS)
Correlation Coefficient
The expectation is that, irrespective of the relationship between the parameters we would
like to know the relationship. And the score should be 0 if there is no relationship and the
score should be 1 if there is a perfect relationship.
Also the score should be able to handle categoric and numeric columns.
Predictive Power Score
(PPS)
The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear
relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect
predictive power). It can be used as an alternative to the correlation (matrix).
Let’s say we have two columns and want to calculate the predictive power score of A predicting
B. In this case, we treat B as our target variable and A as our (only) feature. We can now
calculate a cross-validated Decision Tree and calculate a suitable evaluation metric. When
the target is numeric we can use a Decision Tree Regressor and calculate the Mean Absolute
Error (MAE). When the target is categorical, we can use a Decision Tree Classifier and
calculate the weighted F1
Example
:
Then calculate a baseline score via always predicting the most common city and achieve a score of 0.1
F1.(Accuracy of the basic algorithm if it predicts mode of the dependent variable)
Normalize the score, you will get a final PPS of 0.94 after applying the following normalization
formula: (0.95–0.1) / (1–0.1).
As we can see, a PPS score of 0.94 is rather high, so the zip code seems to have a good predictive
power towards the city. However, if we calculate the PPS in the opposite direction, we might achieve a
PPS of close to 0 because the Decision Tree Classifier is not substantially better than just always
predicting the most common zip code.
Comparing the PPS to
correlation
For the below nonlinear association the Correlation is 0. Both from x to y and from y to x
because the correlation is symmetric.
However, the PPS from x to y is 0.67, detecting the non-linear relationship and PPS from y to x is
0 because your prediction cannot be better than the naive baseline and thus the score is 0.
In case of an regression, the ppscore uses the mean absolute error (MAE) as the
underlying evaluation metric (MAE_model). The best possible score of the MAE is 0
and higher is worse. As a baseline score, we calculate the MAE of a naive model
(MAE_naive) that always predicts the median of the target column. The PPS is the
result of the following normalization (and never smaller than 0):
The weighted F1 takes into account the precision and recall of all classes weighted by
their support as described here. As a baseline score, we calculate the weighted F1 score
of a naive model (F1_naive) that always predicts the most common class of the target
column. The PPS is the result of the following normalization (and never smaller than
0):