Efficient Python Tricks and Tools For Data Scientists
Efficient Python Tricks and Tools For Data Scientists
By Khuyen Tran
Feature Engineer
GitHub View on GitHub Book View Book
X, y = load_iris(return_X_y=True)
np.bincount(y)
np.bincount(y_train)
np.bincount(y_test)
If you want to keep the proportion of classes in the sample the same as the proportion of classes in
the entire dataset, add stratify=y.
np.bincount(y_train)
import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import DropCorrelatedFeatures
X.columns
tr = DropCorrelatedFeatures(variables=None, method="pearson",
threshold=0.8)
Xt = tr.fit_transform(X)
tr.correlated_feature_sets_
Xt.columns
Link to feature-engine.
Similarity Encoding for Dirty Categories Using dirty_cat
$ pip install dirty-cat
To capture the similarities among dirty categories when encoding categorical variables, use
dirty_cat’s SimilarityEncoder .
To understand how SimilarityEncoder works, let's start with the employee_salaries dataset.
X = fetch_employee_salaries().X
X.head(10)
dirty_column = "employee_position_title"
X_dirty = df[dirty_column].values
X_dirty[:7]
We can see that titles such as 'Master Police Officer' and 'Police Officer III' are similar. We can use
SimilaryEncoder to encode these categories while capturing their similarities.
enc = SimilarityEncoder(similarity="ngram")
X_enc = enc.fit_transform(X_dirty[:10].reshape(-1, 1))
X_enc
Cool! Let's create a heatmap to understand the correlation between the encoded features.
def encode_and_plot(labels):
enc = SimilarityEncoder(similarity="ngram") # Encode
X_enc = enc.fit_transform(labels.reshape(-1, 1))
plot_similarity(labels, X_enc) # Plot
encode_and_plot(X_dirty[:10])
As we can see from the matrix above,
The similarity between the same strings such as 'Office Services Coordinator' and 'Office
Services Coordinator' is 1
The similarity between somewhat similar strings such as 'Office Services Coordinator' and
'Master Police Officer' is 0.41
The similarity between two very different strings such as 'Social Worker IV' and 'Polic Aide' is
0.028
Link to dirty-cat.
Imagine you try to determine whether a job posting is fake or not. You come up with some
assumptions about a fake job posting, such as:
If a job posting has few to no descriptions about the requirements, it is likely to be fake.
If a job posting does not include any company profile or logo, it is likely to be fake.
If the job posting requires some sort of education or experience, it is likely to be real.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
train_df = pd.read_pickle(
"https://github.com/khuyentran1401/Data-
science/blob/master/feature_engineering/snorkel_example/train_fake_jo
bs.pkl?raw=true"
)
train_df.head(5)
14680 14681 Instructional US, GA, NaN NaN We are an after- 21st Cent
Advocate Savannah school program Communi
committed to as... Learning
is an ...
16518 16519 Software US, FL, NaN NaN 352 Inc. is a full- We partne
Developer Gainesville service digital great clien
agency crea... build sma
job_id title location department salary_range company_profile des
16348 16349 Web DE, BE, Engineering NaN airfy prägt Design an
Developer 10969 sicheres und develop a
Backend einfach zu microserv
Microservices bedienende... platform f
(m/f)
How do you test which of these features are the most accurate in predicting fraud?
That is when Snorkel comes in handy. Snorkel is an open-source Python library for
programmatically building training datasets without manual labeling.
To learn how Snorkel works, start with giving a meaningful name to each value:
FAKE = 1
REAL = 0
ABSTAIN = -1
We assume that:
@labeling_function()
def no_company_profile(x: pd.Series):
return FAKE if x.company_profile == "" else ABSTAIN
@labeling_function()
def no_company_logo(x: pd.Series):
return FAKE if x.has_company_logo == 0 else ABSTAIN
@labeling_function()
def required_experience(x: pd.Series):
return REAL if x.required_experience else ABSTAIN
@labeling_function()
def required_education(x: pd.Series):
return REAL if x.required_education else ABSTAIN
ABSTAIN or -1 tells Snorkel not to make any conclusion about the instance that doesn’t satisfy the
condition.
Next, we will use each of these labeling functions to label our training dataset:
lfs = [
no_company_profile,
no_company_logo,
required_experience,
required_education,
]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train_df)
100%|██████████████████████████████████████████████| 13410/13410
[00:02<00:00, 5849.25it/s]
Now that we have created the labels using each labeling function, we can use LFAnalysis to
determine the accuracy of these labels.
LFAnalysis(L=L_train,
lfs=lfs).lf_summary(Y=train_df.fraudulent.values)
Emp.
j Polarity Coverage Overlaps Conflicts Correct Incorrect Acc.
Link to Snorkel.