0% found this document useful (0 votes)

111 views13 pages

Efficient Python Tricks and Tools For Data Scientists

This document summarizes techniques for feature engineering in Python, including: 1. Using stratified sampling in scikit-learn to split data while maintaining class proportions between the train and test sets. 2. Using the feature_engine library to drop correlated features based on a correlation threshold. 3. Encoding dirty categorical features while capturing similarities using the dirty_cat library's SimilarityEncoder, which can be visualized using a heatmap.

Uploaded by

Novi Radni Ambijenti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views13 pages

Efficient Python Tricks and Tools For Data Scientists

Uploaded by

Novi Radni Ambijenti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Efficient Python Tricks and Tools for Data Scientists -

By Khuyen Tran

Feature Engineer
GitHub View on GitHub Book View Book

This section covers some libraries for feature engineering.

Split Data in a Stratified Fashion in scikit-learn
Normally, after using scikit-learn's train_test_split, the proportion of values in the sample
will be different from the proportion of values in the entire dataset.

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split
import numpy as np

X, y = load_iris(return_X_y=True)
np.bincount(y)

array([50, 50, 50])

X_train, X_test, y_train, y_test = train_test_split(X, y,

random_state=0)

# Get count of each class in the train set

np.bincount(y_train)

array([37, 34, 41])

# Get count of each class in the test set

np.bincount(y_test)

array([13, 16, 9])

If you want to keep the proportion of classes in the sample the same as the proportion of classes in
the entire dataset, add stratify=y.

X_train, X_test, y_train, y_test = train_test_split(X, y,

random_state=0, stratify=y)

np.bincount(y_train)

array([37, 37, 38])

np.bincount(y_test)

array([13, 13, 12])

Drop Correlated Features
$ pip install feature_engine

If you want to remove the correlated variables from a dataframe, use

feature_engine.DropCorrelatedFeatures.

import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import DropCorrelatedFeatures

# make dataframe with some correlated variables

X, y = make_classification(
n_samples=1000,
n_features=6,
n_redundant=3,
n_clusters_per_class=1,
class_sep=2,
random_state=0,
)

# trabsform arrays into pandas df and series

colnames = ["var_" + str(i) for i in range(6)]
X = pd.DataFrame(X, columns=colnames)

X.columns

Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5'],

dtype='object')

X[["var_0", "var_1", "var_2"]].corr()

var_0 var_1 var_2

var_0 1.000000 0.938936 0.874845

var_1 0.938936 1.000000 0.654745

var_2 0.874845 0.654745 1.000000
Drop the variables with a correlation above 0.8.

tr = DropCorrelatedFeatures(variables=None, method="pearson",
threshold=0.8)

Xt = tr.fit_transform(X)

tr.correlated_feature_sets_

[{'var_0', 'var_1', 'var_2'}]

Xt.columns

Index(['var_0', 'var_3', 'var_4', 'var_5'], dtype='object')

Link to feature-engine.
Similarity Encoding for Dirty Categories Using dirty_cat
$ pip install dirty-cat

To capture the similarities among dirty categories when encoding categorical variables, use
dirty_cat’s SimilarityEncoder .

To understand how SimilarityEncoder works, let's start with the employee_salaries dataset.

from dirty_cat.datasets import fetch_employee_salaries

from dirty_cat import SimilarityEncoder

X = fetch_employee_salaries().X
X.head(10)

gender department department_name division assignment_category employee_positio

0 F POL Department of MSB Fulltime-Regular Office Services
Police Information Coordinator
Mgmt and
Tech
Division
Records...
1 M POL Department of ISB Major Fulltime-Regular Master Police Offi
Police Crimes
Division
Fugitive
Section
2 F HHS Department of Adult Fulltime-Regular Social Worker IV
Health and Human Protective
Services and Case
Management
Services

3 M COR Correction and PRRS Fulltime-Regular Resident Supervis

Rehabilitation Facility and
Security
4 M HCA Department of Affordable Fulltime-Regular Planning Specialis
Housing and Housing
Community Programs
Affairs
gender department department_name division assignment_category employee_positio

5 M POL Department of PSB 6th Fulltime-Regular Police Officer III

Police District
Special
Assignment
Team
6 F FRS Fire and Rescue EMS Billing Fulltime-Regular Accountant/Audito
Services
7 M HHS Department of Head Start Fulltime-Regular Administrative
Health and Human Specialist II
Services
8 M FRS Fire and Rescue Recruit Fulltime-Regular Firefighter/Rescue
Services Training
9 F POL Department of FSB Traffic Fulltime-Regular Police Aide
Police Division
Automated
Traffic
Enforce...

dirty_column = "employee_position_title"
X_dirty = df[dirty_column].values
X_dirty[:7]

array(['Office Services Coordinator', 'Master Police Officer',

'Social Worker IV', 'Resident Supervisor II',
'Planning Specialist III', 'Police Officer III',
'Accountant/Auditor II'], dtype=object)

We can see that titles such as 'Master Police Officer' and 'Police Officer III' are similar. We can use
SimilaryEncoder to encode these categories while capturing their similarities.

enc = SimilarityEncoder(similarity="ngram")
X_enc = enc.fit_transform(X_dirty[:10].reshape(-1, 1))
X_enc

array([[0.05882353, 0.03125 , 0.02739726, 0.19008264, 1. ,

0.01351351, 0.05555556, 0.20535714, 0.08088235, 0.032 ],
[0.008 , 0.02083333, 0.056 , 1. , 0.19008264,
0.02325581, 0.23076923, 0.56 , 0.01574803, 0.02777778],
[0.03738318, 0.07317073, 0.05405405, 0.02777778, 0.032 ,
0.0733945 , 0. , 0.0625 , 0.06542056, 1. ],
[0.11206897, 0.07142857, 0.09756098, 0.01574803, 0.08088235,
0.07142857, 0.03125 , 0.08108108, 1. , 0.06542056],
[0.04761905, 0.3539823 , 0.06976744, 0.02325581, 0.01351351,
1. , 0.02 , 0.09821429, 0.07142857, 0.0733945 ],
[0.0733945 , 0.05343511, 0.14953271, 0.56 , 0.20535714,
0.09821429, 0.26086957, 1. , 0.08108108, 0.0625 ],
[1. , 0.05 , 0.06451613, 0.008 , 0.05882353,
0.04761905, 0.01052632, 0.0733945 , 0.11206897, 0.03738318],
[0.05 , 1. , 0.03378378, 0.02083333, 0.03125 ,
0.3539823 , 0.02631579, 0.05343511, 0.07142857, 0.07317073],
[0.06451613, 0.03378378, 1. , 0.056 , 0.02739726,
0.06976744, 0. , 0.14953271, 0.09756098, 0.05405405],
[0.01052632, 0.02631579, 0. , 0.23076923, 0.05555556,
0.02 , 1. , 0.26086957, 0.03125 , 0. ]])

Cool! Let's create a heatmap to understand the correlation between the encoded features.

import seaborn as sns

import numpy as np
from sklearn.preprocessing import normalize
from IPython.core.pylabtools import figsize

def plot_similarity(labels, features):

normalized_features = normalize(features)

# Create correction matrix
corr = np.inner(normalized_features, normalized_features)

# Plot
figsize(10, 10)
sns.set(font_scale=1.2)
g = sns.heatmap(corr, xticklabels=labels, yticklabels=labels,
vmin=0,
vmax=1, cmap="YlOrRd", annot=True, annot_kws={"size": 10})

g.set_xticklabels(labels, rotation=90)
g.set_title("Similarity")

def encode_and_plot(labels):

enc = SimilarityEncoder(similarity="ngram") # Encode
X_enc = enc.fit_transform(labels.reshape(-1, 1))

plot_similarity(labels, X_enc) # Plot

encode_and_plot(X_dirty[:10])
As we can see from the matrix above,

The similarity between the same strings such as 'Office Services Coordinator' and 'Office
Services Coordinator' is 1
The similarity between somewhat similar strings such as 'Office Services Coordinator' and
'Master Police Officer' is 0.41
The similarity between two very different strings such as 'Social Worker IV' and 'Polic Aide' is
0.028

Link to dirty-cat.

Link to my full article about dirty-cat.

Snorkel — Programmatically Build Training Data in Python
$ pip install snorkel

Imagine you try to determine whether a job posting is fake or not. You come up with some
assumptions about a fake job posting, such as:

If a job posting has few to no descriptions about the requirements, it is likely to be fake.
If a job posting does not include any company profile or logo, it is likely to be fake.
If the job posting requires some sort of education or experience, it is likely to be real.

import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

train_df = pd.read_pickle(
"https://github.com/khuyentran1401/Data-
science/blob/master/feature_engineering/snorkel_example/train_fake_jo
bs.pkl?raw=true"
)
train_df.head(5)

job_id title location department salary_range company_profile des

12276 12277 Big Data GB, WSM, Product Ops NaN Founded in 2010 Qubit: Cu
Analyst London by a team from Edge Big
Google’s Engineeri
London... ...

14680 14681 Instructional US, GA, NaN NaN We are an after- 21st Cent
Advocate Savannah school program Communi
committed to as... Learning
is an ...

16518 16519 Software US, FL, NaN NaN 352 Inc. is a full- We partne
Developer Gainesville service digital great clien
agency crea... build sma
job_id title location department salary_range company_profile des

15478 15479 Internship in IN, , NaN NaN London is

India Bangalore paced city
culture, d

16348 16349 Web DE, BE, Engineering NaN airfy prägt Design an
Developer 10969 sicheres und develop a
Backend einfach zu microserv
Microservices bedienende... platform f
(m/f)

How do you test which of these features are the most accurate in predicting fraud?

That is when Snorkel comes in handy. Snorkel is an open-source Python library for
programmatically building training datasets without manual labeling.

To learn how Snorkel works, start with giving a meaningful name to each value:

from snorkel.labeling import labeling_function, PandasLFApplier,

LFAnalysis

FAKE = 1
REAL = 0
ABSTAIN = -1

We assume that:

Fake companies don’t have company profiles or logos

Fake companies are found in a lot of fake job postings
Real job postings often requires a certain level of experience and education

Let’s test those assumptions using Snorkel’s labeling_function decorator. The

labeling_function decorator allows us to quickly label instances in a dataset using functions.

@labeling_function()
def no_company_profile(x: pd.Series):
return FAKE if x.company_profile == "" else ABSTAIN

@labeling_function()
def no_company_logo(x: pd.Series):
return FAKE if x.has_company_logo == 0 else ABSTAIN

@labeling_function()
def required_experience(x: pd.Series):
return REAL if x.required_experience else ABSTAIN

@labeling_function()
def required_education(x: pd.Series):
return REAL if x.required_education else ABSTAIN

ABSTAIN or -1 tells Snorkel not to make any conclusion about the instance that doesn’t satisfy the
condition.

Next, we will use each of these labeling functions to label our training dataset:

lfs = [
no_company_profile,
no_company_logo,
required_experience,
required_education,
]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train_df)

100%|██████████████████████████████████████████████| 13410/13410
[00:02<00:00, 5849.25it/s]

Now that we have created the labels using each labeling function, we can use LFAnalysis to
determine the accuracy of these labels.

LFAnalysis(L=L_train,
lfs=lfs).lf_summary(Y=train_df.fraudulent.values)

Emp.
j Polarity Coverage Overlaps Conflicts Correct Incorrect Acc.

no_company_profile 0 [1] 0.186204 0.186204 0.186204 459 2038 0.183821

no_company_logo 1 [1] 0.205742 0.205742 0.205742 459 2300 0.166365

required_experience 2 [0] 1.000000 1.000000 0.244295 12741 669 0.950112

required_education 3 [0] 1.000000 1.000000 0.244295 12741 669 0.950112

Details of the statistics in the table above:

Polarity: The set of unique labels this LF outputs (excluding abstains)

Coverage: The fraction of the dataset that is labeled
Overlaps: The fraction of the dataset where this LF and at least one other LF agree
Conflicts: The fraction of the dataset where this LF and at least one other LF disagree
Correct: The number of data points this LF labels correctly
Incorrect: The number of data points this LF labels incorrectly
Empirical Accuracy: The empirical accuracy of this LF

Link to Snorkel.

My full article about Snorkel.

Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
Machine Learning Lab Manaul BCSL606
No ratings yet
Machine Learning Lab Manaul BCSL606
27 pages
M PDF
No ratings yet
M PDF
13 pages
ML - Datascience Manual
No ratings yet
ML - Datascience Manual
64 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
ML Lab Mannual1
No ratings yet
ML Lab Mannual1
37 pages
AI&ML
No ratings yet
AI&ML
9 pages
MLLab Manual
No ratings yet
MLLab Manual
24 pages
V
No ratings yet
V
8 pages
ML Lab Manual
No ratings yet
ML Lab Manual
24 pages
ML 3
No ratings yet
ML 3
24 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
18 pages
ML Labmanual
No ratings yet
ML Labmanual
33 pages
ML Lab Manual
No ratings yet
ML Lab Manual
25 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Mlalllabprgs
No ratings yet
Mlalllabprgs
17 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Machine Learning Programs
No ratings yet
Machine Learning Programs
10 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
Cours Data
No ratings yet
Cours Data
51 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
BCSL606 Machine Learning Lab
No ratings yet
BCSL606 Machine Learning Lab
33 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
33 pages
CSC 240 HW 2
No ratings yet
CSC 240 HW 2
5 pages
BCSL606 Machine Learning Lab Final Draft
No ratings yet
BCSL606 Machine Learning Lab Final Draft
32 pages
ML Lab
No ratings yet
ML Lab
14 pages
ML Lab File Batch 1
No ratings yet
ML Lab File Batch 1
20 pages
06 - Data Preprocessing
No ratings yet
06 - Data Preprocessing
68 pages
ML Prac1-10
No ratings yet
ML Prac1-10
32 pages
UT-1-Machine Learning Lecture Notes-2
No ratings yet
UT-1-Machine Learning Lecture Notes-2
11 pages
15CSL76 Students
No ratings yet
15CSL76 Students
18 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
2 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Machine Learning
No ratings yet
Machine Learning
67 pages
ML - Lab Manual With Woad File
No ratings yet
ML - Lab Manual With Woad File
12 pages
MLWP LAB Experiment's
No ratings yet
MLWP LAB Experiment's
11 pages
Approachin190808095205 PDF
No ratings yet
Approachin190808095205 PDF
112 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
ML Manual
No ratings yet
ML Manual
30 pages
List of Imported Libraries
No ratings yet
List of Imported Libraries
12 pages
1 An Introduction To Machine Learning With Scikit Learn
No ratings yet
1 An Introduction To Machine Learning With Scikit Learn
2 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
ML Manual
No ratings yet
ML Manual
9 pages
5) Randomforest - Ipynb - Colaboratory
No ratings yet
5) Randomforest - Ipynb - Colaboratory
12 pages
1
No ratings yet
1
13 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
Ids Lab
No ratings yet
Ids Lab
14 pages
Random Forest
No ratings yet
Random Forest
8 pages
DT RF
No ratings yet
DT RF
7 pages
Code 1
No ratings yet
Code 1
3 pages
SampleQuestion - AIOL 2024
No ratings yet
SampleQuestion - AIOL 2024
5 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Eac FP Kunal
No ratings yet
Eac FP Kunal
12 pages
Recording Your Assignments With Zoom1
No ratings yet
Recording Your Assignments With Zoom1
7 pages
CT20 80
No ratings yet
CT20 80
3 pages
APMD and EZ-Pilot Pro
No ratings yet
APMD and EZ-Pilot Pro
7 pages
TE-2200-765 Dealer Manual PDF
No ratings yet
TE-2200-765 Dealer Manual PDF
357 pages
SET15-Electronic Science-P II-A
No ratings yet
SET15-Electronic Science-P II-A
12 pages
Balling Machine (Technical Paper)
No ratings yet
Balling Machine (Technical Paper)
6 pages
Carenado Saab 340B For XP 11 Systems Manuals
No ratings yet
Carenado Saab 340B For XP 11 Systems Manuals
26 pages
Micro Are M130-Man
No ratings yet
Micro Are M130-Man
80 pages
St. Gerard School Arts, Science and Technology Tibag Tarlac City
No ratings yet
St. Gerard School Arts, Science and Technology Tibag Tarlac City
2 pages
Utkarsh Resume
No ratings yet
Utkarsh Resume
1 page
Dron-B5zs2y r0 en
No ratings yet
Dron-B5zs2y r0 en
31 pages
Imis Iot RFQCFP Annex A
No ratings yet
Imis Iot RFQCFP Annex A
32 pages
MIT School of BioEngineering Sciences and Research MITADT University
No ratings yet
MIT School of BioEngineering Sciences and Research MITADT University
4 pages
Rav1 Feseau Moteur
No ratings yet
Rav1 Feseau Moteur
3 pages
Resume 2018
No ratings yet
Resume 2018
1 page
Save Wizard Code Info
No ratings yet
Save Wizard Code Info
7 pages
Lec-8 Hazards PDF
No ratings yet
Lec-8 Hazards PDF
7 pages
38EYX - Manual de Instalación
No ratings yet
38EYX - Manual de Instalación
53 pages
M.Saravana Kumar. Ap/ Mech Fmcet
No ratings yet
M.Saravana Kumar. Ap/ Mech Fmcet
6 pages
TPDP Project Macdonald
No ratings yet
TPDP Project Macdonald
38 pages
Mobox - Log 342
No ratings yet
Mobox - Log 342
3 pages
Pacis OI 1296
No ratings yet
Pacis OI 1296
2 pages
Safety: Active and Passive Safety
100% (1)
Safety: Active and Passive Safety
10 pages
Cbcs4103-Is Audit, Security and Control
No ratings yet
Cbcs4103-Is Audit, Security and Control
9 pages
Analyzing The Benefits of Lean Tools: A Consumer Durables Manufacturing Company Case Study
No ratings yet
Analyzing The Benefits of Lean Tools: A Consumer Durables Manufacturing Company Case Study
5 pages
FAANGPath Simple Template 4
No ratings yet
FAANGPath Simple Template 4
2 pages
Annual Report
No ratings yet
Annual Report
21 pages
Fortiplus
No ratings yet
Fortiplus
6 pages
Assignment NodeJs
No ratings yet
Assignment NodeJs
2 pages

Efficient Python Tricks and Tools For Data Scientists

Uploaded by

Efficient Python Tricks and Tools For Data Scientists

Uploaded by

Efficient Python Tricks and Tools for Data Scientists -

This section covers some libraries for feature engineering.

from sklearn.datasets import load_iris

array([50, 50, 50])

X_train, X_test, y_train, y_test = train_test_split(X, y,

# Get count of each class in the train set

array([37, 34, 41])

# Get count of each class in the test set

array([13, 16, 9])

X_train, X_test, y_train, y_test = train_test_split(X, y,

array([37, 37, 38])

array([13, 13, 12])

If you want to remove the correlated variables from a dataframe, use

# make dataframe with some correlated variables

# trabsform arrays into pandas df and series

Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5'],

X[["var_0", "var_1", "var_2"]].corr()

var_0 var_1 var_2

var_0 1.000000 0.938936 0.874845

var_1 0.938936 1.000000 0.654745

[{'var_0', 'var_1', 'var_2'}]

Index(['var_0', 'var_3', 'var_4', 'var_5'], dtype='object')

from dirty_cat.datasets import fetch_employee_salaries

gender department department_name division assignment_category employee_positio

3 M COR Correction and PRRS Fulltime-Regular Resident Supervis

5 M POL Department of PSB 6th Fulltime-Regular Police Officer III

array(['Office Services Coordinator', 'Master Police Officer',

array([[0.05882353, 0.03125 , 0.02739726, 0.19008264, 1. ,

import seaborn as sns

def plot_similarity(labels, features):

Link to my full article about dirty-cat.

job_id title location department salary_range company_profile des

15478 15479 Internship in IN, , NaN NaN London is

from snorkel.labeling import labeling_function, PandasLFApplier,

Fake companies don’t have company profiles or logos

Let’s test those assumptions using Snorkel’s labeling_function decorator. The

no_company_profile 0 [1] 0.186204 0.186204 0.186204 459 2038 0.183821

required_experience 2 [0] 1.000000 1.000000 0.244295 12741 669 0.950112

required_education 3 [0] 1.000000 1.000000 0.244295 12741 669 0.950112

Details of the statistics in the table above:

Polarity: The set of unique labels this LF outputs (excluding abstains)

My full article about Snorkel.

You might also like