2 - Preprocessing
2 - Preprocessing
2
Today
3
Where we want to get to
Instance Age Account activity Owns credit card Churn
1 24 low yes no
2 21 medium no yes
3 42 high yes no
4 34 low no yes
5 19 medium yes yes
6 44 medium no no
7 31 high yes no
8 29 low no no
9 41 high no no
10 29 medium yes no
11 52 medium no yes
12 55 high no no
13 52 medium yes no
14 38 low no yes
4
The data set
Recall, a tabular data set (“structured data”):
Target (label, class, dependent variable, response variable) can also be present
Numeric, categorical, …
Majority of tabular modeling techniques assume a single table as input (a feature matrix)
5
Constructing a data set takes work
Merging different data sources
Levels of aggregation, e.g. household versus individual customer
Linking instance identifiers
Definition of target variable
Cleaning, preprocessing, featurization
6
Data Selection
7
Data types
Master data External data
Relates to core entities company is working with Social media data (e.g. Facebook, Twitter, e.g. for
sentiment analysis)
E.g. Customers, Products, Employees, Suppliers,
Vendors Macro economic data (e.g. GPD, inflation)
Typically stored in operational data bases and data Weather data
warehouses (historical view)
Competitor data
Transactional data Search data (e.g. Google Trends)
Timing, quantity and items Web scraped data
E.g. POS data, credit card transactions, money
Open data
transfers, web visits, etc.
External data that anyone can access, use and share
Will typically require a featurization step (see later)
Government data (e.g. Eurostat, OECD)
Scientific data
8
Example: Google Trends
https://medium.com/dataminingapps-articles/forecasting-with-google-trends-114ab741bda4
9
Example: Google Street View
https://arxiv.org/ftp/arxiv/papers/1904/1904.05270.pdf
10
Data types
Tabular, imagery, time series, …
Small vs. big data
Metadata
Data that describes other data
Data about data
Data definitions
E.g. stored in DBMS catalog
Oftentimes lacking but can help a great deal in understanding, and feature extraction as well
11
Selection
As data mining can only uncover patterns actually present in the data, target data set must
be large/complete enough to contain these patterns
Make absolutely sure you’re not cheating by including explanatory variables which are too “perfectly”
correlated with the target
“Too good to be true”
Do you know this explanatory variable after the target outcome, or before? At the time when your model will
be used, or after?
Your finished model will not be able to look into the future!
12
Data Exploration
13
Exploration
Visual analytics as a means for initial exploration (EDA: exploratory data analysis)
Boxplots
Scatter plots
Histograms
Basic statistics
Correlation plots (!)
14
Exploration
Anscombe (1973)
15
Data Cleaning
16
Cleaning
Consistency
17
Missing values
Many techniques cannot deal with missing values
Not applicable (credit card limit = NA if no credit card owned) versus unknown or not disclosed (age = NA)
Missing at random vs. missing not at random
Detection is easy
18
Missing values
“Detection is easy?”
19
Missing values
20
Missing values
21
Missing values
Common approach: delete rows with too many missing values, impute features using
median and mode, add separate column indicating if original value was missing
22
Intermezzo
Always keep the production setting in mind: new
unseen instances can contain missing values as well
Don’t impute with a new median! But using the same
“rules” as with the training data
Ideally: also when working with validation data!
What if we’ve never observed a missing value for this
feature before?
Use the original training data to construct an imputation
Consider rebuilding your model (monitoring!)
23
Outliers
Extreme observations (age = 241 or income = 3
million)
Treatment:
24
Outliers
25
Outliers
26
Outliers
27
Duplicate rows
Duplicate rows can be valid or invalid
Treat accordingly
28
Transformations: standardization
Standardization: constrain feature to ∼ N (0, 1)
Good/necessary for Gaussian-distributed data and some techniques: SVMs, regression models, k-nearest
neighbors, neural networks, everything working with Euclidian distance or similarity
Useless for other techniques: decision trees
x − μ
xnew =
σ
29
Transformations: normalization
Normalization: also called “feature scaling”
x − xmin
xnew =
xmax − xmin
30
Transformations: categorization
Also called “coarse classification”, “classing”, “binning”, “grouping”
Continuous → nominal
31
Transformations: categorization
To treat outliers
To make final model more interpretable
To reduce curse of dimensionality following from high number of levels
To introduce non-linear effects into linear models
32
Transformations: dummyfication and other
encodings
Nominal → continuous
“Integer encoding”
Convert to:
account_activity_high =1
account_activity_medium = 2
account_activity_low =3
33
Transformations: dummyfication and other
encodings
Nominal → continuous
Convert to:
account_activity_high = 0, 1
account_activity_medium = 0, 1
account_activity_low = 0, 1
34
Transformations: dummyfication and other
encodings
Nominal → continuous
“Binary encoding”
Convert to:
account_activity_high = 1 → 01
account_activity_medium = 2 → 10
account_activity_low = 3 → 11
35
Transformations: high-level categoricals
What if we have a categorical value with too many levels? Dummy variables don’t really
solve our problem
Create a pivot table of the attribute versus target and compute the odds
Group variable values having similar odds
48 cash B
60 car G
… … … …
37
Transformations: odds based grouping
How to verify which option is better?
Consider the following example taken from the book Credit Scoring and Its Applications:
Attribute: Owner Rent Unfurnished Rent Furnished Parents Other No Answer Total
Goods 6000 1600 350 950 90 10 9000
Bads 300 400 140 100 50 10 1000
Goods / Bads Odds 20 4 2.5 9.5 1.8 1 9
1. Owners, Renters (Rent Unfurnished + Furnished), and Others (every other level)
2. Owners, Parents, and Others
38
Transformations: odds based grouping
Empirical frequencies for Option 1:
E.g. number of good owners given that odds are same as in whole population is
6300/10000 × 9000/10000 × 10000 = 5670
Chi-square distance:
2 2 2 2 2 2
(6000−5670) (300−630) (1950−2241) (540−249) (1050−1089) (160−121)
2
χ =
5670
+
630
+
2241
+
249
+
1089
+
121
= 583
39
Transformations: odds based grouping
Chi-square distance Option 1:
2 2 2 2 2 2
(6000−5670) (300−630) (1950−2241) (540−249) (1050−1089) (160−121)
2
χ = + + + + + = 583
5670 630 2241 249 1089 121
In order to judge upon significance, the obtained chi-square statistic should follow a chi-
square distribution with k − 1 degrees of freedom, with k the number of levels (3 in our
case). This can then be summarized by a p-value to see whether there is a statistically
significant dependence or not.
Since both options assume 3 levels, we can directly compare the value of 662 to 583 and since the former is
bigger conclude that Option 2 is the better coarse classification
40
Weights of evidence encoding
Weights of evidence variables can be defined as follows:
pc1,cat
W oEcat = ln( )
pc2,cat
pc1,cat = number of instances with class 1 in category / number of instance with class 1 (total)
pc2,cat = number of instances with class 2 in category / number of instance with class 2 (total)
41
Weights of evidence encoding
Purpose Total Negatives Positives Distr. Neg. Gistr. Pos WoE
Can also be used to screen variables (important variables have IV > 0.1)
Note that we have now been using information coming from the target: so even more
crucial for a train/test split!
42
Weights of evidence encoding
Category boundaries
Optimize so as to maximize IV
https://cran.r-project.org/web/packages/smbinning/index.html
43
Weights of evidence encoding
Number of categories?
Trade-off
Fewer categories because of simplicity, interpretability and stability
More categories to keep predictive power
Laplace smoothing:
pc1,cat +n
W oEcat = ln( )
pc2,cat +n
44
Weights of evidence encoding
45
Some other creative approaches
For geospatial data: group nearby communities together, spatial interpolation, …
First build a decision tree only on the one categorical or continuous variable and target
1-dimensional k-means clustering on a continuous variable to suggest groups
Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang)
Categorical embeddings
46
Some other creative approaches
category_encoder : https://contrib.scikit-learn.org/category_encoders/
categorical-encoding : https://github.com/alteryx/categorical_encoding
https://www.featurelabs.com/open/
https://github.com/alteryx/categorical_encoding/blob/main/guides/Categorical_Encoding_Methods.pdf
47
Hashing trick
The “hashing trick”: for categoricals with many levels and when it is expected that new
levels will occur in new instances
investment products”
48
Hashing trick
“Hash encoding”: a hash function is any function that can be used to map data of arbitrary
size to fixed-size values
Seemingly random
Avoids collisions in output space
Used frequently in encryption / security
“One-way” function
E.g.:
h(“customer”) = 4
0 1 2 3 … 37 38 …
h(“desired”) = 2
0 0 1 0 … 1 0 …
h(“wanted”) = 37
49
Embeddings
Other “embedding” approaches are possible as well in the textual domain
https://arxiv.org/pdf/1604.06737.pdf
50
Transformations: mathematical approaches
Logarithms, square roots, etc
51
Transformations: interaction variables
When no linear effect is present on x1 ∼ y and x2 ∼ y but there is one on f (x1 , x2 ) ∼ y
52
Feature Engineering
53
Feature engineering
The goal of the transformations above: creating more informative features!
“ The aim of feature engineering is to transform data set variables into features so as to
help the analytical models achieve better performance in terms of either predictive
“
performance, interpretability or both.
54
Feature engineering
https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
55
Feature engineering: RFM features
Already popular since (Cullinan, 1977)
56
Feature engineering: RFM features
recency = e(−γt)
57
Feature engineering: time features
Capture information about time aspect by meaningful features
59
Feature engineering: delta’s, trends, windows
Evolutions over time or differences between features are crucial in many settings!
Solution 1: Keep track of an instance through time and add in as separate rows (“panel data analysis”)
Solution 2: One point in time as “base”, add in relative features
60
Feature engineering: delta’s, trends, windows
Ft −Ft−x
Absolute trends: x
Ft −Ft−x
Relative trends:
Ft−x
Can be useful for size variables (e.g., asset size, loan amounts) and ratios
Beware with denominators equal to 0!
Can put higher weight on recent values
Extension: time series analysis
61
Feature engineering: ordinal variables
Ordinal features have intrinsic ordering in their values (e.g., credit rating, debt seniority)
Thermometer coding
Progressively codes ordinal scale of variable
F1 F2 F3 F4
AAA 0 0 0 0
AA 1 0 0 0
A 1 1 0 0
B 1 1 1 0
C 1 1 1 1
62
Feature engineering: relational data
63
Feature engineering: relational data
dm : https://krlmlr.github.io/dm/
64
Featurization
How do get a tabular data set out of something non-structured?
65
Featurization
featuretools : An open source python framework for
automated feature engineering,
https://www.featuretools.com/
stumpy : https://github.com/TDAmeritrade/stumpy
tsfresh : https://tsfresh.readthedocs.io/en/latest/
FeatureSelector : https://github.com/WillKoehrsen/feature-selector
OneBM : https://arxiv.org/abs/1706.00327
67
Feature Selection
68
Feature selection
Oftentimes, you end up with lots of features
69
Principal component analysis
http://setosa.io/ev/principal-component-analysis/
70
Conclusion
71
More on feature engineering and dimensionality
reduction
Stepwise selection
Regularization
Variable importance
Clustering
t-SNE
UMAP
Features from text
Categorical embeddings
72
Sampling
Think carefully about population on which model that is going to be built using sample will
be used
Timing of sample
How far do I go back to get my sample?
Trade-off: many data versus recent data
Sample taken must be from normal business period to get as accurate a picture as possible of target population
(Sampling will return in a different context later on, when we talk more about validation:
over/under/smart sampling)
73
Conclusion
Pre-processing: many steps and checks
Depends on technique being used later on, which might not yet be certain
Can you apply your pre-processing steps next month, on a future data set, in production?
I.e.: can you apply your pre-processing steps on the test set?
Time consuming! E.g. with SQL / pandas / Spark / …: join columns, remove columns, create aggregates, sort,
order, split, …
Not the “fun” aspect of data science
Easy to introduce mistakes
74