Data Preprocessing Exploratory Analysis
Data Preprocessing Exploratory Analysis
2021/2022
2
Why data preprocessing?
3
Why data preprocessing?
Apps
Sensors
Loyalty programs
Credit cards
Online shop
4
Why data preprocessing?
“Data have swept into every industry and business function and are now an
important factor of production, alongside labor and capital.”
McKinsey Global Institute, 2011
5
Why data preprocessing?
1. Identify Source:
Linoff et al., 2013 3. Act
opportunities
4. Act
6
Why data preprocessing?
7
Why data preprocessing?
Predictive Techniques
Target
Observation Variable
Variable
Height Weight Sex Age Income Physical Insurance
Activity Costs
1.60 79 M 41 3000 Y N
1.72 82 M 32 4000 Y N
1.66 65 F 28 2500 N N
1.82 87 M 35 2000 N S
1.71 66 F 42 3500 N S
9
Why data preprocessing?
Descriptive Techniques
Variable
Observation
10
Why data preprocessing?
Decision Neural
Assumption / Requires: KMeans Regression K-NNs
Trees Networks
Quantitative variables Yes No Desirable Desirable Yes
Normally distributed variables No No Yes No No
Linearity of effects N/A No Yes No N/A
Homoscedasticity N/A No Yes No N/A
Independence of error terms N/A No Yes No N/A
Low multicollinearity Desirable No Yes Yes N/A
Only non missing values Yes No Yes Yes Yes
Absence of outliers Yes No Yes Yes Yes
11
Preparing the customer signature
This chapter focuses on finding the right individual-unit (e.g., customers, credit cards,
households, etc.) in data, by gathering the traces they leave when they interact with an
organizations and its IT/IS.
In most cases the observation signatures, also known as analytic-based tables (ABTs) are
the data in data mining/analytics. In other words, they are the data in the right format to
be used by the most methods. They are the defining aspect, for example, if the task will
be of predictive or descriptive nature.
12
Preparing the customer signature
Signature tables, or analytic-based tables, have one row for each example of whatever is
being studied. If the level of analysis is a town, we need a town-signature table; if the
level of analysis is a credit card account, we need an account-signature; and so on.
Although many other different types of data than customer-related one can be analysed
and monetized, the remainder of this chapter will mostly consider customers as the
interest unit in the signature table. It should be noted that ssometimes the level of
analysis is something other than a customer, even when, ultimately, it is customers who
are of interest (e.g., developing insurance policy-signature table for cross-selling
purposes).
13
Preparing the customer signature
What is a customer?
The question seems simple, but the answer depends on the business goal. Thus,
extracting and treating data has as much of technical (objective) as business (subjective)
decisions.
Consider for example, three types of roles that one (a so-called customer) may have:
• Payer;
• Decider;
• User.
Imagine you work for a travel agency that sells travels to business customers. Who is your
customer? And if your company sells multimedia content to individual customers (might
be underaged)?
14
Preparing the customer signature
Besides the problem of the customer’s role(s), defining who is the customer also depends
on the available (transactional) data as this is usually the (main) source of available data
in an organization.
Different possibilities for the transaction data exists, each leading to different approaches.
• All transactions are anonymous.
• Some transactions can be tied to a particular credit or debit card, but others cannot; some cards
can be tied to specific customers.
• Many transactions can be tied to a cookie; some cookies can be tied to accounts.
• All transactions can be tied to a particular account.
• All transactions can be tied to a particular identified customer.
• All transactions can be tied to a particular customer and household.
15
Preparing the customer signature
Anonymous Transactions
The cashless society keeps failing to arrive. When transactions are anonymous, it is only
possible to learn as much about a customer as is revealed in a single transaction. An
automated couponing system can use the contents of a single purchase transaction to
decide which of several coupons to print, for example.
16
Preparing the customer signature
Designing signatures
Preparing the signature is quite often the most time-consuming part of data mining /
analytics projects. However, this only holds for the first time, if things are done properly.
Although as time goes by, analysis may suggest new elements to add to the signature,
most of the work clearly is in building it the first time.
The design of a customer signature is best approached iteratively, one data source at a
time, and the ones we know better in the beginning.
17
Preparing the customer signature
Not all analysis requires customer (or others) signatures. In an analytics project, data can
be turned into information “just” by conducting hypothesis testing or by applying ad hoc
SQL queries to pre-existing database tables.
All the time invested in designing and writing signatures will be very much paid off when
the time to model arrives.
18
Preparing the customer signature
This is the first decision to be made when one is designing a customer signature – the
level of granularity. As mentioned earlier, it can be a household, which may include
several individuals, each potentially with several accounts? Or is it an individual? Is it a
policy? Or is it a user?
Making this decision is not always easy because different business needs result in
questions at each level. Often, it makes sense to create multiple versions of the signature
at different levels.
The most basic level is an account signature, followed by a customer signature and then a
household one. Complexity usually lies in the account signature which is sometimes the
same as customer and household one.
19
Preparing the customer signature
Householding
Building signatures at the household level requires some mechanism for deciding which
individuals belong together — a process called householding. It involves business rules for
matching names, addresses, and other identifying features, all of which may be
misspelled, out of date, or inconsistent (with techniques addressed latter in this course).
It is also worth noting that different householding rules are appropriate for different cases
(e.g., customer value model vs cross-sell for mobile phone plans).
For both mailing and modeling, there are also risks involved with householding, which
may only be tackled with complex data treatment methods.
Another important aspect of the householding process is that households change over
time, in an increasingly faster pace.
20
Signature’s fields example for retail company
This signature for catalogue shoppers was developed for a catalogue retailer. The primary
business goal was to increase the response rate for catalogue mailings. The signature also
supports modelling order size because that figure is also available in the target
timeframe. The signature is in the predictive modelling style, with the target variables in a
later timeframe than the explanatory variables.
Note: The Respond and Order Size fields are both in the target timeframe. For any given model, one is
used as the target and the other is ignored because it contains information from the target
timeframe.
21
Signature’s fields example for retail company
22
Signature’s fields example for retail company
The signature includes two two-year quarterly time series to capture customer spending
over time. Unfortunately, most customers have made very few orders. In fact, the most
common number of orders is one, with two as a distant second. That is why the signature
does not include trend variables which would ordinarily be used to summarize a time
series.
23
Preparing the customer signature
Additional questions
24
Preparing the customer signature
25
Preparing the customer signature
Observation Variable
26
What signatures look like
➢ Each row is identified by a unique primary key that identifies the customer, household,
or whatever is described by each row of the signature table. If such a key does not
already exist, it should be created. Should be long-term stable!
27
Process for creating signatures
28
Process for creating signatures
Some data sources may already contain data stored at the right-level (usually customer=
that can be simply copied directly into the signature without additional transformation.
This is a good place to start.
Many businesses send out bills on a regular schedule (phone or insurance companies).
29
Process for creating signatures
Transaction data is typically the most voluminous data and potentially most fruitful source
for the customer signature. After all, transactions are where actual customer behavior is
captured.
Aggregating transactions creates a regular time series that can be pivoted into the
customer signature as described earlier. Business goals determine which features of the
transaction data are most useful, but some typical aggregations exist. When there are
multiple transaction types (e.g., in a bank such as wire transfers, deposits and
withdrawals, cash and credit card) these fields are grouped by transaction type:
• Largest transaction; Smallest transaction; Average transaction; Number of
transactions; Most recent transaction; Earliest transaction.
30
Process for creating signatures
When transactions occur infrequently and at irregular intervals, one need to chose
between a very wide aggregation period or accepting many zeroes.
31
Types of measurement scales
• Nominal;
Nonmetric scales
• Ordinal;
• Interval;
Metric scales
• Ratio.
32
Types of measurement scales
Properties
33
Derived variables
Derived variables
Until this point, we have focused on getting data prepared to be analyzed (either using
exploratory or explanatory techniques). From this point forward we will focus on
enhancing data, not by adding other data source(s) but rather by transforming and/or
creating new variables based on the now existing (preliminary version) of the signature
table (ABT).
Creating new (derived) variables is arguably the most human and creatively-demand
process in marketing analytics. These derived variables are not originally presented in the
dataset completely depend on the analyst’s insights. Techniques use existing variables, so
if one can add new and relevant ones, techniques will inevitably perform better.
34
Derived variables
Single variables
35
Derived (single) variables
Single-variable transformations
Although making up new transformations may be fun, it is not always required. Some
transformations are done routinely to increase the usability of data. Among the most
common single-variable transformations are:
• Mean-correcting (centering) variables; x* = x − x
ij ij j
x −x
• Standardizing (rescaling or normalizing) variables; ij
x* = s ´
j
ij
j
The main objective is to put variables measured in incompatible units onto a common
scale so that a one-unit difference for a given variable has the same effect as the same
difference in other. This is especially important for techniques, such as clustering and
memory-based reasoning, which depend on the concept of distance, but also in linear
regression to yield standardized betas – although this transformation does not affect the
model’s performance.
36
Derived (single) variables
A good example of using percentiles is used for assessing babies’ characteristics. One in
the 95th percentile for length, weight or other, is a more effective way measuring these
attributes then saying, “is 60 cm long”. Besides pediatricians it is not likely that someone
knows whether that is big or small for a three-month-old. The absolute measurement
does not convey as much information as the fact that this baby is bigger than 95% of
others. Moreover, percentiles also allow to compare one’s evolution across time (for
health more than “big is better” it is supposed that percentiles maintain stable).
Converting numeric values into percentiles has many of the same advantages as
standardization. It translates data into a new scale that works for any numeric measure.
They also do not require knowledge about the underlying distribution.
37
Derived (single) variables
Note that a customer with 10 purchases in one month since acquisition has same rate as
one who bough 20 times in two months. It is fundamental that the new variable
represents that.
A word of caution: When combining data from multiple sources, you must be sure that
they contain data from the same timeframe.
38
Derived (single) variables
A common mistake is to replace categorical values with arbitrary numbers, which lead to
spurious information that algorithms have no way to ignore.
39
Derived (single) variables
According to table below, Alaska is very close to Alabama and Arizona (which is false).
This is meaningless, but that is what the numbers say, and data mining techniques cannot
ignore what the numbers say. A good idea would be to, e.g., replace states’ code by
population, GDP, or other variable of interest, i.e., hypothesized as relevant to the
problem in question.
40
Derived (single) variables
Binning/discretization
Contrarily to the previous case, also quite often, the desired outcome is to (efficiently)
transform numerical variables into categorical ones (a process called binning or
discretization). Some analytic techniques (e.g., lookup models, chi-square association
tests and naïve Bayesian models) only work with categorical data. Others (e.g., decision
trees) can be more efficient in some cases working with these type of variables.
This process consists of dividing the range of a numeric variable into a set of
bins/modalities, replacing its (original) value by the bin or modality. This is also extremely
useful for highly skewed distributed variables as outliers are all placed together in one bin
(e.g., recency). Nevertheless, dealing with outliers will be further later in this chapter.
41
Derived (single) variables
Binning/discretization
They differ on how the ranges are set. The histogram in the next slide presents an
example of a distribution of months since the last purchase (recency). This variable is
often used in descriptive or predictive techniques as it reflects an interesting firmographic
characteristic of customers.
42
Derived (single) variables
Binning/discretization
43
Derived (single) variables
The upper histogram created five-equally distanced in terms of range bins. The lower one
presents five-equally sized ones. Which one is more advisable?
44
Derived (single) variables
45
Derived (single) variables
46
Derived (single) variables
One can use the square root or log(x) to correct right-skewed variables. Square roots do it in a softer
way, whereas ln(x) and log(x) is a stronger way, respectively.
47
Derived (single) variables
On the other hand, if we are interested in correcting left-skewed distributions, we can simply use the
power transformation y = xg.
48
Derived (combined) variables
Combining variables
One of the most common ways of creating derived variables is by combining two or more
existing ones to expose information that is not present originally.
49
Derived (combined) variables
Some derived variables have become so well known that it is easy to forget that someone
once had to invent them. Insurance companies track loss ratios, banks track effort rates,
investors pick stocks based on price/earnings ratios, etc.
A good example is body mass index (BMI) - of height2 / weight. The histogram below
depicts the association between type II diabetes and BMI.
50
Combining Highly Correlated Variables
Consider the case in which a retailer has two variables about their customers: sales and
net sales. The difference between them is that the latter considers returned items (e.g.,
within two weeks) and non-supplied ones because they were out of stock. Assume these
two variables have a correlation of 0.99 between them. Which one would you use? Is
there a way to use both? And if so, is it worth it? Answer: Considering looking at the
cases where the difference between the two variables is high!
51
Combining Highly Correlated Variables
Another interesting thing to look at with correlated variables is whether the degree or
direction of correlation is different in different circumstances, or at different levels of a
hierarchy. For example:
• For new customers, credit score and involuntary attrition are strongly correlated, but the
correlation is weaker for established customers;
• For people under 18, age and height are highly correlated; for older people it is not;
• GDP per capita is highly correlated with ICT adoption, but almost exclusively for developing
countries.
52
Extracting features from time series
Most customer (or other units) signature tables include time series data (e.g., calls sent and
received, quarterly spending, etc.). A difficulty with the time series of quarterly spending is
that, at the customer level, it is very sparse. Any one customer has very few quarters with a
purchase. Thus, it does not make sense to consider whether a customer’s spending
increasing or decreasing over time.
Thus, how can one extract assertive variables from time series data?
53
Extracting features from time series
Seasonality
In the previous slide it is noticeable that sales in 4th quarter are stronger than in any
other. While this pattern is obvious (and plausible) for humans, it is not noticeable but
most techniques without using derived variables:
• At least one could create a variable indicating whether it is 4th quarter or not;
• Even better would be create a categorical variable with the quarter.
The ideal would be creating a variable with the expected effect on sales due to
seasonality (which is not straightforward) – stationary series.
One could represent the quarters with the average difference from the trendline
(almost all below except the 4th quarters (Q1 = -10.633; Q2 = -37.312; Q3 = 38.923; Q4 =
82.837) – see next slide.
54
Extracting features from time series
Seasonality
55
Extracting features from geography
Location – Arguably an important characteristic of virtually every business. The key point
is discovering what aspects of location are relevant for a specific problem (similarly to the
handset example in the beginning of this chapter). Examples are geographic coordinates,
temperature, average income, education, age, etc..
In Portugal, and other countries, data is gathered at different geographic levels, such as
NUTS (Nomenclatura das Unidades Territoriais) 1, 2 and 3, distrito (district), concelho
(municipality), freguesia (town) and quarteirão (block/street).
56
Extracting features from geography
Using geographic IS
One reason for geocoding is to be able to place things on a map, using software as ArcGIS
or Quantum GIS. Representing data in form of maps can be extremely useful to gain
business insights.
58
Missing Values
Missing data
Missing data may be defined as the case where valid values on one or more variables are
not available for analysis. This issue is quite old in the field of data analysis and social
sciences. An important reason for this is the fact that algorithms for data analysis were
originally designed for data matrices with no missing values.
Missing values are arguably one of the worst nightmares data analysts and scientists need
to handle in the pre-processing stage. Missing values have many natures and sources.
Unfortunately, they also have the power to “ruin” virtually every analysis as its presence
violates almost every method’s assumptions.
59
Missing Values
Missing data
Thus, when one is building a signature table, knowing how to tackle (the very likely to
exist) missing values is an important issue. Usually, but not always, missing values present
higher challenges in numerical data than in categorical one. The first step is always to
identify and recognize missing values as such in the original data source(s) where they
may have been conscientiously hidden. Data sources in this context may be other
datasets from the relational database or questions in surveys.
The need to focus on the reasons for missing data comes from the fact that the
researcher must understand the processes leading to the missing data in order to select
the appropriate course of action.
60
Missing Values
Unknown or Non-Existent?
A null or missing value can result from either of two quite different situations:
• A value exists, but it is unknown because it has not been captured. Customers really do have
dates of birth even if they choose not to reveal them;
• The variable (e.g., question) simply does not apply. The “Spending in the last six months” field is
not defined for customers who have less than six months of tenure.
This distinction is important because although imputing the unknown age of a customer
might make sense, imputing the non-existent spending of a non-customer does not. For
categorical variables, these two situations can be distinguished using different codes such
as “Unknown” and “Not Applicable.”
61
Missing Values
62
Missing Values
63
Missing Values
So, what can one do about missing values? There are several alternatives.
The goal is always to preserve the information in the non-missing fields so it can
contribute to the model/analysis. When this requires replacing the missing values with an
imputed value, the imputed values should be chosen to do the least harm.
1. Consider doing nothing (some techniques handle missing values very well);
3. Consider imputation;
64
Outliers
What is an outlier?
65
Outliers
66
Outliers
There are several remedies for coping with outliers. These inevitably vary with:
1. The type of the outlier;
2. The data in the dataset;
3. The analytic methods to be employed in the analysis stage;
4. The distribution of the respective variable;
5. The “philosophic” approach, i.e., the problem.
67
Outliers
As the name describes, one defines a floor and a ceiling in which data points falling
outside them are deleted or addressed (e.g., 100 <= age <=16; 10,500€ <= monthly
income <= 0) Histograma de Frequência
140
120
100
80
60
40
20
4
10
19
27
36
45
54
63
71
80
89
98
1
10
11
12
68
0
10000
2000
4000
6000
8000
7.61
621.405
1235.2
1848.995
2462.79
3076.585
3690.38
4304.175
4917.97
5531.765
6145.56
6759.355
Frequency
7373.15
7986.945
8600.74
9214.535
9828.33
based on the distribution and context.
10442.125
11055.92
11669.715
Remedies for outliers – Statistical criterion
More
0
100
200
300
400
500
600
700
800
Outliers
7.61
37.5345
67.459
97.3835
127.308
157.2325
187.157
217.0815
247.006
276.9305
306.855
336.7795
Frequency
366.704
396.6285
426.553
456.4775
486.402
516.3265
Definition, variable by variable, the potential minimum and maximum allowed values
546.251
576.1755
More
69
Outliers
Tukey’s boxplot can be used to identify outliers in a Gaussian variable. This method
should be used with caution.
70
Outliers
If the variables follows a Normal distribution, then one can use standard deviations to
define outliers’ thresholds.
71
Outliers
Multidimensional outliers
The problem of identifying outliers is much more complex in case of multivariate outliers.
Multivariate outliers are characterized by having admissible values in individual variables,
but not in or two or more jointly. Sadly, these are usually the most interesting.
72
Data: Too much of a good thing?
However, as most things in live “the dose makes the poison”. In this context, the problem
with too many data is much more related to the excess of variables than observations.
The first causes deep trouble to most algorithms and (human) approaches, whereas the
second mostly affects performance in computing time, which is not really (that big of an)
issue anymore.
When you have too many variables, input data is very likely to be sparse. Too many
variables are very often zero or missing. Many techniques can´t cope with this. Moreover,
too many variables are likely to yield overfitting, i.e., models will memorize instead of
learn the underlying patterns, as it is much likely that small nuances exist due the number
of possibilities in observations’ distribution across the space.
Note that the problem of too many data is almost always in the context of predictive
modeling/analysis.
73
Data: Too much of a good thing?
Several ways to solve the threats caused by a high number of dimensions, i.e., reduce
their number, exist. A simple one is selecting only the variables with higher explanation
power of the target variable. Principal components analysis is another alternative that
combines original variables to create new (composite) ones that condense most variance
possible in the smaller number new variables possible.
Reduced
Original Data
Data
74
Problems with too many variables
If variables are necessary to find patterns in data, in excess it may also causes:
1. Risk of correlation among input variables -> Multicollinearity;
2. Risk of overfitting -> Models will memorize not learn the patterns;
3. Sparse data -> Too many zeros…
75
Handling Sparse Data
What is the remedy for having too many variables (≈ sparse data)?
76
Handling Sparse Data
Many techniques exist to reduce the number of variables. These depend whether they
use or not the target variable as well as if it uses a subset of original variables or derives
new ones.
• Using (or not) the target variable: Most of feature selection methods use the target
variable to select the best input variables. A possible problem is that the target is
being leaked to the input variables which, anyway, is not a big concern if one uses a
validation set. Moreover, in data mining problems, contrarily to economic ones, the
main goal is usually prediction and not explanation of effects;
• Original vs new derived variables: The main advantage of using (a subset of) original
variables is understandability. However, some methods that yield new ones can
perform quite well, capturing most of the information in the original ones.
77
Reducing the Number of Variables
One possible way to create the best model, given a set of input variables, is to
exhaustively try all combinations. Doing so would definitely create the best model but it is
virtually impossible.
78
Reducing the Number of Variables
Selection of Features
The most popular way to select which of many input variables to use is by using
sequential selection methods. That is, one variable at a time is considered for, either
inclusion in or exclusion from the model.
• Forward Selection;
• Stepwise Selection;
• Backward Selection.
79
Reducing the Number of Variables
Principal Components
Principal component is a slight variation of the best-fit line. It minimizes the sum of the
squares of the distances from the line to the data points, not only the vertical distance as
does linear regression.
80
Reducing the Number of Variables
Variable Clustering
Variable clustering goes beyond the feature selection techniques mentioned until this
point as it introduced the notion that input variables have a structure between them.
This structure work in the same way as hierarchical analysis do for observations – one can
choose the number of variables to model.
For each cluster of variables selected, a principal components analysis is performed, and
the variables belonging to that cluster are replaced by it.
81
Bibliography
References
• S. Linoff, Gordon & J. A. Berry, Michael. (2011). Data Mining Techniques: For
Marketing, Sales, and Customer Relationship Management (pp 655-774).
• Hair, Black, Babin & Anderson. (2014). Multivariate Data Analysis (pp 40-62).
• Jonh W. Graham. (2012). Missing data: Analysis and Design.
• Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data: Second Edition . New
York: Wiley
• Robert McGill, John W. Tukey and Wayne A. Larsen: “The American Statistician” Vol. 32, No. 1
(1978), pp. 12-16
82
Thank you!