0% found this document useful (0 votes)
26 views

Data Preprocessing Exploratory Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Data Preprocessing Exploratory Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Marketing Engineering & Analytics

2021/2022

Frederico Cruz Jesus


[email protected]
Chapter 1 – Data preprocessing & exploratory analysis

1. Why data preprocessing?


2. Preparing the customer signature
3. Deriving new variables
4. Missing values
5. Outliers and sparse data

2
Why data preprocessing?

3
Why data preprocessing?

Apps
Sensors

Loyalty programs

Credit cards
Online shop

4
Why data preprocessing?

“Data have swept into every industry and business function and are now an
important factor of production, alongside labor and capital.”
McKinsey Global Institute, 2011

“Data is the oil of the 21st Century – Analytics is the engine”


Virginia Rometty, CEO, IBM, 2013

“The world’s most valuable resource is no longer oil, but data”

The Economist, 2017

5
Why data preprocessing?

Data Mining virtuous cycle 2. Transform Data

1. Identify Source:
Linoff et al., 2013 3. Act
opportunities

4. Act

6
Why data preprocessing?

Virtuous cycle (2. Transform data)

Data is never clean!

7
Why data preprocessing?

Predictive vs. Descriptive Techniques


Data Mining

Predictive To learn a decision criterion that allows


us to classify new and unknown
modelling* examples;

Descriptive To describe, to provide a summary of a


modeling** data set;

* aka supervised learning


** aka unsupervised learning
8
Why data preprocessing?

Predictive Techniques
Target
Observation Variable
Variable
Height Weight Sex Age Income Physical Insurance
Activity Costs
1.60 79 M 41 3000 Y N
1.72 82 M 32 4000 Y N
1.66 65 F 28 2500 N N
1.82 87 M 35 2000 N S
1.71 66 F 42 3500 N S

9
Why data preprocessing?

Descriptive Techniques

Variable
Observation

Height Weight Sex Age Income Physical


Activity
1.60 79 M 41 3000 Y
1.72 82 M 32 4000 Y
1.66 65 F 28 2500 N
1.82 87 M 35 2000 N
1.71 66 F 42 3500 N

10
Why data preprocessing?

Overall assumptions/requirements of analytic models

Decision Neural
Assumption / Requires: KMeans Regression K-NNs
Trees Networks
Quantitative variables Yes No Desirable Desirable Yes
Normally distributed variables No No Yes No No
Linearity of effects N/A No Yes No N/A
Homoscedasticity N/A No Yes No N/A
Independence of error terms N/A No Yes No N/A
Low multicollinearity Desirable No Yes Yes N/A
Only non missing values Yes No Yes Yes Yes
Absence of outliers Yes No Yes Yes Yes

11
Preparing the customer signature

What are signatures?

This chapter focuses on finding the right individual-unit (e.g., customers, credit cards,
households, etc.) in data, by gathering the traces they leave when they interact with an
organizations and its IT/IS.

In most cases the observation signatures, also known as analytic-based tables (ABTs) are
the data in data mining/analytics. In other words, they are the data in the right format to
be used by the most methods. They are the defining aspect, for example, if the task will
be of predictive or descriptive nature.

12
Preparing the customer signature

Finding “customers” in Data

Signature tables, or analytic-based tables, have one row for each example of whatever is
being studied. If the level of analysis is a town, we need a town-signature table; if the
level of analysis is a credit card account, we need an account-signature; and so on.

Although many other different types of data than customer-related one can be analysed
and monetized, the remainder of this chapter will mostly consider customers as the
interest unit in the signature table. It should be noted that ssometimes the level of
analysis is something other than a customer, even when, ultimately, it is customers who
are of interest (e.g., developing insurance policy-signature table for cross-selling
purposes).

13
Preparing the customer signature

What is a customer?

The question seems simple, but the answer depends on the business goal. Thus,
extracting and treating data has as much of technical (objective) as business (subjective)
decisions.

Consider for example, three types of roles that one (a so-called customer) may have:
• Payer;
• Decider;
• User.

Imagine you work for a travel agency that sells travels to business customers. Who is your
customer? And if your company sells multimedia content to individual customers (might
be underaged)?

14
Preparing the customer signature

Accounts? Customers? Households?

Besides the problem of the customer’s role(s), defining who is the customer also depends
on the available (transactional) data as this is usually the (main) source of available data
in an organization.

Different possibilities for the transaction data exists, each leading to different approaches.
• All transactions are anonymous.
• Some transactions can be tied to a particular credit or debit card, but others cannot; some cards
can be tied to specific customers.
• Many transactions can be tied to a cookie; some cookies can be tied to accounts.
• All transactions can be tied to a particular account.
• All transactions can be tied to a particular identified customer.
• All transactions can be tied to a particular customer and household.

15
Preparing the customer signature

Anonymous Transactions

The cashless society keeps failing to arrive. When transactions are anonymous, it is only
possible to learn as much about a customer as is revealed in a single transaction. An
automated couponing system can use the contents of a single purchase transaction to
decide which of several coupons to print, for example.

Anonymous transactions along with sociodemographic data for a store’s geographical


area can be used to assess, e.g., the effect of age on sales. Anonymous transactions can
also be used to create association rules which can generate powerful insights (e.g., beer
and diapers). Association rules are about products, not customers.

16
Preparing the customer signature

Designing signatures

Preparing the signature is quite often the most time-consuming part of data mining /
analytics projects. However, this only holds for the first time, if things are done properly.
Although as time goes by, analysis may suggest new elements to add to the signature,
most of the work clearly is in building it the first time.

The design of a customer signature is best approached iteratively, one data source at a
time, and the ones we know better in the beginning.

17
Preparing the customer signature

Is a signature always necessary?

Not all analysis requires customer (or others) signatures. In an analytics project, data can
be turned into information “just” by conducting hypothesis testing or by applying ad hoc
SQL queries to pre-existing database tables.

Nevertheless, when its time comes to (descriptive or predictive) models, a signature is


usually required. A formal signature, or more precisely, the code to produce it, is what
makes model training and scoring a repeatable process.

All the time invested in designing and writing signatures will be very much paid off when
the time to model arrives.

18
Preparing the customer signature

What does each row represent? Granularity

This is the first decision to be made when one is designing a customer signature – the
level of granularity. As mentioned earlier, it can be a household, which may include
several individuals, each potentially with several accounts? Or is it an individual? Is it a
policy? Or is it a user?

Making this decision is not always easy because different business needs result in
questions at each level. Often, it makes sense to create multiple versions of the signature
at different levels.

The most basic level is an account signature, followed by a customer signature and then a
household one. Complexity usually lies in the account signature which is sometimes the
same as customer and household one.

19
Preparing the customer signature

Householding

Building signatures at the household level requires some mechanism for deciding which
individuals belong together — a process called householding. It involves business rules for
matching names, addresses, and other identifying features, all of which may be
misspelled, out of date, or inconsistent (with techniques addressed latter in this course).

It is also worth noting that different householding rules are appropriate for different cases
(e.g., customer value model vs cross-sell for mobile phone plans).

For both mailing and modeling, there are also risks involved with householding, which
may only be tackled with complex data treatment methods.

Another important aspect of the householding process is that households change over
time, in an increasingly faster pace.

20
Signature’s fields example for retail company

Customer signature for retail (source: Linoff et al, 2011 pp 682)

This signature for catalogue shoppers was developed for a catalogue retailer. The primary
business goal was to increase the response rate for catalogue mailings. The signature also
supports modelling order size because that figure is also available in the target
timeframe. The signature is in the predictive modelling style, with the target variables in a
later timeframe than the explanatory variables.

Note: The Respond and Order Size fields are both in the target timeframe. For any given model, one is
used as the target and the other is ignored because it contains information from the target
timeframe.

21
Signature’s fields example for retail company

Customer signature for retail (source: Linoff et al, 2011 pp 682)

22
Signature’s fields example for retail company

Customer signature for retail (source: Linoff et al, 2011 pp 682)

The signature includes two two-year quarterly time series to capture customer spending
over time. Unfortunately, most customers have made very few orders. In fact, the most
common number of orders is one, with two as a distant second. That is why the signature
does not include trend variables which would ordinarily be used to summarize a time
series.

23
Preparing the customer signature

Additional questions

1. Will the signature be used for predictive modeling?

2. Are there constraints imposed by the particular techniques to be employed?

3. Has a target been defined?

4. Which customers will be included?

24
Preparing the customer signature

For predictive modelling


Target
Observation Variable
Variable

Height Weight Sex Age Income Physical Insurance


Activity Costs
1.60 79 M 41 3000 Y N
1.72 82 M 32 4000 Y N
1.66 65 F 28 2500 N N
1.82 87 M 35 2000 N S
1.71 66 F 42 3500 N S

25
Preparing the customer signature

For descriptive modelling

Observation Variable

Height Weight Sex Age Income Physical


Activity
1.60 79 M 41 3000 Y
1.72 82 M 32 4000 Y
1.66 65 F 28 2500 N
1.82 87 M 35 2000 N
1.71 66 F 42 3500 N

26
What signatures look like

What signatures look like

➢ Each row is identified by a unique primary key that identifies the customer, household,
or whatever is described by each row of the signature table. If such a key does not
already exist, it should be created. Should be long-term stable!

Source: Linoff (2011)

27
Process for creating signatures

Process for creating signatures

Transformations include just copying, aggregating transactions, pivoting, translating


dates, tenures, table lookups and combining or deriving variables (new subsection).

Source: Linoff (2011)

28
Process for creating signatures

Some data is already at the right level of granularity

Some data sources may already contain data stored at the right-level (usually customer=
that can be simply copied directly into the signature without additional transformation.
This is a good place to start.

Pivoting time series

Many businesses send out bills on a regular schedule (phone or insurance companies).

Source: Linoff (2011)

29
Process for creating signatures

Aggregating Time-Stamped Transactions

Transaction data is typically the most voluminous data and potentially most fruitful source
for the customer signature. After all, transactions are where actual customer behavior is
captured.

Creating regular time series

Aggregating transactions creates a regular time series that can be pivoted into the
customer signature as described earlier. Business goals determine which features of the
transaction data are most useful, but some typical aggregations exist. When there are
multiple transaction types (e.g., in a bank such as wire transfers, deposits and
withdrawals, cash and credit card) these fields are grouped by transaction type:
• Largest transaction; Smallest transaction; Average transaction; Number of
transactions; Most recent transaction; Earliest transaction.

30
Process for creating signatures

Irregular and rare transactions

When transactions occur infrequently and at irregular intervals, one need to chose
between a very wide aggregation period or accepting many zeroes.

Creating transaction summary

Besides time series data, it is useful to summarize the lifetime transaction:


• Time since first transaction;
• Time since most recent transaction;
• Proportion of transactions at each location;
• Proportion of transactions by time of day;
• Proportion of transactions by channel;
• Largest transaction;
• Smallest transaction;
• Average transaction.

31
Types of measurement scales

Measurement is a process by which numbers or symbols are attached to given


characteristics or properties. For example, customers may be described with respect to
many characteristics, such as age, education, income, sales, revenue, costs, brand
preferences, etc.

Appropriate measurement scales can be used to measure these characteristics. Although


the following specification is not completely consensual in the statistical literature, in this
course we will use the following definition:

Measurement scales can be classified into one of the following:

• Nominal;
Nonmetric scales
• Ordinal;

• Interval;
Metric scales
• Ratio.

32
Types of measurement scales

Properties

33
Derived variables

Derived variables

Until this point, we have focused on getting data prepared to be analyzed (either using
exploratory or explanatory techniques). From this point forward we will focus on
enhancing data, not by adding other data source(s) but rather by transforming and/or
creating new variables based on the now existing (preliminary version) of the signature
table (ABT).

Creating new (derived) variables is arguably the most human and creatively-demand
process in marketing analytics. These derived variables are not originally presented in the
dataset completely depend on the analyst’s insights. Techniques use existing variables, so
if one can add new and relevant ones, techniques will inevitably perform better.

34
Derived variables

Single variables

Creating derived variables include standard transformations of a single variable — e.g.,


mean correcting, rescaling, standardization, and changes in representation.

This type of transformations are mostly technical than creative.

35
Derived (single) variables

Single-variable transformations

Although making up new transformations may be fun, it is not always required. Some
transformations are done routinely to increase the usability of data. Among the most
common single-variable transformations are:
• Mean-correcting (centering) variables; x* = x − x
ij ij j

x −x
• Standardizing (rescaling or normalizing) variables; ij
x* = s ´
j
ij
j

The main objective is to put variables measured in incompatible units onto a common
scale so that a one-unit difference for a given variable has the same effect as the same
difference in other. This is especially important for techniques, such as clustering and
memory-based reasoning, which depend on the concept of distance, but also in linear
regression to yield standardized betas – although this transformation does not affect the
model’s performance.
36
Derived (single) variables

Turning numeric variables into percentiles

A good example of using percentiles is used for assessing babies’ characteristics. One in
the 95th percentile for length, weight or other, is a more effective way measuring these
attributes then saying, “is 60 cm long”. Besides pediatricians it is not likely that someone
knows whether that is big or small for a three-month-old. The absolute measurement
does not convey as much information as the fact that this baby is bigger than 95% of
others. Moreover, percentiles also allow to compare one’s evolution across time (for
health more than “big is better” it is supposed that percentiles maintain stable).

Converting numeric values into percentiles has many of the same advantages as
standardization. It translates data into a new scale that works for any numeric measure.
They also do not require knowledge about the underlying distribution.

37
Derived (single) variables

Turning Counts into Rates

Many databases contain counts: number of purchases, number of calls to customer


service, number of times late, number of catalogs received, and so on. Often, when these
tallies are for events that occur over time, it makes sense to convert the counts to rates
by dividing by some fixed time unit to get calls per day or withdrawals per month. This
allows customers with different tenures to be compared.

Note that a customer with 10 purchases in one month since acquisition has same rate as
one who bough 20 times in two months. It is fundamental that the new variable
represents that.

A word of caution: When combining data from multiple sources, you must be sure that
they contain data from the same timeframe.

38
Derived (single) variables

Replacing Categorical Variables with Numeric Ones

Numeric variables are sometimes preferable to categorical ones. Certain modeling


techniques — including regression, neural networks, and most implementations of
clustering — prefer numbers and cannot readily accept categorical inputs. Even when
using decision trees, which are perfectly capable of handling categorical inputs, there may
be reasons for preferring numeric ones.

A common mistake is to replace categorical values with arbitrary numbers, which lead to
spurious information that algorithms have no way to ignore.

39
Derived (single) variables

Replacing Categorical Variables with Numeric Ones

According to table below, Alaska is very close to Alabama and Arizona (which is false).
This is meaningless, but that is what the numbers say, and data mining techniques cannot
ignore what the numbers say. A good idea would be to, e.g., replace states’ code by
population, GDP, or other variable of interest, i.e., hypothesized as relevant to the
problem in question.

Another popular approach is to create a


separate binary variable for each category (1=
yes; 0=no). This works well when you have only
a few categories, but to represent U.S. states
this way would require 50 indicator variables
(there are 51 U.S. states).
Source: Linoff and Berry (2011)

40
Derived (single) variables

Binning/discretization

Contrarily to the previous case, also quite often, the desired outcome is to (efficiently)
transform numerical variables into categorical ones (a process called binning or
discretization). Some analytic techniques (e.g., lookup models, chi-square association
tests and naïve Bayesian models) only work with categorical data. Others (e.g., decision
trees) can be more efficient in some cases working with these type of variables.

This process consists of dividing the range of a numeric variable into a set of
bins/modalities, replacing its (original) value by the bin or modality. This is also extremely
useful for highly skewed distributed variables as outliers are all placed together in one bin
(e.g., recency). Nevertheless, dealing with outliers will be further later in this chapter.

41
Derived (single) variables

Binning/discretization

The three major approaches to creating bins are:


1. Equal width binning
2. Equal weight binning
3. Supervised binning

They differ on how the ranges are set. The histogram in the next slide presents an
example of a distribution of months since the last purchase (recency). This variable is
often used in descriptive or predictive techniques as it reflects an interesting firmographic
characteristic of customers.

42
Derived (single) variables

Binning/discretization

Source: Linoff and Berry (2011)

43
Derived (single) variables

Binning/discretization – Equal width and equal weighted binning

The upper histogram created five-equally distanced in terms of range bins. The lower one
presents five-equally sized ones. Which one is more advisable?

Source: Linoff and Berry (2011)

44
Derived (single) variables

Binning/discretization – Supervised binning

A more sophisticated approached would be to used supervised methods to produce the


five bins’ optimal thresholds (in case of explanatory analysis).

Source: Linoff and Berry (2011)

45
Derived (single) variables

Spreading the Histogram


Some transformation can be employed to correct variables’ distributions. Look at the example below
(monetary: amount spent by customer in the last 18 months).

46
Derived (single) variables

Correcting right-skewed distributions

One can use the square root or log(x) to correct right-skewed variables. Square roots do it in a softer
way, whereas ln(x) and log(x) is a stronger way, respectively.

Note: Mind the X axes values.

47
Derived (single) variables

Spreading the Histogram

On the other hand, if we are interested in correcting left-skewed distributions, we can simply use the
power transformation y = xg.

48
Derived (combined) variables

Combining variables

One of the most common ways of creating derived variables is by combining two or more
existing ones to expose information that is not present originally.

Quite commonly, variables are combined by summing them, computing differences,


multiplication, or dividing one by the other as in the existing credit payments/income
ratio or sales on a specific type of product/total spending, but products, sums,
differences, squared differences, and even more imaginative combinations also prove
useful.

49
Derived (combined) variables

Classic Combinations – BMI as an example

Some derived variables have become so well known that it is easy to forget that someone
once had to invent them. Insurance companies track loss ratios, banks track effort rates,
investors pick stocks based on price/earnings ratios, etc.

A good example is body mass index (BMI) - of height2 / weight. The histogram below
depicts the association between type II diabetes and BMI.

Source: Linoff and Berry (2011)


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1890993/

50
Combining Highly Correlated Variables

Combining Highly Correlated Variables

In business data, having many strongly correlated variables is relatively common.


Although for some type of techniques (e.g., principal components analysis or decision
trees) it does not affect its performance, for other techniques (e.g., cluster algorithms and
linear regressions) it does so.

Consider the case in which a retailer has two variables about their customers: sales and
net sales. The difference between them is that the latter considers returned items (e.g.,
within two weeks) and non-supplied ones because they were out of stock. Assume these
two variables have a correlation of 0.99 between them. Which one would you use? Is
there a way to use both? And if so, is it worth it? Answer: Considering looking at the
cases where the difference between the two variables is high!

51
Combining Highly Correlated Variables

The Degree of Correlation

Another interesting thing to look at with correlated variables is whether the degree or
direction of correlation is different in different circumstances, or at different levels of a
hierarchy. For example:
• For new customers, credit score and involuntary attrition are strongly correlated, but the
correlation is weaker for established customers;
• For people under 18, age and height are highly correlated; for older people it is not;
• GDP per capita is highly correlated with ICT adoption, but almost exclusively for developing
countries.

52
Extracting features from time series

Extracting features from time series

Most customer (or other units) signature tables include time series data (e.g., calls sent and
received, quarterly spending, etc.). A difficulty with the time series of quarterly spending is
that, at the customer level, it is very sparse. Any one customer has very few quarters with a
purchase. Thus, it does not make sense to consider whether a customer’s spending
increasing or decreasing over time.

Thus, how can one extract assertive variables from time series data?

53
Extracting features from time series

Seasonality

In the previous slide it is noticeable that sales in 4th quarter are stronger than in any
other. While this pattern is obvious (and plausible) for humans, it is not noticeable but
most techniques without using derived variables:
• At least one could create a variable indicating whether it is 4th quarter or not;
• Even better would be create a categorical variable with the quarter.

The ideal would be creating a variable with the expected effect on sales due to
seasonality (which is not straightforward) – stationary series.

One could represent the quarters with the average difference from the trendline
(almost all below except the 4th quarters (Q1 = -10.633; Q2 = -37.312; Q3 = 38.923; Q4 =
82.837) – see next slide.

54
Extracting features from time series

Seasonality

55
Extracting features from geography

Extracting features from geography

Location – Arguably an important characteristic of virtually every business. The key point
is discovering what aspects of location are relevant for a specific problem (similarly to the
handset example in the beginning of this chapter). Examples are geographic coordinates,
temperature, average income, education, age, etc..

In Portugal, and other countries, data is gathered at different geographic levels, such as
NUTS (Nomenclatura das Unidades Territoriais) 1, 2 and 3, distrito (district), concelho
(municipality), freguesia (town) and quarteirão (block/street).

56
Extracting features from geography

Using geographic IS

One reason for geocoding is to be able to place things on a map, using software as ArcGIS
or Quantum GIS. Representing data in form of maps can be extremely useful to gain
business insights.

Source: Cruz-Jesus et al. (2017)


57
Missing Values

58
Missing Values

Missing data

Missing data may be defined as the case where valid values on one or more variables are
not available for analysis. This issue is quite old in the field of data analysis and social
sciences. An important reason for this is the fact that algorithms for data analysis were
originally designed for data matrices with no missing values.

Missing values are arguably one of the worst nightmares data analysts and scientists need
to handle in the pre-processing stage. Missing values have many natures and sources.
Unfortunately, they also have the power to “ruin” virtually every analysis as its presence
violates almost every method’s assumptions.

59
Missing Values

Missing data

Thus, when one is building a signature table, knowing how to tackle (the very likely to
exist) missing values is an important issue. Usually, but not always, missing values present
higher challenges in numerical data than in categorical one. The first step is always to
identify and recognize missing values as such in the original data source(s) where they
may have been conscientiously hidden. Data sources in this context may be other
datasets from the relational database or questions in surveys.

The need to focus on the reasons for missing data comes from the fact that the
researcher must understand the processes leading to the missing data in order to select
the appropriate course of action.

60
Missing Values

Unknown or Non-Existent?

A null or missing value can result from either of two quite different situations:
• A value exists, but it is unknown because it has not been captured. Customers really do have
dates of birth even if they choose not to reveal them;
• The variable (e.g., question) simply does not apply. The “Spending in the last six months” field is
not defined for customers who have less than six months of tenure.

This distinction is important because although imputing the unknown age of a customer
might make sense, imputing the non-existent spending of a non-customer does not. For
categorical variables, these two situations can be distinguished using different codes such
as “Unknown” and “Not Applicable.”

61
Missing Values

Types of missing values

Before any decision pertaining handling missing values is made, it is critical to


characterize missing values as one of the following:
1. Missing completely at random (MCAR)
Y are not dependent on X nor on Y – “Someone simply did not answer.”

2. Missing (should be conditionally) at random (MAR)


Y depend on X, but not on Y – “Men (gender x) may be more likely to decline to
answer some questions (y) than women, and not because of the answers itself.”

3. Missing not at random (MNAR)


Y depend on Y - “Individuals with very high incomes are more likely to decline to
answer questions about their own income (not because they are men but because
they have very high/low income.”

62
Missing Values

What not to do with missing values

1. Do not throw records away;

2. Don’t Replace with a “special” numeric value;

3. Don’t Replace with average, median, or mode.

63
Missing Values

What to do (at least consider) with missing values

So, what can one do about missing values? There are several alternatives.

The goal is always to preserve the information in the non-missing fields so it can
contribute to the model/analysis. When this requires replacing the missing values with an
imputed value, the imputed values should be chosen to do the least harm.

1. Consider doing nothing (some techniques handle missing values very well);

2. Consider multiple models;

3. Consider imputation;

4. Remember that imputed values should never be surprising.

64
Outliers

What is an outlier?

In statistics, an outlier is an observation point that is distant from other observations.

An outlier may be due to variability in the measurement, or it may indicate experimental


error - sometimes excluded from the data set. Outliers are extreme cases in one or more
variables and with great impact on the interpretation of results. They may come from:
➢ Unusual but correct situations (the Bill Gates effect),
➢ Incorrect measurements,
➢ Errors in data collection;
➢ Lack of, or wrong, code for missing data.

65
Outliers

Outlier – Leverage effect

66
Outliers

Remedies for outliers

There are several remedies for coping with outliers. These inevitably vary with:
1. The type of the outlier;
2. The data in the dataset;
3. The analytic methods to be employed in the analysis stage;
4. The distribution of the respective variable;
5. The “philosophic” approach, i.e., the problem.

67
Outliers

Remedies for outliers - Automatic limitation / or thresholding.

As the name describes, one defines a floor and a ceiling in which data points falling
outside them are deleted or addressed (e.g., 100 <= age <=16; 10,500€ <= monthly
income <= 0) Histograma de Frequência

140

120

100

80

60

40

20

4
10

19

27

36

45

54

63

71

80

89

98
1

10

11

12
68
0
10000

2000
4000
6000
8000
7.61
621.405
1235.2
1848.995
2462.79
3076.585
3690.38
4304.175
4917.97
5531.765
6145.56
6759.355
Frequency
7373.15
7986.945
8600.74
9214.535
9828.33
based on the distribution and context.

10442.125
11055.92
11669.715
Remedies for outliers – Statistical criterion

More

0
100
200
300
400
500
600
700
800
Outliers

7.61
37.5345
67.459
97.3835
127.308
157.2325
187.157
217.0815
247.006
276.9305
306.855
336.7795
Frequency

366.704
396.6285
426.553
456.4775
486.402
516.3265
Definition, variable by variable, the potential minimum and maximum allowed values

546.251
576.1755
More
69
Outliers

Remedies for outliers - Manual limitation / thresholding.

Tukey’s boxplot can be used to identify outliers in a Gaussian variable. This method
should be used with caution.

70
Outliers

Remedies for outliers – 68 / 95 / 99.7 rule

If the variables follows a Normal distribution, then one can use standard deviations to
define outliers’ thresholds.

71
Outliers

Multidimensional outliers

The problem of identifying outliers is much more complex in case of multivariate outliers.
Multivariate outliers are characterized by having admissible values in individual variables,
but not in or two or more jointly. Sadly, these are usually the most interesting.

72
Data: Too much of a good thing?

Is more data always good?

However, as most things in live “the dose makes the poison”. In this context, the problem
with too many data is much more related to the excess of variables than observations.
The first causes deep trouble to most algorithms and (human) approaches, whereas the
second mostly affects performance in computing time, which is not really (that big of an)
issue anymore.

When you have too many variables, input data is very likely to be sparse. Too many
variables are very often zero or missing. Many techniques can´t cope with this. Moreover,
too many variables are likely to yield overfitting, i.e., models will memorize instead of
learn the underlying patterns, as it is much likely that small nuances exist due the number
of possibilities in observations’ distribution across the space.

Note that the problem of too many data is almost always in the context of predictive
modeling/analysis.

73
Data: Too much of a good thing?

Is more data always good?

Several ways to solve the threats caused by a high number of dimensions, i.e., reduce
their number, exist. A simple one is selecting only the variables with higher explanation
power of the target variable. Principal components analysis is another alternative that
combines original variables to create new (composite) ones that condense most variance
possible in the smaller number new variables possible.

Features / variables Features / variables

Reduced
Original Data
Data

74
Problems with too many variables

Problems with too many variables

Too many variables have as much good as bad.

If variables are necessary to find patterns in data, in excess it may also causes:
1. Risk of correlation among input variables -> Multicollinearity;
2. Risk of overfitting -> Models will memorize not learn the patterns;
3. Sparse data -> Too many zeros…

75
Handling Sparse Data

Handling Sparse Data

What is the remedy for having too many variables (≈ sparse data)?

Some options exist:


• Feature selection;
• Directed variable selection methods;
• Principal Components Analysis;
• Variable Clustering.

76
Handling Sparse Data

Types of Variable’s Reduction Techniques

Many techniques exist to reduce the number of variables. These depend whether they
use or not the target variable as well as if it uses a subset of original variables or derives
new ones.

• Using (or not) the target variable: Most of feature selection methods use the target
variable to select the best input variables. A possible problem is that the target is
being leaked to the input variables which, anyway, is not a big concern if one uses a
validation set. Moreover, in data mining problems, contrarily to economic ones, the
main goal is usually prediction and not explanation of effects;

• Original vs new derived variables: The main advantage of using (a subset of) original
variables is understandability. However, some methods that yield new ones can
perform quite well, capturing most of the information in the original ones.

77
Reducing the Number of Variables

Exhaustive Feature Selection

One possible way to create the best model, given a set of input variables, is to
exhaustively try all combinations. Doing so would definitely create the best model but it is
virtually impossible.

Number of variables Number of combinations


2 3
3 7
4 15
5 31
10 1 023
20 1 048 575
30 1 073 741 823
40 1 099 511 627 775
50 1 125 899 906 842 620
100 1 267 650 600 228 230 000 000 000 000 000

78
Reducing the Number of Variables

Selection of Features

The most popular way to select which of many input variables to use is by using
sequential selection methods. That is, one variable at a time is considered for, either
inclusion in or exclusion from the model.

Again, several methods exist:

• Forward Selection;

• Stepwise Selection;

• Backward Selection.

Hence, at least one measure of performance is needed

79
Reducing the Number of Variables

Principal Components

Principal component is a slight variation of the best-fit line. It minimizes the sum of the
squares of the distances from the line to the data points, not only the vertical distance as
does linear regression.

Linear Regression Principal Components

Source: Linoff and Berry (2011)

80
Reducing the Number of Variables

Variable Clustering

Variable clustering goes beyond the feature selection techniques mentioned until this
point as it introduced the notion that input variables have a structure between them.

This structure work in the same way as hierarchical analysis do for observations – one can
choose the number of variables to model.

For each cluster of variables selected, a principal components analysis is performed, and
the variables belonging to that cluster are replaced by it.

81
Bibliography

References
• S. Linoff, Gordon & J. A. Berry, Michael. (2011). Data Mining Techniques: For
Marketing, Sales, and Customer Relationship Management (pp 655-774).
• Hair, Black, Babin & Anderson. (2014). Multivariate Data Analysis (pp 40-62).
• Jonh W. Graham. (2012). Missing data: Analysis and Design.
• Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data: Second Edition . New
York: Wiley
• Robert McGill, John W. Tukey and Wayne A. Larsen: “The American Statistician” Vol. 32, No. 1
(1978), pp. 12-16

82
Thank you!

Address: Campus de Campolide, 1070-312 Lisboa, Portugal


Tel: +351 213 828 610 | Fax: +351 213 828 611

You might also like