data science slides
data science slides
IMS 6
Information
Management
School
DATA
PREPARATION
2
6.1
0
Introduction
Data preparation
3
Data preparation
original modeling
dataset dataset
aka
Analytical
Base Table
(ABT)
4
Why data preparation
Due to their size and multiple, heterogenous sources, real-word databases
commonly have:
§ “Noise” (random error or variance)
§ Missing data
§ Inconsistent data
For this reason, data must be preprocessed and prepared to improve the
efficiency and ease of the mining process.
5
Forms of data preprocessing and preparation
Han et al. (2012)
6
6.2
0
Data cleaning
Data preparation
7
Objective
Fix variables problems, such as:
§Duplicates
§Redundancy
§Incorrect or miscoded values
§Outliers
§Missing values
8
Duplicates
It is common for real-world datasets to have duplicate instances, even when
they should not exist (e.g., having two instances of the same customer profile)
§ If instances are exact-match duplicates, with all columns having the same
values, most of the time all duplicated instances could be deleted (except
one, of course)
§ If some columns are not equal (e.g., if there are two customer instances with
the same name, telephone, and address but a different volume of purchases),
aggregations may be required. In this example, sales could need to be
summed up, and one of the instances deleted after
9
Redundancy
When two attributes are redundant, one
of them should not be included in the
modeling dataset. Removing correlated
attributes:
10
Incorrect or miscoded values
Y N N N
11
Approaches to handling
outliers
Data cleaning
12
Approaches to handling outliers (1/6)
Remove from the modeling data
When outliers distort the models more than they
can help, in numeric algorithms (K-means
clustering or Principal component analysis)
13
Approaches to handling outliers (2/6)
Separate the outliers and create models just for them
§ Relax the definition of outliers from two standard
deviations from the mean to three standard deviations
§ Create a separate model for outliers (e.g., linear
regression)
14
Approaches to handling outliers (3/6)
Transform the outliers so they
are no longer outliers
§ Apply skew transformation or
normalization techniques to
reduce the distance between
the outliers and the main body
of the distribution
§ Apply a MIN or MAX function
based on ”valid” minimum or
maximum values
15
Approaches to handling outliers (4/6)
Transform the outlier and
create an indicator column
Apply skew transformation of
normalization techniques as in
the previous approach, but,
additionally, create a dummy
column indicating if the
observation is an outlier (0: no;
1:yes)
16
Approaches to handling outliers (5/6)
Bin the data (discretize the data)
Because transformations may not capture too
extreme outliers, an alternative to transformations
is to transform the numeric variable in categorical
(e.g., instead of salary amount use low, medium,
high)
17
Approaches to handling outliers (6/6)
Leave in the data without modification
Employ on algorithms that are unaffected by outliers, such as Decision trees-
based algorithms
18
Approaches to handling
missing values
Data cleaning
19
Approaches to handling missing values (1/6)
Listwise and column deletion
§If a small percentage of observations
have columns with missing values, just
remove those observations
§If a specific column has many missing
values, consider removing it
20
Approaches to handling missing values (2/6)
Imputation with a constant
§For categorical variables, this is as
simple as filling the missing values with a
value indicating that is missing (e.g.,
“NULL”)
§For numeric variables, if the 0 (zero)
makes sense (e.g., bank balance) then fill
it with a 0 (zero). Otherwise, try other
approach, like the “Mean or median
imputation”
21
Approaches to handling missing values (3/6)
Mean and median imputation (for
continuous variables)
One of the most common approaches in
continuous variables is the imputation of the
mean value. However, if the distribution is
skewed, the median could be better
22
Approaches to handling missing values (4/6)
Imputations with distributions
In numeric variables, when a large
percentage of values are missing, the
summary statistics are affected by
mean/median imputation. In these
cases, the missing value should be
replaced from a random number of a
known distribution (based on the
variable distribution)
23
Approaches to handling missing values (5/6)
Random imputation from own
distributions
This approach involves for each
missing value, randomly, select
a value of one of the non-
missing values existing on the
column.
25
Additional consideration on missing values
Creation of dummy variables
In some cases, the existence of missing values can be informative for the
model. In those cases, besides implementing one of the previous approaches, a
dummy variable could be created to indicate if there is a missing value (0: no;
1:yes)
26
6.3
0
Data reduction
Data preparation
27
Dimensionality reduction
Data reduction
28
The curse of dimensionality
As the number of candidate variables for modeling increase, the number of
observations must also increase (exponentially) to be able to capture the high-
dimensional patterns. One way to address this problem is to reduce the
number of dimensions
source: http://www.turingfinance.com
29
Attribute subset selection
Datasets may contain hundreds of attributes, but many of which may be
irrelevant to the mining task or redundant. For example, for segmenting
customers, telephone number may be irrelevant
30
Attribute subset selection types
§Filter: uses statistics tests (Pearson correlation, Chi-squared, etc.)
§Wrapper: uses ML to select features to use (forward selection,
backward selection, or other method)
§Embedded: included in the algorithm (e.g., Decision trees)
31
Techniques for dimensionality reduction
§Principal Components Analysis (PCA): reduces dimensionality,
while retaining as much variance in data as possible (finds a new
set of variables that are a linear combination of the original
variables)
§Kernel PCA (KPCA): nonlinear variation of PCA
§Linear Discriminant Analysis (LDA): unsupervised learning
method that transforms a set of features to a new set
§Singular Value Decomposition (SVD): extracts important features
from data, while reconstructing the original dataset to a smaller
dataset (e.g., transform a 1 024 pixels image to 66 pixels)
§Among others
Numerosity reduction
Data reduction
33
Numerosity reduction (1/2)
Methods:
§Aggregations: aggregate the data in a different unit of analysis
(e.g., weekly data, instead of daily data)
§Clustering: cluster representations of the data are used to replace
the actual data
§Parametric data reduction: regression and log-linear models are
used to “predict” an output, based on a set of inputs (e.g., using
multivariate linear regression to transform a set of variables in only
one)
34
Numerosity reduction (2/2)
Methods (cont.): RS (s=4)
Survivorship bias
Concentrating on the instances
that passed some selection or
sample process and
overlooking those who did not
(one focus on what can see and
ignore what can not see)
36
6.4
0
Data transformation
Data preparation
37
Normalization
Data transformation
38
Normalization
§ Some algorithms, such as the K-MEANS algorithm, have
difficulty in covering variables in very different ranges (e.g., age
in the range of [15, 80] and salary in the range [30 000, 80 000]
§ Linear regression coefficients are also influenced
disproportionately by the large values of a skewed distribution
§ Normalization can make a continuous variable fall within a
specific range while maintaining the relative differences between
the values for the variable
39
Common normalization techniques
40
Measures scaling
Normalization techniques are also used to scale measurements in different scales to
the same scale
Example
Tripadvisor’s reviews rating scale: [1, 5]
Booking.com’s reviews rating scale: [2.5, 10]
41
Feature engineering
Data transformation
42
Feature engineering
The creation of new features (also know as ”derived variables” or
“derived attributes”) provides more value-added to the quality of
data than any other modeling step
43
Distributions and possible “corrections”
Abbott (2014) 44
Binning (discretizing) variables (1/2)
Abbott (2014)
www.towardsdatascience.com
45
Binning (discretizing) variables (2/2)
46
Other possible transformations
47
Encode categorical variables – Label encoding (1/3)
48
Encode categorical variables – One-hot encoding (2/3)
49
Encode categorical variables (3/3)
Approach to handling high cardinality:
§Encode categorical variables using an encoder that does not
generate a column for each value/level of the categorical
variable (e.g., the count or probability of observations that have
that value/level)
§If there is a hierarchy, consider using higher levels only . For
example, if you have street, city, and region, consider using only
city and region, or even just region
§For values/levels present in more than a predetermined
threshold of observations (e.g., 2%) create dummy variables
CustomerID Spent Segment CustomerID Spent Segment Corporate
1 € 100 Corporate 1 € 100 2/4 1
2 € 120 SME 2 € 120 1/4 0
30%
3 € 110 Individual 3 € 110 1/4 0
4 € 105 Corporate 4 €105 2/4 1
50
Date/time variables
Datasets are two-dimensional, so, when models require the
introduction of time, transformations are necessary to include a
third dimension (time).
51
Multidimensional features
The most powerful of features. The two most common examples
are:
§Interactions: multiplication of variables
§Ratios: division of variables
Usually, domain expertise is required to understand which
interactions, and above all, which ratios may have modeling value.
52
Multidimensional features - ratios
Ratios are import because they are difficult for most algorithms to
uncover. Ratios can:
§Provided a normalized version of a variable. For example, a
percentage (e.g., a customer website purchase ratio =
%&'()* +, -&*./01)1
)
.&12+')* 3)(142) 541421
§Can incorporate complex ideas. For example,.604'1
the claims received
*).)45)7
to premiums paid in an insurance company =
-*)'4&'1 -047
§Can make models to live “longer”. For example, a model for real
estate property value, instead of using each property price, due to
-*+-)*28 -*4.) (':)
prices increasing trend, could be =
05)*0<) -*+-)*28 -*4.) (':)
53
6.5
0
Data integration
Data preparation
54
Merge data
Joining data that comes from two or more
databases about the unit of analysis under studied
stocks forecast
55
Reformat data
Apply syntactic modifications that do not change data meaning,
but are required for modeling, for example:
§Remove commas from text fields if the dataset is supposed to be
saved as comma separated values
§Remove any ordering that might exist in the observations
§Trim some variables (e.g., text variables) to a certain maximum size
56
Data Science for Marketing
© 2021-2024 Nuno António (Rev. 2024-08-28)
Acreditações e Certificações
Instituto Superior de Estatística e Gestão da Informação
Universidade Nova de Lisboa