Applied Management Research Applied Management Research
Methods, a.y. 2024/2025
Methods
EDA & Preprocessing
SOEAKER
Dott. Federico Mangiò
ROOM
R00m 21
DATE 5 March 2025; 10.30-1.30 PM
Agenda
• Data Exploration Roadmap
• Descriptive Statistics Analysis: Uni vs Multivariate
• Outliers & Nas detection & management
• Data Visualization Analysis: Uni, Multivariate, Multidimensional
• Statistical Paradoxes
Data Exploration Roadmap
Understand the
Organize the Watch out for relationship
dataset outliers between
attributes
Visualize the
Find the central
relationship
point for each Pivot the data
bettween
attribute
attributes
Understand the Visualize high-
Visualize the
spread of each dimensional
data
attribute datasets
Exploratory Data Analysis
• EDA: study of the basic characterstics of a dataset
• Fundamental prior step to every research analysis
• Objectives:
1. Data understanding: overview of each varible ,
and their interaction
2. Data preparation: detecting and handling
ouliers, missing values, multicollinearity
3. Data analysis task: can substitute an entire
process, if suitable
4. Interpreting the results
• 2 types: Descriptive statistics analysis vs Data
Visualization
Source: playground records
Data Collector: company X
Metadata Data collection date: 01.01.18
[…]
• A dataset (example set) is a collection of data
with a defined structure (“dataframe”). Attribute
• A data point (record, object or example) is a
single instance in the dataset.
• An attribute (feature, input, dimension, variable,
or predictor) is a single property of the dataset
(numeric, categorical, date-time, text, or Boolean Example
data types)
• A label (class label, output, prediction, target, or
response) is the special attribute to be predicted
based on all the input attributes (Play)
• Identifiers are special attributes that are used for
locating or providing context to individual records.
Datasets
• Relational (e.g. dataframe) vs non relational (e.g.
corpus)
Modeling steps
• Sparsity “curse of dimensionality”
Training data Build Model
• Training dataset: dataset used to create the model,
with known attributes and target
Testing data Evaluation
• Test/Validation dataset: known dataset against which
testing the model validity Final Model
Attributes
• Quantitative variables: they take on numerical values (e.g., income, stock price, etc.).
Continuous, integer, real...
• Qualitative (or categorical) variables (or factors): they take on values in one of K different
classes or categories (e.g. male/female for gender; yes/no for a fraudulent financial
transaction; increase/ decrease for a stock index; ...). Qualitative variables are typically
represented numerically by codes.
• Independent variables (or inputs or regressors or predictors or features) usually denoted by
X.
• Dependent variable (or output or response variable) usually denoted by Y.
RM data types
• numeric/continuous: take infinite values, can be object of math (e.g. +,-) and logical (>/<)
computations; integer: no decimals; ratio/real: zero pint is defined (income)
• categorical/nominal: treated as symbols/names; can be ordered (hot, mild, cold temperature)-
not all algo deal with them, they can be transformed (NB loss of info)
EDA1: Descriptive Statistics Analysis
• Study of the aggregate quantities of a dataset
• Univariate: exploration of a single attribute at a time
• Multivariate: exploration a 2+ attributes at a time
Characteristics of the Measurement Technique
Dataset
Center of the dataset Mean, median, and mode
Spread of the dataset Range, variance / standard deviation
Shape of the distribution of the Symmetry/skeweness, and kurtosis
dataset
(Kubiak & Benbow, 2005)
Univariate
• Measures of central tendency
1. mean: arithmetic average of all observations
2. median: value of the central point in the distribution (sorting small to large--> mid-point)
3. mode: most frequently occurring observation
Univariate
• Measures of central tendency
1. mean: arithmetic average of all observations
2. median: value of the central point in the distribution (sorting small to large--> mid-point)
3. mode: most frequently occurring observation
Insights:
• mean=median=mode -> likely normal distribution (pdf symmetric around mu, bell-
shaped)
• mean is affected by outliers, median is not
• mode=/= mean or median -> more than one natural normal distribution
• mode>median>mean ->left skewed
Univariate
• Measures of spread
1. range: max-min
1. deviation: x-sample mean
1. variance (sigma square): sum of the squared deviations of all data divided by the
number of data
1. std dev= root square s^2
Univariate
• Measures of spread
1. range: max-min
1. deviation: x-sample mean
1. variance (sigma square): sum of the squared deviations of all data divided by the
number of data
1. std dev= root square s^2
Insights:
• high sd-->spread around the mean, low sd-->narrow;
• normal distribution: 68% data lies within 1 sd
• range is affected by outliers
Univariate
• Shape of the distribution
1. Skewness: measure of symmetry
1. Kurtosis: measure of peakedness of the data
df 5 10
Kurtosis 5.4 4.2
Skeweness 1.27 0.9
Univariate
• Shape of the distribution
1. Skewness: measure of symmetry
1. Kurtosis: measure of peakedness of the data
Insights:
• Skewness = 0 distribution is perfectly symmetric
• Skewness < 0 distribution is skewed to the left
• Skewness > 0 distribution is skewed to the right
• Kurtosis info about the tails of the data distribution df 5 10
outliers detection! Kurtosis 5.4 4.2
Skeweness 1.27 0.9
Univariate Descriptive Statistics & Viz
• Frequency distribution • Cumulative frequency
• Absolute frequency:
(«statistical variable»): set of distribution: frequency
number of observations beloning to a
classes and their frequencies distribution that shows the
measurement class
running total of frequencies up to
a certain point in the data set
• Qual • Qual, quant
• Quant excel_tutorial
Bar chart (qual) Frequency diagram (discrete) CDF (continuous variable)
250 300 120.00%
200 250 100.00%
150 200 80.00%
150
100 60.00%
100
50 40.00%
50
0 20.00%
0
0.00%
1 2 3 4 5 6
0
-0.035
-0.025
-0.015
-0.005
0.005
0.015
0.025
0.035
0.045
-0.04
-0.03
-0.02
-0.01
0.01
0.02
0.03
0.04
Multivariate- qual
• Crosstabulation analysis : table showing the absolute conjoint frequencies of 2(+) qualitative variables
excel_tutorial
Contingency table: absolute frequencies Contingency table: relative joint frequencies Contingency table: raw-subordinated frequencies
Count Shopping? % tot Shopping? % tot Shopping?
Yes No Tot Yes No Tot Yes No Tot
Man 445 51 496 Man 56% 6% 62% Man 90% 10% 100%
Gender Gender Gender
Woman 266 37 303 Woman 33% 5% 38% Woman 88% 12% 100%
Tot 711 88 799 Tot 89% 11% 100% Tot 89% 11% 100%
Statistic independence: if when X changes, subordinated frequencies remain the same, Y distribution does
not depend on X
i.e. the relative joint frequency of independent distributions is equal to the product of the corresponding
marginal distributions
Multivariate- quant
• Correlation: measure of the linear dependence of
one variable on another variable. Highly correlated
variables vary at the same rate in the same or
opposite direction
• Pearson correlation coefficient (r): linear
correlation, -1<=r<=1
• Limitations: not able to identify non-linear
relationships + affected by outliers
2. Data Preparation
• Handling missing values (“NAs”):
1. understanding the source of missing values (e.g. recording error vs count data, like document-term matrix)
2. data substitution (mean, min, max depending on the characteristics of the attribute)- only if missing values
occur randomly and rarely
3. (alternatively), exclude records with missing values (NB reduces the size of the ds)
• Data types conversion: fit attributes' values to the specific model (e.g. categorical var--> numeric var in
regression, factor)
• Transformation: some problems require normalizing attributes to prevent one dominating the others (e.g.
distance-based algo)
• Handling outlier: 1. understanding the source of the outlier (e.g. error) 2. management
• Features selection: reducing the number of attributes, without significant loss in the performance of the
model
• Sampling: process of selecting a subset of records as a representation of the original dataset for use in data
analysis or modeling. The sample data serve as a representative of the original dataset with similar properties,
such as a similar mean.
Correlations - context
•Sarah works as a regional sales manager for a national supplier specializing in fossil fuels used for home heating.
•Lately, fluctuating market prices for heating oil, along with significant variations in the volume of individual orders,
have raised concerns for her.
•She recognizes the importance of understanding the behaviors and other elements that drive demand for heating oil
in the domestic market.
•What factors are related to heating oil usage, and how might she use a knowledge of such factors to better
manage her inventory and anticipate demand?
(North, 2018)
Correlations - deployment
•Correlation is not causation: the two most strongly correlated attributes in our data set are
Heating_Oil_Used and Home_Age, with a coefficient of 0.848. We don’t know why.
•Correlation coefficients are not %. A correlation coefficient of 0.776 between two attributes is an
indication that there is 77.6% shared variability between those two attributes is incorrect
•Only linearity is modeled. For non-linear correlations, consider Kendal’s Tau or Spearman’s Rho
(North, 2018)
Are descriptive stats enough? The Anscombe’s Quartet
Load the Anscombe_quartet_dataset.xlsx dataset into your RM
Extract descriptive statistics for all datasets
Compare the aggregated features of the four datasets: are we looking
at the same data?
Are descriptive stats enough? The Anscombe’s Quartet
(Anscombe 1973)
Data visualization
• Set of methods of expressing data in an abstract visual form.
• Univariate vs Multivariate
• Pros of data viz:
1. comprehension of dense information (the "big picture“)
2. relationships (cartesian coordinates+ creative tactics, like colours)
• Why data visualization is mandatory for exploratory purposes:
1. The relationship between two variables might increase, decrease, or even change
direction depending on the set of variables being controlled
2. causal inferences, particularly in nonexperimental studies, can be hazardous.
Uncontrolled and even unobserved variables that would eliminate or reverse the association
observed between two variables might exist
Univariate visualization
• 1. Histogram
• Plotting the frequency of occurrence in a range
• Used to find the central location, range, and shape of
distribution.
• For continuous values: bins needs to be specified
Univariate visualization
• 2. Box plots
• Plotting the distribution of a continuous variable
with information such as quartiles, median, and
outliers, overlaid by mean and standard deviation
• Used to compare distributions of multiple
attributes side by side (e.g. ANOVA).
Univariate visualization
• 3. Distribution chart
• Gaussian curve showing the probabilty of
occurrence of a data point within a range of
values
• NB assumption of normality
• Used for “fast” predictions
Multivariate visualization
• 1. Scatterplot
• Cartesian space where 2 attributes are
plotted as coordinates
• Used to identify relationships between
attributes (e.g. linear relationship, presence
of clusters, presence of outliers)
Multivariate visualization
• 1. Scattermatrix
• comparing all combinations of attributes
with individual scatterplots and arranging
these plots in a matrix.
Multidimensional visualization
• 1. Parallel chart
• projecting multi-dimensional data into a two-
dimensional chart medium, where dimension is
linearly arranged in one coordinate (x-axis) and all the
measures are arranged in the other coordinate (y-
axis).
• Since the x-axis is multivariate, each data point is
represented as a line in a parallel space.
• only for attributes measured with the same metrics,
or standardized
Multidimensional visualization
• 2. Andrews curves
• visualization techniques where the high-dimensional
data are projected into a vector space so that each
data point takes the form of a curve (Fourier series).
• If two data points are similar, then the curves for the
data points are closer to each other. If curves are far
apart and belong to different classes, then this
information can be used classify the data
Data Viz–in-class exercise
• Retrieve the «CaliforniaDDSdatav2.xslx» dataset, and answer the following questions:
- Is the allegation of ethnicity-based discrimination in funding allocation grounded?
I. Which cohort benefit the most from fund allocation (“expenditure”)?
II. Why is the overall average for all consumers significantly different indicating ethnic discrimination of
Hispanics, yet in all but one age cohort (18-21) the average of expenditures for Hispanic consumers
are greater than those of the White non-Hispanic population?
I. Which cohort benefit the most from fund allocation (“expenditure”)?
I. Which cohort benefit the most from fund allocation (“expenditure”)?
I. Which cohort benefit the most from fund allocation (“expenditure”)?
2. Why is the overall average for all consumers significantly different indicating ethnic discrimination of Hispanics, yet in all but one age cohort (18-21) the
average of expenditures for Hispanic consumers are greater than those of the White non-Hispanic population?
2. Why is the overall average for all consumers significantly different indicating ethnic discrimination of Hispanics, yet in all but one age
cohort (18-21) the average of expenditures for Hispanic consumers are greater than those of the White non-Hispanic population?
the overall Hispanic consumer
population is a
relatively younger when compared
to the White non-Hispanic consumer
population. Since the
expenditures for younger
consumers is lower, the overall
average of expenditures for
Hispanics
(vs White non-Hispanics) is less
Simpson’s Paradox
• The marginal association between two
categorical variables might be qualitatively
different than the partial association between
the same two variables after controlling for one
or more other “lurking” variables (reversal or
amalgamation paradox) Pearson’s R(A)= Pearson’s R(C) = .81 (!!!)
(Matejka and Fitzmaurice, 2017: 1293)
Assignment 1
• 1. In RM, retrieve the “HR_performance” and “HR_salary” datasets. For each
dataset:
• Are personality traits (i.e. neuroticism) a good way to evaluate an employee’s
performance (salary)? Why and how?
• What kind of statistical paradox are we facing, and why?