0% found this document useful (0 votes)

3 views39 pages

Lecture - Exploratory Data Analysis

The document outlines the Applied Management Research Methods course for the academic year 2024/2025, focusing on Exploratory Data Analysis (EDA) and data preprocessing techniques. Key topics include data exploration, descriptive statistics, outlier detection, data visualization, and the importance of understanding relationships between attributes. The agenda also emphasizes the significance of handling missing values and preparing data for analysis to ensure accurate results.

Uploaded by

simonecarloseghezzi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views39 pages

Lecture - Exploratory Data Analysis

Uploaded by

simonecarloseghezzi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Applied Management Research Applied Management Research

Methods, a.y. 2024/2025

Methods

EDA & Preprocessing

SOEAKER

Dott. Federico Mangiò

ROOM

R00m 21

DATE 5 March 2025; 10.30-1.30 PM

Agenda

• Data Exploration Roadmap

• Descriptive Statistics Analysis: Uni vs Multivariate
• Outliers & Nas detection & management
• Data Visualization Analysis: Uni, Multivariate, Multidimensional
• Statistical Paradoxes
Data Exploration Roadmap

Understand the
Organize the Watch out for relationship
dataset outliers between
attributes

Visualize the
Find the central
relationship
point for each Pivot the data
bettween
attribute
attributes

Understand the Visualize high-

Visualize the
spread of each dimensional
data
attribute datasets
Exploratory Data Analysis

• EDA: study of the basic characterstics of a dataset

• Fundamental prior step to every research analysis
• Objectives:
1. Data understanding: overview of each varible ,
and their interaction
2. Data preparation: detecting and handling
ouliers, missing values, multicollinearity
3. Data analysis task: can substitute an entire
process, if suitable
4. Interpreting the results
• 2 types: Descriptive statistics analysis vs Data
Visualization
Source: playground records
Data Collector: company X
Metadata Data collection date: 01.01.18
[…]

• A dataset (example set) is a collection of data

with a defined structure (“dataframe”). Attribute

• A data point (record, object or example) is a

single instance in the dataset.

• An attribute (feature, input, dimension, variable,

or predictor) is a single property of the dataset
(numeric, categorical, date-time, text, or Boolean Example
data types)

• A label (class label, output, prediction, target, or

response) is the special attribute to be predicted
based on all the input attributes (Play)

• Identifiers are special attributes that are used for

locating or providing context to individual records.
Datasets

• Relational (e.g. dataframe) vs non relational (e.g.

corpus)
Modeling steps
• Sparsity “curse of dimensionality”
Training data Build Model
• Training dataset: dataset used to create the model,
with known attributes and target
Testing data Evaluation

• Test/Validation dataset: known dataset against which

testing the model validity Final Model
Attributes

• Quantitative variables: they take on numerical values (e.g., income, stock price, etc.).
Continuous, integer, real...
• Qualitative (or categorical) variables (or factors): they take on values in one of K different
classes or categories (e.g. male/female for gender; yes/no for a fraudulent financial
transaction; increase/ decrease for a stock index; ...). Qualitative variables are typically
represented numerically by codes.
• Independent variables (or inputs or regressors or predictors or features) usually denoted by
X.
• Dependent variable (or output or response variable) usually denoted by Y.

RM data types
• numeric/continuous: take infinite values, can be object of math (e.g. +,-) and logical (>/<)
computations; integer: no decimals; ratio/real: zero pint is defined (income)

• categorical/nominal: treated as symbols/names; can be ordered (hot, mild, cold temperature)-

not all algo deal with them, they can be transformed (NB loss of info)
EDA1: Descriptive Statistics Analysis

• Study of the aggregate quantities of a dataset

• Univariate: exploration of a single attribute at a time
• Multivariate: exploration a 2+ attributes at a time

Characteristics of the Measurement Technique

Dataset
Center of the dataset Mean, median, and mode

Spread of the dataset Range, variance / standard deviation

Shape of the distribution of the Symmetry/skeweness, and kurtosis

dataset

(Kubiak & Benbow, 2005)

Univariate

• Measures of central tendency

1. mean: arithmetic average of all observations

2. median: value of the central point in the distribution (sorting small to large--> mid-point)

3. mode: most frequently occurring observation

Univariate

• Measures of central tendency

1. mean: arithmetic average of all observations

2. median: value of the central point in the distribution (sorting small to large--> mid-point)

3. mode: most frequently occurring observation

Insights:

• mean=median=mode -> likely normal distribution (pdf symmetric around mu, bell-
shaped)

• mean is affected by outliers, median is not

• mode=/= mean or median -> more than one natural normal distribution

• mode>median>mean ->left skewed

Univariate

• Measures of spread

1. range: max-min

1. deviation: x-sample mean

1. variance (sigma square): sum of the squared deviations of all data divided by the
number of data

1. std dev= root square s^2

Univariate

• Measures of spread

1. range: max-min

1. deviation: x-sample mean

1. variance (sigma square): sum of the squared deviations of all data divided by the
number of data

1. std dev= root square s^2

Insights:

• high sd-->spread around the mean, low sd-->narrow;

• normal distribution: 68% data lies within 1 sd

• range is affected by outliers

Univariate

• Shape of the distribution

1. Skewness: measure of symmetry

1. Kurtosis: measure of peakedness of the data

df 5 10
Kurtosis 5.4 4.2
Skeweness 1.27 0.9
Univariate

• Shape of the distribution

1. Skewness: measure of symmetry

1. Kurtosis: measure of peakedness of the data

Insights:

• Skewness = 0 distribution is perfectly symmetric

• Skewness < 0 distribution is skewed to the left
• Skewness > 0 distribution is skewed to the right

• Kurtosis info about the tails of the data distribution df 5 10

outliers detection! Kurtosis 5.4 4.2
Skeweness 1.27 0.9
Univariate Descriptive Statistics & Viz

• Frequency distribution • Cumulative frequency

• Absolute frequency:
(«statistical variable»): set of distribution: frequency
number of observations beloning to a
classes and their frequencies distribution that shows the
measurement class
running total of frequencies up to
a certain point in the data set

• Qual • Qual, quant

• Quant excel_tutorial

Bar chart (qual) Frequency diagram (discrete) CDF (continuous variable)

250 300 120.00%

200 250 100.00%

150 200 80.00%

150
100 60.00%
100
50 40.00%
50
0 20.00%
0
0.00%
1 2 3 4 5 6

0
-0.035

-0.025

-0.015

-0.005

0.005

0.015

0.025

0.035

0.045
-0.04

-0.03

-0.02

-0.01

0.01

0.02

0.03

0.04
Multivariate- qual

• Crosstabulation analysis : table showing the absolute conjoint frequencies of 2(+) qualitative variables

excel_tutorial

Contingency table: absolute frequencies Contingency table: relative joint frequencies Contingency table: raw-subordinated frequencies

Count Shopping? % tot Shopping? % tot Shopping?

Yes No Tot Yes No Tot Yes No Tot
Man 445 51 496 Man 56% 6% 62% Man 90% 10% 100%
Gender Gender Gender
Woman 266 37 303 Woman 33% 5% 38% Woman 88% 12% 100%
Tot 711 88 799 Tot 89% 11% 100% Tot 89% 11% 100%

Statistic independence: if when X changes, subordinated frequencies remain the same, Y distribution does
not depend on X

i.e. the relative joint frequency of independent distributions is equal to the product of the corresponding
marginal distributions
Multivariate- quant

• Correlation: measure of the linear dependence of

one variable on another variable. Highly correlated
variables vary at the same rate in the same or
opposite direction

• Pearson correlation coefficient (r): linear

correlation, -1<=r<=1

• Limitations: not able to identify non-linear

relationships + affected by outliers
2. Data Preparation
• Handling missing values (“NAs”):

1. understanding the source of missing values (e.g. recording error vs count data, like document-term matrix)
2. data substitution (mean, min, max depending on the characteristics of the attribute)- only if missing values
occur randomly and rarely
3. (alternatively), exclude records with missing values (NB reduces the size of the ds)

• Data types conversion: fit attributes' values to the specific model (e.g. categorical var--> numeric var in
regression, factor)

• Transformation: some problems require normalizing attributes to prevent one dominating the others (e.g.
distance-based algo)

• Handling outlier: 1. understanding the source of the outlier (e.g. error) 2. management

• Features selection: reducing the number of attributes, without significant loss in the performance of the
model
• Sampling: process of selecting a subset of records as a representation of the original dataset for use in data
analysis or modeling. The sample data serve as a representative of the original dataset with similar properties,
such as a similar mean.
Correlations - context

•Sarah works as a regional sales manager for a national supplier specializing in fossil fuels used for home heating.

•Lately, fluctuating market prices for heating oil, along with significant variations in the volume of individual orders,

have raised concerns for her.

•She recognizes the importance of understanding the behaviors and other elements that drive demand for heating oil

in the domestic market.

•What factors are related to heating oil usage, and how might she use a knowledge of such factors to better

manage her inventory and anticipate demand?

(North, 2018)
Correlations - deployment

•Correlation is not causation: the two most strongly correlated attributes in our data set are

Heating_Oil_Used and Home_Age, with a coefficient of 0.848. We don’t know why.

•Correlation coefficients are not %. A correlation coefficient of 0.776 between two attributes is an

indication that there is 77.6% shared variability between those two attributes is incorrect

•Only linearity is modeled. For non-linear correlations, consider Kendal’s Tau or Spearman’s Rho

(North, 2018)
Are descriptive stats enough? The Anscombe’s Quartet

Load the Anscombe_quartet_dataset.xlsx dataset into your RM

Extract descriptive statistics for all datasets

Compare the aggregated features of the four datasets: are we looking

at the same data?
Are descriptive stats enough? The Anscombe’s Quartet

(Anscombe 1973)
Data visualization

• Set of methods of expressing data in an abstract visual form.

• Univariate vs Multivariate
• Pros of data viz:
1. comprehension of dense information (the "big picture“)
2. relationships (cartesian coordinates+ creative tactics, like colours)
• Why data visualization is mandatory for exploratory purposes:
1. The relationship between two variables might increase, decrease, or even change
direction depending on the set of variables being controlled
2. causal inferences, particularly in nonexperimental studies, can be hazardous.
Uncontrolled and even unobserved variables that would eliminate or reverse the association
observed between two variables might exist
Univariate visualization

• 1. Histogram

• Plotting the frequency of occurrence in a range

• Used to find the central location, range, and shape of

distribution.

• For continuous values: bins needs to be specified

Univariate visualization

• 2. Box plots

• Plotting the distribution of a continuous variable

with information such as quartiles, median, and

outliers, overlaid by mean and standard deviation

• Used to compare distributions of multiple

attributes side by side (e.g. ANOVA).

Univariate visualization

• 3. Distribution chart

• Gaussian curve showing the probabilty of

occurrence of a data point within a range of

values

• NB assumption of normality

• Used for “fast” predictions

Multivariate visualization

• 1. Scatterplot

• Cartesian space where 2 attributes are

plotted as coordinates

• Used to identify relationships between

attributes (e.g. linear relationship, presence

of clusters, presence of outliers)

Multivariate visualization

• 1. Scattermatrix

• comparing all combinations of attributes

with individual scatterplots and arranging

these plots in a matrix.

Multidimensional visualization

• 1. Parallel chart

• projecting multi-dimensional data into a two-

dimensional chart medium, where dimension is

linearly arranged in one coordinate (x-axis) and all the

measures are arranged in the other coordinate (y-

axis).

• Since the x-axis is multivariate, each data point is

represented as a line in a parallel space.

• only for attributes measured with the same metrics,

or standardized
Multidimensional visualization

• 2. Andrews curves

• visualization techniques where the high-dimensional

data are projected into a vector space so that each

data point takes the form of a curve (Fourier series).

• If two data points are similar, then the curves for the
data points are closer to each other. If curves are far

apart and belong to different classes, then this

information can be used classify the data

Data Viz–in-class exercise

• Retrieve the «CaliforniaDDSdatav2.xslx» dataset, and answer the following questions:

- Is the allegation of ethnicity-based discrimination in funding allocation grounded?

I. Which cohort benefit the most from fund allocation (“expenditure”)?

II. Why is the overall average for all consumers significantly different indicating ethnic discrimination of

Hispanics, yet in all but one age cohort (18-21) the average of expenditures for Hispanic consumers

are greater than those of the White non-Hispanic population?

I. Which cohort benefit the most from fund allocation (“expenditure”)?
I. Which cohort benefit the most from fund allocation (“expenditure”)?
I. Which cohort benefit the most from fund allocation (“expenditure”)?
2. Why is the overall average for all consumers significantly different indicating ethnic discrimination of Hispanics, yet in all but one age cohort (18-21) the

average of expenditures for Hispanic consumers are greater than those of the White non-Hispanic population?
2. Why is the overall average for all consumers significantly different indicating ethnic discrimination of Hispanics, yet in all but one age

cohort (18-21) the average of expenditures for Hispanic consumers are greater than those of the White non-Hispanic population?

the overall Hispanic consumer

population is a
relatively younger when compared
to the White non-Hispanic consumer
population. Since the
expenditures for younger
consumers is lower, the overall
average of expenditures for
Hispanics
(vs White non-Hispanics) is less
Simpson’s Paradox

• The marginal association between two

categorical variables might be qualitatively

different than the partial association between

the same two variables after controlling for one

or more other “lurking” variables (reversal or

amalgamation paradox) Pearson’s R(A)= Pearson’s R(C) = .81 (!!!)

(Matejka and Fitzmaurice, 2017: 1293)

Assignment 1

• 1. In RM, retrieve the “HR_performance” and “HR_salary” datasets. For each

dataset:

• Are personality traits (i.e. neuroticism) a good way to evaluate an employee’s

performance (salary)? Why and how?

• What kind of statistical paradox are we facing, and why?

Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Lect 3
No ratings yet
Lect 3
51 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
4 DataUnderstanding
No ratings yet
4 DataUnderstanding
51 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
02 Data
No ratings yet
02 Data
62 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Module 1
No ratings yet
Module 1
64 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
02 Data
No ratings yet
02 Data
41 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
02 Data
No ratings yet
02 Data
66 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
unit 1b
No ratings yet
unit 1b
69 pages
Lec 2
No ratings yet
Lec 2
26 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Cheatsheet FDA A4 Full
No ratings yet
Cheatsheet FDA A4 Full
2 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
CH 2
No ratings yet
CH 2
35 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Week2 1
No ratings yet
Week2 1
24 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
02 Data
No ratings yet
02 Data
35 pages
Analytical Decision Making
No ratings yet
Analytical Decision Making
27 pages
4 ExploratoryAnalysis
No ratings yet
4 ExploratoryAnalysis
42 pages
02 Data
No ratings yet
02 Data
65 pages
Unit .......
No ratings yet
Unit .......
45 pages
Margin 6794edf99eb1f 6794ede66a47f
No ratings yet
Margin 6794edf99eb1f 6794ede66a47f
2 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
02 Data
No ratings yet
02 Data
24 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Applied Linear Algebra: Core Principles
From Everand
Applied Linear Algebra: Core Principles
Kartikeya Dutta
No ratings yet
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Sullivan Section 3.4 Measures of Position and Outliers 1
No ratings yet
Sullivan Section 3.4 Measures of Position and Outliers 1
11 pages
QT Formulae ONLY
No ratings yet
QT Formulae ONLY
4 pages
A Level Math Paper 2 Normal Distribution
No ratings yet
A Level Math Paper 2 Normal Distribution
42 pages
STA301 Midterm MCQs WithReferencesbyMoaaz PDF
No ratings yet
STA301 Midterm MCQs WithReferencesbyMoaaz PDF
28 pages
MCQs Statistic Master Revision
No ratings yet
MCQs Statistic Master Revision
10 pages
STAT7055 T01 Sol
No ratings yet
STAT7055 T01 Sol
8 pages
Statistics chp3&4
No ratings yet
Statistics chp3&4
33 pages
Answer All Questions in This Section.: SMKPJ Mathst P3 2014 Qa Revision: Set 1 Section A (45 Marks)
No ratings yet
Answer All Questions in This Section.: SMKPJ Mathst P3 2014 Qa Revision: Set 1 Section A (45 Marks)
7 pages
Problems Chapter 1
No ratings yet
Problems Chapter 1
11 pages
Histogram - Minitab 19 PDF
No ratings yet
Histogram - Minitab 19 PDF
12 pages
STAT-205 Probability and Statistics
No ratings yet
STAT-205 Probability and Statistics
3 pages
Keshav Asawa - Math AI SL IA
No ratings yet
Keshav Asawa - Math AI SL IA
28 pages
Midterm Exam Inferential Statistics
No ratings yet
Midterm Exam Inferential Statistics
6 pages
Predictive Analytics: Group Assignment 2
No ratings yet
Predictive Analytics: Group Assignment 2
6 pages
3rd Grading Grade 10 Mathematics
No ratings yet
3rd Grading Grade 10 Mathematics
17 pages
Mathematics Worksheet: XI Grade (Semester 1)
100% (1)
Mathematics Worksheet: XI Grade (Semester 1)
65 pages
Assignment 3 - Data
No ratings yet
Assignment 3 - Data
7 pages
Business Statistics Unit 1
No ratings yet
Business Statistics Unit 1
22 pages
Certificate in Business Statistics (VRQ) : Pearson LCCI
No ratings yet
Certificate in Business Statistics (VRQ) : Pearson LCCI
20 pages
STAT - MeasureS of Central Tendency - New
No ratings yet
STAT - MeasureS of Central Tendency - New
12 pages
CA Foundation Maths
100% (7)
CA Foundation Maths
38 pages
Ratio ND Proportion.
No ratings yet
Ratio ND Proportion.
4 pages
Praktikum IV
No ratings yet
Praktikum IV
6 pages
Exercises c3
No ratings yet
Exercises c3
7 pages
Brick Exchange - Descriptive Statistics and Data Representation
No ratings yet
Brick Exchange - Descriptive Statistics and Data Representation
24 pages
Script - 18 - Karl Pearson's Co-Efficient of Skewness
No ratings yet
Script - 18 - Karl Pearson's Co-Efficient of Skewness
15 pages
Static Notes Satyam
No ratings yet
Static Notes Satyam
30 pages
Averages PixiPPt
No ratings yet
Averages PixiPPt
20 pages
Reading 1 Multiple Regression
No ratings yet
Reading 1 Multiple Regression
73 pages
Les5eppt09 160218110600
No ratings yet
Les5eppt09 160218110600
84 pages