M3 Exploratory Data Analysis
M3 Exploratory Data Analysis
• Pairs of variables
Types of questions about the entire data
• Number of samples
• Corrupted samples
Hypothetical dataset
Make Model Year kmpl Top-speed 0-60 kmph Drivability
Hyundai i-20 2017 18 120 13s “3”
Hyundai i-20 2018 17 130 11s “4”
Hyundai i-20 2019 19 130 13 “3”
Hyundai i-10 2017 20 120 12s “4”
Hyundai i-10 2018 19 130 10 “5”
Hyundai i-10 2019 20 120 12 “4”
… … … … … … …
… … … … … … …
Datsun 2019 20 110 15 “2”
w•ÿ Baleno 2019 20 120 17 “3”
Nano 2018 30 80 55 “2”
Types of questions about each variable
• Type and coding
– Nominal (may be coded as numerical)
– Ordinal (may be coded as numerical)
– True numerical (integer, quantized, float)
• Distribution
– Descriptive statistics
– Histograms
• Utility and ethics
– Variability
– Availability
– Should it be used?
Type and coding of variables can be different
• Integers can be used to code:
– Nominal / Categorical (species, postal codes)
– Binary categorical (face or not-face)
– Ordinal (very good, good, normal, bad, very bad)
– Numerical (age in years)
– Temporal (date)
• Text can be used to code:
– Nominal / Categorical (species, postal codes)
– Numerical saved as text
– Temporal saved as text (“Sept 5, 2020”)
Description of discrete variables
300
250
200
150
100
50
0
Value 1 Value 2 Value 3 Value 4
Values of Variable X
Histogram can indicate problems
500
450
400
350
Number of Samples
300
250
200
150
100
50
0
Value 1 Value 2 Value 3 Value 4
Values of Variable X
A continuous variable is described by its
probability density function
1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x
Mean is center of gravity of the PDF;
Sample mean is not population mean
Median divides the PDF into two equal areas
1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x
Quartiles divides the PDF into four equal areas
1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x
Box and whiskers plot summarizes the PDF
1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x
*
Histogram divides the range into discrete bins
for counting samples
30
25
20
Too few
bins
15
10
10 4.5
9 4
8 3.5 Too many
7
3 bins
6
2.5
5
2
4
1.5
3
2 1
1 0.5
0 0
Types of questions about pairs of variables
30
20
10
0
15 17 19 21 23 25 27 29
Fuel efficiency (km/l)
AURKA
BRCA1
BRCA2
ERBB2
CDH2
CD55
TP53
1.00 0.62 0.73 0.74 0.37 0.53 0.37
TP53
0.62 1.00 0.90 0.30 0.67 0.93 -0.92
CDH2
0.73 0.90 1.00 0.63 0.70 0.58 1.00
CD55
0.74 0.30 0.63 1.00 0.95 0.90 0.59
BRCA1
0.37 0.67 0.70 0.95 1.00 0.60 0.16
BRCA2
0.53 0.93 0.58 0.90 0.60 1.00 0.66
ERBB2
0.37 -0.92 1.00 0.59 0.16 0.66 1.00
AURKA