0% found this document useful (0 votes)
11 views

M3 Exploratory Data Analysis

Uploaded by

shubhiajmera2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

M3 Exploratory Data Analysis

Uploaded by

shubhiajmera2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Exploratory Data Analysis

DS 203 Programming for Data Science


Amit Sethi, EE, IITB
Learning outcomes
• Define exploratory data analysis

• Perform basic EDA of single variables

• Perform basic EDA of pairs of variables

• Select EDA appropriate to the type of variable


EDA is about taking stock of data
• Understand the main characteristics of your data to plan
for downstream analyses

• Spot any issues with the data early

• Think about the type of analysis techniques,


approaches, and the experts needed
What do we analyze in EDA

• The entire data at a glance

• Each variable in isolation

• Pairs of variables
Types of questions about the entire data

• Number of samples

• Number of variables per sample

• Samples with missing variables

• Corrupted samples
Hypothetical dataset
Make Model Year kmpl Top-speed 0-60 kmph Drivability
Hyundai i-20 2017 18 120 13s “3”
Hyundai i-20 2018 17 130 11s “4”
Hyundai i-20 2019 19 130 13 “3”
Hyundai i-10 2017 20 120 12s “4”
Hyundai i-10 2018 19 130 10 “5”
Hyundai i-10 2019 20 120 12 “4”
… … … … … … …
… … … … … … …
Datsun 2019 20 110 15 “2”
w•ÿ Baleno 2019 20 120 17 “3”
Nano 2018 30 80 55 “2”
Types of questions about each variable
• Type and coding
– Nominal (may be coded as numerical)
– Ordinal (may be coded as numerical)
– True numerical (integer, quantized, float)
• Distribution
– Descriptive statistics
– Histograms
• Utility and ethics
– Variability
– Availability
– Should it be used?
Type and coding of variables can be different
• Integers can be used to code:
– Nominal / Categorical (species, postal codes)
– Binary categorical (face or not-face)
– Ordinal (very good, good, normal, bad, very bad)
– Numerical (age in years)
– Temporal (date)
• Text can be used to code:
– Nominal / Categorical (species, postal codes)
– Numerical saved as text
– Temporal saved as text (“Sept 5, 2020”)
Description of discrete variables

• List of unique values

• Order of values for ordinal


Histogram of a discrete variable is like a
probability mass function
500
450
400
350
Number of Samples

300
250
200
150
100
50
0
Value 1 Value 2 Value 3 Value 4
Values of Variable X
Histogram can indicate problems
500
450
400
350
Number of Samples

300
250
200
150
100
50
0
Value 1 Value 2 Value 3 Value 4
Values of Variable X
A continuous variable is described by its
probability density function
1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples

0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x

CDF is the integral of the PDF


It is monotonic, and rises from 0 to 1
A continuous variable is sampled
1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples

0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x
Mean is center of gravity of the PDF;
Sample mean is not population mean
Median divides the PDF into two equal areas

1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples

0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x
Quartiles divides the PDF into four equal areas

1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples

0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x
Box and whiskers plot summarizes the PDF
1.0
0.9
0.8
0.7
0.6 PDF
0.5
CDF
0.4
0.3 Samples

0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5
x

*
Histogram divides the range into discrete bins
for counting samples
30

25

20
Too few
bins
15

10

10 4.5
9 4
8 3.5 Too many
7
3 bins
6
2.5
5
2
4
1.5
3
2 1

1 0.5

0 0
Types of questions about pairs of variables

• Relation between variables

– Are some variables correlated?

– Are there other strong relations between


variables?

– Are some variables redundant?


Cross-tab is viewed between discrete variables

Example: “Economic satisfaction” survey respondents by income and gender

Income↓; Gender → Male Female


None 0 0
Low 300 0
Medium 10,000 3,000
High 5,000 2,000

• No women with “low” income?


• Very few men with “low” income?
• Is there a sampling bias (e.g. email survey)?
Correlation and scatter plots are between pairs
of continuous variables
60

Acceleration time 0-60 km/h (s)


50 y = 3.35x - 51.683
R² = 0.8313
40

30

20

10

0
15 17 19 21 23 25 27 29
Fuel efficiency (km/l)

• There is an outlier; otherwise, the relation is


not strong, which indicates hidden factors
Correlation matrix can be computed for all
continuous variables together

AURKA
BRCA1

BRCA2

ERBB2
CDH2

CD55
TP53
1.00 0.62 0.73 0.74 0.37 0.53 0.37
TP53
0.62 1.00 0.90 0.30 0.67 0.93 -0.92
CDH2
0.73 0.90 1.00 0.63 0.70 0.58 1.00
CD55
0.74 0.30 0.63 1.00 0.95 0.90 0.59
BRCA1
0.37 0.67 0.70 0.95 1.00 0.60 0.16
BRCA2
0.53 0.93 0.58 0.90 0.60 1.00 0.66
ERBB2
0.37 -0.92 1.00 0.59 0.16 0.66 1.00
AURKA

• Highly (positively or negatively) correlated variables


can create problems later

You might also like