Eda PDF
Eda PDF
LEARNING COURSE
https://www.facebook.com/diceanalytics/
Sensitivity: Internal
Data Organization
Data is stored in the form of a Data Matrix
Variable Names
Observation
(Row)
Variable
(Column)
Sensitivity: Internal
Types of Variables
Two main
types
Variables
Arithmetic Qualitative;
operations limited
can be number of
performed distinct
categories
Numerical Categorical
Levels with
Distinct set of Infinite values No order,
inherent
values within a range Incomparable
ordering
Sensitivity: Internal
Types of Variables
http://www.statisticshowto.com/types-variables/
https://statistics.laerd.com/statistical-guides/types-of-variable.php
Sensitivity: Internal
Types of Variables
• Response Variable: It is the focus of a question in a
study or experiment. It is the variable we want to
predict or observe. It is the dependent variable.
Sensitivity: Internal
Relationship b/w Variables
Sensitivity: Internal
Data Visualisation
Sensitivity: Internal
Visualising
Numerical Data
Sensitivity: Internal
Scatterplot
Income
Response
Variable
Age
Explanatory
Variable
Sensitivity: Internal
Characteristics of Relationship
Direction Shape Strength Outliers
+ve
curved
strong
-ve linear
weak
Sensitivity: Internal
Correlation (example)
Sensitivity: Internal
Histograms
100
1) Skewness
2) Modality 50
25
0
20 30 40
Age Group50 60
Sensitivity: Internal
Skewness
Left Skewed Symmetric Right Skewed
20 30 40 60 20 30 40 60 20 30 40 60
-ve Skewness Zero Skewness +ve Skewness
Sensitivity: Internal
Modality
20 30 40 60 20 30 40 60 20 30 40 60 20 30 40 60
Sensitivity: Internal
Modality (Example)
20 30 40 60 20 30 40 60 20 30 40 60
Sensitivity: Internal
Binwidth
20 30 40 50 20 25 30 35 40 45 50
Sensitivity: Internal
Measures of Center
Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79
Arithmetic Average
Mean Mean = 56 + 87 + 34 + 65 + 77 + 62 + 90 + 45 + 77 + 79
10
Mean = 67.2
Sensitivity: Internal
Box Plots
Box
IQR Whisker
outliers
Sensitivity: Internal
Box Plots & Skewness
Sensitivity: Internal
Skewness vs Measures of Center
Mean
Median
Mode
Mean < Median < Mode Mean = Median = Mode Mean > Median > Mode
Sensitivity: Internal
Intensity/Heat Maps
Sensitivity: Internal
Time Plots
225
180
135
90
45
0
Jan Feb March April June July Aug Sep Oct Nov Dec
Sensitivity: Internal
Measures of Spread
Range Variance
Standard Inter-quartile
Deviation Range
Sensitivity: Internal
Range
• Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79
• Range = 90 - 34 = 56
Sensitivity: Internal
Variance
• A measure of how much data (a variable)
varies; how spread out a data set is about the
mean.
• Average squared deviation from mean; has
squared units of the variable
• Sample Variance
• Population Variance
Sensitivity: Internal
Variance (Example)
• Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79
2995.6
9 Sum of Squares
332.8
Sensitivity: Internal
Why Square The Differences?
Sensitivity: Internal
Standard Deviation (SD)
• Square root of Variance
• It has the same units as the variable, which
makes it useful in comparisons and
calculations
• Sample SD
√ √
• Population SD
√ √
Sensitivity: Internal
Spread
0 2 4 6 8 0 2 4 6 8
Sensitivity: Internal
Robust Statistics
• Measures on which extreme observations or
outliers have little effect
Robust Non-Robust
Skewed Symmetric
Sensitivity: Internal
Data Transformations
• Applying a Function f(x) to adjust scales of
data.
• Done usually when data is skewed, so that it
becomes easier to perform modelling.
• Done to convert non-linear relationship into a
linear relationship.
Transformed
Data f(x) Data
Sensitivity: Internal
(Natural) Log Transformation
• To transform data that is positively skewed
• Usually done when data is concentrated near
Zero (relative to the few large values in data)
Natural
Log
Sensitivity: Internal
Log Transformation
• To make the relationship between two variable
more linear
• Most of the simple methods for modelling work
only when relationship is linear
Log
Sensitivity: Internal
Other Transformation
Sensitivity: Internal
Visualising
Categorical Data
Sensitivity: Internal
Bar Plot
300
225
150
75
Frequency
Sensitivity: Internal
Bar Plot vs Histogram
Sensitivity: Internal
Pie Chart
Cricket
Very Somewhat
Football Not Very
Hockey Not At All
Squash Not Sure
3%
12%
46%
39%
Sensitivity: Internal
Segmented Bar Plot
For visualising For comparing
conditional frequency relative frequencies
distributions to explore
225
relationship between
variables
150
75
Squash
Hockey
0 Football
Cricket
Sensitivity: Internal
Relative Frequency Segmented
Bar Plot
1.2
0.9
0.6
0.3
0
Cricket Football Hockey Squash
Sensitivity: Internal
AGE
60 120
100
80
40
20 Side-by-Side Box Plots
Sensitivity: Internal
Bubble Plot
Bowler Batsman
37.5
75
30
70
Weight (kg)
22.5
65
15
7.5
60
55
0
50
-7.5
-4 0 4 8 12 16 20
4’8 5’0 5’4 5’8 6’0 6’4 6’8 7’0
Height
Sensitivity: Internal
Outliers
Sensitivity: Internal
Why do EDA
• To understand data properties
• To find patterns in data
• To suggest modelling strategies
• To "debug" analyses
• To communicate results
(From JHU)
Sensitivity: Internal
Why do EDA
https://www.youtube.com/watch?v=jbkSRLYSojo
Sensitivity: Internal