100% found this document useful (1 vote)
201 views45 pages

Eda PDF

The document discusses various concepts related to data science and machine learning including: 1) Types of variables such as numerical, categorical, ordinal, nominal variables and different types like response and explanatory variables. 2) Relationships between variables and how correlated or independent variables can be. 3) Visualizing data through scatterplots, histograms, boxplots and other charts to understand patterns, distributions, outliers and relationships. 4) Key statistical measures used to summarize data like mean, median, mode, variance, standard deviation, range and interquartile range.

Uploaded by

La Magnifico
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
201 views45 pages

Eda PDF

The document discusses various concepts related to data science and machine learning including: 1) Types of variables such as numerical, categorical, ordinal, nominal variables and different types like response and explanatory variables. 2) Relationships between variables and how correlated or independent variables can be. 3) Visualizing data through scatterplots, histograms, boxplots and other charts to understand patterns, distributions, outliers and relationships. 4) Key statistical measures used to summarize data like mean, median, mode, variance, standard deviation, range and interquartile range.

Uploaded by

La Magnifico
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

DATA SCIENCE & MACHINE

LEARNING COURSE
https://www.facebook.com/diceanalytics/
Sensitivity: Internal
Data Organization
Data is stored in the form of a Data Matrix

Variable Names

Observation
(Row)

Variable
(Column)

Sensitivity: Internal
Types of Variables
Two main
types
Variables
Arithmetic Qualitative;
operations limited
can be number of
performed distinct
categories
Numerical Categorical

Discrete Continuous Ordinal Nominal

Levels with
Distinct set of Infinite values No order,
inherent
values within a range Incomparable
ordering

Sensitivity: Internal
Types of Variables

http://www.statisticshowto.com/types-variables/
https://statistics.laerd.com/statistical-guides/types-of-variable.php

Sensitivity: Internal
Types of Variables
• Response Variable: It is the focus of a question in a
study or experiment. It is the variable we want to
predict or observe. It is the dependent variable.

• Explanatory Variable: It is the variable on


whom the response variable depends, or the
variable which ‘explains’ the response variable.
It is assumed to be independent variable.

Sensitivity: Internal
Relationship b/w Variables

• Two variables that show connection


with each other are called A
Associated/Correlated (Dependent)

• Two variables that do not show


connection with each other are
called Independent

• An observation that is away that is


B
not close to majority of data is
called Outlier

Sensitivity: Internal
Data Visualisation

Sensitivity: Internal
Visualising
Numerical Data

Sensitivity: Internal
Scatterplot

Income

Response
Variable

Age

Explanatory
Variable
Sensitivity: Internal
Characteristics of Relationship
Direction Shape Strength Outliers

+ve
curved

strong

-ve linear

weak

Sensitivity: Internal
Correlation (example)

Sensitivity: Internal
Histograms
100

• Help to view data density

• Help to see shape of distribution


75

1) Skewness
2) Modality 50

25

0
20 30 40
Age Group50 60

Sensitivity: Internal
Skewness
Left Skewed Symmetric Right Skewed

20 30 40 60 20 30 40 60 20 30 40 60
-ve Skewness Zero Skewness +ve Skewness

• Draw a smooth curve to see skewness


• Don’t rely on jagged edges

Sensitivity: Internal
Modality

unimodal bimodal uniform multimodal

20 30 40 60 20 30 40 60 20 30 40 60 20 30 40 60

Sensitivity: Internal
Modality (Example)

20 30 40 60 20 30 40 60 20 30 40 60

Normal Distribution Two separate groups No trend

Sensitivity: Internal
Binwidth

20 30 40 50 20 25 30 35 40 45 50

Sensitivity: Internal
Measures of Center
Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79

Arithmetic Average
Mean Mean = 56 + 87 + 34 + 65 + 77 + 62 + 90 + 45 + 77 + 79
10
Mean = 67.2

Most frequent value/observation


Mode Mode = 77

Midpoint of distribution (50th percentile)


Median Median = 77 + 62 = 69.5
2

Sensitivity: Internal
Box Plots
Box
IQR Whisker
outliers

Min. Value Q1 Q2 Q3 Max. Value

Min. Value :Lower Extreme (that’s not an outlier)


Q1 :Lower Quartile (25% of observations)
Q2 :Median (50% of observations)
Q3 :Upper Quartile (75% of observations)
Max. Value :Upper Extreme (that’s not an outlier)
IQR :Inter-Quartile Range = Q3 - Q1 (middle 50% of observations)

Sensitivity: Internal
Box Plots & Skewness

Left Skewed Symmetric Right Skewed

Sensitivity: Internal
Skewness vs Measures of Center
Mean
Median
Mode

Mean < Median < Mode Mean = Median = Mode Mean > Median > Mode

Left Skewed Symmetric Right Skewed

Sensitivity: Internal
Intensity/Heat Maps

Sensitivity: Internal
Time Plots
225

180

135

90

45

0
Jan Feb March April June July Aug Sep Oct Nov Dec

Sensitivity: Internal
Measures of Spread

Range Variance

Standard Inter-quartile
Deviation Range

Sensitivity: Internal
Range

• Range = Max. Value - Min. Value

• Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79

• Range = 90 - 34 = 56

Sensitivity: Internal
Variance
• A measure of how much data (a variable)
varies; how spread out a data set is about the
mean.
• Average squared deviation from mean; has
squared units of the variable

• Sample Variance

• Population Variance

Sensitivity: Internal
Variance (Example)
• Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79

(56 - 67.2)2 + (87 - 67.2)2 + …… + (79 - 67.2)2


10-1

2995.6
9 Sum of Squares
332.8

Sensitivity: Internal
Why Square The Differences?

• Get rid of negatives, so


that the negatives and
positives do not cancel
each other during
addition.
• Increase larger 0 2 4 6 8

deviations more than (2-4) + (6-4) = -2 + 2 = 0


smaller ones so that they
are weighed more
heavily.

Sensitivity: Internal
Standard Deviation (SD)
• Square root of Variance
• It has the same units as the variable, which
makes it useful in comparisons and
calculations

• Sample SD
√ √
• Population SD
√ √
Sensitivity: Internal
Spread

0 2 4 6 8 0 2 4 6 8

Less Spread More Spread


Low Variance High Variance
Low Deviation High Deviation

Sensitivity: Internal
Robust Statistics
• Measures on which extreme observations or
outliers have little effect

Robust Non-Robust

Spread IQR SD, Range

Center Median Mean

Skewed Symmetric
Sensitivity: Internal
Data Transformations
• Applying a Function f(x) to adjust scales of
data.
• Done usually when data is skewed, so that it
becomes easier to perform modelling.
• Done to convert non-linear relationship into a
linear relationship.

Transformed
Data f(x) Data

Sensitivity: Internal
(Natural) Log Transformation
• To transform data that is positively skewed
• Usually done when data is concentrated near
Zero (relative to the few large values in data)

Right Skewed Symmetric

Natural
Log

Sensitivity: Internal
Log Transformation
• To make the relationship between two variable
more linear
• Most of the simple methods for modelling work
only when relationship is linear

Log

Sensitivity: Internal
Other Transformation

• You may use other transformations or create of


your own

• For instance: Square Root, Square, Inverse

Sensitivity: Internal
Visualising
Categorical Data

Sensitivity: Internal
Bar Plot
300

225

150

75

Frequency

Sensitivity: Internal
Bar Plot vs Histogram

• Bar Plot for Categorical Variables, Histogram


for Numerical Variables

• X-axis in Histogram must be a Number Line

• Ordering of bars is not interchangeable in


Histogram as compared to Bar Plot

Sensitivity: Internal
Pie Chart
Cricket
Very Somewhat
Football Not Very
Hockey Not At All
Squash Not Sure

3%

12%
46%
39%

Use Bar Plot instead

Sensitivity: Internal
Segmented Bar Plot
For visualising For comparing
conditional frequency relative frequencies
distributions to explore
225
relationship between
variables

150

75
Squash

Hockey
0 Football

Cricket

Sensitivity: Internal
Relative Frequency Segmented
Bar Plot
1.2

0.9

0.6

0.3

0
Cricket Football Hockey Squash

Sensitivity: Internal
AGE
60 120
100
80
40
20 Side-by-Side Box Plots

Cricket Football Hockey Squash Badminton Other Sports

Sensitivity: Internal
Bubble Plot
Bowler Batsman
37.5
75

30
70
Weight (kg)

22.5
65

15

7.5
60
55

0
50

-7.5
-4 0 4 8 12 16 20
4’8 5’0 5’4 5’8 6’0 6’4 6’8 7’0
Height

Sensitivity: Internal
Outliers

Sensitivity: Internal
Why do EDA
• To understand data properties
• To find patterns in data
• To suggest modelling strategies
• To "debug" analyses
• To communicate results

(From JHU)
Sensitivity: Internal
Why do EDA

https://www.youtube.com/watch?v=jbkSRLYSojo

Sensitivity: Internal

You might also like