Data Analysis and Visualization EDA
Data Analysis and Visualization EDA
DATA
ANALYSIS
Lecture 2
EXPLORATORY DATA
ANALYSIS
• EDA, is an approach to analyze data in order to
summarize main characteristics of the data, gain better
understanding of the data set, uncover relationships
between different variables and extract important
variables for the problem we're trying to solve.
Types Measures
• Central Tendency
• Variability
• Relative Standing
3
TYPES AND FORMS OF DATA
Data: Encoded Knowledge (numbers, text, sound, colors, images)
Data Types:
• Independent – given (Input)
• Dependent – observed (Output)
Data Forms:
• Discrete – finite possible values(yes/no; republican/democrat; satisfied/not-
satisfied, etc.)
• Continuous – infinite possible values (height, weight, length, time, etc.)
4
FORMS OF DATA
5
DIMENSIONALITY OF DATA SETS
• Univariate: f(x)
• Bivariate: f(x1, x2)
• Multivariate: f(x1, x2, x3, ….., xn)
X1 X2 X3 X4 Xn
……….
6
UNDERSTANDING DESCRIPTIVE DATA ANALYSIS
• Describing and summarizing data.
• Uses two main approaches:
• Quantitative: numerically.
• Visual: charts and other graphs.
• Not to infer/estimate properties about
a larger population, only describe!
7
TYPES OF MEASURES
• Central Tendency. To find Center, i.e., mean, median,
and mode.
8
CENTRAL TENDENCY
CENTRAL LIMIT THEOREM
• The mean of a random sample will more closely resemble the mean
for the whole dataset as the sample size increases, regardless of the
shape of the distribution.
10
NORMAL DISTRIBUTION
68%
95%
95%
99% 99%
11
THE BELL CURVE
Significant Significant
Mean=70
12
MEASURES OF CENTRAL TENDENCY
•Mean - arithmetic average
–Σ (𝑥ᵢ / 𝑛), where 𝑖 = 1, 2, …, 𝑛.
13
ARITHMETIC MEAN EXAMPLE
98
88
81
74
72
72
741\10 = 74.1
70
69
65
52
741
14
MODE EXAMPLE
15
MEDIAN EXAMPLE
Odd Number (N = 9)
98 Even Number (N = 10)
88 98
81 88
74 81
72 Midpoint = 72 74
70 72 Midpoint =
69 71 (72+71)/2
65 70 = 71.5
52 69
65
Two most important steps of this implementation are as follows: 52
Sorting the elements of the dataset
Finding the middle element(s) in the sorted dataset
16
IMPLEMENTATION OF CONCEPTS USING PYTHON
#Import Libraries
import math
import statistics
import numpy as np
# Define a list
x = [8.0, 1, 2.5, 4, 28.0] # Sample data
# Print values
Print (x)
# Calculate mean simple
MEAN = sum(x) / len(x) # 8.7
# using statistics library
MEAN = statistics.mean(x) # 8.7
# if you are using numpy then
y = np.array(x)
MEAN = np.mean(y) #8.7
17
IMPLEMENTATION OF CONCEPTS USING PYTHON
print (median_) # 4
18
IMPLEMENTATION OF CONCEPTS USING PYTHON
19
VARIABILITY
VARIABILITY
• The measures of central tendency aren’t sufficient to describe data.
• Need to measures of variability that quantify the spread of data points.
• Variability measures:
• Range
• Variance
• Standard deviation
• Skewness
• Percentiles
21
THE RANGE AS A MEASURE OF SPREAD
• Range = largest value – smallest value
Group 1 Group 2
100, 100 91, 85
99, 98 81, 79
88, 77 78, 77
72, 68 73, 75
67, 52 72, 70
43, 42 65, 60
( X i − X ) 2 ( X i − X ) 2
S =
2
s =
2
N n −1
Population Variance Sample Variance
23
VARIANCE EXAMPLE
X X X-X X –X2
98 - 74.1 = 23.90 = 571.21 Population Variance (N)
88 - 74.1 = 13.90 = 193.21
81 - 74.1 = 6.90 = 47.61 1,434.90 \ 10 = 143.49
74 - 74.1 = -0.10 = 0.01
72 - 74.1 = -2.10 = 4.41
72 - 74.1 = -2.10 = 4.41
70 - 74.1 = -4.10 = 16.81 Sample Variance (n-1)
69 - 74.1 = -5.10 = 26.01
65 - 74.1 = -9.10 = 82.81 1,434.90 \ 9 = 159.43
52 - 74.1 = -22.10 = 488.41
Mean = 74.1 1,434.90
24
WHY FIND VARIANCE AND STD?
• The variance is used in many higher-order calculations including:
• T-test (inferential, mean, two samples)
• Analysis of Variance (ANOVA) (inferential, variance, two samples)
• Regression (inferential, cause-effect correlation)
• Variance = zero (All values within set are identical)
• All variances = non-zero (positive numbers). Why?
• A large variance indicates that numbers in the set are far from the
mean and each other, while a small variance indicates the opposite.
25
STANDARD DEVIATION
• Once you get the variance, you can calculate the standard deviation by taking
its square root.
• The higher the standard deviation, the greater the variability and/or spread
of scores
• Why std instead of variance?
• The std is more convenient because it has the same unit as
the data points, i.e. S instead of 𝒔𝟐 .
• Helps to localize the data item (z-score)
s =
(
Xi −X 2 )
n −1
26
STANDARD DEVIATION EXAMPLE
Population STD
X X X-X X –X2
1,434.90 \ 10 = 143.49
98 - 74.1 = 23.90 = 571.21
88 - 74.1 = 13.90 = 193.21 (SQRT) 143.49 = 11.98
81 - 74.1 = 6.90 = 47.61
74 - 74.1 = -0.10 = 0.01
72 - 74.1 = -2.10 = 4.41
72 - 74.1 = -2.10 = 4.41 Sample STD
70 - 74.1 = -4.10 = 16.81 1,434.90 \ 9 = 159.43
69 - 74.1 = -5.10 = 26.01
65 - 74.1 = -9.10 = 82.81
(SQRT) 159.43 = 12.63
52 - 74.1 = -22.10 = 488.41
Mean = 74.1 1,434.90
27
Z-SCORE FORMULA
The z-score is simply a way of telling how far a score is from the mean in
standard deviation units.
𝑋 − 𝑋ത
𝑧=
𝑆
Z-Scores with positive numbers are above the mean while Z-Scores
with negative numbers are below the mean.
28
COMPARING Z-SCORES
𝑋 − 𝜇 78 − 75 3
Mathematics 𝑧= = = = 0.5
𝜎 6 6
57 − 52 5
English 𝑧= = = 1.25
4 4
29
AREA UNDER THE NORMAL CURVE
50% 50%
34.1% 34.1%
13.5% 13.5%
2.2% 2.2%
68.2%
95.2%
99.6%
30
AREA UNDER THE NORMAL CURVE
• For many lists of observations – especially if their histogram is bell-shaped
• Roughly 68% of the observations in the list lie within 1 std from the mean
• 95% of the observations lie within 2 std from the mean
• 99.6% of the observations lie within 3 std from the mean
31
SKEWNESS
• Negative skewness: The symmetry of the distribution is tilt toward the right side (Higher Numbers)
• Positive skewness: The symmetry of the distribution is tilt toward the left side (Lower Numbers)
32
SYMMETRIC VS. SKEWED DATA
33
WHEN THE DISTRIBUTION MAY NOT BE NORMAL
9
Salary Sample Data
8
7
Average = 62K
6
Mode = 45K
Frequency
5
0
Median =
25 27 29 32 35 38 43 45 48 51 54 56 59 60 62 65 68 71 75 78 85 88 91 95 98 99 100 150 175
Annual Salary in Thousands of Dollars
34
QUARTILES
• Each dataset has three quartiles, which are the percentiles that divide the
dataset into four parts:
• Q1, A value for which 25% of the observations are smaller and 75% are
larger
• Q2, same as median (50% are smaller, 50% are larger)
• Q3, (75% are smaller, 25% are larger)
35
PERCENTILES
• In general the nth percentile is a value such that n% of the observations fall at
or below or it
IMPLEMENTATION OF CONCEPTS USING PYTHON
#Import Libraries
import math
import statistics
import numpy as np For whole population variance:
# Define a list •Replace (n - 1) with n in the pure Python implementation.
x = [8.0, 1, 2.5, 4, 28.0] # Sample data •Use statistics.pvariance() instead of statistics.variance().
•Specify the parameter ddof=0 if you use NumPy or Pandas. In
# Calculate variance simple NumPy, you can omit ddof because its default value is 0.
n = len(x)
mean_ = sum(x) / n #8.7
var_ = sum((item - mean_)**2 for item in x) / (n - 1) #123.2
# using statistics library
var_ = statistics.variance(x) #123.2 It’s very important to specify the parameter ddof=1.
That’s how you set the delta degrees of freedom to 1.
# if you are using numpy then This parameter allows the proper calculation of 𝑠²,
Y = np.array(x) with (𝑛 − 1) in the denominator instead of 𝑛.
var_ = np.var(y, ddof=1)
37
PYTHON IMPLEMENTATION OF CONCEPTS USING PYTHON
38
IMPLEMENTATION OF CONCEPTS USING PYTHON
39
IMPLEMENTATION OF CONCEPTS USING PYTHON
40
IMPLEMENTATION OF CONCEPTS USING PYTHON
41
RELATIVE STANDING
COVARIANCE
• Signifies the direction of the linear relationship between the two variables.
• Direction means if the variables are directly proportional or inversely
proportional to each other.
• The values of covariance can be any number (Not Scaled)
• It only measures how two variables change together, not the dependency of
one variable on another one.
43
COVARIANCE EXAMPLE
44
CORRELATION
45
CORRELATION
46
CORRELATION
47
CORRELATION VS COVARIANCE
Covariance Correlation
Covariance is nothing but a Correlation refers to the scaled
measure of correlation. form of covariance.
Correlation on the other hand
Covariance indicates the measures both the strength and
direction of the linear direction of the linear
relationship between variables. relationship between two
variables.
Covariance can vary between - Correlation ranges between -1
∞ and +∞ and +1
Covariance is affected by the
change in scale. If all the values
of one variable are multiplied by
a constant and all the values of Correlation is not influenced by
another variable are multiplied, the change in scale.
by a similar or different
constant, then the covariance is 48
CORRELATION EXAMPLE
GLUCOSE
SUBJECT AGE X
LEVEL Y
1 43 99
2 21 65
3 25 79
4 42 75
From our table:
5 57 87
Σx = 247
Σy = 486 6 59 81
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
R = 6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]
= 0.5298
49
EXAMPLE
• A survey was given to students to find out how many hours per week they would listen to a
particular radio station.
• The data collected was distributed by gender.
• Determine the mean, range, variance and standard deviation of each group.
• Find the correlation coefficient R
Group A (Female) Group B (Male)
15 30
25 15
12 21
7 12
3 25
33 20
18 5
16 24
9 17
24 11
50
SOLUTION - EXAMPLE
51