0% found this document useful (0 votes)
5 views

Data Analysis and Visualization EDA

Exploratory Data Analysis (EDA) is a method for summarizing and understanding data characteristics, relationships between variables, and identifying key factors affecting outcomes, such as car prices. The lecture covers data types, descriptive analysis, measures of central tendency, variability, and the implementation of these concepts using Python. Key topics include the importance of variance and standard deviation in data analysis, as well as the use of z-scores for comparing distributions.

Uploaded by

usairashahbaz152
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Analysis and Visualization EDA

Exploratory Data Analysis (EDA) is a method for summarizing and understanding data characteristics, relationships between variables, and identifying key factors affecting outcomes, such as car prices. The lecture covers data types, descriptive analysis, measures of central tendency, variability, and the implementation of these concepts using Python. Key topics include the importance of variance and standard deviation in data analysis, as well as the use of z-scores for comparing distributions.

Uploaded by

usairashahbaz152
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

EXPLORATORY

DATA
ANALYSIS
Lecture 2
EXPLORATORY DATA
ANALYSIS
• EDA, is an approach to analyze data in order to
summarize main characteristics of the data, gain better
understanding of the data set, uncover relationships
between different variables and extract important
variables for the problem we're trying to solve.

• Question: what are the characteristics that have the


most impact on the car price?
TOPICS COVERED
Types and Forms of Data

Understand Descriptive Data Analysis

Types Measures

• Central Tendency
• Variability
• Relative Standing

3
TYPES AND FORMS OF DATA
Data: Encoded Knowledge (numbers, text, sound, colors, images)

Dataset: Table (Rows, Columns, Cell (Data Items))

Data Types:
• Independent – given (Input)
• Dependent – observed (Output)
Data Forms:
• Discrete – finite possible values(yes/no; republican/democrat; satisfied/not-
satisfied, etc.)
• Continuous – infinite possible values (height, weight, length, time, etc.)

4
FORMS OF DATA

5
DIMENSIONALITY OF DATA SETS
• Univariate: f(x)
• Bivariate: f(x1, x2)
• Multivariate: f(x1, x2, x3, ….., xn)

X1 X2 X3 X4 Xn

……….

6
UNDERSTANDING DESCRIPTIVE DATA ANALYSIS
• Describing and summarizing data.
• Uses two main approaches:
• Quantitative: numerically.
• Visual: charts and other graphs.
• Not to infer/estimate properties about
a larger population, only describe!

7
TYPES OF MEASURES
• Central Tendency. To find Center, i.e., mean, median,
and mode.

• Variability. To find “data spread” or distance from the


center, i.e. variance and standard deviation (std).

• Relative Standing. To find relative position of specific


data items, i.e. covariance and correlation coefficient.

8
CENTRAL TENDENCY
CENTRAL LIMIT THEOREM
• The mean of a random sample will more closely resemble the mean
for the whole dataset as the sample size increases, regardless of the
shape of the distribution.

• Mean of Sample ~ Mean of whole population

10
NORMAL DISTRIBUTION

Symmetrical (50% above and below mean)

68%

95%
95%
99% 99%

11
THE BELL CURVE

Histograms (Frequency of items)


.01 .01

Significant Significant

Mean=70

12
MEASURES OF CENTRAL TENDENCY
•Mean - arithmetic average
–Σ (𝑥ᵢ / 𝑛), where 𝑖 = 1, 2, …, 𝑛.

•Median - midpoint of the


distribution (Data items need to be
sorted in ascending order first)
•Mode - the value that occurs most
often

13
ARITHMETIC MEAN EXAMPLE

98
88
81
74
72
72
741\10 = 74.1
70
69
65
52

741

14
MODE EXAMPLE

Find the score that occurs most frequently


98
88
81
74
72 Mode = 72
72
70
69
65
52

15
MEDIAN EXAMPLE

Arrange in descending order and find the midpoint

Odd Number (N = 9)
98 Even Number (N = 10)
88 98
81 88
74 81
72 Midpoint = 72 74
70 72 Midpoint =
69 71 (72+71)/2
65 70 = 71.5
52 69
65
Two most important steps of this implementation are as follows: 52
Sorting the elements of the dataset
Finding the middle element(s) in the sorted dataset

16
IMPLEMENTATION OF CONCEPTS USING PYTHON
#Import Libraries
import math
import statistics
import numpy as np
# Define a list
x = [8.0, 1, 2.5, 4, 28.0] # Sample data
# Print values
Print (x)
# Calculate mean simple
MEAN = sum(x) / len(x) # 8.7
# using statistics library
MEAN = statistics.mean(x) # 8.7
# if you are using numpy then
y = np.array(x)
MEAN = np.mean(y) #8.7

17
IMPLEMENTATION OF CONCEPTS USING PYTHON

# Calculate median simple


n = len(x)
if n % 2: #For ODD number of data items
median_ = sorted(x)[round(0.5*(n-1))]
else: #For EVEN number of data items
x_ord, index = sorted(x), round(0.5 * n)
median_ = 0.5 * (x_ord[index-1] + x_ord[index])

print (median_) # 4

# using statistics library


median_ = statistics.median(x) # 4

# if you are using numpy then


y = np.array(x)
median_ = np.median(x) #4

18
IMPLEMENTATION OF CONCEPTS USING PYTHON

u = [2, 3, 2, 8, 12] #Lets change the data to this one

# Calculate mode simple


mode_ = max((u.count(item), item) for item in set(u))[1]
print(mode_) #2

# using statistics library


MEAN = statistics. mode(u) #2

# if you are using numpy then


w = np.array(u)
MEAN = np.mode(w) #2

19
VARIABILITY
VARIABILITY
• The measures of central tendency aren’t sufficient to describe data.
• Need to measures of variability that quantify the spread of data points.

• Variability measures:
• Range
• Variance
• Standard deviation
• Skewness
• Percentiles

21
THE RANGE AS A MEASURE OF SPREAD
• Range = largest value – smallest value

Group 1 Group 2
100, 100 91, 85
99, 98 81, 79
88, 77 78, 77
72, 68 73, 75
67, 52 72, 70
43, 42 65, 60

Range G1: 100 – 42 = 58 Range G2: 91 – 60 = 31


VARIANCE

• The sample variance quantifies the spread of the data.


• It shows numerically how far the data points are from the mean.

( X i − X ) 2 ( X i − X ) 2
S =
2
s =
2

N n −1
Population Variance Sample Variance

23
VARIANCE EXAMPLE

X X X-X X –X2
98 - 74.1 = 23.90 = 571.21 Population Variance (N)
88 - 74.1 = 13.90 = 193.21
81 - 74.1 = 6.90 = 47.61 1,434.90 \ 10 = 143.49
74 - 74.1 = -0.10 = 0.01
72 - 74.1 = -2.10 = 4.41
72 - 74.1 = -2.10 = 4.41
70 - 74.1 = -4.10 = 16.81 Sample Variance (n-1)
69 - 74.1 = -5.10 = 26.01
65 - 74.1 = -9.10 = 82.81 1,434.90 \ 9 = 159.43
52 - 74.1 = -22.10 = 488.41
Mean = 74.1 1,434.90

24
WHY FIND VARIANCE AND STD?
• The variance is used in many higher-order calculations including:
• T-test (inferential, mean, two samples)
• Analysis of Variance (ANOVA) (inferential, variance, two samples)
• Regression (inferential, cause-effect correlation)
• Variance = zero (All values within set are identical)
• All variances = non-zero (positive numbers). Why?
• A large variance indicates that numbers in the set are far from the
mean and each other, while a small variance indicates the opposite.

25
STANDARD DEVIATION

• Once you get the variance, you can calculate the standard deviation by taking
its square root.
• The higher the standard deviation, the greater the variability and/or spread
of scores
• Why std instead of variance?
• The std is more convenient because it has the same unit as
the data points, i.e. S instead of 𝒔𝟐 .
• Helps to localize the data item (z-score)

s =
(
 Xi −X 2 )
n −1
26
STANDARD DEVIATION EXAMPLE

Population STD
X X X-X X –X2
1,434.90 \ 10 = 143.49
98 - 74.1 = 23.90 = 571.21
88 - 74.1 = 13.90 = 193.21 (SQRT) 143.49 = 11.98
81 - 74.1 = 6.90 = 47.61
74 - 74.1 = -0.10 = 0.01
72 - 74.1 = -2.10 = 4.41
72 - 74.1 = -2.10 = 4.41 Sample STD
70 - 74.1 = -4.10 = 16.81 1,434.90 \ 9 = 159.43
69 - 74.1 = -5.10 = 26.01
65 - 74.1 = -9.10 = 82.81
(SQRT) 159.43 = 12.63
52 - 74.1 = -22.10 = 488.41
Mean = 74.1 1,434.90

27
Z-SCORE FORMULA
The z-score is simply a way of telling how far a score is from the mean in
standard deviation units.

Item localization in terms of std from the mean in either direction.

𝑋 − 𝑋ത
𝑧=
𝑆
Z-Scores with positive numbers are above the mean while Z-Scores
with negative numbers are below the mean.

28
COMPARING Z-SCORES

• Z-scores allow the researcher to make comparisons between different


distributions.
Mathematics English
µ = 75 µ = 52
σ=6 σ=4
X = 78 X = 57

𝑋 − 𝜇 78 − 75 3
Mathematics 𝑧= = = = 0.5
𝜎 6 6

57 − 52 5
English 𝑧= = = 1.25
4 4

29
AREA UNDER THE NORMAL CURVE

50% 50%

34.1% 34.1%

13.5% 13.5%
2.2% 2.2%

68.2%

95.2%

99.6%
30
AREA UNDER THE NORMAL CURVE
• For many lists of observations – especially if their histogram is bell-shaped
• Roughly 68% of the observations in the list lie within 1 std from the mean
• 95% of the observations lie within 2 std from the mean
• 99.6% of the observations lie within 3 std from the mean

31
SKEWNESS

• It measures the asymmetry of a data sample.


• Asymmetry is concentration of scores at a particular point on the x-axis.

• Negative skewness: The symmetry of the distribution is tilt toward the right side (Higher Numbers)
• Positive skewness: The symmetry of the distribution is tilt toward the left side (Lower Numbers)

32
SYMMETRIC VS. SKEWED DATA

• Median, mean and mode of symmetric,


positively and negatively skewed data

positively skewed negatively skewed symmetric

33
WHEN THE DISTRIBUTION MAY NOT BE NORMAL
9
Salary Sample Data
8

7
Average = 62K
6
Mode = 45K
Frequency
5

0
Median =
25 27 29 32 35 38 43 45 48 51 54 56 59 60 62 65 68 71 75 78 85 88 91 95 98 99 100 150 175
Annual Salary in Thousands of Dollars

34
QUARTILES
• Each dataset has three quartiles, which are the percentiles that divide the
dataset into four parts:
• Q1, A value for which 25% of the observations are smaller and 75% are
larger
• Q2, same as median (50% are smaller, 50% are larger)
• Q3, (75% are smaller, 25% are larger)

35
PERCENTILES

• In general the nth percentile is a value such that n% of the observations fall at
or below or it
IMPLEMENTATION OF CONCEPTS USING PYTHON
#Import Libraries
import math
import statistics
import numpy as np For whole population variance:
# Define a list •Replace (n - 1) with n in the pure Python implementation.
x = [8.0, 1, 2.5, 4, 28.0] # Sample data •Use statistics.pvariance() instead of statistics.variance().
•Specify the parameter ddof=0 if you use NumPy or Pandas. In
# Calculate variance simple NumPy, you can omit ddof because its default value is 0.
n = len(x)
mean_ = sum(x) / n #8.7
var_ = sum((item - mean_)**2 for item in x) / (n - 1) #123.2
# using statistics library
var_ = statistics.variance(x) #123.2 It’s very important to specify the parameter ddof=1.
That’s how you set the delta degrees of freedom to 1.
# if you are using numpy then This parameter allows the proper calculation of 𝑠²,
Y = np.array(x) with (𝑛 − 1) in the denominator instead of 𝑛.
var_ = np.var(y, ddof=1)

37
PYTHON IMPLEMENTATION OF CONCEPTS USING PYTHON

# Calculate std simple


std_ = var_ ** 0.5

# using statistics library


std_ = statistics.stdev(x)

# if you are using numpy then


y = np.array(x)
np.std(y, ddof=1)

38
IMPLEMENTATION OF CONCEPTS USING PYTHON

# Calculate skewness simple


x = [8.0, 1, 2.5, 4, 28.0]
n = len(x)
mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
std_ = var_ ** 0.5
skew_ = (sum((item - mean_)**3 for item in x)
* n / ((n - 1) * (n - 2) * std_**3))
print (skew_)
1.9470432273905929 #The skewness is positive, so x has a right-side tail.

# using scipy library


y = np.array(x)
scipy.stats.skew(y, bias=False)

39
IMPLEMENTATION OF CONCEPTS USING PYTHON

# Calculate Percentile simple


x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
perc = statistics.quantiles(x, n=4, method='inclusive')
print (perc)
[0.1, 8.0, 21.0]
# In this example, 8.0 is the median of x,
# while 0.1 and 21.0 are the sample 25th and 75th percentiles, respectively.

# using numpy library


y = np.array(x)
perc = np.percentile(y, [25, 50, 75])
print (perc)
#[0.1, 8.0, 21.0]

40
IMPLEMENTATION OF CONCEPTS USING PYTHON

# Calculate Range simple


x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
rng = max(x) – min(x)
print (rng) #46

# using numpy library


y = np.array(x)
rng = np.ptp(y)
print (rng) #46.0
print (np.amax(y) - np.amin(y)) # 46

41
RELATIVE STANDING
COVARIANCE

• Signifies the direction of the linear relationship between the two variables.
• Direction means if the variables are directly proportional or inversely
proportional to each other.
• The values of covariance can be any number (Not Scaled)
• It only measures how two variables change together, not the dependency of
one variable on another one.

43
COVARIANCE EXAMPLE

• Mean ABC = (1.1 + 1.7 + 2.1 + 1.4 + 0.2) / 5 = 1.30


• Mean XYZ = (3 + 4.2 + 4.9 + 4.1 + 2.5) / 5 = 3.74
• Cov = [(1.1 - 1.30) x (3 - 3.74)] + [(1.7 - 1.30) x (4.2 - 3.74)] + [(2.1 - 1.30) x
(4.9 - 3.74)] + …
• Cov = 2.66 / (5 - 1) = 0.665

44
CORRELATION

• It is used to study the strength of a relationship between two, numerically


measured, continuous variables.
• To determine whether the covariance of the two variables is large or small,
we need to assess it relative to the standard deviations of the two variables.
• To do so we have to normalize the covariance by dividing it with the product
of the standard deviations of the two variables, thus providing a correlation
between the two variables.
• The main result of a correlation is called the correlation coefficient.
• The correlation coefficient is a dimensionless metric and its value ranges
from -1 to +1.

45
CORRELATION

46
CORRELATION

47
CORRELATION VS COVARIANCE

Covariance Correlation
Covariance is nothing but a Correlation refers to the scaled
measure of correlation. form of covariance.
Correlation on the other hand
Covariance indicates the measures both the strength and
direction of the linear direction of the linear
relationship between variables. relationship between two
variables.
Covariance can vary between - Correlation ranges between -1
∞ and +∞ and +1
Covariance is affected by the
change in scale. If all the values
of one variable are multiplied by
a constant and all the values of Correlation is not influenced by
another variable are multiplied, the change in scale.
by a similar or different
constant, then the covariance is 48
CORRELATION EXAMPLE

GLUCOSE
SUBJECT AGE X
LEVEL Y
1 43 99
2 21 65
3 25 79
4 42 75
From our table:
5 57 87
Σx = 247
Σy = 486 6 59 81
Σxy = 20,485
Σx2 = 11,409
Σy2 = 40,022
n is the sample size, in our case = 6
R = 6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]
= 0.5298

49
EXAMPLE
• A survey was given to students to find out how many hours per week they would listen to a
particular radio station.
• The data collected was distributed by gender.
• Determine the mean, range, variance and standard deviation of each group.
• Find the correlation coefficient R
Group A (Female) Group B (Male)

15 30
25 15
12 21
7 12
3 25
33 20
18 5
16 24
9 17
24 11

50
SOLUTION - EXAMPLE

Female Group Mean = 16.2 h/w


Male Group Mean = 18 h/w
Female Group Range = 30
Male Group Range = 25
Female Group Sample Variance / Std = 83.73 / 9.15
Female Group Population Variance / Std = 75.36 / 8.68
Male Group Sample Variance / Std = 56.22 / 7.5
Male Group Population Variance / Std = 50.6 / 7.11
Correlation Coefficient = -0.2073

51

You might also like