Data Analysis - Statistics
Data Analysis - Statistics
Data Analysis - Statistics
13-DEC-21
1
What is Statistics & It’s
type
13-DEC-21
3
Categorical data
• Examples of categorical
variables are race, sex, age
group, and educational level.
13-DEC-21
4
Key characteristics of
Categorical data
13-DEC-21
5
Categorical - Nominal
13-DEC-21
6
Categorical -
Ordinal
13-DEC-21
7
Numerical We speak of discrete
data if its values are
Data – distinct and separate.
Discrete
data
Example
13-DEC-21
8
Numerical A variable is continuous if the
data – possible values of the variable form
an interval.
continuous
values
Example Height of students
13-DEC-21
9
Interval data, also called an integer, is defined as a data
Interval
Measured
Distance between two values
is equal.
Equi distance
Characteristics ordered
13-DEC-21
Can be negative
10
Numerical Ratio Data is defined as quantitative data,
data – Ratio having the same properties as interval data,
Example
13-DEC-21
11
13-DEC-21
12
Answer the
level of 1.High school men
soccer players classified 2.Baking temperatures
measurement by their athletic ability: for various main dishes:
Superior, Average, 350, 400, 325, 250, 300
Above average.
Ans:
1. Ordinal 4.A satisfaction survey
2. Ratio of a social website by
3.The colors of crayons
3. Nominal number: 1 very satisfied,
4. Ordinal in a 24-crayon box.
2 somewhat satisfied, 3
not satisfied.
13-DEC-21
13
Only one Frequency
Can be analyzed by
dependent Distribution tables
Four types of
Central Tendency
variable and Bar graphs
Exploratory/ Univariate
Example IQ of
Univariate different people
Descriptive non-
graphical.
graphical.
Data Analysis
done
through Frequency
Charts
Tables
13-DEC-21
15
Central Central tendency is defined as “the
statistical measure that identifies a
Tendency single value as representative of an
Univariate entire distribution.
13-DEC-21
16
Central Tendency
Mean
Median
Mode
13-DEC-21
17
Mean
The mean value is the average value.
To calculate the mean, find the sum of all values, and divide the sum by the number of values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
The important disadvantage of mean is that it is sensitive to extreme values/outliers, especially when the sample size is
small
• In Tim's office, there are 25 employees. Each employee travels to work every morning in his or her
own car. The distribution of the driving times (in minutes) from home to work for the employees is
shown in the table below.
13-DEC-21
20
How to handle grouped data
13-DEC-21
21
Median
The median is less
65 55 89
The median is the middle affected by outliers and
56 35
score for a set of data skewed data. In order to
14 56
that has been arranged in calculate the median,
55 87
order of magnitude. suppose we have the
45 92
data below
What if you had only 10 scores? Well, you simply have to take the middle two
scores and average the result. So, if we look at the example below:
We first need to rearrange that data into order of 65 55 89 56 35 14 56 55 87
45
magnitude (smallest first):
We again rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56
14 35 45 55 55 56 56 65 87
65 87 89 92 89
Only now we have to take the 5th and 6th score in our data set and average
them to get a median of 55.5.
13-DEC-21
22
Median of grouped data
Covid infection
No of people
13-DEC-21
23
Median of
grouped data
N=𝑠𝑢𝑚 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 80
N/2=40
13-DEC-21
24
Median of
grouped data
Median =
24+{8X(40-34)/24}
=26
Covid infection Frequency CF
Median =26
0-8 8 8
8-16 10 18
16-24 16 34
24-32 24 58
32-40 15 73
40-58 7 80
13-DEC-21
25
Important to note
41-50 2 17 0.5-10.5
10.5-20.5
20.5-30.5
30.5-40.5
40.5-50.5
13-DEC-21
26
Mode
13-DEC-21
27
Mode of grouped data
Covid infection Frequency CF
0-8 8 8
8-16• 10 18
16-24 16 34
24-32 24 58
32-40 15 73
40-58 7 80
A standard deviation is a statistic that measures the dispersion of a dataset relative to its mean
The standard deviation is calculated as the square root of variance by determining each data point's
deviation relative to the mean
Example Suppose people have Rs 100, 200, 350,500, 300 in their pockets
Mean is 290 so SD will be ?
13-DEC-21
29
Imagine we have two samples of chocolate cake eaters,
Why each sample with 10 people, self-reporting how many pieces
of chocolate cake they've eaten in the last seven days.
Standard In dataset #1, we have five people that report eating 4
Deviation pieces of cake and five people that report eating 6 pieces of
cake, for a mean of 5 pieces of cake
(4+4+4+4+4+6+6+6+6+6)/10 = 5
(0+0+0+0+0+10+10+10+10+10)/10 = 5.
13-DEC-21
30
Standard Deviation example
A class of students took a math test. Their teacher wants to know whether most students are performing at
the same level, or if there is a high standard deviation.
The scores for the test were 85, 86, 100, 76, 81, 93, 84, 99, 71, 69, 93, 85, 81, 87, and 89.
31
13-DEC-21
Calculate Standard Deviation
85 - 85.2 = -0.2 0.04
• 85, 86, 100, 76, 81, 93, 84, 99, 71, 69, 93, 85, 81, 87, and 89. 86 - 85.2 = 0.8 0.64
knows that
• Variance most
– average students are performing around the
of sums
71 - 85.2 = -14.2 201.64
262.44
69 - 85.2 = -16.2
same level.
• 0.04 + 0.64 + 219.04 + 84.64 + 17.64 + 60.84 +1.44 +190.44 93 - 85.2 = 7.8 60.84
32
89 - 85.2 = 3.8 14.44
13-DEC-21
Imagine we have two samples of chocolate cake eaters,
each sample with 10 people, self-reporting how many pieces
Why of chocolate cake they've eaten in the last seven days.
Deviation
cake, for a mean of 5 pieces of cake
(4+4+4+4+4+6+6+6+6+6)/10 = 5
In this example SD of first will
be 1 and second will be 5 In dataset #2, we have five people that report eating 0 piece
of cake and five people that report eating 10 pieces of cake,
for a mean of 5 pieces of cake
(0+0+0+0+0+10+10+10+10+10)/10 = 5.
13-DEC-21
33
How to analyze Standard Deviation
13-DEC-21
34
Empirical formula
• The empirical rule, also referred to as
the three-sigma rule or 68-95-99.7 rule, is
a statistical rule which states that for
a normal distribution, almost all observed
data will fall within three standard
deviations (denoted by σ) of the mean or
average (denoted by µ).
13-DEC-21
35
Standard Normal Distribution
• A Distribution which has a mean of 0
and standard deviation as 1
13-DEC-21
36
Z-Score (Standard Score) - How
many standard deviations is it from the mean
Equal to 0 at mean
Python Code
Sarah’s Z Score = (70-60)/15=0.6667
13-DEC-21
38
To find how much of the population scored less than
49
• Please try
• http://www.math.odu.edu/
stat130/normal-tables.pdf
13-DEC-21
40
Solution
13-DEC-21
41
Frequency Frequency tables
are a basic tool you
tables – for can use to explore
data and get an idea
univariate of the relationships
between variables.
variables
A frequency table is
just a data table
that shows the
counts of one or
more categorical
variables.
13-DEC-21
42
Tally marks are often used to make a frequency distribution
table.
How to
make For example, let’s say you survey a number of households
and find out how many pets they own.
Frequency
tables The results are 3, 0, 1, 4, 4, 1, 2, 0, 2, 2, 0, 2, 0, 1, 3, 1, 2, 1, 1, 3.
13-DEC-21
43
Data Visualization through charts
13-DEC-21
44
Line Graph – for continuous series
Sales
Year
SNO
0 2014 2000
1 2015 3000
2 2016 4000
3 2017 3500
4 2018 6000
13-DEC-21
45
Bar Graph
13-DEC-21
46
Pie chart
13-DEC-21
47
Scatter plot
• Despite their simplicity, scatter plots are a powerful tool for visualising data
• Scatter plots’ primary uses are to observe and show relationships between two numeric variables.
Monthly
Online Online Advertising
E-commerce Sales
Store Dollars (1000 s)
(in 1000 s)
1 368 1.7
2 340 1.5
3 665 2.8
4 954 5
5 331 1.3
6
13-DEC-21
556 2.2
48
7 376 1.3
Histogram (Univariable graphical)
• Histograms group the data in bins and is the fastest
way to get idea about the distribution of each
attribute in dataset.
13-DEC-21
50
Which chart will you use?
13-DEC-21
51
Graph Interpretation – story that
graph tells us
• Center Center
• Graphically, the center of a distribution is located at the median of the
distribution.
• Spread
• The spread of a distribution refers to the variability of the data.
• Outlier
• An outlier is an observation that lies an abnormal distance from other values in a random sample
from a population.
• Peak
• Highest point in the graph
13-DEC-21
52
Distribution of Shapes in Graphs
by its possession of
by its number of peaks its tendency to skew, or its uniformity.
symmetry,
13-DEC-21
53
Number of Histogram charts often display peaks, or local
maximums. It can be seen from the graph that the data
peaks count is visibly higher in certain sections of the graph.
13-DEC-21
54
13-DEC-21 https://www.youtube.com/watch?v=2oJldeE4JcU 55
Terms learnt above
Symmetry Skewness
13-DEC-21
57
Correlation is a relation between
Bivariate - quantitative variables
Correlation
Examples
13-DEC-21
58
How is Regression different from
Correlation
Correlation describes the
The word correlation is used It has one dependent and
strength of an association
in everyday life to denote one independent variables
between two variables, and
some form of association. (Example age and height)
is completely symmetrical
13-DEC-21
59
Correlation – Bivariate
13-DEC-21
60
Examples
13-DEC-21
61
Pearson’s Coefficient – r =
covariance/sd(x)*sd(y)
13-DEC-21
62
Example of calculation of Pearson’s
Coefficient
Step 1
The correlation coefficient =
Make
6(20,485)a chart.
– (247Use the/ given
× 486) data, and
[√[[6(11,409) –
AVERAGE add
(2472three more columns:
)] × [6(40,022) – 4862]]]xy, x2, and y2.
SNO AGE X GLUCOSE XY X2 Y2 = 0.5298
LEVEL Y Step 2
1 43 99 4257 1849 9801 Find the sum of each column
N is 6range of the correlation
The
2 21 65 1365 441 4225
3 25 79 1975 625 6241 coefficient
Step 3 is from -1 to 1. Our
4 42 75 3150 1764 5625
result is the
Substitute 0.5298
valuesor 52.98%,
5 57 87 4959 7569
which means the variables
3249
have a moderate positive
6 59 81 4779 3481 6561
correlation.
247 486 20485 11409 40022
63
∑
13-DEC-21
Example using second formula
AVERAGE Step 1
SNO AGE X GLUCOSE X-X’ Y-Y’ (X-X’)2 (Y-Y’)2
PROD Find the Mean of X and Y columns 41.667 81
LEVEL Y UCT
1 43 99 Step 2
1.83 18 3.361 324 33 Find X-X’ and Y-Y’
2 21 65 -20.17 -16 406.7 256 322.7
3 25 79 Step 3
-16.17 -2 261.4 4 32.33
Find (X-X’)2 and (Y-Y’)2
4 42 75 0.83 -6 0.694 36 -5
5 57 87 Step 4
15.83 6 250.7 36 95 Find (X-X’)(Y-Y’)
6 59 81 17.83 0 318 0 0 Step 5
1241 656 478 Sum column 6,7 and 8
Pearson’s Coefficient = 0.5298
13-DEC-21
Step 6
Substitute values
64
In the second formula
13-DEC-21
65
Assumptions
for Pearson For a Pearson correlation each variable
should be continuous
Coefficient (r)
Each observation should have a pair of
values ( X and Y)
13-DEC-21
66
To summarize
13-DEC-21
67
References-
Descriptive data analysis using excel
• https://www.youtube.com/watch?v=5MFjwM6K5Sg
• Descriptive statistics and data visualisation. An introduction to statistics and working with data
• https://www.youtube.com/watch?v=txNvZ3Zndak
• Qualitative research
• https://www.youtube.com/watch?v=_uapR0qiN6s&list=RDCMUCig0KhrB5NClMvX9QrbXcrw&index=2
• Statistics made easy ! ! ! Learn about the t-test, the chi square test, the p value and more
• https://www.youtube.com/watch?v=I10q6fjPxJ0&list=RDCMUCig0KhrB5NClMvX9QrbXcrw&index=3
• https://www.youtube.com/watch?v=0oc49DyA3hU
13-DEC-21
68