Data Analysis - Statistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

Data Analysis - Statistics

13-DEC-21
1
What is Statistics & It’s
type

Statistics is the discipline that


concerns the collection, organization,
analysis, interpretation, and
presentation of data.
Type-Descriptive & Predictive
13-DEC-21
2
Variables can be of different types

13-DEC-21
3
Categorical data

• Categorical data represents


characteristics

• Categorical variables represent


types of data which may be
divided into groups.

• Examples of categorical
variables are race, sex, age
group, and educational level.

13-DEC-21
4
Key characteristics of
Categorical data

• Categorical data is divided into groups or categories.

• The categories are based on qualitative characteristics.

• There is no order to categorical values and variables.

• Categorical data can take numerical values, but those


numbers don’t have any mathematical meaning.

• Categorical data is displayed graphically by bar charts


and pie charts.

13-DEC-21
5
Categorical - Nominal

• A nominal variable is made up of various categories which has no order.

• They have no quantitative value

13-DEC-21
6
Categorical -
Ordinal

Ordinal values represent


discrete and ordered units.

It is therefore nearly the same


as nominal data, except that
it’s ordering matters

There is a clear ordering of the


categories i.e. degree

13-DEC-21
7
Numerical We speak of discrete
data if its values are
Data – distinct and separate.
Discrete
data
Example

Number of heads in 100


coin flips.

13-DEC-21
8
Numerical A variable is continuous if the
data – possible values of the variable form
an interval.
continuous
values
Example Height of students

Average daily temperature of a


state

13-DEC-21
9
Interval data, also called an integer, is defined as a data

Numerical data – type which is measured along a scale

Interval
Measured
Distance between two values
is equal.

Equi distance

Characteristics ordered

Trend Analysis Cannot be zero

13-DEC-21
Can be negative
10
Numerical Ratio Data is defined as quantitative data,
data – Ratio having the same properties as interval data,

Ratio data can only be 0 and above

Example

• Amount of money different people have


• There is an order
• Its quantitative
• Amount cannot be less than 0
• Ratio can be calculated

13-DEC-21
11
13-DEC-21
12
Answer the
level of 1.High school men
soccer players classified 2.Baking temperatures
measurement by their athletic ability: for various main dishes:
Superior, Average, 350, 400, 325, 250, 300
Above average.
Ans:
1. Ordinal 4.A satisfaction survey
2. Ratio of a social website by
3.The colors of crayons
3. Nominal number: 1 very satisfied,
4. Ordinal in a 24-crayon box.
2 somewhat satisfied, 3
not satisfied.

13-DEC-21
13
Only one Frequency
Can be analyzed by
dependent Distribution tables
Four types of
Central Tendency
variable and Bar graphs

Exploratory/ Univariate
Example IQ of
Univariate different people
Descriptive non-
graphical.
graphical.

Data Analysis

A group of college Multivariate Multivariate Z test


students to find out their nongraphical: graphical:
average SAT score and their
age
Chi – Square
Two or more test
than two Using Scatter Linear
13-DEC-21
dependent plot correlation 14
variables
Exploratory/
Descriptive Standard
Central Deviation
Data Tendency and Z
Analysis is Score

done
through Frequency
Charts
Tables

13-DEC-21
15
Central Central tendency is defined as “the
statistical measure that identifies a
Tendency single value as representative of an
Univariate entire distribution.

(Single variable) the tendency for the values of a random


variable to cluster round its mean, mode,
or median.

A measure of central tendency is a single


value that attempts to describe a set of
data by identifying the central position
within that set of data.

13-DEC-21
16
Central Tendency

Mean

Median

Mode

13-DEC-21
17
Mean
The mean value is the average value.

To calculate the mean, find the sum of all values, and divide the sum by the number of values:

(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77

The important disadvantage of mean is that it is sensitive to extreme values/outliers, especially when the sample size is
small

Example if salaries are (in K) – 25,95,18,20,23,90,92


Mean salary comes out to be 51 where as we see max people are in the category 18-25
How can a 6-feet tall person drown in a river of average depth 5-feet?
13-DEC-21
18
• Two weeks before Mark opened
Technology Titans, he launched
his company Web site.

• During those 14 days, Mark had


an average of 24.5 hits on his
Web site per day.

• In the first two days that


Technology Titans was open for
business, the Web site received
42 and 53 hits respectively.
Determine the new average for
hits on the Web site.
Example
13-DEC-21
19
Mean for grouped data

• In Tim's office, there are 25 employees. Each employee travels to work every morning in his or her
own car. The distribution of the driving times (in minutes) from home to work for the employees is
shown in the table below.

13-DEC-21
20
How to handle grouped data

• the first step is to determine the Mid point MP*freq


5 15
midpoint of each interval or 15 150
class. 25 150
35 140
• These midpoints must then be 45 90
multiplied by the frequencies of
the corresponding classes.
Add the results from Step 2 and divide the
sum by 25.
• The sum of the products divided 15 + 150 + 150 + 140 + 90 = 545

by the total number of values


will be the value of the mean.

13-DEC-21
21
Median
The median is less
65 55 89
The median is the middle affected by outliers and
56 35
score for a set of data skewed data. In order to
14 56
that has been arranged in calculate the median,
55 87
order of magnitude. suppose we have the
45 92
data below

What if you had only 10 scores? Well, you simply have to take the middle two
scores and average the result. So, if we look at the example below:
We first need to rearrange that data into order of 65 55 89 56 35 14 56 55 87
45
magnitude (smallest first):
We again rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56
14 35 45 55 55 56 56 65 87
65 87 89 92 89
Only now we have to take the 5th and 6th score in our data set and average
them to get a median of 55.5.

13-DEC-21
22
Median of grouped data
Covid infection

No of people

Covid infection Frequency CF


Make a table with 3 columns. First
0-8column for the class interval,8 second column for 8
frequency, f, and the third column for cumulative frequency, cf.
8-16 10 18
Write the class intervals and the 16-24 16 respective
corresponding frequency in the 34
columns. 24-32 24 58

Write the cumulative frequency in32-40 15


the column cf. It is done by adding the frequency 73
in each step. 40-58 7 80

13-DEC-21
23
Median of
grouped data
N=𝑠𝑢𝑚 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 80
N/2=40

Cumulative frequency just greater than


40 is 58

So Median class is 24-32


Covid infection Frequency CF
0-8 8 8 l=24
8-16 10 18 h=32-24=8
16-24 16 34
24-32 L 24 58 f=24
CF
32-40 15 73 CF of previous class = 34
40-58 7 80 N

13-DEC-21
24
Median of
grouped data

Median =
24+{8X(40-34)/24}
=26
Covid infection Frequency CF
Median =26
0-8 8 8
8-16 10 18
16-24 16 34
24-32 24 58
32-40 15 73
40-58 7 80

13-DEC-21
25
Important to note

Marks out of 50 Frequency Cumulative frequency

0-10 2 2 Here since higher interval is not equal to lower


interval of next class
11-20 4 6
So to make them equal we consider (11-10)/2=0.5
21-30 5 11
So the intervals now become
31-40 4 15

41-50 2 17 0.5-10.5
10.5-20.5
20.5-30.5
30.5-40.5
40.5-50.5

13-DEC-21
26
Mode

• The mode is the most frequent


score in our data set.

• the mode is used for categorical


data where we wish to know
which is the most common
category

13-DEC-21
27
Mode of grouped data
Covid infection Frequency CF
0-8 8 8
8-16• 10 18
16-24 16 34
24-32 24 58
32-40 15 73
40-58 7 80

Modal class a class having maximum frequency here it is 24-32


Xk=24
Fk=24 24+8((24-16)/(48-16-15)
Fk-1=16
Fk+1=15 24+8((8/17)
H=8 27.76
13-DEC-21
28
Standard Deviation

Standard deviation tells us amount of variability in our data set

A standard deviation is a statistic that measures the dispersion of a dataset relative to its mean

The standard deviation is calculated as the square root of variance by determining each data point's
deviation relative to the mean

Example Suppose people have Rs 100, 200, 350,500, 300 in their pockets
Mean is 290 so SD will be ?

13-DEC-21
29
Imagine we have two samples of chocolate cake eaters,
Why each sample with 10 people, self-reporting how many pieces
of chocolate cake they've eaten in the last seven days.
Standard In dataset #1, we have five people that report eating 4

Deviation pieces of cake and five people that report eating 6 pieces of
cake, for a mean of 5 pieces of cake

(4+4+4+4+4+6+6+6+6+6)/10 = 5

In dataset #2, we have five people that report eating 0 piece


of cake and five people that report eating 10 pieces of cake,
for a mean of 5 pieces of cake

(0+0+0+0+0+10+10+10+10+10)/10 = 5.

13-DEC-21
30
Standard Deviation example
A class of students took a math test. Their teacher wants to know whether most students are performing at
the same level, or if there is a high standard deviation.
The scores for the test were 85, 86, 100, 76, 81, 93, 84, 99, 71, 69, 93, 85, 81, 87, and 89.

To find the standard deviation

Determine the mean

Subtract the mean from each value

Square each of those differences.

Determine the average of the squared numbers to find the variance


Find the square root of the variance. That’s the standard deviation!

31
13-DEC-21
Calculate Standard Deviation
85 - 85.2 = -0.2 0.04

• 85, 86, 100, 76, 81, 93, 84, 99, 71, 69, 93, 85, 81, 87, and 89. 86 - 85.2 = 0.8 0.64

100 - 85.2 = 14.8 219.04


• Mean is 1279/15=85.2 76 - 85.2 = -9.2 84

81 - 85.2 = -4.2 17.64


The standard
• Deviation deviation of these tests is 8.7 points out of
for every child 93 - 85.2 = 7.8 60.84
1.44
100. Since
• Squares the variance is somewhat low, the teacher
each difference
84 - 85.2 = -1.2
99 - 85.2 = 13.8 190.44

knows that
• Variance most
– average students are performing around the
of sums
71 - 85.2 = -14.2 201.64
262.44
69 - 85.2 = -16.2
same level.
• 0.04 + 0.64 + 219.04 + 84.64 + 17.64 + 60.84 +1.44 +190.44 93 - 85.2 = 7.8 60.84

85 - 85.2 = -0.2 0.04


+201.64 +262.44 + 60.84 + 0.04 + 17.64 + 3.24 + 14.44 = 1135 81 - 85.2 = -4.2 17.64
3.24
=830.64/15=75.6 = sqrt(75.6)=8.7 87 - 85.2 = 1.8

32
89 - 85.2 = 3.8 14.44

13-DEC-21
Imagine we have two samples of chocolate cake eaters,
each sample with 10 people, self-reporting how many pieces
Why of chocolate cake they've eaten in the last seven days.

Standard In dataset #1, we have five people that report eating 4


pieces of cake and five people that report eating 6 pieces of

Deviation
cake, for a mean of 5 pieces of cake

(4+4+4+4+4+6+6+6+6+6)/10 = 5
In this example SD of first will
be 1 and second will be 5 In dataset #2, we have five people that report eating 0 piece
of cake and five people that report eating 10 pieces of cake,
for a mean of 5 pieces of cake

(0+0+0+0+0+10+10+10+10+10)/10 = 5.

13-DEC-21
33
How to analyze Standard Deviation

Low standard deviation tells us data is


closely clustered
High standard deviation tells us data is
dispersed over a large area
Standard deviation is only possible when
data distribution is approximately normal

13-DEC-21
34
Empirical formula
• The empirical rule, also referred to as
the three-sigma rule or 68-95-99.7 rule, is
a statistical rule which states that for
a normal distribution, almost all observed
data will fall within three standard
deviations (denoted by σ) of the mean or
average (denoted by µ).

• Under this rule, 68% of the data falls


within one standard deviation, 95%
percent within two standard deviations,
and 99.7% within three standard
deviations from the mean.

13-DEC-21
35
Standard Normal Distribution
• A Distribution which has a mean of 0
and standard deviation as 1

• This means it will always center at 0

• Z Score tells us how far we many


Standard deviations away we are
from the mean

• For Example if I am at -2 then I am 2


SD away from Mean

13-DEC-21
36
Z-Score (Standard Score) - How
many standard deviations is it from the mean

Equal to 0 at mean

Negative to left of mean

Positive to right of mean

Z Score values are checked from a Z Negative z score Positive z score


Score card
13-DEC-21
37
Z Score calculation / Standardization… converting
any case study with non zero mean to zero and sd to 1

A tutor sets a piece of English Literature coursework for the


50 students in his class.

The mean score is 60 out of 100 and the standard deviation


(in other words, the variation in the scores) is 15 marks

Sarah has obtained 70 marks

how well did Sarah perform in her English Literature


coursework compared to the other 50 students?

Python Code
Sarah’s Z Score = (70-60)/15=0.6667

13-DEC-21
38
To find how much of the population scored less than
49

We find Z score i.e. 49-60/15 = -0.703

Now we look at the Z Score card

Find the values -0.7 and column 0.03

The value comes to be 0.23576


So proportion of students who scored
less than 49 are 0.235
13-DEC-21 Python Code
39
What if I want to know
how much population lies
between 35 and 50 marks

• Please try

• You can check the


negative z score card at

• http://www.math.odu.edu/
stat130/normal-tables.pdf
13-DEC-21
40
Solution

35 marks means 35-60/15=-25/15= -1.6666

50 marks means 50-60/15=-10/15= -0.666

-1.666 maans 0.0485

-0.6666 means 0.25

So proportion of the class is between 4% to 25%

13-DEC-21
41
Frequency Frequency tables
are a basic tool you
tables – for can use to explore
data and get an idea
univariate of the relationships
between variables.
variables
A frequency table is
just a data table
that shows the
counts of one or
more categorical
variables.

13-DEC-21
42
Tally marks are often used to make a frequency distribution
table.
How to
make For example, let’s say you survey a number of households
and find out how many pets they own.

Frequency
tables The results are 3, 0, 1, 4, 4, 1, 2, 0, 2, 2, 0, 2, 0, 1, 3, 1, 2, 1, 1, 3.

Step 1 To make the frequency distribution table, first write


the categories in one column (number of pets)

Next, tally the numbers in each category (from the results


above). For example, the number zero appears four times in
the list, so put four tally marks “||||”

Finally, count up the tally marks and write the frequency in


the final column. The frequency is just the total. You have
four tally marks for “0”, so put 4 in the last column:

13-DEC-21
43
Data Visualization through charts

Different types of chart


types
Charts are a story • Bar Graph
telling tool • Pie Chart
• Line Graph
• Histogram
• Scatterplot

13-DEC-21
44
Line Graph – for continuous series

Sales
Year
SNO
0 2014 2000
1 2015 3000
2 2016 4000
3 2017 3500
4 2018 6000

13-DEC-21
45
Bar Graph

• A bar chart uses bars to show comparisons between categories of data.

13-DEC-21
46
Pie chart

• Pie charts can be used to show


percentages of a whole

• Pie Chart represents


percentages at a set point in
time.

• Unlike bar graphs and


line graphs, pie charts do not
show changes over time.

13-DEC-21
47
Scatter plot
• Despite their simplicity, scatter plots are a powerful tool for visualising data

• Scatter plots’ primary uses are to observe and show relationships between two numeric variables.
Monthly
Online Online Advertising
E-commerce Sales
Store Dollars (1000 s)
(in 1000 s)
1 368 1.7
2 340 1.5
3 665 2.8
4 954 5
5 331 1.3
6
13-DEC-21
556 2.2
48
7 376 1.3
Histogram (Univariable graphical)
• Histograms group the data in bins and is the fastest
way to get idea about the distribution of each
attribute in dataset.

• The primary use of a Histogram Chart is to display the


distribution (or “shape”) of the values in a data
series.

• For example, we might know that normal human oral


body temperature is approx 98.6 degrees Fahrenheit.

• And we might presume that the range of healthy body


temperature is approximately normally distributed,
with most people having body temps close to 98.6
and progressively fewer healthy people with body
temps lower or higher than 98.6.

• To test this, we might sample 300 healthy persons


and measure their oral temperature.
13-DEC-21
49
Difference
between bar
and histogram

13-DEC-21
50
Which chart will you use?

• Comparison of 5 languages spoken

• Sleep needed per day for age group

• Time of day and number of calories consumed

• Age wise distribution of students

13-DEC-21
51
Graph Interpretation – story that
graph tells us
• Center Center
• Graphically, the center of a distribution is located at the median of the
distribution.

• Spread
• The spread of a distribution refers to the variability of the data.

• Outlier
• An outlier is an observation that lies an abnormal distance from other values in a random sample
from a population.

• Peak
• Highest point in the graph

13-DEC-21
52
Distribution of Shapes in Graphs

When graphed, the data


in a set is arranged to
The shape of a
show how the points are
distribution is described
distributed throughout
the set.

by its possession of
by its number of peaks its tendency to skew, or its uniformity.
symmetry,

13-DEC-21
53
Number of Histogram charts often display peaks, or local
maximums. It can be seen from the graph that the data
peaks count is visibly higher in certain sections of the graph.

one clear peak is called a unimodal distribution.

two clear peaks are called a bimodal distribution

single peak at the center is called bell


shaped distribution.

If the data set has no clear peaks its uniform


distribution

13-DEC-21
54
13-DEC-21 https://www.youtube.com/watch?v=2oJldeE4JcU 55
Terms learnt above

Symmetry Skewness

Positively distributed Negatively distributed


If the data is symmetrical
mean=mode=median Mode<Median<Mean Mean<Median<Mode

Extended reading : https://mathbitsnotebook.com/Algebra1/StatisticsData/STShapes.html


13-DEC-21
56
Establishing
relation
between Linear Correlation
variables
Linear Regression
( predictive statistics)

13-DEC-21
57
Correlation is a relation between
Bivariate - quantitative variables
Correlation
Examples

• Your caloric intake and weight increase =


Positive correlation
• Amount you study and percentage – Positive
correlation
• Your income and happiness – No
• Expenditure and your saving – Negative

Correlation is measured in Correlation


Coefficient r

13-DEC-21
58
How is Regression different from
Correlation
Correlation describes the
The word correlation is used It has one dependent and
strength of an association
in everyday life to denote one independent variables
between two variables, and
some form of association. (Example age and height)
is completely symmetrical

A statistical technique for


estimating the change in the On the basis of past records,
Regression is used for
metric dependent variable a business’s future profit
predictions
due to the change in one or can be estimated.
more independent variables

13-DEC-21
59
Correlation – Bivariate

Correlation is of three types


• Positive. (1 is a strongest possible value)
• No (0)
• Negative (-1 is the strongest negative value or inverse
correlation)
A question to gAaze you understood @
https://www.menti.com/v3zocvp1qt

13-DEC-21
60
Examples

Large positive correlation Moderate positive correlation Low Negative correlation


Children age and shoe size As the number of automobiles Eating and hunger
increases, so does the demand in
the fuel variable increases.

13-DEC-21
61
Pearson’s Coefficient – r =
covariance/sd(x)*sd(y)

13-DEC-21
62
Example of calculation of Pearson’s
Coefficient
Step 1
The correlation coefficient =
Make
6(20,485)a chart.
– (247Use the/ given
× 486) data, and
[√[[6(11,409) –
AVERAGE add
(2472three more columns:
)] × [6(40,022) – 4862]]]xy, x2, and y2.
SNO AGE X GLUCOSE XY X2 Y2 = 0.5298
LEVEL Y Step 2
1 43 99 4257 1849 9801 Find the sum of each column
N is 6range of the correlation
The
2 21 65 1365 441 4225
3 25 79 1975 625 6241 coefficient
Step 3 is from -1 to 1. Our
4 42 75 3150 1764 5625
result is the
Substitute 0.5298
valuesor 52.98%,

5 57 87 4959 7569
which means the variables
3249
have a moderate positive
6 59 81 4779 3481 6561
correlation.
247 486 20485 11409 40022
63

13-DEC-21
Example using second formula

AVERAGE Step 1
SNO AGE X GLUCOSE X-X’ Y-Y’ (X-X’)2 (Y-Y’)2
PROD Find the Mean of X and Y columns 41.667 81
LEVEL Y UCT

1 43 99 Step 2
1.83 18 3.361 324 33 Find X-X’ and Y-Y’
2 21 65 -20.17 -16 406.7 256 322.7
3 25 79 Step 3
-16.17 -2 261.4 4 32.33
Find (X-X’)2 and (Y-Y’)2
4 42 75 0.83 -6 0.694 36 -5
5 57 87 Step 4
15.83 6 250.7 36 95 Find (X-X’)(Y-Y’)
6 59 81 17.83 0 318 0 0 Step 5
1241 656 478 Sum column 6,7 and 8
Pearson’s Coefficient = 0.5298
13-DEC-21
Step 6
Substitute values
64
In the second formula

• Numerator : Called Covariance

• Denominator. - Calculate Standard Deviation


• of X and Y

13-DEC-21
65
Assumptions
for Pearson For a Pearson correlation each variable
should be continuous
Coefficient (r)
Each observation should have a pair of
values ( X and Y)

Not having any outliers

For linearity, a “straight line” relationship


between the variable should be formed.

13-DEC-21
66
To summarize

Correlation is used when the


researcher wants to know that
whether the variables under study
are correlated or not, if yes then
what is the strength of their
association.
In regression analysis, a functional
relationship between two variables
is established so as to make future
projections on events.

13-DEC-21
67
References-
Descriptive data analysis using excel

• https://www.youtube.com/watch?v=5MFjwM6K5Sg

• Descriptive statistics and data visualisation. An introduction to statistics and working with data

• https://www.youtube.com/watch?v=txNvZ3Zndak

• Qualitative research

• https://www.youtube.com/watch?v=_uapR0qiN6s&list=RDCMUCig0KhrB5NClMvX9QrbXcrw&index=2

• Statistics made easy ! ! ! Learn about the t-test, the chi square test, the p value and more

• https://www.youtube.com/watch?v=I10q6fjPxJ0&list=RDCMUCig0KhrB5NClMvX9QrbXcrw&index=3

• Hypothesis Testing and The Null Hypothesis, Clearly Explained!!

• https://www.youtube.com/watch?v=0oc49DyA3hU

13-DEC-21
68

You might also like