0% found this document useful (0 votes)
17 views

Module 2

Uploaded by

Parthib Basak
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Module 2

Uploaded by

Parthib Basak
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 148

CSE 2027-Fundamental of Data Analysis

Module: 2: Statistical functions

Sampling Techniques: Fundamental Definitions,


Important sampling distributions concept of
standard error, Descriptive Statistics, Inferential
Statistics (T test, Z test), Probability Uses In
Business and Calculating Probability from a
Contingency Tables.
Definition
• Sampling is the process of selecting a subset(a
predetermined number of observations) from a larger
population. It’s a pretty common technique where in, we run
experiments and draw conclusions about the population,
without the need of having to study the entire population.
We will go through two types of sampling methods:
• Probability Sampling —Here we choose a sample based on
the theory of probability.
• Non-Probability Sampling — Here we choose a sample
based on non-random criteria, and not every member of the
population has a chance of being included.

2
Sampling techniques

3
Sampling types
• Probability Sampling
• Random Sampling
• Stratified Sampling
• Cluster Sampling
• Systematic Sampling
• Multistage Sampling
• Non –Probability Sampling
• Convenience Sampling
• Voluntary Sampling
• Snowball Sampling

4
Random Sampling
• Under Random sampling, every element of the
population has an equal probability of getting
selected. Below fig. shows the pictorial view of the
same — All the points collectively represent the
entire population wherein every point has an
equal chance of getting selected.

5
Random Sampling

6
Stratified Sampling
• Under stratified sampling, we group the entire
population into subpopulations by some common
property. For example — Class labels in a typical ML
classification task. We then randomly sample from those
groups individually, such that the groups are still
maintained in the same ratio as they were in the entire
population. Below fig. shows a pictorial view of the same
— We have two groups with a count ratio of x and 3x
based on the color, we randomly sample from yellow and
green sets separately and represent the final set in the
same ratio of these groups.

7
Stratified Sampling

8
Cluster Sampling
• In Cluster sampling, we divide the entire
population into subgroups, wherein, each of
those subgroups has similar characteristics to
that of the population when considered in
totality. Also, instead of sampling individuals, we
randomly select the entire subgroups. As can
be seen in the below fig. that we had 4 clusters
with similar properties (size and shape), we
randomly select two clusters and treat them as
samples.

9
Cluster Sampling

10
Systematic Sampling
• Systematic sampling is about sampling items
from the population at regular predefined
intervals(basically fixed and periodic intervals).
For example — Every 5th element, 21st element
and so on. This sampling method tends to be
more effective than the vanilla random sampling
method in general. Below fig. shows a pictorial
view of the same — We sample every 9th and 7th
element in order and then repeat this pattern.

11
Systematic Sampling

12
Multistage sampling
• Under Multistage sampling, we stack multiple
sampling methods one after the other. For
example, at the first stage, cluster sampling can
be used to choose clusters from the population
and then we can perform random sampling to
choose elements from each cluster to form the
final set. Below fig. shows a pictorial view of the
same —

13
Multistage Sampling

14
Convenience Sampling
• Under convenience sampling, the researcher
includes only those individuals who are most
accessible and available to participate in the
study. Below fig. shows the pictorial view of the
same — Blue dot is the researcher and orange dots
are the most accessible set of people in orange’s
vicinity.

15
Convenience Sampling

16
Voluntary Sampling
• Under Voluntary sampling, interested people
usually take part by themselves by filling in
some sort of survey forms. A good example of
this is the you tube survey about “Have you seen
any of these ads”, which has been recently shown
a lot. Here, the researcher who is conducting
the survey has no right to choose anyone.
Below fig. shows the pictorial view of the same —
Blue dot is the researcher, orange one’s are those
who voluntarily agreed to take part in the study.

17
Voluntary Sampling

18
Snowball Sampling
• Under Snowball sampling, the final set is
chosen via other participants, i.e. The
researcher asks other known contacts to find
people who would like to participate in the study.
Below fig. shows the pictorial view of the same —
Blue dot is the researcher, orange ones are known
contacts(of the researcher), and yellow ones
(orange’s contacts) are other people that got
ready to participate in the study.

19
Snowball Sampling

20
Sample-Outlook

21
What is a Sampling Distribution?
If we take many random samples of equal size and calculate the mean
value from each sample, we would begin to form a frequency
distribution. The distribution of a statistic computed for each of many
random samples is called a sampling distribution.

22
Population Vs Sample

Population Sample

a b cd b
ef gh i jk n
l m n c
o
o p q rs gi

t u
v w x y
r
z
Copyright © 2010 Pearson Education, Inc. Publishing as 23
The mean of a sampling distribution is called the expected value
of the mean: it is the mean expected of the population.

24
Variance
The variance describes the spread of the data and
measures how much the values of a variable differ
from the mean. For variables that represent only a
sample of some population and not the population
as a whole, the variance formula is

25
• 3, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9
where the mean is
x̄ = (3 + 4 + 4 + 5 + 5 + 5 + 6 + 6 + 6 + 7 + 7 + 8 +
9)/ 13
x̄ = 5.8

26
• To calculate variance, we substitute the values
into the variance formula:

= 2.86

27
Standard Deviation
• The standard deviation is the square root of the
variance.
• The standard deviation is the most widely used
measure of the deviation of a variable.
• The higher the value, the more widely
distributed the variable’s data values are around
the mean.
• standard deviation is calculated as √ 2.86 or
1.69.

28
Example:
• A variable has a mean value of 45 with a
standard deviation value of 6. Approximately
68% of the observations should be in the range
39–51 (45 ± one standard deviation) and
approximately 95% of all observations fall within
two standard deviations of the mean (between
33 and 57).

29
Sampling Distribution

• The sampling distribution of a statistic is the


distribution of values taken by the statistic in all possible
samples of the same size from the same population.
• In practice, it’s difficult to take all possible samples of
size n to obtain the actual sampling distribution of a
statistic. Instead, we can use simulation to imitate the
process of taking many, many samples.
• One of the uses of probability theory in statistics is to
obtain sampling distributions without simulation.

30
Developing a Sampling
Distribution
 Assume there is a population …
 Population size N=4
 Random variable, X,
is age of individuals
 Values of X:
18, 20, 22, 24 (years)

Ch. 6-31
Developing a Sampling
Distribution

μ NX
i

18  20  22 
  21
24 4

σ  i
(X  μ) 2

N-1 2.5819

Ch. 6-32
Sampling Distribution

33
Developing a Sampling
Distribution

1st
2nd Observation
Obs 18 20 22 24
18 18,18 18,20 18,22 18,24
20 20,18 20,20 20,22 20,24 Obs 18 20 22 24
22 22,18 22,20 22,22 22,24 18 18 19 20 21
24 24,18 24,20 24,22 24,24 20 19 20 21 22
16 possible samples 22 20 21 22 23
(sampling with
replacement) 24 21 22 23 24

Ch. 6-34
Sampling Distribution

35
Developing a Sampling
Distribution
(continued)
Sampling Distribution of All Sample Means

16 Sample Means Sample Means


Distribution
1st 2nd Observation _
Obs 18 20 22 24 P(
.3X)

18 18 19 20 21 .2

20 19 20 21 22 .1

22 20 21 22 23 0 _
18 19 20 21 22 23 24 X
24 21 22 23 24 (no longer uniform)
Ch. 6-36
Describing Sampling Distributions:
Spread
• The variability of a statistic is described by the
spread of its sampling distribution. This spread is
determined primarily by the size of the random
sample.
• Larger samples give smaller spread. The spread
of the sampling distribution does not depend on
the size of the population, as long as the
population is at least 10 times larger than the
sample.

37
38
Standard Error of the mean
 Different samples of the same size from the same
population will yield different sample means
 A measure of the variability in the mean from sample to
sample is given by the Standard Error of the Mean:
Note that the standard error of the mean decreases
as the sample size increases

σX σ

n
Ch. 6-39
Descriptive Statistics

40
Descriptive Statistics

41
Understanding Descriptive Statistics

• Descriptive statistics, in short, help describe and


understand the features of a specific data set by
giving short summaries about the sample and
measures of the data.
• The most recognized types of descriptive
statistics are measures of center: the mean,
median, and mode, which are used at almost all
levels of math and statistics.

42
Mean

X̄ = (Sum of values ÷ Number of values)


X̄ = (x1 + x2 + x3 +….+xn)/n

43
• Example:
• What is the mean of 2, 4, 6, 8 and 10?

44
• Solution:
First, add all the numbers.
2 + 4 + 6 + 8 + 10 = 30
Now divide by 5 (total number of observations).
Mean = 30/5 = 6

45
Weighted Mean

46
the average number of TVs each household owns. The data show a large number of households
with two or three TVs and a smaller number with one or four.
Every household in the sample has at least one TV and no household has more than four.
Find the mean number of TVs per household.

Number of TVs per


Number of Households
Household

1 73

2 378

3 459

4 90

The mean number of TVs per household in this sample is 2.566.

47
Categorical Data Set

48
Arithmetic Mean

49
Harmonic Mean

50
• Example 1:
• Find the harmonic mean for data 2, 5, 7, and 9.

51
• Solution:
• Given data: 2, 5, 7, 9
• Step 1: Finding the reciprocal of the values:
• ½ = 0.5
• ⅕ = 0.2
• 1/7 = 0.14
• 1/9 = 0.11
• Step 2: Calculate the average of the reciprocal values obtained from step 1.
• Here, the total number of data values is 4.
• Average = (0.5 + 0.2 + 0.143 + 0.11)/4
• Average = 0.953/4
• Step 3: Finally, take the reciprocal of the average value obtained from step 2.
• Harmonic Mean = 1/ Average
• Harmonic Mean = 4/0.953
• Harmonic Mean = 4.19
• Hence, the harmonic mean for the data 2, 5, 7, 9 is 4.19.

52
Geometric Mean

53
• Question 1: Find the G.M of the values 10, 25, 5,
and 30

54
• Solution : Given 10, 25, 5, 30

• We know that,

=(10×25×5×304)1/4

=(375004) 1/4

= 13.915

Therefore, the geometric mean = 13.915

55
Median

56
57
Mode

58
59
Percentiles

60
Example 1

61
Example 2

62
Quantiles

63
Solution

64
Interquartile range

65
Range and IQR
The interquartile range formula is the first quartile subtracted from the third quartile: IQR = Q3 – Q1.

66
Variance and Standard deviation

67
Problem

68
Solution

69
Solution

70
What is the Standard Deviation?
• The standard deviation indicates the spread of a variable around
its mean value. Thus, the standard deviation is the average distance
of all measured values of a variable from the mean value of the
distribution.
• The standard deviation thus indicates how much the distribution of
values scatters around the mean value. If the individual values
scatter strongly around the mean value, a large standard deviation of
the variable results. There are two slightly different equations for the
calculation. On the one hand, the entire population can be used to
calculate the standard deviation. On the other hand it can also be
calculated if only one sample is available. If all values of the
population are available, the following results are obtained

71
What is the Variance?
• In statistics, variance measures variability from the
mean. For the calculation of the variance, the sum of the
squared variances is divided by the number of values.
• The variance thus describes the squared average distance
from the mean. Because the values are squared, the result
has a different unit (the unit squared) than the original
values. Therefore, it is difficult to relate the results.

72
• The coefficient of variation is a dimensionless
relative measure of dispersion that is defined as the
ratio of the standard deviation to the mean.
• If there are data sets that have different units then
the best way to draw a comparison between them is
by using the coefficient of variation.

73
Co-efficient of Variation

74
Example 2

75
Example 3:HW

76
Skewness
• Skewness refers to a distortion or asymmetry that deviates
from the symmetrical bell curve, or normal distribution, in
a set of data.
• If the curve is shifted to the left or to the right, it is said to
be skewed.
• Skewness can be quantified as a representation of the
extent to which a given distribution varies from a normal
distribution.
• A normal distribution has a skew of zero, while a
lognormal distribution, for example, would exhibit some
degree of right-skew.

77
Zero-Skewness

78
Positive-Skew

79
Negative-Skew

80
Kurtosis
Kurtosis refers to the degree of presence of outliers in the distribution.
Kurtosis is a statistical measure, whether the data is heavy-tailed or
light-tailed in a normal distribution.

81
Types of excess kurtosis
• Leptokurtic or heavy-tailed distribution (kurtosis
more than normal distribution).
• Mesokurtic (kurtosis same as the normal
distribution).
• Platykurtic or short-tailed distribution (kurtosis
less than normal distribution).

82
Standard Error

83
Moments
• Moments are a set of statistical parameters to measure a
distribution. Four moments are commonly used:
• Mean: the average
• Variance:
Standard deviation is the square root of the variance:
an indication of how closely the values are spread about
the mean. A small standard deviation means the values
are all similar. If the distribution is normal, 63% of the
values will be within 1 standard deviation.

84
Moments
• Skewness: measure the asymmetry of a distribution about its peak;
It is a number that describes the shape of the distribution.
It is often approximated by Skew = (Mean - Median) / (Std dev).
If skewness is positive, the mean is bigger than the median and the
distribution has a large tail of high values.
If skewness is negative, the mean is smaller than the median and the
distribution has a large tail of small values.
• Kurtosis: measures the peakedness or flatness of a distribution.
Positive kurtosis indicates a thin pointed distribution.
Negative kurtosis indicates a broad flat distribution.

85
86
Inferential Statistics

87
What is Inferential Statistics?
• Descriptive statistics describe the important
characteristics of data by using mean, median, mode,
variance etc. It summarises the data through numbers
and graphs.
• In Inferential statistics, we make an inference from a
sample about the population. The main aim of inferential
statistics is to draw some conclusions from the sample
and generalise them for the population data. E.g. we have
to find the average salary of a data analyst across India.
There are two options.

88
Importance of Inferential Statistics

• Making conclusions from a sample about the


population
• To conclude if a sample selected is statistically
significant to the whole population or not
• Comparing two models to find which one is more
statistically significant as compared to the other.
• In feature selection, whether adding or removing
a variable helps in improving the model or not.

89
Type of Hypothesis Test

Purpose of hypothesis testing is to get rid of


randomness.
What is the t-Test?
• We do not know the population variance
• Our sample size is small, n < 30
• One-Sample t-Test
• We perform a One-Sample t-test when we want to
compare a sample mean with the population mean.
The difference from the Z Test is that we do not have the
information on Population Variance here. We use the
sample standard deviation instead of population
standard deviation in this case.

91
92
Here’s an Example to Understand a One Sample
t-Test
• Let’s say we want to determine if on average girls
score more than 600 in the exam. We do not have
the information related to variance (or standard
deviation) for girls’ scores. To a perform t-test,
we randomly collect the data of 10 girls with
their marks and choose our ⍺ value (significance
level) to be 0.05 for Hypothesis Testing.

93
94
In this example:
• Mean Score for Girls is 606.8
• The size of the sample is 10
• The population mean is 600
• Standard Deviation for the sample is 13.14

95
Our P-value is greater than 0.05 thus we fail to reject the null
hypothesis and don’t have enough evidence to support the hypothesis
that on average, girls score more than 600 in the exam.

96
To find critical value
1. Degree of freedom(DF) = sample size -1
= 10 -1=9
2. Significance level (SL)= 0.05
3. Refer table 2 pdf file- DF->9 & SL->0.05

Therefore critical value of T is 1.833

97
Finding P Value
• We know that
T = 1.64
DF = 9
Significance = 0.05
Refer table 3 pdf file pg no:2
Therefore p values is 0.068
This p-value can also be used to assess the null hypothesis, where the
null hypothesis is rejected if it is less than the value of alpha. This
value can be looked up using a standard z-distribution table as found
by an online search or readily available software.

98
Conculsion
• T Score < Critical value
• P Value > Significance level
• There fore we eliminate the Alternative
hypothesis
• We choose Null hypothesis

99
Two-Sample t-Test
• We perform a Two-Sample t-test when we want
to compare the mean of two samples.

100
Here’s an Example to Understand a Two-
Sample t-Test
• Here, let’s say we want to determine if on
average, boys score 15 marks more than girls in
the exam. We do not have the information related
to variance (or standard deviation) for girls’
scores or boys’ scores. To perform a t-test. we
randomly collect the data of 10 girls and boys
with their marks. We choose our ⍺ value
(significance level) to be 0.05 as the criteria for
Hypothesis Testing.

101
102
In this example:

• Mean Score for Boys is 630.1


• Mean Score for Girls is 606.8
• Difference between Population Mean 15
• Standard Deviation for Boys’ score is 13.42
• Standard Deviation for Girls’ score is 13.14
• Size of sample is 20

103
Thus, P-value is less than 0.05 so we can reject
the null hypothesis and conclude that on average
boys score 15 marks more than girls in the exam.

104
To find critical value
1. Degree of freedom(DF) = sample size -1
= 20 -1=19
2. Significance level (SL)= 0.05
3. Refer table 2 pdf file- DF->19 & SL->0.05(Two
tail)

Therefore critical value of T is 2.093

105
Finding P Value
• We know that
T =1.39
DF = 19
Significance = 0.05
Refer table 3 pdf file pg no:2
Therefore p values is 0.089

106
Conclusion
• T Score < Critical value
• P Value> Significance level

Therefore we choose Null Hypothesis

107
What is the Z Test?
• z tests are a statistical way of testing a
hypothesis when either:
• We know the population variance, or
• We do not know the population variance but our
sample size is large n ≥ 30
• If we have a sample size of less than 30 and do not
know the population variance, then we must use a
t-test.

108
One-Sample Z test
We perform the One-Sample Z test when we want
to compare a sample mean with the population
mean.

109
Z test

110
Z Test problem 1
• The population of all verbal GRE scores are
known to have a standard deviation of 8.5. The
UW Psychology department hopes to receive
applicants with a verbal GRE scores over 210.
This year, the mean verbal GRE scores for the 42
applicants was 212.79. Using a value of = 0.05 is
this new mean significantly greater than the
desired mean of 210?

111
Z value

112
You can see that the CRITICAL VALUE of Z is 1.64
Z Score > Critical Value
Therefore we reject Null hypothesis

Z Area between mean and Area beyond Z


Z
…. …. ….
1.62 0.4474 0.0526
1.63 0.4484 0.0516
1.64 0.4495 0.0505
1.65 0.4505 0.0495
1.66 0.4515 0.0485

113
Our observed value of z is 2.13 which is greater than the critical value of 1.64.
We therefore reject H_0.

114
You can see that our p-value is p = 0.0166.

Z Area between Mean and Z Area beyond Z


….
2.11 0.4826 0.0174
2.12 0.4830 0.0170
2.13 0.4834 0.0166
2.14 0.4838 0.0162
2.15 0.4842 0.0158

115
Our p-value is less than alpha (0.05). Therefore we reject H_0

116
Conclusion

117
Two Sample Z Test
We perform a Two Sample Z test when we want to
compare the mean of two samples.

118
Tail
• One tailed test allow for the possibility of an
effect in one direction.
• Two tailed test for the possibility of an effect in
two direction.

119
Z Test Problem 2
• Suppose you start up a company that has
developed a drug that is supposed to increase IQ.
You know that the standard deviation of IQ in the
general population is 15. You test your drug on
36 patients and obtain a mean IQ of 97.65. Using
an alpha value of 0.05, is this IQ signicantly
different than the population mean of 100?

120
Z value

121
• We then compare our observed value of z to the critical
values of z for alpha = 0.05. We are looking for a
significant difference, so this will be a two-tailed test.
• We reject the null hypothesis if our observed mean is
either signicantly larger or smaller than 100.
• Our critical values of z are therefore the two values that
span the middle 95% of the area under the standard
normal distribution.
• This means that the areas in each of the two tails is
0.05/2 = 0.025

122
Z Area between mean and Z Area beyond Z
…. ……. ………….
1.94 0.4738 0.0262
1.95 0.4744 0.0256
1.96 0.4750 0.0250
1.97 0.4756 0.0244
1.98 0.4761 0.0239
…… ……….. ……….

123
• Which corresponds to a critical value of z = 1.96

124
Conclusions

125
P Value
Z Area between mean and Z Area beyond z
…. ….. …..
-0.96 -0.3315 0.8315
-0.95 -0.3289 0.8289
-0.94 -0.3264 0.8264
-0.93 -0.3238 0.8238
-0.92 -0.3212 0.8212
….. ……. …….

126
DA vs IA

127
DS vs IS

128
Probability Uses In Business and Calculating
Probability from a Contingency Tables.
• A contingency table provides a way of
portraying data that can facilitate calculating
probabilities. The table helps in determining
conditional probabilities quite easily.
• The table displays sample values in relation to
two different variables that may be dependent or
contingent on one another.

129
Example 1

130
Calculate the following probabilities using the table.

131
Calculate the following probabilities using the table.

132
Calculate the following probabilities using the table.

133
Try it

134
Example 2:
• This table shows a random sample of 100 hikers
and the areas of hiking they prefer. Hiking Area
Preference

135
Solution:

136
137
138
139
140
141
142
143
144
Try it

145
146
147

You might also like