Week 3 Notes
Week 3 Notes
• They have several questions such as: Are all the values relatively similar?
• And does any variable have outlier values that are either extremely small or
extremely large?
• While doing a complete search of the retirement funds data could lead to
answers to the preceding questions, you wonder if there are better ways
than extensive searching to uncover those answers
Introduction
• Descriptive analytics is a commonly used form of data analysis whereby
historical data is collected, organized, and then presented in a way that is
easily understood
• Measures of variation
• Measures of shape
Measures of central tendency
Measures of central tendency
• A measure of central tendency is a summary statistic that represents the
center point or typical value of a dataset
• In statistics, the three most common measures of central tendency are the
mean, median, mode, and quartiles
• Mean: It is the sum of observations divided by the total number of
observations
• Median: It is the middle value of the data set. It splits the data into two
halves
• Mode: It is the value that has the highest frequency in the given data set
• Quartiles: Quartiles are measures of central tendency that divide a group
of data into four subgroups or parts (Q1, Q2, Q3, Q4)
Measures of central tendency: Mean
• The arithmetic mean (in everyday usage, the mean) is the most common
measure of central tendency
• To calculate a mean, sum the values in a set of data and then divide that sum
by the number of values in the set
𝑠𝑢𝑚 𝑜𝑓 𝑛 𝑣𝑎𝑙𝑢𝑒𝑠 𝑋1 +𝑋2 +⋯+𝑋𝑛 σ𝑛
𝑖=1 𝑋𝑖
• 𝑋ത = or 𝑋ത = or 𝑋ത =
𝑛 𝑛 𝑛
• Consider the following data on typical time-to-get-ready for the office in the
morning
Day: 1 2 3 4 5 6 7 8 9 10
Time
39 29 43 52 39 44 40 31 44 35
(minutes)
Measures of central tendency: Mean
• Consider the following data on typical times to get ready for the office in the
morning
• Rule 2: If the data set contains an even number of values, the median is the
measurement associated with the average of the two middle-ranked values
Measures of central tendency: Median
• We will again use the example of 10 time-to-get-ready values, first we will rank
them from low to high
Day: 1 2 3 4 5 6 7 8 9 10
Ranked
29 31 35 39 39 40 43 44 44 52
values
• Like the median and unlike the mean, extreme values do not affect the mode
Day: 1 2 3 4 5 6 7 8 9 10
Ranked
29 31 35 39 39 40 43 44 44 52
values
• There are two modes, 39 minutes and 44 minutes, because each of these
values occurs twice
Measures of central tendency: Mode
• The mode is the value that appears most frequently
• Like the median and unlike the mean, extreme values do not affect the mode
Observed
1 3 0 3 26 2 7 4 0 2 3 3 6 3
Data
Ranked
0 0 1 2 2 3 3 3 3 3 4 6 7 26
values
• Because 3 occurs five times, more times than any other value, the mode is 3
Measures of central tendency: Quartiles
• Quartiles are measures of central tendency that divide a group of data into
four subgroups or parts
• The three quartiles (Q1, Q2, Q3, Q4) split a set of data into four equal parts.
• First quartile, Q1, Q1 = (n + 1)/4th ranked value
• Third quartile, Q3, Q3 =3(n + 1)/4th ranked value
• The second quartile (Q2), the median, divides the set such that 50% of the
values are smaller than or equal to the median, and 50% are larger than or
equal to the median
Measures of central tendency: Quartiles
• Rules for Calculating the Quartiles from a Set of Ranked Values
• Rule 1: If the ranked value is a whole number, the quartile is equal to the
measurement that corresponds to that ranked value
• Rule 2: If the ranked value is a fractional half (2.5, 4.5, etc.), the quartile is
equal to the measurement that corresponds to the average of the
measurements corresponding to the two ranked values involved
• Rule 3: If the ranked value is neither a whole number nor a fractional half,
round the result to the nearest integer and select the measurement
corresponding to that ranked value
Measures of central tendency: Quartiles
• Consider our example of time-to-get-ready values
Day: 1 2 3 4 5 6 7 8 9 10
Ranked
29 31 35 39 39 40 43 44 44 52
values
Day: 1 2 3 4 5 6 7 8 9 10
Ranked
29 31 35 39 39 40 43 44 44 52
values
• IQR= 44-35= 9
Measures of variation
Measures of variability
• Measures of variability describe the spread or the dispersion of a data set
• Measures of variability are
• Range: The Range describes the difference between the largest and
smallest data point in our data set
• Variance: The variance is the average of the squared deviations about
the arithmetic mean for a set of numbers
• Standard Deviation (SD): Standard deviation measures the dispersion of
a dataset relative to its mean. It is defined as the square root of the
variance
• Mean Absolute deviation: The mean absolute deviation (MAD) is the
average of the absolute values of the deviations around the mean for a
set of numbers.
Measures of variability: Range
• A simple measure of variation, the range is the difference between the
largest and smallest value and is the simplest descriptive measure of
variation for a numerical variable
• 𝑅𝑎𝑛𝑔𝑒 = 𝑋𝑙𝑎𝑟𝑔𝑒𝑠𝑡 − 𝑋𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡
Day: 1 2 3 4 5 6 7 8 9 10
Ranked
29 31 35 39 39 40 43 44 44 52
values
𝑛 𝑋𝑖 −𝑋ത 2
• 𝑆= σ𝑖=1
𝑛−1
𝑋𝑖 −𝑋ത 2
• For population SD: 𝜎 = 𝑛
σ𝑖=1 ]
𝑛
• 𝑆 = 6.77
Measures of variability: Variance or
Standard Deviation
• Consider the example of 10 observations from time-to-get-ready
Time (X) ഥ)
Step 1: (𝑿𝒊 − 𝑿 Step 2: 𝑿𝒊 − 𝑿ഥ 𝟐
39 -0.60 0.36
29 -10.60 112.36
43 3.40 11.56
52 12.40 153.76
39 -0.60 0.36
44 4.40 19.36
40 0.40 0.16
31 -8.60 73.96
44 4.40 19.36
35 -4.60 21.16
Mean=40 Sum =412.40
Sum Divide by (n-1)=45.82
• 𝜎 = 6.42
Measures of variability: MAD
• The steps to calculate the mean absolute deviation are shown provided here
• Step 2: Calculate how far away each data point is from the mean using positive
distances. These are called absolute deviations
σ𝑛
𝑖=1 | 𝑥𝑖 −𝑥ҧ |
• 𝑀𝐴𝐷 =
𝑛
Measures of variability: MAD
• Consider the example of 10 time-to-get-ready values and MAD computation
for the data
Time (X) ഥ)
S2: absolute(Xi - 𝑿
39 0.60
29 10.60
43 3.40
52 12.40
39 0.60
44 4.40
40 0.40
31 8.60
44 4.40
35 4.60
S1: Mean=40 S3: Sum=50.00
S4: Sum/10=5
Measures of shape
• A measure of shape is the tool that can be used to describe the shape of a
distribution of data
distribution
Measures of shape: Skewness
• The distribution of data in which the right half is a mirror image of the left half
is said to be symmetrical
distribution
• Leptokurtic: A distribution that has a sharper-rising center peak than the peak of
a normal distribution has positive kurtosis
• Platykurtic: A distribution that has a slower-rising (flatter) center peak than the
peak of a normal distribution has negative kurtosis
Thanks!
Statistical Inference: Sampling
and Confidence Interval
Estimation
Prof. Abhinava Tripathi
Introduction: Statistical Inference
Introduction
• Since one does not have the luxury of working with populations, one has to
make only inferences and exact solutions or estimates are not available
• You are working as a part of food regulator to examine the quality of food
• You can not go to all factory outlets and check each packet
• You find that the led content in these packets is 2.2 ppm with a standard
deviation of 0.7 ppm
• Can we say that the population mean and standard deviation parameters
would be same as the sample
• In the food sample problem, suppose that you pick all the 100 samples from
a single factory
• It may be possible that the higher lead content is specific to that this factory
• This requires that sampling procedure is fair and unbiased so that inferences
are accurate
Simple Random Sampling
• Let us discuss some of the ways in which we can select a sample of 100 noodle
packets
• One, though very less efficient way, is to collect all the 30000 packets randomly
and select 100 out of these
• This is like a blindfolded person picking sample units from population: the process
is completely random
• You would not like to collect your sample from each warehouse (i.e., consider
each warehouse as cluster)
• You can select 3-4 warehouses as clusters (may be through random sampling)
and then consider desired number of samples from these clusters
• Cluster sampling is usually used when you see that the population can be divided
into different groups or clusters that have different characteristics
• But here we only study the selected clusters, not all the clusters
Systematic Sampling
• Let say you label the samples and select
every third packet starting from the second
packet, as shown in the figure here
• Cluster sampling
• Systematic sampling
• For many of the food regulators, stratified sampling and simple random sampling
is considered as more suitable
Heterogeneous population
• What could be those cases where population is not heterogeneous in nature
• If all the noodle packets are of the same nature and manufactured in a single
factory, then simple random sampling would be the most straightforward
• All these sampling techniques fall under the category of probability sampling
• In these sampling techniques, every unit of the population has a certain known
chance of being included in the sample
Types of Sampling: Non-Random Sampling
Non-Random Sampling: Convenient sampling
• There is another sampling method called as non-random sampling
• Here, the odds of a sample unit getting selected can not be calculated
• Convenient sampling: You choose 100 packets that were closer to you and
most easily available
• Often the survey questions and responses require highly specialized skillset
• In this discussion, we learned about two sampling techniques that fall under
non-random sampling
called sample
• The population size is denoted by capital N, its mean by µ and its standard
deviation by σ
• Consider that you have all the N=30000 packets, that is the population data
• If the mean of this data is µ= 2.199 and the standard deviation is σ=0.132;
these are essentially population parameters
• If the sample size is increased the distribution keeps getting closer and
closer to normal distribution
• Moreover, as the sample size increases to more than 30, the mean of the
sample distribution approaches the population mean
• The second part of the theorem states that the standard deviation of this
sampling distribution will be equal to sigma, which is our population standard
deviation, divided by the square root of n where n is the sample size
• Finally, the central limit theorem states that if the sample size that you take is
greater than 30, the sampling distribution will become normally distributed
Central Limit Theorem (CLT)
• The central limit theorem states that when you take a large number of
samples, the mean of the sampling distribution thus formed, will be
approximately equal to the population mean
• The second part of the theorem states that the standard deviation of this
sampling distribution will be equal to sigma, which is our population standard
deviation, divided by the square root of n where n is the sample size
• Finally, the central limit theorem states that if the sample size that you take is
greater than 30, the sampling distribution will become normally distributed
Central Limit Theorem (CLT)
•To concretize your understanding of the central limit theorem, let's try and visualize the
central limit theorem.
• We took a sample of 100 packets and found out that its sample mean was 2.2
ppm and standard deviation was 0.7 ppm
• Sampling distribution is nothing but the distribution of all the possible sample
means that can be generated from this population
Introduction to Confidence Intervals
• With the help of CLT, we have some idea about the properties of this sampling
distribution
• If the sample size is greater than 30, then the sampling distribution is normally
distributed with a mean equal of population mean and a standard deviation
equal to population SD divided by square root of sample size
• There are cases where you have some idea about the population standard
deviation, and in some cases you don’t, and you employ sample SD for that
Introduction to Confidence Intervals
• Sample standard deviation (s=0.7) and n=100, we get the SD of sampling
distribution as 0.7/10= 0.07
• Here, we are using the sample standard deviation as the substitute for
population standard deviation
• For example, using this rule, we can say that the probability that sample
mean lies from µ-2*0.07 to µ+2*0.07 is 95%
Introduction to Confidence Intervals
• While we do not know the population mean ‘µ’, but we know the population
standard deviation 0.07
• Or the probability that population mean ‘µ’ lies in the interval P(2.2-2*0.07 to
2.2+2*0.07) is 95%
• Or you can say with 95% probability that the mean will lie between 2.06 ppm
to 2.34 ppm
Introduction to Confidence Intervals
• The probability associated with this claim is called the confidence level
• Since we are concluding about the population mean with 95% probability, we
can say that the confidence level is 95% or alternatively level of significance
or alpha value =5% (i.e., 1- confidence level)
• Next, you have the margin of error, which is the maximum error= 2*0.07=0.14
• Since the upper bound of the confidence is less than 2.5, we can conclude
with 95% confidence that noodles do not contain led content that is more
than the prescribed limit of 2.5 ppm
Statistical Inference III: Confidence Interval
Construction
Confidence Interval Construction
• Till now we have understood estimation of population mean with the
construction of unbiased confidence interval
• Often getting population data is not feasible and you need to rely on
inferential statistics
• The objective here is to estimate population mean; the population need not
be normal
• To solve the problem we start with the sample, using appropriate sampling
technique
Confidence Interval Construction
• You select a sample of small size=n and calculate the mean of the sample 𝑥ഥ
and sample standard deviation ‘s’
• CLT suggests that sampling distribution will behave like a normal distribution
as you increase the sample size (>30), with a mean of µ (population mean)
and SD of 𝜎/ 𝑛
• For confidence levels 90%, 95%, 99%, the values are 1.65, 1.96, 2.58
• You want to be highly confident in the noodle packet example, and 99%
confidence makes more sense, or may be you have higher tolerance levels and
ok to go ahead with 90% confidence levels
Confidence Interval Construction
• Collect a sample of n>=30 from the population
• Based on the CLT, assume that the sampling distribution is normal with a mean
same as the population mean and SD which is same as population SD divided
by square root of n; population SD is proxied by sample SD
• Select the appropriate confidence level and based on that and decide the
𝑠 𝑠
appropriate confidence interval : 𝑥ҧ − 𝑧 ∗ to 𝑥ҧ + 𝑧 ∗
𝑛 𝑛
Statistical Inference IV: Interval Estimation for
Small Samples
Interval Estimation for Small Samples
• Often times, large samples are not available and one has to work with small
samples
• For example, you are in pharma company and in a medicine trial you only have
15 volunteers
• In such cases with less than 30 sample size, you work with t-distribution, where
population SD is not known and the same is proxied using the sample standard
deviation
• A t-distribution is similar to z-distribution only that it has shorter peak and wider
tails
Interval Estimation for Small Samples
• A t-distribution is similar to z-distribution only that it has shorter peak and wider
tails
Interval Estimation for Small Samples
• You work for a pharma company and are testing the effects of a medicine on 15
volunteers
• For smaller sample sizes, t-distributions are flatter than for larger sample sizes
• For large DOF, t-distribution is similar to the standard normal distribution (at
sample size n=30)
Interval Estimation for Small Samples
• For this case (n=15) DOF=14, and we will use t-distribution
• We select a confidence level of 95%; the relevant confidence interval will be:
𝑠 𝑠
𝑥ҧ − 𝑡 ∗ to 𝑥ҧ + 𝑡 ∗ ; the corresponding t-value is 2.145
𝑛 𝑛
• For a large sample size, both the t-distribution and normal distribution
𝑠 𝑠
• The lower and upper bound is given by 𝑥ҧ − 𝑡 ∗ to 𝑥ҧ + 𝑡 ∗ ; using this we
𝑛 𝑛
• How to extrapolate this value to the entire population, given that sample
mean and standard deviation driven approaches are not valid
• Consider for example, you are working as part of a political science company
that specializes in voter polls and designs surveys to keep political office
seekers informed of their position in a race
Interval Estimation for proportions
• Through these surveys you found that 220 registered voters, out of 500
contacted, favor a particular candidate. You want to develop 95%
confidence interval estimate for the population of registered voters
• For being able to apply the sampling distribution of sampling proportion: n*p
>5 and n*(1-p)>5;
• Both of these values are considerably greater than 5, so we can assume that
sampling distribution follows normal distribution and go ahead with the
formula for 95% confidence interval estimation
1−𝑝 ∗𝑝
• The appropriate confidence interval here is 𝑝ഥ ± 𝑧 𝛼 ; here SD is
2 𝑛
1−𝑝 ∗𝑝
taken as
𝑛
Interval Estimation for proportions
• In this case, 𝑝ഥ =0.44, n=500, z=1.96
• Chances are that some of these phones may take less than 30-mins and
some may take more than 30-mins
• With this information you can come-up with a confidence interval that the
charging time is in the range of 24-29 minutes with 95% confidence: I can
say that charging time is less than 30 mins
• But what if the confidence interval is from 26-32 mins: I can not say that
charging time is less than 30-mins
Introduction
• This confidence interval approach is perfectly valid
• This is so because chances are that 30-min value may fall in this interval
• You gather the evidence from the sample and check if the claim can be
rejected or not
• A pizza delivery firm plans to test the effectiveness of their campaign in the
test population and control population
Applications of Hypothesis Testing
• These kind of problems are referred to as AB testing: for example, you want
to compare the response rate of two different webpages
• You divide the population in two groups (Group 1 and 2) that are exposed to
different versions of the product: version A and version B
• Let us start by assuming that the claim is true: It takes 30-minutes to charge
the phone
• If the null is rejected then either it takes more or less than 30-minutes to
charge the phone
Hypothesis Testing
• This was a simple case of framing null and alternate
hypothesis
• Null (A) and alternate (B) hypothesis are always mutually exclusive events
• So the Null (H0) and alternate (Ha) are always mutually exclusive and exhaustive
• Null H0: Charging time is less than equal to (≤) 30-mins; Alternate H1: Charging
time is more than (>) 30-mins
• To summarize, (a) Null hypothesis captures the status quo, (b) alternate, which we
are trying to prove, is the complement to the null, (c) as a convention, null contains
equality (=, ≤, ≥, )
Hypothesis Testing
• In our earlier example, H0 is that ‘Charging time is = 30 min’
• If the computations favor alternate hypothesis, then you reject the null or
accept the alternate hypothesis
• That is, the claim that charging time is not equal to 30-mins is true
• And the required action is based on whether the charging time is more or
less than 30-mins
Hypothesis Testing
• If the calculations favor the null hypothesis that charging time is equal to 30-
mins that means you fail to reject the null (not that null is correct or
accepted)
• Sample properties may different from population, and as more and more
sample arrive, null may be disapproved
• For example, all swans are white, till the one single black swan was found to
reject the null
Hypothesis Testing Part II: Critical Value
Method
Hypothesis Testing: Critical Value Method (CVM)
• We will discuss the CVM method for testing the claim that charging time is
equal to 30-mins
• You sample 100 phones and find that their mean charging time is 30.37-
mins and population standard deviation (σ) is equal to 2.477
• Please note that these 100 data points make-up for one sample
• You are happy as long as the sample mean lies in the 95% confidence level
interval of 30-mins
Hypothesis Testing: Critical Value Method (CVM)
• You are happy as long as the sample mean lies in the 95% confidence level
interval of 30-mins
30.37−30
• The z-score corresponding to 30.37 is calculated as shown here: =
0.2477
1.4937
• Since this is less than the critical score, you fail to reject the null
Hypothesis Testing: Critical Value Method (CVM)
30.62−30
• For example, if sample mean was 30.62, then 𝑧 = = 2.5
0.2477
• This would fall outside the 95% region and you would be able to reject the null
• Let us recap the problem: (1) Frame the null and alternate hypothesis; (2)
Decide the appropriate confidence interval; (3) Calculate the critical z value; (4)
Compute the sample z-score; (5) Compare the sample z-score with the critical z
value
Hypothesis Testing: Critical Value Method (CVM)
30.62−30
• For example, if sample mean was 30.62, then 𝑧 = = 2.5
0.2477
• This would fall outside the 95% region and you would be able to reject the null
• Let us recap the problem: (1) Frame the null and alternate hypothesis; (2)
Decide the appropriate confidence interval; (3) Calculate the critical z value; (4)
Compute the sample z-score; (5) Compare the sample z-score with the critical z
value
Hypothesis Testing Part III: One Tailed Test
(CVM)
One Tailed Test
• It is sometimes sufficient to test only one side of the sample mean
distribution; this is called a one-tailed test
• In the previous example, we performed the test on both sides of the normal
distribution, that is, we would have rejected the null hypothesis if the sample
was significantly different in either direction
• As a customer, you want to test whether a day’s charge is less than 30-mins
One Tailed Test
• Step 1 is the same, i.e., frame the null and alternate hypothesis
• Step 3: Find the corresponding critical z-value; here, we can reject the null if
we can prove that the sample mean is significantly greater than 30-minutes
• Hence the rejection region is on the right side of the curve: right tailed test
One Tailed Test
• Let us compare the rejection regions
in one-tailed vs two-tailed test
• In the two tailed test, the unshaded region is the 2.5% on the right and left
• In the single tailed test, the entire 5% is on the right side, thus the critical value will
not be the same
• In the previous case, critical z value was 1.96. Now the complete area on the left of
the curve is 95% and the corresponding z-value is 1.645
• For our sample of 100 phones, the sample mean was 30.79, and we found the
population SD to be 2.477
30.79−30
• Sample z-score= 2.477 = 3.19; which is much larger than 1.645
𝑠𝑞𝑟𝑡 100
• Cadbury states that the average weight of a particular brand of its chocolate is
60g; as an analyst, you want to test if the weight is less than 60g or not at 2%
significance
• Here H0: Weight is less than (≥) 60g; H1: Weight is less than (<) 60g
• In the same mobile phone example, first we formulate null and alternate
• P-value or significance level or the area in the tail of the normal probability
distribution
P-value method
• In our mobile phone example, p-value represents area from -∞ to -1.4937
and 1.4937 to + ∞
• Since the curve is symmetric, we can calculate one value and multiply it with
2
• The value can be computed with R software. The value corresponding to one
tail-works out to 0.068 and thus, the total p-value becomes 2*0.068=0.136
• You computed the z-score from sample data points and compared it to the
critical z-value
• You also saw the p-value method, which was the probability at tails
(significance level)
• A lower p-value meant the higher chance (confidence) of rejecting the null
Thanks!