Stats_Notes
Stats_Notes
Statistics is the science that deals with methodologies to gather, review, analyze and draw conclusions
from data. With specific Statistics tools in hand we can derive many key observations and make
predictions from the data in hand.
In Real world we deal with many cases where we use Statistics knowingly or unknowingly.
Let’s talk about one such classic use of statistics in the most famous sports in India, yes you guessed it
right, Cricket.
What makes Virat Kohli the best batsman in ODIs or Jaspreet Bumrah the best bowler in ODIs?
We all have heard about cricketing terms like batting average, bowler’s economy, strike rate etc. We
often see graphs like these
We see and talk about statistics all the time but very few of us know the science behind it.
Here by using different statistical methods ICC compare players, teams and rank them. So, if we learn
the science behind it we can create our own rankings, compare players, teams or better if we debate
with someone over who is the better player, we can debate now with facts and figures because we will
understand the statistics behind it better. We can understand the above graphs better.
We will dive further in to the various methods and terminologies which will help to answer the question
above as well as see the vast uses of Statistics in much complex scenarios such as medical science, drug
research, stock markets, Economics, Marketing etc.
Types of Statistics
1. Descriptive Statistics:
The type of statistics dealing with numbers (numerical facts, figures, or information) to describe any
phenomena. These numbers are descriptive statistics.
e.g. Reports of industry production, cricket batting averages, government deficits, Movie Ratings etc.
2. Inferential statistics
Inferential statistics is a decision, estimate, prediction, or generalization about a population, based on
sample.
Inferential statistics is used to make inferences from data whereas descriptive statistics simply
describes what’s going on in our data.
Let’s clear our understanding about the above types with a basic scenario:
Suppose in your college there are 1000 students. You are interested in finding out the how many
students prefer eating in the college canteen than college mess.
A random group of 100 students is selected. Here our population size is of 1000 students and the
sample size is of 100 students. You surveyed the sample group and got the following results:
Canteen Mess
80
70
60
50
40
30
20
10
0
1st year 2nd Year 3rd year 4th Year
Canteen Mess
The above statistics gives us a trend of variation among the students with their preference. We are using
the numbers and figures to assess the data. This will be the part of Descriptive statistics.
Now, suppose you got a contract to open a canteen in the College. Now with the above data, you can
make following assumptions:
1) 3rd and 4th year students are the main target for sales of restaurant.
2) You can give discounts to the 1st year students to increase the number count.
3) Since most students prefer eating in canteen, opening a canteen can be profitable business
You made the above inferences/estimations for the whole college based on the sample data. This is the
part of Inferential statistics where you make decision based on the descriptive statistics of a sample
data.
Though the above example is very basic and the real scenarios are much more complex, this would help
in getting the underlying difference. We will see more complex examples ahead.
Q) The average salary of employees of company in 2017 is greater than the average salary of teachers of
school in 2017.
Types of Data
I) Categorical data represents characteristics such as a person’s gender, marital status, hometown, or
the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male
and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add
them together
ii) Numerical data have meaning as a measurement, such as a person’s height, weight, IQ, or blood
pressure; or they’re a count, such as the number of stocks shares a person owns etc.
* Numerical data can be further broken into two types: discrete and continuous.
Discrete data represent items that can be counted; they take on possible values that can be listed out.
The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity
(making it countably infinite). For example, the number of heads in 100-coin flips takes on values from 0
through 100 (finite case), but the number of flips needed to get 100 heads takes on values from 100 (the
fastest scenario) on up to infinity (if you never get to that 100th heads). Its possible values are listed as
100, 101, 102, 103, . . . (representing the countably infinite case).
In all the above examples the values can be 1, 2,3 … so on but can never be 1.2,4.6, 8.7 etc. Thus, making
the results countable.
Continuous data represent measurements; their possible values cannot be counted and can only be
described using intervals on the real number line.
For example, the exact amount of petrol purchased at the petrol pump for bikes with 20-liters tanks
would be continuous data from 0 liters to 20 liters, represented by the interval [0, 20], inclusive. You
might pump 8.40 liters, or 8.41, or 8.414863 liters, or any possible number from 0 to 20. In this way,
continuous data can be thought of as being uncountably infinite.
Another Example, you purchase a Light bulb and you are informed that the life of the light bulb is 2000
hours. Now, the life of bulb will be a continuous data as it can take any value such as 1998
hours,1998.56 hours, 1896.34 hours or 2000 hours as well.
Levels of measurement
1. Qualitative Data
A nominal variable is one in which values serve only as labels, even if those values are numbers.
For example, if we want to categorize male and female respondents, we could use a number of 1 for
male, and 2 for female. However, the values of 1 and 2 in this case do not represent any meaningful
order or carry any mathematical meaning. They are simply used as labels. Nominal data cannot be used
to perform many statistical computations, such as mean and standard deviation, because such statistics
do not have any meaning when used with nominal variables.
However, nominal variables can be used to do cross tabulations. The chi-square test can be performed
on a cross-tabulation of nominal data.
Examples:
Here even though we code the different categories, we cannot say that since 2> 1 then Female>male.
Values of ordinal variables have a meaningful order to them. For example, education level (with
possible values of high school, undergraduate degree, and graduate degree) would be an ordinal
variable. There is a definitive order to the categories (i.e., graduate is higher than undergraduate, and
undergraduate is higher than high school), but we cannot make any other arithmetic assumptions
beyond that. For instance, we cannot assume that the difference in education level between
undergraduate and high school is the same as the difference between graduate and undergraduate.
We can use frequencies, percentages, and certain non-parametric statistics with ordinal data. However,
means, standard deviations, and parametric statistical tests are generally not appropriate to use with
ordinal data.
2. Quantitative Data
For interval variables, we can make arithmetic assumptions about the degree of difference between
values. An example of an interval variable would be temperature. We can correctly assume that the
difference between 70 and 80 degrees is the same as the difference between 80 and 90 degrees.
However, the mathematical operations of multiplication and division do not apply to interval variables.
For instance, we cannot accurately say that 100 degrees is twice as hot as 50 degrees. Additionally,
interval variables often do not have a meaningful zero-point. For example, a temperature of zero
degrees (on Celsius and Fahrenheit scales) does not mean a complete absence of heat.
An interval variable can be used to compute commonly used statistical measures such as the average
(mean), standard deviation etc. Many other advanced statistical tests and techniques also require
interval or ratio data.
Example: Measurement of time of an historical event comes under interval scale, since year has no fix
origin i.e. 0 year is different for different religions and countries.
All arithmetic operations are possible on a ratio variable. An example of a ratio variable would be
weight (e.g., in pounds). We can accurately say that 20 pounds is twice as heavy as 10 pounds.
Additionally, ratio variables have a meaningful zero-point (e.g., exactly 0 pounds means the object has
no weight). Other examples of ratio variables include gross sales of a company, the expenditure of a
company, the income of a company, etc.
Example: measurement of temperature in kelvin scale, since kelvin has an absolute 0, measurement of
average height of students in a class.
A ratio variable can be used as a dependent variable for most parametric statistical tests such as t-tests,
F-tests, correlation, and regression.
We can summarize different levels of measurements as below:
Offers: Nominal Ordinal Interval Ratio
Sequence of variables is established yes yes yes
Mode yes yes yes yes
Median yes yes yes
Mean yes yes
Difference between variables can be evaluated yes yes
Addition and subtraction of variables yes yes
Multiplication and Division of variables yes
Absolute zero yes
A measure of central tendency is a summary statistic that represents the center point or typical value of
a dataset. These measures indicate where most values in a distribution fall and are also referred to as
the central location of a distribution. You can think of it as the tendency of data to cluster around a
middle value. In statistics, the three most common measures of central tendency are the mean, median,
and mode. Each of these measures calculates the location of the central point using a different method.
Mean: The mean is the arithmetic average, for calculating the mean just add up all of the values and
divide by the number of observations in your dataset.
Median: The median is the middle value. It is the value that splits the dataset in half. To find the
median, order your data from smallest to largest, and then find the data point that has an equal amount
of values above it and below it. The method for locating the median varies slightly depending on
whether your dataset has an even or odd number of values.
Frequency: The number of times a variable occurs in the data set is called its frequency.
Mode: The mode is the value that occurs the most frequently in your data set i.e. has the highest
frequency. On a bar chart, the mode is the highest bar. If the data have multiple values that are tied for
occurring the most frequently, you have a multimodal distribution. If no value repeats, the data do not
have a mode.
Q) Given a data set of heights of student in a class. Find out the mean, mode and median for the
dataset. Heights (in cm) = {180, 167, 154, 142, 181, 145, 143, 145, 167, 145}
Mean= (180 + 167 +154 + 142 + 181 + 145 + 143 + 145 + 167 + 145) /10 = 156.9 cm
Rearranged Heights= {142, 143, 145, 145, 145, 154, 167, 167, 180, 181}
Since n is even,
i = n/2= 5
For calculating Mode, we will create a Frequency table for all the variables.
Variables Frequency
180 1
167 2
154 1
142 1
181 1
145 3
143 1
mode= 145
Let's take the above example and change values of some observations and check it's effect on our
measurements:
Heights (in cm) = {180, 167, 154, 122, 181, 135, 123, 145, 166, 145}
Rearranged Heights= {122, 123, 135, 145, 145, 154, 166, 167, 180, 181}
Note: we have replaced the first three values on the left of the median
Mean= (122 + 123 + 135 + 145 + 145 + 154 + 166 + 167 + 180 + 181)/10 = 151.8
Note: We can see a significant change in the Mean whereas Median does not have any changes.
That's because the calculation of the mean incorporates all values in the data. If you change any value,
the mean changes.
Unlike the mean, the median value doesn’t depend on all the values in the dataset. Consequently, when
some of the values are more extreme, the effect on the median is smaller. Of course, with other types of
changes, the median can change.
Examples:
Given the below dataset, find out the mean, median and mode.
Solution:
Since, there is no value of 9.5 in cumulative frequency column, we take the next cumulative frequency
that is 12.
What is Skewness?
Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed
either to the left or to the right. Skewness can be quantified to define the extent to which a distribution
differs from a normal distribution.
In a normal distribution, the graph appears as a classical, symmetrical "bell-shaped curve." The mean, or
average, and the mode, or maximum point on the curve, are equal.
In a perfect normal distribution, the tails on either side of the curve are exact mirror images of each
other.
When a distribution is skewed to the left, the tail on the curve's left-hand side is longer than the tail on
the right-hand side, and the mean is less than the mode. This situation is also called negative skewness.
When a distribution is skewed to the right, the tail on the curve's right-hand side is longer than the tail
on the left-hand side, and the mean is greater than the mode. This situation is also called positive
skewness.
**In a symmetric distribution, the mean and median both find the center accurately. They are
approximately equal.
**However, in a skewed distribution, the mean can miss the mark. In the histogram below, it is starting
to fall outside the central area. This problem occurs because outliers have a substantial impact on the
mean. Extreme values in an extended tail pull the mean away from the center. As the distribution
becomes more skewed, the mean is drawn further away from the center.
Here the median better represents the central tendency for the distribution.
Uses of Mean, Median and Mode
When you have a symmetrical distribution for continuous data, the mean, median, and mode are equal.
In this case, use the mean because it includes all of the data in the calculations. However, if you have a
skewed distribution, the median is often the best measure of central tendency.
When you have ordinal data, the median or mode is usually the best choice. For categorical data, you
have to use the mode.
Measures of Dispersion
The measure of dispersion shows the scatterings of the data. It tells the variation of the data from one
another and gives a clear idea about the distribution of the data. The measure of dispersion shows the
homogeneity or the heterogeneity of the distribution of the observations.
How is it useful?
1) Measures of dispersion shows the variation in the data which provides information like how well
the average of the sample represent the entire data. Less variation gives close representation
while with larger variation average may not closely represent all the values in the sample.
2) Measures of dispersion enables us to compare two or more series with regard of their
variations. It helps to determine consistency.
3) With the checking for variation in the data, we can try to control the causes behind the
variations.
1) Range: A range is the most common and easily understandable measure of dispersion. It is the
difference between two extreme observations of the data set. If X max and X min are the two extreme
observations then
2) Standard Deviation
In statistics, the standard deviation is a very common measure of dispersion. Standard deviation
measures how spread out the values in a data set are around the mean. More precisely, it is a measure
of the average distance between the values of the data in the set and the mean. If the data values are all
similar, then the standard deviation will be low (closer to zero). If the data values are highly variable,
then the standard variation is high (further from zero).
The standard deviation is always a positive number and is always measured in the same units as the
original data. Squaring the deviations overcomes the drawback of ignoring signs in mean deviations i.e.
distance of points from mean must always be positive.
3) Variance:
The Variance is defined as the average of the squared differences from the Mean.
Example:
1. A class of students took a test in Language Arts. The teacher determines that the mean grade on the
exam is a 65%. She is concerned that this is very low, so she determines the standard deviation to see if
it seems that most students scored close to the mean, or not. The teacher finds that the standard
deviation is high. After closely examining all of the tests, the teacher is able to determine that several
students with very low scores were the outliers that pulled down the mean of the entire class’s scores.
2. An employer wants to determine if the salaries in one department seem fair for all employees, or if
there is a great disparity. He finds the average of the salaries in that department and then calculates the
variance, and then the standard deviation. The employer finds that the standard deviation is slightly
higher than he expected, so he examines the data further and finds that while most employees fall
within a similar pay bracket, three loyal employees who have been in the department for 20 years or
more, far longer than the others, are making far more due to their longevity with the company. Doing
the analysis helped the employer to understand the range of salaries of the people in the department.
Coefficient of Variation (CV)
The coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized
measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a
percentage, and is defined as the ratio of the standard deviation(σ) to the mean(μ). It gives the measure
of variability
Let’s take one more example to try and understand how standard deviation and CV is helpful:
By seeing the above data, we can say that Batsman 1 is the better batsman than Batsman 2 and can be
given preference since it’s mean is greater. But is it really true?
var sq_2
Batsman var sqaure1 =
Batsman Batsman 1 2 diff_1 diff_2 = (diff_1)^2 (diff_2)^2
Match 1 54 45 -3 -1.5 9 2.25
Match 2 35 42 16 1.5 256 2.25
Match 3 68 25 -17 18.5 289 342.25
Match 4 12 53 39 -9.5 1521 90.25
Match 5 13 75 38 -31.5 1444 992.25
Match 6 120 12 -69 31.5 4761 992.25
Match 7 6 28 45 15.5 2025 240.25
Match 8 0 27 51 16.5 2601 272.25
Match 9 18 85 33 -41.5 1089 1722.25
Match
10 184 43 -133 0.5 17689 0.25
sum 510 435 31684 4656.5
We can clearly see that the standard deviation gives a different picture for both batsmen. Though
batsman1 has a high average but his variance is very high. So, the batsman 1 is less reliable.
On the other hand, Batsman 2 has lower average but is much more consistent than Batsman 1.
Also, coeff. Of variation for batsman 2 is lower than batsman 1 which insures low variability and higher
consistency.
If we only had taken mean into account, we wouldn’t have gotten the true picture. This problem is
solved by the dispersion measures.
Q) What is the variance and standard deviation of the possibilities associated with rolling a fair die?
variance = (6.25+2.25+0.25+0.25+2.25+6.25)/6=2.917
Q) The following data set has a mean of 14.7 and a variance of 10.01.
14.7 = (114+a+b)/10
a+b = 147−114
a=33−b
Since a=33−b
So, the two unknown values in the data set are 13 and 20
We do not know which of these is a and which is b since the mean and variance tell us nothing about the
order of the data.
The formula for calculating Standard deviation and variance changes while dealing with Population and
sample data:-
1. Compute the square of the difference between each value and the sample mean.
In step 1, you compute the difference between each value and the mean of those values. You don't
know the true mean of the population; all you know is the mean of your sample. Except for the rare
cases where the sample mean happens to equal the population mean, the data will be closer to the
sample mean than it will be to the true population mean. So, the value you compute in step 2 will
probably be a bit smaller (and can't be larger) than what it would be if you used the true population
mean in step 1. To make up for this, divide by n-1 rather than n.
But why n-1? If you knew the sample mean, and all but one of the values, you could calculate what that
last value must be. Statisticians say there are n-1 degrees of freedom.
1. It is the relationship between a pair of random variables where change in one variable causes
change in another variable.
2. It can take any value between -infinity to +infinity, where the negative value represents the
negative relationship whereas a positive value represents the positive relationship.
3. It is used for the linear relationship between variables.
4. It gives the direction of relationship between variables.
5. It has dimensions.
Covariance Relationship
Correlation
1. It shows whether and how strongly pairs of variables are related to each other.
2. Correlation takes values between -1 to +1, wherein values close to +1 represents strong positive
3. correlation and values close to -1 represents strong negative correlation.
4. In this variable are indirectly related to each other.
5. It gives the direction and strength of relationship between variables.
6. It is the scaled version of Covariance.
7. It is dimensionless.
Correlation Relationship
Positive Correlation
When the values of variables deviate in the same direction i.e. when value of one variable
increases(decreases) then value of other variable also increases(decreases).
Examples:
Negative Correlation
When the values of variables deviate in the opposite direction i.e. when value of one variable
increases(decreases) then value of other variable also decreases(increases).
Examples:
Zero Correlation
When two variables are independent of each other, they will have a zero correlation.
Note: - When data is scaled covariance and correlation will give the same value. Also, correlation and
Causality are not the same thing.
Example:
x = [1,2,3,4,5,6,7,8,9]
y = [9,8,7,6,5,4,3,2,1]
Ans: We can clearly see in the dataset that as x increase y decreases and vice versa.
Solution:
As we will proceed further, we will use several statistical tools such as Python Statistics libraries to do
the complex calculation and derive our descriptive statistics.
For e.g. if we solved the above problem using Python, it will pretty easy to do.
Probability Distribution is a function that shows all the possible values a variable can take and how often
it occurs.
Probability Distribution is defined by the underlying probability and graph is just the visual
representation.
Let’s take an example and find out probability distribution of all possible outcomes of rolling a die:
Possible
Outcomes Probability
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
The above table represents the probability distribution of the variable (possible outcomes).
When we plot the above table, we get the probability distribution graph as below.
Let’s take another example and find the probability distribution of getting sums of rolling two dice: -
Possible outcomes =
{(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),(3,1),(3,2),(3,3),(3,4),(3,5),(3,6),(4,1),(4,2)
,(4,3),(4,4),(4,5),(4,6),(5,1),(5,2),(5,3),(5,4),(5,5),(5,6),(6,1),(6,2),(6,3),(6,4),(6,5),(6,6)}
Number of outcomes = 36
Possible
Sums Occurrence Probability
2 1 0.03
3 2 0.06
4 3 0.08
5 4 0.11
6 5 0.14
7 6 0.17
8 5 0.14
9 4 0.11
10 3 0.08
11 2 0.06
12 1 0.03
This table represents the Probability distribution of different outcomes and if we plot the above table,
we get the below Probability distribution graph.
Probability Distributions: Discrete vs. Continuous
For example, when we toss a coin or roll a die, we know the possible outcome will be 1 head, 2 tails or in
case of die 1,2 ,6 etc. We can never get values like 1.5,3.2. Such variables are types of Discret variables.
On the other hand, let’s say criteria for selection in basket ball team is that one should have a height
between 170 cm to 200 cm. Here height will be a type of continuous variable as one can take any value
between 170 and 200.
1) Binomial Distribution
2) Poisson distribution
3) Bernoulli distribution
The examples we took above for rolling a die is a type of Discrete Probability Distribution.
In a continuous probability distribution, unlike a discrete probability distribution, the probability that a
continuous random variable will assume a particular value is zero. Thus, we cannot express a continuous
probability distribution in tabular form. We describe it using an equation or a formula also known as
Probability Density Function (pdf).
For a continuous probability distribution, the probability density function has the following properties:
• The graph of the density function will always be continuous over the range in which the random
variable is defined.
• The area bounded by the curve of the density function and the x-axis is equal to 1, when
computed over the domain of the variable.
• The probability that a random variable assumes a value between a and b is equal to the area
under the density function bounded by a and b.
Different types of Continuous Probability Distribution
4) Normal Distribution
5) Student’s T distribution
6) Chi Squared distribution.
Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a continuous probability distribution
that is symmetric about the mean, showing that data near the mean are more frequent in occurrence
than data far from the mean. In graph form, normal distribution will appear as a bell curve (as in the
figure below).
Note: For all the below examples we will use the “Normal Distribution Calculator” as it is very easy for
calculation purposes. We can do the calculations with the above given formula also but it’s not
necessary. We will learn how to solve problems using z-score table in the next section.
Examples:
Q) The Light Bulb Company has found that an average light bulb lasts 900 hours with a standard
deviation of 80 hours. Assuming that bulb life is normally distributed. What is the probability that a
randomly selected light bulb will burn out in 1000 hours or less?
Answer:
Mean = 900
Standard deviation = 80
x(a) = 1000
we have to use the above given formula for finding the probability. But for the time being we are taking
help of the Normal distribution calculator (shown below).
P(X<=1000) = 89.4% i.e. there is 89.4% chance that the bulb will burnout within 1000 hours.
Q) Suppose scores on a mathematics test are normally distributed. If the test has a mean of 55 and a
standard deviation of 10, what is the probability that a person who takes the test will score between
45 and 65?
Ans.
If we find the cumulative probability for x<=45 and cumulative probability for x<=65, we can subtract
them to find the required probability.
Mean = 55
Standard deviation = 10
Again, using the normal distribution calculator, we find both the values as:
= 0.682
So, we have 68.2% probability of a person taking test to score between 45 and 65 marks
-----------------------------------------------------------------------------------------------------------------
When we talk about the heights or weights of people in the world, it is seen that it follows a normal
distribution. Why does it seem obvious? It is simple because it is more probable to find people with
heights near the average rather finding very short heighted or very tall people.
Just look around in your class, you will find the majority of people will fall in the range near the average
height of the class.
Distribution of wealth is another example of Normal Distribution. Most of the people fall in the average
wealth category. (“Middle class”).
1) 68% of the data falls within the first standard deviation from the mean.
1) Distributions of sample means with large sample sizes can be approximated t0 normal
distribution
2) Decisions based on normal distribution insights have proven to be of good value
3) All computable statistics are elegant
4) It approximates a wide variety of random variables
The standard normal distribution is a special case of the normal distribution. It is the distribution that
occurs when a normal random variable has a mean of zero and a standard deviation of one.
The normal random variable of a standard normal distribution is called a standard score or a z score.
Every normal random variable X can be transformed into a z score via the following equation:
z = (X - μ) / σ
X = [1,2,2,3,3,4,4]
steps:
1) mean= 2.71
2) std_deviation = 1.11
Let’s take another example and see how the transformation works:
Given a data set, we were provided with the initial data, we transformed the data and found the z score
as shown in the below table.
We can see from the graphs that our normal distribution when transformed has zero mean and
deviation of 1.
z score
z-score is a measure of position that indicates the number of standard deviations a data value lies from
the mean. z-scores may be positive or negative, with a positive value indicating the score is above the
mean and a negative score indicating it is below the mean.
z score is very powerful tool to find the probability distribution using the z-score table.
We do not need the normal distribution calculator and will simply use z-score table.
Z-score table:
let’s see the application of z-score table and see how to use it:
We will use the same example from above where we used the Normal Distribution calculator, let’s try to
calculate the same using z-score table and compare the results.
Q) The Light Bulb Company has found that an average light bulb lasts 900 hours with a standard
deviation of 80 hours. Assuming that bulb life is normally distributed. What is the probability that a
randomly selected light bulb will burn out in 1000 hours or less?
Mean = 900
Std. deviation = 80
x(a) = 1000
z-score = 1.25 = 1.2 + 0.05 (in the table we will match the value corresponding to 1.2 and 0.05)
This is exactly same as we found out with the Normal distribution calculator.
Thus, if we standardize a normal distribution, z-score becomes a very important tool in helping finding
the probability distribution.
Q) Ravi scored 980 in a Physics Olympiad. The mean test score was 870 with a standard deviation of
120. How many students scored more than Ravi? (Assume that test scores are normally distributed.)
Mean = 870
Thus, we can estimate 17.88% students scored more than Ravi in the test.
Note: When you encounter a negative z score, you can use a negative z score table, or find the value
for positive value of the z-score and subtract it from 1.
e.g. p (-2.5) = 1 – p (2.5) = 1 – 0.99379 = 0.00621, which same as the value you will find in negative z
score table.
Central Limit Theorem
In the study of probability theory, the central limit theorem (CLT) states that the distribution of sample
means approximates a normal distribution (also known as a “bell curve”), as the sample size becomes
larger, assuming that all samples are identical in size, and regardless of the population distribution
shape.
CLT is a statistical theory stating that given a sufficiently large sample size from a population with a finite
level of variance, the mean of all samples from the same population will be approximately equal to the
mean of the population. Furthermore, all the samples will follow an approximate normal distribution
pattern, with all variances being approximately equal to the variance of the population, divided by each
sample's size. The samples extracted should be bigger than 30 observations.
Let’s visualize the CLT with few data sets with different sample sizes:
x= [9,2,1]
x2= [6,6,8,3,8]
x3= [5,3,6,4,7,2,6,9,7,1,1,7]
x4 = [8,1,7,1,4,3,1,7,8,9,8,3,1,6,8,3,4]
Plotting the distribution graph for above samples with different sample size
We can clearly see that among the number of samples extracted from the population, as the sample size
increases, sample moves closer to a Normal Distribution.
We will talk more about Central limit theorem and see it’s uses, but first let’s discuss one more
important concept.
Standard Error
The standard error (SE) of a statistic is the approximate standard deviation of a statistical sample
population. The standard error is a statistical term that measures the accuracy with which a sample
distribution represents a population by using standard deviation. In statistics, a sample mean deviates
from the actual mean of a population—this deviation is the standard error of the mean.
When a population is sampled, the mean, or average, is generally calculated. The standard error can
include the variation between the calculated mean of the population and one which is considered
known, or accepted as accurate. This helps compensate for any incidental inaccuracies related to the
gathering of the sample.
In cases where multiple samples are collected, the mean of each sample may vary slightly from the
others, creating a spread among the variables. This spread is most often measured as the standard
error, accounting for the differences between the means across the datasets.
The more data points involved in the calculations of the mean, the smaller the standard error tends to
be. When the standard error is small, the data is said to be more representative of the true mean. In
cases where the standard error is large, the data may have some notable irregularities.
The standard deviation is a representation of the spread of each of the data points. The standard
deviation is used to help determine the validity of the data based on the number of data points
displayed at each level of standard deviation. Standard errors function more as a way to determine the
accuracy of the sample or the accuracy of multiple samples by analyzing deviation within the means.
Here σ is the standard deviation of the population, whereas σ(u) is the standard deviation of the sample.
We can see that as the size of our sample increases the Standard error decrease.
Now, that we know the Standard error, let’s rephrase our Central Limit theorem as:
The central limit theorem states that the sample mean follows approximately the normal distribution
with mean(μ) and standard deviation (σ/√n), where μ and σ are the mean and standard deviation of
the population from where the sample was selected. The sample size n has to be large (usually n≥30)
if the population from where the sample is taken is non normal.
So, when we transform our sample data, we will use following formula for the z-score:
z = (X - μ) / (σ/√n)
Q) Let X be a random variable with μ= 10 and σ= 4. A sample of size 100 is taken from this population.
Find the probability that the sample mean of these 100 observations is less than 9.
Sample mean = 9
We will use the z-score table and find the value to be 0.0062
P(X<9) = 0.0062
Q) A large freight elevator can transport a maximum of 9800 pounds. Suppose a load of cargo
containing 49 boxes must be transported via the elevator. Experience has shown that the weight of
boxes of this type of cargo follows a distribution with mean= 205 pounds and standard deviation = 15
pounds. Based on this information, what is the probability that all 49 boxes can be safely loaded onto
the freight elevator and transported?
Ans: For all the boxes to be loaded the total weight must be at most 9800.
Std. deviation = 15
Bernoulli Distribution
It is a type of Discrete Probability distribution. The Bernoulli distribution essentially models a single trial
of flipping a weighted coin. It is the probability distribution of a random variable taking on only two
values, 1 ("success") and 0 ("failure") with complementary probabilities p and 1−p respectively. The
Bernoulli distribution therefore describes events having exactly two outcomes, which are present in real
life.
Suppose We have a single trial of with only two possible outcomes success or failure:
P(Success) = p
P(Failure)= 1-p
for x= (0,1)
The Expected value (mean) for the Bernoulli’s distribution is given as:
We will see some more examples when Bernoulli trial is repeated for many times.
Binomial Distribution
A binomial experiment is a series of n Bernoulli trials, whose outcomes are independent of each other. A
random variable, X, is defined as the number of successes in a binomial experiment.
For example, consider a fair coin. Flipping the coin once is a Bernoulli trial, since there are exactly two
complementary outcomes (flipping a head and flipping a tail), and they are both 1/2 no matter how
many times the coin is flipped. Note that the fact that the coin is fair is not necessary; flipping a
weighted coin is still a Bernoulli trial.
A binomial experiment might consist of flipping the coin 100 times, with the resulting number of heads
being represented by the random variable X. The binomial distribution of this experiment is the
probability distribution of X.
If X is the number of success in a given Bernoulli trial with n independent trials, with probability of
success being p and probability of failure being 1-p, then for exactly k success in the experiment, the
probability distribution is given as:
P(X=k) =
Q) Let’s flip a coin 6 times with probability of getting a tail be 0.3. Let’s write the binomial distribution
for this experiment.
Ans:
P(X=2) = 0.324
Solution:
Mean= n*p
Variance = n* p(1-p),
where n is the number of trials, p is probability of success and 1-p is probability of failure
Poisson Distribution
The Poisson distribution is the discrete probability distribution of the number of events occurring in a
given time period, given the average number of times the event occurs over that time period.
Example
A certain car wash shop gets an average of 3 visitors to the center per hour. This is just an average,
however. The actual amount can vary.
A Poisson distribution can be used to analyze the probability of various events regarding how many
customers go to the center. It can allow one to calculate the probability of a dull activity (when there are
0 customers coming) as well as the probability of a high activity (when there are 5 or more customers
coming). This information can, in turn, help the owner to plan for these events with staffing and
scheduling.
If X is the number of events observed over a given time period, then probability of observing k events
over the time period is:
The Poisson distribution is often used as an approximation for binomial probabilities when n is large and
p is small.
Q) In a coffee shop, the average number of customers per hour is 2. Find the probability of getting k
number of customers in the shop.
We can clearly see that probability of getting number customers starts declining after 6.
Q) Suppose the average number of elephants seen on a 1-day safari is 6. What is the probability that
tourists will see fewer than 4 elephants on the next 1-day safari?
Solution:
= 0.0025+0.0149+0.0446+0.0892
= 0.1512
In our daily life, we often hear statements like Dhoni is the better captain than his contemporaries,
Or Motorcycle company claiming that a certain model gives an average mileage of 100Km per liter or
Toothpaste company claiming to be the number one brand suggested by dentists.
Let’s suppose you must purchase a motorcycle and you heard about the above claim made by the
Motorcycle company. Would you just go and buy it or rather look for proof of it? There must be a
parameter based on which one would judge the correctness of the statement made. In this case our
parameter will be the Average mileage, which you will use to check if the statement made is true or
just a hoax.
A hypothesis is a statement, assumption or claim about the value of the parameter (mean,
variance, median etc.).
A hypothesis is an educated guess about something in the world around you. It should be testable,
either by experiment or observation.
Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This made the assumption
that we are making based on the average wins and loses team had under his captaincy. We can test
this statement based on all the match data.
When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and if it
specifies a range of values then it is called a composite hypothesis.
e.g., Motorcycle company claiming that a certain model gives an average mileage of 100Km per liter,
this is a case of simple hypothesis.
The average age of students in a class is greater than 20. This statement is a composite hypothesis.
Null Hypothesis
The null hypothesis is the hypothesis to be tested for possible rejection under the assumption that it
is true. The concept of the null is like innocent until proven guilty We assume innocence until we
have enough evidence to prove that a suspect is guilty.
It is denoted by H0.
Alternate Hypothesis
The alternative hypothesis complements the Null hypothesis. It is opposite of the null hypothesis
such that both Alternate and null hypothesis together cover all the possible values of the population
parameter.
It is denoted by H1.
A soap company claims that its product kills on an average 99% of the germs. To test the claim of
this company we will formulate the null and alternate hypothesis.
Note: The thumb rule is that statement containing equality is the null hypothesis.
Hypothesis Testing
When we test a hypothesis, we assume the null hypothesis to be true until there is sufficient
evidence in the sample to prove it false. In that case we reject the null hypothesis and support the
alternate hypothesis.
If the sample fails to provide sufficient evidence for us to reject the null hypothesis, we cannot say
that the null hypothesis is true because it is based on just the sample data. For saying the null
hypothesis is true we will have to study the whole population data.
If the alternate hypothesis gives the alternate in both directions (less than and greater than) of the
value of the parameter specified in null hypothesis, it is called Two tailed test.
If the alternate hypothesis gives the alternate in only one direction (either less than or greater than)
of the value of the parameter specified in null hypothesis, it is called One tailed test.
here according to H1, mean can be greater than or less than 100. This is an example of Two tailed
test
The critical region is that region in the sample space in which if the calculated value lies then we
reject the null hypothesis.
Suppose you are looking to rent an apartment. You listed out all the available apartments from
different real state websites. You have a budget of Rs. 15000/ month. You cannot spend more than
that. The list of apartments you have made has prices ranging from 7000/month to 30,000/month.
You select a random apartment from the list and assume below hypothesis:
Now, since your budget is 15000, you must reject all the apartments above that price.
Here all the Prices greater than 15000 become your critical region. If the random apartment’s price
lies in this region, you must reject your null hypothesis and if the random apartment’s price doesn’t
lie in this region, you do not reject your null hypothesis.
The critical region lies in one tail or two tails on the probability distribution curve according to the
alternative hypothesis. Critical region is a pre-defined area corresponding to a cut off value in
probability distribution curve. It is denoted by α.
Critical values are values separating the values that support or reject the null hypothesis and are
calculated based on alpha.
We will see more examples later and it will be clear how do we choose α.
Based on the alternative hypothesis, three cases of critical region arise:
A false positive (type I error) — when you reject a true null hypothesis.
A false negative (type II error) — when you accept a false null hypothesis.
The probability of committing Type I error (False positive) is equal to the significance level or size of
critical region α.
The probability of committing Type II error (False negative) is equal to the beta β and is called
‘power of the test’.
Example:
A person is arrested on the charge of being guilty of burglary. A jury of judges has to decide if guilty
or not guilty.
Type II error will be the case when Jury released the person [Do not reject H0] although the person is
guilty [H1 is true].
Level of Significance(α) :
It is the probability of type 1 error. It is also the size of the critical region.
Generally, a strong control on α is desired and in tests it is prefixed at very low levels like 0.05(5%) or
01(1%).
If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis is true
with 95% assurance.
p-value
Where, H0: mean<X (we are just assuming a scenario of 1 tail test.)
We obtain our critical value (based on the type of test we are using) and find that our test statistics
are greater than the critical value. So, we must reject the null hypothesis here since it lies in the
rejection region. Now if the null hypothesis is rejected at 1%, then for sure it will get rejected at the
higher values of significance level, say 5% or 10%.
What if we take significance level lower than 1%, would we have to reject our hypothesis then also?
Yes, there might be a chance that the above scenario can happen and here comes “p-value” in play.
p-value is the smallest level of significance at which a null hypothesis can be rejected. (p < alpha)
That’s why many tests now a days gives p-value, and it is more preferred since it gives out more
information than the critical value.
The p-value is compared to the significance level(alpha) for decision making on null hypothesis.
Confidence Intervals
A confidence interval, in statistics, refers to the probability that a population parameter will fall
between two set values. Confidence intervals measure the degree of uncertainty or certainty in a
sampling method. A confidence interval can take any number of probabilities, with the most
common being a 95% or 99% confidence level.
Suppose a group of researchers is studying the heights of high school basketball players. The
researchers take a random sample from the population and establish a mean height of 74 inches.
The mean of 74 inches is a point estimate of the population mean. A point estimate by itself is of
limited usefulness because it does not reveal the uncertainty associated with the estimate; you do
not have a good sense of how far away this 74-inch sample mean might be from the population
mean. What's missing is the degree of uncertainty in this single sample.
Confidence intervals provide more information than point estimates. By establishing a 95%
confidence interval using the sample's mean and standard deviation, and assuming a normal
distribution as represented by the bell curve, the researchers arrive at an upper and lower bound
that contains the true mean 95% of the time. Assume the interval is between 72 inches and 76
inches. If the researchers take 100 random samples from the population of high school basketball
players, the mean should fall between 72 and 76 inches in 95 of those samples.
If the researchers want even greater confidence, they can expand the interval to 99% confidence.
Doing so invariably creates a broader range, as it makes room for a greater number of sample
means. If they establish the 99% confidence interval as being between 70 inches and 78 inches, they
can expect 99 of 100 samples evaluated to contain a mean value between these numbers. A 90%
confidence level means that we would expect 90% of the interval estimates to include the
population parameter. Likewise, a 99% confidence level means that 95% of the intervals would
include the parameter.
The Confidence Interval is based on Mean and Standard Deviation and is given as:
For n>30
where z critical value is derived from the z score table based on the confidence level.
We obtain these values from the z-score table only, but since the confidence levels are most of the
time fixed as the above values, we can use this table.
For n<30
where t critical value is derived from the t score table based on the confidence level.
Now that we have got all the theory behind Hypothesis testing, let’s see different types of tests
that are used for testing. We have already seen examples on finding z-score and t-score, we will
see how they are used in the testing scenario.
General points for selection type of tests:
Population Type of
sample size Variance Normality of Sample Sample variance Test
Large (>30) Known Normal/Non-Normal Z-test
Use this to calculate t-
Large (>30) Unknown Normal score t-test
Use this to calculate z-
Large (>30) Unknown Unknown score Z-test
Small (<30) Known Normal Z-test
Use this to calculate t-
Small (<30) Unknown Normal score t-test
Note: We will learn about other non-parametric tests and their cases late
Thumb rule: A sample of size greater than 30 is considered a large sample and as per central limit
theorem we will assume that all sampling distributions follow a normal distribution.
We are familiar with the steps of hypothesis testing as shown earlier. We also know, from the above
table, when to use which type of test.
Let’s start with a few practical examples to help our understanding more.
Note: We have learned in the previous section how to use the z-score table to calculate
probabilities, in this section we have some standard Significance level for which we need to find the
critical value(z-score). So instead of going through the whole table, we will just use the below
standardized critical value table for calculation purposes.
Q) A manufacturer of printer cartridges clams that a certain cartridge manufactured by him has a
mean printing capacity of at least 500 pages. A wholesale purchaser selects a sample of 100
printers and tests them. The mean printing capacity of the sample came out to be 490 pages with
a standard deviation of 30 printing pages.
Should the purchaser reject the claim of the manufacturer at a significant level of 5%?
= 30 / (100) *0.5 = 3
= (490-500)/3 = -3.33
Let’s find out the critical value at 5% significance level using the above Critical value table.
We can clearly see that Z(test) < Z (0.05%), that means our test value lies in the rejection region.
Thus, we can reject the null hypothesis i.e. the manufacturer’s claim at 5% significance level.
p-value = P[T<=-3.33] (we know p(-x) = 1 -p(x) also, remember that the p(x) represents the
Here, the p-value is less than the significance level of 5%. So, we are right to reject the null
hypothesis.
Q) A company used a specific brand of Tube lights in the past which has an average life of 1000
hours. A new brand has approached the company with new Tube lights with same power at a
lower price. A sample of 120 light bulbs were taken for testing which yielded an average of 1100
hours with standard deviation of 90 hours. Should the company give the contract to this new
company at a 1% significance level.
Here, the sample is large with an unknown population variance. Since, we don’t know about the
normality of the data, we will use the Z-test (from the table above).
= (1010-1000)/8.22 = 1.22
Let’s find out the critical value at 1% significance level using the above Critical value table.
Thus, we cannot reject the null hypothesis i.e. the company can give the contract at 1% significance
level.
p-value = P[T<1.22]
p-value = 0.88
Here, the p-value is greater than the significance level of 1%. So, we do not reject the null
hypothesis.
The comparison of two population means is very common. The difference between the two samples
depends on both the means and the standard deviations. Very different means can occur by chance
if there is great variation among the individual samples. In order to account for the variation, we
take the difference of the sample means, X1(mean) - X2(mean), and divide by the standard error
(shown below) in order to standardize the difference.
Because we do not know the population standard deviations, we estimate them using the two
sample standard deviations from our independent samples. For the hypothesis test, we calculate the
estimated standard deviation i.e., standard error.
Q) In two samples of men from two different states A and B , the height of of 1000 men and 2000
men respectively are 76.5 and 77 inches. If population standard deviation for both states is same
and is 7 inches, can we assume that mean hieghts of both sates can be regarde same at 5% level of
significance.
Ans. n1 = 1000
n2 = 2000
X1(mean) = 76.5
X2(mean) = 77
S1=S2= 7
Let’s µ(1) = µ(2) be the mean heights of men from states A and B
Since, it is a two tailed test, we need to find critical value for 2.5% on each tail.
We can clearly see, p-value is greater than 0.05% ,thus we cannot reject the null hypothesis.
In real world scenarios, large sample sizes are possible most of the time because of the limited
resources such as money. We generally do hypothesis testing based on small samples, only
assumption being the normality of the sample data.
We will see how to use t- tests in this section and how to use the t-score table (continued from the
topic of student t’s distribution).
All the steps involved are similar to the z-test, only we will calculate t-score instead of z-score.
Q) A type of manufacturer claims that the average life of a particular category of its type is
18000km when used under normal driving conditions. A random sample of 16 types was tested.
The mean and SD of life of the types in the sample were 20000 km and 6000 km respectively.
Assuming that the life of the tyres is normally distributed, test the claim of the manufacture at 1%
level of significance. Construct the confidence interval also.
Ans: population mean = 18000 km
Sample size = 16
H1: population mean is not equal to 18000km (It will be a two tailed test.)
Since sample size is small, population variance is unknown and the sample is normally distributed,
we will used t-test for this.
Let’s find out the critical t- value, for significance level 1% (two tailed) and degree of freedom = 16-1
= 15
So, the value lies in non-rejection region and we cannot reject our null hypothesis.
p-value = P[t>|1.33|]
degree of freedom = 15
let’s see the p-value from the table for the above values:
from the table we can see: 0.20 < p < 0.30
Here, p > significance level (1%), thus we cannot reject the null hypothesis.
= [ 16295, 23705]
Just like the case we saw with z-test, t-test is actually more suitable for comparison of two
populations samples because in practice population standard deviations for both populations are
not always known.
We assume a normal distribution of samples and though the population standard deviations are
unknown, we assume them to be equal.
Degree of freedom = n1 + n2 -2
Standard Error(SE):
Ans.
Degree of freedom = 10 +8 -2 = 16
Let’s look for critical value in the t-table for significance 5%(two tailed) and d.o.f 16:
So, the value lies in non-rejection region and we cannot reject our null hypothesis.
Paired Sample t-Tests
A paired t-test is used to compare two population means where you have two samples which are not
independent e.g. Observations recorded on a patient before and after taking medicine, weight of a
person before and after they started working out etc.
Now, instead of two separate populations, we create a new column with difference of the
populations, and instead of testing equality of two population mean we test the hypothesis that
mean of the population difference is zero. Also, we assume the samples are of same size. Population
variances are not known and not necessarily equal.
Q) A group of 20 students were tested to see how many of them have improved marks after a
special lecture on the subject.
marks before the lecture marks after the lecture Difference(D) (D-mean) ^2
18 22 4 3.24
21 25 4 3.24
16 17 1 1.44
22 24 2 0.04
19 15 -4 38.44
24 26 2 0.04
17 20 3 0.64
21 23 2 0.04
13 18 5 7.84
18 20 2 0.04
15 15 0 4.84
16 15 -1 10.24
18 21 3 0.64
14 16 2 0.04
19 22 3 0.64
20 24 4 3.24
12 18 6 14.44
22 25 3 0.64
14 18 4 3.24
19 18 -1 10.24
44 103.2
Difference mean =
2.2 5.43157895
Standard
Deviation 2.33057481
H0: Difference mean >= 0
df (degree of freedom) = 19
At the significant level of 5%. 19 df and a one tail test, let’s calculate our critical level:
t (5%) = -1.729
Since t is greater than critical t, thus it lies in the non-rejection region and hence we cannot reject
the null hypothesis.
Till now we were dealing with hypothesis testing for the means of various samples, but sometimes it
is also necessary or desired to test the variances of the population under study i.e. let’s we obtained
certain variance for a sample which is different than the population variance, now we need to find
out if the variances are within acceptable limit or does it varies more than the desired variance of
the population.
The chi-square test for variance is a non-parametric statistical procedure with a chi-square-
distributed test statistic that is used for determining whether the variance of a variable obtained
from a particular sample has the same size as the known population variance of the same variables.
The test statistic of the chi-square test for variance is calculated as follows:
As similar with other tests, the critical value is obtained through a chi table on the basis of degree of
freedom and significance level.
Q) The variance of a certain size of towel produced by a machine is 7.2 over a long period of time.
A random sample of 20 towels gave a variance of 8. You nee to check if the variability for towel
has increased at 5% level of significance, assuming a normally distributed sample.
Ans.
n = 20
sample variance = 8
Here, the chi value is less than the critical value, thus we do not reject the null hypothesis.
The chi-square test is widely used to estimate how closely the distribution of a categorical variable
matches an expected distribution (the goodness-of-fit test), or to estimate whether two categorical
variables are independent of one another (the test of independence).
In mathematical terms, the χ2 variable is the sum of the squares of a set of normally distributed
variables.
Suppose that a particular value Z1 is randomly selected from a standardized normal distribution.
Then suppose another value Z2 is selected from the same standardized normal distribution. If there
are d degrees of freedom, then let this process continue until d different Z values are selected from
this distribution. The χ2 variable is defined as the sum of the squares of these Z values.
This sum of squares of d normally distributed variables has a distribution which is called
theχ2distribution with d degrees of freedom.
Chi Square test for testing goodness of fit is used to decide whether there is any difference between
the observed (experimental) value and the expected (theoretical) value.
A goodness of fit test is a test that is concerned with the distribution of one categorical variable.
H0: The population distribution of the variable is the same as the proposed distribution.
Expected= the predicted (expected) counts in each category if the null hypothesis were true.
Q) A survey conducted by a Pet Food Company determined that 60% of dog owners have only one
dog, 28% have two dogs, and 12% have three or more. You were not convinced by the survey and
decided to conduct your own survey and have collected the data below,
Data: Out of 129 dog owners, 73 had one dog and 38 had two dogs
Determine whether your data supports the results of the survey by the pet.
Let’s see the critical value using d.o.f 2 and significance 5%:
Here, our chi statistic is less than the critical chi. Thus, we will not reject the null hypothesis.
Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or
more groups are significantly different from each other by analyzing comparisons of variance
estimates. ANOVA checks the impact of one or more factors by comparing the means of different
samples.
When we have only two samples, t-test and ANOVA give the same results. However, using a t-test
would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests
for comparing more than two samples, it will have a compounded effect on the type 1 error.
Assumptions in ANOVA
1) Assumption of Randomness: The samples should be selected in a random way such that
there is no dependence among the samples.
2) The experimental errors of the data are normally distributed.
3) Assumption of equality of variance (Homoscedasticity) and zero correlation: The variance
should be constant in all the groups and all the covariance among them are zero although
means vary from group to group.
One Way ANOVA
When we are comparing groups based on only one factor variable, then it said to be one-way
analysis of variance (ANOVA).
For example, if we want to compare whether or not the mean output of three workers is the same
based on the working hours of the three workers.
xij = μi + εij
where x are the individual data points (i and j denote the group and the individual observation), ε is
the unexplained variation and the parameters of the model (μ) are the population means of each
group. Thus, each data point (xij) is its group mean plus error.
Sample(k) 1 2 3 Mean
1 x11 x12 x13 Xm1
2 x21 x22 x23 Xm2
3 x31 x32 x33 Xm3
4 x41 x42 x43 Xm4
Suppose we are given with the above data set; we have an independent variable x and 3 samples
with different values of x and each sample has its respective mean as shown in last column.
Grand Mean
Mean is a simple or arithmetic average of a range of values. There are two kinds of means that we
use in ANOVA calculations, which are separate sample means and the grand mean.
The grand mean (Xgm) is the mean of sample means or the mean of all observations combined,
irrespective of the sample.
Xgm = (Xm1 + Xm2 + Xm3 + Xm4 +………. Xmk)/k where, k is the number of samples.
It refers to variations between the distributions of individual groups (or levels) as the values within
each group are different.
Each sample is looked at and the difference between its mean and grand mean is calculated to
calculate the variability. If the distributions overlap or are close, the grand mean will be similar to
the individual means whereas if the distributions are far apart, difference between means and grand
mean would be large.
MeanSSbetween = SSbetween/k-1
It refers to variations caused by differences within individual groups (or levels) as not all the values
within each group are the same. Each sample is looked at on its own and variability between the
individual points in the sample is calculated. In other words, no interactions between samples are
considered.
We can measure Within-group variability by looking at how much each value in each sample differs
from its respective sample mean. So, first, we’ll take the squared deviation of each value from its
respective sample mean and add them up. This is the sum of squares for within-group variability.
MeanSSwithin(MSSE) = SSwithin/ 8
The Null hypothesis in ANOVA is valid when all the sample means are equal, or they don’t have any
significant difference. Thus, they can be considered as a part of a larger set of the population. On the
other hand, the alternate hypothesis is valid when at least one of the sample means is different from
the rest of the sample means. In mathematical form, they can be represented as:
where µ1 and µm belong to any two sample means out of all the samples considered for the test. In
other words, the null hypothesis states that all the sample means are equal, or the factor did not
have any significant effect on the results. Whereas the alternate hypothesis states that at least one
of the sample means is different from another.
F-Statistic
The statistic which measures if the means of different samples are significantly different or not is
called the F-Ratio. The lower the F-Ratio, more similar are the sample means. In that case, we cannot
reject the null hypothesis.
F = MeanSSbetween / MeanSSwitihn
This above formula is intuitive. The numerator term in the F-statistic calculation defines the
between-group variability. As we read earlier, as between group variability increases, sample means
grow further apart from each other. In other words, the samples are more probable to belong to
totally different populations.
This F-statistic calculated here is compared with the F-critical value for making a conclusion.
F-critical is calculated using the F-table, degree of freedoms and Significance level.
If the observed value of F is greater than the F-critical value then we reject the null hypothesis.
School Marks
School 1 8 6 7 5 9
School 2 6 4 6 5 6 7
School 3 6 5 5 6 7 8 5
School 4 5 6 6 7 6 7
Ans:
k=4
N = 24
= 4.99
F-critical = 3.098
Clearly, our F-statistics is less than F-critical. So, we cannot reject our null hypothesis.
Two-way ANOVA allows to compare population means when the populations are classified according
to two independent factors.
Example: We might like to look at SAT scores of students who are male or female (first factor) and
either have or have not had a preparatory course (second factor).
where x are the individual data points (i and j denote the group and the individual observation), ε is
the unexplained variation and the parameters of the model (μ) are the population means of each
group. Thus, each data point (xij) is its group mean plus error.
Just like one-way model, we will calculate the sum of squares between, in this case there will be two
SSTs for both the categories and sum of squares of errors (within).
We calculate F-statistics for both the MSST and see which once greater value than F-critical and
compare them to find the effect of both categories on our assumption.
Example:
Below is the data of yield of crops based on temperature and salinity. Calculate the ANOVA for the
table.
Ans:
Grand mean = 11
N = 9, K =3, nt= 3, ns = 3
MSSTsalainity = 6 /2 = 3
In such question calculating SSE can be tricky, so instead of calculating SSE let’s calculate TSS then
we can subtract SST values from it and get SSE.
To calculate Total sum of squares, we need to find sum of the squares of difference of each value
from the grand mean.
TSS = (3-11)^2 + (5-11)^2 + (3-11)^2 +(4-11)^2 +(11-11)^2 +(10-11)^2 +(12-11)^2 +(16-11)^2 +(21-
11)^2 +( 17-11)^2
TSS = 312
SSE = TSS - SSbetween_temp - SSbetween_salanity = 312 – 294-6 = 12
MSSE = SSE/4 = 3
F-critical for 5% significance and degree of freedom (k-1, (p-1) (q-1)) i.e. (2,4):
F-critical = 10.649
Clearly, we can see that Ftemp is greater than F-critical, so we reject the null hypothesis and support
that temperature has a significant effect on yield.
On the other hand, Fsalinity is less than the F-critical value, so we do not reject the null hypothesis and
support that salinity doesn’t affect the yield.
P( Ak ∩ B )
P( Ak | B ) =
[ P( A1 ∩ B ) + P( A2 ∩ B ) + . . . + P( An ∩ B ) ]
P( Ak ∩ B ) P( Ak ∩ B ) = P( Ak ) P( B | Ak )
• When the sample space is divided(partitioned) into a set of events { A1, A2,…. Ana}.
• An event B is present, for which P(B) > 0 exists within the sample space.
• P( Ak | B ) is the form to compute a conditional probability.
One of the two sets of probabilities is mentioned:
Example:
Problem
There is a marriage ceremony in the desert and Marie is getting married tomorrow. In past years, it
has rained only five days a year. The weatherman has a weather report of raining tomorrow. The
weatherman forecasts rain 90% of the time when it rains. When it failed to rain, the weatherman
incorrectly predicts 10% of the times. What's the probability that it would rain on the marriage day
of Marie?
Solution: The sample space is defined as - it rains or it does not rain. Furthermore, a 3rd event occurs
when the weatherman predicts rain.
We want to know P (A | C), that is the probability it will rain on the wedding’s day of Marie,
P( A ) P( B | A )
P( A | C ) =
P( A ) P( C | A ) + P( B ) P( C | A2 )
(0.0014) (0.09)
P( A | C ) =
[ (0.014)(0.09) + (0.986)(0.1) ]
P( A | C ) = 0.1111
Even when the weatherman predicts rain, it might rain only about 11% of the time. So, there is a
good chance that Marie might not get rained on at her wedding.