0% found this document useful (0 votes)
20 views

Stats_Notes

The document provides an overview of statistics, detailing its methodologies for data analysis and its applications in various fields, including sports like cricket. It distinguishes between descriptive and inferential statistics, explaining their roles in summarizing data and making predictions based on samples. Additionally, it covers types of data, levels of measurement, and measures of central tendency, emphasizing the importance of understanding statistics for informed decision-making.

Uploaded by

Logess War
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Stats_Notes

The document provides an overview of statistics, detailing its methodologies for data analysis and its applications in various fields, including sports like cricket. It distinguishes between descriptive and inferential statistics, explaining their roles in summarizing data and making predictions based on samples. Additionally, it covers types of data, levels of measurement, and measures of central tendency, emphasizing the importance of understanding statistics for informed decision-making.

Uploaded by

Logess War
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

STATISTICS

Statistics is the science that deals with methodologies to gather, review, analyze and draw conclusions
from data. With specific Statistics tools in hand we can derive many key observations and make
predictions from the data in hand.

In Real world we deal with many cases where we use Statistics knowingly or unknowingly.

Let’s talk about one such classic use of statistics in the most famous sports in India, yes you guessed it
right, Cricket.

What makes Virat Kohli the best batsman in ODIs or Jaspreet Bumrah the best bowler in ODIs?

We all have heard about cricketing terms like batting average, bowler’s economy, strike rate etc. We
often see graphs like these

We see and talk about statistics all the time but very few of us know the science behind it.

Here by using different statistical methods ICC compare players, teams and rank them. So, if we learn
the science behind it we can create our own rankings, compare players, teams or better if we debate
with someone over who is the better player, we can debate now with facts and figures because we will
understand the statistics behind it better. We can understand the above graphs better.

We will dive further in to the various methods and terminologies which will help to answer the question
above as well as see the vast uses of Statistics in much complex scenarios such as medical science, drug
research, stock markets, Economics, Marketing etc.
Types of Statistics

1. Descriptive Statistics:
The type of statistics dealing with numbers (numerical facts, figures, or information) to describe any
phenomena. These numbers are descriptive statistics.

e.g. Reports of industry production, cricket batting averages, government deficits, Movie Ratings etc.

2. Inferential statistics
Inferential statistics is a decision, estimate, prediction, or generalization about a population, based on
sample.

A population is a collection of all possible individual, objects, or measurements of interest.

A sample is a portion, or part, of the population of interest.

Inferential statistics is used to make inferences from data whereas descriptive statistics simply
describes what’s going on in our data.

Let’s clear our understanding about the above types with a basic scenario:

Suppose in your college there are 1000 students. You are interested in finding out the how many
students prefer eating in the college canteen than college mess.

A random group of 100 students is selected. Here our population size is of 1000 students and the
sample size is of 100 students. You surveyed the sample group and got the following results:

1st year 2nd Year 3rd year 4th Year Total


Canteen 7 13 20 32 72
Mess 12 8 5 3 28
Let’s analyze the data:

Canteen Mess

80
70
60
50
40
30
20
10
0
1st year 2nd Year 3rd year 4th Year
Canteen Mess

1) 72 % of the students prefer eating in the canteen.


2) Of the total students who prefer canteen, 44.4 % are from the 4th year.
3) Of the total students who prefer canteen, 72% are from the 3rd year and 4th year.
4) 1st year students are more inclined towards eating in the mess.

The above statistics gives us a trend of variation among the students with their preference. We are using
the numbers and figures to assess the data. This will be the part of Descriptive statistics.

Now, suppose you got a contract to open a canteen in the College. Now with the above data, you can
make following assumptions:

1) 3rd and 4th year students are the main target for sales of restaurant.
2) You can give discounts to the 1st year students to increase the number count.
3) Since most students prefer eating in canteen, opening a canteen can be profitable business

You made the above inferences/estimations for the whole college based on the sample data. This is the
part of Inferential statistics where you make decision based on the descriptive statistics of a sample
data.
Though the above example is very basic and the real scenarios are much more complex, this would help
in getting the underlying difference. We will see more complex examples ahead.

Q) The average salary of employees of company in 2017 is greater than the average salary of teachers of
school in 2017.

Is the above statement an example of descriptive or inferential statistics?

Ans. Descriptive statistics because it summarizes the information in two samples.

Q) By 2030 , World will face shortage of waters. ---- Inferential Statistics

Types of Data

I) Categorical data represents characteristics such as a person’s gender, marital status, hometown, or
the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male
and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add
them together

ii) Numerical data have meaning as a measurement, such as a person’s height, weight, IQ, or blood
pressure; or they’re a count, such as the number of stocks shares a person owns etc.

* Numerical data can be further broken into two types: discrete and continuous.

Discrete data represent items that can be counted; they take on possible values that can be listed out.
The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity
(making it countably infinite). For example, the number of heads in 100-coin flips takes on values from 0
through 100 (finite case), but the number of flips needed to get 100 heads takes on values from 100 (the
fastest scenario) on up to infinity (if you never get to that 100th heads). Its possible values are listed as
100, 101, 102, 103, . . . (representing the countably infinite case).

Some more examples: -

1) Number of children in a school


2) Number of books in your library
3) Number of cases a Lawyer has won

In all the above examples the values can be 1, 2,3 … so on but can never be 1.2,4.6, 8.7 etc. Thus, making
the results countable.

Continuous data represent measurements; their possible values cannot be counted and can only be
described using intervals on the real number line.

For example, the exact amount of petrol purchased at the petrol pump for bikes with 20-liters tanks
would be continuous data from 0 liters to 20 liters, represented by the interval [0, 20], inclusive. You
might pump 8.40 liters, or 8.41, or 8.414863 liters, or any possible number from 0 to 20. In this way,
continuous data can be thought of as being uncountably infinite.

Another Example, you purchase a Light bulb and you are informed that the life of the light bulb is 2000
hours. Now, the life of bulb will be a continuous data as it can take any value such as 1998
hours,1998.56 hours, 1896.34 hours or 2000 hours as well.

Levels of measurement
1. Qualitative Data

a) Nominal data levels of measurement:

A nominal variable is one in which values serve only as labels, even if those values are numbers.

For example, if we want to categorize male and female respondents, we could use a number of 1 for
male, and 2 for female. However, the values of 1 and 2 in this case do not represent any meaningful
order or carry any mathematical meaning. They are simply used as labels. Nominal data cannot be used
to perform many statistical computations, such as mean and standard deviation, because such statistics
do not have any meaning when used with nominal variables.

However, nominal variables can be used to do cross tabulations. The chi-square test can be performed
on a cross-tabulation of nominal data.

Examples:

Category Code Category Code


Male 1 Delhi 1
Female 2 Mumbai 2
Bangalore 3

Here even though we code the different categories, we cannot say that since 2> 1 then Female>male.

b) Ordinal data levels of measurement:

Values of ordinal variables have a meaningful order to them. For example, education level (with
possible values of high school, undergraduate degree, and graduate degree) would be an ordinal
variable. There is a definitive order to the categories (i.e., graduate is higher than undergraduate, and
undergraduate is higher than high school), but we cannot make any other arithmetic assumptions
beyond that. For instance, we cannot assume that the difference in education level between
undergraduate and high school is the same as the difference between graduate and undergraduate.

We can use frequencies, percentages, and certain non-parametric statistics with ordinal data. However,
means, standard deviations, and parametric statistical tests are generally not appropriate to use with
ordinal data.

Category Code Category Code


Upper Highly
Class 1 satisfied 5
Middle Satisfied 4
class 2 Average 3
Lower Below
class 3 Average 2
Very bad 1
In the above example for ordinal data, the data gives a sense of comparability i.e. we can say that in the
second table Highly satisfied is better than Average. Though, we can say that the difference between
Highly satisfied and satisfied is same as Below Average and Very bad.

2. Quantitative Data

a) Interval scale data levels of measurement

For interval variables, we can make arithmetic assumptions about the degree of difference between
values. An example of an interval variable would be temperature. We can correctly assume that the
difference between 70 and 80 degrees is the same as the difference between 80 and 90 degrees.
However, the mathematical operations of multiplication and division do not apply to interval variables.
For instance, we cannot accurately say that 100 degrees is twice as hot as 50 degrees. Additionally,
interval variables often do not have a meaningful zero-point. For example, a temperature of zero
degrees (on Celsius and Fahrenheit scales) does not mean a complete absence of heat.

An interval variable can be used to compute commonly used statistical measures such as the average
(mean), standard deviation etc. Many other advanced statistical tests and techniques also require
interval or ratio data.

Example: Measurement of time of an historical event comes under interval scale, since year has no fix
origin i.e. 0 year is different for different religions and countries.

b) Ratio scale data levels of measurement

All arithmetic operations are possible on a ratio variable. An example of a ratio variable would be
weight (e.g., in pounds). We can accurately say that 20 pounds is twice as heavy as 10 pounds.
Additionally, ratio variables have a meaningful zero-point (e.g., exactly 0 pounds means the object has
no weight). Other examples of ratio variables include gross sales of a company, the expenditure of a
company, the income of a company, etc.

Example: measurement of temperature in kelvin scale, since kelvin has an absolute 0, measurement of
average height of students in a class.

A ratio variable can be used as a dependent variable for most parametric statistical tests such as t-tests,
F-tests, correlation, and regression.
We can summarize different levels of measurements as below:
Offers: Nominal Ordinal Interval Ratio
Sequence of variables is established yes yes yes
Mode yes yes yes yes
Median yes yes yes
Mean yes yes
Difference between variables can be evaluated yes yes
Addition and subtraction of variables yes yes
Multiplication and Division of variables yes
Absolute zero yes

Measures of Central Tendency

A measure of central tendency is a summary statistic that represents the center point or typical value of
a dataset. These measures indicate where most values in a distribution fall and are also referred to as
the central location of a distribution. You can think of it as the tendency of data to cluster around a
middle value. In statistics, the three most common measures of central tendency are the mean, median,
and mode. Each of these measures calculates the location of the central point using a different method.

Mean: The mean is the arithmetic average, for calculating the mean just add up all of the values and
divide by the number of observations in your dataset.

Let you have dataset with n values as follows:

Mean for the above data will be given as:

Median: The median is the middle value. It is the value that splits the dataset in half. To find the
median, order your data from smallest to largest, and then find the data point that has an equal amount
of values above it and below it. The method for locating the median varies slightly depending on
whether your dataset has an even or odd number of values.

Let you have dataset with n values as follows:


Case I) When n is odd:

Case ii) when n is even

Frequency: The number of times a variable occurs in the data set is called its frequency.

Mode: The mode is the value that occurs the most frequently in your data set i.e. has the highest
frequency. On a bar chart, the mode is the highest bar. If the data have multiple values that are tied for
occurring the most frequently, you have a multimodal distribution. If no value repeats, the data do not
have a mode.

Q) Given a data set of heights of student in a class. Find out the mean, mode and median for the
dataset. Heights (in cm) = {180, 167, 154, 142, 181, 145, 143, 145, 167, 145}

Solution: no. of observations (n)= 10

Mean= (180 + 167 +154 + 142 + 181 + 145 + 143 + 145 + 167 + 145) /10 = 156.9 cm

Rearranged Heights= {142, 143, 145, 145, 145, 154, 167, 167, 180, 181}

Since n is even,

i = n/2= 5

Value at 5th position = 145

Value at 6th position = 145

median = (145+154)/2 = 149.5

For calculating Mode, we will create a Frequency table for all the variables.
Variables Frequency
180 1
167 2
154 1
142 1
181 1
145 3
143 1

the number with highest frequency of occurring is 145.

mode= 145

Let's take the above example and change values of some observations and check it's effect on our

measurements:

Heights (in cm) = {180, 167, 154, 122, 181, 135, 123, 145, 166, 145}

Rearranged Heights= {122, 123, 135, 145, 145, 154, 166, 167, 180, 181}

Note: we have replaced the first three values on the left of the median

Mean= (122 + 123 + 135 + 145 + 145 + 154 + 166 + 167 + 180 + 181)/10 = 151.8

Median = (145+154)/2 = 149.5= 149.5

Note: We can see a significant change in the Mean whereas Median does not have any changes.

That's because the calculation of the mean incorporates all values in the data. If you change any value,
the mean changes.

Unlike the mean, the median value doesn’t depend on all the values in the dataset. Consequently, when
some of the values are more extreme, the effect on the median is smaller. Of course, with other types of
changes, the median can change.
Examples:

Given the below dataset, find out the mean, median and mode.

Variable Frequency Cumulative Frequency


20 7 7
40 5 12
60 4 16
80 3 19

Solution:

Variable(x) Frequency(f) Cumulative Frequency (c.f) f*X


20 7 7 140
40 5 12 200
60 4 16 240
80 3 19 240
Total 19 820

Mean = sum(f*x)/ sum(f) = 820/19 = 43.15

Mode = 20 (highest frequency)

Median = value of variable corresponding to the (19/2)th cumulative frequency

= value of variable corresponding to the 9.5th cumulative frequency

Since, there is no value of 9.5 in cumulative frequency column, we take the next cumulative frequency
that is 12.

Median = value of variable corresponding to the 12th cumulative frequency = 40


SKEWNESS EFFECTS AND USES OF CENTRAL TENDANCIES

What is Skewness?

Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed
either to the left or to the right. Skewness can be quantified to define the extent to which a distribution
differs from a normal distribution.

In a normal distribution, the graph appears as a classical, symmetrical "bell-shaped curve." The mean, or
average, and the mode, or maximum point on the curve, are equal.

In a perfect normal distribution, the tails on either side of the curve are exact mirror images of each
other.

When a distribution is skewed to the left, the tail on the curve's left-hand side is longer than the tail on
the right-hand side, and the mean is less than the mode. This situation is also called negative skewness.

When a distribution is skewed to the right, the tail on the curve's right-hand side is longer than the tail
on the left-hand side, and the mean is greater than the mode. This situation is also called positive
skewness.

**In a symmetric distribution, the mean and median both find the center accurately. They are
approximately equal.
**However, in a skewed distribution, the mean can miss the mark. In the histogram below, it is starting
to fall outside the central area. This problem occurs because outliers have a substantial impact on the
mean. Extreme values in an extended tail pull the mean away from the center. As the distribution
becomes more skewed, the mean is drawn further away from the center.

Here the median better represents the central tendency for the distribution.
Uses of Mean, Median and Mode

When you have a symmetrical distribution for continuous data, the mean, median, and mode are equal.
In this case, use the mean because it includes all of the data in the calculations. However, if you have a
skewed distribution, the median is often the best measure of central tendency.

When you have ordinal data, the median or mode is usually the best choice. For categorical data, you
have to use the mode.
Measures of Dispersion

The measure of dispersion shows the scatterings of the data. It tells the variation of the data from one
another and gives a clear idea about the distribution of the data. The measure of dispersion shows the
homogeneity or the heterogeneity of the distribution of the observations.

How is it useful?

1) Measures of dispersion shows the variation in the data which provides information like how well
the average of the sample represent the entire data. Less variation gives close representation
while with larger variation average may not closely represent all the values in the sample.
2) Measures of dispersion enables us to compare two or more series with regard of their
variations. It helps to determine consistency.
3) With the checking for variation in the data, we can try to control the causes behind the
variations.

1) Range: A range is the most common and easily understandable measure of dispersion. It is the
difference between two extreme observations of the data set. If X max and X min are the two extreme
observations then

Range = X max – X min

Since it is based on two extreme observations so it gets affected by fluctuations.

Thus, range is not a reliable measure of dispersion

2) Standard Deviation
In statistics, the standard deviation is a very common measure of dispersion. Standard deviation
measures how spread out the values in a data set are around the mean. More precisely, it is a measure
of the average distance between the values of the data in the set and the mean. If the data values are all
similar, then the standard deviation will be low (closer to zero). If the data values are highly variable,
then the standard variation is high (further from zero).

The standard deviation is always a positive number and is always measured in the same units as the
original data. Squaring the deviations overcomes the drawback of ignoring signs in mean deviations i.e.
distance of points from mean must always be positive.

3) Variance:

The Variance is defined as the average of the squared differences from the Mean.

Example:

1. A class of students took a test in Language Arts. The teacher determines that the mean grade on the
exam is a 65%. She is concerned that this is very low, so she determines the standard deviation to see if
it seems that most students scored close to the mean, or not. The teacher finds that the standard
deviation is high. After closely examining all of the tests, the teacher is able to determine that several
students with very low scores were the outliers that pulled down the mean of the entire class’s scores.

2. An employer wants to determine if the salaries in one department seem fair for all employees, or if
there is a great disparity. He finds the average of the salaries in that department and then calculates the
variance, and then the standard deviation. The employer finds that the standard deviation is slightly
higher than he expected, so he examines the data further and finds that while most employees fall
within a similar pay bracket, three loyal employees who have been in the department for 20 years or
more, far longer than the others, are making far more due to their longevity with the company. Doing
the analysis helped the employer to understand the range of salaries of the people in the department.
Coefficient of Variation (CV)

The coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized
measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a
percentage, and is defined as the ratio of the standard deviation(σ) to the mean(μ). It gives the measure
of variability

CV = Standard Deviation / Mean

Let’s take one more example to try and understand how standard deviation and CV is helpful:

We are given batting score made by two batsmen in 10 matches:

Match Match Match Match Match Match Match Match Match


Batsman 1 2 3 4 5 6 7 8 Match 9 10 Sum Mean
Batsman
1 54 35 68 12 13 120 6 0 18 184 510 51
Batsman
2 45 42 25 53 75 12 28 27 85 43 435 43.5

By seeing the above data, we can say that Batsman 1 is the better batsman than Batsman 2 and can be
given preference since it’s mean is greater. But is it really true?

Let’s check the variance of the data:

var sq_2
Batsman var sqaure1 =
Batsman Batsman 1 2 diff_1 diff_2 = (diff_1)^2 (diff_2)^2
Match 1 54 45 -3 -1.5 9 2.25
Match 2 35 42 16 1.5 256 2.25
Match 3 68 25 -17 18.5 289 342.25
Match 4 12 53 39 -9.5 1521 90.25
Match 5 13 75 38 -31.5 1444 992.25
Match 6 120 12 -69 31.5 4761 992.25
Match 7 6 28 45 15.5 2025 240.25
Match 8 0 27 51 16.5 2601 272.25
Match 9 18 85 33 -41.5 1089 1722.25
Match
10 184 43 -133 0.5 17689 0.25
sum 510 435 31684 4656.5

Variance (Batsman 1)= 31684/10 = 3168.4


Standard deviation(batsman1) = Variance (Batsman 1) ^ ½ = 56.288

Coeff. Of Variation (batsman1) = 56.288/51 = 1.10

Variance (Batsman 2)= 4656.5/10 = 465.65

Standard deviation(batsman1) = Variance (Batsman 2) ^ ½ = 21.57

Coeff. Of Variation (batsman2) = 21.57/43.5 = 0.50

We can clearly see that the standard deviation gives a different picture for both batsmen. Though
batsman1 has a high average but his variance is very high. So, the batsman 1 is less reliable.

On the other hand, Batsman 2 has lower average but is much more consistent than Batsman 1.

Also, coeff. Of variation for batsman 2 is lower than batsman 1 which insures low variability and higher
consistency.

If we only had taken mean into account, we wouldn’t have gotten the true picture. This problem is
solved by the dispersion measures.

Q) What is the variance and standard deviation of the possibilities associated with rolling a fair die?

Ans: - Possible outcomes = {1,2,3,4,5,6}

mean = (6+5+4+3+2+1)/6 = 3.5

variance = (6.25+2.25+0.25+0.25+2.25+6.25)/6=2.917

std. deviation = (2.917) ^0.5 = 1.71

Q) The following data set has a mean of 14.7 and a variance of 10.01.

18, 11, 12, a, 16, 11, 19, 14, b, 13

Compute the values of a and b.

Ans: -From the formula of the mean we have

14.7 = (114+a+b)/10

a+b = 147−114

a=33−b

From the formula of the variance we have

10.01 = (69.12+(a−14.7) ^2+(b−14.7) ^2)/10

Substitute a=33−b and solve: -


b= 13 0r b=20

Since a=33−b

we have a=20 or a=13.

So, the two unknown values in the data set are 13 and 20

We do not know which of these is a and which is b since the mean and variance tell us nothing about the
order of the data.

Standard Deviation and Variance for Population and Sample Data


When you have "N" data values that are:

1) The Population: divide by N when calculating Variance

2) A Sample: divide by N-1 when calculating Variance

The formula for calculating Standard deviation and variance changes while dealing with Population and
sample data:-

Why do we divide by (n-1) instead of n?

How to calculate the standard deviation?

1. Compute the square of the difference between each value and the sample mean.

2. Add those values up.

3. Divide the sum by n-1. This is called the variance.

4. Take the square root to obtain the Standard Deviation.


Why n-1?

In step 1, you compute the difference between each value and the mean of those values. You don't
know the true mean of the population; all you know is the mean of your sample. Except for the rare
cases where the sample mean happens to equal the population mean, the data will be closer to the
sample mean than it will be to the true population mean. So, the value you compute in step 2 will
probably be a bit smaller (and can't be larger) than what it would be if you used the true population
mean in step 1. To make up for this, divide by n-1 rather than n.

**This is called Bessel's correction. **

But why n-1? If you knew the sample mean, and all but one of the values, you could calculate what that
last value must be. Statisticians say there are n-1 degrees of freedom.

Covariance and Correlation


Covariance
It is method to find the variance between two variables.

1. It is the relationship between a pair of random variables where change in one variable causes
change in another variable.
2. It can take any value between -infinity to +infinity, where the negative value represents the
negative relationship whereas a positive value represents the positive relationship.
3. It is used for the linear relationship between variables.
4. It gives the direction of relationship between variables.
5. It has dimensions.
Covariance Relationship

Correlation
1. It shows whether and how strongly pairs of variables are related to each other.
2. Correlation takes values between -1 to +1, wherein values close to +1 represents strong positive
3. correlation and values close to -1 represents strong negative correlation.
4. In this variable are indirectly related to each other.
5. It gives the direction and strength of relationship between variables.
6. It is the scaled version of Covariance.
7. It is dimensionless.
Correlation Relationship

Positive Correlation

When the values of variables deviate in the same direction i.e. when value of one variable
increases(decreases) then value of other variable also increases(decreases).

Examples:

1) Height and weight of persons


2) Amount of rainfall and crops yield
3) Income and Expenditure of Households
4) speed of a wind turbine, the amount of electricity that is generated
5) The more years of education you complete, the higher your earning potential will be
6) As the temperature goes up, ice cream sales also go up
7) The more it rains, the more sales for umbrellas go up

Negative Correlation

When the values of variables deviate in the opposite direction i.e. when value of one variable
increases(decreases) then value of other variable also decreases(increases).
Examples:

1) Price and demand of goods


2) Poverty and literacy
3) Sine function and cosine function
4) If a train increases speed, the length of time to get to the final point decreases
5) The more one works out at the gym, the less body fat one may have
6) As the temperature decreases, sale of heaters increases

Zero Correlation

When two variables are independent of each other, they will have a zero correlation.

Note: - When data is scaled covariance and correlation will give the same value. Also, correlation and
Causality are not the same thing.

Example:

x = [1,2,3,4,5,6,7,8,9]

y = [9,8,7,6,5,4,3,2,1]

Find the correlation between x and y.

Ans: We can clearly see in the dataset that as x increase y decreases and vice versa.

Let’s prove this with the formula we have studied above.

Solution:

x y x- x_mean y-y_mean (x- x_mean)^2 (y-y_mean)^2 (x- x_mean)*(y-y_mean)


1 9 -4 4 16 16 -16
2 8 -3 3 9 9 -9
3 7 -2 2 4 4 -4
4 6 -1 1 1 1 -1
5 5 0 0 0 0 0
6 4 1 -1 1 1 -1
7 3 2 -2 4 4 -4
8 2 3 -3 9 9 -9
9 1 4 -4 16 16 -16
5 5 60 60 -60

Corr(x,y) = -60/[(60*60)^1/2 ] = -1 , As expected, we got perfect negative correlation between x and y.


------------------------------------------------------------------------------------------------------------------------------------------

As we will proceed further, we will use several statistical tools such as Python Statistics libraries to do
the complex calculation and derive our descriptive statistics.

For e.g. if we solved the above problem using Python, it will pretty easy to do.

Here is another example of Correlation coefficient calculation using python:


In the above example we are trying to find correlation between three set of variables.The matrix
represent the correlation between them, such as the 2nd value in the 1st row gives relation between 1st
and 2nd set of variables, 3rd value in the 1st row gives correlation between 1st and 3rd set of variables and
so on.
Probability Distribution

Probability Distribution is a function that shows all the possible values a variable can take and how often
it occurs.

Probability Distribution is defined by the underlying probability and graph is just the visual
representation.

Let’s take an example and find out probability distribution of all possible outcomes of rolling a die:

Possible
Outcomes Probability
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6

The above table represents the probability distribution of the variable (possible outcomes).

When we plot the above table, we get the probability distribution graph as below.
Let’s take another example and find the probability distribution of getting sums of rolling two dice: -

Possible outcomes =
{(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),(3,1),(3,2),(3,3),(3,4),(3,5),(3,6),(4,1),(4,2)
,(4,3),(4,4),(4,5),(4,6),(5,1),(5,2),(5,3),(5,4),(5,5),(5,6),(6,1),(6,2),(6,3),(6,4),(6,5),(6,6)}

Number of outcomes = 36

Possible
Sums Occurrence Probability
2 1 0.03
3 2 0.06
4 3 0.08
5 4 0.11
6 5 0.14
7 6 0.17
8 5 0.14
9 4 0.11
10 3 0.08
11 2 0.06
12 1 0.03

This table represents the Probability distribution of different outcomes and if we plot the above table,
we get the below Probability distribution graph.
Probability Distributions: Discrete vs. Continuous

Variables can be of two types discreet and continuous.

For example, when we toss a coin or roll a die, we know the possible outcome will be 1 head, 2 tails or in
case of die 1,2 ,6 etc. We can never get values like 1.5,3.2. Such variables are types of Discret variables.

On the other hand, let’s say criteria for selection in basket ball team is that one should have a height
between 170 cm to 200 cm. Here height will be a type of continuous variable as one can take any value
between 170 and 200.

Discrete Probability Distribution


If a random variable is discret, its probability distribution is said to be Discrete Probability Distribution.

Different types of Continuous Probability Distribution

1) Binomial Distribution
2) Poisson distribution
3) Bernoulli distribution

The examples we took above for rolling a die is a type of Discrete Probability Distribution.

Continuous Probability Distribution


Similarly, when a random variable is continuous, its probability distribution is said to be continuous
Probability Distribution.

In a continuous probability distribution, unlike a discrete probability distribution, the probability that a
continuous random variable will assume a particular value is zero. Thus, we cannot express a continuous
probability distribution in tabular form. We describe it using an equation or a formula also known as
Probability Density Function (pdf).

For a continuous probability distribution, the probability density function has the following properties:

• The graph of the density function will always be continuous over the range in which the random
variable is defined.
• The area bounded by the curve of the density function and the x-axis is equal to 1, when
computed over the domain of the variable.
• The probability that a random variable assumes a value between a and b is equal to the area
under the density function bounded by a and b.
Different types of Continuous Probability Distribution

4) Normal Distribution
5) Student’s T distribution
6) Chi Squared distribution.

Normal Distribution

Normal distribution, also known as the Gaussian distribution, is a continuous probability distribution
that is symmetric about the mean, showing that data near the mean are more frequent in occurrence
than data far from the mean. In graph form, normal distribution will appear as a bell curve (as in the
figure below).

The probability density function for Normal distribution is given as:


Given a variable x(a), the probability of the random variable X, which follows a normal distribution, is
less than or equal to x(a) is given as:

Here we integrate the pdf function to get the cumulative probability.

Note: For all the below examples we will use the “Normal Distribution Calculator” as it is very easy for
calculation purposes. We can do the calculations with the above given formula also but it’s not
necessary. We will learn how to solve problems using z-score table in the next section.

Examples:

Q) The Light Bulb Company has found that an average light bulb lasts 900 hours with a standard
deviation of 80 hours. Assuming that bulb life is normally distributed. What is the probability that a
randomly selected light bulb will burn out in 1000 hours or less?

Answer:

Mean = 900

Standard deviation = 80

x(a) = 1000

we have to use the above given formula for finding the probability. But for the time being we are taking
help of the Normal distribution calculator (shown below).

Putting the above values, we get a probability of:

P(X<=1000) = 89.4% i.e. there is 89.4% chance that the bulb will burnout within 1000 hours.
Q) Suppose scores on a mathematics test are normally distributed. If the test has a mean of 55 and a
standard deviation of 10, what is the probability that a person who takes the test will score between
45 and 65?

Ans.

Here we need to find the probability for 45<= x(a) <=65.

If we find the cumulative probability for x<=45 and cumulative probability for x<=65, we can subtract
them to find the required probability.

Mean = 55

Standard deviation = 10

Again, using the normal distribution calculator, we find both the values as:

P(x(a)< = 45) = 0.159

P(x(a)< = 65) = 0.841

So, P (45< X(a) <65) = 0.841 – 0.159

= 0.682

So, we have 68.2% probability of a person taking test to score between 45 and 65 marks

-----------------------------------------------------------------------------------------------------------------

The Normal Distribution has following properties:


1) mean = median = mode
2) symmetry about the center
3) 50% of values less than the mean and 50% greater than the mean
4) The probability that X is greater than ‘a’ is equal to the area under the normal curve as
shown by the non-shaded area in the figure below.
5) The probability that X is less than ‘a’ is equal to the area under the normal curve as
shown by the shaded area in the figure below.
Let’s see some real-life examples which follows normal distribution

When we talk about the heights or weights of people in the world, it is seen that it follows a normal
distribution. Why does it seem obvious? It is simple because it is more probable to find people with
heights near the average rather finding very short heighted or very tall people.

Just look around in your class, you will find the majority of people will fall in the range near the average
height of the class.

Similar is the case with the weights and IQ of people.

Distribution of wealth is another example of Normal Distribution. Most of the people fall in the average
wealth category. (“Middle class”).

The empirical rule (Three Sigma Rule)


It states that for a Normal Distribution, nearly all of the data will fall within three range of standard
deviations of the mean. The empirical rule can be understood when broken down into three parts:

1) 68% of the data falls within the first standard deviation from the mean.

2) 95% fall within two standard deviations.

3) 99.7% fall within three standard deviations.


Why do we use Normal Distribution?

1) Distributions of sample means with large sample sizes can be approximated t0 normal
distribution
2) Decisions based on normal distribution insights have proven to be of good value
3) All computable statistics are elegant
4) It approximates a wide variety of random variables

Standard Normal Distribution

The standard normal distribution is a special case of the normal distribution. It is the distribution that
occurs when a normal random variable has a mean of zero and a standard deviation of one.

The normal random variable of a standard normal distribution is called a standard score or a z score.

Every normal random variable X can be transformed into a z score via the following equation:

z = (X - μ) / σ

where, X is a normal random variable,

μ is the mean, and σ is the standard deviation.


Steps to transform into Standard N-Distribution

Suppose we have a dataset with elements

X = [1,2,2,3,3,4,4]

we can see the data is normally distributed.

steps:

1) mean= 2.71

2) std_deviation = 1.11

3) transform to z scores using z= (x-mean)/std_deviation

4)After transforming X = {-1.54,-0.64,-0.64,0.26,0.26,1.16,1.16}


Plotting new data:

mean=0, standard deviation=1

Let’s take another example and see how the transformation works:

Given a data set, we were provided with the initial data, we transformed the data and found the z score
as shown in the below table.

Initial_ Data Initial data - mean z-score =(Initial_data - mean)/St. Deviation


45.363 17.295 1.727449369
33.435 5.367 0.536092603
17.713 -10.355 -1.034292115
26.019 -2.049 -0.204671347
27.508 -0.560 -0.055885355
11.194 -16.874 -1.685409951
19.894 -8.174 -0.816445735
23.133 -4.935 -0.492868567
29.126 1.058 0.105720916
30.427 2.359 0.235629855
44.729 16.661 1.664114735
7.393 -20.675 -2.065026984
26.983 -1.085 -0.108386711
30.616 2.548 0.254473261
32.662 4.594 0.458842761
23.013 -5.055 -0.504921837
30.949 2.881 0.287795782
37.189 9.121 0.910991079
21.990 -6.078 -0.607100949
42.025 13.957 1.393988737

mean =28.068 mean= 0


St. deviation
=10.012 St. Deviation= 1
Let’s see the plots for both the initial data and transformed data:

Initial data Transformed data

We can see from the graphs that our normal distribution when transformed has zero mean and
deviation of 1.

z score
z-score is a measure of position that indicates the number of standard deviations a data value lies from
the mean. z-scores may be positive or negative, with a positive value indicating the score is above the
mean and a negative score indicating it is below the mean.

z score is very powerful tool to find the probability distribution using the z-score table.

We do not need the normal distribution calculator and will simply use z-score table.

Z-score table:
let’s see the application of z-score table and see how to use it:

We will use the same example from above where we used the Normal Distribution calculator, let’s try to
calculate the same using z-score table and compare the results.

Q) The Light Bulb Company has found that an average light bulb lasts 900 hours with a standard
deviation of 80 hours. Assuming that bulb life is normally distributed. What is the probability that a
randomly selected light bulb will burn out in 1000 hours or less?

Answer: let’s convert our data in standard normal form.

Mean = 900

Std. deviation = 80

x(a) = 1000

standardized x(a) = (1000-900)/80 = 1.25

z-score = 1.25 = 1.2 + 0.05 (in the table we will match the value corresponding to 1.2 and 0.05)

Let’s use the z-score table for this:

We found that that Probability coming as 89.44 %.

This is exactly same as we found out with the Normal distribution calculator.

Thus, if we standardize a normal distribution, z-score becomes a very important tool in helping finding
the probability distribution.
Q) Ravi scored 980 in a Physics Olympiad. The mean test score was 870 with a standard deviation of
120. How many students scored more than Ravi? (Assume that test scores are normally distributed.)

Answer: Let’s standardize the test score

Mean = 870

St. deviation= 120

z-score = (980-870)/120 = 0.917 (we will approximate it to 0.92 )

So, P(x<=980) = 0.8212

We need to find the probability of scoring more than 980,

P(x>980) = 1 – 0.8212 = 0.1788

Thus, we can estimate 17.88% students scored more than Ravi in the test.

Note: When you encounter a negative z score, you can use a negative z score table, or find the value
for positive value of the z-score and subtract it from 1.

e.g. p (-2.5) = 1 – p (2.5) = 1 – 0.99379 = 0.00621, which same as the value you will find in negative z
score table.
Central Limit Theorem

In the study of probability theory, the central limit theorem (CLT) states that the distribution of sample
means approximates a normal distribution (also known as a “bell curve”), as the sample size becomes
larger, assuming that all samples are identical in size, and regardless of the population distribution
shape.

CLT is a statistical theory stating that given a sufficiently large sample size from a population with a finite
level of variance, the mean of all samples from the same population will be approximately equal to the
mean of the population. Furthermore, all the samples will follow an approximate normal distribution
pattern, with all variances being approximately equal to the variance of the population, divided by each
sample's size. The samples extracted should be bigger than 30 observations.

Let’s visualize the CLT with few data sets with different sample sizes:

x= [9,2,1]

x2= [6,6,8,3,8]

x3= [5,3,6,4,7,2,6,9,7,1,1,7]

x4 = [8,1,7,1,4,3,1,7,8,9,8,3,1,6,8,3,4]
Plotting the distribution graph for above samples with different sample size

We can clearly see that among the number of samples extracted from the population, as the sample size
increases, sample moves closer to a Normal Distribution.

We will talk more about Central limit theorem and see it’s uses, but first let’s discuss one more
important concept.

Standard Error

The standard error (SE) of a statistic is the approximate standard deviation of a statistical sample
population. The standard error is a statistical term that measures the accuracy with which a sample
distribution represents a population by using standard deviation. In statistics, a sample mean deviates
from the actual mean of a population—this deviation is the standard error of the mean.

When a population is sampled, the mean, or average, is generally calculated. The standard error can
include the variation between the calculated mean of the population and one which is considered
known, or accepted as accurate. This helps compensate for any incidental inaccuracies related to the
gathering of the sample.
In cases where multiple samples are collected, the mean of each sample may vary slightly from the
others, creating a spread among the variables. This spread is most often measured as the standard
error, accounting for the differences between the means across the datasets.

The more data points involved in the calculations of the mean, the smaller the standard error tends to
be. When the standard error is small, the data is said to be more representative of the true mean. In
cases where the standard error is large, the data may have some notable irregularities.

The standard deviation is a representation of the spread of each of the data points. The standard
deviation is used to help determine the validity of the data based on the number of data points
displayed at each level of standard deviation. Standard errors function more as a way to determine the
accuracy of the sample or the accuracy of multiple samples by analyzing deviation within the means.

Standard Error is given by the following formula:

Here σ is the standard deviation of the population, whereas σ(u) is the standard deviation of the sample.

We can see that as the size of our sample increases the Standard error decrease.

Now, that we know the Standard error, let’s rephrase our Central Limit theorem as:

The central limit theorem states that the sample mean follows approximately the normal distribution
with mean(μ) and standard deviation (σ/√n), where μ and σ are the mean and standard deviation of
the population from where the sample was selected. The sample size n has to be large (usually n≥30)
if the population from where the sample is taken is non normal.

So, when we transform our sample data, we will use following formula for the z-score:

z = (X - μ) / (σ/√n)

where, X is the sample mean,

μ is the mean of the population,

and σ is the standard deviation of the population.


Let’s see an example based on the above explanation.

Q) Let X be a random variable with μ= 10 and σ= 4. A sample of size 100 is taken from this population.
Find the probability that the sample mean of these 100 observations is less than 9.

Ans: population mean = 10 population std. deviation = 4 sample size(n) = 100

Sample mean = 9

z= (9-10)/ (4/ (100) ^0.5) = -2.5

We will use the z-score table and find the value to be 0.0062

P(X<9) = 0.0062

Q) A large freight elevator can transport a maximum of 9800 pounds. Suppose a load of cargo
containing 49 boxes must be transported via the elevator. Experience has shown that the weight of
boxes of this type of cargo follows a distribution with mean= 205 pounds and standard deviation = 15
pounds. Based on this information, what is the probability that all 49 boxes can be safely loaded onto
the freight elevator and transported?

Ans: For all the boxes to be loaded the total weight must be at most 9800.

So, the sample mean should be = 9800/49 = 200 sample size(n) = 49

Population mean = 205

Std. deviation = 15

z-score = (200-205)/(15/(49)^0.5) = -2.33

using z-score table:


P(X<200) = 0.0099

Bernoulli Distribution

It is a type of Discrete Probability distribution. The Bernoulli distribution essentially models a single trial
of flipping a weighted coin. It is the probability distribution of a random variable taking on only two
values, 1 ("success") and 0 ("failure") with complementary probabilities p and 1−p respectively. The
Bernoulli distribution therefore describes events having exactly two outcomes, which are present in real
life.

Suppose We have a single trial of with only two possible outcomes success or failure:

P(Success) = p

P(Failure)= 1-p

Let, X=1 when Success and X=0 when failure,

Then the probability distribution function is given as:

for x= (0,1)

So, P (1) = p^1 * (1-p) ^0 = p

P (0) = P^0*(1-p) ^1 = 1-p

A simple graphical representation of Bernoulli’s distribution will look like this.


Here, p =0.3

The Expected value (mean) for the Bernoulli’s distribution is given as:

The variance for the Bernoulli’s distribution is given as:

Some real-life cases that follows a Bernoulli distribution:

1) Results of Exam (Pass or Fail)


2) Gender of New born baby (Male or Female)
3) Result of Cricket World Cup (Win or Lose)
4) Tossing a coin (Heads or Tails)

We will see some more examples when Bernoulli trial is repeated for many times.

Binomial Distribution

A binomial experiment is a series of n Bernoulli trials, whose outcomes are independent of each other. A
random variable, X, is defined as the number of successes in a binomial experiment.

For example, consider a fair coin. Flipping the coin once is a Bernoulli trial, since there are exactly two
complementary outcomes (flipping a head and flipping a tail), and they are both 1/2 no matter how
many times the coin is flipped. Note that the fact that the coin is fair is not necessary; flipping a
weighted coin is still a Bernoulli trial.

A binomial experiment might consist of flipping the coin 100 times, with the resulting number of heads
being represented by the random variable X. The binomial distribution of this experiment is the
probability distribution of X.

If X is the number of success in a given Bernoulli trial with n independent trials, with probability of
success being p and probability of failure being 1-p, then for exactly k success in the experiment, the
probability distribution is given as:
P(X=k) =

Here (n k) is the numner of ways of choosing k from n also written as C(n,k).

Q) Let’s flip a coin 6 times with probability of getting a tail be 0.3. Let’s write the binomial distribution
for this experiment.

Ans:

Possible outcomes(k) Probability(X=k) Binomial Distribution Probability(X=k)


0 tail P(X=0) C(6,0) * (0.30^0)*(0.70)^6 0.118
1 tail P(X=1) C(6,1) * (0.30^1)*(0.70)^5 0.302
2 tail P(X=2) C(6,2) * (0.30^2)*(0.70)^4 0.324
3 tail P(X=3) C(6,3) * (0.30^3)*(0.70)^3 0.185
4 tail P(X=4) C(6,4) * (0.30^4)*(0.70)^2 0.06
5 tail P(X=5) C(6,5) * (0.30^5)*(0.70)^1 0.01
6 tail P(X=6) C(6,6) * (0.30^6)*(0.70)^0 0.0007

**How to calculate C(6,1) = 6!/[(6-1)! * 1!]

So looking at the above table we can find probability of obtaining k outcomes.

e.g. what is the probability of getting exactly 2 tails in above experiment

P(X=2) = 0.324

Let’s plot the above table,


Q) A basketball player takes 5 independent free throws with a probability of 0.65 of getting a basket
on each shot. Let X=the number of baskets he gets. Show the probability distribution for X.

Solution:

Possible outcomes(k) Probability(X=k) Binomial Distribution Probability(X=k)


0 basket P(X=0) C(5,0) * (0.65^0)*(0.35)^5 0.0052
1 basket P(X=1) C(5,1) * (0.65^1)*(0.35)^4 0.049
2 baskets P(X=2) C(5,2) * (0.65^2)*(0.35)^3 0.181
3 baskets P(X=3) C(5,3) * (0.65^3)*(0.35)^2 0.336
4 baskets P(X=4) C(5,4) * (0.65^4)*(0.35)^1 0.312
5 baskets P(X=5) C(5,5) * (0.65^5)*(0.35)^0 0.116

Probability distribution Graph:

Mean and variance for Binomial Distribution

Mean= n*p

Variance = n* p(1-p),

where n is the number of trials, p is probability of success and 1-p is probability of failure
Poisson Distribution

The Poisson distribution is the discrete probability distribution of the number of events occurring in a
given time period, given the average number of times the event occurs over that time period.

Example

A certain car wash shop gets an average of 3 visitors to the center per hour. This is just an average,
however. The actual amount can vary.

A Poisson distribution can be used to analyze the probability of various events regarding how many
customers go to the center. It can allow one to calculate the probability of a dull activity (when there are
0 customers coming) as well as the probability of a high activity (when there are 5 or more customers
coming). This information can, in turn, help the owner to plan for these events with staffing and
scheduling.

If X is the number of events observed over a given time period, then probability of observing k events
over the time period is:

The Poisson distribution is often used as an approximation for binomial probabilities when n is large and
p is small.
Q) In a coffee shop, the average number of customers per hour is 2. Find the probability of getting k
number of customers in the shop.

Let’s plot the probability distribution:

We can clearly see that probability of getting number customers starts declining after 6.
Q) Suppose the average number of elephants seen on a 1-day safari is 6. What is the probability that
tourists will see fewer than 4 elephants on the next 1-day safari?

Solution:

Number of Elephants Probability(X=k) Poisson Distribution


0 P(X=0) 0.0025
1 P(X=1) 0.0149
2 P(X=2) 0.0446
3 P(X=3) 0.0892
4 P(X=4) 0.1339
Mean = 6

We need values P(X<4) = P(X<=3) = P(X=0) + P(X=1) + P(X=2) + P(X=3)

= 0.0025+0.0149+0.0446+0.0892
= 0.1512

Some applications that obey a Poisson distribution is below:

a. the number of mutations on a given strand of DNA per time unit


b. the number of bankruptcies that are filed in a month
c. the number of arrivals at a car wash in one hour
d. the number of network failures per day
e. the number of file server virus infection at a data center during a 24-hour period
f. the number of Airbus 330 aircraft engine shutdowns per 100,000 flight hours
g. the number of asthma patient arrivals in a given hour at a walk-in clinic
h. the number of hungry persons entering McDonald's restaurant per day
Hypothesis

In our daily life, we often hear statements like Dhoni is the better captain than his contemporaries,
Or Motorcycle company claiming that a certain model gives an average mileage of 100Km per liter or
Toothpaste company claiming to be the number one brand suggested by dentists.

Let’s suppose you must purchase a motorcycle and you heard about the above claim made by the
Motorcycle company. Would you just go and buy it or rather look for proof of it? There must be a
parameter based on which one would judge the correctness of the statement made. In this case our
parameter will be the Average mileage, which you will use to check if the statement made is true or
just a hoax.

A hypothesis is a statement, assumption or claim about the value of the parameter (mean,
variance, median etc.).

A hypothesis is an educated guess about something in the world around you. It should be testable,
either by experiment or observation.

Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This made the assumption
that we are making based on the average wins and loses team had under his captaincy. We can test
this statement based on all the match data.

Simple and Composite Hypothesis

When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and if it
specifies a range of values then it is called a composite hypothesis.

e.g., Motorcycle company claiming that a certain model gives an average mileage of 100Km per liter,
this is a case of simple hypothesis.

The average age of students in a class is greater than 20. This statement is a composite hypothesis.

Null Hypothesis
The null hypothesis is the hypothesis to be tested for possible rejection under the assumption that it
is true. The concept of the null is like innocent until proven guilty We assume innocence until we
have enough evidence to prove that a suspect is guilty.

It is denoted by H0.
Alternate Hypothesis
The alternative hypothesis complements the Null hypothesis. It is opposite of the null hypothesis
such that both Alternate and null hypothesis together cover all the possible values of the population
parameter.

It is denoted by H1.

Let’s understand this with an example:

A soap company claims that its product kills on an average 99% of the germs. To test the claim of
this company we will formulate the null and alternate hypothesis.

Null Hypothesis(H0): Average =99%

Alternate Hypothesis(H1): Average is not equal to 99%.

Note: The thumb rule is that statement containing equality is the null hypothesis.

Hypothesis Testing

When we test a hypothesis, we assume the null hypothesis to be true until there is sufficient
evidence in the sample to prove it false. In that case we reject the null hypothesis and support the
alternate hypothesis.

If the sample fails to provide sufficient evidence for us to reject the null hypothesis, we cannot say
that the null hypothesis is true because it is based on just the sample data. For saying the null
hypothesis is true we will have to study the whole population data.

One Tailed and Two Tailed Tests

If the alternate hypothesis gives the alternate in both directions (less than and greater than) of the
value of the parameter specified in null hypothesis, it is called Two tailed test.

If the alternate hypothesis gives the alternate in only one direction (either less than or greater than)
of the value of the parameter specified in null hypothesis, it is called One tailed test.

e.g., if H0: mean= 100 H1: mean not equal to 100

here according to H1, mean can be greater than or less than 100. This is an example of Two tailed
test

Similarly, if H0: mean>=100 then H1: mean< 100

Here, mean is less than 100 is called One tailed test.


Critical Region

The critical region is that region in the sample space in which if the calculated value lies then we
reject the null hypothesis.

Let’s understand this with an example:

Suppose you are looking to rent an apartment. You listed out all the available apartments from
different real state websites. You have a budget of Rs. 15000/ month. You cannot spend more than
that. The list of apartments you have made has prices ranging from 7000/month to 30,000/month.

You select a random apartment from the list and assume below hypothesis:

H0: You will rent the apartment.

H1: You won’t rent the apartment.

Now, since your budget is 15000, you must reject all the apartments above that price.

Here all the Prices greater than 15000 become your critical region. If the random apartment’s price
lies in this region, you must reject your null hypothesis and if the random apartment’s price doesn’t
lie in this region, you do not reject your null hypothesis.

The critical region lies in one tail or two tails on the probability distribution curve according to the
alternative hypothesis. Critical region is a pre-defined area corresponding to a cut off value in
probability distribution curve. It is denoted by α.

Critical values are values separating the values that support or reject the null hypothesis and are
calculated based on alpha.

We will see more examples later and it will be clear how do we choose α.
Based on the alternative hypothesis, three cases of critical region arise:

Case 1) This is double tailed test.

Case 2) This scenario is also called Left-tailed test.


Case 3) This scenario is also called Right-tailed test.

Type I and Type II Error

Decision H0 True H0 False


Reject H0 Type I error Correct Decision
Do not reject H0 Correct Decision Type II error

A false positive (type I error) — when you reject a true null hypothesis.

A false negative (type II error) — when you accept a false null hypothesis.

The probability of committing Type I error (False positive) is equal to the significance level or size of
critical region α.

α= P [rejecting H0 when H0 is true]

The probability of committing Type II error (False negative) is equal to the beta β and is called
‘power of the test’.

β = P [not rejecting H0 when h1 is true]

Example:

A person is arrested on the charge of being guilty of burglary. A jury of judges has to decide if guilty
or not guilty.

H0: Person is innocent.

H1: Person is guilty.


Type I error will be if the Jury convicts the person [rejects H0] although the person was innocent [H0
is true].

Type II error will be the case when Jury released the person [Do not reject H0] although the person is
guilty [H1 is true].

Level of Significance(α) :

It is the probability of type 1 error. It is also the size of the critical region.

Generally, a strong control on α is desired and in tests it is prefixed at very low levels like 0.05(5%) or
01(1%).

If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis is true
with 95% assurance.

Steps involved in Hypothesis testing:


1) Setup the null hypothesis and the alternate hypothesis.
2) Decide a level of significance i.e., alpha = 5% or 1%
3) Choose the type of test you want to perform as per the sample data (z test, t test, chi
squared etc.) (we will study all the tests in next section)
4) Calculate the test statistics (z-score, t-score etc.) using the respective formula of test chosen.
5) Obtain the critical value for in the sampling distribution to construct the rejection region of
size alpha using z-table, t-table, chi table etc.
6) Compare the test statistics with the critical value and locate the position of the calculated
test statistics i.e., is it in rejection region or non-rejection region.
7) I) If the critical value lies in the rejection region, we will reject the hypothesis i.e., sample
data provides sufficient evidence against the null hypothesis and there is significant
difference between hypothesized value and observed value of the parameter.
II) If the critical value lies in the non- rejection region, we will not reject the hypothesis i.e.,
sample data does not provide sufficient evidence against the null hypothesis and the
difference between hypothesized value and observed value of the parameter is due to
fluctuation of the sample.

p-value

Let’s suppose we are conducting a hypothesis test at a significance level of 1%.

Where, H0: mean<X (we are just assuming a scenario of 1 tail test.)

We obtain our critical value (based on the type of test we are using) and find that our test statistics
are greater than the critical value. So, we must reject the null hypothesis here since it lies in the
rejection region. Now if the null hypothesis is rejected at 1%, then for sure it will get rejected at the
higher values of significance level, say 5% or 10%.

What if we take significance level lower than 1%, would we have to reject our hypothesis then also?
Yes, there might be a chance that the above scenario can happen and here comes “p-value” in play.

p-value is the smallest level of significance at which a null hypothesis can be rejected. (p < alpha)

That’s why many tests now a days gives p-value, and it is more preferred since it gives out more
information than the critical value.

For right tailed test:

p-value = P [Test statistics >= observed value of the test statistic]

For left tailed test:

p-value = P [Test statistics <= observed value of the test statistic]

For two tailed tests:

p-value = 2 * P[Test statistics >= |observed value of the test statistic|]

Decision making with p-value.

The p-value is compared to the significance level(alpha) for decision making on null hypothesis.

If p-value is greater than alpha, we do not reject the null hypothesis.

If p-value is smaller than alpha, we reject the null hypothesis.

Confidence Intervals
A confidence interval, in statistics, refers to the probability that a population parameter will fall
between two set values. Confidence intervals measure the degree of uncertainty or certainty in a
sampling method. A confidence interval can take any number of probabilities, with the most
common being a 95% or 99% confidence level.

Calculating a Confidence Interval (Theory)

Suppose a group of researchers is studying the heights of high school basketball players. The
researchers take a random sample from the population and establish a mean height of 74 inches.
The mean of 74 inches is a point estimate of the population mean. A point estimate by itself is of
limited usefulness because it does not reveal the uncertainty associated with the estimate; you do
not have a good sense of how far away this 74-inch sample mean might be from the population
mean. What's missing is the degree of uncertainty in this single sample.

Confidence intervals provide more information than point estimates. By establishing a 95%
confidence interval using the sample's mean and standard deviation, and assuming a normal
distribution as represented by the bell curve, the researchers arrive at an upper and lower bound
that contains the true mean 95% of the time. Assume the interval is between 72 inches and 76
inches. If the researchers take 100 random samples from the population of high school basketball
players, the mean should fall between 72 and 76 inches in 95 of those samples.
If the researchers want even greater confidence, they can expand the interval to 99% confidence.
Doing so invariably creates a broader range, as it makes room for a greater number of sample
means. If they establish the 99% confidence interval as being between 70 inches and 78 inches, they
can expect 99 of 100 samples evaluated to contain a mean value between these numbers. A 90%
confidence level means that we would expect 90% of the interval estimates to include the
population parameter. Likewise, a 99% confidence level means that 95% of the intervals would
include the parameter.

The Confidence Interval is based on Mean and Standard Deviation and is given as:

For n>30

Confidence interval = X ± (z * s/√n)

where z critical value is derived from the z score table based on the confidence level.

X is the sample mean

s is sample standard deviation.

n is the sample size

We obtain these values from the z-score table only, but since the confidence levels are most of the
time fixed as the above values, we can use this table.

For n<30

Confidence interval = X ± (t * s/√n)

where t critical value is derived from the t score table based on the confidence level.

X is the sample mean

s is sample standard deviation.

n is the sample size.

We will see how to create confidence intervals in the examples to follow.

Now that we have got all the theory behind Hypothesis testing, let’s see different types of tests
that are used for testing. We have already seen examples on finding z-score and t-score, we will
see how they are used in the testing scenario.
General points for selection type of tests:

Population Type of
sample size Variance Normality of Sample Sample variance Test
Large (>30) Known Normal/Non-Normal Z-test
Use this to calculate t-
Large (>30) Unknown Normal score t-test
Use this to calculate z-
Large (>30) Unknown Unknown score Z-test
Small (<30) Known Normal Z-test
Use this to calculate t-
Small (<30) Unknown Normal score t-test

Note: We will learn about other non-parametric tests and their cases late

Hypothesis Testing for Large Size Samples

Thumb rule: A sample of size greater than 30 is considered a large sample and as per central limit
theorem we will assume that all sampling distributions follow a normal distribution.

We are familiar with the steps of hypothesis testing as shown earlier. We also know, from the above
table, when to use which type of test.

Let’s start with a few practical examples to help our understanding more.

Note: We have learned in the previous section how to use the z-score table to calculate
probabilities, in this section we have some standard Significance level for which we need to find the
critical value(z-score). So instead of going through the whole table, we will just use the below
standardized critical value table for calculation purposes.
Q) A manufacturer of printer cartridges clams that a certain cartridge manufactured by him has a
mean printing capacity of at least 500 pages. A wholesale purchaser selects a sample of 100
printers and tests them. The mean printing capacity of the sample came out to be 490 pages with
a standard deviation of 30 printing pages.

Should the purchaser reject the claim of the manufacturer at a significant level of 5%?

Ans. population mean = 500


Sample mean = 490
Sample standard deviation = 30
Significance level(alpha) = 5% = 0.05
Sample size = 100
H0: Mean printing capacity >=500
H1: Mean printing capacity < 500
We can clearly see it is one tailed test (left tail).
Here, the sample is large with an unknown population variance. Since we don’t know about the
normality of the data, we will use the Z-test (from the table above).

We will use the sample variance to calculate the critical value.

Standard error (SE) = Sample standard deviation/ (sample size) * 0.5

= 30 / (100) *0.5 = 3

Z(test) = (Sample mean - population mean)/ (SE)

= (490-500)/3 = -3.33

Let’s find out the critical value at 5% significance level using the above Critical value table.

Z (0.05%) = - 1.645 (since it is left tailed test).

We can clearly see that Z(test) < Z (0.05%), that means our test value lies in the rejection region.

Thus, we can reject the null hypothesis i.e. the manufacturer’s claim at 5% significance level.

Using p-value to test the above hypothesis:

p-value = P[T<=-3.33] (we know p(-x) = 1 -p(x) also, remember that the p(x) represents the

cumulative probability from 0 to x)

let’s use z-table to find the p-value:


p-value = 1 – 0.9996 = 0.0004

Here, the p-value is less than the significance level of 5%. So, we are right to reject the null
hypothesis.

Q) A company used a specific brand of Tube lights in the past which has an average life of 1000
hours. A new brand has approached the company with new Tube lights with same power at a
lower price. A sample of 120 light bulbs were taken for testing which yielded an average of 1100
hours with standard deviation of 90 hours. Should the company give the contract to this new
company at a 1% significance level.

Also, find the confidence interval.

Ans. Population mean = 1000

Sample mean = 1010

Significance level = 1% = 0.01

Sample size = 120

Sample standard deviation = 90

H0: average life of tube lights >= 1000

H1: average life of tube lights < 1000

Here, the sample is large with an unknown population variance. Since, we don’t know about the
normality of the data, we will use the Z-test (from the table above).

Standard error (SE) = Sample standard deviation/ (sample size) * 0.5

= 90 / (120) *0.5 = 8.22

Z(test) = (Sample mean - population mean)/ (SE)

= (1010-1000)/8.22 = 1.22

Let’s find out the critical value at 1% significance level using the above Critical value table.

Z (0.01%) = -2.33(since it is left tailed test).


We can clearly see that Z(test) >Z (0.01%), that means our test value doesn’t lie in the rejection
region.

Thus, we cannot reject the null hypothesis i.e. the company can give the contract at 1% significance
level.

Using p-value to test the above hypothesis:

p-value = P[T<1.22]

p-value = 0.88

Here, the p-value is greater than the significance level of 1%. So, we do not reject the null
hypothesis.

Comparing two population samples mean using Z-test

The comparison of two population means is very common. The difference between the two samples
depends on both the means and the standard deviations. Very different means can occur by chance
if there is great variation among the individual samples. In order to account for the variation, we
take the difference of the sample means, X1(mean) - X2(mean), and divide by the standard error
(shown below) in order to standardize the difference.

Because we do not know the population standard deviations, we estimate them using the two
sample standard deviations from our independent samples. For the hypothesis test, we calculate the
estimated standard deviation i.e., standard error.

The standard error (SE) is:


Z is given as :

In this comarison case, our null assumpiton is that µ(1) = µ(2)

So, Z becomes = X1(mean)- X2(mean)/ (SE)

Q) In two samples of men from two different states A and B , the height of of 1000 men and 2000
men respectively are 76.5 and 77 inches. If population standard deviation for both states is same
and is 7 inches, can we assume that mean hieghts of both sates can be regarde same at 5% level of
significance.

Ans. n1 = 1000

n2 = 2000

X1(mean) = 76.5

X2(mean) = 77

S1=S2= 7

Let’s µ(1) = µ(2) be the mean heights of men from states A and B

H0: µ(1) = µ(2)

H1: µ(1) is not equal to µ(2)

Standard error(SE) = [((S1)^2/n1 )+((S2)^2/n2)]^0.5 = 0.27

Z(test) = X1(mean)- X2(mean)/ (SE) = (76.5-77)/0.27 = -1.85

Since, it is a two tailed test, we need to find critical value for 2.5% on each tail.

Z(2.5%) = 1.96 and Z(-2.5%) = -1.96

We can clearly see, Z(-2.5%) < Z(test) <Z(2.5%)

Thus, we cannot reject the null hypothesis.


Using p-value

p-value = 2* P[Z>=|-1.85|] = 2 * P[Z>=-1.85]

p-value = 2 * (1- 0.9678) (since we want z> 1.85) = 0.0644

We can clearly see, p-value is greater than 0.05% ,thus we cannot reject the null hypothesis.

Hypothesis Testing for Small Size Samples

In real world scenarios, large sample sizes are possible most of the time because of the limited
resources such as money. We generally do hypothesis testing based on small samples, only
assumption being the normality of the sample data.

We will see how to use t- tests in this section and how to use the t-score table (continued from the
topic of student t’s distribution).

All the steps involved are similar to the z-test, only we will calculate t-score instead of z-score.

Let’s start with an example:

Q) A type of manufacturer claims that the average life of a particular category of its type is
18000km when used under normal driving conditions. A random sample of 16 types was tested.
The mean and SD of life of the types in the sample were 20000 km and 6000 km respectively.
Assuming that the life of the tyres is normally distributed, test the claim of the manufacture at 1%
level of significance. Construct the confidence interval also.
Ans: population mean = 18000 km

Sample mean = 20000 km

Standard deviation = 6000 km

Sample size = 16

H0: population mean = 18000km

H1: population mean is not equal to 18000km (It will be a two tailed test.)

Since sample size is small, population variance is unknown and the sample is normally distributed,
we will used t-test for this.

Standard error = [6000/(16)^0.5] = 1500

t-score(test) = (20000 - 18000)/1500 = 1.33

Let’s find out the critical t- value, for significance level 1% (two tailed) and degree of freedom = 16-1
= 15

t(0.005) = 2.947 and t(-0.005) = -2.947

We can see that, t (- 0.005) < t-score(test) = 1.33 < t (0.005)

So, the value lies in non-rejection region and we cannot reject our null hypothesis.

Using the p-value

p-value = P[t>|1.33|]

degree of freedom = 15

let’s see the p-value from the table for the above values:
from the table we can see: 0.20 < p < 0.30

Here, p > significance level (1%), thus we cannot reject the null hypothesis.

Confidence interval = [20000 – 2.47*1500 , 20000 +2.47*1500]

= [ 16295, 23705]

Comparing two population samples mean using t-test

Just like the case we saw with z-test, t-test is actually more suitable for comparison of two
populations samples because in practice population standard deviations for both populations are
not always known.

We assume a normal distribution of samples and though the population standard deviations are
unknown, we assume them to be equal.

Also, samples are independent to each other.

Let’s assume two independent samples with size n1 and n2:

Degree of freedom = n1 + n2 -2

Standard Error(SE):

Variance(Sample) = (∑[X-X(mean)]^2 + ∑[Y-Y(mean)]^2))/(n1 + n2 -2)

Test statistic t in this case is given as:


Q) The means of two random samples of sizes 10 and 8 from two normal popultaion is 210.40 and
208.92. The sum of sqaures od deviation from their means is 26.94 and 24.50
respectively.Assuming population with equal variances, can we consider the normal populatiojns
have equal mean?(Significance level =5%)

Ans.

n1 =10 , n2= 8 , X(mean) = 210.40 , Y(mean) = 208.92

std. Deviation(sample) =[ (26.94 + 24.50)/(10 + 8 - 2)]^0.5 = 1.79

H0: Population means are equal

H1: Population means are not equal (two tailed test)

Standard errror = 1.79 * (1/10 + 1/8)^0.5 = 0.84

t(test) = X(mean)- Y(mean)/0.84 = 1.48/.84 = 1.76

Degree of freedom = 10 +8 -2 = 16

Let’s look for critical value in the t-table for significance 5%(two tailed) and d.o.f 16:

t(0.005) = 2.120 and t(-0.005) = -2.120

We can see that, t (- 0.005) < t-score(test) = 1.76 < t (0.005)

So, the value lies in non-rejection region and we cannot reject our null hypothesis.
Paired Sample t-Tests
A paired t-test is used to compare two population means where you have two samples which are not
independent e.g. Observations recorded on a patient before and after taking medicine, weight of a
person before and after they started working out etc.

Now, instead of two separate populations, we create a new column with difference of the
populations, and instead of testing equality of two population mean we test the hypothesis that
mean of the population difference is zero. Also, we assume the samples are of same size. Population
variances are not known and not necessarily equal.

Standard error = Deviation of differences/(n^0.5)

t= D(mean)/ standard error, where D(mean) is the men of the differences.

Q) A group of 20 students were tested to see how many of them have improved marks after a
special lecture on the subject.

marks before the lecture marks after the lecture Difference(D) (D-mean) ^2
18 22 4 3.24
21 25 4 3.24
16 17 1 1.44
22 24 2 0.04
19 15 -4 38.44
24 26 2 0.04
17 20 3 0.64
21 23 2 0.04
13 18 5 7.84
18 20 2 0.04
15 15 0 4.84
16 15 -1 10.24
18 21 3 0.64
14 16 2 0.04
19 22 3 0.64
20 24 4 3.24
12 18 6 14.44
22 25 3 0.64
14 18 4 3.24
19 18 -1 10.24
44 103.2
Difference mean =
2.2 5.43157895
Standard
Deviation 2.33057481
H0: Difference mean >= 0

H1: Difference mean < 0

Standard error = 2.33 / (20) ^0.5 = 0.52

t= 2.2 / 0.52 = 4.23

df (degree of freedom) = 19

At the significant level of 5%. 19 df and a one tail test, let’s calculate our critical level:

t (5%) = -1.729

Since t is greater than critical t, thus it lies in the non-rejection region and hence we cannot reject
the null hypothesis.

Testing of Hypothesis for population Variance Using Chi-Squared


test

Till now we were dealing with hypothesis testing for the means of various samples, but sometimes it
is also necessary or desired to test the variances of the population under study i.e. let’s we obtained
certain variance for a sample which is different than the population variance, now we need to find
out if the variances are within acceptable limit or does it varies more than the desired variance of
the population.

The chi-square test for variance is a non-parametric statistical procedure with a chi-square-
distributed test statistic that is used for determining whether the variance of a variable obtained
from a particular sample has the same size as the known population variance of the same variables.
The test statistic of the chi-square test for variance is calculated as follows:

where, n is sample size, s is sample deviation, σ is population std. deviation

As similar with other tests, the critical value is obtained through a chi table on the basis of degree of
freedom and significance level.

We will see about it with an example:

Q) The variance of a certain size of towel produced by a machine is 7.2 over a long period of time.
A random sample of 20 towels gave a variance of 8. You nee to check if the variability for towel
has increased at 5% level of significance, assuming a normally distributed sample.

Ans.

n = 20

sample variance = 8

population variance = 7.2

H0: variance <= 7.2


H1: variance > 7.2 (Right tailed test)

Using chi squared test,

ϗ-square = (20-1) * 8/7.2 = 21.11

Critical value for D.o.f = 19 and 5% significance level,

Critical value = 30.14

Here, the chi value is less than the critical value, thus we do not reject the null hypothesis.

Chi-Squared Test for Categorical Variables

The chi-square test is widely used to estimate how closely the distribution of a categorical variable
matches an expected distribution (the goodness-of-fit test), or to estimate whether two categorical
variables are independent of one another (the test of independence).

In mathematical terms, the χ2 variable is the sum of the squares of a set of normally distributed
variables.

Suppose that a particular value Z1 is randomly selected from a standardized normal distribution.
Then suppose another value Z2 is selected from the same standardized normal distribution. If there
are d degrees of freedom, then let this process continue until d different Z values are selected from
this distribution. The χ2 variable is defined as the sum of the squares of these Z values.
This sum of squares of d normally distributed variables has a distribution which is called
theχ2distribution with d degrees of freedom.

Chi Squared test For Goodness Of fit

Chi Square test for testing goodness of fit is used to decide whether there is any difference between
the observed (experimental) value and the expected (theoretical) value.

A goodness of fit test is a test that is concerned with the distribution of one categorical variable.

The null and alternative hypotheses reflect this focus:

H0: The population distribution of the variable is the same as the proposed distribution.

HA: The distributions are different

The chi-square statistic is calculated as:

Where, Observed= actual count values in each category

Expected= the predicted (expected) counts in each category if the null hypothesis were true.

Let’s see an example for better understanding:

Q) A survey conducted by a Pet Food Company determined that 60% of dog owners have only one
dog, 28% have two dogs, and 12% have three or more. You were not convinced by the survey and
decided to conduct your own survey and have collected the data below,

Data: Out of 129 dog owners, 73 had one dog and 38 had two dogs

Determine whether your data supports the results of the survey by the pet.

Use a significance level of 0.05

Ans: E(1 dog) =0.60

E(2 dog) = 0.28

E(3 dogs) = .12

H0: proportions of dogs is equal to survey data

H1: proportions of dogs is not equal to survey data


1 Dog 2 Dog 3 Dog Total
Observed 73 38 18 129
0.60*129 =
Expected 77.4 0.28 *129=36.12 0.12 * 129 = 15.48 129
Observed -Expected -4.4 1.88 2.52
(Observed -Expected) ^2 19.36 3.53 6.35

Chi statistics = 19.36/77.4 + 3.53/36.12 + 2.52/15.48 = 0.7533

Let’s see the critical value using d.o.f 2 and significance 5%:

Critical chi = 5.99

Here, our chi statistic is less than the critical chi. Thus, we will not reject the null hypothesis.

Analysis of Variance (ANOVA)

Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or
more groups are significantly different from each other by analyzing comparisons of variance
estimates. ANOVA checks the impact of one or more factors by comparing the means of different
samples.

When we have only two samples, t-test and ANOVA give the same results. However, using a t-test
would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests
for comparing more than two samples, it will have a compounded effect on the type 1 error.

Assumptions in ANOVA

1) Assumption of Randomness: The samples should be selected in a random way such that
there is no dependence among the samples.
2) The experimental errors of the data are normally distributed.
3) Assumption of equality of variance (Homoscedasticity) and zero correlation: The variance
should be constant in all the groups and all the covariance among them are zero although
means vary from group to group.
One Way ANOVA

When we are comparing groups based on only one factor variable, then it said to be one-way
analysis of variance (ANOVA).

For example, if we want to compare whether or not the mean output of three workers is the same
based on the working hours of the three workers.

The ANOVA model:

Mathematically, ANOVA can be written as:

xij = μi + εij

where x are the individual data points (i and j denote the group and the individual observation), ε is
the unexplained variation and the parameters of the model (μ) are the population means of each
group. Thus, each data point (xij) is its group mean plus error.

Let’s understand the working procedure of One-way Anova with an example:

Sample(k) 1 2 3 Mean
1 x11 x12 x13 Xm1
2 x21 x22 x23 Xm2
3 x31 x32 x33 Xm3
4 x41 x42 x43 Xm4

Suppose we are given with the above data set; we have an independent variable x and 3 samples
with different values of x and each sample has its respective mean as shown in last column.

Grand Mean

Mean is a simple or arithmetic average of a range of values. There are two kinds of means that we
use in ANOVA calculations, which are separate sample means and the grand mean.

The grand mean (Xgm) is the mean of sample means or the mean of all observations combined,
irrespective of the sample.

Xgm = (Xm1 + Xm2 + Xm3 + Xm4 +………. Xmk)/k where, k is the number of samples.

For our dataset, k = 4

Xgm = (Xm1 + Xm2 + Xm3 + Xm4)/4


Between Group Variability (SST)

It refers to variations between the distributions of individual groups (or levels) as the values within
each group are different.

Each sample is looked at and the difference between its mean and grand mean is calculated to
calculate the variability. If the distributions overlap or are close, the grand mean will be similar to
the individual means whereas if the distributions are far apart, difference between means and grand
mean would be large.

Let’s calculate Sum of Squares for between group variability:

SSbetween = n1 * (Xm1 - Xgm)2 + n2 * (Xm2 - Xgm)2 + n3 * (Xm3 - Xgm)2 + . . . . . . . . . . . + nk * (Xmk - Xgm)2

where, n1, n2,....,nk are the number of observations in each sample

Degree of freedom for between group variability = number of samples – 1 = k-1

MeanSSbetween = SSbetween/k-1

In our dataset example we have k =4 and nk = 3, so for our dataset:

SSbetween = 3 * (Xm1 - Xgm)2 + 3 * (Xm2 - Xgm)2 + 3 * (Xm3 - Xgm)2 + 3 * (Xm4 - Xgm)2

MeanSSbetween(MSST) = SSbetween/ (4-1) = SSbetween/3


Within Group Variability (SSE)

It refers to variations caused by differences within individual groups (or levels) as not all the values
within each group are the same. Each sample is looked at on its own and variability between the
individual points in the sample is calculated. In other words, no interactions between samples are
considered.

We can measure Within-group variability by looking at how much each value in each sample differs
from its respective sample mean. So, first, we’ll take the squared deviation of each value from its
respective sample mean and add them up. This is the sum of squares for within-group variability.

Degree of Freedom for within variability:

Where, N is the total number of observations.

In our dataset example we have k =4 and N =12, so for our dataset:

SSwithin = (X11 - Xm1)2 + (X12 - Xm1)2 + (X13 - Xm1)2 +

(X21 - Xm2)2 + (X22 - Xm2)2 + (X23 - Xm2)2 +

(X31 - Xm3)2 + (X32 – Xm3)2 + (X33 – Xm3)2 +

(X41 - Xm4)2 + (X42 - Xm4)2 + (X43 - Xm4)2

Degree od Freedom = N-k = 12 -4 = 8

MeanSSwithin(MSSE) = SSwithin/ 8

Total Sum of Squares (TSS)

TSS = SSbetween + SSwithin = SST + SSE


Hypothesis In ANOVA

The Null hypothesis in ANOVA is valid when all the sample means are equal, or they don’t have any
significant difference. Thus, they can be considered as a part of a larger set of the population. On the
other hand, the alternate hypothesis is valid when at least one of the sample means is different from
the rest of the sample means. In mathematical form, they can be represented as:

where µ1 and µm belong to any two sample means out of all the samples considered for the test. In
other words, the null hypothesis states that all the sample means are equal, or the factor did not
have any significant effect on the results. Whereas the alternate hypothesis states that at least one
of the sample means is different from another.

To test the null hypothesis, test statistics is given by the F-statistic.

F-Statistic

The statistic which measures if the means of different samples are significantly different or not is
called the F-Ratio. The lower the F-Ratio, more similar are the sample means. In that case, we cannot
reject the null hypothesis.

F = MeanSSbetween / MeanSSwitihn

F = MSST / MSSE with k-1 and N-k degrees of freedom.

This above formula is intuitive. The numerator term in the F-statistic calculation defines the
between-group variability. As we read earlier, as between group variability increases, sample means
grow further apart from each other. In other words, the samples are more probable to belong to
totally different populations.

This F-statistic calculated here is compared with the F-critical value for making a conclusion.

F-critical is calculated using the F-table, degree of freedoms and Significance level.

If the observed value of F is greater than the F-critical value then we reject the null hypothesis.

Let’s see an example on One-way ANOVA analysis:


Q) In a survey conducted to test the knowledge of Mathematics among 4 different schools in city.
The sample data collected for the marks of students out of 10 is below:

School Marks
School 1 8 6 7 5 9
School 2 6 4 6 5 6 7
School 3 6 5 5 6 7 8 5
School 4 5 6 6 7 6 7

Ans:

H0: All the schools have equal means.

H1: Difference in means of schools is significant.

k=4

N = 24

School (S1) - (S2- (S3- (S4-


School 1(S1) School 2(S2) 3(S3) School 4(S4) S1_mean)^2 S2_mean)^2 S3_mean)^2 S4_mean)^2
8 6 6 5 1 0.111111556 0 1.361095556
6 4 5 6 1 2.777775556 1 0.027775556
7 6 5 6 0 0.111111556 1 0.027775556
5 5 6 7 4 0.444443556 0 0.694455556
9 6 7 6 4 0.111111556 1 0.027775556
7 8 7 1.777779556 4 0.694455556
5 1
Total 35 34 42 37 10 5.333333333 8 2.833333334
Mean 7 5.666666667 6 6.166666667
Grand
mean 6.208333333

SSbetween = 5 * (7-6.21) ^2 + 6 * (5.7 – 6.21)^2 + 7 * (6-6.21)^2 + 6 * (6.17 – 6.21)^2

= 4.99

MSST = 4.99 / (4-1) = 1.66

SSwithin = 10 + 5.33 + 8 + 2.83 = 26.16

MSSE = 26.16/(N-k) = 26.16/20 = 1.308

F-statistics = MSST/MSSE = 1.66 / 1.308 = 1.27


Critical F-value

At 5% significance and degree of freedom (3, 20):

F-critical = 3.098

Clearly, our F-statistics is less than F-critical. So, we cannot reject our null hypothesis.

Two Way ANOVA

Two-way ANOVA allows to compare population means when the populations are classified according
to two independent factors.

Example: We might like to look at SAT scores of students who are male or female (first factor) and
either have or have not had a preparatory course (second factor).

The Two-way ANOVA model:

Mathematically, ANOVA can be written as:

xij = μij + εij

where x are the individual data points (i and j denote the group and the individual observation), ε is
the unexplained variation and the parameters of the model (μ) are the population means of each
group. Thus, each data point (xij) is its group mean plus error.

Just like one-way model, we will calculate the sum of squares between, in this case there will be two
SSTs for both the categories and sum of squares of errors (within).

We calculate F-statistics for both the MSST and see which once greater value than F-critical and
compare them to find the effect of both categories on our assumption.
Example:

Below is the data of yield of crops based on temperature and salinity. Calculate the ANOVA for the
table.

Temperature (in Categorical variable salinity


F) 700 1400 2100 Total Mean(temp)
60 3 5 4 12 4
70 11 10 12 33 11
80 16 21 17 54 18
Total 30 36 33 99 11
Mean(salanity) 10 12 11 11

Ans:

Hypothesis for Temperature:

H0: Yield is same for all temperature.

H1: yield varies with temperature with significant difference.

Hypothesis for Salinity:

H0: Yield is same for all Salinity.

H1: yield varies with temperature with significant Salinity.

Grand mean = 11

N = 9, K =3, nt= 3, ns = 3

SSbetween_temp = 3 *(4-11)^2 + 3*(11-11)^2 + 3*(18-11)^2 = 294

MSSTtemp = 294 / 2 = 147

SSbetween_salanity = 3 *(10-11)^2 + 3*(12-11)^2 + 3*(11-11)^2 = 6

MSSTsalainity = 6 /2 = 3

In such question calculating SSE can be tricky, so instead of calculating SSE let’s calculate TSS then
we can subtract SST values from it and get SSE.

To calculate Total sum of squares, we need to find sum of the squares of difference of each value
from the grand mean.

TSS = (3-11)^2 + (5-11)^2 + (3-11)^2 +(4-11)^2 +(11-11)^2 +(10-11)^2 +(12-11)^2 +(16-11)^2 +(21-
11)^2 +( 17-11)^2

TSS = 312
SSE = TSS - SSbetween_temp - SSbetween_salanity = 312 – 294-6 = 12

Degree of freedom for SSE = (nt-1)( ns-1) =(3-1)(3-1) = 4

MSSE = SSE/4 = 3

F-Test For temperature

Ftemp = MSSTtemp/ MSSE = 14/3 = 49

F-Test For Salinity

Fsalinity = MSSTsalinity/MSSE = 3/3 = 1

F-critical for 5% significance and degree of freedom (k-1, (p-1) (q-1)) i.e. (2,4):

F-critical = 10.649

Clearly, we can see that Ftemp is greater than F-critical, so we reject the null hypothesis and support
that temperature has a significant effect on yield.

On the other hand, Fsalinity is less than the F-critical value, so we do not reject the null hypothesis and
support that salinity doesn’t affect the yield.

Bayes Statistics (Bayes Theorem)


Bayes Statistics is used for calculating conditional probabilities.

P( Ak ∩ B )
P( Ak | B ) =
[ P( A1 ∩ B ) + P( A2 ∩ B ) + . . . + P( An ∩ B ) ]

P( Ak ∩ B ) P( Ak ∩ B ) = P( Ak ) P( B | Ak )

It can also be written as follows:

When to Apply Bayes' Theorem:

• When the sample space is divided(partitioned) into a set of events { A1, A2,…. Ana}.
• An event B is present, for which P(B) > 0 exists within the sample space.
• P( Ak | B ) is the form to compute a conditional probability.
One of the two sets of probabilities is mentioned:

• For each Ak, Probability, P( Ak ∩ B ).


• For each Ak, Probability, P( Ak ) and P( B | Ak )

Example:

Problem

There is a marriage ceremony in the desert and Marie is getting married tomorrow. In past years, it
has rained only five days a year. The weatherman has a weather report of raining tomorrow. The
weatherman forecasts rain 90% of the time when it rains. When it failed to rain, the weatherman
incorrectly predicts 10% of the times. What's the probability that it would rain on the marriage day
of Marie?

Solution: The sample space is defined as - it rains or it does not rain. Furthermore, a 3rd event occurs
when the weatherman predicts rain.

Event A. It rains on Marie's wedding.

Event B. It does not rain on Marie's wedding.

Event C. HERE THE RAIN IS PREDICTED.

• P (A) = 5/365 =0.0136985 [In a year only 5 days it might rain.]


• P (b) = 360/365 = 0.9863014 [It does not rain 360 days out of the year.]
• P (C | A) = 0.9 [When it rains, the weatherman predicts rain 90% of the time.]
• P (C | B) = 0.1 [When it FAILS TO rain, then the prediction is 10% of rain]

We want to know P (A | C), that is the probability it will rain on the wedding’s day of Marie,

P( A ) P( B | A )
P( A | C ) =

P( A ) P( C | A ) + P( B ) P( C | A2 )

(0.0014) (0.09)
P( A | C ) =
[ (0.014)(0.09) + (0.986)(0.1) ]

P( A | C ) = 0.1111

Even when the weatherman predicts rain, it might rain only about 11% of the time. So, there is a
good chance that Marie might not get rained on at her wedding.

You might also like