Advanced Analytics Complete Notes March 24
Advanced Analytics Complete Notes March 24
Session 1 & 2
Analytics is the systematic exploration and analysis of data to uncover meaningful patterns,
insights, and trends that can inform decision-making and drive improvements in various
aspects of business, science, and other domains. It involves the use of statistical,
mathematical, and computational techniques to extract actionable insights from data.
Understanding Data: Analytics begins with understanding the data available. This includes
both structured data (such as databases and spreadsheets) and unstructured data (such as text
documents, images, and videos). The data can come from various sources, including business
transactions, customer interactions, sensors, social media, and more.
Data Preparation: Before analysis can begin, the raw data often needs to be cleaned,
transformed, and formatted to make it suitable for analysis. This process involves tasks such
as removing duplicates, handling missing values, standardizing formats, and integrating data
from different sources.
Descriptive Analytics: Descriptive analytics focuses on summarizing and describing the
characteristics of the data. This may involve generating summary statistics, visualizing data
through charts and graphs, and exploring relationships between variables. Descriptive
analytics helps stakeholders understand what has happened in the past and provides context
for further analysis.
Diagnostic Analytics: Diagnostic analytics aims to understand why certain events occurred
by identifying patterns and correlations in the data. It involves digging deeper into the data to
uncover the root causes of observed phenomena or trends. Diagnostic analytics often involves
hypothesis testing, correlation analysis, and causal inference techniques.
Predictive Analytics: Predictive analytics leverages historical data to forecast future
outcomes or trends. This involves building statistical or machine learning models that can
make predictions based on patterns observed in the data. Predictive analytics can be used for
various purposes, such as sales forecasting, customer churn prediction, risk assessment, and
demand forecasting.
Prescriptive Analytics: Prescriptive analytics goes beyond prediction to recommend actions
or decisions that can optimize outcomes. This involves using optimization and simulation
techniques to explore different scenarios and identify the best course of action given specific
constraints and objectives. Prescriptive analytics helps organizations make data-driven
decisions to improve efficiency, minimize risks, and maximize outcomes.
Continuous Improvement: Analytics is an iterative process that requires continuous
improvement and refinement. As new data becomes available and business conditions
change, analytics models and strategies need to be updated and adapted accordingly.
Organizations should establish feedback loops to incorporate insights gained from analytics
into decision-making processes and drive ongoing improvement.
Overall, analytics enables organizations to leverage data as a strategic asset to gain
competitive advantage, improve operational efficiency, enhance customer experiences, and
drive innovation. By harnessing the power of analytics, businesses and other entities can
make smarter decisions and achieve better outcomes in an increasingly data-driven world.
Data analytics Life Cycle
The data analytics lifecycle is a structured approach to extracting insights and value from
data. It typically consists of several interconnected stages that guide the process from
defining the problem to implementing solutions. Here's a breakdown of the data analytics
lifecycle:
Problem Definition:
• Identify the business problem or opportunity that analytics can address.
• Define clear objectives and key performance indicators (KPIs) to measure success.
• Ensure alignment with organizational goals and stakeholder needs.
Data Collection:
• Identify relevant data sources both internal and external to the organization.
• Gather data from databases, spreadsheets, files, APIs, sensors, social media, etc.
• Ensure data quality, completeness, and relevance for analysis.
Data Preparation:
• Cleanse the data by removing duplicates, correcting errors, and handling missing or
inconsistent values.
• Transform the data into a suitable format for analysis (e.g., normalization,
aggregation, or feature engineering).
• Integrate data from multiple sources if necessary.
Exploratory Data Analysis (EDA):
• Explore the dataset to understand its structure, distribution, and relationships.
• Visualize data using charts, graphs, and statistical summaries.
• Identify patterns, trends, outliers, and potential insights.
Feature Engineering:
• Select, create, or transform features that are relevant and predictive for the analysis.
• Apply techniques such as dimensionality reduction, encoding categorical variables, or
deriving new features.
Modeling:
• Select appropriate analytical techniques or algorithms based on the problem and data
characteristics.
• Split the data into training, validation, and testing sets.
• Train machine learning or statistical models using the training data.
• Tune hyperparameters and evaluate model performance using validation data.
• Validate the model's performance on unseen data using the testing set
Interpretation and Evaluation:
• Interpret the model results in the context of the problem and business objectives.
• Evaluate the model's performance using relevant metrics (e.g., accuracy, precision,
recall, or AUC).
• Assess the impact of the analytics solution on the business problem and its alignment
with KPIs.
Deployment:
• Deploy the analytics solution into production or operational systems.
• Integrate the model into decision-making processes or business workflows.
• Monitor the model's performance in real-world scenarios and collect feedback for
continuous improvement.
Monitoring and Maintenance:
• Establish monitoring mechanisms to track the performance and behavior of the
deployed model.
• Monitor data quality, model drift, and other relevant metrics over time.
• Retrain or update the model periodically with new data to ensure relevance and
accuracy.
Iterative Improvement:
• Continuously refine and improve the analytics solution based on feedback, changing
business requirements, and new data.
• Iterate through the lifecycle stages as needed to address evolving challenges and
opportunities.
By following the data analytics lifecycle, organizations can systematically leverage data to
derive actionable insights, make informed decisions, and drive business value.
Session 3 & 4
Prerequisites:
Factorial Notation:
We define, 0! = 1
For any positive integer n,
n! = n(n − 1)(n − 2)….1
e.g.
1! = 1
2! = 21
3! = 321
4! = 4321 and so on.
Consider 6! = 654321, which we can write as
6! = 6(54321) = 65! n! = n×[(n − 1)!]
6! = 65(4321) = 654!
n! = n×(n − 1)×[(n − 2)!] and so on.
Permutation:
Consider selection of ‘r’ objects out of n (r ≤ n). If the order in which the objects are selected
is important then such a selection is called as a Permutation.
The number of such permutations is denoted by nPr and
𝑛!
nPr =
(𝑛 − 𝑟)!
e.g.
𝑛!
nP0 = =1
(𝑛 − 0)!
𝑛! 𝑛(𝑛 − 1)!
nP1 = = =𝑛
(𝑛 − 1)! (𝑛 − 1)!
𝑛!
nPn = = n!
(𝑛 − 𝑛)!
Combination:
Consider selection of ‘r’ objects out of n (r ≤ n). If the order in which the objects are selected
is not important then such a selection is called as a Combination.
e.g.
𝑛! 𝑛!
nC0 = = =1
0! (𝑛 − 0)! 𝑛!
𝑛! 𝑛(𝑛 − 1)!
nC1 = = =𝑛
1! (𝑛 − 1)! (𝑛 − 1)!
𝑛! 𝑛!
nCn = = =1
1! (𝑛 − 𝑛)! 𝑛! 0!
10! 10!
10C3 = =
3! (10 − 3)! 3! 7!
10! 10!
10C7 = =
7! (10 − 7)! 7! 3!
10
C3 =
10!
3!7!
=
10∗9∗8∗7!
3!7!
=
10∗9∗8
3!
; 12
C4 = __________
100
C97 = 100C3 = __________
Basic Terms:
Random Experiment:
Consider an action which is repeated under essentially identical conditions. If it results in any
one of the several possible outcomes, but it is not possible to predict which outcome will
appear, then such an action is called as a Random Experiment
Sample Space:
The set of all possible outcomes of a random experiment is called the sample space. The
sample space is denoted by S or Greek letter omega (Ω). The number of elements in S is
denoted by n(S). A possible outcome is also called a sample point since it is an element in the
sample space.
All the elements of the sample space together are called as ‘exhaustive cases’.
Event:
Any subset of the sample space is called as an ‘Event’ and is denoted by any capital letter like
A, B, C or A1, A2, A3,..
Favourable cases:
The cases which ensure the happening of an event A, are called as the cases favourable to the
event A. The number of cases favourable to event A is denoted by n(A).
Types of Events
Elementary Event: An event consisting of a single outcome is called an elementary event.
Certain Event: The sample space is called the certain event if all possible outcomes are
favourable outcomes. i.e. the event consists of the whole sample space.
Impossible Event: The empty set is called impossible event as no possible outcome is
favorable
Union of Two Events
Let A and B be two events in the sample space S. The union of A and B is denoted by A∪B
and is the set of all possible outcomes that belong to at least one of A and B.
Let S = Set of all positive integers not exceeding 50;
Event A = Set of elements of S that are divisible by 6;
Event B = Set of elements of S that are divisible by 9.
A = {6,12,18,24,30,36,42,48}
B = {9,18,27,36,45}
∴ A∪B = {6,9,12,18,24,27,30,36,42,45, 48} is the set of elements of S that are divisible by 6
or 9.
Exhaustive Events
Two events A and B in the sample space S are said to be exhaustive if A∪B = S
Intersection of Two Events
Let A and B be two events in the sample space S.
The intersection of A and B is the event consisting of outcomes that belong to both the events
A and B.
Let S = Set of all positive integers not exceeding 50,
Event A = Set of elements of S that are divisible by 3,
Event B = Set of elements of S that are divisible by 5.
Then A = {3,6,9,12,15,18,21,24,27,30,33, 36,39,42,45,48},
B = {5,10,15,20,25,30,35,40,45,50}
∴ A∩B = {15,30,45} is the set of elements of S that are divisible by both 3 and 5.
Chance is the occurrence of events in the absence of any obvious intention or cause. It is,
simply, the possibility of something happening. When the chance is defined in Mathematics,
it is called probability.
Probability is the extent to which an event is likely to occur, measured by the ratio of the
favourable cases to the whole number of cases possible.
Mathematically, the probability of an event occurring is equal to the ratio of a number of cases
favourable to a particular event to the number of all possible cases.
Importance of Probability
The concept of probability is of great importance in everyday life. Statistical analysis is based
on this valuable concept. Infact the role played by probability in modern science is that of a
substitute for certainty.
i. The probability theory is very much helpful for making prediction. Estimates and
predictions form an important part of research investigation. With the help of statistical
methods, we make estimates for the further analysis. Thus, statistical methods are largely
all kinds.
iv. It is one of the inseparable tools for all types of formal studies that involve uncertainty.
v. The concept of probability is not only applied in business and commercial lines, rather than
vi. Before knowing statistical decision procedures one must have to know about the theory of
probability.
vii. The characteristics of the Normal Probability. Curve is based upon the theory of
probability.
Operation Interpretation
A', A or Ac Not A.
A∪B At least, one of A and B
A∩B Both A and B
(A'∩B) ∪ (A∩B') Exactly one of A and B
(A'∩B') = (A∪B)' Neither A nor B
Elementary Properties of Probability:
1) A' is complement of A and therefore P(A') = 1 − P(A)
Ex-2: If three coins are tossed simultaneously, find the probability of getting
a) exactly one head
b) at least one head
c) no head.
Ex-3: Find the probability that a leap year selected at random contains 53 Sundays.
Ex-4: In a housing society, half of the families have a single child per family, while the
remaining half have two children per family. If a child is picked at random, find the
probability that the child has a sibling.
Ex-5: A box contains 6 white and 4 black balls. 2 balls are selected at random and the colour
is noted. Find the probability that
a) Both balls are white
b) Both balls are black.
c) Balls are of different colours.
Ex-6: If all the letters of the word EAR are arranged at random. Find the probability that the
word begins and ends with a vowel.
Ex-7: If all the letters of the word EYE are arranged at random find the probability that the
word begins and ends with vowels.
Ex-8: If all the letters of word EQUATION are arranged at random, find the probability that
the word begins and ends with a vowel.
Ex-9: 7 boys and 3 girls are arranged in a row. Find the probability that there is at least one
boy between 2 girls.
Note:
P(A) + P(𝐴̅) = 1
i.e. P(A) = 1 − P(𝐴̅)
Ex-11: If 3 dice are tossed simultaneously, find the probability that the sum of the 3 numbers
is less than 17.
Note:
1) AB : either A or B or both;
AB : at least one of A & B
Ex-13: The probability that a particular film gets award for best direction is 0.7. The
probability that it gets award for best acting is 0.4. The probability that the film gets award
for both is 0.2. Find the probability that the film gets
a) at least one award
b) no award
Ex-14: Two cards are selected at random from a pack of 52 cards, find the probability that
two cards are
a) Red or face cards
b) Aces or jacks.
Conditional Probability
Let S be a sample space associated with the given random experiment.
Let A and B be any two events defined on the sample space S.
Then the probability of occurrence of event A under the condition that event B has already
occurred and P(B) ≠0 is called conditional probability of event A given B and is denoted by
P(A/B).
𝑃(𝐴 ∩ B)
𝑃(𝐴/𝐵) = , 𝑃(𝐵) ≠ 0
𝑃(𝐵)
Multiplication theorem
Let S be sample space associated with the given random experiment.
Let A and B be any two events defined on the sample space S.
Then the probability of occurrence of both the events is denoted by P(A∩B)
and is given by P(A∩B) = P(A).P(B/A)
Independent Events
Let S be sample space associated with the given random experiment.
Let A and B be any two events defined on the sample space S. If the occurrence of either
event, does not affect the probability of the occurrence of the other event, then the two events
A and B are said to be independent.
Thus, if A and B are independent events then,
P(A/B) = P(A/B') = P(A) and P(B/A) = P(B/A') = P(B)
If A and B are independent events then P(A∩B) = P(A).P(B)
(P(A∩B) = P(A).(B/A) = P(A).P(B) ∴ P(A∩B) = P(A).P(B))
If A and B are independent events then
a) A and B' are also independent event
b) A' and B' are also independent event
Ex-15: 2 shooters are firing at target. The probability that they hit the target are 1/3 and 1/2
respectively. If they fire independently find the probability that
a) both hit the target.
b) Nobody hits the target.
c) At least one hits the target.
d) Exactly one hits the target.
Ex-18: Three vendors were asked to supply a component. The respective probabilities that
the component supplied by them is ‘good’ are 0.8, 0.7 and 0.5. Each vendor supplies only one
component. Find the probability that at least one component is ‘good’.
Ex-19: The chance of a student passing a test is 20%. The chance of student passing the test
and getting above 90% marks is 5%. Given that a student passes the test, find the probability
that the student gets above 90% marks.
Ex-20: A box contains 6 white and 4 black balls. One ball is selected at random and its colour
is noted. The ball is replaced and two balls of the opposite colour are added and then second
ball is selected at random find the probability that both balls are white.
Ex-21: A shop has equal number of LED bulbs of two different types. The probability that the
life of an LED bulb is more than 100 hours given that it is of type-1 is 0.7 and given that it is
of type-2 is 0.4. If an LED bulb is selected at random, find the probability that the life of the
bulb is more than 100 hours.
Bayes Theorem
Bayes' Theorem, named after 18th-century British mathematician Thomas Bayes, is a
mathematical formula for determining conditional probability. Conditional probability is the
likelihood of an outcome occurring, based on a previous outcome having occurred in similar
circumstances. Bayes' theorem provides a way to revise existing predictions or theories
(update probabilities) given new or additional evidence
Posterior probability is the revised probability of an event occurring after taking into
consideration the new information. Posterior probability is calculated by updating the prior
probability using Bayes' theorem. In statistical terms, the posterior probability is the
probability of event A occurring given that event B has occurred.
Ex-22: In a bolt factory, three machine P, Q and R produce 25%, 35% and 40% of the total
output respectively. It is found that in their production, respectively 5%, 4% and 2% are
defective bolts. If a bolt is selected at random and found defective, find the probability that it
is produced by machine Q.
Ex-23: A certain test for a particular cancer is known to be 95% accurate. A person submits to
the test and the results are positive. Suppose that the person comes from a population of
1,00,000 where 2,000 people suffer from that disease. What can we conclude about the
probability that the person under test has that particular disease?
ODDS (Ratio of two complementary probabilities):
Let n be number of distinct sample points in the sample space S. Out of n sample points, m
sample points are favourable for the occurrence of event A. Therefore remaining (n-m)
sample points are favourable for the occurrence of its complementary event A'.
𝑚 𝑛− 𝑚
∴ 𝑃(𝐴) = and 𝑃(𝐴′) =
𝑛 𝑛
What Is Correlation?
Correlation refers to the statistical relationship between the two entities. It measures the
extent to which two variables are linearly related. For example, the height and weight of a
person are related, and taller people tend to be heavier than shorter people.
You can apply correlation to a variety of data sets. In some cases, you may be able to predict
how things will relate, while in others, the relation will come as a complete surprise. It's
important to remember that just because something is correlated doesn't mean it's causal.
Ex: Draw a scatter diagram for the following data and give your comments.
x 30 40 50 65 70 75
y 70 65 60 55 50 40
80
70
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80
Ex: Draw a scatter diagram for the following data and comment.
Demand 15 20 18 22 25 30
Price 32 19 25 15 12 10
Correlation Coefficient
The correlation coefficient, r, is a summary measure that describes the extent of the statistical
relationship between two interval or ratio level variables. The correlation coefficient is scaled
so that it is always between -1 and +1. When r is close to 0 this means that there is little
relationship between the variables and the farther away from 0 r is, in either the positive or
negative direction, the greater the relationship between the two variables.
n ∑ xi yi − (∑ xi )(∑ yi )
r=
√n ∑ x2i − (∑ xi )2 √n ∑ y2i − (∑ yi )2
Note:
1) r lies between −1 & 1 i.e. −1 r 1
2) If r = 1, there is perfect positive correlation
3) If 0 < r < 1, there is positive correlation
4) If r = −1, there is perfect negative correlation
5) If −1 < r < 0, there is negative correlation
6) If r = 0, there is no correlation
7) Correlation Coefficient is independent of change of origin & change of scale.
Ex: Calculate correlation coefficient for the following data. Comment on your findings.
Marks in Statistics 53 59 72 43 93 35 55 70
Marks in Economics 35 49 63 36 75 28 38 76
Ex: Calculate Karl Pearson’s Coefficient of correlation for the following data.
X 17 8 12 13 10 12
Y 13 7 10 11 8 11
Ex: Find the Karl Pearson’s correlation coefficient for the following data.
x 10 14 12 18 20 16
y 20 30 20 35 25 20
Spearman’s Rank Correlation Coefficient:
In this method, ranks are assigned to the data. The ranks are given to the x-series & y-
series separately. The highest observation is given rank ‘1’, the next highest observation is
given rank ‘2’ and so on. Suppose, R1 & R2 are the ranks of the x & y respectively and
d = R1 − R2 then
6 d 2
r=1−
(
n n − 1
2
)
where n = number of pairs of observations
Ex: Calculate the Spearman’s rank correlation coefficient for the following data.
x 15 12 16 13 17 14 18 11
y 17 14 20 25 23 24 22 21
Ex: Calculate the Spearman’s rank correlation coefficient for the following data.
x 50 63 40 70 45 65 38 53 52
y 48 30 35 60 55 33 25 54 50
Let's say we observe a strong positive correlation between ice cream sales and the number of
drownings at the beach. During the summer months, both ice cream sales and drownings tend
to increase. However, it would be incorrect to conclude that eating ice cream causes people to
drown or vice versa.
There could be a third variable at play here, such as temperature. Warmer temperatures in the
summer lead to increased ice cream consumption as well as more people going to the beach
and swimming, which in turn increases the risk of drownings. So, in this example,
temperature is the common cause behind both variables—ice cream sales and drownings—
rather than one causing the other directly.
Covariance is a measure of the relationship between two random variables and to what
extent, they change together. Or we can say, in other words, it defines the changes between
the two variables, such that change in one variable is equal to change in another variable.
This is the property of a function of maintaining its form when the variables are linearly
transformed. Covariance is measured in units, which are calculated by multiplying the units
of the two variables.
Types of Covariance
Covariance can have both positive and negative values. Based on this, it has two types:
1. Positive Covariance
2. Negative Covariance
Positive Covariance
If the covariance for any two variables is positive, that means, both the variables move in the
same direction. Here, the variables show similar behaviour. That means, if the values (greater
or lesser) of one variable corresponds to the values of another variable, then they are said to
be in positive covariance.
Negative Covariance
If the covariance for any two variables is negative, that means, both the variables move in the
opposite direction. It is the opposite case of positive covariance, where greater values of
one variable correspond to lesser values of another variable and vice-versa.
Covariance Formula
Covariance formula is a statistical formula, used to evaluate the relationship between two
variables. It is one of the statistical measurements to know the relationship between the
variance between the two variables. Let us say X and Y are any two variables, whose
relationship has to be calculated. Thus the covariance of these two variables is denoted by
Cov(X,Y). The formula is given below for both population covariance and sample
covariance.
If cov(X, Y) is greater than zero, then we can say that the covariance for any two variables is
positive and both the variables move in the same direction.
If cov(X, Y) is less than zero, then we can say that the covariance for any two variables is
negative and both the variables move in the opposite direction.
If cov(X, Y) is zero, then we can say that there is no relation between two variables.
Regression
“Linear regression predicts the relationship between two variables by assuming a linear
connection between the independent and dependent variables. It seeks the optimal line
that minimizes the sum of squared differences between predicted and actual values.
Applied in various domains like economics and finance, this method analyzes and
forecasts data trends. It can extend to multiple linear regression involving several
independent variables and logistic regression, suitable for binary classification problems
In a simple linear regression, there is one independent variable and one dependent
variable. The model estimates the slope and intercept of the line of best fit, which
represents the relationship between the variables. The slope represents the change
in the dependent variable for each unit change in the independent variable, while
the intercept represents the predicted value of the dependent variable when the
independent variable is zero.
Linear regression is a quiet and the simplest statistical regression method used for
predictive analysis in machine learning. Linear regression shows the linear relationship
An outlier is an observation “that appears to deviate markedly from other members of the sample in
which it occurs”
A dilemma
▪ Outliers can be genuine values
▪ The trade-off is between the loss of accuracy if we throw away “good”
observations, and the bias of our estimates if we keep “bad” ones
▪ The challenge is twofold:
1. to figure out whether an extreme value is good (genuine) or bad (error)
2. to assess its impact on the statistics of interest
Outlier treatment is the process of identifying and handling outliers in a dataset. Outliers are defined
as observations that fall outside of the general pattern of the data and can have a significant impact on
the analysis and modeling of the data.
There are several methods for identifying and treating outliers, including:
Z-score method: This method calculates the standard deviation and mean of the data, and any
observation that falls more than 3 standard deviations away from the mean is considered an outlier.
Interquartile range method: This method calculates the interquartile range (IQR) of the data, and
any observation that falls outside of the lower and upper limits of the box plot, which are defined as
Q1–1.5 * IQR and Q3 + 1.5 * IQR, respectively, is considered an outlier.
Clustering methods: Clustering methods such as DBScan and KMeans can also be used to identify
outliers by grouping similar data points together and identifying any data points that do not belong to
any cluster.
Visualization techniques: Visualization techniques such as box plots and scatter plots can also be
used to identify outliers by visually identifying any points that fall outside of the general pattern of the
data.
There are many other advanced methods that we will read about in the following section of the article.
Once outliers have been identified, there are several methods for handling them. The appropriate
method for handling outliers will depend on the specific dataset and the goals of the analysis. It is
important to carefully consider the impact of outliers on the data and the appropriate method for
handling them before proceeding with any analysis or modeling. Some common techniques include:
Deleting the outlier observations: This is a simple method, but it can lead to a loss of information if
the outliers are actually meaningful observations.
Trimming the data: This method involves removing a certain percentage of the largest and smallest
observations.
Winsorizing: This method replaces the outliers with a value that is closer to the center of the data.
Log transformation: This method can be used when the data is positively skewed and the outliers are
on the high end of the distribution.
Z-score standardization: This method replaces each observation with its z-score, which is the
number of standard deviations away from the mean.
Cap and floor: This method replaces the outlier with a maximum and minimum value respectively.
Session 7 & 8
Random experiments are defined as the result of an experiment, whose outcome cannot be
predicted. Suppose, if we toss a coin, we cannot predict, what outcome it will appear either it
will come as Head or as Tail. The possible result of a random experiment is called an
outcome. And the set of outcomes is called a sample point. With the help of these
experiments or events, we can always create a probability pattern table in terms of variables
and probabilities.
Two random variables with equal probability distribution can yet vary with respect to their
relationships with other random variables or whether they are independent of these. The
recognition of a random variable, which means, the outcomes of randomly choosing values as
per the variable’s probability distribution function, are called random variates.
Random Variable
A random variable is a variable whose values can be determined from the outcomes of a
random experiment example in an experiment of throwing a pair of dice,
S = {(1,1), (1,2), ……, (6,6), if we define a variable,
X = sum of the numbers on uppermost faces. From the outcomes of sample space we can
determine values assume by the variable that is X assumes the values 2,3,….,12 and hence X
is a random variable.
According to the values assumed by the random variable we have two tpes of random
variable.
• A discrete random variable is one whose set of assumed values is countable (arises
from counting).
Examples:
(1) If the random experiment is throwing an unbiased dice, the random variable
associated with this experiment can be defined as the number on the uppermost face, then
possible values of X are 1, 2, 3, 4, 5, 6.
(2) If two cards are taken from pack of fifty-two cards, X = number of red cards then X
can take the values 0, 1,2.
(4) Suppose milk contents in one liter bags are measured then X = milk content in bag,
then X can take values 990 ml <X ≤1100 ml
In examples one and two the sample space of random variable is finite and in third it is
countably infinite so these are Discrete random variables where as in fourth sample space
interval so values are uncountable hence it is continuous random variable. is
Notation: The random variable is denoted by uppercase letter and its value in lowercase
letter. E.g. X denotes a random variable whereas x denotes its value.
Discrete variable: If the number of possible values of the variable is finite or countably
infinite then the variable is called as a discrete variable. Thus discrete random variable
takes only isolated values.
For example:
(vi) Number of classes missed last week (possible outcomes are 0, 1, 2, 3, maximum
number).
Let X be a discrete random variable. The set of all possible values of X are denoted by the
sample space S. Then probability function
P(X = x) for all S is known as probability mass function if function satisfies the following
two conditions.
(ii) ΣP(X = x) = 1
Ex-2: If three coins are tossed simultaneously, find the probability distribution of “number of
heads obtained in 3 tosses”. Also find the mean, variance and standard deviation of “number
of heads obtained in 3 tosses”.
Ex-3: In the following table, X is a discrete random variable and p(x) is the probability mass
function of X.
x 1 2 3
p(x) 0.3 0.6 0.1
Find the standard deviation and cumulative distribution function of X.
Ex-4: Suppose X represents the minimum of the two numbers when a pair of fair dice is
rolled once. Find the probability distribution of X.
Discrete Uniform Distribution
A discrete uniform is the simplest type of all the probability distributions. Discrete uniform
distribution is a symmetric probability distribution whereby a finite number of values are
equally likely to be observed; such that each value among n possible values has equal
probability 1/n. In other words discrete uniform distribution could be expressed as "a known,
finite number of outcomes equally likely to happen".
Consider the case of throwing a die. The following observations can be made with this action.
(1) The die is numbered from 1 to 6. In probability language we call it as six outcomes.
(3) The total number of possible events is 6. That is, the die can roll on any of the numbers
from 1 to 6. In other words, the events are 'equally likely' with the same probability of 1/6.
(4) The numbers engraved on the die are 6. In other words, the number of values of the
random variable is finites.
(5) The numbers are equally spaced or we can say the values of the variables are equally
spaced. The experiment leads to discrete Uniform distribution, satisfying all above stated
characteristics. Thus a discrete uniform distribution is a probability distribution of equally
likely events with equal probability and with finite number of equally spaced outcomes. The
discrete uniform distribution is essentially non-parametric, i.e. it does not involve any
parameter.
A discrete random variable is said to follow uniform distribution over the range 1,2,2,,,,n if tis
pmf is given by
1
P(X = x) = 𝑛 𝑥 = 1,2,3, … . . , 𝑛
=0 otherwise
𝑛 +1
Mean = 2
𝑛2 − 1
Variance = 12
Binomial Distribution
= 0 otherwise 0<p<1p+q=1
The binomial experiment means Bernoulli experiment which is repeated n times. The
binomial distribution is used to obtain the probability of observing x successes in n trials,
with the probability of success on a single trial denoted by p. The binomial distribution
assumes that p is fixed for all trials. Here n and p are called as parameters of binomial
distribution.
In the above definition notice that the following conditions need to be satisfied for a binomial
experiment:
(3) The probability of success (p) remains constant from trial to trial.
(4) The trials are independent; the outcome of a trial is not affected by the outcome of any
other trial.
A discrete random variable X is said to follow Binomial distribution with parameters (n,p) if
tis pmf given by
= 0 otherwise 0<p<1p+q=1
Mean = np
Variance = npq
Real Life Applications of Binomial Distributions
·
• There are a lot of areas where the application of binomial theorem is inevitable, even
in the modern world areas such as computing. In computing areas, binomial theorem
has been very useful such as in distribution of IP addresses. With binomial
distribution, the automatic allocation of IP addresses is possible.
• Another field that uses Binomial distribution as the important tools is the nation's
economic prediction. Economists use binomial theorem to count probabilities to
predict the way the economy will behave in the next few years. To be able to come up
with realistic predictions, binomial theorem is used in this field.
• Binomial distribution has also been a great use in the architecture industry in design
of infrastructure. It allows engineers, to calculate the magnitudes of the projects and
thus delivering accurate estimates of not only the costs but also time required to
construct them. For contractors, it is a very important tool to help ensuring the costing
projects is competent enough to deliver profits.
• The binomial distribution is used when a researcher is interested in the occurrence of
an event, not in its magnitude. For instance, in a clinical trial, a patient may survive or
die. The researcher studies the number of survivors, and not how long the patient
survives after treatment.
• Another example is whether a person is ambitious or not. Here, the binomial
distribution describes the number of ambitious persons, and not how ambitious they
are.
• Other situations in which binomial distributions arise are quality control, public
opinion surveys, medical research, and insurance problems.
Poisson Distribution
The Poisson Distribution was developed by the French mathematician Simeon Denis Poisson
in 1837. The Poisson distribution is a discrete probability distribution for the counts of events
that occur randomly in a given interval of time (or space).If we let X = the number of events
in a given interval and the mean number of events per interval as λ, then distribution of X is
given by Poisson distribution.
A discrete random variable is said to follow Poisson Distribution with parameter λ, if its p.m.f
is given by
𝑒 −λ λ𝑥
P(X = x) = x = 0,1,2,….
𝑥!
=0 otherwise
Mean = λ
Variance = λ
Comparison between Binomial and Poisson Distribution
Even though both Binomial and Poisson distribution give probability of X successes, the
differences between binomial and Poisson distribution can be drawn clearly on the following
grounds:
(1) The binomial distribution is one in which the probability of X successes among n
Bernoulli trials is studied. A probability distribution that gives the probability of the count of
a number of independent events that occur randomly within a given period, is called Poisson
distribution.
(2) Binomial Distribution is biparametric, i.e. it is marked by two parameters n and p whereas
Poisson distribution is uniparametric, i.e. described by a single parameter λ.
(3) There are a fixed number of attempts in the binomial distribution. On the other hand, an
unlimited number of trials are there in a Poisson distribution.
(4) The success probability is constant in binomial distribution but in Poisson distribution,
there are an extremely small number of success.
(5) Poisson distribution can be considered as limiting form of binomial distribution. When
success is a rare event i.e. probability of success p is small, p→0, number of trials is large i.e.
n→∞, but mean np is finite, binomial distribution tends to Poisson distribution.
(6) In binomial distribution Mean > Variance while in Poisson distribution mean = variance.
Apart from the above differences, there are a number of similar aspects between these two
distributions i.e. both are the discrete theoretical probability distribution. Further, on the basis
of the values of parameters, both can be unimodal or bimodal. Moreover, the binomial
distribution and success probability (p) tends to 0 but A = np is constant. Infinity
Applications of Poisson distribution:
Poisson distribution is applied whenever we observe rare events. Some examples are given
below:
• The number of deaths by horse kicking in the Prussian army (first application).
• Birth defects and genetic mutations.
• Rare diseases (like Leukemia, bu3t not AIDS because it is infectious and so not
independent). Car accidents.
• Traffic flow and ideal gap distance.
• Number of typing errors on a page. Hairs found in McDonald's hamburgers. Spread of
an endangered animal in Africa.
• Failure of a machine in one month.
Geometric Distribution
Geometric distribution is a type of discrete probability distribution that represents
the probability of the number of successive failures before a success is obtained in
a Bernoulli trial. A Bernoulli trial is an experiment that can have only two possible
outcomes, ie., success or failure. In other words, in a geometric distribution, a
Bernoulli trial is repeated until success is obtained and then stopped.
The geometric probability distribution is widely used in several real-life scenarios.
For example, in financial industries, geometric distribution is used to do a cost-
benefit analysis to estimate the financial benefits of making a certain decision. In
this article, we will study the meaning of geometric distribution, examples, and
certain related important aspects.
What is Geometric Distribution?
Suppose a dice is repeatedly rolled until "3" is obtained. We know that the
probability of getting "3" is p = 1 / 6 and let the random variable X take the values 1,
2, and 3.
• The probability of rolling a 3 in the first trial is 1/6.
• The probability of rolling a 3 in the second trial for the first time is 5/6 × 1/6 = 5/36.
Here, 5/6 is the prob of rolling a number that is NOT 3 in the first trial.
• Similarly, the probability of rolling a 3 in the third trial for the first time is, (5/6)2 × 1/6 =
25/216.
Geometric Distribution Formula
The mean of geometric distribution is also the expected value of the geometric
distribution. The expected value of a random variable, X, can be defined as the
weighted average of all values of X. The formula for the mean of a geometric
distribution is given as follows:
E[X] = 1 / p
Variance of Geometric Distribution
Variance can be defined as a measure of dispersion that checks how far the data in a
distribution is spread out with respect to the mean. The formula for the variance of
a geometric distribution is given as follows:
Var[X] = (1 - p) / p2
Binomial Vs Geometric Distribution
In both geometric distribution and binomial distribution, there can be only two
outcomes of a trial, either success or failure. Furthermore, the probability of
success will be the same for each trial. The difference between binomial
distribution and geometric distribution is given in the table below.
Geometric Distribution Binomial Distribution
A geometric distribution
is concerned with the first
In a binomial distribution, there are a
success only. The random
fixed number of trials and the random
variable, X, counts the
variable, X, counts the number of
number of trials required
successes in those trials.
to obtain that first
success.
Mean = 1 / p, Variance =
Mean = np, Variance = np(1-p)
(1 - p) / p2
A Continuous variable takes all possible values in a range set which is in the form of interval.
On the other hand discrete random variable takes only specific or isolated values. Generally
values of discrete random variable are obtained by counting while values of continuous
variable are obtained by measuring to any degree of accuracy. The values of continuous
variable move continuously from one possible value to another, without having to jump as in
case of a discrete variable. E.g. Random variable X which is number of F.Y.B.Sc. students is
discrete. While random variable Y which is height of a student is continuous variable.
Continuous Variables would (literally) take forever to count. In fact, you would get to
"forever" and never finish counting them. For example, take age. You can't count "age". Why
not? Because it would literally take forever. For example, you could be: 20 years, 10 months,
2 days, 5 hours, 10 minutes, 4 seconds, 4 milliseconds, 8 nanoseconds, 99 picoseconds...and
so on. Even though age is a continuous could turn age into a discrete variable and then you
could count it. For example:
• A person's age in years.
• A baby's age in months.
Hence age, height, weight, time, income are continuous variables but we can turn them into
discrete variable by appropriate definition.
Examples of continuous random variable:
• Consumption of cooking oil in a house.
• Waiting time on a ticket window.
• Life of an electronic gadget.
• Yield of a crop in certain area.
• Temperature in Mumbai
• Time it takes a computer to complete a task.
Properties of pdf of continuous random variable
i) f(x)>0, for all x belongs to sample Space S since probability >0
∞
ii) ∫−∞ 𝑓(𝑥)𝑑𝑓 = 1 implied that total area bounded by the curve of density function
and X – axis equal to 1, when computed over entire range of variable X.
Continuous Uniform Distribution
The simplest continuous probability distribution is the Uniform distribution also known as
Continuous Uniform distribution. It is observed in many situations. In general, if the
probabilities of various classes of a continuous variable are more or less the same, the
situation is best described by uniform distribution. In this distribution the p.d.f. of random
variable remains constant over the range space of variable.
Definition:
A continuous random variable is said to follow uniform distribution over the interval (a, b), if
its p.d.f. is given by
1
𝑓(𝑥) = ; 𝑎≤𝑥≤𝑏
𝑏−𝑎
= 0 otherwise
(𝑏−𝑎)2
Variance = 12
Normal Distribution
Normal distributions are also called Gaussian distributions or bell curves because of their
shape.
Because normally distributed variables are so common, many statistical tests are designed for
normally distributed populations.
Understanding the properties of normal distributions means you can use inferential
statistics to compare different groups and make estimates about populations using samples.
The mean determines where the peak of the curve is centered. Increasing the mean moves the
curve right, while decreasing it moves the curve left.
The standard deviation stretches or squeezes the curve. A small standard deviation results in a
narrow curve, while a large standard deviation leads to a wide curve.
Empirical rule
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values lie in a
normal distribution:
• Around 68% of values are within 1 standard deviation from the mean.
• Around 95% of values are within 2 standard deviations from the mean.
• Around 99.7% of values are within 3 standard deviations from the mean.
In research, to get a good idea of a population mean, ideally you’d collect data from
multiple random samples within the population. A sampling distribution of the mean is the
distribution of the means of these different samples.
• Law of Large Numbers: As you increase sample size (or the number of samples), then
the sample mean will approach the population mean.
• With multiple large samples, the sampling distribution of the mean is normally
distributed, even if your original variable is not normally distributed.
Parametric statistical tests typically assume that samples come from normally distributed
populations, but the central limit theorem means that this assumption isn’t necessary to meet
when you have a large enough sample.
You can use parametric tests for large samples from populations with any kind of distribution
as long as other important assumptions are met. A sample size of 30 or more is generally
considered large.
For small samples, the assumption of normality is important because the sampling
distribution of the mean isn’t known. For accurate results, you have to be sure that the
population is normally distributed before you can use parametric tests with small samples.
EXPONENTIAL DISTRIBUTION
For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5).
The mode of a data set is the value appearing most often, and the median is the figure
situated in the middle of the data set. It is the figure separating the higher figures from the
lower figures within a data set.
Central Tendency
Measures of central tendency focus on the average or middle values of data sets, whereas
measures of variability focus on the dispersion of data. These two measures use graphs,
tables and general discussions to help people understand the meaning of the analysed data.
Measures of central tendency describe the centre position of a distribution for a data set. A
person analyses the frequency of each data point in the distribution and describes it using
the mean, median, or mode, which measures the most common patterns of the analysed data
set.
Measures of Variability
Measures of variability (or the measures of spread) aid in analysing how dispersed the
distribution is for a set of data. For example, while the measures of central tendency may
give a person the average of a data set, it does not describe how the data is distributed within
the set.
So, while the average of the data maybe 65 out of 100, there can still be data points at both 1
and 100. Measures of variability help communicate this by describing the shape and spread
of the data set. Range, quartiles, absolute deviation, and variance are all examples of
measures of variability.
Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95,
which is calculated by subtracting the lowest number (5) in the data set from the highest
(100).
Types of Averages:
1) Arithmetic Mean (A.M.)
2) Weighted Arithmetic Mean
3) Median
4) Mode
5) Geometric Mean (G.M.)
6) Harmonic Mean (H.M.)
X1 + X 2 + ... + X N X
A.M. = X = =
N N
Ex: The grades of a student on six examinations were 84, 91, 72, 68, 87 and 78. Find the
arithmetic mean of the grades.
Ex: The mean of 10 observations was found to be 20. Later on it was discovered that the
observations 24 and 34 were wrongly noted as 42 and 54. Find the corrected mean.
f1X1 + f 2 X 2 + ... + f n X n f X
A.M. = X = =
f1 + f 2 + ... + f n f
We denote f by N, called as total frequency
fX
A.M. = X =
N
Ex: The following table gives the monthly income of 20 families in a city. Calculate the
arithmetic mean.
Income (in ’00 Rs.) 16 20 30 35 45 50
No. of families 2 5 4 6 2 1
Ex: Use the following frequency distribution of heights to find the arithmetic mean of height
of 100 students at XYZ university.
Height (inches) Number of students
60 – 62 5
63 – 65 18
66 – 68 42
69 – 71 27
72 – 74 8
Ex: Use the following frequency distribution of weekly wages to find the arithmetic mean of
wage of employees at P & R company.
Weekly Wage ($) Number of employees
250.00 – 259.99 8
260.00 – 269.99 10
270.00 – 279.99 16
280.00 – 289.99 14
290.00 – 299.99 10
300.00 – 309.99 5
310.00 – 319.99 2
Combined Mean:
Ex: The average marks of a group of 100 students in Statistics are 60 and for other group of
50 students, the average marks are 90. Find the average marks of the combined group of 150
students.
N1X1 + N 2 X 2
Combined Mean = X =
N1 + N 2
(100 60) + (50 90)
= = 70
100 + 50
Weighted Arithmetic Mean:
Sometimes we associate certain weighing factor (or weights) with the numbers X1,
X2, …, Xn. Suppose the weights are w1, w2, …, wn. The weighted arithmetic mean is denoted
by X and is defined as follows.
w1X1 + w 2 X 2 + ... + w n X n
Weighted A.M. = X =
w1 + w 2 + ... + w n
wX
=
w
Ex: A student’s grades in laboratory, lecture and recitation parts of a Physics course were 71,
78 and 89 respectively. If the weights of these grades are 2, 4 and 5 respectively, what is the
appropriate average grade?
Answer:
X w wX
71 2 142
78 4 312
89 5 445
w = 11 wX = 899
wX 899
Weighted A.M. = X = = = 81.7273
w 11
MERITS AND DEMERITS OF MEAN
Merits:
1. It is rigidly defined.
5.Of all the averages, A.M. is least affected by sampling fluctuations i.e. it is a stable
average.
Demerits:
4.It cannot be calculated for frequency distribution having open end class- intervals e.g.
class-intervals like below 10 or above 50 etc.
6. Sometimes, it gives absurd results. e.g. Average number of children per family is 1.28.
7.It cannot be used for the study of qualitative data such as intelligence, honesty, beauty etc.
Even though A.M. has various demerits, it is considered to be the best all averages as it
satisfies most of the requisites of a good average. A.M. is called the Ideal Average.
Median:
The Median of a set of numbers arranged in order of magnitude is either the middle
value or arithmetic mean of the two middle values.
th
N + 1
median = observation in order sample
2
If N is an even number,
th th
N N
median = average of & + 1 observation in order sample
2 2
Ex: The number of ATM transactions per day was recorded at 15 locations in a large city. The
data were as follows.
35, 49, 225, 50, 30, 65, 40, 55, 52, 76, 48, 325, 47, 32, 60.
Find the median number of transactions.
N + 1
=8
2
th
N + 1
median = observation in order sample
2
= 8th observation in order sample = 50
Ex: The following table gives the data of weight of 20 students at a University. Find the
median weight.
138, 146, 168, 146, 161, 164, 158, 126, 173, 145,
150, 140, 138, 142, 135, 132, 147, 176, 147, 142
( 2 − 1 ) N
median =
1+ − c.f.
f 2
1 = lower class boundary of median class
2 = upper class boundary of median class
f = frequency of median class
c.f. = less than cumulative frequency of class preceding median class
Ex: Find the median of the following data.
Weight (pounds) Frequency
118 – 126 3
127 – 135 5
136 – 144 9
145 – 153 12
154 – 162 5
163 – 171 4
172 – 180 2
( 2 − 1 ) N
median =
1+ − c.f.
f 2
153.5 − 144.5
= 144.5 +
( 20 − 17 ) = 146.75
12
MERITS AND DEMERITS OF MEDIAN
Merits:
6.Since median is a positional average, it can be computed even if the observations at the
extremes are unknown.
Demerits:
2. It is not based on all observations and hence, may not be a proper representative.
4. Since it does not require information about all the observation, it is insensitive to some
changes
Mode:
The Mode of a set of numbers is that value which occurs with the greatest frequency,
that is, it is the most common value.
Note:
1) Sometimes the Mode may not exist.
2) Even if the Mode exists, sometimes it may not be unique.
Ex: The reaction times of an individual to a certain stimulus were measured as 0.53, 0.46,
0.50, 0.49, 0.52, 0.44, 0.55, 0.53, 0.40 and 0.56. Find the mode.
Ex: Three teachers of a subject reported examination grade of 79, 74 and 82 in their classes,
which consisted of 32, 25 and 17 students respectively. Determine the mode.
f1 − f 0
mode =
1 + ( 2 − 1 )
2 f1 − f 0 − f 2
1 = lower class boundary of modal class
2 = upper class boundary of modal class
f0 = frequency of the class preceding modal class
f1 = frequency of modal class
f2 = frequency of the class next to modal class
MERITS AND DEMERITS OF MODE
Merits:
5.It is always present within the data and is the most typical value of the given set of data.
Demerits
5. If the sample of data for which mode is obtained is small, then such mode has no
significance.
Geometric Mean (G.M.):
Geometric Mean for Raw Data:
The G.M. of N observations X1, X2, …, XN is denoted by G and is defined as follows.
G.M. = G = N X X ... X
1 2 N
Ex: Find the Geometric Mean of the following data.
28.5, 73.6, 47.2, 31.5 and 64.8.
Answer: N = 5
G.M. = G = N X X ... X
1 2 N
=
5 28.5 73.6 47.2 31.5 64.8 = 45.8258
G.M. = G =
N (X )f1 (X )f 2 ... (X )f n where N = f
1 2 n
Harmonic Mean(H.M.):
Harmonic mean for Raw Data:
The H.M. of N observations X1, X2, …, XN is denoted by H and is defined as follows.
N
H=
1 1 1
+ + ... +
X1 X 2 XN
1 1 1
+ + ... +
1 X1 X 2 XN
=
H N
1 1 1 1 1
= + + ... +
H N X1 X 2 XN
1 1 1
=
H N X
Ex: Cities A, B and C are equidistant from each other. A motorist travels from A to B at 30
mph, from B to C at 40 mph and from C to A at 50 mph. Find his average speed.
Answer:
1 1 1 1 1 1 1 47
= = + + =
H N X 3 30 40 50 1800
1800
H = Harmonic Mean = = 38.2979
47
Harmonic Mean for a frequency distribution:
Consider a data of n observations X1, X2, …, Xn occurring with respective frequencies
f1, f2, …, fn. Then the H.M. is denoted by H and is defined as follows.
N
H= where N = f
f1 f 2 fn
+ + ... +
X1 X 2 Xn
Ex: An airplane travels distances of 2500, 1200 and 500 miles at speeds 500, 400 and 250
mph respectively. Find the Harmonic mean.
Answer:
w
Harmonic Mean =
w
X
2500 + 1200 + 500
= = 420
2500 1200 500
+ +
500 400 250
Quartiles, Deciles and Percentiles:
Suppose we arrange a set of data in order of magnitude. The values which divide the
set into four equal parts are denoted by Q1, Q2, Q3 and are called as the 1st, 2nd, 3rd Quartiles
respectively. Similarly, the values which divide the set into ten equal parts are denoted by D1,
D2, ..., D9 and are called as the 1st, 2nd, ..., 9th Deciles respectively. And, the values which
divide the set into hundred equal parts are denoted by P1, P2, ..., P99 and are called as the 1st,
2nd, ..., 99th Percentiles respectively. Collectively, Quartiles, Deciles and Percentiles are called
as Quantiles.
c iN
th
Qi = i Quartile = L1 + − c.f. where i = 1, 2, 3
f 4
L1 = lower class boundary of ith quartile class
f = frequency of ith quartile class
c.f. = less than cumulative freq of class preceding ith quartile class
c = class width
c iN
th
Di = i Decile = L1 + − c.f. where i = 1, 2, ..., 9
f 10
L1 = lower class boundary of ith decile class
f = frequency of ith decile class
c.f. = less than cumulative freq of class preceding ith decile class
c = class width
c iN where i = 1, 2, ..., 99
Pi = ith Percentile = L1 + − c.f.
f 100
L1 = lower class boundary of ith percentile class
f = frequency of ith percentile class
c.f. = less than cumulative freq of class preceding ith percentile class
c = class width
Ex: For the following data of age, calculate the 1st Quartile, 5th Decile and 54th Percentile.
Age in years Number of persons
0 – 10 6
10 – 20 8
20 – 30 13
30 – 40 18
40 – 50 16
50 – 60 13
60 – 70 12
70 – 80 9
80 – 90 4
90 – 100 1
Answer: N = f = 100
Class Interval Frequency Less Than Cumulative Frequency
0 – 10 6 6
10 – 20 8 14
20 – 30 13 27
30 – 40 18 45
40 – 50 16 61
50 – 60 13 74
60 – 70 12 86
70 – 80 9 95
80 – 90 4 99
90 – 100 1 100
1) To find Q1:
iN
Consider , put i = 1
4
1 N
= 25, class containing Q is 20-30
1
4
c iN
Q = i Quartile = L +
i
th
1 − c.f., put i = 1
f 4
c 1 N
Q =L +
1 1 − c.f. = 20 +
f 4
10
(25 − 1 4 .) = 28.4615
13
2) To find D5:
iN
Consider , put i = 5
10
5 N
= 50, class containing Q is 40-50
1
10
c iN
D = i Decile = L +
i
th
1 − c.f., put i = 5
f 10
c 5N 10
D =L +
5 1 − c.f. = 40 + (50 − 45.) =
f 10 16
43.125
3) To find P54:
iN
Consider , put i = 54
100
54 N
= 54, class containing Q is 40-50
1
100
c iN
P = i Percentile = L +
i
th
1 − c.f. , put i = 54
f 100
c 54 N
P =L +
54 1 − c.f. = 40 +
f 100
10
(54 − 45. ) = 45.625
16
Comparison between Central Tendencies
We have studied five different measures of central tendency. It is obvious that no single
measure can be the best for all situations. The most commonly used measures are mean,
median and mode. It is not desirable to consider any one of them to be superior or inferior in
all situations. The selection of appropriate measure of central tendency would largely depend
upon the nature of the data; more specifically, on the scale of measurement used for
representing the data and the purpose on hand.
The data obtained on nominal scale, we can count the number of cases in each category and
obtain the frequencies. We may then be interested in knowing the class which is most popular
or the most typical value in the data. In such cases, mode can be used as the appropriate
measure of central tendency.
e.g. Suppose in a genetical study, for a group of 50 family members, we want to know most
common colour of eyes. Then we count the number of persons for each different colour of
eye. Suppose 3 persons have light eyes, 6 persons have brown eyes, 12 with dark grey eyes
and 29 persons are with black eyes. Then the most common colour of eyes (i.e. mode) for this
group of people is 'black'.
When the data is available on ordinal scale of measurement i.e. the data is provided in rank
order, use of median as a measure of central tendency is appropriate. Suppose in a group of
75 students, 10 students have failed, 15 get pass class, 20 secure second class and 30 are in
first class. The average performance of the students will be the performance of the
middlemost student (arranged as per rank) i.e. the performance of 38th student i.e. second
class; which is the median of the data. Median is only a point on the scale of measurement,
below and above which lie exactly 50% of the data. Median can also be used (i) for truncated
(incomplete) data, provided we know the total number of cases and their positions on the
scale and (ii) when the distribution is markedly skewed.
Arithmetic Mean is the most commonly used measure of central tendency. It can be
calculated when the data is complete and is represented on interval or ratio scale. It represents
the centre of gravity of the data i.e. the measurements in any sample are perfectly balanced
about the mean. In computation of simple A.M., equal importance is given to all observations
in the data. It is preferred because of its high reliability and its applicability to inferential
statistics. Thus, A.M. is more precise, reliable and stable measure of central tendency.
What are the objectives of computing dispersion?
• A small value of dispersion means low variation between observations and average. It
means that the average is a good representative of observation and very reliable.
• A higher value of dispersion means greater deviation among the observations. In this
case, the average is not a good representative and it cannot be considered reliable.
(3) Control the variability
• Measures of dispersion provide the basis for further statistical analysis like computing
correlation, regression, test of hypothesis, etc.
Range
Definition: If L is the largest observation in the data and S is the smallest observation, then
range is the difference between L and S. Thus,
Range = L-S
For a frequency distribution, range may be considered as the difference between the largest
and the smallest class-boundaries.
Range is a crude and simplest measure of dispersion. It measures the scatter of observations
among themselves and not about any average.
The corresponding relative measure is
𝐿− 𝑆
Coefficient of range = 𝐿 +𝑆
Note:
1.Range is a suitable measure of dispersion in case of small groups. In the branch of statistics
known as Statistical Quality Control, range is widely used. It is also used to measure the
changes in the prices of shares. Variation in daily temperatures at a certain place are measured
by recording maximum temperature and minimum temperature. Range is also used in
medical sciences to check whether blood pressure, haemoglobin count etc. are normal.
2. The main drawback of this measure is that it is based on only two extreme values, the
maximum and the minimum, and completely ignores all the remaining observations.
Quartile Deviation
We have seen earlier that range, as a measure of dispersion, is based only on two extreme
values and fails to take into account the scatter of remaining observations within the range.
To overcome this drawback to an extent, we use another measure of dispersion called Inter-
Quartile Range. It represents the range which includes middle 50% of the distribution. Hence,
Inter-Quartile Range = Q3 – Q1
where, Q3 and Q1 represent upper and lower quartiles respectively.
𝑄3 − 𝑄1
Half of Inter-Quartile-Range i.e. Semi-Inter-Quartile Range = is also
2
Note:
1. Q.D. is independent of extreme values. It is a better representative and more reliable than
range.
2. Q.D. gives an idea about the distribution of middle half of the observations around the
median.
3. Whenever median is preferred as a measure of central tendency, quartile deviation is
preferred as a measure of dispersion. However, like median, quartile deviation is also not
capable of further algebraic treatment, as it does not take into consideration all the values of
the distribution.
4. For a symmetric distribution,
Q1 = Median - Q.D. and Q3 = Median + Q.D.
STANDARD DEVIATION (S.D.)
Karl Pearson introduced the concept of standard deviation in 1893. It is the most important
measure of dispersion and is widely used in many statistical techniques.
Definition:
Standard Deviation (S.D.) is defined as the positive square root of the arithmetic mean of the
squares of the deviations of the observations from their arithmetic mean.
The arithmetic mean of the squares of the deviations of the observations from their A.M. is
called variance
Thus SD = +√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
Note: The coefficient of variation is considered to be the most appropriate measure for
comparing variability of two or more distributions. The importance of C.V. can be explained
with the following example: Suppose milk bags are filled with automatic machine, the
amount of milk being 1 litre per bag. Setting the machine for C.V. = 0 (i.e. zero variability) is
impossible, because of chance causes which exist in any process and are beyond human
control. Let us assume that the machine is set for C.V. less than or equal to 1. Then, using
statistical law, we can expect approximately 99.73% of the bags to contain milk quantity
ranging between atleast 970 mls. and at the most 1030 mls. Usually, this variation is not
noticable and hence acceptable to customers But, if the machine is set for C.V. say equal to 5,
we can see that about 16% of the bags will contain 900 mls. or less of milk, which is
definitely not acceptable. Thus, one has to take utmost care to reduce C.V.
In manufacturing process, with reference to quality control section and in pharmaceutical
industries, C.V. plays a very important role. In quality control section, efforts are made to
improve the quality by producing items as per given specifications. The extent of deviation
from given specifications can be measured using C. V. The lower is the value of C.V., better
is the quality of the items produced. Due to competition, almost all industries have reduced
the C.V. of their goods to a considerable extent in last few years.
In pharmaceutical industries, C.V. is as low as 1 or less than 1. The variation in the weights of
tablets is almost negligible.
In industrial production, C.V. depends upon raw material used. A good quality of raw material
will result in homogeneous end product. In chemical and pharmaceutical industries, C.V. can
be reduced by thorough mixing and pounding of the raw material.
MERITS AND DEMERITS OF STANDARD DEVIATION
Merits:
1.It is rigidly defined.
2.It is based on all observations.
3.It is capable of further algebraic treatment.
4.It is least affected by sampling fluctuations.
Demerits:
1.As compared to other measures it is difficult to calculate.
2.It cannot be calculated for distribution with open-end class-intervals.
3.It gives more importance (weightage) to extreme values and less importance to the values
close to A.M. that is unduly affected due to extreme observations
4.It cannot be calculated for qualitative data.
Skewness and Kurtosis:
Introduction
Understanding the shape of data is crucial while practicing data science. It helps to
understand where the most information lies and analyze the outliers in a given data. In
this article, we’ll learn about the shape of data, the importance of skewness, and kurtosis
in statistics. The types of skewness and kurtosis and Analyze the shape of data in the
given dataset. Let’s first understand what skewness and kurtosis is.
What Is Skewness?
distribution. It quantifies the extent to which the data is skewed or shifted to one side.
Positive skewness indicates a longer tail on the right side of the distribution, while
negative skewness indicates a longer tail on the left side. Skewness helps in
importance.
A probability distribution that deviates from the symmetrical normal distribution (bell
statistics.
A skewed data set, typical values fall between the first quartile (Q1) and the third
quartile (Q3).
The normal distribution helps to know a skewness. When we talk about normal
In a symmetrically distributed dataset, both the left-hand side and the right-hand side
have an equal number of observations. (If the dataset has 90 values, then the left-hand
side has 45 observations, and the right-hand side has 45 observations.). But, what if not
symmetrical distributed? That data is called asymmetrical data, and that time skewness
comes into the picture.
Types of Skewness
sort of distribution where the measures are dispersing, unlike symmetrically distributed
data where all measures of the central tendency (mean, median, and mode) equal each
other. This makes Positively Skewed Distribution a type of distribution where the mean,
median, and mode of the distribution are positive rather than negative or zero
In positively skewed, the mean of the data is greater than the median (a large number of
data-pushed on the right-hand side). In other words, the results are bent towards the
lower side. The mean will be more than the median as the median is the middle val ue
and mode is always the most frequent value.
Extreme positive skewness is not desirable for a distribution, as a high level of skewness
can cause misleading results. The data transformation tools are helping to make the
skewed data closer to a normal distribution. For positively skewed distributions, the
famous transformation is the log transformation. The log transformation proposes the
A distribution with a long left tail, known as negatively skewed or left-skewed, stands in
distribution refers to the distribution model where more values are plots on the right side
of the graph, and the tail of the distribution is spreading on the left side.
In negatively skewed, the mean of the data is less than the median (a large number of
distribution where the mean, median, and mode of the distribution are negative rather
Median is the middle value, and mode is the most frequent value. Due to an unbalanced
Various methods can calculate skewness, with Pearson’s coefficient being the most
To calculate skewness values, subtract the mode from the mean, and then divide the
relationship, When we divide the covariance values by the standard deviati on, it truly
scales the value down to a limited range of -1 to +1. That accurately shows the range of
Pearson’s first coefficient of skewness is helping if the data present high mode.
However, if the data exhibits low mode or multiple modes, it is preferable not to use
Pearson’s first coefficient, and instead, Pearson’s second coefficient may be superior, as
subtract the median from the mean, multiply the difference by 3, and divide the product
Rule of thumb :
• For skewness values between -0.5 and 0.5, the data exhibit approximate
symmetry.
• Skewness values within the range of -1 and -0.5 (negative skewed) or 0.5 and
• Data with skewness values less than -1 (negative skewed) or greater than 1
provides information about the tails and peakedness of the distribution compared to a
normal distribution.
Positive kurtosis indicates heavier tails and a more peaked distribution, while negative
kurtosis suggests lighter tails and a flatter distribution. Kurtosis helps in analyzing the
Peakedness in a data distribution is the degree to which data values are concentrated
around the mean. Datasets with high kurtosis tend to have a distinct peak near the
mean, decline rapidly, and have heavy tails. Datasets with low kurtosis tend to have a
with a high level of risk for an investment because it indicates that there are high
probabilities of extremely large and extremely small returns. On the other hand , a small
kurtosis signals a moderate level of risk because the probabilities of extreme returns are
relatively low.
Types of Kurtosis
tails relative to its peak. There are three main types of kurtosis:
1. Mesokurtic: A distribution with mesokurtic kurtosis has a similar peak and tail
that its tails are neither too heavy nor too light compared to a normal distribution.
sharper peak than the normal distribution. It has a positive kurtosis value,
indicating that it has more extreme outliers than a normal distribution. This type
3. Platykurtic: A distribution with platykurtic kurtosis has lighter tails and a flatter
peak than the normal distribution. It has a negative kurtosis value, indicating that
it has fewer extreme outliers than a normal distribution. This type of distribution
values.
Leptokurtic has very long and thick tails, which means there are more chances of
outliers. Positive values of kurtosis indicate that distribution is peaked and possesses
thick tails. Extremely positive kurtosis indicates a distribution where more numbers are
Platykurtic having a thin tail and stretched around the center means most data points are
present in high proximity to the mean. A platykurtic distribution is flatter (less peaked)
Mesokurtic (Kurtosis = 3)
Mesokurtic is the same as the normal distribution, which means kurtosis is near 0. In
Mesokurtic, distributions are moderate in breadth, and curves are a medium peaked
height.
Difference Between Skewness and Kurtosis
2. Skewness is a measure derived from the third moment, whereas Kurtosis stems
3. The range of values for both Skewness and Kurtosis spans from negative infinity
to positive infinity.
4. Perfect symmetry and normality are indicated by both zero skewness and zero
kurtosis.
5. Skewness can impact the central tendency of a distribution, whereas kurtosis can
6. Both Skewness and Kurtosis provide insights into the shape characteristics of
distributions.
Session 11, 12, 13 & 14
Sampling
When you conduct research about a group of people, it’s rarely possible to collect data from
every person in that group. Instead, you select a sample. The sample is the group of
individuals who will actually participate in the research.
To draw valid conclusions from your results, you have to carefully decide how you will select
a sample that is representative of the group as a whole. This is called a sampling method.
There are two primary types of sampling methods that you can use in your research:
• Probability sampling involves random selection, allowing you to make strong statistical
inferences about the whole group.
• Non-probability sampling involves non-random selection based on convenience or other
criteria, allowing you to easily collect data.
First, you need to understand the difference between a population and a sample, and identify
the target population of your research.
• The population is the entire group that you want to draw conclusions about.
• The sample is the specific group of individuals that you will collect data from.
The population can be defined in terms of geographical location, age, income, or many other
characteristics.
It can be very broad or quite narrow: maybe you want to make inferences about the whole
adult population of your country; maybe your research focuses on customers of a certain
company, patients with a specific health condition, or students in a single school.
It is important to carefully define your target population according to the purpose and
practicalities of your project.
Example
You are doing research on working conditions at a social media marketing company. Your
population is all 1000 employees of the company. Your sampling frame is the company’s HR
database, which lists the names and contact details of every employee.
Sample size
The number of individuals you should include in your sample depends on various factors,
including the size and variability of the population and your research design. There are
different sample size calculators and formulas depending on what you want to achieve
with statistical analysis.
Probability sampling methods
Probability sampling means that every member of the population has a chance of being
selected. It is mainly used in quantitative research. If you want to produce results that are
representative of the whole population, probability sampling techniques are the most valid
choice.
To conduct this type of sampling, you can use tools like random number generators or other
techniques that are based entirely on chance.
Example: Simple random sampling You want to select a simple random sample of 1000 employees of
a social media marketing company. You assign a number to every employee in the company database
from 1 to 1000, and use a random number generator to select 100 numbers.
2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to
conduct. Every member of the population is listed with a number, but instead of randomly
generating numbers, individuals are chosen at regular intervals.
Example: Systematic samplingAll employees of the company are listed in alphabetical order. From
the first 10 numbers, you randomly select a starting point: number 6. From number 6 onwards, every
10th person on the list is selected (6, 16, 26, 36, and so on), and you end up with a sample of 100
people.
If you use this technique, it is important to make sure that there is no hidden pattern in the list
that might skew the sample. For example, if the HR database groups employees by team, and
team members are listed in order of seniority, there is a risk that your interval might skip over
people in junior roles, resulting in a sample that is skewed towards senior employees.
3. Stratified sampling
Stratified sampling involves dividing the population into subpopulations that may differ in
important ways. It allows you draw more precise conclusions by ensuring that every
subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata) based
on the relevant characteristic (e.g., gender identity, age range, income bracket, job role).
Based on the overall proportions of the population, you calculate how many people should be
sampled from each subgroup. Then you use random or systematic sampling to select a sample
from each subgroup.
Example: Stratified sampling the company has 800 female employees and 200 male employees. You
want to ensure that the sample reflects the gender balance of the company, so you sort the population
into two strata based on gender. Then you use random sampling on each group, selecting 80 women
and 20 men, which gives you a representative sample of 100 people.
4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each subgroup
should have similar characteristics to the whole sample. Instead of sampling individuals from
each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled cluster. If
the clusters themselves are large, you can also sample individuals from within each cluster
using one of the techniques above. This is called multistage sampling.
This method is good for dealing with large and dispersed populations, but there is more risk
of error in the sample, as there could be substantial differences between clusters. It’s difficult
to guarantee that the sampled clusters are really representative of the whole population.
Example: Cluster sampling the company has offices in 10 cities across the country (all with roughly
the same number of employees in similar roles). You don’t have the capacity to travel to every office
to collect your data, so you use random sampling to select 3 offices – these are your clusters.
Non-probability sampling is a sampling method that uses non-random criteria like the
availability, geographical proximity, or expert knowledge of the individuals you want to
research in order to answer a research question.
Non-probability sampling is used when the population parameters are either unknown or not
possible to individually identify. For example, visitors to a website that doesn’t require users
to create an account could form part of a non-probability sample.
Note that this type of sampling is at higher risk for research biases than probability sampling,
particularly sampling bias.
Note
Be careful not to confuse probability and non-probability sampling.
• In non-probability sampling, each unit in your target population does not have an
equal chance of being included. Here, you can form your sample using other
considerations, such as convenience or a particular characteristic.
• In probability sampling, each unit in your target population must have an equal
chance of selection.
Convenience sampling
Convenience sampling is primarily determined by convenience to the researcher.
• Ease of access
• Geographical proximity
• Existing contact within the population of interest
Convenience samples are sometimes called “accidental samples,” because participants can be
selected for the sample simply because they happen to be nearby when the researcher is
conducting the data collection.
Example: Convenience samplingYou are investigating the association between daily weather
and daily shopping patterns. To collect insight into people’s shopping patterns, you decide to
stand outside a major shopping mall in your area for a week, stopping people as they exit and
asking them if they are willing to answer a few questions about their purchases.
Quota sampling
In quota sampling, you select a predetermined number or proportion of units, called a quota.
Your quota should comprise subgroups with specific characteristics (e.g., individuals, cases,
or organizations) and should be selected in a non-random manner.
Your subgroups, called strata, should be mutually exclusive. Your estimation can be based
on previous studies or on other existing data, if there are any. This helps you determine how
many units should be chosen from each subgroup. In the data collection phase, you continue
to recruit units until you reach your quota.
TipYour respondents should be recruited non-randomly, with the end goal being that the
proportions in each subgroup coincide with the estimated proportions in the population.
There are two types of quota sampling:
1. Proportional quota sampling is used when the size of the population is known. This
allows you to determine the quota of individuals that you need to include in your
sample in order to be representative of your population.
Example: Proportional quota samplingLet’s say that in a certain company there are 1,000
employees. They are split into 2 groups: 600 people who drive to work, and 400 who take the
train.
You decide to draw a sample of 100 employees. You would need to survey 60 drivers and 40
train-riders for your sample to reflect the proportion seen in the company.
Example: Non-proportional quota samplingLet’s say you are seeking opinions about the
design choices on a website, but do not know how many people use it. You may decide to
draw a sample of 100 people, including a quota of 50 people under 40 and a quota of 50
people over 40. This way, you get the perspective of both age groups.
Note that quota sampling may sound similar to stratified sampling, a probability sampling
method where you divide your population into subgroups that share a common characteristic.
The key difference here is that in stratified sampling, you take a random sample from each
subgroup, while in quota sampling, the sample selection is non-random, usually via
convenience sampling. In other words, who is included in the sample is left up to the
subjective judgment of the researcher.
Example: Quota samplingYou work for a market research company. You are seeking to
interview 20 homeowners and 20 tenants between the ages of 45 and 60 living in a certain
suburb.
You stand at a convenient location, such as a busy shopping street, and randomly select
people to talk to who appear to satisfy the age criterion. Once you stop them, you must first
determine whether they do indeed fit the criteria of belonging to the predetermined age range
and owning or renting a property in the suburb.
Sampling continues until quotas for various subgroups have been selected. If contacted
individuals are unwilling to participate or do not meet one of the conditions (e.g., they are
over 60 or they do not live in the suburb), they are simply replaced by those who do. This
approach really helps to mitigate nonresponse bias.
To conduct a snowball sample, you start by finding one person who is willing to participate
in your research. You then ask them to introduce you to others.
Alternatively, your research may involve finding people who use a certain product or have
experience in the area you are interested in. In these cases, you can also use networks of
people to gain access to your population of interest.
Example: Snowball samplingYou are studying homeless people living in your city. You start
by attending a housing advocacy meeting, striking up a conversation with a homeless woman.
You explain the purpose of your research and she agrees to participate. She invites you to a
parking lot serving as temporary housing and offers to introduce you around.
In this way, the process of snowball sampling begins. You started by attending the meeting,
where you met someone who could then put you in touch with others in the group.
Purposive sampling is common in qualitative and mixed methods research designs, especially
when considering specific issues with unique cases.
Note:
Unlike random samples—which deliberately include a diverse cross-section of ages,
backgrounds, and cultures—the idea behind purposive sampling is to concentrate on people
with particular characteristics, who will enable you to answer your research questions.
The sample being studied is not representative of the population, but for certain qualitative
and mixed methods research designs, this is not an issue.
Hypothesis Testing
Hypothesis testing is a tool for making statistical inferences about the population data. It is an
analysis tool that tests assumptions and determines how likely something is within a given
standard of accuracy. Hypothesis testing provides a way to verify whether the results of an
experiment are valid.
A null hypothesis and an alternative hypothesis are set up before performing the hypothesis
testing. This helps to arrive at a conclusion regarding the sample obtained from the
population. In this article, we will learn more about hypothesis testing, its types, steps to
perform the testing, and associated examples.
Hypothesis testing uses sample data from the population to draw useful conclusions regarding
the population probability distribution. It tests an assumption made about the data using
different types of hypothesis testing methodologies. The hypothesis testing results in either
rejecting or not rejecting the null hypothesis.
Hypothesis testing can be defined as a statistical tool that is used to identify if the results of
an experiment are meaningful or not. It involves setting up a null hypothesis and an
alternative hypothesis. These two hypotheses will always be mutually exclusive. This means
that if the null hypothesis is true then the alternative hypothesis is false and vice versa. An
example of hypothesis testing is setting up a test to check if a new medicine works on a
disease in a more efficient manner.
Null Hypothesis
The null hypothesis is a concise mathematical statement that is used to indicate that there is
no difference between two possibilities. In other words, there is no difference between certain
characteristics of data. This hypothesis assumes that the outcomes of an experiment are based
on chance alone. It is denoted as H0. Hypothesis testing is used to conclude if the null
hypothesis can be rejected or not. Suppose an experiment is conducted to check if girls are
shorter than boys at the age of 5. The null hypothesis will say that they are the same height.
Alternative Hypothesis
The alternative hypothesis is an alternative to the null hypothesis. It is used to show that the
observations of an experiment are due to some real effect. It indicates that there is a statistical
significance between two possible outcomes and can be denoted as H1 or Ha. For the above-
mentioned example, the alternative hypothesis would be that girls are shorter than boys at the
age of 5.
Hypothesis Testing P Value
In hypothesis testing, the p value is used to indicate whether the results obtained after
conducting a test are statistically significant or not. It also indicates the probability of making
an error in rejecting or not rejecting the null hypothesis.This value is always a number
between 0 and 1. The p value is compared to an alpha level, αα or significance level. The
alpha level can be defined as the acceptable risk of incorrectly rejecting the null hypothesis.
The alpha level is usually chosen between 1% to 5%.
All sets of values that lead to rejecting the null hypothesis lie in the critical region.
Furthermore, the value that separates the critical region from the non-critical region is known
as the critical value.
One tailed hypothesis testing is done when the rejection region is only in one direction. It can
also be known as directional hypothesis testing because the effects can be tested in one
direction only. This type of testing is further classified into the right tailed test and left tailed
test.
The right tail test is also known as the upper tail test. This test is used to check whether the
population parameter is greater than some value. The null and alternative hypotheses for this
test are given as follows:
If the test statistic has a greater value than the critical value then the null hypothesis is
rejected
Left Tailed Hypothesis Testing
The left tail test is also known as the lower tail test. It is used to check whether the population
parameter is less than some value. The hypotheses for this hypothesis testing can be written
as follows:
The null hypothesis is rejected if the test statistic has a value lesser than the critical value.
Two Tailed Hypothesis Testing
In this hypothesis testing method, the critical region lies on both sides of the sampling
distribution. It is also known as a non - directional hypothesis testing method. The two-tailed
test is used when it needs to be determined if the population parameter is assumed to be
different than some value. The hypotheses can be set up as follows:
The null hypothesis is rejected if the test statistic has a value that is not equal to the critical
value.
The goal of any hypothesis testing is to make a decision. In particular, we will decide
whether to reject the null hypothesis, H0, in favor of the alternative hypothesis, H1.
Although we would like always to be able to make a correct decision, we must
remember that the decision will be based on sample information, and thus we are
subject to make one of two types of error, as defined in the accompanying boxes. A
Type I error is the error of rejecting the null hypothesis when it is true. The probability
of committing a Type I error is usually denoted by . A Type II error is the error of
accepting the null hypothesis when it is false. The probability of making a Type II error
is usually denoted by . The null hypothesis can be either true or false and based on the
sample drawn we make a conclusion either to reject or not to reject the null hypothesis.
Thus, there are four possible situations that may arise in testing a hypothesis
Hypothesis Testing Steps
Hypothesis testing can be easily performed in five simple steps. The most important step is to
correctly set up the hypotheses and identify the right method for hypothesis testing. The basic
steps to perform hypothesis testing are as follows:
• Step 1: Set up the null hypothesis by correctly identifying whether it is the left-
tailed, right-tailed, or two-tailed hypothesis testing.
• Step 2: Set up the alternative hypothesis.
• Step 3: Choose the correct significance level, α, and find the critical value.
• Step 4: Calculate the correct test statistic and p-value.
• Step 5: Compare the test statistic with the critical value or compare the p-value
with αα to arrive at a conclusion. In other words, decide if the null hypothesis is to
be rejected or not.
The best way to solve a problem on hypothesis testing is by applying the 5 steps mentioned in
the previous section. Suppose a researcher claims that the mean average weight of men is
greater than 100kgs with a standard deviation of 15kgs. 30 men are chosen with
an average weight of 112.5 Kgs. Using hypothesis testing, check if there is enough evidence
to support the researcher's claim. The confidence interval is given as 95%.
Step 1: This is an example of a right-tailed test. Set up the null hypothesis as H0H0: μμ =
100.
Step 2: The alternative hypothesis is given by H1: μ > 100.
Step 3: As this is a one-tailed test, α = 100% - 95% = 5%. This can be used to determine the
critical value. 1 - α = 1 - 0.05 = 0.95
0.95 gives the required area under the curve. Now using a normal distribution table, the area
0.95 is at z = 1.645. A similar process can be followed for a t-test. The only additional
requirement is to calculate the degrees of freedom given by n - 1.
Step 4: Calculate the z test statistic. This is because the sample size is 30. Furthermore, the
sample and population means are known along with the standard deviation.
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
𝜇 = 100, 𝑥̅ = 112.5, 𝑛 = 30, 𝜎 = 15
112.5 − 100
𝑧= = 4.56
15
√30
Step 5: Conclusion. As 4.56 > 1.645 thus, the null hypothesis can be rejected.
Z Test
Z test is a statistical test that is conducted on data that approximately follows a normal
distribution. The z test can be performed on one sample, two samples, or on proportions for
hypothesis testing. It checks if the means of two large samples are different or not when the
population variance is known.
A z test can further be classified into left-tailed, right-tailed, and two-tailed hypothesis tests
depending upon the parameters of the data. In this article, we will learn more about the z test,
its formula, the z test statistic, and how to perform the test for different types of data using
examples.
What is Z Test?
A z test is a test that is used to check if the means of two populations are different or not
provided the data follows a normal distribution. For this purpose, the null hypothesis and the
alternative hypothesis must be set up and the value of the z test statistic must be calculated.
The decision criterion is based on the z critical value.
Z Test Definition
A z test is conducted on a population that follows a normal distribution with independent data
points and has a sample size that is greater than or equal to 30. It is used to check whether the
means of two populations are equal to each other when the population variance is known. The
null hypothesis of a z test can be rejected if the z test statistic is statistically significant when
compared with the critical value.
Z Test Formula
The z test formula compares the z statistic with the z critical value to test whether there is a
difference in the means of two populations. In hypothesis testing, the z critical value divides
the distribution graph into the acceptance and the rejection regions. If the test statistic falls in
the rejection region then the null hypothesis can be rejected otherwise it cannot be rejected.
The z test formula to set up the required hypothesis tests for a one sample and a two-sample z
test are given below.
One-Sample Z Test
A one-sample z test is used to check if there is a difference between the sample mean and the
population mean when the population standard deviation is known. The formula for the z test
statistic is given as follows:
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
𝑥̅ is the sample mean, 𝜇 is the population mean, 𝜎 is the population standard deviation and n
is the sample size.
The algorithm to set a one sample z test based on the z test statistic is given as follows:
Decision Criteria: If the z statistic > z critical value then reject the null hypothesis.
Decision Criteria: If the z statistic > z critical value then reject the null hypothesis.
Two Sample Z Test
A two-sample z test is used to check if there is a difference between the means of two
samples. The z test statistic formula is given as follows:
𝑥1 μ1, σ21 are the sample mean, population mean and population variance respectively for the
̅̅̅,
̅̅̅2 , μ2, σ22 are the sample mean, population mean and population variance
first sample. 𝑥
respectively for the second sample.
The two-sample z test can be set up in the same way as the one-sample test. However, this
test will be used to compare the means of the two samples. For example, the null hypothesis
is given as H0 : μ1=μ2
A one proportion z test is used when there are two groups and compares the value of an
observed proportion to a theoretical one. The z test statistic for a one proportion z test is
given as follows:
Here, p is the observed value of the proportion, p0 is the theoretical proportion value and n is
the sample size.
The null hypothesis is that the two proportions are the same while the alternative hypothesis
is that they are not the same.
A two-proportion z test is conducted on two proportions to check if they are the same or not.
The test statistic formula is given as follows:
• The two sample z test checks if the means of two different groups are equal.
t-test Formula
The t-test formula helps us to compare the average values of two data sets and determine if
they belong to the same population or are they different. The t-score is compared with the
critical value obtained from the t-table. The large t-score indicates that the groups are
different and a small t-score indicates that the groups are similar.
The t-test formula is applied to the sample population. The t-test formula depends on
the mean, variance, and standard deviation of the data being compared. There are 3 types of t-
tests that could be performed on the n number of samples collected.
• One-sample test,
• Independent sample t-test and
• Paired samples t-test
The critical value is obtained from the t-table looking for the degree of freedom (df = n-1)
and the corresponding α value (usually 0.05 or 0.1). If the t-test obtained statistically > CV
then the initial hypothesis is wrong and we conclude that the results are significantly
different.
ANOVA
ANOVA is used to analyze the differences among the means of various groups using certain
estimation procedures. ANOVA means analysis of variance. ANOVA is a statistical
significance test that is used to check whether the null hypothesis can be rejected or not
during hypothesis testing.
An ANOVA can be either one-way or two-way depending upon the number of independent
variables. In this article, we will learn more about an ANOVA test, the one-way ANOVA and
two-way ANOVA, its formulas and see certain associated examples.
What is ANOVA?
ANOVA, in its simplest form, is used to check whether the means of three or more
populations are equal or not. The ANOVA applies when there are more than two independent
groups. The goal of the ANOVA is to check for variability within the groups as well as the
variability among the groups. The ANOVA statistic is given by the f test.
ANOVA Definition
ANOVA can be defined as a type of test used in hypothesis testing to compare whether the
means of two or more groups are equal or not. This test is used to check if the null hypothesis
can be rejected or not depending upon the statistical significance exhibited by the parameters.
The decision is made by comparing the ANOVA statistic with the critical value.
ANOVA Example
Suppose it needs to be determined if consumption of a certain type of tea will result in a mean
weight loss. Let there be three groups using three types of tea - green tea, earl grey tea, and
jasmine tea. Thus, to compare if there was any mean weight loss exhibited by a certain group,
the ANOVA (one way) will be used.
Suppose a survey was conducted to check if there is an interaction between income and
gender with anxiety level at job interviews. To conduct such a test a two-way ANOVA will
be used.
Chi Square Formula
Chi-square formula is used to compare two or more statistical data sets. The chi-square
formula is used in data that consist of variables distributed across various categories and
helps us to know whether that distribution is different from what one would expect by
chance.
Example: You research two groups of women and put them in categories of student,
employed or self-employed.
The numbers collected are different, but you now want to know
The chi-squared test checks the difference between the observed value and the expected
value. Chi-Square shows or in a way check the relationship between two categorical variables
which can be can be calculated by using the given observed frequency and expected
frequency.
Non-parametric test is a statistical analysis method that does not assume the population data
belongs to some prescribed distribution which is determined by some parameters. Due to this,
a non-parametric test is also known as a distribution-free test. These tests are usually based
on distributions that have unspecified parameters.
A non-parametric test in statistics does not assume that the data has been taken from
a normal distribution. A normal distribution belongs to a parametrized family
of probability distributions and includes parameters such as mean, variance, standard
deviation, etc. Thus, a non-parametric test does not make assumptions about the
probability distribution's parameters.
A non-parametric test can be defined as a test that is used in statistical analysis when
the data under consideration does not belong to a parametrized family of distributions. When
the data does not meet the requirements to perform a parametric test, a non-parametric test is
used to analyze it.
Reasons to Use Non-Parametric Tests
It is important to access when to apply parametric and non-parametric tests in order to arrive
at the correct statistical inference. The reasons to use a non-parametric test are given below: