Module 5

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 76

MODULE 4: Applied Statistics

Chapter 1: Maths Refresher


Foundational Maths for DS

What is Linear Algebra?


A branch of mathematics that is concerned with mathematical structures closed under the operations
of addition and scalar multiplication and that includes the theory of systems of linear equations,
matrices, determinants, vector spaces, and linear transformations. Basically, it is the science of
numbers which empowers diverse Data Science algorithms and applications. To fully comprehend
machine learning, linear algebra fundamentals are the essential prerequisite.

Why do you need to learn Linear Algebra?


Linear algebra is a foundation of machine learning.
In machine learning, the majority of data is most often represented as vectors, matrices or tensors. Therefore, the
machine learning heavily relies on the linear algebra.
1. A vector is a 1D array. For instance, a point in space can be defined as a vector of three coordinates (x,
y, z). Usually, it is defined in such a way that it has both the magnitude and the direction.
2. A matrix is a two-dimensional array of numbers, that has a fixed number of rows and columns. It
contains a number at the intersection of each row and each column. A matrix is usually denoted by
square brackets [].
3. A tensor is a generalization of vectors and matrices. For instance, a tensor of dimension one is a vector.
In addition, we can also have a tensor of two dimensions which is a matrix. Then, we can have a three-
dimensional tensor such as the image with RGB colors. This continues to expand to four-dimensional
tensors and so on.

Applications of Linear Algebra in Data Science


Coordinate Transformations
Linear Regression
Dimensionality Reduction
Natural Language Processing
Computer Vision
Network Graphs

What is a vector ?
*A vector is a mathematical quantity with both magnitude and direction.*
The magnitude of vector u, denoted |u|, is it's length.
One way to express its direction is to give the angle it makes with a horizontal ray (that points to the
right) that is parallel to the positive x-axis. This is called the vector's direction angle.

display.IFrame(src='https://www.geogebra.org/classic/xbunqxhv', width=1000,
height=800)

Vector Addition
The sum, u + v, of two vectors, u and v, is constructed by placing u, at some arbitrary location, and then placing
v such that v's tail point coincides with u's tip point, and u + v is the vector that starts at u's tail point, and ends at
v's tip point.
Adding Vectors Geometrically
Let us discover how we geometrically add any 2 vectors.
1. Determine the components of vectors u and v by using the purple and green sliders.
2. Slide the black slider (bottom right) to geometrically form vectors
3. Slide the black slider (bottom right) to geometrically form vectors u and v.
4. Move the YELLOW POINT (initial point of v) ON TOP OF the ORANGE POINT (terminal point of
u).
5. Slide the additional slider that appears once you complete the previous step
The vector that appears will be the the
6. Slide the additional slider that appears. The vector that appears will be the the RESULTANT
VECTOR.
7. In this case, the RESULTANT VECTOR is the sum of vectors u and v.

from IPython import display


display.IFrame(src='https://www.geogebra.org/classic/wvfxkuvc', width=1000,
height=800)

Vector Subtraction
display.IFrame(src='https://www.geogebra.org/classic/thkzrnqa', width=1000,
height=800)

Dot Product Insight


The dot product of 2 vectors is (a)(c) + (b)(d).
the dot product of 2 vectors is a scalar quantity.
display.IFrame(src='https://www.geogebra.org/classic/n2evzwqn', width=1000,
height=800)

Vector Projections
The projection of u(tree) onto v(ground) is another vector that is parallel to v and has a length equal to what
vector u's shadow would be (if it were cast onto the ground).

Orthogonality
2 vectors are orthogonal if they are perpendicular to each other. i.e. the dot product of the two vectors
is zero.
display.IFrame(src='https://www.geogebra.org/classic/hz92uxy3', width=1000,
height=800)

Vector Norms
L1 norm
It is defined as the sum of magnitudes of each component a = ( a1 , a2 , a3 )
L1 norm of vector a = |a1| + |a2| + |a3|
L2 norm
It is defined as the square root of sum of squares of each component
L2 norm of vector a = √( a12 + a22 + a32 )

What is the Theory of Matrices?


Matrix theory is a branch of mathematics that deals with the study of matrices, which are rectangular arrays of
numbers or variables. Matrices are used in various fields such as physics, engineering, and data science for
solving problems that involve linear equations, linear transformations, and systems of equations.

Types of Matrices:
Square matrices: A square matrix has an equal number of rows and columns.
Symmetric matrices: A symmetric matrix is a square matrix that is equal to its transpose.
Diagonal matrices: A diagonal matrix is a square matrix in which all the off-diagonal elements are zero.
Identity matrices: An identity matrix is a square matrix in which all the diagonal elements are equal to one, and
all the off-diagonal elements are equal to zero.
Singular matrices: A singular matrix is a matrix that does not have an inverse.
Matrix Operations:
There are several operations that can be performed on matrices:
Addition: To add two matrices, we add their corresponding elements.
Subtraction: To subtract two matrices, we subtract their corresponding elements.
Multiplication: To multiply two matrices, we multiply their corresponding elements and sum the products.
Transpose: To find the transpose of a matrix, we interchange its rows and columns.
Inverse: To find the inverse of a matrix, we use a formula that involves the determinant of the matrix.

Examples
Addition:
A =[2 4
1 3]
B =[5 1
2 7]
A + B =[2+5 4+1
1+2 3+7]= [7 5
3 10]
Subtraction:
A =[1 2
3 4]
B =[5 6
7 8]
A - B =[1−5 2−6
3−7 4−8]= [−4 −4
−4 −4]
Multiplication:
A =[1 2
3 4]
B =[5 6
7 8]
A * B =[1×5+2×7 1×6+2×8
3×5+4×7 3×6+4×8]= [19 22
43 50]
Transpose:
A =[1 2
3 4
5 6]
[1 3 5
2 4 6]
Inverse:
A =[2143]
we first calculate the determinant
=|2143|
=2×3−1×4 = 2
Since the determinant is nonzero, we know that is invertible. We can then use the following formula to find the
inverse o:
A(−1)=1/det(A)*adj(A)
=[3−1−42]=[32−12−21]
Therefore, the inverse of A is:
[32−12−21]

What is the Determinant of a Matrix?


The determinant of a matrix is a scalar value that can be computed from the elements of a square matrix. The
determinant provides important information about the properties of the matrix, such as whether the matrix has
an inverse or not. It also plays a crucial role in solving systems of linear equations, calculating eigenvalues and
eigenvectors, and in many other applications in data science.

Formula for Determinant:


The formula for finding the determinant of a 2x2 matrix is:
|A| = ad - bc
The formula for finding the determinant of a 3x3 matrix is:
|A| = a(ei - fh) - b(di - fg) + c(dh - eg)

The formula for finding the determinant of a larger square matrix is more complex and involves expanding the
matrix along any row or column using a method called cofactor expansion.
Example:
Suppose we have a 3x3 matrix A, where:
A =[123456789]
To find the determinant of A using the formula for a 3x3 matrix, we have:
|A|=1(5∗9−6∗8)−2(4∗9−6∗7)+3(4∗8−5∗7) = 0
*Since the determinant is zero, we know that the matrix A is singular and does not have an inverse.*

Inverse of a matrix

The inverse of a matrix is a matrix that, when multiplied by the original matrix, results in the identity matrix.
The identity matrix is a square matrix that has ones on the main diagonal and zeros elsewhere. The inverse of a
matrix is important in many applications in data science, such as solving systems of linear equations, finding
eigenvalues and eigenvectors, and inverting matrices to transform data.
Formula for Inverse:
The formula for finding the inverse of a matrix A is
A(−1)=1/det(A)*adj(A)
where |A| or det(A) is the determinant of A, and adj(A) is the adjoint of A. The adjoint of A is the transpose of
the cofactor matrix of A, where the cofactor matrix is formed by taking the determinant of each minor of A.

Eigenvalues and Eigenvectors

🔹 Eigenvalues and eigenvectors are important concepts in linear algebra and have various applications in data
science and machine learning.

🔹 An eigenvector of a square matrix A is a non-zero vector v such that when A is multiplied by v, the result is a
scalar multiple of v. The scalar multiple is called the eigenvalue of the matrix A.

🔹 Eigenvalues and eigenvectors are often used to transform and reduce the dimensionality of data in various
algorithms like PCA, SVD, and others.

🔹 The eigenvalues of a matrix can be found by solving the characteristic equation of the matrix, which is det(A -
λI) = 0, where det is the determinant of the matrix, A is the matrix, λ is the eigenvalue and I is the identity
matrix of the same size as A.

🔹 Once we find the eigenvalues, we can find the corresponding eigenvectors by solving the system of linear
equations (A - λI)x = 0.

🔹 Eigenvectors corresponding to different eigenvalues are linearly independent, and form the basis of the space
spanned by the matrix A.

🔹 Eigenvectors can also be normalized to have unit length, and thus form an orthonormal basis for the space.

🔹 A matrix is diagonalizable if it has n linearly independent eigenvectors, where n is the size of the matrix.

Consider the matrix A = [[3, 1], [1, 3], [2, 2]]. To find the SVD of A, we can follow these steps:
Compute A*TA and AA*T:
[148 814]
[101010 101010 101012]
Find the eigenvalues and eigenvectors of ��� and ���:
Eigenvalues of ���: 20,0
Eigenvectors of ���: [1/2 1/2], [1/2 −1/2]
Eigenvalues of ���: 22,12,0
Eigenvectors of ���: [1/2 1/2 0], [−1/2 1/2 0]

Advanced Maths for DS


Calculus
Calculus is a branch of mathematics that deals with the study of rates of change and how things change over
time. It is widely used in various fields such as science, engineering, economics, and finance. In advanced maths
for data science, calculus is an essential topic as it provides the foundation for many statistical and machine
learning models.
Here are some important topics to cover in the introduction to calculus:
Functions:
A function is a rule that assigns to each input value exactly one output value. It is represented by a graph, an
equation, or a table. Some examples of functions are linear, quadratic, exponential, and trigonometric functions.
Limits:
A limit is the value that a function approaches as the input approaches a certain value. It is denoted by the
symbol "lim" and is used to describe the behavior of a function near a specific point. The concept of limits is
crucial in calculus as it forms the basis for derivatives and integrals.
Continuity:
A function is said to be continuous if its graph can be drawn without lifting the pen from the paper. A function
is continuous at a point if it has a limit at that point and the limit is equal to the function value. Continuity is an
essential concept in calculus as it helps in determining the behavior of a function near a particular point.
Derivatives:
A derivative is the rate at which a function changes with respect to its input. It measures the slope of the tangent
line at a particular point on the graph of the function. The derivative is denoted by the symbol "dy/dx" or "f'(x)"
and is used to solve optimization problems and to analyze the behavior of a function.

What is Differentiation?
Differentiation is a method of finding the rate at which a function changes over a certain period of
time. It is an important concept in calculus and is used to determine the maximum and minimum
values of a function.
The derivative of a function f(x) with respect to x is denoted by f'(x) or dy/dx.
Formulas:
1. Power Rule: ddxxn=nxn−1
2. Product Rule: ddx(f(x)g(x))=f′(x)g(x)+f(x)g′(x)
3. Quotient Rule: ddx(f(x)g(x))=f′(x)g(x)−f(x)g′(x)(g(x))2
4. Chain Rule: ddxf(g(x))=f′(g(x))g′(x)

Integration
Integration is the inverse process of differentiation. It involves finding the function whose derivative is a given
function. The symbol used to represent integration is the integral sign (∫), and the function to be integrated is
placed after the integral sign. The limits of integration are also specified, which determine the range of values
over which the integration is to be performed.
There are different techniques for integration, including:
Integration by Substitution:
Example: Evaluate the integral ∫2x⋅(x2+1)3dx.
u=x2+1
dx=du/2x
(x2+1)4/8+C

Integration by parts:
∫udv=uv−∫vdu.
Example: Evaluate the integral: ∫x⋅exdx.
Solution: Integration by parts involves choosing two functions to differentiate and integrate.
Let's choose u=x and dv=exdx.
Differentiate u to get du:
x⋅ex−ex+C
Partial fraction decomposition:
This technique is used for integrating rational functions, where the function is expressed as a sum of simpler
fractions.
Example: Decompose the rational function 3x+1x2+2x into partial fractions.
Solution: To decompose the rational function, we want to express it as a sum of simpler fractions. First, we
factor the denominator:
x2+2x=x(x+2).
Now, we want to express the given rational function as a sum of two partial fractions:
3x+1x2+2x=Ax+Bx+2.
Where A and B are constants that we need to find.
To solve for A and B, we'll find a common denominator on the right side:
3x+1x2+2x=A(x+2)x(x+2)+Bxx(x+2).
Now, we'll combine the fractions:
3x+1x2+2x=A(x+2)+Bxx(x+2).
We want the numerators to be the same, so we'll set up an equation:
3x+1=A(x+2)+Bx.
Now, let's solve for A and B:
3x+1=Ax+2A+Bx.
Equating coefficients of x terms:
3=A+B,
1=2A.
From the second equation, we find A=12.
Substitute A back into the first equation to find B:
3=12+B,
B=3−12,
B=52.
So, we have the partial fraction decomposition:
3x+1x2+2x=12x+52(x+2).
This is the partial fraction decomposition of the given rational function.

Trigonometric substitution:
This technique is used for integrating expressions that involve trigonometric functions.
Example: Evaluate the integral
∫x21+x4−−−−−√dx
Solution: We can use trigonometric substitution to simplify this integral. Let's use the
substitution x2=tan(θ) which gives us
dx=2xsec2(θ)dθ
Now, we need to express the entire integrand in terms of θ:
1+x4−−−−−√=1+tan2(θ)−−−−−−−−−√=sec2(θ)−−−−−−√=sec(θ)
The integral becomes:
∫x21+x4−−−−−√dx=∫x2sec(θ)⋅2xsec2(θ)dθ,=2∫x3dθ
Now, we have a simpler integral in terms of θ. We'll integrate with respect to θ:
2∫x3dθ=2⋅14x4+C=x42+C
where C is the constant of integration.
However, we need to convert our answer back to the original variable x. Recall that we used x2=tan(θ). So, we
can substitute back:
x42+C=(x2)2/2+C, =tan2(θ)2+C.
Since we want the integral in terms of x, we need to find θ.
From x2=tan(θ), we have θ=arctan(x2).
Substitute back:
tan2(θ)2+C=(x2)2/2+C=x42+C
So, the final result is:
∫x21+x4−−−−−√dx=x42+C
where C is the constant of integration.
Practice Problem:
Q. ∫xx2+1dx
Answer. Let u=x2+1, then du/dx=2x, which implies dx=du/(2x).
Substituting these values in the integral, we get:

∫xx2+1dx=12∫1udu =12ln|u|+C =12ln|x2+1|+C


where C is the constant of integration.

Chain Rule:
The chain rule is a formula for computing the derivative of the composition of two or more functions. In other
words, it tells us how to take the derivative of a function that is formed by nesting one function inside another.
Formula:
Let f and g be functions, and let y=f(g(x)) be their composite function. Then the derivative of y with respect
to x is given by:
If y = f(g(x)), then
dydx=dydu⋅dudx=f′(g(x))⋅g′(x)
dxdy=1dydx=1f′(g(x))⋅g′(x)=1dydu⋅dudx

The chain rule is used to compute the derivative of composite functions.


The formula for the chain rule involves taking the derivative of the outer function evaluated at the inner
function, multiplied by the derivative of the inner function.
The chain rule is an important tool in calculus and is used frequently in many areas of math and science.
Eg: Let's say we have the function f(x)=(x2+3x)4, and we want to find the derivative of f with respect to x.
Using the chain rule, we can break down the problem into two smaller problems:
First, we define g(x)=x2+3x, and h(u)=u4.
Then we can write f(x)=h(g(x)),
which means that f is the composition of g and h.
Now we can apply the chain rule to find f′(x):
dfdx=dhdu⋅dgdx=4u3.dudu⋅(2x+3)dxdx=4u3⋅(2x+3)
.
But we're not done yet! We need to substitute u=g(x)=x2+3x back into this expression to get our final answer:
dfdx=4(x2+3x)3⋅(2x+3)
.
So the derivative of f(x)=(x2+3x)4 is
f′(x)=4(x2+3x)3⋅(2x+3)

Maxima and Minima using derivatives


Maxima and minima are critical points in a function that represent its highest and lowest values, respectively. In
calculus, we use derivatives to find these points.
To find the maxima and minima of a function, we first take the first derivative of the function and set it equal to
zero. We then solve for the values of x that make the derivative zero. These values are known as critical points.
We then analyze the second derivative of the function at these critical points. If the second derivative is positive
at a critical point, then the point is a local minimum. If the second derivative is negative, then the point is a local
maximum. If the second derivative is zero, then further analysis is required.
Example:
Find the maximum and minimum values of the function f(x)=2x3−9x2+12x+4 on the interval [0,4].
Solution:
To find the critical points of f(x), we take the first derivative and set it equal to zero:
f′(x)=6x2−18x+12=0
Solving for x, we get:
x=1 or x=2
Now, we need to check the values of f(x) at the critical points and at the endpoints of the interval [0,4]:
f(0)=4 , f(1)=9 , f(2)=8 , f(4)=68
Therefore, the minimum value of f(x) is 4, which occurs at x=0, and the maximum value of f(x) is 68, which
occurs at x=4
To summarize:
To find the critical points of a function, we take the first derivative and set it equal to zero.
To determine if a critical point is a maximum, minimum, or neither, we use the second derivative test. If f′′
(x)>0, the critical point is a minimum; if f′′(x)<0, the critical point is a maximum; and if f′′(x)=0, the test is
inconclusive.
To find the maximum and minimum values of a function on a closed interval, we evaluate the function at the
critical points and at the endpoints of the interval. The largest of these values is the maximum value, and the
smallest is the minimum value.

Partial Derivatives:
A partial derivative of a multivariable function is the derivative with respect to one of its variables, keeping all
other variables constant. For example, if we have a function f(x,y), the partial derivative of f with respect to x is
denoted as ∂f∂x and is defined as the rate of change of f as x changes, with y held constant.
Notation:
The partial derivative of a function f(x,y) with respect to x is denoted by ∂f∂x. The symbol ∂ is used to indicate
a partial derivative, and it is pronounced "dee" or "del". To denote the partial derivative of f with respect to y,
we write ∂f∂y.
Interpretation:
The partial derivative of a function gives us the rate at which the function changes as one of its input variables
changes, while holding all other input variables constant. It tells us how much the output of the function changes
when we change only one of the input variables.
Rules:
Just like ordinary derivatives, partial derivatives satisfy many of the same rules. Here are a few of the most
important ones:
1. The sum rule: ∂(f+g)∂x=∂f∂x+∂g∂x
2. The product rule: ∂(fg)∂x=f∂g∂x+g∂f∂x
3. The chain rule: ∂f(u,v)∂x=∂f∂u∂u∂x+∂f∂v∂v∂x
Application:
Partial derivatives are used in many areas of mathematics and science, including optimization, physics,
economics, and engineering. For example, in optimization problems, we may need to find the maximum or
minimum value of a function of several variables. To do this, we can take partial derivatives of the function with
respect to each variable, set them equal to zero, and solve for the variables.
Suppose we have the function f(x,y) = x2y+y3, and we want to find the partial derivatives with respect to
x and y at the point (1,2).
The partial derivative with respect to x can be found by treating y as a constant and taking the derivative of the
function with respect to x:
∂f∂x=2xy
At the point (1,2), this gives us:
∂f∂x(1,2)=2(1)(2)=4
The partial derivative with respect to y can be found by treating x as a constant and taking the derivative of the
function with respect to y:
∂f∂y=x2+3y2
At the point (1,2), this gives us:
∂f∂y(1,2)=(1)2+3(2)2=13
Therefore, the partial derivatives of f(x,y) with respect to x and y at the point (1,2) are 4 and 13, respectively.
Chapter 2: Descriptive Statistics
Probability Theory
Probability theory is a branch of mathematics that deals with the study of randomness and
uncertainty. It provides a framework for understanding and quantifying the likelihood of events,
making it an essential tool in many fields such as statistics, finance, and engineering.

In probability theory, we define events as subsets of a sample space, which is the set of all possible
outcomes of a random experiment. We use probabilities to assign a numerical measure to the
likelihood of each event occurring. Probability is always a number between 0 and 1, with 0 indicating
that an event is impossible and 1 indicating that an event is certain.

Probability theory provides us with many useful tools for analyzing and quantifying uncertainty, such
as probability distributions, random variables, and expected values. By understanding these concepts,
we can make more informed decisions and better manage risk in our daily lives.

One common example of probability theory in action is in casino games such as roulette or blackjack.
The rules of these games are based on the laws of probability, and understanding these laws can help
players make more strategic bets and increase their chances of winning.

import random
# Simulate rolling a six-sided die
die_roll = random.randint(1, 6)
print("You rolled a", die_roll)
# Simulate flipping a coin
coin_flip = random.choice(["heads", "tails"])
print("The coin landed on", coin_flip)

You rolled a 2
The coin landed on heads
This code uses the random library in Python to simulate the outcomes of rolling a die or flipping a coin, which
are classic examples of random experiments. By repeating these experiments many times, we can estimate the
probabilities of different outcomes and gain a deeper understanding of probability theory.
Probability axioms:
Non-negativity: The probability of an event can never be negative, i.e., P(A) >= 0.
Normalization: The sum of probabilities of all possible outcomes of an experiment is equal to 1, i.e., P(S) = 1
where S is the sample space.
Additivity: The probability of the union of two mutually exclusive events is equal to the sum of their individual
probabilities, i.e., P(A or B) = P(A) + P(B) for A and B that are mutually exclusive.

Probability rules:
Complement rule: The probability of an event not occurring is 1 minus the probability of the event occurring,
i.e., P(A') = 1 - P(A).
Union rule: The probability of the union of two events A and B is given by P(A or B) = P(A) + P(B) - P(A and
B), where P(A and B) is the probability of the intersection of A and B.
Conditional probability rule: The probability of an event A given that event B has occurred is given by P(A|B)
= P(A and B) / P(B), where P(A and B) is the joint probability of A and B occurring together and P(B) is the
probability of event B occurring.
Multiplication rule: The probability of two events A and B occurring together is given by P(A and B) = P(A|B)
* P(B) or P(B|A) * P(A), where P(A|B) and P(B|A) are the conditional probabilities of A given B and B given
A, respectively.

# Complement rule example


total = 10
event_A = 3
# Probability of event A occurring
p_A = event_A / total
print("Probability of event A:", p_A)
# Probability of event A not occurring
p_A_complement = 1 - p_A
print("Probability of event A not occurring:", p_A_complement)

Probability of event A: 0.3


Probability of event A not occurring: 0.7

Conditional Probability:
Conditional probability is the probability of an event A given that another event B has already occurred. It is
denoted by P(A|B), which means the probability of A given B. The formula for conditional probability is:
P(A|B) = P(A ∩ B) / P(B)
Where P(A ∩ B) is the probability of both A and B occurring, and P(B) is the probability of B occurring.

Bayes' Theorem:
Bayes' theorem is a fundamental concept in probability theory, and it is used to calculate conditional
probabilities. It states that the probability of an event A given that another event B has occurred can be
calculated as:
P(A|B) = P(B|A) * P(A) / P(B)
Where P(B|A) is the probability of B given A has occurred, P(A) is the prior probability of A, and P(B) is the
prior probability of B. Bayes' theorem can be used to update our beliefs about the probability of an event as new
information becomes available.

Suppose you have two decks of cards, one red and one blue, with 52 cards each. You draw a card from the red
deck and observe that it is a heart. What is the probability that the card is an ace?

Here, A is the event of drawing an ace, and B is the event of drawing a heart. The probability of drawing an ace
and a heart is P(A ∩ B) = 1/52, and the probability of drawing a heart is P(B) = 13/52. Using the formula for
conditional probability, we get P(A|B) = P(A ∩ B) / P(B) = (1/52) / (13/52) = 1/13, which is the probability of
drawing an ace given that the card is a heart.

Suppose a test for a disease is 95% accurate, meaning that if a person has the disease, the test will correctly
identify it 95% of the time. However, the test also has a false positive rate of 5%, meaning that if a person does
not have the disease, the test will incorrectly identify them as having it 5% of the time. If 1% of the population
has the disease, what is the probability that a person who tests positive actually has the disease?
Here, A is the event of having the disease, and B is the event of testing positive. The probability of having the
disease is P(A) = 0.01, and the probability of testing positive given that the person has the disease is P(B|A) =
0.95. The probability of testing positive given that the person does not have the disease is P(B|¬A) = 0.05. Using
Bayes' theorem, we can calculate the probability of having the disease given that the person tests positive as
P(A|B) = P(B|A) * P(A) / (P(B|A) * P(A) + P(B|¬A) * P(¬A)) = (0.95 * 0.01) / (0.95 * 0.01 + 0.05 * 0.99) =
0.16, which is the probability of having the disease given that the person tests positive.

Random variables and their properties


Random variables are used to describe the possible outcomes of a random process. A random variable is a
function that maps each outcome of a random process to a numerical value. For example, if we roll a die, the
random variable can be defined as the number that appears on the top face.

Random variables can be classified into two types: discrete and continuous. A discrete random variable can take
on a countable number of values, while a continuous random variable can take on any value in a continuous
range. Examples of discrete random variables include the number of heads in multiple coin flips, while an
example of a continuous random variable is the height of a randomly selected person.
The properties of a random variable can be described by its probability distribution, which specifies the
probabilities of each possible value of the variable. The probability distribution of a discrete random variable
can be represented by a probability mass function (PMF), while the probability distribution of a continuous
random variable can be represented by a probability density function (PDF).
The expected value of a random variable is the weighted average of its possible values, with the weights given
by their respective probabilities. It is a measure of the central tendency of the variable's probability distribution.
The variance of a random variable measures how much its values deviate from its expected value, and it is a
measure of the variability of the variable's probability distribution.
In Python, we can use libraries like NumPy and SciPy to work with random variables and their properties. For
example, we can generate random samples from a given probability distribution using the random module of
NumPy, and we can calculate the expected value and variance of a given distribution using the functions in the
SciPy library.
Here's an example code snippet to generate a random sample of size 10 from a normal distribution with
mean 0 and variance 1 using NumPy:
import numpy as np
sample = np.random.normal(loc=0, scale=1, size=10)
print(sample)
here's an example code snippet to calculate the expected value and variance of a normal distribution with mean
0 and variance 1 using SciPy:
from scipy.stats import norm
mu, var = norm.stats(loc=0, scale=1, moments='mv')
print(f"Expected value: {mu}, Variance: {var}")
Law of large numbers:
The law of large numbers is a fundamental concept in probability theory that describes the relationship between
the sample size of a random variable and its expected value.
1. The law of large numbers states that as the sample size of a random variable increases, the sample
mean will approach the expected value of the variable. This means that the more data you have, the
more accurate your estimate of the true underlying probability distribution will be.
2. This law applies to both discrete and continuous random variables, and it is a key concept in many
areas of statistics and machine learning.
3. The law of large numbers is closely related to the central limit theorem, which states that the
distribution of the sample means approaches a normal distribution as the sample size increases.
4. The law of large numbers has important implications for decision-making and risk management. It
suggests that making decisions based on a large sample size is generally more reliable and accurate
than making decisions based on a small sample size.
5. In practice, the law of large numbers is often used in simulations and statistical modeling to generate
more accurate estimates of probabilities and other statistical measures. For example, if we want to
estimate the probability of a rare event occurring, we can use the law of large numbers to simulate
many trials and calculate the proportion of trials in which the event occurs.
Here is an example of using the law of large numbers to estimate the probability of rolling a 6 on a fair six-
sided die:
import random
def roll_die(n):
count = 0
for i in range(n):
if random.randint(1, 6) == 6:
count += 1
return count / n
print(roll_die(10)) # output: 0.2
print(roll_die(100)) # output: 0.12
print(roll_die(1000)) # output: 0.173
print(roll_die(10000)) # output: 0.1676

Central Limit Theorem:


The central limit theorem is one of the most important theorems in probability theory and statistics. It states that,
under certain conditions, the sum of a large number of independent and identically distributed random variables
will be approximately normally distributed, regardless of the underlying distribution of the individual random
variables. Here are some important points about the central limit theorem:
1. Definition:
The central limit theorem (CLT) states that the distribution of the sample mean of a large number of
independent and identically distributed random variables approaches a normal distribution, regardless of the
underlying distribution of the individual random variables.
2. Conditions:
The central limit theorem holds under certain conditions, including that the sample size is sufficiently large, the
observations are independent, and the population distribution has a finite variance.
3. Importance:
The central limit theorem is important because it allows us to make inferences about the population mean based
on a sample mean, even when the underlying distribution is unknown or non-normal.
4. Applications:
The central limit theorem has many applications in statistics and data analysis, including hypothesis testing,
confidence intervals, and regression analysis.
5. Example:
Here's an example of how the central limit theorem works in practice. Suppose we want to know the average
height of all students in a school, but it's not practical to measure the height of every student. Instead, we can
take a random sample of students and calculate the sample mean. By the central limit theorem, as the sample
size increases, the distribution of the sample mean approaches a normal distribution, allowing us to make
inferences about the population mean.
6. Code example:
Here's an example of how to simulate the central limit theorem in Python. We can generate a large number of
random samples from a non-normal distribution (e.g., a uniform distribution), calculate the sample means, and
plot the distribution of the sample means. As the sample size increases, the distribution of the sample means
becomes increasingly normal.

import numpy as np
import matplotlib.pyplot as plt
# Generate random samples
n_samples = 100000
sample_sizes = [1, 2, 5, 10, 50]
samples = np.random.uniform(size=(n_samples, max(sample_sizes)))
# Calculate sample means
sample_means = [np.mean(samples[:, :n], axis=1) for n in sample_sizes]
# Plot distribution of sample means
fig, axs = plt.subplots(ncols=len(sample_sizes), figsize=(15, 5))
for i, ax in enumerate(axs):
ax.hist(sample_means[i], bins=50, density=True)
ax.set_title(f"Sample size = {sample_sizes[i]}")
plt.show()
This code generates 100,000 random samples from a uniform distribution, calculates the sample means for
various sample sizes, and plots the distribution of the sample means. As you can see from the resulting plots, as
the sample size increases, the distribution of the sample means becomes increasingly normal, demonstrating the
central limit theorem in action.

Data Summarization

Data summarization is the process of presenting a large dataset in a concise and meaningful way.
Here are some important concepts and techniques in data summarization:
Descriptive statistics:
These are statistical measures that describe the main features of a dataset, including measures of
central tendency (mean, median, mode) and measures of dispersion (range, variance, standard
deviation).
Data visualization:
This involves representing data in a graphical or pictorial form, which can help to reveal patterns and
relationships that may not be apparent from numerical summaries alone. Examples include
histograms, scatter plots, and box plots.
Aggregation:
This involves combining data into groups or categories to facilitate analysis. Common methods of
aggregation include grouping by time periods, geographic regions, or other relevant factors.
Sampling:
This involves selecting a subset of data from a larger dataset in order to gain insights about the larger
population. Various sampling techniques can be used depending on the nature of the data and the
research question.
Dimensionality reduction:
This involves reducing the number of variables in a dataset while still retaining as much information as
possible. Techniques such as principal component analysis and factor analysis can be used for this
purpose.
Machine learning:
This involves using algorithms to automatically identify patterns and relationships in a dataset.
Techniques such as clustering, regression, and classification can be used to summarize data and make
predictions.
Here's an example of using Python's pandas library to calculate some basic descriptive statistics for a
dataset:

import pandas as pd

data = pd.read_csv('mydata.csv') # mydata.csv can be any numerical data


print('Mean:', data.mean())
print('Standard deviation:', data.std())
print('Range:', data.max() - data.min())

using Python's matplotlib library to create a histogram of the data:


import matplotlib.pyplot as plt
data = [1, 2, 3, 3, 4, 5, 5, 5, 6, 7]
plt.hist(data)
plt.show()
Descriptive statistics and their uses
Descriptive statistics are used to summarize and describe the important characteristics of a dataset, including
measures of central tendency (mean, median, mode) and measures of variability (range, variance, standard
deviation).
These statistics help to provide a quick overview of the data, identify patterns or trends, and determine if the
data is skewed or has outliers.
Descriptive statistics can be calculated using various Python libraries such as NumPy, Pandas, and SciPy.
The most common descriptive statistics include:
Mean: the arithmetic average of a set of values.
Median: the middle value in a set of values when arranged in order.
Mode: the value that appears most frequently in a set of values.
Range: the difference between the maximum and minimum values in a set of values.
Variance: a measure of how spread out the values are from the mean.
Standard deviation: the square root of the variance, providing a measure of the spread of the data around the
mean.

*These statistics can be calculated using Python libraries such as NumPy, Pandas, and SciPy. Here's an
example using NumPy:*
import numpy as np

data = np.array([10, 20, 30, 40, 50])

# Mean
mean = np.mean(data)
print("Mean:", mean)

# Median
median = np.median(data)
print("Median:", median)

# Range
range = np.ptp(data)
print("Range:", range)

# Variance
variance = np.var(data)
print("Variance:", variance)

# Standard deviation
std_dev = np.std(data)
print("Standard deviation:", std_dev)
Measures of central tendency: mean, median, and mode:
Measures of central tendency are statistical measures that determine the center of a dataset. The three most
common measures of central tendency are mean, median, and mode.
Here are some important points to consider about these measures:
Mean:
It is the arithmetic average of a dataset and is calculated by summing all values in the dataset and dividing by
the number of observations. Mean can be sensitive to outliers and extreme values in a dataset. The formula for
calculating the mean is:
mean = (sum of all values) / (number of observations)
Median:
It is the middle value in a dataset when the values are arranged in ascending or descending order. Median is less
sensitive to outliers compared to mean. In case of even number of observations, the median is calculated as the
average of the two middle values.
Mode:
It is the most frequently occurring value in a dataset. Mode can be used for both numerical and categorical data.
A dataset can have one or more modes, or it may have no mode at all.

*Here are some examples of code samples in Python to calculate these measures using NumPy library:*
import numpy as np
# Example dataset
data = [1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 8]
# Mean
mean = np.mean(data)
print("Mean:", mean)
# Median
median = np.median(data)
print("Median:", median)
Measures of dispersion: Variance, Standard Deviation, and Range:
Measures of dispersion are statistical values that help to describe how spread out a dataset is.
Range is the simplest measure of dispersion and is defined as the difference between the maximum and
minimum values in a dataset.
Variance is a measure of how much the data points are dispersed around the mean value. It is calculated by
taking the average of the squared differences between each data point and the mean.
Standard deviation is another measure of dispersion that indicates how much the data deviates from the mean
value. It is the square root of the variance and is often used as a more interpretable measure of dispersion
compared to variance.

*Here are some code examples in Python:*


To calculate range:
data = [5, 10, 15, 20, 25]
range_value = max(data) - min(data)
print("Range value:", range_value)
To calculate variance and standard deviation using Python's statistics module:
import statistics
data = [5, 10, 15, 20, 25]
variance = statistics.variance(data)
std_deviation = statistics.stdev(data)
print("Variance:", variance)
print("Standard deviation:", std_deviation)
Alternatively, you can also calculate variance and standard deviation manually:
data = [5, 10, 15, 20, 25]
mean = sum(data) / len(data)
variance = sum((x - mean) ** 2 for x in data) / len(data)
std_deviation = variance ** 0.5
print("Variance:", variance)
print("Standard deviation:", std_deviation)
Percentiles and Quartiles:
Percentiles and quartiles are important measures of relative standing or position of a value in a data set.
Percentiles divide a dataset into 100 equal parts, each representing 1% of the data. For example, the 75th
percentile is the value below which 75% of the data falls.
Quartiles divide a dataset into four equal parts, each representing 25% of the data. The first quartile (Q1) is the
25th percentile, the second quartile (Q2) is the 50th percentile (also known as the median), and the third quartile
(Q3) is the 75th percentile.
Percentiles and quartiles can be calculated using various methods, such as interpolation, nearest rank, and the
inverse of the cumulative distribution function.
The most commonly used method for calculating percentiles and quartiles is the interpolation method. This
method estimates the percentile value by interpolating between the two nearest values in the dataset.
In Python, you can use the NumPy library to calculate percentiles and quartiles. The percentile() function can be
used to calculate any percentile value, and the quantile() function can be used to calculate quartiles.

*Example code for calculating percentiles:*


import numpy as np
data = [3, 1, 4, 2, 5, 7, 6, 8, 9, 10]
p75 = np.percentile(data, 75)
print("75th percentile value is:", p75)
Example code for calculating quartiles:
import numpy as np
data = [3, 1, 4, 2, 5, 7, 6, 8, 9, 10]
q1 = np.quantile(data, 0.25)
q2 = np.quantile(data, 0.5)
q3 = np.quantile(data, 0.75)
print("Q1 value is:", q1)
print("Q2 value is:", q2)
print("Q3 value is:", q3)
More Examples:
import matplotlib.pyplot as plt
import numpy as np
# example data
data = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
# calculate quartiles
q1, q2, q3 = np.percentile(data, [25, 50, 75])
# create box plot
plt.boxplot(data)
plt.title("Box Plot with Quartiles")
plt.text(1.05, q1, f"Q1: {q1:.2f}", fontsize=10, color="r")
plt.text(1.05, q2, f"Q2: {q2:.2f}", fontsize=10, color="r")
plt.text(1.05, q3, f"Q3: {q3:.2f}", fontsize=10, color="r")
plt.show()
import matplotlib.pyplot as plt
import numpy as np
# example data
data = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
# calculate percentiles
p25, p50, p75 = np.percentile(data, [25, 50, 75])
# create histogram
plt.hist(data, bins=5)
plt.title("Histogram with Percentiles")
plt.axvline(p25, color="r", linestyle="--")
plt.axvline(p50, color="r", linestyle="--")
plt.axvline(p75, color="r", linestyle="--")
plt.text(p25, 2, f"P25: {p25:.2f}", fontsize=10, color="r")
plt.text(p50, 2, f"P50: {p50:.2f}", fontsize=10, color="r")
plt.text(p75, 2, f"P75: {p75:.2f}", fontsize=10, color="r")
plt.show()

Skewness and Kurtosis


Skewness and Kurtosis are important measures of the shape of a distribution in descriptive statistics.
Skewness refers to the degree of asymmetry in a distribution. A positive skew indicates that the distribution has
a longer tail on the right side, while a negative skew indicates a longer tail on the left side.
Kurtosis, on the other hand, measures the degree of peakedness of a distribution. A high kurtosis indicates that
the distribution has a sharper peak and heavier tails, while a low kurtosis indicates a flatter peak and lighter tails.
Skewness can be calculated using the skew() function in the scipy.stats module in Python. The function takes an
array of numbers as input and returns the skewness value.
Example-
import numpy as np
from scipy.stats import skew
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
skewness = skew(data)
print("Skewness:", skewness)

Kurtosis can be calculated using the kurtosis() function in the scipy.stats module in Python. The
function takes an array of numbers as input and returns the kurtosis value.
Example-
import numpy as np
from scipy.stats import kurtosis
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
kurt = kurtosis(data)
print("Kurtosis:", kurt)
Skewness and kurtosis are useful in understanding the shape of a distribution and can help in
choosing appropriate statistical methods for data analysis.

Types of Skewness:
Positive skewness:
A distribution is said to be positively skewed if its tail is longer on the positive side (to the right) of the
distribution. This means that the majority of the data is on the left side of the distribution.
Negative skewness:
A distribution is said to be negatively skewed if its tail is longer on the negative side (to the left) of the
distribution. This means that the majority of the data is on the right side of the distribution.
Zero skewness:
A distribution is said to have zero skewness if it is symmetric around its mean. This means that the left
and right tails are of equal length.

Types of Kurtosis:
Leptokurtic:
A distribution is said to be leptokurtic if it has a high degree of peakedness. This means that the data
is heavily concentrated around the mean, and the tails are relatively thin.
Mesokurtic:
A distribution is said to be mesokurtic if it has a moderate degree of peakedness. This means that the
data is moderately concentrated around the mean, and the tails are neither too thick nor too thin.
Platykurtic:
A distribution is said to be platykurtic if it has a low degree of peakedness. This means that the data is
widely spread out, and the tails are relatively thick.

More Examples:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# example data
data = [1, 2, 3, 4, 5]
# calculate mean and standard deviation
mu, std = norm.fit(data)
# plot histogram with normal distribution curve
plt.hist(data, bins=10, density=True, alpha=0.6, color='g')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
plt.title("Histogram with Normal Distribution Curve")
plt.show()

import matplotlib.pyplot as plt


# example data
data = [1, 2, 3, 4, 5]
# create box plot
plt.boxplot(data)
plt.title("Box Plot")
plt.show()

Outlier Detection and Treatment:


Outlier Detection and Treatment:
Outliers are data points that are significantly different from other data points in a dataset. Outliers can
significantly affect the results of data analysis, so it is important to detect and treat them
appropriately.
Here are some important points to consider for outlier detection and treatment:
1. Identify outliers:
Use statistical methods such as box plots, scatter plots, or z-scores to identify potential outliers in your
dataset.
2. Determine the cause:
Investigate the cause of outliers, whether it is due to data entry errors or a legitimate deviation from
the norm.
3. Decide on treatment:
Decide whether to remove outliers or to treat them. Depending on the cause and impact of outliers,
different treatments can be applied, such as replacing the outlier with a more appropriate value,
removing the outlier entirely, or keeping the outlier in the analysis but using robust statistical
methods.
4. Apply treatment:
Apply the chosen treatment method to the outliers.
5. Re-evaluate the dataset:
After treating the outliers, re-evaluate the dataset to ensure that the outliers no longer significantly
impact the results of the analysis.

*Here's an example of outlier detection using Z-score in Python:*


import numpy as np
# Sample dataset (with outliers)
data = np.array([25, 30, 35, 40, 45, 50, 1000])
# Calculate the z-scores for each data point
z_scores = (data - np.mean(data)) / np.std(data)
# Define a threshold for outlier detection (e.g., 2 standard deviations
from the mean)
threshold = 2
# Identify outliers
outliers = np.abs(z_scores) > threshold
# Print the original dataset and the outliers
print("Original data:", data)
print("Z-scores:", z_scores)
print("Outliers:", data[outliers])
Task 1:
Activity 1:
Suppose you are given the following dataset representing the number of hours slept by a group of students in a
week:
{6, 7, 8, 5, 6, 7, 9, 8, 6, 7, 5, 4, 8}
a) Find the mean, median, and mode of the dataset.
b) Which measure of central tendency would you use to summarize the data if you were to report it to someone
who wants to know the typical amount of sleep the students get in a week? Explain your answer
Activity 2
Suppose you are given the following dataset representing the weights (in pounds) of a group of dogs:
{22, 24, 20, 18, 26, 25, 23}
a) Find the range, variance, and standard deviation of the dataset.
b) Interpret the values of the range, variance, and standard deviation in the context of the weights of the dogs.
Activity 3
Suppose you are given the following dataset representing the test scores of a group of students:
{75, 80, 85, 90, 95, 100, 100, 100}
a) Calculate the skewness and kurtosis of the dataset.
b) Interpret the values of the skewness and kurtosis in the context of the distribution of test scores.
Activity 4
In this , we will explore Outlier Detection and Treatment techniques on a dataset containing information about
diamonds.
Here is the link to the dataset
The goal is to identify outliers in the dataset and apply appropriate techniques to treat them.
Here are the tasks for this assignment:
Load the dataset from the Kaggle link provided above into a Pandas DataFrame.
Conduct exploratory data analysis (EDA) to gain insights into the dataset.
Identify outliers in the dataset using appropriate techniques.
Treat the identified outliers using appropriate techniques.
Perform EDA again to see the impact of outlier treatment.
You are free to use any technique you deem appropriate to identify and treat outliers in the dataset.

Discrete Probability Distributions

A discrete probability distribution is a statistical function that describes the likelihood of obtaining a
particular value or set of values from a discrete set of possible values.
Examples of discrete probability distributions include the binomial distribution, the Poisson
distribution, and the geometric distribution.
The binomial distribution models the probability of a binary outcome (success/failure) given a fixed
number of trials and a known probability of success.
The Poisson distribution models the probability of a certain number of events occurring in a fixed
interval of time or space, given a known average rate of occurrence.
The geometric distribution models the probability of the number of trials required to obtain the first
success in a sequence of independent trials, given a known probability of success.
Discrete probability distributions can be used in a wide range of fields, including finance, engineering,
and biology, to model real-world phenomena and make predictions based on probability.

Here's an example code snippet in Python for generating a random sample of 100 values from
a binomial distribution with 10 trials and a probability of success of 0.3:
import numpy as np
sample = np.random.binomial(n=10, p=0.3, size=100)
print(sample)

Probability Mass Function (PMF) :


The probability mass function is a function that maps each possible outcome of a discrete random variable to its
probability of occurrence. It is denoted by P(X = x), where X is a discrete random variable and x is a possible
outcome of X.
Some important points to note about PMF are:
1. The PMF gives the probability of each possible value of the random variable.
2. The sum of the probabilities for all possible outcomes is equal to 1.
3. The PMF is only defined for discrete random variables.
Example code in Python:
Suppose we have a dice that is rolled and we want to calculate the PMF for the number of spots that come up.
from collections import Counter
rolls = [1, 2, 3, 4, 5, 6]
# Calculate the PMF
pmf = Counter(rolls)
total_rolls = len(rolls)
for outcome, count in pmf.items():
pmf[outcome] = count / total_rolls
print(pmf)

Cumulative Distribution Function:


Definition:
The Cumulative Distribution Function (CDF) is a function that describes the probability that a random variable
X takes on a value less than or equal to x. It is denoted by F(x).
Formula:
F(x)=P(X≤x)
Properties:
1. The CDF is a non-decreasing function.
2. The CDF ranges from 0 to 1.
3. The CDF is a continuous function for continuous random variables and a step function for discrete
random variables.
Uses:
The CDF can be used to find the probability of a range of values for a random variable, as well as to calculate
the expected value and variance.
Example Code:

Here's an example code in Python that calculates the CDF of a normal distribution:
import numpy as np
from scipy.stats import norm
mu, sigma = 0, 1 # mean and standard deviation
x = np.linspace(-3,3,1000) # create 1000 evenly spaced points from -3 to 3
cdf = norm.cdf(x, mu, sigma) # calculate the CDF
import matplotlib.pyplot as plt
plt.plot(x, cdf)
plt.title('Normal Cumulative Distribution Function')
plt.xlabel('X')
plt.ylabel('F(X)')
plt.show()

This code generates a plot of the CDF of a normal distribution with mean 0 and standard deviation 1, for values
of X ranging from -3 to 3.
Expected Value and Variance:
Expected value and variance are important concepts in probability theory and statistics. They are used to
measure the central tendency and variability of random variables. The expected value is the average value of a
random variable, while variance measures how spread out the data is around the expected value.
Expected Value:
The expected value of a discrete random variable is the sum of the product of each possible value of the variable
and its probability. Mathematically, it can be represented as E(X)=Σxi∗P(X=xi), where xi is the possible value
of X and P(X=xi) is the probability of X taking the value xi.
Variance:
The variance of a random variable measures how much the values of the variable deviate from its expected
value. It is calculated by taking the sum of the squared differences of each value from the expected value,
multiplied by the probability of that value. Mathematically, it can be represented
as Var(X)=Σ(xi−E(X))2∗P(X=xi), where xi is the possible value of X, E(X) is the expected value of X, and
P(X=xi) is the probability of X taking the value xi.
Let's say we have a dice that we roll and get a random number between 1 and 6. We can calculate the
expected value and variance of this random variable as follows:
import numpy as np
# Define the possible values and their probabilities
values = [1, 2, 3, 4, 5, 6]
probs = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
# Calculate the expected value
expected_value = np.sum(np.multiply(values, probs))
print("Expected Value:", expected_value)
# Calculate the variance
variance = np.sum(np.multiply(np.square(np.subtract(values,
expected_value)), probs))
print("Variance:", variance)

Binomial Distribution:
Binomial distribution is a type of discrete probability distribution that deals with the probability of a certain
number of successes in a fixed number of independent trials, each with the same probability of success.
Properties:
1. The trials must be independent.
2. The probability of success, denoted by p, must be constant for each trial.
3. The number of trials, denoted by n, must be fixed.
4. The random variable X, which represents the number of successes, can take on integer values from 0 to
n.
Probability Mass Function (PMF):
The PMF of a binomial distribution is given by:
Using the factorial function: P(X=k)=n!k!(n−k)!pk(1−p)n−k.
Mean and Variance:
The mean of a binomial distribution is given by:
μ=np
The variance of a binomial distribution is given by:
σ2=np(1−p)

Example:
Suppose we have a coin that has a probability of landing heads (success) of 0.6. We flip the coin 10 times. What
is the probability of getting exactly 5 heads?
Solution:

P(X=5)=(105)(0.6)5(0.4)5 =10!5!5!(0.6)5(0.4)5 =10×9×8×7×65×4×3×2×1(0.6)5(0.4)5 =30240120(0


.6)5(0.4)5 =0.246… ≈0.247
from scipy.stats import binom
n = 10
p = 0.6
prob = binom.pmf(5, n, p)
print(prob) # Output: 0.246

Poisson Distribution:
Poisson distribution is a discrete probability distribution that describes the number of events occurring in a fixed
interval of time or space, given the average rate of occurrence of the events.
Parameters:
The Poisson distribution is determined by a single parameter, λ, which represents the average rate of occurrence
of the event in the given interval.
Probability Mass Function:
The probability mass function (PMF) of a Poisson distribution is given by:
P(X=k)=(λk∗e(−λ))/k!
where X is the random variable representing the number of events, k is the number of events, λ is the average
rate of occurrence, and e is the base of the natural logarithm.
Uses:
The Poisson distribution is commonly used in many fields such as biology, physics, finance, and economics to
model the occurrence of rare events.
Some examples of the Poisson distribution in real-life applications include:
1. The number of customers arriving at a service desk in a given time interval.
2. The number of defects in a manufacturing process.
3. The number of earthquakes in a region over a given time period.

Geometric Distribution:
Geometric Distribution is a probability distribution that represents the number of Bernoulli trials needed to
obtain the first success in a sequence of independent trials.
Formula:
The probability mass function (PMF) of the Geometric Distribution is given by P(X=k)=q(k−1)p, where X is
the number of trials needed to obtain the first success, p is the probability of success in each trial, and q=1-p is
the probability of failure.
Mean and Variance:
The mean of the Geometric Distribution is μ=1/p and the variance is σ2=q/p2
Example:
Suppose we have a coin with probability of heads p=0.3. We want to find the probability of getting heads for the
first time on the third flip. Using the Geometric Distribution, we can calculate P(X=3)=(1−0.3)2∗0.3=0.147
Uses:
The Geometric Distribution is commonly used in situations where we are interested in the number of trials
needed to obtain the first success, such as in models for the time to failure of a product or the number of calls
before a customer reaches a call center.

To generate a random sample of size n from a Geometric Distribution with probability of success p in Python,
we can use the numpy.random.geometric function:
import numpy as np
p = 0.3
n = 1000
samples = np.random.geometric(p, size=n)

Hypergeometric Distribution
Hypergeometric Distribution is a probability distribution used to model the probability of drawing a specified
number of objects from a finite population without replacement. It is a discrete probability distribution and is
widely used in statistical inference, sampling theory, and quality control.
Definition:
The hypergeometric distribution describes the probability of k successes in n draws from a finite population size
N with M total successes and N - M total failures, where the draws are made without replacement.
Formula:
The probability mass function (PMF) for Hypergeometric Distribution is given by:
(nk)=n!k!(n−k)!
Where, k = number of successes
n = number of draws
M = total number of successes in the population
N = population size
Properties:
The mean of the hypergeometric distribution is given by E(X) = n * M / N.
The variance of the hypergeometric distribution is given by
Var(X)=n∗M∗(N−M)∗(N−n)/(N2∗(N−1))

In Python, you can use the scipy.stats module to calculate the PMF, mean, and variance of the hypergeometric
distribution.
from scipy.stats import hypergeom
# Define the parameters
M = 10 # Total number of successes
N = 15 # Population size
n = 3 # Number of draws
k = 2 # Number of successes
# Calculate the PMF
pmf = hypergeom.pmf(k, N, M, n)
print("PMF:", pmf)
# Calculate the mean and variance
mean = hypergeom.mean(N, M, n)
var = hypergeom.var(N, M, n)
print("Mean:", mean)
print("Variance:", var)

Here is an example code snippet in Python that calculates the probability mass function of a Poisson
distribution with a given mean:
import math
def poisson_pmf(mean, k):
return math.exp(-mean) * (mean ** k) / math.factorial(k)
mean = 3
k = 2
pmf = poisson_pmf(mean, k)
print(pmf)

Activity 1:
A fair six-sided die is rolled two times. What is the probability of getting a sum of 7?
Activity 2:
Suppose the heights of a group of people are normally distributed with a mean of 68 inches and a standard
deviation of 3 inches. What is the probability that a randomly selected person from this group is shorter than 70
inches tall?
Activity 3:
A call center receives an average of 30 calls per hour. What is the probability that the call center will receive
exactly 40 calls in a given hour, assuming that the number of calls follows a Poisson distribution?
Activity 4:
A coin is tossed 12 times. What is the probability of getting exactly 7 heads?

Continuous Probability Distributions


Introduction on Continuous Probability Distributions
Continuous Probability Distributions are used to model continuous random variables. Unlike discrete random
variables, continuous random variables take on a range of values, and the probability of any one value occurring
is infinitesimally small. As a result, continuous probability distributions are represented by a probability density
function (PDF) instead of a probability mass function (PMF).

Here are some important points regarding Continuous Probability Distributions:


Probability Density Function (PDF):
The probability density function represents the probability distribution of a continuous random variable. The
area under the PDF curve between two points represents the probability of the random variable taking a value
between those two points. The PDF is continuous and non-negative and its total area under the curve is equal to
1.
Cumulative Distribution Function (CDF):
The cumulative distribution function represents the probability that a continuous random variable takes a value
less than or equal to a given value. The CDF is the integral of the PDF and is a monotonically increasing
function.
Expected Value:
The expected value of a continuous random variable is the average value of the variable, weighted by its
probability density function. It is also called the mean of the random variable.
Variance:
The variance of a continuous random variable is a measure of how spread out the values of the variable are. It is
calculated as the expected value of the squared deviation of the variable from its mean.

Some examples of continuous probability distributions are Normal Distribution, Uniform Distribution,
Exponential Distribution, Beta Distribution, Gamma Distribution, and Weibull Distribution.
*Here is an example of calculating the PDF and CDF of a Normal Distribution using Python:*
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
mu = 0
sigma = 1
x = np.linspace(-5, 5, 1000)
pdf = norm.pdf(x, mu, sigma)
cdf = norm.cdf(x, mu, sigma)
plt.plot(x, pdf, label='PDF')
plt.plot(x, cdf, label='CDF')
plt.legend()
plt.show()

This code generates a plot of the PDF and CDF of a Normal Distribution with mean 0 and standard
deviation 1, for values of x ranging from -5 to 5.

Probability Density Function:


The Probability Density Function (PDF) is a fundamental concept in continuous probability distributions. It
describes the probability of a continuous random variable taking on a specific value. Unlike discrete probability
distributions, where the PMF gives the probability of each possible outcome, the PDF gives the probability
density at each point in a continuous range of values.

Here are some important points to note about the PDF:


1. The PDF is always non-negative. It describes the probability density, so it cannot be negative.
2. The area under the PDF curve is equal to 1. Since a continuous random variable must take on some
value, the total probability of all possible outcomes is 1.
3. The PDF gives the slope of the CDF at each point. The CDF is the integral of the PDF, and its slope at
any given point is equal to the PDF at that point.
4. The mode of a PDF is the value at which it reaches its maximum. This is the value with the highest
probability density.
5. The PDF can be used to calculate the expected value and variance of a continuous random variable.
example of a PDF plot using Python:
import numpy as np
import matplotlib.pyplot as plt
# Define the parameters of the normal distribution
mu = 0
sigma = 1
# Define the range of x values to plot
x = np.linspace(-5, 5, 100)
# Calculate the PDF
pdf = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(-(x - mu)**2 / (2 *
sigma**2))
# Plot the PDF
plt.plot(x, pdf)
plt.xlabel('x')
plt.ylabel('PDF')
plt.title('Normal Distribution')
plt.show()

This code generates a plot of the PDF of a


normal distribution with mean 0 and standard
deviation 1 over the range of x values from -5 to 5. The resulting plot shows the characteristic bell
shape of the normal distribution, with the highest probability density at the mean (0) and decreasing
probability density as you move away from the mean in either direction.

Cumulative Distribution Function


What is CDF?
CDF stands for Cumulative Distribution Function. It is a function that describes the probability of observing a
random variable X up to a certain point. The CDF is defined for both continuous and discrete probability
distributions.
How is CDF calculated?
The CDF is calculated as the cumulative sum (in the discrete case) or the integral (in the continuous case) of the
Probability Density Function (PDF) over a range of values of the random variable X. It represents the
probability that the random variable X takes a value less than or equal to x, for all possible values of x.
What does CDF show?
The CDF shows how the probability of observing a particular value of the random variable X changes as we
move along the x-axis. The CDF curve is a step function in the case of discrete distributions and a continuous
curve in the case of continuous distributions.
What are the properties of CDF?
Some of the important properties of CDF are:
1. The CDF of a random variable X is a non-decreasing function.
2. The CDF of a random variable X approaches 0 as x approaches negative infinity, and approaches 1 as x
approaches positive infinity.
3. The CDF is a right-continuous function.

How to plot CDF?


To plot the CDF, we can use the cumulative distribution function (cdf) method from the probability
distribution class in the SciPy library.

Here's an example code snippet to plot the CDF of a normal distribution with mean 0 and
standard deviation 1
from scipy.stats import norm
import matplotlib.pyplot as plt
x = np.linspace(-4, 4, num=100)
y = norm.cdf(x, loc=0, scale=1)
plt.plot(x, y)
plt.title('Cumulative Distribution Function of Normal Distribution')
plt.xlabel('x')
plt.ylabel('Cumulative Probability')
plt.show()

This will plot the CDF of the normal distribution. We can see how the probability of observing a value of X less
than or equal to a certain value changes as we move along the x-axis.
Why is CDF important?
The CDF is an important concept in probability theory and statistics because it allows us to calculate various
properties of a probability distribution. For example, we can use the CDF to calculate the probability of
observing a value of the random variable X within a certain range of values. We can also use the CDF to
calculate the median and quartiles of a probability distribution.

Expected Value and Variance


Expected Value:
The expected value, also known as the mean, of a continuous probability distribution is the weighted average of
all possible values of the random variable.
It is calculated by integrating the product of the value of the random variable and its probability density function
over the entire range of the variable.
Symbolically, the expected value of a continuous random variable X with probability density function f(x) is
denoted by E(X) and is given by:
E(X)=∫x∗f(x)dx
The expected value represents the center of mass of the probability distribution.
Variance:
The variance of a continuous probability distribution is a measure of how spread out the distribution is.
It is defined as the expected value of the squared deviation of the random variable from its expected value.
Symbolically, the variance of a continuous random variable X with probability density function f(x) and
expected value E(X) is denoted by Var(X) and is given by:
Var(X)=∫(x−E(X))2∗f(x)dx
The standard deviation, which is the square root of the variance, is also commonly used as a measure of the
spread of the distribution.
Example code in Python:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# create a normal distribution with mean = 0 and std dev = 1
x = np.linspace(-5, 5, num=1000)
pdf = norm.pdf(x, loc=0, scale=1)
# plot the pdf and shade the area under the curve to represent the expected
value
plt.plot(x, pdf)
plt.fill_between(x, pdf, where=(x >= -0.5) & (x <= 0.5), color='grey',
alpha=0.5)
plt.axvline(x=0, linestyle='--', color='red')
plt.xlabel('x')
plt.ylabel('pdf(x)')
plt.title('Normal Distribution with Expected Value = 0')
plt.show()
# calculate the variance of the distribution
variance = norm.var(loc=0, scale=1)
print("Variance of the Normal Distribution:", variance)
Uniform Distribution
Uniform distribution is a continuous probability distribution that is used to model random variables that have an
equal chance of taking on any value within a specified range. Here are some important points to know about
uniform distribution:
1. In uniform distribution, the probability of a random variable taking any value within a given range is
the same.
2. The probability density function (PDF) for uniform distribution is a horizontal line over the range of
values, with a height equal to 1 divided by the range.
3. The cumulative distribution function (CDF) for uniform distribution is a straight line that increases
linearly from 0 to 1 over the range of values.
4. The expected value (mean) of a uniform distribution is the average of the minimum and maximum
values in the range.
5. The variance of a uniform distribution is calculated as (range2)/12

example code snippet in Python to generate random numbers from a uniform distribution:

import random
# generate a random number between 0 and 1 from a uniform distribution
random.uniform(0, 1)
# generate a list of 10 random numbers between 1 and 100 from a uniform
distribution
[random.uniform(1, 100) for _ in range(10)]
Uniform distribution is commonly used in simulations, games, and optimization problems.
Normal Distribution
Normal distribution, also known as Gaussian distribution, is a continuous probability distribution that describes
the probability of a random variable taking on a range of values.
The probability density function (PDF) of a normal distribution is defined by two parameters: the mean (μ) and
the standard deviation (σ).
The shape of the normal distribution is symmetric around the mean, with the highest point of the curve
occurring at the mean.
The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1.
The cumulative distribution function (CDF) of the normal distribution is given by the area under the normal
curve to the left of a given value.
The normal distribution is widely used in statistical analysis and modeling, as many natural phenomena follow
this distribution.
The normal distribution can be simulated in Python using the numpy and scipy libraries.
Here's an example code snippet to generate random numbers from a normal distribution with a mean of 0 and a
standard deviation of 1:
import numpy as np
import scipy.stats as stats
# generate 1000 random numbers from a standard normal distribution
samples = np.random.normal(loc=0, scale=1, size=1000)
# calculate the mean and standard deviation of the samples
mean = np.mean(samples)
std_dev = np.std(samples)
# calculate the probability of a value being between 1 and 2 standard
deviations from the mean
prob = stats.norm.cdf(2) - stats.norm.cdf(-2)
print(f"Mean: {mean:.2f}, Standard Deviation: {std_dev:.2f}, Probability:
{prob:.2f}")

The normal distribution is also used in hypothesis testing, as it is often assumed that the distribution of sample
means is approximately normal, thanks to the central limit theorem.

Exponential Distribution
Exponential Distribution is a type of continuous probability distribution that describes the time between events
in a Poisson process, where events occur continuously and independently at a constant average rate.
Probability Density Function:
The probability density function of Exponential Distribution is given by f(x)=λe(−λx), where λ is the rate
parameter.
Cumulative Distribution Function:
The cumulative distribution function of Exponential Distribution is given by F(x)=1−e(−λx)
Expected Value and Variance:
The expected value of Exponential Distribution is E(X) = 1/λ, and the variance is Var(X)=1/λ2
Applications:
Exponential Distribution is used in a wide range of applications, such as:
Modeling the time between events, such as the time between calls in a call center.
Reliability analysis, such as predicting the time until a component fails.
Financial modeling, such as modeling the time between trades in financial markets.
To generate random numbers from Exponential Distribution in Python, we can use the
numpy.random.exponential() function. For example, the following code generates 1000 random numbers from
Exponential Distribution with a rate parameter of 0.5:
import numpy as np
x = np.random.exponential(scale=1/0.5, size=1000)

We can also plot the probability density function and cumulative distribution function using scipy.stats.expon
module:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
x = np.linspace(0, 5, 100)
pdf = stats.expon.pdf(x, scale=1/0.5)
cdf = stats.expon.cdf(x, scale=1/0.5)
plt.plot(x, pdf, label='PDF')
plt.plot(x, cdf, label='CDF')
plt.legend()
plt.show()
Gamma Distribution
Gamma Distribution is a continuous probability distribution that is used to model the waiting time until a
specified number of events occur in a Poisson process. It is a two-parameter family of distributions that has
many applications in fields such as physics, engineering, and finance.
Probability Density Function:
The probability density function (PDF) of the Gamma Distribution is given
by f(x)=(1/Γ(α)βα)x(α−1)e(−x/β) where α and β are the shape and scale parameters, respectively, and Γ is
the gamma function.
Cumulative Distribution Function:
The cumulative distribution function (CDF) of the Gamma Distribution is given
by F(x)=P(X≤x)=I(α,x/β) where I is the incomplete gamma function.
Expected Value and Variance:
The expected value of a Gamma Distribution is given by E(X) = αβ, and the variance is given by Var(X)=αβ2
Notes:
1. When α is an integer, the Gamma Distribution reduces to the Erlang Distribution, which models the
waiting time until a specified number of events occur in a Poisson process with a constant rate.
2. When α = 1, the Gamma Distribution reduces to the Exponential Distribution.

Python: To generate a random sample from a Gamma Distribution using NumPy:


import numpy as np
alpha = 2
beta = 1
sample_size = 1000
samples = np.random.gamma(alpha, beta, sample_size)

Beta Distribution
Beta distribution is a continuous probability distribution that is widely used in Bayesian statistics, modeling of
proportions and probabilities, and machine learning algorithms. It has two shape parameters, commonly denoted
by α and β, which determine the shape and spread of the distribution.
Probability Density Function (PDF):
The probability density function of the Beta distribution is given by the following equation:
f(x;α,β)=(1/B(α,β))∗x(α−1)∗(1−x)(β−1)
where x ∈ [0, 1], α > 0 and β > 0 are the shape parameters, and B(α, β) is the beta function.
Cumulative Distribution Function (CDF):
The cumulative distribution function of the Beta distribution does not have a closed-form solution. However, it
can be computed numerically using various numerical integration methods.
Important Properties:
The Beta distribution is a flexible distribution that can take on a wide range of shapes, from U-shaped to J-
shaped to bell-shaped, depending on the values of α and β.
When α = β = 1, the Beta distribution reduces to a uniform distribution over [0, 1].
The mean of the Beta distribution is given by α/(α+β), and the variance is given by α∗β/[(α+β)2∗(α+β+1)]
The mode of the Beta distribution is given by (α−1)/(α+β−2), provided that α > 1 and β > 1. If either α or β is
less than or equal to 1, the mode does not exist.
The Beta distribution is conjugate to the binomial distribution, which means that if the prior distribution of the
probability parameter of a binomial distribution is a Beta distribution, then the posterior distribution is also a
Beta distribution.

example of how to generate random samples from a Beta distribution using Python's NumPy library:
import numpy as np
# Set the shape parameters
alpha = 2
beta = 5
# Generate 1000 random samples from the Beta distribution
samples = np.random.beta(alpha, beta, size=1000)
# Compute the mean and variance of the samples
mean = np.mean(samples)
variance = np.var(samples)
print("Mean:", mean)
print("Variance:", variance)

Activity 1:
The lifetimes of a certain brand of light bulbs are normally distributed with a mean of 800 hours and a standard
deviation of 100 hours. What is the probability that a randomly selected light bulb will last more than 900
hours?
Activity 2:
The time between arrivals at a certain store follows an exponential distribution with a mean of 8 minutes. What
is the probability that the time between two consecutive arrivals will be less than 5 minutes?
Activity 3:
The scores on a standardized test are normally distributed with a mean of 75 and a standard deviation of 10.
What is the probability that a randomly selected student scores between 70 and 80 on the test?
Activity 4:
Let X be a continuous random variable with the PDF given by:

f(x)=⎧⎩⎨⎪⎪x; 0<x<12−x; 1<x<20; x>2

Find P(0.5 < x < 1.5).

Joint Distribution Concept

Joint distribution is an important concept in probability theory and statistics, and it refers to the distribution of
two or more random variables together.
Joint probability mass function:
If we have two discrete random variables, their joint distribution can be described by a joint probability mass
function, which gives the probability of each possible combination of values for the two variables.
Joint probability density function:
If we have two continuous random variables, their joint distribution can be described by a joint probability
density function, which gives the probability density at each point in the joint space.
an example of how to define a joint probability mass function for two discrete random variables X and Y in
Python:
import numpy as np
# Define the joint probability mass function
joint_pmf = np.array([[0.1, 0.2, 0.05],
[0.05, 0.15, 0.2],
[0.05, 0.1, 0.1]])

# Print the joint probability mass function


print(joint_pmf)
example of how to define a joint probability density function for two continuous random variables X and Y in
Python:
import numpy as np
# Define the joint probability density function
def joint_pdf(x, y):
return np.exp(-(x**2 + y**2)) / (2 * np.pi)
# Evaluate the joint probability density function at some points
print(joint_pdf(0.5, 0.5))
print(joint_pdf(1, 2))

Marginal distribution:
The marginal distribution of a single random variable can be obtained from the joint distribution by summing (in
the case of discrete variables) or integrating (in the case of continuous variables) over the other variable(s).
Let's break it down further:
Discrete Variables: If you have a joint probability distribution for two or more discrete random variables, you
can obtain the marginal distribution of one of those variables by summing the joint probabilities over all
possible values of the other variable(s). This effectively "marginalizes out" the other variable(s), leaving you
with the distribution of the variable of interest.
Mathematically, if you have a joint distribution P(X, Y) for discrete variables X and Y, then the marginal
distribution P(X) can be obtained by summing over all possible values of Y:
P(X) = ∑ P(X, y) for all y
Continuous Variables: Similarly, for continuous random variables, you would integrate the joint probability
density function over the entire range of values of the other variable(s) to obtain the marginal density function of
the variable of interest.
Mathematically, if you have a joint density function f(X, Y) for continuous variables X and Y, then the marginal
density function f(X) can be obtained by integrating over the entire range of Y:
f(X) = ∫ f(X, y) dy for all y
In both cases, the resulting marginal distribution or density function will satisfy the properties of a valid
probability distribution: it will be non-negative, and the total probability (or density) will sum (or integrate) to 1.

Conditional distribution:
The conditional distribution of one random variable given another can also be obtained from the joint
distribution by dividing by the marginal distribution of the conditioning variable.
Let's delve into this further:
Discrete Variables: Suppose you have two discrete random variables, X and Y, and you want to find the
conditional distribution of X given a particular value y of Y. You can do this by dividing the joint probability
P(X, Y) by the marginal probability P(Y = y):
P(X = x | Y = y) = P(X = x, Y = y) / P(Y = y)
Here, P(X = x, Y = y) is the joint probability of X and Y taking on specific values x and y, and P(Y = y) is the
marginal probability of Y taking on the value y.
Continuous Variables: Similarly, if you have two continuous random variables, X and Y, and you want to find
the conditional density of X given a particular value y of Y, you can divide the joint density f(X, Y) by the
marginal density f_Y(y):
f(X = x | Y = y) = f(X = x, Y = y) / f_Y(y)
Here, f(X = x, Y = y) is the joint density of X and Y, and f_Y(y) is the marginal density of Y.
This process of obtaining conditional distributions is known as conditional probability or conditional density,
and it allows you to analyze how one random variable behaves given specific information about another random
variable.
Applications:
Joint distribution is used in a variety of fields, including engineering, physics, and finance, to model the
relationships between multiple random variables and make predictions or decisions based on those relationships.

Joint Probability Mass Function (Joint PMF)


In probability theory, a joint probability mass function is a function that gives the probability that two discrete
random variables X and Y are equal to particular values, denoted by P(X = x, Y = y). Here are some important
points to know about Joint PMF:

🔹 Joint PMF is a function that maps each pair of values (x,y) from the joint domain of X and Y to the probability
that X = x and Y = y.

🔹 The joint PMF must satisfy two properties: non-negativity and summation.

🔹 The sum of all the probabilities in the joint PMF over all possible values of X and Y must be equal to 1.

🔹 Joint PMF can be used to calculate marginal probabilities, which are probabilities of individual random
variables.

🔹 Joint PMF can also be used to calculate conditional probabilities, which are probabilities of one random
variable given that another random variable has a particular value.

🔹 Joint PMF is often visualized using a joint probability table or a heatmap.

🔹 In code, joint PMF can be defined using a two-dimensional array or a dictionary of tuples.

example code snippet in Python to calculate joint PMF:


import numpy as np
# Define the joint PMF as a 2D array
joint_pmf = np.array([[0.1, 0.2, 0.05],
[0.05, 0.15, 0.2],
[0.05, 0.1, 0.1]])
# Calculate the marginal PMF of X by summing over Y
marginal_pmf_x = np.sum(joint_pmf, axis=1)
# Calculate the conditional PMF of Y given X=2
conditional_pmf_y_given_x2 = joint_pmf[2, :] / marginal_pmf_x[2]
# Print the results
print("Joint PMF:")
print(joint_pmf)
print("Marginal PMF of X:")
print(marginal_pmf_x)
print("Conditional PMF of Y given X=2:")
print(conditional_pmf_y_given_x2)

Joint Cumulative Distribution Function:


In probability theory and statistics, the joint cumulative distribution function (CDF) is a function that gives the
probability that two or more random variables in a probability distribution take on specific values or fall within
specified ranges simultaneously. It is defined as the probability of the event {X ≤ x, Y ≤ y}, where X and Y are
two random variables, and x and y are values.
Properties of Joint Cumulative Distribution Function:
1. Joint CDF is always non-negative and takes values between 0 and 1.
2. It is a monotonically non-decreasing function in both of its arguments.
3. The joint CDF approaches 0 as both of its arguments approach negative infinity.
4. The joint CDF approaches 1 as both of its arguments approach positive infinity.
Examples of Joint Cumulative Distribution Function:
Suppose we have two discrete random variables X and Y. The joint CDF for X and Y can be defined as:
F(x,y) = P(X ≤ x, Y ≤ y)

The joint CDF can be calculated using Python's scipy.stats module. Here's an example code:

from scipy.stats import randint


xk = [1, 2, 3, 4, 5, 6]
pk = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
X = randint(a=1, b=7)
Y = randint(a=1, b=7)
# Calculate joint CDF for X and Y
F = lambda x, y: X.cdf(x)*Y.cdf(y)
# Calculate joint CDF for x = 3, y = 4
F(3, 4)

This code generates two discrete random variables X and Y with uniform probability distribution between 1 and
6. Then it defines a lambda function F to calculate the joint CDF for X and Y. Finally, it calculates the joint
CDF for x = 3 and y = 4.

Joint Probability Density Function (PDF) is a function that describes the probability of multiple random
variables taking specific values simultaneously. It is used in continuous probability distributions to model the
likelihood of two or more variables occurring together.
Here are some important points regarding Joint PDF:
1. Definition:
The Joint Probability Density Function is a function of two or more variables that define the probability of those
variables taking specific values simultaneously.
2. Properties:
The joint PDF must satisfy certain properties, such as non-negativity and the integral over the entire domain
equals to one.
3. Interpretation:
The value of the joint PDF at any point (x, y) represents the likelihood of the variables X and Y taking on values
close to x and y, respectively.
4. Relationship with Marginal PDFs:
By integrating the joint PDF over one of the variables, we obtain the marginal PDF for the other variable.
5. Relationship with Joint CDF:
The joint PDF is related to the joint CDF by taking the partial derivative with respect to each variable.
Examples:
Some examples of probability distributions that have a joint PDF include the bivariate normal distribution and
the joint uniform distribution.

import numpy as np
import matplotlib.pyplot as plt
# Define the joint PDF function
def joint_pdf(x, y):
return 3*(1-x**2)*y**2*np.exp(-(x**2+y**2)/2)
# Create a grid of x and y values
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
# Evaluate the joint PDF at each point in the grid
Z = joint_pdf(X, Y)
# Plot the Joint PDF as a contour plot
plt.contour(X, Y, Z)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Joint PDF of X and Y')
plt.show()

This code creates a contour plot of the Joint PDF of two random variables X and Y, with X values ranging from
-3 to 3 and Y values ranging from -3 to 3. The joint PDF function is defined using the input variables x and y,
and the plot is generated using the contour function from the matplotlib library. The resulting plot shows the
contours of the Joint PDF, with darker regions indicating higher probability density.

Marginal Distribution:
Marginal distribution refers to the probability distribution of one or more variables in a joint probability
distribution.
The marginal distribution is obtained by summing (or integrating, in the case of continuous variables) the joint
probability distribution over the variables not of interest.
Marginal distributions can be calculated for both discrete and continuous variables.
Marginal distribution is useful in simplifying the analysis of complex problems that involve multiple variables.
The marginal distribution of a single variable can be obtained by summing (or integrating) the joint distribution
over all possible values of the other variables.
The marginal distribution of multiple variables can be obtained by summing (or integrating) the joint
distribution over all possible values of the variables not of interest.
Marginal distributions are important in statistics, as they allow us to study the behavior of individual variables in
a multivariate distribution.

In Python, the marginal distribution can be calculated using the numpy.sum() function for discrete variables and
scipy.integrate.simps() function for continuous variables. Here's an example:
import numpy as np
import scipy.integrate as spi
# Define the joint probability density function
def joint_pdf(x, y):
return x*y*np.exp(-x*y)
# Define the marginal probability density function for x
def marginal_pdf_x(x):
return spi.simps(joint_pdf(x, y_vals), y_vals)
# Define the marginal probability density function for y
def marginal_pdf_y(y):
return spi.simps(joint_pdf(x_vals, y), x_vals)
# Define the range of values for x and y
x_vals = np.linspace(0, 5, 50)
y_vals = np.linspace(0, 5, 50)
# Calculate the marginal PDFs for x and y
marginal_x = np.array([marginal_pdf_x(x) for x in x_vals])
marginal_y = np.array([marginal_pdf_y(y) for y in y_vals])
The resulting arrays marginal_x and marginal_y contain the marginal probability density functions for the
variables x and y, respectively.

Conditional Distribution:
Conditional distribution is a concept in probability theory that involves calculating the probability distribution of
a random variable given that another random variable takes a certain value. It is a way to understand how the
probability distribution of one variable changes when another variable is known or fixed.
Conditional Probability:
Conditional probability is the probability of an event occurring given that another event has occurred. It is
defined as the probability of event A occurring given that event B has already occurred and is denoted by P(A|
B).
Conditional Probability Mass Function (PMF):
In the case of discrete random variables, the conditional probability mass function (PMF) gives the probability
of a particular value of the random variable, given that another random variable takes a certain value. It is
denoted by P(X = x|Y = y).
Conditional Probability Density Function (PDF):
For continuous random variables, the conditional probability density function (PDF) gives the probability
density of a particular value of the random variable, given that another random variable takes a certain value. It
is denoted by f(x|y).
Examples:
Suppose we have two random variables X and Y, and we want to find the conditional probability of X given that
Y takes a certain value y. We can use the conditional probability formula:
P(X=x|Y=y) = P(X=x, Y=y) / P(Y=y)
where P(X=x, Y=y) is the joint probability of X and Y taking the values x and y respectively, and P(Y=y) is the
marginal probability of Y taking the value y.
Example code for calculating the conditional PMF of X given Y = y:
# Define the joint PMF of X and Y
pmf_XY = {(0, 1): 0.1, (1, 1): 0.3, (0, 2): 0.2, (1, 2): 0.4}
# Define the marginal PMF of Y
pmf_Y = {1: 0.4, 2: 0.6}
# Calculate the conditional PMF of X given Y=1
pmf_X_given_Y1 = {(0): pmf_XY[(0, 1)] / pmf_Y[1], (1): pmf_XY[(1, 1)] /
pmf_Y[1]}
# Calculate the conditional PMF of X given Y=2
pmf_X_given_Y2 = {(0): pmf_XY[(0, 2)] / pmf_Y[2], (1): pmf_XY[(1, 2)] /
pmf_Y[2]}

Covariance and correlation are two statistical concepts that measure the relationship between two variables.
Covariance is a measure of how two variables change together, while correlation is a measure of the strength of
their linear relationship. Both are important tools in data analysis and can help us understand the relationship
between different variables.
Covariance:
Covariance is a measure of how two variables vary together. It measures the degree to which two variables are
related to each other. A positive covariance means that the two variables tend to move in the same direction,
while a negative covariance means they tend to move in opposite directions. Covariance is sensitive to the scale
of the variables and is not standardized.
Covariance can be calculated using the following formula: cov(X,Y)=E[(X−E[X])(Y−E[Y])]
Correlation:
Correlation measures the strength of the linear relationship between two variables. It ranges from -1 to 1, where
-1 indicates a perfectly negative correlation, 0 indicates no correlation, and 1 indicates a perfectly positive
correlation. Correlation is standardized, which means it is not sensitive to the scale of the variables.
Correlation can be calculated using the following formula: corr(X,Y)=cov(X,Y)/(std(X)∗std(Y))
Interpretation:
A positive covariance or correlation indicates that two variables tend to move in the same direction, while a
negative covariance or correlation indicates that they tend to move in opposite directions. A correlation of 0
indicates no linear relationship between the variables. The strength of the correlation can be interpreted using
the correlation coefficient.

example of how to calculate covariance and correlation using Python:


import numpy as np
# Sample data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([5, 4, 3, 2, 1])
# Calculate covariance
covariance = np.cov(X, Y)[0][1]
# Calculate correlation
correlation = np.corrcoef(X, Y)[0][1]
print("Covariance:", covariance)
print("Correlation:", correlation)

Multivariate Normal Distribution


Multivariate normal distribution is a powerful concept in probability theory, which allows us to model the
behavior of multiple random variables that are jointly normally distributed.
Definition:
Multivariate normal distribution is a continuous probability distribution of multiple random variables, where
each variable follows a normal distribution, and all variables are correlated with each other. It is also known as a
multivariate Gaussian distribution.
Parameters:
A multivariate normal distribution is characterized by two parameters: a mean vector μ, which specifies the
center of the distribution, and a covariance matrix Σ, which describes the spread and correlation of the variables.
Probability Density Function: The probability density function (PDF) of a multivariate normal distribution is
given by:
f(x)=(2π)⁻ᵏ/²|Σ|⁽¹/²⁾exp(−(1/2)(x−μ)ᵀΣ⁻¹(x−μ))
where x is a k-dimensional vector of random variables, |Σ| is the determinant of the covariance matrix, Σ ⁻¹ is the
inverse of the covariance matrix, and ᵀ denotes the transpose.
Properties:
Some important properties of multivariate normal distributions include:
1. Any linear transformation of a multivariate normal distribution is also a multivariate normal
distribution.
2. The conditional distribution of a subset of variables given the remaining variables is also a multivariate
normal distribution.
3. The marginal distribution of a subset of variables is also a multivariate normal distribution.
Applications:
Multivariate normal distributions are widely used in many fields, including finance, economics, and
engineering, for modeling complex systems with multiple variables.
example of how to generate a multivariate normal distribution in Python using the numpy library:
import numpy as np
# define the mean vector and covariance matrix
mu = np.array([0, 0])
cov = np.array([[1, 0.5], [0.5, 2]])
# generate a random sample from the distribution
sample = np.random.multivariate_normal(mu, cov, size=1000)
# plot the sample using matplotlib
import matplotlib.pyplot as plt
plt.scatter(sample[:, 0], sample[:, 1])
plt.show()

🔹 Joint Distribution is an important concept in probability theory and statistics that deals with the probability of
two or more random variables occurring together.

🔹 It involves understanding the joint probability mass function, joint probability density function, joint
cumulative distribution function, and conditional distribution.

🔹 The relationship between random variables can be explored using covariance and correlation.

🔹 Multivariate normal distribution is an example of joint distribution, which is commonly used in statistical
inference and modeling.
🔹 Joint distribution can be used in various applications such as in finance, economics, engineering, and social
sciences.

🔹 Understanding joint distribution is crucial for conducting hypothesis testing, building regression models, and
making predictions in various fields.

🔹 Python provides several packages such as NumPy, SciPy, and Pandas for working with joint distributions,
calculating their properties, and visualizing them.

import numpy as np
from scipy.stats import multivariate_normal
# Generate random data
x = np.random.normal(0, 1, 100)
y = np.random.normal(0, 1, 100)
data = np.column_stack((x, y))
# Calculate mean and covariance matrix
mean = np.mean(data, axis=0)
cov = np.cov(data.T)
# Create a multivariate normal distribution
mnorm = multivariate_normal(mean=mean, cov=cov)
# Calculate probability density function at a point
point = [0.5, 0.5]
pdf = mnorm.pdf(point)
print(f"PDF at {point}: {pdf}")
# Generate random samples from the distribution
samples = mnorm.rvs(10)
print(f"Random samples:\n{samples}")

This code generates random data from a multivariate normal distribution, calculates its mean and covariance
matrix, creates a multivariate_normal object using these parameters, calculates the probability density function
at a point, and generates random samples from the distribution. This is just a simple example to demonstrate
how joint distribution can be used in Python programming.

Activity 1:
A group of students measured the height and weight of 10 classmates. The data is given in the following table:
Student Height (inches) Weight
(pounds)
1 68 165
2 71 201
3 61 140
4 72 210
5 67 165
6 64 125
7 65 150
8 62 120
9 66 160
10 69 186
a) Calculate the covariance between height and weight.
Activity 2:
Suppose you are a teacher and you have a class of 30 students. You know that the students' heights are normally
distributed with mean μ=64 inches and standard deviation σ=3 inches. Additionally, you know that the heights
of the female students in the class are also normally distributed with mean μf=62 inches and standard
deviation σf=2.5 inches.
a) If a student is chosen at random from the class, what is the probability that their height is between 62 and 66
inches?
b) If you randomly select a female student from the class, what is the probability that her height is between 60
and 64 inches?
Activity 3:
Suppose the heights and weights of a group of 50 students are jointly distributed according to a bivariate normal
distribution with mean vector μ=[65150] and covariance matrix Σ=[9242464].
What is the probability that a randomly selected student has a height between 62 and 68 inches and a weight
between 140 and 160 pounds?
Activity 4:
Given the following set of data:
X = {2, 4, 6, 8, 10}
Y = {1, 3, 5, 7, 9}
Calculate the covariance and correlation between X and Y.
Activity 5: Insurance Cost Prediction:
In this assignment, we will explore some statistical concepts using the Insurance Cost Prediction dataset from
Kaggle. The dataset contains information on insurance cost for individuals based on their age, sex, BMI, number
of children, smoking habit, and geographic region.
1. Load the dataset into a Pandas dataframe and display the first five rows of the dataset. Dataset
link: https://www.kaggle.com/mirichoi0218/insurance
2. Plot the distribution of insurance charges for male and female policyholders.
3. Calculate the conditional distribution of insurance charges given the policyholder's smoking habit.
4. Create a scatter plot to visualize the relationship between BMI and insurance charges.
5. Compute the covariance and correlation between the policyholder's age and insurance charges.
6. Create a heatmap to visualize the covariance matrix of the dataset.
7. Fit a multivariate normal distribution to the dataset using the maximum likelihood estimation and
visualize the contours of the distribution in a 2D plot.

# Import necessary libraries


# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from scipy.stats import multivariate_normal

# Load the dataset


# Display the first five rows of the dataset
# Task 2: Plot the distribution of insurance charges for male and female
policyholders
# Task 3: Calculate the conditional distribution of insurance charges given
the policyholder's smoking habit
# Task 4: Create a scatter plot to visualize the relationship between BMI
and insurance charges
# Task 5: Compute the covariance and correlation between the policyholder's
age and insurance charges
# Task 6: Create a heatmap to visualize the covariance matrix of the
dataset
# Task 7: Fit a multivariate normal distribution and visualize the contours
in a 2D plot
# Select the features for the multivariate normal distribution
# Estimate the mean and covariance matrix using maximum likelihood
estimation

Chapter 4: Inferential Statistics


Sampling & Statistical Inference

Sampling refers to the process of selecting a subset, known as a sample, from a larger group, known
as a population. The sample is carefully chosen to be representative of the population's characteristics,
allowing researchers to draw meaningful conclusions about the entire population based on the
analysis of the sample data. Sampling involves techniques that aim to minimize bias and ensure the
sample accurately reflects the diversity and variability present in the population.

Importance of Sampling in Statistics:


Time and Cost Efficiency: Collecting data from an entire population can be impractical, time-consuming, and
expensive. Sampling allows researchers to obtain accurate information using a smaller budget and within a
reasonable timeframe.
Inference: Statistical analysis of a well-selected sample can provide insights and draw conclusions about the
entire population. This is the basis of inferential statistics, which generalize findings from the sample to the
larger population.
Feasibility: In cases where the population is too large or inaccessible, sampling provides a feasible way to study
and understand the population's characteristics.
Reduced Data Collection Effort: Instead of gathering data from all individuals, researchers can focus on
collecting data from a smaller group, making data collection more manageable.
Risk Reduction: Sampling allows researchers to evaluate hypotheses and test new ideas on a smaller scale
before implementing them on the entire population, reducing potential risks and errors.
Ethics: In cases where it is not feasible or ethical to collect data from every individual, sampling provides a way
to gather relevant information without invading privacy or causing harm.

Probability Sampling
Probabilty sampling involves the selection of elements from the population using random in which each element
of the population has an equal and independent chance of being chosen.
To put it simple, It is a sampling technique wherein the samples are gathered in a process that gives all the
individuals in the population equal chances of being selected.
Imagine you have a big bag of candies, and you want to know what flavors are inside without eating all of them.
Probability sampling is like reaching into the bag and picking out candies in a way that gives every candy an
equal chance to be chosen. This method helps you get a good idea of what flavors are in the bag without having
to check every single candy.
It is also called random sampling.

Random Sample
A set of items selected from a parent population is a random sample if:
the probability that any item in the population is included in the sample is proportional to its frequency in the
parent population, and
the inclusion/exclusion of any item in the sample operates independently of the inclusion/exclusion of any other
item.
Random Sample Notation
A random sample is made up of (iid) random variables and so they are denoted by capital X’s.
We will use the shorthand notation X to denote a random sample, that is, X = (X1, X2, ..., Xn).
An observed sample will be denoted by x=(x1,x2,…,xn).
The population distribution will be specified by a density (or probability function) denoted by f(x;θ), where θ
denotes the parameter(s) of the distribution such as mean (denoted by μ) and variance (denoted by σ2).

Normal Approximations Using CLT


In many cases, especially when dealing with large sample sizes, the normal distribution is widely used in
statistical analysis due to its well-understood properties. The approximations you've mentioned are specific
scenarios where other probability distributions are approximated by the normal distribution using the Central
Limit Theorem.

Binomial Distribution Approximation


The binomial distribution describes the number of successes in a fixed number of independent Bernoulli trials
(each with a probability of success p is the probability of success and q=1−p is the probability of failure ).
For a binomial distribution Bin(n,p) with a large sample size n and a probability of success p, if
both np and n(1−p) are greater than 5, the distribution of the sum of these binomial variables can be
approximated by a normal distribution.
Approximation Formula
Bin(n, p) ∻ N(np, np(1-p)), if np>5 and n(1-p) > 5
This means that when n is large and p is not too close to 0 or 1, you can use the normal distribution with the
same mean (np) and standard deviation np(1−p)−−−−−−−−√ to approximate the binomial distribution.

Poisson Distribution Approximation


The Poisson distribution models the number of events occurring in a fixed interval of time or space, given a
known average rate (λ¯)
For a Poisson distribution with a large average rate (λ), the distribution can be approximated by a normal
distribution.
Approximation Formula
Poi(λ) ∻ N(λ, λ¯), if λ is large
This approximation is valid when the average rate (λ¯) is relatively large. The normal distribution with the same
mean (λ¯) and standard deviation (λ−−√) can be used to approximate the Poisson distribution.

Chi-Square Distribution Approximation


The chi-square distribution is often used in statistical inference, especially in hypothesis testing and confidence
interval calculations.
For a chi-square distribution with a large number of degrees of freedom (n), the distribution can be
approximated by a normal distribution.
Approximation Formula
χ2n ∻ N(n,2n), if n is large
This approximation holds when the number of degrees of freedom (n) is large. The normal distribution with the
same mean (n) and standard deviation (2n−−√) can be used to approximate the chi-square distribution.

Non-Probability Sampling:
Non-probability sampling is a type of sampling technique where not all members of the population have a
known and equal chance of being selected into the sample.
Unlike probability sampling, non-probability sampling methods do not rely on random selection and may
introduce various forms of bias into the sample.
Non-probability sampling is often used when it is not feasible or practical to implement probability sampling
methods.
Types of Non-Probability Sampling
 Judgment or purposive or deliberate sampling
 Convenience sampling
 Quota sampling
 Snow ball sampling

Judgment or purposive or deliberate sampling


 Judgment sampling, also known as purposive or deliberate sampling, involves selecting specific
individuals or cases based on the researcher's judgment or specific criteria.
 This method is often used when the researcher believes that certain participants can provide valuable
insights into the research topic.
Example: Imagine a study focusing on the experiences of successful entrepreneurs . The researcher selects
participants who have a track record of starting and managing successful businesses . In this case, the
participants are chosen deliberately because they possess the expertise and knowledge relevant to the research
question.
Convenience Sampling
 Convenience sampling involves selecting individuals who are easily accessible or convenient for the
researcher to reach.
 This method is simple and quick but may lead to biased results.
Example: Consider a study investigating student opinions about a new school policy. The researcher collects
responses from students who happen to be in the same room during a break. While this method is convenient, it
may not represent the entire student population's opinions accurately.
Quota Sampling
 Quota sampling involves dividing the population into subgroups (quotas) based on specific
characteristics and then selecting participants from each subgroup.
 Quota sampling aims to ensure that the sample reflects the population's diversity in terms of the
selected characteristics.
Example: Suppose a survey aims to understand preferences for different types of smartphone brands across age
groups. The researcher sets quotas for age groups (e.g., 18-25, 26-40, 41-60), and within each quota, selects
participants randomly or through convenience sampling. This approach ensures representation from different
age segments.

 Snowball sampling is used when the population is hard to reach, and participants are recruited through
referrals from existing participants.
 It's often used in studies involving hidden or marginalized populations.

Example: Imagine a study investigating the experiences of individuals who have experienced homelessness .
The researcher starts with a small group of participants who are willing to share their experiences. Then, these
initial participants refer the researcher to others within the homeless community, forming a "snowball" effect.
This method allows researchers to access a population that may not be easily accessible through traditional
sampling methods.
Sample Size Determination:
Sum of Independent R.V.
Expected Value (Mean) of the Sum of Independent Random Variables
Suppose we have n independent random variables, X1, X2, ..., Xn. Each of these random variables has its own
expected value (mean), denoted as E[Xi], where i ranges from 1 to n.
The expected value of the sum of these independent random variables, denoted as E[X1 + X2 + ... + Xn], is the
sum of their individual expected values:
 E[X1+X2+....+Xn] = E[X1]+E[X2]+....+E[Xn]
This means that the expected value of the sum of independent random variables is simply the sum of their
individual expected values.
Variance of the Sum of Independent Random Variables
Similarly, the variance of the sum of independent random variables, denoted as var[X1 + X2 + ... + Xn], is the
sum of their individual variances:
 var[X1+X2+....+Xn] = var[X1]+var[X2]+....+var[Xn]
Again, this equation states that the variance of the sum of independent random variables is the sum of their
individual variances.
Example:

Let's say you're rolling two fair six-sided dice 🎲 . Each die has an expected value of (1+2+3+4+5+6)6 = 3.5
and a variance of ((1−3.5)2+(2−3.5)2+...+(6−3.5)2)6 = 2.9167.
Now, if you want to find the expected value and variance of the sum of the two dice rolls:
Expected value: E[X1 + X2] = E[X1] + E[X2] = 3.5 + 3.5 = 7 Variance: var[X1 + X2] = var[X1] + var[X2] =
2.9167 + 2.9167 = 5.8334
This demonstrates how you can use the properties of expected values and variances to analyze the combined
outcomes of independent random variables.
Sampling Bias:
Definition of a statistic
A statistic is a function of X only and does not involve any unknown parameters.
Examples:
Sample Mean, X̄ =1n∗ΣXi
Sample Variance, S2=1n−1∗∑(Xi−X¯)2
Non-Example:
1n∗∑(Xi−μ)2 is not a statistic, unless μ is known.
A statistic can be generally denoted by g(X) . Since a statistic is a function of random variables, it will be a
random variable itself and will have a distribution, its sampling distribution.

The Sample Mean


Suppose that we have n independent and identically distributed random variables, Xi, i = 1, 2,..., n, each with
mean μ and variance σ2
Sample Mean, X¯=1n∗∑Xi
E[X¯]=μ
Var[X¯]=σ2n
SD[X¯]=σn√
For large n, X¯∼N(μ,σ2n)
This result is known as the Central Limit Theorem(CLT) or the z result.

The Sample Variance


Suppose that we have n independent and identically distributed random variables, Xi, i = 1, 2,..., n, each with
mean μ and variance σ2.
Sample Variance, S2=1n−1∗∑(Xi−X¯)2
E[S2]=σ2
var[S2] -> Dependent on population distribution
The sampling distribution of S2 when sampling from a normal population, with mean μ and variance σ2,
is: (n−1)S2σ2∼χ2n−1

The t result
In most cases, σ2 is not known and so the z result cannot be used in such cases. We use the t result in such
cases.
t=X¯−μS/n√ ~ tn−1
This is the t result or the t sampling distribution.
This result is valid for samples from normal distribution only.
The t distribution is symmetrical about zero.

The F result
If independent random samples of size n1 and n2 respectively are taken from normal populations with
variances σ21 and σ22, then
F=S21/σ21S22/σ22 ~ Fn1−1,n2−1
This result is valid for samples from normal distribution only.

Point Estimation:
Method of Moments
The basic principle is to equate population moments (i.e the means, variances, etc of the theoretical model) to
corresponding sample moments (i.e the means, variances, etc of the sample data observed) and solve for the
parameter(s).
One parameter case:
E(X)=1n∑ni=1Xi
Two parameter case:
E(X)=1n∑ni=1Xi OR E(X2)=1n∑ni=1X2i
Method of Moments Example
A random sample from a Exp(λ) distribution is as follows:
14.84, 0.19, 11.75, 1.18, 2.44, 0.53
Calculate the method of moments estimate for λ.
x̄ = (14.84+0.19+11.75+1.18+2.44+0.53)/6 = 5.155
E[X] = 1λ
According to method of moments: E[X] = x̄ -> 1/λ = 5.155 -> λ = 0.1940

Method of Maximum Likelihood:


The method of maximum likelihood is widely regarded as the best general method of finding estimators.
Likelihood of a sample x1,x2,...,xn from a population with density or probability function f (x; θ) is given by:
L(θ)=∏ni=1f(xi;θ)
Differentiating the likelihood or log likelihood with respect to the parameter and setting the derivative to zero
gives the maximum likelihood estimator (MLE) for the parameter (denoted by θ^ ).
Method of Maximum Likelihood Example:
Properties of Estimators
Let us take a random sample X = (X1,X2,...,Xn) from a distribution with an unknown parameter θ and g(X) is an
estimator of θ.
Bias = E[g(X)] - θ
If bias of an estimator is zero, it is said to be an unbiased estimator.
Important: S2 is an unbiased estimator of σ2. Hence, we take (n-1) instead of n.
Mean Squared Error, MSE(g(X)) = E[(g(X)−θ)2]=bias2+variance
Estimator with a lower MSE is said to be more efficient.

Summary:

🔹 Random Sample: Understanding the characteristics and importance of random sampling in statistical
analysis.

🔹 Random Sample Notation: Familiarizing yourself with the notation and representation of random samples
using capital X's.

🔹 Sum of Independent R.V.: Learning the properties of sums of independent random variables, facilitating
calculations in statistical studies.

🔹 Normal Approximations Using CLT: Exploring the Central Limit Theorem and its application to
approximating distributions.

🔹 The Sample Variance: Understanding the sample variance as an unbiased estimator of population variance
and its distribution.

🔹 The t result: Recognizing the t-distribution as an alternative when population variance is unknown.

🔹 The F result: Exploring the F-distribution in scenarios with two independent samples from normal
populations.

🔹 Point Estimation: Embracing the method of moments and maximum likelihood as techniques for estimating
parameters.

🔹 Method of Moments Example: Applying the method of moments to estimate parameters using a sample from
an exponential distribution.

🔹 Method of Maximum Likelihood: Understanding the widely acclaimed method for finding estimators based
on the likelihood of samples.

🔹 Properties of Estimators: Uncovering the properties of estimators, including bias, variance, and mean
squared error.

Activity 1:
Consider a sample of 10 students' test scores: {75, 80, 92, 68, 85, 77, 81, 79, 90, 88}. Calculate the sample
mean.
Activity 2:
Consider a sample of 10 students' test scores: {75, 80, 92, 68, 85, 77, 81, 79, 90, 88}. Calculate the sample
variance.
Activity 3:
A sample of 12 observations is collected from a normally distributed population. The sample mean is 68, and
the sample standard deviation is 5. Test the hypothesis that the population mean is 65 at a significance level of
0.01.
Activity 4:
Calculate the probability that, for a random sample of 5 values taken from a N(100, 252) population
(i) X will be between 80 and 120, and
(ii) S will exceed 41.7.
Activity 5:
A random sample from an exponential distribution is given as follows:
8.21, 3.47, 5.92, 2.14, 1.05
Calculate the method of moments estimate for the parameter λ of the exponential distribution.

Concept of Confidence

Confidence levels
The confidence level is the probability that a sample statistic (e.g., a mean or proportion) falls within a given
range of values.
The width of a confidence interval is affected by sample size, confidence level, and standard deviation.
Confidence levels in statistics are like a rollercoaster ride. We start with a hypothesis and take a leap of faith by
making a prediction about our population parameter. Then, we collect data and take a thrilling ride through the
statistical analysis process.
Along the way, we use tools like sample means and standard deviations to estimate the population parameter,
but we know that our estimate may not be perfect due to sampling variability. This is where confidence levels
come in.
Think of a confidence level like a safety harness on a rollercoaster. It helps us to stay within a certain range of
our estimated value, keeping us safe from extreme values that could throw off our analysis. A wider confidence
interval suggests that the estimate is less precise, as there is more variability in the data.
Example:
A 95% confidence level means that we are 95% certain that the true population parameter falls within our
calculated confidence interval. It's like taking 100 rides on a rollercoaster and being confident that 95 of them
will keep us safe within the bounds of our confidence level.

To understand confidence levels better, imagine that you want to estimate the average height of all the people in
your city. You can't measure everyone's height, so you take a sample of people and calculate their average
height. But you know that your sample may not be representative of the entire population, so you want to be
sure that your estimate is accurate within a certain range.

This is where confidence levels come in. You can calculate a confidence interval that gives you a range of
possible values for the true population mean. The confidence level tells you how confident you are that the true
mean falls within that interval.

A higher confidence level means that you're more certain that the true mean falls within the interval, but it also
means that the interval will be wider. A lower confidence level means that the interval will be narrower, but
you'll be less certain that the true mean falls within it.

So, understanding confidence levels in statistics is like knowing the limits of our rollercoaster ride. We can take
risks and make predictions, but with the help of confidence levels, we can stay safe and confident in our results.
Note: If a confidence interval includes the value of zero, this typically means there is no effect or difference
between groups.
The level of confidence c is the probability that the interval estimate contains the population parameter. The
remaining area in the tail is 1-c
Interpreting Confidence Intervals:

One sample mean (σ Known):


Constructing confidence intervals using one sample mean (σ known) in statistics is like baking a cake. We start
with a recipe that tells us how much of each ingredient we need, and we follow a specific process to create a
delicious dessert.
Similarly, constructing a confidence interval for one sample mean (σ known) requires a specific formula and a
set process.
For example, when sampling from a N(μ,σ2) distribution where σ2 is known:
X¯ ~ N(μ,σ2n) => Z = X¯−μσ/√n ~ N(0,1)

 If we require a 95% confidence interval, then we can read off the 2.5% and 97.5% z-values.
 This gives us z0.025 = -1.96 and z0.975 = +1.96.
 Putting the values of z0.025 and z0.975 and rearranging we get,

(X¯−1.96σ/√n,X¯+1.96σ/√n)
This is the 95% confidence interval of the population mean, μ. It is also expressed as:
X¯±1.96σ/√n

Example
The average IQ of a sample of 50 university students was found to be 132. Calculate a symmetrical 95%
confidence interval for the average IQ of university students, assuming that IQs are normally distributed. It is
known from previous studies that the standard deviation of IQs among students is approximately 20.
Here, X ~ N(μ,202).
Given, n = 50, x̄ = 132, α=0.05
=> 95% confidence interval for μ is (132 - 1.96 * 2050√ , 132 + 1.96 * 2050√),
i.e, (126.5, 137.5) => We are 95% confident that the average IQ of the population lies between 126.5 and 137.5

Finding sample size:


A very common question asked of a statistician is:
“How large a sample is needed?”
Effectively confidence interval consists of two parts:

Width = 2 * MoE
Given, the value of MoE and σ, we can determine n.

One sample mean (σ


unknown):
The 95% confidence interval of the population mean, μ is expressed as:
X¯± t0.025,n−1 S/√n
t0.025,n−1 is the value k such that P(tn−1 < k) = 0.025.
Since, t distribution is also symmetric about zero, this confidence interval is also symmetric.
Example

Example Calculate a 95% confidence interval for the average height of 10-year-old children 👫 , assuming that
heights have a N(μ, σ2) distribution (where μ and σ are unknown), based on a random sample of 5 children
whose heights are: 124cm, 122cm, 130cm, 125cm and 132cm.
Here, n = 5, x̄ = 126.6, s = 4.22, α=0.05
=> t0.025,4 = −2.776 => 95% confidence interval for μ is
(126.6 - 2.776 * 4.22√5 , 126.6 + 2.776 * 4.22√5) , i.e, (121.4, 131.8)

One sample variance:


The 95% confidence interval of the population variance, σ2 is expressed as:
X20.025,n−1 is the value k1 such that P(X2n−1<k1) = 0.025
X20.975,n−1 is the value k2 such that P(X2n−1 < k2) = 0.975
Due to the skewness of the X2 distribution these confidence intervals are not symmetrical about the point
estimator S2. So, we can’t write these using the “±” notation.
Example
Calculate a 95% confidence interval for the standard deviation of 10-year-old children ,assuming that heights
have a N(μ, σ2) distribution (where μ and σ are unknown), based on a random sample of 5 children whose
heights are: 124cm, 122cm, 130cm, 125cm and 132cm.
Here, n = 5, s = 4.22, α=0.05 => X20.025,4 = 0.4844 ; X20.975,4 = 11.14 => 95% confidence interval for σ2 is
(4 * 4.2220.4844 , 4 * 4.22211.14) , i.e, (6.39, 147.0)
=> 95% confidence interval for σ is (√6.39,√147.0), i.e (2.53,12.1)

One sample proportion:


The 95% confidence interval is a range of values in which the true value of a population parameter (in this case,
the population proportion) is likely to fall, based on a sample from that population. It is a measure of the
precision of our estimate of the population proportion.
The 95% confidence interval of the population proportion, θ is expressed as:
keyboard_arrow_down
θ^±1.96θ^(1−θ^)n−−−−−√ , where θ^ = Xn
where:
 θ^ is the sample proportion
 n is the sample size
 1.96 is the z-value for a 95% confidence level
This confidence interval is also symmetric.It is valid only when the binomial distribution can be approximated
by normal distribution.
Example
In a one-year mortality investigation, 45 of the 250 ninety-year-olds present at the start of the investigation died
before the end of the year. Assuming that the number of deaths has a binomial distribution with parameters
n=250 and p, calculate a symmetrical 90% confidence interval for the unknown mortality rate p.
Here, n = 250, p^ = 45/250 = 0.18, α=0.1
=> z0.05 = -1.6449 ; z0.95 = 1.6449 => [p^(1−p^)n]−−−−−−√ = [0.18∗(1−0.18)250]−−−−−−−−−−−

−√ = 0.024
=> 90% confidence interval for p is (0.18 - 1.6449 * 0.024, 0.18 + 1.6449 * 0.024) , i.e, (0.140, 0.220)

Two sample means (σ1,σ2 known):


A 100(1-α)% confidence interval of the difference in the population means (μ1-μ2) of two normal populations is
an interval estimate of the difference between the population means, where 100(1-α)% represents the level of
confidence we have in the estimate, and α represents the significance level of the test.
The 100(1-α)% confidence interval of the difference in the population means (μ1−μ2) of two normal
populations is:
(X1¯−X2¯) ± Zα2 √σ1^2/n1+σ2^2/n2
where:
 x̄ 1 and x̄ 2 are the sample means of the two populations
 σ1 and σ2 are the population standard deviations of the two populations
 n1 and n2 are the sample sizes of the two populations zα/2 is the z-score corresponding to the α/2 level
of significance
Zα2 is the value k such that P(Z < k) = α2
Example
Suppose we want to estimate the difference in the mean heights of two populations, population 1 and population
2. We take a random sample of 50 individuals from population 1, and a random sample of 40 individuals from
population 2. The sample means are x̄ 1 = 68 inches and x̄ 2 = 70 inches, and the sample standard deviations are
s1 = 2 inches and s2 = 3 inches. We want to construct a 95% confidence interval for the difference in the
population means.
Using the formula above, we can calculate the confidence interval as:
(68−70)±z0.025∗[(2250)+(3240)]√=>−2±1.206=(−3.206,−0.794)
Therefore, we can say with 95% confidence that the true difference in the mean heights of the two populations
lies between -3.206 and -0.794 inches. This means that we are 95% confident that the mean height of population
1 is lower than the mean height of population 2 by somewhere between 0.794 and 3.206 inches.

Two sample mean (σ1,σ2 unknown):


The 100(1-α)% confidence interval of the difference in the population means (μ1−μ2) of two normal
populations is:
(X1¯−X2¯)± tα2,n1+n2−2. Sp 1n1+1n2−−−−−−√
Where:
S2p = (n1−1)s21+(n2−1)s22n1+n2−2
Requirement: The two variances should be nearly equal.
Example

A motor company runs tests to investigate the fuel consumption of cars 🚜 using a newly developed fuel additive.
Sixteen cars of the same make and age are used, eight with the new additive and eight as controls. The results, in
miles per gallon over a test track under regulated conditions, are as follows:

Obtain a 95% confidence interval for the increase in miles per gallon achieved by cars with the additive. State
clearly any assumptions required for this analysis.

Two sample variances:


The 100(1-α)% confidence interval of the ratio in the population variances (σ21/σ22) of two normal
populations is:
S21S22 . 1Fn1−1,n2−1 < σ21σ22 < S21S22 . Fn2−1,n1−1
Two sample proportions:
The 100(1-α)% confidence interval of the difference in the population proportions (θ1−θ2) of two binomial
populations (approx. by normal) is:
(θ1^−θ2^)±Zα2θ1^(1−θ1^)n1+θ2^(1−θ2^)n2−−−−−−−−−−−−−−√Example

In a one-year mortality investigation, 25 of the 100 ninety-year-old males 🧑 and 20 of the 150 ninety-year-old

females 👩 present at the start of the investigation died before the end of the year. Assuming that the numbers of
deaths follow binomial distributions, calculate a symmetrical 95% confidence interval for the difference

between male 🧑 and female 👩 mortality rates at this age.


First, let's calculate the proportions and sample sizes:
θ1^=25/100=0.25
θ2^=20/150=0.133
n1 = 100
n2 = 150
Next, we can plug these values into the formula and calculate the confidence interval:
=> (0.25−0.133)±1.96(0.25(1−0.25)100+0.133(1−0.133))150)−−−−−−−−−−−−−−−−−−−−−√
= (0.117,0.317)
Therefore, the symmetrical 95% confidence interval for the difference between male and female mortality rates
at the age of 90 is (0.117, 0.317). This means that we can be 95% confident that the true difference between
male and female mortality rates falls within this interval. Since the interval does not include zero, we can
conclude that there is a statistically significant difference between male and female mortality rates at this age.

Choosing Appropriate Confidence Levels:

Choosing the appropriate confidence level in statistics is like choosing the right door to open. You want to be
sure that the door you choose leads to the right answer, but you also don't want to waste time and resources on
unnecessary or inaccurate information.

When choosing a confidence level, you need to balance the trade-off between being confident in your results
and having a narrow interval. Generally, higher confidence levels provide greater confidence in your results, but
they also require larger sample sizes and lead to wider intervals. Lower confidence levels provide narrower
intervals, but with less confidence in your results.
For example - let's say you want to estimate the mean height of all the trees in a large forest .If you choose a
90% confidence level, you can be 90% confident that the true population mean falls within your calculated
confidence interval. However, the interval may be wider, requiring a larger sample size. On the other hand, if
you choose a 99% confidence level, you can be more confident in your results, but the interval may be too wide
to be practical.

In general, a confidence level of 95% is commonly used in many statistical analyses as a balance between
precision and confidence. This means that we can be 95% confident that the true population parameter falls
within our calculated interval. However, the appropriate confidence level depends on the specific context and
goals of the analysis.

To choose an appropriate confidence level, consider the following factors:


The level of risk: If a wrong decision based on the analysis could result in serious consequences, a higher
confidence level should be used to ensure accuracy and avoid errors.
The sample size: If the sample size is small, a higher confidence level should be used to ensure that the estimate
is accurate.
The cost of sampling: A lower confidence level can be used if a larger sample size is too costly or time-
consuming.
The context of the analysis: The appropriate confidence level may depend on the field of study or the specific
research question being investigated.
For example - if a pharmaceutical company is testing a new drug for safety, a higher confidence level should be
used to ensure accuracy and avoid any potential risks to human health.
On the other hand, if a marketing research firm is testing consumer preferences for a new product, a lower
confidence level may be used if the cost of sampling a large population is too high.
Ultimately, the appropriate confidence level depends on the specific context and goals of the analysis. By
understanding the trade-offs between precision and confidence, we can choose an appropriate confidence level
that balances accuracy with practicality.
Confidence level is the probability that the true population parameter falls within the interval estimate of the
sample statistic. It is important to choose an appropriate confidence level for an interval estimate because it
determines the range of values that we can be reasonably confident contains the true population parameter.
The most common confidence levels used in statistics are 90%, 95%, and 99%. To choose an appropriate
confidence level, we need to consider the desired level of accuracy and the potential consequences of making a
Type I or Type II error.
The formula for a confidence interval is:
CI = point estimate ± margin of error
where the point estimate is the sample statistic that estimates the population parameter and the margin of error is
the maximum expected difference between the point estimate and the true population parameter.
The formula for the margin of error depends on the sample size, the level of confidence, and the standard
deviation or standard error of the sample statistic.
For example, if we want to estimate the mean income of a population with a 95% confidence level and a sample
of size 100, we can use the following formula:
CI=X¯±(tα2×(sn√))
where,
 X¯ is the sample mean
 s is the sample standard deviation
 n is the sample size
 tα2 is the critical t-value for a two-tailed t-distribution with α/2 level of significance and n-1 degrees of
freedom.
For a 95% confidence level and 99 degrees of freedom, the tα/2 value is 1.984.
Example
Suppose we obtain a sample mean of 50,000 and a sample standard deviation of 5,000. Using the formula, we
can calculate the 95% confidence interval as:
CI=50,000±(1.984×(5,000100√))
CI=50,000±995.20
The 95% confidence interval for the population mean income is therefore between 49,004.80 and 50,995.20.
This means that if we were to take repeated samples and construct 95% confidence intervals for each sample,
about 95% of those intervals would contain the true population mean income.
In general, a higher confidence level requires a larger sample size or a larger margin of error. However, a higher
confidence level also increases the risk of making a Type I error, which is the probability of rejecting a true null
hypothesis.
Conversely, a lower confidence level decreases the risk of making a Type I error, but increases the risk of
making a Type II error, which is the probability of failing to reject a false null hypothesis. Therefore, it is
important to choose an appropriate confidence level based on the desired level of accuracy and the potential
consequences of making a Type I or Type II error.

Activity 1:
A pharmaceutical company is conducting a study to compare the effectiveness of two different medications, A
and C, in treating a specific medical condition. The company randomly assigns patients to two groups: Group A
receives Medication A, and Group C receives Medication C. The company measures a certain outcome variable
before and after the treatment period for each group. The results are as follows:
Group A: Sample size = 30, Sample mean = 85, Sample standard deviation = 10 Group C: Sample size = 35,
Sample mean = 75, Sample standard deviation = 12
Using the provided information, calculate the test statistic for comparing the means of Medication A and
Medication C.
A) 0.97
B) 1.35
C) -0.97
D) -1.35
Activity 2:
A research study investigated the mortality rates of two different groups, Group X and Group Y. In Group X, out
of 80 individuals aged 60, 15 died within a year. In Group Y, out of 120 individuals aged 60, 25 died within a
year. The researchers assume that the number of deaths follows a binomial distribution.
Calculate a symmetrical 95% confidence interval for the difference between the mortality rates of Group X and
Group Y at this age.
A) (-0.060, 0.164)
B) (0.023, 0.126)
C) (-0.105, 0.219)
D) (0.042, 0.107)
Activity 3:
A motor company runs tests to investigate the fuel consumption of cars using a newly developed fuel additive.
Sixteen cars of the same make and age are used, eight with the new additive and eight as controls. The results, in
miles per gallon over a test track under regulated conditions, are as follows:

Obtain a 95% confidence interval for the increase in miles per gallon achieved by cars with the additive. State
clearly any assumptions required for this analysis.
Activity 4:
In a mortality investigation, 25 of the 100 ninety-year-old males and 20 of the 150 ninety-year-old females
present at the start of the investigation dies before the end of the year. Assuming that the number of deaths
follow binomial distributions, calculate a symmetrical 95% confidence interval for the difference between male
and female mortality rates at this age.
Activity 5:
A pharmaceutical company wants to determine the average effectiveness of a new pain medication. A random
sample of 100 patients who were administered the medication is selected. The company wants to estimate the
average pain reduction with 95% confidence. The sample mean pain reduction is found to be 2.5 units with a
standard deviation of 0.8 units.
What is the 95% confidence interval for the true mean pain reduction of the medication?

Hypothesis Testing

Hypothesis:
A hypothesis can be defined as a proposed explanation for a phenomenon. It is not the absolute truth but a
provisional working assumption. In statistics, a hypothesis is considered to be a particular assumption
about a set of parameters of a population distribution. It is called a hypothesis because it is not known
whether or not it is true.
For example, imagineyou notice that your plants are growing taller when you play classical music for
them. You might come up with a hypothesis that the music helps the plants grow. This is just an initial idea

that you can test by playing different genres of music for different groups of plants, measuring their
growth, and comparing the results. If you find that the plants do indeed grow better with classical music,
you can refine your hypothesis and test it further with more experiments.
In statistics, a hypothesis is similar in that it's an assumption that we make about a population based on a
sample of data. We can use statistical tests to evaluate the likelihood that our hypothesis is true, or to
identify other possible explanations for our observations.

Hypothesis Test:
A hypothesis test is a standard procedure for testing a claim about a property of a population.
Rare Event Rule for Inferential Statistics:
If, under a given assumption, the probability of a particular observed event is exceptionally small, we
conclude that the assumption is probably not correct.

For Example:
ProCare Industries, Ltd., once provided a product called “Gender Choice,” which, according to advertising
claims, allowed couples to “increase your chances of having a boy up to 85%, a girl up to 80%.” Gender Choice
was available in blue packages for couples wanting a baby boy and (you guessed it) pink packages for couples
wanting a baby girl.
Suppose we conduct an experiment with 100 couples who want to have baby girls, and they all follow the
Gender Choice “easy-to-use in-home system” described in the pink package. For the purpose of testing the
claim of an increased likelihood for girls, we will assume that Gender Choice has no effect.

Components of a formal hypothesis test


Given a claim, identify the null hypothesis and the alternative hypothesis, and express them both in
symbolic form.
Given a claim and sample data, calculate the value of the test statistic.
Given a significance level, identify the critical value(s).
Given a value of the test statistic, identify the P-value.
State the conclusion of a hypothesis test in simple, non-technical terms.

Null Hypothesis : Ho
The null hypothesis (denoted by Ho) is a statement that the value of a population parameter (such as proportion,
mean, or standard deviation) is equal to some claimed value.
We test the null hypothesis directly.
Either reject Ho or fail to reject Ho.
The null hypothesis is what we are willing to assume is the case until proven otherwise. We can never claim that
the null hypothesis has been actually proved.

Alternative Hypothesis : HA
The alternative hypothesis (denoted by H1orHaorHA) is the statement that the statistic has a value that
somehow differs from the null hypothesis.
The symbolic form of the alternative hypothesis must use one of these symbols: ≠, <, >.
The alternative hypothesis is typically what researchers are hoping to find evidence for, as it represents a new
theory or idea that could advance our understanding of the world.
There are three types of alternative hypotheses:
One-tailed alternative hypothesis: This type of alternative hypothesis proposes that there is a difference
between two groups in a specific direction.
For example, if we wanted to test the hypothesis that a new medication reduces pain better than a placebo, we
might use a one-tailed alternative hypothesis that says the medication reduces pain more than the placebo.

Two-tailed alternative hypothesis: This type of alternative hypothesis proposes that there is a difference
between two groups, but does not specify a particular direction.
For example, if we wanted to test the hypothesis that a new medication has a different effect on pain than a
placebo, we might use a two-tailed alternative hypothesis that says the medication has a different effect on pain
than the placebo.

Non-inferiority or equivalence alternative hypothesis: This type of alternative hypothesis proposes that there is
no significant difference between two groups or treatment, or that the difference is not clinically meaningful.
For example, if we wanted to test the hypothesis that a new medication is no worse than an existing medication
for treating a certain condition, we might use a non-inferiority alternative hypothesis.

For example: suppose we want to test the hypothesis that a new pain medication is more effective than an
existing medication. We could set up our hypotheses as follows:
Null hypothesis: The new pain medication is no more effective than the existing medication.
One-tailed alternative hypothesis: The new pain medication is more effective than the existing medication.
Two-tailed alternative hypothesis: The new pain medication has a different effect on pain than the existing
medication. Non-inferiority or equivalence alternative hypothesis: The new pain medication is not significantly
worse than the existing medication.
We could then conduct a statistical test to determine whether the data supports one of these hypotheses over the
others. By carefully choosing our hypotheses and analyzing the data, we can draw conclusions about the
effectiveness of the new medication and potentially make improvements to patient care.
How to form your claim or hypothesis?
If you are conducting a study and want to use a hypothesis test to support your claim, the claim must be worded
so that it becomes the alternative hypothesis.
Step 1 : Identify the specific claim or hypothesis to be tested and express it in symbolic form
Step 2 : Give the symbolic form that must be true when the original claim is false
Step 3 : Of the two symbolic expressions obtained so far, let the alternative hypothesis HA be the one not
containing equality so that HA uses the symbol < or > or ≠. Let the null hypothesis Ho be the symbolic
expression that the statistic equals the fixed value being considered
Identify the null and alternative hypothesis:
The proportion of drivers who admit to running red lights is greater than 0.5.
Step 1 : We express the given claim as p > 0.5.
Step 2 : We see that if p > 0.5 is false, then p ≤ 0.5 must be true.
Step 3 : We let the alternative hypothesis HA be p > 0.5, and we let Ho be p = 0.5.

The mean height of professional basketball players is at most 7 ft.


Step 1 : We express “a mean of at most 7 ft” in symbols as μ ≤ 7.
Step 2 : We see that if μ ≤ 7 is false, then μ > 7 must be true.
Step 3 : We let the alternative hypothesis HA be μ > 7, and we let H0 be μ = 7.

The standard deviation of IQ scores of actors is equal to 15.


Step 1 : We express the given claim as = 15.
Step 2 : We see that if σ = 15 is false, then σ ≠ 15 must be true.
Step 3 : We let the alternative hypothesis HA be σ ≠ 15 and we let H0 be σ = 15.

Type - I error
A Type I error is the mistake of rejecting the null hypothesis when it is true.
The symbol α (alpha) is used to represent the probability of a type I error.
Type - II error
A Type II error is the mistake of failing to reject the null hypothesis when it is false.
The symbol β (beta) is used to represent the probability of a type II error.

Test Statistic
The test statistic is a value used in making a decision about the null hypothesis, and is found by converting the
sample statistic to a score with the assumption that the null hypothesis is true.
Test Statistic - Formula
Test statistics for proportions:
z=p^−p/√pq/n
Test statistic for mean
z=X¯−μX¯/σ/√n
Test statistics for variance
X2=(n−1)S^2/σ^2
Test Statistic - Example
Problem :
A survey of n = 880 randomly selected adult drivers showed that 56% (or p = 0.56) of those respondents
admitted to running red lights. Find the value of the test statistic for the claim that the majority of all adult
drivers admit to running red lights.

The preceding example showed that the given claim results in the following null and alternative
hypotheses: H0 : p = 0.5 and HA : p > 0.5. Because we work under the assumption that the null hypothesis is
true with p = 0.5, we get the following test statistic:
z=p^−ppqn√=>0.56−0.5(0.5)(0.5)880√=3.56

Significance Level
The significance level (denoted by α) defines how much evidence we require to reject H0 in favor of HA
Critical Region
The critical region (or rejection region) is the set of all values of the test statistic that cause us to reject the null
hypothesis.
Critical Value
A critical value is any value that separates the critical region (where we reject the null hypothesis) from the
values of the test statistic that do not lead to rejection of the null hypothesis. The critical values depend on the
nature of the null hypothesis, the sampling distribution that applies, and the significance level α.
Finding Critical Value for α=0.01 (Two-tailed test)

One-Tailed and Two-Tailed Tests:

In hypothesis testing, a null hypothesis (H₀) is tested against an alternative hypothesis (H ₁). The alternative
hypothesis can be one-tailed or two-tailed, which determines the direction of the test.

One-tailed test
 A one-tailed test is when the alternative hypothesis is directional, meaning it specifies a particular direction
of difference between the sample mean and the population mean.
 The critical region is located entirely in one tail of the distribution.
 For example, if we want to test whether a new drug increases blood pressure, the one-tailed alternative
hypothesis would be that the drug increases blood pressure.
A one-tailed test, showing the p-value as the size of one tail.
Two-tailed test :
 A two-tailed test is when the alternative hypothesis is non-directional, meaning it specifies a difference
between the sample mean and the population mean without specifying a particular direction.
 The critical region is located in both tails of the distribution.
 For example, if we want to test whether a coin is biased, the two-tailed alternative hypothesis would be
that the coin is not fair.
A two-tailed test applied to the normal distribution.

One-Tailed Test:
 Null Hypothesis (H0): μ≤X
 Alternative Hypothesis (Ha): μ>X
 Test Statistic: z=(X¯−μ)/(σ√n)
where, X¯ is the sample mean,
μ is the hypothesized population mean,
σ is the population standard deviation,
n is the sample size.
Critical Value: zα
where, α is the significance level and can be found in a z-table or calculated using a calculator.
 Rejection Region:
If, z>zα reject the null hypothesis.
Example:
A shoe manufacturer claims that their shoes have an average lifespan of 12 months. You take a random sample
of 100 shoes and find that the average lifespan is 13 months with a standard deviation of 2 months. Test the
manufacturer's claim at a 5% level of significance.
H0:μ≤12
Ha:μ>12
Test Statistic: z=(13−12)(2100√)=5
Critical Value: z0.05=1.645
Rejection Region: If z>1.645, reject the null hypothesis.
Since the calculated test statistic (z=5) is greater than the critical value (z0.05=1.645), we reject the null
hypothesis and conclude that the shoe manufacturer's claim is not supported by the data.
Two-Tailed Test
 Null Hypothesis (H0): μ=X
 Alternative Hypothesis (Ha): μ≠X
 Test Statistic: t=(X¯−μ)(s√n)
where, X̄ is the sample mean, μ is the hypothesized population mean,
s is the sample standard deviation, and n is the sample size.
 Degrees of Freedom: n−1
 Critical Value: tα2
where, α is the significance level and can be found in a t-table or calculated using a calculator.
 Rejection Region: If, t<−tα2 or t>tα2 reject the null hypothesis.
Example:
A grocery store claims that their organic apples weigh an average of 150 grams. You take a random sample of
30 organic apples and find that the average weight is 155 grams with a standard deviation of 10 grams. Test the
grocery store's claim at a 5% level of significance.
H0:μ=150
Ha:μ≠150
Test Statistic: t=(155−150)(1030√)=3.07
Degrees of Freedom: 30−1=29
Critical Value: t0.025=±2.045
Rejection Region: If t<−2.045 or t>2.045, reject the null hypothesis.
Since the calculated test statistic (t=3.07) falls outside of the rejection region (±2.045), we reject the null
hypothesis and conclude that the grocery store's claim is not supported by the data.
Right and left-tailed tests are two types of hypothesis tests that are commonly used in statistical analysis.
Right-tailed test
A right-tailed test is used to determine if a sample mean is significantly greater than a hypothesized population
mean. The null hypothesis assumes that the sample mean is less than or equal to the hypothesized population
mean. The alternative hypothesis assumes that the sample mean is greater than the hypothesized population
mean.
Formula:
 Null Hypothesis (H0): μ≤X
 Alternative Hypothesis (Ha): μ>X
 Test Statistic: z=(X¯−μ)(sn√)
where,X¯ is the sample mean,
μ is the hypothesized population mean,
s is the sample standard deviation,
n is the sample size.
 Critical Value: zα ,where, α is the significance level and can be found in a z-table or calculated using a
calculator.
 Rejection Region: If, z>zα, reject the null hypothesis.
Example:
A car manufacturer claims that a new engine design can achieve an average fuel efficiency of 30 miles per
gallon (MPG) or more. You take a random sample of 50 cars with the new engine and find that the average fuel
efficiency is 31 MPG with a standard deviation of 2 MPG. Test the manufacturer's claim at a 5% level of
significance.
H0:μ≤30
Ha:μ>30
Test Statistic: z=(31−30)(250√)=5
Critical Value: z0.05=1.645
Rejection Region: If z>1.645, reject the null hypothesis.
Since the calculated test statistic (z = 5) is greater than the critical value (z0.05=1.645), we reject the null
hypothesis and conclude that the car manufacturer's claim is supported by the data.

Left-tailed test
A left-tailed test is used to determine if a sample mean is significantly less than a hypothesized population mean.
The null hypothesis assumes that the sample mean is greater than or equal to the hypothesized population mean.
The alternative hypothesis assumes that the sample mean is less than the hypothesized population mean.
Formula:
 Null Hypothesis (H0): μ≥X
 Alternative Hypothesis (Ha): μ<X
 Test Statistic: z=(X¯−μ)(s√n)
where, X¯ is the sample mean,
μ is the hypothesized population mean,
s is the sample standard deviation,
n is the sample size.
 Critical Value: −zα, where α is the significance level and can be found in a z-table or calculated using
a calculator.
 Rejection Region: If, z<−zα, reject the null hypothesis.
Example:
A bakery claims that their new recipe for muffins contains no more than 10 grams of sugar. You take a random
sample of 25 muffins and find that the average sugar content is 9 grams with a standard deviation of 1 gram.
Test the bakery's claim at a 1% level of significance.
H0:μ≥10
Ha:μ<10
Test Statistic: z=(9−10)(125√)=−5
Critical Value: −z0.01=−2.326
Rejection Region: If z<−2.326, reject the null hypothesis.
Since -5 is less than -2.326, we reject the null hypothesis and conclude that the bakery's claim of no more than
10 grams of sugar per muffin is supported by the data.

Two-tailed, Right-tailed, Left-tailed Tests


The tails in a distribution are the extreme regions bounded by critical values.
❏ Two-tailed test : H0: = , HA: ≠
❏ Right tailed test : H0: = , HA: >
❏ Left tailed test : H0: = , HA: <
Difference between one and two tailed test and right and left tailed test
In hypothesis testing, one-tailed and two-tailed tests refer to the direction of the alternative hypothesis, while
left-tailed and right-tailed tests refer to the direction of the critical region.

A one-tailed test is used when the alternative hypothesis specifies a direction, either an increase or a decrease, in
the parameter being tested. A two-tailed test is used when the alternative hypothesis does not specify a direction,
but only that the parameter is not equal to the hypothesized value.
In a right-tailed test, the critical region is in the right tail of the distribution and the null hypothesis is rejected if
the test statistic falls in the critical region to the right of the mean. In a left-tailed test, the critical region is in the
left tail of the distribution and the null hypothesis is rejected if the test statistic falls in the critical region to the
left of the mean.
In summary, one-tailed and two-tailed tests are concerned with the directionality of the alternative hypothesis,
while left-tailed and right-tailed tests are concerned with the location of the critical region.
P-Values:
The P-value (or p-value or probability value) is the probability of getting a value of the test statistic that is at
least as extreme as the one representing the sample data, assuming that the null hypothesis is true. The null
hypothesis is rejected if the P-value is very small, such as 0.05 or less.
If a P-value is small enough, then we say the results are statistically significant.

Conclusions in Hypothesis Testing based on P-value


We always test the null hypothesis. The initial conclusion will always be one of the following:
Reject the null hypothesis - if the P-value ≤ α (where α is the significance level, such as 0.05).
Fail to reject the null hypothesis - if the P-value > α
Example 1:
An article distributed by the Associated Press included these results from a nationwide survey: Of 880 randomly
selected drivers, 56% admitted that they run red lights. The claim is that the majority of all Americans run red
lights. That is, p > 0.5. The sample data are n = 880, and p = 0.56.
H0 : p = 0.5 , HA : p > 0.5 , α = 0.05
z=p^−ppqn√=>0.56−0.5(0.5)(0.5)880√=3.56

We see that for values of z = 3.50 and higher, we use 0.9999 for the cumulative area to the left of the test
statistic.
The P-value is 1 – 0.9999 = 0.0001. Since the P-value of 0.0001 is less than the significance level of = 0.05, we
reject the null hypothesis. There is sufficient evidence to support the claim.
Example 2:
We have a sample of 106 body temperatures having a mean of 98.20°F. Assume that the sample is a simple
random sample and that the population standard deviation is known to be 0.62°F. Use a 0.05 significance level
to test the common belief that the mean body temperature of healthy adults is equal to 98.6°F.
H0 : μ = 98.6 , HA : μ ≠ 98.6, α = 0.05 , X¯ = 98.2 , σ = 0.62
z=X¯−μX¯σn√=>98.2−98.60.62106√=−6.64
This is a two-tailed test and the test statistic is to the left of the center, so the P-value is twice the area to the left
of z = –6.64.
Using norm.cdf , the area to the left of z = –6.64 is 0.0001, so the P-value is 2(0.0001) = 0.0002.
Because the P-value of 0.0002 is less than the significance level of = 0.05, we reject the null hypothesis. There
is sufficient evidence to conclude that the mean body temperature of healthy adults differs from 98.6°F.

Activity 1
A researcher wants to investigate whether the average IQ of university students is greater than 100. A sample of
50 university students was selected, and their average IQ was found to be 105. The researcher assumes that IQs
are normally distributed, and previous studies suggest that the standard deviation of IQs among students is
approximately 20.
Is there sufficient evidence to conclude that the average IQ of university students is greater than 100, using a
significance level of 0.05?
Activity 2
A researcher wants to assess whether the standard deviation of the heights of 10-year-old children is equal to
3cm. The researcher randomly selects a sample of 5 heights (in cm): 124, 122, 130, 125, and 132. What
statistical test can be conducted to determine if the standard deviation of heights is equal to 3cm?
Activity 3
In a one-year mortality investigation, 60 out of 300 ninety-year-olds present at the start of the investigation died
before the end of the year. The investigation assumes that the number of deaths follows a binomial distribution
with parameters N = 300 and q = 0.25. What conclusion can be drawn from this data regarding the mortality rate
for this age?
Activity 4
A new gene has been identified that makes individuals particularly susceptible to a specific food allergy. In a
random sample of 100 children from a certain region, 15 were found to be carriers of the gene. Test whether the
proportion of children in that region carrying the gene is greater than 20%.
Null Hypothesis (H0): The proportion of children carrying the gene in the region is equal to or less than 20%.
Alternative Hypothesis (Ha): The proportion of children carrying the gene in the region is greater than 20%.
You are conducting a hypothesis test using the provided information with a significance level of 0.05. What can
you conclude based on the test results regarding the alternative hypothesis?

Experiment Design
Experiment Design:
Experimental design in statistics refers to the process of planning and conducting experiments to answer
research questions or test hypotheses.
Let's say you're a scientist studying the effects of a new fertilizer on the growth of tomato plants. To conduct
your experiment, you need to design a plan that controls for variables that could affect your results, like the
amount of water and sunlight each plant receives.
First, you decide to divide your tomato plants into two groups: an experimental group that will receive the new
fertilizer and a control group that won't receive any fertilizer.
Next, you randomly assign each plant to one of the two groups to avoid any bias in the results.
You also decide to measure the height of each plant at regular intervals over the course of the experiment and
record the data in a spreadsheet.
After the experiment is complete, you can use inferential statistics to analyze your data and determine if there
was a significant difference in the growth of the two groups of tomato plants.
By carefully designing your experiment and controlling for variables, you can draw accurate conclusions about
the effects of the new fertilizer on tomato plant growth.
Overall, experimental design is a crucial step in statistical analysis as it allows researchers to control the
variables and obtain meaningful results that can be used to make informed decisions.

Types of Experimental Design:


Type of Design Category Examples
Randomized Designs where subjects are Post-test only design, pretest-post-test
designs randomly assigned to only design, Solomon four group design,
groups factorial design
Quasi- Designs where subjects are Non-equivalent control group designs,
experimental not randomly assigned to time series design
designs groups
Pre-experimental Designs with no control One shot case design, one group pretest
designs group and no pretest post test design

Randomized designs
 In this type of experimental design, the subjects are randomly assigned to different treatment groups.
 One example of a randomized design is a clinical trial for a new medication, where patients are randomly
assigned to either the treatment group (receiving the new medication) or the control group (receiving a
placebo).
Post-test only design

In this design, subjects are randomly assigned to either the treatment or control group and are measured on the
dependent variable after the treatment is given.

An example of this could be a study measuring the effectiveness of a new exercise program by comparing the
fitness levels of participants who completed the program with those who did not.

Formula:
The formula for the post-test only design is Y=X+ε, Where:
 Y = the dependent variable measured after the intervention
 X = the intervention or treatment
 ε = the error term
Example:
Suppose we want to evaluate the effectiveness of a new pain medication. We randomly select 100 patients with
chronic pain and divide them into two groups: treatment and control. The treatment group receives the new
medication while the control group receives a placebo. After two weeks, we measure the level of pain using a
pain scale from 1 to 10.
The data we collected is as follows:
Treatment Group: Y1=X1+ε1
Control Group: Y2=X2+ε2
Suppose we found that the average pain score for the treatment group is 4.5, and the average pain score for the
control group is 6.5. Using the formula above, we can test whether there is a significant difference between the
two groups:
Y1−Y2=(X1+ε1)−(X2+ε2)
Y1−Y2=X1−X2+(ε1−ε2)
If the difference between the two groups is statistically significant, we can conclude that the new medication is
effective in reducing chronic pain.

Pretest-post-test only design

This design is similar to the post-test only design, but with the additional step of measuring the dependent
variable before the treatment is given.
An example of this could be a study measuring the effect of a new teaching method on student test scores by
comparing their scores before and after the new method is implemented.
Formula:
Treatment effect = (Posttest score of treatment group - Pretest score of treatment group) - (Posttest score of
control group - Pretest score of control group)
Example:
Suppose a researcher wants to test the effectiveness of a new teaching method on students' test scores. They
randomly assign half of the students to receive the new teaching method (treatment group) and half to receive
the traditional teaching method (control group). Before the teaching methods are implemented, all students take
a pretest. After the teaching methods are implemented, all students take a posttest. The results are as follows:
Treatment group: Pretest mean = 65, Posttest mean = 85 Control group: Pretest mean = 62, Posttest mean = 75
Using the formula, the treatment effect can be calculated as:
Treatment effect = (85 - 65) - (75 - 62) = 10
Therefore, the new teaching method resulted in a significant treatment effect of 10 points on the posttest scores.

Activity :
A researcher conducts a study to evaluate the effectiveness of a new exercise program on weight loss. The study
includes two groups: the exercise group and the control group. The researcher measures the participants' weights
before and after the program. The results are as follows:
Exercise group: Pretest mean = 75 kg, Posttest mean = 70 kg Control group: Pretest mean = 73 kg, Posttest
mean = 72 kg
Using the treatment effect formula, calculate the treatment effect of the exercise program on weight loss.
A) 1 kg
B) -2 kg
C) 3 kg
D) -5 kg

Solomon four group design:


Solomon four group design is a type of experimental design that combines the post-test only and pretest-post-
test only designs to increase the reliability of the experiment. In this design, there are four groups:
 Group 1 receives the treatment and takes the post-test
 Group 2 receives the treatment and takes the pretest and post-test
 Group 3 takes the pretest and post-test but does not receive the treatment
 Group 4 does not receive the treatment and only takes the post-test
Formula:
Y1=μ+τ+η1+ϵ1
Y2=μ+τ+η2+ϵ2
Y3=μ+η3+ϵ3
Y4=μ+η4+ϵ4, Where:
 Y1 = the score of Group 1 (treatment and post-test)
 Y2 = the score of Group 2 (treatment, pretest, and post-test)
 Y3 = the score of Group 3 (pretest and post-test only)
 Y4 = the score of Group 4 (no treatment and post-test only)
 μ = the overall mean score of all the groups
 τ = the effect of the treatment
 η1,η2,η3,η4 = the random error for each group
 ϵ1,ϵ2,ϵ3,ϵ4 = the residual error for each group

Example:
A company wants to test a new training program to see if it improves employees' productivity. They randomly
assign their employees to one of the four groups. The scores of the groups are as follows:
Group 1 (treatment and post-test): 80
Group 2 (treatment, pretest, and post-test): 75
Group 3 (pretest and post-test only): 70
Group 4 (no treatment and post-test only): 72
The overall mean score of all the groups is (80 + 75 + 70 + 72) / 4 = 74.25.
The effect of the treatment (τ) is calculated as follows:
τ=[(Y1−μ)−(Y2−μ)]−[(Y4−μ)−(Y3−μ)]
= (80 - 74.25) - (75 - 74.25) - (72 - 74.25) + (70 - 74.25)
= 5.75 - 2.75 - 2.25 + 4.25 = 5
The result indicates that the new training program has a significant positive effect on employees' productivity,
with an effect size of 5.
Factorial Design
Factorial design is a type of experimental design that involves the manipulation of two or more independent
variables to study their effects on the dependent variable. In other words, it is a design that looks at how
different levels of multiple factors (variables) affect the outcome.
Formula:
Y=μ+A+B+AB+e, where

 Y = the dependent variable


 μ = the overall mean of the dependent variable
 A = the first independent variable
 B = the second independent variable
 AB = the interaction effect between the first and second independent variables
 e = the error term
Example:
Suppose a company wants to test the effectiveness of two advertising strategies, A and B, on the sales of their
product. They also want to test whether the effectiveness of the two strategies is different in different regions of
the country. They decide to use a 2 x 2 factorial design, with A and B as the two factors, and region (East vs
West) as a blocking variable. They randomly assign their customers to one of the four groups as follows:
 Group 1: Strategy A in East
 Group 2: Strategy A in West
 Group 3: Strategy B in East
 Group 4: Strategy B in West
They collect data on the sales of the product for each group and use the formula Y = μ + A + B + AB + e to
analyze the data, where:
 Y = Sales of the product
 μ = The overall mean sales of the product
 A = Effect of advertising strategy A
 B = Effect of advertising strategy B
 AB = Interaction effect between A and B
 e = Error term
They obtain the following results:
Mean sales in Group 1: 100
Mean sales in Group 2: 120
Mean sales in Group 3: 110
Mean sales in Group 4: 130
Using the formula, they can calculate the effects of A, B, and AB as follows:
A = (Group 1 sales + Group 2 sales) / 2 - μ = (100 + 120) / 2 - 115 = 2.5
B = (Group 1 sales + Group 3 sales) / 2 - μ = (100 + 110) / 2 - 115 = -2.5
AB = Group 4 sales - (A + B + μ) = 130 - (2.5 - 2.5 + 115) = 10 From these results, they can conclude that:
 Advertising strategy A has a positive effect on sales, with an estimated increase of 2.5 units.
 Advertising strategy B has a negative effect on sales, with an estimated decrease of 2.5 units.
 There is an interaction effect between A and B, with the combined effect of the two strategies being greater
than the sum of their individual effects.
 The effectiveness of the strategies is not significantly different in the East vs West regions.
These conclusions can help the company make more informed decisions about their advertising strategies in
different regions of the country.
Quasi-experimental designs
This type of design is used when it is not possible to randomly assign subjects to groups. Instead, groups are
chosen based on some pre-existing characteristic.
An example of a quasi-experimental design is studying the effect of a new teaching method on student
performance by comparing two different schools, one using the new method and the other using the traditional
method.

Non-equivalent control group designs


Non-equivalent control group design is a type of quasi-experimental design where subjects are not randomly
assigned to groups, and the groups are not initially equivalent. In this design, a treatment group and a control
group are used, but they are not equivalent at the start of the study.
Formula:
YT=a+b1X1+b2X2+b3X3+b4X4+e Where:

 YT is the post-test score of the treatment group


 a is the constant or intercept
 b1−b4 are the regression coefficients for the independent variables X1-X4
 X1−X4 are the independent variables or predictors
 e is the error term
Example:
Suppose a school district implements a new reading program in half of its schools, while the other half do not
use the new program. To evaluate the effectiveness of the new program, a test is given to students in both
groups before and after the program is implemented.
The post-test scores for the treatment group and the control group are recorded as follows:
 Treatment group: 75, 80, 85, 90
 Control group: 70, 75, 80, 85
Using the non-equivalent control group design formula, we can calculate the effectiveness of the reading
program as follows:
YT=a+b1X1+e
Where:
 YT is the post-test score of the treatment group
 X1 is the independent variable, which represents whether the school used the new reading program or
not (0 = no, 1 = yes)
 a is the intercept, which represents the mean post-test score of the control group
 b1 is the difference in mean post-test scores between the treatment group and the control group
Plugging in the numbers, we get:
YT=75+(80−72.5)X1+e
YT=72.5+7.5X1+e
The coefficient 7.5 indicates that the treatment group scored, on average, 7.5 points higher than the control
group on the post-test.

Time series design


Time series design is a type of experimental design in which the same group is measured multiple times over a
period of time. It is commonly used in economics, finance, and other fields to track changes over time.
Formula:
Y1,Y2,Y3,...Yn=μ+T+e, Where:
Y1,Y2,Y3,...Yn are the dependent variables measured at different time points
μ is the mean of the dependent variable
T is the trend or effect of time on the dependent variable
e is the random error
Example:
A company wants to study the effectiveness of their marketing campaign over the last 12 months. They measure
the sales of their product every month for a year. The dependent variable is the sales of the product, and the
independent variable is time. They can use the time series design to analyze the data and see if there is a trend in
the sales over the 12-month period.
In the given problem, we have:
 Dependent variable: Sales of the product
 Independent variable: Time (measured in months)
Using this formula, we can analyze the data collected by the company to see if there is a trend in the sales over
the 12-month period.
Let's assume that the company collected data on sales every month for a year and recorded the following values:
Y1=100,Y2=120,Y3=140,Y4=130,Y5=150,Y6=170,Y7=180,Y8=190,Y9=200,Y10=210,Y11=230,Y12=
240
To apply the formula, we first need to find the overall mean of the dependent variable (μ):
μ=(Y1+Y2+Y3+...+Y12)12
μ=(100+120+140+...+240)12= 175
Next, we need to calculate the trend over time (T). One way to do this is by fitting a linear regression line to the
data. We can use statistical software to do this, and the output will give us the slope of the line (which is the
trend) and the error term.
Assuming that the linear regression model is a good fit for the data, let's say we obtain the following equation:
Sales = 95 + 8.5*time + error
where Sales is the dependent variable (Y), time is the independent variable (T), 95 is the intercept (the expected
value of sales when time is zero), and 8.5 is the slope (the increase in sales per month).
Using this equation, we can estimate the sales for each month:
Y1 = 95 + 8.5*1 + error1
Y2 = 95 + 8.5*2 + error2
Y3 = 95 + 8.5*3 + error3
:
:
Y12 = 95 + 8.5*12 + error12
We can then compare the observed values (Y1 to Y12) with the estimated values to see if there is a trend in the
sales over time. If the estimated values follow a similar pattern to the observed values, we can conclude that
there is a trend in the sales. If there is no pattern or the pattern is different, we cannot conclude that there is a
trend.
This is an example of how the time series design formula can be used to analyze data and identify trends over
time.
Pre-experimental designs
This type of design is the most basic and involves only one group.
An example of a pre-experimental design is when a company introduces a new training program for their
employees and compares their performance before and after the training.

One-shot case design


One-shot case design is a pre-experimental design that involves measuring the effect of an independent variable
on a dependent variable, where a single group of subjects is tested after the introduction of an independent
variable. This design lacks a control group, and there is no pre-test to determine if the groups are equivalent.
Therefore, it can be difficult to establish causality.
Formula: Y=X+e, where:
 Y is the dependent variable,
 X is the independent variable,
 e is the error term.

One group pretest posttest design


One group pretest posttest design is a type of experimental design in which only one group of subjects is used,
and they are tested twice - once before a treatment or intervention (pretest) and once after (posttest). This design
is used to measure the change in the dependent variable that is a result of the treatment or intervention.
Formula: ΔY=Y2−Y1, where:
 ∆Y represents the change in the dependent variable,
 Y2 is the posttest score,
 Y1 is the pretest score.
Formula summary:

Experimental Design Formula

Post-test only design Treatment effect=Y 1−Y 2

Pretest-post-test only design Treatment effect=( Y 3 −Y 4 ) −(Y 1−Y 2)

Solomon four group design Treatment effect=( Y 3 −Y 4 ) −(Y 1−Y 2)

Factorial design Treatment effect=main effect A+main effect


B+interaction AB

Randomized block design Treatment effect=( Y 1−Y 2) −(Y 3−Y 4 )

Repeat measure design Treatment effect=Y 1−Y 2

Non equivalent control group designs Treatment effect=( Y 1−Y 3 )−(Y 2−Y 4 )

Time series design Treatment effect=Y 1−Y 2

Counter balanced designs Treatment effect=( Y 1−Y 2) −(Y 3−Y 4 )

One shot case design Treatment effect=Y 1−Y 2

One group pretest post test design Treatment effect=Y 2−Y 1


Hypothesis Testing:
Hypothesis testing is commonly applied in experimental design to determine whether there is a significant
difference between two groupsor conditions. The process involves setting up a null hypothesis (H0) and an
alternative hypothesis (Ha). The null hypothesis assumes that there is no significant difference between the
groups, while the alternative hypothesis assumes that there is a significant difference.
Steps to apply Hypothesis testing in Experimental design
 Step - 1: Formulate the null and alternative hypotheses
 Step - 2: Select a level of significance (alpha level)
 Step - 3: Choose an appropriate statistical test
 Step - 4: Collect and analyze the data
 Step - 5: Calculate the test statistic and p-value
 Step - 6: Make a decision to reject or fail to reject the null hypothesis

Example:
A new fertilizer has been developed for a specific type of plant, and a group of researchers wants to test whether
this fertilizer has a significant effect on plant growth compared to the standard fertilizer. They plan to conduct a
randomized controlled trial, where they randomly assign 50 plants to either the new fertilizer group or the
standard fertilizer group.

Formulate the null and alternative hypotheses:


Null hypothesis: The mean growth of plants treated with the new fertilizer is not significantly different from the
mean growth of plants treated with the standard fertilizer.
Alternative hypothesis:
The mean growth of plants treated with the new fertilizer is significantly different from the mean growth of
plants treated with the standard fertilizer.

Select a level of significance (alpha level):


Let's assume the researchers choose a significance level of 0.05.

Choose an appropriate statistical test:


Since there are two independent groups (new fertilizer group and standard fertilizer group) and we are
comparing their means, we can use a two-sample t-test.

Collect and analyze the data:


After 4 weeks of treatment, the researchers measure the height of each plant in both groups. They then calculate
the mean and standard deviation of each group.
New fertilizer group: mean = 35 cm, standard deviation = 4 cm Standard fertilizer group: mean = 30 cm,
standard deviation = 3 cm
Calculate the test statistic and p-value:
Using a two-sample t-test with a significance level of 0.05, we get the following results: t-statistic = 2.33 p-
value = 0.023

Make a decision to reject or fail to reject the null hypothesis:


Since the p-value (0.023) is less than the significance level (0.05), we reject the null hypothesis and conclude
that the mean growth of plants treated with the new fertilizer is significantly different from the mean growth of
plants treated with the standard fertilizer.

Pros:
Provides a standardized approach to analyzing data
Allows for objective decision-making based on statistical evidence
Provides a measure of confidence in the results
Cons:
Can be affected by sample size, outliers, and other factors that may not be relevant to the research question
May not provide a complete picture of the data and may oversimplify complex relationships between variables
May be influenced by researcher bias or other factors that are difficult to control for.

Sampling Methods:
Sampling is the process of selecting a representative subset of the population for study in order to make
inferences about the entire population. In experimental design, sampling refers to the process of selecting
individuals or units from the population to be included in the study.

Types of sampling Methods:


There are two main types of sampling methods:
 Probability sampling
 Non-probability sampling.

Probability Sampling:
Probability sampling involves selecting individuals or units from the population at random.
Example:
simple random sampling -> Involves selecting individuals or units at random from the population.
stratified random sampling -> Involves dividing the population into strata based on some characteristic and then
randomly selecting individuals or units from each stratum.
cluster sampling -> Involves dividing the population into clusters and then randomly selecting entire clusters for
inclusion in the study.

Systematic Sampling -> involves selecting every nth item from a population. The value of n is determined by
dividing the population size by the desired sample size.

Non-Probability Sampling:
Non-probability sampling involves selecting individuals or units based on some non-random criterion.
Example:
 convenience sampling -> Involves selecting individuals or units based on their availability and
accessibility.
 purposive sampling -> Involves selecting individuals or units based on some specific criterion, such as
age, gender, or occupation.
 snowball sampling -> Involves selecting individuals or units based on referrals from other individuals
or units already included in the study.

Sampling Method Formula Example


Simple Random Sampling n Suppose there are 100 students in a school and you want to select a
N sample of 20 students to survey.
You can assign each student a number from 1 to 100, and use a random
number generator to select 20 students.
Stratified sampling nk Suppose you want to survey the student body of a school, but you
.N-n/n- know that the school has 60% female students and 40% male students.
Nk You can use stratified sampling to ensure that your sample reflects the
nk gender distribution of the school.
First, divide the population into two strata: female and male.
Then, randomly select a proportionate number of students from each
stratum.
Systematic sampling N Suppose you want to survey every 10th person in a line.
K= You can use systematic sampling by selecting a random starting point
n
between 1 and 10,
and then selecting every 10th person from there on.
Cluster sampling M Suppose you want to survey the employees of a large company, but it is
1 ni

m i=1 N i
difficult to reach every employee.
You can use cluster sampling by dividing the company into clusters
(e.g. departments), and randomly selecting a proportionate number of
clusters to survey.
Then, you can survey all the employees in the selected clusters
Note:
 n represents the sample size
 N represents the population size
 nk represents the sample size for the kth stratum
 Nk represents the population size for the kth stratum
 M represents the number of clusters

Recipe for Types of Non-Probability Sampling:

Sampling Method Formula Example

Convenience Sampling N/A Conducting a survey on the street by asking passersby to


participate

Quota Sampling N stratum Conducting a survey in a university by quota sampling


n= ∗ntotal 100 students, with 30 from the School of Business, 40
N from the School of Arts, and 30 from the School of
Sciences

Purposive sampling N/A Conducting a study on the experiences of cancer patients


by selecting participants who have been diagnosed with
cancer

Snowball Sampling N/A Conducting a study on homeless youth by recruiting


participants through referrals from other homeless youth

Judgmental Sampling N/A Conducting a study on the effectiveness of a new teaching


method by selecting a group of teachers who have been
recognized for their exceptional teaching skills
Power Analysis:
Power analysis is a statistical technique used in experimental design to determine the sample size needed to
detect an effect of a certain magnitude with a given level of confidence.
It helps researchers determine the statistical power of their study, which is the probability of correctly rejecting
the null hypothesis when the alternative hypothesis is true.
Formula: Power=1−β
where:
 𝑃𝑜𝑤𝑒𝑟 is the statistical power
 β is the probability of committing a Type II error (false negative), which is failing to reject the null
hypothesis when the alternative hypothesis is true.
Example:
Let's say a researcher wants to investigate the effect of a new teaching method on student performance. They
hypothesize that the new method will lead to a 10% improvement in test scores compared to the traditional
method. To determine the sample size needed for adequate power, the researcher conducts a power analysis.

 Based on previous studies and preliminary data, the researcher estimates the standard deviation of test
scores to be 12.
 Using a desired power of 0.80 and a significance level of 0.05, the researcher performs a power
analysis and determines that a sample size of 100 participants is needed to detect a 10% difference in
test scores with sufficient power.

Pros:
Helps researchers determine an appropriate sample size, ensuring adequate statistical power.
Enhances the reliability and validity of research findings by minimizing the risk of Type II errors.
Allows researchers to optimize the allocation of resources, such as time and funding, by focusing on studies that
have a high likelihood of detecting an effect.

Cons:
Power analysis requires assumptions about effect size, variability, and significance level, which may not always
accurately reflect the true population parameters.
Power analysis is based on statistical assumptions and can be sensitive to deviations from those assumptions.
Power analysis does not guarantee that a study will detect an effect if it truly exists, as other factors (e.g., study
design, measurement error) can also influence the results.
Ethical considerations:
Ethical considerations play a crucial role in experimental design as they ensure that the rights and well-being of
participants are protected throughout the study.
Let's explore ethical considerations in experimental design, along with an example and their pros and cons.
Informed Consent:
Researchers must obtain informed consent from participants, ensuring they have a clear understanding of the
study's purpose, procedures, risks, and benefits before voluntarily agreeing to participate. This can be obtained
through written consent forms or verbal agreements.
Example:
Before conducting a study on the effects of a new medication, researchers should inform participants about the
potential side effects, benefits, and any alternative treatments available. Participants can then provide informed
consent to participate in the study.
Pros:
 Protects participants' autonomy and right to make decisions about their involvement in research.
 Ensures transparency and fosters trust between researchers and participants.
Cons:
 Obtaining informed consent may introduce selection bias if certain individuals choose not to participate
based on the study's nature or requirements.

Privacy and Confidentiality:


Researchers must ensure the privacyand confidentiality of participant data. Personal identifying information
should be kept confidential and stored securely to prevent unauthorized access or disclosure.
Example:
In a study examining sensitive topics such as mental health, researchers should implement strict confidentiality
measures to safeguard participants' personal information and ensure their privacy.
Pros:
 Protects participants' privacy and fosters trust in the research process.
 Encourages participants to provide honest and accurate information.
Cons:
 Maintaining confidentiality can be challenging, especially when dealing with electronic data storage
and potential breaches of security.

Minimization of Harm:
Researchers should minimize any potential physical or psychological harm to participants. They should
carefully assess the risks and benefits of the study and take appropriate measures to protect participants' well-
being.
Example:
In an experimental study involving medical interventions, researchers must ensure that potential risks to
participants' health are minimized, and proper medical supervision is provided throughout the study.
Pros:
 Prioritizes participant safety and well-being. Demonstrates ethical responsibility towards the
participants.
Cons:
 It can be challenging to anticipate and mitigate all potential risks, especially in complex research
designs or when studying vulnerable populations.

Debriefing and Participant Welfare:


Researchers should provide debriefing sessions to participants after the study, explaining the purpose,
procedures, and results. They should address any questions or concerns raised by participants and ensure their
emotional well-being during and after the study.
Example:
After a psychological study involving exposure to stressful stimuli, researchers should conduct a debriefing
session to explain the study's purpose, reassure participants, and offer appropriate resources or referrals if
necessary.
Pros:
 Allows participants to gain a better understanding of the study and its implications.
 Provides an opportunity for researchers to address any potential negative emotions or concerns.
Cons:
 Some participants may experience emotional distress as a result of their involvement in the study, even
with appropriate debriefing.

Activity 1:
A group of researchers is studying the effect of a new exercise routine on weight loss compared to the standard
routine. They randomly assign 60 participants to either the new exercise group or the standard exercise group.
After 8 weeks of treatment, they record the weight of each participant and calculate the mean and standard
deviation for both groups.
New exercise group: mean = 10 kg, standard deviation = 2 kg Standard exercise group: mean = 8 kg, standard
deviation = 3 kg
Answer the following questions based on the given information:
Question 1: What is the alternative hypothesis in the study comparing weight loss between two exercise groups?
Question 2: Based on the calculated p-value of 0.012, what should be done with the null hypothesis?
Question 3: If the researchers had chosen a significance level of 0.01 instead of 0.05, what decision would they
make regarding the null hypothesis?
Question 4: If the researchers had chosen a significance level of 0.01, the decision regarding the null
hypothesis would be to fail to reject the null hypothesis since the calculated p-value (0.023) is greater than the
significance level (0.01).

Activity 2:
A researcher wants to determine whether the average height of a sample of N individuals is greater than the
average height of the general population, which is μ . The heights are assumed to be normally distributed, and it
is known from previous studies that the standard deviation of heights among individuals is approximately σ .
Write a Python function that performs a statistical test to determine whether the average height of the sample is
greater than μ , given the sample size (N) , the sample mean (x¯) , the population mean (μ) , the population
standard deviation (σ) , and the significance level. The function should return True if the null hypothesis can be
rejected (indicating that the average height of the sample is indeed greater than μ ), and False otherwise. Use
only the functions imported above to perform the tests. To check the function, you can use the following input
combinations:
N = 100, x¯ = 175, μ = 170, σ = 5, significance level = 0.05
N = 50, x¯ = 163, μ = 160, σ = 4, significance level = 0.01
N = 200, x¯ = 180, μ = 175, σ = 6, significance level = 0.05

Activity 3:
Problem:
A researcher wants to evaluate the effectiveness of a new exercise program on improving cardiovascular fitness.
They conduct a pretest-post-test only design study where they measure the cardiovascular fitness of two groups:
the treatment group, which participates in the exercise program, and the control group, which does not
participate in any specific exercise program. The researcher measures the cardiovascular fitness level of each
participant before and after the study.
The following table shows the pretest and post-test scores for both the treatment and control groups:
Group - Pretest Score, Post-test Score
 Treatment - 60, 75
 Control- 62, 65
Using the formula: Treatment effect = (Posttest score of treatment group - Pretest score of treatment group) -
(Posttest score of control group - Pretest score of control group)
Design a Python function to calculate the treatment effect based on the pretest-post-test only design. The
function should take the pretest and post-test scores of both the treatment and control groups as input and output
the treatment effect.
Use the provided data to check the function output.

Activity 4:
A study was conducted to investigate the effect of a new drug on the cure rates of a disease. The study enrolled
200 patients, out of which 100 were randomly assigned to receive the drug and 100 were assigned to the control
group. After the treatment, it was found that 60 patients in the drug group were cured, while only 40 patients in
the control group were cured.
Write a Python function to test whether the cure rates of the two groups are significantly different. Assume that
the cure rates follow binomial distributions and the null hypothesis is that the cure rates in both groups are
equal.
Formally:
Ho: P(drug) = P(control)
Ha: P(drug) ≠ P(control)
The function should take in the sample sizes and cure rates of the two groups as arguments and output the p-
value for the test. Use a significance level of 0.05.

You might also like