Project Report
Project Report
Harshith Pendela
Declaration
I, Harshith Pendela, hereby declare that the project report titled ”Probability and Statistics” is
the result of my own independent work. I have dedicated considerable time and effort to studying
the topic of Probability and Statistics from some of the most widely-used and respected resources
of the fundamental concepts, methodologies, and practical applications of this subject. This
analyze, interpret, and present the information contained within this report. I have ensured
that all the work presented is original and based on my own comprehension of the material.
By signing this declaration, I affirm the authenticity and integrity of the work submitted in this
project.
Signed
i
Contents
1 Statistical Inference 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Random Processes 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
ii
CONTENTS CONTENTS
2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
List of Figures
2.2 LTI-System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
iv
Chapter 1
Statistical Inference
1.1 Introduction
Statistical Inference consists of studying the methods that conclude the data that are prone
to random variation. In our daily lives, most of the data that we study is prone to random
variation. In statistical Inference, We would like to estimate an unknown quantity from the
data that we’re provided with, and based on our approach, we have two Types of Inferences-
quantity.
be a random variable, and we assume that we initially know about the distribution of
Θ. After observing the data, we update the distribution of Θ using Bayes’ Rule.
When we do Sampling, we prefer sampling with replacement than without replacement because
all the samples are independent for sampling with replacement, and in a large population,
they are almost same as probability of choosing a sample twice is very less.
The collection of random variables X1, X2, X3, ..., Xn is said to be a Simple random sample of
1
1.1. Introduction Chapter 1. Statistical Inference
2. Independent and identically distributed (i.i.d.) i.e.,They have the same distri-
bution-
for all x ∈R
assuming that
X1 , X2 , X3 , ..., Xn
1. EX = µ
2. Var(X) = σ 2 /n
Order Statistics
When we arrange a random sample X1,X2 , X3 ,...,Xn from from the smallest to the largest
Random Variable
The probability density function (PDF) of the i-th order statistic X(i) is given by:
n!
fX(i) (x) = fX (x)[FX (x)]i−1 [1 − FX (x)]n−i
(i − 1)!(n − i)!
2
Chapter 1. Statistical Inference 1.2. Point Estimation
The cumulative distribution function (CDF) of the i-th order statistic X(i) is given by:
n
X n
FX(i) (x) = [FX (x)]k [1 − FX (x)]n−k
k
k=i
The joint PDF of the order statistics X(1) , X(2) , . . . , X(n) is given by:
n! fX (x1 )fX (x2 ) · · · fX (xn )
for x1 ≤ x2 ≤ · · · ≤ xn
fX(1) ,...,X(n) (x1 , x2 , . . . , xn ) =
0
otherwise
A point estimator is a function of Random Variables that is used to estimate the unknown
quantity.
B(Θ̂) = E[Θ̂] − θ
We would like to have bias as close to zero as possible and when it is always 0 for all values
Mean square Error The mean squared error (MSE) of a point estimator Θ̂, denoted by
MSE is an indicator for better estimator i.e. if MSE is low then a Good Estimator. If Θ̂ is a
point estimator for θ, the mean squared error (MSE) is given by:
Let Θ̂1 , Θ̂2 , . . . , Θ̂n , . . . be a sequence of point estimators of θ. We say that Θ̂n is a consistent
estimator of θ if:
3
1.3. Maximum Likelihood Estimation Chapter 1. Statistical Inference
or if:
lim MSE(Θ̂n ) = 0,
n→∞
n n
!
2 1 X 1 X 2
S = (Xk − X)2 = Xk2 − nX
n−1 n−1
k=1 k=1
Deviation.
of θis given by the value of θ for which the likelihood function is maximum.
In some problems, it is easier to work with the log (likelihood function) given by
ln L(x1 , x2 , . . . , xn ; θ).
4
Chapter 1. Statistical Inference 1.4. Interval Estimation
imum likelihood estimator (MLE) of θ is denoted as Θ̂ML . Under certain regular conditions,
• Asymptotically Unbiased: As the sample size n becomes large, the expected value of
Θ̂ML approaches θ.
lim E[Θ̂ML ] = θ
n→∞
Instead where we say θ̂ is point estimate of θ we estimate the range in which the real theta
may fall.
estimator with confidence level 1 − α consists of two estimators Θ̂l (X1 , X2 , . . . , Xn ) and
P (Θ̂l ≤ θ ≤ Θ̂h ) ≥ 1 − α,
for every possible value of θ. We call [Θ̂l , Θ̂h ] a (1 − α) × 100% confidence interval for θ.
5
1.4. Interval Estimation Chapter 1. Statistical Inference
Pivotal quantity
Q = Q(X1 , X2 , . . . , Xn , θ).
2. The probability distribution of Q does not depend on θ or any other unknown parameters.
By CLT,
σ 2 < ∞; n is large.
Confidence Interval:
σ σ
X̄ − zα/2 √ , X̄ + zα/2 √
n n
σmax or sample standard deviation. Here, zα/2 is the value from the standard normal
distribution such that the probability of a standard normal variable being within ±zα/2 is
1 − α.
Chi-Squared Distribution
defined as
Y ∼ χ2 (n).
6
Chapter 1. Statistical Inference 1.4. Interval Estimation
n 1
Y ∼ Gamma , .
2 2
1
fY (y) = y n/2−1 e−y/2 , for y > 0.
2n/2 Γ(n/2)
For any p ∈ [0, 1] and n ∈ N, we define χ2p,n as the value for which
P (Y > χ2p,n ) = p,
where Y ∼ χ2 (n).
Let X1 , X2 , . . . , Xn be i.i.d. N (µ, σ 2 ) random variables. Let S 2 be the sample variance. Then,
n
(n − 1)S 2 1 X
Y = = 2 (Xi − X̄)2
σ2 σ
i=1
Y ∼ χ2 (n − 1).
The t-Distribution
Let Z ∼ N (0, 1), and Y ∼ χ2 (n), where n ∈ N.Assuming that Z and Y are independent. The
7
1.4. Interval Estimation Chapter 1. Statistical Inference
T ∼ t(n).
Properties:
The t-distribution has a bell-shaped PDF centered at 0, but its PDF is more spread out than
n
Var(T ) = , for n > 2. However, Var(T ) is undefined for n = 1, 2.
n−2
As n becomes large, the t-density approaches the standard normal PDF. More formally, we
can write
d
T (n) −
→ N (0, 1).
For any p ∈ [0, 1] and n ∈ N, we define tp,n as the real value for which
P (T > tp,n ) = p.
t1−p,n = −tp,n .
How does this help in the Interval Estimation of normal random variables?
Let X1 , X2 , . . . , Xn be i.i.d. N (µ, σ 2 ) random variables. Let S 2 be the sample variance. The
We use this to estimate the mean for Normal Random Variables. When we don’t know σ
8
Chapter 1. Statistical Inference 1.5. Hypothesis Testing
S S
X̄ − tα/2,n−1 √ , X̄ + tα/2,n−1 √ ,
n n
where X̄ is the sample mean, S is the sample standard deviation, and tα/2,n−1 is the critical
(n−1)S 2 1 Pn
The random variable Q defined as Q = σ2
= σ2 i=1 (Xi − X̄)2 has a chi-squared distri-
since it is a function of the Xi ’s and σ 2 , and its distribution does not depend on σ 2 or any other
unknown parameters. Using the definition of χ2p,n , a (1 − α) interval for Q can be stated as:
P χ21−α/2,n−1 ≤ Q ≤ χ2α/2,n−1 = 1 − α.
Therefore,
(n − 1)S 2
2 2
P χ1−α/2,n−1 ≤ ≤ χα/2,n−1 = 1 − α.
σ2
which is equivalent to
!
(n − 1)S 2 2 (n − 1)S 2
P ≤ σ ≤ = 1 − α.
χ2α/2,n−1 χ21−α/2,n−1
(n−1)S 2 2
, (n−1)S
χ2α/2,n−1 χ21−α/2,n−1
is a (1 − α) × 100% confidence interval for σ 2 .
special note - The t-distribution method and Confidence Interval for Variance are applicable
In hypothesis testing, we want to comment on a parameter θ based on observed data. The set
of all possible values of θ is denoted as S. We partition S into two disjoint subsets S0 and S1 :
9
1.5. Hypothesis Testing Chapter 1. Statistical Inference
A hypothesis is called simple if the subset contains only one value of θ, and composite if it
1
S0 =
2
1
S1 = [0, 1] −
2
Here:
• H0 : θ ∈
1
2 (simple hypothesis)
• H1 : θ ∈ [0, 1] −
1
2 (composite hypothesis)
data. It is used to estimate a population parameter or to describe some aspect of the sample.
X1 + X2 + · · · + Xn
X̄ =
n
where n is the sample size. The sample mean provides an estimate of the population mean.
A test statistic is a specific type of statistic that is used in the context of hypothesis testing. It
is a function of the sample data that is used to decide whether to reject the null hypothesis.
The choice of test statistic depends on the hypothesis being tested and the distribution of
the data.
For example, in a hypothesis test comparing the population mean to a specified value, the test
10
Chapter 1. Statistical Inference 1.5. Hypothesis Testing
statistic might be the sample mean, the t-statistic, or the z-statistic, depending on the
sample size and variance properties. The Range(A) in which the test statistic accepts H0 is
then we say the test has significance level α or simply the test is a level α test. Note that it
is often the case that the null hypothesis is a simple hypothesis,i.e. S0 has only one element.
The second possible error that we can make is to accept H0 when H0 is false. This is called
the Type II error. Since the alternative hypothesis, H1 , is usually a composite hypothesis (so
it includes more than one value of θ), the probability of Type II error is usually a function of
11
1.5. Hypothesis Testing Chapter 1. Statistical Inference
Fun fact: Just replace α/2 with α to obtain these one-sided results.
For H0 and H1 , the interchanged result would be just negative of present values as it is
symmetrical.
hypotheses:
H0 : θ = θ 0 ,
H1 : θ = θ 1 ,
L(x1 , x2 , . . . , xn ; θ0 )
λ(x1 , x2 , . . . , xn ) = ,
L(x1 , x2 , . . . , xn ; θ1 )
where L(x1 , x2 , . . . , xn ; θ) is the likelihood function given the data x1 , x2 , . . . , xn and parameter
θ.
To perform a likelihood ratio test (LRT), we choose a constant c , We reject the null
hypothesis H0 if λ < c and accept it if λ ≥ c. The choice of the constant c depends on the
significance level α, which represents the probability of rejecting the null hypothesis
When Hypothesis are not simple, we just take the suprimum of the Likelihood Function
12
Chapter 1. Statistical Inference 1.6. Linear Regression
Linear regression is a method used to model the relationship between a dependent variable
A Simple Linear Regression Model to understand the methods of finding the model
We assume that xi are observed values of a random variable X. The linear regression model is
written as:
Y = β0 + β1 X + ϵ
where:
• β0 is the intercept.
• β1 is the slope.
E[Y ] = β0 + β1 E[X]
Thus:
β0 = E[Y ] − β1 E[X]
Cov(X, Y ) = Cov(X, β0 + β1 X + ϵ)
13
1.6. Linear Regression Chapter 1. Statistical Inference
Cov(X, Y ) = β1 Cov(X, X)
Estimating β0 and β1
n
1X
x̄ = xi
n
i=1
n
1X
ȳ = yi
n
i=1
n
X
sxx = (xi − x̄)2
i=1
n
X
sxy = (xi − x̄)(yi − ȳ)
i=1
sxy
β̂1 =
sxx
β̂0 = ȳ − β̂1 x̄
ŷ = β̂0 + β̂1 x
14
Chapter 1. Statistical Inference 1.6. Linear Regression
ei = yi − ŷi
The coefficient of determination, r2 , measures how well the observed data is represented by
Pn
2
s2xy (ŷi − ȳ)2
r = = Pni=1 2
b
sxx syy i=1 (yi − ȳ)
where
n
X
sxx = (xi − x̄)2 ,
i=1
n
X
syy = (yi − ȳ)2 ,
i=1
n
X
sxy = (xi − x̄)(yi − ȳ).
i=1
The value of r2 ranges from 0 to 1. A larger value of r2 indicates that the linear model
Previously, our model had only one predictor (explanatory variable), x. We can consider models
with more than one explanatory variable. For example, if we want to predict a student’s
final exam score based on several factors such as the number of study hours, attendance rate,
y = β0 + β1 x + β2 z + · · · + βk w + ϵ,
where x, z, · · ·, w are the explanatory variables (number of study hours, attendance rate, and
number of assignments completed) .This is a multiple linear regression model. The method of
15
1.7. Bayesian Inference Chapter 1. Statistical Inference
It is worth noting that when we say linear regression, we mean linear in the unknown
y = β 0 + β 1 x + β 2 x2 + ϵ
When running regression algorithms, one needs to be mindful of some practical considerations.
in regression analysis.
In the Bayesian framework, we treat the unknown quantity, Θ, as a random variable. More
specifically, we assume that we have some initial guess about the distribution of Θ. This distri-
bution is called the prior distribution. After observing some data, we update the distribution
of Θ based on the observed data using Bayes’ Rule. This approach is known as the Bayesian
approach.
Bayesian inference is widely used in various fields. In medical diagnosis, it helps update the
probability of diseases based on new test results. In machine learning, it improves models
by incorporating new data. In economics, it updates forecasts with new economic indicators.
Additionally, it plays a crucial role in robotics for localization and mapping using sensor data.
PX (x) if X is discrete. After observing Y , we update our guess about X using Bayes’ formula:
fY (y|x)fX (x)
fX|Y (x|y) =
fY (y)
PY (y|x)PX (x)
PX|Y (x|y) =
PY (y)
16
Chapter 1. Statistical Inference 1.7. Bayesian Inference
The MAP estimate of the random variable X, given that we have observed Y = y, is given by
the value of x that maximizes fX|Y (x|y) if X is a continuous random variable, or PX|Y (x|y)
xb = E[X | Y = y].
proof -
Z
E[(X − xb )2 ] = (x − xb )2 fX|Y (x|y) dx,
Z
−2 (x − xb )fX|Y (x|y) dx = 0.
Z Z
xb fX|Y (x|y) dx = xfX|Y (x|y) dx.
1. Expectation of MMSE Estimator: - The MMSE estimator, has the same expectation
as X:
17
1.7. Bayesian Inference Chapter 1. Statistical Inference
Cov(X̃, XM ) = 0, (1.2)
E[X 2 ] = E[XM
2
] + E[X̃ 2 ] (1.4)
X̂ L = g(Y ) = aY + b
MSE = E[(X − X̂ L )2 ]
b = E[X] − aE[Y ]
18
Chapter 1. Statistical Inference 1.7. Bayesian Inference
∂MSE
= −2E[XY ] + 2aE[Y 2 ] + 2bE[Y ]
∂a
∂MSE
= −2E[X] + 2aE[Y ] + 2b
∂b
−2E[X] + 2aE[Y ] + 2b = 0
b = E[X] − aE[Y ]
19
1.7. Bayesian Inference Chapter 1. Statistical Inference
E[XY ] − E[X]E[Y ]
a=
E[Y 2 ] − (E[Y ])2
b = E[X] − aE[Y ]
The linear MMSE estimator of the random variable X, given that we have observed Y , is given
by
Cov(X, Y ) σX
X̂L = (Y − E[Y ]) + E[X] = ρ (Y − E[Y ]) + E[X].
Var(Y ) σY
In our Daily Life, We usually have to estimate various Random Variables, so we use Random
Vectors
We estimate X̂L by
20
Chapter 1. Statistical Inference 1.7. Bayesian Inference
Orthogonality Principle
To minimize the mean squared error (MSE) of a Random Vector, it is sufficient to minimize each
E[(Xk − X̂k )2 ] individually. This implies that we only need to focus on estimating a random
variable X given the observations of the random vector Y. Since we want our estimator to
n
X
X̂L = ak Yk + b
k=1
These conditions are known as the orthogonality principle.The orthogonality principle states
that the estimation error (X̃) must be orthogonal to each of the observations (Y1 , Y2 , . . . , Yn ).
Given that there are n + 1 unknown parameters (a1 , a2 , . . . , an and b), and n + 1 corresponding
Let
21
1.7. Bayesian Inference Chapter 1. Statistical Inference
fY (y | H0 ) P (H1 )C01
≥ .
fY (y | H1 ) P (H0 )C10
Given the observation Y = y, the interval [a, b] is said to be a 100(1 − α)% credible interval for
P (a ≤ X ≤ b | Y = y) = 1 − α.
22
Chapter 2
Random Processes
2.1 Introduction
time or space. Example- Let T (t) be the temperature in India at time t ∈ [0, ∞). We can
assume here that t is measured in hours and t = 0 refers to the time we start measuring the
temperature.
If the Random variables are countable in the Random Process, then the Random Process is
Suppose you start learning english in Duolingo and the Random Variable is 1 if you have
completed that day streak or else 0; here we observe that this Random Process can be written
as {Xn , n = 1, 2, 3, . . .} with n being the number of days from the start. This is a countable
interval on the real line such as [−1, 1], [0, ∞), (−∞, ∞), etc.
function of time. If a Random Process is of the form {X(t), t ∈ J}, then the Set of all possible
23
2.1. Introduction Chapter 2. Random Processes
µX (t) = E[X(t)]
by
CXY (t1 , t2 ) = Cov(X(t1 ), Y (t2 )) = RXY (t1 , t2 ) − µX (t1 )µY (t2 ), for t1 , t2 ∈ J.
Two random processes {X(t), t ∈ J} and {Y (t), t ∈ J ′ } are said to be independent if, for all
t1 , t2 , . . . , tm ∈ J and t′1 , t′2 , . . . , t′n ∈ J ′ , the set of random variables X(t1 ), X(t2 ), . . . , X(tm ) are
Similarly, it can be defined as a Discrete -Random Process by replacing ∆ with an integer and
times t1 , t2 , . . . , tr ∈ J.
(WSS) if
24
Chapter 2. Random Processes 2.1. Introduction
4. we have:
25
2.2. Stationary Processes Chapter 2. Random Processes
Two random processes {X(t), t ∈ R} and {Y (t), t ∈ R} are said to be jointly wide-sense
where µX (t) denotes the mean of X(t), and RX (t1 , t2 ) denotes the autocovariance function of
X(t) at times t1 and t2 . Similarly, it can be defined for a Discrete random process but there T
Let X(t) be a continuous-time random process. We say that X(t) is mean-square continuous
at time t if
h i
2
lim E |X(t + δ) − X(t)| = 0.
δ→0
It is worth noting that there are jumps in a Poisson process; however, those jumps are not very
densein time, so the random process is still continuous in the mean-square sense.
26
Chapter 2. Random Processes 2.2. Stationary Processes
A random process {X(t), t ∈ J} is said to be a Gaussian (normal) random process if, for
all t1 , t2 , . . . , tn ∈ J, the random variables X(t1 ), X(t2 ), . . . , X(tn ) are jointly normal.
Two random processes {X(t), t ∈ J} and {Y (t), t ∈ J ′ } are considered jointly Gaussian
if, for any selections of t1 , t2 , . . . , tm ∈ J and t′1 , t′2 , . . . , t′n ∈ J ′ , the set of random variables
d
A random process X(t) has a derivative Y (t) = dt X(t), which is also a random process. For
smooth processes, the derivative is straightforward. For example, if X(t) = A + Bt + Ct2 where
To handle derivatives and integrals of random processes, we assume some regularity conditions:
27
2.3. Processing of Random Signals Chapter 2. Random Processes
with expectation:
Z t Z t
E X(u) du = E[X(u)] du.
0 0
d d
E X(t) = E[X(t)].
dt dt
The regularity conditions ensure that operations on random processes are well-behaved and
Z ∞ √
SX (f ) = F{RX (τ )} = RX (τ )e−2jπf τ dτ, where j = −1
−∞
Z ∞
2
E[X(t) ] = RX (0) = SX (f ) df
−∞
we define the cross spectral density SXY (f ) as the Fourier transform of the cross-correlation
function RXY (τ ):
Z ∞
SXY (f ) = F{RXY (τ )} = RXY (τ )e−2jπf τ dτ
−∞
A linear time-invariant (LTI) system is a type of system used in signal processing. It has
1. Linearity: If you provide the system with two signals, the resulting output will be the same
28
Chapter 2. Random Processes 2.3. Processing of Random Signals
2. Time Invariance: If you shift a signal in time before feeding it into the system, the output
will be the same as if you had shifted the output in time after the signal had passed through
the system.
The impulse response h(t) is the output of the system when the input is a very short signal
When you provide any input signal X(t) to the system, the output Y (t) can be found by combin-
ing (convolving) the input signal with the impulse response. This is done using a mathematical
The convolution of the input signal X(t) and the impulse response h(t) gives you the output
Y (t):
Z ∞
Y (t) = X(τ )h(t − τ ) dτ
−∞
where h(t) is the impulse response of the system. Then X(t) and Y (t) are jointly WSS.
29
2.3. Processing of Random Signals Chapter 2. Random Processes
Moreover,
Z ∞
µY (t) = µY = µX h(α) dα;
−∞
Z ∞
RXY (τ ) = h(−τ ) ∗ RX (τ ) = h(−α)RX (t − α) dα;
−∞
RY (τ ) = h(τ ) ∗ h(−τ ) ∗ RX (τ ).
Proof
Z ∞
µY (t) = E[Y (t)] = E h(α)X(t − α) dα
−∞
Z ∞
= h(α)E[X(t − α)] dα
−∞
Z ∞
= h(α)µX dα
−∞
Z ∞
= µX h(α) dα.
−∞
R∞
Since µY (t) is not a function of t, µY (t) = µY = µX −∞ h(α) dα.
Z ∞
RXY (τ ) = h(α)RX (τ + α) dα = h(τ ) ∗ RX (−τ ) = h(−τ ) ∗ RX (τ ).
−∞
RY (τ ) = h(τ ) ∗ h(−τ ) ∗ RX (τ ).
30
Chapter 2. Random Processes 2.3. Processing of Random Signals
There are some advantages of working in the frequency domain, so we are converting time
Z ∞
H(f ) = F{h(t)} = h(t)e−2jπf t dt
−∞
We know that
F{h(−t)} = H(−f ) = H ∗ (f )
RY (τ ) = h(τ ) ∗ h(−τ ) ∗ RX (τ )
SXY (f ) = SX (f )H(−f ) = SX (f )H ∗ (f )
Similarily
SY (f ) = SX (f )|H(f )|2
1
if f1 < |f | < f2
H(f ) =
0
otherwise
This is, in fact, a bandpass filter as this eliminates every frequency outside of the frequency
band f1 < |f | < f2 . Thus, the resulting random process Y (t) is a filtered version of X(t) in
which frequency components in the frequency band f1 < |f | < f2 are preserved.
SX (f )
if f1 < |f | < f2
SY (f ) = SX (f )|H(f )|2 =
0
otherwise
31
2.4. Important Random Processes Chapter 2. Random Processes
Z ∞ Z −f1 Z f2
2
E[Y (t) ] = SY (f ) df = SX (f ) df + SX (f ) df
−∞ −f2 f1
Since SX (−f ) = SX (f )
Z f2
2
E[Y (t) ] = 2 SX (f ) df
f1
Let X(t) be a stationary Gaussian process. If X(t) is the input to an LTI system, then the
output random process, Y(t) , is also a stationary Gaussian process. Moreover, X(t) and
N0
SX (f ) = , for all f.
2
White noise has infinite power as integral becomes infinite for constant SX (f ). The PSD of
thermal Noise is similar to the PSD of White noise for a frequency range, and it decreases out
of that frequency range. Usually, Thermal noise is modeled as Gaussian White Noise, which
includes an additional condition that the process X(t) is a stationary Gaussian random
A random process {N (t), t ∈ [0, ∞)} is said to be a counting process if N (t) is the number of
events occurred from time 0 up to and including time t. For a counting process, we assume:
1.
N (0) = 0;
32
Chapter 2. Random Processes 2.4. Important Random Processes
2.
3. For 0 ≤ s < t, N (t) − N (s) shows the number of events that occur in the interval (s, t].
Independent Increments
Let {X(t), t ∈ [0, ∞)} be a continuous-time random process. We say that X(t) has indepen-
dent increments if, for all 0 ≤ t1 < t2 < t3 < · · · < tn , the random variables
are independent.
Stationary Increments
A continuous-time random process X(t) is said to have stationary increments if, for any t2 >
t1 ≥ 0 and any shift r > 0, the random variables X(t2 ) − X(t1 ) and X(t2 + r) − X(t1 + r) are
having same Probability distribution. This means that the probability distribution of the
increment depends solely on the length of the interval (t1 , t2 ] and is unaffected by the position
Let λ > 0 be a fixed number. The counting process {N (t), t ∈ [0, ∞)} is called a Poisson
2. The number of arrivals in any interval of length τ > 0 has a Poisson(λτ ) distribution.
If N (t) is a Poisson process with rate λ, then the interarrival times X1 , X2 , . . . are inde-
The first arrival time X1 is the time of the first arrival. The probability that no arrival happens
33
2.4. Important Random Processes Chapter 2. Random Processes
X1 ∼ Exponential(λ)
The probability that no event happens in an interval (s, s + t], given the first event occurred at
The time between the first and second arrivals X2 is also exponentially distributed with rate λ,
The same reasoning applies to all subsequent interarrival times X3 , X4 , . . .. Each Xi is inde-
Xi ∼ Exponential(λ) for i = 1, 2, 3, . . .
Tn = X1 + X2 + · · · + Xn
By this, we can say that the arrival times T1 , T2 , . . . have a Gamma distribution with pa-
n n
E[Tn ] = , and Var(Tn ) =
λ λ2
Let N1 (t), N2 (t), . . . , Nm (t) be m independent Poisson processes with rates λ1 , λ2 , . . . , λm . De-
34
Chapter 2. Random Processes 2.4. Important Random Processes
Initial Condition:
Pm
Independent Increments: Since Ni (t) are independent, N (t) − N (s) = i=1 (Ni (t) − Ni (s))
The sum of m independent Poisson random variables being another Poisson random vari-
able (rates as the sum of individual rates) can be explained using their Moment generating
functions (MGFs).
Splitting
Let N (t) be a Poisson process with rate λ. We split N (t) into two processes, N1 (t) and N2 (t),
as follows: Each arrival is assigned randomly to either N1 (t) or N2 (t) based on the outcome
of an independent coin toss with probability p for heads (H). If heads, the arrival goes to
N1 (t); otherwise, it goes to N2 (t). The coin tosses are independent of each other and of N (t).
Consequently,
35
2.4. Important Random Processes Chapter 2. Random Processes
If the rate of the Poisson Process keeps on changing with time, then it is called Nonhomoge-
Let λ(t) : [0, ∞) → [0, ∞) be an integrable function. The counting process {N (t), t ∈ [0, ∞)}
is called a Nonhomogeneous Poisson process with rate λ(t) if all the following conditions
hold:
• N (0) = 0,
P (N (t + ∆) − N (t) ≥ 2) = o(∆).
36
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains
If we replace λ(t) with constant λ, then we get the second definition for the Poisson process.
2.5.1 Introduction
Consider the random process {Xn , n = 0, 1, 2, . . .}, where RXi = S ⊂ {0, 1, 2, . . .}. We say that
If the number of states is finite, e.g., S = {0, 1, 2, . . . , r}, we call it a finite Markov chain.
We store the transition probabilities in a matrix that is called the state transition matrix or
transition probability matrix. For a finite Markov chain with the number of states r,
p p ··· p1r
11 12
p21 p22 · · · p2r
P = .
. .. .. ..
. . . .
pr1 pr2 · · · prr
We usually show a Markov process using a state transition diagram. Consider a Markov
chain with three possible states, 1, 2, and 3, and the following transition probability matrix:
1 1 1
4 4 2
P = 1 2
3 0 3
1 2 1
4 3 12
In this diagram, there are three possible states 1, 2, and 3, and the arrows from each state to
37
2.5. Discrete Time Markov Chains Chapter 2. Random Processes
other states show the transition probabilities pij . When there is no arrow from state i to state
probability pij . That is, pij gives us the probability of going from state i to state j in one step.
Now suppose that we are interested in finding the probability of going from state i to state j
We can find this probability by applying the law of total probability. In particular, we argue
that X1 can take one of the possible values in S. Thus, we can write
(2)
X X
pij = P (X2 = j | X1 = k, X0 = i)P (X1 = k | X0 = i) = P (X2 = j | X1 = k)P (X1 = k | X0 = i)
k∈S k∈S
We conclude
(2)
X
pij = P (X2 = j | X0 = i) = pik pkj .
k∈S
We can explain the above formula as follows: In order to get to state j, we need to pass through
(2)
some intermediate state k. The probability of this event is pik pkj . To obtain pij , we sum over
all possible intermediate states. Accordingly, we can define the two-step transition matrix as
follows:
38
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains
(2) (2) (2)
p p12 · · · p1r
11
(2) (2) (2)
21 p22 · · ·
p p2r
P (2) = . .
. .. .. ..
. . . .
(2) (2) (2)
pr1 pr2 · · · prr
(2)
Looking at the previous equation, we notice that pij is in fact the element in the i-th row and
p p ··· p1r p p ··· p1r
11 12 11 12
p21 p22 · · ·
p21 p22 · · ·
p2r p2r
P2 = . · .
. .. .. ..
. .. .. ..
. . . . .. . . .
pr1 pr2 · · · prr pr1 pr2 · · · prr
Thus, we conclude that the two-step transition matrix can be obtained by squaring the state
(n)
More generally, we can define the n-step transition probabilities pij as
(n)
pij = P (Xn = j | X0 = i), for n = 0, 1, 2, . . . ,
(n) (n) (n)
p p12 · · · p1r
11
(n) (n) (n)
21 p22 · · ·
p p2r
P (n) = . .
. .. .. ..
. . . .
(n) (n) (n)
pr1 pr2 · · · prr
We can now generalize the previous equation. Let m and n be two positive integers and assume
X0 = i. In order to get to state j in (m + n) steps, the chain will be at some intermediate state
(m+n)
k after m steps. To obtain pij , we sum over all possible intermediate states:
39
2.5. Discrete Time Markov Chains Chapter 2. Random Processes
P (n) = P n , for n = 1, 2, 3, . . . .
communicate, denoted by i ↔ j, if each state is reachable from the other. In other words,
i ↔ j means i → j and j → i.
A Markov chain is said to be irreducible if it has only one communicating class. This means
that every state in the Markov chain can be reached from every other state, ensuring that
all states communicate with each other. This property is desirable because it simplifies the
A state is said to be recurrent if, any time we leave that state, we will return to it in the future
certainly. Conversely, if the probability of returning to the state is less than one, the state is
called transient.
If two states are in the same class, they are either both recurrent or both transient. By a
40
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains
1. Recurrent Class: If the Markov chain enters this class, it will always stay in that class.
2. Transient Class: The Markov chain might enter and stay in this class for a while, but
at some point, it will leave and never return. All the states are transient.
Consider a Markov chain starting from state X0 = i. If i is a recurrent state, the chain will
return to state i any time it leaves that state. Consequently, the chain will visit state i an
infinite number of times. Conversely, if i is a transient state, the chain will return to state i
with probability fii < 1. In this case, the total number of visits to state i follows a Geometric
distribution with success 1 − fii .[Reaching the same state is treated as failure]
For a discrete-time Markov chain, let V be the total number of visits to state i. Then:
P (V = ∞ | X0 = i) = 1.
V | X0 = i ∼ Geometric(1 − fii ).
Periodicity
(n)
The period of a state i in a Markov chain is the largest number d such that pii = 0 whenever
(n)
n is not divisible by d. This period is denoted d(i). If pii = 0 for all n > 0 then d(i) = ∞.
All states in the same communicating class have the same period. So, a class is periodic if
its states are periodic, and aperiodic if its states are aperiodic.
• If there is a self-transition (pii > 0 for some i), the chain is aperiodic.
41
2.5. Discrete Time Markov Chains Chapter 2. Random Processes
(l) (m)
• If state i can return to itself in l steps (pii > 0) and also in m steps (pii > 0), with
• The chain is aperiodic if and only if there is a positive integer n such that all elements
(n)
pij > 0 for all i, j ∈ S.
Absorption Probabilities
Consider a finite Markov chain {Xn } with states S = {0, 1, 2, . . . , r}, where each state is either
that the chain, starting from state i, will eventually be absorbed into state l.
If we enter some other absorbing states (above case) we get struck there only and won’t be able
X
ai = ak · pik ,
k∈S
Consider a finite Markov chain {Xn } with states S = {0, 1, 2, . . . , r}. Let A ⊆ S be a subset
of states. T represents the event: The first time the chain visits any state in A. For each state
i ∈ S, ti denotes the expected number of steps until the chain first visits a state in A,
starting from i.
tj = 0 for all j ∈ A,
42
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains
X
ti = 1 + tk · pik ,
k∈S\A
Consider a finite irreducible Markov chain {Xn } with state space S = {0, 1, 2, . . . , r}. Let l ∈ S
be a state. The mean return time to state l, rl , is the expected number of steps it takes
For finding tk :
• tl = 0.
• If k ̸= l, tk = 1 +
P
j∈S tj pkj .
Limiting Distribution
For Markov chains with a finite number of states, there can be transient and recurrent
states. As time progresses, the chain will eventually enter a recurrent class and stay there
permanently. Hence, for long-term behavior, we focus on these recurrent classes. If the
Markov chain has multiple recurrent classes, it will eventually get absorbed into one of them.
Assuming that the chain is aperiodic, the limiting distribution π = [π0 , π1 , π2 , . . .] can be found
in a simple manner -
43
2.5. Discrete Time Markov Chains Chapter 2. Random Processes
Similarly,
Intuitively, the equation π = πP means if the distribution of Xn (that is, the probabilities of
X
πj = πk Pkj
k∈S
for all j in S. This equation states that the probability πj of being in state j in the long run
is equal to the sum of the probabilities πk of being in each state k, weighted by the probability
By summarizing everything,
describe the Markov chain’s behavior in the long run. π is the limiting distribution if:
πj = lim P (Xn = j | X0 = i)
n→∞
X
πj = 1
j∈S
π = πP
X
πj = 1
j∈S
This unique solution is called the limiting distribution of the Markov chain.
πj = lim P (Xn = j | X0 = i)
n→∞
44
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains
that the chain is irreducible and aperiodic. Depending on the nature of the states, the behavior
Transient states mean the chain will eventually leave these states and never return.
Null recurrent states mean the chain returns to these states eventually, but the expected return
3. All states are positive recurrent: (expected return time is finite) - In this case, there exists
∞
X
πj = πk Pkj , for j = 0, 1, 2, . . .
k=0
∞
X
πj = 1
j=0
45
2.6. Continuous- Time Markov Chains Chapter 2. Random Processes
1
rj = for all j = 0, 1, 2, . . .
πj
A continuous-time Markov chain X(t) is a process that moves between states over time.
Markov Property
The process has the Markov property, meaning that to predict the future state of the pro-
cess, we only need to know the current state, not the history of how we arrived there.
Jump Chain
The jump chain is a sequence of states S ⊂ {0, 1, 2, . . . } with transition probabilities pij . These
probabilities tell us the likelihood of jumping from one state i to another state j. If we are
Holding Times
The amount of time spent in the state i before transitioning is memoryless because of Markov
Property. The exponential distribution is the only continuous distribution with this prop-
erty, so the holding time in the state i, which is the time the process remains in that state
For a continuous-time Markov chain, we define the transition matrix P (t). The (i,j)th entry
46
Chapter 2. Random Processes 2.6. Continuous- Time Markov Chains
X
Pij (t) = 1 for all t ≥ 0.
j∈S
Stationary Distribution
Limiting Distribution
In a continuous-time Markov chain with an irreducible positive recurrent jump chain, the
stationary distribution π̃ = [π̃0 , π̃1 , π̃2 , . . .] of the jump chain determines the long-term behavior
of the chain.
current jump chain.If then unique stationary distribution of the jump chain is given by
and
X π̃k
0< < ∞.
λk
k∈S
47
2.6. Continuous- Time Markov Chains Chapter 2. Random Processes
Then,
π̃j /λj
πj = lim P (X(t) = j | X(0) = i) = P
k∈S π̃k /λk
t→∞
for all i, j ∈ S.
Suppose we start in state i. The time T1 until we jump to the next state follows an exponential
T1 ∼ Exponential(λi )
P (X(δ) ̸= i | X(0) = i)
λi = lim
δ→0+ δ
The probability of moving from state i to state j is pij . The transition rate from i to j is:
gij = λi pij
48
Chapter 2. Random Processes 2.6. Continuous- Time Markov Chains
X X
gii = − gij = −λi pij = −λi
j̸=i j̸=i
P
This holds because if λi = 0, then λi j̸=i pij =0
P
and if λi ̸= 0, then pii = 0 and j̸=i pij = 1.
X
gij = 0
j
The generator matrix G is very useful for analyzing the behavior of continuous-time Markov
chains, especially when calculating probabilities and expectations over time. It also helps
Consider a very small time interval δ. For such a small δ, the probability pjj (δ) that the system
pjj (δ) ≈ 1 − λj δ
Next, let’s consider the probability pkj (δ) that the system transitions from state k to state j
49
2.6. Continuous- Time Markov Chains Chapter 2. Random Processes
The probability of a transition from k to a state t(other than k) pkt (δ) = λk δ.If we even want
the transition to end up in state j, then we multiply pkj as transitioning out of state i and going
X
Pij (t + δ) = Pik (t)pkj (δ)
k
X
Pij (t + δ) ≈ Pij (t)pjj (δ) + Pik (t)pkj (δ)
k̸=j
Using our previous expressions for pjj (δ) and pkj (δ):
X
Pij (t + δ) ≈ Pij (t)(1 + gjj δ) + Pik (t)δgkj
k̸=j
Simplifying, we get:
X
Pij (t + δ) ≈ Pij (t) + δPij (t)gjj + δ Pik (t)gkj
k̸=j
X
Pij (t + δ) ≈ Pij (t) + δ Pik (t)gkj
k
Finally, if we rearrange and take the limit as δ approaches zero, we get the differential equation:
P ′ (t) = P (t)G
P ′ (t) = GP (t)
Consider a continuous-time Markov chain X(t) with state space S and generator matrix G. A
satisfies πG = 0.
50
Chapter 2. Random Processes 2.6. Continuous- Time Markov Chains
Case-1
it means π = πP (t), where P (t) is the transition matrix. Differentiating both sides of π = πP (t)
d
0= [πP (t)] = πP ′ (t) = πGP (t)
dt
Case-2
Since πP ′ (t) is the derivative of πP (t), πP (t) is constant over time. Thus, for any t ≥ 0:
πP (t) = πP (0) = π
51