0% found this document useful (0 votes)
30 views56 pages

Project Report

Ee769 Machine learning GitHub project

Uploaded by

Harsh Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views56 pages

Project Report

Ee769 Machine learning GitHub project

Uploaded by

Harsh Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Probability and Statistics

Harshith Pendela
Declaration

I, Harshith Pendela, hereby declare that the project report titled ”Probability and Statistics” is

the result of my own independent work. I have dedicated considerable time and effort to studying

the topic of Probability and Statistics from some of the most widely-used and respected resources

available in the field. Throughout my studies, I have developed a thorough understanding

of the fundamental concepts, methodologies, and practical applications of this subject. This

comprehensive understanding has equipped me with the necessary knowledge to accurately

analyze, interpret, and present the information contained within this report. I have ensured

that all the work presented is original and based on my own comprehension of the material.

By signing this declaration, I affirm the authenticity and integrity of the work submitted in this

project.

Signed

i
Contents

1 Statistical Inference 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Simple Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Asymptotic Properties of MLEs . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5.1 General Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5.2 Hypothesis Testing: Type I and Type II Errors . . . . . . . . . . . . . . . 11

1.5.3 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.7 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.7.1 Maximum A Posteriori (MAP) Estimation . . . . . . . . . . . . . . . . . . 17

1.7.2 Minimum Mean Squared Error (MMSE) Estimation . . . . . . . . . . . . 17

1.7.3 Linear MMSE Estimation of Random Variables . . . . . . . . . . . . . . . 18

1.7.4 Bayesian Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.7.5 Bayesian Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Random Processes 23

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.1 Gaussian Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.2 Integration and Differentiation . . . . . . . . . . . . . . . . . . . . . . . . 27

ii
CONTENTS CONTENTS

2.3 Processing of Random Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Power Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.2 Linear Time-Invariant (LTI) Systems . . . . . . . . . . . . . . . . . . . . . 28

2.3.3 Power in a Frequency Band . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.4 Gaussian Processes as input to LTI System . . . . . . . . . . . . . . . . . 32

2.3.5 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4 Important Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.1 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.2 Merging and Splitting Poisson Processes . . . . . . . . . . . . . . . . . . . 34

2.4.3 Nonhomogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . . 36

2.5 Discrete Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5.2 State Transition Matrix and Diagram . . . . . . . . . . . . . . . . . . . . 37

2.5.3 Classification of States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5.4 Law of Total Probability with Recursion . . . . . . . . . . . . . . . . . . . 42

2.5.5 Stationary and Limiting Distributions . . . . . . . . . . . . . . . . . . . . 43

2.6 Continuous- Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.6.1 The Generator Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.6.2 Uses of Generator Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.6.3 How do we use the Generator Matrix to find Stationary Distribution? . . 50

iii
List of Figures

2.1 A sample function of Poisson random process . . . . . . . . . . . . . . . . . . . . 27

2.2 LTI-System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 Combining Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Splitting Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5 State Transition diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

iv
Chapter 1

Statistical Inference

1.1 Introduction

Statistical Inference consists of studying the methods that conclude the data that are prone

to random variation. In our daily lives, most of the data that we study is prone to random

variation. In statistical Inference, We would like to estimate an unknown quantity from the

data that we’re provided with, and based on our approach, we have two Types of Inferences-

1. Classical/Frequentist Inference- we assume the unknown quantity to be a Fixed

quantity.

2. Bayesian Inference-In the Bayesian approach, the unknown quantity Θ is assumed to

be a random variable, and we assume that we initially know about the distribution of

Θ. After observing the data, we update the distribution of Θ using Bayes’ Rule.

When we do Sampling, we prefer sampling with replacement than without replacement because

all the samples are independent for sampling with replacement, and in a large population,

they are almost same as probability of choosing a sample twice is very less.

1.1.1 Simple Random Sample

The collection of random variables X1, X2, X3, ..., Xn is said to be a Simple random sample of

size n if they are

1
1.1. Introduction Chapter 1. Statistical Inference

1. Independent random variables.

2. Independent and identically distributed (i.i.d.) i.e.,They have the same distri-

bution-

FX 1 (x) = FX 2 (x) = ... = FX n (x)

for all x ∈R

We generally refer to Simple Random Samples when we say Random Samples.

assuming that

X1 , X2 , X3 , ..., Xn

form a random sample Sample Mean is given by X =

(X1 + X2 + X3 + ... + Xn )/n

Properties of the sample mean

1. EX = µ

2. Var(X) = σ 2 /n

3. X satisfies Weak Law of Large Numbers(WLLN)

4. The Random Variable


X −µ
Zn = √
σ/ n

converges to a Standard Normal Random Variable as n becomes large

Order Statistics

When we arrange a random sample X1,X2 , X3 ,...,Xn from from the smallest to the largest

Random Variable

X(1) , X(2) , X(3) , ..., X(n)

Then this sequence is called the Order Statistics.

The probability density function (PDF) of the i-th order statistic X(i) is given by:

n!
fX(i) (x) = fX (x)[FX (x)]i−1 [1 − FX (x)]n−i
(i − 1)!(n − i)!

2
Chapter 1. Statistical Inference 1.2. Point Estimation

The cumulative distribution function (CDF) of the i-th order statistic X(i) is given by:

n  
X n
FX(i) (x) = [FX (x)]k [1 − FX (x)]n−k
k
k=i

The joint PDF of the order statistics X(1) , X(2) , . . . , X(n) is given by:



n! fX (x1 )fX (x2 ) · · · fX (xn )
 for x1 ≤ x2 ≤ · · · ≤ xn
fX(1) ,...,X(n) (x1 , x2 , . . . , xn ) =

0
 otherwise

1.2 Point Estimation

A point estimator is a function of Random Variables that is used to estimate the unknown

quantity.

Let Θ̂ = h(X1 , X2 , · · · , Xn ) be a point estimator for θ-

The bias of the point estimator Θ̂ is defined by:

B(Θ̂) = E[Θ̂] − θ

We would like to have bias as close to zero as possible and when it is always 0 for all values

of θ then Θ is said to be an Unbiased Estimate of θ.

Mean square Error The mean squared error (MSE) of a point estimator Θ̂, denoted by

MSE(Θ̂), is defined as:

MSE(Θ̂) = E[(Θ̂ − θ)2 ]

MSE is an indicator for better estimator i.e. if MSE is low then a Good Estimator. If Θ̂ is a

point estimator for θ, the mean squared error (MSE) is given by:

MSE(Θ̂) = Var(Θ̂) + B(Θ̂)2

Let Θ̂1 , Θ̂2 , . . . , Θ̂n , . . . be a sequence of point estimators of θ. We say that Θ̂n is a consistent

estimator of θ if:

lim P (|Θ̂n − θ| ≥ ϵ) = 0, for all ϵ > 0


n→∞

3
1.3. Maximum Likelihood Estimation Chapter 1. Statistical Inference

or if:

lim MSE(Θ̂n ) = 0,
n→∞

then Θ̂n is a consistent estimator of θ.(Stronger than one before)

Sample Variance The sample variance S 2 is

n n
!
2 1 X 1 X 2
S = (Xk − X)2 = Xk2 − nX
n−1 n−1
k=1 k=1

It’s defined like this to make the bias of S 2 become 0.



Sample Standard Deviation is S2

Sample Standard Deviation, unlike Sample Variance, is a biased estimator of Standard

Deviation.

1.3 Maximum Likelihood Estimation

Let X1 , X2 , X3 , . . . , Xn be a random sample from a distribution with a parameter θ. Suppose

that we have observed X1 = x1 , X2 = x2 , . . . , Xn = xn . The Maximum likelihood estimate

of θis given by the value of θ for which the likelihood function is maximum.

If Xi ’s are discrete, then the likelihood function is defined as

L(x1 , x2 , . . . , xn ; θ) = PX1 X2 ···Xn (x1 , x2 , . . . , xn ; θ).

If Xi ’s are jointly continuous, then the likelihood function is defined as

L(x1 , x2 , . . . , xn ; θ) = fX1 X2 ···Xn (x1 , x2 , . . . , xn ; θ).

In some problems, it is easier to work with the log (likelihood function) given by

ln L(x1 , x2 , . . . , xn ; θ).

here, we took θ to be a single parameter but it can also be a vector θ1 , θ2 , . . . , θk .

4
Chapter 1. Statistical Inference 1.4. Interval Estimation

1.3.1 Asymptotic Properties of MLEs

Let X1 , X2 , X3 , . . . , Xn be a random sample from a distribution with parameter θ. The max-

imum likelihood estimator (MLE) of θ is denoted as Θ̂ML . Under certain regular conditions,

Θ̂ML has the following properties:

• Asymptotically Consistent: As the sample size n approaches infinity, the probability

that Θ̂ML deviates from θ by more than ϵ approaches zero.

lim P (|Θ̂ML − θ| > ϵ) = 0


n→∞

• Asymptotically Unbiased: As the sample size n becomes large, the expected value of

Θ̂ML approaches θ.

lim E[Θ̂ML ] = θ
n→∞

• Asymptotically Normal: For large n, Θ̂ML is approximately normally distributed.

Specifically, the random variable


Θ̂ − θ
q ML
Var(Θ̂ML )

converges in distribution to the standard normal distribution N (0, 1).

1.4 Interval Estimation

Instead where we say θ̂ is point estimate of θ we estimate the range in which the real theta

may fall.

Let X1 , X2 , X3 , . . . , Xn be a random sample from a distribution with parameter θ. An interval

estimator with confidence level 1 − α consists of two estimators Θ̂l (X1 , X2 , . . . , Xn ) and

Θ̂h (X1 , X2 , . . . , Xn ) such that

P (Θ̂l ≤ θ ≤ Θ̂h ) ≥ 1 − α,

for every possible value of θ. We call [Θ̂l , Θ̂h ] a (1 − α) × 100% confidence interval for θ.

How do we calculate the interval? (with the help of pivotal quantity)

5
1.4. Interval Estimation Chapter 1. Statistical Inference

Pivotal quantity

Let X1 , X2 , X3 , . . . , Xn be a random sample from a distribution with parameter θ. The random

variable Q is called a pivotal quantity if:

1. It is a function of the observed data X1 , X2 , . . . , Xn and the unknown parameter θ,

but does not depend on any other unknown parameters:

Q = Q(X1 , X2 , . . . , Xn , θ).

2. The probability distribution of Q does not depend on θ or any other unknown parameters.

Steps to find the interval

By CLT,

A random sample X1 , X2 , X3 , . . . , Xn is given from a distribution with known variance Var(Xi ) =

σ 2 < ∞; n is large.

Parameter to be Estimated: θ = E[Xi ].

Confidence Interval:
 
σ σ
X̄ − zα/2 √ , X̄ + zα/2 √
n n

is approximately a (1 − α) × 100% confidence interval for θ. Sometimes we may replace σ with

σmax or sample standard deviation. Here, zα/2 is the value from the standard normal

distribution such that the probability of a standard normal variable being within ±zα/2 is

1 − α.

Chi-Squared Distribution

If Z1 , Z2 , . . . , Zn are independent standard normal random variables, the random variable Y

defined as

Y = Z12 + Z22 + · · · + Zn2

has a chi-squared distribution with n degrees of freedom:

Y ∼ χ2 (n).

6
Chapter 1. Statistical Inference 1.4. Interval Estimation

The chi-squared distribution is a special case of the gamma distribution:

 
n 1
Y ∼ Gamma , .
2 2

Thus, the probability density function of Y is

1
fY (y) = y n/2−1 e−y/2 , for y > 0.
2n/2 Γ(n/2)

The expected value and variance of Y are:

E[Y ] = n, Var(Y ) = 2n.

For any p ∈ [0, 1] and n ∈ N, we define χ2p,n as the value for which

P (Y > χ2p,n ) = p,

where Y ∼ χ2 (n).

Let X1 , X2 , . . . , Xn be i.i.d. N (µ, σ 2 ) random variables. Let S 2 be the sample variance. Then,

the random variable Y defined as

n
(n − 1)S 2 1 X
Y = = 2 (Xi − X̄)2
σ2 σ
i=1

is a chi-squared distribution with n − 1 degrees of freedom:

Y ∼ χ2 (n − 1).

moreover, X̄ and S 2 are independent random variables.

The t-Distribution

Let Z ∼ N (0, 1), and Y ∼ χ2 (n), where n ∈ N.Assuming that Z and Y are independent. The

random variable T defined as


Z
T =p
Y /n

7
1.4. Interval Estimation Chapter 1. Statistical Inference

is said to have a t-distribution with n degrees of freedom:

T ∼ t(n).

Properties:

The t-distribution has a bell-shaped PDF centered at 0, but its PDF is more spread out than

the normal PDF.

E[T ] = 0, for n > 0. However, E[T ] is undefined for n = 1.

n
Var(T ) = , for n > 2. However, Var(T ) is undefined for n = 1, 2.
n−2

As n becomes large, the t-density approaches the standard normal PDF. More formally, we

can write
d
T (n) −
→ N (0, 1).

For any p ∈ [0, 1] and n ∈ N, we define tp,n as the real value for which

P (T > tp,n ) = p.

Since the t-distribution has a symmetric PDF, we have

t1−p,n = −tp,n .

How does this help in the Interval Estimation of normal random variables?

Let X1 , X2 , . . . , Xn be i.i.d. N (µ, σ 2 ) random variables. Let S 2 be the sample variance. The

random variable T defined as


X̄ − µ
T = √
S/ n

has a t-distribution with n − 1 degrees of freedom, i.e., T ∼ t(n − 1).

We use this to estimate the mean for Normal Random Variables. When we don’t know σ

8
Chapter 1. Statistical Inference 1.5. Hypothesis Testing

for X1 , X2 , . . . , Xn then the (1 − α) × 100% confidence interval for µ is given by

 
S S
X̄ − tα/2,n−1 √ , X̄ + tα/2,n−1 √ ,
n n

where X̄ is the sample mean, S is the sample standard deviation, and tα/2,n−1 is the critical

value from the t-distribution with n − 1 degrees of freedom.

Confidence Intervals for the Variance of Normal Random Variables

(n−1)S 2 1 Pn
The random variable Q defined as Q = σ2
= σ2 i=1 (Xi − X̄)2 has a chi-squared distri-

bution with n − 1 degrees of freedom, i.e., Q ∼ χ2 (n − 1). In particular, Q is a pivotal quantity

since it is a function of the Xi ’s and σ 2 , and its distribution does not depend on σ 2 or any other

unknown parameters. Using the definition of χ2p,n , a (1 − α) interval for Q can be stated as:

 
P χ21−α/2,n−1 ≤ Q ≤ χ2α/2,n−1 = 1 − α.

Therefore,
(n − 1)S 2
 
2 2
P χ1−α/2,n−1 ≤ ≤ χα/2,n−1 = 1 − α.
σ2

which is equivalent to
!
(n − 1)S 2 2 (n − 1)S 2
P ≤ σ ≤ = 1 − α.
χ2α/2,n−1 χ21−α/2,n−1

 
(n−1)S 2 2
, (n−1)S
χ2α/2,n−1 χ21−α/2,n−1
is a (1 − α) × 100% confidence interval for σ 2 .

special note - The t-distribution method and Confidence Interval for Variance are applicable

to a Normal Random Sample.

1.5 Hypothesis Testing

1.5.1 General Definitions

In hypothesis testing, we want to comment on a parameter θ based on observed data. The set

of all possible values of θ is denoted as S. We partition S into two disjoint subsets S0 and S1 :

9
1.5. Hypothesis Testing Chapter 1. Statistical Inference

• S0 : The subset of S under the null hypothesis H0 .

• S1 : The subset of S under the alternative hypothesis H1 .

The hypotheses are defined as follows:

• Null Hypothesis (H0 ): θ ∈ S0

• Alternative Hypothesis (H1 ): θ ∈ S1

A hypothesis is called simple if the subset contains only one value of θ, and composite if it

contains more than one value.

Consider the set S = [0, 1]. We partition S into two subsets:

 
1
S0 =
2
 
1
S1 = [0, 1] −
2

Here:

• H0 : θ ∈
1
2 (simple hypothesis)

• H1 : θ ∈ [0, 1] −
1
2 (composite hypothesis)

Let X1 , X2 , . . . , Xn be a random sample. A statistic is a real-valued function of the sample

data. It is used to estimate a population parameter or to describe some aspect of the sample.

For example, the sample mean X̄ is a common statistic. It is defined as:

X1 + X2 + · · · + Xn
X̄ =
n

where n is the sample size. The sample mean provides an estimate of the population mean.

A test statistic is a specific type of statistic that is used in the context of hypothesis testing. It

is a function of the sample data that is used to decide whether to reject the null hypothesis.

The choice of test statistic depends on the hypothesis being tested and the distribution of

the data.

For example, in a hypothesis test comparing the population mean to a specified value, the test

10
Chapter 1. Statistical Inference 1.5. Hypothesis Testing

statistic might be the sample mean, the t-statistic, or the z-statistic, depending on the

sample size and variance properties. The Range(A) in which the test statistic accepts H0 is

called Acceptance Region and R − A is the Rejection region.

1.5.2 Hypothesis Testing: Type I and Type II Errors

We define Type I error as the event that we reject H0 when H0 is true.

P (Type I error | θ) = P (Reject H0 | θ) = P (W ∈ R | θ), for θ ∈ S0 .

If the probability of Type I error satisfies

P (Type I error) ≤ α, for all θ ∈ S0 ,

then we say the test has significance level α or simply the test is a level α test. Note that it

is often the case that the null hypothesis is a simple hypothesis,i.e. S0 has only one element.

The second possible error that we can make is to accept H0 when H0 is false. This is called

the Type II error. Since the alternative hypothesis, H1 , is usually a composite hypothesis (so

it includes more than one value of θ), the probability of Type II error is usually a function of

θ. The probability of Type II error is usually denoted by β:

β(θ) = P (Accept H0 | θ), for θ ∈ S1 .

Summary of Two-Sided Hypothesis Testing for the Mean

In Two-sided hypothesis testing,

The null hypothesis H0 : µ = µ0 , and the alternative hypothesis H1 : µ ̸= µ0 .

Case Test Statistic Acceptance Region


X−µ
Xi ∼ N (µ, σ 2 ), σ known W = √0
σ/ n
|W | ≤ zα/2
X−µ
n large, Xi non-normal W = √0
S/ n
|W | ≤ zα/2
X−µ
Xi ∼ N (µ, σ 2 ), σ unknown W = √0
S/ n
|W | ≤ tα/2,n−1

One-sided hypothesis testing for the mean: H0 : µ ≤ µ0 , H1 : µ > µ0

11
1.5. Hypothesis Testing Chapter 1. Statistical Inference

Case Test Statistic Acceptance Region


X−µ
Xi ∼ N (µ, σ 2 ), σ known W = √0
σ/ n
|W | ≤ zα
X−µ
n large, Xi non-normal W = √0
S/ n
|W | ≤ zα
X−µ
Xi ∼ N (µ, σ 2 ), σ unknown W = √0
S/ n
|W | ≤ tα,n−1

Fun fact: Just replace α/2 with α to obtain these one-sided results.

For H0 and H1 , the interchanged result would be just negative of present values as it is

symmetrical.

P-value is the lowest significance level α that results in rejecting the H0 .

1.5.3 Likelihood Ratio Tests

Let X1 , X2 , X3 , . . . , Xn be a random sample from a distribution with a parameter θ. Sup-

pose that we have observed X1 = x1 , X2 = x2 , . . . , Xn = xn . To decide between two simple

hypotheses:

H0 : θ = θ 0 ,

H1 : θ = θ 1 ,

we define the likelihood ratio λ as follows:

L(x1 , x2 , . . . , xn ; θ0 )
λ(x1 , x2 , . . . , xn ) = ,
L(x1 , x2 , . . . , xn ; θ1 )

where L(x1 , x2 , . . . , xn ; θ) is the likelihood function given the data x1 , x2 , . . . , xn and parameter

θ.

To perform a likelihood ratio test (LRT), we choose a constant c , We reject the null

hypothesis H0 if λ < c and accept it if λ ≥ c. The choice of the constant c depends on the

significance level α, which represents the probability of rejecting the null hypothesis

when it is actually true.

When Hypothesis are not simple, we just take the suprimum of the Likelihood Function

Values and define their ratio as the Likelihood Ratio.

12
Chapter 1. Statistical Inference 1.6. Linear Regression

1.6 Linear Regression

Linear regression is a method used to model the relationship between a dependent variable

(target) and one or more independent variables (predictors).

A Simple Linear Regression Model to understand the methods of finding the model

We assume that xi are observed values of a random variable X. The linear regression model is

written as:

Y = β0 + β1 X + ϵ

where:

• Y is the dependent variable.

• X is the independent variable.

• β0 is the intercept.

• β1 is the slope.

• ϵ is the error term.

Taking the expectation of both sides:

E[Y ] = β0 + β1 E[X] + E[ϵ]

if ϵ is a normally distributed with mean 0 and variance σ 2

E[Y ] = β0 + β1 E[X]

Thus:

β0 = E[Y ] − β1 E[X]

Considering the covariance Cov(X, Y ):

Cov(X, Y ) = Cov(X, β0 + β1 X + ϵ)

13
1.6. Linear Regression Chapter 1. Statistical Inference

Cov(X, Y ) = β1 Cov(X, X)

Since Cov(X, X) = Var(X):


Cov(X, Y )
β1 =
Var(X)

Estimating β0 and β1

Given observed pairs (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), we estimate:

n
1X
x̄ = xi
n
i=1

n
1X
ȳ = yi
n
i=1

n
X
sxx = (xi − x̄)2
i=1

n
X
sxy = (xi − x̄)(yi − ȳ)
i=1

Using these, we estimate the coefficients:

sxy
β̂1 =
sxx

β̂0 = ȳ − β̂1 x̄

The Regression Line

The estimated regression line is:

ŷ = β̂0 + β̂1 x

For each xi , the fitted value ŷi is:

14
Chapter 1. Statistical Inference 1.6. Linear Regression

ŷi = β̂0 + β̂1 xi

The residuals are:

ei = yi − ŷi

The Coefficient of determination

The coefficient of determination, r2 , measures how well the observed data is represented by

the linear model. It is defined as:

Pn
2
s2xy (ŷi − ȳ)2
r = = Pni=1 2
b
sxx syy i=1 (yi − ȳ)

where
n
X
sxx = (xi − x̄)2 ,
i=1

n
X
syy = (yi − ȳ)2 ,
i=1

n
X
sxy = (xi − x̄)(yi − ȳ).
i=1

Here, x̄ and ȳ are the of E(x) and E(y) respectively.

The value of r2 ranges from 0 to 1. A larger value of r2 indicates that the linear model

ŷi = β̂0 + β̂1 xi is a good fit for the data.

Previously, our model had only one predictor (explanatory variable), x. We can consider models

with more than one explanatory variable. For example, if we want to predict a student’s

final exam score based on several factors such as the number of study hours, attendance rate,

and number of assignments completed.

y = β0 + β1 x + β2 z + · · · + βk w + ϵ,

where x, z, · · ·, w are the explanatory variables (number of study hours, attendance rate, and

number of assignments completed) .This is a multiple linear regression model. The method of

15
1.7. Bayesian Inference Chapter 1. Statistical Inference

least squares can be extended to compute estimates of β0 , β1 , · · ·, βk .

It is worth noting that when we say linear regression, we mean linear in the unknown

parameters βi . For example, the model

y = β 0 + β 1 x + β 2 x2 + ϵ

is a linear regression model since it is linear in β0 , β1 , and β2 .

When running regression algorithms, one needs to be mindful of some practical considerations.

Issues such as overfitting, heteroscedasticity, and multicollinearity might cause problems

in regression analysis.

1.7 Bayesian Inference

In the Bayesian framework, we treat the unknown quantity, Θ, as a random variable. More

specifically, we assume that we have some initial guess about the distribution of Θ. This distri-

bution is called the prior distribution. After observing some data, we update the distribution

of Θ based on the observed data using Bayes’ Rule. This approach is known as the Bayesian

approach.

Bayesian inference is widely used in various fields. In medical diagnosis, it helps update the

probability of diseases based on new test results. In machine learning, it improves models

by incorporating new data. In economics, it updates forecasts with new economic indicators.

Additionally, it plays a crucial role in robotics for localization and mapping using sensor data.

We want to figure out an unknown variable X by looking at a related random variable

Y . We start with a prior guess about X, given by a distribution fX (x) if X is continuous, or

PX (x) if X is discrete. After observing Y , we update our guess about X using Bayes’ formula:

fY (y|x)fX (x)
fX|Y (x|y) =
fY (y)

PY (y|x)PX (x)
PX|Y (x|y) =
PY (y)

This updated guess, called the posterior distribution, helps us estimate X.

16
Chapter 1. Statistical Inference 1.7. Bayesian Inference

1.7.1 Maximum A Posteriori (MAP) Estimation

The MAP estimate of the random variable X, given that we have observed Y = y, is given by

the value of x that maximizes fX|Y (x|y) if X is a continuous random variable, or PX|Y (x|y)

if X is a discrete random variable. The MAP estimate is shown by x̂MAP .

1.7.2 Minimum Mean Squared Error (MMSE) Estimation

The MMSE estimate xb given Y = y is:

xb = E[X | Y = y].

proof -

1. Start with the expression for the expected MSE:

Z
E[(X − xb )2 ] = (x − xb )2 fX|Y (x|y) dx,

where fX|Y (x|y) is the conditional probability density function of X given Y = y.

2. Differentiate with respect to xb and set it to zero for MMSE:

Z
−2 (x − xb )fX|Y (x|y) dx = 0.

3. Solve the resulting equation to find xb :

Z Z
xb fX|Y (x|y) dx = xfX|Y (x|y) dx.

and we get the result.

Properties of the MMSE Estimator-

1. Expectation of MMSE Estimator: - The MMSE estimator, has the same expectation

as X:

E[XM ] = E[X] (1.1)

17
1.7. Bayesian Inference Chapter 1. Statistical Inference

2. Uncorrelated Estimation Error: - The estimation error X̃ = X −X M , and the MMSE

estimator X M are uncorrelated:

Cov(X̃, XM ) = 0, (1.2)

3. Decomposition of Variance: - The total variance of X can be decomposed into the

sum of the variances of XM and X̃:

Var(X) = Var(XM ) + Var(X̃) (1.3)

E[X 2 ] = E[XM
2
] + E[X̃ 2 ] (1.4)

1.7.3 Linear MMSE Estimation of Random Variables

The linear MMSE estimator for X given Y is of the form:

X̂ L = g(Y ) = aY + b

To minimize the mean squared error (MSE):

MSE = E[(X − X̂ L )2 ]

we find the optimal values of a and b as follows:

Cov(X, Y ) E[XY ] − E[X]E[Y ]


a= =
Var(Y ) E[Y 2 ] − (E[Y ])2

b = E[X] − aE[Y ]

1. Expression for MSE:

MSE = E[(X − (aY + b))2 ]

18
Chapter 1. Statistical Inference 1.7. Bayesian Inference

2. Expand the squared term:

MSE = E[X 2 − 2X(aY + b) + (aY + b)2 ]

3. Simplify using linearity of expectation:

MSE = E[X 2 ] − 2aE[XY ] − 2bE[X] + a2 E[Y 2 ] + 2abE[Y ] + b2

4. Differentiate with respect to a:

∂MSE
= −2E[XY ] + 2aE[Y 2 ] + 2bE[Y ]
∂a

Set the derivative to zero:

−2E[XY ] + 2aE[Y 2 ] + 2bE[Y ] = 0

aE[Y 2 ] + bE[Y ] = E[XY ]

5. Differentiate with respect to b:

∂MSE
= −2E[X] + 2aE[Y ] + 2b
∂b

Set the derivative to zero:

−2E[X] + 2aE[Y ] + 2b = 0

b = E[X] − aE[Y ]

6. Solving the system of equations: we get:

19
1.7. Bayesian Inference Chapter 1. Statistical Inference

E[XY ] − E[X]E[Y ]
a=
E[Y 2 ] − (E[Y ])2

7. Substitute a back into the equation for b:

b = E[X] − aE[Y ]

Some properties of Linear MMSE estimator

The linear MMSE estimator of the random variable X, given that we have observed Y , is given

by

Cov(X, Y ) σX
X̂L = (Y − E[Y ]) + E[X] = ρ (Y − E[Y ]) + E[X].
Var(Y ) σY

The estimation error, defined as X̃ = X − X̂L , satisfies the orthogonality principle:

E[X̃] = 0, Cov(X̃, Y ) = E[X̃Y ] = 0.

The MSE of the linear MMSE estimator is given by

E[(X − X̂L )2 ] = E[X̃ 2 ] = (1 − ρ2 )Var(X).

In our Daily Life, We usually have to estimate various Random Variables, so we use Random

Vectors

We estimate X̂L by

X̂L = CXY C−1


Y (Y − E[Y]) + E[X].

In the above equation, CY is the covariance matrix of Y, defined as

CY = E[(Y − E[Y])(Y − E[Y])T ],

and CXY is the cross-covariance matrix of X and Y, defined as

20
Chapter 1. Statistical Inference 1.7. Bayesian Inference

CXY = E[(X − E[X])(Y − E[Y])T ].

Orthogonality Principle

To minimize the mean squared error (MSE) of a Random Vector, it is sufficient to minimize each

E[(Xk − X̂k )2 ] individually. This implies that we only need to focus on estimating a random

variable X given the observations of the random vector Y. Since we want our estimator to

be linear, we can express it as:

n
X
X̂L = ak Yk + b
k=1

We observed these properties of the Linear MMSE estimator.

E[X̃] = 0, Cov(X̃, Yj ) = E[X̃Yj ] = 0, for all j = 1, 2, . . . , n.

These conditions are known as the orthogonality principle.The orthogonality principle states

that the estimation error (X̃) must be orthogonal to each of the observations (Y1 , Y2 , . . . , Yn ).

Given that there are n + 1 unknown parameters (a1 , a2 , . . . , an and b), and n + 1 corresponding

equations, we can use the orthogonality principle to determine these parameters.

1.7.4 Bayesian Hypothesis Testing

The average error probability for a hypothesis test can be written as

Pe = P (choose H1 | H0 )P (H0 ) + P (choose H0 | H1 )P (H1 ).

MAP estimate achieves the minimum possible average error probability.

Minimum Cost Hypothesis Test:

Let

• C10 : The cost of choosing H1 , given that H0 is true.

• C01 : The cost of choosing H0 , given that H1 is true.

21
1.7. Bayesian Inference Chapter 1. Statistical Inference

The total cost C can be written as:

C = P (choose H1 | H0 ) · [P (H0 )C10 ] + P (choose H0 | H1 ) · [P (H1 )C01 ] .

We choose H0 if and only if

fY (y | H0 ) P (H1 )C01
≥ .
fY (y | H1 ) P (H0 )C10

Equivalently, we choose H0 if and only if

P (H0 | y)C10 ≥ P (H1 | y)C01 .

We call P (H0 | y)C10 the posterior risk of accepting H1 .

1.7.5 Bayesian Interval Estimation

Given the observation Y = y, the interval [a, b] is said to be a 100(1 − α)% credible interval for

X, if the posterior probability of X being in [a, b] is equal to 1 − α,

P (a ≤ X ≤ b | Y = y) = 1 − α.

22
Chapter 2

Random Processes

2.1 Introduction

A Random process/Stochastic Process is a series of random values that change over

time or space. Example- Let T (t) be the temperature in India at time t ∈ [0, ∞). We can

assume here that t is measured in hours and t = 0 refers to the time we start measuring the

temperature.

If the Random variables are countable in the Random Process, then the Random Process is

known as Discrete Random Process.

Suppose you start learning english in Duolingo and the Random Variable is 1 if you have

completed that day streak or else 0; here we observe that this Random Process can be written

as {Xn , n = 1, 2, 3, . . .} with n being the number of days from the start. This is a countable

set so it is a Discrete-Random Process.

A Continuous-time random process is a random process {X(t), t ∈ J}, where J is an

interval on the real line such as [−1, 1], [0, ∞), (−∞, ∞), etc.

A random process (called random signals in engineering) can be thought of as a random

function of time. If a Random Process is of the form {X(t), t ∈ J}, then the Set of all possible

X(t) is a sample function or sample path or realization. {X(t), t ∈ J} will be equal to

one of many possible sample functions.

For a random process {X(t), t ∈ J}

23
2.1. Introduction Chapter 2. Random Processes

The mean function µX (t) : J → R, is defined as

µX (t) = E[X(t)]

The autocorrelation function, or simply, the correlation function, RX (t1 , t2 ), is defined

by

RX (t1 , t2 ) = E[X(t1 )X(t2 )], for t1 , t2 ∈ J.

The autocovariance function, or simply, the covariance function, CX (t1 , t2 ), is defined by

CX (t1 , t2 ) = Cov(X(t1 ), X(t2 )) = RX (t1 , t2 ) − µX (t1 )µX (t2 ), for t1 , t2 ∈ J.

The cross-correlation function RXY (t1 , t2 ) is defined by

RXY (t1 , t2 ) = E[X(t1 )Y (t2 )], for t1 , t2 ∈ J.

The cross-covariance function CXY (t1 , t2 ) is defined by

CXY (t1 , t2 ) = Cov(X(t1 ), Y (t2 )) = RXY (t1 , t2 ) − µX (t1 )µY (t2 ), for t1 , t2 ∈ J.

Two random processes {X(t), t ∈ J} and {Y (t), t ∈ J ′ } are said to be independent if, for all

t1 , t2 , . . . , tm ∈ J and t′1 , t′2 , . . . , t′n ∈ J ′ , the set of random variables X(t1 ), X(t2 ), . . . , X(tm ) are

independent of the set of random variables Y (t′1 ), Y (t′2 ), . . . , Y (t′n ).

A continuous-time random process {X(t), t ∈ R} is called strict-sense stationary, or sta-

tionary, if for all t1 , t2 , . . . , tr ∈ R and all ∆ ∈ R

FX(t1 ),X(t2 ),...,X(tr ) (x1 , x2 , . . . , xr ) = FX(t1 +∆),X(t2 +∆),...,X(tr +∆) (x1 , x2 , . . . , xr ).

Similarly, it can be defined as a Discrete -Random Process by replacing ∆ with an integer and

times t1 , t2 , . . . , tr ∈ J.

A random process {X(n), n ∈ Z} is weak-sense stationary or wide-sense stationary

(WSS) if

24
Chapter 2. Random Processes 2.1. Introduction

the Random Process is Discrete:

µX (n) = µX , for all n ∈ Z,

RX (n1 , n2 ) = RX (n1 − n2 ), for all n1 , n2 ∈ Z.

the Random Process is Continuous:

µX (t) = µX , for all t ∈ R,

RX (t1 , t2 ) = RX (t1 − t2 ), for all t1 , t2 ∈ R.

The expected (average) power of X(t) at time t is E[X(t)2 ].

For a WSS Random Process:

RX (τ ) = E[X(t)X(t − τ )] = E[X(t + τ )X(t)]

1. By the Cauchy-Schwarz inequality, for any random variables X and Y , we have:


p
|E[XY ]| ≤ E[X 2 ]E[Y 2 ]

2. Let X = X(t). Applying the Cauchy-Schwarz inequality:


p
|RX (τ )| = |E[X(t)X(t − τ )]| ≤ E[X(t)2 ]E[X(t − τ )2 ]

3. For a wide-sense stationary process, E[X(t)2 ] = E[X(t − τ )2 ] = RX (0):


p
|RX (τ )| ≤ RX (0) · RX (0) = RX (0)

4. we have:

|RX (τ )| ≤ RX (0), for all τ ∈ R

25
2.2. Stationary Processes Chapter 2. Random Processes

2.2 Stationary Processes

Two random processes {X(t), t ∈ R} and {Y (t), t ∈ R} are said to be jointly wide-sense

stationary if X(t) and Y (t) are each wide-sense stationary and

RXY (t1 , t2 ) = RXY (t1 − t2 ).

A continuous-time random process {X(t), t ∈ R} is weak-sense Cyclostationary if there exists

a positive real number T such that:

µX (t + T ) = µX (t), for all t ∈ R

RX (t1 + T, t2 + T ) = RX (t1 , t2 ), for all t1 , t2 ∈ R

where µX (t) denotes the mean of X(t), and RX (t1 , t2 ) denotes the autocovariance function of

X(t) at times t1 and t2 . Similarly, it can be defined for a Discrete random process but there T

would be a Natural numberand t1 ,t2 ,t are integers.

Mean Square Continuity

Let X(t) be a continuous-time random process. We say that X(t) is mean-square continuous

at time t if
h i
2
lim E |X(t + δ) − X(t)| = 0.
δ→0

It is worth noting that there are jumps in a Poisson process; however, those jumps are not very

densein time, so the random process is still continuous in the mean-square sense.

26
Chapter 2. Random Processes 2.2. Stationary Processes

Figure 2.1: A sample function of Poisson random process

2.2.1 Gaussian Random Processes

A random process {X(t), t ∈ J} is said to be a Gaussian (normal) random process if, for

all t1 , t2 , . . . , tn ∈ J, the random variables X(t1 ), X(t2 ), . . . , X(tn ) are jointly normal.

Two random processes {X(t), t ∈ J} and {Y (t), t ∈ J ′ } are considered jointly Gaussian

if, for any selections of t1 , t2 , . . . , tm ∈ J and t′1 , t′2 , . . . , t′n ∈ J ′ , the set of random variables

X(t1 ), X(t2 ), . . . , X(tm ), Y (t′1 ), Y (t′2 ), . . . , Y (t′n ) is jointly normal.

2.2.2 Integration and Differentiation

d
A random process X(t) has a derivative Y (t) = dt X(t), which is also a random process. For

smooth processes, the derivative is straightforward. For example, if X(t) = A + Bt + Ct2 where

A, B, and C are random variables,X ′ (t) = B + 2Ct.

To handle derivatives and integrals of random processes, we assume some regularity conditions:

1. Continuity: X(t) should be continuous in t.


i hR
2 du < ∞. t
2. Mean-Square Integrability: E 0 X(u)
 2 
X(t+h)−X(t) ′
3. Mean-Square Differentiability: E h − X (t) → 0 as h → 0.

These conditions ensure well-defined differentiation and integration.

• Linearity: Differentiation and integration are linear operations, allowing interchange

27
2.3. Processing of Random Signals Chapter 2. Random Processes

with expectation:
Z t  Z t
E X(u) du = E[X(u)] du.
0 0

d d
E X(t) = E[X(t)].
dt dt

The regularity conditions ensure that operations on random processes are well-behaved and

expectations can be interchanged with differentiation and integration.

2.3 Processing of Random Signals

2.3.1 Power Spectral Density

For a WSS random process X(t):

Power spectral Density SX (f ) is the Fourier Transform of RX (τ )

Z ∞ √
SX (f ) = F{RX (τ )} = RX (τ )e−2jπf τ dτ, where j = −1
−∞

By substituting τ to be Zero, we get the Expected power as

Z ∞
2
E[X(t) ] = RX (0) = SX (f ) df
−∞

For two jointly WSS random processes X(t) and Y (t):

we define the cross spectral density SXY (f ) as the Fourier transform of the cross-correlation

function RXY (τ ):

Z ∞
SXY (f ) = F{RXY (τ )} = RXY (τ )e−2jπf τ dτ
−∞

2.3.2 Linear Time-Invariant (LTI) Systems

A linear time-invariant (LTI) system is a type of system used in signal processing. It has

two important properties:

1. Linearity: If you provide the system with two signals, the resulting output will be the same

as the sum of the outputs from each signal individually.

28
Chapter 2. Random Processes 2.3. Processing of Random Signals

2. Time Invariance: If you shift a signal in time before feeding it into the system, the output

will be the same as if you had shifted the output in time after the signal had passed through

the system.

The impulse response h(t) is the output of the system when the input is a very short signal

called an impulse δ(t).

When you provide any input signal X(t) to the system, the output Y (t) can be found by combin-

ing (convolving) the input signal with the impulse response. This is done using a mathematical

operation called convolution.

The convolution of the input signal X(t) and the impulse response h(t) gives you the output

Y (t):

Z ∞
Y (t) = X(τ )h(t − τ ) dτ
−∞

Figure 2.2: LTI-System

Let X(t) be a WSS random process and Y (t) be given by

Y (t) = h(t) ∗ X(t),

where h(t) is the impulse response of the system. Then X(t) and Y (t) are jointly WSS.

29
2.3. Processing of Random Signals Chapter 2. Random Processes

Moreover,
Z ∞
µY (t) = µY = µX h(α) dα;
−∞
Z ∞
RXY (τ ) = h(−τ ) ∗ RX (τ ) = h(−α)RX (t − α) dα;
−∞

RY (τ ) = h(τ ) ∗ h(−τ ) ∗ RX (τ ).

Proof

We begin with the calculation of µY (t):

Z ∞ 
µY (t) = E[Y (t)] = E h(α)X(t − α) dα
−∞
Z ∞
= h(α)E[X(t − α)] dα
−∞
Z ∞
= h(α)µX dα
−∞
Z ∞
= µX h(α) dα.
−∞

R∞
Since µY (t) is not a function of t, µY (t) = µY = µX −∞ h(α) dα.

RXY (t1 , t2 ) = E[X(t1 )Y (t2 )]


 Z ∞ 
= E X(t1 ) h(α)X(t2 − α) dα
−∞
Z ∞
= h(α)E[X(t1 )X(t2 − α)] dα
−∞
Z ∞
= h(α)RX (t1 , t2 − α) dα
Z−∞

= h(α)RX (t1 − t2 + α) dα (since X(t) is WSS).
−∞

We note that RXY (t1 , t2 ) is only a function of τ = t1 − t2 , so we may write

Z ∞
RXY (τ ) = h(α)RX (τ + α) dα = h(τ ) ∗ RX (−τ ) = h(−τ ) ∗ RX (τ ).
−∞

Similarly, for the autocorrelation function of Y (t), we have:

RY (τ ) = h(τ ) ∗ h(−τ ) ∗ RX (τ ).

30
Chapter 2. Random Processes 2.3. Processing of Random Signals

There are some advantages of working in the frequency domain, so we are converting time

domain functions to frequency domain functions by Fourier Transform

The Fourier transform of h(t) is given by:

Z ∞
H(f ) = F{h(t)} = h(t)e−2jπf t dt
−∞

H(f ) is called the textbftransfer function of the system.

We know that

F{h(−t)} = H(−f ) = H ∗ (f )

RY (τ ) = h(τ ) ∗ h(−τ ) ∗ RX (τ )

applying Fourier transform,

SXY (f ) = SX (f )H(−f ) = SX (f )H ∗ (f )

Similarily

SY (f ) = SX (f )|H(f )|2

2.3.3 Power in a Frequency Band

Let us take the transfer function to be



1
 if f1 < |f | < f2
H(f ) =

0
 otherwise

This is, in fact, a bandpass filter as this eliminates every frequency outside of the frequency

band f1 < |f | < f2 . Thus, the resulting random process Y (t) is a filtered version of X(t) in

which frequency components in the frequency band f1 < |f | < f2 are preserved.

Now, let’s find the expected power in Y (t). We have:



SX (f )
 if f1 < |f | < f2
SY (f ) = SX (f )|H(f )|2 =

0
 otherwise

31
2.4. Important Random Processes Chapter 2. Random Processes

Thus, the power in Y (t) is:

Z ∞ Z −f1 Z f2
2
E[Y (t) ] = SY (f ) df = SX (f ) df + SX (f ) df
−∞ −f2 f1

Since SX (−f ) = SX (f )

Z f2
2
E[Y (t) ] = 2 SX (f ) df
f1

2.3.4 Gaussian Processes as input to LTI System

Let X(t) be a stationary Gaussian process. If X(t) is the input to an LTI system, then the

output random process, Y(t) , is also a stationary Gaussian process. Moreover, X(t) and

Y(t) are jointly Gaussian.

2.3.5 White Noise

The random process X(t) is called a white noise process if

N0
SX (f ) = , for all f.
2

White noise has infinite power as integral becomes infinite for constant SX (f ). The PSD of

thermal Noise is similar to the PSD of White noise for a frequency range, and it decreases out

of that frequency range. Usually, Thermal noise is modeled as Gaussian White Noise, which

includes an additional condition that the process X(t) is a stationary Gaussian random

process with zero mean, µX = 0.

2.4 Important Random Processes

2.4.1 Poisson Process

A random process {N (t), t ∈ [0, ∞)} is said to be a counting process if N (t) is the number of

events occurred from time 0 up to and including time t. For a counting process, we assume:

1.

N (0) = 0;

32
Chapter 2. Random Processes 2.4. Important Random Processes

2.

N (t) ∈ {0, 1, 2, · · · }, for all t ∈ [0, ∞);

3. For 0 ≤ s < t, N (t) − N (s) shows the number of events that occur in the interval (s, t].

Independent Increments

Let {X(t), t ∈ [0, ∞)} be a continuous-time random process. We say that X(t) has indepen-

dent increments if, for all 0 ≤ t1 < t2 < t3 < · · · < tn , the random variables

X(t2 ) − X(t1 ), X(t3 ) − X(t2 ), ··· , X(tn ) − X(tn−1 )

are independent.

Stationary Increments

A continuous-time random process X(t) is said to have stationary increments if, for any t2 >

t1 ≥ 0 and any shift r > 0, the random variables X(t2 ) − X(t1 ) and X(t2 + r) − X(t1 + r) are

having same Probability distribution. This means that the probability distribution of the

increment depends solely on the length of the interval (t1 , t2 ] and is unaffected by the position

of the interval along the time axis.

Let λ > 0 be a fixed number. The counting process {N (t), t ∈ [0, ∞)} is called a Poisson

process with rate λ if the following conditions hold:

1. N (t) has independent increments

2. The number of arrivals in any interval of length τ > 0 has a Poisson(λτ ) distribution.

If N (t) is a Poisson process with rate λ, then the interarrival times X1 , X2 , . . . are inde-

pendent and Xi ∼ Exponential(λ) for i = 1, 2, 3, . . ..

1. First Arrival Time X1

The first arrival time X1 is the time of the first arrival. The probability that no arrival happens

in time t is given by:

P (X1 > t) = P (N (t) = 0) = e−λt

33
2.4. Important Random Processes Chapter 2. Random Processes

This means X1 follows an exponential distribution with rate λ:

X1 ∼ Exponential(λ)

2. Second Arrival Time X2

The probability that no event happens in an interval (s, s + t], given the first event occurred at

s, is:(No arrival in the interval (s,s+t])

P (X2 > s + t | X1 = s) = e−λt

The time between the first and second arrivals X2 is also exponentially distributed with rate λ,

and is independent of X1 . Therefore, X2 is independent of X1 and X2 ∼ Exponential(λ).

3. General Case for Xi

The same reasoning applies to all subsequent interarrival times X3 , X4 , . . .. Each Xi is inde-

pendent and follows an exponential distribution with rate λ:

Xi ∼ Exponential(λ) for i = 1, 2, 3, . . .

The arrival time Tn is the sum of the first n interarrival times:

Tn = X1 + X2 + · · · + Xn

By this, we can say that the arrival times T1 , T2 , . . . have a Gamma distribution with pa-

rameters n and λ. Specifically, for n = 1, 2, 3, . . ., we have:

n n
E[Tn ] = , and Var(Tn ) =
λ λ2

2.4.2 Merging and Splitting Poisson Processes

Let N1 (t), N2 (t), . . . , Nm (t) be m independent Poisson processes with rates λ1 , λ2 , . . . , λm . De-

fine N (t) as:

N (t) = N1 (t) + N2 (t) + · · · + Nm (t) for all t ∈ [0, ∞).

34
Chapter 2. Random Processes 2.4. Important Random Processes

We want to show that N (t) is a Poisson process with rate λ = λ1 + λ2 + · · · + λm .

Initial Condition:

N (0) = N1 (0) + N2 (0) + · · · + Nm (0) = 0.

Pm
Independent Increments: Since Ni (t) are independent, N (t) − N (s) = i=1 (Ni (t) − Ni (s))

are independent for disjoint intervals.

Stationary Increments: For any t ≥ 0 and ∆t > 0:

Ni (t + ∆t) − Ni (t) ∼ Poisson(λi ∆t).

Thus, N (t + ∆t) − N (t) ∼ Poisson ( m


P
i=1 λi ∆t) = Poisson((λ1 + λ2 + · · · + λm )∆t).

The sum of m independent Poisson random variables being another Poisson random vari-

able (rates as the sum of individual rates) can be explained using their Moment generating

functions (MGFs).

Figure 2.3: Combining Poisson Processes

Splitting

Let N (t) be a Poisson process with rate λ. We split N (t) into two processes, N1 (t) and N2 (t),

as follows: Each arrival is assigned randomly to either N1 (t) or N2 (t) based on the outcome

of an independent coin toss with probability p for heads (H). If heads, the arrival goes to

N1 (t); otherwise, it goes to N2 (t). The coin tosses are independent of each other and of N (t).

Consequently,

• N1 (t) behaves as a Poisson process with rate λp,

35
2.4. Important Random Processes Chapter 2. Random Processes

• N2 (t) behaves as a Poisson process with rate λ(1 − p),

• N1 (t) and N2 (t) are independent of each other.

Figure 2.4: Splitting Poisson Processes

2.4.3 Nonhomogeneous Poisson Processes

If the rate of the Poisson Process keeps on changing with time, then it is called Nonhomoge-

neous Poisson Processes.


R 
t+s
Number of arrivals in an Interval (s,s+t] is given by Poisson t λ(θ) dθ

The second Definition of the Poisson process -

Let λ(t) : [0, ∞) → [0, ∞) be an integrable function. The counting process {N (t), t ∈ [0, ∞)}

is called a Nonhomogeneous Poisson process with rate λ(t) if all the following conditions

hold:

• N (0) = 0,

• N (t) has independent increments,

• For any t ∈ [0, ∞), we have:

P (N (t + ∆) − N (t) = 0) = 1 − λ(t)∆ + o(∆),

P (N (t + ∆) − N (t) = 1) = λ(t)∆ + o(∆),

P (N (t + ∆) − N (t) ≥ 2) = o(∆).

36
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains

If we replace λ(t) with constant λ, then we get the second definition for the Poisson process.

2.5 Discrete Time Markov Chains

2.5.1 Introduction

Consider the random process {Xn , n = 0, 1, 2, . . .}, where RXi = S ⊂ {0, 1, 2, . . .}. We say that

this process is a Markov chain if

P (Xm+1 = j | Xm = i, Xm−1 = im−1 , . . . , X0 = i0 ) = P (Xm+1 = j | Xm = i),

for all m, j, i, i0 , i1 , . . . , im−1 .

If the number of states is finite, e.g., S = {0, 1, 2, . . . , r}, we call it a finite Markov chain.

The probabilities pij = P (Xm+1 = j | Xm = i) are called transition probabilities.

2.5.2 State Transition Matrix and Diagram

We store the transition probabilities in a matrix that is called the state transition matrix or

transition probability matrix. For a finite Markov chain with the number of states r,

 
p p ··· p1r
 11 12 
 
p21 p22 · · · p2r 
P = .
 
 . .. .. .. 
 . . . . 

 
pr1 pr2 · · · prr

We usually show a Markov process using a state transition diagram. Consider a Markov

chain with three possible states, 1, 2, and 3, and the following transition probability matrix:

 
1 1 1
4 4 2 
 
P = 1 2
3 0 3


 
1 2 1
4 3 12

We represent this by the below State transition diagram

In this diagram, there are three possible states 1, 2, and 3, and the arrows from each state to

37
2.5. Discrete Time Markov Chains Chapter 2. Random Processes

Figure 2.5: State Transition diagram

other states show the transition probabilities pij . When there is no arrow from state i to state

j, it means that pij = 0.

n-Step Transition Matrix

Consider a Markov chain {Xn , n = 0, 1, 2, . . .}, where Xn ∈ S. If X0 = i, then X1 = j with

probability pij . That is, pij gives us the probability of going from state i to state j in one step.

Now suppose that we are interested in finding the probability of going from state i to state j

in two steps, i.e.,


(2)
pij = P (X2 = j | X0 = i).

We can find this probability by applying the law of total probability. In particular, we argue

that X1 can take one of the possible values in S. Thus, we can write

(2)
X X
pij = P (X2 = j | X1 = k, X0 = i)P (X1 = k | X0 = i) = P (X2 = j | X1 = k)P (X1 = k | X0 = i)
k∈S k∈S

We conclude
(2)
X
pij = P (X2 = j | X0 = i) = pik pkj .
k∈S

We can explain the above formula as follows: In order to get to state j, we need to pass through
(2)
some intermediate state k. The probability of this event is pik pkj . To obtain pij , we sum over

all possible intermediate states. Accordingly, we can define the two-step transition matrix as

follows:

38
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains

 
(2) (2) (2)
p p12 · · · p1r
 11 
 (2) (2) (2)

 21 p22 · · ·
p p2r 
P (2) = . .

 . .. .. .. 
 . . . . 

 
(2) (2) (2)
pr1 pr2 · · · prr

(2)
Looking at the previous equation, we notice that pij is in fact the element in the i-th row and

j-th column of the matrix

   
p p ··· p1r p p ··· p1r
 11 12   11 12 
   
p21 p22 · · ·
 p21 p22 · · ·
p2r  p2r 

P2 =  . · .
 
 . .. .. ..  

 . .. .. .. 
 . . . .   .. . . . 

   
pr1 pr2 · · · prr pr1 pr2 · · · prr

Thus, we conclude that the two-step transition matrix can be obtained by squaring the state

transition matrix, i.e., P (2) = P 2 .

(n)
More generally, we can define the n-step transition probabilities pij as

(n)
pij = P (Xn = j | X0 = i), for n = 0, 1, 2, . . . ,

and the n-step transition matrix, P (n) , as

 
(n) (n) (n)
p p12 · · · p1r
 11 
 (n) (n) (n)

 21 p22 · · ·
p p2r 
P (n) = . .

 . .. .. .. 
 . . . . 

 
(n) (n) (n)
pr1 pr2 · · · prr

We can now generalize the previous equation. Let m and n be two positive integers and assume

X0 = i. In order to get to state j in (m + n) steps, the chain will be at some intermediate state
(m+n)
k after m steps. To obtain pij , we sum over all possible intermediate states:

(m+n) (m) (n)


X
pij = P (Xm+n = j | X0 = i) = pik pkj .
k∈S

39
2.5. Discrete Time Markov Chains Chapter 2. Random Processes

The Chapman-Kolmogorov equation can be written as

(m+n) (m) (n)


X
pij = P (Xm+n = j | X0 = i) = pik pkj .
k∈S

The n-step transition matrix is given by

P (n) = P n , for n = 1, 2, 3, . . . .

2.5.3 Classification of States


(n)
We say that state j is reachable from state i, denoted by i → j, if pij > 0 for some n. Every
(0)
state is assumed to be reachable from itself, as pii = 1. Two states i and j are said to

communicate, denoted by i ↔ j, if each state is reachable from the other. In other words,

i ↔ j means i → j and j → i.

Note- communication is an equivalence relation

A Markov chain is said to be irreducible if it has only one communicating class. This means

that every state in the Markov chain can be reached from every other state, ensuring that

all states communicate with each other. This property is desirable because it simplifies the

analysis of the limiting behavior of the Markov chain.

A state is said to be recurrent if, any time we leave that state, we will return to it in the future

certainly. Conversely, if the probability of returning to the state is less than one, the state is

called transient.

For any state i, we define

fii = P (Xn = i for some n ≥ 1 | X0 = i).

• The state i is recurrent if fii = 1.

• The state i is transient if fii < 1.

If two states are in the same class, they are either both recurrent or both transient. By a

Class, I mean that it is a Communicating class.

40
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains

There are two kinds of classes:

1. Recurrent Class: If the Markov chain enters this class, it will always stay in that class.

All the states are recurrent.

2. Transient Class: The Markov chain might enter and stay in this class for a while, but

at some point, it will leave and never return. All the states are transient.

Consider a Markov chain starting from state X0 = i. If i is a recurrent state, the chain will

return to state i any time it leaves that state. Consequently, the chain will visit state i an

infinite number of times. Conversely, if i is a transient state, the chain will return to state i

with probability fii < 1. In this case, the total number of visits to state i follows a Geometric

distribution with success 1 − fii .[Reaching the same state is treated as failure]

For a discrete-time Markov chain, let V be the total number of visits to state i. Then:

• If i is a recurrent state, then

P (V = ∞ | X0 = i) = 1.

• If i is a transient state, then

V | X0 = i ∼ Geometric(1 − fii ).

Periodicity

(n)
The period of a state i in a Markov chain is the largest number d such that pii = 0 whenever
(n)
n is not divisible by d. This period is denoted d(i). If pii = 0 for all n > 0 then d(i) = ∞.

• A state i is periodic if d(i) > 1.

• A state i is aperiodic if d(i) = 1.

All states in the same communicating class have the same period. So, a class is periodic if

its states are periodic, and aperiodic if its states are aperiodic.

For a finite irreducible Markov chain Xn :

• If there is a self-transition (pii > 0 for some i), the chain is aperiodic.

41
2.5. Discrete Time Markov Chains Chapter 2. Random Processes

(l) (m)
• If state i can return to itself in l steps (pii > 0) and also in m steps (pii > 0), with

gcd(l, m) = 1, then state i is aperiodic.

• The chain is aperiodic if and only if there is a positive integer n such that all elements

of the matrix P n are positive:

(n)
pij > 0 for all i, j ∈ S.

2.5.4 Law of Total Probability with Recursion

Absorption Probabilities

Consider a finite Markov chain {Xn } with states S = {0, 1, 2, . . . , r}, where each state is either

absorbing or transient. Let l ∈ S be an absorbing state. ai is defined as the probability

that the chain, starting from state i, will eventually be absorbed into state l.

ai = P (absorption in l | X0 = i), for all i ∈ S.

al = 1 and aj = 0 for all other absorbing states j ̸= l.

If we enter some other absorbing states (above case) we get struck there only and won’t be able

to enter state l so aj would be 0.

To find ai for each i ∈ S, we use the equation:

X
ai = ak · pik ,
k∈S

where pik denotes the transition probability from state i to state k.

Mean Hitting Times

Consider a finite Markov chain {Xn } with states S = {0, 1, 2, . . . , r}. Let A ⊆ S be a subset

of states. T represents the event: The first time the chain visits any state in A. For each state

i ∈ S, ti denotes the expected number of steps until the chain first visits a state in A,

starting from i.

tj = 0 for all j ∈ A,

42
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains

If j is already in A, the expected time to first visit A will be zero.

To find ti for all i ∈ S, we use the equation:

X
ti = 1 + tk · pik ,
k∈S\A

pik is the probability of transitioning from state i to state k.

Mean Return Times

Consider a finite irreducible Markov chain {Xn } with state space S = {0, 1, 2, . . . , r}. Let l ∈ S

be a state. The mean return time to state l, rl , is the expected number of steps it takes

to return to l after leaving it.

The mean return time rl is:


X
rl = 1 + tk plk ,
k∈S

For finding tk :

• tl = 0.

• If k ̸= l, tk = 1 +
P
j∈S tj pkj .

pik is the probability of transitioning from state i to state k.

2.5.5 Stationary and Limiting Distributions

Limiting Distribution

For Markov chains with a finite number of states, there can be transient and recurrent

states. As time progresses, the chain will eventually enter a recurrent class and stay there

permanently. Hence, for long-term behavior, we focus on these recurrent classes. If the

Markov chain has multiple recurrent classes, it will eventually get absorbed into one of them.

This can be said using absorption probabilities.

Assuming that the chain is aperiodic, the limiting distribution π = [π0 , π1 , π2 , . . .] can be found

in a simple manner -

π = lim π(n) = lim [π(0)P n ]


n→∞ n→∞

43
2.5. Discrete Time Markov Chains Chapter 2. Random Processes

Similarly,

π = lim [π(0)P n+1 ] = lim [π(0)P n P ] = [ lim π(0)P n ]P = πP


n→∞ n→∞ n→∞

Intuitively, the equation π = πP means if the distribution of Xn (that is, the probabilities of

ending up in different states) is π, then the distribution of Xn+1 is also π.

X
πj = πk Pkj
k∈S

for all j in S. This equation states that the probability πj of being in state j in the long run

is equal to the sum of the probabilities πk of being in each state k, weighted by the probability

Pkj of moving from state k to state j.

By summarizing everything,

In a Markov chain, the limiting distribution π = [π0 , π1 , π2 , . . .] is a set of probabilities that

describe the Markov chain’s behavior in the long run. π is the limiting distribution if:

πj = lim P (Xn = j | X0 = i)
n→∞

for all states i and j in the state space S.

X
πj = 1
j∈S

Consider a finite Markov chain {Xn , n = 0, 1, 2, . . .} where Xn ∈ S = {0, 1, 2, . . . , r}. Assume

the chain is irreducible and aperiodic.

The set of equations

π = πP

X
πj = 1
j∈S

has a unique solution, where π is a vector (1×n matrix) of probabilities.

This unique solution is called the limiting distribution of the Markov chain.

πj = lim P (Xn = j | X0 = i)
n→∞

44
Chapter 2. Random Processes 2.5. Discrete Time Markov Chains

for all states i and j in S.

We can also say


1
rj =
πj

for all j in S, where rj is the average return time to state j.

Countably Infinite Markov Chains

Consider an infinite Markov chain {Xn , n = 0, 1, 2, . . .} where Xn ∈ S = {0, 1, 2, . . .}. Assume

that the chain is irreducible and aperiodic. Depending on the nature of the states, the behavior

of the chain can fall into one of three cases:

1. All states are transient:

lim P (Xn = j | X0 = i) = 0 for all i, j ∈ S


n→∞

Transient states mean the chain will eventually leave these states and never return.

2. All states are null recurrent:

lim P (Xn = j | X0 = i) = 0 for all i, j ∈ S


n→∞

Null recurrent states mean the chain returns to these states eventually, but the expected return

time is infinite, making long-term probabilities zero.

3. All states are positive recurrent: (expected return time is finite) - In this case, there exists

a limiting distribution π = [π0 , π1 , . . .] such that:

πj = lim P (Xn = j | X0 = i) > 0 for all i, j ∈ S


n→∞

The limiting distribution π is the unique solution to:


X
πj = πk Pkj , for j = 0, 1, 2, . . .
k=0


X
πj = 1
j=0

45
2.6. Continuous- Time Markov Chains Chapter 2. Random Processes

The mean return time to state j, rj , is given by:

1
rj = for all j = 0, 1, 2, . . .
πj

2.6 Continuous- Time Markov Chains

A continuous-time Markov chain X(t) is a process that moves between states over time.

Markov Property

The process has the Markov property, meaning that to predict the future state of the pro-

cess, we only need to know the current state, not the history of how we arrived there.

Mathematically, for times 0 ≤ t1 < t2 < · · · < tn < tn+1 :

P (X(tn+1 ) = j | X(tn ) = i, X(tn−1 ) = in−1 , . . . , X(t1 ) = i1 ) = P (X(tn+1 ) = j | X(tn ) = i).

It is defined by two main parts: a jump chain and holding times.

Jump Chain

The jump chain is a sequence of states S ⊂ {0, 1, 2, . . . } with transition probabilities pij . These

probabilities tell us the likelihood of jumping from one state i to another state j. If we are

in state i, we do not stay there indefinitely unless it is an absorbing state.

Holding Times

The amount of time spent in the state i before transitioning is memoryless because of Markov

Property. The exponential distribution is the only continuous distribution with this prop-

erty, so the holding time in the state i, which is the time the process remains in that state

before jumping to another state, follows an exponential distribution with a parameter λi .

For a continuous-time Markov chain, we define the transition matrix P (t). The (i,j)th entry

of the transition matrix is given by Pij (t) = P (X(t) = j | X(0) = i).

The transition matrix satisfies the following properties:

1. P (0) is equal to the identity matrix, P (0) = I.

46
Chapter 2. Random Processes 2.6. Continuous- Time Markov Chains

2. The rows of the transition matrix must sum to 1

X
Pij (t) = 1 for all t ≥ 0.
j∈S

3. For all s, t ≥ 0, we have P (s + t) = P (s)P (t).

Stationary Distribution

A probability distribution π on S, represented as π = [π0 , π1 , π2 , . . .], where πi ∈ [0, 1] and


P
i∈S πi = 1, is called a Stationary distribution of the continuous-time Markov chain X(t)

if it remains unchanged over time:

π = πP (t), for all t ≥ 0.

Limiting Distribution

The probability distribution π = [π0 , π1 , π2 , . . .] is called the Limiting distribution of the

continuous-time Markov chain X(t) if

πj = lim P (X(t) = j | X(0) = i)


t→∞

for all i, j ∈ S, and we have


X
πj = 1.
j∈S

In a continuous-time Markov chain with an irreducible positive recurrent jump chain, the

stationary distribution π̃ = [π̃0 , π̃1 , π̃2 , . . .] of the jump chain determines the long-term behavior

of the chain.

Let {X(t), t ≥ 0} be a continuous-time Markov chain with an irreducible positive re-

current jump chain.If then unique stationary distribution of the jump chain is given by

π̃ = [π̃0 , π̃1 , π̃2 , · · · ].

and
X π̃k
0< < ∞.
λk
k∈S

47
2.6. Continuous- Time Markov Chains Chapter 2. Random Processes

Then,
π̃j /λj
πj = lim P (X(t) = j | X(0) = i) = P
k∈S π̃k /λk
t→∞

for all i, j ∈ S.

π = [π0 , π1 , π2 , · · · ] is the limiting distribution of X(t).

2.6.1 The Generator Matrix

Transition Rate (λi )

Suppose we start in state i. The time T1 until we jump to the next state follows an exponential

distribution with rate λi . This means:

T1 ∼ Exponential(λi )

For a very small time interval δ > 0:

P (T1 < δ) = 1 − e−λi δ ≈ λi δ

So, the probability of leaving state i in a short time δ is approximately λi δ. Hence, λi is

called the transition rate out of state i.

 
P (X(δ) ̸= i | X(0) = i)
λi = lim
δ→0+ δ

Transition Rate from State i to State j (gij )

The probability of moving from state i to state j is pij . The transition rate from i to j is:

gij = λi pij

48
Chapter 2. Random Processes 2.6. Continuous- Time Markov Chains

Generator Matrix (G)

The element gij in G is the transition rate from i to j.

we define the diagonal elements gii as:

X X
gii = − gij = −λi pij = −λi
j̸=i j̸=i

P
This holds because if λi = 0, then λi j̸=i pij =0
P
and if λi ̸= 0, then pii = 0 and j̸=i pij = 1.

The sum of elements in any row would sum to 0.

X
gij = 0
j

The generator matrix G is very useful for analyzing the behavior of continuous-time Markov

chains, especially when calculating probabilities and expectations over time. It also helps

us find stationery distribution.

2.6.2 Uses of Generator Matrix

Consider a very small time interval δ. For such a small δ, the probability pjj (δ) that the system

stays in state j can be approximated as:

pjj (δ) ≈ 1 − λj δ

Since gjj = −λj , this can also be written as:

pjj (δ) ≈ 1 + gjj δ

Next, let’s consider the probability pkj (δ) that the system transitions from state k to state j

within the small time interval δ considering k ̸= j

pkj (δ) ≈ λk δpkj = δgkj

49
2.6. Continuous- Time Markov Chains Chapter 2. Random Processes

The probability of a transition from k to a state t(other than k) pkt (δ) = λk δ.If we even want

the transition to end up in state j, then we multiply pkj as transitioning out of state i and going

into state j are independent events.

By the Chapman-Kolmogorov equation, We can write:

X
Pij (t + δ) = Pik (t)pkj (δ)
k

Plugging in our approximations, we get:

X
Pij (t + δ) ≈ Pij (t)pjj (δ) + Pik (t)pkj (δ)
k̸=j

Using our previous expressions for pjj (δ) and pkj (δ):

X
Pij (t + δ) ≈ Pij (t)(1 + gjj δ) + Pik (t)δgkj
k̸=j

Simplifying, we get:

X
Pij (t + δ) ≈ Pij (t) + δPij (t)gjj + δ Pik (t)gkj
k̸=j

X
Pij (t + δ) ≈ Pij (t) + δ Pik (t)gkj
k

Finally, if we rearrange and take the limit as δ approaches zero, we get the differential equation:

P ′ (t) = P (t)G

similarly, we can also prove

P ′ (t) = GP (t)

2.6.3 How do we use the Generator Matrix to find Stationary Distribution?

Consider a continuous-time Markov chain X(t) with state space S and generator matrix G. A

probability distribution π on S is called a stationary distribution for X(t) if and only if it

satisfies πG = 0.

50
Chapter 2. Random Processes 2.6. Continuous- Time Markov Chains

Case-1

Let’s assume S is finite, so π = [π0 , π1 , . . . , πr ] for some r ∈ N. If π is a stationary distribution,

it means π = πP (t), where P (t) is the transition matrix. Differentiating both sides of π = πP (t)

with respect to t, we get:

d
0= [πP (t)] = πP ′ (t) = πGP (t)
dt

Since P (0) is the identity matrix I, we have:

0 = πGP (0) = πGI = πG

This shows that πG = 0.

Case-2

suppose π is a probability distribution on S that satisfies πG = 0. From the backward equation

P ′ (t) = GP (t), multiplying both sides by π gives:

πP ′ (t) = πGP (t) = 0

Since πP ′ (t) is the derivative of πP (t), πP (t) is constant over time. Thus, for any t ≥ 0:

πP (t) = πP (0) = π

This confirms that π is a stationary distribution.

51

You might also like