0% found this document useful (0 votes)
12 views13 pages

Information & Communication

The document provides an overview of probability theory, including concepts such as sample space, events, probability laws, and axioms. It discusses discrete and continuous random variables, their probability mass and density functions, and key properties like expectation, variance, and entropy. Additionally, it covers important rules such as Bayes' Rule and the Total Probability Theorem, along with the definitions of independent events and the characteristics of cumulative distribution functions.

Uploaded by

achutunisr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Information & Communication

The document provides an overview of probability theory, including concepts such as sample space, events, probability laws, and axioms. It discusses discrete and continuous random variables, their probability mass and density functions, and key properties like expectation, variance, and entropy. Additionally, it covers important rules such as Bayes' Rule and the Total Probability Theorem, along with the definitions of independent events and the characteristics of cumulative distribution functions.

Uploaded by

achutunisr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ℹ️

Information & Communication


Probability Theory

Universal set, denoted by Ω: contains all objects that


could conceivably be of interest in a particular context.
Having specified the context in terms of a universal set Ω,
we only consider sets S that are subsets of Ω.

Two sets are said to be disjoint if their intersection is empty.

Elements of a Probabilistic Model →

The sample space Ω, which is the set of all possible outcomes of an


experiment.

The probability law , which assigns to a set A of possible outcomes


(also called an event) a non-negative number P (A)(called the
probability of A).

Sample Space: Every probabilistic model involves an underlying


process, called the experiment, that will produce exactly one out of
several possible outcomes. The set of all possible outcomes is called
the sample space of the experiment, and is denoted by Ω.

Information & Communication 1


Event: A subset of the sample space, that is, a collection of possible
outcomes, is called an event.

Note:for example: three tosses of a coin constitute a single experiment,


rather than three experiments.

Note: The sample space of an experiment may consist of a finite or an


infinite number of possible outcomes.

Probability Axioms →

Non-negativity: P (A) ≥ 0, for every event A.


Additivity: If A1 , ​ A2 , ..., An is a sequence of disjoint events, then
​ ​

the probability of their union satisfies:


P (A1 ∪ A2 ∪ .... ∪ An ) = P (A1 ) + P (A2 ) + ... + P (An )
​ ​ ​ ​ ​

Normalization: The probability of the entire sample space Ωis equal


to 1, that is, P (Ω) = 1.

Example:1 = P (Ω) = P (Ω ∪ Ø) = P (Ω) + P (Ø) = 1 + P (Ø), and


this shows that the probability of the empty event is: P (Ø) = 0.

Some Properties of Probability Laws →

If A ⊂ B,then P (A) ≤ P (B)


P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
P (A ∪ B) ≤ P (A) + P (B)(can be further generalized)
Conditional Probability:

denoted by P (A∣B)
P (A∩B )
P (A∣B) = P (B ) 

We assume that P (B) > 0; the conditional probability is undefined


if the conditioning event has zero probability.

In words, out of the total probability of the elements of B, P (A∣B)


is the fraction that is assigned to possible outcomes that also belong
to A

For a fixed event B, it can be verified that the conditional probabilities
P (A∣B)form a legitimate probability law that satisfies the three axioms

Information & Communication 2


↗️ Multiplication Rule

Total Probability Theorem: Let A1 , ​ A2 , ..., An be disjoint


​ ​

events that form a partition of the sample space (each


possible outcome is included in one and only one of the
events A1 , A2 , ..., An ) and assume that P (Ai ) > 0, for
​ ​ ​ ​

all i = 1, 2, ..., n.Then, for any event B , we have:


P (B) = P (A1 ∩ B)+⋅⋅⋅ +P (An ∩ B)
​ ​

= P (A1 )P (B∣A1 )+⋅⋅⋅ +P (An )P (B∣An )


​ ​ ​ ​

Bayes’ Rule: Let A1 , ​ A2 , ..., An be disjoint events that


​ ​

form a partition of the sample space and assume that


P (Ai ) > 0, for all i = 1, 2, ..., n.Then, for any event B ,

we have:
P (Ai )⋅P (B∣Ai )
P (Ai ∣B) =

P (B )


Bayes’ rule is used for inference. There are a number of “causes” that
may result in a certain “effect.” We observe the effect, and we wish to
infer the cause.

Independent Events:
When the occurrence of B provides no information and
does not alter the probability that A has occurred, i.e.,
P (A∣B) = P (A). In this case, we say that Ais
independent of B .

Equivalently, P (A ∩ B) = P (A)P (B)


The definition of independence can be extended to
multiple events (more than two).

Independence is a symmetric property; that is, if Ais independent of B,


then Bis independent of A, and we can unambiguously say that Aand

Information & Communication 3


Bare independent events.
Misconception: A
common first thought is that two events are
independent if they are disjoint, but in fact the opposite is true: two
disjoint events Aand Bwith P (A) > 0and P (B) > 0are never
independent, since their intersection A ∩ Bis empty and has
probability 0.

If Aand Bare independent, so are Aand Bc .

Pairwise independence does not imply mutual independence.


Conversely, mutual independence does not imply pairwise
independence.

Discrete Random Variables

Given an experiment and the corresponding set of possible


Definition:

outcomes (the sample space), a random variable associates a particular


number with each outcome. We refer to this number as the numerical
value or the experimental value of the random variable. Mathematically,
a random variable is a real-valued function of the experimental
outcome.

We can associate with each random variable certain “averages” of


interest, such the mean and the variance.

A random variable is called discrete if its range (the set of values that
it can take) is
finite or at most countably infinite.

A random variable that can take an uncountably infinite number of


values is not discrete.

A discrete random variable has an associated probability mass function


(PMF), which gives the probability of each numerical value that the
random variable can take.

→If xis any possible value of X , the


Probability Mass Function

probability mass of x, denoted pX (x) ,is the probability of the event {

X = x} consisting of all outcomes that give rise to a value of X equal


to x:

pX (x) = P ({X = x})


Information & Communication 4


X → denotes the random variable; x→ denotes the real number such
as the numerical value of a random variable.

Note: ∑x pX (x) = 1
​ ​

Calculation of the PMF of a Random Variable X →


For each possible value
xof X :
1. Collect all the possible outcomes that give rise to the
event {X = x}.

2. Add their probabilities to obtain pX ​ (x).


Different DRVs →

The Bernoulli Random Variable:

It is used to model generic probabilistic situations with just two


outcomes. The Bernoulli random variable takes the two values 1
and 0, depending on the outcome.

It’s PMF is:

pX (x) = {
p, if x = 1

1 − p, if x = 0
​ ​ ​

PMF of a Bernoulli(p) random variable.

Mean = p; Variance = p(1 − p)

Information & Communication 5


The Binomial Random Variable:

We refer to X as a binomial random variable with parameters n


and p. The PMF of X consists of the binomial probabilities.
n−k
pX (k) = P (X = k) = (nk ) ⋅ pk ⋅ (1 − p)
​ ​ 

A Binomial PMF

The normalization property ∑x pX ​ ​ (x) = 1, specialized to the


binomial random variable, is written as:
∑nk=0 (nk )⋅k ⋅ (1 − p)n−k = 1
​ ​

Mean = n ⋅ p; Variance = n ⋅ p ⋅ (1 − p)


The Poisson Random Variable:

A Poisson random variable takes non-negative integer values.


Its PMF is given by:
λk
pX (k) = e−λ ⋅

k!
​

Mean = λ; Variance = λ


Expectation/Mean and Variance

Expectation of X , which is a weighted (in proportion to probabilities)


average of the possible values of X .

We define the expected value (also called the expectation or the mean)
of a random variable

Information & Communication 6


X , with PMF pX (x), by E[X] = ∑x x ⋅ pX (x)
​ ​ ​

It is useful to view the mean of X as a “representative” value of X ,


which
lies somewhere in the middle of its range.

Moment: We define the nth moment as E[X n ], the expected value of the
random variable X n . With this terminology, the 1st moment of X is just
the mean.

Variance: Denoted by Var(X)and is defined as the expected value of


the random variable (X − E[X])2 , i.e., var(X) = E[(X − E[X])2 ]
The variance is always non-negative. The variance provides a measure
of dispersion of X around its mean. Another measure of dispersion is
the standard deviation of X, which is defined as the square root of the
variance and is denoted by σX :σX ​ ​ = var (X ) ​

Variance in Terms of Moments Expression: var(X) = E[X 2 ] − (E[X])2 

Continuous Random Variable

A random variable X is called continuous if its probability law can be


described in terms of a nonnegative function fX , called the probability ​

density function of X , or PDF for short, which satisfies:


f (X ∈ B) = ∫B fX (x) dx ​ ​ for every subset B of the real line.

b
More generally, P (a ≤ X ≤ b) = ∫a fX (x) dx ​ ​

a
For any single value a, we have P (X = a) = ∫a fX (x) dx = 0. For
​ ​

this reason, including or excluding the endpoints of an interval has no


effect on its probability:

P (a ≤ X ≤ b) = P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b)


Note that to qualify as a PDF, a function fX must be non-negative, i.e.,

fX (x) > 0for every x, and must also satisfy the normalization


equation: ∫−∞ fX (x) dx = P (−∞ < X < ∞) = 1
​ ​

Graphically, this means that the entire area under the graph of the PDF
must be equal to 1.

Information & Communication 7


IMPORTANT: Even though a PDF is used to calculate event probabilities,
fX (x)is not the probability of any particular event. In particular, it is not

restricted to be ≤ 1. fX (x) ≥ 0for all x.


Uniform or uniformly distributed Random Variable. Its PDF

(x) = { 2
c, if a ≤ x ≤ b
has the form: fX , where cis a
x , if otherwise
​ ​ ​

constant.

The PDF of a uniform Random Variable.

Expectation/Mean: The expected value or mean of a


continuous random variable X is defined by: E[X] =

∫−∞ x ⋅ fX (x) dx
​ ​

This is similar to the discrete case except that the PMF is replaced by
the PDF, and summation is replaced by integration.

Cumulative Distribution Functions

The CDF of a random variable X is denoted by FX and provides the ​

probability P(X ≤ x). In particular, for every xwe


have:

FX (x) = P (X ≤ x) = { xk≤x
∑ pX (k) , X: Discrete

​ ​

∫−∞ fX (t) dt, X: Continuous


​ ​ ​

​ ​

Loosely speaking, the CDF “accumulates” probability “up to” the value
x.
Any random variable associated with a given probability model has a

Information & Communication 8


CDF, regardless of whether it is discrete, continuous, or other. This is
because {
X ≤ x} is always an event and therefore has a well-defined probability.

Properties of CDF →
1. FX is monotonically nondecreasing, if x
​ ≤ y , then
FX (x) ≤ FX (y).​ ​

2.
FX (x)tends to 0as x → −∞, and to 1as x → ∞.

3. If
X is discrete, then FX has a piecewise constant and

staircase-like form.
4. If
X is continuous, then FX has a continuously varying ​

form.

pX (k) = P (X ≤ k) − P (X ≤ k − 1) = FX (k) − FX (k − 1)


​ ​ ​

dFX
fX (x) =​

dx

​ (x)
Entropy

Entropy is a measure of the uncertainty of a random variable.

The entropy H(X)of a discrete random variable X is defined by →

H (X ) = − ∑x∈X p (x) ⋅ log (p (x)). The base of the logarithm is 2.


Adding terms of zero probability does not change the entropy.

Lemma: H(X) ≥ 0

Information & Communication 9


The above mentioned definition is for a single random variable. We will
now extend the definition to a pair of random variables.

The joint entropy: The joint entropy H(X, Y )of a pair of discrete
random variables (X, Y )with a joint distribution p(x, y)is defined as:

H(X, Y ) = − ∑x∈X ∑y∈Y p (x, y) log (p (x, y))


​ ​

Conditional Entropy: The expected value of the entropies of the


conditional distributions, averaged over the conditioning random
variable.

H(Y ∣X) = ∑x∈X p (x) H(Y ∣X = x)


Intuitively and can be proved otherwise, the entropy of a pair of random


variables is the entropy of one plus the conditional entropy of the other.

Chain Rule: H(X, Y ) = H(X) + H(Y ∣X)


Corollary: H(X, Y ∣Z) = H(X∣Z) + H(Y ∣X, Z)

H(X, Y ∣Z)refers to the conditional entropy of random variables X 


and Y given random variable Z .

Note: Generally, H(Y ∣X) =


 H(X∣Y ).

However, H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X).

Relative Entropy: The relative entropy D(p∣∣q)is a


measure of the inefficiency of assuming that the
distribution is q when the true distribution is p.
Relative entropy is always nonnegative and is zero if and
only if
p = qD(p∣∣q)
 = ∑x∈X p (x) ⋅ log ( pq ((xx)) )
​ ​

Mutual information is a measure of the


Mutual Information:

amount of information that one random variable contains


about another random variable. It is the reduction in the

Information & Communication 10


uncertainty of one random variable due to the knowledge
of the other.

Consider two random variables X and Y with a joint


probability mass function p(x, y)and marginal probability
mass functions p(x)and p(y). The mutual information
I(X; Y )is the relative entropy between the joint
distribution and the product distribution p(x)⋅ (y):

I (X; Y ) = − ∑x∈X ∑y∈Y p (x, y ) log ( p(px()x,y


​ ​

⋅p(y) )
)

Note: Note that D(p∣∣q) =


 D(q∣∣p)in general.

Relation between entropy and mutual information →


I(X; Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X) =
H(X) + H(Y ) − H(X, Y )

The mutual information I(X; Y )is the reduction in the


uncertainty of X due to the knowledge of Y , or the
reduction in the uncertainty of Y due to the knowledge of
X .

X says as much about Y as Y says about X , thus


I(X; Y ) = I(Y ; X)

Information & Communication 11


Relationship between entropy and mutual information.

Chain rule for entropy Let X1 , X2 , ..., Xn be drawn according to


​ ​

p(x1 , x2 , ..., xn ). Then


​ ​ ​

n
H(X1 , X2 , … , Xn ) = ∑i=1 H (Xi ∣Xi−1 , ..., X1 )
​ ​ ​ ​ ​ ​ ​

H(X1 , X2 , ..., Xn ) = H(X1 ) + H(X2 ∣X1 )+⋅⋅⋅ +H(Xn ∣Xn−1 , ..., X1 )


​ ​ ​ ​ ​ ​ ​ ​ ​

It is defined as the reduction in the uncertainty


Conditional Mutual Information:

of X due to knowledge of Y when Z is given.

I(X; Y ∣Z) = H(X∣Z) − H(X∣Y , Z)


Chain rule for information: I (X1 , X2 , ..., Xn ∣ Y ) = ​ ​ ​

∑ni=1 ​ I (Xi ; Y ∣ Xi−1 , Xi−2 , ..., X1 )


​ ​ ​ ​

The conditional Relative Entropy:

D(p(y∣x)∣∣q(y∣x)) = ∑x p (x) ∑y p (y∣x) ⋅ log ( pq((y∣x



y∣x)
)
) 
​ ​

Chain rule for relative entropy:

D(p(x, y)∣∣q(x, y)) = D(p(x)∣∣q(x)) + D(p(y∣x)∣∣q(y∣x))


Jensen’s Inequality:
If
f is a convex function and X is a random variable, E(f(X)) ≥ f(E(X))

Information & Communication 12


Information inequality:
Let
p(x), q(x), x ∈ X , be two probability mass functions. Then

D(p∣∣q) ≥ 0with equality if and only if p(x) = q(x)for allx.


Non-negativity of mutual information:
For any two random variables,
X, Y , I(X; Y ) ≥ 0with equality if and only if X and Y are independent.

Information & Communication 13

You might also like