0% found this document useful (0 votes)
13 views95 pages

Da Session 2

Uploaded by

Deemat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views95 pages

Da Session 2

Uploaded by

Deemat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Course Taught at IIFT

Day 2: Descriptive Statistics and Probability Distributions

Dr. Tanujit Chakraborty


Centre for Data Sciences
IIIT Bangalore
Today’s Topics…
 Data summarization
 Graphical summarization
 Probability vs. Statistics
 Concept of random variable
 Probability distribution concept
 Discrete probability distribution
• Continuous probability distribution
TRP: An example
 Television rating point (TRP) is a tool provided
to judge which programs are viewed the most.
 This gives us an index of the choice of the people
and also the popularity of a particular channel.

 For calculation purpose, a device is attached to the TV sets in few


thousand viewers’ houses in different geographic and demographic sectors.
 The device is called as People's Meter. It reads the time and the programme
that a viewer watches on a particular day for a certain period.

 An average is taken, for example, for a 30-days period.

 The above further can be augmented with a personal interview survey


(PIS), which becomes the basis for many studies/decision making.

 Essentially, we are to analyze data for TRP estimation.


Data
Definition : Data

A set of data is a collection of observed values representing one or more


characteristics of some objects or units.

Example: For TRP, data collection consist of the following attributes.


 Age: A viewer’s age in years
 Sex: A viewer’s gender coded 1 for male and 0 for female
 Happy: A viewer’s general happiness
 NH for not too happy
 PH for pretty happy
 VH for very happy
 TVHours: The average number of hours a respondent watched TV during a day
Data : Example
Viewer# Age Sex Happy TVHours
… … … … …
… … … … …
55 34 F VH 5
… … … … …

Note:
 A data set is composed of information from a set of units.

 Information from a unit is known as an observation.

 An observation consists of one or more pieces of information about a unit; these are
called variables.
Population
Definition : Population

A population is a data set representing the entire entities of interest.

Example: All TV Viewers in the country/world.

Note:
1. All people in the country/world is not a population.

2. For different survey, the population set may be completely different.

3. For statistical learning, it is important to define the population that we intend to study
very carefully.
Sample
Definition : Sample

A sample is a data set consisting of a population.

Example: All students studying in MBA (IB) 2020-2022 is a sample, whereas


those students belong to IIFT is population.

Note:
 Normally a sample is obtained in such a way as to be representative of the population.
Statistic
Definition : Statistic

A statistic is a quantity calculated from data that describes a particular


characteristics of a sample.

Example: The sample mean (denoted by 𝑦) is the arithmetic mean of a


variable of all the observations of a sample.
Statistical Inference
Definition : Statistical inference

Statistical inference is the process of using sample statistic to make decisions


about population.

Example: In the context of TRP


 Overall frequency of the various levels of happiness.

 Is there a relationship between the age of a viewers and his/her general happiness?

 Is there a relationship between the age of the viewer and the number of TV hours
watched?
Data Summarization
 To identify the typical characteristics of data (i.e., to have an overall picture).

 To identify which data should be treated as noise or outliers.

 The data summarization techniques can be classified into two broad


categories:

 Measures of location

 Measures of dispersion
Measurement of location
 It is also alternatively called as measuring the central tendency.
 A function of the sample values that summarizes the location information into a single
number is known as a measure of location.

 The most popular measures of location are


 Mean
 Median
 Mode
 Midrange

 These can be measured in three ways


 Distributive measure
 Algebraic measure
 Holistic measure
Distributive measure
 It is a measure (i.e. function) that can be computed for a given data set by
partitioning the data into smaller subsets, computing the measure for each
subset, and then merging the results in order to arrive at the measure’s value
for the original (i.e. entire) data set.

Example
 sum(), count()
Algebraic measure
 It is a measure that can be computed by applying an algebraic function to one or
more distributive measures.

 Example
sum( )
average = count( )
Holistic measure
 It is a measure that must be computed on the entire data set as a whole.

 Example
Calculating median
What about mode?
Mean of a sample
 The mean of a sample data is denoted as 𝒙. Different mean measurements
known are:
 Simple mean
 Weighted mean
 Trimmed mean

 In the next few slides, we shall learn how to calculate the mean of a sample.

 We assume that given 𝑥1 , 𝑥2 , 𝑥3 ,….., 𝑥𝑛 are the sample values.


Simple mean of a sample
 Simple mean
It is also called simply arithmetic mean or average and is abbreviated as
(AM).

Definition : Simple mean

If 𝑥1 , 𝑥2 , 𝑥3 ,….., 𝑥𝑛 are the sample values, the simple mean is defined


as
𝒏
𝟏
𝒙= xi
𝒏
𝒊=𝟏
Weighted mean of a sample
 Weighted mean
It is also called weighted arithmetic mean or weighted average.

Definition : Weighted mean

When each sample value 𝑥𝑖 is associated with a weight 𝑤𝑖 , for i =


1,2,…,n, then it is defined as
𝑛
𝑖=1 wixi
𝒙= 𝒏
𝒊=𝟏 wi

Note: When all weights are equal, the weighted mean reduces to simple mean.
Trimmed mean of a sample
 Trimmed Mean
If there are extreme values (also called outlier) in a sample, then the mean is
influenced greatly by those values. To offset the effect caused by those
extreme values, we can use the concept of trimmed mean

Definition : Trimmed mean

Trimmed mean is defined as the mean obtained after chopping off


values at the high and low extremes.
Properties of mean
 Lemma 1:
If 𝒙𝒊 , i = 1,2,…,m are the means of m samples of sizes 𝒏𝟏 , 𝒏𝟐 ,….., 𝒏𝒎
respectively, then the mean of the combined sample is given by:-
𝒎
𝒊=𝟏 𝒏𝒊 𝒙𝒊
𝒙= 𝒎
𝒊=𝟏 𝒏𝒊
(Distributive Measure)
 Lemma 2:
If a new observation 𝒙𝒌 is added to a sample of size n with mean 𝒙, the new
mean is given by
′ 𝒏 𝒙 + 𝒙𝒌
𝒙 =
𝒏+𝟏
Properties of mean
 Lemma 3:
If an existing observation 𝒙𝒌 is removed from a sample of size n with mean 𝒙,
the new mean is given by
′ 𝒏 𝒙 − 𝒙𝒌
𝒙 =
𝒏−𝟏
 Lemma 4:
If m observations with mean 𝒙𝒎 , are added (removed) from a sample of size n
with mean 𝒙𝒏 , then the new mean is given by

𝒏 𝒙𝒏 ± 𝒎 𝒙𝒎
𝒙=
𝒏±𝒎
Properties of mean
 Lemma 5:
If a constant c is subtracted (or added) from each sample value, then the mean
of the transformed variable is linearly displaced by c. That is,

𝒙 = 𝒙∓𝒄
 Lemma 6:
If each observation is called by multiplying (dividing) by a non-zero constant,
then the altered mean is given by

𝒙 = 𝒙∗𝒄
where, * is x (multiplication) or ÷ (division) operator.
Mean with grouped data
Sometimes data is given in the form of classes and frequency for each class.
Class  𝑥1 - 𝑥2 𝑥2 - 𝑥3 ….. 𝑥𝑖 - 𝑥𝑖+1 ….. 𝑥𝑛−1 - 𝑥𝑛
Frequency  𝑓1 𝑓2 ….. 𝑓𝑖 ….. 𝑓𝑛

There three methods to calculate the mean of such a grouped data.


• Direct method
• Assumed mean method
• Step deviation method
Direct method

 Direct Method
𝒏
𝒊=𝟏 fi xi
𝒙= 𝒏
𝒊=𝟏 fi

𝟏 xi+ xi+1
Where, xi = (lower limit + upper limit) of the ith class, i.e., xi =
𝟐 𝟐
(also called class size), and fi is the frequency of the ith class.

Note: fi (xi - 𝒙) = 0
Assumed mean method
 Assumed Mean Method

𝒏
𝒊=𝟏 fi di
𝒙=𝑨+ 𝒏
𝒊=𝟏 fi

x+x
where, A is the assumed mean (it is usually a value xi = i i+1
𝟐
chosen in the middle of the groups di = (𝑨 - xi ) for each i )
Step deviation method

 Step deviation method


𝒏
𝒊=𝟏 fi ui
𝒙=𝑨+ 𝒏 𝒉
𝒊=𝟏 fi
where,
A = assumed mean
h = class size (i.e., 𝐱 𝐢+𝟏 - 𝐱 𝐢 for the ith class)
xi − A
ui =
𝒉
Mean for a group of data
 For the above methods, we can assume that…
 All classes are equal sized

 Groups are with inclusive classes, i.e., xi = 𝐱 𝐢−𝟏 (linear limit of a class
is same as the upper limit of the previous class)

10 - 19 20 - 29 30 - 39 40 − 49

Data with exclusive classes

9.5 – 19.5 19.5 – 29.5 29.5 – 39.5 39.5 – 49.5

Data with inclusive classes


Ogive: Graphical method to find mean
 Ogive (pronounced as O-Jive) is a cumulative frequency polygon graph.
 When cumulative frequencies are plotted against the upper (lower) class
limit, the plot resembles one side of an Arabesque or ogival architecture,
hence the name.
 There are two types of Ogive plots
 Less-than (upper class vs. cumulative frequency)
 More than (lower class vs. cumulative frequency)

Example:
Suppose, there is a data relating the marks obtained by 200 students in an
examination

444, 412, 478, 467, 432, 450, 410, 465, 435, 454, 479, …….

(Further, suppose it is observed that the minimum and maximum marks


are 410, 479, respectively.)
Ogive: Cumulative frequency table
444, 412, 478, 467, 432, 450, 410, 465, 435, 454, 479, …….

Step 1: Draw a cumulative frequency table


Conversion
into No. of Cumulative
Marks
exclusive students Frequency
series
(x) (f) (C.M)
410-419 409.5-419.5 14 14
420-429 419.5-429.5 20 34
430-439 429.5-439.5 42 76
440-449 439.5-449.5 54 130
450-459 449.5-459.5 45 175
460-469 459.5-469.5 18 193
470-479 469.5-479.5 7 200
Ogive: Graphical method to find mean
Conversion
into No. of Cumulative
Marks
exclusive students Frequency
series
(x) (f) (C.M)
410-419 409.5-419.5 14 14
420-429 419.5-429.5 20 34
430-439 429.5-439.5 42 76
440-449 439.5-449.5 54 130
450-459 449.5-459.5 45 175
460-469 459.5-469.5 18 193
470-479 469.5-479.5 7 200

Step 2: Less-than Ogive graph


Cumulative
Upper class
Frequency
Less than 419.5 14
Less than 429.5 34
Less than 439.5 76
Less than 449.5 130
Less than 459.5 175
Less than 469.5 193
Less than 479.5 200
Ogive: Graphical method to find mean
Conversion
into No. of Cumulative
Marks
exclusive students Frequency
series
(x) (f) (C.M)
410-419 409.5-419.5 14 14
420-429 419.5-429.5 20 34
430-439 429.5-439.5 42 76
440-449 439.5-449.5 54 130
450-459 449.5-459.5 45 175
460-469 459.5-469.5 18 193
470-479 469.5-479.5 7 200

Step 3: More-than Ogive graph


Cumulative
Upper class
Frequency
More than 409.5 200
More than 419.5 186
More than 429.5 166
More than 439.5 124
More than 449.5 70
More than 459.5 25
More than 469.5 7
Information from Ogive
 Mean from Less-than Ogive  Mean from More-than Ogive

 A % C frequency of .65 for the third class 439.5.....449.5 means that 65%
of all scores are found in this class or below.
Information from Ogive
 Less-than and more-than Ogive approach

A cross point of two Ogive plots gives the mean of the sample.
Some other measures of mean
 Arithmetic Mean (AM)
 𝑆: 𝑥1 , 𝑥2
𝑥1 +𝑥2
 𝑥=
2  Harmonic Mean (HM)
 𝑥 − 𝑥1 = 𝑥2 − 𝑥  𝑆: 𝑥1 , 𝑥2
2
 𝑥= 1 1
+
𝑥1 𝑥2
 Geometric mean (GM) 2 1 1
 = +
𝑥 𝑥1 𝑥2
 𝑆: 𝑥1 , 𝑥2
 𝑥 = 𝑥1 . 𝑥2
𝑥1 𝑥
 =
𝑥 𝑥2
Geometric mean
Definition : Geometric mean

Geometric mean of n observations (none of which are zero) is defined as:


𝒏 𝟏/𝒏

𝒙= xi
𝒊=𝟏
where, n ≠ 0

Note
 GM is the arithmetic mean in “log space”. This is because, alternatively,
𝒏
𝟏
𝒍𝒐𝒈𝒙 = 𝒍𝒐𝒈 𝒙𝒊
𝒏
𝒊=𝟏

 This summary of measurement is meaningful only when all observations are > 0.
 If at least one observation is zero, the product will itself be zero! For a negative value, root is
not real
Harmonic mean
Definition : Harmonic mean

If all observations are non zero, the reciprocal of the arithmetic mean of the
reciprocals of observations is known as harmonic mean.

For ungrouped data


𝒏
𝒙=
𝒏 𝟏
𝒊=𝟏 xi
For grouped data
𝒏
𝒊=𝟏 fi
𝒙=
𝒏 fi
𝒊=𝟏 xi
where, fi is the frequency of the ith class with xi as the center value of the
ith class.
Significant of different mean calculations
 There are two things involved when we consider a sample
 Observation
 Range

Example: Rainfall data


Rainfall (in r1 r2 … rn
mm)
Days d1 d2 … dn
(in number)
 Here, rainfall is the observation and day is the range for each element in
the sample

 Here, we are to measure the mean “rate of rainfall” as the measure of


location
Significant of different mean calculations
 Case 1: Range remains same for each observation

Example: Having data about amount of rainfall per week, say.

Rainfall 35 18 … 22
(in mm)
Days 7 7 … 7
(in number)
Significant of different mean calculations
 Case 2: Ranges are different, but observation remains same

Example: Same amount of rainfall in different number of days, say.

Rainfall 50 50 … 50
(in mm)
Days 1 2 … 7
(in number)
Significant of different mean calculations
 Case 3: Ranges are different, as well as the observations

Example: Different amount of rainfall in different number of days, say.

Rainfall 21 34 … 18
(in mm)
Days 5 3 … 7
(in number)
Rule of thumbs for means
 AM: When the range remains same for each observation
Example: Case 1

Rainfall 35 18 … 22
(in mm)
Days 7 7 … 7
(in number)

𝑛
1
𝑟= 𝑟𝑖
𝑛
1
Rule of thumbs for means
 HM: When the range is different but each observation is same
 Example: Case 2

Rainfall 50 50 … 50
(in mm)
Days 1 2 … 7
(in number)

𝑛
𝑟=
𝑛1
1𝑟
𝑖
Rule of thumbs for means
 GM: When the ranges are different as well as the observations
 Example: Case 3

Rainfall 21 34 … 18
(in mm)
Days 5 3 … 7
(in number)

1
𝑛 𝑛
𝑟= 𝑟𝑖
1
Rule of thumbs for means
 The important things to recognize is that all three means are simply the
arithmetic means in disguise!

 Each mean follows the “additive structure”.

 Suppose, we are given some abstract quantities {x1, x2, …, xn}

 Each of the three means can be obtained with the following steps

1. Transform each xi into some yi

2. Taking the arithmetic mean of all yi’s

3. Transforming back the to the original scale of measurement


Rule of thumbs for means
 For arithmetic mean
 Use the transformation yi = xi
 Take the arithmetic mean of all yi s to get 𝑦
 Finally, 𝑥 = 𝑦

 For geometric mean


 Use the transformation 𝒚𝒊 = 𝐥𝐨𝐠 𝒙𝒊
 Take the arithmetic mean of all yi s to get 𝑦 AM ≥ GM ≥ HM
 Finally, 𝒙 = 𝒆𝒚

 For harmonic mean


𝟏
 Use the transformation 𝒚𝒊 =
𝒙𝒊
 Take the arithmetic mean of all yi s to get 𝑦
𝟏
 Finally, 𝒙 =
𝒚
Median of a sample
Definition : Median of a sample

Median of a sample is the middle value when the data are arranged in
increasing (or decreasing) order. Symbolically,

𝒙(𝒏+𝟏)/𝟐 𝒊𝒇 𝒏 𝒊𝒔 𝒐𝒅𝒅

𝒙= 𝟏
𝒙𝒏/𝟐 + 𝒙(𝒏+𝟏) 𝒊𝒇 𝒏 𝒊𝒔 𝒆𝒗𝒆𝒏
𝟐 𝟐
Median of a sample
Definition : Median of a grouped data

Median of a grouped data is given by


𝑵
− 𝒄𝒇
𝒙=𝒍+ 𝟐 𝒉
𝒇
where h = width of the median class
N = 𝒏𝒊=𝟏 𝒇𝒊
𝒇𝒊 is the frequency of the ith class, and n is the total number of groups
cf = the cumulative frequency
N = the total number of samples
l = lower limit of the median class
Note
A class is called median class if its cumulative frequency is just greater
than N/2
Mode of a sample
 Mode is defined as the observation which occurs most frequently.

 For example, number of wickets obtained by bowler in 10 test matches are as


follows.
1 2 0 3 2 4 1 1 2 2
 In other words, the above data can be represented as:-
0 1 2 3 4
# of matches 1 3 4 1 1

 Clearly, the mode here is “2”.


Mode of a grouped data
Definition : Mode of a grouped data

Select the modal class (it is the class with the highest frequency). Then
the mode 𝒙 is given by:
∆𝟏
𝒙=l+ h
∆𝟏 +∆𝟐
where,
h is the class width
∆𝟏 is the difference between the frequency of the modal class and the
frequency of the class just after the modal class
∆𝟐 is the difference between the frequency of the modal class and the class
just before the modal class
l is the lower boundary of the modal class

Note
If each data value occurs only once, then there is no mode!
Relation between mean, median and mode
 There is an empirical relation, valid for moderately skewed data

Mean – Mode = 3 * (Mean – Median)


 A given set of data can be categorized into three categories:-
 Symmetric data
 Positively skewed data
 Negatively skewed data

• To understand the above three categories, let us consider the following


• Given a set of m objects, where any object can take values 𝒗𝟏 , 𝒗𝟐 ,…..,𝒗𝒌 .
Then, the frequency of a value 𝒗𝒊 is defined as

𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒐𝒃𝒋𝒆𝒄𝒕𝒔 𝒘𝒊𝒕𝒉 𝒗𝒂𝒍𝒖𝒆 𝒗𝒊


Frequency(𝒗𝒊 ) =
𝒏
for i = 1,2,…..,k
Symmetric data
 For symmetric data, all mean, median and mode lie at the same point
Positively & Negatively Skewed data
 Positively Skewed Data: Mode occurs at a value smaller than the median.
 Negatively Skewed Data: Mode occurs at a value greater than the median.
Midrange
• It is the average of the largest and smallest values in the set.

 Steps
1. A percentage ‘p’ between 0 and 100 is specified.
2. The top and bottom of (p/2)% of the data is thrown out
3. The mean is then calculated in the normal way

• Thus, the median is trimmed mean with p = 100% while the


traditional mean corresponds to p = 0%

Note
• Trimmed mean is a special case of Midrange.
Measures of dispersion
 Location measure are far too insufficient to understand data.
 Another set of commonly used summary statistics for continuous data are
those that measure the dispersion.
 A dispersion measures the extent of spread of observations in a sample.

 Some important measure of dispersion are:


 Range
 Variance and Standard Deviation
 Mean Absolute Deviation (MAD)
 Absolute Average Deviation (AAD)
 Interquartile Range (IQR)
Measures of dispersion
Example
 Suppose, two samples of fruit juice bottles from two companies A and B. The
unit in each bottle is measured in litre.

Sample A 0.97 1.00 0.94 1.03 1.06

Sample B 1.06 1.01 0.88 0.91 1.14

 Both samples have same mean. However, the bottles from company A with more
uniform content than company B.
 We say that the dispersion (or variability) of the observation from the average is
less for A than sample B.
 The variability in a sample should display how the observation spread out from the average
 In buying juice, customer should feel more confident to buy it from A than B
Range of a sample
Definition : Range of a sample

Let X = 𝐱 𝟏 ,….., 𝐱 𝐧 be n sample values that are arranged in increasing


order.

The range R of these samples are then defined as:

R = max(X) – min(X) = 𝐱 𝐧 - 𝐱 𝟏 z

 Range identifies the maximum spread, it can be misleading if most of the


values are concentrated in a narrow band of values, but there are also a
relatively small number of more extreme values.

 The variance is another measure of dispersion to deal with such a situation.


Variance and Standard Deviation
Definition : Variance and Standard Deviation

Let X = {𝐱 𝟏 ,….., 𝐱 𝐧 } are sample values of n samples. Then, variance


denoted as σ² is defined as :-
𝐧
𝟏
𝛔𝟐 = 𝐱𝐢 − 𝐱 𝟐
𝐧−𝟏
𝐢=𝟏
where, x denotes the mean of the sample

The standard deviation, σ, of the samples is the square root of the


variance 𝛔𝟐
Coefficient of Variation
 Basic properties
 σ measures spread about mean and should be chosen only when the mean is
chosen as the measure of central tendency
 σ = 0 only when there is no spread, that is, when all observations have the
same value, otherwise σ > 0

Definition : Coefficient of variation

A related measure is the coefficient of variation CV, which is defined as


follows
σ
CV = × 100
𝐱

This gives a ratio measure to spread.


Mean Absolute Deviation (MAD)
 Since, the mean can be distorted by outlier, and as the variance is computed
using the mean, it is thus sensitive to outlier. To avoid the effect of outlier,
there are two more robust measures of dispersion known. These are:

 Mean Absolute Deviation (MAD)


MAD (X) = median 𝐱𝟏 − 𝐱 , … . . , 𝐱𝐧 − 𝐱

 Absolute Average Deviation (AAD)


𝟏 𝐧
AAD(X) = 𝐢=𝟏 𝐱𝐢 − 𝐱
𝐧

where, X = {𝐱 𝟏 , 𝐱 𝟐 ,…..,𝐱 𝐧 }is the sample values of n observations


Interquartile Range
 Like MAD and AAD, there is another robust measure of dispersion known, called
as Interquartile range, denoted as IQR
 To understand IQR, let us first define percentile and quartile

 Percentile
 The percentile of a set of ordered data can be defined as follows:

o Given an ordinal or continuous attribute x and a number p between 0 and


100, the pth percentile 𝐱 𝐩 is a value of x such that p% of the observed
values of x are less than 𝐱 𝐩

o Example: The 50th percentile is that value 𝐱 𝟓𝟎% such that 50% of all values
of x are less than 𝐱 𝟓𝟎% .

 Note: The median is the 50th percentile.


Interquartile Range
 Quartile
 The most commonly used percentiles are quartiles.
 The first quartile, denoted by 𝐐𝟏 is the 25th percentile.
 The third quartile, denoted by 𝐐𝟑 is the 75th percentile
 The median, 𝐐𝟐 is the 50th percentile.

 The quartiles including median, give some indication of the center, spread
and shape of a distribution.

 The distance between 𝐐𝟏 and 𝐐𝟑 is a simple measure of spread that gives the
range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = 𝐐𝟑 - 𝐐𝟏
Application of IQR
 Outlier detection using five-number summary

 A common rule of the thumb for identifying suspected outliers is to single


out values falling at least 1.5 × IQR above 𝐐𝟑 and below 𝐐𝟏 .

 In other words, extreme observations occurring within 1.5 × IQR of the


quartiles
Application of IQR
 Five Number Summary
 Since, 𝐐𝟏 , 𝐐𝟐 and 𝐐𝟑 together contain no information about the endpoints
of the data, a complete summary of the shape of a distribution can be
obtained by providing the lowest and highest data value as well. This is
known as the five-number summary
 The five-number summary of a distribution consists of :
 The Median 𝐐𝟐
 The first quartile 𝐐𝟏
 The third quartile 𝐐𝟑
 The smallest observation
 The largest observation
These are, when written in order gives the five-number summary:
Minimum, 𝐐𝟏 , Median (𝐐𝟐 ), 𝐐𝟑 , Maximum
Box plot
 Graphical view of Five number summary

Maximum

Q3

Median

Q1

Minimum
Box plot
Histogram
Probability and Statistics
Probability is the chance of an outcome in an experiment (also called event).

Event: Tossing a fair coin


Outcome: Head, Tail

Probability deals with predicting the Statistics involves the analysis of the
likelihood of future events. frequency of past events

Example: Consider there is a drawer containing 100 socks: 30 red, 20 blue and
50 black socks.
We can use probability to answer questions about the selection of a
random sample of these socks.
 PQ1. What is the probability that we draw two blue socks or two red socks from
the drawer?
 PQ2. What is the probability that we pull out three socks or have matching pair?
 PQ3. What is the probability that we draw five socks and they are all black?
Statistics
Instead, if we have no knowledge about the type of socks in the drawers, then we
enter into the realm of statistics. Statistics helps us to infer properties about the
population on the basis of the random sample.

Questions that would be statistical in nature are:

 Q1: A random sample of 10 socks from the drawer produced one blue, four red, five
black socks. What is the total population of black, blue or red socks in the drawer?

 Q2: We randomly sample 10 socks, and write down the number of black socks and
then return the socks to the drawer. The process is done for five times. The mean
number of socks for each of these trial is 7. What is the true number of black socks in
the drawer?
 etc.
Probability vs. Statistics
In other words:
 In probability, we are given a model and asked what kind of data we are likely to
see.
 In statistics, we are given data and asked what kind of model is likely to have
generated it.

Example: Measles Study


 A study on health is concerned with the incidence of childhood measles in parents of
childbearing age in a city. For each couple, we would like to know how likely, it is that
either the mother or father or both have had childhood measles.
 The current census data indicates that 20% adults between the ages 17 and 35
(regardless of sex) have had childhood measles.
 This give us the probability that an individual in the city has had childhood measles.
Defining Random Variable
Definition: Random Variable
A random variable is a rule that assigns a numerical value to an outcome of
interest.

Example : In “measles Study”, we define a random variable 𝑋 as the number of


parents in a married couple who have had childhood measles.
This random variable can take values of 0, 1 𝑎𝑛𝑑 2.
Note:
 Random variable is not exactly the same as the variable defining a data.
 The probability that the random variable takes a given value can be computed
using the rules governing probability.
 For example, the probability that 𝑋 = 1 means either mother or father but not both
has had measles is 0.32. Symbolically, it is denoted as P(X=1) = 0.32.
Probability Distribution
Definition : Probability distribution
A probability distribution is a definition of probabilities of the values of
random variable.

Example : Given that 0.2 is the probability that a person (in the ages between 17
and 35) has had childhood measles. Then the probability distribution is given by

X Probability

?
0 0.64
1 0.32
2 0.04
Probability Distribution
 In data analytics, the probability distribution is important with which many
statistics making inferences about population can be derived .

 In general, a probability distribution function takes the following form

𝒙 𝒙𝟏 𝒙𝟐 … … … … . . 𝒙𝒏
𝑓 𝑥 = 𝑃(𝑋 = 𝑥) 𝑓 𝑥1 𝑓 𝑥2 … … . . 𝑓(𝑥𝑛 )

Example: Measles Study


0.64

𝒙 0 1 2
𝑓 𝑥 0.64 0.32 0.04 0.32
f(x)
0.04

x
Usage of Probability Distribution
 Distribution (discrete/continuous) function is widely used in simulation
studies.
 A simulation study uses a computer to simulate a real phenomenon or process as
closely as possible.

 The use of simulation studies can often eliminate the need of costly experiments
and is also often used to study problems where actual experimentation is
impossible.

Examples :
1) A study involving testing the effectiveness of a new drug, the number of cured
patients among all the patients who use such a drug approximately follows a
binomial distribution.

2) Operation of ticketing system in a busy public establishment (e.g., airport), the


arrival of passengers can be simulated using Poisson distribution.
Binomial Distribution
 In many situations, an outcome has only two outcomes: success and failure.
 Such outcome is called dichotomous outcome.
 An experiment when consists of repeated trials, each with dichotomous outcome is called
Bernoulli process. Each trial in it is called a Bernoulli trial.

Example : Firing bullets to hit a target.


 Suppose, in a Bernoulli process, we define a random variable X ≡ the number of successes in
trials.
 Such a random variable obeys the binomial probability distribution, if the experiment satisfies
the following conditions:
1) The experiment consists of n trials.
2) Each trial results in one of two mutually exclusive outcomes, one labelled a “success” and
the other a “failure”.
3) The probability of a success on a single trial is equal to 𝒑. The value of 𝑝 remains constant
throughout the experiment.
4) The trials are independent.
Defining Binomial Distribution
Definition: Binomial distribution
The function for computing the probability for the binomial probability
distribution is given by
𝑛!
𝑓 𝑥 = 𝑝 𝑥 (1 − 𝑝)𝑛−𝑥
𝑥! 𝑛 − 𝑥 !
for x = 0, 1, 2, …., n
Here, 𝑓 𝑥 = 𝑃 𝑋 = 𝑥 , where 𝑋 denotes “the number of success” and 𝑋 = 𝑥
denotes the number of success in 𝑥 trials.
Binomial Distribution
Example : Measles study
X = having had childhood measles a success
p = 0.2, the probability that a parent had childhood measles
n = 2, here a couple is an experiment and an individual a trial, and the number
of trials is two.

Thus,
2! 0 (0.8)2−0
𝑃 𝑥 = 0 = 0! (0.2) = 𝟎. 𝟔𝟒
2−0 !

2!
𝑃 𝑥=1 = (0.2)1 (0.8)2−1 = 𝟎. 𝟑𝟐
1! 2 − 1 !

2!
𝑃 𝑥=2 = (0.2)2 (0.8)2−2 = 𝟎. 𝟎𝟒
2! 2 − 2 !
Binomial Distribution
Example : Verify with real-life experiment
Suppose, 10 pairs of random numbers are generated by a computer (Monte-Carlo method)

15 38 68 39 49 54 19 79 38 14

If the value of the digit is 0 or 1, the outcome is “had childhood measles”, otherwise,
(digits 2 to 9), the outcome is “did not”.
For example, in the first pair (i.e., 15), representing a couple and for this couple, x = 1. The
frequency distribution, for this sample is

x 0 1 2
f(x)=P(X=x) 0.7 0.3 0.0

Note: This has close similarity with binomial probability distribution!


The Multinomial Distribution
The binomial experiment becomes a multinomial experiment, if we let each trial has more
than two possible outcome.

Definition: Multinomial distribution


If a given trial can result in the k outcomes 𝐸1 , 𝐸2 , … … , 𝐸𝑘 with probabilities
𝑝1 , 𝑝2 , … … , 𝑝𝑘 , then the probability distribution of the random variables
𝑋1 , 𝑋2 , … … , 𝑋𝑘 representing the number of occurrences for 𝐸1 , 𝐸2 , … … , 𝐸𝑘 in
n independent trials is
𝑛
𝑓 𝑥1 , 𝑥2 , … … , 𝑥𝑘 = 𝑥1 ,𝑥2 ,……,𝑥𝑘
𝑝1 𝑥1 𝑝2 𝑥2 … … 𝑝𝑘 𝑥𝑘

𝑛 𝑛!
where 𝑥1 ,𝑥2 ,……,𝑥𝑘
=𝑥
1 !𝑥2 !……𝑥𝑘 !

𝑘 𝑘
𝑖=1 𝑥𝑖 = 𝑛 and 𝑖=1 𝑝𝑖 =1
The Poisson Distribution
There are some experiments, which involve the occurring of the number of
outcomes during a given time interval (or in a region of space).
Such a process is called Poisson process.

Example :
Number of clients visiting a ticket selling counter in a metro station.
The Poisson Distribution
Properties of Poisson process
 The number of outcomes in one time interval is independent of the number that occurs
in any other disjoint interval [Poisson process has no memory]
 The probability that a single outcome will occur during a very short interval is
proportional to the length of the time interval and does not depend on the number of
outcomes occurring outside this time interval.
 The probability that more than one outcome will occur in such a short time interval is
negligible.

Definition : Poisson distribution


The probability distribution of the Poisson random variable 𝑋, representing the
number of outcomes occurring in a given time interval 𝑡, is

𝑒 −𝜆𝑡 . (𝜆𝑡)𝑥
𝑓 𝑥, 𝜆𝑡 = 𝑃 𝑋 = 𝑥 = , 𝑥 = 0, 1, … …
𝑥!

where 𝜆 is the average number of outcomes per unit time and 𝑒 = 2.71828 …
Descriptive measures
Given a random variable X in an experiment, we have denoted 𝑓 𝑥 = 𝑃 𝑋 = 𝑥 , the
probability that 𝑋 = 𝑥. For discrete events 𝑓 𝑥 = 0 for all values of 𝑥 except 𝑥 =
0, 1, 2, … . .

Properties of discrete probability distribution


1. 0 ≤ 𝑓(𝑥) ≤ 1
2. 𝑓 𝑥 =1
3. 𝜇 = 𝑥. 𝑓(𝑥) [ is the mean ]
4. 𝜎 2 = 𝑥 − 𝜇 2 . 𝑓(𝑥) [ is the variance ]

In 2, 3 𝑎𝑛𝑑 4, summation is extended for all possible discrete values of 𝑥.


1
Note: For discrete uniform distribution, 𝑓 𝑥 = 𝑛 with 𝑥 = 1, 2, … … , 𝑛
𝑛
1
𝜇= 𝑥𝑖
𝑛
𝑖=1
1 𝑛
and 𝜎 2 = 𝑖=1 (𝑥𝑖 −𝜇) 2
𝑛
Descriptive measures
Binomial distribution
The binomial probability distribution is characterized with 𝑝 (the probability of
success) and 𝑛 (is the number of trials). Then

𝜇 = 𝑛. 𝑝

𝜎 2 = 𝑛𝑝 1 − 𝑝

Poisson Distribution
The Poisson distribution is characterized with 𝜆 where 𝜆=
𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 and 𝑡 = 𝑡𝑖𝑚𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙.

𝜇 = 𝜆𝑡
𝜎 2 = 𝜆𝑡
Discrete Vs. Continuous Probability Distributions

f(x)

x1 x2 x3 x4
X=x
Discrete Probability distribution

f(x)

X=x
Continuous Probability Distribution
Continuous Probability Distributions

 When the random variable of interest can take any value in an interval, it is
called continuous random variable.
 Every continuous random variable has an infinite, uncountable number of possible
values (i.e., any value in an interval)

 Consequently, continuous random variable differs from discrete random


variable.
Properties of Probability Density Function
The function 𝑓(𝑥) is a probability density function for the continuous random
variable 𝑋, defined over the set of real numbers 𝑅, if

1. 𝑓 𝑥 ≥ 0, for all 𝑥 ∈ 𝑅

2. −∝
𝑓 𝑥 𝑑𝑥 = 1

𝑏 f(x)
3. 𝑃 𝑎≤𝑋≤𝑏 = 𝑎
𝑓(𝑥) 𝑑𝑥

4. 𝜇= −∝
𝑥𝑓(𝑥) 𝑑𝑥
a b

5. 𝜎2= −∝
𝑥−𝜇 2𝑓 𝑥 𝑑𝑥 X=x
Continuous Uniform Distribution
 One of the simplest continuous distribution in all of statistics is the continuous
uniform distribution.

Definition : Continuous Uniform Distribution

The density function of the continuous uniform random variable 𝑋 on the


interval [𝐴, 𝐵] is:
1
𝐴≤𝑥≤𝐵
𝑓 𝑥: 𝐴, 𝐵 = 𝐵 − 𝐴
0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Continuous Uniform Distribution

f(x)
c

A B
Note: X=x
−∞ 1
a) ∞
𝑓 𝑥 𝑑𝑥 = 𝐵−𝐴 × (𝐵 − 𝐴) = 1
𝑑−𝑐
b) 𝑃(𝑐 < 𝑥 < 𝑑)= 𝐵−𝐴 where both 𝑐 and 𝑑 are in the interval (A,B)
𝐴+𝐵
c) 𝜇 = 2
2 (𝐵−𝐴)2
d) 𝜎 = 12
Normal Distribution
 The most often used continuous probability distribution is the normal
distribution; it is also known as Gaussian distribution.

 Its graph called the normal curve is the bell-shaped curve.

 Such a curve approximately describes many phenomenon occur in nature,


industry and research.
 Physical measurement in areas such as meteorological experiments, rainfall
studies and measurement of manufacturing parts are often more than adequately
explained with normal distribution.

 A continuous random variable X having the bell-shaped distribution is called a


normal random variable.
Normal Distribution
• The mathematical equation for the probability distribution of the normal variable
depends upon the two parameters 𝜇 and 𝜎, its mean and standard deviation.

f(x)
𝜎

𝜇
x

Definition : Normal distribution


The density of the normal variable 𝑥 with mean 𝜇 and variance 𝜎 2 is
2
(𝑥−𝜇)
1 −
𝑓 𝑥 = 𝑒 2𝜎2 −∞ < 𝑥 < ∞
𝜎 2𝜋

where 𝜋 = 3.14159 … and 𝑒 = 2.71828 … . ., the Naperian constant


Normal Distribution
σ1 = σ2
σ1

σ2

µ1
µ1 µ
µ2 2

Normal curves with µ1< µ2 and σ1 = σ2


µ1 = µ2
Normal curves with µ1 = µ2 and σ1< σ2

σ1

σ2

µ1 µ2
Normal curves with µ1<µ2 and σ1<σ2
Normal Curve (6-sigma)
Properties of Normal Distribution
 The curve is symmetric about a vertical axis through the mean 𝜇.
 The random variable 𝑥 can take any value from −∞ 𝑡𝑜 ∞.
 The most frequently used descriptive parameter s define the curve itself.
 The mode, which is the point on the horizontal axis where the curve is a
maximum occurs at 𝑥 = 𝜇.
 The total area under the curve and above the horizontal axis is equal to 1.
∞ 1 ∞ − 1 2 (𝑥−𝜇)2
−∞
𝑓 𝑥 𝑑𝑥 = −∞
𝑒 2𝜎 𝑑𝑥 =1
𝜎 2𝜋
1
∞ 1 ∞ − 2 (𝑥−𝜇)2
 𝜇= −∞
𝑥. 𝑓 𝑥 𝑑𝑥 = −∞
𝑥. 𝑒 2𝜎 𝑑𝑥
𝜎 2𝜋
1
1 ∞ −2[(𝑥−𝜇) 𝜎2]
 𝜎2 = −∞
(𝑥 − 2
𝜇) . 𝑒 𝑑𝑥
𝜎 2𝜋
1 𝑥2 − 12 (𝑥−𝜇)2
 𝑃 𝑥1 < 𝑥 < 𝑥2 = 𝑥1
𝑒 2𝜎 𝑑𝑥
𝜎 2𝜋
denotes the probability of x in the interval (𝑥1 , 𝑥2 ). 𝜇 x1 x2
Standard Normal Distribution
 The normal distribution has computational complexity to calculate 𝑃 𝑥1 < 𝑥 < 𝑥2
for any two (𝑥1 , 𝑥2 ) and given 𝜇 and 𝜎
 To avoid this difficulty, the concept of 𝑧-transformation is followed.

𝑥−𝜇
z= [Z-transformation]
𝜎

 X: Normal distribution with mean 𝜇 and variance 𝜎 2 .


 Z: Standard normal distribution with mean 𝜇 = 0 and variance 𝜎 2 = 1.
 Therefore, if f(x) assumes a value, then the corresponding value of 𝑓(𝑧) is given by
1 𝑥2 − 1 2 (𝑥−𝜇)2
𝑓(𝑥: 𝜇, 𝜎) : 𝑃 𝑥1 < 𝑥 < 𝑥2 = 𝜎 2𝜋 𝑥 𝑒 2𝜎 𝑑𝑥
1
1 𝑧2 −1𝑧 2
= 𝜎 2𝜋 𝑧 𝑒 2 𝑑𝑧
1

= 𝑓(𝑧: 0, 𝜎)
Standard Normal Distribution
Definition : Standard normal distribution
The distribution of a normal random variable with mean 0 and variance 1 is called
a standard normal distribution.

0.09
0.4
0.08 σ σ=1
0.07
0.3
0.06

0.05
0.2
0.04

0.03

0.02 0.1

0.01

0.00 0.0
-5 0 5 10 15 20 25 -3 -2 -1 0 1 2 3

x=µ µ=0
f(x: µ, σ) f(z: 0, 1)
Reference Book

 Probability and Statistics for Engineers and Scientists (8th


Ed.) by Ronald E. Walpole, Sharon L. Myers, Keying Ye
(Pearson), 2013.
Any question?
You may also send your question(s) at [email protected]

You might also like