Da Session 2
Da Session 2
Note:
A data set is composed of information from a set of units.
An observation consists of one or more pieces of information about a unit; these are
called variables.
Population
Definition : Population
Note:
1. All people in the country/world is not a population.
3. For statistical learning, it is important to define the population that we intend to study
very carefully.
Sample
Definition : Sample
Note:
Normally a sample is obtained in such a way as to be representative of the population.
Statistic
Definition : Statistic
Is there a relationship between the age of a viewers and his/her general happiness?
Is there a relationship between the age of the viewer and the number of TV hours
watched?
Data Summarization
To identify the typical characteristics of data (i.e., to have an overall picture).
Measures of location
Measures of dispersion
Measurement of location
It is also alternatively called as measuring the central tendency.
A function of the sample values that summarizes the location information into a single
number is known as a measure of location.
Example
sum(), count()
Algebraic measure
It is a measure that can be computed by applying an algebraic function to one or
more distributive measures.
Example
sum( )
average = count( )
Holistic measure
It is a measure that must be computed on the entire data set as a whole.
Example
Calculating median
What about mode?
Mean of a sample
The mean of a sample data is denoted as 𝒙. Different mean measurements
known are:
Simple mean
Weighted mean
Trimmed mean
In the next few slides, we shall learn how to calculate the mean of a sample.
Note: When all weights are equal, the weighted mean reduces to simple mean.
Trimmed mean of a sample
Trimmed Mean
If there are extreme values (also called outlier) in a sample, then the mean is
influenced greatly by those values. To offset the effect caused by those
extreme values, we can use the concept of trimmed mean
𝒏 𝒙𝒏 ± 𝒎 𝒙𝒎
𝒙=
𝒏±𝒎
Properties of mean
Lemma 5:
If a constant c is subtracted (or added) from each sample value, then the mean
of the transformed variable is linearly displaced by c. That is,
′
𝒙 = 𝒙∓𝒄
Lemma 6:
If each observation is called by multiplying (dividing) by a non-zero constant,
then the altered mean is given by
′
𝒙 = 𝒙∗𝒄
where, * is x (multiplication) or ÷ (division) operator.
Mean with grouped data
Sometimes data is given in the form of classes and frequency for each class.
Class 𝑥1 - 𝑥2 𝑥2 - 𝑥3 ….. 𝑥𝑖 - 𝑥𝑖+1 ….. 𝑥𝑛−1 - 𝑥𝑛
Frequency 𝑓1 𝑓2 ….. 𝑓𝑖 ….. 𝑓𝑛
Direct Method
𝒏
𝒊=𝟏 fi xi
𝒙= 𝒏
𝒊=𝟏 fi
𝟏 xi+ xi+1
Where, xi = (lower limit + upper limit) of the ith class, i.e., xi =
𝟐 𝟐
(also called class size), and fi is the frequency of the ith class.
Note: fi (xi - 𝒙) = 0
Assumed mean method
Assumed Mean Method
𝒏
𝒊=𝟏 fi di
𝒙=𝑨+ 𝒏
𝒊=𝟏 fi
x+x
where, A is the assumed mean (it is usually a value xi = i i+1
𝟐
chosen in the middle of the groups di = (𝑨 - xi ) for each i )
Step deviation method
Groups are with inclusive classes, i.e., xi = 𝐱 𝐢−𝟏 (linear limit of a class
is same as the upper limit of the previous class)
10 - 19 20 - 29 30 - 39 40 − 49
Example:
Suppose, there is a data relating the marks obtained by 200 students in an
examination
444, 412, 478, 467, 432, 450, 410, 465, 435, 454, 479, …….
A % C frequency of .65 for the third class 439.5.....449.5 means that 65%
of all scores are found in this class or below.
Information from Ogive
Less-than and more-than Ogive approach
A cross point of two Ogive plots gives the mean of the sample.
Some other measures of mean
Arithmetic Mean (AM)
𝑆: 𝑥1 , 𝑥2
𝑥1 +𝑥2
𝑥=
2 Harmonic Mean (HM)
𝑥 − 𝑥1 = 𝑥2 − 𝑥 𝑆: 𝑥1 , 𝑥2
2
𝑥= 1 1
+
𝑥1 𝑥2
Geometric mean (GM) 2 1 1
= +
𝑥 𝑥1 𝑥2
𝑆: 𝑥1 , 𝑥2
𝑥 = 𝑥1 . 𝑥2
𝑥1 𝑥
=
𝑥 𝑥2
Geometric mean
Definition : Geometric mean
𝒙= xi
𝒊=𝟏
where, n ≠ 0
Note
GM is the arithmetic mean in “log space”. This is because, alternatively,
𝒏
𝟏
𝒍𝒐𝒈𝒙 = 𝒍𝒐𝒈 𝒙𝒊
𝒏
𝒊=𝟏
This summary of measurement is meaningful only when all observations are > 0.
If at least one observation is zero, the product will itself be zero! For a negative value, root is
not real
Harmonic mean
Definition : Harmonic mean
If all observations are non zero, the reciprocal of the arithmetic mean of the
reciprocals of observations is known as harmonic mean.
Rainfall 35 18 … 22
(in mm)
Days 7 7 … 7
(in number)
Significant of different mean calculations
Case 2: Ranges are different, but observation remains same
Rainfall 50 50 … 50
(in mm)
Days 1 2 … 7
(in number)
Significant of different mean calculations
Case 3: Ranges are different, as well as the observations
Rainfall 21 34 … 18
(in mm)
Days 5 3 … 7
(in number)
Rule of thumbs for means
AM: When the range remains same for each observation
Example: Case 1
Rainfall 35 18 … 22
(in mm)
Days 7 7 … 7
(in number)
𝑛
1
𝑟= 𝑟𝑖
𝑛
1
Rule of thumbs for means
HM: When the range is different but each observation is same
Example: Case 2
Rainfall 50 50 … 50
(in mm)
Days 1 2 … 7
(in number)
𝑛
𝑟=
𝑛1
1𝑟
𝑖
Rule of thumbs for means
GM: When the ranges are different as well as the observations
Example: Case 3
Rainfall 21 34 … 18
(in mm)
Days 5 3 … 7
(in number)
1
𝑛 𝑛
𝑟= 𝑟𝑖
1
Rule of thumbs for means
The important things to recognize is that all three means are simply the
arithmetic means in disguise!
Each of the three means can be obtained with the following steps
Median of a sample is the middle value when the data are arranged in
increasing (or decreasing) order. Symbolically,
𝒙(𝒏+𝟏)/𝟐 𝒊𝒇 𝒏 𝒊𝒔 𝒐𝒅𝒅
𝒙= 𝟏
𝒙𝒏/𝟐 + 𝒙(𝒏+𝟏) 𝒊𝒇 𝒏 𝒊𝒔 𝒆𝒗𝒆𝒏
𝟐 𝟐
Median of a sample
Definition : Median of a grouped data
Select the modal class (it is the class with the highest frequency). Then
the mode 𝒙 is given by:
∆𝟏
𝒙=l+ h
∆𝟏 +∆𝟐
where,
h is the class width
∆𝟏 is the difference between the frequency of the modal class and the
frequency of the class just after the modal class
∆𝟐 is the difference between the frequency of the modal class and the class
just before the modal class
l is the lower boundary of the modal class
Note
If each data value occurs only once, then there is no mode!
Relation between mean, median and mode
There is an empirical relation, valid for moderately skewed data
Steps
1. A percentage ‘p’ between 0 and 100 is specified.
2. The top and bottom of (p/2)% of the data is thrown out
3. The mean is then calculated in the normal way
Note
• Trimmed mean is a special case of Midrange.
Measures of dispersion
Location measure are far too insufficient to understand data.
Another set of commonly used summary statistics for continuous data are
those that measure the dispersion.
A dispersion measures the extent of spread of observations in a sample.
Both samples have same mean. However, the bottles from company A with more
uniform content than company B.
We say that the dispersion (or variability) of the observation from the average is
less for A than sample B.
The variability in a sample should display how the observation spread out from the average
In buying juice, customer should feel more confident to buy it from A than B
Range of a sample
Definition : Range of a sample
R = max(X) – min(X) = 𝐱 𝐧 - 𝐱 𝟏 z
Percentile
The percentile of a set of ordered data can be defined as follows:
o Example: The 50th percentile is that value 𝐱 𝟓𝟎% such that 50% of all values
of x are less than 𝐱 𝟓𝟎% .
The quartiles including median, give some indication of the center, spread
and shape of a distribution.
The distance between 𝐐𝟏 and 𝐐𝟑 is a simple measure of spread that gives the
range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = 𝐐𝟑 - 𝐐𝟏
Application of IQR
Outlier detection using five-number summary
Maximum
Q3
Median
Q1
Minimum
Box plot
Histogram
Probability and Statistics
Probability is the chance of an outcome in an experiment (also called event).
Probability deals with predicting the Statistics involves the analysis of the
likelihood of future events. frequency of past events
Example: Consider there is a drawer containing 100 socks: 30 red, 20 blue and
50 black socks.
We can use probability to answer questions about the selection of a
random sample of these socks.
PQ1. What is the probability that we draw two blue socks or two red socks from
the drawer?
PQ2. What is the probability that we pull out three socks or have matching pair?
PQ3. What is the probability that we draw five socks and they are all black?
Statistics
Instead, if we have no knowledge about the type of socks in the drawers, then we
enter into the realm of statistics. Statistics helps us to infer properties about the
population on the basis of the random sample.
Q1: A random sample of 10 socks from the drawer produced one blue, four red, five
black socks. What is the total population of black, blue or red socks in the drawer?
Q2: We randomly sample 10 socks, and write down the number of black socks and
then return the socks to the drawer. The process is done for five times. The mean
number of socks for each of these trial is 7. What is the true number of black socks in
the drawer?
etc.
Probability vs. Statistics
In other words:
In probability, we are given a model and asked what kind of data we are likely to
see.
In statistics, we are given data and asked what kind of model is likely to have
generated it.
Example : Given that 0.2 is the probability that a person (in the ages between 17
and 35) has had childhood measles. Then the probability distribution is given by
X Probability
?
0 0.64
1 0.32
2 0.04
Probability Distribution
In data analytics, the probability distribution is important with which many
statistics making inferences about population can be derived .
𝒙 𝒙𝟏 𝒙𝟐 … … … … . . 𝒙𝒏
𝑓 𝑥 = 𝑃(𝑋 = 𝑥) 𝑓 𝑥1 𝑓 𝑥2 … … . . 𝑓(𝑥𝑛 )
𝒙 0 1 2
𝑓 𝑥 0.64 0.32 0.04 0.32
f(x)
0.04
x
Usage of Probability Distribution
Distribution (discrete/continuous) function is widely used in simulation
studies.
A simulation study uses a computer to simulate a real phenomenon or process as
closely as possible.
The use of simulation studies can often eliminate the need of costly experiments
and is also often used to study problems where actual experimentation is
impossible.
Examples :
1) A study involving testing the effectiveness of a new drug, the number of cured
patients among all the patients who use such a drug approximately follows a
binomial distribution.
Thus,
2! 0 (0.8)2−0
𝑃 𝑥 = 0 = 0! (0.2) = 𝟎. 𝟔𝟒
2−0 !
2!
𝑃 𝑥=1 = (0.2)1 (0.8)2−1 = 𝟎. 𝟑𝟐
1! 2 − 1 !
2!
𝑃 𝑥=2 = (0.2)2 (0.8)2−2 = 𝟎. 𝟎𝟒
2! 2 − 2 !
Binomial Distribution
Example : Verify with real-life experiment
Suppose, 10 pairs of random numbers are generated by a computer (Monte-Carlo method)
15 38 68 39 49 54 19 79 38 14
If the value of the digit is 0 or 1, the outcome is “had childhood measles”, otherwise,
(digits 2 to 9), the outcome is “did not”.
For example, in the first pair (i.e., 15), representing a couple and for this couple, x = 1. The
frequency distribution, for this sample is
x 0 1 2
f(x)=P(X=x) 0.7 0.3 0.0
𝑛 𝑛!
where 𝑥1 ,𝑥2 ,……,𝑥𝑘
=𝑥
1 !𝑥2 !……𝑥𝑘 !
𝑘 𝑘
𝑖=1 𝑥𝑖 = 𝑛 and 𝑖=1 𝑝𝑖 =1
The Poisson Distribution
There are some experiments, which involve the occurring of the number of
outcomes during a given time interval (or in a region of space).
Such a process is called Poisson process.
Example :
Number of clients visiting a ticket selling counter in a metro station.
The Poisson Distribution
Properties of Poisson process
The number of outcomes in one time interval is independent of the number that occurs
in any other disjoint interval [Poisson process has no memory]
The probability that a single outcome will occur during a very short interval is
proportional to the length of the time interval and does not depend on the number of
outcomes occurring outside this time interval.
The probability that more than one outcome will occur in such a short time interval is
negligible.
𝑒 −𝜆𝑡 . (𝜆𝑡)𝑥
𝑓 𝑥, 𝜆𝑡 = 𝑃 𝑋 = 𝑥 = , 𝑥 = 0, 1, … …
𝑥!
where 𝜆 is the average number of outcomes per unit time and 𝑒 = 2.71828 …
Descriptive measures
Given a random variable X in an experiment, we have denoted 𝑓 𝑥 = 𝑃 𝑋 = 𝑥 , the
probability that 𝑋 = 𝑥. For discrete events 𝑓 𝑥 = 0 for all values of 𝑥 except 𝑥 =
0, 1, 2, … . .
𝜇 = 𝑛. 𝑝
𝜎 2 = 𝑛𝑝 1 − 𝑝
Poisson Distribution
The Poisson distribution is characterized with 𝜆 where 𝜆=
𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 and 𝑡 = 𝑡𝑖𝑚𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙.
𝜇 = 𝜆𝑡
𝜎 2 = 𝜆𝑡
Discrete Vs. Continuous Probability Distributions
f(x)
x1 x2 x3 x4
X=x
Discrete Probability distribution
f(x)
X=x
Continuous Probability Distribution
Continuous Probability Distributions
When the random variable of interest can take any value in an interval, it is
called continuous random variable.
Every continuous random variable has an infinite, uncountable number of possible
values (i.e., any value in an interval)
1. 𝑓 𝑥 ≥ 0, for all 𝑥 ∈ 𝑅
∝
2. −∝
𝑓 𝑥 𝑑𝑥 = 1
𝑏 f(x)
3. 𝑃 𝑎≤𝑋≤𝑏 = 𝑎
𝑓(𝑥) 𝑑𝑥
∝
4. 𝜇= −∝
𝑥𝑓(𝑥) 𝑑𝑥
a b
∝
5. 𝜎2= −∝
𝑥−𝜇 2𝑓 𝑥 𝑑𝑥 X=x
Continuous Uniform Distribution
One of the simplest continuous distribution in all of statistics is the continuous
uniform distribution.
f(x)
c
A B
Note: X=x
−∞ 1
a) ∞
𝑓 𝑥 𝑑𝑥 = 𝐵−𝐴 × (𝐵 − 𝐴) = 1
𝑑−𝑐
b) 𝑃(𝑐 < 𝑥 < 𝑑)= 𝐵−𝐴 where both 𝑐 and 𝑑 are in the interval (A,B)
𝐴+𝐵
c) 𝜇 = 2
2 (𝐵−𝐴)2
d) 𝜎 = 12
Normal Distribution
The most often used continuous probability distribution is the normal
distribution; it is also known as Gaussian distribution.
f(x)
𝜎
𝜇
x
σ2
µ1
µ1 µ
µ2 2
σ1
σ2
µ1 µ2
Normal curves with µ1<µ2 and σ1<σ2
Normal Curve (6-sigma)
Properties of Normal Distribution
The curve is symmetric about a vertical axis through the mean 𝜇.
The random variable 𝑥 can take any value from −∞ 𝑡𝑜 ∞.
The most frequently used descriptive parameter s define the curve itself.
The mode, which is the point on the horizontal axis where the curve is a
maximum occurs at 𝑥 = 𝜇.
The total area under the curve and above the horizontal axis is equal to 1.
∞ 1 ∞ − 1 2 (𝑥−𝜇)2
−∞
𝑓 𝑥 𝑑𝑥 = −∞
𝑒 2𝜎 𝑑𝑥 =1
𝜎 2𝜋
1
∞ 1 ∞ − 2 (𝑥−𝜇)2
𝜇= −∞
𝑥. 𝑓 𝑥 𝑑𝑥 = −∞
𝑥. 𝑒 2𝜎 𝑑𝑥
𝜎 2𝜋
1
1 ∞ −2[(𝑥−𝜇) 𝜎2]
𝜎2 = −∞
(𝑥 − 2
𝜇) . 𝑒 𝑑𝑥
𝜎 2𝜋
1 𝑥2 − 12 (𝑥−𝜇)2
𝑃 𝑥1 < 𝑥 < 𝑥2 = 𝑥1
𝑒 2𝜎 𝑑𝑥
𝜎 2𝜋
denotes the probability of x in the interval (𝑥1 , 𝑥2 ). 𝜇 x1 x2
Standard Normal Distribution
The normal distribution has computational complexity to calculate 𝑃 𝑥1 < 𝑥 < 𝑥2
for any two (𝑥1 , 𝑥2 ) and given 𝜇 and 𝜎
To avoid this difficulty, the concept of 𝑧-transformation is followed.
𝑥−𝜇
z= [Z-transformation]
𝜎
= 𝑓(𝑧: 0, 𝜎)
Standard Normal Distribution
Definition : Standard normal distribution
The distribution of a normal random variable with mean 0 and variance 1 is called
a standard normal distribution.
0.09
0.4
0.08 σ σ=1
0.07
0.3
0.06
0.05
0.2
0.04
0.03
0.02 0.1
0.01
0.00 0.0
-5 0 5 10 15 20 25 -3 -2 -1 0 1 2 3
x=µ µ=0
f(x: µ, σ) f(z: 0, 1)
Reference Book