Introduction to Statistical
Methods(ISM)
Slides Modified by
Dr YVK Ravi Kumar
[email protected]
Section:
Faculty Name & Mail ID:
LF details:
Session 1
(3 rd
/ 4th May 2025)
Overview of the course & Basic of Statistics
Overview of the course
Module 1 : Basic Probability & Statistics
Module 2 : Conditional Probability & Bayes’ theorem
Module 3 : Probability Distributions
Module 4 : Hypothesis Testing
Module 5 : Prediction & Forecasting
Module 6 : Prediction & Forecasting; Gaussian Mixture model &Expectation Maximization
TEXT BOOKS
T1 : Statistics for Data Scientists, An introduction to Probability, Statistics and Data Analysis, Maurits Kaptein et
al, Springer 2022
T2 : Probability and Statistics for Engineering and Sciences,8 th Edition, Jay L Devore, Cengage Learning
T3 : Introduction to Time Series and Forecasting, Second Edition , Peter J Brockwell, Richard A Davis, Springer.
Evaluation Components
Name Type Weightage
EC 1 Quiz 1 & Quiz 2 Online 10%
Assignment 1 & Online 20%
Assignment 2
EC 2 Mid semester Exam Closed Book 30%
EC 3 Comprehensive Exam Open Book 40%
Module 1: (Basic Probability & Statistics)
Contact List of Topic Title Reference
Session
Measures of Central Tendency & Measures T1 & T2
of Variability,
CS - 1 Data – Symmetric & Asymmetric, outlier
detection, 5 point summary
Statistics
Statistics may be defined as science that is employed to
o Collect the data
o Present and organize the data in a systematic manner
o Analyse the data
o Infer about the data
o Take decision from the data.
Data - Types
Data
Categorical Numerical
Examples:
o Marital Status
o Political Party
o Eye Color Continuous
o (Defined categories)
Discrete
Examples: Examples:
o Number of Children Weight
o Defects per hour Voltage
o (Counted items) (Measured characteristics)
Levels of Data Measurement
Nominal — Lowest level of measurement
Ordinal
Interval
Ratio — Highest level of measurement
Nominal oA nominal scale classifies data into distinct categories in which no ranking is implied
oExample : Gender, Marital Status, voting to different party
Ordinal scale o An ordinal scale classifies data into distinct categories in which ranking is implied
o Example:
1. Product satisfaction: Satisfied, Neutral, Unsatisfied
2. Faculty rank: Professor, Associate Professor, Assistant Professor
3. Student Grades: A, B, C, D, F
4. Medals won: Gold, Silver, Bronze
Interval scale o An interval scale is an ordered scale in which the difference between measurements is a meaningful quantity but the
measurements do not have a true zero point.
o Example
1. Temperature in Fahrenheit and Celsius
2. Months of the Year: there’s no month called zero and we can’t say January is twice as much as June.
Ratio Scale o A ratio scale is an ordered scale in which the difference between the measurements is a meaningful quantity
and the measurements have a true zero point.
o Examples
Weight , Age , Salary
Types of Variable
Quantitative(Numerical): measured in terms
Qualitative (Categorical): express a qualitative
of numbers such as height, weight, number of
attribute such as hair color, eye color, religion.
people.
Nominal: Ordinal:
Ordering is possible such as Discrete: countable and Continuous:not
No ordering is possible
health, which can take have a finite number of countable and have
such as hair color, eye
values such as poor, possibilities such as an infinite number of
color, religion.
reasonable, good, or number of people possibilities such as
excellent. height
INTERVAL: ratio of values of variable do not
RATIO: ratio of values of variable have
have any meaning and it does not have an
meaning and it have an inherently
inherently defined zero value such as
defined zero value such as length
temperature
Example
Question Type of Variable
Types of vehicle owned Categorical : Nominal
Two wheeler, four wheeler
Product satisfaction Categorical : Ordinal
Unsatisfied, neutral, fairly satisfied, satisfied
To how many magazines you currently subscribe Discrete
Zero, One, Two, Three, Four
How tall are you (in inches) Continuous: Ratio
Weight (in Kilograms) Continuous: Ratio
Temperature (in degrees Celsius or degrees Fahrenheit) Continuous: Interval
Measures of Central Tendency
Measure of central tendency provides a very convenient way of
describing a set of scores with a single number that describes the
PERFORMANCE of the group.
Also defined as a single value that is used to describe the “center”
of the data.
Three commonly used measures of central tendency:
1. Mean
2. Median
3. Mode
Mean
Also referred as the “arithmetic average”
The most commonly used measure of the center of data
Numbers that describe what is average or typical of the distribution
Computation of Sample Mean: Y
Y
N
Computation of the Mean for grouped Data
f Y
Y where f Y = a score multiplied by its frequency
N
Mean: Ungrouped Data
Suppose you define the time to get ready as the time (rounded to the nearest minute) from
when you get out of bed to when you leave your home. You collect the times shown below for
10 consecutive work days:
Day 1 2 3 4 5 6 7 8 9 10
Time 39 29 43 52 39 44 40 31 44 35
(min)
Mean: Grouped Scores
Data of Children watching TV in Bengaluru
Mean
PROPERTIES
o It measures stability. Mean is the most stable among other measures of central tendency because
every score contributes to the value of the mean.
o It may easily be affected by the extreme scores.
o The sum of each score’s distance from the mean is zero.
o It can be applied to interval level of measurement
o It may not be an actual score in the distribution
o It is very easy to compute.
When to Use the Mean
o Other measures are to be computed such as standard deviation, coefficient of variation and
skewness
o Sampling stability is desired.
The Mode
The category or score with the largest frequency (or percentage) in the distribution.
The mode can be calculated for variables with levels of measurement that are: nominal,
ordinal, or interval-ratio.
Example:
A systems manager in charge of a company’s network keeps track of the number of server
failures that occur in a day. Compute the mode for the following data, which represents the
number of server failures in a day for the past two weeks:.
1 3 0 3 26 2 7 4 2 3 3 6 3
Because 3 appears five times, more times than any other value, the mode is 3. Thus, the
systems manager can say that the most common occurrence is having three server failures in
a day.
Mode
Properties
o It can be used when the data are qualitative as well as quantitative.
o It may not be unique.
o It is not affected by extreme values.
When to Use the Mode
o When the data set is measured on a nominal scale
The Median
The score that divides the distribution into two equal parts, so that half the cases are above
it and half below it.
You compute the median value by following one of two rules:
o Rule 1 If there are an odd number of values in the data set, the median is the middle-ranked
value.
o Rule 2 If there are an even number of values in the data set, then the median is the average
of the two middle ranked values.
The median is the middle score, or average of middle scores in a distribution.
o Fifty percent (50%) lies below the median value and 50% lies above the median value.
o It is also known as the middle score or the 50th percentile.
The Median
Example-1: The three-year annualized returns for the seven small-cap growth
funds with low risk are ranked from the smallest to the largest:
19.0 20.8 22.3 22.4 24.9 26.0 29.9
The Median is the 4th position value i.e. 22.4.
Example-2: Consider the time taken for reaching office from home in minutes
arranged from lowest to highest
29 31 35 39 39 40 43 44 44 52
The number of observations are 10. Hence the median is the average of values at
the 5th and 6th Position, i.e. average of 39 and 40 = 39.5
Data : Symmetrical and Asymmetrical
Central tendency Median is used for both Symmetrical and Asymmetrical data
While Mean or Mode or Median can be used as central tendency for Symmetrical data.
Data distribution
Skewed
o Negatively (Left):mean < median
• Distributions with a left skew have long left
tails;
o Positively (Right) : mean > median
• Distributions with a right skew have long
right tails.
Bimodal : has two distinct modes
Multi-modal : has more than 2 distinct
modes
Note: In science, an empirical relationship is a relationship that is supported by experiment and observation
but not necessarily supported by theory. For moderately skewed data (Skewness between -0.5 to 0.5)
empirical relationship is 3Median= Mode+2Mean
Why mean is not enough?
Sl. No. X1
1 2 Statistical measures Group
2 8 1
3 5 Mean
5
4 3
Median
5 7 5
6 8 Mode
5
7 5
8 2
9 5
Total 45
Sl. No. X2
Statistical measures
1 1 Group
2
2 15
Mean
3 5 5
4 5
Median
5 6 5
6 3 Mode
5
7 5
8 2
9 3
Total 45
Sl. No. X1 Statistical measures Group
1&2
1 2
Mean
2 8 5
3 5 Median
5
4 3
5 7 Mode
5
6 8
7 5
8 2
9 5
Total 45
Do we need any other measure?
Answer: Yes
Measures of variability
Three Measures of Variability:
o The Range
o The Variance
o The Standard Deviations
Measure of Variability
Variability can be defined in several ways:
o A quantitative distance measure based on
the differences between scores
o Describes distance of the spread of scores
or distance of a score from the mean
Purposes of Measure of Variability:
o Describe the distribution
o Measure how well an individual score
represents the distribution
The Ranges
The distance covered by the scores in a distribution – From smallest value to
highest value
For continuous data, real limits are used
Range = URL for Xmax - LRL for Xmin
Based on two scores, not all the data – An imprecise, unreliable measure of
variability
Example: For a set of scores: 7, 2, 7, 6, 5, 6, 2
Range = Highest Score minus Lowest score = 7 - 2 = 5
The Standard Deviation
Most common and most important measure of variability is the standard
deviation
o A measure of the standard, or average, distance from the mean
o Describes whether the scores are clustered closely around the mean or
are widely scattered
Calculation differs for population and samples
Variance is a necessary companion concept to standard deviation but not the
same concept
The Standard Deviation
Exercise : Find out the deviations of all the data points with
the mean….and then find the ‘mean deviation’.
The Standard Deviation
Mean deviations will always be ‘zero’ !
(because Mean is a balance point)
Then, how do you find ‘Standard Deviation’ ?
Need a new strategy
The Standard Deviation
New Strategy :
a) First square each deviation score
b) Then sum the Squared Deviations (SS)
c) Average the squared deviations
• Mean Squared Deviation is known as “Variance”
• Variability is now measured in squared units
The Variance
Variance equals mean (average) squared deviation (distance) of the
scores from the mean
Where
The Population Variance
Population variance equals mean (average) squared deviation
(distance) of the scores from the population mean
Variance is the average of squared deviations, so we identify
population variance with a lowercase Greek letter sigma squared: σ2
Standard deviation is the square root of the variance, so we identify
it with a lowercase Greek letter sigma: σ
Sl. No. X1
9
1 2 x 45 i
2 8 X 5
1
3 5 N 9
4 3
5 7
6 8
7 5
√
8 2
9 5 𝟒𝟒
𝐒= =𝟐. 𝟑𝟒𝟓
Total 45
𝟖
Sl. No. X2
9
1 1 x
45 i
2 15 X 5
1
3 5 N 9
4 5
5 6
6 3
7 5
√
8 2
𝟏𝟑𝟒
9 3
𝐒= =𝟒.𝟎𝟗𝟑
Total 45 𝟖
Standard Deviation and Variance for a Sample
Goal of inferential statistics:
o Draw general conclusions about population based on limited information
from a sample
Samples differ from the population
o Samples have less variability
o Computing the Variance and Standard Deviation in the same way as for a
population would give a biased estimate of the population values
Sample Standard Deviation and Variance
Sum of Squares (SS) is computed as
before
Formula for Variance has n-1 rather
than N in the denominator
Notation uses s instead of σ
Degrees of Freedom
Population variance
o Mean is known
o Deviations are computed from a known mean
Sample variance as estimate of population
o Population mean is unknown
o Using sample mean restricts variability
Degrees of freedom
o Number of scores in sample that are independent and free to vary
o Degrees of freedom (df) = n – 1
Learning Check
Select the correct option
a) A sample of four scores has SS = 24. What is the variance?
(1) The variance is 6
(2) The variance is 7
(3) The variance is 8
(4) The variance is 12
b) A sample systematically has less variability than a population True / False ?
c) The standard deviation is the distance from the Mean to the
farthest point on the distribution curve True / False ?
Learning Check
Select the correct option
a) A sample of four scores has SS = 24. What is the variance?
(1) The variance is 6
(2) The variance is 7
(3) The variance is 8
(4) The variance is 12
b) A sample systematically has less variability than a population True
c) The standard deviation is the distance from the Mean to the
farthest point on the distribution curve False
Descriptive Statistics
A standard deviation describes scores in terms of distance from the mean
Describe an entire distribution with just two numbers (M and s)
Reference to both allows reconstruction of the measurement scale from just these two
numbers
Means and standard deviations together provide extremely useful descriptive statistics for
characterizing distributions
Fivepoint summary of Data
The five number summary of data includes 5 items:
o Minimum.
o Q1 (the first quartile, or the 25% mark).
o Median.
o Q3 (the third quartile, or the 75% mark).
o Maximum.
Interquartile range (IQR)
It is measure of Variation
Also Known as Mid-spread : Spread in the Middle 50%
Difference Between Third & First Quartiles:
Not Affected by Extreme Values
Interquartile Range = IQR = Q3 – Q1
Data in Ordered Array: 11 12 13 16 16 17 17
18 21 1•(9 +
Position of =1) 4 = 2.50, Q1 =12.
Q1 3•(9 + 5
Position of =1) 4 = 7.50, Q3 =17.
Q3 5
Interquartile Range = IQR= Q3 – Q1 = 17.5 - 12.5 = 5
Box and Whisker plot
IQR = Q3 –
Q1
Q1-1.5 Q3+1.5
IQR IQR
Min Ma
Q1 Q2 Q3 x
Outlie Outlie
r r
Box and Whisker plot
IQR = Q3 –
Q1
Min Q1-1.5 Q3+1.5
IQR IQR
Ma
Q1 Q2 Q3 x
Q1-3 IQR Q3+3 IQR
Major Major Outlier
Outlier
Potential outliers
The lower limit and upper limit of a data set are given by:
Lower limit = Q1 - 1.5 x IQR
Upper limit = Q3 + 1.5 x IQR
Data points that lie below the lower limit or above the upper limit are
potential outliers.
Summary
1.Measures of Central Tendency: Mean, Median, Mode
2. Measures of Variability: Range, Standard Deviation , Variance
3. Symmetric and Asymmetric distribution
4. Five Point summary
5. Outliers
Practice Problem:
• Q.1 A sample of 77 individuals working at a particular office was selected and the noise level (dBA)
experienced by everyone is the following data:
55.3, 55.3, 55.3, 55.9, 55.9, 55.9, 55.9, 56.1, 56.1, 56.1, 56.1, 56.1, 56.1,
56.8, 56.8, 57.0, 57.0, 57.0, 57.8, 57.8, 57.8, 57.9, 57.9, 57.9, 58.8, 58.8,
58.8, 59.8, 59.8, 59.8, 62.2, 62.2, 63.8, 63.8, 63.8, 63.9, 63.9, 63.9, 64.7,
64.7, 64.7, 65.1, 65.1, 65.1, 65.3, 65.3, 65.3, 65.3, 67.4, 67.4, 67.4, 67.4,
68.7, 68.7, 68.7, 68.7, 69.0, 70.4, 70.4, 71.2, 71.2, 71.2, 73.0, 73.0, 73.1,
73.1, 74.6, 74.6, 74.6, 74.6, 79.3, 79.3, 79.3, 79.3, 83.0, 83.0, 83.0.
Find a) Arithmetic Mean, SD, variance, and IQR
b) Draw box and whisker plot
c) Comment on the outliers, if any.
Discussion
Practice Problems:
Q.2 The data given below is the total fat, in grams per serving, for a sample of 20 chicken
sandwiches from fast-food chains.
7 8 4 5 16 20 20 24 19 30 23 30 25 19 29 29 30 30 40 56
a. Compute the mean, median, first quartile, and third quartile.
b. Compute the variance, standard deviation, range, interquartile range, Are there any outliers? Explain.
c. Are the data skewed? If so, how?
d. Based on the results of (a) through (c), what conclusions can you reach concerning the total fat of chicken
sandwiches?
Practice Problem:
Q.3 The following data represent the battery life (in shots) for three pixel digital cameras:
300 180 85 170 380 460 260 35 380 120
110 240
List the Five-point summary.
Q.4 For the data set below:
82 45 64 80 82 74 79 80 80 78 80 80 48 73 80 79 81 70 78 73
a. Obtain and interpret the quartiles.
b. Determine and interpret the interquartile range.
c. Find and interpret the five-number(point) summary.
d. Identify potential outliers, if any.
e. Construct and interpret a boxplot.
Practice Problem:
Q.5 A bank branch located in a commercial place of a city has developed an improved process for
serving customers during the noon-to-1:00 p.m. lunch period. The waiting time, in minutes (defined
as the time the customer enters the line to when he or she reaches the teller window), of a sample of
15 customers during this hour is recorded over a period of one week. The results are: 4.21, 5.55,
3.02, 5.13, 4.77, 2.34, 3.54, 3.20, 4.50, 6.10, 0.38, 5.12, 6.46, 6.19, 3.79.
Another branch, located in a residential area, is also concerned with the noon-to-1 p.m. lunch hour.
The waiting time, in of a sample of 15 customers during this hour is recorded over a period of one
week. The results are listed below: 9.66, 5.90, 8.02, 5.79, 8.73, 3.82, 8.01, 8.35, 10.49, 6.68, 5.64,
4.08, 6.17, 9.91, 5.47.
a. List the five-number summaries of the waiting times at the two bank branches.
b. Construct box-and-whisker plots and describe the shape of the distribution of each for the two
bank branches.
c. What similarities and differences are there in the distributions of the waiting time at the two bank
branches?
Thank You!