0% found this document useful (0 votes)
15 views54 pages

ISM - Session 1 - May 2025

The document outlines an introductory course on statistical methods, detailing modules covering basic probability, hypothesis testing, and forecasting, among others. It includes evaluation components such as quizzes, assignments, and exams, along with a list of recommended textbooks. Key statistical concepts such as measures of central tendency, types of data, and variability measures are also discussed.

Uploaded by

AtindranathGhosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views54 pages

ISM - Session 1 - May 2025

The document outlines an introductory course on statistical methods, detailing modules covering basic probability, hypothesis testing, and forecasting, among others. It includes evaluation components such as quizzes, assignments, and exams, along with a list of recommended textbooks. Key statistical concepts such as measures of central tendency, types of data, and variability measures are also discussed.

Uploaded by

AtindranathGhosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Introduction to Statistical

Methods(ISM)

Slides Modified by
Dr YVK Ravi Kumar
[email protected]
Section:
Faculty Name & Mail ID:
LF details:

Session 1
(3 rd
/ 4th May 2025)

Overview of the course & Basic of Statistics


Overview of the course

Module 1 : Basic Probability & Statistics

Module 2 : Conditional Probability & Bayes’ theorem

Module 3 : Probability Distributions

Module 4 : Hypothesis Testing

Module 5 : Prediction & Forecasting

Module 6 : Prediction & Forecasting; Gaussian Mixture model &Expectation Maximization


TEXT BOOKS

 T1 : Statistics for Data Scientists, An introduction to Probability, Statistics and Data Analysis, Maurits Kaptein et
al, Springer 2022
 T2 : Probability and Statistics for Engineering and Sciences,8 th Edition, Jay L Devore, Cengage Learning
 T3 : Introduction to Time Series and Forecasting, Second Edition , Peter J Brockwell, Richard A Davis, Springer.
Evaluation Components

Name Type Weightage

EC 1 Quiz 1 & Quiz 2 Online 10%

Assignment 1 & Online 20%


Assignment 2

EC 2 Mid semester Exam Closed Book 30%

EC 3 Comprehensive Exam Open Book 40%


Module 1: (Basic Probability & Statistics)

Contact List of Topic Title Reference


Session
Measures of Central Tendency & Measures T1 & T2
of Variability,
CS - 1 Data – Symmetric & Asymmetric, outlier
detection, 5 point summary
Statistics

Statistics may be defined as science that is employed to


o Collect the data
o Present and organize the data in a systematic manner
o Analyse the data
o Infer about the data
o Take decision from the data.
Data - Types

Data
Categorical Numerical
Examples:
o Marital Status
o Political Party
o Eye Color Continuous
o (Defined categories)
Discrete
Examples: Examples:
o Number of Children  Weight
o Defects per hour  Voltage
o (Counted items) (Measured characteristics)
Levels of Data Measurement
 Nominal — Lowest level of measurement
 Ordinal
 Interval
 Ratio — Highest level of measurement
 Nominal oA nominal scale classifies data into distinct categories in which no ranking is implied
oExample : Gender, Marital Status, voting to different party
 Ordinal scale o An ordinal scale classifies data into distinct categories in which ranking is implied
o Example:
1. Product satisfaction: Satisfied, Neutral, Unsatisfied
2. Faculty rank: Professor, Associate Professor, Assistant Professor
3. Student Grades: A, B, C, D, F
4. Medals won: Gold, Silver, Bronze
 Interval scale o An interval scale is an ordered scale in which the difference between measurements is a meaningful quantity but the
measurements do not have a true zero point.
o Example
1. Temperature in Fahrenheit and Celsius
2. Months of the Year: there’s no month called zero and we can’t say January is twice as much as June.

 Ratio Scale o A ratio scale is an ordered scale in which the difference between the measurements is a meaningful quantity
and the measurements have a true zero point.
o Examples
Weight , Age , Salary
Types of Variable

Quantitative(Numerical): measured in terms


Qualitative (Categorical): express a qualitative
of numbers such as height, weight, number of
attribute such as hair color, eye color, religion.
people.

Nominal: Ordinal:
Ordering is possible such as Discrete: countable and Continuous:not
No ordering is possible
health, which can take have a finite number of countable and have
such as hair color, eye
values such as poor, possibilities such as an infinite number of
color, religion.
reasonable, good, or number of people possibilities such as
excellent. height

INTERVAL: ratio of values of variable do not


RATIO: ratio of values of variable have
have any meaning and it does not have an
meaning and it have an inherently
inherently defined zero value such as
defined zero value such as length
temperature
Example

Question Type of Variable

Types of vehicle owned Categorical : Nominal


Two wheeler, four wheeler
Product satisfaction Categorical : Ordinal
Unsatisfied, neutral, fairly satisfied, satisfied
To how many magazines you currently subscribe Discrete
Zero, One, Two, Three, Four
How tall are you (in inches) Continuous: Ratio

Weight (in Kilograms) Continuous: Ratio

Temperature (in degrees Celsius or degrees Fahrenheit) Continuous: Interval


Measures of Central Tendency
Measure of central tendency provides a very convenient way of
describing a set of scores with a single number that describes the
PERFORMANCE of the group.

 Also defined as a single value that is used to describe the “center”


of the data.

 Three commonly used measures of central tendency:


1. Mean
2. Median
3. Mode
Mean
Also referred as the “arithmetic average”
The most commonly used measure of the center of data
Numbers that describe what is average or typical of the distribution
Computation of Sample Mean: Y
 Y
N
Computation of the Mean for grouped Data
  f Y
Y  where f Y = a score multiplied by its frequency
N
Mean: Ungrouped Data

 Suppose you define the time to get ready as the time (rounded to the nearest minute) from
when you get out of bed to when you leave your home. You collect the times shown below for
10 consecutive work days:

Day 1 2 3 4 5 6 7 8 9 10
Time 39 29 43 52 39 44 40 31 44 35
(min)
Mean: Grouped Scores
Data of Children watching TV in Bengaluru
Mean
 PROPERTIES
o It measures stability. Mean is the most stable among other measures of central tendency because
every score contributes to the value of the mean.

o It may easily be affected by the extreme scores.

o The sum of each score’s distance from the mean is zero.

o It can be applied to interval level of measurement

o It may not be an actual score in the distribution

o It is very easy to compute.

 When to Use the Mean

o Other measures are to be computed such as standard deviation, coefficient of variation and
skewness

o Sampling stability is desired.


The Mode
 The category or score with the largest frequency (or percentage) in the distribution.
 The mode can be calculated for variables with levels of measurement that are: nominal,
ordinal, or interval-ratio.

Example:
A systems manager in charge of a company’s network keeps track of the number of server
failures that occur in a day. Compute the mode for the following data, which represents the
number of server failures in a day for the past two weeks:.

1 3 0 3 26 2 7 4 2 3 3 6 3
Because 3 appears five times, more times than any other value, the mode is 3. Thus, the
systems manager can say that the most common occurrence is having three server failures in
a day.
Mode
Properties
o It can be used when the data are qualitative as well as quantitative.
o It may not be unique.
o It is not affected by extreme values.

When to Use the Mode


o When the data set is measured on a nominal scale
The Median

The score that divides the distribution into two equal parts, so that half the cases are above
it and half below it.
You compute the median value by following one of two rules:
o Rule 1 If there are an odd number of values in the data set, the median is the middle-ranked
value.
o Rule 2 If there are an even number of values in the data set, then the median is the average
of the two middle ranked values.
The median is the middle score, or average of middle scores in a distribution.
o Fifty percent (50%) lies below the median value and 50% lies above the median value.
o It is also known as the middle score or the 50th percentile.
The Median

Example-1: The three-year annualized returns for the seven small-cap growth
funds with low risk are ranked from the smallest to the largest:
19.0 20.8 22.3 22.4 24.9 26.0 29.9
The Median is the 4th position value i.e. 22.4.
Example-2: Consider the time taken for reaching office from home in minutes
arranged from lowest to highest
29 31 35 39 39 40 43 44 44 52
The number of observations are 10. Hence the median is the average of values at
the 5th and 6th Position, i.e. average of 39 and 40 = 39.5
Data : Symmetrical and Asymmetrical

 Central tendency Median is used for both Symmetrical and Asymmetrical data
 While Mean or Mode or Median can be used as central tendency for Symmetrical data.
Data distribution
 Skewed
o Negatively (Left):mean < median
• Distributions with a left skew have long left
tails;
o Positively (Right) : mean > median
• Distributions with a right skew have long
right tails.

 Bimodal : has two distinct modes

 Multi-modal : has more than 2 distinct


modes

Note: In science, an empirical relationship is a relationship that is supported by experiment and observation
but not necessarily supported by theory. For moderately skewed data (Skewness between -0.5 to 0.5)
empirical relationship is 3Median= Mode+2Mean
Why mean is not enough?

Sl. No. X1

1 2 Statistical measures Group


2 8 1
3 5 Mean
5
4 3
Median
5 7 5
6 8 Mode
5
7 5
8 2
9 5
Total 45
Sl. No. X2
Statistical measures
1 1 Group
2
2 15
Mean
3 5 5
4 5
Median
5 6 5
6 3 Mode
5
7 5
8 2
9 3

Total 45
Sl. No. X1 Statistical measures Group
1&2
1 2
Mean
2 8 5
3 5 Median
5
4 3
5 7 Mode
5
6 8
7 5
8 2
9 5
Total 45
Do we need any other measure?
Answer: Yes

Measures of variability

Three Measures of Variability:


o The Range
o The Variance
o The Standard Deviations
Measure of Variability
Variability can be defined in several ways:
o A quantitative distance measure based on
the differences between scores
o Describes distance of the spread of scores
or distance of a score from the mean
Purposes of Measure of Variability:
o Describe the distribution
o Measure how well an individual score
represents the distribution
The Ranges
The distance covered by the scores in a distribution – From smallest value to
highest value
 For continuous data, real limits are used

Range = URL for Xmax - LRL for Xmin

Based on two scores, not all the data – An imprecise, unreliable measure of
variability
Example: For a set of scores: 7, 2, 7, 6, 5, 6, 2

Range = Highest Score minus Lowest score = 7 - 2 = 5


The Standard Deviation

Most common and most important measure of variability is the standard


deviation
o A measure of the standard, or average, distance from the mean
o Describes whether the scores are clustered closely around the mean or
are widely scattered
Calculation differs for population and samples
Variance is a necessary companion concept to standard deviation but not the
same concept
The Standard Deviation

Exercise : Find out the deviations of all the data points with
the mean….and then find the ‘mean deviation’.
The Standard Deviation

 Mean deviations will always be ‘zero’ !


(because Mean is a balance point)

 Then, how do you find ‘Standard Deviation’ ?

Need a new strategy


The Standard Deviation
New Strategy :
a) First square each deviation score
b) Then sum the Squared Deviations (SS)
c) Average the squared deviations

• Mean Squared Deviation is known as “Variance”


• Variability is now measured in squared units
The Variance

Variance equals mean (average) squared deviation (distance) of the


scores from the mean

Where
The Population Variance

 Population variance equals mean (average) squared deviation


(distance) of the scores from the population mean

 Variance is the average of squared deviations, so we identify


population variance with a lowercase Greek letter sigma squared: σ2

 Standard deviation is the square root of the variance, so we identify


it with a lowercase Greek letter sigma: σ
Sl. No. X1
9
1 2 x 45 i
2 8 X  5
1

3 5 N 9
4 3
5 7
6 8
7 5


8 2
9 5 𝟒𝟒
𝐒= =𝟐. 𝟑𝟒𝟓
Total 45
𝟖
Sl. No. X2
9
1 1 x
45 i
2 15 X  5
1

3 5 N 9
4 5
5 6
6 3
7 5


8 2
𝟏𝟑𝟒
9 3
𝐒= =𝟒.𝟎𝟗𝟑
Total 45 𝟖
Standard Deviation and Variance for a Sample

Goal of inferential statistics:


o Draw general conclusions about population based on limited information
from a sample
Samples differ from the population
o Samples have less variability
o Computing the Variance and Standard Deviation in the same way as for a
population would give a biased estimate of the population values
Sample Standard Deviation and Variance

 Sum of Squares (SS) is computed as


before
 Formula for Variance has n-1 rather
than N in the denominator
 Notation uses s instead of σ
Degrees of Freedom
Population variance
o Mean is known
o Deviations are computed from a known mean

 Sample variance as estimate of population


o Population mean is unknown
o Using sample mean restricts variability

 Degrees of freedom
o Number of scores in sample that are independent and free to vary
o Degrees of freedom (df) = n – 1
Learning Check
Select the correct option
a) A sample of four scores has SS = 24. What is the variance?
(1) The variance is 6
(2) The variance is 7
(3) The variance is 8
(4) The variance is 12

b) A sample systematically has less variability than a population True / False ?

c) The standard deviation is the distance from the Mean to the


farthest point on the distribution curve True / False ?
Learning Check
Select the correct option
a) A sample of four scores has SS = 24. What is the variance?
(1) The variance is 6
(2) The variance is 7
(3) The variance is 8
(4) The variance is 12

b) A sample systematically has less variability than a population True

c) The standard deviation is the distance from the Mean to the


farthest point on the distribution curve False
Descriptive Statistics
 A standard deviation describes scores in terms of distance from the mean
 Describe an entire distribution with just two numbers (M and s)
 Reference to both allows reconstruction of the measurement scale from just these two
numbers
 Means and standard deviations together provide extremely useful descriptive statistics for
characterizing distributions
Fivepoint summary of Data
The five number summary of data includes 5 items:

o Minimum.

o Q1 (the first quartile, or the 25% mark).

o Median.

o Q3 (the third quartile, or the 75% mark).

o Maximum.
Interquartile range (IQR)
 It is measure of Variation
 Also Known as Mid-spread : Spread in the Middle 50%
 Difference Between Third & First Quartiles:
 Not Affected by Extreme Values
 Interquartile Range = IQR = Q3 – Q1

Data in Ordered Array: 11 12 13 16 16 17 17


18 21 1•(9 +
Position of =1) 4 = 2.50, Q1 =12.
Q1 3•(9 + 5
Position of =1) 4 = 7.50, Q3 =17.
Q3 5
Interquartile Range = IQR= Q3 – Q1 = 17.5 - 12.5 = 5
Box and Whisker plot

IQR = Q3 –
Q1
Q1-1.5 Q3+1.5
IQR IQR

Min Ma
Q1 Q2 Q3 x
Outlie Outlie
r r
Box and Whisker plot

IQR = Q3 –
Q1
Min Q1-1.5 Q3+1.5
IQR IQR

Ma
Q1 Q2 Q3 x
Q1-3 IQR Q3+3 IQR
Major Major Outlier
Outlier
Potential outliers
 The lower limit and upper limit of a data set are given by:
Lower limit = Q1 - 1.5 x IQR
Upper limit = Q3 + 1.5 x IQR
 Data points that lie below the lower limit or above the upper limit are
potential outliers.
Summary

1.Measures of Central Tendency: Mean, Median, Mode


2. Measures of Variability: Range, Standard Deviation , Variance
3. Symmetric and Asymmetric distribution
4. Five Point summary
5. Outliers
Practice Problem:

• Q.1 A sample of 77 individuals working at a particular office was selected and the noise level (dBA)
experienced by everyone is the following data:

55.3, 55.3, 55.3, 55.9, 55.9, 55.9, 55.9, 56.1, 56.1, 56.1, 56.1, 56.1, 56.1,
56.8, 56.8, 57.0, 57.0, 57.0, 57.8, 57.8, 57.8, 57.9, 57.9, 57.9, 58.8, 58.8,
58.8, 59.8, 59.8, 59.8, 62.2, 62.2, 63.8, 63.8, 63.8, 63.9, 63.9, 63.9, 64.7,
64.7, 64.7, 65.1, 65.1, 65.1, 65.3, 65.3, 65.3, 65.3, 67.4, 67.4, 67.4, 67.4,
68.7, 68.7, 68.7, 68.7, 69.0, 70.4, 70.4, 71.2, 71.2, 71.2, 73.0, 73.0, 73.1,
73.1, 74.6, 74.6, 74.6, 74.6, 79.3, 79.3, 79.3, 79.3, 83.0, 83.0, 83.0.

Find a) Arithmetic Mean, SD, variance, and IQR


b) Draw box and whisker plot
c) Comment on the outliers, if any.
Discussion
Practice Problems:
Q.2 The data given below is the total fat, in grams per serving, for a sample of 20 chicken
sandwiches from fast-food chains.
7 8 4 5 16 20 20 24 19 30 23 30 25 19 29 29 30 30 40 56
a. Compute the mean, median, first quartile, and third quartile.
b. Compute the variance, standard deviation, range, interquartile range, Are there any outliers? Explain.
c. Are the data skewed? If so, how?
d. Based on the results of (a) through (c), what conclusions can you reach concerning the total fat of chicken
sandwiches?
Practice Problem:

Q.3 The following data represent the battery life (in shots) for three pixel digital cameras:
300 180 85 170 380 460 260 35 380 120
110 240
 List the Five-point summary.

Q.4 For the data set below:

82 45 64 80 82 74 79 80 80 78 80 80 48 73 80 79 81 70 78 73

a. Obtain and interpret the quartiles.


b. Determine and interpret the interquartile range.
c. Find and interpret the five-number(point) summary.
d. Identify potential outliers, if any.
e. Construct and interpret a boxplot.
Practice Problem:

Q.5 A bank branch located in a commercial place of a city has developed an improved process for
serving customers during the noon-to-1:00 p.m. lunch period. The waiting time, in minutes (defined
as the time the customer enters the line to when he or she reaches the teller window), of a sample of
15 customers during this hour is recorded over a period of one week. The results are: 4.21, 5.55,
3.02, 5.13, 4.77, 2.34, 3.54, 3.20, 4.50, 6.10, 0.38, 5.12, 6.46, 6.19, 3.79.
Another branch, located in a residential area, is also concerned with the noon-to-1 p.m. lunch hour.
The waiting time, in of a sample of 15 customers during this hour is recorded over a period of one
week. The results are listed below: 9.66, 5.90, 8.02, 5.79, 8.73, 3.82, 8.01, 8.35, 10.49, 6.68, 5.64,
4.08, 6.17, 9.91, 5.47.

a. List the five-number summaries of the waiting times at the two bank branches.
b. Construct box-and-whisker plots and describe the shape of the distribution of each for the two
bank branches.
c. What similarities and differences are there in the distributions of the waiting time at the two bank
branches?
Thank You!

You might also like