0% found this document useful (0 votes)
10 views53 pages

Introduction To Basic Statistics

Uploaded by

Taher Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views53 pages

Introduction To Basic Statistics

Uploaded by

Taher Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to Statistics

Abdur Rahman (Aakash)


Lecturer
Statistics Discipline, KU.
Statistics
▪ Statistics originated from German: Statistik.

▪ Meaning "description of a state, a country"

▪ It is the discipline that concerns the collection,


organization, analysis, interpretation, and
presentation of data.

AR, Lecturer, Statistics Discipline, KU 2


Statistics
▪ Statistics is concerned with scientific methods for
collecting, organizing, summarizing, presenting,
analyzing sample data from a specified population
of interest to draw a valid conclusion and making
inferences about the population characteristics and
finally reaching a reasonable decision.

AR, Lecturer, Statistics Discipline, KU 3


Types of Statistics

Descriptive statistics summarize the characteristics of a data set.

Inferential statistics allow you to test a hypothesis or assess whether


your data is generalizable to the broader population.
AR, Lecturer, Statistics Discipline, KU 4
Descriptive Statistics
• Descriptive statistics are a collection of
numerical values that reveal important
characteristics of your data. They don't make
any inferences about a larger population
(which is the job of inferential statistics), but
instead focus on painting a clear picture of the
specific data you have.
AR, Lecturer, Statistics Discipline, KU 5
Types of descriptive statistics

AR, Lecturer, Statistics Discipline, KU 6


Descriptive Statistics
o Main categories

o Central tendency: These measures tell you where the "typical" value lies in your data. Examples

include:

o Mean: The average of all values.

o Median: The middle value when data is arranged in ascending order.

o Mode: The most frequent value.

o Variability (spread): These measures tell you how spread out your data is. Examples include:

o Range: The difference between the highest and lowest values.

o Standard deviation: How much, on average, each value deviates from the mean.

o Variance: The square of the standard deviation.

o Distribution: This describes the overall shape of your data. Is it symmetric? Bell-shaped? Skewed

towards one side? Histograms and boxplots are handy tools for visualizing this.
AR, Lecturer, Statistics Discipline, KU 7
Importances of Descriptive statistics
• Gain quick insights: Quickly grasp key features of your data
without drowning in individual points.

• Identify patterns and trends: See if your data is clustered around


a central value, spread out evenly, or skewed in a specific direction.

• Compare datasets: Analyze similarities and differences between


different datasets using the same descriptive statistics.

• Prepare for further analysis: Descriptive statistics often serve as


the foundation for more complex statistical methods.

AR, Lecturer, Statistics Discipline, KU 8


Frequency distribution
Frequency distribution, as a specific branch of
descriptive statistics, dives deeper into the how often
aspect of your data. It shows you how many times
each value (or range of values) appears in your
dataset. This helps you understand the spread and
distribution of your data, beyond just knowing the
central tendency or variability.
AR, Lecturer, Statistics Discipline, KU 9
Types of Frequency distribution
• Frequency tables: These organize data into categories or intervals, showing the

count of observations falling within each category. This is a simple and efficient way

to present basic frequency information.

• Histograms: These visualize the frequency distribution in a bar chart, where the bar

height represents the number of observations for each interval. Histograms are great

for understanding the overall shape of your data, like symmetry or skewness.

• Relative frequency tables/histograms: These express frequencies as proportions or

percentages, allowing for easier comparison across datasets with different sizes.

AR, Lecturer, Statistics Discipline, KU 10


Benefits of using frequency distributionsy
• Identify patterns and trends: See if specific values or ranges are more
common, revealing potential patterns or biases in your data.
• Compare groups: Visually compare frequency distributions across different
groups or conditions within your dataset.
• Identify outliers: Values that occur infrequently might stand out in a frequency
distribution, prompting further investigation as potential outliers.
• Prepare for further analysis: Frequency distributions serve as a foundation for
more advanced statistical methods like hypothesis testing or regression analysis.

AR, Lecturer, Statistics Discipline, KU 11


Frequency distribution
• For the variable of gender, you list all possible answers on
the left hand column. You count the number or percentage
of responses for each answer and display it on the right
hand column.
Gender Number
Male 182
Female 235
Other 27

AR, Lecturer, Statistics Discipline, KU 12


• In a grouped frequency distribution, you can group numerical response
values and add up the number of responses for each group. You can also
convert each of these numbers to percentages.
Library visits Percent
in the past
year
0–4 6%
5–8 20%
9–12 42%
13–16 24%
17+ 8%

• From this table, you can see that most people visited the library between
5 and 16 times in the past year.

AR, Lecturer, Statistics Discipline, KU 13


Measures of central location/tendency
In general, central tendency refers to a single value that
best represents the "center" or "typical" value of a dataset. It
aims to summarize the overall data by pinpointing a
representative number.
• Think of it like finding the central point of a see-saw:
even though individual weights might be distributed on
either side, the central tendency reflects the point where
the see-saw balances.
AR, Lecturer, Statistics Discipline, KU 14
Key features of Measures of central location/tendency
• Summarizes data: It doesn't provide details about every data point, but
offers a condensed overview of the whole set.
• Multiple measures: Different measures exist, each with its strengths and
weaknesses:
• Mean: Sum of all values divided by the number of values (sensitive to outliers).
• Median: Middle value when data is ordered (less sensitive to outliers).
• Mode: Most frequent value (not informative if multiple modes or none).

• Depends on data and question: Choice of measure depends on the data type
(numerical, categorical) and the research goal.
• Not infallible: It's just one aspect of data; understanding how individual
points deviate from the center is also crucial.

AR, Lecturer, Statistics Discipline, KU 15


Central tendency types

AR, Lecturer, Statistics Discipline, KU 16


Mean
• The mean, also known as the arithmetic mean, is arguably the most
frequently used measure of central tendency. It aims to represent the
"average" value in a dataset by summing up all the values and
dividing by the number of values.

Formula: mean = (sum of all values) / (number of values)


• Example: Imagine you have the exam scores of 5 students: 75, 82, 90, 85,
and 95.

Their mean score would be: mean = (75 + 82 + 90 + 85 + 95) / 5 = 85.4

AR, Lecturer, Statistics Discipline, KU 17


Strength and weakness of Mean
• Strengths:

• Easy to understand and calculate.

• Widely used and familiar across various disciplines.

• Sensitive to changes in individual values: if everyone's score goes up by 5, the mean


also increases by 5.

• Weaknesses:

• Sensitive to outliers: a single extreme value can significantly distort the mean.

• Not robust for skewed data: if more values lie on one side of the distribution, the
mean might not accurately reflect the "typical" value.

AR, Lecturer, Statistics Discipline, KU 18


When to use the mean
• When dealing with symmetrical, normally distributed data
with no significant outliers.

• When you want to understand the "average" value and how


changes in individual data points affect it.

AR, Lecturer, Statistics Discipline, KU 19


Mean
• Some other measures of mean used to find the central
tendency are as follows:

• Geometric Mean (GM), Harmonic Mean (HM),


Weighted Mean (WM)

AR, Lecturer, Statistics Discipline, KU 20


Geometric Mean
Formula: nth root of the product of all values (where n is the number of values).
1
𝐺𝑀 = 𝑥1 𝑥2 𝑥3 … 𝑥𝑛 𝑛

Example: Calculate the average annual growth rate of an investment that started
at $1000 and grew to $2500 in 5 years.

Strengths: Useful for multiplicative growth/decline, handles non-negative data


well.

Weaknesses: Can't handle zero or negative values, outliers influence heavily.

Use cases: Finance, economics, measuring exponential growth/decline.

AR, Lecturer, Statistics Discipline, KU 21


Harmonic Mean
• Formula: Number of values divided by the sum of the reciprocals of all

𝑛
values. 𝐻𝑀 = 1 1 1
+ +⋯+
𝑥1 𝑥2 𝑥𝑛

Example: You travel 100 km at 60 km/h for the first half of a journey and 50
km at 40 km/h for the second half. What is your average speed for the entire
trip?

• Strengths: Useful for averaging rates or ratios, insensitive to outliers.

• Weaknesses: Can't handle zero or negative values, sensitive to extreme


values.

• Use cases: Physics, engineering, averaging speeds, rates, or efficiencies.

AR, Lecturer, Statistics Discipline, KU 22


Harmonic mean calculation
You travel 100 km at 60 km/h for the first half of a journey and 50 km at
40 km/h for the second half. What is your average speed for the entire
trip?
Solution: We cannot simply average the two speeds (60 km/h + 40
km/h) / 2, because you spent more time travelling slower.
Instead, use the harmonic mean:

2 2 240
Average speed = 1 1 = 2+3 = = 48
+ 5
60 40 120

AR, Lecturer, Statistics Discipline, KU 23


Weighted Mean
Formula: Each value multiplied by its weight, then all products are summed
∑𝐶𝑖 𝐺𝑖
and divided by the sum of all weights. 𝑊𝑀 =
∑𝐶𝑖

Example: Calculate the average exam score in a class where grades have
different weights (e.g., final exam counts for 40%, quizzes for 30%).

Strengths: Captures different levels of importance for different data points.

Weaknesses: Requires clear justification for weights.

Use cases: Education, research, combining data with varying significance.

AR, Lecturer, Statistics Discipline, KU 24


Trimmed mean
Formula: Excludes a certain percentage of extreme values (e.g., top
and bottom 5%) before calculating the mean.
Example: Find the average income in a city, excluding the top 1% of
earners to reduce skew.
Strengths: Robust against outliers, good for skewed data.
Weaknesses: Choice of trimming percentage needs justification.
Use cases: Social sciences, economics, studying central tendency
while mitigating outliers.

AR, Lecturer, Statistics Discipline, KU 25


Different mean in different situation
1. Geometric Mean:

• Investing: When calculating compound annual growth rate (CAGR) of an investment over
multiple years.

• Biology: When measuring cell growth or bacterial reproduction rates (assuming exponential
growth).

• Economics: When analyzing price-to-earnings (P/E) ratios of stocks or index funds.

2. Harmonic Mean:

• Speed/Rate calculations: Finding the average speed if you travel at different speeds for
different durations.

• Physics: Calculating the effective resistance of multiple resistors connected in parallel.

• Machining: Determining the average cutting speed when using tools with different
diameters. AR, Lecturer, Statistics Discipline, KU 26
Different mean in different situation
3. Weighted Mean:

• Grading: Calculating a student's overall grade when different assessments have different weights.

• Surveys: Combining ratings from different groups with varying sample sizes.

• Meta-analysis: Combining results from multiple studies with different sample sizes and
methodologies.

4. Trimmed Mean:

• Income/wealth distribution: Analyzing average income/wealth excluding extreme outliers like


the top or bottom percentiles.

• Pollution measurements: Calculating average air quality when occasional spikes might distort the
regular pattern.

• Sports statistics: Determining an athlete's "typical" performance by excluding their best and worst
scores.
AR, Lecturer, Statistics Discipline, KU 27
Median
The median, a vital measure of central tendency, stands apart from the mean by
focusing on the middlemost value in a sorted dataset, rather than the average. It
shines in various situations, especially when outliers or skewed data are present.

Understanding the Median:

Imagine arranging your data like a number line, from least to greatest. The median is
the value that divides the data into two halves, with an equal number of data points
on either side. Here's how it works:

For odd-numbered datasets: The median is simply the middle value.

For even-numbered datasets: The median is the average of the two middle values.

AR, Lecturer, Statistics Discipline, KU 28


Median (ungrouped)
• For even numbers of observations
𝑛 𝑛
𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 + + 1 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
෥ =2
𝑚 2
2

• For odd numbers of observations

𝑛+1
𝑚
෥=
2

AR, Lecturer, Statistics Discipline, KU 29


Median (grouped)
ℎ 𝑛
• 𝑚
෥ = 𝑙𝑚 + −𝐹 𝑚 −1
𝑓𝑚 2

𝑛
• is the median class
2

• 𝑙𝑚 is the lower limit of the median class


• n= total frequency=∑ 𝑓𝑖
• 𝑓𝑚 is the frequency of median class
• 𝐹 𝑚 −1 is the cumulative frequency of the previous class of the
median class
• h is the width of the median class

AR, Lecturer, Statistics Discipline, KU 30


Strength and limitations of median
Strengths of the Median:
✓ Robustness against outliers: Unlike the mean, the median is largely unaffected by
extreme values, making it a reliable choice for skewed or contaminated data.
✓ Easy interpretation: It represents the value that "half the data falls below and half
falls above," providing a clear picture of the data's center.
✓ Applicable to non-numerical data: While the mean requires numerical values, the
median can be calculated for ordinal data (ranked categories) too.
Weaknesses of the Median:
✓ Less informative than the mean: It doesn't capture the "average" value as precisely
as the mean, potentially losing some information about the data spread.
✓ Not sensitive to all changes: If all values change by the same amount, the median
may not reflect this shift, unlike the mean.

AR, Lecturer, Statistics Discipline, KU 31


When to use median?
• Presence of outliers or skewed data: When extreme values or an
uneven distribution exist, the median offers a more accurate
representation of the "typical" value.

• Ordinal data: If your data involves categories or ranks, the median


provides a meaningful way to identify the central value.

• Need for a quick and interpretable measure: When you want a


simple understanding of the data's center without getting bogged
down in details, the median is a good choice.

AR, Lecturer, Statistics Discipline, KU 32


Mode
The mode, another prominent measure of central tendency, takes a different approach
by identifying the most frequently occurring value in a dataset. While it can be
insightful in certain situations, its limitations also need to be considered.

Understanding the Mode: Think of the mode as the "popular kid" in the data set.
It's the value that shows up the most often, regardless of where it falls within the
data's spread.

Calculating the Mode:

• For unordered data: Simply count the frequency of each value, and the mode is
the one with the highest count.

• For ordered data: Identify the value that repeats the most times.

AR, Lecturer, Statistics Discipline, KU 33


Mode
• Most frequently occurring value is the mode. Thus
mode is such a value which has the highest frequency.
• For nominal data mean and median is not possible
• Mode can be unimodal, bi-modal or multimodal.
• If more than one mode exists then it is ill-defined
distribution and we should choose other measure
then.

AR, Lecturer, Statistics Discipline, KU 34


Strength and weakness of Mode
Strengths of the Mode:

o Simplicity: It's straightforward to calculate and understand, requiring no complex formulas.

o Useful for categorical data: Unlike the mean and median, it can be applied to categorical data where
ordering isn't meaningful.

o Highlights patterns: It can reveal dominant categories or preferences within the data.

Weaknesses of the Mode:

o Not always unique: A dataset can have multiple modes, or even no mode at all, making it less
informative than the mean or median.

o Sensitive to sample size: A larger sample size is more likely to have a distinct mode, while smaller
samples might be misleading.

o Not representative of central tendency: The mode doesn't necessarily reflect the "typical"
value, especially in skewed or multimodal data.

AR, Lecturer, Statistics Discipline, KU 35


When to Use the Mode
• Identifying dominant categories: When understanding the most
common category or value in categorical data is crucial.

• Exploring data patterns: If you want to see if specific values are


significantly more frequent, the mode can help uncover these
patterns.

• Preliminary data exploration: As a starting point for


understanding data distribution, the mode can offer quick insights
before using more rigorous measures.

AR, Lecturer, Statistics Discipline, KU 36


Graphical presentation of central
tendency

AR, Lecturer, Statistics Discipline, KU 37


Skewness
• The term refers to lack of asymmetry.

• This is defined based on normal distribution.

• Departure from symmetry

• Generally graph is required to identify


skewness

AR, Lecturer, Statistics Discipline, KU 38


Skewness
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
• Pearson skewness, 𝑆𝑘 𝑝 =
𝑆𝑑

AR, Lecturer, Statistics Discipline, KU 39


Kurtosis
• Deals with the peakedness of symmetrical
distribution. Thus it measures the degree of
peakedness.

Leptokurtic (L) Mesokurtic (M) Platykurtic (P)

AR, Lecturer, Statistics Discipline, KU 40


Kurtosis
𝜇4
• Kurtosis, 𝛽2 =
𝜇2

• 𝛽2 = 3 Mesokurtic

• 𝛽2 < 3 Platykurtic

• 𝛽2 > 3 Leptokurtic

AR, Lecturer, Statistics Discipline, KU 41


Mean vs Median vs Mode
Topics Mean Median Mode

Definition Typical value of Middle most Most frequent


distribution observation observation
Data type Applied for Nominal but All types of data
numerical data ranked/orderable
Outliers Extremely affected Not affected Not affected

Open end data Can’t be calculated Can’t Can’t

Positive skew Greater than Median Middle Lower than mean


and Mode and Median
Negative skew Lower than median middle Higher than mean
and mode and median

AR, Lecturer, Statistics Discipline, KU 42


Measures of dispersion
Gives idea about how each observation is
scattered around the mean in a distribution.

Often called scatter/variation.

Absence of dispersion is perfect uniformity.

9/9/2024 AR, Lecturer, Statistics Discipline, KU 43


Measures of dispersion
When the unit of this measures and the data
value is same thus it is absolute.

When the unit of this measures and the data


value is different then it is relative.
9/9/2024 AR, Lecturer, Statistics Discipline, KU 44
Measures of dispersion
Absolute measure Relative measure
Range Coefficient of range

Quartile deviation Co-efficient of Quartile


deviation
Mean deviation Co-efficient of Mean
deviation
Variance

Standard deviation Co-efficient of Variation (CV)

9/9/2024 AR, Lecturer, Statistics Discipline, KU 45


Range Coefficient of Range

The gap between the largest and the The unit invariant relative measure of range
smallest value is called range. 𝐿−𝑆
Mathematically, 𝐶𝑅 = ∗ 100
𝐿+𝑆
𝑅 𝑥1 , 𝑥2 , … , 𝑥𝑛
= max 𝑥1 , 𝑥2 , … , 𝑥𝑛
− min 𝑥1 , 𝑥2 , … , 𝑥𝑛
Interpretation:
• Extremely influenced by outlier • 𝑪𝑹=0 means no relative variability
• 𝑪𝑹=1 means maximum relative variability
• Trimmed range: If outlier is present
then trimming 5%, 10% of the data Mostly used to compare two datasets
from the beginning or end.
• IQR(Inter Quartile Range): The
difference between the third and the
first quartile is IQR.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1

9/9/2024 AR, Lecturer, Statistics Discipline, KU 46


Quartile
A quartile is a statistical term that divides the distribution of
observation into four equal parts with three distinct points.

𝑛+1
• First quartile (Q1)=
4

𝑛+1
• Second quartile (Q2)=
2

3(𝑛+1)
• Third quartile (Q3) =
4

AR, Lecturer, Statistics Discipline, KU 47


Quartile deviation Coefficient of Quartile deviation

• Also called the semi IQR 𝑄3 − 𝑄1


𝐼𝑄𝑅 𝐶𝑄𝐷 = ∗ 100
• 𝑄𝐷 = 𝑄3 + 𝑄1
2
• Since doesn’t rely on extreme
observation thus more efficient than Interpretation:
• 𝑪𝑸𝑫 =0 indicates no quartile based
Range.
variability (the values might be evenly
2 distributed)
• QD= 𝑆𝐷 • 𝑪𝑸𝑫 =1 indicates maximum quartile
3
based variability

9/9/2024 AR, Lecturer, Statistics Discipline, KU 48


Mean deviation Coefficient of Mean deviation

• The differences between values near the 𝑀𝑑 (𝑎)


𝐶𝑀𝐷 = ∗ 100
central value of a distribution tends to 𝑎
be more smaller than other individual
values. “a” can be mean, median, mode or any
• This difference is called mean arbitrary value.
deviation/ average deviation. • 𝐶𝑀𝐷 =0% indicates all values are exactly
• Mean deviation about mean: the same as mean
∑|𝑥𝑖 − 𝑥|ҧ • 𝐶𝑀𝐷 =close to 100% means the
𝑀𝑑 𝑥ҧ = datapoints are more dispersed from the
𝑛
• Mean deviation about arbitrary value: mean
∑|𝑥𝑖 − 𝑎|
𝑀𝑑 𝑥ҧ =
𝑛
4
MD= 𝑆𝐷
5
2
QD= 𝑆𝐷
3
6
MD= QD
5
9/9/2024 AR, Lecturer, Statistics Discipline, KU 49
Variance and SD Coefficient of Variation (CV)

• Variance measures how far each When two sets of data has high
number in the set is from
the mean (average), and thus from variability in between then the SD
every other number in the set. doesn’t help much to understand the
∑ 𝑥 −𝜇 2
• For population, 𝜎 2 = 𝑖

2
𝑁 variability.
2 ∑ 𝑥𝑖 −𝑥ҧ
• For sample, 𝑠 =
𝑛−1 𝑠
• SD= 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝐶𝑉 = ∗ 100
𝑥ҧ
NB: if mean is zero it doesn’t work

9/9/2024 AR, Lecturer, Statistics Discipline, KU 50


Variance and SD Coefficient of Variation (CV)
Considers every obs. of the data • Since it is dimensionless thus
set most commonly used measure of
Helps identifying skewness and variation.
kurtosis. • When data units are different for
comparison
[5, 5, 5]; var=0 (no spread) • When most of the observations
[3,5,7]; var=2.67 (some spread) are positive
[1,5,99]; var= 2050.67 (lot of spread)

NB: Since it squares the distance


thus in real life it is better to use the
SD to understand average spread.

9/9/2024 AR, Lecturer, Statistics Discipline, KU 51


Application of CV instead SD
Systolic blood pressure Diastolic blood pressure
Measure
(Hg) (Hg)
Mean 130 60
SD 15 8
CV 11.5% 13.3%

9/9/2024 AR, Lecturer, Statistics Discipline, KU 52


Study materials
• Additional materials can be found from
▪ Google
▪ Wikipedia
▪ Slideshare
▪ Recommended Books: An Introduction to
Statistics and Probability

AR, Lecturer, Statistics Discipline, KU 53

You might also like