2nd Unit - Statistics
2nd Unit - Statistics
Descriptive Statistics
Measures of central tendency are statistical indicators that represent the "typical"
value of a dataset. They provide a single summary of the data's center or middle
point, helping us condense large datasets into a more manageable form.
1. Mean:
2. Median:
● The middle value of a dataset when arranged in ascending or descending
order.
● If the dataset has an even number of values, the median is the average of the
two middle values.
● Less sensitive to outliers than the mean, making it a preferred choice for
skewed data distributions.
3. Mode:
● The value that appears most frequently in a dataset.
● Can be multimodal if there are multiple values with the highest frequency.
● Useful for identifying the most common value in a dataset, but doesn't
necessarily represent the center.
Choosing the right measure of central tendency depends on the nature of your data
and what you want to understand.
● Mean: Use it for normally distributed data where outliers are minimal, and you
want an overall "average" value.
● Median: Use it for skewed data or data with outliers, as it's less affected by
extreme values.
● Mode: Use it to identify the most frequent value in the data, especially for
categorical data.
Mean
The mean in statistics, often called the arithmetic mean or simply the average, is a
measure of central tendency that represents the sum of all values in a data set
divided by the number of values. It gives you a general sense of the "middle" of the
data.
Formula:
The most common formula for the mean (denoted by x̄) is:
Advantages:
Limitations:
Median
Definition:
● The median is the middle value in a set of data when arranged in order, from
smallest to largest.
● If the data set has an odd number of observations, the median is the middle
one.
● If the data set has an even number of observations, the median is the average
of the two middle values.
Formula:
Advantages:
Limitations:
● Less precise than the mean: The median doesn't utilize all the information in
the data set, potentially leading to less precise estimates of central tendency
compared to the mean.
● Difficult to interpret with grouped data: When working with grouped data
(data presented in frequency tables), calculating the median can be more
complex and require additional manipulation.
● Limited information about the distribution: Unlike the mean and standard
deviation, the median doesn't provide information about the spread of the
data.
Mode
Definition: The mode in statistics is the value that appears most frequently in a data
set. It is a measure of central tendency, along with the mean and median.
Formula:
Advantages:
Limitations:
● May not be unique: It is possible for a data set to have multiple modes,
especially if there are several values that appear with the same high
frequency.
● Can be misleading for small data sets: For small data sets, random
fluctuations can lead to false modes.
● Not suitable for ordinal data: The mode is not a good measure of central
tendency for ordinal data, where the order of the values does matter.
Measures of Dispersion
Types of Measures:
Range
The range is a simple yet powerful measure of dispersion in statistics. It tells you
how "spread out" your data is by simply calculating the difference between the
maximum and minimum values in your dataset.
Formula:
Interpretation:
● The range gives you an absolute measure of dispersion, meaning it tells you
the exact distance covered by your data points.
● A higher range indicates greater spread, while a lower range indicates more
closely clustered data points.
Limitations:
● The range can be easily skewed by outliers, as a single extreme value can
significantly inflate the range.
● It doesn't take into account the distribution of your data. Two datasets with the
same range can have very different underlying structures.
Applications:
● The range is a quick and easy way to get a preliminary sense of how spread
out your data is.
● It can be useful for comparing the variation of small datasets where outliers
are less likely to distort the picture.
● It can be used in conjunction with other measures of dispersion, like standard
deviation and quartile deviation, to provide a more comprehensive
understanding of data variability.
Examples:
● Consider a dataset of the ages of students in a class: {12, 13, 14, 15, 16}. The
range would be 16 - 12 = 4.
● In a dataset of exam scores: {70, 80, 85, 90, 95}, the range would be 95 - 70
= 25.
Quartile Deviation
What it is:
● Half of the Interquartile Range (IQR): The IQR is the difference between the
third quartile (Q3) and the first quartile (Q1). Quartile deviation takes this
range and divides it by two.
● Focuses on the middle 50%: Unlike measures like variance and standard
deviation that consider all data points, quartile deviation only focuses on the
central 50% of your data. This makes it less sensitive to outliers.
Formula:
Interpretation:
● A higher quartile deviation indicates that the middle 50% of your data is more
spread out, with larger differences between values.
● A lower quartile deviation indicates that the middle 50% of your data is more
tightly clustered, with smaller differences between values.
Advantages:
● Robust to outliers: Because it only considers the middle 50% of the data,
quartile deviation is less affected by outliers than other measures of
dispersion.
● Simple to calculate: The formula is straightforward and easy to apply, even
with basic calculations.
● Interpretable: The units of quartile deviation are the same as the units of your
data, making it easier to understand.
Disadvantages:
Mean Deviation
Mean deviation, simply put, is the average of the absolute deviations (distances) of
all data points from the mean (or sometimes, the median) of the dataset. This means
we calculate the difference between each individual value and the central value, take
the absolute value (making negative differences positive), and then average them all.
● Mean deviation from the mean: This is the most common type, where the
central value is the mean of the dataset. It gives an average absolute distance
from the "typical" value, indicating how spread out the data is.
● Mean deviation from the median: In this case, the median, another measure
of central tendency, acts as the reference point. This is useful when the data
has outliers that skew the mean, as the median is less sensitive to extreme
values.
Formula:
● Comparing the variability of different datasets on the same scale, even if their
units are different.
● Identifying outliers that significantly deviate from the average.
● Assessing the "typical" spread of data around a central value.
Both are measures of dispersion, but they differ in terms of calculation and
interpretation:
Standard deviation (SD) is the most common and widely used measure of
dispersion. It tells us, on average, how far individual data points deviate from the
mean. Essentially, it quantifies the "spread" of the data.
where:
where:
● A higher SD indicates that the data is more spread out and deviates more
from the mean.
● A lower SD indicates that the data is more clustered around the mean.
● In a normal distribution, approximately 68% of the data points will fall within 1
SD of the mean, 95% within 2 SDs, and 99.7% within 3 SDs.
● Sensitive to outliers.
● Cannot be directly compared across datasets with different units.
● Not a good measure for skewed distributions.
Coefficient of Variation
Calculating CV:
Interpreting CV:
There are no universal thresholds for interpreting CV, but some general guidelines
exist:
Moments
● First moment (mean): Measures the central tendency of the data. For a
symmetrical distribution, the mean coincides with the median and mode.
● Second moment (variance): Measures the spread of the data around the
mean. A higher variance indicates greater dispersion of the data points.
● Third moment (skewness): Measures the asymmetry of the distribution. A
positive skewness indicates a longer tail on the right side, while a negative
skewness indicates a longer tail on the left side. A zero skewness means the
distribution is symmetrical.
● Fourth moment (kurtosis): Measures the "tailedness" of the distribution. A
higher kurtosis indicates heavier tails than a normal distribution, while a lower
kurtosis indicates lighter tails. A kurtosis of 3 corresponds to a normal
distribution.
Skewness
Types of Skewness:
● Positive Skewness: The tail of the distribution stretches out to the right, with
more data points clustered on the left side. Imagine a bunch of kids on a
seesaw – more on the left side makes the right side rise higher.
● Negative Skewness: The tail extends to the left, with more data points on the
right side. Picture the seesaw tipping the other way.
● No Skewness: The distribution is perfectly symmetrical, a bell-shaped curve
with the mean, median, and mode all coinciding. The seesaw is perfectly
balanced.
Understanding Skewness:
● Implications: Knowing the skewness helps interpret your data better. For
example, skewed income data might show more low earners than high
earners, highlighting income inequality.
● Impact on Statistics: Some statistical tests rely on normal distributions (zero
skewness). If your data is highly skewed, these tests might not be reliable.
● Measuring Skewness: Several methods exist, like Pearson's coefficient and
the moment skewness. These values indicate the direction and magnitude of
the asymmetry.
Kurtosis
Types of kurtosis:
● Leptokurtic: This type of distribution has heavy tails and a high kurtosis value
(greater than 3). This means that there are many more extreme values than in
a normal distribution.
● Mesokurtic: This type of distribution has medium tails and a kurtosis value of
3. This is the same as a normal distribution.
● Platykurtic: This type of distribution has light tails and a low kurtosis value
(less than 3). This means that there are fewer extreme values than in a
normal distribution.
How is kurtosis used?