0% found this document useful (0 votes)
9 views15 pages

2nd Unit - Statistics

Uploaded by

ravipal rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

2nd Unit - Statistics

Uploaded by

ravipal rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

2nd Unit - Environmental Statistics

Descriptive Statistics

Measures of Central Tendency - Mean, Median, and Mode

Measures of central tendency are statistical indicators that represent the "typical"
value of a dataset. They provide a single summary of the data's center or middle
point, helping us condense large datasets into a more manageable form.

Here are the three most common measures of central tendency:

1. Mean:

● The most well-known measure, calculated by summing all values in a dataset


and dividing by the total number of values.
● Considered the "balancing point" of the data, where half the values lie above
and half below.
● Sensitive to outliers (extreme values) that can skew the mean away from the
center.

2. Median:
● The middle value of a dataset when arranged in ascending or descending
order.
● If the dataset has an even number of values, the median is the average of the
two middle values.
● Less sensitive to outliers than the mean, making it a preferred choice for
skewed data distributions.

3. Mode:
● The value that appears most frequently in a dataset.
● Can be multimodal if there are multiple values with the highest frequency.
● Useful for identifying the most common value in a dataset, but doesn't
necessarily represent the center.

Choosing the right measure of central tendency depends on the nature of your data
and what you want to understand.

● Mean: Use it for normally distributed data where outliers are minimal, and you
want an overall "average" value.
● Median: Use it for skewed data or data with outliers, as it's less affected by
extreme values.
● Mode: Use it to identify the most frequent value in the data, especially for
categorical data.

Mean

The mean in statistics, often called the arithmetic mean or simply the average, is a
measure of central tendency that represents the sum of all values in a data set
divided by the number of values. It gives you a general sense of the "middle" of the
data.

Formula:

The most common formula for the mean (denoted by x̄) is:

Advantages:

● Simple and easy to calculate: Requires basic arithmetic operations, making


it accessible to beginners.
● Intuitive interpretation: Can be easily understood as the "typical" value in
the data set.
● Widely used: Found in various statistical analyses and can be compared to
other means for insights.

Limitations:

● Sensitive to outliers: Extreme values can significantly affect the mean,


making it less reliable for skewed data.
● May not represent the data accurately: In skewed distributions, the mean
can be pulled toward the tail and not reflect the center of the data.
● Ignores information about spread: Doesn't tell you anything about the
variability of the data, like how closely values are clustered around the mean.

Here are some examples to illustrate the advantages and limitations:

● Example 1 (Symmetrical data): Consider the test scores of 5 students: {70,


75, 80, 85, 90}. The mean is (70 + 75 + 80 + 85 + 90) / 5 = 80. This accurately
reflects the typical score and is a good summary of the data.
● Example 2 (Skewed data): Imagine income levels of residents in a town:
{10k, 20k, 30k, 40k, 100k}. The mean is (10k + 20k + 30k + 40k + 100k) / 5 =
40k. However, this doesn't represent the typical income because of the single
high earner (outlier). The median (30k) would be a better measure of central
tendency in this case.

Median

The median is one of the most fundamental measures of central tendency in


statistics. While the mean ("average") is often the first measure that comes to mind,
the median offers a different perspective and comes with its own set of advantages
and limitations.

Definition:

● The median is the middle value in a set of data when arranged in order, from
smallest to largest.
● If the data set has an odd number of observations, the median is the middle
one.
● If the data set has an even number of observations, the median is the average
of the two middle values.

Formula:
Advantages:

● Less sensitive to outliers: Unlike the mean, which can be skewed by


extreme values, the median is less influenced by outliers. This makes it a
better measure of central tendency when dealing with skewed data sets.
● Easy to understand and interpret: The median represents the "middle
value" and is often easier to explain and understand than the mean, especially
for non-statistical audiences.
● Robust to ordinal data: While the mean requires interval or ratio data, the
median can be calculated for ordinal data (data with ranked categories) as
well.

Limitations:

● Less precise than the mean: The median doesn't utilize all the information in
the data set, potentially leading to less precise estimates of central tendency
compared to the mean.
● Difficult to interpret with grouped data: When working with grouped data
(data presented in frequency tables), calculating the median can be more
complex and require additional manipulation.
● Limited information about the distribution: Unlike the mean and standard
deviation, the median doesn't provide information about the spread of the
data.

Mode

Definition: The mode in statistics is the value that appears most frequently in a data
set. It is a measure of central tendency, along with the mean and median.

Formula:
Advantages:

● Simple to calculate and understand: Even people with no background in


statistics can easily understand the concept of the mode.
● Robust to outliers: The mode is not affected by outliers in the data set,
making it a good choice for data sets with extreme values.
● Useful for nominal data: The mode is particularly useful for nominal data,
where the order of the values does not matter.

Limitations:

● May not be unique: It is possible for a data set to have multiple modes,
especially if there are several values that appear with the same high
frequency.
● Can be misleading for small data sets: For small data sets, random
fluctuations can lead to false modes.
● Not suitable for ordinal data: The mode is not a good measure of central
tendency for ordinal data, where the order of the values does matter.

Measures of Dispersion

In statistics, measures of dispersion tell us how spread out or scattered data is


around a central point, like the mean or median. This helps us understand the
variability within a dataset and compare the spread of different datasets.

Here's a breakdown of key points:

Types of Measures:

● Absolute Measures: These express dispersion in the original units of the


data. They include:
○ Range: Difference between the maximum and minimum values. Simple
but sensitive to outliers.
○ Mean Deviation (MD): Average of absolute deviations from the mean.
More robust than range.
○ Variance: Average of squared deviations from the mean. Sensitive to
outliers.
○ Standard Deviation (SD): Square root of variance. Most commonly
used, shows how much data deviates from the mean on average.
● Relative Measures: These are unitless and allow comparison across
datasets with different units. They include:
○ Coefficient of Range (CR): Range divided by the range, expressed as
a percentage.
○ Coefficient of Variation (CV): Standard deviation divided by the
mean, expressed as a percentage. Useful for comparing spread across
datasets with different means.
○ Coefficient of Mean Deviation (CMD): MD divided by the mean,
expressed as a percentage.

Range

The range is a simple yet powerful measure of dispersion in statistics. It tells you
how "spread out" your data is by simply calculating the difference between the
maximum and minimum values in your dataset.

Here's a breakdown of the range:

Formula:

Range = Maximum Value - Minimum Value

Interpretation:

● The range gives you an absolute measure of dispersion, meaning it tells you
the exact distance covered by your data points.
● A higher range indicates greater spread, while a lower range indicates more
closely clustered data points.

Limitations:

● The range can be easily skewed by outliers, as a single extreme value can
significantly inflate the range.
● It doesn't take into account the distribution of your data. Two datasets with the
same range can have very different underlying structures.

Applications:

● The range is a quick and easy way to get a preliminary sense of how spread
out your data is.
● It can be useful for comparing the variation of small datasets where outliers
are less likely to distort the picture.
● It can be used in conjunction with other measures of dispersion, like standard
deviation and quartile deviation, to provide a more comprehensive
understanding of data variability.

Examples:

● Consider a dataset of the ages of students in a class: {12, 13, 14, 15, 16}. The
range would be 16 - 12 = 4.
● In a dataset of exam scores: {70, 80, 85, 90, 95}, the range would be 95 - 70
= 25.

Quartile Deviation

Quartile deviation, also known as the semi-interquartile range, is a measure of


dispersion in statistics. It tells you how spread out the middle 50% of your data is.
Here's a breakdown of everything you need to know about it:

What it is:

● Half of the Interquartile Range (IQR): The IQR is the difference between the
third quartile (Q3) and the first quartile (Q1). Quartile deviation takes this
range and divides it by two.
● Focuses on the middle 50%: Unlike measures like variance and standard
deviation that consider all data points, quartile deviation only focuses on the
central 50% of your data. This makes it less sensitive to outliers.

Formula:

Quartile Deviation (QD) = (Q3 - Q1) / 2

Interpretation:

● A higher quartile deviation indicates that the middle 50% of your data is more
spread out, with larger differences between values.
● A lower quartile deviation indicates that the middle 50% of your data is more
tightly clustered, with smaller differences between values.

Advantages:

● Robust to outliers: Because it only considers the middle 50% of the data,
quartile deviation is less affected by outliers than other measures of
dispersion.
● Simple to calculate: The formula is straightforward and easy to apply, even
with basic calculations.
● Interpretable: The units of quartile deviation are the same as the units of your
data, making it easier to understand.

Disadvantages:

● Limited information: Unlike measures like standard deviation, quartile


deviation doesn't give you information about the spread of the entire dataset,
only the middle 50%.
● Not good for normal distributions: If your data is normally distributed,
quartile deviation may not be the most informative measure of dispersion.

Mean Deviation

Mean deviation, simply put, is the average of the absolute deviations (distances) of
all data points from the mean (or sometimes, the median) of the dataset. This means
we calculate the difference between each individual value and the central value, take
the absolute value (making negative differences positive), and then average them all.

Types of Mean Deviation:

● Mean deviation from the mean: This is the most common type, where the
central value is the mean of the dataset. It gives an average absolute distance
from the "typical" value, indicating how spread out the data is.
● Mean deviation from the median: In this case, the median, another measure
of central tendency, acts as the reference point. This is useful when the data
has outliers that skew the mean, as the median is less sensitive to extreme
values.

Calculating Mean Deviation:

1. Calculate the central value (mean or median) of your dataset.


2. For each data point, subtract the central value to get the deviation.
3. Take the absolute value of each deviation (to ignore negative signs).
4. Sum all the absolute deviations.
5. Divide the sum by the number of data points (n) to get the average absolute
deviation.

Formula:

Interpreting Mean Deviation:

● Higher mean deviation: Indicates a wider spread of data, meaning individual


values are further away from the central value.
● Lower mean deviation: Suggests a more compact dataset, where values are
closer to the center.

Applications of Mean Deviation:

● Comparing the variability of different datasets on the same scale, even if their
units are different.
● Identifying outliers that significantly deviate from the average.
● Assessing the "typical" spread of data around a central value.

Mean Deviation vs. Standard Deviation:

Both are measures of dispersion, but they differ in terms of calculation and
interpretation:

● Mean deviation: Uses absolute deviations, giving equal weight to all


distances from the center. It's easier to understand but less statistically
efficient.
● Standard deviation: Takes squares of deviations, giving more weight to
larger distances and penalizing outliers. It's more statistically robust but less
intuitive to interpret.
Standard Deviation

Standard deviation (SD) is the most common and widely used measure of
dispersion. It tells us, on average, how far individual data points deviate from the
mean. Essentially, it quantifies the "spread" of the data.

How is Standard Deviation Calculated?

There are two types of standard deviation:

● Population Standard Deviation (σ): Used when the entire population is


available.

where:

● x_i is each data point


● μ is the population mean
● N is the total number of data points
● Sample Standard Deviation (s): Used when only a sample of the population
is available.

where:

● x_i is each data point


● x̄ is the sample mean
● n is the sample size

Interpreting Standard Deviation:

● A higher SD indicates that the data is more spread out and deviates more
from the mean.
● A lower SD indicates that the data is more clustered around the mean.
● In a normal distribution, approximately 68% of the data points will fall within 1
SD of the mean, 95% within 2 SDs, and 99.7% within 3 SDs.

Uses of Standard Deviation:

● Comparing variability between different datasets.


● Statistical hypothesis testing.
● Quality control and process monitoring.
● Risk assessment and prediction.

Limitations of Standard Deviation:

● Sensitive to outliers.
● Cannot be directly compared across datasets with different units.
● Not a good measure for skewed distributions.

Coefficient of Variation

The coefficient of variation (CV) is a relative measure of dispersion used to compare


the variability of data sets with different units or means. It tells you how much, on
average, individual values in a data set deviate from the mean, expressed as a
percentage. This makes it extremely useful for comparing the stability or consistency
of different sets.

Key features of CV:

● Relative measure: Unlike standard deviation, CV is dimensionless and


allows comparison across data sets with different units or scales. For
example, you can easily compare the variability of height in centimeters (cm)
and weight in kilograms (kg) using CV.
● Expressed as a percentage: CV is multiplied by 100 to give a percentage
value, making it easier to interpret and communicate. A lower CV indicates
less variability, and a higher CV indicates more variability.
● Used with ratio data: CV should only be used with data measured on a ratio
scale, where values have a true zero and meaningful ratios can be formed.
Interval data with arbitrary zeros can lead to misleading interpretations of CV.

Calculating CV:

The formula for CV is:

CV = (Standard deviation / Mean) * 100


where:

● CV is the coefficient of variation


● σ is the population standard deviation
● s is the sample standard deviation (if using a sample)
● μ is the population mean
● X is the sample mean (if using a sample)

Interpreting CV:

There are no universal thresholds for interpreting CV, but some general guidelines
exist:

● CV < 20%: Low variability, considered stable or uniform


● 20% ≤ CV ≤ 50%: Moderate variability
● CV > 50%: High variability, considered scattered or inconsistent

Moments

In statistics, moments are descriptive measures of the probability distribution of a


random variable. They tell us about the shape, center, and spread of the data. The
first four moments are particularly important:

● First moment (mean): Measures the central tendency of the data. For a
symmetrical distribution, the mean coincides with the median and mode.
● Second moment (variance): Measures the spread of the data around the
mean. A higher variance indicates greater dispersion of the data points.
● Third moment (skewness): Measures the asymmetry of the distribution. A
positive skewness indicates a longer tail on the right side, while a negative
skewness indicates a longer tail on the left side. A zero skewness means the
distribution is symmetrical.
● Fourth moment (kurtosis): Measures the "tailedness" of the distribution. A
higher kurtosis indicates heavier tails than a normal distribution, while a lower
kurtosis indicates lighter tails. A kurtosis of 3 corresponds to a normal
distribution.

Skewness

Skewness is a vital concept in statistics, measuring the asymmetry of a data


distribution. It tells you how much the "weight" of your data points is unevenly
distributed around the central point (mean or median). Think of it like a seesaw: a
perfectly balanced seesaw with equal weights on both sides represents a
symmetrical distribution with zero skewness. If one side dips lower than the other,
you have skewness.

Types of Skewness:

● Positive Skewness: The tail of the distribution stretches out to the right, with
more data points clustered on the left side. Imagine a bunch of kids on a
seesaw – more on the left side makes the right side rise higher.
● Negative Skewness: The tail extends to the left, with more data points on the
right side. Picture the seesaw tipping the other way.
● No Skewness: The distribution is perfectly symmetrical, a bell-shaped curve
with the mean, median, and mode all coinciding. The seesaw is perfectly
balanced.
Understanding Skewness:

● Implications: Knowing the skewness helps interpret your data better. For
example, skewed income data might show more low earners than high
earners, highlighting income inequality.
● Impact on Statistics: Some statistical tests rely on normal distributions (zero
skewness). If your data is highly skewed, these tests might not be reliable.
● Measuring Skewness: Several methods exist, like Pearson's coefficient and
the moment skewness. These values indicate the direction and magnitude of
the asymmetry.

Kurtosis

Kurtosis is a statistical measure that quantifies the "tailedness" of a probability


distribution. In simpler terms, it tells you how much weight is in the tails of the
distribution, relative to the center.

Imagine a bell curve representing a normal distribution. A normal distribution has a


kurtosis of 3. If a distribution has a higher kurtosis than 3, it means that it has heavier
tails, which indicates that there are more extreme values (outliers) than in a normal
distribution. Conversely, a distribution with a lower kurtosis than 3 has lighter tails,
which means that there are fewer extreme values.

Types of kurtosis:

There are three main types of kurtosis:

● Leptokurtic: This type of distribution has heavy tails and a high kurtosis value
(greater than 3). This means that there are many more extreme values than in
a normal distribution.
● Mesokurtic: This type of distribution has medium tails and a kurtosis value of
3. This is the same as a normal distribution.
● Platykurtic: This type of distribution has light tails and a low kurtosis value
(less than 3). This means that there are fewer extreme values than in a
normal distribution.
How is kurtosis used?

Kurtosis is used in a variety of applications, including:

● Financial analysis: Kurtosis is used to measure the risk of an investment. A


high kurtosis investment is more likely to have large price swings, both
positive and negative.
● Quality control: Kurtosis is used to monitor the quality of a manufacturing
process. A high kurtosis process is more likely to produce defective products.
● Scientific research: Kurtosis is used to analyze data from experiments. A
high kurtosis data set may indicate that there are outliers that need to be
investigated.

You might also like