0% found this document useful (0 votes)
2 views

Unit 4 Class Notes

The document discusses measures of central tendency in statistics, including mean, median, and mode, which represent the central values of data sets. It also covers measures of variability such as range, variance, and standard deviation, as well as concepts like skewness and kurtosis that describe the shape of data distributions. Additionally, it explains hypothesis testing, including its steps, significance levels, and limitations.

Uploaded by

sashantnipate
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 4 Class Notes

The document discusses measures of central tendency in statistics, including mean, median, and mode, which represent the central values of data sets. It also covers measures of variability such as range, variance, and standard deviation, as well as concepts like skewness and kurtosis that describe the shape of data distributions. Additionally, it explains hypothesis testing, including its steps, significance levels, and limitations.

Uploaded by

sashantnipate
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Unit 4

Measures of Central Tendency in Statistics

Central Tendencies in Statistics are the numerical values that are used to represent mid-value or central value a large
collection of numerical data. These obtained numerical values are called central or average values in Statistics. A
central or average value of any statistical data or series is the value of that variable that is representative of the entire data
or its associated frequency distribution. Such a value is of great significance because it depicts the nature or
characteristics of the entire data, which is otherwise very difficult to observe.

Measures of Central Tendency Meaning


The representative value of a data set, generally the central value or the most occurring value that gives a general idea of
the whole data set is called Measure of Central Tendency.
Measures of Central Tendency
Some of the most commonly used measures of central tendency are:
● Mean
● Median
● Mode
Mean
Mean in general terms is used for the arithmetic mean of the data, but other than the arithmetic mean there are geometric
mean and harmonic mean as well that are calculated using different formulas. Here in this article, we will discuss the
arithmetic mean.

Mean for Ungrouped Data

Arithmetic mean (

) is defined as the sum of the individual observations (xi) divided by the total number of observations N. In other words,
the mean is given by the sum of all observations divided by the total number of observations.
Measures of Variability
a. Measures of central tendency (e.g., mean, median, mode) provide useful, but
limited information. Information is insufficient in regards to the dispersion (i.e.,
variability) of scores of a distribution.

b. Three measures of variability that researchers typically examine: range,


variance, and standard deviation. Standard deviation is the most informative and
widely used of the three.
Range
a. Definition: The range is the difference between the largest (maximum value) score and the
smallest score (minimum value) of a distribution
b. Gives researchers a sense of how spread out the scores of a distribution, but it is not practical
and misleading at times.
c. When it may be used: Researchers may want to know whether all of the response categories
on a survey question have been used and/or to have a sense of the overall balance in the
distribution.
d. Interquartile Range (IQR) a. Definition: The difference between the 75th percentile (third
quartile) and 25th percentile (first quartile) scores in a distribution
b. When the scores in a distribution are arranged in order, from smallest to largest (or
vice-versa), the IQR contains scores in the two middle quartiles.
Variance a. Definition: The sum of the squared deviations (between the individual
scores and the mean of a distribution) divided by the number of cases in the
population, or by the number of cases minus one in the sample
b. Provides a squared statistical average of the amount of dispersion in a
distribution of scores. Rarely is variance looked at by itself because it does not use
the same scale as the original measure of a variable, because it is squared. a.
Why have variance? Why not go straight to standard deviation?
1. We need to calculate the variance before finding the standard deviation. That is
because we need to square the deviation scores so they will not sum to zero.
These squared deviations produce the variance. Then we need to take the square
root to find the standard deviation.
c. The fundamental piece of the variance formula, which is the sum of the squared
deviations, is used in a number of other statistics, most notably analysis of
variance (ANOVA)
. Standard Deviation
a. Definition: The average deviation between the individual scores in the
distribution and the mean for the distribution.
1. Note that this is not technically correct, particularly for sample data where the
sum of squared deviations is divided by n – 1, not N. But the sample estimate of
the standard deviation is an estimate of the average deviation (absolute value)
between the mean of a distribution and the scores in the distribution.
b. Useful statistic; provides a handy measure of how spread out the scores are in
the distribution.
c. When combined, the mean and standard deviation provide a pretty good picture
of what the distribution of the scores is like
What is Skewness?
Skewness is an important statistical technique that helps to determine the asymmetrical behavior of the frequency distribution, or
more precisely, the lack of symmetry of tails both left and right of the frequency curve. A distribution or dataset is symmetric if it
looks the same to the left and right of the center point.
1. Symmetric Skewness: A perfect symmetric distribution is one in which frequency distribution is the same on the sides of the center
point of the frequency curve. In this, Mean = Median = Mode. There is no skewness in a perfectly symmetrical distribution.
Asymmetric Skewness: A asymmetrical or skewed distribution is one in which the spread of the frequencies is different on both the
sides of the center point or the frequency curve is more stretched towards one side or value of Mean. Median and Mode falls at
different points.

● Positive Skewness: In this, the concentration of frequencies is more towards higher values of the variable i.e. the right tail is
longer than the left tail.
● Negative Skewness: In this, the concentration of frequencies is more towards the lower values of the variable i.e. the left tail is
longer than the right tail.
What is Kurtosis?
It is also a characteristic of the frequency distribution. It gives an idea about the shape of a frequency distribution. Basically, the
measure of kurtosis is the extent to which a frequency distribution is peaked in comparison with a normal curve. It is the degree of
peaked Ness of a distribution.
1. Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In this curve, there is too much
concentration of items near the central value.
2. Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this curve, there is equal distribution of items
around the central value.
3. Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called platykurtic. In this curve, there is less
concentration of items around the central value.
Sr. No.
Skewness
Kurtosis
1.
It indicates the shape and size of variation on either side of the central value.
It indicates the frequencies of distribution at the central value.
2.
The measure differences of skewness tell us about the magnitude and direction of the asymmetry of a
distribution.
It indicates the concentration of items at the central part of a distribution.
3.
It indicates how far the distribution differs from the normal distribution.
It studies the divergence of the given distribution from the normal distribution.
4.
The measure of skewness studies the extent to which deviation clusters is are above or below the average.
It indicates the concentration of items.
5.
In an asymmetrical distribution, the deviation below or above an average is not equal.
No such distribution takes place.
Hypothesis Testing
Hypothesis method compares two opposite statements about a population and uses
sample data to decide which one is more likely to be correct.To test this assumption we
first take a sample from the population and analyze it and use the results of the analysis
to decide if the claim is valid or not.
Suppose a company claims that its website gets an average of 50 user visits per day. To
verify this we use hypothesis testing to analyze past website traffic data and determine if
the claim is accurate. This helps us decide whether the observed data supports the
company’s claim or if there is a significant difference.
Key Terms of Hypothesis Testing
● Level of significance: It refers to the degree of significance in which we accept or reject the null
hypothesis. 100% accuracy is not possible for accepting a hypothesis so we select a level of
significance. This is normally denoted with α αand generally it is 0.05 or 5% which means your
output should be 95% confident to give a similar kind of result in each sample.
● P-value: When analyzing data the p-value tells you the likelihood of seeing your result if the null
hypothesis is true. If your P-value is less than the chosen significance level then you reject the
null hypothesis otherwise accept it.
● Test Statistic: Test statistic is the number that helps you decide whether your result is
significant. It’s calculated from the sample data you collect it could be used to test if a machine
learning model performs better than a random guess.
● Critical value: Critical value is a boundary or threshold that helps you decide if your test
statistic is enough to reject the null hypothesis
● Degrees of freedom: Degrees of freedom are important when we conduct statistical tests they
help you understand how much data can vary.
Types of Hypothesis Testing
It involves basically two types of testing:
1. One-Tailed Test

A one-tailed test is used when we expect a change in only one direction—either an increase or a decrease but not
both. Let’s say if we’re analyzing data to see if a new algorithm improves accuracy we would only focus on whether the
accuracy goes up not down.
The test looks at just one side of the data to decide if the result is enough to reject the null hypothesis. If the data falls in
the critical region on that side then we reject the null hypothesis.
How does Hypothesis Testing work?
Working of Hypothesis testing involves various steps:
Step 1: Define Null and Alternative Hypothesis

We start by defining the null hypothesis (H₀) which represents the assumption that there is no difference. The
alternative hypothesis (H₁) suggests there is a difference. These hypotheses should be contradictory to one
another. Imagine we want to test if a new recommendation algorithm increases user engagement.
● Null Hypothesis (H₀): The new algorithm has no effect on user engagement.
● Alternative Hypothesis (H₁): The new algorithm increases user engagement.

Step 2 – Choose significance level

● Next we choose a significance level (α) commonly set at 0.05. This level defines the threshold for
deciding if the results are statistically significant. It also tells us the probability of making a Type I
error—rejecting a true null hypothesis.
● In this step we also calculate the p-value which is used to assess the evidence against the null
hypothesis.

Step 3 – Collect and Analyze data.

● Now we gather data this could come from user observations or an experiment. Once collected we analyze the
data using appropriate statistical methods to calculate the test statistic.
● Example: We collect data on user engagement before and after implementing the algorithm. We can also find
the mean engagement scores for each group.

Step 4-Calculate Test Statistic

The test statistic is a measure used to determine if the sample data support in reject the null hypothesis. The choice of the
test statistic depends on the type of hypothesis test being conducted it could be a Z-test, Chi-square, T-test and so on. For
our example we are dealing with a t-test because:
● We have a smaller sample size.
● The population standard deviation is unknown.
Step 5 – Comparing Test Statistic

Now we compare the test statistic to either the critical value or the p-value to decide whether to
reject the null hypothesis or not.
Method A: Using Critical values: We refer to a statistical distribution table like the t-distribution
in this case to find the critical value based on the chosen significance level (α). If:
● If Test Statistic>Critical Value then we Reject the null hypothesis.
● If Test Statistic≤Critical Value then we fail to reject the null hypothesis.

Example: If the p-value is 0.03 and α is 0.05 then we reject the null hypothesis because the
p-value is smaller than the significance level.
Real life Examples of Hypothesis Testing
Let’s understand hypothesis testing using real life situations. Imagine a pharmaceutical company has developed a new
drug that they believe can effectively lower blood pressure in patients with hypertension. Before bringing the drug to
market they need to conduct a study to see its impact on blood pressure.
Data:
● Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
● After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114

Step 1: Define the Hypothesis


● Null Hypothesis: (H0)The new drug has no effect on blood pressure.
● Alternate Hypothesis: (H1)The new drug has an effect on blood pressure.
Limitations of Hypothesis Testing
Although hypothesis testing is a useful technique but it have some limitations as well:
● Limited Scope: Hypothesis testing focuses on specific questions or assumptions and not capture the
complexity of the problem being studied.
● Data Quality Dependence: The accuracy of the results depends on the quality of the data. Poor-quality or
inaccurate data can led to incorrect conclusions.
● Missed Patterns: By focusing only on testing specific hypotheses important patterns or relationships in the
data might be missed.
● Context Limitations: It doesn’t always consider the bigger picture which can oversimplify results and led to
incomplete insights.
● Need for Additional Methods: To get a better understanding of the data hypothesis testing should be
combined with other analytical methods such as data visualization or machine learning techniques which we
study later in upcoming articles.

You might also like