Lesson 3 4
Lesson 3 4
Sample size - is a research term used for defining the number of individuals included in a research study to
represent a population.
Determining the appropriate sample size is one of the most important factors in statistical analysis. If the sample
size is too small, it will not yield valid results or adequately represent the realities of the population being
studied. On the other hand, while larger sample sizes yield smaller margins of error and are more representative,
a sample size that is too large may significantly increase the cost and time taken to conduct the research.
When selecting a sample there are multiple factors that can impact the reliability and validity of results. When
thinking about sample size, the two measures of error that are almost always synonymous with sample sizes
are the confidence interval and the confidence level.
Confidence Level
The confidence level refers to the percentage of probability, or certainty that the confidence interval
would contain the true population parameter when you draw a random sample many times. It is expressed as a
percentage and represents how often the percentage of the population who would pick an answer lies within
the confidence interval. For example, a 99% confidence level means that should you repeat an experiment or
survey over and over again, 99 percent of the time, your results will match the results you get from a population.
The larger your sample size, the more confident you can be that their answers truly reflect the
population. In other words, the larger your sample for a given confidence level, the smaller your confidence
interval.
Lesson 4: Measures of Central Tendency
A measure of central tendency is a single value that attempts to describe a set of data by identifying the
central position within that set of data.
MEAN
- The mean (or average) is the most popular and well known measure of central tendency. It can be used
with both discrete and continuous data, although its use is most often with continuous data.
- The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average.
Looking at the retirement age distribution again:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values (54+54+54+55+56+57+57+58+58+60+60 = 623) and
dividing by the number of observations (11) which equals 56.6 years.
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value
might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in
the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation, we would
like to have a better measure of central tendency, taking the median would be a better measure of central
tendency in this situation.
Median
The median is the middle score for a set of data that has been arranged in order of magnitude. The median is
less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark because there
are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores, but what
happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to
take the middle two scores and average the result. So, if we look at the example below:
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5.
Mode
The mode is the most commonly occurring value in a distribution.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Skewed distributions
When a distribution is skewed the mode remains the most commonly occurring value, the median remains the
middle value in the distribution, but the mean is generally ‘pulled’ in the direction of the tails. In a skewed
distribution, the median is often a preferred measure of central tendency, as the mean is not usually in the
middle of the distribution.
A distribution is said to be positively or right skewed when the tail on the right side of the distribution is longer
than the left side. In a positively skewed distribution it is common for the mean to be ‘pulled’ toward the right
tail of the distribution. Although there are exceptions to this rule, generally, most of the values, including the
median value, tend to be less than the mean value.
The following graph shows a larger retirement age data set with a distribution which is right skewed. The data
has been grouped into classes, as the variable being measured (retirement age) is continuous. The mode is 54
years, the modal class is 54-56 years, the median is 56 years, and the mean is 57.2 years.
Retirement age: Positive (right) skew
A distribution is said to be negatively or left skewed when the tail on the left side of the distribution is longer
than the right side. In a negatively skewed distribution, it is common for the mean to be ‘pulled’ toward the left
tail of the distribution. Although there are exceptions to this rule, generally, most of the values, including the
median value, tend to be greater than the mean value.
The following graph shows a larger retirement age dataset with a distribution which left skewed. The mode is
65 years, the modal class is 63-65 years, the median is 63 years and the mean is 61.8 years.