Discriptive Statistics
Discriptive Statistics
Descriptive Statistics?
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can
be either a representation of the entire or a sample of a population.
Descriptive statistics are broken down into measures of central tendency and measures of
variability (spread).
Measures of central tendency include the mean, median, and mode, while measures of
variability include the standard deviation, variance, the minimum and maximum variables, and
the kurtosis and skewness.
What is a measure of Central
Tendency?
A measure of central tendency is a summary statistic that represents the centre point or typical
value of a dataset.
These measures indicate where most values in a distribution fall and are also referred to as the
central location of a distribution.
Mean (Arithmetic)
The mean is the arithmetic average, and it is probably the measure of central tendency that you
are most familiar. Calculating the mean is very simple. You just add up all of the values and
divide by the number of observations in your dataset.
n
xi x2 xn
x x i
i 1 n
n
When not to use the mean
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
These are values that are unusual compared to the rest of the data set by being especially small
or large in numerical value.
For example, consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that
this mean value might not be the best way to accurately reflect the typical salary of a worker, as
most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large
salaries. Therefore, in this situation, we would like to have a better measure of central tendency.
Median
The median is the middle value. It is the value that splits the dataset in half.
To find the median, order your data from smallest to largest, and then find the data point that
has an equal amount of values above it and below it.
The method for locating the median varies slightly depending on whether your dataset has an
even or odd number of values.
• If n is odd, the median is the middle number.
• If n is even, the median is the average of the 2 middle numbers.
suppose we have the data
below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark
because there are 5 scores before it and 5 scores after it.
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5.
Mode
The mode is the most frequent score in our data set. On a histogram it represents the highest
bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the
most popular option.
Normally, the mode is used for categorical data where we wish to know which is the most
common category
Example:
x <- c(8,2,7,1,2,9,8,2,10,9,8)
sort(x)
names(table(x))[table(x) ==max(table(x))]
Measures of Variability:
Range
Interquartile Range
Variance
Standard Deviation
What is Measures of
Variability?
A measure of variability is a summary statistic that represents the amount of dispersion in a
dataset. How spread out are the values? While a measure of central tendency describes the
typical value, measures of variability define how far away the data points tend to fall from the
centre. We talk about variability in the context of a distribution of values. A low dispersion
indicates that the data points tend to be clustered tightly around the centre. High dispersion
signifies that they tend to fall further away.
Range
The range of a dataset is the difference between the largest and smallest values in that dataset.
For example, in the two datasets below, dataset 1 has a range of 20 – 38 = 18 while dataset 2
has a range of 11 – 52 = 41. Dataset 2 has a broader range and, hence, more variability than
dataset 1.
Interquartile Range
The interquartile range is the middle half of the data. To visualize it, think about the median
value that splits the dataset in half. Similarly, you can divide the data into quarters. Statisticians
refer to these quarters as quartiles and denote them from low to high as Q1, Q2, and Q3. The
lowest quartile (Q1) contains the quarter of the dataset with the smallest values. The upper
quartile (Q4) contains the quarter of the dataset with the highest values. The interquartile range
is the middle half of the data that is in between the upper and lower quartiles. In other words,
the interquartile range includes the 50% of data points that fall between Q1 and Q3.
Example:
Consider a dataset representing the salaries of employees in a company:
Salaries (in dollars): 40000,45000,50000,55000,60000,70000,90000,150000
Step 1: Calculate Quartiles:
Arrange the data in ascending order: 40000,45000,50000,55000,60000,70000,90000,150000
Calculate the median (Q2): Q2=57500
Split the dataset into two halves:
Lower half: 40000,45000,50000,55000 and Upper half: 60000,70000,90000,150000
Calculate the median of the lower half (Q1): Q1=47500
Calculate the median of the upper half (Q3): Q3=80000
Step 2: Calculate IQR:
IQR=Q3−Q1=80000−47500=32500 dollars
In the equation, σ2 is the population parameter for the variance, μ is the parameter for the
population mean, and N is the number of data points, which should include the entire
population.
Sample variance
To use a sample to estimate the variance for a population, use the following formula.
In the equation, s2 is the sample variance, and M is the sample mean. N-1 in the denominator
corrects for the tendency of a sample to underestimate the population variance.
Example of calculating the sample
variance
Standard Deviation
The standard deviation is the standard or typical difference between each data point and the
mean. When the values in a dataset are grouped closer together, you have a smaller standard
deviation. On the other hand, when the values are spread out more, the standard deviation is
larger because the standard distance is greater.
The standard deviation is just the square root of the variance.
In the variance section, we calculated a variance of 201 in the table.