Basic Statistical Description of Data
Basic Statistical Description of Data
Mean, Median, and Mode are the three most common Measures of Central Tendency. They are the commonly used
descriptive statistics to describe the data through a single value (central value) that represents the center point of the
data.
1. Mean
Mean is the most commonly used measure of central tendency.
Mean is equal to the sum of all the values divided by the total number of values.
Mean is also known as Arithmetic Average.
Mean includes all the values in the data.
Mean is impacted by outlier (extreme) values.
Mean cannot be used for categorical data.
Practice Example
There are 15 students in a preschool and their age in months is given below. Calculate the mean age of the students.
Mean =37.13
Interpretation: The average age at which parents send their students to preschool is around 37 month
Histogram: A histogram is a commonly used graphical chart to depict numerical variables. The histogram plot of
the age of the students is shown below:
Practice Example
Let us remove 5% of the highest and lowest value from the below data. 5% trimming from 15 values is removing
0.75 observations, i.e. 1 observation from both the extremes. The sorted data is shown below:
Example
There sample data of 15 Students could have been shown as below. To compute the mean age we will have to give
weight to the frequency of occurrence of each age value and the mean so computed is weighted mean
Age 24 36 37 38 39 40 41
Frequency 1 2 4 3 2 2 1
Weighted Mean = ((24*1) + (36*2) + (37*4) + (38*3) + (39*2) + (40*2) + (41*1)) / (1+2+4+3+2+2+1)
Weighted Mean =37.13
2. Median
Median is the middle value of the data when the observations are sorted (ascending or descending order)
When sorted (ascending or descending), the median splits the data into two halves equally (upper and lower
halves).
The percentile rank of median = 50%
When sorted,
o If the number of observations (n) is odd, then the median is the value of the middle observation at
position (n + 1) / 2.
o Else If the number of observations (n) is even, then the median is the mean of the two middle-most
values at position (n/2, (n+1)/2).
Example
There are 15 students in a preschool and their age in months is given below. Calculate median:
To find the median, first we sort the values in ascending order (or descending)
As n = 15 (n is odd), the median will be 8th position value [(15 + 1)/2 = 8].
Interpretation
50% of the students in preschool are below the age of 38 months and the remaining 50% are above 38 months.
3. Mode
The most frequently occurring value in data is called the mode.
We can use mode as the measure of central tendency for both categorical and numerical variables.
The data distribution can have more than one mode.
.Example
The age in months of 15 Students from a preschool is given in the table below. Compute Mode.
Let’s create a frequency distribution table for the above data.
Value 24 36 37 38 39 40 41
Frequenc
1 2 4 2 2 2 1
y
The value 37 appeared the max number of times (four times) in the data distribution.
Hence, Mode = 37
Types of Mode
Unimodal: There is only one mode in the data distribution. For E.g., x = 1,2,2,3 (mode = 2).
Bimodal: There are two modes in the data distribution. For E.g., x = 1,2,2,3,3,4 (mode = [2,3] ).
Trimodal: There are three modes in the data distribution. For E.g., x = 1,2,2,3,3,4,4,5 (mode = [2,3,4] ).
Multimodal: There are more than three modes in the data distribution. For E.g., x = 1,2,2,3,3,4,4,5,5,6 (mode =
[2,3,4,5] ).
In statistics, the range is one of the most common measures of dispersion. It is the difference between the largest and
the smallest observation in the data distribution. The range has the same unit as the data variable.
Formula: For the values of X, the range is
Solution: Sort the values in ascending order. The difference between the Max and Min is the range.
Range = 18 (i.e., the maximum observed dispersion in the data is 18)
5. Quartiles
Quartiles divide the rank-ordered data distribution into three equal parts. The values that separate parts are called the
first, second, and third quartiles.
First Quartile (Q1): It is the median of the lower half of the data distribution (25th percentile)
Second Quartile (Q2): It is the median of the entire data distribution (50th percentile)
Third Quartile (Q3): It is the median of the upper half of the data distribution (75th percentile)
Example
We will use the small start-up example having 10 employees as discussed earlier. The monthly salary of the
employees is given in the table below. Find the quartiles and inter-quartile range of the salary.
Emp. No. 1 2 3 4 5 6 7 8 9 10
Monthly Salary
90 80 18 18 17 16 16 16 15 14
(k)
Second Quartile
Let us first calculate the second quartile (Median).
Sort the values in ascending order
The number of observations, n=10 (even), therefore Q2 is mean of (n/2)th observation and ((n/2) + 1)th observation
Q2(median) = (1/2) * (5th observation + 6th observation)
Q2 = (16 + 17) / 2
Q2 = 16.5
First Quartile
Now, let’s calculate the first quartile (Q1)
Q2 is the median. It splits the dataset into the upper and lower half of the distribution.
Q1 is the median of the lower half of the distribution (90,80,18,18,17). The number of observations is 5, it is an odd
number. As such Q1 is the value at 3rd position, (n+1) / 2.
Q1 = Value at 3rd observation
Q1 = 16
Third Quartile
Q3 is the median of the upper half of the distribution (16,16,16,15,14). The number of observations in the upper half
also is 5. As such Q3 will be the value at 3rd position in the upper half of the data.
Therefore Q3 = 18
6. Interquartile Range
Interquartile Range (IQR) is the range of the middle 50% of the values in the data distribution. It is the difference
between the third quartile (Q3) and the first quartile (Q1).
Formula:
IQR = Q3 – Q1
Interquartile Range (IQR)
The three quartiles that divide the data distribution into four equal parts are:
Q1 = 16; Q2 = 16.5; Q3 = 18;
IQR = Q3 – Q1 = 18 – 16
IQR = 2
7. Standard Deviation
Standard Deviation is often denoted by the symbol SD or the Greek symbol σ or the Latin letter ‘s’. SD or σ is used
for population standard deviation and ‘s’ is used for sample standard deviation.
Extreme values and outliers will impact the standard deviation.
Standard Deviation can be zero (if all the values in the variable are the same)
Formula
Let us calculate the standard deviation.
The total number of observations, n = 15. Hence,
8. Variance
Variance is the square of the standard deviation. Being a squared term, it is non-negative.
Moreover, standard deviation is preferred over variance because standard deviation can be compared with the mean.
*) Graphic display of basic statistical description of data:
Variable
Plot Type Description
Type
Only One
A bar plot is a chart
Categorical
that presents
Variable
categorical data with
rectangular bars with
Or
heights or lengths
Bar Plot proportional to the
One
values that they
Categorical
represent.
Variable &
Visually represents
One
frequency
Continuous
distribution.
Measure
It is a smoothed
Distributio version of the
Only One
n Plot histogram.
Continuous
(Density
Variable
Plot) Visually shows
Skewness in data.
The box plot is a
standardized way of
Only One displaying the
Continuous distribution of data
Variable based on the five-
Box Plot
Or number summary:
(Box and
One minimum, first
Whisker
Continuous quartile, median, third
Plot)
& One quartile, and
Categorical maximum.
Variable
Quickly helps find
outliers in data.
A line plot is a type of
One of the
chart that displays
dimension
information as a series
has to be
of data points called
Time and
‘markers’ connected
Line Plot the second
by straight line
dimension
segments.
a
Continuous
Visually shows trends
Variable
in Time Series Data.
A pie chart is a
circular statistical
One
graphic, which is
Categorical
divided into slices to
Variable
illustrate numerical
Pie Chart associated
proportions.
with a
Continuous
Quickly helps
Measure
compare parts of a
whole.