0% found this document useful (0 votes)
107 views

Basic Statistical Description of Data

The document introduces three common measures of central tendency: mean, median, and mode. It provides definitions and examples to explain how to calculate each measure. The mean is the average value and can be impacted by outliers. The median is the middle value when data is sorted. The mode is the most frequent value. It also introduces several measures of dispersion, including range, quartiles, interquartile range, standard deviation, and variance. These measures help describe how spread out the values in a data set are from the central tendency.

Uploaded by

Tarika Saij
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Basic Statistical Description of Data

The document introduces three common measures of central tendency: mean, median, and mode. It provides definitions and examples to explain how to calculate each measure. The mean is the average value and can be impacted by outliers. The median is the middle value when data is sorted. The mode is the most frequent value. It also introduces several measures of dispersion, including range, quartiles, interquartile range, standard deviation, and variance. These measures help describe how spread out the values in a data set are from the central tendency.

Uploaded by

Tarika Saij
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction to Measuring the Central Tendency:

Mean, Median, and Mode are the three most common Measures of Central Tendency. They are the commonly used
descriptive statistics to describe the data through a single value (central value) that represents the center point of the
data.

1. Mean
 Mean is the most commonly used measure of central tendency.
 Mean is equal to the sum of all the values divided by the total number of values.
 Mean is also known as Arithmetic Average.
 Mean includes all the values in the data.
 Mean is impacted by outlier (extreme) values.
 Mean cannot be used for categorical data.

Practice Example
There are 15 students in a preschool and their age in months is given below. Calculate the mean age of the students.

Mean = (24+37+38+38+36+39+40+37+38+41+40+36+37+37+39 ) /15

Mean =37.13

Interpretation: The average age at which parents send their students to preschool is around 37 month
Histogram: A histogram is a commonly used graphical chart to depict numerical variables. The histogram plot of
the age of the students is shown below:

From the histogram plot, we can observe that:


 Most of the data points are distributed around mean age (37).
 The age value 24 must be a potential outlier. (Outlier values are those value which is extreme and far away from the
central tendency)

1.1 Truncated Mean or Trimmed Mean


 Truncated mean is a mean obtained after trimming off values at the high and low extremes.
 Example: In a 5% trimmed mean, the mean is computed after removing 5% of the highest and lowest value from
the data sample.

Practice Example
Let us remove 5% of the highest and lowest value from the below data. 5% trimming from 15 values is removing
0.75 observations, i.e. 1 observation from both the extremes. The sorted data is shown below:

The values 24 and 41 from both extremes will get trimmed.

Trimmed Mean = (36+36+37+37+37+37+38+38+38+39+39+40+40 ) /13


Trimmed Mean = 37.85 

1.2 Weighted Mean


 In a simple mean, we give equal weight to each value.
 However, there may be instances where we may have to give some more weight to certain observations
than others in computing the mean and it is called Weighted Mean.
 The Weighted mean is calculated as.

Example
There sample data of 15 Students could have been shown as below. To compute the mean age we will have to give
weight to the frequency of occurrence of each age value and the mean so computed is weighted mean

Age 24 36 37 38 39 40 41

Frequency 1 2 4 3 2 2 1

Weighted Mean = ((24*1) + (36*2) + (37*4) + (38*3) + (39*2) + (40*2) + (41*1)) / (1+2+4+3+2+2+1)
Weighted Mean =37.13

2. Median
 Median is the middle value of the data when the observations are sorted (ascending or descending order)
 When sorted (ascending or descending), the median splits the data into two halves equally (upper and lower
halves).
 The percentile rank of median = 50%
 When sorted,
o If the number of observations (n) is odd, then the median is the value of the middle observation at
position (n + 1) / 2.
o Else If the number of observations (n) is even, then the median is the mean of the two middle-most
values at position (n/2, (n+1)/2).

Example

There are 15 students in a preschool and their age in months is given below. Calculate median:

 To find the median, first we sort the values in ascending order (or descending)
 As n = 15 (n is odd), the median will be 8th position value [(15 + 1)/2 = 8].
  

 The value at 8th position is 38, therefore Median = 38

Interpretation
50% of the students in preschool are below the age of 38 months and the remaining 50% are above 38 months.

3. Mode
 The most frequently occurring value in data is called the mode.
 We can use mode as the measure of central tendency for both categorical and numerical variables.
 The data distribution can have more than one mode.
 .Example

The age in months of 15 Students from a preschool is given in the table below. Compute Mode.
Let’s create a frequency distribution table for the above data.

Value 24 36 37 38 39 40 41

Frequenc
1 2 4 2 2 2 1
y

 The value 37 appeared the max number of times (four times) in the data distribution.
 Hence, Mode = 37

Types of Mode

 Unimodal: There is only one mode in the data distribution. For E.g., x = 1,2,2,3 (mode = 2).
 Bimodal: There are two modes in the data distribution. For E.g., x = 1,2,2,3,3,4 (mode = [2,3] ).
 Trimodal: There are three modes in the data distribution. For E.g., x = 1,2,2,3,3,4,4,5 (mode = [2,3,4] ).
 Multimodal: There are more than three modes in the data distribution. For E.g., x = 1,2,2,3,3,4,4,5,5,6 (mode =
[2,3,4,5] ).

Measuring the dispersion of data:


4. Range

In statistics, the range is one of the most common measures of dispersion. It is the difference between the largest and
the smallest observation in the data distribution. The range has the same unit as the data variable.
Formula: For the values of X, the range is

Range = Largest Value of X – Smallest Value of X


 
Example: The sample age data (in years) of the 15 students of the Data Science Executive Course is given below.
Calculate the range of the age of the students.

Solution: Sort the values in ascending order. The difference between the Max and Min is the range.

 
Range = 18 (i.e., the maximum observed dispersion in the data is 18)

 
5. Quartiles

Quartiles divide the rank-ordered data distribution into three equal parts. The values that separate parts are called the
first, second, and third quartiles.
 First Quartile (Q1): It is the median of the lower half of the data distribution (25th percentile)
 Second Quartile (Q2): It is the median of the entire data distribution (50th percentile)
 Third Quartile (Q3): It is the median of the upper half of the data distribution (75th percentile)

Example
We will use the small start-up example having 10 employees as discussed earlier. The monthly salary of the
employees is given in the table below. Find the quartiles and inter-quartile range of the salary.

Emp. No. 1 2 3 4 5 6 7 8 9 10
Monthly Salary
90 80 18 18 17 16 16 16 15 14
(k)

 
Second Quartile
Let us first calculate the second quartile (Median).
 Sort the values in ascending order

 
 The number of observations, n=10 (even), therefore Q2 is mean of  (n/2)th observation and ((n/2) + 1)th observation
 Q2(median) =  (1/2) * (5th observation + 6th observation)
Q2 = (16 + 17) / 2
Q2 = 16.5
 
First Quartile
Now, let’s calculate the first quartile (Q1) 
 Q2 is the median. It splits the dataset into the upper and lower half of the distribution.
 Q1 is the median of the lower half of the distribution (90,80,18,18,17). The number of observations is 5, it is an odd
number. As such Q1 is the value at 3rd position,  (n+1) / 2.
 Q1 = Value at 3rd observation
Q1 = 16
 
Third Quartile
 Q3 is the median of the upper half of the distribution (16,16,16,15,14). The number of observations in the upper half
also is 5. As such Q3 will be the value at 3rd position in the upper half of the data.
 Therefore Q3 = 18

6. Interquartile Range
Interquartile Range (IQR) is the range of the middle 50% of the values in the data distribution. It is the difference
between the third quartile (Q3) and the first quartile (Q1).
 Formula:

IQR = Q3 – Q1
Interquartile Range (IQR)
 The three quartiles that divide the data distribution into four equal parts are:
 Q1 = 16; Q2 = 16.5; Q3 = 18;

 
 IQR = Q3 – Q1 = 18 – 16
 IQR = 2

7. Standard Deviation
Standard Deviation is often denoted by the symbol SD or the Greek symbol σ or the Latin letter ‘s’. SD or σ is used
for population standard deviation and ‘s’ is used for sample standard deviation.
 Extreme values and outliers will impact the standard deviation.
 Standard Deviation can be zero (if all the values in the variable are the same)
 
Formula

Standard Deviation –Example


We will now take another example to explain the calculations of standard deviation.
Calculate the standard deviation for the sample age data of the 15 students from Data Science Certification Program
as given below.
Calculations
STEP – 1: Calculate the Mean
 The mean age of the 15 Data Science Executive Course students is,
 Mean (Age) = (22 + 23 + 25 + 27 + 28 + 35 + 32 + 28 + 30 + 40 + 24 + 26 + 27 + 29 + 31) / 15 => (427 / 15)
 Mean (Age) = 28.47
 
STEP – 2: Calculate the Standard Deviation
 Let X be the Age of the 15 Data Science Executive Students Sample, Then the Standard Deviation of sample X is,

 
 
 Let us calculate the standard deviation.
 The total number of observations, n = 15. Hence,

8. Variance
Variance is the square of the standard deviation. Being a squared term, it is non-negative.
Moreover, standard deviation is preferred over variance because standard deviation can be compared with the mean.
*) Graphic display of basic statistical description of data:
 

Variable
Plot Type Description
Type

Only One
A bar plot is a chart
Categorical
that presents
Variable
categorical data with
rectangular bars with
Or
heights or lengths
Bar Plot proportional to the
One
values that they
Categorical
represent.
Variable &
Visually represents
One
frequency
Continuous
distribution.
Measure

A stacked bar chart,


also known as
a stacked bar graph,
is a graph that is used
to break down a
category by another
category and compare
parts of a whole.
Two or
Stacked more
Each bar in
Bar Plot Categorical
the chart represents
Variables
one category as a
whole, and segments
in the bar represent
different parts or
categories of that
whole.
Visually represents
cross-tabulation data.
A histogram is an
approximate
representation of the
distribution of
Only One numerical data.
Histogram Continuous
Variable It is created by
converting a
continuous variable
into categorical by
binning/bucketing it.

It is a smoothed
Distributio version of the
Only One
n Plot histogram.
Continuous
(Density
Variable
Plot) Visually shows
Skewness in data.

The box plot is a
standardized way of
Only One displaying the
Continuous distribution of data
Variable based on the five-
Box Plot
Or number summary:
(Box and
One minimum, first
Whisker
Continuous quartile, median, third
Plot)
& One quartile, and
Categorical maximum.
Variable
Quickly helps find
outliers in data.
A line plot is a type of
One of the
chart that displays
dimension
information as a series
has to be
of data points called
Time and
‘markers’ connected
Line Plot the second
by straight line
dimension
segments.
a
Continuous
Visually shows trends
Variable
in Time Series Data.

A graph in which the


values of two variables
are plotted along two
axes. The pattern of
the resulting points on
Two
the plot visually
Scatter Plot Continuous
depicts the existence
Variables
of Correlation between
the two variables.

Quickly helps find


Correlation.

A pie chart is a
circular statistical
One
graphic, which is
Categorical
divided into slices to
Variable
illustrate numerical
Pie Chart associated
proportions.
with a
Continuous
Quickly helps
Measure
compare parts of a
whole.

You might also like