Exploring Numerical Data - Students
Exploring Numerical Data - Students
OF DATA
Exploring numerical data
Central Tendency and Dispersion for
Numerical data
2
Lecture objectives
• Central tendency
• Measures of Dispersion
3
Measures of Central Tendency
• Measures of central tendency yield information about “particular places or
locations in a group of numbers.”
• A single number to describe the characteristics of a set of data
4
Summary statistics
5
Arithmetic Mean
• Commonly called ‘the mean’
• It is the average of a group of numbers
• Applicable for interval and ratio data
• Not applicable for nominal or ordinal data
• Affected by each value in the data set, including extreme values
• Computed by summing all values in the data set and dividing the sum by
the number of values in the data set
6
Population Mean
N X X X X ... N
1 2
N 3
24 13 19 26
X
11
5
93
5
18.6
7
Sample Mean
X X X 1 X 2 X 3 ... X n
n n
57 86 42 38 90 66
6
379
6
63.167
8
Median
• Middle value in an ordered array of numbers
9
Median: Computational Procedure
• First Procedure
– Arrange the observations in an ordered array
– If there is an odd number of terms, the median is the middle term of the
ordered array
– If there is an even number of terms, the median is the average of the
middle two terms
• Second Procedure
– The median’s position in an ordered array is given by (n+1)/2.
10
Median: Example with an Odd Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
22
• There are 17 terms in the ordered array.
• Position of median = (n+1)/2 = (17+1)/2 = 9
• The median is the 9th term, 15.
• If the 22 is replaced by 100, the median is 15.
• If the 3 is replaced by -103, the median is 15.
11
Median: Example with an Even Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20
21
17
Mode -- Example
• The mode is 44
• There are more 44s 35 41 44 45
37 43 44 46
39 43 44 46
40 43 44 46
40 43 45 48
18
Percentiles
• Measures of central tendency that divide a group of data into 100 parts
• Example: 90th percentile indicates that at most 90% of the data lie
below it, and at least 10% of the data lie above it
• The median and the 50th percentile have the same value
19
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array
• Calculate the p th percentile location:
P
i 100 (n)
• Determine the percentile’s location and
its value.
• If i is a whole number, the percentile is the average of the values at the
i and (i+1) positions
20
Percentiles: Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Location of 30th percentile:
30
i 100 (8) 2.4
21
Dispersion
22
Variability
23
Measures of Variability or dispersion
Common Measures of Variability
• Range
• Inter-quartile range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Z scores
• Coefficient of Variation
24
Range – ungrouped data
40 43 45 48
25
Quartiles
Q1 Q2 Q3
28
Quartiles: Example
• Ordered array: 106, 109, 114, 116, 121, 122, 125, 129
• Q1 i
25
(8) Q
109 114
1
2 111.5
100 2
• Q2:
50 116 1 2 1
i (8) Q2
4 118.5
100 2
• Q3:
75 122 125
i (8) Q3
6 123.5
100 2
30
Interquartile Range
Interquartile Range Q3 Q1
31
Deviation from the Mean
• Data set: 5, 9, 16, 17, 18
• Mean:
N 5
X 65
• Deviations13
from the mean: -8, -4,
3, 4, 5
-4 +5
-8 +4
+3
0 5 10 15 20
32
Mean Absolute Deviation
• Average of the absolute deviations from the mean
X X X
5 -8 +8 M .A.D. X
N
9 -4 +4
24
16 +3 +3
17 +4 +4 5
18 +5 +5
0 24 4.8
33
Population Variance
• Average of the squared deviations from the arithmetic mean
X X X 2
2
5 -8 64
2
XN
9 -4 16
130
16 +3 9 5
17 +4 16 26.0
18 +5 25
0 130
34
Population Standard Deviation
• Square root of the variance
X X X 2
2
XN
2
5 -8 64
130
9 -4 16
5
16 +3 9 26.0
17 +4 16
2
18 +5 25
0 130 26 .5.1
0
35
Sample Variance
• Average of the squared deviations from the arithmetic mean
X X X X X
2
2,398
1,844
625
71
390,625
5,041 S
2
Xn 1 X
1,539 -234 54,756
1,311 -462 213,444 663,866
7,092 0 663,866 3
221,
288.67
36
Sample Standard Deviation
• Square root of the sample variance
X X X X X
2
X X
2
1,844 71 5,041
663,866
1,539 -234 54,756 3
221, n 1
1,311 -462 213,444 S
2 8 8 . 6 72
7,092 0 663,866 S
221,
28 48 7. 607. 4 1
37
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control
– construction of quality control charts
– process capability studies
• Comparing populations
– household incomes in two cities
– employee absenteeism at two plants
38
Central Tendency and Dispersion- II
3
9
The Empirical Rule… If the histogram is bell shaped
1 68
2 95
99.7
4
1
Coefficient of Variation
• Ratio of the standard deviation to the mean, expressed as a percentage
• Measurement of relative dispersion
C.V.
100
42
Coefficient of Variation
1 29 2
841 2
4.6 10
C.V 1
C.V 2
100
100
1 2
. 1
. 2
4.6 10
29 84
100
100
15.86 11.90
43
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis
Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal shape
– Platykurtic: flat and spread out
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness
44
Skewness
45
Skewness..
The skewness of a distribution is measured by comparing the relative positions
of the mean, median and mode.
• Distribution is symmetrical
• Mean = Median = Mode
• Distribution skewed right
• Median lies between mode and mean, and mode is less than
mean
• Distribution skewed left
• Median lies between mode and mean, and mode is greater than
mean
46
Skewness
S
3 Md median
• If S < 0, the distribution is negatively skewed (skewed to the left)
48
Coefficient of Skewness
23 26 29
1 2 3
M d1 26 M
d2 26 M
d3 26
1
12.3 2
12.3 3
12.3
3 1
M
3 2 d M
3 3 d
M d1 2 3
S 1
S 2
S 3
1 2 3
Leptokurtic
Mesokurtic
Platykurtic
50
Box and Whisker Plot
– Median, Q2
– First quartile, Q1
– Third quartile, Q3
53
Box and Whisker Plot
Minimum Q1 Q2 Q3 Maximum
54
Skewness: Box and Whisker Plots, and
Coefficient of Skewness
S=0 S>0
S<0
55
1. Introduction to Box Plots
Box Plots are graphical representations that display the distribution of
data based on a five-number summary. These plots are particularly
useful for comparing distributions across different groups and
identifying potential outliers. The five-number summary consists of:
1.Minimum: The smallest data point.
2.First Quartile (Q1): The 25th percentile, or the value below which
25% of the data fall.
3.Median (Q2): The 50th percentile, or the middle value of the data
set.
4.Third Quartile (Q3): The 75th percentile, or the value below which
75% of the data fall.
5.Maximum: The largest data point.
The plot visually represents these statistics with a box and whiskers,
2. Components of a Box Plot
A Box Plot consists of several key components:
•Box: The box represents the interquartile range (IQR),
which is the range between the first and third quartiles
(Q1 and Q3). It shows the middle 50% of the data.
•Whiskers: The whiskers extend from the edges of the
box to the minimum and maximum values within 1.5 times
the IQR from the quartiles. They help to identify the
spread of the data outside the IQR.
•Median Line: A line inside the box denotes the median
(Q2) of the dataset.
•Outliers: Data points that fall outside the range defined
by the whiskers are considered outliers and are typically
3. Steps to Create a Box Plot
To construct a Box Plot, follow these steps:
1.Order the Data: Arrange the data points in ascending order.
2.Calculate Quartiles:Q1 (First Quartile): The median of the lower
half of the dataset.
3.Q2 (Median): The middle value of the dataset.
4.Q3 (Third Quartile): The median of the upper half of the dataset.
5.Determine the IQR: Calculate the interquartile range by subtracting
Q1 from Q3 (IQR = Q3 - Q1).
6.Identify Whisker Limits: Calculate the lower and upper whisker limits
as follows:
7.Lower Whisker Limit: Q1 - 1.5 IQR
8.Upper Whisker Limit: Q3 + 1.5 IQR
9.Plot the Box and Whiskers:Draw a box from Q1 to Q3.Draw a line
inside the box at the median (Q2).Extend whiskers from Q1 to the
Interpreting a Box Plot
Box Plots offer a wealth of information at a glance:
•Central Tendency: The median line inside the box
indicates the center of the data distribution.
•Spread: The length of the box (IQR) shows the spread of
the middle 50% of the data. A larger box indicates greater
variability.
•Skewness: The relative lengths of the whiskers can
indicate skewness. If one whisker is noticeably longer than
the other, the data may be skewed in that direction.
•Outliers: Points plotted outside the whiskers are
considered outliers and may warrant further investigation.
Applications of Box Plots
Now that we have a fairly clear understanding of the data set attributes in terms of
spread and central tendency, let’s try to make an attempt to visualize the whole
thing as a box-plot. A box plot is an extremely effective mechanism to get a one-
shot view and understand the nature of the data. But before we get to review the
box plot for different attributes of Auto MPG data set, let’s first try to understand a
box plot in general and the interpretation of different aspects in a box plot, the box
plot (also called box and whisker plot) gives a standard visualization of the five-
number summary statistics of a data, namely minimum, first quartile (Q1), median
(Q2), third quartile (Q3), and maximum.
The central rectangle or the box spans from first to third quartile (i.e. Q1 to Q3), thus giving the inter-quartile
range (IQR).
Median is given by the line or band within the box.
The lower whisker extends up to 1.5 times of the inter-quartile range (or IQR) from the bottom of the box, i.e.
the first quartile or Q1.
However, the actual length of the lower whisker depends on the lowest data value that falls within (Q1 − 1.5
times of IQR).
Let’s try to understand this with an example. Say for a specific set of data, Q1 = 73, median = 76 and Q3 = 79.
Hence, IQR will be 6 (i.e. Q3 – Q1).
So, lower whisker can extend maximum till (Q1– 1.5 × IQR) = 73 – 1.5 × 6 = 64. However, say there are lower
range data values such as So, the lower whisker will come at 70 as this is the lowest data value larger than 64.
70, 63, and 60.
The upper whisker extends up to 1.5 as times of the inter-quartile range (or IQR) from the top of the box, i.e. the
third quartile or Q3. Similar to lower whisker, the actual length of the upper whisker will also depend on the
highest data value that falls within (Q3 + 1.5 times of IQR).
Let’s try to understand this with an example. For the same set of data mentioned in the above point, upper
whisker can extend maximum till (Q3 + 1.5 × IQR) = 79 + 1.5 × 6 = 88. If there is higher range of data values like
82, 84, and 89. So, the upper whisker will come at 84 as this is the highest data value lower than 88. The data
values coming beyond the lower or upper whiskers are the ones which are of unusually low or high values
respectively. These are the outliers, which may deserve special consideration.
Boxplot
Consider MPG dataset
Plot the boxplot for cylinders from MPG dataset
Interpreting through Frequency and Cumulative Frequency
Let’s analyze this scenario using the Cylinders attribute in the MPG dataset:
1.Frequency Distribution:
1. Create a frequency table for the Cylinders attribute, counting how many times each category (e.g., 4,
6, 8 cylinders) occurs.
2. If the frequency of the lower categories (e.g., 4 cylinders) is disproportionately high, this could cause
Q1 and Q2 to align because the majority of the lower half of the data is dominated by a single value.
2.Cumulative Frequency:
1. The cumulative frequency helps identify the point where 25% and 50% of the data lie.
2. If Q1=Q2, it suggests that both the 25th and 50th percentiles fall in the same category because of the
high cumulative frequency of that category.
# Calculate the frequency of cylinders
# Calculate the cumulative frequency
Bloxplot for origin
# Calculate the frequency of origin
# Calculate the cumulative origin
As can be observed in the table, the frequency is extremely
high for data value 1. Since the total frequency is 398, the first
quartile (Q1), median (Q2), and third quartile (Q3) will be at a
cumulative frequency 99.5 (i.e. average of 99th and 100th
observation), 199 and 298.5 (i.e. average of 298th and 299th
observation), respectively. This way Q1 = 1, median = 1, and Q3
= 2. Since Q1 and median are same in value, the band for
median falls on the bottom of the box. There is no data value
lower than Q1. Hence, the lower whisker is missing.
Boxplot for displacement
Boxplot for Model year
Potting and exploring numerical data using
Histograms
Plotting numerical data
Histograms
A histogram is a graphical representation of the distribution of numerical data. It groups the
data into intervals, called bins, and visualizes how frequently the data values fall within each
bin.
Key Features of Histograms
1.Representation:
1. Histograms use bars to represent the frequency or proportion of data within each bin.
2. Bars are adjacent (no gaps) to emphasize the continuous nature of numerical data.
2.Purpose:
1. To understand the shape of the data distribution (e.g., normal, skewed, uniform, etc.).
2. To identify patterns like skewness, multimodality, or outliers.
X-axis and Y-axis in Histograms
1.X-axis (Horizontal Axis):
1. Represents the range of data values (grouped into bins).
2. Each bin corresponds to an interval of the data.
3. Example: If the data is test scores, the x-axis may show intervals like 0–10, 10–20, etc.
2.Y-axis (Vertical Axis):
1. Represents the frequency (count) or density of data values within each bin.
2. Frequency: Number of data points in each bin.
3. Density: The proportion of data points relative to the total, often used for normalized
histograms.
Visualizations of histograms
Uniform distribution: is a type of probability distribution where all outcomes are equally likely. In a uniform distribution,
the probability of any particular value occurring is constant across its range.
Symmetric and unimodal distribution: is a type of data distribution that has a single peak (unimodal) and is
symdistributional around its central value. This means the left and right halves of the distribution are mirror images of
each other.
Left skewed data: refers to a distribution where the tail on the left side is longer than the right. In this type of data, most
of the values cluster toward the higher (right) end, and the mean is typically less than the median.
Right skewed data: refers to a distribution where the tail on the right side of the distribution is longer than the left. In this
type of data, most of the values cluster toward the lower (left) end, and the mean is typically greater than the median.
Bimodal distribution: is a type of probability distribution that has two distinct peaks (or modes) in its frequency
distribution.
Multimodal distribution: is a type of probability distribution that has more than one peak (mode) in its frequency
distribution.
Consider MPG dataset
Plotting histograms for numeric attributes
Plotting histograms for numeric attributes
model_year: Uniform or nearly uniform distribution.
MPG: Right skewed
Origin: Categorical distribution: The data will show how many cars come from each
origin.Skewed or balanced distribution: Depending on the dataset, it could be skewed, for
example, with more cars from the USA than from Europe or Japan, especially if the
dataset was compiled in a country like the United States.
Acceleration: Normal distribution
Displacement: Right skewed data
Horsepower: Right-skewed
Weight: Right skewed
Difference between bar chart and histogram
1. Bar Chart:
•Data type: Used for categorical data (data that can be divided into distinct categories or groups).
•Bars: Each bar represents a category or group, and the height of the bar corresponds to the value or frequency of that category.
•Spacing: The bars are typically discrete and have space between them, as the categories are distinct and not related in a
continuous manner.
•Example use case: Displaying the count of different car types in a dataset (e.g., the number of cars from different countries: USA,
Europe, Japan).
2. Histogram:
•Data type: Used for continuous numerical data (data that can take any value within a range).
•Bars: Each bar represents a range of values (called a bin or class interval), and the height of the bar represents the frequency of
data points falling within that range.
•Spacing: The bars in a histogram are typically continuous, with no space between them, as the data represents a continuous
range of values.
•Example use case: Displaying the distribution of car weights or the distribution of MPG values in a dataset.
THANK YOU
97