0% found this document useful (0 votes)
4 views97 pages

Exploring Numerical Data - Students

The document discusses measures of central tendency and dispersion for numerical data, including the arithmetic mean, median, mode, and percentiles. It also covers measures of variability such as range, variance, and standard deviation, as well as the empirical rule for normally distributed data. Additionally, it introduces concepts of skewness and kurtosis to describe the shape of data distributions.

Uploaded by

Rohith Saindla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views97 pages

Exploring Numerical Data - Students

The document discusses measures of central tendency and dispersion for numerical data, including the arithmetic mean, median, mode, and percentiles. It also covers measures of variability such as range, variance, and standard deviation, as well as the empirical rule for normally distributed data. Additionally, it introduces concepts of skewness and kurtosis to describe the shape of data distributions.

Uploaded by

Rohith Saindla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

EXPLORING STRUCTURE

OF DATA
Exploring numerical data
Central Tendency and Dispersion for
Numerical data

2
Lecture objectives

• Central tendency
• Measures of Dispersion

3
Measures of Central Tendency
• Measures of central tendency yield information about “particular places or
locations in a group of numbers.”
• A single number to describe the characteristics of a set of data

4
Summary statistics

• Central tendency or measures of • Dispersion


location – Skewness
– Arithmetic mean – Kurtosis
– Weighted mean – Range
– Median – Interquartile range
– Percentile – Variance
– Standard score
– Coefficient of variation

5
Arithmetic Mean
• Commonly called ‘the mean’
• It is the average of a group of numbers
• Applicable for interval and ratio data
• Not applicable for nominal or ordinal data
• Affected by each value in the data set, including extreme values
• Computed by summing all values in the data set and dividing the sum by
the number of values in the data set

6
Population Mean

N X X  X  X  ...  N

   1 2
N 3

24  13  19  26 
 X
11
5
93

5
 18.6

7
Sample Mean

X  X  X 1  X 2  X 3  ...  X n
n n
57  86  42  38  90  66

6
379

6
 63.167
8
Median
• Middle value in an ordered array of numbers

• Applicable for ordinal, interval, and ratio data

• Not applicable for nominal data

• Unaffected by extremely large and extremely small values

9
Median: Computational Procedure
• First Procedure
– Arrange the observations in an ordered array
– If there is an odd number of terms, the median is the middle term of the
ordered array
– If there is an even number of terms, the median is the average of the
middle two terms
• Second Procedure
– The median’s position in an ordered array is given by (n+1)/2.

10
Median: Example with an Odd Number of Terms

Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
22
• There are 17 terms in the ordered array.
• Position of median = (n+1)/2 = (17+1)/2 = 9
• The median is the 9th term, 15.
• If the 22 is replaced by 100, the median is 15.
• If the 3 is replaced by -103, the median is 15.

11
Median: Example with an Even Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20
21

• There are 16 terms in the ordered array

• Position of median = (n+1)/2 = (16+1)/2 = 8.5

• The median is between the 8th and 9th terms, 14.5

• If the 21 is replaced by 100, the median is 14.5

• If the 3 is replaced by -88, the median is 14.5


12
Mode

• The most frequently occurring value in a data set

• Applicable to all levels of data measurement (nominal, ordinal, interval,


and ratio)
• Bimodal -- Data sets that have two modes

• Multimodal -- Data sets that contain more than two modes

17
Mode -- Example
• The mode is 44
• There are more 44s 35 41 44 45

than any other value 37 41 44 46

37 43 44 46

39 43 44 46

40 43 44 46

40 43 45 48

18
Percentiles
• Measures of central tendency that divide a group of data into 100 parts

• Example: 90th percentile indicates that at most 90% of the data lie
below it, and at least 10% of the data lie above it
• The median and the 50th percentile have the same value

• Applicable for ordinal, interval, and ratio data

• Not applicable for nominal data

19
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array
• Calculate the p th percentile location:
P
i 100 (n)
• Determine the percentile’s location and
its value.
• If i is a whole number, the percentile is the average of the values at the
i and (i+1) positions

• If i is not a whole number, the percentile is at the (i+1) position in the


ordered array

20
Percentiles: Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Location of 30th percentile:

30
i  100 (8)  2.4

• The location index, i, is not a whole number; i+1 = 2.4+1=3.4;


the whole number portion is 3; the 30th percentile is at the 3rd
location of the array; the 30th percentile is 13.

21
Dispersion

• Measures of variability describe the spread or the dispersion of a set of


data
• Reliability of measure of central tendency

• To compare dispersion of various samples

22
Variability

No Variability in Cash Flow Mean

Variability in Cash Flow Mean

23
Measures of Variability or dispersion
Common Measures of Variability
• Range
• Inter-quartile range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Z scores
• Coefficient of Variation

24
Range – ungrouped data

• The difference between the largest and the smallest values in 35 41 44 45


a set of data
37 41 44 46
• Simple to compute
37 43 44 46
• Ignores all data points except the two extremes
• Example: 39 43 44 46

Range = Largest – Smallest = 48 - 35 = 13 40 43 44 46

40 43 45 48

25
Quartiles

Quartiles are primarily considered a measure of spread in a


dataset. They divide the data into four equal parts, providing
insight into the distribution and variability of the data, rather
than focusing on the "center" like measures of central
tendency.
Quartiles
• Measures of central tendency that divide a group of data into four subgroups

• Q1: 25% of the data set is below the first quartile

• Q2: 50% of the data set is below the second quartile

• Q3: 75% of the data set is below the third quartile

• Q1 is equal to the 25th percentile

• Q2 is located at 50th percentile and equals the median

• Q3 is equal to the 75th percentile

• Quartile values are not necessarily members of the data set


27
Quartiles

Q1 Q2 Q3

25% 25% 25% 25%

28
Quartiles: Example
• Ordered array: 106, 109, 114, 116, 121, 122, 125, 129
• Q1 i
25
(8)  Q 
109 114
1 
2 111.5
100 2

• Q2:
50 116  1 2 1
i (8)  Q2  
4 118.5
100 2

• Q3:
75 122 125
i (8)  Q3  
6 123.5
100 2

30
Interquartile Range

• Range of values between the first and third quartiles


• Range of the “middle half”
• Less influenced by extremes

Interquartile Range  Q3  Q1

31
Deviation from the Mean
• Data set: 5, 9, 16, 17, 18
• Mean:
  N 5
X 65

• Deviations13
from the mean: -8, -4,
3, 4, 5

-4 +5
-8 +4
+3
0 5 10 15 20


32
Mean Absolute Deviation
• Average of the absolute deviations from the mean

X X   X  
5 -8 +8 M .A.D.   X  
N
9 -4 +4
24
16 +3 +3 
17 +4 +4 5
18 +5 +5 
0 24 4.8

33
Population Variance
• Average of the squared deviations from the arithmetic mean

X X   X    2
2

5 -8 64 
2
  XN   
9 -4 16 
130
16 +3 9 5
17 +4 16  26.0
18 +5 25
0 130

34
Population Standard Deviation
• Square root of the variance

X X   X    2
2

   XN   
2

5 -8 64
130
9 -4 16 
5
16 +3 9  26.0
17 +4 16  

2
18 +5 25

0 130 26 .5.1
0

35
Sample Variance
• Average of the squared deviations from the arithmetic mean

X X  X X  X 
2

2,398
1,844
625
71
390,625
5,041 S
2
  Xn 1 X
1,539 -234 54,756
1,311 -462 213,444 663,866

7,092 0 663,866  3
 221,
288.67
36
Sample Standard Deviation
• Square root of the sample variance

X X  X X  X
2

 X  X
2

2,398 625 390,625 S 

1,844 71 5,041 
663,866
1,539 -234 54,756  3
 221, n  1
1,311 -462 213,444 S 
2 8 8 . 6 72
7,092 0 663,866 S
 221,
28 48 7. 607. 4 1

37
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control
– construction of quality control charts
– process capability studies
• Comparing populations
– household incomes in two cities
– employee absenteeism at two plants

38
Central Tendency and Dispersion- II

3
9
The Empirical Rule… If the histogram is bell shaped

• Approximately 68% of all observations


fall within one standard deviation of
the mean.

• Approximately 95% of all


observations fall
within two standard deviations of the
mean.

• Approximately 99.7% of all


observations fall
within three standard deviations of the
mean. 4
0
Empirical Rule

• Data are normally distributed (or approximately normal)

Distance from Percentage of Values


the Mean Falling Within Distance

  1 68
2 95
 99.7
 4
1
Coefficient of Variation
• Ratio of the standard deviation to the mean, expressed as a percentage
• Measurement of relative dispersion


C.V. 

100
42
Coefficient of Variation

 1  29 2 
841  2
4.6 10
C.V  1
 C.V  2

 100 
 100
1 2

. 1
. 2

4.6 10
 29  84

100 
100
15.86 11.90
43
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis
Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal shape
– Platykurtic: flat and spread out
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness

44
Skewness

Negatively Symmetric Positively


Skewed (Not Skewed
Skewed)

45
Skewness..
The skewness of a distribution is measured by comparing the relative positions
of the mean, median and mode.
• Distribution is symmetrical
• Mean = Median = Mode
• Distribution skewed right
• Median lies between mode and mean, and mode is less than
mean
• Distribution skewed left
• Median lies between mode and mean, and mode is greater than
mean

46
Skewness

Mean Mode Mean Mean


Mode
Median
Median Mode Median

Negativel Symmetr Positive


y Skewed ic (Not ly
Skewed) Skewed
47
Coefficient of Skewness

• Summary measure for skewness

S

3   Md  median

 
• If S < 0, the distribution is negatively skewed (skewed to the left)

• If S = 0, the distribution is symmetric (not skewed)

• If S > 0, the distribution is positively skewed (skewed to the right)

48
Coefficient of Skewness

  23   26   29
1 2 3

M d1  26 M
d2  26 M
d3  26
 1
 12.3  2
 12.3  3
 12.3
 
3 1
  M
3 2 d   M
3 3 d

M d1  2  3
S 1

 S 2

 S 3


1 2 3

323  326  329 


  
26 26 26 49
Kurtosis
• Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal in shape
– Platykurtic: flat and spread out

Leptokurtic

Mesokurtic
Platykurtic

50
Box and Whisker Plot

Box and Whisker Plots are a versatile and informative tool


for visualizing the distribution of data. By representing key
statistical measures such as the median, quartiles, and
potential outliers, these plots provide valuable insights
into the spread and central tendency of datasets. Whether
used for comparing different groups or analyzing a single
distribution, Box Plots offer a clear and concise summary
of the data, making them an essential tool in both
exploratory data analysis and statistical reporting.
By understanding and utilizing Box Plots effectively, you
can gain deeper insights into your data and make more
informed decisions based on statistical evidence.
A Box and Whisker Plot, often referred to simply as
a Box Plot, is a powerful tool in statistics for
visualizing the distribution of a dataset. It provides
a succinct summary of the data's central
tendency, variability, and the presence of outliers.
Box and Whisker Plot

• Five specific values are used:

– Median, Q2

– First quartile, Q1

– Third quartile, Q3

– Minimum value in the data set

– Maximum value in the data set

53
Box and Whisker Plot

Minimum Q1 Q2 Q3 Maximum

54
Skewness: Box and Whisker Plots, and
Coefficient of Skewness
S=0 S>0
S<0

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed

55
1. Introduction to Box Plots
Box Plots are graphical representations that display the distribution of
data based on a five-number summary. These plots are particularly
useful for comparing distributions across different groups and
identifying potential outliers. The five-number summary consists of:
1.Minimum: The smallest data point.
2.First Quartile (Q1): The 25th percentile, or the value below which
25% of the data fall.
3.Median (Q2): The 50th percentile, or the middle value of the data
set.
4.Third Quartile (Q3): The 75th percentile, or the value below which
75% of the data fall.
5.Maximum: The largest data point.
The plot visually represents these statistics with a box and whiskers,
2. Components of a Box Plot
A Box Plot consists of several key components:
•Box: The box represents the interquartile range (IQR),
which is the range between the first and third quartiles
(Q1 and Q3). It shows the middle 50% of the data.
•Whiskers: The whiskers extend from the edges of the
box to the minimum and maximum values within 1.5 times
the IQR from the quartiles. They help to identify the
spread of the data outside the IQR.
•Median Line: A line inside the box denotes the median
(Q2) of the dataset.
•Outliers: Data points that fall outside the range defined
by the whiskers are considered outliers and are typically
3. Steps to Create a Box Plot
To construct a Box Plot, follow these steps:
1.Order the Data: Arrange the data points in ascending order.
2.Calculate Quartiles:Q1 (First Quartile): The median of the lower
half of the dataset.
3.Q2 (Median): The middle value of the dataset.
4.Q3 (Third Quartile): The median of the upper half of the dataset.
5.Determine the IQR: Calculate the interquartile range by subtracting
Q1 from Q3 (IQR = Q3 - Q1).
6.Identify Whisker Limits: Calculate the lower and upper whisker limits
as follows:
7.Lower Whisker Limit: Q1 - 1.5 IQR
8.Upper Whisker Limit: Q3 + 1.5 IQR
9.Plot the Box and Whiskers:Draw a box from Q1 to Q3.Draw a line
inside the box at the median (Q2).Extend whiskers from Q1 to the
Interpreting a Box Plot
Box Plots offer a wealth of information at a glance:
•Central Tendency: The median line inside the box
indicates the center of the data distribution.
•Spread: The length of the box (IQR) shows the spread of
the middle 50% of the data. A larger box indicates greater
variability.
•Skewness: The relative lengths of the whiskers can
indicate skewness. If one whisker is noticeably longer than
the other, the data may be skewed in that direction.
•Outliers: Points plotted outside the whiskers are
considered outliers and may warrant further investigation.
Applications of Box Plots

Box Plots are widely used in various fields:


•Education: To compare student performance
across different classes or schools.
•Healthcare: To analyze the distribution of patient
measurements, such as blood pressure or
cholesterol levels.
•Finance: To compare the performance of different
investment portfolios.
•Research: To visualize experimental data and
compare results across different conditions or
groups.
Plotting and exploring numerical data

Mathematical plots to explore numerical data:


a) Boxplot
b) Histogram
Box plots

Now that we have a fairly clear understanding of the data set attributes in terms of
spread and central tendency, let’s try to make an attempt to visualize the whole
thing as a box-plot. A box plot is an extremely effective mechanism to get a one-
shot view and understand the nature of the data. But before we get to review the
box plot for different attributes of Auto MPG data set, let’s first try to understand a
box plot in general and the interpretation of different aspects in a box plot, the box
plot (also called box and whisker plot) gives a standard visualization of the five-
number summary statistics of a data, namely minimum, first quartile (Q1), median
(Q2), third quartile (Q3), and maximum.
The central rectangle or the box spans from first to third quartile (i.e. Q1 to Q3), thus giving the inter-quartile
range (IQR).
Median is given by the line or band within the box.
The lower whisker extends up to 1.5 times of the inter-quartile range (or IQR) from the bottom of the box, i.e.
the first quartile or Q1.
However, the actual length of the lower whisker depends on the lowest data value that falls within (Q1 − 1.5
times of IQR).
Let’s try to understand this with an example. Say for a specific set of data, Q1 = 73, median = 76 and Q3 = 79.
Hence, IQR will be 6 (i.e. Q3 – Q1).
So, lower whisker can extend maximum till (Q1– 1.5 × IQR) = 73 – 1.5 × 6 = 64. However, say there are lower
range data values such as So, the lower whisker will come at 70 as this is the lowest data value larger than 64.
70, 63, and 60.
The upper whisker extends up to 1.5 as times of the inter-quartile range (or IQR) from the top of the box, i.e. the
third quartile or Q3. Similar to lower whisker, the actual length of the upper whisker will also depend on the
highest data value that falls within (Q3 + 1.5 times of IQR).
Let’s try to understand this with an example. For the same set of data mentioned in the above point, upper
whisker can extend maximum till (Q3 + 1.5 × IQR) = 79 + 1.5 × 6 = 88. If there is higher range of data values like
82, 84, and 89. So, the upper whisker will come at 84 as this is the highest data value lower than 88. The data
values coming beyond the lower or upper whiskers are the ones which are of unusually low or high values
respectively. These are the outliers, which may deserve special consideration.
Boxplot
Consider MPG dataset
Plot the boxplot for cylinders from MPG dataset
Interpreting through Frequency and Cumulative Frequency
Let’s analyze this scenario using the Cylinders attribute in the MPG dataset:
1.Frequency Distribution:
1. Create a frequency table for the Cylinders attribute, counting how many times each category (e.g., 4,
6, 8 cylinders) occurs.
2. If the frequency of the lower categories (e.g., 4 cylinders) is disproportionately high, this could cause
Q1 and Q2 to align because the majority of the lower half of the data is dominated by a single value.
2.Cumulative Frequency:
1. The cumulative frequency helps identify the point where 25% and 50% of the data lie.
2. If Q1=Q2, it suggests that both the 25th and 50th percentiles fall in the same category because of the
high cumulative frequency of that category.
# Calculate the frequency of cylinders
# Calculate the cumulative frequency
Bloxplot for origin
# Calculate the frequency of origin
# Calculate the cumulative origin
As can be observed in the table, the frequency is extremely
high for data value 1. Since the total frequency is 398, the first
quartile (Q1), median (Q2), and third quartile (Q3) will be at a
cumulative frequency 99.5 (i.e. average of 99th and 100th
observation), 199 and 298.5 (i.e. average of 298th and 299th
observation), respectively. This way Q1 = 1, median = 1, and Q3
= 2. Since Q1 and median are same in value, the band for
median falls on the bottom of the box. There is no data value
lower than Q1. Hence, the lower whisker is missing.
Boxplot for displacement
Boxplot for Model year
Potting and exploring numerical data using
Histograms
Plotting numerical data

Histograms
A histogram is a graphical representation of the distribution of numerical data. It groups the
data into intervals, called bins, and visualizes how frequently the data values fall within each
bin.
Key Features of Histograms
1.Representation:
1. Histograms use bars to represent the frequency or proportion of data within each bin.
2. Bars are adjacent (no gaps) to emphasize the continuous nature of numerical data.
2.Purpose:
1. To understand the shape of the data distribution (e.g., normal, skewed, uniform, etc.).
2. To identify patterns like skewness, multimodality, or outliers.
X-axis and Y-axis in Histograms
1.X-axis (Horizontal Axis):
1. Represents the range of data values (grouped into bins).
2. Each bin corresponds to an interval of the data.
3. Example: If the data is test scores, the x-axis may show intervals like 0–10, 10–20, etc.
2.Y-axis (Vertical Axis):
1. Represents the frequency (count) or density of data values within each bin.
2. Frequency: Number of data points in each bin.
3. Density: The proportion of data points relative to the total, often used for normalized
histograms.
Visualizations of histograms

Uniform distribution: is a type of probability distribution where all outcomes are equally likely. In a uniform distribution,
the probability of any particular value occurring is constant across its range.
Symmetric and unimodal distribution: is a type of data distribution that has a single peak (unimodal) and is
symdistributional around its central value. This means the left and right halves of the distribution are mirror images of
each other.
Left skewed data: refers to a distribution where the tail on the left side is longer than the right. In this type of data, most
of the values cluster toward the higher (right) end, and the mean is typically less than the median.
Right skewed data: refers to a distribution where the tail on the right side of the distribution is longer than the left. In this
type of data, most of the values cluster toward the lower (left) end, and the mean is typically greater than the median.
Bimodal distribution: is a type of probability distribution that has two distinct peaks (or modes) in its frequency
distribution.
Multimodal distribution: is a type of probability distribution that has more than one peak (mode) in its frequency
distribution.
Consider MPG dataset
Plotting histograms for numeric attributes
Plotting histograms for numeric attributes
model_year: Uniform or nearly uniform distribution.
MPG: Right skewed
Origin: Categorical distribution: The data will show how many cars come from each
origin.Skewed or balanced distribution: Depending on the dataset, it could be skewed, for
example, with more cars from the USA than from Europe or Japan, especially if the
dataset was compiled in a country like the United States.
Acceleration: Normal distribution
Displacement: Right skewed data
Horsepower: Right-skewed
Weight: Right skewed
Difference between bar chart and histogram
1. Bar Chart:
•Data type: Used for categorical data (data that can be divided into distinct categories or groups).
•Bars: Each bar represents a category or group, and the height of the bar corresponds to the value or frequency of that category.
•Spacing: The bars are typically discrete and have space between them, as the categories are distinct and not related in a
continuous manner.
•Example use case: Displaying the count of different car types in a dataset (e.g., the number of cars from different countries: USA,
Europe, Japan).
2. Histogram:
•Data type: Used for continuous numerical data (data that can take any value within a range).
•Bars: Each bar represents a range of values (called a bin or class interval), and the height of the bar represents the frequency of
data points falling within that range.
•Spacing: The bars in a histogram are typically continuous, with no space between them, as the data represents a continuous
range of values.
•Example use case: Displaying the distribution of car weights or the distribution of MPG values in a dataset.
THANK YOU

97

You might also like