0% found this document useful (0 votes)
15 views19 pages

Descriptive Statistics

The document discusses descriptive statistics, focusing on measures of location (mean, median, mode) and measures of dispersion (range, variance, standard deviation). It explains the importance of feature engineering in machine learning and provides insights into percentiles and quartiles, including the interquartile range (IQR) for identifying outliers. Examples illustrate the concepts, including how to calculate outliers and visualize data using box plots.

Uploaded by

rgrewal112233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views19 pages

Descriptive Statistics

The document discusses descriptive statistics, focusing on measures of location (mean, median, mode) and measures of dispersion (range, variance, standard deviation). It explains the importance of feature engineering in machine learning and provides insights into percentiles and quartiles, including the interquartile range (IQR) for identifying outliers. Examples illustrate the concepts, including how to calculate outliers and visualize data using box plots.

Uploaded by

rgrewal112233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Descriptive Statistics

Measures of Location
• Measures of Location / Measures of Central Tendency : A single
value that represents the “centering” of a set of data, e.g. average
• Example: Marks obtained by 10 students, arranged in an ascending
order … 45,56,61,65,68,71,73,79,82,88,91
• Possible measure of location: 45,56,61,65,68, 71, 73,79,82,88,91

Measures of Location

Mean Mode Median


Basic Usage
• Mean: Better if the data is normally distributed and there are no
outliers … Used for interval and ratio data
• Median: Better when the data is skewed (has extreme values) …
Used for ordinal, interval, and ratio data
• Mode: Useful for identifying the most common value or values in a
dataset … Used in all the four scales … Best for categorical data

Normally distributed data Skewed data


Mean

Median

Mode
• Mode: The value that occurs most frequently in a dataset
• Data: 62, 78, 84, 89, 91, 95, 97, 89, 91, 89
• Frequency: 62: 1, 78: 1, 84: 1, 89: 3, 91: 2, 95: 1, 97: 1
• Mode = 89
• What if there are multiple values with the same highest frequency?:
Multimodal data
• If we have two modes: bi-modal
• If we have three modes: tri-modal
• Not used much in practice
Feature Engineering
• Feature engineering: Transform raw data into meaningful features
• Why? Improve the performance of the machine learning models
• How?
• Create new columns (From Date of purchase, create weekday/weekend)
• Scale features (Bring features on the same scale, e.g. age and income)
• Encode categorical features (Gender: Convert F = 0, M = 1), since ML models
work with numeric data
• Handle missing data (Drop, Indicate using a Missing flag, or Impute with
mean/mode/median)
• Feature selection (Keep only the most relevant features)
• Feature interaction (From unit price and quantity, create bill amount)
Measures of Dispersion
• Spread / Measures of Dispersion / Scatter : How and by how much,
our data set is spread out around its center?

Measures of Dispersion

Range Variance Standard Deviation


Range
• Range: Difference between the maximum value and the minimum
value in the data set
• Affected by outliers Range
Minimum Maximum

• Example: 8, 11, 5, 9, 7, 6, 2500


• Range = Max – Min = 2500 – 5 = 2495, which is quite meaningless
• Solution: Inter Quartile Range (IQR)
• But first, we need to understand percentile and quartile
Percentile
• Percentile (Relative): ≠ Percentage (Absolute)
• Percentile: A value below which certain percentage of observations lie
• Slices percentage data into two parts: Below a certain cut off, Above the
same cut off
• kth percentile = k% data is below it, and rest is above it
• Examples:
• If you are in the 90th percentile in an examination, 90% students are below you and
10% students are above you
• If a patient’s blood pressure is in the 60th percentile, 60% patients have a blood
pressure less than this patient, and 40% patients have higher blood pressure than
this patient
• Median = 50th percentile
Percentile Example
• General graph Score at the 62nd percentile
In some references, we might see Number of
Percentile Example observations, rather than Number of observations + 1
… Generally does not make a big difference


Percentile Example

US Household Net Worth and Percentile (Source:
https://finance.yahoo.com/news/wealthy-net-worth-considered-poor-190014440.html)

Category Percentile Net Worth


Poor 20th $10,000
Middle class 50th $281,000
Wealthy 90th $1.9 million
Quartile
Q1 Q2 Q3

25% 50% 75%


Inter Quartile Range (IQR)
• Inter Quartile Range (IQR) = Q3 – Q1 = Middle 50% of the data
• In the given example: IQR = Q3 – Q1 = 95.5 – 82 = 13.5
• Handles outliers better than range, since the extreme values at both the
ends are ignored in IQR
• Since it uses percentiles rather than actual values, it is less affected by
skewed data (See Skewness)
• Outliers: Data points that are significantly outside of the typical range of
values
• Lower bound: Q1 – (1.5 * IQR) = 82 – (1.5 * 13.5) = 61.75
• Upper bound: Q3 + (1.5 * IQR) = 82 + (1.5 * 13.5) = 102.25
• Points below the lower bound or above the upper bound are outliers
• In our example, there are no such points, so we do not have any outliers
Outlier Example
• Commute times for 14 randomly selected adults in minutes: 16, 8, 35, 17,
13, 15, 15, 5, 16, 25, 20, 20, 12, 10
• Find outliers and draw a box plot
• Solution: First sort them: 5, 8, 10, 12, 13, 15, 15, 16, 16, 17, 20, 20, 25, 35
• Create a 5-number summary: Minimum, Q1, Q2, Q3, Maximum = 5, 12,
15.5, 20, and 35
• Outlier
• First calculate 1.5 * IQR = 1.5 x (20 – 12) = 1.5 x 8 = 12
• Outliers calculation: Q1 – 12 = 12 – 12 = 0 and Q3 + 12 = 20 + 12 = 32
• So, outliers = Commute time < 0 or > 32
• Boxplot: Draw a vertical line between 5 and 35; Draw a box with 12 and 20;
Draw a median line at 15.5, Show outlier points (See next slide)
Outlier Code
• import matplotlib.pyplot as plt
• import seaborn as sns

• # Data
• commuter_times = [16, 8, 35, 17, 13, 15, 15, 5, 16, 25, 20, 20, 12, 10]

• # Create the box plot


• plt.figure(figsize=(10, 6))
• sns.boxplot(data=commuter_times, orient='h')

• # Add titles and labels


• plt.title('Box Plot of Commuter Times')
• plt.xlabel('Minutes')

• # Show the plot


• plt.show()
Resulting Boxplot

You might also like