Descriptive Statistics

The document discusses descriptive statistics, focusing on measures of location (mean, median, mode) and measures of dispersion (range, variance, standard deviation). It explains the importance of feature engineering in machine learning and provides insights into percentiles and quartiles, including the interquartile range (IQR) for identifying outliers. Examples illustrate the concepts, including how to calculate outliers and visualize data using box plots.

Uploaded by

rgrewal112233

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views19 pages

Descriptive Statistics

Uploaded by

rgrewal112233

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Descriptive Statistics

Measures of Location
• Measures of Location / Measures of Central Tendency : A single
value that represents the “centering” of a set of data, e.g. average
• Example: Marks obtained by 10 students, arranged in an ascending
order … 45,56,61,65,68,71,73,79,82,88,91
• Possible measure of location: 45,56,61,65,68, 71, 73,79,82,88,91

Measures of Location

Mean Mode Median

Basic Usage
• Mean: Better if the data is normally distributed and there are no
outliers … Used for interval and ratio data
• Median: Better when the data is skewed (has extreme values) …
Used for ordinal, interval, and ratio data
• Mode: Useful for identifying the most common value or values in a
dataset … Used in all the four scales … Best for categorical data

Normally distributed data Skewed data

Mean
•
Median
•
Mode
• Mode: The value that occurs most frequently in a dataset
• Data: 62, 78, 84, 89, 91, 95, 97, 89, 91, 89
• Frequency: 62: 1, 78: 1, 84: 1, 89: 3, 91: 2, 95: 1, 97: 1
• Mode = 89
• What if there are multiple values with the same highest frequency?:
Multimodal data
• If we have two modes: bi-modal
• If we have three modes: tri-modal
• Not used much in practice
Feature Engineering
• Feature engineering: Transform raw data into meaningful features
• Why? Improve the performance of the machine learning models
• How?
• Create new columns (From Date of purchase, create weekday/weekend)
• Scale features (Bring features on the same scale, e.g. age and income)
• Encode categorical features (Gender: Convert F = 0, M = 1), since ML models
work with numeric data
• Handle missing data (Drop, Indicate using a Missing flag, or Impute with
mean/mode/median)
• Feature selection (Keep only the most relevant features)
• Feature interaction (From unit price and quantity, create bill amount)
Measures of Dispersion
• Spread / Measures of Dispersion / Scatter : How and by how much,
our data set is spread out around its center?

Measures of Dispersion

Range Variance Standard Deviation

Range
• Range: Difference between the maximum value and the minimum
value in the data set
• Affected by outliers Range
Minimum Maximum

• Example: 8, 11, 5, 9, 7, 6, 2500

• Range = Max – Min = 2500 – 5 = 2495, which is quite meaningless
• Solution: Inter Quartile Range (IQR)
• But first, we need to understand percentile and quartile
Percentile
• Percentile (Relative): ≠ Percentage (Absolute)
• Percentile: A value below which certain percentage of observations lie
• Slices percentage data into two parts: Below a certain cut off, Above the
same cut off
• kth percentile = k% data is below it, and rest is above it
• Examples:
• If you are in the 90th percentile in an examination, 90% students are below you and
10% students are above you
• If a patient’s blood pressure is in the 60th percentile, 60% patients have a blood
pressure less than this patient, and 40% patients have higher blood pressure than
this patient
• Median = 50th percentile
Percentile Example
• General graph Score at the 62nd percentile
In some references, we might see Number of
Percentile Example observations, rather than Number of observations + 1
… Generally does not make a big difference

•
Percentile Example
•
US Household Net Worth and Percentile (Source:
https://finance.yahoo.com/news/wealthy-net-worth-considered-poor-190014440.html)

Category Percentile Net Worth

Poor 20th $10,000
Middle class 50th $281,000
Wealthy 90th $1.9 million
Quartile
Q1 Q2 Q3
•

25% 50% 75%

Inter Quartile Range (IQR)
• Inter Quartile Range (IQR) = Q3 – Q1 = Middle 50% of the data
• In the given example: IQR = Q3 – Q1 = 95.5 – 82 = 13.5
• Handles outliers better than range, since the extreme values at both the
ends are ignored in IQR
• Since it uses percentiles rather than actual values, it is less affected by
skewed data (See Skewness)
• Outliers: Data points that are significantly outside of the typical range of
values
• Lower bound: Q1 – (1.5 * IQR) = 82 – (1.5 * 13.5) = 61.75
• Upper bound: Q3 + (1.5 * IQR) = 82 + (1.5 * 13.5) = 102.25
• Points below the lower bound or above the upper bound are outliers
• In our example, there are no such points, so we do not have any outliers
Outlier Example
• Commute times for 14 randomly selected adults in minutes: 16, 8, 35, 17,
13, 15, 15, 5, 16, 25, 20, 20, 12, 10
• Find outliers and draw a box plot
• Solution: First sort them: 5, 8, 10, 12, 13, 15, 15, 16, 16, 17, 20, 20, 25, 35
• Create a 5-number summary: Minimum, Q1, Q2, Q3, Maximum = 5, 12,
15.5, 20, and 35
• Outlier
• First calculate 1.5 * IQR = 1.5 x (20 – 12) = 1.5 x 8 = 12
• Outliers calculation: Q1 – 12 = 12 – 12 = 0 and Q3 + 12 = 20 + 12 = 32
• So, outliers = Commute time < 0 or > 32
• Boxplot: Draw a vertical line between 5 and 35; Draw a box with 12 and 20;
Draw a median line at 15.5, Show outlier points (See next slide)
Outlier Code
• import matplotlib.pyplot as plt
• import seaborn as sns

• # Data
• commuter_times = [16, 8, 35, 17, 13, 15, 15, 5, 16, 25, 20, 20, 12, 10]

• # Create the box plot

• plt.figure(figsize=(10, 6))
• sns.boxplot(data=commuter_times, orient='h')

• # Add titles and labels

• plt.title('Box Plot of Commuter Times')
• plt.xlabel('Minutes')

• # Show the plot

• plt.show()
Resulting Boxplot

16991
100% (3)
16991
82 pages
1 Program
No ratings yet
1 Program
20 pages
Advanced Data Analysis Techniques 3
No ratings yet
Advanced Data Analysis Techniques 3
31 pages
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
No ratings yet
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
44 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
No ratings yet
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
34 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
02 Data
No ratings yet
02 Data
36 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Lecture Slides - Capítulo 02
No ratings yet
Lecture Slides - Capítulo 02
21 pages
01 Data
No ratings yet
01 Data
100 pages
Variability Final
No ratings yet
Variability Final
53 pages
Descriptive Statistics - Numerical Measures
No ratings yet
Descriptive Statistics - Numerical Measures
91 pages
Central Tendency Variation Outliers
No ratings yet
Central Tendency Variation Outliers
59 pages
Quantitative Methods For Management
No ratings yet
Quantitative Methods For Management
118 pages
Measures of Position PDF
No ratings yet
Measures of Position PDF
5 pages
Summary Measures
No ratings yet
Summary Measures
26 pages
2 - Descriptive Statistics
No ratings yet
2 - Descriptive Statistics
29 pages
Business Intelligence and Data Analytics - Week 2
No ratings yet
Business Intelligence and Data Analytics - Week 2
24 pages
Data Management
No ratings yet
Data Management
36 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
03 WEEK2 Statistics Part2
No ratings yet
03 WEEK2 Statistics Part2
38 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
CHP 2
No ratings yet
CHP 2
52 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Session 2 Descriptive Statistics
No ratings yet
Session 2 Descriptive Statistics
33 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Data Mining Part 1
No ratings yet
Data Mining Part 1
16 pages
R22 Unit2 CH2
No ratings yet
R22 Unit2 CH2
28 pages
Slides Week2
No ratings yet
Slides Week2
43 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
65 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
Program-1
No ratings yet
Program-1
15 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
Chapter 2 - Measures of Location and Spread
No ratings yet
Chapter 2 - Measures of Location and Spread
3 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
CH 3 - 250408 - 170537
No ratings yet
CH 3 - 250408 - 170537
33 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
ch03 Ver3
No ratings yet
ch03 Ver3
25 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
02data Part2
No ratings yet
02data Part2
34 pages
Topic3 Descriptive Statistics
No ratings yet
Topic3 Descriptive Statistics
50 pages
Lecture - 04 - TP
No ratings yet
Lecture - 04 - TP
126 pages
U3 PPT6
No ratings yet
U3 PPT6
4 pages
Chapter 3, Part A Descriptive Statistics: Numerical Measures
No ratings yet
Chapter 3, Part A Descriptive Statistics: Numerical Measures
41 pages
Statistics Measure of Center
No ratings yet
Statistics Measure of Center
11 pages
Answers IBS
No ratings yet
Answers IBS
13 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Chapter 03
No ratings yet
Chapter 03
67 pages
Measures
No ratings yet
Measures
8 pages
Lecture 2 & 3 - Numerical Presenation
No ratings yet
Lecture 2 & 3 - Numerical Presenation
60 pages
Analysis of Statistcal Data
No ratings yet
Analysis of Statistcal Data
46 pages
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
From Everand
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
S. Deviant
4.5/5 (6)
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Hypothesis Testing
No ratings yet
Hypothesis Testing
35 pages
T Test, ANOVA, Chi Square Test
No ratings yet
T Test, ANOVA, Chi Square Test
26 pages
Continuous Distributions
No ratings yet
Continuous Distributions
17 pages
Naïve Bayes' Classifier
No ratings yet
Naïve Bayes' Classifier
17 pages
Voltas Screw Chiller Drawing 2
No ratings yet
Voltas Screw Chiller Drawing 2
1 page
Part 1. Why Get Involved in Standardization?
No ratings yet
Part 1. Why Get Involved in Standardization?
3 pages
Technical Specifications of Mulesoft and SAP
No ratings yet
Technical Specifications of Mulesoft and SAP
9 pages
Configurar Firewall para Shared Variables
No ratings yet
Configurar Firewall para Shared Variables
7 pages
Comp128 Out
No ratings yet
Comp128 Out
10 pages
Full Download New Perspectives On The Internet: Comprehensive, Loose-Leaf Version 10th Edition Jessica Evans PDF
100% (4)
Full Download New Perspectives On The Internet: Comprehensive, Loose-Leaf Version 10th Edition Jessica Evans PDF
64 pages
Six Sigma Green Belt Sample Questions: 1. Which Is The Following Is Not True About "Sigma"?
No ratings yet
Six Sigma Green Belt Sample Questions: 1. Which Is The Following Is Not True About "Sigma"?
4 pages
Answer Sheet LA023648 - Assn6 - Answer Sheet - CPCCBC4010B-UNGRADED - Ed1-2
No ratings yet
Answer Sheet LA023648 - Assn6 - Answer Sheet - CPCCBC4010B-UNGRADED - Ed1-2
17 pages
Review Paper Cognitive Radio: Making Software Radios More Personal
No ratings yet
Review Paper Cognitive Radio: Making Software Radios More Personal
2 pages
CS 1 INAP Call Flowv0 1 Scribd
No ratings yet
CS 1 INAP Call Flowv0 1 Scribd
159 pages
Potential Traps in The SCJP Exam
No ratings yet
Potential Traps in The SCJP Exam
1 page
OHSAS 18001:1999 Occupational Health & Safety Management System
No ratings yet
OHSAS 18001:1999 Occupational Health & Safety Management System
6 pages
Return Path Statefarmclaims@Statefa
No ratings yet
Return Path Statefarmclaims@Statefa
20 pages
Electrical Safety and Electrical Work Procedure v3 1
100% (2)
Electrical Safety and Electrical Work Procedure v3 1
24 pages
Learning Licence Test Rules of Road Regulations English
No ratings yet
Learning Licence Test Rules of Road Regulations English
26 pages
Bicycle Assembly Guide
No ratings yet
Bicycle Assembly Guide
18 pages
Pre Course
No ratings yet
Pre Course
5 pages
Baumit Facade Insulating Board EPS-F: Product
No ratings yet
Baumit Facade Insulating Board EPS-F: Product
6 pages
Camshaft and Flywheel Signals, Fault Tracing
100% (3)
Camshaft and Flywheel Signals, Fault Tracing
16 pages
922E Excavator: Tough World. Tough Equipment
No ratings yet
922E Excavator: Tough World. Tough Equipment
20 pages
Goat Simulator
No ratings yet
Goat Simulator
26 pages
Trainingfacts: Fs 132 - Functional Safety Engineer (Tüv Rheinland)
No ratings yet
Trainingfacts: Fs 132 - Functional Safety Engineer (Tüv Rheinland)
1 page
Ramco: Technology
No ratings yet
Ramco: Technology
1 page
27.12.2024-5m-Swift Mt103 Cash Wire Transfer-Billion City Trust-Alswryn Trading
No ratings yet
27.12.2024-5m-Swift Mt103 Cash Wire Transfer-Billion City Trust-Alswryn Trading
1 page
Acrow GASS Aluminium Product Guide
No ratings yet
Acrow GASS Aluminium Product Guide
8 pages
Chakradhar Reddy
No ratings yet
Chakradhar Reddy
4 pages
WSDL Reading, A Beginner's Guide
No ratings yet
WSDL Reading, A Beginner's Guide
18 pages
Manual IDEAL Series 700
No ratings yet
Manual IDEAL Series 700
17 pages
Helmet
No ratings yet
Helmet
5 pages