Lecture 3

Uploaded by

ghania azhar

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Lecture 3

Uploaded by

ghania azhar

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Science Tools

And Techniques
Arfa Hassan
Standard Deviation Method
1. Any data point that lies more than a certain number of standard deviations away from the mean is considered an outlier. The
threshold can be set as multiples of the standard deviation (e.g., 2 or 3 standard deviations away).

2. 𝑘k represents the number of standard deviations away from the mean beyond which a data point is considered an outlier. For
example, if 𝑘=2, any data point lying more than 2 standard deviations away from the mean is flagged as an outlier.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.
• Let's use 𝑘=2.
• Calculate the mean (𝜇) and standard deviation (𝜎) of the dataset.
• μ=1165+70+72+75+78+80+82+85+90+95+150=83.909
• 𝜎=27.945σ=27.945
• Calculate the lower and upper thresholds:
• Lower Threshold: 83.909−2×27.945=28.019
• Upper Threshold: 83.909+2×27.945=139.799
• The data point 150 lies beyond the upper threshold, so it is considered an outlier.
Interquartile Range (IQR) Method
Outliers are identified based on the interquartile range, which is the difference between the third quartile (Q3) and the first quartile (Q1).

k determines the width of the "fences" used to identify outliers. Typically, 𝑘k is set to 1.5, which means that the upper and lower fences are positioned at 1.5 times the IQR above Q3 and
below Q1, respectively.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.

1. Calculating Q1:
1. 𝑄1=(11+1)×0.25=3×0.25=3
2. Since 3 is an integer, 𝑄1 is the value at the 3rd position in the sorted dataset, which is 72.
2. Calculating Q3:
1. 𝑄3=(11+1)×0.75=9×0.75=6.75
2. Since 6.75 is not an integer, Q3 is the average of the values at the 6th and 7th positions in the sorted dataset,
which are 85 and 90, respectively.
1. 𝑄3=85+902=87.5
Z-Score Method

1. Calculates how many standard deviations a data point is from the mean.

2. Data points with ∣𝑍∣>𝑘 (where 𝑘is typically set to 2 or 3) are considered outliers.
3. k sets the threshold for how many standard deviations away from the mean a data point must be to be
considered an outlier. For instance, if 𝑘=3, any data point with a Z-score greater than 3 or less than -3 is
considered an outlier.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.
Modified Z-Score Method

1. Similar to the Z-Score method but more robust to outliers.

2. Formula:

3. where MAD is the median absolute deviation.

4. Data points with modified Z-scores greater than a threshold (e.g., 3.5) are considered outliers.
5. Similar to the Z-Score Method, 𝑘k sets the threshold for identifying outliers based on modified Z-scores. Typically, a value of
𝑘=3.5k=3.5 is used as a cutoff.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.

• Calculate the median and median absolute deviation (MAD).

• Median: 82
• MAD: MAD=median(∣𝑥𝑖−median∣)=median(∣𝑥𝑖−82∣)=9MAD=median(∣xi−median∣)=median(∣xi−82∣)=9
• Let's use 𝑘=3.5k=3.5.
• Calculate the modified Z-score for 150.
• Modified Z-Score=0.6745×150−829=3.890Modified Z-Score=0.6745×9150−82=3.890
• The modified Z-score for 150 is greater than 3.5, so it is considered an outlier.
Tukey's Fences

1. Another method based on the interquartile range.

2. Formula:

3. Data points outside the fences are considered outliers.

4. Again, 𝑘 determines how far out the fences extend from the quartiles. The standard choice for k is 1.5, which places
the fences at 1.5 times the IQR from the quartiles.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.

• Using the quartiles and IQR calculated earlier.

• Let's use 𝑘=1.5.
• Calculate the lower and upper fences:
• Lower Fence: 73−1.5×14.5=51.75
• Upper Fence: 87.5+1.5×14.5=108.25
• The data point 150 lies beyond the upper fence, so it is considered an outlier.
Normalization of Dataset
Standardization (Z-score normalization)
• Standardization transforms the data to have a mean of 0 and a standard
deviation of 1.
• It is less affected by outliers compared to other normalization
methods.
• Formula: , where 𝑥 is the original value, 𝜇 is the mean, and σ is
the standard deviation.
Robust Scaling

1. Robust scaling is similar to standardization but uses the interquartile range

(IQR) instead of the standard deviation.
2. It is robust to outliers because it uses median and IQR instead of mean and
standard deviation.
3. Formula:
4. where 𝑋 is the feature set.
5. It's suitable for datasets with outliers.
Clipping or Winsorization
1. Clipping or Winsorization involves setting a threshold and capping the
outliers to a certain value (e.g., the 95th or 99th percentile).
2. It reduces the impact of extreme outliers on normalization.
Min-Max Scaling:

1. Min-Max scaling rescales the data to a fixed range (e.g., [0, 1]).
2. It is sensitive to outliers because it uses the minimum and maximum values to
scale the data.
3. Formula:
4. where 𝑋X is the feature set.
5. It may not be suitable for datasets with outliers.
Power Transformation And Log
Transformation
1. Power transformation (e.g., Box-Cox or Yeo-Johnson transformation) adjusts
the skewness of the data.
2. It can handle non-normality and reduce the impact of outliers.
3. It's suitable for datasets with skewed distributions.
Sparse Data Handling
1. For sparse datasets (e.g., text data, high-dimensional data), normalization
techniques such as TF-IDF (Term Frequency-Inverse Document Frequency)
or L2 normalization can be applied.
2. TF-IDF assigns weights to terms based on their frequency and inverse
document frequency.
3. L2 normalization scales feature vectors to have a Euclidean norm of 1.
Normalization Technique
References
• https://developers.google.com/machine-learning/data-prep/transform/t
ransform-categorical

Ms Data Science S, 24 (WEEK# 4)
No ratings yet
Ms Data Science S, 24 (WEEK# 4)
23 pages
Introduction To Acute and Ambulatory Care Pharmacy Practice
No ratings yet
Introduction To Acute and Ambulatory Care Pharmacy Practice
100 pages
Questa Adms User
0% (1)
Questa Adms User
700 pages
WINSEM2024-25_CBS3006_ETH_VL2024250505168_2025-01-09_Reference-Material-III
No ratings yet
WINSEM2024-25_CBS3006_ETH_VL2024250505168_2025-01-09_Reference-Material-III
4 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Numericalquestionsonzscoreand IQ
No ratings yet
Numericalquestionsonzscoreand IQ
3 pages
Finding Outliers 2 Wayes Z-Score and Interquortile Range
No ratings yet
Finding Outliers 2 Wayes Z-Score and Interquortile Range
1 page
Outliers Z-Score
No ratings yet
Outliers Z-Score
1 page
Discusion Forum Unit 2
No ratings yet
Discusion Forum Unit 2
2 pages
Outliers
No ratings yet
Outliers
3 pages
3-Introduction to data cleaning outlires
No ratings yet
3-Introduction to data cleaning outlires
5 pages
Outlier treatment
No ratings yet
Outlier treatment
16 pages
Empirical_rule_and_Outliers_1721456291
No ratings yet
Empirical_rule_and_Outliers_1721456291
13 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
Nikita Prasad - Outliers Basics
No ratings yet
Nikita Prasad - Outliers Basics
13 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
DPT Week 10
No ratings yet
DPT Week 10
1 page
5 Ways To Find Outliers in Your Data - Statistics by Jim
No ratings yet
5 Ways To Find Outliers in Your Data - Statistics by Jim
35 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Notes PDF ML Day 17
No ratings yet
Notes PDF ML Day 17
9 pages
Research File 3
No ratings yet
Research File 3
10 pages
Univariate Outlier Detection
No ratings yet
Univariate Outlier Detection
9 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
4- Lect-Finding Z- Score, Percentiles and Quartiles,
No ratings yet
4- Lect-Finding Z- Score, Percentiles and Quartiles,
23 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
No ratings yet
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
45 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Feature Scaling in Machine Learning
No ratings yet
Feature Scaling in Machine Learning
14 pages
Outlier Analysis
No ratings yet
Outlier Analysis
28 pages
Statistics Measures of Position Unit Plan
No ratings yet
Statistics Measures of Position Unit Plan
3 pages
Explanatory Data Analysis
No ratings yet
Explanatory Data Analysis
28 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Advanced Data Analysis Techniques 3
No ratings yet
Advanced Data Analysis Techniques 3
31 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
Range Biggest Number - Smallest Number IQR (Interquartile Range)
No ratings yet
Range Biggest Number - Smallest Number IQR (Interquartile Range)
3 pages
Test To Identify Outliers in Data Series
No ratings yet
Test To Identify Outliers in Data Series
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Feature Engineering
No ratings yet
Feature Engineering
35 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
No ratings yet
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
39 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
3.3 Assignment: One Variable Statistics: A) Histogram
No ratings yet
3.3 Assignment: One Variable Statistics: A) Histogram
12 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Robust Statistics For Outlier Detection: Peter J. Rousseeuw and Mia Hubert
No ratings yet
Robust Statistics For Outlier Detection: Peter J. Rousseeuw and Mia Hubert
7 pages
4_Outliers_+Transformaations ML
No ratings yet
4_Outliers_+Transformaations ML
28 pages
Mathematical
No ratings yet
Mathematical
14 pages
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
No ratings yet
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
19 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
UNIT 4
No ratings yet
UNIT 4
17 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Updated 2 - STAT100 - Median+Mode+Range+Outlier+Percentiles - Problem+Solution - Asma
No ratings yet
Updated 2 - STAT100 - Median+Mode+Range+Outlier+Percentiles - Problem+Solution - Asma
7 pages
7_2
No ratings yet
7_2
34 pages
Relative measure
No ratings yet
Relative measure
19 pages
OUTLIERS
100% (1)
OUTLIERS
5 pages
Demand Outliers
No ratings yet
Demand Outliers
37 pages
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
No ratings yet
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
44 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Ms Data Science S, 24 (WEEK# 2)
No ratings yet
Ms Data Science S, 24 (WEEK# 2)
19 pages
Ms Data Science S, 24 (WEEK# 3) - Unlock
No ratings yet
Ms Data Science S, 24 (WEEK# 3) - Unlock
27 pages
01 Introduction
No ratings yet
01 Introduction
68 pages
Mscs 1
No ratings yet
Mscs 1
2 pages
A Survey of Open Source Data Science Tools: International Journal of Intelligent Computing and Cybernetics June 2015
No ratings yet
A Survey of Open Source Data Science Tools: International Journal of Intelligent Computing and Cybernetics June 2015
32 pages
MSC DS - Sample Paper
No ratings yet
MSC DS - Sample Paper
13 pages
Resume of Manikandan
No ratings yet
Resume of Manikandan
2 pages
Solution Manual for Data Structures and Abstractions with Java, 5th Edition Frank M. Carrano, Timothy M. Henry instant download
100% (3)
Solution Manual for Data Structures and Abstractions with Java, 5th Edition Frank M. Carrano, Timothy M. Henry instant download
42 pages
TRB Polytechnic - TRB Material For Preparation in Lecturers Recruitment in Government Polytechnic Colleges
27% (22)
TRB Polytechnic - TRB Material For Preparation in Lecturers Recruitment in Government Polytechnic Colleges
356 pages
Compressive Strength of Concrete: 1. Objective
No ratings yet
Compressive Strength of Concrete: 1. Objective
6 pages
Guidelines and Format For Writing Lab Reports Science Writing Heuristic
No ratings yet
Guidelines and Format For Writing Lab Reports Science Writing Heuristic
1 page
3.5) Axial Flow Compressors - Concepts and Problems
100% (2)
3.5) Axial Flow Compressors - Concepts and Problems
22 pages
Workstation Erp
No ratings yet
Workstation Erp
10 pages
Veronika Šavarová Portfolio En
No ratings yet
Veronika Šavarová Portfolio En
23 pages
Subject Profile: Dr. Yanga's Colleges, Inc
No ratings yet
Subject Profile: Dr. Yanga's Colleges, Inc
6 pages
Getting Started With DIAdem
No ratings yet
Getting Started With DIAdem
51 pages
Hygiene in Food Processing Principles and Practice 2nd Edition H.L.M. Lelieveld - The latest ebook version is now available for instant access
100% (1)
Hygiene in Food Processing Principles and Practice 2nd Edition H.L.M. Lelieveld - The latest ebook version is now available for instant access
52 pages
Thermal Electrical Machines 2023
No ratings yet
Thermal Electrical Machines 2023
89 pages
CHAPTER 5
No ratings yet
CHAPTER 5
3 pages
Aspects of Glass Handout PDF
No ratings yet
Aspects of Glass Handout PDF
81 pages
Top Prompts to Perfect your Job Application
No ratings yet
Top Prompts to Perfect your Job Application
12 pages
Jason Dreibelbis: Jzd5543@psu - Edu
No ratings yet
Jason Dreibelbis: Jzd5543@psu - Edu
1 page
Kagnew Wolde
No ratings yet
Kagnew Wolde
46 pages
Hat Yai Bus Station - Peta Google
No ratings yet
Hat Yai Bus Station - Peta Google
1 page
Thunder Tiger - +-+PRO-21+og+28BX-RP+bilmotor
No ratings yet
Thunder Tiger - +-+PRO-21+og+28BX-RP+bilmotor
4 pages
004-Civil - MS Construction of Temporary Roads PDF
100% (2)
004-Civil - MS Construction of Temporary Roads PDF
17 pages
Catalytic Production of Biodiesel From Shea Butter
100% (1)
Catalytic Production of Biodiesel From Shea Butter
19 pages
8 Model Question Paper
No ratings yet
8 Model Question Paper
10 pages
Bowles On Strip Loads
No ratings yet
Bowles On Strip Loads
6 pages
Module 5B International Business Strategy - International Sourcing and Production in APAC
No ratings yet
Module 5B International Business Strategy - International Sourcing and Production in APAC
22 pages
3 Discovery of Child Abuse PDF
No ratings yet
3 Discovery of Child Abuse PDF
15 pages
ABA636 Signature Assignment
No ratings yet
ABA636 Signature Assignment
7 pages
Visão Explodida NG1-2 Código 536850000
No ratings yet
Visão Explodida NG1-2 Código 536850000
4 pages
Multi-Field EMF Meter: Operation Manual
No ratings yet
Multi-Field EMF Meter: Operation Manual
13 pages