0% found this document useful (0 votes)
13 views

Lecture 3

Uploaded by

ghania azhar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 3

Uploaded by

ghania azhar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Science Tools

And Techniques
Arfa Hassan
Standard Deviation Method
1. Any data point that lies more than a certain number of standard deviations away from the mean is considered an outlier. The
threshold can be set as multiples of the standard deviation (e.g., 2 or 3 standard deviations away).

2. 𝑘k represents the number of standard deviations away from the mean beyond which a data point is considered an outlier. For
example, if 𝑘=2, any data point lying more than 2 standard deviations away from the mean is flagged as an outlier.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.
• Let's use 𝑘=2.
• Calculate the mean (𝜇) and standard deviation (𝜎) of the dataset.
• μ=1165+70+72+75+78+80+82+85+90+95+150​=83.909
• 𝜎=27.945σ=27.945
• Calculate the lower and upper thresholds:
• Lower Threshold: 83.909−2×27.945=28.019
• Upper Threshold: 83.909+2×27.945=139.799
• The data point 150 lies beyond the upper threshold, so it is considered an outlier.
Interquartile Range (IQR) Method
Outliers are identified based on the interquartile range, which is the difference between the third quartile (Q3) and the first quartile (Q1).

k determines the width of the "fences" used to identify outliers. Typically, 𝑘k is set to 1.5, which means that the upper and lower fences are positioned at 1.5 times the IQR above Q3 and
below Q1, respectively.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.

1. Calculating Q1:
1. 𝑄1=(11+1)×0.25=3×0.25=3
2. Since 3 is an integer, 𝑄1 is the value at the 3rd position in the sorted dataset, which is 72.
2. Calculating Q3:
1. 𝑄3=(11+1)×0.75=9×0.75=6.75
2. Since 6.75 is not an integer, Q3 is the average of the values at the 6th and 7th positions in the sorted dataset,
which are 85 and 90, respectively.
1. 𝑄3=85+902=87.5
Z-Score Method

1. Calculates how many standard deviations a data point is from the mean.

2. Data points with ∣𝑍∣>𝑘 (where 𝑘is typically set to 2 or 3) are considered outliers.
3. k sets the threshold for how many standard deviations away from the mean a data point must be to be
considered an outlier. For instance, if 𝑘=3, any data point with a Z-score greater than 3 or less than -3 is
considered an outlier.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.
Modified Z-Score Method

1. Similar to the Z-Score method but more robust to outliers.


2. Formula:

3. where MAD is the median absolute deviation.


4. Data points with modified Z-scores greater than a threshold (e.g., 3.5) are considered outliers.
5. Similar to the Z-Score Method, 𝑘k sets the threshold for identifying outliers based on modified Z-scores. Typically, a value of
𝑘=3.5k=3.5 is used as a cutoff.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.

• Calculate the median and median absolute deviation (MAD).


• Median: 82
• MAD: MAD=median(∣𝑥𝑖−median∣)=median(∣𝑥𝑖−82∣)=9MAD=median(∣xi​−median∣)=median(∣xi​−82∣)=9
• Let's use 𝑘=3.5k=3.5.
• Calculate the modified Z-score for 150.
• Modified Z-Score=0.6745×150−829=3.890Modified Z-Score=0.6745×9150−82​=3.890
• The modified Z-score for 150 is greater than 3.5, so it is considered an outlier.
Tukey's Fences

1. Another method based on the interquartile range.


2. Formula:

3. Data points outside the fences are considered outliers.


4. Again, 𝑘 determines how far out the fences extend from the quartiles. The standard choice for k is 1.5, which places
the fences at 1.5 times the IQR from the quartiles.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.

• Using the quartiles and IQR calculated earlier.


• Let's use 𝑘=1.5.
• Calculate the lower and upper fences:
• Lower Fence: 73−1.5×14.5=51.75
• Upper Fence: 87.5+1.5×14.5=108.25
• The data point 150 lies beyond the upper fence, so it is considered an outlier.
Normalization of Dataset
Standardization (Z-score normalization)
• Standardization transforms the data to have a mean of 0 and a standard
deviation of 1.
• It is less affected by outliers compared to other normalization
methods.
• Formula: , where 𝑥 is the original value, 𝜇 is the mean, and σ is
the standard deviation.
Robust Scaling

1. Robust scaling is similar to standardization but uses the interquartile range


(IQR) instead of the standard deviation.
2. It is robust to outliers because it uses median and IQR instead of mean and
standard deviation.
3. Formula:
4. where 𝑋 is the feature set.
5. It's suitable for datasets with outliers.
Clipping or Winsorization
1. Clipping or Winsorization involves setting a threshold and capping the
outliers to a certain value (e.g., the 95th or 99th percentile).
2. It reduces the impact of extreme outliers on normalization.
Min-Max Scaling:

1. Min-Max scaling rescales the data to a fixed range (e.g., [0, 1]).
2. It is sensitive to outliers because it uses the minimum and maximum values to
scale the data.
3. Formula:
4. where 𝑋X is the feature set.
5. It may not be suitable for datasets with outliers.
Power Transformation And Log
Transformation
1. Power transformation (e.g., Box-Cox or Yeo-Johnson transformation) adjusts
the skewness of the data.
2. It can handle non-normality and reduce the impact of outliers.
3. It's suitable for datasets with skewed distributions.
Sparse Data Handling
1. For sparse datasets (e.g., text data, high-dimensional data), normalization
techniques such as TF-IDF (Term Frequency-Inverse Document Frequency)
or L2 normalization can be applied.
2. TF-IDF assigns weights to terms based on their frequency and inverse
document frequency.
3. L2 normalization scales feature vectors to have a Euclidean norm of 1.
Normalization Technique
References
• https://developers.google.com/machine-learning/data-prep/transform/t
ransform-categorical

You might also like