Lecture 3
Lecture 3
And Techniques
Arfa Hassan
Standard Deviation Method
1. Any data point that lies more than a certain number of standard deviations away from the mean is considered an outlier. The
threshold can be set as multiples of the standard deviation (e.g., 2 or 3 standard deviations away).
2. 𝑘k represents the number of standard deviations away from the mean beyond which a data point is considered an outlier. For
example, if 𝑘=2, any data point lying more than 2 standard deviations away from the mean is flagged as an outlier.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.
• Let's use 𝑘=2.
• Calculate the mean (𝜇) and standard deviation (𝜎) of the dataset.
• μ=1165+70+72+75+78+80+82+85+90+95+150=83.909
• 𝜎=27.945σ=27.945
• Calculate the lower and upper thresholds:
• Lower Threshold: 83.909−2×27.945=28.019
• Upper Threshold: 83.909+2×27.945=139.799
• The data point 150 lies beyond the upper threshold, so it is considered an outlier.
Interquartile Range (IQR) Method
Outliers are identified based on the interquartile range, which is the difference between the third quartile (Q3) and the first quartile (Q1).
k determines the width of the "fences" used to identify outliers. Typically, 𝑘k is set to 1.5, which means that the upper and lower fences are positioned at 1.5 times the IQR above Q3 and
below Q1, respectively.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.
1. Calculating Q1:
1. 𝑄1=(11+1)×0.25=3×0.25=3
2. Since 3 is an integer, 𝑄1 is the value at the 3rd position in the sorted dataset, which is 72.
2. Calculating Q3:
1. 𝑄3=(11+1)×0.75=9×0.75=6.75
2. Since 6.75 is not an integer, Q3 is the average of the values at the 6th and 7th positions in the sorted dataset,
which are 85 and 90, respectively.
1. 𝑄3=85+902=87.5
Z-Score Method
1. Calculates how many standard deviations a data point is from the mean.
2. Data points with ∣𝑍∣>𝑘 (where 𝑘is typically set to 2 or 3) are considered outliers.
3. k sets the threshold for how many standard deviations away from the mean a data point must be to be
considered an outlier. For instance, if 𝑘=3, any data point with a Z-score greater than 3 or less than -3 is
considered an outlier.
Example Let's consider the following dataset representing the scores of students in a class:
{65,70,72,75,78,80,82,85,90,95,150}
We will use this dataset to illustrate each outlier detection technique.
Modified Z-Score Method
1. Min-Max scaling rescales the data to a fixed range (e.g., [0, 1]).
2. It is sensitive to outliers because it uses the minimum and maximum values to
scale the data.
3. Formula:
4. where 𝑋X is the feature set.
5. It may not be suitable for datasets with outliers.
Power Transformation And Log
Transformation
1. Power transformation (e.g., Box-Cox or Yeo-Johnson transformation) adjusts
the skewness of the data.
2. It can handle non-normality and reduce the impact of outliers.
3. It's suitable for datasets with skewed distributions.
Sparse Data Handling
1. For sparse datasets (e.g., text data, high-dimensional data), normalization
techniques such as TF-IDF (Term Frequency-Inverse Document Frequency)
or L2 normalization can be applied.
2. TF-IDF assigns weights to terms based on their frequency and inverse
document frequency.
3. L2 normalization scales feature vectors to have a Euclidean norm of 1.
Normalization Technique
References
• https://developers.google.com/machine-learning/data-prep/transform/t
ransform-categorical