ISAT 600 Progress Report 3
ISAT 600 Progress Report 3
In this week, several methods were investigated for outlier detection and treatment. Such
methods were briefly experimented to find best suited ones. Nevertheless, more experiments
need to be carried out to weigh the pros and cons of each method. The facts that were learned
through researching existing methods for dealing with outliers are discussed below:
B. Outliers
Outliers are data points that differ significantly from the rest of a dataset. These
datapoints often lie far outside the range of typical observations. They can result from
variability in the data, measurement errors, or unusual but valid observations. Outliers can
skew statistical analyses and models, leading to misleading conclusions if not properly
addressed. In some cases, they may represent critical insights, such as identifying rare events
or anomalies, while in others, they might need to be removed or adjusted to ensure accurate
analysis. Detecting and managing outliers is an essential step in data preprocessing,
particularly in machine learning and statistical modeling.
Outlier detection
Types of outliers
Outliers can be broadly classified into three categories: global outliers, contextual
outliers, and collective outliers. Global outliers are data points that deviate from the rest of
the dataset based on the overall data distribution. These are the most common type of outlier,
where a single point stands out as anomalous. Contextual outliers, also known as conditional
outliers, appear anomalous in a specific context but may not be considered unusual in another
context. For example, a temperature of 40°C would be considered an outlier in a winter
dataset but not in a summer dataset. Collective outliers are a subset of data points that
collectively deviate from the overall dataset, even though individual points might not be
outliers by themselves. This is common in time series or sequential data, where certain
patterns of data points together indicate an anomaly.
1
Statistical methods for outlier detection
Several classical statistical methods are used for detecting outliers, with some of the
most common being Z-score and IQR (Interquartile Range).
Z-score (Standard score) method: This method relies on the mean and standard deviation of
the data. The Z-score measures how many standard deviations a data point is from the mean.
A point is flagged as an outlier if its Z-score exceeds a certain threshold, typically set at 3 or -
3. The Z-score formula is given by:
(X i−μ)
Zi =
σ
where X i is the data point, μ is the mean, and σ is the standard deviation. This method works
best for data that is normally distributed but may fail when applied to skewed or non-
Gaussian distributions.
IQR (Interquartile range) method: The IQR method uses the spread of the middle 50% of
the data to detect outliers. The interquartile range is calculated as the difference between the
third quartile (Q3) and the first quartile (Q1). Any data point that falls below Q 1−1.5 × IQR
or above Q 3+1.5 × IQR is considered an outlier. This approach is robust against non-normal
distributions and skewed data, making it more versatile than the Z-score method.
Model-based methods
Model-based approaches, including Gaussian Mixture Models (GMM) and isolation
forests, provide more sophisticated techniques for detecting outliers by modeling the
underlying distribution of the data.
Gaussian mixture models (GMM): GMMs assume that the data is a mixture of several
Gaussian distributions, each corresponding to a different cluster or subpopulation within the
dataset. The likelihood of each data point belonging to one of these distributions is calculated,
and points with low likelihoods are flagged as outliers. This method is particularly useful
when dealing with multimodal data, where the presence of multiple distributions makes
simpler statistical methods ineffective.
Isolation Forest: The Isolation Forest algorithm is a machine learning approach specifically
designed for anomaly detection. Unlike traditional clustering algorithms, which rely on
density or distance measures, Isolation Forest works by randomly partitioning the data and
isolating observations. The basic principle is that outliers, being rare and distinct, require
fewer partitions to be isolated compared to normal data points. This method is highly
efficient and works well in high-dimensional datasets.
Distance-based methods
Distance-based approaches rely on the proximity of data points to each other, often
using distance metrics such as Euclidean or Mahalanobis distance to detect outliers.
2
K-Nearest Neighbors (KNN) method: In the KNN-based approach, the distance between a
point and its k-nearest neighbors is calculated. Points with large distances from their
neighbors are identified as outliers. This method is intuitive and works well when the dataset
has a relatively uniform distribution. However, it may struggle with high-dimensional data
where the concept of "distance" becomes less meaningful due to the curse of dimensionality.
Mahalanobis distance: This is a distance measure that accounts for the correlation between
variables, making it more appropriate than the Euclidean distance in multivariate settings.
The Mahalanobis distance between a point and the center of the data distribution is
calculated, and points with large Mahalanobis distances are considered outliers. This method
assumes that the data follows a multivariate normal distribution, and thus it works well when
that assumption holds.
Autoencoders: Autoencoders are neural networks trained to compress the data into a lower-
dimensional representation and then reconstruct it. The reconstruction error, which measures
the difference between the original and reconstructed data, is used to detect outliers. Points
with large reconstruction errors are considered anomalies. This method is highly effective for
high-dimensional data, particularly in domains such as image and signal processing.
One-Class SVM: One-Class SVM is a type of support vector machine that is trained only on
normal data. It creates a boundary that encompasses the majority of the data points, and
points that fall outside this boundary are considered outliers. This method is well-suited for
datasets where outliers are rare or hard to define explicitly.
3
Treating outliers
Treating outliers in a dataset is essential for improving the quality and reliability of analyses.
The approach to handling outliers depends on the nature of the data, the context, and the purpose of
the analysis. There are several methods for addressing outliers, each with its strengths and limitations.
Below are the most common techniques for treating outliers:
1. Removing outliers: In some cases, simply removing outliers from the dataset is the
most appropriate solution. This is typically done when the outliers are likely the result
of data entry errors, measurement inaccuracies, or irrelevant observations that do not
represent the underlying data distribution.
2. Transforming data: If outliers cannot be removed because they represent valid data
points, transforming the data is an effective way to reduce their impact. Data
transformations can make the distribution more symmetric and reduce the influence of
extreme values.
3. Imputation of outliers: When outliers are likely due to data entry or measurement
errors but should not be removed entirely, imputation is a viable option. Imputation
involves replacing the outliers with more reasonable values based on the rest of the
dataset.
4. Treating outliers as a separate category: In some scenarios, outliers represent an
important subgroup within the dataset. Instead of removing or modifying them, these
outliers can be treated as a separate category for analysis.
5. Robust statistical models: When outliers cannot be easily removed or transformed,
using robust statistical techniques that are less sensitive to extreme values is a good
approach.
6. Trimming (truncation): Trimming involves removing a fixed percentage of the most
extreme data points at both the high and low ends of the data distribution. This is
similar to Winsorization, but instead of capping extreme values, the extreme data
points are entirely removed.
WEEKLY PROGRESS
The tasks that were carried out in this week are as follows:
Tested several methods for missing data imputation.
Experimented with different outlier detection and treatment procedures.