ISAT 600 Progress Report 3

This progress report discusses the investigation of various methods for outlier detection and treatment in the context of environmental modeling using machine learning. It highlights the importance of addressing outliers, which can skew analyses and models, and categorizes them into global, contextual, and collective types. The report also outlines statistical, model-based, distance-based, and machine learning approaches for detecting outliers, as well as strategies for treating them effectively.

Uploaded by

Shiron Thalagala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views4 pages

ISAT 600 Progress Report 3

Uploaded by

Shiron Thalagala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Intelligent Environmental Modeling: Machine Learning to Tackle Air

and Water Pollution

ISAT 600: Progress Report 3
09/30/2024

In this week, several methods were investigated for outlier detection and treatment. Such
methods were briefly experimented to find best suited ones. Nevertheless, more experiments
need to be carried out to weigh the pros and cons of each method. The facts that were learned
through researching existing methods for dealing with outliers are discussed below:

B. Outliers

Outliers are data points that differ significantly from the rest of a dataset. These
datapoints often lie far outside the range of typical observations. They can result from
variability in the data, measurement errors, or unusual but valid observations. Outliers can
skew statistical analyses and models, leading to misleading conclusions if not properly
addressed. In some cases, they may represent critical insights, such as identifying rare events
or anomalies, while in others, they might need to be removed or adjusted to ensure accurate
analysis. Detecting and managing outliers is an essential step in data preprocessing,
particularly in machine learning and statistical modeling.

Outlier detection

Outlier detection is a fundamental task in data analysis. It identifies observations that

deviate significantly from the overall pattern of the data. The presence of outliers can distort
statistical summaries, reduce the performance of predictive models, and hide meaningful
patterns. Therefore, detecting and handling outliers is essential for ensuring the accuracy and
reliability of data-driven decision-making.

Types of outliers
Outliers can be broadly classified into three categories: global outliers, contextual
outliers, and collective outliers. Global outliers are data points that deviate from the rest of
the dataset based on the overall data distribution. These are the most common type of outlier,
where a single point stands out as anomalous. Contextual outliers, also known as conditional
outliers, appear anomalous in a specific context but may not be considered unusual in another
context. For example, a temperature of 40°C would be considered an outlier in a winter
dataset but not in a summer dataset. Collective outliers are a subset of data points that
collectively deviate from the overall dataset, even though individual points might not be
outliers by themselves. This is common in time series or sequential data, where certain
patterns of data points together indicate an anomaly.

1
Statistical methods for outlier detection
Several classical statistical methods are used for detecting outliers, with some of the
most common being Z-score and IQR (Interquartile Range).

Z-score (Standard score) method: This method relies on the mean and standard deviation of
the data. The Z-score measures how many standard deviations a data point is from the mean.
A point is flagged as an outlier if its Z-score exceeds a certain threshold, typically set at 3 or -
3. The Z-score formula is given by:
(X i−μ)
Zi =
σ
where X i is the data point, μ is the mean, and σ is the standard deviation. This method works
best for data that is normally distributed but may fail when applied to skewed or non-
Gaussian distributions.

IQR (Interquartile range) method: The IQR method uses the spread of the middle 50% of
the data to detect outliers. The interquartile range is calculated as the difference between the
third quartile (Q3) and the first quartile (Q1). Any data point that falls below Q 1−1.5 × IQR
or above Q 3+1.5 × IQR is considered an outlier. This approach is robust against non-normal
distributions and skewed data, making it more versatile than the Z-score method.

Model-based methods
Model-based approaches, including Gaussian Mixture Models (GMM) and isolation
forests, provide more sophisticated techniques for detecting outliers by modeling the
underlying distribution of the data.

Gaussian mixture models (GMM): GMMs assume that the data is a mixture of several
Gaussian distributions, each corresponding to a different cluster or subpopulation within the
dataset. The likelihood of each data point belonging to one of these distributions is calculated,
and points with low likelihoods are flagged as outliers. This method is particularly useful
when dealing with multimodal data, where the presence of multiple distributions makes
simpler statistical methods ineffective.

Isolation Forest: The Isolation Forest algorithm is a machine learning approach specifically
designed for anomaly detection. Unlike traditional clustering algorithms, which rely on
density or distance measures, Isolation Forest works by randomly partitioning the data and
isolating observations. The basic principle is that outliers, being rare and distinct, require
fewer partitions to be isolated compared to normal data points. This method is highly
efficient and works well in high-dimensional datasets.

Distance-based methods
Distance-based approaches rely on the proximity of data points to each other, often
using distance metrics such as Euclidean or Mahalanobis distance to detect outliers.

2
K-Nearest Neighbors (KNN) method: In the KNN-based approach, the distance between a
point and its k-nearest neighbors is calculated. Points with large distances from their
neighbors are identified as outliers. This method is intuitive and works well when the dataset
has a relatively uniform distribution. However, it may struggle with high-dimensional data
where the concept of "distance" becomes less meaningful due to the curse of dimensionality.

Mahalanobis distance: This is a distance measure that accounts for the correlation between
variables, making it more appropriate than the Euclidean distance in multivariate settings.
The Mahalanobis distance between a point and the center of the data distribution is
calculated, and points with large Mahalanobis distances are considered outliers. This method
assumes that the data follows a multivariate normal distribution, and thus it works well when
that assumption holds.

Machine learning and AI-based approaches

More recent advancements in outlier detection involve using machine learning
algorithms that can automatically learn complex patterns in data. Autoencoders and One-
Class SVM (Support Vector Machine) are two popular techniques.

Autoencoders: Autoencoders are neural networks trained to compress the data into a lower-
dimensional representation and then reconstruct it. The reconstruction error, which measures
the difference between the original and reconstructed data, is used to detect outliers. Points
with large reconstruction errors are considered anomalies. This method is highly effective for
high-dimensional data, particularly in domains such as image and signal processing.

One-Class SVM: One-Class SVM is a type of support vector machine that is trained only on
normal data. It creates a boundary that encompasses the majority of the data points, and
points that fall outside this boundary are considered outliers. This method is well-suited for
datasets where outliers are rare or hard to define explicitly.

Challenges in outlier detection

While numerous methods exist for detecting outliers, several challenges remain. One
of the primary difficulties is that outliers are often domain-specific, and what constitutes an
outlier in one field may not be considered an outlier in another. Moreover, in high-
dimensional datasets, the concept of distance becomes less meaningful, complicating
methods that rely on distance metrics. Additionally, outliers can sometimes contain important
information, and removing them without proper understanding can lead to the loss of critical
insights.
In practice, outlier detection often involves a combination of methods, tailored to the
specific characteristics of the data. Whether using statistical, distance-based, or machine
learning approaches, the goal is to robustly identify anomalous points while preserving the
integrity of the underlying data distribution.

3
Treating outliers
Treating outliers in a dataset is essential for improving the quality and reliability of analyses.
The approach to handling outliers depends on the nature of the data, the context, and the purpose of
the analysis. There are several methods for addressing outliers, each with its strengths and limitations.
Below are the most common techniques for treating outliers:

1. Removing outliers: In some cases, simply removing outliers from the dataset is the
most appropriate solution. This is typically done when the outliers are likely the result
of data entry errors, measurement inaccuracies, or irrelevant observations that do not
represent the underlying data distribution.
2. Transforming data: If outliers cannot be removed because they represent valid data
points, transforming the data is an effective way to reduce their impact. Data
transformations can make the distribution more symmetric and reduce the influence of
extreme values.
3. Imputation of outliers: When outliers are likely due to data entry or measurement
errors but should not be removed entirely, imputation is a viable option. Imputation
involves replacing the outliers with more reasonable values based on the rest of the
dataset.
4. Treating outliers as a separate category: In some scenarios, outliers represent an
important subgroup within the dataset. Instead of removing or modifying them, these
outliers can be treated as a separate category for analysis.
5. Robust statistical models: When outliers cannot be easily removed or transformed,
using robust statistical techniques that are less sensitive to extreme values is a good
approach.
6. Trimming (truncation): Trimming involves removing a fixed percentage of the most
extreme data points at both the high and low ends of the data distribution. This is
similar to Winsorization, but instead of capping extreme values, the extreme data
points are entirely removed.

WEEKLY PROGRESS

The tasks that were carried out in this week are as follows:
 Tested several methods for missing data imputation.
 Experimented with different outlier detection and treatment procedures.

Introduction To Outlier Analysis Complete
No ratings yet
Introduction To Outlier Analysis Complete
12 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
Outliers EXTD
No ratings yet
Outliers EXTD
24 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
8k Full Valid Europe Mix 15.11
100% (1)
8k Full Valid Europe Mix 15.11
139 pages
Module 11 (C)
No ratings yet
Module 11 (C)
4 pages
Networkjourney CCNP Enterprise 2021 Lab Workbook 1631597584
No ratings yet
Networkjourney CCNP Enterprise 2021 Lab Workbook 1631597584
77 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Novel AI Applications in The Energy Sector ECCNECT2024VLVP0101 Final Report June 2025 06anUmmiFaybCQULiJc3s2yh1U 117970
No ratings yet
Novel AI Applications in The Energy Sector ECCNECT2024VLVP0101 Final Report June 2025 06anUmmiFaybCQULiJc3s2yh1U 117970
35 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Statistical Test Methods For Hypothesis Testing
No ratings yet
Statistical Test Methods For Hypothesis Testing
6 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
Unit 5
No ratings yet
Unit 5
47 pages
Ads Exp 7
No ratings yet
Ads Exp 7
10 pages
Outlier Analysis
No ratings yet
Outlier Analysis
18 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
Anomaly Detection
No ratings yet
Anomaly Detection
10 pages
Outlier
No ratings yet
Outlier
2 pages
ADV-6 K
No ratings yet
ADV-6 K
303 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Outlier Analysis
No ratings yet
Outlier Analysis
28 pages
Datamining Seminar
No ratings yet
Datamining Seminar
19 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Computer Architecture and Organisation Notes
100% (1)
Computer Architecture and Organisation Notes
18 pages
Ai 1
No ratings yet
Ai 1
3 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
4 pages
Design and Implementation of PV Emulator Based On Synchronous Buck Converter Using Arduino Nano Microcontroller
No ratings yet
Design and Implementation of PV Emulator Based On Synchronous Buck Converter Using Arduino Nano Microcontroller
9 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
Unit 4
No ratings yet
Unit 4
17 pages
5 Ways To Find Outliers in Your Data - Statistics by Jim
No ratings yet
5 Ways To Find Outliers in Your Data - Statistics by Jim
35 pages
Mark VI Turbine Controls GE - AddingIO - Doc 1 ADDING NEW INPUTS/OUPUTS
No ratings yet
Mark VI Turbine Controls GE - AddingIO - Doc 1 ADDING NEW INPUTS/OUPUTS
29 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Guide On Outlier Detection Methods
No ratings yet
Guide On Outlier Detection Methods
11 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
Distance Based Outlier Detection
No ratings yet
Distance Based Outlier Detection
40 pages
4 - Outliers - +transformaations ML
No ratings yet
4 - Outliers - +transformaations ML
28 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
Kawasaki FastCheck
No ratings yet
Kawasaki FastCheck
18 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
12 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
13 pages
Outliers
No ratings yet
Outliers
3 pages
Univariate Outlier Detection
No ratings yet
Univariate Outlier Detection
9 pages
Chapter 14 Summary
No ratings yet
Chapter 14 Summary
2 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Wcms 2nd Unit Notes
No ratings yet
Wcms 2nd Unit Notes
31 pages
Anycubic Kobra Neo 20230109 V0.1.0 English
No ratings yet
Anycubic Kobra Neo 20230109 V0.1.0 English
34 pages
Handling Ouliers
No ratings yet
Handling Ouliers
5 pages
SR958 Control Doc 1
No ratings yet
SR958 Control Doc 1
66 pages
BPF Template File
No ratings yet
BPF Template File
34 pages
Telecom Knowledge and Experience Sharing - ? LTE KPI
No ratings yet
Telecom Knowledge and Experience Sharing - ? LTE KPI
8 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
My Triumph Connectivity - Faq - English
No ratings yet
My Triumph Connectivity - Faq - English
21 pages
Paper 4
No ratings yet
Paper 4
33 pages
Methods To Detect Different Types of Outliers: March 2016
No ratings yet
Methods To Detect Different Types of Outliers: March 2016
7 pages
Amazon 664 1490 Euro 3 Pallets 1
No ratings yet
Amazon 664 1490 Euro 3 Pallets 1
12 pages
Plantilla de Fiestas Patrias Colombia Ninos
No ratings yet
Plantilla de Fiestas Patrias Colombia Ninos
62 pages
Exp 23 - (21203A0048 - Anvita Keer)
No ratings yet
Exp 23 - (21203A0048 - Anvita Keer)
7 pages
Utrasonic Cleaning Generator - 2
No ratings yet
Utrasonic Cleaning Generator - 2
17 pages
System-on-Chip Design and Implementation: Linda E.M. Brackenbury, Luis A. Plana, Senior Member, IEEE and Jeffrey Pepper
No ratings yet
System-on-Chip Design and Implementation: Linda E.M. Brackenbury, Luis A. Plana, Senior Member, IEEE and Jeffrey Pepper
11 pages
Cloudera Administrator Training For Apache Hadoop PDF
50% (2)
Cloudera Administrator Training For Apache Hadoop PDF
2 pages
Noto Sans Korean Font License
No ratings yet
Noto Sans Korean Font License
2 pages
MV1 2023 IDBC Strategy Plan
No ratings yet
MV1 2023 IDBC Strategy Plan
16 pages
On Detection of Outliers and Their Effect in Supervised Classification
No ratings yet
On Detection of Outliers and Their Effect in Supervised Classification
14 pages
Audit Example 2
No ratings yet
Audit Example 2
1 page
Multilayer Perceptron: R - S - S - S Network
No ratings yet
Multilayer Perceptron: R - S - S - S Network
28 pages
Solution Manual For Fundamentals of Communication Systems, 2/E J G. Proakis, M Salehi
No ratings yet
Solution Manual For Fundamentals of Communication Systems, 2/E J G. Proakis, M Salehi
42 pages
Milagrow IMap10.0 User Manual (Page 24 of 29) M
No ratings yet
Milagrow IMap10.0 User Manual (Page 24 of 29) M
1 page
C CV M Model
No ratings yet
C CV M Model
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet