Ads Exp 7
Ads Exp 7
THEORY
Outliers are data points that significantly differ from the rest of the dataset. They can
arise due to measurement errors, natural variability, or external factors. Detecting and
handling outliers is essential for ensuring data quality and improving model accuracy.
1. Types of Outliers
A) Global Outliers
Global outliers, also known as point anomalies, are individual data points that deviate
significantly from the rest of the dataset.
Example:
● A student's test score of 10 in a class where all other scores range between 70
and 100.
● A house priced at $10 million in a neighborhood where the average price is
$300,000.
B) Contextual Outliers
Contextual outliers depend on the specific context of the data. A value may be an outlier
in one scenario but not in another.
Example:
Collective outliers occur when a group of data points deviates from the expected
pattern, even though individual points may not appear anomalous.
Example:
Outliers can arise due to errors in data collection or recording. Identifying and correcting
these errors improves data reliability.
Outliers often represent unusual events that may have business significance, such as
sudden demand surges, equipment failures, or fraudulent activities.
Detecting anomalous transactions in financial data can help identify fraudulent activities
and prevent financial losses.
3. Methods of Detecting Outliers
Several statistical and machine learning techniques can be used for outlier detection:
A) Statistical Methods
1. Z-Score Analysis: Identifies data points that are a certain number of standard
deviations away from the mean.
2. Interquartile Range (IQR): Detects outliers using the 1.5 * IQR rule beyond the
first and third quartiles.
3. Box Plot Analysis: Visual representation of data distribution highlighting potential
outliers.
1. Isolation Forest: Detects outliers by randomly partitioning the dataset and
identifying points that require fewer splits.
2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Identifies
dense regions and marks low-density points as outliers.
3. Autoencoders: Neural network-based method that reconstructs normal data well
but struggles with anomalies.
4. Handling Outliers
A) Removing Outliers
B) Transforming Data
C) Imputation
D) Treating Separately
● In cases where outliers represent rare but important occurrences (e.g., fraud
detection), they should be analyzed separately rather than removed.
CONCLUSION
Outlier detection is a fundamental step in data analysis that ensures data quality and
improves decision-making.