0% found this document useful (0 votes)

5 views31 pages

Advanced Data Analysis Techniques 3

The document discusses advanced data analysis techniques, focusing on anomaly detection and forecasting methods. Anomaly detection identifies patterns that deviate from normal behavior, while forecasting predicts future events based on historical data. It highlights various techniques, their applications, challenges, and the importance of data quality and appropriate methods for effective analysis.

Uploaded by

gurjibecha88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views31 pages

Advanced Data Analysis Techniques 3

Uploaded by

gurjibecha88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Advanced

Data Analysis
Techniques
(Part 3)

2
Overview

Forecasting
Techniques
Anomaly
Detection
Anomaly Detection
What is Anomaly Detection?
Anomaly detection, sometimes called outlier detection, is a process of
finding patterns or instances in a dataset that deviate significantly from
the expected or “normal behavior.”

The definition of both “normal” and anomalous data significantly varies

depending on the context. Below are a few examples of anomaly
detection in action.

Normal

Anomaly/Outlier
Key Clustering Concepts

Financial Network traffic Patient vital signs

transactions in cybersecurity monitoring

Normal Routine purchases and Regular communication, Stable heart rate and
consistent spending by steady data transfer, consistent blood
an individual in London. and adherence to pressure
protocol.

Outlier A massive withdrawal Abrupt increase in data Sudden increase in

from Ireland from the transfer or use of heart rate and decrease
same account, hinting unknown protocols in blood pressure,
at potential fraud. signaling a potential indicating a potential
breach or malware emergency or
equipment failure.
Types of Anomalies
Outliers are abnormal or extreme data points that exist only in
training data
Outliers
For example, consider a dataset of daily temperatures in a city.
Most days, the temperatures range between 20°C and 30°C.
However, one day, there’s a spike of 40°C.

Novelties are new or previously unseen instances compared to

the original (training) data
Novelties
Example, a city installs a new, more accurate weather monitoring
station. As a result, the dataset starts consistently recording
slightly higher temperatures, ranging from 25°C to 35°C.
Classification of Outliers

Univariate Outliers Multivariate Outliers

Exist in a single variable or feature in This is found by combining the values of

isolation. multiple variables at the same time
Univariate outliers are extreme or abnormal Example, comparing the price of houses in
values that deviate from the typical range a community while considering the number
of values for that specific feature. of bedrooms

Example, comparing price of houses in a

community
10,000 9
10,000

2,000 3
2,000
1,000 2 6
1,000
A B C A B C
D D
Techniques for Anomaly Detection

Statistical Methods Machine Learning Methods Clustering Methods

Z-score: Measures how K-Nearest Neighbors DBSCAN: Groups data

many standard deviations (KNN): Calculates the points based on density.
a data point is from the distance between a data Anomalies are typically in
mean point and its k-nearest low-density regions.
neighbors. Anomalies are
Box Plot: Visualizes the typically far from their
distribution of data and neighbors.
identifies outliers based on
quartiles. Isolation Forest: Isolates
anomalies by randomly
Interquartile Range (IQR): partitioning the data
A measure of statistical space. Anomalies are
dispersion, often used to typically isolated with
identify outliers. fewer partitions.
Z Score (standard score)
The z-score measures how many standard deviations a data point is away
from the mean.
It allows us to compare data points from different distributions or to
understand the relative position of a data point within a single distribution

Generally, instances with a z-score over 3 are chosen as outliers.

Where:

z = (x - μ) / x is the data point

σ μ is the mean of the distribution
σ is the standard deviation of the distribution
Z Score(Example)

Age 15 16 16 17 18 16 17 16 25 17

Mean = 17.3 z = (x - μ) / σ

Standard deviation = 2.68 Z = (25 – 17.3)/2.68

Z = 2.87
Z Score Table

0.9979
This means 99.79 % of the students
are below the age of 25.
Interquartile range (IQR)

The IQR is the range between the first quartile (Q1)

and the third quartile (Q3) of a distribution.

The range of the middle 50% of the data

IQR = Q3 – Q1
Calculated as the difference between the 75th
percentile (Q3) and the 25th percentile (Q1).

When an instance is beyond Q1 or Q3 for some

multiplier of IQR, they are considered outliers.

The most common multiplier is 1.5, making the outlier

range [Q1–1.5 * IQR, Q3 + 1.5 * IQR].
Inter Quartile Range (Example)

Age 15 16 16 17 18 16 17 16 25 17

Q1 Q2 Q3 Q4

15 16 16 16 16 17 17 17 18 25

Q1 = 16 Q2= 16.5 Q3 = 17 Q4 = 25
IQR = Q3 – Q1
IQR = 17 – 16
IQR = 1
Boxplots

Box plots, also known as box-and-whisker plots, are a graphical

representation of the distribution of a dataset.

They provide a concise summary of five key statistics:

Minimum: The smallest value in the dataset.
First Quartile (Q1): The value below which 25% of the data falls.
Median (Q2): The middle value of the dataset.
Third Quartile (Q3): The value below which 75% of the data falls.
Maximum: The largest value in the dataset.
Structure of Boxplots

Outliers:
Data points that fall outside the
range of the whiskers, often
marked with dots or asterisks

A box plot
consists of a
Median Line:
box and two Box: A line within the box that
whiskers Represents the indicates the median value.
interquartile range (IQR),
(lines)
extending from
the box Whiskers:
Extend from the box to the
minimum and maximum
values within a certain range.
Boxplot (Example)

Age 15 16 16 17 18 16 17 16 25 17

Q1 Q2 Q3 Q4

15 16 16 16 16 17 17 17 18 25

Min = 15 Q3 = 17

Q1 = 16 Max = 25

Q2 = 16.5
Boxplot (Example)

14.5 15 15.5 16 16.5 17 17.5 18 18.5

Lower whisker = Q1 – 1.5 * IQR Upper whisker = Q3 + 1.5 * IQR

= 16 – 1.5 * 1 = 17 + 1.5 * 1
= 14.5 = 18.5
Min = 15
Q1 = 16
Q2 = 16.5
Q3 = 17
Max = 25
Importance of Anomaly Detection

Proactive Issue Resolution

Improved Efficiency and Productivity

Enhanced Security

Better Decision-Making

Cost Savings
Applications of Anomaly Detection

Fraud Detection

Network Intrusion Detection

Quality Control

Sensor Data Analysis

System Health Monitoring

Challenges of Anomaly Detection

Defining “normal” behavior

Imbalanced Data

Handling high-dimensional data

High Dimensionality

Evolving Data Patterns

Forecasting
Techniques
What is Forecasting?

Forecasting is the process of making

predictions about future events based
on historical data and assumptions
about future trends.

It is a crucial tool for businesses,

governments, and individuals to make
informed decisions and plan for the
future.
Main Types of Forecasting Methods
This is one of the simplest and easy-to-follow
Straight-line
forecasting methods. Uses historical figures and trends
Method
to predict future revenue growth.

Moving Calculates the average of a specific number of past

Average data points to predict the future value

Simple Linear Predicts the value of a dependent variable based on the

Regression value of an independent variable.

Multiple Linear Predicts the value of a dependent variable based on the

Regression values of multiple independent variables.
Straight-Line Method

Historical Forecast

2015 2016 2017 2018 2019 2020 2021

Sales Growth Percent 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%

Revenues 81,422 84,698

Historical Forecast

2015 2016 2017 2018 2019 2020 2021

Sales Growth Percent 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%

Revenues 81,422 84,698 88.086 91,609 95,274 99,085 103,048

Moving Average
Revenues 3-month MA 5-month MA

Jan $ 5.0
Feb $ 8.0
Mar $ 7.0 $ 6.7
Apr $ 8.0 $ 7.7
May $ 8.0 $ 7.7 $ 7.2
Jun $ 9.0 $ 8.3 $ 8.0
Jul $ 7.0 $ 8.0 $ 7.8
Aug $ 9.0 $ 8.3 $ 8.2
Sep $ 5.0 $ 7.0 $ 7.6
Oct $ 7.0 $ 7.0 $ 7.4
Nov $ 5.0 $ 5.7 $ 6.6
Dec $ 8.0 $ 6.7 $ 6.8
Simple Linear Regression

Salary

Hours
Multiple Linear Regression
Simple linear regression Multiple linear regression

Coefficients are interpreted similar to the linear regression equation

If all independent variables are 0, the value a is obtained

If an independent variable changes by one unit, the associated

coefficient b indicates by how much the dependent variable changes
Key Considerations in Forecasting

Data Quality

Uncertainty

Monitoring and Evaluation

Appropriate Forecasting Techniques

Assumptions
Conclusion

By effectively leveraging anomaly detection techniques and forecasting methods, organizations

can
• Proactively identify and mitigate risks.
• Improve operational efficiency and reduce costs.
• Gain a competitive edge through data-driven decision-making.
• Better prepare for future challenges and opportunities.

Anomaly detection helps identify unusual patterns and outliers in data.

Forecasting techniques predict future outcomes based on historical data and identified trends.

The choice of techniques depends on the specific application and available data.
Assignment

Salary 1000 500 800 700 700 800 700 900 750 850
The above data set shows the salary distribution in a company.

0 0 0
1 2 3
Calculate, generate Calculate and Are there outliers?
and interpret the box interpret the z score Why?
plot.

Summary Measures
No ratings yet
Summary Measures
26 pages
Curso de Geoestadistica en Ingles
100% (1)
Curso de Geoestadistica en Ingles
167 pages
Introduction To Probability and Statistics
100% (1)
Introduction To Probability and Statistics
179 pages
Statistics - Lecture Slides 3 - For Lecture
No ratings yet
Statistics - Lecture Slides 3 - For Lecture
37 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Fundamentals Stats
No ratings yet
Fundamentals Stats
44 pages
Outliers
No ratings yet
Outliers
5 pages
Test To Identify Outliers in Data Series
100% (1)
Test To Identify Outliers in Data Series
16 pages
Demand Outliers
No ratings yet
Demand Outliers
37 pages
Patel - Ashwin Tableau
No ratings yet
Patel - Ashwin Tableau
85 pages
Codman ICP Monitor
No ratings yet
Codman ICP Monitor
487 pages
PDF
No ratings yet
PDF
185 pages
Estadistica Analisi
No ratings yet
Estadistica Analisi
29 pages
Soluciones Libro Daniel
No ratings yet
Soluciones Libro Daniel
273 pages
Outliers PDF
No ratings yet
Outliers PDF
5 pages
Lecture 3 - Numerical Presenation
No ratings yet
Lecture 3 - Numerical Presenation
66 pages
3rd Session. Slides
No ratings yet
3rd Session. Slides
58 pages
Lect W4m07a f2023
No ratings yet
Lect W4m07a f2023
5 pages
Trend of Watching Movies Among Cfsiium Gambang Students: Group Members
0% (1)
Trend of Watching Movies Among Cfsiium Gambang Students: Group Members
22 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
1 Program
No ratings yet
1 Program
20 pages
Lecture Notes - Anomaly Detection in Time Series
No ratings yet
Lecture Notes - Anomaly Detection in Time Series
43 pages
1.1 Statistics For Data Science PDF
No ratings yet
1.1 Statistics For Data Science PDF
91 pages
Data Mining Part 1
No ratings yet
Data Mining Part 1
16 pages
Experiment 3
No ratings yet
Experiment 3
43 pages
Anomaly Detection
No ratings yet
Anomaly Detection
10 pages
Lecture 2 & 3 - Numerical Presenation
No ratings yet
Lecture 2 & 3 - Numerical Presenation
60 pages
Statistical Modeling For Biomedical Researchers 1st Edition William D. Dupont PDF Download
100% (1)
Statistical Modeling For Biomedical Researchers 1st Edition William D. Dupont PDF Download
63 pages
Ezy Math Tutoring - Further Maths
No ratings yet
Ezy Math Tutoring - Further Maths
128 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
Quiz Statekbis 1 2020
No ratings yet
Quiz Statekbis 1 2020
31 pages
Lecture 3b Descriptive Statistics - Numerical Measures
No ratings yet
Lecture 3b Descriptive Statistics - Numerical Measures
34 pages
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
No ratings yet
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
34 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
4 pages
CH 3 - 250408 - 170537
No ratings yet
CH 3 - 250408 - 170537
33 pages
Explanatory Data Analysis
100% (1)
Explanatory Data Analysis
28 pages
MLPPT 5
No ratings yet
MLPPT 5
97 pages
Research File 3
No ratings yet
Research File 3
10 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Cleveland Principles Summarized
No ratings yet
Cleveland Principles Summarized
2 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
DWDM (Unit 1)
No ratings yet
DWDM (Unit 1)
29 pages
BIDM Quiz - 2022
No ratings yet
BIDM Quiz - 2022
13 pages
Module 11 (C)
No ratings yet
Module 11 (C)
4 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Advanced Statistics and Probability: Yerkin G. Abdildin
No ratings yet
Advanced Statistics and Probability: Yerkin G. Abdildin
20 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Probabilistik Dan Proses Stokastik
No ratings yet
Probabilistik Dan Proses Stokastik
31 pages
Bloxplots in Data Science
No ratings yet
Bloxplots in Data Science
3 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
1 3 Box and Whisker Plots Udl 620
No ratings yet
1 3 Box and Whisker Plots Udl 620
10 pages
Assignment 2 (8602) Zakir Ud Din
No ratings yet
Assignment 2 (8602) Zakir Ud Din
13 pages
Shubham Dadhich Box Plot-1
No ratings yet
Shubham Dadhich Box Plot-1
9 pages
Hasil Uji Normalitas Data: Explore
No ratings yet
Hasil Uji Normalitas Data: Explore
10 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Full Assignment 1 (Math2565)
No ratings yet
Full Assignment 1 (Math2565)
7 pages
Stage 11 C
No ratings yet
Stage 11 C
7 pages
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
No ratings yet
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
44 pages
02 Data
No ratings yet
02 Data
36 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
Mathematics For Computer Vision
No ratings yet
Mathematics For Computer Vision
14 pages
Wagner Braun 2002
No ratings yet
Wagner Braun 2002
5 pages
Datamining Seminar
No ratings yet
Datamining Seminar
19 pages
Answers IBS
No ratings yet
Answers IBS
13 pages
Review of Data Description and Exploratory Data Analysis (EDA)
No ratings yet
Review of Data Description and Exploratory Data Analysis (EDA)
20 pages
Quantitative Methods For Management
No ratings yet
Quantitative Methods For Management
118 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Lecture 3
No ratings yet
Lecture 3
23 pages
Advanced Data Visualization and Interpretation 3
No ratings yet
Advanced Data Visualization and Interpretation 3
21 pages
Unit - 3: Big Data Analytics
No ratings yet
Unit - 3: Big Data Analytics
23 pages
02data Part2
No ratings yet
02data Part2
34 pages
Statistics ClassNotes - 2
No ratings yet
Statistics ClassNotes - 2
10 pages
2025 GRADE 12 MLIT INVESTIGATION Edited
No ratings yet
2025 GRADE 12 MLIT INVESTIGATION Edited
7 pages
Module 2 Iris Data Set
No ratings yet
Module 2 Iris Data Set
1 page
ADAP (Week 2) - Learning Content
No ratings yet
ADAP (Week 2) - Learning Content
6 pages
Mdm4u 4
No ratings yet
Mdm4u 4
2 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Chapter 1 Learning Target Keys
No ratings yet
Chapter 1 Learning Target Keys
1 page
Programming Foundations in Computer Vision
No ratings yet
Programming Foundations in Computer Vision
12 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
OpenSAP Ds1 Week 3 Transcript
No ratings yet
OpenSAP Ds1 Week 3 Transcript
17 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet