Advanced Data Analysis Techniques 3
Advanced Data Analysis Techniques 3
Data Analysis
Techniques
(Part 3)
2
Overview
Forecasting
Techniques
Anomaly
Detection
Anomaly Detection
What is Anomaly Detection?
Anomaly detection, sometimes called outlier detection, is a process of
finding patterns or instances in a dataset that deviate significantly from
the expected or “normal behavior.”
Normal
Anomaly/Outlier
Key Clustering Concepts
Normal Routine purchases and Regular communication, Stable heart rate and
consistent spending by steady data transfer, consistent blood
an individual in London. and adherence to pressure
protocol.
2,000 3
2,000
1,000 2 6
1,000
A B C A B C
D D
Techniques for Anomaly Detection
Where:
Age 15 16 16 17 18 16 17 16 25 17
Mean = 17.3 z = (x - μ) / σ
Z = 2.87
Z Score Table
0.9979
This means 99.79 % of the students
are below the age of 25.
Interquartile range (IQR)
Age 15 16 16 17 18 16 17 16 25 17
Q1 Q2 Q3 Q4
15 16 16 16 16 17 17 17 18 25
Q1 = 16 Q2= 16.5 Q3 = 17 Q4 = 25
IQR = Q3 – Q1
IQR = 17 – 16
IQR = 1
Boxplots
Outliers:
Data points that fall outside the
range of the whiskers, often
marked with dots or asterisks
A box plot
consists of a
Median Line:
box and two Box: A line within the box that
whiskers Represents the indicates the median value.
interquartile range (IQR),
(lines)
extending from
the box Whiskers:
Extend from the box to the
minimum and maximum
values within a certain range.
Boxplot (Example)
Age 15 16 16 17 18 16 17 16 25 17
Q1 Q2 Q3 Q4
15 16 16 16 16 17 17 17 18 25
Min = 15 Q3 = 17
Q1 = 16 Max = 25
Q2 = 16.5
Boxplot (Example)
Enhanced Security
Better Decision-Making
Cost Savings
Applications of Anomaly Detection
Fraud Detection
Quality Control
Imbalanced Data
High Dimensionality
Historical Forecast
Sales Growth Percent 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%
Historical Forecast
Sales Growth Percent 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%
Jan $ 5.0
Feb $ 8.0
Mar $ 7.0 $ 6.7
Apr $ 8.0 $ 7.7
May $ 8.0 $ 7.7 $ 7.2
Jun $ 9.0 $ 8.3 $ 8.0
Jul $ 7.0 $ 8.0 $ 7.8
Aug $ 9.0 $ 8.3 $ 8.2
Sep $ 5.0 $ 7.0 $ 7.6
Oct $ 7.0 $ 7.0 $ 7.4
Nov $ 5.0 $ 5.7 $ 6.6
Dec $ 8.0 $ 6.7 $ 6.8
Simple Linear Regression
Salary
Hours
Multiple Linear Regression
Simple linear regression Multiple linear regression
Data Quality
Uncertainty
Assumptions
Conclusion
Forecasting techniques predict future outcomes based on historical data and identified trends.
The choice of techniques depends on the specific application and available data.
Assignment
Salary 1000 500 800 700 700 800 700 900 750 850
The above data set shows the salary distribution in a company.
0 0 0
1 2 3
Calculate, generate Calculate and Are there outliers?
and interpret the box interpret the z score Why?
plot.