0% found this document useful (0 votes)
2 views

Advanced Data Analysis Techniques 3

The document discusses advanced data analysis techniques, focusing on anomaly detection and forecasting methods. Anomaly detection identifies patterns that deviate from normal behavior, while forecasting predicts future events based on historical data. It highlights various techniques, their applications, challenges, and the importance of data quality and appropriate methods for effective analysis.

Uploaded by

gurjibecha88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Advanced Data Analysis Techniques 3

The document discusses advanced data analysis techniques, focusing on anomaly detection and forecasting methods. Anomaly detection identifies patterns that deviate from normal behavior, while forecasting predicts future events based on historical data. It highlights various techniques, their applications, challenges, and the importance of data quality and appropriate methods for effective analysis.

Uploaded by

gurjibecha88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Advanced

Data Analysis
Techniques
(Part 3)

2
Overview

Forecasting
Techniques
Anomaly
Detection
Anomaly Detection
What is Anomaly Detection?
Anomaly detection, sometimes called outlier detection, is a process of
finding patterns or instances in a dataset that deviate significantly from
the expected or “normal behavior.”

The definition of both “normal” and anomalous data significantly varies


depending on the context. Below are a few examples of anomaly
detection in action.

Normal

Anomaly/Outlier
Key Clustering Concepts

Financial Network traffic Patient vital signs


transactions in cybersecurity monitoring

Normal Routine purchases and Regular communication, Stable heart rate and
consistent spending by steady data transfer, consistent blood
an individual in London. and adherence to pressure
protocol.

Outlier A massive withdrawal Abrupt increase in data Sudden increase in


from Ireland from the transfer or use of heart rate and decrease
same account, hinting unknown protocols in blood pressure,
at potential fraud. signaling a potential indicating a potential
breach or malware emergency or
equipment failure.
Types of Anomalies
Outliers are abnormal or extreme data points that exist only in
training data
Outliers
For example, consider a dataset of daily temperatures in a city.
Most days, the temperatures range between 20°C and 30°C.
However, one day, there’s a spike of 40°C.

Novelties are new or previously unseen instances compared to


the original (training) data
Novelties
Example, a city installs a new, more accurate weather monitoring
station. As a result, the dataset starts consistently recording
slightly higher temperatures, ranging from 25°C to 35°C.
Classification of Outliers

Univariate Outliers Multivariate Outliers

Exist in a single variable or feature in This is found by combining the values of


isolation. multiple variables at the same time
Univariate outliers are extreme or abnormal Example, comparing the price of houses in
values that deviate from the typical range a community while considering the number
of values for that specific feature. of bedrooms

Example, comparing price of houses in a


community
10,000 9
10,000

2,000 3
2,000
1,000 2 6
1,000
A B C A B C
D D
Techniques for Anomaly Detection

Statistical Methods Machine Learning Methods Clustering Methods

Z-score: Measures how K-Nearest Neighbors DBSCAN: Groups data


many standard deviations (KNN): Calculates the points based on density.
a data point is from the distance between a data Anomalies are typically in
mean point and its k-nearest low-density regions.
neighbors. Anomalies are
Box Plot: Visualizes the typically far from their
distribution of data and neighbors.
identifies outliers based on
quartiles. Isolation Forest: Isolates
anomalies by randomly
Interquartile Range (IQR): partitioning the data
A measure of statistical space. Anomalies are
dispersion, often used to typically isolated with
identify outliers. fewer partitions.
Z Score (standard score)
The z-score measures how many standard deviations a data point is away
from the mean.
It allows us to compare data points from different distributions or to
understand the relative position of a data point within a single distribution

Generally, instances with a z-score over 3 are chosen as outliers.

Where:

z = (x - μ) / x is the data point


σ μ is the mean of the distribution
σ is the standard deviation of the distribution
Z Score(Example)

Age 15 16 16 17 18 16 17 16 25 17

Mean = 17.3 z = (x - μ) / σ

Standard deviation = 2.68 Z = (25 – 17.3)/2.68

Z = 2.87
Z Score Table

0.9979
This means 99.79 % of the students
are below the age of 25.
Interquartile range (IQR)

The IQR is the range between the first quartile (Q1)


and the third quartile (Q3) of a distribution.

The range of the middle 50% of the data


IQR = Q3 – Q1
Calculated as the difference between the 75th
percentile (Q3) and the 25th percentile (Q1).

When an instance is beyond Q1 or Q3 for some


multiplier of IQR, they are considered outliers.

The most common multiplier is 1.5, making the outlier


range [Q1–1.5 * IQR, Q3 + 1.5 * IQR].
Inter Quartile Range (Example)

Age 15 16 16 17 18 16 17 16 25 17

Q1 Q2 Q3 Q4

15 16 16 16 16 17 17 17 18 25

Q1 = 16 Q2= 16.5 Q3 = 17 Q4 = 25
IQR = Q3 – Q1
IQR = 17 – 16
IQR = 1
Boxplots

Box plots, also known as box-and-whisker plots, are a graphical


representation of the distribution of a dataset.

They provide a concise summary of five key statistics:


Minimum: The smallest value in the dataset.
First Quartile (Q1): The value below which 25% of the data falls.
Median (Q2): The middle value of the dataset.
Third Quartile (Q3): The value below which 75% of the data falls.
Maximum: The largest value in the dataset.
Structure of Boxplots

Outliers:
Data points that fall outside the
range of the whiskers, often
marked with dots or asterisks

A box plot
consists of a
Median Line:
box and two Box: A line within the box that
whiskers Represents the indicates the median value.
interquartile range (IQR),
(lines)
extending from
the box Whiskers:
Extend from the box to the
minimum and maximum
values within a certain range.
Boxplot (Example)

Age 15 16 16 17 18 16 17 16 25 17

Q1 Q2 Q3 Q4

15 16 16 16 16 17 17 17 18 25

Min = 15 Q3 = 17

Q1 = 16 Max = 25

Q2 = 16.5
Boxplot (Example)

14.5 15 15.5 16 16.5 17 17.5 18 18.5


25

Lower whisker = Q1 – 1.5 * IQR Upper whisker = Q3 + 1.5 * IQR


= 16 – 1.5 * 1 = 17 + 1.5 * 1
= 14.5 = 18.5
Min = 15
Q1 = 16
Q2 = 16.5
Q3 = 17
Max = 25
Importance of Anomaly Detection

Proactive Issue Resolution

Improved Efficiency and Productivity

Enhanced Security

Better Decision-Making

Cost Savings
Applications of Anomaly Detection

Fraud Detection

Network Intrusion Detection

Quality Control

Sensor Data Analysis

System Health Monitoring


Challenges of Anomaly Detection

Defining “normal” behavior

Imbalanced Data

Handling high-dimensional data

High Dimensionality

Evolving Data Patterns


Forecasting
Techniques
What is Forecasting?

Forecasting is the process of making


predictions about future events based
on historical data and assumptions
about future trends.

It is a crucial tool for businesses,


governments, and individuals to make
informed decisions and plan for the
future.
Main Types of Forecasting Methods
This is one of the simplest and easy-to-follow
Straight-line
forecasting methods. Uses historical figures and trends
Method
to predict future revenue growth.

Moving Calculates the average of a specific number of past


Average data points to predict the future value

Simple Linear Predicts the value of a dependent variable based on the


Regression value of an independent variable.

Multiple Linear Predicts the value of a dependent variable based on the


Regression values of multiple independent variables.
Straight-Line Method

Historical Forecast

2015 2016 2017 2018 2019 2020 2021

Sales Growth Percent 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%

Revenues 81,422 84,698

Historical Forecast

2015 2016 2017 2018 2019 2020 2021

Sales Growth Percent 4.0% 4.0% 4.0% 4.0% 4.0% 4.0% 4.0%

Revenues 81,422 84,698 88.086 91,609 95,274 99,085 103,048


Moving Average
Revenues 3-month MA 5-month MA

Jan $ 5.0
Feb $ 8.0
Mar $ 7.0 $ 6.7
Apr $ 8.0 $ 7.7
May $ 8.0 $ 7.7 $ 7.2
Jun $ 9.0 $ 8.3 $ 8.0
Jul $ 7.0 $ 8.0 $ 7.8
Aug $ 9.0 $ 8.3 $ 8.2
Sep $ 5.0 $ 7.0 $ 7.6
Oct $ 7.0 $ 7.0 $ 7.4
Nov $ 5.0 $ 5.7 $ 6.6
Dec $ 8.0 $ 6.7 $ 6.8
Simple Linear Regression

Salary

Hours
Multiple Linear Regression
Simple linear regression Multiple linear regression

Coefficients are interpreted similar to the linear regression equation

If all independent variables are 0, the value a is obtained

If an independent variable changes by one unit, the associated


coefficient b indicates by how much the dependent variable changes
Key Considerations in Forecasting

Data Quality

Uncertainty

Monitoring and Evaluation

Appropriate Forecasting Techniques

Assumptions
Conclusion

By effectively leveraging anomaly detection techniques and forecasting methods, organizations


can
• Proactively identify and mitigate risks.
• Improve operational efficiency and reduce costs.
• Gain a competitive edge through data-driven decision-making.
• Better prepare for future challenges and opportunities.

Anomaly detection helps identify unusual patterns and outliers in data.

Forecasting techniques predict future outcomes based on historical data and identified trends.

The choice of techniques depends on the specific application and available data.
Assignment

Salary 1000 500 800 700 700 800 700 900 750 850
The above data set shows the salary distribution in a company.

0 0 0
1 2 3
Calculate, generate Calculate and Are there outliers?
and interpret the box interpret the z score Why?
plot.

You might also like