0% found this document useful (0 votes)
4 views

Module 1.2 Data Preprocessing

The document outlines the data modeling approach, detailing the process of organizing and structuring data through various models such as conceptual, logical, and physical modeling. It also discusses the importance of data organization, types of data analytics frameworks, and the distinction between structured and unstructured data. Additionally, it highlights licensed and open-source analytics tools available for data analysis and visualization.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module 1.2 Data Preprocessing

The document outlines the data modeling approach, detailing the process of organizing and structuring data through various models such as conceptual, logical, and physical modeling. It also discusses the importance of data organization, types of data analytics frameworks, and the distinction between structured and unstructured data. Additionally, it highlights licensed and open-source analytics tools available for data analysis and visualization.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

DATA MODELLING APPROACH

Prepared by
Fathima Shana E
Assistant Professor
Dept. of ADS
Data Modelling Approach
• Data are individual facts, statistics, or items of information, often numeric.
• A data model is a conceptual framework that organizes and structures data to
represent how data is stored, managed, and processed in a system
• They are a set of tables and the relationship between them

Data modeling is the process of creating a blueprint of how different pieces


of data relate to each other.
• Align Data : It brings order to raw data by defining relations and
structures
• Making a visual representation of all or part of an information system to
show how different data points and organizational structures are linked.
• Aims to define and organize how data is stored, accessed, and used in
systems, how different types of data are connected, how data can be
grouped and organized, and what its formats and features are.
• E.g: A data analyst in a ecommerce company make use of data modeling
to create meaningful insights from jumbled up data like customer
information, sales reports etc
Types of Data Model
•A relation is a table with rows and columns.
•Each row represents a record (or tuple).
•Each column represents an attribute (or field)
of the relation.
Schema:
The structure of a relation, including table names,
column names, and data types.
Example: A "Student" table might have attributes
like Student ID, Name, Age, and Course

An object consists of:


Attributes: The data describing the object.
Methods: The functions or procedures associated with
the object

E.g: Object: Car


•Attributes: Make, Model, Year
•Methods: Start(), Stop

Entities: Objects or concepts that have


significance in the domain being modeled.
Attributes: Properties or characteristics of an
entity or relationship.
Relationships: Describe how entities are related
to each other.
The ER model helps decision-makers visualize and
structure data for better insights and
understanding.
Data Modelling Process/Types of Data Modelling
1.Conceptual Modeling:
• Provides a high-level view of the data, focusing on business
requirements rather than technical details.
• Identifies key entities and their relationships
• Audience: Business stakeholders and analysts
• E.g.: It focuses on Entities (e.g., Customer, Product) and
Relationships between entities
(e.g.Customer purchases Products).
2.Logical Modeling:
Defines the detailed structure of data, including attributes and relationships
• Refining the conceptual model by defining data types, constraints, and
specific details about each attribute, preparing the design for
implementation in a database.
• Specifies entities, attributes,
and relationships.
• Includes primary and foreign keys
• Audience: Data architects and analysts.

Example: A schema showing tables Customer (with attributes like


CustomerID, Name) and Order (with attributes like OrderID, OrderDate, and
a foreign key CustomerID).
3.Physical Modeling:
Describes how data will be stored and managed in a specific database system.
Translating the logical model into the specific database schema, considering storage
requirements, indexing, and optimization techniques for efficient data access.
• It focuses on Tables, columns, indexes, and database-specific constructs, storage,
performance, and optimization.

• Audience: Database administrators and developers.


• Example: A schema for an SQL database with detailed data types (VARCHAR, INT),
primary/foreign key constraints, and indexing.
Data Organization
• The structured method of storing and categorizing data collected from
various sources within a business, allowing for efficient access,
analysis, and interpretation to support informed decision-making.

• Well-organized data ensures accessibility, accuracy, and relevance.


Key Aspects

1.Data Sources
• Data in business analytics typically comes from various sources:
– Internal: Sales, marketing, operations, HR, finance.
– External: Social media, market research, government reports.
--Real-time: IoT devices, sensors, website interactions

2. Data Storage
• Businesses use different methods to store and organize data:
– Databases: Structured storage using relational or non-relational
models.
– Data Warehouses: Centralized repositories for analytical data.
– Data Lakes: Storage for raw, unstructured, or semi-structured data.
3. Data Structuring
• Data must be organized into a logical structure to
facilitate analysis:
– Relational Databases: Tables with relationships (SQL
databases).
– Hierarchical Models: Parent-child relationships.
– Dimensional Models: Fact and dimension tables
4. Data Cleaning
• Ensuring the data is free from errors and inconsistencies:
– Removing duplicates.
– Handling missing values.
– Standardizing formats.
5. Data Categorization
• Classifying data to improve organization and retrieval:
– By Type: Numerical, categorical, textual, temporal.
– By Source: Customer data, product data, sales data.
– By Purpose: Operational, analytical, strategic.
6. Data Integration
• Combining data from different sources into a single, coherent view

7. Metadata Management
• Maintaining information about the data (e.g., source, definitions,
ownership) to ensure traceability and governance.
8. Data Security and Privacy
• Organizing data with appropriate access controls to protect
sensitive information.
9. Data Accessibility
• Ensuring stakeholders can access the right data efficiently
5 V’s of Business Analytics
1. Velocity is the speed at which the data is created and how fast it moves.
2.Volume is the amount of data qualifying as big data.
E.G :A retail chain generating millions of sales transactions daily requires scalable storage
solutions like data warehouses or lakes.
3. Value is the value the data provides.
The business impact derived from data. Data analytics should drive actionable insights and
measurable outcomes.
Example: Using sales data to predict future demand and optimize inventory.
4.Variety is the diversity that exists in the types of data.
• Structured: Databases, spreadsheets.
• Semi-structured: JSON, XML.
• Unstructured: Text, images, videos.
• Example: Combining customer demographics
(structured), social media posts (unstructured), a
web clickstream data (semi-structured).
5.Veracity is the data's quality and accuracy.
Ensures data is accurate, clean, and consistent for
meaningful analysis.
Example: A company struggling with duplicate or
inconsistent customer records in CRM.
Data cleaning and validation become critical steps in
analytics.
STRUCTURED DATA VS UNSTRUCTURED DATA
ASPECT STRUCTURED DATA UNSTRUCTURED DATA
Data that is organized in Data that does not
predefined formats, follow a specific
Definition
such as rows and structure or predefined
columns. format.
Customer database
Emails, images, videos,
Example with columns for Name,
social media posts.
Age, and Email.
Tabular, with clearly
Freeform, lacks a rigid
defined fields (e.g.,
Format structure (e.g., text
spreadsheets,
files, multimedia).
databases).
Relies on fixed schemas
No predefined schema;
Schema (e.g., relational
format is flexible.
models).
Requires advanced
Easily queried using
Ease of Analysis tools and algorithms
tools like SQL.
(e.g., AI, NLP).
ASPECT STRUCTURED DATA UNSTRUCTURED DATA
Relational Databases
Storage Systems Data Lakes (e.g., Hadoop
(e.g., MySQL).
Tables with rows and File systems, object
Examples
columns. storage.
SQL-based tools, BI
Machine Learning, NLP, Big
Processing Tools platforms (e.g., Tableau,
Data tools
Power BI).
Marketing, healthcare (e.g.,
Banking, e-commerce,
Industries medical images), social
logistics.
media analytics.

Unstructured Data
Scenario Structured Data Example Example
Database with Name,
Customer Data Emails, customer reviews.
Email, Purchase History.
Patient records in tables X-ray images, doctor’s
Healthcare
(e.g., age, diagnosis). notes.
Sales data by product and Social media posts,
Marketing
region. advertisements.
Data Analytics framework
• Data analytics is the process of examining data to uncover useful information
and support decision-making.
• It involves collecting raw data from different sources, cleaning and organizing
it, and using tools and techniques to analyze it.

Steps in a Data Analytics Framework


1.Define Objectives
• Identify the business problem or question to be solved.
• Set goals and metrics (KPIs) to measure success.
• 2. Data Collection
• Determine required data sources (internal or external).
• Collect raw data through databases, APIs, surveys, or other methods.
• Ensure data privacy, security, and compliance with relevant regulations.
• Ensure that you have access to accurate, reliable, and comprehensive data
that is relevant to your analysis objectives.
• 3. Data Preparation (Preprocessing)
• Clean the data (remove duplicates, handle missing values, correct errors).
• Transform data into a consistent format (e.g., normalization, encoding).
• Integrate and store data in a centralized location (e.g., data warehouse).
4. Exploratory Data Analysis (EDA)
• Use statistical methods to understand data distributions and relationships.
• Visualize data patterns using graphs, plots, and charts.
• Identify trends, outliers, and potential issues.

5. Data Modeling
• Select the appropriate analytical approach (descriptive, diagnostic,
predictive, or prescriptive).
• Apply statistical methods or machine learning algorithms.
• Train and validate models to ensure accuracy and reliability.

6. Interpret Results
• Analyze model outputs to derive insights.
• Relate findings back to business goals and KPIs.
• Use visualization tools to communicate insights effectively (dashboards,
reports).

7. Decision Making
• Use insights to support or refine business strategies.
• Develop action plans based on the analysis.
Types of Data Analytics Frameworks
1. Descriptive Analytics:
• Descriptive analytics is a branch of data analytics that focuses on
summarizing historical data to gain insights into past events or
phenomena. It involves organizing and presenting data in a meaningful
way through visualization techniques, such as charts, graphs, and
dashboards. Descriptive analytics aims to provide a clear and concise
snapshot of what has happened.

2. Diagnostic Analytics:
• Diagnostic analytics is a form of data analytics that delves deeper into
understanding the root causes and reasons behind specific events or
outcomes. It goes beyond descriptive analytics by investigating the
relationships between variables to uncover insights and explanations.
Diagnostic analytics involves conducting exploratory analysis and
applying statistical techniques to identify patterns, correlations, and
anomalies within the data.
3. Predictive Analytics:
Predictive analytics is a field within data analytics that employs historical data
and statistical modelling methods to predict future outcomes or trends. Its
objective is to make well-informed forecasts and estimations based on the
analysis of patterns, correlations, and connections present in the data. By
utilizing a range of statistical and machine learning algorithms, predictive
analytics creates predictive models that enable organizations to anticipate
customer behavior, market trends, demand patterns, and other important
factors.

4. Prescriptive Analytics:
Prescriptive analytics is an advanced field in data analytics that
employs historical data, mathematical models, optimization algorithms,
and simulation methods to offer guidance on the best actions or decisions
to attain desired outcomes. Unlike descriptive and predictive analytics,
which concentrate on understanding past occurrences and forecasting
future trends, prescriptive analytics takes an additional step by
proposing precise courses of action.
.
5. Cognitive Analytics :
• Cognitive analytics refers to the application of advanced
technologies and techniques that enable systems and machines
to mimic human cognitive abilities, such as perception, learning,
reasoning, and problem-solving.

• It combines elements of artificial intelligence, machine learning,


natural language processing, and other cognitive computing
technologies to analyze and interpret complex data sets.

• Cognitive analytics enables organizations to unveil patterns,


trends, and relationships within large volumes of structured and
unstructured data,leading to improved decision-making,
enhanced customer experiences, and the discovery of valuable
business opportunities
Licensed analytics Tools
• These tools are proprietary software solutions designed to enable businesses to
collect, analyze, and visualize data effectively.
• These tools are typically provided by vendors under a licensing agreement,
which might involve subscription fees, pay-per-use models, or one-time
licensing costs.

Key Features of Licensed Analytics Tools

• Pre-Built Dashboards: Ready-to-use dashboards that provide quick insights.


• Advanced Analytics: Support for predictive analytics, machine learning (ML),
and artificial intelligence (AI) models.
• Seamless Integrations: Compatibility with various enterprise systems like CRMs,
ERPs, and data warehouses.
• Scalability: Ability to handle increasing data volumes as businesses grow.
• Customer Support: Dedicated support teams, regular updates, and training
programs.
• Security: Built-in compliance with regulations and advanced data encryption.
Popular Licensed Analytics Tools
1.Tableau
• Known for its powerful visualization capabilities.
• Enables drag-and-drop analysis and interactive dashboards.
2.Power BI (Microsoft)
• Integrates well with Microsoft Office and Azure services.
• Offers robust real-time analytics and collaboration features.
3.Google Analytics 360
• Provides advanced attribution modeling and integrations with
Google Ads.
• Best for organizations heavily invested in digital marketing.
4. Adobe Analytics
• Focused on digital marketing and customer journey analytics.
Open-source analytics
• These tools are software solutions that provide data
analysis, visualization, and reporting capabilities.
• They are freely available and allow users to access, modify,
and distribute the source code.
1. Apache Superset
2. Metabase
3. BIRT (Business Intelligence and Reporting Tools)
4. Pentaho
5. Jaspersoft
6. Helical Insight
7. Redash
• Apache Superset is an open-source data exploration and visualization platform
designed for еasе of usе and to empower users to makе data-drivеn dеcisions.
• Apache superset makes it easy for developers to write applications that run both
in the Java Virtual Machine (JVM) and the Web Sphere Application Server (WAS).
• A modern, lightweight BI tool for data visualization and exploration.
– Connects to a wide range of databases.
– Intuitive drag-and-drop interface for building dashboards.
– Highly customizable with SQL support.
– Supports dynamic filtering and time-series analysis
Metabase
• Metabase is a user-friendly open-source BI tool focusing on data
quеrying and visualization, making it accessible to non- tеchnical users.
• This tool is the fastest way to share data and analytics with your team
members.
• Users can install this tool in less than five minutes and connect to
MySQL, PostgreSQL, MongoDB, and more.

Features:
1. Dashboards with automatic refreshing and full-screen view
2. SQL Mode for data analysts and professionals
3. Establish standardized segments and metrics for team-wide use
4. Schedule data delivery to Slack or email through dashboard
subscriptions
5. Access data in Slack at any time using MetaBot
6. Simplify data for your team by renaming, annotating, and concealing
fields
7.Great UX and user-friendly,
• BIRT (Business Intelligence and Reporting
Tools)
• BIRT is a versatile business intelligence open-source tool focusing on
reporting and data visualization.
• It enables users to design, generate, and
• view reports with customizable templates and interactive charts, making a
robust choice for data reporting.
• Strong focus on report generation.
• BIRT has two main components:
• BIRT designer: A graphical tool for designing and developing reports
• BIRT runtime engine: Provides support for running reports and rendering
published report output
• Pentaho is an open-source Business Intelligence (BI) suite that
offers comprehensive data integration and analytics capabilities.
• It is used for data management, reporting, data mining, and
dashboarding.
• Pentaho's main strength is its end-to-end BI functionality,
allowing users to collect, store, analyze, and visualize data in
various ways.
• Pentaho Data Integration (PDI):Integrate data from various
sources like databases and transform the data using complex
operations like filtering, aggregating, and merging.
• Load the data into databases, data warehouses, or other
destinations
• Pentaho Business Analytics: Provides tools
for reporting, dashboards, and visualizations. It includes a Report
Designer and an Interactive Dashboard.
Jaspersoft
• Jaspersoft is a renowned open-source business intelligence tool
with robust reporting, dashboards, and data analysis
capabilities. It is widely used for creating and delivering reports
and interactive data visualizations, making it a valuable choice
for data-driven decision-making.
Features:
1.Comprehensive reporting and dashboard creation
2.Rich library of data visualizations and chart types
3. Integration with various data sources and databases
4. Multi-tenancy support for secure sharing
5. Advanced reporting features like ad-hoc reporting and
scheduling
Helical Insight
• Helical Insight is a self-service open-source BI and reporting
tool.
• It empowers businesses to explore data, create customized
reports, and share insights using Machine Learning and NLP.
• Its focus on self-service makes it accessible to users across the
organization.
Features:
1. Web-based Business Intelligence software
2. Interaction with organizational data
3. Utilizes Machine Learning and NLP (Natural Language
Processing)
4. Custom workflow specification
Redash
• Redash is an open-source tool for data visualization and
collaboration.
• It facilitates organizations in taking a more data-driven approach
by providing them with tools for democratizing data access. It
also has a good range of out-of-the-box dashboards.
Features:
• 1. Data source connections for querying and visualization
• 2. Interactive and shareable dashboards
• 3. Collaboration and sharing features for reports and queries
• 4. Scheduled and automated report generation
• 5. Customizable visualizations and chart types
• 6. Extensible through plugins and API integration
Data Cleaning
• Data cleaning is the process to remove incorrect, inconsistent, incomplete
and inaccurate data from the datasets, and it also replaces the missing values.
• The process of identifying and correcting inaccuracies, inconsistencies, and
errors in a dataset.
Steps in Data Cleaning
1.Handling Missing Values
2. Noisy data
3.Data Cleaning as a Process
1.Handling Missing Values
• Deletion: Remove rows or columns with excessive missing data.
• Imputation: Fill missing values using:
Mean/Median/Mode: For numerical data.
Forward/Backward Fill: Use adjacent values to fill gaps
(e.g., time- series data).
Predictive Imputation: Use machine learning models to
estimate missing values based on other features.
Steps in Handling Missing Values
• 1. Ignore the tuple
• 2. Fill in the missing value manually
• 3. Use a global constant to fill in the missing value
• 4. Use a measure of central tendency for the attribute (e.g., the
• mean or median or mode) to fill in the missing value
• 5. Use the attribute mean or median for all samples belonging to
• the same class as the given tuple
• 6. Use the most probable value to fill in the missing value
• 1.Noisy Data
• Noisy data refers to data that contains errors, outliers, or
irrelevant information, making it difficult to analyze and interpret.
• Noise in data can distort the results of analysis, affect model
performance, and compromise decision-making.
• Noisy data has a low Signal-to-Noise Ratio.
• Data = true signal + noise
• Noisy data unnecessarily increases the amount of storage space
required and can adversely affect any data mining analysis results.

Sources of Noisy Data


• Human Errors: Mistakes during data entry, such as typos or incorrect
values.
• Communication Errors: Data corruption during transmission or storage.
• Environmental Factors: External conditions affecting measurements (e.g.,
background noise in audio data).
• Outliers: Extreme values that deviate significantly from other
observations.

Types of Noisy Data


• Random Noise: Unpredictable fluctuations in data values.
• Systematic Noise: Errors that follow a consistent pattern or bias.
• Outliers: Values that fall outside the expected range, potentially caused
by errors or true variability.
• Missing or Incomplete Data: Gaps that can mislead analysis if untreated
Ways to Remove Data
1. Binning
2. Regression
1.Binning is a technique where we sort the data and then partition
the data into equal frequency bins. Then you may either replace the noisy
data with the bin mean bin median or the bin boundary.
• It smoothens data by grouping values into "bins" and replacing each value with a
representative value for the bin.
• Steps in Binning
• Sort the Data:
– Arrange the data in ascending or descending order.
• Divide Data into Bins:
– Split the dataset into equal-sized bins (e.g., groups of 5 or 10).
• Smooth the Data:
1.Smoothing by Mean Binning: Replace each value in the bin with the mean
(average) of the bin.
Example: Bin values [7, 8, 9, 10, 11] → Mean = 9 → Replace all with 9.
2.Smoothing by Median Binning: Replace each value in the bin with the median
of the bin.
Example: Bin values [7, 8, 9, 10, 11] → Median = 9 → Replace all with 9.
3.Smoothing by Boundary Binning: Replace each value in the bin with the
nearest boundary value of the bin. Example: Bin values [7, 8, 9, 10, 11] →
Replace 7, 8 with 7 (lower boundary), and 10, 11 with 11 (upper boundary).
2.Regression is used to smooth the data and help handle data
when unnecessary data is present.
• Regression techniques fit a mathematical model to the data and
can be used to smooth noisy data by approximating the
underlying trend.

Types of Regression for Noise Removal:


• Linear Regression: To model the relationship between a
dependent variable (target) and one or more independent
variables (predictors).
• It assumes a linear relationship between the variables and
predicts the target variable by fitting a straight line to the data.
– Use the line to predict and smooth noisy values.
Example: y=mx+c (where m is the slope and c is the
intercept).
• Multiple linear regression models the relationship between one
dependent variable (target) and multiple independent variables
(predictors).
• It assumes that the dependent variable is a linear combination of
the independent variables.
• The general form of a multiple linear regression model is:
Y =c+m1x1+m2+...+mnxn+ϵ
Data Cleaning as a Process

• Assess Data Quality


– Examine the dataset for issues like missing values, duplicates, outliers,
and formatting inconsistencies.
• Handle Missing Data
– Remove, replace, or impute missing values based on the context and
data relevance.
• Remove Duplicates
– Identify and eliminate redundant rows or records to maintain data
integrity.
• Standardize and Correct Errors
– Ensure uniform formats for dates, numbers, and text.
– Fix typos, misclassifications, and invalid entries.
• Handle Outliers
– Identify and decide whether to keep, adjust, or remove outliers based on
their impact.
• Validate and Document
– Check cleaned data for accuracy and consistency.
– Document all changes made for transparency and reproducibility.

OUTLIERS
• Outliers are data points that significantly deviate from the overall
pattern or distribution of a dataset.
• They can arise due to errors, variability, or rare events and can
influence statistical analyses and model performance.
• . Identifying and handling outliers is important in data analysis as
they can:
• Skew Results: Outliers can distort statistical measures like the
mean and standard deviation, leading to incorrect conclusions.
• Affect Model Performance: In machine learning, they can lead to
poor model predictions or overfitting
• Outliers are different from the noise data
---Noise is a random error or variance in a measured variable
-----Noise should be removed before outlier detection
• A 2-D customer data plot with respect to customer locations in
a city, showing three data clusters.
• Outliers may be detected as values that fall outside of the
cluster sets process.
TYPES OF OUTLIERS
There are 3 types of outliers:
1. Global outliers(or point Anomaly)
2. Contextual outlier(or conditional outlier)
3. Collective outlier
Global outliers(or point Anomaly)
A global outlier is a data point that deviates significantly from the overall
distribution of the dataset.
These outliers stand out in the entire dataset without being dependent on any
specific context.
• Large Deviation: The value is far removed from
the central tendency (mean, median) or
within an unlikely range.
• Independent of Context: It is considered unusual
across the whole dataset, unlike contextual outliers.
E.g.: 1.Most temperatures in a dataset are between
20°C and 30°C, but one value is 100°C.
2.In a dataset of human heights
(mostly 150–200 cm), a height of 500 cm is a
global outlier.
2.Contextual outlier is a data point that is considered anomalous only in a
specific context.
• While it may fall within a typical range of values overall, it deviates
significantly in relation to its context or surrounding conditions.
• Contextual outliers may not be outliers when considered in the entire
dataset, but they exhibit unusual behavior within a specific context or
subgroup.
Detection: Techniques for detecting contextual outliers include
contextual clustering, contextual anomaly detection, and context-aware
machine learning approaches.
Contextual Specific: Contextual information such as time, location, or
other relevant factors are crucial in identifying contextual outliers.
E.g.:1. A temperature of 40°C is normal
in summer but anomalous in winter.
2.A sale of $1,000 might be typical
on a weekend but unusual on
a weekday.
3.Collective outlier
• A collective outlier refers to a group of data points that, as a
whole, deviate significantly from the expected pattern of the
dataset.
• Individually, these points may not appear unusual, but their
combined behavior is anomalous.
• Detection: Techniques for detecting collective outliers include
clustering algorithms, density-based methods, and subspace-
based approaches
• Patterns of Anomaly: The group forms an unexpected pattern,
such as spikes, dips, or unusual correlations.

E.g: Sensor Data: Multiple sensors reporting


slightly abnormal readings together could
indicate a system
malfunction.
OUTLIER DETECTION
OUTLIER DETECTION : SUPERVISED METHODS
Two ways to categorize outlier detection methods:
1. Based on whether user-labelled examples of outliers can be obtained.
- Supervised, semi-supervised vs unsupervised methods
2.Based on assumptions about normal data and outliers
- Statistical, proximity based & clustering based methods
1.Supervised Methods
Labelled Data Uses a labelled dataset where each point is classified as either
normal or an outlier.
Models are trained to learn the distinction between normal and anomalous
classes.
Classification Task: Models learn to classify data points based on features and
labels
• Samples examined by domain experts used for training & testing.
• Models normal objects & reports those not matching the model as outliers or
• Models outliers & reports those not matching the model as normal.
• Methods:
– Logistic Regression
– Decision Trees
– Support Vector Machines (SVM)
OUTLIER DETECTION :SEMI- SUPERVISED METHODS
• Semi-supervised learning for outlier detection involves using a small amount of
labelled data (normal or outlier) and a larger amount of unlabelled data to train a
model.
• These methods are especially useful when labelling outliers is time-consuming or
expensive, and most data is unlabelled
• for example, a semi-supervised outlier detection algorithm may use clustering to
group similar data points together and then use the labeled data to identify outliers
within the clusters.

1.Assume Normality:
• Semi-supervised methods typically assume that most of the data points are
normal, and only a small fraction are outliers.
2.Train on Labeled Normal Data:
• Use the labeled normal data to learn the patterns or boundaries of normal
behavior.
3.Analyze Unlabeled Data:
• Predict the likelihood of data points in the unlabeled dataset being normal or
anomalous based on the learned model.
4.Identify Outliers: Points that deviate significantly from the learned normal
behavior are flagged as outliers.
• E.g.; DBSCAN, K-Means
OUTLIER DETECTION :UNSUPERVISED METHODS
• These methods use only unlabeled data to identify outliers.
• There are no predefined labels for normal or anomalous data
points.
• These methods identify outliers by analyzing the data's inherent
patterns, structures, and statistical properties without the need
for labeled examples.
• For example, unsupervised outlier detection methods can use
density-based or distance-based methods to identify data points
that are far away from the rest of the data.
• Some popular unsupervised methods include, k-nearest
neighbor (KNN) based method, DBSCAN, and Isolation Forest.
• No Labels Required: Works with unlabeled data, assuming that
most data points represent normal behavior
• Assumption: Outliers are rare and differ substantially from the
rest of the data.
OUTLIER DETECTION :STATISTICAL METHODS
• Statistical methods for outlier detection rely on assumptions
about the distribution of the data to identify points that deviate
significantly from the expected pattern.
• These methods are particularly useful when you know or
assume that the data follows a certain distribution, such as
normal distribution or other well-understood statistical
properties.
• The data not following the model are outliers.
Z-Score (Standard Score)
• This method calculates the standard deviation of the data points
and identifies outliers as those with Z-scores exceeding a certain
threshold (typically 3 or -3).
• It measures how far a data point is from the mean in terms of
standard deviations.
• If the Z-score of a point is greater than a threshold (commonly ∣Z∣>3), it is
considered an outlier.
• A high Z-score means the data point is far from the mean, while a Z-score
close to 0 means it’s near the mean.
2. Interquartile Range (IQR)
The IQR method is based on dividing the data into quartiles (Q1, Q3) and
identifying outliers as points that fall outside the range defined by these
quartiles.
where
• Q1 is the median of the lower quartile (25th percentile) & Q3 is the median
upper quartile (75th percentile).

• Outliers are defined as values that fall outside the range

• Points below the lower bound or above the upper bound are considered
outliers.
OUTLIER DETECTION:PROXIMITY BASED METHODS
• An object is considered an outlier if its nearest neighbors are
significantly far away, meaning the proximity of the object deviates
substantially from the proximity of most
other objects in the same dataset.
• Example:
– Model the proximity of an object using its 3 nearest neighbors.
– Objects in region R (as shown in the figure) are substantially
different from other objects in the dataset.
– Thus, the objects in region R are considered outliers.
• Two Major Types of Proximity-Based Outlier Detection:
1.Distance-based E.g.: KNN
2.Density-based E.g.: DBSCAN
OUTLIER DETECTION: CLUSTERING BASED METHODS
• Clustering-based methods are used for outlier detection by
utilizing clustering algorithms to group data into clusters.
• Objects that do not fit well into any cluster or belong to very
small clusters are considered outliers.
• In clustering, data points are grouped based on similarity
(distance or density).
• Outliers are points that:
– Do not belong to any cluster or
– Are assigned to small or sparse clusters that significantly
deviate from larger cluster
Examples
• K-Means-Based Outlier Detection:
• K-Means clustering, data points are assigned to the nearest
cluster center.
• Outliers can be detected by calculating the distance of each
point to its nearest cluster center.
– If the distance exceeds a certain threshold, the point is
labeled as an outlier.
2.DBSCAN-Based Outlier Detection:
Points that do not belong to any cluster (i.e., they are in low-
density regions) are marked as noise or outliers.
3.Hierarchical Clustering-Based Outlier Detection:
• In hierarchical clustering, clusters are formed in a hierarchical
manner (either bottom-up or top-down).
• Outliers can be identified as:
– Points that are far from their respective clusters.
– Points that form singleton clusters (clusters with only one
or very few points).

You might also like