DW&DM Material
DW&DM Material
Data Mining
III B.Tech – I Semester
COURSE SYLLABUS PAGE NO.
Unit -1: Data Warehouse and OLAP Technology: An Overview: What Is a Data
Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data 1-15
Warehouse Implementation, From Data Warehousing to Data Mining.
Unit -2: Data Mining: Introduction, What is Data Mining?, Motivating challenges,
The origins of Data Mining, Data Mining Tasks, Types of Data, Data Quality. Data
Preprocessing: Aggregation, Sampling, Dimensionality Reduction, Feature Subset 16-35
Selection, Feature creation, Discretization and Binarization, Variable
Transformation, Measures of Similarity and Dissimilarity.
UNIT –I:
Data Warehouse and OLAP Technology: An Overview: What Is a Data Warehouse? A
Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse
Implementation, From Data Warehousing to Data Mining. (Han & Kamber)
store data within it in a simple and universally acceptable manner. It must also be consistent
in terms of nomenclature and layout. This type of application is useful for analysing big data.
3. Time variant – In this data is maintained via different intervals of time such as weekly,
monthly, or annually etc. It founds various time limits which are structured between the
large datasets and are held in online transaction process (OLTP). The time limits for data
warehouse is wide-ranged than that of operational systems. The data resided in data
warehouse is predictable with a specific interval of time and delivers information from the
historical perspective.
4. Non-volatile – The data warehouse is also non-volatile, which means that past data cannot
be erased. It also means that data is not erased or deleted when new data is inserted. The
information is read-only and is only modified on a routine basis. It also helps with statistical
data evaluation and comprehension of what and when events occurred. You don’t require any
other complicated procedure.
What is OLAP?
OLAP stands for Online Analytical Processing. It's a technology used in data analytics and
business intelligence that enables users to extract and view data from multiple perspectives.
OLAP systems are designed for complex queries and data analysis, allowing users to analyse
different dimensions of data, such as time, geography, or product hierarchies, in a dynamic
and multidimensional way.
Applications of OLAP:
1. Business Intelligence
2. Financial Analysis
3. Sales & Marketing
4. Supply Chain Management
5. Healthcare Analysis
6. Educational Institutions
What is OLTP?
OLTP stands for Online Transaction Processing. It's a type of system and database designed
for managing and processing transaction-oriented applications. Unlike OLAP (Online
Analytical Processing) that focuses on data analysis and reporting, OLTP systems are
optimized for managing day-to-day, routine transactions in real-time.
Applications of OLTP:
1. Banking & Financial Transactions
2. Airline & Travel Management
3. Telecommunications
4. Government Systems
DW&DM
In data warehousing literature, an n-D base cube is called a base cuboid. The top most
0-D cuboid, which holds the highest-level of summarization, is called the apex
cuboid. The lattice of cuboids forms a data cube.
1. Star schema:
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.
A fact table in the middle connected to a set of dimension tables.
Example for star schema regarding sales,
DW&DM
2. Snowflake Schema:
The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third
normal form.
The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake.
Example for snowflake schema regarding sales
3. Fact constellation:
A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
Fact Constellation Schema is a sophisticated database design that is difficult to
summarize information. Fact Constellation Schema can implement between
aggregate Fact tables or decompose a complex Fact table into independent
simplex Fact tables.
Example for fact constellation regarding sales,
DW&DM
OLAP provides various operations to gain insights from the data stored in
multidimensional hypercube.
OLAP operations include:
1. Drill down
2. Roll up
3. Dice
4. Slice
5. Pivot
1. Drill down:
Drill down operation allows a user to zoom in on the data cube i.e., the less detailed data is
converted into highly detailed data. It can be implemented by either stepping down a concept
hierarchy for a dimension or adding additional dimensions to the hypercube.
Example: Consider a cube that represents the annual sales (4 Quarters: Q1, Q2, Q3,
Q4) of various kinds of clothes (Shirt, Pant, Shorts, Tees) of a company in 4 cities
(Delhi, Mumbai, Las Vegas, New York) as shown below:
Here, the drill-down operation is applied on the time dimension and the quarter Q1 is
drilled down to January, February, and March. Hence, by applying the drill-down
operation, we can move down from quarterly sales in a year to monthly or weekly
records.
DW&DM
2. Roll up:
It is the opposite of the drill-down operation and is also known as a drill-up or aggregation
operation. It is a dimension reduction technique that performs aggregation on a data cube. It
makes the data less detailed and it can be performed by combining similar dimensions across
any axis.
Here, we are performing the Roll-up operation on the given data cube by combining
categorizing the sales based on the countries instead of cities.
3. Dice:
Dice operation is used to generate a new sub-cube from the existing hypercube. It selects two
or more dimensions from the hypercube to generate a new sub-cube for the given data.
Here, we are using the dice operation to retrieve the sales done by the company in the
first half of the year i.e., the sales in the first two quarters.
4. Slice:
Slice operation is used to select a single dimension from the given cube to generate a new
sub-cube. It represents the information from another point of view.
Here, the sales done by the company during the first quarter are retrieved by
performing the slice operation on the given hypercube.
DW&DM
5. Pivot:
Here, we are using the Pivot operation to view the sub-cube from a different
perspective.
Top-down view:
o Allows selection of the relevant information necessary for the data warehouse.
Data source view:
o Exposes the information being captured, stored, and managed by operational
systems.
Data warehouse view:
o Consists of fact tables and dimension tables.
Business query view:
o Sees the perspectives of data in the warehouse from the view of end-user.
Data Warehouse is referred to the data repository that is maintained separately from the
organization’s operational data.
Multi-Tier Data Warehouse Architecture consists of the following components:
1. Bottom tier
2. Middle tier
3. Top tier
1. Bottom tier:
The bottom Tier usually consists of Data Sources and Data Storage.
It is a warehouse database server. For Example RDBMS.
In Bottom Tier, using the application program interface (called gateways), data is
extracted from operational and external sources.
Application Program Interface likes ODBC (Open Database Connection), OLE-DB
(Open-Linking and Embedding for Database), JDBC (Java Database Connection) is
supported.
ETL stands for Extract, Transform, and Load.
Several popular ETL tools include:
IBM Infosphere
Informatica
Microsoft SSIS
Confluent
2. Middle tier:
The middle tier is an OLAP server that is typically implemented using either :
o A relational OLAP (ROLAP) model.
o A multidimensional OLAP (MOLAP) model.
OLAP server models come in three different categories, including:
3. Top tier:
The top tier is a front-end client layer, which includes query and reporting tools,
analysis tools, and/or data mining tools (eg, trend analysis, prediction, etc.).
Here are a few Top Tier tools that are often used:
SAP BW
IBM Cognos
Microsoft BI Platform
1. Enterprise Warehouse:
An enterprise warehouse collects all information topics spread throughout the
organization.
It usually contains detailed data as well as summarized data and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. Can be an
enterprise data warehouse.
The traditional mainframe, computer super server, or parallel architecture has been
implemented on platforms. This requires extensive commercial modelling and
may take years to design and manufacture.
2. Data Mart:
A data mart contains a subset of corporate-wide data that is important to a specific
group of users.
The scope is limited to specific selected subjects.
For example, a marketing data mart may limit its topics to customers, goods, and
sales.
The data contained in the data marts are summarized. Data marts are typically
applied to low-cost departmental servers that are Unix/Linux or Windows-based.
3. Virtual Warehouse:
A virtual warehouse is a group of views on an operational database.
For efficient query processing, only a few possible summary views can be
physical.
Creating a virtual warehouse is easy, but requires additional capacity on
operational database servers.
3. Modularity: The architecture supports modular design, which facilitates the creation,
testing, and deployment of separate components.
4. Security: The data warehouse’s overall security can be improved by applying various
security measures to various layers.
5. Improved Resource Management: Different tiers can be tuned to use the proper
hardware resources, cutting expenses overall and increasing effectiveness.
6. Easier Maintenance: Maintenance is simpler because individual components can be
updated or maintained without affecting the data warehouse as a whole.
7. Improved Reliability: Using many tiers can offer redundancy and failover
capabilities, enhancing the data warehouse’s overall reliability.
Data extraction:
Get data from multiple, heterogeneous, and external sources
Data cleaning:
Detect errors in the data and rectify them when possible
Data transformation:
Convert data from legacy or host format to warehouse format
Load:
Sort, summarize, consolidate, compute views, check integrity, and build indices
and partitions
Refresh:
Propagate the updates from the data sources to the warehouse
If the dimensions have hierarchy, then the total number of cuboid is calculated by,
2. Access Methods:
There are two access methods:
a. Bitmap Index
b. Join Index
a. Bitmap Index:
An indexing technique known as Bitmap Indexing enables data to be retrieved
quickly from columns that are frequently used and have low cardinality.
Cardinality is the count of distinct elements in a column.
In general, Bitmap combines the terms Bit and Map, where bit represents the
smallest amount of data on a computer, which can only hold either 0 or 1 and
map means transforming and organizing the data according to what value
should be assigned to 0 and 1.
b. Join indices:
Join indexing is especially useful for maintaining the relationship between a
foreign key and its matching primary keys, from the joinable relation.
For example, if two relations R(RID, A) and S(B, SID) join on the attributes
A and B, then the join index record contains the pair(RID, SID), where RID
and SID are record identifiers from the Rand S relations, respectively. Hence,
the join index records can identity joinable tuples without performing costly
join operations.
The star schema model of data warehouses makes join indexing attractive for
cross table search, because the linkage between a fact able and its
corresponding dimension tables comprises the fact table's foreign key and the
dimension table's primary key.
DW&DM
Information processing:
o Supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs.
Analytical processing:
o Multidimensional analysis of data warehouse data.
o Supports basic OLAP operations, slice-dice, drilling, pivoting.
Data mining:
o Knowledge discovery from hidden patterns
o Supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools
DW&DM
*****
DW&DM
UNIT –II:
Data Mining: Introduction, What is Data Mining?, Motivating challenges, The origins of
Data Mining, Data Mining Tasks, Types of Data, Data Quality. Data Preprocessing:
Aggregation, Sampling, Dimensionality Reduction, Feature Subset Selection, Feature
creation, Discretization and Binarization, Variable Transformation, Measures of Similarity
and Dissimilarity. (Tan & Vipin)
Data Mining
2.1.1 Introduction: Data mining is a process of discovering patterns, trends, insights, and
knowledge from large volumes of data. It involves the use of various techniques and tools to
analyze and extract valuable information from datasets, with the goal of making informed
decisions and predictions. Data mining is an integral part of the broader field of data science
and plays a crucial role in industries such as business, healthcare, finance, and more.
Applications:
Business: Data mining is used in customer relationship management, market basket
analysis, and fraud detection.
Healthcare: It helps in disease prediction, patient diagnosis, and medical research.
Finance: In financial services, it's used for credit scoring, risk assessment, and stock
market prediction.
Retail: Retailers use data mining to optimize inventory management and product
recommendations.
Challenges: Data mining faces challenges such as dealing with big data, ensuring
data privacy and security, and selecting the right algorithms and parameters for a
given task.
Machine Learning Connection: Data mining often overlaps with machine learning,
as machine learning algorithms are frequently used for predictive modeling and
pattern recognition in data mining tasks.
KDD Process:
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of
useful, previously unknown, and potentially valuable information from large datasets. The
KDD process is an iterative process and it requires multiple iterations of the above steps to
extract accurate knowledge from the data
The following steps are included in KDD process:
Data Integration
Data Selection
Data Transformation
Data Mapping
Code generation
Data Mining
Pattern Evaluation
Knowledge Representation:
Advantages of KDD:
1. Improves decision-making: KDD provides valuable insights and knowledge that can
help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes
the data ready for analysis, which saves time and money.
DW&DM
3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer
service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying
patterns and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast
future trends and patterns.
Disadvantages of KDD:
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and
analyzing large amounts of data, which can include sensitive information about
individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and
knowledge to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not
accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new unseen data.
The origins of data mining can be traced back to various fields and disciplines, including
computer science, statistics, and database management. Data mining is essentially the
process of discovering patterns, trends, and valuable insights from large and complex
datasets. Here's a brief overview of its origins:
a. Statistics: Data mining has strong roots in statistical analysis. Statisticians have been
working on methods for analyzing and extracting meaningful information from data for
centuries. Techniques such as regression analysis, hypothesis testing, and clustering can
be seen as precursors to modern data mining methods.
b. Machine Learning: Machine learning, a subfield of artificial intelligence, has
contributed significantly to data mining. Techniques such as decision trees, neural
networks, and support vector machines have been adapted and incorporated into data
mining algorithms.
c. Database Management: The field of database management also played a crucial role
in the development of data mining. The emergence of large relational databases in the
1970s and 1980s paved the way for data mining by providing structured data for
analysis. SQL queries and other database-related technologies were essential for data
retrieval.
DW&DM
Data mining involves a variety of tasks aimed at discovering patterns, relationships, and
useful information within large datasets. These tasks can be categorized into several
fundamental areas.
I. Classification: Classification is the process of assigning data points to predefined
categories or classes based on their attributes. This is commonly used in tasks like spam
email detection, sentiment analysis, and disease diagnosis. Popular algorithms for
classification include decision trees, support vector machines, and neural network.
Applications:
DW&DM
1. Fraud Detection:
a. Goal: Predict fraudulent cases in credit card transactions.
b. Approach:
Use credit card transactions and the information on its account-
holder as attributes.
o When does a customer buy, what does he buy, how often he pays
on time etc
Label past transactions as fraud or fair transactions. This forms the
class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions
on an account.
2. Sky Survey Cataloging
a. Goal: To predict class (star or galaxy) of sky objects, especially visually
faint ones, based on the telescopic survey images (from Palomar Observatory)
b. Approach:
Segment the image.
Measure image attributes (features) - 40 of them per object.
Model the class based on these features.
Success Story: Could find 16 new high red-shift quasars, some of the
farthest objects that are difficult to find!
II. Regression: Regression is used to predict a numerical value or continuous variable based
on other attributes or variables. It is often employed in applications like sales forecasting,
price prediction, and risk assessment.
Linear regression, polynomial regression, and regression trees are common
techniques.
The primary objective of regression is to model the relationship between one
or more independent variables (predictors or features) and a dependent
variable (the target or outcome) to make predictions or estimate values.
Examples:
Predicting sales amounts of new product based on advertising expenditure.
Predicting wind velocities as a function of temperature, humidity, air pressure,
etc.
Time series prediction of stock market indices.
Applications:
Predictive Modeling: Regression is commonly used in fields such as finance,
sales forecasting, and epidemiology to predict future values or outcomes.
Risk Assessment: It is used to assess risk in insurance, investment, and loan
approval by estimating the likelihood of specific outcomes.
Quality Control: Regression can help analyze the relationships between
variables in manufacturing processes, identifying factors that affect product
quality.
DW&DM
III. Clustering: Clustering is a key data mining task that involves grouping similar data
points or objects into clusters or clusters into which data points share similar characteristics.
It is an unsupervised learning technique, meaning that the algorithm doesn't rely on
predefined labels or categories but instead seeks to discover the inherent structure or patterns
within the data.
The main objective of clustering is to find natural groupings or structures in a
dataset. These groupings can help in data exploration, pattern recognition, and
understanding the underlying structure of the data.
The choice of clustering algorithm and parameters depends on the nature of the
data and the goals of the analysis.
Applications:
Customer Segmentation: Clustering helps businesses group customers with
similar purchasing behaviors and preferences, which can inform targeted
marketing strategies.
Image Segmentation: In image processing, clustering is used to segment
images into meaningful regions or objects.
Anomaly Detection: Clustering can help identify anomalies or outliers by
considering data points that do not fit well into any cluster.
Document Categorization: In text mining, clustering can group similar
documents together based on their content, facilitating document
categorization.
Network Analysis: Clustering is used in social network analysis to detect
communities or groups of closely connected individuals.
IV. Association: Association in data mining refers to the process of discovering interesting
and meaningful relationships or associations between items or attributes in a dataset. This
task is particularly useful for market basket analysis, where the goal is to find patterns in
customer purchasing behavior.
DW&DM
right-hand side). These rules show the relationships between items. For example, a
simple association rule could be "If a customer buys item A, they are likely to buy
item B." Calculate Support, Confidence and lift values to evaluate the quality of
association rules.
Applications:
Market Basket Analysis: One of the most common applications of
association mining is in retail for understanding customer purchasing behavior
and optimizing product placement and promotions.
Cross-Selling: E-commerce and online platforms use association rules to
recommend products or services to customers based on their purchase history.
Healthcare: Association rules can be applied to medical data to discover
associations between medical conditions, symptoms, and treatments.
Fraud Detection: Detecting fraudulent activities, such as credit card fraud,
can benefit from association rule mining to identify patterns in transaction
data.
Types of Attributes: Nominal, ordinal, interval, and ratio are different levels of
measurement or data types used in statistics and data analysis. They represent a hierarchy of
measurement scales, each with distinct characteristics and properties.
Data quality refers to the degree to which data is accurate, complete, consistent, reliable, and
suitable for its intended purpose. High data quality ensures that data is free from errors,
omissions, and inconsistencies, making it a valuable asset for organizations. Poor data quality
can lead to incorrect analyses, flawed decision-making and operational inefficiencies.
Accuracy: Data accuracy means that the information contained in the dataset
is correct and free from errors or mistakes. Accurate data is vital for making
informed decisions and avoiding costly errors.
Completeness: Data completeness indicates that all the required data points or
attributes are present in the dataset. Incomplete data can lead to gaps in
information and hinder meaningful analysis.
Consistency: Data consistency refers to the uniformity and coherence of data
across different sources or within the same dataset. Inconsistent data can result
from conflicting information or varying formats.
Reliability: Reliable data can be consistently depended upon for accuracy and
consistency. Data should maintain its quality over time, and it should be
reliable for its intended use.
Relevance: Relevance relates to the suitability of data for the specific task or
purpose at hand. Irrelevant data, even if accurate and complete, can lead to
poor decision-making.
Data quality problems can manifest in various ways and can impact an organization's
decision-making, operations, and overall efficiency. Here are some common data quality
problems and solutions to address them:
DW&DM
Data Preprocessing
2.2 Data Preprocessing:
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the
specific data mining task.
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
2.2.1 Aggregation :
Imagine that you have collected the data for your analysis. These data consist of the All
Electronics sales per quarter, for the years 2002 to 2004. You are, however, interested in
the annual sales (total per year), rather than the total per quarter. Thus the data can be
aggregated so that the resulting data summarize the total sales per year instead of per
quarter and organizations.
DW&DM
Data cubes store multidimensional aggregated information. shows a data cube for
multidimensional analysis of sales data with respect to annual sales per item type for
each All Electronics branch. Each cell holds an aggregate data value, corresponding to
the data point in multidimensional space.
Concept hierarchies may exist for each attribute, allowing the analysis of data at multiple
levels of abstraction. For example, a hierarchy for branch could allow branches to be
grouped into regions, based on their address. Data cubes provide fast access to
precomputed, summarized data, thereby benefiting on-line analytical processing as well
as data mining.
The cube created at the lowest level of abstraction is referred to as the base
cuboid. The base cuboid should correspond to an individual entity of interest,
such as sales or customer.
A cube at the highest level of abstraction is the apex cuboid. The apex
cuboid would give one total—the total sales for all three years for all item
types, and for all branches.
Data cubes created for varying levels of abstraction are often referred to as
cuboids, so that a data cube may instead refer to a lattice of cuboids. Each
higher level of abstraction further reduces the resulting data size.
When replying to data mining requests, the smallest available cuboid
relevant to the given task should be used.
2.2.2 Sampling:
Sampling is typically used in data mining because processing the entire set
of data of interest is too expensive or time consuming
Types of Sampling:
Purpose:
Methods
Dimensionality Reduction:
Irrelevant features
Contain no information that is useful for the data mining task at hand.
Example: students' ID is often irrelevant to the task of predicting students'
GPA.
DW&DM
Create new attributes (features) that can capture the important information in a data set more
effectively than the original ones.
Three general methodologies
Attribute extraction
Domain-specific
Mapping data to new space (see: data reduction)
oExample: Fourier transformation, wavelet transformation, manifold
approaches (not covered)
Attribute construction
Combining features
Data discretization
Binarization:
Binarization is the process of transforming continuous and discrete attributes into
binary attributes. It is used to convert numerical data into categorical data.
Binarization is often used in machine learning algorithms the require categorical data
Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:
1. Euclidean Distance: Euclidean distance is considered the traditional metric for problems
with geometry. It can be simply explained as the ordinary distance between two points. It is
one of the most used algorithms in the cluster analysis. One of the algorithms that use this
formula would be K-mean. Mathematically it computes the root of squared differ
2. Manhattan Distance: This determines the absolute difference among the pair of the
coordinates. Suppose we have two points P and Q to determine the distance between these
points we simply have to calculate the perpendicular distance of the points from X-Axis and
Y- Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
DW&DM
3. Minkowski distance: It is the generalized form of the Euclidean and Manhattan Distance
Measure. In an N-dimensional space, a point is represented as, distances between the
coordinates between two objects. (x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:
When p = 2, Minkowski distance is same as the Euclidean distance.
4. Cosine Index: Cosine distance measure for clustering determines the cosine of the angle
between two vectors given by the following formula.
Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.
5. Jaccard Distance measure also used for such distance calculation as above formats.
*****
DW&DM
UNIT –III:
Classification: Basic Concepts, General Approach to solving a classification problem,
Decision Tree Induction: Working of Decision Tree, building a decision tree, methods for
expressing an attribute test conditions, measures for selecting the best split, Algorithm for
decision tree induction. Model Overfitting: Due to presence of noise, due to lack of
representation samples, evaluating the performance of classifier: holdout method, random
sub sampling, cross-validation, bootstrap. Bayes Theorem, Naïve Bayes Classifier.
3.1 Classification
3.1.1 Basic Concepts
3.1.2 General Approach to solving a classification problem
3.1.3 Decision Tree Induction
3.1.3.1 Working of Decision Tree
3.1.3.2 building a decision tree
3.1.3.3 methods for expressing an attribute test conditions
3.1.3.4 measures for selecting the best split
3.1.3.5 Algorithm for decision tree induction.
3.2 Model Overfitting
3.2.1 Due to presence of noise
3.2.2 Due to lack of representation samples
3.2.3 Evaluating the performance of classifier
3.2.3.1 Holdout method
3.2.3.2 Random sub sampling
3.2.3.3 Cross-Validation
3.2.3.4 Bootstrap
3.2.4 Bayes Theorem
3.2.5 Naïve Bayes Classifier (Tan &Vipin)
3.1 CLASSIFICATION:
Classification is a task in data mining that involves assigning a class label to each instance
in a dataset based on its features. The goal of classification is to build a model that
accurately predicts the class labels of new instances based on their features
Table 3.1 A sample data for the loan borrower classification problem
DW&DM
The performance of a model (classifier) can be evaluated by comparing the predicted labels
against the true labels of instances. This information can be summarized in a table called a
confusion matrix. Each entry fij denotes the number of instances from class i predicted to be
of class j. For example, f01 is the number of instances from class 0 incorrectly predicted as
class 1. The number of correct predictions made by the model is (f11 + f00) and the number
of incorrect predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how well a
classification model performs, summarizing this information into a single number makes it
more convenient to compare the relative performance of different models. This can be done
using an evaluation metric such as accuracy, which is computed in the following way:
For binary classification problems, the accuracy of a model is given by
Error rate is another related metric, which is defined as follows for binary classification
problems:
The learning algorithms of most classification techniques are designed to learn models that
attain the highest accuracy, or equivalently, the lowest error rate when applied to the test set.
At each step, the algorithm chooses the question that provides the best separation of the data.
It keeps doing this until it creates a tree structure that can make accurate predictions on new,
unseen data.
In a nutshell, decision tree induction is a powerful tool in data mining, providing a clear and
understandable way to make decisions based on complex data.
3.1.3.1 Working of Decision Tree:
To illustrate how a decision tree works, consider the classification problem of distinguishing
mammals from non-mammals using the vertebrate data set .Suppose a new species is
discovered by scientists.
How can we tell whether it is a mammal or a non-mammal? One approach is to pose a series
of questions about the characteristics of the species. The first question we may ask is whether
the species is cold- or warm-blooded. If it is cold-blooded, then it is definitely not a mammal.
Otherwise, it is either a bird or a mammal. In the latter case, we need to ask a follow-up
DW&DM
question: Do the females of the species give birth to their young? Those that do give birth are
definitely mammals, while those that do not are likely to be non-mammals (with the
exception of egg-laying mammals such as the platypus and spiny anteater).
The previous example illustrates how we can solve a classification problem by asking a
series of carefully crafted questions about the attributes of the test instance. Each time we
receive an answer, we could ask a follow-up question until we can conclusively decide on its
class label. The series of questions and their possible answers can be organized into a
hierarchical structure called a decision tree. Figure 3.4 shows an example of the decision tree
for the mammal classification problem. The tree has three types of nodes:
• A root node, with no incoming links and zero or more outgoing links.
• Internal nodes, each of which has exactly one incoming link and two or more outgoing
links.
• Leaf or terminal nodes, each of which has exactly one incoming link and no outgoing
links.
Every leaf node in the decision tree is associated with a class label. The non-terminal nodes,
which include the root and internal nodes, contain attribute test conditions that are typically
defined using a single attribute.
Each possible outcome of the attribute test condition is associated with exactly one child of
this node. For example, the root node of the tree shown in Figure 3.4 uses the attribute Body
Temperature to define an attribute test condition that has two outcomes, warm and cold,
resulting in two child nodes.
Given a decision tree, classifying a test instance is straightforward. Starting from the root
node, we apply its attribute test condition and follow the appropriate branch based on the
outcome of the test. This will lead us either to another internal node, for which a new
attribute test condition is applied, or to a leaf node. Once a leaf node is reached, we assign the
class label associated with the node to the test instance. As an illustration, Figure 3.5 traces
the path used to predict the class label of a flamingo. The path terminates at a leaf node
labeled as Non-mammals.
Figure 3.5.Classifying an unlabeled vertebrate. The dashed lines represent the outcomes of
applying various attribute test conditions on the unlabeled vertebrate. The vertebrate is
eventually assigned to the Non-mammals class.
3.1.3.2 Building a decision tree
Many possible decision trees that can be constructed from a particular data set. While some
trees are better than others, finding an optimal one is computationally expensive due to the
exponential size of the search space.
One of the earliest methods is Hunt’s algorithm, which is the basis for many current
implementations of decision tree classifiers, including ID3, C4.5, and CART. This subsection
presents Hunt’s algorithm.
Hunt’s Algorithm:
In Hunt’s algorithm, a decision tree is grown in a recursive fashion. The tree initially contains
a single root node that is associated with all the training instances. If a node is associated
with instances from more than one class, it is expanded using an attribute test condition that
is determined using a splitting criterion. A child leaf node is created for each outcome of the
attribute test condition and the instances associated with the parent node are distributed to the
children based on the test outcomes. This node expansion step can then be recursively
applied to each child node, as long as it has labels of more than one class. If all the instances
associated with a leaf node have identical class labels, then the node is not expanded any
further. Each leaf node is assigned a class label that occurs most frequently in the training
instances associated with the node.
To illustrate how the algorithm works, consider the training set shown in Table 3.1 for the
loan borrower classification problem. Suppose we apply Hunt’s algorithm to fit the training
data. The tree initially contains only a single leaf node as shown in Figure 3.6(a). This node
is labeled as Defaulted = No, since the majority of the borrowers did not default on their loan
payments. The training error of this tree is 30% as three out of the ten training instances have
the class label Defaulted = Yes. The leaf node can therefore be further expanded because it
contains training instances from more than one class. Let Home Owner be the attribute
chosen to split the training instances. The justification for choosing this attribute as the
DW&DM
attribute test condition will be discussed later. The resulting binary split on the Home Owner
attribute is shown in Figure 3.6(b). All the training instances for which Home Owner= Yes
are propagated to the left child of the root node and the rest are propagated to the right child.
Hunt’s algorithm is then recursively applied to each child. The left child becomes a leaf node
labeled Defaulted = No, since all instances associated with this node have identical class
label Defaulted=No. The right child has instances from each class label. Hence, we split it
further. The resulting subtrees after recursively expanding the right child are shown in
Figures 3.6(c) and (d).
Hunt’s algorithm, as described above, makes some simplifying assumptions that are often not
true in practice. In the following, we describe these assumptions and briefly discuss some of
the possible ways for handling them.
1. Some of the child nodes created in Hunt’s algorithm can be empty if none of the training
instances have the particular attribute values. One way to handle this is by declaring each of
them as a leaf node with a class label that occurs most frequently among the training
instances associated with their parent nodes.
2. If all training instances associated with a node have identical attribute values but different
class labels, it is not possible to expand this node any further. One way to handle this case is
to declare it a leaf node and assign it the class label that occurs most frequently in the training
instances associated with this node.
Binary Attributes The test condition for a binary attribute generates two potential outcomes.
Nominal Attributes Since a nominal attribute can have many values; its attribute test
condition can be expressed in two ways, as a multiway split or a binary split.
Ordinal Attributes Ordinal attributes can also produce binary or multi-way splits. Ordinal
attribute values can be grouped as long as the grouping does not violate the order property of
the attribute values.
DW&DM
Continuous Attributes For continuous attributes, the attribute test condition can be
expressed as a comparison test (e.g., A<v) producing a binary split, of as a range query of the
form vi ≤ A<vi+1, for i = 1... k, producing a multiway split.
1. Gini impurity: It measures the frequency at which a randomly selected element would be
incorrectly classified. The goal is to minimize the Gini impurity, and a split with lower
impurity is considered better.
3. Misclassification error: This measures the error rate by calculating the proportion of
misclassified instances in a set. The split with the lowest misclassification error is chosen.
4. Gain ratio: It is based on information gain but takes into account the intrinsic information
of a split. It penalizes splits that result in a large number of subsets.
5. Chi-square: It is used for categorical target variables and evaluates the independence of
two variables. A lower chi-square value indicates a better split.
6. Variance reduction (for regression trees): In regression problems, the goal is to
minimize the variance of the target variable within each split.
The choice of the measure depends on the specific problem, data, and algorithm used. Some
algorithms, like CART (Classification and Regression Trees), use Gini impurity or entropy,
while others may use different criteria.
DW&DM
1. The createNode() function extends the decision tree by creating a new node. A node in the
decision tree either has a test condition, denoted as node. test cond, or a class label, denoted
as node.label.
2. The find best split() function determines the attribute test condition for partitioning the
training instances associated with a node. The splitting attribute chosen depends on the
impurity measure used. The popular measures include entropy and the Gini index.
3. The Classify() function determines the class label to be assigned to a leaf node. For each
leaf node t, let p(i|t) denote the fraction of training instances from class i associated with the
node t. The label assigned to the leaf node is typically the one that occurs most frequently in
the training instances that are associated with this node
4. The stopping cond() function is used to terminate the tree-growing process by checking
whether all the instances have identical class label or attribute values.
DW&DM
EXAMPLE:
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the
+ class labels of that region
DW&DM
Confusion Matrix:
A confusion matrix is a tabular representation that shows the true positives (TP), true
negatives (TN), false positives (FP), and false negatives (FN).
TP: Correctly predicted positive instances.
TN: Correctly predicted negative instances.
FP: Incorrectly predicted positive instances
FN: Incorrectly predicted negative instances
Sample Size: You can control the size of the random sub-sample, which may be a fixed
number of data points or a specific percentage of the original dataset. The sample size is often
determined by the requirements of your analysis or modeling task.
Repeatability: If repeatability is important, you can set a random seed before sampling to
ensure that the same sub-sample is obtained when needed. Random sub-sampling can be
useful for various purposes
k-fold Cross-Validation:
Stratified Cross-Validation:
In stratified cross-validation, the data is divided into folds in such a way that each fold
maintains the same class distribution as the overall dataset. This is particularly important
when dealing with imbalanced datasets to ensure that each class is adequately represented in
each fold.
In LOOCV, each data point is treated as a separate test set, while the rest of the data is used
for training.
LOOCV is a special case of k-fold cross-validation where k is equal to the number of data
points. It can be computationally intensive for large datasets but provides a rigorous estimate
of model performance.
Repeated Cross-Validation:
To reduce the potential impact of the initial random partitioning of the data, repeated
cross-validation involves running the cross-validation process multiple times with
different random splits.
This provides more robust performance estimates.
The main advantages of cross-validation in data mining are:
It provides a more reliable estimate of a model's performance compared to a single
train-test split.
It helps detect issues like overfitting, as models that perform well on one set of data
but not on others may indicate overfitting.
It ensures that the model's performance is assessed on a variety of data points,
improving its generalization ability.
Cross-validation is a crucial step in model selection, hyper parameter tuning, and assessing
the overall quality of a predictive model in data mining and machine learning. It is a valuable
tool for ensuring that your model will perform well on new, unseen data.
Bayes' Theorem, named after the Reverend Thomas Bayes, is a fundamental concept in
probability theory and statistics. It provides a way to update and revise probabilities for a
hypothesis or event based on new evidence or information. Bayes' Theorem is particularly
useful in various fields, including machine learning, data science, and decision making. The
theorem is
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
The formula for Bayes' theorem is given as:
DW&DM
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Problem: If the weather is sunny, then the Player should play or not?
P(Yes/Sunny)= P(Sunny/Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No/Sunny)= P(Sunny/No)*P(No)/P(Sunny)
P(Sunny/NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So as we can see from the above calculation that P (Yes/Sunny) > P (No/Sunny)
Steps to implement:
Step1: Data Pre-processing step
Step2: Fitting Naive Bayes to the Training set
Step3: Predicting the test result
Step4: Test accuracy of the result (Creation of Confusion matrix)
Step5: Visualizing the test set result.
14.
DW&DM
Bootstrap:
Resampling: Start with a dataset of size N. To create a bootstrap sample, you randomly
select N data points from the original dataset with replacement. This means that a single data
point can be selected multiple times in a bootstrap sample, and some data points may not be
selected at all.
Statistical Estimation: Calculate the statistic of interest on each bootstrap sample. This
statistic can be a mean, median, standard deviation, correlation coefficient, or any other
parameter you want to estimate.
Repeat: Repeat steps 1 and 2 a large number of times (often thousands or tens of
thousands) to create a distribution of the statistic of interest.
Analyze the Distribution: With the collection of statistics obtained from the bootstrap
samples, you can analyze their distribution. This distribution provides insights into the
variability and uncertainty associated with the original statistic. You can calculate confidence
intervals, perform hypothesis testing, or assess the stability of model parameters.
*****
DW&DM
UNIT –IV:
Association Analysis: Basic Concepts and Algorithms: Problem Definition, Frequent Item
Set Generation, Apriori Principle, Apriori Algorithm, Rule Generation, Compact
Representation of Frequent Item sets, FP-Growth Algorithm.
Association Analysis:
Association Analysis:
{Diapers} {Beer}.
The rule suggests a relationship between the sale of diapers and beer because many
customers who buy diapers also buy beer. Retailers can use these types of rules to
help them identify new opportunities for cross-selling their products to the
customers.
DW&DM
REPRESENTATION:
Binary Representation Market basket data can be represented in a binary format. Where each
row corresponds to a transaction and each column corresponds to an item. An item can be
treated as a binary variable whose value is one if the item is present in a transaction and zero
otherwise.
This representation is a simplistic view of real market basket data because it ignores
important aspects of the data such as the quantity of items sold or the price paid to purchase
them.
The support count for {Beer, Diapers, Milk} is equal to two because there are only two
transactions that contain all three items. Often, the property of interest is the support, which
is fraction of transactions in which an itemset occurs:
s(X) = σ(X)/N.
Where S(x)=Support Count.
X=No of times the itemset can be repeated
N=Total no of Transactions.
An itemset X is called frequent if s(X) is greater than some user-defined threshold, min
support.
NOTE:
Association Rule: An association rule is an implication expression of the form X Y.
Where X and Y are disjoint item sets.
DW&DM
i.e., X ∩ Y = ∅. The strength of an association rule can be measured in terms of its support
& confidence. Support determines how often a rule is applicable to a given data set, while
confidence determines how frequently items in Y appear in transactions that contain X. The
formal definitions of these metrics are Support, s(X Y) = σ(X ∪ Y) N; Confidence, c(X
Y) = σ(X ∪ Y) σ(X).
Example:
Support is an important measure because a rule that has very low support might occur
simply by chance.
Support also has a desirable property that can be exploited for the efficient discovery
of association rules.
Confidence, on the other hand, measures the reliability of the inference made by a
rule.
For a given rule X −→ Y, the higher the confidence, the more likely it is for Y to be
present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.
Formulation of the Association Rule Mining Problem
The association rule mining problem can be formally stated as follows:
Brute-force approach:
This approach can be used to describe the following content with clarity.
List all possible association rules.
Compute the support and confidence for each rule.
Prune rules that fail the minsup and minconf thresholds.
A brute-force approach for mining association rules is to compute the support and confidence
for every possible rule. This approach is prohibitively expensive because there are
exponentially many rules that can be extracted from a data set. More specifically, assuming
that neither the left nor the right-hand side of the rule is an empty set, the total number of
possible rules, R, extracted from a data set that contains d items is
R = 3d – 2d+1 + 1.
Here we can take d=No of items are 6
Total number of item sets = 2 d
Total number of possible association rules: 3^6-2^7=601+1=602.
Computational Complexity: The graph may tell about how the rules can be increased with
the no of item sets “d”.
The computational requirements for frequent itemset generation are generally more
expensive than those of rule generation.
DW&DM
Definition: If an itemset is frequent, then all of its subsets must also be frequent.
An algorithm known as Apriori is a common one in data mining. It's used to identify the
most frequently occurring elements and meaningful associations in a dataset. As an example,
products brought in by consumers to a shop may all be used as inputs in this system.
In the next iteration, candidate 2-itemsets are generated using only the frequent 1-
itemsets because the Apriori principle ensures that all supersets of the infrequent 1-
itemsets
There are only four frequent 1-itemsets, the number of candidate 2-itemsets generated
by the algorithm is ( 42) = 6.
Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found
to be infrequent after computing their support values. The remaining four candidates
are frequent, and thus will be used to generate candidate 3-itemsets. Without support-
based pruning, there are (63) = 20 candidate.
3-itemsets that can be formed using the six items given in this example. With the
Apriori principle, we only need to keep candidate 3-itemsets whose subsets are
frequent. The only candidate that has this property is {Bread, Diapers, Milk}.
However, even though the subsets of {Bread, Diapers, Milk} are frequent, the itemset
itself is not.
The only candidate that has this property is {Bread, Diapers, Milk}. However, even though
the subsets of {Bread, Diapers, Milk} are frequent, the itemset itself is not.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of
candidate itemsets generated. A brute-force strategy of enumerating all itemsets (up to size 3)
as candidates will produce candidates. With the Apriori principle, this number decreases to
candidates, which represents a 68% reduction in the number of candidate itemsets even in
this simple example.
K-itemsets are p art of the candidate k-itemsets generated by this procedure. The Fk−1 × F1
candidate generation method only produces four candidate 3-itemsets, instead of the (6 3) =
20 itemsets produced by the brute-force method.
Fk−1×Fk−1 Method: This candidate generation procedure, which is used in the candidate-
gen function of the Apriori algorithm, merges a pair of frequent (k −1)-itemsets only if their
first k −2 items, arranged in lexicographic order, are identical. Let A = {a1, a2,...,ak−1} and
B = {b1, b2,...,bk−1} be a pair of frequent (k − 1)-itemsets, arranged lexicographically. A
and B are merged if they satisfy the following conditions: ai = bi (for I = 1, 2, ..., k − 2)
A and B are two distinct itemsets. The candidate k-itemset generated by merging A and B
consists of the first k − 2 common items followed by ak−1 and bk−1 in lexicographic order.
This candidate generation procedure is complete, because for every lexicographically ordered
frequent k-itemset, there exists two lexicographically ordered frequent (k − 1)-itemsets that
have identical items in the first k – 2 positions.
DW&DM
Candidate Pruning:
Support Counting:
transaction. Some of the itemsets may correspond to the candidate 3-itemsets under
investigation, in which case, their support counts are incremented. Other subsets of t that do
not correspond to any candidates can be ignored.
In the Apriori algorithm, candidate itemsets are partitioned into different buckets and stored
in a hash tree. During support counting, itemsets contained in each transaction are also
hashed into their appropriate buckets. That way, instead of comparing each itemset in the
transaction with every candidate itemset, it is matched only against candidate itemsets that
belong to the same bucket.
DW&DM
Consider the transaction, t = {1, 2, 3, 5, 6}. To update the support counts of the candidate
itemsets, the hash tree must be traversed in such a way that all the leaf nodes containing
candidate 3-itemsets belonging to t must be visited at least once.
Computational Complexity:
The computational complexity of the Apriori algorithm, which includes both its runtime and
storage, can be affected by the following factors. Support Threshold Lowering the support
threshold often results in more itemsets being declared as frequent. This has an adverse effect
on the computational complexity of the algorithm because more candidate itemsets must be
generated and counted at every level.
The maximum size of frequent itemsets also tends to increase with lower support thresholds.
This increases the total number of iterations to be performed by the Apriori algorithm, further
increasing the computational cost. Number of Items (Dimensionality) as the number of items
increases, more space will be needed to store the support counts of items. If the nu mber of
frequent items also grows with the dimensionality of the data.
Number of Transactions Because the Apriori algorithm makes repeated passes over the
transaction data set, its run time increases with a larger number of transactions.
DW&DM
Average Transaction Width For dense data sets, the average transaction width can be very
large. This affects the complexity of the Apriori algorithm in two ways. First, the maximum
size of frequent itemsets tends to increase as the average transaction width increases. As a
result, more candidate itemsets must be examined during candidate generation and support
counting, as illustrated in Second, as the transaction width increases, more itemsets are
contained in the transaction. This will increase the number of hash tree traversals performed
during support counting. A detailed analysis of the time complexity for the Apriori algorithm
is presented next.
Generation of frequent 1-itemsets for each transaction, we need to update the support count
for every item present in the transaction. Assuming that w is the average transaction width,
this operation requires O(Nw) time, where N is the total number of transactions.
Rule generation is a process of finding interesting and useful patterns or rules from large
sets of data. It is one of the main tasks of data mining, which aims to discover hidden
knowledge from data.
One of the most common methods of rule generation is association rule mining, which
finds frequent itemsets and then derives rules that imply the co-occurrence of items in the
itemsets.
For example, if a customer buys bread and milk, they are like buy eggs as well. This can
be expressed as an association rule: {bread, milk} -> {eggs}.
DW&DM
Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f → L – f satisfies
the minimum confidence requirement
The number of frequent itemsets produced from a transaction data set can be very large. It is
useful to identify a small representative set of frequent itemsets from which all other frequent
itemsets can be derived. Two such representations are presented in this section in the form of
maximal and closed frequent itemsets.
Maximal frequent itemsets provide a valuable representation for data sets that can produce
very long, frequent itemsets, as there are exponentially many frequent itemsets in such data.
DW&DM
Closed Itemsets:
Closed itemsets provide a minimal representation of all itemsets without losing their
support information.
An itemset X is closed if none of its immediate supersets has exactly the same
support count as X.
An itemset X is closed if none of its immediate supersets has the same support as the
itemset X.
X is not closed if at least one of its immediate supersets has support count as X.
FP-Tree Representation:
An FP-tree is a compressed representation of the input data. It is constructed by reading the
data set one transaction at a time and mapping each transaction onto a path in the FP-tree. As
different transactions can have several items in common, their paths might overlap. The more
the paths overlap with one another, the more compression we can achieve using the FP-tree
structure. If the size of the FP-tree is small enough to fit into main memory, this will allow us
to extract frequent itemsets directly from the structure in memory instead of making repeated
passes over the data stored on disk.
1) The data set is scanned once to determine the support count of each item. Infrequent items
are discarded, while the frequent items are sorted in decreasing support counts inside
every transaction of the data set. For the data set shown in Figure 5.24, a is the most
frequent item, followed by b, c, d, and e.
2) The algorithm makes a second pass over the data to construct the FP-tree. After reading
the first transaction, {a, b}, the nodes labeled as a and b are created. A path is then
formed from null → a → b to encode the transaction. Every node along the path has a
frequency count of 1.
3) After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c,
and d. A path is then formed to represent the transaction by connecting the nodes null →
b → c → d. Every node along this path also has a frequency count equal to one. Although
the first two transactions have an item in common, which is b; their paths are disjoint
because the transactions do not share a common prefix.
DW&DM
4) The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first
transaction. As a result, the path for the third transaction, null → a → c → d → e,
overlaps with the path for the first transaction, null → a → b. Because of their
overlapping path, the frequency count for node a is incremented to two, while the
frequency counts for the newly created nodes, c, d, and e, are equal to one.
5) This process continues until every transaction has been mapped onto one of the paths
given in the FP-tree. The resulting FP-tree after reading all the transactions
DW&DM
Algorithm by Han: The original algorithm to construct the FP-Tree defined by Han is given
below:
1. The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in
the database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The
root is represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken
at the top, and then the next itemset with the lower count. It means that the branch of
the tree is constructed with transaction itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch, then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in
this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. The
common node and new node count are increased by 1 as they are created and linked
according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined
first, along with the links of the lowest nodes. The lowest node represents the
frequency pattern length 1. From this, traverse the path in the FP Tree. This path or
paths is called a conditional pattern base. A conditional pattern base is a sub-database
consisting of prefix paths in the FP tree occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects
and sorts the set of frequent items, and the second constructs the FP-Tree.
*******************
What is Apriori Algorithm?
It is a classic algorithm used in data mining for finding association rules based on the principle "Any
subset of a large item set must be large". It uses a generate-and-test approach – generates candidate
itemsets and tests if they are frequent.
Given the minimum threshold support, Generating large item sets (only keep frequent item sets –
large item sets with enough support).
Illustration:
Consider the below transaction in which B = Bread, J = Jelly, P = Peanut Butter, M = Milk and E =
Eggs. Given that minimum threshold support = 40% and minimum threshold confidence = 80%.
DW&DM
Step-1: Count the number of transactions in which each item occurs (Bread B occurs in 4 transactions
and so on).
Step-2: As minimum threshold support = 40%, So in this step we will remove all the items that are
bought less than 40% of support or support less than 2.
The above table has single items that are bought frequently. Now let’s find a pair of items that are
bought frequently. We continue from the above table (Table in step 2)
Step-3: We start making pairs from the first item and below items like {B,P} ,{B,M} ,{B,E} and then
we start with the second item and below items like {P,M} ,{P,E}. We do not make pair {P,B} because
we already made {P,B} pair when we were making pairs of B. As buying a bread and Peanut Butter
together is same as buying Peanut Butter and bread together. After making all the pairs we get,
DW&DM
Step-4: As minimum threshold support = 40%, So in this step we will remove all the items that are
bought less than 40% of support and we are left with
The above table has two items {B, P} that are bought together frequently.
Step-5: As we cannot generate large frequent item (itemset of 3) further because we are left with 1
frequent item set. We will start generating association rules from the frequent item set. As we have
frequent item set of two, only two association rules will be generated which is shown below :
As P -> B has confidence 100% which is greater than minimum confidence threshold 80%, thus P ->
B is a Strong Association Rule.
*******************
What is FP Growth Algorithm?
An efficient and scalable method to find frequent patterns. It allows frequent itemset discovery
without candidate itemset generation.
Illustration:
Consider the below transaction in which B = Bread, J = Jelly, P = Peanut Butter, M = Milk and E =
Eggs. Given that minimum threshold support = 40% and minimum threshold confidence = 80%.
DW&DM
Step-2: As minimum threshold support = 40%, So in this step we will remove all the items that are
bought less than 40% of support or support less than 2.
Step-3: Create a F -list in which frequent items are sorted in the descending order based on the
support.
Step-4: Sort frequent items in transactions based on F-list. It is also known as FPDP.
Read transaction 1: {B,P} -> Create 2 nodes B and P. Set the path as null -> B -> P and the count of B
and P as 1 as shown below :
DW&DM
Read transaction 2: {B,P} -> The path will be null -> B -> P. As transaction 1 and 2 share the same
path. Set counts of B and P to 2.
Read transaction 3: {B,P,M} -> The path will be null -> B -> P -> M. As transaction 2 and 3 share the
same path till node P. Therefore, set count of B and P as 3 and create node M having count 1.
Step-6: Construct the conditional FP tree in the sequence of reverse order of F - List {E,M,P,B} and
generate frequent item set. The conditional FP tree is sub tree which is built by considering the
transactions of a particular item and then removing that item from all the transaction.
The above table has two items {B, P} that are bought together frequently.
As for items E and M, nodes in the conditional FP tree have a count (support) of 1 (less than
minimum threshold support 2). Therefore frequent itemset are nil. In case of item P, node B in the
conditional FP tree has a count (support) of 3 (satisfying minimum threshold support). Hence frequent
itemset is generated by adding the item P to the B.
*****
DW&DM
UNIT –V:
Cluster Analysis: Basic Concepts and Algorithms: Overview, What Is Cluster Analysis?
Different Types of Clustering, Different Types of Clusters; K-means: The Basic K-means
Algorithm, K-means Additional Issues, Bisecting K-means, Strengths and Weaknesses;
Agglomerative Hierarchical Clustering: Basic Agglomerative Hierarchical Clustering
Algorithm DBSCAN: Traditional Density Center-Based Approach, DBSCAN Algorithm,
Strengths and Weaknesses.
Cluster analysis can be used as a stand-alone tool to gain insight into distribution of
data, to observe the characteristics of each cluster, and to focus on a particular set of
cluster for further analysis.
Clustering may serve as a preprocessing step for other algorithms, such as
characterization and classification, which would then operate on detected clusters.
Clustering is an example of unsupervised learning. Unlike classification, clustering and
unsupervised learning do not rely on predefined classes and class-labeled training
examples.
Clustering is a challenging field of research where its potential applications pose with
their own special requirements. The following are typical requirements of clustering in
data mining:
o Scalability
o Ability to deal with different types of attributes
o Discovery of clusters with arbitrary shape
o Minimal requirements for domain knowledge to determine input parameters
o Ability to deal with noisy data
o Insensitivity to the order of input records
o High dimensionality
o Constraint-based clustering
o Interpretability and usability
Partitioning methods:
Partitioning methods constructs k partitions from the given database consists of n
objects or data tuples, where each partition represents a cluster and k ≤ n.
It classifies the data into k groups, which together satisfy the following requirements:
o Each group must contain at least one object
o Each object must belong to exactly one group
Partitioning method creates an initial partitioning; it then uses an iterative relocation
technique that attempts to improve the partitioning by moving objects from one group
to another.
The objects in the same cluster are “close” or related to each other, whereas objects of
different clusters are “far apart” or very difficult.
Most of the applications use two popular heuristic methods for partitioning
o The k-means algorithm, where each cluster is represented by the mean value
of the objects in the cluster.
o The k-medoids algorithm, where each cluster is represented by one of the
objects located near the cluster.
Hierarchical methods:
Hierarchical method creates a hierarchical decomposition of the given set of data
objects.
A Hierarchical method can be classified as being either agglomerative or divisive,
based on how hierarchical decomposition is formed.
The agglomerative approach is also called bottom up approach, starts with each object
forming a separate group. It successively merges the objects or groups close one to
another, until all of the groups are merged into one or until a termination condition
holds.
DW&DM
The divisive approach is also called top down approach, starts with all the objects in
the same cluster. In each successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it
can never be undone.
There are two approaches to improving the quality of hierarchical clustering:
o Perform careful analysis of object “linkages” at each hierarchical partitioning,
such as in CURE and Chameleon.
o Integrate hierarchical agglomeration and iterative relocation by first using a
hierarchical agglomerative algorithm and then refining the result using
iterative relocation.
Density-based methods:
This approach is to continue growing the cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.
For each data-point within a given cluster, the neighborhood of a given radius has to
contain at least a minimum number of points.
This method can be used to filter out noise (outliers) and discover the clusters of
arbitrary shape.
1. Well-separated clusters
2. Prototype-based clusters
3. Contiguity-based clusters
4. Density-based clusters
5. Described by an Objective Function
DW&DM
Well-separated clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every
other point in the cluster than to any point not in the cluster.
Prototype-based Clusters:
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the
prototype or “center” of a cluster, than to the center of any other cluster.
The center of a cluster is often a centroid, the average of all the points in the cluster, or a
medoid, the most “representative” point of a cluster.
Contiguity Cluster:
A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or
more other points in the cluster than to any point not in the cluster.
Objective Function
Clusters Defined by an Objective Function
Finds clusters that minimize or maximize an objective function.
Enumerate all possible ways of dividing the points into clusters and evaluate the
`goodness' of each potential set of clusters by using the given objective function. (NP
Hard)
Can have global or local objectives.
o Hierarchical clustering algorithms typically have local objectives
o Partition algorithms typically have global objectives
A variation of the global objective function approach is to fit the data to a
parameterized model.
o Parameters for the model are determined from the data.
o Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.
Objective function. Points in a cluster share some general property that derives from the
entire set of points
5.2 K-Means:
o Now we need to find the new centroid points for each clusters from the obtained
clusters
o Since we know the centroid formula for the given set of points
o i.e. centroid G = ( (x1+x2+…+xn)/n, (y1+y2+ .. +yn)/n)
DW&DM
o For cluster-1: The centroid is (2, 10) because it consists of single point
o For cluster-2: ( (8+5+7+6+4)/5 , (4+8+5+4+9)/5 ) = (6, 6)
o For cluster-3: ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
o The centroid points are (2, 10), (6, 6), (1.5, 3.5)
o Now again we need to find the Euclidean distance and allocate the points to the
respective cluster.
You can observe that the H data point is moved from Cluster-2 to Cluster-1.
Now again we need to find the Centroid points for the next iteration
For Cluster-1: ( (2+4)/2, (10+9)/2 ) =(3, 9.5)
For Cluster-2: ( (8+5+7+6)/4, (4+8+5+4)/4 )=(6.5, 5.25)
For Cluster-3: ( (2+1)/2, (5+2)/2 )=(1.5, 3.5)
o The new centroid points for the next iteration are (3, 9.5), (6.5, 5.25), (1.5, 3.5)
You can observe that the D data point is moved from Cluster-2 to Cluster-1.
Now again we need to find the Centroid points for the next iteration
o For Cluster-1: ( (2+5+4)/3, (10+8+9)/3 ) =(3.67, 9)
o For Cluster-2: ( (8+7+6)/3, (4+5+4)/3 )=(7, 4.33)
DW&DM
o We can observe that there is no change in the clusters. Such that there is no change in
the centroid points also. Therefore we can conclude that
Data points A, D, H belongs to one cluster.
Data points C, E, F belongs to one cluster
Data points B, G belongs to one cluster
K-Medoids Algorithm:
o K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering.
The difference between the K-means and K-medoids algorithm are K-means uses
Euclidean distance whereas K-medoids uses Manhattan distance.
o K-medoids is an unsupervised method with unlabeled data to be clustered. It is an
improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Compared to other partitioning algorithms, the algorithm is simple, fast, and
easy to implement.
o Medoid: A medoid is a point in the cluster from which dissimilarities with all other
points in the clusters are minimal.
DW&DM
o Let us consider the example having data points (2, 6), (3, 4), (3, 8), (4,7), (6, 2), (6, 4),
(7, 3), (7, 4), (8, 5), (7, 6) naming x1, x2, x3, x4, x5, x6, x7, x8, x9, x10 respectively.
o The number of clusters required K=2.
o Step1: we need to select 2 medoids.
C1 =(3, 4)
C2= (7, 4)
o Step2: we need to find Manhattan distance for each point to the medoid point
The formula for Manhattan distance is
d(p1, p2) = |x1 – x2| + |y1 – y2|
1. Handling Empty Clusters − The first issue with the basic K-means algorithm given
prior is that null clusters can be acquired if no points are allocated to a cluster during the
assignment phase. If this occurs, then a method is needed to choose a replacement
centroid, because the squared error will be larger than necessary.
2. Outliers − When the squared error method is used, outliers can unduly tend to the
clusters that are discovered. In specific, when outliers are present, the resulting cluster
centroids (prototypes) cannot be as representative as they can be, and thus, the SSE will
be higher as well.
3. Reducing the SSE with Post-processing − the method to reduce the SSE is to find
more clusters, i.e., to need a larger K. In such cases, it is likely to improve the SSE, but
don't require to increase the number of clusters. This is possible because K-means
generally converge to a local minimum.
Step1: Initialize the list of clusters to accommodate the cluster consisting of all points.
Step2: repeat
Step2: Apply K-Means (K=3). The cluster ‘GFG’ is split into two clusters ‘GFG1’ and
‘GFG2’. The required number of clusters isn’t obtained yet. Thus, ‘GFG1’ is further split into
two (since it has a higher SSE (formula to calculate SSE is explained below))
Strengths:
It is simple, highly flexible, and efficient. The simplicity of k-means makes it easy to
explain the results in contrast to Neural Networks.
The flexibility of k-means allows for easy adjustment if there are problems.
The efficiency of k-means implies that the algorithm is good at segmenting a dataset.
An instance can change cluster (move to another cluster) when the centroids are
recomputed
Easy to interpret the clustering results.
Weakness:
It does not allow developing the most optimal set of clusters and the number of
clusters must be decided before the analysis. How many clusters to include is left at
the discretion of the researcher.
This involves a combination of common sense, domain knowledge, and statistical
tools. Too many clusters tell you nothing because of the groups becoming very small
and there are too many of them.
DW&DM
There are statistical tools that measure within-group homogeneity and group
heterogeneity. There are methods like the elbow method to decide the value of k.
additionally, there is a technique called a dendrogram. The results of a dendrogram
analysis provide a recommendation of how many clusters to use. However,
calculating a dendrogram for a large dataset could potentially crash a computer due to
the computational load and the limits of RAM.
When doing the analysis, the k-means algorithm will randomly select several different
places from which to develop clusters. This can be good or bad depending on where
the algorithm chooses to begin at. From there, the centre of the clusters is recalculated
until an adequate "centre'' is found for the number of clusters requested.
The order of the data has an impact on the final results.
6 5
4
3 4
2
5
2
1
3 1
DW&DM
Strengths
It is simple to implement and gives the best output in some cases.
It is easy and results in a hierarchy, a structure that contains more information.
It does not need us to pre-specify the number of clusters.
Weakness
It breaks the large clusters.
It is Difficult to handle different sized clusters and convex shapes.
It is sensitive to noise and outliers.
The algorithm can never be changed or deleted once it was done previously.
Basic algorithm
Step-1 : Compute the proximity matrix
Step-2 : Let each data point be a cluster
Step-3 : Repeat
Merge the two closest clusters
Update the proximity matrix
Step-4 : Until only a single cluster remains
Different approaches to defining the distance between clusters distinguish the different
algorithms.
Step 1: Consider the points (1,2,3,4,5,6) as an individual cluster and find the distance
between the individual cluster from all other clusters.
Step 2: Now, merge the comparable clusters in a single cluster. As the clusters (2&3) and
DW&DM
Clusters (4&5) are similar to each other so that we can merge them in the second step.
Finally, we get the clusters [ (1), (2,3), (4,5), (6)]
Step 3: Here, we recalculate the proximity as per the algorithm and combine the two closest
clusters [(4,5), (6)] together to form new clusters as [(1), (2,3), (4,5,6)]
Step 4: Repeat the same process. The clusters (4,5,6) and (2,3) are comparable and combined
together to form a new cluster. Now we have [(1), (2,3,4,5,6)].
Step 5: Finally, the remaining two clusters are merged together to form a single cluster
[(1,2,3,4,5,6)]
Space complexity
The space required for the Hierarchical clustering Technique is very high when the number
of data points is high as we need to store the similarity matrix in the RAM. The space
complexity is the order of the square of n.
Space complexity = O(n²) where n is the number of data points.
Time complexity
Since we’ve to perform n iterations and in each iteration, we need to update the similarity
matrix and restore the matrix, the time complexity is also very high. The time complexity is
the order of the cube of n.
Time complexity = O(n³) where n is the number of data points.
Complexity can be reduced to O(n2 log(n) ) time with some cleverness
DW&DM
For the single link or MIN version of hierarchical clustering, the proximity of two clusters is
defined as the minimum of the distance between any two points in the two different clusters.
Using graph terminology, if you start with all points as singleton clusters and add links
between points one at a time, shortest links first, and then these single links combine the
points into clusters.
Let us consider the example
Point a b
P1 0.07 0.83
P2 0.85 0.14
P3 0.66 0.89
P4 0.49 0.64
P5 0.80 0.46
Step-1: Compute the distance matrix by: d[(x, y)(a, b)] = √(x-a)2 + (y-b)2
So we have to find the Euclidean distance between each and every points, we first find the
Euclidean distance between P1 and P2.
d(P1, P2) = √(0.07-0.85)2 + (0.83-0.14)2 =1.04139
P1 P2 P3 P4 P5
P1 0 1.0413
P2 1.0413 0
P3 0
P4 0
P5 0
P1 P2 P3 P4 P5
P1 0 1.0413 0.59304 0.46098 0.81841
P2 1.0413 0 0.77369 0.61612 0.32388
P3 0.59304 0.77369 0 0.30232 0.45222
P4 0.46098 0.61612 0.30232 0 0.35847
P5 0.81841 0.32388 0.45222 0.35847 0
DW&DM
Step-2: Merging the two closest members of the two clusters and finding the minimum
element in distance matrix.
Here the minimum value is 0.30232 and hence we combine P3 and P4. Now, form clusters of
elements corresponding to the minimum value and update the proximity matrix. To update
the proximity matrix:
min ((P3, P4), P1) = min ((P3, P1), (P4, P1)) = min (0.59304, 0.46098) = 0.46098
min ((P3, P4), P2) = min ((P3, P2), (P4, P2)) = min (0.77369, 0.61612) = 0.61612
min ((P3, P4), P5) = min ((P3, P5), (P4, P5)) = min (0.45222, 0.35847) = 0.35847
Now we will update the proximity Matrix:
P1 P2 P3, P4 P5
P1 0 1.0413 0.46098 0.81841
P2 1.0413 0 0.61612 0.32388
P3, P4 0.46098 0.61612 0 0.35847
P5 0.81841 0.32388 0.35847 0
Now we will repeat the same process. The next minimum value is 0.32388 and hence we
combine P2 and P5.
min ((P2, P5), P1) = min ((P2, P1), (P5, P1)) = min (1.04139, 0.81841) = 0.81841
min ((P2, P5), (P3, P4)) = min ((P2, (P3, P4)), (P5, (P3, P4))) = min (0.61612, 0.35847) =
0.35847
update the proximity Matrix:
P1 P2, P5 P3, P4
P1 0 0.81841 0.46098
P2, P5 0.81841 0 0.35847
P3, P4 0.46098 0.35847 0
The next minimum value is 0.35847 and hence we combine (P2, P5) and (P3, P4).
min ((P2, P5, P3, P4),P1) = min((P3, P4), P1), (P2, P5), P1))
= min (0.46098, 0.81841) =0.46098
P1 P2, P5, P3, P4
P1 0 0.46098
P2, P5, P3, P4 0.46098 0
Finally the cluster (P3, P4, P2, P5) is merged with the datapoint P1
The single link technique is good at handling non-elliptical shapes, but is sensitive to
noise and outliers.
DW&DM
Step-1 : Compute the distance matrix by: d[(x, y)(a, b)] = √(x-a)2 + (y-b)2
So we have to find the Euclidean distance between each and every points, we first find the
Euclidean distance between P1 and P2.
d(P1, P2) = √(0.07-0.85)2 + (0.83-0.14)2 =1.04139
P1 P2 P3 P4 P5
P1 0 1.04139
P2 1.04139 0
P3 0
P4 0
P5 0
P1 P2 P3 P4 P5
P1 0 1.04139 0.59304 0.46098 0.81841
P2 1.04139 0 0.77369 0.61612 0.32388
P3 0.59304 0.77369 0 0.30232 0.45222
P4 0.46098 0.61612 0.30232 0 0.35847
P5 0.81841 0.32388 0.45222 0.35847 0
DW&DM
Step-2: Merging the two closest members of the two clusters and finding the maximum
element in distance matrix.
Here the minimum value is 0.30232 and hence we combine P3 and P4. Now, form clusters of
elements corresponding to the maximum value and update the proximity matrix. To update
the proximity matrix:
max ((P3, P4), P1) = max ((P3, P1), (P4, P1)) = max (0.59304, 0.46098) = 0.59304
max ((P3, P4), P2) = max ((P3, P2), (P4, P2)) = max (0.77369, 0.61612) = 0.77369
max ((P3, P4), P5) = max ((P3, P5), (P4, P5)) = max (0.45222, 0.35847) = 0.45222
Now we will update the proximity Matrix:
P1 P2 P3, P4 P5
P1 0 1.04139 0.59304 0.81841
P2 1.0413 0 0.77369 0.32388
P3, P4 0.59304 0.77369 0 0.45222
P5 0.81841 0.32388 0.45222 0
Now we will repeat the same process. The next minimum value is 0.32388 and hence we
combine P2 and P5.
max ((P2, P5), P1) = max ((P2, P1), (P5, P1)) = max (1.04139, 0.81841) = 1.04139
max ((P2, P5), (P3, P4)) = max ((P2, (P3, P4)), (P5, (P3, P4))) = max (0.77369, 0.45222) =
0.7736. then, update the proximity Matrix:
P1 P2, P5 P3, P4
P1 0 1.04139 0.59304
P2, P5 1.04139 0 0.77369
P3, P4 0.59304 0.77369 0
The next minimum value is 0.59304 and hence we combine P1 and (P3, P4).
max ((P3, P4), P1) = max (((P3, P4), (P2, P5)), (P1, (P2, P5)))
= max (0.77369, 1.04139) = 1.0413
Finally the cluster (P1, P3, P4) is merged with the datapoint (P2, P5)
Complete link is less susceptible to noise and outliers, but it can break large clusters and it
favors globular shapes.
Nested Clusters>>
DW&DM
Group Average
For the group average version of hierarchical clustering, the proximity of two clusters is
defined as the average pairwise proximity among all pairs of points in the different clusters.
This is an intermediate approach between the single and complete link approaches.
Thus, for group average, the cluster proximity proximity(Ci, Cj) of clusters Ci and Cj, which
are of size mi and mj , respectively, is expressed by the following equation:
.
Average-average distance or average linkage is the method that involves looking at the
distances between all pairs and averages all of these distances. This is also called Universal
Pair Group Mean Averaging.
Let us consider the example
Point a b
P1 0.07 0.83
P2 0.85 0.14
P3 0.66 0.89
P4 0.49 0.64
P5 0.80 0.46
Step-1: Compute the distance matrix by: d[(x, y)(a, b)] = √(x-a)2 + (y-b)2
So we have to find the Euclidean distance between each and every points.
P1 P2 P3 P4 P5
P1 0 1.04139 0.59304 0.46098 0.81841
P2 1.04139 0 0.77369 0.61612 0.32388
P3 0.59304 0.77369 0 0.30232 0.45222
P4 0.46098 0.61612 0.30232 0 0.35847
P5 0.81841 0.32388 0.45222 0.35847 0
Step-2: Merging the two closest members of the two clusters and finding the maximum
element in distance matrix.
Here the minimum value is 0.30232 and hence we combine P3 and P4. Now, form clusters of
elements corresponding to the average value and update the proximity matrix. To update the
proximity matrix:
avg ((P3, P4), P1) = avg ((P3, P1), (P4, P1)) = avg (0.59304, 0.46098) = 0.52701
avg ((P3, P4), P2) = avg ((P3, P2), (P4, P2)) = avg (0.77369, 0.61612) = 0.69490
avg ((P3, P4), P5) = avg ((P3, P5), (P4, P5)) = avg (0.45222, 0.35847) = 0.40534
P1 P2 P3, P4 P5
P1 0 1.04139 0.52701 0.81841
P2 1.0413 0 0.69490 0.32388
P3, P4 0.52701 0.69490 0 0.40534
P5 0.81841 0.32388 0.40534 0
DW&DM
Now we will repeat the same process. The next minimum value is 0.32388 and hence we
combine P2 and P5.
avg ((P2, P5), P1) = avg ((P2, P1), (P5, P1)) = avg (1.04139, 0.81841) = 0.9299
avg ((P2, P5), (P3, P4)) = avg ((P2, (P3, P4)), (P5, (P3, P4)))
= avg (0.77369, 0.45222) =0.55012
Update the proximity Matrix:
P1 P2, P5 P3, P4
P1 0 0.9299 0.59304
P2, P5 0.9299 0 0.55012
P3, P4 0.59304 0.55012 0
The next minimum value is 0.55012 and hence we combine (P2, P5) and (P3, P4).
avg = ((P3, P4), (P2,P5)) = avg (((P3, P4), P1), ((P2, P5),P1))
= avg (0.59304, 0.9299) = 0.76147
P1 P2, P5, P3, P4
P1 0 0.76147
P2, P5, P3, P4 0.76147 0
Finally the cluster (P2, P5, P3, P4) is merged with the datapoint P1
Nested Clusters:
For Ward’s method, the proximity between two clusters is defined as the increase in
the squared error that results when two clusters are merged. Thus, this method uses
the same objective function as K-means clustering.
While it might seem that this feature makes Ward’s method somewhat distinct from
other hierarchical techniques, it can be shown mathematically that Ward’s method is
very similar to the group average method when the proximity between two points is
taken to be the square of the distance between them.
DW&DM
The concepts of core, border, and noise points using a collection of two-dimensional points.
Core points:
A point is a core point if it has at least a specified number of points (MinPts) within Eps
These are points that are at the interior of a cluster
Counts the point itself
Border points:
A border point is not a core point, but falls within the neighborhood of a core point...
Noise points:
A noise point is any point that is neither a core point nor a border point.
DW&DM
DBSCAN algorithm.
Step-1 : Label all points as core, border, or noise points.
Step-2 : Eliminate noise points.
Step-3 : Put an edge between all core points within a distance Eps of each other.
Step-4 : Make each group of connected core points into a separate cluster.
Step-5 : Assign each border point to one of the clusters of its associated core points
Let us Consider example, Create the clusters with minpts = 4 and ε = 1.9
Data Points P1(3,7), P2(4,6), P3(5,5), P4(6,4), P5(7,3), P6(6,2), P7(7,2), P8(8,4),
P9(3,3), P10(2,6), P11(3,5), P12(2,4)
Step-1 : Calculate the distance between each point using Euclidian distance
Distance (A(x1, y1 ), B(x2 , y2)) =√(x2 -x1)2+ (y2 -y1)2
Step - 2 : Map the points where the distances are less than ε=1.9
P1 : P2, P10
P2 : P1, P3, P11
P3 : P2, P4
P4 : P3, P5
P5 : P4, P6, P7, P8
P6 : P5, P7
DW&DM
P7 : P5, P6
P8 : P5
P9 : P12
P10 : P1, P11
P11 : P2, P10, P12
P12 : P9, P11
Step-4: Now, we have one cluster containing points P2, P5 and P11. The remaining points
(P1, P3, P4, P6, P7, P8, P10, P12) are considered border points and P9 is considered as noise
point as they are not core points.
Strengths
DBSCAN works very well when there is a lot of noise in the dataset.
It can handle clusters of different shapes and sizes.
We need not specify the no. of clusters just like any other clustering algorithm.
We just need two parameters MinPts and Eps which can be set by a domain expert.
Weakness
If we have a dataset of different densities, the algorithm fails to give an accurate result.
It is very sensitive to hyper-parameters.
If you are having high-dimensional data and you are using metrics like Euclidean
distance you easily kick into the problem of Curse of Dimensionality.
If the domain expertise fails to understand the data very well then it’s very difficult to
find the optimal value of MinPts and Eps.
*****