0% found this document useful (0 votes)
32 views107 pages

DW&DM Material

The document outlines the syllabus for a Data Warehousing and Data Mining course, detailing five units covering topics such as data warehouse architecture, data mining tasks, classification methods, association analysis, and cluster analysis. It defines key concepts like OLAP and OLTP, and discusses data models, schemas, and OLAP operations. The document serves as a comprehensive guide for understanding the principles and applications of data warehousing and mining techniques.

Uploaded by

charantejak082
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views107 pages

DW&DM Material

The document outlines the syllabus for a Data Warehousing and Data Mining course, detailing five units covering topics such as data warehouse architecture, data mining tasks, classification methods, association analysis, and cluster analysis. It defines key concepts like OLAP and OLTP, and discusses data models, schemas, and OLAP operations. The document serves as a comprehensive guide for understanding the principles and applications of data warehousing and mining techniques.

Uploaded by

charantejak082
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Data Warehousing and

Data Mining
III B.Tech – I Semester
COURSE SYLLABUS PAGE NO.

Unit -1: Data Warehouse and OLAP Technology: An Overview: What Is a Data
Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data 1-15
Warehouse Implementation, From Data Warehousing to Data Mining.

Unit -2: Data Mining: Introduction, What is Data Mining?, Motivating challenges,
The origins of Data Mining, Data Mining Tasks, Types of Data, Data Quality. Data
Preprocessing: Aggregation, Sampling, Dimensionality Reduction, Feature Subset 16-35
Selection, Feature creation, Discretization and Binarization, Variable
Transformation, Measures of Similarity and Dissimilarity.

Unit – 3: Classification: Basic Concepts, General Approach to solving a


classification problem, Decision Tree Induction: Working of Decision Tree,
building a decision tree, methods for expressing an attribute test conditions,
measures for selecting the best split, Algorithm for decision tree induction. Model 36-54
Overfitting: Due to presence of noise, due to lack of representation samples,
evaluating the performance of classifier: holdout method, random sub sampling,
cross-validation, bootstrap. Bayes Theorem, Naïve Bayes Classifier

Unit – 4: Association Analysis: Basic Concepts and Algorithms: Problem


Definition, Frequent Item Set Generation, Apriori Principle, Apriori Algorithm,
55-79
Rule Generation, Compact Representation of Frequent Item sets, FP-Growth
Algorithm.

Unit – 5: Cluster Analysis: Basic Concepts and Algorithms: Overview, What Is


Cluster Analysis? Different Types of Clustering, Different Types of Clusters; K-
means: The Basic K-means Algorithm, K-means Additional Issues, Bisecting K-
80-105
means, Strengths and Weaknesses; Agglomerative Hierarchical Clustering: Basic
Agglomerative Hierarchical Clustering Algorithm DBSCAN: Traditional Density
Center-Based Approach, DBSCAN Algorithm, Strengths and Weaknesses.
DW&DM

DATA WAREHOUSING & DATA MINING

UNIT –I:
Data Warehouse and OLAP Technology: An Overview: What Is a Data Warehouse? A
Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse
Implementation, From Data Warehousing to Data Mining. (Han & Kamber)

Data Warehouse and OLAP Technology


1.1 What Is a Data Warehouse?
1.2 A Multidimensional Data Model
1.3 Data Warehouse Architecture
1.4 Data Warehouse Implementation
1.5 From Data Warehousing to Data Mining

1.1 What is Data Warehouse?

Data warehouse can be defined as,


 “A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision-making process.”
 The above definition was defined by W. H. Inmon
 The key features of data warehouse are:
1. Subject-oriented
2. Integrated
3. Time variant
4. Non-volatile

1. Subject-oriented – A data warehouse is always a subject oriented as it delivers


information about a theme instead of organization’s current operations. It can be achieved
on specific theme. That means the data warehousing process is proposed to handle with a
specific theme which is more defined. These themes can be sales, distributions, marketing
etc.
2. Integrated – Establishing a common unit of measurement for all related data in a data
warehouse using data from different databases is the process of integrating data. You must
DW&DM

store data within it in a simple and universally acceptable manner. It must also be consistent
in terms of nomenclature and layout. This type of application is useful for analysing big data.

3. Time variant – In this data is maintained via different intervals of time such as weekly,
monthly, or annually etc. It founds various time limits which are structured between the
large datasets and are held in online transaction process (OLTP). The time limits for data
warehouse is wide-ranged than that of operational systems. The data resided in data
warehouse is predictable with a specific interval of time and delivers information from the
historical perspective.

4. Non-volatile – The data warehouse is also non-volatile, which means that past data cannot
be erased. It also means that data is not erased or deleted when new data is inserted. The
information is read-only and is only modified on a routine basis. It also helps with statistical
data evaluation and comprehension of what and when events occurred. You don’t require any
other complicated procedure.

What is OLAP?

OLAP stands for Online Analytical Processing. It's a technology used in data analytics and
business intelligence that enables users to extract and view data from multiple perspectives.
OLAP systems are designed for complex queries and data analysis, allowing users to analyse
different dimensions of data, such as time, geography, or product hierarchies, in a dynamic
and multidimensional way.

Applications of OLAP:
1. Business Intelligence
2. Financial Analysis
3. Sales & Marketing
4. Supply Chain Management
5. Healthcare Analysis
6. Educational Institutions

What is OLTP?

OLTP stands for Online Transaction Processing. It's a type of system and database designed
for managing and processing transaction-oriented applications. Unlike OLAP (Online
Analytical Processing) that focuses on data analysis and reporting, OLTP systems are
optimized for managing day-to-day, routine transactions in real-time.

Applications of OLTP:
1. Banking & Financial Transactions
2. Airline & Travel Management
3. Telecommunications
4. Government Systems
DW&DM

Differences between OLTP & OLAP:

PARAMETER OLTP OLAP


Users Clerk, IT professional Knowledge worker
Function Day to day operations Decision support
DB design Application-oriented Subject-oriented
Historical, summarized,
Current, up-to-date detailed, flat
Data multidimensional integrated,
relational isolated
consolidated
Usage Repetitive Ad-hoc
Read/write index/hash on prim.
Access Lots of scans
key
Unit of work Short, simple transaction Complex query
Records accessed Tens Millions
Users Thousands Hundreds
DB size 100MB-GB 100GB-TB
Metric Transaction throughput Query throughput, response

1.2 A Multi-dimensional Data Model:


 A multi-dimensional data model is a fundamental concept used to organize and
represent data for efficient querying and analysis. It structures data in multiple
dimensions to facilitate complex queries and analytics. This model primarily revolves
around the concept of a data cube. 
 A data cube is a multidimensional representation of data, allowing for the analysis of
data along multiple dimensions. 
 The multi-dimensional data model in DWDM typically includes the following
components:
1. Dimensions
2. Facts/Measures
3. Hierarchies
4. Data Aggregation
 In DWDM, the multi-dimensional data model facilitates the design and
implementation of data warehouses, making it easier for analysts, decision-makers,
and data scientists to explore and derive insights from large volumes of data by
organizing it along different dimensions and levels of granularity.
 A data cube, such as sales, allows data to be modelled and viewed in multiple
dimensions
o Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
o Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables
DW&DM

 In data warehousing literature, an n-D base cube is called a base cuboid. The top most
0-D cuboid, which holds the highest-level of summarization, is called the apex
cuboid. The lattice of cuboids forms a data cube.

1.2.1 Conceptual Modelling of Data Warehouse:

 Modelling data warehouses: dimensions & measures


1. Star schema
2. Snowflake schema
3. Fact constellations

1. Star schema:
 A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.
 A fact table in the middle connected to a set of dimension tables.
 Example for star schema regarding sales,
DW&DM

2. Snowflake Schema:

 The snowflake schema consists of one fact table which is linked to many dimension
tables, which can be linked to other dimension tables through a many-to-one
relationship. Tables in a snowflake schema are generally normalized to the third
normal form.
 The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of
snowflake schema resembles a snowflake.
 Example for snowflake schema regarding sales

3. Fact constellation:

 A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
 Fact Constellation Schema is a sophisticated database design that is difficult to
summarize information. Fact Constellation Schema can implement between
aggregate Fact tables or decompose a complex Fact table into independent
simplex Fact tables.
 Example for fact constellation regarding sales,
DW&DM

Defining Star Schema in DMQL:


define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)

Defining Snowflake Schema in DMQL:


define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key,
supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))

Defining Fact Constellation Schema in DMQL:

define cube sales [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales,
shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

Measures of Data Cube:

 Distributive: If the result derived by applying the function to n aggregate values is


the same as that derived by applying the function on all the data without partitioning.
E.g., count(), sum(), min(), max()
DW&DM

 Algebraic: If it can be computed by an algebraic function with M arguments (where


M is a bounded integer), each of which is obtained by applying a distributive
aggregate function.
E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage size needed to describe a sub-
aggregate.
E.g., median(), mode(), rank()

1.2.2 OLAP Operations:

 OLAP provides various operations to gain insights from the data stored in
multidimensional hypercube.
 OLAP operations include:
1. Drill down
2. Roll up
3. Dice
4. Slice
5. Pivot

1. Drill down:
Drill down operation allows a user to zoom in on the data cube i.e., the less detailed data is
converted into highly detailed data. It can be implemented by either stepping down a concept
hierarchy for a dimension or adding additional dimensions to the hypercube.

 Example: Consider a cube that represents the annual sales (4 Quarters: Q1, Q2, Q3,
Q4) of various kinds of clothes (Shirt, Pant, Shorts, Tees) of a company in 4 cities
(Delhi, Mumbai, Las Vegas, New York) as shown below:
 Here, the drill-down operation is applied on the time dimension and the quarter Q1 is
drilled down to January, February, and March. Hence, by applying the drill-down
operation, we can move down from quarterly sales in a year to monthly or weekly
records.
DW&DM

2. Roll up:
It is the opposite of the drill-down operation and is also known as a drill-up or aggregation
operation. It is a dimension reduction technique that performs aggregation on a data cube. It
makes the data less detailed and it can be performed by combining similar dimensions across
any axis.
 Here, we are performing the Roll-up operation on the given data cube by combining
categorizing the sales based on the countries instead of cities.

3. Dice:
Dice operation is used to generate a new sub-cube from the existing hypercube. It selects two
or more dimensions from the hypercube to generate a new sub-cube for the given data.

 Here, we are using the dice operation to retrieve the sales done by the company in the
first half of the year i.e., the sales in the first two quarters.

4. Slice:
Slice operation is used to select a single dimension from the given cube to generate a new
sub-cube. It represents the information from another point of view.
 Here, the sales done by the company during the first quarter are retrieved by
performing the slice operation on the given hypercube.
DW&DM

5. Pivot:

 It is used to provide an alternate view of the data available to the users.


 It is also known as rotate operation as it rotates the cube’s orientation to view the data
from different perspectives.

 Here, we are using the Pivot operation to view the sub-cube from a different
perspective.

1.3 Data warehouse architecture:

1.3.1 Four views regarding the design of a data warehouse:

 Top-down view:
o Allows selection of the relevant information necessary for the data warehouse.
 Data source view:
o Exposes the information being captured, stored, and managed by operational
systems.
 Data warehouse view:
o Consists of fact tables and dimension tables.
 Business query view:
o Sees the perspectives of data in the warehouse from the view of end-user.

1.3.2 Data Warehouse: A Multi-tiered Architecture:


DW&DM

 Data Warehouse is referred to the data repository that is maintained separately from the
organization’s operational data.
 Multi-Tier Data Warehouse Architecture consists of the following components:
1. Bottom tier
2. Middle tier
3. Top tier
1. Bottom tier:

 The bottom Tier usually consists of Data Sources and Data Storage.
 It is a warehouse database server. For Example RDBMS.
 In Bottom Tier, using the application program interface (called gateways), data is
extracted from operational and external sources.
 Application Program Interface likes ODBC (Open Database Connection), OLE-DB
(Open-Linking and Embedding for Database), JDBC (Java Database Connection) is
supported.
 ETL stands for Extract, Transform, and Load.
 Several popular ETL tools include:
 IBM Infosphere
 Informatica
 Microsoft SSIS
 Confluent
2. Middle tier:

 The middle tier is an OLAP server that is typically implemented using either :
o A relational OLAP (ROLAP) model.
o A multidimensional OLAP (MOLAP) model.
 OLAP server models come in three different categories, including:

a. ROLAP: A relational database is not converted into a multidimensional database;


rather, a relational database is actively broken down into several dimensions as part
of relational online analytical processing (ROLAP). This is used when everything
that is contained in the repository is a relational database system.

b. MOLAP: A different type of online analytical processing called multidimensional


online analytical processing (MOLAP) includes directories and catalogues that are
immediately integrated into its multidimensional database system. This is used
when all that is contained in the repository is the multidimensional database
system.

c. HOLAP: A combination of relational and multidimensional online analytical


processing paradigms is hybrid online analytical processing (HOLAP). HOLAP is
the ideal option for a seamless functional flow across the database systems when
the repository houses both the relational database management system and the
multidimensional database management system.
DW&DM

3. Top tier:

 The top tier is a front-end client layer, which includes query and reporting tools,
analysis tools, and/or data mining tools (eg, trend analysis, prediction, etc.).
 Here are a few Top Tier tools that are often used:
 SAP BW
 IBM Cognos
 Microsoft BI Platform

Data Warehouse Models:


From the architecture point of view, there are three warehouse models-

1. Enterprise Warehouse:
 An enterprise warehouse collects all information topics spread throughout the
organization.
 It usually contains detailed data as well as summarized data and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. Can be an
enterprise data warehouse.
 The traditional mainframe, computer super server, or parallel architecture has been
implemented on platforms. This requires extensive commercial modelling and
may take years to design and manufacture.
2. Data Mart:
 A data mart contains a subset of corporate-wide data that is important to a specific
group of users.
 The scope is limited to specific selected subjects.
 For example, a marketing data mart may limit its topics to customers, goods, and
sales.
 The data contained in the data marts are summarized. Data marts are typically
applied to low-cost departmental servers that are Unix/Linux or Windows-based.
3. Virtual Warehouse:
 A virtual warehouse is a group of views on an operational database.
 For efficient query processing, only a few possible summary views can be
physical.
 Creating a virtual warehouse is easy, but requires additional capacity on
operational database servers.

Advantages of Multi-Tier Architecture of Data warehouse:


1. Scalability: Various components can be added, deleted, or updated in accordance with
the data warehouse’s shifting needs and specifications.
2. Better Performance: The several layers enable parallel and efficient processing,
which enhances performance and reaction times.
DW&DM

3. Modularity: The architecture supports modular design, which facilitates the creation,
testing, and deployment of separate components.
4. Security: The data warehouse’s overall security can be improved by applying various
security measures to various layers.
5. Improved Resource Management: Different tiers can be tuned to use the proper
hardware resources, cutting expenses overall and increasing effectiveness.
6. Easier Maintenance: Maintenance is simpler because individual components can be
updated or maintained without affecting the data warehouse as a whole.
7. Improved Reliability: Using many tiers can offer redundancy and failover
capabilities, enhancing the data warehouse’s overall reliability.

Data Warehouse Back-End Tools and Utilities:

 Data extraction:
 Get data from multiple, heterogeneous, and external sources
 Data cleaning:
 Detect errors in the data and rectify them when possible
 Data transformation:
 Convert data from legacy or host format to warehouse format
 Load:
 Sort, summarize, consolidate, compute views, check integrity, and build indices
and partitions
 Refresh:
 Propagate the updates from the data sources to the warehouse

1.4 Data Warehouse Implementation:

 Data warehouse is represented by data cube.


 The following things we must consider to implement data warehouse:
1. Efficient cube computation techniques
2. Access methods
3. Query processing techniques

1. Efficient cube computation:


 Data cube computation is an essential task in data warehouse implementation. The
precomputation of all or part of a data cube can greatly reduce the response time
and enhance the performance of on-line analytical processing.
 Compute cube operator is used to apply for cube computation.
 It computes aggregates of overall subsets of the dimensions specified in the
operation.
 The storage requirements are more expensive when many of the dimensions have
associated concept hierarchies; this problem is referred as curse of dimensionality.
 Total number of cuboids for an n-dimensional data cube is 2n.
DW&DM

 If the dimensions have hierarchy, then the total number of cuboid is calculated by,

 Where LI is number of levels associated with dimension i.


 There are three choices for data cube materialization;
1. No materialization
2. Full materialization
3. Partial materialization

2. Access Methods:
 There are two access methods:
a. Bitmap Index
b. Join Index

a. Bitmap Index:
 An indexing technique known as Bitmap Indexing enables data to be retrieved
quickly from columns that are frequently used and have low cardinality.
Cardinality is the count of distinct elements in a column.
 In general, Bitmap combines the terms Bit and Map, where bit represents the
smallest amount of data on a computer, which can only hold either 0 or 1 and
map means transforming and organizing the data according to what value
should be assigned to 0 and 1.

b. Join indices:
 Join indexing is especially useful for maintaining the relationship between a
foreign key and its matching primary keys, from the joinable relation.
 For example, if two relations R(RID, A) and S(B, SID) join on the attributes
A and B, then the join index record contains the pair(RID, SID), where RID
and SID are record identifiers from the Rand S relations, respectively. Hence,
the join index records can identity joinable tuples without performing costly
join operations.
 The star schema model of data warehouses makes join indexing attractive for
cross table search, because the linkage between a fact able and its
corresponding dimension tables comprises the fact table's foreign key and the
dimension table's primary key.
DW&DM

3. Query processing techniques:

 Determine which operations should be performed on the available cuboids


o Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
 Determine which materialized cuboid(s) should be selected for OLAP op.
o Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
 Explore indexing structures and compressed vs. dense array structs in MOLAP

1.6 From data warehousing to data mining:


Data Warehouse Usage: Three kinds of data warehouse applications

 Information processing:
o Supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs.
 Analytical processing:
o Multidimensional analysis of data warehouse data.
o Supports basic OLAP operations, slice-dice, drilling, pivoting.
 Data mining:
o Knowledge discovery from hidden patterns
o Supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools
DW&DM

 From On-Line Analytical Processing (OLAP) to On Line Analytical Mining


(OLAM):

Why online analytical mining?

 High quality of data in data warehouses


o DW contains integrated, consistent, cleaned data.
 Available information processing structure surrounding data
warehouses.
o ODBC, OLEDB, Web accessing, service facilities, reporting and
OLAP tools
 OLAP-based exploratory data analysis.
o Mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions.
o Integration and swapping of multiple mining functions,
algorithms, and tasks.

OLAM System Architecture:

*****
DW&DM

UNIT –II:
Data Mining: Introduction, What is Data Mining?, Motivating challenges, The origins of
Data Mining, Data Mining Tasks, Types of Data, Data Quality. Data Preprocessing:
Aggregation, Sampling, Dimensionality Reduction, Feature Subset Selection, Feature
creation, Discretization and Binarization, Variable Transformation, Measures of Similarity
and Dissimilarity. (Tan & Vipin)

2.1 Data Mining


2.1.1 Introduction
2.1.2 What is Data Mining?
2.1.3 Motivating Challenges
2.1.4 The Origins of Data Mining
2.1.5 Data Mining Tasks
2.1.6 Types of Data
2.1.7 Data Quality.
2.2 Data Pre-processing
2.2.1 Aggregation
2.2.2 Sampling
2.2.3 Dimensionality Reduction
2.2.4 Feature Subset Selection
2.2.5 Feature Creation
2.2.6 Discretization and Binarization
2.2.7 Variable Transformation
2.2.8 Measures of Similarity and Dissimilarity.

Data Mining

2.1.1 Introduction: Data mining is a process of discovering patterns, trends, insights, and
knowledge from large volumes of data. It involves the use of various techniques and tools to
analyze and extract valuable information from datasets, with the goal of making informed
decisions and predictions. Data mining is an integral part of the broader field of data science
and plays a crucial role in industries such as business, healthcare, finance, and more.

Here is an introduction to the key concepts and components of data mining:


Data: Data mining begins with collecting and organizing data. This data can come
from various sources, including databases, spreadsheets, text documents, sensor data,
and more. The quality and quantity of data are essential for effective data mining.
Data Preprocessing: Before data mining can take place, the data often needs to be
cleaned and preprocessed. This involves tasks like handling missing values, removing
duplicates, and transforming data into a suitable format.
Data Exploration: Exploratory Data Analysis (EDA) is a critical step to understand
the characteristics of the data. Data mining practitioners use various statistical and
visualization techniques to identify patterns, outliers, and correlations in the dataset.
DW&DM

Data Mining Techniques:


Association Rule Mining: This technique identifies relationships and patterns in
data, such as "customers who buy X also tend to buy Y."
Clustering: Clustering techniques group similar data points together, helping to
identify natural groupings within the data.
Classification: Classification algorithms assign data points to predefined categories
or classes.
Regression: Regression models are used to predict a continuous numerical value
based on other data variables.
Time Series Analysis: This is used for data that is collected over time, such as stock
prices or weather data.
Anomaly Detection: It focuses on identifying unusual patterns that do not conform to
expected behavior.

Applications:
Business: Data mining is used in customer relationship management, market basket
analysis, and fraud detection.
Healthcare: It helps in disease prediction, patient diagnosis, and medical research.
Finance: In financial services, it's used for credit scoring, risk assessment, and stock
market prediction.
Retail: Retailers use data mining to optimize inventory management and product
recommendations.
Challenges: Data mining faces challenges such as dealing with big data, ensuring
data privacy and security, and selecting the right algorithms and parameters for a
given task.
Machine Learning Connection: Data mining often overlaps with machine learning,
as machine learning algorithms are frequently used for predictive modeling and
pattern recognition in data mining tasks.

2.1.2 What is Data Mining?


Data mining is the process of extracting useful, previously unknown, and potentially
actionable information or patterns from large datasets. It involves a range of techniques and
methods for discovering hidden relationships, trends, and insights within data. The primary
goal of data mining is to transform raw data into valuable knowledge that can be used for
decision making, prediction, and problem-solving. Here are some key aspects of data mining
DW&DM

KDD Process in Data Mining:


Data mining also known as Knowledge Discovery in Databases refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data
stored in databases.
The need of data mining is to extract useful information from large datasets and use it to
make predictions or better decision-making. Nowadays, data mining is used in almost all
places where a large amount of data is stored and processed

KDD Process:
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of
useful, previously unknown, and potentially valuable information from large datasets. The
KDD process is an iterative process and it requires multiple iterations of the above steps to
extract accurate knowledge from the data
The following steps are included in KDD process:
 Data Integration
 Data Selection
 Data Transformation
 Data Mapping
 Code generation
 Data Mining
 Pattern Evaluation

Knowledge Representation:

Advantages of KDD:
1. Improves decision-making: KDD provides valuable insights and knowledge that can
help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes
the data ready for analysis, which saves time and money.
DW&DM

3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer
service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying
patterns and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast
future trends and patterns.

Disadvantages of KDD:
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and
analyzing large amounts of data, which can include sensitive information about
individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and
knowledge to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not
accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new unseen data.

Difference between KDD and Data Mining:


Parameter KDD Data Mining
KDD refers to a process of identifying Data Mining refers to a process
valid, novel, potentially useful, and of extracting useful and
Definition
ultimately understandable patterns and valuable information or patterns
relationships in data. from large data sets.
To extract useful information
Objective To find useful knowledge from data
from data.
Data cleaning, data integration, data
Association rules, classification,
selection, data transformation, data
Techniques clustering, regression, decision
mining, pattern evaluation, and
Used trees, neural networks, and
knowledge representation and
dimensionality reduction.
visualization
Patterns, associations, or
Structured information, such as rules
insights that can be used to
Output and models that can be used to make
improve decision-making or
decisions or predictions.
understanding.
Patterns, associations, or insights that Data mining focus is on the
Focus can be used to improve decision- discovery of patterns or
making or understanding. relationships in data.
DW&DM

2.1.3 Motivating Challenges:


Data Mining presents several, motivating challenges that making it an exciting and
dynamic field
Scalability: Data mining algorithms need to scale efficiently with the size of the dataset.
Scalability is a significant challenge when dealing with large and streaming data.
High dimensionality: Many datasets have a high number of features or attributes, which
can lead to the curse of dimensionality. Analyzing and extracting meaningful patterns
from high-dimensional data is a challenging task.
Data Heterogeneity: Data can come from diverse sources and formats, making it
challenging to integrate and analyze. Techniques for handling heterogeneous data, such as
text and image data are important.
Complex data: As the volume of data generated continues to grow exponentially, data
mining faces the challenge of efficiently processing and analyzing massive datasets.
Handling big data requires scalable algorithms, distributed computing, and specialized
tools.
Data ownership & distribution: Data ownership and distribution are significant
challenges in data mining, particularly when dealing with sensitive or proprietary data.
These challenges encompass issues related to who owns the data, how it can be
distributed, and how data mining can be conducted while respecting data ownership and
distribution rights.
Nontraditional analysis: Data mining refer to situations where the data, tasks, or
objectives depart from conventional or well-established data mining paradigms. These
challenges often require creative and innovative approaches to extract valuable insights
and knowledge from data

2.1.4 Origins of Data Mining:

The origins of data mining can be traced back to various fields and disciplines, including
computer science, statistics, and database management. Data mining is essentially the
process of discovering patterns, trends, and valuable insights from large and complex
datasets. Here's a brief overview of its origins:
a. Statistics: Data mining has strong roots in statistical analysis. Statisticians have been
working on methods for analyzing and extracting meaningful information from data for
centuries. Techniques such as regression analysis, hypothesis testing, and clustering can
be seen as precursors to modern data mining methods.
b. Machine Learning: Machine learning, a subfield of artificial intelligence, has
contributed significantly to data mining. Techniques such as decision trees, neural
networks, and support vector machines have been adapted and incorporated into data
mining algorithms.
c. Database Management: The field of database management also played a crucial role
in the development of data mining. The emergence of large relational databases in the
1970s and 1980s paved the way for data mining by providing structured data for
analysis. SQL queries and other database-related technologies were essential for data
retrieval.
DW&DM

d. Knowledge Discovery in Databases (KDD): The term "data mining" gained


prominence in the 1990s with the introduction of the concept of Knowledge Discovery
in Databases (KDD). KDD is a comprehensive process that encompasses data selection,
preprocessing, transformation, data mining, pattern evaluation, and knowledge
presentation. It provides a structured framework for data mining.
e. Computer Science: Advances in computer hardware and software have significantly
contributed to the growth of data mining. The availability of powerful computers and
storage systems has made it feasible to process and analyze vast amounts of data.
f. Business and Industry: The practical applications of data mining became evident in
various industries, including marketing, finance, healthcare, and retail. Businesses
began to use data mining techniques to gain insights, make predictions, and improve
decision-making.
g. Academic Research: Data mining research has been an active area in academia, with
researchers and scholars developing new algorithms and methodologies. This research
has further advanced the field.

2.1.5 Data Mining Tasks:

Data mining involves a variety of tasks aimed at discovering patterns, relationships, and
useful information within large datasets. These tasks can be categorized into several
fundamental areas.
I. Classification: Classification is the process of assigning data points to predefined
categories or classes based on their attributes. This is commonly used in tasks like spam
email detection, sentiment analysis, and disease diagnosis. Popular algorithms for
classification include decision trees, support vector machines, and neural network.

Applications:
DW&DM

1. Fraud Detection:
a. Goal: Predict fraudulent cases in credit card transactions.
b. Approach:
 Use credit card transactions and the information on its account-
holder as attributes.
o When does a customer buy, what does he buy, how often he pays
on time etc
 Label past transactions as fraud or fair transactions. This forms the
class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card transactions
on an account.
2. Sky Survey Cataloging
a. Goal: To predict class (star or galaxy) of sky objects, especially visually
faint ones, based on the telescopic survey images (from Palomar Observatory)
b. Approach:
 Segment the image.
 Measure image attributes (features) - 40 of them per object.
 Model the class based on these features.
 Success Story: Could find 16 new high red-shift quasars, some of the
farthest objects that are difficult to find!

II. Regression: Regression is used to predict a numerical value or continuous variable based
on other attributes or variables. It is often employed in applications like sales forecasting,
price prediction, and risk assessment.
 Linear regression, polynomial regression, and regression trees are common
techniques.
 The primary objective of regression is to model the relationship between one
 or more independent variables (predictors or features) and a dependent
variable (the target or outcome) to make predictions or estimate values.
Examples:
 Predicting sales amounts of new product based on advertising expenditure.
 Predicting wind velocities as a function of temperature, humidity, air pressure,
etc.
 Time series prediction of stock market indices.
Applications:
 Predictive Modeling: Regression is commonly used in fields such as finance,
sales forecasting, and epidemiology to predict future values or outcomes.
 Risk Assessment: It is used to assess risk in insurance, investment, and loan
approval by estimating the likelihood of specific outcomes.
 Quality Control: Regression can help analyze the relationships between
variables in manufacturing processes, identifying factors that affect product
quality.
DW&DM

 Marketing and Sales: Regression analysis can be used to understand how


marketing expenditures and strategies affect sales and customer behavior.

III. Clustering: Clustering is a key data mining task that involves grouping similar data
points or objects into clusters or clusters into which data points share similar characteristics.
It is an unsupervised learning technique, meaning that the algorithm doesn't rely on
predefined labels or categories but instead seeks to discover the inherent structure or patterns
within the data.
 The main objective of clustering is to find natural groupings or structures in a
dataset. These groupings can help in data exploration, pattern recognition, and
understanding the underlying structure of the data.
 The choice of clustering algorithm and parameters depends on the nature of the
data and the goals of the analysis.

Applications:
 Customer Segmentation: Clustering helps businesses group customers with
similar purchasing behaviors and preferences, which can inform targeted
marketing strategies.
 Image Segmentation: In image processing, clustering is used to segment
images into meaningful regions or objects.
 Anomaly Detection: Clustering can help identify anomalies or outliers by
considering data points that do not fit well into any cluster.
 Document Categorization: In text mining, clustering can group similar
documents together based on their content, facilitating document
categorization.
 Network Analysis: Clustering is used in social network analysis to detect
communities or groups of closely connected individuals.

IV. Association: Association in data mining refers to the process of discovering interesting
and meaningful relationships or associations between items or attributes in a dataset. This
task is particularly useful for market basket analysis, where the goal is to find patterns in
customer purchasing behavior.
DW&DM

 The primary objective of association mining is to find patterns, dependencies, or


associations among items in a dataset. These patterns are typically expressed as
rules that describe the likelihood of one item being associated with another.
 Association rule mining starts with the identification of frequent item sets, which
are sets of items that frequently appear together in the dataset. The frequency of
an item set is measured by a support threshold, which is the proportion of
transactions in which the item set appears.
 Once frequent item sets are identified, association rules are generated. Association
rules consist of two parts: an antecedent (the left-hand side) and a consequent (the

right-hand side). These rules show the relationships between items. For example, a
simple association rule could be "If a customer buys item A, they are likely to buy
item B." Calculate Support, Confidence and lift values to evaluate the quality of
association rules.

Applications:
 Market Basket Analysis: One of the most common applications of
association mining is in retail for understanding customer purchasing behavior
and optimizing product placement and promotions.
 Cross-Selling: E-commerce and online platforms use association rules to
recommend products or services to customers based on their purchase history.
 Healthcare: Association rules can be applied to medical data to discover
associations between medical conditions, symptoms, and treatments.
 Fraud Detection: Detecting fraudulent activities, such as credit card fraud,
can benefit from association rule mining to identify patterns in transaction
data.

2.1.6 Types of Data:


Data: Collection of data objects and their attributes
 An attribute is a property or characteristic of an object

Examples: eye color of a person, temperature, etc.


 Attribute is also known as variable, field, characteristic, dimension, or feature
 A collection of attributes describe an object
 Object is also known as record, point, case, sample, entity, or instance
DW&DM

Types of Attributes: Nominal, ordinal, interval, and ratio are different levels of
measurement or data types used in statistics and data analysis. They represent a hierarchy of
measurement scales, each with distinct characteristics and properties.

Nominal Data (=,):


 Nominal data is the simplest level of measurement.
 It categorizes data into distinct, non-overlapping categories or labels with no
inherent order or ranking.
 Examples: Gender (male, female), colors (red, blue, green), and types of
animals (dog, cat, bird).
 Arithmetic operations like addition, subtraction, or multiplication are not
meaningful for nominal data.

Ordinal Data (<,>):


 Ordinal data represents categories with a meaningful order or ranking, but the
intervals between values are not well-defined.
 It tells you that one value is greater or smaller than another, but it doesn't
specify how much greater or smaller.
 Examples: Education levels (e.g., high school, college, postgraduate),
customer satisfaction ratings (e.g., very satisfied, satisfied, neutral,
dissatisfied, very dissatisfied).
 Ordinal data allows for ranking and comparison but not precise mathematical
operations.

Interval Data (+,-):


 Interval data possesses all the characteristics of ordinal data, but it also has
equal intervals between values, which means the intervals are meaningful and
consistent.
 It lacks a true zero point, meaning that a value of zero doesn't imply the
absence of the quantity being measured. Negative values are possible.
 Examples: Temperature measured in Celsius or Fahrenheit, IQ scores, and
calendar dates. In these examples, differences between values are meaningful,
but ratios are not.

Ratio Data (*, /):


 Ratio data is the highest level of measurement, possessing all the properties of
nominal, ordinal, and interval data, along with a true zero point.
 A true zero point means that a value of zero implies the complete absence of
the quantity being measured. In ratio data, ratios and proportions are
meaningful.
 Examples: Height, weight, age, income, and distances. These measurements
allow for meaningful mathematical operations, such as addition, subtraction,
multiplication, and division.
DW&DM

2.1.7 Data Quality:

Data quality refers to the degree to which data is accurate, complete, consistent, reliable, and
suitable for its intended purpose. High data quality ensures that data is free from errors,
omissions, and inconsistencies, making it a valuable asset for organizations. Poor data quality
can lead to incorrect analyses, flawed decision-making and operational inefficiencies.

Key aspects of data quality:

 Accuracy: Data accuracy means that the information contained in the dataset
is correct and free from errors or mistakes. Accurate data is vital for making
informed decisions and avoiding costly errors.
 Completeness: Data completeness indicates that all the required data points or
attributes are present in the dataset. Incomplete data can lead to gaps in
information and hinder meaningful analysis.
 Consistency: Data consistency refers to the uniformity and coherence of data
across different sources or within the same dataset. Inconsistent data can result
from conflicting information or varying formats.
 Reliability: Reliable data can be consistently depended upon for accuracy and
consistency. Data should maintain its quality over time, and it should be
reliable for its intended use.
 Relevance: Relevance relates to the suitability of data for the specific task or
purpose at hand. Irrelevant data, even if accurate and complete, can lead to
poor decision-making.

Data quality problems can manifest in various ways and can impact an organization's
decision-making, operations, and overall efficiency. Here are some common data quality
problems and solutions to address them:
DW&DM

Data Quality Problem 1: Missing Data


Solution: Data Imputation
For missing data, one common solution is data imputation. Depending on the context, you
can impute missing values using statistical methods, such as mean, median, or mode
imputation, or more advanced techniques like regression imputation, k-nearest neighbors
(KNN) imputation, or multiple imputation. The choice of method depends on the data and the
nature of the missingness.

Data Quality Problem 2: Duplicate Data


Solution: Data Duplication
Duplicate data can be addressed through data deduplication. Identify duplicate records and
remove or merge them as necessary. Utilize deduplication algorithms and tools to automate
the process.

Data Quality Problem 3: Inconsistent Data Formats


Solution: Data Standardization
Inconsistencies in data formats, such as date formats, units of measurement, or naming
conventions, can be addressed through data standardization. Define and enforce data format
standards to ensure consistency across the organization.

Data Quality Problem 4: Inaccurate Data Entry


Solution: Data Entry Validation and Training
Implement data entry validation checks to prevent inaccuracies at the source. This can
include enforcing data format rules, range checks, and validation against predefined
constraints. Additionally, provide training and guidance to data entry personnel to reduce
errors.

Data Quality Problem 5: Incomplete Data


Solution: Data Enrichment and Data Collection
Incomplete data can be addressed through data enrichment. Integrate additional information
from external sources to fill in missing details. Additionally, ensure that data collection
processes are comprehensive and collect all necessary information.

Data Quality Problem 6: Data Consistency Issues


Solution: Data Transformation and Data Validation
Data consistency issues can be addressed through data transformation and validation.
Transform data to ensure that it adheres to consistent standards. Implement data validation
checks to identify and correct inconsistent data.

Data Quality Problem 7: Outliers and Anomalies


Solution: Outlier Detection and Data Cleansing
Outliers and anomalies can be addressed by implementing outlier detection techniques. Once
identified, determine if they are valid or errors. Valid outliers can be retained, while
erroneous ones should be corrected or removed through data cleansing.
DW&DM

Data Preprocessing
2.2 Data Preprocessing:

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the
specific data mining task.
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

2.2.1 Aggregation :

 Imagine that you have collected the data for your analysis. These data consist of the All
Electronics sales per quarter, for the years 2002 to 2004. You are, however, interested in
the annual sales (total per year), rather than the total per quarter. Thus the data can be
aggregated so that the resulting data summarize the total sales per year instead of per
quarter and organizations.
DW&DM

 Data cubes store multidimensional aggregated information. shows a data cube for
multidimensional analysis of sales data with respect to annual sales per item type for
each All Electronics branch. Each cell holds an aggregate data value, corresponding to
the data point in multidimensional space.

 Concept hierarchies may exist for each attribute, allowing the analysis of data at multiple
levels of abstraction. For example, a hierarchy for branch could allow branches to be
grouped into regions, based on their address. Data cubes provide fast access to
precomputed, summarized data, thereby benefiting on-line analytical processing as well
as data mining. 

Data cube Aggregation:

 The cube created at the lowest level of abstraction is referred to as the base
cuboid. The base cuboid should correspond to an individual entity of interest,
such as sales or customer.
 A cube at the highest level of abstraction is the apex cuboid. The apex
cuboid would give one total—the total sales for all three years for all item
types, and for all branches.
 Data cubes created for varying levels of abstraction are often referred to as
cuboids, so that a data cube may instead refer to a lattice of cuboids. Each
higher level of abstraction further reduces the resulting data size.
 When replying to data mining requests, the smallest available cuboid
relevant to the given task should be used.
2.2.2 Sampling:

 Sampling is the main technique employed for data reduction


 It is often used for both the preliminary investigation of the data and the
final data analysis.
 Statisticians often sample because obtaining the entire set of data of
interest is to expensive or time consuming.
DW&DM

 Sampling is typically used in data mining because processing the entire set
of data of interest is too expensive or time consuming

Types of Sampling:

 Simple Random Sampling


There is an equal probability of selecting any particular item
 Sampling without replacement
As each item is selected, it is removed from the population
 Sampling with replacement
o Objects are not removed from the population as they are selected for the
sample.
o In sampling with replacement, the same object can be picked up more than
once
 Stratified sampling
Split the data into several partitions; then draw random samples from each partition.

2.2.3 Dimensionality Reduction:

Dimensionality reduction, or dimension reduction, is the transformation of data from a high -


dimensional space into a low-dimensional space so that the low-dimensional representation
retains some meaningful properties of the original data, ideally close to its intrinsic
dimension.

Purpose:

 Avoid curse of dimensionality


 Reduce amount of time and memory required by data mining algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise
DW&DM

Methods

 Principal Component Analysis (PCA)


 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA

Dimensionality Reduction:

2.2.4 Feature Subset Selection:


Another way to reduce dimensionality of data which means selecting a subset of the
variables, or features, that describe the data in order to obtain a more essential and compact
representation of the available information.
 Redundant features
 Duplicate much or all of the information contained in one or more other
attributes.
 Example: purchase price of a product and the amount of sales tax paid.

 Irrelevant features
 Contain no information that is useful for the data mining task at hand.
 Example: students' ID is often irrelevant to the task of predicting students'
GPA.
DW&DM

2.2.5 Feature Creation:

Create new attributes (features) that can capture the important information in a data set more
effectively than the original ones.
 Three general methodologies
 Attribute extraction
 Domain-specific
 Mapping data to new space (see: data reduction)
oExample: Fourier transformation, wavelet transformation, manifold
approaches (not covered)
 Attribute construction
 Combining features
 Data discretization

2.2.6 Discretization and Binarization:


Discretization is the process of converting a continuous attribute into an ordinal attribute.
DW&DM

 A potentially infinite number of values are mapped into a small number of


categories.
 Discretization is used in both unsupervised and supervised settings.
 Discretization can be performed recursively on an attribute.
Data Discretization methods: Typical methods: All the methods can be applied recursively

 Binning: Top-down split, unsupervised


 Histogram analysis: Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge)

Binarization:
 Binarization is the process of transforming continuous and discrete attributes into
binary attributes. It is used to convert numerical data into categorical data.
 Binarization is often used in machine learning algorithms the require categorical data

2.2.7 Variable Transformation:


An attribute transform is a function that maps the entire set of values of a given attribute to
a new set of replacement values such that each old value can be identified with one of the
new values
Simple functions: xk , log(x), ex, |x|
 Normalization:
 Refers to various techniques to adjust to differences among attributes in
terms of frequency of occurrence, mean, variance, range
 Take out unwanted, common signal, e.g., seasonality

Standardization:
 In statistics, standardization refers to subtracting off the means and
dividing by the standard deviation
 Methods:
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
DW&DM

2.2.8 Measures of Similarity and Dissimilarity:


 Clustering consists of grouping certain objects that are similar to each other can be
used to decide if two items are similar or dissimilar in their properties.
 In a Data Mining sense, the similarity measure is a distance with dimensions
describing object features. That means if the distance among two data points is small
then there is a high degree of similarity among the objects and vice versa.
 The similarity is subjective and depends heavily on the context and application. For
example, similarity among vegetables can be determined from their taste, size, colour
etc.
 Similarity measure is a way of measuring how data samples are related or closed to
each other.
 On the other hand, the dissimilarity measure is to tell how much the data objects
distinct.

Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:

1. Euclidean Distance: Euclidean distance is considered the traditional metric for problems
with geometry. It can be simply explained as the ordinary distance between two points. It is
one of the most used algorithms in the cluster analysis. One of the algorithms that use this
formula would be K-mean. Mathematically it computes the root of squared differ

2. Manhattan Distance: This determines the absolute difference among the pair of the
coordinates. Suppose we have two points P and Q to determine the distance between these
points we simply have to calculate the perpendicular distance of the points from X-Axis and
Y- Axis.
 In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
DW&DM

 Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

3. Minkowski distance: It is the generalized form of the Euclidean and Manhattan Distance
Measure. In an N-dimensional space, a point is represented as, distances between the
coordinates between two objects. (x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:
When p = 2, Minkowski distance is same as the Euclidean distance.

When p = 1, Minkowski distance is same as the Manhattan distance.

4. Cosine Index: Cosine distance measure for clustering determines the cosine of the angle
between two vectors given by the following formula.

Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.

5. Jaccard Distance measure also used for such distance calculation as above formats.

*****
DW&DM

UNIT –III:
Classification: Basic Concepts, General Approach to solving a classification problem,
Decision Tree Induction: Working of Decision Tree, building a decision tree, methods for
expressing an attribute test conditions, measures for selecting the best split, Algorithm for
decision tree induction. Model Overfitting: Due to presence of noise, due to lack of
representation samples, evaluating the performance of classifier: holdout method, random
sub sampling, cross-validation, bootstrap. Bayes Theorem, Naïve Bayes Classifier.

3.1 Classification
3.1.1 Basic Concepts
3.1.2 General Approach to solving a classification problem
3.1.3 Decision Tree Induction
3.1.3.1 Working of Decision Tree
3.1.3.2 building a decision tree
3.1.3.3 methods for expressing an attribute test conditions
3.1.3.4 measures for selecting the best split
3.1.3.5 Algorithm for decision tree induction.
3.2 Model Overfitting
3.2.1 Due to presence of noise
3.2.2 Due to lack of representation samples
3.2.3 Evaluating the performance of classifier
3.2.3.1 Holdout method
3.2.3.2 Random sub sampling
3.2.3.3 Cross-Validation
3.2.3.4 Bootstrap
3.2.4 Bayes Theorem
3.2.5 Naïve Bayes Classifier (Tan &Vipin)

3.1 CLASSIFICATION:
Classification is a task in data mining that involves assigning a class label to each instance
in a dataset based on its features. The goal of classification is to build a model that
accurately predicts the class labels of new instances based on their features

A schematic illustration of a classification task.


3.1.1 Basic Concepts
The data for a classification task consists of a collection of instances (records). Each such
instance is characterized by the tuple (x, y), where x is the set of attribute values that describe
the instance and y is the class label of the instance. The attribute set x can contain attributes
of any type, while the class label y must be categorical.
DW&DM

A classification model is an abstract representation of the relationship between the attribute


set and the class label. The model can be represented in many ways, e.g., as a tree, a
probability table, or simply, a vector of real-valued parameters. More formally, we can
express it mathematically as a target function f that takes as input the attribute set x and
produces an output corresponding to the predicted class label. The model is said to classify
an instance (x, y) correctly if f(x) = y.
A classification model serves two important roles in data mining. First, it is used as a
predictive model to classify previously unlabeled instances. A good classification model
must provide accurate predictions with a fast response time. Second, it serves as a descriptive
model to identify the characteristics that distinguish instances from different classes. This is
particularly useful for critical applications, such as medical diagnosis, where it is insufficient
to have a model that makes a prediction without justifying how it reaches such a decision.

Examples of classification tasks:

Example [Loan Borrower Classification] Consider the problem of predicting whether a


loan borrower will repay the loan or default on the loan payments. The data set used to build
the classification model is shown in the below Table.
The attribute set includes personal information of the borrower such as marital status and
annual income, while the class label indicates whether the borrower had defaulted on the loan
payments.

Table 3.1 A sample data for the loan borrower classification problem
DW&DM

3.1.2 General Approach to solving a classification problem


Classification is the task of assigning labels to unlabeled data instances and classifier is used
to perform such a task. A classifier is typically described in terms of a model as illustrated in
the previous section. The model is created using a given a set of instances, known as the
training set, which contains attribute values as well as class labels for each instance. The
systematic approach for learning a classification model given a training set is known as a
learning algorithm. The process of using a learning algorithm to build a classification model
from the training data is known as induction. This process is also often described as
“learning a model” or “building a model.” This process of applying a classification model on
unseen test instances to predict their class labels is known as deduction. Thus, the process of
classification involves twosteps: applying a learning algorithm to training data to learn a
model, and then applying the model to assign labels to unlabeled instances.
General Approach to solving a classification problem

The performance of a model (classifier) can be evaluated by comparing the predicted labels
against the true labels of instances. This information can be summarized in a table called a
confusion matrix. Each entry fij denotes the number of instances from class i predicted to be
of class j. For example, f01 is the number of instances from class 0 incorrectly predicted as
class 1. The number of correct predictions made by the model is (f11 + f00) and the number
of incorrect predictions is (f10 + f01).

Confusion matrix for a binary classification problem.


DW&DM

Although a confusion matrix provides the information needed to determine how well a
classification model performs, summarizing this information into a single number makes it
more convenient to compare the relative performance of different models. This can be done
using an evaluation metric such as accuracy, which is computed in the following way:
For binary classification problems, the accuracy of a model is given by

Error rate is another related metric, which is defined as follows for binary classification
problems:

The learning algorithms of most classification techniques are designed to learn models that
attain the highest accuracy, or equivalently, the lowest error rate when applied to the test set.

3.1.3 Decision Tree Induction:


The goal is to ask a series of yes-or-no questions that lead you to the correct answer or
decision. In data mining, it's a popular method for classification and regression tasks.
Here's the basic idea: Imagine you have a dataset with different features and a target variable
(what you're trying to predict). The decision tree algorithm looks at the features and decides
which questions to ask in order to split the data into subsets that are as pure as possible in
terms of the target variable.

At each step, the algorithm chooses the question that provides the best separation of the data.
It keeps doing this until it creates a tree structure that can make accurate predictions on new,
unseen data.

In a nutshell, decision tree induction is a powerful tool in data mining, providing a clear and
understandable way to make decisions based on complex data.
3.1.3.1 Working of Decision Tree:
To illustrate how a decision tree works, consider the classification problem of distinguishing
mammals from non-mammals using the vertebrate data set .Suppose a new species is
discovered by scientists.
How can we tell whether it is a mammal or a non-mammal? One approach is to pose a series
of questions about the characteristics of the species. The first question we may ask is whether
the species is cold- or warm-blooded. If it is cold-blooded, then it is definitely not a mammal.
Otherwise, it is either a bird or a mammal. In the latter case, we need to ask a follow-up
DW&DM

question: Do the females of the species give birth to their young? Those that do give birth are
definitely mammals, while those that do not are likely to be non-mammals (with the
exception of egg-laying mammals such as the platypus and spiny anteater).
The previous example illustrates how we can solve a classification problem by asking a
series of carefully crafted questions about the attributes of the test instance. Each time we
receive an answer, we could ask a follow-up question until we can conclusively decide on its
class label. The series of questions and their possible answers can be organized into a
hierarchical structure called a decision tree. Figure 3.4 shows an example of the decision tree
for the mammal classification problem. The tree has three types of nodes:
• A root node, with no incoming links and zero or more outgoing links.

• Internal nodes, each of which has exactly one incoming link and two or more outgoing
links.
• Leaf or terminal nodes, each of which has exactly one incoming link and no outgoing
links.
Every leaf node in the decision tree is associated with a class label. The non-terminal nodes,
which include the root and internal nodes, contain attribute test conditions that are typically
defined using a single attribute.
Each possible outcome of the attribute test condition is associated with exactly one child of
this node. For example, the root node of the tree shown in Figure 3.4 uses the attribute Body
Temperature to define an attribute test condition that has two outcomes, warm and cold,
resulting in two child nodes.
Given a decision tree, classifying a test instance is straightforward. Starting from the root
node, we apply its attribute test condition and follow the appropriate branch based on the
outcome of the test. This will lead us either to another internal node, for which a new
attribute test condition is applied, or to a leaf node. Once a leaf node is reached, we assign the
class label associated with the node to the test instance. As an illustration, Figure 3.5 traces
the path used to predict the class label of a flamingo. The path terminates at a leaf node
labeled as Non-mammals.

Figure 3.4. A decision tree for the mammal classification problem.


DW&DM

Figure 3.5.Classifying an unlabeled vertebrate. The dashed lines represent the outcomes of
applying various attribute test conditions on the unlabeled vertebrate. The vertebrate is
eventually assigned to the Non-mammals class.
3.1.3.2 Building a decision tree
Many possible decision trees that can be constructed from a particular data set. While some
trees are better than others, finding an optimal one is computationally expensive due to the
exponential size of the search space.
One of the earliest methods is Hunt’s algorithm, which is the basis for many current
implementations of decision tree classifiers, including ID3, C4.5, and CART. This subsection
presents Hunt’s algorithm.

Hunt’s Algorithm:
In Hunt’s algorithm, a decision tree is grown in a recursive fashion. The tree initially contains
a single root node that is associated with all the training instances. If a node is associated
with instances from more than one class, it is expanded using an attribute test condition that
is determined using a splitting criterion. A child leaf node is created for each outcome of the
attribute test condition and the instances associated with the parent node are distributed to the
children based on the test outcomes. This node expansion step can then be recursively
applied to each child node, as long as it has labels of more than one class. If all the instances
associated with a leaf node have identical class labels, then the node is not expanded any
further. Each leaf node is assigned a class label that occurs most frequently in the training
instances associated with the node.
To illustrate how the algorithm works, consider the training set shown in Table 3.1 for the
loan borrower classification problem. Suppose we apply Hunt’s algorithm to fit the training
data. The tree initially contains only a single leaf node as shown in Figure 3.6(a). This node
is labeled as Defaulted = No, since the majority of the borrowers did not default on their loan
payments. The training error of this tree is 30% as three out of the ten training instances have
the class label Defaulted = Yes. The leaf node can therefore be further expanded because it
contains training instances from more than one class. Let Home Owner be the attribute
chosen to split the training instances. The justification for choosing this attribute as the
DW&DM

attribute test condition will be discussed later. The resulting binary split on the Home Owner
attribute is shown in Figure 3.6(b). All the training instances for which Home Owner= Yes
are propagated to the left child of the root node and the rest are propagated to the right child.
Hunt’s algorithm is then recursively applied to each child. The left child becomes a leaf node
labeled Defaulted = No, since all instances associated with this node have identical class
label Defaulted=No. The right child has instances from each class label. Hence, we split it
further. The resulting subtrees after recursively expanding the right child are shown in
Figures 3.6(c) and (d).

Hunt’s algorithm, as described above, makes some simplifying assumptions that are often not
true in practice. In the following, we describe these assumptions and briefly discuss some of
the possible ways for handling them.
1. Some of the child nodes created in Hunt’s algorithm can be empty if none of the training
instances have the particular attribute values. One way to handle this is by declaring each of
them as a leaf node with a class label that occurs most frequently among the training
instances associated with their parent nodes.
2. If all training instances associated with a node have identical attribute values but different
class labels, it is not possible to expand this node any further. One way to handle this case is
to declare it a leaf node and assign it the class label that occurs most frequently in the training
instances associated with this node.

3.1.3.3 Methods for expressing an attribute test conditions


Decision tree induction algorithms must provide a method for expressing an attribute test
condition and its corresponding outcomes for different attribute types.
DW&DM

Binary Attributes The test condition for a binary attribute generates two potential outcomes.

Attribute test condition for a binary attribute.

Nominal Attributes Since a nominal attribute can have many values; its attribute test
condition can be expressed in two ways, as a multiway split or a binary split.
Ordinal Attributes Ordinal attributes can also produce binary or multi-way splits. Ordinal
attribute values can be grouped as long as the grouping does not violate the order property of
the attribute values.
DW&DM

Continuous Attributes For continuous attributes, the attribute test condition can be
expressed as a comparison test (e.g., A<v) producing a binary split, of as a range query of the
form vi ≤ A<vi+1, for i = 1... k, producing a multiway split.

3.1.3.4 Measures for selecting the best split


When it comes to decision tree algorithms, selecting the best split is crucial for building an
effective and accurate model. The commonly used measures for selecting the best split
include:

1. Gini impurity: It measures the frequency at which a randomly selected element would be
incorrectly classified. The goal is to minimize the Gini impurity, and a split with lower
impurity is considered better.

2. Entropy: It calculates the information gain by measuring the amount of uncertainty or


disorder in a set of data. Higher information gain implies a better split.

3. Misclassification error: This measures the error rate by calculating the proportion of
misclassified instances in a set. The split with the lowest misclassification error is chosen.
4. Gain ratio: It is based on information gain but takes into account the intrinsic information
of a split. It penalizes splits that result in a large number of subsets.
5. Chi-square: It is used for categorical target variables and evaluates the independence of
two variables. A lower chi-square value indicates a better split.
6. Variance reduction (for regression trees): In regression problems, the goal is to
minimize the variance of the target variable within each split.
The choice of the measure depends on the specific problem, data, and algorithm used. Some
algorithms, like CART (Classification and Regression Trees), use Gini impurity or entropy,
while others may use different criteria.
DW&DM

3.1.3.5 Algorithm for decision tree induction.


Algorithm 3.1 presents a pseudocode for decision tree induction algorithm. The input to this
algorithm is a set of training instances E along with the attribute set F. The algorithm works
by recursively selecting the best attribute to split the data (Step 7) and expanding the nodes of
the tree (Steps 11 and 12) until the stopping criterion is met (Step 1). The details of this
algorithm are explained below.

1. The createNode() function extends the decision tree by creating a new node. A node in the
decision tree either has a test condition, denoted as node. test cond, or a class label, denoted
as node.label.
2. The find best split() function determines the attribute test condition for partitioning the
training instances associated with a node. The splitting attribute chosen depends on the
impurity measure used. The popular measures include entropy and the Gini index.
3. The Classify() function determines the class label to be assigned to a leaf node. For each
leaf node t, let p(i|t) denote the fraction of training instances from class i associated with the
node t. The label assigned to the leaf node is typically the one that occurs most frequently in
the training instances that are associated with this node

4. The stopping cond() function is used to terminate the tree-growing process by checking
whether all the instances have identical class label or attribute values.
DW&DM

3.2 Model Overfitting:


When the model is most suitable for training data but when comes to testing data if that
model gives less accuracy then it is said to be Model Overfitting
Errors that causes Model Overfitting
The error committed by a classification model is generally two types. They are
1. Training error
2. Generalization error
1. Training error: Also known as apparent errors. he training error is defined as the average
loss that occurred during the training process
2. Generalization error: Generalization error is a measure of how accurately an algorithm
is able to predict outcome values for previously unseen data.
 A good model must have low training error as well as low generalization error this is
important because a model that fits the training data too well can have a poorer
generalization error then a model with high training error such a situation is called Model
Overfitting
How to Address Overfitting?
Addressing of overfitting is done by tree Pruning when a decision tree is built, many of the
branches will reflect anomalies in the training data due to noise or outliers
 Tree pruning methods address the problem of overfitting the data .such methods
typically use statistical measures to remove the last-reliable branches
 Pruned tress tend to be smaller and less complex and thus easier to comprehensive
 They are usually faster and better at correctly classifying independent test data than
unpruned trees
Tree pruning classified into
1. Pre pruning
2. Post pruning
Pre Pruning:
 Stop the algorithm before it becomes a fully-grown tree
 Typical stopping conditions for a node:
1. Stop if all instances belong to the same class
2. Stop if all the attribute values are the same
 More restrictive conditions:
1. Stop if number of instances is less than some user-specified threshold
2.Stop if class distribution of instances is independent of the available features (e.g.,
using  2 tests)
3. Stop if expanding the current node does not improve impurity measures (e.g., Gini or
information gain).
Post Pruning:
 Grow decision tree to its entirety
 Trim the nodes of the decision tree in a bottom-up fashion
 If generalization error improves after trimming, replace sub-tree by a leaf node.
 Class label of leaf node is determined from majority class of instances in the sub-tree
 Can use MDL for post-pruning
DW&DM

3.2.1 Due to Presence of noise:


Model overfitting due to noise is a common issue in machine learning. Noise refers to
random or irrelevant variations in the data that do not represent actual patterns or
relationships. Overfitting occurs when a machine learning model captures this noise in the
training data, leading to poor generalization on unseen data.

3.2.2 Due to lack of representation samples:


 Lack of data points in the lower half of the diagram makes it difficult to predict correctly
the + class labels of that region
 Insufficient data in data mining can pose significant challenges and limitations. Data
mining relies on large and diverse datasets to discover patterns, trends, and insights.
 When you have insufficient data, you may encounter the following issues:
 Incomplete Patterns: With limited data, you may not be able to identify and
analyze complete patterns or trends within your dataset. This can result in
inaccurate or incomplete conclusions.
 Insufficient number of training records in the region causes the decision tree to
predict the test examples using other training records that are irrelevant to the
classification task

EXAMPLE:
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the
+ class labels of that region
DW&DM

3.2.3 Evaluating the performance of classifier:


There are various ways in Evaluating the performance of a classifier we mainly use the
confusion matrix

Confusion Matrix:
A confusion matrix is a tabular representation that shows the true positives (TP), true
negatives (TN), false positives (FP), and false negatives (FN).
TP: Correctly predicted positive instances.
TN: Correctly predicted negative instances.
FP: Incorrectly predicted positive instances
FN: Incorrectly predicted negative instances

3.2.3.1 Hold Out Method:


It is a straight forward approach for evaluating the classifier performance. It involves the
Splitting data into two subsets
1. Training set
2. Test set (or hold out set) or validation set
The training set is used to train the classifier while the testing set is used to its evaluate is
performance. Typically the data is randomly partitioned, into these 2 sets with a common
split ratio being 70% for training and 30% for testing. Although the exact split can vary
based on the dataset size and requirements, Advantages;
 Simplicity
 Speed
 useful for large datasets, Disadvantages:
 variability
 Limited data for training.

3.2.3.2 Random Sub Sampling:


Random sub-sampling involves selecting a random subset of data points from the original
dataset. This can be done with or without replacement, depending on the specific context and
purpose:
Sampling without Replacement: In this case, each data point is selected only once. This is
often used for creating training and testing datasets where you want non-overlapping sets.
Sampling with Replacement: With replacement, data points can be selected more than once,
which is typical in bootstrapping and certain cross-validation techniques.
DW&DM

Sample Size: You can control the size of the random sub-sample, which may be a fixed
number of data points or a specific percentage of the original dataset. The sample size is often
determined by the requirements of your analysis or modeling task.
Repeatability: If repeatability is important, you can set a random seed before sampling to
ensure that the same sub-sample is obtained when needed. Random sub-sampling can be
useful for various purposes

3.2.3.3 Cross Validation:

Cross-validation is a fundamental technique in data mining and machine learning used to


assess the performance and generalization of predictive models. It involves partitioning a
dataset into multiple subsets to evaluate a model's performance, particularly its ability to
generalize to new, unseen data. Cross-validation helps ensure that the model's evaluation is
not overly optimistic or pessimistic due to the specific random split of the data. Several
common methods of cross-validation are used in data mining, including k-fold cross-
validation, stratified cross-validation, and leave-one-out cross-validation.
Here's an overview of cross-validation in data mining:

k-fold Cross-Validation:

In k-fold cross-validation, the dataset is divided into k approximately equal-sized "folds" or


subsets.
 The model is trained and evaluated k times, where each fold is used as the testing set
exactly once, and the remaining k-1 folds are used for training.
 The results of each fold (e.g., performance metrics like accuracy or error) are typically
averaged to provide an overall assessment of the model's performance.
Common values for k are 5 and 10, but the choice of k depends on the specific dataset and
problem. Smaller values of k lead to more repetitions but potentially higher variance in the
performance estimate. Larger values of k lead to less variance but can be computationally
expensive.
DW&DM

Stratified Cross-Validation:

In stratified cross-validation, the data is divided into folds in such a way that each fold
maintains the same class distribution as the overall dataset. This is particularly important
when dealing with imbalanced datasets to ensure that each class is adequately represented in
each fold.

Leave-One-Out Cross-Validation (LOOCV):

In LOOCV, each data point is treated as a separate test set, while the rest of the data is used
for training.
LOOCV is a special case of k-fold cross-validation where k is equal to the number of data
points. It can be computationally intensive for large datasets but provides a rigorous estimate
of model performance.

Repeated Cross-Validation:

 To reduce the potential impact of the initial random partitioning of the data, repeated
cross-validation involves running the cross-validation process multiple times with
different random splits.
 This provides more robust performance estimates.
 The main advantages of cross-validation in data mining are:
 It provides a more reliable estimate of a model's performance compared to a single
train-test split.
 It helps detect issues like overfitting, as models that perform well on one set of data
but not on others may indicate overfitting.
 It ensures that the model's performance is assessed on a variety of data points,
improving its generalization ability.
Cross-validation is a crucial step in model selection, hyper parameter tuning, and assessing
the overall quality of a predictive model in data mining and machine learning. It is a valuable
tool for ensuring that your model will perform well on new, unseen data.

3.2.4 Bayes Theorem:

Bayes' Theorem, named after the Reverend Thomas Bayes, is a fundamental concept in
probability theory and statistics. It provides a way to update and revise probabilities for a
hypothesis or event based on new evidence or information. Bayes' Theorem is particularly
useful in various fields, including machine learning, data science, and decision making. The
theorem is
 Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
 The formula for Bayes' theorem is given as:
DW&DM

 P(A|B) is Posterior probability:


o Probability of hypothesis A on the observed event B.
 P(B|A) is Likelihood probability:
o Probability of the evidence given that the probability of a hypothesis is true.
 P(A) is Prior Probability:
o Probability of hypothesis before observing the evidence.
 P(B) is Marginal Probability:
o Probability of Evidence.

3.2.5 Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

Convert the given dataset into frequency tables.

Generate Likelihood table by finding the probabilities of given features.

Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:


DW&DM

P(Yes/Sunny)= P(Sunny/Yes)*P(Yes)/P(Sunny)

P(Sunny/Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes/Sunny) = 0.3*0.71/0.35= 0.60

P(No/Sunny)= P(Sunny/No)*P(No)/P(Sunny)

P(Sunny/NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No/Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P (Yes/Sunny) > P (No/Sunny)

Hence on a Sunny day, Player can play the game.


DW&DM

Advantages of Naïve Bayes Classifier:


 Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
 It can be used for Binary as well as Multi-class Classifications.
 It performs well in Multi-class predictions as compared to the other Algorithms.
 It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


 Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Applications of Naïve Bayes Classifier:


 It is used for Credit Scoring.
 It is used in medical data classification.
 It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
 It is used in Text classification such as Spam filtering and Sentiment analysis.

Python Implementation of the Naïve Bayes algorithm:


Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user data" dataset, which we have used in our other classification model. Therefore we can
easily compare the Naive Bayes model with the other models.

Steps to implement:
Step1: Data Pre-processing step
Step2: Fitting Naive Bayes to the Training set
Step3: Predicting the test result
Step4: Test accuracy of the result (Creation of Confusion matrix)
Step5: Visualizing the test set result.

Program to implement the data preprocessing step:


1. Importing the libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6.# Importing the dataset
7.dataset = pd.read_csv('user_data.csv')
8.x = dataset.iloc[:, [2, 3]].values
9.y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

14.
DW&DM

15. # Feature Scaling


16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

Bootstrap:

"Bootstrap" is a resampling technique used for estimating the sampling distribution of a


statistic by repeatedly resampling the dataset with replacement. This method is especially
useful when you want to assess the variability and uncertainty of a statistical estimate or
model performance, such as calculating confidence intervals or conducting hypothesis tests.
Here's how the bootstrap method works:

Resampling: Start with a dataset of size N. To create a bootstrap sample, you randomly
select N data points from the original dataset with replacement. This means that a single data
point can be selected multiple times in a bootstrap sample, and some data points may not be
selected at all.

Statistical Estimation: Calculate the statistic of interest on each bootstrap sample. This
statistic can be a mean, median, standard deviation, correlation coefficient, or any other
parameter you want to estimate.

Repeat: Repeat steps 1 and 2 a large number of times (often thousands or tens of
thousands) to create a distribution of the statistic of interest.

Analyze the Distribution: With the collection of statistics obtained from the bootstrap
samples, you can analyze their distribution. This distribution provides insights into the
variability and uncertainty associated with the original statistic. You can calculate confidence
intervals, perform hypothesis testing, or assess the stability of model parameters.

*****
DW&DM

UNIT –IV:
Association Analysis: Basic Concepts and Algorithms: Problem Definition, Frequent Item
Set Generation, Apriori Principle, Apriori Algorithm, Rule Generation, Compact
Representation of Frequent Item sets, FP-Growth Algorithm.

Association Analysis:

4.1.1.1 Basic Concepts and Algorithms


4.1.1.2 Problem Definition
4.1.1.3 Frequent Item Set Generation
4.1.1.4 Apriori Principle
4.1.1.5 Apriori Algorithm
4.1.1.6 Rule Generation
4.1.1.7 Compact Representation of Frequent Item sets
4.1.1.8 FP-Growth Algorithm. (Tan & Vipin)

4.1 Basic Concepts and Algorithms:

Association Analysis:

 It is useful for discovering interesting relationships hidden in large datasets.


 The uncovered relationships can be represented in the form of association rules or
sets of frequent items.
 The following can be represented as follows

 This chapter presents a methodology known as association analysis, which is


useful for discovering interesting relationships hidden in large data sets. The
uncovered relationships can be represented in the form of sets of items present in
many transactions, which are known as frequent item sets. Or association rules,
that represent relationships between two item sets. For example, the following rule
can be extracted from the data set.

{Diapers}  {Beer}.

 The rule suggests a relationship between the sale of diapers and beer because many
customers who buy diapers also buy beer. Retailers can use these types of rules to
help them identify new opportunities for cross-selling their products to the
customers.
DW&DM

REPRESENTATION:

Binary Representation Market basket data can be represented in a binary format. Where each
row corresponds to a transaction and each column corresponds to an item. An item can be
treated as a binary variable whose value is one if the item is present in a transaction and zero
otherwise.

This representation is a simplistic view of real market basket data because it ignores
important aspects of the data such as the quantity of items sold or the price paid to purchase
them.

Itemset and Support Count:


 Let I = {i1, i2,..., id} be the set of all items in a market basket data.
 T = {t1, t2, ... , tN } be the set of all transactions.
 If an itemset contains k items, it is called a k-itemset.
 For instance, {Beer, Diapers, Milk} is an example of a 3-itemset.
 The null (or empty) set is an itemset that does not contain any items.
An important property of an itemset is its support count, which refers to the number of
transactions that contain a particular itemset. Mathematically, the support count, σ(X), for an
itemset X can be stated as follows:
σ(X) = {ti|X ⊆ ti, ti ∈ T}
Where the symbol |·| denotes the number of elements in a set. In the data set

The support count for {Beer, Diapers, Milk} is equal to two because there are only two
transactions that contain all three items. Often, the property of interest is the support, which
is fraction of transactions in which an itemset occurs:
s(X) = σ(X)/N.
Where S(x)=Support Count.
X=No of times the itemset can be repeated
N=Total no of Transactions.

An itemset X is called frequent if s(X) is greater than some user-defined threshold, min
support.

NOTE:
Association Rule: An association rule is an implication expression of the form X  Y.
Where X and Y are disjoint item sets.
DW&DM

i.e., X ∩ Y = ∅. The strength of an association rule can be measured in terms of its support
& confidence. Support determines how often a rule is applicable to a given data set, while
confidence determines how frequently items in Y appear in transactions that contain X. The
formal definitions of these metrics are Support, s(X  Y) = σ(X ∪ Y) N; Confidence, c(X 
Y) = σ(X ∪ Y) σ(X).

Example:

 Consider the rule {Milk, Diapers} → {Beer}.


 Because the support count for {Milk, Diapers, Beer} is 2 and the total number of
transactions is 5.
 The rule’s support is 2/5=0.4.
 The rule’s confidence is obtained by dividing the support count for {Milk, Diapers,
Beer} by the support count for {Milk, Diapers}.
 Since there are 3 transactions that contain milk and diapers, the confidence for this
rule is 2/3=0.67.

Why Use Support and Confidence?

 Support is an important measure because a rule that has very low support might occur
simply by chance.
 Support also has a desirable property that can be exploited for the efficient discovery
of association rules.
 Confidence, on the other hand, measures the reliability of the inference made by a
rule.
 For a given rule X −→ Y, the higher the confidence, the more likely it is for Y to be
present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.
Formulation of the Association Rule Mining Problem
The association rule mining problem can be formally stated as follows:

4.2 Problem Definition:


Given a set of transactions T, find all the rules having support ≥ min_sup and confidence ≥
min_conf, where min_sup and min_conf are the corresponding support and confidence
thresholds.
DW&DM

Brute-force approach:

This approach can be used to describe the following content with clarity.
 List all possible association rules.
 Compute the support and confidence for each rule.
 Prune rules that fail the minsup and minconf thresholds.

Explanation for Brute-Force Approach:

A brute-force approach for mining association rules is to compute the support and confidence
for every possible rule. This approach is prohibitively expensive because there are
exponentially many rules that can be extracted from a data set. More specifically, assuming
that neither the left nor the right-hand side of the rule is an empty set, the total number of
possible rules, R, extracted from a data set that contains d items is
R = 3d – 2d+1 + 1.
Here we can take d=No of items are 6
Total number of item sets = 2 d
Total number of possible association rules: 3^6-2^7=601+1=602.

Computational Complexity: The graph may tell about how the rules can be increased with
the no of item sets “d”.

Therefore, a common strategy adopted by many association rule mining algorithms is to


decompose the problem into two major subtasks:
1. Frequent Itemset Generation: whose objective is to find all the itemsets that satisfy the
minsup threshold?
2. Rule Generation: whose objective is to extract all the high confidence rules from the
frequent item sets found in the previous step. These rules are called strong rules.

The computational requirements for frequent itemset generation are generally more
expensive than those of rule generation.
DW&DM

4.3 Frequent Itemset Generation:


Definition: Frequent item set generation is the process of identifying sets of items that
occur together frequently in a dataset.
The frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set.

Given d items and we get 2d possible candidate itemsets.


D=5  25  32 data items
Brute-force approach:
 Each itemset in the lattice is a candidate frequent itemset
 Count the support of each candidate by scanning the database
 Match each transaction against every candidate.
 Complexity ~ O(NMw) => Expensive since M = 2d !
DW&DM

4.4 Apriori Principle:

Definition: If an itemset is frequent, then all of its subsets must also be frequent.

An algorithm known as Apriori is a common one in data mining. It's used to identify the
most frequently occurring elements and meaningful associations in a dataset. As an example,
products brought in by consumers to a shop may all be used as inputs in this system.

Frequent itemset generation in the Apriori Algorithm:


Apriori is the first association rule mining algorithm that pioneered the use of support-based
pruning to systematically control the exponential growth of candidate itemsets. Provides a
high-level illustration of the frequent itemset generation part of the apriori algorithm we
assume that the support threshold is 60%, which is equivalent to a minimum support count
equal to 3.

 Initially, every item is considered as a candidate 1-itemset.


 After counting their supports, the candidate itemsets {Cola} and {Eggs} are discarded
 {cola } & {Eggs} are removed because they appear in fewer than three transactions.
DW&DM

 In the next iteration, candidate 2-itemsets are generated using only the frequent 1-
itemsets because the Apriori principle ensures that all supersets of the infrequent 1-
itemsets
 There are only four frequent 1-itemsets, the number of candidate 2-itemsets generated
by the algorithm is ( 42) = 6.
 Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found
to be infrequent after computing their support values. The remaining four candidates
are frequent, and thus will be used to generate candidate 3-itemsets. Without support-
based pruning, there are (63) = 20 candidate.
 3-itemsets that can be formed using the six items given in this example. With the
Apriori principle, we only need to keep candidate 3-itemsets whose subsets are
frequent. The only candidate that has this property is {Bread, Diapers, Milk}.
However, even though the subsets of {Bread, Diapers, Milk} are frequent, the itemset
itself is not.

Frequent Repeated itemset:

Frequent Non-Repeated itemset:


DW&DM

The only candidate that has this property is {Bread, Diapers, Milk}. However, even though
the subsets of {Bread, Diapers, Milk} are frequent, the itemset itself is not.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of
candidate itemsets generated. A brute-force strategy of enumerating all itemsets (up to size 3)
as candidates will produce candidates. With the Apriori principle, this number decreases to
candidates, which represents a 68% reduction in the number of candidate itemsets even in
this simple example.

4.5 Apriori Algorithm:


The Apriori algorithm is a machine learning algorithm used for association rule learning and
frequent item set mining. It works on databases that contain transactions and is designed to
find frequent item sets from large datasets.

Candidate Generation & Candidate Pruning:


1) Candidate Generation: This operation generates new candidate k-itemsets based on the
frequent (k − 1)-itemsets found in the previous iteration.
2) Candidate Pruning: This operation eliminates some of the candidate itemsets using
support-based pruning, i.e. by removing k-itemsets whose subsets are known to be
infrequent in previous iterations.
Note: This pruning is done without computing the actual support of these k-itemsets (which
could have required comparing them against each transaction).
DW&DM

Candidate generation can be done by using brute force & Fk methods


Brute-Force Method: The brute-force method considers every k-itemset as a potential
candidate and then applies the candidate pruning step to remove any unnecessary candidates
whose subsets are infrequent. The number of candidate itemsets generated at level k is equal
to (d k) where d is the total number of items. Although candidate generation is rather trivial,
candidate pruning becomes extremely expensive because a large number of itemsets must be
examined
Fk−1 × F1 Method: An alternative method for candidate generation is to extend each
frequent (k − 1)-itemset with frequent items that are not part of the (k − 1)-itemset. Illustrates
how a frequent 2-itemset such as {Beer, Diapers} can be augmented with a frequent item
such as Bread to produce a candidate 3-itemset {Beer, Diapers, Bread}.

K-itemsets are p art of the candidate k-itemsets generated by this procedure. The Fk−1 × F1
candidate generation method only produces four candidate 3-itemsets, instead of the (6 3) =
20 itemsets produced by the brute-force method.

Fk−1×Fk−1 Method: This candidate generation procedure, which is used in the candidate-
gen function of the Apriori algorithm, merges a pair of frequent (k −1)-itemsets only if their
first k −2 items, arranged in lexicographic order, are identical. Let A = {a1, a2,...,ak−1} and
B = {b1, b2,...,bk−1} be a pair of frequent (k − 1)-itemsets, arranged lexicographically. A
and B are merged if they satisfy the following conditions: ai = bi (for I = 1, 2, ..., k − 2)

A and B are two distinct itemsets. The candidate k-itemset generated by merging A and B
consists of the first k − 2 common items followed by ak−1 and bk−1 in lexicographic order.
This candidate generation procedure is complete, because for every lexicographically ordered
frequent k-itemset, there exists two lexicographically ordered frequent (k − 1)-itemsets that
have identical items in the first k – 2 positions.
DW&DM

Candidate Pruning:

Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of frequent 3-itemsets. L4 =


{ABCD,ABCE,ABDE} is the set of candidate 4-itemsets generated Candidate pruning Prune
ABCE because ACE and BCE are infrequent Prune ABDE because ADE is infrequent After
candidate pruning: L4 = {ABCD}

Support Counting:

Support counting is the process of determining the frequency of occurrencefor every


candidate itemset that survives the candidate pruning step. An alternative approach is to
enumerate the itemsets contained in each transaction and use them to update the su pport
counts of their respective candidate itemsets. To illustrate, consider a transaction t that
contains fiveitems, {1, 2, 3, 5, 6}. There are 5= 10 itemsets of size 3 contained in this
DW&DM

transaction. Some of the itemsets may correspond to the candidate 3-itemsets under
investigation, in which case, their support counts are incremented. Other subsets of t that do
not correspond to any candidates can be ignored.

Support Counting Using a Hash Tree*

In the Apriori algorithm, candidate itemsets are partitioned into different buckets and stored
in a hash tree. During support counting, itemsets contained in each transaction are also
hashed into their appropriate buckets. That way, instead of comparing each itemset in the
transaction with every candidate itemset, it is matched only against candidate itemsets that
belong to the same bucket.
DW&DM

Consider the transaction, t = {1, 2, 3, 5, 6}. To update the support counts of the candidate
itemsets, the hash tree must be traversed in such a way that all the leaf nodes containing
candidate 3-itemsets belonging to t must be visited at least once.

Computational Complexity:

The computational complexity of the Apriori algorithm, which includes both its runtime and
storage, can be affected by the following factors. Support Threshold Lowering the support
threshold often results in more itemsets being declared as frequent. This has an adverse effect
on the computational complexity of the algorithm because more candidate itemsets must be
generated and counted at every level.
The maximum size of frequent itemsets also tends to increase with lower support thresholds.
This increases the total number of iterations to be performed by the Apriori algorithm, further
increasing the computational cost. Number of Items (Dimensionality) as the number of items
increases, more space will be needed to store the support counts of items. If the nu mber of
frequent items also grows with the dimensionality of the data.

Number of Transactions Because the Apriori algorithm makes repeated passes over the
transaction data set, its run time increases with a larger number of transactions.
DW&DM

Average Transaction Width For dense data sets, the average transaction width can be very
large. This affects the complexity of the Apriori algorithm in two ways. First, the maximum
size of frequent itemsets tends to increase as the average transaction width increases. As a
result, more candidate itemsets must be examined during candidate generation and support
counting, as illustrated in Second, as the transaction width increases, more itemsets are
contained in the transaction. This will increase the number of hash tree traversals performed
during support counting. A detailed analysis of the time complexity for the Apriori algorithm
is presented next.

Generation of frequent 1-itemsets for each transaction, we need to update the support count
for every item present in the transaction. Assuming that w is the average transaction width,
this operation requires O(Nw) time, where N is the total number of transactions.

Candidate generation to generate candidate k-itemsets, pairs of frequent (k − 1)-itemsets


are merged to determine whether they have at least k – 2 items in common. Each merging
operation requires at most k − 2 equality comparisons. Every merging step can produce at
most one viable candidate k-itemset, while in the worst-case, the algorithm must try to merge
every pair of frequent (k − 1)-itemsets found in the previous iteration. Therefore, the overall
cost of merging frequent itemsets is where w is the maximum transaction width.

4.6 Rule Generation:

 Rule generation is a process of finding interesting and useful patterns or rules from large
sets of data. It is one of the main tasks of data mining, which aims to discover hidden
knowledge from data.
 One of the most common methods of rule generation is association rule mining, which
finds frequent itemsets and then derives rules that imply the co-occurrence of items in the
itemsets.
 For example, if a customer buys bread and milk, they are like buy eggs as well. This can
be expressed as an association rule: {bread, milk} -> {eggs}.
DW&DM

 Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f → L – f satisfies
the minimum confidence requirement

 If {A,B,C,D} is a frequent itemset, candidate rules:


ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
 Each frequent k-itemset, Y, can produce up to 2k−2 association rules, ignoring rules that
have empty antecedents or consequents (∅ −→ Y or Y −→ ∅).
In general, confidence does not have an anti-monotone property
c(ABC →D) can be larger or smaller than c(AB →D)
 But confidence of rules generated from the same itemset has an anti-monotone property
E.g., Suppose {A,B,C,D} is a frequent 4-itemset:
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
 Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Rule Generation in Apriori Algorithm:


The Apriori algorithm uses a level-wise approach for generating association rules, where
each level corresponds to the number of items that belong to the rule consequent. Initially, all
the high confidence rules that have only one item in the rule consequent are extracted. These
rules are then used to generate new candidate rules.
DW&DM

4.7 Compact Representation of Frequent Itemsets:

The number of frequent itemsets produced from a transaction data set can be very large. It is
useful to identify a small representative set of frequent itemsets from which all other frequent
itemsets can be derived. Two such representations are presented in this section in the form of
maximal and closed frequent itemsets.

Maximal Frequent Itemsets:

 Maximal frequent itemsets effectively provide a compact representation of frequent


itemsets. In other words, they form the smallest set of itemsets from which all frequent
itemsets can be derived.
 For example, every frequent itemset in Figure is a subset of one of the three maximal
frequent itemsets, {a, d}, {a, c, e}, and {b, c, d, e}.
 If an itemset is not a proper subset of any of the maximal frequent itemsets, then it is
either infrequent (e.g., {a, d, e}) or maximal frequent itself (e.g., {b, c, d, e}).
 Hence, the maximal frequent itemsets {a, c, e}, {a, d}, and {b, c, d, e} provide a
compact representation of the frequent itemsets

Maximal frequent itemsets provide a valuable representation for data sets that can produce
very long, frequent itemsets, as there are exponentially many frequent itemsets in such data.
DW&DM

Closed Itemsets:
 Closed itemsets provide a minimal representation of all itemsets without losing their
support information.
 An itemset X is closed if none of its immediate supersets has exactly the same
support count as X.
 An itemset X is closed if none of its immediate supersets has the same support as the
itemset X.
 X is not closed if at least one of its immediate supersets has support count as X.

Closed frequent itemsets:


DW&DM

Maximal vs Closed Itemsets:

Maximal Frequent vs Closed Frequent Itemsets:


DW&DM

4.8 FP-Growth Algorithm:


 FP growth algorithm is a method for finding frequent itemsets in a transaction database
without using candidate generation. It uses a special data structure called FP-tree to store
the association information between the items12. It is faster and more efficient than the
Apriori algorithm
 The algorithm does not subscribe to the generate-and-test paradigm of Apriori.
o Instead, it encodes the data set using a compact data structure called an FP-tree and
extracts frequent itemsets directly from this structure.

FP-Tree Representation:
An FP-tree is a compressed representation of the input data. It is constructed by reading the
data set one transaction at a time and mapping each transaction onto a path in the FP-tree. As
different transactions can have several items in common, their paths might overlap. The more
the paths overlap with one another, the more compression we can achieve using the FP-tree
structure. If the size of the FP-tree is small enough to fit into main memory, this will allow us
to extract frequent itemsets directly from the structure in memory instead of making repeated
passes over the data stored on disk.
1) The data set is scanned once to determine the support count of each item. Infrequent items
are discarded, while the frequent items are sorted in decreasing support counts inside
every transaction of the data set. For the data set shown in Figure 5.24, a is the most
frequent item, followed by b, c, d, and e.
2) The algorithm makes a second pass over the data to construct the FP-tree. After reading
the first transaction, {a, b}, the nodes labeled as a and b are created. A path is then
formed from null → a → b to encode the transaction. Every node along the path has a
frequency count of 1.
3) After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c,
and d. A path is then formed to represent the transaction by connecting the nodes null →
b → c → d. Every node along this path also has a frequency count equal to one. Although
the first two transactions have an item in common, which is b; their paths are disjoint
because the transactions do not share a common prefix.
DW&DM

4) The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first
transaction. As a result, the path for the third transaction, null → a → c → d → e,
overlaps with the path for the first transaction, null → a → b. Because of their
overlapping path, the frequency count for node a is incremented to two, while the
frequency counts for the newly created nodes, c, d, and e, are equal to one.
5) This process continues until every transaction has been mapped onto one of the paths
given in the FP-tree. The resulting FP-tree after reading all the transactions
DW&DM

Algorithm by Han: The original algorithm to construct the FP-Tree defined by Han is given
below:

Algorithm 1: FP-tree construction

Input: A transaction database DB and a minimum support threshold?

Output: FP-tree, the frequent-pattern tree of DB.

Method: The FP-tree is constructed as follows.

1. The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in
the database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The
root is represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken
at the top, and then the next itemset with the lower count. It means that the branch of
the tree is constructed with transaction itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch, then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in
this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. The
common node and new node count are increased by 1 as they are created and linked
according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined
first, along with the links of the lowest nodes. The lowest node represents the
frequency pattern length 1. From this, traverse the path in the FP Tree. This path or
paths is called a conditional pattern base. A conditional pattern base is a sub-database
consisting of prefix paths in the FP tree occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.

Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects
and sorts the set of frequent items, and the second constructs the FP-Tree.

Advantages of FP Growth Algorithm:


Here are the following advantages of the FP growth algorithm, such as:
 This algorithm needs to scan the database twice when compared to Apriori, which
scans the transactions for each iteration.
DW&DM

 The pairing of items is not done in this algorithm, making it faster.


 The database is stored in a compact version in memory.
 It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages of FP-Growth Algorithm:


This algorithm also has some disadvantages, such as:
 FP Tree is more cumbersome and difficult to build than Apriori.
 It may be expensive.
 The algorithm may not fit in the shared memory when the database is large.

Difference between Apriori and FP Growth Algorithm:


Apriori and FP-Growth algorithms are the most basic FIM algorithms. There are some basic
differences between these algorithms, such as:
Apriori FP Growth
Apriori generates frequent patterns by making FP Growth generates an FP-Tree for
the itemsets using pairings such as single item making frequent patterns.
set, double itemset, and triple itemset.
Apriori uses candidate generation where FP-growth generates a conditional FP-
frequent subsets are extended one item at a Tree for every item in the data.
time.
Since apriori scans the database in each step, it FP-tree requires only one database scan
becomes time-consuming for data where the in its beginning steps, so it consumes less
number of items is larger. time.
A converted version of the database is saved in A set of conditional FP-tree for every
the memory item is saved in the memory
It uses a breadth-first search It uses a depth-first search.

*******************
What is Apriori Algorithm?

It is a classic algorithm used in data mining for finding association rules based on the principle "Any
subset of a large item set must be large". It uses a generate-and-test approach – generates candidate
itemsets and tests if they are frequent.

Frequent Itemset Generation:

Given the minimum threshold support, Generating large item sets (only keep frequent item sets –
large item sets with enough support).

Illustration:
Consider the below transaction in which B = Bread, J = Jelly, P = Peanut Butter, M = Milk and E =
Eggs. Given that minimum threshold support = 40% and minimum threshold confidence = 80%.
DW&DM

Step-1: Count the number of transactions in which each item occurs (Bread B occurs in 4 transactions
and so on).

Step-2: As minimum threshold support = 40%, So in this step we will remove all the items that are
bought less than 40% of support or support less than 2.

The above table has single items that are bought frequently. Now let’s find a pair of items that are
bought frequently. We continue from the above table (Table in step 2)

Step-3: We start making pairs from the first item and below items like {B,P} ,{B,M} ,{B,E} and then
we start with the second item and below items like {P,M} ,{P,E}. We do not make pair {P,B} because
we already made {P,B} pair when we were making pairs of B. As buying a bread and Peanut Butter
together is same as buying Peanut Butter and bread together. After making all the pairs we get,
DW&DM

Step-4: As minimum threshold support = 40%, So in this step we will remove all the items that are
bought less than 40% of support and we are left with

The above table has two items {B, P} that are bought together frequently.

Association Rule Generation:

Step-5: As we cannot generate large frequent item (itemset of 3) further because we are left with 1
frequent item set. We will start generating association rules from the frequent item set. As we have
frequent item set of two, only two association rules will be generated which is shown below :

As P -> B has confidence 100% which is greater than minimum confidence threshold 80%, thus P ->
B is a Strong Association Rule.

Disadvantages of Apriori Algorithm?


 Generation of itemsets is expensive (in both space and time)
 Support counting is expensive

*******************
What is FP Growth Algorithm?

An efficient and scalable method to find frequent patterns. It allows frequent itemset discovery
without candidate itemset generation.

Following are the steps for FP Growth Algorithm


 Scan DB once, find frequent 1-itemset (single item pattern)
 Sort frequent items in frequency descending order, f-list
 Scan DB again, construct FP-tree
 Construct the conditional FP tree in the sequence of reverse order of F - List - generate
frequent item set

Illustration:
Consider the below transaction in which B = Bread, J = Jelly, P = Peanut Butter, M = Milk and E =
Eggs. Given that minimum threshold support = 40% and minimum threshold confidence = 80%.
DW&DM

Step-1: Scan DB once, find frequent 1-itemset (single item in itemset)

Step-2: As minimum threshold support = 40%, So in this step we will remove all the items that are
bought less than 40% of support or support less than 2.

Step-3: Create a F -list in which frequent items are sorted in the descending order based on the
support.

Step-4: Sort frequent items in transactions based on F-list. It is also known as FPDP.

Step-5: Construct the FP tree

Read transaction 1: {B,P} -> Create 2 nodes B and P. Set the path as null -> B -> P and the count of B
and P as 1 as shown below :
DW&DM

Read transaction 2: {B,P} -> The path will be null -> B -> P. As transaction 1 and 2 share the same
path. Set counts of B and P to 2.

Read transaction 3: {B,P,M} -> The path will be null -> B -> P -> M. As transaction 2 and 3 share the
same path till node P. Therefore, set count of B and P as 3 and create node M having count 1.

Continue until all the transactions are mapped to a path in FP-tree.

Step-6: Construct the conditional FP tree in the sequence of reverse order of F - List {E,M,P,B} and
generate frequent item set. The conditional FP tree is sub tree which is built by considering the
transactions of a particular item and then removing that item from all the transaction.

The above table has two items {B, P} that are bought together frequently.
As for items E and M, nodes in the conditional FP tree have a count (support) of 1 (less than
minimum threshold support 2). Therefore frequent itemset are nil. In case of item P, node B in the
conditional FP tree has a count (support) of 3 (satisfying minimum threshold support). Hence frequent
itemset is generated by adding the item P to the B.
*****
DW&DM

UNIT –V:
Cluster Analysis: Basic Concepts and Algorithms: Overview, What Is Cluster Analysis?
Different Types of Clustering, Different Types of Clusters; K-means: The Basic K-means
Algorithm, K-means Additional Issues, Bisecting K-means, Strengths and Weaknesses;
Agglomerative Hierarchical Clustering: Basic Agglomerative Hierarchical Clustering
Algorithm DBSCAN: Traditional Density Center-Based Approach, DBSCAN Algorithm,
Strengths and Weaknesses.

5.1 Cluster Analysis: Basic Concepts and Algorithms: Overview


5.1.1 What Is Cluster Analysis?
5.1.2 Different Types of Clustering
5.1.3 Different Types of Clusters
5.2 K-means:
5.2.1 The Basic K-means Algorithm
5.2.2 K-means Additional Issues
5.2.3 Bisecting K-means
5.2.4 Strengths and Weaknesses
5.3 Agglomerative Hierarchical Clustering:
5.3.1 Basic Agglomerative Hierarchical Clustering
5.4 Algorithm DBSCAN:
5.4.1 Traditional Density Center-Based Approach
5.4.2 DBSCAN Algorithm
5.4.3 Strengths and Weaknesses. (Tan &Vipin)

5.1 Cluster Analysis:

5.1.1 What is Cluster Analysis?


The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering. A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters. A cluster of data
objects can be treated collectively as one group in many applications.

Here are some applications of clustering in daily life applications


1. In business, clustering can help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.
2. In biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionality, and gain insight into structures inherits in populations.
3. Clustering may also help in the identification of similar land use in an earth
observation database.
4. Clustering is helpful in identification of groups of automobile insurance holders with
high average claim cost.
5. Clustering helps in identification of groups of houses in a city according to the house
type, value, and geographical location.
DW&DM

 Cluster analysis can be used as a stand-alone tool to gain insight into distribution of
data, to observe the characteristics of each cluster, and to focus on a particular set of
cluster for further analysis.
 Clustering may serve as a preprocessing step for other algorithms, such as
characterization and classification, which would then operate on detected clusters.
 Clustering is an example of unsupervised learning. Unlike classification, clustering and
unsupervised learning do not rely on predefined classes and class-labeled training
examples.

 Clustering is a challenging field of research where its potential applications pose with
their own special requirements. The following are typical requirements of clustering in
data mining:
o Scalability
o Ability to deal with different types of attributes
o Discovery of clusters with arbitrary shape
o Minimal requirements for domain knowledge to determine input parameters
o Ability to deal with noisy data
o Insensitivity to the order of input records
o High dimensionality
o Constraint-based clustering
o Interpretability and usability

5.1.2 Different types of clustering:


The choice of clustering algorithm depends both on the type of data available and on
particular purpose and application. Clustering methods can be classified into following
categories:
 Partitioning methods
 Hierarchical methods
 Density-based methods
 Grid-based methods
 Model-based methods
DW&DM

Partitioning methods:
 Partitioning methods constructs k partitions from the given database consists of n
objects or data tuples, where each partition represents a cluster and k ≤ n.
 It classifies the data into k groups, which together satisfy the following requirements:
o Each group must contain at least one object
o Each object must belong to exactly one group
 Partitioning method creates an initial partitioning; it then uses an iterative relocation
technique that attempts to improve the partitioning by moving objects from one group
to another.

 The objects in the same cluster are “close” or related to each other, whereas objects of
different clusters are “far apart” or very difficult.
 Most of the applications use two popular heuristic methods for partitioning
o The k-means algorithm, where each cluster is represented by the mean value
of the objects in the cluster.
o The k-medoids algorithm, where each cluster is represented by one of the
objects located near the cluster.

Hierarchical methods:
 Hierarchical method creates a hierarchical decomposition of the given set of data
objects.
 A Hierarchical method can be classified as being either agglomerative or divisive,
based on how hierarchical decomposition is formed.

 The agglomerative approach is also called bottom up approach, starts with each object
forming a separate group. It successively merges the objects or groups close one to
another, until all of the groups are merged into one or until a termination condition
holds.
DW&DM

 The divisive approach is also called top down approach, starts with all the objects in
the same cluster. In each successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a termination condition holds.
 Hierarchical methods suffer from the fact that once a step (merge or split) is done, it
can never be undone.
 There are two approaches to improving the quality of hierarchical clustering:
o Perform careful analysis of object “linkages” at each hierarchical partitioning,
such as in CURE and Chameleon.
o Integrate hierarchical agglomeration and iterative relocation by first using a
hierarchical agglomerative algorithm and then refining the result using
iterative relocation.

Density-based methods:
 This approach is to continue growing the cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.
 For each data-point within a given cluster, the neighborhood of a given radius has to
contain at least a minimum number of points.
 This method can be used to filter out noise (outliers) and discover the clusters of
arbitrary shape.

 DBSCAN is a typical density-based method that grows clusters according to a density


threshold.
 OPTICS is a density-based method that computes an augmented clustering ordering
for automatic and interactive cluster analysis.

5.1.3 Different Types of Clusters:


The clusters are classified into following types:

1. Well-separated clusters
2. Prototype-based clusters
3. Contiguity-based clusters
4. Density-based clusters
5. Described by an Objective Function
DW&DM

Well-separated clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every
other point in the cluster than to any point not in the cluster.

Prototype-based Clusters:
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the
prototype or “center” of a cluster, than to the center of any other cluster.
The center of a cluster is often a centroid, the average of all the points in the cluster, or a
medoid, the most “representative” point of a cluster.

Contiguity Cluster:
A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or
more other points in the cluster than to any point not in the cluster.

Density Based Cluster:


Density based clusters are employed when the clusters are irregular, intertwined and when
noise and outliers are present.
DW&DM

Objective Function
 Clusters Defined by an Objective Function
 Finds clusters that minimize or maximize an objective function.
 Enumerate all possible ways of dividing the points into clusters and evaluate the
`goodness' of each potential set of clusters by using the given objective function. (NP
Hard)
 Can have global or local objectives.
o Hierarchical clustering algorithms typically have local objectives
o Partition algorithms typically have global objectives
 A variation of the global objective function approach is to fit the data to a
parameterized model.
o Parameters for the model are determined from the data.
o Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.

Objective function. Points in a cluster share some general property that derives from the
entire set of points

5.2 K-Means:

5.2.1 K-Means Clustering Algorithm:


 K-means clustering is an unsupervised learning that is used to solve the clustering
problems in machine learning or data science.
 K-Means clustering groups the unlabeled dataset into different clusters. Here K
means number of predefined clusters.
 For example if k=2 it means the dataset will be grouped into 2 different clusters
and if k=3 it means the dataset will be grouped into 3 different clusters.
 It is an iterative algorithm that divides the unlabeled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties
 It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
 K means algorithm mainly performs 2 tasks:
o Determine the best value for k center points or centroids by an iterative process
o Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.
DW&DM

Algorithm for K-Means:


Step1: Select K points as the initial centroids
Step2: Repeat
Step3: Form K clusters b assigning all points to the closest centroid
Step4: Recompute the centroid of each cluster
Step5: Until the Centroids Don’t Change

Let us see the Example


Let us consider the data points A(2, 10), B(2, 5), C(8, 4), D(5, 8), E(7, 5), F(6, 4), G(1, 2),
H(4, 9)
Step1: Let us consider k=3 i.e. 3 clusters we need to form
Let the initial centroids be A, D, G
Step2: Repeat
Step3: We need to find distance between each point to the centroid and we need to allocate
to the respective cluster i.e to which centroid that point has minimum distance.
Where we need to find the Euclidean distance between the points
Formula: d(p1, p2) = √(𝑥2 − 𝑥1)2 + (𝑦2 − 𝑦1)2

Data points Distance To


New
Point Cluster
X Y A(2, 10) D(5, 8) G(1, 2) Cluster
Name
A 2 10 0.00 3.61 8.06 1
B 2 5 5.00 4.24 3.16 3
C 8 4 8.49 5.00 7.28 2
D 5 8 3.61 0.00 7.21 2
E 7 5 7.07 3.61 6.71 2
F 6 4 7.21 4.12 5.39 2
G 1 2 8.06 7.21 0.00 3
H 4 9 2.24 1.41 7.62 2

o Now we need to find the new centroid points for each clusters from the obtained
clusters
o Since we know the centroid formula for the given set of points
o i.e. centroid G = ( (x1+x2+…+xn)/n, (y1+y2+ .. +yn)/n)
DW&DM

o For cluster-1: The centroid is (2, 10) because it consists of single point
o For cluster-2: ( (8+5+7+6+4)/5 , (4+8+5+4+9)/5 ) = (6, 6)
o For cluster-3: ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
o The centroid points are (2, 10), (6, 6), (1.5, 3.5)
o Now again we need to find the Euclidean distance and allocate the points to the
respective cluster.

Data points Distance To


New
Point Cluster
X Y (2, 10) (6, 6) (1.5, 3.5) Cluster
Name
A 2 10 0.00 5.66 6.52 1 1
B 2 5 5.00 4.12 1.58 3 3
C 8 4 8.49 2.83 6.52 2 2
D 5 8 3.61 2.24 5.70 2 2
E 7 5 7.07 1.41 5.70 2 2
F 6 4 7.21 2.00 4.53 2 2
G 1 2 8.06 6.40 1.58 3 3
H 4 9 2.24 3.61 6.04 2 1

 You can observe that the H data point is moved from Cluster-2 to Cluster-1.
 Now again we need to find the Centroid points for the next iteration
 For Cluster-1: ( (2+4)/2, (10+9)/2 ) =(3, 9.5)
 For Cluster-2: ( (8+5+7+6)/4, (4+8+5+4)/4 )=(6.5, 5.25)
 For Cluster-3: ( (2+1)/2, (5+2)/2 )=(1.5, 3.5)
o The new centroid points for the next iteration are (3, 9.5), (6.5, 5.25), (1.5, 3.5)

Data points Distance To


New
Point Cluster
X Y (3, 9.5) (6.5, 5.25) (1.5, 3.5) Cluster
Name
A 2 10 1.12 6.54 6.52 1 1
B 2 5 4.61 4.51 1.58 3 3
C 8 4 7.43 1.95 6.52 2 2
D 5 8 2.50 3.13 5.70 2 1
E 7 5 6.02 0.56 5.70 2 2
F 6 4 6.26 1.35 4.53 2 2
G 1 2 7.76 6.39 1.58 3 3
H 4 9 1.12 4.51 6.04 1 1

 You can observe that the D data point is moved from Cluster-2 to Cluster-1.
 Now again we need to find the Centroid points for the next iteration
o For Cluster-1: ( (2+5+4)/3, (10+8+9)/3 ) =(3.67, 9)
o For Cluster-2: ( (8+7+6)/3, (4+5+4)/3 )=(7, 4.33)
DW&DM

o For Cluster-3: ( (2+1)/2, (5+2)/2 )=(1.5, 3.5)


 The new centroid points for the next iteration are (3.67, 9), (7, 4.33), (1.5, 3.5)

Data points Distance To


New
Point Cluster
X Y (3.67, 9) (7, 4.33) (1.5, 3.5) Cluster
Name
A 2 10 1.94 7.56 6.52 1 1
B 2 5 4.33 5.04 1.58 3 3
C 8 4 6.62 1.05 6.52 2 2
D 5 8 1.67 4.18 5.70 2 1
E 7 5 5.21 0.67 5.70 2 2
F 6 4 5.52 1.05 4.53 2 2
G 1 2 7.49 6.44 1.58 3 3
H 4 9 0.33 5.55 6.04 1 1

o We can observe that there is no change in the clusters. Such that there is no change in
the centroid points also. Therefore we can conclude that
 Data points A, D, H belongs to one cluster.
 Data points C, E, F belongs to one cluster
 Data points B, G belongs to one cluster

K-Medoids Algorithm:

o K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering.
The difference between the K-means and K-medoids algorithm are K-means uses
Euclidean distance whereas K-medoids uses Manhattan distance.
o K-medoids is an unsupervised method with unlabeled data to be clustered. It is an
improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Compared to other partitioning algorithms, the algorithm is simple, fast, and
easy to implement.
o Medoid: A medoid is a point in the cluster from which dissimilarities with all other
points in the clusters are minimal.
DW&DM

o Instead of centroids as reference points in K-Means algorithms, the K-Medoids


algorithm takes a Medoid as a reference point.

Algorithm for K-Medoids:


Step1: Choose k number of random points from the data and assign these k points to k
number of clusters. These are the initial medoids.
Step2: For all the remaining data points, calculate the distance from each medoid and
assign it to the cluster with the nearest medoid.
Step3: Calculate the total cost (Sum of all the distances from all the data points to the
medoids)
Step4: Select a random point as the new medoid and swap it with the previous medoid.
Repeat 2 and 3 steps.
Step5: If the total cost of the new medoid is less than that of the previous medoid, make
the new medoid permanent and repeat step 4.
Step6: If the total cost of the new medoid is greater than the cost of the previous
medoid, undo the swap and repeat step 4.
Step7: The Repetitions have to continue until no change is encountered with new
medoids to classify data points.

o Let us consider the example having data points (2, 6), (3, 4), (3, 8), (4,7), (6, 2), (6, 4),
(7, 3), (7, 4), (8, 5), (7, 6) naming x1, x2, x3, x4, x5, x6, x7, x8, x9, x10 respectively.
o The number of clusters required K=2.
o Step1: we need to select 2 medoids.
 C1 =(3, 4)
 C2= (7, 4)

o Step2: we need to find Manhattan distance for each point to the medoid point
 The formula for Manhattan distance is
 d(p1, p2) = |x1 – x2| + |y1 – y2|

Data Points Distance to


Cluster
Point X Y (3, 4) (7, 4)
X1 2 6 3 7 C1
X2 3 4 0 4 C1
X3 3 8 4 8 C1
X4 4 7 4 6 C1
X5 6 2 5 3 C2
X6 6 4 3 1 C2
X7 7 3 5 1 C2
X8 7 4 4 0 C2
X9 8 5 6 2 C2
X10 7 6 6 2 C2
DW&DM

 The obtained clusters are


C1= { (2, 6), (3, 4), (3, 8), (4, 7) }
C2= { (6, 2), (6, 4), (7, 3), (7, 4), (8, 5), (7, 6) }
 Now we need to find the total cost that is sum of all the cost of 2 clusters from their
respective medoids
 Total Cost= { cost( (3, 4), (2, 6) ) + cost( (3, 4), (3, 8) ) + cost( (3, 4), (4, 7) ) + cost( (7,
4), (6, 2) ) + cost( (7, 4), (6, 4) ) + cost( (7, 4), (7, 3 )+cost( (7, 4), (8, 5)) + cost( (7, 4), (7,
6) ) }
Total Cost= 3+ 4 + 4 + 2 + 3 + 3 + 1 + 1 + 2 = 20
 Now we need to select one non-medoid point as a medoid point and recalculate the cost.
 Let the other medoid be (7, 3) such that the medoid (7, 4) is replaced with (7, 3) and again
calculate the total cost.
 Now the medoids are C1= (3, 4) and O=(7, 3)

Data Points Distance to


Cluster
Point X Y C1(3, 4) O(7, 3)
X1 2 6 3 8 C1
X2 3 4 0 5 C1
X3 3 8 4 9 C1
X4 4 7 4 7 C1
X5 6 2 5 2 O
X6 6 4 3 2 O
X7 7 3 5 0 O
X8 7 4 4 1 O
X9 8 5 6 3 O
X10 7 6 6 3 O
 The obtained clusters are
C1= { (2, 6), (3, 4), (3, 8), (4, 7) }
C2= { (6, 2), (6, 4), (7, 3), (7, 4), (8, 5), (7, 6) }
 Total Cost= { cost( (3, 4), (2, 6) ) + cost( (3, 4), (3, 8) ) + cost( (3, 4), (4, 7) ) + cost( (7,
3), (6, 2) ) + cost( (7, 3), (6, 4) ) + cost( (7, 3), (7, 4)+cost( (7, 3), (8, 5)) + cost( (7, 3), (7,
6) ) }
Total Cost= 3+ 4 + 4 + 2 + 2 + 2 + 1 + 3 + 3 = 22
 Now we need to the find the S value where it is the difference between the current cost
value and previous cost value. If S<0 we can stop the process else we need to select
another medoid and continue the process because we need to get the low cost after the
swapping the medoid
 S= current cost – Previous cost S= 22 – 20 >0
 Swapping O with C2 is not a good idea therefore the final medoids are C1(3, 4) and C2(7,
4).
 The obtained clusters are
C1= {(2, 6), (3, 4), (3, 8), (4, 7)}
C2= {(6, 2), (6, 4), (7, 3), (7, 4), (8, 5), (7, 6)}
DW&DM

5.2.2 K-means Additional issues:

1. Handling Empty Clusters − The first issue with the basic K-means algorithm given
prior is that null clusters can be acquired if no points are allocated to a cluster during the
assignment phase. If this occurs, then a method is needed to choose a replacement
centroid, because the squared error will be larger than necessary.
2. Outliers − When the squared error method is used, outliers can unduly tend to the
clusters that are discovered. In specific, when outliers are present, the resulting cluster
centroids (prototypes) cannot be as representative as they can be, and thus, the SSE will
be higher as well.
3. Reducing the SSE with Post-processing − the method to reduce the SSE is to find
more clusters, i.e., to need a larger K. In such cases, it is likely to improve the SSE, but
don't require to increase the number of clusters. This is possible because K-means
generally converge to a local minimum.

5.2.3 Bisecting K-means clustering algorithm:


 Bisecting K-Means Algorithm is a modification of the K-Means algorithm. It is a hybrid
approach between partitional and hierarchical clustering. It can recognize clusters of any
shape and size.
 This algorithm is convenient because:
 It beats K-Means in entropy measurement.
 When K is big, bisecting k-means is more effective. Every data point in the data
collection and k centroids are used in the K-means method for computation. On the
other hand, only the data points from one cluster and two centroids are used in each
Bisecting stage of Bisecting k-means. As a result, computation time is shortened.
 While k-means is known to yield clusters of varied sizes, bisecting k-means results in
clusters of comparable sizes.

Algorithm for Bisecting K-Means:

Step1: Initialize the list of clusters to accommodate the cluster consisting of all points.

Step2: repeat

o Discard a cluster from the list of clusters.


o {Perform several “trial” bisections of the selected cluster.}
o for i = 1 to number of trials do
o Bisect the selected clusters using basic K-means.
o end for
o Select the 2 clusters from the bisection with the least total SSE.

Step3: Until the list of clusters contain ‘K’ clusters


DW&DM

Step 1: All points/objects/instances are put into 1 cluster.

Step2: Apply K-Means (K=3). The cluster ‘GFG’ is split into two clusters ‘GFG1’ and
‘GFG2’. The required number of clusters isn’t obtained yet. Thus, ‘GFG1’ is further split into
two (since it has a higher SSE (formula to calculate SSE is explained below))

5.2.4 Strengths and Weakness of K-means:

Strengths:
 It is simple, highly flexible, and efficient. The simplicity of k-means makes it easy to
explain the results in contrast to Neural Networks.
 The flexibility of k-means allows for easy adjustment if there are problems.
 The efficiency of k-means implies that the algorithm is good at segmenting a dataset.
 An instance can change cluster (move to another cluster) when the centroids are
recomputed
 Easy to interpret the clustering results.

Weakness:
 It does not allow developing the most optimal set of clusters and the number of
clusters must be decided before the analysis. How many clusters to include is left at
the discretion of the researcher.
 This involves a combination of common sense, domain knowledge, and statistical
tools. Too many clusters tell you nothing because of the groups becoming very small
and there are too many of them.
DW&DM

 There are statistical tools that measure within-group homogeneity and group
heterogeneity. There are methods like the elbow method to decide the value of k.
additionally, there is a technique called a dendrogram. The results of a dendrogram
analysis provide a recommendation of how many clusters to use. However,
calculating a dendrogram for a large dataset could potentially crash a computer due to
the computational load and the limits of RAM.
 When doing the analysis, the k-means algorithm will randomly select several different
places from which to develop clusters. This can be good or bad depending on where
the algorithm chooses to begin at. From there, the centre of the clusters is recalculated
until an adequate "centre'' is found for the number of clusters requested.
 The order of the data has an impact on the final results.

5.3 Agglomerative Hierarchical Clustering


A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly
executes the subsequent steps:
 Identify the 2 clusters which can be closest together, and
 Merge the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.

There are two basic approaches for generating a hierarchical clustering:


 Agglomerative: Start with the points as individual clusters and, at each step, merge
the closest pair of clusters. This requires defining a notion of cluster proximity.
 Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until
only singleton clusters of individual points remain. In this case, we need to decide
which cluster to split at each step and how to do the splitting.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A
diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the
sequences of merges or splits) graphically represents this hierarchy and is an inverted tree that
describes the order in which factors are merged (bottom-up) or clusters are broken up (top-
down ).

6 5

4
3 4
2
5
2

1
3 1
DW&DM

Strengths
 It is simple to implement and gives the best output in some cases.
 It is easy and results in a hierarchy, a structure that contains more information.
 It does not need us to pre-specify the number of clusters.

Weakness
 It breaks the large clusters.
 It is Difficult to handle different sized clusters and convex shapes.
 It is sensitive to noise and outliers.
 The algorithm can never be changed or deleted once it was done previously.

5.3.1 Basic Agglomerative Hierarchical Clustering Algorithm


Agglomerative clustering is a bottom-up approach. It starts clustering by treating the
individual data points as a single cluster then it is merged continuously based on similarity
until it forms one big cluster containing all objects.

Basic algorithm
Step-1 : Compute the proximity matrix
Step-2 : Let each data point be a cluster
Step-3 : Repeat
 Merge the two closest clusters
 Update the proximity matrix
Step-4 : Until only a single cluster remains

The pictorial representation of the above steps

Different approaches to defining the distance between clusters distinguish the different
algorithms.
Step 1: Consider the points (1,2,3,4,5,6) as an individual cluster and find the distance
between the individual cluster from all other clusters.
Step 2: Now, merge the comparable clusters in a single cluster. As the clusters (2&3) and
DW&DM

Clusters (4&5) are similar to each other so that we can merge them in the second step.
Finally, we get the clusters [ (1), (2,3), (4,5), (6)]
Step 3: Here, we recalculate the proximity as per the algorithm and combine the two closest
clusters [(4,5), (6)] together to form new clusters as [(1), (2,3), (4,5,6)]
Step 4: Repeat the same process. The clusters (4,5,6) and (2,3) are comparable and combined
together to form a new cluster. Now we have [(1), (2,3,4,5,6)].
Step 5: Finally, the remaining two clusters are merged together to form a single cluster
[(1,2,3,4,5,6)]

Defining Proximity between Clusters


The key operation of Algorithm is the computation of the proximity between two clusters,
and it is the definition of cluster proximity that differentiates the various agglomerative
hierarchical techniques. Cluster proximity is typically defined with a particular type of
cluster in mind. Many agglomerative hierarchical clustering techniques, such as MIN, MAX,
and Group Average, come from a graph-based view of clusters.
MIN : It defines the cluster proximity as the proximity between the closest two points that
are in different clusters, or using graph terms, the shortest edge between two nodes in
different subsets of nodes.
MAX: It defines the cluster proximity between the farthest two points in different clusters to
be the cluster proximity, or using graph terms, the longest edge between two nodes in
different subsets of nodes.
Group average: It defines cluster proximity to be the average pairwise proximities (average
length of edges) of all pairs of points from different clusters.

Time and Space Complexity

Space complexity
The space required for the Hierarchical clustering Technique is very high when the number
of data points is high as we need to store the similarity matrix in the RAM. The space
complexity is the order of the square of n.
Space complexity = O(n²) where n is the number of data points.

Time complexity
Since we’ve to perform n iterations and in each iteration, we need to update the similarity
matrix and restore the matrix, the time complexity is also very high. The time complexity is
the order of the cube of n.
Time complexity = O(n³) where n is the number of data points.
Complexity can be reduced to O(n2 log(n) ) time with some cleverness
DW&DM

Single Link or MIN

For the single link or MIN version of hierarchical clustering, the proximity of two clusters is
defined as the minimum of the distance between any two points in the two different clusters.
Using graph terminology, if you start with all points as singleton clusters and add links
between points one at a time, shortest links first, and then these single links combine the
points into clusters.
Let us consider the example
Point a b
P1 0.07 0.83
P2 0.85 0.14
P3 0.66 0.89
P4 0.49 0.64
P5 0.80 0.46

Step-1: Compute the distance matrix by: d[(x, y)(a, b)] = √(x-a)2 + (y-b)2
So we have to find the Euclidean distance between each and every points, we first find the
Euclidean distance between P1 and P2.
d(P1, P2) = √(0.07-0.85)2 + (0.83-0.14)2 =1.04139
P1 P2 P3 P4 P5
P1 0 1.0413
P2 1.0413 0
P3 0
P4 0
P5 0

Similarly, calculate the values in the proximity matrix


d(P3, P1) = √(0.66-0.07)2 + (0.89-0.83)2 =0.59304
d(P3, P2)=√(0.66-0.85)2 + (0.89-0.14)2 = 0.77369
d(P4, P1)=√(0.49 -0.07)2 + (0.64-0.83)2 =0.46098
d(P4, P2)=√(0.49-0.85)2 + (0.64-0.14)2 = 0.61612
d(P4, P3)=√(0.49-0.66)2 + (0.64-0.89)2 = 0.30232
d(P5, P1) =√(0.80-0.07)2 + (0.46-0.83)2 = 0.81841
d(P5, P2) =√(0.80-0.85)2 + (0.46-0.14)2 = 0.32388
d(P5, P3) =√(0.80-0.66)2 + (0.46-0.89)2 = 0.45222
d(P5, P4) =√(0.80-0.49)2 + (0.46-0.46)2 = 0.35847

P1 P2 P3 P4 P5
P1 0 1.0413 0.59304 0.46098 0.81841
P2 1.0413 0 0.77369 0.61612 0.32388
P3 0.59304 0.77369 0 0.30232 0.45222
P4 0.46098 0.61612 0.30232 0 0.35847
P5 0.81841 0.32388 0.45222 0.35847 0
DW&DM

Step-2: Merging the two closest members of the two clusters and finding the minimum
element in distance matrix.
Here the minimum value is 0.30232 and hence we combine P3 and P4. Now, form clusters of
elements corresponding to the minimum value and update the proximity matrix. To update
the proximity matrix:
min ((P3, P4), P1) = min ((P3, P1), (P4, P1)) = min (0.59304, 0.46098) = 0.46098
min ((P3, P4), P2) = min ((P3, P2), (P4, P2)) = min (0.77369, 0.61612) = 0.61612
min ((P3, P4), P5) = min ((P3, P5), (P4, P5)) = min (0.45222, 0.35847) = 0.35847
Now we will update the proximity Matrix:
P1 P2 P3, P4 P5
P1 0 1.0413 0.46098 0.81841
P2 1.0413 0 0.61612 0.32388
P3, P4 0.46098 0.61612 0 0.35847
P5 0.81841 0.32388 0.35847 0

Now we will repeat the same process. The next minimum value is 0.32388 and hence we
combine P2 and P5.
min ((P2, P5), P1) = min ((P2, P1), (P5, P1)) = min (1.04139, 0.81841) = 0.81841
min ((P2, P5), (P3, P4)) = min ((P2, (P3, P4)), (P5, (P3, P4))) = min (0.61612, 0.35847) =
0.35847
update the proximity Matrix:
P1 P2, P5 P3, P4
P1 0 0.81841 0.46098
P2, P5 0.81841 0 0.35847
P3, P4 0.46098 0.35847 0

The next minimum value is 0.35847 and hence we combine (P2, P5) and (P3, P4).
min ((P2, P5, P3, P4),P1) = min((P3, P4), P1), (P2, P5), P1))
= min (0.46098, 0.81841) =0.46098
P1 P2, P5, P3, P4
P1 0 0.46098
P2, P5, P3, P4 0.46098 0

Finally the cluster (P3, P4, P2, P5) is merged with the datapoint P1

 The single link technique is good at handling non-elliptical shapes, but is sensitive to
noise and outliers.
DW&DM

Complete Link or MAX or CLIQUE


For the complete link or MAX version of hierarchical clustering, the proximity of two
clusters is defined as the maximum of the distance between any two points in the two
different clusters.
Using graph terminology, if you start with all points as singleton clusters and add links
between points one at a time, shortest links first, then a group of points is not a cluster until
all the points in it are completely linked.
Let us consider the example:
Point a b
P1 0.07 0.83
P2 0.85 0.14
P3 0.66 0.89
P4 0.49 0.64
P5 0.80 0.46

Step-1 : Compute the distance matrix by: d[(x, y)(a, b)] = √(x-a)2 + (y-b)2
So we have to find the Euclidean distance between each and every points, we first find the
Euclidean distance between P1 and P2.
d(P1, P2) = √(0.07-0.85)2 + (0.83-0.14)2 =1.04139
P1 P2 P3 P4 P5
P1 0 1.04139
P2 1.04139 0
P3 0
P4 0
P5 0

Similarly, calculate the values in the proximity matrix


d(P3, P1) = √(0.66-0.07)2 + (0.89-0.83)2 =0.59304
d(P3, P2)=√(0.66-0.85)2 + (0.89-0.14)2 = 0.77369
d(P4, P1)=√(0.49 -0.07)2 + (0.64-0.83)2 =0.46098
d(P4, P2)=√(0.49-0.85)2 + (0.64-0.14)2 = 0.61612
d(P4, P3)=√(0.49-0.66)2 + (0.64-0.89)2 = 0.30232
d(P5, P1) =√(0.80-0.07)2 + (0.46-0.83)2 = 0.81841
d(P5, P2) =√(0.80-0.85)2 + (0.46-0.14)2 = 0.32388
d(P5, P3) =√(0.80-0.66)2 + (0.46-0.89)2 = 0.45222
d(P5, P4) =√(0.80-0.49)2 + (0.46-0.46)2 = 0.35847

P1 P2 P3 P4 P5
P1 0 1.04139 0.59304 0.46098 0.81841
P2 1.04139 0 0.77369 0.61612 0.32388
P3 0.59304 0.77369 0 0.30232 0.45222
P4 0.46098 0.61612 0.30232 0 0.35847
P5 0.81841 0.32388 0.45222 0.35847 0
DW&DM

Step-2: Merging the two closest members of the two clusters and finding the maximum
element in distance matrix.
Here the minimum value is 0.30232 and hence we combine P3 and P4. Now, form clusters of
elements corresponding to the maximum value and update the proximity matrix. To update
the proximity matrix:
max ((P3, P4), P1) = max ((P3, P1), (P4, P1)) = max (0.59304, 0.46098) = 0.59304
max ((P3, P4), P2) = max ((P3, P2), (P4, P2)) = max (0.77369, 0.61612) = 0.77369
max ((P3, P4), P5) = max ((P3, P5), (P4, P5)) = max (0.45222, 0.35847) = 0.45222
Now we will update the proximity Matrix:
P1 P2 P3, P4 P5
P1 0 1.04139 0.59304 0.81841
P2 1.0413 0 0.77369 0.32388
P3, P4 0.59304 0.77369 0 0.45222
P5 0.81841 0.32388 0.45222 0

Now we will repeat the same process. The next minimum value is 0.32388 and hence we
combine P2 and P5.
max ((P2, P5), P1) = max ((P2, P1), (P5, P1)) = max (1.04139, 0.81841) = 1.04139
max ((P2, P5), (P3, P4)) = max ((P2, (P3, P4)), (P5, (P3, P4))) = max (0.77369, 0.45222) =
0.7736. then, update the proximity Matrix:
P1 P2, P5 P3, P4
P1 0 1.04139 0.59304
P2, P5 1.04139 0 0.77369
P3, P4 0.59304 0.77369 0

The next minimum value is 0.59304 and hence we combine P1 and (P3, P4).
max ((P3, P4), P1) = max (((P3, P4), (P2, P5)), (P1, (P2, P5)))
= max (0.77369, 1.04139) = 1.0413
Finally the cluster (P1, P3, P4) is merged with the datapoint (P2, P5)

P1, P3, P4 P2, P5


P1, P3, P4 0 1.04139
P2, P5 1.04139 0

Complete link is less susceptible to noise and outliers, but it can break large clusters and it
favors globular shapes.

Nested Clusters>>
DW&DM

Group Average
For the group average version of hierarchical clustering, the proximity of two clusters is
defined as the average pairwise proximity among all pairs of points in the different clusters.
This is an intermediate approach between the single and complete link approaches.
Thus, for group average, the cluster proximity proximity(Ci, Cj) of clusters Ci and Cj, which
are of size mi and mj , respectively, is expressed by the following equation:

.
Average-average distance or average linkage is the method that involves looking at the
distances between all pairs and averages all of these distances. This is also called Universal
Pair Group Mean Averaging.
Let us consider the example
Point a b
P1 0.07 0.83
P2 0.85 0.14
P3 0.66 0.89
P4 0.49 0.64
P5 0.80 0.46

Step-1: Compute the distance matrix by: d[(x, y)(a, b)] = √(x-a)2 + (y-b)2
So we have to find the Euclidean distance between each and every points.
P1 P2 P3 P4 P5
P1 0 1.04139 0.59304 0.46098 0.81841
P2 1.04139 0 0.77369 0.61612 0.32388
P3 0.59304 0.77369 0 0.30232 0.45222
P4 0.46098 0.61612 0.30232 0 0.35847
P5 0.81841 0.32388 0.45222 0.35847 0

Step-2: Merging the two closest members of the two clusters and finding the maximum
element in distance matrix.
Here the minimum value is 0.30232 and hence we combine P3 and P4. Now, form clusters of
elements corresponding to the average value and update the proximity matrix. To update the
proximity matrix:
avg ((P3, P4), P1) = avg ((P3, P1), (P4, P1)) = avg (0.59304, 0.46098) = 0.52701
avg ((P3, P4), P2) = avg ((P3, P2), (P4, P2)) = avg (0.77369, 0.61612) = 0.69490
avg ((P3, P4), P5) = avg ((P3, P5), (P4, P5)) = avg (0.45222, 0.35847) = 0.40534
P1 P2 P3, P4 P5
P1 0 1.04139 0.52701 0.81841
P2 1.0413 0 0.69490 0.32388
P3, P4 0.52701 0.69490 0 0.40534
P5 0.81841 0.32388 0.40534 0
DW&DM

Now we will repeat the same process. The next minimum value is 0.32388 and hence we
combine P2 and P5.
avg ((P2, P5), P1) = avg ((P2, P1), (P5, P1)) = avg (1.04139, 0.81841) = 0.9299
avg ((P2, P5), (P3, P4)) = avg ((P2, (P3, P4)), (P5, (P3, P4)))
= avg (0.77369, 0.45222) =0.55012
Update the proximity Matrix:
P1 P2, P5 P3, P4
P1 0 0.9299 0.59304
P2, P5 0.9299 0 0.55012
P3, P4 0.59304 0.55012 0

The next minimum value is 0.55012 and hence we combine (P2, P5) and (P3, P4).
avg = ((P3, P4), (P2,P5)) = avg (((P3, P4), P1), ((P2, P5),P1))
= avg (0.59304, 0.9299) = 0.76147
P1 P2, P5, P3, P4
P1 0 0.76147
P2, P5, P3, P4 0.76147 0

Finally the cluster (P2, P5, P3, P4) is merged with the datapoint P1

Nested Clusters:

Ward’s Method and Centroid Methods

 For Ward’s method, the proximity between two clusters is defined as the increase in
the squared error that results when two clusters are merged. Thus, this method uses
the same objective function as K-means clustering.
 While it might seem that this feature makes Ward’s method somewhat distinct from
other hierarchical techniques, it can be shown mathematically that Ward’s method is
very similar to the group average method when the proximity between two points is
taken to be the square of the distance between them.
DW&DM

5.4 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)


 Density-based clustering locates regions of high density that are separated from one
another by regions of low density. DBSCAN is a simple and effective density-based
clustering algorithm.

5.4.1 Traditional Density: Centre - Based Approach


 Density-Based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms.
 The data points in the region separated by two clusters of low point density are
considered as noise.
 The surroundings with a radius ε of a given object are known as the ε neighborhood of
the object.
 If the ε neighborhood of the object comprises at least a minimum number, MinPts
(minimum points) of objects, then it is called a core object.

Centre-based Density MinPts=7

Classification of Points According to Centre-Based Density

 The centre-based approach to density allows us to classify a point as being


o in the interior of a dense region (a core point),
o on the edge of a dense region (a border point), or
o in a sparsely occupied region (a noise or background point).

The concepts of core, border, and noise points using a collection of two-dimensional points.
Core points:
A point is a core point if it has at least a specified number of points (MinPts) within Eps
 These are points that are at the interior of a cluster
 Counts the point itself

Border points:
A border point is not a core point, but falls within the neighborhood of a core point...
Noise points:
A noise point is any point that is neither a core point nor a border point.
DW&DM

5.4.2 The DBSCAN Algorithm


 Any two core points that are close enough within a distance Eps of one another are put in
the same cluster.
 Likewise, any border point that is close enough to a core point is put in the same cluster
as the core point.
 Noise points are discarded.
 This algorithm uses the same concepts and finds the same clusters as the original
DBSCAN, but is optimized for simplicity, not efficiency.

DBSCAN algorithm.
Step-1 : Label all points as core, border, or noise points.
Step-2 : Eliminate noise points.
Step-3 : Put an edge between all core points within a distance Eps of each other.
Step-4 : Make each group of connected core points into a separate cluster.
Step-5 : Assign each border point to one of the clusters of its associated core points

Let us Consider example, Create the clusters with minpts = 4 and ε = 1.9
Data Points P1(3,7), P2(4,6), P3(5,5), P4(6,4), P5(7,3), P6(6,2), P7(7,2), P8(8,4),
P9(3,3), P10(2,6), P11(3,5), P12(2,4)
Step-1 : Calculate the distance between each point using Euclidian distance
Distance (A(x1, y1 ), B(x2 , y2)) =√(x2 -x1)2+ (y2 -y1)2

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12


P1 0
P2 1.41 0
P3 2.83 1.41 0
P4 4.24 2.83 1.41 0
P5 5.66 4.24 2.83 1.41 0
P6 5.83 4.47 3.16 2.00 1.41 0
P7 6.40 5.00 3.61 2.24 1.00 1.00 0
P8 5.83 4.47 3.16 2.00 1.41 2.83 2.24 0
P9 4.00 3.16 2.83 3.16 4.00 3.16 4.12 5.10 0
P10 1.41 2.00 3.16 4.47 5.83 5.66 6.40 6.32 3.16 0
P11 2.00 1.41 2.00 3.16 4.47 4.24 5.00 5.10 2.00 1.41 0
P12 3.16 2.83 3.16 4.00 5.10 4.47 5.39 6.00 1.41 2.00 1.41 0

Step - 2 : Map the points where the distances are less than ε=1.9
P1 : P2, P10
P2 : P1, P3, P11
P3 : P2, P4
P4 : P3, P5
P5 : P4, P6, P7, P8
P6 : P5, P7
DW&DM

P7 : P5, P6
P8 : P5
P9 : P12
P10 : P1, P11
P11 : P2, P10, P12
P12 : P9, P11

Step-3: Identify the Core, border, Noise Points minPts=4


Point Status
P1 Noise Border
P2 Core
P3 Noise Border
P4 Noise Border
P5 Core
P6 Noise Border
P7 Noise Border
P8 Noise Border
P9 Noise
P10 Noise Border
P11 Core
P12 Noise Border

Step-4: Now, we have one cluster containing points P2, P5 and P11. The remaining points
(P1, P3, P4, P6, P7, P8, P10, P12) are considered border points and P9 is considered as noise
point as they are not core points.

Step – 5: Identify the clusters


DW&DM

Time and Space Complexity


 Time Complexity of DBSCAN Algorithm is O (nlogn).
 Space Complexity of DBSCAN Algorithm is O (n).

Determining MinPts and Eps


Now, to determine the correct value for our Parameter MinPts and Eps.
MinPts : It helps us in removing the outliers, so the rule of thumb is your MinPts will always
greater than the dimensionality of your dataset.
Typically the MinPts = 2*dimensionality of data.
You need to choose the larger value of MinPts if your dataset is a bit noisy.
Eps: Suppose the value of MinPts is 4 so, for each point ‘xi’ compute the distance from ‘xi’
to the 4th nearest neighbor of xi. Now I will plot the sorted distance of every point to its 4th
nearest neighbor in increasing order.

5.4.3 Strengths and Weaknesses:

Strengths
 DBSCAN works very well when there is a lot of noise in the dataset.
 It can handle clusters of different shapes and sizes.
 We need not specify the no. of clusters just like any other clustering algorithm.
 We just need two parameters MinPts and Eps which can be set by a domain expert.

Weakness
 If we have a dataset of different densities, the algorithm fails to give an accurate result.
 It is very sensitive to hyper-parameters.
 If you are having high-dimensional data and you are using metrics like Euclidean
distance you easily kick into the problem of Curse of Dimensionality.
 If the domain expertise fails to understand the data very well then it’s very difficult to
find the optimal value of MinPts and Eps.

*****

♫ ALL THE BEST ♫

You might also like