Data Warehousing and Mining Complete Notes
Data Warehousing and Mining Complete Notes
DATA WAREHOUSING
1
DATA WAREHOUSING (Unit - I)
2
Data Warehouse Overview
3
What is Data Warehouse?
■ Data warehousing provides architectures and tools for business
executives to systematically organize, understand, and use their data
to make strategic decisions.
■ Data warehouse refers to a data repository that is maintained
separately from an organization’s operational databases.
4
Data Warehouse—Subject-Oriented
5
Data Warehouse—Integrated
■ Constructed by integrating multiple, heterogeneous data
sources
■ relational databases, flat files, on-line transaction
records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding
6
Data Warehouse—Time Variant
■ The time horizon for the data warehouse is significantly
longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse contains an
element of time, explicitly or implicitly. But the key of
operational data may or may not contain “time element”
7
Data Warehouse—Nonvolatile
■ A physically separate store of data transformed from the
operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data
8
OLTP vs OLAP
OLTP OLAP
Database design ER data model ( Application oriented Star or Snowflake model (subject
database design) Oriented Database design)
9
Data Warehouse Architecture
15
Data Warehouse vs. Operational DBMS
■ OLTP (on-line transaction processing)
■ Major task of traditional relational DBMS
16
Why Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
■ Bottom Tier:
■ Warehouse Database server
■ data extraction
■ by using API gateways(ODBC, JDBC & OLEDB)
■ cleaning
■ transformation
■ load & refresh
19
Data Warehousing: A Multitiered Architecture
23
Data Warehousing: A Multitiered Architecture
Metadata Repository:
metadata are the data that define warehouse
objects
It consists of:
1) Data warehouse structure
2) Operational metadata
3) algorithms used for summarization
4) Mapping from the operational environment to
the data warehouse
5) Data related to system performance
6) Business metadata
25
Data Warehousing: A Multitiered Architecture
Metadata Repository:
■ data warehouse structure
i) warehouse schema,
ii) view, dimensions,
iii) hierarchies, and
iv) derived data definitions,
v) data mart locations and contents.
■ Operational metadata
i) data lineage (history of migrated data and the
sequence of transformations applied to it),
ii) currency of data (active, archived, or purged),
iii) monitoring information (warehouse usage
statistics, error reports, and audit trails).
26
Data Warehousing: A Multitiered Architecture
Metadata Repository:
■ The algorithms used for summarization,
27
Data Warehousing: A Multitiered Architecture
Metadata Repository:
1) Mapping from the operational environment to the
data warehouse
i)source databases and their contents,
gateway descriptions,
ii)
data partitions,
iii)
defaults
v)data refresh and purging rules, and
security (user authorization and access control).
vi)
28
Data Warehousing: A Multitiered Architecture
Metadata Repository:
■ Data related to system performance
■ indices and profiles that improve data access and
retrieval performance,
■ rules for the timing and scheduling of refresh,
update, and replication cycles.
■ Business metadata,
■ business terms and definitions,
■ data ownership information, and
■ charging policies
29
A Multidimensional Data Model
30
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
■ A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined by
dimensions and facts.
■ Dimensions are the perspectives or entities with
respect to which an organization wants to keep
records.
■ Example:-
■ AllElectronics may create a sales data warehouse
32
Data Cube: A Multidimensional Data Model
■ A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
33
Data Cube: A Multidimensional Data Model
34
Data Cube: A Multidimensional Data Model
■ AllElectronics sales data for items sold per quarter in the city of Vancouver.
■ a simple 2-D data cube that is a table or spreadsheet for sales data from
AllElectronics
35
Data Cube: A Multidimensional Data Model
The 3-D data in the table are represented as a series of 2-D tables
36
Data Cube: A Multidimensional Data Model
we may also represent the same data in the form of a 3D data cube
37
Data Cube: A Multidimensional Data Model
38
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base)
cuboid
time, item, location, supplier
39
■ In data warehousing literature, an n-D base cube is called a base
cuboid.
40
Schemas for Multidimensional Data Models
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
41
Schemas for Multidimensional Data Models
43
Snow flake schema
44
Snowflake schema
45
Fact Constellation
46
Fact This schema specifies two fact tables,
Constellation sales and shipping
47
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas
■ Just as relational query languages like SQL can be used
to specify relational queries, a data mining query
language (DMQL) can be used to specify data mining
tasks.
48
Syntax for Cube and Dimension
Definition in DMQL
■ Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
■ Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
■ Special Case (Shared Dimension Tables)
■ First time as “cube definition”
<dimension_name_first_time> in cube
<cube_name_first_time>
49
Defining Star Schema in DMQL
50
Defining Snowflake Schema in DMQL
51
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
52
Concept Hierarchies
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman 53
Concept Hierarchies
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman 54
Concept Hierarchies
55
Measures of Data Cube: Three
Categories
58
Fig. 3.10 Typical OLAP
Operations
59
Typical OLAP Operations:Roll Up/Drill Up
■ summarize data
■ by climbing up
hierarchy
or
■ by dimension
reduction
■ reverse of roll-up
■ from higher
level summary
to lower level
summary or
detailed data, or
introducing new
dimensions
Slicing:
It selects a single
dimension from the OLAP
cube which results in a new
sub-cube creation.
Dice:
It selects a sub-
cube from the
OLAP cube by
selecting two or
more
dimensions.
67
A Star-Net Query Model
68
A Star-Net Query Model
■ Four radial lines, representing concept hierarchies for the
dimensions location, customer, item, and time,
respectively
■ footprints representing abstraction levels of the
dimension - time line has four footprints: “day,”
“month,” “quarter,” and “year.”
■ Concept hierarchies can be used to generalize data by
replacing low-level values (such as “day” for the time
dimension) by higher-level abstractions (such as “year”)
or
■ to specialize data by replacing higher-level abstractions
with lower-level values.
69
Data Warehouse Design and Usage
70
Data Warehouse Design and Usage
■ Four different views regarding a data warehouse design
must be considered:
■ Top-down view
■ allows the selection of the relevant information
necessary for the data warehouse (matches current
and future business needs).
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems.
■ Documented at various levels of detail and accuracy,
from individual data source tables to integrated data
source tables.
■ Modeled in ER model or CASE (computer-aided
software engineering).
71
Data Warehouse Design and Usage
■ Data warehouse view
■includes fact tables and dimension tables.
■It represents the information that is stored inside the
data warehouse, including
■precalculated totals and counts,
■information regarding the source, date, and time
of origin, added to provide historical context.
■ Business query view
■is the data perspective in the data warehouse from
the end-user’s viewpoint.
72
Data Warehouse Design and Usage
■ Skills required to build & use a Data warehouse
■ Business Skills
■ how systems store and manage their data,
■ how to build extractors (operational DBMS to DW)
■ how to build warehouse refresh software(update)
■ Technology skills
■ the ability to discover patterns and trends,
■ to extrapolate trends based on history and look
for anomalies or paradigm shifts, and
■ to present coherent managerial recommendations
based on such analysis.
■ Program management skills
■ Interface with many technologies, vendors, and end-
users in order to deliver results in a timely and cost
effective manner 73
Data Warehouse Design and Usage
Data Warehouse Design Process
■ A data warehouse can be built using
■ Top-down approach (overall design and planning)
■ It is useful in cases where the technology is
mature and well known
■ Bottom-up approach(starts with experiments & prototypes)
■ a combination of both.
■ In SE point of view ( Waterfall model or Spiral model)
■ planning, ● rapid generation, short intervals between
successive releases, good choice for data
■ requirements study, warehouse development
■ problem analysis, ● turnaround time is short, modifications can
structured and ■
■ warehouse design, be done quickly, and new designs and
technologies can be adapted in a timely
systematic
analysis at each data integration and testing, andmanner
step, one step to■ finally deployment of the data warehouse
the next
74
Data Warehouse Design and Usage
Data Warehouse Design Process
■4 major Steps involved in Warehouse design are:
■1. Choose a business process to model (e.g., orders,
invoices, shipments, inventory, account administration,
sales, or the general ledger).
■Data warehouse model - If the business process is
organizational and involves multiple complex object
collections
■Data mart model - if the process is departmental and
focuses on the analysis of one kind of business
process
75
Data Warehouse Design and Usage
76
Data Warehouse Design and Usage
Data Warehouse Usage for Information Processing
■ Evolution of DW takes place throughout a number of
phases.
■ Initial Phase - DW is used for generating reports and
answering predefined queries.
■ Progressively - to analyze summarized and detailed data,
(results are in the form of reports and charts)
■ Later - for strategic purposes, performing
multidimensional analysis and sophisticated slice-and-
dice operations.
■ Finally - for knowledge discovery and strategic decision
making using data mining tools.
77
Data Warehouse Implementation
78
Data warehouse implementation
processing.
79
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
Example 4.6
■create a data cube for AllElectronics sales that
contains the following:
city, item, year, and sales in dollars.
■
81
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
82
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
83
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
84
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
■ zero-dimensional operation:
■ An SQL query containing no group-by
■ Example - “compute the sum of total sales”
■ one-dimensional operation:
■ An SQL query containing one group-by
■ Example - “compute the sum of sales group-by city”
85
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
■ data cube could be defined as:
■ “define cube sales_cube [city, item, year]:
sum(sales_in_dollars)”
■ 2 power n cuboids - For a cube with n dimensions
■ “compute cube sales_cube” - statement
■ computes the sales aggregate cuboids for all eight
subsets of the set {city, item, year}, including the
empty subset.
■ In OLAP, for diff. queries diff. cuboids need to be
accessed.
■ Precomputation - compute in advance all or at least
some of the cuboids in a data cube
■ curse of dimensionality - required storage space
may explode if all the cuboids in a data cube are
precomputed ( for more dimensions) 86
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
■ Data cube can be viewed as a lattice of cuboids
■ 2 power n - when no concept hierarchy
87
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
■ 3 factors to consider:
■ (1) identify the subset of cuboids or subcubes to
materialize;
■ (2) exploit the materialized cuboids or subcubes
during query processing; and
■ (3) efficiently update the materialized cuboids or
subcubes during load and refresh.
89
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
91
Data warehouse implementation:
1.3.2 Indexing OLAP Data: Bitmap Index
Index structures - To facilitate efficient data accessing
■ Bitmap indexing method - it allows quick searching in
data cubes.
■ In the bitmap index for a given attribute, there is a
distinct bit vector, Bv, for each value v in the attribute’s
domain.
■ If a given attribute’s domain consists of n values, then n
bits are needed for each entry in the bitmap index (i.e.,
there are n bit vectors).
■ If the attribute has the value v for a given row in the
data table, then the bit representing that value is set to 1
in the corresponding row of the bitmap index. All other
bits for that row are set to 0.
92
Data warehouse implementation:
1.3.2 Indexing OLAP Data: Bitmap Index
94
Data warehouse implementation:
Indexing OLAP Data: Join Index
■ Traditional indexing maps the value in a given
column to a list of rows having that value.
■ Join indexing registers the joinable rows of
two relations from a relational database.
■ For example,
■ two relations - R(RID, A) and S(B, SID)
■ join on the attributes A and B,
■ join index record contains the pair (RID, SID),
■ where RID and SID are record identifiers from
the R and S relations, respectively
95
Data warehouse implementation:
Indexing OLAP Data: Join Index
■ Advantage:-
■ Identification of joinable tuples without performing
costly join operations.
■ Useful:-
■ To maintain the relationship between a foreign
key(fact table) and its matching primary
keys(dimension table), from the joinable relation.
■ Indexing maintains relationships between attribute
values of a dimension (e.g., within a dimension table)
and the corresponding rows in the fact table.
■ Composite join indices: Join indices with multiple
dimensions.
96
Data warehouse implementation:
Indexing OLAP Data: Join Index
■ Example:-Star Schema
■ “sales_star [time, item, branch, location]: dollars_sold
= sum (sales_in_dollars).”
■ join index is relationship between
■ Sales fact table and
■ the location, item dimension tables
99
Data warehouse implementation:
Efficient processing of OLAP queries
Example:-
■define a data cube for AllElectronics of the
form “sales cube [time, item, location]:
sum(sales in dollars).”
■ dimension hierarchies
“day < month < quarter < year” for time;
■
for location
■Query:
{brand, province or state}, with the selection
■
101
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
103
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
105
From DataWarehousing to Data Mining
DataWarehouse Usage
■Data warehouses and data marts are used in a
wide range of applications.
■ Business executives use the data in data warehouses
and data marts to perform data analysis and make
strategic decisions.
■ data warehouses are used as an integral part of a
plan-execute-assess “closed-loop” feedback
system for enterprise management.
■ Data warehouses are used extensively in banking and
financial services, consumer goods and retail
distribution sectors, and controlled manufacturing,
such as demand-based production.
106
DataWarehouse Usage
■ There are three kinds of data warehouse
applications:
■information processing
■analytical processing
■data mining
107
DataWarehouse Usage
110
Architecture for On-Line Analytical
Mining
■ An OLAM server performs analytical mining in data
cubes in a similar manner as an OLAP server performs
on-line analytical processing.
■ An integrated OLAM and OLAP architecture is shown in
Figure, where the OLAM and OLAP servers both accept
user on-line queries (or commands) via a graphical user
interface API and work with the data cube in the data
analysis via a cube API.
■ The data cube can be constructed by accessing and/or
integrating multiple databases via an MDDB API and/or
by filtering a datawarehouse via a database API that may
support OLE DB or ODBC connections.
111
112
Data Mining
&
Motivating Challenges
UNIT - II
By
M. Rajesh Reddy
WHAT IS DATA MINING?
• Post Processing:
• only valid and useful results are incorporated into the
decision support system.
• Visualization
• allows analysts to explore the data and the data
mining results from a variety of viewpoints.
• High Dimensionality
• Non-traditional Analysis
Motivating Challenges:
• Scalability
• Size of datasets are in the order of GB, TB or PB.
access)
• High Dimensionality
• common today - data sets with hundreds or thousands
of attributes
• Example
• Bio-Informatics - microarray technology has
produced gene expression data involving
thousands of features.
• Data sets with temporal or spatial components
also tend to have high dimensionality.
• a data set that contains measurements of
temperature at various locations.
Motivating Challenges:
• Non-traditional Analysis:
• Traditional statistical approach: hypothesize-and-test paradigm.
• A hypothesis is proposed,
• an experiment is designed to gather the data, and
• then the data is analyzed with respect to the hypothesis.
• Current data analysis tasks
• Generation and evaluation of thousands of hypotheses,
• Some DM techniques automate the process of hypothesis
generation and evaluation.
• Some data sets frequently involve non-traditional types of data
and data distributions.
Origins of Data mining,
Data mining Tasks
&
Types of Data
Unit - II
DWDM
The Origins of Data Mining
https://www.javatpoint.com/data-mining-cluster-
analysis
Data Mining Tasks …
Data
Milk
Example:
(Predicting
the Type of a
Flower)
Data Mining Tasks
■ Association analysis
– used to discover patterns that describe strongly associated features in the
data.
– Discovered patterns are represented in the form of implication rules or
feature subsets.
– Goal of association analysis:
■ To extract the most interesting patterns in an efficient manner.
– Example
■ finding groups of genes that have related functionality,
■ identifying Web pages that are accessed together, or
■ understanding the relationships between different elements of Earth’s climate
system.
Data Mining Tasks
■ Association analysis
■ Example (Market Basket Analysis).
– AIM: find items that are frequently bought together by customers.
– Association rule {Diapers} −→ {Milk},
■ suggests that customers who buy diapers also tend to buy milk.
■ This rule can be used to identify potential cross-selling opportunities among related
items.
https://commons.wikimedia.org/wiki/File:Anomalous_Web_Traffi
c.png
Data Mining Tasks
■ Anomaly Detection:
– Example 1.4 (Credit Card Fraud Detection).
– A credit card company records the transactions made by every credit card
holder, along with personal information such as credit limit, age, annual income,
and address.
– Since the number of fraudulent cases is relatively small compared to the
number of legitimate transactions, anomaly detection techniques can be
applied to build a profile of legitimate transactions for the users.
– When a new transaction arrives, it is compared against the profile of the user. If
the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.
Types of Data
■ Example:-
■ Dataset - Student Information.
■ Each row corresponds to a student.
■ Each column is an attribute that describes some aspect of a
student.
Types of Data
Data
Substructure mining
Types of Data - Types of Dataset
■ Ordered Data:
– In some data, the attributes have relationships that involve order in time or
space.
– Sequential Data
■ Sequential data / temporal data
■ extension of record data - each record has a time associated with it.
■ Ex:- Retail transaction data set - stores the time of transaction
– time information used to find patterns
■ “candy sales peak before Halloween.”
■ Each attribute - also - time associated
– Record - purchase history of a customer
■ with a listing of items purchased at different times.
– find patterns
■ “people who buy DVD players tend to buy DVDs in the period
immediately following the purchase.”
Types of Data - Types of Dataset
■ Ordered Data: Sequential
Types of Data - Types of Dataset
■ Ordered Data: Sequence Data
– consists of a data set that is a sequence
of individual entities,
– Example
■ sequence of words or letters.
– Example:
■ Genetic information of plants and
animals can be represented in the
form of sequences of nucleotides that
are known as genes.
■ Predicting similarities in the structure
and function of genes from similarities
in nucleotide sequences.
– Ex:- Human genetic code expressed
using the four nucleotides from which all
DNA is constructed: A, T, G, and C.
Types of Data - Types of Dataset
■ Ordered Data: Time Series Data
– Special type of sequential data in
which each record is a time series,
– A series of measurements taken over
time.
– Example:
■ Financial data set might contain
objects that are time series of the
daily prices of various stocks.
– Temporal autocorrelation; i.e., if two
measurements are close in time, then
the values of those measurements are
often very similar. Time series of the average
monthly temperature for
Minneapolis during the years
1982 to 1994.
Types of Data - Types of Dataset
■ Ordered Data: Spatial Data
■ Some objects have spatial attributes,
such as positions or areas, as well as
other types of attributes.
■ An example of spatial data is
– weather data (precipitation,
temperature, pressure) that is
collected for a variety of geographical
locations.
■ spatial autocorrelation; i.e., objects that
are physically close tend to be similar in
other ways as well.
■ Example Average Monthly
– two points on the Earth that are close Temperature of land and
to each other usually have similar ocean
values for temperature and rainfall.
Data Quality
Unit – II- DWDM
Data Quality
● Data mining applications are applied to data that was collected for another purpose, or for
future, but unspecified applications.
● Data mining focuses on
(1) the detection and correction of data quality problems - Data Cleaning
(2) the use of algorithms that can tolerate poor data quality.
• “less is more”
• Aggregation - combining of two or more objects into a single object.
• In Example,
• One way to aggregate transactions for this data set is to replace all the transactions of a single store with a
single storewide transaction.
• This reduces number of records (1 record per store).
• How an aggregate transaction is created
• Quantitative attributes, such as price, are typically aggregated by taking a sum or an average.
• A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that
were sold at that location.
• Can also be viewed as a multidimensional array, where each attribute is a dimension.
• Used in OLAP
AGGREGATION
• Motivations for aggregation
• Smaller data sets require less memory and processing time which
allows the use of more expensive data mining algorithms.
• Availability of change of scope or scale
• by providing a high-level view of the data instead of a low-level view.
• Behavior of groups of objects or attributes is often more stable than
that of individual objects or attributes.
• Disadvantage of aggregation
• potential loss of interesting details.
AGGREGATION
average yearly precipitation has less variability than the average monthly precipitation.
SAMPLING
• Approach for selecting a subset of the data objects to be analyzed.
• Data miners sample because it is too expensive or time consuming to
process all the data.
• The key principle for effective sampling is the following:
• Using a sample will work almost as well as using the entire data set if the sample
is representative.
• A sample is representative if it has approximately the same property (of interest) as the
original set of data.
• Choose a sampling scheme/Technique – which gives high probability of getting a
representative sample.
SAMPLING
• Sampling Approaches: (a) Simple random (b) Stratified (c) Adaptive
• Simple random sampling
• equal probability of selecting any particular item.
• Two variations on random sampling:
• (1) sampling without replacement—as each item is selected, it is removed from the set of all objects that
together constitute the population, and
• (2) sampling with replacement—objects are not removed from the population as they are selected for the
sample.
• Problem: When the population consists of different types of objects, with widely different numbers of
objects, simple random sampling can fail to adequately represent those types of objects that are less
frequent.
• Stratified sampling:
• starts with prespecified groups of objects
• Simpler version -equal numbers of objects are drawn from each group even though the groups are of
different sizes.
• Other - the number of objects drawn from each group is proportional to the size of that group.
SAMPLING
• Adaptive/Progressive Sampling:
• Proper sample size - Difficult to determine
• Start with a small sample, and then increase the sample size until a
sample of sufficient size has been obtained.
• Initial correct sample size is eliminated
• Stop increasing the sample size at leveling-off point(where no
improvement in the outcome is identified).
DIMENSIONALITY REDUCTION
• Irrelevant features contain almost no useful information for the data mining task at hand.
• Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages.
• Filter approaches:
• Features are selected before the data mining algorithm is run
• Approach that is independent of the data mining task.
• Wrapper approaches:
• Uses the target data mining algorithm as a black box to find the best subset of
attributes
• typically without enumerating all possible subsets.
FEATURE SUBSET SELECTION
• An Architecture for Feature Subset Selection :
• The feature selection process is viewed as consisting of four parts:
1. a measure for evaluating a subset,
2. a search strategy that controls the generation of a new subset of features,
3. a stopping criterion, and
4. a validation procedure.
• Filter methods and wrapper methods differ only in the way in which they
evaluate a subset of features.
• wrapper method – uses the target data mining algorithm
• filter approach - evaluation technique is distinct from the target data mining
algorithm.
FEATURE SUBSET SELECTION
FEATURE SUBSET SELECTION
• Feature subset selection is a search over all possible subsets of features.
• Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task
• Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes.
• Wrapper approach: running the target data mining application, measure the result of the data mining.
• Stopping criterion
• conditions involving the following:
• the number of iterations,
• whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
• whether a subset of a certain size has been obtained,
• whether simultaneous size and evaluation criteria have been achieved, and
• whether any improvement can be achieved by the options available to the search strategy.
• Validation:
• Finally, the results of the target data mining algorithm on the selected subset should be validated.
• An evaluation approach: run the algorithm with the full set of features and compare the full results to results
obtained using the subset of features.
FEATURE SUBSET SELECTION
• Feature Weighting
• An alternative to keeping or eliminating features.
• One Approach
• Higher weight - More important features
• Lower weight - less important features
• Another Approach – automatic
• Example – Classification Scheme - Support vector machines
• Other Approach
• The normalization of objects – Cosine Similarity – used as weights
FEATURE CREATION
• Create a new set of attributes that captures the important
information in a data set from the original attributes
• much more effective.
• No. of new attributes < No. of original attributes
• Three related methodologies for creating new attributes:
1. Feature extraction
2. Mapping the data to a new space
3. Feature construction
FEATURE CREATION
• Feature Extraction
• The creation of a new set of features from the original raw data
• Example: Classify set of photographs based on existence of human face
(present or not)
• Raw data (set of pixels) - not suitable for many types of classification algorithms.
• Higher level features( presence or absence of certain types of edges and areas that are highly correlated with
the presence of human faces), then a much broader set of classification techniques can be applied to this
problem.
• Feature Construction
• Features in the original data sets consists necessary information, but not suitable for the data mining
algorithm.
• If new features constructed out of the original features can be more useful than the original features.
• Example (Density).
• Dataset contains the volume and mass of historical artifact.
• Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most
directly yield an accurate classification.
DISCRETIZATION AND BINARIZATION
Original Data
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization
• Normalization or Standardization
• Goal of standardization or normalization
• To make an entire set of values have a particular property.
• A traditional example is that of “standardizing a variable” in statistics.
• x - mean (average) of the attribute values and
• sx - standard deviation,
• Transformation
• Normalization or Standardization
• If different variables are to be combined, a transformation is necessary
to avoid having a variable with large values dominate the results of the
calculation.
• Example:
• comparing people based on two variables: age and income.
• For any two people, the difference in income will likely be much
higher in absolute terms (hundreds or thousands of dollars) than the
difference in age (less than 150).
• Income values(higher values) will dominate the calculation.
Variable Transformation
• Normalization or Standardization
• Mean and standard deviation are strongly affected by outliers
• Mean is replaced by the median, i.e., the middle value.
• x - variable
• absolute standard deviation of x is
• xi - i th value of the variable,
• m - number of objects, and
• µ - mean or median.
• Other approaches
• computing estimates of the location (center) and
• spread of a set of values in the presence of outliers
• These measures can also be used to define a standardization transformation.
Measures of
Similarity and
Dissimilarity
Unit - II
Datamining
Measures of Similarity and
Dissimilarity
Transformations
● Transformations are often applied to
○ convert a similarity to a dissimilarity,
○ convert a dissimilarity to a similarity
○ to transform a proximity measure to fall within a particular range, such as [0,1].
● Example
○ Similarities between objects range from 1 (not at all similar) to 10 (completely
similar)
○ we can make them fall within the range [0, 1] by using the transformation
■ s’ = (s−1)/9
■ s - Original Similarity
■ s’ - New similarity values
Measures of Similarity and Dissimilarity
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Euclidean Distance
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
If d(x, y) is the distance between two points, x and y, then the following properties hold.
1. Positivity
2. Symmetry
3. Triangle Inequality
If d(A, B) = size(A − B), then it does not satisfy the second part of the
positivity property, the symmetry property, or the triangle inequality.
Jaccard Coefficient
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Measures of Similarity and Dissimilarity
Examples of proximity measures
Cosine similarity (Document similarity)
If x and y are two document vectors, then
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
print("A:", A)
print("B:", B)
Note:-
Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not
considered)
Measures of Similarity and Dissimilarity
Examples of proximity measures
• Learning algorithm
• Used by the classifier
• To identify a model
• That best fits the relationship between the
attribute set and class label of the input data.
General approach to solving a classification problem
• Model
• Generated by a learning algorithm
• Should satisfy the following:
• Fit the input data well
• Correctly predict the class labels of
records it has never seen before.
• Training set
• Consisting of records whose class labels are
known
• used to build a classification model
General approach to solving a classification problem
• Confusion Matrix
• Used to evaluate the performance of a classification model
• Holds details about
• counts of test records correctly and incorrectly predicted by the model.
• Table 4.2 depicts the confusion matrix for a binary classification problem.
• fij – no. of records from class i predicted to be of class j.
• f01 – no. of records from class 0 incorrectly predicted as class 1.
• total no. of correct predictions made (f11 + f00)
• total number of incorrect predictions (f10 + f01).
General approach to solving a classification problem
• Performance Metrics:
1. Accuracy
1. Error Rate
DECISION TREE INDUCTION
Working of Decision Tree
• We can solve a classification problem by
asking a series of carefully crafted questions
about the attributes of the test record.
• Each time we receive an answer, a follow-
up question is asked until we reach a
conclusion about the class label of the
record.
• The series of questions and their possible
answers can be organized in the form of a
decision tree
• Decision tree is a hierarchical structure
consisting of nodes and directed edges.
DECISION TREE INDUCTION
Working of Decision Tree
• Three types of nodes:
• Root node
• No incoming edges
• Zero or more outgoing edges.
• Internal nodes
• Exactly one incoming edge and
• Two or more outgoing edges.
• Leaf or terminal nodes
• Exactly one incoming edge and
• No outgoing edges.
● A brute-force method -Take every value of the attribute in the N records as a candidate split position.
● Count the number of records with annual income less than or greater than
v(computationally expensive).
● To reduce the complexity, the training records are sorted based on their annual income,
● Candidate split positions are identified by taking the midpoints between two adjacent sorted values:
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Gain Ratio
○ Problem:
■ Customer ID - produce purer partitions.
■ Customer ID is not a predictive attribute because its value is
unique for each record.
○ Two Strategies:
■ First strategy(used in CART)
● restrict the test conditions to binary splits only.
■ Second Strategy(used in C4.5 - Gain Ratio - to determine goodness
of a split)
● modify the splitting criterion
● consider - number of outcomes produced by the attribute test
condition.
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Gain Ratio
Tree-Pruning
• After building the decision tree,
• Tree-pruning step - to reduce the size of the decision
tree.
• Pruning -
• trims the branches of the initial tree
• improves the generalization capability of the
decision tree.
• Decision trees that are too large are susceptible to a
phenomenon known as overfitting.
Model Overfitting
DWDM Unit-III
Model Overfitting
https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html
Model Overfitting
https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html
Model Overfitting
Overfitting Due to Presence of Noise: Train Error - 0, Test Error - 30%
Sdsdsd
Sdsdsd
Sdsdsd
Sdsdsd
Model Overfitting
https://www.datavedas.com/holdout-cross-validation/
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Random Sub Sampling
○ The holdout method can be repeated several times to improve the estimation of a classifier’s performance.
○ Overall accuracy,
○ Problems:
■ Does not utilize as much data as possible for training.
■ No control over the number of times each record is used for testing and training.
https://blog.ineuron.ai/Hold-Out-Method-Random-Sub-Sampling-Method-3MLDEXAZML
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Cross Validation
○ Alternative to Random Subsampling
○ Each record is used the same number of times for training and exactly once for testing.
○ Two fold cross-validation
■ Partition the data into two equal-sized subsets.
■ one of the subsets for training and the other for testing.
■ Then swap the roles of the subsets
● b -no. of times
● Ei - accuracy of ith bootstrap sample, accs - accuracy on training data
Picture reference - https://bradleyboehmke.github.io/HOML/process.html
Bayesian Classifiers
DWDM Unit - III
Bayesian Classifiers
● Consider a football game between two rival teams: Team 0 and Team 1.
● Suppose Team 0 wins 65% of the time and Team 1 wins the remaining
matches.
● Among the games won by Team 0, only 30% of them come from playing on
Team 1’s football field.
● On the other hand, 75% of the victories for Team 1 are obtained while playing at
home.
● If Team 1 is to host the next match between the two teams, which team will
most likely emerge as the winner?
● This Problem can be solved by Bayes Theorem
Bayesian Classifiers
● Bayes Theorem
○ X and Y are random variables.
○ A conditional probability is the probability that a random variable will take on a
particular value given that the outcome for another random variable is known.
○ Example:
■ conditional probability P(Y = y|X = x) refers to the probability that the variable
Y will take on the value y, given that the variable X is observed to have the
value x.
Bayesian Classifiers
● Bayes Theorem
If {X1, X2,..., Xk} is the set of mutually exclusive and exhaustive outcomes of a
random variable X, then the denominator of the previous slide equation can be
expressed as follows:
Bayesian Classifiers
● Bayes Theorem
Bayesian Classifiers
● Bayes Theorem
○ Using the Bayes Theorem for Classification
■ X - attribute set
■ Y - class variable.
○ Treat X and Y as random variables -for non-deterministic relationship
○ Capture relationship probabilistically using P(Y |X) - Posterior Probability or Conditional Probability
○ P(Y) - prior probability
○ Training phase
■ Learn the posterior probabilities P(Y |X) for every combination of X and Y
○ Use these probabilities and classify test record X` by finding the class Y` (max posterior probability - P(y`/x`))
Bayesian Classifiers
Using the Bayes Theorem for Classification
Example:-
● test record
X= (Home Owner = No, Marital Status = Married, Annual Income = $120K)
● Y=?
● Use training data & compute - posterior probabilities P(Yes|X) and P(No|X)
● Y= Yes, if P(Yes|X) > P(No|X)
● Y= No, Otherwise
Bayesian Classifiers
● assumes that the attributes are conditionally independent, given the class label y.
● The conditional independence assumption can be formally stated as follows:
Bayesian Classifiers
● Discretization
● Probability Distribution
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes
● Probability Distribution
○ Gaussian distribution can be used to represent the class-conditional probability for continuous
attributes.
○ The distribution is characterized by two parameters,
■ mean, µ
■ variance, σ 2
µij - sample mean of Xi for all training records that belong to the class yj.
σ2ij - sample variance (s ) of such training records.
2
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes
● Probability Distribution
sample mean and variance for this attribute with respect to the class No
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier
● P(no|x)= ?
● P(yes|x) = ?
● Large value is the class label
● X = (Home Owner=No, Marital Status = Married, Income = $120K)
● P(no| Home Owner=No, Marital Status = Married, Income = $120K) = ?
● P(Y|X) = P(Y) * P(X|Y)
● P(no| Home Owner=No, Marital Status = Married, Income = $120K) =
P(DB=no) * P(Home Owner=No, Marital Status = Married, Income = $120K | DB=no)
● P(X|Y) = P(HM=no|DB=no) * P(MS=married|DB=no) * P(Income=$120K|DB=no)
= 4/7 * 4/7 * 0.0072
=0.0024
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier
https://www.geeksforgeeks.org/naive-bayes-classifiers/
Association Analysis:
Basic Concepts and Algorithms
DWDM Unit - IV
Basic Concepts
Problem Definition:
● Binary Representation Market basket data
● each row - transaction
● each column - item
● value is one if the item is present in a transaction and
zero otherwise.
● item is an asymmetric binary variable because the
presence of an item in a transaction is often considered
more important than its absence
Basic Concepts
Association Rule:
● An association rule is an implication expression of
the form X → Y, where X and Y are disjoint itemsets
○ i.e., X ∩ Y = ∅.
∅
● The strength of an association rule can be measured
in terms of its support and confidence.
Basic Concepts
● Support
○ determines how often a rule is applicable to
a given data set
○
● Confidence
○ determines how frequently items in Y appear
in transactions that contain X
Basic Concepts
● Example:
= 2/3 = 0.67.
Basic Concepts
The Apriori
Principle
If an itemset is
frequent, then all
of its subsets must
also be frequent.
Frequent Itemset Generation
Support-based pruning:
I - set of items
J = 2I - power set of I
A measure f is monotone/anti-monotone if
https://www.softwaretestinghelp.com/apriori-
algorithm/#:~:text=Apriori%20algorithm%20is%20a%20sequence,is%20assumed%20by%20the%20user.
Frequent Itemset Generation in the Apriori Algorithm
Example
Example
Apriori in Python
https://intellipaat.com/blog/data-science-apriori-algorithm/
Apriori in Python
https://intellipaat.com/blog/data-science-apriori-algorithm/
Frequent Itemset Generation in the Apriori Algorithm
Ck-set of k-candidate itemsets
1. Brute-Force Method
2. Fk−1 × F1 Method
3. Fk−1×Fk−1 Method
Frequent Itemset Generation in the Apriori Algorithm
overall complexity
● The procedure is complete.
● But the same candidate itemset will be generated more than once ( duplicates).
● Example:
○ {Bread, Diapers, Milk} can be generated
○ by merging {Bread, Diapers} with {Milk},
○ {Bread, Milk} with {Diapers}, or
○ {Diapers, Milk} with {Bread}.
● One Solution
○ Generate candidate itemset by joining items
in lexicographical order only
● {Bread, Diapers} join with {Milk}
Don’t join
● An itemset is a closed
frequent itemset if it is
closed and its support is
greater than or equal to
minsup.
Compact Representation of Frequent Itemsets
where
X is a subset of X` and
Y is a subset of Y `
such that the support and confidence for both rules are identical.
● From table 6.5 {b} is not a closed frequent itemset while {b, c} is closed.
● The association rule {b} → {d, e} is therefore redundant because it has the same
support and confidence as {b, c} → {d, e}.
● Such redundant rules are not generated if closed frequent itemsets are used
for rule generation.
● all maximal frequent itemsets are closed because none
● of the maximal frequent itemsets can have the same support count as their
● immediate supersets.
FP Growth Algorithm
Association Analysis (Unit - IV)
DWDM
FP Growth Algorithm
● FP-growth algorithm takes a radically different approach for discovering frequent itemsets.
● The algorithm encodes the data set using a compact data structure called an FP-tree and extracts
frequent itemsets directly from this structure
FP-Tree Representation
● An FP-tree is a compressed representation of the input data. It is constructed by reading the data
set one transaction at a time and mapping each transaction onto a path in the FP-tree.
● As different transactions can have several items in common, their paths may overlap. The more
the paths overlap with one another, the more compression we can achieve using the FP-tree
structure.
● If the size of the FP-tree is small enough to fit into main memory, this will allow us to extract
frequent itemsets directly from the structure in memory instead of making repeated passes over
the data stored on disk.
FP Tree Representation
FP Tree Representation
● Figure 6.24 shows a data set that
contains ten transactions and five
items.
● The structures of the FP-tree after
reading the first three
transactions are also depicted in
the diagram.
● Each node in the tree contains the
label of an item along with a
counter that shows the number of
transactions mapped onto the
given path.
● Initially, the FP-tree contains only
the root node represented by the
null symbol.
FP Tree Representation
1. The data set is scanned once to
determine the support count of
each item. Infrequent items are
discarded, while the frequent
items are sorted in decreasing
support counts. For the data set
shown in Figure, a is the most
frequent item, followed by b, c, d,
and e.
FP Tree Representation
2. The algorithm makes a second
pass over the data to construct
the FP-tree. After reading the
first transaction, {a, b}, the nodes
labeled as a and b are created. A
path is then formed from null →
a → b to encode the transaction.
Every node along the path has a
frequency count of 1.
FP Tree Representation
3. After reading the second transaction, {b,c,d}, a new set of
nodes is created for items b, c, and d. A path is then
formed to represent the transaction by connecting the
nodes null → b → c → d. Every node along this path
also has a frequency count equal to one.
4. The third transaction, {a,c,d,e}, shares a common prefix
item (which is a) with the first transaction. As a result,
the path for the third transaction, null → a → c → d
→ e, overlaps with the path for the first transaction,
null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two,
while the frequency counts for the newly created
nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been
mapped onto one of the paths given in the FP-tree.
The resulting FP-tree after reading all the transactions
is shown in Figure 6.24.
FP Tree Representation
● The size of an FP-tree is typically smaller
than the size of the uncompressed data
because many transactions in market
basket data often share a few items in
common.
● In the best-case scenario, where all the
transactions have the same set of items,
the FP-tree contains only a single branch
of nodes.
● The worst-case scenario happens when
every transaction has a unique set of
items.
FP Tree Representation
● The size of an FP-tree also
depends on how the items are
ordered.
● If the ordering scheme in the
preceding example is reversed,
i.e., from lowest to highest
support item, the resulting FP-
tree is shown in Figure 6.25.
● An FP-tree also contains a list
of pointers connecting
between nodes that have the
same items.
● These pointers, represented as
dashed lines in Figures 6.24
and 6.25, help to facilitate the
rapid access of individual
items in the tree.
Frequent Itemset Generation using FP-Growth Algorithm
Steps in FP-Growth Algorithm:
Step-1: Scan the database to build Frequent 1-item set which will contain all
the elements whose frequency is greater than or equal to the minimum
support. These elements are stored in descending order of their
respective frequencies.
Step-2: For each transaction, the respective Ordered-Item set is built.
Step-3: Construct the FP tree. by scanning each Ordered-Item set
Step-4: For each item, the Conditional Pattern Base is computed which is
path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.
Step-5: For each item, the Conditional Frequent Pattern Tree is built.
Step-6: Frequent Pattern rules are generated by pairing the items of the
Conditional Frequent Pattern Tree set to each corresponding item.
Frequent Itemset Generation in FP-Growth Algorithm
Example:
Given Database: min_support=3 The frequency of each individual
item is computed:-
Frequent Itemset Generation in FP-Growth Algorithm
● A Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies.
● L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
● Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained
in the transaction. The following table is built for all the transactions:
Frequent Itemset Generation in FP-Growth Algorithm
Now, all the Ordered-Item sets
are inserted into a Trie Data
Structure.
a) Inserting the set {K, E, M, O,
Y}:
All the items are simply
linked one after the other in
the order of occurrence in
the set and initialize the
support count for each item
as 1.
Frequent Itemset Generation in FP-Growth Algorithm
b) Inserting the set {K, E, O, Y}:
It is done by taking the set of elements that is common in all the paths in the Conditional
Pattern Base of that item and calculating its support count by summing the support counts of all
the paths in the Conditional Pattern Base.
The itemsets whose support count >= min_support value are retained in the Conditional
Frequent Pattern Tree and the rest are discarded.
Frequent Itemset Generation in FP-Growth Algorithm
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are
generated by pairing the items of the Conditional Frequent Pattern Tree set to
the corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the first
row which contains the element, the rules K -> Y and Y -> K can be inferred.
To determine the valid rule, the confidence of both the rules is calculated and the one
with confidence greater than or equal to the minimum confidence value is retained.
Data Mining
Cluster Analysis: Basic Concepts
and Algorithms
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
2
Applications of Cluster Analysis
● Understanding
– Group related documents
for browsing(Information
Retrieval),
– group genes and proteins
that have similar
functionality(Biology),
– group stocks with similar
price fluctuations
(Business)
– Climate
– Psychology & Medicine
Clustering precipitation
in Australia
3
Applications of Cluster Analysis
Clustering precipitation
in Australia
4
Notion of a Cluster can be Ambiguous
5
Types of Clusterings
6
Partitional Clustering
7
Hierarchical Clustering
8
Other Distinctions Between Sets of Clusters
9
Types of Clusters
● Well-separated clusters
● Prototype-based clusters
● Contiguity-based clusters
● Density-based clusters
10
Types of Clusters: Well-Separated
● Well-Separated Clusters:
– A cluster with a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster
than to any point not in the cluster.
3 well-separated clusters
11
Types of Clusters: Prototype-Based
4 center-based clusters
12
Types of Clusters: Contiguity-Based ( Graph)
8 contiguous clusters
● Useful when clusters are irregular or intertwined
● Trouble when noise is present
– a small bridge of points can merge two distinct clusters.
13
Types of Clusters: Density-Based
● Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
The two circular clusters are not merged, as in Figure, because the bridge between them(previous
slide figure) fades into the noise.
6 density-based clusters
Curve that is present in previous slide Figure also fades into the noise and does not form a
cluster
A density based definition of a cluster is often employed when the clusters are irregular or intertwined,
and when noise and outliers are present.
14
Types of Clusters: Density-Based
A clustering algorithm would need a very specific concept (sophisticated) of a cluster to successfully
detect these clusters. The process of finding such clusters is called conceptual clustering.
15
Clustering Algorithms
● Hierarchical clustering
● Density-based clustering
16
K-means
● Prototype-based, partitional clustering
technique
● Attempts to find a user-specified number of
clusters (K)
17
Agglomerative Hierarchical Clustering
● Hierarchical clustering
● Starts with each point as a singleton cluster
● Repeatedly merges the two closest clusters
until a single, all encompassing cluster
remains.
● Some Times - graph-based clustering
● Others - prototype-based approach.
18
DBSCAN
● Density-based clustering algorithm
the algorithm.
● Noise - Points in low-density regions (omitted)
19
K-means Clustering
20
Example of K-means Clustering
Example of K-means Clustering
22
K-means Clustering – Details
● Simple iterative algorithm.
– Choose initial centroids;
– repeat {assign each point to a nearest centroid; re-compute cluster centroids}
– until centroids stop changing.
23
K-means Clustering – Details
24
K-means Clustering – Details
25
Centroids and Objective Functions
26
K-means Objective Function
Document Data
● Cosine Similarity
27
Two different K-means Clusterings
Original Points
28
Importance of Choosing Initial Centroids …
The below 2 figures show the clusters that result from two particular choices of initial centroids.
(For both figures, the positions of the cluster centroids in the various iterations are indicated by
crosses.)
Fig-1
Fig-2
30
Problems with Selecting Initial Points
● Figure 5.7 shows that if a pair of clusters has only one initial
centroid and the other pair has three, then two of the true
clusters will be combined and one true cluster will be split.
31
10 Clusters Example
Starting with two initial centroids in one cluster of each pair of clusters
32
10 Clusters Example
Starting with two initial centroids in one cluster of each pair of clusters
33
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other
have only one.
34
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other have only one.
35
Solutions to Initial Centroids Problem
● Multiple runs
● K-means++
● Bisecting K-means
36
Multiple Runs
38
K-means++
39
Bisecting K-means
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
40
https://www.geeks
forgeeks.org/bisec
ting-k-means-
algorithm-
introduction/
41
Limitations of K-means
42
Limitations of K-means: Differing Sizes
43
Limitations of K-means: Differing Density
44
Limitations of K-means: Non-globular Shapes
45
Overcoming K-means Limitations
One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.
46
Overcoming K-means Limitations
One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.
47
Overcoming K-means Limitations
One solution is to find a large number of clusters such that each of them represents a part of
a natural cluster. But these small clusters need to be put together in a post-processing step.
48
Hierarchical Clustering
49
Strengths of Hierarchical Clustering
50
Hierarchical Clustering
– Divisive:
◆ Start with one, all-inclusive cluster
◆ At each step, split a cluster until each cluster contains an individual
point (or there are k clusters)
51
Agglomerative Clustering Algorithm
52
Steps 1 and 2
p2
p3
p4
p5
.
.
. Proximity Matrix
53
Intermediate Situation
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
54
Step 4
● We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
55
Step 5
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
56
How to Define Inter-Cluster Distance
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
57
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
58
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
59
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
60
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
× × p2
p3
p4
p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
61
MIN or Single Link
62
Hierarchical Clustering: MIN
5
1
3
5
2 1
2 3 6
4
4
63
Strength of MIN
64
Limitations of MIN
Two Clusters
Original Points
• Sensitive to noise
Three Clusters
65
MAX or Complete Linkage
Distance Matrix:
66
Hierarchical Clustering: MAX
4 1
2 5
5
2
3 6
3
1
4
67
Strength of MAX
68
Limitations of MAX
69
Group Average
Distance Matrix:
70
Hierarchical Clustering: Group Average
5 4 1
2
5
2
3 6
1
4
3
71
Hierarchical Clustering: Group Average
● Strengths
– Less susceptible to noise
● Limitations
– Biased towards globular clusters
72
Cluster Similarity: Ward’s Method
73
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
74
Hierarchical Clustering: Time and Space requirements
75
Hierarchical Clustering: Problems and Limitations
76
Density Based Clustering
77
DBSCAN
78
DBSCAN: Core, Border, and Noise Points
MinPts = 7
79
DBSCAN: Core, Border and Noise Points
81
When DBSCAN Works Well
82
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.75)
83