0% found this document useful (0 votes)
35 views495 pages

Data Warehousing and Mining Complete Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views495 pages

Data Warehousing and Mining Complete Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 495

UNIT I

DATA WAREHOUSING

1
DATA WAREHOUSING (Unit - I)

❑ Data Warehouse and OLAP Technology:


○ 1.1 An Overview: Data Warehouse
○ 1.2 Data Warehouse Architecture
○ 1.3 A Multidimensional Data Model
○ 1.4 Data Warehouse Implementation
○ 1.5 From Data Warehousing to Data
Mining. (Han & Kamber)

2
Data Warehouse Overview

3
What is Data Warehouse?
■ Data warehousing provides architectures and tools for business
executives to systematically organize, understand, and use their data
to make strategic decisions.
■ Data warehouse refers to a data repository that is maintained
separately from an organization’s operational databases.

■ “A data warehouse is a subject-oriented, integrated,


time-variant, and nonvolatile collection of data in support
of management’s decision-making process.”

■ Data warehousing: The process of constructing and using data


warehouses

4
Data Warehouse—Subject-Oriented

■ Organized around major subjects, such as customer,


product, sales
■ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process

5
Data Warehouse—Integrated
■ Constructed by integrating multiple, heterogeneous data
sources
■ relational databases, flat files, on-line transaction

records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different


data sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.

6
Data Warehouse—Time Variant
■ The time horizon for the data warehouse is significantly
longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse contains an
element of time, explicitly or implicitly. But the key of
operational data may or may not contain “time element”

7
Data Warehouse—Nonvolatile
■ A physically separate store of data transformed from the
operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data

8
OLTP vs OLAP
OLTP OLAP

User & System Customer Oriented market Oriented ( Data Analysis


Orientation ( transaction & query processing) by managers, executives &
Analysts)

Data Contents Current Data (too detailed) Large amount of data


(summarization & aggregation)

Database design ER data model ( Application oriented Star or Snowflake model (subject
database design) Oriented Database design)

View focus on current Data within an multiple versions of database


enterprise or department schema(evolutionary process),
data from diff. org. & many data
stores

Access Patterns short, atomic transactions (requires read-only operations ( Complex


concurrency control & recovery) queries)

9
Data Warehouse Architecture

15
Data Warehouse vs. Operational DBMS
■ OLTP (on-line transaction processing)
■ Major task of traditional relational DBMS

■ Day-to-day operations: purchasing, inventory, banking,


manufacturing, payroll, registration, accounting, etc.
■ OLAP (on-line analytical processing)
■ Major task of data warehouse system

■ Data analysis and decision making

■ Distinct features (OLTP vs. OLAP):


■ User and system orientation: customer vs. market

■ Data contents: current, detailed vs. historical, consolidated

■ Database design: ER + application vs. star + subject


■ View: current, local vs. evolutionary, integrated
■ Access patterns: update vs. read-only but complex queries

16
Why Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency

control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which

operational DBs do not typically maintain


■ data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from heterogeneous
sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
17
18
Data Warehousing: A Multitiered Architecture

■ Bottom Tier:
■ Warehouse Database server

■ a relational database system

■ Back-end tools and utilities

■ data extraction
■ by using API gateways(ODBC, JDBC & OLEDB)
■ cleaning
■ transformation
■ load & refresh

19
Data Warehousing: A Multitiered Architecture

■ Middle Tier (OLAP server)


■ ROLAP - Relational OLAP

■ extended RDBMS that maps operations on


multidimensional data to standard relational
operations.
■ MOLAP - Multidimensional OLAP
■ Special-purpose server that directly implements
multidimensional data and operations.
■ Top Tier
■ Front-end Client Layer

■ Query and reporting tools, analysis tools and

data mining tools.


20
Data Warehousing: A Multitiered Architecture

■ Data Warehouse Models:


■ Enterprise warehouse:

■ collects all of the information about subjects


spanning the entire organization.

■ corporate-wide data integration

■ can range in size from a few gigabytes to


hundreds of gigabytes, terabytes, or beyond.

■ implemented on mainframes, computer


superservers, or parallel architecture platforms
21
Data Warehousing: A Multitiered Architecture

■ Data Warehouse Models:


■ Data mart:a subset of corporate-wide data that is of value
to a specific group of users
■ confined to specific selected subjects.
■ Example - marketing data mart may confine its subjects to
customer, item, and sales.
■ implemented on low-cost departmental servers
■ Independent Data mart - data captured from
■ one or more operational systems or external information
providers,
or
■ from data generated locally within a particular department or
geographic area.
■ Dependent Data mart - sourced directly from enterprise data
warehouses.
22
Data Warehousing: A Multitiered Architecture

■ Data Warehouse Models:


■ Virtual warehouse:
■ A virtual warehouse is a set of views over operational
databases.
■ easy to build but requires excess capacity on operational
database servers.

23
Data Warehousing: A Multitiered Architecture

■ Data extraction: gathers data from multiple,


heterogeneous, and external sources.
■ Data Cleaning: detects errors in the data and
rectifies them when possible
■ Data transformation: converts data from
legacy or host format to warehouse format.
■ Load: sorts, summarizes, consolidates,
computes views, checks integrity, and builds
indices and partitions.
■ Refresh: propagates the updates from the data
sources to the warehouse.
24
Data Warehousing: A Multitiered Architecture

Metadata Repository:
metadata are the data that define warehouse
objects
It consists of:
1) Data warehouse structure
2) Operational metadata
3) algorithms used for summarization
4) Mapping from the operational environment to
the data warehouse
5) Data related to system performance
6) Business metadata
25
Data Warehousing: A Multitiered Architecture

Metadata Repository:
■ data warehouse structure

i) warehouse schema,
ii) view, dimensions,
iii) hierarchies, and
iv) derived data definitions,
v) data mart locations and contents.
■ Operational metadata
i) data lineage (history of migrated data and the
sequence of transformations applied to it),
ii) currency of data (active, archived, or purged),
iii) monitoring information (warehouse usage
statistics, error reports, and audit trails).
26
Data Warehousing: A Multitiered Architecture

Metadata Repository:
■ The algorithms used for summarization,

i) measure and dimension definition algorithms,


ii) data on granularity,
iii) partitions,
iv) subject areas,
v) aggregation,
vi) summarization, and
vii) predefined queries and reports.

27
Data Warehousing: A Multitiered Architecture

Metadata Repository:
1) Mapping from the operational environment to the
data warehouse
i)source databases and their contents,
gateway descriptions,
ii)

data partitions,
iii)

data extraction, cleaning, transformation rules and


iv)

defaults
v)data refresh and purging rules, and
security (user authorization and access control).
vi)

28
Data Warehousing: A Multitiered Architecture

Metadata Repository:
■ Data related to system performance
■ indices and profiles that improve data access and
retrieval performance,
■ rules for the timing and scheduling of refresh,
update, and replication cycles.
■ Business metadata,
■ business terms and definitions,
■ data ownership information, and
■ charging policies

29
A Multidimensional Data Model

30
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
■ A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined by
dimensions and facts.
■ Dimensions are the perspectives or entities with
respect to which an organization wants to keep
records.
■ Example:-
■ AllElectronics may create a sales data warehouse

■ time, item, branch, and location - These


dimensions allow the store to keep track of things
like monthly sales of items and the branches and
locations at which the items were sold.
31
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
■ Each dimension may have a table associated with it, called
a dimension table, which further describes the
dimension.
■ For example - a dimension table for item may contain the
attributes item name, brand, type.
■ A multidimensional data model is typically organized
around a central theme, such as sales. This theme is
represented by a fact table.
■ Facts are numeric measures.
■ The fact table contains the names of the facts, or
measures, as well as keys to each of the related dimension
tables.

32
Data Cube: A Multidimensional Data Model
■ A data warehouse is based on a multidimensional data model which
views data in the form of a data cube

■ A data cube, such as sales, allows data to be modeled and viewed in


multiple dimensions

■ Dimension tables, such as item (item_name, brand, type), or


time(day, week, month, quarter, year)

■ Fact table contains measures (such as dollars_sold) and keys to


each of the related dimension tables

33
Data Cube: A Multidimensional Data Model

■ A data cube is a lattice of cuboids


■ A data warehouse is usually modeled by a multidimensional data
structure, called a data cube, in which
■ each dimension corresponds to an attribute or a set of
attributes in the schema, and
■ each cell stores the value of some aggregate measure such as
count or sum(sales_amount).
■ A data cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data.

34
Data Cube: A Multidimensional Data Model

2-D View of Sales data

■ AllElectronics sales data for items sold per quarter in the city of Vancouver.
■ a simple 2-D data cube that is a table or spreadsheet for sales data from
AllElectronics
35
Data Cube: A Multidimensional Data Model

3-D View of a Sales data

The 3-D data in the table are represented as a series of 2-D tables

36
Data Cube: A Multidimensional Data Model

3D Data Cube Representation of Sales data

we may also represent the same data in the form of a 3D data cube

37
Data Cube: A Multidimensional Data Model

4-D Data Cube Representation of Sales Data

we may display any n-dimensional data as a series of (n − 1)-dimensional


“cubes.”

38
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base)
cuboid
time, item, location, supplier

39
■ In data warehousing literature, an n-D base cube is called a base
cuboid.

■ The top most 0-D cuboid, which holds the highest-level of


summarization, is called the apex cuboid.
■ In our example, this is the total sales, or dollars sold,
summarized over all four dimensions.
■ The apex cuboid is typically denoted by all.

■ The lattice of cuboids forms a data cube.

40
Schemas for Multidimensional Data Models
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

41
Schemas for Multidimensional Data Models

■ Star schema: In this, a data warehouse contains


(1) a large central table (fact table) containing the bulk
of the data, with no redundancy, and
(2) a set of smaller attendant tables
(dimension tables), one for each dimension.
■ Each dimension is represented by only one table.
■ Each table contains a set of attributes
■ Problem: redundancy in dimension tables.
■ ex:- location dimension table will create redundancy
among the attributes province or state and country; that
is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA).
42
Star schema

43
Snow flake schema

■ Variant of the star schema model


■ Dimension tables are normalized ( to remove
redundancy)
■ Dimension table is splitted into additional tables.
■ The resulting schema graph forms a shape similar to a
snowflake.
■ Problem
■ more joins will be needed to execute a query ( affects
system performance)
■ so this is not as popular as the star schema in data
warehouse design.

44
Snowflake schema

45
Fact Constellation

● A fact constellation schema allows dimension tables to be


shared between fact tables
● A data warehouse collects information about subjects that
span the entire organization, such as customers, items,
sales, assets, and personnel, and thus its scope is
enterprise-wide.
● For data warehouses, the fact constellation
schema is commonly used.
● For data marts, the star or snowflake schema is
commonly used

46
Fact This schema specifies two fact tables,
Constellation sales and shipping

the dimensions tables for time, item, and


location are shared between the sales
and shipping fact tables.

47
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas
■ Just as relational query languages like SQL can be used
to specify relational queries, a data mining query
language (DMQL) can be used to specify data mining
tasks.

■ Data warehouses and data marts can be defined using


two language primitives, one for cube definition and
one for dimension definition.

48
Syntax for Cube and Dimension
Definition in DMQL
■ Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
■ Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
■ Special Case (Shared Dimension Tables)
■ First time as “cube definition”

■ define dimension <dimension_name> as

<dimension_name_first_time> in cube
<cube_name_first_time>

49
Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

50
Defining Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))

51
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

52
Concept Hierarchies

■ A concept hierarchy defines a sequence of mappings


from a set of low-level concepts to higher-level.
■ concept hierarchy for the dimension location

courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman 53
Concept Hierarchies

■ A concept hierarchy that is a total or partial order among


attributes in a database schema is called a schema
hierarchy.

courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman 54
Concept Hierarchies

■ Concept hierarchies may also be defined by discretizing


or grouping values for a given dimension or attribute,
resulting in a set-grouping hierarchy.
■ A total or partial order can be defined among groups of
values.

55
Measures of Data Cube: Three
Categories

■ A multidimensional point in the data cube space can be


defined by a set of dimension-value pairs,
for example, 〈time = “Q1”, location = “Vancouver”,
item = “computer”〉.
■ A data cube measure is a numerical function that can be
evaluated at each point in the data cube space.
■ A measure value is computed for a given point by
aggregating the data corresponding to the respective
dimension-value pairs defining the given point.
■ Based on the kind of aggregate functions used, measures
can be organized into three categories : distributive,
algebraic, holistic
56
Measures of Data Cube: Three
Categories
■ Distributive: An aggregate function is distributive if the result
derived by applying the function to n aggregate values is same
as that derived by applying the function on all the data without
partitioning
■ E.g., count(), sum(), min(), max()
■ Algebraic: An aggregate function is algebraic if it can be
computed by an algebraic function with M arguments (where M
is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function.
■ E.g., avg()=sum()/count(), min_N(), standard_deviation()
■ Holistic: An aggregate function is holistic if there is no constant
bound on the storage size and there does not exist an algebraic
function with M arguments (where M is a constant) that
characterizes the computation.
■ E.g., median(), mode(), rank() 57
Typical OLAP Operations
■ Roll up (drill-up):
■ Drill down (roll down):
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: Allows users to analyze the same data through
different reports, analyze it with different features and even display it
through different visualization methods

58
Fig. 3.10 Typical OLAP
Operations

59
Typical OLAP Operations:Roll Up/Drill Up

■ summarize data
■ by climbing up
hierarchy
or
■ by dimension
reduction

Source & Courtesy: https://www.javatpoint.com/olap-operations


60
Typical OLAP Operations:Roll Down

■ reverse of roll-up
■ from higher
level summary
to lower level
summary or
detailed data, or
introducing new
dimensions

Source & Courtesy: https://www.javatpoint.com/olap-operations


61
Typical OLAP Operations:Slicing
● Slice is the act of picking a rectangular subset of a cube by choosing a single
value for one of its dimensions, creating a new cube with one fewer
dimension.
● Example: The sales figures of all sales regions and all product categories of
the company in the year 2005 and 2006 are "sliced" out of the data cube.

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube


62
Typical OLAP Operations:Slicing

Slicing:
It selects a single
dimension from the OLAP
cube which results in a new
sub-cube creation.

Source & Courtesy: https://www.javatpoint.com/olap-operations 63


Typical OLAP Operations:Dice
● Dice: The dice operation produces a subcube by allowing the analyst to pick
specific values of multiple dimensions
● The picture shows a dicing operation: The new cube shows the sales figures
of a limited number of product categories, the time and region dimensions
cover the same range as before.

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube


64
Typical OLAP Operations:Dicing

Dice:
It selects a sub-
cube from the
OLAP cube by
selecting two or
more
dimensions.

Source & Courtesy: https://www.javatpoint.com/olap-operations 65


Typical OLAP Operations:Pivot
Pivot allows an analyst to rotate the cube in space to see its various faces. For
example, cities could be arranged vertically and products horizontally while viewing
data for a particular quarter.

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube


66
A Star-Net Query Model

● The querying of multidimensional databases can be based


on a starnet model.
● It consists of radial lines emanating from a central point,
where each line represents a concept hierarchy for a
dimension.
● Each abstraction level in the hierarchy is called a footprint.

● These represent the granularities available for use by OLAP


operations such as drill-down and roll-up.

67
A Star-Net Query Model

68
A Star-Net Query Model
■ Four radial lines, representing concept hierarchies for the
dimensions location, customer, item, and time,
respectively
■ footprints representing abstraction levels of the
dimension - time line has four footprints: “day,”
“month,” “quarter,” and “year.”
■ Concept hierarchies can be used to generalize data by
replacing low-level values (such as “day” for the time
dimension) by higher-level abstractions (such as “year”)
or
■ to specialize data by replacing higher-level abstractions
with lower-level values.

69
Data Warehouse Design and Usage

A Business Analysis Framework for Data


Warehouse Design:
■ To design an effective data warehouse we need to
understand and analyze business needs and construct a
business analysis framework.

■ Different views are combined to form a complex


framework.

70
Data Warehouse Design and Usage
■ Four different views regarding a data warehouse design
must be considered:
■ Top-down view
■ allows the selection of the relevant information
necessary for the data warehouse (matches current
and future business needs).
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems.
■ Documented at various levels of detail and accuracy,
from individual data source tables to integrated data
source tables.
■ Modeled in ER model or CASE (computer-aided
software engineering).
71
Data Warehouse Design and Usage
■ Data warehouse view
■includes fact tables and dimension tables.
■It represents the information that is stored inside the
data warehouse, including
■precalculated totals and counts,
■information regarding the source, date, and time
of origin, added to provide historical context.
■ Business query view
■is the data perspective in the data warehouse from
the end-user’s viewpoint.

72
Data Warehouse Design and Usage
■ Skills required to build & use a Data warehouse
■ Business Skills
■ how systems store and manage their data,
■ how to build extractors (operational DBMS to DW)
■ how to build warehouse refresh software(update)
■ Technology skills
■ the ability to discover patterns and trends,
■ to extrapolate trends based on history and look
for anomalies or paradigm shifts, and
■ to present coherent managerial recommendations
based on such analysis.
■ Program management skills
■ Interface with many technologies, vendors, and end-
users in order to deliver results in a timely and cost
effective manner 73
Data Warehouse Design and Usage
Data Warehouse Design Process
■ A data warehouse can be built using
■ Top-down approach (overall design and planning)
■ It is useful in cases where the technology is
mature and well known
■ Bottom-up approach(starts with experiments & prototypes)
■ a combination of both.
■ In SE point of view ( Waterfall model or Spiral model)
■ planning, ● rapid generation, short intervals between
successive releases, good choice for data
■ requirements study, warehouse development
■ problem analysis, ● turnaround time is short, modifications can
structured and ■
■ warehouse design, be done quickly, and new designs and
technologies can be adapted in a timely
systematic
analysis at each data integration and testing, andmanner
step, one step to■ finally deployment of the data warehouse
the next
74
Data Warehouse Design and Usage
Data Warehouse Design Process
■4 major Steps involved in Warehouse design are:
■1. Choose a business process to model (e.g., orders,
invoices, shipments, inventory, account administration,
sales, or the general ledger).
■Data warehouse model - If the business process is
organizational and involves multiple complex object
collections
■Data mart model - if the process is departmental and
focuses on the analysis of one kind of business
process

75
Data Warehouse Design and Usage

■ 2. Choose the business process grain


■ Fundamental, atomic level of data to be represented
in the fact table
■ (e.g., individual transactions, individual daily
snapshots, and so on).
■ 3. Choose the dimensions that will apply to each
fact table record.
■ Typical dimensions are time, item, customer, supplier,
warehouse, transaction type, and status.
■ 4. Choose the measures that will populate each
fact table record.
■ Typical measures are numeric additive quantities like
dollars sold and units sold.

76
Data Warehouse Design and Usage
Data Warehouse Usage for Information Processing
■ Evolution of DW takes place throughout a number of
phases.
■ Initial Phase - DW is used for generating reports and
answering predefined queries.
■ Progressively - to analyze summarized and detailed data,
(results are in the form of reports and charts)
■ Later - for strategic purposes, performing
multidimensional analysis and sophisticated slice-and-
dice operations.
■ Finally - for knowledge discovery and strategic decision
making using data mining tools.

77
Data Warehouse Implementation

78
Data warehouse implementation

■ OLAP servers demand that decision support queries be


answered in the order of seconds.

■ Methods for the efficient implementation of data


warehouse systems.
■ 1. Efficient data cube computation.

■ 2. OLAP data indexing (bitmap or join indices )

■ 3. OLAP query processing

■ 4. Various types of warehouse servers for OLAP

processing.

79
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

■ Requires efficient computation of aggregations


across many sets of dimensions.
■ In SQL terms:
■ Aggregations are referred to as group-by’s.
■ Each group-by can be represented by a cuboid,
■ set of group-by’s forms a lattice of cuboids
defining a data cube.
■ Compute cube Operator - computes
aggregates over all subsets of the dimensions
specified in the operation.
■ require excessive storage space for large
number of dimensions.
80
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

Example 4.6
■create a data cube for AllElectronics sales that
contains the following:
city, item, year, and sales in dollars.

81
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

■ What is the total number of cuboids, or group-


by’s, that can be computed for this data cube?
■ 3 attributes - city, item & year -3 dimensions
■ sales in dollars - measure,
■ the total number of cuboids, or group by’s,
■ 2 POWER 3 = 8.
■ The possible group-by’s are the following:
■ {(city, item, year), (city, item), (city, year),
(item, year), (city), (item), (year), ()}
■ () - group-by is empty (i.e., the dimensions are not
grouped) - all.
■ group-by’s form a lattice of cuboids for the data cube

82
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

83
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

■ Base cuboid contains all three dimensions(city, item, year)


■ returns - total sales for any combination of the three
dimensions.
■ This is least generalized (most specific) of the cuboids.
■ Apex cuboid, or 0-D cuboid, refers to the case where
the group-by is empty (contains total sum of all sales)
■ This is most generalized (least specific) of the cuboids
■ Drill Down equivalent
■ start at the apex cuboid and explore downward in the
lattice
■ akin to rolling up
■ start at the base cuboid and explore upward

84
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
■ zero-dimensional operation:
■ An SQL query containing no group-by
■ Example - “compute the sum of total sales”
■ one-dimensional operation:
■ An SQL query containing one group-by
■ Example - “compute the sum of sales group-by city”

■ A cube operator on n dimensions is equivalent to a


collection of group-by statements, one for each subset of
the n dimensions.

85
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
■ data cube could be defined as:
■ “define cube sales_cube [city, item, year]:
sum(sales_in_dollars)”
■ 2 power n cuboids - For a cube with n dimensions
■ “compute cube sales_cube” - statement
■ computes the sales aggregate cuboids for all eight
subsets of the set {city, item, year}, including the
empty subset.
■ In OLAP, for diff. queries diff. cuboids need to be
accessed.
■ Precomputation - compute in advance all or at least
some of the cuboids in a data cube
■ curse of dimensionality - required storage space
may explode if all the cuboids in a data cube are
precomputed ( for more dimensions) 86
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
■ Data cube can be viewed as a lattice of cuboids
■ 2 power n - when no concept hierarchy

■ How many cuboids in an n-dimensional cube with L


levels?

■ where Li is the number of levels associated with


dimension i ( +1 for all )

■ If the cube has 10 dimensions and each dimension has


five levels (including all), the total number of cuboids
that can be generated is 510 ≈ 9.8 × 106 .

87
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

There are three choices for data cube


materialization for a given base cuboid:

■ 1. No materialization: Do not precompute -


expensive multidimensional aggregates -
extremely slow.

■ 2. Full materialization: Precompute all of the


cuboids - full cube - requires huge amounts of
memory space in order to store all of the
precomputed cuboids.
88
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

■ 3. Partial materialization: Selectively compute a


proper subset of the whole set of possible cuboids.
■ compute a subset of the cube, which contains only those
cells that satisfy some user-specified criterion - subcube

■ 3 factors to consider:
■ (1) identify the subset of cuboids or subcubes to
materialize;
■ (2) exploit the materialized cuboids or subcubes
during query processing; and
■ (3) efficiently update the materialized cuboids or
subcubes during load and refresh.

89
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

■ Partial Materialization: Selected Computation of Cuboids

■ Following should take into account during selection of


the subset of cuboids or subcubes

■ the queries in the workload, their frequencies, and


their accessing costs

■ workload characteristics, the cost for incremental


updates, and the total storage requirements.

■ physical database design such as the generation and


selection of indices.
90
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation

■ Heuristic approaches for cuboid and subcube


selection
■ Iceberg cube:
■ data cube that stores only those cube cells with
an aggregate value (e.g., count) that is above
some minimum support threshold.
■ shell cube:
■ precomputing the cuboids for only a small number
of dimensions

91
Data warehouse implementation:
1.3.2 Indexing OLAP Data: Bitmap Index
Index structures - To facilitate efficient data accessing
■ Bitmap indexing method - it allows quick searching in
data cubes.
■ In the bitmap index for a given attribute, there is a
distinct bit vector, Bv, for each value v in the attribute’s
domain.
■ If a given attribute’s domain consists of n values, then n
bits are needed for each entry in the bitmap index (i.e.,
there are n bit vectors).
■ If the attribute has the value v for a given row in the
data table, then the bit representing that value is set to 1
in the corresponding row of the bitmap index. All other
bits for that row are set to 0.

92
Data warehouse implementation:
1.3.2 Indexing OLAP Data: Bitmap Index

● Example:- AllElectronics data warehouse


● dim(item)={H,C,P,S} - 4 values - 4 bit vectors
● dim(city)= {V,T} - 2 values - 2 bit vectors
● Better than Hash & Tree Indices but good for low
cardinality only (cardinality:number of unique items in the database column)
93
Exercise: Bitmap index on CITY

94
Data warehouse implementation:
Indexing OLAP Data: Join Index
■ Traditional indexing maps the value in a given
column to a list of rows having that value.
■ Join indexing registers the joinable rows of
two relations from a relational database.
■ For example,
■ two relations - R(RID, A) and S(B, SID)
■ join on the attributes A and B,
■ join index record contains the pair (RID, SID),
■ where RID and SID are record identifiers from
the R and S relations, respectively

95
Data warehouse implementation:
Indexing OLAP Data: Join Index
■ Advantage:-
■ Identification of joinable tuples without performing
costly join operations.
■ Useful:-
■ To maintain the relationship between a foreign
key(fact table) and its matching primary
keys(dimension table), from the joinable relation.
■ Indexing maintains relationships between attribute
values of a dimension (e.g., within a dimension table)
and the corresponding rows in the fact table.
■ Composite join indices: Join indices with multiple
dimensions.

96
Data warehouse implementation:
Indexing OLAP Data: Join Index
■ Example:-Star Schema
■ “sales_star [time, item, branch, location]: dollars_sold
= sum (sales_in_dollars).”
■ join index is relationship between
■ Sales fact table and
■ the location, item dimension tables

To speed up query processing - join indexing & bitmap indexing methods


can be integrated to form bitmapped join indices. 97
Data warehouse implementation:
Efficient processing of OLAP queries
Given materialized views, query processing should proceed as
follows:
■ 1. Determine which operations should be performed
on the available cuboids:
■ This involves transforming any selection, projection,
roll-up (group-by), and drill-down operations specified
in the query
into
corresponding SQL and/or OLAP operations.
■ Example:
■ slicing and dicing a data cube may correspond to
selection and/or projection operations on a
materialized cuboid.
98
Data warehouse implementation:
Efficient processing of OLAP queries
■ 2. Determine to which materialized cuboid(s) the
relevant operations should be applied:
■ pruning the set using knowledge of
“dominance” relationships among the cuboids,
■ estimating the costs of using the remaining
materialized cuboids, and selecting the cuboid with
the least cost.

99
Data warehouse implementation:
Efficient processing of OLAP queries
Example:-
■define a data cube for AllElectronics of the
form “sales cube [time, item, location]:
sum(sales in dollars).”
■ dimension hierarchies
“day < month < quarter < year” for time;

“item_name < brand < type” for item


“street < city < province or state < country”


for location
■Query:
{brand, province or state}, with the selection

constant “year = 2010.” 100


Data warehouse implementation:
Efficient processing of OLAP queries
■ suppose that there are four materialized cuboids
available, as follows:

■ Which of these four cuboids should be selected


to process the query? Ans: 1,3,4
■ Low cost cuboid to process the query? Ans: 4

101
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

■ Relational OLAP (ROLAP) servers:


■ROLAP uses relational tables to store data for
online analytical processing
■Intermediate servers that stand in
between a relational back-end server and
client front-end tools.
■Operation:
■ use a relational or extended-relational DBMS to
store and manage warehouse data
■ OLAP middleware to support missing pieces
■ ROLAP has greater scalability than MOLAP.
■ Example:-
■ DSS server of Microstrategy 102
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

■ Multidimensional OLAP (MOLAP) servers:


support multidimensional data views through array-

based multidimensional storage engines


■maps multidimensional views directly to data cube
array structures.
■ Advantage:
■ fast indexing to precomputed summarized data.
■ adopt a two-level storage representation
■ Denser subcubes are stored as array structures
■ Sparse subcubes employ compression
technology
A sparse array is one that contains mostly zeros and few non-zero entries. A dense array contains mostly non-
zeros.

103
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

■ Hybrid OLAP (HOLAP) servers:


■ Combines ROLAP and MOLAP technology
■ benefits
■ greater scalability from ROLAP and
■ faster computation of MOLAP.
■ HOLAP server may allow
■ large volumes of detailed data to be stored in a
relational database,
■ while aggregations are kept in a separate MOLAP store.
■ Example:- Microsoft SQL Server 2000 (supports)
■ Specialized SQL servers:
■ provide advanced query language and query
processing support for SQL queries over star and
snowflake schemas in a read-only environment. 104
From Data Warehousing to Data Mining

105
From DataWarehousing to Data Mining
DataWarehouse Usage
■Data warehouses and data marts are used in a
wide range of applications.
■ Business executives use the data in data warehouses
and data marts to perform data analysis and make
strategic decisions.
■ data warehouses are used as an integral part of a
plan-execute-assess “closed-loop” feedback
system for enterprise management.
■ Data warehouses are used extensively in banking and
financial services, consumer goods and retail
distribution sectors, and controlled manufacturing,
such as demand-based production.
106
DataWarehouse Usage
■ There are three kinds of data warehouse
applications:
■information processing
■analytical processing
■data mining

107
DataWarehouse Usage

■ Information processing supports


■querying,
■basic statistical analysis, and
■reporting using crosstabs, tables, charts, or
graphs.
■ Analytical processing supports
■basic OLAP operations,
■ slice-and-dice, drill-down, roll-up, and pivoting.
■ It generally operates on historic data in both
summarized and detailed forms.
■ multidimensional data analysis
108
DataWarehouse Usage

■ Data mining supports


■knowledge discovery by finding hidden
patterns and associations,
■constructing analytical models,
■performing classification and prediction, and
■presenting the mining results using
visualization tools.
■ Note:-
■Data Mining is different with Information
Processing and Analytical processing
109
From Online Analytical Processing
to Multidimensional Data Mining
■ On-line analytical mining (OLAM) (also called OLAP
mining) integrates on-line analytical processing (OLAP)
with data mining and mining knowledge in
multidimensional databases.
■ OLAM is particularly important for the following reasons:
■ High quality of data in data warehouses.
■ Available information processing infrastructure
surrounding data warehouses
■ OLAP-based exploratory data analysis:
■ On-line selection of data mining functions

110
Architecture for On-Line Analytical
Mining
■ An OLAM server performs analytical mining in data
cubes in a similar manner as an OLAP server performs
on-line analytical processing.
■ An integrated OLAM and OLAP architecture is shown in
Figure, where the OLAM and OLAP servers both accept
user on-line queries (or commands) via a graphical user
interface API and work with the data cube in the data
analysis via a cube API.
■ The data cube can be constructed by accessing and/or
integrating multiple databases via an MDDB API and/or
by filtering a datawarehouse via a database API that may
support OLE DB or ODBC connections.

111
112
Data Mining
&
Motivating Challenges

UNIT - II

By
M. Rajesh Reddy
WHAT IS DATA MINING?

• Data mining is the process of automatically discovering


useful information in large data repositories.
• To find novel and useful patterns that might
otherwise remain unknown.
• provide capabilities to predict the outcome of a future
observation,
• Example
• predicting whether a newly arrived customer will spend
more than $100 at a department store.
WHAT IS DATA MINING?

• Not all information discovery tasks are considered to be


data mining.
• For example, tasks related to the area of information
retrieval.
• looking up individual records using a database
management system
or
• finding particular Web pages via a query to an Internet
search engine
• To enhance information retrieval systems.
WHAT IS DATA MINING?

Data Mining and Knowledge


• Data mining is an integral part of Knowledge Discovery in
Databases (KDD),
• process of converting raw data into useful
information
• This process consists of a series of transformation
steps
WHAT IS DATA MINING?

• Preprocessing - to transform the raw input data into an


appropriate format for subsequent analysis.
• Steps involved in data preprocessing
• Fusing (joining) data from multiple sources,
• cleaning data to remove noise and duplicate
observations
• selecting records and features that are relevant to the
data mining task at hand.
• most laborious and time-consuming step
WHAT IS DATA MINING?

• Post Processing:
• only valid and useful results are incorporated into the
decision support system.

• Visualization
• allows analysts to explore the data and the data
mining results from a variety of viewpoints.

• Statistical measures or hypothesis testing methods can


also be applied
• to eliminate spurious (false or fake) data mining
results.
Motivating Challenges:

• challenges that motivated the development of data


mining.
• Scalability

• High Dimensionality

• Heterogeneous and Complex Data

• Data Ownership and Distribution

• Non-traditional Analysis
Motivating Challenges:

• Scalability
• Size of datasets are in the order of GB, TB or PB.

• special search strategies

• implementation of novel data structures ( for efficient

access)

• out-of-core algorithms - for large datasets

• sampling or developing parallel and distributed algorithms.


Motivating Challenges:

• High Dimensionality
• common today - data sets with hundreds or thousands
of attributes
• Example
• Bio-Informatics - microarray technology has
produced gene expression data involving
thousands of features.
• Data sets with temporal or spatial components
also tend to have high dimensionality.
• a data set that contains measurements of
temperature at various locations.
Motivating Challenges:

Heterogeneous and Complex Data


• Traditional data analysis methods - data sets - attributes
of the same type - either continuous or categorical.
• Examples of such non-traditional types of data include
• collections of Web pages containing semi-structured
text and hyperlinks;
• DNA data with sequential and three-dimensional
structure and
• climate data with time series measurements
• DM should maintain relationships in the data, such as
• temporal and spatial autocorrelation,
• graph connectivity, and
• parent-child relationships between the elements in
semi-structured text and XML documents.
Motivating Challenges:

• Data Ownership and Distribution


• Data is not stored in one location or owned by one organization
• geographically distributed among resources belonging to multiple
entities.
• This requires the development of distributed data mining techniques.
• key challenges in distributed data mining algorithms
• (1) reduction in the amount of communication needed
• (2) effective consolidation of data mining results obtained from
multiple sources, and
• (3) Data security issues.
Motivating Challenges:

• Non-traditional Analysis:
• Traditional statistical approach: hypothesize-and-test paradigm.
• A hypothesis is proposed,
• an experiment is designed to gather the data, and
• then the data is analyzed with respect to the hypothesis.
• Current data analysis tasks
• Generation and evaluation of thousands of hypotheses,
• Some DM techniques automate the process of hypothesis
generation and evaluation.
• Some data sets frequently involve non-traditional types of data
and data distributions.
Origins of Data mining,
Data mining Tasks
&
Types of Data
Unit - II

DWDM
The Origins of Data Mining

Data mining draws upon ideas, such as


■ (1) sampling, estimation, and hypothesis testing from statistics and
■ (2) search algorithms, modeling techniques, and learning theories from
artificial intelligence, pattern recognition, and machine learning.
The Origins of Data Mining

■ adopt ideas from other areas, including


– optimization,
– evolutionary computing,
– information theory,
– signal processing,
– visualization, and
– information retrieval
The Origins of Data Mining

■ An optimization algorithm is a procedure which is executed iteratively by


comparing various solutions till an optimum or a satisfactory solution is
found.
■ Evolutionary Computation is a field of optimization theory where instead of
using classical numerical methods to solve optimization problems, we use
inspiration from biological evolution to ‘evolve’ good solutions
– Evolution can be described as a process by
which individuals become ‘fitter’ in different
environments through adaptation,
natural selection, and selective breeding.

picture of the famous finches Charles Darwin depicted


in his journal
The Origins of Data Mining

■ Information theory is the scientific study of the quantification, storage,


and communication of digital information.
■ The field was fundamentally established by the works of Harry
Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s.
■ The field is at the intersection of probability theory, statistics, computer
science, statistical mechanics, information engineering, and electrical
engineering.
The Origins of
Data Mining
■ Other Key areas:
– database systems
■ to provide support for efficient storage, indexing, and query processing.
– Techniques from high performance (parallel) computing
■ addressing the massive size of some data sets.
– Distributed techniques
■ also help address the issue of size and are essential when the data cannot
be gathered in one location.
Data Mining Tasks
■ Data mining tasks are generally divided into two major categories:
– Predictive tasks. - Use some variables to predict unknown or future
values of other variables
■ Task Objective: predict the value of a particular attribute based on the
values of other attributes.
■ Target/Dependent Variable: attribute to be predicted
■ Explanatory or independent variables: attributes used for making the
prediction
– Descriptive tasks. - Find human-interpretable patterns that
describe the data.
■ Task objective: derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data.
■ Descriptive data mining tasks are often exploratory in nature and
frequently require post processing techniques to validate and explain the
results.
Trajectory data mining enables to predict the moving
location details of humans, vehicles, animals and so on.

Data Mining TasksAnomaly detection is a step in data mining that


identifies data points, events, and/or observations
that deviate from a dataset’s normal behavior.

■ Correlation is a statistical term describing the degree to which two variables


move in coordination with one another.
■ Trends: a general direction in which something is developing or
changing.(meaning)
■ Clusters
– Clustering is the task of
data points into a number of groups
such that data points in the same groups
are more similar to other data points
in the same group
than those in other groups

https://www.javatpoint.com/data-mining-cluster-
analysis
Data Mining Tasks …

Data

Milk

Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar


Data Mining Tasks
Data Mining Tasks
■ Predictive modeling refers to the task of building a model for the target variable as a
function of the explanatory variables.
■ 2 types of predictive modeling tasks:
– Classification: Used for discrete target variables
– Regression: used for continuous target variables.
Data Mining Tasks
■ Predictive modeling refers to the task of building a model for the target variable as a
function of the explanatory variables.
■ 2 types of predictive modeling tasks:
– Classification: Used for discrete target variables
– Regression: used for continuous target variables.
– Example:
■ Classification Task : predicting whether a Web user will make a purchase at an online
bookstore is a classification task because the target variable is binary-valued.
■ Regression Task: forecasting the future price of a stock is a regression task because price
is a continuous-valued attribute.
– Goal of both tasks: learn a model that minimizes the error between the predicted and
true values of the target variable.
– Predictive modeling can be used to:
■ identify customers that will respond to a marketing campaign,
■ predict disturbances in the Earth’s ecosystem, or
■ judge whether a patient has a particular disease based on the results of medical tests.
Data Mining Tasks
■ Example: (Predicting the Type of a Flower): the task of predicting a species of flower
based on the characteristics of the flower.
■ Iris species: Setosa, Versicolour, or Virginica.
■ Requirement: need a data set containing the characteristics of various flowers of these
three species.
■ 4 other attributes(dataset): sepal width, sepal length, petal length, and petal width.
■ Petal width is broken into the categories low, medium, and high, which correspond to the
intervals [0, 0.75), [0.75, 1.75), [1.75, ∞), respectively.
■ Also, petal length is broken into categories low, medium, and high, which correspond to the
intervals [0, 2.5), [2.5, 5), [5, ∞), respectively.
■ Based on these categories of petal width and length, the following rules can be derived:
– Petal width low and petal length low implies Setosa.
– Petal width medium and petal length medium implies Versicolour.
– Petal width high and petal length high implies Virginica.
Data Mining Tasks

■ Example: (Predicting the Type of a Flower):


Data Mining Tasks

Example:
(Predicting
the Type of a
Flower)
Data Mining Tasks
■ Association analysis
– used to discover patterns that describe strongly associated features in the
data.
– Discovered patterns are represented in the form of implication rules or
feature subsets.
– Goal of association analysis:
■ To extract the most interesting patterns in an efficient manner.
– Example
■ finding groups of genes that have related functionality,
■ identifying Web pages that are accessed together, or
■ understanding the relationships between different elements of Earth’s climate
system.
Data Mining Tasks
■ Association analysis
■ Example (Market Basket Analysis).
– AIM: find items that are frequently bought together by customers.
– Association rule {Diapers} −→ {Milk},
■ suggests that customers who buy diapers also tend to buy milk.
■ This rule can be used to identify potential cross-selling opportunities among related
items.

The transactions data collected at the checkout counters of a grocery store.


Data Mining Tasks
■ Cluster analysis
– Cluster analysis seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar than
observations that belong to other clusters.
– Clustering has been used to
■ group sets of related customers,
■ find areas of the ocean that have a significant impact on the Earth’s climate, and
■ compress data.
Data Mining Tasks
■ Cluster analysis
– Example 1.3 (Document Clustering)
– Each article is represented as a set of word-frequency pairs (w, c),
■ where w is a word and
■ c is the number of times the word appears in the article.
– There are two natural clusters in the data set.
– First cluster -> first four articles (news about the economy)
– Second cluster-> last four articles ( news about health care)
– A good clustering algorithm should be able to identify these two clusters
based on the similarity between words that appear in the articles.
Data Mining Tasks
■ Anomaly Detection:
– Task of identifying observations whose characteristics are significantly
different from the rest of the data.
– Such observations are known as anomalies or outliers.
– A good anomaly detector must have a high detection rate and a low false alarm
rate.
– Applications of anomaly detection include
■ the detection of fraud,
■ network intrusions,
■ unusual patterns of disease, and
■ ecosystem disturbances

https://commons.wikimedia.org/wiki/File:Anomalous_Web_Traffi
c.png
Data Mining Tasks

■ Anomaly Detection:
– Example 1.4 (Credit Card Fraud Detection).
– A credit card company records the transactions made by every credit card
holder, along with personal information such as credit limit, age, annual income,
and address.
– Since the number of fraudulent cases is relatively small compared to the
number of legitimate transactions, anomaly detection techniques can be
applied to build a profile of legitimate transactions for the users.
– When a new transaction arrives, it is compared against the profile of the user. If
the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.
Types of Data

■ Data set - collection of data objects.


■ Other names for a data object are:-
– record,
– point,
– vector,
– pattern,
– event,
– case,
– sample,
– observation, or
– entity.
Types of Data

■ Data objects are described by a number of attributes that


capture the basic characteristics of an object.
■ Example:-
– mass of a physical object or
– time at which an event occurred.
■ Other names for an attribute are:-
– variable,
– characteristic,
– field,
– feature, or
– dimension.
Types of Data

■ Example:-
■ Dataset - Student Information.
■ Each row corresponds to a student.
■ Each column is an attribute that describes some aspect of a
student.
Types of Data

■ Attributes and Measurement


– An attribute is a property or characteristic of an object
that may vary, either from one object to another or from
one time to another.
– Example,
■ eye color varies from person to person, while the
temperature of an object varies over time.
– Eye color is a symbolic attribute with a small number of
possible values {brown, black, blue, green, hazel, etc.},
– Temperature is a numerical attribute with a potentially
unlimited number of values.
Types of Data

■ Attributes and Measurement


– A measurement scale is a rule (function) that associates
a numerical or symbolic value with an attribute of an
object.
– process of measurement
■ application of a measurement scale to associate a
value with a particular attribute of a specific object.
Properties of Attribute Values
■ The type of an attribute depends on which of the following
properties it possesses:
■ Distinctness: = ≠
■ Order: < >
■ Addition: + ‐
■ Multiplication: * /

■ Nominal attribute: distinctness


■ Ordinal attribute: distinctness & order
■ Interval attribute: distinctness, order & addition
■ Ratio attribute: all 4 properties
Types of Data
■ Properties of Attribute Values
– Nominal - attributes to differentiate between one object
and another.
– Roll, EmpID
– Ordinal - attributes to order the objects.
– Rankings, Grades, Height
– Interval - measured on a scale of equal size units
– no Zero point
– Temperatures in C & F, Calendar Dates
– Ratio - numeric attribute with an inherent zero-point.
– value as being a multiple (or ratio) of another
value.
– Weight, No. of Staff, Income/Salary
Types of Data Properties of Attribute Values
Types of Data
Properties of Attribute Values - Transformations
– yielding the same results when the attribute is
transformed using a transformation that preserves
the attribute’s meaning.
– Example:-
■ the average length of a set of objects is different
when measured in meters rather than in feet, but
both averages represent the same length.
Types of Data
Properties of Attribute Values - Transformations
Types of Data
Attribute Types

Data

Qualitative / Categorical Quantitative / Numeric


( no properties of integer) (properties of Integers)

Nominal Ordinal Interval Ratio


Types of Data
■ Describing Attributes by the Number of Values
a. Discrete
■ finite or countably infinite set of values.
■ Categorical - zip codes or ID numbers, or
■ Numeric - counts.
■ Binary attributes (special case of discrete)
– assume only two values,
– e.g., true/false, yes/no, male/female, or 0/1.
b. Continuous
■ values are real numbers.
■ Ex:- temperature, height, or weight.
Any of the measurement scale types—nominal, ordinal, interval, and ratio—could be combined
with any of the types based on the number of attribute values—binary, discrete, and continuous.
Types of Data - Types of Dataset
General Characteristics of Data Sets
■ 3 characteristics that apply to many data sets are:-
– dimensionality,
– sparsity, and
– resolution.
■ Dimensionality - number of attributes that the objects in the data set possess.
– small number of dimensions more quality than moderate or high-
dimensional data.
– curse of dimensionality & dimensionality reduction.
■ Sparsity - data sets, with asymmetric features, most attributes of an object
have values of 0;
– fewer than 1% of the entries are non-zero.
■ Resolution - Data will be gathered at different levels of resolution
– Example:- the surface of the Earth seems very uneven at a resolution of a
few meters, but is relatively smooth at a resolution of tens of kilometers.
Types of Data - Types of Dataset
■ Record Data
– data set is a collection of records (data objects), each of which consists of
a fixed set of data fields (attributes).
– No relationships b/w records
– Same attributes for all records
– Flat files or relational DB.
Types of Data - Types of Dataset
■ Transaction or Market Basket Data
– special type of record data
– Each record (transaction) involves a set of items.
– Also called market basket data because the items in each record are the
products in a person’s “market basket.”
– Can be viewed as a set of records whose fields are asymmetric attributes.
Types of Data - Types of Dataset
■ Data Matrix / Pattern Matrix
– fixed set of numeric attributes,
– Data objects = points (vectors) in a multidimensional space
– each dimension = a distinct attribute describing the object.
– A set of such data objects can be interpreted as
■ an m by n matrix,
– where there are
– m rows, one for each object,
– and n columns, one for each attribute.
– Standard matrix operation can be applied to transform and manipulate the
data.
Types of Data - Types of Dataset
■ Sparse Data Matrix:
– Special case of a data matrix

– attributes are of the


■ same type and
■ asymmetric; i.e., only non-zero values are important.
– Example:-
■ Transaction data which has only 0–1 entries.
■ Document Term Matrix - collection of term vector
– One Term vector represents - one document ( one row in matrix)
– Attribute of vector - each term in the document ( one col in matrix)
– value in term vector under an attribute is number of times the
corresponding term occurs in the document.
Types of Data - Types of Dataset
■ Graph based Data:
– Data can be represented in the form of Graph.
– Graphs are used for 2 specific reasons
■ (1) the graph captures relationships among data objects and
■ (2) the data objects themselves are represented as graphs.
– Data with Relationships among Objects
■ Relationships among objects also convey important information.
■ Relationships among objects are captured by the links between objects
and link properties, such as direction and weight.
■ Example:
– Web page in www contain both text and links to other pages.
– Web search engines collect and process Web pages to extract their
contents.
– Links to and from each page provide a great deal of information
about the relevance of a Web page to a query, and thus, must also
be taken into consideration.
Types of Data - Types of Dataset
■ Graph based Data:
– Data with Relationships among Objects
■ Example:
– Web page in www contain both text and links to other pages.
Types of Data - Types of Dataset
■ Graph based Data:
– Data with Objects That Are Graphs
■ When objects contain sub-objects that have relationships, then such
objects are frequently represented as graphs.
■ Example:-Structure of chemical compounds
■ Atoms are - nodes
■ Chemical Bonds - links between nodes
– ball-and-stick diagram of the chemical compound benzene,
which contains atoms of carbon (black) and hydrogen (gray).

Substructure mining
Types of Data - Types of Dataset
■ Ordered Data:
– In some data, the attributes have relationships that involve order in time or
space.
– Sequential Data
■ Sequential data / temporal data
■ extension of record data - each record has a time associated with it.
■ Ex:- Retail transaction data set - stores the time of transaction
– time information used to find patterns
■ “candy sales peak before Halloween.”
■ Each attribute - also - time associated
– Record - purchase history of a customer
■ with a listing of items purchased at different times.
– find patterns
■ “people who buy DVD players tend to buy DVDs in the period
immediately following the purchase.”
Types of Data - Types of Dataset
■ Ordered Data: Sequential
Types of Data - Types of Dataset
■ Ordered Data: Sequence Data
– consists of a data set that is a sequence
of individual entities,
– Example
■ sequence of words or letters.
– Example:
■ Genetic information of plants and
animals can be represented in the
form of sequences of nucleotides that
are known as genes.
■ Predicting similarities in the structure
and function of genes from similarities
in nucleotide sequences.
– Ex:- Human genetic code expressed
using the four nucleotides from which all
DNA is constructed: A, T, G, and C.
Types of Data - Types of Dataset
■ Ordered Data: Time Series Data
– Special type of sequential data in
which each record is a time series,
– A series of measurements taken over
time.
– Example:
■ Financial data set might contain
objects that are time series of the
daily prices of various stocks.
– Temporal autocorrelation; i.e., if two
measurements are close in time, then
the values of those measurements are
often very similar. Time series of the average
monthly temperature for
Minneapolis during the years
1982 to 1994.
Types of Data - Types of Dataset
■ Ordered Data: Spatial Data
■ Some objects have spatial attributes,
such as positions or areas, as well as
other types of attributes.
■ An example of spatial data is
– weather data (precipitation,
temperature, pressure) that is
collected for a variety of geographical
locations.
■ spatial autocorrelation; i.e., objects that
are physically close tend to be similar in
other ways as well.
■ Example Average Monthly
– two points on the Earth that are close Temperature of land and
to each other usually have similar ocean
values for temperature and rainfall.
Data Quality
Unit – II- DWDM
Data Quality

● Data mining applications are applied to data that was collected for another purpose, or for
future, but unspecified applications.
● Data mining focuses on

(1) the detection and correction of data quality problems - Data Cleaning

(2) the use of algorithms that can tolerate poor data quality.

● Measurement and Data Collection Issues


● Issues Related to Applications
Data Quality
● Measurement and Data Collection Issues
● problems due to human error,
● limitations of measuring devices, or
● flaws in the data collection process.
● Values or even entire data objects may be missing.
● Spurious or duplicate objects; i.e., multiple data objects that all correspond to a
single “real” object.
○ Example - there might be two different records for a person who has recently lived at two
different addresses.
● Inconsistencies—
○ Example - a person has a height of 2 meters, but weighs only 2 kilograms.
Data Quality
● Measurement and Data Collection Errors
○ Measurement error - any problem resulting from the measurement process.
■ Value recorded differs from the true value to some extent.
■ Continuous attributes:
● Numerical difference of the measured and true value is called the
error.
○ Data collection error - errors such as omitting data objects or attribute
values, or inappropriately including a data object.
■ For example, a study of animals of a certain species might include animals
of a related species that are similar in appearance to the species of
interest.
Data Quality

● Measurement and Data Collection Errors


○ Noise and Artifacts:
○ Noise is the random component of a measurement error.
○ It may involve the distortion of a value or the addition of spurious objects.
Data Quality
Data Quality

● Measurement and Data Collection Errors


○ Noise and Artifacts:
○ used in connection with data that has a spatial or temporal component.
○ Techniques from signal or image processing can frequently be used to reduce
noise
■ These will help to discover patterns (signals) that might be “lost in the
noise.”
○ Note:Elimination of noise - difficult
■ robust algorithms - produce acceptable results even when noise is present.
Data Quality

● Measurement and Data Collection Errors


○ Noise and Artifacts:
■ Artifacts: Deterministic distortions of the data
■ Data errors may be the result of a more deterministic phenomenon, such
as a streak in the same place on a set of photographs.
Data Quality
● Measurement and Data Collection Errors
● Precision, Bias, and Accuracy:
○ Precision:
■ The closeness of repeated measurements (of the same quantity) to one another.
■ Precision is often measured by the standard deviation of a set of values
○ Bias:
■ A systematic variation of measurements from the quantity being measured.
■ Bias is measured by taking the difference between the mean of the set of values and the
known value of the quantity being measured.
○ Example:
■ standard laboratory weight with a mass of 1g and want to assess the precision and bias of our
new laboratory scale.
■ weigh the mass five times & values are: {1.015, 0.990, 1.013, 1.001, 0.986}.
■ The mean of these values is 1.001, and hence, the bias is 0.001.
■ The precision, as measured by the standard deviation, is 0.013.
Data Quality
● Measurement and Data Collection Errors
● Precision, Bias, and Accuracy:
○ Accuracy:
■ The closeness of measurements to the true value of the
quantity being measured.
Data Quality
● Measurement and Data Collection Errors
● Outliers:
○ Outliers are either
■ (1) data objects that, in some sense, have characteristics that
are different from most of the other data objects in the data set,
or
■ (2) values of an attribute that are unusual with respect to the
typical values for that attribute.
○ Alternatively - anomalous objects or values.
Data Quality
● Measurement and Data Collection Errors
● Missing Values:
○ Eliminate Data Objects or Attributes
○ Estimate Missing Values
○ Ignore the Missing Value during Analysis
○ Inconsistent Values
Data Quality
● Measurement and Data Collection Errors
● Duplicate Data: Same Data in multiple Data Objects
○ To detect and eliminate such duplicates, two main issues
must be addressed.
■ First - if two objects represent a single object, then the values of
corresponding attributes may differ, and these inconsistent
values must be resolved
■ Second - care needs to be taken to avoid accidentally combining
data objects that are similar - deduplication
Data Quality “data is of high quality if it is suitable for its intended use.”
● Issues Related to Applications:
● Timeliness:
○ If the data is out of date, then so are the models and patterns that are based on it.
● Relevance:
○ The available data must contain the information necessary for the application.
○ Consider the task of building a model that predicts the accident rate for drivers. If information about the age and
gender of the driver is omitted, then it is likely that the model will have limited accuracy unless this information
is indirectly available through other attributes.
● Knowledge about the Data:
○ Data sets are accompanied documentation that describes different aspects of the data;
○ the quality of this documentation can help in the subsequent analysis.
○ For example,
■ If the documentation is poor, however, and fails to tell us, for example, that the missing values for a
particular field are indicated with a -9999, then our analysis of the data may be faulty.
○ Other important characteristics are the precision of the data, the type of features (nominal, ordinal, interval,
ratio), the scale of measurement (e.g., meters or feet for length), and the origin of the data.
DATA PREPROCESSING
Datamining
Unit - II
AGGREGATION

• “less is more”
• Aggregation - combining of two or more objects into a single object.
• In Example,
• One way to aggregate transactions for this data set is to replace all the transactions of a single store with a
single storewide transaction.
• This reduces number of records (1 record per store).
• How an aggregate transaction is created
• Quantitative attributes, such as price, are typically aggregated by taking a sum or an average.
• A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that
were sold at that location.
• Can also be viewed as a multidimensional array, where each attribute is a dimension.
• Used in OLAP
AGGREGATION
• Motivations for aggregation
• Smaller data sets require less memory and processing time which
allows the use of more expensive data mining algorithms.
• Availability of change of scope or scale
• by providing a high-level view of the data instead of a low-level view.
• Behavior of groups of objects or attributes is often more stable than
that of individual objects or attributes.
• Disadvantage of aggregation
• potential loss of interesting details.
AGGREGATION

average yearly precipitation has less variability than the average monthly precipitation.
SAMPLING
• Approach for selecting a subset of the data objects to be analyzed.
• Data miners sample because it is too expensive or time consuming to
process all the data.
• The key principle for effective sampling is the following:
• Using a sample will work almost as well as using the entire data set if the sample
is representative.
• A sample is representative if it has approximately the same property (of interest) as the
original set of data.
• Choose a sampling scheme/Technique – which gives high probability of getting a
representative sample.
SAMPLING
• Sampling Approaches: (a) Simple random (b) Stratified (c) Adaptive
• Simple random sampling
• equal probability of selecting any particular item.
• Two variations on random sampling:
• (1) sampling without replacement—as each item is selected, it is removed from the set of all objects that
together constitute the population, and
• (2) sampling with replacement—objects are not removed from the population as they are selected for the
sample.
• Problem: When the population consists of different types of objects, with widely different numbers of
objects, simple random sampling can fail to adequately represent those types of objects that are less
frequent.
• Stratified sampling:
• starts with prespecified groups of objects
• Simpler version -equal numbers of objects are drawn from each group even though the groups are of
different sizes.
• Other - the number of objects drawn from each group is proportional to the size of that group.
SAMPLING

Sampling and Loss of Information


• Larger sample sizes increase the probability that a sample will be representative, but they also eliminate
much of the advantage of sampling.
• Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected.
SAMPLING

Determining the Proper Sample Size


• Desired outcome: at least one point will be obtained from each cluster.
• Probability of getting one object from each of the 10 groups increases as the sample size runs from 10
to 60.
SAMPLING

• Adaptive/Progressive Sampling:
• Proper sample size - Difficult to determine
• Start with a small sample, and then increase the sample size until a
sample of sufficient size has been obtained.
• Initial correct sample size is eliminated
• Stop increasing the sample size at leveling-off point(where no
improvement in the outcome is identified).
DIMENSIONALITY REDUCTION

• Data sets can have a large number of features.


• Example
• a set of documents, where each document is represented by a vector
whose components are the frequencies with which each word occurs in
the document.
• thousands or tens of thousands of attributes (components), one for each
word in the vocabulary.
DIMENSIONALITY REDUCTION

• Benefits to dimensionality reduction.


• Data mining algorithms work better if the dimensionality is lower.
• It eliminates irrelevant features and reduce noise
• Lead to a more understandable model
• fewer attributes
• Allow the data to be more easily visualized.
• Amount of time and memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
• Reduce the dimensionality of a data set by creating new attributes that are a combination of the old
attributes.
• Feature subset selection or feature selection:
• The reduction of dimensionality by selecting new attributes that are a subset of the old.
DIMENSIONALITY REDUCTION

• The Curse of Dimensionality


• Data analysis become significantly harder as the dimensionality of the data
increases.
• data becomes increasingly sparse
• Classification
• there are not enough data objects to model a class to all possible objects.
• Clustering
• density and the distance between points - becomes less meaningful
DIMENSIONALITY REDUCTION

• Linear Algebra Techniques for Dimensionality Reduction


• Principal Components Analysis (PCA)
• for continuous attributes
• finds new attributes (principal components) that
• (1) are linear combinations of the original attributes,
• (2) are orthogonal (perpendicular) to each other, and
• (3) capture the maximum amount of variation in the data.
• Singular Value Decomposition (SVD)
• Related to PCA
FEATURE SUBSET SELECTION
• Another way to reduce the dimensionality - use only a subset of the features.
• Redundant Features
• Example:
• Purchase price of a product and the amount of sales tax paid
• Redundant to each other
• contain much of the same information.

• Irrelevant features contain almost no useful information for the data mining task at hand.
• Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages.

• Redundant and irrelevant features


• reduce classification accuracy and the quality of the clusters that are found.
• can be eliminated immediately by using common sense or domain knowledge,
• systematic approach - for selecting the best subset of features
• Best approach - try all possible subsets of features as input to the data mining algorithm of interest, and
then take the subset that produces the best results.
FEATURE SUBSET SELECTION

• 3 standard approaches to feature


selection:
• Embedded
• Filter
• Wrapper
FEATURE SUBSET SELECTION
• Embedded approaches:
• Feature selection occurs naturally as part of the data mining algorithm.
• During execution of algorithm, the Algorithm itself decides which attributes to use
and which to ignore.
• Example:- Algorithms for building decision tree classifiers

• Filter approaches:
• Features are selected before the data mining algorithm is run
• Approach that is independent of the data mining task.

• Wrapper approaches:
• Uses the target data mining algorithm as a black box to find the best subset of
attributes
• typically without enumerating all possible subsets.
FEATURE SUBSET SELECTION
• An Architecture for Feature Subset Selection :
• The feature selection process is viewed as consisting of four parts:
1. a measure for evaluating a subset,
2. a search strategy that controls the generation of a new subset of features,
3. a stopping criterion, and
4. a validation procedure.

• Filter methods and wrapper methods differ only in the way in which they
evaluate a subset of features.
• wrapper method – uses the target data mining algorithm
• filter approach - evaluation technique is distinct from the target data mining
algorithm.
FEATURE SUBSET SELECTION
FEATURE SUBSET SELECTION
• Feature subset selection is a search over all possible subsets of features.
• Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task
• Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes.
• Wrapper approach: running the target data mining application, measure the result of the data mining.

• Stopping criterion
• conditions involving the following:
• the number of iterations,
• whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
• whether a subset of a certain size has been obtained,
• whether simultaneous size and evaluation criteria have been achieved, and
• whether any improvement can be achieved by the options available to the search strategy.

• Validation:
• Finally, the results of the target data mining algorithm on the selected subset should be validated.
• An evaluation approach: run the algorithm with the full set of features and compare the full results to results
obtained using the subset of features.
FEATURE SUBSET SELECTION
• Feature Weighting
• An alternative to keeping or eliminating features.
• One Approach
• Higher weight - More important features
• Lower weight - less important features
• Another Approach – automatic
• Example – Classification Scheme - Support vector machines
• Other Approach
• The normalization of objects – Cosine Similarity – used as weights
FEATURE CREATION
• Create a new set of attributes that captures the important
information in a data set from the original attributes
• much more effective.
• No. of new attributes < No. of original attributes
• Three related methodologies for creating new attributes:
1. Feature extraction
2. Mapping the data to a new space
3. Feature construction
FEATURE CREATION
• Feature Extraction
• The creation of a new set of features from the original raw data
• Example: Classify set of photographs based on existence of human face
(present or not)
• Raw data (set of pixels) - not suitable for many types of classification algorithms.
• Higher level features( presence or absence of certain types of edges and areas that are highly correlated with
the presence of human faces), then a much broader set of classification techniques can be applied to this
problem.

• Feature extraction is highly domain-specific


• New area means development of new features and feature extraction
methods.
FEATURE CREATION

Mapping the Data to a New Space


• A totally different view of the data can reveal important and interesting features.
• If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
• If, there are a number of periodic patterns and a significant amount of noise is present, then these
patterns are hard to detect.
• Such patterns can be detected by applying a Fourier transform to the time series in order to
change to a representation in which frequency information is explicit.
• Example:
• Power spectrum that can be computed after applying a Fourier transform to the original time series.
FEATURE CREATION

• Feature Construction
• Features in the original data sets consists necessary information, but not suitable for the data mining
algorithm.
• If new features constructed out of the original features can be more useful than the original features.

• Example (Density).
• Dataset contains the volume and mass of historical artifact.
• Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most
directly yield an accurate classification.
DISCRETIZATION AND BINARIZATION

• Classification algorithms, require that the data be in the form of


categorical attributes.
• Algorithms that find association patterns, require that the data be in
the form of binary attributes.
• Discretization - transforming a continuous attribute into a
categorical attribute
• Binarization - transforming both continuous and discrete attributes
into one or more binary attributes
DISCRETIZATION AND BINARIZATION
• Binarization of a categorical attribute (Simple technique):
• If there are m categorical values, then uniquely assign
each original value to an integer in the interval [0, m − 1].
• If the attribute is ordinal, then order must be maintained
by the assignment.
• Next, convert each of these m integers to a binary number
using n binary attributes
• n = [log2 (m)] binary digits are required to represent these
integers
DISCRETIZATION AND BINARIZATION
Example: a categorical variable with 5 values
{awful, poor, OK, good, great}
require three binary variables x1, x2, and x3.
DISCRETIZATION AND BINARIZATION
• Discretization of Continuous Attributes ( classification or
association analysis)
• Transformation of a continuous attribute to a categorical attribute
involves two subtasks:
• decide no. of categories
• decide how to map the values of the continuous attribute to these
categories.
• Step I: Sort Attribute Values and divide into n intervals by specifying n−1
split points.
• Step II : all the values in one interval are mapped to the same categorical
value.
DISCRETIZATION AND BINARIZATION
• Discretization of Continuous Attributes
• Problem of discretization is
• Deciding how many split points to choose and
• where to place them.
• The result can be represented either as
• a set of intervals {(x0, x1],(x1, x2],... ,(xn−1, xn)},
where x0 and xn may be +∞ or −∞, respectively,
or
• as a series of inequalities x0 < x ≤ x1,..., xn−1 < x < xn.
DISCRETIZATION AND BINARIZATION
• UnSupervised Discretization
• Discretization methods for Classification
• Supervised - known class information
• Unsupervised - unknown class information
• Equal width approach:
• divides the range of the attribute into a user-specified number of
intervals each having the same width.
• problem with outliers
• Equal frequency (equal depth) approach:
• Puts same number of objects into each interval
• K-means Clustering method
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Original Data
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Equal Width Discretization


DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

Equal Frequency Discretization


DISCRETIZATION AND BINARIZATION
UnSupervised Discretization

K-means Clustering (better result)


DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• When additional information (class labels) are used then it
produces better results.
• Some Concerns: purity of an interval and the minimum size of
an interval.
• statistically based approaches:
• start with each attribute value as a separate interval and
create larger intervals by merging adjacent intervals that are
similar according to a statistical test.
• Entropy based approaches:
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Entropy Definition

• ei - Entropy in ith interval


• pij = mij/mi probability of class j in the i th interval.
• k - no. of different class labels
• mi - no. of values in the i th interval of a partition,
• mij - no. of values of class j in interval i.
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Total entropy, e, of the partition is
• weighted average of the individual interval entropies,
• m - no. of values,
• wi = mi/m fraction of values in the i th interval
• n - no. of intervals.
• Perfectly Pure Interval:entropy is 0
• If an interval contains only values of one class
• Impure Interval: entropy is maximum
• classes of values in an interval occur equal
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Simple approach for partitioning a continuous attribute:
• starts by bisecting the initial values so that the resulting
two intervals give minimum entropy.
• consider each value as a possible split point
• Repeat splitting process with another interval
• choosing the interval with the worst (highest) entropy,
• until a user-specified number of intervals is reached,
or
• stopping criterion is satisfied.
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy based
approaches:
• 3 categories for
both x & y
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy based
approaches:
• 5 categories for
both x & y
• Observation:
• no improvement
for 6 categories
DISCRETIZATION AND BINARIZATION
• Categorical Attributes with Too Many Values
• If categorical attribute is an ordinal,
• techniques similar to those for continuous attributes
• If the categorical attribute is nominal,
• Example:-
• University that has a large number of departments.
• department name attribute - dozens of diff. values.
• combine departments into larger groups, such as
• engineering,
• social sciences, or
• biological sciences.
Variable Transformation

• Transformation that is applied to all the values of a variable.


• Example: magnitude of a variable is important
• then the values of the variable can be transformed by taking the absolute
value.
• Simple Function Transformation:
• A simple mathematical function is applied to each value individually.
• If x is a variable, then examples of such transformations include
• x k,
• log x,
• e x,
• √ x,
• 1/x,
• sin x, or |x|
Variable Transformation

• Variable transformations should be applied with caution since they


change the nature of the data.
• Example:-
• transformation fun. is 1/x
• if value is 1 or >1 then reduces the magnitude of values
• values {1, 2, 3} go to {1, 1/ 2, 1/3}
• if value is b/w 0 & 1 then increases the magnitude of values
• values {1, 1/2, 1/3} go to {1, 2, 3}.
• so better ask questions such as the following:
• Does the order need to be maintained?
• Does the transformation apply to all values( -ve & 0)?
• What is the effect of the transformation on the values between 0 & 1?
Variable Transformation

• Normalization or Standardization
• Goal of standardization or normalization
• To make an entire set of values have a particular property.
• A traditional example is that of “standardizing a variable” in statistics.
• x - mean (average) of the attribute values and
• sx - standard deviation,
• Transformation

• creates a new variable that has a mean of 0 and a standard deviation


of 1.
Variable Transformation

• Normalization or Standardization
• If different variables are to be combined, a transformation is necessary
to avoid having a variable with large values dominate the results of the
calculation.
• Example:
• comparing people based on two variables: age and income.
• For any two people, the difference in income will likely be much
higher in absolute terms (hundreds or thousands of dollars) than the
difference in age (less than 150).
• Income values(higher values) will dominate the calculation.
Variable Transformation

• Normalization or Standardization
• Mean and standard deviation are strongly affected by outliers
• Mean is replaced by the median, i.e., the middle value.
• x - variable
• absolute standard deviation of x is
• xi - i th value of the variable,
• m - number of objects, and
• µ - mean or median.
• Other approaches
• computing estimates of the location (center) and
• spread of a set of values in the presence of outliers
• These measures can also be used to define a standardization transformation.
Measures of
Similarity and
Dissimilarity
Unit - II
Datamining
Measures of Similarity and
Dissimilarity

● Similarity and dissimilarity are important because they are used by a


number of data mining techniques
○ such as
■ clustering,
■ nearest neighbor classification, and
■ anomaly detection.
● Proximity is used to refer to either similarity or dissimilarity.
○ proximity between objects having only one simple attribute, and
○ proximity measures for objects with multiple attributes.
Measures of Similarity and Dissimilarity

● Similarity between two objects is a numerical measure of the


degree to which the two objects are alike.
○ Similarity - high -objects that are more alike.
○ Non-negative
○ between 0 (no similarity) and 1 (complete similarity).
● Dissimilarity between two objects is a numerical measure of the
degree to which the two objects are different.
○ Dissimilarity - low - objects are more similar.
○ Distance - synonym for dissimilarity
Measures of Similarity and Dissimilarity

Transformations
● Transformations are often applied to
○ convert a similarity to a dissimilarity,
○ convert a dissimilarity to a similarity
○ to transform a proximity measure to fall within a particular range, such as [0,1].
● Example
○ Similarities between objects range from 1 (not at all similar) to 10 (completely
similar)
○ we can make them fall within the range [0, 1] by using the transformation
■ s’ = (s−1)/9
■ s - Original Similarity
■ s’ - New similarity values
Measures of Similarity and Dissimilarity
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects

Euclidean Distance
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
If d(x, y) is the distance between two points, x and y, then the following properties hold.

1. Positivity

(a) d(x, x) ≥ 0 for all x and y,

(b) d(x, y) = 0 only if x = y.

2. Symmetry

d(x, y) = d(y, x) for all x and y.

3. Triangle Inequality

d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.

Note:-Measures that satisfy all three properties are known as metrics.


Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Set Differences

A = {1, 2, 3, 4} and B = {2, 3, 4},


then A − B = {1} and
B − A = ∅, the empty set.

If d(A, B) = size(A − B), then it does not satisfy the second part of the
positivity property, the symmetry property, or the triangle inequality.

d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)


Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Time
Dissimilarity measure that is not a metric,but still useful.

d(1PM, 2PM) = 1 hour


d(2PM, 1PM) = 23 hours

● Example:- when answering the question: “If an event occurs at 1PM


every day, and it is now 2PM, how long do I have to wait for that event to
occur again?”
Distance in python
Measures of Similarity and Dissimilarity
Similarities between Data Objects
● Typical properties of similarities are the following:
○ 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
○ 2. s(x, y) = s(y, x) for all x and y. (Symmetry)
● A Non-symmetric Similarity Measure
○ Classify a small set of characters which is flashed on a screen.
○ Confusion matrix - records how often each character is classified as itself,
and how often each is classified as another character.
○ “0” appeared 200 times but classified as
■ “0” 160 times,
■ “o” 40 times.
○ ‘o’ appeared 200 times and was classified as
■ “o” 170 times
■ “0” only 30 times.
● similarity measure can be made symmetric by setting
○ S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2,
■ S` - new similarity measure.
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
○ Similarity measures between objects that contain only binary
attributes are called similarity coefficients

○ Let x and y be two objects that consist of n binary attributes.

○ The comparison of two objects (or two binary vectors), leads to


the following four quantities (frequencies):

f00 = the number of attributes where x is 0 and y is 0


f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Simple Matching Coefficient(SMC)

Jaccard Coefficient
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Measures of Similarity and Dissimilarity
Examples of proximity measures
Cosine similarity (Document similarity)
If x and y are two document vectors, then
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)

# import required libraries


import numpy as np
from numpy.linalg import norm

# define two lists or array


A = np.array([2,1,2,3,2,9])
B = np.array([3,4,2,4,5,5])

print("A:", A)
print("B:", B)

# compute cosine similarity


cosine = np.dot(A,B)/(norm(A)*norm(B))
print("Cosine Similarity:", cosine)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)

● Cosine similarity - measure of angle between x and y.


● Cosine similarity = 1 (angle is 0◦, and x & y are same (except magnitude or length))


Cosine similarity = 0 (angle is 90 , and x & y do not share any terms (words))
Measures of Similarity and Dissimilarity
Examples of proximity measures

cosine similarity (Document similarity)

Note:-
Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not
considered)
Measures of Similarity and Dissimilarity
Examples of proximity measures

Extended Jaccard Coefficient (Tanimoto Coefficient)


Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
● The more tightly linear two variables X and Y are,
the closer Pearson's correlation coefficient(PCC)
○ PCC = -1, if the relationship is negative,
○ PCC=+1, if the relationship is positive.
■ an increase in the value of one variable increases the value of another variable
○ PCC = 0 Perfectly linearly uncorrelated numbers
■ an increase in the value of one decreases the value of another variable.
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation ( scipy.stats.pearsonr() - automatic)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation (manual in python)
CLASSIFICATION
DATAMINING
UNIT III
BASIC CONCEPTS
• Input data ->collection of records. E
• Record / instance / example -> tuple (x, y)
• x - attribute set
• y - special attribute (class label / category / target attribute)
• Attribute set - properties of a Data Object – Discrete / Continuous
• Class label –
• Classification – y is Discrete attribute
• Regression (Predictive Modeling Task) - y is a continuous attribute.
BASIC CONCEPTS
• Definition:
• Classification is the task of learning a target function f that maps each attribute set x to one of the
predefined class labels y.
• The target function is also known informally as a classification model.
BASIC CONCEPTS
• A classification model is useful for the following purposes.
• Descriptive modeling: A classification model can serve as an explanatory tool to distinguish
between objects of different classes.
BASIC CONCEPTS
• A classification model is useful for the following purposes.
• Predictive Modeling:
• A classification model can also be used to predict the class label of unknown records.
• Automatically assigns a class label when presented with the attribute set of an
unknown record.

• Classification techniques best suit for binary or nominal categories.


• Do not consider the implicit order
• Relationships are also ignored(super-sub class)
General approach to solving a classification problem
• Classification technique (or classifier)
• Systematic approach to building classification models
from an input data set.
• Examples
• Decision tree classifiers,
• Rule-based classifiers,
• Neural networks,
• Support vector machines, and
• Naive bayes classifiers.

• Learning algorithm
• Used by the classifier
• To identify a model
• That best fits the relationship between the
attribute set and class label of the input data.
General approach to solving a classification problem
• Model
• Generated by a learning algorithm
• Should satisfy the following:
• Fit the input data well
• Correctly predict the class labels of
records it has never seen before.
• Training set
• Consisting of records whose class labels are
known
• used to build a classification model
General approach to solving a classification problem
• Confusion Matrix
• Used to evaluate the performance of a classification model
• Holds details about
• counts of test records correctly and incorrectly predicted by the model.
• Table 4.2 depicts the confusion matrix for a binary classification problem.
• fij – no. of records from class i predicted to be of class j.
• f01 – no. of records from class 0 incorrectly predicted as class 1.
• total no. of correct predictions made (f11 + f00)
• total number of incorrect predictions (f10 + f01).
General approach to solving a classification problem
• Performance Metrics:
1. Accuracy

1. Error Rate
DECISION TREE INDUCTION
Working of Decision Tree
• We can solve a classification problem by
asking a series of carefully crafted questions
about the attributes of the test record.
• Each time we receive an answer, a follow-
up question is asked until we reach a
conclusion about the class label of the
record.
• The series of questions and their possible
answers can be organized in the form of a
decision tree
• Decision tree is a hierarchical structure
consisting of nodes and directed edges.
DECISION TREE INDUCTION
Working of Decision Tree
• Three types of nodes:
• Root node
• No incoming edges
• Zero or more outgoing edges.
• Internal nodes
• Exactly one incoming edge and
• Two or more outgoing edges.
• Leaf or terminal nodes
• Exactly one incoming edge and
• No outgoing edges.

• Each leaf node is assigned a class label.


• Non-terminal nodes (root & other internal nodes)
contain attribute test conditions to separate
records that have different characteristics.
DECISION TREE INDUCTION
Working of Decision Tree
DECISION TREE INDUCTION
Buiding Decision Tree
• Hunt’s algorithm:
• basis of many existing decision tree induction algorithms, including
• ID3,
• C4.5, and
• CART.
• Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records
into successively purer subsets.
• Dt - set of training records with node t

• y = {y1, y2,..., yc} -> class labels.


• Hunt’s algorithm.
• Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.
• Step 2: If Dt contains records that belong to more than one class, an attribute test condition is
selected to partition the records into smaller subsets. A child node is created for each outcome of the
test condition and the records in Dt are distributed to the children based on the outcomes.
• Note:-algorithm is then recursively applied to each child node.
DECISION TREE INDUCTION
• Example:-predicting whether a loan applicant will repay or not (defaulted)
Buiding Decision Tree
• Construct a training set by examining the records of previous
borrowers.
DECISION TREE INDUCTION
Building Decision Tree
• Hunt’s algorithm will work fine
• if every combination of attribute values is present in the training data and
• if each combination has a unique class label.
• Additional conditions
1. If a child nodes is empty(no records in training set) then declare it as a leaf node with the same class
label as the majority class of training records associated with its parent node.
2. Identical attribute values and diff. class label. Not possible to further split. declare node as leaf with the
same class label as the majority class of training records associated with this node.
• Design Issues of Decision Tree Induction
• 1. How should the training records be split?
• Test condition to divide the records into smaller subsets.
• provide a method for specifying the test condition
• measure for evaluating the goodness of each test condition.

• 2. How should the splitting procedure stop?


• A stopping condition is needed
• stop when either all the records belong to the same class or all the records have identical attribute
values.
DECISION TREE INDUCTION
Methods for Expressing Attribute Test Conditions
• Test condition for Binary Attributes
• The test condition for a binary attribute generates two potential outcomes.
DECISION TREE INDUCTION
Methods for Expressing Attribute Test Conditions
• Test condition for Nominal Attributes
• nominal attribute can have many values
• Test condition can be expressed in two ways
• Multiway split - number of outcomes depends on the number of distinct values
• Binary splits(used in CART) - produces binary splits by considering all 2k−1 − 1 ways of
creating a binary partition of k attribute values.
DECISION TREE INDUCTION
Methods for Expressing Attribute Test Conditions
• Test condition for Ordinal Attributes
• Ordinal attributes can also produce binary or multiway splits.
• values can be grouped without violating the order property.
• 4.10© is invalid
DECISION TREE INDUCTION
Methods for Expressing Attribute Test Conditions
• Test condition for Continuous Attributes
• Test condition - Comparison test (A < v) or (A ≥ v) with binary outcomes,
or
• Test condition - a range query with outcomes of the form vi ≤ A < vi+1, for i = 1,..., k.
• Multiway split
• Apply the discretization strategies
DECISION TREE INDUCTION
Measures for Selecting the Best Split
• p(i|t) - fraction of records belonging to class i at a given node t.
• Sometimes – used as only Pi
• Two-class problem
• (p0, p1) - class distribution at any node
• p1 = 1 − p0
• (0.5, 0.5) because there are an equal number of records from each class
• Car Type, will result in purer partitions
DECISION TREE INDUCTION
Measures for Selecting the Best Split
• selection of best split is based on the degree of impurity of the child nodes
• Node with class distribution (0, 1) has zero impurity,
• Node with uniform class distribution (0.5, 0.5) has the highest impurity.
• p - fraction of records that belong to one of the two classes.
• P – maximum(0.5) – class distribution is even
• P- min. (0 or 1)– all records belong to the same class
DECISION TREE INDUCTION
Measures for Selecting the Best Split
• Node N1 has the lowest impurity value, followed by N2 and N3.
DECISION TREE INDUCTION
Measures for Selecting the Best Split
• To Determine the performance of test condition – compare the degree of
impurity of the parent node (before splitting) with the degree of impurity
of the child nodes (after splitting).
• The larger their difference, the better the test condition.
• Information Gain:

• I(·) - impurity measure of a given node,


• N - total no. of records at parent node,
• k - no. of attribute values
• N(vj) - no. of records associated with the child node, vj.
• The difference in entropy(Impurity measure) is known as the Information
gain, ∆info
Calculate Impurity using Gini
Find out, which attribute
is selected?
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Splitting of Binary Attributes
○ Before splitting, the Gini index is 0.5
■ because equal number of records
from both classes.
○ If attribute A is chosen to split the
data,
■ Gini index
● node N1 = 0.4898, and
● node N2 = 0.480.
■ Weighted average of the Gini index
for the descendent nodes is
● (7/12) × 0.4898 + (5/12) × 0.480
= 0.486.
○ Weighted average of the Gini index for
attribute B is 0.375.
○ B is selected because of small value
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Splitting of Nominal Attributes
○ First Binary Grouping
■ Gini index of {Sports, Luxury} is 0.4922 and
■ the Gini index of {Family} is 0.3750.
■ The weighted average Gini index
16/20 × 0.4922 + 4/20 × 0.3750 =
0.468.
○ Second binary grouping of {Sports} and {Family, Luxury},
■ weighted average Gini index is 0.167.

● The second grouping has a


lower Gini index
because its corresponding subsets
are much purer.
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Splitting of Continuous Attributes

● A brute-force method -Take every value of the attribute in the N records as a candidate split position.
● Count the number of records with annual income less than or greater than
v(computationally expensive).
● To reduce the complexity, the training records are sorted based on their annual income,
● Candidate split positions are identified by taking the midpoints between two adjacent sorted values:
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Gain Ratio
○ Problem:
■ Customer ID - produce purer partitions.
■ Customer ID is not a predictive attribute because its value is
unique for each record.
○ Two Strategies:
■ First strategy(used in CART)
● restrict the test conditions to binary splits only.
■ Second Strategy(used in C4.5 - Gain Ratio - to determine goodness
of a split)
● modify the splitting criterion
● consider - number of outcomes produced by the attribute test
condition.
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Gain Ratio
Tree-Pruning
• After building the decision tree,
• Tree-pruning step - to reduce the size of the decision
tree.
• Pruning -
• trims the branches of the initial tree
• improves the generalization capability of the
decision tree.
• Decision trees that are too large are susceptible to a
phenomenon known as overfitting.
Model Overfitting
DWDM Unit-III
Model Overfitting

● Errors generally occur in classification Model are:-


○ Training Errors ( or Resubstitution Error or Apparent Error)
■ No. of misclassification errors Committed on Training data
○ Generalization Errors
■ Expected Error of the model on previously unused records.
● Model Overfitting:
○ Model is overfitting your training data when you see that the model performs well on the training data but
does not perform well on the evaluation (Test) data.
○ This is because the model is memorizing the data it has seen and is unable to generalize to unseen
examples.
Model Overfitting
● Model Underfitting:
● Model is underfitting the training data when the model performs poorly on the training data.
● Model is unable to capture the relationship between the input examples (X) and the target values (Y).

https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html
Model Overfitting

https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html
Model Overfitting
Overfitting Due to Presence of Noise: Train Error - 0, Test Error - 30%

Sdsdsd

Sdsdsd

● Humans and dolphins were misclassified


● Spiny anteaters (exceptional case)
● Errors due to exceptional cases are often
Unavoidable and establish the minimum error
rate achievable by any classifier.
Model Overfitting
Overfitting Due to Presence of Noise: Train Error - 20%, Test Error - 10%

Sdsdsd

Sdsdsd
Model Overfitting

Overfitting Due to Lack of Representative Samples


Overfitting occurs when small number of data training records are available

● Training error is zero, Test Error is 30%


● Humans, elephants, and dolphins are misclassified
● Decision tree classifies all warm-blooded vertebrates
that do not hibernate as non-mammals(because of
eagle - Lack of representative samples).
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Methods commonly used to evaluate the performance of a classifier
○ Hold Out method
○ Random Sub Sampling
○ Cross Validation
■ K-fold
■ Leave-one-out
○ Bootstrap
■ .632 Bootstrap
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Hold Out method
○ Original data - partitioned into two disjoint sets
■ training set
■ test sets
○ A classification model is then induced from the training set
○ Model performance is evaluated on the test set.
○ Analysts can decide the proportion of data reserved for training and for testing
■ e.g., 50-50 or
■ twothirds - training & one-third - testing
○ Limitations
1. Model may not be good because only few records are for Model induction
2. Model may be highly dependent on the composition of the training and test sets.
● training set size=small, then larger the variance of the model.
● training set =too large, then the estimated accuracy of small test set is less reliable.
3. training and test sets are no longer independent

https://www.datavedas.com/holdout-cross-validation/
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Random Sub Sampling
○ The holdout method can be repeated several times to improve the estimation of a classifier’s performance.
○ Overall accuracy,
○ Problems:
■ Does not utilize as much data as possible for training.
■ No control over the number of times each record is used for testing and training.

https://blog.ineuron.ai/Hold-Out-Method-Random-Sub-Sampling-Method-3MLDEXAZML
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Cross Validation
○ Alternative to Random Subsampling
○ Each record is used the same number of times for training and exactly once for testing.
○ Two fold cross-validation
■ Partition the data into two equal-sized subsets.
■ one of the subsets for training and the other for testing.
■ Then swap the roles of the subsets

https://fengkehh.github.io/post/introduction-to-cross-validation/ - picture reference


Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Cross Validation
○ K-Fold Cross Validation
■ k equal-sized partitions
■ During each run,
● one of the partitions is chosen for testing,
● while the rest of them are used for training.
■ Total error is found by summing up the errors for all k runs.

Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR


Model Overfitting - Evaluating the Performance of a Classifier
Cross-validation
● Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly
split up into ‘k’ groups. One of the groups is used as the test set and the rest are
used as the training set. The model is trained on the training set and scored on
the test set. Then the process is repeated until each unique group as been used
as the test set.
● For example, for 5-fold cross validation, the dataset would be split into 5 groups,
and the model would be trained and tested 5 separate times so each group would
get a chance to be the test set. This can be seen in the graph below.
● 5-fold cross validation (image credit)

Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR


Model Overfitting - Evaluating the Performance of a Classifier

5-fold cross validation (image credit)

Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR


Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Cross Validation
○ leave-one-out approach
■ A special case of the k-fold cross-validation
● sets k = N ( Dataset size)
■ Size of test set = 1 record
■ All remaining records = Training set
■ Advantage
● Utilizing as much data as possible for training
● Test sets are mutually exclusive and they effectively cover the entire data set.
■ Drawback
● computationally expensive

Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR


Model Overfitting
Evaluating the Performance of a Classifier
● Bootstrap
○ Training records are sampled with replacement;
■ A record already chosen for training is put back into the original pool of records so that it is equally likely
to be redrawn.
○ Probability a record is chosen by a bootstrap sample is 1 − (1 − 1/N) N
■ When N is sufficiently large, the probability asymptotically approaches 1 − e −1 = 0.632.
○ On average, a bootstrap sample contains 63.2% of the records of the original data.

● b -no. of times
● Ei - accuracy of ith bootstrap sample, accs - accuracy on training data
Picture reference - https://bradleyboehmke.github.io/HOML/process.html
Bayesian Classifiers
DWDM Unit - III
Bayesian Classifiers

● In many applications the relationship between the attribute set and


the class variable is non-deterministic.
● Example:
○ Risk for heart disease based on the person’s diet and workout
frequency.
● So, Modeling probabilistic relationships between the attribute
set and the class variable.
● Bayes Theorem
Bayesian Classifiers

● Consider a football game between two rival teams: Team 0 and Team 1.
● Suppose Team 0 wins 65% of the time and Team 1 wins the remaining
matches.
● Among the games won by Team 0, only 30% of them come from playing on
Team 1’s football field.
● On the other hand, 75% of the victories for Team 1 are obtained while playing at
home.
● If Team 1 is to host the next match between the two teams, which team will
most likely emerge as the winner?
● This Problem can be solved by Bayes Theorem
Bayesian Classifiers

● Bayes Theorem
○ X and Y are random variables.
○ A conditional probability is the probability that a random variable will take on a
particular value given that the outcome for another random variable is known.
○ Example:
■ conditional probability P(Y = y|X = x) refers to the probability that the variable
Y will take on the value y, given that the variable X is observed to have the
value x.
Bayesian Classifiers

● Bayes Theorem

If {X1, X2,..., Xk} is the set of mutually exclusive and exhaustive outcomes of a
random variable X, then the denominator of the previous slide equation can be
expressed as follows:
Bayesian Classifiers
● Bayes Theorem
Bayesian Classifiers
● Bayes Theorem
○ Using the Bayes Theorem for Classification
■ X - attribute set
■ Y - class variable.
○ Treat X and Y as random variables -for non-deterministic relationship
○ Capture relationship probabilistically using P(Y |X) - Posterior Probability or Conditional Probability
○ P(Y) - prior probability
○ Training phase
■ Learn the posterior probabilities P(Y |X) for every combination of X and Y
○ Use these probabilities and classify test record X` by finding the class Y` (max posterior probability - P(y`/x`))
Bayesian Classifiers
Using the Bayes Theorem for Classification

Example:-

● test record
X= (Home Owner = No, Marital Status = Married, Annual Income = $120K)

● Y=?
● Use training data & compute - posterior probabilities P(Yes|X) and P(No|X)
● Y= Yes, if P(Yes|X) > P(No|X)
● Y= No, Otherwise
Bayesian Classifiers

Computing P(X/Y) - Class Conditional Probability

Na¨ıve Bayes Classifier

● assumes that the attributes are conditionally independent, given the class label y.
● The conditional independence assumption can be formally stated as follows:
Bayesian Classifiers

How a Na¨ıve Bayes Classifier Works

● Assumption - conditional independence


● Estimate the conditional probability of each Xi, given Y
○ (instead of computing the class-conditional probability for every combination of X)
○ No need of very large training set to obtain a good estimate of the probability.
● To classify a test record,
○ Compute the posterior probability for each class Y:
■ P(X) can be ignored
● It is fixed for every Y, it is sufficient to choose the class that maximizes the
numerator term
Bayesian Classifiers
Estimating Conditional Probabilities for Binary Attributes

Xi - categorical attribute , xi - one of the value under attribute Xi


Y - Target Attribute ( for Class Label), y- one class Label
conditional probability P(Xi = xi |Y = y) = fraction of training instances in class y that take on
attribute value xi.
P(Home Owner=yes|DB=no) =
(No. of HO=yes & No. of DB = no)/(Total No. of DB=no)
=3/7
P(Home Owner=no|DB=no)=4/7
P(Home Owner=yes|DB=yes)=0
P(Home Owner=no|DB=yes)=3/3
Bayesian Classifiers
Estimating Conditional Probabilities for Categorical Attributes
P(MS=single|DB=no) = 2/7
P(MS=married|DB=no) = 4/7
P(MS=divorced|DB=no) =1/7
P(MS=single|DB=yes) = 2/3
P(MS=married|DB=yes) = 0/3
P(MS=divorced|DB=yes) =1 /3
Bayesian Classifiers

Estimating Conditional Probabilities for Continuous Attributes

● Discretization
● Probability Distribution
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes

● Discretization (Transforming continuous attributes into ordinal attributes)


○ Replace the continuous attribute value with its corresponding discrete interval.
○ Estimation error depends on
■ the discretization strategy
■ the number of discrete intervals.
○ If the number of intervals is too large, there are too few training records in
each interval
○ If the number of intervals is too small, then some intervals may aggregate
records from different classes and we may miss the correct decision boundary.
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes

● Probability Distribution
○ Gaussian distribution can be used to represent the class-conditional probability for continuous
attributes.
○ The distribution is characterized by two parameters,
■ mean, µ
■ variance, σ 2

µij - sample mean of Xi for all training records that belong to the class yj.
σ2ij - sample variance (s ) of such training records.
2
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes

● Probability Distribution

sample mean and variance for this attribute with respect to the class No
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier

● Compute the class conditional probability for each categorical attribute


● Compute sample mean and variance for the continuous attribute
● Predict the class label of a test record

X = (Home Owner=No, Marital Status = Married,


Income = $120K)

● compute the posterior probabilities


○ P(No|X)
○ P(Yes|X)
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier

● P(yes) = 3/10 =0.3 P(no) =7/10 = 0.7


Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier

● P(no|x)= ?
● P(yes|x) = ?
● Large value is the class label
● X = (Home Owner=No, Marital Status = Married, Income = $120K)
● P(no| Home Owner=No, Marital Status = Married, Income = $120K) = ?
● P(Y|X) = P(Y) * P(X|Y)
● P(no| Home Owner=No, Marital Status = Married, Income = $120K) =
P(DB=no) * P(Home Owner=No, Marital Status = Married, Income = $120K | DB=no)
● P(X|Y) = P(HM=no|DB=no) * P(MS=married|DB=no) * P(Income=$120K|DB=no)
= 4/7 * 4/7 * 0.0072
=0.0024
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier

P(DB=no | X)=P(DB=no)*P(X | DB=no) = 7/10 * 0.0024 = 0.0016


P(DB=yes | X)=P(DB=yes)*P(X | DB=yes) = 3/10 * 0 = 0

Class Label for the record is NO


Bayesian Classifiers

Find out Class Label ( Play Golf ) for

today = (Sunny, Hot, Normal, False)

https://www.geeksforgeeks.org/naive-bayes-classifiers/
Association Analysis:
Basic Concepts and Algorithms

DWDM Unit - IV
Basic Concepts

● Retailers are interested in analyzing the data to learn


about the purchasing behavior of their customers.
● Such Information is used in marketing promotions, inventory
management, and customer relationship management.
● Association analysis - useful for discovering interesting
relationships hidden in large data sets.
● The uncovered relationships can be represented in the form
of association rules or sets of frequent items.
Basic Concepts

● Example Association Rule


○ {Diapers} → {Beer}
● rule suggests - strong relationship exists between the sale of
diapers and beer
● many customers who buy diapers also buy beer.
● Association analysis is also applicable to
○ Bioinformatics,
○ Medical diagnosis,
○ Web mining, and
○ Scientific data analysis
● Example - analysis of Earth science data(ocean, land, &
atmospheric processes)
Basic Concepts

Problem Definition:
● Binary Representation Market basket data
● each row - transaction
● each column - item
● value is one if the item is present in a transaction and
zero otherwise.
● item is an asymmetric binary variable because the
presence of an item in a transaction is often considered
more important than its absence
Basic Concepts

Itemset and Support Count:


I = {i1,i2,.. .,id} - set of all items
T = {t1, t2,..., tN} - set of all transactions

Each transaction ti contains a subset of items chosen from I


Itemset - collection of zero or more items
K-itemset - itemset contains k items
Example:-
{Beer, Diapers, Milk} - 3-itemset
null (or empty) set - no items
Basic Concepts
Itemset and Support Count:
● Transaction width - number of items present in a
transaction.
● A transaction tj contain an itemset X if X is a subset of
tj.
● Example:
○ t2 contains itemset {Bread, Diapers} but not {Bread, Milk}.
● support count,σ(X) - number of transactions that contain a
particular itemset.
● σ(X) = |{ti |X ⊆ ti, ti ∈ T}|,
○ symbol | · | denote the number of elements in a set.
● support count for {Beer, Diapers, Milk} =2
○ ( 2 transactions contain all three items)
Basic Concepts

Association Rule:
● An association rule is an implication expression of
the form X → Y, where X and Y are disjoint itemsets
○ i.e., X ∩ Y = ∅.

● The strength of an association rule can be measured
in terms of its support and confidence.
Basic Concepts

● Support
○ determines how often a rule is applicable to
a given data set

● Confidence
○ determines how frequently items in Y appear
in transactions that contain X
Basic Concepts

● Example:

○ Consider the rule {Milk, Diapers} → {Beer}


○ support count for {Milk, Diapers, Beer}=2
○ total number of transactions=5,
○ rule’s support is 2/5 = 0.4.
○ rule’s confidence =
(support count for {Milk, Diapers, Beer})/(support count for {Milk, Diapers})

= 2/3 = 0.67.
Basic Concepts

Formulation of Association Rule Mining Problem

Association Rule Discovery

Given a set of transactions T, find all the rules having


support ≥ minsup and confidence ≥ minconf, where minsup and
minconf are the corresponding support and confidence
thresholds.
Basic Concepts

Formulation of Association Rule Mining Problem


Association Rule Discovery
● Brute-force approach: compute the support and confidence for every
possible rule(expensive)
● Total number of possible rules extracted from a data set that
contains d items is R = 3d−2d+1+1
● For a dataset of 6 items, no of possible rules are 36−27 +1=602
rules.
● More than 80% of the rules are discarded after applying minsup=20% &
minconf=50%
● most of the computations become wasted.
● Prune the rules early without having to compute their support and
confidence values.
Basic Concepts

Formulation of Association Rule Mining Problem

Association Rule Discovery

● Common strategy - decompose the problem into two major


subtasks: (separate support & confidence)
1. Frequent Itemset Generation:
■ Objective:Find all the itemsets that satisfy the minsup threshold.
2. Rule Generation:
■ Objective: Extract all the high-confidence rules from the frequent
itemsets found in the previous step.
■ These rules are called strong rules.
Frequent Itemset Generation

● Lattice structure - list of all


possible itemsets
● itemset lattice for
○ I = {a, b, c, d, e}
● Data set with k items can generate
up to 2k − 1 frequent itemsets
(without null set)
○ Example:- 25-1=31
● So, search space of itemsets in
practical applications is
exponentially large
Frequent Itemset Generation

● A brute-force approach for finding frequent itemsets


○ determine the support count for every candidate
itemset in the lattice structure.
● compare each candidate against every transaction
● Very expensive
○ requires O(NMw) comparisons,
○ N- No. of transactions,
○ M = 2k − 1 is the number of candidate itemsets
○ w - maximum transaction width.
Frequent Itemset Generation

several ways to reduce the computational complexity of


frequent itemset generation.

Reduce the number of candidate itemsets (M)


The Apriori principle
Reduce the number of comparisons
by using more advanced data structures
Frequent Itemset Generation

The Apriori
Principle
If an itemset is
frequent, then all
of its subsets must
also be frequent.
Frequent Itemset Generation

Support-based pruning:

● strategy of trimming the exponential search space based on the


support measure is known as support-based pruning.
● It uses anti-monotone property of the support measure.
● Anti-monotone property of the support measure
○ support for an itemset never exceeds the support for its subsets.
● Example:
○ {a, b} is infrequent,
○ then all of its supersets must be infrequent too.
○ entire subgraph containing the supersets of {a, b} can be pruned immediately
Frequent Itemset Generation
Let,

I - set of items

J = 2I - power set of I

A measure f is monotone/anti-monotone if

Monotonicity Property(or upward closed):

∀X, Y ∈ J: (X ⊆ Y) → f(X) ≤ f(Y)

Anti-monotone (or downward closed):

∀X, Y ∈ J: (X ⊆ Y) → f(Y) ≤ f(X)

means that if X is a subset of Y, then f(Y) must not exceed f(X).


Frequent Itemset Generation in the Apriori Algorithm
Identify Frequent Itemset
Frequent Itemset Generation in the Apriori Algorithm
Ck-set of k-candidate itemsets

Fk - set of k-frequent itemsets


Frequent Itemset Generation in the Apriori Algorithm

https://www.softwaretestinghelp.com/apriori-
algorithm/#:~:text=Apriori%20algorithm%20is%20a%20sequence,is%20assumed%20by%20the%20user.
Frequent Itemset Generation in the Apriori Algorithm
Example
Example
Apriori in Python

https://intellipaat.com/blog/data-science-apriori-algorithm/
Apriori in Python

https://intellipaat.com/blog/data-science-apriori-algorithm/
Frequent Itemset Generation in the Apriori Algorithm
Ck-set of k-candidate itemsets

Fk - set of k-frequent itemsets


Frequent Itemset Generation in the Apriori Algorithm

Candidate Generation and Pruning


The apriori-gen function shown in Step 5 of Algorithm 6.1
generates candidate itemsets by performing the following two
operations:
1. Candidate Generation (join)
a. Generates new candidate k-itemsets
b. based on the frequent (k − 1)-itemsets found in the previous
iteration.
2. Candidate Pruning
a. Eliminates some of the candidate k-itemsets using the support-based
pruning strategy.
Frequent Itemset Generation in the Apriori Algorithm

Candidate Generation and Pruning

Requirements for an effective candidate generation


procedure:

1. It should avoid generating too many unnecessary


candidates
2. It must ensure that the candidate set is complete,
i.e., no frequent itemsets are left out
3. It should not generate the same candidate itemset more
than once (no duplicates).
Frequent Itemset Generation in the Apriori Algorithm

Candidate Generation and Pruning

Candidate Generation Procedures

1. Brute-Force Method
2. Fk−1 × F1 Method
3. Fk−1×Fk−1 Method
Frequent Itemset Generation in the Apriori Algorithm

Candidate Generation and Pruning


Candidate Generation Procedures
1. Brute-Force Method
a. considers every k-itemset as a
potential candidate
b. candidate pruning ( to remove
unnecessary candidates) becomes
extremely expensive
c. No. of candidate itemsets generated
at level k =(dk)
d - no. of items
2. Fk−1 × F1 Method

O(|Fk−1| × |F1|) candidate k-itemsets,

|Fj | = no. of frequent j-itemsets.

overall complexity
● The procedure is complete.
● But the same candidate itemset will be generated more than once ( duplicates).
● Example:
○ {Bread, Diapers, Milk} can be generated
○ by merging {Bread, Diapers} with {Milk},
○ {Bread, Milk} with {Diapers}, or
○ {Diapers, Milk} with {Bread}.
● One Solution
○ Generate candidate itemset by joining items
in lexicographical order only
● {Bread, Diapers} join with {Milk}

Don’t join

● {Diapers, Milk} with {Bread}


● {Bread, Milk} with {Diapers}
because violation of lexicographic ordering
Problem:
Large no. of unnecessary candidates
3. Fk−1×Fk−1 Method (used in the apriori-gen function)

● merges a pair of frequent (k−1)-itemsets only if their


first k−2 items are identical.
● Let A = {a1, a2,..., ak−1} and B = {b1, b2,..., bk−1} be a
pair of frequent (k−1)-itemsets.
● A and B are merged if they satisfy the following
conditions:
○ ai = bi (for i = 1, 2,..., k−2) and
○ ak−1 != bk−1.
Merge {Bread, Diapers} & {Bread, Milk} to form a candidate 3-
itemset {Bread, Diapers, Milk}
Don’t merge {Beer, Diapers} with {Diapers, Milk} because the
first item in both itemsets is different.
Support Counting

● Support counting is the process of


determining the frequency of
occurrence for every candidate
itemset that survives the candidate
pruning step.
● One approach for doing this is to
compare each transaction against
every candidate itemset (see Figure
6.2) and to update the support
counts of candidates contained in
the transaction.
● This approach is computationally
expensive, especially when the
numbers of transactions and
candidate itemsets are large.
Support Counting

● An alternative approach is to enumerate the itemsets contained in


each transaction and use them to update the support counts of
their respective candidate itemsets.
● To illustrate, consider a transaction t that contains five
items,{1, 2, 3, 5, 6}.
● Assuming that each itemset keeps its items in increasing
lexicographic order, an itemset can be enumerated by specifying
the smallest item first,followed by the larger items.
● For instance, given t = {1, 2, 3, 5, 6}, all the 3-itemsets
contained in t must begin with item 1, 2, or 3. It is not
possible to construct a 3-itemset that begins with items 5 or 6
because there are only two items in t whose labels are greater
than or equal to 5.
Support Counting
● The number of ways to specify the first item of a 3-itemset
contained in t is illustrated by the Level 1 prefix structures.
For instance, 1 2 3 5 6 represents a 3-itemset that begins with
item 1, followed by two more items chosen from the set {2, 3, 5, 6}
● After fixing the first item, the prefix structures at Level 2
represent the number of ways to select the second item.
For example, 1 2 3 5 6 corresponds to itemsets that begin with
prefix (1 2) and are followed by items 3, 5, or 6.
● Finally, the prefix structures at Level 3 represent the complete
set of 3-itemsets contained in t.
For example, the 3-itemsets that begin with prefix {1 2} are
{1, 2, 3}, {1, 2, 5}, and {1, 2, 6}, while those that begin with
prefix {2 3} are {2, 3, 5} and {2, 3, 6}.
Support Counting

(steps 6 through 11 of Algorithm 6.1. )

● Enumerate the itemsets contained in


each transaction
● Figure 6.9 demonstrate how itemsets
contained in a transaction can be
systematically enumerated, i.e., by
specifying their items one by one,
from the leftmost item to the
rightmost item.
● If enumerated item of transaction
matches one of the candidates, then
the support count of the
corresponding candidate is
incremented.(line 9 in algo.)

For instance, given t = {1, 2, 3, 5, 6}, all the 3- itemsets contained in t


Support Counting Using a Hash Tree

● Candidate itemsets are


partitioned into different
buckets and stored in a
hash tree.
● Itemsets contained in
each transaction are also
hashed into their
appropriate buckets.
● Instead of comparing each
itemset in the transaction
with every candidate
itemset
● Matched only against
candidate itemsets that
belong to the same bucket
https://www.youtube.com/watch?v=btW-uU1dhWI
Hash Tree from a Candidate Itemset
Hash function= p mod 3
Rule generation
&
Compact representation of frequent
itemsets
DWDM
Unit - IV
Association Analysis
Rule Generation

● Each frequent k-itemset can produce up to 2k−2 association


rules, ignoring rules that have empty antecedents or
consequents .
● An association rule can be extracted by partitioning the
itemset Y into two non-empty subsets, X and Y −X, such that
X → Y −X satisfies the confidence threshold.
Confidence-Based Pruning
Theorem:
If a rule X → Y −X does not satisfy the confidence
threshold, then
any rule X` → Y − X`, where X` is a subset of X, must
not satisfy the confidence threshold as well.
Rule Generation in Apriori Algorithm

● The Apriori algorithm uses a level-wise approach for generating


association rules, where each level corresponds to the
number of items that belong to the rule consequent.
● Initially, all the high-confidence rules that have only one
item in the rule consequent are extracted.
● These rules are then used to generate new candidate rules.
For example, if {acd} →{b} and {abd} →{c} are high-confidence rules, then the
candidate rule {ad} → {bc} is generated by merging the consequents of both rules.
Rule Generation in Apriori Algorithm
● Figure 6.15 shows a lattice
structure for the association
rules generated from the
frequent itemset {a, b, c, d}.
● If any node in the lattice has
low confidence, then
according to Theorem, the
entire sub-graph spanned by
the node can be pruned
immediately.
● Suppose the confidence for
{bcd} → {a} is low. All the rules
containing item a in its
consequent, can be discarded.
In rule generation, we do not have to make additional passes over the data set
to compute the confidence of the candidate rules.
Instead, we determine the confidence of each rule by using the support counts
computed during frequent itemset generation.
Compact Representation of Frequent Itemsets

Maximal Frequent Itemsets


Definition
A maximal frequent itemset is defined as a frequent itemset for
which none of its immediate supersets are frequent.
Compact Representation of Frequent Itemsets
● The itemsets in the lattice are divided
into two groups: those that are
frequent and those that are
infrequent.
● A frequent itemset border, which is
represented by a dashed line, is also
illustrated in the diagram.
● Every itemset located above the
border is frequent, while those
located below the border (the shaded
nodes) are infrequent.
● Among the itemsets residing near the
border, {a, d}, {a, c, e}, and {b, c, d, e} are
considered to be maximal frequent
itemsets because their immediate
supersets are infrequent.
● Maximal frequent itemsets do not
contain the support information of
their subsets.
Compact Representation of Frequent Itemsets
● Maximal frequent itemsets effectively provide a compact representation of
frequent itemsets.
● They form the smallest set of itemsets from which all frequent itemsets can
be derived.
● For example, the frequent itemsets shown in Figure 6.16 can be divided into
two groups:
○ Frequent itemsets that begin with item a and that may contain items c, d, or e. This group
includes itemsets such as {a}, {a, c}, {a, d}, {a, e} and {a, c, e}.
○ Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as {b}, {b,
c}, {c, d},{b, c, d, e}, etc.
● Frequent itemsets that belong in the first group are subsets of either {a, c, e}
or {a, d}, while those that belong in the second group are subsets of {b, c, d, e}.
Compact Representation of Frequent Itemsets

● Closed Frequent Itemsets


○ Closed itemsets provide a minimal representation of itemsets
without losing their support information.
○ An itemset X is closed if none of its immediate supersets has
exactly the same support count as X.
Or
○ X is not closed if at least one of its immediate supersets has the
same support count as X.
Compact Representation of Frequent Itemsets

Closed Frequent Itemsets

● An itemset is a closed
frequent itemset if it is
closed and its support is
greater than or equal to
minsup.
Compact Representation of Frequent Itemsets

Closed Frequent Itemsets


● Determine the support counts for the non-closed by using the closed frequent
itemsets
● consider the frequent itemset {a, d} - is not closed, its support count must be
identical to one of its immediate supersets {a, b, d}, {a, c, d}, or {a, d, e}.
● Apriori principle states
○ any transaction that contains the superset of {a, d} must also contain {a, d}.
○ any transaction that contains {a, d} does not have to contain the supersets of {a, d}.
● So, the support for {a, d} = largest support among its supersets = support of
{a,c,d}
● Algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the
smallest frequent itemsets.
● The items can be divided into
three groups: (1) Group A,
which contains items a1
through a5; (2) Group B, which
contains items b1 through b5;
and (3) Group C, which
contains items c1 through c5.
● The items within each group
are perfectly associated with
each other and they do not
appear with items from
another group. Assuming the
support threshold is 20%,
the total number of frequent
itemsets is 3×(25−1) = 93.
● There are only three closed
frequent itemsets in the
data: ({a1, a2, a3, a4, a5}, {b1,
b2, b3, b4, b5}, and {c1, c2, c3,
c4, c5})
● Redundant association rules can be removed by using Closed frequent itemsets
● An association rule X → Y is redundant if there exists another rule X`→ Y`,

where

X is a subset of X` and

Y is a subset of Y `

such that the support and confidence for both rules are identical.
● From table 6.5 {b} is not a closed frequent itemset while {b, c} is closed.
● The association rule {b} → {d, e} is therefore redundant because it has the same
support and confidence as {b, c} → {d, e}.

● Such redundant rules are not generated if closed frequent itemsets are used
for rule generation.
● all maximal frequent itemsets are closed because none
● of the maximal frequent itemsets can have the same support count as their
● immediate supersets.
FP Growth Algorithm
Association Analysis (Unit - IV)
DWDM
FP Growth Algorithm
● FP-growth algorithm takes a radically different approach for discovering frequent itemsets.
● The algorithm encodes the data set using a compact data structure called an FP-tree and extracts
frequent itemsets directly from this structure
FP-Tree Representation
● An FP-tree is a compressed representation of the input data. It is constructed by reading the data
set one transaction at a time and mapping each transaction onto a path in the FP-tree.
● As different transactions can have several items in common, their paths may overlap. The more
the paths overlap with one another, the more compression we can achieve using the FP-tree
structure.
● If the size of the FP-tree is small enough to fit into main memory, this will allow us to extract
frequent itemsets directly from the structure in memory instead of making repeated passes over
the data stored on disk.
FP Tree Representation
FP Tree Representation
● Figure 6.24 shows a data set that
contains ten transactions and five
items.
● The structures of the FP-tree after
reading the first three
transactions are also depicted in
the diagram.
● Each node in the tree contains the
label of an item along with a
counter that shows the number of
transactions mapped onto the
given path.
● Initially, the FP-tree contains only
the root node represented by the
null symbol.
FP Tree Representation
1. The data set is scanned once to
determine the support count of
each item. Infrequent items are
discarded, while the frequent
items are sorted in decreasing
support counts. For the data set
shown in Figure, a is the most
frequent item, followed by b, c, d,
and e.
FP Tree Representation
2. The algorithm makes a second
pass over the data to construct
the FP-tree. After reading the
first transaction, {a, b}, the nodes
labeled as a and b are created. A
path is then formed from null →
a → b to encode the transaction.
Every node along the path has a
frequency count of 1.
FP Tree Representation
3. After reading the second transaction, {b,c,d}, a new set of
nodes is created for items b, c, and d. A path is then
formed to represent the transaction by connecting the
nodes null → b → c → d. Every node along this path
also has a frequency count equal to one.
4. The third transaction, {a,c,d,e}, shares a common prefix
item (which is a) with the first transaction. As a result,
the path for the third transaction, null → a → c → d
→ e, overlaps with the path for the first transaction,
null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two,
while the frequency counts for the newly created
nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been
mapped onto one of the paths given in the FP-tree.
The resulting FP-tree after reading all the transactions
is shown in Figure 6.24.
FP Tree Representation
● The size of an FP-tree is typically smaller
than the size of the uncompressed data
because many transactions in market
basket data often share a few items in
common.
● In the best-case scenario, where all the
transactions have the same set of items,
the FP-tree contains only a single branch
of nodes.
● The worst-case scenario happens when
every transaction has a unique set of
items.
FP Tree Representation
● The size of an FP-tree also
depends on how the items are
ordered.
● If the ordering scheme in the
preceding example is reversed,
i.e., from lowest to highest
support item, the resulting FP-
tree is shown in Figure 6.25.
● An FP-tree also contains a list
of pointers connecting
between nodes that have the
same items.
● These pointers, represented as
dashed lines in Figures 6.24
and 6.25, help to facilitate the
rapid access of individual
items in the tree.
Frequent Itemset Generation using FP-Growth Algorithm
Steps in FP-Growth Algorithm:
Step-1: Scan the database to build Frequent 1-item set which will contain all
the elements whose frequency is greater than or equal to the minimum
support. These elements are stored in descending order of their
respective frequencies.
Step-2: For each transaction, the respective Ordered-Item set is built.
Step-3: Construct the FP tree. by scanning each Ordered-Item set
Step-4: For each item, the Conditional Pattern Base is computed which is
path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.
Step-5: For each item, the Conditional Frequent Pattern Tree is built.
Step-6: Frequent Pattern rules are generated by pairing the items of the
Conditional Frequent Pattern Tree set to each corresponding item.
Frequent Itemset Generation in FP-Growth Algorithm
Example:
Given Database: min_support=3 The frequency of each individual
item is computed:-
Frequent Itemset Generation in FP-Growth Algorithm
● A Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies.
● L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
● Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained
in the transaction. The following table is built for all the transactions:
Frequent Itemset Generation in FP-Growth Algorithm
Now, all the Ordered-Item sets
are inserted into a Trie Data
Structure.
a) Inserting the set {K, E, M, O,
Y}:
All the items are simply
linked one after the other in
the order of occurrence in
the set and initialize the
support count for each item
as 1.
Frequent Itemset Generation in FP-Growth Algorithm
b) Inserting the set {K, E, O, Y}:

Till the insertion of the elements K and


E, simply the support count is increased
by 1.

There is no direct link between E and O,


therefore a new node for the item O is
initialized with the support count as 1
and item E is linked to this new node.

On inserting Y, we first initialize a new


node for the item Y with support count
as 1 and link the new node of O with the
new node of Y.
Frequent Itemset Generation in FP-Growth Algorithm
c) Inserting the set {K, E, M}:
● Here simply the support
count of each element is
increased by 1.
Frequent Itemset Generation in FP-Growth Algorithm
d) Inserting the set {K, M, Y}:
● Similar to step b), first the
support count of K is
increased, then new nodes
for M and Y are initialized
and linked accordingly.
Frequent Itemset Generation in FP-Growth Algorithm
e) Inserting the set {K, E, O}:
● Here simply the support
counts of the respective
elements are increased.
Frequent Itemset Generation in FP-Growth Algorithm
Now, for each item starting from leaf, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.
Frequent Itemset Generation in FP-Growth Algorithm
Now for each item, the Conditional Frequent Pattern Tree is built.

It is done by taking the set of elements that is common in all the paths in the Conditional
Pattern Base of that item and calculating its support count by summing the support counts of all
the paths in the Conditional Pattern Base.

The itemsets whose support count >= min_support value are retained in the Conditional
Frequent Pattern Tree and the rest are discarded.
Frequent Itemset Generation in FP-Growth Algorithm
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are
generated by pairing the items of the Conditional Frequent Pattern Tree set to
the corresponding to the item as given in the below table.

For each row, two types of association rules can be inferred for example for the first
row which contains the element, the rules K -> Y and Y -> K can be inferred.
To determine the valid rule, the confidence of both the rules is calculated and the one
with confidence greater than or equal to the minimum confidence value is retained.
Data Mining
Cluster Analysis: Basic Concepts
and Algorithms

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar
What is Cluster Analysis?

● Given a set of objects, place them in groups such that the


objects in a group are similar (or related) to one another and
different from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

2
Applications of Cluster Analysis

● Understanding
– Group related documents
for browsing(Information
Retrieval),
– group genes and proteins
that have similar
functionality(Biology),
– group stocks with similar
price fluctuations
(Business)
– Climate
– Psychology & Medicine

Clustering precipitation
in Australia

3
Applications of Cluster Analysis

● Clustering for Utility


– Summarization
– Compression
– Efficiently finding Nearest
Neighbors

Clustering precipitation
in Australia

4
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

5
Types of Clusterings

● A clustering is a set of clusters

● Important distinction between hierarchical and


partitional sets of clusters
– Partitional Clustering (unnested)
◆ A division of data objects into non-overlapping subsets (clusters)

– Hierarchical clustering (nested)


◆ A set of nested clusters organized as a hierarchical tree

6
Partitional Clustering

Original Points A Partitional Clustering

7
Hierarchical Clustering

Traditional Hierarchical Clustering Traditional Dendrogram

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

8
Other Distinctions Between Sets of Clusters

● Exclusive versus non-exclusive


– In non-exclusive clusterings, points may belong to multiple
clusters.
◆ Can belong to multiple classes or could be ‘border’ points
– Fuzzy clustering (one type of non-exclusive)
◆ In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
◆ Weights must sum to 1
◆ Probabilistic clustering has similar characteristics

● Partial versus complete


– In some cases, we only want to cluster some of the data

9
Types of Clusters

● Well-separated clusters

● Prototype-based clusters

● Contiguity-based clusters

● Density-based clusters

● Described by an Objective Function

10
Types of Clusters: Well-Separated

● Well-Separated Clusters:
– A cluster with a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster
than to any point not in the cluster.

3 well-separated clusters

11
Types of Clusters: Prototype-Based

● Prototype-based ( or center based)


– A cluster with set of points such that a point in a cluster is
closer (more similar) to the prototype or “center” of the
cluster, than to the center of any other cluster
– If Data is Continuous – Center will be Centroid /mean
– If Data is Categorical - Center will be Medoid ( Most
Representative point)

4 center-based clusters

12
Types of Clusters: Contiguity-Based ( Graph)

● Contiguous Cluster (Nearest neighbor or


Transitive)
– A cluster with set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
– Graph ( Data-Nodes, links - Connections),Cluster is group of
connected objects.No connections with outside group.

8 contiguous clusters
● Useful when clusters are irregular or intertwined
● Trouble when noise is present
– a small bridge of points can merge two distinct clusters.

13
Types of Clusters: Density-Based

● Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

The two circular clusters are not merged, as in Figure, because the bridge between them(previous
slide figure) fades into the noise.
6 density-based clusters

Curve that is present in previous slide Figure also fades into the noise and does not form a
cluster

A density based definition of a cluster is often employed when the clusters are irregular or intertwined,
and when noise and outliers are present.

14
Types of Clusters: Density-Based

● Shared property(Conceptual Clusters)


– a cluster as a set of objects that share some
property.

A clustering algorithm would need a very specific concept (sophisticated) of a cluster to successfully
detect these clusters. The process of finding such clusters is called conceptual clustering.

15
Clustering Algorithms

● K-means and its variants

● Hierarchical clustering

● Density-based clustering

16
K-means
● Prototype-based, partitional clustering

technique
● Attempts to find a user-specified number of

clusters (K)

17
Agglomerative Hierarchical Clustering
● Hierarchical clustering
● Starts with each point as a singleton cluster
● Repeatedly merges the two closest clusters
until a single, all encompassing cluster
remains.
● Some Times - graph-based clustering
● Others - prototype-based approach.

18
DBSCAN
● Density-based clustering algorithm

● Produces a partitional clustering,

● No. of clusters is automatically determined by

the algorithm.
● Noise - Points in low-density regions (omitted)

● Not a complete clustering.

19
K-means Clustering

● Partitional clustering approach


● Number of clusters, K, must be specified
● Each cluster is associated with a centroid (center point)
● Each point is assigned to the cluster with the closest
centroid
● The basic algorithm is very simple

20
Example of K-means Clustering
Example of K-means Clustering

22
K-means Clustering – Details
● Simple iterative algorithm.
– Choose initial centroids;
– repeat {assign each point to a nearest centroid; re-compute cluster centroids}
– until centroids stop changing.

● Initial centroids are often chosen randomly.


– Clusters produced can vary from one run to another
● The centroid is (typically) the mean of the points in the cluster,
but other definitions are possible
● Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points
change clusters’
● Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

23
K-means Clustering – Details

● Centroid can vary, depending on the proximity


measure for the data and the goal of the
clustering.
● The goal of the clustering is typically expressed
by an objective function that depends on the
proximities of the points to one another or to the
cluster centroids.
● e.g.minimize the squared distance of each point
to its closest centroid

24
K-means Clustering – Details

25
Centroids and Objective Functions

Data in Euclidean Space


● A common objective function (used with Euclidean
distance measure) is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
center
– To get SSE, we square these errors and sum them.

– x is a data pointin cluster Ci and mi is the centroid (mean) for


cluster Ci
– A kmeans run which produces minimum SSE will be considered.
– Centroid (mean) of the i th cluster is

26
K-means Objective Function

Document Data
● Cosine Similarity

● Document data is represented as Document Term Matrix

● Objective (Cohesion of the cluster)

– Maximize the similarity of the documents in a cluster


to the cluster centroid; which is called cohesion of
the cluster

27
Two different K-means Clusterings

Original Points

Figure a shows a clustering Figure b shows suboptimal


solution that is the global clustering that is only a local
minimum of the SSE for three minimum.
clusters

Fig a: Optimal Clustering Fig b: Sub-optimal Clustering

28
Importance of Choosing Initial Centroids …
The below 2 figures show the clusters that result from two particular choices of initial centroids.
(For both figures, the positions of the cluster centroids in the various iterations are indicated by
crosses.)
Fig-1

In Figure1, even though all the


initial centroids are from one
natural cluster, the minimum
SSE clustering is still found

Fig-2

In Figure 2, even though the initial


centroids seem to be better
distributed, we obtain a
suboptimal clustering, with higher
squared error. This is considered as
poor starting of centroids
Importance of Choosing Initial Centroids …

30
Problems with Selecting Initial Points

● Figure 5.7 shows that if a pair of clusters has only one initial
centroid and the other pair has three, then two of the true
clusters will be combined and one true cluster will be split.
31
10 Clusters Example

Starting with two initial centroids in one cluster of each pair of clusters
32
10 Clusters Example

Starting with two initial centroids in one cluster of each pair of clusters
33
10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other
have only one.

34
10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

35
Solutions to Initial Centroids Problem

● Multiple runs

● K-means++

● Use hierarchical clustering to determine initial


centroids

● Bisecting K-means

36
Multiple Runs

● One technique that is commonly


used to address the problem of
choosing initial centroids is to
perform multiple runs, each with
a different set of randomly
chosen initial centroids, and then
select the set of clusters with the
minimum SSE
● In Figure 5.6(a), the data
consists of two pairs of clusters,
where the clusters in each (top-
bottom) pair are closer to each
other than to the clusters in the
other pair.
● Figure 5.6 (b–d) shows that if we
start with two initial centroids per
pair of clusters, then even when
both centroids are in a single
cluster, the centroids will
redistribute themselves so that
the “true” clusters are found.
37
K-means++

38
K-means++

39
Bisecting K-means

● Bisecting K-means algorithm


– Variant of K-means that can produce a partitional or a
hierarchical clustering

CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

40
https://www.geeks
forgeeks.org/bisec
ting-k-means-
algorithm-
introduction/

41
Limitations of K-means

● K-means has problems when clusters are of


differing
– Sizes
– Densities
– Non-globular shapes

● K-means has problems when the data contains


outliers.
– One possible solution is to remove outliers before
clustering

42
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

43
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

44
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

45
Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.

46
Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.

47
Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to find a large number of clusters such that each of them represents a part of
a natural cluster. But these small clusters need to be put together in a post-processing step.

48
Hierarchical Clustering

● Produces a set of nested clusters organized as a


hierarchical tree
● Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits

49
Strengths of Hierarchical Clustering

● Do not have to assume any particular number of


clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level

50
Hierarchical Clustering

● Two main types of hierarchical clustering


– Agglomerative:
◆ Start with the points as individual clusters
◆ At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

– Divisive:
◆ Start with one, all-inclusive cluster
◆ At each step, split a cluster until each cluster contains an individual
point (or there are k clusters)

● Traditional hierarchical algorithms use a similarity or


distance matrix
– Merge or split one cluster at a time

51
Agglomerative Clustering Algorithm

● Key Idea: Successively merge closest clusters


● Basic algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

● Key operation is the computation of the proximity of two clusters


– Different approaches to defining the distance between clusters
distinguish the different algorithms

52
Steps 1 and 2

● Start with clusters of individual points and a


proximity matrix p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
. Proximity Matrix

53
Intermediate Situation

● After some merging steps, we have some clusters


C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5

Proximity Matrix
C1

C2 C5

54
Step 4

● We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5

Proximity Matrix
C1

C2 C5

55
Step 5

● The question is “How do we update the proximity matrix?”


C2
U
C1 C5 C3 C4

C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?

Proximity Matrix
C1

C2 U C5

56
How to Define Inter-Cluster Distance

p1 p2 p3 p4 p5 ...
p1
Similarity?
p2

p3

p4

p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error

57
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error

58
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error

59
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error

60
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

× × p2

p3

p4

p5
● MIN
.
● MAX .
● Group Average .
Proximity Matrix
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error

61
MIN or Single Link

● Proximity of two clusters is based on the two


closest points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph
● Example:
Distance Matrix:

62
Hierarchical Clustering: MIN

5
1
3
5
2 1
2 3 6

4
4

Nested Clusters Dendrogram

63
Strength of MIN

Original Points Six Clusters

• Can handle non-elliptical shapes

64
Limitations of MIN

Two Clusters

Original Points

• Sensitive to noise
Three Clusters

65
MAX or Complete Linkage

● Proximity of two clusters is based on the two


most distant points in the different clusters
– Determined by all pairs of points in the two clusters

Distance Matrix:

66
Hierarchical Clustering: MAX

4 1
2 5
5
2
3 6
3
1
4

Nested Clusters Dendrogram

67
Strength of MAX

Original Points Two Clusters

• Less susceptible to noise

68
Limitations of MAX

Original Points Two Clusters

• Tends to break large clusters


• Biased towards globular clusters

69
Group Average

● Proximity of two clusters is the average of pairwise proximity


between points in the two clusters.

Distance Matrix:

70
Hierarchical Clustering: Group Average

5 4 1
2
5
2
3 6
1
4
3

Nested Clusters Dendrogram

71
Hierarchical Clustering: Group Average

● Compromise between Single and Complete


Link

● Strengths
– Less susceptible to noise

● Limitations
– Biased towards globular clusters

72
Cluster Similarity: Ward’s Method

● Similarity of two clusters is based on the increase


in squared error when two clusters are merged
– Similar to group average if distance between points is
distance squared

● Less susceptible to noise

● Biased towards globular clusters

● Hierarchical analogue of K-means


– Can be used to initialize K-means

73
Hierarchical Clustering: Comparison

5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3

74
Hierarchical Clustering: Time and Space requirements

● O(N2) space since it uses the proximity matrix.


– N is the number of points.

● O(N3) time in many cases


– There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time with
some cleverness

75
Hierarchical Clustering: Problems and Limitations

● Once a decision is made to combine two clusters,


it cannot be undone

● No global objective function is directly minimized

● Different schemes have problems with one or


more of the following:
– Sensitivity to noise
– Difficulty handling clusters of different sizes and non-
globular shapes
– Breaking large clusters

76
Density Based Clustering

● Clusters are regions of high density that are


separated from one another by regions on low
density.

77
DBSCAN

● DBSCAN is a density-based algorithm.


– Density = number of points within a specified radius (Eps)

– A point is a core point if it has at least a specified number of


points (MinPts) within Eps
◆ These are points that are at the interior of a cluster
◆ Counts the point itself

– A border point is not a core point, but is in the neighborhood


of a core point

– A noise point is any point that is not a core point or a border


point

78
DBSCAN: Core, Border, and Noise Points

MinPts = 7

79
DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise

Eps = 10, MinPts = 4


80
DBSCAN Algorithm

● Form clusters using core points, and assign


border points to one of its neighboring clusters

1: Label all points as core, border, or noise points.


2: Eliminate noise points.
3: Put an edge between all core points within a distance Eps of each
other.
4: Make each group of connected core points into a separate cluster.
5: Assign each border point to one of the clusters of its associated core
points

81
When DBSCAN Works Well

Original Points Clusters (dark blue points indicate noise)

• Can handle clusters of different shapes and sizes


• Resistant to noise

82
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.92).

Original Points

• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.75)
83

You might also like