0% found this document useful (0 votes)
36 views

dwm NOTES

The document outlines the course structure for Data Warehousing and Mining at Raghu Engineering College, detailing objectives, course units, and outcomes. It covers fundamental concepts in data mining, including data preprocessing, classification, association analysis, and clustering techniques. The document also emphasizes the importance of understanding various algorithms and their applications in analyzing data effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

dwm NOTES

The document outlines the course structure for Data Warehousing and Mining at Raghu Engineering College, detailing objectives, course units, and outcomes. It covers fundamental concepts in data mining, including data preprocessing, classification, association analysis, and clustering techniques. The document also emphasizes the importance of understanding various algorithms and their applications in analyzing data effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

RAGHU ENGINEERING

COLLEGE
(Autonomous)
(Approved by AICTE, New Delhi, Permanently Affiliated to JNTU Kakinada,
Accredited by NBA & Accredited by NAAC with A grade)

DATA WARE HOUSING AND MINING


III Year –II Semester L T P C
3 1 0 3

Course Code: 17CS603 Internal Marks: 40


Credits: 3 External Marks: 60

OBJECTIVES:
 Students will be enabled to understand and implement classical models and algorithms in
data warehousing and data mining.
 They will learn how to analyze the data, identify the problems, and choose the relevant
models and algorithms to apply.
 They will further be able to assess the strengths and weaknesses of various methods
andalgorithms and to analyze their behavior.

Unit I: Introduction to Data Mining: What is data mining, motivating challenges, origins of data
mining, data mining tasks , Types of Data-attributes and measurements, types of data sets, Data
Quality ( Tan)
Unit II: Data pre-processing, Measures of Similarity and Dissimilarity: Basics, similarity and
dissimilarity between simple attributes, dissimilarities between data objects, similarities between
data objects, examples of proximity measures: similarity measures for binary data, Jaccard
coefficient, Cosine similarity, Extended Jaccard coefficient, Correlation, Exploring Data : Data Set,
Summary Statistics (Tan)
Unit III: Data Warehouse: basic concepts:, Data Warehousing Modeling: Data Cube and OLAP,
Data Warehouse implementation : efficient data cube computation, partial materialization, indexing
OLAP data, efficient processing of OLAP queries. ( H & C)
Unit IV: Classification: Basic Concepts, General approach to solving a classification problem,
Decision Tree induction: working of decision tree, building a decision tree, methods for expressing
attribute test conditions, measures for selecting the best split, Algorithm for decision tree induction.
Model over fitting: Due to presence of noise, due to lack of representation samples, evaluating the
performance of classifier: holdout method, random sub sampling, cross-validation, bootstrap. (Tan)
Unit V:
Association Analysis: Problem Definition, Frequent Item-set generation- The Apriori principle ,
Frequent Item set generation in the Apriori algorithm, candidate generation and pruning, support
counting (eluding support counting using a Hash tree) , Rule generation, compact representation of
frequent item sets, FP-Growth Algorithms. (Tan)
Unit VI:
Overview- types of clustering, Basic K-means, K –means –additional issues, Bisecting k-means, k-
means and different types of clusters, strengths and weaknesses, k-means as an optimization
problem. Agglomerative Hierarchical clustering, basic agglomerative hierarchical clustering
algorithm, specific techniques, DBSCAN: Traditional density: centre-based approach, strengths and
weaknesses (Tan)
RAGHU ENGINEERING
COLLEGE
(Autonomous)
(Approved by AICTE, New Delhi, Permanently Affiliated to JNTU Kakinada,
Accredited by NBA & Accredited by NAAC with A grade)

Course outcomes:
 Understand stages in building a Data Warehouse
 Understand the need and importance of preprocessing techniques
 Understand the need and importance of Similarity and dissimilarity techniques
 Analyze and evaluate performance of algorithms for Association Rules.
 Analyze Classification and Clustering algorithms
Text Books:
1. Introduction to Data Mining : Pang-Ning tan, Michael Steinbach, Vipin Kumar, Pearson
2. Data Mining ,Concepts and Techniques, 3/e, Jiawei Han , Micheline Kamber , Elsevier
Reference Books:
1. Introduction to Data Mining with Case Studies 2nd ed: GK Gupta; PHI.
2. Data Mining : Introductory and Advanced Topics : Dunham, Sridhar, Pearson.
3. Data Warehousing, Data Mining & OLAP, Alex Berson, Stephen J Smith, TMH
4. Data Mining Theory and Practice, Soman, Diwakar, Ajay, PHI, 2006.
1

UNIT – 1:
Data Mining:
Introduction to Data Mining: What is data mining, motivating challenges, origins of data mining, data
mining tasks, Types of Data-attributes and measurements, types of data sets, Data Quality (Tan)

Data Mining
Data mining is the process of automatically discovering useful information in large data repositories. Data
mining techniques finds the useful patterns or predict the outcome of a future observation.

knowledge discovery in databases (KDD) process


Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall of
converting raw data into useful information, as shown in figure.

 This process consists of a series of transformation steps, from data preprocessing to postprocessing
of data mining results.
 The input data can be stored in a variety of formats (flat files, spread-sheets, or relational tables)
and may reside in a centralized data repository or be distributed across multiple sites.
 The purpose of preprocessing is to transform the raw input data into an appropriate format for
subsequent analysis.
 The steps involved in the data preprocessing include fusing data from multiple sources, cleaning
data to remove noise and duplicate observations, and selecting the records and features that are
relevant to the data mining task at hand.
 Because of many ways data can be collected and stored, data preprocessing is time-consuming step
in the overall knowledge discovery process.
 The postprocessing step ensures that only valid and useful results are incorporated into the
decision support systems.
 An example of postprocessing is visualization, which allows analysts to explore the data and the
data mining results from a variety of viewpoints.
 Statistical measures or hypothesis testing methods can also be applied during the postprocessing to
eliminate spurious data mining results.

Steps involved in KDD process.


KDD is the overall process of converting raw data into useful information.
The steps that are involved in the KDD Process are
 Data cleaning
 Data integration
 Data selection
 Data transformation
 Data Mining

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
2

 Pattern evaluation
 Knowledge presentation
Steps involved in KDD process.

Data Mining is an integral part of knowledge discovery in database (KDD) which is the overall process of
converting raw data into useful information.
KDD consists of sequence of the following steps:
1) Data cleaning: It is a process of removing noise and inconsistent data.
2) Data integration: It is a process of integrating multiple data sources.
3) Data selection: It is a process of selecting the relevant data for analysis task are retrieved from the
database.
4) Data transformation: It is a process of generating and manipulating the data.
5) Data Mining: It is an important process where expert techniques are applied so as to extract the
hidden data patterns.
6) Pattern evaluation: To identify the truly interesting patterns representing knowledge based on
some interesting measures.
7) Knowledge presentation: In this, visualization and knowledge representation techniques are used to
present the mined knowledge to user.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
3

Callenges in data mining that motivate the mining tasks


Some of the specific challenges that motivated the development of data mining are
 Scalability
 High dimensionality
 Heterogeneous and Complex data
 Data ownership and Distribution
 Non-Traditional Analysis

Scalability:
 Because of advances in data generation and collection, datasets with sizes of gigabytes, terabytes or
even petabytes are common.
 Massive datasets cannot fit into main memory
 Need to develop scalable data mining algorithms to mine massive datasets
 Scalability can also be improved by using sampling or developing parallel and distributed algorithms.

High dimensionality:
 Nowadays, Data sets are coming with hundreds or thousands of attributes.
o Example: Dataset that contains measurements of temperature at various location
 Traditional data analysis techniques that were developed for low dimensional data that do not work
well for such high dimensional data.
 Need to develop data mining algorithms to handle high dimensionality.

Heterogeneous and Complex Data:


 Traditional data analysis methods deal with datasets containing attributes of same type(Continuous
or Categorical).
 Complex data sets contains image, video, text etc.
 Need to develop mining methods to handle complex datasets.

Data Ownership and Distribution:


 Data is not stored in one location or owned by one organization.
 Data is geographically distributed among resources belonging to multiple entities.
 Need to develop distributed data mining algorithms to handle distributed datasets.
 Key challenges:
o How to reduce the amount of communication needed for distributed data.
o How to effectively consolidate the data mining results from multiple sources
o How to address data security issues.

Non Traditional Analysis:


 Traditional statistical approach is based on a hypothesize-and-test paradigm.
 A hypothesis is proposed, an experiment is designed to gather the data, and then data is analyzed
with respect to the hypothesis.
 This process is extremely labor-intensive.
 Need to develop mining methods to automate the process of hypothesis generation and evaluation

Origins of data mining


Data Mining draws ideas, such as:
o Sampling, estimation and hypothesis testing from statistics.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
4

o Search algorithms, modeling techniques and learning theories from Artificial Intelligence or
Machine Learning, Pattern Recognition.
 Database systems are needed to provide support for efficient storage, Indexing and query
processing.
 The Techniques from parallel (high performance) computing are addressing the massive size of
some datasets.
 Distributed Computing techniques are used to gather information from different locations
 The below figure shows the relationship of data mining to other areas.

Data mining tasks


Data Mining tasks divided into two major categories:
 Predictive Tasks: Predict the value of particular attribute based on the values of other attributes.
The predicted attribute is known as target or dependent variable and other attribute is known as
explanatory or independent variables.
 Descriptive Tasks: Derive patterns that summarize the underlying relationships in data. Characterize
the general properties of the data in the database (Correlations, Trends, Clusters, Trajectories and
anomalies).
Four of the core data mining tasks:
1) Classification & Regression
2) Association Analysis
3) Cluster Analysis
4) Anomaly Detection

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
5

The below figure illustrates four of the core data mining tasks.

Predictive Modeling refers to the task of building a model for the target variable as a function of the
explanatory variables. There are two types of predictive modeling tasks:
 Classification, which is used for Discrete Target Variables.
Ex: Predicting whether a web user will make a purchase at an online book store (Target
variable is binary valued).
 Regression, which is used for Continuous Target Variables.
Ex: Forecasting the future price of a stock (Price is a continuous-valued attribute)
Example:

Association Analysis:
 Used to discover patterns that describe strongly associated features in the data.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
6

 The discovered patterns are typically represented in the form of implication rules or feature
subsets.
Example:
Transaction ID Items
1 {Bread, Butter, Diapers, Milk}
2 {Coffee, Sugar, Cookies, Salmon}
3 {Bread, Butter, Coffee, Diapers, Milk, Eggs}
4 {Bread, Butter, Salmon, Chicken}
5 {Eggs, Bread, Butter}
6 {Salmon, Diapers, Milk}
7 {Bread, Tea, Sugar, Eggs}
8 {Coffee, Sugar, Chicken, Eggs}
9 {Bread, Diapers, Milk, Salt}
10 {Tea, Eggs, Cookies, Diapers, Milk}
Market Basket Analysis
The above table, illustrate the data collected at supermarkets.
 Association analysis can be applied to find items that are frequently bought together by customers.
 Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers also tend to buy milk)

Cluster Analysis:
 Grouping of similar things is called cluster.
 The objects are clustered or grouped based on the principle of maximizing the intra class
similarity(Within a Cluster) and minimizing the interclass similarity(Cluster to Cluster).
Example:
Article Word
1 Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2
2 Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1
3 Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3
4 Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2
5 Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2
6 Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
Document Clustering
 The collection of news articles shown in the above table can be grouped based on their respective
topics.
 Each Article is represented as a set of word frequency pairs (w, c), Where w is a word and c is the
number of times the word appears in the article.
 There are 2 natural clusters in the above dataset
o First Cluster consists of the first 3 articles (News about the Economy)
o Second cluster contain last 3 articles (News about the Heath Care)

Anomaly detection:
 The task of identifying observations whose characteristics are significantly different from the rest of
the data.
 Such observations are known as anomalies or Outliers.
 A good anomaly detector must have a high detection rate and a low false rate.
 Applications: Detection of fraud, Network Intrusions etc…

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
7

Example:
Credit Card Fraud Detection:
 A Credit Card Company records the transactions made by every credit card holder, along with the
personal information such as credit limit, age, annual income and address.
 When a new transaction arrives, it is compared against the profile of the user.
 If the characteristics of the transaction are very different from the previously created profile, then
the transaction is flagged as potentially fraudulent.

Data Set (OR) Data


In simpler terms, Data Set is a collection of data objects and their attributes. Other names for a data object
are record, point, vector, pattern, event, case, sample, observation, or entity.

Attributes
An attribute is a property or characteristic of an object that may vary, either from one object to another or
from one time to another.
Example: Eye colour of a person, mass of a physical object, the time at which an event occurred, etc.
This is also known by other names such as variable, field, characteristic, dimension, feature etc.

Measurement scale
To analyse the characteristics of objects, we assign numbers or symbols to them. To do this in a well-
defined way, we need a measurement scale.
A measurement scale is a rule (function) that associates a numerical or symbolic value(attribute
values) with an attribute of an object.
For instance, we classify someone as male or female.
The “physical value” of an attribute of an object is mapped to a numerical or symbolic value

Differences between attributes and attribute values:


 The values used to represent an attribute may have properties that are not properties of the
attribute itself, and vice versa. This can be understood with the help of an example:
– Same attribute can be mapped to different attribute values
 Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
 Example: Attribute values for ID and age are integers
 The type of an attribute should tell us what properties of the attribute are reflected in the values
used to measure it.
 Knowing the type of an attribute is important because it tells us which properties of the measured
values are consistent with the underlying property of the attribute.

Properties (operations) of numbers, which are used to describe attributes


The distribution of attributes into different types is done, based on their characteristics. The following
properties (operations) of numbers are typically used to describe attributes.
1. Distinctness : = and ≠
2. Order: <, ≤, >, and ≥
3. Addition: + and -
4. Multiplication: * and /
Given these properties, we can define four types of attributes: nominal, ordinal, interval, and ratio.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
8

Types of attributes
We can define four types of attributes: nominal, ordinal, interval, and ratio. Below two tables explains
these attribute types and their transformation.

Attribute Type Description Examples Operations

Nominal The values of a nominal zip codes, employee ID mode, entropy,


attribute are just different numbers, eye color, contingency
names; i.e., nominal values gender: {male, female} correlation, 2 test
provide only enough
information to distinguish one
object from another. (=, )
Categorical
Qualitative

Ordinal The values of an ordinal hardness of minerals, median, percentiles,


attribute provide enough {good, better, best}, rank correlation, run
information to order objects. grades, street numbers tests, sign tests
(<, >)
Interval For interval attributes, the calendar dates, mean, standard
differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. (+, - ) tests
Quantitative

Ratio For ratio variables, both temperature in Kelvin, geometric mean,


Numeric

differences and ratios are monetary quantities, harmonic mean,


meaningful. (*, /) counts, age, mass, percent variation
length, current
Different attribute types

Transformation that define attribute levels

Nominal and ordinal attributes are collectively referred to as categorical or qualitative attributes.
As the name suggests, qualitative attributes, such as employee ID, lack most of the properties of numbers.
Even if they are represented by numbers, i.e., integers, they should be treated more like symbols.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
9

The remaining two types of attributes, interval and ratio, are collectively referred to
as quantitative or numeric attributes. Quantitative attributes are represented by numbers and have most
of the properties of numbers. Note that quantitative attributes can be integer-valued or continuous.
The different types of attributes adhere to different properties mentioned above.
 Nominal adheres to Distinctness.
 Ordinal adheres to Distinctness and Order.
 Interval adheres to Distinctness, Order and meaningful difference.
 Ratio adheres to all the four properties (Distinctness, Order, Addition and Multiplication).

Discrete Attribute
 A discrete attribute has a finite or countably infinite set of values. Such attributes can be categorical,
such as Zip codes or ID numbers, or numeric.
 These are often represented using integer values.
 Binary attributes are a special case of discrete attributes and assume only two values. Example:
true/false, yes/no, or 0/1.
 Nominal and ordinal attributes are binary or discrete

Continuous Attribute
 Continuous Attribute has real numbers as attribute values.
 Example: Temperature, height etc.
 These values can only be measured and represented practically by using only a finite number of
digits. It means they are represented as floating-point variables.
 Interval and ratio numbers are continuous.

Asymmetric Attribute
For Asymmetric Attributes, Only non-zero values are important.
For Instance, consider a data set where each object is a student and each attribute records whether or not
a student took a particular course at a university. For a specific student, an attribute has a value of 1 if the
student took the course associated with that attribute and a value of 0 otherwise. Because students take
only a small fraction of all available courses, most of the values in such a data set would be 0. Therefore, it
is more meaningful and more efficient to focus on non-zero values.
Binary attributes where only non-zero values are important are called asymmetric binary
attributes. This type of attribute is important for association analysis.

General characteristics of Data sets


There are three general characteristics of Data Sets namely: Dimensionality, Sparsity, and Resolution.
1) Dimensionality:
The dimensionality of a data set is the number of attributes that the objects in the data set have.
In a particular data set if there are high number of attributes (also called high dimensionality), then
it can become difficult to analyse such a data set. When this problem is faced, it is referred to
as Curse of Dimensionality.
2) Sparsity:
For some data sets, such as those with asymmetric features, most attributes of an object have
values of 0; in many cases, fewer than 1% of the entries are non-zero. Such a data is called sparse
data or it can be said that the data set has Sparsity. Sparsity is an advantage because only non-zero
values need to be stored and manipulated, which results in savings w.r.t computation time and
storage.
3) Resolution:
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
10

The patterns in the data depend on the level of resolution. If the resolution is too fine, a pattern
may not be visible or may be buried in noise; if the resolution is too coarse, the pattern may
disappear. For example, variations in atmospheric pressure on a scale of hours reflect the
movement of storms and other weather systems. On a scale of months, such phenomena are not
detectable.

Types of data sets


Different types of data sets are grouped into three groups.
1) Record data
2) Graph-based data, and
3) Ordered data

Record data:
 Majority of Data Mining work assumes that data is a collection of records (data objects), each of
which consists of a fixed set of data fields (attributes).
 The most basic form of record data has no explicit relationship among records or data fields, and
every record (object) has the same set of attributes.
 Record data is usually stored either in flat files or in relational databases.

Different variations of record data

There are a few variations of Record Data, which have some characteristic properties.
1) Transaction or Market Basket Data: It is a special type of record data, in which each record contains
a set of items. For example, shopping in a supermarket or a grocery store. For any particular
customer, a record will contain a set of items purchased by the customer in that respective visit to
the supermarket or the grocery store. This type of data is called Market Basket Data. Transaction
data is a collection of sets of items, but it can be viewed as a set of records whose fields are
asymmetric attributes. Most often, the attributes are binary, indicating whether or not an item was
purchased or not.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
11

2) The Data Matrix: If the data objects in a collection of data all have the same fixed set of numeric
attributes, then the data objects can be thought of as points (vectors) in a multidimensional space,
where each dimension represents a distinct attribute describing the object. A set of such data
objects can be interpreted as an m X n matrix, where there are n rows, one for each object, and n
columns, one for each attribute. This matrix is called a data matrix or a pattern matrix. Standard
matrix operation can be applied to transform and manipulate the data. Therefore, the data matrix is
the standard data format for most statistical data.
3) The Sparse Data Matrix: A sparse data matrix (sometimes also called document-data matrix) is a
special case of a data matrix in which the attributes are of the same type and are asymmetric; i.e.,
only non-zero values are important. In document data, if the order of the terms (words) in a
document is ignored, then a document can be represented as a term vector, where each term is a
component (attribute) of the vector and the value of each component is the number of times the
corresponding term occurs in the document. This representation of a collection of documents is
called a document-term matrix.

Graph-based Data:
This can be further divided into types:
1. Data with Relationships among Objects: The data objects are mapped to nodes of the graph, while
the relationships among objects are captured by the links between objects and link properties, such
as direction and weight. Consider Web pages on the World Wide Web, which contain both text and
links to other pages. In order to process search queries, Web search engines collect and process
Web pages to extract their contents.
2. Data with Objects That Are Graphs: If objects have structure, that is, the objects contain sub
objects that have relationships, then such objects are frequently represented as graphs. For
example, the structure of chemical compounds can be represented by a graph, where the nodes are
atoms and the links between nodes are chemical bonds.

Different variations of graph data

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
12

Ordered Data
For some types of data, the attributes have relationships that involve order in time or space. As you can see
in the picture below, it can be segregated into four types:

Different variations of ordered data

1. Sequential Data: Also referred to as temporal data, can be thought of as an extension of record
data, where each record has a time associated with it. Consider a retail transaction data set that
also stores the time at which the transaction took place
2. Sequence Data: Sequence data consists of a data set that is a sequence of individual entities, such
as a sequence of words or letters. It is quite similar to sequential data, except that there are no time
stamps; instead, there are positions in an ordered sequence. For example, the genetic information
of plants and animals can be represented in the form of sequences of nucleotides that are known as
genes.
3. Time Series Data: Time series data is a special type of sequential data in which each record is a time
series, i.e., a series of measurements taken over time. For example, a financial data set might
contain objects that are time series of the daily prices of various stocks.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
13

4. Spatial Data: Some objects have spatial attributes, such as positions or areas, as well as other types
of attributes. An example of spatial data is weather data (precipitation, temperature, pressure) that
is collected for a variety of geographical locations.

Data Quality
Many characteristics act as a deciding factor for data quality, such as incompleteness and incoherent
information, which are common properties of the big database in the real world. Factors used for data
quality assessment are:
 Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e., Having incorrect values of
properties that could be human or computer errors.
 Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer information
for sales & transaction data may not always be available.
 Consistency:
Incorrect data can also result from inconsistencies in naming convention or data codes, or from
input field incoherent format. Duplicate tuples need cleaning of details, too.
 Timeliness:
It also affects the quality of the data. At the end of the month, several sales representatives fail to
file their sales record on time. These are also several corrections & adjustments which flow into
after the end of the month. Data stored in the database are incomplete for a time after each
month.
 Believability:
It is reflective of how much users trust the data.
 Interpretability:
It is a reflection of how easy the users can understand the data.

Why data quality is important?


Data mining applications are often applied to data that was collected for another purpose, or for future,
but unspecified applications. For that reason data mining cannot usually take advantage of the significant
benefits of “addressing quality issues at the source.”
Since preventing data quality problems is not an option in such a case, Data Mining mainly focuses on:
1. The detection and correction of data quality problems (is often called data cleaning) and
2. The use of algorithms that can tolerate poor data quality.

Measurement Error
Measurement Error refers to any problem resulting from the measurement process. In other words, the
recorded data values differ from true values to some extent. For continuous attributes, the numerical
difference between measured and true value is called the error.

Data Collection Error


Data Collection Error refers to errors such as omitting data objects or attributes values, or including an
unnecessary data object.

Measurement and data collection issues of data quality


It is not a good idea to assume that the data will be perfect. There can be incorrectness in data due to
multiple reasons such as human error, limitations of measuring devices, or flawed data collection process.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
14

There can be data sets in which some values are missing, sometimes even some data objects are not
present or there are redundant/duplicate data objects.

The following are the variety of problems that involve measurement error:
 Noise
 Artifacts
 Bias
 Precision
 Accuracy
The following issues involve both measurement and data collection problems. They are
 Outliers
 Missing and inconsistent values, and
 Duplicate data

Noise:
Noise is the random component of a measurement error. It involves either the distortion of a value
or addition of objects that are not required. The following figure shows a time series before and
after disruption by some random noise. If a bit more noise were added to the time series, its shape would
be lost.

The term noise is often connected with data that has a spatial (space related) or temporal (time related)
component. In these cases, techniques from signal and image processing are used in order to reduce noise.
But, the removal of noise is a difficult task; hence much of the data mining work involves use of Robust
Algorithms that can produce acceptable results even in the presence of noise.

Outliers:
Outliers are either
1. Data objects that, in some sense, have characteristics that are different from most of the other data
objects in the data set, or
2. Values of an attribute that are unusual with respect to the most common (typical) values for that
attribute.
Additionally, it is important to distinguish between noise and outliers. Outliers can be legitimate data
objects or values. Thus, unlike noise, outliers may sometimes be of interest. In fraud and network intrusion
detection, the goal is to find unusual objects or events from among a large number of normal ones.

Missing values:
If there are Missing values present in the data set.
It is not unusual to have data objects that have missing values for some of the attributes. The reasons can
be:
1. The information was not collected.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
15

2. Some attributes are not applicable to all objects.


Regardless, missing values should be handled during the data analysis. The following are some of the
strategies for handling missing data.
 Eliminate Data Objects or Attributes: A simple and effective strategy is to eliminate objects with
missing values. However, if a data set has only a few objects that have missing values, then we may
omit them. A related strategy is to eliminate attributes that have missing values. This should be
done with caution, however, since the eliminated attributes may be the ones that are critical to the
analysis.
 Estimate Missing Values: Some missing data can be estimated reliably. If the attribute is continuous
in nature, then the average of that attribute can be used in place of missing values. If the data is
categorical, then the most occurring value can replace the missing values.
 Ignore the Missing Values during Analysis: Many data mining approaches can be modified to ignore
missing values. For example, suppose that objects are being clustered and the similarity between
pairs of data objects needs to be calculated. If one or both objects of a pair have missing values for
some attributes, then the similarity can be calculated by using only the attributes that do not have
missing values. It is true that the similarity will only be approximate, but unless the total number of
attributes is small or the number of missing values is high, this degree of inaccuracy may not matter
much.

Inconsistent Values:
Data can contain inconsistent values. Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city. It may be that the individual entering
this information transposed two digits, or perhaps a digit was misread when the information was
scanned from a handwritten form. Some types of inconsistencies are easy to detect. For instance, a
person's height should not be negative. The correction of an inconsistency requires additional or
redundant information.

Duplicate Data: A data set may include data objects that are duplicates, or almost duplicates, of one
another. To detect and eliminate such duplicates, two main issues must be addressed. First, if there are two
objects that actually represent a single object, then the values of corresponding attributes may differ, and
these inconsistent values must be resolved. Second, care needs to be taken to avoid accidentally combining
data objects that are similar, but not duplicates, such as two distinct people with identical names. The term
deduplication is used to refer to the process of dealing with these issues.

Techniques to measures data quality


The quality of measurement process and the resulting data are measured by Precision and Bias.
Precision:
The closeness of repeated measurements (of the same quantity) to one another. It is often
measured by the standard deviation of a set of values.
Bias:
A systematic variation of measurements from the quantity being measured. It is measured by taking
the difference between the mean of the set of values and the known values of the quantity being
measured. It can only be determined for those objects whose measured quantity is already known.
For example, we have a standard laboratory weight with a mass of 1g and want to assess the precision and
bias of our new laboratory scale. We weigh the mass five times, and obtain the following five values:
{1.015, 0.990, 1.013, 1 .001, 0.986}. The mean of these values is 1.001, and hence, the bias is 0.001. The
precision, as measured by the standard deviation, is 0.013.
It is common to use the more general term, accuracy, to refer to the degree of measurement error in data.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
16

Accuracy:
The closeness of measurements to the true value of the quantity being measured.
Accuracy depends on precision and bias, but since it is a general concept, there is no specific formula for
accuracy in terms of these two quantities.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
1
UNIT – 2:
Data Preprocessing and Proximity measures:
Unit II: Data pre-processing, Measures of Similarity and Dissimilarity: Basics, similarity and dissimilarity
between simple attributes, dissimilarities between data objects, similarities between data objects,
examples of proximity measures: similarity measures for binary data, Jaccard coefficient, Cosine similarity,
Extended Jaccard coefficient, Correlation, Exploring Data: Data Set, Summary Statistics (Tan)

Data preprocessing:
Data Preprocessing refers to the steps applied to make data more suitable for data mining. The steps used
for Data Preprocessing usually fall into two categories:
1) Selecting data objects and attributes for the analysis.
2) Creating/changing the attributes.
The goal is to improve the data mining analysis with respect to time, cost and quality

Techniques and strategies for data preprocessing:


The following are the approaches of Data Preprocessing:
1) Aggregation
2) Sampling
3) Dimensionality Reduction
4) Feature Subset Selection
5) Feature Creation
6) Discretization and Binarization
7) Variable Transformation

Aggregation
Aggregation refers to combining two or more attributes (or objects) into single attribute (or object).
– For example, consider the below data set consisting of data objects recording the daily sales
of products in various store locations for different days over the course of a year.

– One way to aggregate the data objects is to replace all the data objects of a single store with
a single store wide data object.
– This reduces the hundreds or thousands of transaction that occur daily at a specific store to
a single daily transaction, and the number of data objects is reduced to the number of
stores.
There are several motivations for aggregation.
1) Data Reduction: Reduce the number of objects or attributes. This results into smaller data sets and
hence requires less memory and processing time, and hence, aggregation may permit the use of
more expensive data mining algorithms.
2) Change of Scale: Aggregation can act as a change of scope or scale by providing a high-level view of
the data instead of a low-level view. For example,
– Cities aggregated into regions, states, countries etc.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
2
– Days aggregated into weeks, months and years.
3) More “Stable” Data: Aggregated Data tends to have less variability.

Sampling
 Sampling is the main technique employed for data reduction. Sampling is a commonly used
approach for selecting a subset of the data objects to be analysed.
 It is often used for both the preliminary investigation of the data and the final data analysis.
 Statisticians often sample because obtaining the entire set of data of interest is too expensive or
time consuming.
 Sampling is typically used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
The key aspect of sampling is to use a sample that is representative. A sample is representative if it has
approximately the same property (of interest) as the original set of data. If the mean (average) of the data
objects is the property of interest, then a sample is representative if it has a mean that is close to that of
the original data.
Types of Sampling:
1) Simple Random Sampling:
There is an equal probability of selecting any particular item
→ Sampling without replacement: As each item is selected, it is removed from the
population.
→ Sampling with replacement: Objects are not removed from the population as they are
selected for the sample. In sampling with replacement, the same object can be picked up
more than once.
2) Stratified sampling: Split the data into several partitions, and then draw random samples from each
partition.
3) Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start with a small sample,
and then increase the sample size until a sample of sufficient size has been obtained. Here, no need
to determine the correct sample size initially.

Curse of dimensionality
Dimensionality: The dimensionality of a data set is the number of attributes that the objects in the data set
have.
In a particular data set if there are high number of attributes (also called high dimensionality), then it can
become difficult to analyse such a data set. When this problem is faced, it is referred to as Curse of
Dimensionality.

Dimensionality reduction
The term dimensionality reduction is often reserved for those techniques that reduce the dimensionality of
a data set by creating new attributes that are a combination of the old attributes.
Purpose:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining algorithms.
 Allow data to be more easily visualized.
 May help to eliminate irrelevant features or reduce noise.
Techniques:
There are some linear algebra techniques for dimensionality reduction, particularly for continuous data, to
project the data from a high-dimensional space into a low dimensional space.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
3
 Principal Components Analysis (PCA)
 Singular Value Decomposition(SVD)

Feature Subset Selection


Another way to reduce dimensionality of data is to use only a subset of the features.
Redundant features
– Duplicate much or all of the information contained in one or more other attributes
– Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
– Contain no information that is useful for the data mining task at hand
– Example: students' ID is often irrelevant to the task of predicting students' GPA
 Redundant features and irrelevant features can reduce classification accuracy and the quality of the
clusters that are found.
 These features can be eliminated immediately by using common sense or domain knowledge,
selecting the subset of features frequently requires a systematic approach.
 The ideal approach to feature selection is to try all possible subsets of features as input to the data
mining algorithm of interest, and then take the subset that produces the best results.
 There are three standard approaches to feature selection: embedded, filter, and wrapper.
Embedded approaches
– Feature selection occurs naturally as part of the data mining algorithm. Specifically, during
the operation of the data mining algorithm, the algorithm itself decides which attributes to
use and which to ignore. Algorithms for building decision tree classifiers operate in this
manner.
Filter approaches
– Features are selected before the data mining algorithm is run, using some approach that is
independent of the data mining task. For example, we might select sets of attributes whose
pair wise correlation is as low as possible.
Wrapper approaches
– These methods use the target data mining algorithm as a black box to find the best subset of
attributes, in a way similar to that of the ideal algorithm described above, but typically
without enumerating all possible subset.
Feature Weighting
– Feature weighting is an alternative to keeping or eliminating features. More important
features are assigned a higher weight, while less important are given a lower weight. These
weights are sometimes assigned based on domain knowledge about the relative importance
of features. Alternatively, they may be determined automatically.

Feature Creation
It involves creation of new attributes that can capture the important information in a data set much more
efficiently than the original attributes. The number of new attributes can be smaller than the number of
original attributes. There are three methodologies for creating new attributes:
1) Feature extraction:
 The creation of a new set of features from the original raw data is known as feature
extraction.
 Consider a set of photographs, where each photograph is to be classified according to
whether or not it contains a human face.
 The raw data is a set of pixels, and as such, is not suitable for many types of classification
algorithms.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
4
 However, if the data is processed to provide higher- level features, such as the presence or
absence of certain types of edges and areas that are highly correlated with the presence of
human faces, then a much broader set of classification techniques can be applied to this
problem.
 This method is highly domain specific.
2) Feature construction:
 Sometimes the features in the original data sets have the necessary information, but it is not
in a form suitable for the data mining algorithm.
 In this situation, one or more new features constructed out of the original features can be
more useful than the original features.
 Example: dividing mass by volume to get density
3) Mapping the data to a new space
 A totally different view of the data can reveal important and interesting features.
 Consider, for example, time series data, which often contains periodic patterns. If there is
only a single periodic pattern and not much noise then the pattern is easily detected.
 If, on the other hand, there are a number of periodic patterns and a significant amount of
noise is present, then these patterns are hard to detect.
 Such patterns can, nonetheless, often be detected by applying a Fourier transform to the
time series in order to change to a representation in which frequency information is explicit.

Discretization and Binarization


Discretization is the process of converting a continuous attribute into an ordinal attribute
 A potentially infinite number of values are mapped into a small number of categories
 Discretization is used in both unsupervised and supervised settings
 Discretization is commonly used in classification.
 Many classification algorithms work best if both the independent and dependent variables have
only a few values.
Binarization maps a continuous or categorical attribute into one or more binary variables
 Typically used for association analysis
 Often convert a continuous attribute to a categorical attribute and then convert a categorical
attribute to a set of binary attributes
 Association analysis needs asymmetric binary attributes
 Examples: eye colour and height measured as {low, medium, high}
 Conversion of a categorical attribute to three binary attributes

 Conversion of a categorical attribute to five asymmetric binary attributes

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
5
Variable Transformation
A variable transform is a function that maps the entire set of values of a given attribute to a new set of
replacement values such that each old value can be identified with one of the new values.
A simple mathematical function is applied to each value individually. If x is a variable, then examples of
such transformations include power(x, k), log(x), power(e, x), |x|
 Normalization: It refers to various techniques to adjust to differences among attributes in terms of
frequency of occurrence, mean, variance, range
 Standardization: In statistics, it refers to subtracting off the means and dividing by the standard
deviation.

Proximity and Proximity Measure


 The term proximity between two objects is a function of the proximity between the corresponding
attributes of the two objects.
 Proximity measures refer to the Measures of Similarity and Dissimilarity.
 Similarity and Dissimilarity are important because they are used by a number of data mining
techniques, such as clustering, nearest neighbour classification, and anomaly detection.

Techniques for Measures of Similarity and Dissimilarity


 Proximity measure for similarity and dissimilarity with one attribute
 Proximity measure for similarity and dissimilarity with multiple attribute
o Euclidean distance and Correlation measures are used for dense data, such as time series or
two-dimensional points.
o Jaccard and cosine similarity measures, which are useful for sparse data

 Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1] ; 0 - no similarity and 1 - complete similarity
 Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Transformation:
– It is a function used to convert similarity to dissimilarity and vice versa, or to transform a
proximity measure to fall into a particular range. For instance:
 s’ = (s-min(s)) / max(s)-min(s))
where,
s’ = new transformed proximity measure value,
s = current proximity measure value,
min(s) = minimum of proximity measure values,
max(s) = maximum of proximity measure values

Similarity and Dissimilarity between Simple Attributes


The proximity of objects with a number of attributes is typically defined by combining the proximities of
individual attributes. Based on the attribute type, we can measure the similarity and dissimilarity.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
6
Nominal Attribute:
– Consider objects described by one nominal attribute.
– How to compare similarity of two objects like this?
– Nominal attributes only tell us about the distinctness of objects.
– Hence, in this case similarity is defined as 1 if attribute values match, and 0 otherwise and
oppositely defined would be dissimilarity.
Ordinal Attribute:
– For objects with a single ordinal attribute, the situation is more complicated because information
about order needs to be taken into account.
– Consider an attribute that measures the quality of a product, on the scale {poor, fair, OK, good,
wonderful}.
– We have 3 products P1, P2, & P3 with quality as wonderful, good, & OK respectively. In order to
compare ordinal quantities, they are mapped to successive integers.
– In this case, if the scale is mapped to {0, 1, 2, 3, 4} respectively. Then, dissimilarity (P1, P2) = 4–3 = 1
or, if we want the dissimilarity to fall between 0 and 1, d(P1, P2)= (3-2)/4= 0.25. A similarity for
ordinal attributes can be defined as s=1-d.
Interval or Ratio attributes:
– For interval or ratio attributes, the natural measure of dissimilarity between two objects is the
absolute difference of their values.
– For example, we might compare our current weight and our weight a year ago by saying “I am ten
pounds heavier.” The dissimilarity range from 0 to ∞, rather than from 0 to 1.
– The similarity is expressed by transforming a similarity into a dissimilarity, by using the below table.
In this table, x and y are two objects that have one attribute of the indicated type.
– S(x,y) and d(x,y) are the similarity and dissimilarity between x and y objects.

Dissimilarities between Data Objects with multiple attributes


We begin with a discussion about distances, which are dissimilarities with certain properties.
Euclidean Distance
The Euclidean distance, d, between two points, x and y, in one, two, three, or higher- dimensional space, is
given by the following formula:

where n is the number of dimensions, and xk and yk are respectively, the kth attributes (components)
of x and y.
We illustrate this formula with below figure, which shows a set of points, the x and y coordinates of these
points, and the distance matrix containing the pairwise distances of these points.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
7

3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0 X and y coordinates of four points
0 1 2 3 4 5 6

D(p1, p2) = sqrt((0 – 2)2 + (2 – 0)2) = sqrt(4+4) = sqrt(8) = 2.828


D(p1, p3) = sqrt((0 – 3)2 + (2 – 1)2) = sqrt(9+1) = sqrt(10) = 3.162
D(p1, p4) = sqrt((0 – 5)2 + (2 – 1)2) = sqrt(25+1) = sqrt(26) = 5.099
D(p2, p3) = sqrt((2 – 3)2 + (0 – 1)2) = sqrt(1+1) = sqrt(2) = 1.414
D(p2, p4) = sqrt((2 – 5)2 + (0 – 1)2) = sqrt(9+1) = sqrt(10) = 3.162
D(p3, p4) = sqrt((3 – 5)2 + (1 – 1)2) = sqrt(4+0) = sqrt(4) =2

Finally, the distance matrix is


p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

The Euclidean distance measure given in above equation is generalized by the Minkowski distance metric
shown in below equation.

Where r is a parameter, n is the number of dimensions (attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.
The following are the three most common examples of Minkowski distances.
 r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example of this for binary
vectors is the Hamming distance, which is just the number of bits that are different between two
binary vectors
 r = 2. Euclidean distance(L2 norm).
 r  . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any
component of the vectors
Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Similarly compute the distance matrices like L1 norm and L norm using the above data.
D(p1, p2) = |(0 – 2)| + |(2 – 0)| = 2+2 = 4
D(p1, p3) = |(0 – 3)| + |(2 – 1)| = 3+1 = 4
D(p1, p4) = |(0 – 5)| + |(2 – 1)| = 5+1 = 6
D(p2, p3) = |(2 – 3)| + |(0 – 1)| = 1+1 = 2
D(p2, p4) = |(2 – 5)| + |(0 – 1)| = 3+1 = 4
D(p3, p4) = |(3 – 5)| + |(1 – 1)| = 2+0 = 2

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
8

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L1 distance matrix

D(p1, p2) = max{|(0 – 2)| , |(2 – 0)|} = max{2, 2} =2


D(p1, p3) =max{ |(0 – 3)| , |(2 – 1)|} = max{3, 1} =3
D(p1, p4) = max{|(0 – 5)| , |(2 – 1)|} = max{5, 1} =5
D(p2, p3) = max{|(2 – 3)| , |(0 – 1)|} = max{1, 1} =1
D(p2, p4) = max{|(2 – 5)| , |(0 – 1)|} = max{3, 1} =3
D(p3, p4) = max{|(3 – 5)| , |(1 – 1)|} = max{2, 0} =2
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Lmax or L distance matrix
Distances, such as the Euclidean distance, have some well-known properties. If d(x, y) is the distance
between two points, x and y, then the following properties hold.
1. Positivity
a) d(x, y) > 0 for all x and y,
b) d(x, y) = 0 only if x = y
2. Symmetry
d(x, y) = d(y, x) for all x and y
3. Triangle Inequality
d(x, z) ≤ d(x, y) + d(y, z) for all points x, y and z
The measures that satisfy all three properties are called metrics.

Similarities between Data Objects


For similarities, the triangle inequality typically does not hold, but symmetry and positivity typically do. To
be explicit, if s(x, y) is the similarity between points x and y, then the typical properties of similarities are
the following:
1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
There is no general analog of the triangle inequality for similarity measures. The following are the some of
the similarity measures.
1) Jaccard Similarity
2) Cosine Similarity

Similarity measures for Binary Data


Similarity measures between objects that contain only binary attributes are called similarity
coefficients and typically have values between 0 and 1. A value of 1 indicates that the two objects are
completely similar, while a value of 0 indicates that the objects are not at all similar.
Let x and y be two objects that consists of n binary attributes. The comparison between these two binary
objects i.e., two binary vectors, is done using the following four quantities (frequencies):
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
9
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1
Simple Matching Coefficients (SMC):
SMC = number of matching attribute values / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
This measure counts both presence and absences equally. The SMC could be used to find students who had
answered questions similarly on a test that consisted only of true/false questions.

The Jaccard Coefficients:


Suppose that x and y are data objects that represent two rows (two transactions) of a transaction matrix. If
each asymmetric binary attribute corresponds to an item in a store, then a 1 indicates that the item was
purchased, while a 0 indicates that the product was not purchased. Since the number of products not
purchased by any customer far outnumbers the number of products that were purchased, a similarity
measure like Jaccard Coefficient is used, to handle objects consisting of asymmetric binary attributes.
J = number of 11 matches / number of non-zero attributes
= number of matching presences/ number of attributes not involved in 00 matches
= (f11) / (f01 + f10 + f11)

To illustrate the difference between these two similarity measures, we calculate SMC and Jaccard for the
following two binary vectors.
x= 1000000000
y= 0000001001
f01 = 2 (the number of attributes where x was 0 and y was 1)
f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

18) Explain the Cosine Similarity with an example.


Documents are often represented as vectors, where each attribute represents the frequency with which a
particular term (word) occurs in the document. The cosine similarity is one of the most common measure
of document similarity. If x and y are two document vectors, then

(or)
where . indicates dot product and ||x|| defines the length of vector x.

The below example calculates the cosine similarity for the following two data objects, which might
represent document vectors.
x= 3205000200
y= 1000000102
<x, y> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| x || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
10
0.5 0.5
|| y || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) = (6) = 2.449
cos(x, y ) = 0.3150

As indicated by above figure, cosine similarity is a measure of the angle between x and y. Thus, if the cosine
similarity is 1, the angle between x and y is 0o, and x and y are the same except for magnitude (length). If
the cosine similarity is 0, then the angle between x and y is 90 o, and they do not share any terms (words).

Extended Jaccard Coefficient (Tanimoto Coefficient)


The extended Jaccard Coefficient can be used for document data and that reduces to the Jaccard
coefficient in the case of binary attributes. The extended Jaccard coefficient is also known as the Tanimoto
Coefficient. This coefficient, can be represent EJ, is defined by the following equation:

Correlation
The correlation between two data objects that have binary or continuous attributes is a measure of the
linear relationship between the attributes of the objects (e.g., height, weight).
Pearson’s Correlation Coefficient between two objects, x and y, is defined by the following equation:

where the notations used are defined in standard as:

 Positive correlation is a relationship between two variables in which both variables move in the
same direction. This is when one variable increases while the other increases and visa versa. For
example, positive correlation may be that the more you exercise, the more calories you will burn.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
11
 Whilst negative correlation is a relationship where one variable increases as the other decreases,
and vice versa.

 The correlation coefficient is a value that indicates the strength of the relationship between
variables. The coefficient can take any values from -1 to 1. The interpretations of the values are:
– -1: Perfect negative correlation. The variables tend to move in opposite directions (i.e., when
one variable increases, the other variable decreases).
– 0: No correlation. The variables do not have a relationship with each other.
– 1: Perfect positive correlation. The variables tend to move in the same direction (i.e., when
one variable increases, the other variable also increases).
To understand this, we will consider an example here..
Following data shows the number of customers with their corresponding temperature.

1) First find means of both the variables, subtract each of the item with its respective mean and
multiply it together as follows
ean of , x = (97+86+89+84+94+74)/6 = 524/6= 87.333
ean of Y, Ȳ = (14+11+9+9+15+7)/6 = 65/6= 10.833
2) Now, find the covariance

COV(x, y) = 112.33/(6–1) = 112.33/5 = 22.46


3) Now, find the standard deviation for x and y.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
12

Sx = 331.28/5=66.25= 8.13
Sy = 48.78/5=9.75=3.1
4) Finally, the correlation = 22.46/(8.13x 3.1)= 22.46/25.20 =0.8
5) 0.8 shows that strength of the correlation between temperature and number of customers is very
strong.

Data Exploration
 Data exploration can aid in selecting the appropriate preprocessing and data analysis techniques.
 Summary statistics, such as the mean and standard deviation of a set of values, and visualization
techniques, such as histograms and scatter plots, are standard methods that are widely employed
for data exploration.
 Many of the exploratory data techniques are illustrated with the Iris Plant data set, that can be
obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
 it consists of information on 150 Iris flowers, 50 each from one of three Iris species: Setosa,
Versicolour, and Virginica.
 Each flower is characterized by five attributes:
1) Sepal length in centimeters
2) Sepal width in centimeters
3) Petal length in centimeters
4) Petal width in centimeters
5) Class ( Setosa, Virginica, Versicolour)
The sepals of a flower are the outer structures that protect the more fragile parts of the flower, such as the
petals.

Summary Statistics
 Summary statistics are quantities, such as mean and standard deviation that capture various
characteristics of a potentially large set of values with a single number or a small set of numbers.
That is, Summary statistics are numbers that summarize properties of the data
 Summarized properties include
– Frequencies and Mode

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
13
– Percentiles
– Measures of Location: Mean and Median
– Measures of Spread: Range and Variance
– Other ways to summarize the data
 Most summary statistics can be calculated in a single pass through the data

Frequency and the Mode


 Given a set of unordered categorical values, we can compute the frequency with which each value
occurs for a particular set of data.
 Given a categorical attribute x, which can take values {v1, v2, …., vi, ….., vk} and a set of m objects,
the frequency of a value vi is defined as
Frequency(vi) =number of objects with attribute value vi/m
Example:
Consider a set of students who have an attribute, class, which can take values from the set
{freshman, sophomore, junior, senior}. The below table shows the number of students for each
value of the class attribute.
Class Size Frequency
freshman 140 0.23
sophomore 160 0.27
junior 130 0.22
senior 170 0.28
 The mode of a categorical attribute is the value that has the highest frequency.
 In above example, the mode of the class attribute is senior, with a frequency of 0.28.
 The notions of frequency and mode are typically used with categorical data.

Percentile
 For continuous data, the notion of a percentile is more useful.
 Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile xp
is a value of x such that p% of the observed values of x are less than xp.
 For instance, the 50th percentile is the value x50% such that 50% of all values of x are less than x50%.

Measures of location: Mean and Median


For continuous data, two of the most widely used summary statistics are the mean and median, which are
measures of the location of a set of values.
Consider a set of m objects and an attribute x. let {x1, x2, … , xm} be the attribute values of x for these m
objects. Assume that these values might be the heights of m children. Let {x(1), x(2), ……., x(m)} represent the
values of x after they have been sorted in non-decreasing order (i.e. ascending order). Thus, x(1) = min(x)
and x(m) = max(x). then, the mean and median are defined as follows:

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
14
The median is the middle value if there are an odd number of values and average of the two middle values
if the number of values is even. Thus, for seven values, the median is x (4), while for ten values, the median is
(x(5) + x(6))/2.

Example:
Suppose you randomly selected 10 house prices in the South Lake Tahoe area. You are interested in the
typical house price. In $100,000 the prices were
2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8
The mean, we would say that the average house price is 744,000.
Mean = (2.7+2.9+3.1+3.4+3.7+4.1+4.3+4.7+4.7+40.8)/10 = 7.44
Since there is an even number of outcomes, we take the average of the middle two = (3.7 + 4.1)/2 = 3.9.
The median house price is $390,000.

Trimmed
A trimmed mean (sometimes called a truncated mean) is similar to a “regular” mean (average), but it trims
any outliers.
 These means are expressed in percentages. The percentage tells you what percentage of data to
remove.
 A percentage p between 0 and 100 is specified, the top and bottom (p/2)% of the data is thrown
out, and the mean is calculated in the normal way.
 For example, with a 10% trimmed mean, the lowest 5% and highest 5% of the data are excluded.
The mean is calculated from the remaining 90% of data points.

Example: Find the trimmed 40% mean for the following test scores: 60, 81, 83, 91, 99.
Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three values:
60, 81, 83, 91, 99.
Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85.

The median is a trimmed mean with p = 100%, while the standard mean corresponds to p = 0%

Measures of Spread: Range and Variance


Another set of commonly used summary statistics for continuous data are those that measure the
dispersion or spread of a set of values. Such measures indicate if the attribute values are widely spread out
or if they are relatively concentrated around a single point such as the mean.
Range: The simplest measure of spread is the range, which, given an attribute x with a set of m values {x 1,
x2, …, xm}, is defined as
Range(x) = max(x) – min(x) = x(m) = x(1)
The Variance is preferred as a measure of spread. The variance of the values of an attribute x is typically
written as and is defined below. The standard deviation, which is the square root of the variance, is
written as sx and has the same units as x.

However, this is also sensitive to outliers, so that other measures are often used.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
15

Where, Absolute Average Deviation (AAD), Median Absolute Deviation (MAD) and Interquartile Range
(IQR).

Example:
The owner of the Ches Tahoe restaurant is interested in how much people spend at the restaurant. He
examines 10 randomly selected receipts for parties of four and writes down the following data.
44, 50, 38, 96, 42, 47, 40, 39, 46, 50
Step1: Now, calculate the mean by adding and dividing by 10 to get
Mean(x) = (44+50+38+96+42+47+40+39+46+50)/2 = 49.2
Step2: Below is the table for getting the standard deviation:
x x - 49.2 (x - 49.2 )2
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44
96 46.8 2190.24
42 -7.2 51.84
47 -2.2 4.84
40 -9.2 84.64
39 -10.2 104.04
46 -3.2 10.24
50 0.8 0.64
Total 2600.4

Step3: Now, calculate the Variance like = (1/(10-1))*2600.4 = 288.7 ≈ 289


Step4: Hence the variance is 289 and the standard deviation is the square root of 289 = 17.

Multivariate Summary Statistics


 Measures of locations for data that consists of several attributes (multivariate data) can be obtained
by computing the mean and median separately for each attribute.
 Thus, given a data set the mean of the data objects, , is given by
=( , ……,
Where is the mean of the ith attribute xi
 For data with continuous variables, the spread of the data is most commonly captured by the
covariance matrix S, whose ijth entry sij is the covariance of the ith and jth attributes of the data.
Thus, if xi and xj are the ith and jth attributes, then
sij = covariance(xi, xj)
 In turn, the covariance(xi, xj) is given by

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
16

 Where xki and xkj are the values of the ith and jth attributes for the kth object.
 Notice that covariance(xi, xi) = variance(xi). Thus, the covariance matrix has the variances of the
attributes along the diagonal.
 The covariance of two variables is a measure of the degree to which two attributes vary together
and depends on the magnitudes of the variables.
 A value near 0 indicates that two attributes do not have a relationship.
 The ijth entry of the correlation matrix R, is the correlation between the ith and jth attributes of the
data.
 If xi and xj are the ith and jth attributes, then
rij = correlation(xi, xj) =
where si and sj are the variances of xi and xj respectively.
 The diagonal entries of R are correlation(xi, xi) = 1, while other entries are between -1 and 1.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
1

UNIT – 3:
Data Warehouse:
Unit III: Data Warehouse: basic concepts, Data Warehousing Modeling: Data Cube and OLAP, Data
Warehouse implementation: efficient data cube computation, partial materialization, indexing OLAP data,
efficient processing of OLAP queries. (H & C)

Data warehouse:
 Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.

It is defined in many different ways, but not rigorously.


 Data warehouse is a decision support database that is maintained separately from the
organization’s operational database. They Support information processing by providing a solid
platform of consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data
in support of management’s decision-making process.”—William. H. Inmon
 We view data warehousing as the process of constructing and using data warehouses.

Four keywords of the data ware house:


 “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process.”—William. H. Inmon
 The four keywords—subject-oriented, integrated, time-variant, and non-volatile—distinguish data
warehouses from other data repository systems, such as relational database systems, transaction
processing systems, and file systems.

Subject-oriented:
 A data warehouse is organized around major subjects such as customer, supplier, product, and
sales.
 Rather than concentrating on the day-to-day operations and transaction processing of an
organization, a data warehouse focuses on the modelling and analysis of data for decision
makers.
 Data warehouses typically provide a simple and concise view of particular subject issues by
excluding data that are not useful in the decision support process.
Integrated:
 A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as
relational databases, flat files, and online transaction records.
 Data cleaning and data integration techniques are applied to ensure consistency in naming
conventions, encoding structures, attribute measures, and so on.
Time-variant:
 Data are stored to provide information from an historic perspective (e.g., the past 5–10 years).
Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.
Non-volatile:
 A data warehouse is always a physically separate store of data transformed from the application
data found in the operational environment.
 Due to this separation, a data warehouse does not require transaction processing, recovery, and
concurrency control mechanisms.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
2

 It usually requires only two operations in data accessing: initial loading of data and access of
data.

How are organizations using the information from data warehouses?


Many organizations using this information to support business decision-making activities, including:
 increasing customer focus, which includes the analysis of customer buying patterns (such as buying
preference, buying time, budget cycles, and appetites for spending);
 Repositioning products and managing product portfolios by comparing the performance of sales by
quarter, by year, and by geographic regions in order to fine-tune production strategies
 Analyzing operations and looking for sources of profit
 Managing customer relationships, making environmental corrections, and managing the cost of
corporate assets

Differentiate the data warehousing and traditional database approach for heterogeneous database
integration?
Traditional heterogeneous DB integration: Query driven approach
 The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases.
 When a query is posed to a client site, a metadata dictionary is used to translate the query into
queries appropriate for the individual heterogeneous sites involved. These queries are then mapped
and sent to local query processors.
 The results returned from the different sites are integrated into a global answer set.
 This query-driven approach requires complex information filtering and integration processes, and
competes with local sites for processing resources.
 It is inefficient and potentially expensive for frequent queries, especially queries requiring
aggregations.
Data warehouse: update-driven, high performance
 Data warehousing employs an update-driven approach in which information from multiple,
heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and
analysis.
 A data warehouse brings high performance to the integrated heterogeneous database system
because data are copied, pre-processed, integrated, annotated, summarized, and restructured into
one semantic data store.

Advantages of a Data warehouse compared to traditional database approach for heterogeneous


database integration:
Advantages of a Data Warehouse:
 The information is integrated in advance, therefore there is no overhead for (i) querying the sources
and (ii) combining the results
 There is no interference with the processing at local sources (a local source may go offline)
 Some information is already summarized in the warehouse, so query effort is reduced.
When should mediators be used?
 When queries apply on current data and the information is highly dynamic (changes are very
frequent).
 When the local sources are not collaborative.

Differences between operational database system and data warehouses:


Online Transactional processing (OLTP):
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
3

The major task of on-line operational database systems is to perform online transaction and query
processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of
the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking,
payroll, registration, and accounting.

Online Analytical Processing (OLAP):


Data warehouse systems serve users or knowledge workers in the role of data analysis and decision
making. Such systems can organize and present data in various formats in order to accommodate the
diverse needs of the different users. These systems are known as on-line analytical processing (OLAP)
systems.

The major distinguishing features between OLTP and OLAP are summarized as follows:
Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query
processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented
and is used for data analysis by knowledge workers, including managers, executives, and analysts.

Data contents: An OLTP system manages current data that are too detailed to be easily used for decision
making. An OLAP system manages large amounts of historical data, provides facilities for summarization
and aggregation, and stores and manages information at different levels of granularity. These features
make the data easier to use in informed decision making.

Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application-
oriented database design. An OLAP system typically adopts either a star or snowflake model and a subject
oriented database design.

View: An OLTP system focuses mainly on the current data within an enterprise or Department, without
referring to historic data or data in different organizations. OLAP systems deal with information that
originates from different organizations, integrating information from many data stores. Because of their
huge volume, OLAP data are stored on multiple storage media.

Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a
system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are
mostly read-only operations (because most data warehouses store historical rather than up-to-date
information).

Comparison of OLTP and OLAP Systems:

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
4

Why Data warehouse is kept separate from operational databases?


 A major reason for such a separation is to help promote the high performance of both systems.
 An operational database is designed and tuned from known tasks and workloads like indexing and
hashing using primary keys, searching for particular records, and optimizing “canned” queries.
 Data warehouse queries are often complex. They involve the computation of large data groups at
summarized levels, and may require the use of special data organization, access, and
implementation methods based on multidimensional views.
 Processing OLAP queries in operational databases degrades the performance of operational tasks.
 An operational database supports the Concurrency control and recovery mechanisms (e.g., locking
and logging) to ensure the consistency and robustness of transactions.
 An OLAP query often needs read-only access of data records for summarization and aggregation.
Concurrency control and recovery mechanisms, if applied for such OLAP operations, may jeopardize
the execution of concurrent transactions and thus substantially reduce the throughput of an OLTP
system.
 Finally, the separation of operational databases from data warehouses is based on the different
structures, contents, and uses of the data in these two systems.
 Decision support requires historic data, whereas operational databases do not typically maintain
historic data.
 Decision support requires consolidation (e.g., aggregation and summarization) of data from
heterogeneous sources; operational databases contain only detailed raw data, such as transactions,
which need to be consolidated before analysis.
 The two systems provide quite different functionalities and require different kinds of data, it is
presently necessary to maintain separate databases.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
5

 However, many vendors of operational relational database management systems are beginning to
optimize such systems to support OLAP queries. As this trend continues, the separation between
OLTP and OLAP systems is expected to decrease.

Multitier data ware house architecture:

Data Warehouse can be visualized as three-tier architecture


1) Bottom tier represents Data warehouse server.
2) Middle tier represents the OLAP engine or OLAP server.
3) Top tier represents the front end client layer.

1) Bottom Tier:
The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases
or other external sources (e.g., customer profile information provided by external consultants).
These tools and utilities perform data extraction, cleaning, and transformation (e.g., to
merge similar data from different sources into a unified format), as well as load and refresh functions to
update the data warehouse.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
6

The data are extracted using application program interfaces known as gateways. A gateway
is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a
server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Object Linking and
Embedding Database) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.
2) Middle Tier:
The middle tier is an OLAP server.
An OLAP server is a set of specifications which acts as a gateway between user and +Data warehouse
(Database).
OLAP Server is typically implemented using either
1) a relational OLAP(ROLAP) model (i.e., an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or
2) a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly implements
multidimensional data and operations).
3) Middle Tier:
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Data warehouse models:


From the architecture point of view, there are three data warehouse models:
1) The enterprise warehouse
2) The data mart, and
3) The virtual warehouse.
1) Enterprise warehouse:
 An enterprise warehouse collects all of the information about subjects spanning the entire
organization. It provides corporate-wide data integration, usually from one or more operational
systems or external information providers.
 It typically contains detailed data as well as summarized data, and can range in size from a few
gigabytes to hundreds of gigabytes, terabytes, or beyond.
 An enterprise data warehouse may be implemented on traditional mainframes, computer super-
servers, or parallel architecture platforms.
 It requires extensive business modeling and may take years to design and build.
2) Data mart:
 A data mart contains a subset of corporate-wide data that is of value to a specific group of users.
 The scope is confined to specific selected subjects. For example, a marketing data mart may confine
its subjects to customer, item, and sales.
 The data contained in data marts tend to be summarized.
 Data marts are usually implemented on low-cost departmental servers that are Unix/Linux or
Windows based.
 The implementation cycle of a data mart take weeks to build the model.
 Depending on the source of data, data marts can be categorized as
1) Independent data marts:
2) Dependent data marts.
1) Independent data marts: Independent data marts are sourced from data captured from one
or more operational systems or external information providers, or from data generated locally
within a particular department or geographic area.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
7

2) Dependent data marts: Dependent data marts are sourced directly from enterprise data
warehouses.
3) Virtual warehouse:
 A virtual warehouse is a set of views over operational databases.
 For efficient query processing, only some of the possible summary views may be materialized.
 A virtual warehouse is easy to build but requires excess capacity on operational database servers.

What are the pros and cons of the top-down and bottom-up approaches to data warehouse
development?”
The top-down development of an enterprise warehouse serves as a systematic solution and
minimizes integration problems. However, it is expensive, takes a long time to develop, and lacks flexibility
due to the difficulty in achieving consistency
The bottom up approach to the design, development, and deployment of independent data
marts provides flexibility, low cost, and rapid return of investment. It can lead to problems when
integrating various disparate data marts into a consistent enterprise data warehouse.

Approach for data warehouse development:

A recommended method for the development of data warehouse systems is to implement


the warehouse in an incremental and evolutionary manner, as shown in above Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among different
subjects and potential usages. This high-level model, although it will need to be refined in the further
development of enterprise data warehouses and departmental data marts, will greatly reduce future
integration problems.
Second, independent data marts can be implemented in parallel with the enterprise
warehouse based on the same corporate data model set noted before.
Third, distributed data marts can be constructed to integrate different data marts via hub
servers.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
8

Finally, a multitier data warehouse is constructed where the enterprise warehouse is the
sole custodian of all warehouse data, which is then distributed to the various dependent data marts.

Functions of back end tools and utilities:


Data warehouse systems use back-end tools and utilities to populate and refresh their data. These tools
and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.
Data cleaning, which detects errors in the data and rectifies them when possible.
Data transformation, which converts data from legacy or host format to warehouse format.
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and
partitions.
Refresh, which propagates the updates from the data sources to the warehouse.

Types of OLAP servers:


There are different types of OLAP servers
1) Relational OLAP servers.
2) Multidimensional OLAP servers.
3) Hybrid OLAP servers.
4) Specialized SQL servers.
Relational OLAP (ROLAP) servers:
These are the intermediate servers that stand in between a relational back-end server and
client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse
data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS
back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP
technology tends to have greater scalability than MOLAP technology.
Example: The DSS server of Microstrategy adopts the ROLAP approach.
Multidimensional OLAP (MOLAP) servers:
These servers support multidimensional data views through array-based multidimensional
storage engines. They map multidimensional views directly to data cube array structures. The advantage of
using a data cube is that it allows fast indexing to precomputed summarized data. Notice that with
multidimensional data stores, the storage utilization may be low if the data set is sparse.
Hybrid OLAP (HOLAP) servers:
The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP.
Specialized SQL servers:
To meet the growing demand of OLAP processing in relational databases, some database
system vendors implement specialized SQL servers that provide advanced query language and query
processing support for SQL queries over star and snowflake schemas in a read-only environment.

Metadata repository in data warehouse:


Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects. Metadata are created for the data names and definitions of the given warehouse.
Additional metadata are created and captured for time-stamping any extracted data, the source of the
extracted data, and missing fields that have been added by data cleaning or integration processes.
A metadata repository should contain the following:
 A description of the data warehouse structure, which includes the warehouse schema, view,
dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
9

 Operational metadata, which include data lineage (history of migrated data and the sequence of
transformations applied to it), currency of data (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).
 The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
 Mapping from the operational environment to the data warehouse, which includes source
databases and their contents, gateway descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules, and security (user authorization
and access control).
 Data related to system performance, which include indices and profiles that improve data access
and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and
replication cycles.
 Business metadata, which include business terms and definitions, data ownership information, and
charging policies.

Multidimensional data model:


 Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube.
Data cube:
A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.
Dimensions:
Dimensions are the perspectives or entities with respect to which an organization wants to
keep records.
For example, AllElectronics may create a sales data warehouse in order to keep records of
the store’s sales with respect to the dimensions time, item, branch, and location. These dimensions allow
the store to keep track of things like monthly sales of items and the branches and locations at which the
items were sold.
Each dimension may have a table associated with it, called a dimension table, which further
describes the dimension. For example, a dimension table for item may contain the attributes item name,
brand, and type.
Dimension tables can be specified by users or experts, or automatically generated and
adjusted based on data distributions.
Facts:
A multidimensional data model is typically organized around a central theme, such as sales.
This theme is represented by a fact table.
Facts are numeric measures. Think of them as the quantities by which we want to analyze
relationships between dimensions.
Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars),
units sold (number of units sold), and amount budgeted.
The fact table contains the names of the facts, or measures, as well as keys to each of the
related dimension tables.

2D and 3D Data cubes:

Example:

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
10

Consider AllElectronics store. In below 2-D representation, the sales for Vancouver are
shown with respect to the time dimension (organized in quarters) and the item dimension (organized
according to the types of items sold). The fact or measure displayed is dollars sold (in thousands).

Suppose we would like to view the data according to time and item, as well as location, for
the cities Chicago, New York, Toronto, and Vancouver. These 3-D data are shown in Table 4.3. The 3-D data
in the table are represented as a series of 2-D tables. Conceptually, we may also represent the same data in
the form of a 3-D data cube, as in Figure 4.3.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
11

Suppose that we would now like to view our sales data with an additional fourth dimension such as
supplier. Viewing things in 4-D becomes tricky. However, we can think of a 4-D cube as being a series of 3-D
cubes, as shown in Figure 4.4.

In general, it is possible to display n-d data cubes as a series of (n-1) D cubes. A data cube
such as each of the above is often referred to as a cuboid.
Given a set of dimensions, we can generate a cuboid for each of the possible subset of the
given dimensions. The result would form a lattice of cuboids, each showing the data at different level of
summarization, or group-by.
The below Figure shows a lattice of cuboids forming a data cube for the dimensions time, item, location,
and supplier.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
12

Apex cuboid:
The 0-D cuboid, which holds the highest level of summarization, is called as the apex cuboid
and it is denoted by all.
Base cuboid:
The cuboid that holds the lowest level of summarization is called the base cuboid.

Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models

A data warehouse requires a concise, subject-oriented schema that facilitates online data analysis.
The most popular data model for a data warehouse is a multidimensional model, which can exist in the
form of
1) a star schema,
2) a snowflake schema,
3) a fact constellation schema
Star schema:
The most common modeling paradigm is the star schema, in which the data warehouse
contains
(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
13

The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the
central fact table.

Example:
A star schema for AllElectronics sales is shown in below Figure. Sales are considered along
four dimensions: time, item, branch, and location. The schema contains a central fact table for sales that
contains keys to each of the four dimensions, along with two measures: dollars sold and units sold.
In the star schema, each dimension is represented by only one table, and each table contains
a set of attributes.
For example, the location dimension table contains the attribute set {location key, street,
city, province or state, country}. This constraint may introduce some redundancy.
For example, “Kakinada” and “Visakhapatnam” are both cities in the province of Andhra
Pradesh. Entries for such cities in the location dimension table will create redundancy among the attributes
province or state and country, that is, (..., Visakhapatnam, Andhra Pradesh, India) and (..., Kakinada, Andhra
Pradesh, India).
The attributes within a dimension table may form either a hierarchy (total order) or a lattice
(partial order).

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
14

Snowflake schema:
The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a
shape similar to a snowflake.
Example:
A snowflake schema for AllElectronics sales is given in below Figure. The sales table
definition is identical to that of the star schema. The single dimension table for item is normalized, resulting
in new item and supplier tables.
For example, the item dimension table now contains the attributes item key, item name,
brand, type, and supplier key, where supplier key is linked to the supplier dimension table, containing
supplier key and supplier type information.
Similarly, the single dimension table for location can be normalized into two new tables:
location and city. The city key in the new location table links to the city dimension.

Fact constellation:
Sophisticated applications may require multiple fact tables to share dimension tables. This
kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.
Example:
A fact constellation schema is shown in below Figure. This schema specifies two fact tables,
sales and shipping. The sales table definition is identical to that of the star schema. The shipping table has
five dimensions, or keys—item key, time key, shipper key, from location, and to location—and two
measures—dollars cost and units shipped. A fact constellation schema allows dimension tables to be shared
between fact tables.
For example, the dimensions tables for time, item, and location are shared between the sales
and shipping fact tables.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
15

Snowflake and star schema models:


 The major difference between the snowflake and star schema models is that the dimension tables
of the snowflake model may be kept in normalized form to reduce redundancies. Such a table is
easy to maintain and saves storage space.
 However, this space savings is negligible in comparison to the typical magnitude of the fact table.
 Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more joins
will be needed to execute a query.
 Consequently, the system performance may be adversely impacted.
 Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema
in data warehouse design.

Differences between Data mart and Data warehouse:


Data mart Data warehouse
1. It is designed to store department specific 1. It is designed to store enterprise
business information. specific business information.
2. It is designed for middle management 2. It is designed for top management.
3. It is a single subject specific database 3. It is an integration of multi subject database.
4. For data marts, the star or snowflake 4. For data warehouses, the fact constellation
schema is commonly used schema is commonly used
5. Example: department-wide, such as sales 5. Example: entire organization,
such as customers, items, sales, assets, and
personnel

Defining data warehouses and data marts in SQL-based data mining query language (DMQL):
Data warehouses and data marts can be defined using two language primitives, one for cube definition and
one for dimension definition.
The cube definition syntax is:
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
16

define cube <cube_name> [<dimension_list>]: <measure_list>


The dimension definition syntax is:
define dimension <dimension_name> as (<attribute_or_subdimension_list>)
Star Schema: The Star Schema of above example is defined in DMQL as follows:
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)

Snowflake Schema: The Snowflake Schema of above example is defined in DMQL as follows:
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))

Fact constellation schema: The Fact constellation Schema of above example is defined in DMQL as follows:
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

Categories of measures and their computation:


 A multidimensional point in the data cube space can be defined by a set of dimension–value pairs;
 A data cube measure is a numeric function that can be evaluated at each point in the data cube
space.
 A measure value is computed for a given point by aggregating the data corresponding to the
respective dimension–value pairs defining the given point.
Measures can be organized into three categories based on the kind of aggregate functions used.
1) Distributive measure
2) Algebraic measure
3) Holistic measure

Distributive: An aggregate function is distributive if it can be computed in a distributed manner as follows.


I. Partitioning the entire data set into smaller data subsets.
II. Applying the aggregate functions on each individual subset such as sum(),count().
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
17

III. Combining the outcomes of each subset so as to get a value which is equal to the measure value
computed on entire data set after applying the same aggregate functions. Some of the distributive
measures are sum(),count(),max(),min().
Example: sum() can be computed for a data cube by first partitioning the cube into a set of subcubes,
computing sum() for each subcube, and then summing up the counts obtained for each subcube. Hence,
sum() is a distributive aggregate function.

Algebraic:
An aggregate function is algebraic if it can be computed by an algebraic function with M
arguments (where M is a bounded positive integer), each of which is obtained by applying a distributive
aggregate function.
For example, avg() (average) can be computed by sum()/count(), where both sum() and
count() are distributive aggregate functions.
A measure is algebraic if it is obtained by applying an algebraic aggregate function.

Holistic:
An aggregate function is holistic if there does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the computation. Common examples of holistic
functions include median(), mode(), and rank().
A measure is said to be holistic, if it is computed on entire data set but not on the partitioning data subset
by applying holistic aggregate functions In short, a holistic measure is not computed in distributive manner.

Concept hierarchies:
 A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts.
 Consider a concept hierarchy for the dimension location. City values for location include Vancouver,
Toronto, NewYork, and Chicago.
 Each city, however, can be mapped to the province or state to which it belongs. For example,
Vancouver can be mapped to British Columbia, and Chicago to Illinois.
 The provinces and states can in turn be mapped to the country to which they belong, such as
Canada or the USA. These mappings forma concept hierarchy for the dimension location, mapping a
set of low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
 For example, suppose that the dimension location is described by the attributes number, street,
city, province or state, zipcode, and country. These attributes are related by a total order, forming a
concept hierarchy such as “street < city < province or state < country”.
 Alternatively, the attributes of a dimension may be organized in a partial order, forming a lattice.
 An example of a partial order for the time dimension based on the attributes day, week, month,
quarter, and year is “day < ,month <quarter;week- < year”

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
18

The two major types of concept hierarchies are


1) Schema hierarchies
2) Set-grouping hierarchies

Schema hierarchies:
A concept hierarchy that is a total or partial order among attributes in a database schema is
called a schema hierarchy.

Set-grouping hierarchies:
Concept hierarchy may also be defined by grouping values for a given dimension or
attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined among group of
values.

OLAP operations in the multidimensional data model:


In multidimensional models, data are organized into multiple dimensions, and each dimension contains
multiple levels of abstraction defined by concept hierarchies. This organization provides users with the
flexibility to view data from different perspectives. Hence OLAP provides a user friendly environment for
interactive data analysis.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
19

The different OLAP operations are


1) Roll-Up operation
2) Drill-Down operation
3) Slice and Dice operation
4) Pivot(Rotate) operation
Example: Each of the operations described below is illustrated in figure. At the center of the figure is a data
cube for AllElectronics sales. The cube contains the dimensions location, time, and item, where location is
aggregated with respect to city values, time is aggregated with respect to quarters, and item is aggregated
with respect to item types.

Roll-up:
The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension
reduction.
The below Figure shows the result of a roll-up operation performed on the central cube by
climbing up the concept hierarchy for location given in Figure. This hierarchy was defined as the total order
“street < city < province or state < country.” The roll-up operation shown aggregates the data by ascending
the location hierarchy from the level of city to the level of country. In other words, rather than grouping the
data by city, the resulting cube groups the data by country.
When roll-up is performed by dimension reduction, one or more dimensions are removed
from the given cube. For example, consider a sales data cube containing only the location and time
dimensions. Roll-up may be performed by removing, say, the time dimension, resulting in an aggregation of
the total sales by location, rather than by location and by time.

Drill-down:
Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing
additional dimensions.
The below Figure shows the result of a drill-down operation performed on the central cube
by stepping down a concept hierarchy for time defined as “day < month < quarter < year.” Drill-down
occurs by descending the time hierarchy from the level of quarter to the more detailed level of month. The
resulting data cube details the total sales per month rather than summarizing them by quarter.
Because a drill-down adds more detail to the given data, it can also be performed by adding
new dimensions to a cube. For example, a drill-down on the central cube of Figure 4.12 can occur by
introducing an additional dimension, such as customer group.

Slice and dice:


The slice operation performs a selection on one dimension of the given cube, resulting in a
subcube. The below Figure shows a slice operation where the sales data are selected from the central cube
for the dimension time using the criterion time D “Q1.”
The dice operation defines a subcube by performing a selection on two or more dimensions.
Figure 4.12 shows a dice operation on the central cube based on the following selection criteria that involve
three dimensions: (location D “Toronto” or “Vancouver”) and (time D “Q1” or “Q2”) and (item D “home
entertainment” or “computer”).

Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view to
provide an alternative data presentation. The below Figure shows a pivot operation, where the item and
location axes in a 2-D slice are rotated.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
20

Other OLAP operations:


Drill-across executes queries involving (i.e., across) more than one fact table.
Drill-through operation uses relational SQL facilities to drill through the bottom level of a data cube down
to its back-end relational tables.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
21

Starnet Query model for querying multidimensional databases:


The querying of multidimensional databases can be based on a starnet model. A starnet
model consists of radial lines emanating from a central point, where each line represents a concept
hierarchy for a dimension. Each abstraction level in the hierarchy is called a footprint. These represent the
granularities available for use by OLAP operations such as drill-down and roll-up.
Example:
A starnet query model for the AllElectronics data warehouse is shown in below figure. This starnet consists
of four radial lines, representing concept hierarchies for the dimensions location, customer, item, and time,
respectively. Each line consists of footprints representing abstraction levels of the dimension. For example,
the time line has four footprints: “day,” “month,” “quarter,” and “year.” A concept hierarchy may involve a
single attribute (like date for the time hierarchy) or several attributes (e.g.,the concept hierarchy for
location involves the attributes street, city, province or state, and country). In order to examine the item
sales at AllElectronics, users can roll up along the time dimension from month to quarter, or, say, drill down
along the location dimension from country to city. Concept hierarchies can be used to generalize data by
replacing low-level values (such as “day” for the time dimension) by higher-level abstractions (such as
“year”), or to specialize data by replacing higher-level abstractions with lower-level values.

Techniques involved in data warehouse implementation:


 Data warehouses contain huge volumes of data.
 OLAP servers demand that decision support queries be answered in the order of seconds.
 Data warehouse is represented in the form of multi-dimensional data model that is data cube.
 The following things, we need to consider the following to implement the data warehouse systems.
o Efficient cube computation techniques
o Access methods, and
o Query processing techniques

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
22

Efficient computation of data cubes:


 At the core of multidimensional data analysis is the efficient computation of aggregations across
many sets of dimensions.
 In SQL terms, these aggregations are referred to as group-by’s.
 Each group-by can be represented by a cuboid, where the set of group-by’s forms a lattice of
cuboids defining a data cube.

The compute cube Operator and the Curse of Dimensionality


The compute cube operator computes aggregates over all subsets of the dimensions
specified in the operation.
The total number of cuboids, or group-by’s, that can be computed for the below data cube
by taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales in
dollars as the measure, the total number of cuboids, or groupby’s, that can be computed for this data
cube is 23 = 8. The possible group-by’s are the following: {(city, item, year), (city, item), (city, year),
(item, year), (city), (item), (year), ( )}, where () means that the group-by is empty (i.e., the dimensions
are not grouped). These group-by’s form a lattice of cuboids for the data cube, as shown in below
Figure.

The base cuboid contains all three dimensions, city, item, and year. It can return the total
sales for any combination of the three dimensions.
The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. It contains
the total sum of all sales.
The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is
the most generalized (least specific) of the cuboids, and is often denoted as all.
If we start at the apex cuboid and explore downward in the lattice, this is equivalent to
drilling down within the data cube. If we start at the base cuboid and explore upward, this is akin to
rolling up.
An SQL query containing no group-by (e.g., “compute the sum of total sales”) is a zero
dimensional operation.
An SQL query containing one group-by (e.g., “compute the sum of sales, group-by city”) is a
one-dimensional operation.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
23

A cube operator on n dimensions is equivalent to a collection of group-by statements, one


for each subset of the n dimensions. Therefore, the cube operator is the n-dimensional generalization
of the group-by operator.
Similar to the SQL syntax, the data cube could be defined as
define cube sales cube [city, item, year]: sum(sales in dollars)
For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid.
A statement such as
compute cube sales cube
would explicitly instruct the system to compute the sales aggregate cuboids for all
eight subsets of the set {city, item, year}, including the empty subset.

Online analytical processing may need to access different cuboids for different queries.
Therefore, it may seem like a good idea to compute in advance all or at least some of the cuboids in a data
cube. Precomputation, leads to fast response time and avoids some redundant computation.
A major challenge related to this precomputation requires huge storage space if all the
cuboids in a data cube are precomputed, especially when the cube has many dimensions. The storage
requirements are even more excessive when many of the dimensions have associated concept hierarchies,
each with multiple levels. This problem is referred to as the curse of dimensionality.

If there were no hierarchies associated with each dimension, then the total number of
cuboids for an n-dimensional data cube, is 2n.
But many dimensions have hierarchies. If a dimension consists of concept hierarchy at
multiple levels, then the total number of cuboid for n-dimension is computed as

where Li is the number of levels associated with dimension i. One is added to Li in above Eq.
to include the virtual top level, all.
For example the dimension contains multiple conceptual levels, such as in the hierarchy
“day<month<quarter<year”.
For the time dimension, there are 4 conceptual level so as the total number of cuboids for this dimension
are 5.
Because of the dimensionality curse, it is not beneficial for precomputing and materializing all the cuboids
produced for a single data cube. In order to avoid this problem, a different method called “partial
materialization” is used, that is, to materialize only some of the possible cuboids that can be generated.

Partial Materialization: Selected Computation of Cuboids


There are three choices for data cube materialization given a base cuboid:
1) No materialization:
Do not precompute any of the “nonbase” cuboids. This leads to computing expensive
multidimensional aggregates on-the-fly, which can be extremely slow.
2) Full materialization:
Precompute all of the cuboids. The resulting lattice of computed cuboids is referred to as the
full cube. This choice typically requires huge amounts of memory space in order to store all of the
precomputed cuboids.
3) Partial materialization:
Selectively compute a proper subset of the whole set of possible cuboids. Alternatively, we
may compute a subset of the cube, which contains only those cells that satisfy some user-specified
criterion, such as where the tuple count of each cell is above some threshold.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
24

The partial materialization of cuboids or subcubes should consider three factors:


1) Identify the subset of cuboids or subcubes to materialize;
2) Exploit the materialized cuboids or subcubes during query processing; and
3) Efficiently update the materialized cuboids or subcubes during load and refresh.

The selection of the subset of cuboids or subcubes to materialize should take into account
the queries in the workload, their frequencies, and their accessing costs. In addition, it should
consider workload characteristics, the cost for incremental updates, and the total storage
requirements.
Several OLAP products have adopted heuristic approaches for cuboid and subcube selection.
They are
1) Materialize the cuboids set on which other frequently referenced cuboids are based.
Alternatively, we can compute an iceberg cube, which is a data cube that stores only those cube
cells with an aggregate value (e.g., count) that is above some minimum support threshold.
2) Materialize a shell cube which involves precomputing the cuboids for only a small number of
dimensions (e.g., three to five) of a data cube.
3) Finally, during load and refresh, the materialized cuboids should be updated efficiently.

Indexing OLAP Data: Bitmap Index and Join Index


The indexing methods that are used to access OLAP data efficiently are,
1) Bit map indexing method.
2) Join indexing method.

Bit map indexing method:


 The bitmap indexing method is popular in OLAP products because it allows quick searching in data
cubes.
 The bitmap index is an alternative representation of the record ID (RID) list.
 There are two tables associated with bit map indexing method.
i. Base table.
ii. Bit map index table.
 In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the
domain of the attribute.
 If the domain of a given attribute consists of n values, then n bits are needed for each entry in the
bitmap index (i.e., there are n bit vectors).
 If the attribute has the value v for a given row in the data table, then the bit representing that value
is set to 1 in the corresponding row of the bitmap index.
 All other bits for that row are set to 0.
Example:
 In the AllElectronics data warehouse, suppose the dimension item at the top level has four values
(representing item types): “home entertainment,” “computer,” “phone,” and “security.”
 Each value (e.g., “computer”) is represented by a bit vector in the bitmap index table for item.
 Suppose that the cube is stored as a relation table with 100,000 rows.
 Because the domain of item consists of four values, the bitmap index table requires four bit vectors
(or lists), each with 100,000 bits.
 The below figure shows a base (data) table containing the dimensions item and city, and its
mapping to bitmap index tables for each of the dimensions.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
25

The advantages of bit map indexing are:


i) It doesnot consume much time for processing because of bit arithmetic operation.
ii) It doesnot require a lot of storage space as the values for a characters are represented using binary
values.

Join Indexing method:


 The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having the value.
 For example, if two relations R(RID, A) and S(B, SID) join on the attributes A and B, then the join
index record contains the pair (RID, SID), where RID and SID are record identifiers from the R and S
relations, respectively.
 Join index is especially useful for maintaining the relationship between a foreign key its matching
primary keys from the joinable relation.
 The star schema model of data warehouses makes join indexing, because the linkage between a fact
table and its corresponding dimension tables comprises the foreign key of the fact table and the
primary key of the dimension table.
 Join indexing maintains relationships between values (eg., within a dimension table) and the
corresponding rows in the fact table.
 Join indices may span multiple dimensions to form composite join indices.

Example:
 We defined a star schema for AllElectronics of the form “sales star *time, item, branch, location+:
dollars sold = sum (sales in dollars)”.
 An example of a join index relationship between the sales fact table and the dimension tables for
location and item is shown in below Figure. For example, the “Main Street” value in the location
dimension table joins with tuples T57, T238, and T884 of the sales fact table.
 Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and T459 of the
sales fact table.
 The corresponding join index tables are shown in below figure.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
26

Efficient processing of OLAP queries:


The purpose of materializing cuboids and constructing OLAP index structures is to speed up query
processing in data cubes. Given materialized views, query processing should proceed as follows:

1) Determine which operations should be performed on the available cuboids: This involves
transforming any selection, projection, roll-up (group-by), and drill-down operations specified in the
query into corresponding SQL and/or OLAP operations. For example, slicing and dicing a data cube
may correspond to selection and/or projection operations on a materialized cuboid.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
27

2) Determine to which materialized cuboid(s) the relevant operations should be applied: This
involves identifying all of the materialized cuboids that may potentially be used to answer the
query, pruning the above set using knowledge of “dominance” relationships among the cuboids,
estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with the
least cost.

Example:
Suppose that we define a data cube for All Electronics of the form “sales cube [time, item, location]:
sum(sales in dollars)”. The dimension hierarchies used are “day < month < quarter < year” for time, “item
name < brand < type” for item, and “street < city < province or state < country” for location.
Suppose that the query to be processed is on ,brand, province or state-, with the selection constant “year =
2004”. Also, suppose that there are four materialized cuboids available, as follows:
 cuboid 1: {year, item name, city}
 cuboid 2: {year, brand, country}
 cuboid 3: {year, brand, province or state}
 cuboid 4: {item name, province or state} where year = 2004
“Which of the above four cuboids should be selected to process the query?”
Finer granularity data cannot be generated from coarser-granularity data.
Therefore, cuboid 2 cannot be used because country is a more general concept than province or state.
Cuboids 1, 3, and 4 can be used to process the query because
1) They have the same set or a superset of the dimensions in the query.
2) The selection clause in the query can imply the selection in the cuboid.
3) The abstraction levels for the item and location dimensions in these cuboids are at a finer level than
brand and province or state, respectively.

“How would the costs of each cuboid compare if used to process the query?”
It is likely that using cuboid 1 would cost the most because both item name and city are at a lower level
than the brand and province or state concepts specified in the query.
If there are not many year values associated with items in the cube, but there are several
item names for each brand, then cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be
chosen to process the query. However, if efficient indices are available for cuboid 4, then cuboid 4 may be a
better choice.
Therefore, some cost-based estimation is required in order to decide which set of cuboids
should be selected for query processing.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
1

UNIT – 4:
Classification:
Unit IV: Classification: Basic Concepts, General approach to solving a classification problem, Decision Tree
induction: working of decision tree, building a decision tree, methods for expressing attribute test
conditions, measures for selecting the best split, Algorithm for decision tree induction.
Model over fitting: Due to presence of noise, due to lack of representation samples, evaluating the
performance of classifier: holdout method, random sub sampling, cross-validation, bootstrap. (Tan)

Classification:
Classification, which is the task of assigning objects to one of several predefined categories.
Examples:
 Detecting spam email messages based upon the message header and content
 Categorizing cells as malignant or benign based upon the results of MRI scans
 Classifying galaxies based upon their shapes

Classification is the task of learning a target function f that maps each attribute set to one of the
predefined class labels y.
The target function is also known informally as a classification model.

A classification model is useful for the following purposes


Descriptive Modelling:
A classification model can serve as an explanatory tool to distinguishing between objects of different
classes.

For example, it would be useful for both biologists and others to, have a descriptive model that summarizes
the data and explains what features define a vertebrate as a mammal, reptile, bird, fish or amphibian.

Predictive modelling:
A classification model can also be used to predict the class label of unknown records.
A classification model can be treated as a black box that automatically assigns a class label when presented
with the attribute set of an unknown record.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
2

Suppose we are given the following characteristics of a creature known as a gila monster:
We can use a classification model built from the data set shown in above Table 4.1 to determine the class
to which the creature belongs.

Classification techniques are most suited for predicting or describing data sets with binary or nominal
categories. They are less effective for ordinal categories (e.g.,to classify a person as a member of high-,
medium-, or low- income group) because they do not consider the implicit order among the categories.

General approach to solve a classification problem:

 A classification technique (or classifier) is a systematic approach to building classification models


from an input data set.
 Examples include decision tree classifiers, rule-based classifiers, neural networks, support vector
machines, and naive Bayes classifiers.
 Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and class label of the input data.
 The model generated by a learning algorithm should both fit the input data well and correctly
predict the class labels of records it has never seen before.
 Therefore, a key objective of the learning algorithm is to build models with good generalization
capability; i.e., models that accurately predict the class labels of previously unknown records.
 The above figure shows a general approach for solving classification problems.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
3

 First, a training set consisting of records whose class labels are known must be provided.
 The training set is used to build a classification model, which is subsequently applied to the test set,
which consists of records with unknown class labels.

Classification is the task of assigning labels to unlabeled data instances and a classifier is used to perform
such a task.
A classifier is typically described in terms of a model. The model is created using a given a set of instances,
known as the training set, which contains attribute values as well as class labels for each instance. The
systematic approach for learning a classification model given a training set is known as a learning
algorithm. The process of using a learning algorithm to build a classification model from the training data is
known as induction. This process is also often described as “learning a model” or “building a model.”
This process of applying a classification model on unseen test instances to predict their class labels is known
as deduction.
Thus, the process of classification involves two steps: applying a learning algorithm to training data to learn
a model, and then applying the model to assign labels to unlabeled instances.

Confusion Matrix:
Traditional heterogeneous DB integration: Query driven approach
 The performance of a classification model (classifier) can be evaluated by comparing the predicted
labels against the true labels of instances.
 That is based on the counts of test records correctly and incorrectly predicted by the model.
 This information can be summarized in a table called a confusion matrix.

 The above table depicts the confusion matrix for a binary classification problem.
 Each entry fij denotes the number of instances from class i predicted to be of class j.
 For example, f01 is the number of instances from class 0 incorrectly predicted as class 1.
 The number of correct predictions made by the model is (f11 + f00) and the number of incorrect
predictions is (f10 + f01).
 A confusion matrix provides the information needed to determine how well a classification model
performs, summarizing this information with a single number would make it more convenient to
compare the performance of different models.
 This can be done using an evaluation metric such as accuracy, which is computed in the following
way:

 For binary classification problems, the accuracy of a model is given by

 Error rate is another related metric, which is defined as follows for binary classification problems:
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
4

 The learning algorithms of most classification techniques are designed to learn models that attain
the highest accuracy, or equivalently, the lowest error rate when applied to the test set.

Decision tree induction:


How a Decision Tree works?
 A Decision Tree is constructed by asking a serious of questions with respect to a record of the
dataset we have got.
 Each time an answer is received, a follow-up question is asked until a conclusion about the class
label of the record.
 The series of questions and their possible answers can be organised in the form of a decision tree,
which is a hierarchical structure consisting of nodes and directed edges.
The tree has three types of nodes:
 A root node that has no incoming edges and zero or more outgoing edges.
 Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.
 Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.

In a decision tree, each leaf node is assigned a class label. The nonterminal nodes, which include the root
and other internal nodes, contain attribute test conditions to separate records that have different
characteristics.

 Classifying a test record is straightforward once a decision tree has been constructed.
 Starting from the root node, we apply the test condition to the record and follow the appropriate
branch based on the outcome of the test.
 This will lead us either to another internal node, for which a new test condition is applied, or to a
leaf node. The class label associated with the leaf node is then assigned to the record.
 Classifying an unlabeled vertebrate. The dashed lines represent the outcomes of applying various
attribute test conditions on the unlabeled vertebrate. The vertebrate is eventually assigned to the
Non-mammal class.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
5

 The path terminates at a leaf node labeled Non-mammals.

How to build a decision tree?


The decision tree is build by using efficient decision tree algorithm, such as Hunt’s algorithm. Using this
algorithm, many decision tree induction algorithms were developed. They are ID3, C4.5 and CART.

Hunt’s Algorithm:
In hunt’s algorithm, a decision tree is grown in recursive fashion by partitioning the training records into
successively purer subsets. Let Dt be the set of training records that are associated with node t and y={ y1,
y2, ….., yc} be the class labels. The following is a recursive definition of Hunt’s algorithm.
Step1:
If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt..
Step2:
If Dt contains records that belong to more than one class, an attribute test condition is selected to partition
the records into smaller subsets.
 A child node is created for each outcome of the test condition and the records in D t are distributed
to the children based on the outcomes.
 The algorithm is then recursively applied to each node.

Consider the following training set for predicting borrowers who will default on loan payments.
 A training set for this problem can be constructed by examining the records of previous borrowers.
 In the below example Figure, each record contains the personal information of a borrower along
with a class label indicating whether the borrower has defaulted on loan payments.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
6

From the table predicting outcome data whether a loan applicant will repay the loan or not. Using the
training set table, the problem can be constructed by examining the records of previous borrowers.
a) The initial tree for the classification problem contains a single node with class label “defaulted =
No”. It means that most of the borrowers successfully repaid their loans.

Defaulted = No
Fig (a)
b) The records are subsequently divided into smaller
subsets based on the outcomes of the “Home Owner”
test condition. In this if Home Owner = Yes then that has
all records in same class.

c) If test condition is No [i.e. “Home Owner = No”]


then the Hunt’s algorithm apply the step
recursively. Because Home Owner is No has all
records are in different class.

 In the tree, the left child of the root node is


labeled “Defaulted = Yes” then it is not
extended recursively.

d) The right child of the root node is continued by applying the recursive step of Hunt’s algorithm
until all the records belongs to the same class. The tree, resulting from the recursive steps as
shown in fig (d).

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
7

Design Issues of Decision Tree Induction:


A learning algorithm for inducing decision trees must address the following two issues.
1. How should the training records be split?
 Each recursive step of the tree-growing process must select an attribute test condition to divide
the records into smaller subsets.
 To implement this step, the algorithm must provide a method for specifying the test condition
for different attribute types as well as an objective measure for evaluating the goodness of each
test condition.
2. How should the splitting procedure stop?
 A stopping condition is needed to terminate the tree-growing process.
 A possible strategy is to continue expanding a node until either all the records belong to the
same class or all the records have identical attribute values.

Methods for expressing attribute test conditions:


Decision tree induction algorithms must provide a method for expressing an attribute test condition and its
corresponding outcomes for different attribute types.
• Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
1. Binary attribute: The test condition for a binary attribute generates two potential outcomes, shown
in below figure.

2. Nominal Attribute: A nominal attribute can have many related values.


 Its test condition can be expressed in two ways. 1. Multi-way, 2. Two-way.
 For a multi-way split the number of outcomes depends on the number of distinct values for the
corresponding attribute.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
8

 For example, if marital status has three distinct values such single, married and
divorced. Its test condition will produce a three-way split.

 The test condition can be split into two-ways i.e. binary attribute. some decision tree
algorithms, such as CART, produce only binary splits by considering all 2 k−1 − 1 ways of
creating a binary partition of k attribute values. This is shown in fig.

3. Ordinal Attribute (Group): Ordinal attributes can also produce binary (or) multi-way splits. Ordinal
attribute values can be grouped as long as the grouping does not violate the order property of the
attribute values. The figure shows two-way split.

4. Continuous Attributes: For continuous attributes, the test condition can be expressed as a
comparison test with binary outcomes (yes or no) (A < v) or (A ≥ v)

In Multi-way, the test


condition is done based
on Annual Income in the
given ranges, with outcome
of the form vi ≤ A < vi+1, for i
= 1, . . . , k.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
9

Measures for selecting the Best split:


Define the terms Entropy, information gain and gini index. How they are useful for attribute selection?
There are many measures that can be used to determine the best way to split the records. These measures
are defined in terms of the class distribution of the records before and after splitting.
 Let p(i/t) denote the fraction of records belonging to class i at a given node t.
 In a two-class problem, the class distribution at any node can be written as (p0,p1), where p1=1- p0.

The measures developed for selecting the best split are often based on the degree of impurity
(i.e., the impurity of a node measures how dissimilar the class labels are for the data instances belonging to
a common node) of the child nodes. The smaller the degree of impurity, the more skewed the class
distribution.
For example, a node with class distribution (0, 1) has zero impurity, whereas a node with
uniform class distribution (0.5, 0.5) has the highest impurity.

The following measures can be used to evaluate the impurity of a node t:


 Gini
 Entropy
 Classification Error

1. Compute impurity measure (P) before splitting


2. Compute impurity measure (M) after splitting
 Compute impurity measure of each child node
 M is the weighted impurity of child nodes
3. Choose the attribute test condition that produces the highest gain,
Gain=P-M
or equivalently, lowest impurity measure after splitting (M)
In the below examples,
 The Node N1 has the lowest impurity value.
 The Node N2 has the approximately Low impurity value.
 The Node N3 has the equal number of records. So is high impurity.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
10

Example to illustrate the best split using following fig.


The class distribution before splitting is (0.500). Class distribution after splitting is
(0.333)
Before Split After Split

N1 N2

C0 4 2

C1 3 3

Before Split:
 1 - (6/12)2 – (6/12)2
 1 – 0.25 – 0.25
 0.5
After Split:
N1  1 - (4/7)2 – (3/7)2 N2  1 - (2/5)2 – (3/5)2
N1  1 - (0.571429)2 – (0.428571)2 N2  1 - (0.4)2 – (0.6)2
N1  1 - (0.3265) – (0.1837) N2  1 - (0.16) – (3/5)
N1  0.489 N2 0.480
The Gini Index of entire tree  7/12 * 0.489 + 5/12 * 0.480  0.486
The attribute chosen as the test condition may vary depending on the choice of impurity measure.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
11

How to determine the performance of test condition: To determine how well a test condition performs,
we need to compare the degree of impurity of the parent node (before splitting) with the degree of
impurity of the child nodes (after splitting). The larger their difference, the better the test condition. The
gain, Δ, is a criterion that can be used to determine the goodness of a split:

o Where I(.) is the impurity measure of a given node.


o N is the total number of records at the parent node.
o K is the number of attribute values and
o N(vj) is the number of records associated with the child node vj.
 Decision tree induction algorithms often choose a test condition that maximizes the gain Δ.
 Since I(parent) is the same for all test conditions, maximizing the gain is equivalent to minimizing
the weighted average impurity measures of the child nodes.
 Finally, when entropy is used as the impurity measure in above Equation, the difference in entropy
is known as the information gain, Δinfo.
 Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT
 Information Gain used in the ID3 and C4.5 decision tree algorithms

Splitting the attributes into subsets (nodes):


Splitting of Binary Attributes
Consider the diagram shown below.

 Suppose there are two ways to split the data into smaller subsets (i.e., if the attribute is having only
two categorical values, then that attribute will split into two subsets).
 Before splitting, the Gini index is 0.5 since there are an equal number of records from both classes.
 If attribute A is chosen to split the data, the Gini index for node N1 is 0.4898, and for node N2, it is
0.480.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
12

 The weighted average of the Gini index for the descendent nodes is (7/12) × 0.4898 + (5/12) ×
0.480 = 0.486.
 Similarly, we can show that the weighted average of the Gini index for attribute B is 0.375. Since the
subsets for attribute B have a smaller Gini index, it is preferred over attribute A.

Splitting of Nominal Attributes


The attribute which having two or more distinct category values then they will split into either two or multi way split.
That is, a nominal attribute can produce either binary or multiway splits, as shown in Figure.

 This can be split into binary grouping of car type attribute with 3 categories such as (sports, luxury, and
family).
 For the first binary grouping of the Car Type attribute,
o The Gini index of {Sports,Luxury} is 0.4922 and
o The Gini index of {Family} is 0.3750.
o The weighted average Gini index for the grouping is equal to
 16/20 × 0.4922 + 4/20 × 0.3750 = 0.468.
 Similarly, for the second binary grouping of {Sports} and {Family, Luxury}, the weighted average Gini
index is 0.167.
 The second grouping has a lower Gini index because its corresponding subsets are much purer.
 For the multiway split, the Gini index is computed for every attribute value.
o Since Gini({Family}) = 0.375, Gini({Sports}) = 0, and Gini({Luxury}) = 0.219,
o The overall Gini index for the multiway split is equal to 4/20 × 0.375 + 8/20 × 0 + 8/20 ×
0.219 = 0.163.

Splitting of Continuous Attributes


The attribute, which having range of values (continuous attribute), in which the test condition Annual
Income ≤ v is used to split the training records for the loan default classification problem.
Brute-Force Method for finding v:
- Consider every value of the attribute in the N records as a candidate split position.
- For each candidate v, the data set is scanned once to count the number of records with annual
income less than or greater than v.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
13

- We then compute the Gini index for each candidate and choose the one that gives the lowest value.
This approach is expensive because it requires O(N) operations to compute the Gini index at each
candidate split position. Since there are N candidates, the overall complexity of this task is O(N2).

To reduce the complexity, follow the below procedure for splitting the continuous attribute:
- The training records are sorted based on their annual income(attribute)
- Candidate split positions are identified by taking the midpoints between two adjacent sorted values
- For each candidate split position, compute the Gini index values
- Now, select the best split position corresponds to the one that produces the smallest Gini index
- It can be further optimized by considering only candidate split positions located between two
adjacent records with different class labels.
- In the below example figure, this approach allows us to reduce the number of candidate split
positions from 11 to 2.

Gain Ratio
 Impurity measures such as entropy and Gini index tend to favor attributes that have a large number
of distinct values.
 That is, a test condition that results in a large number of outcomes may not be desirable because
the number of records associated with each partition is too small to enable us to make any reliable
predictions.
 There are two strategies for overcoming this problem.
 The first strategy is to restrict the test conditions to binary splits only. This strategy is employed by
decision tree algorithms such as CART.
 Another strategy is to modify the splitting criterion to take into account the number of outcomes
produced by the attribute test condition. For example, in the C4.5 decision tree algorithm, a
splitting criterion known as gain ratio is used to determine the goodness of a split.
 This criterion is defined as follows:

Where, ∑ and
Parent Node, is split into partitions (children)
is number of records in child node
 This suggests that if an attribute produces a large number of splits, its split information will also be
large, which in turn reduces its gain ratio.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
14

Algorithm for decision tree induction:


 A skeleton decision tree induction algorithm called TreeGrowth is shown in Algorithm 4.1.
 The input to this algorithm consists of the training records E and the attribute set F.
 The algorithm works by recursively selecting the best attribute to split the data (Step 7) and
expanding the leaf nodes of the tree (Steps 11 and 12) until the stopping criterion is met (Step 1).

The details of this algorithm are explained below:


1. The createNode() function extends the decision tree by creating a new node. A node in the decision tree
has either a test condition, denoted as node.test_cond, or a class label, denoted as node.label.
2. The find_best_split() function determines which attribute should be selected as the test condition for
splitting the training records. The choice of test condition depends on which impurity measure is used to
determine the goodness of a split.
3. The Classify() function determines the class label to be assigned to a leaf node. For each leaf node t, let
p(i|t) denote the fraction of training records from class i associated with the node t. In most cases, the leaf
node is assigned to the class that has the majority number of training records:
4. The stopping_cond() function is used to terminate the tree-growing process by testing whether all the
records have either the same class label or the same attribute values. Another way to terminate the
recursive function is to test whether the number of records have fallen below some minimum threshold.

Characteristics of decision tree induction:


Data warehouse systems use back-end tools and utilities to populate and refresh their data. These tools
1. Decision tree induction is a nonparametric approach for building classification models.
2. Techniques developed for constructing decision trees. But they are computationally inexpensive.
3. Small sized trees are relatively easy to interpret.
4. Decision trees provide an expressive representation for learning discrete-valued functions.
5. Decision tree algorithms are quite robust to the presence of noise.
6. Decision tree algorithm employs a top-down, recursive portioning approach.
7. Sub-trees can be replicated multiple times in a decision tree.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
15

Advantages and disadvantages of decision trees:


Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are interacting)
Disadvantages:
– Due to the greedy nature of splitting criterion, interacting attributes (that can distinguish between
classes together but not individually) may be passed over in favor of other attributed that are less
discriminating.
– Each decision boundary involves only a single attribute

Classification Errors:
 The errors committed by a classification model are generally divided into two types:
training errors and generalization errors.
 Training error, also known as resubstitution error or apparent error, is the number of
misclassification errors committed on training records, that is, we get the by calculating the
classification error of a model on the same data the model was trained on
 Generalization error is the expected error of the model on previously unseen records.

Overfitting and underfitting:


Model overfitting:
 A good classification model fits the training data well, it also accurately classify records it has never
seen before.
 In other words, a good model must have low training error as well as low generalization error.
 A model that fits the training data too well can have a poorer generalization error than a model with
a higher training error. Such a situation is known as model overfitting.
 Overfitting results in decision trees that are more complex than necessary
 Once the tree becomes too large, its test error rate begins to increase even though its training error
rate continues to decrease. This phenomenon is known as model overfitting.
 Training error does not provide a good estimate of how well the tree will perform on previously
unseen records

Model underfitting:
 The training and test error rates of the model are large when the size of the tree is very small. This
situation is known as model underfitting.
 Underfitting occurs because the model has yet to learn the true structure of the data. As a result, it
performs poorly on both the training and the test sets.
 As the number of nodes in the decision tree increases, the tree will have fewer training and test
errors.
 The training error of a model can be reduced by increasing the model complexity.

For example, the leaf nodes of the tree can be expanded until it perfectly fits the training data. The
training error for such a complex tree is zero, the test error can be large because the tree may contain
nodes that accidently fit some of the noise points in the training data. Such nodes can degrade the

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
16

performance of the tree because they do not generalize well to the test examples. The below figure shows
the structure of two decision trees with different number of nodes.

The tree that contains the smaller number of nodes has a higher training error rate, but a lower test error
rate compared to the more complex tree.

Causes of model overfitting:


 Overfitting and underfitting are two pathologies that are related to the model complexity.
 Causes of model overfitting:
 Overfitting Due to Presence of Noise
 Overfitting Due to Lack of Representative Samples

Overfitting Due to Presence of Noise:


 Consider the training and test sets shown in tables for the mammal classification problem.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
17

 Two of the training records are mislabelled: Bats and whales are classified as non-mammals instead
of mammals.
 A decision tree that perfectly fits the training data is shown in figure(a). Although the training error
for the tree is zero, its error rate on the test set is 30%.

 Both humans and dolphins were misclassified as non-mammals because their attribute values for
body temperature, gives-birth and four legged are identical to the mislabelled records in the
training set.
 Errors due to exceptional cases are often unavoidable and establish the minimum error rate
achievable by any classifier.
 The decision tree M2 has a lower test error rate (10%) even though the training error rate is
somewhat higher (20%).
 The four legged attribute test condition in model M1 is spurious because it fits the mislabelled
training records, which leads to the classification of records in the test set.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
18

Overfitting Due to Lack of Representative Samples:


 Models that make their classification decision based on a smaller number of training records are
also susceptible to overfitting.
 Such models can be generated because of lack of representative samples in the training data and
learning algorithms that continue to refine their models even when few training records are
available.
 For example consider five training records shown in table.

 All of these training records are labeled correctly and the corresponding decision tree is shown in
figure.

 Although its training error is zero, its error rate on the test set is 30%.
 Humans, elephants and dolphins are misclassified because the decision tree classifies all warm-
blooded vertebrates that do not hibernate as non-mammals.
 The tree arrives at this classification decision because there is only one training record, which is an
eagle, with such characteristics.
 This example clearly shows that the danger of making wrong predictions, when there are not
enough representative examples at the leaf nodes of a decision tree.

Evaluating the performance of a classifier


Some of the methods commonly used to evaluate the performance of a classifier are,
1) Holdout method.
2) Random Subsampling
3) Cross-validation.
4) Bootstrap.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
19

Holdout method:
 Split the learning sample into a training set and a test data set.
– A model is induced on the training data set
– Performance is evaluated on the test data set
 Limitations:
– Too few data for learning: The more data used for testing, the more reliable the performance
estimation but more data is missing (less data available) for learning.
– Interdependence of training and test data set: If a class is underrepresented in the training
data set, it will be overrepresented in the test data set and vice versa.

Random Subsampling
 The holdout method can be repeated several times to improve the estimation of a classifier’s
performance. If the estimation is performed k times then, the overall performance can be the
average of each estimate.

 This method also encounters some of the problems associated with the holdout method because it
does not utilize as much data as possible for training.
 It also has no control over the number of times each record is used for training and testing.

Cross-validation
 In this approach each record is used the same number of times for training and exactly once for
testing.
 To illustrate this method, suppose we partition the data into two equal-sized subsets.
– First we choose one of the subsets for training and the other for testing.
– We than swap the roles of the subsets so that the previous training set becomes the test set
and vice versa. This approach is called a two-fold cross validation. The total error is obtained
by summing up the errors for both runs.
– In this example, each record is used exactly once for training and once for testing.
 Core idea:
– use each record k times for training and once for testing
– aggregate the performance values over all k tests
 k-fold cross validation
– split the dataset into k equi-sized subsets
– for i = 1, …., k use the k-1 folds for training and kth fold for testing
– aggregate the performance values over all k tests
 Leave-one-out cross validation
– In k-fold cross validation, if k = N where N is the number of records in the learning dataset
– Each test set will contain only one record
– Computationally expensive

Bootstrap
 The methods presented so far assume that the training records are sampled without replacement.
 It means that there are no duplicate records in the training and test set.
 In the bootstrap approach, the training records are sampled with replacement.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
20

 It means that a record already chosen for training is put back into the original pool of records so
that it is equally likely to be redrawn.
 There are several bootstrap methods. A commonly used one is the .632 bootstrap.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
1

UNIT – 5:
Association Analysis:
Association Analysis: Problem Definition, Frequent Item-set generation- The Apriori principle, Frequent
Item set generation in the Apriori algorithm, candidate generation and pruning, support counting (eluding
support counting using a Hash tree), Rule generation, compact representation of frequent item sets, FP-
Growth Algorithms. (Tan)

Association rule mining:


Association mining aims to extract interesting correlations, frequent patterns, associations or casual
structures among sets of items or objects in transaction databases, relational database or other data
repositories.
Association rule mining aims at finding frequent patterns called associations, among sets of items or
objects in transaction databases, relational databases, and other information repositories.
 Given a set of transactions, find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction
Applications:
Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.

Association analysis:
 Association analysis is useful for discovering interesting relationships hidden in large data sets.
 The uncovered relationships can be represented in the form of association rules or sets of frequent
items.
 Association analysis is applicable to application domains such as market basket data, bioinformatics,
medical diagnosis, Web mining, and scientific data analysis. In

Problem definition:
The following are the basic terminology used in association analysis. Consider the following example of
market basket transactions.

Binary Representation
 Market basket data can be represented in a binary format as shown in Table, where each row
corresponds to a transaction and each column corresponds to an item.
 An item can be treated as a binary variable whose value is one if the item is present in a transaction
and zero otherwise.
 Because the presence of an item in a transaction is often considered more important than its
absence, an item is an asymmetric binary variable.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
2

Itemset:
 Let I = {i1, i2, . . . ,id} be the set of all items in a market basket data and T = {t1, t2, . . . , tN} be the set of
all transactions.
 Each transaction ti contains a subset of items chosen from I.
 In association analysis, a collection of zero or more items is termed an itemset.
 If an itemset contains k items, it is called a k-itemset.
– For instance, {Beer, Diapers, Milk} is an example of a 3-itemset.
 The null (or empty) set is an itemset that does not contain any items.
 The transaction width is defined as the number of items present in a transaction.
 A transaction tj is said to contain an itemset X if X is a subset of t j . For example, the second
transaction shown in Table 6.2 contains the itemset {Bread, Diapers} but not {Bread, Milk}.

Support count:
Support count refers to the number of transactions that contain a particular itemset. Mathematically, the
support count, σ(X), for an itemset X can be stated as follows:
σ(X) = |{ti|X ⊆ ti, ti ∈ T}|, where the symbol | ・ | denote the number of elements in a set.
In the data set shown in Table 6.2, the support count for {Beer, Diapers, Milk} is equal to two because there
are only two transactions that contain all three items.

Association Rule:
An association rule is an implication expression of the form X → Y , where X and Y are disjoint itemsets, i.e.,
X ∩ Y = ∅.
The strength of an association rule can be measured in terms of its support and confidence.
Support determines how often a rule is applicable to a given data set, while confidence determines how
frequently items in Y appear in transactions that contain X. The formal definitions of these metrics are

In other words, Support is the ratio (or fraction) of the number of transactions that contain both X and Y
Confidence is the probability that itemset B will exist given itemset A exists in the transaction.

Why use Support and Confidence?


Support
 Support is an important measure because a rule that has very low support may occur simply by
chance.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
3

 A low support rule is also likely to be uninteresting from a business perspective because it may not
be profitable to promote items that customers seldom buy together.
 For these reasons, support is often used to eliminate uninteresting rules.
Confidence
 Confidence, on the other hand, measures the reliability of the inference made by a rule.
 For a given rule X → Y, the higher the confidence, the more likely it is for Y to be present in
transactions that contain X.
 Confidence also provides an estimate of the conditional probability of Y given X.
 An association rules suggests a strong co-occurrence relationship between items in the antecedent
and consequent of the rule.

Formal definition of association rule mining


The association rule mining problem can be formally stated as follows:
Definition (Association Rule Discovery):
Given a set of transactions T, find all the rules (association rules) having support ≥ minsup and
confidence ≥ minconf, where minsup and minconf are the corresponding support and confidence
thresholds.
minsup: This is the minimal support used as a threshold.
minconf: This is the minimal confidence used as a threshold
Frequent Itemset: An itemset whose support is greater than or equal to a minsup threshold
Strong Association Rules: rules whose confidence is greater than or equal to a minconf threshold
Association Rule Mining uses these thresholds to reduce the time complexity of the computations and find
strong association rules in the data set.

Association Rule Mining can be viewed as a two-step process:


1. Frequent Itemset Generation, whose objective is to find all the item- sets that satisfy the minsup
threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules (whose confidence
greater than or equal to minconf) from the frequent itemsets found in the previous step. These
rules are called strong rules.

Frequent Itemset Generation


A lattice structure can be used to enumerate the list of all possible itemsets. The below figure shows an
itemset lattice for I = {a, b, c, d, e}. In general, a data set that contains k items can potentially generate up
to 2k − 1 frequent itemsets, excluding the null set. Because k can be very large in many practical
applications, the search space of itemsets that need to be explored is exponentially large.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
4

Brute-force approach:
 A brute-force approach for finding frequent itemsets is to determine the support count for every
candidate itemset in the lattice structure.
 To do this, we need to compare each candidate against every transaction, an operation that is
shown in below figure.

 If the candidate is contained in a transaction, its support count will be incremented.


 For example, the support for {Bread, Milk} is incremented 3 times because the itemset is contained
in transactions 1, 4, and 5.
 Such an approach can be very expensive because it requires O(NMw) comparisons, where N is the
number of transactions, M = 2k −1 is the number of candidate itemsets, and w is the maximum
transaction width.

There are several ways to reduce the computational complexity of frequent itemset generation.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
5

1. Reduce the number of candidate itemsets (M):


The Apriori principle is an effective way to eliminate some of the candidate itemsets without
counting their support values.
2. Reduce the number of comparisons (MN):
Instead of matching each candidate itemset against every transaction, we can reduce the
number of comparisons by using more advanced data structures, either to store the candidate
itemsets or to compress the data set.

Apriori Principle
The use of support for pruning candidate itemsets is guided by the following Apriori principle.
Apriori Principle:
If an itemset is frequent, then all of its subsets must also be frequent.
Example:
Consider the itemset lattice shown in below figure. Suppose {c, d, e} is a frequent itemset. Clearly, any
transaction that contains {c, d, e} must also contain its subsets, {c, d} {c, e}, {d, e}, {c}, {d}, and {e}. As a
result, if {c, d, e} is frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must also be
frequent.

Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent too.
This strategy of trimming the exponential search space based on the support measure is known as support-
based pruning.
Such a pruning strategy is made possible by a key property of the support measure, namely, that the
support for an itemset never exceeds the support for its subsets. This property is also known as the anti-
monotone property of the support measure.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
6

Monotonicity Property:
Let I be a set of items, and J = 2I be the power set of I. A measure f is monotone (or upward closed)
if
∀X, Y ∈ J : (X ⊆ Y ) → f(X) ≤ f(Y ),
which means that if X is a subset of Y , then f(X) must not exceed f(Y ). On the other hand, f is anti-
monotone (or downward closed) if
∀X, Y ∈ J : (X ⊆ Y ) → f(Y ) ≤ f(X),
which means that if X is a subset of Y , then f(Y ) must not exceed f(X).

Frequent itemset generation in the Apriori algorithm:


 Apriori is the first association rule mining algorithm that pioneered the use of support-based
pruning to systematically control the exponential growth of candidate itemsets.
 Consider the following example for illustrating the frequent itemset generation using Apriori
algorithm.

 Above figure provides a high-level illustration of the frequent itemset generation part of the Apriori
algorithm for the transactions shown in below table.

 We assume that the support threshold is 60%, which is equivalent to a minimum support count
equal to 3.
 Initially, every item is considered as a candidate 1-itemset.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
7

 After counting their supports, the candidate itemsets {Cola} and {Eggs} are discarded because they
appear in fewer than three transactions.
 In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets
because the Apriori principle ensures that all supersets of the infrequent 1-itemsets must be
infrequent.
 Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by
the algorithm is ( ) = 6
 Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found to be infrequent
after computing their support values.
 The remaining four candidates are frequent, and thus will be used to generate candidate
 3-itemsets.
 With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent.
The only candidate that has this property is {Bread, Diapers, Milk}.

The pseudocode for the frequent itemset generation part of the Apriori algorithm is shown in Algorithm
6.1.

Let denote the set of candidate k-itemsets and denote the set of frequent k-itemsets:
 The algorithm initially makes a single pass over the data set to determine the support of each item.
Upon completion of this step, the set of all frequent 1-itemsets, , will be known (steps 1 and 2).
 Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent (k − 1)-
itemsets found in the previous iteration (step-5). Candidate generation is implemented using a
function called apriorigen, which is described in next slide.
 To count the support of the candidates, the algorithm needs to make an additional pass over the
data set (steps 6–10). The subset function is used to determine all the candidate itemsets in that
are contained in each transaction t. The implementation of this function is described in
support_count.
 After counting their supports, the algorithm eliminates all candidate itemsets whose support counts
are less than minsup (step 12).
 The algorithm terminates when there are no new frequent itemsets generated, i.e., = ∅ (step 13).

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
8

Characteristics of the Apriori algorithm:


The frequent itemset generation part of the Apriori algorithm has two important characteristics.
 First, it is a level-wise algorithm; i.e.,
– It traverses the itemset lattice one level at a time, from frequent 1-itemsets to the maximum
size of frequent itemsets.
 Second, it employs a generate-and-test strategy for finding frequent itemsets.
– At each iteration, new candidate itemsets are generated from the frequent itemsets found
in the previous iteration. The support for each candidate is then counted and tested against
the minsup threshold.
 The total number of iterations needed by the algorithm is +1, where is the maximum size
of the frequent itemsets.

Apriori algorithm:
– Fk: frequent k-itemsets
– Lk: candidate k-itemsets
Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
 Candidate Generation: Generate Lk+1 from Fk
 Candidate Pruning: Prune candidate itemsets in Lk+1 containing subsets of length k that are
infrequent
 Support Counting: Count the support of each candidate in Lk+1 by scanning the DB
 Candidate Elimination: Eliminate candidates in Lk+1 that are infrequent, leaving only those
that are frequent => Fk+1

Example for finding the frequent itemsets using Apriori algorithm:

Example: Suppose we have the following dataset that has various transactions, and from this dataset, we
need to find the frequent itemsets and generate the association rules using the Apriori algorithm:

Solution:
Step-1: Candidate Generation C1 and F1:
o In the first step, we will create a table that contains support count (The frequency of each itemset
individually in the dataset) of each itemset in the given dataset. This table is called the Candidate
set or C1.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
9

o Now, we will take out all the itemsets that have the greater or equal support count to the Minimum
Support (2). It will give us the table for the frequent itemset F1.
Since, all the itemsets have greater or equal support count than the minimum support, except the E,
so E itemset will be removed.

Step-2: Candidate Generation C2, and F2:


o In this step, we will generate C2 with the help of F1. In C2, we will create the pair of the itemsets of
L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the main transaction table of
datasets, i.e., how many times these pairs have occurred together in the given dataset. So, we will
get the below table for C2:

o Again, we need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the table C2. It will give us
the below table for F2.

Step-3: Candidate generation C3, and F3:


o For C3, we will repeat the same two processes, but now we will form the C3 table with subsets of
three itemsets together, and will calculate the support count from the dataset. It will give the below

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
10

table:

o Now we will create the F3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So, the F3 will
have only one combination, i.e., {A, B, C}.

Requirements for effective candidate generation:


Requirements for an effective candidate generation procedure:
1. It should avoid generating too many unnecessary candidates.
– A candidate itemset is unnecessary if at least one of its subsets is infrequent. Such a
candidate is guaranteed to be infrequent according to the anti-monotone property of
support.
2. It must ensure that the candidate set is complete,
– i.e., no frequent itemsets are left out by the candidate generation procedure. To ensure
completeness, the set of candidate itemsets must subsume the set of all frequent itemsets,
i.e., ∀k : Fk ⊆ Ck.
3. It should not generate the same candidate itemset more than once.
– For example, the candidate itemset {a, b, c, d} can be generated in many ways—by merging
{a, b, c} with {d}, {b, d} with {a, c}, {c} with {a, b, d}, etc. Generation of duplicate candidates
leads to wasted computations and thus should be avoided for efficiency reasons.

Candidate Generation and Pruning:


The apriori-gen function shown in Step 5 of Algorithm 6.1 generates candidate itemsets by performing the
following two operations:
1. Candidate Generation.
This operation generates new candidate k-itemsets based on the frequent (k − 1)-itemsets
found in the previous iteration.
2. Candidate Pruning.
This operation eliminates some of the candidate k-itemsets using the support-based pruning
strategy.

There are several candidate generation procedures, including the one used by the apriori-gen function.
1) Brute-Force Method
2) × Method
3) × Method

Brute-Force Method:
– Considers every k-itemset as a potential candidate and then applies the candidate pruning step to
remove any unnecessary candidates, example shown in below figure.
– The number of candidate itemsets generated at level k is equal to ( ), where d is the total number
of items.
– Candidate pruning becomes expensive because a large number of itemsets must be examined.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
11

– Given that the amount of computations needed for each candidate is O(k), the overall complexity of
this method is O(∑ ( )) = O(d. )

× Method
 Extends each frequent (k − 1)-itemset with other frequent items.
 Every frequent k-itemset is composed of a frequent (k − 1)-itemset and a frequent 1-itemset.
 All frequent k-itemsets are part of the candidate k-itemsets generated.
 This method will produce O( | | × | ) candidate k-itemsets, where | | is the number of
frequent j-itemsets.
 The overall complexity of this step is O(∑ )
Drawbacks: Produces a large number of unnecessary candidates.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
12

The above figure illustrates how a frequent 2-itemset such as {Beer, Diapers} can be augmented with a
frequent item such as Bread to produce a candidate 3-itemset {Beer, Diapers, Bread}.

Compact Representation of Frequent Itemsets:


In practice, the number of frequent itemsets produced from a transaction data set can be very large. It is
useful to identify a small representative set of itemsets from which all other frequent itemsets can be
derived. Two such representations are there, in the form of maximal and closed frequent itemsets.
1) Maximal Frequent Itemsets:
A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are
frequent. The itemsets in the lattice are divided into two groups: those are frequent and infrequent. Every
itemset located above the border is frequent, while those located below the border (the shaded nodes) are
infrequent.

Among the itemsets residing near the border, {a, d}, {a, c, e), and {b, c, d, e} are considered to be maximal
frequent itemsets because their immediate supersets are infrequent. An itemset such as {a,d} is maximal
frequent because all of its immediate supersets, {a, b, d}, {a, c, d}, and {a, d, e}, are infrequent. In contrast,
{a, c} is non-maximal because one of its immediate supersets, {a, c, e}, is frequent. Maximal frequent
itemsets effectively provide a compact representation of frequent itemsets.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
13

2) Closed Frequent Itemsets:


Closed Itemset: An itemset X is closed if none of its immediate supersets has exactly the same support
count as X. Consider the following lattice.

For example, since the node {b, c} is associated with transaction IDs 1,2, and 3, its support count is equal to
three. From the transactions given in this diagram, notice that every transaction that contains b also
contains c. Consequently, the support for {b} is identical to {b, c} and {b} should not be considered a closed
itemset. Similarly, since c occurs in every transaction that contains both a and d, the itemset {a, d} is not
closed. On the other hand, {b, c} is a closed itemset because it does not have the same support count as
any of its supersets ({a, b, c}, {b, c, d}, {b, c, e}).
Closed Frequent Itemset: An itemset is a closed frequent itemset if it is closed and its support is greater
than or equal to minsup.
In the previous example, assuming that the support threshold is 40%, {b, c} is a closed frequent itemset
because its support is 60%. The rest of the closed frequent itemsets are indicated by the shaded nodes.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
14

FP-Growth Algorithm:
It encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets
directly from this structure.
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.
a) FP-Tree Representation:
An FP-tree is a compressed representation of the input data. It is constructed by reading the data set one
transaction at a time and mapping each transaction onto a path in the FP-tree. An FP-tree is a compressed
representation of the input data. It is constructed by reading the data set one transaction at a time and
mapping each transaction onto a path in the FP-tree.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
15

Figure 6.24 shows a data set that contains ten transactions and five items. Each node in the tree contains
the label of an item along with a counter that shows the number of transactions mapped onto the given
path. Initially, the FP-tree contains only the root node represented by the null symbol.
The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent items are
discarded, while the frequent items are sorted in decreasing support counts. For the data set shown in
Figure 6.24, a is the most frequent item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP-tree. After reading the first
transaction, {a, b), the nodes labeled as a and b are created. A path is then formed from null  a  b to
encode the transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b, c, d}, a new set of nodes is created for items b, c, and d. A path
is then formed to represent the transaction by connecting the nodes null  b  c  d. Every node along

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
16

this path also has a frequency count equal to one. Although the first two transactions have an item in
common, which is b, their paths are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a, c, d, e}, shares a common prefix item (which is a) with the first transaction. As a
result, the path for the third transaction, null  a  c  d  e, overlaps with the path for the first
transaction, null  a  b. Because of their overlapping path, the frequency count for node a is
incremented to two, while the frequency counts for the newly created nodes, c, d, and e are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths given in the FP-
tree. The resulting FP-tree after reading all the transactions is shown at the bottom of Figure 6.24.
The size of an FP-tree is typically smaller than the size of the uncompressed data because many
transactions in market basket data often share a few items in common. In the best-case scenario, where all
the transactions have the same set of items, the FP-tree contains only a single branch of nodes. The worst-
case scenario happens when every transaction has a unique set of items. However, the physical storage
requirement for the FP-tree is higher because it requires additional space to store pointers between nodes
and counters for each item.
b) Frequent Itemset Generation in FP-Growth Algorithm:
After FP-tree is generated, to find frequent itemset we need to do the following steps:
1) Conditional Pattern Base is computed which is path labels of all the paths which lead to any node of
the given item in the frequent-pattern tree.
2) Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set of
elements that is common in all the paths in the Conditional Pattern Base of that item and calculating
its support count by summing the support counts of all the paths in the Conditional Pattern Base.
3) From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing
the items of the Conditional Frequent Pattern Tree set to the corresponding to the item.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
1

UNIT – 6:
Cluster Analysis:
Overview- types of clustering, Basic K-means, K –means –additional issues, Bisecting k-means, k-means and
different types of clusters, strengths and weaknesses, k-means as an optimization problem. Agglomerative
Hierarchical clustering, basic agglomerative hierarchical clustering algorithm, specific techniques, DBSCAN:
Traditional density: centre-based approach, strengths and weaknesses (Tan)

Clustering:

Overview:

Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. Clustering for
Understanding Classes, or conceptually meaningful groups of objects that share common characteristics,
play an important role in how people analyze and describe the world. Indeed, human beings are skilled at
dividing objects into groups (clustering) and assigning particular objects to these groups (classification).
clusters are potential classes and cluster analysis is the study of techniques for automatically finding
classes. The following are some examples:

Biology: Biologists have spent many years creating a taxonomy (hierarchical classification) of all living
things: family, genus) and species. For example, clustering has been used to find groups of genes that
have similar functions.

Information Retrieval: The World Wide Web consists of billions of Web pages, and the results of a query
to a search engine can return thousands of pages. Clustering can be used to group these search results
into a small number of clusters, each of which captures a particular aspect of the query. For instance, a
query of "movie" might return Web pages grouped into categories such as reviews, trailers, stars, and
theatres.

Business: Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into a small number of groups for additional analysis and
marketing activities.

We explore two important topics: (1) different ways to group a set of objects into a set of clusters, and (2)
types of clusters.

What Is Cluster Analysis?

Cluster analysis groups data objects based only on information found in the data that describes the objects
and their relationships. The goal is that the objects within a group be similar (or related) to one another and
different from (or unrelated to) the objects in other groups. The greater the similarity (or homogeneity)
within a group and the greater the difference between groups, the better or more distinct the clustering.

Consider Figure 8.1, which shows twenty points and three different ways of dividing them into clusters.
Figures 8.1(b) and 8.1(d) divide the data into two and six parts, respectively.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
2

Classification is supervised learning approach; i.e., new, unlabeled objects are assigned a class label using a
model developed from objects with known class labels. For this reason, cluster analysis is sometimes
referred to as unsupervised classification. When the term classification is used without any qualification
within data mining.

Different Types of Clusterings:

An entire collection of clusters is commonly referred to as a clustering. There are various types of
Clusterings: hierarchical (nested) versus partitional (unnested), exclusive versus overlapping versus fuzzy,
and complete versus partial.

Hierarchical versus Partitional

A partitional clustering is simply a division of the set of data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset. Taken individually, each collection of clusters in Figures
8.1 (b-d) is a partitional clustering.

If we permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested
clusters that are organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is the union
of its children (subclusters), and the root of the tree is the cluster containing all the objects.

Exclusive versus Overlapping versus Fuzzy

The Clusterings shown in Figure 8.1 are all exclusive, as they assign each object to a single cluster. There are
many situations in which a point could reasonably be placed in more than one cluster, and these situations
are better addressed by non-exclusive clustering. In the most general sense, an overlapping or non-
exclusive clustering is used to reflect the fact that an object can simultaneously belong to more than one
group (class). For instance, a person at a university can be both an enrolled student and an employee of the
university.

In a fizzy clustering, every object belongs to every cluster with a membership weight that is between 0
(absolutely doesn't belong) and 1 (absolutely belongs). In other words, clusters are treated as fuzzy sets.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
3

(Mathematically, a fizzy set is one in which an object belongs to any set with a weight that is between 0 and
1. In fuzzy clustering, we often impose the additional constraint that the sum of the weights for each object
must equal 1.)

Complete versus Partial A complete clustering assigns every object to a cluster, whereas a partial clustering
does not. The motivation for a partial clustering is that some objects in a data set may not belong to well-
defined groups. Many times, objects in the data set may represent noise, outliers, or "uninteresting
background."

Different Types of Clusters:

The different types of clusters include

1. Well separated.
2. Prototype based.
3. Graph based
4. Density based
5. Shared-property
1) Well separated:

A cluster is a set of objects in which each object is closer (or more similar) every other object in the cluster.
Sometimes a threshold is used to specify that all the objects in a cluster must be sufficiently close(or
similar) to one another. This idealistic definition of a cluster is satisfied only when the data contains natural
clusters that are quite far from each other.

This figure gives an example of well separated clusters that consists of two groups of points in a two-
dimensional space. The distance between any two points in different groups is larger than the distance
between any two points within a group.

2) Prototype-based:

 A cluster is a set of objects in which each object is closer to the prototype that defines the cluster
than to the prototype of any other cluster.
 For data with continuous attribute, the prototype of a cluster is often a centroid i.e., the average
(mean) of all the points in the cluster.
 When a centroid is not meaningful, such as when the data has categorical attributes, the prototype
is often a medoid i.e., the most representative point of a cluster.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
4

 For many types of data, the prototype can be regarded as the most central point and in such
instances, we commonly refer to prototype-based cluster as centroid clusters.

3) Graph based:

If the data is represented as a group, where the nodes are objects and link represent connections among
object then a cluster can be defined as a connected component i.e., a group of objects that are connected
to one another, but that have no connection to objects outside the group.

An important example of a graph-based clusters is contiguity-based clusters. Where two objects are
connected only if they are within a specified distance of each other. This implies that each object in a
contiguity-based cluster is closer to some other object in the cluster than to any point in a different cluster.

Figure 8.2(c) shows an example of such clusters for two-dimensional points. This definition of a cluster is
useful when clusters are irregular or intertwined, but can have trouble when noise is present since, as
illustrated by the two spherical clusters of Figure 8.2(c), a small bridge of points can merge two distinct
clusters.

4) Density Based: A cluster is a dense region of objects that in surrounded by a region of low density. The
figure (d) shows some density-based clusters for data created by adding noise to the data of figure(c). The
two circular clusters are not merged, as in Figure 8.2(c), because the bridge between them fades into the
noise. Likewise, the curve that is present in Figure 8.2(c) also fades into the noise and does not form a
cluster in Figure 8.2(d).
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
5

A density-based definition of a cluster is often employed when the clusters are irregular or intertwined, and
when noise and outliers are present

5) Shared-property (conceptual clusters): More generally, we can define a cluster as a set of objects that
share common property. This definition encompasses all the previous definitions of a cluster. Example
objects in a center-based cluster share the property that they are closet to the same centroid or medoid.
However, the shared property approach also includes new types of clusters.

Consider the cluster shown in figure. A triangular area (cluster) is adjacent to a rectangular one, and there
are two irregular circles(clusters). In both cases, a clustering algorithm would need a very specific concept
of a cluster to successfully detect these clusters. This process of finding such clusters is called conceptual
clustering.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
6

K-means:

Prototype-based clustering techniques creates a one-level partitioning of the data objects. There are
number of such techniques, but two of the most prominent are k-means and k-medoids.

 K-means defines a prototype in terms of a centroid, which is usually the mean of a group of points,
and is typically applied to objects in a continuous n-dimensional space.
 K-medoids defines a prototype in terms of a medoid.
In this section we will focus on K-means which is one of the oldest and most widely used calculating
algorithms.

The basic K-means algorithm:

The k-means clustering techniques is sample and we begin with a description of the basic algorithm. We
first choose k initial centroids, where k is a user specified parameter, namely, the number of clusters
desired. Each point is then assigned to the closest centroid and each collection of points assigned to a
centroid is a cluster. The centroid of each cluster is then updated based on the points assigned to the
cluster.

We repeat the assignment and update steps until no points changes clusters, or equivalently, until the
centroids remain the same.

Algorithm basic k-means algorithm:

1. Select k points as initial centroids.


2. Repeat
3. From k clusters by assigning each point to its closest centroid.
4. Recompute the centroid of each cluster.
5. Until Centroid does not change.

The operation of k-means is illustrated in figure, which shows how, starting from three centroids, the final
clusters are found in four assignment-update steps. In these and other figures displaying k-means
clustering, each subfigure shows (1) the centroids at the start of the iteration and (2) the assignment of the
points to those centroids. The centroids are indicated by the “+” symbol, all points belonging to the same
cluster have the same marker shape.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
7

In the first step, shown in fig(a), points are assigned to the initial centroids, which are all in the larger group
of points. For this example, we use the mean as the centroid. After points are assigned to a centroid, the
centroid is then updated. Again, the figure for each step shows the centroid at the beginning of the step
and the assignment of points to those centroids.

In the second step, points are assigned to the updated centroids, and the centroids are updated again. In
steps 2,3and 4 which are shown in (b), (c), (d) respectively, two of the centroids move to the two small
groups of points at the bottom of the figures. When k-means algorithm terminates in figure(d) because no
more changes occur, the centroids have identified the natural grouping of points.

Assigning points to the closest centroid:

To assign a point to the closest centroid, we need a proximity measure. Euclidean (L2) distance is often
used for data points in Euclidean space, while cosine similarity is more appropriate for documents.
However, there may be several types of proximity measures that are appropriate for a given type of data.
For example, Manhattan(L1) distance can be used for Euclidean data, while the c measure is often
employed for documents.

Centroids and Objective Functions:

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
8

Step 4 of the k-means algorithm was stated rather generally as “recompute the centroid of each cluster”,
since the centroid can vary depending on the proximity measure (Euclidean or Manhattan or Manhattan)
for the data and the goal of the clustering.

The goal of the clustering is typically expressed by an objective function that depends on the proximities of
the points to one another or to the cluster centroids.

Data in Euclidean Space:

Consider data whose proximity measure is Euclidean distance. For our objective function, which measures
the quality of a clustering, we use the sum of the squared error (SSE), which is also known as scatter. In
other words, we calculate the error of each data point, i.e., its Euclidean distance to the closest centroid,
and then compute the total sum of the squared errors.

Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one
with the smallest squared error since this means that the prototypes (centroids) of this clustering are a
better representation of the points in their cluster.

where dist is the standard Euclidean (L2) distance between two objects in Euclidean space.

Given these assumptions, it can be shown that the centroid that minimizes the SSE of the cluster is the
mean. The centroid (mean) of the ith cluster is defined as

To illustrate, the centroid of a cluster containing the three two-dimensional points, (1,1), (2,3), and (6,2), is
((1 + 2 + 6)/3, (1 +3 + 2)/3)) = (3,2).

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
9

Document Data:

To illustrate that K-means is not restricted to data in Euclidean space, we consider document data and the
cosine similarity measure. Here we assume that the document data is represented as a document-term
matrix as shown in the diagram.

Our objective is to maximize the similarity of the documents in a cluster to the cluster centroid; this
quantity is known as the cohesion of the cluster. For this objective it can be shown that the cluster centroid
is, as for Euclidean data, the mean. The analogous quantity to the total SSE is the total cohesion, which is
given by Equation,

Time and Space Complexity:

The space requirements for K-means are modest because only the data points and centroids are stored.
Specifically, the storage required is ) ), where m is the number of points, K is the number of
clusters and n is the number of attributes.

The time requirements for K-means are also modest-basically linear in the number of data points. In
particular, the time required is ), where is the number of iterations required for
convergence

Handling empty clusters:

One of the problems with the basic k-means algorithm is that empty clusters can be obtained if no points
are allocated to a cluster during the assignment step. If this happens, then a strategy is needed to choose a
replacement centroid, since otherwise, the squared error will be larger than necessary. One approach is to
choose the point that is farthest away from any current centroid.

Outliers:

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
10

When outliers are present, the resulting cluster centroids(prototypes) may not be as representative as they
otherwise would be and thus, the SSE will be higher as well. Because of this, it is often useful to discover
outliers and eliminate them beforehand.

It is important that there are certain clustering applications for which outliers should not be eliminated.
When clustering is used for data compression, every point must be clustered.

An issue is how to identify outliers:

A number of techniques for identifying outliers. If we use approaches that remove outliers before
clustering. We avoid clustering points that will not cluster well. Alternatively, outliers can also be identified
in postprocessing step.

Two strategies that decrease the total SSE (Sum of Squares Error) by increasing the number of clusters are
the following:

1. Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster
with the largest standard deviation for one particular attribute.
2. Introduce a new cluster centroid: Often the point that is farthest from any cluster center is chosen.
We can easily determine this if we keep track of the SSE contributed by each point. Another
approach is to choose randomly from all points or form the points with the highest SSE

Two strategies that decreases the number of clusters, while trying to minimize the total SSE are the
following:

1. Disperse a cluster: This is accomplished by removing the centroid that corresponds to the cluster
and reassigning the points to other clusters.
2. Merge two clusters: The clusters with the closest centroids are typically chosen and merge the
clusters that result in the smallest increase in total SSE.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
11

Bisecting K-means:

The bisecting K-means algorithm is a straight forward extension of the basic K-means algorithm that is
based on a simple idea: to obtain k clusters, split the set of all points into two clusters, select one of these
clusters to split and so on, until K clusters have been produced. The details of bisecting K-means are given
in algorithm.

Bisecting K-means algorithm:

1. Initialize the list of clusters to contain the cluster consisting of all points.
2. Repeat.
3. Remove a cluster from the list of clusters.
4. {Perform several “trial” bisections of the chosen cluster}.
5. for i=1 number of trails do
6. Bisect the selected cluster using basic k-means
7. End for.
8. Select the two clusters from the bisection with the lowest SSE.
9. Add these two clusters to the list of clusters.
10. Until the list of clusters contain K clusters.

There are number of different ways to choose which cluster to split. We can choose the target cluster at
each step, choose the one with the largest SSE, or use a criterion based on both size and SSE. Different
choices result in different clusters.

Example:

To illustrate that bisecting K-means is less susceptible to initialization problems. We show in figure how
bisecting K-means finds four clusters in the data set originally shown in figure.

In iteration 1, two pairs of clusters are found. In iteration 2, the rightmost pair of clusters is split. In
iteration 3, the leftmost pair of clusters is split.

K-means and Different Types of Clusters:

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
12

K-means and its variations have a number of limitations with respect to finding different types of clusters.
In particular, K-means has difficulty detecting the "natural" clusters, when clusters have non-spherical
shapes or widely different sizes or densities. This is illustrated by Figures 8.9, 8.10, and 8.11.

In Figure 8.9, K-means cannot find the three natural clusters because one of the clusters is much larger than
the other two, and hence, the larger cluster is broken, while one of the smaller clusters is combined with a
portion of the larger cluster.

In Figure 8.10, K-means fails to find the three natural clusters because the two smaller clusters are much
denser than the larger cluster.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
13

In Figure 8.11, K-means finds two clusters that mix portions of the two natural clusters because the shape
of the natural clusters is not globular.

The difficulty in these three situations is that the K-means objective function is a mismatch for the kinds of
clusters we are trying to find since it is minimized by globular clusters of equal size and density or by
clusters that are well separated. However, these limitations can be overcome, in some sense, if the user is
willing to accept a clustering that breaks the natural clusters into a number of subclusters.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
14

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
15

Agglomerative Hierarchical Clustering:

There are two basic approaches for generating a hierarchical clustering:

Agglomerative: Start with the points as individual clusters and, at each step, merge the closest pair of
clusters.

Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of
individual points remain.

A hierarchical clustering is often displayed graphically using a tree like diagram called a dendrogram, which
displays both the cluster-subcluster relationships and the order in which the clusters were merged
(agglomerative view) or split (divisive view).

Many agglomerative hierarchical clustering techniques are variations on a single approach: starting with
individual points as clusters, successively merge the two closest clusters until only one cluster remains. This
approach is expressed more formally in Algorithm

Algorithm: Basic agglomerative hierarchical clustering algorithm.

1. Compute the proximity matrix if necessary.


2. repeat.
a. Merge the closest two clusters.
b. Update the proximity matrix to reflect the proximity between the new cluster and the
original clusters.
3. Until only one cluster remains.

The key operation of above algorithm is the computation of the proximity between two clusters. It
differentiates the various agglomerative hierarchical techniques. For example, many agglomerative
hierarchical clustering techniques, such as MIN, MAX, and Group Average, come from a graph-based view
of clusters.

MIN (Single Link):

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
16

It defines cluster proximity as the proximity between the closest two points that are in different clusters or
using graph terms, the shortest edge between two nodes in different subsets of nodes.

Using graph terminology, if you start with all points as singleton clusters and add links between points one
at a time, shortest links first, then these single links combine the points into clusters.

The single link techniques are good at handling non elliptical shapes, but it is sensitive to noise and outliers.

MAX (Complete Link) or CLIQUE: The proximity between the farthest two points in different clusters to be
the cluster proximity or using graph terms, the longest edge between two nodes in different subsets of
nodes.

Using graph terminology, if you start with all points as singleton clusters and add links between one point
at a time, shortest links first, then a group of points is not a cluster until all points is not a cluster until all
points in it are completely linked i.e., form a CLIQUE.

Complete link is less susceptible to noise and outliers.

Group Average: It defines cluster proximity to be the average pairwise proximities (average length of edges)
all pairs of points from different clusters.

Thus, for group average, the cluster proximity its proximity (Ci, Cj) of clusters Ci and Cj ,which are of size mi
and mj respectively is expressed by the following equation.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
17

For example:

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
18

Example for Single Link or MIN:

From Table 8.4, we see that the distance between points 3 and 6 is 0.11, and that is the height at which
they are joined into one cluster in the dendrogram. As another example, the distance between clusters
{3,6} and {2,5} is given by

dist({3,6}, {2,5}) = min (dist (3, 2), dist (6, 2), dist (3, 5), dist (6, 5))

= min (0.15, 0.25, 0.28, 0.39)

= 0.15

Figure 8.16 shows the result of applying the single link technique to our example data set of six points.
Figure 8.16(a) shows the nested clusters as a sequence of nested ellipse, Figure S.16(b) shows the same
information, but as a dendrogram.

Example for Complete Link or MAX or CLIQUE:

As with single link, points 3 and 6 are merged first. However, {3,6} is merged with {4}, instead of {2,5} or {1}
because

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
19

Example for Group Average:

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
20

Time and Space Complexity:

Overall time required for a hierarchical clustering based on Algorithm is O(m 2logm).

Total space complexity is O(m2).

DBSCAN (Density Based Spatial Clustering of Application with Noise):

Evaluating the number of points present in the data set with respect to its radius (EPS or epsilon) for
density estimation is known as center-based approach.

DBSCAN is totally based on this approach The center-based approach to density allows us to classify a point
as being

1. Interior of a dense region (a core point).


2. On the edge of a dense region (a border point).
3. In a sparsely occupied region (a noise of background point).

Core points: These points are in the interior of a density-based cluster. A point is a core point if the number
of points within a given neighbourhood around the point as determined by the distance function, Eps,
exceeds a threshold (minpts).

Border points: A border point is not a core point, but falls within the neighbourhood of a core point.

Noise points: A noise point is any point that is neither a core point nor a border point.

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
21

For example, consider minpts is 7, observe the following diagram, so A point is Core point, B is Border
point, C is Noise point.

Time complexity:

The basic time complexity of the DBSCAN algorithm is

O (m X time to find points in the Eps-neighborhood).

Where m is the number of points.

In the worst case, this complexity is O (m2).

Space complexity:

The space requirement of DBSCAN, even for high dimensional data is O(m) because it is only necessary to
keep a small amount of data for each point

Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6

You might also like