dwm NOTES
dwm NOTES
COLLEGE
(Autonomous)
(Approved by AICTE, New Delhi, Permanently Affiliated to JNTU Kakinada,
Accredited by NBA & Accredited by NAAC with A grade)
OBJECTIVES:
Students will be enabled to understand and implement classical models and algorithms in
data warehousing and data mining.
They will learn how to analyze the data, identify the problems, and choose the relevant
models and algorithms to apply.
They will further be able to assess the strengths and weaknesses of various methods
andalgorithms and to analyze their behavior.
Unit I: Introduction to Data Mining: What is data mining, motivating challenges, origins of data
mining, data mining tasks , Types of Data-attributes and measurements, types of data sets, Data
Quality ( Tan)
Unit II: Data pre-processing, Measures of Similarity and Dissimilarity: Basics, similarity and
dissimilarity between simple attributes, dissimilarities between data objects, similarities between
data objects, examples of proximity measures: similarity measures for binary data, Jaccard
coefficient, Cosine similarity, Extended Jaccard coefficient, Correlation, Exploring Data : Data Set,
Summary Statistics (Tan)
Unit III: Data Warehouse: basic concepts:, Data Warehousing Modeling: Data Cube and OLAP,
Data Warehouse implementation : efficient data cube computation, partial materialization, indexing
OLAP data, efficient processing of OLAP queries. ( H & C)
Unit IV: Classification: Basic Concepts, General approach to solving a classification problem,
Decision Tree induction: working of decision tree, building a decision tree, methods for expressing
attribute test conditions, measures for selecting the best split, Algorithm for decision tree induction.
Model over fitting: Due to presence of noise, due to lack of representation samples, evaluating the
performance of classifier: holdout method, random sub sampling, cross-validation, bootstrap. (Tan)
Unit V:
Association Analysis: Problem Definition, Frequent Item-set generation- The Apriori principle ,
Frequent Item set generation in the Apriori algorithm, candidate generation and pruning, support
counting (eluding support counting using a Hash tree) , Rule generation, compact representation of
frequent item sets, FP-Growth Algorithms. (Tan)
Unit VI:
Overview- types of clustering, Basic K-means, K –means –additional issues, Bisecting k-means, k-
means and different types of clusters, strengths and weaknesses, k-means as an optimization
problem. Agglomerative Hierarchical clustering, basic agglomerative hierarchical clustering
algorithm, specific techniques, DBSCAN: Traditional density: centre-based approach, strengths and
weaknesses (Tan)
RAGHU ENGINEERING
COLLEGE
(Autonomous)
(Approved by AICTE, New Delhi, Permanently Affiliated to JNTU Kakinada,
Accredited by NBA & Accredited by NAAC with A grade)
Course outcomes:
Understand stages in building a Data Warehouse
Understand the need and importance of preprocessing techniques
Understand the need and importance of Similarity and dissimilarity techniques
Analyze and evaluate performance of algorithms for Association Rules.
Analyze Classification and Clustering algorithms
Text Books:
1. Introduction to Data Mining : Pang-Ning tan, Michael Steinbach, Vipin Kumar, Pearson
2. Data Mining ,Concepts and Techniques, 3/e, Jiawei Han , Micheline Kamber , Elsevier
Reference Books:
1. Introduction to Data Mining with Case Studies 2nd ed: GK Gupta; PHI.
2. Data Mining : Introductory and Advanced Topics : Dunham, Sridhar, Pearson.
3. Data Warehousing, Data Mining & OLAP, Alex Berson, Stephen J Smith, TMH
4. Data Mining Theory and Practice, Soman, Diwakar, Ajay, PHI, 2006.
1
UNIT – 1:
Data Mining:
Introduction to Data Mining: What is data mining, motivating challenges, origins of data mining, data
mining tasks, Types of Data-attributes and measurements, types of data sets, Data Quality (Tan)
Data Mining
Data mining is the process of automatically discovering useful information in large data repositories. Data
mining techniques finds the useful patterns or predict the outcome of a future observation.
This process consists of a series of transformation steps, from data preprocessing to postprocessing
of data mining results.
The input data can be stored in a variety of formats (flat files, spread-sheets, or relational tables)
and may reside in a centralized data repository or be distributed across multiple sites.
The purpose of preprocessing is to transform the raw input data into an appropriate format for
subsequent analysis.
The steps involved in the data preprocessing include fusing data from multiple sources, cleaning
data to remove noise and duplicate observations, and selecting the records and features that are
relevant to the data mining task at hand.
Because of many ways data can be collected and stored, data preprocessing is time-consuming step
in the overall knowledge discovery process.
The postprocessing step ensures that only valid and useful results are incorporated into the
decision support systems.
An example of postprocessing is visualization, which allows analysts to explore the data and the
data mining results from a variety of viewpoints.
Statistical measures or hypothesis testing methods can also be applied during the postprocessing to
eliminate spurious data mining results.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
2
Pattern evaluation
Knowledge presentation
Steps involved in KDD process.
Data Mining is an integral part of knowledge discovery in database (KDD) which is the overall process of
converting raw data into useful information.
KDD consists of sequence of the following steps:
1) Data cleaning: It is a process of removing noise and inconsistent data.
2) Data integration: It is a process of integrating multiple data sources.
3) Data selection: It is a process of selecting the relevant data for analysis task are retrieved from the
database.
4) Data transformation: It is a process of generating and manipulating the data.
5) Data Mining: It is an important process where expert techniques are applied so as to extract the
hidden data patterns.
6) Pattern evaluation: To identify the truly interesting patterns representing knowledge based on
some interesting measures.
7) Knowledge presentation: In this, visualization and knowledge representation techniques are used to
present the mined knowledge to user.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
3
Scalability:
Because of advances in data generation and collection, datasets with sizes of gigabytes, terabytes or
even petabytes are common.
Massive datasets cannot fit into main memory
Need to develop scalable data mining algorithms to mine massive datasets
Scalability can also be improved by using sampling or developing parallel and distributed algorithms.
High dimensionality:
Nowadays, Data sets are coming with hundreds or thousands of attributes.
o Example: Dataset that contains measurements of temperature at various location
Traditional data analysis techniques that were developed for low dimensional data that do not work
well for such high dimensional data.
Need to develop data mining algorithms to handle high dimensionality.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
4
o Search algorithms, modeling techniques and learning theories from Artificial Intelligence or
Machine Learning, Pattern Recognition.
Database systems are needed to provide support for efficient storage, Indexing and query
processing.
The Techniques from parallel (high performance) computing are addressing the massive size of
some datasets.
Distributed Computing techniques are used to gather information from different locations
The below figure shows the relationship of data mining to other areas.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
5
The below figure illustrates four of the core data mining tasks.
Predictive Modeling refers to the task of building a model for the target variable as a function of the
explanatory variables. There are two types of predictive modeling tasks:
Classification, which is used for Discrete Target Variables.
Ex: Predicting whether a web user will make a purchase at an online book store (Target
variable is binary valued).
Regression, which is used for Continuous Target Variables.
Ex: Forecasting the future price of a stock (Price is a continuous-valued attribute)
Example:
Association Analysis:
Used to discover patterns that describe strongly associated features in the data.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
6
The discovered patterns are typically represented in the form of implication rules or feature
subsets.
Example:
Transaction ID Items
1 {Bread, Butter, Diapers, Milk}
2 {Coffee, Sugar, Cookies, Salmon}
3 {Bread, Butter, Coffee, Diapers, Milk, Eggs}
4 {Bread, Butter, Salmon, Chicken}
5 {Eggs, Bread, Butter}
6 {Salmon, Diapers, Milk}
7 {Bread, Tea, Sugar, Eggs}
8 {Coffee, Sugar, Chicken, Eggs}
9 {Bread, Diapers, Milk, Salt}
10 {Tea, Eggs, Cookies, Diapers, Milk}
Market Basket Analysis
The above table, illustrate the data collected at supermarkets.
Association analysis can be applied to find items that are frequently bought together by customers.
Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers also tend to buy milk)
Cluster Analysis:
Grouping of similar things is called cluster.
The objects are clustered or grouped based on the principle of maximizing the intra class
similarity(Within a Cluster) and minimizing the interclass similarity(Cluster to Cluster).
Example:
Article Word
1 Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2
2 Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1
3 Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3
4 Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2
5 Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2
6 Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
Document Clustering
The collection of news articles shown in the above table can be grouped based on their respective
topics.
Each Article is represented as a set of word frequency pairs (w, c), Where w is a word and c is the
number of times the word appears in the article.
There are 2 natural clusters in the above dataset
o First Cluster consists of the first 3 articles (News about the Economy)
o Second cluster contain last 3 articles (News about the Heath Care)
Anomaly detection:
The task of identifying observations whose characteristics are significantly different from the rest of
the data.
Such observations are known as anomalies or Outliers.
A good anomaly detector must have a high detection rate and a low false rate.
Applications: Detection of fraud, Network Intrusions etc…
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
7
Example:
Credit Card Fraud Detection:
A Credit Card Company records the transactions made by every credit card holder, along with the
personal information such as credit limit, age, annual income and address.
When a new transaction arrives, it is compared against the profile of the user.
If the characteristics of the transaction are very different from the previously created profile, then
the transaction is flagged as potentially fraudulent.
Attributes
An attribute is a property or characteristic of an object that may vary, either from one object to another or
from one time to another.
Example: Eye colour of a person, mass of a physical object, the time at which an event occurred, etc.
This is also known by other names such as variable, field, characteristic, dimension, feature etc.
Measurement scale
To analyse the characteristics of objects, we assign numbers or symbols to them. To do this in a well-
defined way, we need a measurement scale.
A measurement scale is a rule (function) that associates a numerical or symbolic value(attribute
values) with an attribute of an object.
For instance, we classify someone as male or female.
The “physical value” of an attribute of an object is mapped to a numerical or symbolic value
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
8
Types of attributes
We can define four types of attributes: nominal, ordinal, interval, and ratio. Below two tables explains
these attribute types and their transformation.
Nominal and ordinal attributes are collectively referred to as categorical or qualitative attributes.
As the name suggests, qualitative attributes, such as employee ID, lack most of the properties of numbers.
Even if they are represented by numbers, i.e., integers, they should be treated more like symbols.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
9
The remaining two types of attributes, interval and ratio, are collectively referred to
as quantitative or numeric attributes. Quantitative attributes are represented by numbers and have most
of the properties of numbers. Note that quantitative attributes can be integer-valued or continuous.
The different types of attributes adhere to different properties mentioned above.
Nominal adheres to Distinctness.
Ordinal adheres to Distinctness and Order.
Interval adheres to Distinctness, Order and meaningful difference.
Ratio adheres to all the four properties (Distinctness, Order, Addition and Multiplication).
Discrete Attribute
A discrete attribute has a finite or countably infinite set of values. Such attributes can be categorical,
such as Zip codes or ID numbers, or numeric.
These are often represented using integer values.
Binary attributes are a special case of discrete attributes and assume only two values. Example:
true/false, yes/no, or 0/1.
Nominal and ordinal attributes are binary or discrete
Continuous Attribute
Continuous Attribute has real numbers as attribute values.
Example: Temperature, height etc.
These values can only be measured and represented practically by using only a finite number of
digits. It means they are represented as floating-point variables.
Interval and ratio numbers are continuous.
Asymmetric Attribute
For Asymmetric Attributes, Only non-zero values are important.
For Instance, consider a data set where each object is a student and each attribute records whether or not
a student took a particular course at a university. For a specific student, an attribute has a value of 1 if the
student took the course associated with that attribute and a value of 0 otherwise. Because students take
only a small fraction of all available courses, most of the values in such a data set would be 0. Therefore, it
is more meaningful and more efficient to focus on non-zero values.
Binary attributes where only non-zero values are important are called asymmetric binary
attributes. This type of attribute is important for association analysis.
The patterns in the data depend on the level of resolution. If the resolution is too fine, a pattern
may not be visible or may be buried in noise; if the resolution is too coarse, the pattern may
disappear. For example, variations in atmospheric pressure on a scale of hours reflect the
movement of storms and other weather systems. On a scale of months, such phenomena are not
detectable.
Record data:
Majority of Data Mining work assumes that data is a collection of records (data objects), each of
which consists of a fixed set of data fields (attributes).
The most basic form of record data has no explicit relationship among records or data fields, and
every record (object) has the same set of attributes.
Record data is usually stored either in flat files or in relational databases.
There are a few variations of Record Data, which have some characteristic properties.
1) Transaction or Market Basket Data: It is a special type of record data, in which each record contains
a set of items. For example, shopping in a supermarket or a grocery store. For any particular
customer, a record will contain a set of items purchased by the customer in that respective visit to
the supermarket or the grocery store. This type of data is called Market Basket Data. Transaction
data is a collection of sets of items, but it can be viewed as a set of records whose fields are
asymmetric attributes. Most often, the attributes are binary, indicating whether or not an item was
purchased or not.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
11
2) The Data Matrix: If the data objects in a collection of data all have the same fixed set of numeric
attributes, then the data objects can be thought of as points (vectors) in a multidimensional space,
where each dimension represents a distinct attribute describing the object. A set of such data
objects can be interpreted as an m X n matrix, where there are n rows, one for each object, and n
columns, one for each attribute. This matrix is called a data matrix or a pattern matrix. Standard
matrix operation can be applied to transform and manipulate the data. Therefore, the data matrix is
the standard data format for most statistical data.
3) The Sparse Data Matrix: A sparse data matrix (sometimes also called document-data matrix) is a
special case of a data matrix in which the attributes are of the same type and are asymmetric; i.e.,
only non-zero values are important. In document data, if the order of the terms (words) in a
document is ignored, then a document can be represented as a term vector, where each term is a
component (attribute) of the vector and the value of each component is the number of times the
corresponding term occurs in the document. This representation of a collection of documents is
called a document-term matrix.
Graph-based Data:
This can be further divided into types:
1. Data with Relationships among Objects: The data objects are mapped to nodes of the graph, while
the relationships among objects are captured by the links between objects and link properties, such
as direction and weight. Consider Web pages on the World Wide Web, which contain both text and
links to other pages. In order to process search queries, Web search engines collect and process
Web pages to extract their contents.
2. Data with Objects That Are Graphs: If objects have structure, that is, the objects contain sub
objects that have relationships, then such objects are frequently represented as graphs. For
example, the structure of chemical compounds can be represented by a graph, where the nodes are
atoms and the links between nodes are chemical bonds.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
12
Ordered Data
For some types of data, the attributes have relationships that involve order in time or space. As you can see
in the picture below, it can be segregated into four types:
1. Sequential Data: Also referred to as temporal data, can be thought of as an extension of record
data, where each record has a time associated with it. Consider a retail transaction data set that
also stores the time at which the transaction took place
2. Sequence Data: Sequence data consists of a data set that is a sequence of individual entities, such
as a sequence of words or letters. It is quite similar to sequential data, except that there are no time
stamps; instead, there are positions in an ordered sequence. For example, the genetic information
of plants and animals can be represented in the form of sequences of nucleotides that are known as
genes.
3. Time Series Data: Time series data is a special type of sequential data in which each record is a time
series, i.e., a series of measurements taken over time. For example, a financial data set might
contain objects that are time series of the daily prices of various stocks.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
13
4. Spatial Data: Some objects have spatial attributes, such as positions or areas, as well as other types
of attributes. An example of spatial data is weather data (precipitation, temperature, pressure) that
is collected for a variety of geographical locations.
Data Quality
Many characteristics act as a deciding factor for data quality, such as incompleteness and incoherent
information, which are common properties of the big database in the real world. Factors used for data
quality assessment are:
Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e., Having incorrect values of
properties that could be human or computer errors.
Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer information
for sales & transaction data may not always be available.
Consistency:
Incorrect data can also result from inconsistencies in naming convention or data codes, or from
input field incoherent format. Duplicate tuples need cleaning of details, too.
Timeliness:
It also affects the quality of the data. At the end of the month, several sales representatives fail to
file their sales record on time. These are also several corrections & adjustments which flow into
after the end of the month. Data stored in the database are incomplete for a time after each
month.
Believability:
It is reflective of how much users trust the data.
Interpretability:
It is a reflection of how easy the users can understand the data.
Measurement Error
Measurement Error refers to any problem resulting from the measurement process. In other words, the
recorded data values differ from true values to some extent. For continuous attributes, the numerical
difference between measured and true value is called the error.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
14
There can be data sets in which some values are missing, sometimes even some data objects are not
present or there are redundant/duplicate data objects.
The following are the variety of problems that involve measurement error:
Noise
Artifacts
Bias
Precision
Accuracy
The following issues involve both measurement and data collection problems. They are
Outliers
Missing and inconsistent values, and
Duplicate data
Noise:
Noise is the random component of a measurement error. It involves either the distortion of a value
or addition of objects that are not required. The following figure shows a time series before and
after disruption by some random noise. If a bit more noise were added to the time series, its shape would
be lost.
The term noise is often connected with data that has a spatial (space related) or temporal (time related)
component. In these cases, techniques from signal and image processing are used in order to reduce noise.
But, the removal of noise is a difficult task; hence much of the data mining work involves use of Robust
Algorithms that can produce acceptable results even in the presence of noise.
Outliers:
Outliers are either
1. Data objects that, in some sense, have characteristics that are different from most of the other data
objects in the data set, or
2. Values of an attribute that are unusual with respect to the most common (typical) values for that
attribute.
Additionally, it is important to distinguish between noise and outliers. Outliers can be legitimate data
objects or values. Thus, unlike noise, outliers may sometimes be of interest. In fraud and network intrusion
detection, the goal is to find unusual objects or events from among a large number of normal ones.
Missing values:
If there are Missing values present in the data set.
It is not unusual to have data objects that have missing values for some of the attributes. The reasons can
be:
1. The information was not collected.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
15
Inconsistent Values:
Data can contain inconsistent values. Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city. It may be that the individual entering
this information transposed two digits, or perhaps a digit was misread when the information was
scanned from a handwritten form. Some types of inconsistencies are easy to detect. For instance, a
person's height should not be negative. The correction of an inconsistency requires additional or
redundant information.
Duplicate Data: A data set may include data objects that are duplicates, or almost duplicates, of one
another. To detect and eliminate such duplicates, two main issues must be addressed. First, if there are two
objects that actually represent a single object, then the values of corresponding attributes may differ, and
these inconsistent values must be resolved. Second, care needs to be taken to avoid accidentally combining
data objects that are similar, but not duplicates, such as two distinct people with identical names. The term
deduplication is used to refer to the process of dealing with these issues.
Accuracy:
The closeness of measurements to the true value of the quantity being measured.
Accuracy depends on precision and bias, but since it is a general concept, there is no specific formula for
accuracy in terms of these two quantities.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 1
1
UNIT – 2:
Data Preprocessing and Proximity measures:
Unit II: Data pre-processing, Measures of Similarity and Dissimilarity: Basics, similarity and dissimilarity
between simple attributes, dissimilarities between data objects, similarities between data objects,
examples of proximity measures: similarity measures for binary data, Jaccard coefficient, Cosine similarity,
Extended Jaccard coefficient, Correlation, Exploring Data: Data Set, Summary Statistics (Tan)
Data preprocessing:
Data Preprocessing refers to the steps applied to make data more suitable for data mining. The steps used
for Data Preprocessing usually fall into two categories:
1) Selecting data objects and attributes for the analysis.
2) Creating/changing the attributes.
The goal is to improve the data mining analysis with respect to time, cost and quality
Aggregation
Aggregation refers to combining two or more attributes (or objects) into single attribute (or object).
– For example, consider the below data set consisting of data objects recording the daily sales
of products in various store locations for different days over the course of a year.
– One way to aggregate the data objects is to replace all the data objects of a single store with
a single store wide data object.
– This reduces the hundreds or thousands of transaction that occur daily at a specific store to
a single daily transaction, and the number of data objects is reduced to the number of
stores.
There are several motivations for aggregation.
1) Data Reduction: Reduce the number of objects or attributes. This results into smaller data sets and
hence requires less memory and processing time, and hence, aggregation may permit the use of
more expensive data mining algorithms.
2) Change of Scale: Aggregation can act as a change of scope or scale by providing a high-level view of
the data instead of a low-level view. For example,
– Cities aggregated into regions, states, countries etc.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
2
– Days aggregated into weeks, months and years.
3) More “Stable” Data: Aggregated Data tends to have less variability.
Sampling
Sampling is the main technique employed for data reduction. Sampling is a commonly used
approach for selecting a subset of the data objects to be analysed.
It is often used for both the preliminary investigation of the data and the final data analysis.
Statisticians often sample because obtaining the entire set of data of interest is too expensive or
time consuming.
Sampling is typically used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
The key aspect of sampling is to use a sample that is representative. A sample is representative if it has
approximately the same property (of interest) as the original set of data. If the mean (average) of the data
objects is the property of interest, then a sample is representative if it has a mean that is close to that of
the original data.
Types of Sampling:
1) Simple Random Sampling:
There is an equal probability of selecting any particular item
→ Sampling without replacement: As each item is selected, it is removed from the
population.
→ Sampling with replacement: Objects are not removed from the population as they are
selected for the sample. In sampling with replacement, the same object can be picked up
more than once.
2) Stratified sampling: Split the data into several partitions, and then draw random samples from each
partition.
3) Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start with a small sample,
and then increase the sample size until a sample of sufficient size has been obtained. Here, no need
to determine the correct sample size initially.
Curse of dimensionality
Dimensionality: The dimensionality of a data set is the number of attributes that the objects in the data set
have.
In a particular data set if there are high number of attributes (also called high dimensionality), then it can
become difficult to analyse such a data set. When this problem is faced, it is referred to as Curse of
Dimensionality.
Dimensionality reduction
The term dimensionality reduction is often reserved for those techniques that reduce the dimensionality of
a data set by creating new attributes that are a combination of the old attributes.
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining algorithms.
Allow data to be more easily visualized.
May help to eliminate irrelevant features or reduce noise.
Techniques:
There are some linear algebra techniques for dimensionality reduction, particularly for continuous data, to
project the data from a high-dimensional space into a low dimensional space.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
3
Principal Components Analysis (PCA)
Singular Value Decomposition(SVD)
Feature Creation
It involves creation of new attributes that can capture the important information in a data set much more
efficiently than the original attributes. The number of new attributes can be smaller than the number of
original attributes. There are three methodologies for creating new attributes:
1) Feature extraction:
The creation of a new set of features from the original raw data is known as feature
extraction.
Consider a set of photographs, where each photograph is to be classified according to
whether or not it contains a human face.
The raw data is a set of pixels, and as such, is not suitable for many types of classification
algorithms.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
4
However, if the data is processed to provide higher- level features, such as the presence or
absence of certain types of edges and areas that are highly correlated with the presence of
human faces, then a much broader set of classification techniques can be applied to this
problem.
This method is highly domain specific.
2) Feature construction:
Sometimes the features in the original data sets have the necessary information, but it is not
in a form suitable for the data mining algorithm.
In this situation, one or more new features constructed out of the original features can be
more useful than the original features.
Example: dividing mass by volume to get density
3) Mapping the data to a new space
A totally different view of the data can reveal important and interesting features.
Consider, for example, time series data, which often contains periodic patterns. If there is
only a single periodic pattern and not much noise then the pattern is easily detected.
If, on the other hand, there are a number of periodic patterns and a significant amount of
noise is present, then these patterns are hard to detect.
Such patterns can, nonetheless, often be detected by applying a Fourier transform to the
time series in order to change to a representation in which frequency information is explicit.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
5
Variable Transformation
A variable transform is a function that maps the entire set of values of a given attribute to a new set of
replacement values such that each old value can be identified with one of the new values.
A simple mathematical function is applied to each value individually. If x is a variable, then examples of
such transformations include power(x, k), log(x), power(e, x), |x|
Normalization: It refers to various techniques to adjust to differences among attributes in terms of
frequency of occurrence, mean, variance, range
Standardization: In statistics, it refers to subtracting off the means and dividing by the standard
deviation.
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1] ; 0 - no similarity and 1 - complete similarity
Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Transformation:
– It is a function used to convert similarity to dissimilarity and vice versa, or to transform a
proximity measure to fall into a particular range. For instance:
s’ = (s-min(s)) / max(s)-min(s))
where,
s’ = new transformed proximity measure value,
s = current proximity measure value,
min(s) = minimum of proximity measure values,
max(s) = maximum of proximity measure values
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
6
Nominal Attribute:
– Consider objects described by one nominal attribute.
– How to compare similarity of two objects like this?
– Nominal attributes only tell us about the distinctness of objects.
– Hence, in this case similarity is defined as 1 if attribute values match, and 0 otherwise and
oppositely defined would be dissimilarity.
Ordinal Attribute:
– For objects with a single ordinal attribute, the situation is more complicated because information
about order needs to be taken into account.
– Consider an attribute that measures the quality of a product, on the scale {poor, fair, OK, good,
wonderful}.
– We have 3 products P1, P2, & P3 with quality as wonderful, good, & OK respectively. In order to
compare ordinal quantities, they are mapped to successive integers.
– In this case, if the scale is mapped to {0, 1, 2, 3, 4} respectively. Then, dissimilarity (P1, P2) = 4–3 = 1
or, if we want the dissimilarity to fall between 0 and 1, d(P1, P2)= (3-2)/4= 0.25. A similarity for
ordinal attributes can be defined as s=1-d.
Interval or Ratio attributes:
– For interval or ratio attributes, the natural measure of dissimilarity between two objects is the
absolute difference of their values.
– For example, we might compare our current weight and our weight a year ago by saying “I am ten
pounds heavier.” The dissimilarity range from 0 to ∞, rather than from 0 to 1.
– The similarity is expressed by transforming a similarity into a dissimilarity, by using the below table.
In this table, x and y are two objects that have one attribute of the indicated type.
– S(x,y) and d(x,y) are the similarity and dissimilarity between x and y objects.
where n is the number of dimensions, and xk and yk are respectively, the kth attributes (components)
of x and y.
We illustrate this formula with below figure, which shows a set of points, the x and y coordinates of these
points, and the distance matrix containing the pairwise distances of these points.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
7
3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0 X and y coordinates of four points
0 1 2 3 4 5 6
The Euclidean distance measure given in above equation is generalized by the Minkowski distance metric
shown in below equation.
Where r is a parameter, n is the number of dimensions (attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.
The following are the three most common examples of Minkowski distances.
r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example of this for binary
vectors is the Hamming distance, which is just the number of bits that are different between two
binary vectors
r = 2. Euclidean distance(L2 norm).
r . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any
component of the vectors
Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Similarly compute the distance matrices like L1 norm and L norm using the above data.
D(p1, p2) = |(0 – 2)| + |(2 – 0)| = 2+2 = 4
D(p1, p3) = |(0 – 3)| + |(2 – 1)| = 3+1 = 4
D(p1, p4) = |(0 – 5)| + |(2 – 1)| = 5+1 = 6
D(p2, p3) = |(2 – 3)| + |(0 – 1)| = 1+1 = 2
D(p2, p4) = |(2 – 5)| + |(0 – 1)| = 3+1 = 4
D(p3, p4) = |(3 – 5)| + |(1 – 1)| = 2+0 = 2
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
8
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L1 distance matrix
To illustrate the difference between these two similarity measures, we calculate SMC and Jaccard for the
following two binary vectors.
x= 1000000000
y= 0000001001
f01 = 2 (the number of attributes where x was 0 and y was 1)
f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)
(or)
where . indicates dot product and ||x|| defines the length of vector x.
The below example calculates the cosine similarity for the following two data objects, which might
represent document vectors.
x= 3205000200
y= 1000000102
<x, y> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| x || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
10
0.5 0.5
|| y || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) = (6) = 2.449
cos(x, y ) = 0.3150
As indicated by above figure, cosine similarity is a measure of the angle between x and y. Thus, if the cosine
similarity is 1, the angle between x and y is 0o, and x and y are the same except for magnitude (length). If
the cosine similarity is 0, then the angle between x and y is 90 o, and they do not share any terms (words).
Correlation
The correlation between two data objects that have binary or continuous attributes is a measure of the
linear relationship between the attributes of the objects (e.g., height, weight).
Pearson’s Correlation Coefficient between two objects, x and y, is defined by the following equation:
Positive correlation is a relationship between two variables in which both variables move in the
same direction. This is when one variable increases while the other increases and visa versa. For
example, positive correlation may be that the more you exercise, the more calories you will burn.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
11
Whilst negative correlation is a relationship where one variable increases as the other decreases,
and vice versa.
The correlation coefficient is a value that indicates the strength of the relationship between
variables. The coefficient can take any values from -1 to 1. The interpretations of the values are:
– -1: Perfect negative correlation. The variables tend to move in opposite directions (i.e., when
one variable increases, the other variable decreases).
– 0: No correlation. The variables do not have a relationship with each other.
– 1: Perfect positive correlation. The variables tend to move in the same direction (i.e., when
one variable increases, the other variable also increases).
To understand this, we will consider an example here..
Following data shows the number of customers with their corresponding temperature.
1) First find means of both the variables, subtract each of the item with its respective mean and
multiply it together as follows
ean of , x = (97+86+89+84+94+74)/6 = 524/6= 87.333
ean of Y, Ȳ = (14+11+9+9+15+7)/6 = 65/6= 10.833
2) Now, find the covariance
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
12
Sx = 331.28/5=66.25= 8.13
Sy = 48.78/5=9.75=3.1
4) Finally, the correlation = 22.46/(8.13x 3.1)= 22.46/25.20 =0.8
5) 0.8 shows that strength of the correlation between temperature and number of customers is very
strong.
Data Exploration
Data exploration can aid in selecting the appropriate preprocessing and data analysis techniques.
Summary statistics, such as the mean and standard deviation of a set of values, and visualization
techniques, such as histograms and scatter plots, are standard methods that are widely employed
for data exploration.
Many of the exploratory data techniques are illustrated with the Iris Plant data set, that can be
obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
it consists of information on 150 Iris flowers, 50 each from one of three Iris species: Setosa,
Versicolour, and Virginica.
Each flower is characterized by five attributes:
1) Sepal length in centimeters
2) Sepal width in centimeters
3) Petal length in centimeters
4) Petal width in centimeters
5) Class ( Setosa, Virginica, Versicolour)
The sepals of a flower are the outer structures that protect the more fragile parts of the flower, such as the
petals.
Summary Statistics
Summary statistics are quantities, such as mean and standard deviation that capture various
characteristics of a potentially large set of values with a single number or a small set of numbers.
That is, Summary statistics are numbers that summarize properties of the data
Summarized properties include
– Frequencies and Mode
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
13
– Percentiles
– Measures of Location: Mean and Median
– Measures of Spread: Range and Variance
– Other ways to summarize the data
Most summary statistics can be calculated in a single pass through the data
Percentile
For continuous data, the notion of a percentile is more useful.
Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile xp
is a value of x such that p% of the observed values of x are less than xp.
For instance, the 50th percentile is the value x50% such that 50% of all values of x are less than x50%.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
14
The median is the middle value if there are an odd number of values and average of the two middle values
if the number of values is even. Thus, for seven values, the median is x (4), while for ten values, the median is
(x(5) + x(6))/2.
Example:
Suppose you randomly selected 10 house prices in the South Lake Tahoe area. You are interested in the
typical house price. In $100,000 the prices were
2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8
The mean, we would say that the average house price is 744,000.
Mean = (2.7+2.9+3.1+3.4+3.7+4.1+4.3+4.7+4.7+40.8)/10 = 7.44
Since there is an even number of outcomes, we take the average of the middle two = (3.7 + 4.1)/2 = 3.9.
The median house price is $390,000.
Trimmed
A trimmed mean (sometimes called a truncated mean) is similar to a “regular” mean (average), but it trims
any outliers.
These means are expressed in percentages. The percentage tells you what percentage of data to
remove.
A percentage p between 0 and 100 is specified, the top and bottom (p/2)% of the data is thrown
out, and the mean is calculated in the normal way.
For example, with a 10% trimmed mean, the lowest 5% and highest 5% of the data are excluded.
The mean is calculated from the remaining 90% of data points.
Example: Find the trimmed 40% mean for the following test scores: 60, 81, 83, 91, 99.
Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three values:
60, 81, 83, 91, 99.
Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85.
The median is a trimmed mean with p = 100%, while the standard mean corresponds to p = 0%
However, this is also sensitive to outliers, so that other measures are often used.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
15
Where, Absolute Average Deviation (AAD), Median Absolute Deviation (MAD) and Interquartile Range
(IQR).
Example:
The owner of the Ches Tahoe restaurant is interested in how much people spend at the restaurant. He
examines 10 randomly selected receipts for parties of four and writes down the following data.
44, 50, 38, 96, 42, 47, 40, 39, 46, 50
Step1: Now, calculate the mean by adding and dividing by 10 to get
Mean(x) = (44+50+38+96+42+47+40+39+46+50)/2 = 49.2
Step2: Below is the table for getting the standard deviation:
x x - 49.2 (x - 49.2 )2
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44
96 46.8 2190.24
42 -7.2 51.84
47 -2.2 4.84
40 -9.2 84.64
39 -10.2 104.04
46 -3.2 10.24
50 0.8 0.64
Total 2600.4
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
16
Where xki and xkj are the values of the ith and jth attributes for the kth object.
Notice that covariance(xi, xi) = variance(xi). Thus, the covariance matrix has the variances of the
attributes along the diagonal.
The covariance of two variables is a measure of the degree to which two attributes vary together
and depends on the magnitudes of the variables.
A value near 0 indicates that two attributes do not have a relationship.
The ijth entry of the correlation matrix R, is the correlation between the ith and jth attributes of the
data.
If xi and xj are the ith and jth attributes, then
rij = correlation(xi, xj) =
where si and sj are the variances of xi and xj respectively.
The diagonal entries of R are correlation(xi, xi) = 1, while other entries are between -1 and 1.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 2
1
UNIT – 3:
Data Warehouse:
Unit III: Data Warehouse: basic concepts, Data Warehousing Modeling: Data Cube and OLAP, Data
Warehouse implementation: efficient data cube computation, partial materialization, indexing OLAP data,
efficient processing of OLAP queries. (H & C)
Data warehouse:
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.
Subject-oriented:
A data warehouse is organized around major subjects such as customer, supplier, product, and
sales.
Rather than concentrating on the day-to-day operations and transaction processing of an
organization, a data warehouse focuses on the modelling and analysis of data for decision
makers.
Data warehouses typically provide a simple and concise view of particular subject issues by
excluding data that are not useful in the decision support process.
Integrated:
A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as
relational databases, flat files, and online transaction records.
Data cleaning and data integration techniques are applied to ensure consistency in naming
conventions, encoding structures, attribute measures, and so on.
Time-variant:
Data are stored to provide information from an historic perspective (e.g., the past 5–10 years).
Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.
Non-volatile:
A data warehouse is always a physically separate store of data transformed from the application
data found in the operational environment.
Due to this separation, a data warehouse does not require transaction processing, recovery, and
concurrency control mechanisms.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
2
It usually requires only two operations in data accessing: initial loading of data and access of
data.
Differentiate the data warehousing and traditional database approach for heterogeneous database
integration?
Traditional heterogeneous DB integration: Query driven approach
The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases.
When a query is posed to a client site, a metadata dictionary is used to translate the query into
queries appropriate for the individual heterogeneous sites involved. These queries are then mapped
and sent to local query processors.
The results returned from the different sites are integrated into a global answer set.
This query-driven approach requires complex information filtering and integration processes, and
competes with local sites for processing resources.
It is inefficient and potentially expensive for frequent queries, especially queries requiring
aggregations.
Data warehouse: update-driven, high performance
Data warehousing employs an update-driven approach in which information from multiple,
heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and
analysis.
A data warehouse brings high performance to the integrated heterogeneous database system
because data are copied, pre-processed, integrated, annotated, summarized, and restructured into
one semantic data store.
The major task of on-line operational database systems is to perform online transaction and query
processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of
the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking,
payroll, registration, and accounting.
The major distinguishing features between OLTP and OLAP are summarized as follows:
Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query
processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented
and is used for data analysis by knowledge workers, including managers, executives, and analysts.
Data contents: An OLTP system manages current data that are too detailed to be easily used for decision
making. An OLAP system manages large amounts of historical data, provides facilities for summarization
and aggregation, and stores and manages information at different levels of granularity. These features
make the data easier to use in informed decision making.
Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application-
oriented database design. An OLAP system typically adopts either a star or snowflake model and a subject
oriented database design.
View: An OLTP system focuses mainly on the current data within an enterprise or Department, without
referring to historic data or data in different organizations. OLAP systems deal with information that
originates from different organizations, integrating information from many data stores. Because of their
huge volume, OLAP data are stored on multiple storage media.
Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a
system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are
mostly read-only operations (because most data warehouses store historical rather than up-to-date
information).
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
4
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
5
However, many vendors of operational relational database management systems are beginning to
optimize such systems to support OLAP queries. As this trend continues, the separation between
OLTP and OLAP systems is expected to decrease.
1) Bottom Tier:
The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases
or other external sources (e.g., customer profile information provided by external consultants).
These tools and utilities perform data extraction, cleaning, and transformation (e.g., to
merge similar data from different sources into a unified format), as well as load and refresh functions to
update the data warehouse.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
6
The data are extracted using application program interfaces known as gateways. A gateway
is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a
server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Object Linking and
Embedding Database) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.
2) Middle Tier:
The middle tier is an OLAP server.
An OLAP server is a set of specifications which acts as a gateway between user and +Data warehouse
(Database).
OLAP Server is typically implemented using either
1) a relational OLAP(ROLAP) model (i.e., an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or
2) a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly implements
multidimensional data and operations).
3) Middle Tier:
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
7
2) Dependent data marts: Dependent data marts are sourced directly from enterprise data
warehouses.
3) Virtual warehouse:
A virtual warehouse is a set of views over operational databases.
For efficient query processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database servers.
What are the pros and cons of the top-down and bottom-up approaches to data warehouse
development?”
The top-down development of an enterprise warehouse serves as a systematic solution and
minimizes integration problems. However, it is expensive, takes a long time to develop, and lacks flexibility
due to the difficulty in achieving consistency
The bottom up approach to the design, development, and deployment of independent data
marts provides flexibility, low cost, and rapid return of investment. It can lead to problems when
integrating various disparate data marts into a consistent enterprise data warehouse.
Finally, a multitier data warehouse is constructed where the enterprise warehouse is the
sole custodian of all warehouse data, which is then distributed to the various dependent data marts.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
9
Operational metadata, which include data lineage (history of migrated data and the sequence of
transformations applied to it), currency of data (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).
The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
Mapping from the operational environment to the data warehouse, which includes source
databases and their contents, gateway descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules, and security (user authorization
and access control).
Data related to system performance, which include indices and profiles that improve data access
and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and
replication cycles.
Business metadata, which include business terms and definitions, data ownership information, and
charging policies.
Example:
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
10
Consider AllElectronics store. In below 2-D representation, the sales for Vancouver are
shown with respect to the time dimension (organized in quarters) and the item dimension (organized
according to the types of items sold). The fact or measure displayed is dollars sold (in thousands).
Suppose we would like to view the data according to time and item, as well as location, for
the cities Chicago, New York, Toronto, and Vancouver. These 3-D data are shown in Table 4.3. The 3-D data
in the table are represented as a series of 2-D tables. Conceptually, we may also represent the same data in
the form of a 3-D data cube, as in Figure 4.3.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
11
Suppose that we would now like to view our sales data with an additional fourth dimension such as
supplier. Viewing things in 4-D becomes tricky. However, we can think of a 4-D cube as being a series of 3-D
cubes, as shown in Figure 4.4.
In general, it is possible to display n-d data cubes as a series of (n-1) D cubes. A data cube
such as each of the above is often referred to as a cuboid.
Given a set of dimensions, we can generate a cuboid for each of the possible subset of the
given dimensions. The result would form a lattice of cuboids, each showing the data at different level of
summarization, or group-by.
The below Figure shows a lattice of cuboids forming a data cube for the dimensions time, item, location,
and supplier.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
12
Apex cuboid:
The 0-D cuboid, which holds the highest level of summarization, is called as the apex cuboid
and it is denoted by all.
Base cuboid:
The cuboid that holds the lowest level of summarization is called the base cuboid.
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models
A data warehouse requires a concise, subject-oriented schema that facilitates online data analysis.
The most popular data model for a data warehouse is a multidimensional model, which can exist in the
form of
1) a star schema,
2) a snowflake schema,
3) a fact constellation schema
Star schema:
The most common modeling paradigm is the star schema, in which the data warehouse
contains
(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
13
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the
central fact table.
Example:
A star schema for AllElectronics sales is shown in below Figure. Sales are considered along
four dimensions: time, item, branch, and location. The schema contains a central fact table for sales that
contains keys to each of the four dimensions, along with two measures: dollars sold and units sold.
In the star schema, each dimension is represented by only one table, and each table contains
a set of attributes.
For example, the location dimension table contains the attribute set {location key, street,
city, province or state, country}. This constraint may introduce some redundancy.
For example, “Kakinada” and “Visakhapatnam” are both cities in the province of Andhra
Pradesh. Entries for such cities in the location dimension table will create redundancy among the attributes
province or state and country, that is, (..., Visakhapatnam, Andhra Pradesh, India) and (..., Kakinada, Andhra
Pradesh, India).
The attributes within a dimension table may form either a hierarchy (total order) or a lattice
(partial order).
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
14
Snowflake schema:
The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a
shape similar to a snowflake.
Example:
A snowflake schema for AllElectronics sales is given in below Figure. The sales table
definition is identical to that of the star schema. The single dimension table for item is normalized, resulting
in new item and supplier tables.
For example, the item dimension table now contains the attributes item key, item name,
brand, type, and supplier key, where supplier key is linked to the supplier dimension table, containing
supplier key and supplier type information.
Similarly, the single dimension table for location can be normalized into two new tables:
location and city. The city key in the new location table links to the city dimension.
Fact constellation:
Sophisticated applications may require multiple fact tables to share dimension tables. This
kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.
Example:
A fact constellation schema is shown in below Figure. This schema specifies two fact tables,
sales and shipping. The sales table definition is identical to that of the star schema. The shipping table has
five dimensions, or keys—item key, time key, shipper key, from location, and to location—and two
measures—dollars cost and units shipped. A fact constellation schema allows dimension tables to be shared
between fact tables.
For example, the dimensions tables for time, item, and location are shared between the sales
and shipping fact tables.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
15
Defining data warehouses and data marts in SQL-based data mining query language (DMQL):
Data warehouses and data marts can be defined using two language primitives, one for cube definition and
one for dimension definition.
The cube definition syntax is:
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
16
Snowflake Schema: The Snowflake Schema of above example is defined in DMQL as follows:
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))
Fact constellation schema: The Fact constellation Schema of above example is defined in DMQL as follows:
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
III. Combining the outcomes of each subset so as to get a value which is equal to the measure value
computed on entire data set after applying the same aggregate functions. Some of the distributive
measures are sum(),count(),max(),min().
Example: sum() can be computed for a data cube by first partitioning the cube into a set of subcubes,
computing sum() for each subcube, and then summing up the counts obtained for each subcube. Hence,
sum() is a distributive aggregate function.
Algebraic:
An aggregate function is algebraic if it can be computed by an algebraic function with M
arguments (where M is a bounded positive integer), each of which is obtained by applying a distributive
aggregate function.
For example, avg() (average) can be computed by sum()/count(), where both sum() and
count() are distributive aggregate functions.
A measure is algebraic if it is obtained by applying an algebraic aggregate function.
Holistic:
An aggregate function is holistic if there does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the computation. Common examples of holistic
functions include median(), mode(), and rank().
A measure is said to be holistic, if it is computed on entire data set but not on the partitioning data subset
by applying holistic aggregate functions In short, a holistic measure is not computed in distributive manner.
Concept hierarchies:
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts.
Consider a concept hierarchy for the dimension location. City values for location include Vancouver,
Toronto, NewYork, and Chicago.
Each city, however, can be mapped to the province or state to which it belongs. For example,
Vancouver can be mapped to British Columbia, and Chicago to Illinois.
The provinces and states can in turn be mapped to the country to which they belong, such as
Canada or the USA. These mappings forma concept hierarchy for the dimension location, mapping a
set of low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
For example, suppose that the dimension location is described by the attributes number, street,
city, province or state, zipcode, and country. These attributes are related by a total order, forming a
concept hierarchy such as “street < city < province or state < country”.
Alternatively, the attributes of a dimension may be organized in a partial order, forming a lattice.
An example of a partial order for the time dimension based on the attributes day, week, month,
quarter, and year is “day < ,month <quarter;week- < year”
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
18
Schema hierarchies:
A concept hierarchy that is a total or partial order among attributes in a database schema is
called a schema hierarchy.
Set-grouping hierarchies:
Concept hierarchy may also be defined by grouping values for a given dimension or
attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined among group of
values.
Roll-up:
The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension
reduction.
The below Figure shows the result of a roll-up operation performed on the central cube by
climbing up the concept hierarchy for location given in Figure. This hierarchy was defined as the total order
“street < city < province or state < country.” The roll-up operation shown aggregates the data by ascending
the location hierarchy from the level of city to the level of country. In other words, rather than grouping the
data by city, the resulting cube groups the data by country.
When roll-up is performed by dimension reduction, one or more dimensions are removed
from the given cube. For example, consider a sales data cube containing only the location and time
dimensions. Roll-up may be performed by removing, say, the time dimension, resulting in an aggregation of
the total sales by location, rather than by location and by time.
Drill-down:
Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing
additional dimensions.
The below Figure shows the result of a drill-down operation performed on the central cube
by stepping down a concept hierarchy for time defined as “day < month < quarter < year.” Drill-down
occurs by descending the time hierarchy from the level of quarter to the more detailed level of month. The
resulting data cube details the total sales per month rather than summarizing them by quarter.
Because a drill-down adds more detail to the given data, it can also be performed by adding
new dimensions to a cube. For example, a drill-down on the central cube of Figure 4.12 can occur by
introducing an additional dimension, such as customer group.
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view to
provide an alternative data presentation. The below Figure shows a pivot operation, where the item and
location axes in a 2-D slice are rotated.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
20
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
22
The base cuboid contains all three dimensions, city, item, and year. It can return the total
sales for any combination of the three dimensions.
The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. It contains
the total sum of all sales.
The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is
the most generalized (least specific) of the cuboids, and is often denoted as all.
If we start at the apex cuboid and explore downward in the lattice, this is equivalent to
drilling down within the data cube. If we start at the base cuboid and explore upward, this is akin to
rolling up.
An SQL query containing no group-by (e.g., “compute the sum of total sales”) is a zero
dimensional operation.
An SQL query containing one group-by (e.g., “compute the sum of sales, group-by city”) is a
one-dimensional operation.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
23
Online analytical processing may need to access different cuboids for different queries.
Therefore, it may seem like a good idea to compute in advance all or at least some of the cuboids in a data
cube. Precomputation, leads to fast response time and avoids some redundant computation.
A major challenge related to this precomputation requires huge storage space if all the
cuboids in a data cube are precomputed, especially when the cube has many dimensions. The storage
requirements are even more excessive when many of the dimensions have associated concept hierarchies,
each with multiple levels. This problem is referred to as the curse of dimensionality.
If there were no hierarchies associated with each dimension, then the total number of
cuboids for an n-dimensional data cube, is 2n.
But many dimensions have hierarchies. If a dimension consists of concept hierarchy at
multiple levels, then the total number of cuboid for n-dimension is computed as
where Li is the number of levels associated with dimension i. One is added to Li in above Eq.
to include the virtual top level, all.
For example the dimension contains multiple conceptual levels, such as in the hierarchy
“day<month<quarter<year”.
For the time dimension, there are 4 conceptual level so as the total number of cuboids for this dimension
are 5.
Because of the dimensionality curse, it is not beneficial for precomputing and materializing all the cuboids
produced for a single data cube. In order to avoid this problem, a different method called “partial
materialization” is used, that is, to materialize only some of the possible cuboids that can be generated.
The selection of the subset of cuboids or subcubes to materialize should take into account
the queries in the workload, their frequencies, and their accessing costs. In addition, it should
consider workload characteristics, the cost for incremental updates, and the total storage
requirements.
Several OLAP products have adopted heuristic approaches for cuboid and subcube selection.
They are
1) Materialize the cuboids set on which other frequently referenced cuboids are based.
Alternatively, we can compute an iceberg cube, which is a data cube that stores only those cube
cells with an aggregate value (e.g., count) that is above some minimum support threshold.
2) Materialize a shell cube which involves precomputing the cuboids for only a small number of
dimensions (e.g., three to five) of a data cube.
3) Finally, during load and refresh, the materialized cuboids should be updated efficiently.
Example:
We defined a star schema for AllElectronics of the form “sales star *time, item, branch, location+:
dollars sold = sum (sales in dollars)”.
An example of a join index relationship between the sales fact table and the dimension tables for
location and item is shown in below Figure. For example, the “Main Street” value in the location
dimension table joins with tuples T57, T238, and T884 of the sales fact table.
Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and T459 of the
sales fact table.
The corresponding join index tables are shown in below figure.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
26
1) Determine which operations should be performed on the available cuboids: This involves
transforming any selection, projection, roll-up (group-by), and drill-down operations specified in the
query into corresponding SQL and/or OLAP operations. For example, slicing and dicing a data cube
may correspond to selection and/or projection operations on a materialized cuboid.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
27
2) Determine to which materialized cuboid(s) the relevant operations should be applied: This
involves identifying all of the materialized cuboids that may potentially be used to answer the
query, pruning the above set using knowledge of “dominance” relationships among the cuboids,
estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with the
least cost.
Example:
Suppose that we define a data cube for All Electronics of the form “sales cube [time, item, location]:
sum(sales in dollars)”. The dimension hierarchies used are “day < month < quarter < year” for time, “item
name < brand < type” for item, and “street < city < province or state < country” for location.
Suppose that the query to be processed is on ,brand, province or state-, with the selection constant “year =
2004”. Also, suppose that there are four materialized cuboids available, as follows:
cuboid 1: {year, item name, city}
cuboid 2: {year, brand, country}
cuboid 3: {year, brand, province or state}
cuboid 4: {item name, province or state} where year = 2004
“Which of the above four cuboids should be selected to process the query?”
Finer granularity data cannot be generated from coarser-granularity data.
Therefore, cuboid 2 cannot be used because country is a more general concept than province or state.
Cuboids 1, 3, and 4 can be used to process the query because
1) They have the same set or a superset of the dimensions in the query.
2) The selection clause in the query can imply the selection in the cuboid.
3) The abstraction levels for the item and location dimensions in these cuboids are at a finer level than
brand and province or state, respectively.
“How would the costs of each cuboid compare if used to process the query?”
It is likely that using cuboid 1 would cost the most because both item name and city are at a lower level
than the brand and province or state concepts specified in the query.
If there are not many year values associated with items in the cube, but there are several
item names for each brand, then cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be
chosen to process the query. However, if efficient indices are available for cuboid 4, then cuboid 4 may be a
better choice.
Therefore, some cost-based estimation is required in order to decide which set of cuboids
should be selected for query processing.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 3
1
UNIT – 4:
Classification:
Unit IV: Classification: Basic Concepts, General approach to solving a classification problem, Decision Tree
induction: working of decision tree, building a decision tree, methods for expressing attribute test
conditions, measures for selecting the best split, Algorithm for decision tree induction.
Model over fitting: Due to presence of noise, due to lack of representation samples, evaluating the
performance of classifier: holdout method, random sub sampling, cross-validation, bootstrap. (Tan)
Classification:
Classification, which is the task of assigning objects to one of several predefined categories.
Examples:
Detecting spam email messages based upon the message header and content
Categorizing cells as malignant or benign based upon the results of MRI scans
Classifying galaxies based upon their shapes
Classification is the task of learning a target function f that maps each attribute set to one of the
predefined class labels y.
The target function is also known informally as a classification model.
For example, it would be useful for both biologists and others to, have a descriptive model that summarizes
the data and explains what features define a vertebrate as a mammal, reptile, bird, fish or amphibian.
Predictive modelling:
A classification model can also be used to predict the class label of unknown records.
A classification model can be treated as a black box that automatically assigns a class label when presented
with the attribute set of an unknown record.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
2
Suppose we are given the following characteristics of a creature known as a gila monster:
We can use a classification model built from the data set shown in above Table 4.1 to determine the class
to which the creature belongs.
Classification techniques are most suited for predicting or describing data sets with binary or nominal
categories. They are less effective for ordinal categories (e.g.,to classify a person as a member of high-,
medium-, or low- income group) because they do not consider the implicit order among the categories.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
3
First, a training set consisting of records whose class labels are known must be provided.
The training set is used to build a classification model, which is subsequently applied to the test set,
which consists of records with unknown class labels.
Classification is the task of assigning labels to unlabeled data instances and a classifier is used to perform
such a task.
A classifier is typically described in terms of a model. The model is created using a given a set of instances,
known as the training set, which contains attribute values as well as class labels for each instance. The
systematic approach for learning a classification model given a training set is known as a learning
algorithm. The process of using a learning algorithm to build a classification model from the training data is
known as induction. This process is also often described as “learning a model” or “building a model.”
This process of applying a classification model on unseen test instances to predict their class labels is known
as deduction.
Thus, the process of classification involves two steps: applying a learning algorithm to training data to learn
a model, and then applying the model to assign labels to unlabeled instances.
Confusion Matrix:
Traditional heterogeneous DB integration: Query driven approach
The performance of a classification model (classifier) can be evaluated by comparing the predicted
labels against the true labels of instances.
That is based on the counts of test records correctly and incorrectly predicted by the model.
This information can be summarized in a table called a confusion matrix.
The above table depicts the confusion matrix for a binary classification problem.
Each entry fij denotes the number of instances from class i predicted to be of class j.
For example, f01 is the number of instances from class 0 incorrectly predicted as class 1.
The number of correct predictions made by the model is (f11 + f00) and the number of incorrect
predictions is (f10 + f01).
A confusion matrix provides the information needed to determine how well a classification model
performs, summarizing this information with a single number would make it more convenient to
compare the performance of different models.
This can be done using an evaluation metric such as accuracy, which is computed in the following
way:
Error rate is another related metric, which is defined as follows for binary classification problems:
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
4
The learning algorithms of most classification techniques are designed to learn models that attain
the highest accuracy, or equivalently, the lowest error rate when applied to the test set.
In a decision tree, each leaf node is assigned a class label. The nonterminal nodes, which include the root
and other internal nodes, contain attribute test conditions to separate records that have different
characteristics.
Classifying a test record is straightforward once a decision tree has been constructed.
Starting from the root node, we apply the test condition to the record and follow the appropriate
branch based on the outcome of the test.
This will lead us either to another internal node, for which a new test condition is applied, or to a
leaf node. The class label associated with the leaf node is then assigned to the record.
Classifying an unlabeled vertebrate. The dashed lines represent the outcomes of applying various
attribute test conditions on the unlabeled vertebrate. The vertebrate is eventually assigned to the
Non-mammal class.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
5
Hunt’s Algorithm:
In hunt’s algorithm, a decision tree is grown in recursive fashion by partitioning the training records into
successively purer subsets. Let Dt be the set of training records that are associated with node t and y={ y1,
y2, ….., yc} be the class labels. The following is a recursive definition of Hunt’s algorithm.
Step1:
If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt..
Step2:
If Dt contains records that belong to more than one class, an attribute test condition is selected to partition
the records into smaller subsets.
A child node is created for each outcome of the test condition and the records in D t are distributed
to the children based on the outcomes.
The algorithm is then recursively applied to each node.
Consider the following training set for predicting borrowers who will default on loan payments.
A training set for this problem can be constructed by examining the records of previous borrowers.
In the below example Figure, each record contains the personal information of a borrower along
with a class label indicating whether the borrower has defaulted on loan payments.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
6
From the table predicting outcome data whether a loan applicant will repay the loan or not. Using the
training set table, the problem can be constructed by examining the records of previous borrowers.
a) The initial tree for the classification problem contains a single node with class label “defaulted =
No”. It means that most of the borrowers successfully repaid their loans.
Defaulted = No
Fig (a)
b) The records are subsequently divided into smaller
subsets based on the outcomes of the “Home Owner”
test condition. In this if Home Owner = Yes then that has
all records in same class.
d) The right child of the root node is continued by applying the recursive step of Hunt’s algorithm
until all the records belongs to the same class. The tree, resulting from the recursive steps as
shown in fig (d).
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
7
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
8
For example, if marital status has three distinct values such single, married and
divorced. Its test condition will produce a three-way split.
The test condition can be split into two-ways i.e. binary attribute. some decision tree
algorithms, such as CART, produce only binary splits by considering all 2 k−1 − 1 ways of
creating a binary partition of k attribute values. This is shown in fig.
3. Ordinal Attribute (Group): Ordinal attributes can also produce binary (or) multi-way splits. Ordinal
attribute values can be grouped as long as the grouping does not violate the order property of the
attribute values. The figure shows two-way split.
4. Continuous Attributes: For continuous attributes, the test condition can be expressed as a
comparison test with binary outcomes (yes or no) (A < v) or (A ≥ v)
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
9
The measures developed for selecting the best split are often based on the degree of impurity
(i.e., the impurity of a node measures how dissimilar the class labels are for the data instances belonging to
a common node) of the child nodes. The smaller the degree of impurity, the more skewed the class
distribution.
For example, a node with class distribution (0, 1) has zero impurity, whereas a node with
uniform class distribution (0.5, 0.5) has the highest impurity.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
10
N1 N2
C0 4 2
C1 3 3
Before Split:
1 - (6/12)2 – (6/12)2
1 – 0.25 – 0.25
0.5
After Split:
N1 1 - (4/7)2 – (3/7)2 N2 1 - (2/5)2 – (3/5)2
N1 1 - (0.571429)2 – (0.428571)2 N2 1 - (0.4)2 – (0.6)2
N1 1 - (0.3265) – (0.1837) N2 1 - (0.16) – (3/5)
N1 0.489 N2 0.480
The Gini Index of entire tree 7/12 * 0.489 + 5/12 * 0.480 0.486
The attribute chosen as the test condition may vary depending on the choice of impurity measure.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
11
How to determine the performance of test condition: To determine how well a test condition performs,
we need to compare the degree of impurity of the parent node (before splitting) with the degree of
impurity of the child nodes (after splitting). The larger their difference, the better the test condition. The
gain, Δ, is a criterion that can be used to determine the goodness of a split:
Suppose there are two ways to split the data into smaller subsets (i.e., if the attribute is having only
two categorical values, then that attribute will split into two subsets).
Before splitting, the Gini index is 0.5 since there are an equal number of records from both classes.
If attribute A is chosen to split the data, the Gini index for node N1 is 0.4898, and for node N2, it is
0.480.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
12
The weighted average of the Gini index for the descendent nodes is (7/12) × 0.4898 + (5/12) ×
0.480 = 0.486.
Similarly, we can show that the weighted average of the Gini index for attribute B is 0.375. Since the
subsets for attribute B have a smaller Gini index, it is preferred over attribute A.
This can be split into binary grouping of car type attribute with 3 categories such as (sports, luxury, and
family).
For the first binary grouping of the Car Type attribute,
o The Gini index of {Sports,Luxury} is 0.4922 and
o The Gini index of {Family} is 0.3750.
o The weighted average Gini index for the grouping is equal to
16/20 × 0.4922 + 4/20 × 0.3750 = 0.468.
Similarly, for the second binary grouping of {Sports} and {Family, Luxury}, the weighted average Gini
index is 0.167.
The second grouping has a lower Gini index because its corresponding subsets are much purer.
For the multiway split, the Gini index is computed for every attribute value.
o Since Gini({Family}) = 0.375, Gini({Sports}) = 0, and Gini({Luxury}) = 0.219,
o The overall Gini index for the multiway split is equal to 4/20 × 0.375 + 8/20 × 0 + 8/20 ×
0.219 = 0.163.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
13
- We then compute the Gini index for each candidate and choose the one that gives the lowest value.
This approach is expensive because it requires O(N) operations to compute the Gini index at each
candidate split position. Since there are N candidates, the overall complexity of this task is O(N2).
To reduce the complexity, follow the below procedure for splitting the continuous attribute:
- The training records are sorted based on their annual income(attribute)
- Candidate split positions are identified by taking the midpoints between two adjacent sorted values
- For each candidate split position, compute the Gini index values
- Now, select the best split position corresponds to the one that produces the smallest Gini index
- It can be further optimized by considering only candidate split positions located between two
adjacent records with different class labels.
- In the below example figure, this approach allows us to reduce the number of candidate split
positions from 11 to 2.
Gain Ratio
Impurity measures such as entropy and Gini index tend to favor attributes that have a large number
of distinct values.
That is, a test condition that results in a large number of outcomes may not be desirable because
the number of records associated with each partition is too small to enable us to make any reliable
predictions.
There are two strategies for overcoming this problem.
The first strategy is to restrict the test conditions to binary splits only. This strategy is employed by
decision tree algorithms such as CART.
Another strategy is to modify the splitting criterion to take into account the number of outcomes
produced by the attribute test condition. For example, in the C4.5 decision tree algorithm, a
splitting criterion known as gain ratio is used to determine the goodness of a split.
This criterion is defined as follows:
Where, ∑ and
Parent Node, is split into partitions (children)
is number of records in child node
This suggests that if an attribute produces a large number of splits, its split information will also be
large, which in turn reduces its gain ratio.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
14
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
15
Classification Errors:
The errors committed by a classification model are generally divided into two types:
training errors and generalization errors.
Training error, also known as resubstitution error or apparent error, is the number of
misclassification errors committed on training records, that is, we get the by calculating the
classification error of a model on the same data the model was trained on
Generalization error is the expected error of the model on previously unseen records.
Model underfitting:
The training and test error rates of the model are large when the size of the tree is very small. This
situation is known as model underfitting.
Underfitting occurs because the model has yet to learn the true structure of the data. As a result, it
performs poorly on both the training and the test sets.
As the number of nodes in the decision tree increases, the tree will have fewer training and test
errors.
The training error of a model can be reduced by increasing the model complexity.
For example, the leaf nodes of the tree can be expanded until it perfectly fits the training data. The
training error for such a complex tree is zero, the test error can be large because the tree may contain
nodes that accidently fit some of the noise points in the training data. Such nodes can degrade the
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
16
performance of the tree because they do not generalize well to the test examples. The below figure shows
the structure of two decision trees with different number of nodes.
The tree that contains the smaller number of nodes has a higher training error rate, but a lower test error
rate compared to the more complex tree.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
17
Two of the training records are mislabelled: Bats and whales are classified as non-mammals instead
of mammals.
A decision tree that perfectly fits the training data is shown in figure(a). Although the training error
for the tree is zero, its error rate on the test set is 30%.
Both humans and dolphins were misclassified as non-mammals because their attribute values for
body temperature, gives-birth and four legged are identical to the mislabelled records in the
training set.
Errors due to exceptional cases are often unavoidable and establish the minimum error rate
achievable by any classifier.
The decision tree M2 has a lower test error rate (10%) even though the training error rate is
somewhat higher (20%).
The four legged attribute test condition in model M1 is spurious because it fits the mislabelled
training records, which leads to the classification of records in the test set.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
18
All of these training records are labeled correctly and the corresponding decision tree is shown in
figure.
Although its training error is zero, its error rate on the test set is 30%.
Humans, elephants and dolphins are misclassified because the decision tree classifies all warm-
blooded vertebrates that do not hibernate as non-mammals.
The tree arrives at this classification decision because there is only one training record, which is an
eagle, with such characteristics.
This example clearly shows that the danger of making wrong predictions, when there are not
enough representative examples at the leaf nodes of a decision tree.
Holdout method:
Split the learning sample into a training set and a test data set.
– A model is induced on the training data set
– Performance is evaluated on the test data set
Limitations:
– Too few data for learning: The more data used for testing, the more reliable the performance
estimation but more data is missing (less data available) for learning.
– Interdependence of training and test data set: If a class is underrepresented in the training
data set, it will be overrepresented in the test data set and vice versa.
Random Subsampling
The holdout method can be repeated several times to improve the estimation of a classifier’s
performance. If the estimation is performed k times then, the overall performance can be the
average of each estimate.
This method also encounters some of the problems associated with the holdout method because it
does not utilize as much data as possible for training.
It also has no control over the number of times each record is used for training and testing.
Cross-validation
In this approach each record is used the same number of times for training and exactly once for
testing.
To illustrate this method, suppose we partition the data into two equal-sized subsets.
– First we choose one of the subsets for training and the other for testing.
– We than swap the roles of the subsets so that the previous training set becomes the test set
and vice versa. This approach is called a two-fold cross validation. The total error is obtained
by summing up the errors for both runs.
– In this example, each record is used exactly once for training and once for testing.
Core idea:
– use each record k times for training and once for testing
– aggregate the performance values over all k tests
k-fold cross validation
– split the dataset into k equi-sized subsets
– for i = 1, …., k use the k-1 folds for training and kth fold for testing
– aggregate the performance values over all k tests
Leave-one-out cross validation
– In k-fold cross validation, if k = N where N is the number of records in the learning dataset
– Each test set will contain only one record
– Computationally expensive
Bootstrap
The methods presented so far assume that the training records are sampled without replacement.
It means that there are no duplicate records in the training and test set.
In the bootstrap approach, the training records are sampled with replacement.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
20
It means that a record already chosen for training is put back into the original pool of records so
that it is equally likely to be redrawn.
There are several bootstrap methods. A commonly used one is the .632 bootstrap.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 4
1
UNIT – 5:
Association Analysis:
Association Analysis: Problem Definition, Frequent Item-set generation- The Apriori principle, Frequent
Item set generation in the Apriori algorithm, candidate generation and pruning, support counting (eluding
support counting using a Hash tree), Rule generation, compact representation of frequent item sets, FP-
Growth Algorithms. (Tan)
Association analysis:
Association analysis is useful for discovering interesting relationships hidden in large data sets.
The uncovered relationships can be represented in the form of association rules or sets of frequent
items.
Association analysis is applicable to application domains such as market basket data, bioinformatics,
medical diagnosis, Web mining, and scientific data analysis. In
Problem definition:
The following are the basic terminology used in association analysis. Consider the following example of
market basket transactions.
Binary Representation
Market basket data can be represented in a binary format as shown in Table, where each row
corresponds to a transaction and each column corresponds to an item.
An item can be treated as a binary variable whose value is one if the item is present in a transaction
and zero otherwise.
Because the presence of an item in a transaction is often considered more important than its
absence, an item is an asymmetric binary variable.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
2
Itemset:
Let I = {i1, i2, . . . ,id} be the set of all items in a market basket data and T = {t1, t2, . . . , tN} be the set of
all transactions.
Each transaction ti contains a subset of items chosen from I.
In association analysis, a collection of zero or more items is termed an itemset.
If an itemset contains k items, it is called a k-itemset.
– For instance, {Beer, Diapers, Milk} is an example of a 3-itemset.
The null (or empty) set is an itemset that does not contain any items.
The transaction width is defined as the number of items present in a transaction.
A transaction tj is said to contain an itemset X if X is a subset of t j . For example, the second
transaction shown in Table 6.2 contains the itemset {Bread, Diapers} but not {Bread, Milk}.
Support count:
Support count refers to the number of transactions that contain a particular itemset. Mathematically, the
support count, σ(X), for an itemset X can be stated as follows:
σ(X) = |{ti|X ⊆ ti, ti ∈ T}|, where the symbol | ・ | denote the number of elements in a set.
In the data set shown in Table 6.2, the support count for {Beer, Diapers, Milk} is equal to two because there
are only two transactions that contain all three items.
Association Rule:
An association rule is an implication expression of the form X → Y , where X and Y are disjoint itemsets, i.e.,
X ∩ Y = ∅.
The strength of an association rule can be measured in terms of its support and confidence.
Support determines how often a rule is applicable to a given data set, while confidence determines how
frequently items in Y appear in transactions that contain X. The formal definitions of these metrics are
In other words, Support is the ratio (or fraction) of the number of transactions that contain both X and Y
Confidence is the probability that itemset B will exist given itemset A exists in the transaction.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
3
A low support rule is also likely to be uninteresting from a business perspective because it may not
be profitable to promote items that customers seldom buy together.
For these reasons, support is often used to eliminate uninteresting rules.
Confidence
Confidence, on the other hand, measures the reliability of the inference made by a rule.
For a given rule X → Y, the higher the confidence, the more likely it is for Y to be present in
transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.
An association rules suggests a strong co-occurrence relationship between items in the antecedent
and consequent of the rule.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
4
Brute-force approach:
A brute-force approach for finding frequent itemsets is to determine the support count for every
candidate itemset in the lattice structure.
To do this, we need to compare each candidate against every transaction, an operation that is
shown in below figure.
There are several ways to reduce the computational complexity of frequent itemset generation.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
5
Apriori Principle
The use of support for pruning candidate itemsets is guided by the following Apriori principle.
Apriori Principle:
If an itemset is frequent, then all of its subsets must also be frequent.
Example:
Consider the itemset lattice shown in below figure. Suppose {c, d, e} is a frequent itemset. Clearly, any
transaction that contains {c, d, e} must also contain its subsets, {c, d} {c, e}, {d, e}, {c}, {d}, and {e}. As a
result, if {c, d, e} is frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must also be
frequent.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent too.
This strategy of trimming the exponential search space based on the support measure is known as support-
based pruning.
Such a pruning strategy is made possible by a key property of the support measure, namely, that the
support for an itemset never exceeds the support for its subsets. This property is also known as the anti-
monotone property of the support measure.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
6
Monotonicity Property:
Let I be a set of items, and J = 2I be the power set of I. A measure f is monotone (or upward closed)
if
∀X, Y ∈ J : (X ⊆ Y ) → f(X) ≤ f(Y ),
which means that if X is a subset of Y , then f(X) must not exceed f(Y ). On the other hand, f is anti-
monotone (or downward closed) if
∀X, Y ∈ J : (X ⊆ Y ) → f(Y ) ≤ f(X),
which means that if X is a subset of Y , then f(Y ) must not exceed f(X).
Above figure provides a high-level illustration of the frequent itemset generation part of the Apriori
algorithm for the transactions shown in below table.
We assume that the support threshold is 60%, which is equivalent to a minimum support count
equal to 3.
Initially, every item is considered as a candidate 1-itemset.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
7
After counting their supports, the candidate itemsets {Cola} and {Eggs} are discarded because they
appear in fewer than three transactions.
In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets
because the Apriori principle ensures that all supersets of the infrequent 1-itemsets must be
infrequent.
Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by
the algorithm is ( ) = 6
Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found to be infrequent
after computing their support values.
The remaining four candidates are frequent, and thus will be used to generate candidate
3-itemsets.
With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent.
The only candidate that has this property is {Bread, Diapers, Milk}.
The pseudocode for the frequent itemset generation part of the Apriori algorithm is shown in Algorithm
6.1.
Let denote the set of candidate k-itemsets and denote the set of frequent k-itemsets:
The algorithm initially makes a single pass over the data set to determine the support of each item.
Upon completion of this step, the set of all frequent 1-itemsets, , will be known (steps 1 and 2).
Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent (k − 1)-
itemsets found in the previous iteration (step-5). Candidate generation is implemented using a
function called apriorigen, which is described in next slide.
To count the support of the candidates, the algorithm needs to make an additional pass over the
data set (steps 6–10). The subset function is used to determine all the candidate itemsets in that
are contained in each transaction t. The implementation of this function is described in
support_count.
After counting their supports, the algorithm eliminates all candidate itemsets whose support counts
are less than minsup (step 12).
The algorithm terminates when there are no new frequent itemsets generated, i.e., = ∅ (step 13).
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
8
Apriori algorithm:
– Fk: frequent k-itemsets
– Lk: candidate k-itemsets
Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
Candidate Generation: Generate Lk+1 from Fk
Candidate Pruning: Prune candidate itemsets in Lk+1 containing subsets of length k that are
infrequent
Support Counting: Count the support of each candidate in Lk+1 by scanning the DB
Candidate Elimination: Eliminate candidates in Lk+1 that are infrequent, leaving only those
that are frequent => Fk+1
Example: Suppose we have the following dataset that has various transactions, and from this dataset, we
need to find the frequent itemsets and generate the association rules using the Apriori algorithm:
Solution:
Step-1: Candidate Generation C1 and F1:
o In the first step, we will create a table that contains support count (The frequency of each itemset
individually in the dataset) of each itemset in the given dataset. This table is called the Candidate
set or C1.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
9
o Now, we will take out all the itemsets that have the greater or equal support count to the Minimum
Support (2). It will give us the table for the frequent itemset F1.
Since, all the itemsets have greater or equal support count than the minimum support, except the E,
so E itemset will be removed.
o Again, we need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the table C2. It will give us
the below table for F2.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
10
table:
o Now we will create the F3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So, the F3 will
have only one combination, i.e., {A, B, C}.
There are several candidate generation procedures, including the one used by the apriori-gen function.
1) Brute-Force Method
2) × Method
3) × Method
Brute-Force Method:
– Considers every k-itemset as a potential candidate and then applies the candidate pruning step to
remove any unnecessary candidates, example shown in below figure.
– The number of candidate itemsets generated at level k is equal to ( ), where d is the total number
of items.
– Candidate pruning becomes expensive because a large number of itemsets must be examined.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
11
– Given that the amount of computations needed for each candidate is O(k), the overall complexity of
this method is O(∑ ( )) = O(d. )
× Method
Extends each frequent (k − 1)-itemset with other frequent items.
Every frequent k-itemset is composed of a frequent (k − 1)-itemset and a frequent 1-itemset.
All frequent k-itemsets are part of the candidate k-itemsets generated.
This method will produce O( | | × | ) candidate k-itemsets, where | | is the number of
frequent j-itemsets.
The overall complexity of this step is O(∑ )
Drawbacks: Produces a large number of unnecessary candidates.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
12
The above figure illustrates how a frequent 2-itemset such as {Beer, Diapers} can be augmented with a
frequent item such as Bread to produce a candidate 3-itemset {Beer, Diapers, Bread}.
Among the itemsets residing near the border, {a, d}, {a, c, e), and {b, c, d, e} are considered to be maximal
frequent itemsets because their immediate supersets are infrequent. An itemset such as {a,d} is maximal
frequent because all of its immediate supersets, {a, b, d}, {a, c, d}, and {a, d, e}, are infrequent. In contrast,
{a, c} is non-maximal because one of its immediate supersets, {a, c, e}, is frequent. Maximal frequent
itemsets effectively provide a compact representation of frequent itemsets.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
13
For example, since the node {b, c} is associated with transaction IDs 1,2, and 3, its support count is equal to
three. From the transactions given in this diagram, notice that every transaction that contains b also
contains c. Consequently, the support for {b} is identical to {b, c} and {b} should not be considered a closed
itemset. Similarly, since c occurs in every transaction that contains both a and d, the itemset {a, d} is not
closed. On the other hand, {b, c} is a closed itemset because it does not have the same support count as
any of its supersets ({a, b, c}, {b, c, d}, {b, c, e}).
Closed Frequent Itemset: An itemset is a closed frequent itemset if it is closed and its support is greater
than or equal to minsup.
In the previous example, assuming that the support threshold is 40%, {b, c} is a closed frequent itemset
because its support is 60%. The rest of the closed frequent itemsets are indicated by the shaded nodes.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
14
FP-Growth Algorithm:
It encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets
directly from this structure.
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.
a) FP-Tree Representation:
An FP-tree is a compressed representation of the input data. It is constructed by reading the data set one
transaction at a time and mapping each transaction onto a path in the FP-tree. An FP-tree is a compressed
representation of the input data. It is constructed by reading the data set one transaction at a time and
mapping each transaction onto a path in the FP-tree.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
15
Figure 6.24 shows a data set that contains ten transactions and five items. Each node in the tree contains
the label of an item along with a counter that shows the number of transactions mapped onto the given
path. Initially, the FP-tree contains only the root node represented by the null symbol.
The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent items are
discarded, while the frequent items are sorted in decreasing support counts. For the data set shown in
Figure 6.24, a is the most frequent item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP-tree. After reading the first
transaction, {a, b), the nodes labeled as a and b are created. A path is then formed from null a b to
encode the transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b, c, d}, a new set of nodes is created for items b, c, and d. A path
is then formed to represent the transaction by connecting the nodes null b c d. Every node along
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
16
this path also has a frequency count equal to one. Although the first two transactions have an item in
common, which is b, their paths are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a, c, d, e}, shares a common prefix item (which is a) with the first transaction. As a
result, the path for the third transaction, null a c d e, overlaps with the path for the first
transaction, null a b. Because of their overlapping path, the frequency count for node a is
incremented to two, while the frequency counts for the newly created nodes, c, d, and e are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths given in the FP-
tree. The resulting FP-tree after reading all the transactions is shown at the bottom of Figure 6.24.
The size of an FP-tree is typically smaller than the size of the uncompressed data because many
transactions in market basket data often share a few items in common. In the best-case scenario, where all
the transactions have the same set of items, the FP-tree contains only a single branch of nodes. The worst-
case scenario happens when every transaction has a unique set of items. However, the physical storage
requirement for the FP-tree is higher because it requires additional space to store pointers between nodes
and counters for each item.
b) Frequent Itemset Generation in FP-Growth Algorithm:
After FP-tree is generated, to find frequent itemset we need to do the following steps:
1) Conditional Pattern Base is computed which is path labels of all the paths which lead to any node of
the given item in the frequent-pattern tree.
2) Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set of
elements that is common in all the paths in the Conditional Pattern Base of that item and calculating
its support count by summing the support counts of all the paths in the Conditional Pattern Base.
3) From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing
the items of the Conditional Frequent Pattern Tree set to the corresponding to the item.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 5
1
UNIT – 6:
Cluster Analysis:
Overview- types of clustering, Basic K-means, K –means –additional issues, Bisecting k-means, k-means and
different types of clusters, strengths and weaknesses, k-means as an optimization problem. Agglomerative
Hierarchical clustering, basic agglomerative hierarchical clustering algorithm, specific techniques, DBSCAN:
Traditional density: centre-based approach, strengths and weaknesses (Tan)
Clustering:
Overview:
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. Clustering for
Understanding Classes, or conceptually meaningful groups of objects that share common characteristics,
play an important role in how people analyze and describe the world. Indeed, human beings are skilled at
dividing objects into groups (clustering) and assigning particular objects to these groups (classification).
clusters are potential classes and cluster analysis is the study of techniques for automatically finding
classes. The following are some examples:
Biology: Biologists have spent many years creating a taxonomy (hierarchical classification) of all living
things: family, genus) and species. For example, clustering has been used to find groups of genes that
have similar functions.
Information Retrieval: The World Wide Web consists of billions of Web pages, and the results of a query
to a search engine can return thousands of pages. Clustering can be used to group these search results
into a small number of clusters, each of which captures a particular aspect of the query. For instance, a
query of "movie" might return Web pages grouped into categories such as reviews, trailers, stars, and
theatres.
Business: Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into a small number of groups for additional analysis and
marketing activities.
We explore two important topics: (1) different ways to group a set of objects into a set of clusters, and (2)
types of clusters.
Cluster analysis groups data objects based only on information found in the data that describes the objects
and their relationships. The goal is that the objects within a group be similar (or related) to one another and
different from (or unrelated to) the objects in other groups. The greater the similarity (or homogeneity)
within a group and the greater the difference between groups, the better or more distinct the clustering.
Consider Figure 8.1, which shows twenty points and three different ways of dividing them into clusters.
Figures 8.1(b) and 8.1(d) divide the data into two and six parts, respectively.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
2
Classification is supervised learning approach; i.e., new, unlabeled objects are assigned a class label using a
model developed from objects with known class labels. For this reason, cluster analysis is sometimes
referred to as unsupervised classification. When the term classification is used without any qualification
within data mining.
An entire collection of clusters is commonly referred to as a clustering. There are various types of
Clusterings: hierarchical (nested) versus partitional (unnested), exclusive versus overlapping versus fuzzy,
and complete versus partial.
A partitional clustering is simply a division of the set of data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset. Taken individually, each collection of clusters in Figures
8.1 (b-d) is a partitional clustering.
If we permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested
clusters that are organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is the union
of its children (subclusters), and the root of the tree is the cluster containing all the objects.
The Clusterings shown in Figure 8.1 are all exclusive, as they assign each object to a single cluster. There are
many situations in which a point could reasonably be placed in more than one cluster, and these situations
are better addressed by non-exclusive clustering. In the most general sense, an overlapping or non-
exclusive clustering is used to reflect the fact that an object can simultaneously belong to more than one
group (class). For instance, a person at a university can be both an enrolled student and an employee of the
university.
In a fizzy clustering, every object belongs to every cluster with a membership weight that is between 0
(absolutely doesn't belong) and 1 (absolutely belongs). In other words, clusters are treated as fuzzy sets.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
3
(Mathematically, a fizzy set is one in which an object belongs to any set with a weight that is between 0 and
1. In fuzzy clustering, we often impose the additional constraint that the sum of the weights for each object
must equal 1.)
Complete versus Partial A complete clustering assigns every object to a cluster, whereas a partial clustering
does not. The motivation for a partial clustering is that some objects in a data set may not belong to well-
defined groups. Many times, objects in the data set may represent noise, outliers, or "uninteresting
background."
1. Well separated.
2. Prototype based.
3. Graph based
4. Density based
5. Shared-property
1) Well separated:
A cluster is a set of objects in which each object is closer (or more similar) every other object in the cluster.
Sometimes a threshold is used to specify that all the objects in a cluster must be sufficiently close(or
similar) to one another. This idealistic definition of a cluster is satisfied only when the data contains natural
clusters that are quite far from each other.
This figure gives an example of well separated clusters that consists of two groups of points in a two-
dimensional space. The distance between any two points in different groups is larger than the distance
between any two points within a group.
2) Prototype-based:
A cluster is a set of objects in which each object is closer to the prototype that defines the cluster
than to the prototype of any other cluster.
For data with continuous attribute, the prototype of a cluster is often a centroid i.e., the average
(mean) of all the points in the cluster.
When a centroid is not meaningful, such as when the data has categorical attributes, the prototype
is often a medoid i.e., the most representative point of a cluster.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
4
For many types of data, the prototype can be regarded as the most central point and in such
instances, we commonly refer to prototype-based cluster as centroid clusters.
3) Graph based:
If the data is represented as a group, where the nodes are objects and link represent connections among
object then a cluster can be defined as a connected component i.e., a group of objects that are connected
to one another, but that have no connection to objects outside the group.
An important example of a graph-based clusters is contiguity-based clusters. Where two objects are
connected only if they are within a specified distance of each other. This implies that each object in a
contiguity-based cluster is closer to some other object in the cluster than to any point in a different cluster.
Figure 8.2(c) shows an example of such clusters for two-dimensional points. This definition of a cluster is
useful when clusters are irregular or intertwined, but can have trouble when noise is present since, as
illustrated by the two spherical clusters of Figure 8.2(c), a small bridge of points can merge two distinct
clusters.
4) Density Based: A cluster is a dense region of objects that in surrounded by a region of low density. The
figure (d) shows some density-based clusters for data created by adding noise to the data of figure(c). The
two circular clusters are not merged, as in Figure 8.2(c), because the bridge between them fades into the
noise. Likewise, the curve that is present in Figure 8.2(c) also fades into the noise and does not form a
cluster in Figure 8.2(d).
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
5
A density-based definition of a cluster is often employed when the clusters are irregular or intertwined, and
when noise and outliers are present
5) Shared-property (conceptual clusters): More generally, we can define a cluster as a set of objects that
share common property. This definition encompasses all the previous definitions of a cluster. Example
objects in a center-based cluster share the property that they are closet to the same centroid or medoid.
However, the shared property approach also includes new types of clusters.
Consider the cluster shown in figure. A triangular area (cluster) is adjacent to a rectangular one, and there
are two irregular circles(clusters). In both cases, a clustering algorithm would need a very specific concept
of a cluster to successfully detect these clusters. This process of finding such clusters is called conceptual
clustering.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
6
K-means:
Prototype-based clustering techniques creates a one-level partitioning of the data objects. There are
number of such techniques, but two of the most prominent are k-means and k-medoids.
K-means defines a prototype in terms of a centroid, which is usually the mean of a group of points,
and is typically applied to objects in a continuous n-dimensional space.
K-medoids defines a prototype in terms of a medoid.
In this section we will focus on K-means which is one of the oldest and most widely used calculating
algorithms.
The k-means clustering techniques is sample and we begin with a description of the basic algorithm. We
first choose k initial centroids, where k is a user specified parameter, namely, the number of clusters
desired. Each point is then assigned to the closest centroid and each collection of points assigned to a
centroid is a cluster. The centroid of each cluster is then updated based on the points assigned to the
cluster.
We repeat the assignment and update steps until no points changes clusters, or equivalently, until the
centroids remain the same.
The operation of k-means is illustrated in figure, which shows how, starting from three centroids, the final
clusters are found in four assignment-update steps. In these and other figures displaying k-means
clustering, each subfigure shows (1) the centroids at the start of the iteration and (2) the assignment of the
points to those centroids. The centroids are indicated by the “+” symbol, all points belonging to the same
cluster have the same marker shape.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
7
In the first step, shown in fig(a), points are assigned to the initial centroids, which are all in the larger group
of points. For this example, we use the mean as the centroid. After points are assigned to a centroid, the
centroid is then updated. Again, the figure for each step shows the centroid at the beginning of the step
and the assignment of points to those centroids.
In the second step, points are assigned to the updated centroids, and the centroids are updated again. In
steps 2,3and 4 which are shown in (b), (c), (d) respectively, two of the centroids move to the two small
groups of points at the bottom of the figures. When k-means algorithm terminates in figure(d) because no
more changes occur, the centroids have identified the natural grouping of points.
To assign a point to the closest centroid, we need a proximity measure. Euclidean (L2) distance is often
used for data points in Euclidean space, while cosine similarity is more appropriate for documents.
However, there may be several types of proximity measures that are appropriate for a given type of data.
For example, Manhattan(L1) distance can be used for Euclidean data, while the c measure is often
employed for documents.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
8
Step 4 of the k-means algorithm was stated rather generally as “recompute the centroid of each cluster”,
since the centroid can vary depending on the proximity measure (Euclidean or Manhattan or Manhattan)
for the data and the goal of the clustering.
The goal of the clustering is typically expressed by an objective function that depends on the proximities of
the points to one another or to the cluster centroids.
Consider data whose proximity measure is Euclidean distance. For our objective function, which measures
the quality of a clustering, we use the sum of the squared error (SSE), which is also known as scatter. In
other words, we calculate the error of each data point, i.e., its Euclidean distance to the closest centroid,
and then compute the total sum of the squared errors.
Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one
with the smallest squared error since this means that the prototypes (centroids) of this clustering are a
better representation of the points in their cluster.
where dist is the standard Euclidean (L2) distance between two objects in Euclidean space.
Given these assumptions, it can be shown that the centroid that minimizes the SSE of the cluster is the
mean. The centroid (mean) of the ith cluster is defined as
To illustrate, the centroid of a cluster containing the three two-dimensional points, (1,1), (2,3), and (6,2), is
((1 + 2 + 6)/3, (1 +3 + 2)/3)) = (3,2).
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
9
Document Data:
To illustrate that K-means is not restricted to data in Euclidean space, we consider document data and the
cosine similarity measure. Here we assume that the document data is represented as a document-term
matrix as shown in the diagram.
Our objective is to maximize the similarity of the documents in a cluster to the cluster centroid; this
quantity is known as the cohesion of the cluster. For this objective it can be shown that the cluster centroid
is, as for Euclidean data, the mean. The analogous quantity to the total SSE is the total cohesion, which is
given by Equation,
The space requirements for K-means are modest because only the data points and centroids are stored.
Specifically, the storage required is ) ), where m is the number of points, K is the number of
clusters and n is the number of attributes.
The time requirements for K-means are also modest-basically linear in the number of data points. In
particular, the time required is ), where is the number of iterations required for
convergence
One of the problems with the basic k-means algorithm is that empty clusters can be obtained if no points
are allocated to a cluster during the assignment step. If this happens, then a strategy is needed to choose a
replacement centroid, since otherwise, the squared error will be larger than necessary. One approach is to
choose the point that is farthest away from any current centroid.
Outliers:
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
10
When outliers are present, the resulting cluster centroids(prototypes) may not be as representative as they
otherwise would be and thus, the SSE will be higher as well. Because of this, it is often useful to discover
outliers and eliminate them beforehand.
It is important that there are certain clustering applications for which outliers should not be eliminated.
When clustering is used for data compression, every point must be clustered.
A number of techniques for identifying outliers. If we use approaches that remove outliers before
clustering. We avoid clustering points that will not cluster well. Alternatively, outliers can also be identified
in postprocessing step.
Two strategies that decrease the total SSE (Sum of Squares Error) by increasing the number of clusters are
the following:
1. Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster
with the largest standard deviation for one particular attribute.
2. Introduce a new cluster centroid: Often the point that is farthest from any cluster center is chosen.
We can easily determine this if we keep track of the SSE contributed by each point. Another
approach is to choose randomly from all points or form the points with the highest SSE
Two strategies that decreases the number of clusters, while trying to minimize the total SSE are the
following:
1. Disperse a cluster: This is accomplished by removing the centroid that corresponds to the cluster
and reassigning the points to other clusters.
2. Merge two clusters: The clusters with the closest centroids are typically chosen and merge the
clusters that result in the smallest increase in total SSE.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
11
Bisecting K-means:
The bisecting K-means algorithm is a straight forward extension of the basic K-means algorithm that is
based on a simple idea: to obtain k clusters, split the set of all points into two clusters, select one of these
clusters to split and so on, until K clusters have been produced. The details of bisecting K-means are given
in algorithm.
1. Initialize the list of clusters to contain the cluster consisting of all points.
2. Repeat.
3. Remove a cluster from the list of clusters.
4. {Perform several “trial” bisections of the chosen cluster}.
5. for i=1 number of trails do
6. Bisect the selected cluster using basic k-means
7. End for.
8. Select the two clusters from the bisection with the lowest SSE.
9. Add these two clusters to the list of clusters.
10. Until the list of clusters contain K clusters.
There are number of different ways to choose which cluster to split. We can choose the target cluster at
each step, choose the one with the largest SSE, or use a criterion based on both size and SSE. Different
choices result in different clusters.
Example:
To illustrate that bisecting K-means is less susceptible to initialization problems. We show in figure how
bisecting K-means finds four clusters in the data set originally shown in figure.
In iteration 1, two pairs of clusters are found. In iteration 2, the rightmost pair of clusters is split. In
iteration 3, the leftmost pair of clusters is split.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
12
K-means and its variations have a number of limitations with respect to finding different types of clusters.
In particular, K-means has difficulty detecting the "natural" clusters, when clusters have non-spherical
shapes or widely different sizes or densities. This is illustrated by Figures 8.9, 8.10, and 8.11.
In Figure 8.9, K-means cannot find the three natural clusters because one of the clusters is much larger than
the other two, and hence, the larger cluster is broken, while one of the smaller clusters is combined with a
portion of the larger cluster.
In Figure 8.10, K-means fails to find the three natural clusters because the two smaller clusters are much
denser than the larger cluster.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
13
In Figure 8.11, K-means finds two clusters that mix portions of the two natural clusters because the shape
of the natural clusters is not globular.
The difficulty in these three situations is that the K-means objective function is a mismatch for the kinds of
clusters we are trying to find since it is minimized by globular clusters of equal size and density or by
clusters that are well separated. However, these limitations can be overcome, in some sense, if the user is
willing to accept a clustering that breaks the natural clusters into a number of subclusters.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
14
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
15
Agglomerative: Start with the points as individual clusters and, at each step, merge the closest pair of
clusters.
Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of
individual points remain.
A hierarchical clustering is often displayed graphically using a tree like diagram called a dendrogram, which
displays both the cluster-subcluster relationships and the order in which the clusters were merged
(agglomerative view) or split (divisive view).
Many agglomerative hierarchical clustering techniques are variations on a single approach: starting with
individual points as clusters, successively merge the two closest clusters until only one cluster remains. This
approach is expressed more formally in Algorithm
The key operation of above algorithm is the computation of the proximity between two clusters. It
differentiates the various agglomerative hierarchical techniques. For example, many agglomerative
hierarchical clustering techniques, such as MIN, MAX, and Group Average, come from a graph-based view
of clusters.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
16
It defines cluster proximity as the proximity between the closest two points that are in different clusters or
using graph terms, the shortest edge between two nodes in different subsets of nodes.
Using graph terminology, if you start with all points as singleton clusters and add links between points one
at a time, shortest links first, then these single links combine the points into clusters.
The single link techniques are good at handling non elliptical shapes, but it is sensitive to noise and outliers.
MAX (Complete Link) or CLIQUE: The proximity between the farthest two points in different clusters to be
the cluster proximity or using graph terms, the longest edge between two nodes in different subsets of
nodes.
Using graph terminology, if you start with all points as singleton clusters and add links between one point
at a time, shortest links first, then a group of points is not a cluster until all points is not a cluster until all
points in it are completely linked i.e., form a CLIQUE.
Group Average: It defines cluster proximity to be the average pairwise proximities (average length of edges)
all pairs of points from different clusters.
Thus, for group average, the cluster proximity its proximity (Ci, Cj) of clusters Ci and Cj ,which are of size mi
and mj respectively is expressed by the following equation.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
17
For example:
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
18
From Table 8.4, we see that the distance between points 3 and 6 is 0.11, and that is the height at which
they are joined into one cluster in the dendrogram. As another example, the distance between clusters
{3,6} and {2,5} is given by
dist({3,6}, {2,5}) = min (dist (3, 2), dist (6, 2), dist (3, 5), dist (6, 5))
= 0.15
Figure 8.16 shows the result of applying the single link technique to our example data set of six points.
Figure 8.16(a) shows the nested clusters as a sequence of nested ellipse, Figure S.16(b) shows the same
information, but as a dendrogram.
As with single link, points 3 and 6 are merged first. However, {3,6} is merged with {4}, instead of {2,5} or {1}
because
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
19
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
20
Overall time required for a hierarchical clustering based on Algorithm is O(m 2logm).
Evaluating the number of points present in the data set with respect to its radius (EPS or epsilon) for
density estimation is known as center-based approach.
DBSCAN is totally based on this approach The center-based approach to density allows us to classify a point
as being
Core points: These points are in the interior of a density-based cluster. A point is a core point if the number
of points within a given neighbourhood around the point as determined by the distance function, Eps,
exceeds a threshold (minpts).
Border points: A border point is not a core point, but falls within the neighbourhood of a core point.
Noise points: A noise point is any point that is neither a core point nor a border point.
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6
21
For example, consider minpts is 7, observe the following diagram, so A point is Core point, B is Border
point, C is Noise point.
Time complexity:
Space complexity:
The space requirement of DBSCAN, even for high dimensional data is O(m) because it is only necessary to
keep a small amount of data for each point
Raghu Engineering College Dept. of CSE Data Warehousing & Mining Unit - 6