0% found this document useful (0 votes)
33 views

Data mining module - New

The document outlines a course on data mining methodologies at Debre Tabor University, focusing on practical techniques and hands-on experience with software like Weka. It details course goals, required texts, teaching methods, assessment strategies, and policies regarding attendance and assignments. Additionally, it covers the fundamentals of data mining, including its processes, tasks, and major issues, while emphasizing the importance of knowledge discovery in databases.

Uploaded by

Chalachew Mulu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Data mining module - New

The document outlines a course on data mining methodologies at Debre Tabor University, focusing on practical techniques and hands-on experience with software like Weka. It details course goals, required texts, teaching methods, assessment strategies, and policies regarding attendance and assignments. Additionally, it covers the fundamentals of data mining, including its processes, tasks, and major issues, while emphasizing the importance of knowledge discovery in databases.

Uploaded by

Chalachew Mulu
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

DEBRE TABOR UNIVERSITY

Faculty of Technology and Related Sciences


Department of computer Science
Course Description:

This course will provide participants with an understanding of fundamental data


mining methodologies and with the ability to formulate and solve problems with
them. Particular attention will be paid to practical, efficient and statistically sound
techniques, capable of providing not only the requested discoveries, but also estimates
of their utility. The lectures will be complemented with hands-on experience with data
mining software, primarily Weka, to allow development of basic execution skills

Course Goals or Learning Outcomes:

On completion of this course students should be able to:

 Introduce concepts in Data Mining and Data warehouse


 Study the central concepts of Association rule mining
 Acquire insights into classification and prediction
 Identify different Cluster Analysis
 Acquire insights into Clustering Methods
 Design different data mining tasks

Required Texts:
1. Text Book: Data Mining Concepts and Techniques, Jaiwei Han and Michelin
Kamber, Second Edition, Morgan Kaufmann Publishers, Elsevier
2. M. H. Dunham, 2003, Data Mining: Introductory and Advanced Topics, Pearson
Education, Delhi.
Summary of Teaching Learning Methods:

The learning–teaching methodology will be student-centered with appropriate guidance of


instructor/s during the students’ activities .There will be
 Lecture,
 Presentation
 Tutorials,
 Reading assignments and Group Discussions

Prepared by Mr. Anil Sharma Page 1


Summary of Assessment Methods:
The course will be assessed using the different assessment methods like:
 Quizzes,
 Reading assessments,
 Assignments,
 Presentation,
 Mid exam and Final exam

Policies

Attendance: It is compulsory to come to class on time and every time. If you are going to
miss more than a couple of classes during the term, you should not take this course.

Assignments: You should submit individual and group assignments on due date, late
assignments won’t entertain.

Tests/Quizzes: You should take all quizzes and assignments as scheduled. If you miss quiz
or assignment without any reason, no makeup will be arranged for you.

Cheating/plagiarism: you must do your own work and not copy and get answers from
someone else.

Assignments: Whether individual assignment or group assignment should be submitted on


time. If a copied assignment is submitted both the cheaters are going to score 0. If their
cheating is that much visible they may score negative marks.

Student Workload: Taking into consideration that 1ECTS accounts for 27 hours of student
work, the course Introduction to Data mining and Warehousing has 5*27hr =135

Grading policies
 Student grade and performance will be evaluated as the whole activities
(Tests) (30%) + lab exam (10%) + Quizzes (10%) + Attendance (5%) + Assignment
(5%) +Final Exam (40%)) = Total (100%).

Prepared by Mr. Anil Sharma Page 2


Chapter-1: Introduction to Data Mining and Data warehouse
1.1. Introduction to Data Mining

Data mining refers to extracting or mining knowledge from large amounts of data. The term
is actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.

The key properties of data mining are


 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and database
1.2. The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require
either sifting through an immense amount of material, or intelligently probing it to find
exactly where the value resides. Given databases of sufficient size and quality, data mining
technology can generate new business opportunities by providing these capabilities:

Automated prediction of trends and behaviours: Data mining automates the process of
finding predictive information in large databases. Questions that traditionally required
extensive hands-on analysis can now be answered directly from the data — quickly. A typical
example of a predictive problem is targeted marketing. Data mining uses data on past
promotional mailings to identify the targets most likely to maximize return on investment in
future mailings. Other predictive problems include forecasting bankruptcy and other forms of
default, and identifying segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns: Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern
discovery is the analysis of retail sales data to identify seemingly unrelated products that are
often purchased together. Other pattern discovery problems include detecting fraudulent
credit card transactions and identifying anomalous data that could represent data entry keying
errors.

1.3. Tasks of Data Mining


Data mining involves six common classes of tasks:
Anomaly detection (Outlier/change/deviation detection) – The identification of unusual
data records, that might be interesting or data errors that require further investigation.

Prepared by Mr. Anil Sharma Page 3


Association rule learning (Dependency modelling) – Searches for relationships between
variables. For example a supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which products are frequently
bought together and use this information for marketing purposes. This is sometimes referred
to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

Regression – is a data mining function that is used to determine the relationships between the
dependant variables (target field) and one or more independent variables. The dependent
variable is the one whose values you want to predict, where as independent variables are the
variables on which your prediction is based on.

Summarization – providing a more compact representation of the data set, including


visualization and report generation.

1.4 Architecture of Data Mining


A typical data mining system may have the following major components

1. Knowledge Base:
This is the domain
knowledge that is used to
guide the search or evaluate
the interestingness of
resulting patterns. Such
knowledge can include
concept hierarchies, used to
organize attributes or
attribute values into different
levels of abstraction.
2. Data Mining Engine:
This is essential to the data
mining system and ideally consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction, cluster
analysis, outlier analysis, and evolution analysis.
3. Pattern Evaluation Module:

Prepared by Mr. Anil Sharma Page 4


This component typically employs interestingness measures, interacts with the data mining
modules so as to focus the search toward interesting patterns. It may use interestingness
thresholds to filter-out discovered patterns.
4. User interface:
This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.

1.5 Data Mining Process:


Data Mining is a process of discovering various models, summaries, and derived values from
a given collection of data. The general experimental procedure adapted to data-mining
problems involves the following steps:
1. State the problem and formulate the hypothesis
Most data-based modelling studies are performed in a particular application domain. Hence,
domain-specific knowledge and experience are usually necessary in order to come up with a
meaningful problem statement. Unfortunately, many application studies tend to focus on the
data-mining technique at the expense of a clear problem statement.
2. Collect the data
This step is concerned with how the data are generated and collected. In general, there are
two distinct possibilities. The first is when the data-generation process is under the control of
an expert (modeller): this approach is known as a designed experiment. The second
possibility is when the expert cannot influence the data- generation process: this is known as
the observational approach. It is very important to understand how data collection affects its
theoretical distribution, since such a priori knowledge can be very useful for modelling and,
later, for the final interpretation of results.
3. Pre-processing the data
In the observational setting, data are usually "collected" from the existing databases, data
warehouses, and data marts. Data pre-processing usually includes at least two common tasks:
i. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement errors,
coding and recording errors, and, sometimes, are natural, abnormal values. Such non

Prepared by Mr. Anil Sharma Page 5


representative samples can seriously affect the model produced later. There are two strategies
for dealing with outliers:
a) Detect and eventually remove outliers as a part of the pre-processing phase, or
b) Develop robust modelling methods that are insensitive to outlier
ii. Scaling, encoding, and selecting features – Data pre-processing includes several
steps such as variable scaling and different types of encoding. For example, one feature with
the range [0, 1] and the other with the range [−100, 1000] will not have the same weights in
the applied technique; they will also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same weight for
further analysis.
4. Estimate the model
The selection and implementation of the appropriate data-mining technique is the main task
in this phase. This process is not straightforward; usually, in practice, the implementation is
based on several models, and selecting the best one is an additional task.
5. Interpret the model and draw conclusions
In most cases, data-mining models should help in decision making. Modern data-mining
methods are expected to yield highly accurate results using high dimensional models. The
problem of interpreting these models, also very important, is considered a separate task, with
specific techniques to validate the results. A user does not want hundreds of pages of numeric
results.

1.6 Classification of Data mining Systems:


The data mining system can be classified according to the following criteria:
 Database Technology

 Statistics

 Machine Learning

Prepared by Mr. Anil Sharma Page 6


 Information Science

 Visualization

 Other Disciplines

Some Other Classification Criteria:


 Classification according to kind of databases mined

 Classification according to kind of knowledge mined

 Classification according to kinds of techniques utilized

 Classification according to applications adapted

Classification according to kind of databases mined


We can classify the data mining system according to kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data etc.
And the data mining system can be classified accordingly. For example if we classify the
database according to data model then we may have a relational, transactional, object-
relational, or data warehouse mining system.

Classification according to kind of knowledge mined


We can classify the data mining system according to kind of knowledge mined. It is means
data mining system are classified on the basis of functionalities such as:
 Characterization  Prediction
 Discrimination  Clustering
 Association and Correlation  Outlier Analysis
Analysis
 Evolution Analysis
 Classification

 Classification according to kinds of techniques utilized
 We can classify the data mining system according to kind of techniques used. We can
describe these techniques according to degree of user interaction involved or the
methods of analysis employed.
 Classification according to applications adapted
 We can classify the data mining system according to application adapted. These
applications are as follows:
 Finance  1.7 Major Issues in Data Mining:
 Telecommunications  Mining different kinds of
 DNA knowledge in databases: The
 Stock Markets need of different users is not the
 E-mail same. And Different user may be in
 mail interested in different kind of
 knowledge. Therefore it is

Prepared by Mr. Anil Sharma Page 7


necessary for data mining to cover  Pattern evaluation.- It refers to
broad range of knowledge interestingness of the problem. The
discovery task. patterns discovered should be
 Interactive mining of knowledge interesting because either they
at multiple levels of abstraction.- represent common knowledge or
The data mining process needs to lack novelty (originality).
be interactive because it allows  Efficiency and scalability of data
users to focus the search for mining algorithms.- In order to
patterns, providing and refining effectively extract the information
data mining requests based on from huge amount of data in
returned results. databases, data mining algorithm
 Incorporation of background must be efficient and scalable.
knowledge.- To guide discovery  Parallel, distributed, and
process and to express the incremental mining algorithms.-
discovered patterns, the The factors such as huge size of
background knowledge can be databases, wide distribution of
used. Background knowledge may data, and complexity of data
be used to express the discovered mining methods motivate the
patterns not only in concise terms development of parallel and
but at multiple level of abstraction distributed data mining algorithms.
 Data mining query languages These algorithms divide the data
and ad hoc data mining. - Data into partitions which is further
Mining Query language that allows processed parallel. Then the results
the user to describe ad hoc mining from the partitions are merged.
tasks, should be integrated with a 
data warehouse query language and  1.8 Knowledge Discovery in
optimized for efficient and flexible Databases (KDD)
data mining.  Some people treat data mining
 Presentation and visualization of same as Knowledge discovery
data mining results.- Once the while some people view data
patterns are discovered it needs to mining essential step in process of
be expressed in high level knowledge discovery. Here is the
languages, visual representations. list of steps involved in knowledge
This representation should be discovery process:
easily understandable by the users.  Data Cleaning - In this step the
 Handling noisy or incomplete noise and inconsistent data is
data.- The data cleaning methods removed.
are required that can handle the
 Data Integration - In this step
noise, incomplete objects while
multiple data sources are
mining the data regularities. If data
combined.
cleaning methods are not there then
the accuracy of the discovered  Data Selection - In this step
patterns will be poor. relevant to the analysis task are
retrieved from the database.

Prepared by Mr. Anil Sharma Page 8


 Data Transformation - In this  Integrated: A data warehouse
step data are transformed or integrates data from multiple data
consolidated into forms appropriate sources. For example, source A and
for mining by performing summary source B may have different ways
or aggregation operations. of identifying a product, but in a
data warehouse, there will be only
 Data Mining - In this step
a single way of identifying a
intelligent methods are applied in
product.
order to extract data patterns
 Time-Variant: Historical data is
 Pattern Evaluation - In this step,
kept in a data warehouse. For
data patterns are evaluated.
example, one can retrieve data
 Knowledge Presentation - In this from 3 months, 6 months, 12
step, knowledge is represented months, or even older data from a
 data warehouse. This contrasts with
 a transactions system, where often
only the most recent data is kept.
For example, a transaction system
may hold the most recent address
of a customer, where a data
warehouse can hold all addresses
associated with a customer.
 Non-volatile: Once data is in the
data warehouse, it will not change.
So, historical data in a data
warehouse should never be altered
 1.9.1 Data Warehouse Design
Process:
 A data warehouse can be built
using a top-down approach, a
bottom-up approach, or a
 combination of both.

 The top-down approach starts with
 1.9 Data Warehouse:
the overall design and planning. It
 A data warehouse is a subject- is useful in cases where the
oriented, integrated, time-variant technology is mature and well
and non-volatile collection of data known, and where the business
in support of management's problems that must be solved are
decision making process. clear and well understood.
 Subject-Oriented: A data  The bottom-up approach starts
warehouse can be used to analyze a with experiments and prototypes.
particular subject area. For This is useful in the early stage of
example, "sales" can be a particular business modelling and technology
subject. development. It allows an
organization to move forward at

Prepared by Mr. Anil Sharma Page 9


considerably less expense and to information provided by external
evaluate the benefits of the consultants). These tools and
technology before making utilities perform data extraction,
significant commitments. cleaning, and transformation (e.g.,
 In the combined approach, an to merge similar data from
organization can exploit the different sources into a unified
planned and strategic nature of the format), as well as load and refresh
top-down approach while retaining functions to update the data
the rapid implementation and warehouse. This tier also contains a
opportunistic application of the metadata repository, which stores
bottom-up approach. information about the data
 warehouse and its contents
 1.9.2 Three Tier Data Warehouse  Tier-2:
Architecture  The middle tier is an OLAP server
 that is typically implemented using
 either a relational OLAP (ROLAP)
model or a multidimensional
OLAP.

 OLAP model is an extended
relational DBMS that maps
operations on multidimensional
data to standard relational
operations.

 A multidimensional OLAP
(MOLAP) model, that is, a special-
purpose server that directly
implements multidimensional data
and operations.
 Tier-3: The top tier is a front-end
client layer, which contains query
and reporting tools, analysis tools,
and/or data mining tools (e.g.,
trend analysis, prediction, and so
 Tier-1: on)
 The bottom tier is a data warehouse  1.9.3 Data Warehouse Models:
database server that is almost  There are three data warehouse
always a relational database models.
system. Back-end tools and utilities  1. Enterprise warehouse:
are used to feed data into the  An enterprise warehouse collects
bottom tier from operational all of the information about
databases or other external sources subjects spanning the entire
(such as customer profile organization. It provides corporate-

Prepared by Mr. Anil Sharma Page 10


wide data integration, usually from  1.9.4 Meta Data Repository:
one or more operational systems or  Metadata are data about data.
external information providers, and When used in a data warehouse,
is cross-functional in scope. metadata are the data that define
warehouse objects. Metadata are
 It typically contains detailed data
created for the data names and
as well as summarized data, and
definitions of the given warehouse.
can range in size from a few
Additional metadata are created
gigabytes to hundreds of gigabytes,
and captured for time stamping any
terabytes, or beyond.
extracted data, the source of the
 2. Data mart: extracted data, and missing fields
 A data mart contains a subset of that have been added by data
corporate-wide data that is of cleaning or integration processes.
value to a specific group of users. 
The scope is confined to specific  1.10 OLAP (Online analytical
selected subjects. For example: a Processing):
marketing data mart may confine  OLAP is an approach to answering
its subjects to customer, item, and multi-dimensional analytical
sales. The data contained in data (MDA) queries swiftly.
marts tend to be summarized.
 OLAP is part of the broader

category of business intelligence,
 Depending on the source of data,
which also encompasses relational
data marts can be categorized as
database, report writing and data
independent or dependent.
mining.
 Independent data marts are
sourced from data captured from  OLAP tools enable users to analyze
one or more operational systems or multidimensional data interactively
external information providers, or from multiple perspectives.
from data generated locally within 
a particular department or  OLAP consists of three basic
geographic area. analytical operations:
 Dependent data marts are  Consolidation (Roll-
sourced directly from enterprise Up)
data warehouses.
 Drill-Down

 3. Virtual warehouse:  Slicing And Dicing
 A virtual warehouse is a set of

views over operational databases.
 1.11 DATA PRE-PROCESSING
For efficient query processing, only 
some of the possible summary  1.11.1 Data Integration:
views may be materialized. A  It combines data from multiple
virtual warehouse is easy to build sources into a coherent data store,
but requires excess capacity on as in data warehousing. These
operational database servers.
Prepared by Mr. Anil Sharma Page 11
sources may include multiple  In data transformation, the data are
databases, data cubes, or flat files. transformed or consolidated into
The data integration systems are forms appropriate for mining. Data
formally defined as triple <G, S, transformation can involve the
and M> Where G: The global following:
schema S: Heterogeneous source of  Smoothing, which works to
schemas M: Mapping between the remove noise from the data? Such
queries of source and global techniques include binning,
schema. regression, and clustering.
  Aggregation, where summary or
aggregation operations are applied
to the data. For example, the daily
sales data may be aggregated so as
to compute monthly and annual
total amounts.
 Generalization of the data, where
low-level or ―primitive‖ (raw)
data are replaced by higher-level
concepts through the use of
concept hierarchies. For example,
categorical attributes, like street,
 1.11.2 Issues in Data integration: can be generalized to higher-level
 1. Schema integration and object concepts, like city or country.
matching:  Normalization, where the attribute
 How can the data analyst or the data are scaled so as to fall within a
computer be sure that customer id small specified range, such as 1:0
in one database and customer to 1:0, or 0:0 to 1:0.
number in another reference to the  Attribute construction (or feature
same attribute? construction),where new attributes
 2. Redundancy: are constructed and added from the
 An attribute (such as annual given set of attributes to help the
revenue, for instance) may mining process
redundant if it can be derived from 
another or set of attributes.  1.11.4 Data Reduction:
Inconsistencies in attribute or  Data reduction techniques can be
dimension naming can also cause applied to obtain a reduced
redundancies in the resulting data representation of the data set that is
set. much smaller in volume, yet
 3. Detection and resolution of closely maintains the integrity of
data value conflicts: the original data. That is, mining
 For the same real-world entity, on the reduced data set should be
attribute values from different more efficient yet produce the
sources may differ. same (or almost the same)
 1.11.3 Data Transformation: analytical results. Strategies for

Prepared by Mr. Anil Sharma Page 12


data reduction include the  Association rule mining is a
following: popular and well researched
 Data cube aggregation, where method for discovering interesting
aggregation operations are applied relations between variables in large
to the data in the construction of a databases. It is intended to identify
data cube. strong rules discovered in
 Attribute subset selection, where databases using different measures
irrelevant, weakly relevant or of interestingness.
redundant attributes or dimensions
 Based on the concept of strong
may be detected and removed.
rules, Rakesh Agrawal et al.
 Dimensionality reduction, where
introduced association rules.
encoding mechanisms are used to
 Problem Definition: The problem
reduce the dataset size.
of association rule mining is
 Numerosity reduction, where the
defined as:
data are replaced or estimated by
 Let I = {i1,i2,i3,….. in}be a set of
alternative, smaller data
binary attributes called items.
representations such as parametric
 Let D= {t1,t2,t3,…….,tn}be a set of
models (which need store only the
transactions called the database.
model parameters instead of the
 Each transaction in D has a unique
actual data) or nonparametric
transaction ID and contains a
methods such as clustering,
subset of the items in I .
sampling, and the use of
 A rule is defined as an implication
histograms.
of the form X  Y
 Discretization and concept
 Where X, Y ≤ Iand X Y = Ҩ
hierarchy generation, where raw
 The sets of items (for short
data values for attributes are
itemsets) and are called antecedent
replaced by ranges or higher
(left-hand-side or LHS) and
conceptual levels.
consequent (right-hand-side or

RHS) of the rule respectively.
 
 Example:
  To illustrate the concepts, we use a
 small example from the
supermarket domain. The set of
 items isI = { milk, bread, butter,
beer} and a small database

containing the items (1 codes
 Chapter-2: Mining Frequent presence and 0 absence of an item
in a transaction) is shown in the
Patterns, Associations, and
table.
Correlations 
  An example rule for the
 2.1 Association Rule Mining: supermarket could be { butter,

Prepared by Mr. Anil Sharma Page 13


bread}  {milk} meaning that if  The lift of a rule is defined as
butter and bread are bought, SUPP ( X U Y )
 Lift ( X Y )=
customers also buy milk. SUPP ( X ) X SUPP (Y )
  or the ratio of the observed support
 to that expected if X and Y were
independent. The rule {milk,
bread}  {butter} has a lift of
0.2
= 1.25
0.4 X 0.4

 The conviction of a rule is defined
as
1 − SUPP ( Y )
 . Conv ( X Y ) =
 1 − conf ( X → Y )
 2.1.1 Important concepts of 
Association Rule Mining:  The rule {milk, bread}  {butter}
 The supportSupp(X)of an item set 1− 0.4
has a conviction of =1.2
X is defined as the proportion of 1− 0.5
transactions in the data set which and can be interpreted as the ratio
contain the item et. In the example of the expected frequency that X
database, the item set{ milk, bread, occurs without Y (that is to say, the
butter}has a support of 1/5 =0.2 frequency that the rule makes an
since it occurs in 20% of all incorrect prediction) if X and Y
transactions (1 out of 5 were independent divided by the
transactions). observed frequency of incorrect
 predictions.
 The confidence of a rule is defined 
 Conf(XY) = sup(XUY) / sup(X)  2.2 Frequent Pattern Mining:
 For example, the rule has a  Frequent pattern mining can be
confidence {butter, bread}  classified in various ways, based on
{milk} has a confidence of 0.2/0.2 the following criteria:
=1.0 in the database, which means 
that for 100% of the transactions  1. Based on the completeness of
containing butter and bread the rule patterns to be mined:
is correct (100% of the times a 
customer buys butter and bread,  We can mine the complete set of
milk is bought as well). Confidence frequent item sets, the closed
can be interpreted as an estimate of frequent item sets, and the maximal
the probability P(X/Y), the frequent item sets, given a
probability of finding the RHS of minimum support threshold.
the rule in transactions under the  We can also mine constrained
condition that these transactions frequent item sets, approximate
also contain the LHS. frequent item sets, near-match

Prepared by Mr. Anil Sharma Page 14


frequent item sets, top-k frequent 
item sets and so on.  4. Based on the types of values
 handled in the rule:
 2. Based on the levels of  If a rule involves associations
abstraction involved in the rule between the presence or absence of
set: items, it is a Boolean association
 Some methods for association rule rule.
mining can find rules at differing  If a rule describes associations
levels of abstraction. between quantitative items or
 attributes, then it is a quantitative
 For example, suppose that a set of association rule.
association rules mined includes 
the following rules where X is a  5. Based on the kinds of rules to
variable representing a customer: be mined:
 buys(X, “computer”))=>buys(X,  Frequent pattern analysis can
“HP printer”) ------------------(1) generate various kinds of rules and
 buys(X, “laptop computer”)) other interesting relationships.
=>buys(X, “HP printer”) --------(2)  Association rule mining can
 In rule (1) and (2), the items generate a large number of rules,
bought are referenced at different many of which are redundant or do
levels of abstraction (e.g., not indicate a correlation
“computer” is a higher-level relationship among item sets.
abstraction of “laptop computer”).  The discovered associations can be
 3. Based on the number of data further analyzed to uncover
dimensions involved in the rule: statistical correlations, leading to
 correlation rules.
 If the items or attributes in an 
association rule reference only one  6. Based on the kinds of patterns
dimension, then it is a single- to be mined:
dimensional association rule.  Many kinds of frequent patterns can
 be mined from different kinds of
 Buys(X, “computer”))=>buys(X, data sets.
“antivirus software”)  Sequential pattern mining searches
 for frequent sub sequences in a
 If a rule references two or more sequence data set, where a sequence
dimensions, such as the dimensions records an ordering of events.
age, income, and buys, then it is a  For example, with sequential pattern
multidimensional association rule. mining, we can study the order in
The following rule is an example of which items are frequently
a multidimensional rule: purchased. For instance, customers
 may tend to first buy a PC, followed
 Age (X, “30,31…39”) ^ income(X, by a digital camera, and then a
“42K,…48K”))=>buys(X, “high memory card.
resolution TV”)

Prepared by Mr. Anil Sharma Page 15


 Structured pattern mining searches  A two-step process is followed in
for frequent substructures in a Apriori consisting of join and
structured data set. prune action
 Single items are the simplest form 
of structure.
 Each element of an item set may
contain a subsequence, a sub tree,
and so on.
 Therefore, structured pattern mining
can be considered as the most
general form of frequent pattern
mining
 .
 2.3Efficient Frequent Item set
Mining Methods:
 2.3.1 Finding Frequent Item sets
Using Candidate Generation:

 The Apriori Algorithm
 Apriori is a seminal algorithm
proposed by R. Agrawal and R.
Srikant in 1994 for mining frequent
itemsets for Boolean association
rules.
 The name of the algorithm is based
on the fact that the algorithm uses
prior knowledge of frequent
itemset properties.
 Apriori employs an iterative
approach known as a level-wise
search, where k-itemsets are used
to explore (k+1)-itemsets. 
 First, the set of frequent 1-itemsets  Example:
is found by scanning the database  There are nine transactions in this
to accumulate the count for each database, that is, |D| = 9.
item, and collecting those items  Steps:
that satisfy minimum support. The  1. In the first iteration of the
resulting set is denoted L1.Next, L1 algorithm, each item is a member
is used to find L2, the set of of the set of candidate1-itemsets,
frequent 2-itemsets, which is used C1. The algorithm simply scans all
to find L3, and so on, until no more of the transactions in order to count
frequent k-itemsets can be found. the number of occurrences of each
 The finding of each Lk requires one item.
full scan of the database

Prepared by Mr. Anil Sharma Page 16


 2. Suppose that the minimum 
support count required is 2, that is, 
min sup = 2. The set off frequent 1- 
itemsets, L1, can then be
determined. It consists of the
candidate 1-itemsets satisfying
minimum support. In our example,
all of the candidates in C1 satisfy
minimum support.
 3. To discover the set of frequent 2-
itemsets, L2, the algorithm uses the
join L1 on L1togenerate a
candidate set of 2-itemsets, C2.No
candidates are removed fromC2
during the prune step because each
subset of the candidates is also
frequent.
 4. Next, the transactions in Dare
scanned and the support count of
each candidate itemsetInC2 is

accumulated.

 5. The set of frequent 2-itemsets,
L2, is then determined, consisting
of those candidate2-itemsets in C2
having minimum support.
 6. The generation of the set of
candidate 3-itemsets,C3, From the
join step, we first getC3 =L2xL2 =
({I1, I2, I3}, {I1, I2, I5}, {I1, I3,
I5}, {I2, I3, I4},{I2, I3, I5}, {I2,
I4, I5}. Based on the Apriori
property that all subsets of a
frequent itemset must also be
frequent, we can determine that the
four latter candidates cannot
possibly be frequent.
 7. The transactions in D are
scanned in order to determine L3, 
consisting of those candidate3-  2.4.2 Generating Association
itemsets in C3 having minimum Rules from Frequent Itemsets:
support.8.The algorithm uses L3x  Once the frequent itemsets from
L3 to generate a candidate set of 4- transactions in a database D have
itemsets, C4. been found, it is straightforward to

Prepared by Mr. Anil Sharma Page 17


generate strong association rules 
from them.  at multiple levels of abstraction,
 with sufficient flexibility for easy
traversal among different
abstraction spaces

 Association rules generated from
mining data at multiple levels of
abstraction arecalled multiple-level
or multilevel association rules.

 Multilevel association rules can be
mined efficiently using concept
hierarchies under a support-
confidence framework.

 Example:  In general, a top-down strategy is
 employed, where counts are
accumulated for the calculation of
frequent itemsets at each concept
level, starting at the concept level 1
and working downward in the
hierarchy toward the more specific
concept levels,until no more
frequent itemsets can be found.

 A concept hierarchy defines a


sequence of mappings from a set of
low-level concepts to higher level,
 more general concepts. Data can be
 2.5 Mining Multilevel generalized by replacing low-level
Association Rules: concepts with in the data by their
 higher-level concepts, or ancestors,
 For many applications, it is from a concept hierarchy.
difficult to find strong associations 
among data items at low or
primitive levels of abstraction due  The concept hierarchy has five
to the sparsity of data at those levels, respectively referred to as
levels. levels 0to 4, starting with level 0 at
 the root node for all.
 Strong associations discovered at  Here, Level 1 includes computer,
high levels of abstraction may
software, printer &camera, and
represent common sense
knowledge. computer accessory.
  Level 2 includes laptop computer,
 Therefore, data mining systems desktop computer, office software,
should provide capabilities for antivirus software
mining association rules

Prepared by Mr. Anil Sharma Page 18


 Level 3 includes IBM desktop  The same minimum support
computer, . . . , Microsoft office threshold is used when mining at
software, and so on. each level of abstraction.
 Level 4 is the most specific  When a uniform minimum support
abstraction level of this hierarchy. threshold is used, the search
 procedure is simplified.
  The method is also simple in that
 users are required to specify only
 one minimum support threshold.
  The uniform support approach,
 however, has some difficulties. It is
 unlikely that items at lower levels
 of abstraction will occur as
 frequently as those at higher levels
 of abstraction.
 If the minimum support threshold
is set too high, it could miss some
meaningful associations occurring
at low abstraction levels. If the
threshold is set too low, it may
generate many uninteresting
associations occurring at high
abstraction levels.


 2. Reduced Minimum Support:
 
 2.5.1 Approaches for Mining  Each level of abstraction has its
Multilevel Association Rules: own minimum support threshold.
  The deeper the level of abstraction,
1. Uniform Minimum Support: the smaller the corresponding
threshold is.

Prepared by Mr. Anil Sharma Page 19


 For example, the minimum support  Single dimensional or intra
thresholds for levels 1 and 2 are dimensional association rule
5% and 3%,respectively. In this contains a single distinct predicate
way, “computer”, “laptop (e.g., buys) with multiple
computer”, and “desktop occurrences i.e., the predicate
computer” are all considered occurs more than once within the
frequent. rule.
 
  buys(X, “digital
camera”)=>buys(X, “HP printer”)

 Association rules that involve two
or more dimensions or predicates
can be referred to as
multidimensional association rules.
 age(X, “20…29”)^occupation(X,
“student”)=>buys(X, “laptop”)
 Above Rule contains three
 3. Group-Based Minimum predicates (age, occupation, and
Support: buys), each of which occurs only
 Because users or experts often have once in the rule. Hence, we say that
insight as to which groups are more it has no repeated predicates.
important than others, it is  Multidimensional association rules
sometimes more desirable to set up with no repeated predicates are
user-specific, item, or group based called inter dimensional association
minimal support thresholds when rules.
mining multilevel rules.  We can also mine
 For example, a user could set up multidimensional association rules
the minimum support thresholds with repeated predicates, which
based on product price, or on items contain multiple occurrences of
of interest, such as by setting some predicates. These rules are
particularly low support thresholds called hybrid-dimensional
for laptop computers and flash association rules. An example of
drives in order to pay particular such a rule is the following, where
attention to the association patterns the predicate buys is repeated:
containing items in these 
categories.  age(X, “20…29”)^buys(X,
 “laptop”)=>buys(X, “HP printer”)
 2.6 Mining Multidimensional 
Association Rules from  2.7 Mining Quantitative
Relational Databases and Data Association Rules
Warehouses:  Quantitative association rules are
multidimensional association rules
in which the numeric attributes are

Prepared by Mr. Anil Sharma Page 20


dynamically discretized during the framework for association rules.
mining process so as to satisfy This leads to correlation rules of
some mining criteria, such as the form
maximizing the confidence or  A =>B [support, confidence,
compactness of the rules mined. correlation]
 That is, a correlation rule is
 In this, we focus specifically on
measured not only by its support
how to mine quantitative
and confidence but also by the
association rules having two
correlation between itemsets A and
quantitative attributes on the left-
B. There are many different
hand side of the rule and one
correlation measures from which to
categorical attribute on the right-
choose. In this section, we study
hand side of the rule. That is
various correlation measures to

determine which would be good for
 Aquan1 ^Aquan2 =>Acat
mining large data sets.
 Where Aquan1 and Aquan2 are

tests on quantitative attribute
 Lift is a simple correlation measure
interval Acattests a categorical
that is given as follows. The
attribute from the task-relevant
occurrence of itemset
data.

 Such rules have been referred to as
 A is independent of the occurrence
two-dimensional quantitative
of itemset B if = P(A)P(B);
association rules, because they
otherwise, itemsets A and B are
contain two quantitative
dependent and correlated as events.
dimensions.
This definition can easily be
 For instance, suppose you are
extended to more than two
curious about the association
itemsets.
relationship between pairs of

quantitative attributes, like
 The lift between the occurrence of
customer age and income, and the
A and B can be measured by
type of television (such as high-
computing
definition TV, i.e., HDTV) that
customers like to buy.
 
 An example of such a 2-D  If the lift(A,B) is less than 1, then
quantitative association rule is the occurrence of A is negatively
age(X, “30…39”)^income(X, correlated with the occurrence of
“42K…48K”)=>buys(X, “HDTV”) B.

 If the resulting value is greater than

1, then A and B are positively
 2.8 From Association Mining to
correlated, meaning that the
Correlation Analysis
occurrence of one implies the
 A correlation measure can be used
occurrence of the other.
to augment the support-confidence

Prepared by Mr. Anil Sharma Page 21


 If the resulting value is equal to 1,
then A and B are independent and
there is no correlation between
them.

Prepared by Mr. Anil Sharma Page 22


 Chapter-3: Classification  3.1.1 Issues Regarding
and Prediction Classification and Prediction:
  1.Preparing the Data for
 3.1 Classification and Prediction Classification and Prediction:
 Classification and prediction are The following pre-processing steps
two forms of data analysis that can may be applied to the data to help
be used to extract models improve the accuracy, efficiency,
describing important data classes or and scalability of the classification
to predict future data trends. or prediction process.
Classification predicts categorical  (i)Data cleaning:
(discrete, unordered) labels,  This refers to the pre-processing of
prediction models continuous data in order to remove or reduce
valued functions. noise (by applying smoothing
 techniques) and the treatment of
 For example, we can build a missing values (e.g., by replacing a
classification model to categorize missing value with the most
bank loan applications as either commonly occurring value for that
safe or risky, or a prediction model attribute, or with the most probable
to predict the expenditures of value based on statistics).
potential customers on computer
equipment given their income and  Although most classification
occupation. algorithms have some mechanisms
for handling noisy or missing data,
 A predictor is constructed that this step can help reduce confusion
predicts a continuous-valued during learning.
function, or ordered value, as 
opposed to a categorical label.  (ii)Relevance analysis:
Regression analysis is a statistical  Many of the attributes in the data
methodology that is most often may be redundant. Correlation
used for numeric prediction. Many analysis can be used to identify
classification and prediction whether any two given attributes
methods have been proposed by are statistically related.
researchers in machine learning,
pattern recognition, and statistics.  For example, a strong correlation
Most algorithms are memory between attributes A1 and A2
resident, typically assuming a small would suggest that one of the two
data size. Recent data mining could be removed from further
research has built on such work, analysis.
developing scalable classification  A database may also contain
and prediction techniques capable irrelevant attributes. Attribute
of handling large disk-resident subset selection can be used in
data. these cases to find a reduced set of
attributes such that the resulting
probability distribution of the data

Prepared by Mr. Anil Sharma Page 23


classes is as close as possible to the  transformation and principle
original distribution obtained using components analysis to
all attributes. Hence, relevance discretization techniques, such as
analysis, in the form of correlation binning, histogram analysis, and
analysis and attribute subset clustering.
selection, can be used to detect  3.1.2 Comparing Classification
attributes that do not contribute to and Prediction Methods:
the classification or prediction task.
 Accuracy: The accuracy of a classifier
 Such analysis can help improve
refers to the ability of a given classifier
classification efficiency and
to correctly predict the class label of new
scalability.
or previously unseen data (i.e., tuples

without class label information). The
 (iii)Data Transformation And
accuracy of a predictor refers to how
Reduction
well a given predictor can guess the
 The data may be transformed by
value of the predicted attribute for new
normalization, particularly when
or previously unseen data.
neural networks or methods
 Speed: This refers to the computational
involving distance measurements
costs involved in generating and using
are used in the learning step.
the given classifier or predictor.
 Normalization involves scaling all  Robustness: This is the ability of the
values for a given attribute so that classifier or predictor to make correct
they fall within a small specified predictions given noisy data or data with
range, such as -1 to +1 or 0 to 1. missing values.
 Scalability: This refers to the ability to
 The data can also be transformed construct the classifier or predictor
by generalizing it to higher-level efficiently given large amounts of data.
concepts. Concept hierarchies may  Interpretability: This refers to the level
be used for this purpose. This is of understanding and insight that is
particularly useful for continuous provided by the classifier or predictor.
valued attributes. Interpretability is subjective and
 therefore more difficult to assess.
 For example, numeric values for
the attribute income can be  3.2 Classification by Decision
generalized to discrete ranges, such Tree Induction:
as low, medium, and high.
 Decision tree induction is the
Similarly, categorical attributes,
learning of decision trees from
like street, can be generalized to
class-labeled training tuples. A
higher-level concepts, like city.
decision tree is a flowchart-like
 Data can also be reduced by tree structure where
applying many other methods,
 Each internal node denotes a test
ranging from wavelet
on an attribute.

Prepared by Mr. Anil Sharma Page 24


 Each branch represents an outcome belongs to a particular class.
of the test. Bayesian classification is based on
 Each leaf node holds a class label. Bayes’ theorem.
 The topmost node in a tree is the

root node.
 3.3.1 Bayes’ Theorem:

 Let X be a data tuple. In Bayesian

terms, X is considered “evidence”
and it is described by
measurements made on a set of n
attributes.

 Let H be some hypothesis, such as
that the data tuple X belongs to a
 ADV of DTIM specified class C.
 The construction of decision tree
classifiers does not require any  For classification problems, we
domain knowledge or parameter want to determine P(H|X), the
setting, and therefore I appropriate probability that the hypothesis H
for exploratory knowledge holds given the ―evidence‖ or
discovery. observed data tuple X.
 Decision trees can handle high
 P(H|X) is the posterior probability,
dimensional data.
or a posteriori probability, of H
 Their representation of acquired
conditioned on X.
knowledge in tree form is intuitive
and generaly easy to assimilate by  Bayes’ theorem is useful in that it
humans. provides a way of calculating the
 The learning and classification posterior probability, P(H|X), from
steps of decision tree induction are P(H), P(X|H), and P(X).
simple and fast. 
 In general, decision tree classifiers
have good accuracy. 
 Decision tree induction algorithms (BT)
have been used for classification in 
many application areas, such as  3.3.2 Naïve Bayesian
medicine, manufacturing and Classification:
production, financial analysis,  The naïve Bayesian classifier, or
astronomy, and molecular biology. simple Bayesian classifier, works
 as follows:
 3.3 Bayesian Classification: 1. Let D be a training set of tuples and
 Bayesian classifiers are statistical their associated class labels. As
classifiers. They can predict class usual, each tuple is represented by
membership probabilities, such as an n-dimensional attribute vector,
the probability that a given tuple X = (x1, x2, …,xn), depicting n

Prepared by Mr. Anil Sharma Page 25


measurements made on the tuple compute P(X|Ci). In order to
from n attributes, respectively, A1, reduce computation in evaluating
A2, …, An. P(X|Ci), the naive assumption of
2. Suppose that there are m classes, class conditional independence is
C1, C2, …, Cm. Given a tuple, X, made. This presumes that the
the classifier will predict that X values of the attributes are
belongs to the class having the conditionally independent of one
highest posterior probability, another, given the class label of the
conditioned on X. tuple. Thus,

 That is, the naïve Bayesian 
classifier predicts that tuple X
belongs to the class Ci if and only
if
 

 We can easily estimate the
probabilities P(x1|Ci), P(x2|Ci), : : :
, P(xn|Ci) from the training tuples.
 For each attribute, we look at
 Thus we maximize P(CijX). whether the attribute is categorical
The class Ci for which P(CijX) is or continuous-valued. For instance,
maximized is called the maximum to compute P(X|Ci), we consider
posteriori hypothesis. By Bayes’ the following:
theorem 
  If Akis categorical, then P(xk|Ci) is
the number of tuples of class CiinD
having the value xkfor Ak, divided
 by |Ci,D| the number of tuples of
 class CiinD.
3. As P(X) is constant for all classes, 
only P(X|Ci)P(Ci) need be  If Akis continuous-valued, then we
maximized. If the classprior need to do a bit more work, but the
probabilities are not known, then it calculation is pretty
is commonly assumed that the straightforward.
classes are equally likely, that is, 
P(C1) = P(C2) = …= P(Cm), and  A continuous-valued attribute is
we would therefore maximize P(X| typically assumed to have a
Ci). Otherwise, we maximize P(X| Gaussian distribution with a mean
Ci)P(Ci). μ and standard deviation , defined
 by
4. Given data sets with many 
attributes, it would be extremely
computationally expensive to

Prepared by Mr. Anil Sharma Page 26




 5.In order to predict the class label
of X, P(XjCi)P(Ci) is evaluated for
each class Ci. The classifier
predicts that the class label of tuple
X is the class Ciif and only if


 3.4 A Multilayer Feed-Forward 
Neural Network:  The inputs to the network
 The back propagation algorithm correspond to the attributes
performs learning on a multilayer measured for each training tuple.
feed-forward neural network. The inputs are fed simultaneously
into the units making up the input
 It iteratively learns a set of weights layer. These inputs pass through the
for prediction of the class label of input layer and are then weighted
tuples. and fed simultaneously to a second
layer known as a hidden layer.
 A multilayer feed-forward neural
network consists of an input layer,  The outputs of the hidden layer units
one or more hidden layers, and an can be input to another hidden layer,
output layer. and so on. The number of hidden
layers is arbitrary.
  The weighted outputs of the last
hidden layer are input to units
making up the output layer, which
emits the network’s prediction for
given tuples

 3.4.1 Classification by
Backpropagation:
 Backpropagation is a neural
network learning algorithm. A
neural network is a set of
connected input/output units in
which each connection has a
weight associated with it.

Prepared by Mr. Anil Sharma Page 27


 During the learning phase, the  It include their high tolerance of
network learns by adjusting the noisy data as well as their ability to
weights so as to be able to predict classify patterns on which they
the correct class label of the input have not been trained.
tuples.
 They can be used when you may
 Neural network learning is also have little knowledge of the
referred to as connectionist relationships between attributes
learning due to the connections and classes.
between units.
 They are well-suited for

continuous-valued inputs and
 Neural networks involve long
outputs, unlike most decision tree
training times and are therefore
algorithms.
more suitable for applications
where this is feasible.  They have been successful on a
wide array of real-world data,
 Backpropagation learns by
including handwritten character
iteratively processing a data set of
recognition, pathology and
training tuples, comparing the
laboratory medicine, and training a
network’s prediction for each tuple
computer to pronounce English
with the actual known target value.
text.
 The target value may be the known
 Neural network algorithms are
class label of the training tuple (for
inherently parallel; parallelization
classification problems) or a
techniques can be used to speed up
continuous value (for prediction).
the computation process.
 For each training tuple, the weights 
are modified so as to minimize the  Process:
mean squared error between the  Initialize the weights: The
network’s prediction and the actual weights in the network are
target value. These modifications initialized to small random
are made in the backwards‖ numbers ranging from-1.0 to 1.0,
direction, that is, from the output or -0.5 to 0.5. Each unit has a bias
layer, through each hidden layer associated with it. The biases are
down to the first hidden layer similarly initialized to small
hence the name is random numbers.
Backpropagation. 
 Each training tuple, X, is processed
 Although it is not guaranteed, in
by the following steps.
general the weights will eventually
 Propagate the inputs forward:
converge, and the learning process
 First, the training tuple is fed to the
stops
input layer of the network. The

inputs pass through the input units,
 Advantages:
unchanged. That is, for an input

Prepared by Mr. Anil Sharma Page 28


unit j, its output, Oj, is equal to its searches the pattern space for the k
input value, Ij. Next, the net input training tuples that are closest to
and output of each unit in the the unknown tuple. These k
hidden and output layers are training tuples are the k nearest
computed. The net input to a unit neighbours of the unknown tuple.
in the hidden or output layers is  Closeness is defined in terms of a
computed as a linear combination distance metric, such as Euclidean
of its inputs. Each such unit has a distance.
number of inputs to it that are, in
 The Euclidean distance between
fact, the outputs of the units
two points or tuples, say, X1 =
connected to it in the previous
(x11, x12, … ,x1n) and
layer. Each connection has a
 X2 = (x21, x22, … ,x2n), is
weight. To compute the net input to

the unit, each input connected to
the unit is multiplied by its
corresponding weight, and this is
summed.
 In other words, for each numeric
 Where wi,j is the weight of the
attribute, we take the difference
connection from unit in the
between the corresponding values
previous layer to unit j; Oi is the
of that attribute in tuple X1and in
output of unit i from the previous
tuple X2, square this difference,
layer Ɵj is the bias of the unit & it
and accumulate it. The square root
acts as a threshold in that it serves
is taken of the total accumulated
to vary the activity of the unit.
distance count. Min-Max
Each unit in the hidden and output
normalization can be used to
layers takes its net input and then
transform a value v of a numeric
applies an activation function to it.
attribute A to v0 in the range [0, 1]

by computing
 3.5 k-Nearest-Neighbour
Classifier:
 Nearest-neighbor classifiers are 
based on learning by analogy, that  Where mina and maxA are the
is, by comparing a minimum and maximum values of
 attribute A
 given test tuple with training tuples  For k-nearest-neighbour
that are similar to it. classification, the unknown tuple is
 The training tuples are described assigned the most common class
by n attributes. Each tuple among its k nearest neighbours.
represents a point in an n-
dimensional space. In this way, all  When k = 1, the unknown tuple is
of the training tuples are stored in assigned the class of the training
an n-dimensional pattern space. tuple that is closest to
When given an unknown tuple, a k-  it in pattern space.
nearest-neighbour classifier

Prepared by Mr. Anil Sharma Page 29


 Nearest neighbour classifiers can
also be used for prediction, that is,
to return a real-valued
 prediction for a given unknown
tuple.
 In this case, the classifier returns
the average value of the real-valued
labels associated with the k nearest
neighbours of the unknown tuple.

Prepared by Mr. Anil Sharma Page 30


 Chapter-4: CLUSTERING identification of groups of
 4.1 Cluster Analysis: automobile insurance policy
 The process of grouping a set of holders with a high average claim
physical or abstract objects into cost.
classes of similar objects is called  Clustering is also called data
clustering. A cluster is a collection segmentation in some applications
of data objects that are similar to because clustering partitions large
one another within the same cluster data sets into groups according to
and are dissimilar to the objects in their similarity.
other clusters. A cluster of data  Clustering can also be used for
objects can be treated collectively outlier detection, Applications of
as one group and so may be outlier detection include the
considered as a form of data detection of credit card fraud and
compression. Cluster analysis tools the monitoring of criminal
based on k-means, k-medoids, and activities in electronic commerce
several methods have also been 
built into many statistical analysis  4.1.2 Typical Requirements of
software packages or systems, such Clustering In Data Mining:
as S-Plus, SPSS, and SAS. 
  Scalability: Many clustering
 4.1.1 Applications: algorithms work well on small data
 Cluster analysis has been widely sets containing fewer than several
used in numerous applications, hundred data objects; however, a
including market research, pattern large database may contain
recognition, data analysis, and millions of objects. Clustering on a
image processing. sample of a given large data set
 In business, clustering can help may lead to biased results. Highly
marketers discover distinct groups scalable clustering algorithms are
in their customer bases and needed.
characterize customer groups based  Ability to deal with different
on purchasing patterns. types of attributes: Many
 In biology, it can be used to derive algorithms are designed to cluster
plant and animal taxonomies, interval-based (numerical) data.
categorize genes with similar However, applications may require
functionality, and gain insight into clustering other types of data, such
structures inherent in populations. as binary, categorical (nominal),
 Clustering may also help in the and ordinal data, or mixtures of
identification of areas of similar these data types.
land use in an earth observation  Discovery of clusters with
database and in the identification of arbitrary shape: Many clustering
groups of houses in a city algorithms determine clusters
according to house type, value, and based on Euclidean or Manhattan
geographic location, as well as the distance measures. Algorithms
based on such distance measures

Prepared by Mr. Anil Sharma Page 31


tend to find spherical clusters with incremental clustering algorithms
similar size and density. However, and algorithms that are insensitive
a cluster could be of any shape. It to the order of input.
is important to develop algorithms  High dimensionality: A database
that can detect clusters of arbitrary or a data warehouse can contain
shape. several dimensions or attributes.
 Minimal requirements for Many clustering algorithms are
domain knowledge to determine good at handling low-dimensional
input parameters: Many data, involving only two to three
clustering algorithms require users dimensions. Human eyes are good
to input certain parameters in at judging the quality of clustering
cluster analysis (such as the for up to three dimensions. Finding
number of desired clusters). The clusters of data objects in high
clustering results can be quite dimensional space is challenging,
sensitive to input parameters. especially considering that such
Parameters are often difficult to data can be sparse and highly
determine, especially for data sets skewed.
containing high-dimensional  Constraint-based clustering:
objects. This not only burdens Real-world applications may need
users, but it also makes the quality to perform clustering under various
of clustering difficult to control. kinds of constraints. Suppose that
 Ability to deal with noisy data: your job is to choose the locations
Most real-world databases contain for a given number of new
outliers or missing, unknown, or automatic banking machines
erroneous data. Some clustering (ATMs) in a city. To decide upon
algorithms are sensitive to such this, you may cluster households
data and may lead to clusters of while considering constraints such
poor quality as the city’s rivers and highway
 Incremental clustering and networks, and the type and number
insensitivity to the order of input of customers per cluster. A
records: Some clustering challenging task is to find groups
algorithms cannot incorporate of data with good clustering
newly inserted data (i.e., database behaviour that satisfy specified
updates)into existing clustering constraints.
structures and, instead, must  Interpretability and usability:
determine a new clustering from Users expect clustering results to
scratch. Some clustering algorithms be interpretable, comprehensible,
are sensitive to the order of input and usable. That is, clustering may
data. That is, given a set of data need to be tied to specific semantic
objects, such an algorithm may interpretations and applications. It
return dramatically different is important to study how an
clustering depending on the order application goal may influence the
of presentation of the input objects. selection of clustering features and
It is important to develop methods.

Prepared by Mr. Anil Sharma Page 32


 4.2 Major Clustering Methods: starts with each object forming a
1. Partitioning Methods separate group. It successively
2. Hierarchical Methods merges the objects or groups that
3. Density-Based Methods are close to one another, until all of
4. Grid-Based Methods the groups are merged into one or
5. Model-Based Methods until a termination condition holds.
  The divisive approach, also called
 4.2.1 Partitioning Methods: the top-down approach, starts with
 A partitioning method constructs k all of the objects in the same
partitions of the data, where each cluster. In each successive
partition represents a cluster and k iteration, a cluster is split up into
<= n. That is, it classifies the data smaller clusters, until eventually
into k groups, which together each object is in one cluster, or
satisfy the following requirements: until a termination condition holds.
 Each group must contain at least Hierarchical methods suffer from
one object, and the fact that once a step (merge or
split) is done, it can never be
 Each object must belong to exactly
undone. This rigidity is useful in
one group.
that it leads to smaller computation

costs by not having to worry about
 A partitioning method creates an
a combinatorial number of different
initial partitioning. It then uses an
choices
iterative relocation technique that
 There are two approaches to
attempts to improve the
improving the quality of
partitioning by moving objects
hierarchical clustering:
from one group to another. The
 Perform careful analysis of object
general criterion of a good
“linkages” at each hierarchical
partitioning is that objects in the
partitioning, such as in Chameleon,
same cluster are close or related to
or
each other, whereas objects of
 Integrate hierarchical
different clusters are far apart or
agglomeration and other
very different.
approaches by first using a
 4.2.2 Hierarchical Methods:
hierarchical agglomerative

algorithm to group objects into
 A hierarchical method creates a
micro clusters, and then performing
hierarchical decomposition of the
macro clustering on the micro
given set of data objects. A
clusters using another clustering
hierarchical method can be
method such as iterative relocation
classified as being either

agglomerative or divisive, based on
how the hierarchical decomposition  4.2.3 Density-based methods:
is formed.  Most partitioning methods cluster
 The agglomerative approach, also objects based on the distance
called the bottom-up approach, between objects. Such methods can

Prepared by Mr. Anil Sharma Page 33


find only spherical-shaped clusters  STING is a typical example of a
and encounter difficulty at grid-based method. Wave Cluster
discovering clusters of arbitrary applies wavelet transformation for
shapes. clustering analysis and is both grid-
based and density-based
 Other clustering methods have
 4.2.5 Model-Based Methods:
been developed based on the notion
 Model-based methods hypothesize
of density. Their general idea is to
a model for each of the clusters and
continue growing the given cluster
find the best fit of the data to the
as long as the density in the
given model.
neighbourhood exceeds some
threshold; that is, for each data  A model-based algorithm may
point within a given cluster, the locate clusters by constructing a
neighbourhood of a given radius density function that reflects the
has to contain at least a minimum spatial distribution of the data
number of points. Such a method points.
can be used to filter out noise
 It also leads to a way of
(outliers) and discover clusters of
automatically determining the
arbitrary shape.
number of clusters based on
 standard statistics, taking noise or
 DBSCAN and its extension, outliers into account and thus
OPTICS, are typical density-based yielding robust clustering methods.
methods that grow clusters 
according to a density-based  4.3 Tasks in Data Mining:
connectivity analysis. DENCLUE 1. Clustering High-Dimensional Data
is a method that clusters objects 2. Constraint-Based Clustering
based on the analysis of the value 
distributions of density functions.  4.3.1 Clustering High-
 Dimensional Data:
 4.2.4 Grid-Based Methods:  It is a particularly important task in
 Grid-based methods quantize the cluster analysis because many
object space into a finite number of applications require the analysis of
cells that form a grid structure. objects containing a large number
 of features or dimensions.
 All of the clustering operations are
 For example, text documents may
performed on the grid structure i.e.,
contain thousands of terms or
on the quantized space. The main
keywords as features, and DNA
advantage of this approach is its
micro array data may provide
fast processing time, which is
information on the expression
typically independent of the
levels of thousands of genes under
number of data objects and
hundreds of conditions.
dependent only on the number of
cells in each dimension in the
quantized space.

Prepared by Mr. Anil Sharma Page 34


 Clustering high-dimensional data is  Various kinds of constraints can be
challenging due to the curse of specified, either by a user or as per
dimensionality. application requirements.

 Many dimensions may not be  Spatial clustering employs with the


relevant. As the number of existence of obstacles and
dimensions increases, the data clustering under user-specified
become increasingly sparse so that constraints. In addition, semi-
the distance measurement between supervised clustering employs for
pairs of points become meaningless pair-wise constraints in order to
and the average density of points improve the quality of the resulting
anywhere in the data is likely to be clustering.
low. Therefore, a different 
clustering methodology needs to be 4.4 Classical Partitioning Methods:
developed for high-dimensional  The most well-known and
data. commonly used partitioning
 methods are
 CLIQUE and PROCLUS are two 1. The k-Means Method
influential subspace clustering 2. k-Medoids Method
methods, which search for clusters 
in subspaces of the data, rather than  4.4.1 Centroid-Based Technique:
over the entire data space. The K-Means Method:
 The k-means algorithm takes the
 Frequent pattern–based clustering,
input parameter, k, and partitions a
another clustering methodology,
set of n objects into k clusters so
extracts distinct frequent patterns
that the resulting intracluster
among subsets of dimensions that
similarity is high but the
occur frequently. It uses such
intercluster similarity is low.
patterns to group objects and
Cluster similarity is measured in
generate meaningful clusters
regard to the mean value of the

objects in a cluster, which can be
 4.3.2 Constraint-Based
viewed as the cluster’s centroid or
Clustering:
centre of gravity. The k-means
 It is a clustering approach that
algorithm proceeds as follows.
performs clustering by

incorporation of user-specified or
 First, it randomly selects k of the
application-oriented constraints.
objects, each of which initially
 A constraint expresses a user’s represents a cluster mean or center.
expectation or describes properties
 For each of the remaining objects,
of the desired clustering results,
an object is assigned to the cluster
and provides an effective means for
to which it is the most similar,
communicating with the clustering
based on the distance between the
process.
object and the cluster mean.

Prepared by Mr. Anil Sharma Page 35


 It then computes the new mean for 
each cluster.

 This process iterates until the


criterion function converges.

 Typically, the square-error criterion
is used, defined as

Clustering of a set of objects based


 on the k-means method

 Where E is the sum of the square 
error for all objects in the data set p  4.4.2 The k-Medoids Method:
is the point in space representing a  The k-means algorithm is sensitive
given object mi is the mean of to outliers because an object with
cluster Ci. an extremely large value may
 substantially distort the distribution
 4.4.1 The k-means partitioning of data. This effect is particularly
algorithm: The k-means algorithm exacerbated due to the use of the
for partitioning, where each square-error function.
cluster’s center is represented by
 Instead of taking the mean value of
the mean value of the objects in the
the objects in a cluster as a
cluster.
reference point, we can pick actual
 JNTU World www.alljntuworld.in
objects to represent the clusters,

using one representative object per
 cluster. Each remaining object is
clustered with the representative
object to which it is the most
similar.

 The partitioning method is then


performed based on the principle of
minimizing the sum of the
dissimilarities between each object
and its corresponding reference
point. That is, an absolute-error
criterion is used, defined as

Prepared by Mr. Anil Sharma Page 36


 Where E is the sum of the absolute  Case 3: p currently belongs to
error for all objects in the data set p representative object, oi, i≠j. If oj is
is the point in space representing a replaced by orandom as a
given object in cluster Cjoj is the representative object and p is still
representative object of Cj closest to oi, then the assignment
 does not change.
 
 The initial representative objects  Case 4: p currently belongs to
are chosen arbitrarily. The iterative representative object, oi, i≠j. If oj is
replaced by orandom as a
process of replacing representative
representative object and p is closest
objects by non-representative
to orandom, then p is reassigned to
objects continues as long as the
orandom.
quality of the resulting clustering is

improved.

 This quality is estimated using a
cost function that measures the
average dissimilarity between an
object and the representative object
of its cluster.

 To determine whether a non-


representative object, oj random, is
a good replacement for a current
representative object, oj, the
following four cases are examined
for each of the non-representative
objects.  4.4.2 The k-Medoids Algorithm:
  The k-medoids algorithm for
 Case 1: p currently belongs to partitioning based on medoid or
representative object, oj. If oj is central objects.
replaced by orandom as a 
representative object and p is closest
to one of the other representative
objects, oi, i≠j, then p is reassigned
to oi.

 Case 2: p currently belongs to
representative object, oj. If oj is
replaced by orandom as a
representative object and p is closest
to orandom, then p is reassigned to
orandom.

Prepared by Mr. Anil Sharma Page 37


 hierarchical decomposition is
formed in a bottom-up or top-down
fashion.

 4.5.1 Agglomerative hierarchical
clustering:
 This bottom-up strategy starts by
placing each object in its own
cluster and then merges these
atomic clusters into larger and
larger clusters, until all of the
objects are in a single cluster or
until certain termination conditions
are satisfied.

 Most hierarchical clustering


 The k-medoids method is more methods belong to this category.
robust than k-means in the presence They differ only in their definition
of noise and outliers, because a of inter cluster similarity.
medoid is less influenced by 
outliers or other extreme values  4.5.2 Divisive hierarchical
than a mean. However, its clustering:
processing is more costly than the  This top-down strategy does the
k-means method. reverse of agglomerative
 hierarchical clustering by starting
 4.5 Hierarchical Clustering with all objects in one cluster.
Methods:
 A hierarchical clustering method  It sub divides the cluster into
works by grouping data objects smaller and smaller pieces, until
into a tree of clusters. each object forms a cluster on its
own or until it satisfies certain
 The quality of a pure hierarchical termination conditions, such as a
clustering method suffers from its desired number of clusters is
inability to performed adjustment obtained or the diameter of each
once a merge or split decision has cluster is within a certain threshold.
been executed. That is, if a
particular merge or split decision
later turns out to have been a poor
choice, the method cannot
backtrack and correct it.

 Hierarchical clustering methods
can be further classified as either
agglomerative or divisive,
depending on whether the

Prepared by Mr. Anil Sharma Page 38

You might also like