0% found this document useful (0 votes)
17 views756 pages

Mining Sessions

The document covers advanced classification methods in business data mining, focusing on Artificial Neural Networks (ANNs) and their structure, inspired by the human brain. It details the McCullough-Pitts model, perceptrons, and the backpropagation algorithm, explaining how these models learn and classify data. The content is presented as part of a session by Dr. D. Dutta at BITS Pilani, emphasizing the evolution and application of neural networks in machine learning.

Uploaded by

akanksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views756 pages

Mining Sessions

The document covers advanced classification methods in business data mining, focusing on Artificial Neural Networks (ANNs) and their structure, inspired by the human brain. It details the McCullough-Pitts model, perceptrons, and the backpropagation algorithm, explaining how these models learn and classify data. The content is presented as part of a session by Dr. D. Dutta at BITS Pilani, emphasizing the evolution and application of neural networks in machine learning.

Uploaded by

akanksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 756

BUSINESS DATA MINING

Session 14

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents
As per syllabus
1. Advanced Classification Methods

1. Artificial Neural Networks


As per session plan
• Classification: Advanced Methods

o Artificial Neural Networks

2 Dr. D. Dutta BITS Pilani, Pilani Campus


Neural Networks

▪ The human brain is made up of billions of cells called neurons.


▪ The cell body of the neuron gets signals sent to it from dendrites
(input).
▪ The neuron then essentially sums up the inputs to it, and will
subsequently fire, or output a signal, if a certain level is reached.
▪ This output goes through the axon and across to other cells by way
of connections called synapses.
▪ These synapses (output) are then connected to the dendrites of
other cells (input).
▪ Eventually the information sent will reach a destination where a
reaction may occur.

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Neural Networks

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Artificial Neural Networks

▪ Artificial Neural Networks (ANNs) are computational models inspired


by the structure and functioning of the human brain.
▪ These networks are a fundamental component of machine learning
and are designed to recognize patterns, make decisions, and
perform tasks that typically require human intelligence.
▪ The basic building block of an ANN is the artificial neuron. Neurons
are organized into layers: the input layer receives data, hidden
layers process information, and the output layer produces the
network's final output.
▪ Neurons in one layer are connected to neurons in the next layer
through weighted connections. Each connection has an associated
weight that influences the strength of the signal. Learning in ANNs
involves adjusting these weights based on training data.
▪ Artificial Neural Networks continue to evolve with advancements
such as deep learning, which involves training deep neural networks
with multiple hidden layers.
5 Dr. D. Dutta BITS Pilani, Pilani Campus
Artificial Neural Networks

6 Dr. D. Dutta BITS Pilani, Pilani Campus


McCullogh-Pitts model

x1
w1
x2 w2

x3
w3  Z
f (Z) y
. output
. . .
.
.
.
. wn .
b neuron
xn bias
input
7 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model
Warren McCulloch (a psychiatrist and neuroanatomist) and Walter
Pitts (a mathematician), started the modern era of neural networks
when, in 1943

• spikes are interpreted as spike rates;


• synaptic strength are translated as synaptic weights;
• it does not require learning or adaptation.
• excitation means positive product between the incoming spike
rate and the corresponding synaptic weight; excitatory connection
has positive weights. All excitatory connections in a particular
neuron have the same weight.
• inhibition means negative product between the incoming spike
rate and the corresponding synaptic weight; inhibitory connection
has negative weights.
8 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model

Each neuron has a fixed threshold such that if the net input to the
network is greater than the threshold the neuron should fire.
The threshold is set such that the inhibition is absolute. This means any
non-zero inhibitory input will prevent the neuron from firing.

9 Dr. D. Dutta BITS Pilani, Pilani Campus


McCullogh-Pitts model

x1
w
x2 w

x3
w  Z
f (Z) y
. output
. . .
.
.
.
. w .
-p -p neuron
xn
input x n+1 x n+m
10 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model

A neuron receives 'n' signals through excitatory connection and ‘m’


signals through inhibitory connection is shown in Figure.
The excitatory path has weight w > 0 and the inhibitory connection path
has weight –p
F(Z)=1 if Z and F(Z)=0 if Z<
where Z, is the total input signal received by the neuron , and  is the
threshold.
The condition for absolute inhibition is  > nw–p.
Neuron will fire if it receives k or more excitatory signals and no
inhibitory inputs, i.e. kw   > (k –1)w.

11 Dr. D. Dutta BITS Pilani, Pilani Campus


McCullogh-Pitts model
x1 x2 y
Truth value for AND 0 0 0
0 1 0
1 0 0
x1
1 1 1
1

x2  Z
f (Z) y
1
output

F(Z)=1 if Z2 and F(Z)=0 if Z<2


neuron

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron

Proposed by Rosenblatt, 1958.


Basic building block of nearly all ANNs.
Z =I=1Nwixi and y = fN(Z)
where wi is the weight at the inputs xi where Z is the node (summation)
output and fN is a nonlinear operator.
y = sign(z) i.e y=f(z)=1 if z>0 and f(z)=-1 if z<=0
The output of the network thus is either +1 or -1, depending on the
input.

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron

x1
w1
x2 w2

x3
w3  Z
f (Z) y
. output
. . .
.
.
.
. wn .
neuron
xn
input
14 Dr. D. Dutta BITS Pilani, Pilani Campus
Perceptron

x1
w1

 Z
f (Z) y
w2 output

x2 w0
x=1
0 neuron
bias
input
15 Dr. D. Dutta BITS Pilani, Pilani Campus
Perceptron

The network can now be used for a classification task: it can decide
whether an input pattern belongs to one of two classes.
If the total input is positive, the pattern will be assigned to class +1, if
the total input is negative, the sample will be assigned to class -1.
The separation between the two classes in this case is a straight line,
given by the equation:
w1x1 + w2x2 + w0= 0
The single layer network represents
a linear discriminant function.
x2 = -w1/w2 x1 - w0 / w2
weights determine the slope of the line
and the bias determines the `offset’
considering w0=

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron

How do we learn the weights and biases in the network?


Iterative procedures are there.
Weight the new value is computed by adding a correction to the
old value
wi (t + 1) = wi (t) + wi (t)
i (t + 1) = i (t) +  i (t)
How do we compute wi (t) and  i (t) in order to classify the
learning patterns correctly?

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron

We have a set of learning samples consisting of an input vector x and a


desired output d(x)
Classification task the d(x) is usually +1 or -1.
Perceptron learning rule is very simple and can be stated as follows:
1. Start with random weights for the connections.
2. Select an input vector x from the set of training samples.
3. If y  d(x) (the perceptron gives an incorrect response), modify all
connections wi according to: wi = d(x)xi ;
4. Go back to 2.

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron: Example

Initially:w1 = 1;w2 = 2; = -2. x0 is always 1.

For sample A, x = (0.5; 1.5) and target value d(x) = +1


The network output is +1, so no weights are adjusted.

For sample B, x = (-0.5; 0.5)and target value d(x) = -1


The network output is -1, so no weights are adjusted.

For sample C, x = (0.5; 0.5)and target value d(x) = +1


The network output is -1, so weights are to be adjusted.
According to the perceptron learning rule, the weight changes are:  w1
= 0.5,  w2 = 0.5, = 1.
The new weights are now: w1 = 1.5, w2 = 2.5,  = -1, and sample C is
classified correctly.

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron: Example

20 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Architecture

21 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Notation
▪ i an input unit;
▪ h a hidden unit;
▪ o an output unit;
▪ xp the pth input pattern vector;
▪ xpj the jth element of the pth input pattern vector;
▪ sp the input to a set of neurons when input pattern vector p is
clamped (i.e., presented to the network); often: the input of the
network by clamping input pattern vector p;
▪ dp the desired output of the network when input pattern vector p was
input to the network ;
▪ dpj the jth element of the desired output of the network when input
pattern vector p was input to the network;
▪ yp the activation values of the network when input pattern vector p
was input to the network;

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Notation
• ypj the activation values of element j of the network when input
pattern vector p was input to the network;
• W the matrix of connection weights;
• Wj the weights of the connections which feed into unit j;
• wjk the weight of the connection from unit j to unit k;
• Fj the activation function associated with unit j;
• jk the learning rate associated with weight wjk;
•  the biases to the units;
• j the bias input to unit j;
• Uj the threshold of unit j in Fj ;
• Ep the error in the output of the network when input pattern vector p
is input;
• E the total error.

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Feed
Forword
A feed-forward network has a layered structure.
Each layer consists of units which receive their input from units from a
layer directly below (left) and send their output to units in a layer
directly above (right) the unit
The Ni inputs are fed into the first layer of Nh,1 hidden units
The input units are merely 'fan-out' units; no processing takes place in
these units.
The activation of a hidden unit is a function Fi of the weighted inputs
plus a bias

The output of the hidden units is distributed over the next layer of Nh 2
hidden units, until the last layer of hidden units, of which the outputs
are fed into a layer of No output units

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Delta
Rule
The activation is a differentiable function of the total input, given
by 1 in which 2
we must set 3
The error measure Ep is defined as the total quadratic error for
pattern p at the output units: 4

where dpo is the desired output for unit o when pattern p is


clamped.
Here total error
We can write 5

From equation 2 we can get 6


25 Dr. D. Dutta BITS Pilani, Pilani Campus
Back Propagation ANN: Delta
Rule
When we define 7

We make the weight changes according to: 8

The trick is to figure out what pk should be for each unit k in the
network. The interesting result, which we now derive, is that there is
a simple recursive computation of these 's which can be
implemented by propagating error signals backward through the
network.
To compute pk we apply the chain rule to write this partial derivative as
the product of two factors, one factor reflecting the change in error
as a function of the output of the unit and one reflecting the change
in the output as a function of changes in the input
9
26 Dr. D. Dutta BITS Pilani, Pilani Campus
Back Propagation ANN: Delta
Rule
▪ By Equation 1
10

▪ To compute the first factor of equation 9, we consider two cases.


First, assume that unit k is an output unit k = o of the network. In this
case, it follows from the definition of Ep that
11

▪ Substituting 11 and equation 10 in equation 9, we get


12

▪ Secondly, if k is not an output unit but a hidden unit k = h, we do not


readily know the contribution of the unit to the output error of the
network. However, the error measure can be written as a function of
the net inputs from hidden to output layer;

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Delta
Rule
We use the chain rule to write

13

Substituting this in equation 9 yields


14
Equations 12 and 14 give a recursive procedure for computing the  's
for all units in the network, which are then used to compute the
weight changes according to equation 8.

28 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Understanding
Step one
• A learning pattern is clamped
• The activation values are propagated to the output units
• The actual network output is compared with the desired output values
• It usually end up with an error in each of the output units. Let's call this
error eo for a particular output unit o. We have to bring eo to zero.
• We know from the delta rule that, in order to reduce an error, we have
to adapt its incoming weights according to
Step two
• But it alone is not enough: when we only apply this rule, the weights
from input to hidden units are never changed.
• In order to adapt the weights from input to hidden units, we again want
to apply the delta rule.
• we do not have a value for  for the hidden units

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Understanding
• This is solved by the chain rule
• Distribute the error of an output unit o to all the hidden units that is it
connected to, weighted by this connection
• Differently , a hidden unit h receives a delta from each output unit o
equal to the delta of that output unit weighted with (= multiplied by)
the weight of the connection between those units.
• The activation function of the hidden unit; F’ has to be applied to the
delta, before the back-propagation process can continue.

30 Dr. D. Dutta BITS Pilani, Pilani Campus


Working with Back
Propagation ANN
The application of the generalized delta rule involves two phases
First phase
The input x is presented and propagated forward through the network
to compute the output values ypo for each output unit
This output is compared with its desired value do, resulting in an error
signal po for each output unit.
Second phase
Involves a backward pass through the network during which the error
signal is passed to each unit in the network and appropriate weight
changes are calculated.

31 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Weight adjustment
with Sigmoid activation function

The weight of a connection is adjusted by an amount proportional to the


product of an error signal , on the unit k receiving the input and the
output of the unit j sending this signal along the connection:

If the unit is an output unit, the error signal is given by

Take as the activation function F the 'sigmoid' function

32 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Weight adjustment
with Sigmoid activation function

In this case the derivative is equal to

The error signal for an output unit can be written as

33 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Weight adjustment
with Sigmoid activation function

The error signal for a hidden unit is determined recursively in terms of


error signals of the units to which it directly connects and the
weights of those connections. For the sigmoid activation function:

34 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
Given
x1 =1st input pattern vector=[111]
d1 =desired output=1
 =learning rate =0.1
F= Activation function= Bipolar Sigmoidal= F(x)=2/(1+e-x) –1
F’(x)=0.5(1+ F(x)) (1- F(x))
 =biases to the units=0.1

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
1 1 0.1
0.2
0.1 1
0.2 0.1
0.2
1 1
1 2
0.1
2 0.1

0.1 0.1
1 3

i h o

36 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
sh1=0.2*1+ 0.2*1+ 0.2*1+0.1=0.7
sh2=0.1*1+ 0.1*1+ 0.1*1+0.1=0.4

yh1=F(sh1)=2/(1+e-0.7) –1=0.336
yh2=F(sh2)=2/(1+e-0.4) –1=0.1974
so1 =0.1* 0.336 + 0.1* 0.1974 =0.0536
yo1=F(so1)=2/(1+e-0.0536) –1=0.02678
F’(so1)=0.5(1+ F(so1)) (1- F(so1))

o1=(1-0.02678) 0.5(1+ 0.02678) (1- 0.02678)=0.4863

37 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
h1= 0.5(1+ 0.02678) (1- 0.02678)*0.4863*0.1=0.0216
h2= 0.5(1+ 0.02678) (1- 0.02678)*0.4863*0.1=0.0216

wi1h1 =0.1* 0.0216*1= 0.00216


wi2h1 =0.1* 0.0216*1= 0.00216
wi3h1 =0.1* 0.0216*1= 0.00216
wi1h2 =0.1* 0.0216*1= 0.00216
wi2h2 =0.1* 0.0216*1= 0.00216
wi3h2 =0.1* 0.0216*1= 0.00216

38 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
wh1o1 =0.1* 0.4863*0.7= 0.00216
wh2o1 =0.1* 0.4863*0.4= 0.00216

wi1h1(new)= wi1h1(old)+wi1h1=0.2+0.00216=0.20216
wi2h1(new)= wi2h1(old)+wi2h1=0.2+0.00216=0.20216
wi3h1(new)= wi3h1(old)+wi3h1=0.2+0.00216=0.20216
wi1h2(new)= wi1h2(old)+wi1h2=0.1+0.00216=0.10216
wi2h2(new)= wi2h2(old)+wi2h2=0.1+0.00216=0.10216
wi3h2(new)= wi3h2(old)+wi3h2=0.1+0.00216=0.10216
wh1o1(new)= wh1o1(old)+wh1o1=0.1+0.00216=0.10216
wh2o1(new)= wh2o1(old)+wh2o1=0.1+0.00216=0.10216

39 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
1 1 0.1
0.20216
0.10216 1
0.20216 0.10216
0.20216
1 1
1 2
0.10216 0.10216
2

0.10216 0.1
1 3

i h o

40 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

41 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 15

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents

As per syllabus
1. Data Mining Applications

1. Recommendation Systems.

2. Sentiment Analysis

3. Fraud Detection
As per session plan
• Data Mining Applications

o Recommendation Systems

o Sentiment Analysis

o Fraud Detection
2 Dr. D. Dutta BITS Pilani, Pilani Campus
Recommender System

A recommender system is a type of information filtering


system designed to suggest relevant items to users
based on their preferences, behavior, or other contextual
information. The goal is to help users discover products,
services, or content that they may find interesting or
useful. Recommender systems are commonly used in a
variety of applications, including e-commerce, streaming
services, and social media.

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Recommender System

Makes recommendations!
E.g. music, books and movies
In eCommerce recommend items
In eLearning recommend content
In search and navigation recommend links

Use items as generic term for what is recommended

Help people (customers, users) make decisions


Recommendation is based on preferences
– Of an individual
– Of a group or community

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Recommender
Systems
Content-Based Filtering (CBF) – It is a recommender
system approach that suggests items similar to those a
user has previously shown interest in. This method
focuses on the characteristics of items and matches
them to a user’s known preferences or past behavior. It’s
commonly used when a recommender system has
detailed information about item attributes and can map
these to user profiles or preferences.
If a user has previously shown interest in thriller novels, the
system will recommend other thriller novels with similar
themes, keywords, or authors.

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Recommender
Systems
Content-Based Filtering (CBF)
Finding Preferred Items Based on Past Behavior
A machine learning model learns user preferences by gathering
feedback and creating a profile based on user behavior.
• Types of Feedback:
• Explicit Feedback: Users rate items directly.
• Implicit Feedback: The system observes user actions without
explicit ratings, such as:
• Tracking clicks and categorizing the type of page visited
(e.g., browsing a product page).
• Measuring time spent on activities like browsing certain
pages.
• Recommendation as a Search Process:
• The user's profile acts as a "query," while the available items act
as "documents" to match with the profile.
6 Dr. D. Dutta BITS Pilani, Pilani Campus
Types of Recommender
Systems
Collaborative Filtering (CF) – match `like-minded’ people
– E.g. if two people have similar ‘taste’ they can
recommend items to each other
– Based on the idea that users who have shown similar
preferences in the past will likely continue to share
similar tastes.
– Types include user-based (recommends items liked
by similar users) and item-based (recommends items
similar to ones the user liked).

7 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Recommender
Systems
Collaborative Filtering (CF)
Recording User Interests Through Ratings
Users interact with items, and their interests are captured through two
types of feedback:
• Explicit Feedback: For instance, buying or directly rating an item.
• Implicit Feedback: Includes indirect signals like time spent
browsing or the number of mouse clicks on an item.
Recommending Items Using Nearest Neighbor Matching
The system identifies users with similar interests through nearest
neighbor matching. It then recommends items that these similar
users (neighbors) have rated highly but that you haven't interacted
with yet. The user can further refine these recommendations by
rating the suggested items.

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Recommender
Systems
Collaborative Filtering (CF)
Example of CF MxN Matrix with M users and N items (An empty cell is
an unrated item)

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Recommender
Systems
Collaborative Filtering (CF)
Can construct a vector for each user (where 0 implies an item is
unrated)
– E.g. for Alex: <1,0,5,4>
– E.g. for Peter <0,0,4,5>
On average, user vectors are sparse, since users rate (or buy) only a
few items.
Vector similarity or correlation can be used to find nearest neighbor.
– E.g. Alex closest to Peter, then to George.

10 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Recommender
Systems
Hybrid Recommender Systems:
• Combines collaborative and content-based methods to
provide more accurate recommendations.
• Addresses some limitations of each method individually,
such as the cold-start problem (lack of user or item
data).

11 Dr. D. Dutta BITS Pilani, Pilani Campus


Case Study: Recommender
Systems
Customers Who Bought This Item Also Bought:
Item-to-Item Collaborative Filtering
• Instead of finding similar customers, the system finds similar items.
• It tracks pairs of items frequently bought together by the same
customers to determine item similarity.
• These item similarities are computed offline in advance for all items.
• Based on this information, the system can quickly recommend
similar or popular items bought by others, such as books.
• This recommendation process is fast and done in real time.

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Use of Data Mining in building
Recommender Systems
1. Pattern Recognition and User Behavior Analysis
• Data mining helps analyze user behavior patterns, such as browsing
history, purchase history, ratings, and other interactions with items.
• By identifying these patterns, a recommender system can predict
which items a user might be interested in, based on their past
behavior or on similar behaviors among other users.
2. Segmentation and Clustering
• Clustering techniques, a part of data mining, allow grouping users or
items based on similarities in characteristics or behavior.
• For example, users with similar interests can be clustered together,
making it easier to recommend items that are popular or highly rated
within the same group.
• Similarly, items can be grouped based on common features, which
helps in item-to-item recommendations.

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Use of Data Mining in building
Recommender Systems
3. Association Rule Mining
• Association rule mining helps identify relationships between different
items. For instance, it can find associations such as, "Users who
bought item A also bought item B."
• This information is useful for generating "frequently bought together"
recommendations, as it allows the system to suggest items often
purchased in combination.
4. Classification and Prediction Models
• Classification algorithms, such as decision trees, k-nearest
neighbors (KNN), and neural networks, help predict whether a user
will like a certain item.
• Predictive modeling, a core part of data mining, can help
recommend new items to users by predicting their interests based
on past interactions or similar profiles.

14 Dr. D. Dutta BITS Pilani, Pilani Campus


Use of Data Mining in building
Recommender Systems
5. Anomaly Detection for Outlier Recommendations
• Anomaly detection techniques can be used to identify unique
preferences or niche interests of a user. This enables the system to
occasionally recommend items that might be outside the user’s
typical choices but are still relevant.
• These "outlier" recommendations can help users discover new
interests.
6. Evaluating and Improving Recommendation Quality
• Data mining helps in evaluating the performance of different
recommendation algorithms through techniques like cross-
validation, error measurement, and statistical analysis.
• By mining feedback data (such as clicks, ratings, and purchase
behavior after recommendation), the system can improve the
recommendation model and adapt to new trends.

15 Dr. D. Dutta BITS Pilani, Pilani Campus


Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a


natural language processing (NLP) technique used to
determine the emotional tone behind a body of text. It
aims to identify and categorize opinions expressed in text
into various sentiments, such as positive, negative,
neutral, or even more specific emotions like happiness,
anger, or sadness.

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Sentiment Analysis

1. Data Collection
• Purpose: Gather the text data that will be analyzed for sentiment.
• Sources: Data can come from various sources like social media
posts, customer reviews, survey responses, emails, or any other text
where people express opinions.
• Methods: Data can be collected manually or through APIs (e.g.,
Twitter API, web scraping tools).

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Sentiment Analysis

2. Text Preprocessing
• Purpose: Clean and prepare the text data for analysis. Raw text often
contains unnecessary elements that can affect analysis accuracy.
• Steps:
• Lowercasing: Convert text to lowercase to maintain consistency.
• Removing Punctuation and Special Characters: Clean up
punctuation, emojis, URLs, and other symbols that may not
contribute to sentiment.
• Removing Stop Words: Remove common words (like “the,” “is,”
“and”) that don’t carry sentiment.
• Tokenization: Split the text into individual words or tokens.
• Stemming and Lemmatization: Reduce words to their root forms
(e.g., “playing” to “play”) for better generalization.
• Outcome: A clean and structured dataset that can be used for analysis.

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Sentiment Analysis

3. Feature Extraction
• Purpose: Transform text data into numerical features that can be
processed by machine learning or deep learning models.
• Common Techniques:
• Bag-of-Words (BoW): Converts text into a set of word counts or
frequencies without considering word order.
• TF-IDF (Term Frequency-Inverse Document Frequency):
Weighs words by their importance across multiple documents.
• Word Embeddings: Techniques like Word2Vec, GloVe, or BERT
that convert words into dense vectors capturing semantic
meaning.
• Outcome: Numerical representation of text that can be fed into a
model.

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Sentiment Analysis

4. Sentiment Classification
• Purpose: Classify the sentiment of each text input as positive,
negative, or neutral (or sometimes with more specific emotions).
• Methods:
• Rule-Based Approach: Uses predefined lists of positive and
negative words (sentiment lexicons) to determine sentiment.
Simple but less flexible.
• Machine Learning Models: Trains classifiers like Naive Bayes,
Support Vector Machine (SVM), or logistic regression on labeled
sentiment data to learn patterns.
• Deep Learning Models: Uses advanced models like LSTMs,
RNNs, or transformer-based models (BERT, GPT) for better
accuracy, especially with context-dependent or complex
sentiment.
• Outcome: Predicted sentiment labels for each input text.
20 Dr. D. Dutta BITS Pilani, Pilani Campus
Steps of Sentiment Analysis

5. Evaluation of Model Performance


• Purpose: Assess the accuracy and effectiveness of the sentiment
analysis model.
• Metrics:
• Accuracy: Measures the percentage of correct predictions.
• Precision, Recall, and F1 Score: Used to evaluate the model's
performance on each class, especially for unbalanced datasets.
• Confusion Matrix: Shows how often the model classifies each
sentiment correctly or incorrectly.
• Cross-Validation: Splits the dataset into training and testing
sets to check model reliability.
• Outcome: Metrics that show how well the model performs on
sentiment classification.

21 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Sentiment Analysis

6. Fine-Tuning and Optimization (if needed)


• Purpose: Improve the model’s performance by tweaking certain
parameters or adjusting the preprocessing and feature extraction
steps.
• Approaches:
• Hyperparameter Tuning: Adjust parameters in the machine
learning model (e.g., learning rate, number of layers).
• Data Augmentation: Increase the dataset size with paraphrased
or synthetically generated examples.
• Handling Imbalanced Classes: Adjust class weights or
resample data if certain sentiments (e.g., negative) are
underrepresented.
• Outcome: A more accurate and robust model for sentiment analysis.

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Sentiment Analysis

7. Interpretation and Visualization


• Purpose: Present the results in a way that stakeholders can easily
understand.
• Techniques:
• Sentiment Scores: Assign scores to show sentiment intensity.
• Charts and Graphs: Use bar charts, pie charts, or word clouds
to visualize the distribution of sentiments.
• Trend Analysis: Track sentiment trends over time (useful for
monitoring changes in customer opinions).
• Outcome: Actionable insights that can be used for decision-making.

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Sentiment Analysis

8. Deployment
• Purpose: Integrate the sentiment analysis model into a production
environment so it can analyze new data in real time.
• Methods:
• API Integration: Deploy as a REST API so other applications
can send data for sentiment analysis.
• Automation: Automate data collection, analysis, and reporting
for continuous monitoring.
• Outcome: A fully functional sentiment analysis tool that provides
real-time insights.

24 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
1. Data Collection and Extraction
• Data mining techniques can be used to collect and extract text
data from various sources such as social media platforms, online
reviews, blogs, and forums.
• Web scraping and API calls are common data mining methods to
gather large volumes of text data, which are then used as the input
for sentiment analysis.

25 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
2. Data Cleaning and Preprocessing
• Text data often includes noise, such as punctuation, special
characters, irrelevant words, or symbols. Data mining helps clean
this raw data to make it usable.
• Tokenization, stop word removal, and stemming/lemmatization
are data preprocessing techniques that fall under data mining and
are essential to prepare text data for sentiment analysis models.

26 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
3. Feature Extraction and Selection
• Data mining helps transform unstructured text data into structured
features that can be used by machine learning models.
• Feature extraction techniques, such as Bag-of-Words (BoW), TF-
IDF, and word embeddings (Word2Vec, GloVe, BERT), are used to
create numerical representations of the text data, capturing
important words, phrases, and semantic meaning.
• Feature selection is also applied to select the most relevant words
or phrases that best represent the sentiment in the text, which
reduces dimensionality and improves model performance.

27 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
4. Pattern Discovery
• Data mining helps identify patterns and relationships in the text
data that correlate with positive, negative, or neutral sentiments.
• For instance, association rule mining can find frequently co-
occurring words or phrases that indicate certain sentiments. This is
useful for understanding common phrases or expressions that carry
sentiment within specific domains (e.g., “great service” in customer
reviews).

28 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
5. Sentiment Lexicon Creation
• Data mining techniques are used to build sentiment lexicons—
dictionaries of positive, negative, and neutral words and phrases.
• Lexicons can be created manually or automatically mined from data
by analyzing the frequency and context in which words appear,
thereby assigning them sentiment scores.
• This is useful in rule-based sentiment analysis, where words from
the lexicon are matched against text to determine sentiment.

29 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
6. Training Machine Learning Models
• Machine learning, which is a branch of data mining, is used
extensively in sentiment analysis to train models that classify text
as positive, negative, or neutral.
• Supervised machine learning techniques (e.g., Naive Bayes,
Support Vector Machines, logistic regression) are trained on labeled
sentiment datasets.
• Deep learning models like LSTM, GRU, and transformer-based
models (BERT, GPT) are also trained to learn complex patterns in
text, which are particularly useful for capturing context-sensitive
sentiment.

30 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
7. Clustering and Topic Modeling
• Unsupervised data mining techniques, like clustering and topic
modeling (e.g., LDA - Latent Dirichlet Allocation), are used to group
similar pieces of text and identify prevalent topics in the dataset.
• By clustering similar reviews or comments, sentiment analysis can
identify specific aspects or issues (e.g., customer service, product
quality) that people feel positively or negatively about, providing
more nuanced insights.

31 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
8. Sentiment Trend Analysis and Visualization
• Data mining tools help analyze sentiment trends over time, such as
changes in customer opinion or shifts in brand perception.
• This can be visualized through time series plots, sentiment heat
maps, or word clouds, allowing organizations to spot trends and
make timely decisions.

32 Dr. D. Dutta BITS Pilani, Pilani Campus


How Data Mining is helping to
do sentiment analysis
9. Handling Large-Scale Data with Big Data Techniques
• When dealing with large datasets from social media, e-commerce
sites, or streaming platforms, big data mining techniques (e.g.,
MapReduce, Hadoop, Spark) are used to process and analyze data
efficiently.
• This allows sentiment analysis to be conducted in real time or on
massive datasets, helping businesses get timely insights from
extensive feedback.

33 Dr. D. Dutta BITS Pilani, Pilani Campus


Our work on Sentiment
Analysis

34 Dr. D. Dutta BITS Pilani, Pilani Campus


What is Fraud Detection?

Fraud detection is the process of identifying and preventing


fraudulent activities or transactions within various industries, such
as finance, insurance, healthcare, e-commerce, and
telecommunications.

Applications of Fraud Detection


• Banking and Finance: Detecting credit card fraud, loan fraud,
money laundering, and insider trading.
• Insurance: Identifying fraudulent claims in health, auto, or life
insurance.
• E-Commerce: Detecting fake accounts, false reviews, account
takeovers, and fraudulent orders.
• Healthcare: Preventing fraudulent billing, claims, and prescription
abuse.

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Frauds

36 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

Data mining techniques are widely used in fraud


detection to analyze large volumes of data and detect
unusual patterns. Here’s how data mining helps in fraud
detection:
1. Data Collection and Integration
• Data mining starts with collecting data from various
sources, such as transaction records, user profiles, and
past fraud cases.
• Integrating data from multiple channels (e.g., credit card
transactions, login data, and location) provides a holistic
view, which is critical in identifying fraud patterns.

37 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

2. Data Cleaning and Preprocessing


• Since real-world data is often noisy, incomplete, or
inconsistent, data cleaning is essential for accurate
fraud detection.
• Outlier detection can help identify irregularities, such as
incorrect data entries or unusual transaction amounts,
which might be signs of fraudulent behavior.

38 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

3. Pattern Recognition and Anomaly Detection


• Anomaly detection is one of the core techniques used
in fraud detection. It helps find transactions or behaviors
that deviate significantly from typical patterns.
• For example, if a credit card is used in two different
locations in a very short time, it may be flagged as an
anomaly.

39 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

4. Classification and Prediction


• Supervised machine learning algorithms, such as
logistic regression, decision trees, and neural
networks, can classify transactions as either normal or
potentially fraudulent.
• These models are trained on labeled data (where fraud
instances are identified) to predict whether new
transactions are fraudulent or legitimate.

40 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

5. Clustering and Profiling


• Clustering groups transactions or users into clusters
with similar characteristics, helping to identify unusual
activity within a group.
• Profiling involves building behavior profiles for users and
identifying activities that deviate from typical behavior,
which could indicate fraud.

41 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

6. Association Rule Mining


• Association rule mining identifies relationships
between actions or transactions that may indicate fraud.
For example, certain purchase patterns could be
associated with fraudulent behavior.
• It’s particularly useful for detecting collusion, such as
when a group of users or accounts is involved in
organized fraud.

42 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

7. Time Series Analysis


• Analyzing data over time helps detect temporal
patterns associated with fraud. For instance, sudden
changes in transaction volume or frequency could
indicate fraudulent activity.
• Time series analysis is useful in scenarios where
fraudsters attempt to stay undetected by spreading their
activities over time.

43 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

8. Social Network Analysis


• In cases of organized fraud or collusion, social network
analysis can identify connections between fraudsters.
• This involves mapping relationships between users,
accounts, or devices to detect clusters of linked entities
that are involved in fraud.

44 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

9. Risk Scoring and Real-Time Detection


• Fraud detection systems often assign a risk score to
each transaction, based on various factors like amount,
location, and user behavior. High-risk scores trigger
alerts for further investigation.
• Real-time fraud detection uses fast processing
techniques to flag and block potentially fraudulent
transactions before they’re completed.

45 Dr. D. Dutta BITS Pilani, Pilani Campus


Steps of Fraud Detection

10. Visualization and Reporting


• Data visualization helps fraud analysts understand
patterns and trends more clearly. Dashboards and visual
tools display real-time alerts and summarize detected
fraud instances.
• Reporting is also essential for compliance and regulatory
purposes, as it provides evidence and insights on fraud
cases.

46 Dr. D. Dutta BITS Pilani, Pilani Campus


Different applications of
Data Mining
•Marketing and Customer Relationship Management (CRM)
•Healthcare and Medicine
•Finance and Banking
•Retail and E-commerce
•Telecommunications
•Manufacturing and Industry
•Education
•Energy and Utilities
•Government and Public Sector
•Social Media and Sentiment Analysis
•Transportation and Logistics
•Cybersecurity

47 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

48 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 13

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents

As per syllabus
1. Outlier Detection

1. What are Outliers?

2. Outlier Detection Methods

3. Statistical Approach
As per session plan
• Outliers

o What are Outliers?

o Outlier Detection Methods

o Statistical Approach
2 Dr. D. Dutta BITS Pilani, Pilani Campus
What Are Outliers?
Outlier: A data object that deviates significantly from the normal objects as if it
were generated by a different mechanism
– Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky,
...
– In a group, all are moving in the forward direction and one person is moving
in the backward direction.
Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
Outliers are interesting: It violates the mechanism that generates the normal data
Novelty detection is different from Outlier detection: Focuses on identifying new,
previously unseen patterns or examples once the model is trained.
Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
3 Dr. D. Dutta BITS Pilani, Pilani Campus
Types of Outliers

Three kinds: global, contextual and collective outliers


Global outlier (or point anomaly)
– Object is Og if it significantly deviates from the rest of the data set
– Ex. Intrusion detection in computer networks
– Issue: Find an appropriate measurement of deviation

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Outliers

Contextual outlier (or conditional outlier)


– Object is Oc if it deviates significantly based on a selected context
– Ex. 40o C in Calcutta: outlier? (depending on summer or winter?)

Attributes of data objects should be divided into two groups


Contextual attributes: defines the context, e.g., time & location
Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
Can be viewed as a generalization of local outliers - whose density
significantly deviates from its local area
Issue: How to define or formulate meaningful context?
5 Dr. D. Dutta BITS Pilani, Pilani Campus
Types of Outliers

Collective Outliers
▪ A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers Collective Outlier
▪ Applications: E.g., intrusion detection:
• When a number of computers keep sending
denial-of-service packages to each other
◼ Detection of collective outliers
◼ Consider not only behavior of individual objects, but also that of groups

of objects
◼ Need to have the background knowledge on the relationship among

data objects, such as a distance or similarity measure on objects.


◼ A data set may have multiple types of outlier
◼ One object may belong to more than one type of outlier

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Challenges of Outlier
Detection
◼ Modeling normal objects and outliers properly
◼ Hard to enumerate all possible normal behaviors in an application

◼ The border between normal and outlier objects is often a gray area

◼ Application-specific outlier detection


◼ Choice of distance measure among objects and the model of

relationship among objects are often application-dependent


◼ E.g., clinic data: a small deviation could be an outlier; while in

marketing analysis, larger fluctuations


◼ Handling noise in outlier detection
◼ Noise may distort the normal objects and blur the distinction between

normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
◼ Understandability
◼ Understand why these are outliers: Justification of the detection

◼ Specify the degree of an outlier: the unlikelihood of the object being

generated by a normal mechanism


7 Dr. D. Dutta BITS Pilani, Pilani Campus
Outlier Detection: Supervised
Methods
Two ways to categorize outlier detection methods:
– Based on whether user-labeled examples of outliers can be obtained:
• Supervised, semi-supervised vs. unsupervised methods
– Based on assumptions about normal data and outliers:
• Statistical, proximity-based, and clustering-based methods
Outlier Detection I: Supervised Methods
– Modeling outlier detection as a classification problem
• Samples examined by domain experts used for training & testing
– Methods for Learning a classifier for outlier detection effectively:
• Model normal objects & report those not matching the model as
outliers, or
• Model outliers and treat those not matching the model as normal

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Detection: Supervised
Methods
– Challenges
• Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers
• Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)

• Recall measures the ability of the model to find all the relevant
(true) outliers. A higher recall means the model is successfully
identifying more outliers.

• Accuracy represents the proportion of total correct predictions


(both outliers and normal points) made by the model out of all
predictions.
9 Dr. D. Dutta BITS Pilani, Pilani Campus
Outlier Detection:
Unsupervised Methods
Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each
having some distinct features
An outlier is expected to be far away from any groups of normal objects
Weakness: Cannot detect collective outlier effectively
– Normal objects may not share any strong patterns, but the collective outliers
may share high similarity in a small area
Ex. In some intrusion or virus detection, normal activities are diverse
– Unsupervised methods may have a high false positive rate but still miss
many real outliers.
– Supervised methods can be more effective, e.g., identify attacking some key
resources
Many clustering methods can be adapted for unsupervised methods
– Find clusters, then outliers: not belonging to any cluster
– Problem 1: Hard to distinguish noise from outliers
– Problem 2: Costly since first clustering: but far less outliers than normal
objects
• Newer methods: tackle outliers directly

10 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Detection:
Unsupervised Methods

11 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Detection: Semi-
Supervised Methods
Situation: In many applications, the number of labeled data is often small: Labels
could be on outliers only, normal objects only, or both
Semi-supervised outlier detection: Regarded as applications of semi-supervised
learning
If some labeled normal objects are available
– Use the labeled examples and the proximate unlabeled objects to train a
model for normal objects
– Those not fitting the model of normal objects are detected as outliers
If only some labeled outliers are available, a small number of labeled outliers many
not cover the possible outliers well
– To improve the quality of outlier detection, one can get help from models for
normal objects learned from unsupervised methods

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Detection: Semi-
Supervised Methods

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Detection: Statistical
Methods
Statistical methods (also known as model-based methods) assume that
the normal data follow some statistical model (a stochastic model)
– The data not following the model are outliers.
◼ Example (right figure): First use Gaussian distribution
to model the normal data
◼ For each object y in region R, estimate gD(y), the

probability of y fits the Gaussian distribution


◼ If gD(y) is very low, y is unlikely generated by the

Gaussian model, thus an outlier

◼ Effectiveness of statistical methods: highly depends on whether the


assumption of statistical model holds in the real data
◼ There are rich alternatives to use various statistical models
◼ E.g., parametric vs. non-parametric

14 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Detection: Statistical
Methods

15 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Detection: Proximity-
Based Methods
An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object is significantly deviates from the proximity of most of
the other objects in the same data set
◼ Example (right figure): Model the proximity of an object
using its 3 nearest neighbors
◼ Objects in region R are substantially different from
other objects in the data set.
◼ Thus the objects in R are outliers
◼ The effectiveness of proximity-based methods highly relies on the proximity
measure.
◼ In some applications, proximity or distance measures cannot be obtained
easily.
◼ Often have a difficulty in finding a group of outliers which stay close to each
other
◼ Two major types of proximity-based outlier detection
◼ Distance-based vs. density-based
16 Dr. D. Dutta 16
BITS Pilani, Pilani Campus
Outlier Detection: Clustering-
Based Methods
Normal data belong to large and dense clusters, whereas outliers belong
to small or sparse clusters, or do not belong to any clusters

◼ Example (right figure): two clusters


◼ All points not in R form a large cluster
◼ The two points in R form a tiny cluster, thus
are outliers

◼ Since there are many clustering methods, there are many clustering-
based outlier detection methods as well
◼ Clustering is expensive: straightforward adaption of a clustering
method for outlier detection can be costly and does not scale up well
for large data sets

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Statistical Approaches
• Statistical methods are divided into two categories: parametric vs.
non-parametric
• Parametric method
– Assumes that the normal data is generated by a parametric
distribution with parameter θ
– The probability density function of the parametric distribution f(x,
θ) gives the probability that object x is generated by the
distribution
– The smaller this value, the more likely x is an outlier
• Non-parametric method
– Not assume an a-priori statistical model and determine the
model from the input data
– Not completely parameter free but consider the number and
nature of the parameters are flexible and not fixed in advance
– Examples: histogram and kernel density estimation

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Parametric Methods: Detection Univariate
Outliers Based on Normal Distribution
Univariate data: A data set involving only one attribute or variable
Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low
probability as outliers
Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}

◼ For the above data with n = 10, we have


◼ Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Parametric Methods: Detection
of Multivariate Outliers
Multivariate data: A data set involving two or more attributes or variables
Transform the multivariate outlier detection task into a univariate outlier
detection problem
Method 1. Compute Mahalaobis distance
– Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō)
where S is the covariance matrix

Method 2. Use χ2 –statistic:


– where Ei is the mean of the i-dimension among all objects, and n is
the dimensionality
– If χ2 –statistic is large, then object oi is an outlier
20 Dr. D. Dutta BITS Pilani, Pilani Campus
Parametric Methods: Using Mixture of
Parametric Distributions
Assuming data generated by a normal distribution
could be sometimes overly simplified
Example (right figure): The objects between the two
clusters cannot be captured as outliers since they
are close to the estimated mean
◼ To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by

where fθ1 and fθ2 are the probability density functions of θ1 and θ2
◼ Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
◼ An object o is an outlier if it does not belong to any cluster
21 Dr. D. Dutta BITS Pilani, Pilani Campus
Parametric Methods: Using Mixture of
Parametric Distributions

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Non-Parametric Methods:
Detection Using Histogram
The model of normal data is learned from the input
data without any a priori structure.
Often makes fewer assumptions about the data, and
thus can be applicable in more scenarios
Outlier detection using histogram:

◼ Figure shows the histogram of purchase amounts in transactions


◼ A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
◼ Problem: Hard to choose an appropriate bin size for histogram
◼ Too small bin size → normal objects in empty/rare bins, false positive
◼ Too big bin size → outliers in some frequent bins, false negative
◼ Solution: Adopt kernel density estimation to estimate the probability density
distribution of the data. If the estimated density function is high, the object is
likely normal. Otherwise, it is likely an outlier.

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Proximity-Based Approaches: Distance-
Based vs. Density-Based Outlier Detection

Intuition: Objects that are far away from the others are outliers
Assumption of proximity-based approach: The proximity of an outlier
deviates significantly from that of most of the others in the data set
Two types of proximity-based outlier detection methods
– Distance-based outlier detection: An object o is an outlier if its
neighborhood does not have enough other points
– Density-based outlier detection: An object o is an outlier if its
density is relatively much lower than that of its neighbors

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Distance-Based Outlier
Detection
For each object o, examine the # of other objects in the r-neighborhood of o,
where r is a user-specified distance threshold
An object o is an outlier if most (taking π as a fraction threshold) of the objects in
D are far away from o, i.e., not in the r-neighborhood of o

An object o is a DB(r, π) outlier if


Equivalently, one can check the distance between o and its k-th nearest neighbor
ok, where . o is an outlier if dist(o, ok) > r
Efficient computation: Nested loop algorithm
– For any object oi, calculate its distance from other objects, and count the #
of other objects in the r-neighborhood.
– If π∙n other objects are within r distance, terminate the inner loop
– Otherwise, oi is a DB(r, π) outlier
Efficiency: Actually CPU time is not O(n2) but linear to the data set size since for
most non-outlier objects, the inner loop terminates early

25 Dr. D. Dutta BITS Pilani, Pilani Campus


Distance-Based Outlier Detection:
A Grid-Based Method
Why efficiency is still a concern? When the complete set of objects cannot be
held into main memory, cost I/O swapping
The major cost: (1) each object tests against the whole data set, why not only its
close neighbor? (2) check objects one by one, why not group by group?
Grid-based method (CELL): Data space is partitioned into a multi-D grid. Each
cell is a hyper cube with diagonal length r/2
◼ Pruning using the level-1 & level 2 cell properties:
◼ For any possible point x in cell C and any possible point
y in a level-1 cell, dist(x,y) ≤ r
◼ For any possible point x in cell C and any point y such
that dist(x,y) ≥ r, y is in a level-2 cell

◼ Thus we only need to check the objects that cannot be pruned, and even for
such an object o, only need to compute the distance between o and the
objects in the level-2 cells (since beyond level-2, the distance from o is more
than r)
26 Dr. D. Dutta BITS Pilani, Pilani Campus
Density-Based Outlier
Detection
Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data distribution
In Fig., O1 and O2 are local outliers to C1, O3 is a global
outlier, but O4 is not an outlier. However, proximity-
based clustering cannot find O1 and O2 are outlier
(e.g., comparing with O4).

◼ Intuition (density-based outlier detection): The density around an outlier object is


significantly different from the density around its neighbors
◼ Method: Use the relative density of an object against its neighbors as the
indicator of the degree of the object being outliers
◼ k-distance of an object o, distk(o): distance between o and its k-th NN
◼ k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}
◼ Nk(o) could be bigger than k since multiple objects may have identical
distance to o

27 Dr. D. Dutta 27
BITS Pilani, Pilani Campus
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
An object is an outlier if (1) it does not belong to any cluster, (2) there is a large
distance between the object and its closest cluster , or (3) it belongs to a small
or sparse cluster
◼ Case I: Not belong to any cluster
◼ Identify animals not part of a flock (group): Using a
density-based clustering method such as DBSCAN
◼ Case 2: Far from its closest cluster
◼ Using k-means, partition data points of into clusters
◼ For each object o, assign an outlier score based on its
distance from its closest center
◼ If dist(o, co)/avg_dist(co) is large, likely an outlier

◼ Ex. Intrusion detection: Consider the similarity between data


points and the clusters in a training data set
◼ Use a training set to find patterns of “normal” data, e.g., frequent itemsets in
each segment, and cluster similar connections into groups
◼ Compare new data points with the clusters mined - Outliers are possible
attacks
28 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering-Based Outlier Detection:
Detecting Outliers in Small Clusters

FindCBLOF: Detect outliers in small clusters


– Find clusters, and sort them in decreasing size
– To each data point, assign a cluster-based local
outlier factor (CBLOF):
– If obj p belongs to a large cluster, CBLOF =
cluster_size X similarity between p and cluster
– If p belongs to a small one, CBLOF = cluster size
X similarity between p and the closest large
cluster
◼ Ex. In the figure, o is outlier since its closest large cluster is C1, but the
similarity between o and C1 is small. For any point in C3, its closest
large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Clustering-Based Method:
Strength and Weakness
Strength
– Detect outliers without requiring any labeled data
– Work for many types of data
– Clusters can be regarded as summaries of the data
– Once the cluster are obtained, need only compare any object against
the clusters to determine whether it is an outlier (fast)
Weakness
– Effectiveness depends highly on the clustering method used—they
may not be optimized for outlier detection
– High computational cost: Need to first find clusters
– A method to reduce the cost: Fixed-width clustering
• A point is assigned to a cluster if the center of the cluster is within a
pre-defined distance threshold from the point
• If a point cannot be assigned to any existing cluster, a new cluster
is created and the distance threshold may be learned from the
training data under certain conditions
30 Dr. D. Dutta BITS Pilani, Pilani Campus
Classification-Based Method:
One-Class Model
Idea: Train a classification model that can distinguish
“normal” data from outliers
A brute-force approach: Consider a training set that contains
samples labeled as “normal” and others labeled as
“outlier”
– But, the training set is typically heavily biased: # of
“normal” samples likely far exceeds # of outlier
samples
– Cannot detect unseen anomaly (refer to anomalous
data points that were not present in the training
dataset)
◼ One-class model: A classifier is built to describe only the normal class.
◼ Learn the decision boundary of the normal class using classification methods
such as SVM
◼ Any samples that do not belong to the normal class (not within the decision
boundary) are declared as outliers
◼ Adv: can detect new outliers that may not appear close to any outlier objects
in the training set
◼ Extension: Normal objects may belong to multiple classes
31 Dr. D. Dutta BITS Pilani, Pilani Campus
Classification-Based Method:
Semi-Supervised Learning
Semi-supervised learning: Combining classification-based and
clustering-based methods
Method
– Using a clustering-based approach, find a large cluster, C,
and a small cluster, C1
– Since some objects in C carry the label “normal”, treat all
objects in C as normal
– Use the one-class model of this cluster to identify normal
objects in outlier detection
– Since some objects in cluster C1 carry the label “outlier”,
declare all objects in C1 as outliers
– Any object that does not fall into the model for C (such as
a) is considered an outlier as well
◼ Comments on classification-based outlier detection methods
◼ Strength: Outlier detection is fast
◼ Bottleneck: Quality heavily depends on the availability and quality of the
training set, but often difficult to obtain representative and high-quality
training data
32 Dr. D. Dutta BITS Pilani, Pilani Campus
Mining Contextual Outliers: Transform into
Conventional Outlier Detection
• If the contexts can be clearly identified, transform it to conventional outlier
detection
1. Identify the context of the object using the contextual attributes
2. Calculate the outlier score for the object in the context using a
conventional outlier detection method
• Ex. Detect outlier customers in the context of customer groups
– Contextual attributes: age group, postal code
– Behavioral attributes: # of trans/yr, annual total trans. amount
• Steps: (1) locate c’s context, (2) compare c with the other customers in the
same group, and (3) use a conventional outlier detection method
• If the context contains very few customers, generalize contexts
– Ex. Learn a mixture model U on the contextual attributes, and another
mixture model V of the data on the behavior attributes
– Learn a mapping p(Vi|Uj): the probability that a data object o belonging to
cluster Uj on the contextual attributes is generated by cluster Vi on the
behavior attributes
– Outlier score:

33 Dr. D. Dutta BITS Pilani, Pilani Campus


Mining Contextual Outliers II: Modeling
Normal Behavior with Respect to Contexts

In some applications, one cannot clearly partition the data into contexts
– Ex. if a customer suddenly purchased a product that is unrelated to
those she recently browsed, it is unclear how many products browsed
earlier should be considered as the context
Model the “normal” behavior with respect to contexts
– Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute values
– An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model
Using a prediction model that links the contexts and behavior, these methods
avoid the explicit identification of specific contexts
Methods: A number of classification and prediction techniques can be used to
build such models, such as regression, Markov Models, and Finite State
Automaton

34 Dr. D. Dutta BITS Pilani, Pilani Campus


Mining Collective Outliers I: On
the Set of “Structured Objects”
Collective outlier: If objects as a group deviate significantly from
the entire data
Need to examine the structure of the data set, i.e, the
relationships between multiple data objects
◼ Each of these structures is inherent to its respective type of data
◼ For temporal data (such as time series and sequences), we explore the
structures formed by time, which occur in segments of the time series or
subsequences
◼ For spatial data, explore local areas
◼ For graph and network data, we explore subgraphs
◼ Difference from the contextual outlier detection: the structures are often not
explicitly defined, and have to be discovered as part of the outlier detection
process.
◼ Collective outlier detection methods: two categories
◼ Reduce the problem to conventional outlier detection
◼ Identify structure units, treat each structure unit (e.g., subsequence,
time series segment, local area, or subgraph) as a data object, and
extract features
◼ Then outlier detection on the set of “structured objects” constructed as
such using the extracted features
35 Dr. D. Dutta 35
BITS Pilani, Pilani Campus
Mining Collective Outliers: Direct Modeling of
the Expected Behavior of Structure Units
• Models the expected behavior of structure units directly
• Ex. 1. Detect collective outliers in online social network of customers
– Treat each possible subgraph of the network as a structure unit
– Collective outlier: An outlier subgraph in the social network
• Small subgraphs that are of very low frequency
• Large subgraphs that are surprisingly frequent
• Ex. 2. Detect collective outliers in temporal sequences
– Learn a Markov model from the sequences
– A subsequence can then be declared as a collective outlier if it
significantly deviates from the model
• Collective outlier detection is subtle due to the challenge of exploring the
structures in data
– The exploration typically uses heuristics, and thus may be application
dependent
– The computational cost is often high due to the sophisticated mining
process
36 Dr. D. Dutta 36
BITS Pilani, Pilani Campus
Challenges for Outlier Detection
in High-Dimensional Data
Interpretation of outliers
– Detecting outliers without saying why they are outliers is not very useful
in high-D due to many features (or dimensions) are involved in a high-
dimensional data set
– E.g., which subspaces that manifest the outliers or an assessment
regarding the “outlier-ness” of the objects
Data sparsity
– Data in high-D spaces are often sparse
– The distance between objects becomes heavily dominated by noise as
the dimensionality increases
Data subspaces
– Adaptive to the subspaces signifying the outliers (it means the model is
able to identify and analyze the most relevant features (or
subspaces) )
– Capturing the local behavior of data
Scalable with respect to dimensionality
– # of subspaces increases exponentially
37 Dr. D. Dutta BITS Pilani, Pilani Campus
Approach I: Extending
Conventional Outlier Detection
• Method 1: Detect outliers in the full space, e.g., HilOut Algorithm
– Find distance-based outliers, but use the ranks of distance instead of
the absolute distance in outlier detection
– For each object o, find its k-nearest neighbors: nn1(o), . . . , nnk(o)
– The weight of object o:

– All objects are ranked in weight-descending order


– Top-l objects in weight are output as outliers (l: user-specified parm)
– Employ space-filling curves for approximation: scalable in both time
and space w.r.t. data size and dimensionality
• Method 2: Dimensionality reduction
– Works only when in lower-dimensionality, normal instances can still be
distinguished from outliers
– PCA(Principal Component Analysis): Heuristically, the principal
components with low variance are preferred because, on such
dimensions, normal objects are likely close to each other and outliers
often deviate from the majority
38 Dr. D. Dutta BITS Pilani, Pilani Campus
Approach II: Finding Outliers
in Subspaces
• Extending conventional outlier detection: Hard for outlier interpretation
• Find outliers in much lower dimensional subspaces: easy to interpret why
and to what extent the object is an outlier
– E.g., find outlier customers in certain subspace: average transaction
amount >> avg. and purchase frequency << avg.
• Ex. A grid-based subspace outlier detection method
– Project data onto various subspaces to find an area whose density is
much lower than average
– Discretize the data into a grid with φ equi-depth (why?) regions
– Search for regions that are significantly sparse
• Consider a k-d cube: k ranges on k dimensions, with n objects
• If objects are independently distributed, the expected number of
objects falling into a k-dimensional region is (1/ φ)kn = fkn, the
standard deviation is
• The sparsity coefficient of cube C:
• If S(C) < 0, C contains less objects than expected
• The more negative, the sparser C is and the more likely the objects
in C are outliers in the subspace
39 Dr. D. Dutta BITS Pilani, Pilani Campus
Approach III: Modeling
High-Dimensional Outliers
◼ Develop new models for high-dimensional
outliers directly A set of points
◼ Avoid proximity measures and adopt new form a cluster
except c (outlier)
heuristics that do not deteriorate in high-
dimensional data
Ex. Angle-based outliers: Kriegel, Schubert, and Zimek [KSZ08]
For each point o, examine the angle ∆xoy for every pair of points x, y.
– Point in the center (e.g., a), the angles formed differ widely
– An outlier (e.g., c), angle variable is substantially smaller
Use the variance of angles for a point to determine outlier
Combine angles and distance to model outliers
– Use the distance-weighted angle variance as the outlier score
– Angle-based outlier factor (ABOF):

– Efficient approximation computation method is developed


– It can be generalized to handle arbitrary types of data
40 Dr. D. Dutta BITS Pilani, Pilani Campus
Thanks

Any Question?

41 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 12

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents
As per syllabus
1. Clustering

1. Cluster analysis concepts.

2. Partitioning methods

3. Hierarchical methods for cluster analysis

4. Considerations for cluster analysis

5. Outlier analysis
As per session plan

• Clustering

o Cluster analysis concepts.

o Partitioning methods

o Hierarchical methods for cluster analysis

o Considerations for cluster analysis

o Outlier analysis

2 Dr. D. Dutta BITS Pilani, Pilani Campus


Model-Based Clustering

What is model-based clustering?


– Attempt to optimize the fit between the given data and some
mathematical model
– Based on the assumption: Data are generated by a mixture of
underlying probability distribution
Typical methods
– Statistical approach
• EM (Expectation maximization), AutoClass
– Machine learning approach
• COBWEB, CLASSIT
– Neural network approach
• SOM (Self-Organizing Feature Map)

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Mixture Models

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Gaussian mixture models

A Gaussian mixture model is a probabilistic model that assumes all the


data points are generated from a mixture of a finite number of
Gaussian distributions with unknown parameters.

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Gaussian mixture models

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Gaussian mixture models

7 Dr. D. Dutta BITS Pilani, Pilani Campus


Mixture Models in 1 D

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Mixture Models in 1 D

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Expectation Maximization
(EM)

10 Dr. D. Dutta BITS Pilani, Pilani Campus


EM- 1 D Example

11 Dr. D. Dutta BITS Pilani, Pilani Campus


How to pick k

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Gaussian mixture model: d>1

13 Dr. D. Dutta BITS Pilani, Pilani Campus


ANEMIA PATIENTS AND CONTROLS
4.4

4.3
Red Blood Cell Hemoglobin Concentration

4.2

4.1

3.9

3.8 From P. Smyth


ICML 2001
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
14
EM ITERATION 1
4.4
Red Blood Cell Hemoglobin Concentration

4.3

4.2

4.1

3.9

3.8 From P. Smyth


ICML 2001
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume 15
EM ITERATION 3
4.4
Red Blood Cell Hemoglobin Concentration

4.3

4.2

4.1

3.9

3.8 From P. Smyth


ICML 2001
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume 16
EM ITERATION 5
4.4
Red Blood Cell Hemoglobin Concentration

4.3

4.2

4.1

3.9

3.8 From P. Smyth


ICML 2001
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume 17
EM ITERATION 10
4.4
Red Blood Cell Hemoglobin Concentration

4.3

4.2

4.1

3.9

3.8 From P. Smyth


ICML 2001
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume 18
EM ITERATION 15
4.4
Red Blood Cell Hemoglobin Concentration

4.3

4.2

4.1

3.9

3.8 From P. Smyth


ICML 2001
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume 19
EM ITERATION 25
4.4
Red Blood Cell Hemoglobin Concentration

4.3

4.2

4.1

3.9

3.8 From P. Smyth


ICML 2001
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume 20
Neural Network Approach
Neural network approaches
– Represent each cluster as an exemplar, acting as a “prototype” of
the cluster means selecting a representative object or data point
for each cluster that can act as a "prototype" or summary of that
cluster.
– New objects are distributed to the cluster whose exemplar is the
most similar according to some distance measure
Typical methods
– SOM (Self-Organizing feature Map)
• Competitive learning
– Involves a hierarchical architecture of several units
(neurons)
– Neurons compete in a “winner-takes-all” fashion for the
object currently being presented

21 Dr. D. Dutta BITS Pilani, Pilani Campus


Self-Organizing Feature Map
(SOM)
• SOMs, also called topological ordered maps, or Kohonen Self-Organizing
Feature Map (KSOMs)
• It maps all the points in a high-dimensional source space into a 2 to 3-d
target space, s.t., the distance and proximity relationship (i.e., topology) are
preserved as much as possible
• Similar to k-means: cluster centers tend to lie in a low-dimensional manifold
in the feature space
• Clustering is performed by having several units competing for the current
object
– The unit whose weight vector is closest to the current object wins
– The winner and its neighbors learn by having their weights adjusted
• SOMs are believed to resemble processing that can occur in the brain
• Useful for visualizing high-dimensional data in 2- or 3-D space

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Kohonen network
• The Kohonen network (Kohonen, 1982, 1984) can be seen as an
extension to the competitive learning network, although this is
chronologically incorrect.
• The output units in S are ordered in some fashion.
• When learning patterns are presented to the network, the weights to
the output units are thus adapted such that the order present in the
input space is preserved in the output, i.e., the neurons in S.
• This means that learning patterns which are near to each other in the
input space (where 'near' is determined by the distance measure used
in finding the winning unit) must be mapped on output units which are
also near to each other, i.e., the same or neighboring units.
• The mapping, which represents a discretization of the input space, is
said to be topology preserving.

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Kohonen network
• Dimensionality of S must be at most N.
• For example: data on a two dimensional manifold in a high dimensional
input space can be mapped onto a two-dimensional Kohonen network.
• The learning patterns are random samples from
• At time t, a sample x(t) is generated and presented to the network.
• The winning unit k is determined
• The weights to this winning unit as well as its neighbors are adapted using
the learning rule

• Here, g(o, k) is a decreasing function of the grid-distance between units o


and k, such that g(k, k) = 1.
• For example, for g() a Gaussian function can be used, such that (in one
dimension!)
g(o,k) = exp(-(o-k)2)

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Feature-mapping Kohonen
model
Kohonen layer Kohonen layer

Input layer Input layer

1 0 0 1
(a) (b)
25 Dr. D. Dutta BITS Pilani, Pilani Campus
The Kohonen Network

• The Kohonen model provides a topological mapping. It places a


fixed number of input patterns from the input layer into a higher-
dimensional output or Kohonen layer.

• Training in the Kohonen network begins with the winner’s


neighbourhood of a fairly large size. Then, as training proceeds, the
neighbourhood size gradually decreases.

26 Dr. D. Dutta BITS Pilani, Pilani Campus


Architecture of the Kohonen
Network

y1

Output Signals
Input Signals

x1
y2
x2
y3

Input Output
layer layer
27 Dr. D. Dutta BITS Pilani, Pilani Campus
The Kohonen Network

• The lateral connections are used to create a competition between


neurons. The neuron with the largest activation level among all
neurons in the output layer becomes the winner. This neuron is the
only neuron that produces an output signal. The activity of all other
neurons is suppressed in the competition.
• The lateral feedback connections produce excitatory or inhibitory
effects, depending on the distance from the winning neuron. This is
achieved by the use of a Mexican hat function which describes
synaptic weights between neurons in the Kohonen layer.

28 Dr. D. Dutta BITS Pilani, Pilani Campus


The Mexican hat function of
lateral connection

Connection
1 strength
Excitatory
effect

0 Distance
Inhibitory Inhibitory
effect effect

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Network after 100 iterations
1

0.8

0.6

0.4

0.2
W(2,j)

-0.2

-0.4

-0.6

-0.8

-1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
W(1,j)

30
Network after 1000 iterations
1

0.8

0.6

0.4

0.2
W(2,j)

-0.2

-0.4

-0.6

-0.8

-1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
W(1,j)

31
Network after 10,000 iterations
1

0.8

0.6

0.4

0.2
W(2,j)

-0.2

-0.4

-0.6

-0.8

-1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
W(1,j)

32
An Example

• Suppose, for instance, that the 2-dimensional input vector X is


presented to the three-neuron Kohonen network,

0.52
X= 
 0.12 

• The initial weight vectors, Wj, are given by

0.27 0.42 0.43


W1 =   W2 =   W3 =  
0.81 0.70 0.21

33 Dr. D. Dutta BITS Pilani, Pilani Campus


An Example
• We find the winning (best-matching) neuron jX using the minimum-
distance Euclidean criterion:

d1 = ( x1 − w11 ) 2 + ( x2 − w21 ) 2 = (0.52 − 0.27) 2 + (0.12 − 0.81) 2 = 0.73

d 2 = ( x1 − w12 ) 2 + ( x2 − w22 ) 2 = (0.52 − 0.42) 2 + (0.12 − 0.70) 2 = 0.59

d 3 = ( x1 − w13 ) 2 + ( x2 − w23 ) 2 = (0.52 − 0.43) 2 + (0.12 − 0.21) 2 = 0.13

• Neuron 3 is the winner and its weight vector W3 is updated according


to the competitive learning rule.

w13 =  ( x1 − w13 ) = 0.1 (0.52 − 0.43) = 0.01

w23 =  ( x 2 − w23 ) = 0.1 (0.12 − 0.21) = − 0.01


34 Dr. D. Dutta BITS Pilani, Pilani Campus
An Example
The updated weight vector W3 at iteration (p + 1) is determined as:

0.43  0.01 0.44


W3 ( p + 1) = W3 ( p) + W3 ( p) =   +  = 
0.21 − 0.01 0.20

The weight vector W3 of the wining neuron 3 becomes closer to the


input vector X with each iteration.

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Web Document Clustering
Using SOM
The result of SOM
clustering of
12088 Web
articles
The picture on the
right: drilling
down on the
keyword
“mining”
Based on
websom.hut.fi
Web page
36 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering High-Dimensional
Data
• Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
• Methods
– Feature transformation: only effective if most dimensions are relevant
• Principal Component Analysis (PCA) & singular value
decomposition (SVD) useful only when features are highly
correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering

37 Dr. D. Dutta BITS Pilani, Pilani Campus


Dimension reduction
Problems of high dimensional data, “the curse of dimensionality”
– running time
– overfitting
– number of samples required

38 Dr. D. Dutta BITS Pilani, Pilani Campus


Curse of Dimensionality:
Complexity
Complexity (running time) increases with dimension d
A lot of methods have at least O(nd2) complexity, where n is the number
of samples
For example if we need to estimate covariance matrix
So as d becomes large, O(nd2) complexity may be too costly

39 Dr. D. Dutta BITS Pilani, Pilani Campus


Curse of Dimensionality:
Complexity
If d is large, n, the number of samples, may be too small for accurate
parameter estimation
For example, covariance matrix has d2 parameters:

For accurate estimation, n should be much bigger than d2


Otherwise model is too complicated for the data, overfitting:

40 Dr. D. Dutta BITS Pilani, Pilani Campus


Curse of Dimensionality:
Complexity
Paradox: If n < d2 we are better off assuming that features are
uncorrelated, even if we know this assumption is wrong
In this case, the covariance matrix has only d parameters:

We are likely to avoid overfitting because we fit a model with less


parameters:

41 Dr. D. Dutta 41
BITS Pilani, Pilani Campus
Curse of Dimensionality:
Number of Samples
Suppose we want to use the nearest neighbor approach with k = 1
(1NN)
This feature is not discriminative, i.e. it does not separate the classes
well
Suppose we start with only one feature

We decide to use 2 features. For the 1NN method to work well, need a
lot of samples, i.e. samples have to be dense
To maintain the same density as in 1D (9 samples per unit length), how
many samples do we need?

42 Dr. D. Dutta BITS Pilani, Pilani Campus


Curse of Dimensionality:
Number of Samples
We need 92 samples to maintain the same density as in 1D

43 Dr. D. Dutta 43
BITS Pilani, Pilani Campus
Curse of Dimensionality:
Number of Samples
Of course, when we go from 1 feature to 2, no one gives us more
samples, we still have 9

This is way too sparse for 1NN to work well


44 Dr. D. Dutta BITS Pilani, Pilani Campus
Curse of Dimensionality:
Number of Samples
Things go from bad to worse if we decide to use 3 features:

If 9 was dense enough in 1D, in 3D we need 93=729 samples!

45 Dr. D. Dutta BITS Pilani, Pilani Campus


Curse of Dimensionality:
Number of Samples
In general, if n samples is dense enough in 1D
Then in d dimensions we need nd samples!
And nd grows really fast as a function of d
Common pitfall:
– If we can’t solve a problem with a few features, adding more
features seems like a good idea
– However the number of samples usually stays the same
– The method with more features is likely to perform worse instead
of expected better

46 Dr. D. Dutta BITS Pilani, Pilani Campus


Curse of Dimensionality:
Number of Samples
For a fixed number of samples, as we add features, the graph of
classification error:

Thus for each fixed sample size n, there is the optimal number of
features to use

47 Dr. D. Dutta BITS Pilani, Pilani Campus


The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorations 2004)

Data in only one dimension is relatively packed


Adding a dimension “stretch” the points across
that dimension, making them further apart
Adding more dimensions will make the points
further apart - high dimensional data is
extremely sparse
Distance measure becomes meaningless - due to
equi-distance

48 Dr. D. Dutta 48
BITS Pilani, Pilani Campus
Why Subspace Clustering?
(adapted from Parsons et al. SIGKDD Explorations 2004)

Clusters may exist only in some subspaces


Subspace-clustering: find clusters in all the subspaces

49 Dr. D. Dutta 4
9
October
BITS Pilani, Pilani20, 2024
Campus
CLIQUE (Clustering In
QUEst)
Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal
length interval
– It partitions an m-dimensional data space into non-overlapping
rectangular units
– A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a
subspace
50 Dr. D. Dutta BITS Pilani, Pilani Campus
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside
each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected
dense units for each cluster
– Determination of minimal cover for each cluster

51 Dr. D. Dutta BITS Pilani, Pilani Campus


Example
• Let us say that we want to cluster a set of records that have three
attributes, namely, salary, vacation and age.
• The data space for the this data would be 3-dimensional. age
salary vacation

52 Dr. D. Dutta BITS Pilani, Pilani Campus


Example
• After plotting the data objects, each dimension, (i.e., salary,
vacation and age) is split into intervals of equal length.
• Then we form a 3-dimensional grid on the space, each unit of
which would be a 3-D rectangle.
• Now, our goal is to find the dense 3-D rectangular units.
• To do this, we find the dense units of the subspaces of this 3-D
space.
• So, we find the dense units with respect to age for salary. This
means that we look at the salary- age plane and find all the 2-D
rectangular units that are dense.
• We also find the dense 2-D rectangular units for the vacation - age
plane.

53 Dr. D. Dutta BITS Pilani, Pilani Campus


Example
• We can extend the dense areas in the vacation-age plane inwards.
• We can extend the dense areas in the salary-age plane upwards.
• The intersection of these two spaces would give us a candidate
search space in which 3-dimensional dense units exist.
• We then find the dense units in the salary-vacation plane and we
form an extension of the subspace that represents these dense
units.
• Now, we perform an intersection of the candidate search space
with the extension of the dense units of the salary-vacation plane,
in order to get all the 3-d dense units.
• So, What was the main idea?
• We used the dense units in subspaces in order to find the dense
units in the 3-dimensional space.
• After finding the dense units, it is very easy to find clusters.

54 Dr. D. Dutta BITS Pilani, Pilani Campus


Vacation
Example (10,000)

(week)
Salary

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

age age
20 30 40 50 60 20 30 40 50 60

=3
Vacation

30 50
age

55
55 Dr. D. Dutta BITS Pilani, Pilani Campus
Strength and Weakness of
CLIQUE
• Strength
– automatically finds subspaces of the highest dimensionality
such that high density clusters exist in those subspaces
– insensitive to the order of records in input and does not
presume some canonical data distribution
– scales linearly with the size of input and has good scalability as
the number of dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the
expense of simplicity of the method

56 Dr. D. Dutta BITS Pilani, Pilani Campus


Frequent Pattern-Based
Approach
• Clustering high-dimensional space (e.g., clustering text documents,
micro-array data)
– Projected subspace-clustering: which dimensions to be
projected on?
• CLIQUE, ProClus
– Feature extraction: costly and may not be effective (?)
– Using frequent patterns as “features”
• “Frequent” are inherent features
• Mining freq. patterns may not be so expensive
• Typical methods
– Frequent-term-based document clustering
– Clustering by pattern similarity in micro-array data
(p-Clustering)
57 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering by Pattern
Similarity (p-Clustering)
Right: The micro-array “raw” data shows 3
genes and their values in a multi-
dimensional space
– Difficult to find their patterns
Bottom: Some subsets of dimensions form
nice shift and scaling patterns

58 Dr. D. Dutta 5
8
October
BITS Pilani, Pilani20, 2024
Campus
Why Constraint-Based
Cluster Analysis?
• Need user feedback: Users know their applications the best
• Less parameters but more user-desired constraints, e.g., an ATM
allocation problem: obstacle & desired clusters

59 Dr. D. Dutta BITS Pilani, Pilani Campus


A Classification of Constraints in
Cluster Analysis
• Clustering in applications: desirable to have user-guided (i.e.,
constrained) cluster analysis
• Different constraints in cluster analysis:
– Constraints on individual objects (do selection first)
• Cluster on houses worth over $300K
– Constraints on distance or similarity functions
• Weighted functions, obstacles (e.g., rivers, lakes)
– Constraints on the selection of clustering parameters
• Number of clusters, Minimum number of Points, etc.
– User-specified constraints
• Contain at least 500 valued customers and 5000 ordinary
ones
– Semi-supervised: giving small training sets as “constraints” or
hints
60 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering With Obstacle
Objects
K-medoids is more preferable since k-means
may locate the ATM center in the middle of
a lake
Visibility graph and shortest path
Triangulation and micro-clustering
Two kinds of join indices (shortest-paths)
worth pre-computation
– VV index: indices for any pair of
obstacle vertices
– MV index: indices for any pair of micro-
cluster and obstacle indices

61 Dr. D. Dutta BITS Pilani, Pilani Campus


An Example: Clustering With
Obstacle Objects

Not Taking obstacles into account Taking obstacles into account


62 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering with User-
Specified Constraints
• Example: Locating k delivery centers, each serving at least m
valued customers and n ordinary ones
• Proposed approach
– Find an initial “solution” by partitioning the data set into k groups
and satisfying user-constraints
– Iteratively refine the solution by micro-clustering relocation (e.g.,
moving δ μ-clusters from cluster Ci to Cj) and “deadlock”
handling (break the microclusters when necessary)
– Efficiency is improved by micro-clustering
• How to handle more complicated constraints?
– E.g., having approximately same number of valued customers
in each cluster?! — Can you solve it?

63 Dr. D. Dutta BITS Pilani, Pilani Campus


What Is Outlier Discovery?
• What are outliers?
– The set of objects are considerably dissimilar from the
remainder of the data
– Example: Sports: Michael Jordon, Wayne Gretzky, ...
• Problem: Define and find outliers in large data sets
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis

64 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Discovery: Statistical
Approaches

• Assume a model underlying distribution that generates data set


(e.g. normal distribution)
• Use discordancy tests depending on
– data distribution
– distribution parameter (e.g., mean, variance)
– number of expected outliers
• Drawbacks
– most tests are for single attribute
– In many cases, data distribution may not be known
65 Dr. D. Dutta BITS Pilani, Pilani Campus
Outlier Discovery: Distance-
Based Approach
• Introduced to counter the main limitations imposed by statistical
methods
– We need multi-dimensional analysis without knowing data
distribution
• Distance-based outlier: A DB(p, D)-outlier is an object O in a
dataset T such that at least a fraction p of the objects in T lies at a
distance greater than D from O
• Algorithms for mining distance-based outliers
– Index-based algorithm
– Nested-loop algorithm
– Cell-based algorithm

66 Dr. D. Dutta BITS Pilani, Pilani Campus


Density-Based Local Outlier
Detection
Distance-based outlier detection is
based on global distance
distribution
It encounters difficulties to identify
outliers if data is not uniformly
distributed
Ex. C1 contains 400 loosely
distributed points, C2 has 100
tightly condensed points, 2
outlier points o1, o2
Distance-based method cannot ◼ Local outlier factor (LOF)
identify o2 as an outlier ◼ Assume outlier is not
crisp
Need the concept of local outlier
◼ Each point has a LOF

67 Dr. D. Dutta BITS Pilani, Pilani Campus


Outlier Discovery: Deviation-
Based Approach
• Identifies outliers by examining the main characteristics of objects
in a group
• Objects that “deviate” from this description are considered outliers
• Sequential exception technique
– simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
• OLAP data cube technique
– uses data cubes to identify regions of anomalies in large
multidimensional data

68 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

69 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 11

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents
As per syllabus
1. Clustering

1. Cluster analysis concepts.

2. Partitioning methods

3. Hierarchical methods for cluster analysis

4. Considerations for cluster analysis

5. Outlier analysis
As per session plan

• Clustering

o Cluster analysis concepts.

o Partitioning methods

o Hierarchical methods for cluster analysis

o Considerations for cluster analysis

• Outlier analysis

2 Dr. D. Dutta BITS Pilani, Pilani Campus


Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not


require the number of clusters k as an input, but needs a termination
condition

Step 0 Step 1 Step 2 Step 3 Step 4


agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
3 Dr. D. Dutta BITS Pilani, Pilani Campus
Dendrogram

A binary tree that shows how clusters are merged/split hierarchically


Each node on the tree is a cluster; each leaf node is a singleton cluster

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Dendrogram

A clustering of the data objects is obtained by cutting the dendrogram


at the desired level, then each connected component forms a cluster

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Dendrogram

A clustering of the data objects is obtained by cutting the dendrogram


at the desired level, then each connected component forms a cluster

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Dendrogram

A good clustering method will produce high quality clusters


– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
– the similarity measure used by the method
– its implementation, and
– Its ability to discover some or all of the hidden patterns

7 Dr. D. Dutta BITS Pilani, Pilani Campus


AGNES (Agglomerative
Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages, e.g., Splus
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

8 Dr. D. Dutta BITS Pilani, Pilani Campus


DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Recent Hierarchical
Clustering Methods
Major weakness of agglomerative clustering methods
– do not scale well: time complexity of at least O(n2), where n is
the number of total objects
– can never undo what was done previously
Integration of hierarchical with distance-based clustering
– BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
– ROCK (1999): clustering categorical data by neighbor and link
analysis
– CHAMELEON (1999): hierarchical clustering using dynamic
modeling

10 Dr. D. Dutta BITS Pilani, Pilani Campus


An Example of the Agglomerative
Hierarchical Clustering Algorithm

• For the following data set, we will get different clustering results with
the single-link and complete-link algorithms.

1 5

2 3 4
6

11 Dr. D. Dutta BITS Pilani, Pilani Campus


Result of the Single-Link algorithm

1 5

2 3 4
6
1 3 4 5 2 6
Result of the Complete-Link algorithm

1 5

2 3 4
6
1 3 2 4 5 6

12
Hierarchical Clustering: Comparison
Single-link Complete-link
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4

Average-link Centroid distance


5
1 5 4 1
2
5 2
2 5
2
3 6 3
3 6
4 1 1
4 4
3

13
Compare
Single-link Dendrograms
Complete-link

1 2 5 3 6 4 1 2 5 3 6 4

Average-link Centroid distance

2 5 3 6 4 1
14 1 2 5 3 6 4 Dr. D. Dutta BITS Pilani, Pilani Campus
Effect of Bias towards
Spherical Clusters

Single-link (2 clusters) Complete-link (2 clusters)


15 Dr. D. Dutta BITS Pilani, Pilani Campus
Strength of Single-link

Original Points Two Clusters

• Can handle non-global shapes

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Limitations of Single-Link

Original Points Two Clusters

• Sensitive to noise and outliers

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Strength of Complete-link

Original Points Two Clusters

• Less susceptible to noise and outliers


18 Dr. D. Dutta BITS Pilani, Pilani Campus
Density-Based Clustering
Methods
• Density-based Clustering locates regions of high density that are
separated from one another by regions of low density
• Density = number of points within a specified radius (Eps)
• Clustering based on density (local cluster criterion), such as density-
connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
19
– CLIQUE: Agrawal, et al. (SIGMOD’98)
Dr. D. Dutta
(more grid-based)
BITS Pilani, Pilani Campus
Density-Based Clustering:
Basic Concepts
Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
p MinPts = 5
– core point condition:
q
Eps = 1 cm
|NEps (q)| >= MinPts

20 Dr. D. Dutta BITS Pilani, Pilani Campus


Density-Based Clustering:
Basic Concepts
• A point is a core point if it has more than a specified number of
points (MinPts) within Eps
• These are points that are at the interior of a cluster
• A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
• A noise point is any point that is not a core point or a border point

21 Dr. D. Dutta BITS Pilani, Pilani Campus


Density-Based Clustering:
Basic Concepts
• Any two core points are close enough – within a distance Eps of one
another – are put in the same cluster
• Any border point that is close enough to a core point is put in the
same cluster as the core point
• Noise points are discarded

22 Dr. D. Dutta BITS Pilani, Pilani Campus


DBSCAN: Core, Border, and
Noise Points

23 Dr. D. Dutta BITS Pilani, Pilani Campus


DBSCAN Algorithm
(simplified view for teaching)
1. Create a graph whose nodes are the points to be clustered
2. For each core-point c create an edge from c to every point p in the
-neighborhood of c
3. Set N to the nodes of the graph;
4. If N does not contain any core points terminate
5. Pick a core point c in N
6. Let X be the set of nodes that can be reached from c by going
forward;
1. create a cluster containing X{c}
2. N=N/(X{c})
7. Continue with step 4

Remark: points that are not assigned to any cluster are outliers;

24 Dr. D. Dutta BITS Pilani, Pilani Campus


DBSCAN Algorithm

25 Dr. D. Dutta BITS Pilani, Pilani Campus


DBSCAN: Core, Border and
Noise Points

Original Points Point types: core,


border and noise

Eps = 10, MinPts = 4


26 Dr. D. Dutta BITS Pilani, Pilani Campus
When DBSCAN Works Well

Original Points Clusters


• Resistant to Noise
• Can handle clusters of different shapes and sizes

27 Dr. D. Dutta BITS Pilani, Pilani Campus


DBSCAN: Sensitive to
Parameters

28 Dr. D. Dutta BITS Pilani, Pilani Campus


When DBSCAN Does NOT
Work Well

(MinPts=4, Eps=9.75).
Original Points

• Varying densities
• High-dimensional
29
data Dr. D. Dutta
(MinPts=4, Eps=9.92)
BITS Pilani, Pilani Campus
OPTICS: A Cluster-
Ordering Method (1999)
OPTICS: Ordering Points To Identify the Clustering Structure
– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
– Produces a special order of the database w.r.t. its density-based
clustering structure
– This cluster-ordering contains info equivalent to the density-based
clustering corresponding to a broad range of parameter settings
– Good for both automatic and interactive cluster analysis, including
finding intrinsic clustering structure
– Can be represented graphically or using visualization techniques

30 Dr. D. Dutta BITS Pilani, Pilani Campus


OPTICS: Some Extension
from DBSCAN
Index-based:
• k = number of dimensions
• N = 20
• p = 75%
D
• M = N(1-p) = 5
– Complexity: O(kN2)

Core Distance
p1
Reachability Distance
o
p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
31 Dr. D. Dutta  = 3 cm 31
BITS Pilani, Pilani Campus
OPTICS: Some Extension
from DBSCAN
Reachability
-distance

undefined


‘

Cluster-order of the objects


32 Dr. D. Dutta BITS Pilani, Pilani Campus
Density-Based Clustering:
OPTICS & Its Applications

33 Dr. D. Dutta October


BITS Pilani, Pilani12, 2024
Campus
Grid-Based Clustering
Method
This is the approach in which we quantize space into a finite number
of cells that form a grid structure on which all of the operations for
clustering is performed.
So, for example assume that we have a set of records and we want to
cluster with respect to two attributes, then, we divide the related
space (plane), into a grid structure and then we find the clusters.

34 Dr. D. Dutta BITS Pilani, Pilani Campus


Grid-Based Clustering
Method
Several interesting methods
– STING (a STatistical INformation Grid approach) by Wang, Yang
and Muntz (1997)
– WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using wavelet method
– CLIQUE: Agrawal, et al. (SIGMOD’98)
• On high-dimensional data (thus put in the section of clustering
high-dimensional data

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Grid-Based Clustering
Method

36 Dr. D. Dutta BITS Pilani, Pilani Campus


What is Spatial Data?
Spatial data may be thought of as features located on or referenced to
the Earth's surface, such as roads, streams, political boundaries,
schools, land use classifications, property ownership parcels, drinking
water intakes, pollution discharge sites - in short, anything that can be
mapped.
Spatial Area: The area that encompasses the locations of all the spatial
data is called spatial area.

37 Dr. D. Dutta BITS Pilani, Pilani Campus


STING Overview
STING is used for performing clustering on spatial data.
STING uses a hierarchical multi resolution grid data structure to partition
the spatial area.
STINGS big benefit is that it processes many common “region oriented”
queries on a set of points, efficiently.
We want to cluster the records that are in a spatial table in terms of
location.
Placement of a record in a grid cell is completely determined by its
physical location.

38 Dr. D. Dutta BITS Pilani, Pilani Campus


Grid Cell Hierarchy
The spatial area is divided into rectangular cells. (Using latitude and
longitude.)
Each cell forms a hierarchical structure.
This means that each cell at a higher level is further partitioned into 4
smaller cells in the lower level.
In other words each cell at the ith level (except the leaves) has 4 children
in the (i+1)th level.
The union of the 4 children cells would give back the parent cell in the
level above them.

39 Dr. D. Dutta BITS Pilani, Pilani Campus


Grid Cell Hierarchy
The size of the leaf level cells and the number of layers depends upon
how much granularity the user wants.
So, why do we have a hierarchical structure for cells?
– We have them in order to provide a better granularity, or higher
resolution.

40 Dr. D. Dutta BITS Pilani, Pilani Campus


STING: A Statistical
Information Grid Approach
Wang, Yang and Muntz (VLDB’97)

41 Dr. D. Dutta BITS Pilani, Pilani Campus


Statistical Parameters
Stored in each Cell
For each cell in each layer we have:
– Attribute Independent Parameter:
• Count : number of records in this cell.
– Attribute Dependent Parameter:
• (We are assuming that our attribute values are real numbers.)

42 Dr. D. Dutta BITS Pilani, Pilani Campus


Statistical Parameters
For each attribute of each cell we store the following parameters:
– M → mean of all values of each attribute in this cell.
– S → Standard Deviation of all values of each attribute in this cell.
– Min → The minimum value for each attribute in this cell.
– Max → The maximum value for each attribute in this cell.
– Distribution → The type of distribution that the attribute value in
this cell follows. (e.g. normal, exponential, etc.) None is assigned
to “Distribution” if the distribution is unknown.

43 Dr. D. Dutta BITS Pilani, Pilani Campus


Storing of Statistical
Parameters
Statistical information regarding the attributes in each grid cell, for each
layer are pre-computed and stored before hand.
The statistical parameters for the cells in the lowest layer is computed
directly from the values that are present in the table.
The Statistical parameters for the cells in all the other levels are
computed from their respective children cells that are in the lower
level.

44 Dr. D. Dutta BITS Pilani, Pilani Campus


Query Types
SQL like Language used to describe queries
– Two types of common queries found:
• find region specifying certain constraints
• take in a region and return some attribute of the region
A top-down approach is used to answer spatial data queries.

45 Dr. D. Dutta BITS Pilani, Pilani Campus


Query Processing
1. Start from a pre-selected layer-typically with a small number of cells.
(The pre-selected layer does not have to be the top most layer.)
2. For each cell in the current layer compute the confidence interval (or
estimated range of probability) reflecting the cells relevance to the
given query.
3. The confidence interval is calculated by using the statistical
parameters of each cell.
4. Remove irrelevant cells from further consideration.
5. When finished with the current layer, proceed to the next lower level.
6. Processing of the next lower level examines only the remaining
relevant cells.
7. Repeat this process until the bottom layer is reached.
8. Return the regions of relevant cells that satisfy the query

46 Dr. D. Dutta BITS Pilani, Pilani Campus


Different Grid Levels during
Query Processing

47 Dr. D. Dutta BITS Pilani, Pilani Campus


Sample Query Examples

Select the maximal regions that have at least 100 houses per unit area
and at least 70% of the house prices are above $4OOK and with
total area at least 100 units with 90% confidence.
Select the range of age of houses in those maximal regions where
there are at least 100 houses per unit area and at least 70% of the
houses have price between $150K and $300K with area at least 100
units in California.

48 Dr. D. Dutta BITS Pilani, Pilani Campus


Advantages and
Disadvantages of STING
• ADVANTAGES:
– Very efficient.
– The computational complexity is O(k) where k is the number of
grid cells at the lowest level
– Usually k << N, where N is the number of records.
– STING is a query independent approach, since statistical
information exists independently of queries.
– Incremental update.
• DISADVANTAGES:
– All Cluster boundaries are either horizontal or vertical, and no
diagonal boundary is selected.

49 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

50 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 10

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents
As per syllabus
1. Clustering

1. Cluster analysis concepts.

2. Partitioning methods

3. Hierarchical methods for cluster analysis

4. Considerations for cluster analysis

5. Outlier analysis
As per session plan

• Clustering

o Cluster analysis concepts.

o Partitioning methods

o Hierarchical methods for cluster analysis

o Considerations for cluster analysis

• Outlier analysis

2 Dr. D. Dutta BITS Pilani, Pilani Campus


Clustering

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Clustering
Y

A solution =(x1, y1) (x2, y2) (x3, y3)


Here data set is two dimensional

y3
(x3, y3)

y1
(x1, y1)

(x2, y2)
y2

X
4 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering

The term clustering refers to the identification of natural groups within


data sets such that instances in the same groups are more similar
than instances in different groups.
From the definition of clustering two objectives of clustering are very
clear
– intra-cluster distance (Homogeneity) (H) should be low
– inter-cluster distances (Separation) (S) should be high

5 Dr. D. Dutta BITS Pilani, Pilani Campus


What is Cluster Analysis?

Cluster: A collection of data objects


– similar (or related) to one another within the same group
– dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
– Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
Typical applications
– As a stand-alone tool to get insight into data distribution
– As a pre-processing step for other algorithms

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Uses of Clustering

Biology: taxonomy of living things: kingdom, phylum, class, order,


family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric and
ocean
Economic Science: market research
7 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering as a Pre-
processing Tool (Utility)
Summarization:
– Pre-processing for regression, PCA, classification, and
association analysis
Compression:
– Image processing: vector quantization
Finding K-Nearest Neighbors (k-NN)
– Localizing search to one or a small number of clusters
Outlier detection
– Outliers are often viewed as those “far away” from any cluster

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Quality: What Is Good
Clustering?
A good clustering method will produce high quality clusters
– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
– the similarity measure used by the method
– its implementation, and
– Its ability to discover some or all of the hidden patterns

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Measure the Quality of
Clustering
Dissimilarity/Similarity metric
– Similarity is expressed in terms of a distance function, typically
metric: d(i, j)
– The definitions of distance functions are usually different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables
– Weights should be associated with different variables based on
applications and data semantics
Quality of clustering:
– There is usually a separate “quality” function that measures the
“goodness” of a cluster.
– It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective

10 Dr. D. Dutta BITS Pilani, Pilani Campus


Considerations for Cluster
Analysis
Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
Similarity measure
– Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
Clustering space
– Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
11 Dr. D. Dutta BITS Pilani, Pilani Campus
Requirements and Challenges

Scalability
– Clustering algorithm is suitabale for big datasets also
Ability to deal with different types of attributes
– Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
Interpretability and usability
Others
– Discovery of clusters with arbitrary shape
– Ability to deal with noisy data
– Incremental clustering and insensitivity to input order
– High dimensionality
12 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Structures

◼ Data matrix  x11 ... x1f ... x1p 


 
◼ (two modes)
 ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 

 0 
◼ Dissimilarity matrix  d(2,1) 
 0 
◼ (one mode)  d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Structures

• A (2-dimensional) matrix is said to be 2-mode if the rows and


columns index different sets of entities (e.g., the rows might
correspond to persons while the columns correspond to
organizations).
• In contrast, a matrix is 1-mode if the rows and columns refer to the
same set of entities, such as a city-by-city matrix if distances.

14 Dr. D. Dutta BITS Pilani, Pilani Campus


Type of data in clustering
analysis
• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio variables
• Variables of mixed types

15 Dr. D. Dutta BITS Pilani, Pilani Campus


Interval-valued variables

• An interval scale is one where there is order and the difference


between two values is meaningful.
• Examples of interval variables include:
temperature (Fahrenheit), temperature (Celsius), pH, credit score (300-
850).

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Interval-valued variables

• Standardize data
– Calculate the mean absolute deviation:
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where mf = 1 n (x1 f + x2 f + ... + xnf )
.

– Calculate the standardized measurement (z-score)


xif − m f
zif = sf
• Using mean absolute deviation is more robust than using standard
deviation

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Interval-valued variables

• Distances are normally used to measure the similarity or dissimilarity


between two data objects
• Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Interval-valued variables

• If q = 2, d is Euclidean distance:

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also, one can use weighted distance, parametric Pearson product
moment correlation, or other dissimilarity measures

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Binary Variables
• A binary variable is a categorical variable that can only take one of
two values, usually represented as a Boolean - True or False - or an
integer variable - 0 or 1 - where 0 typically indicates that the attribute
is absent, and 1 indicates that it is present.
• If there is no preference for which outcome should be coded as 0
and which as 1, the binary variable is called symmetric. For
example, Gender - Male or Female.
• If the outcomes of a binary variable are not equally important, the
binary variable is called asymmetric. For example, A rare disease is
positive or negative.

20 Dr. D. Dutta BITS Pilani, Pilani Campus


Binary Variables
Object j
• A contingency table for 1 0 sum
binary data
1 a b a +b
• Distance measure for Object i
0 c d c+d
symmetric binary
sum a + c b + d p
variables:
d (i, j) = b+c
• Distance measure for a +b+c + d
asymmetric binary
variables: d (i, j) = b+c
• Jaccard coefficient
a +b+c
(similarity measure for
asymmetric binary simJaccard (i, j) = a
variables):
a +b +c
21 Dr. D. Dutta BITS Pilani, Pilani Campus
Dissimilarity between Binary
Variables
◼ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

◼ gender is a symmetric attribute


◼ the remaining attributes are asymmetric binary
◼ let the values Y and P be set to 1, and the value N be set to 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Nominal Variables
• A nominal scale describes a variable with categories that do not have
a natural order or ranking. You can code nominal variables with
numbers if you want, but the order is arbitrary and any calculations,
such as computing a mean, median, or standard deviation, would be
meaningless.
• Examples of nominal variables include:
genotype, blood type, zip code, gender, race, eye color, political party

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Nominal Variables

• A generalization of the binary variable in that it can take more than 2


states, e.g., red, yellow, blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables
d (i, j) = p −
p
m

• Method 2: use a large number of binary variables


– creating a new binary variable for each of the M nominal states

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Ordinal Variables
• An ordinal scale is one where the order matters but not the
difference between values.
• Examples of ordinal variables include:
socio economic status (“low income", "middle income", "high income”),
education level (“high school”,”BS”,”MS”,”PhD”), income level (“less
than 50K”, “50K-100K”, “over 100K”), satisfaction rating (“extremely
dislike”, “dislike”, “neutral”, “like”, “extremely like”).
• Note the differences between adjacent categories do not necessarily
have the same meaning. For example, the difference between the
two income levels “less than 50K” and “50K-100K” does not have
the same meaning as the difference between the two income levels
“50K-100K” and “over 100K”.
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank

25 Dr. D. Dutta BITS Pilani, Pilani Campus


Ratio-Scaled Variables
• A ratio variable, has all the properties of an interval variable, and
also has a clear definition of 0.0. When the variable equals 0.0,
there is none of that variable.
• Examples of ratio variables include:
enzyme activity, dose amount, reaction rate, flow rate, concentration,
pulse, weight, length, temperature in Kelvin (0.0 Kelvin really does
mean “no heat”), survival time.
• When working with ratio variables, but not interval variables, the
ratio of two measurements has a meaningful interpretation. For
example, because weight is a ratio variable, a weight of 4 grams is
twice as heavy as a weight of 2 grams.

26 Dr. D. Dutta BITS Pilani, Pilani Campus


Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale,
approximately at exponential scale, such as AeBt or Ae-Bt
Methods:
– treat them like interval-scaled variables—not a good choice!
(why?—the scale can be distorted)
– apply logarithmic transformation
yif = log(xif)
– treat them as continuous ordinal data treat their rank as interval-
scaled

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Relationships

28 Dr. D. Dutta BITS Pilani, Pilani Campus


Variables of Mixed Types
A database may contain all the six types of variables
– symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
One may use a weighted formula to combine their effects

 pf = 1 ij( f ) dij( f )
d (i, j) =
– f is binary or nominal:  pf = 1 ij( f )
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled r −1
zif =
if

M f −1

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Vector Objects
Vector objects: keywords in documents, gene features in micro-arrays,
etc.
Broad applications: information retrieval, biologic taxonomy, etc.
Cosine measure

A variant: Tanimoto coefficient

30 Dr. D. Dutta BITS Pilani, Pilani Campus


Major Clustering Approaches
Partitioning approach:
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects)
using some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSCAN, OPTICS, DenClue
Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
31 Dr. D. Dutta BITS Pilani, Pilani Campus
Major Clustering Approaches
Model-based:
– A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
– Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
– Based on the analysis of frequent patterns
– Typical methods: p-Cluster
User-guided or constraint-based:
– Clustering by considering user-specified or application-specific
constraints
– Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
– Objects are often linked together in various ways
– Massive links can be used to cluster objects: SimRank, LinkClus
32 Dr. D. Dutta BITS Pilani, Pilani Campus
Typical Alternatives to Calculate
the Distance between Clusters
Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj)
= dis(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) =
dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster

33 Dr. D. Dutta BITS Pilani, Pilani Campus


How to measure the distance
between clusters?
Single-link
Complete-link
Average-link Distance?

Centroid distance

Hint: Distance between clusters is usually


defined on the basis of distance between
objects.

34

34 Dr. D. Dutta BITS Pilani, Pilani Campus


How to Define Inter-Cluster
Distance?

Single-link d min (Ci , C j ) = min d ( p, q)


pCi , qC j
Complete-link
Average-link The distance between two
Centroid distance clusters is represented by the
distance of the closest pair of
data objects belonging to
different clusters.
35 Dr. D. Dutta BITS Pilani, Pilani Campus
How to Define Inter-Cluster
Distance?

Single-link d min (Ci , C j ) = max d ( p, q)


pCi , qC j
Complete-link
Average-link The distance between two
Centroid distance clusters is represented by the
distance of the farthest pair of
data objects belonging to
different clusters.

36 Dr. D. Dutta BITS Pilani, Pilani Campus


How to Define Inter-Cluster
Distance?

Single-link
d min (Ci , C j ) = avg d ( p, q)
Complete-link pCi , qC j
Average-link
Centroid distance The distance between two
clusters is represented by the
average distance of all pairs of
data objects belonging to
different clusters.
37 Dr. D. Dutta BITS Pilani, Pilani Campus
How to Define Inter-Cluster
Distance?

 
mi,mj are the
means of Ci, Cj,

Single-link
d mean (Ci , C j ) = d (mi , m j )
Complete-link
Average-link
The distance between two
Centroid distance clusters is represented by the
distance between the means of
the clusters.

38 Dr. D. Dutta BITS Pilani, Pilani Campus


Centroid, Radius and Diameter of
a Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN= 1(t
Cm = ip )
N
Radius: square root of average distance from any point of the cluster to
its centroid
 N (t − cm ) 2
Rm = i =1 ip
N
Diameter: square root of average mean squared distance between all
pairs of points in the cluster

 N  N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

39 Dr. D. Dutta BITS Pilani, Pilani Campus


Partitioning Algorithms: Basic
Concept
Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
E =  ik=1 pCi ( p − ci ) 2
Given k, find a partition of k clusters that optimizes the chosen partitioning
criterion
– Global optimal: exhaustively enumerate all partitions: Exhaustively
enumerating all possible partitions means considering all possible ways
to divide the set into the desired number of partitions, and evaluating the
objective function for each partition.
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
40 Dr. D. Dutta BITS Pilani, Pilani Campus
K-means clustering
K-means is a partitional clustering algorithm
Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr,
and r is the number of attributes (dimensions) in the data.
The k-means algorithm partitions the given data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user

41 Dr. D. Dutta BITS Pilani, Pilani Campus


K-means clustering
Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids,
cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to 2).

42 Dr. D. Dutta BITS Pilani, Pilani Campus


K-means clustering

43 Dr. D. Dutta BITS Pilani, Pilani Campus


Stopping/convergence
criterion
1. no (or minimum) re-assignments of data points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease ink the sum of squared error (SSE),
SSE = 
j =1
xC j
dist (x, m j ) 2
(1)
– Ci is the jth cluster, mj is the centroid of cluster Cj (the mean
vector of all the data points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.

44 Dr. D. Dutta BITS Pilani, Pilani Campus


A Simple example showing the implementation of k-means algorithm

(using K=2)

45 Dr. D. Dutta BITS Pilani, Pilani Campus


A Simple example showing the implementation of k-means algorithm

(using K=2)

Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two
clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).

46 Dr. D. Dutta BITS Pilani, Pilani Campus


A Simple example showing the implementation of k-means algorithm

(using K=2)

Step 2:
Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:

47 Dr. D. Dutta BITS Pilani, Pilani Campus


A Simple example showing the implementation of k-means algorithm

(using K=2)

Step 3:
Now using these centroids we
compute the Euclidean distance of
each object, as shown in table.

Therefore, the new clusters are:


{1,2} and {3,4,5,6,7}

Next centroids are: m1=(1.25,1.5) and


m2 = (3.9,5.1)

48 Dr. D. Dutta BITS Pilani, Pilani Campus


A Simple example showing the implementation of k-means algorithm

(using K=2)

Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

Therefore, there is no change in the


cluster.
Thus, the algorithm comes to a halt
here and final result consist of 2
clusters {1,2} and {3,4,5,6,7}.

49 Dr. D. Dutta BITS Pilani, Pilani Campus


PLOT

50 Dr. D. Dutta BITS Pilani, Pilani Campus


A Simple example showing the implementation of k-means algorithm

(using K=3)

Step 1 Step 2
51 Dr. D. Dutta BITS Pilani, Pilani Campus
PLOT

52 Dr. D. Dutta BITS Pilani, Pilani Campus


Strengths of k-means
Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O (tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear
algorithm.
K-means is the most popular clustering algorithm.
Note that: it terminates at a local optimum if SSE is used. The global
optimum is hard to find due to complexity.

53 Dr. D. Dutta BITS Pilani, Pilani Campus


Weaknesses of k-means
1. When the numbers of data are not so many, initial grouping will
determine the cluster significantly.
2. The number of cluster, K, must be determined before hand. Its
disadvantage is that it does not yield the same result with each run,
since the resulting clusters depend on the initial random
assignments.
3. We never know the real cluster, using the same data, because if it
is inputted in a different order it may produce different cluster if the
number of data is few.
4. It is sensitive to initial condition. Different initial condition may
produce different result of cluster. The algorithm may be trapped in
the local optimum.

54 Dr. D. Dutta BITS Pilani, Pilani Campus


Weaknesses of k-means:
Problems with outliers

55 Dr. D. Dutta BITS Pilani, Pilani Campus


Weaknesses of k-means: To
deal with outliers
One method is to remove some data points in the clustering process
that are much further away from the centroids than other data points.
– To be safe, we may want to monitor these possible outliers over a
few iterations and then decide to remove them.
Another method is to perform random sampling. Since in sampling we
only choose a small subset of the data points, the chance of
selecting an outlier is very small.
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification

56 Dr. D. Dutta BITS Pilani, Pilani Campus


Weaknesses of k-means
The algorithm is sensitive to initial seeds.

57 Dr. D. Dutta BITS Pilani, Pilani Campus


Weaknesses of k-means
If we use different seeds: good results
There are some
methods to help
choose good
seeds

58 Dr. D. Dutta BITS Pilani, Pilani Campus


Weaknesses of k-means
The k-means algorithm is not suitable for discovering clusters that are
not hyper-ellipsoids (or hyper-spheres).

59 Dr. D. Dutta BITS Pilani, Pilani Campus


The K-Medoids Clustering
Method
K-Medoids Clustering: Find representative objects (medoids) in clusters
– PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw
1987)
• Starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
Efficiency improvement on PAM
– CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
– CLARANS (Ng & Han, 1994): Randomized re-sampling

60 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

61 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 9

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents
As per syllabus
1. Advanced Classification Methods

1. Bayesian Belief Networks

2. Artificial Neural Networks

3. Support Vector Machines

4. Lazy Learner

5. Multiclass Classification
As per session plan
• Classification: Advanced Methods

o Bayesian Belief Networks

o Artificial Neural Networks

o Support Vector Machines

o Lazy Learner

o Multiclass Classification
2 Dr. D. Dutta BITS Pilani, Pilani Campus
Introduction to Bayesian
Network
Suppose you are trying to determine if a
patient has Covid. You observe the
following symptoms:
• The patient has a cough
• The patient has a fever
• The patient has difficulty breathing

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Introduction to Bayesian
Network
You would like to determine how
likely the patient is infected with
Covid given that the patient has a
cough, a fever, and difficulty
breathing

We are not 100% certain that the


patient has Covid because of these
symptoms. We are dealing with
uncertainty!

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Introduction to Bayesian
Network
Now suppose you order an x-ray and
observe that the patient has a wide
mediastinum.
Your belief that that the patient is
infected with Covid is now much higher.

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Introduction to Bayesian
Network
In the previous slides, what you observed affected your belief that the
patient is infected with Covid
This is called reasoning with uncertainty
Wouldn’t it be nice if we had some methodology for reasoning with
uncertainty? Why in fact, we do…

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Bayesian Network
HasDifficultyBreathing
HasFever HasWideMediastinum
HasCough

Has Covid

In the opinion of many AI researchers, Bayesian networks are the most


significant contribution in AI in the last 10 years.
They are used in many applications eg. spam filtering, speech
recognition, robotics, diagnostic systems and even syndromic
surveillance.

7 Dr. D. Dutta BITS Pilani, Pilani Campus


Bayesian Network

Bayesian Networks (BN) are directed acyclic graphs (DAGs) with an


associated set of probability tables.

The nodes are random variables.

Certain independence relations can be induced by the topology of the


graph.

The intuitive meaning of an arrow from a parent to a child is that the


parent directly influences the child. These influences are quantified
by conditional probabilities.

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Bayesian Network

Deal with uncertainty in inference via probability − Bayes.

Handle incomplete data set, e.g., classification, regression.

Model the domain knowledge, e.g., causal relationships.

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Example of Bayesian Network

BNs are graphical representations of joint distributions. The BN for the


medical expert system mentioned previously represents a joint
distribution over 8 binary random variables {A,T,E,L,S,B,D,X}.

10 Dr. D. Dutta BITS Pilani, Pilani Campus


A Bayesian Network

A Bayesian network is made up of:


A
1. A Directed Acyclic Graph

C D

2. A set of tables for each node in the graph


A P(A) A B P(B|A) B D P(D|B) B C P(C|B)

false 0.6 false false 0.01 false false 0.02 false false 0.4

true 0.4 false true 0.99 false true 0.98 false true 0.6

true false 0.7 true false 0.05 true false 0.9

true true 0.3 true true 0.95 true true 0.1

11 Dr. D. Dutta BITS Pilani, Pilani Campus


A Bayesian Network

Each node in the graph is a A node X is a parent of another


random variable node Y if there is an arrow from
node X to node Y eg. A is a
parent of B
A

C D

Informally, an arrow from


node X to node Y means X
has a direct influence on Y

12 Dr. D. Dutta BITS Pilani, Pilani Campus


A set of Tables for each
node
A P(A)
A B P(B|A)
false 0.6

true 0.4
false false 0.01 Each node Xi has a
false true 0.99
conditional probability
true false 0.7
distribution
true true 0.3
P(Xi|Parents(Xi)) that
quantifies the effect of the
parents on the node
B C P(C|B)

false false 0.4


A The parameters are the
false true 0.6
probabilities in these
true false 0.9
conditional probability
true true 0.1
B tables (CPTs)
B D P(D|B)

false false 0.02

C D false true 0.98

true false 0.05

true true 0.95

13 Dr. D. Dutta BITS Pilani, Pilani Campus


A set of Tables for each
node
Conditional Probability
Distribution for C given B

B C P(C|B)

false false 0.4

false true 0.6

true false 0.9

true true 0.1

For a given combination of values of the parents (B


in this example), the entries for P(C=true | B) and
P(C=false | B) must add up to 1
eg. P(C=true | B=false) + P(C=false |B=false )=1

If you have a Boolean variable with k Boolean parents, this table has 2k+1
probabilities (but only 2k need to be stored)

14 Dr. D. Dutta BITS Pilani, Pilani Campus


Bayesian Network

Two important properties:


1. Encodes the conditional independence relationships between the
variables in the graph structure
2. Is a compact representation of the joint probability distribution
over the variables

15 Dr. D. Dutta BITS Pilani, Pilani Campus


The Joint Probability
Distribution
Due to the Markov condition, we can compute the joint probability
distribution over all the variables X1, …, Xn in the Bayesian net using
the formula:

n
P( X 1 = x1 ,..., X n = xn ) =  P( X i = xi | Parents( X i ))
i =1

Where Parents(Xi) means the values of the Parents of the node Xi with
respect to the graph

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Example of Bayesian Network

Using the network in the example, suppose you want to calculate:


P(A = true, B = true, C = true, D = true)
= P(A = true) * P(B = true | A = true) *
P(C = true | B = true) P( D = true | B = true)
= (0.4)*(0.3)*(0.1)*(0.95)

C D

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Example of Bayesian Network

Using the network in the example, suppose you want to calculate:


This is from the
P(A = true, B = true, C = true, D = true) graph structure
= P(A = true) * P(B = true | A = true) *
P(C = true | B = true) P( D = true | B = true)
= (0.4)*(0.3)*(0.1)*(0.95)

A
These numbers are from the
conditional probability tables B

C D

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Inference

Using a Bayesian network to compute probabilities is called inference


In general, inference involves queries of the form:
P( X | E )

E = The evidence variable(s)

X = The query variable(s)

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Inference
HasDifficultyBreathing
HasFever HasWideMediastinum
HasCough

Has Covid
An example of a query would be:
P( Has Covid = true | HasFever = true, HasCough = true)
Note: Even though HasDifficultyBreathing and HasWideMediastinum
are in the Bayesian network, they are not given values in the query
(i.e. they do not appear either as query variables or evidence
variables)
They are treated as unobserved variables

20 Dr. D. Dutta BITS Pilani, Pilani Campus


The bad news

Exact inference is feasible in small to medium-sized networks


Exact inference in large networks takes a very long time
We resort to approximate inference techniques which are much faster
and give pretty good results

21 Dr. D. Dutta BITS Pilani, Pilani Campus


One last Unresolved Issue

We still haven’t said where we get the Bayesian network from. There
are two options:
1. Get an expert to design it
2. Learn it from data

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Example
Use a Directed Acyclic Graph (DAG)
to model the causality.
Martin Train Norman
Oversleep Strike Oversleep

Martin Norman
Late Late

Boss Project Office


Failure-in-Love Delay Dirty

Boss
Angry
23 Dr. D. Dutta BITS Pilani, Pilani Campus
Example
Martin Train Norman Probabilit
Probability Probability y
oversleep Strike oversleep

T 0.01 T 0.1 T 0.2

F 0.99 F 0.9 F 0.8

Martin Train Norman


Oversleep Strike Oversleep
Boss
Probability
failure-in-love
Martin Norman
T 0.01
Late Late
F 0.99

Boss Project Office


Failure-in-Love Delay Dirty

Attach prior probabilities to all root nodes


Boss
Angry
24 Dr. D. Dutta BITS Pilani, Pilani Campus
Attach prior probabilities to non-root nodes
Example
Each column is summed to 1.
Train strike
T F
Martin oversleep
T F T F
Martin Train Norman
T 0.95 0.8Strike 0.7 0.05
Oversleep
Martin Late Oversleep
F 0.05 0.2 0.3 0.95

Martin Norman Norman


Late Late untidy

Norman oversleep
Boss Project Office
Failure-in-Love Delay Dirty T F

Norman T 0.6 0.2

untidy F 0.4 0.8

Boss
Angry
25 Dr. D. Dutta BITS Pilani, Pilani Campus
Example
Attach prior probabilities to non-root nodes
Each column is summed to 1. Boss Failure-in-love

T F

Project Delay

T F T F
Martin Train Norman
Oversleep Strike
Office Dirty Oversleep
T F T F T F T F

very 0.98 0.85 0.6 0.5 0.3 0.2 0 0.01


Norman
Martin Norman
mid 0.02 0.15 Late
0.3 0.25 0.5 0.5Late 0.2 untidy
0.02
Boss Angry
little 0 0 0.1 0.25 0.2 0.3 0.7 0.07

no 0 0 0 0 0 0 0.1 0.9
Boss Project Office
Failure-in-Love Delay Dirty

Boss
Angry
26 Dr. D. Dutta BITS Pilani, Pilani Campus
Definition of Bayesian
Network
A Bayesian network is a directed acyclic graph with
the following properties:
Each node represents a random variable.
Each node representing a variable A with parent nodes representing
variables B1, B2,..., Bn is assigned a conditional probability table
(CPT):

P( A | B1 , B2 , , Bn )

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Problems

How to inference?
How to learn the probabilities from data?
How to learn the structure from data?

Bad news: All of them are NP-Hard


What applications we may have?

28 Dr. D. Dutta BITS Pilani, Pilani Campus


Example

Train
Probability
Strike
Train T 0.1
Strike F 0.9

Train Strike Train Strike


T F T F
Martin Norman
Martin T 0.6 0.5 Norman T 0.8 0.1
Late F 0.4 0.5
Late Late Late F 0.2 0.9

Questions:
P (“Martin Late”, “Norman Late”, “Train Strike”)=? Joint distribution

P(“Martin Late”)=? Marginal distribution

P(“Matrin Late” | “Norman Late ”)=? Conditional distribution


29 Dr. D. Dutta BITS Pilani, Pilani Campus
Example A B C Probability
Demo

T T T 0.048
Demo
F T T 0.032

T F T 0.012

F F T 0.008

T T F 0.045

F T F 0.045

T F F 0.405
Train
Probability
F F F 0.405 Strike
C Train
Strike
T
F
0.1
0.9

Train Strike Train Strike


T F T F
Martin Norman
Martin
Late
T
F
0.6
0.4
0.5
0.5
Late A B Late
Norman
Late
T
F
0.8
0.2
0.1
0.9

Questions:
P (“Martin Late”, “Norman Late”, “Train Strike”)=? Joint distribution
P( A, B, C ) = P( A | B, C ) P( B | C ) P(C ) = P( A | C ) P( B | C ) P(C )

e.g., P( A = T , B = T , C = T ) = 0.6  0.8  0.1 = 0.048


30 Dr. D. Dutta BITS Pilani, Pilani Campus
A B Probability

Example
T T 0.093

F T 0.077

A B C Probability T F 0.417

F F 0.413
T T T 0.048

F T T 0.032
Demo

T F T 0.012

F F T 0.008
Demo
T T F 0.045

F T F 0.045

T F F 0.405 Train
Probability
Strike
F F F 0.405

C Train
Strike
T
F
0.1
0.9

Train Strike Train Strike


T F T F
Martin Norman
Martin
Late
T
F
0.6
0.4
0.5
0.5
Late A B Late
Norman
Late
T
F
0.8
0.2
0.1
0.9

Questions:
P (“Martin Late”, “Norman Late”)=? Joint distribution
P( A, B) =  P( A, B, C )
C

e.g., P( A = T , B = T ) = 0.048 + 0.045 = 0.093


31 Dr. D. Dutta BITS Pilani, Pilani Campus
Example A

T
B

T
C

T
Probability

0.048

F T T 0.032 A B Probability
T F T 0.012 T T 0.093

F T 0.077
F F T 0.008
T F 0.417
T T F 0.045
F F 0.413
F T F 0.045

T F F 0.405

F F F 0.405 Train
Probability
Strike
C Train
Strike
T
F
0.1
0.9

Train Strike Train Strike


T F T F
Martin Norman
Martin
Late
T
F
0.6
0.4
0.5
0.5
Late A B Late
Norman
Late
T
F
0.8
0.2
0.1
0.9
A Probability Demo

T 0.51
Demo
Questions: F 0.49

P (“Martin Late”)=? Marginal distribution

P( A) =  P( A, B, C ) =  P( A, B)
B ,C B

e.g., P( A = T ) = 0.093 + 0.417 = 0.51


32 Dr. D. Dutta BITS Pilani, Pilani Campus
Example A

T
B

T
C

T
Probability

0.048

F T T 0.032
A B Probability
T F T 0.012 T T 0.093

F F T 0.008 F T 0.077

T T F 0.045 T F 0.417

F T F 0.045 F F 0.413

T F F 0.405
Train
Probability
F F F 0.405 Strike
C Train
Strike
T
F
0.1
0.9

Train Strike Train Strike


T F T F
Martin Norman
Martin
Late
T
F
0.6
0.4
0.5
0.5
Late A B Late
Norman
Late
T
F
0.8
0.2
0.1
0.9
A Probability B Probability
T 0.51 T 0.17

Questions: F 0.49 F 0.83

P (“Martin Late” | “Norman Late” )=? Conditional distribution

P( A, B) 0.093
P( A | B) = e.g., P( A = T | B = T ) = = 0.5471
P( B) 0.17 Demo

Demo
33 Dr. D. Dutta BITS Pilani, Pilani Campus
Neural Networks

▪ The human brain is made up of billions of cells called neurons.


▪ The cell body of the neuron gets signals sent to it from dendrites
(input).
▪ The neuron then essentially sums up the inputs to it, and will
subsequently fire, or output a signal, if a certain level is reached.
▪ This output goes through the axon and across to other cells by way
of connections called synapses.
▪ These synapses (output) are then connected to the dendrites of
other cells (input).
▪ Eventually the information sent will reach a destination where a
reaction may occur.

34 Dr. D. Dutta BITS Pilani, Pilani Campus


Neural Networks

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Artificial Neural Networks

▪ Artificial Neural Networks (ANNs) are computational models inspired


by the structure and functioning of the human brain.
▪ These networks are a fundamental component of machine learning
and are designed to recognize patterns, make decisions, and
perform tasks that typically require human intelligence.
▪ The basic building block of an ANN is the artificial neuron. Neurons
are organized into layers: the input layer receives data, hidden
layers process information, and the output layer produces the
network's final output.
▪ Neurons in one layer are connected to neurons in the next layer
through weighted connections. Each connection has an associated
weight that influences the strength of the signal. Learning in ANNs
involves adjusting these weights based on training data.
▪ Artificial Neural Networks continue to evolve with advancements
such as deep learning, which involves training deep neural networks
with multiple hidden layers.
36 Dr. D. Dutta BITS Pilani, Pilani Campus
Artificial Neural Networks

37 Dr. D. Dutta BITS Pilani, Pilani Campus


McCullogh-Pitts model

x1
w1
x2 w2

x3
w3  Z
f (Z) y
. output
. . .
.
.
.
. wn .
b neuron
xn bias
input
38 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model
Warren McCulloch (a psychiatrist and neuroanatomist) and Walter
Pitts (a mathematician), started the modern era of neural networks
when, in 1943

• spikes are interpreted as spike rates;


• synaptic strength are translated as synaptic weights;
• it does not require learning or adaptation.
• excitation means positive product between the incoming spike
rate and the corresponding synaptic weight; excitatory connection
has positive weights. All excitatory connections in a particular
neuron have the same weight.
• inhibition means negative product between the incoming spike
rate and the corresponding synaptic weight; inhibitory connection
has negative weights.
39 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model

Each neuron has a fixed threshold such that if the net input to the
network is greater than the threshold the neuron should fire.
The threshold is set such that the inhibition is absolute. This means any
non-zero inhibitory input will prevent the neuron from firing.

40 Dr. D. Dutta BITS Pilani, Pilani Campus


McCullogh-Pitts model

x1
w
x2 w

x3
w  Z
f (Z) y
. output
. . .
.
.
.
. w .
-p -p neuron
xn
input x n+1 x n+m
41 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model

A neuron receives 'n' signals through excitatory connection and ‘m’


signals through inhibitory connection is shown in Figure.
The excitatory path has weight w > 0 and the inhibitory connection path
has weight –p
F(Z)=1 if Z and F(Z)=0 if Z<
where Z, is the total input signal received by the neuron , and  is the
threshold.
The condition for absolute inhibition is  > nw–p.
Neuron will fire if it receives k or more excitatory signals and no
inhibitory inputs, i.e. kw   > (k –1)w.

42 Dr. D. Dutta BITS Pilani, Pilani Campus


McCullogh-Pitts model
x1 x2 y
Truth value for AND 0 0 0
0 1 0
1 0 0
x1
1 1 1
1

x2  Z
f (Z) y
1
output

F(Z)=1 if Z2 and F(Z)=0 if Z<2


neuron

43 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron

Proposed by Rosenblatt, 1958.


Basic building block of nearly all ANNs.
Z =I=1Nwixi and y = fN(Z)
where wi is the weight at the inputs xi where Z is the node (summation)
output and fN is a nonlinear operator.
y = sign(z) i.e y=f(z)=1 if z>0 and f(z)=-1 if z<=0
The output of the network thus is either +1 or -1, depending on the
input.

44 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron

x1
w1
x2 w2

x3
w3  Z
f (Z) y
. output
. . .
.
.
.
. wn .
neuron
xn
input
45 Dr. D. Dutta BITS Pilani, Pilani Campus
Perceptron

x1
w1

 Z
f (Z) y
w2 output

x2 w0
x=1
0 neuron
bias
input
46 Dr. D. Dutta BITS Pilani, Pilani Campus
Perceptron

The network can now be used for a classification task: it can decide
whether an input pattern belongs to one of two classes.
If the total input is positive, the pattern will be assigned to class +1, if
the total input is negative, the sample will be assigned to class -1.
The separation between the two classes in this case is a straight line,
given by the equation:
w1x1 + w2x2 + w0= 0
The single layer network represents
a linear discriminant function.
x2 = -w1/w2 x1 - w0 / w2
weights determine the slope of the line
and the bias determines the `offset’
considering w0=

47 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron

How do we learn the weights and biases in the network?


Iterative procedures are there.
Weight the new value is computed by adding a correction to the
old value
wi (t + 1) = wi (t) + wi (t)
i (t + 1) = i (t) +  i (t)
How do we compute wi (t) and  i (t) in order to classify the
learning patterns correctly?

48 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron

We have a set of learning samples consisting of an input vector x and a


desired output d(x)
Classification task the d(x) is usually +1 or -1.
Perceptron learning rule is very simple and can be stated as follows:
1. Start with random weights for the connections.
2. Select an input vector x from the set of training samples.
3. If y  d(x) (the perceptron gives an incorrect response), modify all
connections wi according to: wi = d(x)xi ;
4. Go back to 2.

49 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron: Example

Initially:w1 = 1;w2 = 2; = -2. x0 is always 1.

For sample A, x = (0.5; 1.5) and target value d(x) = +1


The network output is +1, so no weights are adjusted.

For sample B, x = (-0.5; 0.5)and target value d(x) = -1


The network output is -1, so no weights are adjusted.

For sample C, x = (0.5; 0.5)and target value d(x) = +1


The network output is -1, so weights are to be adjusted.
According to the perceptron learning rule, the weight changes are:  w1
= 0.5,  w2 = 0.5, = 1.
The new weights are now: w1 = 1.5, w2 = 2.5,  = -1, and sample C is
classified correctly.

50 Dr. D. Dutta BITS Pilani, Pilani Campus


Perceptron: Example

51 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Architecture

52 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Notation
▪ i an input unit;
▪ h a hidden unit;
▪ o an output unit;
▪ xp the pth input pattern vector;
▪ xpj the jth element of the pth input pattern vector;
▪ sp the input to a set of neurons when input pattern vector p is
clamped (i.e., presented to the network); often: the input of the
network by clamping input pattern vector p;
▪ dp the desired output of the network when input pattern vector p was
input to the network ;
▪ dpj the jth element of the desired output of the network when input
pattern vector p was input to the network;
▪ yp the activation values of the network when input pattern vector p
was input to the network;

53 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Notation
• ypj the activation values of element j of the network when input
pattern vector p was input to the network;
• W the matrix of connection weights;
• Wj the weights of the connections which feed into unit j;
• wjk the weight of the connection from unit j to unit k;
• Fj the activation function associated with unit j;
• jk the learning rate associated with weight wjk;
•  the biases to the units;
• j the bias input to unit j;
• Uj the threshold of unit j in Fj ;
• Ep the error in the output of the network when input pattern vector p
is input;
• E the total error.

54 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Feed
Forword
A feed-forward network has a layered structure.
Each layer consists of units which receive their input from units from a
layer directly below and send their output to units in a layer directly
above the unit
The Ni inputs are fed into the first layer of Nh,1 hidden units
The input units are merely 'fan-out' units; no processing takes place in
these units.
The activation of a hidden unit is a function Fi of the weighted inputs
plus a bias

The output of the hidden units is distributed over the next layer of Nh 2
hidden units, until the last layer of hidden units, of which the outputs
are fed into a layer of No output units

55 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Delta
Rule
The activation is a differentiable function of the total input, given
by 1 in which 2
we must set 3
The error measure Ep is defined as the total quadratic error for
pattern p at the output units: 4

where dpo is the desired output for unit o when pattern p is


clamped.
Here total error
We can write 5

From equation 2 we can get 6


56 Dr. D. Dutta BITS Pilani, Pilani Campus
Back Propagation ANN: Delta
Rule
When we define 7

We make the weight changes according to: 8

The trick is to figure out what pk should be for each unit k in the
network. The interesting result, which we now derive, is that there is
a simple recursive computation of these  's which can be
implemented by propagating error signals backward through the
network.
To compute pk we apply the chain rule to write this partial derivative as
the product of two factors, one factor reflecting the change in error
as a function of the output of the unit and one reflecting the change
in the output as a function of changes in the input
9
57 Dr. D. Dutta BITS Pilani, Pilani Campus
Back Propagation ANN: Delta
Rule
▪ By Equation 1
10

▪ To compute the first factor of equation 9, we consider two cases.


First, assume that unit k is an output unit k = o of the network. In this
case, it follows from the definition of Ep that
11

▪ Substituting 11 and equation 10 in equation 9, we get


12

▪ Secondly, if k is not an output unit but a hidden unit k = h, we do not


readily know the contribution of the unit to the output error of the
network. However, the error measure can be written as a function of
the net inputs from hidden to output layer;

58 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Delta
Rule
We use the chain rule to write

13

Substituting this in equation 9 yields


14
Equations 12 and 14 give a recursive procedure for computing the  's
for all units in the network, which are then used to compute the
weight changes according to equation 8.

59 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Understanding
Step one
• A learning pattern is clamped
• The activation values are propagated to the output units
• The actual network output is compared with the desired output values
• It usually end up with an error in each of the output units. Let's call this
error eo for a particular output unit o. We have to bring eo to zero.
• We know from the delta rule that, in order to reduce an error, we have
to adapt its incoming weights according to
Step two
• But it alone is not enough: when we only apply this rule, the weights
from input to hidden units are never changed.
• In order to adapt the weights from input to hidden units, we again want
to apply the delta rule.
• we do not have a value for  for the hidden units

60 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Understanding
• This is solved by the chain rule
• Distribute the error of an output unit o to all the hidden units that is it
connected to, weighted by this connection
• Differently , a hidden unit h receives a delta from each output unit o
equal to the delta of that output unit weighted with (= multiplied by)
the weight of the connection between those units.
• The activation function of the hidden unit; F’ has to be applied to the
delta, before the back-propagation process can continue.

61 Dr. D. Dutta BITS Pilani, Pilani Campus


Working with Back
Propagation ANN
The application of the generalized delta rule involves two phases
First phase
The input x is presented and propagated forward through the network
to compute the output values ypo for each output unit
This output is compared with its desired value do, resulting in an error
signal po for each output unit.
Second phase
Involves a backward pass through the network during which the error
signal is passed to each unit in the network and appropriate weight
changes are calculated.

62 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Weight adjustment
with Sigmoid activation function

The weight of a connection is adjusted by an amount proportional to the


product of an error signal , on the unit k receiving the input and the
output of the unit j sending this signal along the connection:

If the unit is an output unit, the error signal is given by

Take as the activation function F the 'sigmoid' function

63 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Weight adjustment
with Sigmoid activation function

In this case the derivative is equal to

The error signal for an output unit can be written as

64 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN: Weight adjustment
with Sigmoid activation function

The error signal for a hidden unit is determined recursively in terms of


error signals of the units to which it directly connects and the
weights of those connections. For the sigmoid activation function:

65 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
Given
x1 =1st input pattern vector=[111]
d1 =desired output=1
 =learning rate =0.1
F= Activation function= Bipolar Sigmoidal= F(x)=2/(1+e-x) –1
F’(x)=0.5(1+ F(x)) (1- F(x))
 =biases to the units=0.1

66 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
1 1 0.1
0.2
0.1 1
0.2 0.1
0.2
1 1
1 2
0.1
2 0.1

0.1 0.1
1 3

i h o

67 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
sh1=0.2*1+ 0.2*1+ 0.2*1+0.1=0.7
sh2=0.1*1+ 0.1*1+ 0.1*1+0.1=0.4

yh1=F(sh1)=2/(1+e-0.7) –1=0.336
yh2=F(sh2)=2/(1+e-0.4) –1=0.1974
so1 =0.1* 0.336 + 0.1* 0.1974 =0.0536
yo1=F(so1)=2/(1+e-0.0536) –1=0.02678
F’(so1)=0.5(1+ F(so1)) (1- F(so1))

o1=(1-0.02678) 0.5(1+ 0.02678) (1- 0.02678)=0.4863

68 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
h1= 0.5(1+ 0.02678) (1- 0.02678)*0.4863*0.1=0.0216
h2= 0.5(1+ 0.02678) (1- 0.02678)*0.4863*0.1=0.0216

wi1h1 =0.1* 0.0216*1= 0.00216


wi2h1 =0.1* 0.0216*1= 0.00216
wi3h1 =0.1* 0.0216*1= 0.00216
wi1h2 =0.1* 0.0216*1= 0.00216
wi2h2 =0.1* 0.0216*1= 0.00216
wi3h2 =0.1* 0.0216*1= 0.00216

69 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
wh1o1 =0.1* 0.4863*0.7= 0.00216
wh2o1 =0.1* 0.4863*0.4= 0.00216

wi1h1(new)= wi1h1(old)+wi1h1=0.2+0.00216=0.20216
wi2h1(new)= wi2h1(old)+wi2h1=0.2+0.00216=0.20216
wi3h1(new)= wi3h1(old)+wi3h1=0.2+0.00216=0.20216
wi1h2(new)= wi1h2(old)+wi1h2=0.1+0.00216=0.10216
wi2h2(new)= wi2h2(old)+wi2h2=0.1+0.00216=0.10216
wi3h2(new)= wi3h2(old)+wi3h2=0.1+0.00216=0.10216
wh1o1(new)= wh1o1(old)+wh1o1=0.1+0.00216=0.10216
wh2o1(new)= wh2o1(old)+wh2o1=0.1+0.00216=0.10216

70 Dr. D. Dutta BITS Pilani, Pilani Campus


Back Propagation ANN:
Example
1 1 0.1
0.20216
0.10216 1
0.20216 0.10216
0.20216
1 1
1 2
0.10216 0.10216
2

0.10216 0.1
1 3

i h o

71 Dr. D. Dutta BITS Pilani, Pilani Campus



Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1

denotes -1

How would you


classify this data?

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1

denotes -1

How would you


classify this data?

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1

denotes -1

How would you


classify this data?

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1

denotes -1

How would you


classify this data?

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1

denotes -1

Any of these would


be fine..

..but which is best?



Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1

denotes -1 Define the margin


of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.

Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1

denotes -1 The maximum


margin linear
classifier is the
linear classifier
with the maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM

Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1

denotes -1 The maximum


margin linear
classifier is the
linear classifier
Support Vectors
are those
with the maximum
datapoints that the margin.
margin pushes up This is the
against
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
1. Intuitively this feels safest.
2. If we’ve made a small
f(x,w,b) error xin- the
= sign(w. b)
denotes +1
location of the boundary (it’s been
denotes -1 jolted in its perpendicular
The maximumdirection)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
3. Leave-one-out cross validation
linear classifier
Support Vectors (LOOCV) is easy since the model
with the, um, is
are those immune to removal of any non-
datapoints that the maximum margin.
support-vector datapoints.
margin pushes up This is the
against 4. There’s some theory (using kind
simplest Vapnik–
of
Chervonenkis (VC) dimension) that is
SVM (Called an
related to (but not the same as) the
LSVM)
proposition that this is a good thing.
5. Empirically it works very very well.
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane

• How do we represent this mathematically?


• …in m input dimensions?
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }

Classify as.. +1 if w . x + b >= 1


-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
Computing the margin width
M = Margin Width

How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Computing the margin width
M = Margin Width

How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the Plus
Plane. What is w . ( u – v ) ?

And so of course the vector w is also


perpendicular to the Minus Plane
Computing the margin width
x+ M = Margin Width

x-
How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane Rmm:: not
not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint
Computing the margin width
x+ M = Margin Width

x-
How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- +  w for some value of . Why?
Computing the margin width
x+ M = Margin Width
The line from x- to x+ is
x - How do we
perpendicularcompute
to the
M in terms -of w+
planes.
So to get from x to x
and b?
travel some distance in
direction w.
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- +  w for some value of . Why?
Computing the margin width
x+ M = Margin Width

x-

What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- +  w
• |x+ - x- | = M
It’s now easy to get M
in terms of w and b
Computing the margin width
x+ M = Margin Width

x-

w . (x - +  w) + b = 1
=>
What we know:
• w . x+ + b = +1 w . x - + b +  w .w = 1
• w . x- + b = -1 =>
• x+ = x- +  w
-1 +  w .w = 1
• |x+ - x- | = M
=>
It’s now easy to get M 2
in terms of w and b λ=
w.w
Computing the margin width
2
x+ M = Margin Width =
w.w

x-

M = |x+ - x- | =|  w |=

What we know: = λ | w | = λ w.w


• w . x+ + b = +1
• w . x- + b = -1 2 w.w 2
• x+ = x- +  w = =
w.w w.w
• |x+ - x- | = M
• 2
λ=
w.w
Learning the Maximum Margin Classifier
2
x+ M = Margin Width =
w.w

x-

Given a guess of w and b we can


• Compute whether all data points in the correct half-planes
• Compute the width of the margin
So now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all the
datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion?
EM? Newton’s Method?
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function
of some real-valued variables subject to
linear constraints.
Quadratic Programming
T
u Ru
Find arg max c + d u +
T Quadratic criterion
u 2

Subject to a11u1 + a12u 2 + ... + a1mu m  b1


a21u1 + a22u 2 + ... + a2 m u m  b2 n additional linear
inequality
: constraints
an1u1 + an 2u 2 + ... + anmu m  bn
And subject to

e additional linear
a( n +1)1u1 + a( n +1) 2u 2 + ... + a( n +1) mu m = b( n +1)

constraints
equality
a( n + 2 )1u1 + a( n + 2 ) 2u 2 + ... + a( n + 2 ) mu m = b( n + 2 )
:
a( n + e )1u1 + a( n + e ) 2u 2 + ... + a( n + e ) mu m = b( n + e )
T
Quadratic Programming
Find arg max c + d u +
T u Ru Quadratic criterion
u 2

Subject to a11u1 + a12u 2 + ... + a1mu m  b1


a21u1 + a22u 2 + ... + a2 m u m  b2 n additional linear
inequality
: constraints
an1u1 + an 2u 2 + ... + anmu m  bn
And subject to

e additional linear
a( n +1)1u1 + a( n +1) 2u 2 + ... + a( n +1) mu m = b( n +1)

constraints
equality
a( n + 2 )1u1 + a( n + 2 ) 2u 2 + ... + a( n + 2 ) mu m = b( n + 2 )
:
a( n + e )1u1 + a( n + e ) 2u 2 + ... + a( n + e ) mu m = b( n + e )
Lazy Learner: Instance-
Based Methods
• Instance-based learning:
– Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
• Typical approaches
– k-nearest neighbor (KNN) approach
• Instances represented as points in a Euclidean space.
– Locally weighted regression
• Constructs local approximation
– Case-based reasoning
• Uses symbolic representations and knowledge-based
inference

95 Dr. D. Dutta BITS Pilani, Pilani Campus


The k-Nearest Neighbor
(kNN) Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of Euclidean distance,
dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common value among
the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-NN for a typical
set of training examples

_
_ _ .
+ _
_
.+ . . .
xq +
_ + .
96 Dr. D. Dutta BITS Pilani, Pilani Campus
Discussion on the k-NN
Algorithm
k-NN for real-valued prediction for a given unknown tuple
– Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
– Weight the contribution of each of the k neighbors according to
their distance to the query xq 1
w
• Give greater weight to closer neighbors d ( xq , x )2
i
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes
– To overcome it, axes stretch or elimination of the least relevant
attributes

97 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiclass Classification

Multiclass classification refers to categorizing instances into one of


three or more distinct classes, as opposed to binary classification
which only deals with two classes.
In multiclass classification, the goal is to assign each input instance
(such as an image, text, or tabular data) to one, and only one, out of
multiple classes.
• Example: Predicting whether an animal is a cat, dog, or bird based
on certain features.

98 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiclass Classification
Common Algorithms Used:
• Logistic Regression (Multinomial): A variant of logistic regression
used for multiclass problems.
• Decision Trees: Can handle both binary and multiclass classification
by splitting data based on feature values.
• Random Forest: An ensemble method that creates multiple decision
trees and aggregates their predictions for multiclass classification.
• Support Vector Machines (SVM): Usually used for binary
classification, but can be adapted to multiclass problems using
methods like "One-vs-One" or "One-vs-All."
• Neural Networks: Particularly useful for large datasets and complex
problems; commonly used in image classification.
• K-Nearest Neighbors (KNN): Classifies new instances by looking at
the "k" closest examples in the training set.

99 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

10 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 8

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents
As per syllabus
1. Classification and Prediction
1. Concepts of classification and prediction
2. Decision trees for classification
3. Rule based classification, Bayesian classification
4. Evaluation of classification techniques
5. Prediction Techniques
As per session plan
• Classification and Prediction
o Concepts of classification and prediction
o Decision trees for classification
o Rule based classification
o Bayesian classification,
o Evaluation of classification techniques

2
o Prediction Techniques Dr. D. Dutta BITS Pilani, Pilani Campus
Using IF-THEN Rules for
Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
If more than one rule are triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification
cost per class
– Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Rule Extraction from a
Decision Tree
◼ Rules are easier to understand than large trees age?
◼ One rule is created for each path from the root to a
<=30
leaf 31..40 >40

◼ Each attribute-value pair along a path forms a student?


yes
credit rating?
conjunction: the leaf holds the class prediction
no yes excellent fair
◼ Rules are mutually exclusive and exhaustive
no yes yes

Example: Rule extraction from our buys_computer decision-tree


IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer =
yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
4 Dr. D. Dutta 4BITS Pilani, Pilani Campus
Rule Induction: Sequential
Covering Method
Sequential covering algorithm: Extracts rules directly from training data
Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are
removed
– Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
Compared with decision-tree induction: learning a set of rules
simultaneously

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Sequential Covering Method
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Simultaneous Rules
Generation

7 Dr. D. Dutta BITS Pilani, Pilani Campus


Model Evaluation and
Selection
Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
Use validation test set of class-labeled tuples instead of training set
when assessing accuracy
Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
• Given m classes, an entry, CMi,j in a confusion matrix indicates # of
tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C ◼ Class Imbalance Problem:
C TP FN P ◼ One class may be rare, e.g.

¬C FP TN N fraud, or HIV-positive
P’ N’ All ◼ Significant majority of the

negative class and minority of the


Classifier Accuracy, or recognition positive class
rate: percentage of test set tuples ◼ Sensitivity: True Positive
that are correctly classified recognition rate
Accuracy = (TP + TN)/All ◼ Sensitivity = TP/P

Error rate: 1 – accuracy, or ◼ Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


◼ Specificity = TN/N

10 Dr. D. Dutta BITS Pilani, Pilani Campus


Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier labeled as
positive are actually positive

Recall: completeness – what % of positive tuples did the classifier label


as positive?
Perfect score is 1.0
Inverse relationship between precision & recall

F measure (F1 or F-score): harmonic mean of precision and recall,

Fß: weighted measure of precision and recall


– assigns ß times as much weight to recall as to precision

11 Dr. D. Dutta BITS Pilani, Pilani Campus


Classifier Evaluation Metrics:
Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized data
– *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics)
curves: for visual comparison of
classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a
measure of the accuracy of the model
• Rank the test tuples in decreasing order:
the one that is most likely to belong to the ◼ Vertical axis represents
the true positive rate
positive class appears at the top of the list
◼ Horizontal axis rep. the
• The closer to the diagonal line (i.e., the false positive rate
closer the area is to 0.5), the less accurate ◼ The plot also shows a
is the model diagonal line
◼ A model with perfect
accuracy will have an area
of 1.0
14 Dr. D. Dutta 14
BITS Pilani, Pilani Campus
Issues Affecting Model
Selection
Accuracy
– classifier accuracy: predicting class label
Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
– understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules

15 Dr. D. Dutta BITS Pilani, Pilani Campus


Numeric Prediction

Models continuous-valued functions, i.e., predicts unknown or


missing values

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Linear Regression

It is a statistical and machine learning technique used to model the


relationship between a dependent variable and one or more
independent variables. The objective of linear regression is to find
the linear relationship that best fits the observed data. This
relationship is represented by a linear equation in the form:
Y=b0​+b1⋅X1​+b2⋅X2​+…+bn⋅Xn
• Y is the dependent variable (the variable we are trying to predict).
• b0 is the y-intercept, the value of Y when all X's are 0.
• b1​,b2​,…,bn are the coefficients, representing the change in Y for a
one-unit change in X1​,X2​,…,Xn respectively.
• X1​,X2​,…,Xn are the independent variables.

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

In a simple linear regression (with one independent variable), the


equation becomes:
Y=b0​+b1⋅X
The goal of linear regression is to determine the values of b0 and b1
that minimize the difference between the predicted values (Ŷ) and
the actual values (Y). In other words it seeks the optimal line that
minimizes the sum of squared differences between predicted and
actual values. The blue line is referred to as the best-fit straight line.
Computation
al Intelligence

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

But how the linear regression finds out which is the best fit line?
The goal of the linear regression algorithm is to get the best values for
B0 and B1 to find the best fit line. The best fit line is a line that has
the least error which means the error between predicted values and
actual values should be minimum.
Random Error(Residuals)
In regression, the difference between the observed value of the
dependent variable (yi) and the predicted value (predicted) is called
the residuals.
εi = ypredicted – yi
where ypredicted = B0 + B1 Xi

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

yi=b0+b1xi+ɛi

B1

B0

20 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

What is the best fit line?


In simple terms, the best fit line is a line that fits the given scatter plot in
the best way. Mathematically, the best fit line is obtained by
minimizing the Residual Sum of Squares (RSS).
Cost Function for Linear Regression
It helps to work out the optimal values for B0 and B1, which provides the
best fit line for the data points.
In Linear Regression, generally Mean Squared Error (MSE) cost
function is used, which is the average of squared error that occurred
between the ypredicted and yi.

Using the MSE function, we’ll update the values of B0 and B1 such that
the MSE value settles at the minima. These parameters can be
determined using the gradient descent method such that the value
for the cost function is minimum.
21 Dr. D. Dutta BITS Pilani, Pilani Campus
Simple Linear Regression

Gradient Descent for Linear Regression


Gradient Descent is one of the optimization algorithms that optimize the
cost function (objective function) to reach the optimal minimal
solution.
A regression model optimizes the gradient descent algorithm to update
the coefficients of the line by reducing the cost function by randomly
selecting coefficient values and then iteratively updating the values
to reach the minimum cost function.

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

In the gradient descent algorithm, the number of steps you’re taking


can be considered as the learning rate, and this decides how fast
the algorithm converges to the minima.

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

Gradient Descent for Linear Regression


Linear Regression Model:
Y=b0​+b1⋅X
Objective:
Minimize the cost function, often represented as the Mean Squared
Error (MSE):
J(b0​,b1​)=1/2m * ​∑i=1m​(Yi​−(Ŷi​))2
• m is the number of training examples.
• Yi​ is the actual output for the i-th example.
• Ŷi​ is the predicted output for the i-th example.

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

Gradient Descent Algorithm:


1. Initialize Coefficients:
Start with some initial values for b0​ and b1​.
2. Calculate Predictions:
For each training example, calculate the predicted output Ŷi​ using
the current values of b0​ and b1​.
3. Compute Gradients:
Calculate the partial derivatives of the cost function with respect to
each coefficient: ​
∂J​/∂b0
∂J​/∂b1

25 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

4. Update Coefficients:
Update the coefficients using the gradients and a learning rate (α):
b0​=b0​−α*∂J​/∂b0
b1​=b1​−α*∂J​/∂b1
5. Repeat:
Repeat steps 2-4 until convergence (the cost function reaches a
minimum or changes very slowly).
Partial Derivatives:
The partial derivatives with respect to b0​ and b1 ​ are calculated as
follows:
∂J​/∂b0 ​=−1/m*​∑i=1m​(Yi​−Ŷi​)
∂J​/∂b1 ​=−1/m*​∑ i=1m ​ (Yi​−Ŷi​) ⋅Xi

26 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Linear Regression

Learning Rate (α):


The learning rate is a hyperparameter that controls the size of the steps
taken during the optimization process. It is crucial to choose an
appropriate learning rate to ensure convergence and prevent
overshooting the minimum.
Convergence:
The algorithm is considered to have converged when the change in the
cost function becomes very small, indicating that further iterations
are not significantly improving the model.

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Polynomial Regression

It is a type of regression analysis in which the relationship between the


independent variable (X) and the dependent variable (Y) is modeled
as an n-th degree polynomial. In contrast to simple linear regression,
which assumes a linear relationship, polynomial regression allows
for a more flexible curve to fit the data.
The general form of a polynomial regression equation is given by:
Y=b0​+b1⋅X+b2⋅X2+…+bn⋅Xn
• Y is the dependent variable.
• X is the independent variable.
• b0​,b1​,…,bn​ are the coefficients.
• n is the degree of the polynomial.

28 Dr. D. Dutta BITS Pilani, Pilani Campus


Polynomial Regression

Key Concepts:
1. Degree of the Polynomial (n):
The degree of the polynomial determines the complexity of the curve.
Higher degrees allow the model to capture more intricate patterns in the
data but may also lead to overfitting.
2. Overfitting:
Overfitting occurs when the model fits the training data too closely,
capturing noise rather than the underlying pattern. Regularization
techniques or model selection methods may be used to address overfitting.
3. Underfitting:
Underfitting occurs when the model is too simple to capture the underlying
pattern in the data. Increasing the degree of the polynomial may help
address underfitting.
4. Model Interpretation:
Polynomial regression can complicate the interpretation of individual
coefficients. The relationship between X and Y becomes more complex as
the degree of the polynomial increases.
29 Dr. D. Dutta BITS Pilani, Pilani Campus
Polynomial Regression

Example:
Consider a scenario where the relationship between the number of
hours of study (X) and the exam score (Y) is explored using
polynomial regression. The equation might take the form:
Exam Score=b0​+b1⋅Hours of Study+b2⋅(Hours of Study)2
Here, the second-degree polynomial allows the model to capture a
quadratic relationship between hours of study and exam scores.

30 Dr. D. Dutta BITS Pilani, Pilani Campus


Regression

Apart from linear and polynomial regression there are many regression
algorithms
1. Ridge Regression
2. Lasso Regression
3. ElasticNet Regression
4. Decision Tree Regression
5. Random Forest Regression
6. Gradient Boosting Regression (e.g., XGBoost, LightGBM,
CatBoost)
7. Support Vector Regression (SVR)
8. K-Nearest Neighbors (KNN) Regression
9. Bayesian Regression
10. Huber Regression

31 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

32 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 7

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents
As per syllabus
1. Classification and Prediction
1. Concepts of classification and prediction
2. Decision trees for classification
3. Rule based classification, Bayesian classification
4. Evaluation of classification techniques
5. Prediction Techniques
As per session plan
• Classification and Prediction
o Concepts of classification and prediction
o Decision trees for classification
o Rule based classification
o Bayesian classification,
o Evaluation of classification techniques

2
o Prediction Techniques Dr. D. Dutta BITS Pilani, Pilani Campus
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
– Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Supervised Learning

▪ Classification
▪ Predicts categorical class labels (discrete or nominal)
▪ Classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
▪ Numeric Prediction
▪ Models continuous-valued functions, i.e., predicts unknown or
missing values

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Classification – A two
stepped process
▪ Model construction: Describing a set of predetermined classes
▪ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
▪ The set of tuples used for model construction is training set
▪ The model is represented as classification rules, decision trees, or
mathematical formulae

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Classification – A two
stepped process
▪ Model usage: For classifying future or unknown objects
▪ Estimate accuracy of the model
▪ The known label of test sample is compared with the classified
result from the model
▪ Accuracy rate is the percentage of test set samples that are
correctly classified by the model
▪ Test set is independent of training set. By having a separate
test dataset, we can ensure that the model is not overfitting
and is performing well on new, unseen data.
▪ If the accuracy is acceptable, use the model to classify new data
▪ Note: If the test set is used to select models, it is called validation
(test) set

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Step 1: Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
7 Dr. D. Dutta BITS Pilani, Pilani Campus
Step 2: Using the Model for
Classification

Classifier

Testing
Data Unseen Data

(George, Professor, 5)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
8 Dr. D. Dutta BITS Pilani, Pilani Campus
Prediction
The practice of using data to create predictions or foresee future events is known
as machine learning prediction. Building models that can recognize patterns
in data and utilize those patterns to create precise predictions about novel,
unforeseen data is the aim of machine learning prediction.

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Prediction

10 Dr. D. Dutta BITS Pilani, Pilani Campus


Decision Tree Induction: An
Example
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training data set: Buys_computer <=30 high no excellent no
❑ Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31-40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
11 Dr. D. Dutta BITS Pilani, Pilani Campus
Algorithm for Decision Tree
Induction
Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Brief Review of Entropy

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = − pi log 2 ( pi )
i =1

Information needed (after using A to split D into v partitions) to classify D:


v | Dj |
InfoA ( D) =   Info( D j )
j =1 |D|
Information gained by branching on attribute A

Gain(A) = Info(D) − Info A(D)

14 Dr. D. Dutta BITS Pilani, Pilani Campus


Attribute Selection:
Information Gain
 Class P: buys_computer = “yes” Infoage ( D) =
5
I (2,3) +
4
I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
31…40 4 0 0
>40 3 2 0.971 and 3 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain (age) = Info( D) − Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes

Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40
15 15 medium no excellent no Dr. D. Dutta BITS Pilani, Pilani Campus
Computing Information-Gain for
Continuous-Valued Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information requirement for
A is selected as the split-point for A. The minimum expected
information requirement refers to the point at which the entropy of
the dataset is minimized after the split.
Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the
set of tuples in D satisfying A > split-point
16 Dr. D. Dutta BITS Pilani, Pilani Campus
Underfitting and Overfitting

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or
outliers
– Poor accuracy for unseen samples
Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree - get a
sequence of progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best pruned tree”

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Enhancements to Basic
Decision Tree Induction
Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that partition
the continuous attribute value into a discrete set of intervals
Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
Attribute construction
– Create new attributes based on existing ones that are sparsely
represented
– This reduces fragmentation, repetition, and replication

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts
class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural
network classifiers
Incremental: Each training example can incrementally increase/decrease
the probability that a hypothesis is correct - prior knowledge can be
combined with observed data
Standard: Even when Bayesian methods are computationally intractable,
they can provide a standard of optimal decision making against which
other methods can be measured

20 Dr. D. Dutta BITS Pilani, Pilani Campus


Sources of uncertainty
Uncertain inputs
– Missing data
– Noisy data
Uncertain knowledge
– Multiple causes lead to multiple effects
– Incomplete enumeration of conditions or effects
• Incomplete enumeration
– Selecting alternatives by only looking at parts of the
solution space by applying certain heuristics. This
provides approximate solutions, but not necessarily
optimal ones.
– Incomplete knowledge of causality in the domain
– Probabilistic/stochastic effects

21 Dr. D. Dutta BITS Pilani, Pilani Campus


Sources of uncertainty
Uncertain outputs
– Abduction and induction are inherently uncertain
• Abductive reasoning, or abduction, is making a probable conclusion from what
you know. If you see an abandoned bowl of hot soup on the table, you can use
abduction to conclude the owner of the soup is likely returning soon.
• Induction, is making an inference based on an observation, often of a sample.
You can induce that the soup is tasty if you observe all of your friends
consuming it.
– Default reasoning, even in deductive fashion, is uncertain
• Default reasoning is a form of nonmonotonic reasoning where plausible
conclusions are inferred based on general rules which may have exceptions
(defaults).
– Incomplete deductive inference may be uncertain
• Deductive reasoning starts with the assertion of a general rule and proceeds
from there to a guaranteed specific conclusion. Deductive reasoning moves
from the general rule to the specific application
Probabilistic reasoning only gives probabilistic results (summarizes
uncertainty from various sources)

22 Dr. D. Dutta BITS Pilani, Pilani Campus


The frequency interpretation
of probability
The frequency interpretation: The probability that some specific outcome
of a process will be obtained can be interpreted as the relative
frequency with which that outcome would be obtained if the process
were repeated a large number of times under similar conditions.

e.g. the probability of obtaining a head in a fair coin toss is ½ because


the relative frequency of heads should be ½ if I were to flip a coin
many times.

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Joint Probability
Joint probability refers to the probability of two (or more) events
happening simultaneously.

Mathematically, the joint probability of two events A and B is written as:


P(A∩B) or P(A,B)

For independent events: When events do not affect each other, the
joint probability is the product of their individual probabilities:
P(A∩B)=P(A)×P(B)

For dependent events: If the occurrence of one event affects the other,
the joint probability is calculated using conditional probability:
P(A∩B)=P(A)×P(B∣A)

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Marginal Probability
Marginal probability refers to the probability of a single event occurring,
regardless of the outcome of other variables or events.

If we have two events A and B, the marginal probability of A is calculated


by summing over all possible values of B in the joint probability
P(A,B) :
P(A)=∑BP(A,B)

25 Dr. D. Dutta BITS Pilani, Pilani Campus


Probability Theory
Here’s a joint distribution over two binary value variables A and B

26 Dr. D. Dutta BITS Pilani, Pilani Campus


Probability Theory
We get the marginal distribution over B by simply adding up the different
possible values of A for any value of B (and put the result in the
“margin”).

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Conditional Probability
Conditional probabilities allow us to understand how the probability
of an event A changes after it has been learned that some other
event B has occurred.
The key concept for thinking about conditional probabilities is that the
occurrence of B reshapes the sample space for subsequent events.
- That is, we begin with a sample space S
- A and B  S
- The conditional probability of A given that B looks just at the
subset of the sample space for B.
The conditional probability of A given B is
S Pr(A | B) denoted Pr(A | B).
B
- Importantly, according to Bayesian
A
orthodoxy, all probability distributions are
implicitly or explicitly conditioned on the
model.

28 Dr. D. Dutta BITS Pilani, Pilani Campus


Conditional Probability
By definition: If A and B are two events such that Pr(B) > 0, then:

Pr(A | B)
Pr(A  B) S
Pr(A | B) = B

Pr(B) A

Example: What is the Pr(Republican Vote | Republican Identifier)?


Pr(Rep. Vote  Rep. Id) = .35 and Pr(Rep ID) = .4
Thus, Pr(Republican Vote | Republican Identifier) = .35 / .4 = .875

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Conditional Probability
P(A = true | B = true) = Out of all the outcomes in which B is true,
how many also have A equal to true
Read this as: “Probability of A conditioned on B” or “Probability of A
given B”

H = “Have a headache”
F = “Coming down with Flu”
P(F =
P(H = true) = 1/10
true) P(F = true) = 1/40
P(H = true | F = true) = 1/2

“Headaches are rare and flu is rarer,


P(H = but if you’re coming down with flu
there’s a 50-50 chance you’ll have a
true) headache.”

30 Dr. D. Dutta BITS Pilani, Pilani Campus


Bayes’ Theorem (Rule, Law)
Bayes’ Theorem: Let events A1,…,Ak form a partition of the space S
such that Pr(Aj) > 0 for all j and let B be any event such that Pr(B)
> 0. Then for i = 1,..,k:
Pr ( Ai ) Pr ( B | Ai )
Pr ( Ai | B ) =
k Pr (Ak ) Pr (B|Ak )
Proof:
Pr( Ai  B) Pr( Ai ) Pr(B | Ai )
Pr( Ai | B) = =
Pr(B) k Pr( Ak ) Pr(B | Ak )
Bayes’ Theorem is just a simple rule for computing the conditional
probability of events Ai given B from the conditional probability of B
given each event Ai and the unconditional probability of each Ai

31 Dr. D. Dutta BITS Pilani, Pilani Campus


Interpretation of Bayes’
Theorem
Pr(Ai) = Prior distribution for the Pr( B | Ai ) = The conditional
Ai. It summarizes your beliefs probability of B given Ai. It
about the probability of event Ai summarizes the likelihood of
before Ai or B are observed. event B given Ai.

Pr ( Ai ) Pr ( B | Ai )
Pr ( Ai | B ) =
k Pr (Ak ) Pr (B|Ak )

Pr( Ai | B ) = The k Pr( Ak ) Pr( B | Ak ) = The normalizing


posterior distribution of Ai constant. This is equal to the sum of the
given B. It represents the quantities in the numerator for all events
probability of event Ai Ak. Thus, P( Ai | B ) represents the
after Ai has B has been likelihood of event Ai relative to all other
observed. elements of the partition of the sample
space.
32 Dr. D. Dutta BITS Pilani, Pilani Campus
Combining Data
When applying Bayes’ Theorem, the order in which you collect the data
doesn’t matter.

It also doesn’t matter whether you “peak” at the data halfway through an
experiment.

33 Dr. D. Dutta BITS Pilani, Pilani Campus


From Bayes’ Theorem

Pr(A  B)
Pr(A | B) =
Pr(B)

34 Dr. D. Dutta BITS Pilani, Pilani Campus


Example
Application of Bayes' rule
Suppose you have been tested positive for a disease; what is the
probability that you actually have the disease?
It depends on the accuracy and sensitivity of the test, and on the
background (prior) probability of the disease.
Let P(Test=+ve | Disease=true) = 0.95, so the false negative rate,
P(Test=-ve | Disease=true), is 5%.
Let P(Test=+ve | Disease=false) = 0.05, so the false positive rate is also
5%.

Test=+ve Test=-ve

Disease=true 0.95 0.05


Disease=false 0.05 0.95

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Example
Suppose the disease is rare: P(Disease=true) = 0.01 (1%).
Let D denote Disease and "T=+ve" denote the positive Test.

36 Dr. D. Dutta BITS Pilani, Pilani Campus


Example
So the probability of having the disease given that you tested positive is
just 16%. This seems too low, but here is an intuitive argument to
support it.
Of 100 people, we expect only 1 to have the disease, but we expect
about 5% of those (5 people) to test positive.
So of the 6 people who test positive, we only expect 1 of them to actually
have the disease; and indeed 1/6 is approximately 0.16.

37 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
Dataset
Features: X= (X1,X2,…,Xn)
Label : Y

38 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier

For what value of y

P(Y= y|X= (x1,x2,…,xn))


is maximum?

39 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier

But the problem is


P(Y|X) is hard to find !!!

40 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier

Posterior Likelihood Prior

Evidence

41 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

42 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0

Estimate the value of Y given that X=(0,2) 0 1 1


1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

43 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier

Let’s compute

P(Y = 0 | X = (0,2))

and

P(Y = 1 | X = (0,2))

44 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

45 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

46 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

47 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

48 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

49 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

50 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

51 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

52 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
Let’s assume X1 and X2 are independent… X1 X2 Y
0 0 0
P(X = (0,2) | Y = 1)
0 1 1
1 2 1
0 0 1
P(X = (0,2) | Y = 0) 2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

53 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

54 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

55 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

56 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

57 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
2 1 0
1 0 0

58 Dr. D. Dutta BITS Pilani, Pilani Campus


The Math Behind Bayesian
Classifier
X1 X2 Y
0 0 0
0 1 1
1 2 1
0 0 1
2 2 0
1 1 0
0 2 1
2 0 0
P(Y=1 P(Y 2 1 0
) =0) 1 0 0

So, the estimated value of Y is 1

59 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

60 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 5

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents

As per syllabus
1. Data Warehousing
1. Basic Concepts
2. DW Design
3. Data Cube and OLAP
4. DW implementation considerations
As per session plan
• Data Warehousing

o Basic Concepts
o DW Design
o Data Cube and OLAP
o DW implementation considerations
2 Dr. D. Dutta BITS Pilani, Pilani Campus
The “Compute Cube”
Operator
“How many cuboids are there in an n-dimensional data cube?” If there
were no hierarchies associated with each dimension, then the total
number of cuboids for an n -dimensional data cube, as we have seen,
is 2n. However, in practice, many dimensions do have hierarchies. For
example, time is usually explored not at only one conceptual level
(e.g., year), but rather at multiple conceptual levels such as in the
hierarchy “day < month < quarter < year.” For an n-dimensional data
cube, the total number of cuboids that can be generated (including
the cuboids generated by climbing up the hierarchies along each
dimension) is
n
T =  ( Li +1)
i =1
where Li is the number of levels associated with dimension i. One is
added to Li in Equation to include the virtual top level, all. (Note that
generalizing to all is equivalent to the removal of the dimension.)

3 Dr. D. Dutta BITS Pilani, Pilani Campus


The “Compute Cube”
Operator
This formula is based on the fact that, at most, one abstraction level in
each dimension will appear in a cuboid. For example, the time
dimension as specified before has four conceptual levels, or five if we
include the virtual level all. If the cube has 10 dimensions and each
dimension has 4 levels (including all ), the total number of cuboids
that can be generated is 510.
The size of each cuboid also depends on the cardinality (i.e., number of
distinct values) of each dimension. For example, if
the AllElectronics branch in each city sold every item, there would
be |city|×|item| tuples in the city_item group-by alone. As the number
of dimensions, number of conceptual hierarchies, or cardinality
increases, the storage space required for many of the group-by's will
grossly exceed the (fixed) size of the input relation.

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Efficient Data Cube
Computation
By now, you probably realize that it is unrealistic to pre-compute and
materialize all of the cuboids that can possibly be generated for a
data cube (i.e., from a base cuboid). If there are many cuboids, and
these cuboids are large in size, a more reasonable option is partial
materialization; that is, to materialize only some of the possible
cuboids that can be generated.
Materialization of data cube
– Materialize every (cuboid) (full materialization), none (no
materialization), or some (partial materialization)

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
Figure shows a 3-D data cube for the dimensions A, B, and C, and an
aggregate measure, M.
Commonly used measures include count, sum, min, max, and total
sales.
A data cube is a lattice of cuboids.
Each cuboid represents a group-by.
ABC is the base cuboid, containing all three of the dimensions.

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
Here, the aggregate measure, M, is computed for each possible
combination of the three dimensions.
The base cuboid is the least generalized of all of the cuboids in the data
cube.
The most generalized cuboid is the apex cuboid, commonly represented
as all.
It contains one value - it aggregates measure M for all of the tuples
stored in the base cuboid.
To drill down in the data cube, we move from the apex cuboid, downward
in the lattice.
To roll up, we move from the base cuboid, upward.

7 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
A cell in the base cuboid is a base cell.
A cell from a non-base cuboid is an aggregate cell.
An aggregate cell aggregates over one or more dimensions, where each
aggregated dimension is indicated by a “∗” in the cell notation.
Suppose we have an n-dimensional data cube. Let a = (a1, a2, . . . , an,
measures) be a cell from one of the cuboids making up the data cube.
We say that a is an m-dimensional cell (that is, from an m-dimensional
cuboid) if exactly m (m ≤ n) values among {a1, a2, . . . , an} are not
“∗”.
If m = n, then a is a base cell; otherwise, it is an aggregate cell (i.e.,
where m < n).

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
Base and aggregate cells. Consider a data cube with the dimensions
month, city, and customer group, and the measure sales.
(Jan, ∗ , ∗ , 2800) and (∗, Chicago, ∗ , 1200) are 1-D cells,
(Jan, ∗ , Business, 150) is a 2-D cell,
(Jan, Chicago, Business, 45) is a 3-D cell.
Here, all base cells are 3-D,
Whereas 1-D and 2-D cells are aggregate cells.

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
An ancestor-descendant relationship may exist between cells.
In an n-dimensional data cube, an i-D cell a = (a1, a2, . . . , an,
measuresa) is an ancestor of a j-D cell b = (b1, b2, . . . , bn,
measuresb), and b is a descendant of a, if and only if (1) i < j, and (2)
for 1 ≤ k ≤ n, ak = bk whenever ak ≠“∗”.
In particular, cell a is called a parent of cell b, and b is a child of a, if and
only if j = i + 1.

10 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
Ancestor and descendant cells.
Referring to our previous example, 1-D cell a = (Jan, ∗ , ∗ , 2800) and 2-
D cell b = (Jan, ∗ , Business, 150) are ancestors of 3-D cell c = (Jan,
Chicago, Business, 45);
c is a descendant of both a and b;
b is a parent of c, and c is a child of b.

11 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
In order to ensure fast on-line analytical processing (OLAP), it is
sometimes desirable to pre-compute the full cube (i.e., all the cells of
all of the cuboids for a given data cube).
Full cube computation, however, is exponential to the number of
dimensions.
That is, a data cube of n dimensions contains 2n cuboids.
There are even more cuboids if we consider concept hierarchies for each
dimension. Concept hierarchies represent the hierarchical
relationships between different levels of granularity or abstraction
within a dimension.

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
In addition, the size of each cuboid depends on the cardinality of its
dimensions.
Thus, pre-computation of the full cube can require huge and often
excessive amounts of memory.
Nonetheless, full cube computation algorithms are important. Individual
cuboids may be stored on secondary storage and accessed when
necessary.
Partial materialization of data cubes offers an interesting trade-off
between storage space and response time for OLAP. Instead of
computing the full cube, we can compute only a subset of the data
cube’s cuboids, or sub-cubes consisting of subsets of cells from the
various cuboids.

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
Many cells in a cuboid may actually be of little or no interest to the data
analyst. Recall that each cell in a full cube records an aggregate
value, such as count or sum. For many cells in a cuboid, the measure
value will be zero.
When the product of the cardinalities for the dimensions in a cuboid is
large relative to the number of nonzero-valued tuples that are stored
in the cuboid, then we say that the cuboid is sparse. If a cube
contains many sparse cuboids, we say that the cube is sparse.

14 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
In many cases, a substantial amount of the cube’s space could be taken
up by a large number of cells with very low measure values.
This is because the cube cells are often quite sparsely distributed within
a multiple dimensional space.
For example, a customer may only buy a few items in a store at a time.
Such an event will generate only a few nonempty cells, leaving most
other cube cells empty.
In such situations, it is useful to materialize only those cells in a cuboid
(group-by) whose measure value is above some minimum threshold.

15 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
This not only saves processing time and disk space, but also leads to a
more focused analysis.
Such partially materialized cubes are known as iceberg cubes.
The minimum threshold is called the minimum support threshold, or
minimum support (min sup), for short.
Iceberg cube.
compute cube sales_iceberg as
select month, city, customer_group, count(*)
from sales_Info
cube by month, city, customer_group
having count(*) >= min sup

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Iceberg Cube

Computing only the cuboid cells whose count or


other aggregates satisfying the condition like
HAVING COUNT(*) >= minsup

◼ Motivation
◼ Only a small portion of cube cells may be “above the water’’ in a

sparse cube
◼ Only calculate “interesting” cells—data above certain threshold

◼ Avoid explosive growth of the cube

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
The compute cube statement specifies the pre-computation of the
iceberg cube, sales_iceberg, with the dimensions month, city,
and customer_group, and the aggregate measure count().
The input tuples are in the salesInfo relation. The cube by clause
specifies that aggregates (group-by's) are to be formed for each of the
possible subsets of the given dimensions. If we were computing the
full cube, each group-by would correspond to a cuboid in the data
cube lattice.
The constraint specified in the having clause is known as the iceberg
condition. Here, the iceberg measure is count(). Note that the
iceberg cube computed here could be used to answer group-by
queries on any combination of the specified dimensions of the
form having count(*) >= v, where v ≥ (min_sup). Instead of count(),
the iceberg condition could specify more complex measures such
as average().

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
If we were to omit the having clause, we would end up with the full cube.
Let's call this cube sales_cube. The iceberg cube, sales_iceberg,
excludes all the cells of sales_cube with a count that is less
than min_sup. Obviously, if we were to set the minimum support to “0”
in sales_iceberg, the resulting cube would be the full
cube, sales_cube.
A naïve approach to computing an iceberg cube would be to first
compute the full cube and then prune the cells that do not satisfy the
iceberg condition. However, this is still prohibitively expensive. An
efficient approach is to compute only the iceberg cube directly without
computing the full cube.
Introducing iceberg cubes will lessen the burden of computing trivial
aggregate cells in a data cube. However, we could still end up with a
large number of uninteresting cells to compute.

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Cube Materialization
To systematically compress a data cube, we need to introduce the
concept of closed coverage. A cell, c, is a closed cell if there exists no
cell, d, such that d is a specialization (descendant) of cell c (i.e.,
where d is obtained by replacing ∗ in c with a non-∗ value), and d has
the same measure value as c. A closed cube is a data cube
consisting of only closed cells.
Now, if a cell 'c' is considered closed, it means that there is no more
specific cell 'd' that has the same measure value as 'c'. In other
words, 'c' is already at the most detailed level and no further
specialization is possible without changing the measure value.
Another strategy for partial materialization is to precompute only the
cuboids involving a small number of dimensions such as three to five.
These cuboids form a cube shell for the corresponding data cube.
This, however, can still result in a large number of cuboids to compute,
particularly when n is large. Alternatively, we can choose to pre-
compute only portions or fragments of the cube shell based on
20 cuboids of interest. Dr. D. Dutta BITS Pilani, Pilani Campus
What is Data generalization?
Data generalization is a process that involves transforming detailed or
specific data into more abstract, summarized, or higher-level
representations.
By generalizing the data, we move from a lower conceptual level to
higher conceptual levels. This involves aggregating or grouping the
data based on certain attributes or dimensions, applying functions like
summarization, averaging, or grouping, and representing the data in a
more condensed form.
For example, let's consider a sales dataset with individual transaction
records containing information such as customer details, product
details, date, and sales amount. In the data generalization process,
we might aggregate the data by customer groups, product categories,
or monthly sales, thereby creating higher-level representations that
capture the overall patterns and trends rather than individual
transactions.

21 Dr. D. Dutta BITS Pilani, Pilani Campus


General Strategies for Data
Cube Computation
There are several methods for efficient data cube computation, based on
the various kinds of cubes described earlier.
In general, there are two basic data structures used for storing cuboids.
– The implementation of relational OLAP (ROLAP) uses relational
tables,
– The multidimensional arrays are used in multidimensional OLAP
(MOLAP).

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Optimization Technique 1:
Sorting, hashing, and grouping.
Sorting, hashing, and grouping operations should be applied to the
dimension attributes in order to reorder and cluster related tuples.
In cube computation, aggregation is performed on the tuples (or cells)
that share the same set of dimension values. Thus it is important to
explore sorting, hashing, and grouping operations to access and
group such data together to facilitate computation of such aggregates.
For example, to compute total sales by branch, day, and item, it can be
more efficient to sort tuples or cells by branch, and then by day, and
then group them according to the item name. Efficient
implementations of such operations in large data sets have been
extensively studied in the database research community. Such
implementations can be extended to data cube computation.

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Optimization Technique 2: Simultaneous
aggregation and caching intermediate results
In cube computation, it is efficient to compute higher-level aggregates
from previously computed lower-level aggregates, rather than from
the base fact table. Moreover, simultaneous aggregation from cached
intermediate computation results may lead to the reduction of
expensive disk I/O operations.
For example, to compute sales by branch, we can use the intermediate
results derived from the computation of a lower-level cuboid, such as
sales by branch and day.

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Optimization Technique 3: Aggregation from the smallest child, when

there exist multiple child cuboids

When there exist multiple child cuboids, it is usually more efficient to


compute the desired parent (i.e., more generalized) cuboid from the
smallest, previously computed child cuboid.
For example, to compute a sales cuboid, Cbranch, when there exist two
previously computed cuboids, C{branch,year} and C{branch,item}, it
is obviously more efficient to compute Cbranch from the former than
from the latter if there are many more distinct items than distinct
years.

25 Dr. D. Dutta BITS Pilani, Pilani Campus


Optimization Technique 4: The Apriori pruning method can be explored

to compute iceberg cubes efficiently

In iceberg cube computation the following optimization technique plays a


particularly important role.
The Apriori property, in the context of data cubes, states as follows: If a
given cell does not satisfy minimum support, then no descendant of
the cell (i.e., more specialized cell) will satisfy minimum support
either. This property can be used to substantially reduce the
computation of iceberg cubes.

26 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Cube Computation
Methods
Data cube computation is an essential task in data warehouse
implementation.
The pre-computation of all or part of a data cube can greatly reduce the
response time and enhance the performance of on-line analytical
processing.
However, such computation is challenging because it may require
substantial computational time and storage space.
Methods are
– Multiway Array Aggregation
– BUC
– Star-Cubing
– Shell-fragment cubing

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiway Array Aggregation
The Multiway Array Aggregation (or simply MultiWay) method computes
a full data cube by using a multidimensional array as its basic data
structure.
It is a typical MOLAP approach that uses direct array addressing, where
dimension values are accessed via the position or index of their
corresponding array locations.
Hence, MultiWay cannot perform any value-based reordering as an
optimization technique.
1. Partition the array into chunks:
– A chunk is a subcube that is small enough to fit into the memory
available for cube computation.
– Chunking is a method for dividing an n-dimensional array into
small n-dimensional chunks, where each chunk is stored as an
object on disk.

28 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiway Array Aggregation
2. Compute aggregates by visiting (i.e., accessing the values at) cube
cells.
– The order in which cells are visited can be optimized so as to
minimize the number of times that each cell must be revisited,
thereby reducing memory access and storage costs.
– The trick is to exploit this ordering so that portions of the
aggregate cells in multiple cuboids can be computed
simultaneously, and any unnecessary revisiting of cells is avoided.
This chunking technique involves “overlapping” some of the aggregation
computations, therefore, it is referred to as multiway array
aggregation.
It performs simultaneous aggregation-that is, it computes aggregations
simultaneously on multiple dimensions.

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiway Array Aggregation

30 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiway Array Aggregation
Consider a 3-D data array containing the three dimensions A, B, and C.
The 3-D array is partitioned into small, memory-based chunks.
In this example, the array is partitioned into 64 chunks as shown in
Figure.
Dimension A is organized into four equal-sized partitions, a0, a1, a2, and
a3 .
Dimensions B and C are similarly organized into four partitions each.
Chunks 1, 2, . . . , 64 correspond to the subcubes a0b0c0, a1b0c0, . . . ,
a3b3c3, respectively.
Suppose that the cardinality of the dimensions A, B, and C is 40, 400,
and 4000, respectively.
Thus, the size of the array for each dimension, A, B, and C, is also 40,
400, and 4000, respectively.

31 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiway Array Aggregation
The size of each partition in A, B, and C is therefore 10, 100, and 1000,
respectively.
Full materialization of the corresponding data cube involves the
computation of all of the cuboids defining this cube.
The resulting full cube consists of the following cuboids:
– The base cuboid, denoted by ABC (from which all of the other
cuboids are directly or indirectly computed). This cube is already
computed and corresponds to the given 3-D array.
– The 2-D cuboids, AB, AC, and BC, which respectively correspond
to the group-by’s AB, AC, and BC. These cuboids must be
computed.
– The 1-D cuboids, A, B, and C, which respectively correspond to
the group-by’s A, B, and C. These cuboids must be computed.
– The 0-D (apex) cuboid, denoted by all, which corresponds to the
group-by (); that is, there is no group-by here. This cuboid must be
computed. It consists of only one value.
32 Dr. D. Dutta BITS Pilani, Pilani Campus
Multiway Array Aggregation
There are many possible orderings with which chunks can be read into
memory for use in cube computation. Consider the ordering labeled
from 1 to 64, shown in Figure.
Suppose we would like to compute the b0c0 chunk of the BC cuboid. We
allocate space for this chunk in chunk memory.
By scanning chunks 1 to 4 of ABC, the b0c0 chunk is computed.
That is, the cells for b0c0 are aggregated over a0 to a3.
The chunk memory can then be assigned to the next chunk, b1c0, which
completes its aggregation after the scanning of the next four chunks
of ABC: 5 to 8.
Continuing in this way, the entire BC cuboid can be computed.
Therefore, only one chunk of BC needs to be in memory, at a time, for
the computation of all of the chunks of BC.

33 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiway Array Aggregation
In computing the BC cuboid, we will have scanned each of the 64
chunks.
“Is there a way to avoid having to rescan all of these chunks for the
computation of other cuboids, such as AC and AB?”
The answer is, most definitely - yes.
This is where the “multiway computation” or “simultaneous aggregation”
idea comes in.
For example, when chunk 1 (i.e., a0b0c0) is being scanned (say, for the
computation of the 2-D chunk b0c0 of BC, as described above), all of
the other 2-D chunks relating to a0b0c0 can be simultaneously
computed.
That is, when a0b0c0 is being scanned, each of the three chunks, b0c0,
a0c0, and a0b0, on the three 2-D aggregation planes, BC, AC, and AB,
should be computed then as well.
In other words, multiway computation simultaneously aggregates to each
of the 2-D planes while a 3-D chunk is in memory.
34 Dr. D. Dutta BITS Pilani, Pilani Campus
Multiway Array Aggregation
Now let’s look at how different orderings of chunk scanning and of
cuboid computation can affect the overall data cube computation
efficiency.
Recall that the size of the dimensions A, B, and C is 40, 400, and 4000,
respectively.
Therefore, the largest 2-D plane is BC (of size 400 × 4000 = 1, 600,
000).
The second largest 2-D plane is AC (of size 40×4000 = 160, 000).
AB is the smallest 2-D plane (with a size of 40 × 400 = 16, 000).

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Multiway Array Aggregation
Suppose that the chunks are scanned in the order shown, from chunk 1
to 64.
As mentioned above, b0c0 is fully aggregated after scanning the row
containing chunks 1 to 4;
b1c0 is fully aggregated after scanning chunks 5 to 8, and so on.
Thus, we need to scan 4 chunks of the 3-D array in order to fully
compute one chunk of the BC cuboid (where BC is the largest of the
2-D planes).
In other words, by scanning in this order, one chunk of BC is fully
computed for each row scanned.
In comparison, the complete computation of one chunk of the second
largest 2-D plane, AC, requires scanning 13 chunks, given the
ordering from 1 to 64.
That is, a0c0 is fully aggregated only after the scanning of chunks 1, 5, 9,
and 13.
36 Dr. D. Dutta BITS Pilani, Pilani Campus
Multiway Array Aggregation
Finally, the complete computation of one chunk of the smallest 2-D
plane, AB, requires scanning 49 chunks.
For example, a0b0 is fully aggregated after scanning chunks 1, 17, 33,
and 49. Hence, AB requires the longest scan of chunks in order to
complete its computation.

37 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)
BUC is an algorithm for the computation of sparse and iceberg cubes.
Unlike MultiWay, BUC constructs the cube from the apex cuboid toward
the base cuboid.
This allows BUC to share data partitioning costs.
This order of processing also allows BUC to prune during construction,
using the Apriori property.
◼ Figure shows a lattice of cuboids,
making up a 3-D data cube with the
dimensions A, B, and C.
◼ The apex (0-D) cuboid, representing
the concept all (that is, (∗, ∗ , ∗)), is
at the top of the lattice.
◼ This is the most aggregated or
generalized level.

38 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)
The 3-D base cuboid, ABC, is at the bottom of the lattice.
It is the least aggregated (most detailed or specialized) level.
This representation of a lattice of cuboids, with the apex at the top and
the base at the bottom, is commonly accepted in data warehousing.
It consolidates the notions of drilldown (where we can move from a
highly aggregated cell to lower, more detailed cells) and roll-up
(where we can move from detailed, low-level cells to higher level,
more aggregated cells).

39 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)
BUC stands for “Bottom-Up Construction.”
However, according to the lattice convention described above and used
throughout this ppt, the order of processing of BUC is actually top-
down!
The authors of BUC view a lattice of cuboids in the reverse order, with
the apex cuboid at the bottom and the base cuboid at the top. In that
view, BUC does bottom-up construction.
However, because we adopt the application worldview where drill-down
refers to drilling from the apex cuboid down toward the base cuboid,
the exploration process of BUC is regarded as top-down.
BUC’s exploration for the computation of a 3-D data cube is shown in
Figure.

40 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)
Algorithm: BUC. Algorithm for the computation of sparse and iceberg
cubes.
Input:
input : the relation to aggregate;
dim: the starting dimension for this iteration.
Globals:
constant numDims: the total number of dimensions;
constant cardinality[numDims]: the cardinality of each dimension;
constant min sup: the minimum number of tuples in a partition in order
for it to be output;
outputRec: the current output record;
dataCount[numDims]: stores the size of each partition. dataCount[i] is a
list of integers of size cardinality[i].
Output: Recursively output the iceberg cube cells satisfying the minimum
support.
41 Dr. D. Dutta BITS Pilani, Pilani Campus
BUC (Bottom –Up
Construction)

42 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)

43 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)
Let’s see how BUC constructs the iceberg cube for the dimensions A, B,
C, and D, where the minimum support count is 3.
Suppose that dimension A has four distinct values, a1, a2, a3, a4; B has
four distinct values, b1, b2, b3, b4; C has two distinct values, c1, c2; and
D has two distinct values, d1, d2. If we consider each group-by to be a
partition, then we must compute every combination of the grouping
attributes that satisfy minimum support (i.e., that have 3 tuples).
Figure illustrates how the input is partitioned first according to the
different attribute values of dimension A, and then B, C, and D.
To do so, BUC scans the input, aggregating the tuples to obtain a count
for all, corresponding to the cell (∗, ∗ , ∗ , ∗).
Dimension A is used to split the input into four partitions, one for each
distinct value of A. The number of tuples (counts) for each distinct
value of A is recorded in dataCount.

44 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)
BUC uses the Apriori property to save time while searching for tuples
that satisfy the iceberg condition.
Starting with A dimension value, a1, the a1 partition is aggregated,
creating one tuple for the A group-by, corresponding to the cell (a1, ∗,
∗, ∗). Suppose (a1, ∗, ∗, ∗) satisfies the minimum support, in which
case a recursive call is made on the partition for a1. BUC partitions a1
on the dimension B.
It checks the count of (a1, b1, ∗ , ∗) to see if it satisfies the minimum
support. If it does, it outputs the aggregated tuple to the AB group-by
and recurses on (a1, b1, ∗ , ∗) to partition on C, starting with c1.
Suppose the cell count for (a1, b1, c1, ∗) is 2, which does not satisfy the
minimum support.
According to the Apriori property, if a cell does not satisfy minimum
support, then neither can any of its descendants. Therefore, BUC
prunes any further exploration of (a1, b1, c1, ∗).

45 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)
That is, it avoids partitioning this cell on dimension D. It backtracks to the
a1, b1 partition and recurses on (a1, b1, c2, ∗), and so on.
By checking the iceberg condition each time before performing a
recursive call, BUC saves a great deal of processing time whenever a
cell’s count does not satisfy the minimum support.
The partition process is facilitated by a linear sorting method,
CountingSort.
The performance of BUC is sensitive to the order of the dimensions and
to skew in the data.
Ideally, the most discriminating dimensions should be processed first.
Dimensions should be processed in the order of decreasing cardinality.
The higher the cardinality is, the smaller the partitions are, and thus, the
more partitions there will be, thereby providing BUC with a greater
opportunity for pruning.

46 Dr. D. Dutta BITS Pilani, Pilani Campus


BUC (Bottom –Up
Construction)
Similarly, the more uniform a dimension is (i.e., having less skew), the
better it is for pruning.
BUC’s major contribution is the idea of sharing partitioning costs.
However, unlike MultiWay, it does not share the computation of
aggregates between parent and child group-by’s.
For example, the computation of cuboid AB does not help that of ABC.
The latter needs to be computed essentially from scratch.

47 Dr. D. Dutta BITS Pilani, Pilani Campus


Star-Cubing
Star-Cubing algorithm is used for computing iceberg cubes.
Star-Cubing combines the strengths of the other methods we have
studied up to this point.
It integrates top-down and bottom-up cube computation and explores
both multidimensional aggregation (similar to MultiWay) and Apriori-
like pruning (similar to BUC).
It operates from a data structure called a star-tree, which performs
lossless data compression, thereby reducing the computation time
and memory requirements.

48 Dr. D. Dutta BITS Pilani, Pilani Campus


Star-Cubing

Star-Cubing: Bottom-up computation with top-down


expansion of shared dimensions.
49 Dr. D. Dutta BITS Pilani, Pilani Campus
Star-Cubing
Star-Cubing’s approach is illustrated in above Figure for the computation
of a 4-D data cube.
If we were to follow only the bottom-up model (similar to Multiway), then
the cuboids marked as pruned by Star-Cubing would still be explored.
Star-Cubing is able to prune the indicated cuboids because it considers
shared dimensions.
ACD/A means cuboid ACD has shared dimension A,
ABD/AB means cuboid ABD has shared dimension AB,
ABC/ABC means cuboid ABC has shared dimension ABC, and so on.
This comes from the generalization that all the cuboids in the subtree
rooted at ACD include dimension A, all those rooted at ABD include
dimensions AB, and all those rooted at ABC include dimensions ABC
(even though there is only one such cuboid).
We call these common dimensions the shared dimensions of those
particular subtrees.
50 Dr. D. Dutta BITS Pilani, Pilani Campus
Star-Cubing
The introduction of shared dimensions facilitates shared computation.
Because the shared dimensions are identified early on in the tree
expansion, we can avoid recomputing them later.
For example, cuboid AB extending from ABD in Figure would actually be
pruned because AB was already computed in ABD/AB.
Similarly, cuboid A extending from AD would also be pruned because it
was already computed in ACD/A.

51 Dr. D. Dutta BITS Pilani, Pilani Campus


Star-Cubing
Shared dimensions allow us to do Apriori-like pruning if the measure of
an iceberg cube, such as count, is antimonotonic; that is, if the
aggregate value on a shared dimension does not satisfy the iceberg
condition, then all of the cells descending from this shared dimension
cannot satisfy the iceberg condition either.
Such cells and all of their descendants can be pruned, because these
descendant cells are, by definition, more specialized (i.e., contain
more dimensions) than those in the shared dimension(s).
The number of tuples covered by the descendant cells will be less than
or equal to the number of tuples covered by the shared dimensions.
Therefore, if the aggregate value on a shared dimension fails the iceberg
condition, the descendant cells cannot satisfy it either.

52 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
Recall the reason that we are interested in precomputing data cubes:
Data cubes facilitate fast on-line analytical processing (OLAP) in a
multidimensional data space.
However, a full data cube of high dimensionality needs massive storage
space and unrealistic computation time.
Iceberg cubes provide a more feasible alternative, as we have seen,
wherein the iceberg condition is used to specify the computation of
only a subset of the full cube’s cells.
However, although an iceberg cube is smaller and requires less
computation time than its corresponding full cube, it is not an ultimate
solution.

53 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
For one, the computation and storage of the iceberg cube can still be
costly. For example, if the base cuboid cell, (a1, a2, . . . , a60), passes
minimum support (or the iceberg threshold), it will generate 260
iceberg cube cells.
Second, it is difficult to determine an appropriate iceberg threshold.
Setting the threshold too low will result in a huge cube, whereas setting
the threshold too high may invalidate many useful applications.
Third, an iceberg cube cannot be incrementally updated. Once an
aggregate cell falls below the iceberg threshold and is pruned, its
measure value is lost. Any incremental update would require re-
computing the cells from scratch. This is extremely undesirable for
large real-life applications where incremental appending of new data
is the norm.

54 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
One possible solution, which has been implemented in some commercial
data warehouse systems, is to compute a thin cube shell. For
example, we could compute all cuboids with three dimensions or less,
in a 60-dimensional data cube, resulting in cube shell of size 3.
The resulting set of cuboids would require much less computation and
storage than the full 60-dimensional data cube.
However, there are two disadvantages of this approach.
First, we would still need to compute C(60,3) + C(60,2) + 60 = 36050
cuboids, each with many cells.
Second, such a cube shell does not support high-dimensional OLAP
because
(1) it does not support OLAP on four or more dimensions, and
(2) it cannot even support drilling along three dimensions

55 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
Instead of computing a cube shell, we can compute only portions or
fragments of it.
Although a data cube may contain many dimensions, most OLAP
operations are performed on only a small number of dimensions at a
time. In other words, an OLAP query is likely to ignore many
dimensions (i.e., treating them as irrelevant), fix some dimensions
(e.g., using query constants as instantiations), and leave only a few to
be manipulated (for drilling, pivoting, etc.). This is because it is neither
realistic nor fruitful for anyone to comprehend the changes of
thousands of cells involving tens of dimensions simultaneously in a
high-dimensional space at the same time.
Instead, it is more natural to first locate some cuboids of interest and
then drill along one or two dimensions to examine the changes of a
few related dimensions.

56 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
Most analysts will only need to examine, at any one moment, the
combinations of a small number of dimensions.
This implies that if multidimensional aggregates can be computed
quickly on a small number of dimensions inside a high-dimensional
space, we may still achieve fast OLAP without materializing the
original high-dimensional data cube.
Computing the full cube (or, often, even an iceberg cube or cube shell)
can be excessive.
Instead, a semi-on-line computation model with certain preprocessing
may offer a more feasible solution.
Given a base cuboid, some quick preparation computation can be done
first (i.e., off-line).
After that, a query can then be computed on-line using the preprocessed
data.

57 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
The shell fragment approach follows such a semi-on-line computation
strategy.
It involves two algorithms: one for computing cube shell fragments and
the other for query processing with the cube fragments.
The shell fragment approach can handle databases of high
dimensionality and can quickly compute small local cubes on-line.
It explores the inverted index data structure, which is popular in
information retrieval and Web-based information systems.

58 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
The basic idea is as follows.
Given a high-dimensional data set, we partition the dimensions into a set
of disjoint dimension fragments,
convert each fragment into its corresponding inverted index
representation,
and then construct cube shell fragments while keeping the inverted
indices associated with the cube cells.
Using the pre-computed cubes shell fragments, we can dynamically
assemble and compute cuboid cells of the required data cube on-line.
This is made efficient by set intersection operations on the inverted
indices.

59 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
To illustrate the shell fragment approach, we use the tiny database of
Table 5.4 as a running example. Let the cube measure be count(). We
first look at how to construct the inverted index for the given database

60 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
Construct the inverted index: For each attribute value in each
dimension, list the tuple identifiers (TIDs) of all the tuples that have
that value. For example, attribute value a2 appears in tuples 4 and 5.
The TIDlist for a2 then contains exactly two items, namely 4 and 5.
The resulting inverted index table is shown in Table 5.5. It retains all
of the information of the original database. If each table entry takes
one unit of memory, Tables 5.4 and 5.5 each takes 25 units, i.e., the
inverted index table uses the same amount of memory as the original
database.

61 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
“How do we compute shell fragments of a data cube?”
Frag-Shells, is summarized in Figure 5.14. We first partition all the
dimensions of the given data set into independent groups of dimensions,
called fragments (line 1). We scan the base cuboid and construct an
inverted index for each attribute (lines 2 to 6). Line 3 is for when the
measure is other than the tuple count(), which will be described later. For
each fragment, we compute the full local (i.e., fragment-based) data
cube while retaining the inverted indices (lines 7 to 8).
Consider a database of 60 dimensions, namely, A1, A2, . . . , A60. We can
first partition the 60 dimensions into 20 fragments of size 3: (A1, A2, A3),
(A4, A5, A6), . . ., (A58, A59, A60). For each fragment, we compute its full
data cube while recording the inverted indices. For example, in fragment
(A1, A2, A3), we would compute seven cuboids: A1, A2, A3, A1A2, A2A3,
A1A3, A1A2A3. Furthermore, an inverted index is retained for each cell in
the cuboids. That is, for each cell, its associated TIDlist is recorded.

62 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing

63 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
The benefit of computing local cubes of each shell fragment instead of
computing the complete cube shell can be seen by a simple
calculation. For a base cuboid of 60 dimensions, there are only 7 × 20
= 140 cuboids to be computed according to the above shell fragment
partitioning. This is in contrast to the 36, 050 cuboids (C(60,3) +
C(60,2) + 60 = 36050) computed for the cube shell of size 3
described earlier! Notice that the above fragment partitioning is based
simply on the grouping of consecutive dimensions. A more desirable
approach would be to partition based on popular dimension
groupings. Such information can be obtained from domain experts or
the past history of OLAP queries.

64 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
How shell fragments are computed?
Suppose we are to compute the shell fragments of size 3. We first divide
the five dimensions into two fragments, namely (A, B, C) and (D, E). For
each fragment, we compute the full local data cube by intersecting the
TIDlists in Table 5.5 in a top-down depth-first order in the cuboid lattice.
For example, to compute the cell (a1, b2, *), we intersect the tuple ID lists
of a1 and b2 to obtain a new list of {2, 3}. Cuboid AB is shown in Table
5.6.
After computing cuboid AB, we can then compute cuboid ABC by
intersecting all pairwise combinations between Table 5.6 and the row c1
in Table 5.5. Notice that because cell (a2, b2) is empty, it can be
effectively discarded in subsequent computations, based on the Apriori
property. The same process can be applied to compute fragment (D, E),
which is completely independent from computing (A, B, C). Cuboid DE is
shown in Table 5.7.

65 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing

66 Dr. D. Dutta BITS Pilani, Pilani Campus


Shell-fragment cubing
If the measure in the iceberg condition is count() (as in tuple counting),
there is no need to reference the original database for this because
the length of the TIDlist is equivalent to the tuple count.

67 Dr. D. Dutta BITS Pilani, Pilani Campus


Multifeature Cubes: Complex Aggregation at Multiple
Granularities

Data cubes facilitate the answering of analytical or mining-oriented


queries as they allow the computation of aggregate data at multiple
levels of granularity.
Traditional data cubes are typically constructed on commonly-used
dimensions (like time, location, and product) using simple measures
(like count, average, and sum).
In this section, you will learn a newer way to define data cubes called
multifeature cubes. Multifeature cubes enable more in-depth analysis.
They can compute more complex queries whose measures depend
on groupings of multiple aggregates at varying levels of granularity.
The queries posed can be much more elaborate and task-specific than
traditional queries, as we shall illustrate in the next examples.
Many complex data mining queries can be answered by multifeature
cubes without significant increase in computational cost, in
comparison to cube computation for simple queries with traditional
data cubes
68 Dr. D. Dutta BITS Pilani, Pilani Campus
Multifeature Cubes: Complex Aggregation at Multiple
Granularities

A simple data cube query.


Let the query be “find the total sales in 2010, broken down by item,
region, and month, with subtotals for each dimension”.
To answer this query, a traditional data cube is constructed that
aggregates the total sales at the following eight different levels of
granularity: {(item, region, month), (item, region), (item, month),
(month, region), (item), (month), (region), ()}, where () represents all.
This data cube is a simple in that it does not involve any dependent
aggregates.
To illustrate what is mean by ‘dependent aggregates’, let’s examine a
more complex query, which can be computed with a multifeature
cube.

69 Dr. D. Dutta BITS Pilani, Pilani Campus


Multifeature Cubes: Complex Aggregation at Multiple
Granularities

A complex query involving dependent aggregates. Suppose the query is


“grouping by all subsets of {item, region, month}, find the maximum
price in 2010 for each group and the total sales among all maximum
price tuples”.
The specification of such a query using standard SQL can be long,
repetitive, and difficult to optimize and maintain. Alternatively, it can
be specified concisely using an extended SQL syntax as follows:
select item, region, month, max(price), sum(R.sales)
from Purchases
where year = 2010
cube by item, region, month: R
such that R.price = max(price)

70 Dr. D. Dutta BITS Pilani, Pilani Campus


Multifeature Cubes: Complex Aggregation at Multiple
Granularities

The tuples representing purchases in 2010 are first selected.


The cube by clause computes aggregates (or group-by’s) for all possible
combinations of the attributes item, region, and month.
It is an n-dimensional generalization of the group by clause.
The attributes specified in the cube by clause are the grouping attributes.
Tuples with the same value on all grouping attributes form one group.
Let the groups be g1, . . . , gr.
For each group of tuples gi, the maximum price maxgi among the tuples
forming the group is computed.
The variable R is a grouping variable, ranging over all tuples in group gi
whose price is equal to maxgi (as specified in the such that clause).
The sum of sales of the tuples in gi that R ranges over is computed and
returned with the values of the grouping attributes of gi.

71 Dr. D. Dutta BITS Pilani, Pilani Campus


Multifeature Cubes: Complex Aggregation at Multiple
Granularities

The resulting cube is a multifeature cube in that it supports complex data


mining queries for which multiple dependent aggregates are
computed at a variety of granularities.
“How can multifeature cubes be computed efficiently?” The computation
of a multifeature cube depends on the types of aggregate functions
used in the cube.
Aggregate functions can be categorized as either distributive, algebraic,
or holistic.

72 Dr. D. Dutta BITS Pilani, Pilani Campus


Exception-Based Discovery-Driven Exploration of Cube
Space

Discovery-driven approach:
Pre-computed measures indicating data exceptions are used to guide
the user in the data analysis process, at all levels of aggregation.
We hereafter refer to these measures as exception indicators.
Intuitively, an exception is a data cube cell value that is significantly
different from the value anticipated, based on a statistical model.
The model considers variations and patterns in the measure value
across all of the dimensions to which a cell belongs.
For example, if the analysis of item-sales data reveals an increase in
sales in December in comparison to all other months, this may seem
like an exception in the time dimension.
However, it is not an exception if the item dimension is considered, since
there is a similar increase in sales for other items during December.

73 Dr. D. Dutta BITS Pilani, Pilani Campus


Exception-Based Discovery-Driven Exploration of Cube
Space

Three measures are used as exception indicators to help identify data


anomalies. These measures indicate the degree of surprise that the
quantity in a cell holds, with respect to its expected value. The measures
are computed and associated with every cell, for all levels of
aggregation. They are as follows:
SelfExp: This indicates the degree of surprise of the cell value, relative to
other cells at the same level of aggregation.
InExp: This indicates the degree of surprise somewhere beneath the cell,
if we were to drill down from it.
PathExp: This indicates the degree of surprise for each drill-down path
from the cell.

74 Dr. D. Dutta BITS Pilani, Pilani Campus


Discovery-driven exploration
of a data cube
Suppose that you would like to analyze the monthly sales at
AllElectronics as a percentage difference from the previous month.
The dimensions involved are item, time, and region.
You begin by studying the data aggregated over all items and sales
regions for each month.
To view the exception indicators, you would click on a button marked
highlight exceptions on the screen. This translates the SelfExp and
InExp values into visual cues, displayed with each cell. The
background color of each cell is based on its SelfExp value. In
addition, a box is drawn around each cell, where the thickness and
color of the box are a function of its InExp value. Thick boxes indicate
high InExp values. In both cases, the darker the color, the greater the
degree of exception. For example, the dark, thick boxes for sales
during July, August, and September signal the user to explore the
lower-level aggregations of these cells by drilling down.

75 Dr. D. Dutta BITS Pilani, Pilani Campus


Discovery-driven exploration
of a data cube

76 Dr. D. Dutta BITS Pilani, Pilani Campus


Discovery-driven exploration
of a data cube
Drill-downs can be executed along the aggregated item or region
dimensions. “Which path has more exceptions?” you wonder.
To find this out, you select a cell of interest and trigger a path exception
module that colors each dimension based on the PathExp value of
the cell.
This value reflects the degree of surprise of that path. Suppose that the
path along item contains more exceptions.

77 Dr. D. Dutta BITS Pilani, Pilani Campus


Discovery-driven exploration
of a data cube
A drill-down along item results in the cube slice of Figure, showing the
sales over time for each item.
At this point, you are presented with many different sales values to
analyze.
By clicking on the highlight exceptions button, the visual cues are
displayed, bringing focus toward the exceptions.
Consider the sales difference of 41% for “Sony b/w printers” in
September. This cell has a dark background, indicating a high
SelfExp value, meaning that the cell is an exception.
Consider now the sales difference of −15% for “Sony b/w printers” in
November, and of −11% in December. The −11% value for December
is marked as an exception, while the −15% value is not, even though
−15% is a bigger deviation than −11%.
This is because the exception indicators consider all of the dimensions
that a cell is in.

78 Dr. D. Dutta BITS Pilani, Pilani Campus


Discovery-driven exploration
of a data cube
Notice that the December sales of most of the other items have a large
positive value, while the November sales do not.
Therefore, by considering the position of the cell in the cube, the sales
difference for “Sony b/w printers” in December is exceptional, while
the November sales difference of this item is not.
The InExp values can be used to indicate exceptions at lower levels that
are not visible at the current level.
Consider the cells for “IBM home computers” in July and September.
These both have a dark, thick box around them, indicating high InExp
values.
You may decide to further explore the sales of “IBM home computers” by
drilling down along region.
The resulting sales difference by region is shown in Figure, where the
highlight exceptions option has been invoked.

79 Dr. D. Dutta BITS Pilani, Pilani Campus


Discovery-driven exploration
of a data cube
“How are the exception values computed?”
The SelfExp, InExp, and PathExp measures are based on a statistical
method for table analysis.
They take into account all of the group-by’s (aggregations) in which a
given cell value participates.
A cell value is considered an exception based on how much it differs
from its expected value, where its expected value is determined with
a statistical model described below.
The difference between a given cell value and its expected value is
called a residual.
Intuitively, the larger the residual, the more the given cell value is an
exception.
The comparison of residual values requires us to scale the values based
on the expected standard deviation associated with the residuals.

80 Dr. D. Dutta BITS Pilani, Pilani Campus


Discovery-driven exploration
of a data cube
A cell value is therefore considered an exception if its scaled residual
value exceeds a prespecified threshold.
The SelfExp, InExp, and PathExp measures are based on this scaled
residual.

81 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

82 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 3

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents

As per syllabus
1. Data Preprocessing

1. Data Quality

2. Data preprocessing requirements

3. Data preprocessing techniques


As per session plan
• Data Preprocessing

o Data Quality

o Data preprocessing requirements

o Data preprocessing techniques


2 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Quality: Why
Preprocess the Data?
Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: Completeness measures the presence of all
expected data values and the absence of missing or null values.
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Major Tasks in Data
Preprocessing
Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
– Integration of multiple databases, data cubes, or files
Data reduction
– Dimensionality reduction
– Numerosity reduction (It is a data reduction technique which
replaces the original data by smaller form of data representation)
– Data compression
Data transformation and data discretization
– Normalization
– Concept hierarchy generation
4 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data) - the person who enters
the data might not care about providing the correct value or does
not want to provide that information, which is common in e.g.,
survey forms; Dr. D. Dutta
5 BITS Pilani, Pilani Campus
Incomplete (Missing) Data

Data is not always available


– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
Missing data may need to be inferred

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Labour Data with Missing
Values
1,5.0,?,?,?,40,?,?,2,?,11,average,?,?,yes,?,good
2,4.5,5.8,?,?,35,ret_allw,?,?,yes,11,below_average,?,full,?,full,good
?,?,?,?,?,38,empl_contr,?,5,?,11,generous,yes,half,yes,half,good
3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good
3,4.5,4.5,5.0,?,40,?,?,?,?,12,average,?,half,yes,half,good
2,2.0,2.5,?,?,35,?,?,6,yes,12,average,?,?,?,?,good
3,4.0,5.0,5.0,tc,?,empl_contr,?,?,?,12,generous,yes,none,yes,half,good
3,6.9,4.8,2.3,?,40,?,?,3,?,12,below_average,?,?,?,?,good
2,3.0,7.0,?,?,38,?,12,25,yes,11,below_average,yes,half,yes,?,good
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
3,3.5,4.0,4.6,none,36,?,?,3,?,13,generous,?,?,yes,full,good
2,6.4,6.4,?,?,38,?,?,4,?,15,?,?,full,?,?,good
2,3.5,4.0,?,none,40,?,?,2,no,10,below_average,no,half,?,half,bad
3,3.5,4.0,5.1,tcf,37,?,?,4,?,13,generous,?,full,yes,full,good
1,3.0,?,?,none,36,?,?,10,no,11,generous,?,?,?,?,good
2,4.5,4.0,?,none,37,empl_contr,?,?,?,11,average,?,full,yes,?,good
1,2.8,?,?,?,35,?,?,2,?,12,below_average,?,?,?,?,good
1,2.1,?,?,tc,40,ret_allw,2,3,no,9,below_average,yes,half,?,none,bad…………
7 Dr. D. Dutta BITS Pilani, Pilani Campus
How to handle Missing Values

Ignore the tuple: usually done when class label is missing (when doing
classification) - not effective when the % of missing values per
attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as Bayesian
formula or decision tree

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Our Paper

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Noise Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data

10 Dr. D. Dutta BITS Pilani, Pilani Campus


How to Handle Noise Data?
Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
– smooth by fitting the data into regression functions
Clustering
– detect and remove outliers
Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)

11 Dr. D. Dutta BITS Pilani, Pilani Campus


Clustering

Clustering is the task of dividing the population or data points into a


number of groups such that data points in the same groups are more
similar to other data points in the same group and dissimilar to the
data points in other groups.

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Integration

• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units

14 Dr. D. Dutta BITS Pilani, Pilani Campus


Handing Redundancy in Data
Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different
names in different databases
– Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality

15 Dr. D. Dutta BITS Pilani, Pilani Campus


Correlation Analysis (Nominal
Data)
• Χ2 (chi-square) test
(Observed − Expected) 2
2 = 
Expected
• The chi-square test statistic measures the discrepancy between
observed and expected frequencies in categorical data.
• A low chi-square value suggests that there is little difference between
the observed and expected values, indicating a high level of
correlation or agreement.
• Conversely, a high chi-square value suggests a larger discrepancy
between observed and expected values, indicating a potential lack of
correlation or agreement

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Correlation Analysis (Nominal
Data)
• The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

17 Dr. D. Dutta BITS Pilani, Pilani Campus


Chi-Square Calculation: An
Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts


calculated based on the data distribution in the two categories)

(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2


 =
2
+ + + = 507.93
90 210 360 840
It shows that like_science_fiction and play_chess are correlated in the
group

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Correlation Analysis (Numeric
Data)
Correlation coefficient (also called Pearson’s product moment coefficient)

i=1 (ai − A)(bi − B) 


n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B

where n is the number of tuples, A and B are the respective means


of A and B, σA and σB are the respective standard deviation of A and
B, and Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

20 Dr. D. Dutta BITS Pilani, Pilani Campus


Correlation (viewed as linear
relationship)
Correlation measures the linear relationship between objects
To compute correlation, we standardize data objects, A and B, and then
take their dot product

a 'k = (ak − mean( A)) / std ( A)

b'k = (bk − mean( B )) / std ( B)

correlation( A, B) = A'• B '

21 Dr. D. Dutta BITS Pilani, Pilani Campus


Covariance (Numeric Data)
Covariance is similar to correlation

E means Expected value

Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
Negative covariance: If CovA,B < 0 then if A is larger than its expected value,
B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
22 Dr. D. Dutta BITS Pilani, Pilani Campus
Co-Variance: An Example

It can be simplified in computation as

Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
– A = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– B = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
23 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
• Why data reduction? - A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
– Data compression
24 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Reduction 1:
Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)

25 Dr. D. Dutta BITS Pilani, Pilani Campus


Mapping Data to a New
Space
◼Fourier transform
◼Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

26 Dr. D. Dutta BITS Pilani, Pilani Campus


Fourier transform

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Fourier transform
• Fourier transform is a mathematical technique that decomposes a
complex waveform into its individual sine and cosine components. It
was developed by Joseph Fourier, a French mathematician and
physicist in the early 19th century. The Fourier transform is used to
analyze and represent functions in terms of their frequency
components.
• The Fourier transform takes a function of time (or space) and
converts it into a function of frequency. It is defined as follows:
• F(ω) = ∫ f(t) e^(-iωt) dt
• where F(ω) is the Fourier transform of the function f(t), ω is the
frequency variable, and i is the imaginary unit.
• But what is the Fourier Transform_ A visual introduction.

28 Dr. D. Dutta BITS Pilani, Pilani Campus


Fourier transform
• The Fourier transform can be used to analyze various types of
signals, such as sound waves, images, and electromagnetic fields. It
is widely used in fields such as signal processing, communications,
image processing, and quantum mechanics.
• One of the most important properties of the Fourier transform is the
ability to represent a function in terms of its frequency content. This
allows us to filter out unwanted frequencies, isolate specific frequency
components, and manipulate signals in the frequency domain. The
inverse Fourier transform can be used to convert a function from the
frequency domain back to the time (or space) domain.

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Wavelet transform

30 Dr. D. Dutta BITS Pilani, Pilani Campus


Wavelet transform
• Wavelet transform is a mathematical technique for analyzing signals
and images by decomposing them into a set of small wavelets. Unlike
the Fourier transform, which uses a fixed set of basis functions (sine
and cosine waves) to represent the signal in the frequency domain,
the wavelet transform uses a set of basis functions that are localized
in both time and frequency domains.
• The wavelet transform is useful for analyzing signals that have non-
stationary or time-varying properties, such as speech signals, music
signals, and biomedical signals. The wavelet transform allows us to
identify and analyze changes in frequency content over time, which is
important for many applications such as feature extraction, denoising,
compression, and pattern recognition.

31 Dr. D. Dutta BITS Pilani, Pilani Campus


Wavelet transform
• There are two main types of wavelet transform: the continuous
wavelet transform (CWT) and the discrete wavelet transform (DWT).
The CWT is a continuous-time version of the wavelet transform,
where the wavelet basis functions are scaled and shifted continuously
over time. The DWT, on the other hand, is a discrete-time version of
the wavelet transform, where the wavelet basis functions are scaled
and shifted by powers of two.
• The wavelet transform has many applications in various fields such
as signal processing, image processing, computer vision, data
compression, and pattern recognition. It has also found applications
in the study of turbulence, financial time series analysis, and the
analysis of genomic data.

32 Dr. D. Dutta BITS Pilani, Pilani Campus


Principal Component Analysis
(PCA)
• Find a projection that captures the largest amount of variation in data
• The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space.

x2

x1
33 Dr. D. Dutta BITS Pilani, Pilani Campus
Principal Component Analysis
(PCA) : Steps
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
34 Dr. D. Dutta BITS Pilani, Pilani Campus
Principal Component Analysis
(PCA)
• Principal Component Analysis (PCA) is a process of converting a set
of objects of correlated features to set of values of linearly
uncorrelated variables called Principal Components (PC).
• Over the years PC is used to discover or to reduce the number of
features of the data set and to identify new meaningful underlying
features.
• Many large data sets (many features and/or individuals) are available
now-a-days.
• If we want to derive some information from it then it will be very
difficult.
• PCA will be helpful here.

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Principal Component Analysis
(PCA)
• PCA is a multivariate analysis based on true eigenvector.
• An eigenvector or characteristic vector of a linear transformation is a
non-zero vector whose direction does not change when that linear
transformation is applied to it.
• Multivariate analysis involves analysis of data set in which
observations are described by inter correlated dependent variables.
• PCA explains the variance in the data in the best possible way. It
represents the data into a set of new orthogonal variables called
Principal Components.

36 Dr. D. Dutta BITS Pilani, Pilani Campus


Principal Component Analysis
(PCA)
• When viewed from the most informative viewpoint of the objects, PCA
can provide a lower dimensional picture. Principal components were
introduced in the context of providing a small number of ‘more
fundamental’ variables which determine the values of the p original
variables.
• Researchers are using PCA for dimensionality reduction of high
dimensional data set.

37 Dr. D. Dutta BITS Pilani, Pilani Campus


Principal Component Analysis
(PCA): Example

38 Dr. D. Dutta BITS Pilani, Pilani Campus


Principal Component Analysis
(PCA) : Example

39 Dr. D. Dutta BITS Pilani, Pilani Campus


Principal Component Analysis
(PCA) : Example

40 Dr. D. Dutta BITS Pilani, Pilani Campus


Principal Component Analysis
(PCA) : Our Paper

41 Dr. D. Dutta BITS Pilani, Pilani Campus


Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
– Duplicate much or all of the information contained in one or more
other attributes
– E.g., purchase price of a product and the amount of sales tax paid
Irrelevant attributes
– Contain no information that is useful for the data mining task at
hand
– E.g., students' ID is often irrelevant to the task of predicting
students' GPA

42 Dr. D. Dutta BITS Pilani, Pilani Campus


Our Paper

43 Dr. D. Dutta BITS Pilani, Pilani Campus


Heuristic Search in Attribute
Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
– Best single attribute under the attribute independence
assumption: choose by significance tests
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
– Optimal branch and bound:
• Use attribute elimination and backtracking

44 Dr. D. Dutta BITS Pilani, Pilani Campus


Attribute Creation (Feature
Generation)
• Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation, manifold
approaches (not covered)
– Attribute construction
• Combining features
• Data discretization

45 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Reduction 2:
Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
– Ex.: Log-linear models - These are a class of statistical models
used to analyze and model categorical data. These models are
based on the principle that the logarithm of the expected
frequency of an event is a linear combination of the predictor
variables.
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …

46 Dr. D. Dutta BITS Pilani, Pilani Campus


Parametric Data Reduction:
Regression and Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a linear function of
multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability distributions.
In log-linear models, the logarithm of the expected frequency of an
event is modeled as a linear combination of the predictor
variables. The predictor variables may include the main effects of
one or more categorical variables, as well as interactions between
these variables. The log-linear model is typically fit using
maximum likelihood estimation, which involves finding the
parameter estimates that maximize the likelihood of the observed
data.
47 Dr. D. Dutta BITS Pilani, Pilani Campus
y
Regression Analysis
Regression analysis: A collective name for Y1
techniques for the modeling and analysis of
numerical data consisting of values of a Y1’
y=x+1
dependent variable (also called response
variable or measurement) and of one or
more independent variables (aka. x
X1
explanatory variables or predictors)
• Used for prediction
The parameters are estimated so as to give a (including forecasting of
"best fit" of the data time-series data),
inference, hypothesis
Most commonly the best fit is evaluated by
testing, and modeling of
using the least squares method, but other
causal relationships
criteria have also been used

48 Dr. D. Dutta BITS Pilani, Pilani Campus


Regression Analysis and Log-
Linear Models
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and are to
be estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
• Log-linear models:
– Approximate discrete multidimensional probability distributions
– Estimate the probability of each point (tuple) in a multi-
dimensional space for a set of discretized attributes, based on a
smaller subset of dimensional combinations
– Useful for dimensionality reduction and data smoothing
49 Dr. D. Dutta BITS Pilani, Pilani Campus
Histogram Analysis
Divide data into buckets and store 40
average (sum) for each bucket
35
Partitioning rules:
– Equal-width: equal bucket 30
range 25
– Equal-frequency (or equal- 20
depth)
15
10
5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
50 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”:
It means that there is some kind of measurement error or variability in
the data that makes it less precise or less accurate.
• Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
• There are many choices of clustering definitions and clustering
algorithms
• Cluster analysis will be studied in depth later

51 Dr. D. Dutta BITS Pilani, Pilani Campus


Sampling
• Sampling: obtaining a small sample s to represent the whole data set
N
• Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in the
presence of skew
– Develop adaptive sampling methods, e.g., stratified sampling:
Stratified sampling is a sampling technique that involves dividing a
population into subgroups or strata based on some characteristic
or criteria, and then selecting a sample from each subgroup in
proportion to its size or importance.

52 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population
• Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
– Used in conjunction with skewed data

53 Dr. D. Dutta BITS Pilani, Pilani Campus


Sampling: With or without
Replacement

Raw Data
54 Dr. D. Dutta BITS Pilani, Pilani Campus
Sampling: Cluster or
Stratified Sampling

Raw Data Cluster/Stratified Sample

55 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Reduction 3: Data
Compression
• String compression
– The string should be compressed such that consecutive
duplicates of characters are replaced with the character and
followed by the number of consecutive duplicates. For
example, if the input string is “wwwwaaadexxxxxx”, then the
function should return “w4a3dex6”. This kind of compression is
called Run Length Encoding.
• Audio/video compression
– Typically lossy compression, with progressive refinement
– The lossy compression algorithms remove information that is not
perceptually important, such as high-frequency components that
are beyond the range of human hearing.
• Dimensionality and numerosity reduction may also be considered as
forms of data compression

56 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

57 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing

58 Dr. D. Dutta BITS Pilani, Pilani Campus


Normalization
Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,600 is mapped to (1.0 − 0) + 0 = 0.716
98,000 − 12,000
Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
 A

73,600 − 54,000
= 1.225
– Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
59 Dr. D. Dutta BITS Pilani, Pilani Campus
Discretization
• Three types of attributes
– Nominal (catagorical) - values from an unordered set, e.g., color,
profession
– Ordinal - values from an ordered set, e.g., military or academic
rank
– Numeric (continuous) - real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification

60 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Discretization Methods
• Typical methods: All the methods can be applied recursively
– Binning
• Top-down split, unsupervised
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or bottom-up
merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)

61 Dr. D. Dutta BITS Pilani, Pilani Campus


Simple Discretization: Binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– If A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well

• Equal-depth (frequency) partitioning


– Divides the range into N intervals, each containing approximately
same number of samples
– Good data scaling
– Managing categorical attributes can be tricky
62 Dr. D. Dutta BITS Pilani, Pilani Campus
Binning Methods for Data
Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

63 Dr. D. Dutta BITS Pilani, Pilani Campus


Discretization Without Using Class Labels
(Binning vs. Clustering)

Equal Width (binning)


Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

64 Dr. D. Dutta BITS Pilani, Pilani Campus


Discretization by Classification &
Correlation Analysis
• Classification (e.g., decision tree analysis)
– Supervised: Given class labels, e.g., cancerous vs. benign
– Using entropy to determine split point (discretization point)
– Top-down, recursive split
Details to be covered later
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
– Supervised: use class information
– Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
– Merge performed recursively, until a predefined stopping condition

65 Dr. D. Dutta BITS Pilani, Pilani Campus


Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses
to view data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
• Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.

66 Dr. D. Dutta BITS Pilani, Pilani Campus


Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at the
schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
– {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
– E.g., for a set of attributes: {street, city, state, country}

67 Dr. D. Dutta BITS Pilani, Pilani Campus


Automatic Concept Hierarchy
Generation
Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the
data set
– The attribute with the most distinct values is placed at
the lowest level of the hierarchy
– Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


68 Dr. D. Dutta BITS Pilani, Pilani Campus
Thanks

Any Question?

69 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 2

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents

As per syllabus
1. Data Exploration and Description

1. Statistical descriptions of data

2. Measuring data similarity & dissimilarity

3. Data Visualization
As per session plan
• Data Exploration and Description

o Statistical descriptions of data

o Measuring data similarity & dissimilarity

o Data Visualization
2 Dr. D. Dutta BITS Pilani, Pilani Campus
Types of Data Set
Record
TID Items
– Relational records
– Data matrix, e.g., numerical matrix, crosstabs
1 Bread, Coke, Milk
– Document data: text documents: term-frequency 2 Beer, Bread
vector (is a representation of a document in a vector space 3 Beer, Coke, Diaper, Milk
model, where each element of the vector corresponds to a term (or
word) in a vocabulary, and the value at each position represents 4 Beer, Bread, Diaper, Milk
the frequency of that term in the document.)
5 Coke, Diaper, Milk
– Transaction data
Graph and network
– World Wide Web
– Social or information networks
– Molecular Structures
Ordered

timeout

season
coach
– Video data: sequence of images

game
score
team

ball

lost
pla

wi
n
y
– Temporal data: time-series
– Sequential Data: transaction sequences
– Genetic sequence data Document 1 3 0 5 0 2 6 0 2 0 2

Spatial, image and multimedia: Document 2 0 7 0 2 1 0 0 3 0 0


– Spatial data: maps
Document 3 0 1 0 0 1 2 2 0 3 0
– Image data:
3 – Video data: Dr. D. Dutta 3
BITS Pilani, Pilani Campus
Labour Data with Missing
Values
1,5.0,?,?,?,40,?,?,2,?,11,average,?,?,yes,?,good
2,4.5,5.8,?,?,35,ret_allw,?,?,yes,11,below_average,?,full,?,full,good
?,?,?,?,?,38,empl_contr,?,5,?,11,generous,yes,half,yes,half,good
3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good
3,4.5,4.5,5.0,?,40,?,?,?,?,12,average,?,half,yes,half,good
2,2.0,2.5,?,?,35,?,?,6,yes,12,average,?,?,?,?,good
3,4.0,5.0,5.0,tc,?,empl_contr,?,?,?,12,generous,yes,none,yes,half,good
3,6.9,4.8,2.3,?,40,?,?,3,?,12,below_average,?,?,?,?,good
2,3.0,7.0,?,?,38,?,12,25,yes,11,below_average,yes,half,yes,?,good
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
3,3.5,4.0,4.6,none,36,?,?,3,?,13,generous,?,?,yes,full,good
2,6.4,6.4,?,?,38,?,?,4,?,15,?,?,full,?,?,good
2,3.5,4.0,?,none,40,?,?,2,no,10,below_average,no,half,?,half,bad
3,3.5,4.0,5.1,tcf,37,?,?,4,?,13,generous,?,full,yes,full,good
1,3.0,?,?,none,36,?,?,10,no,11,generous,?,?,?,?,good
2,4.5,4.0,?,none,37,empl_contr,?,?,?,11,average,?,full,yes,?,good
1,2.8,?,?,?,35,?,?,2,?,12,below_average,?,?,?,?,good
1,2.1,?,?,tc,40,ret_allw,2,3,no,9,below_average,yes,half,?,none,bad…………
4 Dr. D. Dutta BITS Pilani, Pilani Campus
Important Characteristics of
Structured Data
Dimensionality - Dimensionality refers to the number of features or
attributes in a dataset.
– Curse of dimensionality: It is a phenomenon that occurs when
the number of dimensions (features) in a dataset is very high. As
the dimensionality increases, the volume of the space increases
exponentially, which causes several issues such as Sparsity,
Distance Metrics, Overfitting, Computational Complexity.

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Important Characteristics of
Structured Data
Sparsity - Sparsity refers to a condition in structured data where the
majority of the data points have a value of zero or are empty. This
characteristic is common in high-dimensional datasets, where most
of the values in the feature space do not contain information or are
absent.
Only presence counts: The absence or zero values, which are
predominant, do not contribute much to the analysis. This can be
seen in various scenarios such as:
• Document-Term Matrices: In text mining, a document-term matrix
often has many zero values, indicating the absence of a particular
word in a document. The few non-zero values indicate the presence
and frequency of words, which are critical for analysis.

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Important Characteristics of
Structured Data
Resolution: In the context of structured data, resolution refers to the
level of detail or granularity at which data is collected, stored, and
analyzed. Resolution can significantly impact the ability to detect
patterns and insights within the data.
– Patterns depend on the scale: Patterns or trends that can be
identified in the data are influenced by the resolution at which
the data is observed. Different scales can reveal different types
of information.

7 Dr. D. Dutta BITS Pilani, Pilani Campus


Important Characteristics of
Structured Data
Distribution: In the context of structured data, distribution refers to how
the values of a dataset are spread or dispersed across different
possible values. Understanding the distribution of data is fundamental
to statistical analysis and helps in summarizing and describing the
characteristics of the data.
– Centrality measures provide an indication of the central point
around which the data values are clustered.
– Dispersion measures provide an indication of how spread out
the data values are.

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Objects

Data sets are made up of data objects.


A data object represents an entity.
Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
Also called samples, examples, instances, data points, objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.

9 Dr. D. Dutta BITS Pilani, Pilani Campus


Attributes

Attribute (or dimensions, features, variables): a data field,


representing a characteristic or feature of a data object.
– E.g., customer _ID, name, address
Types:
– Nominal
– Binary
– Numeric:
• Quantitative
• Interval-scaled
• Ratio-scaled

10 Dr. D. Dutta BITS Pilani, Pilani Campus


Attribute Types

Nominal: categories, states, or “names of things”


– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
– Values have a meaningful order (ranking) but magnitude
between successive values is not known.
– Size = {small, medium, large}, grades, army rankings
11 Dr. D. Dutta BITS Pilani, Pilani Campus
Numeric Attribute Types

Quantity (integer or real-valued)


Interval
• Measured on a scale of equal-sized units
• Values have order
– E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high
as 5 K˚).
– e.g., temperature in Kelvin, length, counts, monetary
quantities

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Discrete vs continuous
attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and represented using
a finite number of digits
– Continuous attributes are typically represented as floating-point
variables

13 Dr. D. Dutta BITS Pilani, Pilani Campus


Basic Statistical Description
of Data
• Motivation
– To better understand the data: central tendency, variation and
spread
• Data dispersion characteristics
– median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
– Data dispersion (Refers to the spread of data points in a dataset.
It shows how much the data varies): analyzed with multiple
granularities of precision
– Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions (refers to the
process of integrating various measures or metrics into a unified
numerical framework.)
– Boxplot or quantile analysis on the transformed cube
14 Dr. D. Dutta BITS Pilani, Pilani Campus
Measuring the Central
Tendency
Mean (algebraic measure) (sample vs. population):
1 n
x =  xi =  x
n i =1 N
Note: n is sample size and N is population size.
n
– Weighted arithmetic mean: w x i i

– Trimmed mean: Chopping extreme values x= i =1


n

Median: w
i =1
i

– Middle value if odd number of values, or average of


the middle two values otherwise
– Estimated by interpolation (for grouped data):

Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
– Empirical formula: mean − mode = 3  (mean − median )
15 Dr. D. Dutta 1
5
BITS Pilani, Pilani Campus
Measuring the Central
Tendency

Here
L is the lower boundary of the median class.
N is the total number of observations.
F is the cumulative frequency of the class preceding the median class.
f is the frequency of the median class.
h is the class width (the difference between the upper and lower
boundaries of the class).

16 Dr. D. Dutta 1
6
BITS Pilani, Pilani Campus
Measuring the Central
Tendency

17 Dr. D. Dutta 1
7
BITS Pilani, Pilani Campus
Measuring the Central
Tendency
Identify the Median Class:
• Total number of observations, N=50
• N/2=50/2=25
• The cumulative frequency just greater than 25 is 40, so the median
class is 30 - 40.
Apply the Formula:
• L=30 (lower boundary of the median class)
• N=50
• F=25 (cumulative frequency of the class before the median class, which
is 20 - 30)
• f=15 (frequency of the median class)
• h=10 (class width)
Median=30+(50/2−25)/15×10=30

18 Dr. D. Dutta 1
8
BITS Pilani, Pilani Campus
Symmetric Vs Screwed
Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

August 4, 2024
19 Data Mining:Dr.
Concepts
D. Duttaand Techniques 1
9
BITS Pilani, Pilani Campus
Measuring the Dispersion of
Data
Quartiles, outliers and boxplots
– Quartiles: These are values that divide a set of data into four equal
parts. Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2 n n

 [ xi − ( xi ) ]
1 1
s = ( xi − x ) =  =  −  = x − 2
2 2 2 2 2
( x )
n − 1 i =1 n − 1 i =1 n i =1 N i =1
i
N i =1
i

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)


20 Dr. D. Dutta 2
0
BITS Pilani, Pilani Campus
Measuring the Dispersion of
Data

21 Dr. D. Dutta 2
1
BITS Pilani, Pilani Campus
Boxplot Analysis
Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended
to Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually

22 Dr. D. Dutta 2
2
BITS Pilani, Pilani Campus
Properties of Normal
Distribution Curve
• The normal (distribution) curve
– From μ–σ to μ+σ: contains about 68% of the measurements (μ:
mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Graphic Displays of Basic
Statistical Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis represents frequencies
• Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are  xi
• Scatter plot: each pair of values is a pair of coordinates and plotted
as points in the plane

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Histogram Analysis
Histogram: Graph display of tabulated 40
frequencies, shown as bars
35
It shows what proportion of cases fall
into each of several categories 30

Differs from a bar chart in that it is the 25


area of the bar that denotes the 20
value, not the height as in bar
15
charts, a crucial distinction when the
categories are not of uniform width 10
The categories are usually specified as 5
non-overlapping intervals of some 0
variable. The categories (bars) must 10000 30000 50000 70000 90000

be adjacent

25 Dr. D. Dutta BITS Pilani, Pilani Campus


Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value
xi

26 Dr. D. Dutta BITS Pilani, Pilani Campus


Scatter Plot
• Provides a first look at bivariate data to see clusters of points, outliers,
etc
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Positively and Negatively
Corelated Data

The left half fragment is positively


correlated
The right half is negative correlated

28 Dr. D. Dutta BITS Pilani, Pilani Campus


Uncorrelated Data

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Visualization
• Why data visualization?
– Gain insight into an information space by mapping data onto graphical
primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure, irregularities, relationships among
data
– Help find interesting regions and suitable parameters for further
quantitative analysis
– Provide a visual proof of computer representations derived
• Categorization of visualization methods:
– Pixel-oriented visualization techniques
– Geometric projection visualization techniques
– Icon-based visualization techniques
– Hierarchical visualization techniques
– Visualizing complex data and relations
30 Dr. D. Dutta BITS Pilani, Pilani Campus
Pixel Oriented Visualization
Techniques
• Pixel-oriented visualization techniques are used to represent large
data sets by mapping each individual data value to a single pixel. This
approach leverages the high resolution of modern computer screens
to display vast amounts of data in a compact and comprehensible
manner.
• Each data value is mapped to a single pixel in the display.
• The color and intensity of the pixel can encode the data value, such
as using color gradients to represent different ranges of values.

31 Dr. D. Dutta BITS Pilani, Pilani Campus


Pixel Oriented Visualization
Techniques
• For a data set of m dimensions, create m windows on the screen, one
for each dimension
• The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
• The colors of the pixels reflect the corresponding values

(a) Income (b) Credit Limit (c) transaction volume (d) age
32 Dr. D. Dutta 32
BITS Pilani, Pilani Campus
Laying Out Pixels in Circle
Segment
In this technique, pixels representing
data values are arranged within
segments of a circle rather than
traditional row-column grids. This
method can be particularly effective in
emphasizing periodic patterns or cyclic
data.
Data points are arranged in concentric
circular segments, with each segment
representing a different subset or
Representing a data record in
category of the data. circle segment

33 Dr. D. Dutta BITS Pilani, Pilani Campus


Geometric Projection
Visualization Techniques
• Visualization of geometric transformations and projections of the data
• Methods
– Direct visualization
– Scatterplot and scatterplot matrices
– Landscapes
– Projection pursuit technique: Help users find meaningful
projections of multidimensional data
– Prosection views
– Hyperslice
– Parallel coordinates

34 Dr. D. Dutta BITS Pilani, Pilani Campus


Direct Data Visualization
Ribbons with Twists Based on Vorticity

35 Data Mining:Dr.
Concepts
D. Duttaand Techniques 3
5
BITS Pilani, Pilani Campus
Scatterplot Matrices

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]


36 Dr. D. Dutta BITS Pilani, Pilani Campus
Landscape

Visualization of the data as perspective landscape


The data needs to be transformed into a (possibly artificial) 2D spatial
representation which preserves the characteristics of the data
37 Dr. D. Dutta BITS Pilani, Pilani Campus
Parallel Coordinates
• n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
• The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
• Every data item corresponds to a polygonal line which intersects each
of the axes at the point which corresponds to the value for the
attribute

• • •

Attr. 1 Attr. 2 Attr. 3 Attr. k


38 Dr. D. Dutta BITS Pilani, Pilani Campus
Parallel Coordinates of a
Data Sets

39 Dr. D. Dutta BITS Pilani, Pilani Campus


Icon-Based Visualization
Techniques
• Visualization of the data values as features of icons
• Typical visualization methods
– Chernoff Faces
– Stick Figures
• General techniques
– Shape coding: Use shape to represent certain information
encoding
– Color icons: Use color icons to encode more information
– Tile bars: Use small icons to represent the relevant feature vectors
in document retrieval

40 Dr. D. Dutta BITS Pilani, Pilani Campus


Chernoff Faces
• A way to display variables on a two-dimensional surface, e.g., let x be
eyebrow slant, y be eye size, z be nose length, etc.
• The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated using
Mathematica (S. Dickson)

REFERENCE: Gonick, L. and Smith, W.


The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html

41 Dr. D. Dutta BITS Pilani, Pilani Campus


Stick Figure

A census data
figure showing
age, income,
gender,
education, etc.

A 5-piece stick
figure (1 body and
4 limbs w. different
angle/length)

Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
42 Dr. D. Dutta BITS Pilani, Pilani Campus
Hierarchical Visualization
Techniques
• Visualization of the data using a hierarchical partitioning into
subspaces
• Methods
– Dimensional Stacking
– Worlds-within-Worlds
– Tree-Map
– Cone Trees
– InfoCube

43 Dr. D. Dutta BITS Pilani, Pilani Campus


Dimensional Stacking

attribute 4
attribute 2

attribute 3

attribute 1

• Partitioning of the n-dimensional attribute space in 2-D subspaces,


which are ‘stacked’ into each other
• Partitioning of the attribute value ranges into classes. The important
attributes should be used on the outer levels.
• Adequate for data with ordinal attributes (Example: Size = {small,
medium, large}) of low cardinality
• But, difficult to display more than nine dimensions
• Important to map dimensions appropriately
44 Dr. D. Dutta BITS Pilani, Pilani Campus
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute

Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
45 Dr. D. Dutta BITS Pilani, Pilani Campus
Worlds Within Worlds
Assign the function and two most important parameters to innermost world
Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
Software that uses this paradigm

• N - vision: Dynamic
interaction through data
glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
• Auto Visual: Static
interaction by means of
queries
46 Dr. D. Dutta BITS Pilani, Pilani Campus
Tree Map
• A treemap uses nested rectangles to represent the branches and
leaves of a tree structure. Each branch is given a rectangle, which is
then tiled with smaller rectangles representing sub-branches. The
size of each rectangle is proportional to a specific data dimension
(such as value or size), and colors are often used to add an additional
data dimension.

MSR Netscan Image

47 Dr. D. Dutta BITS Pilani, Pilani Campus


InfoCube
A 3-D visualization technique where hierarchical information is displayed as
nested semi-transparent cubes
The outermost cubes correspond to the top level data, while the subnodes
or the lower level data are represented as smaller cubes inside the
outermost cubes, and so on

48 Dr. D. Dutta BITS Pilani, Pilani Campus


Three-D Cone Trees
3D cone tree visualization technique works
well for up to a thousand nodes or so
First build a 2D circle tree that arranges its
nodes in concentric circles centered on the
root node
Cannot avoid overlaps when projected to 2D
G. Robertson, J. Mackinlay, S. Card. “Cone
Trees: Animated 3D Visualizations of
Hierarchical Information”, ACM SIGCHI'91
Graph from Nadeau Software Consulting
website: Visualize a social network data set
that models the way an infection spreads
from one person to the next

Ack.: http://nadeausoftware.com/articles/visualization
49 Dr. D. Dutta 4
9
BITS Pilani, Pilani Campus
Visualizing Complex Data and
Relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags

◼ The importance of tag is


represented by font
size/color
◼ Besides text data, there
are also methods to
visualize relationships,
such as visualizing social
networks

Newsmap: Google News Stories in 2005


50 Dr. D. Dutta BITS Pilani, Pilani Campus
Similarity or Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
– Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity

51 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Matrix and Dissimilarity
Matrix
Data matrix
– n data points with p  x11 ... x1f ... x1p 
dimensions  
– Two modes  ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 

Dissimilarity matrix  0 
– n data points, but registers  d(2,1) 0 
 
only the distance  d(3,1) d ( 3,2) 0 
– A triangular matrix  
– Single mode  : : : 
d ( n,1) d ( n,2) ... ... 0

52 Dr. D. Dutta BITS Pilani, Pilani Campus


Proximity Measure of
Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
• Method 1: Simple matching
– m: # of matches, p: total # of variables

d (i, j) = p −
p
m

• Method 2: Use a large number of binary attributes


– creating a new binary attribute for each of the M nominal states

53 Dr. D. Dutta BITS Pilani, Pilani Campus


Proximity Measure of Binary
Attributes Object j
A contingency table for binary data
Object i

Distance measure for symmetric binary


variables:
Distance measure for asymmetric
binary variables:
Jaccard coefficient (similarity measure
for asymmetric binary variables):

◼ Note: Jaccard coefficient is the same as “coherence”:

54 Dr. D. Dutta BITS Pilani, Pilani Campus


Dissimilarity Between Binary
Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

– Gender is a symmetric attribute


– The remaining attributes are asymmetric binary
– Let the values Y and P be 1, and the value N 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
55 Dr. D. Dutta BITS Pilani, Pilani Campus
Standardizing Numeric Data
Z-score:
x
z=  − 
– X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
– the distance between the raw score and the population mean in
units of the standard deviation
– negative when the raw score is below the mean, “+” when above
An alternative way: Calculate the mean absolute deviation
sf = 1 n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m f = n (x1 f + x2 f + ... + xnf )
1
x −m
.

zif = if
sf
f

– standardized measure (z-score):


Using mean absolute deviation is more robust than using standard
deviation
56 Dr. D. Dutta BITS Pilani, Pilani Campus
Example: Data Matrix and
Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

57 Dr. D. Dutta BITS Pilani, Pilani Campus


Distance on Numeric Data:
Minkowski Distance
Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance so
defined is also called L-h norm)
Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
– d(i, j) = d(j, i) (Symmetry)
– d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric

58 Dr. D. Dutta BITS Pilani, Pilani Campus


Special Cases of Minkowski
Distance
h = 1: Manhattan (city block, L1 norm) distance
– E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

h = 2: (L2 norm) Euclidean distance


d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

h → . “supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component
(attribute) of the vectors

59 Dr. D. Dutta BITS Pilani, Pilani Campus


Example: Minkowski Distance
Manhattan (L1) Dissimilarity Matrices
point attribute 1 attribute 2
x1 1 2 L x1 x2 x3 x4
x1 0
x2 3 5
x2 5 0
x3 2 0 x3 3 6 0
x4 4 5 x4 6 1 7 0

Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
60 Dr. D. Dutta BITS Pilani, Pilani Campus
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
– replace xif by their rank rif {1,..., M f }
– map the range of each variable onto [0, 1] by replacing i-th object
in the f-th variable by
rif −1
zif =
M f −1
– compute the dissimilarity using methods for interval-scaled
variables

61 Dr. D. Dutta BITS Pilani, Pilani Campus


Attributes of Mixed Type
A database may contain all attribute types
– Nominal, symmetric binary, asymmetric binary, numeric, ordinal
One may use a weighted formula to combine their effects
 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
– f is numeric: use the normalized distance
– f is ordinal
• Compute ranks rif and
z rif − 1
• Treat zif as interval-scaled if
=
M f −1

62 Dr. D. Dutta BITS Pilani, Pilani Campus


Cosine Similarity
A document can be represented by thousands of attributes, each recording
the frequency of a particular word (such as keywords) or phrase in the
document.

Other vector objects: gene features in micro-arrays, …


Applications: information retrieval, biologic taxonomy, gene feature mapping,
...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d

63 Dr. D. Dutta BITS Pilani, Pilani Campus


Example: Cosine Similarity
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d

Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

64 Dr. D. Dutta BITS Pilani, Pilani Campus


Summary
• Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
• Many types of data sets, e.g., numerical, text, graph, Web, image.
• Gain insight into the data by:
– Basic statistical data description: central tendency, dispersion,
graphical displays
– Data visualization: map data onto graphical primitives
– Measure data similarity
• Above steps are the beginning of data preprocessing.
• Many methods have been developed but still an active area of
research.

65 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

66 Dr. D. Dutta BITS Pilani, Pilani Campus


BUSINESS DATA MINING
Session 1

BITS Pilani Dr. D. Dutta


Pilani Campus
Contents

As per syllabus
1. Introduction to Data Mining (DM)

1. DM definitions & activities

2. DM processes & challenges


As per session plan
• Introduction to Data Mining

o Data Mining definitions

o Data Mining activities

o DM process

o DM challenges
2 Dr. D. Dutta BITS Pilani, Pilani Campus
What is Data Mining?

Data mining is the process of discovering patterns, relationships,


anomalies, and valuable information from large sets of data. It
involves using various techniques and algorithms to analyze and
extract meaningful insights from data, with the goal of making
informed decisions, predicting future trends, and identifying hidden
patterns.

3 Dr. D. Dutta BITS Pilani, Pilani Campus


Why Data Mining?

The tremendous expansion of data, scaling from terabytes (1012 bytes) to


petabytes (1015 bytes), reflects the explosive growth in data generation.
This surge is propelled by automated data collection tools, advanced database
systems, and the pervasive influence of the web in our computerized
society.
Various sectors contribute substantially to the abundance of data:
– Business: Enormous datasets are generated from web interactions, e-
commerce activities, financial transactions, and stock market
operations.
– Science: Fields like remote sensing, bioinformatics, and scientific
simulations produce substantial volumes of data, contributing to the
overall data deluge.
– Society at Large: Everyday sources, such as news outlets, digital
cameras, and platforms like YouTube, continually add to the vast pool of
data.

4 Dr. D. Dutta BITS Pilani, Pilani Campus


Why Data Mining?

Despite this data deluge, a paradox exists: we find ourselves drowning in data
but often starved for knowledge.
To address this challenge, the principle of "necessity is the mother of invention"
comes into play.
Data mining – an innovative solution involving the automated analysis of
massive datasets.
Data mining serves as a crucial tool for uncovering patterns, relationships, and
valuable insights within the overwhelming sea of data. By leveraging
advanced algorithms, it allows us to transform raw data into actionable
knowledge. In a world where data is abundant but meaningful insights are
scarce, data mining plays a pivotal role in extracting valuable knowledge
from the vast ocean of information.

5 Dr. D. Dutta BITS Pilani, Pilani Campus


Evolution of Database
Technology
1960s:
– Data collection, database creation, MIS (Management Information system) and
network DBMS
1970s:
– Relational data model, relational DBMS implementation
1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems

6 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining: A Misnomer

The term "data mining" has been criticized by some as a misnomer


because it implies that the process is similar to mining for gold or
other valuable resources. However, the process of data mining does
not actually involve extracting or removing data from a source, but
rather involves analyzing and extracting insights from existing data
sets.
Instead of "mining," some experts prefer the term "knowledge discovery
in databases" (KDD) to describe the process of data mining. KDD
refers to the process of discovering patterns, trends, and insights
from large data sets using statistical and computational methods.
Regardless of the terminology used, the process of data mining or KDD
is an important tool for businesses and researchers to extract
valuable information from large data sets and make data-driven
decisions.

7 Dr. D. Dutta BITS Pilani, Pilani Campus


What is Data Mining?

The process of extracting information to identify patterns, trends, and


useful data that would allow the business to take the data-driven
decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of
investigating hidden patterns of information to various perspectives
for categorization into useful data, which is collected and assembled
in particular areas such as data warehouses, efficient analysis, data
mining algorithm, helping decision making and other data
requirement to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of
information to find trends and patterns that go beyond simple
analysis procedures.
Data Mining is a process used by organizations to extract specific data
from huge databases to solve business problems. It primarily turns
raw data into useful information.

8 Dr. D. Dutta BITS Pilani, Pilani Campus


Knowledge Discovery from
Database (KDD)
This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
Data mining plays an essential role in
the knowledge discovery process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
9 Dr. D. Dutta BITS Pilani, Pilani Campus
KDD: An Alternative View

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis

This is a view from typical machine learning and statistics communities

10 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
11 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Mining: Confluence of
Multiple Disciplines
Machine Pattern Statistics
Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

12 Dr. D. Dutta BITS Pilani, Pilani Campus


Why not Traditional Data
Analysis?
Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of
data
High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
High complexity of data
– Data streams and sensor data
– Time-series data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
– Software programs, scientific simulations
New and sophisticated
13 Dr. D. Dutta BITS Pilani, Pilani Campus
Multi Dimentional View of
Data Mining
Data to be mined
– Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
14 Dr. D. Dutta BITS Pilani, Pilani Campus
Types of data mining based
on data
Data mining can be performed on the following types of data:
• Relational Database: A relational database is a collection of multiple
data sets formally organized by tables, records, and columns from
which data can be accessed in various ways without having to
recognize the database tables. Tables convey and share information,
which facilitates data searchability, reporting, and organization.
• Data warehouses: A Data Warehouse is the technology that collects
the data from various sources within the organization to provide
meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision making for a
business organization. The data warehouse is designed for the
analysis of data rather than transaction processing.

15 Dr. D. Dutta BITS Pilani, Pilani Campus


Types of data mining based
on data
Data Repositories: The Data Repository generally refers to a
destination for data storage. However, many IT professionals utilize
the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization
has kept various kinds of information.
Object-Relational Database: A combination of an object-oriented
database model and relational database model is called an object-
relational model. It supports Classes, Objects, Inheritance, etc. One
of the primary objectives of the Object-relational data model is to
close the gap between the Relational database and the object-
oriented model practices frequently utilized in many programming
languages, for example, C++, Java, C#, and so on.
Transactional Database: A transactional database refers to a database
management system (DBMS) that has the potential to undo a
database transaction if it is not performed appropriately.

16 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining: Classification
Schemes
General functionality
– Descriptive data mining: As the name suggests, descriptive mining
"describe" the data. Once the data is captured, we convert it into
human interpretable form. Descriptive analytics focus on answering
"What has happened in the past?" Descriptive analytics is useful
because it enables us to learn from the past.
– Predictive data mining: The term 'Predictive' means to predict
something, so predictive data mining is the analysis done to predict
the future event or other data or trends. Predictive data mining can
enable business analysts to make decisions and add value to the
analytics team efforts. Predictive data mining supports predictive
analytics. As we know, predictive analytics is the use of information to
predict outcomes.
17 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Mining: Classification
Schemes
Different views lead to different classifications
– Data view: Kinds of data to be mined
– Knowledge view: Kinds of knowledge to be discovered
– Method view: Kinds of techniques utilized
– Application view: Kinds of applications adapted

18 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining: On What Kinds
of Data?
Database-oriented data sets and applications
– Relational database, data warehouse, transactional database (discussed
earlier)
Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web

19 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining Functions: (1)
Generalization
Information integration and data warehouse construction
– Data cleaning, transformation, integration, and multidimensional
data model
Data cube technology
– Scalable methods for computing (i.e., materializing)
multidimensional aggregates
– OLAP (online analytical processing)
Multidimensional concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry
vs. wet regions
– Wet regions in data mining refer to domains or applications where
large amounts of data are available, dry regions in data mining
refer to domains or applications where data is scarce or limited.

20 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining Functions: (2) Association
and Correlation Analysis
Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in your Smart
Bazar?
Association, correlation vs. causality
– A typical association rule
• Diaper → Beer [0.5%, 75%] (support, confidence)
– Are strongly associated items also strongly correlated?
• Association is a data mining function that discovers the
probability of the co-occurrence of items in a collection. The
relationships between co-occurring items are expressed
as Association Rules.
• A correlation coefficient measures the extent to which the
value of one variable changes with another.

21 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining Functions: (2) Association
and Correlation Analysis
How to mine such patterns and rules efficiently in large datasets?
How to use such patterns for classification, clustering, and other
applications?
Correlation, on the other hand, specifically measures the strength and
direction of a linear relationship between two continuous variables.
The correlation coefficient, often denoted as "r," ranges from -1 to 1. A
positive correlation indicates a direct relationship, a negative
correlation indicates an inverse relationship, and 0 indicates no linear
correlation.

22 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining Functions: (2) Association
and Correlation Analysis
Relationship Between the Concepts:
While strongly associated items may exhibit some degree of correlation,
association rules do not explicitly measure the strength of the linear
relationship between variables as correlation does.
Association rules focus on co-occurrence or presence relationships
between items, whereas correlation measures the degree to which
changes in one variable correspond to changes in another, with a
specific emphasis on the linearity of the relationship.
In summary, strong association between items doesn't necessarily imply
strong correlation in the statistical sense, and vice versa.

23 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining Functions: (3)
Classification and Prediction
Classification and prediction
– Construct models (functions) based on some training examples
– Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars
based on (mileage)
– Predict some unknown or missing numerical values
– Classification is a supervised learning task where the goal is to
categorize input data into predefined classes or categories based
on the features or attributes of the data.
– Prediction, often referred to as regression, is a supervised
learning task where the goal is to predict a continuous numerical
outcome based on input features.

24 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining Functions: (3)
Classification and Prediction
Typical classification methods
– Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression, …
Typical applications of classification:
– Credit card fraud detection, direct marketing, classifying stars, diseases,
web-pages, …
Typical prediction methods
– Linear Regression, Decision Trees, Support Vector Machines (SVM), K-
Nearest Neighbors (KNN), Neural Networks, ARIMA (AutoRegressive
Integrated Moving Average), Prophet,…
Typical applications of prediction:
– Stock Price Forecasting, Disease Prediction, Customer Churn Prediction,
Demand Forecasting, Network Fault Prediction, Short-term and Long-
term Weather Prediction
25 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Mining Functions: (4)
Cluster and Outlier Analysis
Cluster analysis
– Unsupervised learning (i.e., Class label is unknown)
– Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
– Principle: Maximizing intra-class similarity & minimizing inter-class
similarity
– Many methods and applications
Outlier analysis
– Outlier: A data object that does not comply with the general
behavior of the data
– Noise or exception? - One person’s garbage could be another
person’s treasure. Methods: by product of clustering or regression
analysis, …
– Useful in fraud detection, rare events analysis
26 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Mining Functions: (5)
Trend and Evolution Analysis
Sequence, trend and evolution analysis
– Trend and deviation analysis: e.g., regression
– Sequential pattern mining
• e.g., first buy digital camera, then large SD memory cards
– Periodicity analysis
– Motifs, time-series, and biological sequence analysis
• Approximate and consecutive motifs
– Similarity-based analysis
Mining data streams
– Ordered, time-varying, potentially infinite, data streams

27 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining Functions: (6)
Structure and Network Analysis
Graph mining
– Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
– Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
– Multiple heterogeneous networks
• A person could be multiple information networks: friends,
family, classmates, …
– Links carry a lot of semantic information: Link mining
Web mining
– Web is a big information network: from PageRank to Google
– Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …
28 Dr. D. Dutta BITS Pilani, Pilani Campus
Major Challenges in Data
Mining
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Handling high-dimensionality
Handling noise, uncertainty, and incompleteness of data
Incorporation of constraints, expert knowledge, and background
knowledge in data mining
Pattern evaluation and knowledge integration
Mining diverse and heterogeneous kinds of data: e.g., bioinformatics,
Web, software/system engineering, information networks
Application-oriented and domain-specific data mining
Invisible data mining (embedded in other functional modules)
Protection of security, integrity, and privacy in data mining

29 Dr. D. Dutta BITS Pilani, Pilani Campus


Data Mining: Potential
Applications
Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved guaranteeing,
quality control, competitive analysis
– Fraud detection and detection of unusual patterns (outliers)
Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis

30 Dr. D. Dutta BITS Pilani, Pilani Campus


Ex.1: Market Analysis and
Management
Where does the data come from? - Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
– Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
– Determine customer purchasing patterns over time
Cross-market analysis - Find associations/co-relations between product sales, &
predict based on such association
Customer profiling - What types of customers buy what products (clustering or
classification)
Customer requirement analysis
– Identify the best products for different groups of customers
– Predict what factors will attract new customers
Provision of summary information
– Multidimensional summary reports
– Statistical summary information (data central tendency and variation)
31 Dr. D. Dutta BITS Pilani, Pilani Campus
Ex.2: Corporate Analysis and
Risk Management
Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning
– summarize and compare the resources and spending
Competition
– monitor competitors and market directions
– group customers into classes and a class-based pricing procedure
– set pricing strategy in a highly competitive market

32 Dr. D. Dutta BITS Pilani, Pilani Campus


Ex.3: Fraud Detection &
Mining Unusual Patterns
Approaches: Clustering & model construction for frauds, outlier analysis
Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
– Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
– Telecommunications: phone-call fraud
• Phone call model: destination of the call, duration, time of day
or week. Analyze patterns that deviate from an expected norm
– Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest
employees
33
– Anti-terrorism Dr. D. Dutta BITS Pilani, Pilani Campus
KDD: Key Steps
Learning the application domain
– relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
– summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge

34 Dr. D. Dutta BITS Pilani, Pilani Campus


Find All or Only Interesting
Patterns?
Find all the interesting patterns: Completeness
– Can a data mining system find all the interesting patterns? Do we
need to find all of the interesting patterns?
– Heuristic vs. exhaustive search
– Association vs. classification vs. clustering
Search for only interesting patterns: An optimization problem
– Can a data mining system find only the interesting patterns?
– Approaches
• First generate all the patterns and then filter out the
uninteresting ones
• Generate only the interesting patterns - mining query
optimization

35 Dr. D. Dutta BITS Pilani, Pilani Campus


Architecture: Typical Data
Mining System
Graphical User Interface

Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web

36 Dr. D. Dutta BITS Pilani, Pilani Campus


Thanks

Any Question?

37 Dr. D. Dutta BITS Pilani, Pilani Campus

You might also like