Mining Sessions
Mining Sessions
Session 14
x1
w1
x2 w2
x3
w3 Z
f (Z) y
. output
. . .
.
.
.
. wn .
b neuron
xn bias
input
7 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model
Warren McCulloch (a psychiatrist and neuroanatomist) and Walter
Pitts (a mathematician), started the modern era of neural networks
when, in 1943
Each neuron has a fixed threshold such that if the net input to the
network is greater than the threshold the neuron should fire.
The threshold is set such that the inhibition is absolute. This means any
non-zero inhibitory input will prevent the neuron from firing.
x1
w
x2 w
x3
w Z
f (Z) y
. output
. . .
.
.
.
. w .
-p -p neuron
xn
input x n+1 x n+m
10 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model
x2 Z
f (Z) y
1
output
x1
w1
x2 w2
x3
w3 Z
f (Z) y
. output
. . .
.
.
.
. wn .
neuron
xn
input
14 Dr. D. Dutta BITS Pilani, Pilani Campus
Perceptron
x1
w1
Z
f (Z) y
w2 output
x2 w0
x=1
0 neuron
bias
input
15 Dr. D. Dutta BITS Pilani, Pilani Campus
Perceptron
The network can now be used for a classification task: it can decide
whether an input pattern belongs to one of two classes.
If the total input is positive, the pattern will be assigned to class +1, if
the total input is negative, the sample will be assigned to class -1.
The separation between the two classes in this case is a straight line,
given by the equation:
w1x1 + w2x2 + w0= 0
The single layer network represents
a linear discriminant function.
x2 = -w1/w2 x1 - w0 / w2
weights determine the slope of the line
and the bias determines the `offset’
considering w0=
The output of the hidden units is distributed over the next layer of Nh 2
hidden units, until the last layer of hidden units, of which the outputs
are fed into a layer of No output units
The trick is to figure out what pk should be for each unit k in the
network. The interesting result, which we now derive, is that there is
a simple recursive computation of these 's which can be
implemented by propagating error signals backward through the
network.
To compute pk we apply the chain rule to write this partial derivative as
the product of two factors, one factor reflecting the change in error
as a function of the output of the unit and one reflecting the change
in the output as a function of changes in the input
9
26 Dr. D. Dutta BITS Pilani, Pilani Campus
Back Propagation ANN: Delta
Rule
▪ By Equation 1
10
13
0.1 0.1
1 3
i h o
yh1=F(sh1)=2/(1+e-0.7) –1=0.336
yh2=F(sh2)=2/(1+e-0.4) –1=0.1974
so1 =0.1* 0.336 + 0.1* 0.1974 =0.0536
yo1=F(so1)=2/(1+e-0.0536) –1=0.02678
F’(so1)=0.5(1+ F(so1)) (1- F(so1))
wi1h1(new)= wi1h1(old)+wi1h1=0.2+0.00216=0.20216
wi2h1(new)= wi2h1(old)+wi2h1=0.2+0.00216=0.20216
wi3h1(new)= wi3h1(old)+wi3h1=0.2+0.00216=0.20216
wi1h2(new)= wi1h2(old)+wi1h2=0.1+0.00216=0.10216
wi2h2(new)= wi2h2(old)+wi2h2=0.1+0.00216=0.10216
wi3h2(new)= wi3h2(old)+wi3h2=0.1+0.00216=0.10216
wh1o1(new)= wh1o1(old)+wh1o1=0.1+0.00216=0.10216
wh2o1(new)= wh2o1(old)+wh2o1=0.1+0.00216=0.10216
0.10216 0.1
1 3
i h o
Any Question?
As per syllabus
1. Data Mining Applications
1. Recommendation Systems.
2. Sentiment Analysis
3. Fraud Detection
As per session plan
• Data Mining Applications
o Recommendation Systems
o Sentiment Analysis
o Fraud Detection
2 Dr. D. Dutta BITS Pilani, Pilani Campus
Recommender System
Makes recommendations!
E.g. music, books and movies
In eCommerce recommend items
In eLearning recommend content
In search and navigation recommend links
1. Data Collection
• Purpose: Gather the text data that will be analyzed for sentiment.
• Sources: Data can come from various sources like social media
posts, customer reviews, survey responses, emails, or any other text
where people express opinions.
• Methods: Data can be collected manually or through APIs (e.g.,
Twitter API, web scraping tools).
2. Text Preprocessing
• Purpose: Clean and prepare the text data for analysis. Raw text often
contains unnecessary elements that can affect analysis accuracy.
• Steps:
• Lowercasing: Convert text to lowercase to maintain consistency.
• Removing Punctuation and Special Characters: Clean up
punctuation, emojis, URLs, and other symbols that may not
contribute to sentiment.
• Removing Stop Words: Remove common words (like “the,” “is,”
“and”) that don’t carry sentiment.
• Tokenization: Split the text into individual words or tokens.
• Stemming and Lemmatization: Reduce words to their root forms
(e.g., “playing” to “play”) for better generalization.
• Outcome: A clean and structured dataset that can be used for analysis.
3. Feature Extraction
• Purpose: Transform text data into numerical features that can be
processed by machine learning or deep learning models.
• Common Techniques:
• Bag-of-Words (BoW): Converts text into a set of word counts or
frequencies without considering word order.
• TF-IDF (Term Frequency-Inverse Document Frequency):
Weighs words by their importance across multiple documents.
• Word Embeddings: Techniques like Word2Vec, GloVe, or BERT
that convert words into dense vectors capturing semantic
meaning.
• Outcome: Numerical representation of text that can be fed into a
model.
4. Sentiment Classification
• Purpose: Classify the sentiment of each text input as positive,
negative, or neutral (or sometimes with more specific emotions).
• Methods:
• Rule-Based Approach: Uses predefined lists of positive and
negative words (sentiment lexicons) to determine sentiment.
Simple but less flexible.
• Machine Learning Models: Trains classifiers like Naive Bayes,
Support Vector Machine (SVM), or logistic regression on labeled
sentiment data to learn patterns.
• Deep Learning Models: Uses advanced models like LSTMs,
RNNs, or transformer-based models (BERT, GPT) for better
accuracy, especially with context-dependent or complex
sentiment.
• Outcome: Predicted sentiment labels for each input text.
20 Dr. D. Dutta BITS Pilani, Pilani Campus
Steps of Sentiment Analysis
8. Deployment
• Purpose: Integrate the sentiment analysis model into a production
environment so it can analyze new data in real time.
• Methods:
• API Integration: Deploy as a REST API so other applications
can send data for sentiment analysis.
• Automation: Automate data collection, analysis, and reporting
for continuous monitoring.
• Outcome: A fully functional sentiment analysis tool that provides
real-time insights.
Any Question?
As per syllabus
1. Outlier Detection
3. Statistical Approach
As per session plan
• Outliers
o Statistical Approach
2 Dr. D. Dutta BITS Pilani, Pilani Campus
What Are Outliers?
Outlier: A data object that deviates significantly from the normal objects as if it
were generated by a different mechanism
– Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky,
...
– In a group, all are moving in the forward direction and one person is moving
in the backward direction.
Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
Outliers are interesting: It violates the mechanism that generates the normal data
Novelty detection is different from Outlier detection: Focuses on identifying new,
previously unseen patterns or examples once the model is trained.
Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
3 Dr. D. Dutta BITS Pilani, Pilani Campus
Types of Outliers
Collective Outliers
▪ A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers Collective Outlier
▪ Applications: E.g., intrusion detection:
• When a number of computers keep sending
denial-of-service packages to each other
◼ Detection of collective outliers
◼ Consider not only behavior of individual objects, but also that of groups
of objects
◼ Need to have the background knowledge on the relationship among
◼ The border between normal and outlier objects is often a gray area
normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
◼ Understandability
◼ Understand why these are outliers: Justification of the detection
• Recall measures the ability of the model to find all the relevant
(true) outliers. A higher recall means the model is successfully
identifying more outliers.
◼ Since there are many clustering methods, there are many clustering-
based outlier detection methods as well
◼ Clustering is expensive: straightforward adaption of a clustering
method for outlier detection can be costly and does not scale up well
for large data sets
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
◼ Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
◼ An object o is an outlier if it does not belong to any cluster
21 Dr. D. Dutta BITS Pilani, Pilani Campus
Parametric Methods: Using Mixture of
Parametric Distributions
Intuition: Objects that are far away from the others are outliers
Assumption of proximity-based approach: The proximity of an outlier
deviates significantly from that of most of the others in the data set
Two types of proximity-based outlier detection methods
– Distance-based outlier detection: An object o is an outlier if its
neighborhood does not have enough other points
– Density-based outlier detection: An object o is an outlier if its
density is relatively much lower than that of its neighbors
◼ Thus we only need to check the objects that cannot be pruned, and even for
such an object o, only need to compute the distance between o and the
objects in the level-2 cells (since beyond level-2, the distance from o is more
than r)
26 Dr. D. Dutta BITS Pilani, Pilani Campus
Density-Based Outlier
Detection
Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data distribution
In Fig., O1 and O2 are local outliers to C1, O3 is a global
outlier, but O4 is not an outlier. However, proximity-
based clustering cannot find O1 and O2 are outlier
(e.g., comparing with O4).
27 Dr. D. Dutta 27
BITS Pilani, Pilani Campus
Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
An object is an outlier if (1) it does not belong to any cluster, (2) there is a large
distance between the object and its closest cluster , or (3) it belongs to a small
or sparse cluster
◼ Case I: Not belong to any cluster
◼ Identify animals not part of a flock (group): Using a
density-based clustering method such as DBSCAN
◼ Case 2: Far from its closest cluster
◼ Using k-means, partition data points of into clusters
◼ For each object o, assign an outlier score based on its
distance from its closest center
◼ If dist(o, co)/avg_dist(co) is large, likely an outlier
In some applications, one cannot clearly partition the data into contexts
– Ex. if a customer suddenly purchased a product that is unrelated to
those she recently browsed, it is unclear how many products browsed
earlier should be considered as the context
Model the “normal” behavior with respect to contexts
– Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute values
– An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model
Using a prediction model that links the contexts and behavior, these methods
avoid the explicit identification of specific contexts
Methods: A number of classification and prediction techniques can be used to
build such models, such as regression, Markov Models, and Finite State
Automaton
Any Question?
2. Partitioning methods
5. Outlier analysis
As per session plan
• Clustering
o Partitioning methods
o Outlier analysis
4.3
Red Blood Cell Hemoglobin Concentration
4.2
4.1
3.9
4.3
4.2
4.1
3.9
4.3
4.2
4.1
3.9
4.3
4.2
4.1
3.9
4.3
4.2
4.1
3.9
4.3
4.2
4.1
3.9
4.3
4.2
4.1
3.9
1 0 0 1
(a) (b)
25 Dr. D. Dutta BITS Pilani, Pilani Campus
The Kohonen Network
y1
Output Signals
Input Signals
x1
y2
x2
y3
Input Output
layer layer
27 Dr. D. Dutta BITS Pilani, Pilani Campus
The Kohonen Network
Connection
1 strength
Excitatory
effect
0 Distance
Inhibitory Inhibitory
effect effect
0.8
0.6
0.4
0.2
W(2,j)
-0.2
-0.4
-0.6
-0.8
-1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
W(1,j)
30
Network after 1000 iterations
1
0.8
0.6
0.4
0.2
W(2,j)
-0.2
-0.4
-0.6
-0.8
-1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
W(1,j)
31
Network after 10,000 iterations
1
0.8
0.6
0.4
0.2
W(2,j)
-0.2
-0.4
-0.6
-0.8
-1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
W(1,j)
32
An Example
0.52
X=
0.12
41 Dr. D. Dutta 41
BITS Pilani, Pilani Campus
Curse of Dimensionality:
Number of Samples
Suppose we want to use the nearest neighbor approach with k = 1
(1NN)
This feature is not discriminative, i.e. it does not separate the classes
well
Suppose we start with only one feature
We decide to use 2 features. For the 1NN method to work well, need a
lot of samples, i.e. samples have to be dense
To maintain the same density as in 1D (9 samples per unit length), how
many samples do we need?
43 Dr. D. Dutta 43
BITS Pilani, Pilani Campus
Curse of Dimensionality:
Number of Samples
Of course, when we go from 1 feature to 2, no one gives us more
samples, we still have 9
Thus for each fixed sample size n, there is the optimal number of
features to use
48 Dr. D. Dutta 48
BITS Pilani, Pilani Campus
Why Subspace Clustering?
(adapted from Parsons et al. SIGKDD Explorations 2004)
49 Dr. D. Dutta 4
9
October
BITS Pilani, Pilani20, 2024
Campus
CLIQUE (Clustering In
QUEst)
Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal
length interval
– It partitions an m-dimensional data space into non-overlapping
rectangular units
– A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a
subspace
50 Dr. D. Dutta BITS Pilani, Pilani Campus
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside
each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected
dense units for each cluster
– Determination of minimal cover for each cluster
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
30 50
age
55
55 Dr. D. Dutta BITS Pilani, Pilani Campus
Strength and Weakness of
CLIQUE
• Strength
– automatically finds subspaces of the highest dimensionality
such that high density clusters exist in those subspaces
– insensitive to the order of records in input and does not
presume some canonical data distribution
– scales linearly with the size of input and has good scalability as
the number of dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the
expense of simplicity of the method
58 Dr. D. Dutta 5
8
October
BITS Pilani, Pilani20, 2024
Campus
Why Constraint-Based
Cluster Analysis?
• Need user feedback: Users know their applications the best
• Less parameters but more user-desired constraints, e.g., an ATM
allocation problem: obstacle & desired clusters
Any Question?
2. Partitioning methods
5. Outlier analysis
As per session plan
• Clustering
o Partitioning methods
• Outlier analysis
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
• For the following data set, we will get different clustering results with
the single-link and complete-link algorithms.
1 5
2 3 4
6
1 5
2 3 4
6
1 3 4 5 2 6
Result of the Complete-Link algorithm
1 5
2 3 4
6
1 3 2 4 5 6
12
Hierarchical Clustering: Comparison
Single-link Complete-link
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4
13
Compare
Single-link Dendrograms
Complete-link
1 2 5 3 6 4 1 2 5 3 6 4
2 5 3 6 4 1
14 1 2 5 3 6 4 Dr. D. Dutta BITS Pilani, Pilani Campus
Effect of Bias towards
Spherical Clusters
Remark: points that are not assigned to any cluster are outliers;
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional
29
data Dr. D. Dutta
(MinPts=4, Eps=9.92)
BITS Pilani, Pilani Campus
OPTICS: A Cluster-
Ordering Method (1999)
OPTICS: Ordering Points To Identify the Clustering Structure
– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
– Produces a special order of the database w.r.t. its density-based
clustering structure
– This cluster-ordering contains info equivalent to the density-based
clustering corresponding to a broad range of parameter settings
– Good for both automatic and interactive cluster analysis, including
finding intrinsic clustering structure
– Can be represented graphically or using visualization techniques
Core Distance
p1
Reachability Distance
o
p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
31 Dr. D. Dutta = 3 cm 31
BITS Pilani, Pilani Campus
OPTICS: Some Extension
from DBSCAN
Reachability
-distance
undefined
‘
Select the maximal regions that have at least 100 houses per unit area
and at least 70% of the house prices are above $4OOK and with
total area at least 100 units with 90% confidence.
Select the range of age of houses in those maximal regions where
there are at least 100 houses per unit area and at least 70% of the
houses have price between $150K and $300K with area at least 100
units in California.
Any Question?
2. Partitioning methods
5. Outlier analysis
As per session plan
• Clustering
o Partitioning methods
• Outlier analysis
y3
(x3, y3)
y1
(x1, y1)
(x2, y2)
y2
X
4 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering
Scalability
– Clustering algorithm is suitabale for big datasets also
Ability to deal with different types of attributes
– Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
Interpretability and usability
Others
– Discovery of clusters with arbitrary shape
– Ability to deal with noisy data
– Incremental clustering and insensitivity to input order
– High dimensionality
12 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Structures
0
◼ Dissimilarity matrix d(2,1)
0
◼ (one mode) d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
• Standardize data
– Calculate the mean absolute deviation:
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where mf = 1 n (x1 f + x2 f + ... + xnf )
.
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp
• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) d(i,k) + d(k,j)
• Also, one can use weighted distance, parametric Pearson product
moment correlation, or other dissimilarity measures
pf = 1 ij( f ) dij( f )
d (i, j) =
– f is binary or nominal: pf = 1 ij( f )
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled r −1
zif =
if
M f −1
Centroid distance
34
Single-link
d min (Ci , C j ) = avg d ( p, q)
Complete-link pCi , qC j
Average-link
Centroid distance The distance between two
clusters is represented by the
average distance of all pairs of
data objects belonging to
different clusters.
37 Dr. D. Dutta BITS Pilani, Pilani Campus
How to Define Inter-Cluster
Distance?
mi,mj are the
means of Ci, Cj,
Single-link
d mean (Ci , C j ) = d (mi , m j )
Complete-link
Average-link
The distance between two
Centroid distance clusters is represented by the
distance between the means of
the clusters.
N N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
(using K=2)
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two
clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
(using K=2)
Step 2:
Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
(using K=2)
Step 3:
Now using these centroids we
compute the Euclidean distance of
each object, as shown in table.
(using K=2)
Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
(using K=3)
Step 1 Step 2
51 Dr. D. Dutta BITS Pilani, Pilani Campus
PLOT
Any Question?
4. Lazy Learner
5. Multiclass Classification
As per session plan
• Classification: Advanced Methods
o Lazy Learner
o Multiclass Classification
2 Dr. D. Dutta BITS Pilani, Pilani Campus
Introduction to Bayesian
Network
Suppose you are trying to determine if a
patient has Covid. You observe the
following symptoms:
• The patient has a cough
• The patient has a fever
• The patient has difficulty breathing
Has Covid
C D
false 0.6 false false 0.01 false false 0.02 false false 0.4
true 0.4 false true 0.99 false true 0.98 false true 0.6
C D
true 0.4
false false 0.01 Each node Xi has a
false true 0.99
conditional probability
true false 0.7
distribution
true true 0.3
P(Xi|Parents(Xi)) that
quantifies the effect of the
parents on the node
B C P(C|B)
B C P(C|B)
If you have a Boolean variable with k Boolean parents, this table has 2k+1
probabilities (but only 2k need to be stored)
n
P( X 1 = x1 ,..., X n = xn ) = P( X i = xi | Parents( X i ))
i =1
Where Parents(Xi) means the values of the Parents of the node Xi with
respect to the graph
C D
A
These numbers are from the
conditional probability tables B
C D
Has Covid
An example of a query would be:
P( Has Covid = true | HasFever = true, HasCough = true)
Note: Even though HasDifficultyBreathing and HasWideMediastinum
are in the Bayesian network, they are not given values in the query
(i.e. they do not appear either as query variables or evidence
variables)
They are treated as unobserved variables
We still haven’t said where we get the Bayesian network from. There
are two options:
1. Get an expert to design it
2. Learn it from data
Martin Norman
Late Late
Boss
Angry
23 Dr. D. Dutta BITS Pilani, Pilani Campus
Example
Martin Train Norman Probabilit
Probability Probability y
oversleep Strike oversleep
Norman oversleep
Boss Project Office
Failure-in-Love Delay Dirty T F
Boss
Angry
25 Dr. D. Dutta BITS Pilani, Pilani Campus
Example
Attach prior probabilities to non-root nodes
Each column is summed to 1. Boss Failure-in-love
T F
Project Delay
T F T F
Martin Train Norman
Oversleep Strike
Office Dirty Oversleep
T F T F T F T F
no 0 0 0 0 0 0 0.1 0.9
Boss Project Office
Failure-in-Love Delay Dirty
Boss
Angry
26 Dr. D. Dutta BITS Pilani, Pilani Campus
Definition of Bayesian
Network
A Bayesian network is a directed acyclic graph with
the following properties:
Each node represents a random variable.
Each node representing a variable A with parent nodes representing
variables B1, B2,..., Bn is assigned a conditional probability table
(CPT):
P( A | B1 , B2 , , Bn )
How to inference?
How to learn the probabilities from data?
How to learn the structure from data?
Train
Probability
Strike
Train T 0.1
Strike F 0.9
Questions:
P (“Martin Late”, “Norman Late”, “Train Strike”)=? Joint distribution
T T T 0.048
Demo
F T T 0.032
T F T 0.012
F F T 0.008
T T F 0.045
F T F 0.045
T F F 0.405
Train
Probability
F F F 0.405 Strike
C Train
Strike
T
F
0.1
0.9
Questions:
P (“Martin Late”, “Norman Late”, “Train Strike”)=? Joint distribution
P( A, B, C ) = P( A | B, C ) P( B | C ) P(C ) = P( A | C ) P( B | C ) P(C )
Example
T T 0.093
F T 0.077
A B C Probability T F 0.417
F F 0.413
T T T 0.048
F T T 0.032
Demo
T F T 0.012
F F T 0.008
Demo
T T F 0.045
F T F 0.045
T F F 0.405 Train
Probability
Strike
F F F 0.405
C Train
Strike
T
F
0.1
0.9
Questions:
P (“Martin Late”, “Norman Late”)=? Joint distribution
P( A, B) = P( A, B, C )
C
T
B
T
C
T
Probability
0.048
F T T 0.032 A B Probability
T F T 0.012 T T 0.093
F T 0.077
F F T 0.008
T F 0.417
T T F 0.045
F F 0.413
F T F 0.045
T F F 0.405
F F F 0.405 Train
Probability
Strike
C Train
Strike
T
F
0.1
0.9
T 0.51
Demo
Questions: F 0.49
P( A) = P( A, B, C ) = P( A, B)
B ,C B
T
B
T
C
T
Probability
0.048
F T T 0.032
A B Probability
T F T 0.012 T T 0.093
F F T 0.008 F T 0.077
T T F 0.045 T F 0.417
F T F 0.045 F F 0.413
T F F 0.405
Train
Probability
F F F 0.405 Strike
C Train
Strike
T
F
0.1
0.9
P( A, B) 0.093
P( A | B) = e.g., P( A = T | B = T ) = = 0.5471
P( B) 0.17 Demo
Demo
33 Dr. D. Dutta BITS Pilani, Pilani Campus
Neural Networks
x1
w1
x2 w2
x3
w3 Z
f (Z) y
. output
. . .
.
.
.
. wn .
b neuron
xn bias
input
38 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model
Warren McCulloch (a psychiatrist and neuroanatomist) and Walter
Pitts (a mathematician), started the modern era of neural networks
when, in 1943
Each neuron has a fixed threshold such that if the net input to the
network is greater than the threshold the neuron should fire.
The threshold is set such that the inhibition is absolute. This means any
non-zero inhibitory input will prevent the neuron from firing.
x1
w
x2 w
x3
w Z
f (Z) y
. output
. . .
.
.
.
. w .
-p -p neuron
xn
input x n+1 x n+m
41 Dr. D. Dutta BITS Pilani, Pilani Campus
McCullogh-Pitts model
x2 Z
f (Z) y
1
output
x1
w1
x2 w2
x3
w3 Z
f (Z) y
. output
. . .
.
.
.
. wn .
neuron
xn
input
45 Dr. D. Dutta BITS Pilani, Pilani Campus
Perceptron
x1
w1
Z
f (Z) y
w2 output
x2 w0
x=1
0 neuron
bias
input
46 Dr. D. Dutta BITS Pilani, Pilani Campus
Perceptron
The network can now be used for a classification task: it can decide
whether an input pattern belongs to one of two classes.
If the total input is positive, the pattern will be assigned to class +1, if
the total input is negative, the sample will be assigned to class -1.
The separation between the two classes in this case is a straight line,
given by the equation:
w1x1 + w2x2 + w0= 0
The single layer network represents
a linear discriminant function.
x2 = -w1/w2 x1 - w0 / w2
weights determine the slope of the line
and the bias determines the `offset’
considering w0=
The output of the hidden units is distributed over the next layer of Nh 2
hidden units, until the last layer of hidden units, of which the outputs
are fed into a layer of No output units
The trick is to figure out what pk should be for each unit k in the
network. The interesting result, which we now derive, is that there is
a simple recursive computation of these 's which can be
implemented by propagating error signals backward through the
network.
To compute pk we apply the chain rule to write this partial derivative as
the product of two factors, one factor reflecting the change in error
as a function of the output of the unit and one reflecting the change
in the output as a function of changes in the input
9
57 Dr. D. Dutta BITS Pilani, Pilani Campus
Back Propagation ANN: Delta
Rule
▪ By Equation 1
10
13
0.1 0.1
1 3
i h o
yh1=F(sh1)=2/(1+e-0.7) –1=0.336
yh2=F(sh2)=2/(1+e-0.4) –1=0.1974
so1 =0.1* 0.336 + 0.1* 0.1974 =0.0536
yo1=F(so1)=2/(1+e-0.0536) –1=0.02678
F’(so1)=0.5(1+ F(so1)) (1- F(so1))
wi1h1(new)= wi1h1(old)+wi1h1=0.2+0.00216=0.20216
wi2h1(new)= wi2h1(old)+wi2h1=0.2+0.00216=0.20216
wi3h1(new)= wi3h1(old)+wi3h1=0.2+0.00216=0.20216
wi1h2(new)= wi1h2(old)+wi1h2=0.1+0.00216=0.10216
wi2h2(new)= wi2h2(old)+wi2h2=0.1+0.00216=0.10216
wi3h2(new)= wi3h2(old)+wi3h2=0.1+0.00216=0.10216
wh1o1(new)= wh1o1(old)+wh1o1=0.1+0.00216=0.10216
wh2o1(new)= wh2o1(old)+wh2o1=0.1+0.00216=0.10216
0.10216 0.1
1 3
i h o
denotes -1
denotes -1
denotes -1
denotes -1
denotes -1
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the Plus
Plane. What is w . ( u – v ) ?
x-
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane Rmm:: not
not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint
Computing the margin width
x+ M = Margin Width
x-
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + w for some value of . Why?
Computing the margin width
x+ M = Margin Width
The line from x- to x+ is
x - How do we
perpendicularcompute
to the
M in terms -of w+
planes.
So to get from x to x
and b?
travel some distance in
direction w.
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + w for some value of . Why?
Computing the margin width
x+ M = Margin Width
x-
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + w
• |x+ - x- | = M
It’s now easy to get M
in terms of w and b
Computing the margin width
x+ M = Margin Width
x-
w . (x - + w) + b = 1
=>
What we know:
• w . x+ + b = +1 w . x - + b + w .w = 1
• w . x- + b = -1 =>
• x+ = x- + w
-1 + w .w = 1
• |x+ - x- | = M
=>
It’s now easy to get M 2
in terms of w and b λ=
w.w
Computing the margin width
2
x+ M = Margin Width =
w.w
x-
M = |x+ - x- | =| w |=
x-
e additional linear
a( n +1)1u1 + a( n +1) 2u 2 + ... + a( n +1) mu m = b( n +1)
constraints
equality
a( n + 2 )1u1 + a( n + 2 ) 2u 2 + ... + a( n + 2 ) mu m = b( n + 2 )
:
a( n + e )1u1 + a( n + e ) 2u 2 + ... + a( n + e ) mu m = b( n + e )
T
Quadratic Programming
Find arg max c + d u +
T u Ru Quadratic criterion
u 2
e additional linear
a( n +1)1u1 + a( n +1) 2u 2 + ... + a( n +1) mu m = b( n +1)
constraints
equality
a( n + 2 )1u1 + a( n + 2 ) 2u 2 + ... + a( n + 2 ) mu m = b( n + 2 )
:
a( n + e )1u1 + a( n + e ) 2u 2 + ... + a( n + e ) mu m = b( n + e )
Lazy Learner: Instance-
Based Methods
• Instance-based learning:
– Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
• Typical approaches
– k-nearest neighbor (KNN) approach
• Instances represented as points in a Euclidean space.
– Locally weighted regression
• Constructs local approximation
– Case-based reasoning
• Uses symbolic representations and knowledge-based
inference
_
_ _ .
+ _
_
.+ . . .
xq +
_ + .
96 Dr. D. Dutta BITS Pilani, Pilani Campus
Discussion on the k-NN
Algorithm
k-NN for real-valued prediction for a given unknown tuple
– Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
– Weight the contribution of each of the k neighbors according to
their distance to the query xq 1
w
• Give greater weight to closer neighbors d ( xq , x )2
i
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes
– To overcome it, axes stretch or elimination of the least relevant
attributes
Any Question?
2
o Prediction Techniques Dr. D. Dutta BITS Pilani, Pilani Campus
Using IF-THEN Rules for
Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
If more than one rule are triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification
cost per class
– Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
¬C FP TN N fraud, or HIV-positive
P’ N’ All ◼ Significant majority of the
But how the linear regression finds out which is the best fit line?
The goal of the linear regression algorithm is to get the best values for
B0 and B1 to find the best fit line. The best fit line is a line that has
the least error which means the error between predicted values and
actual values should be minimum.
Random Error(Residuals)
In regression, the difference between the observed value of the
dependent variable (yi) and the predicted value (predicted) is called
the residuals.
εi = ypredicted – yi
where ypredicted = B0 + B1 Xi
yi=b0+b1xi+ɛi
B1
B0
Using the MSE function, we’ll update the values of B0 and B1 such that
the MSE value settles at the minima. These parameters can be
determined using the gradient descent method such that the value
for the cost function is minimum.
21 Dr. D. Dutta BITS Pilani, Pilani Campus
Simple Linear Regression
4. Update Coefficients:
Update the coefficients using the gradients and a learning rate (α):
b0=b0−α*∂J/∂b0
b1=b1−α*∂J/∂b1
5. Repeat:
Repeat steps 2-4 until convergence (the cost function reaches a
minimum or changes very slowly).
Partial Derivatives:
The partial derivatives with respect to b0 and b1 are calculated as
follows:
∂J/∂b0 =−1/m*∑i=1m(Yi−Ŷi)
∂J/∂b1 =−1/m*∑ i=1m (Yi−Ŷi) ⋅Xi
Key Concepts:
1. Degree of the Polynomial (n):
The degree of the polynomial determines the complexity of the curve.
Higher degrees allow the model to capture more intricate patterns in the
data but may also lead to overfitting.
2. Overfitting:
Overfitting occurs when the model fits the training data too closely,
capturing noise rather than the underlying pattern. Regularization
techniques or model selection methods may be used to address overfitting.
3. Underfitting:
Underfitting occurs when the model is too simple to capture the underlying
pattern in the data. Increasing the degree of the polynomial may help
address underfitting.
4. Model Interpretation:
Polynomial regression can complicate the interpretation of individual
coefficients. The relationship between X and Y becomes more complex as
the degree of the polynomial increases.
29 Dr. D. Dutta BITS Pilani, Pilani Campus
Polynomial Regression
Example:
Consider a scenario where the relationship between the number of
hours of study (X) and the exam score (Y) is explored using
polynomial regression. The equation might take the form:
Exam Score=b0+b1⋅Hours of Study+b2⋅(Hours of Study)2
Here, the second-degree polynomial allows the model to capture a
quadratic relationship between hours of study and exam scores.
Apart from linear and polynomial regression there are many regression
algorithms
1. Ridge Regression
2. Lasso Regression
3. ElasticNet Regression
4. Decision Tree Regression
5. Random Forest Regression
6. Gradient Boosting Regression (e.g., XGBoost, LightGBM,
CatBoost)
7. Support Vector Regression (SVR)
8. K-Nearest Neighbors (KNN) Regression
9. Bayesian Regression
10. Huber Regression
Any Question?
2
o Prediction Techniques Dr. D. Dutta BITS Pilani, Pilani Campus
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
– Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
▪ Classification
▪ Predicts categorical class labels (discrete or nominal)
▪ Classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
▪ Numeric Prediction
▪ Models continuous-valued functions, i.e., predicts unknown or
missing values
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(George, Professor, 5)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
8 Dr. D. Dutta BITS Pilani, Pilani Campus
Prediction
The practice of using data to create predictions or foresee future events is known
as machine learning prediction. Building models that can recognize patterns
in data and utilize those patterns to create precise predictions about novel,
unforeseen data is the aim of machine learning prediction.
no yes yes
11 Dr. D. Dutta BITS Pilani, Pilani Campus
Algorithm for Decision Tree
Induction
Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left
Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40
15 15 medium no excellent no Dr. D. Dutta BITS Pilani, Pilani Campus
Computing Information-Gain for
Continuous-Valued Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information requirement for
A is selected as the split-point for A. The minimum expected
information requirement refers to the point at which the entropy of
the dataset is minimized after the split.
Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the
set of tuples in D satisfying A > split-point
16 Dr. D. Dutta BITS Pilani, Pilani Campus
Underfitting and Overfitting
For independent events: When events do not affect each other, the
joint probability is the product of their individual probabilities:
P(A∩B)=P(A)×P(B)
For dependent events: If the occurrence of one event affects the other,
the joint probability is calculated using conditional probability:
P(A∩B)=P(A)×P(B∣A)
Pr(A | B)
Pr(A B) S
Pr(A | B) = B
Pr(B) A
H = “Have a headache”
F = “Coming down with Flu”
P(F =
P(H = true) = 1/10
true) P(F = true) = 1/40
P(H = true | F = true) = 1/2
Pr ( Ai ) Pr ( B | Ai )
Pr ( Ai | B ) =
k Pr (Ak ) Pr (B|Ak )
It also doesn’t matter whether you “peak” at the data halfway through an
experiment.
Pr(A B)
Pr(A | B) =
Pr(B)
Test=+ve Test=-ve
Evidence
Let’s compute
P(Y = 0 | X = (0,2))
and
P(Y = 1 | X = (0,2))
Any Question?
As per syllabus
1. Data Warehousing
1. Basic Concepts
2. DW Design
3. Data Cube and OLAP
4. DW implementation considerations
As per session plan
• Data Warehousing
o Basic Concepts
o DW Design
o Data Cube and OLAP
o DW implementation considerations
2 Dr. D. Dutta BITS Pilani, Pilani Campus
The “Compute Cube”
Operator
“How many cuboids are there in an n-dimensional data cube?” If there
were no hierarchies associated with each dimension, then the total
number of cuboids for an n -dimensional data cube, as we have seen,
is 2n. However, in practice, many dimensions do have hierarchies. For
example, time is usually explored not at only one conceptual level
(e.g., year), but rather at multiple conceptual levels such as in the
hierarchy “day < month < quarter < year.” For an n-dimensional data
cube, the total number of cuboids that can be generated (including
the cuboids generated by climbing up the hierarchies along each
dimension) is
n
T = ( Li +1)
i =1
where Li is the number of levels associated with dimension i. One is
added to Li in Equation to include the virtual top level, all. (Note that
generalizing to all is equivalent to the removal of the dimension.)
◼ Motivation
◼ Only a small portion of cube cells may be “above the water’’ in a
sparse cube
◼ Only calculate “interesting” cells—data above certain threshold
Discovery-driven approach:
Pre-computed measures indicating data exceptions are used to guide
the user in the data analysis process, at all levels of aggregation.
We hereafter refer to these measures as exception indicators.
Intuitively, an exception is a data cube cell value that is significantly
different from the value anticipated, based on a statistical model.
The model considers variations and patterns in the measure value
across all of the dimensions to which a cell belongs.
For example, if the analysis of item-sales data reveals an increase in
sales in December in comparison to all other months, this may seem
like an exception in the time dimension.
However, it is not an exception if the item dimension is considered, since
there is a similar increase in sales for other items during December.
Any Question?
As per syllabus
1. Data Preprocessing
1. Data Quality
o Data Quality
Ignore the tuple: usually done when class label is missing (when doing
classification) - not effective when the % of missing values per
attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as Bayesian
formula or decision tree
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units
Scatter plots
showing the
similarity from
–1 to 1.
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
Negative covariance: If CovA,B < 0 then if A is larger than its expected value,
B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
22 Dr. D. Dutta BITS Pilani, Pilani Campus
Co-Variance: An Example
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
– A = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– B = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
23 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
• Why data reduction? - A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
• Data reduction strategies
– Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
– Data compression
24 Dr. D. Dutta BITS Pilani, Pilani Campus
Data Reduction 1:
Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
x2
x1
33 Dr. D. Dutta BITS Pilani, Pilani Campus
Principal Component Analysis
(PCA) : Steps
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
34 Dr. D. Dutta BITS Pilani, Pilani Campus
Principal Component Analysis
(PCA)
• Principal Component Analysis (PCA) is a process of converting a set
of objects of correlated features to set of values of linearly
uncorrelated variables called Principal Components (PC).
• Over the years PC is used to discover or to reduce the number of
features of the data set and to identify new meaningful underlying
features.
• Many large data sets (many features and/or individuals) are available
now-a-days.
• If we want to derive some information from it then it will be very
difficult.
• PCA will be helpful here.
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
50 Dr. D. Dutta BITS Pilani, Pilani Campus
Clustering
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”:
It means that there is some kind of measurement error or variability in
the data that makes it less precise or less accurate.
• Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
• There are many choices of clustering definitions and clustering
algorithms
• Cluster analysis will be studied in depth later
Raw Data
54 Dr. D. Dutta BITS Pilani, Pilani Campus
Sampling: Cluster or
Stratified Sampling
Original Data
Approximated
73,600 − 54,000
= 1.225
– Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
59 Dr. D. Dutta BITS Pilani, Pilani Campus
Discretization
• Three types of attributes
– Nominal (catagorical) - values from an unordered set, e.g., color,
profession
– Ordinal - values from an ordered set, e.g., military or academic
rank
– Numeric (continuous) - real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
Any Question?
As per syllabus
1. Data Exploration and Description
3. Data Visualization
As per session plan
• Data Exploration and Description
o Data Visualization
2 Dr. D. Dutta BITS Pilani, Pilani Campus
Types of Data Set
Record
TID Items
– Relational records
– Data matrix, e.g., numerical matrix, crosstabs
1 Bread, Coke, Milk
– Document data: text documents: term-frequency 2 Beer, Bread
vector (is a representation of a document in a vector space 3 Beer, Coke, Diaper, Milk
model, where each element of the vector corresponds to a term (or
word) in a vocabulary, and the value at each position represents 4 Beer, Bread, Diaper, Milk
the frequency of that term in the document.)
5 Coke, Diaper, Milk
– Transaction data
Graph and network
– World Wide Web
– Social or information networks
– Molecular Structures
Ordered
timeout
season
coach
– Video data: sequence of images
game
score
team
ball
lost
pla
wi
n
y
– Temporal data: time-series
– Sequential Data: transaction sequences
– Genetic sequence data Document 1 3 0 5 0 2 6 0 2 0 2
Median: w
i =1
i
Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
– Empirical formula: mean − mode = 3 (mean − median )
15 Dr. D. Dutta 1
5
BITS Pilani, Pilani Campus
Measuring the Central
Tendency
Here
L is the lower boundary of the median class.
N is the total number of observations.
F is the cumulative frequency of the class preceding the median class.
f is the frequency of the median class.
h is the class width (the difference between the upper and lower
boundaries of the class).
16 Dr. D. Dutta 1
6
BITS Pilani, Pilani Campus
Measuring the Central
Tendency
17 Dr. D. Dutta 1
7
BITS Pilani, Pilani Campus
Measuring the Central
Tendency
Identify the Median Class:
• Total number of observations, N=50
• N/2=50/2=25
• The cumulative frequency just greater than 25 is 40, so the median
class is 30 - 40.
Apply the Formula:
• L=30 (lower boundary of the median class)
• N=50
• F=25 (cumulative frequency of the class before the median class, which
is 20 - 30)
• f=15 (frequency of the median class)
• h=10 (class width)
Median=30+(50/2−25)/15×10=30
18 Dr. D. Dutta 1
8
BITS Pilani, Pilani Campus
Symmetric Vs Screwed
Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
August 4, 2024
19 Data Mining:Dr.
Concepts
D. Duttaand Techniques 1
9
BITS Pilani, Pilani Campus
Measuring the Dispersion of
Data
Quartiles, outliers and boxplots
– Quartiles: These are values that divide a set of data into four equal
parts. Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2 n n
[ xi − ( xi ) ]
1 1
s = ( xi − x ) = = − = x − 2
2 2 2 2 2
( x )
n − 1 i =1 n − 1 i =1 n i =1 N i =1
i
N i =1
i
21 Dr. D. Dutta 2
1
BITS Pilani, Pilani Campus
Boxplot Analysis
Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended
to Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually
22 Dr. D. Dutta 2
2
BITS Pilani, Pilani Campus
Properties of Normal
Distribution Curve
• The normal (distribution) curve
– From μ–σ to μ+σ: contains about 68% of the measurements (μ:
mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
be adjacent
(a) Income (b) Credit Limit (c) transaction volume (d) age
32 Dr. D. Dutta 32
BITS Pilani, Pilani Campus
Laying Out Pixels in Circle
Segment
In this technique, pixels representing
data values are arranged within
segments of a circle rather than
traditional row-column grids. This
method can be particularly effective in
emphasizing periodic patterns or cyclic
data.
Data points are arranged in concentric
circular segments, with each segment
representing a different subset or
Representing a data record in
category of the data. circle segment
35 Data Mining:Dr.
Concepts
D. Duttaand Techniques 3
5
BITS Pilani, Pilani Campus
Scatterplot Matrices
• • •
A census data
figure showing
age, income,
gender,
education, etc.
A 5-piece stick
figure (1 body and
4 limbs w. different
angle/length)
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
42 Dr. D. Dutta BITS Pilani, Pilani Campus
Hierarchical Visualization
Techniques
• Visualization of the data using a hierarchical partitioning into
subspaces
• Methods
– Dimensional Stacking
– Worlds-within-Worlds
– Tree-Map
– Cone Trees
– InfoCube
attribute 4
attribute 2
attribute 3
attribute 1
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
45 Dr. D. Dutta BITS Pilani, Pilani Campus
Worlds Within Worlds
Assign the function and two most important parameters to innermost world
Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
Software that uses this paradigm
• N - vision: Dynamic
interaction through data
glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
• Auto Visual: Static
interaction by means of
queries
46 Dr. D. Dutta BITS Pilani, Pilani Campus
Tree Map
• A treemap uses nested rectangles to represent the branches and
leaves of a tree structure. Each branch is given a rectangle, which is
then tiled with smaller rectangles representing sub-branches. The
size of each rectangle is proportional to a specific data dimension
(such as value or size), and colors are often used to add an additional
data dimension.
Ack.: http://nadeausoftware.com/articles/visualization
49 Dr. D. Dutta 4
9
BITS Pilani, Pilani Campus
Visualizing Complex Data and
Relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
Dissimilarity matrix 0
– n data points, but registers d(2,1) 0
only the distance d(3,1) d ( 3,2) 0
– A triangular matrix
– Single mode : : :
d ( n,1) d ( n,2) ... ... 0
d (i, j) = p −
p
m
zif = if
sf
f
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance so
defined is also called L-h norm)
Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
– d(i, j) = d(j, i) (Symmetry)
– d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
60 Dr. D. Dutta BITS Pilani, Pilani Campus
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
– replace xif by their rank rif {1,..., M f }
– map the range of each variable onto [0, 1] by replacing i-th object
in the f-th variable by
rif −1
zif =
M f −1
– compute the dissimilarity using methods for interval-scaled
variables
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
Any Question?
As per syllabus
1. Introduction to Data Mining (DM)
o DM process
o DM challenges
2 Dr. D. Dutta BITS Pilani, Pilani Campus
What is Data Mining?
Despite this data deluge, a paradox exists: we find ourselves drowning in data
but often starved for knowledge.
To address this challenge, the principle of "necessity is the mother of invention"
comes into play.
Data mining – an innovative solution involving the automated analysis of
massive datasets.
Data mining serves as a crucial tool for uncovering patterns, relationships, and
valuable insights within the overwhelming sea of data. By leveraging
advanced algorithms, it allows us to transform raw data into actionable
knowledge. In a world where data is abundant but meaningful insights are
scarce, data mining plays a pivotal role in extracting valuable knowledge
from the vast ocean of information.
Task-relevant Data
Data Cleaning
Data Integration
Databases
9 Dr. D. Dutta BITS Pilani, Pilani Campus
KDD: An Alternative View
Data Exploration
Statistical Summary, Querying, and Reporting
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
Any Question?