0% found this document useful (0 votes)
17 views

Module 4 - Supervised and Unsupervised learning techniques

Uploaded by

devaadi0713
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Module 4 - Supervised and Unsupervised learning techniques

Uploaded by

devaadi0713
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

SUPERVISED LEARNING

ALGORITHMS

Presenter: Dr. Amit Kumar Das


Professor,
Dept. of Computer Science and Engg.,
Institute of Engineering & Management.
K-NEAREST NEIGHBOR
ALGORITHM
LET’S CONSIDER THE INPUT DATA …
DATA HOLDOUT

Training data

Test data
LET’S SEE HOW THE TRAINING DATA IS
GROUPED…
Communica
Name Aptitude tion Class
Karuna 2 5 Speaker
Bhuvna 2 6 Speaker
Communication

Gaurav 7 6 Leader
Parul 7 2.5 Intel
Dinesh 8 6 Leader
Jani 4 7 Speaker
Bobby 5 3 Intel
Parimal 3 5.5 Speaker
Govind 8 3 Intel
Susant 6 5.5 Leader
Gouri 6 4 Intel
Bharat 6 7 Leader
Ravi 6 2 Intel
Pradeep 9 7 Leader

Aptitude
SAY WE DON’T KNOW WHICH CLASS THE
TEST DATA BELONGS TO …
Communica
Name Aptitude tion Class
Karuna 2 5 Speaker
Communication

Bhuvna 2 6 Speaker
Gaurav 7 6 Leader
Parul 7 2.5 Intel
Dinesh 8 6 Leader
Jani 4 7 Speaker
Bobby 5 3 Intel
Parimal 3 5.5 Speaker
Govind 8 3 Intel
Susant 6 5.5 Leader
Gouri 6 4 Intel
Bharat 6 7 Leader
Ravi 6 2 Intel
Pradeep 9 7 Leader
Aptitude Josh 5 4.5 ???
LET’S TRY FIND SIMILARITY WITH THE
DIFFERENT TRAINING DATA INSTANCES…
We calculate the “Euclidean distance” from the test data point
to the different training data points using the formula:
DIFFERENT MEASURES OF SIMILARITY
Distance-based similarity measure –
The most common distance measure is the Euclidean distance, which, between two
features F1 and F2 is calculated as:

where F1 and F2 are features of an n-dimensional data set.


A more generalized form of the Euclidean distance is the Minkowski distance. It takes
the form of Euclidean distance (also called L2 norm) when r = 2.

Minkowski distance Manhattan distance

At r = 1, it takes the form of Manhattan distance (also called L1 norm).


DIFFERENT MEASURES OF SIMILARITY
Distance-based similarity measure –
To calculate the distance between binary vectors, Hamming distance is used.

For example, the Hamming distance between two vectors 01101011 and 11001001 is 3,
as illustrated in the above diagram.

Other similarity measures –


The Jaccard distance, a measure of dissimilarity between two features, is a
complementary of Jaccard index. For two features having binary values, Jaccard
index is measured as:

where n11 = number of cases where both the features have value 1
n01 = number of cases where the feature 1 has value 0 and feature 2 has value 1
n10 = number of cases where the feature 1 has value 1 and feature 2 has value 0
Jaccard distance, dJ = 1 - J
DIFFERENT MEASURES OF SIMILARITY
Other similarity measures –
Jaccard distance
Let’s consider two features F1 and F2 having values (0, 1, 1, 0, 1, 0, 1, 0) and (1, 1, 0, 0,
1, 0, 0, 0).

Cosine Similarity

It is calculated as:
DIFFERENT MEASURES OF SIMILARITY
Cosine Similarity
Let’s calculate the cosine similarity of x and y, where x = (2, 4, 0, 0, 2, 1, 3, 0, 0) and y
= (2, 1, 0, 0, 3, 2, 1, 0, 1).
In this case, x.y = 2*2 + 4*1 + 0*0 + 0*0 + 2*3 + 1*2 + 3*1 + 0*0 + 0*1 = 19

Cosine similarity actually measures the angle between x and y vectors. In the above
example, the angle comes to be which is a good level of similarity. if cosine
similarity has a value 1, the angle between x and y is , which means x and y are same
except for the magnitude.

Cosine similarity is one of the most popular measure in text classification.


DECISION TREE
ALGORITHM
WHAT IS ENTROPY?
 In statistical mechanics, entropy is a property of a
thermodynamic system, closely related to the microscopic
configurations (known as microstates)

 Because it is determined by the number of random


microstates, entropy is related to the amount of additional
information needed to specify the exact physical state of a
system, given its macroscopic specification. For this
reason, it is often said that entropy is an expression of the
disorder, or randomness of a system, or of the lack of
information about it.

 The concept of entropy plays a central role in information


theory
TRAINING DATA FOR GTS RECRUITMENT
TRAINING DATA FOR GTS RECRUITMENT
ENTROPY AND INFORMATION GAIN
CALCULATION (LEVEL 1)

Entropy (S) =

Information Gain (S, A) =


Entropy (Sbs) - Entropy (Sas)

Entropy (Sas) =
ENTROPY AND INFORMATION GAIN CALCULATION
(LEVEL 2)
ENTROPY AND INFORMATION GAIN
CALCULATION (LEVEL 3)
RANDOM FOREST CLASSIFIERS
SIMPLE LINEAR
REGRESSION ALGORITHM
MOTIVATING PROBLEM

How can he validate what he believes?

As we know, in simple linear regression, the line is drawn using the


regression formula.
Y = a + bX

If we know the values of ‘a’ and ‘b’, then it is easy to predict the value of Y
for any given X by using the above formula. But the question is how to
calculate the values of ‘a’ and ‘b’ for a given set of X and Y values?
MOTIVATING PROBLEM (CONTD.)
A scatter plot was drawn to explore the relationship between the
independent variable (internal marks) mapped to X-axis and dependent
variable (external marks) mapped to Y-axis as depicted in the figure below.
ORDINARY LEAST SQUARES (OLS) TECHNIQUE
A straight line is drawn as close as possible over the points on the scatter
plot. Ordinary Least Squares (OLS) is the technique used to estimate a
line that will minimize the error (ε), which is the difference between the
predicted and the actual values of Y. This means summing the errors of
each prediction or, more appropriately, the Sum of the Squares of the
Errors (SSE) i.e. .

It is observed that the SSE is least when ‘b’ takes the value.

The corresponding value of ‘a’ calculated using the above value of ‘b’ is
OLS TECHNIQUE BASED CALCULATION
OLS TECHNIQUE BASED CALCULATION
As we have
already seen, the
simple linear
regression model
built on the data
in the example is

MExt = 19.04 +
1.89 * MInt

The value of the intercept from the above equation is 19.05. However, none of
the internal mark is 0. So, intercept = 19.05 indicates that 19.05 is the portion of
the external examination marks not explained by the internal examination marks.

Slope measures the estimated change in the average value of Y as a result of a


one-unit change in X. Here, slope = 1.89 tells us that the average value of the
external examination marks increases by 1.89 for each additional 1 mark in the
internal examination.
MULTIPLE LINEAR REGRESSION
 Two or more independent variables, i.e. predictors are
involved in the model
 Say, while predicting Price of a Property as the dependent
variable, the possible predictors can be Area of the Property,
location, floor, number of years since purchase, amenities
available, etc.
 We can form a multiple regression equation as shown below:
PriceProperty = f (AreaProperty, location, floor, Ageing, Amenities)
 The following expression describes the equation involving the
relationship with two predictor variables, namely X1 and X2.

The model describes a plane in the three-dimensional space of Ŷ,


X1, and X2. Parameter ‘a’ is the intercept of this plane. Parameters
‘b1’ and ‘b2’ are referred to as partial regression coefficients.
UNSUPERVISED
LEARNING
UNSUPERVISED LEARNING - CLUSTERING

Cluster 2

Cluster 1

Cluster 3
Cluster 4
UNSUPERVISED LEARNING – ASSOCIATION
ANALYSIS
DIFFERENT CLUSTERING TECHNIQUES
 Partitioning techniques
 Hierarchical techniques

 Density-based techniques
DIFFERENT CLUSTERING TECHNIQUES (CONTD.)
Partitioning techniques
 Uses mean or medoid (etc.) to represent cluster
centre
 Adopts distance-based approach to refine
clusters
 Finds mutually exclusive clusters of spherical
or nearly spherical shape
 Effective for data sets of small to medium size

 Two of the most important algorithms for


partitioning-based clustering are k-means
and k-medoids
DIFFERENT CLUSTERING TECHNIQUES (CONTD.)
Hierarchical techniques
 Creates hierarchical or tree-like structure through
decomposition or merger
 For example, in a problem of organizing employees of
a university in different departments, first the
employees are grouped under the different
departments in the university. Then within each
department, the employees can be grouped according
to their roles such as professors, assistant professors,
supervisors, lab assistants, etc.
 Uses distance between the nearest or furthest points
in neighbouring clusters as a guideline for refinement
 Two main hierarchical clustering methods:
agglomerative clustering and divisive clustering
DIFFERENT CLUSTERING TECHNIQUES (CONTD.)
Hierarchical techniques
 Agglomerative clustering is a bottom-up technique
which starts with individual objects as clusters and
then iteratively merges them to form larger clusters.
 Divisive clustering starts with one cluster with all
given objects and then splits it iteratively to form
smaller clusters.
 In both these cases, it is important to select the split
and merger points carefully, because the subsequent
splits or mergers will use the result of the previous
ones and there is no option to perform any object
swapping between the clusters or rectify the decisions
made in previous steps
DIFFERENT CLUSTERING TECHNIQUES (CONTD.)
Hierarchical techniques
DIFFERENT CLUSTERING TECHNIQUES (CONTD.)
Density-based techniques
 Useful for identifying arbitrarily shaped clusters

 Guiding principle of cluster creation is the identification of


dense regions of objects in space which are separated by low-
density regions. The key idea is that for each point of a
cluster, the neighbourhood of a given radius has to contain at
least a minimum number of points.
 DBSCAN is one of the popular density-based algorithm
K-MEANS ALGORITHM
BASIC ALGORITHM OF K-MEANS
 Choose k points in the feature space to serve as
the cluster centres
 After choosing the initial cluster centres, the
other examples are assigned to the cluster centre
that is most similar or nearest according to the
distance function
 Initial assignment phase is complete; the k-
means algorithm proceeds to the update phase.
 The first step of updating the clusters involves
shifting the initial centres to a new location,
known as the centroid, which is calculated as
the mean value of the points currently assigned
to that cluster
BASIC ALGORITHM OF K-MEANS (CONTD.)
 More points have been reassigned from one
cluster to another. This leads to another
update stage.
 When no points are assigned, the k-means
algorithm stops. The cluster assignments are
now final.
Figure 9.6 Clustering dataset
Clustering with initial centroids
Cluster C1

Cluster C3

ai2 ai1

b14(1)
ain_1
b14(2)

b14(n4)

Cluster C4
Cluster C2

Iteration 1: 4 Clusters and distance of points from the centroids


Cluster C1

Cluster C3

Moved from Cluster1

ai2 ai1

b14(1)
ain_1
b14(2)

b14(n4)

Cluster C4
Cluster C2

Iteration 2: Centroids recomputed and points redistributed among


clusters based on the nearest centroid
Cluster C1

Cluster C3

ai2 ai1

b14(1)
ain_1
b14(2)

b14(n4)
Moved from Cluster3

Cluster C4
Cluster C2

Iteration 3: final cluster arrangement - centroids recomputed and


points redistributed among clusters based on the nearest centroid
HOW TO ARRIVE AT A VALUE OF K?
 In case there is some a priori knowledge, the
same can be used
 Sometimes it is dictated by business need or
the motivation for the analysis
 Without any a priori knowledge at all, one
rule of thumb suggests setting k equal to the
square root of (n / 2), where n is the number
of examples in the dataset
 A technique known as the elbow method
attempts to gauge how the homogeneity or
heterogeneity within the clusters changes for
various values of k
HOW TO ARRIVE AT A VALUE OF K?
- ELBOW METHOD
ASSOCIATION ANALYSIS
ASSOCIATION ANALYSIS
 Useful for identifying interesting relationships hidden
in large data sets
 A common application of this analysis is the Market
Basket Analysis that retailers use for cross-selling of
their products
 Itemset - One or more items are grouped together.
They are surrounded by brackets to indicate that they
form a set. E.g. {Bread, Milk, Egg} can be grouped
together to form an itemset as those are frequently
bought together
 Support count - Denotes the number of transactions
in which a particular itemset is present. This is a very
important property of an itemset as it denotes the
frequency of occurrence for the itemset.
ASSOCIATION ANALYSIS (CONTD.)
 Association rules - set of rules that specify patterns
of relationships among items. A typical rule might be
expressed as {Bread, Milk}→{Egg}, which denotes that
if Bread and Milk are purchased, then Egg is also
likely to be purchased.
 Association rules are learned from subsets of itemsets.
For example, the preceding rule was identified from
the set of {Bread, Milk, Egg}.
 It should be noted that an association rule is an
expression of X → Y where X and Y are disjoint
itemsets, i.e. X ∩ Y = 0.
 Support and confidence are the two concepts that
are used for measuring the strength of an association
rule.
ASSOCIATION ANALYSIS (CONTD.)
 Support denotes how often a rule is applicable to a
given data set.
 Confidence indicates how often the items in Y
appear in transactions that contain X in a total
transaction of N. Confidence denotes the predictive
power or accuracy of the rule.
 Support and confidence are the two concepts that
are used for measuring the strength of an association
rule.

lift(X→Y) = confidence(X→Y) / support(Y)


Leverage (X→Y) = support(X→Y) − support(X) × support(Y)
conviction(X→Y) = (1 − support(Y)) / (1 − confidence(X→Y))
ASSOCIATION ANALYSIS (CONTD.)
 In the data set presented below, if we consider the
association rule {Bread, Milk} → {Egg}, then from the
formula of support and confidence we can say:
ASSOCIATION ANALYSIS (CONTD.)
Role of support & confidence in association
analysis :
 A low support may indicate that the rule has occurred
by chance i.e. in context of retail, items are seldom
bought together by the customers
 Confidence provides the measurement for reliability of
the inference of a rule. Higher confidence of a rule X →
Y denotes more likelihood of Y to be present in
transactions that contain X.
 The confidence of X leading to Y is not the same as the
confidence of Y leading to X. In the given data set,
confidence of {Bread, Milk} → {Egg} = ¾ = 0.75 but
confidence of {Egg} → {Bread, Milk} = 3/5 = 0.6. Here,
the rule {Bread, Milk} → {Egg} is the strong rule.
THANK YOU &
QUESTIONS PLEASE!

You might also like