Module 4 - Supervised and Unsupervised learning techniques
Module 4 - Supervised and Unsupervised learning techniques
ALGORITHMS
Training data
Test data
LET’S SEE HOW THE TRAINING DATA IS
GROUPED…
Communica
Name Aptitude tion Class
Karuna 2 5 Speaker
Bhuvna 2 6 Speaker
Communication
Gaurav 7 6 Leader
Parul 7 2.5 Intel
Dinesh 8 6 Leader
Jani 4 7 Speaker
Bobby 5 3 Intel
Parimal 3 5.5 Speaker
Govind 8 3 Intel
Susant 6 5.5 Leader
Gouri 6 4 Intel
Bharat 6 7 Leader
Ravi 6 2 Intel
Pradeep 9 7 Leader
Aptitude
SAY WE DON’T KNOW WHICH CLASS THE
TEST DATA BELONGS TO …
Communica
Name Aptitude tion Class
Karuna 2 5 Speaker
Communication
Bhuvna 2 6 Speaker
Gaurav 7 6 Leader
Parul 7 2.5 Intel
Dinesh 8 6 Leader
Jani 4 7 Speaker
Bobby 5 3 Intel
Parimal 3 5.5 Speaker
Govind 8 3 Intel
Susant 6 5.5 Leader
Gouri 6 4 Intel
Bharat 6 7 Leader
Ravi 6 2 Intel
Pradeep 9 7 Leader
Aptitude Josh 5 4.5 ???
LET’S TRY FIND SIMILARITY WITH THE
DIFFERENT TRAINING DATA INSTANCES…
We calculate the “Euclidean distance” from the test data point
to the different training data points using the formula:
DIFFERENT MEASURES OF SIMILARITY
Distance-based similarity measure –
The most common distance measure is the Euclidean distance, which, between two
features F1 and F2 is calculated as:
For example, the Hamming distance between two vectors 01101011 and 11001001 is 3,
as illustrated in the above diagram.
where n11 = number of cases where both the features have value 1
n01 = number of cases where the feature 1 has value 0 and feature 2 has value 1
n10 = number of cases where the feature 1 has value 1 and feature 2 has value 0
Jaccard distance, dJ = 1 - J
DIFFERENT MEASURES OF SIMILARITY
Other similarity measures –
Jaccard distance
Let’s consider two features F1 and F2 having values (0, 1, 1, 0, 1, 0, 1, 0) and (1, 1, 0, 0,
1, 0, 0, 0).
Cosine Similarity
It is calculated as:
DIFFERENT MEASURES OF SIMILARITY
Cosine Similarity
Let’s calculate the cosine similarity of x and y, where x = (2, 4, 0, 0, 2, 1, 3, 0, 0) and y
= (2, 1, 0, 0, 3, 2, 1, 0, 1).
In this case, x.y = 2*2 + 4*1 + 0*0 + 0*0 + 2*3 + 1*2 + 3*1 + 0*0 + 0*1 = 19
Cosine similarity actually measures the angle between x and y vectors. In the above
example, the angle comes to be which is a good level of similarity. if cosine
similarity has a value 1, the angle between x and y is , which means x and y are same
except for the magnitude.
Entropy (S) =
Entropy (Sas) =
ENTROPY AND INFORMATION GAIN CALCULATION
(LEVEL 2)
ENTROPY AND INFORMATION GAIN
CALCULATION (LEVEL 3)
RANDOM FOREST CLASSIFIERS
SIMPLE LINEAR
REGRESSION ALGORITHM
MOTIVATING PROBLEM
If we know the values of ‘a’ and ‘b’, then it is easy to predict the value of Y
for any given X by using the above formula. But the question is how to
calculate the values of ‘a’ and ‘b’ for a given set of X and Y values?
MOTIVATING PROBLEM (CONTD.)
A scatter plot was drawn to explore the relationship between the
independent variable (internal marks) mapped to X-axis and dependent
variable (external marks) mapped to Y-axis as depicted in the figure below.
ORDINARY LEAST SQUARES (OLS) TECHNIQUE
A straight line is drawn as close as possible over the points on the scatter
plot. Ordinary Least Squares (OLS) is the technique used to estimate a
line that will minimize the error (ε), which is the difference between the
predicted and the actual values of Y. This means summing the errors of
each prediction or, more appropriately, the Sum of the Squares of the
Errors (SSE) i.e. .
It is observed that the SSE is least when ‘b’ takes the value.
The corresponding value of ‘a’ calculated using the above value of ‘b’ is
OLS TECHNIQUE BASED CALCULATION
OLS TECHNIQUE BASED CALCULATION
As we have
already seen, the
simple linear
regression model
built on the data
in the example is
MExt = 19.04 +
1.89 * MInt
The value of the intercept from the above equation is 19.05. However, none of
the internal mark is 0. So, intercept = 19.05 indicates that 19.05 is the portion of
the external examination marks not explained by the internal examination marks.
Cluster 2
Cluster 1
Cluster 3
Cluster 4
UNSUPERVISED LEARNING – ASSOCIATION
ANALYSIS
DIFFERENT CLUSTERING TECHNIQUES
Partitioning techniques
Hierarchical techniques
Density-based techniques
DIFFERENT CLUSTERING TECHNIQUES (CONTD.)
Partitioning techniques
Uses mean or medoid (etc.) to represent cluster
centre
Adopts distance-based approach to refine
clusters
Finds mutually exclusive clusters of spherical
or nearly spherical shape
Effective for data sets of small to medium size
Cluster C3
ai2 ai1
b14(1)
ain_1
b14(2)
b14(n4)
Cluster C4
Cluster C2
Cluster C3
ai2 ai1
b14(1)
ain_1
b14(2)
b14(n4)
Cluster C4
Cluster C2
Cluster C3
ai2 ai1
b14(1)
ain_1
b14(2)
b14(n4)
Moved from Cluster3
Cluster C4
Cluster C2