Machine Learning
Machine Learning
Machine Learning
Learning
In machine learning,
• There is a learning algorithm.
• Data called as training data set is fed to the learning algorithm.
• Learning algorithm draws inferences from the training data set.
• It generates a model which is a function that maps input to the
output.
Supervised
•Learning
The training data set is a labeled data set.
• In other words, the training data set contains the input value (X) and
target value (Y).
• The learning algorithm generates a model.
• Then, new data set consisting of only the input value is fed.
• The model then generates the target value based on its learning.
EXAMP
LE
• Consider a sample database consisting of two columns where-
• The first column specifies mails.
• The second column specifies whether those emails are spam or not.
• In this training data set, emails categorized as spam or not are done
by a supervisor’s knowledge.
• So, it is supervised learning algorithm.
Types of Supervise
Learning
Regression-
• The target variable (Y) has continuous value.
• Example- house price prediction
Classification-
• The target variable (Y) has discrete values such as Yes or No, 0 or 1
and many more.
• Example- Credit Scoring, Spam Filtering
Unsupervised
Learning
• The training data set is an unlabeled data set.
• In other words, the training data set contains only the input value (X)
and not the target value (Y).
• Based on the similarity between data, it tries to draw inference from
the data such as finding patterns or clusters.
Supervised learning vs. unsupervised
learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute in
future data instances.
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
6
Clusterin
• Clustering is a technique for finding similarity groups in data,
gcalled clusters. I.e.,
• it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
• Clustering is often called an unsupervised learning task as
no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised learning.
• clustering is often considered synonymous
with unsupervised learning.
• In fact, association rule mining is also unsupervised
7
An
•illustration
The data set has three natural groups of data points, i.e., 3
natural clusters.
8
What is clustering
for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes together to make “small”,
“medium” and “large” T-Shirts.
• Tailor-made for each person: too expensive
• One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers according to their
similarities
• To do targeted marketing.
9
What is clustering
for?
• Example 3: Given a collection of text documents, we want to organize
them according to their content similarities,
• To produce a topic hierarchy
• In fact, clustering is one of the most utilized data mining techniques.
• It has a long history, and used in almost every field, e.g., medicine, psychology,
botany, sociology, biology, archeology, marketing, insurance, libraries, etc.
• In recent years, due to the rapid increase of online documents, text clustering
becomes important.
Aspects of
clustering
•A clustering algorithm
• Partitional clustering
• Hierarchical clustering
• …
• A distance (similarity, or dissimilarity) function
• Clustering quality
• Inter-clusters distance maximized
• Intra-clusters distance minimized
• The quality of a clustering result depends on the algorithm, the distance
function, and the application.
11
KMEANS
Clustering
Step-01:
• Choose the number of clusters K.
Step-02:
• Randomly select any K data points as cluster centers.
• Select cluster centers in such a way that they are as farther as possible from each other.
Step-03:
• Calculate the distance between each data point and each cluster center.
• The distance may be calculated either by using given distance function or by using euclidean
distance formula.
Step-04:
• Assign each data point to some cluster.
• A data point is assigned to that cluster whose center is nearest to that data point.
Step-05:
• Re-compute the center of newly formed clusters.
• The center of a cluster is computed by taking mean of all the data points contained in that cluster.
Step-06:
• Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria is
met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
K-means
clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and r is the
number of attributes (dimensions) in the data.
• The k-means algorithm partitions the given data into k clusters.
• Each cluster has a cluster center, called centroid.
• k is specified by the user
14
K-means
algorithm
• Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current cluster memberships.
4)If a convergence criterion is not met, go to 2).
15
K-means algorithm – (cont
…)
16
An
example
+
+
17
An example (cont
…)
18
Exampl
e the following eight points (with (x, y) representing locations) into three
• Cluster
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
• The distance function between two points a = (x1, y1) and b = (x2, y2) is defined
as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
• Use K-Means Algorithm to find the three cluster centres after the second
iteration.
Iteration-01:
Calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
Calculating Distance Between A1(2, 10) and C1(2, 10)-
• Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-
• Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-
• Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
Distance from Distance from Distance from center Point belongs
Given Points center (2, 10) of center (5, 8) of (1, 2) of Cluster-03 to Cluster
Cluster-01 Cluster-02
A1(2, 10) 0 5 9 C1
A2(2, 5) 5 6 4 C3
A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2
A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3
A8(4, 9) 3 2 10 C2
Re-compute the new cluster centres.
The new cluster center is computed by taking mean of all the points contained in
that cluster.
For Cluster-01:
• We have only one point A1(2, 10) in Cluster-01.
• So, cluster center remains the same.
For Cluster-02:
• Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
• Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
Re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the points
contained in that cluster.
For Cluster-01:
• Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
• Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
• Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
28
Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent
values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.
29
Weaknesses of k-means: Problems with
outliers
30
Weaknesses of k-means: To deal with
•oOuneltmei etrhsod is to remove some data points in the
clustering process that are much further away from the
centroids than other data points.
• To be safe, we may want to monitor these possible outliers over a
few iterations and then decide to remove them.
• Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data points,
the chance of selecting an outlier is very small.
• Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
31
Weaknesses of k-means (cont
• The algorithm is sensitive to initial seeds.
…)
32
K-means
•summary
Despite weaknesses, k-means is still the most popular algorithm due to
its simplicity, efficiency and
• other clustering algorithms have their own lists of weaknesses.
• No clear evidence that any other clustering algorithm performs better in
general
• although they may be more suitable for some specific types of data or
applications.
• Comparing different clustering algorithms is a difficult task. No one
knows the correct clusters!
33
Common ways to represent clusters
• Use the centroid of each cluster to represent the cluster.
• compute the radius and
• standard deviation of the cluster to determine its spread in each dimension
• The centroid representation alone works well if the clusters are of the hyper-
spherical shape.
• If clusters are elongated or are of other shapes, centroids are not sufficient
34
Similarity and
Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]:
• Examples: Cosine, Jaccard, Tanimoto,
• Dissimilarity
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
35
Similarity and
Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
37
Euclidean
Distance
• Euclidean Distance
n
dist ( pk qk )2
k
1
Where n is the number of dimensions (attributes)
and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski
Distance
• Minkowski Distance is a generalization of
Euclidean Distance
1
n r
dist ( | pk qk | ) r
k 1
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
Minkowski Distance:
Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
• r = 2. Euclidean distance
6/30/21 Data 42
Mining
Euclidean
Distance
6/30/21 Data 43
Mining
Minkowski
Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Mahalanobis
Distance
mahalanobis( p, q) ( p q) 1 ( p
q)T
is the covariance matrix of
the input data X
n
1
j,k
n
( X
i1
ij X j )( X ik X k )
1