0% found this document useful (0 votes)
34 views58 pages

Module 4

Uploaded by

Darshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views58 pages

Module 4

Uploaded by

Darshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Module 4

Unsupervised Learning
Bayesian learning
Unsupervised Learning
 Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed
to act on that data without any supervision.

 The goal of unsupervised learning is to find the underlying


structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.
Why use Unsupervised Learning?
 Unsupervised learning is helpful for finding useful insights from the
data.

 Unsupervised learning is much similar as a human learns to think by


their own experiences, which makes it closer to the real AI.

 Unsupervised learning works on unlabeled and uncategorized data


which make unsupervised learning more important.

 In real-world, we do not always have input data with the corresponding


output so to solve such cases, we need unsupervised learning.
Working of Unsupervised Learning
Types of Unsupervised Learning Algorithm:

 Clustering: Clustering is a method of grouping the


objects into clusters such that objects with most
similarities remains into a group and has less or no
similarities with the objects of another group.

 Association: An association rule is an unsupervised


learning method which is used for finding the
relationships between variables in the large
database.
Different types of clustering techniques
 Partitioning methods

 Hierarchical methods

 Density-based methods
Unsupervised Learning algorithms:
 K-means clustering
 KNN (k-nearest neighbors)
 Hierarchal clustering
 Anomaly detection
 Neural Networks
 Principle Component Analysis
 Independent Component Analysis
 Apriori algorithm
 Singular value decomposition
 Advantages of Unsupervised Learning
 Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have labeled
input data.
 Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

 Disadvantages of Unsupervised Learning


 Unsupervised learning is intrinsically more difficult than supervised learning
as it does not have corresponding output.
 The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.
Hierarchical Clustering
 Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabeled datasets into a cluster
and also known as hierarchical cluster analysis or HCA.

 In this algorithm, we develop the hierarchy of clusters in the form of a


tree, and this tree-shaped structure is known as the dendrogram.
Hierarchical Clustering

 Agglomerative: Agglomerative is a bottom-up approach, in which


the algorithm starts with taking all data points as single clusters and
merging them until one cluster is left.

 Divisive: Divisive algorithm is the reverse of the agglomerative


algorithm as it is a top-down approach.
Agglomerative Hierarchical clustering
 The agglomerative hierarchical clustering algorithm is a popular example of
HCA.

 To group the datasets into clusters, it follows the bottom-up approach.

 It means, this algorithm considers each dataset as a single cluster at the


beginning, and then start combining the closest pair of clusters together.

 It does this until all the clusters are merged into a single cluster that contains
all the datasets.
Dendrogram
Dendrogram
Agglomerative Clustering Algorithm
Starting Situation
Intermediate Situation
Intermediate Situation
After Merging
How to Define Inter-Cluster Distance
MIN or Single Link
MAX or Complete Link
Group Average or Average Link
Cluster Distance Measures
Example

dist((x, y), (a, b)) = √[(x - a)² + (y - b)²]


Hierarchical Clustering
 Hierarchal clustering can sometimes show patterns that are meaningless
or spurious
 For example, in this clustering, the tight grouping of Australia, Anguilla,
St. Helena etc is meaningful, since all these countries are former UK
colonies.
 However the tight grouping of Niger and India is completely spurious,
there is no connection between the two.

South Georgia & Serbia &


St. Helena & South Sandwich
AUSTRALIA ANGUILLA U.K. Montenegro FRANCE NIGER INDIA IRELAND BRAZIL
Dependencies Islands (Yugoslavia)
Partitioning methods
 K-Means
Factors Affecting K-Means Results
 Choosing appropriate number of clusters

 Elbow method
Factors Affecting K-Means Results

 Choosing the initial centroids


K-means
 Disadvantages
 Dependent on initialization
K-means
 Disadvantages
 Dependent on initialization
K-means
 Disadvantages
 Dependent on initialization
K-means
 Disadvantages
 Dependent on initialization
 Sensitive to outliers
K-means Clustering

Kmeans
Input Data
two circles, 2 clusters (K−means)
5

4.5 4.5

4 4

3.5
3.5

3
3

2.5
2.5

2
2

1.5
1.5

1
1
0.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
K-means Clustering

Kmeans work well Kmeans Fails


K-Means
BAYEIAN LEARNING
BAYES THEOREM
 Bayes theorem provides a way to calculate the probability of a hypothesis
based on its prior probability, the probabilities of observing various data given
the hypothesis, and the observed data itself.
Notations
 P(h) prior probability of h, reflects any background knowledge about the
chance that h is correct
 P(D) prior probability of D, probability that D will be observed
 P(D|h) probability of observing D given a world in which h holds
 P(h|D) posterior probability of h, reflects confidence that h holds after D has
been observed
 Bayes theorem is the cornerstone of Bayesian learning methods because it
provides a way to calculate the posterior probability P(h|D), from the prior
probability P(h), together with P(D) and P(D|h).

 P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
 P(h|D) decreases as P(D) increases, because the more probable it is that D
will be observed independent of h, the less evidence D provides in support
of h.
Example
Consider a medical diagnosis problem in which there are two alternative
hypotheses
 The patient has a particular form of cancer (denoted by cancer)
 The patient does not (denoted by ¬ cancer)
The available data is from a particular laboratory with two possible outcomes: +
(positive) and - (negative)
 A patient takes a lab test and the result comes back positive, The test returns a
correct positive result in only 98% of the cases in which the disease is actually
present, and a correct negative result in only 97% of the cases in which the disease is
not present. Furthermore, .008 of the entire population have this cancer.
 Suppose a new patient is observed for whom the lab test returns a
positive (+) result.
 Should we diagnose the patient as having cancer or not?

 The exact posterior probabilities can also be determined by normalizing


the above quantities so that they sum to 1

0.0078
P (cancer | )   0.21
0.0078  0.0298
0.0298
P (cancer | )   0.79
0.0078  0.0298
BAYES THEOREM AND CONCEPT LEARNING
MAP Hypotheses and Consistent Learners
 A learning algorithm is a consistent learner if it outputs a hypothesis that
commits zero errors over the training examples.
 Every consistent learner outputs a MAP hypothesis, if we assume a
uniform prior probability distribution over H (P(hi) = P(hj) for all i, j), and
deterministic, noise free training data (P(D|h) =1 if D and h are consistent,
and 0 otherwise).
Example:
 FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis
under the probability distributions P(h) and P(D|h) defined above.
 Are there other probability distributions for P(h) and P(D|h) under which
FIND-S outputs MAP hypotheses? Yes.
 Because FIND-S outputs a maximally specific hypothesis from the version
space, its output hypothesis will be a MAP hypothesis relative to any prior
probability distribution that favours more specific hypotheses.
Naive Bayes Classifier
 Along with decision trees, neural networks, one of the most practical
learning methods.
 When to use
–Moderate or large training set available
–Attributes that describe instances are conditionally independent given
classification
 Successful applications:
–Diagnosis
–Classifying text documents
Bayesian Belief Network
EM for Estimating k Means

You might also like