0% found this document useful (0 votes)
10 views42 pages

Chapter 04

machine learning

Uploaded by

Belete Siyum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views42 pages

Chapter 04

machine learning

Uploaded by

Belete Siyum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Fundamentals of Machine Learning

Prepared by

Prince Thomas M.E., PhD


Associate Professor
Chapter 4: Unsupervised Learning and Graphical Models
• Clustering Methods
• K-Means Clustering: Centroid Calculation and Convergence
• Hierarchical Clustering: Agglomerative and Divisive Methods
• Frequent Pattern Mining
• Apriori Algorithm and Association Rules
• Applications in Market Basket Analysis
• Graphical Models
• Bayesian Networks
• Markov Networks
• Hidden Markov Models (HMMs)
• States, Transitions, and Observations
• Forward Algorithm and Viterbi Algorithm
• Applications in Speech Recognition and NLP
Unsupervised learning
• It involves analyzing and clustering data without labeled outputs.

• It tries to find hidden patterns, structures, or features within the data.

• It is primarily used for exploratory data analysis and tasks where the goal is to uncover

insights rather than predict labels.

Key Characteristics
• No labeled data is provided; only input data is available.

• Focuses on understanding data distributions, relationships, and patterns.

• Examples include clustering, dimensionality reduction, and anomaly detection.

Major Technique
• Clustering

• Dimensionality Reduction

• Association Rule Mining


Types of Unsupervised Learning Algorithm
• The unsupervised learning algorithm can be further categorized
into two types of problems:
Clustering
• It is an unsupervised learning technique.
• It used to group data points into clusters based on their similarity.
• Data points in the same cluster are more similar to each other than to those in other clusters.
• Clustering is widely used for exploratory data analysis and pattern recognition.
Clustering Methods
• K-Means Clustering
• Hierarchical Clustering
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• Mean-Shift Clustering
• Gaussian Mixture Models (GMM)
Applications of Clustering
• Customer segmentation
• Social network analysis
• Market segmentation
K-Means Clustering
• K-Means Clustering is a popular unsupervised machine learning algorithm.
• It is used to partition a dataset into distinct clusters.
• Each cluster is represented by a centroid, and the algorithm iteratively assigns data
points to clusters and updates these centroids until convergence.
Steps of Algorithm
• Select random K points or centroids.
• Assign each data point to their closest centroid, which will form the predefined K
clusters.
• Calculate the variance and place a new centroid of each cluster.
• Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
• If any reassignment occurs, then go to step-4 else go to FINISH.
Centroid in K-Means Clustering
• The centroid is the central point of a cluster in K-Means Clustering.

• It is a representative "average" position for all the data points in a cluster.

• The algorithm uses centroids to determine which data points belong to which cluster and
updates their positions iteratively for better clustering.

Definition
• A centroid is the geometric center of a cluster, calculated as the mean of all data points
within the cluster.

Convergence in K-Means occurs when the algorithm stops updating centroids or cluster
assignments. This happens when the clusters stabilize, meaning the centroids no longer change
significantly

• Manhattan and Euclidean distance are used to find the distance between the two data point.
Example:
Final clusters are: {A1,B1,C2}, {A3,B2,B3} and {A2,C1}
Hierarchical Clustering
• Hierarchical clustering is a technique that builds a hierarchy of clusters for a dataset. Unlike flat
clustering methods like K-Means, hierarchical clustering organizes data into a tree-like structure
called a dendrogram, which illustrates how clusters are merged (or split) at different levels.

Types of Hierarchical Clustering

1. Agglomerative Clustering
• Bottom-Up Approach: Each data point starts as its own cluster, and clusters are iteratively merged
based on similarity until a single cluster (or a specified number of clusters) remains.

Steps:

1. Start with each data point as an individual cluster.

2. Compute the distance (similarity) between all pairs of clusters using a distance metric (Euclidean
distance).

3. Merge the two closest clusters into a single cluster.

4. Repeat steps 2 and 3 until all points are in a single cluster or the desired number of clusters is
Distance Metrics for Clustering

• Single Linkage: Measures the shortest distance between any two points in
different clusters.

• Complete Linkage: Considers the farthest distance between points in


different clusters; ensures compact and spherical clusters.

• Average Linkage: Calculates the mean of all pairwise distances between


points in two clusters; balances between Single and Complete Linkage.

• Centroid Linkage: Uses the Euclidean distance between the centroids


(average positions) of two clusters; focuses on cluster centers rather than
individual points.
2. Divisive Clustering
• Top-Down Approach: Start with all data points in one cluster and recursively split
clusters into smaller clusters until each data point forms its own cluster (or a
desired number of clusters is reached).
Steps:
1. Begin with all data points in a single cluster.
2. Identify the cluster to split using a criterion (e.g., largest variance or dissimilarity).
3. Divide the selected cluster into two sub-clusters based on a distance metric or
other criteria.
4. Repeat step 2 and 3 until all points are in their own cluster or the desired number
of clusters is achieved.
Challenges:
• More computationally intensive than agglomerative clustering since splitting
Frequent Pattern Mining
• It is a data mining technique to discover patterns, correlations, or
associations in datasets.
• It identifies sets of items, sequences, or events that occur frequently
together in transactional databases.
Apriori Algorithm
• The Apriori Algorithm is an iterative method used to identify frequent
itemsets in a dataset and generate association rules.
Graphical Models
• Graphical models are probabilistic models that represent dependencies among
variables using a graph structure. They are used for reasoning about
uncertainty and for making predictions based on observed data.
Key Features:
• Nodes: Represent random variables.
• Edges: Represent probabilistic dependencies.
• Types: Directed and undirected graphs.
Types of Graphical Models:
• Bayesian Networks (Directed Acyclic Graphs): Captures conditional
dependencies via directed edges.
• Markov Networks (Undirected Graphs): Captures undirected
What is a Bayesian Network?

• A Bayesian network, also known as a Bayes network, belief network, or


probabilistic directed acyclic graphical model.

• It is a graphical model that represents a set of variables and their


conditional dependencies via a directed acyclic graph (DAG).

• Each node in the DAG corresponds to a random variable, and the edges
represent direct causal or influential relationships between the variables.
Why Do We Need Bayesian Networks?
Bayesian networks are valuable tools for various reasons:
• Uncertainty Modeling: It allow us to model uncertain information and make
probabilistic inferences.
• Causal Reasoning: It can represent causal relationships between variables,
enabling us to understand how changes in one variable affect others.
• Decision Making: It can assist in making decisions under uncertainty by
considering multiple factors and their probabilities.
• Learning from Data: It can be learned from data, allowing us to discover
hidden relationships and patterns.
• Inference and Prediction: It enable us to make predictions about
unobserved variables based on observed evidence
Key Components of a Bayesian Network:
• Nodes: Represent random variables.
• Edges: Represent direct dependencies between variables.
• Conditional Probability Distributions (CPDs): Associated with each node, these
specify the probability distribution of a node's value given the values of its parent
nodes.
Formulas and Concepts:
• Joint Probability Distribution: The joint probability distribution of all variables in
a Bayesian network can be factorized as the product of the conditional probabilities
of each variable given its parents:
P(X1, X2, ..., Xn) = Π P(Xi | Parents(Xi))
• Bayesian Inference: Bayes' theorem is used to update beliefs about a variable
given new evidence:
Example:
You have a new burglar alarm installed at home. It is fairly reliable at
detecting burglary, but also sometimes responds to minor earthquakes. You
have two neighbors, John and Merry, who promised to call you at work when
they hear the alarm. John always calls when he hears the alarm, but
sometimes confuses telephone ringing with the alarm and calls too. Merry
likes loud music and sometimes misses the alarm. Given the evidence of who
has or has not called, estimate the probability of a burglary.
What is the probability that the alarm has sounded but neither a
burglary nor an earthquake has occurred, and both John and Merry
call?

What is the probability that john calls?


Markov Network
• Its also known as a Markov Random Field (MRF).
• It is a type of undirected graphical model that represents probabilistic
relationships between random variables.
• Markov Networks use undirected edges to represent pairwise relationships
between variables.
Markov Network Parameters
• Nodes: Represent random variables (e.g., "sunny," "rainy").
• Edges: Represent pairwise relationships between variables (e.g., the influence
of today's weather on tomorrow's).
• Potential Functions: Measure the compatibility of joint assignments of values
to variables within a clique (a fully connected subgraph). These functions
quantify the strength of the relationship between variables.
Markov Network Parameters

• State Space:
The state space of a Markov Network is the set of all possible configurations of
the variables. In the given example, the state space consists of two states:
"sunny" and "rainy."
• Initial Probability:
The initial probability distribution specifies the probability of each state at the
initial time step. In the example: P(sunny) = 0.5 P(rainy) = 0.5
• Transition Matrix:
The transition matrix defines the probabilities of transitioning from one state to
another. In the example, the transition matrix would be:
What are HMMs?
• Hidden Markov Models (HMMs) are statistical models used to represent systems
that are assumed to be Markov processes with unobserved (hidden) states.
• HMMs are widely utilized in various fields, including speech recognition,
bioinformatics, and finance.
Why are HMMs Needed
• Modeling Sequential Data: They can effectively capture temporal
dependencies in sequences.
• Dealing with Uncertainty: HMMs provide a framework for modeling systems
with hidden states and observations that can be noisy or incomplete.
• Probabilistic Inference: They allow for the computation of probabilities of
sequences of observations, making them suitable for tasks such as classification
and prediction.
Parameters of HMM
• States (S): A set of hidden states in the model. Each state
represents a possible condition of the system.
• Observations (O): A set of possible observations that can be
generated by the states.
• Transition Probabilities (A): A matrix where A[i][j] represents
the probability of transitioning from state i to state j.
• Emission Probabilities (B): A matrix where B[i][k] represents
the probability of emitting observation k from state i.
• Initial State Distribution (π): A vector where π[i] indicates the
Properties of HMM
• Markov Property: The future state depends only on the current state and not on
the previous states.
• Stationary Transition Probabilities: The transition probabilities remain the
same over time.
• Memoryless: The model does not retain memory of past states beyond the
current one.

States, Transitions, and Observations


• States: These are not directly observable and represent the underlying processes.
• Transitions: These define how likely it is to move from one state to another.
• Observations: These are the outputs we can observe and are dependent on the
states.
Forward Algorithm
• The Forward Algorithm is used to calculate the probability of observing
a sequence of events (observations) given a Hidden Markov Model.

Forward Algorithm Steps


• Initialization: Compute the initial probabilities of the first observation.
• Recursion: For each subsequent observation, update the probabilities
based on the previous state probabilities and the transition/emission
probabilities.
• Termination: Sum the probabilities of ending in any state after the last
observation.
States: Start, Sunny, Rainy, End
Transition Probabilities:
•Start to Sunny: 0.6
•Start to Rainy: 0.4
•Sunny to Sunny: 0.4
•Sunny to Rainy: 0.6
•Rainy to Sunny: 0.2
•Rainy to Rainy: 0.3
•Rainy to End: 0.1
•Sunny to End: 0.2
Emission Probabilities (for illustration):
•Assume:
•P(Play | Sunny) = 0.7
•P(Play | Rainy) = 0.4
•P(Shop | Sunny) = 0.2
•P(Shop | Rainy) = 0.5
Viterbi Algorithm
• The Viterbi Algorithm finds the most likely sequence of hidden states given
the observed sequence. It uses dynamic programming similar to the
Forward Algorithm but tracks the best paths taken to reach each state.
Viterbi Algorithm Steps
• Initialization: Similar to the Forward Algorithm but also record the state
paths.
• Recursion: Update the probabilities and the paths for each observation.
• Termination: Trace back the most likely path from the last state to the first.
• The path from Sunny → Rainy → Shop → End yields a maximum probability of 0.021.
• The path from Rainy → Rainy → Shop → End yields a lower probability of 0.006.
• The better path, based on the maximum probability of reaching the End state after observing
the sequence,

You might also like