0% found this document useful (0 votes)
1 views33 pages

Unit 3

Image segmentation is a computer vision technique that divides an image into meaningful regions for easier analysis, utilizing methods like thresholding, edge detection, and clustering. Clustering techniques, including K-means and hierarchical clustering, group similar pixels based on their attributes, while the Mean Shift algorithm identifies dense regions without pre-specifying the number of clusters. The Watershed algorithm models image segmentation as a landscape where water collects in valleys, creating boundaries between segments.

Uploaded by

akshat124b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views33 pages

Unit 3

Image segmentation is a computer vision technique that divides an image into meaningful regions for easier analysis, utilizing methods like thresholding, edge detection, and clustering. Clustering techniques, including K-means and hierarchical clustering, group similar pixels based on their attributes, while the Mean Shift algorithm identifies dense regions without pre-specifying the number of clusters. The Watershed algorithm models image segmentation as a landscape where water collects in valleys, creating boundaries between segments.

Uploaded by

akshat124b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Computer vision (unit-3)

Image segmentation
Image segmentation is a computer vision technique that involves dividing an
image into meaningful regions or segments, typically to simplify or change the
representation of an image into something more meaningful and easier to analyze.

Image segmentation is the process of partitioning an image into multiple


regions based on the characteristics of the pixels in the original image

Multiple distinct regions contain pixel similar attributes

Common Techniques:
Thresholding (e.g., Otsu's method)

Edge Detection (e.g., Canny, Sobel)

Region-Based Segmentation (e.g., Region Growing, Watershed)

Clustering (e.g., K-means)

Clustering
Clustering is a technique to group similar entities and label them

Cluster similar pixels using a clustering algorithm and group a particular


cluster pixel as a single segment

Goal is to change the representation of an image into more meaningful and


easier to analyze

Typically used to locate objects and boundaries (lines, curves, etc.)

Is the process of assigning a label to every pixel in an image such that pixels
with the same label share certain characteristics

Hierarchical clustering
Divisive clustering is a type of hierarchical clustering where we start with one
large cluster that contains all the data points, and then we recursively split it into

Computer vision (unit-3) 1


smaller clusters.
It’s the opposite of agglomerative clustering, which starts with each point as its
own cluster and merges them.

🔍 Simple Explanation:
Think of it like organizing a messy drawer of mixed items:

At first, all items are in one drawer (one cluster).

You split the items into broad groups—say, electronics and stationery.

Then, you take the electronics and divide them into chargers and
headphones.

And maybe you split stationery into pens and notebooks.

You keep splitting until you reach useful, meaningful groups.

🧠 Example:
Suppose you have this data (2D points):

scss
CopyEdit
(1, 2), (2, 1), (1, 1), (8, 8), (9, 9), (8, 9)

Step-by-step divisive clustering:


1. Start: All 6 points are in one cluster.

2. First split: Use a method like K-means (k=2) to split into two groups:

Cluster A: (1,2), (2,1), (1,1)

Cluster B: (8,8), (9,9), (8,9)

3. Stop or split again: Now we decide if each of these smaller clusters can be
split further. Maybe Cluster A is fine, but we decide to split Cluster B more.

4. Next split (if needed): Cluster B → (8,8), (8,9) and (9,9)

Computer vision (unit-3) 2


This goes on until no cluster needs further division or meets a stopping condition
(like max clusters or distance threshold).

Agglomerative clustering is the most common type of hierarchical clustering. It


follows a bottom-up approach:
Start with each data point as its own cluster, and repeatedly merge the closest
pairs of clusters until all points are in a single cluster (or you reach a stopping
criterion like a fixed number of clusters).

🔍 Step-by-Step Explanation:
Let’s say we have 4 data points:

scss
CopyEdit
A (1, 1), B (2, 1), C (5, 4), D (6, 5)

Step 1: Start with each point as its own cluster


Clusters: [A], [B], [C], [D]

Step 2: Find the two closest clusters


A and B are closest → merge them
Clusters: [AB], [C], [D]

Step 3: Repeat
C and D are closest → merge them

Clusters: [AB], [CD]

Step 4: Merge remaining clusters


Clusters: [ABCD]

Computer vision (unit-3) 3


What is K-Means Clustering?
K-Means is an unsupervised learning algorithm used to partition a dataset into K
clusters. Each point is assigned to the cluster with the nearest mean (centroid),
and the centroids are updated until they stabilize.

🔢
Data Points:

Let’s take these 2D points:

ini
CopyEdit
P1 = (1, 1)
P2 = (2, 1)
P3 = (4, 3)
P4 = (5, 4)

We want to cluster them into K = 2 clusters.

✅ Step 1: Initialization
Randomly select two points as initial centroids:

java
CopyEdit
Centroid A = (1, 1) ← P1
Centroid B = (5, 4) ← P4

🔁 Iteration 1: Assign points to nearest centroid


Computer vision (unit-3) 4
We'll use Euclidean distance:

Point Distance to A (1,1) Distance to B (5,4) Assigned to

√[(5−1)² + (4−1)²] = √25 =


P1 (1,1) 0 A
5.00

√[(5−2)² + (4−1)²] = √18 ≈


P2 (2,1) √[(2−1)²] = 1 A
4.24

√[(4−1)² + (3−1)²] = √13 ≈ √[(5−4)² + (4−3)²] = √2 ≈


P3 (4,3) B
3.61 1.41

P4 (5,4) 5 0 B

Clusters after iteration 1:

Cluster A: P1, P2 → (1,1), (2,1)

Cluster B: P3, P4 → (4,3), (5,4)

🔄 Recalculate centroids

🔁 Iteration 2: Reassign points


Point Distance to A (1.5,1) Distance to B (4.5,3.5) Assigned to

√[(4.5−1)² + (3.5−1)²] =
P1 (1,1) √[(1−1.5)²] = 0.5 A
√21.25 ≈ 4.61

√[(4.5−2)² + (3.5−1)²] =
P2 (2,1) √[(2−1.5)²] = 0.5 A
√12.25 ≈ 3.50

√[(4−1.5)² + (3−1)²] = √[(4.5−4)² + (3.5−3)²] =


P3 (4,3) B
√10.25 ≈ 3.20 √0.5 ≈ 0.71

Computer vision (unit-3) 5


√[(5−1.5)² + (4−1)²] = √[(5−4.5)² + (4−3.5)²] =
P4 (5,4) B
√22.25 ≈ 4.72 √0.5 ≈ 0.71

Clusters after iteration 2:

Cluster A: P1, P2

Cluster B: P3, P4

Centroids didn't change → convergence likely in next iteration.

✅ Summary:
Initial centroids: P1 (1,1) and P4 (5,4)

After Iteration 1:

Cluster A: (1,1), (2,1) → Centroid: (1.5, 1)

Cluster B: (4,3), (5,4) → Centroid: (4.5, 3.5)

After Iteration 2: Clusters remain the same

Some common stopping conditions for k-means clustering are:


Centroids don’t change location anymore

Data points don’t change clusters anymore

Terminate training after a set number of iterations

K-mean clustering(Elbow Method)


✅ Why is the Elbow Method used?
The Elbow Method is a technique to find the optimal number of clusters (K) in K-
Means clustering.

📉 How it works:
Run K-Means with different values of K (e.g., from 1 to 10).

Computer vision (unit-3) 6


For each K, compute the Within-Cluster Sum of Squares (WCSS) — a
measure of how tight the clusters are.

Plot K vs WCSS.

Look for the "elbow point" — the point where the WCSS stops decreasing
significantly.

This is considered the optimal K.

Let's compute the WCSS (Within-Cluster Sum of Squares) step by step for K = 2
using the earlier example dataset:

📍Data Points:
Point Coordinates

P1 (1, 2)

P2 (1, 4)

P3 (1, 0)

P4 (10, 2)

P5 (10, 4)

P6 (10, 0)

✅ Step 1: Assign Clusters (K = 2)


We'll assign based on clear proximity:

Cluster 1: P1, P2, P3 → (1,2), (1,4), (1,0)

Cluster 2: P4, P5, P6 → (10,2), (10,4), (10,0)

Computer vision (unit-3) 7


Computer vision (unit-3) 8
Computer vision (unit-3) 9
Types of Clustering:

Mean Shift Algorithm

Computer vision (unit-3) 10


Mean Shift is a clustering algorithm that moves each data point toward the region
where data is most dense (like climbing uphill), and groups together points that
end up at the same peak (mode).

does not require specifying the number of clusters in advance. The number
of clusters is determined by the algorithm with respect to the data

Dense regions already exist in the data — areas where many points are close
together
A mode is the peak of a dense region — the point with highest data density.

Every dense region usually contains one local mode.

Think of a hill:

The whole hill = dense region

The top of the hill = mode (local maximum)

✅ What does Mean Shift do?


Mean Shift moves each data point toward the nearest local
mode (peak of density).

Each hill represents a cluster

Peak (mode) of the hill represents the center of cluster

Computer vision (unit-3) 11


Based on the feature value, each pixel climbs up the hill within its
neighbourhood

Analogy:

The 5-Step “Hiker’s Workflow”


1. Pick a starting spot

Choose any data point (or grid point) as your current location xxx.

In our story: you stand somewhere on the foggy ground.

2. Look around with your “flashlight” (kernel window)

Imagine a circular lamp of radius hhh around you.

Collect every data point that lies inside that circle.

3. Compute the “crowd’s average” (mean of neighbors)

Take all the points you saw and compute their average position:mean=#
{neighbors}∑neighbors xixi

In the story: you ask everyone around, “Where are you standing on
average?”

4. Step toward that average

Move your position xxx to the computed mean.

In the story: you take a step uphill in the direction most of the crowd is
standing.

5. Repeat until you stop moving

Keep shining your flashlight, recalculating the mean, and stepping, until
your step-size is almost zero (you’ve reached a top).

Computer vision (unit-3) 12


That final position is one mode (hilltop).

Technical explanation

🎯 Goal of Mean Shift


We want to move xxx toward higher density — i.e., "shift" xxx in the direction of
the local mode.
In Mean Shift, weight functions determine how much influence each
neighboring point has when we compute the new position (mean) of a data point
during each iteration.

📌 Mean Shift Vector


Mean Shift vector m(x), which tells you the direction to move:

In the Mean Shift algorithm, the value m(x) becomes the new position (mean) for
the data point during each iteration.
The bandwidth (also known as the window radius) directly controls the
neighborhood size(number of points used in the mean(Mw) used in the Mean
Shift algorithm.

Computer vision (unit-3) 13


Gradual and Smooth Movement: The Gaussian kernel ensures that the data
point moves smoothly towards the mode, with a gradual shift as it is
influenced mostly by nearby points.

Faster Convergence: Points move faster towards the mode because the
Gaussian kernel places less weight on distant points, allowing the algorithm to
converge more efficiently.

Computer vision (unit-3) 14


✅ How Do We Know a Mode Has Arrived?
We say that a mode has been reached when the Mean Shift vector becomes
very small, or in simpler terms:

When the point stops moving significantly during updates.

Here’s a clear and well-structured explanation of those points related to


bandwidth (window size) in the Mean Shift algorithm:

📌 Effect of Window Size (Bandwidth) in Mean Shift


1. Large Bandwidth Leads to Global Behavior:

Computer vision (unit-3) 15


When the window size (bandwidth h) is large, the local neighborhood
includes many points, possibly covering the entire dataset. This causes the
local mean (computed in m(x) to get very close to the global mean of the
data. As a result, all points might shift toward the same central location, losing
the ability to detect distinct clusters.

2. Small Bandwidth Captures Local Detail (but with Noise):


If the bandwidth is too small, the algorithm considers only a tiny
neighborhood around each point. This can lead to the formation of many
small clusters, including noisy or spurious ones, because each point is
influenced only by a few close neighbors.

Pros:

Finds variable number of modes

Robust to outliers

Does not assume any prior shape like spherical, elliptical, etc. on data clusters

Just a single parameter (window size W) is required

Cons:

clustering depends on window size

Computationally expensive

Doesn’t scale well with dimension of feature space

Finds arbitrary number of clusters

Watershed problem
Imagine your image is a landscape made of hills(segment’s peak) and valleys.
Now, picture rain falling on this landscape. Water naturally starts collecting in the
lowest points — the valleys.

Computer vision (unit-3) 16


As more rain falls, the water level rises. The water fills up each valley slowly.
When two growing pools of water from different valleys get close, a wall (like a
dam) is built to keep them from mixing.

These walls — built to keep waters from different valleys apart(acts as boundary
separating different segments in image)

Computer vision (unit-3) 17


Water can not flow from one to another basin

After sometime water in second basin begins to merge in third basin

Construct a dam

Dam boundaries are watershed lines

Setup: Multiple Coins in an Image

Step 1: Preprocessing
Convert the image to binary using thresholding.

Coins = white (foreground), Background = black

Step 2: Distance Transform


Apply a distance transform to the binary image.

Inside each coin, pixels get values based on their distance to the edge.

Computer vision (unit-3) 18


The center of the coin has the highest value — like the lowest valley in
the topography analogy.

more the coin away from edge higher its value.

Step 3:Invert the Distance Transform


🟢 After computing the distance transform — where coin centers become peaks
— we then:

➤ Invert the image (multiply the values by -1)

This flips the landscape:

What was high (coin center) becomes low

What was low (coin edge) becomes high

Now the coin center — which was a hilltop — becomes a valley again.

Step 4: Gradient Image


Compute the gradient (edge strength) of the original image.

This serves as the elevation map.

Higher gradient = harder for water to cross (like hills).

Lower gradient = easy for water to spread (flat areas or valleys).

Step 3: Internal Markers


Use the local maxima(minima in the inverted image) of the distance map as
internal markers (starting points).

Each marker gets a unique label (e.g., 1, 2, 3 for 3 coins).

These will act as water sources during flooding.

Computer vision (unit-3) 19


These centers become internal markers — starting points for flooding.

You give them labels: e.g., Coin A = 1, Coin B = 2, Coin C = 3

4. Start the flooding


This is the flooding:

Start with the labeled center pixel(internal markers) of Coin A.

It tries to expand(labels the near by pixel withe label 1 which was given to the
local minima) to its neighboring pixels.

It only expands to pixels that are lower or slightly higher in intensity (just like
water flows into valleys first).

It labels those pixels with “1” — same as Coin A.

Repeat the same for Coin B and Coin C.

5.When two labels are about to touch


Let's say Coin A’s flooding reaches the edge of Coin B’s flooded area.

The algorithm does not allow the labels to mix.

Instead, it marks those boundary pixels as watershed lines.

Background subtraction
Background subtraction in real-time is a computer vision technique used to
detect and isolate moving objects (foreground) from a static or slowly changing
background in video streams. It’s commonly used in surveillance, gesture
recognition, and traffic monitoring.
An image can be digitally represented as a function of space, I = f(x, y)
where x and y represent row and column number of points I represents the
intensity

Computer vision (unit-3) 20


For video, image varies both in space and time V = f(x, y, t)
x and y represent row and column number of points
t represents time of the frame
V represents the intensity of the pixel of frame at a location, (x, y)

Computer vision (unit-3) 21


As the B(X,Y) is updated using the mean in of the previous frames with every
iteration,it can updated using the following method:

Computer vision (unit-3) 22


2. Running Average (Adaptive Temporal Filtering)
Updates background on-the-fly:B(x,y)=α⋅Icurrent(x,y)+(1−α)⋅B(x,y)

Does not need to store past frames.

Adapts to slow changes in the background.

Advantages:
Simple and fast.

Good for static backgrounds.

Limitations:
Doesn't handle dynamic backgrounds (e.g., trees moving in the wind).

Sudden changes in lighting can affect accuracy.

Background Subtraction (Temporal Median Filter)

Computer vision (unit-3) 23


Gaussians mixture model
The Gaussian Mixture Model (GMM) for background subtraction is a
probabilistic, adaptive method used to handle complex and dynamic scenes —
like moving trees, flickering lights, or waves on water — where a simple average
or difference won’t work well.

Visual Example:
Imagine the color of a pixel changes over time:

Most of the time it's green (leaf),

Sometimes brown (branch),

Rarely black (bird flies by).

A single Gaussian would try to model all these colors with one mean and variance
— poor fit.
But a mixture of 3 Gaussians can:

One Gaussian for green,

One for brown,

One for black,


with different weights depending on how frequent they are.

Computer vision (unit-3) 24


P(x) is the probability distribution of the intensity (or color) of an individual
pixel over time.

You're combining multiple bell curves, each with its own shape and
importance, to approximate a complex distribution.

For example, if you had pixel values that were mostly near 50, sometimes near
100, and rarely near 200, a 3-component GMM could model that with:

A strong Gaussian centered at 50,

A weaker one at 100,

A tiny one at 200.

Computer vision (unit-3) 25


Computer vision (unit-3) 26
Computer vision (unit-3) 27
Computer vision (unit-3) 28
Computer vision (unit-3) 29
Note:check the euclidean distance of I(X,Y) with mean value of each gaussian
in the P(X)

If x matches at least one of the background Gaussianst

Label it as B (background)

If x does not match with any of the background Gaussianst

Label it as F (foreground)

Computer vision (unit-3) 30


Computer vision (unit-3) 31
Computer vision (unit-3) 32
Computer vision (unit-3) 33

You might also like