Dbscan Algorithm
Dbscan Algorithm
produces a varying number of clusters, based on the input data, as it should be. Density clustering
(DBSCAN) seems to correspond more to human intuitions of clustering, rather than distance from a
central clustering point (K-Means).
Take a large group of people and have them all stand in a field. You're going to use DBSCAN to identify
separate crowds within the group.
You have to choose two numbers beforehand. First, how close do two people have to be to be
"close"? Let's take something intuitive: if I can reach out and put my hand on top of someone's head,
he's close by. About 3 feet. Second, how many people have to be close to you in order to form a
crowd? Let's say three. Two is company, but three's a crowd.
Then we will have some people who are not yet holding up flags. They have fewer than 3 neighbors and
one of those neighbors holds a green flag. Such people hold up yellow flags. They are the edge of a
crowd.
Finally, we've got people who have fewer than 3 neighbors. None of those neighbors holds a green
flag. These people hold up red flags. They are the outliers, not part of any crowd. (They're the radical
individualists.)
Suppose that I am holding a green flag. I can identify all the other people in my crowd as follows: if I
can make a chain of green-flag people (core points) with hands on one another's heads from myself to
the target person, we are part of the same crowd. This chain cannot involve any yellow-flag people
(edge points). To keep the different crowds straight, put numbers on the different green
flags. Everyone with green flag #1 can reach everyone else with green flag #1 through a chain of hands
on heads. Everyone with green flag #2 can reach everyone else holding green flag #2 but not anyone
holding green flags #1 or #3. You get the idea.
Or suppose that I hold a yellow flag. I get to choose which crowd to be part of. If someone with green
flag C (#1, #2, #3 or whatever) can put her hand on my head, I can say I'm part of crowd C. If more than
one person has his hand on my head then the guy running the game gets to choose whether I'm part of
just one cluster or whether I hold onto my multiple memberships is up to the guy running the game.
You're done. Each distinct number that appears on a green flag (#1, #2, #3, ...) identifies a separate
cluster. People holding red flags -- the outlier nodes -- are not part of any cluster. The edge nodes
(people with yellow flags) can be considered part of whichever cluster is most convenient for you.