0% found this document useful (0 votes)
66 views25 pages

Unsupe - Rvised Learning: Able T Understand and Prehend

1. Clustering is an unsupervised machine learning technique that groups unlabeled data points together based on similarities. The goal is to identify natural grouping in the data. 2. There are different types of clustering algorithms like hierarchical, density-based, and spectral clustering. Clustering evaluation metrics include the elbow method and intrinsic and extrinsic scoring. 3. Clustering can be applied to group customers based on attributes like region, income, or device ownership to gain insights for data analysis. Different clusters may emerge from the same data depending on the attributes and algorithm used.

Uploaded by

clashing fury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views25 pages

Unsupe - Rvised Learning: Able T Understand and Prehend

1. Clustering is an unsupervised machine learning technique that groups unlabeled data points together based on similarities. The goal is to identify natural grouping in the data. 2. There are different types of clustering algorithms like hierarchical, density-based, and spectral clustering. Clustering evaluation metrics include the elbow method and intrinsic and extrinsic scoring. 3. Clustering can be applied to group customers based on attributes like region, income, or device ownership to gain insights for data analysis. Different clusters may emerge from the same data depending on the attributes and algorithm used.

Uploaded by

clashing fury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Unsu pe .

d Lear
rvise ning

Al the
end of this unit, you should be able t
o understand and com
' ,-t,eans.d prehend the fo 11 owing syllabus topi cs •

, .rnedo1 s
1 •
, Hierarchical
, oensity-based Clustering

, spectral Clustering

, outlier analysis
Introduction to isolation forest
0

o Local outlier factor


, Evaluation metrics and score

o Elbow method

o Extrinsic and intrinsic methods

l.l Clustering

•The
ilie galaxy. meaning
dictionary . of cluste r IS
· " a nu m b er of s,·m'•lar things that occur together•. For example cluste o· sta-,;

lfith respect t0 machine


. learning
Defin.itiaon : Clustering is
Witlu,ut ,
· a technique
· • which
in . the data points
. are arranged zn
. swular
. . groups J_,nanucall,
It ny pre ·
-assignm ent of groups •
a,,ismo,
the t . .
ask in wh ,ch
.
· the data points are grouped in such a way that objects in the same group (called a c cste I
"led toe srmilar t each other than to those in other groups (clusters) It .,s a data exploration technique common,y
O .
,"'"'" Understa n d how the data should be interprete d and grouped to be best ana I\Sed There 1s no pnor
· ·
I °" th pings
g of grou . required. Instead, groups are Implicitly arranged (or created) based on th• d>ta attnbutes.
.
do~,~ededata ·" . grouped depends on the type of algorithm used. The same dataset can be used to form various
~rth Pending upo th d . d ti en you can decide which cluster> make sense to be used for
er dat n e ata attributes an ,
~or
•~ a analytics.
Pie, her . . some set of points are closer to each other when
exall'l
Pared With . othe IS a simple plot of data points. As you can see,
( a cluster) for further data ana 1yt·,cs.

----------------
ers. These sets could possibly form a group or
(\Pl'l/)
µ -
•, 1 l11111' 1.c,1111111
~ ~ ,11 ---
\
...s- 0 ~0 I

O ·o O '1
I

,' 0
I
'•- Clu ster 1
O • 0 I

0 0/
I

:, 6) 0 Q
'\, 0 0 /

•◄ Clu stor 2
O O \
y nx1S · • ...- Clu ste r 3
I

/ 00 00 0\ ,, \
,,, 0 0
: o O O )
O I / 00 °0 0 \:
'. 0 Oo O ,'
I

;I O O O
\ 0 0 ,,, I

', ..... ,..,. .,,' \ 00 0 00 /


',,.., 0 0 ,,/I/

....... _..,_..,,,,

x-axis
Fig. 5.1.1

que .
Lets take a -,,mple exa mple to
understand the clustering techni

e Phone Model Laptop . . .
Region Monthly Income Ag
I Customer Name MacBook Pro
50,000 29 iPhone X
l A Bangalore
34 Motorola X Dell Precision
35,000
I 8 Bangalore
36 iPhone X Dell Precision
Mumbai 80,000
I C
Dell Precision
D Mumbai 40,000 26 iPhone X
I I
MacBook Pro
L E Mumbai 55,000 39 iPhone X -
I
ups) for analys1·s . For examp e
o., could build several clusters (gro
'
" Based on region

Banga1ore {2 data points)

Mumbai (3 data poin ts)


• Based O month!Yrncorne

More than 50,000 (3 data poi nts)

ess a 50 000 (2 data poin ts)


age
der o(2 data Points)
ta POlnt)

{l dat Point )

Pl X rence
4 ta I , t)
lo X(1
data 'l,, I t)

c.o 'f(
tt~o L9 9
0.120211
--.-,-d
f aase'
1,1
0
n rop•;:~~:~::~-----------=;:;;;;:;:::::::::::~...::!!'.;~;::::::~:::::::!.,
r,.,1ac80 ok Pro
(2 d a t a points)
. lln!;u1>crv·
rsecJ L.c..irnrng
2
p ecision (3 data points)
oell r
a hich cluster could you answer th
se d
on w ese quest·
ions?
63
true that people prefer iPhone X irr
rs it espcct1ve o f lh
p . . . e1r 1ncon,
true that De II rec1s 1on 1s a more p c• group or r1uet
Js it opu Iar Product f
average you nger people are making m or laptop?
·
3 on . . ore money than older P I
e that this might be a simp li stic exa m eop e?
Note tier . p Ie that You feel th
, rposes a dataset contains millions of at You can do manual! f
-eftll pu record s with vari '/ 8ut, or all practical ,.ind
"' (lusters that could be more useful to understand the d ous attributes. Imagine how hard could it be to ftnd
ire I ata and run further . .
:irob!ern to solve . analysis on it. This is not a tn 11al
t
te here that clustering is a data explorar
A'so no . . . ion method (or a task h
' gont hrn or analytics techniq ue tn itself. w ere clusters are outcome) and not a~
3

an use various algorithms such as the following f •


)OU c or c1ustenng the d t .
' ·
nee and time to k"II)
I · h out an algorithm
wit . (like for th . . • a a or could do rt yourself (if you nav<>-
oatie e simp 1IS t lc example I gave earlier).
~ Partition clustering (also called K-means clustering)

0
Hierarchical clusteri ng

0 Fuzzy clustering

" Model based clustering

a Density-based clustering

' Oustering is used in recommendati on and search engines, marketing, economics, and other branches of sde'1ce.

ill Properties of a Cluster

3ased on the cluster plot and the example given in the previous section, it is evident that typically clusters have the
· 1~ ngproperties.

h
All t e data points in a cluster should be similar to each other:

:ydefinition, a cluster is a grouping of similar objects. So, a good clust~r .muS t have data

lf)me convincing similarities on the basis of which they are grouped . For example, ph
rritnljoned · h . b · ·ng enough cluster
int e example in the previous section could ea convinci ·
The data "bl
Points from different clusters should be as different as poss, e :
Torn r
rr,. a,e clusters d 1
' st·rnct enoug h, the d ata poin
. t s from different clusters . should be far
~~b h
~1sc1 e clearly distinguishable. For example, a cluster containing all customers w ose pre
early distinguishable from another clu ster containing custom ers preferring Motorola
f c 1ust erin9
5.1.2 Types o f ciusterin9 ·luster at a time. For exam ple, custom
to onlY one c er A. lj
~
t,, o types o
,\ ta h1gh •k,el tht're are point belongs . u clustered based on phone model Prefe
eJCh dJtil X c1uster if yo rer,ee
Hard Cluc.tering : In th1.:; I be in iPhone . . a separate cluster, a probability or lik ,
1. ) could on Y 01 nt into e, he,.
pre""~u') 9" en e:-:,ln1P 1e tting each data p storner A would be in b oth the cluster . --O,
. . In thi.; instead of pu . d for exarnple, cu s ·1P~1Cir~
2. sott c1u .. ter1ng . . clusters is ass1gne .
. in those nd O.
tnd, ll,~lJ po1•,t to be . obabilitieS of 1 a
.;pectt,e pr
,1nd Motoro J \ ,, th re. •
. 0
f c1usterin9
U Cases (Appl ications) . e These applications are not specific to K
5.1.3 se clustering techn1qu . . . -~ •.
plicat 1ons of lustered and the desired clustering out
There a~ se, eral use cases or ap . on the dataset to be c ccm::.
I thm Depending
er an)' other c uste1 ng a gon .
algorithm could be chosen. . are as following .
lications of c:~lu~s~te~r~in~g~~~-=---~= = =: i
5 ....,e of the common app . f ns) of Clusterin g
0 use cases (Apptr ca io

1. customer segmentation

2 _image
Processing

3. Healthcare

4. Recommendation Engines

Fig. 5.1.2

L Customer Se-gmentation :
One o~ the most common applications of clustering is customer segmentation. Various compant!S

customer segment profiles to understand th e customer's shopping b ehaviour and acco
£or predicting demand and improving sales.

• As you saw ·n the example given in the previou s secti o n, various cu st omers could
.
. g campaign
their purchase behaviours and appropri ate and t arget e d mark etin
segments. for example, customers who O wn 'Ph
I one M o d el X could be shown ot
or omer products from the same com pany.

• Another example could be a wo man b uymg . clothes fo h 3 .


company could send h . r er -month o ld infant.
er promotional offers fo .
she bought doth size fitting r next size of cloth es she is likely'
3 months old infant ft
doth c;ize for 1 year old bab Th a er 9 months. she could
y. e chances of h
company could identify such er buying clothes for 1 year
cIusters of moth
send lhem promotional offers. ers who would require 1 year old

(Copyright No. IA8904/20it)


s·~p(l~l:J)=:~=~=~=-..;:;::..,______-_-~---_-=-=-~-~-:::-~-~-:::-:.:-~-~-;!-::!-~- - - -
, It'•l111til!l .
w11'e pro'
t,-r;
"""'Pc,vf cd f,-a,nf 11 H
1P'
1,,,~9,iPtJ t ll'>'.'.~
essin9:
I tel ,11CJ tech nique', you could ' Cl
[). ,tl1CI 'incl tlien m,1tch
t<'O' • 1'1,11 with know
I ' 11llly
l1
Ol>JL'< (', 111
0 1)jf'C ,_
~ (0 "
1111 i1n,HJe1 y
I<'tllily 11 , oli ,,,"" r drl <111~,,, r ""''"' ,,,,.,, In the ,mag•
1 10 J ti; rn th" rrn,1,,,.

Fig. 5.1.3

.
, ~dditionally you can use this. technique to capture ima g e fram es from a video
. . obJect5 in
and then determ,ne
the video. Based. on the. objects you. can. categorise the vide 0 (for example, a video
. . animals), orooab e
having
. as e atewaf of Indra) or 1dent1fy oeoc e
iocation of the vid eo (rf you could 1dent1fy a famous location such th G . .
•n eac e very advanced level these days where you could
the video. Image and video based analytics have r h d

carry out various usefu l tasks.

1ealthcare
, Paoent data, containing attributes such as blood pressure, height, weight, cholesterol level and glucose level
could be used to create clusters and detect any early signs of health problems based on historical records of
other patients who had similar characteristics and then they were diagnosed with a part,cular nealm problem.
For example, if someone has high cholesterol level and weight it might be possible to detect if she is closer to
gelling a heart attack. There could be several other uses of clustering in the healthca re industry.

' For example, detecting cancer, its type and severity and the possible treatment plan at early stage could have
•huge impact on number of years the patient is expected to survive. The shape of the cancerous cells plays a
""' role in determining the severity of the cancer. Using clustering, you can determine the shape of the

cancerous cells and accordi ngly create a treatment plan.

Fig. S.1.4
~om
, I am.,,lllendaf
sure th•on Engines : . med'ia apps
ero,nnirndations on you, favounte or
1
Pro duct re at you would have seen video 01 song nrl< • poi ta I. Some of the examples are as
01111
111
nFi commendations as you browse through ' e c
1

,' 9· 5.1.S
llr1!n
lNo
· L-9a
904/2021)
., '

s\'\'\I)
\Ol)l'111l'I
ly l10llll Ill
I I l'(ll ll'I \\ rotnl p11rn '700.00
(\(Ill 11lt thrf'C to Cllrt

paiw•~fi
Related t,1 1ten,,
, '
, en1AN rnAcv •K iss~ THE G
MAGIC
a "'~B MILLION That) 11-••· ,, .........
OF~~===-=

GOAlS! . "urnE DOLLAR Fr,l>gr


THINKING
,ti,,

BIG
- • ,ut RAC' H'A°B"ITS
Fig. 5.1.5
-~!~3-"-~i~1-
11o<Pw;,1,"•'·,._.,,,..t,_

Hov, does recomrn endat1on work? How does it know what you may like? Basically, the organisations ha-,"2
data fur you (suc'i as \\hat you bought, what items you looked at, where you clicked, etc.) and severa fi
:other si~e or app , s tors. Based on how similar you are when compared with others (creating ut:St=!

erom'.nenda;:ions are made to you.

The col etted aata is clustered and analysed to answer questions such as " how likely it is that SOffll!UI!
has bo;.ight the product X will also buy product Y?". If the probability is quite high, then the
15 made for example, if customer A has bought Book 1 and also Book 2 historically and
J.JS! bought Boo~ 1 now, she is recommended that Book 2 might be of interest to her.

5.!. lJ K-means

r. means s one of the popular duste rin tech .


for further data ana1v1,c Ea . d . g niques (or algorithm). It helps to form clusters
, ... s en ata point belongs t O I
'I wa t to group your data on Y one cluster. You pre- decide the nu
points mto before e
011erv1ew of •"h . xecuting th e algorithm steps Let's learn about
1
" e 1Vlethod

r. means algonthm is des,


u Pr ce tr gned to ,n11111nisP tht> .
aid Ea h luster 15 assoaa, d sum of dtst.irKe-. b£\tween the data poi
coordinate e wnh a CPnt101d ,
geometry
.e" Defi,11lio,1•' Ce11'~tVJid t
Cl po111t wlws, I
100/c lll(lfe ,
g, en ti O{JJCJlllt Ii ro,• th1• utl(•r,1>,:i'S of' tlat

(Copyright No L 98904
/2021)
( c.,t1f'l l)
I 1•,l' 111flft - r; 7
I,
d fl'Jitlll'. f wing Sll'PS ,ll c t ,1,..e n for c. lus t ,
i~I till' tl 110 ' ring the clnto us1n v
h It' ,l1t~, of< Ius t <,' s, k , I I1.i t you <les,re 9 r. m an Iu t11 ng
~1• 10
~i'
till' nun group your dlit"'tJ point int
rit{ ,n data points us< 1•11110,ds. o
k rando
~!t<1 i, .. wnr.e f1on1 eud, clat d po1111 10 ca h
tP tin! 1 c centroid /\
ru d, .. tanr.e d 1J1~1w1•cn any two poin ts (x ) ss,gn all th d ta point
,1.111 o1d 1l1i" , h Y1 and (x, Y,) ,s cal ul ted a

d ✓(X1 ><2)' + (yl Y2)2

iln1pt1tc 11
,e rentioidc: of newly formed clusters. rhe centroid (X VJ of th11 m d n

rn
}: X, m)
_L v,
( -' -
I 1
m
1
m
1:

isa~,mple arithmetic mean of all X coordina tes and y coordinates 0 f them d ata points ,n the us c:ar
if:)e3Hteps 3 and 4 until any of the following criteria is met

, (e:itroids of newly formed clusters do not change.

Dom:S remain in the same cluster.

Na>:1rnum number of iterations are reached as desired .

.ts-~ arew exampl es of K-means Clustering Algorithm.

m Questions
Use tre followi ng data and group them using K-means clustering algorithm. Show catcutanon of centroids

Height Weight

185 72

170 56

168 60

179 68

182 72
-
188 77

180 71 -~

180
70
,_ ~L.-

183
84
,_

180
88
' - - ·- 67
180
76
177 ..---
~
5-8 u
U Machine Learning (SPPU)

Soln.: Y-axis. Lets, assume that we need two


. clusters · Hence, k
Let's assume height on X-axis · and weight
. on
d data points to cIus ter assignment twice (number of .iteratio"'< . 1~-•.

'd calculation an . . P 5 11 11s) in


decide that we would do centroi • as shown in Fig . · · · · ~
. n data points 1s
th
algorithm . The first plot of e give 180,88
100

go~---+---- I
soL--+------i-T ~l
70

so~----7

40L--+----1----t- --,--7
30L--+----+---, - - 7---7
20~----+---+-----t- --,--1
10~-----t---+---1- ---,--,
oL _ _ _-1.._ _ ___J_ _ _ _~~----:1~s:s----::190
165 170 175 180
Fig. P. 5.1.1
Iteration 1

Let's randomly choose two centroids.

Certroid 1 = (170, 56) and Centroid 2 = (182, 72) .

Now, let's calculate the distance of the data points from the chosen centroids and c
The data poirt is assigned to the cluster based on the closest centroid .

1- sarrple distance calculation is as fol lowing for the first data point.

D stance from Centroid 1 (170, 56) for (185, 72) =

d - ✓(X1 - X2)2 + (Y1 -y2)2

d = ✓ (170 - 185) 2 + (56- 72) 2


d 21.93

(Copyright No. L-98904/2021)


ri~,c.i•"e ,earning (SPPU
)

C
Distance from •ntrofd 1
5- 'J

W ei g h t Distance fro rn Centroid 2


,-4eight

72

56 00 0 1
170

60 4.4 7 1
I 168
I
15 00 5.00 2
68
179
0.00 2
72 20 .0 0
182
7.81 2
27.66
77
188
2
2.24
18 .03
180 71
2
2.83
17 .2 0
180 70
2
12.04
30.87
183 84 2
16.12
33.53
180 88 2
5. 39
14.8 7
180 67 2
6.40
21 .1 9
76

ne xt iteration.
alcula te th e ce nt ro id s fo r th e
..:c"Is re-c uster 1. They .
.
ve tw o da ta po in ts th at fall in cl
. n, yo u ha
ca Icu lat1o + 60
~:Jr Certroid 1 ( 170 + 168 , 56 2
. l
po in ts w hi ch is
2
~er:ce Ce t ro1d .
1 1 s th e m ea n of th e da ta
· n
at fall in dU SW
th e re m ai ni ng 10 da ta po in ts th
fr:,r centroid 2 ca lc ul a tio
n, ta ke
,s
ta po ints which
C 2 1s th e m ea n of th es e 10 da 77 72 68 12 •
liE:rrp er tr o d , + +
0 18 0 + l
+ 18 3 + 18
-+
(,~~, s 1- 179 18 0
~ + l8 2 + 188 + 18 0 +
10

, '1e14, 74.S) d final iteration


d dy fo r the ne"t an
, 0 u hdv,. bo t L t L,1e ce nt ro , s ,ea
Nolf' Y
'•t · ,,
-
lori 2
4
CErttro .d C en t ro rd 2 (J 8l 4 7 S)
1 1 = (169, 58 )
an d from th e ntr
N
tlle data po in ts
o.,,, let' e di st an ce of
5 calculate th d
Jrit
th e closest ,e nt ro
1Sa cluster ba se d on
e
55 '
9 ned to th
:, • I 0 Unsuperv
- - ised Lea
Olstnnce from Centroid 2 Assigned Cl
list,,
rri,~,
Weight Olstnn< • from Centroid 1
H•lght
11.38 2
UV, 21 )(l
72
:;1 /3 1
li() ',(, 2 2•1
19 74 1
1GS (,0 2:.M
6 93 2
l79 1>8
2 57
l~2 72 1!), I 0

26,87
-- 7.06
2

2
l8R 77
170~ 3.77 2
lSl' 71
4 .71 2
180 ·o 16.28

1$3 84 29.53 9.63 2

180 88 31.95 13.57 2

1S0 67 14.21 7 .63 2

177 76 19.70 4.65 2

'\ ou stop here because the num ber of iterations that you decided are completed and also you see that the clusta!
aSSJgned m the iteration 1 fo r the data points did not change.
Hence, )10u got the two clusters as shown in Fig . P. 5.1.l(a).
100 ,--- - - - r - - - - - , - - - - - , - - - - - - r - - - - - - ,

90

80 •
70

60

50
..
'
I

~----· Cluster 2 --
-
Cluster 1
40


30

20

10

0
105 1 /0

Hg P. 5.1.l(a)
180
--
(Copyright No . L-YOCJ04/2U2 l)
arning (SPPU)
11ine ~e S 11
fllac
· In a survey, the following data was coll 0
~1,2: ? octed for TV vlow h' nsupcrviscd Lcarnrng
P· , the other. ors ip. Is there a oro
up which watches more TV than
Age Number of TV
Watching hours
?3
?5
----..__~3
28
2
35 4
37 5
40 4
34 3
41 5
39 4
38 3

~n.: let us use K-means clustering for grouping the survey results. let's choose the des,red number of duster.; k 1
ancthe number of iterations for centroid calculation and cluster assignment to be 2 as we/I let's put age on X
-1 ¥.atching hours on Y-axi s.
lie lirst plot of data points is as shown in Fig. P.5.1.2.
6

5l-------l-----t----i

o L___l----:210:----fo---""to---
o 10
Fig, P. S,1.2
1
il:r11uon1. 'd
• Let's randomly choose two centror 5•
r entro 1 r. 4)
d l = (25, L) and Centroid 2 == (h, the chosen ,ent 01
Now I a points from
'led et 5 calculate the d1sta,1ce of the dat se'.it ,entrotd
11t0 Po I on the cIo
int is assigned to 111e dustei base< d ta pomt
A•a,np1 119 for the i,st d
c dl51.anc.e calculation 1s as folio ..
~~,
{C fron, Cer1tro1d l {25 i> for <23
n
'I V J
d

d
4
d
.
-
5. 12
-
Distance
-
from centroid 2 Assigned Cl

---
'(, MachinC' LC'arning (Sl'Pll)
-~ centroid 1 __.. 1
Distance frolll 12.04

--
Age Hours
224 1
10.20
23 3
o.oo 1
7.28
25
3 00 0.00 2

3S
37
4
s
10.20

12.31
------- 2.24
5.00
2
2
15. 13 2
•W 4 1.41
9.06 2
34 3 6.08
16.28
41 5 4.00 2
14.14
39 4 3.16 2
13.04
38 3

Now re-calculate the centroids for the next iteration.


t 2
For Centroid 1 calculation, you have three data points that fall in cluS er 1. They are ( 3, 3), (25, 2) and (28 2)
23 + 25 + 28 3 + 2 + 2)
J-.ence Centroid 1 is the mean of the data points which is ( 3 , 3 = (25.33, 2.33).

For Centroid 2 calculation, take the remaining seven data points that fall in cluster 2.

rence Centroid 2 is the mean of these seven data points which is

35 - 37 - 40 + 34 + 41 + 39 + 38 4 + 5 + 4 + 3 + 5 + 4 + 3)
( 7 , = (37.71, 4)
7

Now, you have both th e centroids ready for the next and final iteration.

Iteration 2 :

Ceruoid 1 = (25.33, 2.33) and Centroid 2 = (37.71, 4).

.
Now let's calculate the distance of the d at a points from th •
po1m: is ass gned to the cluster bas d h e centroids and complet
e on t e closest centroid .
Age Hours Distance from Centroid 1
Distance from Cent
23 3 2.42
25
14.74
2 0.47
28 2 12.87
2.69
35 4 9.91
9.81
37 5 2.71
1197
40 4 1.23
14 76
34 3 2.29
8.10
41 s 3.84
15 90
39 4 3.44
38

(Copyright No. L-98904/2021)


3
----- 1.29
,
11
,,,,H (S\ll'l I} b• I ~I
~1,,rlt'• lln s11p1•rv1scd I ,.irn1ng 1
n,c \Jl't'-HIS«' Ille 1H1tlllw1 nl llt'I l
1 1 •P ll 1 ,I IOI)<, lh ll
1 ,1r1,1II'-'" 1 1,,, tlw d,11,, pu1111, ti, I
,,,1''
f'1II' 1\1 e I < llll\ <I 1' ,111<yu11 I
I<' tlt•<Hlt•cl '""" lll<\J<•IP<I ,ind ,clsn you s"e th 1t th 1
J l)\)\ 1lll' \\\it' t ll'>lt•1 s . 1~ ~hown ii ' e c usters
t \Ill l 1lq p 11 I ' ,1 ( 'I) '
It' h 1, - ,,.

- - ~ •-
,• .,
G -- Clwi l111 2-

I
I
I
'•,
••'
I
I ' I
I
' I
- ; -•- ··I

I
I I

------- ..
I I
I I
I I
.•
.
I

3 I
~ , ..
~_.-
.______ ... , ,'
I

'II
Cluslor 1 - ·Ii' I

' - - ,,
....... _______
I

40 50
0 20 30
0 10
Fig. P. S.1.2(a)

from tho d ostees farmed, it can be con cl uded that people in clu ster 2 seem to be watching more 1V :han oeople

11~uste1 1.
El S.1.3: A bank 118S roceived the following loan applications. Which of the applications could be nsky to aoorove?
Amount In Lakhs
Credit Score (out of 1000)
10
500
25
726
5
430
15
678
30
780

380

---
645
50
890
85
900 10
4!,0

So111 . • .
Lot us lh< 10 ,11 , ,,pPh,eUons L
,."' 2 a d use> K m<>nns clu s l<> r111() 1or qr 0 uptrHJ ,net {luster assl
n lh t> . I I c ,1nt1.it1011 '
on X a nurnbt-r o f ,t~ralt o ll '> for ce11tro c '
xis and Ioan arno\mt o n Y ax1~ .
11)efttstp\ 11 111 11q r .5.1 ,3.
ot of data points is as !>1,ow
\C.()\)'I
r\1!.h\ 1\1
o. L 9 ll904 /2021)

14

yso
• ~l

1000
400
Fig. P. 5.1.3

(726. 25)
t e data po nts from the chosen centroids and complete the data l'WIII* .......

" s•er based on the closest centroid.

Ql::U'latJon s as f lo\" ng for the first data point.

l 00 5) for (500 10) =


2
d = ✓<x1 - 2
x2) + <Y1 - Y2)
2
= ✓(430 - 500) + (5 - 10)
2
d
d = 70.18

7018 226.50

29667 000

000

4 0 4903

54 23
S-1.s

-1 %P ds fo-- the ex t rte ,.~ an.


re -v' '
- e fcr,.,.r d- ·- .
y-- hc.v
r~ r ~..-.. ,-,.
,ch v. .
n
pc ;..-!5 tr ~• C- I tn

du ster L
otd J. _, -c::
I
G
- i c:
~f'

~ e cfatc ru- ,· ... ~ •• .L.-t...I •(5


r,.,, uoid 1 s U1 e rr ean of tli r""'-' U t., "'- '

~it,e, ~
5 + 10 + 10
, 3a.. 3ZO + 45 0 10 + 4 "::,)•
= (440, ~:.J./--
~ ,

ke tfi -·- .
~ rerna·.. ·-ta six c C:L fa I in cluster 2.
at w ~, ta ~ d oo m s ~a t
, o•r uo d 2 ca al
r()( -
s th e ea n of tli es e six data po ... 15 which is
~eriee, Centro d 2 ~o · - -"
+ 7W + 64 5 + 29
0 + 90 0 25 ... 15 - ..I - _:) - :i _, - 65
(, + 67Z = "6 9 83 3333 )
(72 V
' 6
6
ne xt and final ite ration.
u ~ ave both thP <.entro ds ready for the
tJow, yo

11~rat1on 2
)
ro d 2 = (769 83, 33.33
Centroid 1 (440, 8 75 ) an d Cent int s table. The da ta
m the centroids an d complete th e da ta po
in ts fro
lfow, let ':; calculate th
d Stano of the data po
based on the dosest
centroid.
to the du st er
point 1~ r1ss1g ncd
2 As sig ne d Cluster
Distance from Centro
id I
roid .1
Am ou nt Di sta nc .e fr om Ce nt
Credit Score 1
27 0.8 4
60 01
- 10
1
100 2
44 .61
28 64 6
72b 25 1
34 1.0 1
10 68
430 s 93.64
2

23 8.0 8
678 15 2_ - .
34 0.6 6
780 30
60 .01
380 10
20 5.1 0
15
45 1.8 9
so
46 3.4 3
65
10.08
10
h t yo u decided a
· rat ion s t a
y0 u sto e nu m be r o f ,te
as · p he re be ca us e th . d id no t ch an ge .
signed in h . ra tio n 1 fo r th e da ta po int s
t e ite in Fig . P. 5.l .3( a).
Hence ste rs as sh ow n
You go t the tw o clu
0
'
''I
fO Chi ,tor ' .I • I
I
I I
I
I
I
I
~
I
40 I
'...
~-=• 1- - ~ - -...1,1 I
I
''
I • I
I
I

..
'' I
-\- I
'\ I
I

~()
, I

''

800 1000
t)
Fig. P. S.l.3(n)

rn, , th,,, ,,,,,,r, tor rn,•d rt c,in be rnncludcd thnt people in cluster 7 seem to be less rrsky for granting loan,
11

Determining the Number of Clusters (Elbow Method)


5.1.S
• To l ~e-" mE':tns :ilgorithn, you ure required to choose the desired number of clu sters, the value k. But, how do)'(X;
\. "oose the 11ght number of clusters for ;i given set of data points?
• ~ ·ost ot the times you choose the valu e of k by seeing the data plot (guess) or on the basis of your require"'en!!

(sav) ou need 3 clusters only).


But the number you choose in these ways may not be the optimum number to divide your data points r.:

clusters So. how could you choose the right number? Let's learn about it.
Do you remember the first property of a cluster? It wa s "All the data points in a cluster should be similar to eac."

other·.
So, what does this mean? It means that the data points within a good cluster sh

possible. In other words, the distance between the data points and the o
minimal as possible.

• lhis is called Inertia of a cluster.

J)pfinitfon: /1wrt1a ts tit<• ,•111111 ' , ti1e d nta points


· <>/'nll
of distances . •
w&t
1/Ub/l•f .

• It. is also callPd


. as 1ntr,1clust<>r dli; tnn cc • ror cxc1 mp 1c,
d1staric"s of ,ts dtJt,1 point~ fr om its crn troid.
r his would IJ,-. 1 1 2 , 1 , 2 , 0.5 G.5.

• So, now suppCJ,;,-. 111,,t you Ji,ivp llll rP cl ll<; tf' I 'i 11<1Vl11CJ
. • C' I ti
lll Of 5
tt,P. r.lusu•r1, forrrH•d ,md tlH • dr1t,l point ., wit. 11111 th orn would bf' cl , 10 and
sum of indlv

(Copyright No. L·'Jll'J04/:l0l 1)


Un

I
,,, . A
I
,' y
I
I
I
I
I
I

'I
I

'I
I

,,
\ I
\

' ,I
' ',
' ',,
.... .... -
Fig. 5.1.6

\olll oal is to ensure t hat the final clusters created have minrm;il
. , total inertia to ensure h t yo
g
urn nurnber of clust ers created .
0pt1n1
, to determine the rig ht number of clusters :
_i)
!. \ou start with any value of k randomly and perfo rm k-means analysis .

., once the analysis is done, determine the total inertia of the clusters that were created

3. Increase the nu mber k by 1 and carry out steps 1 and 2 until the change in total ,nertJa is no+ s gnr!:rari~

, for example, suppose you p Iot the inertia va Iu e again st the value of k, and it plots as shown n F g

1000
500

~ 300
(l)
C
- 200
100

Clusters

Fig. S.1.7

• Vou see that good values of k s hould be 6, 8 or 10 because after lf>.


TfllS tyr k Curve (bee
,.Je of plot is also called as elbow- plot or nee·
c.urve" ,, 5t t to get d
-erves as a cut-off point beyo nd which you ar
addition-I l> hould choose -
0 cost. ln clustering this means th at you
onrnher I , II 111 of the data.
• c USler doesn't give muc.h better mode 9
1~ote hPrp b too high resultl
11,p - th dt mer easing ti ,e valuP. of k to e ould
or11ngful • ants and it IN
cic.t dnd su bstantial g , oup1ng of dc1ta po
'vr,s
By the wa 0
f lusters that you
1
the Y, what is the rnax11r1u111 number" ' t
nurnber ch data pt:Jlfl
For ex of data points in the set where ea -"' you ,, ...
W0u1d
0 rnple -
' ,t you have 1,000 data points
111 a set c1,- h dJl.t
You d . d 5 ueaun9 ea
let's with that cluster? Jt is as goo a
O

(c see a fe . I 1at1on
0
1lvr w examples of cluster inertia ca cu
lght N
O, l -9Aoo
Unsupel'YI
lO
• Machin~ Ll-1'rnlnp, (SPP\l)

Practice Qunt\ons
~ lor
nd
.,..A,.....,,, nertl A ume 2 data point to be the
11 1
E1t 6.1 4
lu lil!
Income In ThOu•""d
Age
12

15
33
13
35

14
34

16
32

Soln
a fa ster: you need to find the distance of all data pornts from the centroid of th@OU11111r ..i

:sn.:.rmt> a e on X ax s and income on Y-axis. Now, let's calculate the distance of the data panes

,.,.,.,,..,.,,."! a d comp ete the data points table


e ca culauon 1s as following for the first data point

he Ouster Centro d (33, 15) for (33, 12) =

d = ✓(X1 - 2
X2) + (Y1 - Y2)
2

d = ✓(33 - 33) 2 + (15 - 12)


2

d = 3

Age Income Distance from Cluster c

33 12 3.00

3 15 000

13

14
141
l
141
T91.ti1
"'
t \ I
"Lc,irn1ng (Sl' Pll)
.,_·lh11i:tll:.:-~~--::~-:~~-.-c===:...-"'=..,,;:,!1~•l~'''.________________!!i!!liiiiii!i~:!!~~~~~~
Two clusters h,.we tho followinc1 (iHlll flc)lr,t•, C
11 113 data point assrgnmont
,iln1l11lrJ tluJlr lnlrtu h1t1lot rllol1111r o /\I " • 0 ,1, ul
,,, tlrt ur
,to 111 1 11 111 "

Height W•lqht Assigned Clu1tt1r


18 1
)
u. ;,>

170 ',h
168 CJO
llCJ b8 ')_

182 72 ')_

188 71 i!
180 71 2
180 70 l.
183 84 2
180 88 2
180 67 2
177 76 2

-o calculate inertia (or intracluster distance) of a cluster, you need to find the distance of all data points from the
rtro,d of the clu ster and add them. In the question, centroids for the respective clusters are not given Hence fir ,t
u need to calcul ate the centroids for the respective clusters.

For Centroid 1 calcul ation, you have two data points that fall in cluster 1. They are (170, 56) and (168, 60)

170 + 168 , 56 + 60) = (169, 58).


Yence, Centroid 1 is the mean of the data points which is (
2 2

For Centroid 2 calcul ation, take the remaining 10 data points that fall in cluster 2.

Hence, Centroid 2 is the mean of these 10 data points which is


(185 + 179 + 182 + 188 + 180 + 180 + 183 + 180 + 180 + 177 72 + 68 + 72 + 77 + 71 + 70 + 84 + 88 + 67 + 76
10 10

= (181.4, 74.5)

lfow, you have both the centroids ready for ca lculating the respective cluster inertia.

Centroid 1 = (169, 58) and Centroid 2 - (181.4, 74.5).

IJow, let's calculate the respective distance of the data points from the centroids and c
IF-

Height Weight Distance from Centroid 1

1/0 56

1GB 60

Total 4,48

o. L-98904/2021)
t, t' II

~; M,11 11111 , 1,,11 1111 11111 (',l'l'l ll


11t,l1.J hl
- w,,101tt
n
Dl~lli 11trt f ,out GeuttuhJ Z
, I :JII
I 11 1 I I, 'I I
1,H
I/') ;, ,,1

I llil
n I 111,
//
IIIU \/ I
111()
/I
II I I
/()
I 110 I J II I
IHI IM
I I 'i/
IHCl /Ill
I ,,1
I HO
,,1
II {I, I
/(,
1//
64.91
Totul

Hcncl?,
4/18
• The ine, tia f 01 cluster
64.91
• The inerti a fo r clus ter 2
4,48 I 64 .9 1 69 J9
• Total inertia ,s sum of in e1lia of all clusters

5.1.6 Diagnostics (Per form ance Measures)


er progr ams dnd the resultc.int outco me may not
alw
K-means algor ithm is usually run throu gh co mput
ersta nding of the data point s crnd the resultant duster
s to
obvio us. You may require a personal touch and und
diagn osis Uudg ment ) o f any issues that may occur
Wll
outco me more meaningful. This migh t require some
sma ll, you shou ld plot the data to ensure that
numb er of attributes, in a set of data points, is very
2. No cluster has only a few data point s
1. Clust ers are clearly distin guish able

3 Centroids are not too close to each other

For example,
and adeq uatel y popu lated with the data
The follow ing cluster migh t be clearly distinguishable
.·-
70 -
'
,,
, -- .-...
--~
60 ' '\ \

C luster 2 • I
I \
'
,_ I
I
'
50 I
I
\
I
I ' \

40 I
I
I
I
I
.. •
30 I
I
I ''
' I • I '
20 1~ ~

-- I

\ I '
'

10 -
,--· - --
,'--. • ·- ' . --1;
' I

'' •
.·- ,,
, ,'
'
-

0
Clu&IOI 1- .. • _, '
·- - ---
I

·-- ---· --- - -


1000
0 200 400 600 800
Fig. 5.1.8
, (.1•,1tllillg (S PPlll
~1.1dllll 1 5-l 1 Un upe,vfsed Leaming
tl1C toll0\\Ing d,H,l po mt s o n tlw plot llMy b 1
~11! 100 ll I oo S< c.11 terc d

90

no
70 0
0
GO
50

40

30

20

10

0
0 200 400 600 800 1000 1200
Fig. 5.1.9

'.:i~ need to apply j ud gement to see if increasing the number of clusters or changing any other attnbu e rd
-: ::::""e LP with better clu sters .

Reasons to Choose and Cautions (Drawbacks/ Challenges)

~ =e.-. cl"' rgs to consid er for K-means clustering are as following .

~rE c:ara ports could have several attributes such as age, weight, height. income, etc. Once your ustermg
-r9 -n;:i ete, you need to en sure that the attributes would be available for new data points as and wlllll
: :- -'E cnclysis. For exampl e, if your existing clustering is based on customer ratings on a
'c~ 1 g may not be im mediately ava ilable for a customer who has just completed the purchas&
E"'I'. or;_; month before the custome r is co mfortabl e rating the product judiciously

nits of measurement a nd scaling :


~... "'f:d to b1: careful in choosing the units for your dc1tn poin t c1 ttributes. Whtie it may
-tt< ng a lot, it could act ually shift a fr:w data points here nnd there For example suppose
~ ts c:1ge and income Age is ex prnssE>d in double dirJ1ts. But, 1f you choose income to be,
"' eone ec:1rn1ng 10,000 15 re pr P.s tmted as 10, th en bo th age and income are m double
( b te towards distance ca lcu lation But, suppose the income was not in thousand
t>,Jte would dominate the distance calcu lation as it has many more d191ts than age Thi$
ts rnore w,m rPspect
10
income t han age and hence skew (bias or distort) the cluster as
f cit ti ie following distance calculation
(33, 15) ttnd Centroid}. (~4 L4)
In the first calcuhit1on, Centroid l
(33 l S000) and Centroid 2 (34 14000)
In the second calculat1on, Centroid l

•: I.
~
, , Machine Learnin g (SPPU) Distance from Centroid 2
centroid 1
Distance from
Age Income in thousand 2.24
3.00 ----+---------1 2
33 12 1.41
0 00
15 1.41
33
2 83
0.00
35'.-L~-_,:,:,'~3 --=-~ --+~~-- 1.4 I
34 14 2.83
1.41
16
32
- _, -Distance
------ from Centroid 1
...
-
Distance from Centroid 2 Assigned O..
Income
-
Age

33
- 12000
3000.00
2000.00 2

1000.00 1
0.00
33 15000
1000.00 2
2000.00
35 13000
0.00 2
14000 1000.00
34
2000.00 1
16000 1000.00
32
While you might argue that the cluster assignment has not changed, but if you look closely the distm
the centroids is more noticeable due to income attribute. This could have impact on inertia and might iea:

be ·eve that you need more clusters.


3. Initial Centroid position : K-means is sensitive to initial centroid position. Hence, you should n11
analysis several times to ensure that clusters created are optimum.
4. Number of dusters: As discussed earlier, you must be careful while deciding on the number of ciJSl85
for optimum placement of data points. Cluster Inertia is a key parameter to consider for ensuring that
right nu mber of cl usters.

_5_.2__k_-_N_e_a_re_s_t _N_e~ig~h_b_o:. . : .u.:.=. r=--s~(k=N~N~)_::C~la~s~si~f~ic~a~ti~o~n~A~lg~o~r~it~h!!:m~---- - ~

• Ore of th e popular variations of K-means clustering algo rithm is k-Nea rest NeighboUf'S Orfl'
algorithm.

• lt ✓Jorks
· hb
on the sa me principle as K-mea 15 1
ns cl ustering algorithm that a data point . ,...., IP"1

ne1g ours
. and wou ld possibly have the sam e class1f1 . .cat1on.
. . meu- _.-.l
kNN is a supervised learning
• The uay it works is sim ple and st ra,g
. Iit forward .
l . A.ssumP. that you dlri>ady ass,qn('d labelc; to ti
hdVf'
et
Then you drr> giv<> . 'le ex,stinq data points m a given data s
2.
. n a new data point to cla~s1fy
3. {ou calculate> th<' distanrn f I
o t 'IP i1c>w date\ . i..
t,erl9
your data SP.t and iriquii111n(>m b point with I espect to ,ts k neighbours. The nuro ..
~ ut in qenP.1,,I h1 I . d avoid
4. The new data poi nt 1 • . • CJ "ler the better to red uce the noise an
s c1ass1f1ed b<1 5 d
classifi cation · ' ~ on the class,f·1cat 10n of the maJonty . . of the su

let's take an exampl e t


. o understand it
{Copynght No. L-98904/20 21) .
mg (SPPU)
5-23 Unsupervised Leaming
ns

following is a dataset for weight of teens and wh th . .


e er they like Pizza.

Weight (Kgs) like Plua?

78 Yes

54 No

69 Yes

73 Yes

59 No

48 No

82 No

65 Yes

Using k-Nearest Neighbours (kNN) Classification Algorithm determine if a teen weighing 63 Kgs likely to like
Pizza?
Use k = 3.

Calculate the distance of the new data point (63 Kgs) with respect to other data points in the data set

Asample distance calculation is as following for the first data point.

d = ✓ (x1 - x2)2

d = ✓ (78 - 63) 2 =15


d = 15

Weight (Kgs) Distance for 63 Kgs Like Pizza?

78 15 Yes

54 9 No

69 6 Yes

73 10 Yes

59 4 No

48 15 No

82 19 No

65 2 Yes
Sort the table based on the dist
an ce
~ ,-

We igh t (Kgs)

65
--Distance for 63 Kgs

2
like Pizza?

Yes
~

,-

59 4 No
-
69 6 Yes
-
54 9 No
-
73 10 Yes

78 15 Yes

48 15 No

82 19 No
Now, given tha t k = 3.

So pick top 3 rows of the sorted


table.
I

We igh t (Kgs) Distance for 63 Kgs Like Plz ul


65 2 Yes
59 4 No
69 6
Yes
Now , you have got 2 Yes, and
1 No. The majority classification
is "Yes". Hence, a teen weighing
like Pizza. 63 KgsJS

5.2 .1 K-m edo ids

• The me doi d of a cluster is defi


ned as the object (or the data poin
t) in the cluster whose average dlSSlm
the oth er objects in the cluster iJall(Y
is minimal, tha t is, it is the mo
st centrally located point in the
me doid s algo rith m is another app ctust'fl
roach for clustering similar to the
k-means algorithm .
• k- medoids cIustering . .
1 . .
s a part1t1o .
nmg met hod com mon ly used in dom
• • . ains tha t requ1.re robustness to ou ~
arb itra ry distance metrics , or ones for whi•ch the mean or
median does not have a c1ear definition. 1t 15
means, and the goal of bot h met l)Sl(Sor
hods is to divide a set of measure
ments or observations into k so ,-- .. J
so tha t t h e su b sets mm1 · ·m1s
· e t he sum · ances •. •
of dist bet wee n a measurement an d center of tht _flillll
cluster. In the k-means algo rith a ~111 •r
m, the center of the subset is the ts in
mean of measurernen theledsoaffllU""
-.wt.
a cen troi d . In the k medoids algo "'
rithm, the cen ter of the subset
is a me mb er of the subset, cal
• The k- medo1d . . . .
., f~
s algo rith m return s medo1d
·h ·
s whi ch are the actual data poin .
ts in the data set. ThiS al . ~
the algo rit m in . s1·t uat1·ons h 'JtliS iS
w ere t I1e mean of th e data • the data set
· ren does not exist within .!thill di'
diffe ce between k me d 01'd s and k mea t t,e "1"' ,,J,
ns where the cen troi ds returne ns maY 00
set. Hence k -me d 01"d s ·1s use f u I f d by k-mea d flnl°' _d jlllfY
· g
or clusterin ca tegorical data where a mean . •
,s ,mpos sible to e ,~ ,
(Copyright No. L-98 904 /202 1)
" '" Learning (SPPU) · 5 25 Unsupen,ised Learnmg
chi'' .
..
and k- medo1ds algorithms are part1t1onal

eoth. the. k-rneans
the distance between points labelled t , (brea king the dataset up into groups) and attempt to

' r,,rri1seI contrast to the k means algorith ko be 1n, a clu st N rind a point designated as the center of that
. n exemplars), and thereby allow f m,
111 ster.
dU medo1ds
. choosos artu•I input data points a,
centres (called
. interpr et ab Iity of the cluster centre:; than in k-means. where
I
~,edo1ds or fa cluster is not necessari ly s or greater
f.
th' one o the input d t
a a poonts (1t IS the average botw••• th• points in the
center o

cluster)
. -Fig
• for example, the 5.2.1
-. - -- illustrates
---- the
--difference
---~ between the k means and k medo1ds approach.

.•. •
e e M
~

b) Medoid
a) Mean
Fig. 5.2.1

]hegroup of points on the left form a cluster, while the rightmost point is an outlier. Mean is greatly influenced oy
theoutlier and thus can not represent the correct cluster center, while medoid is robust to the out.·er and correctly

represents the cluster center.


k-medoids minimises a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances as in ~-
means algorithm. It is, thus, more robust to noise and outliers than k-means. k-medoids is a classical parti,ioning
technique of clustering that splits the data set of n objects into k clusters. Similar to k-means, you can decide me

number k as per your requirements.


There are several approaches to implementing k-medoids. However, partitioning around medoids (PAM) ,s most

popularly used. The steps in PAM are as following.


11) Build-step: Each of k clusters is associated with a potential medoid. You can randomly choose the medoids
based on the number of clusters that you desire. For example, if you desire two clusters. then you - -

data points from the input data set as medoids to start with.
Swap-step: Within each cluster, each point is tested as a potential medoid by checking if the sum
cluster distances gets smaller using that point as the medoid. If so, the poont ts dehned as a
Eve,y point is then assigned to the cluster with the closest medoid.
The algorithm iterates the build and swap steps until the medoids do not t hanqe or other terminatiOII

rnet.
:ote here that k-medoids implementation Is usually a g,eedy alqonth111
ol"ng heuristic of making the locally optimal ct1olce al earh , taq e In ""'"Y problem,, •• greedy stro
Produce an t· . d puristic. urn yiPld loctilly opt1111.il <,olut1011., that approxi
I
.
Ptirnal solut' . solution, but a greP t Y f1 time so k 111 ecloid'> 111,1y not q,v~ you perfectly clusterld
OP 1mal
0
al ion in a reasonable amoun o ,
Ways but does a great job of quickly clustering thP d.it,1 pomt\
s see a sol d k doid s algorithrn (PAM 11npler11ent,,t1on) works
let· ve example to learn how - me _ _ __

right N ----------- .-
o. L- 98904/2021)

You might also like