Unsupe - Rvised Learning: Able T Understand and Prehend
Unsupe - Rvised Learning: Able T Understand and Prehend
d Lear
rvise ning
Al the
end of this unit, you should be able t
o understand and com
' ,-t,eans.d prehend the fo 11 owing syllabus topi cs •
, .rnedo1 s
1 •
, Hierarchical
, oensity-based Clustering
, spectral Clustering
, outlier analysis
Introduction to isolation forest
0
o Elbow method
l.l Clustering
•The
ilie galaxy. meaning
dictionary . of cluste r IS
· " a nu m b er of s,·m'•lar things that occur together•. For example cluste o· sta-,;
----------------
ers. These sets could possibly form a group or
(\Pl'l/)
µ -
•, 1 l11111' 1.c,1111111
~ ~ ,11 ---
\
...s- 0 ~0 I
O ·o O '1
I
,' 0
I
'•- Clu ster 1
O • 0 I
0 0/
I
:, 6) 0 Q
'\, 0 0 /
•◄ Clu stor 2
O O \
y nx1S · • ...- Clu ste r 3
I
/ 00 00 0\ ,, \
,,, 0 0
: o O O )
O I / 00 °0 0 \:
'. 0 Oo O ,'
I
;I O O O
\ 0 0 ,,, I
....... _..,_..,,,,
x-axis
Fig. 5.1.1
que .
Lets take a -,,mple exa mple to
understand the clustering techni
•
e Phone Model Laptop . . .
Region Monthly Income Ag
I Customer Name MacBook Pro
50,000 29 iPhone X
l A Bangalore
34 Motorola X Dell Precision
35,000
I 8 Bangalore
36 iPhone X Dell Precision
Mumbai 80,000
I C
Dell Precision
D Mumbai 40,000 26 iPhone X
I I
MacBook Pro
L E Mumbai 55,000 39 iPhone X -
I
ups) for analys1·s . For examp e
o., could build several clusters (gro
'
" Based on region
{l dat Point )
Pl X rence
4 ta I , t)
lo X(1
data 'l,, I t)
c.o 'f(
tt~o L9 9
0.120211
--.-,-d
f aase'
1,1
0
n rop•;:~~:~::~-----------=;:;;;;:;:::::::::::~...::!!'.;~;::::::~:::::::!.,
r,.,1ac80 ok Pro
(2 d a t a points)
. lln!;u1>crv·
rsecJ L.c..irnrng
2
p ecision (3 data points)
oell r
a hich cluster could you answer th
se d
on w ese quest·
ions?
63
true that people prefer iPhone X irr
rs it espcct1ve o f lh
p . . . e1r 1ncon,
true that De II rec1s 1on 1s a more p c• group or r1uet
Js it opu Iar Product f
average you nger people are making m or laptop?
·
3 on . . ore money than older P I
e that this might be a simp li stic exa m eop e?
Note tier . p Ie that You feel th
, rposes a dataset contains millions of at You can do manual! f
-eftll pu record s with vari '/ 8ut, or all practical ,.ind
"' (lusters that could be more useful to understand the d ous attributes. Imagine how hard could it be to ftnd
ire I ata and run further . .
:irob!ern to solve . analysis on it. This is not a tn 11al
t
te here that clustering is a data explorar
A'so no . . . ion method (or a task h
' gont hrn or analytics techniq ue tn itself. w ere clusters are outcome) and not a~
3
0
Hierarchical clusteri ng
0 Fuzzy clustering
a Density-based clustering
' Oustering is used in recommendati on and search engines, marketing, economics, and other branches of sde'1ce.
3ased on the cluster plot and the example given in the previous section, it is evident that typically clusters have the
· 1~ ngproperties.
h
All t e data points in a cluster should be similar to each other:
:ydefinition, a cluster is a grouping of similar objects. So, a good clust~r .muS t have data
lf)me convincing similarities on the basis of which they are grouped . For example, ph
rritnljoned · h . b · ·ng enough cluster
int e example in the previous section could ea convinci ·
The data "bl
Points from different clusters should be as different as poss, e :
Torn r
rr,. a,e clusters d 1
' st·rnct enoug h, the d ata poin
. t s from different clusters . should be far
~~b h
~1sc1 e clearly distinguishable. For example, a cluster containing all customers w ose pre
early distinguishable from another clu ster containing custom ers preferring Motorola
f c 1ust erin9
5.1.2 Types o f ciusterin9 ·luster at a time. For exam ple, custom
to onlY one c er A. lj
~
t,, o types o
,\ ta h1gh •k,el tht're are point belongs . u clustered based on phone model Prefe
eJCh dJtil X c1uster if yo rer,ee
Hard Cluc.tering : In th1.:; I be in iPhone . . a separate cluster, a probability or lik ,
1. ) could on Y 01 nt into e, he,.
pre""~u') 9" en e:-:,ln1P 1e tting each data p storner A would be in b oth the cluster . --O,
. . In thi.; instead of pu . d for exarnple, cu s ·1P~1Cir~
2. sott c1u .. ter1ng . . clusters is ass1gne .
. in those nd O.
tnd, ll,~lJ po1•,t to be . obabilitieS of 1 a
.;pectt,e pr
,1nd Motoro J \ ,, th re. •
. 0
f c1usterin9
U Cases (Appl ications) . e These applications are not specific to K
5.1.3 se clustering techn1qu . . . -~ •.
plicat 1ons of lustered and the desired clustering out
There a~ se, eral use cases or ap . on the dataset to be c ccm::.
I thm Depending
er an)' other c uste1 ng a gon .
algorithm could be chosen. . are as following .
lications of c:~lu~s~te~r~in~g~~~-=---~= = =: i
5 ....,e of the common app . f ns) of Clusterin g
0 use cases (Apptr ca io
1. customer segmentation
2 _image
Processing
3. Healthcare
4. Recommendation Engines
Fig. 5.1.2
L Customer Se-gmentation :
One o~ the most common applications of clustering is customer segmentation. Various compant!S
•
customer segment profiles to understand th e customer's shopping b ehaviour and acco
£or predicting demand and improving sales.
• As you saw ·n the example given in the previou s secti o n, various cu st omers could
.
. g campaign
their purchase behaviours and appropri ate and t arget e d mark etin
segments. for example, customers who O wn 'Ph
I one M o d el X could be shown ot
or omer products from the same com pany.
Fig. 5.1.3
.
, ~dditionally you can use this. technique to capture ima g e fram es from a video
. . obJect5 in
and then determ,ne
the video. Based. on the. objects you. can. categorise the vide 0 (for example, a video
. . animals), orooab e
having
. as e atewaf of Indra) or 1dent1fy oeoc e
iocation of the vid eo (rf you could 1dent1fy a famous location such th G . .
•n eac e very advanced level these days where you could
the video. Image and video based analytics have r h d
1ealthcare
, Paoent data, containing attributes such as blood pressure, height, weight, cholesterol level and glucose level
could be used to create clusters and detect any early signs of health problems based on historical records of
other patients who had similar characteristics and then they were diagnosed with a part,cular nealm problem.
For example, if someone has high cholesterol level and weight it might be possible to detect if she is closer to
gelling a heart attack. There could be several other uses of clustering in the healthca re industry.
' For example, detecting cancer, its type and severity and the possible treatment plan at early stage could have
•huge impact on number of years the patient is expected to survive. The shape of the cancerous cells plays a
""' role in determining the severity of the cancer. Using clustering, you can determine the shape of the
Fig. S.1.4
~om
, I am.,,lllendaf
sure th•on Engines : . med'ia apps
ero,nnirndations on you, favounte or
1
Pro duct re at you would have seen video 01 song nrl< • poi ta I. Some of the examples are as
01111
111
nFi commendations as you browse through ' e c
1
,' 9· 5.1.S
llr1!n
lNo
· L-9a
904/2021)
., '
s\'\'\I)
\Ol)l'111l'I
ly l10llll Ill
I I l'(ll ll'I \\ rotnl p11rn '700.00
(\(Ill 11lt thrf'C to Cllrt
paiw•~fi
Related t,1 1ten,,
, '
, en1AN rnAcv •K iss~ THE G
MAGIC
a "'~B MILLION That) 11-••· ,, .........
OF~~===-=
BIG
- • ,ut RAC' H'A°B"ITS
Fig. 5.1.5
-~!~3-"-~i~1-
11o<Pw;,1,"•'·,._.,,,..t,_
Hov, does recomrn endat1on work? How does it know what you may like? Basically, the organisations ha-,"2
data fur you (suc'i as \\hat you bought, what items you looked at, where you clicked, etc.) and severa fi
:other si~e or app , s tors. Based on how similar you are when compared with others (creating ut:St=!
The col etted aata is clustered and analysed to answer questions such as " how likely it is that SOffll!UI!
has bo;.ight the product X will also buy product Y?". If the probability is quite high, then the
15 made for example, if customer A has bought Book 1 and also Book 2 historically and
J.JS! bought Boo~ 1 now, she is recommended that Book 2 might be of interest to her.
5.!. lJ K-means
(Copyright No L 98904
/2021)
( c.,t1f'l l)
I 1•,l' 111flft - r; 7
I,
d fl'Jitlll'. f wing Sll'PS ,ll c t ,1,..e n for c. lus t ,
i~I till' tl 110 ' ring the clnto us1n v
h It' ,l1t~, of< Ius t <,' s, k , I I1.i t you <les,re 9 r. m an Iu t11 ng
~1• 10
~i'
till' nun group your dlit"'tJ point int
rit{ ,n data points us< 1•11110,ds. o
k rando
~!t<1 i, .. wnr.e f1on1 eud, clat d po1111 10 ca h
tP tin! 1 c centroid /\
ru d, .. tanr.e d 1J1~1w1•cn any two poin ts (x ) ss,gn all th d ta point
,1.111 o1d 1l1i" , h Y1 and (x, Y,) ,s cal ul ted a
iln1pt1tc 11
,e rentioidc: of newly formed clusters. rhe centroid (X VJ of th11 m d n
rn
}: X, m)
_L v,
( -' -
I 1
m
1
m
1:
isa~,mple arithmetic mean of all X coordina tes and y coordinates 0 f them d ata points ,n the us c:ar
if:)e3Hteps 3 and 4 until any of the following criteria is met
m Questions
Use tre followi ng data and group them using K-means clustering algorithm. Show catcutanon of centroids
Height Weight
185 72
170 56
168 60
179 68
182 72
-
188 77
180 71 -~
180
70
,_ ~L.-
183
84
,_
180
88
' - - ·- 67
180
76
177 ..---
~
5-8 u
U Machine Learning (SPPU)
go~---+---- I
soL--+------i-T ~l
70
so~----7
40L--+----1----t- --,--7
30L--+----+---, - - 7---7
20~----+---+-----t- --,--1
10~-----t---+---1- ---,--,
oL _ _ _-1.._ _ ___J_ _ _ _~~----:1~s:s----::190
165 170 175 180
Fig. P. 5.1.1
Iteration 1
Now, let's calculate the distance of the data points from the chosen centroids and c
The data poirt is assigned to the cluster based on the closest centroid .
1- sarrple distance calculation is as fol lowing for the first data point.
C
Distance from •ntrofd 1
5- 'J
72
56 00 0 1
170
60 4.4 7 1
I 168
I
15 00 5.00 2
68
179
0.00 2
72 20 .0 0
182
7.81 2
27.66
77
188
2
2.24
18 .03
180 71
2
2.83
17 .2 0
180 70
2
12.04
30.87
183 84 2
16.12
33.53
180 88 2
5. 39
14.8 7
180 67 2
6.40
21 .1 9
76
ne xt iteration.
alcula te th e ce nt ro id s fo r th e
..:c"Is re-c uster 1. They .
.
ve tw o da ta po in ts th at fall in cl
. n, yo u ha
ca Icu lat1o + 60
~:Jr Certroid 1 ( 170 + 168 , 56 2
. l
po in ts w hi ch is
2
~er:ce Ce t ro1d .
1 1 s th e m ea n of th e da ta
· n
at fall in dU SW
th e re m ai ni ng 10 da ta po in ts th
fr:,r centroid 2 ca lc ul a tio
n, ta ke
,s
ta po ints which
C 2 1s th e m ea n of th es e 10 da 77 72 68 12 •
liE:rrp er tr o d , + +
0 18 0 + l
+ 18 3 + 18
-+
(,~~, s 1- 179 18 0
~ + l8 2 + 188 + 18 0 +
10
26,87
-- 7.06
2
2
l8R 77
170~ 3.77 2
lSl' 71
4 .71 2
180 ·o 16.28
'\ ou stop here because the num ber of iterations that you decided are completed and also you see that the clusta!
aSSJgned m the iteration 1 fo r the data points did not change.
Hence, )10u got the two clusters as shown in Fig . P. 5.1.l(a).
100 ,--- - - - r - - - - - , - - - - - , - - - - - - r - - - - - - ,
90
80 •
70
•
60
50
..
'
I
~----· Cluster 2 --
-
Cluster 1
40
■
30
20
10
0
105 1 /0
Hg P. 5.1.l(a)
180
--
(Copyright No . L-YOCJ04/2U2 l)
arning (SPPU)
11ine ~e S 11
fllac
· In a survey, the following data was coll 0
~1,2: ? octed for TV vlow h' nsupcrviscd Lcarnrng
P· , the other. ors ip. Is there a oro
up which watches more TV than
Age Number of TV
Watching hours
?3
?5
----..__~3
28
2
35 4
37 5
40 4
34 3
41 5
39 4
38 3
~n.: let us use K-means clustering for grouping the survey results. let's choose the des,red number of duster.; k 1
ancthe number of iterations for centroid calculation and cluster assignment to be 2 as we/I let's put age on X
-1 ¥.atching hours on Y-axi s.
lie lirst plot of data points is as shown in Fig. P.5.1.2.
6
5l-------l-----t----i
o L___l----:210:----fo---""to---
o 10
Fig, P. S,1.2
1
il:r11uon1. 'd
• Let's randomly choose two centror 5•
r entro 1 r. 4)
d l = (25, L) and Centroid 2 == (h, the chosen ,ent 01
Now I a points from
'led et 5 calculate the d1sta,1ce of the dat se'.it ,entrotd
11t0 Po I on the cIo
int is assigned to 111e dustei base< d ta pomt
A•a,np1 119 for the i,st d
c dl51.anc.e calculation 1s as folio ..
~~,
{C fron, Cer1tro1d l {25 i> for <23
n
'I V J
d
d
4
d
.
-
5. 12
-
Distance
-
from centroid 2 Assigned Cl
---
'(, MachinC' LC'arning (Sl'Pll)
-~ centroid 1 __.. 1
Distance frolll 12.04
--
Age Hours
224 1
10.20
23 3
o.oo 1
7.28
25
3 00 0.00 2
3S
37
4
s
10.20
12.31
------- 2.24
5.00
2
2
15. 13 2
•W 4 1.41
9.06 2
34 3 6.08
16.28
41 5 4.00 2
14.14
39 4 3.16 2
13.04
38 3
For Centroid 2 calculation, take the remaining seven data points that fall in cluster 2.
35 - 37 - 40 + 34 + 41 + 39 + 38 4 + 5 + 4 + 3 + 5 + 4 + 3)
( 7 , = (37.71, 4)
7
Now, you have both th e centroids ready for the next and final iteration.
Iteration 2 :
.
Now let's calculate the distance of the d at a points from th •
po1m: is ass gned to the cluster bas d h e centroids and complet
e on t e closest centroid .
Age Hours Distance from Centroid 1
Distance from Cent
23 3 2.42
25
14.74
2 0.47
28 2 12.87
2.69
35 4 9.91
9.81
37 5 2.71
1197
40 4 1.23
14 76
34 3 2.29
8.10
41 s 3.84
15 90
39 4 3.44
38
- - ~ •-
,• .,
G -- Clwi l111 2-
I
I
I
'•,
••'
I
I ' I
I
' I
- ; -•- ··I
I
I I
------- ..
I I
I I
I I
.•
.
I
3 I
~ , ..
~_.-
.______ ... , ,'
I
'II
Cluslor 1 - ·Ii' I
' - - ,,
....... _______
I
40 50
0 20 30
0 10
Fig. P. S.1.2(a)
from tho d ostees farmed, it can be con cl uded that people in clu ster 2 seem to be watching more 1V :han oeople
11~uste1 1.
El S.1.3: A bank 118S roceived the following loan applications. Which of the applications could be nsky to aoorove?
Amount In Lakhs
Credit Score (out of 1000)
10
500
25
726
5
430
15
678
30
780
380
---
645
50
890
85
900 10
4!,0
So111 . • .
Lot us lh< 10 ,11 , ,,pPh,eUons L
,."' 2 a d use> K m<>nns clu s l<> r111() 1or qr 0 uptrHJ ,net {luster assl
n lh t> . I I c ,1nt1.it1011 '
on X a nurnbt-r o f ,t~ralt o ll '> for ce11tro c '
xis and Ioan arno\mt o n Y ax1~ .
11)efttstp\ 11 111 11q r .5.1 ,3.
ot of data points is as !>1,ow
\C.()\)'I
r\1!.h\ 1\1
o. L 9 ll904 /2021)
•
14
yso
• ~l
1000
400
Fig. P. 5.1.3
(726. 25)
t e data po nts from the chosen centroids and complete the data l'WIII* .......
7018 226.50
29667 000
000
4 0 4903
54 23
S-1.s
~it,e, ~
5 + 10 + 10
, 3a.. 3ZO + 45 0 10 + 4 "::,)•
= (440, ~:.J./--
~ ,
ke tfi -·- .
~ rerna·.. ·-ta six c C:L fa I in cluster 2.
at w ~, ta ~ d oo m s ~a t
, o•r uo d 2 ca al
r()( -
s th e ea n of tli es e six data po ... 15 which is
~eriee, Centro d 2 ~o · - -"
+ 7W + 64 5 + 29
0 + 90 0 25 ... 15 - ..I - _:) - :i _, - 65
(, + 67Z = "6 9 83 3333 )
(72 V
' 6
6
ne xt and final ite ration.
u ~ ave both thP <.entro ds ready for the
tJow, yo
11~rat1on 2
)
ro d 2 = (769 83, 33.33
Centroid 1 (440, 8 75 ) an d Cent int s table. The da ta
m the centroids an d complete th e da ta po
in ts fro
lfow, let ':; calculate th
d Stano of the data po
based on the dosest
centroid.
to the du st er
point 1~ r1ss1g ncd
2 As sig ne d Cluster
Distance from Centro
id I
roid .1
Am ou nt Di sta nc .e fr om Ce nt
Credit Score 1
27 0.8 4
60 01
- 10
1
100 2
44 .61
28 64 6
72b 25 1
34 1.0 1
10 68
430 s 93.64
2
23 8.0 8
678 15 2_ - .
34 0.6 6
780 30
60 .01
380 10
20 5.1 0
15
45 1.8 9
so
46 3.4 3
65
10.08
10
h t yo u decided a
· rat ion s t a
y0 u sto e nu m be r o f ,te
as · p he re be ca us e th . d id no t ch an ge .
signed in h . ra tio n 1 fo r th e da ta po int s
t e ite in Fig . P. 5.l .3( a).
Hence ste rs as sh ow n
You go t the tw o clu
0
'
''I
fO Chi ,tor ' .I • I
I
I I
I
I
I
I
~
I
40 I
'...
~-=• 1- - ~ - -...1,1 I
I
''
I • I
I
I
..
'' I
-\- I
'\ I
I
~()
, I
''
800 1000
t)
Fig. P. S.l.3(n)
rn, , th,,, ,,,,,,r, tor rn,•d rt c,in be rnncludcd thnt people in cluster 7 seem to be less rrsky for granting loan,
11
• So, now suppCJ,;,-. 111,,t you Ji,ivp llll rP cl ll<; tf' I 'i 11<1Vl11CJ
. • C' I ti
lll Of 5
tt,P. r.lusu•r1, forrrH•d ,md tlH • dr1t,l point ., wit. 11111 th orn would bf' cl , 10 and
sum of indlv
I
,,, . A
I
,' y
I
I
I
I
I
I
'I
I
'I
I
,,
\ I
\
' ,I
' ',
' ',,
.... .... -
Fig. 5.1.6
\olll oal is to ensure t hat the final clusters created have minrm;il
. , total inertia to ensure h t yo
g
urn nurnber of clust ers created .
0pt1n1
, to determine the rig ht number of clusters :
_i)
!. \ou start with any value of k randomly and perfo rm k-means analysis .
., once the analysis is done, determine the total inertia of the clusters that were created
3. Increase the nu mber k by 1 and carry out steps 1 and 2 until the change in total ,nertJa is no+ s gnr!:rari~
, for example, suppose you p Iot the inertia va Iu e again st the value of k, and it plots as shown n F g
1000
500
~ 300
(l)
C
- 200
100
Clusters
Fig. S.1.7
(c see a fe . I 1at1on
0
1lvr w examples of cluster inertia ca cu
lght N
O, l -9Aoo
Unsupel'YI
lO
• Machin~ Ll-1'rnlnp, (SPP\l)
Practice Qunt\ons
~ lor
nd
.,..A,.....,,, nertl A ume 2 data point to be the
11 1
E1t 6.1 4
lu lil!
Income In ThOu•""d
Age
12
15
33
13
35
14
34
16
32
Soln
a fa ster: you need to find the distance of all data pornts from the centroid of th@OU11111r ..i
:sn.:.rmt> a e on X ax s and income on Y-axis. Now, let's calculate the distance of the data panes
d = ✓(X1 - 2
X2) + (Y1 - Y2)
2
d = 3
33 12 3.00
3 15 000
13
14
141
l
141
T91.ti1
"'
t \ I
"Lc,irn1ng (Sl' Pll)
.,_·lh11i:tll:.:-~~--::~-:~~-.-c===:...-"'=..,,;:,!1~•l~'''.________________!!i!!liiiiii!i~:!!~~~~~~
Two clusters h,.we tho followinc1 (iHlll flc)lr,t•, C
11 113 data point assrgnmont
,iln1l11lrJ tluJlr lnlrtu h1t1lot rllol1111r o /\I " • 0 ,1, ul
,,, tlrt ur
,to 111 1 11 111 "
170 ',h
168 CJO
llCJ b8 ')_
182 72 ')_
188 71 i!
180 71 2
180 70 l.
183 84 2
180 88 2
180 67 2
177 76 2
-o calculate inertia (or intracluster distance) of a cluster, you need to find the distance of all data points from the
rtro,d of the clu ster and add them. In the question, centroids for the respective clusters are not given Hence fir ,t
u need to calcul ate the centroids for the respective clusters.
For Centroid 1 calcul ation, you have two data points that fall in cluster 1. They are (170, 56) and (168, 60)
For Centroid 2 calcul ation, take the remaining 10 data points that fall in cluster 2.
= (181.4, 74.5)
lfow, you have both the centroids ready for ca lculating the respective cluster inertia.
IJow, let's calculate the respective distance of the data points from the centroids and c
IF-
1/0 56
1GB 60
Total 4,48
o. L-98904/2021)
t, t' II
I llil
n I 111,
//
IIIU \/ I
111()
/I
II I I
/()
I 110 I J II I
IHI IM
I I 'i/
IHCl /Ill
I ,,1
I HO
,,1
II {I, I
/(,
1//
64.91
Totul
Hcncl?,
4/18
• The ine, tia f 01 cluster
64.91
• The inerti a fo r clus ter 2
4,48 I 64 .9 1 69 J9
• Total inertia ,s sum of in e1lia of all clusters
For example,
and adeq uatel y popu lated with the data
The follow ing cluster migh t be clearly distinguishable
.·-
70 -
'
,,
, -- .-...
--~
60 ' '\ \
C luster 2 • I
I \
'
,_ I
I
'
50 I
I
\
I
I ' \
40 I
I
I
I
I
.. •
30 I
I
I ''
' I • I '
20 1~ ~
-- I
\ I '
'
10 -
,--· - --
,'--. • ·- ' . --1;
' I
•
'' •
.·- ,,
, ,'
'
-
0
Clu&IOI 1- .. • _, '
·- - ---
I
90
no
70 0
0
GO
50
40
30
20
10
0
0 200 400 600 800 1000 1200
Fig. 5.1.9
'.:i~ need to apply j ud gement to see if increasing the number of clusters or changing any other attnbu e rd
-: ::::""e LP with better clu sters .
~rE c:ara ports could have several attributes such as age, weight, height. income, etc. Once your ustermg
-r9 -n;:i ete, you need to en sure that the attributes would be available for new data points as and wlllll
: :- -'E cnclysis. For exampl e, if your existing clustering is based on customer ratings on a
'c~ 1 g may not be im mediately ava ilable for a customer who has just completed the purchas&
E"'I'. or;_; month before the custome r is co mfortabl e rating the product judiciously
•: I.
~
, , Machine Learnin g (SPPU) Distance from Centroid 2
centroid 1
Distance from
Age Income in thousand 2.24
3.00 ----+---------1 2
33 12 1.41
0 00
15 1.41
33
2 83
0.00
35'.-L~-_,:,:,'~3 --=-~ --+~~-- 1.4 I
34 14 2.83
1.41
16
32
- _, -Distance
------ from Centroid 1
...
-
Distance from Centroid 2 Assigned O..
Income
-
Age
33
- 12000
3000.00
2000.00 2
1000.00 1
0.00
33 15000
1000.00 2
2000.00
35 13000
0.00 2
14000 1000.00
34
2000.00 1
16000 1000.00
32
While you might argue that the cluster assignment has not changed, but if you look closely the distm
the centroids is more noticeable due to income attribute. This could have impact on inertia and might iea:
• Ore of th e popular variations of K-means clustering algo rithm is k-Nea rest NeighboUf'S Orfl'
algorithm.
• lt ✓Jorks
· hb
on the sa me principle as K-mea 15 1
ns cl ustering algorithm that a data point . ,...., IP"1
•
ne1g ours
. and wou ld possibly have the sam e class1f1 . .cat1on.
. . meu- _.-.l
kNN is a supervised learning
• The uay it works is sim ple and st ra,g
. Iit forward .
l . A.ssumP. that you dlri>ady ass,qn('d labelc; to ti
hdVf'
et
Then you drr> giv<> . 'le ex,stinq data points m a given data s
2.
. n a new data point to cla~s1fy
3. {ou calculate> th<' distanrn f I
o t 'IP i1c>w date\ . i..
t,erl9
your data SP.t and iriquii111n(>m b point with I espect to ,ts k neighbours. The nuro ..
~ ut in qenP.1,,I h1 I . d avoid
4. The new data poi nt 1 • . • CJ "ler the better to red uce the noise an
s c1ass1f1ed b<1 5 d
classifi cation · ' ~ on the class,f·1cat 10n of the maJonty . . of the su
78 Yes
54 No
69 Yes
73 Yes
59 No
48 No
82 No
65 Yes
Using k-Nearest Neighbours (kNN) Classification Algorithm determine if a teen weighing 63 Kgs likely to like
Pizza?
Use k = 3.
Calculate the distance of the new data point (63 Kgs) with respect to other data points in the data set
d = ✓ (x1 - x2)2
78 15 Yes
54 9 No
69 6 Yes
73 10 Yes
59 4 No
48 15 No
82 19 No
65 2 Yes
Sort the table based on the dist
an ce
~ ,-
We igh t (Kgs)
65
--Distance for 63 Kgs
2
like Pizza?
Yes
~
,-
59 4 No
-
69 6 Yes
-
54 9 No
-
73 10 Yes
78 15 Yes
48 15 No
82 19 No
Now, given tha t k = 3.
' r,,rri1seI contrast to the k means algorith ko be 1n, a clu st N rind a point designated as the center of that
. n exemplars), and thereby allow f m,
111 ster.
dU medo1ds
. choosos artu•I input data points a,
centres (called
. interpr et ab Iity of the cluster centre:; than in k-means. where
I
~,edo1ds or fa cluster is not necessari ly s or greater
f.
th' one o the input d t
a a poonts (1t IS the average botw••• th• points in the
center o
cluster)
. -Fig
• for example, the 5.2.1
-. - -- illustrates
---- the
--difference
---~ between the k means and k medo1ds approach.
.•. •
e e M
~
b) Medoid
a) Mean
Fig. 5.2.1
]hegroup of points on the left form a cluster, while the rightmost point is an outlier. Mean is greatly influenced oy
theoutlier and thus can not represent the correct cluster center, while medoid is robust to the out.·er and correctly
data points from the input data set as medoids to start with.
Swap-step: Within each cluster, each point is tested as a potential medoid by checking if the sum
cluster distances gets smaller using that point as the medoid. If so, the poont ts dehned as a
Eve,y point is then assigned to the cluster with the closest medoid.
The algorithm iterates the build and swap steps until the medoids do not t hanqe or other terminatiOII
rnet.
:ote here that k-medoids implementation Is usually a g,eedy alqonth111
ol"ng heuristic of making the locally optimal ct1olce al earh , taq e In ""'"Y problem,, •• greedy stro
Produce an t· . d puristic. urn yiPld loctilly opt1111.il <,olut1011., that approxi
I
.
Ptirnal solut' . solution, but a greP t Y f1 time so k 111 ecloid'> 111,1y not q,v~ you perfectly clusterld
OP 1mal
0
al ion in a reasonable amoun o ,
Ways but does a great job of quickly clustering thP d.it,1 pomt\
s see a sol d k doid s algorithrn (PAM 11npler11ent,,t1on) works
let· ve example to learn how - me _ _ __
right N ----------- .-
o. L- 98904/2021)