0% found this document useful (0 votes)
54 views

Module 4 Algorithms For Data Science

This document provides an overview of algorithms for data science, including supervised and unsupervised learning methods. It discusses linear regression as a basic algorithm that models relationships between independent and dependent variables. Linear regression finds the best fit straight line by minimizing the sum of squared distances between the data points and the line. It is an iterative process that calculates a random line initially and updates the line to reduce the sum of squared errors at each step.

Uploaded by

Raghu C
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Module 4 Algorithms For Data Science

This document provides an overview of algorithms for data science, including supervised and unsupervised learning methods. It discusses linear regression as a basic algorithm that models relationships between independent and dependent variables. Linear regression finds the best fit straight line by minimizing the sum of squared distances between the data points and the line. It is an iterative process that calculates a random line initially and updates the line to reduce the sum of squared errors at each step.

Uploaded by

Raghu C
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Fundamentals of Data Science

Module 4:
Algorithms for Data Science:

4 Algorithms for Data Science: Basic algorithms under 9 CO5


supervised and unsupervised learning methods,
Ensemble Learning and Time Series Modeling.
MACHINE LEARNING
UNSUPERVISED RETNFORCE MENT
SUPERVisED
Appori) ati
REGRESSiON CLUSTERING Dynomic Ppgrammi
Linc SVD
PCA
Polanomial k - means
Beute Fosce
kNN Uses
DECISI ON Conhnuous Data|
TREE
Co)
Nurmerical Data Real- hine Ex-
RANDOM FREST
O Gaming9
ASSoerATION Finana Secor
CLASST1CATION ANALYSs
se8
TREES
Apaioi
FP- GppuJth
Manulathuing
LOISTie REGRES/ON Cadegorical
Data Robot Naigaion
NAvE BAYES Hiddan MARkov
SVM
MoDEL
Invenor
Real. Hme Ex-0Te minmg Manognent
Real ime Ex-i)Fraud dedechon
FaceFace Recognition
aErmaL Spam detechion
Imoge dlanijaion Big deda visuolgoton
Store prediion Meicoal Jmoge Recogi Hor
6) Risk Amesment Drognesis largeted Markeing
lop
op Machine Leanning Algoithr u in Data Srnce

Linca ReqneSSIon .-
is a inean modelling approadh to nd
olaonship
olafonship
bekoeen One mone Jndependert Vantab lo» (pmeditona)
denoted as X and dependert Voodables (tanget) denoted

Ex Bedict Sales oP ice ceam baed on tempenoure


X empenatwe
y Sales

shaas, hoo the data


Fause dotauon in
looks. Line has to be
that, the
distance op
Such
Such a
a yy
he Line Jempelaure
hese Point gtom
eoch
o of
h op
Line (o*
thot i knoun a s Regneion
is minimum Linean Regneion becaue
Colled
best t line does n
does not
ot hae
he he equahon
and
t is a Stnaight ine. w do not have
haveX
m eans
means wee
which
Component. h
t i is
hi
non-linea hs.
ar
any
OP
0 3 0n
more on any og
to the poue n e
one

inean egpreior.
s
when
when there i
there
ony
ony
one x
X..
Simple So there ony
only
o ne

Voiable.
indopendent
all akout knding the b k RE ine.
Regmenion is
inean Kecurive m a m e r .S o frst a nandom
On
done bas ed is Calculated
C aluated kom his
ditance
and known
draon diu tance i
Jine is and the
Ql the poin
line op
J
and to enue hact thene ae no
negaihe
. .

the diutance
VotuesThid is Squaned,
So Squane of
op he Ppoint k1om the ine and add it up. and
it iis
T made Sure od thee end Sum hich 19 tnoon
noon as
Sum Of Squaned eons he minimum
he

the dsta points


Best t 1egmelion line Distance h
is he boat.
and the ne

itenaive procels
T t7t is mintum.
e01ODS mugt be
The SumSum Of
Of he Squaned

UsS of Kegmejon analysis

Dekexmining the Shength of pmedichone


Sales and
Ex what iu he skrengt of nelahonship blo
mankekoa Spanding
Fonecasing eppec
Vatiable chonges uwith thee
x Hou much the deperndest
Vaniable.
in One more ndependent
Change
change
rerdfenecosing eonf in nedómgty
what s ! be he pice

when t o use Linean Regtekion- Mgoithro:


when he available data op Coninuous,
is
e ed
numbers. Et To predic tempenoture o a
and Puwrely
Stocka.
Sales a Lompany,
Ghy,
the dadaset sth Les mikáing Values
Dnta qua uing
n o t Compudaionally epensive.
U
andd
4Sneo
4)Linean Megneion es modhem aical equahion a nd

obiy undeutandble
Understanding the hgithm:
1-mis
on
Ihe ndepenodent Vasiable
Vasiabl on y-cois.
and dependent
the
O the datpoint inceaking on

-axu and So does he dependen Indeperdet


So fon
this
t hi Loe oee Vaiabla2
Vasiablu on y-aris. we

te LR l i e
would get

-azis
ip the he
dcta poins dnuechng On

on the y-aig is decnearng


cecraing
and he dependent Variab
- e LR lire
we err

-ve

* +ve LR Line which a 8houos the Relationship


and y Called
Linean Regn e o n
behween X

Hhe
Equaion is
m tC J-Intencapt.o4 he line
V+ve Slope of the line

Un ders tand by addin9 dota pofn


The drauo bet t line
D
eskim aled @ preditad
Value
enos
Te main goal is to
value
Neduce the ervton i,e
ual
the ditance b/o
e3himated value and
dual vakue. The best t

which hag
which bas the Least esnon
One
ine would be
Mathenaical Jnplementaion O
3
C 5
3
3.6F (3,3.6.
55
3 3.6
mean :

S (Inde)
3

(3)S0, we ound a
point (3,3-6)
eguaion of OU ine
Tegion une
According to Hhe

mx+C
now, we have to nd and C

m ( - ) Cy-9) T= mean = 3

mean 3.6.
ZC-7
x-). The dàtance of
mean

- y-
au the
poin om
3 3 0 6 |4_|(9-)=
7he distance o
au he poín hom mean

-:6 O

0.4|
5 5

- 10

0.4
-0.h

equaion
put in the
0. Let u
36 0. 4 x3 +C
3.6 - 3 tc
C 3.6- /a
C

O.Hxt2:h - This i a Regnesion in.


Fo e n Le predic t Values
m= O.h, C=.4
a y onx ,,3, 4, 5 f.
8
O 4x +. 4
=
0.4xs + y = 3.2
04x 3 t 3.y = 3 6
O.4x5 +3.4 =
. 4 h:o
O. 4x + =
Let us plot he raph is bad.
Drauin oph
3
o 2 3 5
distance behween aual Value
Calulate the
the esor
and prediced Value Coa) Raducing
he
the line
bekoeen the ackual and predited Value.
wiu be Hhe line OP egne Mhon
oith the Least C1o
the best Kt line ,
and check hou good
noo, Calulated
So to do hat
that thee s
mode! is perkonming,
Ow
Called R-Squae
method
a
it is a
StaisiCa measure of
R-Squaned method:
to the ted egreson ne.
Cclose the data are
how deherminafion.
alt
knocon as Co-cient of
R Sqtuaned a Cokuoded
as See how the
Lel
pPrediaJ)

3 3.6

(yp-9) Diance Poedite-Meam

Fomuta i
R Distana acual Meon

(y-9)
- DSep-9
3 -o 6 O36 &8.8 0:80.6
o.4 0.16 3 0.40.16|
-1-6 5 6 3.6
o
3

4 o40164o O.4 O 16
5 h196440.8 o.64
5
Su- 52 Gund-6

1.6 O.3
R5
kegneion ine
ane Vey fan
fan auway iom
points
TheThe data
not a ood Value
30y is bad?
ane
Values
Low R-Squane it wil be Low
g4Low is ePected
it
haf

areas,
A
B JIn
n Some
Some
(human behaviour)
EX- Pshycalogy
Logishc Kegresiorn
--
is one fthe most populo
LogisHe Kegpewion

machine Jeaning
algoithma fo bing
AclakifCai,
ML aoithm.
Superised
it 3 a
Logistic Regnesion?
yes/No, Tue/ Fae
Categonical Values. Such
as
Fon
Bestfit line
+

yMalignart

-> eghdld
0.S-
Benigo
CRaksP

(N

Equaion op he line (om) hypotheis epuahion


ineat Regnekion.

h) =
Jndepencdest Vaiable
Bredickion Slope
Condion o5

he(1)2 o5, y1 maligmant


he he C4) < 05, y=o (Benign)
when he qnaph is d naLon we Con Sen2
So, a
best f line feu data potn ae above
the
below he threshotd ine, So Cakily we o n
and
loox into the
(os) Benign-So we
Say malignant on

o 5 (treshtd poît) prmjected


Previous gnaph. h e anen
we Can Categonige
5 . and eay
hbove
hbove Comdeion.
Camdthon.
the
fakent bsed
On
on

Inoieased. the best fi ne


data is
when aining with the prgjecHm
But,
changes t dineton
+t

Ex

3.5 I
raining dataset,
Jureaie the
the
So because
op r ohee umo e .
în the prafechom blem. e where
the change
"Satuuhion to the
annot be a
tto
t o pictwre.
picture.
Comes
Regnekion
he Jogistic
he
Regreion
3 seps lcgistie
X anod
.nt
Rerericn.
O Calculate the cgishe
Y Vasiables to plot the gph
uing he
Stgoid fencHon and
lemn he Co-ecient
ond lader he using the Stochistic

make the mode! teady

tadient descant.
Model pp he
Then uing the

Biedichon
Q:-Datase raining Se 500 tumo e olata

X Y
Length,
uhodS
Sp clas
a.7 5

.3 O

3.3 O

806 3.05

5.3 2. 75

hot the qsoph


Data is nan
Sepacable then
LoR Can b
ony the
applied.
Continue.. . .
LoR
We meed to trans fonm the data point
Step (o) Sigmoid funchon.
using the Logt Aunchon
which i gvenb

he +e Gach
to ransonmn
the above Cpuaian helps
oto.
Value to arge b
inpud

epuation 0 Can be tranatomed to,


he
So, the

P (Clas ) = -(bo tb, +b).


+e
> Malignant
then clasi,
ip PC 2o.5
else

clas Benign
the model panameters
Step: We need to rd
training Set input
uking the
oiu be Calulatel
Initialy, we he Bo, B,,
SGD. (Ezplained
in cla)
the
using
using
Jnitially, we aume,

B, O.0 =O.o.
8.:0. 0,

wiu Calulate he predichon using the


above Values
model panamee es
the
Cach aining set

X,.7 5 , Y=o

Pmectehon +e
-(0.0+ o.0x0.7 +0.0x.s)

O.5

So ultimately he Sotuion to the poblem s


We have find the good Co-eiien, So now

the fonmula on achieyinga the Values

b b t * y- Prediion)* Predichn (1- pnediken


whete, b Can be bo, b, b
Can be

() alpha the leanning 1ate that


fred at the bgining
each tnaining
whith depics the charge valuea model
arometen.
d Value Tange bw o. to 0 3

updated Values ot
b -0.0375, b,= , bo=-0.095

Plug in to the uaion


nediion =

-botb,z, + )

So usin9 his predicted values and Callate again


panameten and update them.
the model

Calculaing the pmedicted Values Pon all training


model patameren Called an
Se using the
we Con go po n
mumbe o epohs
Epoch g00d amout Qccusau
untill we ger

1esul

LoR (araph loots lixe

ot
K-mean Clustering Atacithm i
to divide the
k-mean chusteinng algoi hm
L Ose

olata into clustors


Pe

Dataset
A
3
B 5

Yandomly cusbens-
Stpl Selechng
and
custen Centers
the
inding the distance blw
Step -
each data Pointa
and

Distance Distance Asianed


Data Lom
Lon Centen
Points

a, (1)
O P
O Pa

_P
83
Pa
a(5.5 5 3.61. Pa

distance wiu be ound out by sin


The
Eucidean Distance. P(,1
(z,-1 (-
Ct,)Ca,)

1
( 1 + 1-1
wu e caleudasal
'i (5,5)
The distance wiu be tound om P(,3),
check the toble

a,(1,1) (3) (1-3+ (1-3) Ity 5 94

Aigned Cete iu be Caluladad among he

Values ditena Pand ohich aMe the

Least ones

Stps Custeu has to be ovmed


Clustes P 2 , a2, a4
Pa a3, as, acj.
Re-calculate the custen Centes. by kinding
Stepy
P clutens
the ayenage

R ) +) + (a,
=(6,4)
A (a, 1.33)
Some ko
: )+ (u,3)+ (s.5

Pa (3.64, 3.67)

Same custen
Repeat Sp unkU we gek
pMev iouu itetaion
eement a in the
Dislana tabla.
Distana o m D i s t a n t klo Aigned
Dada birt (,1-33) 8.67, 3.67)Cente

05 3.T8 P
a, (L1
aa ( ) 0. 33 315 P

a3 (3) I67 8
Q4 ( 3 , 8 P
aF (4,3) .605 0.75

as (5s) .T4 188

Custens R , , aa, as, Q4

pebviou8
and pnesent
Clustes
as pen Rle he
custen enk
So he new

ne dienent

hasto be omed

Re-calwaing the clsten Centens

C, D+(»,)t ( ) + (3,3)
PP:
8) Ca, i-75)

P L.3)+ (5,5
( a . 8 ) = (45,4)

Centes ate
Ne u s ten

P ( ,1-4s)
- (4-5,4)
Fven he Centens an also ot ame as pevious
Even the
Step has to be Nepeated

Daba Distanca Distonca Amigned


Foom on
Points
PCa,1-75) PC4.5, 4 Contens
a, (u) 5

ap(a,1) O.75 3.9


P
aa (3) 69 P
-03
a (3 5 P
&36
as (4,3) 1.1
Pa
1.12
a(5,5) Pa

clusters o P la,, as.Gs, a f


P 5,%j
element Same as he
h e poevious itenahion
Tteraon
Custer
Cster ane

clusten = 7 Cu, o, 1), (a, 3), (sa)f


KNN Agosthm :-
R- Neanest Nerghbov Atgoithm).
KNN is a Stmpla algoithm tht Stores all the available
Cases and cla ies the neu data (om Cae ose
based on
on

meaiune
StmiLad
Sgtems
Ex-) Reco mmendey
Wispedia
Seanch algooithm hen
KNN J9n general alo ued as
Simila tems.
we are
Seonchia
a 2s k in KNN ?
when K=i.Number oP Neanext Neighbor).
3 hae
apned 1, he teaing data g o
So, koher it ia
ven the closest etample în the taining set

Pk 3. clas coil be chcked. and the


and
3 closes
The labe) oe the to he tesing data
Common
Label & awigned
mast
9
ng an EX:- Roshmika 05
Solvin
Age Gender Spost
Name Age M Football what n d
Ro 32
M Neithen *Hrd
Of
out
Spovt doas
Seetha Cricke
Tohn 16 E
Coucke Kashmika Li
Dhoni 34
55 M Neite
Lokshmi
Sachin ho
M Coicket
Virat 20 F Nthe
Lobhman 15 M Coick
Rahul 55 F Foattball
SRK 16 M Foothall
Rashmika
Pnedidnq chasof Spet
emale with

Stpi Let us aume

the data into Jumeric.


Stepa Conveyting 1S
îs
Numenc,
but Godev
a the Age
be ConvexFed.
Disact had has to
Conidev, M O.
wu
So
So, we
F 1

algorithmn does is
Step3 whod kNN
Now
Distanca bw Ras hmika
inds Ot the
Data o the
and wi h the othen Ree7enca
table, So hene. to ind the Distance b/uo hen
tuses he mula
a)*+ (v-1)
Call ed Euclidean iStance
whith

the Distana blw Rashmika and


Seph Calulde
membeYs in the table
all Othen
fon
= ge(5) Y, Gerder omal (i)
Fo Rorn 3 Sa Male (o

Distona (s-3a+ (1-0)


'Te distane
all
Ta9+1 w
Callatad for
4 02
Predicion of kN

Name Gyende Distance


Clo .
Ram T. 02 Fooha
Setha o O
35 0 Neithe
John 16 oo Cicket
Dhoni 34 9.00 Cicket
Latshmi 55 O 50. 0
Sachin 4o Noithen
O
35 O
Cicker
Vinat 15 O0
Labshman15 O
0.00
Neithe
Rahul 55 S0. Oo
Cike
SRK 15 FootbalL
O 10. 05 FootballL
Seps We akume he Vaue op k as s , So hee

wil look into clses Value Rashmika.


e
of 3 on
he above table, he clsest values hom
k1om
ae
Dhoni (9o0), Lashman (1o. oo)
Rashmika
Sohete 3 axe Coulled Naorest
ond SRK (10.05),
Neighbors
above table
table
the Distra ound
So om
StepC Laks hmoan
ite CtkeF and

Coik e ms
Dhoni Lkes
he clam Ccket is m&
Sine will
SRK iks Football Neigb s ,
hen kNw
Neaneg
Common omong thele
w be in he Clas peope
Pnedic Rashmika
who ke3 ricket
Natve Bayes Clasifien Maoaitra
Machine heanrin9 olgoit,
Supevised
ts a
with clamipication technique

Baye s heoye
Given ahypothesis(Hand uidence E)
EidercelE.)
he telaionship
iven States hot
Baye 's theoven hypotheia
the pnobability the
)and
and he
he
berween
evidence
eidenca P(4
(P(H) )
he The
t e
2eting
beore
hyporhesis t e getin
he
Brobabi ityH ote
evidence PCHE)is
P(E). P(
P(E

PnB) P(elo)=
P(BnA)
P(PlB) P

P(nB) P(1B).P(A)= P(Bla). P(B


P(Ba) P(e)
P(anB) =
P()
a r a s e t-

utlookenp
pAY
Humidibnd Ploy Caicke
Suroy Hot High
oeak NO

Sunny Hor higb Showg No

Cwmk Yes
3 overCast|_Hot high
Rain mild high | weok

5 Rain Cool nomal Yea

Roin Cool omal shog No

1 OverCast Cool nomal Shorg


Suny mild kgh Loeak

oeak Yes
9 Sunny| Cool hormal
eak Yes
1o Ralomild nommal
Sunnymild Noma Shong
12 OvesCastmiLd high Shon Yes
3 OverCatho noalweak Yes

4Roin mtld hgh Shang


o

sSany CooL tgshr7


C9icket n 5thdau
5hda
the probabil
7
play Ccket cn
Find
Condhoma ane:

whete
Huniclity Hig
Cool
Temp Gutlook = Sum
wind shwm
Total o Ras I44.what to find1

yes (9) No (5)


Pno) /

Wind C14)

Weak8) SPrngC6)

Yol6) Nola) Yes Cs) No

Phearlys) : 9 Total Ne Y
Phonglyea)
5Tota e q o
P(Sher/o-

Humidity (4)

Hig) Nomal()

No(4) No (

P(nemal /ya) 4
:

P(bg/ya)=79
P(noomal Ro= 5
Temp (14

Hot (4) mild (4) Coot4)


a) No1)
No(a) Yes(14) Nola)

p/e/es Pmid ya ):4 P(ool/ve)


P(eoln)
P/Aoet/nio)

Outloor (14

Sunny (5) OveCast (4) Rain(5


Yes(9) Noo) YeaC3)
No(4)
P (Raio ye):
PSunnylys) r(ovta/ya)-
P(Raio ne) -
P (oveCaaNo)= o
Sunng )ne)
Cool, Higb, Shpog f
Sunny,
(Sunny Yes)P (Cool /yes)
P(yes)
P(xlya) P(gb /ya) P (Shoog | Ys
* 3/4
O.O053

P(Cool/No)
P(Mo) P(Sunny /o)
Pxpo) P(bigb /No) P(shorg/no)

s 3/s
O. O006
and No.
Vaue
with Yes
Compare he
thon aYes
hon
No is 9neater
Since the Value of
Caicket
On 15T day toe
Should no
go
fo
Play
Ars
Real- ime
E-mail Span tenn
eather fovecasing9

Medical diagonsts
detrchionclapiak
ojet
(ategongahon.
leuos
inipal Component AnalySis.

incipal Component Analis (PCA) 3 a dimeno


enablas you to
ality neduckion techniqu that
and pattens în a data aj
Jdenipy Connelaions
into aa daa Set
So that
So it Can be transponmed
dimenion withot
without Jass op
ap
looe
Sigmi prConty
Signi T

Jnomation.
Jmport ant
an
any

Oiginal Neuo dada Spaca with


dota Leen featuves h t

Serain mast the dnpo.


O Remove Jntasirtencie3
Recdundant data
HighiyoCoelated features.

PCA 1s used when ML algoithmg Cannot hardlo


high dimenkional data.
Step by Step PcA

fthe dal - it à all abou


Standandizaion
al
a u the
dala in Sucha way that
saling ouy ee within aa
Simian
Simila
and hein Values
Vaiables

Mange.
Rakiog feotures lange
Movte No
Rating_ Doodoad
l O-5
5 1383
668
763
342 doon loads
701 No of
arges behoean

l000-5o00.

&Peaures has Diperen


Yange
ohen
wiu be biased
as we Ouput
the Co-Vaúance mahix
Compuing_ Conn elation
Conn
e pneses
epnekes
the
mahix
data- 7t is
A
A Covai a n c e
Variables în the
blo the dipperent heavily dependent Vaniables.

etential to idanh} biased and Yedundant

beCause hey Contain perkoman


peomana a
Overall
OvemL
he
which
veduces
Jnkomaion.
Jnkomaion.
he mabit.
X Y
5 .4
O.5 0.7

1:9
3. 3.0
&.7

I5 1.6
0.9

.9 MeanMean
0
adhibutes

the Covasiance fon


Finding
(Cov (xD Cov (xy
C
Cov (y) Cov ()DJ.

oomula o
(-) r(x-i)
CevCov ( z x ) z m-

X-
C-) (i-1)
X O.476
.5 0.69
I. 7161.
0-S -13
enhies. = 5.5490.
a l l h e

(Fox) SummaHion

5.590511O 0. 6|65
Cov( ) =
10-
So Cov (9.y) = o.7165.
Cov (x,1) Cov(4.9)
C
Cov (.) Cov(s.)
O.6165 o.6154
Cov
O.7165
O.6154
Veutors.
Values and Eigen
Aind the Cigen
C I =
O.
O. 6165 o. 6154
o. 7165
O. 6154
0-6154
a u l e .
O. 6165 Sta
the
6154
O.7165
s-] Hs per
t o.0630 = O
1333
, o.O490o
1 . . 284o.
Values..
agen Values
have
now
oe
So
Value,
nou uan Veo
So. Pov Cveny igen
mhod chon
he
has t be found,
o

C
Cigen value
Covamance

Ctgen Vecr
Gigen vedor

which has to be found


Eigen Vee

O6165, o.61S4 x
0:0490

O.7165
Y
O.6154

O.6165 x t 0. 6154 Y = o. 049o X


6154 X t 0.7165 Y, o. O+9o Y,

O. 5674 X
- 6154 Y
Simp

6-6154 X = 0.6674 Y

o.6674 Y
0.6154
- 1.0844 Y
beCause
The above Simplipicaion is done ony
Telatio nship blo 3, and Y
ind he

The relation formed s

- I 0849 Y
be inalised.
alues has to
Now x, and
Let
t us
us Conpute
procadure,
The one the

-1os49 Tempora7 Vectos


Squaned and adde
wil be
Both he Vectors

e 1770 +I
.176l4

Dot

he Square
nd I. 4791
&.1461
used to
wil be
Values 14291
Te Computed
the Vectot
devide
/.o842

-o.735

.499479
o.6778
1.4391
No ompute Value for a
arive at he Cquahom

o.93194 Yo

O.8u99 +1
|o9194
I81y99
3601
O.6778
o. 735

o.6778
-o.735
e,
e e .335
o. 6778

DadaSc : -
Deive nw

Hrst
Prindpal P P4
Componet

Ftrst Varioble Value o x - mean X

e
irst Value o y mean
Y
.735 0.6778
Stmilasy Compute he Value x and y
the e above FOmula mwltp
igen Vec-on
ith he C ransport o
aive at the Sing Value.

So we wfu anive at all the Values

PCL P Pia a P

4 dimen&ion
which is
Nu dataset aived
anes and Hypen Planes
ine in DimenRional Space-
Equation of

att bytC O. =

Can be witten as,

M+C.

be a Planes in 3D Space
Auos hne
wil
Auways
ins.

Equaion plane ia

ax bytcz t d
= o

d
Qx-by-by
-

tonitten as,
Can be

Plane

ne 3D.

Co-ordinde

epuahion hd is Hhe
he
the above
Ans:Jn
equivaktt Jine Typeplane
hat is the Mequinement of typerplane in Csene
Doda Steng,

dimenaional dada is Veny


A Ihe te
of high
nequ ent.
Ex
TRs Dataset C Cotumns
MNIST Dada set (785 cournna
uking clampicakion algoithms in
Llhen we aNe
we nee
data Set at he ime
dimenional
to Pind the Hypen plane equaion2.

inding he Equaion f Hype plane

.Equaion fline =
mtc

ay+bytc O

Name and y as
m , b:
COnitten as
Can be
So, he e
o
ax,+ bz +C
=

when e ane uwonkin9 on moTe han


Eventual a to3 Temains Les. So me egcuotion
6 Spaceg.
e- niten
Con be
for aD

Gyenerad eguaim
fon 3D

Plone

Fo - dimenkional

+ Wnnt ,

Converted to a Veco. So,


an be

Stored vecton,
on be as

wn in

a an be wtten as

, ,to3 .

Wn| 2
13

w'x tw, = O

dot pmcuc Vector = ab= ab = ab.


As pes aue
the dimen&ion
al
No, he euation for

twhat is w.Y
D une
going to
Now again Y m +e

OO
ax +byt c =

W=Fo.
paing om oviain. 9
As it's
For aD
So
HT,= O.

Hyperplane best eis Support Vecdor Mahine

goithrm (Ezploined in clas)

Vecoy Machine
Suppovt
What is Support Vecdor

What is bypexplane

Margial distante
Sepoable.
ineanly
Non- Lineay Separable
Support Vector Machine.

Jt is a Supervised ML algoithm usad fon clamifcabio


and Regneion
with crcles and Zeno.
e dataset
i the
h is given
Ex need to be
Seponated. Jn t h
Labelled data e
Jnput and ouspt.
oe know both
Scenario
Crsdsta.
N e , _

I p data

Model Phedacien
Eaanina
op
Oncethe model is built
above diagram, task to Prelz
prdrt
Hs pe he
Mackine
Machine tast to
data. the
wtht hhe
w e raiin9
raining
is fed a Sgtem. t has
the Meuw data Squane On cincle.
When a
hthen
tt ài
precic
the better understandg
better undestanding
The
The drauon
draun ph Cone Cirle data potn

1 0

Mangin

Squane
eisicn

b o u n d a

dála
upesplore
PoTirits
x
is dnawn
+o Sepanate
Hs per the graph, a line
Sqone clas. the line Called
Cncle clas and
Hhe
the tn eneral.
and
Hypeplane
deciston bouncdary. (o w.1. + o7 o hypemplane
hyperplane
ne
apanalel
.4.t
Step. draoing
drawing
data poin. now we get he distane.
the
touching his Point isaled
is aled
D+ and D- the Summaion e
b-+D+
= Margin
Margin
drawn to he hyperplane
Dosnallel ine
The and cince
Square
touching datapoint
the Support Vector8.
Called
those data poins
wicth the
As penthe 9aph, The
be high. because when
because
hypenplane must aluays
It. the width
Mew data point Keaps adding to
a good pHedichion.
must hotd and helps to give
and Accurac

Let s loox into a rumerical, hat how


hou hyperplane
duauwn,
SVM Nurnerical. for near Separable daa
pesTHvely labeled data
afven he foloong
Potnts
(3, , (6, 1D, (6,-1Df
C3, 1),
-

(abeled dota point


Negaivel4

(o,-), -1,0
(uo). (o. 1)
Let s dnaw a
ph Red v e
Stepd-
Bla +ve)

are close to
which
dlata point
he
Sepa Sden data.
and abel them
dada poin cwhich
-ve
+ve and depending o n the
he
Support Veckors.
S,, S,S 3
ane
hese
ae clo sest 3
( 9 ) , s3
S ()
S
wite The hypeplane equatien
Noto
Vecor6 each Vecdor
Step3: Fo he Sdenhiied
Qugmented
ith a 1 as a
bias ip
b+ag p
has to be

So
S then
() 3
S() then S
) 3
S () then 3
coeight he Vector has to be
Steph The
which is 3
a, as he 3 data otnts ane

alulated 0.91.t S,Sp. s.


means o, , a.
l&en which
the u afiors Looks like,

is Common among all, - is


akove Cquaion, S
ft ipnesent On megatively habeled data peinls
beCouse

+ -

See, Sa and 5
above equatiem, & and 3 as we Can

ane Common, t because both ane on the +vely


Labelad data poir
3 /3
+a + -1 -1

d (/ 3 3
= 1

/ -
3
+
3
-1
3
= 1

// // /
o,(1 0+) t 3+0 ti) t a (3+0+- -1

30t1) 9 + Iti) t t (9-1t1)= 4

4
#(-1 +1) + s ( 9+1t1) =

,(3+ 0+) t

t
O

t o 1
4 + 1
abore equalien and get
Solve the

the values
mulply by a

Ha8 , + 8 O
1
(-
- 3 da - 3.

4o,tl +9 =t

4,+9
(-

a-o =o

Subsihing in e

, +8 o -J
Suhtihding in e4'

d+5= V4
52
Substiule Value

-S2

O.75
2 O.T5

Subs ute Value

- 3.75
Stved, tus Pind e wdgt e
Once, the equadion
the Vecon,

3 3
+0.75 +0.75
-3.5

we n
above yectors,
ad the
t e
Above, ooking augmetad wth
Vecor is
hat OUT wighe

bas
wth above Vector when we Cquade the
fset b.
Last enhy
in 3 as the hyperplare
and when e write Sepavatey

pesplane equation: y wx +b

() and b -
b-a= O
b-21
an Say
X-1 and
he above ) oe

parnllel to is
the ine is
i,e - axis
to
is parallel
the ine
X andL
O. Tt
il be
4 ( 1 )the Line
y axd8.

-3

+ve an d ye Docda potns


Separates

Seporotey
ko a p e r P]ans
Normad

n rlan Paming
Find an E

2) s (-,
S(-2,a
R (+, 2,
, 2)

Sol m'

'ois the oiin 6

OS-O

= (-2,,S) - (4,),2
--4,2,2)- (-1, ,2)

QR 3 , 1, o) s ,o,3)

e in elan
S

-3
O

+l (0 +1
i ( 3 - - 3(-9

393+
-O
n mal
h

Pla
To Cha ch

Tale
a n d om

OP O
V P
C a , z ) -( -1, 2 )
y - 1 - ) , z-2)
normal
Ploune2
Susihde the values in eauahian-0
+ (y 1) +(Z-
3++11- z 2
3 Z -8

+z -8
o

3 9
Decision lnee

Jt is one of the techrique


clanipicakon technique
gPau
Decision tree is a gnaphical nepreentaHon
deisians
odecision. decisims
Sotuions to
a

he posible
Condthons.
aHe base d on Some

Deciston- Tnee temindogies


the entine populahon
Root node:- it nePresent

Node Cann ot be Pucrte 6egregated


haf node :-
Into Puthen nodes.

he oot noce
Spl:thing s dividing
Spliting into dippenent paet
on the

Sub node
Sone Condation.
bas of
he hee rode
fontmed by Spliting
Branch sub tree
:-

Splihirg basical nemoYin1


Brunin9 opposite fstom he tree
bnanches
unuwanted
and
node
ode : Root node is paernt
Fonent/child a
bnanhed krom
all othen mode
hild rode.
knoon
H

2
3
Diagonsis So Let us
ath bute is ,

Ous target all the diagon sis.


find the Sample Space a
S.T+ A + C IO
S

3 3

J.G for all plug oto the Pomula by taking -)


Common Out Side

JnRo goin - (+
= -0.8 log (oa) xa + o.4 log, (o.4)

0.6 log (o.3) + O. 4 log,(o.4

o.6log
Ulo
Co.3) logo4)

- / 0 . 6 (-0.sap)
O.4 397)
O 301
30
taking -) Common ge Cancel.
o.6[23] + o. [ 318]
1.56

Hoding Splisting atibute'- Slet he

atibute wth highast aun


n.

0 Soxe throat

5-T A C

Yes 2 å |5
No
Se throatt
0)- () )+)
1-52

INe) -

) )
52

Yes) and (rot IG


beopy Szne thrpat = obobity
(Yas) (No) (No ondno
Cyea l52 t 0.5 x I:52
O.5x
I.G En
Enhopy
koppy I52 Gtain 1563 - 5 2
Sanethea Griain 05
Groin ot each althbute.

Alibute Gnain
Sae throat o05

Feves
Swotlen O. 88
glands
O.45
Congestion
Headache O.05

hoosen bastd on
altibute will be
Spliting Yalue. Consuchng Ds
Gain
which has he highes
For Soollen lands No Suwallen Gtlands
which has Valuesj atiburle
So 2"
Atle and atd,
hich has
kott be checen
higheas lekes
NO Yes
0.72

Fever Strep #hroat) Diagencss


No Yes
Diagnosis

Aller Ctd Diagnosis

Cannot e Divided
Dedsion 19eekrrthe
lime- Sevies Amalysis
Vasiable
analsis. you,jut have
One
Jn his
i,e TiME
in onden to
Serics data a analysed
The
Theime-
and othe
Staistic&
etract meoning ful
Chassactesistics

x Any
Hy dodaset, ie we
eypluca
ane eoluading
Ke wise
oith the ime
accondance
Saks. (on)
morth Sales.
Next yea Sales, ( o Nert
tto time.
time
w.t
Stck maikat
PredicHon

What is Tine Series?


obsevadion alen
A Time Seies is a Set ¢
Jntenals
mes usually a equal
at Specitied
o.T.t to Hme on the 1-a
ts
ie i a gaph
equally dishibuted, i
gaph
(it Should be
time ie week, the
whale
is ploted fon a

Cannot miz wTth


with
Shouwd be week,
X-aris

month (C) 4a7


mportance ime Seie8

Stock maske
O Busines oneCostirng -0predicion f
fon he nert daH
Sales a Company
pnedichon of pspduc
fon he net y

Undexstand Post behaviouY :


on what time It
the ales proauct
Hralysing cwhct ime it was down.
as up and on
busines.
doon Sales. the
fo op and
Obviousy
CO1e
attached b it
Heaions

The fhre decisicn a Company


Plan fuckure -
the ime Sertes analysts.
Con be decicled uBing
using

Evaluate Curent accomplishment : - The

Current y e a is achteveo os) not


Greal Set fon the
achievad
shadt is he teson. TL
fmot ahieved

the o3on. these on be eveluatal in

Time Sestes Analysi


Comporent ¢Tme Series Analysis

elaively highey o
OTmend: The movement

time. when the


Lower OveY Jong period
Jona
Shouus a 2neal patteyn
ime Seies anaysrs its
down t s
it's douon
its prend. P
e up w i r t .
no-rend. then
here fs
Called olown tend. . P
o)
Called as hoigontal rend
it an be

Stahionany rend
be here with
Seasonaty :- Same as T tend,
ime pesdod
in a Prred
rristmas.
Sale of Cake, choco duaing
Ex
not Same
pattem a fixe Hme,
Repeahirg
rend
during Shoxt time
Tveqularihy -t happens a

and non-Yepaafng
nahural disaster, during his time the eu
Rx
itema in need wu be Sotd .
annot
tablts, on)
on Such data.
rely
up and daon
Cycic tepecing
much hasnde
ho Pired pattens. and
m o vennent

to pmedit

Not to use TFme Sertes Anaysis 8


When
Cannot
ime Setes
Setes
Values ane Constant
When
be applied decembeY, a n d
december,
loo on
is
a Cfpee
Sale
Sale
ime Seies
annot
a n n o t be
So heve
loo onn Janua next month.
the Sales of
app.lied. to fnd

we have in the fom


he Values
Nhen Cwe have Stn C)

fancHons, ie So he z i
he
alue of
and
and Cos Ca) aPplying ime Series
no point in
avai lable. Values
there s a fomula and
when already
ayailable.
C1e
Stationaity in he dala
the data ko Over a time.
The behaviowt o is
hen here a
pattem.
is having Specipic wiu Pollou h e
that
Vey high probabtli
in he Puure
Same

The Coteda f Satonariy dala :


Constant
should be
O The mean
Constant
Should be
The Vasiance

ARIMA MoDEL

the best model to wok wi


ts a one e
basical -a
data It &
he 4ime Seteg .

AR MA
models e,e
Combinaion of
AR Auto Regneive
MA Moving Average
Both ane Separate model, what binds t togetheY
a Jntegaion pant

I Integation
xplainaHon

heve ave
When an andluaina the data. ip
T, Ts. . ..
Th P
umber ime
ime data, Ti,
blw TI5 a n
nelaionship eristal
Q
hexe
auto Re9reion.
s Called
TR9 tha
in d dala
here is a noise (o Jnreqularity
i be
Could be taken and
and Can
he avero
averope
pupase
Jmp lorted fo he analys

ARI M4
Jntegraion
(Order ditpereniaion)
Au
( RegreMive
Moving Average
Ca)
mdel has knouan by Vaviables P.g.d.
The each

To predict the Vaue oP-> parstia autocorelahon


Getaph is used

the Value o Ausocarelakon ptot is


To predict
used

dpperenhiahion used to make


D onder
dah stafonay
Ensemble Learning
To develop a machine learning model that predicts inventory stock orders for your company
based on historical data you have gathered from previous years. You use train four machine
learning models using a different algorithms: linear regression, support vector machine, a
regression decision tree, and a basic artificial neural network. But even after much tweaking
and configuration, none of them achieves your desired 95 percent prediction accuracy. These
machine learning models are called “weak learners” because they fail to converge to the desired
level.

But weak doesn’t mean useless. You can combine them into an ensemble. For each new
prediction, you run your input data through all four models, and then compute the average of
the results. When examining the new result, you see that the aggregate results provide 96
percent accuracy, which is more than acceptable.

The reason ensemble learning is efficient is that your machine learning models work
differently. Each model might perform well on some data and less accurately on others. When
you combine all them, they cancel out each other’s weaknesses.

You can apply ensemble methods to both predictions problems, like the inventory prediction
example we just saw, and classification problems, such as determining whether a picture
contains a certain object.
Ensemble methods

For a machine learning ensemble, you must make sure your models are independent of each
other (or as independent of each other as possible). One way to do this is to create your
ensemble from different algorithms, as in the above example.

Another ensemble method is to use instances of the same machine learning algorithms and train
them on different data sets. For instance, you can create an ensemble composed of 12 linear
regression models, each trained on a subset of your training data.

There are two key methods for sampling data from your training set. “Bootstrap aggregation,”
aka “bagging,” takes random samples from the training set “with replacement.” The other
method, “pasting,” draws samples “without replacement.”

To understand the difference between the sampling methods, here’s an example. Say you have
a training set with 10,000 samples and you want to train each machine learning model in your
ensemble with 9,000 samples. In case you’re using bagging, for each of your machine learning
models, you take the following steps:

1. Draw a random sample from the training set.


2. Add a copy of the sample to the model’s training set
3. Return the sample to the original training set
4. Repeat the process 8,999 times

When using pasting, you go through the same process, with the difference that samples are not
returned to the training set after being drawn. Consequently, the same sample might appear in
a model’s several times when using bagging but only once when using pasting.
After training all your machine learning models, you’ll have to choose an aggregation method.
If you’re tackling a classification problem, the usual aggregation method is “statistical mode,”
or the class that is predicted more than others. In regression problems, ensembles usually use
the average of the predictions made by the models.

Boosting methods

Another popular ensemble technique is “boosting.” In contrast to classic ensemble methods,


where machine learning models are trained in parallel, boosting methods train them
sequentially, with each new model building up on the previous one and solving its
inefficiencies.

AdaBoost (short for “adaptive boosting”), one of the more popular boosting methods, improves
the accuracy of ensemble models by adapting new models to the mistakes of previous ones.
After training your first machine learning model, you single out the training examples
misclassified or wrongly predicted by the model. When training the next model, you put more
emphasis on these examples. This results in a machine learning model that performs better
where the previous one failed. The process repeats itself for as many models you want to add
to the ensemble. The final ensemble contains several machine learning models of different
accuracies, which together can provide better accuracy. In boosted ensembles, the output of
each model is given a weight that is proportionate to its accuracy.

Random forests

One area where ensemble learning is very popular is decision trees, a machine learning
algorithm that is very useful because of its flexibility and interpretability. Decision trees can
make predictions on complex problems, and they can also trace back their outputs to a series
of very clear steps.
The problem with decision trees is that they don’t create smooth boundaries between different
classes unless you break them down into too many branches, in which case they become prone
to “overfitting,” a problem that occurs when a machine learning model performs very well on
training data but poorly on novel examples from the real world.

This is a problem that can be solved through ensemble learning. Random forests are machine
learning ensembles composed of multiple decision trees (hence the name “forest”). Using
random forests ensures that a machine learning model does not get caught up in the specific
confines of a single decision tree.

Random forests have their own independent implementation in Python machine learning
libraries such as scikit-learn.

Challenges of ensemble learning

While ensemble learning is a very powerful tool, it also has some tradeoffs.

Using ensemble means you must spend more time and resources on training your machine
learning models. For instance, a random forest with 500 trees provides much better results than
a single decision tree, but it also takes much more time to train. Running ensemble models can
also become problematic if the algorithms you use require a lot of memory.

Another problem with ensemble learning is explainability. While adding new models to an
ensemble can improve its overall accuracy, it makes it harder to investigate the decisions made
by the AI algorithm. A single machine learning models such as decision tree is easy to trace,
but when you have hundreds of models contributing to an output, it is much more difficult to
make sense of the logic behind each decision.

You might also like